R – Page 17 – R-posts.com

Predicting future recessions

Even if this sounds incredible, yes, we can predict future recessions using a couple of time series, some simple econometric models, and … R !

The basic idea is that the slope of the yield curve is somewhat linked to the probability of future recessions. In other words, the difference between the short and the long term rate can be used as a tool for monitoring business cycles. Nothing new about that: Kessel (1965) documented the cyclical behavior of the yield spread, and he showed that the yield spread tended to decline immediately before a recession. This relationship is one of the most famous stylized facts among economists (see Figure 1).

Figure_1_FRED — Figure 1: The yield spread and recessions

– So, why people don’t use this model to predict recessions ?
– Well, it seems to be related to the fact that (i) they think it only used to work in the US (ii) they don’t feel to be qualified to run a sophisticated R code to estimate this relationship.

This post is about answering these two questions: (i) yes, the yield curve does signal recessions (ii) yes, it is easy to monitor economic cycles with R using the EWS package !

First, if you have some doubts about the predictive power of the yield spread, please have a look on Hasse and Lajaunie (2022)’s recent paper, published in the Quarterly Review of Economics and Finance. The authors – Quentin and I – reexamine the predictive power of the yield spread across countries and over time. Using a dynamic panel/dichotomous model framework and a unique dataset covering 13 OECD countries over the period 1975–2019, we empirically show that the yield spread signals recessions. This result is robust to different econometric specifications, controlling for recession risk factors and time sampling. Main results are reported in Table 1.

Table_1 — Table 1: Estimation of the predictive power of the yield spread (1975–2019)

– Wait, what does mean “dichotomous model” ?
– Don’t be afraid: the academic literature provides a specific econometric framework to predict future recessions.

Estrella and Hardouvelis (1991) and Kauppi and Saikkonen (2008) have enhanced the use of binary regression models (probit and logit models) to model the relationship between recession dummies (i.e., binary variables) and the yield spread (i.e., continuous variable). Indeed, classic linear regressions cannot do the job here. If you have specific issues about probit/logit models, you should have a look on Quentin’s PhD dissertation. He is a specialist in nonlinear econometrics.

Now, let’s talk about the EWS R package. In a few words, the package is available on the CRAN package repository and it includes data and code you need to replicate our empirical findings. So you only have to run a few lines of code to estimate the predictive power of the yield spread. Not so bad eh ?

Here is an example focusing on the US: first install and load the package, then we extract the data we need.

# Load the package
library(EWS)

# Load the dataset included in the package
data("data_USA") # Print the first six rows of the dataframe head(data_USA)

Well, now we just have to run these few lines:

# Data process
Var_Y <- as.vector(data_USA$NBER)
Var_X <- as.vector(data_USA$Spread)
# Estimate the logit regression
results <- Logistic_Estimation(Dicho_Y = Var_Y, Exp_X = Var_X, Intercept = TRUE, Nb_Id = 1, Lag = 1, type_model = 1)
# print results
print(results)

The results can be printed… and plotted ! Here is an illustration what you should have:

$Estimation
name Estimate Std.Error zvalue Pr
1 Intercept -1.2194364 0.3215586 -3.792268 0.0001492776
2 1 -0.5243175 0.2062655 -2.541955 0.0110234400

and then what you could plot:

Figure_2 — Figure 2: The predictive power of the yield spread in the US (1999-2020)

Nice output, let’s interpret what we have. First the estimation results: the intercept is equal to -1.21 and high significant, and the lagged yield spread is equal to -0.52 and is also highly significant. This basic result illustrates the predictive power of the yield spread.

– But what does mean “1” instead of the name of the lagged variable ? And what if we choose to have another lag ? And if we choose model 2 instead of model 1 ?
– “1” refers to the number associated to the lagged variable, and you can change the model or the number of lags via the function arguments:

$Estimation
name Estimate Std.Error zvalue Pr
1 Intercept 0.08342331 0.101228668 0.8241075 4.098785e-01
2 1 -0.32340655 0.046847136 -6.9034433 5.075718e-12
3 Index_Lag 0.85134073 0.003882198 219.2934980 0.000000e+00

Last but not least, you can choose the best model according to AIC, BIC or R2 criteria:

$AIC
[1] 164.0884
$BIC
[1] 177.8501
$R2
[1] 0.2182592

Everything you need to know about the predictive power of the yield spread is here. These in-sample estimations confirm empirical evidences from the literature for the US. And for those who are interested in out-of-sample forecasting… the EWS package provides what you need. I’ll write an another post soon !

References

Estrella, A., & Hardouvelis, G. A. (1991). The term structure as a predictor of real economic activity. The Journal of Finance, 46(2), 555-576.

Hasse, J. B., & Lajaunie, Q. (2022). Does the yield curve signal recessions? new evidence from an international panel data analysis. The Quarterly Review of Economics and Finance, 84, 9-22.

Hasse, J. B., & Lajaunie, Q. (2020). EWS: Early Warning System. R package version 0.1. 0.

Kauppi, H., & Saikkonen, P. (2008). Predicting US recessions with dynamic binary response models. The Review of Economics and Statistics, 90(4), 777-791.

Kessel, Reuben, A. “The Cyclical Behavior of the Term Structure of Interest Rates.” NBER Occasional Paper 91, National Bureau of Economic Research, 1965.

Detecting multicollinearity — it’s not that easy sometimes

By Huey Fern Tay with Greg Page

When are two variables too related to one another to be used together in a linear regression model? Should the maximum acceptable correlation be 0.7? Or should the rule of thumb be 0.8? There is actually no single, ‘one-size-fits-all’ answer to this question.

As an alternative to using pairwise correlations, an analyst can examine the variance inflation factor, or VIF, associated with each of the numeric input variables. Sometimes, however, pairwise correlations, and even VIF scores, do not tell the entire picture.

Consider this correlation matrix created from a Los Angeles Airbnb dataset.

Two item pairs identified in the correlation matrix above have a strong correlation value:

· Beds + log_accommodates (r= 0.701)

· Beds + bedrooms (r= 0.706)

Based on one school of thought, these correlation values are cause for concern; meanwhile, other sources suggest that such values are nothing to be worried about.

The variance inflation factor, which is used to detect the severity of multicollinearity, does not suggest anything unusual either.

library(car)
vif(model_test)

The VIF for each potential input variable is found by making a separate linear regression model, in which the variable being scored serves as the outcome, and the other numeric variables are the predictors. The VIF score for each variable is found by applying this formula:

When the other numeric inputs explain a large percentage of the variability in some other variable, then that variable will have a high VIF. Some sources will tell you that any VIF above 5 means that a variable should not be used in a model, whereas other sources will say that VIF values below 10 are acceptable. None of the vif() results here appear to be problematic, based on standard cutoff thresholds.

Based on the vif() results shown above, plus some algebraic manipulation of the VIF formula, we can know that a model that predicts beds as the outcome variable, using log_accommodates, bedrooms, and bathrooms as the inputs, has an r-squared of just a little higher than 0.61. That is verified with the model shown below:

But look at what happens when we build a multi-linear regression model predicting the price of an Airbnb property listing.

The model summary hints at a problem because the coefficient for beds is negative. The proper interpretation for each coefficient in this linear model is the way that log_price will be impacted by a one-unit increase in that input, with all other inputs held constant.

Literally, then, this output indicates that having more beds within a house or apartment will make its rental value go down, when all else is held constant. That not only defies our common sense, but it also contradicts something that we already know to be the case — that bed number and log_price are positively associated. Indeed, the correlation matrix shown above indicates a moderately-strong linear relationship between these values, of 0.4816.

After dropping ‘beds’ from the original model, the adjusted R-squared declines only marginally, from 0.4878 to 0.4782.

This tiny decline in adjusted r-squared is not worrisome at all. The very low p-value associated with this model’s F-statistic indicates a highly significant overall model. Moreover, the signs of the coefficients for each of these inputs are consistent with the directionality that we see in the original correlation matrix.

Moreover, we still need to include other important variables that determine real estate pricing e.g. location and property type. After factoring in these categories along with other considerations such as pool availability, cleaning fee, and pet-friendly options, the model’s adjusted R-squared value is pushed to 0.6694. In addition, the residual standard error declines from 0.5276 in the original model to 0.4239.

Long story short: we cannot be completely reliant on rules of thumb, or even cutoff thresholds from textbooks, when evaluating the multicollinearity risk associated with specific numeric inputs. We must also examine model coefficients’ signs. When a coefficient’s sign “flips” from the direction that we should expect, based on that variable’s correlation with the response variable, that can also indicate that our model coefficients are impacted by multicollinearity.

Data source: Inside Airbnb

Natural Gas Prices Fall 42% in 3 Months Following Breach of ‘Nonlinear Stealth Support’

The below earlier LinkedIn post of well-defined nonlinear Stealth Support indicated price evolution for natural gas (UNG) was highly
likely on an unsustainable trajectory.

Within a month of the post, the market breached its well-defined Stealth Support Curve. In the 3 months that followed, this market lost 42% of its value.

The following R code reproduces the below Stealth Curve.

UNG Stealth Curve – R Code

library(tidyverse)
library(readxl)

# Original data source - https://www.nasdaq.com/market-activity/funds-and-etfs/ung/historical

# Download reformatted data (columns/headings) from my github site and save to a local drive
# https://github.com/123blee/Stealth_Curves.io/blob/main/UNG_prices.xlsx

ung <- read_excel("... Insert your local file path here .../UNG_prices.xlsx")
ung

# Convert 'Date and Time' to 'Date' column
ung[["Date"]] <- as.Date(ung[["Date"]])
ung

bars <- nrow(ung)

# Add bar indicator as first tibble column

ung <- ung %>%
  add_column(t = 1:nrow(ung), .before = "Date")
ung

# Add 40 future days to the tibble for projection of the Stealth Curve once added

future <- 40

ung <- ung %>%
  add_row(t = (bars+1):(bars+future))

# Market Pivot Lows using 'Low' Prices

# Chart 'Low' UNG prices

xmin <- 2250
xmax <- bars + future
ymin <- 0
ymax <- 25

plot.new()

background <- c("azure1")

chart_title_low <- c("Natural Gas (UNG) \nDaily Low Prices ($)")

u <- par("usr") 

rect(u[1], u[3], u[2], u[4], col = background) 

par(ann=TRUE)

par(new=TRUE)

t <- ung[["t"]]
Price <- ung[["Low"]]

plot(x=t, y=Price, main = chart_title_low, type="l", col = "blue", 
     
     ylim = c(ymin, ymax) ,
     
     xlim = c(xmin, xmax ) ) 

# Add Stealth Curve to tibble

# Stealth Support Curve parameters
a <-   -444.56  
b <-      6.26  
c <-  -2555.01  


ung <- ung %>%
  mutate(Stealth_Curve_Low = a/(t + c) + b)
ung

# Omit certain Stealth Support Curve values from charting

ung[["Stealth_Curve_Low"]][2550:xmax] <- NA
ung

# Add Stealth Curve to chart

lines(t, ung[["Stealth_Curve_Low"]])

Details of Stealth Curve parameterization are detailed in my ‘Stealth Curves: The Elegance of Random Markets’ text on Amazon.

Brian K. Lee, MBA, PRM, CMA, CFA

Feel free to contact me on LinkedIn.

SurvCART: Constructing Survival Tree in R

[Author: Madan G. Kundu]

In practice, survival times may be influenced by a combination of baseline variables. Survival trees (i.e., trees with survival data) offer a relatively flexible approach to understanding the effects of covariates, including their interaction, on survival times when the functional form of the association is unknown, and also have the advantage of easier interpretation. This blog presents the construction of a survival tree following the SurvCART algorithm (Kundu and Ghosh 2021). Usually, a survival tree is constructed based on the heterogeneity in time-to-event distribution. However, in some applications with marker dependent censoring (e.g., less education or better prognosis might lead to early censoring) ignorance of censoring distribution might lead to inconsistent survival tree (Cui, Zhou and Kosorok, 2021). Therefore, the SurvCART algorithm is flexible to construct a survival tree based on heterogeneity both in time-to-event and censoring distribution. However, it is important to emphasize that the use of censoring heterogeneity in the construction of survival trees is optional. In this blog, the construction of a survival tree is illustrated with an R dataset in step by step approach and the results are explained using the functions available in the LongCART package.

Installing and Loading LongCART package

R> install.packages("LongCART")
R> library(LongCART)

The GBSG2 dataset
We will illustrate the SurvCART algorithm with GBSG2 data to analyze recurrence-free survival (RFS) originated from a prospective randomized clinical trial conducted by the German Breast Cancer Study Group. The purpose is to evaluate the effect of prognostic factors on RFS among node-positive breast cancer patients receiving chemotherapy in the adjuvant setting. RFS is time-to-event endpoint which was defined as the time from mastectomy to the first occurrence of either recurrence, contralateral or secondary tumor, or death. This GBSG2 dataset is included in the LongCART package in R. The dataset contains the information from 686 breast cancer patients with the following variables: time (RFS follow-up time in days), cens (censoring indicator: 0[censored], 1[event]), horTH (hormonal therapy: yes, no), age (age in years), menostat (menopausal status: pre, post), tsize (tumour size in mm), tgrade (tumour grade: I, II, III), pnodes (number of positive nodes), progrec (level of progesterone receptor[PR] in fmol), and estrec (level of the estrogen receptor in fmol). For details about the conduct of the study, please refer to Schumacher et al.

Let’s first explore the dataset.

R> library(speff2trial)
R> data("ACTG175", package = "speff2trial")
R> adata<- reshape(data=ACTG175[,!(names(ACTG175) %in% c("cd80", "cd820"))],
+ varying=c("cd40", "cd420", "cd496"), v.names="cd4",
+ idvar="pidnum", direction="long", times=c(0, 20, 96))
R> adata<- adata[order(adata$pidnum, adata$time),]

The median RFS time based on the entire 686 patients was 60 months with a total of 299 reported RFS events. Does any baseline covariate influence the median RFS time and if yes then how? A survival tree is meant to find this answer.

Impact of individual categorical baseline variables on RFS
We would like to start with understanding the impact of individual baseline covariates on the median RFS time. There could be two types of baseline covariates: categorical (e.g., horTH [hormonal therapy]) and continuous (e.g., age). The impact of an individual categorical variable can be statistically assessed using StabCat.surv() [Note: StabCat.surv() performs a statistical test as described in Section 2.2.1 of Kundu and Ghosh 2021].

Let’s focus only on the heterogeneity in RFS (assuming exponential underlying distribution for RFS), but not on the censoring distribution.

R> out1<- StabCat.surv(data=GBSG2, timevar="time", censorvar="cens", 
R>                     splitvar="horTh", time.dist="exponential", 
R>                     event.ind=1) 
Stability Test for Categorical grouping variable 
Test.statistic= 8.562, Adj. p-value= 0.003 
Greater evidence of heterogeneity in time-to-event distribution 
R> out1$pval
[1] 0.003431809

The p-value is 0.00343 suggesting the influence of hormonal therapy on RFS. Here we considered Exponential distribution for RFS, but the following distributions can be specified as well: "weibull", "lognormal" or "normal".

Now we illustrate the use of StabCat.surv() considering heterogeneity both in RFS time and censoring time.

R> out1<- StabCat.surv(data=GBSG2, timevar="time", censorvar="cens", 
R>                     splitvar="horTh", time.dist="exponential", 
R>                     cens.dist="exponential", event.ind=1) 
Stability Test for Categorical grouping variable 
Test.statistic= 8.562 0.013, Adj. p-value= 0.007 0.909 
Greater evidence of heterogeneity in time-to-event distribution 
> out1$pval
[1] 0.006863617

Note the two Test.statistic values (8.562 and 0.013): the first one corresponds to for heterogeneity in RFS time and the second one corresponds to heterogeneity in censoring distribution. Consequently, we also see the two p-values (adjusted). The overall p-value 0.006863617. In comparison to the earlier p-value of 0.00343, this p-value is greater due to the multiplicity adjustment.

Impact of individual continuous baseline variables on RFS
Similarly, we can assess the impact of an individual continuous variable by performing a statistical test as described in Section 2.2.2 of Kundu and Ghosh 2021 which is implemented in StabCont.surv()

R> out1<- StabCont.surv(data=GBSG2, timevar="time", censorvar="cens",
R>                      splitvar="age", time.dist="exponential",
R>                      event.ind=1)
Stability Test for Continuous grouping variable 
Test.statistic= 0.925, Adj. p-value= 0.359 
Greater evidence of heterogeneity in time-to-event distribution 
> out1$pval
[1] 0.3591482

The p-value is 0.3591482 suggesting no impact of age on RFS. Heterogeneity in censoring distribution can be considered here as well by specifying cens.dist as shown with StabCat.surv().

Construction of survival tree
A survival tree can be constructed using SurvCART(). Note that the SurvCART() function currently can handle the categorical baseline variables with numerical levels only. For nominal variables, please create the corresponding numerically coded dummy variable(s) as has been done below for horTh, trade and menostat

R> GBSG2$horTh1<- as.numeric(GBSG2$horTh)
R> GBSG2$tgrade1<- as.numeric(GBSG2$tgrade)
R> GBSG2$menostat1<- as.numeric(GBSG2$menostat)

We also have to add the subject id if that is not already there.

R> GBSG2$subjid<- 1:nrow(GBSG2)

In the SurvCART(), we have to list out all the baseline variables of interest and declare whether it is categorical or continuous. Further, we also have to specify the RFS distribution "exponential"(default), "weibull", "lognormal" or "normal". The censoring distribution could be "NA"(default: i.e., no censoring heterogeneity), "exponential" "weibull", "lognormal" or "normal".

Let’s construct the survival tree with heterogeneity in RFS only (i.e., ignoring censoring heterogeneity)

R> out.tree<- SurvCART(data=GBSG2, patid="subjid", 
R>                censorvar="cens", timevar="time", 
R>                gvars=c('horTh1', 'age', 'menostat1', 'tsize',          
R>                        'tgrade1', 'pnodes', 'progrec', 'estrec'),  
R>                tgvars=c(0,1,0,1,0,1, 1,1), 
R>                time.dist="exponential", 
R>                event.ind=1, alpha=0.05, minsplit=80, minbucket=40, 
R>                print=TRUE)

All the baseline variables are listed in gvars argument. The tgvars argument is accompanied with the gvars argument which indicates type of the partitioning variables (0=categorical or 1=continuous). Further, we have considered exponential distribution for RFS time.

Within SurvCART(), at any given tree-node, each of the baseline variables is tested for heterogeneity using eitherStabCat.surv() or StabCont.surv() and the corresponding p-value is obtained. Further, these p-values are adjusted according to the Hochberg procedure and subsequently, the baseline variable with the smallest p-value is selected. These details would be printed if print=TRUE. This procedure keeps iterating until we hit the terminal node (i.e., no more statistically significant baseline variable, node size is smaller than the minsplit or further splitting results in a node size smaller than minbucket).

Now let’s view the tree result

In the above output, each row corresponds to a single node including the 7 terminal nodes identified by TERMINAL=TRUE. Now let’s visualize the tree result

R> par(xpd = TRUE)
R> plot(out.tree, compress = TRUE)
R> text(out.tree, use.n = TRUE)

The resultant tree is as follows:

We can also plot KM plot for sub-groups identified by survival tree as follows:

KMPlot.SurvCART(out.tree, scale.time=365.25, type=1)

Last but not least, we can consider heterogeneity in censoring distribution in SurvCART() by specifying cens.dist .

Reference
1. Kundu, M. G., & Ghosh, S. (2021). Survival trees based on heterogeneity in time‐to‐event and censoring distributions using parameter instability test. Statistical Analysis and Data Mining: The ASA Data Science Journal, 14(5), 466-483. [link]
2. Y. Cui, R. Zhu, M. Zhou, and M. Kosorok, Consistency of survival tree and forest models: Splitting bias and correction. Statistica Sinica Preprint No: SS-2020-0263, 2021.
3. Schumacher, M., Bastert, G., Bojar, H., Hübner, K., Olschewski, M., Sauerbrei, W., … & Rauschecker, H. F. (1994). Randomized 2 x 2 trial evaluating hormonal treatment and the duration of chemotherapy in node-positive breast cancer patients. German Breast Cancer Study Group. Journal of Clinical Oncology, 12(10), 2086-2093. [link]
4. LongCART package v 3.1 or higher, https://CRAN.R-project.org/package=LongCART

Mapping iNaturalist Data in R

This post was originally published from IGIS as a Tech Note. Enjoy!

Introduction

iNaturalist is an enormously popular platform for recording and sharing observations of nature. Most people interact with iNaturalist through the Android or iOS phone app, but a little known fact is the platform also has an API (Application Programming Interface). That means that you can query observations using programming languages like R and Python. And if you can query observations, then you can map them!

The rest of this post demonstrates how you can download iNaturalist observations in R using the rinat package, and then map them using leaflet. We’ll also use functions from sf and dplyr. These are very common R packages in data science, and there are lot of online resources that teach you how to use them.

Our study area will be Placerita Canyon State Park in Los Angeles County, California. If you want to jump ahead to see the final product, click here.

Import the Park Boundary

First we load the packages we’re going use:

library(rinat)
library(sf)
library(dplyr)
library(tmap)
library(leaflet)

## Load the conflicted package and set preferences
library(conflicted)
conflict_prefer("filter", "dplyr", quiet = TRUE)
conflict_prefer("count", "dplyr", quiet = TRUE)
conflict_prefer("select", "dplyr", quiet = TRUE)
conflict_prefer("arrange", "dplyr", quiet = TRUE)

Now we can import the park boundary which is saved as a geojson file. We use the sf package to import and transform the boundary to geographic coordinates, and leaflet to plot it:

placertia_sf <- st_read("placertia_canyon_sp.geojson", quiet = TRUE) %>% 
  st_transform(4326)

leaflet(placertia_sf) %>% 
  addTiles() %>% 
  addPolygons()

Get the Bounding Box

Next, we get the bounding box of the study area, so we can pass it to the iNaturalist API to tell it which area we’re interested in:

placertia_bb <- placertia_sf %>% 
  st_bbox()

placertia_bb

##       xmin       ymin       xmax       ymax 
## -118.47216   34.36698 -118.43764   34.37820

Download data

Now we are ready to retrieve iNaturalist observations. rinat makes this easy with get_inat_obs().

Downloading Etiquette

The non-profit that supports iNaturalist doesn’t charge for downloading data, because they believe in open data and supporting citizen science initiatives. They also don’t require registration for simply downloading observations. That being said, downloading data consumes resources that can affect other users, and it’s bad form (and bad programming) to download data unnecessarily.

There are a few best practices for downloading data in ways that won’t cause problems for other users:

Use the server-side filters the API provides to only return results for your area of interest and your taxon(s) of interest. See below for examples of sending spatial areas and a taxon name when calling the API from R.
Always save your results to disk, and don’t download data more than once. Your code should check to see if you’ve already downloaded something before getting it again.
Don’t download more than you need. The API will throttle you (reduce your download speed) if you exceed more than 100 calls per minute. Spread it out.
The API is not designed for bulk downloads. For that, look at the Export tool.

We’ll use the bounds argument in get_inat_obs() to specify the extent of our search. bounds takes 4 coordinates. Our study area is not a pure rectangle, but we’ll start with the bounding box and then apply an additional mask to remove observations outside the park boundary.

For starters we’ll just download the most recent 100 iNaturalist observations within the bounding box. (There are actually over 3000, but that would make for a very crowded map!). The code below first checks if the data have already been downloaded, to prevent downloading data twice.

search_results_fn <- "placertia_bb_inat.Rdata"

if (file.exists(search_results_fn)) {
  load(search_results_fn)

} else {  
  # Note we have to rearrange the coordinates of the bounding box a 
  # little bit to give get_inat_obs() what it expects
  inat_obs_df <- get_inat_obs(bounds = placertia_bb[c(2,1,4,3)], 
                            taxon_name = NULL,
                            year = NULL,
                            month = NULL,
                            maxresults = 100)
  
  ## Save the data frame to disk
  save(inat_obs_df, file = search_results_fn)
}

See what we got back:

dim(inat_obs_df)

## [1] 100  36

glimpse(inat_obs_df)

## Rows: 100
## Columns: 36
## $ scientific_name                   "Sceloporus occidentalis", "Eriogonum~
## $ datetime                          "2021-09-04 14:01:55 -0700", "2021-08~
## $ description                       "", "", "", "", "", "", "", "", "", "~
## $ place_guess                       "Placerita Canyon State Park, Newhall~
## $ latitude                          34.37754, 34.37690, 34.37691, 34.3764~
## $ longitude                         -118.4571, -118.4598, -118.4597, -118~
## $ tag_list                          "", "", "", "", "", "", "", "", "", "~
## $ common_name                       "Western Fence Lizard", "California b~
## $ url                               "https://www.inaturalist.org/observat~
## $ image_url                         "https://inaturalist-open-data.s3.ama~
## $ user_login                        "bmelchiorre", "ericas3", "ericas3", ~
## $ id                                94526862, 92920757, 92920756, 9291801~
## $ species_guess                     "Western Fence Lizard", "", "", "", "~
## $ iconic_taxon_name                 "Reptilia", "Plantae", "Plantae", "Pl~
## $ taxon_id                          36204, 54999, 52854, 58014, 64121, 55~
## $ num_identification_agreements     2, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1~
## $ num_identification_disagreements  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ observed_on_string                "Sat Sep 04 2021 14:01:55 GMT-0700 (P~
## $ observed_on                       "2021-09-04", "2021-08-29", "2021-08-~
## $ time_observed_at                  "2021-09-04 21:01:55 UTC", "2021-08-2~
## $ time_zone                         "Pacific Time (US & Canada)", "Pacifi~
## $ positional_accuracy               5, 3, 4, 4, 16, 4, 41, 41, 26, 9, NA,~
## $ public_positional_accuracy        5, 3, 4, 4, 16, 4, 41, 41, 26, 9, NA,~
## $ geoprivacy                        "", "", "", "", "", "", "", "", "", "~
## $ taxon_geoprivacy                  "open", "", "", "", "", "", "", "", "~
## $ coordinates_obscured              "false", "false", "false", "false", "~
## $ positioning_method                "", "", "", "", "", "", "", "", "", "~
## $ positioning_device                "", "", "", "", "", "", "", "", "", "~
## $ user_id                           4985856, 404893, 404893, 404893, 3214~
## $ created_at                        "2021-09-12 02:16:03 UTC", "2021-08-2~
## $ updated_at                        "2021-09-12 06:28:20 UTC", "2021-08-2~
## $ quality_grade                     "research", "needs_id", "needs_id", "~
## $ license                           "CC-BY-NC", "CC-BY-NC-SA", "CC-BY-NC-~
## $ sound_url                         NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
## $ oauth_application_id              3, 333, 333, 333, 3, 3, 3, 3, 3, 3, N~
## $ captive_cultivated                "false", "false", "false", "false", "~

Next, in order to spatially filter the observations using the park boundary, we convert the data frame into a sf object. At the same time we’ll select just a few of the columns for our map:

inat_obs_sf <- inat_obs_df %>% 
select(longitude, latitude, datetime, common_name, 
      scientific_name, image_url, user_login) %>% 
  st_as_sf(coords=c("longitude", "latitude"), crs=4326)

dim(inat_obs_sf)

## [1] 100 6

Next, filter out observations that lie outside Placerita Canyon SP. How many are left?

inat_obs_pcsp_sf <- inat_obs_sf %>% st_intersection(placertia_sf)
nrow(inat_obs_pcsp_sf)

## [1] 90

Next, add a column to the attribute table containing HTML code that will appear in the popup windows:

inat_obs_pcsp_popup_sf <- inat_obs_pcsp_sf %>% 
  mutate(popup_html = paste0("<p><b>", common_name, "</b><br/>",
                        "<i>", scientific_name, "</i></p>",
                        "<p>Observed: ", datetime, "<br/>",
                        "User: ", user_login, "</p>",
                        "<p><img src='", image_url, "' style='width:100%;'/></p>")
)

Map the Results

Finally, we have everything we need to map the results with leaflet. Zoom-in and click on an observation to see details about it.

htmltools::p("iNaturalist Observations in Placerita Canyon SP",
             htmltools::br(),
             inat_obs_pcsp_popup_sf$datetime %>% 
               as.Date() %>% 
               range(na.rm = TRUE) %>% 
               paste(collapse = " to "),
             style = "font-weight:bold; font-size:110%;")

leaflet(placertia_sf) %>% 
  addProviderTiles("Esri.WorldStreetMap") %>% 
  addPolygons() %>% 
  addCircleMarkers(data = inat_obs_pcsp_popup_sf,
                   popup = ~popup_html, 
                   radius = 5)

iNaturalist Observations in Placerita Canyon SP
2006-01-04 to 2021-09-04

Click on the map to open in a new tab

Map the Arachnids

The iNaturalist API supports taxon searching. If you pass it the name of a genus, for example, it will restrict results to only observations from that genus. Likewise for family, order, class, etc. Observations that don’t have a scientific name assigned will be left out.

You can do a taxon search from R by passing the taxon_name argument in get_inat_obs(). Below we’ll ask for the Arachnid class, which includes spiders, scorpions, ticks, and other arthropods. This code should all look familiar by now:

search_arachnid_fn <- "placertia_arachnid_obs.Rdata"
if (file.exists(search_arachnid_fn)) {
  load(search_arachnid_fn)
} else {  
  inat_arachnid_df <- get_inat_obs(bounds = placertia_bb[c(2,1,4,3)], 
                            taxon_name = "Arachnida",
                            year = NULL,
                            month = NULL,
                            maxresults = 100)
  save(inat_arachnid_df, file = search_arachnid_fn)
}

inat_arachnid_pcsp_popup_sf <- inat_arachnid_df %>% 
  select(longitude, latitude, datetime, common_name, scientific_name, image_url, user_login) %>%
  st_as_sf(coords=c("longitude", "latitude"), crs=4326) %>% 
  st_intersection(placertia_sf) %>% 
  mutate(popup_html = paste0("<p><b>", common_name, "</b><br/>",
                             "<i>", scientific_name, "</i></p>",
                             "<p>Observed: ", datetime, "<br/>",
                             "User: ", user_login, "</p>",
                             "<p><img src='", image_url, 
                             "' style='width:100%;'/></p>"))

htmltools::p("Arachnids in Placerita Canyon SP", style = "font-weight:bold; font-size:110%;")

leaflet(placertia_sf) %>% 
  addProviderTiles("Esri.WorldStreetMap") %>% 
  addPolygons() %>% 
  addCircleMarkers(data = inat_arachnid_pcsp_popup_sf,
                   popup = ~popup_html, 
                   radius = 5)

Arachnids in Placerita Canyon SP

Click on the map to open in a new tab

Summary

The iNaturalist API allows you to download observations from a global community of naturalists, which is pretty powerful. This Tech Note showed you how to create a map of iNaturalist observations within a specific area and for a specific taxon using the rinat and leaflet packages for R.

A couple caveats are worth noting. A limitation of this method is that the data are static. In other words, to get the latest data your have to re-query the API again. It isn’t ‘live’ like a feature service.

Also, although leaflet works well for small to medium sized datasets, you wouldn’t want to use this mapping tool for massive amounts of spatial data. With leaflet, all of the geographic information and popup content are saved in the HTML file, and rendering is done on the client. This generally makes it extremely fast because everything is in memory. But if the data overwhelm memory it can slow your browser to a crawl. If you have to map hundreds of thousands of points, for example, you would need a different approach for both downloading and visualizing the data.

The Statsomat Apps with R and Python

The Statsomat project and site (https://statsomat.com) were launched at the beginning of 2021 and have the goal of developing and maintaining open-source and web-based apps for automated data analysis. Special about the reports generated by the Statsomat apps is the annotated output and the human-readable interpretation in natural language. Most of the apps currently available deliver also a user-friendly code (R or Python) to reproduce the outputs.

The Statsomat apps currently available are:

Statsomat/EDAR, a Shiny app for Exploratory Data Analysis with R (launch the app or go to GitHub repo)
Statsomat/EDAPY, a Shiny app for Exploratory Data Analysis with Python (launch the app or go to GitHub repo)
Statsomat/CORRANA, a Shiny app for correlation analyis with R (launch the app or go to GitHub repo)
Statsomat/PCA, a Shiny app for Principal Components Analysis with R (launch the app or go to GitHub repo)
Statsomat/CFA, a Shiny app for Confirmatory Factor Analysis (launch the app or go to GitHub repo)
Statsomat/Multicomp, a Shiny app for Multiple Comparison Procedures with a Control, with R (launch the app or go to GitHub repo)

The apps were designed especially for learners and applied researchers. Therefore the GUI of the apps strives a maximal automation with a minimal but sufficient user-interaction.

If you need a quick help with your data analysis or with programming then check the Statsomat apps.

Nonlinear Market Forecasting using ‘Stealth Curves’

In 2001, I was fortunate to discover a ‘market characteristic’ that transcends virtually every liquid market. Markets often trend in a nonlinear fashion according to hidden support/resistance curves (Stealth Curves). Whereas ‘market anomalies’ are transient and are overwhelmingly market specific, Stealth Curves are highly robust across virtually every market and stand the test of time. Using R, I chart several Stealth Curve examples relative to the wheat market (symbol = WEAT). Download data from my personal GitHub site and read into a ‘weat’ tibble.

library(tidyverse)
library(readxl)

# Original data source - https://www.nasdaq.com/market-activity/funds-and-etfs/weat/historical

# Download reformatted data (columns/headings) from github
# https://github.com/123blee/Stealth_Curves.io/blob/main/WEAT_nasdaq_com_data_reformatted.xlsx

# Insert your file path in the next line of code 

weat <- read_excel("... Place your file path Here ... /WEAT_nasdaq_com_data_reformatted.xlsx")
weat


# Convert 'Date and Time' to 'Date' column
weat[["Date"]] <- as.Date(weat[["Date"]])
weat

bars <- nrow(weat)

# Add bar indicator as first tibble column

weat <- weat %>%
     add_column(t = 1:nrow(weat), .before = "Date")
weat

View tibble

Prior to developing Stealth Curve charts, it is often advantageous to view the raw data in an interactive chart.

# Interactive Pricing Chart

xmin <- 1              
ymin <- 0   
ymax_close <- ceiling(max(weat[["Close"]]))
ymax_low <- ceiling(max(weat[["Low"]]))
ymax_high <- ceiling(max(weat[["High"]]))

interactive <- hPlot(x = "t", y = "Close", data = weat, type = "line",
               
               ylim = c(ymin, ymax_close),
               
               xlim = c(xmin, bars),
               
               xaxt="n",   # suppress x-axis labels
               
               yaxt="n",   # suppress y-axis labels,
               
               ann=FALSE)  # x and y axis titles

interactive$set(height = 600)

interactive$set(width = 700)

interactive$plotOptions(line = list(color = "green"))

interactive$chart(zoomType = "x")   # Highlight range of chart to zoom in

interactive

Interactive Chart
WEAT Daily Closing Prices

Highlight any range to zoom data and view price on any bar number.

Interactive Chart
WEAT Daily Closing Prices (Zoomed)

Prior to plotting, add 400 additional rows to the tibble in preparation for extending the calculated Stealth Curve into future periods.

# Add 400 future days to the plot for projection of the Stealth Curve
 
 future <- 400
 
 weat <- weat %>%
   add_row(t = (bars+1):(bars+future))

Chart the WEAT daily closing prices with 400 days of padding.

 # Chart Closing WEAT prices
 
 plot.new()
 
 chart_title_close <- c("Teucrium Wheat ETF (Symbol = WEAT) \nDaily Closing Prices ($)")
 
 background <- c("azure1") 
 
 u <- par("usr") 
 
 rect(u[1], u[3], u[2], u[4], col = background) 
 
 par(new=TRUE)
 
 t <- weat[["t"]]
 Price <- weat[["Close"]]
 
 plot(x=t, y=Price, main = chart_title_close, type="l", col = "blue",        
      
      ylim = c(ymin, ymax_close) ,
      
      xlim = c(xmin, (bars+future )) )

The below chart represents 10 years of closing prices for the WEAT market since inception of the ETF on 9/19/2011. The horizontal axis represents time (t) stated in bar numbers (t=1 = bar 1 = Date 9/19/2011). This starting ‘t‘ value is entirely arbitrary and does not impact the position of a calculated Stealth Curve on a chart.

The above chart displays a ‘random’ market in decline over much of the decade.

Stealth Curves reveal an entirely different story. They depict extreme nonlinear order with respect to market pivot lows. In order for Stealth Curves to be robust across most every liquid market, the functional form must not only be simple – it must be extremely simple. The following Stealth Curve equation is charted against the closing price data.

 # Add Stealth Curve to tibble
 
 # Stealth Curve parameters
 a <- 7231.88  
 b <-  1.18 
 c <- -7.77 
 
 weat <- weat %>%
   mutate(Stealth_Curve_Close = a/(t + c) + b)
 weat

As the Stealth Curve is negative in bars 1 through 7, these values are ignored in the chart by the use of NA padding.

 # Omit negative Stealth Curve values in charting, if any
 
 z1 <- weat[["Stealth_Curve_Close"]]
 
 weat[["Stealth_Curve_Close"]] <- ifelse(z1 < 0, NA, z1)
 weat

Add the Stealth Curve to the chart.

# Add Stealth Curve to chart
 
 lines(t, weat[["Stealth_Curve_Close"]])

Closing Prices with Stealth Curve

Once the Stealth Curve is added to the chart, extreme nonlinear market order is clearly evident.

I personally refer to this process as overlaying a cheat-sheet on a pricing chart to understand why prices bounced where they did and where they may bounce on a go-forward basis. Stealth Curves may be plotted from [-infinity, + infinity].

The human eye is not adept at discerning the extreme accuracy of this Stealth Curve. As such, visual aids are added. This simple curve serves as a strange attractor to WEAT closing prices. The market closely hugs the Stealth Curve just prior to t=500 (oval) for 3 consecutive months. The arrows depict 10 separate market bounces off/very near the curve.

While some of the bounces appear ‘small,’ it is important to note the prices are also relatively small. As an example, the ‘visually small bounce’ at the ‘3rd arrow from the right’ represents a 10%+ market gain. That particular date is of interest, as it is the exact date I personally identified this Stealth trending market for the first time. I typically do not follow the WEAT market. Had I done so, I could have identified this Stealth Curve sooner in time. A Stealth Curve requires only 3 market pivot points for definition.

Reference my real-time LinkedIn post of this Stealth Curve at the ‘3rd arrow’ event here. This multi-year Stealth Curve remains valid to the current date. By definition, a Stealth Curve remains valid until it is penetrated to the downside in a meaningful manner. Even then, it often later serves as market resistance. On occasion, a Stealth Curve will serve as support followed by resistance followed by support (reference last 2 charts in this post).

Next, a Stealth Curve is applied to market pivot lows as defined by high prices.

Chart high prices.

# Market Pivot Lows using High Prices
 
 # Chart High WEAT prices
 
 plot.new()
 
 chart_title_high <- c("Teucrium Wheat ETF (Symbol = WEAT) \nDaily High Prices ($)")
 
 u <- par("usr") 
 
 rect(u[1], u[3], u[2], u[4], col = background) 
 
 par(ann=TRUE)
 
 par(new=TRUE)
 
 t <- weat[["t"]]
 Price <- weat[["High"]]
 
 plot(x=t, y=Price, main = chart_title_high, type="l", col = "blue", 
      
      ylim = c(ymin, ymax_high) ,
      
      xlim = c(xmin, (bars+future )) )

The parameterized Stealth Curve equation is as follows:

Add the Stealth Curve to the tibble.

# Add Stealth Curve to tibble
 
 # Stealth Curve parameters
 a <- 7815.16  
 b <-    1.01 
 c <-   37.35 
 
 weat <- weat %>%
   mutate(Stealth_Curve_High = a/(t + c) + b)
 
 # Omit negative Stealth Curve values in charting, if any
 
 z2 <- weat[["Stealth_Curve_High"]]
 
 weat[["Stealth_Curve_High"]] <- ifelse(z2 < 0, NA, z2)
 weat

 # Add Stealth Curve to chart
 
 lines(t, weat[["Stealth_Curve_High"]])

When backcasted in time, this Stealth Curve transitions from resistance to support and is truly amazing. When a backcasted Stealth Curve, relative to the data used for curve parameterization, displays additional periods of market support/resistance; additional confidence is placed in its ability to act as a strange attractor. Based on visual inspection, it is doubtful no other smooth curve (infinitely many) could coincide with as many meaningful pivot highs and lows (21 total) over this 10-year period as this simple Stealth Curve. In order to appreciate the level of accuracy of the backcasted Stealth Curve, a log chart is presented of a zoomed sectional of the data.

 # Natural Log High Prices 
 
 # Chart High WEAT prices
 
 plot.new()
 
 chart_title_high <- c("Teucrium Wheat ETF (Symbol = WEAT) \nDaily High Prices ($)")
 
 u <- par("usr") 
 
 rect(u[1], u[3], u[2], u[4], col = background) 
 
 par(ann=TRUE)
 
 par(new=TRUE)
 
 chart_title_log_high <- c("Teucrium Wheat ETF (Symbol = WEAT) \nNatural Log of Daily High Prices ($)")
 
 t <- weat[["t"]]
 Log_Price <- log(weat[["High"]])
 
 plot(x=t, y=Log_Price, main = chart_title_log_high, type="l", col = "blue", 
      
      ylim = c(log(7), log(ymax_high)) ,
      
      xlim = c(xmin, 800) ) # (bars+future )) )  
 
 # Add Log(Stealth Curve) to chart
 
 lines(t, log(weat[["Stealth_Curve_High"]]))

Zoomed Sectional
Log(High Prices) and Log(Stealth Curve)
WEAT

There are 11 successful tests of Stealth Resistance including the all-time market high of the ETF.

In total, this Stealth Curve exactly or closely identifies 21 total market pivots (Stealth Resistance = 11, Stealth Support = 10).

Lastly, a Stealth Curve is presented based on market pivot lows defined by low prices.

# Market Pivot Lows using Low Prices
 
 # Chart Low WEAT prices
 
 plot.new()
 
 u <- par("usr") 
 
 rect(u[1], u[3], u[2], u[4], col = background) 
 
 par(ann=TRUE)
 
 par(new=TRUE)
 
 chart_title_low <- c("Teucrium Wheat ETF (Symbol = WEAT) \nDaily Low Prices ($)")
 
 t <- weat[["t"]]
 Price <- weat[["Low"]]
 
 
 plot(x=t, y=Price, main = chart_title_low, type="l", col = "blue", 
      
      ylim = c(ymin, ymax_low) ,
      
      xlim = c(xmin, (bars+future )) )

Chart the low prices.

The parameterized Stealth Curve equation based on 3 pivot lows is given below.

Add the calculated Stealth Curve to the tibble.

 # Add Stealth Curve to tibble
 
 # Stealth Curve parameters
 a <- 9022.37  
 b <-    0.48 
 c <-  125.72 
 
 weat <- weat %>%
   mutate(Stealth_Curve_Low = a/(t + c) + b)

 # Omit negative Stealth Curve values in charting, if any
 
 z3 <- weat[["Stealth_Curve_Low"]]
 
 weat[["Stealth_Curve_Low"]] <- ifelse(z3 < 0, NA, z3)
 weat

Add the Stealth Curve to the chart.

# Add Stealth Curve to chart
 
 lines(t, weat[["Stealth_Curve_Low"]])

Intentionally omitting identifying arrows, it is clearly obvious this Stealth Curve identifies the greatest number of pivot lows of the 3 different charts presented (close, high, low).

As promised, below are 2 related Stealth Curve charts I posted in 2005 that transitioned from support, to resistance, to support, and then back to resistance (data no longer exists – only graphics). This data used Corn reverse adjusted futures data. The Stealth Curve was defined using pivot low prices. Subsequent market resistance applied to market high prices (Chart 2 of 2).

Chart 1 of 2

Chart 2 of 2

For those interested in additional Stealth Curve examples applied to various other markets, simply view my LinkedIn post here. Full details of Stealth Curve model parameterization are described in my latest Amazon publication, Stealth Curves: The Elegance of ‘Random’ Markets.

Brian K. Lee, MBA, PRM, CMA, CFA
[email protected]

Download recently published book – Learn Data Science with R

Learn Data Science with R is for learning the R language and data science. The book is beginner-friendly and easy to follow. It is available for download as pay what you want. The minimum price is 0 and the suggested contribution is rs 1300 ($18). Please review the book at Goodreads.

The book topics are –

R Language
Data Wrangling with data.table package
Graphing with ggplot2 package
Exploratory Data Analysis
Machine Learning with caret package
Boosting with lightGBM package
Hands-on projects

Function With Special Talent from ‘caret’ package in R — NearZeroVar()

By Xiaotong Ding (Claire), With Greg Page

A practical tool that enables a modeler to remove non-informative data points during the variable selection process of data modeling

In this article, we will introduce a powerful function called ‘nearZeroVar()’. This function, which comes from the caret package, is a practical tool that enables a modeler to remove non-informative data points during the variable selection process of data modeling.

For starters, the nearZeroVar() function identifies constants, and predictors with one unique value across samples. In addition, nearZeroVar() diagnoses predictors as having “near-zero variance” when they possess very few unique values relative to the number of samples, and for which the ratio of the frequency of the most common value to the frequency of the second most common value is large.

Regardless of the modeling process being used, or of the specific purpose for a particular model, the removal of non-informative predictors is a good idea. Leaving such variables in a model only adds extra complexity, without any corresponding payoff in model accuracy or quality.

For this analysis, we will use the dataset hawaii.csv , which contains information about Airbnb rentals from Hawaii. In the code cell below, the dataset is read into R, and blank cells are converted to NA values

library(dplyr)

library(caret)

options(scipen=999)  #display decimal values, rather than scientific notation
data = read.csv("/Users/xiaotongding/Desktop/Page BA-WritingProject/hawaii.csv")
dim(data)

## [1] 21523    74

data[data==""] <- NA

nzv_vals <- nearZeroVar(data, saveMetrics = TRUE)
dim(nzv_vals)

## [1] 74  4

The code chunk shown above generates a dataframe with 74 rows (one for each variable in the dataset) and four columns. If saveMetrics is set to FALSE, the positions of the zero or near-zero predictors are returned instead.

nzv_sorted <- arrange(nzv_vals, desc(freqRatio))
head(nzv_sorted)

	freqRatio <dbl>	percentUnique <dbl>	zeroVar <lgl>	nzv <lgl>
has_availability	21522.000000	0.009292385	FALSE	TRUE
calculated_host_listings_count_shared_rooms	521.634146	0.032523347	FALSE	TRUE
host_has_profile_pic	282.184211	0.009292385	FALSE	TRUE
number_of_reviews_l30d	26.545337	0.046461924	FALSE	TRUE
calculated_host_listings_count_private_rooms	13.440804	0.097570041	FALSE	FALSE
room_type	9.244102	0.018584770	FALSE	FALSE

The first column, freqRatio, tells us the ratio of frequencies for the most common value over the second most common value for that variable. To see how this is calculated, let’s look at the freqRatio for host_has_profile_pic (282.184):

table(sort(data$host_has_profile_pic, decreasing=TRUE))

## 
##     f     t 
##    76 21446

In the entire dataset, there are 76 ‘f’ values, and 21446 ‘t’ values. The frequency ratio of the most common outcome to the second-most common outcome, therefore, is 21446/76, or 282.1842. The second column, percentUnique, indicates the percentage of unique data points out of the total number of data points. To illustrate how this is determined, let’s examine the ‘license’ variable, which shows a value here of 45.384007806. The length of the output from the unique() function, generated below, indicates that license contains 9768 distinct values throughout the entire dataset (most likely, some are repeated because a single individual may own multiple Airbnb properties).

length(unique(data$license))

## [1] 9768

By dividing the number of unique values by the number of observations, and then multiplying by 100, we arrive back at the percentUnique value shown above:

length(unique(data$license)) / nrow(data) * 100

## [1] 45.38401

For predictive modeling with numeric input features, it can be okay to have 100 percent uniqueness, as numeric values exist along a continuous spectrum. Imagine, for example, a medical dataset with the weights of 250 patients, all taken to 5 decimal places of precision – it is quite possible to expect that no two patients’ weights would be identical, yet weight could still carry predictive value in a model focused on patient health outcomes. For non-numeric data, however, 100 percent uniqueness means that the variable will not have any predictive power in a model. If every customer in a bank lending dataset has a unique address, for example, then the ‘customer address’ variable cannot offer us any general insights about default likelihood. The third column, zeroVar, is a vector of logicals (TRUE or FALSE) that indicate whether the predictor has only one distinct value. Such variables will not yield any predictive power, regardless of their data type. The fourth column, nzv, is also a vector of logical values, for which TRUE values indicate that the variable is a near-zero variance predictor. For a variable to be flagged as such, it must meet two conditions: (1) Its frequency ratio must exceed the freqCut threshold used by the function; AND (2) its percentUnique value must fall below the uniqueCut threshold used by the function. By default, freqCut is set to 95/5 (or 19, if expressed as an integer value), and uniqueCut is set to 10. Let’s take a look at the variables with the 10 highest frequency ratios:

head(nzv_sorted, 10)

	freqRatio <dbl>	percentUnique <dbl>	zeroVar <lgl>	nzv <lgl>
has_availability	21522.000000	0.009292385	FALSE	TRUE
calculated_host_listings_count_shared_rooms	521.634146	0.032523347	FALSE	TRUE
host_has_profile_pic	282.184211	0.009292385	FALSE	TRUE
number_of_reviews_l30d	26.545337	0.046461924	FALSE	TRUE
calculated_host_listings_count_private_rooms	13.440804	0.097570041	FALSE	FALSE
room_type	9.244102	0.018584770	FALSE	FALSE
review_scores_checkin	7.764874	0.041815732	FALSE	FALSE
review_scores_location	7.632574	0.041815732	FALSE	FALSE
maximum_nights_avg_ntm	7.083577	6.095804488	FALSE	FALSE
minimum_maximum_nights	7.018508	0.715513637	FALSE	FALSE

Right now, number_of_reviews_l30d (Number of reviews in the last 30 days) is considered an ‘nzv’ variable, with its frequency ratio of 26.54 falling above the default of 19, and its uniqueness percentage of 0.046 falling below 0.10. If we adjust the function’s settings in a way that would nullify either of those conditions, it will no longer be considered an nzv variable:

nzv_vals2 <- nearZeroVar(data, saveMetrics = TRUE, uniqueCut = 0.04)
nzv_sorted2 <- arrange(nzv_vals2, desc(freqRatio))
head(nzv_sorted2, 10)

	freqRatio <dbl>	percentUnique <dbl>	zeroVar <lgl>	nzv <lgl>
has_availability	21522.000000	0.009292385	FALSE	TRUE
calculated_host_listings_count_shared_rooms	521.634146	0.032523347	FALSE	TRUE
host_has_profile_pic	282.184211	0.009292385	FALSE	TRUE
number_of_reviews_l30d	26.545337	0.046461924	FALSE	FALSE
calculated_host_listings_count_private_rooms	13.440804	0.097570041	FALSE	FALSE
room_type	9.244102	0.018584770	FALSE	FALSE
review_scores_checkin	7.764874	0.041815732	FALSE	FALSE
review_scores_location	7.632574	0.041815732	FALSE	FALSE
maximum_nights_avg_ntm	7.083577	6.095804488	FALSE	FALSE
minimum_maximum_nights	7.018508	0.715513637	FALSE	FALSE

Note that with the lower cutoff for percentUnique in place, number_of_reviews_l30d no longer qualifies for nzv status. Adjusting the frequency ratio to any value above 26.55 would have had a similar effect. So what is the “correct” setting to use? Like nearly everything else in the world of modeling, this question does not lend itself to a “one-size-fits-all” answer. At times, nearZeroVar() may serve as a handy way to quickly whittle down the size of an enormous dataset. Other times, it might even be used in a nearly-opposite way – if a modeler is specifically looking to call attention to anomalous values, this function could be used to flag variables that contain them. Either way, we encourage you to explore this function, and to consider making it part of your Exploratory Data Analysis (EDA) routine, especially when you are faced with a large dataset and looking for places to simplify the task in front of you.

Text processing and stemming for classification tasks in master data management context

author: “Genrikh Ananiev”
Problem description
Under business conditions, narrowly specialized tasks often come across, which require a special approach because they do not fit into the standard data processing flow and constructing models. One of these tasks is the classification of new products in master data management process (MDM).

Example 1

You are working in a large company (supplier) engaged in the production and / or sales of products, including through wholesale intermediaries (distributors). Often your distributors have an obligation (in front of the company in which you work) regularly providing reporting on your own sales of your products – the so-called Sale Out. Not always, distributors are able to report on the sold products in your company codes, more often are their own codes and their own product names that differ from the names in your system. Accordingly, in your database there is a need to keep the table matching distributors with product codes of your account. The more distributors, the more variations of the name of the same product. If you have a large assortment portfolio, it becomes a problem that is solved by manual labor-intensive support for such matching tables when new product name variations in your accounting system are received.

If it refers to the names of such products as to the texts of documents, and the codes of your accounting system (to which these variations are tied) to consider classes, then we obtain a task of a multiple classification of texts. Such a matching table (which operators are maintained manually) can be considered a training sample, and if it is built on it such a classification model – it would be able to reduce the complexity of the work of operators to classify the flow of new names of existing products. However, the classic approach to working with the text “as is” will not save you, it will be said just below.

Example 2

In the database of your company, data on sales (or prices) of products from external analytical (marketing) agencies are coming or from the parsing of third-party sites. The same product from each data source will also contain variations in writing. As part of this example, the task can be even more difficult than in Example 1 because often your company’s business users have the need to analyze not only your products, but also the range of your direct competitors and, accordingly, the number of classes (reference products) to which variations are tied – sharply increases.

What is the specificity of such a class of tasks?

First, there are a lot of classes (in fact, how many products you have so many classes) And if in this process you have to work not only with the company’s products, but also competitors, the growth of such new classes can occur every day – therefore it becomes meaningless to teach one time Model to be repeatedly used to predict new products.

Secondly, the number of documents (different variations of the same product) in the classes are not very balanced: there may be one by one to class, and maybe more.

Why does the classic approach of the multiple classification of texts work poorly?

Consider the shortcomings of the classic text processing approach by steps:

Stop words.

In such tasks there are no stop-words in the generally accepted concepts of any text processing package.

Tochenization

In classic packages from the box, the division of text on words is based on the presence of punctuation or spaces. As part of this class task (where the length of the text field input is often limited), it is not uncommon to receive product names without spaces where words are not clearly separated, but visually on the register of numbers or other language. How to pass toochenization from the box on your favorite programming language for the name of wine “Dom.CHRISTIANmoreau0,75LPLtr.EtFilChablis” ? (Unfortunately it’s not a joke)

Stemming

Product names are not text in a classic understanding of the task (such as news from sites, services reviews or newspaper headers) which is amenable to release suffix that can be discarded. In the names of products, abbreviations are often present and the reductions of words of which are not clear how to allocate this suffix. And there are also the names of the brands from another language group (for example, the inclusion of the brands of French or Italian) that are not amenable to a normal stemming.

Reducing document-terms matrices

Often, when building “Document-Term” matrices, the package of your language offers to reduce the sparsity of matrices to remove words (columns of the matrix) with a frequency below some minimum threshold. And in classical tasks, it really helps improve quality and reduce overhead in the training of the model. But not in such tasks. Above, I have already written that the distribution of classes is not strongly balanced – it can easily be on the same product name to the class (for example, a rare and expensive brand that has sold it for the first time and while there is only one time in the training sample). The classic approach of sparsity reduction we bring the quality of the classifier.

Training the model

Usually, some kind of model is trained on texts (LibSVM, naive Bayesian classifier, neural networks, or something else) which is then used repeatedly. In this case, new classes can appear daily and the number of documents in the class can be counted as a single instance. Therefore, it makes no sense to learn one large model for a long time using- any algorithm with online training, for example, a KNN classifier with one nearest neighbor, is enough.

Next, we will try to compare the classification of the traditional approach with the classification based on the proposed package. We will use tidytext as an auxiliary package.

Case example

devtools::install_github(repo = 'https://github.com/edvardoss/abbrevTexts')
library(abbrevTexts)
library(tidytext) # text proccessing
library(dplyr) # data processing
library(stringr) # data processing
library(SnowballC) # traditional stemming approach
library(tm) #need only for tidytext internal purpose

The package includes 2 data sets on the names of wines: the original names of wines from external data sources – “rawProducts” and the unified names of wines written in the standards for maintaining the company’s master data – “standardProducts”. The rawProducts table has many spelling variations of the same product, these variations are reduced to one product in standardProducts through a many-to-one relationship on the “standartId” key column. PS Variations in the “rawProducts” table are generated programmatically, but with the maximum possible similarity to how product names come from external various sources in my experience (although somewhere I may have overdone it)

data(rawProducts, package = 'abbrevTexts')
head(rawProducts)

data(standardProducts, package = 'abbrevTexts')
head(standardProducts)

Train and test split

set.seed(1234)
trainSample <- sample(x = seq(nrow(rawProducts)),size = .9*nrow(rawProducts))
testSample <- setdiff(seq(nrow(rawProducts)),trainSample)
testSample

Create dataframes for ‘no stemming mode’ and ‘traditional stemming mode’

df <- rawProducts %>% mutate(prodId=row_number(), 
                             rawName=str_replace_all(rawName,pattern = '\\.','. ')) %>% 
  unnest_tokens(output = word,input = rawName) %>% count(StandartId,prodId,word)

df.noStem <- df %>% bind_tf_idf(term = word,document = prodId,n = n)

df.SnowballStem <- df %>% mutate(wordStm=SnowballC::wordStem(word)) %>% 
  bind_tf_idf(term = wordStm,document = prodId,n = n)

Create document terms matrix

dtm.noStem <- df.noStem %>% 
  cast_dtm(document = prodId,term = word,value = tf_idf) %>% data.matrix()

dtm.SnowballStem <- df.SnowballStem %>% 
  cast_dtm(document = prodId,term = wordStm,value = tf_idf) %>% data.matrix()

Create knn model for ‘no stemming mode’ and calculate accuracy

knn.noStem <- class::knn1(train = dtm.noStem[trainSample,],
                          test = dtm.noStem[testSample,],
                          cl = rawProducts$StandartId[trainSample])
mean(knn.noStem==rawProducts$StandartId[testSample])

Accuracy is: 0.4761905 (47%)

Create knn model for ‘stemming mode’ and calculate accuracy

knn.SnowballStem <- class::knn1(train = dtm.SnowballStem[trainSample,],
                               test = dtm.SnowballStem[testSample,],
                               cl = rawProducts$StandartId[trainSample])
mean(knn.SnowballStem==rawProducts$StandartId[testSample])

Accuracy is: 0.5 (50%)

abbrevTexts primer

Below is an example on the same data but using the functions from abbrevTexts package

Separating words by case

df <- rawProducts %>% mutate(prodId=row_number(), 
                             rawNameSplitted= makeSeparatedWords(rawName)) %>% 
        unnest_tokens(output = word,input = rawNameSplitted)
print(df)

As you can see, the tokenization of the text was carried out correctly: not only transitions from upper and lower case when writing together are taken into account, but also punctuation marks between words written together without spaces are taken into account.

Creating a stemming dictionary based on a training sample of words

After a long search among different stemming implementations, I came to the conclusion that traditional methods based on the rules of the language are not suitable for such specific tasks, so I had to look for my own approach. As a result, I came to the most optimal solution, which was reduced to unsupervised learning, which is not sensitive to the text language or the degree of reduction of the available words in the training sample.

The function takes a vector of words as input, the minimum word length for the training sample and the minimum fraction for considering the child word as an abbreviation of the parent word and then does the following:

1. Discarding words with a length less than the set threshold
2. Discarding words consisting of numbers
3. Sort the words in descending order of their length
4. For each word in the list:
4.1 Filter out words that are less than the length of the current word and greater than or equal to the length of the current word multiplied by the minimum fraction
4.2 Select from the list of filtered words those that are the beginning of the current word

Let’s say that we fix min.share = 0.7
At this intermediate stage (4.2), we get a parent-child table where such examples can be found:

Note that each line meets the condition that the length of the child’s word is not shorter than 70% of the length of the parent’s word.

However, there may be found pairs that can not be considered as abbreviations of words because in them different parents are reduced to one child, for example:

My function for such cases leaves only one pair.

Let’s go back to the example with unambiguous abbreviations of words

But if you look a little more closely, we see that there is a common word ‘bodeg’ for these 2 pairs and this word allows you to connect these pairs into one chain of abbreviations without violating our initial conditions on the length of a word to consider it an abbreviation of another word:
bodegas->bodeg->bode
So we come to a table of the form:

Such chains can be of arbitrary length and it is possible to assemble from the found pairs into such chains recursively. Thus we come to the 5th stage of determining the final child for each participant of the constructed chain of abbreviations of words

5. Recursively iterating through the found pairs to determine the final (terminal) child for all members of chains
6. Return the abbreviation dictionary

The makeAbbrStemDict function is automatically paralleled by several threads loading all the processor cores, so it is advisable to take this point into account for large volumes of texts.

abrDict <- makeAbbrStemDict(term.vec = df$word,min.len = 3,min.share = .6)
head(abrDict) # We can see parent word, intermediate results and total result (terminal child)

The output of the stemming dictionary in the form of a table is also convenient because it is possible to selectively and in a simple way in the “dplyr” paradigm to delete some of the stemming lines.

Lets say that we wont to exclude parent word “abruzz” and terminal child group “absolu” from stemming dictionary:

abrDict.reduced <- abrDict %>% filter(parent!='abruzz',terminal.child!='absolu')
print(abrDict.reduced)

Compare the simplicity and clarity of this solution with what is offered in stackoverflow:

Text-mining with the tm-package – word stemming

Stem using abbreviate dictionary

df.AbbrStem <- df %>% left_join(abrDict %>% select(parent,terminal.child),by = c('word'='parent')) %>% 
    mutate(wordAbbrStem=coalesce(terminal.child,word)) %>% select(-terminal.child)
print(df.AbbrStem)

TF-IDF for stemmed words

df.AbbrStem <- df.AbbrStem %>% count(StandartId,prodId,wordAbbrStem) %>% 
  bind_tf_idf(term = wordAbbrStem,document = prodId,n = n)
print(df.AbbrStem)

Create document terms matrix

dtm.AbbrStem <- df.AbbrStem %>% 
  cast_dtm(document = prodId,term = wordAbbrStem,value = tf_idf) %>% data.matrix()

Create knn model for ‘abbrevTexts mode’ and calculate accuracy

knn.AbbrStem <- class::knn1(train = dtm.AbbrStem[trainSample,],
                                test = dtm.AbbrStem[testSample,],
                                cl = rawProducts$StandartId[trainSample])
mean(knn.AbbrStem==rawProducts$StandartId[testSample])

Accuracy for “abbrevTexts”: 0.8333333 (83%)

As you can see , we have received significant improvements in the quality of classification in the test sample.
Tidytext is a convenient package for a small courpus of texts, but in the case of a large courpus of texts, the “AbbrevTexts” package is also perfectly suitable for preprocessing and normalization and usually gives better accuracy in such specific tasks compared to the traditional approach

edvardoss/abbrevTexts: Functions that will make life less sad when working with abbreviated text for multiclassification tasks (github.com)