Create a hyper-marketing model using Naïve Bayes

By Huey Fern Tay with Greg Page

Everyone loves an extra income stream – even the super-wealthy owners of luxurious properties that they only inhabit for just a few weeks each year.  Offering a property as a short-term rental through a platform like Airbnb can provide a wonderful side hustle.  For some owners, however, the associated hassles could be a powerful deterrent to using the service.  Text messages at 3 a.m. about Wi-Fi passwords, stopped-up toilets, and the lack of water pressure in the shower might be just enough to tip the scales against such an undertaking…especially when such messages are followed up by angry “Why isn’t this fixed yet?” queries just 30 minutes later.

So what if an all-in-one concierge service could take away ALL of those hassles?  If an intermediary service could handle all of the tenant interactions, the marketing, the logistics of the key hand-offs, etc. then suddenly the idle rich jetsetters might be a bit more willing to open up their pied-a-terres to the unbathed masses.  Such a service would benefit all stakeholders – travelers would have more options, the property owners would earn more income, and the platform would receive more commission fees from the extra transactions. In exchange for a fee paid to the service, willing property owners could have a side hustle that was “all side, no hustle.”  

Let’s imagine that such a service is looking to establish itself, with an initial marketing outreach effort to high-end property owners not already using Airbnb.  Let’s also imagine that this new service is operating on a shoestring budget, and therefore needs to. How can it identify the properties within a city that are most likely to command high values in the short-term rental market? 

The Naïve Bayes classifier is a good candidate for the task at hand because of its simplicity, computational efficiency, and ability to handle categorical variables.  Furthermore, its classification outcomes come with associated probability values – we can use those to identify records that are most likely to belong to some particular group. 

To illustrate how this method could be used to solve the business problem outlined above, I will utilize Airbnb data of San Francisco listings.

One of the first decisions the modeler must make is deciding how to bin the data. In this case, the question is which numerical variable would you use to separate the properties? Would you use price, review ratings, or number of reviews? Each of these variables has a different impact on the outcome and may not be equally effective at separating classes. If the classes are not well separated, then even a large dataset will not be helpful.

The next important decision is to determine the number of classification categories you would like to create. Also, what would the cut-off be for each group? In other words, should you create four groups and bin them equally? Or should you create three groups by dividing the data according to a 15-70-15 proportion, 20-40-20, or 30-40-30…etc? The decision made at this step has a big impact on the model.

Consider both models below which were each created with 3811 rows of data representing 60% of the total dataset. Both models were created with the same predictor variables, such as the number of bedrooms, bathrooms, property type, location, etc. But in Model 1, the data was binned into 4 equal groups while in Model 2, the data was binned into 3 equal groups. The model summary for Model 1 showed the model has an accuracy of 54.19%, which is good considering that this performance is slightly more than double the No Information Rate (Naïve Rate). Model 2’s accuracy’s level is at 65.36%, a level which is nearly double its No Information Rate.

These are encouraging results but in our Airbnb example, we are more interested in knowing how well our model performs when it is asked to classify properties into any of the classes used in the model. For this reason, it is worth considering the true positive rate i.e. ‘sensitivity’. Model 1 is better at predicting the true positives (‘sensitivity’) for classes at opposite ends of the spectrum. This suggests Model 1 has difficulty reading nuances. On the other hand, Model 2’s performance in this regard is comparatively more balanced.


Naive Bayes model with four classification outcome



Naive Bayes output with three classification outcome


But wait – let’s get back to our original goal.  While overall accuracy is good to see, what we are most interested in here is identifying that high-end price group.  Owners of such units will be the best targets for our all-in-one concierge service.  Therefore, let’s dive a bit deeper to examine this model’s suitability for identifying such properties. 

By running the predict() function with the type=’raw’ parameter included, we can view the associated probabilities for each outcome class, and then rank records by probability of belonging to some particular outcome group. 

Taking this approach with the validation set, we find that among the 100 records identified by the model as most likely to land in the top tier group, 96 truly belonged to “Above Average and Pricey Digs.”   Among the 150 likeliest, 140 units, or 93.33%, actually belonged to that group, and among the 200 likeliest, 185 units, or 92.5%, were truly in that top tier. 

But that’s not all. It is worth going one step further by evaluating the model with lift charts or decile-wise lift charts because these charts determine how effectively our model ‘skims the cream’.

The decile-wise lift charts below illustrate how effectively the model can predict membership in the ‘above average and pricey digs’ group. When the model is used to classify the top 27% properties in this category, its performance is more than 3.5x better than a random guess.

Decile-wise lift chart shows the model is 3.5 times better at classifying the top 27 percent properties

Another way to assess the model’s ability to identify top-tier rentals is with a two-dimensional lift chart.  Such a chart only works with two-outcome class scenarios, so we start here by collapsing the first and second tiers together, and then labelling that group as “other.”  

Gains chart shows among the 500 records that the model says are most likely to land in the Above Average & Pricey Digs Tier, just over 400 truly did belong to that group

In the entire validation set, there are 780 units that land in the highest price tier.  The values along the x-axis represent all the validation set records, ranked in order of their probability of belonging to the highest-tier class.  The y-axis shows the number of correct predictions.  The solid line represents the model’s performance – it shows us, for instance, that among the 500 records that the model says are most likely to land in the Above Average & Pricey Digs Tier, just over 400 truly did belong to that group.  The line flattens out at around x=1500, because by that point, the model has already identified nearly all of the records that truly belonged to this outcome class. 

The dotted line, by contrast, shows how effective a model would be if it simply labelled all the records as belonging to the top tier.  Since 33.6%, of the validation records belong to this group, each x-axis value here corresponds to a y-axis value that is exactly 33.6% as large.  The difference between the solid line and the dotted line represents the model’s improvement as the number of cases increases.

What do these results mean for the concierge service?  Let’s revisit our original assumptions: 


  • such a service would most likely appeal to the owners of properties in the highest pricing tier;
  • the initial outreach efforts should be made to owners of properties not already registered with Airbnb; and that
  • the new service has a limited budget, and therefore needs to carefully focus its outreach efforts only on that tier of properties whose owners would be likeliest to use it
Given these assumptions, our primary interest does not lie with overall model accuracy.  The model’s ability to distinguish between the bottom two pricing tiers is almost immaterial to us; however, we are keenly interested in the answer to this question:  When the model predicts that a property will belong to the highest pricing tier, how often is it correct? 

As demonstrated here, the model delivers quite effectively in this regard, especially when we maintain a relatively narrow focus on the properties that are most likely to be top tier.  Splashy magazine inserts, Super Bowl advertisements, and big-ticket endorsements from celebrities might be in the cards for this service down the road, after it spreads across the globe and prepares for its IPO roadshow.  For now, though, the hyper-specific focus that can come from “skimming the cream” off the top of those Naïve Bayes model probability predictions may be the surest next step for this service’s success. 

Data source: Inside Airbnb
 

Walmart’s (WMT) 7-Year Nonlinear Market Trend at ‘Critical Juncture’

This is a brief market update from my earlier post on October 28, 2021, Walmart’s 7-Year Nonlinear Market Trend using ‘Stealth Curves.’

The prior post indicated the then current nonlinear market structure for Walmart (WMT) stock.

October 28, 2021 Market View



Fast forward to today …

This market is now resting on its multi-year lower Stealth Support Curve. As the below updated graphic indicates, this represents a critical juncture for this market.

Current Market View (1-27-2022)



This current chart can be produced using the code in my October 2021 post along with updated market data.  The data is freely available from Nasdaq.com here. Just be certain the updated data match the column headings in my ‘reformatted’ GitHub data file.  
 
For those interested in numerous Stealth Curve examples applied to various other markets, simply view this LinkedIn post.  Full details of Stealth Curve equation parameterization are described in my latest Amazon publication, Stealth Curves: The Elegance of ‘Random’ Markets.

Feel free to contact me directly on my LinkedIn site.
 

Brian K. Lee, MBA, PRM, CMA, CFA


Bifurcation diagram for nonautonomous SIR model

I am trying to do the bifurcation diagram for a nonautonomous SIR two-patch model. But It’s not giving the proper result. The model and parameter values are in the code. I solved the model by ODE solution for a set of bifurcation parameter values and then took the maximum and minimum for the plot. How should I draw a phase plane analysis for my model? Any suggestions? 

Bifurcation Analysis #########
library(deSolve)
library(tidyverse)
vals_betaAbar <- seq(1,70,by=1)  #bifurcation parameter

outdf = NULL

for (i in 1:length(vals_betaAbar)) {

betaAbar = vals_betaAbar[i]


sir.two.patch.model.closed <- function (t, x, params){
#create local variable IA, IB, RA, RB, SA, SB
IA <- x[1]
IB <- x[2]
RA <- x[3]
RB <- x[4]
SA <- x[5]
SB <- x[6]
#we can simplify code using “with”
#this argument to “with” lets us use the variable Names
with(
as.list(params),
{ #the system of rate equations
dIA <- betaAbar(1+sigAcos(2pit/Ti))IASA/Na-(muA+gamaA+phiIAB(1+sigAcos(2pit/Ti)))IA+phiIBA(1+sigAcos(2pit/Ti))IB
dIB <- betaBbar(1+sigAcos(2pit/Ti))IBSB/NB-(muB+gamaB+phiIBA(1+sigAcos(2pit/Ti)))IB +phiIAB(1+sigAcos(2pit/Ti))IA
dRA <- gamaAIA-(muA+phiRAB(1+sigAcos(2pit/Ti)))RA+phiRBA(1+sigAcos(2pit/Ti))RB
dRB <- gamaB
IB-(muB+phiRBA(1+sigAcos(2pit/Ti)))RB+phiRAB(1+sigAcos(2pit/Ti))RA
dSA <- muANa-betaAbar(1+sigAcos(2pit/Ti))IASA/Na-(muA+phiSAB(1+sigAcos(2pit/Ti)))SA+phiSBA(1+sigAcos(2pit/Ti))SB
dSB <- muB
NB-betaBbar(1+sigAcos(2pit/Ti))IBSB/NB-(muB+phiSBA(1+sigAcos(2pit/Ti)))SB+phiSAB(1+sigAcos(2pit/Ti))SA
#combine results into a single vector dx
dx <- c(dIA,dIB,dRA,dRB,dSA,dSB)
#return result as a list
list(dx)
}
)
}
#function seq returns a sequence
times <- seq(0,100,length.out=3000)
#function c “c”combines values into a vector
params <- c(Na=100, NB=1000, betaAbar=betaAbar, betaBbar=5,muA=0.06, muB=0.02, gamaA=12, gamaB=12,phiIAB=0.4, phiIBA=0.04, phiSAB=0.4, phiSBA=0.04, phiRAB=0.4, phiRBA=0.04, sigA=0., Ti=1)

#initial conditions
xstart <- c(IA=5,IB=5,RA=0,RB=0,SA=100-5,SB=1000-5)
#result stored in data frame
out <- as.data.frame(ode(xstart,times,sir.two.patch.model.closed,params))

out$betaAbar= vals_betaAbar[i]

out=out[251:300,]

outdf = rbind.data.frame(outdf, out)

}


bifdata=NULL
for (i in 1:length(vals_betaAbar)) {

betaAbar = vals_betaAbar


bifdata_A = NULL
bifdata_A$IA_max=max(outdf[outdf$betaAbar==vals_betaAbar[i],]$IA)
bifdata_A$IA_min=min(outdf[outdf$betaAbar==vals_betaAbar[i],]$IA)
bifdata_A$IB_max=max(outdf[outdf$betaAbar==vals_betaAbar[i],]$IB)
bifdata_A$IB_min=min(outdf[outdf$betaAbar==vals_betaAbar[i],]$IB)

bifdata_A$SA_max=max(outdf[outdf$betaAbar==vals_betaAbar[i],]$SA)
bifdata_A$SA_min=min(outdf[outdf$betaAbar==vals_betaAbar[i],]$SA)
bifdata_A$SB_max=max(outdf[outdf$betaAbar==vals_betaAbar[i],]$SB)
bifdata_A$SB_min=min(outdf[outdf$betaAbar==vals_betaAbar[i],]$SB)

bifdata_A$RA_max=max(outdf[outdf$betaAbar==vals_betaAbar[i],]$RA)
bifdata_A$RA_min=min(outdf[outdf$betaAbar==vals_betaAbar[i],]$RA)
bifdata_A$RB_max=max(outdf[outdf$betaAbar==vals_betaAbar[i],]$RB)
bifdata_A$RB_min=min(outdf[outdf$betaAbar==vals_betaAbar[i],]$RB)

bifdata_A$betaAbar= vals_betaAbar[i]

bifdata = rbind.data.frame(bifdata, bifdata_A)
}

dev.new()
ggplot(data=bifdata)+
geom_line(aes(x=betaAbar,y=IA_max))+
geom_point(aes(x=betaAbar,y=IA_min))

#############################

Predicting future recessions

Even if this sounds incredible, yes, we can predict future recessions using a couple of time series, some simple econometric models, and … R

The basic idea is that the slope of the yield curve is somewhat linked to the probability of future recessions. In other words, the difference between the short and the long term rate can be used as a tool for monitoring business cycles. Nothing new about that: Kessel (1965) documented the cyclical behavior of the yield spread, and he showed that the yield spread tended to decline immediately before a recession. This relationship is one of the most famous stylized facts among economists (see Figure 1).

Figure_1_FRED
Figure 1: The yield spread and recessions

– So, why people don’t use this model to predict recessions ?
– Well, it seems to be related to the fact that (i) they think it only used to work in the US (ii) they don’t feel to be qualified to run a sophisticated R code to estimate this relationship.

This post is about answering these two questions: (i) yes, the yield curve does signal recessions (ii) yes, it is easy to monitor economic cycles with R using the EWS package !

First, if you have some doubts about the predictive power of the yield spread, please have a look on Hasse and Lajaunie (2022)’s recent paper, published in the Quarterly Review of Economics and Finance. The authors – Quentin and I –  reexamine the predictive power of the yield spread across countries and over time. Using a dynamic panel/dichotomous model framework and a unique dataset covering 13 OECD countries over the period 1975–2019, we empirically show that the yield spread signals recessions. This result is robust to different econometric specifications, controlling for recession risk factors and time sampling. Main results are reported in Table 1.

Table_1
Table 1: Estimation of the predictive power of the yield spread (1975–2019)

– Wait, what does mean “dichotomous model” ?
– Don’t be afraid: the academic literature provides a specific econometric framework to predict future recessions.

Estrella and Hardouvelis (1991) and Kauppi and Saikkonen (2008) have enhanced the use of binary regression models (probit and logit models) to model the relationship between recession dummies (i.e., binary variables) and the yield spread (i.e., continuous variable). Indeed, classic linear regressions cannot do the job here. If you have specific issues about probit/logit models, you should have a look on Quentin’s PhD dissertation. He is a specialist in nonlinear econometrics.

Now, let’s talk about the EWS R package. In a few words, the package is available on the CRAN package repository and it includes data and code you need to replicate our empirical findings. So you only have to run a few lines of code to estimate the predictive power of the yield spread. Not so bad eh ?

Here is an example focusing on the US: first install and load the package, then we extract the data we need.
# Load the package
library(EWS)

# Load the dataset included in the package
data("data_USA") # Print the first six rows of the dataframe head(data_USA)
Well, now we just have to run these few lines:
# Data process
Var_Y <- as.vector(data_USA$NBER)
Var_X <- as.vector(data_USA$Spread)
# Estimate the logit regression
results <- Logistic_Estimation(Dicho_Y = Var_Y, Exp_X = Var_X, Intercept = TRUE, Nb_Id = 1, Lag = 1, type_model = 1)
# print results
print(results)
The results can be printed… and plotted ! Here is an illustration what you should have:

$Estimation
name Estimate Std.Error zvalue Pr
1 Intercept -1.2194364 0.3215586 -3.792268 0.0001492776
2 1 -0.5243175 0.2062655 -2.541955 0.0110234400
and then what you could plot:

Figure_2
Figure 2: The predictive power of the yield spread in the US (1999-2020)

Nice output, let’s interpret what we have. First the estimation results: the intercept is equal to -1.21 and high significant, and the lagged yield spread is equal to -0.52 and is also highly significant. This basic result illustrates the predictive power of the yield spread.

– But what does mean “1” instead of the name of the lagged variable ? And what if we choose to have another lag  ? And if we choose model 2 instead of model 1 ?
– “1” refers to the number associated to the lagged variable, and you can change the model or the number of lags via the function arguments:
$Estimation
name Estimate Std.Error zvalue Pr
1 Intercept 0.08342331 0.101228668 0.8241075 4.098785e-01
2 1 -0.32340655 0.046847136 -6.9034433 5.075718e-12
3 Index_Lag 0.85134073 0.003882198 219.2934980 0.000000e+00

Last but not least, you can choose the best model according to AIC, BIC or R2 criteria:

$AIC
[1] 164.0884
$BIC
[1] 177.8501
$R2
[1] 0.2182592

Everything you need to know about the predictive power of the yield spread is here. These in-sample estimations confirm empirical evidences from the literature for the US. And for those who are interested in out-of-sample forecasting… the EWS package provides what you need. I’ll write an another post soon !

References

Estrella, A., & Hardouvelis, G. A. (1991). The term structure as a predictor of real economic activity. The Journal of Finance46(2), 555-576.

Hasse, J. B., & Lajaunie, Q. (2022). Does the yield curve signal recessions? new evidence from an international panel data analysis. The Quarterly Review of Economics and Finance, 84, 9-22.

Hasse, J. B., & Lajaunie, Q. (2020). EWS: Early Warning System. 
R package version 0.1. 0.

Kauppi, H., & Saikkonen, P. (2008). Predicting US recessions with dynamic binary response models. The Review of Economics and Statistics90(4), 777-791.

Kessel, Reuben, A. “The Cyclical Behavior of the Term Structure of Interest Rates.” NBER Occasional Paper 91, National Bureau of Economic Research, 1965.

Detecting multicollinearity — it’s not that easy sometimes

By Huey Fern Tay with Greg Page

When are two variables too related to one another to be used together in a linear regression model? Should the maximum acceptable correlation be 0.7? Or should the rule of thumb be 0.8? There is actually no single, ‘one-size-fits-all’ answer to this question.

As an alternative to using pairwise correlations, an analyst can examine the variance inflation factor, or VIF, associated with each of the numeric input variables. Sometimes, however, pairwise correlations, and even VIF scores, do not tell the entire picture.

Consider this correlation matrix created from a Los Angeles Airbnb dataset.



Two item pairs identified in the correlation matrix above have a strong correlation value:

· Beds + log_accommodates (r= 0.701)

· Beds + bedrooms (r= 0.706)

Based on one school of thought, these correlation values are cause for concern; meanwhile, other sources suggest that such values are nothing to be worried about.

The variance inflation factor, which is used to detect the severity of multicollinearity, does not suggest anything unusual either.

library(car)
vif(model_test) 

The VIF for each potential input variable is found by making a separate linear regression model, in which the variable being scored serves as the outcome, and the other numeric variables are the predictors. The VIF score for each variable is found by applying this formula:


When the other numeric inputs explain a large percentage of the variability in some other variable, then that variable will have a high VIF. Some sources will tell you that any VIF above 5 means that a variable should not be used in a model, whereas other sources will say that VIF values below 10 are acceptable. None of the vif() results here appear to be problematic, based on standard cutoff thresholds.

Based on the vif() results shown above, plus some algebraic manipulation of the VIF formula, we can know that a model that predicts beds as the outcome variable, using log_accommodates, bedrooms, and bathrooms as the inputs, has an r-squared of just a little higher than 0.61. That is verified with the model shown below:



But look at what happens when we build a multi-linear regression model predicting the price of an Airbnb property listing.



The model summary hints at a problem because the coefficient for beds is negative. The proper interpretation for each coefficient in this linear model is the way that log_price will be impacted by a one-unit increase in that input, with all other inputs held constant.

Literally, then, this output indicates that having more beds within a house or apartment will make its rental value go down, when all else is held constant. That not only defies our common sense, but it also contradicts something that we already know to be the case — that bed number and log_price are positively associated. Indeed, the correlation matrix shown above indicates a moderately-strong linear relationship between these values, of 0.4816.

After dropping ‘beds’ from the original model, the adjusted R-squared declines only marginally, from 0.4878 to 0.4782.



This tiny decline in adjusted r-squared is not worrisome at all. The very low p-value associated with this model’s F-statistic indicates a highly significant overall model. Moreover, the signs of the coefficients for each of these inputs are consistent with the directionality that we see in the original correlation matrix.

Moreover, we still need to include other important variables that determine real estate pricing e.g. location and property type. After factoring in these categories along with other considerations such as pool availability, cleaning fee, and pet-friendly options, the model’s adjusted R-squared value is pushed to 0.6694. In addition, the residual standard error declines from 0.5276 in the original model to 0.4239.

Long story short: we cannot be completely reliant on rules of thumb, or even cutoff thresholds from textbooks, when evaluating the multicollinearity risk associated with specific numeric inputs. We must also examine model coefficients’ signs. When a coefficient’s sign “flips” from the direction that we should expect, based on that variable’s correlation with the response variable, that can also indicate that our model coefficients are impacted by multicollinearity.

Data source: Inside Airbnb

Natural Gas Prices Fall 42% in 3 Months Following Breach of ‘Nonlinear Stealth Support’


The below earlier LinkedIn post of well-defined nonlinear Stealth Support indicated price evolution for natural gas (UNG) was highly
likely on an unsustainable trajectory.



Within a month of the post, the market breached its well-defined Stealth Support Curve.  In the 3 months that followed, this market lost 42% of its value. 



The following R code reproduces the below Stealth Curve




UNG Stealth Curve – R Code
library(tidyverse)
library(readxl)

# Original data source - https://www.nasdaq.com/market-activity/funds-and-etfs/ung/historical

# Download reformatted data (columns/headings) from my github site and save to a local drive
# https://github.com/123blee/Stealth_Curves.io/blob/main/UNG_prices.xlsx

ung <- read_excel("... Insert your local file path here .../UNG_prices.xlsx")
ung

# Convert 'Date and Time' to 'Date' column
ung[["Date"]] <- as.Date(ung[["Date"]])
ung

bars <- nrow(ung)

# Add bar indicator as first tibble column

ung <- ung %>%
  add_column(t = 1:nrow(ung), .before = "Date")
ung

# Add 40 future days to the tibble for projection of the Stealth Curve once added

future <- 40

ung <- ung %>%
  add_row(t = (bars+1):(bars+future))

# Market Pivot Lows using 'Low' Prices

# Chart 'Low' UNG prices

xmin <- 2250
xmax <- bars + future
ymin <- 0
ymax <- 25

plot.new()

background <- c("azure1")

chart_title_low <- c("Natural Gas (UNG) \nDaily Low Prices ($)")

u <- par("usr") 

rect(u[1], u[3], u[2], u[4], col = background) 

par(ann=TRUE)

par(new=TRUE)

t <- ung[["t"]]
Price <- ung[["Low"]]

plot(x=t, y=Price, main = chart_title_low, type="l", col = "blue", 
     
     ylim = c(ymin, ymax) ,
     
     xlim = c(xmin, xmax ) ) 

# Add Stealth Curve to tibble

# Stealth Support Curve parameters
a <-   -444.56  
b <-      6.26  
c <-  -2555.01  


ung <- ung %>%
  mutate(Stealth_Curve_Low = a/(t + c) + b)
ung

# Omit certain Stealth Support Curve values from charting

ung[["Stealth_Curve_Low"]][2550:xmax] <- NA
ung

# Add Stealth Curve to chart

lines(t, ung[["Stealth_Curve_Low"]])
Details of Stealth Curve parameterization are detailed in my ‘Stealth Curves: The Elegance of Random Markets’ text on Amazon.

Brian K. Lee, MBA, PRM, CMA, CFA

Feel free to contact me on LinkedIn.      

SurvCART: Constructing Survival Tree in R

[Author: Madan G. Kundu]

In practice, survival times may be influenced by a combination of baseline variables. Survival trees (i.e., trees with survival data) offer a relatively flexible approach to understanding the effects of covariates, including their interaction, on survival times when the functional form of the association is unknown, and also have the advantage of easier interpretation. This blog presents the construction of a survival tree following the SurvCART algorithm (Kundu and Ghosh 2021).  Usually, a survival tree is constructed based on the heterogeneity in time-to-event distribution. However, in some applications with marker dependent censoring (e.g., less education or better prognosis might lead to early censoring) ignorance of censoring distribution might lead to inconsistent survival tree (Cui, Zhou and Kosorok, 2021).  Therefore, the SurvCART algorithm is flexible to construct a survival tree based on heterogeneity both in time-to-event and censoring distribution. However, it is important to emphasize that the use of censoring heterogeneity in the construction of survival trees is optional.  In this blog, the construction of a survival tree is illustrated with an R dataset in step by step approach and the results are explained using the functions available in the LongCART package


Installing and Loading LongCART package
R> install.packages("LongCART")
R> library(LongCART)
The GBSG2 dataset
We will illustrate the SurvCART algorithm with GBSG2 data to analyze recurrence-free survival (RFS) originated from a prospective randomized clinical trial conducted by the German Breast Cancer Study Group. The purpose is to evaluate the effect of prognostic factors on RFS among node-positive breast cancer patients receiving chemotherapy in the adjuvant setting. RFS is time-to-event endpoint which was defined as the time from mastectomy to the first occurrence of either recurrence, contralateral or secondary tumor, or death. This GBSG2 dataset is included in the LongCART package in R. The dataset contains the information from 686 breast cancer patients with the following variables: time (RFS follow-up time in days), cens (censoring indicator: 0[censored], 1[event]), horTH (hormonal therapy: yes, no), age (age in years), menostat (menopausal status: pre, post), tsize (tumour size in mm), tgrade (tumour grade: I, II, III), pnodes (number of positive nodes), progrec (level of progesterone receptor[PR] in fmol), and estrec (level of the estrogen receptor in fmol). For details about the conduct of the study, please refer to Schumacher et al.

Let’s first explore the dataset. 
R> library(speff2trial)
R> data("ACTG175", package = "speff2trial")
R> adata<- reshape(data=ACTG175[,!(names(ACTG175) %in% c("cd80", "cd820"))],
+ varying=c("cd40", "cd420", "cd496"), v.names="cd4",
+ idvar="pidnum", direction="long", times=c(0, 20, 96))
R> adata<- adata[order(adata$pidnum, adata$time),]
The median RFS time based on the entire 686 patients was 60 months with a total of 299 reported RFS events.  Does any baseline covariate influence the median RFS time and if yes then how? A survival tree is meant to find this answer.

Impact of individual categorical baseline variables on RFS
We would like to start with understanding the impact of individual baseline covariates on the median RFS time. There could be two types of baseline covariates: categorical (e.g., horTH [hormonal therapy]) and continuous (e.g., age).  The impact of an individual categorical variable can be statistically assessed using StabCat.surv() [Note: StabCat.surv() performs a statistical test as described in Section 2.2.1 of Kundu and Ghosh 2021].

Let’s focus only on the heterogeneity in RFS (assuming exponential underlying distribution for RFS), but not on the censoring distribution. 
R> out1<- StabCat.surv(data=GBSG2, timevar="time", censorvar="cens", 
R>                     splitvar="horTh", time.dist="exponential", 
R>                     event.ind=1) 
Stability Test for Categorical grouping variable 
Test.statistic= 8.562, Adj. p-value= 0.003 
Greater evidence of heterogeneity in time-to-event distribution 
R> out1$pval
[1] 0.003431809
The p-value is 0.00343 suggesting the influence of hormonal therapy on RFS. Here we considered Exponential distribution for RFS, but the following distributions can be specified as well: "weibull""lognormal" or "normal"

Now we illustrate the use of StabCat.surv() considering heterogeneity both in RFS time and censoring time.
R> out1<- StabCat.surv(data=GBSG2, timevar="time", censorvar="cens", 
R>                     splitvar="horTh", time.dist="exponential", 
R>                     cens.dist="exponential", event.ind=1) 
Stability Test for Categorical grouping variable 
Test.statistic= 8.562 0.013, Adj. p-value= 0.007 0.909 
Greater evidence of heterogeneity in time-to-event distribution 
> out1$pval
[1] 0.006863617
Note the two Test.statistic values (8.562 and 0.013): the first one corresponds to for heterogeneity in RFS time and the second one corresponds to heterogeneity in censoring distribution.  Consequently, we also see the two p-values (adjusted).  The overall p-value 0.006863617. In comparison to the earlier p-value of 0.00343, this p-value is greater due to the multiplicity adjustment.

Impact of individual continuous baseline variables on RFS
Similarly, we can assess the impact of an individual continuous variable by performing a statistical test as described in Section 2.2.2 of Kundu and Ghosh 2021 which is implemented in StabCont.surv()
R> out1<- StabCont.surv(data=GBSG2, timevar="time", censorvar="cens",
R>                      splitvar="age", time.dist="exponential",
R>                      event.ind=1)
Stability Test for Continuous grouping variable 
Test.statistic= 0.925, Adj. p-value= 0.359 
Greater evidence of heterogeneity in time-to-event distribution 
> out1$pval
[1] 0.3591482
The p-value is 0.3591482 suggesting no impact of age on RFS. Heterogeneity in censoring distribution can be considered here as well by specifying cens.dist as shown with StabCat.surv().

Construction of survival tree
A survival tree can be constructed using SurvCART(). Note that the SurvCART() function currently can handle the categorical baseline variables with numerical levels only. For nominal variables, please create the corresponding numerically coded dummy variable(s) as has been done below for horTh, trade and menostat  
R> GBSG2$horTh1<- as.numeric(GBSG2$horTh)
R> GBSG2$tgrade1<- as.numeric(GBSG2$tgrade)
R> GBSG2$menostat1<- as.numeric(GBSG2$menostat)
We also have to add the subject id if that is not already there.
R> GBSG2$subjid<- 1:nrow(GBSG2)
In the SurvCART(), we have to list out all the baseline variables of interest and declare whether it is categorical or continuous. Further, we also have to specify the RFS distribution "exponential"(default), "weibull""lognormal" or "normal". The censoring distribution could be "NA"(default: i.e., no censoring heterogeneity), "exponential" "weibull""lognormal" or "normal"

Let’s construct the survival tree with heterogeneity in RFS only (i.e., ignoring censoring heterogeneity)
R> out.tree<- SurvCART(data=GBSG2, patid="subjid", 
R>                censorvar="cens", timevar="time", 
R>                gvars=c('horTh1', 'age', 'menostat1', 'tsize',          
R>                        'tgrade1', 'pnodes', 'progrec', 'estrec'),  
R>                tgvars=c(0,1,0,1,0,1, 1,1), 
R>                time.dist="exponential", 
R>                event.ind=1, alpha=0.05, minsplit=80, minbucket=40, 
R>                print=TRUE)
All the baseline variables are listed in gvars argument. The tgvars argument is accompanied with the gvars argument which indicates type of the partitioning variables (0=categorical or 1=continuous). Further, we have considered exponential distribution for RFS time. 

Within SurvCART(), at any given tree-node, each of the baseline variables is tested for heterogeneity using eitherStabCat.surv() or StabCont.surv() and the corresponding p-value is obtained. Further, these p-values are adjusted according to the Hochberg procedure and subsequently, the baseline variable with the smallest p-value is selected. These details would be printed if print=TRUE. This procedure keeps iterating until we hit the terminal node (i.e., no more statistically significant baseline variable, node size is smaller than the minsplit or further splitting results in a node size smaller than  minbucket).

Now let’s view the tree result


In the above output, each row corresponds to a single node including the 7  terminal nodes identified by TERMINAL=TRUE.  Now let’s visualize the tree result
R> par(xpd = TRUE)
R> plot(out.tree, compress = TRUE)
R> text(out.tree, use.n = TRUE)
The resultant tree is as follows:

We can also plot KM plot for sub-groups identified by survival tree as follows:
KMPlot.SurvCART(out.tree, scale.time=365.25, type=1)


Last but not least, we can consider heterogeneity in censoring distribution in SurvCART() by specifying cens.dist .

Reference

1. Kundu, M. G., & Ghosh, S. (2021). Survival trees based on heterogeneity in time‐to‐event and censoring distributions using parameter instability test. Statistical Analysis and Data Mining: The ASA Data Science Journal14(5), 466-483. [link]
2. Y. Cui, R. Zhu, M. Zhou, and M. Kosorok, Consistency of survival tree and forest models: Splitting bias and correction. Statistica Sinica Preprint No: SS-2020-0263, 2021.
3. Schumacher, M., Bastert, G., Bojar, H., Hübner, K., Olschewski, M., Sauerbrei, W., … & Rauschecker, H. F. (1994). Randomized 2 x 2 trial evaluating hormonal treatment and the duration of chemotherapy in node-positive breast cancer patients. German Breast Cancer Study Group. Journal of Clinical Oncology12(10), 2086-2093. [link]
4. LongCART package v 3.1 or higher, https://CRAN.R-project.org/package=LongCART

Mapping iNaturalist Data in R

This post was originally published from IGIS as a Tech Note. Enjoy!

Introduction

iNaturalist logoiNaturalist is an enormously popular platform for recording and sharing observations of nature. Most people interact with iNaturalist through the Android or iOS phone app, but a little known fact is the platform also has an API (Application Programming Interface). That means that you can query observations using programming languages like R and Python. And if you can query observations, then you can map them!

The rest of this post demonstrates how you can download iNaturalist observations in R using the rinat package, and then map them using leaflet. We’ll also use functions from sf and dplyr. These are very common R packages in data science, and there are lot of online resources that teach you how to use them.

Our study area will be Placerita Canyon State Park in Los Angeles County, California. If you want to jump ahead to see the final product, click here.

Import the Park Boundary

First we load the packages we’re going use:
library(rinat)
library(sf)
library(dplyr)
library(tmap)
library(leaflet)

## Load the conflicted package and set preferences
library(conflicted)
conflict_prefer("filter", "dplyr", quiet = TRUE)
conflict_prefer("count", "dplyr", quiet = TRUE)
conflict_prefer("select", "dplyr", quiet = TRUE)
conflict_prefer("arrange", "dplyr", quiet = TRUE)
Now we can import the park boundary which is saved as a geojson file. We use the sf package to import and transform the boundary to geographic coordinates, and leaflet to plot it:
placertia_sf <- st_read("placertia_canyon_sp.geojson", quiet = TRUE) %>% 
  st_transform(4326)

leaflet(placertia_sf) %>% 
  addTiles() %>% 
  addPolygons()
Placerita Canyon State Park

Get the Bounding Box

Next, we get the bounding box of the study area, so we can pass it to the iNaturalist API to tell it which area we’re interested in:
placertia_bb <- placertia_sf %>% 
  st_bbox()

placertia_bb
##       xmin       ymin       xmax       ymax 
## -118.47216   34.36698 -118.43764   34.37820

Download data

Now we are ready to retrieve iNaturalist observations. rinat makes this easy with get_inat_obs().
Downloading Etiquette

The non-profit that supports iNaturalist doesn’t charge for downloading data, because they believe in open data and supporting citizen science initiatives. They also don’t require registration for simply downloading observations. That being said, downloading data consumes resources that can affect other users, and it’s bad form (and bad programming) to download data unnecessarily.

There are a few best practices for downloading data in ways that won’t cause problems for other users:

  1. Use the server-side filters the API provides to only return results for your area of interest and your taxon(s) of interest. See below for examples of sending spatial areas and a taxon name when calling the API from R.

  2. Always save your results to disk, and don’t download data more than once. Your code should check to see if you’ve already downloaded something before getting it again.

  3. Don’t download more than you need. The API will throttle you (reduce your download speed) if you exceed more than 100 calls per minute. Spread it out.

  4. The API is not designed for bulk downloads. For that, look at the Export tool.

We’ll use the bounds argument in get_inat_obs() to specify the extent of our search. bounds takes 4 coordinates. Our study area is not a pure rectangle, but we’ll start with the bounding box and then apply an additional mask to remove observations outside the park boundary.

For starters we’ll just download the most recent 100 iNaturalist observations within the bounding box. (There are actually over 3000, but that would make for a very crowded map!). The code below first checks if the data have already been downloaded, to prevent downloading data twice.
search_results_fn <- "placertia_bb_inat.Rdata"

if (file.exists(search_results_fn)) {
  load(search_results_fn)

} else {  
  # Note we have to rearrange the coordinates of the bounding box a 
  # little bit to give get_inat_obs() what it expects
  inat_obs_df <- get_inat_obs(bounds = placertia_bb[c(2,1,4,3)], 
                            taxon_name = NULL,
                            year = NULL,
                            month = NULL,
                            maxresults = 100)
  
  ## Save the data frame to disk
  save(inat_obs_df, file = search_results_fn)
}
See what we got back:
dim(inat_obs_df)
## [1] 100  36
glimpse(inat_obs_df)
## Rows: 100
## Columns: 36
## $ scientific_name                   "Sceloporus occidentalis", "Eriogonum~
## $ datetime                          "2021-09-04 14:01:55 -0700", "2021-08~
## $ description                       "", "", "", "", "", "", "", "", "", "~
## $ place_guess                       "Placerita Canyon State Park, Newhall~
## $ latitude                          34.37754, 34.37690, 34.37691, 34.3764~
## $ longitude                         -118.4571, -118.4598, -118.4597, -118~
## $ tag_list                          "", "", "", "", "", "", "", "", "", "~
## $ common_name                       "Western Fence Lizard", "California b~
## $ url                               "https://www.inaturalist.org/observat~
## $ image_url                         "https://inaturalist-open-data.s3.ama~
## $ user_login                        "bmelchiorre", "ericas3", "ericas3", ~
## $ id                                94526862, 92920757, 92920756, 9291801~
## $ species_guess                     "Western Fence Lizard", "", "", "", "~
## $ iconic_taxon_name                 "Reptilia", "Plantae", "Plantae", "Pl~
## $ taxon_id                          36204, 54999, 52854, 58014, 64121, 55~
## $ num_identification_agreements     2, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1~
## $ num_identification_disagreements  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ observed_on_string                "Sat Sep 04 2021 14:01:55 GMT-0700 (P~
## $ observed_on                       "2021-09-04", "2021-08-29", "2021-08-~
## $ time_observed_at                  "2021-09-04 21:01:55 UTC", "2021-08-2~
## $ time_zone                         "Pacific Time (US & Canada)", "Pacifi~
## $ positional_accuracy               5, 3, 4, 4, 16, 4, 41, 41, 26, 9, NA,~
## $ public_positional_accuracy        5, 3, 4, 4, 16, 4, 41, 41, 26, 9, NA,~
## $ geoprivacy                        "", "", "", "", "", "", "", "", "", "~
## $ taxon_geoprivacy                  "open", "", "", "", "", "", "", "", "~
## $ coordinates_obscured              "false", "false", "false", "false", "~
## $ positioning_method                "", "", "", "", "", "", "", "", "", "~
## $ positioning_device                "", "", "", "", "", "", "", "", "", "~
## $ user_id                           4985856, 404893, 404893, 404893, 3214~
## $ created_at                        "2021-09-12 02:16:03 UTC", "2021-08-2~
## $ updated_at                        "2021-09-12 06:28:20 UTC", "2021-08-2~
## $ quality_grade                     "research", "needs_id", "needs_id", "~
## $ license                           "CC-BY-NC", "CC-BY-NC-SA", "CC-BY-NC-~
## $ sound_url                         NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
## $ oauth_application_id              3, 333, 333, 333, 3, 3, 3, 3, 3, 3, N~
## $ captive_cultivated                "false", "false", "false", "false", "~
Next, in order to spatially filter the observations using the park boundary, we convert the data frame into a sf object. At the same time we’ll select just a few of the columns for our map:
inat_obs_sf <- inat_obs_df %>% 
select(longitude, latitude, datetime, common_name, 
      scientific_name, image_url, user_login) %>% 
  st_as_sf(coords=c("longitude", "latitude"), crs=4326)

dim(inat_obs_sf)
## [1] 100 6
Next, filter out observations that lie outside Placerita Canyon SP. How many are left?
inat_obs_pcsp_sf <- inat_obs_sf %>% st_intersection(placertia_sf)
nrow(inat_obs_pcsp_sf)
## [1] 90
Next, add a column to the attribute table containing HTML code that will appear in the popup windows:
inat_obs_pcsp_popup_sf <- inat_obs_pcsp_sf %>% 
  mutate(popup_html = paste0("<p><b>", common_name, "</b><br/>",
                        "<i>", scientific_name, "</i></p>",
                        "<p>Observed: ", datetime, "<br/>",
                        "User: ", user_login, "</p>",
                        "<p><img src='", image_url, "' style='width:100%;'/></p>")
)

Map the Results

Finally, we have everything we need to map the results with leaflet. Zoom-in and click on an observation to see details about it.
htmltools::p("iNaturalist Observations in Placerita Canyon SP",
             htmltools::br(),
             inat_obs_pcsp_popup_sf$datetime %>% 
               as.Date() %>% 
               range(na.rm = TRUE) %>% 
               paste(collapse = " to "),
             style = "font-weight:bold; font-size:110%;")

leaflet(placertia_sf) %>% 
  addProviderTiles("Esri.WorldStreetMap") %>% 
  addPolygons() %>% 
  addCircleMarkers(data = inat_obs_pcsp_popup_sf,
                   popup = ~popup_html, 
                   radius = 5)

iNaturalist Observations in Placerita Canyon SP
2006-01-04 to 2021-09-04

Leaflet map of all species in Placerita Canyon Sin
Click on the map to open in a new tab

Map the Arachnids

The iNaturalist API supports taxon searching. If you pass it the name of a genus, for example, it will restrict results to only observations from that genus. Likewise for family, order, class, etc. Observations that don’t have a scientific name assigned will be left out.

You can do a taxon search from R by passing the taxon_name argument in get_inat_obs(). Below we’ll ask for the Arachnid class, which includes spiders, scorpions, ticks, and other arthropods. This code should all look familiar by now:
search_arachnid_fn <- "placertia_arachnid_obs.Rdata"
if (file.exists(search_arachnid_fn)) {
  load(search_arachnid_fn)
} else {  
  inat_arachnid_df <- get_inat_obs(bounds = placertia_bb[c(2,1,4,3)], 
                            taxon_name = "Arachnida",
                            year = NULL,
                            month = NULL,
                            maxresults = 100)
  save(inat_arachnid_df, file = search_arachnid_fn)
}

inat_arachnid_pcsp_popup_sf <- inat_arachnid_df %>% 
  select(longitude, latitude, datetime, common_name, scientific_name, image_url, user_login) %>%
  st_as_sf(coords=c("longitude", "latitude"), crs=4326) %>% 
  st_intersection(placertia_sf) %>% 
  mutate(popup_html = paste0("<p><b>", common_name, "</b><br/>",
                             "<i>", scientific_name, "</i></p>",
                             "<p>Observed: ", datetime, "<br/>",
                             "User: ", user_login, "</p>",
                             "<p><img src='", image_url, 
                             "' style='width:100%;'/></p>"))

htmltools::p("Arachnids in Placerita Canyon SP", style = "font-weight:bold; font-size:110%;")

leaflet(placertia_sf) %>% 
  addProviderTiles("Esri.WorldStreetMap") %>% 
  addPolygons() %>% 
  addCircleMarkers(data = inat_arachnid_pcsp_popup_sf,
                   popup = ~popup_html, 
                   radius = 5)
Arachnids in Placerita Canyon SP

Leaflet map of Arachnids
Click on the map to open in a new tab

Summary

The iNaturalist API allows you to download observations from a global community of naturalists, which is pretty powerful. This Tech Note showed you how to create a map of iNaturalist observations within a specific area and for a specific taxon using the rinat and leaflet packages for R.

A couple caveats are worth noting. A limitation of this method is that the data are static. In other words, to get the latest data your have to re-query the API again. It isn’t ‘live’ like a feature service.

Also, although leaflet works well for small to medium sized datasets, you wouldn’t want to use this mapping tool for massive amounts of spatial data. With leaflet, all of the geographic information and popup content are saved in the HTML file, and rendering is done on the client. This generally makes it extremely fast because everything is in memory. But if the data overwhelm memory it can slow your browser to a crawl. If you have to map hundreds of thousands of points, for example, you would need a different approach for both downloading and visualizing the data.

The Statsomat Apps with R and Python

The Statsomat project and site (https://statsomat.com) were launched at the beginning of 2021 and have the goal of developing and maintaining open-source and web-based apps for automated data analysis. Special about the reports generated by the Statsomat apps is the annotated output and the human-readable interpretation in natural language. Most of the apps currently available deliver also a user-friendly code (R or Python) to reproduce the outputs. 

The Statsomat apps currently available are:

 

Nonlinear Market Forecasting using ‘Stealth Curves’

In 2001, I was fortunate to discover a ‘market characteristic’ that transcends virtually every liquid market. Markets often trend in a nonlinear fashion according to hidden support/resistance curves (Stealth Curves).  Whereas ‘market anomalies’ are transient and are overwhelmingly market specific, Stealth Curves are highly robust across virtually every market and stand the test of time. Using R, I chart several Stealth Curve examples relative to the wheat market (symbol = WEAT). Download data from my personal GitHub site and read into a ‘weat’ tibble.

library(tidyverse)
library(readxl)

# Original data source - https://www.nasdaq.com/market-activity/funds-and-etfs/weat/historical

# Download reformatted data (columns/headings) from github
# https://github.com/123blee/Stealth_Curves.io/blob/main/WEAT_nasdaq_com_data_reformatted.xlsx

# Insert your file path in the next line of code 

weat <- read_excel("... Place your file path Here ... /WEAT_nasdaq_com_data_reformatted.xlsx")
weat


# Convert 'Date and Time' to 'Date' column
weat[["Date"]] <- as.Date(weat[["Date"]])
weat

bars <- nrow(weat)

# Add bar indicator as first tibble column

weat <- weat %>%
     add_column(t = 1:nrow(weat), .before = "Date")
weat
View tibble



Prior to developing Stealth Curve charts, it is often advantageous to view the raw data in an interactive chart.

# Interactive Pricing Chart

xmin <- 1              
ymin <- 0   
ymax_close <- ceiling(max(weat[["Close"]]))
ymax_low <- ceiling(max(weat[["Low"]]))
ymax_high <- ceiling(max(weat[["High"]]))

interactive <- hPlot(x = "t", y = "Close", data = weat, type = "line",
               
               ylim = c(ymin, ymax_close),
               
               xlim = c(xmin, bars),
               
               xaxt="n",   # suppress x-axis labels
               
               yaxt="n",   # suppress y-axis labels,
               
               ann=FALSE)  # x and y axis titles

interactive$set(height = 600)

interactive$set(width = 700)

interactive$plotOptions(line = list(color = "green"))

interactive$chart(zoomType = "x")   # Highlight range of chart to zoom in

interactive


Interactive Chart
WEAT Daily Closing Prices 




Highlight any range to zoom data and view price on any bar number.

Interactive Chart
WEAT Daily Closing Prices (Zoomed)




Prior to plotting, add 400 additional rows to the tibble in preparation for extending the calculated Stealth Curve into future periods.

# Add 400 future days to the plot for projection of the Stealth Curve
 
 future <- 400
 
 weat <- weat %>%
   add_row(t = (bars+1):(bars+future))
Chart the WEAT daily closing prices with 400 days of padding. 

 # Chart Closing WEAT prices
 
 plot.new()
 
 chart_title_close <- c("Teucrium Wheat ETF (Symbol = WEAT) \nDaily Closing Prices ($)")
 
 background <- c("azure1") 
 
 u <- par("usr") 
 
 rect(u[1], u[3], u[2], u[4], col = background) 
 
 par(new=TRUE)
 
 t <- weat[["t"]]
 Price <- weat[["Close"]]
 
 plot(x=t, y=Price, main = chart_title_close, type="l", col = "blue",        
      
      ylim = c(ymin, ymax_close) ,
      
      xlim = c(xmin, (bars+future )) ) 
The below chart represents 10 years of closing prices for the WEAT market since inception of the ETF on 9/19/2011.  The horizontal axis represents time (t) stated in bar numbers (t=1 = bar 1 = Date 9/19/2011).  This starting ‘t‘ value is entirely arbitrary and does not impact the position of a calculated Stealth Curve on a chart.



The above chart displays a ‘random’ market in decline over much of the decade.

Stealth Curves reveal an entirely different story.  They depict extreme nonlinear order with respect to market pivot lows. In order for Stealth Curves to be robust across most every liquid market, the functional form must not only be simple – it must be extremely simple. The following Stealth Curve equation is charted against the closing price data.
 # Add Stealth Curve to tibble
 
 # Stealth Curve parameters
 a <- 7231.88  
 b <-  1.18 
 c <- -7.77 
 
 weat <- weat %>%
   mutate(Stealth_Curve_Close = a/(t + c) + b)
 weat


As the Stealth Curve is negative in bars 1 through 7, these values are ignored in the chart by the use of NA padding.

 # Omit negative Stealth Curve values in charting, if any
 
 z1 <- weat[["Stealth_Curve_Close"]]
 
 weat[["Stealth_Curve_Close"]] <- ifelse(z1 < 0, NA, z1)
 weat


Add the Stealth Curve to the chart.

# Add Stealth Curve to chart
 
 lines(t, weat[["Stealth_Curve_Close"]])
Closing Prices with Stealth Curve


Once the Stealth Curve is added to the chart, extreme nonlinear market order is clearly evident.  

I personally refer to this process as overlaying a cheat-sheet on a pricing chart to understand why prices bounced where they did and where they may bounce on a go-forward basis. Stealth Curves may be plotted from [-infinity, + infinity].

The human eye is not adept at discerning the extreme accuracy of this
Stealth Curve.  As such, visual aids are added.  This simple curve serves as a strange attractor to WEAT closing prices.  The market closely hugs the Stealth Curve just prior to t=500 (oval) for 3 consecutive months.  The arrows depict 10 separate market bounces off/very near the curve.



While some of the bounces appear ‘small,’ it is important to note the prices are also relatively small.  As an example, the ‘visually small bounce’ at the ‘3rd arrow from the right’ represents a 10%+ market gain. That particular date is of interest, as it is the exact date I personally identified this Stealth trending market for the first time.  I typically do not follow the WEAT market.  Had I done so, I could have identified this Stealth Curve sooner in time.  A Stealth Curve requires only 3 market pivot points for definition.

Reference my real-time LinkedIn post of this Stealth Curve at the ‘3rd arrow’ event here. This multi-year Stealth Curve remains valid to the current date.  By definition, a Stealth Curve remains valid until it is penetrated to the downside in a meaningful manner.  Even then, it often later serves as market resistance. On occasion, a Stealth Curve will serve as support followed by resistance followed by support (reference last 2 charts in this post).   

Next, a Stealth Curve is applied to market pivot lows as defined by high prices.  

Chart high prices.

# Market Pivot Lows using High Prices
 
 # Chart High WEAT prices
 
 plot.new()
 
 chart_title_high <- c("Teucrium Wheat ETF (Symbol = WEAT) \nDaily High Prices ($)")
 
 u <- par("usr") 
 
 rect(u[1], u[3], u[2], u[4], col = background) 
 
 par(ann=TRUE)
 
 par(new=TRUE)
 
 t <- weat[["t"]]
 Price <- weat[["High"]]
 
 plot(x=t, y=Price, main = chart_title_high, type="l", col = "blue", 
      
      ylim = c(ymin, ymax_high) ,
      
      xlim = c(xmin, (bars+future )) )  


The parameterized Stealth Curve equation is as follows:
Add the Stealth Curve to the tibble.


# Add Stealth Curve to tibble
 
 # Stealth Curve parameters
 a <- 7815.16  
 b <-    1.01 
 c <-   37.35 
 
 weat <- weat %>%
   mutate(Stealth_Curve_High = a/(t + c) + b)
 
 # Omit negative Stealth Curve values in charting, if any
 
 z2 <- weat[["Stealth_Curve_High"]]
 
 weat[["Stealth_Curve_High"]] <- ifelse(z2 < 0, NA, z2)
 weat



 # Add Stealth Curve to chart
 
 lines(t, weat[["Stealth_Curve_High"]])


When backcasted in time, this Stealth Curve transitions from resistance to support and is truly amazing. When a backcasted Stealth Curve, relative to the data used for curve parameterization, displays additional periods of market support/resistance; additional confidence is placed in its ability to act as a strange attractor.  Based on visual inspection, it is doubtful no other smooth curve (infinitely many) could coincide with as many meaningful pivot highs and lows (21 total) over this 10-year period as this simple Stealth Curve. In order to appreciate the level of accuracy of the backcasted Stealth Curve, a log chart is presented of a zoomed sectional of the data. 

 # Natural Log High Prices 
 
 # Chart High WEAT prices
 
 plot.new()
 
 chart_title_high <- c("Teucrium Wheat ETF (Symbol = WEAT) \nDaily High Prices ($)")
 
 u <- par("usr") 
 
 rect(u[1], u[3], u[2], u[4], col = background) 
 
 par(ann=TRUE)
 
 par(new=TRUE)
 
 chart_title_log_high <- c("Teucrium Wheat ETF (Symbol = WEAT) \nNatural Log of Daily High Prices ($)")
 
 t <- weat[["t"]]
 Log_Price <- log(weat[["High"]])
 
 plot(x=t, y=Log_Price, main = chart_title_log_high, type="l", col = "blue", 
      
      ylim = c(log(7), log(ymax_high)) ,
      
      xlim = c(xmin, 800) ) # (bars+future )) )  
 
 # Add Log(Stealth Curve) to chart
 
 lines(t, log(weat[["Stealth_Curve_High"]]))
Zoomed Sectional
Log(High Prices) and Log(Stealth Curve)
WEAT



There are
11 successful tests of Stealth Resistance including the all-time market high of the ETF.


In total, this Stealth Curve exactly or closely identifies 21 total market pivots (Stealth Resistance = 11, Stealth Support = 10).

Lastly, a Stealth Curve is presented based on market pivot lows defined by low prices.

# Market Pivot Lows using Low Prices
 
 # Chart Low WEAT prices
 
 plot.new()
 
 u <- par("usr") 
 
 rect(u[1], u[3], u[2], u[4], col = background) 
 
 par(ann=TRUE)
 
 par(new=TRUE)
 
 chart_title_low <- c("Teucrium Wheat ETF (Symbol = WEAT) \nDaily Low Prices ($)")
 
 t <- weat[["t"]]
 Price <- weat[["Low"]]
 
 
 plot(x=t, y=Price, main = chart_title_low, type="l", col = "blue", 
      
      ylim = c(ymin, ymax_low) ,
      
      xlim = c(xmin, (bars+future )) )  
Chart the low prices.



The parameterized Stealth Curve equation based on 3 pivot lows is given below.
Add the calculated Stealth Curve to the tibble.


 # Add Stealth Curve to tibble
 
 # Stealth Curve parameters
 a <- 9022.37  
 b <-    0.48 
 c <-  125.72 
 
 weat <- weat %>%
   mutate(Stealth_Curve_Low = a/(t + c) + b)

 # Omit negative Stealth Curve values in charting, if any
 
 z3 <- weat[["Stealth_Curve_Low"]]
 
 weat[["Stealth_Curve_Low"]] <- ifelse(z3 < 0, NA, z3)
 weat


Add the Stealth Curve to the chart. 


# Add Stealth Curve to chart
 
 lines(t, weat[["Stealth_Curve_Low"]])
 



Intentionally omitting identifying arrows, it is clearly obvious this Stealth Curve identifies the greatest number of pivot lows of the 3 different charts presented (close, high, low).

As promised, below are 2 related Stealth Curve charts I posted in 2005 that transitioned from support, to resistance, to support, and then back to resistance (data no longer exists – only graphics).  This data used Corn reverse adjusted futures data.  The Stealth Curve was defined using pivot low prices.  Subsequent market resistance applied to market high prices (Chart 2 of 2). 

Chart 1 of 2




Chart 2 of 2



For those interested in additional
Stealth Curve examples applied to various other markets, simply view my LinkedIn post here.  Full details of Stealth Curve model parameterization are described in my latest Amazon publication, Stealth Curves: The Elegance of ‘Random’ Markets.

Brian K. Lee, MBA, PRM, CMA, CFA
[email protected]