Natural Gas Prices Fall 42% in 3 Months Following Breach of ‘Nonlinear Stealth Support’


The below earlier LinkedIn post of well-defined nonlinear Stealth Support indicated price evolution for natural gas (UNG) was highly
likely on an unsustainable trajectory.



Within a month of the post, the market breached its well-defined Stealth Support Curve.  In the 3 months that followed, this market lost 42% of its value. 



The following R code reproduces the below Stealth Curve




UNG Stealth Curve – R Code
library(tidyverse)
library(readxl)

# Original data source - https://www.nasdaq.com/market-activity/funds-and-etfs/ung/historical

# Download reformatted data (columns/headings) from my github site and save to a local drive
# https://github.com/123blee/Stealth_Curves.io/blob/main/UNG_prices.xlsx

ung <- read_excel("... Insert your local file path here .../UNG_prices.xlsx")
ung

# Convert 'Date and Time' to 'Date' column
ung[["Date"]] <- as.Date(ung[["Date"]])
ung

bars <- nrow(ung)

# Add bar indicator as first tibble column

ung <- ung %>%
  add_column(t = 1:nrow(ung), .before = "Date")
ung

# Add 40 future days to the tibble for projection of the Stealth Curve once added

future <- 40

ung <- ung %>%
  add_row(t = (bars+1):(bars+future))

# Market Pivot Lows using 'Low' Prices

# Chart 'Low' UNG prices

xmin <- 2250
xmax <- bars + future
ymin <- 0
ymax <- 25

plot.new()

background <- c("azure1")

chart_title_low <- c("Natural Gas (UNG) \nDaily Low Prices ($)")

u <- par("usr") 

rect(u[1], u[3], u[2], u[4], col = background) 

par(ann=TRUE)

par(new=TRUE)

t <- ung[["t"]]
Price <- ung[["Low"]]

plot(x=t, y=Price, main = chart_title_low, type="l", col = "blue", 
     
     ylim = c(ymin, ymax) ,
     
     xlim = c(xmin, xmax ) ) 

# Add Stealth Curve to tibble

# Stealth Support Curve parameters
a <-   -444.56  
b <-      6.26  
c <-  -2555.01  


ung <- ung %>%
  mutate(Stealth_Curve_Low = a/(t + c) + b)
ung

# Omit certain Stealth Support Curve values from charting

ung[["Stealth_Curve_Low"]][2550:xmax] <- NA
ung

# Add Stealth Curve to chart

lines(t, ung[["Stealth_Curve_Low"]])
Details of Stealth Curve parameterization are detailed in my ‘Stealth Curves: The Elegance of Random Markets’ text on Amazon.

Brian K. Lee, MBA, PRM, CMA, CFA

Feel free to contact me on LinkedIn.      

SurvCART: Constructing Survival Tree in R

[Author: Madan G. Kundu]

In practice, survival times may be influenced by a combination of baseline variables. Survival trees (i.e., trees with survival data) offer a relatively flexible approach to understanding the effects of covariates, including their interaction, on survival times when the functional form of the association is unknown, and also have the advantage of easier interpretation. This blog presents the construction of a survival tree following the SurvCART algorithm (Kundu and Ghosh 2021).  Usually, a survival tree is constructed based on the heterogeneity in time-to-event distribution. However, in some applications with marker dependent censoring (e.g., less education or better prognosis might lead to early censoring) ignorance of censoring distribution might lead to inconsistent survival tree (Cui, Zhou and Kosorok, 2021).  Therefore, the SurvCART algorithm is flexible to construct a survival tree based on heterogeneity both in time-to-event and censoring distribution. However, it is important to emphasize that the use of censoring heterogeneity in the construction of survival trees is optional.  In this blog, the construction of a survival tree is illustrated with an R dataset in step by step approach and the results are explained using the functions available in the LongCART package


Installing and Loading LongCART package
R> install.packages("LongCART")
R> library(LongCART)
The GBSG2 dataset
We will illustrate the SurvCART algorithm with GBSG2 data to analyze recurrence-free survival (RFS) originated from a prospective randomized clinical trial conducted by the German Breast Cancer Study Group. The purpose is to evaluate the effect of prognostic factors on RFS among node-positive breast cancer patients receiving chemotherapy in the adjuvant setting. RFS is time-to-event endpoint which was defined as the time from mastectomy to the first occurrence of either recurrence, contralateral or secondary tumor, or death. This GBSG2 dataset is included in the LongCART package in R. The dataset contains the information from 686 breast cancer patients with the following variables: time (RFS follow-up time in days), cens (censoring indicator: 0[censored], 1[event]), horTH (hormonal therapy: yes, no), age (age in years), menostat (menopausal status: pre, post), tsize (tumour size in mm), tgrade (tumour grade: I, II, III), pnodes (number of positive nodes), progrec (level of progesterone receptor[PR] in fmol), and estrec (level of the estrogen receptor in fmol). For details about the conduct of the study, please refer to Schumacher et al.

Let’s first explore the dataset. 
R> library(speff2trial)
R> data("ACTG175", package = "speff2trial")
R> adata<- reshape(data=ACTG175[,!(names(ACTG175) %in% c("cd80", "cd820"))],
+ varying=c("cd40", "cd420", "cd496"), v.names="cd4",
+ idvar="pidnum", direction="long", times=c(0, 20, 96))
R> adata<- adata[order(adata$pidnum, adata$time),]
The median RFS time based on the entire 686 patients was 60 months with a total of 299 reported RFS events.  Does any baseline covariate influence the median RFS time and if yes then how? A survival tree is meant to find this answer.

Impact of individual categorical baseline variables on RFS
We would like to start with understanding the impact of individual baseline covariates on the median RFS time. There could be two types of baseline covariates: categorical (e.g., horTH [hormonal therapy]) and continuous (e.g., age).  The impact of an individual categorical variable can be statistically assessed using StabCat.surv() [Note: StabCat.surv() performs a statistical test as described in Section 2.2.1 of Kundu and Ghosh 2021].

Let’s focus only on the heterogeneity in RFS (assuming exponential underlying distribution for RFS), but not on the censoring distribution. 
R> out1<- StabCat.surv(data=GBSG2, timevar="time", censorvar="cens", 
R>                     splitvar="horTh", time.dist="exponential", 
R>                     event.ind=1) 
Stability Test for Categorical grouping variable 
Test.statistic= 8.562, Adj. p-value= 0.003 
Greater evidence of heterogeneity in time-to-event distribution 
R> out1$pval
[1] 0.003431809
The p-value is 0.00343 suggesting the influence of hormonal therapy on RFS. Here we considered Exponential distribution for RFS, but the following distributions can be specified as well: "weibull""lognormal" or "normal"

Now we illustrate the use of StabCat.surv() considering heterogeneity both in RFS time and censoring time.
R> out1<- StabCat.surv(data=GBSG2, timevar="time", censorvar="cens", 
R>                     splitvar="horTh", time.dist="exponential", 
R>                     cens.dist="exponential", event.ind=1) 
Stability Test for Categorical grouping variable 
Test.statistic= 8.562 0.013, Adj. p-value= 0.007 0.909 
Greater evidence of heterogeneity in time-to-event distribution 
> out1$pval
[1] 0.006863617
Note the two Test.statistic values (8.562 and 0.013): the first one corresponds to for heterogeneity in RFS time and the second one corresponds to heterogeneity in censoring distribution.  Consequently, we also see the two p-values (adjusted).  The overall p-value 0.006863617. In comparison to the earlier p-value of 0.00343, this p-value is greater due to the multiplicity adjustment.

Impact of individual continuous baseline variables on RFS
Similarly, we can assess the impact of an individual continuous variable by performing a statistical test as described in Section 2.2.2 of Kundu and Ghosh 2021 which is implemented in StabCont.surv()
R> out1<- StabCont.surv(data=GBSG2, timevar="time", censorvar="cens",
R>                      splitvar="age", time.dist="exponential",
R>                      event.ind=1)
Stability Test for Continuous grouping variable 
Test.statistic= 0.925, Adj. p-value= 0.359 
Greater evidence of heterogeneity in time-to-event distribution 
> out1$pval
[1] 0.3591482
The p-value is 0.3591482 suggesting no impact of age on RFS. Heterogeneity in censoring distribution can be considered here as well by specifying cens.dist as shown with StabCat.surv().

Construction of survival tree
A survival tree can be constructed using SurvCART(). Note that the SurvCART() function currently can handle the categorical baseline variables with numerical levels only. For nominal variables, please create the corresponding numerically coded dummy variable(s) as has been done below for horTh, trade and menostat  
R> GBSG2$horTh1<- as.numeric(GBSG2$horTh)
R> GBSG2$tgrade1<- as.numeric(GBSG2$tgrade)
R> GBSG2$menostat1<- as.numeric(GBSG2$menostat)
We also have to add the subject id if that is not already there.
R> GBSG2$subjid<- 1:nrow(GBSG2)
In the SurvCART(), we have to list out all the baseline variables of interest and declare whether it is categorical or continuous. Further, we also have to specify the RFS distribution "exponential"(default), "weibull""lognormal" or "normal". The censoring distribution could be "NA"(default: i.e., no censoring heterogeneity), "exponential" "weibull""lognormal" or "normal"

Let’s construct the survival tree with heterogeneity in RFS only (i.e., ignoring censoring heterogeneity)
R> out.tree<- SurvCART(data=GBSG2, patid="subjid", 
R>                censorvar="cens", timevar="time", 
R>                gvars=c('horTh1', 'age', 'menostat1', 'tsize',          
R>                        'tgrade1', 'pnodes', 'progrec', 'estrec'),  
R>                tgvars=c(0,1,0,1,0,1, 1,1), 
R>                time.dist="exponential", 
R>                event.ind=1, alpha=0.05, minsplit=80, minbucket=40, 
R>                print=TRUE)
All the baseline variables are listed in gvars argument. The tgvars argument is accompanied with the gvars argument which indicates type of the partitioning variables (0=categorical or 1=continuous). Further, we have considered exponential distribution for RFS time. 

Within SurvCART(), at any given tree-node, each of the baseline variables is tested for heterogeneity using eitherStabCat.surv() or StabCont.surv() and the corresponding p-value is obtained. Further, these p-values are adjusted according to the Hochberg procedure and subsequently, the baseline variable with the smallest p-value is selected. These details would be printed if print=TRUE. This procedure keeps iterating until we hit the terminal node (i.e., no more statistically significant baseline variable, node size is smaller than the minsplit or further splitting results in a node size smaller than  minbucket).

Now let’s view the tree result


In the above output, each row corresponds to a single node including the 7  terminal nodes identified by TERMINAL=TRUE.  Now let’s visualize the tree result
R> par(xpd = TRUE)
R> plot(out.tree, compress = TRUE)
R> text(out.tree, use.n = TRUE)
The resultant tree is as follows:

We can also plot KM plot for sub-groups identified by survival tree as follows:
KMPlot.SurvCART(out.tree, scale.time=365.25, type=1)


Last but not least, we can consider heterogeneity in censoring distribution in SurvCART() by specifying cens.dist .

Reference

1. Kundu, M. G., & Ghosh, S. (2021). Survival trees based on heterogeneity in time‐to‐event and censoring distributions using parameter instability test. Statistical Analysis and Data Mining: The ASA Data Science Journal14(5), 466-483. [link]
2. Y. Cui, R. Zhu, M. Zhou, and M. Kosorok, Consistency of survival tree and forest models: Splitting bias and correction. Statistica Sinica Preprint No: SS-2020-0263, 2021.
3. Schumacher, M., Bastert, G., Bojar, H., Hübner, K., Olschewski, M., Sauerbrei, W., … & Rauschecker, H. F. (1994). Randomized 2 x 2 trial evaluating hormonal treatment and the duration of chemotherapy in node-positive breast cancer patients. German Breast Cancer Study Group. Journal of Clinical Oncology12(10), 2086-2093. [link]
4. LongCART package v 3.1 or higher, https://CRAN.R-project.org/package=LongCART

Mapping iNaturalist Data in R

This post was originally published from IGIS as a Tech Note. Enjoy!

Introduction

iNaturalist logoiNaturalist is an enormously popular platform for recording and sharing observations of nature. Most people interact with iNaturalist through the Android or iOS phone app, but a little known fact is the platform also has an API (Application Programming Interface). That means that you can query observations using programming languages like R and Python. And if you can query observations, then you can map them!

The rest of this post demonstrates how you can download iNaturalist observations in R using the rinat package, and then map them using leaflet. We’ll also use functions from sf and dplyr. These are very common R packages in data science, and there are lot of online resources that teach you how to use them.

Our study area will be Placerita Canyon State Park in Los Angeles County, California. If you want to jump ahead to see the final product, click here.

Import the Park Boundary

First we load the packages we’re going use:
library(rinat)
library(sf)
library(dplyr)
library(tmap)
library(leaflet)

## Load the conflicted package and set preferences
library(conflicted)
conflict_prefer("filter", "dplyr", quiet = TRUE)
conflict_prefer("count", "dplyr", quiet = TRUE)
conflict_prefer("select", "dplyr", quiet = TRUE)
conflict_prefer("arrange", "dplyr", quiet = TRUE)
Now we can import the park boundary which is saved as a geojson file. We use the sf package to import and transform the boundary to geographic coordinates, and leaflet to plot it:
placertia_sf <- st_read("placertia_canyon_sp.geojson", quiet = TRUE) %>% 
  st_transform(4326)

leaflet(placertia_sf) %>% 
  addTiles() %>% 
  addPolygons()
Placerita Canyon State Park

Get the Bounding Box

Next, we get the bounding box of the study area, so we can pass it to the iNaturalist API to tell it which area we’re interested in:
placertia_bb <- placertia_sf %>% 
  st_bbox()

placertia_bb
##       xmin       ymin       xmax       ymax 
## -118.47216   34.36698 -118.43764   34.37820

Download data

Now we are ready to retrieve iNaturalist observations. rinat makes this easy with get_inat_obs().
Downloading Etiquette

The non-profit that supports iNaturalist doesn’t charge for downloading data, because they believe in open data and supporting citizen science initiatives. They also don’t require registration for simply downloading observations. That being said, downloading data consumes resources that can affect other users, and it’s bad form (and bad programming) to download data unnecessarily.

There are a few best practices for downloading data in ways that won’t cause problems for other users:

  1. Use the server-side filters the API provides to only return results for your area of interest and your taxon(s) of interest. See below for examples of sending spatial areas and a taxon name when calling the API from R.

  2. Always save your results to disk, and don’t download data more than once. Your code should check to see if you’ve already downloaded something before getting it again.

  3. Don’t download more than you need. The API will throttle you (reduce your download speed) if you exceed more than 100 calls per minute. Spread it out.

  4. The API is not designed for bulk downloads. For that, look at the Export tool.

We’ll use the bounds argument in get_inat_obs() to specify the extent of our search. bounds takes 4 coordinates. Our study area is not a pure rectangle, but we’ll start with the bounding box and then apply an additional mask to remove observations outside the park boundary.

For starters we’ll just download the most recent 100 iNaturalist observations within the bounding box. (There are actually over 3000, but that would make for a very crowded map!). The code below first checks if the data have already been downloaded, to prevent downloading data twice.
search_results_fn <- "placertia_bb_inat.Rdata"

if (file.exists(search_results_fn)) {
  load(search_results_fn)

} else {  
  # Note we have to rearrange the coordinates of the bounding box a 
  # little bit to give get_inat_obs() what it expects
  inat_obs_df <- get_inat_obs(bounds = placertia_bb[c(2,1,4,3)], 
                            taxon_name = NULL,
                            year = NULL,
                            month = NULL,
                            maxresults = 100)
  
  ## Save the data frame to disk
  save(inat_obs_df, file = search_results_fn)
}
See what we got back:
dim(inat_obs_df)
## [1] 100  36
glimpse(inat_obs_df)
## Rows: 100
## Columns: 36
## $ scientific_name                   "Sceloporus occidentalis", "Eriogonum~
## $ datetime                          "2021-09-04 14:01:55 -0700", "2021-08~
## $ description                       "", "", "", "", "", "", "", "", "", "~
## $ place_guess                       "Placerita Canyon State Park, Newhall~
## $ latitude                          34.37754, 34.37690, 34.37691, 34.3764~
## $ longitude                         -118.4571, -118.4598, -118.4597, -118~
## $ tag_list                          "", "", "", "", "", "", "", "", "", "~
## $ common_name                       "Western Fence Lizard", "California b~
## $ url                               "https://www.inaturalist.org/observat~
## $ image_url                         "https://inaturalist-open-data.s3.ama~
## $ user_login                        "bmelchiorre", "ericas3", "ericas3", ~
## $ id                                94526862, 92920757, 92920756, 9291801~
## $ species_guess                     "Western Fence Lizard", "", "", "", "~
## $ iconic_taxon_name                 "Reptilia", "Plantae", "Plantae", "Pl~
## $ taxon_id                          36204, 54999, 52854, 58014, 64121, 55~
## $ num_identification_agreements     2, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1~
## $ num_identification_disagreements  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ observed_on_string                "Sat Sep 04 2021 14:01:55 GMT-0700 (P~
## $ observed_on                       "2021-09-04", "2021-08-29", "2021-08-~
## $ time_observed_at                  "2021-09-04 21:01:55 UTC", "2021-08-2~
## $ time_zone                         "Pacific Time (US & Canada)", "Pacifi~
## $ positional_accuracy               5, 3, 4, 4, 16, 4, 41, 41, 26, 9, NA,~
## $ public_positional_accuracy        5, 3, 4, 4, 16, 4, 41, 41, 26, 9, NA,~
## $ geoprivacy                        "", "", "", "", "", "", "", "", "", "~
## $ taxon_geoprivacy                  "open", "", "", "", "", "", "", "", "~
## $ coordinates_obscured              "false", "false", "false", "false", "~
## $ positioning_method                "", "", "", "", "", "", "", "", "", "~
## $ positioning_device                "", "", "", "", "", "", "", "", "", "~
## $ user_id                           4985856, 404893, 404893, 404893, 3214~
## $ created_at                        "2021-09-12 02:16:03 UTC", "2021-08-2~
## $ updated_at                        "2021-09-12 06:28:20 UTC", "2021-08-2~
## $ quality_grade                     "research", "needs_id", "needs_id", "~
## $ license                           "CC-BY-NC", "CC-BY-NC-SA", "CC-BY-NC-~
## $ sound_url                         NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
## $ oauth_application_id              3, 333, 333, 333, 3, 3, 3, 3, 3, 3, N~
## $ captive_cultivated                "false", "false", "false", "false", "~
Next, in order to spatially filter the observations using the park boundary, we convert the data frame into a sf object. At the same time we’ll select just a few of the columns for our map:
inat_obs_sf <- inat_obs_df %>% 
select(longitude, latitude, datetime, common_name, 
      scientific_name, image_url, user_login) %>% 
  st_as_sf(coords=c("longitude", "latitude"), crs=4326)

dim(inat_obs_sf)
## [1] 100 6
Next, filter out observations that lie outside Placerita Canyon SP. How many are left?
inat_obs_pcsp_sf <- inat_obs_sf %>% st_intersection(placertia_sf)
nrow(inat_obs_pcsp_sf)
## [1] 90
Next, add a column to the attribute table containing HTML code that will appear in the popup windows:
inat_obs_pcsp_popup_sf <- inat_obs_pcsp_sf %>% 
  mutate(popup_html = paste0("<p><b>", common_name, "</b><br/>",
                        "<i>", scientific_name, "</i></p>",
                        "<p>Observed: ", datetime, "<br/>",
                        "User: ", user_login, "</p>",
                        "<p><img src='", image_url, "' style='width:100%;'/></p>")
)

Map the Results

Finally, we have everything we need to map the results with leaflet. Zoom-in and click on an observation to see details about it.
htmltools::p("iNaturalist Observations in Placerita Canyon SP",
             htmltools::br(),
             inat_obs_pcsp_popup_sf$datetime %>% 
               as.Date() %>% 
               range(na.rm = TRUE) %>% 
               paste(collapse = " to "),
             style = "font-weight:bold; font-size:110%;")

leaflet(placertia_sf) %>% 
  addProviderTiles("Esri.WorldStreetMap") %>% 
  addPolygons() %>% 
  addCircleMarkers(data = inat_obs_pcsp_popup_sf,
                   popup = ~popup_html, 
                   radius = 5)

iNaturalist Observations in Placerita Canyon SP
2006-01-04 to 2021-09-04

Leaflet map of all species in Placerita Canyon Sin
Click on the map to open in a new tab

Map the Arachnids

The iNaturalist API supports taxon searching. If you pass it the name of a genus, for example, it will restrict results to only observations from that genus. Likewise for family, order, class, etc. Observations that don’t have a scientific name assigned will be left out.

You can do a taxon search from R by passing the taxon_name argument in get_inat_obs(). Below we’ll ask for the Arachnid class, which includes spiders, scorpions, ticks, and other arthropods. This code should all look familiar by now:
search_arachnid_fn <- "placertia_arachnid_obs.Rdata"
if (file.exists(search_arachnid_fn)) {
  load(search_arachnid_fn)
} else {  
  inat_arachnid_df <- get_inat_obs(bounds = placertia_bb[c(2,1,4,3)], 
                            taxon_name = "Arachnida",
                            year = NULL,
                            month = NULL,
                            maxresults = 100)
  save(inat_arachnid_df, file = search_arachnid_fn)
}

inat_arachnid_pcsp_popup_sf <- inat_arachnid_df %>% 
  select(longitude, latitude, datetime, common_name, scientific_name, image_url, user_login) %>%
  st_as_sf(coords=c("longitude", "latitude"), crs=4326) %>% 
  st_intersection(placertia_sf) %>% 
  mutate(popup_html = paste0("<p><b>", common_name, "</b><br/>",
                             "<i>", scientific_name, "</i></p>",
                             "<p>Observed: ", datetime, "<br/>",
                             "User: ", user_login, "</p>",
                             "<p><img src='", image_url, 
                             "' style='width:100%;'/></p>"))

htmltools::p("Arachnids in Placerita Canyon SP", style = "font-weight:bold; font-size:110%;")

leaflet(placertia_sf) %>% 
  addProviderTiles("Esri.WorldStreetMap") %>% 
  addPolygons() %>% 
  addCircleMarkers(data = inat_arachnid_pcsp_popup_sf,
                   popup = ~popup_html, 
                   radius = 5)
Arachnids in Placerita Canyon SP

Leaflet map of Arachnids
Click on the map to open in a new tab

Summary

The iNaturalist API allows you to download observations from a global community of naturalists, which is pretty powerful. This Tech Note showed you how to create a map of iNaturalist observations within a specific area and for a specific taxon using the rinat and leaflet packages for R.

A couple caveats are worth noting. A limitation of this method is that the data are static. In other words, to get the latest data your have to re-query the API again. It isn’t ‘live’ like a feature service.

Also, although leaflet works well for small to medium sized datasets, you wouldn’t want to use this mapping tool for massive amounts of spatial data. With leaflet, all of the geographic information and popup content are saved in the HTML file, and rendering is done on the client. This generally makes it extremely fast because everything is in memory. But if the data overwhelm memory it can slow your browser to a crawl. If you have to map hundreds of thousands of points, for example, you would need a different approach for both downloading and visualizing the data.

The Statsomat Apps with R and Python

The Statsomat project and site (https://statsomat.com) were launched at the beginning of 2021 and have the goal of developing and maintaining open-source and web-based apps for automated data analysis. Special about the reports generated by the Statsomat apps is the annotated output and the human-readable interpretation in natural language. Most of the apps currently available deliver also a user-friendly code (R or Python) to reproduce the outputs. 

The Statsomat apps currently available are:

 

Nonlinear Market Forecasting using ‘Stealth Curves’

In 2001, I was fortunate to discover a ‘market characteristic’ that transcends virtually every liquid market. Markets often trend in a nonlinear fashion according to hidden support/resistance curves (Stealth Curves).  Whereas ‘market anomalies’ are transient and are overwhelmingly market specific, Stealth Curves are highly robust across virtually every market and stand the test of time. Using R, I chart several Stealth Curve examples relative to the wheat market (symbol = WEAT). Download data from my personal GitHub site and read into a ‘weat’ tibble.

library(tidyverse)
library(readxl)

# Original data source - https://www.nasdaq.com/market-activity/funds-and-etfs/weat/historical

# Download reformatted data (columns/headings) from github
# https://github.com/123blee/Stealth_Curves.io/blob/main/WEAT_nasdaq_com_data_reformatted.xlsx

# Insert your file path in the next line of code 

weat <- read_excel("... Place your file path Here ... /WEAT_nasdaq_com_data_reformatted.xlsx")
weat


# Convert 'Date and Time' to 'Date' column
weat[["Date"]] <- as.Date(weat[["Date"]])
weat

bars <- nrow(weat)

# Add bar indicator as first tibble column

weat <- weat %>%
     add_column(t = 1:nrow(weat), .before = "Date")
weat
View tibble



Prior to developing Stealth Curve charts, it is often advantageous to view the raw data in an interactive chart.

# Interactive Pricing Chart

xmin <- 1              
ymin <- 0   
ymax_close <- ceiling(max(weat[["Close"]]))
ymax_low <- ceiling(max(weat[["Low"]]))
ymax_high <- ceiling(max(weat[["High"]]))

interactive <- hPlot(x = "t", y = "Close", data = weat, type = "line",
               
               ylim = c(ymin, ymax_close),
               
               xlim = c(xmin, bars),
               
               xaxt="n",   # suppress x-axis labels
               
               yaxt="n",   # suppress y-axis labels,
               
               ann=FALSE)  # x and y axis titles

interactive$set(height = 600)

interactive$set(width = 700)

interactive$plotOptions(line = list(color = "green"))

interactive$chart(zoomType = "x")   # Highlight range of chart to zoom in

interactive


Interactive Chart
WEAT Daily Closing Prices 




Highlight any range to zoom data and view price on any bar number.

Interactive Chart
WEAT Daily Closing Prices (Zoomed)




Prior to plotting, add 400 additional rows to the tibble in preparation for extending the calculated Stealth Curve into future periods.

# Add 400 future days to the plot for projection of the Stealth Curve
 
 future <- 400
 
 weat <- weat %>%
   add_row(t = (bars+1):(bars+future))
Chart the WEAT daily closing prices with 400 days of padding. 

 # Chart Closing WEAT prices
 
 plot.new()
 
 chart_title_close <- c("Teucrium Wheat ETF (Symbol = WEAT) \nDaily Closing Prices ($)")
 
 background <- c("azure1") 
 
 u <- par("usr") 
 
 rect(u[1], u[3], u[2], u[4], col = background) 
 
 par(new=TRUE)
 
 t <- weat[["t"]]
 Price <- weat[["Close"]]
 
 plot(x=t, y=Price, main = chart_title_close, type="l", col = "blue",        
      
      ylim = c(ymin, ymax_close) ,
      
      xlim = c(xmin, (bars+future )) ) 
The below chart represents 10 years of closing prices for the WEAT market since inception of the ETF on 9/19/2011.  The horizontal axis represents time (t) stated in bar numbers (t=1 = bar 1 = Date 9/19/2011).  This starting ‘t‘ value is entirely arbitrary and does not impact the position of a calculated Stealth Curve on a chart.



The above chart displays a ‘random’ market in decline over much of the decade.

Stealth Curves reveal an entirely different story.  They depict extreme nonlinear order with respect to market pivot lows. In order for Stealth Curves to be robust across most every liquid market, the functional form must not only be simple – it must be extremely simple. The following Stealth Curve equation is charted against the closing price data.
 # Add Stealth Curve to tibble
 
 # Stealth Curve parameters
 a <- 7231.88  
 b <-  1.18 
 c <- -7.77 
 
 weat <- weat %>%
   mutate(Stealth_Curve_Close = a/(t + c) + b)
 weat


As the Stealth Curve is negative in bars 1 through 7, these values are ignored in the chart by the use of NA padding.

 # Omit negative Stealth Curve values in charting, if any
 
 z1 <- weat[["Stealth_Curve_Close"]]
 
 weat[["Stealth_Curve_Close"]] <- ifelse(z1 < 0, NA, z1)
 weat


Add the Stealth Curve to the chart.

# Add Stealth Curve to chart
 
 lines(t, weat[["Stealth_Curve_Close"]])
Closing Prices with Stealth Curve


Once the Stealth Curve is added to the chart, extreme nonlinear market order is clearly evident.  

I personally refer to this process as overlaying a cheat-sheet on a pricing chart to understand why prices bounced where they did and where they may bounce on a go-forward basis. Stealth Curves may be plotted from [-infinity, + infinity].

The human eye is not adept at discerning the extreme accuracy of this
Stealth Curve.  As such, visual aids are added.  This simple curve serves as a strange attractor to WEAT closing prices.  The market closely hugs the Stealth Curve just prior to t=500 (oval) for 3 consecutive months.  The arrows depict 10 separate market bounces off/very near the curve.



While some of the bounces appear ‘small,’ it is important to note the prices are also relatively small.  As an example, the ‘visually small bounce’ at the ‘3rd arrow from the right’ represents a 10%+ market gain. That particular date is of interest, as it is the exact date I personally identified this Stealth trending market for the first time.  I typically do not follow the WEAT market.  Had I done so, I could have identified this Stealth Curve sooner in time.  A Stealth Curve requires only 3 market pivot points for definition.

Reference my real-time LinkedIn post of this Stealth Curve at the ‘3rd arrow’ event here. This multi-year Stealth Curve remains valid to the current date.  By definition, a Stealth Curve remains valid until it is penetrated to the downside in a meaningful manner.  Even then, it often later serves as market resistance. On occasion, a Stealth Curve will serve as support followed by resistance followed by support (reference last 2 charts in this post).   

Next, a Stealth Curve is applied to market pivot lows as defined by high prices.  

Chart high prices.

# Market Pivot Lows using High Prices
 
 # Chart High WEAT prices
 
 plot.new()
 
 chart_title_high <- c("Teucrium Wheat ETF (Symbol = WEAT) \nDaily High Prices ($)")
 
 u <- par("usr") 
 
 rect(u[1], u[3], u[2], u[4], col = background) 
 
 par(ann=TRUE)
 
 par(new=TRUE)
 
 t <- weat[["t"]]
 Price <- weat[["High"]]
 
 plot(x=t, y=Price, main = chart_title_high, type="l", col = "blue", 
      
      ylim = c(ymin, ymax_high) ,
      
      xlim = c(xmin, (bars+future )) )  


The parameterized Stealth Curve equation is as follows:
Add the Stealth Curve to the tibble.


# Add Stealth Curve to tibble
 
 # Stealth Curve parameters
 a <- 7815.16  
 b <-    1.01 
 c <-   37.35 
 
 weat <- weat %>%
   mutate(Stealth_Curve_High = a/(t + c) + b)
 
 # Omit negative Stealth Curve values in charting, if any
 
 z2 <- weat[["Stealth_Curve_High"]]
 
 weat[["Stealth_Curve_High"]] <- ifelse(z2 < 0, NA, z2)
 weat



 # Add Stealth Curve to chart
 
 lines(t, weat[["Stealth_Curve_High"]])


When backcasted in time, this Stealth Curve transitions from resistance to support and is truly amazing. When a backcasted Stealth Curve, relative to the data used for curve parameterization, displays additional periods of market support/resistance; additional confidence is placed in its ability to act as a strange attractor.  Based on visual inspection, it is doubtful no other smooth curve (infinitely many) could coincide with as many meaningful pivot highs and lows (21 total) over this 10-year period as this simple Stealth Curve. In order to appreciate the level of accuracy of the backcasted Stealth Curve, a log chart is presented of a zoomed sectional of the data. 

 # Natural Log High Prices 
 
 # Chart High WEAT prices
 
 plot.new()
 
 chart_title_high <- c("Teucrium Wheat ETF (Symbol = WEAT) \nDaily High Prices ($)")
 
 u <- par("usr") 
 
 rect(u[1], u[3], u[2], u[4], col = background) 
 
 par(ann=TRUE)
 
 par(new=TRUE)
 
 chart_title_log_high <- c("Teucrium Wheat ETF (Symbol = WEAT) \nNatural Log of Daily High Prices ($)")
 
 t <- weat[["t"]]
 Log_Price <- log(weat[["High"]])
 
 plot(x=t, y=Log_Price, main = chart_title_log_high, type="l", col = "blue", 
      
      ylim = c(log(7), log(ymax_high)) ,
      
      xlim = c(xmin, 800) ) # (bars+future )) )  
 
 # Add Log(Stealth Curve) to chart
 
 lines(t, log(weat[["Stealth_Curve_High"]]))
Zoomed Sectional
Log(High Prices) and Log(Stealth Curve)
WEAT



There are
11 successful tests of Stealth Resistance including the all-time market high of the ETF.


In total, this Stealth Curve exactly or closely identifies 21 total market pivots (Stealth Resistance = 11, Stealth Support = 10).

Lastly, a Stealth Curve is presented based on market pivot lows defined by low prices.

# Market Pivot Lows using Low Prices
 
 # Chart Low WEAT prices
 
 plot.new()
 
 u <- par("usr") 
 
 rect(u[1], u[3], u[2], u[4], col = background) 
 
 par(ann=TRUE)
 
 par(new=TRUE)
 
 chart_title_low <- c("Teucrium Wheat ETF (Symbol = WEAT) \nDaily Low Prices ($)")
 
 t <- weat[["t"]]
 Price <- weat[["Low"]]
 
 
 plot(x=t, y=Price, main = chart_title_low, type="l", col = "blue", 
      
      ylim = c(ymin, ymax_low) ,
      
      xlim = c(xmin, (bars+future )) )  
Chart the low prices.



The parameterized Stealth Curve equation based on 3 pivot lows is given below.
Add the calculated Stealth Curve to the tibble.


 # Add Stealth Curve to tibble
 
 # Stealth Curve parameters
 a <- 9022.37  
 b <-    0.48 
 c <-  125.72 
 
 weat <- weat %>%
   mutate(Stealth_Curve_Low = a/(t + c) + b)

 # Omit negative Stealth Curve values in charting, if any
 
 z3 <- weat[["Stealth_Curve_Low"]]
 
 weat[["Stealth_Curve_Low"]] <- ifelse(z3 < 0, NA, z3)
 weat


Add the Stealth Curve to the chart. 


# Add Stealth Curve to chart
 
 lines(t, weat[["Stealth_Curve_Low"]])
 



Intentionally omitting identifying arrows, it is clearly obvious this Stealth Curve identifies the greatest number of pivot lows of the 3 different charts presented (close, high, low).

As promised, below are 2 related Stealth Curve charts I posted in 2005 that transitioned from support, to resistance, to support, and then back to resistance (data no longer exists – only graphics).  This data used Corn reverse adjusted futures data.  The Stealth Curve was defined using pivot low prices.  Subsequent market resistance applied to market high prices (Chart 2 of 2). 

Chart 1 of 2




Chart 2 of 2



For those interested in additional
Stealth Curve examples applied to various other markets, simply view my LinkedIn post here.  Full details of Stealth Curve model parameterization are described in my latest Amazon publication, Stealth Curves: The Elegance of ‘Random’ Markets.

Brian K. Lee, MBA, PRM, CMA, CFA
[email protected] 

 

Download recently published book – Learn Data Science with R

Learn Data Science with R is for learning the R language and data science. The book is beginner-friendly and easy to follow. It is available for download as pay what you want. The minimum price is 0 and the suggested contribution is rs 1300 ($18). Please review the book at Goodreads.

book cover

The book topics are –
  • R Language
  • Data Wrangling with data.table package
  • Graphing with ggplot2 package
  • Exploratory Data Analysis
  • Machine Learning with caret package
  • Boosting with lightGBM package
  • Hands-on projects

Function With Special Talent from ‘caret’ package in R — NearZeroVar()

By Xiaotong Ding (Claire), With Greg Page

A practical tool that enables a modeler to remove non-informative data points during the variable selection process of data modeling

In this article, we will introduce a powerful function called ‘nearZeroVar()’. This function, which comes from the caret package, is a practical tool that enables a modeler to remove non-informative data points during the variable selection process of data modeling.

For starters, the nearZeroVar() function identifies constants, and predictors with one unique value across samples. In addition, nearZeroVar() diagnoses predictors as having “near-zero variance” when they possess very few unique values relative to the number of samples, and for which the ratio of the frequency of the most common value to the frequency of the second most common value is large.

Regardless of the modeling process being used, or of the specific purpose for a particular model, the removal of non-informative predictors is a good idea. Leaving such variables in a model only adds extra complexity, without any corresponding payoff in model accuracy or quality.

For this analysis, we will use the dataset hawaii.csv , which contains information about Airbnb rentals from Hawaii. In the code cell below, the dataset is read into R, and blank cells are converted to NA values

library(dplyr) 
library(caret)
options(scipen=999)  #display decimal values, rather than scientific notation
data = read.csv("/Users/xiaotongding/Desktop/Page BA-WritingProject/hawaii.csv")
dim(data)
## [1] 21523    74
data[data==""] <- NA
nzv_vals <- nearZeroVar(data, saveMetrics = TRUE)
dim(nzv_vals)
## [1] 74  4
  • The code chunk shown above generates a dataframe with 74 rows (one for each variable in the dataset) and four columns. If saveMetrics is set to FALSE, the positions of the zero or near-zero predictors are returned instead.
nzv_sorted <- arrange(nzv_vals, desc(freqRatio))
head(nzv_sorted)
 
freqRatio
<dbl>
percentUnique
<dbl>
zeroVar
<lgl>
nzv
<lgl>
has_availability 21522.000000 0.009292385 FALSE TRUE
calculated_host_listings_count_shared_rooms 521.634146 0.032523347 FALSE TRUE
host_has_profile_pic 282.184211 0.009292385 FALSE TRUE
number_of_reviews_l30d 26.545337 0.046461924 FALSE TRUE
calculated_host_listings_count_private_rooms 13.440804 0.097570041 FALSE FALSE
room_type 9.244102 0.018584770 FALSE FALSE

The first column, freqRatio, tells us the ratio of frequencies for the most common value over the second most common value for that variable. To see how this is calculated, let’s look at the freqRatio for host_has_profile_pic (282.184):

table(sort(data$host_has_profile_pic, decreasing=TRUE))
## 
##     f     t 
##    76 21446
In the entire dataset, there are 76 ‘f’ values, and 21446 ‘t’ values. The frequency ratio of the most common outcome to the second-most common outcome, therefore, is 21446/76, or 282.1842. The second column, percentUnique, indicates the percentage of unique data points out of the total number of data points. To illustrate how this is determined, let’s examine the ‘license’ variable, which shows a value here of 45.384007806. The length of the output from the unique() function, generated below, indicates that license contains 9768 distinct values throughout the entire dataset (most likely, some are repeated because a single individual may own multiple Airbnb properties).
length(unique(data$license))
## [1] 9768
By dividing the number of unique values by the number of observations, and then multiplying by 100, we arrive back at the percentUnique value shown above:
length(unique(data$license)) / nrow(data) * 100
## [1] 45.38401
For predictive modeling with numeric input features, it can be okay to have 100 percent uniqueness, as numeric values exist along a continuous spectrum. Imagine, for example, a medical dataset with the weights of 250 patients, all taken to 5 decimal places of precision – it is quite possible to expect that no two patients’ weights would be identical, yet weight could still carry predictive value in a model focused on patient health outcomes. For non-numeric data, however, 100 percent uniqueness means that the variable will not have any predictive power in a model. If every customer in a bank lending dataset has a unique address, for example, then the ‘customer address’ variable cannot offer us any general insights about default likelihood. The third column, zeroVar, is a vector of logicals (TRUE or FALSE) that indicate whether the predictor has only one distinct value. Such variables will not yield any predictive power, regardless of their data type. The fourth column, nzv, is also a vector of logical values, for which TRUE values indicate that the variable is a near-zero variance predictor. For a variable to be flagged as such, it must meet two conditions: (1) Its frequency ratio must exceed the freqCut threshold used by the function; AND (2) its percentUnique value must fall below the uniqueCut threshold used by the function. By default, freqCut is set to 95/5 (or 19, if expressed as an integer value), and uniqueCut is set to 10. Let’s take a look at the variables with the 10 highest frequency ratios:
head(nzv_sorted, 10)
 
 
freqRatio
<dbl>
percentUnique
<dbl>
zeroVar
<lgl>
nzv
<lgl>
has_availability 21522.000000 0.009292385 FALSE TRUE
calculated_host_listings_count_shared_rooms 521.634146 0.032523347 FALSE TRUE
host_has_profile_pic 282.184211 0.009292385 FALSE TRUE
number_of_reviews_l30d 26.545337 0.046461924 FALSE TRUE
calculated_host_listings_count_private_rooms 13.440804 0.097570041 FALSE FALSE
room_type 9.244102 0.018584770 FALSE FALSE
review_scores_checkin 7.764874 0.041815732 FALSE FALSE
review_scores_location 7.632574 0.041815732 FALSE FALSE
maximum_nights_avg_ntm 7.083577 6.095804488 FALSE FALSE
minimum_maximum_nights 7.018508 0.715513637 FALSE FALSE
Right now, number_of_reviews_l30d (Number of reviews in the last 30 days) is considered an ‘nzv’ variable, with its frequency ratio of 26.54 falling above the default of 19, and its uniqueness percentage of 0.046 falling below 0.10. If we adjust the function’s settings in a way that would nullify either of those conditions, it will no longer be considered an nzv variable:
nzv_vals2 <- nearZeroVar(data, saveMetrics = TRUE, uniqueCut = 0.04)
nzv_sorted2 <- arrange(nzv_vals2, desc(freqRatio))
head(nzv_sorted2, 10)
 
 
freqRatio
<dbl>
percentUnique
<dbl>
zeroVar
<lgl>
nzv
<lgl>
has_availability 21522.000000 0.009292385 FALSE TRUE
calculated_host_listings_count_shared_rooms 521.634146 0.032523347 FALSE TRUE
host_has_profile_pic 282.184211 0.009292385 FALSE TRUE
number_of_reviews_l30d 26.545337 0.046461924 FALSE FALSE
calculated_host_listings_count_private_rooms 13.440804 0.097570041 FALSE FALSE
room_type 9.244102 0.018584770 FALSE FALSE
review_scores_checkin 7.764874 0.041815732 FALSE FALSE
review_scores_location 7.632574 0.041815732 FALSE FALSE
maximum_nights_avg_ntm 7.083577 6.095804488 FALSE FALSE
minimum_maximum_nights 7.018508 0.715513637 FALSE FALSE
Note that with the lower cutoff for percentUnique in place, number_of_reviews_l30d no longer qualifies for nzv status. Adjusting the frequency ratio to any value above 26.55 would have had a similar effect. So what is the “correct” setting to use? Like nearly everything else in the world of modeling, this question does not lend itself to a “one-size-fits-all” answer. At times, nearZeroVar() may serve as a handy way to quickly whittle down the size of an enormous dataset. Other times, it might even be used in a nearly-opposite way – if a modeler is specifically looking to call attention to anomalous values, this function could be used to flag variables that contain them. Either way, we encourage you to explore this function, and to consider making it part of your Exploratory Data Analysis (EDA) routine, especially when you are faced with a large dataset and looking for places to simplify the task in front of you.

Text processing and stemming for classification tasks in master data management context

author: “Genrikh Ananiev”
Problem description
Under business conditions, narrowly specialized tasks often come across, which require a special approach because they do not fit into the standard data processing flow and constructing models. One of these tasks is the classification of new products in master data management process (MDM).

Example 1
You are working in a large company (supplier) engaged in the production and / or sales of products, including through wholesale intermediaries (distributors). Often your distributors have an obligation (in front of the company in which you work) regularly providing reporting on your own sales of your products – the so-called Sale Out. Not always, distributors are able to report on the sold products in your company codes, more often are their own codes and their own product names that differ from the names in your system. Accordingly, in your database there is a need to keep the table matching distributors with product codes of your account. The more distributors, the more variations of the name of the same product. If you have a large assortment portfolio, it becomes a problem that is solved by manual labor-intensive support for such matching tables when new product name variations in your accounting system are received.

If it refers to the names of such products as to the texts of documents, and the codes of your accounting system (to which these variations are tied) to consider classes, then we obtain a task of a multiple classification of texts. Such a matching table (which operators are maintained manually) can be considered a training sample, and if it is built on it such a classification model – it would be able to reduce the complexity of the work of operators to classify the flow of new names of existing products. However, the classic approach to working with the text “as is” will not save you, it will be said just below.

Example 2

In the database of your company, data on sales (or prices) of products from external analytical (marketing) agencies are coming or from the parsing of third-party sites. The same product from each data source will also contain variations in writing. As part of this example, the task can be even more difficult than in Example 1 because often your company’s business users have the need to analyze not only your products, but also the range of your direct competitors and, accordingly, the number of classes (reference products) to which variations are tied – sharply increases.

What is the specificity of such a class of tasks?

First, there are a lot of classes (in fact, how many products you have so many classes) And if in this process you have to work not only with the company’s products, but also competitors, the growth of such new classes can occur every day – therefore it becomes meaningless to teach one time Model to be repeatedly used to predict new products.

Secondly, the number of documents (different variations of the same product) in the classes are not very balanced: there may be one by one to class, and maybe more.

Why does the classic approach of the multiple classification of texts work poorly?

Consider the shortcomings of the classic text processing approach by steps:

        • Stop words.
      In such tasks there are no stop-words in the generally accepted concepts of any text processing package.

            •  Tochenization
          In classic packages from the box, the division of text on words is based on the presence of punctuation or spaces. As part of this class task (where the length of the text field input is often limited), it is not uncommon to receive product names without spaces where words are not clearly separated, but visually on the register of numbers or other language. How to pass toochenization from the box on your favorite programming language for the name of wine “Dom.CHRISTIANmoreau0,75LPLtr.EtFilChablis” ? (Unfortunately it’s not a joke)

                • Stemming
              Product names are not text in a classic understanding of the task (such as news from sites, services reviews or newspaper headers) which is amenable to release suffix that can be discarded. In the names of products, abbreviations are often present and the reductions of words of which are not clear how to allocate this suffix. And there are also the names of the brands from another language group (for example, the inclusion of the brands of French or Italian) that are not amenable to a normal stemming.

                    • Reducing document-terms matrices
                  Often, when building “Document-Term” matrices, the package of your language offers to reduce the sparsity of matrices to remove words (columns of the matrix) with a frequency below some minimum threshold. And in classical tasks, it really helps improve quality and reduce overhead in the training of the model. But not in such tasks. Above, I have already written that the distribution of classes is not strongly balanced – it can easily be on the same product name to the class (for example, a rare and expensive brand that has sold it for the first time and while there is only one time in the training sample). The classic approach of sparsity reduction we bring the quality of the classifier.

                        • Training the model
                      Usually, some kind of model is trained on texts (LibSVM, naive Bayesian classifier, neural networks, or something else) which is then used repeatedly. In this case, new classes can appear daily and the number of documents in the class can be counted as a single instance. Therefore, it makes no sense to learn one large model for a long time using- any algorithm with online training, for example, a KNN classifier with one nearest neighbor, is enough.

                      Next, we will try to compare the classification of the traditional approach with the classification based on the proposed package. We will use tidytext as an auxiliary package.

                      Case example

                      devtools::install_github(repo = 'https://github.com/edvardoss/abbrevTexts')
                      library(abbrevTexts)
                      library(tidytext) # text proccessing
                      library(dplyr) # data processing
                      library(stringr) # data processing
                      library(SnowballC) # traditional stemming approach
                      library(tm) #need only for tidytext internal purpose

                      The package includes 2 data sets on the names of wines: the original names of wines from external data sources – “rawProducts” and the unified names of wines written in the standards for maintaining the company’s master data – “standardProducts”. The rawProducts table has many spelling variations of the same product, these variations are reduced to one product in standardProducts through a many-to-one relationship on the “standartId” key column. PS Variations in the “rawProducts” table are generated programmatically, but with the maximum possible similarity to how product names come from external various sources in my experience (although somewhere I may have overdone it)

                      data(rawProducts, package = 'abbrevTexts')
                      head(rawProducts)




                      data(standardProducts, package = 'abbrevTexts')
                      head(standardProducts)


                      Train and test split

                      set.seed(1234)
                      trainSample <- sample(x = seq(nrow(rawProducts)),size = .9*nrow(rawProducts))
                      testSample <- setdiff(seq(nrow(rawProducts)),trainSample)
                      testSample
                      Create dataframes for ‘no stemming mode’ and ‘traditional stemming mode’

                      df <- rawProducts %>% mutate(prodId=row_number(), 
                                                   rawName=str_replace_all(rawName,pattern = '\\.','. ')) %>% 
                        unnest_tokens(output = word,input = rawName) %>% count(StandartId,prodId,word)
                      
                      df.noStem <- df %>% bind_tf_idf(term = word,document = prodId,n = n)
                      
                      df.SnowballStem <- df %>% mutate(wordStm=SnowballC::wordStem(word)) %>% 
                        bind_tf_idf(term = wordStm,document = prodId,n = n)
                      Create document terms matrix

                      dtm.noStem <- df.noStem %>% 
                        cast_dtm(document = prodId,term = word,value = tf_idf) %>% data.matrix()
                      
                      dtm.SnowballStem <- df.SnowballStem %>% 
                        cast_dtm(document = prodId,term = wordStm,value = tf_idf) %>% data.matrix()
                      Create knn model for ‘no stemming mode’ and calculate accuracy

                      knn.noStem <- class::knn1(train = dtm.noStem[trainSample,],
                                                test = dtm.noStem[testSample,],
                                                cl = rawProducts$StandartId[trainSample])
                      mean(knn.noStem==rawProducts$StandartId[testSample])
                      Accuracy is: 0.4761905 (47%)

                      Create knn model for ‘stemming mode’ and calculate accuracy

                      knn.SnowballStem <- class::knn1(train = dtm.SnowballStem[trainSample,],
                                                     test = dtm.SnowballStem[testSample,],
                                                     cl = rawProducts$StandartId[trainSample])
                      mean(knn.SnowballStem==rawProducts$StandartId[testSample])
                      Accuracy is: 0.5 (50%)

                      abbrevTexts primer

                      Below is an example on the same data but using the functions from abbrevTexts package

                      Separating words by case

                      df <- rawProducts %>% mutate(prodId=row_number(), 
                                                   rawNameSplitted= makeSeparatedWords(rawName)) %>% 
                              unnest_tokens(output = word,input = rawNameSplitted)
                      print(df)


                      As you can see, the tokenization of the text was carried out correctly: not only transitions from upper and lower case when writing together are taken into account, but also punctuation marks between words written together without spaces are taken into account.

                      Creating a stemming dictionary based on a training sample of words

                      After a long search among different stemming implementations, I came to the conclusion that traditional methods based on the rules of the language are not suitable for such specific tasks, so I had to look for my own approach. As a result, I came to the most optimal solution, which was reduced to unsupervised learning, which is not sensitive to the text language or the degree of reduction of the available words in the training sample.

                      The function takes a vector of words as input, the minimum word length for the training sample and the minimum fraction for considering the child word as an abbreviation of the parent word and then does the following:

                      1. Discarding words with a length less than the set threshold
                      2. Discarding words consisting of numbers
                      3. Sort the words in descending order of their length
                      4. For each word in the list:
                        4.1 Filter out words that are less than the length of the current word and greater than or equal to the length of the current word multiplied by the minimum fraction
                        4.2 Select from the list of filtered words those that are the beginning of the current word


                      Let’s say that we fix min.share = 0.7
                      At this intermediate stage (4.2), we get a parent-child table where such examples can be found:



                      Note that each line meets the condition that the length of the child’s word is not shorter than 70% of the length of the parent’s word.

                      However, there may be found pairs that can not be considered as abbreviations of words because in them different parents are reduced to one child, for example:



                      My function for such cases leaves only one pair.

                      Let’s go back to the example with unambiguous abbreviations of words



                      But if you look a little more closely, we see that there is a common word ‘bodeg’ for these 2 pairs and this word allows you to connect these pairs into one chain of abbreviations without violating our initial conditions on the length of a word to consider it an abbreviation of another word:
                      bodegas->bodeg->bode
                      So we come to a table of the form:


                      Such chains can be of arbitrary length and it is possible to assemble from the found pairs into such chains recursively. Thus we come to the 5th stage of determining the final child for each participant of the constructed chain of abbreviations of words

                      5. Recursively iterating through the found pairs to determine the final (terminal) child for all members of chains
                      6. Return the abbreviation dictionary

                      The makeAbbrStemDict function is automatically paralleled by several threads loading all the processor cores, so it is advisable to take this point into account for large volumes of texts.

                      abrDict <- makeAbbrStemDict(term.vec = df$word,min.len = 3,min.share = .6)
                      head(abrDict) # We can see parent word, intermediate results and total result (terminal child)


                      The output of the stemming dictionary in the form of a table is also convenient because it is possible to selectively and in a simple way in the “dplyr” paradigm to delete some of the stemming lines.

                      Lets say that we wont to exclude parent word “abruzz” and terminal child group “absolu” from stemming dictionary:

                      abrDict.reduced <- abrDict %>% filter(parent!='abruzz',terminal.child!='absolu')
                      print(abrDict.reduced)


                      Compare the simplicity and clarity of this solution with what is offered in stackoverflow:

                      Text-mining with the tm-package – word stemming

                      Stem using abbreviate dictionary

                      df.AbbrStem <- df %>% left_join(abrDict %>% select(parent,terminal.child),by = c('word'='parent')) %>% 
                          mutate(wordAbbrStem=coalesce(terminal.child,word)) %>% select(-terminal.child)
                      print(df.AbbrStem)


                      TF-IDF for stemmed words

                      df.AbbrStem <- df.AbbrStem %>% count(StandartId,prodId,wordAbbrStem) %>% 
                        bind_tf_idf(term = wordAbbrStem,document = prodId,n = n)
                      print(df.AbbrStem)


                      Create document terms matrix

                      dtm.AbbrStem <- df.AbbrStem %>% 
                        cast_dtm(document = prodId,term = wordAbbrStem,value = tf_idf) %>% data.matrix()
                      Create knn model for ‘abbrevTexts mode’ and calculate accuracy

                      knn.AbbrStem <- class::knn1(train = dtm.AbbrStem[trainSample,],
                                                      test = dtm.AbbrStem[testSample,],
                                                      cl = rawProducts$StandartId[trainSample])
                      mean(knn.AbbrStem==rawProducts$StandartId[testSample]) 
                      Accuracy for “abbrevTexts”: 0.8333333 (83%)

                      As you can see , we have received significant improvements in the quality of classification in the test sample.
                      Tidytext is a convenient package for a small courpus of texts, but in the case of a large courpus of texts, the “AbbrevTexts” package is also perfectly suitable for preprocessing and normalization and usually gives better accuracy in such specific tasks compared to the traditional approach

                      edvardoss/abbrevTexts: Functions that will make life less sad when working with abbreviated text for multiclassification tasks (github.com)

                      Four (4) Different Ways to Calculate DCF Based ‘Equity Cash Flow (ECF)’ – Part 4 of 4


                      This represents Part 4 of a 4-part series relative to the calculation of Equity Cash Flow (ECF) using R.  If you missed any of the prior posts, be certain to reference them before proceeding. Content in this section builds on previously described information/data.

                      Part 3 of 4 prior post is located here – Part 3 of 4 .

                      ‘ECF – Method 4’ differs slightly from the prior 3 versions.  Specifically, it represents ECF with an adjustment.  By definition, Equity Value (E) is calculated as the present value of a series of Equity Cash Flow (ECF) discounted at the appropriate discount rate, the cost of levered equity capital, Ke. When using forward rate discounting, the equation for E is as follows:
                      The cost of levered equity capital (Ke) is shown below.

                      Where



                      Many DCF practitioners incorrectly assume the cost of equity capital (Ke) is constant in all periods.  The above equation indicates Ke can easily vary over time even if Ku, Kd, and T are all constant values.  Assuming a constant Ke value when such does not apply violates a basic premise of valuation, the value additivity rule, Debt Value (D) + Equity Value (E) = Asset Value (V). Substituting the cost of equity capital (Ke) into the Equity valuation (E) equation yield this.

                      Note in the above valuation equation, equity value is a function of itself. We require Equity Value (E) in the prior period (t-1) in order to obtain the discount rate (Ke) for the current period “t.” This current period discount rate is used to calculate prior period’s equity value.  This is clearly a circular calculation, as Equity Value (E) in the prior period (t-1) exists on both sides of the equation.  While Excel solutions with intentional circular references such as this can be problematic, R experiences no such problems in proper iterative solution.  Even so, we can completely bypass calculation circularity altogether and arrive at the correct iterative, circular solution. Using simple 8th grade math, a noncircular equity valuation equation is derived. Note this new noncircular equation requires a noncircular discount rate (Ku) and a noncircular numerator term in which to discount. All calculation circularity is eliminated in the equity valuation equation. The numerator includes a noncircular adjustment to Equity Cash Flow (ECF).


                      The 2 noncircular discount rates (Ku, Kd) are calculated using the Capital Asset Pricing Model (CAPM).

                      The noncircular debt valuation (D) equation using forward rate (Kd) discounting is provided below.



                      Reference Part 2 of 4 in this series for the calculation of debt cash flow (CFd). Update ‘data’ tibble
                      data <- data %>%
                        mutate(Rf = rep(0.03, 6),
                               MRP = rep(0.04, 6),
                               Bd = rep(0.2, 6),
                               Bu = rep(1.1, 6),
                               Kd = Rf + Bd * MRP,
                               Ku = Rf + Bu * MRP,
                               N = np + cpltd + LTD,   # All interest bearing debt
                               CFd = ie - (N - lag(N, default=0)),
                               ECF3 = ni - ii*(1-T_) - ( Ebv - lag(Ebv, default=0) ) + ( MS  - lag(MS, default=0)) ) 
                      
                      
                      View tibble
                      rotate(data)


                      The R code below calculates Debt Value (D) and Equity Value (E) each period. The function sum these 2 values to obtain asset value (V).

                      R Code – ‘valuation’ R function
                      valuation <- function(a) {
                        
                        library(tidyverse)
                        
                        n <- length(a$bd) - 1
                        
                        Rf  <- a$Rf
                        MRP <- a$MRP
                        Ku  <- a$Ku
                        Kd  <- a$Kd
                        T_   <- a$T_
                        
                        # Flow values
                        
                        CFd <- a$CFd
                        ECF <- a$ECF3
                        
                        # Initialize valuation vectors to zero by Year
                        
                        d <- rep(0, n+1 )  # Initialize debt value to zero each Year
                        e <- rep(0, n+1 )  # Initialize equity value to zero each Year
                        
                        # Calculate debt and equity value by period in reverse order using discount rates 'Kd' and 'Ku', repsectively
                        
                        for (t in (n+1):2)    # reverse step through loop from period 'n+1' to 2
                        {
                          
                          # Debt Valuation discounting 1-period at the forward discount rate, Kd[t]
                          
                          d[t-1] <- ( d[t] + CFd[t] ) / (1 + Kd[t] )
                          
                           # Equity Valuation discounting 1-period at the forward discount rate, Ku[t]
                          
                          e[t-1] <- ( e[t] + ECF[t] - (d[t-1])*(Ku[t]-Kd[t])*(1-T_[t]) ) / (1 + Ku[t] )
                          
                        }
                        
                        # Asset valuation by Year (Using Value Additivity Equation)
                        v = d + e
                        
                        npv_0 <- round(e[1],0) + round(ECF[1],0)
                        npv_0 <- c(npv_0, rep(NaN,n) )
                        
                        valuation <- as_tibble( cbind(a$Year, T_, Rf, MRP, Ku, Kd, Ku-Kd, ECF,
                                                         -lag(d, default=0)*(1-T_)*(Ku-Kd), ECF - lag(d, default=0)*(1-T_)*(Ku-Kd), 
                                                          d, e, v, d/e, c( ECF[1], rep(NaN,n)), npv_0 ) )  
                        
                        names(valuation) <- c("Year", "T", "Rf", "MRP", "Ku", "Kd", "Ku_Kd", "ECF",
                                                 "ECF_adj", "ADJ_ECF", "D", "E", "V", "D_E_Ratio", "ECF_0", "NPV_0")
                        
                        return(rotate(valuation))
                      }
                      View R output
                      valuation <- valuation( data )
                      
                      round(valuation, 5)


                      This method of noncircular equity valuation (E) is simple and straightforward.  Unfortunately, DCF practitioners tend to incorrectly treat Ke as a noncircular calculation using CAPM.  That widely used approach violates the value additivity rule.


                      Additionally, there is a widely held belief the Adjusted Present Value (APV) asset valuation approach is the only one that provides a means of calculating asset value in a noncircular fashion.

                      Citation: Fernandez, Pablo, (August 27, 2020), Valuing Companies by Cash Flow Discounting: Only APV Does Not Require Iteration. 

                      Though the APV method is almost 50 years old, there is little agreement as to how to correctly calculate one of the model’s 2 primary components – the value of interest expense tax shields.  The above 8th grade approach to equity valuation (E) eliminates the need to use the APV model for asset valuation if calculation by noncircular means is the goal. Simply sum the 2 noncircular valuation equations below (D + E).  They ensure the enforcement of the value-additivity rule (V = D + E). Valuation Additivity Rule
                      (Assuming debt and equity are the 2 sources of financing)

                      In summary, circular equity valuation (E) is entirely eliminated using simple 8th grade math.  Adding this noncircular equity valuation (E) solution to noncircular debt valuation (D) results in noncircular asset valuation (V). There is no need to further academically squabble over the correct methodology for valuing tax shields relative to the noncircular APV asset valuation model.  Tax shields are not separately discounted using the above approach.

                      This example is taken from my newly published textbook, ‘Advanced Discounted Cash Flow (DCF) Valuation using R.’  The above method is discussed in far greater detail, including the requisite 8th grade math, along with development of the integrated financials using R. Included in the text are 40+ advanced DCF valuation models – all of which are value-additivity compliant.

                      Typical corporate finance texts do not teach this very important concept.  As a result, DCF practitioners often unknowingly violate the immensely important value-additivity rule.  This modeling error is closely akin to violating the accounting equation (Book Assets = Book Liabilities + Book Equity) when constructing pro form balance sheets used in a DCF valuation. 

                      For some reason, violation of the accounting equation is considered a valuation sin, while violation of the value-additivity rule is a well-established practice in DCF valuation land.  

                      Reference my website for additional details.

                      https://www.leewacc.com/

                      Next up, 10 Different, Mathematically Equivalent Ways to Calculate Free Cash Flow (FCF) …

                      Brian K. Lee, MBA, PRM, CMA, CFA  

                      Four (4) Different Ways to Calculate DCF Based ‘Equity Cash Flow (ECF)’ –  Part 3 of 4


                      This represents Part 3 of a 4-part series relative to the calculation of Equity Cash Flow (ECF) using R.  If you missed Parts 1 and 2, be certain to reference them before proceeding. Content in this section builds off previously described information/data. Part 1 prior post is located here – Part 1 of 4 .
                      Part 2 prior post is located here – Part 2 of 4 . ‘ECF – Method 3’ is defined as follows:  In words, Equity Cash Flow (ECF) equals Net Income (NI) less after-tax Interest Income less the change in the quantity ‘Equity Book Value (Ebv) minus Marketable Securities (MS).’ All terms in the equation are defined in the prior posts except for Ebv. 

                      Reference details of the 5-year capital project’s fully integrated financial statements developed in R at the following link.  The R output is formatted in Excel and produced in a PDF file for ease of viewing.  Zoom the PDF for detail. 

                      Financial Statements The Equity Book Value (Ebv) vector is added to the data tibble below.
                      data <- data %>%
                      mutate(Ebv = c(250000, 295204.551, 429491.869, 678425.4966, 988024.52, 0 ))
                      Though Ebv values are entered as known values, they are calculated in the text noted at the conclusion of this post.

                      View tibble.

                      R function ECF3 defines the Equity Cash Flow (ECF) equation and its components.   ‘ECF – Method 3’ R function
                      ECF3 <- function(a, b) {
                        
                        library(tibble)
                        
                        ECF3 <- a$ni -a$ii*(1-a$T_) - ( a$Ebv -lag(a$Ebv, default=0) ) + ( a$MS  - lag(a$MS, default=0) )  
                        
                        ECF_3 <-     tibble(T              = a$T_,
                                            ii             = a$ii,
                                            ni             = a$ni,
                                            Year           = c(0:(length(ii)-1)),
                                            ii_AT          = -ii*(1-T),
                                            ni_less_ii_AT   = ni + ii_AT,
                                            Ebv            = a$Ebv,
                                            MS             = -a$MS,
                                            Ebv_less_MS    = Ebv + MS,
                                            chg_Ebv_less_MS  = - (Ebv_less_MS - lag(Ebv_less_MS, default=0) ) ,
                                            ECF_3          = ni_less_ii_AT + chg_Ebv_less_MS )
                        
                        ECF_3 <- rotate(ECF_3)
                        
                        return(ECF_3)
                        
                      }
                       Run the R function and view the output.



                      R Output formatted in Excel

                      ECF – Method 3



                      ECF Method 3‘ agrees with the prior published methods each year.  Any differences are due to rounding error.

                      This ECF calculation example is taken from my newly published textbook, ‘Advanced Discounted Cash Flow (DCF) Valuation Using R.’ The above method is discussed in far greater detail along with development of the integrated financials using R as well 40+ advanced DCF valuation models – all of which are value-additivity compliant. Typical corporate finance texts do not teach this very important concept. 

                      The text importantly clearly explains ‘why’ these ECF calculation methods are mathematically equivalent, though the equations may appear vastly different. 

                      Reference my website for further details.

                      https://www.leewacc.com/

                      Next up, ‘ECF – Method 4‘ …

                      Brian K. Lee, MBA, PRM, CMA, CFA