R – Page 13 – R-posts.com

Nonlinear Market Forecasting using ‘Stealth Curves’

In 2001, I was fortunate to discover a ‘market characteristic’ that transcends virtually every liquid market. Markets often trend in a nonlinear fashion according to hidden support/resistance curves (Stealth Curves). Whereas ‘market anomalies’ are transient and are overwhelmingly market specific, Stealth Curves are highly robust across virtually every market and stand the test of time. Using R, I chart several Stealth Curve examples relative to the wheat market (symbol = WEAT). Download data from my personal GitHub site and read into a ‘weat’ tibble.

library(tidyverse)
library(readxl)

# Original data source - https://www.nasdaq.com/market-activity/funds-and-etfs/weat/historical

# Download reformatted data (columns/headings) from github
# https://github.com/123blee/Stealth_Curves.io/blob/main/WEAT_nasdaq_com_data_reformatted.xlsx

# Insert your file path in the next line of code 

weat <- read_excel("... Place your file path Here ... /WEAT_nasdaq_com_data_reformatted.xlsx")
weat


# Convert 'Date and Time' to 'Date' column
weat[["Date"]] <- as.Date(weat[["Date"]])
weat

bars <- nrow(weat)

# Add bar indicator as first tibble column

weat <- weat %>%
     add_column(t = 1:nrow(weat), .before = "Date")
weat

View tibble

Prior to developing Stealth Curve charts, it is often advantageous to view the raw data in an interactive chart.

# Interactive Pricing Chart

xmin <- 1              
ymin <- 0   
ymax_close <- ceiling(max(weat[["Close"]]))
ymax_low <- ceiling(max(weat[["Low"]]))
ymax_high <- ceiling(max(weat[["High"]]))

interactive <- hPlot(x = "t", y = "Close", data = weat, type = "line",
               
               ylim = c(ymin, ymax_close),
               
               xlim = c(xmin, bars),
               
               xaxt="n",   # suppress x-axis labels
               
               yaxt="n",   # suppress y-axis labels,
               
               ann=FALSE)  # x and y axis titles

interactive$set(height = 600)

interactive$set(width = 700)

interactive$plotOptions(line = list(color = "green"))

interactive$chart(zoomType = "x")   # Highlight range of chart to zoom in

interactive

Interactive Chart
WEAT Daily Closing Prices

Highlight any range to zoom data and view price on any bar number.

Interactive Chart
WEAT Daily Closing Prices (Zoomed)

Prior to plotting, add 400 additional rows to the tibble in preparation for extending the calculated Stealth Curve into future periods.

# Add 400 future days to the plot for projection of the Stealth Curve
 
 future <- 400
 
 weat <- weat %>%
   add_row(t = (bars+1):(bars+future))

Chart the WEAT daily closing prices with 400 days of padding.

 # Chart Closing WEAT prices
 
 plot.new()
 
 chart_title_close <- c("Teucrium Wheat ETF (Symbol = WEAT) \nDaily Closing Prices ($)")
 
 background <- c("azure1") 
 
 u <- par("usr") 
 
 rect(u[1], u[3], u[2], u[4], col = background) 
 
 par(new=TRUE)
 
 t <- weat[["t"]]
 Price <- weat[["Close"]]
 
 plot(x=t, y=Price, main = chart_title_close, type="l", col = "blue",        
      
      ylim = c(ymin, ymax_close) ,
      
      xlim = c(xmin, (bars+future )) )

The below chart represents 10 years of closing prices for the WEAT market since inception of the ETF on 9/19/2011. The horizontal axis represents time (t) stated in bar numbers (t=1 = bar 1 = Date 9/19/2011). This starting ‘t‘ value is entirely arbitrary and does not impact the position of a calculated Stealth Curve on a chart.

The above chart displays a ‘random’ market in decline over much of the decade.

Stealth Curves reveal an entirely different story. They depict extreme nonlinear order with respect to market pivot lows. In order for Stealth Curves to be robust across most every liquid market, the functional form must not only be simple – it must be extremely simple. The following Stealth Curve equation is charted against the closing price data.

 # Add Stealth Curve to tibble
 
 # Stealth Curve parameters
 a <- 7231.88  
 b <-  1.18 
 c <- -7.77 
 
 weat <- weat %>%
   mutate(Stealth_Curve_Close = a/(t + c) + b)
 weat

As the Stealth Curve is negative in bars 1 through 7, these values are ignored in the chart by the use of NA padding.

 # Omit negative Stealth Curve values in charting, if any
 
 z1 <- weat[["Stealth_Curve_Close"]]
 
 weat[["Stealth_Curve_Close"]] <- ifelse(z1 < 0, NA, z1)
 weat

Add the Stealth Curve to the chart.

# Add Stealth Curve to chart
 
 lines(t, weat[["Stealth_Curve_Close"]])

Closing Prices with Stealth Curve

Once the Stealth Curve is added to the chart, extreme nonlinear market order is clearly evident.

I personally refer to this process as overlaying a cheat-sheet on a pricing chart to understand why prices bounced where they did and where they may bounce on a go-forward basis. Stealth Curves may be plotted from [-infinity, + infinity].

The human eye is not adept at discerning the extreme accuracy of this Stealth Curve. As such, visual aids are added. This simple curve serves as a strange attractor to WEAT closing prices. The market closely hugs the Stealth Curve just prior to t=500 (oval) for 3 consecutive months. The arrows depict 10 separate market bounces off/very near the curve.

While some of the bounces appear ‘small,’ it is important to note the prices are also relatively small. As an example, the ‘visually small bounce’ at the ‘3rd arrow from the right’ represents a 10%+ market gain. That particular date is of interest, as it is the exact date I personally identified this Stealth trending market for the first time. I typically do not follow the WEAT market. Had I done so, I could have identified this Stealth Curve sooner in time. A Stealth Curve requires only 3 market pivot points for definition.

Reference my real-time LinkedIn post of this Stealth Curve at the ‘3rd arrow’ event here. This multi-year Stealth Curve remains valid to the current date. By definition, a Stealth Curve remains valid until it is penetrated to the downside in a meaningful manner. Even then, it often later serves as market resistance. On occasion, a Stealth Curve will serve as support followed by resistance followed by support (reference last 2 charts in this post).

Next, a Stealth Curve is applied to market pivot lows as defined by high prices.

Chart high prices.

# Market Pivot Lows using High Prices
 
 # Chart High WEAT prices
 
 plot.new()
 
 chart_title_high <- c("Teucrium Wheat ETF (Symbol = WEAT) \nDaily High Prices ($)")
 
 u <- par("usr") 
 
 rect(u[1], u[3], u[2], u[4], col = background) 
 
 par(ann=TRUE)
 
 par(new=TRUE)
 
 t <- weat[["t"]]
 Price <- weat[["High"]]
 
 plot(x=t, y=Price, main = chart_title_high, type="l", col = "blue", 
      
      ylim = c(ymin, ymax_high) ,
      
      xlim = c(xmin, (bars+future )) )

The parameterized Stealth Curve equation is as follows:

Add the Stealth Curve to the tibble.

# Add Stealth Curve to tibble
 
 # Stealth Curve parameters
 a <- 7815.16  
 b <-    1.01 
 c <-   37.35 
 
 weat <- weat %>%
   mutate(Stealth_Curve_High = a/(t + c) + b)
 
 # Omit negative Stealth Curve values in charting, if any
 
 z2 <- weat[["Stealth_Curve_High"]]
 
 weat[["Stealth_Curve_High"]] <- ifelse(z2 < 0, NA, z2)
 weat

 # Add Stealth Curve to chart
 
 lines(t, weat[["Stealth_Curve_High"]])

When backcasted in time, this Stealth Curve transitions from resistance to support and is truly amazing. When a backcasted Stealth Curve, relative to the data used for curve parameterization, displays additional periods of market support/resistance; additional confidence is placed in its ability to act as a strange attractor. Based on visual inspection, it is doubtful no other smooth curve (infinitely many) could coincide with as many meaningful pivot highs and lows (21 total) over this 10-year period as this simple Stealth Curve. In order to appreciate the level of accuracy of the backcasted Stealth Curve, a log chart is presented of a zoomed sectional of the data.

 # Natural Log High Prices 
 
 # Chart High WEAT prices
 
 plot.new()
 
 chart_title_high <- c("Teucrium Wheat ETF (Symbol = WEAT) \nDaily High Prices ($)")
 
 u <- par("usr") 
 
 rect(u[1], u[3], u[2], u[4], col = background) 
 
 par(ann=TRUE)
 
 par(new=TRUE)
 
 chart_title_log_high <- c("Teucrium Wheat ETF (Symbol = WEAT) \nNatural Log of Daily High Prices ($)")
 
 t <- weat[["t"]]
 Log_Price <- log(weat[["High"]])
 
 plot(x=t, y=Log_Price, main = chart_title_log_high, type="l", col = "blue", 
      
      ylim = c(log(7), log(ymax_high)) ,
      
      xlim = c(xmin, 800) ) # (bars+future )) )  
 
 # Add Log(Stealth Curve) to chart
 
 lines(t, log(weat[["Stealth_Curve_High"]]))

Zoomed Sectional
Log(High Prices) and Log(Stealth Curve)
WEAT

There are 11 successful tests of Stealth Resistance including the all-time market high of the ETF.

In total, this Stealth Curve exactly or closely identifies 21 total market pivots (Stealth Resistance = 11, Stealth Support = 10).

Lastly, a Stealth Curve is presented based on market pivot lows defined by low prices.

# Market Pivot Lows using Low Prices
 
 # Chart Low WEAT prices
 
 plot.new()
 
 u <- par("usr") 
 
 rect(u[1], u[3], u[2], u[4], col = background) 
 
 par(ann=TRUE)
 
 par(new=TRUE)
 
 chart_title_low <- c("Teucrium Wheat ETF (Symbol = WEAT) \nDaily Low Prices ($)")
 
 t <- weat[["t"]]
 Price <- weat[["Low"]]
 
 
 plot(x=t, y=Price, main = chart_title_low, type="l", col = "blue", 
      
      ylim = c(ymin, ymax_low) ,
      
      xlim = c(xmin, (bars+future )) )

Chart the low prices.

The parameterized Stealth Curve equation based on 3 pivot lows is given below.

Add the calculated Stealth Curve to the tibble.

 # Add Stealth Curve to tibble
 
 # Stealth Curve parameters
 a <- 9022.37  
 b <-    0.48 
 c <-  125.72 
 
 weat <- weat %>%
   mutate(Stealth_Curve_Low = a/(t + c) + b)

 # Omit negative Stealth Curve values in charting, if any
 
 z3 <- weat[["Stealth_Curve_Low"]]
 
 weat[["Stealth_Curve_Low"]] <- ifelse(z3 < 0, NA, z3)
 weat

Add the Stealth Curve to the chart.

# Add Stealth Curve to chart
 
 lines(t, weat[["Stealth_Curve_Low"]])

Intentionally omitting identifying arrows, it is clearly obvious this Stealth Curve identifies the greatest number of pivot lows of the 3 different charts presented (close, high, low).

As promised, below are 2 related Stealth Curve charts I posted in 2005 that transitioned from support, to resistance, to support, and then back to resistance (data no longer exists – only graphics). This data used Corn reverse adjusted futures data. The Stealth Curve was defined using pivot low prices. Subsequent market resistance applied to market high prices (Chart 2 of 2).

Chart 1 of 2

Chart 2 of 2

For those interested in additional Stealth Curve examples applied to various other markets, simply view my LinkedIn post here. Full details of Stealth Curve model parameterization are described in my latest Amazon publication, Stealth Curves: The Elegance of ‘Random’ Markets.

Brian K. Lee, MBA, PRM, CMA, CFA
[email protected]

Download recently published book – Learn Data Science with R

Learn Data Science with R is for learning the R language and data science. The book is beginner-friendly and easy to follow. It is available for download as pay what you want. The minimum price is 0 and the suggested contribution is rs 1300 ($18). Please review the book at Goodreads.

The book topics are –

R Language
Data Wrangling with data.table package
Graphing with ggplot2 package
Exploratory Data Analysis
Machine Learning with caret package
Boosting with lightGBM package
Hands-on projects

Function With Special Talent from ‘caret’ package in R — NearZeroVar()

By Xiaotong Ding (Claire), With Greg Page

A practical tool that enables a modeler to remove non-informative data points during the variable selection process of data modeling

In this article, we will introduce a powerful function called ‘nearZeroVar()’. This function, which comes from the caret package, is a practical tool that enables a modeler to remove non-informative data points during the variable selection process of data modeling.

For starters, the nearZeroVar() function identifies constants, and predictors with one unique value across samples. In addition, nearZeroVar() diagnoses predictors as having “near-zero variance” when they possess very few unique values relative to the number of samples, and for which the ratio of the frequency of the most common value to the frequency of the second most common value is large.

Regardless of the modeling process being used, or of the specific purpose for a particular model, the removal of non-informative predictors is a good idea. Leaving such variables in a model only adds extra complexity, without any corresponding payoff in model accuracy or quality.

For this analysis, we will use the dataset hawaii.csv , which contains information about Airbnb rentals from Hawaii. In the code cell below, the dataset is read into R, and blank cells are converted to NA values

library(dplyr)

library(caret)

options(scipen=999)  #display decimal values, rather than scientific notation
data = read.csv("/Users/xiaotongding/Desktop/Page BA-WritingProject/hawaii.csv")
dim(data)

## [1] 21523    74

data[data==""] <- NA

nzv_vals <- nearZeroVar(data, saveMetrics = TRUE)
dim(nzv_vals)

## [1] 74  4

The code chunk shown above generates a dataframe with 74 rows (one for each variable in the dataset) and four columns. If saveMetrics is set to FALSE, the positions of the zero or near-zero predictors are returned instead.

nzv_sorted <- arrange(nzv_vals, desc(freqRatio))
head(nzv_sorted)

	freqRatio <dbl>	percentUnique <dbl>	zeroVar <lgl>	nzv <lgl>
has_availability	21522.000000	0.009292385	FALSE	TRUE
calculated_host_listings_count_shared_rooms	521.634146	0.032523347	FALSE	TRUE
host_has_profile_pic	282.184211	0.009292385	FALSE	TRUE
number_of_reviews_l30d	26.545337	0.046461924	FALSE	TRUE
calculated_host_listings_count_private_rooms	13.440804	0.097570041	FALSE	FALSE
room_type	9.244102	0.018584770	FALSE	FALSE

The first column, freqRatio, tells us the ratio of frequencies for the most common value over the second most common value for that variable. To see how this is calculated, let’s look at the freqRatio for host_has_profile_pic (282.184):

table(sort(data$host_has_profile_pic, decreasing=TRUE))

## 
##     f     t 
##    76 21446

In the entire dataset, there are 76 ‘f’ values, and 21446 ‘t’ values. The frequency ratio of the most common outcome to the second-most common outcome, therefore, is 21446/76, or 282.1842. The second column, percentUnique, indicates the percentage of unique data points out of the total number of data points. To illustrate how this is determined, let’s examine the ‘license’ variable, which shows a value here of 45.384007806. The length of the output from the unique() function, generated below, indicates that license contains 9768 distinct values throughout the entire dataset (most likely, some are repeated because a single individual may own multiple Airbnb properties).

length(unique(data$license))

## [1] 9768

By dividing the number of unique values by the number of observations, and then multiplying by 100, we arrive back at the percentUnique value shown above:

length(unique(data$license)) / nrow(data) * 100

## [1] 45.38401

For predictive modeling with numeric input features, it can be okay to have 100 percent uniqueness, as numeric values exist along a continuous spectrum. Imagine, for example, a medical dataset with the weights of 250 patients, all taken to 5 decimal places of precision – it is quite possible to expect that no two patients’ weights would be identical, yet weight could still carry predictive value in a model focused on patient health outcomes. For non-numeric data, however, 100 percent uniqueness means that the variable will not have any predictive power in a model. If every customer in a bank lending dataset has a unique address, for example, then the ‘customer address’ variable cannot offer us any general insights about default likelihood. The third column, zeroVar, is a vector of logicals (TRUE or FALSE) that indicate whether the predictor has only one distinct value. Such variables will not yield any predictive power, regardless of their data type. The fourth column, nzv, is also a vector of logical values, for which TRUE values indicate that the variable is a near-zero variance predictor. For a variable to be flagged as such, it must meet two conditions: (1) Its frequency ratio must exceed the freqCut threshold used by the function; AND (2) its percentUnique value must fall below the uniqueCut threshold used by the function. By default, freqCut is set to 95/5 (or 19, if expressed as an integer value), and uniqueCut is set to 10. Let’s take a look at the variables with the 10 highest frequency ratios:

head(nzv_sorted, 10)

	freqRatio <dbl>	percentUnique <dbl>	zeroVar <lgl>	nzv <lgl>
has_availability	21522.000000	0.009292385	FALSE	TRUE
calculated_host_listings_count_shared_rooms	521.634146	0.032523347	FALSE	TRUE
host_has_profile_pic	282.184211	0.009292385	FALSE	TRUE
number_of_reviews_l30d	26.545337	0.046461924	FALSE	TRUE
calculated_host_listings_count_private_rooms	13.440804	0.097570041	FALSE	FALSE
room_type	9.244102	0.018584770	FALSE	FALSE
review_scores_checkin	7.764874	0.041815732	FALSE	FALSE
review_scores_location	7.632574	0.041815732	FALSE	FALSE
maximum_nights_avg_ntm	7.083577	6.095804488	FALSE	FALSE
minimum_maximum_nights	7.018508	0.715513637	FALSE	FALSE

Right now, number_of_reviews_l30d (Number of reviews in the last 30 days) is considered an ‘nzv’ variable, with its frequency ratio of 26.54 falling above the default of 19, and its uniqueness percentage of 0.046 falling below 0.10. If we adjust the function’s settings in a way that would nullify either of those conditions, it will no longer be considered an nzv variable:

nzv_vals2 <- nearZeroVar(data, saveMetrics = TRUE, uniqueCut = 0.04)
nzv_sorted2 <- arrange(nzv_vals2, desc(freqRatio))
head(nzv_sorted2, 10)

	freqRatio <dbl>	percentUnique <dbl>	zeroVar <lgl>	nzv <lgl>
has_availability	21522.000000	0.009292385	FALSE	TRUE
calculated_host_listings_count_shared_rooms	521.634146	0.032523347	FALSE	TRUE
host_has_profile_pic	282.184211	0.009292385	FALSE	TRUE
number_of_reviews_l30d	26.545337	0.046461924	FALSE	FALSE
calculated_host_listings_count_private_rooms	13.440804	0.097570041	FALSE	FALSE
room_type	9.244102	0.018584770	FALSE	FALSE
review_scores_checkin	7.764874	0.041815732	FALSE	FALSE
review_scores_location	7.632574	0.041815732	FALSE	FALSE
maximum_nights_avg_ntm	7.083577	6.095804488	FALSE	FALSE
minimum_maximum_nights	7.018508	0.715513637	FALSE	FALSE

Note that with the lower cutoff for percentUnique in place, number_of_reviews_l30d no longer qualifies for nzv status. Adjusting the frequency ratio to any value above 26.55 would have had a similar effect. So what is the “correct” setting to use? Like nearly everything else in the world of modeling, this question does not lend itself to a “one-size-fits-all” answer. At times, nearZeroVar() may serve as a handy way to quickly whittle down the size of an enormous dataset. Other times, it might even be used in a nearly-opposite way – if a modeler is specifically looking to call attention to anomalous values, this function could be used to flag variables that contain them. Either way, we encourage you to explore this function, and to consider making it part of your Exploratory Data Analysis (EDA) routine, especially when you are faced with a large dataset and looking for places to simplify the task in front of you.

Text processing and stemming for classification tasks in master data management context

author: “Genrikh Ananiev”
Problem description
Under business conditions, narrowly specialized tasks often come across, which require a special approach because they do not fit into the standard data processing flow and constructing models. One of these tasks is the classification of new products in master data management process (MDM).

Example 1

You are working in a large company (supplier) engaged in the production and / or sales of products, including through wholesale intermediaries (distributors). Often your distributors have an obligation (in front of the company in which you work) regularly providing reporting on your own sales of your products – the so-called Sale Out. Not always, distributors are able to report on the sold products in your company codes, more often are their own codes and their own product names that differ from the names in your system. Accordingly, in your database there is a need to keep the table matching distributors with product codes of your account. The more distributors, the more variations of the name of the same product. If you have a large assortment portfolio, it becomes a problem that is solved by manual labor-intensive support for such matching tables when new product name variations in your accounting system are received.

If it refers to the names of such products as to the texts of documents, and the codes of your accounting system (to which these variations are tied) to consider classes, then we obtain a task of a multiple classification of texts. Such a matching table (which operators are maintained manually) can be considered a training sample, and if it is built on it such a classification model – it would be able to reduce the complexity of the work of operators to classify the flow of new names of existing products. However, the classic approach to working with the text “as is” will not save you, it will be said just below.

Example 2

In the database of your company, data on sales (or prices) of products from external analytical (marketing) agencies are coming or from the parsing of third-party sites. The same product from each data source will also contain variations in writing. As part of this example, the task can be even more difficult than in Example 1 because often your company’s business users have the need to analyze not only your products, but also the range of your direct competitors and, accordingly, the number of classes (reference products) to which variations are tied – sharply increases.

What is the specificity of such a class of tasks?

First, there are a lot of classes (in fact, how many products you have so many classes) And if in this process you have to work not only with the company’s products, but also competitors, the growth of such new classes can occur every day – therefore it becomes meaningless to teach one time Model to be repeatedly used to predict new products.

Secondly, the number of documents (different variations of the same product) in the classes are not very balanced: there may be one by one to class, and maybe more.

Why does the classic approach of the multiple classification of texts work poorly?

Consider the shortcomings of the classic text processing approach by steps:

Stop words.

In such tasks there are no stop-words in the generally accepted concepts of any text processing package.

Tochenization

In classic packages from the box, the division of text on words is based on the presence of punctuation or spaces. As part of this class task (where the length of the text field input is often limited), it is not uncommon to receive product names without spaces where words are not clearly separated, but visually on the register of numbers or other language. How to pass toochenization from the box on your favorite programming language for the name of wine “Dom.CHRISTIANmoreau0,75LPLtr.EtFilChablis” ? (Unfortunately it’s not a joke)

Stemming

Product names are not text in a classic understanding of the task (such as news from sites, services reviews or newspaper headers) which is amenable to release suffix that can be discarded. In the names of products, abbreviations are often present and the reductions of words of which are not clear how to allocate this suffix. And there are also the names of the brands from another language group (for example, the inclusion of the brands of French or Italian) that are not amenable to a normal stemming.

Reducing document-terms matrices

Often, when building “Document-Term” matrices, the package of your language offers to reduce the sparsity of matrices to remove words (columns of the matrix) with a frequency below some minimum threshold. And in classical tasks, it really helps improve quality and reduce overhead in the training of the model. But not in such tasks. Above, I have already written that the distribution of classes is not strongly balanced – it can easily be on the same product name to the class (for example, a rare and expensive brand that has sold it for the first time and while there is only one time in the training sample). The classic approach of sparsity reduction we bring the quality of the classifier.

Training the model

Usually, some kind of model is trained on texts (LibSVM, naive Bayesian classifier, neural networks, or something else) which is then used repeatedly. In this case, new classes can appear daily and the number of documents in the class can be counted as a single instance. Therefore, it makes no sense to learn one large model for a long time using- any algorithm with online training, for example, a KNN classifier with one nearest neighbor, is enough.

Next, we will try to compare the classification of the traditional approach with the classification based on the proposed package. We will use tidytext as an auxiliary package.

Case example

devtools::install_github(repo = 'https://github.com/edvardoss/abbrevTexts')
library(abbrevTexts)
library(tidytext) # text proccessing
library(dplyr) # data processing
library(stringr) # data processing
library(SnowballC) # traditional stemming approach
library(tm) #need only for tidytext internal purpose

The package includes 2 data sets on the names of wines: the original names of wines from external data sources – “rawProducts” and the unified names of wines written in the standards for maintaining the company’s master data – “standardProducts”. The rawProducts table has many spelling variations of the same product, these variations are reduced to one product in standardProducts through a many-to-one relationship on the “standartId” key column. PS Variations in the “rawProducts” table are generated programmatically, but with the maximum possible similarity to how product names come from external various sources in my experience (although somewhere I may have overdone it)

data(rawProducts, package = 'abbrevTexts')
head(rawProducts)

data(standardProducts, package = 'abbrevTexts')
head(standardProducts)

Train and test split

set.seed(1234)
trainSample <- sample(x = seq(nrow(rawProducts)),size = .9*nrow(rawProducts))
testSample <- setdiff(seq(nrow(rawProducts)),trainSample)
testSample

Create dataframes for ‘no stemming mode’ and ‘traditional stemming mode’

df <- rawProducts %>% mutate(prodId=row_number(), 
                             rawName=str_replace_all(rawName,pattern = '\\.','. ')) %>% 
  unnest_tokens(output = word,input = rawName) %>% count(StandartId,prodId,word)

df.noStem <- df %>% bind_tf_idf(term = word,document = prodId,n = n)

df.SnowballStem <- df %>% mutate(wordStm=SnowballC::wordStem(word)) %>% 
  bind_tf_idf(term = wordStm,document = prodId,n = n)

Create document terms matrix

dtm.noStem <- df.noStem %>% 
  cast_dtm(document = prodId,term = word,value = tf_idf) %>% data.matrix()

dtm.SnowballStem <- df.SnowballStem %>% 
  cast_dtm(document = prodId,term = wordStm,value = tf_idf) %>% data.matrix()

Create knn model for ‘no stemming mode’ and calculate accuracy

knn.noStem <- class::knn1(train = dtm.noStem[trainSample,],
                          test = dtm.noStem[testSample,],
                          cl = rawProducts$StandartId[trainSample])
mean(knn.noStem==rawProducts$StandartId[testSample])

Accuracy is: 0.4761905 (47%)

Create knn model for ‘stemming mode’ and calculate accuracy

knn.SnowballStem <- class::knn1(train = dtm.SnowballStem[trainSample,],
                               test = dtm.SnowballStem[testSample,],
                               cl = rawProducts$StandartId[trainSample])
mean(knn.SnowballStem==rawProducts$StandartId[testSample])

Accuracy is: 0.5 (50%)

abbrevTexts primer

Below is an example on the same data but using the functions from abbrevTexts package

Separating words by case

df <- rawProducts %>% mutate(prodId=row_number(), 
                             rawNameSplitted= makeSeparatedWords(rawName)) %>% 
        unnest_tokens(output = word,input = rawNameSplitted)
print(df)

As you can see, the tokenization of the text was carried out correctly: not only transitions from upper and lower case when writing together are taken into account, but also punctuation marks between words written together without spaces are taken into account.

Creating a stemming dictionary based on a training sample of words

After a long search among different stemming implementations, I came to the conclusion that traditional methods based on the rules of the language are not suitable for such specific tasks, so I had to look for my own approach. As a result, I came to the most optimal solution, which was reduced to unsupervised learning, which is not sensitive to the text language or the degree of reduction of the available words in the training sample.

The function takes a vector of words as input, the minimum word length for the training sample and the minimum fraction for considering the child word as an abbreviation of the parent word and then does the following:

1. Discarding words with a length less than the set threshold
2. Discarding words consisting of numbers
3. Sort the words in descending order of their length
4. For each word in the list:
4.1 Filter out words that are less than the length of the current word and greater than or equal to the length of the current word multiplied by the minimum fraction
4.2 Select from the list of filtered words those that are the beginning of the current word

Let’s say that we fix min.share = 0.7
At this intermediate stage (4.2), we get a parent-child table where such examples can be found:

Note that each line meets the condition that the length of the child’s word is not shorter than 70% of the length of the parent’s word.

However, there may be found pairs that can not be considered as abbreviations of words because in them different parents are reduced to one child, for example:

My function for such cases leaves only one pair.

Let’s go back to the example with unambiguous abbreviations of words

But if you look a little more closely, we see that there is a common word ‘bodeg’ for these 2 pairs and this word allows you to connect these pairs into one chain of abbreviations without violating our initial conditions on the length of a word to consider it an abbreviation of another word:
bodegas->bodeg->bode
So we come to a table of the form:

Such chains can be of arbitrary length and it is possible to assemble from the found pairs into such chains recursively. Thus we come to the 5th stage of determining the final child for each participant of the constructed chain of abbreviations of words

5. Recursively iterating through the found pairs to determine the final (terminal) child for all members of chains
6. Return the abbreviation dictionary

The makeAbbrStemDict function is automatically paralleled by several threads loading all the processor cores, so it is advisable to take this point into account for large volumes of texts.

abrDict <- makeAbbrStemDict(term.vec = df$word,min.len = 3,min.share = .6)
head(abrDict) # We can see parent word, intermediate results and total result (terminal child)

The output of the stemming dictionary in the form of a table is also convenient because it is possible to selectively and in a simple way in the “dplyr” paradigm to delete some of the stemming lines.

Lets say that we wont to exclude parent word “abruzz” and terminal child group “absolu” from stemming dictionary:

abrDict.reduced <- abrDict %>% filter(parent!='abruzz',terminal.child!='absolu')
print(abrDict.reduced)

Compare the simplicity and clarity of this solution with what is offered in stackoverflow:

Text-mining with the tm-package – word stemming

Stem using abbreviate dictionary

df.AbbrStem <- df %>% left_join(abrDict %>% select(parent,terminal.child),by = c('word'='parent')) %>% 
    mutate(wordAbbrStem=coalesce(terminal.child,word)) %>% select(-terminal.child)
print(df.AbbrStem)

TF-IDF for stemmed words

df.AbbrStem <- df.AbbrStem %>% count(StandartId,prodId,wordAbbrStem) %>% 
  bind_tf_idf(term = wordAbbrStem,document = prodId,n = n)
print(df.AbbrStem)

Create document terms matrix

dtm.AbbrStem <- df.AbbrStem %>% 
  cast_dtm(document = prodId,term = wordAbbrStem,value = tf_idf) %>% data.matrix()

Create knn model for ‘abbrevTexts mode’ and calculate accuracy

knn.AbbrStem <- class::knn1(train = dtm.AbbrStem[trainSample,],
                                test = dtm.AbbrStem[testSample,],
                                cl = rawProducts$StandartId[trainSample])
mean(knn.AbbrStem==rawProducts$StandartId[testSample])

Accuracy for “abbrevTexts”: 0.8333333 (83%)

As you can see , we have received significant improvements in the quality of classification in the test sample.
Tidytext is a convenient package for a small courpus of texts, but in the case of a large courpus of texts, the “AbbrevTexts” package is also perfectly suitable for preprocessing and normalization and usually gives better accuracy in such specific tasks compared to the traditional approach

edvardoss/abbrevTexts: Functions that will make life less sad when working with abbreviated text for multiclassification tasks (github.com)

Four (4) Different Ways to Calculate DCF Based ‘Equity Cash Flow (ECF)’ – Part 4 of 4

This represents Part 4 of a 4-part series relative to the calculation of Equity Cash Flow (ECF) using R. If you missed any of the prior posts, be certain to reference them before proceeding. Content in this section builds on previously described information/data.

Part 3 of 4 prior post is located here – Part 3 of 4 .

‘ECF – Method 4’ differs slightly from the prior 3 versions. Specifically, it represents ECF with an adjustment. By definition, Equity Value (E) is calculated as the present value of a series of Equity Cash Flow (ECF) discounted at the appropriate discount rate, the cost of levered equity capital, Ke. When using forward rate discounting, the equation for E is as follows:

The cost of levered equity capital (Ke) is shown below.

Where

Many DCF practitioners incorrectly assume the cost of equity capital (Ke) is constant in all periods. The above equation indicates Ke can easily vary over time even if Ku, Kd, and T are all constant values. Assuming a constant Ke value when such does not apply violates a basic premise of valuation, the value additivity rule, Debt Value (D) + Equity Value (E) = Asset Value (V). Substituting the cost of equity capital (Ke) into the Equity valuation (E) equation yield this.

Note in the above valuation equation, equity value is a function of itself. We require Equity Value (E) in the prior period (t-1) in order to obtain the discount rate (Ke) for the current period “t.” This current period discount rate is used to calculate prior period’s equity value. This is clearly a circular calculation, as Equity Value (E) in the prior period (t-1) exists on both sides of the equation. While Excel solutions with intentional circular references such as this can be problematic, R experiences no such problems in proper iterative solution. Even so, we can completely bypass calculation circularity altogether and arrive at the correct iterative, circular solution. Using simple 8^th grade math, a noncircular equity valuation equation is derived. Note this new noncircular equation requires a noncircular discount rate (Ku) and a noncircular numerator term in which to discount. All calculation circularity is eliminated in the equity valuation equation. The numerator includes a noncircular adjustment to Equity Cash Flow (ECF).

The 2 noncircular discount rates (Ku, Kd) are calculated using the Capital Asset Pricing Model (CAPM).

The noncircular debt valuation (D) equation using forward rate (Kd) discounting is provided below.

Reference Part 2 of 4 in this series for the calculation of debt cash flow (CFd). Update ‘data’ tibble

data <- data %>%
  mutate(Rf = rep(0.03, 6),
         MRP = rep(0.04, 6),
         Bd = rep(0.2, 6),
         Bu = rep(1.1, 6),
         Kd = Rf + Bd * MRP,
         Ku = Rf + Bu * MRP,
         N = np + cpltd + LTD,   # All interest bearing debt
         CFd = ie - (N - lag(N, default=0)),
         ECF3 = ni - ii*(1-T_) - ( Ebv - lag(Ebv, default=0) ) + ( MS  - lag(MS, default=0)) )

View tibble

rotate(data)

The R code below calculates Debt Value (D) and Equity Value (E) each period. The function sum these 2 values to obtain asset value (V).

R Code – ‘valuation’ R function

valuation <- function(a) {
  
  library(tidyverse)
  
  n <- length(a$bd) - 1
  
  Rf  <- a$Rf
  MRP <- a$MRP
  Ku  <- a$Ku
  Kd  <- a$Kd
  T_   <- a$T_
  
  # Flow values
  
  CFd <- a$CFd
  ECF <- a$ECF3
  
  # Initialize valuation vectors to zero by Year
  
  d <- rep(0, n+1 )  # Initialize debt value to zero each Year
  e <- rep(0, n+1 )  # Initialize equity value to zero each Year
  
  # Calculate debt and equity value by period in reverse order using discount rates 'Kd' and 'Ku', repsectively
  
  for (t in (n+1):2)    # reverse step through loop from period 'n+1' to 2
  {
    
    # Debt Valuation discounting 1-period at the forward discount rate, Kd[t]
    
    d[t-1] <- ( d[t] + CFd[t] ) / (1 + Kd[t] )
    
     # Equity Valuation discounting 1-period at the forward discount rate, Ku[t]
    
    e[t-1] <- ( e[t] + ECF[t] - (d[t-1])*(Ku[t]-Kd[t])*(1-T_[t]) ) / (1 + Ku[t] )
    
  }
  
  # Asset valuation by Year (Using Value Additivity Equation)
  v = d + e
  
  npv_0 <- round(e[1],0) + round(ECF[1],0)
  npv_0 <- c(npv_0, rep(NaN,n) )
  
  valuation <- as_tibble( cbind(a$Year, T_, Rf, MRP, Ku, Kd, Ku-Kd, ECF,
                                   -lag(d, default=0)*(1-T_)*(Ku-Kd), ECF - lag(d, default=0)*(1-T_)*(Ku-Kd), 
                                    d, e, v, d/e, c( ECF[1], rep(NaN,n)), npv_0 ) )  
  
  names(valuation) <- c("Year", "T", "Rf", "MRP", "Ku", "Kd", "Ku_Kd", "ECF",
                           "ECF_adj", "ADJ_ECF", "D", "E", "V", "D_E_Ratio", "ECF_0", "NPV_0")
  
  return(rotate(valuation))
}

View R output

valuation <- valuation( data )

round(valuation, 5)

This method of noncircular equity valuation (E) is simple and straightforward. Unfortunately, DCF practitioners tend to incorrectly treat Ke as a noncircular calculation using CAPM. That widely used approach violates the value additivity rule.

Additionally, there is a widely held belief the Adjusted Present Value (APV) asset valuation approach is the only one that provides a means of calculating asset value in a noncircular fashion.

Citation: Fernandez, Pablo, (August 27, 2020), Valuing Companies by Cash Flow Discounting: Only APV Does Not Require Iteration.

Though the APV method is almost 50 years old, there is little agreement as to how to correctly calculate one of the model’s 2 primary components – the value of interest expense tax shields. The above 8^th grade approach to equity valuation (E) eliminates the need to use the APV model for asset valuation if calculation by noncircular means is the goal. Simply sum the 2 noncircular valuation equations below (D + E). They ensure the enforcement of the value-additivity rule (V = D + E). Valuation Additivity Rule
(Assuming debt and equity are the 2 sources of financing)

In summary, circular equity valuation (E) is entirely eliminated using simple 8^th grade math. Adding this noncircular equity valuation (E) solution to noncircular debt valuation (D) results in noncircular asset valuation (V). There is no need to further academically squabble over the correct methodology for valuing tax shields relative to the noncircular APV asset valuation model. Tax shields are not separately discounted using the above approach.

This example is taken from my newly published textbook, ‘Advanced Discounted Cash Flow (DCF) Valuation using R.’ The above method is discussed in far greater detail, including the requisite 8^th grade math, along with development of the integrated financials using R. Included in the text are 40+ advanced DCF valuation models – all of which are value-additivity compliant.

Typical corporate finance texts do not teach this very important concept. As a result, DCF practitioners often unknowingly violate the immensely important value-additivity rule. This modeling error is closely akin to violating the accounting equation (Book Assets = Book Liabilities + Book Equity) when constructing pro form balance sheets used in a DCF valuation.

For some reason, violation of the accounting equation is considered a valuation sin, while violation of the value-additivity rule is a well-established practice in DCF valuation land.

Reference my website for additional details.

https://www.leewacc.com/

Next up, 10 Different, Mathematically Equivalent Ways to Calculate Free Cash Flow (FCF) …

Brian K. Lee, MBA, PRM, CMA, CFA

Four (4) Different Ways to Calculate DCF Based ‘Equity Cash Flow (ECF)’ – Part 3 of 4

This represents Part 3 of a 4-part series relative to the calculation of Equity Cash Flow (ECF) using R. If you missed Parts 1 and 2, be certain to reference them before proceeding. Content in this section builds off previously described information/data. Part 1 prior post is located here – Part 1 of 4 .
Part 2 prior post is located here – Part 2 of 4 . ‘ECF – Method 3’ is defined as follows:

In words, Equity Cash Flow (ECF) equals Net Income (NI) less after-tax Interest Income less the change in the quantity ‘Equity Book Value (Ebv) minus Marketable Securities (MS).’ All terms in the equation are defined in the prior posts except for Ebv.

Reference details of the 5-year capital project’s fully integrated financial statements developed in R at the following link. The R output is formatted in Excel and produced in a PDF file for ease of viewing. Zoom the PDF for detail.

Financial Statements The Equity Book Value (Ebv) vector is added to the data tibble below.

data <- data %>%
mutate(Ebv = c(250000, 295204.551, 429491.869, 678425.4966, 988024.52, 0 ))

Though Ebv values are entered as known values, they are calculated in the text noted at the conclusion of this post.

View tibble.

R function ECF3 defines the Equity Cash Flow (ECF) equation and its components. ‘ECF – Method 3’ R function

ECF3 <- function(a, b) {
  
  library(tibble)
  
  ECF3 <- a$ni -a$ii*(1-a$T_) - ( a$Ebv -lag(a$Ebv, default=0) ) + ( a$MS  - lag(a$MS, default=0) )  
  
  ECF_3 <-     tibble(T              = a$T_,
                      ii             = a$ii,
                      ni             = a$ni,
                      Year           = c(0:(length(ii)-1)),
                      ii_AT          = -ii*(1-T),
                      ni_less_ii_AT   = ni + ii_AT,
                      Ebv            = a$Ebv,
                      MS             = -a$MS,
                      Ebv_less_MS    = Ebv + MS,
                      chg_Ebv_less_MS  = - (Ebv_less_MS - lag(Ebv_less_MS, default=0) ) ,
                      ECF_3          = ni_less_ii_AT + chg_Ebv_less_MS )
  
  ECF_3 <- rotate(ECF_3)
  
  return(ECF_3)
  
}

Run the R function and view the output.

R Output formatted in Excel
ECF – Method 3

‘ECF Method 3‘ agrees with the prior published methods each year. Any differences are due to rounding error.

This ECF calculation example is taken from my newly published textbook, ‘Advanced Discounted Cash Flow (DCF) Valuation Using R.’ The above method is discussed in far greater detail along with development of the integrated financials using R as well 40+ advanced DCF valuation models – all of which are value-additivity compliant. Typical corporate finance texts do not teach this very important concept.

The text importantly clearly explains ‘why’ these ECF calculation methods are mathematically equivalent, though the equations may appear vastly different.

Reference my website for further details.

https://www.leewacc.com/

Next up, ‘ECF – Method 4‘ …

Brian K. Lee, MBA, PRM, CMA, CFA

Monitoring systemic risk with R

Hi everyone !

This is my very first post… many researchers / students / practitioners from all over the world write to me regularly about my R package SystemicR. I’m glad to contribute to the community but questions about data management and plotting often come up. I guess that the package notice could be improved. As I receive more and more emails (which I always answer), I have to go a step further. In order to help the community in the most efficient way, I am launching a blog ! The purpose is to introduce my package proposing a tutorial. I hope you’ll find what you were looking for !

Tutorial : load, estimate and plot

First of all, you have to install and load the package SystemicR (available on CRAN so that’s the easy part) :

# Install and load SystemicR

install.packages("SystemicR")

library(SystemicR)

See ? By the way I use RStudio / R 3.6.3, please let me know in comments if you have problems with more recent versions.
Then, we have to deal with data input : (i) the data I used in my research paper entitled “Systemic Risk: a Network Approach”, or (ii) your own data. Let’s begin with my data (included in the package) :

# Data management

data("data_stock_returns")

head(data_stock_returns)

data("data_state_variables")

head(data_state_variables)

That’s it. Nothing too difficult. But things can be a bit more complicated if you choose to import your own data. First of all, you have to be very careful about the input file: please import a .txt or .csv file. This will avoid 90% of the issues you face and write to me about. Then, you have to be aware of the format of the variables that the functions can take as input. The first column is named “Date” but the variable is not a date. I used a character format when I created this dataframe. If you want to import your own data, please use the “dd/mm/yyyy” format. Furthermore, my advice would be to import the data from a .txt or .csv file, using the command :

# Import data

df_my_data <- read.table(file = "My_CSV_File", sep = ";")

df_my_data <- read.csv(file = "My_CSV_File", sep = ";")

And please be careful about sep =…
Once data management is done, the remaining part of the code is straightforward. The package SystemicR is a toolbox that provides R users with useful functions to estimate and plot systemic risk measures.

Let’s begin with f_CoVaR_Delta_CoVaR_i_q(). This function computes the CoVaR and the ΔCoVaR of a given financial institution i for a given quantile q. I developed this function following each step of Adrian and Brunnermeier (2016)’s research article :

# Compute CoVaR_i_q and Delta_CoVaR_i_q

f_CoVaR_Delta_CoVaR_i_q(data_stock_returns)

Then, having estimated this static measure, let’s move on to the dynamic using f_CoVaR_Delta_CoVaR_i_q_t().Still, I developed this function following each step of Adrian and Brunnermeier (2016)’s research article :

# Compute CoVaR_i_q_t , Delta_CoVaR_i_q_t and Delta_CoVaR_t

l_result <- f_CoVaR_Delta_CoVaR_i_q_t(data_stock_returns, data_state_variables)

Of course, other systemic risk measures can be estimated. Following Billio et al. (2012), the function f_correlation_network_measures() estimates degree, closeness centrality, eigenvector centrality. This function also estimates SR and volatility as in Hasse (2020):

# Compute topological risk measures from correlation-based financial networks

l_result <- f_correlation_network_measures(data_stock_returns)

Last but not least, we shall now plot the evolution of one of the systemic risk measure using the function f_plot() :

# Plot Delta_CoVaR_t and SR_t

f_plot(l_result$Delta_CoVaR_t)

f_plot(l_result$SR)

And that’s it! Before the end of the year, I will do my best to propose a new version of this package, including updated data and other systemic risk measure. Any suggestions are welcome and you can contact me if needed. Last, please do not hesitate to share your thoughts or questions above !

References

Adrian, Tobias, and Markus K. Brunnermeier. “CoVaR”. American Economic Review 106.7 (2016): , 106, 7, 1705-1741.

Billio, M., Getmansky, M., Lo, A. W., & Pelizzon, L. (2012). Econometric measures of connectedness and systemic risk in the finance and insurance sectors. Journal of Financial Economics, 104(3), 535-559.

Hasse, Jean-Baptiste. “Systemic Risk: a Network Approach”. AMSE Working Paper (2020)

Four (4) Different Ways to Calculate DCF Based ‘Equity Cash Flow (ECF)’ – Part 2 of 4

This represents Part 2 of a 4-part series relative to the calculation of Equity Cash Flow (ECF) using R. If you missed Part 1, be certain read that first part before proceeding. The content builds off prior described information/data.

Part 1 previous post is located here.

‘ECF – Method 2’ is defined as follows:

The equation appears innocent enough, though there are many underlying terms that require definition for understanding of the calculation. In words, ‘ECF – Method 2’ equals free cash Flow (FCFF) minus after-tax Debt Cash Flow (CFd).

Reference details of the 5-year capital project’s fully integrated financial statements developed in R at the following link. The R output is formatted in Excel. Zoom for detail.

https://www.dropbox.com/s/lx3uz2mnei3obbb/financial_statements.pdf?dl=0
The first order of business is to define the terms necessary to calculate FCFF.

Next, pretax Debt Cash Flow (CFd) and its components are defined as follows:

The following data are added to the ‘data’ tibble from the prior article relative to the financial statements.

data <- data %>%
  mutate(ie    = c(0, 10694, 8158, 527, 627, 717 ),
         np    = c(31415, 9188, 13875,  16500, 18863, 0),
         LTD   = c(250000, 184952, 0, 0, 0, 0),
         cpltd = c(0, 20550, 0, 0, 0, 0),
         ni    =  c(0, 47584,  141355,  262035, 325894, 511852),
         bd    =  c(0, 62500,  62500,   62500,   62500,   62500),
         chg_DTL_net = c(0, 35000,  55000,  35000, -25000, -100000),
         cash  = c(30500,  61250, 92500, 110000, 125750, 0),
         ar    = c(0, 61250,  92500,  110000,  125750, 0),
         inv   = c(30500, 61250, 92500, 110000,  125750, 0),
         pe    = c(915, 1838, 2775, 3300, 3773, 0),
         ap    = c(30500, 73500, 111000, 132000, 150900, 0),
         wp    = c(0, 5513, 8325, 9900, 11318, 0),
         itp   = c(0, -819.377,  9809,  34923, 60566, 0),
         CapX  = c(500000,0,0,0,0,0),
         gain  = c(0,0,0,0,0,162500),
         sp  = c(0,0,0,0,0,350000))

View tibble.

All of the above calculations are defined in the below R function ECF_2. ‘ECF – Method 2’ R function

ECF_2 <- function(a) {
  
  ECF2 <-      tibble(T_        = a$T_,
                       ie       = a$ie,
                       ii       = a$ii,
                       Year     = c(0:(length(ii)-1)),
                       ni       = a$ni,
                       bd       = a$bd,
                       chg_DTL_net = a$chg_DTL_net,
                       gain     = - a$gain,
                       sp       = a$sp,
                       ie_AT    = ie*(1-a$T_),
                       ii_AT    = - ii*(1-a$T_),
                       gcf      = ni + bd + chg_DTL_net + gain + sp 
                                + ie_AT + ii_AT,
                       OCA      = a$cash + a$ar + a$inv + a$pe,
                       OCL      = a$ap + a$wp + a$itp,
                       OWC      = OCA - OCL,
                       chg_OWC  = OWC - lag(OWC, default=0),
                       CapX     = - a$CapX,
                       FCFF1    = gcf + CapX - chg_OWC,
                       N        = a$LTD + a$cpltd + a$np,
                       chg_N    = N - lag(N, default=0),
                       CFd_AT   = ie*(1-T_) - chg_N,   
                       ECF2     = FCFF1 - CFd_AT )
                      
  
  ECF2 <- rotate(ECF2)
  return(ECF2)
  
}

Run the R function and view the output.

R Output formatted in Excel Method 2

‘ECF Method 2‘ agrees with the prior results from ‘ECF Method 1‘ each year. Any differences are due to rounding error.

This ECF calculation example is taken from my newly published textbook, ‘Advanced Discounted Cash Flow (DCF) Valuation using R.’ It is discussed in far greater detail along with development of the integrated financials using R as well as numerous, advanced DCF valuation modeling approaches – some never before published. The text importantly clearly explains ‘why’ these ECF calculation methods are mathematically exactly equivalent, though the individual components appear vastly different.

Reference my website for further details.

https://www.leewacc.com/

Next up, ‘ECF – Method 3’ …

Brian K. Lee, MBA, PRM, CMA, CFA

Four (4) Different Ways to Calculate DCF Based ‘Equity Cash Flow (ECF)’ – Part 1 of 4

Over the next several days, I will present 4 different methods of correctly calculating Equity Cash Flow (ECF) using R. The valuation technique of discounted cash flow (DCF) estimates equity value (E) as the present value of forecasted ECF. The appropriate discount rate for this flow definition is the cost of equity capital (Ke).

‘ECF – Method 1’ is defined as follows:

where

Note: ECF is not simply ‘dividends.’ A common misconception is that discounted dividends (DIV) provide equity value. An example of this is the common ‘dividend growth’ equity valuation model found in many corporate finance texts. All ‘dividend growth’ models that discount dividends (DIV) at the cost of equity capital (Ke) are incorrect unless forecasted 1) marketable securities (MS) balances are zero, and 2) there is no issuance or repurchase of equity shares.

The data assumes a 5-year year hypothetical capital project. A single revenue producing asset is purchased at the end of ‘Year 0‘ and is sold at the end of ‘Year 5.’ The $500,000 asset is purchased assuming 50% debt and 50% equity financing.

Further, the data used to estimate ECF in this example are taken from fully integrated pro forma financial statements and other relevant data assumptions including the corporate tax rate. This particular example only requires financial data from integrated pro forma income statements and balance sheets. These 2 pro forma financial statements are shown below with the relevant data rows highlighted.

https://www.dropbox.com/s/xwy97flxe99gqr9/financials.pdf?dl=0

The above link provides access to a PDF of all financial statement pro forma data and is easily zoomable for viewing purposes.

The relevant data used to calculate ECF are initially placed in a tibble.

library(tidyverse)


data <- tibble(Year = c(0:5),
              div  = c(0, 2379, 7068, 13102, 16295, 1249876),
              MS   = c(0, 0, 7226, 350948, 698648, 0),
              ii   = c(0, 0, 0, 253, 12283, 24453),
              pic  = c(250000, 250000, 250000, 250000, 250000, 0), 
              T_   = c(0.25, 0.40, 0.40, 0.40, 0.40, 0.40))

data

An R function is created to rotate the data in standard financial data presentation format (each data line item occupies a single row instead of a column)

rotate <- function(r) {
  
  p <- t(as.matrix(as_tibble(r)))
  
  return(p)
  
}

View the rotated data.

rotate(data)

An R function reads in the appropriate data, performs the necessary calculations, and outputs the data. The R output is then placed in a spreadsheet to formatting purposes.

‘ECF – Method 1’ R function

ECF_1 <- function(a) {
  
  ECF1 <-     tibble( T_             = a$T_,
                      pic            = a$pic,
                      chg_pic        = pic - lag(pic, default=0),
                      MS             = a$MS,
                      ii             = a$ii,
                      Year           = c(0:(length(T_)-1)),
                      div            = a$div,
                      net_new_equity = -chg_pic,   
                      chg_MS         = MS - lag(MS, default=0),
                      ii_AT          = -ii*(1-T_),
                      ECF1           = div + net_new_equity 
                                       + chg_MS + ii_AT )
  
  ECF1 <- rotate(ECF1)
  
  return(ECF1)
  
}

View R Output

ECF_method_1 <- ECF_1( data)
ECF_method_1

Excel formatting applied to R Output

It is quite evident there is far more than just dividends (DIV) involved in the proper calculation of ECF. Use of a ‘dividend growth’ equity valuation model in this instance would result in significant model error.

This ECF calculation example is taken from my newly published textbook, ‘Advanced Discounted Cash Flow (DCF) Valuation using R.’ It is discussed in far greater detail along with development of the integrated financials using R as well as numerous, advanced DCF valuation modeling approaches – some never before published.

Reference my website for further details.

https://www.leewacc.com/

Next up, ‘ECF – Method 2’ …

Brian K. Lee, MBA, PRM, CMA, CFA

New R textbook for machine learning

Mathematics and Programming for Machine Learning with R -Chapter 2 Logic

Have a look at the FREE attached pdf of Chapter 2 on Logic and R from my recently published textbook,
Mathematics and Programming for Machine Learning with R: From the Ground Up, by William B. Claster (Author)
~430 pages, over 400 exercises.Mathematics and Programming for Machine Learning with R -Chapter 2 Logic
We discuss how to code machine learning algorithms in R but start from scratch. The first 4 chapters cover Logic, Sets, Probability, Functions. I am sharing Chapter 2 here on Logic and R here and will also probably release chapters 9 and 10 on Math for Neural Networks shortly. The text is on sale at Amazon here:
https://www.amazon.com/Mathematics-Programming-Machine-Learning-R-dp-0367507854/dp/0367507854/ref=mt_other?_encoding=UTF8&me=&qid=1623663440

I will try to add an errata page as well.