Anomaly Detection in R

The World of Anomalies

Imagine you are a credit card selling company and you know about a particular customer who makes a purchase of 25$ every week. You guessed this purchase is his fixed weekly rations but one day, this customer makes a different purchase of 700$. This development will not just startle you but also compel you to talk to the customer and find out the reason so you can approve the transaction. This is because, the behavior of the customer had become fixed and the change was so different that it was not expected. Hence we call this event as an anomaly.

Anomalies are hard to detect because they can also be real phenomenon. Let’s say that the customer in the example above made the usual purchases while he was living alone and will be starting his family this week. This will mean that this should be the first of his future purchases of similar magnitude or he is throwing a party this week and this was a one-time large purchase. In all these cases, the customer will be classified as making an ‘abnormal’ choice. We as the credit card seller need to know which of these cases are genuine and which are mistakes which can be corrected if they reconfirm the same with the customer. The usefulness of detecting such anomalies are very useful especially in BFSI industry with the primary use in credit card transactions. Such anomalies can be signs of fraud or theft. Someone making multiple transactions of small amounts from the same credit card, making one very large transaction which is a few order of magnitudes larger than the average, making transactions from an unfamiliar location are such examples that can caused by fraudsters and must be caught. With the popularity of adoption, let’s study the ways we can detect anomalies.

Detecting The Pattern To Find Anomalies

Anomalies are essentially the outliers in our data. If something happens regularly, it is not an anomaly but a trend. Things which happen once or twice and are deviant from the usual behavior, whether continuously or with lags are all anomalies. So it all boils down to the definition of outliers for our data. R provides a lot of packages with different approaches to anomaly detection. We will use the AnomalyDetection package in R to understand the concept of anomalies using one such method. However, the package needs to be installed specially from github. This requires the install_github() function in devtools package. We will also use the Rcpp package which helps us to integrate R with C++ functions. Another github package to be used in this article is the wikipedia trend package which contains the API to access wikipedia and create data for anomaly detection analysis.

The package is capable of identifying outliers in the presence of seasonality and trend in the data. The package uses an algorithm known as Seasonal Hybrid ESD algorithm which finds outliers globally as well as locally in time series or a vector of data. The package has a lot of features, some of which include visualization graphs, type of anomalies (positive or negative) and specifying the window of interest.
#Install the devtools package then github packages
install.packages("devtools")
install.packages("Rcpp")
library(devtools)
install_github("petermeissner/wikipediatrend")
install_github("twitter/AnomalyDetection")

#Loading the libraries
library(Rcpp)
library(wikipediatrend)
library(AnomalyDetection)
The first step is data preparation. We will use the page views on wikipedia page marked on fifa data starting from date 18th March 2013. (link: https://en.wikipedia.org/wiki/FIFA). The wp_trend function gives us the access statistics for the page with the ability to filter data from within the function. We will use this data to model day wise page views and understand anomalies in the pattern of those view numbers
#Download wikipedia webpage "fifa" 
fifa_data_wikipedia = wp_trend("fifa", from="2013-03-18", lang = "en")
This gives us a dataset of about 1022 observations and 8 columns. Looking at the data reveals some redundant information captured
#First_look
fifa_data_wikipedia
   project   language article access     agent      granularity date       views
197 wikipedia en       Fifa    all-access all-agents daily       2016-01-13 116  
546 wikipedia en       Fifa    all-access all-agents daily       2016-12-27  64  
660 wikipedia en       Fifa    all-access all-agents daily       2017-04-20 100  
395 wikipedia en       Fifa    all-access all-agents daily       2016-07-29  70  
257 wikipedia en       Fifa    all-access all-agents daily       2016-03-13  75  
831 wikipedia en       Fifa    all-access all-agents daily       2017-10-08 194  
229 wikipedia en       Fifa    all-access all-agents daily       2016-02-14  84  
393 wikipedia en       Fifa    all-access all-agents daily       2016-07-27 140  
293 wikipedia en       Fifa    all-access all-agents daily       2016-04-18 105  
420 wikipedia en       Fifa    all-access all-agents daily       2016-08-23 757
We see that project, language, article, access, agent and granularity appear to be same for all rows and are irrelevant for us. We are only concerned with date and views as the features to work on. Let’s plot the views against date
#Plotting data
library(ggplot2)
ggplot(fifa_data_wikipedia, aes(x=date, y=views, color=views)) + geom_line()
Anomaly detection
We see some huge spikes at different intervals. There are a lot of anomalies in this data. Before we process them further, let’s keep only the relevant columns.
# Keep only date & page views and discard all other variables
columns_to_keep=c("date","views")
fifa_data_wikipedia=fifa_data_wikipedia[,columns_to_keep]
We will now perform anomaly detection using Seasonal Hybrid ESD Test. The technique maps data as a series and captures seasonality while pointing out data which does not follow the seasonality pattern. The AnomalyDetectionTs() function finds the anomalies in the data. It will basically narrow down all the peaks keeping in mind that not more than 10% of data can be anomalies (by default). We can also reduce this number by changing the max_anoms parameter in the data. We can also specify which kind of anomalies are to be identified using the direction parameter. Here, we are going to specify only positive direction anomalies to be identified. That means that sudden dips in the data are not considered.
#Apply anomaly detection and plot the results
anomalies = AnomalyDetectionTs(fifa_data_wikipedia, direction="pos", plot=TRUE)
anomalies$plot
Anomaly Detection 2

Our data has 5.68% anomalies in positive direction if we take a level of significance (alpha) to be 95%. Since we had a total of 1022 observations, 5.68% of the number is about 58 observations. We can look at the specific dates which are pointed out by the algorithm.
# Look at the anomaly dates
anomalies$anoms
   timestamp anoms
1  2015-07-01   269
2  2015-07-02   233
3  2015-07-04   198
4  2015-07-05   330
5  2015-07-06   582
6  2015-07-07   276
7  2015-07-08   211
8  2015-07-09   250
9  2015-07-10   198
10 2015-07-20   315
11 2015-07-21   209
12 2015-07-25   202
13 2015-07-26   217
14 2015-09-18   278
15 2015-09-25   234
16 2015-09-26   199
17 2015-10-03   196
18 2015-10-07   242
19 2015-10-08   419
20 2015-10-09   240
21 2015-10-11   204
22 2015-10-12   223
23 2015-10-13   237
24 2015-10-18   204
25 2015-10-28   213
26 2015-12-03   225
27 2015-12-21   376
28 2015-12-22   212
29 2016-02-24   240
30 2016-02-26   826
31 2016-02-27   516
32 2016-02-29   199
33 2016-04-04   330
34 2016-05-13   217
35 2016-05-14   186
36 2016-06-10   196
37 2016-06-11   200
38 2016-06-12   258
39 2016-06-13   245
40 2016-06-14   204
41 2016-06-22   232
42 2016-06-27   273
43 2016-06-28   212
44 2016-07-10   221
45 2016-07-11   233
46 2016-08-22   214
47 2016-08-23   757
48 2016-08-24   244
49 2016-09-18   250
50 2016-09-19   346
51 2017-01-10   237
52 2017-03-29   392
53 2017-06-03   333
54 2017-06-21   365
55 2017-10-08   194
56 2017-10-09   208
57 2017-10-11   251
58 2017-10-14   373
We have the exact dates and the anomaly values for each date. In a typical anomaly detection process, each of these dates are looked case by case and the reason for anomalies is identified. For instance, the page views can be higher on these dates if there had been fifa matches or page updates on these particular days. Another reason could be big news about fifa players. However, if page views on any of the dates does not correspond to any special event, then those days are true anomalies and should be flagged. In other situations such as credit card transactions, such anomalies can indicate fraud and quick action must be taken on identification.

The ‘Anomaly Way’

Anomalies are a kind of outlier so SH-ESD (Seasonal Hybrid ESD) is not the only way to detect anomalies. Moreover, ‘AnomalyDetection’ is not the only package we will look upon. Let’s try the anomalize package which is available in CRAN. However, it is always recommended to update the package using github as the owners keep the most recent package versions there and it takes time and testing for the changes to move into standard repositories such as CRAN. We will first install the package from CRAN so that the dependencies are also installed then update the package using devtools
#Installing anomalize
install.packages('anomalize')
#Update from github
library(devtools)
install_github("business-science/anomalize")
#Load the package
library(anomalize)
# We will also use tidyverse package for processing and coindeskr to get bitcoin data
library(tidyverse)
library(coindeskr)
I am also using the tidyverse package (Link) and coindeskr package (Link). The coindeskr package is used to download the bitcoin data and tidyverse is used for speedy data processing. We will now download bitcoin data from 1st January 2017
#Get bitcoin data from 1st January 2017
bitcoin_data <- get_historic_price(start = "2017-01-01")
This data indicates the price per date. Let’s convert it into a time series
#Convert bitcoin data to a time series
bitcoin_data_ts = bitcoin_data %>% rownames_to_column() %>% as.tibble() %>% mutate(date = as.Date(rowname)) %>% select(-one_of('rowname'))
In the time series conversion, we are actually converting the data to a tibble_df which the package requires. We could have alternatively converted the data into tibbletime object. Since it is a time series now, we should also see the seasonality and trend patterns in the data. It is important to remove them so that anomaly detection is not affected. We will now decompose the series. We will also plot the series

#Decompose data using time_decompose() function in anomalize package. We will use stl method which extracts seasonality
bitcoin_data_ts %>% time_decompose(Price, method = "stl", frequency = "auto", trend = "auto") %>%  anomalize(remainder, method = "gesd", alpha = 0.05, max_anoms = 0.1) %>% plot_anomaly_decomposition()
Converting from tbl_df to tbl_time.
Auto-index message: index = date
frequency = 7 days
trend = 90.5 days
Anomaly Detection 2
We have some beautiful plots with the first plot being overall observed data, second being season, third being trend and the final plot analyzed for anomalies. The red points indicate anomalies according to the anomalize function. However, we are not looking for this plot. We only want the anomalies plot with trend and seasonality removed. Let’s plot the data again with recomposed data. This can be done by setting the time_recomposed() function

#Plot the data again by recomposing data
bitcoin_data_ts %>% time_decompose(Price) %>% anomalize(remainder) %>% time_recompose() %>%  plot_anomalies(time_recomposed = TRUE, ncol = 3, alpha_dots = 0.5)
Converting from tbl_df to tbl_time.
Auto-index message: index = date
frequency = 7 days
trend = 90.5 days
Anomaly Detection 3
This is a better plot and shows the anomalies. We all know how bitcoin prices shot up in 2018. The grey portion explains the expected trend. Let’s see what these red points are.
#Extract the anomalies
anomalies=bitcoin_data_ts %>% time_decompose(Price) %>%  anomalize(remainder) %>%  time_recompose() %>%  filter(anomaly == 'Yes')
Now the anomalies dataset consists of the data points which were identified as anomalies by the algorithm

Conclusion: Are You An Anomaly?

We have twitter’s anomaly detection package based on Seasonal Hybrid ESD (SH-ESD) as well as CRAN’s anomaly detection package based on factor analysis, Mahalanobis distance, Horn’s parallel analysis or Principal component analysis. We also have TsOutliers package and anomalize packages in R. There are a lot more packages than one could find in R. They all have the same concept but differ in the underlying algorithm which they use to detect anomaly. Hence, one can get a general idea from all such packages: anomalies are data points which do not follow the general trend or do not lie under the expected behavior of the rest of the data. The next question which is raised is the criteria for a data point to be following expected behavior. The rest of the data points are all anomalies. One can also have varying types of anomalies such as direction based anomalies as described by the anomaly detection package (positive or negative) or anomalies not following events such as matches in fifa data. One can similarly pitch in another logic for anomaly classification and treat them accordingly.

Here is the entire code used in this article

#Install the devtools package then github packages
install.packages("devtools")
install.packages("Rcpp")
library(devtools)
install_github("petermeissner/wikipediatrend")
install_github("twitter/AnomalyDetection")

#Loading the libraries
library(Rcpp)
library(wikipediatrend)
library(AnomalyDetection)

# Download wikipedia webpage "fifa"
fifa_data_wikipedia = wp_trend("fifa", from="2013-03-18", lang = "en")
#First_look
fifa_data_wikipedia

# Plotting data
library(ggplot2)
ggplot(fifa_data_wikipedia, aes(x=date, y=views, color=views)) + geom_line()

# Keep only date & page views and discard all other variables
columns_to_keep=c("date","views")
fifa_data_wikipedia=fifa_data_wikipedia[,columns_to_keep]

#Apply anomaly detection and plot the results
anomalies = AnomalyDetectionTs(fifa_data_wikipedia, direction="pos", plot=TRUE)
anomalies$plot

# Look at the anomaly dates
anomalies$anoms

#Installing anomalize
install.packages('anomalize')
#Update from github
library(devtools)
install_github("business-science/anomalize")
#Load the package
library(anomalize)
# We will also use tidyverse package for processing and coindeskr to get bitcoin data
library(tidyverse)
library(coindeskr)

#Get bitcoin data from 1st January 2017
bitcoin_data = get_historic_price(start = "2017-01-01")

#Convert bitcoin data to a time series
bitcoin_data_ts = bitcoin_data %>% rownames_to_column() %>% as.tibble() %>% mutate(date = as.Date(rowname)) %>% select(-one_of('rowname'))

#Decompose data using time_decompose() function in anomalize package. We will use stl method which extracts seasonality
bitcoin_data_ts %>% time_decompose(Price, method = "stl", frequency = "auto", trend = "auto") %>%  anomalize(remainder, method = "gesd", alpha = 0.05, max_anoms = 0.1) %>% plot_anomaly_decomposition()

#Plot the data again by recomposing data
bitcoin_data_ts %>% time_decompose(Price) %>% anomalize(remainder) %>% time_recompose() %>%  plot_anomalies(time_recomposed = TRUE, ncol = 3, alpha_dots = 0.5)

#Extract the anomalies
anomalies=bitcoin_data_ts %>% time_decompose(Price) %>%  anomalize(remainder) %>%  time_recompose() %>%  filter(anomaly == 'Yes')

Author Bio:

This article was contributed by Perceptive Analytics. Madhur Modi, Prudhvi Potuganti, Saneesh Veetil and Chaitanya Sagar contributed to this article. Perceptive Analytics provides Tableau Consulting, data analytics, business intelligence and reporting services to e-commerce, retail, healthcare and pharmaceutical industries. Our client roster includes Fortune 500 and NYSE listed companies in the USA and India.


Do Clustering by “Dimensional Collapse”

Problem

Image that someone in a bank want to find out whether some of bank’s credit card holders are acctually the same person, so according to his experience, he set a rule: the people share either the same address or the same phone number can be reasonably regarded as the same person. Just as the example:
library(tidyverse)
a <- data_frame(id = 1:16,
                addr = c("a", "a", "a", "b", "b", "c", "d", "d", "d", "e", "e", "f", "f", "g", "g", "h"),
                phone = c(130L, 131L, 132L, 133L, 134L, 132L, 135L, 136L, 137L, 136L, 138L, 138L, 139L, 140L, 141L, 139L),
                flag = c(1L, 1L, 1L, 2L, 2L, 1L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 3L))

head(a)

## id   addr    phone   flag
## 1    a   130 1
## 2    a   131 1
## 3    a   132 1
## 4    b   133 2
## 5    b   134 2
## 6    c   132 1 
In the dataframe a, the letters in column addr stand for address information, the numbers in column phone stand for phone numbers, and the integers in column flag is what he want: the CLUSTER flag which means "really" different persons.

The records in a bank In the above plot, each point stand for a "identity" who has a address which you can tell according to horizontal axis , and a phone number which you can see in vertical axis. The red dotted line present the "connections" betweent identities, which actually means the same address or phone number. So the wanted result is the blue rectangels to circle out different flags which reprent really different persons.

Goal

The "finding the same person" thing is typically a clustring process, and I am very sure there are pretty many ways to do it, Such as Disjoint-set data structure. But, I can not help thinking mayby we can make it in a simple way with R. that's my goal.

"Dimensional Collapse"

When I stared at the plot, I ask myself, why not map the x-axis information of the points to the very first one according to the y-axis "connections". When everything goes well and all done, all the grey points should be mapped along the red arrows to the first marks of the groups, and there should be only 4 marks leave on x-axis: a, b, d and g, instead of 9 marks in the first place. And the y-axis information, after contributing all the "connection rules", can be put away now, since the left x-axis marks are exactly what I want: the final flags. It is why I like to call it "Dimensional Collapse".
Furthermore, in order to take advantage of R properties, I also:
1. Treat both dimensions as integers by factoring them.
2. Use "integer subsetting" to map and collapse.
Dimensional Collapse
axis_collapse <- function(df, .x, .y) {
    .x <- enquo(.x)
    .y <- enquo(.y)
    
    # Turn the address and phone number into integers.
    df <- mutate(df,
                 axis_x = c(factor(!!.x)),
                 axis_y = c(factor(!!.y)))
    
    oldRule <- seq_len(max(df$axis_x))
    
    mapRule <- df %>%
      select(axis_x, axis_y) %>%
      group_by(axis_y) %>%
      arrange(axis_x, .by_group = TRUE) %>%
      mutate(collapse = axis_x[1]) %>%
      ungroup() %>%
      select(-axis_y) %>%
      distinct() %>%
      group_by(axis_x) %>%
      arrange(collapse, .by_group = TRUE) %>%
      slice(1) %>%  
      ungroup() %>%
      arrange(axis_x) %>%
      pull(collapse)
    
    # Use integer subsetting to collapse x-axis.
    # In case of indirect "connections", we should do it recursively.
    while (TRUE) {
        newRule <- mapRule[oldRule]
        if(identical(newRule, oldRule)) {
            break
        } else {
            oldRule <- newRule
        }
    }
    
    df <- df %>%
      mutate(flag = newRule[axis_x],
             flag = c(factor(flag))) %>%
      select(-starts_with("axis_"))
    
    df
}
Let see the result.
a %>%
  rename(flag_t = flag) %>% 
  axis_collapse(addr, phone) %>%
  mutate_at(.vars = vars(addr:flag), factor) %>%
  ggplot(aes(factor(addr), factor(phone), shape = flag_t, color = flag)) +
  geom_point(size = 3) +
  labs(x = "Address", y = "Phone Number", shape = "Target Flag:", color = "Cluster Flag:")
The Clustering result compared to the target Not bad so far.

Calculation Complexity

Let make a simple test about time complexity.
test1 <- data_frame(addr = sample(1:1e4, 1e4), phone = sample(1:1e4, 1e4))
test2 <- data_frame(addr = sample(1:1e5, 1e5), phone = sample(1:1e5, 1e5))

bm <- microbenchmark::microbenchmark(n10k = axis_collapse(test1, addr, phone),
                                     n100k = axis_collapse(test2, addr, phone),
                                     times = 30)

summary(bm)

## expr min lq  mean    median  uq  max neval   cld
## n10k     249.2172    259.918     277.0333    266.9297    279.505     379.4292    30  a
## n100k    2489.1834   2581.731    2640.9394   2624.5741   2723.390    2839.5180   30  b 
It seems that the growth of consumed time is in linear relationship with data increase holding the other conditions unchanged. That is acceptable.

More Dimensions?

To me, since this method collapse one dimension by transfering their clustering information to the other dimension, so the method should can be used resursively on more than 2 dimensions. But I am not 100% sure. Let do a simple test.
a %>%
  # I deliberately add a column which connect group 2 and 4 only.
  mutate(other = c(LETTERS[1:14], "D", "O")) %>%
  # use axis_collapse recursively
  axis_collapse(other, phone) %>%
  axis_collapse(flag, addr) %>%
  ggplot(aes(x = factor(addr), y = factor(phone), color = factor(flag))) +
  geom_point(size = 3) +
  labs(x = "Address", y = "Phone Number", color = "Cluster Flag:")
Dimensional Collapse when more than 2 dimensions

Decision Modelling in R Workshop in The Netherlands!

The Decision Analysis in R for Technologies in Health (DARTH) workgroup is hosting a two-day workshop on decision analysis in R in Leiden, The Netherlands from June 7-8, 2018. A one-day introduction to R course will also be offered the day before the workshop, on June 6th. Decision models are mathematical simulation models that are increasingly being used in health sciences to simulate the impact of policy decisions on population health. New methodological techniques around decision modeling are being developed that rely heavily on statistical and mathematical techniques. R is becoming increasingly popular in decision analysis as it provides a flexible environment where advanced statistical methods can be combined with decision models of varying complexity. Also, the fact that R is freely available improves model transparency and reproducibility. The workshop will guide participants on building probabilistic decision trees, Markov models and microsimulations, creating publication-quality tabular and graphical output, and will provide a basic introduction to value of information methods and model calibration using R. For more information and to register, please visit: http://www.cvent.com/d/mtqth1

Exploratory Factor Analysis in R

Changing Your Viewpoint for Factors

In real life, data tends to follow some patterns but the reasons are not apparent right from the start of the data analysis. Taking a common example of a demographics based survey, many people will answer questions in a particular ‘way’. For example, all married men will have higher expenses than single men but lower than married men with children. In this case, the driving factor which makes them answer following a pattern is the economic status but these answers may also depend on other factors such as level of education, salary and locality or area. It becomes complicated to assign answers related to multiple factors. One option is to map variables or answers to one of the factors. This process has a lot of drawbacks such as the requirement to ‘guess’ the number of factors, heuristic based or biased manual assignment and non-consideration of influence of other factors to the variable. We have variables and answers in the data defined in a way such that we can understand them and interpret. What if we change our view lens in an alternate reality where all variables are automatically mapped into different new categories with customized weight based on their influence on that category. This is the idea behind factor analysis.

Creating Meaningful Factors

Factor analysis starts with the assumption of hidden latent variables which cannot be observed directly but are reflected in the answers or variables of the data. It also makes the assumption that there are as many factors as there are variables. We then transform our current set of variables into an equal number of variables such that each new variable is a combination of the current ones in some weightage. Hence, we are not essentially adding or removing information in this step but only transforming it. A typical way to make this transformation is to use eigenvalues and eigenvectors. We transform our data in the direction of each eigenvector and represent all the new variables or ‘factors’ using the eigenvalues. An eigenvalue more than 1 will mean that the new factor explains more variance than one original variable. We then sort the factors in decreasing order of the variances they explain. Thus, the first factor will be the most influential factor followed by the second factor and so on. This also helps us think of variable reduction by removing the last few factors. Typically we take factors which collectively explain 90-99% of the variance depending upon the dataset and problem. Thus, there is no need to guess the number of factors required for the data. However, we need to read into each of the factors and the combination of original features out of which they are made of to understand what they represent.

Loading Factors With Factor Loadings

As I mentioned, even after transformation, we retain the weight of each original feature that were combined to obtain the factor. These weights are known as ‘factor loadings’ and help understand what the factors represent. Let’s take a cooked up example of factor loadings for an airlines survey. Let’s say the table looks like this:

Perceptive Analytics

I took 10 features originally so it should generate 10 factors. Let’s say our first three factors are as shown in the table. Looking at the values of the factors, the first factor may represent customer experience post on-boarding. The second factor reflects the airline booking experience and related perks. The third factor shows the flight competitive advantage of the airline compared to its competition. We also have a 10th feature which is negatively affecting the second factor (which seems to make sense as a loyal customer will book flights even if they are no longer economic or don’t have as great frequent flyer programs). Thus, we can now understand that the top most important factors for customers filling the survey are customer experience, booking experience and competitive advantage. However, this understanding needs to be developed manually by looking at the loadings. It is the factor loadings and their understanding which are the prime reason which makes factor analysis of such importance followed by the ability to scale down to a few factors without losing much information.

Exploration – How Many Factors?

Factor analysis can be driven by different motivations. One objective of factor analysis can be verifying with the data with what you already know and think about it. This requires a prior intuition on the number of important factors after which the loadings will be low overall as well as an idea of the loadings of each original variable in those factors. This is the confirmatory way of factor analysis where the process is run to confirm with understanding of the data. A more common approach is to understand the data using factor analysis. In this case, you perform factor analysis first and then develop a general idea of what you get out of the results. How do we stop at a specific number of factor in factor analysis when we are exploring? We use the scree plot in this case. The scree plot maps the factors with their eigenvalues and a cut-off point is determined wherever there is a sudden change in the slope of the line.

Going Practical – The BFI Dataset in R

Let’s start with a practical demonstration of factor analysis. We will use the Psych package in R which is a package for personality, psychometric, and psychological research. It consists a dataset – the bfi dataset which represents 25 personality items with 3 additional demographics for 2800 data points. The columns are already classified into 5 factors thus their names start with letters A (Agreeableness), C (Conscientiousness), E (Extraversion), N (Neuroticism) and O (Openness). Let’s perform factor analysis to see if we really have the same association of the variables with each factor.
#Installing the Psych package and loading it
install.packages("psych")
library(psych)
#Loading the dataset
bfi_data=bfi
Though we have NA values in our data which need to be handled but we will not perform much processing on the data. To make things simple, we will only take those data points where there are no missing values.
#Remove rows with missing values and keep only complete cases
bfi_data=bfi_data[complete.cases(bfi_data),]
This leaves us with 2236 data points down from 2800 which means a reduction of 564 data points. Since 2236 is still a plausible number of data points, we can proceed further. We will use the fa() function for factor analysis and need to calculate the correlation matrix for the function
#Create the correlation matrix from bfi_data
bfi_cor <- cor(bfi_data)
The fa() function needs correlation matrix as r and number of factors. The default value is 1 which is undesired so we will specify the factors to be 6 for this exercise.
#Factor analysis of the data
factors_data <- fa(r = bfi_cor, nfactors = 6)
#Getting the factor loadings and model analysis
factors_data
Factor Analysis using method =  minres
Call: fa(r = bfi_cor, nfactors = 6)
Standardized loadings (pattern matrix) based upon correlation matrix
            MR2   MR3   MR1   MR5   MR4   MR6    h2   u2 com
A1         0.11  0.07 -0.07 -0.56 -0.01  0.35 0.379 0.62 1.8
A2         0.03  0.09 -0.08  0.64  0.01 -0.06 0.467 0.53 1.1
A3        -0.04  0.04 -0.10  0.60  0.07  0.16 0.506 0.49 1.3
A4        -0.07  0.19 -0.07  0.41 -0.13  0.13 0.294 0.71 2.0
A5        -0.17  0.01 -0.16  0.47  0.10  0.22 0.470 0.53 2.1
C1         0.05  0.54  0.08 -0.02  0.19  0.05 0.344 0.66 1.3
C2         0.09  0.66  0.17  0.06  0.08  0.16 0.475 0.53 1.4
C3         0.00  0.56  0.07  0.07 -0.04  0.05 0.317 0.68 1.1
C4         0.07 -0.67  0.10 -0.01  0.02  0.25 0.555 0.45 1.3
C5         0.15 -0.56  0.17  0.02  0.10  0.01 0.433 0.57 1.4
E1        -0.14  0.09  0.61 -0.14 -0.08  0.09 0.414 0.59 1.3
E2         0.06 -0.03  0.68 -0.07 -0.08 -0.01 0.559 0.44 1.1
E3         0.02  0.01 -0.32  0.17  0.38  0.28 0.507 0.49 3.3
E4        -0.07  0.03 -0.49  0.25  0.00  0.31 0.565 0.44 2.3
E5         0.16  0.27 -0.39  0.07  0.24  0.04 0.410 0.59 3.0
N1         0.82 -0.01 -0.09 -0.09 -0.03  0.02 0.666 0.33 1.1
N2         0.83  0.02 -0.07 -0.07  0.01 -0.07 0.654 0.35 1.0
N3         0.69 -0.03  0.13  0.09  0.02  0.06 0.549 0.45 1.1
N4         0.44 -0.14  0.43  0.09  0.10  0.01 0.506 0.49 2.4
N5         0.47 -0.01  0.21  0.21 -0.17  0.09 0.376 0.62 2.2
O1        -0.05  0.07 -0.01 -0.04  0.57  0.09 0.357 0.64 1.1
O2         0.12 -0.09  0.01  0.12 -0.43  0.28 0.295 0.70 2.2
O3         0.01  0.00 -0.10  0.05  0.65  0.04 0.485 0.52 1.1
O4         0.10 -0.05  0.34  0.15  0.37 -0.04 0.241 0.76 2.6
O5         0.04 -0.04 -0.02 -0.01 -0.50  0.30 0.330 0.67 1.7
gender     0.20  0.09 -0.12  0.33 -0.21 -0.15 0.184 0.82 3.6
education -0.03  0.01  0.05  0.11  0.12 -0.22 0.072 0.93 2.2
age       -0.06  0.07 -0.02  0.16  0.03 -0.26 0.098 0.90 2.0

                       MR2  MR3  MR1  MR5  MR4  MR6
SS loadings           2.55 2.13 2.14 2.03 1.79 0.87
Proportion Var        0.09 0.08 0.08 0.07 0.06 0.03
Cumulative Var        0.09 0.17 0.24 0.32 0.38 0.41
Proportion Explained  0.22 0.18 0.19 0.18 0.16 0.08
Cumulative Proportion 0.22 0.41 0.59 0.77 0.92 1.00

 With factor correlations of 
      MR2   MR3   MR1   MR5   MR4   MR6
MR2  1.00 -0.18  0.24 -0.05 -0.01  0.10
MR3 -0.18  1.00 -0.23  0.16  0.19  0.04
MR1  0.24 -0.23  1.00 -0.28 -0.19 -0.15
MR5 -0.05  0.16 -0.28  1.00  0.18  0.17
MR4 -0.01  0.19 -0.19  0.18  1.00  0.05
MR6  0.10  0.04 -0.15  0.17  0.05  1.00

Mean item complexity =  1.8
Test of the hypothesis that 6 factors are sufficient.

The degrees of freedom for the null model are  378  and the objective function was  7.79
The degrees of freedom for the model are 225  and the objective function was  0.57 

The root mean square of the residuals (RMSR) is  0.02 
The df corrected root mean square of the residuals is  0.03 

Fit based upon off diagonal values = 0.98
Measures of factor score adequacy             
                                                   MR2  MR3  MR1  MR5  MR4  MR6
Correlation of (regression) scores with factors   0.93 0.89 0.89 0.88 0.86 0.77
Multiple R square of scores with factors          0.86 0.79 0.79 0.77 0.74 0.59
Minimum correlation of possible factor scores     0.72 0.57 0.58 0.54 0.49 0.18
The factor loadings show that the first factor represents N followed by C,E,A and O. This means most of the members in the data have Neuroticism in the data. We also notice that the first five factors adequately represent the factor categories as the data is meant for.

Conclusion: A Deeper Insight

As apparent from the bfi survey example, factor analysis is helpful in classifying our current features into factors which represent hidden features not measured directly. It also has an additional advantage of helping reduce our data into a smaller set of features without losing much information. There are a few things to keep in mind before putting factor analysis into action. The first is about the values of factor loadings. We may have datasets where the factor loadings for all factors are low – lower than 0.5 or 0.3. While a factor loading lower than 0.3 means that you are using too many factors and need to re-run the analysis with lesser factors. A range of loadings around 0.5 is satisfactory but indicates poor predicting ability. You should later keep thresholds and discard factors which have a loading lower than the threshold for all features. Factor analysis on dynamic data can also be helpful in tracking changes in the nature of data. In case the data changes significantly, the number of factors in exploratory factor analysis will also change and indicate you to look into the data and check what changes have occurred. The final one of importance is the interpretability of factors. In case you are unable to understand or explain the factor loadings, you are either using a very granular or very generalized set of factors. In this case, you need to find the right number of factors and obtain loadings to features which are both interpretable and beneficial for analysis. There can be a variety of other situations that may occur with factor analysis and are all subject to interpretation.

Keep learning and here is the entire code used in this article.

#Installing the Psych package and loading it
install.packages("psych")
library(psych)
#Loading the dataset
bfi_data=bfi

#Remove rows with missing values and keep only complete cases
bfi_data=bfi_data[complete.cases(bfi_data),]

#Create the correlation matrix from bfi_data
bfi_cor <- cor(bfi_data)

#Factor analysis of the data
factors_data <- fa(r = bfi_cor, nfactors = 6)
#Getting the factor loadings and model analysis
factors_data

Author Bio:

This article was contributed by Perceptive Analytics. Madhur Modi, Prudhvi Sai Ram, Saneesh Veetil and Chaitanya Sagar contributed to this article.

Perceptive Analytics provides Tableau Consulting, data analytics, business intelligence and reporting services to e-commerce, retail, healthcare and pharmaceutical industries. Our client roster includes Fortune 500 and NYSE listed companies in the USA and India.

An overview of R with a curated learning path

I recently wrote a 80-page guide to how to get a programming job without a degree, curated from my experience helping students do just that at Springboard, a leading data science bootcamp. This excerpt is a part where I focus on an overview of the R programming language.

Description: R is an open-source programming language most often used by statisticians, data scientists, and academics who will use it to explore large data sets and distill insights from it. It offers multiple libraries that are useful for data processing tasks. Developed by John Chambers and other colleagues at Bell Laboratories, it is a refined version of its precursor language S.

It has strong libraries for data visualization, time series analysis and a variety of statistical analysis tasks.

It’s free software that runs on a variety of operating systems, from UNIX to Windows to OSX. It runs on the open source license of the GNU general public license and is thus free to use and adapt.

Salary: Median salaries for R users tend to vary, with the main split being the difference between data analysts who use R to query existing data pipelines and data scientists who build those data pipelines and train different models programmatically on top of larger data sets. The difference can be stark, with around a $90,000 median salary for data scientists who use R, vs about a $60,000 median salary for data analysts who use R.

Uses: R is often used to analyze datasets, especially in an academic context. Most frameworks that have evolved around R focus on different methods of data processing. The ggplot family of libraries has been widely recognized as some of the top programming modules for data visualization.

Differentiation: R is often compared to Python when it comes to data analysis and data science tasks. Its strong data visualization and manipulation libraries along with its data analysis-focused community help make it a strong contender for any data transformation, analysis, and visualization tasks.

Most Popular Github Projects:

1- Mal

Mal is a Clojure inspired lisp interpreter which can be implemented in the R programming language. With 4,500 stars, Mal requires one of the lowest amount of stars to qualify for the top repository of a programming language. It speaks to the fact that most of the open-source work done on the R programming language resides outside of Github.

2- Prophet

Prophet is a library that is able to do rich time series analysis by adjusting forecasts to account for seasonal and non-linear trends. It was created by Facebook and forms a part of the strong corpus of data analysis frameworks and libraries that exist for the R programming language.

3- ggplot2

Ggplot2 is a data visualization library for R that is based on the Grammar of Graphics. It is a library often used by data analysts and data scientists to display their results in charts, heatmaps, and more.

4- H2o-3

H2o-3 is the open source machine learning library for the R programming language, similar to scikit-learn for Python. It allows people using the R programming language to run deep learning and other machine learning techniques on their data, an essential utility in an era where data is not nearly as useful without machine learning techniques.

5- Shiny

Shiny is an easy web application framework for R that allows you to build interactive websites in a few lines of code without any JavaScript. It uses an intuitive UI (user interface) component based on Bootstrap. Shiny can take all of the guesswork out of building something for the web with the R programming language.

Example Sites: There are not many websites built with R, which is used more for data analysis tasks and projects that are internal to one computer. However, you can build things with R Markdown and build different webpages. You might also use a web development framework such as Shiny if you wanted to create simple interactive web applications around your data.

Frameworks: The awesome repository comes up again with a great list of different R packages and frameworks you can use. A few that are worth mentioning are packages such as dplyr that help you assemble data in an intuitive tabular fashion, ggplot2 to help with data visualization and plotly to help with interactive web displays of R analysis. R libraries and frameworks are some of the most robust for doing ad hoc data analysis and displaying the results in a variety of formats.

Learning Path: This article helps frame the resources you need to learn R, and how you should learn it, starting from syntax and going to specific packages. It makes for a great introduction to the field, even if you’re an absolute beginner. If you want to apply R to data science projects and different data analysis tasks, Datacamp will help you learn the skills and mentality you need to do just that — you’ll learn everything from machine learning practices with R to how to do proper data visualization of the results.

Resources: R-bloggers is a large community of R practitioners and writers who aim to share knowledge about R with each other. This list of 60+ resources on R can be used in case you ever get lost trying to learn R.

I hope this is helpful for you! Want more material like this? Check out my guide to how to get a programming job without a degree.

R as learning tool: solving integrals




Integrals are so easy only math teachers could make them difficult.When I was in high school I really disliked math and, with hindsight, I would say it was just because of the the prehistoric teaching tools (when I saw this video I thought I’m not alone). I strongly believe that interaction CAUSES learning (I’m using “causes” here on purpose being quite aware of the difference between correlation and causation), practice should come before theory and imagination is not a skill you, as a teacher, could assume in your students. Here follows a short and simple practical explanation of integrals. The only math-thing I will write here is the following: f(x) = x + 7. From now on everything will be coded in R. So, first of all, what is a function? Instead of using the complex math philosophy let’s just look at it with a programming eye: it is a tool that takes something in input and returns something else as output. For example, if we use the previous tool with 2 as an input we get a 9. Easy peasy. Let’s look at the code:
# here we create the tool (called "f")
# it just takes some inputs and add it to 7
f <- function(x){x+7}

# if we apply it to 2 it returns a 9
f(2)
9

Then the second question comes by itself. What is an integral? Even simpler, it is just the sum of this tool applied to many inputs in a range. Quite complicated, let’s make it simpler with code: 
# first we create the range of inputs
# basically x values go from 4 to 6 
# with a very very small step (0.01)
# seq stands for sequence(start, end, step)


x <- seq(4, 6, 0.01) 
x
4.00 4.01 4.02 4.03 4.04 4.05 4.06 4.07...

x[1]
4

x[2]
4.01
As you see, x has many values and each of them is indexed so it’s easy to find, e.g. the first element is 4 (x[1]). Now that we have many x values (201) within the interval from 4 to 6, we compute the integral.
# since we said that the integral is 
# just a sum, let's call it IntSum and 
# set it to the start value of 0
# in this way it will work as an accumulator
IntSum = 0
Differently from the theory in which the calculation of the integral produces a new non-sense formula (just kidding, but this seems to be what math teachers are supposed to explain), the integral does produce an output, i.e. a number. We find this number by summing the output of each input value we get from the tool (e.g. 4+7, 4.01+7, 4.02+7, etc) multiplied by the step between one value and the following (e.g. 4.01-4, 4.02-4.01, 4.03-4.02, etc). Let’s clarify this, look down here:
# for each value of x 
for(i in 2:201){
    
    # we do a very simple thing:
    # we cumulate with a sum
    # the output value of the function f 
    # multiplied by each steps difference
    
    IntSum = IntSum + f(x[i])*(x[i]-x[i-1])
    
    
    # So for example,  
    # with the first and second x values the numbers will be:
    #0.1101 = 0 + (4.01 + 7)*(4.01 - 4)
    
    # with the second and third:
    #0.2203 = 0.1101 + (4.02 + 7)*(4.02 - 4.01)
    
    # with the third and fourth:
    #0.3306 = 0.2203 + (4.03 + 7)*(4.03 - 4.02)
    
    # and so on... with the sum (integral) growing and growing
    # up until the last value
}

IntSum
24.01
Done! We have the integral but let’s have a look to the visualization of this because it can be represented and made crystal clear. Let’s add a short line of code to serve the purpose of saving the single number added to the sum each time. The reason why we decide to call it “bin” instead of, for example, “many_sum” will be clear in a moment.
# we need to store 201 calculation and we
# simply do what we did for IntSum but 201 times
bin = rep(0, 201)
bin
0 0 0 0 0 0 0 0 0 0 0 0 ...
Basically, we created a sort of memory to host each of the calculation as you see down here:
for (i in 2:201){
    
    # the sum as earlier
    IntSum = IntSum + f(x[i])*(x[i]-x[i-1])
    
    # overwrite each zero with each number
    bin[i] = f(x[i])*(x[i]-x[i-1])
}

IntSum
24.01

bin
0.0000 0.1101 0.1102 0.1103 0.1104 0.1105 ..

sum(bin)
24.01
Now if you look at the plot below you get the whole story: each bin is a tiny bar with a very small area and is the smallest part of the integral (i.e. the sum of all the bins).
# plotting them all
barplot(bin, names.arg=x)
This tells you a lot about the purpose of integral and the possibility of calculating areas of curvy surfaces. To have an idea of this just change the function f with, let’s say, sin(x) or log(x). What is happening? And what if you increase/decrease the number of bins? Have fun replicating the code changing some numbers and functions. Integrals should be clearer in the end. That’s all folks! #R #rstats #maRche #Rbloggers 

Using Shiny Dashboards for Financial Analysis

For some time now, I have been trading traditional assets—mostly U.S. equities. About a year ago, I jumped into the cryptocurrency markets to try my hand there as well. In my time in investor Telegram chats and subreddits, I often saw people arguing over which investments had performed better over time, but the reality was that most such statements were anecdotal, and thus unfalsifiable.

Given the paucity of cryptocurrency data available in an easily accessible format, it was quite difficult to say for certain that a particular investment was a good one relative to some alternative, unless you were very familiar with a handful of APIs. Even then, assuming you knew how to get daily OHLC data for a crypto-asset like Bitcoin, in order to compare it to some other asset—say Amazon stock—you would have to eyeball trends from a website like Yahoo finance or scrape that data separately and build your own visualizations and metrics. In short, historical asset performance comparisons in the crypto space were difficult to conduct for all but the most technically savvy individuals, so I set out to build a tool that would remedy this, and the Financial Asset Comparison Tool was born.

In this post, I aim to describe a few key components of the dashboard, and also call out lessons learned from the process of iterating on the tool along the way. Prior to proceeding, I highly recommend that you read the app’s README and take a look at the UI and code base itself, as this will provide the context necessary to understanding the rest of the commentary below.


I’ll start by delving into a few principles that I find to be to key when designing analytic dashboards, drawing on the asset comparison dashboard as my exemplar, and will end with some discussion of the relative utility of a few packages integral to the app. Overall, my goal is not to focus on the tool that I built alone, but to highlight a few main best practices when it comes to building dashboards for any analysis.

Build the app around the story, not the other way around.


Before ever writing a single line of code for an analytic app, I find that it is absolutely imperative to have a clear vision of the story that the tool must tell. I do not mean by this that you should already have conclusions about your data that you will then force the app into telling, but rather, that you must know how you want your user to interact with the app in order glean useful information.

In the case of my asset comparison tool, I wanted to serve multiple audiences—everyone from a casual trader who just wanted to see which investment produced the greatest net profit over a period of time, to a more experience trader, who had more nuanced questions about risk-adjusted return on investment given varying discount rates. The trick is thus building the app in such a way that serves all possible audiences without hindering any one type of user in particular.

The way I designed my app to meet this need was to build the UI such that as you descend the various sections vertically, the metrics displayed scale in complexity. My reasoning for this becomes apparent when you consider the two extremes in terms of users—the most basic vs. the most advanced trader.

The most basic user will care only about the assets of interest, the time period they want to examine, and how their initial investment performed over time. As such, they will start with the sidebar, input their assets and time frame of choice, and then use the top right-most input box to modulate their initial investment amount (although some may choose to stick with the default value here). They will then see the first chart change to reflect their choices, and they will see, both visually, and via the summary table below, which asset performed better.

The experienced trader, on the other hand, will start off exactly as the novice did, by choosing assets of interest, a time frame of reference, and an initial investment amount. They may then choose to modulate the LOESS parameters as they see fit, descending the page, looking over the simple returns section, perhaps stopping to make changes to the corresponding inputs there, and finally ending at the bottom of the page—at the Sharpe Ratio visualizations. Here they will likely spend more time—playing around with the time period over which to measure returns and changing the risk-free rate to align with their own personal macroeconomic assumptions.

The point of these two examples is to illustrate that the app by dint of its structure alone guides the user through the analytic story in a waterfall-like manner—building from simple portfolio performance, to relative performance, to the most complicated metrics for risk-adjusted returns. This keeps the novice trader from being overwhelmed or confused, and also allows the most experienced user to follow the same line of thought that they would anyway when comparing assets, while following a logical progression of complexity, as shown via the screenshot below.




Once you think you have a structure that guides all users through the story you want them to experience, test it by asking yourself if the app flows in such a way that you could pose and answer a logical series of questions as you navigate the app without any gaps in cohesion. In the case of this app, the questions that the UI answers as you descend are as follows:



  • How do these assets compare in terms of absolute profit?
  • How do these assets compare in terms of simple return on investment?
  • How do these assets compare in terms of variance-adjusted and/or risk-adjusted return on investment?


Thus, when you string these questions together, you can make statements of the type: “Asset X seemed to outperform Asset Y in terms of absolute profit, and this trend held true as well when it comes to simple return on investment, over varying time frames. That said, when you take into account the variance inherent to Asset X, it seems that Asset Y may have been the best choice, as the excess downside risk associated with Asset X outweighs its excess net profitability.


Too many cooks in the kitchen—the case for a functional approach to app-building.



While the design of the UI of any analytic app is of great importance, it’s important to not forget that the code base itself should also be well-designed; a fully-functional app from the user’s perspective can still be a terrible app to work with if the code is a jumbled, incomprehensible mess. A poorly designed code base makes QC a tiresome, aggravating process, and knowledge sharing all but impossible.

For this reason, I find that sourcing a separate R script file containing all analytic functions necessitated by the app is the best way to go, as done below (you can see Functions.R at my repo here).


# source the Functions.R file, where all analytic functions for the app are stored
source("Functions.R")


Not only does this allow for a more comprehensible and less-cluttered App.R, but it also drastically improves testability and reusability of the code. Consider the example function below, used to create the portfolio performance chart in the app (first box displayed in the UI, center middle).

build_portfolio_perf_chart <- function(data, port_loess_param = 0.33){
  
  port_tbl <- data[,c(1,4:5)]
  
  # grabbing the 2 asset names
  asset_name1 <- sub('_.*', '', names(port_tbl)[2])
  asset_name2 <- sub('_.*', '', names(port_tbl)[3])
  
  # transforms dates into correct type so smoothing can be done
  port_tbl[,1] <- as.Date(port_tbl[,1])
  date_in_numeric_form <- as.numeric((port_tbl[,1]))
  # assigning loess smoothing parameter
  loess_span_parameter <- port_loess_param
  
  # now building the plotly itself
  port_perf_plot <- plot_ly(data = port_tbl, x = ~port_tbl[,1]) %>%
    # asset 1 data plotted
    add_markers(y =~port_tbl[,2],
                marker = list(color = '#FC9C01'),
                name = asset_name1,
                showlegend = FALSE) %>%
    add_lines(y = ~fitted(loess(port_tbl[,2] ~ date_in_numeric_form, span = loess_span_parameter)),
              line = list(color = '#FC9C01'),
              name = asset_name1,
              showlegend = TRUE) %>%
    # asset 2 data plotted
    add_markers(y =~port_tbl[,3],
                marker = list(color = '#3498DB'),
                name = asset_name2,
                showlegend = FALSE) %>%
    add_lines(y = ~fitted(loess(port_tbl[,3] ~ date_in_numeric_form, span = loess_span_parameter)),
              line = list(color = '#3498DB'),
              name = asset_name2,
              showlegend = TRUE) %>%
    layout(
      title = FALSE,
      xaxis = list(type = "date",
                   title = "Date"),
      yaxis = list(title = "Portfolio Value ($)"),
      legend = list(orientation = 'h',
                    x = 0,
                    y = 1.15)) %>%
    add_annotations(
      x= 1,
      y= 1.133,
      xref = "paper",
      yref = "paper",
      text = "",
      showarrow = F
    )
  
  return(port_perf_plot)
  
}


Writing this function in the sourced Functions.R file instead of directly within the App.R allows for the developer to first test the function itself with fake data—i.e. data not gleaned from the reactive inputs. Once it has been tested in this way, it can be integrated in the app.R on the server side as shown below, with very little code.

  output$portfolio_perf_chart <- 
    debounce(
      renderPlotly({
        data <- react_base_data()
        build_portfolio_perf_chart(data, port_loess_param = input$port_loess_param)
      }), 
      millis = 2000) # sets wait time for debounce


This process allows for better error-identification and troubleshooting. If, for example, you want to change the work accomplished by the analytic function in some way, you can make the changes necessary to the code, and if the app fails to produce the desired outcome, you simply restart the chain: first you test the function in a vacuum outside of the app, and if it runs fine there, then you know that you have a problem with the way the reactive inputs are integrating with the function itself. This is a huge time saver when debugging.

Lastly, this allows for ease of reproducibility and hand-offs. If, say, one of your functions simply takes in a dataset and produces a chart of some sort, then it can be easily copied from the Functions.R and reused elsewhere. I have done this too many times to count, ripping code from project and, with a few alterations, instantly applying it in other contexts. This is easy to do if the functions are written in a manner not dependent on a particular Shiny reactive structure. For all these reasons, it makes sense in most cases to keep the code for the app UI and inputs cleanly separated from the analytic functions via a sourced R script.

Dashboard documentation—both a story and a manual, not one or the other.


When building an app for a customer at work, I never simply write an email with a link in it and write “here you go!” That will result in, at best, a steep learning curve, and at worst, an app used in an unintended way, resulting in user frustration or incorrect results. I always meet with the customer, explain the purpose and functionalities of the tool, walk through the app live, take feedback, and integrate any key takeaways into further iterations.

Even if you are just planning on writing some code to put up on GitHub, you should still consider all of these steps when working on the documentation for your app. In most cases, the README is the epicenter of your documentation—the README is your meeting with the customer.  As you saw when reading the README for the Asset Comparison Tool, I always start my READMEs with a high-level introduction to the purpose of the app—hopefully written or supplemented with visuals (as seen below) that are easy to understand and will capture the attention of browsing passers-by. 






After this introduction, the rest of the potential sections to include can vary greatly from app-to-app. In some cases apps are meant to answer one particular question, and might have a variety of filters or supplemental functionalities—one such example can be found here. As can be seen, in that README, I spend a great deal of time on the methodology after making the overall purpose clear, calling out additional options along the way. In the case of the README for the Asset Comparison Tool, however, the story is a bit different. Given that there are many questions that the app seeks to answer, it makes sense to answer each in turn, writing the README in such a way that its progression mirrors the logical flow of the progression intended for the app’s user.

One should of course not neglect to cover necessary technical information in the README as well. Anything that is not immediately clear from using the app should be clarified in the README—from calculation details to the source of your data, etc. Finally, don’t neglect the iterative component! Mention how you want to interact with prospective users and collaborators in your documentation. For example, I normally call out how I would like people to use the Issues tab on GitHub to propose any changes or additions to the documentation, or the app in general. In short, your documentation must include both the story you want to tell, and a manual for your audience to follow. 

Why Shiny Dashboard?



One of the first things you will notice about the app.R code is that the entire thing is built using Shiny Dashboard as its skeleton. There are a two main reasons for this, which I will touch on in turn.

Shiny Dashboard provides the biggest bang for your buck in terms of how much UI complexity and customizability you get out of just a small amount of code.


I can think of few cases where any analyst or developer would prefer longer, more verbose code to a shorter, succinct solution. That said, Shiny Dashboard’s simplicity when it comes to UI manipulation and customization is not just helpful because it saves you time as a coder, but because it is intuitive from the perspective of your audience.

Most of the folks that use the tools I have built to shed insight into economic questions don’t know how to code in R or Python, but they can, with a little help from extensive commenting and detailed documentation, understand the broad structure of an app coded in Shiny Dashboard format. This is, I believe, largely a function of two features of Shiny Dashboard: the colloquial-English-like syntax of the code for UI elements, and the lack of the necessity for in-line or external CSS.

As you can see from the example below, Shiny Dashboard’s system of “boxes” for UI building is easy to follow. Users can see a box in the app and easily tie that back to a particular box in the UI code.

Here is the box as visible to the user:







And here is the code that produces the box:

box(
        title = "Portfolio Performance Inputs",
        status= "primary",
        solidHeader = TRUE,
        h5("This box focuses on portfolio value, i.e., how much an initial investment of the amount specified below (in USD) would be worth over time, given price fluctuations."),
        
        textInput(
          inputId = "initial_investment",
          label = "Enter your initial investment amount ($):",
          value = "1000"),
        
        hr(),
        
        h5("The slider below modifies the", a(href = "https://stats.stackexchange.com/questions/2002/how-do-i-decide-what-span-to-use-in-loess-regression-in-r", "smoothing parameter"), "used in the", a(href = "https://en.wikipedia.org/wiki/Local_regression", "LOESS function"), "that produces the lines on the scatterplot."),
        
        sliderInput(
          inputId = "port_loess_param",
          label = "Smoothing parameter for portfolio chart:",
          min = 0.1,
          max = 2,
          value = .33,
          step = 0.01,
          animate = FALSE
        ),
        
        hr(),
        h5("The table below provides metrics by which we can compare the portfolios. For each column, the asset that performed best by that metric is colored green."),
        
        height = 500, 
        width = 4
      )


Secondly, and somewhat related to the first point, with Shiny Dashboard, much of the coloring and overall UI design comes pre-made via dashboard-wide “skins”, and box-specific “statuses.”

This is great if you are okay sacrificing a bit of control for a significant reduction in code complexity. In my experience dealing with non-coding-proficient audiences, I find that in-line CSS or complicated external CSS makes folks far more uncomfortable with the code in general. Anything you can do to reduce this anxiety and make those using your tools feel as though they understand them better is a good thing, and Shiny Dashboard makes that easier.

Shiny Dashboard’s combination of sidebar and boxes makes for easy and efficient data processing when your app has a waterfall-like analytic structure. 


Having written versions of this app both in base Shiny and using Shiny Dashboard, the number one reason I chose Shiny Dashboard was the fact that the analytic questions I sought to solve followed a waterfall-like structure, as explained in the previous section. This works perfectly well with Shiny Dashboard’s combination of sidebar input controls and inputs within UI boxes themselves.  

The inputs of primordial importance to all users are included in the sidebar UI: the two assets to analyze, and the date range over which to compare their performance. These are the only inputs that all users, regardless of experience or intent, must absolutely use, and when they are changed, all views in the dashboard will be affected. All other inputs are stored in the UI Boxes adjacent to the views that they modulate. This makes for a much more intuitive and fluid user experience, as once the initial sidebar inputs have been modulated, the sidebar can be hidden, as all other non-hidden inputs affect only the visualizations to which they are adjacent.

This waterfall-like structure also makes for more efficient reactive processes on the Shiny back-end. The inputs on the sidebar are parameters that, when changed, force the main reactive function that creates that primary dataset to fire, thus recreating the base dataset (as can be seen in the code for that base datasets creation below).

  # utility functions to be used within the server; this enables us to use a textinput for our portfolio values
  exists_as_number <- function(item) {
    !is.null(item) && !is.na(item) && is.numeric(item)
  }
  
  # data-creation reactives (i.e. everything that doesn't directly feed an output)
  
  # first is the main data pull which will fire whenever the primary inputs (asset_1a, asset_2a, initial_investment, or port_dates1a change)
  react_base_data <- reactive({
    if (exists_as_number(as.numeric(input$initial_investment)) == TRUE) {
      # creates the dataset to feed the viz
      return(
        get_pair_data(
          asset_1 = input$asset_1a,
          asset_2 = input$asset_2a, 
          port_start_date = input$port_dates1a[1],
          port_end_date = input$port_dates1a[2],
          initial_investment = (as.numeric(input$initial_investment))
        )
      )
    } else {
      return(
        get_pair_data(
          asset_1 = input$asset_1a,
          asset_2 = input$asset_2a, 
          port_start_date = input$port_dates1a[1],
          port_end_date = input$port_dates1a[2],
          initial_investment = (0)
        )
      )
    }
  })


Each of the visualizations are then produced via their own separate reactive functions, each of which takes as an input the main reactive (as shown below). This makes it so that whenever the sidebar inputs are changed, all reactives fire and all visualizations are updated; however, if all that is changed is a single loess smoothing parameter input, only the reactive used in the creation of that particular parameter-dependent visualization fires, which makes for great computational efficiency.

 # Now the reactives for the actual visualizations
  output$portfolio_perf_chart <- 
    debounce(
      renderPlotly({
        data <- react_base_data()
        build_portfolio_perf_chart(data, port_loess_param = input$port_loess_param)
      }), 
      millis = 2000) # sets wait time for debounce
  


Why Plotly?


Plotly vs. ggplot is always a fun subject for discussion among folks who build visualizations in R. Sometimes I feel like such discussions just devolve into the same type of argument as R vs. Python for data science (my answer to this question being just pick one and learn it well), but over time I have found that there are actually some circumstances where the plotly vs. ggplot debate can yield cleaner answers.

In particular, I have found in working on this particular type of analytic app that there are two areas where plotly has a bit of an advantage: clickable interactivity, and wide data.

Those familiar with ggplot will know that every good ggplot begins with long data. It is possible, via some functions, to transform wide data into a long format, but that transformation can sometimes be problematic. While there are essentially no circumstances in which it is impossible to transform wide data into long format, there are a handful of cases where it is excessively cumbersome: namely, when dealing with indexed xts objects (as shown below) or time series / OHLC-styled data.




In these cases—either due to the sometimes-awkward way in which you have to handle rowname indexes in xts, or the time and code complexity saved by not having to transform every dataset into long format—plotly offers efficiency gains relative to ggplot.

The aforementioned efficiency gains are a reason to choose plotly in some cases because it makes the life of the coder easier, but there are also reasons why it sometimes make the life of the user easier as well.

If one of the primary utilities of a visualization is to allow the user the ability to seamlessly and intuitively zoom in on, select, or filter the data displayed, particularly in the context of a Shiny App, then plotly should be strongly considered. Sure, ggplotly wrappers can be used to make a ggplot interactive, but with an added layer of abstraction comes an added layer of possible errors. While in most cases a ggplotly wrapper should work seamlessly, I have found that, particularly in cases where auto-sizing and margin size specification is key, ggplotly can require a great deal of added code in order to work correctly in a Shiny context.

In summary, when considering when to start with plotly vs. when to start with ggplot, I find one question to be particularly helpful: what do I value most—visual complexity and/or customization, or interactive versatility and/or preserving wide data?




If I choose the former, then ggplot is what I need; otherwise, I go with plotly. More often than not I find that ggplot emerges victorious, but even if you disagree with me in my decision-making calculus, I think it is helpful to at least think through what your personal calculus is. This will save you time when coding, as instead of playing around with various types of viz, you can simply pose the question(s) behind your calculus and know quickly what solution best fits your problem.

Why Formattable?


The case for formattable is, in my opinion, a much easier case to make than arguing for plotly vs. ggplot. The only question worth asking when deciding on whether or not to use formattable in your app is: do I want my table to tell a quick story via added visual complexity within the same cell that contains my data, or is a reference table all I am looking for? If you chose the former, formattable is probably a good way to go. You’ll notice as well that the case for formattable is very specific–in most cases there is likely a simpler solution via the DT  or kableExtra packages.



The one downside that I have encountered in dealing with formattable code is the amount of code necessary to generate even moderately complicated tables. That said, this problem is easily remedied via a quick function that we can use to kill most of the duplicative coding, as seen in the example below.



First, here is the long form version:


  react_formattable <- reactive({
    return(
      formattable(react_port_summary_table(), 
                  list(
                    "Asset Portfolio Max Worth" = formatter("span",
                                                            style = x ~ style(
                                                              display = "inline-block",
                                                              direction = "rtl",
                                                              "border-radius" = "4px",
                                                              "padding-right" = "2px",
                                                              "background-color" = csscolor("darkslategray"),
                                                              width = percent(proportion(x)),
                                                              color = csscolor(gradient(x, "red", "green"))
                                                            )),
                    "Asset Portfolio Latest Worth" = formatter("span",
                                                               style = x ~ style(
                                                                 display = "inline-block",
                                                                 direction = "rtl",
                                                                 "border-radius" = "4px",
                                                                 "padding-right" = "2px",
                                                                 "background-color" = csscolor("darkslategray"),
                                                                 width = percent(proportion(x)),
                                                                 color = csscolor(gradient(x, "red", "green"))
                                                               )),
                    "Asset Portfolio Absolute Profit" = formatter("span",
                                                                  style = x ~ style(
                                                                    display = "inline-block",
                                                                    direction = "rtl",
                                                                    "border-radius" = "4px",
                                                                    "padding-right" = "2px",
                                                                    "background-color" = csscolor("darkslategray"),
                                                                    width = percent(proportion(x)),
                                                                    color = csscolor(gradient(x, "red", "green"))
                                                                  )),
                    "Asset Portfolio Rate of Return" = formatter("span",
                                                                 style = x ~ style(
                                                                   display = "inline-block",
                                                                   direction = "rtl",
                                                                   "border-radius" = "4px",
                                                                   "padding-right" = "2px",
                                                                   "background-color" = csscolor("darkslategray"),
                                                                   width = percent(proportion(x)),
                                                                   color = csscolor(gradient(x, "red", "green"))
                                                                 ))
                    
                  )
      )
      
    )
  })


This code can easily be shortened via the integration of a custom function, as shown below.


simple_formatter <- function(){
    formatter("span",
              style = x ~ style(
                display = "inline-block",
                direction = "rtl",
                "border-radius" = "4px",
                "padding-right" = "2px",
                "background-color" = csscolor("darkslategray"),
                width = percent(proportion(x)),
                color = csscolor(gradient(x, "red", "green"))
              ))
  }
  
  react_formattable <- reactive({
    return(
      formattable(react_port_summary_table(), 
                  list(
                    "Asset Portfolio Max Worth" = simple_formatter(),
                    "Asset Portfolio Latest Worth" = simple_formatter(),
                    "Asset Portfolio Absolute Profit" = simple_formatter(),
                    "Asset Portfolio Rate of Return" = simple_formatter()
                    )
                  )
      )
    })


As can be seen, formattable allows for a great deal of added complexity in crafting your table—complexity that may not be suited for all apps. That said, if you do want to quickly draw a user’s attention to something in a table, formattable is a great solution, and most of the details of the code can be greatly simplified via a function, as shown.

Conclusions:


That was a lot—I know—but I hope that from this commentary and my exemplar of the Asset Comparison Tool more generally has helped to inform your understanding of how dashboards can serve as a helpful analytic tool. Furthermore, I hope to have prompted some thoughts as to the best practices to be followed when building such a tool. I’ll end with a quick tl;dr:


  • Shave complexity wherever possible, and make code as simple as possible by keeping the code for the app’s UI and inner mechanism (inputs, reactives, etc.) separate from the code for the analytic functions and visualizations.
  • Build with the most extreme cases in mind (think of how your most edge-case user might use the app, and ensure that behavior won’t break the app)
  • Document, document, and then document some more. Make your README both a story and a manual.
  • Give Shiny Dashboard a shot if you want an easy-to-construct UI over which you don’t need complete control when it comes to visual design.
  • Pick your visualization packages based on what you want to prioritize for your user, not the other way around (this applies to ggplot, plotly, formattable, etc.).

Thanks for reading!

A New Package (hhi) for Quick Calculation of Herfindahl-Hirschman Index scores

The Herfindahl-Hirschman Index (HHI) is a widely used measure of concentration in a variety of fields including, business, economics, political science, finance, and many others. Though simple to calculate (summed squared market shares of firms/actors in a single market/space), calculation of the HHI can get onerous, especially as the number of firms/actors increases and the time period grows. Thus, I decided to write a package aimed at streamlining and simplifying calculation of HHI scores. The package, hhi, calculates the concentration of a market/space based on a supplied vector of values corresponding with shares of all individual firms/actors acting in that space. The package is available on CRAN.

The purpose of this blog post is to provide a quick overview of the package’s two key functions: hhi (calculation) and plot_hhi (visualization). 

Calculating HHI Scores

As the package is intended for simple, intuitive usage, the function requires only the name of the data frame and then the name of the variable corresponding with the market shares in quotation marks. With these placed directly in the command, calling the function hhi will generate the HHI score based on the values supplied, following the basic form,
where MS is the market share of each firm, i, operating in a single market. Summing across all squared market shares for all firms results in the measure of concentration in the given market, HHI

Consider a simple application calculating the HHI for the men’s footwear market in the United States in 2017 (see and download the data file, “footwear.txt”, from my GitHub repo). Using market share data for every men’s footwear company operating in the U.S. in 2017 from Euromonitor Passport, we can calculate this market’s HHI with the following code:

 
# First, install the "hhi" package, then load the library
install.packages("hhi")
library(hhi)

# Next, read in data: US Men's Footwear Company Market Shares, 2012-2017
footwear = read.table(".../footwear.txt")

# Now, call the "hhi" command to calculate HHI for 2017
hhi(footwear, "ms.2017") # first the df, then the shares variable in quotes
Calling the function hhi gives us an HHI index value for men’s footwear in the U.S. in 2017 of 2009.25. You can corroborate this output manually by squaring each market share corresponding with each company in the data file in the year 2017, and then summing over each firm’s squared market share.

Often, the HHI is used as a measure of competition, with 10,000 equaling perfect monopoly (100^2) and 0.0 equaling perfect competition. As such, we can see that the U.S. men’s footwear industry in 2017 seems relatively competitive. Yet, to say anything substantive about the men’s U.S. footwear market, we really need a comparison of HHI scores for this market over time. This is where the second command comes in.

Visualizing HHI Time Series

The second key function in the package, plot_hhi, is a plotting feature allowing for quick and simple visualization of a time series of HHI scores. Usage is similarly straightforward, requiring only the name of the data frame, the name of the variable corresponding with the time indicator in quotation marks, and then the name of the variable corresponding with the market shares also in quotation marks. The package leverages ggplot2 to provide a visual rendering of the supplied vector of HHI values over the specified range of time. The function supports any measure of time, such as, years, quarters, months, etc. Note that plot_hhi is a relatively inflexible function meant for quick visual rendering of a vector of HHI scores over a period of time. For bigger and formal projects, users are advised to generate original plots with other plotting functions and packages beyond hhi to allow for greater flexibility in customizing visual output according to specific needs.

Let’s return to our men’s U.S. footwear example to see how the function works in practice. First, we need to calculate the HHI scores for each year in the data file (2012-2017), and store those as objects to make a data frame of HHI scores corresponding to individual years. Then, we simply call the plot_hhi command and generate a simple, pleasing plot of HHI scores over time. This will give us a much better sense of how our 2017 HHI score above compares with other years in this market. See the code below, followed by the output.

# First, calculate and store HHI for each year in the data file (2012-2017)
hhi.12 = hhi(footwear, "ms.2012")
hhi.13 = hhi(footwear, "ms.2013")
hhi.14 = hhi(footwear, "ms.2014")
hhi.15 = hhi(footwear, "ms.2015")
hhi.16 = hhi(footwear, "ms.2016")
hhi.17 = hhi(footwear, "ms.2017")

# Combine and create df for plotting
hhi = rbind(hhi.12, hhi.13, hhi.14, hhi.15, hhi.16, hhi.17)

year = c(2012, 2013, 2014, 2015, 2016, 2017)

hhi.data = data.frame(year, hhi)

# Finally, generate HHI time series plot using the "plot_hhi" command
plot_hhi(hhi.data, "year", "hhi")

These lines of code will give us the following plot of HHI scores for each year in the data set.



Interestingly, the men’s U.S. footwear industry seems to be getting slightly less competitive (higher HHI scores) from 2012 to 2017, on average. To say anything substantive about this trend, though, would obviously require more sophisticated methods as well as a longer time series. Yet, the value of the hhi package is allowing for quick calculation and visualization of HHI scores over time. You can download the package from CRAN or directly from the package installation context in RStudio. And as always, if you have any questions or find any bugs requiring fixing, please feel free to contact me.

As a final note, here are a few references for further reading on the HHI and its original calculation and intuition:

Herfindahl, Orris C. 1950. “Concentration in the steel industry.” Ph.D. dissertation, Columbia University.

Hirschman, Albert O. 1945. “National power and structure of foreign trade.” Berkeley, CA: University of California Press.

Rhoades, Stephen A. 1993. “The herfindahl-hirschman index.” Federal Reserve Bulletin 79: 188.


Thanks and enjoy!

Introducing purging: An R package for addressing mediation effects

A Simple Method for Purging Mediation Effects among Independent Variables

Mediation can occur when one independent variable swamps the effect of another, suggesting high correlation between the two variables. Though there are some great packages for mediation analysis out there, the simple intuition of its need is often ambiguous, especially for younger graduate students. Thus, in this blog post, it is my goal to introduce an intuitive overview of mediation and offer a simple method for “purging” variables of mediation effects for their simultaneous use in multivariate analysis. The purging process detailed in this blog is available in my recently released R package, purging, which is available on CRAN or at my GitHub.  

Let’s consider a couple practical examples from “real life” research contexts. First, suppose we are interested in whether committee membership relating to a specific issue domain influences the likelihood of sponsoring related issue-specific legislation. However, in the American context as representational responsibilities permeate legislative behavior, district characteristics in similar employment-related industries likely influence self-selection onto the issue-specific committees in the first place, which we also suggest should influence likelihood of related-issue bill sponsorship. Therefore, in this context, we have a mediation model, where employment/industry (indirect) -> committee membership (direct) -> sponsorship. Thus, we would want to purge committee membership of the effects of employment/industry in the district to observe the “pure” effect of committee membership on the likelihood of related sponsorship. This example is from my paper recently published in American Politics Research.

Or consider a second example in a different realm. Let’s say we had a model where women’s level of labor force participation determines their level of contraceptive use, and that the effect of female labor force participation on fertility is indirect, essentially filtered through its impact on contraceptive use. Once we control for contraceptive use, the direct effect of labor force participation may go away. In other words, the effect of labor force participation on fertility is likely indirect, and filtered through contraceptive use, which means the variables are also highly correlated. This second example was borrowed from Scott Basinger’s and Patrick Shea’s (University of Houston) graduate statistics labs, which originally gave me the idea of expanding this out to develop an R package dedicated to addressing this issue in a variety of contexts and for several functional forms.

These two examples offer simple ways of thinking about mediation effects (e.g., labor force (indirect) -> contraception (direct) -> fertility). If we run into this problem, a simple solution is “purging”. The steps to purge are to, first, regress the direct variable (in the second case, “contraceptive use”) on the indirect variable (in the second case, “labor force participation”). Then, store those residuals, which is the direct effect of contraception after accounting for the indirect effect of labor force participation. Then, we add the stored residuals as their own “purged variable” in the updated specification. Essentially, this purging process allows for a new direct variable that is uncorrelated with the indirect variable. When we do this, we will see that each variable is explaining unique variance in the DV of interest (you can double check this several ways, such as comparing correlation coefficients (which we will do below) or by comparing R^2 across specifications).

An Applied Example

With the intuition behind mediation and the purging solution in mind, let’s walk through a simple example using some fake data. For an example based on the second case described above using real data from the United Nations Human Development Programme, see the code file, purging example.R, at my GitHub repository.

# First, install the MASS package for the "mvrnorm" function
install.packages("MASS")
library(MASS)
# Second, install the purging package directly from CRAN for the "purge.lm" function
install.packages("purging")
library(purging)

# Set some paramters to guide our simulation
n = 5000
rho = 0.9

# Create some fake data
d = mvrnorm(n = n, mu = c(0, 0), Sigma = matrix(c(1, rho, rho, 1), nrow = 2), empirical = TRUE)

# Store each correlated variable as its own object 
X = d[, 1]
Y = d[, 2]

# Create a dataframe of your two variables
d = data.frame(X, Y)

# Verify the correlation between these two normally distributed variables is what we set (rho = 0.9)
cor(d$X, d$Y)
plot(d$X, d$Y) 
 

In addition to the correlation coefficient between the two variables being exactly as we specified (0.90), see this positive correlation between the two random variables in the plot below.



Now, with our correlated data created, we can call the “purge.lm” command, given that our data are continuous. Note: the package supports a variety of functional forms for continuous (linear), binary (logit and probit), and event count data (Poisson and negative binomial).

The idea behind the package is to generate the new direct-impact variable to be used in the analysis, purged of the effects of the indirect variable. To do so, simply input the name of the data frame first, followed by the name of the direct variable in quotes, and then the indirect variable also in quotes in the function. Calling the function will generate a new object (i.e., the direct variable), which can then be added to a data frame using the $ operator, with the following line of code:
 df$purged.var <- purged.var
Let’s now see the purge command in action using our fake data.
 
# Purge the "direct" variable, Y, of the mediation effects of X 
# (direct/indirect selection will depend on your model specification) 
purge.lm(d, "Y", "X") # df, "direct", "indirect" 
 
# You will get an automatic suggestion message to store the values in the df 
purged = purge.lm(d, "Y", "X") # Store as its own object 
d$purged = purged # Attach to df 
 
# Finally, check the correlation and the plot to see the effects purged from the original "Y" (direct) variable 
cor(d$X, d$purged) 
plot(d$X, d$purged) 

Note the correlation between the original indirect (X) variable and the new direct (Y) variable, purged of the effects of X, is -9.211365e-17, or essentially non-existent. For additional corroboration, let’s see the updated correlation plot between the X and purged-Y variables.



The purge command did as expected, with the correlation between the two variables essentially gone. You can download the package and documentation at CRAN. If you have any questions or find any bugs requiring fixing, please feel free to contact me. As this procedure was first developed and implemented (using the binary/logit iteration discussed above in the first example) in a now-published paper, please cite use of the package as: Waggoner, Philip D. 2018. “Do Constituents Influence Issue-Specific Bill Sponsorship?” American Politics Research, <https://doi.org/10.1177/1532673X18759644>

As a final note, once the intuition is mastered, be sure to check out the great work on mediation from many folks, including Kosuke Imai (Princeton), Luke Keele (Georgetown), and several others. See Imai’s mediation site as a sound starting place with code, papers, and more.

Thanks and enjoy!

Discriminant Analysis: Statistics All The Way

Discriminant analysis is used when the variable to be predicted is categorical in nature. This analysis requires that the way to define data points to the respective categories is known which makes it different from cluster analysis where the classification criteria is not know. It works by calculating a score based on all the predictor variables and based on the values of the score, a corresponding class is selected. Hence, the name discriminant analysis which, in simple terms, discriminates data points and classifies them into classes or categories based on analysis of the predictor variables. This article delves into the linear discriminant analysis function in R and delivers in-depth explanation of the process and concepts. Before we move further, let us look at the assumptions of discriminant analysis which are quite similar to MANOVA.

  • Since we are dealing with multiple features, one of the first assumptions that the technique makes is the assumption of multivariate normality that means the features are normally distributed when separated for each class. This also implies that the technique is susceptible to possible outliers and is also sensitive to the group sizes. If there is an imbalance between the group sizes and one of the groups is too small or too large, the technique suffers when classifying data points into that ‘outlier’ class
  • The second assumption is about homoscedasticity. This states that the variance of the features is same across all the classes of the predictor feature
  • We also assume that the features are sampled randomly
  • The final assumption is about the absence of multicollinearity. If the variables are correlated with each other, the predictive ability will decrease.
Though the discriminant analysis can discriminate features non-linearly as well, linear discriminant analysis is a simpler and more popular methodology. We have normally distributed conditional probability functions for each class. If y is the class to be predicted with two values, 1 and 2 and x is the combined set of all the predictor features, we can assume a threshold value T such that the value which comes as a result of linear combination of features of x belongs to class 1 if it is less than T and belongs to class 2 otherwise. Mathematically,

(x−μ1)TΣ1−1(x−μ1)+ln|Σ1|−(x−μ2)TΣ2−1(x−μ2)−ln|Σ2|T

Where (μ1,Σ1) and(μ2, Σ2) are the respective means and variances of x for class 1 and class 2. We sometimes simplify our calculations by assuming equal variances of the two classes to get a simplified version w.x>c where c is the threshold and w is the weight combined with x.

Let’s understand Fisher’s LDA which is one of the most popular variants of LDA

Fisher’s Linear Discriminant analysis – How and when to use it?

Fisher’s linear discriminant finds out a linear combination of features that can be used to discriminate between the target variable classes. In Fisher’s LDA, we take the separation by the ratio of the variance between the classes to the variance within the classes. To understand it in a different way, it is the interclass variance to intraclass variance ratio
S= 𝛔2between/𝛔2within = (w⋅(μ2−μ1))2/ wT(Σ1+Σ2)w
Fisher’s LDA maximizes this ratio and has a lot of applications. One of the recent applications involve classification of speech and audio. Other past usages include face recognition where Fisher’s LDA is used to create Fisher’s Faces and combined with PCA technique to get eigenfaces. Fisher’s LDA also finds usages in earth science, biomedical science, bankruptcy problems and finance along with in marketing. That’s all on the theoretical aspect of LDA. Let’s understand using an example in R.

LDA Classification example in R

R has a MASS package which has the lda() function. For dataset, we will use the iris dataset and try to classify the classes.
#Load the library containing lda() function
library(MASS)
#Store the dataset
dataset=iris

Before running the lda() function, let’s start with the help documentation of lda()
#Help Documentation
?lda
The description for lda() is minimalistic and simple. We are interested in the details section of the documentation which describes the process which the function uses. As the documentation mentions – the lda() function also tries to detect if the within-class covariance matrix is singular. We can also define a tolerance such that if any variable has within-group variance less than tol^2 it will stop and report the variable as a constant. Another possible adjustment is the prior probabilities. The prior parameter in lda() function is used to specify the prior probabilities. If not specified, the function calculates the prior probabilities to be the same as the distribution of classes in the data. These prior probabilities also affect the rotation of the linear discriminants. Let us proceed with performing linear discriminant analysis over the iris dataset.
#Perform LDA over the data
lda_iris=lda(Species~.,data=dataset)
#Prior Probabilities and coefficients of Linear discriminants
lda_iris

Call:
lda(Species ~ ., data = dataset)

Prior probabilities of groups:
        setosa      versicolor      virginica 
    0.3333333   0.3333333   0.3333333 

Group means:
                Sepal.Length        Sepal.Width         Petal.Length        Petal.Width
setosa              5.006               3.428               1.462               0.246
versicolor          5.936               2.770               4.260               1.326
virginica           6.588               2.974               5.552               2.026

Coefficients of linear discriminants:
                            LD1             LD2
Sepal.Length        0.8293776   0.02410215
Sepal.Width         1.5344731   2.16452123
Petal.Length        -2.2012117  -0.93192121
Petal.Width         -2.8104603      2.83918785

Proportion of trace:
        LD1         LD2 
0.9912  0.0088 


#Check the accuracy of our analysis
Predictions=predict(lda_iris,dataset)
table(Predictions$class, dataset$Species)
                setosa      versicolor  virginica
  setosa            50              0           0
  versicolor        0           48          1
  virginica         0           2           49
With LDA, we are able to classify all but 3 data points correctly in iris dataset. This is probably because the iris data is linearly separable. How do we know whether a data is linearly separable or not? We use the pairs function to see the scatter plots of data and see if they are separable
#Check how easily we can linearly separate the iris dataset
pairs(dataset)

As we can see, one of the classes is completely separate while the other two are somewhat overlapping. However, LDA is still able to distinguish between the two. A better version of using lda is lda() with CV. This can be done by passing the CV=TRUE in the lda function.
#LDA with CV
lda_cv_iris=lda(Species~.,data=dataset,CV=TRUE)

#The predictions are already generated in lda_cv_iris
table(lda_cv_iris$class, dataset$Species)
I didn’t generate the summary for the model as it will also produce all the predictions. As we already know from the summary of lda_iris, the function first calculates the prior probabilities of the classes in the dataset unless provided specifically. The iris dataset had 50 data points for each class hence the prior probabilities are calculated to be 0.33 each. It then makes the necessary calculations which involves means of each class and overall variance and gets the linear discriminant. The function also scales the value of the linear discriminants so that the mean is zero and variance is one. The final value, proportion of trace that we get is the percentage separation that each of the discriminant achieves. Thus, the first linear discriminant is enough and achieves about 99% of the separation. As a final step, we will plot the linear discriminants and visually see the difference in distinguishing ability. The ldahist() function helps make the separator plot. For the data into the ldahist() function, we can use the x[,1] for the first linear discriminant and x[,2] for the second linear discriminant and so on
#Plot the predictions - first linear discriminant
ldahist(data = Predictions$x[,1], g=Species)

The data points are almost completely separated by the first linear discriminant and that is why we see the three classes in different ranges of values. To further our understanding, we also see the second linear discriminant.
#Plot the predictions - second linear discriminant
ldahist(data = Predictions$x[,2], g=Species)

From the plot of the second linear discriminant, we see that we can hardly differentiate between the three groups hence the proportion of trace values.

Everything is not linear – quadratic discriminant analysis

MASS package also contains the qda() function which stands for quadratic discriminant analysis. The idea is simple – if the data can be discriminated using a quadratic function, we can use qda() instead of lda(). The rest of the nuances are the same for qda() as were in lda()
#QDA
qda_iris=qda(Species~.,data=dataset)
qda_iris


Call:
qda(Species ~ ., data = dataset)

Prior probabilities of groups:
            setosa      versicolor      virginica 
        0.3333333   0.3333333   0.3333333 

Group means:
                Sepal.Length        Sepal.Width         Petal.Length        Petal.Width
setosa              5.006               3.428               1.462               0.246
versicolor          5.936               2.770               4.260               1.326
virginica           6.588               2.974               5.552               2.026

#Check the accuracy of our analysis of qda
Predictions_qda=predict(qda_iris,dataset)
table(Predictions_qda$class, dataset$Species)
                setosa      versicolor  virginica
  setosa            50              0           0
  versicolor        0           48          1
  virginica         0           2           49
Since the data has a linear relation, the qda function also applies the same statistics and returns similar results.

Conclusion: Evaluating LDA and QDA

Even though LDA is a tough problem to understand, its implementation in R is simple. As a final step, we will look into another package- the klaR package which helps to create an exploratory graph for LDA or QDA. The package contains the partimat() function which takes a similar input as the lda() function but returns a plot instead of the model. The function stands for partition matrix and plots the ability of the features to partition the target class taking combinations of two at a time.
#Using the klarR package
# install.packages("klaR")
library(klaR)
partimat(Species~.,data=dataset,method="lda")

Our data has four features so we have 4C2 =6 combinations to classify our data. The plot show how different classes are defined based on the two features on x-axis and y-axis. As a summary, it is important to know that one should look at the data first to know whether the data seems to be linearly separable (or quadratically separable in case of qda) before selecting the technique. Since LDA makes some assumptions about the data, we also need to preprocess the data and perform univariate analysis to see if the normality assumption holds for each class of the data. In the absence of normality, that is, in case there is a violation of the normality condition, one can still proceed with LDA or QDA but the results will not be appropriate and will lack in accuracy. We also need to analyze whether the features are related to each other and some of them need to be omitted from our analysis. The rest is up to the lda() function to calculate and make predictions on. Here is the entire code used in this article:
#Load the library containing lda() function
library(MASS)
#Store the dataset
dataset=iris

#Help Documentation
?lda

#Perform LDA over the data
lda_iris=lda(Species~.,data=dataset)
#Prior Probabilities and coefficients of Linear discriminants
lda_iris

#Check the accuracy of our analysis
Predictions=predict(lda_iris,dataset)
table(Predictions$class, dataset$Species)

#Check how easily we can linearly separate the iris dataset
pairs(dataset)

#LDA with CV
lda_cv_iris=lda(Species~.,data=dataset,CV=TRUE)

#The predictions are already generated in lda_cv_iris
table(lda_cv_iris$class, dataset$Species)

#Plot the predictions - first linear discriminant
ldahist(data = Predictions$x[,1], g=Species)

#Plot the predictions - second linear discriminant
ldahist(data = Predictions$x[,2], g=Species)

#QDA
qda_iris=qda(Species~.,data=dataset)
qda_iris

#Check the accuracy of our analysis of qda
Predictions_qda=predict(qda_iris,dataset)
table(Predictions_qda$class, dataset$Species)

#Using the klarR packagew
# install.packages("klaR")
library(klaR)
partimat(Species~.,data=dataset,method="lda")

Author Bio:

This article was contributed by Perceptive Analytics. Madhur Modi, Chaitanya Sagar, Jyothirmayee Thondamallu and Saneesh Veetil contributed to this article.
Perceptive Analytics provides data analytics, data visualization, business intelligence and reporting services to e-commerce, retail, healthcare and pharmaceutical industries. Our client roster includes Fortune 500 and NYSE listed companies in the USA and India.