A New Package (hhi) for Quick Calculation of Herfindahl-Hirschman Index scores

The Herfindahl-Hirschman Index (HHI) is a widely used measure of concentration in a variety of fields including, business, economics, political science, finance, and many others. Though simple to calculate (summed squared market shares of firms/actors in a single market/space), calculation of the HHI can get onerous, especially as the number of firms/actors increases and the time period grows. Thus, I decided to write a package aimed at streamlining and simplifying calculation of HHI scores. The package, hhi, calculates the concentration of a market/space based on a supplied vector of values corresponding with shares of all individual firms/actors acting in that space. The package is available on CRAN.

The purpose of this blog post is to provide a quick overview of the package’s two key functions: hhi (calculation) and plot_hhi (visualization). 

Calculating HHI Scores

As the package is intended for simple, intuitive usage, the function requires only the name of the data frame and then the name of the variable corresponding with the market shares in quotation marks. With these placed directly in the command, calling the function hhi will generate the HHI score based on the values supplied, following the basic form,
where MS is the market share of each firm, i, operating in a single market. Summing across all squared market shares for all firms results in the measure of concentration in the given market, HHI

Consider a simple application calculating the HHI for the men’s footwear market in the United States in 2017 (see and download the data file, “footwear.txt”, from my GitHub repo). Using market share data for every men’s footwear company operating in the U.S. in 2017 from Euromonitor Passport, we can calculate this market’s HHI with the following code:

 
# First, install the "hhi" package, then load the library
install.packages("hhi")
library(hhi)

# Next, read in data: US Men's Footwear Company Market Shares, 2012-2017
footwear = read.table(".../footwear.txt")

# Now, call the "hhi" command to calculate HHI for 2017
hhi(footwear, "ms.2017") # first the df, then the shares variable in quotes
Calling the function hhi gives us an HHI index value for men’s footwear in the U.S. in 2017 of 2009.25. You can corroborate this output manually by squaring each market share corresponding with each company in the data file in the year 2017, and then summing over each firm’s squared market share.

Often, the HHI is used as a measure of competition, with 10,000 equaling perfect monopoly (100^2) and 0.0 equaling perfect competition. As such, we can see that the U.S. men’s footwear industry in 2017 seems relatively competitive. Yet, to say anything substantive about the men’s U.S. footwear market, we really need a comparison of HHI scores for this market over time. This is where the second command comes in.

Visualizing HHI Time Series

The second key function in the package, plot_hhi, is a plotting feature allowing for quick and simple visualization of a time series of HHI scores. Usage is similarly straightforward, requiring only the name of the data frame, the name of the variable corresponding with the time indicator in quotation marks, and then the name of the variable corresponding with the market shares also in quotation marks. The package leverages ggplot2 to provide a visual rendering of the supplied vector of HHI values over the specified range of time. The function supports any measure of time, such as, years, quarters, months, etc. Note that plot_hhi is a relatively inflexible function meant for quick visual rendering of a vector of HHI scores over a period of time. For bigger and formal projects, users are advised to generate original plots with other plotting functions and packages beyond hhi to allow for greater flexibility in customizing visual output according to specific needs.

Let’s return to our men’s U.S. footwear example to see how the function works in practice. First, we need to calculate the HHI scores for each year in the data file (2012-2017), and store those as objects to make a data frame of HHI scores corresponding to individual years. Then, we simply call the plot_hhi command and generate a simple, pleasing plot of HHI scores over time. This will give us a much better sense of how our 2017 HHI score above compares with other years in this market. See the code below, followed by the output.

# First, calculate and store HHI for each year in the data file (2012-2017)
hhi.12 = hhi(footwear, "ms.2012")
hhi.13 = hhi(footwear, "ms.2013")
hhi.14 = hhi(footwear, "ms.2014")
hhi.15 = hhi(footwear, "ms.2015")
hhi.16 = hhi(footwear, "ms.2016")
hhi.17 = hhi(footwear, "ms.2017")

# Combine and create df for plotting
hhi = rbind(hhi.12, hhi.13, hhi.14, hhi.15, hhi.16, hhi.17)

year = c(2012, 2013, 2014, 2015, 2016, 2017)

hhi.data = data.frame(year, hhi)

# Finally, generate HHI time series plot using the "plot_hhi" command
plot_hhi(hhi.data, "year", "hhi")

These lines of code will give us the following plot of HHI scores for each year in the data set.



Interestingly, the men’s U.S. footwear industry seems to be getting slightly less competitive (higher HHI scores) from 2012 to 2017, on average. To say anything substantive about this trend, though, would obviously require more sophisticated methods as well as a longer time series. Yet, the value of the hhi package is allowing for quick calculation and visualization of HHI scores over time. You can download the package from CRAN or directly from the package installation context in RStudio. And as always, if you have any questions or find any bugs requiring fixing, please feel free to contact me.

As a final note, here are a few references for further reading on the HHI and its original calculation and intuition:

Herfindahl, Orris C. 1950. “Concentration in the steel industry.” Ph.D. dissertation, Columbia University.

Hirschman, Albert O. 1945. “National power and structure of foreign trade.” Berkeley, CA: University of California Press.

Rhoades, Stephen A. 1993. “The herfindahl-hirschman index.” Federal Reserve Bulletin 79: 188.


Thanks and enjoy!

Introducing purging: An R package for addressing mediation effects

A Simple Method for Purging Mediation Effects among Independent Variables

Mediation can occur when one independent variable swamps the effect of another, suggesting high correlation between the two variables. Though there are some great packages for mediation analysis out there, the simple intuition of its need is often ambiguous, especially for younger graduate students. Thus, in this blog post, it is my goal to introduce an intuitive overview of mediation and offer a simple method for “purging” variables of mediation effects for their simultaneous use in multivariate analysis. The purging process detailed in this blog is available in my recently released R package, purging, which is available on CRAN or at my GitHub.  

Let’s consider a couple practical examples from “real life” research contexts. First, suppose we are interested in whether committee membership relating to a specific issue domain influences the likelihood of sponsoring related issue-specific legislation. However, in the American context as representational responsibilities permeate legislative behavior, district characteristics in similar employment-related industries likely influence self-selection onto the issue-specific committees in the first place, which we also suggest should influence likelihood of related-issue bill sponsorship. Therefore, in this context, we have a mediation model, where employment/industry (indirect) -> committee membership (direct) -> sponsorship. Thus, we would want to purge committee membership of the effects of employment/industry in the district to observe the “pure” effect of committee membership on the likelihood of related sponsorship. This example is from my paper recently published in American Politics Research.

Or consider a second example in a different realm. Let’s say we had a model where women’s level of labor force participation determines their level of contraceptive use, and that the effect of female labor force participation on fertility is indirect, essentially filtered through its impact on contraceptive use. Once we control for contraceptive use, the direct effect of labor force participation may go away. In other words, the effect of labor force participation on fertility is likely indirect, and filtered through contraceptive use, which means the variables are also highly correlated. This second example was borrowed from Scott Basinger’s and Patrick Shea’s (University of Houston) graduate statistics labs, which originally gave me the idea of expanding this out to develop an R package dedicated to addressing this issue in a variety of contexts and for several functional forms.

These two examples offer simple ways of thinking about mediation effects (e.g., labor force (indirect) -> contraception (direct) -> fertility). If we run into this problem, a simple solution is “purging”. The steps to purge are to, first, regress the direct variable (in the second case, “contraceptive use”) on the indirect variable (in the second case, “labor force participation”). Then, store those residuals, which is the direct effect of contraception after accounting for the indirect effect of labor force participation. Then, we add the stored residuals as their own “purged variable” in the updated specification. Essentially, this purging process allows for a new direct variable that is uncorrelated with the indirect variable. When we do this, we will see that each variable is explaining unique variance in the DV of interest (you can double check this several ways, such as comparing correlation coefficients (which we will do below) or by comparing R^2 across specifications).

An Applied Example

With the intuition behind mediation and the purging solution in mind, let’s walk through a simple example using some fake data. For an example based on the second case described above using real data from the United Nations Human Development Programme, see the code file, purging example.R, at my GitHub repository.

# First, install the MASS package for the "mvrnorm" function
install.packages("MASS")
library(MASS)
# Second, install the purging package directly from CRAN for the "purge.lm" function
install.packages("purging")
library(purging)

# Set some paramters to guide our simulation
n = 5000
rho = 0.9

# Create some fake data
d = mvrnorm(n = n, mu = c(0, 0), Sigma = matrix(c(1, rho, rho, 1), nrow = 2), empirical = TRUE)

# Store each correlated variable as its own object 
X = d[, 1]
Y = d[, 2]

# Create a dataframe of your two variables
d = data.frame(X, Y)

# Verify the correlation between these two normally distributed variables is what we set (rho = 0.9)
cor(d$X, d$Y)
plot(d$X, d$Y) 
 

In addition to the correlation coefficient between the two variables being exactly as we specified (0.90), see this positive correlation between the two random variables in the plot below.



Now, with our correlated data created, we can call the “purge.lm” command, given that our data are continuous. Note: the package supports a variety of functional forms for continuous (linear), binary (logit and probit), and event count data (Poisson and negative binomial).

The idea behind the package is to generate the new direct-impact variable to be used in the analysis, purged of the effects of the indirect variable. To do so, simply input the name of the data frame first, followed by the name of the direct variable in quotes, and then the indirect variable also in quotes in the function. Calling the function will generate a new object (i.e., the direct variable), which can then be added to a data frame using the $ operator, with the following line of code:
 df$purged.var <- purged.var
Let’s now see the purge command in action using our fake data.
 
# Purge the "direct" variable, Y, of the mediation effects of X 
# (direct/indirect selection will depend on your model specification) 
purge.lm(d, "Y", "X") # df, "direct", "indirect" 
 
# You will get an automatic suggestion message to store the values in the df 
purged = purge.lm(d, "Y", "X") # Store as its own object 
d$purged = purged # Attach to df 
 
# Finally, check the correlation and the plot to see the effects purged from the original "Y" (direct) variable 
cor(d$X, d$purged) 
plot(d$X, d$purged) 

Note the correlation between the original indirect (X) variable and the new direct (Y) variable, purged of the effects of X, is -9.211365e-17, or essentially non-existent. For additional corroboration, let’s see the updated correlation plot between the X and purged-Y variables.



The purge command did as expected, with the correlation between the two variables essentially gone. You can download the package and documentation at CRAN. If you have any questions or find any bugs requiring fixing, please feel free to contact me. As this procedure was first developed and implemented (using the binary/logit iteration discussed above in the first example) in a now-published paper, please cite use of the package as: Waggoner, Philip D. 2018. “Do Constituents Influence Issue-Specific Bill Sponsorship?” American Politics Research, <https://doi.org/10.1177/1532673X18759644>

As a final note, once the intuition is mastered, be sure to check out the great work on mediation from many folks, including Kosuke Imai (Princeton), Luke Keele (Georgetown), and several others. See Imai’s mediation site as a sound starting place with code, papers, and more.

Thanks and enjoy!

Discriminant Analysis: Statistics All The Way

Discriminant analysis is used when the variable to be predicted is categorical in nature. This analysis requires that the way to define data points to the respective categories is known which makes it different from cluster analysis where the classification criteria is not know. It works by calculating a score based on all the predictor variables and based on the values of the score, a corresponding class is selected. Hence, the name discriminant analysis which, in simple terms, discriminates data points and classifies them into classes or categories based on analysis of the predictor variables. This article delves into the linear discriminant analysis function in R and delivers in-depth explanation of the process and concepts. Before we move further, let us look at the assumptions of discriminant analysis which are quite similar to MANOVA.

  • Since we are dealing with multiple features, one of the first assumptions that the technique makes is the assumption of multivariate normality that means the features are normally distributed when separated for each class. This also implies that the technique is susceptible to possible outliers and is also sensitive to the group sizes. If there is an imbalance between the group sizes and one of the groups is too small or too large, the technique suffers when classifying data points into that ‘outlier’ class
  • The second assumption is about homoscedasticity. This states that the variance of the features is same across all the classes of the predictor feature
  • We also assume that the features are sampled randomly
  • The final assumption is about the absence of multicollinearity. If the variables are correlated with each other, the predictive ability will decrease.
Though the discriminant analysis can discriminate features non-linearly as well, linear discriminant analysis is a simpler and more popular methodology. We have normally distributed conditional probability functions for each class. If y is the class to be predicted with two values, 1 and 2 and x is the combined set of all the predictor features, we can assume a threshold value T such that the value which comes as a result of linear combination of features of x belongs to class 1 if it is less than T and belongs to class 2 otherwise. Mathematically,

(x−μ1)TΣ1−1(x−μ1)+ln|Σ1|−(x−μ2)TΣ2−1(x−μ2)−ln|Σ2|T

Where (μ1,Σ1) and(μ2, Σ2) are the respective means and variances of x for class 1 and class 2. We sometimes simplify our calculations by assuming equal variances of the two classes to get a simplified version w.x>c where c is the threshold and w is the weight combined with x.

Let’s understand Fisher’s LDA which is one of the most popular variants of LDA

Fisher’s Linear Discriminant analysis – How and when to use it?

Fisher’s linear discriminant finds out a linear combination of features that can be used to discriminate between the target variable classes. In Fisher’s LDA, we take the separation by the ratio of the variance between the classes to the variance within the classes. To understand it in a different way, it is the interclass variance to intraclass variance ratio
S= 𝛔2between/𝛔2within = (w⋅(μ2−μ1))2/ wT(Σ1+Σ2)w
Fisher’s LDA maximizes this ratio and has a lot of applications. One of the recent applications involve classification of speech and audio. Other past usages include face recognition where Fisher’s LDA is used to create Fisher’s Faces and combined with PCA technique to get eigenfaces. Fisher’s LDA also finds usages in earth science, biomedical science, bankruptcy problems and finance along with in marketing. That’s all on the theoretical aspect of LDA. Let’s understand using an example in R.

LDA Classification example in R

R has a MASS package which has the lda() function. For dataset, we will use the iris dataset and try to classify the classes.
#Load the library containing lda() function
library(MASS)
#Store the dataset
dataset=iris

Before running the lda() function, let’s start with the help documentation of lda()
#Help Documentation
?lda
The description for lda() is minimalistic and simple. We are interested in the details section of the documentation which describes the process which the function uses. As the documentation mentions – the lda() function also tries to detect if the within-class covariance matrix is singular. We can also define a tolerance such that if any variable has within-group variance less than tol^2 it will stop and report the variable as a constant. Another possible adjustment is the prior probabilities. The prior parameter in lda() function is used to specify the prior probabilities. If not specified, the function calculates the prior probabilities to be the same as the distribution of classes in the data. These prior probabilities also affect the rotation of the linear discriminants. Let us proceed with performing linear discriminant analysis over the iris dataset.
#Perform LDA over the data
lda_iris=lda(Species~.,data=dataset)
#Prior Probabilities and coefficients of Linear discriminants
lda_iris

Call:
lda(Species ~ ., data = dataset)

Prior probabilities of groups:
        setosa      versicolor      virginica 
    0.3333333   0.3333333   0.3333333 

Group means:
                Sepal.Length        Sepal.Width         Petal.Length        Petal.Width
setosa              5.006               3.428               1.462               0.246
versicolor          5.936               2.770               4.260               1.326
virginica           6.588               2.974               5.552               2.026

Coefficients of linear discriminants:
                            LD1             LD2
Sepal.Length        0.8293776   0.02410215
Sepal.Width         1.5344731   2.16452123
Petal.Length        -2.2012117  -0.93192121
Petal.Width         -2.8104603      2.83918785

Proportion of trace:
        LD1         LD2 
0.9912  0.0088 


#Check the accuracy of our analysis
Predictions=predict(lda_iris,dataset)
table(Predictions$class, dataset$Species)
                setosa      versicolor  virginica
  setosa            50              0           0
  versicolor        0           48          1
  virginica         0           2           49
With LDA, we are able to classify all but 3 data points correctly in iris dataset. This is probably because the iris data is linearly separable. How do we know whether a data is linearly separable or not? We use the pairs function to see the scatter plots of data and see if they are separable
#Check how easily we can linearly separate the iris dataset
pairs(dataset)

As we can see, one of the classes is completely separate while the other two are somewhat overlapping. However, LDA is still able to distinguish between the two. A better version of using lda is lda() with CV. This can be done by passing the CV=TRUE in the lda function.
#LDA with CV
lda_cv_iris=lda(Species~.,data=dataset,CV=TRUE)

#The predictions are already generated in lda_cv_iris
table(lda_cv_iris$class, dataset$Species)
I didn’t generate the summary for the model as it will also produce all the predictions. As we already know from the summary of lda_iris, the function first calculates the prior probabilities of the classes in the dataset unless provided specifically. The iris dataset had 50 data points for each class hence the prior probabilities are calculated to be 0.33 each. It then makes the necessary calculations which involves means of each class and overall variance and gets the linear discriminant. The function also scales the value of the linear discriminants so that the mean is zero and variance is one. The final value, proportion of trace that we get is the percentage separation that each of the discriminant achieves. Thus, the first linear discriminant is enough and achieves about 99% of the separation. As a final step, we will plot the linear discriminants and visually see the difference in distinguishing ability. The ldahist() function helps make the separator plot. For the data into the ldahist() function, we can use the x[,1] for the first linear discriminant and x[,2] for the second linear discriminant and so on
#Plot the predictions - first linear discriminant
ldahist(data = Predictions$x[,1], g=Species)

The data points are almost completely separated by the first linear discriminant and that is why we see the three classes in different ranges of values. To further our understanding, we also see the second linear discriminant.
#Plot the predictions - second linear discriminant
ldahist(data = Predictions$x[,2], g=Species)

From the plot of the second linear discriminant, we see that we can hardly differentiate between the three groups hence the proportion of trace values.

Everything is not linear – quadratic discriminant analysis

MASS package also contains the qda() function which stands for quadratic discriminant analysis. The idea is simple – if the data can be discriminated using a quadratic function, we can use qda() instead of lda(). The rest of the nuances are the same for qda() as were in lda()
#QDA
qda_iris=qda(Species~.,data=dataset)
qda_iris


Call:
qda(Species ~ ., data = dataset)

Prior probabilities of groups:
            setosa      versicolor      virginica 
        0.3333333   0.3333333   0.3333333 

Group means:
                Sepal.Length        Sepal.Width         Petal.Length        Petal.Width
setosa              5.006               3.428               1.462               0.246
versicolor          5.936               2.770               4.260               1.326
virginica           6.588               2.974               5.552               2.026

#Check the accuracy of our analysis of qda
Predictions_qda=predict(qda_iris,dataset)
table(Predictions_qda$class, dataset$Species)
                setosa      versicolor  virginica
  setosa            50              0           0
  versicolor        0           48          1
  virginica         0           2           49
Since the data has a linear relation, the qda function also applies the same statistics and returns similar results.

Conclusion: Evaluating LDA and QDA

Even though LDA is a tough problem to understand, its implementation in R is simple. As a final step, we will look into another package- the klaR package which helps to create an exploratory graph for LDA or QDA. The package contains the partimat() function which takes a similar input as the lda() function but returns a plot instead of the model. The function stands for partition matrix and plots the ability of the features to partition the target class taking combinations of two at a time.
#Using the klarR package
# install.packages("klaR")
library(klaR)
partimat(Species~.,data=dataset,method="lda")

Our data has four features so we have 4C2 =6 combinations to classify our data. The plot show how different classes are defined based on the two features on x-axis and y-axis. As a summary, it is important to know that one should look at the data first to know whether the data seems to be linearly separable (or quadratically separable in case of qda) before selecting the technique. Since LDA makes some assumptions about the data, we also need to preprocess the data and perform univariate analysis to see if the normality assumption holds for each class of the data. In the absence of normality, that is, in case there is a violation of the normality condition, one can still proceed with LDA or QDA but the results will not be appropriate and will lack in accuracy. We also need to analyze whether the features are related to each other and some of them need to be omitted from our analysis. The rest is up to the lda() function to calculate and make predictions on. Here is the entire code used in this article:
#Load the library containing lda() function
library(MASS)
#Store the dataset
dataset=iris

#Help Documentation
?lda

#Perform LDA over the data
lda_iris=lda(Species~.,data=dataset)
#Prior Probabilities and coefficients of Linear discriminants
lda_iris

#Check the accuracy of our analysis
Predictions=predict(lda_iris,dataset)
table(Predictions$class, dataset$Species)

#Check how easily we can linearly separate the iris dataset
pairs(dataset)

#LDA with CV
lda_cv_iris=lda(Species~.,data=dataset,CV=TRUE)

#The predictions are already generated in lda_cv_iris
table(lda_cv_iris$class, dataset$Species)

#Plot the predictions - first linear discriminant
ldahist(data = Predictions$x[,1], g=Species)

#Plot the predictions - second linear discriminant
ldahist(data = Predictions$x[,2], g=Species)

#QDA
qda_iris=qda(Species~.,data=dataset)
qda_iris

#Check the accuracy of our analysis of qda
Predictions_qda=predict(qda_iris,dataset)
table(Predictions_qda$class, dataset$Species)

#Using the klarR packagew
# install.packages("klaR")
library(klaR)
partimat(Species~.,data=dataset,method="lda")

Author Bio:

This article was contributed by Perceptive Analytics. Madhur Modi, Chaitanya Sagar, Jyothirmayee Thondamallu and Saneesh Veetil contributed to this article.
Perceptive Analytics provides data analytics, data visualization, business intelligence and reporting services to e-commerce, retail, healthcare and pharmaceutical industries. Our client roster includes Fortune 500 and NYSE listed companies in the USA and India.

Steps to Perform Survival Analysis in R

Another way of analysis?

When there are so many tools and techniques of prediction modelling, why do we have another field known as survival analysis? As one of the most popular branch of statistics, Survival analysis is a way of prediction at various points in time. This is to say, while other prediction models make predictions of whether an event will occur, survival analysis predicts whether the event will occur at a specified time. Thus, it requires a time component for prediction and correspondingly, predicts the time when an event will happen. This helps one in understanding the expected duration of time when events occur and provide much more useful information. One can think of natural areas of application of survival analysis which include biological sciences where one can predict the time for bacteria or other cellular organisms to multiple to a particular size or expected time of decay of atoms. Some interesting applications include prediction of the expected time when a machine will break down and maintenance will be required

How hard does it get..

It is not easy to apply the concepts of survival analysis right off the bat. One needs to understand the ways it can be used first. This includes Kaplan-Meier Curves, creating the survival function through tools such as survival trees or survival forests and log-rank test. Let’s go through each of them one by one in R. We will use the survival package in R as a starting example. The survival package has the surv() function that is the center of survival analysis.
# install.packages("survival")
# Loading the package
library("survival")
The package contains a sample dataset for demonstration purposes. The dataset is pbc which contains a 10 year study of 424 patients having Primary Biliary Cirrhosis (pbc) when treated in Mayo clinic. A point to note here from the dataset description is that out of 424 patients, 312 participated in the trial of drug D-penicillamine and the rest 112 consented to have their basic measurements recorded and followed for survival but did not participate in the trial. 6 of these 112 cases were lost. We are particularly interested in ‘time’ and ‘status’ features in the dataset. Time represents the number of days after registration and final status (which can be censored, liver transplant or dead). Since it is survival, we will consider the status as dead or not-dead (transplant or censored). Further details about the dataset can be read from the command:
#Dataset description
?pbc
We start with a direct application of the Surv() function and pass it to the survfit() function. The Surv() function will take the time and status parameters and create a survival object out of it. The survfit() function takes a survival object (the one which Surv() produces) and creates the survival curves.
#Fitting the survival model
survival_func=survfit(Surv(pbc$time,pbc$status == 2)~1)
survival_func

Call: survfit(formula = Surv(pbc$time, pbc$status == 2) ~ 1)

        n   events      median  0.95LCL     0.95UCL 
        418         161         3395        3090        3853
The function gives us the number of values, the number of positives in status, the median time and 95% confidence interval values. The model can also be plotted.
#Plot the survival model
plot(survival_func)

As expected, the plot shows us the decreasing probabilities for survival as time passes. The dashed lines are the upper and lower confidence intervals. In the survfit() function here, we passed the formula as ~ 1 which indicates that we are asking the function to fit the model solely on the basis of survival object and thus have an intercept. The output along with the confidence intervals are actually Kaplan-Meier estimates. This estimate is prominent in medical research survival analysis. The Kaplan – Meier estimates are based on the number of patients (each patient as a row of data) from the total number who survive for a certain time after treatment. (which is the event). We can represent the Kaplan – Meier function by the formula:
Ŝ(t)=∏(1-di/ni) for all i where ti≤t
Here, di the number of events and ni is the total number of people at risk at time ti

What to make of the graph?

Unlike other machine learning techniques where one uses test samples and makes predictions over them, the survival analysis curve is a self – explanatory curve. From the curve, we see that the possibility of surviving about 1000 days after treatment is roughly 0.8 or 80%. We can similarly define probability of survival for different number of days after treatment. At the same time, we also have the confidence interval ranges which show the margin of expected error. For example, in case of surviving 1000 days example, the upper confidence interval reaches about 0.85 or 85% and goes down to about 0.75 or 75%. Post the data range, which is 10 years or about 3500 days, the probability calculations are very erratic and vague and should not be taken up. For example, if one wants to know the probability of surviving 4500 days after treatment, then though the Kaplan – Meier graph above shows a range between 0.25 to 0.55 which is itself a large value to accommodate the lack of data, the data is still not sufficient enough and a better data should be used to make such an estimate.

Alternative models: Cox Proportional Hazard model

The survival package also contains a cox proportional hazard function coxph() and use other features in the data to make a better survival model. Though the data has untreated missing values, I am skipping the data processing and fitting the model directly. In practice, however, one needs to study the data and look at ways to process the data appropriately so that the best possible models are fitted. As the intention of this article is to get the readers acquainted with the function rather than processing, applying the function is the shortcut step which I am taking.
# Fit Cox Model
Cox_model = coxph(Surv(pbc$time,pbc$status==2) ~.,data=pbc)
summary(Cox_model)

Call:
coxph(formula = Surv(pbc$time, pbc$status == 2) ~ ., data = pbc)

  n= 276, number of events= 111 
   (142 observations deleted due to missingness)

                coef    exp(coef)       se(coef)        z   Pr(>|z|)   
id              -2.729e-03      9.973e-01   1.462e-03   -1.866  0.06203 . 
trt             -1.116e-01      8.944e-01   2.156e-01   -0.518  0.60476   
age         3.191e-02   1.032e+00   1.200e-02   2.659   0.00784 **
sexf            -3.822e-01      6.824e-01   3.074e-01   -1.243  0.21378   
ascites     6.321e-02   1.065e+00   3.874e-01   0.163   0.87038   
hepato      6.257e-02   1.065e+00   2.521e-01   0.248   0.80397   
spiders     7.594e-02   1.079e+00   2.448e-01   0.310   0.75635   
edema       8.860e-01   2.425e+00   4.078e-01   2.173   0.02980 * 
bili            8.038e-02   1.084e+00   2.539e-02   3.166   0.00155 **
chol        5.151e-04   1.001e+00   4.409e-04   1.168   0.24272   
albumin     -8.511e-01      4.270e-01   3.114e-01   -2.733  0.00627 **
copper      2.612e-03   1.003e+00   1.148e-03   2.275   0.02290 * 
alk.phos    -2.623e-05      1.000e+00   4.206e-05   -0.624  0.53288   
ast         4.239e-03   1.004e+00   1.941e-03   2.184   0.02894 * 
trig            -1.228e-03      9.988e-01   1.334e-03   -0.920  0.35741   
platelet    7.272e-04   1.001e+00   1.177e-03   0.618   0.53660   
protime     1.895e-01   1.209e+00   1.128e-01   1.680   0.09289 . 
stage       4.468e-01   1.563e+00   1.784e-01   2.504   0.01226 * 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

                exp(coef)   exp(-coef)  lower .95   upper .95
id              0.9973      1.0027      0.9944      1.000
trt             0.8944      1.1181      0.5862      1.365
age             1.0324      0.9686      1.0084      1.057
sexf            0.6824      1.4655      0.3736      1.246
ascites         1.0653      0.9387      0.4985      2.276
hepato          1.0646      0.9393      0.6495      1.745
spiders         1.0789      0.9269      0.6678      1.743
edema           2.4253      0.4123      1.0907      5.393
bili            1.0837      0.9228      1.0311      1.139
chol            1.0005      0.9995      0.9997      1.001
albumin     0.4270      2.3422      0.2319      0.786
copper          1.0026      0.9974      1.0004      1.005
alk.phos        1.0000      1.0000      0.9999      1.000
ast             1.0042      0.9958      1.0004      1.008
trig            0.9988      1.0012      0.9962      1.001
platelet        1.0007      0.9993      0.9984      1.003
protime         1.2086      0.8274      0.9690      1.508
stage           1.5634      0.6397      1.1020      2.218

Concordance= 0.849  (se = 0.031 )
Rsquare= 0.462   (max possible= 0.981 )
Likelihood ratio test= 171.3  on 18 df,   p=0
Wald test            = 172.5  on 18 df,   p=0
Score (logrank) test = 286.1  on 18 df,   p=0
The Cox model output is similar to how a linear regression output comes up. The R2 is only 46% which is not high and we don’t have any feature which is highly significant. The top important features appear to be age, bilirubin (bili) and albumin. Let’s see how the plot looks like.
#Create a survival curve from the cox model
Cox_curve <- survfit(Cox_model)
plot(Cox_curve)

With more data, we get a different plot and this one is more volatile. Compared to the Kaplan – Meier curve, the cox-plot curve is higher for the initial values and lower for the higher values. The major reason for this difference is the inclusion of variables in cox-model. The plots are made by similar functions and can be interpreted the same way as the Kaplan – Meier curve.

Going traditional : Using survival forests

Random forests can also be used for survival analysis and the ranger package in R provides the functionality. However, the ranger function cannot handle the missing values so I will use a smaller data with all rows having NA values dropped. This will reduce my data to only 276 observations.
#Using the Ranger package for survival analysis
Install.packages("ranger")
library(ranger)

#Drop rows with NA values
pbc_nadrop=pbc[complete.cases(pbc), ]
#Fitting the random forest
ranger_model <- ranger(Surv(pbc_nadrop$time,pbc_nadrop$status==2) ~.,data=pbc_nadrop,num.trees = 500, importance = "permutation",seed = 1)

#Plot the death times
plot(ranger_model$unique.death.times,ranger_model$survival[1,], type = "l", ylim = c(0,1),)

Let’s look at the variable importance plot which the random forest model calculates.
#Get the variable importance
data.frame(sort(ranger_model$variable.importance,decreasing = TRUE))
sort.ranger_model.variable.importance..decreasing...TRUE.

bili                                                    0.0762338981
copper                                                  0.0202733989
albumin                                                 0.0165070226
age                                                     0.0130134413
edema                                                   0.0122113704
ascites                                                 0.0115315711
chol                                                    0.0092889960
protime                                                 0.0060215073
id                                                      0.0055867915
ast                                                     0.0049932803
stage                                                   0.0030225398
hepato                                                  0.0029290675
trig                                                    0.0028869184
platelet                                                0.0012958105
sex                                                     0.0010639806
spiders                                                 0.0005210531
alk.phos                                                0.0003291581
trt                                                     -0.0002020952
These numbers may be different for different runs. In my example, we see that bilirubin is the most important feature.

Lessons learned: Conclusion

Though the input data for Survival package’s Kaplan – Meier estimate, Cox Model and ranger model are all different, we will compare the methodologies by plotting them on the same graph using ggplot.
#Comparing models
library(ggplot2)

#Kaplan-Meier curve dataframe
#Add a row of model name
km <- rep("Kaplan Meier", length(survival_func$time))
#Create a dataframe
km_df <- data.frame(survival_func$time,survival_func$surv,km)
#Rename the columns so they are same for all dataframes
names(km_df) <- c("Time","Surv","Model")

#Cox model curve dataframe
#Add a row of model name
cox <- rep("Cox",length(Cox_curve$time))
#Create a dataframe
cox_df <- data.frame(Cox_curve$time,Cox_curve$surv,cox)
#Rename the columns so they are same for all dataframes
names(cox_df) <- c("Time","Surv","Model")

#Dataframe for ranger
#Add a row of model name
rf <- rep("Survival Forest",length(ranger_model$unique.death.times))
#Create a dataframe
rf_df <- data.frame(ranger_model$unique.death.times,sapply(data.frame(ranger_model$survival),mean),rf)
#Rename the columns so they are same for all dataframes
names(rf_df) <- c("Time","Surv","Model")

#Combine the results
plot_combo <- rbind(km_df,cox_df,rf_df)

#Make a ggplot
plot_gg <- ggplot(plot_combo, aes(x = Time, y = Surv, color = Model))
plot_gg + geom_line() + ggtitle("Comparison of Survival Curves")

We see here that the Cox model is the most volatile with the most data and features. It is higher for lower values and drops down sharply when the time increases. The survival forest is of the lowest range and resembles Kaplan-Meier curve. The difference might be because of Survival forest having less rows. The essence of the plots is that there can be different approaches to the same concept of survival analysis and one may choose the technique based on one’s comfort and situation. A better data with processed data points and treated missing values might fetch us a better R2 and more stable curves. At the same time, they will help better in finding time to event cases such as knowing the time when a promotion’s effect dies down, knowing when tumors will develop and become significant and lots of other applications with a significant chunk of them being from medical science. Survival, as the name suggests, relates to surviving objects and is thus related to event occurrence in a completely different way than machine learning. It is important to know this technique to know more and more ways data can help us in solving problems, with time involved in this particular case. Hope this article serves the purpose of giving a glimpse of survival analysis and the feature rich packages available in R.

Here is the complete code for the article:

# install.packages("survival")
# Loading the package
library("survival")

#Dataset description
?pbc

#Fitting the survival model
survival_func=survfit(Surv(pbc$time,pbc$status == 2)~1)
survival_func

#Plot the survival model
plot(survival_func)

# Fit Cox Model
Cox_model = coxph(Surv(pbc$time,pbc$status==2) ~.,data=pbc)
summary(Cox_model)

#Create a survival curve from the cox model
Cox_curve <- survfit(Cox_model)
plot(Cox_curve)

#Using the Ranger package for survival analysis
#install.packages("ranger")
library(ranger)

#Drop rows with NA values
pbc_nadrop=pbc[complete.cases(pbc), ]
#Fitting the random forest
ranger_model <- ranger(Surv(pbc_nadrop$time,pbc_nadrop$status==2) ~.,data=pbc_nadrop,num.trees = 500, importance = "permutation",seed = 1)

#Plot the death times
plot(ranger_model$unique.death.times,ranger_model$survival[1,], type = "l", ylim = c(0,1),)

#Get the variable importance
data.frame(sort(ranger_model$variable.importance,decreasing = TRUE))

#Comparing models
library(ggplot2)

#Kaplan-Meier curve dataframe
#Add a row of model name
km <- rep("Kaplan Meier", length(survival_func$time))
#Create a dataframe
km_df <- data.frame(survival_func$time,survival_func$surv,km)
#Rename the columns so they are same for all dataframes
names(km_df) <- c("Time","Surv","Model")

#Cox model curve dataframe
#Add a row of model name
cox <- rep("Cox",length(Cox_curve$time))
#Create a dataframe
cox_df <- data.frame(Cox_curve$time,Cox_curve$surv,cox)
#Rename the columns so they are same for all dataframes
names(cox_df) <- c("Time","Surv","Model")

#Dataframe for ranger
#Add a row of model name
rf <- rep("Survival Forest",length(ranger_model$unique.death.times))
#Create a dataframe
rf_df <- data.frame(ranger_model$unique.death.times,sapply(data.frame(ranger_model$survival),mean),rf)
#Rename the columns so they are same for all dataframes
names(rf_df) <- c("Time","Surv","Model")

#Combine the results
plot_combo <- rbind(km_df,cox_df,rf_df)

#Make a ggplot
plot_gg <- ggplot(plot_combo, aes(x = Time, y = Surv, color = Model))
plot_gg + geom_line() + ggtitle("Comparison of Survival Curves")

Author Bio:

This article was contributed by Perceptive Analytics. Madhur Modi, Chaitanya Sagar, Vishnu Reddy and Saneesh Veetil contributed to this article.

Perceptive Analytics provides data analytics, data visualization, business intelligence and reporting services to e-commerce, retail, healthcare and pharmaceutical industries. Our client roster includes Fortune 500 and NYSE listed companies in the USA and India.

Whys and Hows of Apply Family of Functions in R

Introduction to Looping system

Imagine you were to perform a simple task, let’s say calculating sum of columns for 3X3 matrix, what do you think is the best way? Calculating it directly using traditional methods such as calculator or even pen and paper doesn’t sound like a bad approach. A lot of us may prefer to just calculate it manually instead of writing an entire piece of code for such a small dataset.

Now, if the dataset is 10X10 matrix, would you do the same? Not sure.

Now, if the dataset is further bigger, let’s say 100X100 matrix or 1000X1000 matrix or 5000X5000 matrix, would you even think of doing it manually? I won’t.

But, let’s not worry ourselves with this task because it is not as big a task as it may look at prima facie. There’s a concept called ‘looping’ which comes to our rescue in such situations. I’m sure whosoever has ever worked on any programming language, they must have encountered loops and looping system. It is one of the most useful concepts in any programming language. ‘Looping system’ is nothing but an iterative system that performs a specific task repeatedly until given conditions are met or it is forced to break. Looping system comes in handy when we have to carry a task iteratively; we may or may not know before-hand how many iterations to be carried out. Instead of writing the same piece of code tens or hundreds or thousands of times, we write a small piece of code using loops and it does the entire task for us.

There are majorly two loops which are used extensively in programming – for loop and while loop. In the case of ‘for loop’ we know before hand as to how many times we want the loop to run or we know before-hand the number of iterations that should be carried out. Let’s take a very simple example of printing number 1 to 10. One way could be we write the code of printing every number from 1 to 10; while, the other and smart way would be to write a two-line code that will do the work for us.
for (i in 1:10) {
  print(i) 
}
The above code should print values from 1 to 10 for us.
> for (i in 1:10) {
+   print(i) 
+ }
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
[1] 9
[1] 10
The other very powerful loop is ‘while’ loop. In while loop, we don’t know before-hand as to how many iterations should the loop perform. It works till a certain condition is being met, as soon as the condition is violated the loop breaks.
i = 1
while (i < 10) {
  print(i)
  i = i+1
}
In the above code, we don’t know how many iterations are there, we just know that the loop should work until the value of ‘i’ is less than 10 and it does the same.
> i = 1
> while (i < 10) {
+   print(i)
+   i = i+1
+ }
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
[1] 9
Using very basic examples, we have seen how powerful these loops can be. However, there is one disadvantage of using these loops in R language – they make our code run slow. The number of computations that needs to be carried increases and this increases the time that system takes to execute the code.

But we need not worry about this limitation as R offers a very good alternative, vectorization, to using these loops in lot of conditions. Vectorization, as the name suggests, is an operation of converting scalars or plain numbers in to single operation on vectors or matrices. A lot of functions that are performed by loops can be performed through vectorization. Moreover, vectorization makes calculation and running of processes faster because they convert the code into lower language such as C, C++ which further contains loops to execute the operation. User need not worry about these aspects of vectorization and can just do fine with direct functions.

Based on the concept of vectorization is a family of functions in R called ‘apply’ family function. It is a part of base R package. There are multiple functions in the apply family. We will go through them one by one and check their implementation, alongside, in R. The functions in apply family are apply, sapply, lapply, mapply, rapply, tapply and vapply. Their usage depends on the kind of input data we have, kind of output we want to see and the kind of operations we want to perform on data. Let’s go through some of these functions and implement toy examples using them.

Apply Function

Apply function is the most commonly used function in apply family. It works on arrays or matrices. The syntax of apply function is follows:
Apply(x, margin, function, ….)
Where,
  • X refers to an array or matrix on which operation is to be performed
  • Margin refers to how the function is to be applied; margin =1 means function is to be applied on rows, while margin = 2 means function is to be applied on columns. Margin = c(1,2) means function is to be applied on both row and column.
  • Function refers to the operation that is to be performed on the data. It could be predefined functions in R such as Sum, Stddev, ColMeans or it could be user defined function.
Let’s take an example and use the function to see how it can help us.
ApplyFun = matrix(c(1:16), 4,4)

ApplyFun

apply(ApplyFun,2,sum)

apply(ApplyFun,2,mean)

apply(ApplyFun,1,var)

apply(ApplyFun,1,sum)
In the above code, we have applied sum and mean function on columns of the matrix; while variance and sum function on the rows. Let’s see the output of the above code.
> ApplyFun = matrix(c(1:16), 4,4)

> ApplyFun
     [,1] [,2] [,3] [,4]
[1,]    1    5    9   13
[2,]    2    6   10   14
[3,]    3    7   11   15
[4,]    4    8   12   16
 
> apply(ApplyFun,2,sum)
[1] 10 26 42 58
 
> apply(ApplyFun,2,mean)
[1]  2.5  6.5 10.5 14.5
 
> apply(ApplyFun,1,var)
[1] 26.66667 26.66667 26.66667 26.66667
 
> apply(ApplyFun,1,sum)
[1] 28 32 36 40
Let’s understand the first statement that we executed; others are based on the same logic. We first generated a matrix as below:
> ApplyFun
     [,1] [,2] [,3] [,4]
[1,]    1    5    9   13
[2,]    2    6   10   14
[3,]    3    7   11   15
[4,]    4    8   12   16
Now, in the second line [apply(ApplyFun,2,sum)], we are trying to calculate the sum of all the columns of the matrix. Here, ‘2’ means that the operation should be performed on the columns and sum is the function that should be executed. The output generated here is a vector.

Lapply Function

Lapply function is similar to apply function but it takes list or data frame as an input and returns list as an output. It has a similar syntax to apply function. Let’s take a couple of examples and see how it can be used.
LapplyFun = list(a = 1:5, b = 10:15, c = 21:25)

LapplyFun

lapply(LapplyFun, FUN = mean)

lapply(LapplyFun, FUN = median)

> LapplyFun = list(a = 1:5, b = 10:15, c = 21:25)
> LapplyFun
$a
[1] 1 2 3 4 5

$b
[1] 10 11 12 13 14 15

$c
[1] 21 22 23 24 25

> lapply(LapplyFun, FUN = mean)
$a
[1] 3

$b
[1] 12.5

$c
[1] 23

> lapply(LapplyFun, FUN = median)
$a
[1] 3

$b
[1] 12.5

$c
[1] 23

Sapply Function

Sapply function is similar to lapply, but it returns a vector as an output instead of list.
set.seed(5)

SapplyFun = list(a = rnorm(5), b = rnorm(5), c = rnorm(5))

SapplyFun

sapply(SapplyFun, FUN = mean)

> set.seed(5)
> 
> SapplyFun = list(a = rnorm(5), b = rnorm(5), c = rnorm(5))
> 
> SapplyFun
$a
[1] -0.84085548  1.38435934 -1.25549186  0.07014277  1.71144087

$b
[1] -0.6029080 -0.4721664 -0.6353713 -0.2857736  0.1381082

$c
[1]  1.2276303 -0.8017795 -1.0803926 -0.1575344 -1.0717600

> 
> sapply(SapplyFun, FUN = mean)
         a          b          c 
 0.2139191 -0.3716222 -0.3767672 
Let’s take another example and see the difference between lapply and sapply in further detail.
X = matrix(1:9,3,3)
X

Y = matrix(11:19,3,3)
Y

Z = matrix(21:29,3,3)
Z

Comp.lapply.sapply = list(X,Y,Z)
Comp.lapply.sapply

> X = matrix(1:9,3,3)
> X
     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9
> 
> Y = matrix(11:19,3,3)
> Y
     [,1] [,2] [,3]
[1,]   11   14   17
[2,]   12   15   18
[3,]   13   16   19
> 
> Z = matrix(21:29,3,3)
> Z
     [,1] [,2] [,3]
[1,]   21   24   27
[2,]   22   25   28
[3,]   23   26   29
> Comp.lapply.sapply = list(X,Y,Z)
> Comp.lapply.sapply
[[1]]
     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9

[[2]]
     [,1] [,2] [,3]
[1,]   11   14   17
[2,]   12   15   18
[3,]   13   16   19

[[3]]
     [,1] [,2] [,3]
[1,]   21   24   27
[2,]   22   25   28
[3,]   23   26   29


lapply(Comp.lapply.sapply,"[", , 2)

lapply(Comp.lapply.sapply,"[", 1, )

lapply(Comp.lapply.sapply,"[", 1, 2)

> lapply(Comp.lapply.sapply,"[", , 2)
[[1]]
[1] 4 5 6

[[2]]
[1] 14 15 16

[[3]]
[1] 24 25 26

> lapply(Comp.lapply.sapply,"[", 1, )
[[1]]
[1] 1 4 7

[[2]]
[1] 11 14 17

[[3]]
[1] 21 24 27

> lapply(Comp.lapply.sapply,"[", 1, 2)
[[1]]
[1] 4

[[2]]
[1] 14

[[3]]
[1] 24

Now, getting the output of last statement using sapply function.
> sapply(Comp.lapply.sapply,"[", 1,2)
[1]  4 14 24
We can see the difference between lapply and sapply in the above example. Lapply returns the list as an output; while sapply returns vector as an output.

Mapply Function

Mapply function is similar to sapply function, which returns vector as an output and takes list as an input. Let’s take an example and understand how it works.
X = matrix(1:9,3,3)
X

Y = matrix(11:19,3,3)
Y

Z = matrix(21:29,3,3)
Z

mapply(sum,X,Y,Z)

> X = matrix(1:9,3,3)
> X
     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9
> 
> Y = matrix(11:19,3,3)
> Y
     [,1] [,2] [,3]
[1,]   11   14   17
[2,]   12   15   18
[3,]   13   16   19
> 
> Z = matrix(21:29,3,3)
> Z
     [,1] [,2] [,3]
[1,]   21   24   27
[2,]   22   25   28
[3,]   23   26   29
> 
> mapply(sum,X,Y,Z)
[1] 33 36 39 42 45 48 51 54 57
The above function adds element by element of each of the matrix and returns a vector as an output.
For e.g., 33 = X[1,1] + Y[1,1] + Z[1,1]
36 = X[2,1] + Y[2,1] + Z[2,1} and so on.

How to decide which apply function to use

Now, comes the part of deciding which apply function should one use and how to decide which apply function will provide the desired results. This is mainly based on the following four parameters:
  1. Input
  2. Output
  3. Intention
  4. Section of Data
As we have discussed in the article above, all the apply family functions work on different types of datasets. Apply function works on arrays or matrices, lapply works on lists, sapply also works on lists and similar other functions. Based on kind of input that we are providing to the function provides us a first level of filter as to which all functions can be used.

Second filter comes from the output that we desire from the function. Lapply and sapply both work on lists; in that case, how to decide which function to use? As we saw above, lapply gives us list as an output while sapply outputs vector. This provides us another level of filter in deciding which function to choose.

Now, comes the intention which is making us use the apply family function. By intention, we mean the kind of functions that we are planning to pass through the apply family. Section of data refers to the subset or part of the data that we want our function to operate on – is it rows or columns or the entire dataset.

These four things can help us figure out which apply function should we choose for our tasks.

After going through the article, I’m sure you will agree with me that these functions are much easier to use than loops, and provides faster and efficient ways to execute codes. However, this doesn’t mean that we should not use loops at all. Loops have their own advantage when doing complex operations. Moreover, other programming languages do not provide any support for apply family function, so we don’t have an option but to use loops. We should keep ourselves open to both the ideas and decide what to use basis the requirements at hand.

Author Bio:

This article was contributed by Perceptive Analytics. Chaitanya Sagar, Jyothirmayee Thondamallu and Saneesh Veetil contributed to this article.

Perceptive Analytics provides data analytics, data visualization, business intelligence and reporting services to e-commerce, retail, healthcare and pharmaceutical industries. Our client roster includes Fortune 500 and NYSE listed companies in the USA and India.

Introducing R-Ladies Remote Chapter

R-Ladies Remote is kicking off and we want YOU! Do you want to be part of the R community but can’t attend meetups? There are many R-Ladies across the globe who love the idea of the organisation, but aren’t able to connect with it easily due to their distance, their work or their caring responsibilities. If child care ends at 6pm, ducking out to a chapter meeting at 6:30 isn’t always easy.

What do you need to join in? An interest in R and to be part of a gender minority in tech, that’s all. We are open to all R users, from new starters to experienced users. Sign up here.

What will RLadies Remote be doing? We’ll be hosting a variety of online events and speakers. We’ll be covering introductions to basic R and more advanced topics, discussions about remote working, independent consulting and seminars from our members.

Do you have an idea for an event, would you like to give a talk or would you like to come along to learn? If so we’d love to hear from you. Please show your interest by filling in our initial survey.

Introducing DataFramed, a Data Science Podcast

[soundcloud url=”https://api.soundcloud.com/tracks/385794143″ params=”color=#ff5500&auto_play=false&hide_related=false&show_comments=true&show_user=true&show_reposts=false&show_teaser=true&visual=true” width=”100%” height=”300″ iframe=”true” /]

We are super pumped to be launching a weekly data science podcast called DataFramed, in which Hugo Bowne-Anderson (me), a data scientist and educator at DataCamp, speaks with industry experts about what data science is, what it's capable of, what it looks like in practice and the direction it is heading over the next decade and into the future.

You can check out the podcast here and make sure to subscribe, rate and review!

For a sneak peak, check out the trailer above!

Instead of answering "what is data science?" merely through the lens of related technologies, tools and skill-sets, a methodology commonly invoked to discover what data science is, we have decided to answer this question by delving into what modern data science looks like in practice via in-depth conversations with practitioners. These are the types of conversations we all have over dinner, around the water cooler and at conferences and I am happy to be formalizing them and bringing them to you in podcast form.

We're launching with a bang!

We’ve already released 7 episodes, which were honestly so much fun to record. In these episodes, I speak with:

  • Hilary Mason (VP of Research at Cloudera Fast Forward Labs and Data Scientist in Residence at Accel Partners),
  • Chris Volinksy (Assistant Vice President for Big Data Research at AT&T Labs and a member of the 7-person, 4-country team that won the $1M Netflix Prize),
  • Ben Skrainka (data scientist at Convoy, a company dedicated to revolutionizing the North American trucking industry with data science),
  • Maelle Salmon (Statistician/data scientist in Public Health, Epidemiology, #rstats) and
  • Dave Robinson (DataCamp, previously StackOverflow).
  • Robert Chang (Airbnb).


data science podcast

These interviews will be interspersed with brief segments on "Tales from the Open Source", "Statistical Pitfalls", "Data Science blog post of the week" and "Stack Overflow diaries", to name a few. The mission of these segments is to explain and discuss in brief topics essential to any working data scientist's toolbox.

Future episodes will include interviews with Mike Tamir (Head of Data Science Uber ATG), Mara Averick (Tidyverse Dev Advocate, RStudio), Emily Robinson (Etsy) and Drew Conway (Alluvium).

If you have any suggestions or would like to come on the show, do reach out to me on twitter @hugobowne.

Original music and sounds by The Sticks.

Understanding Naïve Bayes Classifier Using R

The Best Algorithms are the Simplest

The field of data science has progressed from simple linear regression models to complex ensembling techniques but the most preferred models are still the simplest and most interpretable. Among them are regression, logistic, trees and naive bayes techniques. Naive Bayes algorithm, in particular is a logic based technique which is simple yet so powerful that it is often known to outperform complex algorithms for very large datasets. Naive bayes is a common technique used in the field of medical science and is especially used for cancer detection. This article explains the underlying logic behind naive bayes algorithm and example implementation.

How Probability defines Everything

We calculate probability as the proportion of cases where an event happens and call it the probability of the event. Just as there is probability for a single event, we have probability of a group of events as the proportion of cases where the group of events occur together. Another concept in probability is calculating the occurrence of events in a particular sequence, that is, if it is known that something has already happened, what will be the probability that another event happens after that. By logic, one can understand that we are narrowing down our scope to only the case when that something has already happened and then calculating the proportion of cases where our second event occurs. To represent it mathematically, If A is the first event and B is the second event, then P(B|A) is our desired probability of calculating probability of event A after occurrence of event B, P(A n B) is probability of the two events occurring together

P(B | A) = P(B) * P(A | B) / P(A)

This is the foundation pillar for Naive bayes algorithm. Owing to this, Naive Bayes can handle different kind of events which are differentiated by the probabilities of event separately, that is , P(B) and conditional probability P(B|A). If the two probabilities are same, then it means that the occurrence of event A had no effect on event B and the events are known as independent events. If the conditional probability becomes zero, then it means the occurrence of event A implies that event B cannot occur. If the reverse is also true, then the events are known as mutually exclusive events and the occurrence of only one of the events at a time is possible. All other cases are classified as dependent events where the conditional probability can be either lower or higher than the original. In real life, every coin toss is independent of all other coin tosses made previously and thus coin tosses are independent. The outcome of a single coin toss is composed of mutually exclusive events. We cannot have a head and the tails at the same time. When we consider runs of multiple coin tosses, we are talking about dependent events. For a combination of three coin tosses, the final outcome is dependent of the first, second as well as the third coin toss.

How do we Calculate these Probabilities?

It is easy to calculate the probability of a single event. It equates to the number of cases when the event occurs divided by the total number of possible cases. For instance, the probability of a 6 in a single six-faced die roll is ⅙ if all the sides have equal chance of coming. However, one needs to be careful when calculating probabilities of two or more events. Simply knowing the probability of each event separately is not enough to calculate the probability of multiple events happening. If we additionally know that the events are independent, then the probability of them occurring together is the multiplication of each event separately.

We denote this mathematically as follows: P(A and B)=P(A)*P(B) – For independent events

As I already described, each coin toss is independent of other coin tosses. So the probability of having a Heads and a Heads combination in two coin tosses is P(Heads-Heads Combo)=P(Heads in first throw)*P(Heads in second throw)=½ * ½ = ¼

If the events are not independent, we can use the probability of any one event multiplied by the probability of second event after the first has happened
P(A and B)=P(A)*P(B|A) – For dependent events

An example of dependent events can be drawing cards without replacement. If you want to know that the two cards drawn are King and Queen then we know that the probability of the first event is dependent of 52 cards whereas the probability of the second event is dependent on 51 cards.

Thus, P(King and Queen)=P(King)*P(Queen|King)

Here, P(King) is 4/52. After a King is drawn, there are 4 queens out of 51 cards. So, P(Queen|King) is 4/51 P(King and Queen)=4/52*4/51=~0.6%

This is known as general multiplication rule. It also applies to the independent events scenario but since the events are independent, P(B|A) becomes equal to P(B)

The third case is for mutually exclusive events. If the events are mutually exclusive, we know that only one of the events can occur at a time. So the probability of the two events occurring together is zero. We are sometimes interested in probability of one of the events occuring and it is the sum of the individual probabilities in this scenario.

P(A OR B)=P(A)+P(B) – for mutually exclusive events

If we’re talking about a single six faced fair die throw, the probability of any two numbers occurring together is zero. In this case the probability of any prime number occuring is the sum of occurrence of each prime number. In this case P(2)+P(3)+P(5)

Had the events not been mutually exclusive, the probability of one of the events would have counted the probability of both events coming together twice. Hence we subtract this probability. P(A OR B)=P(A)+P(B)-P(A AND B) – for events which are not mutually exclusive

In a single six faced fair dice throw, the probability of throwing a multiple of 2 or 3 describes a scenario of events which are not mutually exclusive since 6 is both a multiple of 2 and 3 and is counted twice.

Thus, P(multiple of 2 or 3)=P(Multiple of 2)+P(Multiple of 3)- P(Multiple of 2 AND 3) =P(2,4,6)+P(3,6)-P(6)=3/6 + 2/6 -1/6 = 4/6 =2/3

This is known as general addition rule and similar to the multiplication rule, it also applies to the mutually exclusive events scenario but in that case, P(A AND B) is zero.

This is all we need to understand how Naive Bayes algorithm works. It takes into account all such scenarios and learns accordingly. Let’s get our hands dirty with a sample dataset.

Naive Bayes – a Not so Naive Algorithm

The reason that Naive Bayes algorithm is called Naive is not because it is simple or stupid. It is because the algorithm makes a very strong assumption about the data having features independent of each other while in reality, they may be dependent in some way. In other words, it assumes that the presence of one feature in a class is completely unrelated to the presence of all other features. If this assumption of independence holds, Naive Bayes performs extremely well and often better than other models. Naive Bayes can also be used with continuous features but is more suited to categorical variables. If all the input features are categorical, Naive Bayes is recommended. However, in case of numeric features, it makes another strong assumption which is that the numerical variable is normally distributed.

R supports a package called ‘e1071’ which provides the naive bayes training function. For this demonstration, we will use the classic titanic dataset and find out the cases which naive bayes can identify as survived.

The Titanic dataset in R is a table for about 2200 passengers summarised according to four factors – economic status ranging from 1st class, 2nd class, 3rd class and crew; gender which is either male or female; Age category which is either Child or Adult and whether the type of passenger survived. For each combination of Age, Gender, Class and Survived status, the table gives the number of passengers who fall into the combination. We will use the Naive Bayes Technique to classify such passengers and check how well it performs. As we know, Bayes theorem is based on conditional probability and uses the formula

P(A | B) = P(A) * P(B | A) / P(B)

We now know how this conditional probability comes from multiplication of events so if we use the general multiplication rule, we get another variation of the theorem that is, using P(A AND B) = P(A) * P(B | A), we can obtain the value for conditional probability: P(B | A) = P(A AND B) / P(A) which is the variation of Bayes theorem.

Since P(A AND B) also equals P(B) * P(A | B), we can substitute it and get back the original formula P(B | A) = P(B) * P(A | B) / P(A) Using this for each of the features among Age, Gender and Economic status, Naive Bayes algorithm will calculate the conditional probability of survival of the combination
#Getting started with Naive Bayes
#Install the package
#install.packages(“e1071”)
#Loading the library
library(e1071)
?naiveBayes #The documentation also contains an example implementation of Titanic dataset
#Next load the Titanic dataset
data(“Titanic”)
#Save into a data frame and view it
Titanic_df=as.data.frame(Titanic)
We see that there are 32 observations which represent all possible combinations of Class, Sex, Age and Survived with their frequency. Since it is summarised, this table is not suitable for modelling purposes. We need to expand the table into individual rows. Let’s create a repeating sequence of rows based on the frequencies in the table
#Creating data from table
repeating_sequence=rep.int(seq_len(nrow(Titanic_df)), Titanic_df$Freq) #This will repeat each combination equal to the frequency of each combination
#Create the dataset by row repetition created
Titanic_dataset=Titanic_df[repeating_sequence,]
#We no longer need the frequency, drop the feature
Titanic_dataset$Freq=NULL
The data is now ready for Naive Bayes to process. Let’s fit the model
#Fitting the Naive Bayes model
Naive_Bayes_Model=naiveBayes(Survived ~., data=Titanic_dataset)
#What does the model say? Print the model summary
Naive_Bayes_Model
Naive Bayes Classifier for Discrete Predictors
Call:
naiveBayes.default(x = X, y = Y, laplace = laplace)

A-priori probabilities:
Y
      No      Yes 
0.676965 0.323035 

Conditional probabilities:
     Class
Y          1st          2nd         3rd         Crew
  No    0.08187919  0.11208054  0.35436242  0.45167785
  Yes   0.28551336  0.16596343  0.25035162  0.29817159

     Sex
Y          Male         Female
  No    0.91543624  0.08456376
  Yes   0.51617440  0.48382560

     Age
Y         Child         Adult
  No    0.03489933  0.96510067
  Yes   0.08016878  0.91983122
The model creates the conditional probability for each feature separately. We also have the a-priori probabilities which indicates the distribution of our data. Let’s calculate how we perform on the data.
#Prediction on the dataset
NB_Predictions=predict(Naive_Bayes_Model,Titanic_dataset)
#Confusion matrix to check accuracy
table(NB_Predictions,Titanic_dataset$Survived)
NB_Predictions      No      Yes
                No      1364    362
                Yes     126     349
We have the results! We are able to classify 1364 out of 1490 “No” cases correctly and 349 out of 711 “Yes” cases correctly. This means the ability of Naive Bayes algorithm to predict “No” cases is about 91.5% but it falls down to only 49% of the “Yes” cases resulting in an overall accuracy of 77.8%

Conclusion: Can we Do any Better?

Naive Bayes is a parametric algorithm which implies that you cannot perform differently in different runs as long as the data remains the same. We will, however, learn another implementation of Naive Bayes algorithm using the ‘mlr’ package. Assuming the same session is going on for the readers, I will install and load the package and start fitting a model
#Getting started with Naive Bayes in mlr
#Install the package
#install.packages(“mlr”)
#Loading the library
library(mlr)
The mlr package consists of a lot of models and works by creating tasks and learners which are then trained. Let’s create a classification task using the titanic dataset and fit a model with the naive bayes algorithm.
#Create a classification task for learning on Titanic Dataset and specify the target feature
task = makeClassifTask(data = Titanic_dataset, target = "Survived")
#Initialize the Naive Bayes classifier
selected_model = makeLearner("classif.naiveBayes")
#Train the model
NB_mlr = train(selected_model, task)
The summary of the model which was printed in e3071 package is stored in learner model. Let’s print it and compare
#Read the model learned  
NB_mlr$learner.model
Naive Bayes Classifier for Discrete Predictors

Call:
naiveBayes.default(x = X, y = Y, laplace = laplace)

A-priori probabilities:
Y
      No      Yes 
0.676965 0.323035 

Conditional probabilities:
     Class
Y               1st         2nd         3rd         Crew
    No      0.08187919  0.11208054  0.35436242  0.45167785
    Yes     0.28551336  0.16596343  0.25035162  0.29817159

     Sex
Y               Male        Female
     No     0.91543624  0.08456376
    Yes     0.51617440  0.48382560

     Age
Y           Child       Adult
    No      0.03489933  0.96510067
    Yes     0.08016878  0.91983122
The a-priori probabilities and the conditional probabilities for the model are similar to the one calculated by e3071 package as was expected. This means that our predictions will also be the same.
#Predict on the dataset without passing the target feature
predictions_mlr = as.data.frame(predict(NB_mlr, newdata = Titanic_dataset[,1:3]))

##Confusion matrix to check accuracy
table(predictions_mlr[,1],Titanic_dataset$Survived)
        No      Yes
  No    1364    362
  Yes   126     349
As we see, the predictions are exactly same. The only way to improve is to have more features or more data. Perhaps, if we have more features such as the exact age, size of family, number of parents in the ship and siblings then we may arrive at a better model using Naive Bayes. In essence, Naive Bayes has an advantage of a strong foundation build and is very robust. I know of the ‘caret’ package which also consists of Naive Bayes function but it will also give us the same predictions and probability.

Author Bio:

This article was contributed by Perceptive Analytics. Madhur Modi, Chaitanya Sagar, Vishnu Reddy and Saneesh Veetil contributed to this article.

Perceptive Analytics provides data analytics, data visualization, business intelligence and reporting services to e-commerce, retail, healthcare and pharmaceutical industries. Our client roster includes Fortune 500 and NYSE listed companies in the USA and India.

Here is the Complete Code (used in this article):

#Getting started with Naive Bayes
#Install the package
#install.packages(“e1071”)
#Loading the library
library(e1071)
?naiveBayes #The documentation also contains an example implementation of Titanic dataset
#Next load the Titanic dataset
data("Titanic")
#Save into a data frame and view it
Titanic_df=as.data.frame(Titanic)
#Creating data from table
repeating_sequence=rep.int(seq_len(nrow(Titanic_df)), Titanic_df$Freq) #This will repeat each combination equal to the frequency of each combination

#Create the dataset by row repetition created
Titanic_dataset=Titanic_df[repeating_sequence,]
#We no longer need the frequency, drop the feature
Titanic_dataset$Freq=NULL

#Fitting the Naive Bayes model
Naive_Bayes_Model=naiveBayes(Survived ~., data=Titanic_dataset)
#What does the model say? Print the model summary
Naive_Bayes_Model

#Prediction on the dataset
NB_Predictions=predict(Naive_Bayes_Model,Titanic_dataset)
#Confusion matrix to check accuracy
table(NB_Predictions,Titanic_dataset$Survived)

#Getting started with Naive Bayes in mlr
#Install the package
#install.packages(“mlr”)
#Loading the library
library(mlr)

#Create a classification task for learning on Titanic Dataset and specify the target feature
task = makeClassifTask(data = Titanic_dataset, target = "Survived")

#Initialize the Naive Bayes classifier
selected_model = makeLearner("classif.naiveBayes")

#Train the model
NB_mlr = train(selected_model, task)

#Read the model learned  
NB_mlr$learner.model

#Predict on the dataset without passing the target feature
predictions_mlr = as.data.frame(predict(NB_mlr, newdata = Titanic_dataset[,1:3]))

##Confusion matrix to check accuracy
table(predictions_mlr[,1],Titanic_dataset$Survived)

How to implement Random Forests in R

Imagine you were to buy a car, would you just go to a store and buy the first one that you see? No, right? You usually consult few people around you, take their opinion, add your research to it and then go for the final decision. Let’s take a simpler scenario: whenever you go for a movie, do you ask your friends for reviews about the movie (unless, off-course it stars one of your favorite actress)?

Have you ever wondered why do we ask multiple people about their opinions or reviews before going for a movie or before buying a car or may be, before planning a holiday? It’s because review of one person may be biased as per her preference; however, when we ask multiple people we are trying to remove bias that a single person may provide. One person may have a very strong dislike for a specific destination because of her experience at that location; however, ten other people may have very strong preference for the same destination because they have had a wonderful experience there. From this, we can infer that the one person was more like an exceptional case and her experience may be one of a case.

Another example which I am sure all of us have encountered is during the interviews at any company or college. We often have to go through multiple rounds of interviews. Even though the questions asked in multiple rounds of interview are similar, if not same – companies still go for it. The reason is that they want to have views from multiple recruitment leaders. If multiple leaders are zeroing in on a candidate then the likelihood of her turning out to be a good hire is high.

In the world of analytics and data science, this is called ‘ensembling’. Ensembling is a “type of supervised learning technique where multiple models are trained on a training dataset and their individual outputs are combined by some rule to derive the final output.”

Let’s break the above definition and look at it step by step.

When we say multiple models are trained on a dataset, same model with different hyper parameters or different models can be trained on the training dataset. Training observations may differ slightly while sampling; however, overall population remains the same.

“Outputs are combined by some rule” – there could be multiple rules by which outputs are combined. The most common ones are the average (in terms of numerical output) or vote (in terms of categorical output). When different models give us numerical output, we can simply take the average of all the outputs and use the average as the result. In case of categorical output, we can use vote – output occurring maximum number of times is the final output. There are other complex methods of deriving at output also but they are out of scope of this article.

Random Forest is one such very powerful ensembling machine learning algorithm which works by creating multiple decision trees and then combining the output generated by each of the decision trees. Decision tree is a classification model which works on the concept of information gain at every node. For all the data points, decision tree will try to classify data points at each of the nodes and check for information gain at each node. It will then classify at the node where information gain is maximum. It will follow this process subsequently until all the nodes are exhausted or there is no further information gain. Decision trees are very simple and easy to understand models; however, they have very low predictive power. In fact, they are called weak learners.

Random Forest works on the same weak learners. It combines the output of multiple decision trees and then finally come up with its own output. Random Forest works on the same principle as Decision Tress; however, it does not select all the data points and variables in each of the trees. It randomly samples data points and variables in each of the tree that it creates and then combines the output at the end. It removes the bias that a decision tree model might introduce in the system. Also, it improves the predictive power significantly. We will see this in the next section when we take a sample data set and compare the accuracy of Random Forest and Decision Tree.

Now, let’s take a small case study and try to implement multiple Random Forest models with different hyper parameters, and compare one of the Random Forest model with Decision Tree model. (I am sure you will agree with me on this – even without implementing the model, we can say intuitively that Random Forest will give us better results than Decision Tree). The dataset is taken from UCI website and can be found on this link. The data contains 7 variables – six explanatory (Buying Price, Maintenance, NumDoors, NumPersons, BootSpace, Safety) and one response variable (Condition). The variables are self-explanatory and refer to the attributes of cars and the response variable is ‘Car Acceptability’. All the variables are categorical in nature and have 3-4 factor levels in each.

Let’s start the R code implementation and predict the car acceptability based on explanatory variables.
# Data Source: https://archive.ics.uci.edu/ml/machine-learning-databases/car/

install.packages("randomForest")
library(randomForest)
# Load the dataset and explore
data1 <- read.csv(file.choose(), header = TRUE)

head(data1)

str(data1)

summary(data1)
> head(data1)
  BuyingPrice Maintenance NumDoors NumPersons BootSpace Safety Condition
1       vhigh       vhigh        2          2     small    low     unacc
2       vhigh       vhigh        2          2     small    med     unacc
3       vhigh       vhigh        2          2     small   high     unacc
4       vhigh       vhigh        2          2       med    low     unacc
5       vhigh       vhigh        2          2       med    med     unacc
6       vhigh       vhigh        2          2       med   high     unacc
> str(data1)
'data.frame':   1728 obs. of  7 variables:
 $ BuyingPrice: Factor w/ 4 levels "high","low","med",..: 4 4 4 4 4 4 4 4 4 4 ...
 $ Maintenance: Factor w/ 4 levels "high","low","med",..: 4 4 4 4 4 4 4 4 4 4 ...
 $ NumDoors   : Factor w/ 4 levels "2","3","4","5more": 1 1 1 1 1 1 1 1 1 1 ...
 $ NumPersons : Factor w/ 3 levels "2","4","more": 1 1 1 1 1 1 1 1 1 2 ...
 $ BootSpace  : Factor w/ 3 levels "big","med","small": 3 3 3 2 2 2 1 1 1 3 ...
 $ Safety     : Factor w/ 3 levels "high","low","med": 2 3 1 2 3 1 2 3 1 2 ...
 $ Condition  : Factor w/ 4 levels "acc","good","unacc",..: 3 3 3 3 3 3 3 3 3 3 ...
> summary(data1)
 BuyingPrice Maintenance  NumDoors   NumPersons BootSpace    Safety    Condition   
 high :432   high :432   2    :432   2   :576   big  :576   high:576   acc  : 384  
 low  :432   low  :432   3    :432   4   :576   med  :576   low :576   good :  69  
 med  :432   med  :432   4    :432   more:576   small:576   med :576   unacc:1210  
 vhigh:432   vhigh:432   5more:432                                     vgood:  65  
Now, we will split the dataset into train and validation set in the ratio 70:30. We can also create a test dataset, but for the time being we will just keep train and validation set.
# Split into Train and Validation sets
# Training Set : Validation Set = 70 : 30 (random)
set.seed(100)
train <- sample(nrow(data1), 0.7*nrow(data1), replace = FALSE)
TrainSet <- data1[train,]
ValidSet <- data1[-train,]
summary(TrainSet)
summary(ValidSet)
> summary(TrainSet)
 BuyingPrice Maintenance  NumDoors   NumPersons BootSpace    Safety    Condition  
 high :313   high :287   2    :305   2   :406   big  :416   high:396   acc  :264  
 low  :292   low  :317   3    :300   4   :399   med  :383   low :412   good : 52  
 med  :305   med  :303   4    :295   more:404   small:410   med :401   unacc:856  
 vhigh:299   vhigh:302   5more:309                                     vgood: 37  
> summary(ValidSet)
 BuyingPrice Maintenance  NumDoors   NumPersons BootSpace    Safety    Condition  
 high :119   high :145   2    :127   2   :170   big  :160   high:180   acc  :120  
 low  :140   low  :115   3    :132   4   :177   med  :193   low :164   good : 17  
 med  :127   med  :129   4    :137   more:172   small:166   med :175   unacc:354  
 vhigh:133   vhigh:130   5more:123                                     vgood: 28  
Now, we will create a Random Forest model with default parameters and then we will fine tune the model by changing ‘mtry’. We can tune the random forest model by changing the number of trees (ntree) and the number of variables randomly sampled at each stage (mtry). According to Random Forest package description:

Ntree: Number of trees to grow. This should not be set to too small a number, to ensure that every input row gets predicted at least a few times.

Mtry: Number of variables randomly sampled as candidates at each split. Note that the default values are different for classification (sqrt(p) where p is number of variables in x) and regression (p/3)
# Create a Random Forest model with default parameters
model1 <- randomForest(Condition ~ ., data = TrainSet, importance = TRUE)
model1
> model1

Call:
 randomForest(formula = Condition ~ ., data = TrainSet, importance = TRUE) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 2

        OOB estimate of  error rate: 3.64%
Confusion matrix:
      acc good unacc vgood class.error
acc   253    7     4     0  0.04166667
good    3   44     1     4  0.15384615
unacc  18    1   837     0  0.02219626
vgood   6    0     0    31  0.16216216
By default, number of trees is 500 and number of variables tried at each split is 2 in this case. Error rate is 3.6%.
# Fine tuning parameters of Random Forest model
model2 <- randomForest(Condition ~ ., data = TrainSet, ntree = 500, mtry = 6, importance = TRUE)
model2
> model2

Call:
 randomForest(formula = Condition ~ ., data = TrainSet, ntree = 500,      mtry = 6, importance = TRUE) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 6

        OOB estimate of  error rate: 2.32%
Confusion matrix:
      acc good unacc vgood class.error
acc   254    4     6     0  0.03787879
good    3   47     1     1  0.09615385
unacc  10    1   845     0  0.01285047
vgood   1    1     0    35  0.05405405
When we have increased the mtry to 6 from 2, error rate has reduced from 3.6% to 2.32%. We will now predict on the train dataset first and then predict on validation dataset.
# Predicting on train set
predTrain <- predict(model2, TrainSet, type = "class")
# Checking classification accuracy
table(predTrain, TrainSet$Condition)  
> table(predTrain, TrainSet$Condition)
         
predTrain acc good unacc vgood
    acc   264    0     0     0
    good    0   52     0     0
    unacc   0    0   856     0
    vgood   0    0     0    37
# Predicting on Validation set
predValid <- predict(model2, ValidSet, type = "class")
# Checking classification accuracy
mean(predValid == ValidSet$Condition)                    
table(predValid,ValidSet$Condition)
> mean(predValid == ValidSet$Condition)                    
[1] 0.9884393
> table(predValid,ValidSet$Condition)
         
predValid acc good unacc vgood
    acc   117    0     2     0
    good    1   16     0     0
    unacc   1    0   352     0
    vgood   1    1     0    28
In case of prediction on train dataset, there is zero misclassification; however, in the case of validation dataset, 6 data points are misclassified and accuracy is 98.84%. We can also use function to check important variables. The below functions show the drop in mean accuracy for each of the variables.
# To check important variables
importance(model2)        
varImpPlot(model2)        
> importance(model2)        
                  acc     good     unacc    vgood MeanDecreaseAccuracy MeanDecreaseGini
BuyingPrice 143.90534 80.38431 101.06518 66.75835            188.10368         71.15110
Maintenance 130.61956 77.28036  98.23423 43.18839            171.86195         90.08217
NumDoors     32.20910 16.14126  34.46697 19.06670             49.35935         32.45190
NumPersons  142.90425 51.76713 178.96850 49.06676            214.55381        125.13812
BootSpace    85.36372 60.34130  74.32042 50.24880            132.20780         72.22591
Safety      179.91767 93.56347 207.03434 90.73874            275.92450        149.74474

> varImpPlot(model2)
Perceptive Analytics
Now, we will use ‘for’ loop and check for different values of mtry.
# Using For loop to identify the right mtry for model
a=c()
i=5
for (i in 3:8) {
  model3 <- randomForest(Condition ~ ., data = TrainSet, ntree = 500, mtry = i, importance = TRUE)
  predValid <- predict(model3, ValidSet, type = "class")
  a[i-2] = mean(predValid == ValidSet$Condition)
}

a

plot(3:8,a)
> a
[1] 0.9749518 0.9884393 0.9845857 0.9884393 0.9884393 0.9903661
> 
> plot(3:8,a)
Perceptive Analytics
From the above graph, we can see that the accuracy decreased when mtry was increased from 4 to 5 and then increased when mtry was changed to 6 from 5. Maximum accuracy is at mtry equal to 8.

Now, we have seen the implementation of Random Forest and understood the importance of the model. Let’s compare this model with decision tree and see how decision trees fare in comparison to random forest.
# Compare with Decision Tree

install.packages("rpart")
install.packages("caret")
install.packages("e1071")

library(rpart)
library(caret)
library(e1071)
# We will compare model 1 of Random Forest with Decision Tree model

model_dt = train(Condition ~ ., data = TrainSet, method = "rpart")
model_dt_1 = predict(model_dt, data = TrainSet)
table(model_dt_1, TrainSet$Condition)

mean(model_dt_1 == TrainSet$Condition)
> table(model_dt_1, TrainSet$Condition)
          
model_dt_1 acc good unacc vgood
     acc   241   52   132    37
     good    0    0     0     0
     unacc  23    0   724     0
     vgood   0    0     0     0
> 
> mean(model_dt_1 == TrainSet$Condition)
[1] 0.7981803
On the training dataset, the accuracy is around 79.8% and there is lot of misclassification. Now, look at the validation dataset.
# Running on Validation Set
model_dt_vs = predict(model_dt, newdata = ValidSet)
table(model_dt_vs, ValidSet$Condition)

mean(model_dt_vs == ValidSet$Condition)
> table(model_dt_vs, ValidSet$Condition)
           
model_dt_vs acc good unacc vgood
      acc   107   17    58    28
      good    0    0     0     0
      unacc  13    0   296     0
      vgood   0    0     0     0
> 
> mean(model_dt_vs == ValidSet$Condition)
[1] 0.7764933
The accuracy on validation dataset has decreased further to 77.6%.

The above comparison shows the true power of ensembling and the importance of using Random Forest over Decision Trees. Though Random Forest comes up with its own inherent limitations (in terms of number of factor levels a categorical variable can have), but it still is one of the best models that can be used for classification. It is easy to use and tune as compared to some of the other complex models, and still provides us good level of accuracy in the business scenario. You can also compare Random Forest with other models and see how it fares in comparison to other techniques. Happy Random Foresting!!

Author Bio:

This article was contributed by Perceptive Analytics. Chaitanya Sagar, Prudhvi Potuganti and Saneesh Veetil contributed to this article.

Perceptive Analytics provides data analytics, data visualization, business intelligence and reporting services to e-commerce, retail, healthcare and pharmaceutical industries. Our client roster includes Fortune 500 and NYSE listed companies in the USA and India.

How to extract data from a PDF file with R


In this post, taken from the book R Data Mining by Andrea Cirillo, we’ll be looking at how to scrape PDF files using R. It’s a relatively straightforward way to look at text mining – but it can be challenging if you don’t know exactly what you’re doing.

Until January 15th, every single eBook and video by Packt is just $5! Start exploring some of Packt’s huge range of R titles here.

You may not be aware of this, but some organizations create something called a ‘customer card’ for every single customer they deal with. This is quite an informal document that contains some relevant information related to the customer, such as the industry and the date of foundation. Probably the most precious information contained within these cards is the comments they write down about the customers. Let me show you one of them:
My plan was the following—get the information from these cards and analyze it to discover whether some kind of common traits emerge.

As you may already know, at the moment this information is presented in an unstructured way; that is, we are dealing with unstructured data. Before trying to analyze this data, we will have to gather it in our analysis environment and give it some kind of structure.

Technically, what we are going to do here is called text mining, which generally refers to the activity of gaining knowledge from texts. The techniques we are going to employ are the following:

  • Sentiment analysis
  • Wordclouds
  • N-gram analysis
  • Network analysis

Getting a list of documents in a folder

First of all, we need to get a list of customer cards we were from the commercial department. I have stored all of them within the ‘data’ folder on my workspace. Let’s use list.files() to get them:
file_vector <- list.files(path = "data")
Nice! We can inspect this looking at the head of it. Using the following command:
file_vector %>% head()

[1] "banking.xls" "Betasoloin.pdf" "Burl Whirl.pdf" "BUSINESSCENTER.pdf"
 [5] "Buzzmaker.pdf" "BuzzSaw Publicity.pdf"
Uhm… not exactly what we need. I can see there are also .xls files. We can remove them using the grepl() function, which performs partial matches on strings, returning TRUE if the pattern required is found, or FALSE if not. We are going to set the following test here: give me TRUE if you find .pdf in the filename, and FALSE if not:
grepl(".pdf",file_list)

[1] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[18] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[35] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[52] TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[69] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[86] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[103] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE TRUE
[120] TRUE
As you can see, the first match results in a FALSE since it is related to the .xls file we saw before. We can now filter our list of files by simply passing these matching results to the list itself. More precisely, we will slice our list, selecting only those records where our grepl() call returns TRUE:
pdf_list <- file_vector[grepl(".pdf",file_list)]
Did you understand [grepl(“.pdf”,file_list)] ? It is actually a way to access one or more indexes within a vector, which in our case are the indexes corresponding to “.pdf”, exactly the same as we printed out before. If you now look at the list, you will see that only PDF filenames are shown on it.

Reading PDF files into R via pdf_text()

R comes with a really useful that’s employed tasks related to PDFs. This is named pdftools, and beside the pdf_text function we are going to employ here, it also contains other relevant functions that are used to get different kinds of information related to the PDF file into R. For our purposes, it will be enough to get all of the textual information contained within each of the PDF files. First of all, let’s try this on a single document; we will try to scale it later on the whole set of documents. The only required argument to make pdf_text work is the path to the document. The object resulting from this application will be a character vector of length 1:
pdf_text("data/BUSINESSCENTER.pdf")

[1] "BUSINESSCENTER business profile\nInformation below are provided under non disclosure agreement. date of enquery: 12.05.2017\ndate of foundation: 1993-05-18\nindustry: Non-profit\nshare holders: Helene Wurm ; Meryl Savant ; Sydney Wadley\ncomments\nThis is one of our worst customer. It really often miss payments even if for just a couple of days. We have\nproblems finding useful contact persons. The only person we can have had occasion to deal with was the\nfiscal expert, since all other relevant person denied any kind of contact.\n                                                       1\n"
If you compare this with the original PDF document you can easily see that all of the information is available even if it is definitely not ready to be analyzed. What do you think is the next step needed to make our data more useful? We first need to split our string into lines in order to give our data a structure that is closer to the original one, that is, made of paragraphs. To split our string into separate records, we can use the strsplit() function, which is a base R function. It requires the string to be split and the token that decides where the string has to be split as arguments. If you now look at the string, you’ll notice that where we found the end of a line in the original document, for instance after the words business profile, we now find the \n token. This is commonly employed in text formats to mark the end of a line. We will therefore use this token as a split argument:
pdf_text("data/BUSINESSCENTER.pdf") %>% strsplit(split = "\n")

[[1]]
 [1] "BUSINESSCENTER business profile"
 [2] "Information below are provided under non disclosure agreement. date of enquery: 12.05.2017"
 [3] "date of foundation: 1993-05-18"
 [4] "industry: Non-profit"
 [5] "share holders: Helene Wurm ; Meryl Savant ; Sydney Wadley"
 [6] "comments"
 [7] "This is one of our worst customer. It really often miss payments even if for just a couple of days. We have"
 [8] "problems finding useful contact persons. The only person we can have had occasion to deal with was the"
 [9] "fiscal expert, since all other relevant person denied any kind of contact."
 [10] " 1"
strsplit() returns a list with an element for each element of the character vector passed as argument; within each list element, there is a vector with the split string. Isn’t that better? I definitely think it is. The last thing we need to do before actually doing text mining on our data is to apply those treatments to all of the PDF files and gather the results into a conveniently arranged data frame.

Iteratively extracting text from a set of documents with a for loop

What we want to do here is run trough the list of files and for filename found there, we run the pdf_text() function and then the strsplit() function to get an object similar to the one we have seen with our test. A convenient way to do this is by employing a ‘for’ loop. These structures basically do this to their content: Repeat this instruction n times and then stop. Let me show you a typical ‘for’ loop:
for (i in 1:3){
 print(i)
}
If we run this, we obtain the following:
[1] 1
 [1] 2
 [1] 3
This means that the loop runs three times and therefore repeats the instructions included within the brackets three times. What is the only thing that seems to change every time? It is the value of i. This variable, which is usually called counter, is basically what the loop employs to understand when to stop iterating. When the loop execution starts, the loop starts increasing the value of the counter, going from 1 to 3 in our example. The for loop repeats the instructions between the brackets for each element of the values of the vector following the in clause in the for command. At each step, the value of the variable before in (i in this case) takes one value of the sequence from the vector itself. The counter is also useful within the loop itself, and it is usually employed to iterate within an object in which some kind of manipulation is desired. Take, for instance, a vector defined like this:
vector <- c(1,2,3)
Imagine we want to increase the value of every element of the vector by 1. We can do this by employing a loop such as this:
for (i in 1:3){
 vector[i] <- vector[i]+1
}
If you look closely at the loop, you’ll realize that the instruction needs to access the element of the vector with an index equal to i and modify this value by 1. The counter here is useful because it will allow iteration on all vectors from 1 to 3. Be aware that this is actually not a best practice because loops tend to be quite computationally expensive, and they should be employed when no other valid alternative is available. For instance, we can obtain the same result here by working directly on the whole vector, as follows:
vector_increased <- vector +1
If you are interested in the topic of avoiding loops where they are not necessary, I can share with you some relevant material on this. For our purposes, we are going to apply loops to go through the pdf_list object, and apply the pdf_text function and subsequently the strsplit() function to each element of this list:
corpus_raw <- data.frame("company" = c(),"text" = c())

for (i in 1:length(pdf_list)){
print(i)
 pdf_text(paste("data/", pdf_list[i],sep = "")) %>% 
 strsplit("\n")-> document_text
data.frame("company" = gsub(x =pdf_list[i],pattern = ".pdf", replacement = ""), 
 "text" = document_text, stringsAsFactors = FALSE) -> document

colnames(document) <- c("company", "text")
corpus_raw <- rbind(corpus_raw,document) 
}
Let’s get closer to the loop: we first have a call to the pdf_text function, passing an element of pdf_list as an argument; it is defined as referencing the i position in the list. Once we have done this, we can move on to apply the strsplit() function to the resulting string. We define the document object, which contains two columns, in this way:
  • company, which stores the name of the PDF without the .pdf token; this is the name of the company
  • text, which stores the text resulting from the extraction
This document object is then appended to the corpus object, which we created previously, to store all of the text within the PDF. Let’s have a look a the resulting data frame:
corpus_raw %>% head()

This is a well-structured object, ready for some text mining. Nevertheless, if we look closely at our PDF customer cards, we can see that there are three different kinds of information and they should be handled differently:
  • Repeated information, such as the confidentiality disclosure on the second line and the date of enquiry (12.05.2017)
  • Structured attributes, for instance, date of foundation or industry
  • Strictly unstructured data, which is in the comments paragraph
We are going to address these three kinds of data differently, removing the first group of irrelevant information; we therefore have to split our data frame accordingly into two smaller data frames. To do so, we will leverage the grepl() function once again, looking for the following tokens:

  • 12.05.2017: This denotes the line showing the non-disclosure agreement and the date of inquiry.
  • business profile: This denotes the title of the document, containing the name of the company. We already have this information stored within the company column.
  • comments: This is the name of the last paragraph.
  • 1: This represents the number of the page and is always the same on every card.
We can apply the filter function to our corpus_raw object here as follows:
corpus_raw %>% 
filter(!grepl("12.05.2017",text)) %>% 
filter(!grepl("business profile",text)) %>% 
filter(!grepl("comments",text)) %>%
filter(!grepl("1",text)) -> corpus
Now that we have removed those useless things, we can actually split the data frame into two sub-data frames based on what is returned by the grepl function when searching for the following tokens, which point to the structured attributes we discussed previously:
  • date of foundation
  • industry
  • shareholders

We are going to create two different data frames here; one is called ‘information’ and the other is called ‘comments’:
corpus %>% 
filter(!grepl(c("date of foundation"),text)) %>% 
filter(!grepl(c( "industry"),text)) %>% 
filter(!grepl(c( "share holders"),text)) -> comments

corpus %>% 
filter(grepl(("date of foundation"),text)|grepl(( "industry"),text)|grepl(( "share holders"),text))-> information
As you can see, the two data treatments are nearly the opposite of each other, since the first looks for lines showing none of the three tokens while the other looks for records showing at least one of the tokens. Let’s inspect the results by employing the head function:
information %>% head()
comments %>% head()
Great! We are nearly done. We are now going to start analyzing the comments data frame, which reports all comments from our colleagues. The very last step needed to make this data frame ready for subsequent analyzes is to tokenize it, which basically means separating it into different rows for all the words available within the text column. To do this, we are going to leverage the unnest_tokens function, which basically splits each line into words and produces a new row for each word, having taken care to repeat the corresponding value within the other columns of the original data frame. This function comes from the recent tidytext package by Julia Silge and Davide Robinson, which provides an organic framework of utilities for text mining tasks. It follows the tidyverse framework, which you should already know about if you are using the dplyr package. Let’s see how to apply the unnest_tokens function to our data:
comments %>% 
 unnest_tokens(word,text)-> comments_tidy
If we now look at the resulting data frame, we can see the following:

As you can see, we now have each word separated into a single record.

Thanks for reading! We hope you enjoyed this extract taken from R Data Mining.

Explore $5 R eBooks and videos from Packt until January 15th 2018.