How to Perform Ordinal Logistic Regression in R

In this article, we discuss the basics of ordinal logistic regression and its implementation in R. Ordinal logistic regression is a widely used classification method, with applications in variety of domains. This method is the go-to tool when there is a natural ordering in the dependent variable. For example, dependent variable with levels low, medium, high is a perfect context for application of logistic ordinal regression.

Having wide range of applicability, ordinal logistic regression is considered as one of the most admired methods in the field of data analytics. The method is also known as proportional odds model because of the transformations used during estimation and the log odds interpretation of the output. We hope that this article helps our readers to understand the basics and implement the model in R.

The article is organized as follows: focusing on the theoretical aspects of the technique, section 1 provides a quick review of ordinal logistic regression. Section 2 discusses the steps to perform ordinal logistic regression in R and shares R script. In addition, section 2 also covers the basics of interpretation and evaluation of the model on R. In section 3, we learn a more intuitive way to interpret the model. Section 4 concludes the article.

Basics of ordinal logistic regression
Ordinal logistic regression is an extension of simple logistic regression model. In simple logistic regression, the dependent variable is categorical and follows a Bernoulli distribution. (for a quick reference check out this article by perceptive analytics – https://www.kdnuggets.com/2017/10/learn-generalized-linear-models-glm-r.html). Whereas, in ordinal logistic regression the dependent variable is ordinal i.e. there is an explicit ordering in the categories. For example, during preliminary testing of a pain relief drug, the participants are asked to express the amount of relief they feel on a five point Likert scale. Another common example of an ordinal variable is app ratings. On google play, customers are asked to rate apps on a scale ranging from 1 to 5. Ordinal logistic regression becomes handy in the aforementioned examples as there is a clear order in the categorical dependent variable.

In simple logistic regression, log of odds that an event occurs is modeled as a linear combination of the independent variables. But, the above approach of modeling ignores the ordering of the categorical dependent variable. Ordinal logistic regression model overcomes this limitation by using cumulative events for the log of the odds computation. It means that unlike simple logistic regression, ordinal logistic models consider the probability of an event and all the events that are below the focal event in the ordered hierarchy. For example, the event of interest in ordinal logistic regression would be to obtain an app rating equal to X or less than X.  For example, the log of odds for the app rating less than or equal to 1 would be computed as follows:

LogOdds rating<1 = Log (p(rating=1)/p(rating>1)  [Eq. 1]
Likewise, the log of odds can be computed for other values of app ratings.  The computations for other ratings are below: LogOdds rating<2 = Log (p(rating<=2)/p(rating>2)  [Eq. 2] LogOdds rating<3 = Log (p(rating<=3)/p(rating>3)  [Eq. 3] LogOdds rating<4 = Log (p(rating=4)/p(rating>4)  [Eq. 4]

Because all the ratings below the focal score are considered in computation, the highest app rating of 5 will include all the ratings below it and does not have a log of odds associated with it. In general, the ordinal regression model can be represented using the LogOdds computation.

LogoddsYαi+ β1X12X2 +….. +βnXn

where,
Y is the ordinal dependent variable
i is the number of categories minus 1
X1, X2,…. Xn  are independent variables. They can be measured on nominal, ordinal or continuous measurement scale.
β1, β2,… βare estimated parameters For i ordered categories, we obtain i – 1 equations. The proportional odds assumption implies that the effect of independent variables is identical for each log of odds computation. But, this is not the case for intercept  as the intercept takes different values for each computation. Besides the proportional odds assumption, the ordinal logistic regression model assumes an ordinal dependent variable and absence of multicollinearity. Absence of multicollinearity means that the independent variables are not significantly correlated. These assumptions are important as their violation makes the computed parameters unacceptable.

Model building in R
In this section, we describe the dataset and implement ordinal logistic regression in R. We use a simulated dataset for analysis. The details of the variables are as follows. The objective of the analysis is to predict the likelihood of each level of customer purchase. The dependent variable is the likelihood of repeated purchase by customers. The variable is measured in an ordinal scale and can be equal to one of the three levels – low probability, medium probability, and high probability. The independent variables are measures of possession of coupon by the focal customer, recommendation received by the peers and quality of the product. Possession of coupon and peer recommendation are categorical variables, while quality is measured on a scale of 1 to 5. We discuss the steps for implementation of ordinal logistic regression and share the commented R script for better understanding of the reader. The data is in .csv format and can be downloaded by clicking here. Before starting the analysis, I will describe the preliminary steps in short. The first step is to keep the data file in the working directory. The next step is to explicitly define the ordering of the levels in  the dependent variable and the relevant independent variables. This step is crucial and ignoring it can lead to meaningless analysis.

#Read data file from working directory
setwd("C:/Users/You/Desktop")
data <- read.table("data.txt")
#Ordering the dependent variable
data$rpurchase = factor(data$rpurchase, levels = c("low probability", "medium probability", "high probability"), ordered = TRUE) 
data$peers = factor(data$peers, levels = c("0", "1"), ordered = TRUE) 
data$coupon = factor(data$coupon, levels = c("0", "1"), ordered = TRUE)
  Next, it is essential to perform the exploratory data analysis. Exploratory data analysis deals with outliers and missing values, which can induce bias in our model.  In this article we do basic exploration by looking at summary statistics and frequency table. Figure 1. shows the summary statistics. We observe the count of data for ordinal variables and distribution characteristics for other variables.  We compute the count of rpurchase with different values of coupon in Table 1. Note that the median value for rpurchase changes with change in coupon. The median level of rpurchase increases, indicating that coupon positively affects the likelihood of repeated purchase. The R script for summary statistics and the frequency table is as follows:

#Exploratory data analysis 
#Summarizing the data
summary(data)
#Making frequency table
table(data$rpurchase, data$coupon)
Figure 1. Summary statistics    Table 1. Count of rpurchase by coupon
Rpurchase/Coupon With coupon Without coupon
Low probability 200 20
Medium probability 110 30
High proabability 27 13
After defining the order of the levels in the dependent variable and performing exploratory data analysis, the data is ready to be partitioned into training and testing set. We will build the model using the training set and validate the computed model using the data in test set. The partitioning of the data into training and test set is random. The random sample is generated using sample ( ) function along with set.seed ( ) function. The R script for explicitly stating the order of levels in the dependent variable and creating training and test data set is follows:

#Dividing data into training and test set
#Random sampling 
samplesize = 0.60*nrow(data)
set.seed(100)
index = sample(seq_len(nrow(data)), size = samplesize)
#Creating training and test set 
datatrain = data[index,]
datatest = data[-index,]

 Now, we will build the model using the data in training set. As discussed in the above section the dependent variable in the model is in the form of log odds.  Because of the log odds transformation, it is difficult to interpret the coefficients of the model. Note that in this case the coefficients of the regression cannot be interpreted in terms of marginal effects.  The coefficients are called as proportional odds and interpreted in terms of increase in log odds. The interpretation changes not only for the coefficients but also for the intercept. Unlike simple linear regression, in ordinal logistic regression we obtain n-1 intercepts, where n is the number of categories in the dependent variable. The intercept can be interpreted as the expected odds of identifying in the listed categories. Before interpreting the model, we share the relevant R script and the results. In the R code, we set Hess equal to true which the logical operator to return hessian matrix. Returning the hessian matrix is essential to use summary function or calculate variance-covariance matrix of the fitted model.

#Build ordinal logistic regression model
model= polr(rpurchase ~ coupon + peers + quality , data = datatrain, Hess = TRUE)
summary(model)
  Figure 2. Ordinal logistic regression model on training set

The table displays the value of coefficients and intercepts, and corresponding standard errors and t values.  The interpretation for the coefficients is as follows. For example, holding everything else constant, an increase in value of coupon by one unit increase the expected value of rpurchase in log odds by 0.96. Likewise, the coefficients of peers and quality can be interpreted.
Note that the ordinal logistic regression outputs multiple values of intercepts depending on the levels of intercept. The intercepts can be interpreted as the expected odds when others variables assume a value of zero. For example, the low probability | medium probability intercept takes value of 2.13, indicating that the expected odds of identifying in low probability category, when other variables assume a value of zero, is 2.13. Using the logit inverse transformation, the intercepts can be interpreted in terms of expected probabilities. The inverse logit transformation, <to be inserted> . The expected probability of identifying low probability category, when other variables assume a value of zero, is 0.89. After building the model and interpreting the model, the next step is to evaluate it. The evaluation of the model is conducted on the test dataset. A basic evaluation approach is to compute the confusion matrix and the misclassification error. The R code and the results are as follows:

#Compute confusion table and misclassification error
predictrpurchase = predict(model,datatest)
table(datatest$rpurchase, predictrpurchase)
mean(as.character(datatest$rpurchase) != as.character(predictrpurchase))
Figure 3. Confusion matrix
The confusion matrix shows the performance of the ordinal logistic regression model. For example, it shows that, in the test dataset, 76 times low probability category is identified correctly. Similarly, 10 times medium category and 0 times high category is identified correctly. We observe that the model identifies high probability category poorly. This happens because of inadequate representation of high probability category in the training dataset. Using the confusion matrix, we find that the misclassification error for our model is 46%.

Interpretation using plots

The interpretation of the logistic ordinal regression in terms of log odds ratio is not easy to understand. We offer an alternative approach to interpretation using plots. The R code for plotting the effects of the independent variables is as follows:

#Plotting the effects 
library("effects")
Effect(focal.predictors = "quality",model)
plot(Effect(focal.predictors = "coupon",model))
plot(Effect(focal.predictors = c("quality", "coupon"),model))
The plots are intuitive and easy to understand. For example, figure 4 shows that coupon increases the likelihood of classification into high probability and medium probability classes, while decreasing the likelihood of classification in low probability class.

Figure 4. Effect of coupon on identification
It is also possible to look at joint effect of two independent variable. Figure 5. shows the joint effect of quality and coupon on identification of category of independent variable. Observing the top row of figure 5., we notice that the interaction of coupon and quality increases the likelihood of identification in high probability category.

Figure 5. Joint effect of quality and coupon on identification

Conclusion The article discusses the fundamentals of ordinal logistic regression, builds and the model in R, and ends with interpretation and evaluation.  Ordinal logistic regression extends the simple logistic regression model to the situations where the dependent variable is ordinal, i.e. can be ordered. Ordinal logistic regression has variety of applications, for example, it is often used in marketing to increase customer life time value. For example, consumers can be categorized into different classes based on their tendency to make repeated purchase decision. In order to discuss the model in an applied manner, we develop this article around the case of consumer categorization. The independent variables of interest are – coupon held by consumers from previous purchase, influence of peers, quality of the product. The article has two key takeaways. First, ordinal logistic regression come handy while dealing with a dependent variable that can be ordered. If one uses multinomial logistic regression then the user is ignoring the information related to ordering of the dependent variable. Second, the coefficients of the ordinal linear regression cannot be interpreted in a similar manner to the coefficients of ordinary linear regression. Interpreting the coefficents in terms of marginal effects is one of the common mistakes that users make while implementing the ordinal regression model. We again emphasize the use of graphical method to interpret the coefficients. Using the graphical method, it is easy to understand the individual and joint effects of independent variables on the likelihood of classification. This article can be a go to reference for understanding the basics of ordinal logistic regression and its implementation in R. We have provided commented R code throughout the article.

Download R-code
Credits: Chaitanya Sagar and Aman Asija of Perceptive Analytics. Perceptive Analytics is a marketing analytics and Tableau consulting company. 


A New Release of rIP (v1.2.0) for Detecting Fraud in Online Surveys

We are excited to announce the latest major release of rIP (v1.2.0), which is an R package that detects fraud in online surveys by tracing, scoring, and visualizing IP addresses. Essentially, rIP takes an array of IP addresses, which are always captured in online surveys (e.g., MTurk), and the keys for the services the user wishes to use (IP Hub, IP Intel, and Proxycheck), and passes these to all respective APIs. The output is a dataframe with the IP addresses, country, internet service provider (ISP), labels for non-US IP Addresses, whether a virtual private server (VPS) was used, and then recommendations for blocking the IP address. Users also have the option to visualize the distributions, as discussed below in the updates to v1.2.0.

Especially important in this is the variable “block”, which gives a score indicating whether the IP address is likely from a “server farm” and should be excluded from the data. It is coded 0 if the IP is residential/unclassified (i.e. safe IP), 1 if the IP is non-residential IP (hostping provider, proxy, etc. – should likely be excluded, though the decision to do so is left to the researcher), and 2 for non-residential and residential IPs (more stringent, may flag innocent respondents).

Including some great contributions from Bob Rudis, some of the key feature updates included in v1.2.0 of rIP are:

  • Added discrete API endpoints for the three IP services so users can use this as a general purpose utility package as well as for the task-specific functionality currently provided. Each endpoint is tied to an environment variable for the secret info (API key or contact info). This is documented in each function.
  • On-load computed package global .RIP_UA which is an httr user_agent object, given the best practice to use an identifiable user agent when making API calls so the service provider can track usage and also follow up with any issues they see.
  • A new plotting option that, when set to “TRUE”, produces a barplot of the IP addresses checked with color generated via the amerika package.
  • Users can now supply any number of IP service keys they wish to use (1, 2, or all 3), and the function will ping only the preferred IP check services (formerly, the package required all three keys or none to be entered).
  • For those interested in reading more and citing the package in published work, check out our recently published software paper in the Journal of Open Source Software.
Here is a quick demo of the package with some fake (auto-generated) IP addresses:
# Install and load rIP, v1.2.0
install.packages("rIP")
library(rIP)

# Store personal keys (only "IP Hub" used here)
ip_hub_key = "MzI2MTpkOVpld3pZTVg1VmdTV3ZPenpzMmhodkJmdEpIMkRMZQ=="

ipsample = data.frame(rbind(c(1, "15.79.157.16"), c(2, "1.223.176.227"), c(3, "72.167.36.25"), c(4, "5.195.165.176"),
                             c(5, "27.28.25.206"), c(6, "106.118.241.121"), c(7, "231.86.14.33"), c(8, "42.56.9.80"), c(9, "41.42.62.229"),
                             c(10, "183.124.243.176")))
names(ipsample) = c("number", "IPAddress")

# Call the getIPinfo function to check the IPs
getIPinfo(ipsample, "IPAddress", iphub_key = ip_hub_key, plots = TRUE)
Running the code above will generate the following plot, as well as the dataframe mentioned above.

Note that to use the package, users must have valid personal keys for any IP service they wish to call via the function. These can be obtained for free at the corresponding IP check services.

Finally, we welcome contributions and reports of bugs either by opening an issue ticket or a pull request at the corresponding Github repository. Thanks and enjoy the package!

Workshop – R Bootcamp: For Newcomers to R

Sunday, June 16, 2019 in Las Vegas

Two and a half hour workshop: 2:00pm-4:30pm

Intended Audience:
Practitioners who wish to learn the nuts and bolts of the R language; anyone who wants “to turn ideas into software, quickly and faithfully.” Knowledge Level: Experience with handling data (e.g. spreadsheets) or hands-on familiarity with any programming language.

Workshop Description

This afternoon workshop launches your tenure as a user of R, the well-known open-source platform for data science and machine learning. The workshop stands alone as the perfect way to get started with R, or may serve to prepare for the more advanced full-day hands-on workshop, “R for Predictive Modeling”. Designed for newcomers to the language of R, “R Bootcamp” covers the R ecosystem and core elements of the language, so you attain the foundations for becoming an R user. Topics include common tools for data import, manipulation and export. If time allows, other topics will be covered, such: as graphical systems in R (e.g. LATTICE and GGPLOT) and automated reporting. The instructor will guide attendees on hands-on execution with R, covering:
  • A working knowledge of the R system
  • The strengths and limitations of the R language
  • Core language features
  • The best tools for merging, processing and arranging data
Hardware: Bring Your Own Laptop Each workshop participant is required to bring their own laptop running Windows or OS X. The software used during this training program, R, is free and readily available for download. Attendees receive an electronic copy of the course materials and related R code at the conclusion of the workshop.

Workshop – R for Machine Learning: A Hands-On Introduction

Monday, June 17, 2019 in Las Vegas

Full-day Workshop: 8:30am – 4:30pm

Intended Audience:
Practitioners who wish to learn how to execute on predictive analytics by way of the R language; anyone who wants “to turn ideas into software, quickly and faithfully.” Knowledge Level: Either hands-on experience with predictive modeling (without R) or both hands-on familiarity with any programming language (other than R) and basic conceptual knowledge about predictive modeling is sufficient background and preparation to participate in this workshop. The 2 1/2 hour “R Bootcamp” is recommended preparation for this workshop.
Workshop Description
This one-day session provides a hands-on introduction to R, the well-known open-source platform for data analysis. Real examples are employed in order to methodically expose attendees to best practices driving R and its rich set of predictive modeling (machine learning) packages, providing hands-on experience and know-how. R is compared to other data analysis platforms, and common pitfalls in using R are addressed. The instructor will guide attendees on hands-on execution with R, covering:
  • A working knowledge of the R system
  • The strengths and limitations of the R language
  • Preparing data with R, including splitting, resampling and variable creation
  • Developing predictive models with R, including the use of these machine learning methods: decision trees, ensemble methods, and others.
  • Visualization: Exploratory Data Analysis (EDA), and tools that persuade
  • Evaluating predictive models, including viewing lift curves, variable importance and avoiding overfitting
Hardware: Bring Your Own Laptop
Each workshop participant is required to bring their own laptop running Windows or OS X. The software used during this training program, R, is free and readily available for download. Attendees receive an electronic copy of the course materials and related R code at the conclusion of the workshop.

Schedule

  • Workshop starts at 8:30am
  • Morning Coffee Break at 10:30am – 11:00am
  • Lunch provided at 12:30pm – 1:15pm
  • Afternoon Coffee Break at 3:00pm – 3:30pm
  • End of the Workshop: 4:30pm
Coffee breaks and lunch are included.

Instructor

Brennan Lodge, Data Scientist VP, Goldman Sachs

Brennan is a self-proclaimed data nerd. He has been working in the financial industry for the past ten years and is striving to save the world with a little help from our machine friends. He has held cyber security, data scientist, and leadership roles at JP Morgan Chase, the Federal Reserve Bank of New York, Bloomberg, and Goldman Sachs. Brennan holds a Masters in Business Analytics from New York University and participates in the data science community with his non-profit pro-bono work at DataKind, and as a co-organizer for the NYU Data Science and Analytics Meetup. Brennan is also an instructor at the New York Data Science Academy and teaches data science courses in R and Python.

Introduction to statistics with origami (and R examples)

I have written a small booklet (about 120 pages A5 size) on statistics and origami
it is a very simple -basic book, useful for introductory courses. 
There are several R examples and the R script for each of them

You can downolad the files free from: 
http://www.mariocigada.com/mariodocs/origanova_e.html

in a .zip file you find
the text in .pdf format
a colour cover
a (small) dataset
a .R file with all the script

Here is a link for downloading the book directly, I hope someone can find it useful:

Link for downloading the book: origanova2en

ciao
m


ODSC East 2019 Talks to Expand and Apply R Skills

R programmers are not necessary data scientists, but rather software engineers. We have an entirely new multitrack focus area that helps engineers learn AI skills – AI for Engineers. This focus area is designed specifically to help programmers get familiar with AI-driven software that utilizes deep learning and machine learning models to enable conversational AI, autonomous machines, machine vision, and other AI technologies that require serious engineering.

For example, some use R in reinforcement learning – a popular topic and arguably one of the best techniques for sequential decision making and control policies. At ODSC East, Leonardo De Marchi of Badoo will present “Introduction to Reinforcement Learning,” where he will go over the fundamentals of RL all the way to creating unique algorithms, and everything in between.

Well, what if you’re missing data? That will mess up your entire workflow – R or otherwise. In “Good, Fast, Cheap: How to do Data Science with Missing Data,” Matt Brems of General Assembly, you will start by visualizing missing data and identifying the three different types of missing data, which will allow you to see how they affect whether we should avoid, ignore, or account for the missing data. Matt will give you practical tips for working with missing data and recommendations for integrating it with your workflow.

How about getting some context? In “Intro to Technical Financial Evaluation with R” with Ted Kwartler of Liberty Mutual and Harvard Extension School, you will learn how to download and evaluate equities with the TTR (technical trading rules) package. You will evaluate an equity according to three basic indicators and introduce you to backtesting for more sophisticated analyses on your own.

There is a widespread belief that the twin modeling goals of prediction and explanation are in conflict. That is, if one desires superior predictive power, then by definition one must pay a price of having little insight into how the model made its predictions. In “Explaining XGBoost Models – Tools and Methods” with Brian Lucena, PhD, Consulting Data Scientist at Agentero  you will work hands-on using XGBoost with real-world data sets to demonstrate how to approach data sets with the twin goals of prediction and understanding in a manner such that improvements in one area yield improvements in the other.

What about securing your deep learning frameworks? In “Adversarial Attacks on Deep Neural Networks” with Sihem Romdhani, Software Engineer at Veeva Systems, you will answer questions such as how do adversarial attacks pose a real-world security threat? How can these attacks be performed? What are the different types of attacks? What are the different defense techniques so far and how to make a system more robust against adversarial attacks? Get these answers and more here.

Data science is an art. In “Data Science + Design Thinking: A Perfect Blend to Achieve the Best User Experience,” Michael Radwin, VP of Data Science at Intuit, will offer a recipe for how to apply design thinking to the development of AI/ML products. You will learn to go broad to go narrow, focusing on what matters most to customers, and how to get customers involved in the development process by running rapid experiments and quick prototypes.

Lastly, your data and hard work mean nothing if you don’t do anything with it – that’s why the term “data storytelling” is more important than ever. In “Data Storytelling: The Essential Data Science Skill,” Isaac Reyes, TedEx speaker and founder of DataSeer, will discuss some of the most important of data storytelling and visualization, such as what data to highlight, how to use colors to bring attention to the right numbers, the best charts for each situation, and more.

Ready to apply all of your R skills to the above situations? Learn more techniques, applications, and use cases at ODSC East 2019 in Boston this April 30 to May 3!

Save 10% off the public ticket price when you use the code RBLOG10 today. Register Here

More on ODSC:

ODSC East 2019  is one of the largest applied data science conferences in the world. Speakers include some of the core contributors to many open source tools, libraries, and languages. Attend ODSC East in Boston this April 30 to May 3 and learn the latest AI & data science topics, tools, and languages from some of the best and brightest minds in the field.

Coke vs. Pepsi? data.table vs. tidy? Examining Consumption Preferences for Data Scientists

By Beth Milhollin ([email protected]), Russell Zaretzki ([email protected]), and Audris Mockus ([email protected])

Coke and Pepsi have been fighting the cola wars for over one hundred years and, by now, if you are still willing to consume soft drinks, you are either a Coke person or a Pepsi person. How did you choose? Why are you so convinced and passionate about the choice? Which side is winning and why? Similarly, some users of R prefer enhancements to data frame provided by data.table while others lean towards tidy-related framework. Both popular R packages provide functionality that is lacking in the built-in R data frame. While the competition between them spans much less than one hundred years, can we draw similarities and learn from the deeper understanding of consumption choices exhibited in Coke vs Pepsi battle? One of the basic measures of consumer enthusiasm is a so-called Net Promoter Score (NPS).

In the business world, the NPS is widely used for products and services and it can be a leading indicator for market share growth. One simple question is the basis for the NPS – How likely are you to recommend the brand to your friends or colleagues, using a scale from 0 to 10? Respondents who give a score of 0-6 belong to the Detractors group. These unhappy customers will tend to voice their displeasure and can damage a brand. The Passive group gives a score of 7-8, and they are generally satisfied, but they are also open to changing brands. The Promoters are the loyal enthusiasts who give a score of 9-10, and these coveted cheerleaders will fuel the growth of a brand. To calculate the NPS, subtract the percentage of Detractors from the percentage of Promoters. A minimum score of -100 means everyone is a Detractor, and the elusive maximum score of +100 means everyone is a Promoter.

How does the NPS play out in the cola wars? For 2019 there is a standoff, with Coca-cola and Pepsi tied at 20. Instinct tells us that a NPS over 0 is good, but how good? The industry average for fast moving consumer goods is 30 (“Pepsi Net Promoter Score 2019 Benchmarks”, 2019). We will let you be the judge.
In the Fall of 2018 we surveyed R users identified by the first commit that added a dependency on either package to a public repository (a repository hosted on GitHub, BitBucket, and other so-called software forges) with some code written in R language. The results of the full survey can be seen here. The responding project contributors were identified as “data.table” users or “tidy” users by their inclusion of the corresponding R libraries in their projects. The survey responses were used to obtain a Net Promoter Score.

Unlike the cola wars, the results of the survey indicate that data.table has an industry average NPS of 28.6, while tidy brings home the gold with a 59.4. What makes tidy users so much more enthusiastic in their choice of data structure library? Tune in for next installment where we try to track down what factors drive a users choice. 


References: “Pepsi Net Promoter Score 2019 Benchmarks.” Pepsi Net Promoter Score 2019 Benchmarks | Customer.guru, customer.guru/net-promoter-score/pepsi.

R Expands to Machine Learning and Deep Learning at ODSC East

For many, R is the go-to language when it comes to data analysis and predictive analytics. However many data scientists are also expanding their use of R to include machine learning and deep learning.

These are exciting new topics, and ODSC East — where thousands of data scientists will gather this year in Boston — has several speakers scheduled to lead trainings in R from April 30 to May 3.

Can’t make it to Boston? Sign up now to Livestream the conference in its entirety so as not to miss out on the latest methodologies and newest developments.

Some of the most anticipated talks this year at the conference include “Mapping Geographic Data in R” and “Modeling the tidyverse,” which will be further discussed, below. For a full list of speakers, click here.

Highlighted R-Specific Talks: In “Machine Learning in R” Part 1 and Part 2, with Jared Lander of the Columbia Business School, you will learn everything ranging from the theoretical backing behind ML up through the nitty-gritty technical side of implementation.

Jared will also present “Introduction to R-Markdown in Shiny,” where he will cover everything from formatting and integrating R, to reactive expressions and outputs like text, tables, and plots. He will also give an “Intermediate R-Markdown in Shiny” presentation to go a bit deeper into the subject.

With everything Jared is presenting on, you definitely have quite a lot to work with for R and Shiny. Why not take it a step further and be able to communicate your findings? Alyssa Columbus of Pacific Life will present “Data Visualization with R Shiny,” where she will show you how to build simple yet robust Shiny applications and how to build data visualizations. You’ll learn how to use R to prepare data, run simple analyses, and display the results in Shiny web applications as you get hands-on experience creating effective and efficient data visualizations.

For many businesses, non-profits, educational institutions, and more, location is important in developing the best products, services, and messages. In “Data Visualization with R Shiny” with Joy Payton of the Children’s Hospital of Philadelphia, you will use R to take public data from various sources and combine them to find statistically interesting patterns and display them in static and dynamic, web-ready maps.

The “tidyverse” collects some of the most versatile R packages: ggplot2, dplyr, tidyr, readr, purrr, and tibble, all working in harmony to clean, process, model, and visualize data. In “Modeling in the tidyverse” author and creator of the Caret R package, Max Kuhn, Phd, will provide a concise overview of the system and will provide examples and context for implementation.

Factorization machines are a relatively new and powerful tool for modeling high-dimensional and sparse data. Most commonly they are used as recommender systems by modeling the relationship between users and items. Here, Jordan Bakerman, PhD and Robert Blanchard of SAS will present “Building Recommendation Engines and Deep Learning Models Using Python, R and SAS,” where you will use recurrent neural networks to analyze sequential data and improve the forecast performance of time series data, and use convolutional neural networks for image classification. Connecting it All R isn’t siloed language; new machine and deep learning packages are being added monthly and data scientists are finding new ways to utilize it in machine learning, deep learning, data visualization, and more. Attend ODSC East this April 30 to May 3 and learn all there is to know about the current state of R, and walk away with tangible experience!

Save 30% off ticket prices for a limited time when you use the code RBLOG10 today.

Register Here

More on ODSC: ODSC East 2019  is one of the largest applied data science conferences in the world. Speakers include some of the core contributors to many open source tools, libraries, and languages. Attend ODSC East in Boston this April 30 to May 3 and learn the latest AI & data science topics, tools, and languages from some of the best and brightest minds in the field.

Benchmarking cast in R from long data frame to wide matrix

In my daily work I often have to transform a long table to a wide matrix so accommodate some function. At some stage in my life I came across the reshape2 package, and I have been with that philosophy ever since – I find it makes data wrangling easy and straight forward. I particularly like the tidyverse philosophy where data should be in a long table, where one row is an observation, and a column a parameter. It just makes sense.

However, I quite often have to transform the data into another format, a wide matrix especially for functions of the vegan package, and one day I wondering how to do that in the fastest way.

The code to create the test sets and benchmark the functions is in section ‘Settings and script’ at the end of this document.

I created several data sets that mimic the data I usually work with in terms of size and values. The data sets have 2 to 10 groups, where each group can have up to 50000, 100000, 150000, or 200000 samples. The methods xtabs() from base R, dcast() from data.table, dMcast() from Matrix.utils, and spread() from tidyr were benchmarked using microbenchmark() from the package microbenchmark. Each method was evaluated 10 times on the same data set, which was repeated for 10 randomly generated data sets.

After the 10 x 10 repetitions of casting from long to wide it is clear the spread() is the worst. This is clear when we focus on the size (figure 1).
Figure 1. Runtime for 100 repetitions of data sets of different size and complexity.
And the complexity (figure 2).
Figure 2. Runtime for 100 repetitions of data sets of different complexity and size.

Close up on the top three methods

Casting from a long table to a wide matrix is clearly slowest with spread(), where as the remaining look somewhat similar. A direct comparison of the methods show a similarity in their performance, with dMcast() from the package Matrix.utils being better — especially with the large and more complex tables (figure 3).
Figure 3. Direct comparison of set size.
I am aware, that it might be to much to assume linearity, between the computation times at different set sizes, but I do believe it captures the point – dMcast() and dcast() are similar, with advantage to dMcast() for large data sets with large number of groups. It does, however, look like dcast() scales better with the complexity (figure 4).
Figure 4. Direct comparison of number groups.

Settings and script

Session info

 ## ─ Session info ──────────────────────────────────────────────────────────
 ## setting value 
 ## version R version 3.5.2 (2018-12-20)
 ## os Ubuntu 18.04.1 LTS 
 ## system x86_64, linux-gnu 
 ## ui X11 
 ## language en_GB:en_US 
 ## collate en_DE.UTF-8 
 ## ctype en_DE.UTF-8 
 ## tz Europe/Berlin 
 ## date 2019-02-03 
 ## 
 ## ─ Packages ──────────────────────────────────────────────────────────────
 ## package * version date lib source 
 ## assertthat 0.2.0 2017-04-11 [1] CRAN (R 3.5.2)
 ## bindr 0.1.1 2018-03-13 [1] CRAN (R 3.5.2)
 ## bindrcpp * 0.2.2 2018-03-29 [1] CRAN (R 3.5.2)
 ## cli 1.0.1 2018-09-25 [1] CRAN (R 3.5.2)
 ## colorspace 1.4-0 2019-01-13 [1] CRAN (R 3.5.2)
 ## crayon 1.3.4 2017-09-16 [1] CRAN (R 3.5.2)
 ## digest 0.6.18 2018-10-10 [1] CRAN (R 3.5.2)
 ## dplyr * 0.7.8 2018-11-10 [1] CRAN (R 3.5.2)
 ## evaluate 0.12 2018-10-09 [1] CRAN (R 3.5.2)
 ## ggplot2 * 3.1.0 2018-10-25 [1] CRAN (R 3.5.2)
 ## glue 1.3.0 2018-07-17 [1] CRAN (R 3.5.2)
 ## gtable 0.2.0 2016-02-26 [1] CRAN (R 3.5.2)
 ## highr 0.7 2018-06-09 [1] CRAN (R 3.5.2)
 ## htmltools 0.3.6 2017-04-28 [1] CRAN (R 3.5.2)
 ## knitr 1.21 2018-12-10 [1] CRAN (R 3.5.2)
 ## labeling 0.3 2014-08-23 [1] CRAN (R 3.5.2)
 ## lazyeval 0.2.1 2017-10-29 [1] CRAN (R 3.5.2)
 ## magrittr 1.5 2014-11-22 [1] CRAN (R 3.5.2)
 ## munsell 0.5.0 2018-06-12 [1] CRAN (R 3.5.2)
 ## packrat 0.5.0 2018-11-14 [1] CRAN (R 3.5.2)
 ## pillar 1.3.1 2018-12-15 [1] CRAN (R 3.5.2)
 ## pkgconfig 2.0.2 2018-08-16 [1] CRAN (R 3.5.2)
 ## plyr 1.8.4 2016-06-08 [1] CRAN (R 3.5.2)
 ## purrr 0.2.5 2018-05-29 [1] CRAN (R 3.5.2)
 ## R6 2.3.0 2018-10-04 [1] CRAN (R 3.5.2)
 ## Rcpp 1.0.0 2018-11-07 [1] CRAN (R 3.5.2)
 ## rlang 0.3.1 2019-01-08 [1] CRAN (R 3.5.2)
 ## rmarkdown 1.11 2018-12-08 [1] CRAN (R 3.5.2)
 ## scales 1.0.0 2018-08-09 [1] CRAN (R 3.5.2)
 ## sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 3.5.2)
 ## stringi 1.2.4 2018-07-20 [1] CRAN (R 3.5.2)
 ## stringr * 1.3.1 2018-05-10 [1] CRAN (R 3.5.2)
 ## tibble 2.0.1 2019-01-12 [1] CRAN (R 3.5.2)
 ## tidyselect 0.2.5 2018-10-11 [1] CRAN (R 3.5.2)
 ## viridisLite 0.3.0 2018-02-01 [1] CRAN (R 3.5.2)
 ## withr 2.1.2 2018-03-15 [1] CRAN (R 3.5.2)
 ## xfun 0.4 2018-10-23 [1] CRAN (R 3.5.2)
 ## yaml 2.2.0 2018-07-25 [1] CRAN (R 3.5.2)

Settings

# settings.yml
set_size: [50000, 100000, 150000, 200000]
num_groups: [2, 3, 4, 5, 6, 7, 8, 9, 10]
benchmark_repetitions: 10
num_test_sets: 10
max_value: 500
word_length: 10

Data creation and benchmarking scripts

# main.R
# Global variables ----------------------------------------------
# Set this to FALSE if you want to run the complete analysis
running_test <- TRUE
vars <- yaml::read_yaml("./settings.yml")
set_size <- vars$set_size
num_groups <- vars$num_groups
benchmark_repetitions <- vars$benchmark_repetitions
num_test_sets <- vars$num_test_sets
max_value <- vars$max_value
word_length <- vars$word_length

# Test variables ------------------------------------------------
if(running_test){
set_size <- seq.int(0L, 60L, 30L)
num_groups <- c(2L:3L)
benchmark_repetitions <- 2L
num_test_sets <- 2L
}


# Libraries ----------------------------------------------------- 
suppressPackageStartupMessages(library(foreach))
suppressPackageStartupMessages(library(doParallel))


# Setup parallel ------------------------------------------------
num_cores <- detectCores() - 1

these_cores <- makeCluster(num_cores, type = "PSOCK")
registerDoParallel(these_cores)

# Functions -----------------------------------------------------
run_benchmark <- function(as){
source("test_cast.R")
num_groups <- as["num_groups"]
set_size <- as["set_size"]
num_test_sets <- as["num_test_sets"]
word_length <- as["word_length"]
max_value <- as["max_value"]
 
test_data <- prepare_test_data(set_size, num_groups, word_length, max_value)
perform_benchmark(test_data, benchmark_repetitions)
}


# Setup and run tests -------------------------------------------
set_size <- set_size[set_size > 0]

analysis_comb <- expand.grid(num_groups, set_size)
analysis_settings <- vector("list", length = nrow(analysis_comb))

for(i in seq_len(nrow(analysis_comb))){
analysis_settings[[i]] <- c(num_groups =analysis_comb[i, "Var1"],
set_size = analysis_comb[i, "Var2"],
word_length = word_length,
max_value = max_value,
benchmark_repetitions = benchmark_repetitions)
}


for(as in analysis_settings){
num_groups <- as["num_groups"]
set_size <- as["set_size"]

report_str <- paste("ng:", num_groups,
"- setsize:", set_size, "\n")
cat(report_str)
 
rds_file_name <- paste0("./output/benchmark_setsize-", set_size,
"_ng-", num_groups, ".rds")
 
bm_res <- foreach(seq_len(num_test_sets), .combine = "rbind") %dopar% {
run_benchmark(as)
 }
 
bm_res <- dplyr::mutate(bm_res, `Number groups` = num_groups,
 `Set size` = set_size)
 
saveRDS(bm_res, rds_file_name)
report_str
}
# test_cast.R
setup <- function(){
library(data.table)
library(tidyr)
library(dplyr)
library(Matrix.utils)
library(tibble)
}

prepare_test_data <- function(set_size, num_groups, word_length, sample_int_n){
calc_subset_size <- function(){
subset_size <- 0
while(subset_size == 0 | subset_size > set_size){
subset_size <- abs(round(set_size - set_size/(3 + rnorm(1))))
 }
subset_size
 }
 
words <- stringi::stri_rand_strings(set_size, word_length)
subset_sizes <- replicate(num_groups, calc_subset_size())
 
 purrr::map_df(subset_sizes, function(subset_size){
 tibble::tibble(Variable = sample(words, subset_size),
Value = sample.int(sample_int_n, subset_size, replace = TRUE),
Group = stringi::stri_rand_strings(1, word_length))
 })
}

test_tidyr <- function(test_df){
test_df %>% 
spread(Variable, Value, fill = 0L) %>% 
 tibble::column_to_rownames(var = "Group") %>% 
as.matrix.data.frame()
}

test_xtabs <- function(test_df){
xtabs(Value ~ Group + Variable, data = test_df) 
}

test_dMcast <- function(test_df){
class(test_df) <- "data.frame"
dMcast(test_df, Group ~ Variable, value.var = "Value")
}

test_dcast <- function(test_df){
test_df_dt <- data.table(test_df)
dcast(test_df_dt, Group ~ Variable, value.var = "Value", fill = 0) %>% 
 tibble::column_to_rownames(var = "Group") %>% 
as.matrix.data.frame()
}


perform_benchmark <- function(test_df, benchmark_repetitions){
suppressPackageStartupMessages(setup())
bm_res <- microbenchmark::microbenchmark(
Spread = test_tidyr(test_df = test_df), 
Xtabs = test_xtabs(test_df = test_df), 
dMcast = test_dMcast(test_df = test_df), 
dcast = test_dcast(test_df = test_df), 
times = benchmark_repetitions
 )
class(bm_res) <- "data.frame"
 
bm_res %>% 
mutate(time = microbenchmark:::convert_to_unit(time, "ms")) %>% 
rename(Method = expr, `Time (ms)` = time)
}

How GPL makes me leave R for Python :-(


Being a data scientist in a startup I can program with several languages, but often R is a natural choice.
Recently I wanted my company to build a product based on R. It simply seemed like a perfect fit.

But this turned out to be a slippery slope into the open-source code licensing field, which I wasn’t really aware of before.

Bottom line: legal advice was not to use R!

Was it a single lawyer? No. The company was willing to “play along” with me, and we had a consultation with 4 different software lawyers, one after the other.
What is the issue? R is licensed as GPL 2, and most R packages are also GPL (whether 2 or 3).

GPL is not a permissive license. It is categorized as “strongly protective”.

In layman terms, if you build your work on a GPL program it may force you to license your product with a GPL license, too. In other words – it restrains you from keeping your code proprietary.

Now you say – “This must be wrong”, and “You just don’t understand the license and its meaning”, right? You may also mention that Microsoft and other big companies are using R, and provide R services.

Well, maybe. I do believe there are ways to make your code proprietary, legally. But, when your software lawyers advise to “make an effort to avoid using this program” you do not brush them off 🙁


Now, for some details.

As a private company, our code needs to be proprietary. Our core is not services, but the software itself. We need to avoid handing our source code to a customer. The program itself will be installed on a customer’s server. Most of our customers have sensitive data and a SAAS model (or a connection to the internet) is out of the question. Can we use R?

The R Core Team addressed the question “Can I use R for commercial purposes?”. But, as lawyers told us, the way it is addressed does not solve much. Any GPL program can be used for commercial purposes. You can offer your services installing the software, or sell a visualization you’ve prepared with ggplot2. But, it does not answer the question – can I write a program in R, and have it licensed with a non-GPL license (or simply – a commercial license)?

The key question we were asked was is our work a “derivative work” of R. Now, R is an interpreted programming language. You can write your code in notepad and it will run perfectly. Logic says that if you do not modify the original software (R) and you do not copy any of its source code, you did not make a derivative work.

A
s a matter of fact, when you read the FAQ of the GPL license it almost seems that indeed there is no problem. Here is a paragraph from the Free Software Foundation https://www.gnu.org/licenses/gpl-faq.html#IfInterpreterIsGPL:   If a programming language interpreter is released under the GPL, does that mean programs written to be interpreted by it must be under GPL-compatible licenses?(#IfInterpreterIsGPL) When the interpreter just interprets a language, the answer is no. The interpreted program, to the interpreter, is just data; a free software license like the GPL, based on copyright law, cannot limit what data you use the interpreter on. You can run it on any data (interpreted program), any way you like, and there are no requirements about licensing that data to anyone.

Problem solved? Not quite. The next paragraph shuffles the cards:

However, when the interpreter is extended to provide “bindings” to other facilities (often, but not necessarily, libraries), the interpreted program is effectively linked to the facilities it uses through these bindings. So if these facilities are released under the GPL, the interpreted program that uses them must be released in a GPL-compatible way. The JNI or Java Native Interface is an example of such a binding mechanism; libraries that are accessed in this way are linked dynamically with the Java programs that call them. These libraries are also linked with the interpreter. If the interpreter is linked statically with these libraries, or if it is designed to link dynamically with these specific libraries, then it too needs to be released in a GPL-compatible way.

Another similar and very common case is to provide libraries with the interpreter which are themselves interpreted. For instance, Perl comes with many Perl modules, and a Java implementation comes with many Java classes. These libraries and the programs that call them are always dynamically linked together.

A consequence is that if you choose to use GPLed Perl modules or Java classes in your program, you must release the program in a GPL-compatible way, regardless of the license used in the Perl or Java interpreter that the combined Perl or Java program will run on

This is commonly interpreted as “You can use R, as long as you don’t call any library”.


Now, can you think of using R without, say, the Tidyverse package? Tidyverse is a GPL library. And if you want to create a shiny web app – you still use the Shiny library (also GPL). Assume you will purchase a shiny server pro commercial license, this still does not resolve the shiny library itself being licensed as GPL.

Furthermore, we often use quite a lot of R libraries – and almost all are GPL. Same goes for a shiny app, in which you are likely to use many GPL packages to make your product look and behave as you want it to.

Is it legal to use R after all?

I think it is. The term “library” may be the cause of the confusion.   As Perl is mentioned specifically in the GPL FAQ quoted above, Perl addressed the issue of GPL licensed interpreter on proprietary scripts head on (https://dev.perl.org/licenses/ ): “my interpretation of the GNU General Public License is that no Perl script falls under the terms of the GPL unless you explicitly put said script under the terms of the GPL yourself.

Furthermore, any object code linked with perl does not automatically fall under the terms of the GPL, provided such object code only adds definitions of subroutines and variables, and does not otherwise impair the resulting interpreter from executing any standard Perl script



There may also be a hidden explanation by which most libraries are fine to use. As said above, it is possible the confusion is caused by the use of the term “library” in different ways.

Linking/binding is a technical term for what occurs when compiling software together. This is not what happens with most R packages, as may be understood when reading the following question and answer: Does an Rcpp-dependent package require a GPL license?

The question explains why (due to GPL) one should NOT use the Rcpp R library. Can we infer from it that it IS ok to use most other libraries?

“This is not a legal advice”

As we’ve seen, what is and is not legal to do with R, being GPL, is far from being clear.

Everything that is written on the topic is also marked as “not a legal advice”. While this may not be surprising, one has a hard time convincing a lawyer to be permissive, when the software owners are not clear about it. For example, the FAQ “Can I use R for commercial purposes?” mentioned above begins with “R is released under the GNU General Public License (GPL), version 2. If you have any questions regarding the legality of using R in any particular situation you should bring it up with your legal counsel”. And ends with “None of the discussion in this section constitutes legal advice. The R Core Team does not provide legal advice under any circumstances.

  In between the information is not very decisive, either. So at the end of the day, it is unclear what is the actual legal situation.

Another thing one of the software lawyers told us is that Investors do not like GPL. In other words, even if it turns out that it is legal to use R with its libraries – a venture capital investor may be reluctant. If true, this may cause delays and may also require additional work convincing the potential investor that what you are doing is indeed flawless. Hence, lawyers told us, it is best if you can find an alternative that is not GPL at all.  

What makes Python better?

Most of the “R vs. Python” articles are pure junk, IMHO. They express nonsense commonly written in the spirit of “Python is a general-purpose language with a readable syntax. R, however, is built by statisticians and encompasses their specific language.” Far away from the reality as I see it.

 
Most Common Licenses for Python Packages https://snyk.io/blog/over-10-of-python-packages-on-pypi-are-distributed-without-any-license/
But Python has a permissive license. You can distribute it, you can modify it, and you do not have to worry your code will become open-source, too. This truly is a great advantage.

Is there anything in between a permissive license and a GPL?

Yes there is.

For example, there is the Lesser GPL (LGPL). As described in Wikipedia: “The license allows developers and companies to use and integrate a software component released under the LGPL into their own (even proprietary) software without being required by the terms of a strong copyleft license to release the source code of their own components. However, any developer who modifies an LGPL-covered component is required to make their modified version available under the same LGPL license.” Isn’t this exactly what the R choice of a license was aiming at?

Others use an exception. Javascript, for example, is also GPL. But they added the following exception: “As a special exception to GPL, any HTML file which merely makes function calls to this code, and for that purpose includes it by reference shall be deemed a separate work for copyright law purposes. In addition, the copyright holders of this code give you permission to combine this code with free software libraries that are released under the GNU LGPL. You may copy and distribute such a system following the terms of the GNU GPL for this code and the LGPL for the libraries. If you modify this code, you may extend this exception to your version of the code, but you are not obligated to do so. If you do not wish to do so, delete this exception statement from your version.

R is not LGPL. R has no written exceptions.

  The fact that R and most of its libraries use a GPL license is a problem. At the very least it is not clear if it is really legal to use R to write proprietary code.

Even if it is legal, Python still has an advantage being a permissive license, which means “no questions asked” by potential customers and investors.

  It would be good if the R core team, as well as people releasing packages, were clearer about the proper use of the software, as they see it. They could take Perl as an example.

  It would be even better if the license would change. At least by adding an exception, reducing it to an LGPL or (best) permissive license.

Click HERE to leave a comment.