Web data acquisition: from database to dataframe for data analysis and visualization (Part 4)

The previous post described how the deeply nested JSON data on fligths were parsed and stored in an R-friendly database structure. However, looking into the data, the information is not yet ready for statistical analysis and visualization and some further processing is necessary before extracting insights and producing nice plots. In the parsed batch, it is clearly visible the redundant structure of the data with the flight id repeted for each segment of each flight. This is also confirmed with the following simple check as the rows of the dataframe are more than the unique counts of the elements in the id column.
dim(data_items)
[1] 397  15

length(unique(data_items$id))
201

# real time changes of data could produce different results
This implies that the information of each segment of each flight has to be aggregated and merged in a dataset as single observations of a statistical analysis between, for example, price and distance. First, a unique primary key for each observation has to be used as reference variable to uniquely identify each element of the dataset.
library(plyr) # sql like functions
library(readr) # parse numbers from strings
 
data_items <- data.frame(data_items)
 
# id (primary key)
data <- data.frame(unique(data_items$id))
colnames(data) <- c('id')
 
# n° of segment
n_segment <- aggregate(data_items['segment.id'], by=data_items['id'], length)
data <- join(data, n_segment, by='id', type='left', match='first') # sql left join
# mileage
mileage <- aggregate(data_items['segment.leg.mileage'], by=data_items['id'], sum)
data <- join(data, mileage, by='id', type='left', match='first') # sql left join
# price
price <- data.frame('id'=data_items$id, 'price'=parse_number(data_items$saleTotal))
data <- join(data, price, by='id', type='left', match='first') # sql left join
# dataframe
colnames(data) <- c('id','segment', 'mileage', 'price')
head(data)

The aggregation of mileage and price using the unique primary key allows to set up a dataframe ready for statistical analysis and data visualization. Current data tells us that there is a maximum of three segments in the connection between FCO and LHR with a minimum price of around EUR 122 and a median around EUR 600.

# descriptive statistics
summary(data)
 
 
# histogram price & distance
g1 <- ggplot(data, aes(x=price)) + 
  geom_histogram(bins = 50) +  
  ylab("Distribution of the Price (EUR)") +
  xlab("Price (EUR)") 
 
g2 <- ggplot(data, aes(x=mileage)) + 
  geom_histogram(bins = 50) +  
  ylab("Distribution of the Distance") +
  xlab("Distance (miles)")
 
grid.arrange(g1, g2)
# price - distance relationship
s0 <- ggplot(data = data, aes(x = mileage, y = price)) +
    geom_smooth(method = "lm", se=FALSE, color="black") +
    geom_point() + labs(x = "Distance in miles", y = "Price (EUR)")
s0 <- ggMarginal(s0, type = "histogram", binwidth = 30)
s0

Of course, plenty of other analysis and graphical representations using flights features are possible given the large set of variables available in QPX Express API and the availability of data in real time.
To conclude the 4-step (flight) trip from data acquisition to data analysis, let's recap the most important concepts described in each of the post: 1) Client-Server connection 2) POST request in R 3) Data parsing and structuring 4) Data analysis and visualization
That's all folks! #R #rstats #maRche #json #curl #qpxexpress #Rbloggers This post is also shared in www.r-bloggers.com

Shiny app to explore ggplot2

Do you struggle with learning ggplot2? Do you have problems with understanding what aesthetics actually do and how manipulating them change your plots?

Here is the solution! Explore 33 ggplot2 geoms on one website! I created this ggplot2 explorer to help all R learners to understand how to plot beautiful/useful charts using the most popular vizualization package ggplot2. It won’t teach you how to write a code, but definitely will show you how ggplot2 geoms look like, and how manipulating their arguments changes visualization.  You can find my app here

And you can find code on my github

Example

Shiny: data presentation with an extra

A Shiny app with three tabs presenting different sections of the same data.
Shiny
is an application based on R/RStudio which enables an interactive exploration of data through a dashboard with drop-down lists and checkboxes—programming-free. The apps can be useful for both the data analyst and the public.

Shiny apps are based on the Internet: This allows for private consultation of the data on one’s own browser as well as for online publication. Free apps can handle quite a lot of data, which can be increased with a subscription.

The target user of Shiny is extremely broad. Let’s take science—open science. At a time when openly archiving all data is becoming standard practice (e.g., OSF.io, Figshare.com, Github.com), Shiny can be used to walk the extra mile by letting people tour the data at once without programming. It’s the right tool for acknowledging all aspects of the data. Needless to say, these apps do not replace raw data archiving.

The apps simply add. For instance, the data in the lower app is a little noisy, right? Indeed it shows none of the succession of waves that characterizes word reading. The app helped me in identifying this issue. Instead of running along a host of participants and electrodes through a heavy code score, I found that the drop-down lists of the app let me seamlessly surf the data. By Participant 7, though, my wave dropped me…
 

Those data were very poor—systematically poorer than those of any other participants. I then checked the EEG preprocessing logs, and confirmed that those data had to be discarded. So much for the analysis utility of such an app. On the open science utility, what I did on discovering the fault was maintain the discarded data in the app, with a note, so that any colleagues and reviewers could consider it too. Now, although this example of use concerns a rather salient trait in the data, some other Shiny app might help to spot patterns such as individual differences, third variables.

Building a Shiny app is not difficult. Apps basically draw on some data presentation code (tables, plots) that you already have. Then just add a couple of scripts into the folder: one for the user interface (named iu.R), one for the process (named server.R), and perhaps another one compiling the commands for deploying the app and checking any errors.

The steps to start a Shiny app from scratch are:

1: Tutorials. Being open-source software, the best manuals are available through a Google search.

2: User token. Signing up and reading in your private key—just once.

3: GO. Plunge into the ui and server scripts, and deployApp().

4: Bugs and logs. They are not bugs in fact—rather fancies. For instance, some special characters have to get even more special (technically, UTF-8 encoding). For a character such as μ, Shiny prefers Âμ. Just cling to error logs by calling:

showLogs(appPath = getwd(), appFile = NULL, appName = NULL, account = NULL, entries = 50, streaming = FALSE)

The log output will mention any typos and unaccepted characters, pointing to specific lines in your code.

It may take a couple of intense days to get a first app running. As usual with programming, it’s easy to run into the traps which are there to spice up the way. The app’s been around for years, so tips and tricks abound on the Internet. For greater companionship, there are dedicated Google groups, and then there’s StackExchange etc., where you can post any needs/despair. Post your code, log, and explanation, and you’ll be rescued out in a couple of days. Long live those contributors.

It will often be enough to upload a bare app, but you might then think it can look better.

5 (optional): Pro up.
Use tabs to combine multiple apps in one, use different widgets, etc. Tutorials like this one on Youtube can take you there, especially those that provide the code, as in the description of that video. Use those scripts as templates. For example, see this script in which the function conditionalPanel() is used to modify the app’s sidebar based on which tab is selected. The utility of tabs is illustrated in the upper cover of this article and in the app shown in the text: When having multiple data sections, the tabs allow you to have all in one (cover screenshot), instead of linking to other apps in each (screenshot in text).

Time for logistics. You can include any text in your app’s webpage, such as explanations of any length, web links, and author information. Oh, also importantly: the Shiny application allows for the presentation of data in any of the forms available in R—notably, plots and tables. Costs: Shiny is free to use, in principle. You may only have to pay if you use your app(s) a lot—first month, commonly—, in which case you may pay 9 euros a month. There are different subscription plans. The free plan allows 25 hours of use per month, distributed among up to five apps.

There do exist alternatives to Shiny. One is fairly similar: It’s called Tableau. A nice comparison of the two apps is available here. Then, there’s a more advanced application called D3.js, which allows for lush graphics but proves more restrictive to newbies.

In sum, if you already program in R or even in Python, but have not tried online data presentation yet, consider it.

Feel free to share your ideas, experiences, or doubts in a comment on the original post.

New online datacamp course: Forecasting in R

Forecasting in R is taught by Rob J. Hyndman, author of the forecast package

Forecasting involves making predictions about the future. It is required in many situations such as deciding whether to build another power generation plant in the next ten years requires forecasts of future demand; scheduling staff in a call center next week requires forecasts of call volumes; stocking an inventory requires forecasts of stock requirements. Forecasts can be required several years in advance (for the case of capital investments), or only a few minutes beforehand (for telecommunication routing). Whatever the circumstances or time horizons involved, forecasting is an important aid to effective and efficient planning. This course provides an introduction to time series forecasting using R.

What You’ll Learn
Chapter 1: Exploring and Visualizing Time Series in R
The first thing to do in any data analysis task is to plot the data.
Chapter 2: Benchmark Methods and Forecast Accuracy
In this chapter, you will learn general tools that are useful for many different forecasting situations.
Chapter 3: Exponential Smoothing
is framework generates reliable forecasts quickly and for a wide range of time series.
Chapter 4: Forecasting with ARIMA Models
ARIMA models provide another approach to time series forecasting.
Chapter 5: Advanced Methods
In this chapter, you will look at some methods that handle more complicated seasonality.

You can start the free chapter for free of Forecasting in R.

It can be easy to explore data generating mechanisms with the simstudy package

I learned statistics and probability by simulating data. Sure, I battled my way through proofs, but I never believed the results until I saw it in a simulation. I guess I have it backwards, it worked for me. And now that I do this for a living, I continue to use simulation to understand models, to do sample size estimates and power calculations, and of course to teach. Sure – I’ll use the occasional formula, but I always feel the need to check it with simulation. It’s just the way I am.

Since I found myself constantly setting up simulations, over time I developed ways to make the process a bit easier. Those processes turned into a package, which I called simstudy, or simulating study data. My goal here is to introduce the basic idea behind simstudy, and provide a relatively simple example that comes from a question posed by a user who was trying to generate correlated longitudinal data.

The basic idea

Simulation using simstudy has two essential steps. First, the user defines the data elements of a data set either in an external csv file or internally through a set of repeated definition statements. Second, the user generates the data, using these definitions. Data generation can be as simple as a cross-sectional design or prospective cohort design, or it can be more involved. Simulation can include observed or randomized treatment assignment/exposures, survival data, longitudinal/panel data, multi-level/hierarchical data, data sets with correlated variables based on a specified covariance structure, and data sets with missing data resulting from any sort of missingness pattern.

The key to simulating data in simstudy is the creation of series of data definition tables that look like this: Here’s the code that is used to generate this definition, which is stored as a data.table :

def <- defData(varname = "nr", dist = "nonrandom", formula = 7, id = "idnum")
def <- defData(def, varname = "x1", dist = "uniform", formula = "10;20")
def <- defData(def, varname = "y1", formula = "nr + x1 * 2", variance = 8)
def <- defData(def, varname = "y2", dist = "poisson", formula = "nr - 0.2 * x1", link = "log")
def <- defData(def, varname = "xCat", formula = "0.3;0.2;0.5", dist = "categorical")
def <- defData(def, varname = "g1", dist = "gamma", formula = "5+xCat", variance = 1, link = "log")
def <- defData(def, varname = "a1", dist = "binary", formula = "-3 + xCat", link = "logit")
To create a simple data set based on these definitions, all one needs to do is execute a single genData command. In this example, we generate 500 records that are based on the definition in the def table:

dt <- genData(500, def)

dt
[code language="lang="r"]
##      idnum nr       x1       y1  y2 xCat          g1 a1
##   1:     1  7 10.74929 30.01273 123    3 13310.84731  0
##   2:     2  7 18.56196 44.77329  17    1   395.41606  0
##   3:     3  7 16.96044 43.76427  42    3   522.45258  0
##   4:     4  7 19.51511 45.14214  22    3  3045.06826  0
##   5:     5  7 10.79791 27.25012 128    1   406.88647  0
##  ---                                                   
## 496:   496  7 19.96636 52.05377  21    3   264.85557  1
## 497:   497  7 15.97957 39.62428  44    2    40.59823  0
## 498:   498  7 19.74036 47.32292  21    3  3410.54787  0
## 499:   499  7 19.71597 48.26259  25    3   206.90961  1
## 500:   500  7 14.60405 28.94185  53    1    97.43068  0
There’s a lot more functionality in the package, and I’ll be writing about that in the future. But here, I just want give a little more introduction by way of an example that came in from across the globe a couple of days ago. (I’d say the best thing about building an R package is hearing from folks literally all over the world and getting to talk to them about statistics and R.)

Going a bit further: simulating a prospective cohort study with repeated measures

The question was, can we simulate a study with two arms, say a control and treatment, with repeated measures at three time points: baseline, after 1 month, and after 2 months? The answer is, of course.

Drawing on her original code, we wanted a scenario the included two treatment arms or exposures and three measurements per individual. The change over time was supposed to be greater for one of the groups. This was what I sent back to my correspondent:

# Define the outcome

ydef <- defDataAdd(varname = "Y", dist = "normal", 
                   formula = "5 + 2.5*period + 1.5*T + 3.5*period*T", 
                   variance = 3)

# Generate a 'blank' data.table with 24 observations and assign them to groups

set.seed(1234)

indData <- genData(24)
indData <- trtAssign(indData, nTrt = 2, balanced = TRUE, grpName = "T")

# Create a longitudinal data set of 3 records for each id

longData <- addPeriods(indData, nPeriods = 3, idvars = "id")
longData <- addColumns(dtDefs = ydef, longData)

longData[, `:=`(T, factor(T, labels = c("No", "Yes")))]

# Let's look at the data

ggplot(data = longData, aes(x = factor(period), y = Y)) + geom_line(aes(color = T, 
    group = id)) + scale_color_manual(values = c("#e38e17", "#8e17e3")) + xlab("Time")
My correspondent quickly pointed out that I hadn’t really provided her with the full solution – as the measurements for each individual in the example above are not correlated across the time periods. If we generate a data set based on 1,000 individuals and estimate a linear regression model we see that the parameter estimates are quite good, though we can see from the estimate of alpha (at the bottom of the output), which was approximately 0.02, that there is little within-individual correlation:
# Fit a GEE model to the data

fit <- geeglm(Y ~ factor(T) + period + factor(T) * period, family = gaussian(link = "identity"), 
    data = longData, id = id, corstr = "exchangeable")
summary(fit)
## 
## Call:
## geeglm(formula = Y ~ factor(T) + period + factor(T) * period, 
##     family = gaussian(link = "identity"), data = longData, id = id, 
##     corstr = "exchangeable")
## 
##  Coefficients:
##                     Estimate Std.err   Wald Pr(>|W|)    
## (Intercept)          4.98268 0.07227 4753.4   <2e-16 ***
## factor(T)Yes         1.48555 0.10059  218.1   <2e-16 ***
## period               2.53946 0.05257 2333.7   <2e-16 ***
## factor(T)Yes:period  3.51294 0.07673 2096.2   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Estimated Scale Parameters:
##             Estimate Std.err
## (Intercept)    2.952 0.07325
## 
## Correlation: Structure = exchangeable  Link = identity 
## 
## Estimated Correlation Parameters:
##       Estimate Std.err
## alpha  0.01737 0.01862
## Number of clusters:   1000   Maximum cluster size: 3

One way to generate correlated data

To address this shortcoming, there are at least two ways we can go about it. The first is to use the simstudy function genCorData. In this example, we generate correlated errors that are normally distributed with mean 0, variance 3, and common correlation coefficient of 0.7. Using this approach, the underlying data generation process is a bit cryptic, because we are acknowledging that we don’t know what is driving the relationship between the three outcomes, just that they have some common cause or other relationship that results in a strong relationship. However, it does the trick:
# define the outcome
ydef <- defDataAdd(varname = "Y", dist = "normal", formula = "5 + 2.5*period + 1.5*T + 3.5*period*T + e")

# define the correlated errors

mu <- c(0, 0, 0)
sigma <- rep(sqrt(3), 3)

# generate correlated data for each id and assign treatment

dtCor <- genCorData(24, mu = mu, sigma = sigma, rho = 0.7, corstr = "cs")
dtCor <- trtAssign(dtCor, nTrt = 2, balanced = TRUE, grpName = "T")

# create longitudinal data set and generate outcome based on definition

longData <- addPeriods(dtCor, nPeriods = 3, idvars = "id", timevars = c("V1", 
    "V2", "V3"), timevarName = "e")
longData <- addColumns(ydef, longData)

longData[, `:=`(T, factor(T, labels = c("No", "Yes")))]

# look at the data, outcomes should appear more correlated, lines a bit straighter

ggplot(data = longData, aes(x = factor(period), y = Y)) + geom_line(aes(color = T, 
    group = id)) + scale_color_manual(values = c("#e38e17", "#8e17e3")) + xlab("Time")
Again, we recover the true parameters. And this time, if we look at the estimated correlation, we see that indeed the outcomes are correlated within each individual. The estimate is 0.71, very close to the the “true” value 0.7.
fit <- geeglm(Y ~ factor(T) + period + factor(T) * period, family = gaussian(link = "identity"), 
    data = longData, id = id, corstr = "exchangeable")

summary(fit)
## 
## Call:
## geeglm(formula = Y ~ factor(T) + period + factor(T) * period, 
##     family = gaussian(link = "identity"), data = longData, id = id, 
##     corstr = "exchangeable")
## 
##  Coefficients:
##                     Estimate Std.err Wald Pr(>|W|)    
## (Intercept)           5.0636  0.0762 4411   <2e-16 ***
## factor(T)Yes          1.4945  0.1077  192   <2e-16 ***
## period                2.4972  0.0303 6798   <2e-16 ***
## factor(T)Yes:period   3.5204  0.0426 6831   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Estimated Scale Parameters:
##             Estimate Std.err
## (Intercept)     3.07   0.117
## 
## Correlation: Structure = exchangeable  Link = identity 
## 
## Estimated Correlation Parameters:
##       Estimate Std.err
## alpha    0.711  0.0134
## Number of clusters:   1000   Maximum cluster size: 3

Another way to generate correlated data

A second way to generate correlated data is through an individual level random-effect or random intercept. This could be considered some unmeasured characteristic of the individuals (which happens to have a convenient normal distribution with mean zero). This random effect contributes equally to all instances of an individuals outcomes, but the outcomes for a particular individual deviate slightly from the hypothetical straight line as a result of unmeasured noise.
ydef1 <- defData(varname = "randomEffect", dist = "normal", formula = 0, variance = sqrt(3))
ydef2 <- defDataAdd(varname = "Y", dist = "normal", formula = "5 + 2.5*period + 1.5*T + 3.5*period*T + randomEffect", 
    variance = 1)

indData <- genData(24, ydef1)
indData <- trtAssign(indData, nTrt = 2, balanced = TRUE, grpName = "T")

indData[1:6]
##    id T randomEffect
## 1:  1 0      -1.3101
## 2:  2 1       0.3423
## 3:  3 0       0.5716
## 4:  4 1       2.6723
## 5:  5 0      -0.9996
## 6:  6 1      -0.0722
longData <- addPeriods(indData, nPeriods = 3, idvars = "id")
longData <- addColumns(dtDefs = ydef2, longData)

longData[, `:=`(T, factor(T, labels = c("No", "Yes")))]

ggplot(data = longData, aes(x = factor(period), y = Y)) + geom_line(aes(color = T, 
    group = id)) + scale_color_manual(values = c("#e38e17", "#8e17e3")) + xlab("Time")
fit <- geeglm(Y ~ factor(T) + period + factor(T) * period, family = gaussian(link = "identity"), 
    data = longData, id = id, corstr = "exchangeable")
summary(fit)
## 
## Call:
## geeglm(formula = Y ~ factor(T) + period + factor(T) * period, 
##     family = gaussian(link = "identity"), data = longData, id = id, 
##     corstr = "exchangeable")
## 
##  Coefficients:
##                     Estimate Std.err Wald Pr(>|W|)    
## (Intercept)           4.9230  0.0694 5028   <2e-16 ***
## factor(T)Yes          1.4848  0.1003  219   <2e-16 ***
## period                2.5310  0.0307 6793   <2e-16 ***
## factor(T)Yes:period   3.5076  0.0449 6104   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Estimated Scale Parameters:
##             Estimate Std.err
## (Intercept)     2.63  0.0848
## 
## Correlation: Structure = exchangeable  Link = identity 
## 
## Estimated Correlation Parameters:
##       Estimate Std.err
## alpha    0.619  0.0146
## Number of clusters:   1000   Maximum cluster size: 3
I sent all this back to my correspondent, but I haven’t heard yet if either of these solutions meets her needs. I certainly hope so. Speaking of which, if there are specific topics you’d like me to discuss related to simstudy, definitely get in touch, and I will try to write something up.

Three-Way Analysis of Variance: Simple Second-Order Interaction Effects and Simple Main Effects

In this article we will show how to run a three-way analysis of variance when both the third-order interaction effect and the second-order interaction effects are statistically significant. This type of analysis can become pretty tedious, especially when our factors have many levels, so we will try to explain it here as clearly as possible. (If you want to watch me doing these analyses live, get my free course on statistical analysis with R here.)

First of all, let’s present the fictitious data we are going to work with. Let’s suppose that a pharmaceutical company is planning to launch a new vitamin that allegedly improves the employees’ resistance to effort. The vitamin is tested on a sample of 720 employees, divided into three groups: employees who take a placebo (the control group), employees who take the vitamin in low dose and employees who take the vitamin in high dose. Half of the employees are male, and half are female. Moreover, we have both blue collar employees and white collar employees in our sample. The resistance to effort is measured on a scale whatsoever, from 1 to 30 (30 being the highest resistance). Our goal is to determine whether the effort resistance is influenced by three factors: dose of vitamin (placebo, low dose, and high dose), gender (male, female) and type of employee (blue collar, white collar). You can find the experiment data in CSV format here. Third-order interaction effect First of all, let’s check whether the third-order interaction effect is significant. We are going to run the analysis using the aov function in the stats package (our data frame is called vitamin). aov1 <- aov(effort~dose*gender*type, data=vitamin) summary(aov1) In the formula above the interaction effect is, of course, dosegendertype. The ANOVA results can be seen below (we have only kept the line presenting the third-order interaction effect).                                 

                                  Df Sum Sq Mean Sq F value   Pr(>F)


dose:gender:type   2    187    93.4  22.367 3.81e-10
The interaction effect is statistically significant: F(2)=22.367, p<0.01. In other words, we do have a third-order interaction effect. In this situation, it is not advisable to report and interpret the second-order interaction effects (they could be misleading). Therefore, we are going to compute the simple second-order interaction effects. Simple second-order interaction effects The simple second-order interaction effects are the effects of each pair of factors at each level of the third factor. Specifically, we have to compute the following effects:
  1. the interaction effect of dose and type of employee, for each gender category (male and female)
  2. the interaction effect of gender and type of employee, at each dose level (placebo, low and high)
  3. the interaction effect of dose and gender, for each type of employee (blue collar and white collar).
The total number of second-order interaction effect is given by the sum of the factor levels. In our case we have 7 effects (3+2+2). We will not analyze of all them, because this article would become too long. We will only focus on the first set of effects, leaving the others for you as an exercise. So let’s investigate the interaction effect of dose and type of employee for each gender group. First we have to create two separate data frames, for male and female employees. We do that with the filter command in the dplyr package (though you can also use brackets or subsets). vitamin_male <- filter(vitamin, gender=="male") vitamin_female <- filter(vitamin, gender=="female") Now we perform a two-way analysis of variance on each data frame (the factors being dose and type, of course). [code lang=””r””] aov1 <- aov(effort~dose*type, data=vitamin_male) summary(aov1) aov2 <- aov(effort~dose*type, data=vitamin_female) summary(aov2) The results of the analyses are shown below (we have only retained the lines with the interaction effects).
                Df Sum Sq Mean Sq F value   Pr(>F)    
dose:type     2    249   124.7   28.42 3.57e-12            
                   Df Sum Sq Mean Sq F value   Pr(>F)   
dose:type     2  137.2    68.6   17.31 6.74e-08

We can notice that both simple second-order interaction effects are significant (p<0.01). So we are dealing with a combined influence of the factors dose and type of employee in both male and female groups. In this situation, we have to examine the simple main effects for each factor. This is what we are going to do in the next section. Simple main effects Let’s compute the main effect for the factor dose of vitamin, which is the most important (after all, the company wants to demonstrate that the vitamin does affect the resistance to effort). You will be able to compute the other simple main effects yourself, using this as an example. Now we must create four separate data frames, for each combination of the factors gender and type of employee: male – blue collar, male – white collar, female – blue collar, female – white collar. [code lang=””r””] vitamin_male_blue <- filter(vitamin, gender=="male", type=="blue collar") vitamin_male_white <- filter(vitamin, gender=="male", type=="white collar") vitamin_female_blue <- filter(vitamin, gender=="female", type=="blue collar") vitamin_female_white <- filter(vitamin, gender=="female", type=="white collar") Next we perform a one-way ANOVA for each data frame. Let’s do it for the first group, male – blue collar. [code lang=””r””] aov1 <- aov(effort~dose, data=vitamin_male_blue) summary(aov1)

                 Df Sum Sq Mean Sq F value Pr(>F)    
dose          2 2943.5  1471.8   349.9 <2e-16

The simple main effect for the factor dose on this group is statistically significant (p<0.01). In other words, there is a significant difference between placebo, low dose and high dose levels within the male – blue collar employees category, regarding the resistance to effort. To find out how big the differences are, we use the TuckeyHSD function to compute the test with the same name.
TukeyHSD(aov1)


                                              diff        lwr       upr   p adj
low dose-high dose -2.528333  -3.413363 -1.643303     0
placebo-high dose  -9.558333 -10.443363 -8.673303     0
placebo-low dose   -7.030000  -7.915030 -6.144970     0

By inspection of the table we conclude that the differences in effort resistance between the dose groups are significant (p<0.01). The highest difference, in absolute values, is that between low dose and placebo levels: 9.5 points. So the employees who took a high dose present a higher resistance to effort than those who just took a placebo. One more example: the simple main effects of the variable dose of vitamin on the female – blue collar group. [code lang=””r””] aov1 <- aov(effort~dose, data=vitamin_female_blue) summary(aov1)                  Df Sum Sq Mean Sq F value Pr(>F)   
dose          2  399.6  199.81   45.57 <2e-16  
TukeyHSD(aov1)


                                            diff        lwr       upr     p adj
low dose-high dose  1.083333  0.1797508  1.986916 0.0141485
placebo-high dose  -2.476667 -3.3802492 -1.573084 0.0000000
placebo-low dose   -3.560000 -4.4635826 -2.656417 0.0000000
  The simple main effect is statistically significant, as it results from the first table. Furthermore, all the differences between dose levels are significant. The highest difference is the difference between low dose and placebo (3.5 points). To learn more on data analysis in R, check the free “Statistics with R” video course here.

Everyone knows that loops in R are to be avoided but vectorization is not always possible

It goes without saying that there are always many ways to solve a problem in R, but clearly some ways are better (for example, faster) than others. Recently, I found myself in a situation where I could not find a way to avoid using a loop, and I was immediately concerned, knowing that I would want this code to be flexible enough to run with a very large number of observations, possibly over many observations. Two tools immediately came to mind: data.table and Rcpp . This brief description explains the background of the simulation problem I was working on and walks through the evolution of ideas to address the problems I ran into when I tried to simulate a large number of individuals. In particular, when I tried to simulate a very large number of individuals, say over 1 million, running the simulation over night wasn’t enough.

Setting up the problem

The task in question here is not the focus, but needs a little explanation to understand what is motivating the programming issue. I am conducting a series of simulations that involve generating an individual-level stochastic (Markov) process for any number of individuals. For the data generation, I am using the simstudy package developed to help facilitate simulated data.   The functions defDataAdd and genData are both from simstudy. The first part of the simulation involves specifying the transition matrix P that determine a state I am calling status, and then defining the probability of an event that are based on a particular status level at a particular time point. For each individual, I generate 36 months of data and a status and event for each month.
library(data.table)
library(simstudy)

set.seed(123)

P <- matrix(c(0.985, 0.015, 0.000, 0.000, 
              0.000, 0.950, 0.050, 0.000,
              0.000, 0.000, 0.850, 0.150,
              0.000, 0.000, 0.000, 1.000), 
            nrow = 4, byrow = TRUE)

form <- "(status == 1) * 0.02 + (status == 2) * 0.10 + (status == 3) * 0.20"

dtDef <- defDataAdd(varname = "event", 
            formula = form, 
            dist = "binary",
            link = "identity")

N = 5000
did <- genData(N)
In order to simulate the Markov process, I decided immediately that Rcpp would be most appropriate because I knew I could not avoid looping. Since each state of a Markov process depends on the state immediately preceding, states need to be generated sequentially, which means no obvious way to vectorize (if someone has figured that out, let me know.)
#include  < RcppArmadilloExtensions/sample.h >; 

// [[Rcpp::depends(RcppArmadillo)]]

using namespace Rcpp;

// [[Rcpp::export]]

IntegerVector MCsim( unsigned int nMonths, NumericMatrix P, 
                     int startStatus, unsigned int startMonth ) {
  
  IntegerVector sim( nMonths );
  IntegerVector m( P.ncol());
  NumericVector currentP;
  IntegerVector newstate;
  
  unsigned int q = P.ncol();
  
  m = Rcpp::seq(1, q);
  
  sim[startMonth - 1] = startStatus;
  
  for (unsigned int i = startMonth; i < nMonths; i++) {
    
    newstate = RcppArmadillo::sample(m, 1, TRUE, P.row(sim(i-1) - 1));
    sim(i) = newstate(0);
    
  }
  
  return sim;
}
The process is simulated for each individual using the Rcpp function MCsim, but is done in the context of a data.table statement. The key here is that each individual is processed separately through the keyby = id statement. This obviates the requirement to loop through individuals even though I still need to loop within individuals for the stochastic process. This algorithm is quite fast, even with very large numbers of individuals and large numbers of observations (in this case months) per individual.
dt <- did[, .(status = MCsim(36, P, 1, 1)), 
          keyby = id]
dt[, month := 1 : .N, keyby = id]
dt <- addColumns(dtDefs = dtDef, dtOld = dt)

dt
##           id status month event
##      1:    1      1     1     0
##      2:    1      1     2     0
##      3:    1      1     3     0
##      4:    1      1     4     0
##      5:    1      1     5     0
##     ---                        
## 179996: 5000      4    32     0
## 179997: 5000      4    33     0
## 179998: 5000      4    34     0
## 179999: 5000      4    35     0
## 180000: 5000      4    36     0

This is where things begin to slow down

It is the next phase of the simulation that started to cause me problems. For the simulation, I need to assign individuals to a group or cohort which is defined by a month and is based on several factors: (1) whether an event occurred in that month, (2) whether the status of that individual in that month exceeded a value of 1, and (3) whether or not the individual experienced 2 or more events in the prior 12 months. An indivdual might be eligible for more than one cohort, but will be assigned to the first possible cohort (i.e. the earliest month where all three criteria are met.) Again, the specifics of the simulation are not important here. What is important, is the notion that the problem requires looking through individual data sequentially, something R is generally not so good at when the sequences get particularly long, and they must be repeated a large number of times. My first, naïve, approach was to create an R function that loops through all the individuals and loops within each individual until a cohort is found:
rAssignCohortID <- function(id, month, status, 
                            event, nInds, 
                            startMonth, thresholdNum) {
  
  cohort <- rep(0, length(id));

  for (j in (1 : nInds))  {
    
    idMonth <- month[id == j];
    idEvent <- event[id == j];
    idStatus <- status[id == j];
    
    endMonth <- length(idMonth);
    
    done <- FALSE;
    i <- max(startMonth - idMonth[1], 13);
    
    while (i <= endMonth & !done) {
      
      if (idEvent[i] == 1 & idStatus[i] > 1) {
        
        begin = i-12;
        end = i-1;
        
        sumED = sum(idEvent[begin:end]);
        
        if (sumED <= thresholdNum) {
          
          cohort[id == j] <- i - 1 + month[1];
          done = TRUE;
        }
      }    
      i = i + 1;
    }    
  }
  
  return(cohort);
} 
system.time(dt[, cohort1 := rAssignCohortID(id, month, status, event, 
                    nInds = N, startMonth = 13, thresholdNum = 2)])
##    user  system elapsed 
##   10.92    1.81   12.89

Working through possible solutions

The naïve approach works, but can we do better? I thought Rcpp might be a solution, because we know that loops in C++ are much more efficient. However, things did not turn out so well after I translated the function into C++; in fact, they got a little worse.

#include < Rcpp.h >;

using namespace Rcpp;

// [[Rcpp::export]]
IntegerVector cAssignCohortID( IntegerVector id, 
                    IntegerVector month, 
                    IntegerVector status, 
                    IntegerVector event,
                    int nInds, 
                    int startMonth, 
                    int thresholdNum) {
    
  IntegerVector cohort(id.length(), 0);
  
  IntegerVector idMonth;
  IntegerVector idEvent;
  IntegerVector idStatus;
  
  for (int j = 0; j < nInds; j++) {
    
    idMonth = month[id == j+1];
    idEvent = event[id == j+1];
    idStatus = status[id == j+1];

    int endMonth = idMonth.length();
    int sumED;
    bool done = FALSE;
    int i = std::max(startMonth - idMonth(0), 12);
    int begin;
    int end;

    while (i < endMonth & !done) {
      
      if (idEvent(i) == 1 & idStatus(i) > 1) {
        
        begin = i-12;
        end = i-1;
        
        sumED = sum(idEvent[Rcpp::seq(begin, end)]);
        
        if (sumED >= thresholdNum) {
          cohort[id == j + 1] = i + month(0);
          done = TRUE;
        }
      }    
      i += 1;
    }    
  }
   
  return(cohort);
}
system.time(dt[, cohort2 := cAssignCohortID(id, month, status, event, 
                    nInds = N, startMonth = 13, thresholdNum = 2)])
## user  system elapsed 
## 13.88   2.03   16.05

I know that the function cAssignCohortID bogs down not in the loop, but in each phase where I need to subset the data set to work on a single id. For example, I need to execute the statement idMonth = month[id == j+1] for each id, and this apparently uses a lot of resources. I tried variations on this theme, alternatives to subset the data set within the Rcpp function, but could get no improvements. But a light bulb went off in my head (dim as it might be), telling me that this is one of the many things data.table is particularly good at. In fact, I used this trick earlier in generating the stochastic process data. So, rather than subsetting the data within the function, I created a regular R function that handles only a single individual id at a time, and let data.table do the hard work of splitting up the data set to process by individual. As you can see, things got markedly faster.
rAssignCohort <- function(id, month, status, event, 
                   nInds, startMonth, thresholdNum) {
  
  cohort <- 0
  
  endMonth = length(month);
    
  done = FALSE;
  i = max(startMonth - month[1], 13);
  
  while (i <= endMonth & !done) {
    
    if (event[i] == 1 & status[i] > 1) {
        
      begin = i-12;
      end = i-1;
        
      sumED = sum(event[begin:end]);
        
      if (sumED >= thresholdNum) {
          
        cohort <- i - 1 + month[1];
        done = TRUE;
      }
    }    
    i = i + 1;
  }
  
  return(cohort)
}
system.time(dt[, cohort3 := rAssignCohort(id, month, status, event, 
                    nInds = N, startMonth = 13, thresholdNum = 2), 
               keyby=id])
##    user  system elapsed 
##     0.2     0.0     0.2
Finally, it occurred to me that an Rcpp function that is not required to subset the data might offer more yet improvements in speed. So, for the last iteration, I combined the strengths of looping in Rcpp with the strengths of subsetting in data.table to create a formidable combination. (Even when sample sizes exceed 1 million, the data are generated in a flash.)
#include < Rcpp.h >; 

using namespace Rcpp;

// [[Rcpp::export]]
int cAssignCohort( IntegerVector month, 
                  IntegerVector status,
                  IntegerVector event,
                  int startMonth, int thresholdNum) {
  
  int endMonth = month.length();
  int sumED;
  int cohort = 0;
  bool done = FALSE;
  int i = std::max(startMonth - month(0), 12);
  int begin;
  int end;
  
  while (i < endMonth & !done) {
    
    if (event(i) == 1 & status(i) > 1) {
      
      begin = i-12;
      end = i-1;
      
      sumED = sum(event[Rcpp::seq(begin, end)]);
      
      if (sumED >= thresholdNum) {
        cohort = i + month(0);
        done = TRUE;
      }
    }    
    i += 1;
  }
  return(cohort);
}
system.time(dt[, cohort4 := cAssignCohort(month, status, event, 
                               startMonth=13,  thresholdNum = 2), keyby=id])
##    user  system elapsed 
##    0.01    0.00    0.01
For a more robust comparison, let’s use the benchmark function in package rbenchmark, and you can see how well data.table performs and how much Rcpp can add when used efficiently.
library(rbenchmark)

benchmark(
  dt[, cohort1 := rAssignCohortID(id, month, status, event,       # Naïve approach
                      nInds = N, startMonth = 13, thresholdNum = 2)],
  dt[, cohort2 := cAssignCohortID(id, month, status, event,        # Rcpp approach
                      nInds = N, startMonth = 13, thresholdNum = 2)],
  dt[, cohort3 := rAssignCohort(id, month, status, event,    # data.table approach
                      nInds = N, startMonth = 13, thresholdNum = 2), keyby=id],
  dt[, cohort4 := cAssignCohort(month, status, event,   # combined data.table/Rcpp
                      startMonth=13,  thresholdNum = 2), keyby=id],
    replications = 5,
    columns = c("replications", "elapsed", "relative"))
##   replications elapsed relative
## 1            5   46.18    461.8
## 2            5   68.01    680.1
## 3            5    0.96      9.6
## 4            5    0.10      1.0

Postscript

I shared all of this with the incredibly helpful folks who have created data.table, and they offered a data.table only solution that avoids all looping, which I will share here for completeness. While it is an improvement over the third approach presented above (R function with data.table statment keyby), it is still no match for the fastest solution. (But, this all just goes to show you there will always be new approaches to consider, and I don’t claim to have come any where near to trying them all out.)
dtfunc <- function(dx) {
  
  dx[, prev12 := Reduce(`+`, shift(event, 1:12)), by=id]
  map <- CJ(id=1:N, start=13L, end=36L, event=1L, statusx=1L, prev12x=1L)
  ans <- dx[map, on=.(id, event, status > statusx, prev12 > prev12x, month >= start, month <= end), 
            .I, allow=TRUE, by=.EACHI, nomatch=0L][, .(id, I)]
  minans <- ans[, .(I=min(I)), by=id]
  
  dx <- dx[, cohort5 := 0L][minans, cohort5 := min(month) - 1L + dx$month[I], on="id", by=.EACHI]

  return(dx)
}

system.time(dtfunc(dt))
##    user  system elapsed 
##    0.18    0.00    0.17
And here is a more complete comparison of the fastest version with this additional approach:
benchmark(
  dt[, cohort6 := cAssignCohort(month, status, event,   # combined data.table/Rcpp
                      startMonth=13,  thresholdNum = 2), keyby=id],
  dt2 <- dtfunc(dt),
  replications = 5,
  columns = c("replications", "elapsed", "relative"))
##   replications elapsed relative
## 1            5    0.10      1.0
## 2            5    0.85      8.5

Sinew: a R Package to create self-populating roxygen2 skeletons

Sinew is a R package that generates a roxygen2 skeletons populated with information scraped from within the function script. The goal of the package is to automate nearly all of the mundane tasks needed to document functions, properly set up the import fields for oxygenation, and make it easier to attain documentation consistency across functions. 

Functionality

  • makeOxygen: Create a skeleton for roxygen2 documentation populated with information scraped from within the package function scripts.
  • makeImport: Create import(s) calls for DESCRIPTION, NAMESPACE, and roxygen2.
  • makeDictionary: Create a R file of all the unique roxygen2 parameter fields in a package R subdirectory.

Installation

#CRAN
install.packages('sinew')

#DEV
devtools::install_github('metrumresearchgroup/sinew')

Example Output

makeOxygen(lm)
#' @title FUNCTION_TITLE
#' @description FUNCTION_DESCRIPTION
#' @param formula PARAM_DESCRIPTION
#' @param data PARAM_DESCRIPTION
#' @param subset PARAM_DESCRIPTION
#' @param weights PARAM_DESCRIPTION
#' @param na.action PARAM_DESCRIPTION
#' @param method PARAM_DESCRIPTION, Default: 'qr'
#' @param model PARAM_DESCRIPTION, Default: TRUE
#' @param x PARAM_DESCRIPTION, Default: FALSE
#' @param y PARAM_DESCRIPTION, Default: FALSE
#' @param qr PARAM_DESCRIPTION, Default: TRUE
#' @param singular.ok PARAM_DESCRIPTION, Default: TRUE
#' @param contrasts PARAM_DESCRIPTION, Default: NULL
#' @param offset PARAM_DESCRIPTION
#' @param ... PARAM_DESCRIPTION
#' @return OUTPUT_DESCRIPTION
#' @importFrom stats model.frame

In more detail…

For new developers, getting a package ready for building and submitting to CRAN is an expletive-filled, head-scratching experience to say the least. Trying to figure out the basics of what goes in depends and what goes in imports is a lost afternoon most of us would like back. Once that is understood, filling in relevant information to each field is a mundane task even for a well polished package developer. The out-of-the-box roxygen2 skeleton supplied by RStudio gives the bare bones road map of what should be part of function documentation:
Continue reading Sinew: a R Package to create self-populating roxygen2 skeletons

Machine Learning Using Support Vector Machines

Support Vector Machines (SVM) is a data classification method that separates data using hyperplanes. The concept of SVM is very intuitive and easily understandable. If we have labeled data, SVM can be used to generate multiple separating hyperplanes such that the data space is divided into segments and each segment contains only one kind of data. SVM technique is generally useful for data which has non-regularity which means, data whose distribution is unknown.

Let’s take a simple example to understand how SVM works. Say you have only two kinds of values and we can represent them as in the figure: Perceptive Analytics This data is simple to classify and one can see that the data is clearly separated into two segments. Any line that separates the red and blue items can be used to classify the above data. Had this data been multi-dimensional data, any plane can separate and successfully classify the data. However, we want to find the “most optimal” solution. What will then be the characteristic of this most optimal line? We have to remember that this is just the training data and we can have more data points which can lie anywhere in the subspace. If our line is too close to any of the datapoints, noisy test data is more likely to get classified in a wrong segment. We have to choose the line which lies between these groups and is at the farthest distance from each of the segments. To solve this classifier line, we draw the line as y=ax+b and make it equidistant from the respective data points closest to the line. So we want to maximize the margin (m). Perceptive Analytics We know that all points on the line ax+b=0 will satisfy the equation. If we draw two parallel lines – ax+b=-1 for one segment and ax+b=1 for the other segment such that these lines pass through a datapoint in the segment which is nearest to our line then the distance between these two lines will be our margin. Hence, our margin will be m=2/||a||. Looked at another way, if we have a training dataset (x1,x2,x3…xn) and want to produce and outcome y such that y is either -1 or 1 (depending on which segment the datapoint belongs to), then our classifier should correctly classify data points in the form of y=ax+b. This also means that y(ax+b)>1 for all data points. Since we have to maximize the margin, we have the solve this problem with the constraint of maximizing 2/||a|| or, minimizing ||a||^2/2 which is basically the dual form of the constraint. Solving this can be easy or complex depending upon the dimensionality of data. However, we can do this quickly with R and try out with some sample dataset. To use SVM in R, I just created a random data with two features x and y in excel. I took all the values of x as just a sequence from 1 to 20 and the corresponding values of y as derived using the formula y(t)=y(t-1) + r(-1:9) where r(a,b) generates a random integer between a and b. I took y(1) as 3. The following code in R illustrates a set of sample generated values: x=c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20) y=c(3,4,5,4,8,10,10,11,14,20,23,24,32,34,35,37,42,48,53,60) #Create a data frame of the data train=data.frame(x,y) Let’s see how our data looks like. For this we use the plot function #Plot the dataset plot(train,pch=16)   Perceptive Analytics
Seems to be a fairly good data. Looking at the plot, it also seems like a linear regression should also be a good fit to the data. We’ll use both the models and compare them. First, the code for linear regression:   #Linear regression model <- lm(y ~ x, train) #Plot the model using abline abline(model)

Perceptive Analytics   SVM To use SVM in R, we have a package e1071. The package is not preinstalled, hence one needs to run the line “install.packages(“e1071”) to install the package and then import the package contents using the library command. The syntax of svm package is quite similar to linear regression. We use svm function here. #SVM library(e1071) #Fit a model. The function syntax is very similar to lm function model_svm <- svm(y ~ x , train) #Use the predictions on the data pred <- predict(model_svm, train) #Plot the predictions and the plot to see our model fit points(train$x, pred, col = "blue", pch=4)
Perceptive Analytics   The points follow the actual values much more closely than the abline. Can we verify that? Let’s calculate the rmse errors for both the models: #Linear model has a residuals part which we can extract and directly calculate rmse error <- model$residuals lm_error <- sqrt(mean(error^2)) # 3.832974 #For svm, we have to manually calculate the difference between actual values (train$y) with our predictions (pred) error_2 <- train$y – pred svm_error <- sqrt(mean(error_2^2)) &amp;amp;amp;amp;amp;nbsp;# 2.696281 In this case, the rmse for linear model is ~3.83 whereas our svm model has a lower error of ~2.7. A straightforward implementation of SVM has an accuracy higher than the linear regression model. However, the SVM model goes far beyond that. We can further improve our SVM model and tune it so that the error is even lower. We will now go deeper into the SVM function and the tune function. We can specify the values for the cost parameter and epsilon which is 0.1 by default. A simple way is to try for each value of epsilon between 0 and 1 (I will take steps of 0.01) and similarly try for cost function from 4 to 2^9 (I will take exponential steps of 2 here). I am taking 101 values of epsilon and 8 values of cost function. I will thus be testing 808 models and see which ones performs best. The code may take a short while to run all the models and find the best version. The corresponding code will be svm_tune <- tune(svm, y ~ x, &amp;amp;amp;amp;amp;nbsp;data = train, ranges = list(epsilon = seq(0,1,0.01), cost = 2^(2:9)) ) print(svm_tune) #Printing gives the output: #Parameter tuning of ‘svm’: # – sampling method: 10-fold cross validation #- best parameters: # epsilon cost #0.09 256 #- best performance: 2.569451 #This best performance denotes the MSE. The corresponding RMSE is 1.602951 which is square root of MSE An advantage of tuning in R is that it lets us extract the best function directly. We don’t have to do anything and just extract the best function from the svm_tune list. We can now see the improvement in our model by calculating its RMSE error using the following code #The best model best_mod <- svm_tune$best.model best_mod_pred <- predict(best_mod, train) error_best_mod <- train$y – best_mod_pred # this value can be different on your computer # because the tune method randomly shuffles the data best_mod_RMSE <- sqrt(mean(error_best_mod^2)) # 1.290738 This tuning method is known as grid search. R runs all various models with all the possible values of epsilon and cost function in the specified range and gives us the model which has the lowest error. We can also plot our tuning model to see the performance of all the models together plot(svm_tune) Perceptive Analytics This plot shows the performance of various models using color coding. Darker regions imply better accuracy. The use of this plot is to determine the possible range where we can narrow down our search to and try further tuning if required. For instance, this plot shows that I can run tuning for epsilon in the new range of 0 to 0.2 and while I’m at it, I can move in even lower steps (say 0.002) but going further may lead to overfitting so I can stop at this point. From this model, we have improved on our initial error of 2.69 and come as close as 1.29 which is about half of our original error in SVM. We have come very far in our model accuracy. Let’s see how the best model looks like when plotted. plot(train,pch=16) points(train$x, best_mod_pred, col = "blue", pch=4) Perceptive Analytics Visually, the points predicted by our tuned model almost follow the data. This is the power of SVM and we are just seeing this for data with two features. Imagine the abilities of the model with more number of complex features!   Summary SVM is a powerful technique and especially useful for data whose distribution is unknown (also known as non-regularity in data). Because the example considered here consisted of only two features, the SVM fitted by R here is also known as linear SVM. SVM is powered by a kernel for dealing with various kinds of data and its kernel can also be set during model tuning. Some such examples include gaussian and radial. Hence, SVM can also be used for non-linear data and does not require any assumptions about its functional form. Because we separate data with the maximum possible margin, the model becomes very robust and can deal with incongruencies such as noisy test data or biased train data. We can also interpret the results produced by SVM through visualization. A common disadvantage with SVM is associated with its tuning. The level of accuracy in predicting over the training data has to be defined in our data. Because our example was custom generated data, we went ahead and tried to get our model accuracy as high as possible by reducing the error. However, in business situations when one needs to train the model and continually predict over test data, SVM may fall into the trap of overfitting. This is the reason SVM needs to be carefully modeled – otherwise the model accuracy may not be satisfactory. As I did in the example, SVM technique is closely related to regression technique. For linear data, we can compare SVM with linear regression while non-linear SVM is comparable to logistic regression. As the data becomes more and more linear in nature, linear regression becomes more and more accurate. At the same time, tuned SVM can also fit the data. However, noisiness and bias can severely impact the ability of regression. In such cases, SVM comes in really handy!   With this article, I hope one may understand the basics of SVM and its implementation in R. One can also deep dive into the SVM technique by using model tuning. The full R code used in the article is laid out as under: x=c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20) y=c(3,4,5,4,8,10,10,11,14,20,23,24,32,34,35,37,42,48,53,60) #Create a data frame of the data train=data.frame(x,y) #Plot the dataset plot(train,pch=16) #Linear regression model <- lm(y ~ x, train) #Plot the model using abline abline(model) #SVM library(e1071) #Fit a model. The function syntax is very similar to lm function model_svm <- svm(y ~ x , train) #Use the predictions on the data pred <- predict(model_svm, train) #Plot the predictions and the plot to see our model fit points(train$x, pred, col = "blue", pch=4) #Linear model has a residuals part which we can extract and directly calculate rmse error <- model$residuals lm_error <- sqrt(mean(error^2)) # 3.832974 #For svm, we have to manually calculate the difference between actual values (train$y) with our predictions (pred) error_2 <- train$y – pred svm_error <- sqrt(mean(error_2^2)) # 2.696281 # perform a grid search svm_tune <- tune(svm, y ~ x, data = train, ranges = list(epsilon = seq(0,1,0.01), cost = 2^(2:9)) ) print(svm_tune) #Parameter tuning of ‘svm’: # – sampling method: 10-fold cross validation #- best parameters: # epsilon cost #0 8 #- best performance: 2.872047 #The best model best_mod <- svm_tune$best.model best_mod_pred <- predict(best_mod, train) error_best_mod <- train$y – best_mod_pred # this value can be different on your computer # because the tune method randomly shuffles the data best_mod_RMSE <- sqrt(mean(error_best_mod^2)) # 1.290738 plot(svm_tune) plot(train,pch=16) points(train$x, best_mod_pred, col = "blue", pch=4) Bio: Chaitanya Sagar is the Founder and CEO of Perceptive Analytics. Perceptive Analytics has been chosen as one of the top 10 analytics companies to watch out for by Analytics India Magazine. It works on Marketing Analytics for e-commerce, Retail and Pharma companies.

Creating Interactive Charts with R, Shiny, MySQL and AnyChart JS via Template

How to creating interactive chart with R, Shiny, MySQL and AnyChart JS via template

Data visualization and charting are actively evolving as a more and more important field of web development. In fact, people perceive information much better when it is represented graphically rather than numerically as raw data. As a result, various business intelligence apps, reports, and so on widely implement graphs and charts to visualize and clarify data and, consequently, to speed up and facilitate its analysis for further decision making. While there are many ways you can follow to handle data visualization in R, today let’s see how to create interactive charts with the help of popular JavaScript (HTML5) charting library AnyChart. It has recently got an official R, Shiny and MySQL template that makes the whole process pretty straightforward and easy. (Disclaimer: I am the CTO at the AnyChart team. The template I am talking about here is released under the Apache 2.0 license; the library itself can be used for free in any personal, educational and other non-profit projects and is open on GitHub, but for commercial purposes it requires a commercial license though is fully functional when taken on a free trial.) In this step-by-step tutorial, we will take a closer look at the template and the basic pie chart example it includes, and then I will show you how to quickly modify it to get some different data visualization, e.g. a 3D column (vertical bar) chart.

Briefly about AnyChart

AnyChart is a flexible, cross-browser JS charting library for adding interactive charts to websites and web apps in quite a simple way. Basically, it does not require any installations and work with any platform and database. Some more of AnyChart’s features include (but are not limited to): Templates for popular technology stacks, like R, Shiny and MySQL in the present case, can further facilitate AnyChart’s integration.

Getting started

First of all, let’s make sure the R language is installed. If not, you can visit the official R website and follow the instructions. If you have worked with R before, most likely you already have RStudio. Then you are welcome to create a project in it now, because the part devoted to R can be done there. If currently you do not have RStudio, you are welcome to install it from the official RStudio website. But, actually, using RStudio is not mandatory, and the pad will be enough in our case. After that, we should check if MySQL is properly installed. To do that, you can open a terminal window and enter the next command: $ mysql –version mysql Ver 14.14 Distrib 5.7.16, for Linux (x86_64) using EditLine wrapper You should receive the above written response (or a similar one) to be sure all is well. Please follow these instructions to install MySQL, if you do not have it at the moment. Now that all the required components have been installed, we are ready to write some code for our example.

Basic template

First, to download the R, Shiny and MySQL template for AnyChart, type the next command in the terminal: $ git clone https://github.com/anychart-integrations/r-shiny-mysql-template.git The folder you are getting here features the following structure: r-shiny-mysql-template/ www/ css/ style.css # css style app.R # main application code database_backup.sql # MySQL database dump LICENSE README.md index.html # html template Let’s take a look at the project files and examine how this sample works. We’ll run the example first. Open the terminal and go to the repository folder: $ cd r-shiny-mysql-template Set up the MySQL database. To specify your username and password, make use of -u and -p flags: $ mysql < database_backup.sql Then run the R command line, using the command below: $ R And install the Shiny and RMySQL packages as well as initialize the Shiny library at the end: > install.packages("shiny") > install.packages("RMySQL") > library(shiny) If you face any problems during the installation of these dependencies, carefully read error messages, e.g. you might need sudo apt-get install libmysqlclient-dev for installing RMySQL. Finally, run the application: > runApp("{PATH_TO_TEMPLATE}") # e.g. runApp("/workspace/r-shiny-mysql-template") And the new tab that should have just opened in your browser shows you the example included in the template: Interactive Pie chart created with R and AnyChart JS charting library. Basic sample from R, Shiny and MySQL integration template

Basic template: code

Now, let’s go back to the folder with our template to see how it works. Files LICENSE and README.md contain information about the license and the template (how to run it, technologies, structure, etc.) respectively. They are not functionally important to our project, and therefore we will not explore them here. Please check these files by yourself for a general understanding. The style.css file is responsible for the styles of the page. The database_backup.sql file contains a code for the MySQL table and user creation and for writing data to the table. You can use your own table or change the data in this one. Let’s move on to the code. First, open the app.R file. This file ensures the connection to the MySQL database, reads data, and passes it to the index.html file, which contains the main code of using AnyChart. The following is a part of the app.R code, which contains the htmlTemplate function; here we specify the name of the file where the data will be transmitted to, the names of our page and chart, and the JSON encoded chart data from MySQL database. htmlTemplate("index.html", title = "Anychart R Shiny template", chartTitle = shQuote("Top 5 fruits"), chartData = toJSON(loadData()) The main thing here is the index.html file, which is actually where the template for creating charts is. As you see, the first part of this file simply connects all necessary files to the code, including the AnyChart library, the CSS file with styles, and so on. I’ll skip this for now and proceed directly to the script tag and the anychart.onDocumentReady (function () {...}) function. [code lang=”javascript”] anychart.onDocumentReady(function() { var chart = anychart.pie({{ chartData }}); chart.title({{ chartTitle }}); chart.container("container"); chart.draw(); }); This pattern works as follows. We create a pie chart by using the function pie() and get the data that have already been read and prepared using the R code. Please note that the names of the variables containing data are the same in the app.R and index.html files. Then we display the chart title via (chart.title({{ chartTitle }})) and specify the ID of the element that will contain a chart, which is a div with id = container in this case. To show all that was coded, we use сhart.draw().

Modifying the template to create a custom chart

Now that we’ve explored the basic example included in the template, we can move forward and create our own, custom interactive chart. To do that, we simply need to change the template a little bit and add some features if needed. Let’s see how it works. First, we create all the necessary files by ourselves or make a new project using RStudio. Second, we add a project folder named anychart. Its structure should look like illustrated below. Please note that some difference is possible (and acceptable) if you are using a new project in RStudio. anychart/ www/ css/ style.css # css style ui.R # main application code server.R # sub code database_backup.sql # data set index.html # html template Now you know what files you need. If you’ve made a project with the studio, the ui.R and server.R files are created automatically. If you’ve made a project by yourself, just create empty files with the same names and extensions as specified above. The main difference from the original example included in the template is that we should change the file index.html and divide app.R into parts. You can copy the rest of the files or create new ones for your own chart. Please take a look at the file server.R. If you’ve made a project using the studio, it was created automatically and you don’t need to change anything in it. However, if you’ve made it by yourself, open it in the Notepad and add the code below, which is standard for the Shiny framework. You can read more about that here. The file structure of ui.R is similar to the one of app.R, so you can copy app.R from the template and change/add the following lines: loadData = dbGetQuery(db, "SELECT name, value FROM fruits") data1 <- character() #data preparation for(var in 1:nrow(loadData)){ c = c(as.character(loadData[var, 1]), loadData[var, 2]) data1 <- c(data1, c) } data = matrix(data1, nrow=nrow(loadData), ncol=2, byrow=TRUE) ui = function(){ htmlTemplate("index.html", title = "Anychart R Shiny template", chartTitle = shQuote("Fruits"), chartData = toJSON(data) )} Since we are going to change the chart type, from pie to 3D vertical bar (column), the data needs some preparation before being passed to index.html. The main difference is that we will use the entire data from the database, not simply the top 5 positions. We will slightly modify and expand the basic template. Let’s see the resulting code of the index.html first (the script tag) and then explore it. [code lang=”javascript”] anychart.onDocumentReady(function() { var chart = anychart.column3d({{ chartData }}); chart.title({{ chartTitle }}); chart.animation(true); var xAxis = chart.xAxis(); xAxis.title("fruits"); var yAxis = chart.yAxis(); yAxis.title("pounds, t"); var yScale = chart.yScale(); yScale.minimum(0); yScale.maximum(120); chart.container("container"); chart.draw(); }); With the help of var chart = anychart.column3d({{chartData}}), we are creating a 3D column chart by using the function column3d(). Here you can choose any other chart type you need; consider getting help from Chartopedia if you are unsure which one works best in your situation. Next, we are adding animation to the column chart via chart.animation (true) to make it appear on page load gradually. In the following section, we are creating two variables, xAxis and yAxis. Including these is required if you want to provide the coordinate axes of the chart with captions. So, you should create variables that will match the captions for the X and Y axes, and then use the function, transmit the values that you want to see. The next unit is basically optional. We are explicitly specifying the maximum and minimum values for the Y axis, or else AnyChart will independently calculate these values. You can do that the same way for the X axis. And that’s it! Our 3D column chart is ready, and all seems to be fine for successfully running the code. The only thing left to do before that is to change the MySQL table to make it look as follows: (‘apple’,100), (‘orange’,58), (‘banana’,81), (‘lemon’,42), (‘melon’,21), (‘kiwi’,66), (‘mango’,22), (‘pear’,48), (‘coconut’,29), (‘cherries’,65), (‘grapes’,31), (‘strawberries’,76), To see what you’ve got, follow the same steps as for running the R, Shiny and MySQL template example, but do not forget to change the path to the folder and the folder name to anychart. So, let’s open the terminal and command the following: $ cd anychart $ mysql < database_backup.sql $ R > install.packages("shiny") > install.packages("RMySQL") > library(shiny) > runApp("{PATH_TO_TEMPLATE}") # e.g. runApp("/workspace/anychart") Interactive 3D Column chart made with R and AnyChart JS charting library, based on R, Shiny and MySQL integration template For consistency purposes, I am including the code of ui.R and server.R below. The full source code of this example can be found on GitHub.

ui.R:

library(shiny) library(RMySQL) library(jsonlite) data1 <- character() db = dbConnect(MySQL(), dbname = "anychart_db", host = "localhost", port = 3306, user = "anychart_user", password = "anychart_pass") loadData = dbGetQuery(db, "SELECT name, value FROM fruits") #data preparation for(var in 1:nrow(loadData)){ c = c(as.character(loadData[var, 1]), loadData[var, 2]) data1 <- c(data1, c) } data = matrix(data1, nrow=nrow(loadData), ncol=2, byrow=TRUE) server = function(input, output){} ui = function(){ htmlTemplate("index.html", title = "Anychart R Shiny template", chartTitle = shQuote("Fruits"), chartData = toJSON(data) )} shinyApp(ui = ui, server = server)

server.R:

library(shiny) shinyServer(function(input, output) { output$distPlot <- renderPlot({ # generate bins based on input$bins from ui.R x <- faithful[, 2] bins <- seq(min(x), max(x), length.out = input$bins + 1) # draw the chart with the specified number of bins hist(x, breaks = bins, col = ‘darkgray’, border = ‘white’) }) })

Conclusion

When your technology stack includes R, Shiny and MySQL, using AnyChart JS with the integration template we were talking about in this tutorial requires no big effort and allows you to add beautiful interactive JavaScript-based charts to your web apps quite quickly. It is also worth mentioning that you can customize the look and feel of charts created this way as deeply as needed by using some of the library’s numerous out-of-the-box features: add or remove axis labels, change the background color and how the axis is positioned, leverage interactivity, and so on. The scope of this tutorial is likely to be actually even broader, because the process I described here not only applies to the AnyChart JS charting library, but also is mostly the same for its sister libraries AnyMap (geovisualization in maps), AnyStock (date/time graphs), and AnyGantt (charts for project management). All of them are free for non-profit projects but – I must put it clearly here again just in case – require a special license for commercial use. I hope you find this article helpful in your activities when it comes to interactive data visualization in R. Now ask your questions, please, if any.