Three-Way Analysis of Variance: Simple Second-Order Interaction Effects and Simple Main Effects

In this article we will show how to run a three-way analysis of variance when both the third-order interaction effect and the second-order interaction effects are statistically significant. This type of analysis can become pretty tedious, especially when our factors have many levels, so we will try to explain it here as clearly as possible. (If you want to watch me doing these analyses live, get my free course on statistical analysis with R here.)

First of all, let’s present the fictitious data we are going to work with. Let’s suppose that a pharmaceutical company is planning to launch a new vitamin that allegedly improves the employees’ resistance to effort. The vitamin is tested on a sample of 720 employees, divided into three groups: employees who take a placebo (the control group), employees who take the vitamin in low dose and employees who take the vitamin in high dose. Half of the employees are male, and half are female. Moreover, we have both blue collar employees and white collar employees in our sample. The resistance to effort is measured on a scale whatsoever, from 1 to 30 (30 being the highest resistance). Our goal is to determine whether the effort resistance is influenced by three factors: dose of vitamin (placebo, low dose, and high dose), gender (male, female) and type of employee (blue collar, white collar). You can find the experiment data in CSV format here. Third-order interaction effect First of all, let’s check whether the third-order interaction effect is significant. We are going to run the analysis using the aov function in the stats package (our data frame is called vitamin). aov1 <- aov(effort~dose*gender*type, data=vitamin) summary(aov1) In the formula above the interaction effect is, of course, dosegendertype. The ANOVA results can be seen below (we have only kept the line presenting the third-order interaction effect).                                 

                                  Df Sum Sq Mean Sq F value   Pr(>F)


dose:gender:type   2    187    93.4  22.367 3.81e-10
The interaction effect is statistically significant: F(2)=22.367, p<0.01. In other words, we do have a third-order interaction effect. In this situation, it is not advisable to report and interpret the second-order interaction effects (they could be misleading). Therefore, we are going to compute the simple second-order interaction effects. Simple second-order interaction effects The simple second-order interaction effects are the effects of each pair of factors at each level of the third factor. Specifically, we have to compute the following effects:
  1. the interaction effect of dose and type of employee, for each gender category (male and female)
  2. the interaction effect of gender and type of employee, at each dose level (placebo, low and high)
  3. the interaction effect of dose and gender, for each type of employee (blue collar and white collar).
The total number of second-order interaction effect is given by the sum of the factor levels. In our case we have 7 effects (3+2+2). We will not analyze of all them, because this article would become too long. We will only focus on the first set of effects, leaving the others for you as an exercise. So let’s investigate the interaction effect of dose and type of employee for each gender group. First we have to create two separate data frames, for male and female employees. We do that with the filter command in the dplyr package (though you can also use brackets or subsets). vitamin_male <- filter(vitamin, gender=="male") vitamin_female <- filter(vitamin, gender=="female") Now we perform a two-way analysis of variance on each data frame (the factors being dose and type, of course). [code lang=””r””] aov1 <- aov(effort~dose*type, data=vitamin_male) summary(aov1) aov2 <- aov(effort~dose*type, data=vitamin_female) summary(aov2) The results of the analyses are shown below (we have only retained the lines with the interaction effects).
                Df Sum Sq Mean Sq F value   Pr(>F)    
dose:type     2    249   124.7   28.42 3.57e-12            
                   Df Sum Sq Mean Sq F value   Pr(>F)   
dose:type     2  137.2    68.6   17.31 6.74e-08

We can notice that both simple second-order interaction effects are significant (p<0.01). So we are dealing with a combined influence of the factors dose and type of employee in both male and female groups. In this situation, we have to examine the simple main effects for each factor. This is what we are going to do in the next section. Simple main effects Let’s compute the main effect for the factor dose of vitamin, which is the most important (after all, the company wants to demonstrate that the vitamin does affect the resistance to effort). You will be able to compute the other simple main effects yourself, using this as an example. Now we must create four separate data frames, for each combination of the factors gender and type of employee: male – blue collar, male – white collar, female – blue collar, female – white collar. [code lang=””r””] vitamin_male_blue <- filter(vitamin, gender=="male", type=="blue collar") vitamin_male_white <- filter(vitamin, gender=="male", type=="white collar") vitamin_female_blue <- filter(vitamin, gender=="female", type=="blue collar") vitamin_female_white <- filter(vitamin, gender=="female", type=="white collar") Next we perform a one-way ANOVA for each data frame. Let’s do it for the first group, male – blue collar. [code lang=””r””] aov1 <- aov(effort~dose, data=vitamin_male_blue) summary(aov1)

                 Df Sum Sq Mean Sq F value Pr(>F)    
dose          2 2943.5  1471.8   349.9 <2e-16

The simple main effect for the factor dose on this group is statistically significant (p<0.01). In other words, there is a significant difference between placebo, low dose and high dose levels within the male – blue collar employees category, regarding the resistance to effort. To find out how big the differences are, we use the TuckeyHSD function to compute the test with the same name.
TukeyHSD(aov1)


                                              diff        lwr       upr   p adj
low dose-high dose -2.528333  -3.413363 -1.643303     0
placebo-high dose  -9.558333 -10.443363 -8.673303     0
placebo-low dose   -7.030000  -7.915030 -6.144970     0

By inspection of the table we conclude that the differences in effort resistance between the dose groups are significant (p<0.01). The highest difference, in absolute values, is that between low dose and placebo levels: 9.5 points. So the employees who took a high dose present a higher resistance to effort than those who just took a placebo. One more example: the simple main effects of the variable dose of vitamin on the female – blue collar group. [code lang=””r””] aov1 <- aov(effort~dose, data=vitamin_female_blue) summary(aov1)                  Df Sum Sq Mean Sq F value Pr(>F)   
dose          2  399.6  199.81   45.57 <2e-16  
TukeyHSD(aov1)


                                            diff        lwr       upr     p adj
low dose-high dose  1.083333  0.1797508  1.986916 0.0141485
placebo-high dose  -2.476667 -3.3802492 -1.573084 0.0000000
placebo-low dose   -3.560000 -4.4635826 -2.656417 0.0000000
  The simple main effect is statistically significant, as it results from the first table. Furthermore, all the differences between dose levels are significant. The highest difference is the difference between low dose and placebo (3.5 points). To learn more on data analysis in R, check the free “Statistics with R” video course here.

Everyone knows that loops in R are to be avoided but vectorization is not always possible

It goes without saying that there are always many ways to solve a problem in R, but clearly some ways are better (for example, faster) than others. Recently, I found myself in a situation where I could not find a way to avoid using a loop, and I was immediately concerned, knowing that I would want this code to be flexible enough to run with a very large number of observations, possibly over many observations. Two tools immediately came to mind: data.table and Rcpp . This brief description explains the background of the simulation problem I was working on and walks through the evolution of ideas to address the problems I ran into when I tried to simulate a large number of individuals. In particular, when I tried to simulate a very large number of individuals, say over 1 million, running the simulation over night wasn’t enough.

Setting up the problem

The task in question here is not the focus, but needs a little explanation to understand what is motivating the programming issue. I am conducting a series of simulations that involve generating an individual-level stochastic (Markov) process for any number of individuals. For the data generation, I am using the simstudy package developed to help facilitate simulated data.   The functions defDataAdd and genData are both from simstudy. The first part of the simulation involves specifying the transition matrix P that determine a state I am calling status, and then defining the probability of an event that are based on a particular status level at a particular time point. For each individual, I generate 36 months of data and a status and event for each month.
library(data.table)
library(simstudy)

set.seed(123)

P <- matrix(c(0.985, 0.015, 0.000, 0.000, 
              0.000, 0.950, 0.050, 0.000,
              0.000, 0.000, 0.850, 0.150,
              0.000, 0.000, 0.000, 1.000), 
            nrow = 4, byrow = TRUE)

form <- "(status == 1) * 0.02 + (status == 2) * 0.10 + (status == 3) * 0.20"

dtDef <- defDataAdd(varname = "event", 
            formula = form, 
            dist = "binary",
            link = "identity")

N = 5000
did <- genData(N)
In order to simulate the Markov process, I decided immediately that Rcpp would be most appropriate because I knew I could not avoid looping. Since each state of a Markov process depends on the state immediately preceding, states need to be generated sequentially, which means no obvious way to vectorize (if someone has figured that out, let me know.)
#include  < RcppArmadilloExtensions/sample.h >; 

// [[Rcpp::depends(RcppArmadillo)]]

using namespace Rcpp;

// [[Rcpp::export]]

IntegerVector MCsim( unsigned int nMonths, NumericMatrix P, 
                     int startStatus, unsigned int startMonth ) {
  
  IntegerVector sim( nMonths );
  IntegerVector m( P.ncol());
  NumericVector currentP;
  IntegerVector newstate;
  
  unsigned int q = P.ncol();
  
  m = Rcpp::seq(1, q);
  
  sim[startMonth - 1] = startStatus;
  
  for (unsigned int i = startMonth; i < nMonths; i++) {
    
    newstate = RcppArmadillo::sample(m, 1, TRUE, P.row(sim(i-1) - 1));
    sim(i) = newstate(0);
    
  }
  
  return sim;
}
The process is simulated for each individual using the Rcpp function MCsim, but is done in the context of a data.table statement. The key here is that each individual is processed separately through the keyby = id statement. This obviates the requirement to loop through individuals even though I still need to loop within individuals for the stochastic process. This algorithm is quite fast, even with very large numbers of individuals and large numbers of observations (in this case months) per individual.
dt <- did[, .(status = MCsim(36, P, 1, 1)), 
          keyby = id]
dt[, month := 1 : .N, keyby = id]
dt <- addColumns(dtDefs = dtDef, dtOld = dt)

dt
##           id status month event
##      1:    1      1     1     0
##      2:    1      1     2     0
##      3:    1      1     3     0
##      4:    1      1     4     0
##      5:    1      1     5     0
##     ---                        
## 179996: 5000      4    32     0
## 179997: 5000      4    33     0
## 179998: 5000      4    34     0
## 179999: 5000      4    35     0
## 180000: 5000      4    36     0

This is where things begin to slow down

It is the next phase of the simulation that started to cause me problems. For the simulation, I need to assign individuals to a group or cohort which is defined by a month and is based on several factors: (1) whether an event occurred in that month, (2) whether the status of that individual in that month exceeded a value of 1, and (3) whether or not the individual experienced 2 or more events in the prior 12 months. An indivdual might be eligible for more than one cohort, but will be assigned to the first possible cohort (i.e. the earliest month where all three criteria are met.) Again, the specifics of the simulation are not important here. What is important, is the notion that the problem requires looking through individual data sequentially, something R is generally not so good at when the sequences get particularly long, and they must be repeated a large number of times. My first, naïve, approach was to create an R function that loops through all the individuals and loops within each individual until a cohort is found:
rAssignCohortID <- function(id, month, status, 
                            event, nInds, 
                            startMonth, thresholdNum) {
  
  cohort <- rep(0, length(id));

  for (j in (1 : nInds))  {
    
    idMonth <- month[id == j];
    idEvent <- event[id == j];
    idStatus <- status[id == j];
    
    endMonth <- length(idMonth);
    
    done <- FALSE;
    i <- max(startMonth - idMonth[1], 13);
    
    while (i <= endMonth & !done) {
      
      if (idEvent[i] == 1 & idStatus[i] > 1) {
        
        begin = i-12;
        end = i-1;
        
        sumED = sum(idEvent[begin:end]);
        
        if (sumED <= thresholdNum) {
          
          cohort[id == j] <- i - 1 + month[1];
          done = TRUE;
        }
      }    
      i = i + 1;
    }    
  }
  
  return(cohort);
} 
system.time(dt[, cohort1 := rAssignCohortID(id, month, status, event, 
                    nInds = N, startMonth = 13, thresholdNum = 2)])
##    user  system elapsed 
##   10.92    1.81   12.89

Working through possible solutions

The naïve approach works, but can we do better? I thought Rcpp might be a solution, because we know that loops in C++ are much more efficient. However, things did not turn out so well after I translated the function into C++; in fact, they got a little worse.

#include < Rcpp.h >;

using namespace Rcpp;

// [[Rcpp::export]]
IntegerVector cAssignCohortID( IntegerVector id, 
                    IntegerVector month, 
                    IntegerVector status, 
                    IntegerVector event,
                    int nInds, 
                    int startMonth, 
                    int thresholdNum) {
    
  IntegerVector cohort(id.length(), 0);
  
  IntegerVector idMonth;
  IntegerVector idEvent;
  IntegerVector idStatus;
  
  for (int j = 0; j < nInds; j++) {
    
    idMonth = month[id == j+1];
    idEvent = event[id == j+1];
    idStatus = status[id == j+1];

    int endMonth = idMonth.length();
    int sumED;
    bool done = FALSE;
    int i = std::max(startMonth - idMonth(0), 12);
    int begin;
    int end;

    while (i < endMonth & !done) {
      
      if (idEvent(i) == 1 & idStatus(i) > 1) {
        
        begin = i-12;
        end = i-1;
        
        sumED = sum(idEvent[Rcpp::seq(begin, end)]);
        
        if (sumED >= thresholdNum) {
          cohort[id == j + 1] = i + month(0);
          done = TRUE;
        }
      }    
      i += 1;
    }    
  }
   
  return(cohort);
}
system.time(dt[, cohort2 := cAssignCohortID(id, month, status, event, 
                    nInds = N, startMonth = 13, thresholdNum = 2)])
## user  system elapsed 
## 13.88   2.03   16.05

I know that the function cAssignCohortID bogs down not in the loop, but in each phase where I need to subset the data set to work on a single id. For example, I need to execute the statement idMonth = month[id == j+1] for each id, and this apparently uses a lot of resources. I tried variations on this theme, alternatives to subset the data set within the Rcpp function, but could get no improvements. But a light bulb went off in my head (dim as it might be), telling me that this is one of the many things data.table is particularly good at. In fact, I used this trick earlier in generating the stochastic process data. So, rather than subsetting the data within the function, I created a regular R function that handles only a single individual id at a time, and let data.table do the hard work of splitting up the data set to process by individual. As you can see, things got markedly faster.
rAssignCohort <- function(id, month, status, event, 
                   nInds, startMonth, thresholdNum) {
  
  cohort <- 0
  
  endMonth = length(month);
    
  done = FALSE;
  i = max(startMonth - month[1], 13);
  
  while (i <= endMonth & !done) {
    
    if (event[i] == 1 & status[i] > 1) {
        
      begin = i-12;
      end = i-1;
        
      sumED = sum(event[begin:end]);
        
      if (sumED >= thresholdNum) {
          
        cohort <- i - 1 + month[1];
        done = TRUE;
      }
    }    
    i = i + 1;
  }
  
  return(cohort)
}
system.time(dt[, cohort3 := rAssignCohort(id, month, status, event, 
                    nInds = N, startMonth = 13, thresholdNum = 2), 
               keyby=id])
##    user  system elapsed 
##     0.2     0.0     0.2
Finally, it occurred to me that an Rcpp function that is not required to subset the data might offer more yet improvements in speed. So, for the last iteration, I combined the strengths of looping in Rcpp with the strengths of subsetting in data.table to create a formidable combination. (Even when sample sizes exceed 1 million, the data are generated in a flash.)
#include < Rcpp.h >; 

using namespace Rcpp;

// [[Rcpp::export]]
int cAssignCohort( IntegerVector month, 
                  IntegerVector status,
                  IntegerVector event,
                  int startMonth, int thresholdNum) {
  
  int endMonth = month.length();
  int sumED;
  int cohort = 0;
  bool done = FALSE;
  int i = std::max(startMonth - month(0), 12);
  int begin;
  int end;
  
  while (i < endMonth & !done) {
    
    if (event(i) == 1 & status(i) > 1) {
      
      begin = i-12;
      end = i-1;
      
      sumED = sum(event[Rcpp::seq(begin, end)]);
      
      if (sumED >= thresholdNum) {
        cohort = i + month(0);
        done = TRUE;
      }
    }    
    i += 1;
  }
  return(cohort);
}
system.time(dt[, cohort4 := cAssignCohort(month, status, event, 
                               startMonth=13,  thresholdNum = 2), keyby=id])
##    user  system elapsed 
##    0.01    0.00    0.01
For a more robust comparison, let’s use the benchmark function in package rbenchmark, and you can see how well data.table performs and how much Rcpp can add when used efficiently.
library(rbenchmark)

benchmark(
  dt[, cohort1 := rAssignCohortID(id, month, status, event,       # Naïve approach
                      nInds = N, startMonth = 13, thresholdNum = 2)],
  dt[, cohort2 := cAssignCohortID(id, month, status, event,        # Rcpp approach
                      nInds = N, startMonth = 13, thresholdNum = 2)],
  dt[, cohort3 := rAssignCohort(id, month, status, event,    # data.table approach
                      nInds = N, startMonth = 13, thresholdNum = 2), keyby=id],
  dt[, cohort4 := cAssignCohort(month, status, event,   # combined data.table/Rcpp
                      startMonth=13,  thresholdNum = 2), keyby=id],
    replications = 5,
    columns = c("replications", "elapsed", "relative"))
##   replications elapsed relative
## 1            5   46.18    461.8
## 2            5   68.01    680.1
## 3            5    0.96      9.6
## 4            5    0.10      1.0

Postscript

I shared all of this with the incredibly helpful folks who have created data.table, and they offered a data.table only solution that avoids all looping, which I will share here for completeness. While it is an improvement over the third approach presented above (R function with data.table statment keyby), it is still no match for the fastest solution. (But, this all just goes to show you there will always be new approaches to consider, and I don’t claim to have come any where near to trying them all out.)
dtfunc <- function(dx) {
  
  dx[, prev12 := Reduce(`+`, shift(event, 1:12)), by=id]
  map <- CJ(id=1:N, start=13L, end=36L, event=1L, statusx=1L, prev12x=1L)
  ans <- dx[map, on=.(id, event, status > statusx, prev12 > prev12x, month >= start, month <= end), 
            .I, allow=TRUE, by=.EACHI, nomatch=0L][, .(id, I)]
  minans <- ans[, .(I=min(I)), by=id]
  
  dx <- dx[, cohort5 := 0L][minans, cohort5 := min(month) - 1L + dx$month[I], on="id", by=.EACHI]

  return(dx)
}

system.time(dtfunc(dt))
##    user  system elapsed 
##    0.18    0.00    0.17
And here is a more complete comparison of the fastest version with this additional approach:
benchmark(
  dt[, cohort6 := cAssignCohort(month, status, event,   # combined data.table/Rcpp
                      startMonth=13,  thresholdNum = 2), keyby=id],
  dt2 <- dtfunc(dt),
  replications = 5,
  columns = c("replications", "elapsed", "relative"))
##   replications elapsed relative
## 1            5    0.10      1.0
## 2            5    0.85      8.5

Sinew: a R Package to create self-populating roxygen2 skeletons

Sinew is a R package that generates a roxygen2 skeletons populated with information scraped from within the function script. The goal of the package is to automate nearly all of the mundane tasks needed to document functions, properly set up the import fields for oxygenation, and make it easier to attain documentation consistency across functions. 

Functionality

  • makeOxygen: Create a skeleton for roxygen2 documentation populated with information scraped from within the package function scripts.
  • makeImport: Create import(s) calls for DESCRIPTION, NAMESPACE, and roxygen2.
  • makeDictionary: Create a R file of all the unique roxygen2 parameter fields in a package R subdirectory.

Installation

#CRAN
install.packages('sinew')

#DEV
devtools::install_github('metrumresearchgroup/sinew')

Example Output

makeOxygen(lm)
#' @title FUNCTION_TITLE
#' @description FUNCTION_DESCRIPTION
#' @param formula PARAM_DESCRIPTION
#' @param data PARAM_DESCRIPTION
#' @param subset PARAM_DESCRIPTION
#' @param weights PARAM_DESCRIPTION
#' @param na.action PARAM_DESCRIPTION
#' @param method PARAM_DESCRIPTION, Default: 'qr'
#' @param model PARAM_DESCRIPTION, Default: TRUE
#' @param x PARAM_DESCRIPTION, Default: FALSE
#' @param y PARAM_DESCRIPTION, Default: FALSE
#' @param qr PARAM_DESCRIPTION, Default: TRUE
#' @param singular.ok PARAM_DESCRIPTION, Default: TRUE
#' @param contrasts PARAM_DESCRIPTION, Default: NULL
#' @param offset PARAM_DESCRIPTION
#' @param ... PARAM_DESCRIPTION
#' @return OUTPUT_DESCRIPTION
#' @importFrom stats model.frame

In more detail…

For new developers, getting a package ready for building and submitting to CRAN is an expletive-filled, head-scratching experience to say the least. Trying to figure out the basics of what goes in depends and what goes in imports is a lost afternoon most of us would like back. Once that is understood, filling in relevant information to each field is a mundane task even for a well polished package developer. The out-of-the-box roxygen2 skeleton supplied by RStudio gives the bare bones road map of what should be part of function documentation:
Continue reading Sinew: a R Package to create self-populating roxygen2 skeletons

Machine Learning Using Support Vector Machines

Support Vector Machines (SVM) is a data classification method that separates data using hyperplanes. The concept of SVM is very intuitive and easily understandable. If we have labeled data, SVM can be used to generate multiple separating hyperplanes such that the data space is divided into segments and each segment contains only one kind of data. SVM technique is generally useful for data which has non-regularity which means, data whose distribution is unknown.

Let’s take a simple example to understand how SVM works. Say you have only two kinds of values and we can represent them as in the figure: Perceptive Analytics This data is simple to classify and one can see that the data is clearly separated into two segments. Any line that separates the red and blue items can be used to classify the above data. Had this data been multi-dimensional data, any plane can separate and successfully classify the data. However, we want to find the “most optimal” solution. What will then be the characteristic of this most optimal line? We have to remember that this is just the training data and we can have more data points which can lie anywhere in the subspace. If our line is too close to any of the datapoints, noisy test data is more likely to get classified in a wrong segment. We have to choose the line which lies between these groups and is at the farthest distance from each of the segments. To solve this classifier line, we draw the line as y=ax+b and make it equidistant from the respective data points closest to the line. So we want to maximize the margin (m). Perceptive Analytics We know that all points on the line ax+b=0 will satisfy the equation. If we draw two parallel lines – ax+b=-1 for one segment and ax+b=1 for the other segment such that these lines pass through a datapoint in the segment which is nearest to our line then the distance between these two lines will be our margin. Hence, our margin will be m=2/||a||. Looked at another way, if we have a training dataset (x1,x2,x3…xn) and want to produce and outcome y such that y is either -1 or 1 (depending on which segment the datapoint belongs to), then our classifier should correctly classify data points in the form of y=ax+b. This also means that y(ax+b)>1 for all data points. Since we have to maximize the margin, we have the solve this problem with the constraint of maximizing 2/||a|| or, minimizing ||a||^2/2 which is basically the dual form of the constraint. Solving this can be easy or complex depending upon the dimensionality of data. However, we can do this quickly with R and try out with some sample dataset. To use SVM in R, I just created a random data with two features x and y in excel. I took all the values of x as just a sequence from 1 to 20 and the corresponding values of y as derived using the formula y(t)=y(t-1) + r(-1:9) where r(a,b) generates a random integer between a and b. I took y(1) as 3. The following code in R illustrates a set of sample generated values: x=c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20) y=c(3,4,5,4,8,10,10,11,14,20,23,24,32,34,35,37,42,48,53,60) #Create a data frame of the data train=data.frame(x,y) Let’s see how our data looks like. For this we use the plot function #Plot the dataset plot(train,pch=16)   Perceptive Analytics
Seems to be a fairly good data. Looking at the plot, it also seems like a linear regression should also be a good fit to the data. We’ll use both the models and compare them. First, the code for linear regression:   #Linear regression model <- lm(y ~ x, train) #Plot the model using abline abline(model)

Perceptive Analytics   SVM To use SVM in R, we have a package e1071. The package is not preinstalled, hence one needs to run the line “install.packages(“e1071”) to install the package and then import the package contents using the library command. The syntax of svm package is quite similar to linear regression. We use svm function here. #SVM library(e1071) #Fit a model. The function syntax is very similar to lm function model_svm <- svm(y ~ x , train) #Use the predictions on the data pred <- predict(model_svm, train) #Plot the predictions and the plot to see our model fit points(train$x, pred, col = "blue", pch=4)
Perceptive Analytics   The points follow the actual values much more closely than the abline. Can we verify that? Let’s calculate the rmse errors for both the models: #Linear model has a residuals part which we can extract and directly calculate rmse error <- model$residuals lm_error <- sqrt(mean(error^2)) # 3.832974 #For svm, we have to manually calculate the difference between actual values (train$y) with our predictions (pred) error_2 <- train$y – pred svm_error <- sqrt(mean(error_2^2)) &amp;amp;amp;amp;amp;nbsp;# 2.696281 In this case, the rmse for linear model is ~3.83 whereas our svm model has a lower error of ~2.7. A straightforward implementation of SVM has an accuracy higher than the linear regression model. However, the SVM model goes far beyond that. We can further improve our SVM model and tune it so that the error is even lower. We will now go deeper into the SVM function and the tune function. We can specify the values for the cost parameter and epsilon which is 0.1 by default. A simple way is to try for each value of epsilon between 0 and 1 (I will take steps of 0.01) and similarly try for cost function from 4 to 2^9 (I will take exponential steps of 2 here). I am taking 101 values of epsilon and 8 values of cost function. I will thus be testing 808 models and see which ones performs best. The code may take a short while to run all the models and find the best version. The corresponding code will be svm_tune <- tune(svm, y ~ x, &amp;amp;amp;amp;amp;nbsp;data = train, ranges = list(epsilon = seq(0,1,0.01), cost = 2^(2:9)) ) print(svm_tune) #Printing gives the output: #Parameter tuning of ‘svm’: # – sampling method: 10-fold cross validation #- best parameters: # epsilon cost #0.09 256 #- best performance: 2.569451 #This best performance denotes the MSE. The corresponding RMSE is 1.602951 which is square root of MSE An advantage of tuning in R is that it lets us extract the best function directly. We don’t have to do anything and just extract the best function from the svm_tune list. We can now see the improvement in our model by calculating its RMSE error using the following code #The best model best_mod <- svm_tune$best.model best_mod_pred <- predict(best_mod, train) error_best_mod <- train$y – best_mod_pred # this value can be different on your computer # because the tune method randomly shuffles the data best_mod_RMSE <- sqrt(mean(error_best_mod^2)) # 1.290738 This tuning method is known as grid search. R runs all various models with all the possible values of epsilon and cost function in the specified range and gives us the model which has the lowest error. We can also plot our tuning model to see the performance of all the models together plot(svm_tune) Perceptive Analytics This plot shows the performance of various models using color coding. Darker regions imply better accuracy. The use of this plot is to determine the possible range where we can narrow down our search to and try further tuning if required. For instance, this plot shows that I can run tuning for epsilon in the new range of 0 to 0.2 and while I’m at it, I can move in even lower steps (say 0.002) but going further may lead to overfitting so I can stop at this point. From this model, we have improved on our initial error of 2.69 and come as close as 1.29 which is about half of our original error in SVM. We have come very far in our model accuracy. Let’s see how the best model looks like when plotted. plot(train,pch=16) points(train$x, best_mod_pred, col = "blue", pch=4) Perceptive Analytics Visually, the points predicted by our tuned model almost follow the data. This is the power of SVM and we are just seeing this for data with two features. Imagine the abilities of the model with more number of complex features!   Summary SVM is a powerful technique and especially useful for data whose distribution is unknown (also known as non-regularity in data). Because the example considered here consisted of only two features, the SVM fitted by R here is also known as linear SVM. SVM is powered by a kernel for dealing with various kinds of data and its kernel can also be set during model tuning. Some such examples include gaussian and radial. Hence, SVM can also be used for non-linear data and does not require any assumptions about its functional form. Because we separate data with the maximum possible margin, the model becomes very robust and can deal with incongruencies such as noisy test data or biased train data. We can also interpret the results produced by SVM through visualization. A common disadvantage with SVM is associated with its tuning. The level of accuracy in predicting over the training data has to be defined in our data. Because our example was custom generated data, we went ahead and tried to get our model accuracy as high as possible by reducing the error. However, in business situations when one needs to train the model and continually predict over test data, SVM may fall into the trap of overfitting. This is the reason SVM needs to be carefully modeled – otherwise the model accuracy may not be satisfactory. As I did in the example, SVM technique is closely related to regression technique. For linear data, we can compare SVM with linear regression while non-linear SVM is comparable to logistic regression. As the data becomes more and more linear in nature, linear regression becomes more and more accurate. At the same time, tuned SVM can also fit the data. However, noisiness and bias can severely impact the ability of regression. In such cases, SVM comes in really handy!   With this article, I hope one may understand the basics of SVM and its implementation in R. One can also deep dive into the SVM technique by using model tuning. The full R code used in the article is laid out as under: x=c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20) y=c(3,4,5,4,8,10,10,11,14,20,23,24,32,34,35,37,42,48,53,60) #Create a data frame of the data train=data.frame(x,y) #Plot the dataset plot(train,pch=16) #Linear regression model <- lm(y ~ x, train) #Plot the model using abline abline(model) #SVM library(e1071) #Fit a model. The function syntax is very similar to lm function model_svm <- svm(y ~ x , train) #Use the predictions on the data pred <- predict(model_svm, train) #Plot the predictions and the plot to see our model fit points(train$x, pred, col = "blue", pch=4) #Linear model has a residuals part which we can extract and directly calculate rmse error <- model$residuals lm_error <- sqrt(mean(error^2)) # 3.832974 #For svm, we have to manually calculate the difference between actual values (train$y) with our predictions (pred) error_2 <- train$y – pred svm_error <- sqrt(mean(error_2^2)) # 2.696281 # perform a grid search svm_tune <- tune(svm, y ~ x, data = train, ranges = list(epsilon = seq(0,1,0.01), cost = 2^(2:9)) ) print(svm_tune) #Parameter tuning of ‘svm’: # – sampling method: 10-fold cross validation #- best parameters: # epsilon cost #0 8 #- best performance: 2.872047 #The best model best_mod <- svm_tune$best.model best_mod_pred <- predict(best_mod, train) error_best_mod <- train$y – best_mod_pred # this value can be different on your computer # because the tune method randomly shuffles the data best_mod_RMSE <- sqrt(mean(error_best_mod^2)) # 1.290738 plot(svm_tune) plot(train,pch=16) points(train$x, best_mod_pred, col = "blue", pch=4) Bio: Chaitanya Sagar is the Founder and CEO of Perceptive Analytics. Perceptive Analytics has been chosen as one of the top 10 analytics companies to watch out for by Analytics India Magazine. It works on Marketing Analytics for e-commerce, Retail and Pharma companies.

Creating Interactive Charts with R, Shiny, MySQL and AnyChart JS via Template

How to creating interactive chart with R, Shiny, MySQL and AnyChart JS via template

Data visualization and charting are actively evolving as a more and more important field of web development. In fact, people perceive information much better when it is represented graphically rather than numerically as raw data. As a result, various business intelligence apps, reports, and so on widely implement graphs and charts to visualize and clarify data and, consequently, to speed up and facilitate its analysis for further decision making. While there are many ways you can follow to handle data visualization in R, today let’s see how to create interactive charts with the help of popular JavaScript (HTML5) charting library AnyChart. It has recently got an official R, Shiny and MySQL template that makes the whole process pretty straightforward and easy. (Disclaimer: I am the CTO at the AnyChart team. The template I am talking about here is released under the Apache 2.0 license; the library itself can be used for free in any personal, educational and other non-profit projects and is open on GitHub, but for commercial purposes it requires a commercial license though is fully functional when taken on a free trial.) In this step-by-step tutorial, we will take a closer look at the template and the basic pie chart example it includes, and then I will show you how to quickly modify it to get some different data visualization, e.g. a 3D column (vertical bar) chart.

Briefly about AnyChart

AnyChart is a flexible, cross-browser JS charting library for adding interactive charts to websites and web apps in quite a simple way. Basically, it does not require any installations and work with any platform and database. Some more of AnyChart’s features include (but are not limited to): Templates for popular technology stacks, like R, Shiny and MySQL in the present case, can further facilitate AnyChart’s integration.

Getting started

First of all, let’s make sure the R language is installed. If not, you can visit the official R website and follow the instructions. If you have worked with R before, most likely you already have RStudio. Then you are welcome to create a project in it now, because the part devoted to R can be done there. If currently you do not have RStudio, you are welcome to install it from the official RStudio website. But, actually, using RStudio is not mandatory, and the pad will be enough in our case. After that, we should check if MySQL is properly installed. To do that, you can open a terminal window and enter the next command: $ mysql –version mysql Ver 14.14 Distrib 5.7.16, for Linux (x86_64) using EditLine wrapper You should receive the above written response (or a similar one) to be sure all is well. Please follow these instructions to install MySQL, if you do not have it at the moment. Now that all the required components have been installed, we are ready to write some code for our example.

Basic template

First, to download the R, Shiny and MySQL template for AnyChart, type the next command in the terminal: $ git clone https://github.com/anychart-integrations/r-shiny-mysql-template.git The folder you are getting here features the following structure: r-shiny-mysql-template/ www/ css/ style.css # css style app.R # main application code database_backup.sql # MySQL database dump LICENSE README.md index.html # html template Let’s take a look at the project files and examine how this sample works. We’ll run the example first. Open the terminal and go to the repository folder: $ cd r-shiny-mysql-template Set up the MySQL database. To specify your username and password, make use of -u and -p flags: $ mysql < database_backup.sql Then run the R command line, using the command below: $ R And install the Shiny and RMySQL packages as well as initialize the Shiny library at the end: > install.packages("shiny") > install.packages("RMySQL") > library(shiny) If you face any problems during the installation of these dependencies, carefully read error messages, e.g. you might need sudo apt-get install libmysqlclient-dev for installing RMySQL. Finally, run the application: > runApp("{PATH_TO_TEMPLATE}") # e.g. runApp("/workspace/r-shiny-mysql-template") And the new tab that should have just opened in your browser shows you the example included in the template: Interactive Pie chart created with R and AnyChart JS charting library. Basic sample from R, Shiny and MySQL integration template

Basic template: code

Now, let’s go back to the folder with our template to see how it works. Files LICENSE and README.md contain information about the license and the template (how to run it, technologies, structure, etc.) respectively. They are not functionally important to our project, and therefore we will not explore them here. Please check these files by yourself for a general understanding. The style.css file is responsible for the styles of the page. The database_backup.sql file contains a code for the MySQL table and user creation and for writing data to the table. You can use your own table or change the data in this one. Let’s move on to the code. First, open the app.R file. This file ensures the connection to the MySQL database, reads data, and passes it to the index.html file, which contains the main code of using AnyChart. The following is a part of the app.R code, which contains the htmlTemplate function; here we specify the name of the file where the data will be transmitted to, the names of our page and chart, and the JSON encoded chart data from MySQL database. htmlTemplate("index.html", title = "Anychart R Shiny template", chartTitle = shQuote("Top 5 fruits"), chartData = toJSON(loadData()) The main thing here is the index.html file, which is actually where the template for creating charts is. As you see, the first part of this file simply connects all necessary files to the code, including the AnyChart library, the CSS file with styles, and so on. I’ll skip this for now and proceed directly to the script tag and the anychart.onDocumentReady (function () {...}) function. [code lang=”javascript”] anychart.onDocumentReady(function() { var chart = anychart.pie({{ chartData }}); chart.title({{ chartTitle }}); chart.container("container"); chart.draw(); }); This pattern works as follows. We create a pie chart by using the function pie() and get the data that have already been read and prepared using the R code. Please note that the names of the variables containing data are the same in the app.R and index.html files. Then we display the chart title via (chart.title({{ chartTitle }})) and specify the ID of the element that will contain a chart, which is a div with id = container in this case. To show all that was coded, we use сhart.draw().

Modifying the template to create a custom chart

Now that we’ve explored the basic example included in the template, we can move forward and create our own, custom interactive chart. To do that, we simply need to change the template a little bit and add some features if needed. Let’s see how it works. First, we create all the necessary files by ourselves or make a new project using RStudio. Second, we add a project folder named anychart. Its structure should look like illustrated below. Please note that some difference is possible (and acceptable) if you are using a new project in RStudio. anychart/ www/ css/ style.css # css style ui.R # main application code server.R # sub code database_backup.sql # data set index.html # html template Now you know what files you need. If you’ve made a project with the studio, the ui.R and server.R files are created automatically. If you’ve made a project by yourself, just create empty files with the same names and extensions as specified above. The main difference from the original example included in the template is that we should change the file index.html and divide app.R into parts. You can copy the rest of the files or create new ones for your own chart. Please take a look at the file server.R. If you’ve made a project using the studio, it was created automatically and you don’t need to change anything in it. However, if you’ve made it by yourself, open it in the Notepad and add the code below, which is standard for the Shiny framework. You can read more about that here. The file structure of ui.R is similar to the one of app.R, so you can copy app.R from the template and change/add the following lines: loadData = dbGetQuery(db, "SELECT name, value FROM fruits") data1 <- character() #data preparation for(var in 1:nrow(loadData)){ c = c(as.character(loadData[var, 1]), loadData[var, 2]) data1 <- c(data1, c) } data = matrix(data1, nrow=nrow(loadData), ncol=2, byrow=TRUE) ui = function(){ htmlTemplate("index.html", title = "Anychart R Shiny template", chartTitle = shQuote("Fruits"), chartData = toJSON(data) )} Since we are going to change the chart type, from pie to 3D vertical bar (column), the data needs some preparation before being passed to index.html. The main difference is that we will use the entire data from the database, not simply the top 5 positions. We will slightly modify and expand the basic template. Let’s see the resulting code of the index.html first (the script tag) and then explore it. [code lang=”javascript”] anychart.onDocumentReady(function() { var chart = anychart.column3d({{ chartData }}); chart.title({{ chartTitle }}); chart.animation(true); var xAxis = chart.xAxis(); xAxis.title("fruits"); var yAxis = chart.yAxis(); yAxis.title("pounds, t"); var yScale = chart.yScale(); yScale.minimum(0); yScale.maximum(120); chart.container("container"); chart.draw(); }); With the help of var chart = anychart.column3d({{chartData}}), we are creating a 3D column chart by using the function column3d(). Here you can choose any other chart type you need; consider getting help from Chartopedia if you are unsure which one works best in your situation. Next, we are adding animation to the column chart via chart.animation (true) to make it appear on page load gradually. In the following section, we are creating two variables, xAxis and yAxis. Including these is required if you want to provide the coordinate axes of the chart with captions. So, you should create variables that will match the captions for the X and Y axes, and then use the function, transmit the values that you want to see. The next unit is basically optional. We are explicitly specifying the maximum and minimum values for the Y axis, or else AnyChart will independently calculate these values. You can do that the same way for the X axis. And that’s it! Our 3D column chart is ready, and all seems to be fine for successfully running the code. The only thing left to do before that is to change the MySQL table to make it look as follows: (‘apple’,100), (‘orange’,58), (‘banana’,81), (‘lemon’,42), (‘melon’,21), (‘kiwi’,66), (‘mango’,22), (‘pear’,48), (‘coconut’,29), (‘cherries’,65), (‘grapes’,31), (‘strawberries’,76), To see what you’ve got, follow the same steps as for running the R, Shiny and MySQL template example, but do not forget to change the path to the folder and the folder name to anychart. So, let’s open the terminal and command the following: $ cd anychart $ mysql < database_backup.sql $ R > install.packages("shiny") > install.packages("RMySQL") > library(shiny) > runApp("{PATH_TO_TEMPLATE}") # e.g. runApp("/workspace/anychart") Interactive 3D Column chart made with R and AnyChart JS charting library, based on R, Shiny and MySQL integration template For consistency purposes, I am including the code of ui.R and server.R below. The full source code of this example can be found on GitHub.

ui.R:

library(shiny) library(RMySQL) library(jsonlite) data1 <- character() db = dbConnect(MySQL(), dbname = "anychart_db", host = "localhost", port = 3306, user = "anychart_user", password = "anychart_pass") loadData = dbGetQuery(db, "SELECT name, value FROM fruits") #data preparation for(var in 1:nrow(loadData)){ c = c(as.character(loadData[var, 1]), loadData[var, 2]) data1 <- c(data1, c) } data = matrix(data1, nrow=nrow(loadData), ncol=2, byrow=TRUE) server = function(input, output){} ui = function(){ htmlTemplate("index.html", title = "Anychart R Shiny template", chartTitle = shQuote("Fruits"), chartData = toJSON(data) )} shinyApp(ui = ui, server = server)

server.R:

library(shiny) shinyServer(function(input, output) { output$distPlot <- renderPlot({ # generate bins based on input$bins from ui.R x <- faithful[, 2] bins <- seq(min(x), max(x), length.out = input$bins + 1) # draw the chart with the specified number of bins hist(x, breaks = bins, col = ‘darkgray’, border = ‘white’) }) })

Conclusion

When your technology stack includes R, Shiny and MySQL, using AnyChart JS with the integration template we were talking about in this tutorial requires no big effort and allows you to add beautiful interactive JavaScript-based charts to your web apps quite quickly. It is also worth mentioning that you can customize the look and feel of charts created this way as deeply as needed by using some of the library’s numerous out-of-the-box features: add or remove axis labels, change the background color and how the axis is positioned, leverage interactivity, and so on. The scope of this tutorial is likely to be actually even broader, because the process I described here not only applies to the AnyChart JS charting library, but also is mostly the same for its sister libraries AnyMap (geovisualization in maps), AnyStock (date/time graphs), and AnyGantt (charts for project management). All of them are free for non-profit projects but – I must put it clearly here again just in case – require a special license for commercial use. I hope you find this article helpful in your activities when it comes to interactive data visualization in R. Now ask your questions, please, if any.

Generating Dynamic Nomograms using DynNom

Nomograms are useful computational tools for model visualisation, graphical assessment of variable importance and the calculation of predicted values.  The nomogram function in the rms package is a popular way of creating (static) nomograms for a variety of regression models. For example, the following code will generate a nomogram from a logistic regression model used to model the probability of survival on the titanic (http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic.html)

library(PASWR) data(titanic3) library("rms") t.data <- datadist(titanic3) options(datadist = ‘t.data’) fit <- lrm(formula = survived ~ age + pclass + sex, data = titanic3) plot(nomogram(fit, fun = function(x)plogis(x)))

C:\Users\Amir\Dropbox\Projects\thesis\DynNom paper\RBlogger\static nomogram.jpeg


This nomogram is a graphical calculator of the probability of surviving the Titanic, based on three explanatory variables, sex, age and passenger class.  The nomogram can become more cumbersome to read and use when higher order interaction terms and smoothers are present in the model. For example, consider the following nomogram of the model containing all possible interactions of the three main effects.

The DynNom package (Jalali, A, Newell, J), built on shiny, allows the creation of dynamic nomograms from any Generalised Linear models or Cox Proportional hazards models.  DynNom supports model objects created by the lm, glm and coxph functions. It also supports models generated using the Ols, Glm, lrm and cph in the rms package which allows the inclusion of smoothing splines. For example, the following R code will build a dynamic nomogram for the corresponding higher order logistic regression model using the DynNom package.

fit2 <- glm(survived ~ (age + pclass + sex) ^ 3, titanic3, family = "binomial") library(DynNom) DynNom(fit2, titanic3)

The resulting dynamic nomogram allows the prediction of the probability of survival (and corresponding 95% Confidence interval) for any chosen set of values of the explanatory variables.  The shiny tabs display the corresponding predicted values graphically and numerically and the underlying model summary.

 

C:\Users\Amir\Dropbox\Projects\thesis\DynNom paper\RBlogger\DynNom-titanic.jpg

The ability to interact with a model, in terms of the effect of changes in the explanatory variables on the predicted response, is a useful translational tool.  In theory all published model summaries could be accompanied with a hyperlink directing the reader to the corresponding model nomogram.  In order to do this the DNbuilder function within DynNom will generate the necessary ui.R, server.R and global.r files needed to share the nomogram online in any server (e.g.  http://shinyapps.io).

For example, the dynamic nomogram created for the titanic dataset, published using the DNbuilder function, is available at https://amir.shinyapps.io/titanic.

 

 

Composing reproducible manuscripts using R Markdown

The technology is available for researchers to create a reproducible manuscript whereby calculations and graphs are generated computationally – thereby saving researcher time and avoiding human error. eLife is exploring ways to support life and biomedical scientists who wish to communicate their research in this way. In this post, Chris Hartgerink, a metascience researcher at Tilburg University, the Netherlands, describes how he composes a reproducible manuscript using R Markdown. For any researchers interested in submitting a manuscript to eLife in R Markdown, or a similar format, please let us know by email to [email protected].

  Blog post by Chris Hartgerink, Tilburg University This week, a paper that was almost three years in the making finally got published (Hartgerink et al., 2017). I feel confident about the paper and the results in it, not because it took three years to write, but because it’s the output of a dynamic document that I produced using R Markdown. A dynamic document? This means that I no longer had to manually enter all results from the data into tables or the text – the computer did it for me. All I had to do was point it in the right direction. Figures? The same! It saved me tons of time after I made the initial investment to learn how to use it. (Something else that saved me time was git version control, but that’s for another time.) Why is this important? We are all human, and we make mistakes. And that’s okay! What matters is how we try to remedy those mistakes if they do occur. Even more importantly, if we can change the way we work in order to prevent some of them, that can help us tremendously. I think dynamic documents like R Markdown help us do so. Markdown is a simple document language, which you can create in any text editor, including Windows Notepad and Notes for Mac. Markdown is supported in multiple tools designed to make life and work easier, including Reddit, Hypothes.is, Github and several blogging and commenting tools. The purpose of Markdown is simply to define the document structure and formatting. For example, the writer can define header levels using hash signs, and easily format text (Figure 1). Figure 1. Basic syntax of Markdown This makes formatting as you write quick – no more buttons or keyboard shortcuts. Subsequently, the text file can be exported in multiple file formats, including PDF, html and even as a Word file to satisfy co-authors who prefer to use Word’s track-changes function. R Markdown is a flavour of Markdown: it allows you to include R code either as a code chunk between paragraphs of text, or as code within the text. This code is executable. As such, you can do analyses, make figures and format results automatically. One problem I’ve worked on previously is checking the accuracy of p-values in the literature. My colleagues and I found that p-values can be mistyped or miscalculated, leading to inaccurate reporting of results, however unintentionally, in half of all papers, and leading to potential changes in the conclusions in one out of eight papers (in psychology; Nuijten et al. 2015). Enabling researchers to insert p-values via direct computation instead of manually copying results from statistical programmes will resolve this issue, for a start. R Markdown is simple and easy to learn and use. I will show just one exciting aspect here. More extensive step-by-step guides are available. If you want to follow along, you will need to follow the steps described in Table 1.
Step 1 Install R from https://cran.r-project.org/ and Rstudio from https://www.rstudio.com/products/rstudio/download/
Step 2 In RStudio, install the R Markdown package by typing in the console: > install.packages(‘R Markdown’) You will need to select a CRAN local to you, from which the package will be downloaded.
Step 3 Once R Markdown is installed, go to File > New File > R Markdown file. This loads a new .Rmd file with a basic template already set out for you.
Table 1. Getting started with R Markdown
 
Usually, we tend to type results in the running text ourselves, as depicted in Figure 2. By clicking “Knit”, R Markdown creates a document (right) from a simple plain-text file (left). This is the basic function of Markdown and its associated flavours.
 
Figure 2. Using R Markdown to write a document However, in this text, we have a p-value that is calculated based on a t-value and the number of degrees of freedom. Let’s make this dynamic to ensure we have the rounding correct (Figure 3). Figure 3. Using R Markdown to generate a document with dynamic results, to ensure accuracy of reporting and reproducibility of results As we see, the original contained a mistake (p = 0.027 has become p = 0.028) – using R Markdown allowed us to catch that error by using R code to generate and properly round the p-value (i.e., round(pt(q = 1.95, df = 69, lower.tail = FALSE), 3) calculates the p-value and rounds the result to three decimal places). So, there are no more mistakes, and we can be confident in our reporting. Disclaimer: of course, you can still input wrong code – garbage in, garbage out. Further, you can generate plots from your research data within your dynamic document. Figure 4 demonstrates using an example dataset pre-loaded in R (you can find more example datasets by typing data() into the console). Figure 4. Using R Markdown to dynamically generate plots within a document Finally, you can even use citations and alter the citation style throughout the whole document without any problem; my experience is that it’s easier with R Markdown than with EndNote or Mendeley. These are just simple examples. You can write entire manuscripts in this way. That’s what I did for our Collabra manuscript: you can view the raw manuscript at https://github.com/chartgerink/2014tgtbf/blob/master/submission/manuscript.Rnw. (Note: I preferred LaTeX for that project and used Sweave, which is R Markdown for LaTeX.) Using R Markdown presents the opportunity to link results and graphs in the article directly to the data and code that produce them. All it takes is some initial time investment to learn how to work with it (Markdown can be learned in five minutes) and change your workflow to accommodate this modern approach to writing manuscripts. A downside to working this way is that raw R Markdown files are not currently supported as a submission format by journals, meaning that we need to export our manuscripts to traditional Word or PDF formats before submitting – stripping out that rich, reproducible layer of information R Markdown facilitates. I hope dynamic documents will become more and more widespread in the future, both in how often they’re used by the authors and how publishers support this type of document, to truly innovate how scholarly information is communicated and consumed. Imagine interacting with a statistical result or graph in a research article and being presented with the underlying code – this would allow you to more directly evaluate the methods in a paper and empower you as a reader to be critical of what you are presented with. References: Hartgerink, C.H.J., Wicherts, J.M. and van Assen, M.A.L.M., (2017). Too Good to be False: Nonsignificant Results Revisited. Collabra: Psychology. 3(1), p.9. https://doi.org/10.1525/collabra.71 Nuijten, M. B., Hartgerink, C. H. J., van Assen, M. A. L. M., Epskamp, S., and Wicherts, J. M. (2015). The prevalence of statistical reporting errors in psychology (1985–2013). Behavior Research Methods, 48(4), pp.1205–1226. https://doi.org/10.3758/s13428-015-0664-2    This blog post is adapted from a post published by Chris Hartgerink at https://onsnetwork.org/chartgerink/2017/03/30/reproducible-manuscripts-are-the-future/. For more R news and tutorials, please visit https://www.r-bloggers.com/.

slickR

We are happy to bring the slick JavaScript library to R. It is self-described as “the last carousel you’ll ever need”. This carousel is based on putting the elements of the carousel in a div HTML tag. This makes the carousel very versatile in what can be placed inside. Regular objects that are placed in a carousel can be for example: images, plots, tables, gifs, videos, iframes and even htmlwidgets.

  This tool helps review multiple outputs in an efficient manner and saves much needed space in documents and Shiny applications, while creating a user friendly experience. These carousels can be used directly from the R console, from RStudio, in Shiny apps and R Markdown documents.

Installation

CRAN
install.packages('slickR')
Github (dev)
devtools::install_github('metrumresearchgroup/slickR')

Examples

There are many options within the slick carousel, to get started with the basics we will show a few examples in the rest of the post. If you want to try out any of the examples you can go to this site where they are rendered and can be tested out.

library(svglite)
library(lattice)
library(ggplot2)
library(rvest) 
library(reshape2)
library(dplyr)
library(htmlwidgets)
library(slickR)

Images

Some web scraping for the images example….
#NBA Team Logos
nbaTeams=c("ATL","BOS","BKN","CHA","CHI","CLE","DAL","DEN","DET","GSW",
    "HOU","IND","LAC","LAL","MEM","MIA","MIL","MIN","NOP","NYK",
    "OKC","ORL","PHI","PHX","POR","SAC","SAS","TOR","UTA","WAS")
teamImg=sprintf("https://i.cdn.turner.com/nba/nba/.element/img/4.0/global/logos/512x512/bg.white/svg/%s.svg",nbaTeams)

#Player Images
a1=read_html('http://www.espn.com/nba/depth')%>%html_nodes(css = '#my-teams-table a')
a2=a1%>%html_attr('href')
a3=a1%>%html_text()
team_table=read_html('http://www.espn.com/nba/depth')%>%html_table()
team_table=team_table[[1]][-c(1,2),]
playerTable=team_table%>%melt(,id='X1')%>%arrange(X1,variable)
playerName=a2[grepl('[0-9]',a2)]
playerId=do.call('rbind',lapply(strsplit(playerName,'[/]'),function(x) x[c(8,9)]))
playerId=playerId[playerId[,1]!='phi',]
playerTable$img=sprintf('http://a.espncdn.com/combiner/i?img=/i/headshots/nba/players/full/%s.png&w=350&h=254',playerId[,1])

Grouped Images

There are players on each team, so lets group the starting five together and have each dot correspond with a team:
slickR(
  obj = playerTable$img,
  slideId = c('ex2'),
  slickOpts = list(
    initialSlide = 0,
    slidesToShow = 5,
    slidesToScroll = 5,
    focusOnSelect = T,
    dots = T
  ),
  height = 100,width='100%'
)

Replacing the dots

Sometimes the dots won’t be informative enough so we can switch them with an HTML object, such as text or other images. We can pass to the customPaging argument javascript code using the htmlwidgets::JS function.

Replace dots with text

cP1=JS("function(slick,index) {return '<a>'+(index+1)+'</a>';}")

slickR(
  obj = playerTable$img,
  slideId = 'ex3',
  dotObj = teamImg,
  slickOpts = list(
    initialSlide = 0,
    slidesToShow = 5,
    slidesToScroll = 5,
    focusOnSelect = T,
    dots = T,
    customPaging = cP1
  ),
  height=100,width='100%'
)

Replace dots with Images

cP2=JS("function(slick,index) {return '<a><img src= ' + dotObj[index] + '  width=100% height=100%></a>';}")


#Replace dots with Images
s1 <- slickR(
  obj = playerTable$img,
  dotObj = teamImg,
  slickOpts = list(
    initialSlide = 0,
    slidesToShow = 5,
    slidesToScroll = 5,
    focusOnSelect = T,
    dots = T,
    customPaging = cP2
  ),height = 100,width='100%'
)

#Putting it all together in one slickR call
s2 <- htmltools::tags$script(
  sprintf("var dotObj = %s", 
          jsonlite::toJSON(teamImg))
)

htmltools::browsable(htmltools::tagList(s2, s1))

Plots

To place plots directly into slickR we use svglite to convert a plot into svg code using xmlSVG and then convert it into a character object
plotsToSVG=list(
  #Standard Plot
    xmlSVG({plot(1:10)},standalone=TRUE),
  #lattice xyplot
    xmlSVG({print(xyplot(x~x,data.frame(x=1:10),type="l"))},standalone=TRUE),
  #ggplot
    xmlSVG({show(ggplot(iris,aes(x=Sepal.Length,y=Sepal.Width,colour=Species))+
                   geom_point())},standalone=TRUE), 
  #lattice dotplot
    xmlSVG({print(dotplot(variety ~ yield | site , data = barley, groups = year,
                          key = simpleKey(levels(barley$year), space = "right"),
                          xlab = "Barley Yield (bushels/acre) ",
                          aspect=0.5, layout = c(1,6), ylab=NULL))
            },standalone=TRUE) 
)

#make the plot self contained SVG to pass into slickR 
s.in=sapply(plotsToSVG,function(sv){paste0("data:image/svg+xml;utf8,",as.character(sv))})
slickR(s.in,slideId = 'ex4',slickOpts = list(dots=T), height = 200,width = '100%')

Synching Carousels

There are instances when you have many outputs at once and do not want to go through all, so you can combine two carousels one for viewing and one for searching.
slickR(rep(s.in,each=5),slideId = c('ex5up','ex5down'),
       slideIdx = list(1:20,1:20),
       synchSlides = c('ex5up','ex5down'),
       slideType = rep('img',4),
       slickOpts = list(list(slidesToShow=1,slidesToScroll=1),
                        list(dots=F,slidesToScroll=1,slidesToShow=5,
                             centerMode=T,focusOnSelect=T)
                        ),
       height=100,width = '100%'
       )

Iframes

Since the carousel can accept any html element we can place iframes in the carousel. There are times when you may want to put in different DOMs rather than an image in slick. Using slideType you can specify which DOM is used in the slides. For example let’s put the help Rd files from ggplot2 into a slider using the helper function getHelp. (Dont forget to open the output to a browser to view the iframe contents).
geom_filenames=ls("package:ggplot2",pattern = '^geom')

slickR(unlist(lapply(geom_filenames,getHelp,pkg = 'ggplot2')),slideId = 'ex6',slideType = 'iframe',height = '400px',width='100%',slickOpts = list(dots=T,slidesToShow=2,slidesToScroll=2))

htmlwidgets

Finally, we can really leverage R and place self contained htmlwidgets in iframes (like leaflets and plotly) and use them in a carousel. This solves a problem of how to run many htmlwidgets at once outside of Shiny.
library(leaflet)
library(plotly)

l <- leaflet() %>% addTiles()
htmlwidgets::saveWidget(l,'leaflet.html')

p <- iris%>%ggplot(aes(x=Sepal.Length,y=Sepal.Width))+geom_point()
pL= ggplotly(p)
htmlwidgets::saveWidget(pL,'ggplotly.html')

slickR(c(rep(paste0(readLines('leaflet.html'),collapse='\n'),4),
         rep(paste0(readLines('ggplotly.html'),collapse='\n'),4)),
       slideId = c('leaf','plot'),
       slideIdx = list(1:4,5:8),
       slideType = rep('iframe',2),
       slickOpts = list(list(dots=T,slidesToShow=2,slidesToScroll=2),
                        list(dots=T,slidesToShow=2,slidesToScroll=2)),
       height='200px',width='100%')

Jonathan Sidi joined Metrum Research Group in 2016 after working for several years on problems in applied statistics, financial stress testing and economic forecasting in both industrial and academic settings. To learn more about additional open-source software packages developed by Metrum Research Group please visit the Metrum website. Contact: For questions and comments, feel free to email me at: [email protected] or open an issue for bug fixes or enhancements at github.

R Function Call with Ellipsis Trap/Pitfall

Objective if this post: alerting all users to double check case and spelling of all function parameters

I am newbie in R and was trying RSNNS mlp function and wasted a lot of time due to some typos.

RSNNS mlp function silently ignores misspelled keywords
Example:
model&lt;-mlp(iris[,1:4],decodeClassLabels(iris[,5]),hize=7,mazit=100)
I intentionally misspelled size as hize and maxit as mazit
There are no warnings or errors.

I think that many packages may have same problem as package writers may not always validate ellipsis arguments. I made a small spelling mistake and got puzzling results as there was no parameter validation, but I expected that great eminent packages should be robust and help users recover from typos Let us see what happens with no ellipsis
&gt; square &lt;-function(num ){
+ return(num*num)
+ }
&gt; 
&gt; square(num=4)
[1] 16
&gt; square(numm=4)
Error in square(numm = 4) : unused argument (numm = 4)

# With ellipsis added
&gt; square &lt;-function(num, …){
+ print(names(list(…)));
+ return(num*num)
+ }
&gt; 
&gt; square(num=3,bla=4,kla=9)
[1] “bla” “kla”
[1] 9
As you can see names(list(…)) does give access to parameter names The problem is that ellipsis function calls are for flexibility but package writers should take extra care to throw exception “unused argument” when parameters of functions are misspelled.

This to my mind is a major weakness of the R ecosystem. Most parameters have defaults and small case or spelling mistake can lead to really wrong conclusions!

RSNNS  is simply fantastic but given simply as an example of the ellipsis function call trap. Hope other newbies benefit and learn to avoid the trap of wrong arguments.

Jayanta Narayan Choudhuri
Kolkata India

ggedit 0.2.0 is now on CRAN

Jonathan Sidi, Metrum Research Group

ggedit We are pleased to announce the release of the ggedit package on CRAN. To install the package you can call the standard R command
install.packages('ggedit')
The source version is still tracked on github, which has been reorganized to be easier to navigate. To install the dev version:
devtools::install_github('metrumresearchgroup/ggedit')

What is ggedit?

ggedit is an R package that is used to facilitate ggplot formatting. With ggedit, R users of all experience levels can easily move from creating ggplots to refining aesthetic details, all while maintaining portability for further reproducible research and collaboration. ggedit is run from an R console or as a reactive object in any Shiny application. The user inputs a ggplot object or a list of objects. The application populates Bootstrap modals with all of the elements found in each layer, scale, and theme of the ggplot objects. The user can then edit these elements and interact with the plot as changes occur. During editing, a comparison of the script is logged, which can be directly copied and shared. The application output is a nested list containing the edited layers, scales, and themes in both object and script form, so you can apply the edited objects independent of the original plot using regular ggplot2 grammar. Why does it matter? ggedit promotes efficient collaboration. You can share your plots with team members to make formatting changes, and they can then send any objects they’ve edited back to you for implementation. No more email chains to change a circle to a triangle!

Updates in ggedit 0.2.0:

  • The layer modal (popups) elements have been reorganized for less clutter and easier navigation.
  • The S3 method written to plot and compare themes has been removed from the package, but can still be found on the repo, see plot.theme.

Deploying

  • call from the console: ggedit(p)
  • call from the addin toolbar: highlight script of a plot object on the source editor window of RStudio and run from toolbar.
  • call as part of Shiny: use the Shiny module syntax to call the ggEdit UI elements.
    • server: callModule(ggEdit,'pUI',obj=reactive(p))
    • ui: ggEditUI('pUI')
  • if you have installed the package you can see an example of a Shiny app by executing runApp(system.file('examples/shinyModule.R',package = 'ggedit'))

Outputs

ggedit returns a list containing 8 elements either to the global enviroment or as a reactive output in Shiny.
  • updatedPlots
    • List containing updated ggplot objects
  • updatedLayers
    • For each plot a list of updated layers (ggproto) objects
    • Portable object
  • updatedLayersElements
    • For each plot a list elements and their values in each layer
    • Can be used to update the new values in the original code
  • updatedLayerCalls
    • For each plot a list of scripts that can be run directly from the console to create a layer
  • updatedThemes
    • For each plot a list of updated theme objects
    • Portable object
    • If the user doesn’t edit the theme updatedThemes will not be returned
  • updatedThemeCalls
    • For each plot a list of scripts that can be run directly from the console to create a theme
  • updatedScales
    • For each plot a list of updated scales (ggproto) objects
    • Portable object
  • updatedScaleCalls
    • For each plot a list of scripts that can be run directly from the console to create a scale
 

Short Clip to use ggedit in Shiny


Jonathan Sidi joined Metrum Research Group in 2016 after working for several years on problems in applied statistics, financial stress testing and economic forecasting in both industrial and academic settings. To learn more about additional open-source software packages developed by Metrum Research Group please visit the Metrum website. Contact: For questions and comments, feel free to email me at: [email protected] or open an issue for bug fixes or enhancements at github.