It can be easy to explore data generating mechanisms with the simstudy package

I learned statistics and probability by simulating data. Sure, I battled my way through proofs, but I never believed the results until I saw it in a simulation. I guess I have it backwards, it worked for me. And now that I do this for a living, I continue to use simulation to understand models, to do sample size estimates and power calculations, and of course to teach. Sure – I’ll use the occasional formula, but I always feel the need to check it with simulation. It’s just the way I am.

Since I found myself constantly setting up simulations, over time I developed ways to make the process a bit easier. Those processes turned into a package, which I called simstudy, or simulating study data. My goal here is to introduce the basic idea behind simstudy, and provide a relatively simple example that comes from a question posed by a user who was trying to generate correlated longitudinal data.

The basic idea

Simulation using simstudy has two essential steps. First, the user defines the data elements of a data set either in an external csv file or internally through a set of repeated definition statements. Second, the user generates the data, using these definitions. Data generation can be as simple as a cross-sectional design or prospective cohort design, or it can be more involved. Simulation can include observed or randomized treatment assignment/exposures, survival data, longitudinal/panel data, multi-level/hierarchical data, data sets with correlated variables based on a specified covariance structure, and data sets with missing data resulting from any sort of missingness pattern.

The key to simulating data in simstudy is the creation of series of data definition tables that look like this: Here’s the code that is used to generate this definition, which is stored as a data.table :

def <- defData(varname = "nr", dist = "nonrandom", formula = 7, id = "idnum")
def <- defData(def, varname = "x1", dist = "uniform", formula = "10;20")
def <- defData(def, varname = "y1", formula = "nr + x1 * 2", variance = 8)
def <- defData(def, varname = "y2", dist = "poisson", formula = "nr - 0.2 * x1", link = "log")
def <- defData(def, varname = "xCat", formula = "0.3;0.2;0.5", dist = "categorical")
def <- defData(def, varname = "g1", dist = "gamma", formula = "5+xCat", variance = 1, link = "log")
def <- defData(def, varname = "a1", dist = "binary", formula = "-3 + xCat", link = "logit")
To create a simple data set based on these definitions, all one needs to do is execute a single genData command. In this example, we generate 500 records that are based on the definition in the def table:

dt <- genData(500, def)

dt
[code language="lang="r"]
##      idnum nr       x1       y1  y2 xCat          g1 a1
##   1:     1  7 10.74929 30.01273 123    3 13310.84731  0
##   2:     2  7 18.56196 44.77329  17    1   395.41606  0
##   3:     3  7 16.96044 43.76427  42    3   522.45258  0
##   4:     4  7 19.51511 45.14214  22    3  3045.06826  0
##   5:     5  7 10.79791 27.25012 128    1   406.88647  0
##  ---                                                   
## 496:   496  7 19.96636 52.05377  21    3   264.85557  1
## 497:   497  7 15.97957 39.62428  44    2    40.59823  0
## 498:   498  7 19.74036 47.32292  21    3  3410.54787  0
## 499:   499  7 19.71597 48.26259  25    3   206.90961  1
## 500:   500  7 14.60405 28.94185  53    1    97.43068  0
There’s a lot more functionality in the package, and I’ll be writing about that in the future. But here, I just want give a little more introduction by way of an example that came in from across the globe a couple of days ago. (I’d say the best thing about building an R package is hearing from folks literally all over the world and getting to talk to them about statistics and R.)

Going a bit further: simulating a prospective cohort study with repeated measures

The question was, can we simulate a study with two arms, say a control and treatment, with repeated measures at three time points: baseline, after 1 month, and after 2 months? The answer is, of course.

Drawing on her original code, we wanted a scenario the included two treatment arms or exposures and three measurements per individual. The change over time was supposed to be greater for one of the groups. This was what I sent back to my correspondent:

# Define the outcome

ydef <- defDataAdd(varname = "Y", dist = "normal", 
                   formula = "5 + 2.5*period + 1.5*T + 3.5*period*T", 
                   variance = 3)

# Generate a 'blank' data.table with 24 observations and assign them to groups

set.seed(1234)

indData <- genData(24)
indData <- trtAssign(indData, nTrt = 2, balanced = TRUE, grpName = "T")

# Create a longitudinal data set of 3 records for each id

longData <- addPeriods(indData, nPeriods = 3, idvars = "id")
longData <- addColumns(dtDefs = ydef, longData)

longData[, `:=`(T, factor(T, labels = c("No", "Yes")))]

# Let's look at the data

ggplot(data = longData, aes(x = factor(period), y = Y)) + geom_line(aes(color = T, 
    group = id)) + scale_color_manual(values = c("#e38e17", "#8e17e3")) + xlab("Time")
My correspondent quickly pointed out that I hadn’t really provided her with the full solution – as the measurements for each individual in the example above are not correlated across the time periods. If we generate a data set based on 1,000 individuals and estimate a linear regression model we see that the parameter estimates are quite good, though we can see from the estimate of alpha (at the bottom of the output), which was approximately 0.02, that there is little within-individual correlation:
# Fit a GEE model to the data

fit <- geeglm(Y ~ factor(T) + period + factor(T) * period, family = gaussian(link = "identity"), 
    data = longData, id = id, corstr = "exchangeable")
summary(fit)
## 
## Call:
## geeglm(formula = Y ~ factor(T) + period + factor(T) * period, 
##     family = gaussian(link = "identity"), data = longData, id = id, 
##     corstr = "exchangeable")
## 
##  Coefficients:
##                     Estimate Std.err   Wald Pr(>|W|)    
## (Intercept)          4.98268 0.07227 4753.4   <2e-16 ***
## factor(T)Yes         1.48555 0.10059  218.1   <2e-16 ***
## period               2.53946 0.05257 2333.7   <2e-16 ***
## factor(T)Yes:period  3.51294 0.07673 2096.2   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Estimated Scale Parameters:
##             Estimate Std.err
## (Intercept)    2.952 0.07325
## 
## Correlation: Structure = exchangeable  Link = identity 
## 
## Estimated Correlation Parameters:
##       Estimate Std.err
## alpha  0.01737 0.01862
## Number of clusters:   1000   Maximum cluster size: 3

One way to generate correlated data

To address this shortcoming, there are at least two ways we can go about it. The first is to use the simstudy function genCorData. In this example, we generate correlated errors that are normally distributed with mean 0, variance 3, and common correlation coefficient of 0.7. Using this approach, the underlying data generation process is a bit cryptic, because we are acknowledging that we don’t know what is driving the relationship between the three outcomes, just that they have some common cause or other relationship that results in a strong relationship. However, it does the trick:
# define the outcome
ydef <- defDataAdd(varname = "Y", dist = "normal", formula = "5 + 2.5*period + 1.5*T + 3.5*period*T + e")

# define the correlated errors

mu <- c(0, 0, 0)
sigma <- rep(sqrt(3), 3)

# generate correlated data for each id and assign treatment

dtCor <- genCorData(24, mu = mu, sigma = sigma, rho = 0.7, corstr = "cs")
dtCor <- trtAssign(dtCor, nTrt = 2, balanced = TRUE, grpName = "T")

# create longitudinal data set and generate outcome based on definition

longData <- addPeriods(dtCor, nPeriods = 3, idvars = "id", timevars = c("V1", 
    "V2", "V3"), timevarName = "e")
longData <- addColumns(ydef, longData)

longData[, `:=`(T, factor(T, labels = c("No", "Yes")))]

# look at the data, outcomes should appear more correlated, lines a bit straighter

ggplot(data = longData, aes(x = factor(period), y = Y)) + geom_line(aes(color = T, 
    group = id)) + scale_color_manual(values = c("#e38e17", "#8e17e3")) + xlab("Time")
Again, we recover the true parameters. And this time, if we look at the estimated correlation, we see that indeed the outcomes are correlated within each individual. The estimate is 0.71, very close to the the “true” value 0.7.
fit <- geeglm(Y ~ factor(T) + period + factor(T) * period, family = gaussian(link = "identity"), 
    data = longData, id = id, corstr = "exchangeable")

summary(fit)
## 
## Call:
## geeglm(formula = Y ~ factor(T) + period + factor(T) * period, 
##     family = gaussian(link = "identity"), data = longData, id = id, 
##     corstr = "exchangeable")
## 
##  Coefficients:
##                     Estimate Std.err Wald Pr(>|W|)    
## (Intercept)           5.0636  0.0762 4411   <2e-16 ***
## factor(T)Yes          1.4945  0.1077  192   <2e-16 ***
## period                2.4972  0.0303 6798   <2e-16 ***
## factor(T)Yes:period   3.5204  0.0426 6831   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Estimated Scale Parameters:
##             Estimate Std.err
## (Intercept)     3.07   0.117
## 
## Correlation: Structure = exchangeable  Link = identity 
## 
## Estimated Correlation Parameters:
##       Estimate Std.err
## alpha    0.711  0.0134
## Number of clusters:   1000   Maximum cluster size: 3

Another way to generate correlated data

A second way to generate correlated data is through an individual level random-effect or random intercept. This could be considered some unmeasured characteristic of the individuals (which happens to have a convenient normal distribution with mean zero). This random effect contributes equally to all instances of an individuals outcomes, but the outcomes for a particular individual deviate slightly from the hypothetical straight line as a result of unmeasured noise.
ydef1 <- defData(varname = "randomEffect", dist = "normal", formula = 0, variance = sqrt(3))
ydef2 <- defDataAdd(varname = "Y", dist = "normal", formula = "5 + 2.5*period + 1.5*T + 3.5*period*T + randomEffect", 
    variance = 1)

indData <- genData(24, ydef1)
indData <- trtAssign(indData, nTrt = 2, balanced = TRUE, grpName = "T")

indData[1:6]
##    id T randomEffect
## 1:  1 0      -1.3101
## 2:  2 1       0.3423
## 3:  3 0       0.5716
## 4:  4 1       2.6723
## 5:  5 0      -0.9996
## 6:  6 1      -0.0722
longData <- addPeriods(indData, nPeriods = 3, idvars = "id")
longData <- addColumns(dtDefs = ydef2, longData)

longData[, `:=`(T, factor(T, labels = c("No", "Yes")))]

ggplot(data = longData, aes(x = factor(period), y = Y)) + geom_line(aes(color = T, 
    group = id)) + scale_color_manual(values = c("#e38e17", "#8e17e3")) + xlab("Time")
fit <- geeglm(Y ~ factor(T) + period + factor(T) * period, family = gaussian(link = "identity"), 
    data = longData, id = id, corstr = "exchangeable")
summary(fit)
## 
## Call:
## geeglm(formula = Y ~ factor(T) + period + factor(T) * period, 
##     family = gaussian(link = "identity"), data = longData, id = id, 
##     corstr = "exchangeable")
## 
##  Coefficients:
##                     Estimate Std.err Wald Pr(>|W|)    
## (Intercept)           4.9230  0.0694 5028   <2e-16 ***
## factor(T)Yes          1.4848  0.1003  219   <2e-16 ***
## period                2.5310  0.0307 6793   <2e-16 ***
## factor(T)Yes:period   3.5076  0.0449 6104   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Estimated Scale Parameters:
##             Estimate Std.err
## (Intercept)     2.63  0.0848
## 
## Correlation: Structure = exchangeable  Link = identity 
## 
## Estimated Correlation Parameters:
##       Estimate Std.err
## alpha    0.619  0.0146
## Number of clusters:   1000   Maximum cluster size: 3
I sent all this back to my correspondent, but I haven’t heard yet if either of these solutions meets her needs. I certainly hope so. Speaking of which, if there are specific topics you’d like me to discuss related to simstudy, definitely get in touch, and I will try to write something up.

Everyone knows that loops in R are to be avoided but vectorization is not always possible

It goes without saying that there are always many ways to solve a problem in R, but clearly some ways are better (for example, faster) than others. Recently, I found myself in a situation where I could not find a way to avoid using a loop, and I was immediately concerned, knowing that I would want this code to be flexible enough to run with a very large number of observations, possibly over many observations. Two tools immediately came to mind: data.table and Rcpp . This brief description explains the background of the simulation problem I was working on and walks through the evolution of ideas to address the problems I ran into when I tried to simulate a large number of individuals. In particular, when I tried to simulate a very large number of individuals, say over 1 million, running the simulation over night wasn’t enough.

Setting up the problem

The task in question here is not the focus, but needs a little explanation to understand what is motivating the programming issue. I am conducting a series of simulations that involve generating an individual-level stochastic (Markov) process for any number of individuals. For the data generation, I am using the simstudy package developed to help facilitate simulated data.   The functions defDataAdd and genData are both from simstudy. The first part of the simulation involves specifying the transition matrix P that determine a state I am calling status, and then defining the probability of an event that are based on a particular status level at a particular time point. For each individual, I generate 36 months of data and a status and event for each month.
library(data.table)
library(simstudy)

set.seed(123)

P <- matrix(c(0.985, 0.015, 0.000, 0.000, 
              0.000, 0.950, 0.050, 0.000,
              0.000, 0.000, 0.850, 0.150,
              0.000, 0.000, 0.000, 1.000), 
            nrow = 4, byrow = TRUE)

form <- "(status == 1) * 0.02 + (status == 2) * 0.10 + (status == 3) * 0.20"

dtDef <- defDataAdd(varname = "event", 
            formula = form, 
            dist = "binary",
            link = "identity")

N = 5000
did <- genData(N)
In order to simulate the Markov process, I decided immediately that Rcpp would be most appropriate because I knew I could not avoid looping. Since each state of a Markov process depends on the state immediately preceding, states need to be generated sequentially, which means no obvious way to vectorize (if someone has figured that out, let me know.)
#include  < RcppArmadilloExtensions/sample.h >; 

// [[Rcpp::depends(RcppArmadillo)]]

using namespace Rcpp;

// [[Rcpp::export]]

IntegerVector MCsim( unsigned int nMonths, NumericMatrix P, 
                     int startStatus, unsigned int startMonth ) {
  
  IntegerVector sim( nMonths );
  IntegerVector m( P.ncol());
  NumericVector currentP;
  IntegerVector newstate;
  
  unsigned int q = P.ncol();
  
  m = Rcpp::seq(1, q);
  
  sim[startMonth - 1] = startStatus;
  
  for (unsigned int i = startMonth; i < nMonths; i++) {
    
    newstate = RcppArmadillo::sample(m, 1, TRUE, P.row(sim(i-1) - 1));
    sim(i) = newstate(0);
    
  }
  
  return sim;
}
The process is simulated for each individual using the Rcpp function MCsim, but is done in the context of a data.table statement. The key here is that each individual is processed separately through the keyby = id statement. This obviates the requirement to loop through individuals even though I still need to loop within individuals for the stochastic process. This algorithm is quite fast, even with very large numbers of individuals and large numbers of observations (in this case months) per individual.
dt <- did[, .(status = MCsim(36, P, 1, 1)), 
          keyby = id]
dt[, month := 1 : .N, keyby = id]
dt <- addColumns(dtDefs = dtDef, dtOld = dt)

dt
##           id status month event
##      1:    1      1     1     0
##      2:    1      1     2     0
##      3:    1      1     3     0
##      4:    1      1     4     0
##      5:    1      1     5     0
##     ---                        
## 179996: 5000      4    32     0
## 179997: 5000      4    33     0
## 179998: 5000      4    34     0
## 179999: 5000      4    35     0
## 180000: 5000      4    36     0

This is where things begin to slow down

It is the next phase of the simulation that started to cause me problems. For the simulation, I need to assign individuals to a group or cohort which is defined by a month and is based on several factors: (1) whether an event occurred in that month, (2) whether the status of that individual in that month exceeded a value of 1, and (3) whether or not the individual experienced 2 or more events in the prior 12 months. An indivdual might be eligible for more than one cohort, but will be assigned to the first possible cohort (i.e. the earliest month where all three criteria are met.) Again, the specifics of the simulation are not important here. What is important, is the notion that the problem requires looking through individual data sequentially, something R is generally not so good at when the sequences get particularly long, and they must be repeated a large number of times. My first, naïve, approach was to create an R function that loops through all the individuals and loops within each individual until a cohort is found:
rAssignCohortID <- function(id, month, status, 
                            event, nInds, 
                            startMonth, thresholdNum) {
  
  cohort <- rep(0, length(id));

  for (j in (1 : nInds))  {
    
    idMonth <- month[id == j];
    idEvent <- event[id == j];
    idStatus <- status[id == j];
    
    endMonth <- length(idMonth);
    
    done <- FALSE;
    i <- max(startMonth - idMonth[1], 13);
    
    while (i <= endMonth & !done) {
      
      if (idEvent[i] == 1 & idStatus[i] > 1) {
        
        begin = i-12;
        end = i-1;
        
        sumED = sum(idEvent[begin:end]);
        
        if (sumED <= thresholdNum) {
          
          cohort[id == j] <- i - 1 + month[1];
          done = TRUE;
        }
      }    
      i = i + 1;
    }    
  }
  
  return(cohort);
} 
system.time(dt[, cohort1 := rAssignCohortID(id, month, status, event, 
                    nInds = N, startMonth = 13, thresholdNum = 2)])
##    user  system elapsed 
##   10.92    1.81   12.89

Working through possible solutions

The naïve approach works, but can we do better? I thought Rcpp might be a solution, because we know that loops in C++ are much more efficient. However, things did not turn out so well after I translated the function into C++; in fact, they got a little worse.

#include < Rcpp.h >;

using namespace Rcpp;

// [[Rcpp::export]]
IntegerVector cAssignCohortID( IntegerVector id, 
                    IntegerVector month, 
                    IntegerVector status, 
                    IntegerVector event,
                    int nInds, 
                    int startMonth, 
                    int thresholdNum) {
    
  IntegerVector cohort(id.length(), 0);
  
  IntegerVector idMonth;
  IntegerVector idEvent;
  IntegerVector idStatus;
  
  for (int j = 0; j < nInds; j++) {
    
    idMonth = month[id == j+1];
    idEvent = event[id == j+1];
    idStatus = status[id == j+1];

    int endMonth = idMonth.length();
    int sumED;
    bool done = FALSE;
    int i = std::max(startMonth - idMonth(0), 12);
    int begin;
    int end;

    while (i < endMonth & !done) {
      
      if (idEvent(i) == 1 & idStatus(i) > 1) {
        
        begin = i-12;
        end = i-1;
        
        sumED = sum(idEvent[Rcpp::seq(begin, end)]);
        
        if (sumED >= thresholdNum) {
          cohort[id == j + 1] = i + month(0);
          done = TRUE;
        }
      }    
      i += 1;
    }    
  }
   
  return(cohort);
}
system.time(dt[, cohort2 := cAssignCohortID(id, month, status, event, 
                    nInds = N, startMonth = 13, thresholdNum = 2)])
## user  system elapsed 
## 13.88   2.03   16.05

I know that the function cAssignCohortID bogs down not in the loop, but in each phase where I need to subset the data set to work on a single id. For example, I need to execute the statement idMonth = month[id == j+1] for each id, and this apparently uses a lot of resources. I tried variations on this theme, alternatives to subset the data set within the Rcpp function, but could get no improvements. But a light bulb went off in my head (dim as it might be), telling me that this is one of the many things data.table is particularly good at. In fact, I used this trick earlier in generating the stochastic process data. So, rather than subsetting the data within the function, I created a regular R function that handles only a single individual id at a time, and let data.table do the hard work of splitting up the data set to process by individual. As you can see, things got markedly faster.
rAssignCohort <- function(id, month, status, event, 
                   nInds, startMonth, thresholdNum) {
  
  cohort <- 0
  
  endMonth = length(month);
    
  done = FALSE;
  i = max(startMonth - month[1], 13);
  
  while (i <= endMonth & !done) {
    
    if (event[i] == 1 & status[i] > 1) {
        
      begin = i-12;
      end = i-1;
        
      sumED = sum(event[begin:end]);
        
      if (sumED >= thresholdNum) {
          
        cohort <- i - 1 + month[1];
        done = TRUE;
      }
    }    
    i = i + 1;
  }
  
  return(cohort)
}
system.time(dt[, cohort3 := rAssignCohort(id, month, status, event, 
                    nInds = N, startMonth = 13, thresholdNum = 2), 
               keyby=id])
##    user  system elapsed 
##     0.2     0.0     0.2
Finally, it occurred to me that an Rcpp function that is not required to subset the data might offer more yet improvements in speed. So, for the last iteration, I combined the strengths of looping in Rcpp with the strengths of subsetting in data.table to create a formidable combination. (Even when sample sizes exceed 1 million, the data are generated in a flash.)
#include < Rcpp.h >; 

using namespace Rcpp;

// [[Rcpp::export]]
int cAssignCohort( IntegerVector month, 
                  IntegerVector status,
                  IntegerVector event,
                  int startMonth, int thresholdNum) {
  
  int endMonth = month.length();
  int sumED;
  int cohort = 0;
  bool done = FALSE;
  int i = std::max(startMonth - month(0), 12);
  int begin;
  int end;
  
  while (i < endMonth & !done) {
    
    if (event(i) == 1 & status(i) > 1) {
      
      begin = i-12;
      end = i-1;
      
      sumED = sum(event[Rcpp::seq(begin, end)]);
      
      if (sumED >= thresholdNum) {
        cohort = i + month(0);
        done = TRUE;
      }
    }    
    i += 1;
  }
  return(cohort);
}
system.time(dt[, cohort4 := cAssignCohort(month, status, event, 
                               startMonth=13,  thresholdNum = 2), keyby=id])
##    user  system elapsed 
##    0.01    0.00    0.01
For a more robust comparison, let’s use the benchmark function in package rbenchmark, and you can see how well data.table performs and how much Rcpp can add when used efficiently.
library(rbenchmark)

benchmark(
  dt[, cohort1 := rAssignCohortID(id, month, status, event,       # Naïve approach
                      nInds = N, startMonth = 13, thresholdNum = 2)],
  dt[, cohort2 := cAssignCohortID(id, month, status, event,        # Rcpp approach
                      nInds = N, startMonth = 13, thresholdNum = 2)],
  dt[, cohort3 := rAssignCohort(id, month, status, event,    # data.table approach
                      nInds = N, startMonth = 13, thresholdNum = 2), keyby=id],
  dt[, cohort4 := cAssignCohort(month, status, event,   # combined data.table/Rcpp
                      startMonth=13,  thresholdNum = 2), keyby=id],
    replications = 5,
    columns = c("replications", "elapsed", "relative"))
##   replications elapsed relative
## 1            5   46.18    461.8
## 2            5   68.01    680.1
## 3            5    0.96      9.6
## 4            5    0.10      1.0

Postscript

I shared all of this with the incredibly helpful folks who have created data.table, and they offered a data.table only solution that avoids all looping, which I will share here for completeness. While it is an improvement over the third approach presented above (R function with data.table statment keyby), it is still no match for the fastest solution. (But, this all just goes to show you there will always be new approaches to consider, and I don’t claim to have come any where near to trying them all out.)
dtfunc <- function(dx) {
  
  dx[, prev12 := Reduce(`+`, shift(event, 1:12)), by=id]
  map <- CJ(id=1:N, start=13L, end=36L, event=1L, statusx=1L, prev12x=1L)
  ans <- dx[map, on=.(id, event, status > statusx, prev12 > prev12x, month >= start, month <= end), 
            .I, allow=TRUE, by=.EACHI, nomatch=0L][, .(id, I)]
  minans <- ans[, .(I=min(I)), by=id]
  
  dx <- dx[, cohort5 := 0L][minans, cohort5 := min(month) - 1L + dx$month[I], on="id", by=.EACHI]

  return(dx)
}

system.time(dtfunc(dt))
##    user  system elapsed 
##    0.18    0.00    0.17
And here is a more complete comparison of the fastest version with this additional approach:
benchmark(
  dt[, cohort6 := cAssignCohort(month, status, event,   # combined data.table/Rcpp
                      startMonth=13,  thresholdNum = 2), keyby=id],
  dt2 <- dtfunc(dt),
  replications = 5,
  columns = c("replications", "elapsed", "relative"))
##   replications elapsed relative
## 1            5    0.10      1.0
## 2            5    0.85      8.5