It can be easy to explore data generating mechanisms with the simstudy package

Interested in publishing a one-time post on R-bloggers.com? Press here to learn how.
I learned statistics and probability by simulating data. Sure, I battled my way through proofs, but I never believed the results until I saw it in a simulation. I guess I have it backwards, it worked for me. And now that I do this for a living, I continue to use simulation to understand models, to do sample size estimates and power calculations, and of course to teach. Sure – I’ll use the occasional formula, but I always feel the need to check it with simulation. It’s just the way I am.

Since I found myself constantly setting up simulations, over time I developed ways to make the process a bit easier. Those processes turned into a package, which I called simstudy, or simulating study data. My goal here is to introduce the basic idea behind simstudy, and provide a relatively simple example that comes from a question posed by a user who was trying to generate correlated longitudinal data.

The basic idea

Simulation using simstudy has two essential steps. First, the user defines the data elements of a data set either in an external csv file or internally through a set of repeated definition statements. Second, the user generates the data, using these definitions. Data generation can be as simple as a cross-sectional design or prospective cohort design, or it can be more involved. Simulation can include observed or randomized treatment assignment/exposures, survival data, longitudinal/panel data, multi-level/hierarchical data, data sets with correlated variables based on a specified covariance structure, and data sets with missing data resulting from any sort of missingness pattern.

The key to simulating data in simstudy is the creation of series of data definition tables that look like this: Here’s the code that is used to generate this definition, which is stored as a data.table :

def <- defData(varname = "nr", dist = "nonrandom", formula = 7, id = "idnum")
def <- defData(def, varname = "x1", dist = "uniform", formula = "10;20")
def <- defData(def, varname = "y1", formula = "nr + x1 * 2", variance = 8)
def <- defData(def, varname = "y2", dist = "poisson", formula = "nr - 0.2 * x1", link = "log")
def <- defData(def, varname = "xCat", formula = "0.3;0.2;0.5", dist = "categorical")
def <- defData(def, varname = "g1", dist = "gamma", formula = "5+xCat", variance = 1, link = "log")
def <- defData(def, varname = "a1", dist = "binary", formula = "-3 + xCat", link = "logit")
To create a simple data set based on these definitions, all one needs to do is execute a single genData command. In this example, we generate 500 records that are based on the definition in the def table:

dt <- genData(500, def)

dt
[code language="lang="r"]
##      idnum nr       x1       y1  y2 xCat          g1 a1
##   1:     1  7 10.74929 30.01273 123    3 13310.84731  0
##   2:     2  7 18.56196 44.77329  17    1   395.41606  0
##   3:     3  7 16.96044 43.76427  42    3   522.45258  0
##   4:     4  7 19.51511 45.14214  22    3  3045.06826  0
##   5:     5  7 10.79791 27.25012 128    1   406.88647  0
##  ---                                                   
## 496:   496  7 19.96636 52.05377  21    3   264.85557  1
## 497:   497  7 15.97957 39.62428  44    2    40.59823  0
## 498:   498  7 19.74036 47.32292  21    3  3410.54787  0
## 499:   499  7 19.71597 48.26259  25    3   206.90961  1
## 500:   500  7 14.60405 28.94185  53    1    97.43068  0
There’s a lot more functionality in the package, and I’ll be writing about that in the future. But here, I just want give a little more introduction by way of an example that came in from across the globe a couple of days ago. (I’d say the best thing about building an R package is hearing from folks literally all over the world and getting to talk to them about statistics and R.)

Going a bit further: simulating a prospective cohort study with repeated measures

The question was, can we simulate a study with two arms, say a control and treatment, with repeated measures at three time points: baseline, after 1 month, and after 2 months? The answer is, of course.

Drawing on her original code, we wanted a scenario the included two treatment arms or exposures and three measurements per individual. The change over time was supposed to be greater for one of the groups. This was what I sent back to my correspondent:

# Define the outcome

ydef <- defDataAdd(varname = "Y", dist = "normal", 
                   formula = "5 + 2.5*period + 1.5*T + 3.5*period*T", 
                   variance = 3)

# Generate a 'blank' data.table with 24 observations and assign them to groups

set.seed(1234)

indData <- genData(24)
indData <- trtAssign(indData, nTrt = 2, balanced = TRUE, grpName = "T")

# Create a longitudinal data set of 3 records for each id

longData <- addPeriods(indData, nPeriods = 3, idvars = "id")
longData <- addColumns(dtDefs = ydef, longData)

longData[, `:=`(T, factor(T, labels = c("No", "Yes")))]

# Let's look at the data

ggplot(data = longData, aes(x = factor(period), y = Y)) + geom_line(aes(color = T, 
    group = id)) + scale_color_manual(values = c("#e38e17", "#8e17e3")) + xlab("Time")
My correspondent quickly pointed out that I hadn’t really provided her with the full solution – as the measurements for each individual in the example above are not correlated across the time periods. If we generate a data set based on 1,000 individuals and estimate a linear regression model we see that the parameter estimates are quite good, though we can see from the estimate of alpha (at the bottom of the output), which was approximately 0.02, that there is little within-individual correlation:
# Fit a GEE model to the data

fit <- geeglm(Y ~ factor(T) + period + factor(T) * period, family = gaussian(link = "identity"), 
    data = longData, id = id, corstr = "exchangeable")
summary(fit)
## 
## Call:
## geeglm(formula = Y ~ factor(T) + period + factor(T) * period, 
##     family = gaussian(link = "identity"), data = longData, id = id, 
##     corstr = "exchangeable")
## 
##  Coefficients:
##                     Estimate Std.err   Wald Pr(>|W|)    
## (Intercept)          4.98268 0.07227 4753.4   <2e-16 ***
## factor(T)Yes         1.48555 0.10059  218.1   <2e-16 ***
## period               2.53946 0.05257 2333.7   <2e-16 ***
## factor(T)Yes:period  3.51294 0.07673 2096.2   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Estimated Scale Parameters:
##             Estimate Std.err
## (Intercept)    2.952 0.07325
## 
## Correlation: Structure = exchangeable  Link = identity 
## 
## Estimated Correlation Parameters:
##       Estimate Std.err
## alpha  0.01737 0.01862
## Number of clusters:   1000   Maximum cluster size: 3

One way to generate correlated data

To address this shortcoming, there are at least two ways we can go about it. The first is to use the simstudy function genCorData. In this example, we generate correlated errors that are normally distributed with mean 0, variance 3, and common correlation coefficient of 0.7. Using this approach, the underlying data generation process is a bit cryptic, because we are acknowledging that we don’t know what is driving the relationship between the three outcomes, just that they have some common cause or other relationship that results in a strong relationship. However, it does the trick:
# define the outcome
ydef <- defDataAdd(varname = "Y", dist = "normal", formula = "5 + 2.5*period + 1.5*T + 3.5*period*T + e")

# define the correlated errors

mu <- c(0, 0, 0)
sigma <- rep(sqrt(3), 3)

# generate correlated data for each id and assign treatment

dtCor <- genCorData(24, mu = mu, sigma = sigma, rho = 0.7, corstr = "cs")
dtCor <- trtAssign(dtCor, nTrt = 2, balanced = TRUE, grpName = "T")

# create longitudinal data set and generate outcome based on definition

longData <- addPeriods(dtCor, nPeriods = 3, idvars = "id", timevars = c("V1", 
    "V2", "V3"), timevarName = "e")
longData <- addColumns(ydef, longData)

longData[, `:=`(T, factor(T, labels = c("No", "Yes")))]

# look at the data, outcomes should appear more correlated, lines a bit straighter

ggplot(data = longData, aes(x = factor(period), y = Y)) + geom_line(aes(color = T, 
    group = id)) + scale_color_manual(values = c("#e38e17", "#8e17e3")) + xlab("Time")
Again, we recover the true parameters. And this time, if we look at the estimated correlation, we see that indeed the outcomes are correlated within each individual. The estimate is 0.71, very close to the the “true” value 0.7.
fit <- geeglm(Y ~ factor(T) + period + factor(T) * period, family = gaussian(link = "identity"), 
    data = longData, id = id, corstr = "exchangeable")

summary(fit)
## 
## Call:
## geeglm(formula = Y ~ factor(T) + period + factor(T) * period, 
##     family = gaussian(link = "identity"), data = longData, id = id, 
##     corstr = "exchangeable")
## 
##  Coefficients:
##                     Estimate Std.err Wald Pr(>|W|)    
## (Intercept)           5.0636  0.0762 4411   <2e-16 ***
## factor(T)Yes          1.4945  0.1077  192   <2e-16 ***
## period                2.4972  0.0303 6798   <2e-16 ***
## factor(T)Yes:period   3.5204  0.0426 6831   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Estimated Scale Parameters:
##             Estimate Std.err
## (Intercept)     3.07   0.117
## 
## Correlation: Structure = exchangeable  Link = identity 
## 
## Estimated Correlation Parameters:
##       Estimate Std.err
## alpha    0.711  0.0134
## Number of clusters:   1000   Maximum cluster size: 3

Another way to generate correlated data

A second way to generate correlated data is through an individual level random-effect or random intercept. This could be considered some unmeasured characteristic of the individuals (which happens to have a convenient normal distribution with mean zero). This random effect contributes equally to all instances of an individuals outcomes, but the outcomes for a particular individual deviate slightly from the hypothetical straight line as a result of unmeasured noise.
ydef1 <- defData(varname = "randomEffect", dist = "normal", formula = 0, variance = sqrt(3))
ydef2 <- defDataAdd(varname = "Y", dist = "normal", formula = "5 + 2.5*period + 1.5*T + 3.5*period*T + randomEffect", 
    variance = 1)

indData <- genData(24, ydef1)
indData <- trtAssign(indData, nTrt = 2, balanced = TRUE, grpName = "T")

indData[1:6]
##    id T randomEffect
## 1:  1 0      -1.3101
## 2:  2 1       0.3423
## 3:  3 0       0.5716
## 4:  4 1       2.6723
## 5:  5 0      -0.9996
## 6:  6 1      -0.0722
longData <- addPeriods(indData, nPeriods = 3, idvars = "id")
longData <- addColumns(dtDefs = ydef2, longData)

longData[, `:=`(T, factor(T, labels = c("No", "Yes")))]

ggplot(data = longData, aes(x = factor(period), y = Y)) + geom_line(aes(color = T, 
    group = id)) + scale_color_manual(values = c("#e38e17", "#8e17e3")) + xlab("Time")
fit <- geeglm(Y ~ factor(T) + period + factor(T) * period, family = gaussian(link = "identity"), 
    data = longData, id = id, corstr = "exchangeable")
summary(fit)
## 
## Call:
## geeglm(formula = Y ~ factor(T) + period + factor(T) * period, 
##     family = gaussian(link = "identity"), data = longData, id = id, 
##     corstr = "exchangeable")
## 
##  Coefficients:
##                     Estimate Std.err Wald Pr(>|W|)    
## (Intercept)           4.9230  0.0694 5028   <2e-16 ***
## factor(T)Yes          1.4848  0.1003  219   <2e-16 ***
## period                2.5310  0.0307 6793   <2e-16 ***
## factor(T)Yes:period   3.5076  0.0449 6104   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Estimated Scale Parameters:
##             Estimate Std.err
## (Intercept)     2.63  0.0848
## 
## Correlation: Structure = exchangeable  Link = identity 
## 
## Estimated Correlation Parameters:
##       Estimate Std.err
## alpha    0.619  0.0146
## Number of clusters:   1000   Maximum cluster size: 3
I sent all this back to my correspondent, but I haven’t heard yet if either of these solutions meets her needs. I certainly hope so. Speaking of which, if there are specific topics you’d like me to discuss related to simstudy, definitely get in touch, and I will try to write something up.

Published by

Keith Goldfeld

I am an Assistant Professor of Biostatistics in the Department of Population Health at the New York University School of Medicine.

2 thoughts on “It can be easy to explore data generating mechanisms with the simstudy package”

  1. I am having trouble understanding how the binary variable a1 is created. xCat takes 3 values: 1,2,3 so there is a one-to-one relationship between xCat and “-3 + xCat”. I cannot figure out the decision rule for the creation of the binary variable.

    def <- defData(def, varname = "a1", dist = "binary", formula = "-3 + xCat", link = "logit")

    I have run the code for 1,000 obs and obtained the following proportions of binary responses for each value of xCat:

    xCAT zero one
    1 .38 .11
    2 .23 .14
    3 .39 .75
    total .66 .34

    1) Don't think I understand what the formula does in this case.
    2) Don't understand the binary decision rule

    Really like the package!
    Thanks.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.