R-posts.com

How to generate data from a model – Part 2

Interested in publishing a one-time post on R-bloggers.com? Press here to learn how.


Summary

Traditionally, data scientists have built models based on data. This article details how to do the exact opposite i.e. generate data based on a model. This article is second in the series of articles on building data from model. 
You can find part 1 here


Broadly speaking, there are three steps to generating data from model as given below.

Step 1: Register for the API


Step 2: Install R Package Conjurer

Install the latest version of the package from CRAN as follows.
install.packages("conjurer")

Step 3: Generate data from model

The function used to generate data from model is buildModelData(numOfObs, numOfVars, key, modelObj) .
The components of this function buildModelData are as follows.


Generate data completely random using the code below.
library(conjurer)
uncovrJson <- buildModelData(numOfObs = 1000, numOfVars = 3, key = "input your subscription key here")
df <- extractDf(uncovrJson=uncovrJson)


Generate data based on the model object provided. For this example, a simple linear regression model is used.
library(conjurer)
library(datasets)

data(cars)
m <- lm(formula = dist ~ speed, data = cars)
uncovrJson <- buildModelData(numOfObs=100, numOfVars=1, key="insert subscription key here", modelObj = m)
df <- extractDf(uncovrJson=uncovrJson)

Interpretation of results

The data frame df (in the code above) will have two columns with the names iv1 and dv. The columns with prefix iv are the independent variables while the dv is the dependent variable. You can rename them to suit your needs. In the example above iv1 is speed and dv is distance. The details of the model formula and its estimated performance can be inspected as follows.  A simple comparison can be made to see how the synthetic data generated compares to the original data with the following code. 

summary(cars)
     speed            dist
  Min. : 4.0       Min. : 2.00
1st Qu.:12.0     1st Qu.: 26.00
Median :15.0     Median : 36.00
  Mean :15.4       Mean : 42.98
3rd Qu.:19.0     3rd Qu.: 56.00
  Max. :25.0       Max. :120.00


summary(df)
       iv1              dv
  Min. : 4.080     Min. :-38.76
1st Qu.: 8.915   1st Qu.: 29.35
Median :16.844   Median : 47.03
  Mean :15.405     Mean : 46.13
3rd Qu.:20.461   3rd Qu.: 75.83
  Max. :24.958     Max. :127.66


Limitation and Future Work

Some of the known limitations of this algorithm are as follows.
These limitations will be addressed in the future versions. To be more specific, the distribution of the independent variables and error terms will be further engineered in the future versions.

Concluding Remarks

The underlying API uncovr is under development by FOYI . As new functionality is released, the R package conjurer will be updated to reflect those changes. Your feedback is valuable. For any feature requests or bug reports, please follow the contribution guidelines on GitHub repository. If you would like to follow the future releases and news, please follow our LinkedIn page
Exit mobile version