How to generate data from a model – Part 2

Interested in publishing a one-time post on R-bloggers.com? Press here to learn how.


Summary

Traditionally, data scientists have built models based on data. This article details how to do the exact opposite i.e. generate data based on a model. This article is second in the series of articles on building data from model. 
You can find part 1 here


Broadly speaking, there are three steps to generating data from model as given below.

Step 1: Register for the API

    • Head over to the API developer portal at
    • (https://foyi.developer.azure-api.net/)
    • Click on the Sign up button on the home page.
    • Register your details such as email, password etc.
    • You will receive an email to verify your email id.
    • Once you verify your email, please head over to the Products section of the developer portal. You can find it on the menu at the top right hand corner of the web page.
    • Please select the product Starter by clicking it. It will take you to the product page where you will find the section Your subscriptions. Please enter a name for your subscription and hit Subscribe.
    • Post subscribing, on your profile page, under the subscriptions section, click on show next to the Primary key. That is the subscription key you will need to access the API. Congratulations!!, you now can access the API.
    • If you have any issues with the signup, please email [email protected]

Step 2: Install R Package Conjurer

Install the latest version of the package from CRAN as follows.
install.packages("conjurer")

Step 3: Generate data from model

The function used to generate data from model is buildModelData(numOfObs, numOfVars, key, modelObj) .
The components of this function buildModelData are as follows.

    • numOfObs is the number of observations i.e. rows of data that you would like to generate. Please note that the current version allows you to generate from a minimum of 100 observations to a maximum of 10,000. 
    • numOfVars is the number of independent variables i.e. columns in the data. Please note that the current version allows you to generate from a minimum of 1 variable to a maximum of 100.
    • key is the Primary key that you have sourced from the earlier step.
    • modelObj is the model object. In the current version 1.7.1, this accepts either an lm or a glm model object built using the stats module. This is an optional parameter i.e. if this parameter is not specified, then the function generates the data randomly. However, if the model object is provided, then the intercept, coefficient and the independent variable range is sourced from it.

Generate data completely random using the code below.
library(conjurer)
uncovrJson <- buildModelData(numOfObs = 1000, numOfVars = 3, key = "input your subscription key here")
df <- extractDf(uncovrJson=uncovrJson)


Generate data based on the model object provided. For this example, a simple linear regression model is used.
library(conjurer)
library(datasets)

data(cars)
m <- lm(formula = dist ~ speed, data = cars)
uncovrJson <- buildModelData(numOfObs=100, numOfVars=1, key="insert subscription key here", modelObj = m)
df <- extractDf(uncovrJson=uncovrJson)

Interpretation of results

The data frame df (in the code above) will have two columns with the names iv1 and dv. The columns with prefix iv are the independent variables while the dv is the dependent variable. You can rename them to suit your needs. In the example above iv1 is speed and dv is distance. The details of the model formula and its estimated performance can be inspected as follows. 
    • To begin with, you can inspect the JSON data that is received from the API by using the code  str(uncovrJson). This would display all the components of the JSON file. The attributes prefixed as slope are the coefficients of the model formula corresponding to the number. For example, slope1 is the coefficient corresponding to iv1 i.e. independent variable 1. 
    • The regression formula used to construct the data for the example data frame is as follows.
      Please note that the formula takes the form of Y = mX + C. If there are multiple variables, then the component (slope1*iv1) will be repeated for each independent variable.
      dv = intercept + (slope1*iv1) + error.
    • While the slopes i.e. the coefficients are at variable level, the error is at each observation level. These errors can be accessed as uncovrJson$error
A simple comparison can be made to see how the synthetic data generated compares to the original data with the following code. 

summary(cars)
     speed            dist
  Min. : 4.0       Min. : 2.00
1st Qu.:12.0     1st Qu.: 26.00
Median :15.0     Median : 36.00
  Mean :15.4       Mean : 42.98
3rd Qu.:19.0     3rd Qu.: 56.00
  Max. :25.0       Max. :120.00


summary(df)
       iv1              dv
  Min. : 4.080     Min. :-38.76
1st Qu.: 8.915   1st Qu.: 29.35
Median :16.844   Median : 47.03
  Mean :15.405     Mean : 46.13
3rd Qu.:20.461   3rd Qu.: 75.83
  Max. :24.958     Max. :127.66


Limitation and Future Work

Some of the known limitations of this algorithm are as follows.
    • It can be observed from the above comparison that the independent variable range in synthetic data generated i.e. iv1 is close to the range of the original data i.e. speed. However, the range of the dependent variable i.e. dv in synthetic data is very different from the original data i.e. dist. This is on account of the error terms of the synthetic data being totally random and not sourced from the model object. 

    • While the range of the independent variable is similar across the original and synthetic datasets, a simple visual inspection of the distribution using a histogram plot hist(df$iv1) and hist(cars$speed) shows a drastic difference. This is because the independent data distribution is random and not sourced from the model object.

    • Additionally, if the same lm model is used to fit the synthetic data, the formula will be similar but the p values, R2 etc will be way off. This is on account of the error terms being random and not sourced from the model object.
These limitations will be addressed in the future versions. To be more specific, the distribution of the independent variables and error terms will be further engineered in the future versions.

Concluding Remarks

The underlying API uncovr is under development by FOYI . As new functionality is released, the R package conjurer will be updated to reflect those changes. Your feedback is valuable. For any feature requests or bug reports, please follow the contribution guidelines on GitHub repository. If you would like to follow the future releases and news, please follow our LinkedIn page

Published by

Sidharth Macherla

Data Scientist @ FOYI (https://www.foyi.co.nz/)

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.