Summary
Traditionally, data scientists have built models based on data. This article details how to do the exact opposite i.e. generate data based on a model. This article is first in the series of articles on building data from model. Motivation & Practical Applications
Businesses across various industry domains are embracing Artificial Intelligence(AI) and Machine Learning(ML) driven solutions. Furthermore, it is observed that the recent increase in the cloud based Machine Learning Operations (ML Ops) tools such Azure ML has made AI/ML solutions relatively more accessible, easy to deploy and in some cases more affordable. Additionally, it is also observed that there is an increase in the usage of Auto ML & feature engineering packages. These approaches reduce manual intervention during model build and retraining stages. Since the focus is predominantly on building ML pipelines as opposed to the traditional approach of building models manually, the robustness of the pipelines needs to be inspected. This is still an evolving field and currently is being handled by model observability tools. This article proposes one such method of observability. The purpose of this method can be best represented in the form of a question as given below.What if we built the underlying data distributions, the outliers, the dependent variable and then put it through the ML Ops pipeline? Wouldn’t we know where the pipeline worked well and where it could have done better?This question motivated the build of a Software as a Service (SaaS) product called uncovr. This product can now be accessed through an R package conjurer by following the steps outlined below.
Data from Model Using R
Step 1: Register for the API
-
- Head over to the API developer portal at (https://foyi.developer.azure-api.net/).
- Click on the Sign up button on the home page.
- Register your details such as email, password etc.
- You will receive an email to verify your email id.
- Once you verify your email id, your account will be setup and you will receive a confirmation email.
- Once your account is set up, please head over to the products section on the developer portal and select the product starter. Currently, this is the only subscription available. Give your subscription a name, read and accept the terms and click Subscribe.
- On your profile page, under the subscriptions section, click on show next to the Primary key. That is the subscription key you will need to access the API.
Step 2: Install R Package Conjurer
Install the latest version of the package from CRAN as follows.
install.packages("conjurer")
Step 3: Generate data
Generate the data using the code below.
library(conjurer)
uncovrJson <- buildModelData(numOfObs = 1000, numOfVars = 3, key = "input your subscription key here")
df <- extractDf(uncovrJson=uncovrJson)
The above code has two steps. The first step is to connect to the API and source the data in JSON format. The second step is to convert the JSON format to an R dataframe.
The components of the function buildModelData are as follows.
-
- numOfObs is the number of observations i.e. rows of data that you would like to generate. Please note that the current version allows you to generate from a minimum of 100 observations to a maximum of 10,000.
- numOfVars is the number of independent variables i.e. columns in the data. Please note that the current version allows you to generate from a minimum of 1 variable to a maximum of 100.
- key is the Primary key that you have sourced from the earlier step.
The model used in the current version to generate the data is a linear regression model. The details of the model formula and its estimated performance can be inspected as follows.
-
- To begin with, you can inspect the JSON data that is received from the API by using the code
str(uncovrJson)
. This should display all the components of the JSON file. The attributes prefixed as slope are the coefficients of the model formula corresponding to the number. For example, slope1 is the coefficient corresponding to iv1 i.e. independent variable 1. - The regression formula used to construct the data for the example data frame is as follows.
Please note that the formula takes the form of Y = mX + C.
dv = intercept + (slope1*iv1) + (slope2*iv2) + (slope3*iv3) + error. - Please note that while the slopes i.e. the coefficients are at variable level, the error is at each observation level. These errors can be accessed as
uncovrJson$error
- To begin with, you can inspect the JSON data that is received from the API by using the code