How to generate data from a model – Part 2


Summary

Traditionally, data scientists have built models based on data. This article details how to do the exact opposite i.e. generate data based on a model. This article is second in the series of articles on building data from model. 
You can find part 1 here


Broadly speaking, there are three steps to generating data from model as given below.

Step 1: Register for the API

    • Head over to the API developer portal at
    • (https://foyi.developer.azure-api.net/)
    • Click on the Sign up button on the home page.
    • Register your details such as email, password etc.
    • You will receive an email to verify your email id.
    • Once you verify your email, please head over to the Products section of the developer portal. You can find it on the menu at the top right hand corner of the web page.
    • Please select the product Starter by clicking it. It will take you to the product page where you will find the section Your subscriptions. Please enter a name for your subscription and hit Subscribe.
    • Post subscribing, on your profile page, under the subscriptions section, click on show next to the Primary key. That is the subscription key you will need to access the API. Congratulations!!, you now can access the API.
    • If you have any issues with the signup, please email [email protected]

Step 2: Install R Package Conjurer

Install the latest version of the package from CRAN as follows.
install.packages("conjurer")

Step 3: Generate data from model

The function used to generate data from model is buildModelData(numOfObs, numOfVars, key, modelObj) .
The components of this function buildModelData are as follows.

    • numOfObs is the number of observations i.e. rows of data that you would like to generate. Please note that the current version allows you to generate from a minimum of 100 observations to a maximum of 10,000. 
    • numOfVars is the number of independent variables i.e. columns in the data. Please note that the current version allows you to generate from a minimum of 1 variable to a maximum of 100.
    • key is the Primary key that you have sourced from the earlier step.
    • modelObj is the model object. In the current version 1.7.1, this accepts either an lm or a glm model object built using the stats module. This is an optional parameter i.e. if this parameter is not specified, then the function generates the data randomly. However, if the model object is provided, then the intercept, coefficient and the independent variable range is sourced from it.

Generate data completely random using the code below.
library(conjurer)
uncovrJson <- buildModelData(numOfObs = 1000, numOfVars = 3, key = "input your subscription key here")
df <- extractDf(uncovrJson=uncovrJson)


Generate data based on the model object provided. For this example, a simple linear regression model is used.
library(conjurer)
library(datasets)

data(cars)
m <- lm(formula = dist ~ speed, data = cars)
uncovrJson <- buildModelData(numOfObs=100, numOfVars=1, key="insert subscription key here", modelObj = m)
df <- extractDf(uncovrJson=uncovrJson)

Interpretation of results

The data frame df (in the code above) will have two columns with the names iv1 and dv. The columns with prefix iv are the independent variables while the dv is the dependent variable. You can rename them to suit your needs. In the example above iv1 is speed and dv is distance. The details of the model formula and its estimated performance can be inspected as follows. 
    • To begin with, you can inspect the JSON data that is received from the API by using the code  str(uncovrJson). This would display all the components of the JSON file. The attributes prefixed as slope are the coefficients of the model formula corresponding to the number. For example, slope1 is the coefficient corresponding to iv1 i.e. independent variable 1. 
    • The regression formula used to construct the data for the example data frame is as follows.
      Please note that the formula takes the form of Y = mX + C. If there are multiple variables, then the component (slope1*iv1) will be repeated for each independent variable.
      dv = intercept + (slope1*iv1) + error.
    • While the slopes i.e. the coefficients are at variable level, the error is at each observation level. These errors can be accessed as uncovrJson$error
A simple comparison can be made to see how the synthetic data generated compares to the original data with the following code. 

summary(cars)
     speed            dist
  Min. : 4.0       Min. : 2.00
1st Qu.:12.0     1st Qu.: 26.00
Median :15.0     Median : 36.00
  Mean :15.4       Mean : 42.98
3rd Qu.:19.0     3rd Qu.: 56.00
  Max. :25.0       Max. :120.00


summary(df)
       iv1              dv
  Min. : 4.080     Min. :-38.76
1st Qu.: 8.915   1st Qu.: 29.35
Median :16.844   Median : 47.03
  Mean :15.405     Mean : 46.13
3rd Qu.:20.461   3rd Qu.: 75.83
  Max. :24.958     Max. :127.66


Limitation and Future Work

Some of the known limitations of this algorithm are as follows.
    • It can be observed from the above comparison that the independent variable range in synthetic data generated i.e. iv1 is close to the range of the original data i.e. speed. However, the range of the dependent variable i.e. dv in synthetic data is very different from the original data i.e. dist. This is on account of the error terms of the synthetic data being totally random and not sourced from the model object. 

    • While the range of the independent variable is similar across the original and synthetic datasets, a simple visual inspection of the distribution using a histogram plot hist(df$iv1) and hist(cars$speed) shows a drastic difference. This is because the independent data distribution is random and not sourced from the model object.

    • Additionally, if the same lm model is used to fit the synthetic data, the formula will be similar but the p values, R2 etc will be way off. This is on account of the error terms being random and not sourced from the model object.
These limitations will be addressed in the future versions. To be more specific, the distribution of the independent variables and error terms will be further engineered in the future versions.

Concluding Remarks

The underlying API uncovr is under development by FOYI . As new functionality is released, the R package conjurer will be updated to reflect those changes. Your feedback is valuable. For any feature requests or bug reports, please follow the contribution guidelines on GitHub repository. If you would like to follow the future releases and news, please follow our LinkedIn page

How to generate data from a model – Part 1


Summary

Traditionally, data scientists have built models based on data. This article details how to do the exact opposite i.e. generate data based on a model. This article is first in the series of articles on building data from model. 


Motivation & Practical Applications

Businesses across various industry domains are embracing Artificial Intelligence(AI) and Machine Learning(ML) driven solutions. Furthermore, it is observed that the recent increase in the cloud based Machine Learning Operations (ML Ops) tools such Azure ML has made AI/ML solutions relatively more accessible,  easy to deploy and in some cases more affordable. Additionally, it is also observed that there is an increase in the usage of Auto ML & feature engineering packages. These approaches reduce manual intervention during model build and retraining stages.  Since the focus is predominantly on building ML pipelines as opposed to the traditional approach of building models manually, the robustness of the pipelines needs to be inspected. This is still an evolving field and currently is being handled by model observability tools. This article proposes one such method of observability. The purpose of this method can be best represented in the form of a question as given below.

What if we built the underlying data distributions, the outliers, the dependent variable and then put it through the ML Ops pipeline?  Wouldn’t we know where the pipeline worked well and where it could have done better?
This question motivated the build of a Software as a Service (SaaS) product called uncovr. This product can now be accessed through an R package conjurer by following the steps outlined below.


Data from Model Using R

Step 1: Register for the API

    • Head over to the API developer portal at (https://foyi.developer.azure-api.net/).
    • Click on the Sign up button on the home page.
    • Register your details such as email, password etc.
    • You will receive an email to verify your email id.
    • Once you verify your email id, your account will be setup and you will receive a confirmation email.
    • Once your account is set up, please head over to the products section on the developer portal and select the product starter. Currently, this is the only subscription available. Give your subscription a name, read and accept the terms and click Subscribe
    • On your profile page, under the subscriptions section, click on show next to the Primary key. That is the subscription key you will need to access the API.

Step 2: Install R Package Conjurer

Install the latest version of the package from CRAN as follows.
install.packages("conjurer")

Step 3: Generate data

Generate the data using the code below.

library(conjurer)
uncovrJson <- buildModelData(numOfObs = 1000, numOfVars = 3, key = "input your subscription key here")
df <- extractDf(uncovrJson=uncovrJson)

The above code has two steps. The first step is to connect to the API and source the data in JSON  format. The second step is to convert the JSON format to an R dataframe.

The components of the function buildModelData are as follows.
    • numOfObs is the number of observations i.e. rows of data that you would like to generate. Please note that the current version allows you to generate from a minimum of 100 observations to a maximum of 10,000. 
    • numOfVars is the number of independent variables i.e. columns in the data. Please note that the current version allows you to generate from a minimum of 1 variable to a maximum of 100.
    • key is the Primary key that you have sourced from the earlier step.
The data frame df (in the code above) will have three columns with the names iv1, iv2, iv3 and one column dv. The columns with prefix iv are the independent variables while the dv is the dependent variable. You can rename them to suit your needs. 
The model used in the current version to generate the data is a linear regression model. The details of the model formula and its estimated performance can be inspected as follows. 
    • To begin with, you can inspect the JSON data that is received from the API by using the code  str(uncovrJson). This should display all the components of the JSON file. The attributes prefixed as slope are the coefficients of the model formula corresponding to the number. For example, slope1 is the coefficient corresponding to iv1 i.e. independent variable 1. 
    • The regression formula used to construct the data for the example data frame is as follows.
      Please note that the formula takes the form of Y = mX + C.
      dv = intercept + (slope1*iv1) + (slope2*iv2) + (slope3*iv3) + error.
    • Please note that while the slopes i.e. the coefficients are at variable level, the error is at each observation level. These errors can be accessed as uncovrJson$error

Concluding Remarks

The underlying API uncovr is under development and as new functionality is released, the R package conjurer will be updated to reflect those changes. For any feature requests or bug reports, please follow the contribution guidelines on GitHub repository. If you would like to follow the future releases and news, please follow our LinkedIn page.

Generate names using posterior probabilities

If you are building synthetic data and need to generate people names, this article will be a helpful guide. This article is part of a series of articles regarding the R package conjurer. You can find the first part of this series here.

Steps to generate people names


1. Installation


Install conjurer package by using the following code. 
 install.packages("conjurer") 

2. Training data Vs default data


The package conjurer provides 2 two options to generate names.
    • The first option is to provide a custom training data. 
    • The second option is to use the default training data provided by the package.
If it is people names that you are interested in generating, you are better off using the default training data. However, if you would like to generate names of  items or products (example: pharmaceutical drug names), it is recommended that you build your own training data.
The function that helps in generating names is buildNames. Let us understand the inputs of the function. This function takes the form as given below.
buildNames(dframe, numOfNames, minLength, maxLength)
In this function,
dframe is a dataframe. This dataframe must be a single column dataframe where each row contains a name. These names must only contain english alphabets(upper or lower case) from A to Z but no special characters such as “;” or non ASCII characters. If you do not pass this argument to the function, the function uses the default prior probabilities to generate the names.

numOfNames is a numeric. This specifies the number of names to be generated. It should be a non-zero natural number. 

minLength is a numeric. This specifies the minimum number of alphabets in the name. It must be a non-zero natural number.

maxLength is a numeric. This specifies the maximum number of alphabets in the name. It must be a non-zero natural number.

3. Example


Let us run this function with an example to see how it works. Let us use the default matrix of prior probabilities for this example. The output would be a list of names as given below.
library(conjurer)
peopleNames <- buildNames(numOfNames = 3, minLength = 5, maxLength = 7)
print(peopleNames)
[1] "ellie"   "bellann" "netar" 
Please note that since this is a random generator, you may get other names than displayed in the above example. 

4. Consolidated code


Following is the consolidated code for your convenience.
#install latest version
install.packages("conjurer") 

#invoke library
library(conjurer)

#generate names
peopleNames <- buildNames(numOfNames = 3, minLength = 5, maxLength = 7) 

#inspect the names generated
print(peopleNames) 

5. Concluding remarks


In this article, we have learnt how to use the R package conjurer and generate names. Since the algorithm relies on prior probabilities, the names that are output may not look exactly like real human names but will phonetically sound like human names. So, go ahead and give it a try. If you like to understand the underlying code that generates these names, you can explore the GitHub repository here. If you are interested in what’s coming next in this package, you can find it in the issues section here

Generate synthetic data using R

If you are building data science applications and need some data to demonstrate the prototype to a potential client, you will most likely need synthetic data. In this article, we discuss the steps to generating synthetic data using the R package ‘conjurer’. 

Steps to build synthetic data


1. Installation


Install conjurer package by using the following code. Since the package uses base R functions, it does not have any dependencies.
 install.packages("conjurer") 

2. Build customers


A customer is identified by a unique customer identifier(ID). A customer ID is alphanumeric with prefix “cust” followed by a numeric. This numeric ranges from 1 and extend to the number of customers provided as the argument within the function. For example, if there are 100 customers, then the customer ID will range from cust001 to cust100. This ensures that the customer ID is always of the same length. Let us build a group of customer IDs using the following code. For simplicity, let us assume that there are 100 customers. customer ID is built using the function buildCust. This function takes one argument “numOfCust” that specifies the number of customer IDs to be built.
library(conjurer)
customers <- buildCust(numOfCust =  100)
print(head(customers))
#[1] "cust001" "cust002" "cust003" "cust004" "cust005" "cust006"

3. Build products


The next step is building some products. A product is identified by a product ID. Similar to a customer ID, a product ID is also an alphanumeric with prefix “sku” which signifies a stock keeping unit. This prefix is followed by a numeric ranging from 1 and extending to the number of products provided as the argument within the function. For example, if there are 10 products, then the product ID will range from sku01 to sku10. This ensures that the product ID is always of the same length. Besides product ID, the product price range must be specified. Let us build a group of products using the following code. For simplicity, let us assume that there are 10 products and the price range for them is from 5 dollars to 50 dollars. Products are built using the function buildProd. This function takes 3 arguments as given below.
    • numOfProd. This defines the number of product IDs to be generated.
    • minPrice. This is the minimum value of the price range.
    • maxPrice. This is the maximum value of the price range.
library(conjurer)
products <- buildProd(numOfProd = 10, minPrice = 5, maxPrice = 50)
print(head(products))
#     SKU Price
# 1 sku01 43.60
# 2 sku02 48.56
# 3 sku03 36.16
# 4 sku04 19.02
# 5 sku05 17.19
# 6 sku06 25.35

4. Build transactions


Now that a group of customer IDs and Products are built, the next step is to build transactions. Transactions are built using the function genTrans. This function takes 5 arguments. The details of them are as follows.
    • cylces. This represents the cyclicality of data. It can take the following values
      • y“. If cycles is set to the value “y”, it means that there is only one instance of a high number of transactions during the entire year. This is a very common situation for some retail clients where the highest number of sales are during the holiday period in December.
      • q“. If cycles is set to the value “q”, it means that there are 4 instances of a high number of transactions. This is generally noticed in the financial services industry where the financial statements are revised every quarter and have an impact on the equity transactions in the secondary market.
      • m“. If cycles is set to the value “m”, it means that there are 12 instances of a high number of transactions for a year. This means that the number of transactions increases once every month and then subside for the rest of the month.
    • spike. This represents the seasonality of data. It can take any value from 1 to 12. These numbers represent months in an year, from January to December respectively. For example, if spike is set to 12, it means that December has the highest number of transactions.
    • trend. This represents the slope of data distribution. It can take a value of 1 or -1.
      • If the trend is set to value 1, then the aggregated monthly transactions will exhibit an upward trend from January to December and vice versa if it is set to -1.
    • outliers. This signifies the presence of outliers. If set to value 1, then outliers are generated randomly. If set to value 0, then no outliers are generated. The presence of outliers is a very common occurrence and hence setting the outliers to 1 is recommended. However, there are instances where outliers are not needed. For example, if the objective of data generation is solely for visualization purposes then outliers may not be needed.
    • transactions. This represents the number of transactions to be generated.
Let us build transactions using the following code
transactions <- genTrans(cycles = "y", spike = 12, outliers = 1, transactions = 10000)
Visualize generated transactions by using
TxnAggregated <- aggregate(transactions$transactionID, by = list(transactions$dayNum), length)
plot(TxnAggregated, type = "l", ann = FALSE)

5. Build final data


Bringing customers, products and transactions together is the final step of generating synthetic data. This process entails 3 steps as given below.

5.1 Allocate customers to transactions


The allocation of transactions is achieved with the help of buildPareto function. This function takes 3 arguments as detailed below.
    • factor1 and factor2. These are factors to be mapped to each other. As the name suggests, they must be of data type factor.
    • Pareto. This defines the percentage allocation and is a numeric data type. This argument takes the form of c(x,y) where x and y are numeric and their sum is 100. If we set Pareto to c(80,20), it then allocates 80 percent of factor1 to 20 percent of factor 2. This is based on a well-known concept of Pareto principle.
Let us now allocate transactions to customers first by using the following code.
customer2transaction <- buildPareto(customers, transactions$transactionID, pareto = c(80,20))
Assign readable names to the output by using the following code.
names(customer2transaction) <- c('transactionID', 'customer')

#inspect the output
print(head(customer2transaction))
#   transactionID customer
# 1     txn-91-11  cust072
# 2    txn-343-25  cust089
# 3    txn-264-08  cust076
# 4    txn-342-07  cust030
# 5      txn-2-19  cust091
# 6    txn-275-06  cust062

5.2 Allocate products to transactions


Now, using similar step as mentioned above, allocate transactions to products using following code.
product2transaction <- buildPareto(products$SKU,transactions$transactionID,pareto = c(70,30))
names(product2transaction) <- c('transactionID', 'SKU')

#inspect the output
print(head(product2transaction))
#   transactionID   SKU
# 1    txn-182-30 sku10
# 2    txn-179-21 sku01
# 3    txn-179-10 sku10
# 4    txn-360-08 sku01
# 5     txn-23-09 sku01
# 6    txn-264-20 sku10

5.3 Final data


Now, using a similar step as mentioned above, allocate transactions to products using the following code.
df1 <- merge(x = customer2transaction, y = product2transaction, by = "transactionID")

dfFinal <- merge(x = df1, y = transactions, by = "transactionID", all.x = TRUE)

#inspect the output
print(head(dfFinal))
#   transactionID customer   SKU dayNum mthNum
# 1      txn-1-01  cust076 sku03      1      1
# 2      txn-1-02  cust062 sku04      1      1
# 3      txn-1-03  cust087 sku07      1      1
# 4      txn-1-04  cust010 sku04      1      1
# 5      txn-1-05  cust039 sku01      1      1
# 6      txn-1-06  cust010 sku01      1      1
Thus, we have the final data set with transactions, customers and products. Interpret the results The column names of the final data frame can be interpreted as follows.
    • Each row is a transaction and the data frame has all the transactions for a year i.e 365 days.
    • transactionID is the unique identifier for that transaction. + customer is the unique customer identifier. This is the customer who made that transaction.
    • SKU is the product that was bought in that transaction.
    • dayNum is the day number in the year. There would be 365 unique dayNum in the data frame.
    • mthNum is the month number. This ranges from 1 to 12 and represents January to December respectively.

Summary & concluding remarks


In this article, we started by building customers, products and transactions. Later on, we also understood how to bring them all together in to a final data set. At the time of writing this article, the package is predominantly focused on building the basic data set and there is room for improvement. If you are interested in contributing to this package, please find the details at contributions.