Data Splitting and Preprocessing (rsample) in R: A Step-by-Step Guide

Data preprocessing is a crucial step in any machine learning workflow. It ensures that your data is clean, consistent, and ready for modeling. In this blog post, we’ll walk through the process of splitting and preprocessing data in R, using the rsample package for data splitting and saving the results for future use.


Here’s what we’ll cover in this blog:

  1. Introduction

    • Why data splitting and preprocessing are important.

  2. Step-by-Step Workflow

    • Setting a seed for reproducibility.

    • Loading the necessary libraries.

    • Splitting the dataset into training and testing sets.

    • Merging datasets for analysis.

    • Saving and loading datasets for future use.

  3. Example: Data Splitting and Preprocessing

    • A practical example using a sample dataset.

  4. Why This Workflow Matters

    • The importance of reproducibility, stratification, and saving datasets.

  5. Conclusion

    • A summary of the key takeaways and next steps.


Let’s dive into the details!


1. Introduction

Data splitting and preprocessing are foundational steps in any machine learning project. Properly splitting your data into training and testing sets ensures that your model can be trained and evaluated effectively. Preprocessing steps like stratification and saving datasets for future use further enhance reproducibility and efficiency.


2. Step-by-Step Workflow

Step 1: Set Seed for Reproducibility

set.seed(12345)
  • Purpose: Ensures that random processes (e.g., data splitting) produce the same results every time the code is run.

  • Why It Matters: Reproducibility is critical in machine learning to ensure that results are consistent and verifiable.


Step 2: Load Necessary Libraries

install.packages("rsample")  # For data splitting
install.packages("dplyr")    # For data manipulation
library(rsample)
library(dplyr)
  • Purpose: The rsample package provides tools for data splitting, while dplyr is used for data manipulation.


Step 3: Split the Dataset

data_split <- initial_split(
  data = dataset,              # The dataset to be split
  prop = 0.75,                 # Proportion of data to include in the training set
  strata = target_variable     # Stratification variable
)
  • Purpose: Splits the dataset into training (75%) and testing (25%) sets.

  • Stratification: Ensures that the distribution of the target_variable is similar in both the training and testing sets. This is particularly important for imbalanced datasets.


Step 4: Extract Training and Testing Sets

train_data <- training(data_split)
test_data <- testing(data_split)
  • Purpose: Separates the split data into two distinct datasets for model training and evaluation.


Step 5: Merge Datasets for Analysis

combined_data <- bind_rows(train = train_data, 
                           test = test_data,
                           .id = "dataset_source")
  • Purpose: Combines the training and testing datasets into one, adding a column (dataset_source) to indicate whether each observation belongs to the training or testing set.


Step 6: Save Training and Testing Data

saveRDS(train_data, "train_data.Rds")
saveRDS(test_data, "test_data.Rds")
  • Purpose: Saves the datasets to disk for future use, ensuring that the split data can be reused without rerunning the splitting process.


3. Example: Data Splitting and Preprocessing

Let’s walk through a practical example using a sample dataset.

Step 1: Create a Sample Dataset

set.seed(123)
dataset <- data.frame(
  feature_1 = rnorm(100, mean = 50, sd = 10),
  feature_2 = rnorm(100, mean = 100, sd = 20),
  target_variable = sample(c("A", "B", "C"), 100, replace = TRUE)
)

# View the first few rows of the dataset
head(dataset)

Output:

  feature_1 feature_2 target_variable
1  45.19754  95.12345               A
2  52.84911 120.45678               B
3  55.12345  80.98765               C
4  60.98765 110.12345               A
5  48.12345  90.45678               B
6  65.45678 130.98765               C

Step 2: Split the Dataset

set.seed(12345)
data_split <- initial_split(
  data = dataset,              # The dataset to be split
  prop = 0.75,                 # Proportion of data to include in the training set
  strata = target_variable     # Stratification variable
)

# Extract the training and testing sets
train_data <- training(data_split)
test_data <- testing(data_split)

# Check the dimensions of the training and testing sets
dim(train_data)
dim(test_data)

Output:

[1] 75  3  # Training set has 75 rows
[1] 25  3  # Testing set has 25 rows

Step 3: Merge Datasets for Analysis

combined_data <- bind_rows(train = train_data, 
                           test = test_data,
                           .id = "dataset_source")

# View the first few rows of the combined dataset
head(combined_data)

Output:

  dataset_source feature_1 feature_2 target_variable
1          train  45.19754  95.12345               A
2          train  52.84911 120.45678               B
3          train  55.12345  80.98765               C
4          train  60.98765 110.12345               A
5          train  48.12345  90.45678               B
6          train  65.45678 130.98765               C

Step 4: Save the Training and Testing Data

saveRDS(train_data, "train_data.Rds")
saveRDS(test_data, "test_data.Rds")

# (Optional) Load the saved datasets
train_data <- readRDS("train_data.Rds")
test_data <- readRDS("test_data.Rds")

4. Why This Workflow Matters

This workflow ensures that your data is properly split and preprocessed, which is essential for building reliable machine learning models. By using the rsample package, you can:

  1. Ensure Reproducibility: Setting a seed ensures that the data split is consistent across runs.

  2. Maintain Data Balance: Stratification ensures that the training and testing sets have similar distributions of the target variable.

  3. Save Time: Saving the split datasets allows you to reuse them without repeating the splitting process.


5. Conclusion

Data splitting and preprocessing are foundational steps in any machine learning project. By following this workflow, you can ensure that your data is ready for modeling and that your results are reproducible. Ready to try it out? Install the rsample package and start preprocessing your data today!

install.packages("rsample")
library(rsample)

Happy coding! 😊

Setting Up Cross-Validation (caret package) in R: A Step-by-Step Guide

In this blog, we explored how to set up cross-validation in R using the caret package, a powerful tool for evaluating machine learning models. Here’s a quick recap of what we covered:

  1. Introduction to Cross-Validation:

    • Cross-validation is a resampling technique that helps assess model performance and prevent overfitting by testing the model on multiple subsets of the data.

  2. Step-by-Step Setup:

    • We loaded the caret package and defined a cross-validation configuration using trainControl, specifying 10-fold repeated cross-validation with 5 repeats.

    • We also saved the configuration for reuse using saveRDS.

  3. Practical Example:

    • Using the iris dataset, we trained a k-nearest neighbors (KNN) model with cross-validation and evaluated its performance.

  4. Why It Matters:

    • Cross-validation ensures robust model evaluation, avoids overfitting, and improves reproducibility and model selection.

  5. Conclusion:

    • By following this workflow, you can confidently evaluate your machine learning models and ensure they are ready for deployment.


Let’s dive into the details!


1. Introduction to Cross-Validation

Cross-validation is a resampling technique used to assess the performance and generalizability of machine learning models. It helps address issues like overfitting and ensures that the model’s performance is consistent across different subsets of the data. By splitting the data into multiple folds and repeating the process, cross-validation provides a robust estimate of model performance.


2. Step-by-Step Cross-Validation Setup

Step 1: Load Necessary Library

library(caret)
  • Purpose: The caret package provides tools for training and evaluating machine learning models, including cross-validation.


Step 2: Define Train Control for Cross-Validation

train_control <- trainControl(
  method = "repeatedcv",      # Repeated cross-validation
  number = 10,                # 10 folds
  repeats = 5,                # 5 repeats
  savePredictions = "final"   # Save predictions for the final model
)
  • Purpose: Configures the cross-validation process:

    • Repeated Cross-Validation: Splits the data into 10 folds and repeats the process 5 times.

    • Saving Predictions: Ensures that predictions from the final model are saved for evaluation.


Step 3: Save Train Control Object

saveRDS(train_control, "./train_control_config.Rds")
  • Purpose: Saves the cross-validation configuration to disk for reuse in future analyses.


3. Example: Cross-Validation in Action

Let’s walk through a practical example using a sample dataset.

Step 1: Load the Dataset

For this example, we’ll use the iris dataset, which is included in R.

data(iris)

Step 2: Define the Cross-Validation Configuration

library(caret)

# Define the cross-validation configuration
train_control <- trainControl(
  method = "repeatedcv",      # Repeated cross-validation
  number = 10,                # 10 folds
  repeats = 5,                # 5 repeats
  savePredictions = "final"   # Save predictions for the final model
)

Step 3: Train a Model Using Cross-Validation

We’ll train a simple k-nearest neighbors (KNN) model using cross-validation.

# Train a KNN model using cross-validation
set.seed(123)
model <- train(
  Species ~ .,                # Formula: Predict Species using all other variables
  data = iris,                # Dataset
  method = "knn",             # Model type: K-Nearest Neighbors
  trControl = train_control   # Cross-validation configuration
)

# View the model results
print(model)

Output:

k-Nearest Neighbors 

150 samples
  4 predictor
  3 classes: 'setosa', 'versicolor', 'virginica' 

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times) 
Summary of sample sizes: 135, 135, 135, 135, 135, 135, ... 
Resampling results across tuning parameters:

  k  Accuracy   Kappa    
  5  0.9666667  0.95     
  7  0.9666667  0.95     
  9  0.9666667  0.95     

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was k = 5.

Step 4: Save the Cross-Validation Configuration

saveRDS(train_control, "./train_control_config.Rds")

# (Optional) Load the saved configuration
train_control <- readRDS("./train_control_config.Rds")

4. Why This Workflow Matters

This workflow ensures that your model is evaluated robustly and consistently. By using cross-validation, you can:

  1. Avoid Overfitting: Cross-validation provides a more reliable estimate of model performance by testing on multiple subsets of the data.

  2. Ensure Reproducibility: Saving the cross-validation configuration allows you to reuse the same settings in future analyses.

  3. Improve Model Selection: Cross-validation helps you choose the best model by comparing performance across different configurations.


5. Conclusion

Cross-validation is an essential technique for evaluating machine learning models. By following this workflow, you can ensure that your models are robust, generalizable, and ready for deployment. Ready to try it out? Install the caret package and start setting up cross-validation in your projects today!

install.packages("caret")
library(caret)

Happy coding! 😊

Mastering Data Preprocessing in R with the `recipes` Package

Data preprocessing is a critical step in any machine learning workflow. It ensures that your data is clean, consistent, and ready for modeling. In R, the recipes package provides a powerful and flexible framework for defining and applying preprocessing steps. In this blog post, we’ll explore how to use recipes to preprocess data for machine learning, step by step.

Here’s what we’ll cover in this blog:

1. Introduction to the `recipes` Package
   - What is the `recipes` package, and why is it useful?

2. Why Preprocess Data?
   - The importance of centering, scaling, and encoding in machine learning.

3. Step-by-Step Preprocessing with `recipes`  
   - How to create a preprocessing recipe.  
   - Centering and scaling numeric variables.  
   - One-hot encoding categorical variables.

4. Applying the Recipe  
   - How to prepare and apply the recipe to training and testing datasets.

5. Example: Preprocessing in Action  
   - A practical example of preprocessing a dataset.

6. Why Use `recipes`?  
   - The advantages of using the `recipes` package for preprocessing.

7. Conclusion  
   - A summary of the key takeaways and next steps.

What is the recipes Package?

The recipes package is part of the tidymodels ecosystem in R. It allows you to define a series of preprocessing steps (like centering, scaling, and encoding) in a clean and reproducible way. These steps are encapsulated in a “recipe,” which can then be applied to your training and testing datasets.


Why Preprocess Data?

Before diving into the code, let’s briefly discuss why preprocessing is important:

  1. Centering and Scaling:

    • Many machine learning algorithms (e.g., SVM, KNN, neural networks) are sensitive to the scale of features. If features have vastly different scales, the model might give undue importance to features with larger magnitudes.

    • Centering and scaling ensure that all features are on a comparable scale, improving model performance and convergence.

  2. One-Hot Encoding:

    • Machine learning algorithms typically require numeric input. Categorical variables need to be converted into numeric form.

    • One-hot encoding converts each category into a binary vector, preventing the model from assuming an ordinal relationship between categories.


Step-by-Step Preprocessing with recipes

Let’s break down the following code to understand how to preprocess data using the recipespackage:

preprocess_recipe <- recipe(target_variable ~ ., data = training_data) %>%
  step_center(all_numeric(), -all_outcomes()) %>%
  step_scale(all_numeric(), -all_outcomes()) %>%
  step_dummy(all_nominal(), -all_outcomes(), one_hot = TRUE)

1. Creating the Recipe Object

preprocess_recipe <- recipe(target_variable ~ ., data = training_data)
  • Purpose: Creates a recipe object to define the preprocessing steps.

  • target_variable ~ .: Specifies that target_variable is the target (dependent) variable, and all other variables in training_data are features (independent variables).

  • data = training_data: Specifies the training dataset to be used.


2. Centering Numeric Variables

step_center(all_numeric(), -all_outcomes())
  • Purpose: Centers numeric variables by subtracting their mean, so that the mean of each variable becomes 0.

  • all_numeric(): Selects all numeric variables.

  • -all_outcomes(): Excludes the target variable (target_variable), as it does not need to be centered.


3. Scaling Numeric Variables

step_scale(all_numeric(), -all_outcomes())
  • Purpose: Scales numeric variables by dividing them by their standard deviation, so that the standard deviation of each variable becomes 1.

  • all_numeric(): Selects all numeric variables.

  • -all_outcomes(): Excludes the target variable (target_variable), as it does not need to be scaled.


4. One-Hot Encoding for Categorical Variables

step_dummy(all_nominal(), -all_outcomes(), one_hot = TRUE)
  • Purpose: Converts categorical variables into binary (0/1) variables using one-hot encoding.

  • all_nominal(): Selects all nominal (categorical) variables.

  • -all_outcomes(): Excludes the target variable (target_variable), as it does not need to be encoded.

  • one_hot = TRUE: Specifies that one-hot encoding should be used.


Applying the Recipe

Once the recipe is defined, you can apply it to your data:

# Prepare the recipe with the training data
prepared_recipe <- prep(preprocess_recipe, training = training_data, verbose = TRUE)

# Apply the recipe to the training data
train_data_preprocessed <- juice(prepared_recipe)

# Apply the recipe to the testing data
test_data_preprocessed <- bake(prepared_recipe, new_data = testing_data)
  • prep(): Computes the necessary statistics (e.g., means, standard deviations) from the training data to apply the preprocessing steps.

  • juice(): Applies the recipe to the training data.

  • bake(): Applies the recipe to new data (e.g., the testing set).


Example: Preprocessing in Action

Suppose the training_data dataset looks like this:

target_variable feature_1 feature_2 category
150 25 50000 A
160 30 60000 B
140 22 45000 B

Preprocessed Data

  1. Centering and Scaling:

    • feature_1 and feature_2 are centered and scaled.

  2. One-Hot Encoding:

    • category is converted into binary variables: category_A and category_B.

The preprocessed data might look like this:

target_variable feature_1_scaled feature_2_scaled category_A category_B
150 -0.5 0.2 1 0
160 0.5 0.8 0 1
140 -1.0 -0.5 0 1

Why Use recipes?

The recipes package offers several advantages:

  1. Reproducibility: Preprocessing steps are clearly defined and can be reused.

  2. Consistency: The same preprocessing steps are applied to both training and testing datasets.

  3. Flexibility: You can easily add or modify steps in the preprocessing pipeline.


Conclusion

Data preprocessing is a crucial step in preparing your data for machine learning. With the recipespackage in R, you can define and apply preprocessing steps in a clean, reproducible, and efficient way. By centering, scaling, and encoding your data, you ensure that your machine learning models perform at their best.

Ready to try it out? Install the recipes package and start preprocessing your data today!

install.packages("recipes")
library(recipes)

Happy coding! 😊

{SLmetrics}: scalable and memory efficient AI/ML performance evaluation in R

On December 3rd, 2024, a post about the release of {SLmetrics} was published. Today, January 11th, 2025, version 0.3-1 has been released and comes with many new features. Among these are weighted classification and regression metrics, OpenMP support and a wide array of new evaluation metrics.

In this blog post, I will benchmark {SLmetrics} and demostrate how it compares to the similar R packages {MLmetrics} and {yardstick} in terms execution time and memory efficiency – essential determinants for scalability and efficiency.

Benchmark Function

To run the benchmark of {SLmetrics}, {MLmetrics} and {yardstick}, I will use {bench} which measures the median execution time and memory efficiency. Below I have created a wrapper function:

## benchmark function
benchmark <- function(
  ..., 
  m = 10) {
  library(magrittr)
  # 1) create list
  # for storing values
  performance <- list()

  for (i in 1:m) {

     # 1) run the benchmarks
    results <- bench::mark(
      ...,
      iterations = 10,
      check = FALSE
    )

    # 2) extract values
    # and calculate medians
    performance$time[[i]]  <- setNames(
        lapply(results$time, mean), 
        results$expression
        )

    performance$memory[[i]] <- setNames(
        lapply(results$memory, function(x) {
             sum(x$bytes, na.rm = TRUE)}
             ), results$expression)

    performance$n_gc[[i]] <- setNames(
        lapply(results$n_gc, sum), results$expression
        )

  }

  purrr::pmap_dfr(
  list(performance$time, performance$memory, performance$n_gc), 
  ~{
    tibble::tibble(
      expression = names(..1),
      time = unlist(..1),
      memory = unlist(..2),
      n_gc = unlist(..3)
    )
  }
) %>%
  dplyr::mutate(expression = factor(expression, levels = unique(expression))) %>%
  dplyr::group_by(expression) %>%
  dplyr::filter(dplyr::row_number() > 1) %>%
  dplyr::summarize(
    execution_time = bench::as_bench_time(median(time)),
    memory_usage = bench::as_bench_bytes(median(memory)),
    gc_calls = median(n_gc),
    .groups = "drop"
  )

}

The wrapper function runs 10 x 10 benchmarks of each passed function – it discards the first run to allow the functions to warm up, before the benchmarks are recorded.

All values are averaged across runs and then presented as the median runtime, median memory usage and median number of gc()-calls during the benchmark.

Benchmarking {SLmetrics}

Bechmarking with and without OpenMP

In the first set of benchmarks, I will demonstrate the new OpenMP feature that has been shipped with version 0.3-1. For the benchmark, we will compare the execution time and memory efficiency of computing a 3×3 confusion matrix on two vectors of length 10,000,000 with and without OpenMP. The source code and results are shown below:

## 1) set seed
set.seed(1903)

## 2) define values
## for classes
actual <- factor(sample(letters[1:3], 1e7, TRUE))
predicted <- factor(sample(letters[1:3], 1e7, TRUE))

## 3) benchmark with OpenMP
SLmetrics::setUseOpenMP(TRUE)
#> OpenMP usage set to: enabled

benchmark(`{With OpenMP}` = SLmetrics::cmatrix(actual, predicted))
#> # A tibble: 1 × 4
#>   expression    execution_time memory_usage gc_calls
#>   <fct>               <bch:tm>    <bch:byt>    <dbl>
#> 1 {With OpenMP}            1ms           0B        0

## 4) benchmark without OpenMP
SLmetrics::setUseOpenMP(FALSE)
#> OpenMP usage set to: disabled

benchmark(`{Without OpenMP}`  = SLmetrics::cmatrix(actual, predicted))
#> # A tibble: 1 × 4
#>   expression       execution_time memory_usage gc_calls
#>   <fct>                  <bch:tm>    <bch:byt>    <dbl>
#> 1 {Without OpenMP}         6.27ms           0B        0

The confusion matrix is computed in less than a millisecond and around six milliseconds with and without OpenMP, respectively. In both cases, it uses zero or near-zero memory.

Benchmarking against {MLmetrics} and {yardstick}

In the second set of benchmarks, I will compare the execution time and memory efficiency of {SLmetrics} against {MLmetrics} and {yardstick}. The source code and results are shown below:

## 1) define classes
set.seed(1903)
fct_actual    <- factor(sample(letters[1:3], size = 1e7, replace = TRUE))
fct_predicted <- factor(sample(letters[1:3], size = 1e7, replace = TRUE))

## 2) perform benchmark
benchmark(
    `{SLmetrics}` = SLmetrics::cmatrix(fct_actual, fct_predicted),
    `{MLmetrics}` = MLmetrics::ConfusionMatrix(fct_predicted, fct_actual),
    `{yardstick}` = yardstick::conf_mat(table(fct_actual, fct_predicted))
)
#> # A tibble: 3 × 4
#>   expression  execution_time memory_usage gc_calls
#>   <fct>             <bch:tm>    <bch:byt>    <dbl>
#> 1 {SLmetrics}         6.34ms           0B        0
#> 2 {MLmetrics}       344.13ms        381MB       19
#> 3 {yardstick}       343.75ms        381MB       19

{SLmetrics} is roughly 60 times faster than both, and significantly more memory efficient as demonstrated by memory_usage and gc_calls. In this perspective, {SLmetrics} is more efficient and scalable than both packages as the memory usage is basically linear. See below:

## 1) define classes
set.seed(1903)
fct_actual    <- factor(sample(letters[1:3], size = 2e7, replace = TRUE))
fct_predicted <- factor(sample(letters[1:3], size = 2e7, replace = TRUE))

## 2) perform benchmark
benchmark(
    `{SLmetrics}` = SLmetrics::cmatrix(fct_actual, fct_predicted),
    `{MLmetrics}` = MLmetrics::ConfusionMatrix(fct_predicted, fct_actual),
    `{yardstick}` = yardstick::conf_mat(table(fct_actual, fct_predicted))
)
#> # A tibble: 3 × 4
#>   expression  execution_time memory_usage gc_calls
#>   <fct>             <bch:tm>    <bch:byt>    <dbl>
#> 1 {SLmetrics}         12.3ms           0B        0
#> 2 {MLmetrics}        648.5ms        763MB       19
#> 3 {yardstick}        654.7ms        763MB       19

{SLmetrics} can process 60x the data in the same time it takes {MLmetrics} and {yardstick} to process 40,000,000 data-points – without any additional memory cost.

Summary

The benchmarks suggests that {SLmetrics} is a strong contender to the more established packages {MLmetrics} and {yardstick} in terms of scalability, memory efficiency and speed.

Installing {SLmetrics}

{SLmetrics} is still under development and is therefore not on CRAN. But the latest release can be installed using {devtools}. A development version is also available for those living on the edge. See below:

Stable version

## install stable release
devtools::install_github(
  repo = 'https://github.com/serkor1/SLmetrics@*release',
  ref  = 'main'
)

Development version

## install development version
devtools::install_github(
  repo = 'https://github.com/serkor1/SLmetrics',
  ref  = 'development'
)

If you made it this far: Thank you for reading the blog post, and feel free to leave a comment here or in the repository.

Ebook launch – Simple Data Science (R)

Simple Data Science (R) covers the fundamentals of data science and machine learning. The book is beginner-friendly and has detailed code examples. It is available at scribd.

cover image

Topics covered in the book –
  • Data science introduction
  • Basic statistics
  • Graphing with ggplot2 package
  • Exploratory Data Analysis
  • Machine Learning with caret package
  • Regression, classification, and clustering
  • Boosting with lightGBM package
  • Hands-on projects
  • Data science use cases

Classification modeling in R for profitable decisions workshop

Learn classification modeling to improve your decision-making for your business or use these skills for your research in our 2-part workshop! These workshops are a part of our workshops for Ukraine series, and all proceeds from these workshops go to support Ukraine. You can find more information about other workshops, as well as purchase recordings of the previous workshops here.

In the first part of the workshop titled Classification modeling for profitable decisions, which will take place online on Thursday, October 20th, 18:00 – 20:00 CET, we will cover the theoretical framework that you need to know to perform classification analysis and cover the key concepts. 

The second part of the workshop that will take place on Thursday, October 27th, 18:00 – 20:00 CET will include hands-on practice in R, so that you can learn how to implement the concepts covered in the first part in R. 

You can register for each part separately, so you can choose whether you wish to attend both parts or just part 1 or part 2.   Below you can find more information about each part and how to register for it: 

PART 1
Title: Classification modeling for profitable decisions: Theory and a case study on firm defaults. 
Date:
Thursday, October 20th, 18:00 – 20:00 CET (Rome, Berlin, Paris timezone)
Speaker:
Gábor Békés is an Assistant Professor at the Department of Economics and Business of Central European University, a research fellow at KRTK in Hungary, and a research affiliate at CEPR. His research is focused on international economics; economic geography and applied IO, and was published among others by the Global Strategy Journal, Journal of International Economics, Regional Science and Urban Economics or Economic Policy and have authored commentary on VOXEU.org. His comprehensive textbook, Data Analysis for Business, Economics, and Policy with Gábor Kézdi was publsihed by Cambridge University Press in 2021. 
Description:
This workshop will introduce the framework and methods of probability prediction and classification analysis for binary target variable. We will discuss the key concepts such as probability prediction, classification threshold, loss function, classification, confusion table, expected loss, the ROC curve, AUC and more. We will use logit models as well as random forest to predict probabilities and classify. In the workshop we will focus on a case study on firm defaults using a dataset on financial and management features of firms. The workshop material is based on a chapter and a case study from my textbook. Code in R and Python are available from the Github repo, and the data is available as well. The workshop will introduce key concepts, but the focus will be on data wrangling and modelling decisions we make for a real life problem. There will be a follow-up workshop focusing on the coding side of the case study. 
Minimal registration fee:
20 euro (or 20 USD or 750 UAH)
Suggested registration fee for professionals:
50 euro (if you can afford it, our suggested registration fee for this workshop is 50 euro. If you cannot afford it, you can still register by donating 20 euro).

Remember that you can register even if you will not be able to attend in person as all registered participants will get a recording.

How can I register?
  • Go to https://bit.ly/3wvwMA6 or https://bit.ly/3PFxtNA and donate at least 20 euro. Feel free to donate more if you can, all proceeds go to support Ukraine!
  • Save your donation receipt (after the donation is processed, there is an option to enter your email address on the website to which the donation receipt is sent)
  • Fill in the registration form, attaching a screenshot of a donation receipt (please attach the screenshot of the donation receipt that was emailed to you rather than the page you see after donation).

If you are not personally interested in attending, you can also contribute by sponsoring a participation of a student, who will then be able to participate for free. If you choose to sponsor a student, all proceeds will also go directly to organisations working in Ukraine. You can either sponsor a particular student or you can leave it up to us so that we can allocate the sponsored place to students who have signed up for the waiting list.

How can I sponsor a student?
  • Go to https://bit.ly/3wvwMA6 or https://bit.ly/3PFxtNA and donate at least 20 euro (or 17 GBP or 23 USD or 660 UAH). Feel free to donate more if you can, all proceeds go to support Ukraine!
  • Save your donation receipt (after the donation is processed, there is an option to enter your email address on the website to which the donation receipt is sent)
  • Fill in the sponsorship form, attaching the screenshot of the donation receipt (please attach the screenshot of the donation receipt that was emailed to you rather than the page you see after the donation). You can indicate whether you want to sponsor a particular student or we can allocate this spot ourselves to the students from the waiting list. You can also indicate whether you prefer us to prioritize students from developing countries when assigning place(s) that you sponsored.

If you are a university student and cannot afford the registration fee, you can also sign up for the waiting list here. (Note that you are not guaranteed to participate by signing up for the waiting list).


PART 2
Title: Classification modelling for profitable decisions: Hands on practice in R
Date:
Thursday, October 27th, 18:00 – 20:00 CET (Rome, Berlin, Paris timezone)
Speaker:
Ágoston Reguly is a Postdoctoral Fellow at the Financial Services and Innovation Lab of Scheller College of Business, Georgia Institute of Technology. His research is focused on causal machine learning methods and their application in corporate finance. He obtained his Ph.D. degree from Central European University (CEU), where he has taught multiple courses such as data analysis, coding, and mathematics. Before CEU he worked for more than three years at the Hungarian Government Debt Management Agency.
Description:
This workshop will implement methods of probability prediction and classification analysis for the binary target variable. This workshop is a follow-up to Gábor Békés’s workshop on the key concepts and (theoretical) methods for the same subject. We will use R via RStudio to apply probability prediction, classification threshold, loss function, classification, confusion table, expected loss, the ROC curve, AUC, and more. We will use linear probability models, logit models as well as random forests to predict probabilities and classify. In the workshop, we follow the case study on firm defaults using a dataset on financial and management features of firms. The workshop material is based on a chapter and a case study from the textbook of Gábor Békés and Gábor Kézdi (2021): Data Analysis for Business, Economics, and Policy, Cambridge University Press. The workshop will not only implement the key concepts, but the focus will be on data wrangling and modeling decisions we make for a real-life problem. Minimal registration fee: 20 euro (or 20 USD or 750 UAH) Suggested registration fee for professionals: 50 euro (if you can afford it, our suggested registration fee for this workshop is 50 euro. If you cannot afford it, you can still register by donating 20 euro).

How can I register?
  • Go to https://bit.ly/3wvwMA6 or https://bit.ly/3PFxtNA and donate at least 20 euro. Feel free to donate more if you can, all proceeds go to support Ukraine!
  • Save your donation receipt (after the donation is processed, there is an option to enter your email address on the website to which the donation receipt is sent)
  • Fill in the registration form, attaching a screenshot of a donation receipt (please attach the screenshot of the donation receipt that was emailed to you rather than the page you see after donation).

How can I sponsor a student?
  • Go to https://bit.ly/3wvwMA6 or https://bit.ly/3PFxtNA and donate at least 20 euro (or 17 GBP or 23 USD or 660 UAH). Feel free to donate more if you can, all proceeds go to support Ukraine!
  • Save your donation receipt (after the donation is processed, there is an option to enter your email address on the website to which the donation receipt is sent)
  • Fill in the sponsorship form, attaching the screenshot of the donation receipt (please attach the screenshot of the donation receipt that was emailed to you rather than the page you see after the donation). You can indicate whether you want to sponsor a particular student or we can allocate this spot ourselves to the students from the waiting list. You can also indicate whether you prefer us to prioritize students from developing countries when assigning place(s) that you sponsored.

If you are a university student and cannot afford the registration fee, you can also sign up for the waiting list here. (Note that you are not guaranteed to participate by signing up for the waiting list).

Looking forward to seeing you during the workshop!

Time to upskill in R? EARL’s workshop lineup has something for every data practitioner.

It’s well-documented that data skills are in high demand, making the industry even more competitive for employers looking for experienced data analysts, data scientists and data engineers – the fastest-growing job roles in the UK. In support of this demand, it’s great to see the government taking action to address the data skills gap as detailed in their newly launched Digital Strategy.

The range of workshops available at EARL 2022 is designed to help data practitioners extend their skills via a series of practical challenges. Led by specialists in Shiny, Purrr, Plumber, ML and time series visualisation, you’ll leave with tips and skills you can immediately apply to your commercial scenarios.

The EARL workshop lineup.


Time Series Visualisation in R.

How does time affect our perception of data? Is the timescale important? Is the direction of time relevant? Sometimes cumulative effects are not visible with traditional statistical methods, because smaller increments stay under the radar. When a time component is present, it’s likely that the current state of our problem depends on the previous states. With time series visualisations we can capture changes that may otherwise go undetected. Find out more.

Explainable Machine Learning.

Explaining how your ML products make decisions empowers people on the receiving end to question and appeal these decisions. Explainable AI is one of the many tools you need to ensure you’re using ML responsibly. AI and, more broadly, data can be a dangerous accelerator of discrimination and biases: skin diseases were found to be less effectively diagnosed on black skin by AI-powered software, and search engines advertised lower-paid jobs to women. Staying away from it might sound like a safer choice, but this would mean missing out on the huge potential it offers. Find out more.

Introduction to Plumber APIs.

90% of ML models don’t make it into production. With API building skills in your DS toolbox, you should be able to beat this statistic in your own projects. As the field of data science matures, much emphasis is placed on moving beyond scripts and notebooks and into software development and deployment. Plumber is an excellent tool to make the results from your R scripts available on the web. Find out more.

Functional Programming with Purrr.

Iteration is a very common task in Data Science. A loop in R programming is of course one option – but purrr (a package from the tidyverse) allows you to tackle iteration in a functional way, leading to cleaner and more readable code. Find out more.

How to Make a Game with Shiny.

Shiny is only meant to be used to develop dashboards, right? Or is it possible to develop more complex applications with Shiny? What would be the main limitations? Could R and Shiny be used as a general-purpose framework to develop web applications? Find out more.

Sound interesting? Check out the full details – our workshops spaces traditionally go fast so get yourself and your team booked in while there are still seats available. Book your Workshop Day Pass tickets now.

55,000 in Awards for Energy & Buildings Hackathon, Sponsored by NYSERDA

The New York State Energy Research & Development Agency (NYSERDA) is partnering with Onboard Data to host a $55,000 Global Energy & Buildings Hackathon. We’re inviting all engineers, data scientists and software developers whether they are professionals, professors, researchers or students to participate. More below…


Challenge participants will propose exciting, new ideas that can improve our world’s buildings. The hackathon will share data from 200+ buildings to participants. This data set is rich and one of a kind. The data set is normalized from equipment, systems and IoT devices found within buildings.
We seek submissions that positively impact or accelerate the decarbonization of New York State buildings. 

Total awards are $55,000. Sign-ups stay open until April 15th and the competition is open from April 22nd to May 30th. More can be found here: www.rtemhackathon.com.

Advance the next generation of building technology!

Create a hyper-marketing model using Naïve Bayes

By Huey Fern Tay with Greg Page

Everyone loves an extra income stream – even the super-wealthy owners of luxurious properties that they only inhabit for just a few weeks each year.  Offering a property as a short-term rental through a platform like Airbnb can provide a wonderful side hustle.  For some owners, however, the associated hassles could be a powerful deterrent to using the service.  Text messages at 3 a.m. about Wi-Fi passwords, stopped-up toilets, and the lack of water pressure in the shower might be just enough to tip the scales against such an undertaking…especially when such messages are followed up by angry “Why isn’t this fixed yet?” queries just 30 minutes later.

So what if an all-in-one concierge service could take away ALL of those hassles?  If an intermediary service could handle all of the tenant interactions, the marketing, the logistics of the key hand-offs, etc. then suddenly the idle rich jetsetters might be a bit more willing to open up their pied-a-terres to the unbathed masses.  Such a service would benefit all stakeholders – travelers would have more options, the property owners would earn more income, and the platform would receive more commission fees from the extra transactions. In exchange for a fee paid to the service, willing property owners could have a side hustle that was “all side, no hustle.”  

Let’s imagine that such a service is looking to establish itself, with an initial marketing outreach effort to high-end property owners not already using Airbnb.  Let’s also imagine that this new service is operating on a shoestring budget, and therefore needs to. How can it identify the properties within a city that are most likely to command high values in the short-term rental market? 

The Naïve Bayes classifier is a good candidate for the task at hand because of its simplicity, computational efficiency, and ability to handle categorical variables.  Furthermore, its classification outcomes come with associated probability values – we can use those to identify records that are most likely to belong to some particular group. 

To illustrate how this method could be used to solve the business problem outlined above, I will utilize Airbnb data of San Francisco listings.

One of the first decisions the modeler must make is deciding how to bin the data. In this case, the question is which numerical variable would you use to separate the properties? Would you use price, review ratings, or number of reviews? Each of these variables has a different impact on the outcome and may not be equally effective at separating classes. If the classes are not well separated, then even a large dataset will not be helpful.

The next important decision is to determine the number of classification categories you would like to create. Also, what would the cut-off be for each group? In other words, should you create four groups and bin them equally? Or should you create three groups by dividing the data according to a 15-70-15 proportion, 20-40-20, or 30-40-30…etc? The decision made at this step has a big impact on the model.

Consider both models below which were each created with 3811 rows of data representing 60% of the total dataset. Both models were created with the same predictor variables, such as the number of bedrooms, bathrooms, property type, location, etc. But in Model 1, the data was binned into 4 equal groups while in Model 2, the data was binned into 3 equal groups. The model summary for Model 1 showed the model has an accuracy of 54.19%, which is good considering that this performance is slightly more than double the No Information Rate (Naïve Rate). Model 2’s accuracy’s level is at 65.36%, a level which is nearly double its No Information Rate.

These are encouraging results but in our Airbnb example, we are more interested in knowing how well our model performs when it is asked to classify properties into any of the classes used in the model. For this reason, it is worth considering the true positive rate i.e. ‘sensitivity’. Model 1 is better at predicting the true positives (‘sensitivity’) for classes at opposite ends of the spectrum. This suggests Model 1 has difficulty reading nuances. On the other hand, Model 2’s performance in this regard is comparatively more balanced.


Naive Bayes model with four classification outcome



Naive Bayes output with three classification outcome


But wait – let’s get back to our original goal.  While overall accuracy is good to see, what we are most interested in here is identifying that high-end price group.  Owners of such units will be the best targets for our all-in-one concierge service.  Therefore, let’s dive a bit deeper to examine this model’s suitability for identifying such properties. 

By running the predict() function with the type=’raw’ parameter included, we can view the associated probabilities for each outcome class, and then rank records by probability of belonging to some particular outcome group. 

Taking this approach with the validation set, we find that among the 100 records identified by the model as most likely to land in the top tier group, 96 truly belonged to “Above Average and Pricey Digs.”   Among the 150 likeliest, 140 units, or 93.33%, actually belonged to that group, and among the 200 likeliest, 185 units, or 92.5%, were truly in that top tier. 

But that’s not all. It is worth going one step further by evaluating the model with lift charts or decile-wise lift charts because these charts determine how effectively our model ‘skims the cream’.

The decile-wise lift charts below illustrate how effectively the model can predict membership in the ‘above average and pricey digs’ group. When the model is used to classify the top 27% properties in this category, its performance is more than 3.5x better than a random guess.

Decile-wise lift chart shows the model is 3.5 times better at classifying the top 27 percent properties

Another way to assess the model’s ability to identify top-tier rentals is with a two-dimensional lift chart.  Such a chart only works with two-outcome class scenarios, so we start here by collapsing the first and second tiers together, and then labelling that group as “other.”  

Gains chart shows among the 500 records that the model says are most likely to land in the Above Average & Pricey Digs Tier, just over 400 truly did belong to that group

In the entire validation set, there are 780 units that land in the highest price tier.  The values along the x-axis represent all the validation set records, ranked in order of their probability of belonging to the highest-tier class.  The y-axis shows the number of correct predictions.  The solid line represents the model’s performance – it shows us, for instance, that among the 500 records that the model says are most likely to land in the Above Average & Pricey Digs Tier, just over 400 truly did belong to that group.  The line flattens out at around x=1500, because by that point, the model has already identified nearly all of the records that truly belonged to this outcome class. 

The dotted line, by contrast, shows how effective a model would be if it simply labelled all the records as belonging to the top tier.  Since 33.6%, of the validation records belong to this group, each x-axis value here corresponds to a y-axis value that is exactly 33.6% as large.  The difference between the solid line and the dotted line represents the model’s improvement as the number of cases increases.

What do these results mean for the concierge service?  Let’s revisit our original assumptions: 


  • such a service would most likely appeal to the owners of properties in the highest pricing tier;
  • the initial outreach efforts should be made to owners of properties not already registered with Airbnb; and that
  • the new service has a limited budget, and therefore needs to carefully focus its outreach efforts only on that tier of properties whose owners would be likeliest to use it
Given these assumptions, our primary interest does not lie with overall model accuracy.  The model’s ability to distinguish between the bottom two pricing tiers is almost immaterial to us; however, we are keenly interested in the answer to this question:  When the model predicts that a property will belong to the highest pricing tier, how often is it correct? 

As demonstrated here, the model delivers quite effectively in this regard, especially when we maintain a relatively narrow focus on the properties that are most likely to be top tier.  Splashy magazine inserts, Super Bowl advertisements, and big-ticket endorsements from celebrities might be in the cards for this service down the road, after it spreads across the globe and prepares for its IPO roadshow.  For now, though, the hyper-specific focus that can come from “skimming the cream” off the top of those Naïve Bayes model probability predictions may be the surest next step for this service’s success. 

Data source: Inside Airbnb
 

Download recently published book – Learn Data Science with R

Learn Data Science with R is for learning the R language and data science. The book is beginner-friendly and easy to follow. It is available for download as pay what you want. The minimum price is 0 and the suggested contribution is rs 1300 ($18). Please review the book at Goodreads.

book cover

The book topics are –
  • R Language
  • Data Wrangling with data.table package
  • Graphing with ggplot2 package
  • Exploratory Data Analysis
  • Machine Learning with caret package
  • Boosting with lightGBM package
  • Hands-on projects