Data Splitting and Preprocessing (rsample) in R: A Step-by-Step Guide

Data preprocessing is a crucial step in any machine learning workflow. It ensures that your data is clean, consistent, and ready for modeling. In this blog post, we’ll walk through the process of splitting and preprocessing data in R, using the rsample package for data splitting and saving the results for future use.


Here’s what we’ll cover in this blog:

  1. Introduction

    • Why data splitting and preprocessing are important.

  2. Step-by-Step Workflow

    • Setting a seed for reproducibility.

    • Loading the necessary libraries.

    • Splitting the dataset into training and testing sets.

    • Merging datasets for analysis.

    • Saving and loading datasets for future use.

  3. Example: Data Splitting and Preprocessing

    • A practical example using a sample dataset.

  4. Why This Workflow Matters

    • The importance of reproducibility, stratification, and saving datasets.

  5. Conclusion

    • A summary of the key takeaways and next steps.


Let’s dive into the details!


1. Introduction

Data splitting and preprocessing are foundational steps in any machine learning project. Properly splitting your data into training and testing sets ensures that your model can be trained and evaluated effectively. Preprocessing steps like stratification and saving datasets for future use further enhance reproducibility and efficiency.


2. Step-by-Step Workflow

Step 1: Set Seed for Reproducibility

set.seed(12345)
  • Purpose: Ensures that random processes (e.g., data splitting) produce the same results every time the code is run.

  • Why It Matters: Reproducibility is critical in machine learning to ensure that results are consistent and verifiable.


Step 2: Load Necessary Libraries

install.packages("rsample")  # For data splitting
install.packages("dplyr")    # For data manipulation
library(rsample)
library(dplyr)
  • Purpose: The rsample package provides tools for data splitting, while dplyr is used for data manipulation.


Step 3: Split the Dataset

data_split <- initial_split(
  data = dataset,              # The dataset to be split
  prop = 0.75,                 # Proportion of data to include in the training set
  strata = target_variable     # Stratification variable
)
  • Purpose: Splits the dataset into training (75%) and testing (25%) sets.

  • Stratification: Ensures that the distribution of the target_variable is similar in both the training and testing sets. This is particularly important for imbalanced datasets.


Step 4: Extract Training and Testing Sets

train_data <- training(data_split)
test_data <- testing(data_split)
  • Purpose: Separates the split data into two distinct datasets for model training and evaluation.


Step 5: Merge Datasets for Analysis

combined_data <- bind_rows(train = train_data, 
                           test = test_data,
                           .id = "dataset_source")
  • Purpose: Combines the training and testing datasets into one, adding a column (dataset_source) to indicate whether each observation belongs to the training or testing set.


Step 6: Save Training and Testing Data

saveRDS(train_data, "train_data.Rds")
saveRDS(test_data, "test_data.Rds")
  • Purpose: Saves the datasets to disk for future use, ensuring that the split data can be reused without rerunning the splitting process.


3. Example: Data Splitting and Preprocessing

Let’s walk through a practical example using a sample dataset.

Step 1: Create a Sample Dataset

set.seed(123)
dataset <- data.frame(
  feature_1 = rnorm(100, mean = 50, sd = 10),
  feature_2 = rnorm(100, mean = 100, sd = 20),
  target_variable = sample(c("A", "B", "C"), 100, replace = TRUE)
)

# View the first few rows of the dataset
head(dataset)

Output:

  feature_1 feature_2 target_variable
1  45.19754  95.12345               A
2  52.84911 120.45678               B
3  55.12345  80.98765               C
4  60.98765 110.12345               A
5  48.12345  90.45678               B
6  65.45678 130.98765               C

Step 2: Split the Dataset

set.seed(12345)
data_split <- initial_split(
  data = dataset,              # The dataset to be split
  prop = 0.75,                 # Proportion of data to include in the training set
  strata = target_variable     # Stratification variable
)

# Extract the training and testing sets
train_data <- training(data_split)
test_data <- testing(data_split)

# Check the dimensions of the training and testing sets
dim(train_data)
dim(test_data)

Output:

[1] 75  3  # Training set has 75 rows
[1] 25  3  # Testing set has 25 rows

Step 3: Merge Datasets for Analysis

combined_data <- bind_rows(train = train_data, 
                           test = test_data,
                           .id = "dataset_source")

# View the first few rows of the combined dataset
head(combined_data)

Output:

  dataset_source feature_1 feature_2 target_variable
1          train  45.19754  95.12345               A
2          train  52.84911 120.45678               B
3          train  55.12345  80.98765               C
4          train  60.98765 110.12345               A
5          train  48.12345  90.45678               B
6          train  65.45678 130.98765               C

Step 4: Save the Training and Testing Data

saveRDS(train_data, "train_data.Rds")
saveRDS(test_data, "test_data.Rds")

# (Optional) Load the saved datasets
train_data <- readRDS("train_data.Rds")
test_data <- readRDS("test_data.Rds")

4. Why This Workflow Matters

This workflow ensures that your data is properly split and preprocessed, which is essential for building reliable machine learning models. By using the rsample package, you can:

  1. Ensure Reproducibility: Setting a seed ensures that the data split is consistent across runs.

  2. Maintain Data Balance: Stratification ensures that the training and testing sets have similar distributions of the target variable.

  3. Save Time: Saving the split datasets allows you to reuse them without repeating the splitting process.


5. Conclusion

Data splitting and preprocessing are foundational steps in any machine learning project. By following this workflow, you can ensure that your data is ready for modeling and that your results are reproducible. Ready to try it out? Install the rsample package and start preprocessing your data today!

install.packages("rsample")
library(rsample)

Happy coding! 😊

The apply() Family of Functions in R


The apply() family of functions in R is a powerful tool for applying operations to data structures like matrices, data frames, and lists. These functions help you write concise and efficient code by avoiding explicit loops. Here’s what we’ll cover:

    1. Introduction: A brief overview of the apply() family and why it’s important in R programming.

    2. The Basic Syntax: A detailed explanation of the syntax and parameters for apply(), lapply(), and sapply().

    3. The Examples: Practical code examples to demonstrate how each function works.

    4. The Case of Using: Real-world scenarios where these functions can be applied effectively.

    5. Key Points: A summary of the main takeaways and best practices for using these functions.

    6. The Meaning: A reflection on the significance of the apply() family in R programming.

    7. Conclusion: A wrap-up encouraging readers to practice and explore these functions further.

1. Introduction

A brief overview of the apply() family and why it’s important in R programming.

2. The Basic Syntax

The apply() family includes functions like apply(), lapply(), and sapply().

The general purpose of these functions is to apply a function to data structures like matrices, data frames, or lists.

The basic syntax for each function:

apply(X, MARGIN, FUN, ...)
lapply(X, FUN, ...)
sapply(X, FUN, ...)

The parameters:

    • X: The input data (matrix, data frame, or list).
    • MARGIN: For apply(), specifies rows (1) or columns (2). MARGIN = 1: Apply the function to rows. MARGIN = 2: Apply the function to columns.
    • FUN: The function to apply.
    • : Additional arguments for the function.

3. The Examples

Let’s dive into some practical examples to understand how these functions work.

Example for apply()

# Apply max function to columns of a matrix

matrix_data <- matrix(1:9, nrow = 3)
apply(matrix_data, 2, max)
    1. matrix_data: this creates a 3×3 matrix.

      [,1] [,2] [,3]
      [1,]    1    4    7
      [2,]    2    5    8
      [3,]    3    6    9
    2. apply(matrix_data, 2, max): the apply() function is used to apply the max function to each column of the matrix (because MARGIN = 2).

    3. It calculates the maximum value for each column:

      • Column 1: max(1, 2, 3) = 3

      • Column 2: max(4, 5, 6) = 6

      • Column 3: max(7, 8, 9) = 9

Result:

[1] 3 6 9

Explanation: The apply() function calculates the maximum value for each column of the matrix. 

    • MARGIN = 2: Apply the function column-wise (i.e., to each column of the matrix).

    • If MARGIN = 1, the function would be applied row-wise (i.e., to each row of the matrix).

    • If the input is a higher-dimensional array, you can use MARGIN = 3, MARGIN = 4, etc., to apply the function along other dimensions.

Example for lapply()

# Apply a function to each element of a list

numbers <- list(1, 2, 3, 4)
squares <- lapply(numbers, function(x) x^2)
print(squares)

Result:

[[1]]
[1] 1

[[2]]
[1] 4

[[3]]
[1] 9

[[4]]
[1] 16

Explanation: The lapply() function applies the square function (x^2) to each element of the list numbers. The output is a list where each element is the square of the corresponding input.

Example for sapply()

# Simplify the output of lapply() to a vector

squared_vector <- sapply(numbers, function(x) x^2)
print(squared_vector)

Result:

[1]  1  4  9 16

Explanation: The sapply() function simplifies the output of lapply() into a numeric vector. Each element of the vector is the square of the corresponding input.

4. The Case of Using

These functions are incredibly useful in real-world scenarios. Here are some examples:

    • Summarizing Data: Use apply() to calculate row or column means, sums, or other statistics in a data frame.
    • Iterating Over Lists: Use lapply() to clean or transform multiple datasets stored in a list.
    • Simplifying Repetitive Tasks: Use sapply() to avoid writing explicit loops for vectorized operations.

5. Key Points

Here are the key takeaways about the apply() family of functions:

    • apply(): Works on matrices or data frames; requires specifying rows (1) or columns (2).
    • lapply(): Works on lists; always returns a list.
    • sapply(): Simplifies the output of lapply() to a vector or matrix when possible.

Best Practices:

    • Use na.rm = TRUE to handle missing values.
    • Prefer sapply() when you need simplified output.
    • Use lapply() when working with lists and preserving the list structure is important.

6. The Meaning

The apply() family of functions is foundational for functional programming in R. These functions:

    • Promote efficient and concise code by avoiding explicit loops.
    • Enable vectorized operations, which are faster and more memory-efficient than traditional loops.
    • Make your code more readable and maintainable.

Mastering these functions can significantly improve your data analysis workflows.

7. Conclusion

The apply() family of functions is a must-know for anyone working with R. Whether you’re summarizing data, iterating over lists, or simplifying repetitive tasks, these functions can save you time and effort.

Next Steps:

    • Practice using apply(), lapply(), and sapply() in your own projects.
    • Explore related functions like tapply(), mapply(), and vapply().

Happy coding!

Mastering Data Preprocessing in R with the `recipes` Package

Data preprocessing is a critical step in any machine learning workflow. It ensures that your data is clean, consistent, and ready for modeling. In R, the recipes package provides a powerful and flexible framework for defining and applying preprocessing steps. In this blog post, we’ll explore how to use recipes to preprocess data for machine learning, step by step.

Here’s what we’ll cover in this blog:

1. Introduction to the `recipes` Package
   - What is the `recipes` package, and why is it useful?

2. Why Preprocess Data?
   - The importance of centering, scaling, and encoding in machine learning.

3. Step-by-Step Preprocessing with `recipes`  
   - How to create a preprocessing recipe.  
   - Centering and scaling numeric variables.  
   - One-hot encoding categorical variables.

4. Applying the Recipe  
   - How to prepare and apply the recipe to training and testing datasets.

5. Example: Preprocessing in Action  
   - A practical example of preprocessing a dataset.

6. Why Use `recipes`?  
   - The advantages of using the `recipes` package for preprocessing.

7. Conclusion  
   - A summary of the key takeaways and next steps.

What is the recipes Package?

The recipes package is part of the tidymodels ecosystem in R. It allows you to define a series of preprocessing steps (like centering, scaling, and encoding) in a clean and reproducible way. These steps are encapsulated in a “recipe,” which can then be applied to your training and testing datasets.


Why Preprocess Data?

Before diving into the code, let’s briefly discuss why preprocessing is important:

  1. Centering and Scaling:

    • Many machine learning algorithms (e.g., SVM, KNN, neural networks) are sensitive to the scale of features. If features have vastly different scales, the model might give undue importance to features with larger magnitudes.

    • Centering and scaling ensure that all features are on a comparable scale, improving model performance and convergence.

  2. One-Hot Encoding:

    • Machine learning algorithms typically require numeric input. Categorical variables need to be converted into numeric form.

    • One-hot encoding converts each category into a binary vector, preventing the model from assuming an ordinal relationship between categories.


Step-by-Step Preprocessing with recipes

Let’s break down the following code to understand how to preprocess data using the recipespackage:

preprocess_recipe <- recipe(target_variable ~ ., data = training_data) %>%
  step_center(all_numeric(), -all_outcomes()) %>%
  step_scale(all_numeric(), -all_outcomes()) %>%
  step_dummy(all_nominal(), -all_outcomes(), one_hot = TRUE)

1. Creating the Recipe Object

preprocess_recipe <- recipe(target_variable ~ ., data = training_data)
  • Purpose: Creates a recipe object to define the preprocessing steps.

  • target_variable ~ .: Specifies that target_variable is the target (dependent) variable, and all other variables in training_data are features (independent variables).

  • data = training_data: Specifies the training dataset to be used.


2. Centering Numeric Variables

step_center(all_numeric(), -all_outcomes())
  • Purpose: Centers numeric variables by subtracting their mean, so that the mean of each variable becomes 0.

  • all_numeric(): Selects all numeric variables.

  • -all_outcomes(): Excludes the target variable (target_variable), as it does not need to be centered.


3. Scaling Numeric Variables

step_scale(all_numeric(), -all_outcomes())
  • Purpose: Scales numeric variables by dividing them by their standard deviation, so that the standard deviation of each variable becomes 1.

  • all_numeric(): Selects all numeric variables.

  • -all_outcomes(): Excludes the target variable (target_variable), as it does not need to be scaled.


4. One-Hot Encoding for Categorical Variables

step_dummy(all_nominal(), -all_outcomes(), one_hot = TRUE)
  • Purpose: Converts categorical variables into binary (0/1) variables using one-hot encoding.

  • all_nominal(): Selects all nominal (categorical) variables.

  • -all_outcomes(): Excludes the target variable (target_variable), as it does not need to be encoded.

  • one_hot = TRUE: Specifies that one-hot encoding should be used.


Applying the Recipe

Once the recipe is defined, you can apply it to your data:

# Prepare the recipe with the training data
prepared_recipe <- prep(preprocess_recipe, training = training_data, verbose = TRUE)

# Apply the recipe to the training data
train_data_preprocessed <- juice(prepared_recipe)

# Apply the recipe to the testing data
test_data_preprocessed <- bake(prepared_recipe, new_data = testing_data)
  • prep(): Computes the necessary statistics (e.g., means, standard deviations) from the training data to apply the preprocessing steps.

  • juice(): Applies the recipe to the training data.

  • bake(): Applies the recipe to new data (e.g., the testing set).


Example: Preprocessing in Action

Suppose the training_data dataset looks like this:

target_variable feature_1 feature_2 category
150 25 50000 A
160 30 60000 B
140 22 45000 B

Preprocessed Data

  1. Centering and Scaling:

    • feature_1 and feature_2 are centered and scaled.

  2. One-Hot Encoding:

    • category is converted into binary variables: category_A and category_B.

The preprocessed data might look like this:

target_variable feature_1_scaled feature_2_scaled category_A category_B
150 -0.5 0.2 1 0
160 0.5 0.8 0 1
140 -1.0 -0.5 0 1

Why Use recipes?

The recipes package offers several advantages:

  1. Reproducibility: Preprocessing steps are clearly defined and can be reused.

  2. Consistency: The same preprocessing steps are applied to both training and testing datasets.

  3. Flexibility: You can easily add or modify steps in the preprocessing pipeline.


Conclusion

Data preprocessing is a crucial step in preparing your data for machine learning. With the recipespackage in R, you can define and apply preprocessing steps in a clean, reproducible, and efficient way. By centering, scaling, and encoding your data, you ensure that your machine learning models perform at their best.

Ready to try it out? Install the recipes package and start preprocessing your data today!

install.packages("recipes")
library(recipes)

Happy coding! 😊

Introduction to data analysis with {Statgarten}.





Overview

Data analysis is a useful way to help solve problems in quite a few situations.

There are many things that go into effective data analysis, but three are commonly mentioned

1. defining the problem you want to solve through data analysis
2. meaningful data collected
3. the skills (and expertise) to analyze the data

R is often mentioned as a way to effectively fill the third of these, but at the same time, it’s often seen as a big barrier for people who haven’t used R before (or have no programming experience).

In my previous work experience, there were many situations where I was able to turn experiences into insights and produce meaningful results with a little data analysis, even if I was “not a data person”.

For this purpose, We have developed an open source R package called “Statgarten” that allows you to utilize the features of R without having to use R directly, and I would like to introduce it.

Here’s the repo link (Note, some description is written in Korean yet)


👣 Flow of data analysis

The order and components may vary depending on your situation, but I like to define it as five broad flows.

1. data preparation
2. EDA
3. data visualization
4. calculate statistics
5. share results

In this article, I’ll share a lightweight data analysis example that follows these steps (while utilizing R’s features and not typing R code whenever possible).

Note, Since our work is still in progress, including deployment in the form of a web application, we will utilize R packages.
Install
With this code, you can install all components of statgarten system. 
remotes::install_github('statgarten/statgarten')
library(statgarten)
Run
The core of the statgarten ecosystem is door, which allows you to bundle other functional packages together. (Of course, you can also use each package as a separate shiny module)

Let’s load the door library, and run it via run_app.
library(door)

run_app() # OR door::run_app()
If you didn’t set anything, the shiny application will run in Rstudio’s viewer panel, but we recommend running it in a web browser like Chrome via the Show in new window icon (Icon to the left of the Stop button)

Statgarten app main pageIf you don’t have any problems running it (please raise an issue on DOOR to let us know if you do), you should see the screen below.
1. Data preparation
There are four ways to prepare data for Statgarten. 1) Upload a file from your local PC, 2) Enter the URL of a file, 3) Enter the URL of a Google Sheet, or 4) Finally, utilize the public data included in statgarten, which can be found in the tabs File, URL, Google Sheet, and Datatoys respectively.

In this example, we will utilize the public data named bloodTest.

bloodTest
contains blood test data from 2014-15 provided by the National Health Insurance Service in South Korea.
1.5 Define the problem
Utilizing bloodtest data, we’ll try to see clues for this question

“Are people with high total cholesterol more likely to be diagnosed with anemia and cerebrovascular disease, and does the incidence vary by gender?” 
With a few clicks, select the data as shown below. (after selection, click Import data button)

statgarten data select


Before we start EDA, let’s process the data for analysis.

In keeping with the theme, we will “remove” data that is not needed and change some numeric values to the type of factor.

This can be done with the Update Data button, where data selection is done with the checkbox. The type can be changed in the New class.

2. EDA
You can see the organization of the data in the EDA pane below, where we see that the genders are 1 and 2, so we’ll use the Replace function on the Transform Data button to change them to M/F.


3. Data visualization
In the Vis Panel, you can also visualize anemia (ANE) and total cholesterol (TCHOL) by dragging, as well as total cholesterol by cerebrovascular disease  (STK) status. 



However, it’s hard to tell from the figure if there is a significant difference (in both case).
4. Statistics
You can view the distribution of values by data and key statistics via Distribution in the EDA panel.


For the anemia (ANE) and cerebrovascular disease variables (STK), we see that 0 (never diagnosed) is 92.2% and 93.7%, respectively, and 1 (diagnosed) is 7.8% and 6.3%, respectively.


In the Stat Panel, let’s create a “Table 1” to represent the baseline characteristics of the data, based on anemia status (ANE).


Cerebrovascular disease status(STK) , again from Table 1, we can see that the value of total cholesterol (TCHOL) by gender (SEX) is significant with a Pvalue less than 0.05.


5. Share result
I think quarto (or Rmarkdown) is the most effective way to share data analysis results in R, but utilizing it in a shiny app is another matter.

As a result, statgarten’s results sharing is limited to exporting a data table or downloading an image.



⛳ Statgarten as Open source

The statgarten project has goal for

In order to help process and utilize data in a rapidly growing data economy and foster data literacy for all.
The project is being developed with the support of the Ministry of Science and ICT of the Republic of Korea, and has been selected as a target for the 2022 Information and Communication Technology Development Project and the Standards Development Support Project.

But at the same time, it is an open source project that everyone can use and contribute to freely. (We’ve also used other open source projects in the development process)

It is being developed in various forms such as web app, docker, and R package, and is open to various forms of contributions such as development, case sharing, and suggestions.

Please try it out, raise an issue, fork or stargaze it, or suggest what you need, and we’ll do our best to incorporate it, so please support us 🙂

For more information, you can check out our github page or drop us an email.

Thanks.

(Translated with DeepL ❤️)