R – R-posts.com

Building fully reproducible data science environments for R and Python with ease using Nix, rix, and rixpress workshop

Join our workshop on Building fully reproducible data science environments for R and Python with ease using Nix, rix, and rixpress, which is a part of our workshops for Ukraine series!

Here’s some more info:

Title: Building fully reproducible data science environments for R and Python with ease using Nix, rix, and rixpress

Date: Friday, June 20th, 18:00 – 20:00 CEST (Rome, Berlin, Paris timezone)

Speaker: Bruno Rodrigues. Bruno is currently employed as the head of the statistics department at the Ministry of Higher education and Research in Luxembourg. Before joining the public sector, Bruno worked as a data science consultant in one of the big four accounting companies, and before that as a teaching and research assistant. Bruno discovered tools such as Git and software carpentry techniques while working on his PhD. These tools and techniques served him well for the past decade, and Bruno has been consistently sharing his knowledge on his blog during that time.

Description: Reproducibility is a critical aspect of modern research, ensuring that results can be consistently replicated and verified by others. In this presentation, Bruno Rodrigues will introduce participants to Nix, a package manager that focuses on reproducible builds. Unlike other solutions, Nix ensures that all dependencies—R and/or Python itself, R and/or Python packages, and system libraries—are precisely versioned and isolated. While Docker provides containerized environments, Nix complements it by guaranteeing deterministic builds within those containers, eliminating issues related to hidden dependencies and environment drift. Participants will also be introduced to {rix} and {rixpress} which are R packages designed to simplify Nix usage.

Minimal registration fee: 20 euro (or 20 USD or 800 UAH)

Please note that the registration confirmation is sent 1 day before the workshop to all registered participants rather than immediately after registration

How can I register?

Go to https://bit.ly/3wvwMA6 or https://bit.ly/4aD5LMC or https://bit.ly/3PFxtNA and donate at least 20 euro. Feel free to donate more if you can, all proceeds go directly to support Ukraine.

Save your donation receipt (after the donation is processed, there is an option to enter your email address on the website to which the donation receipt is sent)

Fill in the registration form, attaching a screenshot of a donation receipt (please attach the screenshot of the donation receipt that was emailed to you rather than the page you see after donation).

If you are not personally interested in attending, you can also contribute by sponsoring a participation of a student, who will then be able to participate for free. If you choose to sponsor a student, all proceeds will also go directly to organisations working in Ukraine. You can either sponsor a particular student or you can leave it up to us so that we can allocate the sponsored place to students who have signed up for the waiting list.

How can I sponsor a student?

Go to https://bit.ly/3wvwMA6 or https://bit.ly/4aD5LMC or https://bit.ly/3PFxtNA and donate at least 20 euro (or 17 GBP or 20 USD or 800 UAH). Feel free to donate more if you can, all proceeds go to support Ukraine!

Save your donation receipt (after the donation is processed, there is an option to enter your email address on the website to which the donation receipt is sent)

Fill in the sponsorship form, attaching the screenshot of the donation receipt (please attach the screenshot of the donation receipt that was emailed to you rather than the page you see after the donation). You can indicate whether you want to sponsor a particular student or we can allocate this spot ourselves to the students from the waiting list. You can also indicate whether you prefer us to prioritize students from developing countries when assigning place(s) that you sponsored.

If you are a university student and cannot afford the registration fee, you can also sign up for the waiting list here. (Note that you are not guaranteed to participate by signing up for the waiting list).

You can also find more information about this workshop series, a schedule of our future workshops as well as a list of our past workshops which you can get the recordings & materials here.

Looking forward to seeing you during the workshop!

Using LLMs with ellmer workshop by Hadley Wickham

Join our workshop on Using LLMs with ellmer, which is a part of our workshops for Ukraine series!

Here’s some more info:

Title: Using LLMs with ellmer

Date: Friday, June 13th, 18:00 – 20:00 CEST (Rome, Berlin, Paris timezone)

Speaker: Hadley Wickham is Chief Scientist at Posit PBC, winner of the 2019 COPSS award, and a member of the R Foundation. He builds tools (both computational and cognitive) to make data science easier, faster, and more fun. His work includes packages for data science (like the tidyverse, which includes ggplot2, dplyr, and tidyr)and principled software development (e.g. roxygen2, testthat, and pkgdown). He is also a writer, educator, and speaker promoting the use of R for data science. Learn more on his website, <http://hadley.nz>.

Description: Join us for an engaging, hands-on hackathon workshop where you’ll learn to use large language models (LLMs) from R with the ellmer (https://ellmer.tidyverse.org) package. In this 2-hour session, we’ll combine theory with practical exercises to help you create AI-driven solutions—no extensive preparation needed!

## What you’ll learn:

– A quick intro to LLMs: what they’re good at and where they struggle

– How to use ellmer with different model providers (OpenAI, Anthropic, Google Gemini, and others)

– Effective prompt design strategies and practical applications for your work

– Function calling: how to let LLMs use R functions for tasks they can’t handle well

– Extracting structured data from text, images, and video using LLMs

## What you’ll need:

– A laptop with R installed

– The development version of ellmer (`pak::pak(“tidyverse/ellmer”))`

– An account with either Claude (cheap) or Google Gemini (free).

Follow the instructions at <github.com/hadley/workshop-llm-hackathon> to get setup.

Minimal registration fee: 20 euro (or 20 USD or 800 UAH)

Please note that the registration confirmation is sent 1 day before the workshop to all registered participants rather than immediately after registration

How can I register?

Go to https://bit.ly/3wvwMA6 or https://bit.ly/4aD5LMC or https://bit.ly/3PFxtNA and donate at least 20 euro. Feel free to donate more if you can, all proceeds go directly to support Ukraine.

Save your donation receipt (after the donation is processed, there is an option to enter your email address on the website to which the donation receipt is sent)

Fill in the registration form, attaching a screenshot of a donation receipt (please attach the screenshot of the donation receipt that was emailed to you rather than the page you see after donation).

How can I sponsor a student?

Go to https://bit.ly/3wvwMA6 or https://bit.ly/4aD5LMC or https://bit.ly/3PFxtNA and donate at least 20 euro (or 17 GBP or 20 USD or 800 UAH). Feel free to donate more if you can, all proceeds go to support Ukraine!

Save your donation receipt (after the donation is processed, there is an option to enter your email address on the website to which the donation receipt is sent)

Fill in the sponsorship form, attaching the screenshot of the donation receipt (please attach the screenshot of the donation receipt that was emailed to you rather than the page you see after the donation). You can indicate whether you want to sponsor a particular student or we can allocate this spot ourselves to the students from the waiting list. You can also indicate whether you prefer us to prioritize students from developing countries when assigning place(s) that you sponsored.

You can also find more information about this workshop series, a schedule of our future workshops as well as a list of our past workshops which you can get the recordings & materials here.

Looking forward to seeing you during the workshop!

surveydown: An Open-Source, Markdown-Based Platform for Interactive and Reproducible Surveys workshop

Join our workshop on surveydown: An Open-Source, Markdown-Based Platform for Interactive and Reproducible Surveys, which is a part of our workshops for Ukraine series!

Here’s some more info:

Title:surveydown: An Open-Source, Markdown-Based Platform for Interactive and Reproducible Surveys

Date: Thursday, May 29th, 18:00 – 20:00 CET (Rome, Berlin, Paris timezone)

Speaker: John Paul Helveston, John Paul (JP) is an Assistant Professor at George Washington University in the Department of Engineering Management and Systems Engineering. His research focuses on understanding how consumer preferences, market dynamics, and policy affect the emergence and adoption of low-carbon technologies, such as electric vehicles and renewable energy technologies. He also studies the critical relationship between the US and China in developing and mass producing these technologies. He has expertise in discrete choice modeling, conjoint analysis, exploratory data analysis, interview-based research methods, and the R programming language. He speaks fluent Mandarin Chinese and has conducted extensive fieldwork in China. He is also an accomplished violinist and swing dancer. John holds a Ph.D. and M.S. in Engineering and Public Policy from Carnegie Mellon University and a B.S. in Engineering Science and Mechanics (ESM) from Virginia Tech.

Description: This workshop introduces the surveydown R package and survey platform. With surveydown, researchers can create reproducible, interactive surveys using markdown and R code chunks, leveraging the Quarto publication system and R shiny web application framework. While most survey platforms rely on graphical interfaces or spreadsheets to define survey content, surveydown uses plain text, enabling version control and collaboration via tools like GitHub. The package renders surveys as interactive shiny web applications, allowing for complex features like conditional skip logic, dynamic question display, and complex randomization. The package supports a diverse set of question types and formatting options and users can leverage shiny’s powerful reactive programming model to create a wide variety of customized interactive features. As an open-source tool, surveydown provides researchers full control over their survey implementation, including the survey application as well as where and how the resulting response data are stored. Workflows are entirely reproducible and integrate seamlessly with existing workflows for data collection and analysis in R. At this workshop, you’ll not only learn how to build and deploy your own interactive surveys using surveydown, but also how you can join the growing community of contributors to the project.

Minimal registration fee: 20 euro (or 20 USD or 800 UAH)

Please note that the registration confirmation is sent 1 day before the workshop to all registered participants rather than immediately after registration

How can I register?

Go to https://bit.ly/3wvwMA6 or https://bit.ly/4aD5LMC or https://bit.ly/3PFxtNA and donate at least 20 euro. Feel free to donate more if you can, all proceeds go directly to support Ukraine.

Save your donation receipt (after the donation is processed, there is an option to enter your email address on the website to which the donation receipt is sent)

Fill in the registration form, attaching a screenshot of a donation receipt (please attach the screenshot of the donation receipt that was emailed to you rather than the page you see after donation).

How can I sponsor a student?

Go to https://bit.ly/3wvwMA6 or https://bit.ly/4aD5LMC or https://bit.ly/3PFxtNA and donate at least 20 euro (or 17 GBP or 20 USD or 800 UAH). Feel free to donate more if you can, all proceeds go to support Ukraine!

Save your donation receipt (after the donation is processed, there is an option to enter your email address on the website to which the donation receipt is sent)

Fill in the sponsorship form, attaching the screenshot of the donation receipt (please attach the screenshot of the donation receipt that was emailed to you rather than the page you see after the donation). You can indicate whether you want to sponsor a particular student or we can allocate this spot ourselves to the students from the waiting list. You can also indicate whether you prefer us to prioritize students from developing countries when assigning place(s) that you sponsored.

You can also find more information about this workshop series, a schedule of our future workshops as well as a list of our past workshops which you can get the recordings & materials here.

Looking forward to seeing you during the workshop!

Shinyscholar – a template for producing reproducible analytic apps in R workshop

Join our workshop on Shinyscholar – a template for producing reproducible analytic apps in R, which is a part of our workshops for Ukraine series!

Here’s some more info:

Title: Shinyscholar – a template for producing reproducible analytic apps in R

Date: Thursday, June 5th, 18:00 – 20:00 CEST (Rome, Berlin, Paris timezone)

Speaker: Simon Smart is a software developer in the Department of Population Health Sciences at the University of Leicester, UK. He has a background in plant and agricultural science and began developing Shiny apps in 2018, originally for forecasting yield in tomato and potato crops. He developed the shinyscholar package for creating reproducible apps and has applied it to create Disagapp for epidemiological modelling and refactor MetaInsight for evidence synthesis. He strives to create flexible, robust and reproducible apps using modern workflows that break down barriers for performing complex analyses.

Description: Shiny is an increasingly popular method for researchers to develop apps but they are typically not reproducible and a lack of training in software development results in substandard coding practices that make apps hard to maintain. The shinyscholar package addresses these problems by providing a template for producing apps that enable complex reproducible analyses, without having to learn best practices from scratch. In the workshop you will learn how to create a new application and the steps in developing shinyscholar modules, including prototyping, creating functions, checking for valid inputs, generating outputs, enabling reproducibility and automated testing.

Minimal registration fee: 20 euro (or 20 USD or 800 UAH)

Please note that the registration confirmation is sent 1 day before the workshop to all registered participants rather than immediately after registration

How can I register?

Go to https://bit.ly/3wvwMA6 or https://bit.ly/4aD5LMC or https://bit.ly/3PFxtNA and donate at least 20 euro. Feel free to donate more if you can, all proceeds go directly to support Ukraine.

Save your donation receipt (after the donation is processed, there is an option to enter your email address on the website to which the donation receipt is sent)

Fill in the registration form, attaching a screenshot of a donation receipt (please attach the screenshot of the donation receipt that was emailed to you rather than the page you see after donation).

How can I sponsor a student?

Go to https://bit.ly/3wvwMA6 or https://bit.ly/4aD5LMC or https://bit.ly/3PFxtNA and donate at least 20 euro (or 17 GBP or 20 USD or 800 UAH). Feel free to donate more if you can, all proceeds go to support Ukraine!

Save your donation receipt (after the donation is processed, there is an option to enter your email address on the website to which the donation receipt is sent)

Fill in the sponsorship form, attaching the screenshot of the donation receipt (please attach the screenshot of the donation receipt that was emailed to you rather than the page you see after the donation). You can indicate whether you want to sponsor a particular student or we can allocate this spot ourselves to the students from the waiting list. You can also indicate whether you prefer us to prioritize students from developing countries when assigning place(s) that you sponsored.

You can also find more information about this workshop series, a schedule of our future workshops as well as a list of our past workshops which you can get the recordings & materials here.

Looking forward to seeing you during the workshop!

Assessing rating scale dimensionality through bifactor model-based indices in R workshop

Join our workshop on Assessing rating scale dimensionality through bifactor model-based indices in R, which is a part of our workshops for Ukraine series!

Here’s some more info:

Title: Assessing rating scale dimensionality through bifactor model-based indices in R

Date: Thursday, May 15th, 18:00 – 20:00 CET (Rome, Berlin, Paris timezone)

Speaker: Tony Roberson, PhD, LP, LSSP is a child psychologist and assistant professor in the doctoral Health Service Psychology program and the Specialist in School Psychology program at the University of Houston-Clear Lake. He teaches graduate courses on research methods, quantitative data analysis, psychological measurement, and applied assessment and intervention for psychologists in training. His primary research interests concern the refinement of positively oriented mental and behavioral health screeners for youth in the school context.

Description: In this workshop, we will explore how to determine when it is justifiable to interpret behavior rating scale scores from (1) an overall general factor, (2) group factors (i.e., subscales), or (3) a combination of both through an array of model-based statistical indices derived from bifactor latent measurement models (e.g., omega hierarchical, relative parameter bias, percent uncontaminated correlations). The focus will be on practical demonstration and practice deriving and interpreting such indices in R using the lavaan and BifactorIndicesCalculator packages.

Minimal registration fee: 20 euro (or 20 USD or 800 UAH)

Please note that the registration confirmation is sent 1 day before the workshop to all registered participants rather than immediately after registration

How can I register?

Go to https://bit.ly/3wvwMA6 or https://bit.ly/4aD5LMC or https://bit.ly/3PFxtNA and donate at least 20 euro. Feel free to donate more if you can, all proceeds go directly to support Ukraine.

Save your donation receipt (after the donation is processed, there is an option to enter your email address on the website to which the donation receipt is sent)

Fill in the registration form, attaching a screenshot of a donation receipt (please attach the screenshot of the donation receipt that was emailed to you rather than the page you see after donation).

How can I sponsor a student?

Go to https://bit.ly/3wvwMA6 or https://bit.ly/4aD5LMC or https://bit.ly/3PFxtNA and donate at least 20 euro (or 17 GBP or 20 USD or 800 UAH). Feel free to donate more if you can, all proceeds go to support Ukraine!

Save your donation receipt (after the donation is processed, there is an option to enter your email address on the website to which the donation receipt is sent)

Fill in the sponsorship form, attaching the screenshot of the donation receipt (please attach the screenshot of the donation receipt that was emailed to you rather than the page you see after the donation). You can indicate whether you want to sponsor a particular student or we can allocate this spot ourselves to the students from the waiting list. You can also indicate whether you prefer us to prioritize students from developing countries when assigning place(s) that you sponsored.

You can also find more information about this workshop series, a schedule of our future workshops as well as a list of our past workshops which you can get the recordings & materials here.

Looking forward to seeing you during the workshop!

Data Splitting and Preprocessing (rsample) in R: A Step-by-Step Guide

Data preprocessing is a crucial step in any machine learning workflow. It ensures that your data is clean, consistent, and ready for modeling. In this blog post, we’ll walk through the process of splitting and preprocessing data in R, using the rsample package for data splitting and saving the results for future use.

Here’s what we’ll cover in this blog:

Introduction
- Why data splitting and preprocessing are important.
Step-by-Step Workflow
- Setting a seed for reproducibility.
- Loading the necessary libraries.
- Splitting the dataset into training and testing sets.
- Merging datasets for analysis.
- Saving and loading datasets for future use.
Example: Data Splitting and Preprocessing
- A practical example using a sample dataset.
Why This Workflow Matters
- The importance of reproducibility, stratification, and saving datasets.
Conclusion
- A summary of the key takeaways and next steps.

Let’s dive into the details!

1. Introduction

Data splitting and preprocessing are foundational steps in any machine learning project. Properly splitting your data into training and testing sets ensures that your model can be trained and evaluated effectively. Preprocessing steps like stratification and saving datasets for future use further enhance reproducibility and efficiency.

2. Step-by-Step Workflow

Step 1: Set Seed for Reproducibility

set.seed(12345)

Purpose: Ensures that random processes (e.g., data splitting) produce the same results every time the code is run.
Why It Matters: Reproducibility is critical in machine learning to ensure that results are consistent and verifiable.

Step 2: Load Necessary Libraries

install.packages("rsample")  # For data splitting
install.packages("dplyr")    # For data manipulation
library(rsample)
library(dplyr)

Purpose: The rsample package provides tools for data splitting, while dplyr is used for data manipulation.

Step 3: Split the Dataset

data_split <- initial_split(
  data = dataset,              # The dataset to be split
  prop = 0.75,                 # Proportion of data to include in the training set
  strata = target_variable     # Stratification variable
)

Purpose: Splits the dataset into training (75%) and testing (25%) sets.
Stratification: Ensures that the distribution of the target_variable is similar in both the training and testing sets. This is particularly important for imbalanced datasets.

Step 4: Extract Training and Testing Sets

train_data <- training(data_split)
test_data <- testing(data_split)

Purpose: Separates the split data into two distinct datasets for model training and evaluation.

Step 5: Merge Datasets for Analysis

combined_data <- bind_rows(train = train_data, 
                           test = test_data,
                           .id = "dataset_source")

Purpose: Combines the training and testing datasets into one, adding a column (dataset_source) to indicate whether each observation belongs to the training or testing set.

Step 6: Save Training and Testing Data

saveRDS(train_data, "train_data.Rds")
saveRDS(test_data, "test_data.Rds")

Purpose: Saves the datasets to disk for future use, ensuring that the split data can be reused without rerunning the splitting process.

3. Example: Data Splitting and Preprocessing

Let’s walk through a practical example using a sample dataset.

Step 1: Create a Sample Dataset

set.seed(123)
dataset <- data.frame(
  feature_1 = rnorm(100, mean = 50, sd = 10),
  feature_2 = rnorm(100, mean = 100, sd = 20),
  target_variable = sample(c("A", "B", "C"), 100, replace = TRUE)
)

# View the first few rows of the dataset
head(dataset)

Output:

  feature_1 feature_2 target_variable
1  45.19754  95.12345               A
2  52.84911 120.45678               B
3  55.12345  80.98765               C
4  60.98765 110.12345               A
5  48.12345  90.45678               B
6  65.45678 130.98765               C

Step 2: Split the Dataset

set.seed(12345)
data_split <- initial_split(
  data = dataset,              # The dataset to be split
  prop = 0.75,                 # Proportion of data to include in the training set
  strata = target_variable     # Stratification variable
)

# Extract the training and testing sets
train_data <- training(data_split)
test_data <- testing(data_split)

# Check the dimensions of the training and testing sets
dim(train_data)
dim(test_data)

Output:

[1] 75  3  # Training set has 75 rows
[1] 25  3  # Testing set has 25 rows

Step 3: Merge Datasets for Analysis

combined_data <- bind_rows(train = train_data, 
                           test = test_data,
                           .id = "dataset_source")

# View the first few rows of the combined dataset
head(combined_data)

Output:

  dataset_source feature_1 feature_2 target_variable
1          train  45.19754  95.12345               A
2          train  52.84911 120.45678               B
3          train  55.12345  80.98765               C
4          train  60.98765 110.12345               A
5          train  48.12345  90.45678               B
6          train  65.45678 130.98765               C

Step 4: Save the Training and Testing Data

saveRDS(train_data, "train_data.Rds")
saveRDS(test_data, "test_data.Rds")

# (Optional) Load the saved datasets
train_data <- readRDS("train_data.Rds")
test_data <- readRDS("test_data.Rds")

4. Why This Workflow Matters

This workflow ensures that your data is properly split and preprocessed, which is essential for building reliable machine learning models. By using the rsample package, you can:

Ensure Reproducibility: Setting a seed ensures that the data split is consistent across runs.
Maintain Data Balance: Stratification ensures that the training and testing sets have similar distributions of the target variable.
Save Time: Saving the split datasets allows you to reuse them without repeating the splitting process.

5. Conclusion

Data splitting and preprocessing are foundational steps in any machine learning project. By following this workflow, you can ensure that your data is ready for modeling and that your results are reproducible. Ready to try it out? Install the rsample package and start preprocessing your data today!

install.packages("rsample")
library(rsample)

Happy coding! 😊

Setting Up Cross-Validation (caret package) in R: A Step-by-Step Guide

In this blog, we explored how to set up cross-validation in R using the caret package, a powerful tool for evaluating machine learning models. Here’s a quick recap of what we covered:

Introduction to Cross-Validation:
- Cross-validation is a resampling technique that helps assess model performance and prevent overfitting by testing the model on multiple subsets of the data.
Step-by-Step Setup:
- We loaded the caret package and defined a cross-validation configuration using trainControl, specifying 10-fold repeated cross-validation with 5 repeats.
- We also saved the configuration for reuse using saveRDS.
Practical Example:
- Using the iris dataset, we trained a k-nearest neighbors (KNN) model with cross-validation and evaluated its performance.
Why It Matters:
- Cross-validation ensures robust model evaluation, avoids overfitting, and improves reproducibility and model selection.
Conclusion:
- By following this workflow, you can confidently evaluate your machine learning models and ensure they are ready for deployment.

Let’s dive into the details!

1. Introduction to Cross-Validation

Cross-validation is a resampling technique used to assess the performance and generalizability of machine learning models. It helps address issues like overfitting and ensures that the model’s performance is consistent across different subsets of the data. By splitting the data into multiple folds and repeating the process, cross-validation provides a robust estimate of model performance.

2. Step-by-Step Cross-Validation Setup

Step 1: Load Necessary Library

library(caret)

Purpose: The caret package provides tools for training and evaluating machine learning models, including cross-validation.

Step 2: Define Train Control for Cross-Validation

train_control <- trainControl(
  method = "repeatedcv",      # Repeated cross-validation
  number = 10,                # 10 folds
  repeats = 5,                # 5 repeats
  savePredictions = "final"   # Save predictions for the final model
)

Purpose: Configures the cross-validation process:
- Repeated Cross-Validation: Splits the data into 10 folds and repeats the process 5 times.
- Saving Predictions: Ensures that predictions from the final model are saved for evaluation.

Step 3: Save Train Control Object

saveRDS(train_control, "./train_control_config.Rds")

Purpose: Saves the cross-validation configuration to disk for reuse in future analyses.

3. Example: Cross-Validation in Action

Let’s walk through a practical example using a sample dataset.

Step 1: Load the Dataset

For this example, we’ll use the iris dataset, which is included in R.

data(iris)

Step 2: Define the Cross-Validation Configuration

library(caret)

# Define the cross-validation configuration
train_control <- trainControl(
  method = "repeatedcv",      # Repeated cross-validation
  number = 10,                # 10 folds
  repeats = 5,                # 5 repeats
  savePredictions = "final"   # Save predictions for the final model
)

Step 3: Train a Model Using Cross-Validation

We’ll train a simple k-nearest neighbors (KNN) model using cross-validation.

# Train a KNN model using cross-validation
set.seed(123)
model <- train(
  Species ~ .,                # Formula: Predict Species using all other variables
  data = iris,                # Dataset
  method = "knn",             # Model type: K-Nearest Neighbors
  trControl = train_control   # Cross-validation configuration
)

# View the model results
print(model)

Output:

k-Nearest Neighbors 

150 samples
  4 predictor
  3 classes: 'setosa', 'versicolor', 'virginica' 

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times) 
Summary of sample sizes: 135, 135, 135, 135, 135, 135, ... 
Resampling results across tuning parameters:

  k  Accuracy   Kappa    
  5  0.9666667  0.95     
  7  0.9666667  0.95     
  9  0.9666667  0.95     

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was k = 5.

Step 4: Save the Cross-Validation Configuration

saveRDS(train_control, "./train_control_config.Rds")

# (Optional) Load the saved configuration
train_control <- readRDS("./train_control_config.Rds")

4. Why This Workflow Matters

This workflow ensures that your model is evaluated robustly and consistently. By using cross-validation, you can:

Avoid Overfitting: Cross-validation provides a more reliable estimate of model performance by testing on multiple subsets of the data.
Ensure Reproducibility: Saving the cross-validation configuration allows you to reuse the same settings in future analyses.
Improve Model Selection: Cross-validation helps you choose the best model by comparing performance across different configurations.

5. Conclusion

Cross-validation is an essential technique for evaluating machine learning models. By following this workflow, you can ensure that your models are robust, generalizable, and ready for deployment. Ready to try it out? Install the caret package and start setting up cross-validation in your projects today!

install.packages("caret")
library(caret)

Happy coding! 😊

The apply() Family of Functions in R

The apply() family of functions in R is a powerful tool for applying operations to data structures like matrices, data frames, and lists. These functions help you write concise and efficient code by avoiding explicit loops. Here’s what we’ll cover:

1. Introduction: A brief overview of the apply() family and why it’s important in R programming.
2. The Basic Syntax: A detailed explanation of the syntax and parameters for apply(), lapply(), and sapply().
3. The Examples: Practical code examples to demonstrate how each function works.
4. The Case of Using: Real-world scenarios where these functions can be applied effectively.
5. Key Points: A summary of the main takeaways and best practices for using these functions.
6. The Meaning: A reflection on the significance of the apply() family in R programming.
7. Conclusion: A wrap-up encouraging readers to practice and explore these functions further.

1. Introduction

A brief overview of the apply() family and why it’s important in R programming.

2. The Basic Syntax

The apply() family includes functions like apply(), lapply(), and sapply().

The general purpose of these functions is to apply a function to data structures like matrices, data frames, or lists.

The basic syntax for each function:

apply(X, MARGIN, FUN, ...)
lapply(X, FUN, ...)
sapply(X, FUN, ...)

The parameters:

- X: The input data (matrix, data frame, or list).
- MARGIN: For apply(), specifies rows (1) or columns (2). MARGIN = 1: Apply the function to rows. MARGIN = 2: Apply the function to columns.
- FUN: The function to apply.
- …: Additional arguments for the function.

3. The Examples

Let’s dive into some practical examples to understand how these functions work.

Example for apply()

# Apply max function to columns of a matrix

matrix_data <- matrix(1:9, nrow = 3)
apply(matrix_data, 2, max)

1. matrix_data: this creates a 3×3 matrix.
```
[,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9
```
2. apply(matrix_data, 2, max): the apply() function is used to apply the max function to each column of the matrix (because MARGIN = 2).
3. It calculates the maximum value for each column:
  - Column 1: max(1, 2, 3) = 3
  - Column 2: max(4, 5, 6) = 6
  - Column 3: max(7, 8, 9) = 9

Result:

[1] 3 6 9

Explanation: The apply() function calculates the maximum value for each column of the matrix.

- MARGIN = 2: Apply the function column-wise (i.e., to each column of the matrix).
- If MARGIN = 1, the function would be applied row-wise (i.e., to each row of the matrix).
- If the input is a higher-dimensional array, you can use MARGIN = 3, MARGIN = 4, etc., to apply the function along other dimensions.

Example for lapply()

# Apply a function to each element of a list

numbers <- list(1, 2, 3, 4)
squares <- lapply(numbers, function(x) x^2)
print(squares)

Result:

[[1]]
[1] 1

[[2]]
[1] 4

[[3]]
[1] 9

[[4]]
[1] 16

Explanation: The lapply() function applies the square function (x^2) to each element of the list numbers. The output is a list where each element is the square of the corresponding input.

Example for sapply()

# Simplify the output of lapply() to a vector

squared_vector <- sapply(numbers, function(x) x^2)
print(squared_vector)

Result:

[1]  1  4  9 16

Explanation: The sapply() function simplifies the output of lapply() into a numeric vector. Each element of the vector is the square of the corresponding input.

4. The Case of Using

These functions are incredibly useful in real-world scenarios. Here are some examples:

- Summarizing Data: Use apply() to calculate row or column means, sums, or other statistics in a data frame.
- Iterating Over Lists: Use lapply() to clean or transform multiple datasets stored in a list.
- Simplifying Repetitive Tasks: Use sapply() to avoid writing explicit loops for vectorized operations.

5. Key Points

Here are the key takeaways about the apply() family of functions:

- apply(): Works on matrices or data frames; requires specifying rows (1) or columns (2).
- lapply(): Works on lists; always returns a list.
- sapply(): Simplifies the output of lapply() to a vector or matrix when possible.

Best Practices:

- Use na.rm = TRUE to handle missing values.
- Prefer sapply() when you need simplified output.
- Use lapply() when working with lists and preserving the list structure is important.

6. The Meaning

The apply() family of functions is foundational for functional programming in R. These functions:

- Promote efficient and concise code by avoiding explicit loops.
- Enable vectorized operations, which are faster and more memory-efficient than traditional loops.
- Make your code more readable and maintainable.

Mastering these functions can significantly improve your data analysis workflows.

7. Conclusion

The apply() family of functions is a must-know for anyone working with R. Whether you’re summarizing data, iterating over lists, or simplifying repetitive tasks, these functions can save you time and effort.

Next Steps:

- Practice using apply(), lapply(), and sapply() in your own projects.
- Explore related functions like tapply(), mapply(), and vapply().

Happy coding!

Mastering Data Preprocessing in R with the `recipes` Package

Data preprocessing is a critical step in any machine learning workflow. It ensures that your data is clean, consistent, and ready for modeling. In R, the recipes package provides a powerful and flexible framework for defining and applying preprocessing steps. In this blog post, we’ll explore how to use recipes to preprocess data for machine learning, step by step.

Here’s what we’ll cover in this blog:

1. Introduction to the `recipes` Package
   - What is the `recipes` package, and why is it useful?

2. Why Preprocess Data?
   - The importance of centering, scaling, and encoding in machine learning.

3. Step-by-Step Preprocessing with `recipes`  
   - How to create a preprocessing recipe.  
   - Centering and scaling numeric variables.  
   - One-hot encoding categorical variables.

4. Applying the Recipe  
   - How to prepare and apply the recipe to training and testing datasets.

5. Example: Preprocessing in Action  
   - A practical example of preprocessing a dataset.

6. Why Use `recipes`?  
   - The advantages of using the `recipes` package for preprocessing.

7. Conclusion  
   - A summary of the key takeaways and next steps.

What is the `recipes` Package?

The recipes package is part of the tidymodels ecosystem in R. It allows you to define a series of preprocessing steps (like centering, scaling, and encoding) in a clean and reproducible way. These steps are encapsulated in a “recipe,” which can then be applied to your training and testing datasets.

Why Preprocess Data?

Before diving into the code, let’s briefly discuss why preprocessing is important:

Centering and Scaling:
- Many machine learning algorithms (e.g., SVM, KNN, neural networks) are sensitive to the scale of features. If features have vastly different scales, the model might give undue importance to features with larger magnitudes.
- Centering and scaling ensure that all features are on a comparable scale, improving model performance and convergence.
One-Hot Encoding:
- Machine learning algorithms typically require numeric input. Categorical variables need to be converted into numeric form.
- One-hot encoding converts each category into a binary vector, preventing the model from assuming an ordinal relationship between categories.

Step-by-Step Preprocessing with `recipes`

Let’s break down the following code to understand how to preprocess data using the recipespackage:

preprocess_recipe <- recipe(target_variable ~ ., data = training_data) %>%
  step_center(all_numeric(), -all_outcomes()) %>%
  step_scale(all_numeric(), -all_outcomes()) %>%
  step_dummy(all_nominal(), -all_outcomes(), one_hot = TRUE)

1. Creating the Recipe Object

preprocess_recipe <- recipe(target_variable ~ ., data = training_data)

Purpose: Creates a recipe object to define the preprocessing steps.
target_variable ~ .: Specifies that target_variable is the target (dependent) variable, and all other variables in training_data are features (independent variables).
data = training_data: Specifies the training dataset to be used.

2. Centering Numeric Variables

step_center(all_numeric(), -all_outcomes())

Purpose: Centers numeric variables by subtracting their mean, so that the mean of each variable becomes 0.
all_numeric(): Selects all numeric variables.
-all_outcomes(): Excludes the target variable (target_variable), as it does not need to be centered.

3. Scaling Numeric Variables

step_scale(all_numeric(), -all_outcomes())

Purpose: Scales numeric variables by dividing them by their standard deviation, so that the standard deviation of each variable becomes 1.
all_numeric(): Selects all numeric variables.
-all_outcomes(): Excludes the target variable (target_variable), as it does not need to be scaled.

4. One-Hot Encoding for Categorical Variables

step_dummy(all_nominal(), -all_outcomes(), one_hot = TRUE)

Purpose: Converts categorical variables into binary (0/1) variables using one-hot encoding.
all_nominal(): Selects all nominal (categorical) variables.
-all_outcomes(): Excludes the target variable (target_variable), as it does not need to be encoded.
one_hot = TRUE: Specifies that one-hot encoding should be used.

Applying the Recipe

Once the recipe is defined, you can apply it to your data:

# Prepare the recipe with the training data
prepared_recipe <- prep(preprocess_recipe, training = training_data, verbose = TRUE)

# Apply the recipe to the training data
train_data_preprocessed <- juice(prepared_recipe)

# Apply the recipe to the testing data
test_data_preprocessed <- bake(prepared_recipe, new_data = testing_data)

prep(): Computes the necessary statistics (e.g., means, standard deviations) from the training data to apply the preprocessing steps.
juice(): Applies the recipe to the training data.
bake(): Applies the recipe to new data (e.g., the testing set).

Example: Preprocessing in Action

Suppose the training_data dataset looks like this:

target_variable	feature_1	feature_2	category
150	25	50000	A
160	30	60000	B
140	22	45000	B

Preprocessed Data

Centering and Scaling:
- feature_1 and feature_2 are centered and scaled.
One-Hot Encoding:
- category is converted into binary variables: category_A and category_B.

The preprocessed data might look like this:

target_variable	feature_1_scaled	feature_2_scaled	category_A	category_B
150	-0.5	0.2	1	0
160	0.5	0.8	0	1
140	-1.0	-0.5	0	1

Why Use `recipes`?

The recipes package offers several advantages:

Reproducibility: Preprocessing steps are clearly defined and can be reused.
Consistency: The same preprocessing steps are applied to both training and testing datasets.
Flexibility: You can easily add or modify steps in the preprocessing pipeline.

Conclusion

Data preprocessing is a crucial step in preparing your data for machine learning. With the recipespackage in R, you can define and apply preprocessing steps in a clean, reproducible, and efficient way. By centering, scaling, and encoding your data, you ensure that your machine learning models perform at their best.

Ready to try it out? Install the recipes package and start preprocessing your data today!

install.packages("recipes")
library(recipes)

Happy coding! 😊

Smart Extraction: Converting PDF Tables into Usable Data with R workshop

Join our workshop on Smart Extraction: Converting PDF Tables into Usable Data with R, which is a part of our workshops for Ukraine series!

Here’s some more info:

Title: Smart Extraction: Converting PDF Tables into Usable Data with R

Date: Thursday, May 1st, 18:00 – 20:00 CET (Rome, Berlin, Paris timezone)

Speaker: Flávia E. Rius, PhD, is a data scientist at Mendelics, Latin America’s leading genomics company, and a postdoctoral researcher at the University of São Paulo. With a strong background in molecular biology and bioinformatics, she combines research and applied genomics to advance precision medicine in Brazil. Passionate about sharing knowledge, she also mentors students and professionals in R, data science, and bioinformatics.

Description: In this workshop, we’ll dive into the extraction of tables from PDFs using R, an essential skill for turning static documents into usable data. We’ll explore two approaches: first, using {tabulizer} to extract structured tables, and second, using the ocr() function from {tesseract}, a powerful tool for when text can’t be extracted directly. Our focus will be on academic journal articles, a rich source of data for both research and industry applications. Join me to level up your data wrangling skills and add a valuable asset to your R toolkit!

Minimal registration fee: 20 euro (or 20 USD or 800 UAH)

Please note that the registration confirmation is sent 1 day before the workshop to all registered participants rather than immediately after registration

How can I register?

Go to https://bit.ly/3wvwMA6 or https://bit.ly/4aD5LMC or https://bit.ly/3PFxtNA and donate at least 20 euro. Feel free to donate more if you can, all proceeds go directly to support Ukraine.

Save your donation receipt (after the donation is processed, there is an option to enter your email address on the website to which the donation receipt is sent)

Fill in the registration form, attaching a screenshot of a donation receipt (please attach the screenshot of the donation receipt that was emailed to you rather than the page you see after donation).

How can I sponsor a student?

Go to https://bit.ly/3wvwMA6 or https://bit.ly/4aD5LMC or https://bit.ly/3PFxtNA and donate at least 20 euro (or 17 GBP or 20 USD or 800 UAH). Feel free to donate more if you can, all proceeds go to support Ukraine!

Save your donation receipt (after the donation is processed, there is an option to enter your email address on the website to which the donation receipt is sent)

Fill in the sponsorship form, attaching the screenshot of the donation receipt (please attach the screenshot of the donation receipt that was emailed to you rather than the page you see after the donation). You can indicate whether you want to sponsor a particular student or we can allocate this spot ourselves to the students from the waiting list. You can also indicate whether you prefer us to prioritize students from developing countries when assigning place(s) that you sponsored.

You can also find more information about this workshop series, a schedule of our future workshops as well as a list of our past workshops which you can get the recordings & materials here.

Looking forward to seeing you during the workshop!

1. Introduction

2. Step-by-Step Workflow

Step 1: Set Seed for Reproducibility

Step 2: Load Necessary Libraries

Step 3: Split the Dataset

Step 4: Extract Training and Testing Sets

Step 5: Merge Datasets for Analysis

Step 6: Save Training and Testing Data

3. Example: Data Splitting and Preprocessing

Step 1: Create a Sample Dataset

Step 2: Split the Dataset

Step 3: Merge Datasets for Analysis

Step 4: Save the Training and Testing Data

4. Why This Workflow Matters

5. Conclusion

1. Introduction to Cross-Validation

2. Step-by-Step Cross-Validation Setup

Step 1: Load Necessary Library

Step 2: Define Train Control for Cross-Validation

Step 3: Save Train Control Object

3. Example: Cross-Validation in Action

Step 1: Load the Dataset

Step 2: Define the Cross-Validation Configuration

Step 3: Train a Model Using Cross-Validation

Step 4: Save the Cross-Validation Configuration

4. Why This Workflow Matters

5. Conclusion

1. Introduction

2. The Basic Syntax

3. The Examples

Example for apply()

Example for lapply()

Example for sapply()

4. The Case of Using

5. Key Points

6. The Meaning

7. Conclusion

What is the recipes Package?

Why Preprocess Data?

Step-by-Step Preprocessing with recipes

1. Creating the Recipe Object

2. Centering Numeric Variables

3. Scaling Numeric Variables

4. One-Hot Encoding for Categorical Variables

Applying the Recipe

Example: Preprocessing in Action

Preprocessed Data

Why Use recipes?

Conclusion

What is the `recipes` Package?

Step-by-Step Preprocessing with `recipes`

Why Use `recipes`?