Benchmarking cast in R from long data frame to wide matrix

Interested in publishing a one-time post on R-bloggers.com? Press here to learn how.
In my daily work I often have to transform a long table to a wide matrix so accommodate some function. At some stage in my life I came across the reshape2 package, and I have been with that philosophy ever since – I find it makes data wrangling easy and straight forward. I particularly like the tidyverse philosophy where data should be in a long table, where one row is an observation, and a column a parameter. It just makes sense.

However, I quite often have to transform the data into another format, a wide matrix especially for functions of the vegan package, and one day I wondering how to do that in the fastest way.

The code to create the test sets and benchmark the functions is in section ‘Settings and script’ at the end of this document.

I created several data sets that mimic the data I usually work with in terms of size and values. The data sets have 2 to 10 groups, where each group can have up to 50000, 100000, 150000, or 200000 samples. The methods xtabs() from base R, dcast() from data.table, dMcast() from Matrix.utils, and spread() from tidyr were benchmarked using microbenchmark() from the package microbenchmark. Each method was evaluated 10 times on the same data set, which was repeated for 10 randomly generated data sets.

After the 10 x 10 repetitions of casting from long to wide it is clear the spread() is the worst. This is clear when we focus on the size (figure 1).
Figure 1. Runtime for 100 repetitions of data sets of different size and complexity.
And the complexity (figure 2).
Figure 2. Runtime for 100 repetitions of data sets of different complexity and size.

Close up on the top three methods

Casting from a long table to a wide matrix is clearly slowest with spread(), where as the remaining look somewhat similar. A direct comparison of the methods show a similarity in their performance, with dMcast() from the package Matrix.utils being better — especially with the large and more complex tables (figure 3).
Figure 3. Direct comparison of set size.
I am aware, that it might be to much to assume linearity, between the computation times at different set sizes, but I do believe it captures the point – dMcast() and dcast() are similar, with advantage to dMcast() for large data sets with large number of groups. It does, however, look like dcast() scales better with the complexity (figure 4).
Figure 4. Direct comparison of number groups.

Settings and script

Session info

 ## ─ Session info ──────────────────────────────────────────────────────────
 ## setting value 
 ## version R version 3.5.2 (2018-12-20)
 ## os Ubuntu 18.04.1 LTS 
 ## system x86_64, linux-gnu 
 ## ui X11 
 ## language en_GB:en_US 
 ## collate en_DE.UTF-8 
 ## ctype en_DE.UTF-8 
 ## tz Europe/Berlin 
 ## date 2019-02-03 
 ## 
 ## ─ Packages ──────────────────────────────────────────────────────────────
 ## package * version date lib source 
 ## assertthat 0.2.0 2017-04-11 [1] CRAN (R 3.5.2)
 ## bindr 0.1.1 2018-03-13 [1] CRAN (R 3.5.2)
 ## bindrcpp * 0.2.2 2018-03-29 [1] CRAN (R 3.5.2)
 ## cli 1.0.1 2018-09-25 [1] CRAN (R 3.5.2)
 ## colorspace 1.4-0 2019-01-13 [1] CRAN (R 3.5.2)
 ## crayon 1.3.4 2017-09-16 [1] CRAN (R 3.5.2)
 ## digest 0.6.18 2018-10-10 [1] CRAN (R 3.5.2)
 ## dplyr * 0.7.8 2018-11-10 [1] CRAN (R 3.5.2)
 ## evaluate 0.12 2018-10-09 [1] CRAN (R 3.5.2)
 ## ggplot2 * 3.1.0 2018-10-25 [1] CRAN (R 3.5.2)
 ## glue 1.3.0 2018-07-17 [1] CRAN (R 3.5.2)
 ## gtable 0.2.0 2016-02-26 [1] CRAN (R 3.5.2)
 ## highr 0.7 2018-06-09 [1] CRAN (R 3.5.2)
 ## htmltools 0.3.6 2017-04-28 [1] CRAN (R 3.5.2)
 ## knitr 1.21 2018-12-10 [1] CRAN (R 3.5.2)
 ## labeling 0.3 2014-08-23 [1] CRAN (R 3.5.2)
 ## lazyeval 0.2.1 2017-10-29 [1] CRAN (R 3.5.2)
 ## magrittr 1.5 2014-11-22 [1] CRAN (R 3.5.2)
 ## munsell 0.5.0 2018-06-12 [1] CRAN (R 3.5.2)
 ## packrat 0.5.0 2018-11-14 [1] CRAN (R 3.5.2)
 ## pillar 1.3.1 2018-12-15 [1] CRAN (R 3.5.2)
 ## pkgconfig 2.0.2 2018-08-16 [1] CRAN (R 3.5.2)
 ## plyr 1.8.4 2016-06-08 [1] CRAN (R 3.5.2)
 ## purrr 0.2.5 2018-05-29 [1] CRAN (R 3.5.2)
 ## R6 2.3.0 2018-10-04 [1] CRAN (R 3.5.2)
 ## Rcpp 1.0.0 2018-11-07 [1] CRAN (R 3.5.2)
 ## rlang 0.3.1 2019-01-08 [1] CRAN (R 3.5.2)
 ## rmarkdown 1.11 2018-12-08 [1] CRAN (R 3.5.2)
 ## scales 1.0.0 2018-08-09 [1] CRAN (R 3.5.2)
 ## sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 3.5.2)
 ## stringi 1.2.4 2018-07-20 [1] CRAN (R 3.5.2)
 ## stringr * 1.3.1 2018-05-10 [1] CRAN (R 3.5.2)
 ## tibble 2.0.1 2019-01-12 [1] CRAN (R 3.5.2)
 ## tidyselect 0.2.5 2018-10-11 [1] CRAN (R 3.5.2)
 ## viridisLite 0.3.0 2018-02-01 [1] CRAN (R 3.5.2)
 ## withr 2.1.2 2018-03-15 [1] CRAN (R 3.5.2)
 ## xfun 0.4 2018-10-23 [1] CRAN (R 3.5.2)
 ## yaml 2.2.0 2018-07-25 [1] CRAN (R 3.5.2)

Settings

# settings.yml
set_size: [50000, 100000, 150000, 200000]
num_groups: [2, 3, 4, 5, 6, 7, 8, 9, 10]
benchmark_repetitions: 10
num_test_sets: 10
max_value: 500
word_length: 10

Data creation and benchmarking scripts

# main.R
# Global variables ----------------------------------------------
# Set this to FALSE if you want to run the complete analysis
running_test <- TRUE
vars <- yaml::read_yaml("./settings.yml")
set_size <- vars$set_size
num_groups <- vars$num_groups
benchmark_repetitions <- vars$benchmark_repetitions
num_test_sets <- vars$num_test_sets
max_value <- vars$max_value
word_length <- vars$word_length

# Test variables ------------------------------------------------
if(running_test){
set_size <- seq.int(0L, 60L, 30L)
num_groups <- c(2L:3L)
benchmark_repetitions <- 2L
num_test_sets <- 2L
}


# Libraries ----------------------------------------------------- 
suppressPackageStartupMessages(library(foreach))
suppressPackageStartupMessages(library(doParallel))


# Setup parallel ------------------------------------------------
num_cores <- detectCores() - 1

these_cores <- makeCluster(num_cores, type = "PSOCK")
registerDoParallel(these_cores)

# Functions -----------------------------------------------------
run_benchmark <- function(as){
source("test_cast.R")
num_groups <- as["num_groups"]
set_size <- as["set_size"]
num_test_sets <- as["num_test_sets"]
word_length <- as["word_length"]
max_value <- as["max_value"]
 
test_data <- prepare_test_data(set_size, num_groups, word_length, max_value)
perform_benchmark(test_data, benchmark_repetitions)
}


# Setup and run tests -------------------------------------------
set_size <- set_size[set_size > 0]

analysis_comb <- expand.grid(num_groups, set_size)
analysis_settings <- vector("list", length = nrow(analysis_comb))

for(i in seq_len(nrow(analysis_comb))){
analysis_settings[[i]] <- c(num_groups =analysis_comb[i, "Var1"],
set_size = analysis_comb[i, "Var2"],
word_length = word_length,
max_value = max_value,
benchmark_repetitions = benchmark_repetitions)
}


for(as in analysis_settings){
num_groups <- as["num_groups"]
set_size <- as["set_size"]

report_str <- paste("ng:", num_groups,
"- setsize:", set_size, "\n")
cat(report_str)
 
rds_file_name <- paste0("./output/benchmark_setsize-", set_size,
"_ng-", num_groups, ".rds")
 
bm_res <- foreach(seq_len(num_test_sets), .combine = "rbind") %dopar% {
run_benchmark(as)
 }
 
bm_res <- dplyr::mutate(bm_res, `Number groups` = num_groups,
 `Set size` = set_size)
 
saveRDS(bm_res, rds_file_name)
report_str
}
# test_cast.R
setup <- function(){
library(data.table)
library(tidyr)
library(dplyr)
library(Matrix.utils)
library(tibble)
}

prepare_test_data <- function(set_size, num_groups, word_length, sample_int_n){
calc_subset_size <- function(){
subset_size <- 0
while(subset_size == 0 | subset_size > set_size){
subset_size <- abs(round(set_size - set_size/(3 + rnorm(1))))
 }
subset_size
 }
 
words <- stringi::stri_rand_strings(set_size, word_length)
subset_sizes <- replicate(num_groups, calc_subset_size())
 
 purrr::map_df(subset_sizes, function(subset_size){
 tibble::tibble(Variable = sample(words, subset_size),
Value = sample.int(sample_int_n, subset_size, replace = TRUE),
Group = stringi::stri_rand_strings(1, word_length))
 })
}

test_tidyr <- function(test_df){
test_df %>% 
spread(Variable, Value, fill = 0L) %>% 
 tibble::column_to_rownames(var = "Group") %>% 
as.matrix.data.frame()
}

test_xtabs <- function(test_df){
xtabs(Value ~ Group + Variable, data = test_df) 
}

test_dMcast <- function(test_df){
class(test_df) <- "data.frame"
dMcast(test_df, Group ~ Variable, value.var = "Value")
}

test_dcast <- function(test_df){
test_df_dt <- data.table(test_df)
dcast(test_df_dt, Group ~ Variable, value.var = "Value", fill = 0) %>% 
 tibble::column_to_rownames(var = "Group") %>% 
as.matrix.data.frame()
}


perform_benchmark <- function(test_df, benchmark_repetitions){
suppressPackageStartupMessages(setup())
bm_res <- microbenchmark::microbenchmark(
Spread = test_tidyr(test_df = test_df), 
Xtabs = test_xtabs(test_df = test_df), 
dMcast = test_dMcast(test_df = test_df), 
dcast = test_dcast(test_df = test_df), 
times = benchmark_repetitions
 )
class(bm_res) <- "data.frame"
 
bm_res %>% 
mutate(time = microbenchmark:::convert_to_unit(time, "ms")) %>% 
rename(Method = expr, `Time (ms)` = time)
}

9 thoughts on “Benchmarking cast in R from long data frame to wide matrix”

  1. Nice comparison. There are, however, other aspects too beside speed. I wrote my little wrapper `mefa4::Xtab` for xtabs (when it was still in the Matrix package) because my computer 10 years ago could not fit the wide format table in memory. So if you worry about object size and memory usage, sparse=TRUE becomes a really important consideration. My other reason for writing the wrapper was that xtabs returns a table, which is a tricky object class. I wanted plain old dgCMatrix.

    1. True, that wide matrix’ can cause memory problems. For my particular use case I have no other choice, and the matrix is in its nature sparse.

      The xtabs is indeed a matrix underneath, but for some reason there is no as.matrix function. This will do the trick:


      as.matrix.xtabs <- function(x){ attr(x, "class") <- NULL attr(x, "call") <- NULL x }

  2. Thanks, this is great.

    I am at present working on casting long to wide. I tried dcast.data.table but it gave me a max interger error. I have used sparseMatrix to obtain this. I’ll try dMcast.

    My question is: how do you save to a tab delimited or a csv file?

    I’ve tried write.matrix from the MASS package and it doesnt work for the dimensions of my sparse matrix.

    1. Glad to hear it is useful.

      I don’t have the need to save the wide matrix, but if I did, I think I would just save them as rds.

      For large data.frames I generally save in the feather format (using the library feather), but I don’t know it it will do any good for the wide matrix. The feather is really fast but for data.frame like structures only.

      It makes no sense to save as csv or similar, since reading the file will turn it into a data.frame which is insanely slow when there are many columns.

Leave a Reply to Egor Cancel reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.