Benchmarking API Performance: R-Native and Plumber in Data Extraction and Sending

R, a language best known for its prowess in statistical analysis and data science, might not be the first choice that comes to mind when thinking about building APIs. However, rapid prototyping, scalability, seamless integration with data analysis,  and ease of debugging are reasons for me to encapsulate API functionality within R packages. In doing so, I like to distinguish between two approaches:



The R-native approach (blue): API interaction is available by functions of relevant R-packages that are directly installed into the R-session.

The Plumber approach (green): The Plumber package allows R code to be exposed as a RESTful web service through special decorator comments.  As a results, API functionality is assesed by sending REST calls (GET, POST, …) rather than  calling functions of R-packages directly. 

#* This endpoint is a dummy for loading data
#* @get load_from_db
#* @serializer unboxedJSON
#* @param row_limit :int
#* @response 200 OK with result
#* @response 201 OK without result
#* @response 401 Client Error: Invalid Request - Class
#* @response 402 Client Error: Invalid Request - Type
#* @response 403 Client Error: Missing or invalid parameter
#* @response 500 Server Error: Unspecified Server error
#* @response 501 Server Error: Database writing failed
#* @response 502 Server Error: Database reading failed
#* @tag demo
function(res, row_limit = 10000) {
   # load data from db 
   data <- dplyr::tibble()
   res$status <- 200
   res$body <- as.list(data)
   return(res$body)
}

Which approach to use?

Certainly, both approaches have their strengths and limitations. It should be no surprise, that in terms of execution time, CPU utilization and format consistency, the R-native approach is likely to be the first choice, as code and data is processed within one context. Furthermore, the approach offers flexibility for complex data manipulations, but can be challenging when it comes to maintenance, especially propagating new releases of packages to all relevant processes and credential management. To the best of my knowledge, there is no automated way of re-installing new releases directly into all related R-packages, even with using the POSIT package manager – so this easily becomes tedious.

In contrast, the Plumber API encourages modular design that enhances code organization and facilitates integration with a wide array of platforms and systems.  It streamlines package updates while ensuring a consistent interface. This means that interacting with a Plumber API remains separate from the underlying code logic provided by the endpoint. This approach not only improves version management but also introduces a clear separation between the client and server. In general, decoupling functionality through a RESTful API offers the possibility of dividing tasks into separate development teams more easily and thus a higher degree of flexibility and external support. Additionally, I found distributing a Plumber API notably more straightforward than handing over a raw R package. 

The primary goal of this blog post is to quantify the performance difference between the two approaches when it comes to getting data in and out of a database. Such benchmark can be particularly valuable for ETL (Extract, Transform, Load) processes, thereby shedding light on the threshold at which the advantages of the Plumber approach cease to justify its constraints. In doing so, we hope to provide information to developers who are faced with the decision of whether it makes sense to provide or access R functionalities via Plumber APIs.

Experimental Setup
The experimental setup encompassed a virtual machine instance equipped with  64GB RAM and an Intel(R) Xeon(R) Gold 6152 CPU clocked at 2.1GHz, incorporating 8 kernels, running Ubuntu 22.04 LTS, hosting the POSIT Workbench and Connect server (for hosting the Plumber API) and employing R version v4.2.1. Both POSIT services were granted identical access permissions to the virtual machine’s computational resources.

Both approaches are evaluated in terms of execution times, simply measured with system.time(), and maximal observed CPU load, the latter being expecially an important indicator on how how much data can be extracted and send at once. For each fixed number of data row, ranging from 10^4 to 10^7, 10 trials are being conducted and results beeing plotted by using a jittered beeswarm plot. For assessing the cpu load during the benchmark, I build a separate function that returns a new session object, within which every 10 seconds the output of NCmisc::top(CPU = FALSE) is appended to a file.

get_cpu_load <- function(interval = 10, root, name, nrow) {
  rs_cpu <- callr::r_session$new()
  rs_cpu$call(function(nrow, root, name, interval) {
    files <- list.files(root)
    n_files <- sum(stringr::str_detect(files, sprintf("%s_%s_", name, format(nrow, scientific = FALSE))))
    l <- c()
    while (TRUE) {
      ret <- NCmisc::top(CPU = FALSE)
      l <- c(l, ret$RAM$used * 1000000)
      save(l, file = sprintf("%s/%s_%s_%s.rda", root, name, format(nrow, scientific = FALSE), n_files + 1))
      Sys.sleep(interval)
    }
  }, args = list(nrow, root, name, interval))
  return(rs_cpu)
}


Result
Execution time: in the following figure A), the data extraction process is observed to be approximately 10 times slower, when utilizing the plumber API as compared to the R-native approach across all dataset sizes.  


(y-axis in logarithmic scale)

Both approaches display a linear increase in execution time on a logarithmic time scale, indicating exponential growth in the original data domain. Specifically, the mean execution times for R-native and Plumber start at 0.00078 and 0.00456 minutes, respectively, and escalate to 0.286 and 2.61 minutes. It is reasonable to assume that this exponential trend persists for larger datasets, potentially resulting in execution times exceeding half an hour for very large tables (> 100 million rows) when using Plumber.

Conversely, subfigure B) shows the execution time for sending data and illustrates that both approaches provide rather comparable performance, particularly with larger numbers of rows. While for 10,000 rows, the R-native approach is still twice as fast (average of 0.0023 minutes) compared to Plumber (0.00418), the advantage of being in one context diminishes as the number of rows increases. At 10 million rows, the Plumber approach is even faster than the R-native approach (1.88 min), averaging 1.7 minutes. Once again, the execution time exhibits an exponential growth trend with an increasing number of rows.

CPU Load: In examining maximum observable CPU load during data receiving and sending, notable differences emerge between the Plumber API and the R-native approach. 

(y-axis in logarithmic scale)

A) For data extraction up to 1 million rows, CPU utilization remains below 10% for both approaches. However, the utilization patterns diverge as row counts increase. Notably, the R-native approach maintains relatively consistent CPU usage (averaging 5.53%, 5.48%, 5.47%) up to 1 million rows, whereas the Plumber approach already experiences a noticeable increase (5.97%, 6.05%, 8.6%). When extracting 10 million rows, CPU usage surpasses 30% for Plumber, while R-native extraction incurs approximately five times less computational overhead. B) In contrast to execution time, a clear difference in CPU utilization becomes evident also during sending data. The R-native approach consistently demonstrates at least half as less CPU demands compared to Plumber across all data row sizes. For 10,000,000 rows, the plumber approach even consumes over three times more CPU power13.1% vs. 43.2%.  This makes up to almost 30GB in absolute terms.

Conclusion

The Plumber approach, while offering several advantages, encounters clear limitations when dealing with large datasets, be it tables with a substantial number of rows or extensive columns. As a result,  data extraction becomes roughly ten times slower, with CPU utilization being up to five and three times higher during getting data out and in, respectively. Digging deeper into it reveals that this gap is likely to result from the necessity of converting data into JSON format when using a web-based architecture. Plumber can’t handle R dataframes directly, which is why serializer have to to be used before sending and retrieving data from an endpoint. Even with lots of RAM capacity, this conversion process can lead to execution errors in practice as JSON representations may surpass the allowed byte size for the R datatype character.

>jsonlite::toJSON(dataframe)
Error in collapse_object(objnames, tmp, indent):
R character strings are limited to 2^31-1 bytes

The only viable workaround in such scenarios involves breaking down tables into smaller chunks based on certain identifiers.

Providing a precise table size limitation where the Plumber approach remains suitable proves challenging, as it hinges on a multitude of factors, including the number of rows, columns, and cell content within the dataset. Personally, I will stick to using the Plumber API for scenarios with limited data traffic, such as querying terminology or a statistical summary, as I generally prioritize code encapsulation and ease of maintenance over maximizing performance.

Micha Christ
Bosch Health Campus Centrum für Medizinische Datenintegration

Benchmarking cast in R from long data frame to wide matrix

In my daily work I often have to transform a long table to a wide matrix so accommodate some function. At some stage in my life I came across the reshape2 package, and I have been with that philosophy ever since – I find it makes data wrangling easy and straight forward. I particularly like the tidyverse philosophy where data should be in a long table, where one row is an observation, and a column a parameter. It just makes sense.

However, I quite often have to transform the data into another format, a wide matrix especially for functions of the vegan package, and one day I wondering how to do that in the fastest way.

The code to create the test sets and benchmark the functions is in section ‘Settings and script’ at the end of this document.

I created several data sets that mimic the data I usually work with in terms of size and values. The data sets have 2 to 10 groups, where each group can have up to 50000, 100000, 150000, or 200000 samples. The methods xtabs() from base R, dcast() from data.table, dMcast() from Matrix.utils, and spread() from tidyr were benchmarked using microbenchmark() from the package microbenchmark. Each method was evaluated 10 times on the same data set, which was repeated for 10 randomly generated data sets.

After the 10 x 10 repetitions of casting from long to wide it is clear the spread() is the worst. This is clear when we focus on the size (figure 1).
Figure 1. Runtime for 100 repetitions of data sets of different size and complexity.
And the complexity (figure 2).
Figure 2. Runtime for 100 repetitions of data sets of different complexity and size.

Close up on the top three methods

Casting from a long table to a wide matrix is clearly slowest with spread(), where as the remaining look somewhat similar. A direct comparison of the methods show a similarity in their performance, with dMcast() from the package Matrix.utils being better — especially with the large and more complex tables (figure 3).
Figure 3. Direct comparison of set size.
I am aware, that it might be to much to assume linearity, between the computation times at different set sizes, but I do believe it captures the point – dMcast() and dcast() are similar, with advantage to dMcast() for large data sets with large number of groups. It does, however, look like dcast() scales better with the complexity (figure 4).
Figure 4. Direct comparison of number groups.

Settings and script

Session info

 ## ─ Session info ──────────────────────────────────────────────────────────
 ## setting value 
 ## version R version 3.5.2 (2018-12-20)
 ## os Ubuntu 18.04.1 LTS 
 ## system x86_64, linux-gnu 
 ## ui X11 
 ## language en_GB:en_US 
 ## collate en_DE.UTF-8 
 ## ctype en_DE.UTF-8 
 ## tz Europe/Berlin 
 ## date 2019-02-03 
 ## 
 ## ─ Packages ──────────────────────────────────────────────────────────────
 ## package * version date lib source 
 ## assertthat 0.2.0 2017-04-11 [1] CRAN (R 3.5.2)
 ## bindr 0.1.1 2018-03-13 [1] CRAN (R 3.5.2)
 ## bindrcpp * 0.2.2 2018-03-29 [1] CRAN (R 3.5.2)
 ## cli 1.0.1 2018-09-25 [1] CRAN (R 3.5.2)
 ## colorspace 1.4-0 2019-01-13 [1] CRAN (R 3.5.2)
 ## crayon 1.3.4 2017-09-16 [1] CRAN (R 3.5.2)
 ## digest 0.6.18 2018-10-10 [1] CRAN (R 3.5.2)
 ## dplyr * 0.7.8 2018-11-10 [1] CRAN (R 3.5.2)
 ## evaluate 0.12 2018-10-09 [1] CRAN (R 3.5.2)
 ## ggplot2 * 3.1.0 2018-10-25 [1] CRAN (R 3.5.2)
 ## glue 1.3.0 2018-07-17 [1] CRAN (R 3.5.2)
 ## gtable 0.2.0 2016-02-26 [1] CRAN (R 3.5.2)
 ## highr 0.7 2018-06-09 [1] CRAN (R 3.5.2)
 ## htmltools 0.3.6 2017-04-28 [1] CRAN (R 3.5.2)
 ## knitr 1.21 2018-12-10 [1] CRAN (R 3.5.2)
 ## labeling 0.3 2014-08-23 [1] CRAN (R 3.5.2)
 ## lazyeval 0.2.1 2017-10-29 [1] CRAN (R 3.5.2)
 ## magrittr 1.5 2014-11-22 [1] CRAN (R 3.5.2)
 ## munsell 0.5.0 2018-06-12 [1] CRAN (R 3.5.2)
 ## packrat 0.5.0 2018-11-14 [1] CRAN (R 3.5.2)
 ## pillar 1.3.1 2018-12-15 [1] CRAN (R 3.5.2)
 ## pkgconfig 2.0.2 2018-08-16 [1] CRAN (R 3.5.2)
 ## plyr 1.8.4 2016-06-08 [1] CRAN (R 3.5.2)
 ## purrr 0.2.5 2018-05-29 [1] CRAN (R 3.5.2)
 ## R6 2.3.0 2018-10-04 [1] CRAN (R 3.5.2)
 ## Rcpp 1.0.0 2018-11-07 [1] CRAN (R 3.5.2)
 ## rlang 0.3.1 2019-01-08 [1] CRAN (R 3.5.2)
 ## rmarkdown 1.11 2018-12-08 [1] CRAN (R 3.5.2)
 ## scales 1.0.0 2018-08-09 [1] CRAN (R 3.5.2)
 ## sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 3.5.2)
 ## stringi 1.2.4 2018-07-20 [1] CRAN (R 3.5.2)
 ## stringr * 1.3.1 2018-05-10 [1] CRAN (R 3.5.2)
 ## tibble 2.0.1 2019-01-12 [1] CRAN (R 3.5.2)
 ## tidyselect 0.2.5 2018-10-11 [1] CRAN (R 3.5.2)
 ## viridisLite 0.3.0 2018-02-01 [1] CRAN (R 3.5.2)
 ## withr 2.1.2 2018-03-15 [1] CRAN (R 3.5.2)
 ## xfun 0.4 2018-10-23 [1] CRAN (R 3.5.2)
 ## yaml 2.2.0 2018-07-25 [1] CRAN (R 3.5.2)

Settings

# settings.yml
set_size: [50000, 100000, 150000, 200000]
num_groups: [2, 3, 4, 5, 6, 7, 8, 9, 10]
benchmark_repetitions: 10
num_test_sets: 10
max_value: 500
word_length: 10

Data creation and benchmarking scripts

# main.R
# Global variables ----------------------------------------------
# Set this to FALSE if you want to run the complete analysis
running_test <- TRUE
vars <- yaml::read_yaml("./settings.yml")
set_size <- vars$set_size
num_groups <- vars$num_groups
benchmark_repetitions <- vars$benchmark_repetitions
num_test_sets <- vars$num_test_sets
max_value <- vars$max_value
word_length <- vars$word_length

# Test variables ------------------------------------------------
if(running_test){
set_size <- seq.int(0L, 60L, 30L)
num_groups <- c(2L:3L)
benchmark_repetitions <- 2L
num_test_sets <- 2L
}


# Libraries ----------------------------------------------------- 
suppressPackageStartupMessages(library(foreach))
suppressPackageStartupMessages(library(doParallel))


# Setup parallel ------------------------------------------------
num_cores <- detectCores() - 1

these_cores <- makeCluster(num_cores, type = "PSOCK")
registerDoParallel(these_cores)

# Functions -----------------------------------------------------
run_benchmark <- function(as){
source("test_cast.R")
num_groups <- as["num_groups"]
set_size <- as["set_size"]
num_test_sets <- as["num_test_sets"]
word_length <- as["word_length"]
max_value <- as["max_value"]
 
test_data <- prepare_test_data(set_size, num_groups, word_length, max_value)
perform_benchmark(test_data, benchmark_repetitions)
}


# Setup and run tests -------------------------------------------
set_size <- set_size[set_size > 0]

analysis_comb <- expand.grid(num_groups, set_size)
analysis_settings <- vector("list", length = nrow(analysis_comb))

for(i in seq_len(nrow(analysis_comb))){
analysis_settings[[i]] <- c(num_groups =analysis_comb[i, "Var1"],
set_size = analysis_comb[i, "Var2"],
word_length = word_length,
max_value = max_value,
benchmark_repetitions = benchmark_repetitions)
}


for(as in analysis_settings){
num_groups <- as["num_groups"]
set_size <- as["set_size"]

report_str <- paste("ng:", num_groups,
"- setsize:", set_size, "\n")
cat(report_str)
 
rds_file_name <- paste0("./output/benchmark_setsize-", set_size,
"_ng-", num_groups, ".rds")
 
bm_res <- foreach(seq_len(num_test_sets), .combine = "rbind") %dopar% {
run_benchmark(as)
 }
 
bm_res <- dplyr::mutate(bm_res, `Number groups` = num_groups,
 `Set size` = set_size)
 
saveRDS(bm_res, rds_file_name)
report_str
}
# test_cast.R
setup <- function(){
library(data.table)
library(tidyr)
library(dplyr)
library(Matrix.utils)
library(tibble)
}

prepare_test_data <- function(set_size, num_groups, word_length, sample_int_n){
calc_subset_size <- function(){
subset_size <- 0
while(subset_size == 0 | subset_size > set_size){
subset_size <- abs(round(set_size - set_size/(3 + rnorm(1))))
 }
subset_size
 }
 
words <- stringi::stri_rand_strings(set_size, word_length)
subset_sizes <- replicate(num_groups, calc_subset_size())
 
 purrr::map_df(subset_sizes, function(subset_size){
 tibble::tibble(Variable = sample(words, subset_size),
Value = sample.int(sample_int_n, subset_size, replace = TRUE),
Group = stringi::stri_rand_strings(1, word_length))
 })
}

test_tidyr <- function(test_df){
test_df %>% 
spread(Variable, Value, fill = 0L) %>% 
 tibble::column_to_rownames(var = "Group") %>% 
as.matrix.data.frame()
}

test_xtabs <- function(test_df){
xtabs(Value ~ Group + Variable, data = test_df) 
}

test_dMcast <- function(test_df){
class(test_df) <- "data.frame"
dMcast(test_df, Group ~ Variable, value.var = "Value")
}

test_dcast <- function(test_df){
test_df_dt <- data.table(test_df)
dcast(test_df_dt, Group ~ Variable, value.var = "Value", fill = 0) %>% 
 tibble::column_to_rownames(var = "Group") %>% 
as.matrix.data.frame()
}


perform_benchmark <- function(test_df, benchmark_repetitions){
suppressPackageStartupMessages(setup())
bm_res <- microbenchmark::microbenchmark(
Spread = test_tidyr(test_df = test_df), 
Xtabs = test_xtabs(test_df = test_df), 
dMcast = test_dMcast(test_df = test_df), 
dcast = test_dcast(test_df = test_df), 
times = benchmark_repetitions
 )
class(bm_res) <- "data.frame"
 
bm_res %>% 
mutate(time = microbenchmark:::convert_to_unit(time, "ms")) %>% 
rename(Method = expr, `Time (ms)` = time)
}