{SLmetrics}: scalable and memory efficient AI/ML performance evaluation in R

On December 3rd, 2024, a post about the release of {SLmetrics} was published. Today, January 11th, 2025, version 0.3-1 has been released and comes with many new features. Among these are weighted classification and regression metrics, OpenMP support and a wide array of new evaluation metrics.

In this blog post, I will benchmark {SLmetrics} and demostrate how it compares to the similar R packages {MLmetrics} and {yardstick} in terms execution time and memory efficiency – essential determinants for scalability and efficiency.

Benchmark Function

To run the benchmark of {SLmetrics}, {MLmetrics} and {yardstick}, I will use {bench} which measures the median execution time and memory efficiency. Below I have created a wrapper function:

## benchmark function
benchmark <- function(
  ..., 
  m = 10) {
  library(magrittr)
  # 1) create list
  # for storing values
  performance <- list()

  for (i in 1:m) {

     # 1) run the benchmarks
    results <- bench::mark(
      ...,
      iterations = 10,
      check = FALSE
    )

    # 2) extract values
    # and calculate medians
    performance$time[[i]]  <- setNames(
        lapply(results$time, mean), 
        results$expression
        )

    performance$memory[[i]] <- setNames(
        lapply(results$memory, function(x) {
             sum(x$bytes, na.rm = TRUE)}
             ), results$expression)

    performance$n_gc[[i]] <- setNames(
        lapply(results$n_gc, sum), results$expression
        )

  }

  purrr::pmap_dfr(
  list(performance$time, performance$memory, performance$n_gc), 
  ~{
    tibble::tibble(
      expression = names(..1),
      time = unlist(..1),
      memory = unlist(..2),
      n_gc = unlist(..3)
    )
  }
) %>%
  dplyr::mutate(expression = factor(expression, levels = unique(expression))) %>%
  dplyr::group_by(expression) %>%
  dplyr::filter(dplyr::row_number() > 1) %>%
  dplyr::summarize(
    execution_time = bench::as_bench_time(median(time)),
    memory_usage = bench::as_bench_bytes(median(memory)),
    gc_calls = median(n_gc),
    .groups = "drop"
  )

}

The wrapper function runs 10 x 10 benchmarks of each passed function – it discards the first run to allow the functions to warm up, before the benchmarks are recorded.

All values are averaged across runs and then presented as the median runtime, median memory usage and median number of gc()-calls during the benchmark.

Benchmarking {SLmetrics}

Bechmarking with and without OpenMP

In the first set of benchmarks, I will demonstrate the new OpenMP feature that has been shipped with version 0.3-1. For the benchmark, we will compare the execution time and memory efficiency of computing a 3×3 confusion matrix on two vectors of length 10,000,000 with and without OpenMP. The source code and results are shown below:

## 1) set seed
set.seed(1903)

## 2) define values
## for classes
actual <- factor(sample(letters[1:3], 1e7, TRUE))
predicted <- factor(sample(letters[1:3], 1e7, TRUE))

## 3) benchmark with OpenMP
SLmetrics::setUseOpenMP(TRUE)
#> OpenMP usage set to: enabled

benchmark(`{With OpenMP}` = SLmetrics::cmatrix(actual, predicted))
#> # A tibble: 1 × 4
#>   expression    execution_time memory_usage gc_calls
#>   <fct>               <bch:tm>    <bch:byt>    <dbl>
#> 1 {With OpenMP}            1ms           0B        0

## 4) benchmark without OpenMP
SLmetrics::setUseOpenMP(FALSE)
#> OpenMP usage set to: disabled

benchmark(`{Without OpenMP}`  = SLmetrics::cmatrix(actual, predicted))
#> # A tibble: 1 × 4
#>   expression       execution_time memory_usage gc_calls
#>   <fct>                  <bch:tm>    <bch:byt>    <dbl>
#> 1 {Without OpenMP}         6.27ms           0B        0

The confusion matrix is computed in less than a millisecond and around six milliseconds with and without OpenMP, respectively. In both cases, it uses zero or near-zero memory.

Benchmarking against {MLmetrics} and {yardstick}

In the second set of benchmarks, I will compare the execution time and memory efficiency of {SLmetrics} against {MLmetrics} and {yardstick}. The source code and results are shown below:

## 1) define classes
set.seed(1903)
fct_actual    <- factor(sample(letters[1:3], size = 1e7, replace = TRUE))
fct_predicted <- factor(sample(letters[1:3], size = 1e7, replace = TRUE))

## 2) perform benchmark
benchmark(
    `{SLmetrics}` = SLmetrics::cmatrix(fct_actual, fct_predicted),
    `{MLmetrics}` = MLmetrics::ConfusionMatrix(fct_predicted, fct_actual),
    `{yardstick}` = yardstick::conf_mat(table(fct_actual, fct_predicted))
)
#> # A tibble: 3 × 4
#>   expression  execution_time memory_usage gc_calls
#>   <fct>             <bch:tm>    <bch:byt>    <dbl>
#> 1 {SLmetrics}         6.34ms           0B        0
#> 2 {MLmetrics}       344.13ms        381MB       19
#> 3 {yardstick}       343.75ms        381MB       19

{SLmetrics} is roughly 60 times faster than both, and significantly more memory efficient as demonstrated by memory_usage and gc_calls. In this perspective, {SLmetrics} is more efficient and scalable than both packages as the memory usage is basically linear. See below:

## 1) define classes
set.seed(1903)
fct_actual    <- factor(sample(letters[1:3], size = 2e7, replace = TRUE))
fct_predicted <- factor(sample(letters[1:3], size = 2e7, replace = TRUE))

## 2) perform benchmark
benchmark(
    `{SLmetrics}` = SLmetrics::cmatrix(fct_actual, fct_predicted),
    `{MLmetrics}` = MLmetrics::ConfusionMatrix(fct_predicted, fct_actual),
    `{yardstick}` = yardstick::conf_mat(table(fct_actual, fct_predicted))
)
#> # A tibble: 3 × 4
#>   expression  execution_time memory_usage gc_calls
#>   <fct>             <bch:tm>    <bch:byt>    <dbl>
#> 1 {SLmetrics}         12.3ms           0B        0
#> 2 {MLmetrics}        648.5ms        763MB       19
#> 3 {yardstick}        654.7ms        763MB       19

{SLmetrics} can process 60x the data in the same time it takes {MLmetrics} and {yardstick} to process 40,000,000 data-points – without any additional memory cost.

Summary

The benchmarks suggests that {SLmetrics} is a strong contender to the more established packages {MLmetrics} and {yardstick} in terms of scalability, memory efficiency and speed.

Installing {SLmetrics}

{SLmetrics} is still under development and is therefore not on CRAN. But the latest release can be installed using {devtools}. A development version is also available for those living on the edge. See below:

Stable version

## install stable release
devtools::install_github(
  repo = 'https://github.com/serkor1/SLmetrics@*release',
  ref  = 'main'
)

Development version

## install development version
devtools::install_github(
  repo = 'https://github.com/serkor1/SLmetrics',
  ref  = 'development'
)

If you made it this far: Thank you for reading the blog post, and feel free to leave a comment here or in the repository.

Is round(0.5) 0 or 1?


Actually, it’s both possible

This Article was originally published before on YOZM-IT as Korean

Various way of data science 

There are many programming languages in the world and software that utilizes them. And those play an important role in “Data science”.

For example, if you’re using funnel analysis to improve your product, you might want to 

  • Compare the bounce rates of funnel stages before and after an event,
  • And perform a ratio test to calculate their statistical significance.
Image by the author

Meanwhile, data scientists have various career backgrounds and experiences. So They tend to use the methods they’re comfortable with, including Python, R, SAS and more.

We see this quite a bit, because in most cases, the software you use at the level of business doesn’t make much of a difference.

But what happens if you “produce different results by the software used?

The following image shows the results of running a proportion test in R, Python, and STATA with example mentioned.

Image from the author and CAMIS project

You can see that even though we used the same values of 1000 and 123, the p-value, which indicates the significance of the proportion test, is slightly different for each method.

There are many reasons why the calculation value is different depending on the method used, such as 

  • Different algorithms in the core logic of the programming language 
  • Different default values of the parameters used in the function.

In the example above, if you change the value of the parameter correct in R and apply “Continuity correction” as using “correct = F” , you can see that the result is the same as in STATA.

Image from CAMIS project

Rounding

Next, I’ll introduce rounding for more general data analysis. 

Image by the author

Similarly, you can see that the round changes its value depending on software.

If the fee is “0.5 billion” in some large financial transaction in business, the rounded cost could be zero or 1 billion, depending on how you calculate the rounding.

Another case could be Logistic regression, which various round can be reverse prediction.

Image from Wikipedia, edited by the author

Why is round different?

Let’s talk a little more about why this round is different. 

Rounding as we usually perceive it means changing 0 ~ 4 to 0, and 5 ~ 9 to 10, as shown below image.

And in decimal units, is rounding to the nearest whole number by changing .0 ~ .4999.. to 0 and .5 ~ .9999.. to 1

However, there are a number of mathematical interpretations of when exactly 0.5 , and when it is a negative number.

Image from the Learning corner

For example, round(-23.5) should produce -23 or -24?

Both are possible, depending on the mathematical interpretation and it’s called as rounding half up and rounding half down respectively. We can take this a step further and round both positive and negative numbers closer to zero, or vice versa.

This means that round(-23.5) will round to -23, and round(23.5) will round to 23, or round to -24 and 24, respectively. These are represented by the names Rounding half toward zero, Rounding half away from zero, respectively.

Finally, there are methods called Rounding half to even and Rounding half to odd, which mean that we want to consider the nearest integers to be even and odd, respectively.

In particular, the Rounding half to even method also goes by the names Convergent rounding, Statistician’s rounding, Dutch rounding, Gaussian rounding, and Bankers’ rounding, and is one of the official standard methods according to IEEE 754.

Bankers’ rounding

Bankers’s rounding, is default method in R , so Let’s breif a little bit more.

The image below shows the result of rounding from 0.0 to 2.0.

Image from the author

While this may seem like a good idea, there is actually a problem. Because .5 is unconditionally rounded to the next integer, there is an unconditional bias towards rounding to a “+ value”.

I don’t know the exact reason for this, but one theory is that the US IRS used to use this rounding to collect taxes and was sued for unfairly profiting by collecting more taxes from people who were .5 off, so they lost the case and changed to rounding to the nearest even (or odd) number to match the .5 rounding.

This means that by modifying the rounding as shown below, we can avoid the bias that was previously occurring.


The problem with different results

In recent years, industries in various domains, including pharmaceuticals and finance, have been trying to switch from “commercial” software such as SPSS, SAS and STATA to “open source” software such as Python, R and Julia . 

And as rounding mentioned earlier, diffrent result issue by software has been also raised which can create problems in terms of reproducibility, uncertainty, accuracy, and traceability.

So if you’re utilizing multiple softwares, you should be aware of why they produce different results, and how you can use them to properly

CAMIS project

Image from CAMIS project

CAMIS stands for Comparing Analysis Method Implementations in Software. 

This project compares the differences in softwares (or programming languages) and make standards to produce the same results.

The core area of the project is the “statistical computation” part, so most contributions come from the data science leaders who have strong understanding with it.

But CAMIS is also an open source project, that is not restricted and maintained with various people through regular discussions, collaboration, and sharing of project progress.

Below is one of the comparisons published on the CAMIS project’s webpage, which reviews how a one sample t-test is run with each software, what the results are, and how the results are compatible with each other.

Image from CAMIS project

The CAMIS project was started by members who interested in “SAS to R” in the medical and pharmaceutical industry. So it mainly focuses on R and SAS along major statistical data analysis, but recently it’s also working on how to use Python for data science in a broader domain of the industry.

Not only clasiccal methods such as Hypothesis tests, Regression analysis, but modern methods in data science such as Bayesian statistics, Causal inference and novel implementations of existing methods (e.g. MMRM) are topic of interest in project.

Sessions are increasingly appearing at multiple data science conferences, where many researchers and contributors are encouraged to promote, contribute and utilize it as a reference.

Finally, the CAMIS project is also collaborating with academia beyond the data science industry, as similar topics have been published in The American Statistician and Drug Information Association, among others.

Image from The American Statistician
The project is also currently working with students on a thesis entitled “A comparison of MMRM methodology in SAS and R software” and is open to collaborations and suggestions on other topics.

Summary

Various software used in data science. As the domain, the libraries or software used by an organization may be dependent on a particular language, which can sometimes be mixed with personal preferred methods. (in many cases, this doesn’t vary much at the level of the business)

However, if you’re not careful, the methods you use can lead to different results.

In this article, I’ve given you some examples of and reasons for differences in the methods used by different software for calculations, and introduced the CAMIS project, a research project that aims to minimize them to ensure consistency in data analysis.

If you use different software in your data analytics work, it’s a good idea to take a look at them to understand the differences and try to find the optimal method for your purposes,

And if you work in data science in the field, I highly recommend that you take an interstate in or contribute to the CAMIS project for a global collaborative experience.

{cryptoQuotes}: Open access to cryptocurrency market data in R (Update)

The {cryptoQuotes}-package have been updated to version 1.3.0. With this update comes many new features,  and breaking changes. Prior to version 1.3.0 the package were using camelCase (See for example this post), with no particular style guide. The package now uses the tidyverse style guide which, in return, have deprecated a few core functions.

Note: Only the styling is affected, the returned market data is still xts/zoo-objects
Of the many new features and enhancements includes dark and light themed charting, and  a wide array of new sentiment indicators. The full documentation can be found on pkgdown.

In this blog post the new charting features will be showcased using hourly Bitcoin OHLC-V and long-short ratios from the last two days (From writing this draft).

Cryptocurrency market data in R

# 0) load library
library(cryptoQuotes)
To extract the Bitcoin OHLC-V,  the get_quote()-function [previously getQuote()]  is used as  shown below,

# 1) extract last two
# days of Bitcoin on the
# hourly chart
tail(
  BTC <- get_quote(
    ticker   = "BTCUSDT",
    source   = "binance",
    interval = "1h",
    from     = Sys.Date() - 2
  )
)
#>                        open    high     low   close    volume
#> 2024-06-05 02:00:00 70580.0 70954.1 70462.8 70820.1  7593.081
#> 2024-06-05 03:00:00 70820.2 71389.8 70685.9 71020.7 11466.934
#> 2024-06-05 04:00:00 71020.7 71216.0 70700.0 70892.1  7824.993
#> 2024-06-05 05:00:00 70892.2 71057.0 70819.1 70994.0  5420.481
#> 2024-06-05 06:00:00 70994.0 71327.9 70875.9 71220.2  7955.595
#> 2024-06-05 07:00:00 71220.2 71245.0 70922.0 70988.8  3500.795
The long-short ratios on Bitcoin in the same hourly interval is retrieved using the get_lsratio()-function [previously getLSRatio()] as shown below,

# 2) extract last two days
# of long-short ratio on
# Bitcoin
tail(
  BTC_LS <- get_lsratio(
    ticker   = "BTCUSDT",
    source   = "binance",
    interval = "1h",
    from     = Sys.Date() - 2
  )
)
#>                       long  short  ls_ratio
#> 2024-06-05 02:00:00 0.4925 0.5075 0.9704433
#> 2024-06-05 03:00:00 0.4938 0.5062 0.9755038
#> 2024-06-05 04:00:00 0.4942 0.5058 0.9770660
#> 2024-06-05 05:00:00 0.4901 0.5099 0.9611689
#> 2024-06-05 06:00:00 0.4884 0.5116 0.9546521
#> 2024-06-05 07:00:00 0.4823 0.5177 0.9316206
Prior to version 1.3.0 all charting with indicators were done with the magrittr-pipe operator, both internally and externally. This came with a overhead on both efficiency and readability (Opinionated, I know). The charting has been reworked in terms of layout and syntax.

Below is an example of a dark-themed chart with the long-short ratio alongside simple moving averages, bollinger bands and volume indicators,

# 3) dark-themed
# chart
chart(
  ticker = BTC,
  main   = kline(),
  indicator = list(
    bollinger_bands(),
    sma(n = 7),
    sma(n = 14)

  ),
  sub = list(
    volume(),
    lsr(ratio = BTC_LS)
  )
)

The light-themed chart have been reworked, and have received some extra love, such that its different from the default colors provided by the {plotly}-package,

# 4) light-themed
# chart
chart(
  ticker = BTC,
  main   = kline(),
  indicator = list(
    bollinger_bands(),
    sma(n = 7),
    sma(n = 14)
  ),
  sub = list(
    volume(),
    lsr(ratio = BTC_LS)
  ),
  options = list(
    dark = FALSE
  )
)

About the {cryptoQuotes}-package

The {cryptoQuotes}-package is a high-level API-client that interacts with public market data endpoints from major cryptocurrency exchanges using the {curl}-package.

The endpoints, which are publicly accessible and maintained by the exchanges themselves, ensure a consistent and reliable access to high-quality cryptocurrency market data with R.

Installing {cryptoQuotes}
The {cryptoQuotes}-package can be installed via CRAN,

# installing {cryptoQuotes}
install.packages(
  pkgs ="cryptoQuotes",
  dependencies = TRUE
)

Created on 2024-06-05 with reprex v2.1.0

Making tables in R with the gt package workshop

Join our workshop on Making tables in R with the gt package, which is a part of our workshops for Ukraine series! 

Here’s some more info: 

Title: Making tables in R with the gt package

Date: Thursday, June 13th, 18:00 – 20:00 CEST (Rome, Berlin, Paris timezone) 

Speaker: Rich Iannone, Rich is a software engineer that enjoys working with R and Python. He likes to create packages that help people to accomplish things. While Rich very clearly digs programming, he enjoys other things as well! Examples include: playing and listening to music, reading books, watching films, meeting up with friends, and wandering through the many valleys and ravines of the Greater Toronto Area.

Description: The goal of the {gt} package is to make building tables for publication a hassle-free process while giving you the freedom to be creative. If you’re familiar with {ggplot2}, the feel of working with {gt} isn’t too far off. Join this workshop for background on the goals of {gt} and an extensive tour of its features by the package developer. There are a lot of functions in this package but we’ll go through the most important ones and learn how to make beautiful tables!

Minimal registration fee: 20 euro (or 20 USD or 800 UAH)



How can I register?



  • Save your donation receipt (after the donation is processed, there is an option to enter your email address on the website to which the donation receipt is sent)

  • Fill in the registration form, attaching a screenshot of a donation receipt (please attach the screenshot of the donation receipt that was emailed to you rather than the page you see after donation).

If you are not personally interested in attending, you can also contribute by sponsoring a participation of a student, who will then be able to participate for free. If you choose to sponsor a student, all proceeds will also go directly to organisations working in Ukraine. You can either sponsor a particular student or you can leave it up to us so that we can allocate the sponsored place to students who have signed up for the waiting list.


How can I sponsor a student?


  • Save your donation receipt (after the donation is processed, there is an option to enter your email address on the website to which the donation receipt is sent)

  • Fill in the sponsorship form, attaching the screenshot of the donation receipt (please attach the screenshot of the donation receipt that was emailed to you rather than the page you see after the donation). You can indicate whether you want to sponsor a particular student or we can allocate this spot ourselves to the students from the waiting list. You can also indicate whether you prefer us to prioritize students from developing countries when assigning place(s) that you sponsored.


If you are a university student and cannot afford the registration fee, you can also sign up for the waiting list here. (Note that you are not guaranteed to participate by signing up for the waiting list).



You can also find more information about this workshop series,  a schedule of our future workshops as well as a list of our past workshops which you can get the recordings & materials here.


Looking forward to seeing you during the workshop!

Gauging Cryptocurrency Market Sentiment in R

Navigating the volatile world of cryptocurrencies requires a keen understanding of market sentiment. This blog post explores some of the essential tools and techniques for analyzing the mood of the crypto market, using the cryptoQuotes-package.

The Cryptocurrency Fear and Greed Index in R

The Fear and Greed Index is a market sentiment tool that measures investor emotions, ranging from 0 (extreme fear) to 100 (extreme greed). It analyzes data like volatility, market momentum, and social media trends to indicate potential overvaluation or undervaluation of cryptocurrencies. This index helps investors identify potential buying or selling opportunities by gauging the market’s emotional extremes.

This index can be retrieved by using the cryptoQuotes::getFGIndex()-function, which returns the daily index within a specified time-frame,

## Fear and Greed Index
## from the last 14 days
tail(
  FGI <- cryptoQuotes::getFGIndex(
    from = Sys.Date() - 14
  )
)
#>            FGI
#> 2024-01-03  70
#> 2024-01-04  68
#> 2024-01-05  72
#> 2024-01-06  70
#> 2024-01-07  71
#> 2024-01-08  71

The Long-Short Ratio of a Cryptocurrency Pair in R

The Long-Short Ratio is a financial metric indicating market sentiment by comparing the number of long positions (bets on price increases) against short positions (bets on price decreases) for an asset. A higher ratio signals bullish sentiment, while a lower ratio suggests bearish sentiment, guiding traders in making informed decisions.

The Long-Short Ratio can be retrieved by using the cryptoQuotes::getLSRatio()-function, which returns the ratio within a specified time-frame and granularity. Below is an example using the Daily Long-Short Ratio on Bitcoin (BTC),

## Long-Short Ratio
## from the last 14 days
tail(
  LSR <- cryptoQuotes::getLSRatio(
    ticker = "BTCUSDT",
    interval = '1d',
    from = Sys.Date() - 14
  )
)
#>              Long  Short LSRatio
#> 2024-01-03 0.5069 0.4931  1.0280
#> 2024-01-04 0.6219 0.3781  1.6448
#> 2024-01-05 0.5401 0.4599  1.1744
#> 2024-01-06 0.5499 0.4501  1.2217
#> 2024-01-07 0.5533 0.4467  1.2386
#> 2024-01-08 0.5364 0.4636  1.1570

Putting it all together

Even though cryptoQuotes::getLSRatio() is an asset-specific sentiment indicator, and cryptoQuotes::getFGIndex() is a general sentiment indicator, there is much information to be gathered by combining this information.

This information can be visualized by using the the various charting-functions in the cryptoQuotes-package,

## get the BTCUSDT
## pair from the last 14 days
BTCUSDT <- cryptoQuotes::getQuote(
  ticker = "BTCUSDT",
  interval = "1d",
  from = Sys.Date() - 14
)
## chart the BTCUSDT
## pair with sentiment indicators
cryptoQuotes::chart(
  slider = FALSE,
  chart = cryptoQuotes::kline(BTCUSDT) %>%
    cryptoQuotes::addFGIndex(FGI = FGI) %>% 
    cryptoQuotes::addLSRatio(LSR = LSR)
)
Bitcoin charted against Fear and Greed Index and the Long-Short Ratio using R
Bitcoin (BTC) plotted with Fear and Greed Index along side the Long-Short Ratio

Installing cryptoQuotes

Installing via CRAN

# install from CRAN
install.packages(
  pkgs = 'cryptoQuotes',
  dependencies = TRUE
)

Installing via Github

# install from github
devtools::install_github(
  repo = 'https://github.com/serkor1/cryptoQuotes/',
  ref = 'main'
)

Note: The latest price may vary depending on time of publication relative to the rendering time of the document. This document were rendered at 2024-01-08 23:30 CET

Cryptocurrency Market Data in R

Getting cryptocurrency OHLCV data in R without having to depend on low-level coding using, for example, curl or httr2, have not been easy for the R community.

There is now a high-level API Client available on CRAN which fetches all the market data without having to rely on web-scrapers, API keys or low-level coding.

Bitcoin Prices in R (Example)

This high-level API-client have one main function, getQuotes(), which returns cryptocurrency market data with a xts– and zoo-class. The returned objects contains Open, High, Low, Close and Volume data with different granularity, from the currently supported exchanges.

In this blog post I will show how to get hourly Bitcoin (BTC) prices in R
using the getQuotes()-function. See the code below,
# 1) getting hourly BTC
# from the last 3 days

BTC <- cryptoQuotes::getQuote(
 ticker   = "BTCUSDT", 
 source   = "binance", 
 futures  = FALSE, 
 interval = "1h", 
 from     = as.character(Sys.Date() - 3)
)
Bitcoin (BTC) OHLC-prices (Output from getQuote-function)
Index Open High Low Close Volume
2023-12-23 19:00:00 43787.69 43821.69 43695.03 43703.81 547.96785
2023-12-23 20:00:00 43703.82 43738.74 43632.77 43711.33 486.4342
2023-12-23 21:00:00 43711.33 43779.71 43661.81 43772.55 395.6197
2023-12-23 22:00:00 43772.55 43835.94 43737.85 43745.86 577.03505
2023-12-23 23:00:00 43745.86 43806.38 43701.1 43702.16 940.55167
2023-12-24 43702.15 43722.25 43606.18 43716.72 773.85301

The returned Bitcoin prices from getQuotes() are compatible with quantmod and TTR, without further programming. Let me demonstrate this using chartSeries(), addBBands() and addMACD() from these powerful libraries,

# charting BTC
# using quantmod
quantmod::chartSeries(
 x = BTC,
 TA = c(
    # add bollinger bands
    # to the chart
    quantmod::addBBands(), 
    # add MACD indicator
    # to the chart
    quantmod::addMACD()
 ), 
 theme = quantmod::chartTheme("white")
)
Cryptocurrency charts using R
Charting Bitcoin prices using quantmod and TTR

Installing cryptoQuotes

Stable version

# install from CRAN
install.packages(
  pkgs = 'cryptoQuotes',
  dependencies = TRUE
)

Development version

# install from github
devtools::install_github(
  repo = 'https://github.com/serkor1/cryptoQuotes/',
  ref = 'main'
)

Using ChatGPT for Exploratory Data Analysis with Python, R and prompting

Join our workshop on Using ChatGPT for Exploratory Data Analysis with Python, R and prompting, which is a part of our workshops for Ukraine series! 


Here’s some more info: 


Title:Using ChatGPT for Exploratory Data Analysis with Python, R and prompting


Date: Thursday, November 30th, 18:00 – 20:00 CET (Rome, Berlin, Paris timezone)


Speaker: Gábor Békés is an Associate Professor at the Department of Economics and Business of Central European University, a research fellow at KRTK in Hungary, and a research affiliate at CEPR. His research is focused on international economics; economic geography and applied IO, and was published among others by the Global Strategy Journal, Journal of International Economics, Regional Science and Urban Economics or Economic Policy and have authored commentary on VOXEU.org. His comprehensive textbook, Data Analysis for Business, Economics, and Policy with Gábor Kézdi was publsihed by Cambridge University Press in 2021. 


Description: How can GenAI, like ChatGPT augment and speed up data exploration? Is it true that we no longer need coding skills? Or, instead, does ChatGPT hallucinate too much to be taken seriously? I will do a workshop with live prompting and coding to investigate.  I will experiment with two datasets shared ahead of the workshop.  The first comes from my Data Analysis textbook and is about football managers. Here I’ll see how close working with AI will get to what we have in the textbook, and compare codes written by us vs the machine. Second, I’ll work with a dataset I have no/little experience with and see how far it takes me. In this case, we will look at descriptive statistics, make graphs and tables, and work to improve a textual variable.  It will generate code and reports, and I’ll then check them on my laptop to see if they work. The process starts with Python but then I’ll proceed with R.


Minimal registration fee: 20 euro (or 20 USD or 800 UAH)




How can I register?



  • Save your donation receipt (after the donation is processed, there is an option to enter your email address on the website to which the donation receipt is sent)

  • Fill in the registration form, attaching a screenshot of a donation receipt (please attach the screenshot of the donation receipt that was emailed to you rather than the page you see after donation).

If you are not personally interested in attending, you can also contribute by sponsoring a participation of a student, who will then be able to participate for free. If you choose to sponsor a student, all proceeds will also go directly to organisations working in Ukraine. You can either sponsor a particular student or you can leave it up to us so that we can allocate the sponsored place to students who have signed up for the waiting list.


How can I sponsor a student?


  • Save your donation receipt (after the donation is processed, there is an option to enter your email address on the website to which the donation receipt is sent)

  • Fill in the sponsorship form, attaching the screenshot of the donation receipt (please attach the screenshot of the donation receipt that was emailed to you rather than the page you see after the donation). You can indicate whether you want to sponsor a particular student or we can allocate this spot ourselves to the students from the waiting list. You can also indicate whether you prefer us to prioritize students from developing countries when assigning place(s) that you sponsored.


If you are a university student and cannot afford the registration fee, you can also sign up for the waiting list here. (Note that you are not guaranteed to participate by signing up for the waiting list).



You can also find more information about this workshop series,  a schedule of our future workshops as well as a list of our past workshops which you can get the recordings & materials here.


Looking forward to seeing you during the workshop!


















Visualising connection in R workshop

Join our workshop on Visualising connection in R, which is a part of our workshops for Ukraine series! 

Here’s some more info: 

Title: Visualising connection in R
Date: Thursday, November 9th, 19:00 – 21:00 CEST (Rome, Berlin, Paris timezone)
Speaker: Rita Giordano is a freelance data visualisation consultant and scientific illustrator based in the UK. By training, she is a physicist who holds a PhD in statistics applied to structural biology. She has extensive experience in research and data science. Furthermore, she has over fourteen years of professional experience working with R. She is also a LinkedIn instructor. You can find her course “Build Advanced Charts with R” on LinkedIn Learning.
Description: How to show connection? It depends on the connection we want to visualise. We could use a network, chord, or Sankey diagram.  The workshop will focus on how to visualise connections using chord diagrams. We will explore how to create a chord diagram with the {circlize} package.  In the final part of the workshop, I will briefly mention how to create a Sankey diagram with networkD3. Attendees need to have installed the {circlize} and {networkD3} packages.  
Minimal registration fee: 20 euro (or 20 USD or 800 UAH)


How can I register?


  • Save your donation receipt (after the donation is processed, there is an option to enter your email address on the website to which the donation receipt is sent)
  • Fill in the registration form, attaching a screenshot of a donation receipt (please attach the screenshot of the donation receipt that was emailed to you rather than the page you see after donation).

If you are not personally interested in attending, you can also contribute by sponsoring a participation of a student, who will then be able to participate for free. If you choose to sponsor a student, all proceeds will also go directly to organisations working in Ukraine. You can either sponsor a particular student or you can leave it up to us so that we can allocate the sponsored place to students who have signed up for the waiting list.


How can I sponsor a student?

  • Save your donation receipt (after the donation is processed, there is an option to enter your email address on the website to which the donation receipt is sent)
  • Fill in the sponsorship form, attaching the screenshot of the donation receipt (please attach the screenshot of the donation receipt that was emailed to you rather than the page you see after the donation). You can indicate whether you want to sponsor a particular student or we can allocate this spot ourselves to the students from the waiting list. You can also indicate whether you prefer us to prioritize students from developing countries when assigning place(s) that you sponsored.

If you are a university student and cannot afford the registration fee, you can also sign up for the waiting list here. (Note that you are not guaranteed to participate by signing up for the waiting list).


You can also find more information about this workshop series,  a schedule of our future workshops as well as a list of our past workshops which you can get the recordings & materials here.


Looking forward to seeing you during the workshop!





EARL your way: book your tickets now!

Your EARL tickets are now live to purchase here. Offering you every possible EARL ticket combination, here is a quick summary of what you can expect. You can simply choose a 3-day jam-packed conference pass or a 1 or 2-day option to customise an itinerary that works for you.

Grab your EARLy bird tickets right away – limited for a period of 2 weeks and 2 weeks only, we are delighted to be offering an unlimited amount of tickets ranging from 15-25% discount on all ticket options, depending if you are NHS, not for profit or an academic.

Team networking.

Why not bring your colleagues along for a much needed team social at the largest commercial R event in the UK? Offering lots of networking opportunities from brands in similar markets – there will be plenty of time to swap market experiences, over coffee, at lunch or at our evening reception. We are certainly proud to be a part of such an enthusiastic community.

Full or half day workshop on day 1.

We are running a 1-day series of workshops to kick off EARL on 6th September, covering all areas of R from explainable machine learning, to time series visualisation, functional programming with purr, an introduction to plumber APIs to having some fun and making games in Shiny. There is plenty of choice with morning and afternoon sessions agenda.

Full conference pass.

Our all-access pass to EARL gives you full access to a 1-day workshop, full 2-day conference pass and access to the evening reception at the unforgettable Drapers Hall on day 2 – the former home of Henry VIII. We have got an impressive line-up of keynotes including mathematician, science presenter and all-round badass – Hannah Fry, Top 100 Global Innovator in Data & Analytics – Harry Powell and the unmissable Financial Times columnist John Burn-Murdock. To add to this excitement, we have approved used cases from Bumble, Samaritans, BBC, Meta, Bank of England, Dogs Trust, NHS, and partners RStudio alongside many more.

1 or 2-day conference pass.

If you would like access to the keynotes, session talks and abundance of networking opportunities, you can choose from a 1 or 2-day pass aligned to your areas of interest. The 2-day conference pass gives you access to the main evening reception.

Evening reception.

This year we have opted for an unforgettable experience at Drapers Hall (the former home of Henry VIII), where you will get the ability to network with colleagues, delegates and speakers over drinks, canapes, and dinner in unforgettable surroundings. Transport is provided in a provide London red bus transfer. This year promises an unforgettable experience, with a heavy weight line up, use cases from leading brands and the opportunity at last to share and network to your heart’s content. We look forward to meeting you. Book your tickets now.

Create a hyper-marketing model using Naïve Bayes

By Huey Fern Tay with Greg Page

Everyone loves an extra income stream – even the super-wealthy owners of luxurious properties that they only inhabit for just a few weeks each year.  Offering a property as a short-term rental through a platform like Airbnb can provide a wonderful side hustle.  For some owners, however, the associated hassles could be a powerful deterrent to using the service.  Text messages at 3 a.m. about Wi-Fi passwords, stopped-up toilets, and the lack of water pressure in the shower might be just enough to tip the scales against such an undertaking…especially when such messages are followed up by angry “Why isn’t this fixed yet?” queries just 30 minutes later.

So what if an all-in-one concierge service could take away ALL of those hassles?  If an intermediary service could handle all of the tenant interactions, the marketing, the logistics of the key hand-offs, etc. then suddenly the idle rich jetsetters might be a bit more willing to open up their pied-a-terres to the unbathed masses.  Such a service would benefit all stakeholders – travelers would have more options, the property owners would earn more income, and the platform would receive more commission fees from the extra transactions. In exchange for a fee paid to the service, willing property owners could have a side hustle that was “all side, no hustle.”  

Let’s imagine that such a service is looking to establish itself, with an initial marketing outreach effort to high-end property owners not already using Airbnb.  Let’s also imagine that this new service is operating on a shoestring budget, and therefore needs to. How can it identify the properties within a city that are most likely to command high values in the short-term rental market? 

The Naïve Bayes classifier is a good candidate for the task at hand because of its simplicity, computational efficiency, and ability to handle categorical variables.  Furthermore, its classification outcomes come with associated probability values – we can use those to identify records that are most likely to belong to some particular group. 

To illustrate how this method could be used to solve the business problem outlined above, I will utilize Airbnb data of San Francisco listings.

One of the first decisions the modeler must make is deciding how to bin the data. In this case, the question is which numerical variable would you use to separate the properties? Would you use price, review ratings, or number of reviews? Each of these variables has a different impact on the outcome and may not be equally effective at separating classes. If the classes are not well separated, then even a large dataset will not be helpful.

The next important decision is to determine the number of classification categories you would like to create. Also, what would the cut-off be for each group? In other words, should you create four groups and bin them equally? Or should you create three groups by dividing the data according to a 15-70-15 proportion, 20-40-20, or 30-40-30…etc? The decision made at this step has a big impact on the model.

Consider both models below which were each created with 3811 rows of data representing 60% of the total dataset. Both models were created with the same predictor variables, such as the number of bedrooms, bathrooms, property type, location, etc. But in Model 1, the data was binned into 4 equal groups while in Model 2, the data was binned into 3 equal groups. The model summary for Model 1 showed the model has an accuracy of 54.19%, which is good considering that this performance is slightly more than double the No Information Rate (Naïve Rate). Model 2’s accuracy’s level is at 65.36%, a level which is nearly double its No Information Rate.

These are encouraging results but in our Airbnb example, we are more interested in knowing how well our model performs when it is asked to classify properties into any of the classes used in the model. For this reason, it is worth considering the true positive rate i.e. ‘sensitivity’. Model 1 is better at predicting the true positives (‘sensitivity’) for classes at opposite ends of the spectrum. This suggests Model 1 has difficulty reading nuances. On the other hand, Model 2’s performance in this regard is comparatively more balanced.


Naive Bayes model with four classification outcome



Naive Bayes output with three classification outcome


But wait – let’s get back to our original goal.  While overall accuracy is good to see, what we are most interested in here is identifying that high-end price group.  Owners of such units will be the best targets for our all-in-one concierge service.  Therefore, let’s dive a bit deeper to examine this model’s suitability for identifying such properties. 

By running the predict() function with the type=’raw’ parameter included, we can view the associated probabilities for each outcome class, and then rank records by probability of belonging to some particular outcome group. 

Taking this approach with the validation set, we find that among the 100 records identified by the model as most likely to land in the top tier group, 96 truly belonged to “Above Average and Pricey Digs.”   Among the 150 likeliest, 140 units, or 93.33%, actually belonged to that group, and among the 200 likeliest, 185 units, or 92.5%, were truly in that top tier. 

But that’s not all. It is worth going one step further by evaluating the model with lift charts or decile-wise lift charts because these charts determine how effectively our model ‘skims the cream’.

The decile-wise lift charts below illustrate how effectively the model can predict membership in the ‘above average and pricey digs’ group. When the model is used to classify the top 27% properties in this category, its performance is more than 3.5x better than a random guess.

Decile-wise lift chart shows the model is 3.5 times better at classifying the top 27 percent properties

Another way to assess the model’s ability to identify top-tier rentals is with a two-dimensional lift chart.  Such a chart only works with two-outcome class scenarios, so we start here by collapsing the first and second tiers together, and then labelling that group as “other.”  

Gains chart shows among the 500 records that the model says are most likely to land in the Above Average & Pricey Digs Tier, just over 400 truly did belong to that group

In the entire validation set, there are 780 units that land in the highest price tier.  The values along the x-axis represent all the validation set records, ranked in order of their probability of belonging to the highest-tier class.  The y-axis shows the number of correct predictions.  The solid line represents the model’s performance – it shows us, for instance, that among the 500 records that the model says are most likely to land in the Above Average & Pricey Digs Tier, just over 400 truly did belong to that group.  The line flattens out at around x=1500, because by that point, the model has already identified nearly all of the records that truly belonged to this outcome class. 

The dotted line, by contrast, shows how effective a model would be if it simply labelled all the records as belonging to the top tier.  Since 33.6%, of the validation records belong to this group, each x-axis value here corresponds to a y-axis value that is exactly 33.6% as large.  The difference between the solid line and the dotted line represents the model’s improvement as the number of cases increases.

What do these results mean for the concierge service?  Let’s revisit our original assumptions: 


  • such a service would most likely appeal to the owners of properties in the highest pricing tier;
  • the initial outreach efforts should be made to owners of properties not already registered with Airbnb; and that
  • the new service has a limited budget, and therefore needs to carefully focus its outreach efforts only on that tier of properties whose owners would be likeliest to use it
Given these assumptions, our primary interest does not lie with overall model accuracy.  The model’s ability to distinguish between the bottom two pricing tiers is almost immaterial to us; however, we are keenly interested in the answer to this question:  When the model predicts that a property will belong to the highest pricing tier, how often is it correct? 

As demonstrated here, the model delivers quite effectively in this regard, especially when we maintain a relatively narrow focus on the properties that are most likely to be top tier.  Splashy magazine inserts, Super Bowl advertisements, and big-ticket endorsements from celebrities might be in the cards for this service down the road, after it spreads across the globe and prepares for its IPO roadshow.  For now, though, the hyper-specific focus that can come from “skimming the cream” off the top of those Naïve Bayes model probability predictions may be the surest next step for this service’s success. 

Data source: Inside Airbnb