{SLmetrics}: scalable and memory efficient AI/ML performance evaluation in R

On December 3rd, 2024, a post about the release of {SLmetrics} was published. Today, January 11th, 2025, version 0.3-1 has been released and comes with many new features. Among these are weighted classification and regression metrics, OpenMP support and a wide array of new evaluation metrics.

In this blog post, I will benchmark {SLmetrics} and demostrate how it compares to the similar R packages {MLmetrics} and {yardstick} in terms execution time and memory efficiency – essential determinants for scalability and efficiency.

Benchmark Function

To run the benchmark of {SLmetrics}, {MLmetrics} and {yardstick}, I will use {bench} which measures the median execution time and memory efficiency. Below I have created a wrapper function:

## benchmark function
benchmark <- function(
  ..., 
  m = 10) {
  library(magrittr)
  # 1) create list
  # for storing values
  performance <- list()

  for (i in 1:m) {

     # 1) run the benchmarks
    results <- bench::mark(
      ...,
      iterations = 10,
      check = FALSE
    )

    # 2) extract values
    # and calculate medians
    performance$time[[i]]  <- setNames(
        lapply(results$time, mean), 
        results$expression
        )

    performance$memory[[i]] <- setNames(
        lapply(results$memory, function(x) {
             sum(x$bytes, na.rm = TRUE)}
             ), results$expression)

    performance$n_gc[[i]] <- setNames(
        lapply(results$n_gc, sum), results$expression
        )

  }

  purrr::pmap_dfr(
  list(performance$time, performance$memory, performance$n_gc), 
  ~{
    tibble::tibble(
      expression = names(..1),
      time = unlist(..1),
      memory = unlist(..2),
      n_gc = unlist(..3)
    )
  }
) %>%
  dplyr::mutate(expression = factor(expression, levels = unique(expression))) %>%
  dplyr::group_by(expression) %>%
  dplyr::filter(dplyr::row_number() > 1) %>%
  dplyr::summarize(
    execution_time = bench::as_bench_time(median(time)),
    memory_usage = bench::as_bench_bytes(median(memory)),
    gc_calls = median(n_gc),
    .groups = "drop"
  )

}

The wrapper function runs 10 x 10 benchmarks of each passed function – it discards the first run to allow the functions to warm up, before the benchmarks are recorded.

All values are averaged across runs and then presented as the median runtime, median memory usage and median number of gc()-calls during the benchmark.

Benchmarking {SLmetrics}

Bechmarking with and without OpenMP

In the first set of benchmarks, I will demonstrate the new OpenMP feature that has been shipped with version 0.3-1. For the benchmark, we will compare the execution time and memory efficiency of computing a 3×3 confusion matrix on two vectors of length 10,000,000 with and without OpenMP. The source code and results are shown below:

## 1) set seed
set.seed(1903)

## 2) define values
## for classes
actual <- factor(sample(letters[1:3], 1e7, TRUE))
predicted <- factor(sample(letters[1:3], 1e7, TRUE))

## 3) benchmark with OpenMP
SLmetrics::setUseOpenMP(TRUE)
#> OpenMP usage set to: enabled

benchmark(`{With OpenMP}` = SLmetrics::cmatrix(actual, predicted))
#> # A tibble: 1 × 4
#>   expression    execution_time memory_usage gc_calls
#>   <fct>               <bch:tm>    <bch:byt>    <dbl>
#> 1 {With OpenMP}            1ms           0B        0

## 4) benchmark without OpenMP
SLmetrics::setUseOpenMP(FALSE)
#> OpenMP usage set to: disabled

benchmark(`{Without OpenMP}`  = SLmetrics::cmatrix(actual, predicted))
#> # A tibble: 1 × 4
#>   expression       execution_time memory_usage gc_calls
#>   <fct>                  <bch:tm>    <bch:byt>    <dbl>
#> 1 {Without OpenMP}         6.27ms           0B        0

The confusion matrix is computed in less than a millisecond and around six milliseconds with and without OpenMP, respectively. In both cases, it uses zero or near-zero memory.

Benchmarking against {MLmetrics} and {yardstick}

In the second set of benchmarks, I will compare the execution time and memory efficiency of {SLmetrics} against {MLmetrics} and {yardstick}. The source code and results are shown below:

## 1) define classes
set.seed(1903)
fct_actual    <- factor(sample(letters[1:3], size = 1e7, replace = TRUE))
fct_predicted <- factor(sample(letters[1:3], size = 1e7, replace = TRUE))

## 2) perform benchmark
benchmark(
    `{SLmetrics}` = SLmetrics::cmatrix(fct_actual, fct_predicted),
    `{MLmetrics}` = MLmetrics::ConfusionMatrix(fct_predicted, fct_actual),
    `{yardstick}` = yardstick::conf_mat(table(fct_actual, fct_predicted))
)
#> # A tibble: 3 × 4
#>   expression  execution_time memory_usage gc_calls
#>   <fct>             <bch:tm>    <bch:byt>    <dbl>
#> 1 {SLmetrics}         6.34ms           0B        0
#> 2 {MLmetrics}       344.13ms        381MB       19
#> 3 {yardstick}       343.75ms        381MB       19

{SLmetrics} is roughly 60 times faster than both, and significantly more memory efficient as demonstrated by memory_usage and gc_calls. In this perspective, {SLmetrics} is more efficient and scalable than both packages as the memory usage is basically linear. See below:

## 1) define classes
set.seed(1903)
fct_actual    <- factor(sample(letters[1:3], size = 2e7, replace = TRUE))
fct_predicted <- factor(sample(letters[1:3], size = 2e7, replace = TRUE))

## 2) perform benchmark
benchmark(
    `{SLmetrics}` = SLmetrics::cmatrix(fct_actual, fct_predicted),
    `{MLmetrics}` = MLmetrics::ConfusionMatrix(fct_predicted, fct_actual),
    `{yardstick}` = yardstick::conf_mat(table(fct_actual, fct_predicted))
)
#> # A tibble: 3 × 4
#>   expression  execution_time memory_usage gc_calls
#>   <fct>             <bch:tm>    <bch:byt>    <dbl>
#> 1 {SLmetrics}         12.3ms           0B        0
#> 2 {MLmetrics}        648.5ms        763MB       19
#> 3 {yardstick}        654.7ms        763MB       19

{SLmetrics} can process 60x the data in the same time it takes {MLmetrics} and {yardstick} to process 40,000,000 data-points – without any additional memory cost.

Summary

The benchmarks suggests that {SLmetrics} is a strong contender to the more established packages {MLmetrics} and {yardstick} in terms of scalability, memory efficiency and speed.

Installing {SLmetrics}

{SLmetrics} is still under development and is therefore not on CRAN. But the latest release can be installed using {devtools}. A development version is also available for those living on the edge. See below:

Stable version

## install stable release
devtools::install_github(
  repo = 'https://github.com/serkor1/SLmetrics@*release',
  ref  = 'main'
)

Development version

## install development version
devtools::install_github(
  repo = 'https://github.com/serkor1/SLmetrics',
  ref  = 'development'
)

If you made it this far: Thank you for reading the blog post, and feel free to leave a comment here or in the repository.

{cryptoQuotes}: Open access to cryptocurrency market data in R (Update)

The {cryptoQuotes}-package have been updated to version 1.3.0. With this update comes many new features,  and breaking changes. Prior to version 1.3.0 the package were using camelCase (See for example this post), with no particular style guide. The package now uses the tidyverse style guide which, in return, have deprecated a few core functions.

Note: Only the styling is affected, the returned market data is still xts/zoo-objects
Of the many new features and enhancements includes dark and light themed charting, and  a wide array of new sentiment indicators. The full documentation can be found on pkgdown.

In this blog post the new charting features will be showcased using hourly Bitcoin OHLC-V and long-short ratios from the last two days (From writing this draft).

Cryptocurrency market data in R

# 0) load library
library(cryptoQuotes)
To extract the Bitcoin OHLC-V,  the get_quote()-function [previously getQuote()]  is used as  shown below,

# 1) extract last two
# days of Bitcoin on the
# hourly chart
tail(
  BTC <- get_quote(
    ticker   = "BTCUSDT",
    source   = "binance",
    interval = "1h",
    from     = Sys.Date() - 2
  )
)
#>                        open    high     low   close    volume
#> 2024-06-05 02:00:00 70580.0 70954.1 70462.8 70820.1  7593.081
#> 2024-06-05 03:00:00 70820.2 71389.8 70685.9 71020.7 11466.934
#> 2024-06-05 04:00:00 71020.7 71216.0 70700.0 70892.1  7824.993
#> 2024-06-05 05:00:00 70892.2 71057.0 70819.1 70994.0  5420.481
#> 2024-06-05 06:00:00 70994.0 71327.9 70875.9 71220.2  7955.595
#> 2024-06-05 07:00:00 71220.2 71245.0 70922.0 70988.8  3500.795
The long-short ratios on Bitcoin in the same hourly interval is retrieved using the get_lsratio()-function [previously getLSRatio()] as shown below,

# 2) extract last two days
# of long-short ratio on
# Bitcoin
tail(
  BTC_LS <- get_lsratio(
    ticker   = "BTCUSDT",
    source   = "binance",
    interval = "1h",
    from     = Sys.Date() - 2
  )
)
#>                       long  short  ls_ratio
#> 2024-06-05 02:00:00 0.4925 0.5075 0.9704433
#> 2024-06-05 03:00:00 0.4938 0.5062 0.9755038
#> 2024-06-05 04:00:00 0.4942 0.5058 0.9770660
#> 2024-06-05 05:00:00 0.4901 0.5099 0.9611689
#> 2024-06-05 06:00:00 0.4884 0.5116 0.9546521
#> 2024-06-05 07:00:00 0.4823 0.5177 0.9316206
Prior to version 1.3.0 all charting with indicators were done with the magrittr-pipe operator, both internally and externally. This came with a overhead on both efficiency and readability (Opinionated, I know). The charting has been reworked in terms of layout and syntax.

Below is an example of a dark-themed chart with the long-short ratio alongside simple moving averages, bollinger bands and volume indicators,

# 3) dark-themed
# chart
chart(
  ticker = BTC,
  main   = kline(),
  indicator = list(
    bollinger_bands(),
    sma(n = 7),
    sma(n = 14)

  ),
  sub = list(
    volume(),
    lsr(ratio = BTC_LS)
  )
)

The light-themed chart have been reworked, and have received some extra love, such that its different from the default colors provided by the {plotly}-package,

# 4) light-themed
# chart
chart(
  ticker = BTC,
  main   = kline(),
  indicator = list(
    bollinger_bands(),
    sma(n = 7),
    sma(n = 14)
  ),
  sub = list(
    volume(),
    lsr(ratio = BTC_LS)
  ),
  options = list(
    dark = FALSE
  )
)

About the {cryptoQuotes}-package

The {cryptoQuotes}-package is a high-level API-client that interacts with public market data endpoints from major cryptocurrency exchanges using the {curl}-package.

The endpoints, which are publicly accessible and maintained by the exchanges themselves, ensure a consistent and reliable access to high-quality cryptocurrency market data with R.

Installing {cryptoQuotes}
The {cryptoQuotes}-package can be installed via CRAN,

# installing {cryptoQuotes}
install.packages(
  pkgs ="cryptoQuotes",
  dependencies = TRUE
)

Created on 2024-06-05 with reprex v2.1.0

Gauging Cryptocurrency Market Sentiment in R

Navigating the volatile world of cryptocurrencies requires a keen understanding of market sentiment. This blog post explores some of the essential tools and techniques for analyzing the mood of the crypto market, using the cryptoQuotes-package.

The Cryptocurrency Fear and Greed Index in R

The Fear and Greed Index is a market sentiment tool that measures investor emotions, ranging from 0 (extreme fear) to 100 (extreme greed). It analyzes data like volatility, market momentum, and social media trends to indicate potential overvaluation or undervaluation of cryptocurrencies. This index helps investors identify potential buying or selling opportunities by gauging the market’s emotional extremes.

This index can be retrieved by using the cryptoQuotes::getFGIndex()-function, which returns the daily index within a specified time-frame,

## Fear and Greed Index
## from the last 14 days
tail(
  FGI <- cryptoQuotes::getFGIndex(
    from = Sys.Date() - 14
  )
)
#>            FGI
#> 2024-01-03  70
#> 2024-01-04  68
#> 2024-01-05  72
#> 2024-01-06  70
#> 2024-01-07  71
#> 2024-01-08  71

The Long-Short Ratio of a Cryptocurrency Pair in R

The Long-Short Ratio is a financial metric indicating market sentiment by comparing the number of long positions (bets on price increases) against short positions (bets on price decreases) for an asset. A higher ratio signals bullish sentiment, while a lower ratio suggests bearish sentiment, guiding traders in making informed decisions.

The Long-Short Ratio can be retrieved by using the cryptoQuotes::getLSRatio()-function, which returns the ratio within a specified time-frame and granularity. Below is an example using the Daily Long-Short Ratio on Bitcoin (BTC),

## Long-Short Ratio
## from the last 14 days
tail(
  LSR <- cryptoQuotes::getLSRatio(
    ticker = "BTCUSDT",
    interval = '1d',
    from = Sys.Date() - 14
  )
)
#>              Long  Short LSRatio
#> 2024-01-03 0.5069 0.4931  1.0280
#> 2024-01-04 0.6219 0.3781  1.6448
#> 2024-01-05 0.5401 0.4599  1.1744
#> 2024-01-06 0.5499 0.4501  1.2217
#> 2024-01-07 0.5533 0.4467  1.2386
#> 2024-01-08 0.5364 0.4636  1.1570

Putting it all together

Even though cryptoQuotes::getLSRatio() is an asset-specific sentiment indicator, and cryptoQuotes::getFGIndex() is a general sentiment indicator, there is much information to be gathered by combining this information.

This information can be visualized by using the the various charting-functions in the cryptoQuotes-package,

## get the BTCUSDT
## pair from the last 14 days
BTCUSDT <- cryptoQuotes::getQuote(
  ticker = "BTCUSDT",
  interval = "1d",
  from = Sys.Date() - 14
)
## chart the BTCUSDT
## pair with sentiment indicators
cryptoQuotes::chart(
  slider = FALSE,
  chart = cryptoQuotes::kline(BTCUSDT) %>%
    cryptoQuotes::addFGIndex(FGI = FGI) %>% 
    cryptoQuotes::addLSRatio(LSR = LSR)
)
Bitcoin charted against Fear and Greed Index and the Long-Short Ratio using R
Bitcoin (BTC) plotted with Fear and Greed Index along side the Long-Short Ratio

Installing cryptoQuotes

Installing via CRAN

# install from CRAN
install.packages(
  pkgs = 'cryptoQuotes',
  dependencies = TRUE
)

Installing via Github

# install from github
devtools::install_github(
  repo = 'https://github.com/serkor1/cryptoQuotes/',
  ref = 'main'
)

Note: The latest price may vary depending on time of publication relative to the rendering time of the document. This document were rendered at 2024-01-08 23:30 CET

Cryptocurrency Market Data in R

Getting cryptocurrency OHLCV data in R without having to depend on low-level coding using, for example, curl or httr2, have not been easy for the R community.

There is now a high-level API Client available on CRAN which fetches all the market data without having to rely on web-scrapers, API keys or low-level coding.

Bitcoin Prices in R (Example)

This high-level API-client have one main function, getQuotes(), which returns cryptocurrency market data with a xts– and zoo-class. The returned objects contains Open, High, Low, Close and Volume data with different granularity, from the currently supported exchanges.

In this blog post I will show how to get hourly Bitcoin (BTC) prices in R
using the getQuotes()-function. See the code below,
# 1) getting hourly BTC
# from the last 3 days

BTC <- cryptoQuotes::getQuote(
 ticker   = "BTCUSDT", 
 source   = "binance", 
 futures  = FALSE, 
 interval = "1h", 
 from     = as.character(Sys.Date() - 3)
)
Bitcoin (BTC) OHLC-prices (Output from getQuote-function)
Index Open High Low Close Volume
2023-12-23 19:00:00 43787.69 43821.69 43695.03 43703.81 547.96785
2023-12-23 20:00:00 43703.82 43738.74 43632.77 43711.33 486.4342
2023-12-23 21:00:00 43711.33 43779.71 43661.81 43772.55 395.6197
2023-12-23 22:00:00 43772.55 43835.94 43737.85 43745.86 577.03505
2023-12-23 23:00:00 43745.86 43806.38 43701.1 43702.16 940.55167
2023-12-24 43702.15 43722.25 43606.18 43716.72 773.85301

The returned Bitcoin prices from getQuotes() are compatible with quantmod and TTR, without further programming. Let me demonstrate this using chartSeries(), addBBands() and addMACD() from these powerful libraries,

# charting BTC
# using quantmod
quantmod::chartSeries(
 x = BTC,
 TA = c(
    # add bollinger bands
    # to the chart
    quantmod::addBBands(), 
    # add MACD indicator
    # to the chart
    quantmod::addMACD()
 ), 
 theme = quantmod::chartTheme("white")
)
Cryptocurrency charts using R
Charting Bitcoin prices using quantmod and TTR

Installing cryptoQuotes

Stable version

# install from CRAN
install.packages(
  pkgs = 'cryptoQuotes',
  dependencies = TRUE
)

Development version

# install from github
devtools::install_github(
  repo = 'https://github.com/serkor1/cryptoQuotes/',
  ref = 'main'
)

Visualising connection in R workshop

Join our workshop on Visualising connection in R, which is a part of our workshops for Ukraine series! 

Here’s some more info: 

Title: Visualising connection in R
Date: Thursday, November 9th, 19:00 – 21:00 CEST (Rome, Berlin, Paris timezone)
Speaker: Rita Giordano is a freelance data visualisation consultant and scientific illustrator based in the UK. By training, she is a physicist who holds a PhD in statistics applied to structural biology. She has extensive experience in research and data science. Furthermore, she has over fourteen years of professional experience working with R. She is also a LinkedIn instructor. You can find her course “Build Advanced Charts with R” on LinkedIn Learning.
Description: How to show connection? It depends on the connection we want to visualise. We could use a network, chord, or Sankey diagram.  The workshop will focus on how to visualise connections using chord diagrams. We will explore how to create a chord diagram with the {circlize} package.  In the final part of the workshop, I will briefly mention how to create a Sankey diagram with networkD3. Attendees need to have installed the {circlize} and {networkD3} packages.  
Minimal registration fee: 20 euro (or 20 USD or 800 UAH)


How can I register?


  • Save your donation receipt (after the donation is processed, there is an option to enter your email address on the website to which the donation receipt is sent)
  • Fill in the registration form, attaching a screenshot of a donation receipt (please attach the screenshot of the donation receipt that was emailed to you rather than the page you see after donation).

If you are not personally interested in attending, you can also contribute by sponsoring a participation of a student, who will then be able to participate for free. If you choose to sponsor a student, all proceeds will also go directly to organisations working in Ukraine. You can either sponsor a particular student or you can leave it up to us so that we can allocate the sponsored place to students who have signed up for the waiting list.


How can I sponsor a student?

  • Save your donation receipt (after the donation is processed, there is an option to enter your email address on the website to which the donation receipt is sent)
  • Fill in the sponsorship form, attaching the screenshot of the donation receipt (please attach the screenshot of the donation receipt that was emailed to you rather than the page you see after the donation). You can indicate whether you want to sponsor a particular student or we can allocate this spot ourselves to the students from the waiting list. You can also indicate whether you prefer us to prioritize students from developing countries when assigning place(s) that you sponsored.

If you are a university student and cannot afford the registration fee, you can also sign up for the waiting list here. (Note that you are not guaranteed to participate by signing up for the waiting list).


You can also find more information about this workshop series,  a schedule of our future workshops as well as a list of our past workshops which you can get the recordings & materials here.


Looking forward to seeing you during the workshop!





Ebook launch – Simple Data Science (R)

Simple Data Science (R) covers the fundamentals of data science and machine learning. The book is beginner-friendly and has detailed code examples. It is available at scribd.

cover image

Topics covered in the book –
  • Data science introduction
  • Basic statistics
  • Graphing with ggplot2 package
  • Exploratory Data Analysis
  • Machine Learning with caret package
  • Regression, classification, and clustering
  • Boosting with lightGBM package
  • Hands-on projects
  • Data science use cases

Why this is the year you should take the stage at EARL 2022…

EARL is Europe’s largest R community event dedicated to showcasing commercial applications of the R language. As a conference, it has always lived up to its promise of connecting and inspiring R users with creative suggestions and solutions, sparking new ideas, solving problems and sharing perspectives to advance the community. 

2022 marks the return of face-to-face EARL (6th – 8th September at the Tower Hotel in London) – now run by Ascent, the new home of Mango Solutions. Over the past eight years, EARL has attracted some fascinating presentations from some engaging, authentic speakers, both experienced and first timers. This year, we’re keen to understand how recent global events and trends that have disrupted our view of ‘normal’ have impacted, changed or driven your R projects: from inspirational innovation to reducing operational cost and creating richer customer experiences. If you have an interesting application of R, our call for abstracts is now open and we’re inviting you to share your synopsis with us. Deadline for submissions is Thursday 30th June.  Maybe you’ve built a Shiny app that helps detect bias, or you’ve been on a data journey you’d like to share. Perhaps you’ve built a data science syllabus for young minds or created an NLP tool to automate clinical processes. If you are searching for inspiration, potential applications of R might come under the following categories:
  • Responding to global events with R
  • The role of R in the business data science toolbox
  • Overcoming the challenges of using R commercially
  • Efficient R: dealing with huge data
  • Sustainable R / R for good
  • R tools & packages (eg. Shiny R, Purrr)
  • Building your R community
  • Women in R
  • The future of R in enterprise: 2022 and beyond
We are also looking for short form submissions: 10-minute lightning talks on a wide range of applications.

What’s presenting at EARL really like?  

We asked our 2019 presenters what prompted their decision to speak at our last in-person EARL and their advice to others who may be considering submitting an abstract for EARL 2022. For Mitchell Stirling, Capacity and Modelling Manager at Heathrow Airport, the opportunity to present helped fulfil a professional ambition. “I discussed with my line manager, slightly tongue in cheek, that it should be an ambition in 2019 when he signed off a conference attendance in Scotland the previous year. As the work I’d been doing developed in 2019 and the opportunity presented itself, I started to think “why not?” – this is interesting and if I can show it interestingly, hopefully others would agree. I was slightly wary of the technical nature of the event, with my exposure to coding in R still better measured in minutes than hours (never mind days) but a reassurance that people would be interested in the ‘what’ and ‘why’ as well as the ‘how’, won me over.”  Dr Zhanna Mileeva, a Data Scientist at NBrown Group confirmed that making a contribution to the data science community was an important factor in her decision to submit an abstract: “After some research I found the EARL conference as a great cross-sector forum for R users to share Data Science, AI and ML engineering knowledge, discuss modern business problems and pathways to solutions. It was a fantastic opportunity to contribute to this community, learn from it and re-charge with some fresh ideas.” In past years EARL has attracted speakers from across the globe and last year, Harold Selman, Lead Data Scientist at Ordina (NL) came from the Netherlands to speak at the conference. “I knew the EARL conference as a visitor and had given some presentations in The Netherlands, so I decided to give it a shot. The staff of the EARL conference are very helpful and open to questions, which made being a speaker very pleasant.”  Some of our presenters have enjoyed the experience so much they have presented more than once. Chris Billingham, Lead Data Scientist at Manchester Airport Group’s Digital Agency MAG-O, is one such speaker. “I’ve had the good fortune to present twice at EARL.  I saw it as an opportunity to challenge myself to present at the biggest R conference in the UK.” 

How to submit your abstract. 

Feeling inspired? You can find the abstract submission form on our website. Here’s our recommendations for a successful submission.
  • Topic: Your topic can relate to any real-world application of R. We aim to represent a range of industry sectors and a balance of technical and strategic content.
  • Clarity: The talk synopsis should provide an overview of the topic and why you believe it will be of interest or resonate with the audience. We suggest an introduction or problem statement alongside any supporting facts that determine the talk objectives or expected takeaways.
  • Storytelling: Aim to demonstrate how the tools and techniques you used helped to transform and translate value with a clear and compelling narrative.
  • Approval: Before you submit, it’s a good idea to ensure your application has been approved by your wider organisation and or team.
  • Novel: Is the application particularly new or innovative? If your application of R is new or distinctive and not widely written about in the industry, please provide as much supporting information as you can for review purposes.
  • Target audience: 34% of our attendees are R practitioners and 46% of delegates typically have senior or leadership roles – consider the alignment of your proposal with these audiences.
We hope these hints and tips have been helpful – but feel free to get in touch if you have any questions by contacting [email protected]. 

EARL your way: book your tickets now!

Your EARL tickets are now live to purchase here. Offering you every possible EARL ticket combination, here is a quick summary of what you can expect. You can simply choose a 3-day jam-packed conference pass or a 1 or 2-day option to customise an itinerary that works for you.

Grab your EARLy bird tickets right away – limited for a period of 2 weeks and 2 weeks only, we are delighted to be offering an unlimited amount of tickets ranging from 15-25% discount on all ticket options, depending if you are NHS, not for profit or an academic.

Team networking.

Why not bring your colleagues along for a much needed team social at the largest commercial R event in the UK? Offering lots of networking opportunities from brands in similar markets – there will be plenty of time to swap market experiences, over coffee, at lunch or at our evening reception. We are certainly proud to be a part of such an enthusiastic community.

Full or half day workshop on day 1.

We are running a 1-day series of workshops to kick off EARL on 6th September, covering all areas of R from explainable machine learning, to time series visualisation, functional programming with purr, an introduction to plumber APIs to having some fun and making games in Shiny. There is plenty of choice with morning and afternoon sessions agenda.

Full conference pass.

Our all-access pass to EARL gives you full access to a 1-day workshop, full 2-day conference pass and access to the evening reception at the unforgettable Drapers Hall on day 2 – the former home of Henry VIII. We have got an impressive line-up of keynotes including mathematician, science presenter and all-round badass – Hannah Fry, Top 100 Global Innovator in Data & Analytics – Harry Powell and the unmissable Financial Times columnist John Burn-Murdock. To add to this excitement, we have approved used cases from Bumble, Samaritans, BBC, Meta, Bank of England, Dogs Trust, NHS, and partners RStudio alongside many more.

1 or 2-day conference pass.

If you would like access to the keynotes, session talks and abundance of networking opportunities, you can choose from a 1 or 2-day pass aligned to your areas of interest. The 2-day conference pass gives you access to the main evening reception.

Evening reception.

This year we have opted for an unforgettable experience at Drapers Hall (the former home of Henry VIII), where you will get the ability to network with colleagues, delegates and speakers over drinks, canapes, and dinner in unforgettable surroundings. Transport is provided in a provide London red bus transfer. This year promises an unforgettable experience, with a heavy weight line up, use cases from leading brands and the opportunity at last to share and network to your heart’s content. We look forward to meeting you. Book your tickets now.

Detecting multicollinearity — it’s not that easy sometimes

By Huey Fern Tay with Greg Page

When are two variables too related to one another to be used together in a linear regression model? Should the maximum acceptable correlation be 0.7? Or should the rule of thumb be 0.8? There is actually no single, ‘one-size-fits-all’ answer to this question.

As an alternative to using pairwise correlations, an analyst can examine the variance inflation factor, or VIF, associated with each of the numeric input variables. Sometimes, however, pairwise correlations, and even VIF scores, do not tell the entire picture.

Consider this correlation matrix created from a Los Angeles Airbnb dataset.



Two item pairs identified in the correlation matrix above have a strong correlation value:

· Beds + log_accommodates (r= 0.701)

· Beds + bedrooms (r= 0.706)

Based on one school of thought, these correlation values are cause for concern; meanwhile, other sources suggest that such values are nothing to be worried about.

The variance inflation factor, which is used to detect the severity of multicollinearity, does not suggest anything unusual either.

library(car)
vif(model_test) 

The VIF for each potential input variable is found by making a separate linear regression model, in which the variable being scored serves as the outcome, and the other numeric variables are the predictors. The VIF score for each variable is found by applying this formula:


When the other numeric inputs explain a large percentage of the variability in some other variable, then that variable will have a high VIF. Some sources will tell you that any VIF above 5 means that a variable should not be used in a model, whereas other sources will say that VIF values below 10 are acceptable. None of the vif() results here appear to be problematic, based on standard cutoff thresholds.

Based on the vif() results shown above, plus some algebraic manipulation of the VIF formula, we can know that a model that predicts beds as the outcome variable, using log_accommodates, bedrooms, and bathrooms as the inputs, has an r-squared of just a little higher than 0.61. That is verified with the model shown below:



But look at what happens when we build a multi-linear regression model predicting the price of an Airbnb property listing.



The model summary hints at a problem because the coefficient for beds is negative. The proper interpretation for each coefficient in this linear model is the way that log_price will be impacted by a one-unit increase in that input, with all other inputs held constant.

Literally, then, this output indicates that having more beds within a house or apartment will make its rental value go down, when all else is held constant. That not only defies our common sense, but it also contradicts something that we already know to be the case — that bed number and log_price are positively associated. Indeed, the correlation matrix shown above indicates a moderately-strong linear relationship between these values, of 0.4816.

After dropping ‘beds’ from the original model, the adjusted R-squared declines only marginally, from 0.4878 to 0.4782.



This tiny decline in adjusted r-squared is not worrisome at all. The very low p-value associated with this model’s F-statistic indicates a highly significant overall model. Moreover, the signs of the coefficients for each of these inputs are consistent with the directionality that we see in the original correlation matrix.

Moreover, we still need to include other important variables that determine real estate pricing e.g. location and property type. After factoring in these categories along with other considerations such as pool availability, cleaning fee, and pet-friendly options, the model’s adjusted R-squared value is pushed to 0.6694. In addition, the residual standard error declines from 0.5276 in the original model to 0.4239.

Long story short: we cannot be completely reliant on rules of thumb, or even cutoff thresholds from textbooks, when evaluating the multicollinearity risk associated with specific numeric inputs. We must also examine model coefficients’ signs. When a coefficient’s sign “flips” from the direction that we should expect, based on that variable’s correlation with the response variable, that can also indicate that our model coefficients are impacted by multicollinearity.

Data source: Inside Airbnb

Download recently published book – Learn Data Science with R

Learn Data Science with R is for learning the R language and data science. The book is beginner-friendly and easy to follow. It is available for download as pay what you want. The minimum price is 0 and the suggested contribution is rs 1300 ($18). Please review the book at Goodreads.

book cover

The book topics are –
  • R Language
  • Data Wrangling with data.table package
  • Graphing with ggplot2 package
  • Exploratory Data Analysis
  • Machine Learning with caret package
  • Boosting with lightGBM package
  • Hands-on projects