r – R-posts.com

RStudio AI That Doesn’t Cost a Penny: llmcoder vs. Posit AI Assistant

Introduction

If you’re an R user, you’ve probably experienced these moments:

You’re writing code and forgot the exact syntax for a function
Your code throws an error and you’re staring at a confusing error message
You have a block of code but want to understand what it does in plain English
You want to chat with an AI assistant about your data analysis, but don’t want to leave RStudio

llmcoder is an RStudio addin that solves all of these problems by integrating Large Language Model (LLM) assistance directly into your RStudio workflow, and more importantly, it’s FREE!

In this post, I’ll show you how llmcoder can speed up your R coding and make your workflow smoother.

Watch a quick demo of llmcoder in action:

https://youtu.be/SRzjaURbKCw

Installation

You can install llmcoder from GitHub:

# Install remotes if you haven't already
install.packages("remotes")

# Install llmcoder
remotes::install_github("ShiyangZheng/llmcoder")

Load the package:

library(llmcoder)

Feature 1: Generate R Code from Inline Comments

Ever wish you could just type what you want in plain English and get R code instantly?

How to use:

Type a comment describing what you want
Place your cursor on that line
Use the Addins menu and select “Generate Code from Comment”

Example:

# Load the mtcars dataset and create a scatter plot of mpg vs wt, colored by number of cylinders

After running the addin, the comment is replaced with:

library(ggplot2)
data(mtcars)
ggplot(mtcars, aes(x = wt, y = mpg, color = factor(cyl))) +
  geom_point(size = 3, alpha = 0.8) +
  labs(
    title = "Fuel Efficiency vs Weight by Cylinder Count",
    x = "Weight (1000 lbs)",
    y = "Miles per Gallon",
    color = "Cylinders"
  ) +
  theme_minimal()

No more switching to ChatGPT or copying code from Stack Overflow!

Feature 2: Fix Console Errors with LLM Assistance

We’ve all been there – a cryptic error message and you’re not sure what went wrong.

How to use:

Run code that produces an error
The error appears in the console
Use the Addins menu and select “Fix Error with LLM”

Example:

library(dplyr)
data %>%
  filter(cyl == 4) %>%
  summary()
# Error: object 'data' not found

llmcoder captures the error and sends it to the LLM, which returns an explanation and suggests:

mtcars %>% filter(cyl == 4) %>% summary()

Feature 3: Explain Selected Code in Plain English

Sometimes you inherit code from a colleague or find a Stack Overflow answer and want to understand what it does.

How to use:

Select a block of code in the editor
Use the Addins menu and select “Explain Code”

Example:

mtcars %>%
  group_by(cyl) %>%
  summarize(
    mean_mpg = mean(mpg, na.rm = TRUE),
    sd_mpg = sd(mpg, na.rm = TRUE),
    count = n()
  ) %>%
  arrange(desc(mean_mpg))

llmcoder returns:

Takes the built-in mtcars dataset
Groups the data by the number of cylinders (cyl)
Calculates the mean and standard deviation of miles per gallon (mpg) for each group
Arranges the results in descending order of mean fuel efficiency

Feature 4: Multi-Turn Chat Panel with Session Context

This is the flagship feature. llmcoder includes a Chat Panel that understands your current R session.

How to open: Use the Addins menu and select “Open Chat Panel”

What makes it special?

The Chat Panel is session-aware:

It knows which packages you have loaded
It knows what objects are in your global environment
It can read the contents of your current script
It has access to your recent console history

Example conversation:

You: What’s the correlation between mpg and wt in mtcars?

AI: The correlation between mpg and wt in the mtcars dataset is -0.87, indicating a strong negative relationship. As weight increases, fuel efficiency decreases.

cor(mtcars$mpg, mtcars$wt, use = "complete.obs")

Want to see the Chat Panel in action? Watch this demo:
https://youtu.be/zP-RuCN3q14

Supported LLM Providers

llmcoder supports multiple LLM providers – you can choose the one that works best for you:

Provider	API Key	Notes
OpenAI (GPT-4/3.5)	Yes	Most popular
Anthropic (Claude)	Yes	Great for long conversations
DeepSeek	Yes	Cost-effective
Groq	Yes	Very fast inference
Together AI	Yes	Open-source models
OpenRouter	Yes	Access multiple models
Ollama	No	Fully local, no API key!
Custom endpoint	Yes	LM Studio, vLLM, llama.cpp

Privacy note: If you use Ollama, all processing happens locally on your machine. No data is sent to external servers.

Customization: Choose Your Prompt Style

The Chat Panel allows you to select different prompt styles:

General Assistant: Best for general questions
R Code Helper: Focuses on writing clean, idiomatic R code
Statistics Advisor: Helps with statistical concepts and test selection
Research (Psycho): Tailored for psycholinguistics researchers

Why llmcoder?

There are many AI coding assistants out there (Copilot, Cursor, etc.), so why llmcoder?

Native RStudio integration: No need to switch to another app or browser tab
Session-aware: The LLM knows what you’re working on
Multiple LLM providers: Choose the one you prefer (or use a local model for privacy)
Open source: MIT license, free to use and modify
Designed for R users: Not a generic coding assistant – it understands R-specific workflows

Call to Action

Ready to try llmcoder?

remotes::install_github("ShiyangZheng/llmcoder")

GitHub: https://github.com/ShiyangZheng/llmcoder

If you encounter any bugs or have feature requests, please file an issue: https://github.com/ShiyangZheng/llmcoder/issues

Star the repo if you find it useful!

About the Author

Shiyang Zheng is a PhD student in Psycholinguistics at the University of Nottingham. His research focuses on idiom acquisition and computational modeling. He built llmcoder to make R coding easier for himself and the R community.

GitHub: @ShiyangZheng
Academic website: shiyangzheng.top
ORCID: 0000-0003-0511-4683

{SLmetrics}: scalable and memory efficient AI/ML performance evaluation in R

On December 3rd, 2024, a post about the release of {SLmetrics} was published. Today, January 11th, 2025, version 0.3-1 has been released and comes with many new features. Among these are weighted classification and regression metrics, OpenMP support and a wide array of new evaluation metrics.

In this blog post, I will benchmark {SLmetrics} and demostrate how it compares to the similar R packages {MLmetrics} and {yardstick} in terms execution time and memory efficiency – essential determinants for scalability and efficiency.

Benchmark Function

To run the benchmark of {SLmetrics}, {MLmetrics} and {yardstick}, I will use {bench} which measures the median execution time and memory efficiency. Below I have created a wrapper function:

## benchmark function
benchmark <- function(
  ..., 
  m = 10) {
  library(magrittr)
  # 1) create list
  # for storing values
  performance <- list()

  for (i in 1:m) {

     # 1) run the benchmarks
    results <- bench::mark(
      ...,
      iterations = 10,
      check = FALSE
    )

    # 2) extract values
    # and calculate medians
    performance$time[[i]]  <- setNames(
        lapply(results$time, mean), 
        results$expression
        )

    performance$memory[[i]] <- setNames(
        lapply(results$memory, function(x) {
             sum(x$bytes, na.rm = TRUE)}
             ), results$expression)

    performance$n_gc[[i]] <- setNames(
        lapply(results$n_gc, sum), results$expression
        )

  }

  purrr::pmap_dfr(
  list(performance$time, performance$memory, performance$n_gc), 
  ~{
    tibble::tibble(
      expression = names(..1),
      time = unlist(..1),
      memory = unlist(..2),
      n_gc = unlist(..3)
    )
  }
) %>%
  dplyr::mutate(expression = factor(expression, levels = unique(expression))) %>%
  dplyr::group_by(expression) %>%
  dplyr::filter(dplyr::row_number() > 1) %>%
  dplyr::summarize(
    execution_time = bench::as_bench_time(median(time)),
    memory_usage = bench::as_bench_bytes(median(memory)),
    gc_calls = median(n_gc),
    .groups = "drop"
  )

}

The wrapper function runs 10 x 10 benchmarks of each passed function – it discards the first run to allow the functions to warm up, before the benchmarks are recorded.

All values are averaged across runs and then presented as the median runtime, median memory usage and median number of gc()-calls during the benchmark.

Benchmarking {SLmetrics}

Bechmarking with and without OpenMP

In the first set of benchmarks, I will demonstrate the new OpenMP feature that has been shipped with version 0.3-1. For the benchmark, we will compare the execution time and memory efficiency of computing a 3×3 confusion matrix on two vectors of length 10,000,000 with and without OpenMP. The source code and results are shown below:

## 1) set seed
set.seed(1903)

## 2) define values
## for classes
actual <- factor(sample(letters[1:3], 1e7, TRUE))
predicted <- factor(sample(letters[1:3], 1e7, TRUE))

## 3) benchmark with OpenMP
SLmetrics::setUseOpenMP(TRUE)
#> OpenMP usage set to: enabled

benchmark(`{With OpenMP}` = SLmetrics::cmatrix(actual, predicted))
#> # A tibble: 1 × 4
#>   expression    execution_time memory_usage gc_calls
#>   <fct>               <bch:tm>    <bch:byt>    <dbl>
#> 1 {With OpenMP}            1ms           0B        0

## 4) benchmark without OpenMP
SLmetrics::setUseOpenMP(FALSE)
#> OpenMP usage set to: disabled

benchmark(`{Without OpenMP}`  = SLmetrics::cmatrix(actual, predicted))
#> # A tibble: 1 × 4
#>   expression       execution_time memory_usage gc_calls
#>   <fct>                  <bch:tm>    <bch:byt>    <dbl>
#> 1 {Without OpenMP}         6.27ms           0B        0

The confusion matrix is computed in less than a millisecond and around six milliseconds with and without OpenMP, respectively. In both cases, it uses zero or near-zero memory.

Benchmarking against {MLmetrics} and {yardstick}

In the second set of benchmarks, I will compare the execution time and memory efficiency of {SLmetrics} against {MLmetrics} and {yardstick}. The source code and results are shown below:

## 1) define classes
set.seed(1903)
fct_actual    <- factor(sample(letters[1:3], size = 1e7, replace = TRUE))
fct_predicted <- factor(sample(letters[1:3], size = 1e7, replace = TRUE))

## 2) perform benchmark
benchmark(
    `{SLmetrics}` = SLmetrics::cmatrix(fct_actual, fct_predicted),
    `{MLmetrics}` = MLmetrics::ConfusionMatrix(fct_predicted, fct_actual),
    `{yardstick}` = yardstick::conf_mat(table(fct_actual, fct_predicted))
)
#> # A tibble: 3 × 4
#>   expression  execution_time memory_usage gc_calls
#>   <fct>             <bch:tm>    <bch:byt>    <dbl>
#> 1 {SLmetrics}         6.34ms           0B        0
#> 2 {MLmetrics}       344.13ms        381MB       19
#> 3 {yardstick}       343.75ms        381MB       19

{SLmetrics} is roughly 60 times faster than both, and significantly more memory efficient as demonstrated by memory_usage and gc_calls. In this perspective, {SLmetrics} is more efficient and scalable than both packages as the memory usage is basically linear. See below:

## 1) define classes
set.seed(1903)
fct_actual    <- factor(sample(letters[1:3], size = 2e7, replace = TRUE))
fct_predicted <- factor(sample(letters[1:3], size = 2e7, replace = TRUE))

## 2) perform benchmark
benchmark(
    `{SLmetrics}` = SLmetrics::cmatrix(fct_actual, fct_predicted),
    `{MLmetrics}` = MLmetrics::ConfusionMatrix(fct_predicted, fct_actual),
    `{yardstick}` = yardstick::conf_mat(table(fct_actual, fct_predicted))
)
#> # A tibble: 3 × 4
#>   expression  execution_time memory_usage gc_calls
#>   <fct>             <bch:tm>    <bch:byt>    <dbl>
#> 1 {SLmetrics}         12.3ms           0B        0
#> 2 {MLmetrics}        648.5ms        763MB       19
#> 3 {yardstick}        654.7ms        763MB       19

{SLmetrics} can process 60x the data in the same time it takes {MLmetrics} and {yardstick} to process 40,000,000 data-points – without any additional memory cost.

Summary

The benchmarks suggests that {SLmetrics} is a strong contender to the more established packages {MLmetrics} and {yardstick} in terms of scalability, memory efficiency and speed.

Installing {SLmetrics}

{SLmetrics} is still under development and is therefore not on CRAN. But the latest release can be installed using {devtools}. A development version is also available for those living on the edge. See below:

Stable version

## install stable release
devtools::install_github(
  repo = 'https://github.com/serkor1/SLmetrics@*release',
  ref  = 'main'
)

Development version

## install development version
devtools::install_github(
  repo = 'https://github.com/serkor1/SLmetrics',
  ref  = 'development'
)

If you made it this far: Thank you for reading the blog post, and feel free to leave a comment here or in the repository.

Add shiny in quarto blog with shinylive

Shiny, without server

In previous article, I introduced method to share shiny application in static web page (github page)

At the core of this method is a technology called WASM, which is a way to load and utilize R and Shiny-related libraries and files that have been converted for use in a web browser. The main problem with wasm is that it is difficult to configure, even for R developers.

Of course, there was a way called shinylive, but unfortunately it was only available in python at the time.

Fortunately, after a few months, there is an R package that solves this configuration problem, and I will introduce how to use it to add a shiny application to a static page.

shinylive

shinylive is R package to utilize wasm above shiny. and now it has both Python and R version, and in this article will be based on the R version.

shinylive is responsible for generating HTML, Javascript, CSS, and other elements needed to create web pages, as well as wasm-related files for using shiny.

You can see examples created with shinylive at this link.

Install shinylive

While shinylive is available on CRAN, it is recommended to use the latest version from github as it may be updated from time to time, with the most recent release being 0.1.1. Additionally, pak is the recently recommended R package for installing R packages in posit, and can replace existing functions like install.packages() and remotes::install_github().

# install.packages("pak")

pak::pak("posit-dev/r-shinylive")

You can think of shinylive as adding a wasm to an existing shiny application, which means you need to create a shiny application first.

For the example, we’ll use the code provided by shiny package (which you can also see by typing shiny::runExample("01_hello") in the Rstudio console).

library(shiny)

ui <- fluidPage(

titlePanel("Hello Shiny!"),
  sidebarLayout(
  sidebarPanel(
    sliderInput(
      inputId = "bins",
      label = "Number of bins:",
      min = 1,
      max = 50,
      value = 30
    )
  ),
  mainPanel(
    plotOutput(outputId = "distPlot")
    )
  )
)

server <- function(input, output) {
  output$distPlot <- renderPlot({
  x <- faithful$waiting
  bins <- seq(min(x), max(x), length.out = input$bins + 1)
  hist(x,
    breaks = bins, col = "#75AADB", border = "white",
    xlab = "Waiting time to next eruption (in mins)",
    main = "Histogram of waiting times"
  )
  })
}

shinyApp(ui = ui, server = server)

This code creates a simple shiny application that creates a number of histograms in response to the user’s input, as shown below.

There are two ways to create a static page with this code using shinylive, one is to create it as a separate webpage (like previous article) and the other is to embed it as internal content on a quarto blog page .

First, here’s how to create a separate webpage.

shinylive via web page

To serve shiny on a separate static webpage, you’ll need to convert your app.R to a webpage using the shinylive package you installed earlier.

Based on creating a folder named shinylive in my Documents(~/Documents) and saving `app.R` inside it, here’s an example of how the export function would look like

shinylive::export('~/Documents/shinylive', '~/Documents/shinylive_out')

When you run this code, it will create a new folder called shinylive_out in the same location as shinylive, (i.e. in My Documents), and inside it, it will generate the converted wasm version of shiny code using the shinylive package.

If you check the contents of this shinylive_out folder, you can see that it contains the webr, service worker, etc. mentioned in the previous post.

More specifically, the export function is responsible for adding the files from the local PC’s shinylive package assets, i.e. the library files related to shiny, to the out directory on the local PC currently running R studio.

Now, if you create a github page or something based on the contents of this folder, you can serve a static webpage that provides shiny, and you can preview the result with the command below.

httpuv::runStaticServer("~/Documents/shinylive_out")

shinylive in quarto blog

To add a shiny application to a quarto blog, you need to use a separate extension. The quarto extension is a separate package that extends the functionality of quarto, similar to using R packages to add functionality to basic R.

First, we need to add the quarto extension by running the following code in the terminal (not a console) of Rstudio.

quarto add quarto-ext/shinylive

You don’t need to create a separate file to plant shiny in your quarto blog, you can use a code block called {shinylive-r}. Additionally, you need to set shinylive in the yaml of your index.qmd.

filters: 
- shinylive

Then, in the {shinylive-r} block, write the contents of the app.R we created earlier.

#| standalone: true
#| viewerHeight: 800
library(shiny)
ui <- fluidPage(
  titlePanel("Hello Shiny!"),
  sidebarLayout(
    sidebarPanel(
      sliderInput(
        inputId = "bins",
        label = "Number of bins:",
        min = 1,
        max = 50,
        value = 30
      )
    ),
    mainPanel(
      plotOutput(outputId = "distPlot")
    )
  )
)
server <- function(input, output) {
  output$distPlot <- renderPlot({
    x <- faithful$waiting
    bins <- seq(min(x), max(x), length.out = input$bins + 1)
    hist(x,
      breaks = bins, col = "#75AADB", border = "white",
      xlab = "Waiting time to next eruption (in mins)",
      main = "Histogram of waiting times"
    )
  })
}
shinyApp(ui = ui, server = server)

after add this in quarto blog, you may see working shiny application.

You can see working example in this link

Summary

shinylive is a feature that utilizes wasm to run shiny on static pages, such as GitHub pages or quarto blogs, and is available as an R package and quarto extension, respectively.

Of course, since it is less than a year old, not all features are available, and since it uses static pages, there are disadvantages compared to utilizing a separate shiny server.

However, it is very popular for introducing shiny usage and simple statistical analysis, and you can practice it right on the website without installing R, and more features are expected to be added in the future.

The code used in blog (previous example link) can be found at the link.

Author: jhk0530

Build serverless shiny application via Github page

Simple guide for simple shiny application

TL;DR

I made shiny application in github page with quarto.

You can check code in my github repository and result, result2

How we use shiny

Shiny is R package to make user utilize R with web browser without install it.

So my company utilizes shiny to provide statistical analysis for doctors (who don’t know R but need statistics).

Behind shiny

As you know, shiny is consisted with 2 part. UI and Server

You may think just UI is channel to both get input (data) from user and return calculated output (result) to user.
and server is just calculator

It means, server requires dynamic calculation that may change, not fixed contents (it called as static web page)

To achieve dynamic calculation, there are several options.

We can use shinyapps.io, posit connect, or deploy own shiny server in other cloud like AWS / azure / GCP …

These options can be categorized into two main categories: free but with limited features, or feature-rich but paid.

There is no single right answer, but I use shinyapps.io in see toy level project or deploy using shiny server in company’s cloud server which is not just toy level.

The rise of webR

Recent, webassembly (wasm) has emerged. that is use programming language in web browser (like Chrome) without install it (via javascript)

As far as I know, webR (R version of wasm) is built from late 2022 and some Examples are being shared to make R available on the web.

I understand logic for webR like just below figure. (but understanding is not necessary to run)

Shiny with wasm

For shiny, there is already wasm application called shinylive. but it utilizes shiny for Python.

Personally, I’m not familiar with this. Since I used R for a long time
so I wasn’t interested in this.

but very recent, The article has been shared with appsilon’s shiny weekly newsletter.

and Leemput explains how to implant webR and shiny application in WordPress very kindly.

Since wordpress provides a static page service, this means that shiny can be planted using the github page I’m familiar with. (There are examples with netlify. so I think static page service like Vercel, Notion, Firebase or even medium may use webR)

The logic is just below. (as I understand)

note, Main difference with webR and shiny wasm is service worker

Let’s Build it

To build serverless shiny application with github page, we need 3 + 1 things.

1. HTML contents (button to show status of wasm and iframe for shiny applicaiton)

2. shiny code (app.R, we’ll utilize pre-made and publicly avaiable app)

3. javascript to run service worker (web worker + serivce worker)

and hard thing.

4. configuration for github page to utilize service worker via proxy.

we can utilize the resources provided by Leemput. (HTML contents and javascript code)

so let’s make index.qmd like below (you can check in my repo too)

Important: Change html to “{=html}”. I changed it since it breaks wordpress site.

---
title: "serverless shiny with github page"
include-in-header:
text: | 
  <script> type='application/javascript' src = 'enable-threads.js' </script>
---

```html
<button class="btn btn-success btn-sm" type="button" style="background-color: dodgerblue" id="statusButton">
  <i class="fas fa-spinner fa-spin"></i>
  Loading webR...
</button>
<div id="iframeContainer"></div>
<script defer src="<https://use.fontawesome.com/releases/v5.15.4/js/all.js>" integrity="sha384-rOA1PnstxnOBLzCLMcre8ybwbTmemjzdNlILg8O7z1lUkLXozs4DHonlDtnE7fpc" crossorigin="anonymous"></script>
<script type="module">
  import { WebR } from '<https://webr.r-wasm.org/latest/webr.mjs>';
  const webR = new WebR();
  // TODO
  const shinyScriptURL = '<https://raw.githubusercontent.com/rstudio/shiny/main/inst/examples/01_hello/app.R>'
  const shinyScriptName = 'app.R'
  let webSocketHandleCounter = 0;
  let webSocketRefs = {};
  const loadShiny = async () => {
    try {
      document.getElementById('statusButton').innerHTML = `
        <i class="fas fa-spinner fa-spin"></i>
        Setting up websocket proxy and register service worker`;
      class WebSocketProxy {
        url;
        handle;
        bufferedAmount;
        readyState;
        constructor(_url) {
          this.url = _url
          this.handle = webSocketHandleCounter++;
          this.bufferedAmount = 0;
          this.shelter = null;
          webSocketRefs[this.handle] = this;
          webR.evalRVoid(`
                        onWSOpen <- options('webr_httpuv_onWSOpen')[[1]]
                        if (!is.null(onWSOpen)) {
                          onWSOpen(${this.handle},list(handle = ${this.handle}))
                        }`)
          setTimeout(() => {
            this.readyState = 1;
            this.onopen()},
            0);
        }
        async send(msg) {
          webR.evalRVoid(`
          onWSMessage <- options('webr_httpuv_onWSMessage')[[1]]
          if (!is.null(onWSMessage)) {onWSMessage(${this.handle}, FALSE, '${msg}')}
          `)
        }
      }
      await webR.init();
      console.log('webR ready');
      (async () => {
        for (; ;) {
          const output = await webR.read();
          switch (output.type) {
            case 'stdout':
              console.log(output.data)
              break;
            case 'stderr':
              console.log(output.data)
              break;
            case '_webR_httpuv_TcpResponse':
              const registration = await navigator.serviceWorker.getRegistration();
              registration.active.postMessage({
                type: "wasm-http-response",
                uuid: output.uuid,
                response: output.data,
              });
              break;
            case '_webR_httpuv_WSResponse':
              const event = { data: output.data.message };
              webSocketRefs[output.data.handle].onmessage(event);
              console.log(event)
              break;
          }
        }
      })();
      // TODO
      const registration = await navigator.serviceWorker.register('/wasmR/httpuv-serviceworker.js', { scope: '/wasmR/' }).catch((error) => {
      console.error('Service worker registration error:', error);
      });
      if ('serviceWorker' in navigator) {
        navigator.serviceWorker.getRegistration()
          .then((registration) => {
            if (registration) {
              const scope = registration.scope;
              console.log('Service worker scope:', scope);
            } else {
              console.log('No registered service worker found.');
            }
          })
          .catch((error) => {
            console.error('Error retrieving service worker registration:', error);
          });
      } else {
        console.log('Service workers not supported.');
      }
      await navigator.serviceWorker.ready;
      window.addEventListener('beforeunload', async () => {
        await registration.unregister();
      });
      console.log("service worker registered");
      document.getElementById('statusButton').innerHTML = `
        <i class="fas fa-spinner fa-spin"></i>
        Downloading R script...
      `;
      await webR.evalR("download.file('" + shinyScriptURL + "', '" + shinyScriptName + "')");
      console.log("file downloaded");
      document.getElementById('statusButton').innerHTML = `
        <i class="fas fa-spinner fa-spin"></i>
        Installing packages...
      `;
      await webR.installPackages(["shiny", "jsonlite"])
      document.getElementById('statusButton').innerHTML = `
        <i class="fas fa-spinner fa-spin"></i>
        Loading app...
      `;
      webR.writeConsole(`
          library(shiny)
          runApp('` + shinyScriptName + `')
      `);
      // Setup listener for service worker messages
      navigator.serviceWorker.addEventListener('message', async (event) => {
        if (event.data.type === 'wasm-http-fetch') {
          var url = new URL(event.data.url);
          var pathname = url.pathname.replace(/.*\\/__wasm__\\/([0-9a-fA-F-]{36})/, "");
          var query = url.search.replace(/^\\?/, '');
          webR.evalRVoid(`
                     onRequest <- options("webr_httpuv_onRequest")[[1]]
                     if (!is.null(onRequest)) {
                       onRequest(
                         list(
                           PATH_INFO = "${pathname}",
                           REQUEST_METHOD = "${event.data.method}",
                           UUID = "${event.data.uuid}",
                           QUERY_STRING = "${query}"
                         )
                       )
                     }
                     `);
        }
      });
      // Register with service worker and get our client ID
      const clientId = await new Promise((resolve) => {
        navigator.serviceWorker.addEventListener('message', function listener(event) {
          if (event.data.type === 'registration-successful') {
            navigator.serviceWorker.removeEventListener('message', listener);
            resolve(event.data.clientId);
            console.log("event data:")
            console.log(event.data)
          }
        });
        registration.active.postMessage({ type: "register-client" });
      });
      console.log('I am client: ', clientId);
      console.log("serviceworker proxy is ready");
      // Load the WASM httpuv hosted page in an iframe
      const containerDiv = document.getElementById('iframeContainer');
      let iframe = document.createElement('iframe');
      iframe.id = 'app';
      iframe.src = `./__wasm__/${clientId}/`;
      iframe.frameBorder = '0';
      iframe.style.width = '100%';
      iframe.style.height = '600px'; // Adjust the height as needed
      iframe.style.overflow = 'auto';
      containerDiv.appendChild(iframe);
      // Install the websocket proxy for chatting to httpuv
      iframe.contentWindow.WebSocket = WebSocketProxy;
      document.getElementById('statusButton').innerHTML = `
          <i class="fas fa-check-circle"></i>
          App loaded!
      `;
      document.getElementById('statusButton').style.backgroundColor = 'green';
      console.log("App loaded!");
    } catch (error) {
      console.log("Error:", error);
      document.getElementById('statusButton').innerHTML = `
        <i class="fas fa-times-circle"></i>
        Something went wrong...
      `;
      document.getElementById('statusButton').style.backgroundColor = 'red';
    }
  };
  loadShiny();
</script>
```

note that, there is 4 code that you must notice.

Line 3–4 add header to enable-thread.js : github page has some permissions-policy (CORB / COOP / COEP) that blocks resource from other source page. so add this to enable it.
Line 7 add “=html” to HTML code in quarto (not just show it)
First TODO set app.R code via URL: I tried to include in repo and call it like repo/app.R, but it didn’t work. so upload app.R in your repo and call it with raw file URL
Last TODO register service worker along your github page:
in registration, above code use /wasmR/httpuv-serviceworker.js and scope /wasmR/ but you must change to wasmR as your repository name.

after complete index.qmd. render it to index.html

but result will not show in localhost.

Remain step is so easy.

just commit your work to repository,
and build github page with that.
In page setting, do not use /docs, just / (root) only worked for me. even set quarto project to render output in /docs

Final repository structure

/repo
  - index.qmd
  - index.html
  - /index_files 
  - enable-thread.js
  - httpuv-serviceworker.js
  - …. (readme.md and so on)

bold is essential file and js file should download from link 1, 2.

Summary

We’ve seen how to deploy shiny as a github page with a simple example.

Let’s summarize some of the pros and cons of this method.

Pros

You can deploy simple shinyapp with static page (github page)
You don’t need to consider cost / scale / performance since wasm uses client (User) ‘s PC.
You can extend your shiny app with other framework like react, vue, tailwind… since shinyapp only requires just iframe and javascript code.
You don’t need to consider deploy. (Github will do that)

Cons

webR is in really really really earlier stage. so it doesn’t have much references, resources to refer.
wasm shiny application requires time to initiate webR and shiny in chrome (this is critical, it takes so much time sometime randomly)
(as Leemput already mentioned) why shiny? we can just use common web framework as UI and input, then utilize just webR not shiny.
Heavier work (like file I/O) doesn’t supports yet in wasm shiny.

Future work?

I think some can be improved.

use javascript as separate file (in qmd’s module script) : so quarto only requires iframe and button
use app.R via repo not URL
render quarto page to /docs not root: with this, quarto blog can use wasm shiny application well.
research about what can be done or not via wasm shiny application. )I checked file upload / download can’t)

Other ways to use webR

You may note that, there are other options to build shiny webR using golem framework. (I’ll not brief them)

There is quarto template use webR (not supports shiny yet)

https://github.com/coatless/quarto-webr

Thanks to community

George Stagg for httpuv-serviceworker.js
Joseph rocca for enable-thread.js
Veerle van Leemput for kind introduction and help

If you have question or some ideas. Let’s talk!

An Introduction to Bayesian A/B Testing in Stan, R, and Python workshop

Join our workshop on Introduction to An Introduction to Bayesian A/B Testing in Stan, R, and Python, which is a part of our workshops for Ukraine series!

Here’s some more info:

Title: An Introduction to Bayesian A/B Testing in Stan, R, and Python

Date: Thursday, September 14th, 18:00 – 20:00 CEST (Rome, Berlin, Paris timezone)

Speaker: Jordan Nafa is a Data Scientist at Game Data Pros, Inc. where his work centers around Bayesian A/B Testing, stochastic optimization, and applied causal inference for revenue optimization, promotional pricing, and personalized targeting in video games. He is also a Ph.D. Candidate in Political Science at the University of North Texas where he previously taught undergraduate courses in causal inference, applied statistics, and American political behavior.

Description: This workshop will cover a basic introduction to Bayesian inference, A/B Testing, and decision theory for the analysis of large-scale field experiments in industry settings. After introducing the foundations of the Bayesian approach to A/B Testing, we will work through real-world examples using the probabilistic programming language Stan along with its R and Python interfaces.

Minimal registration fee: 20 euro (or 20 USD or 800 UAH)

How can I register?

Go to https://bit.ly/3wvwMA6 or https://bit.ly/3PFxtNA and donate at least 20 euro. Feel free to donate more if you can, all proceeds go directly to support Ukraine.

Save your donation receipt (after the donation is processed, there is an option to enter your email address on the website to which the donation receipt is sent)

Fill in the registration form, attaching a screenshot of a donation receipt (please attach the screenshot of the donation receipt that was emailed to you rather than the page you see after donation).

If you are not personally interested in attending, you can also contribute by sponsoring a participation of a student, who will then be able to participate for free. If you choose to sponsor a student, all proceeds will also go directly to organisations working in Ukraine. You can either sponsor a particular student or you can leave it up to us so that we can allocate the sponsored place to students who have signed up for the waiting list.

How can I sponsor a student?

Go to https://bit.ly/3wvwMA6 or https://bit.ly/3PFxtNA and donate at least 20 euro (or 17 GBP or 20 USD or 800 UAH). Feel free to donate more if you can, all proceeds go to support Ukraine!

Save your donation receipt (after the donation is processed, there is an option to enter your email address on the website to which the donation receipt is sent)

Fill in the sponsorship form, attaching the screenshot of the donation receipt (please attach the screenshot of the donation receipt that was emailed to you rather than the page you see after the donation). You can indicate whether you want to sponsor a particular student or we can allocate this spot ourselves to the students from the waiting list. You can also indicate whether you prefer us to prioritize students from developing countries when assigning place(s) that you sponsored.

If you are a university student and cannot afford the registration fee, you can also sign up for the waiting list here. (Note that you are not guaranteed to participate by signing up for the waiting list).

You can also find more information about this workshop series, a schedule of our future workshops as well as a list of our past workshops which you can get the recordings & materials here.

Looking forward to seeing you during the workshop!

REDCapDM: a package to access and manage REDCap data

Garcia-Lerma E, Carmezim J, Satorra P, Peñafiel J, Pallares N, Santos N, Tebé C.
Biostatistics Unit, Bellvitge Biomedical Research Institute (IDIBELL)

The REDCapDM package allows users to read data exported directly from REDCap or via an API connection. It also allows users to process the previously downloaded data, create reports of queries and track the identified issues.

The diagram below shows the data management cycle: from data entry in REDCap to obtain data ready for the analysis.

The package structure can be divided into three main components: reading raw data, processing data and identifying queries. Typically, after collecting data in REDCap, we will have to follow these three components in order to have a final validated dataset for analysis. We will provide a user guide on how to perform each one of these steps using the package’s functions. For data processing and query identification, we will use the COVICAN data as an example (see the package vignette for more information about this built-in dataset).

Read data: redcap_data

The redcap_data function allows users to easily import data from a REDCap project into R for analysis.

To read exported data from REDCap, use the arguments data_path and dic_path to, respectively, describe the path of the R file and the REDCap project’s dictionary:

dataset <- redcap_data(data_path="C:/Users/username/example.r",
                       dic_path="C:/Users/username/example_dictionary.csv")

Note: The R and CSV files exported from REDCap must be located in the same directory.

If the REDCap project is longitudinal (contains more than one event) then a third element should be specified with the correspondence of each event with each form of the project. This csv file can be downloaded in the REDCap of the project following these steps: Project Setup < Designate Instruments for My Events < Download instrument-event mappings (CSV).

dataset <- redcap_data(data_path="C:/Users/username/example.r",
                       dic_path="C:/Users/username/example_dictionary.csv",
                       event_path="C:/Users/username/events.csv")

Note: if the project is longitudinal and the event-form file is not provided using the event_path argument, some steps of the processment can not be performed.

Another way to read data exported from a REDCap project is using an API connection. To do this, we can use the arguments uri and token which respectively refer to the uniform resource identifier of the REDCap project and the user-specific string that serves as the password:

dataset_api <- redcap_data(uri ="https://redcap.idibell.cat/api/",
                           token = "55E5C3D1E83213ADA2182A4BFDEA")

In this case, there is no need to specify the event-form file since the function will download it automatically using the API connection, if the project is longitudinal.

Remember that the token would give anyone access to all the project’s information. You should be careful about who you give this information to.

This function returns a list with 3 elements (imported data, dictionary and event-form mapping) which can then be used for further analysis or visualization.

Data process: rd_transform

The main function involved in the processing of the data is rd_transform. This function is used to process the REDCap data read into R using the redcap_data, as described above. Using the arguments of the function we can perform different type of transformations of our data.

As previously stated, we will use the built-in dataset covican as an example.

The only necessary elements that must be provided are the dataset to be transformed and the corresponding dictionary. If the project is longitudinal, as in the case of covican, also the event-form dataset should be specified. These elements can be specified directly using the output of the redcap_data function or separately in different arguments.

#Option A: list object 
covican_transformed <- rd_transform(covican)

#Option B: separately with different arguments
covican_transformed <- rd_transform(data = covican$data, 
                                    dic = covican$dictionary, 
                                    event_form = covican$event_form)

#Print the results of the transformation
covican_transformed$results

1. Recalculating calculated fields and saving them as '[field_name]_recalc'

| Total calculated fields | Non-transcribed fields | Recalculated different fields |
|:-----------------:|:----------------:|:-----------------------:|
|         2         |      0 (0%)      |         1 (50%)         |


|     field_name      | Transcribed? | Is equal? |
|:-------------------:|:------------:|:---------:|
|         age         |     Yes      |   FALSE   |
| screening_fail_crit |     Yes      |   TRUE    |

2. Transforming checkboxes: changing their values to No/Yes and changing their names to the names of its options. For checkboxes that have a branching logic, when the logic is missing their values will be set to missing

Table: Checkbox variables advisable to be reviewed

| Variables without any branching logic |
|:-------------------------------------:|
|        type_underlying_disease        |

3. Replacing original variables for their factor version

4. Deleting variables that contain some patterns

This function will return a list with the transformed dataset, dictionary and the output of the results of the transformation.

As we can see, there are 4 steps in the transformation and they are briefly explained in the output of the function. This four steps are:

1. 1. 1. Recalculation of REDCap calculated fields
    2. Checkbox transformation
    3. Replacement of the original variable by its factor version
    4. Elimination of variables containing some pattern

In addition, we can change the final structure of the transformed dataset by specifying in the final_format argument whether we want our data to be split by event or by form.

For more examples and information on extra arguments, see the vignette.

Queries

Queries are very important to ensure the accuracy and reliability of a REDCap dataset. The collected data may contain missing values, inconsistencies, or other potential errors that need to be identified in order to correct them later.

For all the following examples we will use the raw transformed data: covican_transformed.

rd_query

The rd_query function allows users to generate queries by using a specific expression. It can be used to identify missing values, values that fall outside the lower and upper limit of a variable and other types of inconsistencies.

Missings

If we want to identify missing values in the variables copd and age in the raw transformed data, a list of required arguments needs to be supplied.

example <- rd_query(covican_transformed,
                    variables = c("copd", "age"),
                    expression = c("%in%NA", "%in%NA"),
                    event = "baseline_visit_arm_1")

# Printing results
example$results

Report of queries
Variables	Description	Event	Query	Total
copd	Chronic obstructive pulmonary disease	Baseline visit	The value should not be missing	6
age	Age	Baseline visit	The value should not be missing	5

Expressions

The rd_query function is also able to identify outliers or observations that fulfill a specific condition.

example <- rd_query(variables="age",
                    expression=">70",
                    event="baseline_visit_arm_1",
                    dic=covican_transformed$dictionary,
                    data=covican_transformed$data)

# Printing results
example$results

Report of queries
Variables	Description	Event	Query	Total
age	Age	Baseline visit	The value should not be >70	76

More examples of both functions can be seen at the vignette.

Output

When the rd_query function is executed, it returns a list that includes a data frame with all the queries identified and a second element with a summary of the number of generated queries in each specified variable for each expression applied:

Identifier	DAG	Event	Instrument	Field	Repetition	Description	Query	Code
100-58	Hospital 11	Baseline visit	Comorbidities	copd	·	Chronic obstructive pulmonary disease	The value is NA and it should not be missing	100-58-1

Report of queries
Variables	Description	Event	Query	Total
copd	Chronic obstructive pulmonary disease	Baseline visit	The value should not be missing	6

The data frame is designed to aid users in locating each query in their REDCap project. It includes information such as the record identifier, the Data Access Group (DAG), the event in which each query can be found, along with the name and the description of the analyzed variable and a brief description of the query.

check_queries

Once the process of identifying queries is complete, the typical approach would be to adress them by modifying the original dataset in REDCap and re-run the query identification process generating a new query dataset.

The check_queries function compares the previous query dataset with the new one by using the arguments old and new, respectively. The output remains a list with 2 items, but the data frame containing the information for each query will now have an additional column (“Modification”) indicating which queries are new, which have been modified, which have been corrected, and which remain pending. Besides, the summary will show the number of queries in each one of these categories:

check <- check_queries(old = example$queries, 
                       new = new_example$queries)

# Print results
check$results

Comparison report
State	Total
Pending	7
Solved	3
Miscorrected	1
New	1

There are 7 pending queries, 3 solved queries, 1 miscorrected query, and 1 new query between the previous and the new query dataset.

Note: The “Miscorrected” category includes queries that belong to the same combination of record identifier and variable in both the old and new reports, but with a different reason. For instance, if a variable had a missing value in the old report, but in the new report shows a value outside the established range, it would be classified as “Miscorrected”.

Query control output:

Identifier	DAG	Event	Instrument	Field	Repetition	Description	Query	Code	Modification
100-58	Hospital 11	Baseline visit	Comorbidities	copd	·	Chronic obstructive pulmonary disease	The value is NA and it should not be missing	100-58-1	Pending
100-79	Hospital 11	Baseline visit	Comorbidities	copd	·	Chronic obstructive pulmonary disease	The value is NA and it should not be missing	100-79-1	New
102-113	Hospital 24	Baseline visit	Demographics	age	·	Age	The value is NA and it should not be missing	102-113-1	Pending
105-11	Hospital 5	Baseline visit	Comorbidities	copd	·	Chronic obstructive pulmonary disease	The value is NA and it should not be missing	105-11-1	Pending

Future improvements

In the short term, we would like to make some improvements to the query identification and tracking process to minimise errors and cover a wide range of possible structures. We would also like to extend the scope of the data processing to cover up specific transformations of the data that may be required in some specific scenarios. As a long-term plan, we would like to complement this package with the development of a shiny application to facilitate the use of the package and make it as user-friendly as possible.

How do I count thee? Let me count the ways?

How do I count thee? Let me count the ways? by Jerry Tuttle In Major League Baseball, a player who hits 50 home runs in a single season has hit a lot of home runs. Suppose I want to count the number of 50 homer seasons by team, and also the number of 50 homer seasons by New York Yankees. (I will count Maris and Mantle in 1961 as two.) Here is the data including Aaron Judge’s 62 in 2022 :

You would think base R would have a count function such as count(df$Team) and count(df$Team == “NYY”) but this gives the error “could not find function ‘count'”. Base R does not have a count function. Base R has at last four ways to perform a count: 1. The table function will count items in a vector. table(df$Team) presents results horizontally, and data.frame(table(df$Team)) presents results vertically. table(df$Team == “NYY”) displays results 37 false and 10 true, while table(df$Team == “NYY”)[2] just displays the result 10 true. 2. The sum function can be used to count the number of rows meeting a condition. sum(df$Team == “NYY”) displays the result 10. Here df$Team == “NYY” is creating a logical vector, and sum is summing the number of true = 1. 3. Similar to sum, nrow(df[df$Team == “NYY”, ]) counts the number of rows meeting the NYY condition. 4. The length function counts the number of elements in an R object. length(which(df$Team == “NYY”)) , length(df$Team[df$Team == “NYY”]) , and length(grep(“NYY”, df[ , “Team”])) are all ways that will count the 10 Yankees. The more direct solution to counting uses the count function in the dplyr library. Note that dplyr’s count function applies to a data frame or tibble, but not to a vector. After loading library(dplyr) , 1. df %>% count(Team) lists the count for each team. 2. df %>% filter(Team = “NYY”) lists each Yankee, and you can see there are 10. 3. df %>% count(Team == “NYY”) displays 37 false and 10 true, while df %>% filter(Team == “NYY”) %>% count() just displays the 10 true. The following is a bar chart of the results by team for teams with at least 1 50 homer season:

Finally, “How do I count thee? Let me count the ways?” is of course adapted from Elizabeth Barrett Browning’s poem “How do I love thee? Let me count the ways?” But in her poem, just how would we count the number of times “love” is mentioned? The tidytext library makes counting words fairly easy, and the answer is ten, the same number of 50 homer Yankee seasons. Coincidence? The following is all the R code. Happy counting!


  library(dplyr) 

  library(ggplot2) 

  library(tidytext) 




  df <- data.frame(

     Player=c('Ruth','Ruth','Ruth','Ruth','Wilson','Foxx','Greenberg','Foxx','Kiner','Mize','Kiner','Mays','Mantle','Maris',
  'Mantle','Mays','Foster','Fielder','Belle','McGwire','Anderson','McGwire','Griffey','McGwire','Sosa','Griffey',
  'Vaughn','McGwire','Sosa','Sosa','Bonds','Sosa','Gonzalez','Rodriguez','Rodriguez','Thome','Jones','Howard','Ortiz',
  'Rodriguez','Fielder','Bautista','Davis','Stanton','Judge','Alonso','Judge'),
  

     Year=c(1920,1921,1927,1928,1930,1932,1938,1938,1947,1947,1949,1955,1956,1961,1961,1965,1977,1990,1995,1996,1996,1997,1997,
  1998,1998,1998,1998,1999,1999,2000,2001,2001,2001,2001,2002,2002,2005,2006,2006,2007,2007,2010,2013,2017,2017,2019,2022),
  


     Homers=c(54,59,60,54,56,58,58,50,51,51,54,51,52,61,54,52,52,51,50,52,50,58,56,70,66,56,50,65,63,50,73,64,57,52,57,52,51,
  58,54,54,50,54,53,59,52,53,62), 

     Team=c('NYY','NYY','NYY','NYY','CHC','PHA','DET','BOS','PIT','NYG','PIT','NYG','NYY','NYY','NYY','SF','CIN','DET','CLE',
  'OAK','BAL','OAK/SLC','SEA','SLC','CHC','SEA','SD','SLC','CHC','CHC','SF','CHC','ARI','TEX','TEX','CLE','ATL','PHP',
  'BOR','NYY','MIL','TOR','BAL','MIA','NYY','NYM','NYY'))


head(df)

# base R ways to count:


  table(df$Team)    # shows results horizontally 

  data.frame(table(df$Team))    #shows results vertically 

  table(df$Team == "NYY")    # displays 37 false and 10 true
  

  table(df$Team == "NYY")[2]


sum(df$Team == "NYY")    # displays the result 10.


  nrow(df[df$Team == "NYY", ])    # counts the number of rows
  meeting the NYY condition.



  length(which(df$Team == "NYY"))     # which returns a vector of
  indices which are true 

  length(df$Team[df$Team == "NYY"]) 

  length(grep("NYY", df[ , "Team"]))     # grep returns a vector
  of indices that match the pattern


# dplyr R ways to count; remember to load library(dplyr):

df %>% count(Team)    # lists the count for each team.


  df %>% filter(Team == "NYY")    # lists each Yankee, and you
  can see there are 10.



  df %>% count(Team == "NYY")    # displays 37 false and 10 true,
  while 
 df %>% filter(Team == "NYY") %>% count()    #
  just displays the 10 true.



  # barplot of all teams with at least 1 50 homer season; remember to load
  library(ggplot2)



  df %>% 

      group_by(Team) %>% 

      summarise(count = n()) %>% 

      ggplot(aes(x=reorder(Team, count), y=count, fill=Team)) +
  

      geom_bar(stat = 'identity') + 

      ggtitle("Count of 50 Homer Seasons") + 

      xlab("Team") + 

      scale_y_continuous(breaks=c(1,2,3,4,5,6,7,8,9,10)) + 

      coord_flip() + 

      theme(plot.title = element_text(face="bold", size=18)) +
  

      theme(axis.title.y = element_text(face="bold")) + 

      theme(axis.title.x = element_blank()) + 

      theme(axis.text.x = element_text(size=12, face="bold"),
  

      axis.text.y = element_text(size=12, face="bold")) + 

      theme(legend.position="none") 




  # count number of times "love" is mentioned in Browning's poem; remember to
  load library(tidytext)



  textfile <- c("How do I love thee? Let me count the ways.", 

  "I love thee to the depth and breadth and height", 

  "My soul can reach, when feeling out of sight", 

  "For the ends of being and ideal grace.", 

  "I love thee to the level of every day's", 

  "Most quiet need, by sun and candle-light.", 

  "I love thee freely, as men strive for right.", 

  "I love thee purely, as they turn from praise.", 

  "I love thee with the passion put to use", <br 

  "In my old griefs, and with my childhood's faith.", 

  "I love thee with a love I seemed to lose", 

  "With my lost saints. I love thee with the breath,", <br 

  "Smiles, tears, of all my life; and, if God choose,", 

  "I shall but love thee better after death.") 




  df<-data.frame(line=1:length(textfile), text=textfile) <br <br 

  df_words % unnest_tokens(word, text) <br 

  cleaned_words % anti_join(get_stopwords()) 

  cleaned_words %>% count(word, sort = TRUE) %>% head(6) 

  cleaned_words %>% filter(word == "love") %>% count()

Create a hyper-marketing model using Naïve Bayes

By Huey Fern Tay with Greg Page

Everyone loves an extra income stream – even the super-wealthy owners of luxurious properties that they only inhabit for just a few weeks each year. Offering a property as a short-term rental through a platform like Airbnb can provide a wonderful side hustle. For some owners, however, the associated hassles could be a powerful deterrent to using the service. Text messages at 3 a.m. about Wi-Fi passwords, stopped-up toilets, and the lack of water pressure in the shower might be just enough to tip the scales against such an undertaking…especially when such messages are followed up by angry “Why isn’t this fixed yet?” queries just 30 minutes later.

So what if an all-in-one concierge service could take away ALL of those hassles? If an intermediary service could handle all of the tenant interactions, the marketing, the logistics of the key hand-offs, etc. then suddenly the idle rich jetsetters might be a bit more willing to open up their pied-a-terres to the unbathed masses. Such a service would benefit all stakeholders – travelers would have more options, the property owners would earn more income, and the platform would receive more commission fees from the extra transactions. In exchange for a fee paid to the service, willing property owners could have a side hustle that was “all side, no hustle.”

Let’s imagine that such a service is looking to establish itself, with an initial marketing outreach effort to high-end property owners not already using Airbnb. Let’s also imagine that this new service is operating on a shoestring budget, and therefore needs to. How can it identify the properties within a city that are most likely to command high values in the short-term rental market?

The Naïve Bayes classifier is a good candidate for the task at hand because of its simplicity, computational efficiency, and ability to handle categorical variables. Furthermore, its classification outcomes come with associated probability values – we can use those to identify records that are most likely to belong to some particular group.

To illustrate how this method could be used to solve the business problem outlined above, I will utilize Airbnb data of San Francisco listings.

One of the first decisions the modeler must make is deciding how to bin the data. In this case, the question is which numerical variable would you use to separate the properties? Would you use price, review ratings, or number of reviews? Each of these variables has a different impact on the outcome and may not be equally effective at separating classes. If the classes are not well separated, then even a large dataset will not be helpful.

The next important decision is to determine the number of classification categories you would like to create. Also, what would the cut-off be for each group? In other words, should you create four groups and bin them equally? Or should you create three groups by dividing the data according to a 15-70-15 proportion, 20-40-20, or 30-40-30…etc? The decision made at this step has a big impact on the model.

Consider both models below which were each created with 3811 rows of data representing 60% of the total dataset. Both models were created with the same predictor variables, such as the number of bedrooms, bathrooms, property type, location, etc. But in Model 1, the data was binned into 4 equal groups while in Model 2, the data was binned into 3 equal groups. The model summary for Model 1 showed the model has an accuracy of 54.19%, which is good considering that this performance is slightly more than double the No Information Rate (Naïve Rate). Model 2’s accuracy’s level is at 65.36%, a level which is nearly double its No Information Rate.

These are encouraging results but in our Airbnb example, we are more interested in knowing how well our model performs when it is asked to classify properties into any of the classes used in the model. For this reason, it is worth considering the true positive rate i.e. ‘sensitivity’. Model 1 is better at predicting the true positives (‘sensitivity’) for classes at opposite ends of the spectrum. This suggests Model 1 has difficulty reading nuances. On the other hand, Model 2’s performance in this regard is comparatively more balanced.

Naive Bayes model with four classification outcome

Naive Bayes model with four classification outcome

Naive Bayes output with three classification outcome

But wait – let’s get back to our original goal. While overall accuracy is good to see, what we are most interested in here is identifying that high-end price group. Owners of such units will be the best targets for our all-in-one concierge service. Therefore, let’s dive a bit deeper to examine this model’s suitability for identifying such properties.

By running the predict() function with the type=’raw’ parameter included, we can view the associated probabilities for each outcome class, and then rank records by probability of belonging to some particular outcome group.

Taking this approach with the validation set, we find that among the 100 records identified by the model as most likely to land in the top tier group, 96 truly belonged to “Above Average and Pricey Digs.” Among the 150 likeliest, 140 units, or 93.33%, actually belonged to that group, and among the 200 likeliest, 185 units, or 92.5%, were truly in that top tier.

But that’s not all. It is worth going one step further by evaluating the model with lift charts or decile-wise lift charts because these charts determine how effectively our model ‘skims the cream’.

The decile-wise lift charts below illustrate how effectively the model can predict membership in the ‘above average and pricey digs’ group. When the model is used to classify the top 27% properties in this category, its performance is more than 3.5x better than a random guess.

Decile-wise lift chart shows the model is 3.5 times better at classifying the top 27 percent properties

Decile-wise lift chart shows the model is 3.5 times better at classifying the top 27 percent properties

Another way to assess the model’s ability to identify top-tier rentals is with a two-dimensional lift chart. Such a chart only works with two-outcome class scenarios, so we start here by collapsing the first and second tiers together, and then labelling that group as “other.”

Gains chart shows among the 500 records that the model says are most likely to land in the Above Average & Pricey Digs Tier, just over 400 truly did belong to that group

Gains chart shows among the 500 records that the model says are most likely to land in the Above Average & Pricey Digs Tier, just over 400 truly did belong to that group

In the entire validation set, there are 780 units that land in the highest price tier. The values along the x-axis represent all the validation set records, ranked in order of their probability of belonging to the highest-tier class. The y-axis shows the number of correct predictions. The solid line represents the model’s performance – it shows us, for instance, that among the 500 records that the model says are most likely to land in the Above Average & Pricey Digs Tier, just over 400 truly did belong to that group. The line flattens out at around x=1500, because by that point, the model has already identified nearly all of the records that truly belonged to this outcome class.

The dotted line, by contrast, shows how effective a model would be if it simply labelled all the records as belonging to the top tier. Since 33.6%, of the validation records belong to this group, each x-axis value here corresponds to a y-axis value that is exactly 33.6% as large. The difference between the solid line and the dotted line represents the model’s improvement as the number of cases increases.

What do these results mean for the concierge service? Let’s revisit our original assumptions:

such a service would most likely appeal to the owners of properties in the highest pricing tier;
the initial outreach efforts should be made to owners of properties not already registered with Airbnb; and that
the new service has a limited budget, and therefore needs to carefully focus its outreach efforts only on that tier of properties whose owners would be likeliest to use it

Given these assumptions, our primary interest does not lie with overall model accuracy. The model’s ability to distinguish between the bottom two pricing tiers is almost immaterial to us; however, we are keenly interested in the answer to this question: When the model predicts that a property will belong to the highest pricing tier, how often is it correct?

As demonstrated here, the model delivers quite effectively in this regard, especially when we maintain a relatively narrow focus on the properties that are most likely to be top tier. Splashy magazine inserts, Super Bowl advertisements, and big-ticket endorsements from celebrities might be in the cards for this service down the road, after it spreads across the globe and prepares for its IPO roadshow. For now, though, the hyper-specific focus that can come from “skimming the cream” off the top of those Naïve Bayes model probability predictions may be the surest next step for this service’s success.

Data source: Inside Airbnb

Mapping iNaturalist Data in R

This post was originally published from IGIS as a Tech Note. Enjoy!

Introduction

iNaturalist is an enormously popular platform for recording and sharing observations of nature. Most people interact with iNaturalist through the Android or iOS phone app, but a little known fact is the platform also has an API (Application Programming Interface). That means that you can query observations using programming languages like R and Python. And if you can query observations, then you can map them!

The rest of this post demonstrates how you can download iNaturalist observations in R using the rinat package, and then map them using leaflet. We’ll also use functions from sf and dplyr. These are very common R packages in data science, and there are lot of online resources that teach you how to use them.

Our study area will be Placerita Canyon State Park in Los Angeles County, California. If you want to jump ahead to see the final product, click here.

Import the Park Boundary

First we load the packages we’re going use:

library(rinat)
library(sf)
library(dplyr)
library(tmap)
library(leaflet)

## Load the conflicted package and set preferences
library(conflicted)
conflict_prefer("filter", "dplyr", quiet = TRUE)
conflict_prefer("count", "dplyr", quiet = TRUE)
conflict_prefer("select", "dplyr", quiet = TRUE)
conflict_prefer("arrange", "dplyr", quiet = TRUE)

Now we can import the park boundary which is saved as a geojson file. We use the sf package to import and transform the boundary to geographic coordinates, and leaflet to plot it:

placertia_sf <- st_read("placertia_canyon_sp.geojson", quiet = TRUE) %>% 
  st_transform(4326)

leaflet(placertia_sf) %>% 
  addTiles() %>% 
  addPolygons()

Get the Bounding Box

Next, we get the bounding box of the study area, so we can pass it to the iNaturalist API to tell it which area we’re interested in:

placertia_bb <- placertia_sf %>% 
  st_bbox()

placertia_bb

##       xmin       ymin       xmax       ymax 
## -118.47216   34.36698 -118.43764   34.37820

Download data

Now we are ready to retrieve iNaturalist observations. rinat makes this easy with get_inat_obs().

Downloading Etiquette

The non-profit that supports iNaturalist doesn’t charge for downloading data, because they believe in open data and supporting citizen science initiatives. They also don’t require registration for simply downloading observations. That being said, downloading data consumes resources that can affect other users, and it’s bad form (and bad programming) to download data unnecessarily.

There are a few best practices for downloading data in ways that won’t cause problems for other users:

Use the server-side filters the API provides to only return results for your area of interest and your taxon(s) of interest. See below for examples of sending spatial areas and a taxon name when calling the API from R.
Always save your results to disk, and don’t download data more than once. Your code should check to see if you’ve already downloaded something before getting it again.
Don’t download more than you need. The API will throttle you (reduce your download speed) if you exceed more than 100 calls per minute. Spread it out.
The API is not designed for bulk downloads. For that, look at the Export tool.

We’ll use the bounds argument in get_inat_obs() to specify the extent of our search. bounds takes 4 coordinates. Our study area is not a pure rectangle, but we’ll start with the bounding box and then apply an additional mask to remove observations outside the park boundary.

For starters we’ll just download the most recent 100 iNaturalist observations within the bounding box. (There are actually over 3000, but that would make for a very crowded map!). The code below first checks if the data have already been downloaded, to prevent downloading data twice.

search_results_fn <- "placertia_bb_inat.Rdata"

if (file.exists(search_results_fn)) {
  load(search_results_fn)

} else {  
  # Note we have to rearrange the coordinates of the bounding box a 
  # little bit to give get_inat_obs() what it expects
  inat_obs_df <- get_inat_obs(bounds = placertia_bb[c(2,1,4,3)], 
                            taxon_name = NULL,
                            year = NULL,
                            month = NULL,
                            maxresults = 100)
  
  ## Save the data frame to disk
  save(inat_obs_df, file = search_results_fn)
}

See what we got back:

dim(inat_obs_df)

## [1] 100  36

glimpse(inat_obs_df)

## Rows: 100
## Columns: 36
## $ scientific_name                   "Sceloporus occidentalis", "Eriogonum~
## $ datetime                          "2021-09-04 14:01:55 -0700", "2021-08~
## $ description                       "", "", "", "", "", "", "", "", "", "~
## $ place_guess                       "Placerita Canyon State Park, Newhall~
## $ latitude                          34.37754, 34.37690, 34.37691, 34.3764~
## $ longitude                         -118.4571, -118.4598, -118.4597, -118~
## $ tag_list                          "", "", "", "", "", "", "", "", "", "~
## $ common_name                       "Western Fence Lizard", "California b~
## $ url                               "https://www.inaturalist.org/observat~
## $ image_url                         "https://inaturalist-open-data.s3.ama~
## $ user_login                        "bmelchiorre", "ericas3", "ericas3", ~
## $ id                                94526862, 92920757, 92920756, 9291801~
## $ species_guess                     "Western Fence Lizard", "", "", "", "~
## $ iconic_taxon_name                 "Reptilia", "Plantae", "Plantae", "Pl~
## $ taxon_id                          36204, 54999, 52854, 58014, 64121, 55~
## $ num_identification_agreements     2, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1~
## $ num_identification_disagreements  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ observed_on_string                "Sat Sep 04 2021 14:01:55 GMT-0700 (P~
## $ observed_on                       "2021-09-04", "2021-08-29", "2021-08-~
## $ time_observed_at                  "2021-09-04 21:01:55 UTC", "2021-08-2~
## $ time_zone                         "Pacific Time (US & Canada)", "Pacifi~
## $ positional_accuracy               5, 3, 4, 4, 16, 4, 41, 41, 26, 9, NA,~
## $ public_positional_accuracy        5, 3, 4, 4, 16, 4, 41, 41, 26, 9, NA,~
## $ geoprivacy                        "", "", "", "", "", "", "", "", "", "~
## $ taxon_geoprivacy                  "open", "", "", "", "", "", "", "", "~
## $ coordinates_obscured              "false", "false", "false", "false", "~
## $ positioning_method                "", "", "", "", "", "", "", "", "", "~
## $ positioning_device                "", "", "", "", "", "", "", "", "", "~
## $ user_id                           4985856, 404893, 404893, 404893, 3214~
## $ created_at                        "2021-09-12 02:16:03 UTC", "2021-08-2~
## $ updated_at                        "2021-09-12 06:28:20 UTC", "2021-08-2~
## $ quality_grade                     "research", "needs_id", "needs_id", "~
## $ license                           "CC-BY-NC", "CC-BY-NC-SA", "CC-BY-NC-~
## $ sound_url                         NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
## $ oauth_application_id              3, 333, 333, 333, 3, 3, 3, 3, 3, 3, N~
## $ captive_cultivated                "false", "false", "false", "false", "~

Next, in order to spatially filter the observations using the park boundary, we convert the data frame into a sf object. At the same time we’ll select just a few of the columns for our map:

inat_obs_sf <- inat_obs_df %>% 
select(longitude, latitude, datetime, common_name, 
      scientific_name, image_url, user_login) %>% 
  st_as_sf(coords=c("longitude", "latitude"), crs=4326)

dim(inat_obs_sf)

## [1] 100 6

Next, filter out observations that lie outside Placerita Canyon SP. How many are left?

inat_obs_pcsp_sf <- inat_obs_sf %>% st_intersection(placertia_sf)
nrow(inat_obs_pcsp_sf)

## [1] 90

Next, add a column to the attribute table containing HTML code that will appear in the popup windows:

inat_obs_pcsp_popup_sf <- inat_obs_pcsp_sf %>% 
  mutate(popup_html = paste0("<p><b>", common_name, "</b><br/>",
                        "<i>", scientific_name, "</i></p>",
                        "<p>Observed: ", datetime, "<br/>",
                        "User: ", user_login, "</p>",
                        "<p><img src='", image_url, "' style='width:100%;'/></p>")
)

Map the Results

Finally, we have everything we need to map the results with leaflet. Zoom-in and click on an observation to see details about it.

htmltools::p("iNaturalist Observations in Placerita Canyon SP",
             htmltools::br(),
             inat_obs_pcsp_popup_sf$datetime %>% 
               as.Date() %>% 
               range(na.rm = TRUE) %>% 
               paste(collapse = " to "),
             style = "font-weight:bold; font-size:110%;")

leaflet(placertia_sf) %>% 
  addProviderTiles("Esri.WorldStreetMap") %>% 
  addPolygons() %>% 
  addCircleMarkers(data = inat_obs_pcsp_popup_sf,
                   popup = ~popup_html, 
                   radius = 5)

iNaturalist Observations in Placerita Canyon SP
2006-01-04 to 2021-09-04

Click on the map to open in a new tab

Map the Arachnids

The iNaturalist API supports taxon searching. If you pass it the name of a genus, for example, it will restrict results to only observations from that genus. Likewise for family, order, class, etc. Observations that don’t have a scientific name assigned will be left out.

You can do a taxon search from R by passing the taxon_name argument in get_inat_obs(). Below we’ll ask for the Arachnid class, which includes spiders, scorpions, ticks, and other arthropods. This code should all look familiar by now:

search_arachnid_fn <- "placertia_arachnid_obs.Rdata"
if (file.exists(search_arachnid_fn)) {
  load(search_arachnid_fn)
} else {  
  inat_arachnid_df <- get_inat_obs(bounds = placertia_bb[c(2,1,4,3)], 
                            taxon_name = "Arachnida",
                            year = NULL,
                            month = NULL,
                            maxresults = 100)
  save(inat_arachnid_df, file = search_arachnid_fn)
}

inat_arachnid_pcsp_popup_sf <- inat_arachnid_df %>% 
  select(longitude, latitude, datetime, common_name, scientific_name, image_url, user_login) %>%
  st_as_sf(coords=c("longitude", "latitude"), crs=4326) %>% 
  st_intersection(placertia_sf) %>% 
  mutate(popup_html = paste0("<p><b>", common_name, "</b><br/>",
                             "<i>", scientific_name, "</i></p>",
                             "<p>Observed: ", datetime, "<br/>",
                             "User: ", user_login, "</p>",
                             "<p><img src='", image_url, 
                             "' style='width:100%;'/></p>"))

htmltools::p("Arachnids in Placerita Canyon SP", style = "font-weight:bold; font-size:110%;")

leaflet(placertia_sf) %>% 
  addProviderTiles("Esri.WorldStreetMap") %>% 
  addPolygons() %>% 
  addCircleMarkers(data = inat_arachnid_pcsp_popup_sf,
                   popup = ~popup_html, 
                   radius = 5)

Arachnids in Placerita Canyon SP

Click on the map to open in a new tab

Summary

The iNaturalist API allows you to download observations from a global community of naturalists, which is pretty powerful. This Tech Note showed you how to create a map of iNaturalist observations within a specific area and for a specific taxon using the rinat and leaflet packages for R.

A couple caveats are worth noting. A limitation of this method is that the data are static. In other words, to get the latest data your have to re-query the API again. It isn’t ‘live’ like a feature service.

Also, although leaflet works well for small to medium sized datasets, you wouldn’t want to use this mapping tool for massive amounts of spatial data. With leaflet, all of the geographic information and popup content are saved in the HTML file, and rendering is done on the client. This generally makes it extremely fast because everything is in memory. But if the data overwhelm memory it can slow your browser to a crawl. If you have to map hundreds of thousands of points, for example, you would need a different approach for both downloading and visualizing the data.

NPL Markets and R Shiny

Our mission at NPL Markets is to enable smart decision-making through innovative trading technology, advanced data analytics and a new comprehensive trading ecosystem. We focus specifically on the illiquid asset market, a market that is partially characterized by its unstructured data.

NPL Markets fully embraces R and Shiny to create an interactive platform where sellers and buyers of illiquid credit portfolios can interact with each other and use sophisticated tooling to structure and analyse credit data.

Creating such a platform with R Shiny was a challenge, while R Shiny is extremely well suited to analyzing data, it is not entirely a generalist web framework. Perhaps it was not intended as such, but our team at NPL Markets has managed to create an extremely productive setup.

Our development setup has a self-build ‘hot reload’ library, which we may release to a wider audience once it is sufficiently tested internally. This library allows us to update front-end and server code on the fly without restarting the entire application.

In addition to our development setup, our production environment uses robust error handling and reporting, preventing crashes that require an application to restart.

More generally, we use continuous integration and deployment and automatically create unit tests for any newly created R functions, allowing us to quickly iterate.

If you would like to know more about what we built with R and Shiny, reach out to us via our website www.nplmarkets.com