WhatsR – An R-Package for processing exported WhatsApp Chat Logs

WhatsApp is one of the most heavily used mobile instant messaging applications around the world. It is especially popular for everyday communication with friends and family and most users communicate on a daily or a weekly basis through the app. Interestingly, it is possible for WhatsApp users to extract a log file from each of their chats. This log file contains all textual communication in the chat that was not manually deleted or is not too far in the past.

This logging of digital communication is on the one hand interesting for researchers seeking to investigate interpersonal communication, social relationships, and linguistics, and can on the other hand also be interesting for individuals seeking to learn more about their own chatting behavior (or their social relationships).

The WhatsR R-package enables users to transform exported WhatsApp chat logs into a usable data frame object with one row per sent message and multiple variables of interest. In this blog post, I will demonstrate how the package can be used to process and visualize chat log files.


Installing the Package
The package can either be installed via CRAN or via GitHub for the most up-to-date version. I recommend to install the GitHub version for the most recent features and bugfixes.
# from CRAN
# install.packages("WhatsR")

# from GitHub
devtools::install_github("gesiscss/WhatsR")
The package also needs to be attached before it can be used. For creating nicer plots, I recommend to also install and attach the patchwork package.
# installing patchwork package
install.packages("patchwork")

# attaching packages
library(WhatsR)
library(patchwork)
Obtaining a Chat Log
You can export one of your own chat logs from your phone to your email address as explained in this tutorial. If you do this, I recommend to use the “without media” export option as this allows you to export more messages.

If you don’t want to use one of your own chat logs, you can create an artificial chat log with the same structure as a real one but with made up text using the WhatsR package!
## creating chat log for demonstration purposes

# setting seed for reproducibility
set.seed(1234)

# simulating chat log
# (and saving it automatically as a .txt file in the working directory)
create_chatlog(n_messages = 20000,
               n_chatters = 5,
               n_emoji = 5000,
               n_diff_emoji = 50,
               n_links = 999,
               n_locations = 500,
               n_smilies = 2500,
               n_diff_smilies = 10,
               n_media = 999,
               n_sdp = 300,
               startdate = "01.01.2019",
               enddate = "31.12.2023",
               language = "english",
               time_format = "24h",
               os = "android",
               path = getwd(),
               chatname = "Simulated_WhatsR_chatlog")
Parsing Chat Log File
Once you have a chat log on your device, you can use the WhatsR package to import the chat log and parse it into a usable data frame structure.
data <- parse_chat("Simulated_WhatsR_chatlog.txt", verbose = TRUE)
Checking the parsed Chat Log
You should now have a data frame object with one row per sent message and 19 variables with information extracted from the individual messages. For a detailed overview what each column contains and how it is computed, you can check the related open source publication for the package. We also add a tabular overview here.
## Checking the chat log
dim(data)
colnames(data)

Column Name

Description

DateTime

Timestamp for date and time the message was sent. Formatted as yyyy-mm-dd hh:mm:ss

Sender

Name of the sender of the message as saved in the contact list of the exporting phone or telephone number. Messages inserted by WhatsApp into the chat are coded with “WhatsApp System Message”

Message

Text of user-generated messages with all information contained in the exported chat log

Flat

Simplified version of the message with emojis, numbers, punctuation, and URLs removed. Better suited for some text mining or machine learning tasks

TokVec

Tokenized version of the Flat column. Instead of one text string, each cell contains a list of individual words. Better suited for some text mining or machine learning tasks

URL

A list of all URLs or domains contained in the message body

Media

A list of all media attachment filenames contained in the message body

Location

A list of all shared location URLs or indicators in the message body, or indicators for shared live locations

Emoji

A list of all emoji glyphs contained in the message body

EmojiDescriptions

A list of all emojis as textual representations contained in the message body

Smilies

A list of all smileys contained in the message body

SystemMessage

Messages that are inserted by WhatsApp into the conversation and not generated by users

TokCount

Amount of user-generated tokens per message

TimeOrder

Order of messages as per the timestamps on the exporting phone

DisplayOrder

Order of messages as they appear in the exported chat log

Checking Descriptives of Chat Logs
Now, you can have a first look at the overall statistics of the chat log. You can check the number of messages, sent tokens, number of chat participants, date of first message, date of last message, the timespan of the chat, and the number of emoji, smilies, links, media files, as well as locations in the chat log.
# Summary statistics
summarize_chat(data, exclude_sm = TRUE)
We can also check the distribution of the number of tokens per message and a set of summary statistics for each individual chat participant.
# Summary statistics
summarize_tokens_per_person(data, exclude_sm = TRUE)
Visualizing Chat Logs
The chat characteristics can now be visualized using the custom functions from the WhatsR package. These functions are basically wrappers to ggplot2 with some options for customizing the plots. Most plots have multiple ways of visualizing the data. For the visualizations, we can exclude the WhatsApp System Messages using ‘exclude_sm= TRUE’. Lets try it out:

Amount of sent messages
# Plotting amount of messages
p1 <- plot_messages(data, plot = "bar", exclude_sm = TRUE)
p2 <- plot_messages(data, plot = "cumsum", exclude_sm = TRUE)
p3 <- plot_messages(data, plot = "heatmap", exclude_sm = TRUE)
p4 <- plot_messages(data, plot = "pie", exclude_sm = TRUE)

# Printing plots with patchwork package
(free(p3) | free(p2)) / (free(p1) | free(p4))
The graphic shows four different ways of visualizing the amount of sent messages in a WhatsApp chat log
Four different ways of visualizing the amount of sent messages in a WhatsApp chat log. Click image to zoom in.

Amount of sent tokens
# Plotting amount of messages
p5 <- plot_tokens(data, plot = "bar", exclude_sm = TRUE)
p6 <- plot_tokens(data, plot = "box", exclude_sm = TRUE)
p7 <- plot_tokens(data, plot = "violin", exclude_sm = TRUE)
p8 <- plot_tokens(data, plot = "cumsum", exclude_sm = TRUE)

# Printing plots with patchwork package
(free(p5) | free(p6)) / (free(p7) | free(p8))
The graphic shows Four different ways of visualizing the amount of sent tokens in a WhatsApp chat log
Four different ways of visualizing the amount of sent tokens in a WhatsApp chat log. Click image to zoom in.

Amount of sent tokens over time
# Plotting amount of tokens over time
p9 <- plot_tokens_over_time(data,
                            plot = "year",
                            exclude_sm = TRUE)

p10 <- plot_tokens_over_time(data,
                             plot = "day",
                             exclude_sm = TRUE)

p11 <- plot_tokens_over_time(data,
                             plot = "hour",
                             exclude_sm = TRUE)

p12 <- plot_tokens_over_time(data,
                             plot = "heatmap",
                             exclude_sm = TRUE)

p13 <- plot_tokens_over_time(data,
                             plot = "alltime",
                             exclude_sm = TRUE)

# Printing plots with patchwork package
(free(p9) | free(p10)) / (free(p11) | free(p12))
The plot shows Four different ways of visualizing the amount of sent tokens over time in a WhatsApp chat log
Four different ways of visualizing the amount of sent tokens over time in a WhatsApp chat log. Click image to zoom in.

Amount of sent links
# Plotting amount of links
p14 <- plot_links(data, plot = "bar", exclude_sm = TRUE)
p15 <- plot_links(data, plot = "splitbar", exclude_sm = TRUE)
p16 <- plot_links(data, plot = "heatmap", exclude_sm = TRUE)
p17 <- plot_links(data, plot = "cumsum", exclude_sm = TRUE)

# Printing plots with patchwork package
(free(p14) | free(p15)) / (free(p16) | free(p17))
The graphic shows four different ways of visualizing the amount of sent links in a WhatsApp chat log
Four different ways of visualizing the amount of sent links in a WhatsApp chat log. Click image to zoom in.

Amount of sent smilies
# Plotting amount of smilies
p18 <- plot_smilies(data, plot = "bar", exclude_sm = TRUE)
p19 <- plot_smilies(data, plot = "splitbar", exclude_sm = TRUE)
p20 <- plot_smilies(data, plot = "heatmap", exclude_sm = TRUE)
p21 <- plot_smilies(data, plot = "cumsum", exclude_sm = TRUE)

# Printing plots with patchwork package
(free(p18) | free(p19)) / (free(p20) | free(p21))
The graphic shows Four different ways of visualizing the amount of sent smileys in a WhatsApp chat log
Four different ways of visualizing the amount of sent smileys in a WhatsApp chat log. Click image to zoom in.

Amount of sent emoji
# Plotting amount of messages
p22 <- plot_emoji(data,
 plot = "bar",
 min_occur = 300,
 exclude_sm = TRUE,
 emoji_size=5)

p23 <- plot_emoji(data,
 plot = "splitbar",
 min_occur = 70,
 exclude_sm = TRUE,
 emoji_size=5)

p24 <- plot_emoji(data,
 plot = "heatmap",
 min_occur = 300,
 exclude_sm = TRUE)

p25 <- plot_emoji(data,
 plot = "cumsum",
 min_occur = 300,
 exclude_sm = TRUE)

# Printing plots with patchwork package
(free(p22) | free(p23)) / (free(p24) | free(p25))
The graphic shows four different ways of visualizing the amount of sent emoji in a WhatsApp chat log
Four different ways of visualizing the amount of sent emoji in a WhatsApp chat log. Click image to zoom in.

Distribution of reaction times
# Plotting distribution of reaction times
p26 <- plot_replytimes(data,
                       type = "replytime",
                       exclude_sm = TRUE)
p27 <- plot_replytimes(data,
                       type = "reactiontime",
                       exclude_sm = TRUE)

# Printing plots with patchwork package
free(p26) | free(p27)
The graphic shows the average response times and times it takes to answer to messages for each individual chat participant in a WhatsApp chat log
Average response times and times it takes to answer to messages for each individual chat participant in a WhatsApp chat log. Click image to zoom in.

Lexical Dispersion
A lexical dispersion plot is a visualization of where specific words occur within a text corpus. Because the simulated chat log in this example is using lorem ipsum text where all words occur similarly often, we add the string “testword” to a random subsample of messages. For visualizing real chat logs, this would of course not be necessary.
# Adding "testword" to random subset of messages for demonstration         # purposes
set.seed(12345)
word_additions <- sample(dim(data)[1],50)
data$TokVec[word_additions]
sapply(data$TokVec[word_additions],function(x){c(x,"testword")})
data$Flat[word_additions] <- sapply(data$Flat[word_additions],
  function(x){x <- paste(x,"testword");return(x)})
Now you can create the lexical dispersion plot:
# Plotting lexical dispersion plot
plot_lexical_dispersion(data,
                        keywords = c("testword"),
                        exclude_sm = TRUE)
Lexical dispersion plot for the occurance of the word "testword" in the simulated WhatsApp chat log.
Lexical dispersion plot for the occurance of the word “testword” in the simulated WhatsApp chat log. Click image to zoom in.

Response Networks
# Plotting response network
plot_network(data,
             edgetype = "n",
             collapse_sessions = TRUE,
             exclude_sm = TRUE)
Network graph showing how often each chat participant directly responded to the previous messages (A subsequent message is counted as a "response" here)
Network graph showing how often each chat participant directly responded to the previous messages (a subsequent message is counted as a “response” here). Click image to zoom in.

Issues and long-term availability.
Unfortunately, WhatsApp chat logs are a moving target when it comes to plotting and visualization. The structure of exported WhatsApp chat logs keeps changing from time to time. On top of that, the structure of chat logs is different for chats exported from different operating systems (Android & iOS) and for different time (am/pm vs. 24h format) and language (e.g. English & German) settings on the exporting phone. When the structure changes, the WhatsR package can be limited in its functionality or become completely dysfunctional until it is updated and tested. Should you encounter any issues, all reports on the GitHub issues page are welcome. Should you want to contribute to improving on or maintaining the package, pull requests and collaborations are also welcome!

Creating R packages for data analysis and reproducible research workshop

Join our workshop on Creating R packages for data analysis and reproducible research, which is a part of our workshops for Ukraine series! 


Here’s some more info: 


Title: Creating R packages for data analysis and reproducible research

Date: Thursday, February 29th, 18:00 – 20:00 CET (Rome, Berlin, Paris timezone)

Speaker: Fred Boehm is a biostatistics and translational medicine researcher living in Michigan, USA. His research focuses on statistical questions that arise in human genetics studies and their applications to clinical medicine and public health. He has extensive teaching experience as a statistics lecturer at the University of Wisconsin-Madison (https://www.wisc.edu) and as a workshop instructor for The Carpentries (https://carpentries.org/index.html). He enjoys spending time with his nieces and nephews and his two dogs. He also blogs (occasionally) at https://fboehm.us/blog/.

Description: Participants will learn to use functions from several packages, including `devtools` and `rrtools`, in the R ecosystem, while learning and adhering to practices to promote reproducible research. Participants will learn to create their own R packages for software development or data analysis. We will also motivate the need to follow reproducible research practices and will discuss strategies and open source tools.


Minimal registration fee: 20 euro (or 20 USD or 800 UAH)



How can I register?



  • Save your donation receipt (after the donation is processed, there is an option to enter your email address on the website to which the donation receipt is sent)

  • Fill in the registration form, attaching a screenshot of a donation receipt (please attach the screenshot of the donation receipt that was emailed to you rather than the page you see after donation).

If you are not personally interested in attending, you can also contribute by sponsoring a participation of a student, who will then be able to participate for free. If you choose to sponsor a student, all proceeds will also go directly to organisations working in Ukraine. You can either sponsor a particular student or you can leave it up to us so that we can allocate the sponsored place to students who have signed up for the waiting list.


How can I sponsor a student?


  • Save your donation receipt (after the donation is processed, there is an option to enter your email address on the website to which the donation receipt is sent)

  • Fill in the sponsorship form, attaching the screenshot of the donation receipt (please attach the screenshot of the donation receipt that was emailed to you rather than the page you see after the donation). You can indicate whether you want to sponsor a particular student or we can allocate this spot ourselves to the students from the waiting list. You can also indicate whether you prefer us to prioritize students from developing countries when assigning place(s) that you sponsored.


If you are a university student and cannot afford the registration fee, you can also sign up for the waiting list here. (Note that you are not guaranteed to participate by signing up for the waiting list).



You can also find more information about this workshop series,  a schedule of our future workshops as well as a list of our past workshops which you can get the recordings & materials here.


Looking forward to seeing you during the workshop!


Call for Speakers: ShinyConf 2024 by Appsilon

Excitement is building as we approach ShinyConf 2024, organized by Appsilon. We are thrilled to announce the Call for Speakers. This is a unique opportunity for experts, industry leaders, and enthusiasts to disseminate their knowledge, insights, and expertise to a diverse and engaged audience.

Why Speak at ShinyConf?

Becoming a speaker at ShinyConf is not just about sharing your expertise; it’s about enriching the community, networking with peers, and contributing to the growth and innovation in your field. It’s an experience that extends beyond the conference, fostering a sense of camaraderie and collaboration among professionals.

Conference Tracks

ShinyConf 2024 features several tracks, each tailored to different aspects of our industry. Our track chairs, experts in their respective fields, will guide these sessions.

  • Shiny Innovation Hub – Led by Jakub Nowicki, Lab Lead at Appsilon, this track focuses on the latest developments and creative applications within the R Shiny framework. We’re looking for talks on advanced Shiny programming techniques, case studies, and how Shiny drives data communication advancements​.
Image

  • Shiny in Enterprise – Chaired by Maria Grycuk, Senior Delivery Manager at Appsilon. This track delves into R Shiny’s role in shaping business outcomes, including case studies, benefits and challenges in enterprise environments, and integration strategies​.
Explore Shiny in Enterprise with Maria Grycuk

  • Shiny in Life Sciences – Guided by Eric Nantz, a Statistician/Developer/Podcaster. This track focuses on R Shiny’s application in data science and life sciences, including interactive visualization, drug discovery, and clinical research​​.
Explore Shiny in Life Sciences with Eric Nantz


  • Shiny for Good – Overseen by Jon Harmon, Data Science Leader and Expert R Programmer. This track highlights R Shiny’s impact on social good, community initiatives, and strategies for engaging diverse communities​.
Explore Shiny for Good with Jon Harmon

Submission Guidelines

  • Topics of Interest: Tailored to each track, ranging from advanced programming techniques to real-world applications in life sciences, social good and enterprise.
  • Submission Types:
    • Talks (20 min)
    • Shiny app showcases (5 min)
    • Tutorials (40 min)
  • Who Can Apply: Open to both seasoned and new speakers. Unsure about your idea? Submit it anyway!
Looking for inspiration? Check out these sessions from ShinyConf 2023.

Important Dates

  • Submission Deadline: February 4
  • Speaker Selection Notification: March 1
  • Event Dates: April 17-19, all virtual

How to Apply

Submit your proposal on the Shiny Conf website: https://www.shinyconf.com/call-for-speakers

Conclusion

Join us at the Shiny Conf as a speaker and shine! We look forward to receiving your submissions and creating an inspiring and educational event together.

Follow us on social media (LinkedIn and Twitter) for updates. Registration opens this month! Contact us at [email protected] for any queries.

Useful Links

Gauging Cryptocurrency Market Sentiment in R

Navigating the volatile world of cryptocurrencies requires a keen understanding of market sentiment. This blog post explores some of the essential tools and techniques for analyzing the mood of the crypto market, using the cryptoQuotes-package.

The Cryptocurrency Fear and Greed Index in R

The Fear and Greed Index is a market sentiment tool that measures investor emotions, ranging from 0 (extreme fear) to 100 (extreme greed). It analyzes data like volatility, market momentum, and social media trends to indicate potential overvaluation or undervaluation of cryptocurrencies. This index helps investors identify potential buying or selling opportunities by gauging the market’s emotional extremes.

This index can be retrieved by using the cryptoQuotes::getFGIndex()-function, which returns the daily index within a specified time-frame,

## Fear and Greed Index
## from the last 14 days
tail(
  FGI <- cryptoQuotes::getFGIndex(
    from = Sys.Date() - 14
  )
)
#>            FGI
#> 2024-01-03  70
#> 2024-01-04  68
#> 2024-01-05  72
#> 2024-01-06  70
#> 2024-01-07  71
#> 2024-01-08  71

The Long-Short Ratio of a Cryptocurrency Pair in R

The Long-Short Ratio is a financial metric indicating market sentiment by comparing the number of long positions (bets on price increases) against short positions (bets on price decreases) for an asset. A higher ratio signals bullish sentiment, while a lower ratio suggests bearish sentiment, guiding traders in making informed decisions.

The Long-Short Ratio can be retrieved by using the cryptoQuotes::getLSRatio()-function, which returns the ratio within a specified time-frame and granularity. Below is an example using the Daily Long-Short Ratio on Bitcoin (BTC),

## Long-Short Ratio
## from the last 14 days
tail(
  LSR <- cryptoQuotes::getLSRatio(
    ticker = "BTCUSDT",
    interval = '1d',
    from = Sys.Date() - 14
  )
)
#>              Long  Short LSRatio
#> 2024-01-03 0.5069 0.4931  1.0280
#> 2024-01-04 0.6219 0.3781  1.6448
#> 2024-01-05 0.5401 0.4599  1.1744
#> 2024-01-06 0.5499 0.4501  1.2217
#> 2024-01-07 0.5533 0.4467  1.2386
#> 2024-01-08 0.5364 0.4636  1.1570

Putting it all together

Even though cryptoQuotes::getLSRatio() is an asset-specific sentiment indicator, and cryptoQuotes::getFGIndex() is a general sentiment indicator, there is much information to be gathered by combining this information.

This information can be visualized by using the the various charting-functions in the cryptoQuotes-package,

## get the BTCUSDT
## pair from the last 14 days
BTCUSDT <- cryptoQuotes::getQuote(
  ticker = "BTCUSDT",
  interval = "1d",
  from = Sys.Date() - 14
)
## chart the BTCUSDT
## pair with sentiment indicators
cryptoQuotes::chart(
  slider = FALSE,
  chart = cryptoQuotes::kline(BTCUSDT) %>%
    cryptoQuotes::addFGIndex(FGI = FGI) %>% 
    cryptoQuotes::addLSRatio(LSR = LSR)
)
Bitcoin charted against Fear and Greed Index and the Long-Short Ratio using R
Bitcoin (BTC) plotted with Fear and Greed Index along side the Long-Short Ratio

Installing cryptoQuotes

Installing via CRAN

# install from CRAN
install.packages(
  pkgs = 'cryptoQuotes',
  dependencies = TRUE
)

Installing via Github

# install from github
devtools::install_github(
  repo = 'https://github.com/serkor1/cryptoQuotes/',
  ref = 'main'
)

Note: The latest price may vary depending on time of publication relative to the rendering time of the document. This document were rendered at 2024-01-08 23:30 CET

Factor Analysis in R workshop

Join our workshop on Factor Analysis in R, which is a part of our workshops for Ukraine series! 

Here’s some more info: 

Title: Factor Analysis in R

Date: Thursday, February 1st, 18:00 – 20:00 CET (Rome, Berlin, Paris timezone)

Speaker: Gagan Atreya is a quantitative social scientist and data science consultant based in Los Angeles, California. He has graduate degrees in Experimental Psychology and Quantitative Political Science from The College of William & Mary in Virginia and The University of Minnesota respectively. He has multiple years of experience in data analysis and visualization in the social sciences – both as a researcher and a consultant with faculty and researchers around the world. You can find him in Bluesky at @gaganatreya.bsky.social.

Description: This workshop will go through the basics of Exploratory and Confirmatory Factor Analysis in the R programming language. Factor Analysis is a valuable statistical technique widely used in Psychology, Economics, Political Science, and related disciplines that allows us to uncover the underlying structure of our data by reducing it to coherent factors. The workshop will heavily (but not exclusively) utilize the “psych” and “lavaan” packages in R. Although open to everyone, a beginner level familiarity with R and some background/interest in survey data analysis will be ideal to make the most out of this workshop.

Minimal registration fee: 20 euro (or 20 USD or 800 UAH)


How can I register?



  • Save your donation receipt (after the donation is processed, there is an option to enter your email address on the website to which the donation receipt is sent)

  • Fill in the registration form, attaching a screenshot of a donation receipt (please attach the screenshot of the donation receipt that was emailed to you rather than the page you see after donation).

If you are not personally interested in attending, you can also contribute by sponsoring a participation of a student, who will then be able to participate for free. If you choose to sponsor a student, all proceeds will also go directly to organisations working in Ukraine. You can either sponsor a particular student or you can leave it up to us so that we can allocate the sponsored place to students who have signed up for the waiting list.


How can I sponsor a student?


  • Save your donation receipt (after the donation is processed, there is an option to enter your email address on the website to which the donation receipt is sent)

  • Fill in the sponsorship form, attaching the screenshot of the donation receipt (please attach the screenshot of the donation receipt that was emailed to you rather than the page you see after the donation). You can indicate whether you want to sponsor a particular student or we can allocate this spot ourselves to the students from the waiting list. You can also indicate whether you prefer us to prioritize students from developing countries when assigning place(s) that you sponsored.


If you are a university student and cannot afford the registration fee, you can also sign up for the waiting list here. (Note that you are not guaranteed to participate by signing up for the waiting list).



You can also find more information about this workshop series,  a schedule of our future workshops as well as a list of our past workshops which you can get the recordings & materials here.


Looking forward to seeing you during the workshop!



Cryptocurrency Market Data in R

Getting cryptocurrency OHLCV data in R without having to depend on low-level coding using, for example, curl or httr2, have not been easy for the R community.

There is now a high-level API Client available on CRAN which fetches all the market data without having to rely on web-scrapers, API keys or low-level coding.

Bitcoin Prices in R (Example)

This high-level API-client have one main function, getQuotes(), which returns cryptocurrency market data with a xts– and zoo-class. The returned objects contains Open, High, Low, Close and Volume data with different granularity, from the currently supported exchanges.

In this blog post I will show how to get hourly Bitcoin (BTC) prices in R
using the getQuotes()-function. See the code below,
# 1) getting hourly BTC
# from the last 3 days

BTC <- cryptoQuotes::getQuote(
 ticker   = "BTCUSDT", 
 source   = "binance", 
 futures  = FALSE, 
 interval = "1h", 
 from     = as.character(Sys.Date() - 3)
)
Bitcoin (BTC) OHLC-prices (Output from getQuote-function)
Index Open High Low Close Volume
2023-12-23 19:00:00 43787.69 43821.69 43695.03 43703.81 547.96785
2023-12-23 20:00:00 43703.82 43738.74 43632.77 43711.33 486.4342
2023-12-23 21:00:00 43711.33 43779.71 43661.81 43772.55 395.6197
2023-12-23 22:00:00 43772.55 43835.94 43737.85 43745.86 577.03505
2023-12-23 23:00:00 43745.86 43806.38 43701.1 43702.16 940.55167
2023-12-24 43702.15 43722.25 43606.18 43716.72 773.85301

The returned Bitcoin prices from getQuotes() are compatible with quantmod and TTR, without further programming. Let me demonstrate this using chartSeries(), addBBands() and addMACD() from these powerful libraries,

# charting BTC
# using quantmod
quantmod::chartSeries(
 x = BTC,
 TA = c(
    # add bollinger bands
    # to the chart
    quantmod::addBBands(), 
    # add MACD indicator
    # to the chart
    quantmod::addMACD()
 ), 
 theme = quantmod::chartTheme("white")
)
Cryptocurrency charts using R
Charting Bitcoin prices using quantmod and TTR

Installing cryptoQuotes

Stable version

# install from CRAN
install.packages(
  pkgs = 'cryptoQuotes',
  dependencies = TRUE
)

Development version

# install from github
devtools::install_github(
  repo = 'https://github.com/serkor1/cryptoQuotes/',
  ref = 'main'
)

Automating updates to dashboards on Shiny Server workshop

Join our workshop on Automating updates to dashboards on Shiny Server, which is a part of our workshops for Ukraine series! 

Here’s some more info: 

Title: Automating updates to dashboards on Shiny Server

Date: Thursday, January 25th, 18:00 – 20:00 CET (Rome, Berlin, Paris timezone)

Speaker: Clinton Oyogo David is a data scientist with 7 years of experience currently working with Oxford Policy Management (OPM) under the Research and Evidence, data innovations team. Prior to joining OPM I was working at World Agroforestry Centre as a junior data scientist in the Spatial Data Science and Applied Learning Lab.

Description: In this workshop, we will talk about the configurations and set ups needed to have an automated update to R Shiny dashboards deployed on a shiny server. The talk will touch on GitHub webhooks, API (Django) and bash scripting. With the set-up in place one will not need to manually update the code on shiny server, a push event to github will be enough to have your changes to the code reflect on the dashboard in a matter of seconds.

Minimal registration fee: 20 euro (or 20 USD or 800 UAH)

How can I register?


  • Save your donation receipt (after the donation is processed, there is an option to enter your email address on the website to which the donation receipt is sent)
  • Fill in the registration form, attaching a screenshot of a donation receipt (please attach the screenshot of the donation receipt that was emailed to you rather than the page you see after donation).

If you are not personally interested in attending, you can also contribute by sponsoring a participation of a student, who will then be able to participate for free. If you choose to sponsor a student, all proceeds will also go directly to organisations working in Ukraine. You can either sponsor a particular student or you can leave it up to us so that we can allocate the sponsored place to students who have signed up for the waiting list.


How can I sponsor a student?

  • Save your donation receipt (after the donation is processed, there is an option to enter your email address on the website to which the donation receipt is sent)
  • Fill in the sponsorship form, attaching the screenshot of the donation receipt (please attach the screenshot of the donation receipt that was emailed to you rather than the page you see after the donation). You can indicate whether you want to sponsor a particular student or we can allocate this spot ourselves to the students from the waiting list. You can also indicate whether you prefer us to prioritize students from developing countries when assigning place(s) that you sponsored.

If you are a university student and cannot afford the registration fee, you can also sign up for the waiting list here. (Note that you are not guaranteed to participate by signing up for the waiting list).


You can also find more information about this workshop series,  a schedule of our future workshops as well as a list of our past workshops which you can get the recordings & materials here.


Looking forward to seeing you during the workshop!

Use google’s Gemini in R with R package “gemini.R”

Introduction

Few days ago, Google presented their own multimodal-LLM named as “Gemini”.


Also there was article named “How to Integrate google’s gemini AI model into R” that tells us how to use gemini API in R brieflly.

Thanks to Deepanshu Bhalla (writer of above article), I’ve many inspirations and made some research to utilize Gemini API more. And I’m glad to share the results with you.

In this article, I want to highlight to How to use gemini with R and Shiny via R package for Gemini API

(You can see result and contribute in github repository: gemini.r)

Gemini API


As today (23.12.26), Gemini API is mainly consisted with 4 things. you can see more details in official docs.

1. Gemini Pro: Is get Text and returns Text 
2. Gemini Pro Vision: Is get Text and Image  and returns Text
3. Gemini Pro Multi-turn: Just chat
4. Embedding: for NLP

and I’ll use 1 & 2. 

You can get API keys in Google AI Studio

However, offical docs doesn’t describe for how to use Gemini API in R. (How sad)
But we can handle it as “REST API” ( I’ll explain it later)

Shiny application

I made very brief concept of Shiny application that uses Gemini API for get Image and Text (maybe “Explain this picture”) and returns Answer from Gemini

(Number is expected user flow)

This UI, is consisted 5 components.

1. fileInput for upload image
2. imageOutput for show uploaded Image
3. textInput for prompt 
4. actionButton for send API to Gemini
5. textOutput for show return value from Gemini

And this is result of shiny and R code (Again, you can see it in github repository)



library(shiny)
library(gemini.R)

ui
<- fluidPage(

  sidebarLayout(
    NULL,
    mainPanel(
      fileInput(
        inputId = “file”,
        label = “Choose file to upload”,
      ),
      div(
        style = ‘border: solid 1px blue;’,
        imageOutput(outputId = “image1”),
      ),
      textInput(
        inputId = “prompt”,
        label = “Prompt”,
        placeholder = “Enter Prompts Here”
      ),
      actionButton(“goButton”, “Ask to gemini”),
      div(
        style = ‘border: solid 1px blue; min-height: 100px;’,            textOutput(“text1”)
      )
    )
  )
)

server <- function(input, output) {
  observeEvent(input$file, {
    path <- input$file$datapath
    output$image1 <- renderImage({
      list( src = path )
    }, deleteFile = FALSE) })

  observeEvent(input$goButton, {
    output$text1 <- renderText({
      gemini_image(input$prompt, input$file$datapath)
    })
  })
}

shinyApp(ui = ui, server = server)


gemini.R package

I think you may think “What is gemini_image function?”

It is function to send API to Gemini server and return result.

and it consisted with 3 main part.

1. Model query
2. API key
3. Content

I used gemini_image function in example. but I’ll gemini function first (which is function to send text and get text)


Gemini’s API example usage is looks like below. (for REST API)

Which can be transformed like below in R


Also, gemini API key must set before use with “Sys.setenv” function

Anyway, I think you should note, body for API is mainly consisted with list.

Similarly, gemini_image function for Gemini Pro Vision API looks like below


is 

Note that, image must encoded as base64 using base64encode function and provided as separated list.


Example 

So with Shiny application and gemini.r package.

You now can run example application to ask image to Gemini.


Summary 

I made very basic R package “gemini.R” to use Gemini API. 

Which provides 2 function: gemini and gemini_image.

And still there’s many possiblity for develop this package. 

like feature to Chat like bard or provide NLP Embedding

and finally, I want to hear feedback or contribution from you. (Really)


Thanks. 

* P.S, I think just using bard / chatGPT / copilot is much better for personal usage. (unless you don’t want to provide AI service via R)

Learning inferential statistics using R

Imagine you need to find the average height of 20-year olds. One way is to go around and measure each person individually. But that seems quite a bit of work, doesn’t it? Luckily, there’s a better way. Inferential statistics allows us to use samples to draw conclusions about the population. In other words, we can get a small group of people and use their characteristics to estimate the characteristics of the entire group.
 To see how this works in practice, let’s take a look at a dataset from Kaggle. This platform provides a wealth of data sets from various fields, each offering unique challenges for R users. Here, we’ll be using a dataset on Cardiovascular diseases compiled by Jocelyn Dumlao.
This dataset originates from a renowned multispecialty hospital situated in India, encompassing a comprehensive array of health-related information. Comprising an extensive structure of 1000 columns and 14 rows, this dataset plays a pivotal role in the early detection of diseases.
Let us see how to import this into RStudio. The dataset is imported into RStudio using the library ‘readr’ (this is only if the dataset is in .csv format). Replace “File path” with the path of your downloaded dataset.
library(readr)
cardio <- read.csv("File path")
Just type in the name of the variable you used to import the dataset so that you can view the entire dataset in RStudio.
cardio


The first 6 rows of the dataset can be viewed using the ‘head’ function.
top_6=head(cardio)
top_6

Similarly, the last 6 rows of the dataset can be viewed using the ‘tail’ function.
bottom_6=tail(cardio)
bottom_6

The dimension of the dataset (number of rows and columns) can be found out using the ‘dim’ function.
dimension=dim(cardio)
dimension

The entire dataset can be termed as population and all the population parameters can be easily found. The mean of a target variable in the population is calculated by the ‘mean’ function. Below, we choose serumcholestrol as the target variable.
mean_chol=mean(cardio$serumcholestrol)
mean_chol

So, we can infer that the average serumcholestrol levels in the patient population taken from the hospital is 311.447.
There also exists a function to calculate the standard deviation of a dataset.

std_chol=sd(cardio$serumcholestrol)
std_chol


From this value, it can be understood that the values of serumcholestrol lies 132.4438 below or above the mean level.
We take a random sample of size 100 where our target variable is serumcholestrol. If you want to take a random sample with replacement, give the third argument as TRUE. Here, we’re taking a sample without replacement.

sample_1=sample(cardio$serumcholestrol,100,FALSE)
sample_1
mean_sample_chol=mean(sample_1)
mean_sample_chol

The mean of the sample that we selected is 317.51. This mean can be used to calculate the test statistic which further can be used to make decisions about the null hypothesis(whether to accept or reject).


Calculating the standard error of the sample


Getting the standard deviation of a dataset gives us many insights. Standard deviation provides the spread of the data around the mean. The standard deviation of sampling distribution is called standard error.
std_error=sd(sample_1)
std_error
The mean and the standard error of the sample is close to the population mean and standard deviation.

Plotting the sample distribution in histogram with x-axis as frequency and y-axis as Cholesterol levels.

To get a sampling distribution, we repeatedly take samples 1000 times. This is done using the replicate function, which repeatedly evaluates an expression a given number of times.
samp_dist_1=replicate(1000,mean(sample(cardio$serumcholestrol,100,replace=TRUE)))
samp_dist_1

The obtained graph is similar to normal distribution graph. That is, values near the mean is occurring more frequently than values far from mean. Now let's calculate the variance of the sampling distribution using the var function.
variance_sample_1=var(samp_dist_1)
variance_sample_1

Now let us see how increasing the sample size affects the variance of the sample.
Increasing the sample size by 200
sample_2=sample(cardio$serumcholestrol,200,FALSE)
sample_2
Calculating the mean of the sample 2
mean_sample_chol=mean(sample_2)
mean_sample_chol

The mean of the sample 2 with sample size 200 is 308.875 .

Calculating the standard error of the sample2
std_error=sd(sample_2)
std_error

The standard error of sample2 is 135.9615 .
We repeat the previous steps to obtain a sampling distribution.
samp_dist_2=replicate(1000,mean(sample(cardio$serumcholestrol,200,replace=TRUE)))
samp_dist_2
Now we plot it like before.
hist(samp_dist_2,main="Sampling distribution of serum_cholestrol",xlab = "Frequency",ylab = "Cholestrol Levels", col = "skyblue")
variance_sample_2=var(samp_dist_2)
variance_sample_2
The variance of the sample 2 with sample size 200 is 84.513. That is, the variance of sample 1 with size 100 is greater than the latter sample. Hence we can conclude that as sample size increase, variance as well as standard error reduces. On the other hand, precision increases with an increase in sample size.

Authors: Aadith Joseph Mathew, Amrutha Paalathara, Devika S Vinod, Jyosna Philip

DICOM Parsing with R

Abstract

This blog post is to describe how to parse medically relevant non-image meta information from DICOM files using the programming language R. The resulting structure of the whole parsing process is an R data frame in form of a name – value table that is both easy to handle and flexible.

We first describe the general structure of DICOM files and which kind of information they contain. Following this, our DicomParseR module in R is explained in detail. The package has been developed as part of practical DICOM parsing from our hospital’s cardiac magnetic resonance (CMR) modalities in order to populate a scientific database. The given examples hence refer to CMR information parsing, however, due to its generic nature, DicomParseR may be used to parse information from any type of DICOM file.

The following graph illustrates the use of DicomParseR in our use case as an example:

Structure of CMR DICOM files

On top level, a DICOM file generated by a CMR modality consists of a header (hdr) section and the image (img) information. In between an XML part can be found.

The hdr section mainly contains baseline information about the patient, including name, birth date and system ID. It also contains contextual information about the observation, such as date and time, ID of the modality and the observation protocol.

The XML section contains quantified information generated by the modality’s embedded AI, e. g. regarding myocardial blood flow (MBF). All information is stored between specifically named sub tags. These tags will serve to search for specific information. For further information on DICOM files, please refer to dicomstandard.org.

The heterogeneous structure of DICOM files as described above requires the use of distinct submodules to compose a technically harmonized name – value table. The information from the XML section will be extended by information from the hdr section. The key benefit of our DicomParseR module is to parse these syntactically distinct sections, which will be described in the following.

Technical Approach

To extract information from the sub tags in the XML section and any additional relevant meta information from the hdr section of the DICOM file, following steps are performed:

    1. Check if a DICOM file contains desired XML tag
    2. If the desired tag is present, extract and transform baseline information from hdr part
    3. If step 1 applied, extract and transform desired tag information from XML part
    4. Combine the two sets of information into an integrated R data frame
    5. Write the data frame into a suitable database for future scientific analysis

The steps mentioned above will be explained in detail in the following.

Step 1: Check if a DICOM file contains the desired XML tag

At the beginning of processing, DicomParseR will check whether a certain tag is present in the DICOM file, in our case <ismrmrdMeta>. In case that tag exists, quantified medical information will be stored here. Please refer to ISMRMRD for further information about the ismrmrd data format.

For this purpose, DicomParseR offers the function file_has_content() that expects the file and a search tag as parameters. The function will use base::readLines() to read in the file and stringr::str_detect() to detect if the given tag is available in the file. Performance tests with help of the package microbenchmark have proven stringr’s outstanding processing speed in this context. If the given tag was found, TRUE is returned, otherwise FALSE.

Any surrounding application may hence call

if (DicomParseR::file_has_content(file, "ismrmrdMeta")) {…}

to only continue parsing DICOM files that contain the desired tag.

It is important to note, that the information generated by the CMR modality is actually not a single DICOM file but rather a composition of a multitude of files. These files may or may not contain the desired XML tag(s). If step 1 were omitted, our parsing module would import many more files than necessary.

Step 2: Extract and transform baseline hdr information

Step 2 will extract hdr information from the file. For this purpose, DicomParseR uses the function readDICOMFile() provided by package oro.dicom. By calling

oro.dicom::readDICOMFile(dicom_file)[["hdr"]]

the XML and image part are removed. The hdr section contains information such as patient’s name, sex and birthdate as well as meta information about the observation, such as date, time and contrast bolus. DicomParseR will save the hdr part as a data frame (in the following called df_hdr) in this step and later append it to the data frame that is generated in the next step.

Note that the oro.dicom package provides functionality to extract header and image data from a DICOM file as shown in the code snippet. However, it does not provide an out-of-the-box solution to extract the XML section and return it as an R data frame. For this purpose, the DicomParseR wraps the extra functionality required around existing packages for DICOM processing.

Step 3: Extract and transform information from XML part

In this step, the data within the provided XML tag is extracted and transformed into a data frame.

Following snippet shows an example about how myocardial blood flow numbers are stored in the respective DICOM files (values modified in terms of data privacy):

<ismrmrdMeta>
                …
                 <meta>
                               <name>GADGETRON_FLOW_ENDO_S_1</name>
                               <value>1.95</value>
                               <value>0.37</value>
                               <value>1.29</value>
                               <value>3.72</value>
                               <value>1.89</value>
                               <value>182</value>
                </meta>
                …
</ismrmrdMeta>

Within each meta tag, “name” specifies the context of the observation and “value” stores the myocardial blood flow data. The different data points between the value tags correspond to different descriptive metrics, such as mean, median, minimum and maximum values. Other meta tags may be structured differently. In order to stay flexible, the final extraction of a concrete value is done in the last step of data processing, see step 5.

Now, to extract and transform the desired information from the DICOM file, DicomParseR will first use its function extract_xml_from_file() for extraction and subsequently the function convert_xml2df() for transformation. With

extract_xml_from_file <- function(file, tag) {
  file_content <- readLines(file, encoding = "UTF-16LE", skipNul = TRUE)
  indeces <- grep(tag, file_content)
  xml_part <- paste(file_content[indeces[[1]]:indeces[[2]]], collapse = "\n")
  return(xml_part)
}

and “ismrmrdMeta” as tag, the function will return a string in XML structure. That string is then converted to an R data frame in the form of a name – value table by convert_xml2df(). Based on our example above, the resulting data frame will look like this:

name value [index]
GADGETRON_FLOW_ENDO_S1 1.95 [1]
GADGETRON_FLOW_ENDO_S1 0.37 [2]

That data frame is called df_ismrmrdMeta in the following. A specific value can be accesses with the combination of name and index, see the example in step 5.

Step 4: Integrate hdr and XML data frames

At this point in time, two data frames have resulted from processing the original DICOM file: df_hdr and df_ismrmrdMeta.

In this step, those two data frames are combined into one single data frame called df_filtered. This is done by using base::rbind().

For example, executing

df_filtered <- rbind(c("Pat_Weight", df_hdr$value[df_hdr$name=="PatientsWeight"][1]), df_ismrmrdMeta)

will extend the data frame df_ismrmrdMeta by the patient’s weight. The result is returned in form of the target data frame df_filtered. As with df_ismrmrdMeta, df_filtered will be a name – value table. This design has been chosen in order to stay as flexible as possible when it comes to subsequent data analysis.

Step 5: Populate scientific database

The data frame df_filtered contains all information from the DICOM file as a name – value table. In the final step 5, df_filtered may now be split again as required to match the use case specific schema of the scientific database.

For example, in our use case, the table “cmr_data” in the scientific database is dedicated to persist MBF values. An external program (in this case, an R Shiny application providing a GUI for end-user interaction) will call its function transform_input_to_cmr_data() to generate a data frame in format of the “cmr_data” table. By calling

transform_input_to_cmr_data <- function(df) {
  mbf_endo_s1 = as.double(df$value[df$name=="GADGETRON_FLOW_ENDO_S_1"][1])
  mbf_endo_s2 = ...
}

with df_filtered as parameter, the mean MBF values of the heart segments are extracted and can now be sent to the database. Another sub step would be to call transform_input_to_baseline_data() to persist baseline information in the database.

Summary and Outlook

This blog post has described the way DICOM files from CMR observations can be processed with R in order to extract quantified myocardial blood flow values for scientific analysis. Apart from R, different approaches by other institutes have been discussed publicly as well, e. g. by using MATLAB. Interested readers may refer to NIH, among others, for further information.

The chosen approach tries to respect both the properties of DICOM files, that is, their heterogeneous inner structure, the different types of information and their file size, as well as the specific data requirements by our institute’s use cases. With a single R data frame in form of a name – value table, the result of the process is easy to handle for further data analysis. At the same time, due to its flexible setup, DicomParseR may serve as a module in any kind of DICOM-related use case.

Thomas Schröder
Centrum für medizinische Datenintegration BHC Stuttgart