R – Page 2 – R-posts.com

Creating R packages for data analysis and reproducible research workshop

Join our workshop on Creating R packages for data analysis and reproducible research, which is a part of our workshops for Ukraine series!

Here’s some more info:

Title: Creating R packages for data analysis and reproducible research

Date: Thursday, February 29th, 18:00 – 20:00 CET (Rome, Berlin, Paris timezone)

Speaker: Fred Boehm is a biostatistics and translational medicine researcher living in Michigan, USA. His research focuses on statistical questions that arise in human genetics studies and their applications to clinical medicine and public health. He has extensive teaching experience as a statistics lecturer at the University of Wisconsin-Madison (https://www.wisc.edu) and as a workshop instructor for The Carpentries (https://carpentries.org/index.html). He enjoys spending time with his nieces and nephews and his two dogs. He also blogs (occasionally) at https://fboehm.us/blog/.

Description: Participants will learn to use functions from several packages, including `devtools` and `rrtools`, in the R ecosystem, while learning and adhering to practices to promote reproducible research. Participants will learn to create their own R packages for software development or data analysis. We will also motivate the need to follow reproducible research practices and will discuss strategies and open source tools.

Minimal registration fee: 20 euro (or 20 USD or 800 UAH)

How can I register?

Go to https://bit.ly/3wvwMA6 or https://bit.ly/4aD5LMC or https://bit.ly/3PFxtNA and donate at least 20 euro. Feel free to donate more if you can, all proceeds go directly to support Ukraine.

Save your donation receipt (after the donation is processed, there is an option to enter your email address on the website to which the donation receipt is sent)

Fill in the registration form, attaching a screenshot of a donation receipt (please attach the screenshot of the donation receipt that was emailed to you rather than the page you see after donation).

If you are not personally interested in attending, you can also contribute by sponsoring a participation of a student, who will then be able to participate for free. If you choose to sponsor a student, all proceeds will also go directly to organisations working in Ukraine. You can either sponsor a particular student or you can leave it up to us so that we can allocate the sponsored place to students who have signed up for the waiting list.

How can I sponsor a student?

Go to https://bit.ly/3wvwMA6 or https://bit.ly/4aD5LMC or https://bit.ly/3PFxtNA and donate at least 20 euro (or 17 GBP or 20 USD or 800 UAH). Feel free to donate more if you can, all proceeds go to support Ukraine!

Save your donation receipt (after the donation is processed, there is an option to enter your email address on the website to which the donation receipt is sent)

Fill in the sponsorship form, attaching the screenshot of the donation receipt (please attach the screenshot of the donation receipt that was emailed to you rather than the page you see after the donation). You can indicate whether you want to sponsor a particular student or we can allocate this spot ourselves to the students from the waiting list. You can also indicate whether you prefer us to prioritize students from developing countries when assigning place(s) that you sponsored.

If you are a university student and cannot afford the registration fee, you can also sign up for the waiting list here. (Note that you are not guaranteed to participate by signing up for the waiting list).

You can also find more information about this workshop series, a schedule of our future workshops as well as a list of our past workshops which you can get the recordings & materials here.

Looking forward to seeing you during the workshop!

Call for Speakers: ShinyConf 2024 by Appsilon

Excitement is building as we approach ShinyConf 2024, organized by Appsilon. We are thrilled to announce the Call for Speakers. This is a unique opportunity for experts, industry leaders, and enthusiasts to disseminate their knowledge, insights, and expertise to a diverse and engaged audience.

Why Speak at ShinyConf?

Becoming a speaker at ShinyConf is not just about sharing your expertise; it’s about enriching the community, networking with peers, and contributing to the growth and innovation in your field. It’s an experience that extends beyond the conference, fostering a sense of camaraderie and collaboration among professionals.

Conference Tracks

ShinyConf 2024 features several tracks, each tailored to different aspects of our industry. Our track chairs, experts in their respective fields, will guide these sessions.

Shiny Innovation Hub – Led by Jakub Nowicki, Lab Lead at Appsilon, this track focuses on the latest developments and creative applications within the R Shiny framework. We’re looking for talks on advanced Shiny programming techniques, case studies, and how Shiny drives data communication advancements.

Shiny in Enterprise – Chaired by Maria Grycuk, Senior Delivery Manager at Appsilon. This track delves into R Shiny’s role in shaping business outcomes, including case studies, benefits and challenges in enterprise environments, and integration strategies.

Explore Shiny in Enterprise with Maria Grycuk

Shiny in Life Sciences – Guided by Eric Nantz, a Statistician/Developer/Podcaster. This track focuses on R Shiny’s application in data science and life sciences, including interactive visualization, drug discovery, and clinical research.

Explore Shiny in Life Sciences with Eric Nantz

Shiny for Good – Overseen by Jon Harmon, Data Science Leader and Expert R Programmer. This track highlights R Shiny’s impact on social good, community initiatives, and strategies for engaging diverse communities.

Submission Guidelines

Topics of Interest: Tailored to each track, ranging from advanced programming techniques to real-world applications in life sciences, social good and enterprise.
Submission Types:
- Talks (20 min)
- Shiny app showcases (5 min)
- Tutorials (40 min)
Who Can Apply: Open to both seasoned and new speakers. Unsure about your idea? Submit it anyway!

Looking for inspiration? Check out these sessions from ShinyConf 2023.

Important Dates

Submission Deadline: February 4
Speaker Selection Notification: March 1
Event Dates: April 17-19, all virtual

How to Apply

Submit your proposal on the Shiny Conf website: https://www.shinyconf.com/call-for-speakers

Conclusion

Join us at the Shiny Conf as a speaker and shine! We look forward to receiving your submissions and creating an inspiring and educational event together.

Follow us on social media (LinkedIn and Twitter) for updates. Registration opens this month! Contact us at [email protected] for any queries.

Useful Links

Join our community, Shiny 4 All, to keep up with the latest updates
ShinyConf2023 Summary
ShinyConf 2023 YouTube Playlist
Posit’s ShinyConf 2023 Recap
ShinyConf 2022 – Short Recap of the 2022 R Shiny Conference (Projects, Talks, and Showcases)
ShinyConf 2022 YouTube Playlist

Gauging Cryptocurrency Market Sentiment in R

Navigating the volatile world of cryptocurrencies requires a keen understanding of market sentiment. This blog post explores some of the essential tools and techniques for analyzing the mood of the crypto market, using the cryptoQuotes-package.

The Cryptocurrency Fear and Greed Index in R

The Fear and Greed Index is a market sentiment tool that measures investor emotions, ranging from 0 (extreme fear) to 100 (extreme greed). It analyzes data like volatility, market momentum, and social media trends to indicate potential overvaluation or undervaluation of cryptocurrencies. This index helps investors identify potential buying or selling opportunities by gauging the market’s emotional extremes.

This index can be retrieved by using the cryptoQuotes::getFGIndex()-function, which returns the daily index within a specified time-frame,

## Fear and Greed Index
## from the last 14 days
tail(
  FGI <- cryptoQuotes::getFGIndex(
    from = Sys.Date() - 14
  )
)
#>            FGI
#> 2024-01-03  70
#> 2024-01-04  68
#> 2024-01-05  72
#> 2024-01-06  70
#> 2024-01-07  71
#> 2024-01-08  71

The Long-Short Ratio of a Cryptocurrency Pair in R

The Long-Short Ratio is a financial metric indicating market sentiment by comparing the number of long positions (bets on price increases) against short positions (bets on price decreases) for an asset. A higher ratio signals bullish sentiment, while a lower ratio suggests bearish sentiment, guiding traders in making informed decisions.

The Long-Short Ratio can be retrieved by using the cryptoQuotes::getLSRatio()-function, which returns the ratio within a specified time-frame and granularity. Below is an example using the Daily Long-Short Ratio on Bitcoin (BTC),

## Long-Short Ratio
## from the last 14 days
tail(
  LSR <- cryptoQuotes::getLSRatio(
    ticker = "BTCUSDT",
    interval = '1d',
    from = Sys.Date() - 14
  )
)
#>              Long  Short LSRatio
#> 2024-01-03 0.5069 0.4931  1.0280
#> 2024-01-04 0.6219 0.3781  1.6448
#> 2024-01-05 0.5401 0.4599  1.1744
#> 2024-01-06 0.5499 0.4501  1.2217
#> 2024-01-07 0.5533 0.4467  1.2386
#> 2024-01-08 0.5364 0.4636  1.1570

Putting it all together

Even though cryptoQuotes::getLSRatio() is an asset-specific sentiment indicator, and cryptoQuotes::getFGIndex() is a general sentiment indicator, there is much information to be gathered by combining this information.

This information can be visualized by using the the various charting-functions in the cryptoQuotes-package,

## get the BTCUSDT
## pair from the last 14 days
BTCUSDT <- cryptoQuotes::getQuote(
  ticker = "BTCUSDT",
  interval = "1d",
  from = Sys.Date() - 14
)

## chart the BTCUSDT
## pair with sentiment indicators
cryptoQuotes::chart(
  slider = FALSE,
  chart = cryptoQuotes::kline(BTCUSDT) %>%
    cryptoQuotes::addFGIndex(FGI = FGI) %>% 
    cryptoQuotes::addLSRatio(LSR = LSR)
)

Bitcoin charted against Fear and Greed Index and the Long-Short Ratio using R — Bitcoin (BTC) plotted with Fear and Greed Index along side the Long-Short Ratio

Installing cryptoQuotes

Installing via CRAN

# install from CRAN
install.packages(
  pkgs = 'cryptoQuotes',
  dependencies = TRUE
)

Installing via Github

# install from github
devtools::install_github(
  repo = 'https://github.com/serkor1/cryptoQuotes/',
  ref = 'main'
)

Note: The latest price may vary depending on time of publication relative to the rendering time of the document. This document were rendered at 2024-01-08 23:30 CET

Factor Analysis in R workshop

Join our workshop on Factor Analysis in R, which is a part of our workshops for Ukraine series!

Here’s some more info:

Title: Factor Analysis in R

Date: Thursday, February 1st, 18:00 – 20:00 CET (Rome, Berlin, Paris timezone)

Speaker: Gagan Atreya is a quantitative social scientist and data science consultant based in Los Angeles, California. He has graduate degrees in Experimental Psychology and Quantitative Political Science from The College of William & Mary in Virginia and The University of Minnesota respectively. He has multiple years of experience in data analysis and visualization in the social sciences – both as a researcher and a consultant with faculty and researchers around the world. You can find him in Bluesky at @gaganatreya.bsky.social.

Description: This workshop will go through the basics of Exploratory and Confirmatory Factor Analysis in the R programming language. Factor Analysis is a valuable statistical technique widely used in Psychology, Economics, Political Science, and related disciplines that allows us to uncover the underlying structure of our data by reducing it to coherent factors. The workshop will heavily (but not exclusively) utilize the “psych” and “lavaan” packages in R. Although open to everyone, a beginner level familiarity with R and some background/interest in survey data analysis will be ideal to make the most out of this workshop.

Minimal registration fee: 20 euro (or 20 USD or 800 UAH)

How can I register?

Go to https://bit.ly/3wvwMA6 or https://bit.ly/4aD5LMC or https://bit.ly/3PFxtNA and donate at least 20 euro. Feel free to donate more if you can, all proceeds go directly to support Ukraine.

Save your donation receipt (after the donation is processed, there is an option to enter your email address on the website to which the donation receipt is sent)

Fill in the registration form, attaching a screenshot of a donation receipt (please attach the screenshot of the donation receipt that was emailed to you rather than the page you see after donation).

How can I sponsor a student?

Go to https://bit.ly/3wvwMA6 or https://bit.ly/4aD5LMC or https://bit.ly/3PFxtNA and donate at least 20 euro (or 17 GBP or 20 USD or 800 UAH). Feel free to donate more if you can, all proceeds go to support Ukraine!

Save your donation receipt (after the donation is processed, there is an option to enter your email address on the website to which the donation receipt is sent)

Fill in the sponsorship form, attaching the screenshot of the donation receipt (please attach the screenshot of the donation receipt that was emailed to you rather than the page you see after the donation). You can indicate whether you want to sponsor a particular student or we can allocate this spot ourselves to the students from the waiting list. You can also indicate whether you prefer us to prioritize students from developing countries when assigning place(s) that you sponsored.

You can also find more information about this workshop series, a schedule of our future workshops as well as a list of our past workshops which you can get the recordings & materials here.

Looking forward to seeing you during the workshop!

Cryptocurrency Market Data in R

Getting cryptocurrency OHLCV data in R without having to depend on low-level coding using, for example, curl or httr2, have not been easy for the R community.

There is now a high-level API Client available on CRAN which fetches all the market data without having to rely on web-scrapers, API keys or low-level coding.

Bitcoin Prices in R (Example)

This high-level API-client have one main function, getQuotes(), which returns cryptocurrency market data with a xts– and zoo-class. The returned objects contains Open, High, Low, Close and Volume data with different granularity, from the currently supported exchanges.

In this blog post I will show how to get hourly Bitcoin (BTC) prices in R
using the getQuotes()-function. See the code below,

# 1) getting hourly BTC
# from the last 3 days

BTC <- cryptoQuotes::getQuote(
 ticker   = "BTCUSDT", 
 source   = "binance", 
 futures  = FALSE, 
 interval = "1h", 
 from     = as.character(Sys.Date() - 3)
)

Bitcoin (BTC) OHLC-prices (Output from getQuote-function)
Index	Open	High	Low	Close	Volume
2023-12-23 19:00:00	43787.69	43821.69	43695.03	43703.81	547.96785
2023-12-23 20:00:00	43703.82	43738.74	43632.77	43711.33	486.4342
2023-12-23 21:00:00	43711.33	43779.71	43661.81	43772.55	395.6197
2023-12-23 22:00:00	43772.55	43835.94	43737.85	43745.86	577.03505
2023-12-23 23:00:00	43745.86	43806.38	43701.1	43702.16	940.55167
2023-12-24	43702.15	43722.25	43606.18	43716.72	773.85301

The returned Bitcoin prices from getQuotes() are compatible with quantmod and TTR, without further programming. Let me demonstrate this using chartSeries(), addBBands() and addMACD() from these powerful libraries,

# charting BTC
# using quantmod
quantmod::chartSeries(
 x = BTC,
 TA = c(
    # add bollinger bands
    # to the chart
    quantmod::addBBands(), 
    # add MACD indicator
    # to the chart
    quantmod::addMACD()
 ), 
 theme = quantmod::chartTheme("white")
)

Cryptocurrency charts using R — Charting Bitcoin prices using quantmod and TTR

Installing cryptoQuotes

Stable version

# install from CRAN
install.packages(
  pkgs = 'cryptoQuotes',
  dependencies = TRUE
)

Development version

# install from github
devtools::install_github(
  repo = 'https://github.com/serkor1/cryptoQuotes/',
  ref = 'main'
)

Automating updates to dashboards on Shiny Server workshop

Join our workshop on Automating updates to dashboards on Shiny Server, which is a part of our workshops for Ukraine series!

Here’s some more info:

Title: Automating updates to dashboards on Shiny Server

Date: Thursday, January 25th, 18:00 – 20:00 CET (Rome, Berlin, Paris timezone)

Speaker: Clinton Oyogo David is a data scientist with 7 years of experience currently working with Oxford Policy Management (OPM) under the Research and Evidence, data innovations team. Prior to joining OPM I was working at World Agroforestry Centre as a junior data scientist in the Spatial Data Science and Applied Learning Lab.

Description: In this workshop, we will talk about the configurations and set ups needed to have an automated update to R Shiny dashboards deployed on a shiny server. The talk will touch on GitHub webhooks, API (Django) and bash scripting. With the set-up in place one will not need to manually update the code on shiny server, a push event to github will be enough to have your changes to the code reflect on the dashboard in a matter of seconds.

Minimal registration fee: 20 euro (or 20 USD or 800 UAH)

How can I register?

Go to https://bit.ly/3wvwMA6 or https://bit.ly/4aD5LMC or https://bit.ly/3PFxtNA and donate at least 20 euro. Feel free to donate more if you can, all proceeds go directly to support Ukraine.

Save your donation receipt (after the donation is processed, there is an option to enter your email address on the website to which the donation receipt is sent)

Fill in the registration form, attaching a screenshot of a donation receipt (please attach the screenshot of the donation receipt that was emailed to you rather than the page you see after donation).

How can I sponsor a student?

Go to https://bit.ly/3wvwMA6 or https://bit.ly/4aD5LMC or https://bit.ly/3PFxtNA and donate at least 20 euro (or 17 GBP or 20 USD or 800 UAH). Feel free to donate more if you can, all proceeds go to support Ukraine!

Save your donation receipt (after the donation is processed, there is an option to enter your email address on the website to which the donation receipt is sent)

Fill in the sponsorship form, attaching the screenshot of the donation receipt (please attach the screenshot of the donation receipt that was emailed to you rather than the page you see after the donation). You can indicate whether you want to sponsor a particular student or we can allocate this spot ourselves to the students from the waiting list. You can also indicate whether you prefer us to prioritize students from developing countries when assigning place(s) that you sponsored.

You can also find more information about this workshop series, a schedule of our future workshops as well as a list of our past workshops which you can get the recordings & materials here.

Looking forward to seeing you during the workshop!

Use google’s Gemini in R with R package “gemini.R”

Introduction

Few days ago, Google presented their own multimodal-LLM named as “Gemini”.

Also there was article named “How to Integrate google’s gemini AI model into R” that tells us how to use gemini API in R brieflly.

Thanks to Deepanshu Bhalla (writer of above article), I’ve many inspirations and made some research to utilize Gemini API more. And I’m glad to share the results with you.

In this article, I want to highlight to How to use gemini with R and Shiny via R package for Gemini API

(You can see result and contribute in github repository: gemini.r)

Gemini API

As today (23.12.26), Gemini API is mainly consisted with 4 things. you can see more details in official docs.

1. Gemini Pro: Is get Text and returns Text
2. Gemini Pro Vision: Is get Text and Image and returns Text
3. Gemini Pro Multi-turn: Just chat
4. Embedding: for NLP

and I’ll use 1 & 2.

You can get API keys in Google AI Studio

However, offical docs doesn’t describe for how to use Gemini API in R. (How sad)
But we can handle it as “REST API” ( I’ll explain it later)

Shiny application

I made very brief concept of Shiny application that uses Gemini API for get Image and Text (maybe “Explain this picture”) and returns Answer from Gemini

(Number is expected user flow)

This UI, is consisted 5 components.

1. fileInput for upload image
2. imageOutput for show uploaded Image
3. textInput for prompt
4. actionButton for send API to Gemini
5. textOutput for show return value from Gemini

And this is result of shiny and R code (Again, you can see it in github repository)

—
library(shiny)
library(gemini.R)

ui <- fluidPage(
sidebarLayout(
NULL,
mainPanel(
fileInput(
inputId = “file”,
label = “Choose file to upload”,
),
div(
style = ‘border: solid 1px blue;’,
imageOutput(outputId = “image1”),
),
textInput(
inputId = “prompt”,
label = “Prompt”,
placeholder = “Enter Prompts Here”
),
actionButton(“goButton”, “Ask to gemini”),
div(
style = ‘border: solid 1px blue; min-height: 100px;’, textOutput(“text1”)
)
)
)
)

server <- function(input, output) {
observeEvent(input$file, {
path <- input$file$datapath
output$image1 <- renderImage({
list( src = path )
}, deleteFile = FALSE) })

observeEvent(input$goButton, {
output$text1 <- renderText({
gemini_image(input$prompt, input$file$datapath)
})
})
}

shinyApp(ui = ui, server = server)
—

gemini.R package

I think you may think “What is gemini_image function?”

It is function to send API to Gemini server and return result.

and it consisted with 3 main part.

1. Model query
2. API key
3. Content

I used gemini_image function in example. but I’ll gemini function first (which is function to send text and get text)

Gemini’s API example usage is looks like below. (for REST API)

Which can be transformed like below in R

Also, gemini API key must set before use with “Sys.setenv” function

Anyway, I think you should note, body for API is mainly consisted with list.

Similarly, gemini_image function for Gemini Pro Vision API looks like below

Note that, image must encoded as base64 using base64encode function and provided as separated list.

Example

So with Shiny application and gemini.r package.

You now can run example application to ask image to Gemini.

Summary

I made very basic R package “gemini.R” to use Gemini API.

Which provides 2 function: gemini and gemini_image.

And still there’s many possiblity for develop this package.

like feature to Chat like bard or provide NLP Embedding

and finally, I want to hear feedback or contribution from you. (Really)

Thanks.

* P.S, I think just using bard / chatGPT / copilot is much better for personal usage. (unless you don’t want to provide AI service via R)

Learning inferential statistics using R

Imagine you need to find the average height of 20-year olds. One way is to go around and measure each person individually. But that seems quite a bit of work, doesn’t it? Luckily, there’s a better way. Inferential statistics allows us to use samples to draw conclusions about the population. In other words, we can get a small group of people and use their characteristics to estimate the characteristics of the entire group.
To see how this works in practice, let’s take a look at a dataset from Kaggle. This platform provides a wealth of data sets from various fields, each offering unique challenges for R users. Here, we’ll be using a dataset on Cardiovascular diseases compiled by Jocelyn Dumlao.
This dataset originates from a renowned multispecialty hospital situated in India, encompassing a comprehensive array of health-related information. Comprising an extensive structure of 1000 columns and 14 rows, this dataset plays a pivotal role in the early detection of diseases.
Let us see how to import this into RStudio. The dataset is imported into RStudio using the library ‘readr’ (this is only if the dataset is in .csv format). Replace “File path” with the path of your downloaded dataset.

library(readr)
cardio <- read.csv("File path")

Just type in the name of the variable you used to import the dataset so that you can view the entire dataset in RStudio.

cardio

The first 6 rows of the dataset can be viewed using the ‘head’ function.

top_6=head(cardio)
top_6

Similarly, the last 6 rows of the dataset can be viewed using the ‘tail’ function.

bottom_6=tail(cardio)
bottom_6

The dimension of the dataset (number of rows and columns) can be found out using the ‘dim’ function.

dimension=dim(cardio)
dimension

The entire dataset can be termed as population and all the population parameters can be easily found. The mean of a target variable in the population is calculated by the ‘mean’ function. Below, we choose serumcholestrol as the target variable.

mean_chol=mean(cardio$serumcholestrol)
mean_chol

So, we can infer that the average serumcholestrol levels in the patient population taken from the hospital is 311.447.
There also exists a function to calculate the standard deviation of a dataset.

std_chol=sd(cardio$serumcholestrol)
std_chol

From this value, it can be understood that the values of serumcholestrol lies 132.4438 below or above the mean level.
We take a random sample of size 100 where our target variable is serumcholestrol. If you want to take a random sample with replacement, give the third argument as TRUE. Here, we’re taking a sample without replacement.

sample_1=sample(cardio$serumcholestrol,100,FALSE)
sample_1

mean_sample_chol=mean(sample_1)
mean_sample_chol

The mean of the sample that we selected is 317.51. This mean can be used to calculate the test statistic which further can be used to make decisions about the null hypothesis(whether to accept or reject).





Calculating the standard error of the sample



Getting the standard deviation of a dataset gives us many insights. Standard deviation provides the spread of the data around the mean. The standard deviation of sampling distribution is called standard error.

std_error=sd(sample_1)
std_error

The mean and the standard error of the sample is close to the population mean and standard deviation.



Plotting the sample distribution in histogram with x-axis as frequency and y-axis as Cholesterol levels.

To get a sampling distribution, we repeatedly take samples 1000 times. This is done using the replicate function, which repeatedly evaluates an expression a given number of times.

samp_dist_1=replicate(1000,mean(sample(cardio$serumcholestrol,100,replace=TRUE)))
samp_dist_1



 
The obtained graph is similar to normal distribution graph. That is, values near the mean is occurring more frequently than values far from mean. Now let's calculate the variance of the sampling distribution using the var function.

variance_sample_1=var(samp_dist_1)
variance_sample_1

Now let us see how increasing the sample size affects the variance of the sample.
Increasing the sample size by 200

sample_2=sample(cardio$serumcholestrol,200,FALSE)
sample_2

Calculating the mean of the sample 2

mean_sample_chol=mean(sample_2)
mean_sample_chol






 
The mean of the sample 2 with sample size 200 is 308.875 .



Calculating the standard error of the sample2

std_error=sd(sample_2)
std_error







 


The standard error of sample2 is 135.9615 .

We repeat the previous steps to obtain a sampling distribution.

samp_dist_2=replicate(1000,mean(sample(cardio$serumcholestrol,200,replace=TRUE)))
samp_dist_2


Now we plot it like before.

hist(samp_dist_2,main="Sampling distribution of serum_cholestrol",xlab = "Frequency",ylab = "Cholestrol Levels", col = "skyblue")

variance_sample_2=var(samp_dist_2)
variance_sample_2












 

The variance of the sample 2 with sample size 200 is 84.513. 
That is, the variance of sample 1 with size 100 is greater than the latter sample. Hence we can conclude that as sample size increase, variance as well as standard error reduces. On the other hand, precision increases with an increase in sample size.

Authors: Aadith Joseph Mathew, Amrutha Paalathara, Devika S Vinod, Jyosna Philip

DICOM Parsing with R

Abstract

This blog post is to describe how to parse medically relevant non-image meta information from DICOM files using the programming language R. The resulting structure of the whole parsing process is an R data frame in form of a name – value table that is both easy to handle and flexible.

We first describe the general structure of DICOM files and which kind of information they contain. Following this, our DicomParseR module in R is explained in detail. The package has been developed as part of practical DICOM parsing from our hospital’s cardiac magnetic resonance (CMR) modalities in order to populate a scientific database. The given examples hence refer to CMR information parsing, however, due to its generic nature, DicomParseR may be used to parse information from any type of DICOM file.

The following graph illustrates the use of DicomParseR in our use case as an example:

Structure of CMR DICOM files

On top level, a DICOM file generated by a CMR modality consists of a header (hdr) section and the image (img) information. In between an XML part can be found.

The hdr section mainly contains baseline information about the patient, including name, birth date and system ID. It also contains contextual information about the observation, such as date and time, ID of the modality and the observation protocol.

The XML section contains quantified information generated by the modality’s embedded AI, e. g. regarding myocardial blood flow (MBF). All information is stored between specifically named sub tags. These tags will serve to search for specific information. For further information on DICOM files, please refer to dicomstandard.org.

The heterogeneous structure of DICOM files as described above requires the use of distinct submodules to compose a technically harmonized name – value table. The information from the XML section will be extended by information from the hdr section. The key benefit of our DicomParseR module is to parse these syntactically distinct sections, which will be described in the following.

Technical Approach

To extract information from the sub tags in the XML section and any additional relevant meta information from the hdr section of the DICOM file, following steps are performed:

1. Check if a DICOM file contains desired XML tag
2. If the desired tag is present, extract and transform baseline information from hdr part
3. If step 1 applied, extract and transform desired tag information from XML part
4. Combine the two sets of information into an integrated R data frame
5. Write the data frame into a suitable database for future scientific analysis

The steps mentioned above will be explained in detail in the following.

Step 1: Check if a DICOM file contains the desired XML tag

At the beginning of processing, DicomParseR will check whether a certain tag is present in the DICOM file, in our case <ismrmrdMeta>. In case that tag exists, quantified medical information will be stored here. Please refer to ISMRMRD for further information about the ismrmrd data format.

For this purpose, DicomParseR offers the function file_has_content() that expects the file and a search tag as parameters. The function will use base::readLines() to read in the file and stringr::str_detect() to detect if the given tag is available in the file. Performance tests with help of the package microbenchmark have proven stringr’s outstanding processing speed in this context. If the given tag was found, TRUE is returned, otherwise FALSE.

Any surrounding application may hence call

if (DicomParseR::file_has_content(file, "ismrmrdMeta")) {…}

to only continue parsing DICOM files that contain the desired tag.

It is important to note, that the information generated by the CMR modality is actually not a single DICOM file but rather a composition of a multitude of files. These files may or may not contain the desired XML tag(s). If step 1 were omitted, our parsing module would import many more files than necessary.

Step 2: Extract and transform baseline hdr information

Step 2 will extract hdr information from the file. For this purpose, DicomParseR uses the function readDICOMFile() provided by package oro.dicom. By calling

oro.dicom::readDICOMFile(dicom_file)[["hdr"]]

the XML and image part are removed. The hdr section contains information such as patient’s name, sex and birthdate as well as meta information about the observation, such as date, time and contrast bolus. DicomParseR will save the hdr part as a data frame (in the following called df_hdr) in this step and later append it to the data frame that is generated in the next step.

Note that the oro.dicom package provides functionality to extract header and image data from a DICOM file as shown in the code snippet. However, it does not provide an out-of-the-box solution to extract the XML section and return it as an R data frame. For this purpose, the DicomParseR wraps the extra functionality required around existing packages for DICOM processing.

Step 3: Extract and transform information from XML part

In this step, the data within the provided XML tag is extracted and transformed into a data frame.

Following snippet shows an example about how myocardial blood flow numbers are stored in the respective DICOM files (values modified in terms of data privacy):

<ismrmrdMeta>
                …
<meta>
                               <name>GADGETRON_FLOW_ENDO_S_1</name>
                               <value>1.95</value>
                               <value>0.37</value>
                               <value>1.29</value>
                               <value>3.72</value>
                               <value>1.89</value>
                               <value>182</value>
</meta>
                …
</ismrmrdMeta>

Within each meta tag, “name” specifies the context of the observation and “value” stores the myocardial blood flow data. The different data points between the value tags correspond to different descriptive metrics, such as mean, median, minimum and maximum values. Other meta tags may be structured differently. In order to stay flexible, the final extraction of a concrete value is done in the last step of data processing, see step 5.

Now, to extract and transform the desired information from the DICOM file, DicomParseR will first use its function extract_xml_from_file() for extraction and subsequently the function convert_xml2df() for transformation. With

extract_xml_from_file <- function(file, tag) {
  file_content <- readLines(file, encoding = "UTF-16LE", skipNul = TRUE)
  indeces <- grep(tag, file_content)
  xml_part <- paste(file_content[indeces[[1]]:indeces[[2]]], collapse = "\n")
  return(xml_part)
}

and “ismrmrdMeta” as tag, the function will return a string in XML structure. That string is then converted to an R data frame in the form of a name – value table by convert_xml2df(). Based on our example above, the resulting data frame will look like this:

name	value	[index]
GADGETRON_FLOW_ENDO_S1	1.95	[1]
GADGETRON_FLOW_ENDO_S1	0.37	[2]
…	…

That data frame is called df_ismrmrdMeta in the following. A specific value can be accesses with the combination of name and index, see the example in step 5.

Step 4: Integrate hdr and XML data frames

At this point in time, two data frames have resulted from processing the original DICOM file: df_hdr and df_ismrmrdMeta.

In this step, those two data frames are combined into one single data frame called df_filtered. This is done by using base::rbind().

For example, executing

df_filtered <- rbind(c("Pat_Weight", df_hdr$value[df_hdr$name=="PatientsWeight"][1]), df_ismrmrdMeta)

will extend the data frame df_ismrmrdMeta by the patient’s weight. The result is returned in form of the target data frame df_filtered. As with df_ismrmrdMeta, df_filtered will be a name – value table. This design has been chosen in order to stay as flexible as possible when it comes to subsequent data analysis.

Step 5: Populate scientific database

The data frame df_filtered contains all information from the DICOM file as a name – value table. In the final step 5, df_filtered may now be split again as required to match the use case specific schema of the scientific database.

For example, in our use case, the table “cmr_data” in the scientific database is dedicated to persist MBF values. An external program (in this case, an R Shiny application providing a GUI for end-user interaction) will call its function transform_input_to_cmr_data() to generate a data frame in format of the “cmr_data” table. By calling

transform_input_to_cmr_data <- function(df) {
  mbf_endo_s1 = as.double(df$value[df$name=="GADGETRON_FLOW_ENDO_S_1"][1])
  mbf_endo_s2 = ...
}

with df_filtered as parameter, the mean MBF values of the heart segments are extracted and can now be sent to the database. Another sub step would be to call transform_input_to_baseline_data() to persist baseline information in the database.

Summary and Outlook

This blog post has described the way DICOM files from CMR observations can be processed with R in order to extract quantified myocardial blood flow values for scientific analysis. Apart from R, different approaches by other institutes have been discussed publicly as well, e. g. by using MATLAB. Interested readers may refer to NIH, among others, for further information.

The chosen approach tries to respect both the properties of DICOM files, that is, their heterogeneous inner structure, the different types of information and their file size, as well as the specific data requirements by our institute’s use cases. With a single R data frame in form of a name – value table, the result of the process is easy to handle for further data analysis. At the same time, due to its flexible setup, DicomParseR may serve as a module in any kind of DICOM-related use case.

Thomas Schröder
Centrum für medizinische Datenintegration BHC Stuttgart

Customizing slides and documents using Quarto extensions workshop

Join our workshop on Introduction to mixed frequency data models in R, which is a part of our workshops for Ukraine series!

Here’s some more info:

Title: Customizing slides and documents using Quarto extensions

Date: Thursday, January 11th, 18:00 – 20:00 CET (Rome, Berlin, Paris timezone)

Speaker:Nicola Rennie is a Lecturer in Health Data Science based within the Centre for Health Informatics, Computing, and Statistics at Lancaster Medical School. Her research interests include applications of statistics and machine learning to healthcare and medicine, communicating data through visualisation, and understanding how we teach statistical concepts. Nicola also has experience in data science consultancy and collaborates closely with external research partners. She can often be found at data science meetups, presenting at conferences, and is the R-Ladies Lancaster chapter organiser.

Description: Quarto is an open-source scientific and technical publishing system that allows you to combine text with code to create fully reproducible documents in a variety of formats. The addition of custom styling to documents can make them look more professional and recognisable. In the first half of this workshop, we’ll look at ways to customise HTML outputs (including documents and revealjs slides) using CSS, and ways to customise PDF documents using LaTeX. In the second half, we’ll discuss the use of Quarto extensions as a way of sharing customised templates with others, demonstrate how to install and use extensions, and show the process of building your own custom style extension.

Minimal registration fee: 20 euro (or 20 USD or 800 UAH)

How can I register?

Go to https://bit.ly/3wvwMA6 or https://bit.ly/3PFxtNA and donate at least 20 euro. Feel free to donate more if you can, all proceeds go directly to support Ukraine.

Save your donation receipt (after the donation is processed, there is an option to enter your email address on the website to which the donation receipt is sent)

Fill in the registration form, attaching a screenshot of a donation receipt (please attach the screenshot of the donation receipt that was emailed to you rather than the page you see after donation).

How can I sponsor a student?

Go to https://bit.ly/3wvwMA6 or https://bit.ly/3PFxtNA and donate at least 20 euro (or 17 GBP or 20 USD or 800 UAH). Feel free to donate more if you can, all proceeds go to support Ukraine!

Save your donation receipt (after the donation is processed, there is an option to enter your email address on the website to which the donation receipt is sent)

Fill in the sponsorship form, attaching the screenshot of the donation receipt (please attach the screenshot of the donation receipt that was emailed to you rather than the page you see after the donation). You can indicate whether you want to sponsor a particular student or we can allocate this spot ourselves to the students from the waiting list. You can also indicate whether you prefer us to prioritize students from developing countries when assigning place(s) that you sponsored.

You can also find more information about this workshop series, a schedule of our future workshops as well as a list of our past workshops which you can get the recordings & materials here.

Looking forward to seeing you during the workshop!