Overview
Data analysis is a useful way to help solve problems in quite a few situations.There are many things that go into effective data analysis, but three are commonly mentioned
1. defining the problem you want to solve through data analysis
2. meaningful data collected
3. the skills (and expertise) to analyze the data
R is often mentioned as a way to effectively fill the third of these, but at the same time, it’s often seen as a big barrier for people who haven’t used R before (or have no programming experience).
In my previous work experience, there were many situations where I was able to turn experiences into insights and produce meaningful results with a little data analysis, even if I was “not a data person”.
For this purpose, We have developed an open source R package called “Statgarten” that allows you to utilize the features of R without having to use R directly, and I would like to introduce it.
Here’s the repo link (Note, some description is written in Korean yet)
👣 Flow of data analysis
The order and components may vary depending on your situation, but I like to define it as five broad flows.1. data preparation
2. EDA
3. data visualization
4. calculate statistics
5. share results
In this article, I’ll share a lightweight data analysis example that follows these steps (while utilizing R’s features and not typing R code whenever possible).
Note, Since our work is still in progress, including deployment in the form of a web application, we will utilize R packages.
Install
With this code, you can install all components of statgarten system.remotes::install_github('statgarten/statgarten') library(statgarten)
Run
The core of the statgarten ecosystem is door, which allows you to bundle other functional packages together. (Of course, you can also use each package as a separate shiny module)Let’s load the door library, and run it via run_app.
library(door) run_app() # OR door::run_app()If you didn’t set anything, the shiny application will run in Rstudio’s viewer panel, but we recommend running it in a web browser like Chrome via the Show in new window icon (Icon to the left of the Stop button)
If you don’t have any problems running it (please raise an issue on DOOR to let us know if you do), you should see the screen below.
1. Data preparation
There are four ways to prepare data for Statgarten. 1) Upload a file from your local PC, 2) Enter the URL of a file, 3) Enter the URL of a Google Sheet, or 4) Finally, utilize the public data included in statgarten, which can be found in the tabs File, URL, Google Sheet, and Datatoys respectively.In this example, we will utilize the public data named bloodTest.
bloodTest contains blood test data from 2014-15 provided by the National Health Insurance Service in South Korea.
1.5 Define the problem
Utilizing bloodtest data, we’ll try to see clues for this question“Are people with high total cholesterol more likely to be diagnosed with anemia and cerebrovascular disease, and does the incidence vary by gender?”With a few clicks, select the data as shown below. (after selection, click Import data button)
Before we start EDA, let’s process the data for analysis.
In keeping with the theme, we will “remove” data that is not needed and change some numeric values to the type of factor.
This can be done with the Update Data button, where data selection is done with the checkbox. The type can be changed in the New class.
2. EDA
You can see the organization of the data in the EDA pane below, where we see that the genders are 1 and 2, so we’ll use the Replace function on the Transform Data button to change them to M/F.3. Data visualization
In the Vis Panel, you can also visualize anemia (ANE) and total cholesterol (TCHOL) by dragging, as well as total cholesterol by cerebrovascular disease (STK) status.However, it’s hard to tell from the figure if there is a significant difference (in both case).
4. Statistics
You can view the distribution of values by data and key statistics via Distribution in the EDA panel.For the anemia (ANE) and cerebrovascular disease variables (STK), we see that 0 (never diagnosed) is 92.2% and 93.7%, respectively, and 1 (diagnosed) is 7.8% and 6.3%, respectively.
In the Stat Panel, let’s create a “Table 1” to represent the baseline characteristics of the data, based on anemia status (ANE).
Cerebrovascular disease status(STK) , again from Table 1, we can see that the value of total cholesterol (TCHOL) by gender (SEX) is significant with a Pvalue less than 0.05.
5. Share result
I think quarto (or Rmarkdown) is the most effective way to share data analysis results in R, but utilizing it in a shiny app is another matter.As a result, statgarten’s results sharing is limited to exporting a data table or downloading an image.
⛳ Statgarten as Open source
The statgarten project has goal forIn order to help process and utilize data in a rapidly growing data economy and foster data literacy for all.The project is being developed with the support of the Ministry of Science and ICT of the Republic of Korea, and has been selected as a target for the 2022 Information and Communication Technology Development Project and the Standards Development Support Project.
But at the same time, it is an open source project that everyone can use and contribute to freely. (We’ve also used other open source projects in the development process)
It is being developed in various forms such as web app, docker, and R package, and is open to various forms of contributions such as development, case sharing, and suggestions.
Please try it out, raise an issue, fork or stargaze it, or suggest what you need, and we’ll do our best to incorporate it, so please support us 🙂
For more information, you can check out our github page or drop us an email.
Thanks.
(Translated with DeepL ❤️)