Is round(0.5) 0 or 1?


Actually, it’s both possible

This Article was originally published before on YOZM-IT as Korean

Various way of data science 

There are many programming languages in the world and software that utilizes them. And those play an important role in “Data science”.

For example, if you’re using funnel analysis to improve your product, you might want to 

  • Compare the bounce rates of funnel stages before and after an event,
  • And perform a ratio test to calculate their statistical significance.
Image by the author

Meanwhile, data scientists have various career backgrounds and experiences. So They tend to use the methods they’re comfortable with, including Python, R, SAS and more.

We see this quite a bit, because in most cases, the software you use at the level of business doesn’t make much of a difference.

But what happens if you “produce different results by the software used?

The following image shows the results of running a proportion test in R, Python, and STATA with example mentioned.

Image from the author and CAMIS project

You can see that even though we used the same values of 1000 and 123, the p-value, which indicates the significance of the proportion test, is slightly different for each method.

There are many reasons why the calculation value is different depending on the method used, such as 

  • Different algorithms in the core logic of the programming language 
  • Different default values of the parameters used in the function.

In the example above, if you change the value of the parameter correct in R and apply “Continuity correction” as using “correct = F” , you can see that the result is the same as in STATA.

Image from CAMIS project

Rounding

Next, I’ll introduce rounding for more general data analysis. 

Image by the author

Similarly, you can see that the round changes its value depending on software.

If the fee is “0.5 billion” in some large financial transaction in business, the rounded cost could be zero or 1 billion, depending on how you calculate the rounding.

Another case could be Logistic regression, which various round can be reverse prediction.

Image from Wikipedia, edited by the author

Why is round different?

Let’s talk a little more about why this round is different. 

Rounding as we usually perceive it means changing 0 ~ 4 to 0, and 5 ~ 9 to 10, as shown below image.

And in decimal units, is rounding to the nearest whole number by changing .0 ~ .4999.. to 0 and .5 ~ .9999.. to 1

However, there are a number of mathematical interpretations of when exactly 0.5 , and when it is a negative number.

Image from the Learning corner

For example, round(-23.5) should produce -23 or -24?

Both are possible, depending on the mathematical interpretation and it’s called as rounding half up and rounding half down respectively. We can take this a step further and round both positive and negative numbers closer to zero, or vice versa.

This means that round(-23.5) will round to -23, and round(23.5) will round to 23, or round to -24 and 24, respectively. These are represented by the names Rounding half toward zero, Rounding half away from zero, respectively.

Finally, there are methods called Rounding half to even and Rounding half to odd, which mean that we want to consider the nearest integers to be even and odd, respectively.

In particular, the Rounding half to even method also goes by the names Convergent rounding, Statistician’s rounding, Dutch rounding, Gaussian rounding, and Bankers’ rounding, and is one of the official standard methods according to IEEE 754.

Bankers’ rounding

Bankers’s rounding, is default method in R , so Let’s breif a little bit more.

The image below shows the result of rounding from 0.0 to 2.0.

Image from the author

While this may seem like a good idea, there is actually a problem. Because .5 is unconditionally rounded to the next integer, there is an unconditional bias towards rounding to a “+ value”.

I don’t know the exact reason for this, but one theory is that the US IRS used to use this rounding to collect taxes and was sued for unfairly profiting by collecting more taxes from people who were .5 off, so they lost the case and changed to rounding to the nearest even (or odd) number to match the .5 rounding.

This means that by modifying the rounding as shown below, we can avoid the bias that was previously occurring.


The problem with different results

In recent years, industries in various domains, including pharmaceuticals and finance, have been trying to switch from “commercial” software such as SPSS, SAS and STATA to “open source” software such as Python, R and Julia . 

And as rounding mentioned earlier, diffrent result issue by software has been also raised which can create problems in terms of reproducibility, uncertainty, accuracy, and traceability.

So if you’re utilizing multiple softwares, you should be aware of why they produce different results, and how you can use them to properly

CAMIS project

Image from CAMIS project

CAMIS stands for Comparing Analysis Method Implementations in Software. 

This project compares the differences in softwares (or programming languages) and make standards to produce the same results.

The core area of the project is the “statistical computation” part, so most contributions come from the data science leaders who have strong understanding with it.

But CAMIS is also an open source project, that is not restricted and maintained with various people through regular discussions, collaboration, and sharing of project progress.

Below is one of the comparisons published on the CAMIS project’s webpage, which reviews how a one sample t-test is run with each software, what the results are, and how the results are compatible with each other.

Image from CAMIS project

The CAMIS project was started by members who interested in “SAS to R” in the medical and pharmaceutical industry. So it mainly focuses on R and SAS along major statistical data analysis, but recently it’s also working on how to use Python for data science in a broader domain of the industry.

Not only clasiccal methods such as Hypothesis tests, Regression analysis, but modern methods in data science such as Bayesian statistics, Causal inference and novel implementations of existing methods (e.g. MMRM) are topic of interest in project.

Sessions are increasingly appearing at multiple data science conferences, where many researchers and contributors are encouraged to promote, contribute and utilize it as a reference.

Finally, the CAMIS project is also collaborating with academia beyond the data science industry, as similar topics have been published in The American Statistician and Drug Information Association, among others.

Image from The American Statistician
The project is also currently working with students on a thesis entitled “A comparison of MMRM methodology in SAS and R software” and is open to collaborations and suggestions on other topics.

Summary

Various software used in data science. As the domain, the libraries or software used by an organization may be dependent on a particular language, which can sometimes be mixed with personal preferred methods. (in many cases, this doesn’t vary much at the level of the business)

However, if you’re not careful, the methods you use can lead to different results.

In this article, I’ve given you some examples of and reasons for differences in the methods used by different software for calculations, and introduced the CAMIS project, a research project that aims to minimize them to ensure consistency in data analysis.

If you use different software in your data analytics work, it’s a good idea to take a look at them to understand the differences and try to find the optimal method for your purposes,

And if you work in data science in the field, I highly recommend that you take an interstate in or contribute to the CAMIS project for a global collaborative experience.

Add shiny in quarto blog with shinylive

Shiny, without server

In previous article, I introduced method to share shiny application in static web page (github page)

At the core of this method is a technology called WASM, which is a way to load and utilize R and Shiny-related libraries and files that have been converted for use in a web browser. The main problem with wasm is that it is difficult to configure, even for R developers.

Of course, there was a way called shinylive, but unfortunately it was only available in python at the time.

Fortunately, after a few months, there is an R package that solves this configuration problem, and I will introduce how to use it to add a shiny application to a static page.

shinylive

shinylive is R package to utilize wasm above shiny. and now it has both Python and R version, and in this article will be based on the R version.

shinylive is responsible for generating HTML, Javascript, CSS, and other elements needed to create web pages, as well as wasm-related files for using shiny.

You can see examples created with shinylive at this link.





Install shinylive

While shinylive is available on CRAN, it is recommended to use the latest version from github as it may be updated from time to time, with the most recent release being 0.1.1. Additionally, pak is the recently recommended R package for installing R packages in posit, and can replace existing functions like install.packages() and remotes::install_github().

# install.packages("pak")

pak::pak("posit-dev/r-shinylive")


You can think of shinylive as adding a wasm to an existing shiny application, which means you need to create a shiny application first.

For the example, we’ll use the code provided by shiny package (which you can also see by typing shiny::runExample("01_hello") in the Rstudio console).
library(shiny)

ui <- fluidPage(

titlePanel("Hello Shiny!"),
  sidebarLayout(
  sidebarPanel(
    sliderInput(
      inputId = "bins",
      label = "Number of bins:",
      min = 1,
      max = 50,
      value = 30
    )
  ),
  mainPanel(
    plotOutput(outputId = "distPlot")
    )
  )
)

server <- function(input, output) {
  output$distPlot <- renderPlot({
  x <- faithful$waiting
  bins <- seq(min(x), max(x), length.out = input$bins + 1)
  hist(x,
    breaks = bins, col = "#75AADB", border = "white",
    xlab = "Waiting time to next eruption (in mins)",
    main = "Histogram of waiting times"
  )
  })
}

shinyApp(ui = ui, server = server)

This code creates a simple shiny application that creates a number of histograms in response to the user’s input, as shown below.


There are two ways to create a static page with this code using shinylive, one is to create it as a separate webpage (like previous article) and the other is to embed it as internal content on a quarto blog page .

First, here’s how to create a separate webpage.

shinylive via web page

To serve shiny on a separate static webpage, you’ll need to convert your app.R to a webpage using the shinylive package you installed earlier.

Based on creating a folder named shinylive in my Documents(~/Documents) and saving `app.R` inside it, here’s an example of how the export function would look like

shinylive::export('~/Documents/shinylive', '~/Documents/shinylive_out')


When you run this code, it will create a new folder called shinylive_out in the same location as shinylive, (i.e. in My Documents), and inside it, it will generate the converted wasm version of shiny code using the shinylive package.

If you check the contents of this shinylive_out folder, you can see that it contains the webr, service worker, etc. mentioned in the previous post.


More specifically, the export function is responsible for adding the files from the local PC’s shinylive package assets, i.e. the library files related to shiny, to the out directory on the local PC currently running R studio.


Now, if you create a github page or something based on the contents of this folder, you can serve a static webpage that provides shiny, and you can preview the result with the command below.

httpuv::runStaticServer("~/Documents/shinylive_out")

shinylive in quarto blog


To add a shiny application to a quarto blog, you need to use a separate extension. The quarto extension is a separate package that extends the functionality of quarto, similar to using R packages to add functionality to basic R.

First, we need to add the quarto extension by running the following code in the terminal (not a console) of Rstudio.

quarto add quarto-ext/shinylive

You don’t need to create a separate file to plant shiny in your quarto blog, you can use a code block called {shinylive-r}. Additionally, you need to set shinylive in the yaml of your index.qmd.

filters: 
- shinylive

Then, in the {shinylive-r} block, write the contents of the app.R we created earlier. 

#| standalone: true
#| viewerHeight: 800
library(shiny)
ui <- fluidPage(
  titlePanel("Hello Shiny!"),
  sidebarLayout(
    sidebarPanel(
      sliderInput(
        inputId = "bins",
        label = "Number of bins:",
        min = 1,
        max = 50,
        value = 30
      )
    ),
    mainPanel(
      plotOutput(outputId = "distPlot")
    )
  )
)
server <- function(input, output) {
  output$distPlot <- renderPlot({
    x <- faithful$waiting
    bins <- seq(min(x), max(x), length.out = input$bins + 1)
    hist(x,
      breaks = bins, col = "#75AADB", border = "white",
      xlab = "Waiting time to next eruption (in mins)",
      main = "Histogram of waiting times"
    )
  })
}
shinyApp(ui = ui, server = server)

after add this in quarto blog, you may see working shiny application.

You can see working example in this link

Summary

shinylive is a feature that utilizes wasm to run shiny on static pages, such as GitHub pages or quarto blogs, and is available as an R package and quarto extension, respectively.

Of course, since it is less than a year old, not all features are available, and since it uses static pages, there are disadvantages compared to utilizing a separate shiny server.

However, it is very popular for introducing shiny usage and simple statistical analysis, and you can practice it right on the website without installing R, and more features are expected to be added in the future.

The code used in blog (previous example link) can be found at the link.

Author: jhk0530

Use google’s Gemini in R with R package “gemini.R”

Introduction

Few days ago, Google presented their own multimodal-LLM named as “Gemini”.


Also there was article named “How to Integrate google’s gemini AI model into R” that tells us how to use gemini API in R brieflly.

Thanks to Deepanshu Bhalla (writer of above article), I’ve many inspirations and made some research to utilize Gemini API more. And I’m glad to share the results with you.

In this article, I want to highlight to How to use gemini with R and Shiny via R package for Gemini API

(You can see result and contribute in github repository: gemini.r)

Gemini API


As today (23.12.26), Gemini API is mainly consisted with 4 things. you can see more details in official docs.

1. Gemini Pro: Is get Text and returns Text 
2. Gemini Pro Vision: Is get Text and Image  and returns Text
3. Gemini Pro Multi-turn: Just chat
4. Embedding: for NLP

and I’ll use 1 & 2. 

You can get API keys in Google AI Studio

However, offical docs doesn’t describe for how to use Gemini API in R. (How sad)
But we can handle it as “REST API” ( I’ll explain it later)

Shiny application

I made very brief concept of Shiny application that uses Gemini API for get Image and Text (maybe “Explain this picture”) and returns Answer from Gemini

(Number is expected user flow)

This UI, is consisted 5 components.

1. fileInput for upload image
2. imageOutput for show uploaded Image
3. textInput for prompt 
4. actionButton for send API to Gemini
5. textOutput for show return value from Gemini

And this is result of shiny and R code (Again, you can see it in github repository)



library(shiny)
library(gemini.R)

ui
<- fluidPage(

  sidebarLayout(
    NULL,
    mainPanel(
      fileInput(
        inputId = “file”,
        label = “Choose file to upload”,
      ),
      div(
        style = ‘border: solid 1px blue;’,
        imageOutput(outputId = “image1”),
      ),
      textInput(
        inputId = “prompt”,
        label = “Prompt”,
        placeholder = “Enter Prompts Here”
      ),
      actionButton(“goButton”, “Ask to gemini”),
      div(
        style = ‘border: solid 1px blue; min-height: 100px;’,            textOutput(“text1”)
      )
    )
  )
)

server <- function(input, output) {
  observeEvent(input$file, {
    path <- input$file$datapath
    output$image1 <- renderImage({
      list( src = path )
    }, deleteFile = FALSE) })

  observeEvent(input$goButton, {
    output$text1 <- renderText({
      gemini_image(input$prompt, input$file$datapath)
    })
  })
}

shinyApp(ui = ui, server = server)


gemini.R package

I think you may think “What is gemini_image function?”

It is function to send API to Gemini server and return result.

and it consisted with 3 main part.

1. Model query
2. API key
3. Content

I used gemini_image function in example. but I’ll gemini function first (which is function to send text and get text)


Gemini’s API example usage is looks like below. (for REST API)

Which can be transformed like below in R


Also, gemini API key must set before use with “Sys.setenv” function

Anyway, I think you should note, body for API is mainly consisted with list.

Similarly, gemini_image function for Gemini Pro Vision API looks like below


is 

Note that, image must encoded as base64 using base64encode function and provided as separated list.


Example 

So with Shiny application and gemini.r package.

You now can run example application to ask image to Gemini.


Summary 

I made very basic R package “gemini.R” to use Gemini API. 

Which provides 2 function: gemini and gemini_image.

And still there’s many possiblity for develop this package. 

like feature to Chat like bard or provide NLP Embedding

and finally, I want to hear feedback or contribution from you. (Really)


Thanks. 

* P.S, I think just using bard / chatGPT / copilot is much better for personal usage. (unless you don’t want to provide AI service via R)

Build serverless shiny application via Github page

Simple guide for simple shiny application 

TL;DR

I made shiny application in github page with quarto.

You can check code in my github repository and result, result2


How we use shiny

Shiny is R package to make user utilize R with web browser without install it.

So my company utilizes shiny to provide statistical analysis for doctors (who don’t know R but need statistics).


Behind shiny

As you know, shiny is consisted with 2 part. UI and Server

You may think just UI is channel to both get input (data) from user and return calculated output (result) to user.
and server is just calculator

It means, server requires dynamic calculation that may change, not fixed contents (it called as static web page)

To achieve dynamic calculation, there are several options.

We can use shinyapps.io, posit connect, or deploy own shiny server in other cloud like AWS / azure / GCP …

These options can be categorized into two main categories: free but with limited features, or feature-rich but paid.

There is no single right answer, but I use shinyapps.io in see toy level project or deploy using shiny server in company’s cloud server which is not just toy level.

Creating Standalone Apps from Shiny with Electron [2023, macOS M1]

💡 I assume that…

    • You’re famillar with R / shiny 
    • You know basic terminal command.
    • You’ve heard about node.js, npm, or Javascript something…

1. Why standalone shiny app?

First, let’s talk about the definition of a standalone app.

I’m going to define it this way

An app that can run independently without any external help.

“External” here is probably a web browser, which means software that can be installed and run without an internet connection.


Rstudio can also be seen as a kind of standalone, as you can’t install packages or update them if you’re not connected to a network, but you can still use them.

Creating a standalone app with shiny is really unfamiliar and quite complicated. What are the benefits, and why should you develop it anyway?

I can think of at least two.


1. better user experience
Regardless of the deployment method, when using Shiny as a web app, you have to turn on your web browser, enter a URL, and run it.

The process of sending and receiving data over the network can affect the performance of your app.


However, a standalone app can run without a browser and use the OS’s resources efficiently, resulting in a slightly faster and more stable execution.

Of course, the advantage is that it can be used without an internet connection.

2. Improved security
Shiny apps run through a web browser anyway, so if a “legendary” hacker had their way, they could pose a threat to the security of Shiny apps.

However, standalone is a bit immune to this problem, as long as they don’t physically break into your PC.

2. Very short introduction of electron

Electron (or more precisely, electron.js) is a technology that allows you to embed chromium and node.js in binary form to utilize the (shiny!) technologies used in web development: html, css, and javascript, to quote a bit from the official page.

It’s a story I still don’t fully understand, but fortunately, there have been numerous attempts by people to make shiny standalone with electron.js before, and their dedication has led to the sharing of templates that remove the “relatively complex” process.

The article I referenced was “turn a shiny application into a tablet or desktop app” by r-bloggers, written in 2020, but times have changed so quickly that the stuff from then doesn’t work (at least not on my M1 MAC).

After a considerable amount of wandering, I found a github repository that I could at least understand. Unfortunately, the repository was archived in April 2022. There were some things that needed to be updated for March 23.

Eventually, I was able to make the shiny app work as a standalone app.

And I’m going to leave some footprints for anyone else who might wander in the future.

3. Packaging shiny app with Electron

It’s finally time to get serious and package shiny as an electron.

I’ll describe the steps in a detailed, follow-along way where possible, but if you run into any problems, please let me know by raising an issue in the repository.

(I’ve actually seen it package as a standalone app utilizing the template by following the steps below)


1. the first thing you need to do is install npm.js, npm, and electron forge using Rstudio’s terminal. (I’ll skip these)

2. fork/clone (maybe even star ⭐) the template below

https://github.com/zarathucorp/shiny-electron-template-m1-2023


3. open the cloned project in Rstudio (.Rproj)

4. if you get something like below, except for the version, you are good to go.



Now start at line 6 of readmd (of template).


Let’s name the standalone app we want to create (obviously) “helloworld”

💡 I’ll format directory like /this

5. Run npx create-electron-app helloworld in the terminal to create the standalone app package. This will create a directory called /helloworld, delete /helloworld/src.

6. move the template’s files below to /helloworld and set the working directory to /helloworld.

    • start_shiny.R
    • add_cran-binary_pkgs.R
    • get-r-mac.sh
    • /shiny
    • /src
7. in the console, use version to check the version of R installed on your PC. Then run the shell script sh ./get-r-mac.sh in the terminal to install R for electron. (The version on your PC and the version of R in sh should be the same)

8. Once you see that the /r-mac directory exists, install the automagic R package from the console

9. modify the package.json (change the author name of course) The parts that should look like the image are the dependencies, repository, and devDependencies parts.



10. develop a shiny app (assuming you’re familiar with shiny, I’ll skip this part)

11. install the R package for electron by running Rscript add-cran-binary-pkgs.R in the terminal.

12. in a terminal, update the package.json for electron with npm install (this is a continuation of 9)

13. in a terminal, verify that the standalone app is working by running electron-forge start

If, like me in the past, the electron app still won’t run, exit the app, restart your R session in Rstudio, and then run the standalone app again. (It seems to be an environment variable issue, such as R’s shiny port.



14. once you’ve verified that start is running fine, create a working app with electron-forge make.


🥳 Voila, you have successfully made shiny a standalone app using electron.

4. Summary

If I’ve succeeded in my intentions, you should be able to use the
template to make shiny a standalone app using electron in 2023 on an m1 mac.

That app (delivered as a zip file) now makes
    • the power of R / Shiny available to people with little experience
    • without installing or using R.
    • Or even in a “closed environment” with no network connection
Since electron is technically electron.js, my biggest challenge in creating a standalone app with electron was utilizing Javascript (which I have limited skills in compared to R).

Fortunately, I was able to do so by making some improvements to the templates that the pioneers had painstakingly created.

Thank you L. Abigail Walter, Travis Hinkelman, and Dirk Shumacher
I’ll end this post with a template that I followed up with that I hope you’ll find useful.

Thank you.

(Translated with DeepL ❤️)

Introduction to data analysis with {Statgarten}.





Overview

Data analysis is a useful way to help solve problems in quite a few situations.

There are many things that go into effective data analysis, but three are commonly mentioned

1. defining the problem you want to solve through data analysis
2. meaningful data collected
3. the skills (and expertise) to analyze the data

R is often mentioned as a way to effectively fill the third of these, but at the same time, it’s often seen as a big barrier for people who haven’t used R before (or have no programming experience).

In my previous work experience, there were many situations where I was able to turn experiences into insights and produce meaningful results with a little data analysis, even if I was “not a data person”.

For this purpose, We have developed an open source R package called “Statgarten” that allows you to utilize the features of R without having to use R directly, and I would like to introduce it.

Here’s the repo link (Note, some description is written in Korean yet)


👣 Flow of data analysis

The order and components may vary depending on your situation, but I like to define it as five broad flows.

1. data preparation
2. EDA
3. data visualization
4. calculate statistics
5. share results

In this article, I’ll share a lightweight data analysis example that follows these steps (while utilizing R’s features and not typing R code whenever possible).

Note, Since our work is still in progress, including deployment in the form of a web application, we will utilize R packages.
Install
With this code, you can install all components of statgarten system. 
remotes::install_github('statgarten/statgarten')
library(statgarten)
Run
The core of the statgarten ecosystem is door, which allows you to bundle other functional packages together. (Of course, you can also use each package as a separate shiny module)

Let’s load the door library, and run it via run_app.
library(door)

run_app() # OR door::run_app()
If you didn’t set anything, the shiny application will run in Rstudio’s viewer panel, but we recommend running it in a web browser like Chrome via the Show in new window icon (Icon to the left of the Stop button)

Statgarten app main pageIf you don’t have any problems running it (please raise an issue on DOOR to let us know if you do), you should see the screen below.
1. Data preparation
There are four ways to prepare data for Statgarten. 1) Upload a file from your local PC, 2) Enter the URL of a file, 3) Enter the URL of a Google Sheet, or 4) Finally, utilize the public data included in statgarten, which can be found in the tabs File, URL, Google Sheet, and Datatoys respectively.

In this example, we will utilize the public data named bloodTest.

bloodTest
contains blood test data from 2014-15 provided by the National Health Insurance Service in South Korea.
1.5 Define the problem
Utilizing bloodtest data, we’ll try to see clues for this question

“Are people with high total cholesterol more likely to be diagnosed with anemia and cerebrovascular disease, and does the incidence vary by gender?” 
With a few clicks, select the data as shown below. (after selection, click Import data button)

statgarten data select


Before we start EDA, let’s process the data for analysis.

In keeping with the theme, we will “remove” data that is not needed and change some numeric values to the type of factor.

This can be done with the Update Data button, where data selection is done with the checkbox. The type can be changed in the New class.

2. EDA
You can see the organization of the data in the EDA pane below, where we see that the genders are 1 and 2, so we’ll use the Replace function on the Transform Data button to change them to M/F.


3. Data visualization
In the Vis Panel, you can also visualize anemia (ANE) and total cholesterol (TCHOL) by dragging, as well as total cholesterol by cerebrovascular disease  (STK) status. 



However, it’s hard to tell from the figure if there is a significant difference (in both case).
4. Statistics
You can view the distribution of values by data and key statistics via Distribution in the EDA panel.


For the anemia (ANE) and cerebrovascular disease variables (STK), we see that 0 (never diagnosed) is 92.2% and 93.7%, respectively, and 1 (diagnosed) is 7.8% and 6.3%, respectively.


In the Stat Panel, let’s create a “Table 1” to represent the baseline characteristics of the data, based on anemia status (ANE).


Cerebrovascular disease status(STK) , again from Table 1, we can see that the value of total cholesterol (TCHOL) by gender (SEX) is significant with a Pvalue less than 0.05.


5. Share result
I think quarto (or Rmarkdown) is the most effective way to share data analysis results in R, but utilizing it in a shiny app is another matter.

As a result, statgarten’s results sharing is limited to exporting a data table or downloading an image.



⛳ Statgarten as Open source

The statgarten project has goal for

In order to help process and utilize data in a rapidly growing data economy and foster data literacy for all.
The project is being developed with the support of the Ministry of Science and ICT of the Republic of Korea, and has been selected as a target for the 2022 Information and Communication Technology Development Project and the Standards Development Support Project.

But at the same time, it is an open source project that everyone can use and contribute to freely. (We’ve also used other open source projects in the development process)

It is being developed in various forms such as web app, docker, and R package, and is open to various forms of contributions such as development, case sharing, and suggestions.

Please try it out, raise an issue, fork or stargaze it, or suggest what you need, and we’ll do our best to incorporate it, so please support us 🙂

For more information, you can check out our github page or drop us an email.

Thanks.

(Translated with DeepL ❤️)

Basic data analysis with palmerpenguins




Introduction

In June 17, nice article for introducing new trial dataset were uploaded via R-bloggers.

iris, one of commonly used dataset for simple data analysis. but there is a little issue for using it. 

Too good.

Every data has well-structured and most of analysis method works with iris very well.

In reality, most of dataset is not pretty and requires a lot of pre-process to just start. This can be possible works in pre-process

Remove NAs.
Select meaningful features
Handle duplicated or inconsistent values.
or even, just loading the dataset. if is not well-structured like Flipkart-products

However, in this penguin dataset, you can try for this work. also there’s pre-processed data too.

For more information, see the page of palmerpenguins

There is a routine for me with brief data analysis. and today, I want to share them with this lovely penguins. 


Contents

0. Load dataset and library on workspace.
library(palmerpenguins) # for data
library(dplyr) # for data-handling
library(corrplot) # for correlation plot
library(GGally) # for parallel coordinate plot
library(e1071) # for svm

data(penguins) # load pre-processed penguins 
palmerpenguins have 2 data penguins, penguins_raw , and as you can see from their name, penguins is pre-processed data.  

1. See the summary  and plot of Dataset
summary(penguins)
plot(penguins)


It seems speciesisland  and sex is categorical features.
and remaining for numerical features.

2. Set the format of feature
penguins$species <- as.factor(penguins$species)
penguins$island <- as.factor(penguins$island)
penguins$sex <- as.factor(penguins$sex)

summary(penguins)
plot(penguins)
and see summary and plot again. note that result of plot is same. 



There’s unwanted NA and . values in some features.

3. Remove not necessary datas ( in this tutorial, NA)
penguins <- penguins %>% filter(sex == 'MALE' | sex == 'FEMALE')
summary(penguins)
And here, I additionally defined color values for each penguins to see better plot result
# Green, Orange, Purple
pCol <- c('#057076', '#ff8301', '#bf5ccb')
names(pCol) <- c('Gentoo', 'Adelie', 'Chinstrap')
plot(penguins, col = pCol[penguins$species], pch = 19)

Now, plot results are much better to give insights.

Note that, other pre-process step may requires for different datasets.

4. See relation of categorical features

My first purpose of analysis this penguin is species
So, I will try to see relation between species and other categorical values

4-1. species, island
table(penguins$species, penguins$island)
chisq.test(table(penguins$species, penguins$island)) # meaningful difference

ggplot(penguins, aes(x = island, y = species, color = species)) +
  geom_jitter(size = 3) + 
  scale_color_manual(values = pCol) 






Wow, there’s strong relationship between species and island

Adelie lives in every island
Gentoo lives in only Biscoe
Chinstrap lives in only Dream

4-2 & 4.3.
However, species and sex or sex and island did not show any meaningful relation.
You can try following codes. 
# species vs sex
table(penguins$sex, penguins$species)
chisq.test(table(penguins$sex, penguins$species)[-1,]) # not meaningful difference 0.916

# sex vs island
table(penguins$sex, penguins$island) # 0.9716
chisq.test(table(penguins$sex, penguins$island)[-1,]) # not meaningful difference 0.9716
5. See with numerical features

I will select numerical features. 
and see correlation plot and parallel coordinate plots.
# Select numericals
penNumeric <- penguins %>% select(-species, -island, -sex)

# Cor-relation between numerics

corrplot(cor(penNumeric), type = 'lower', diag = FALSE)

# parallel coordinate plots

ggparcoord(penguins, columns = 3:6, groupColumn = 1, order = c(4,3,5,6)) + 
  scale_color_manual(values = pCol)

plot(penNumeric, col = pCol[penguins$species], pch = 19)

and below are result of them.








lucky, every numeric features (even only 4) have meaningful correlation and there is trend with  their combination for species (See parallel coordinate plot)

6. Give statistical work on dataset.

In this step, I usually do linear modeling or svm to predict

6.1 linear modeling

species is categorical value, so it needs to be change to numeric value
set.seed(1234)
idx <- sample(1:nrow(penguins), size = nrow(penguins)/2)

# as. numeric
speciesN <- as.numeric(penguins$species)
penguins$speciesN <- speciesN

train <- penguins[idx,]
test <- penguins[-idx,]


fm <- lm(speciesN ~ flipper_length_mm + culmen_length_mm + culmen_depth_mm + body_mass_g, train)

summary(fm) 


It shows that, body_mass_g is not meaningful feature as seen in plot above ( it may explain gentoo, but not other penguins )

To predict, I used this code. however, numeric predict generate not complete value (like 2.123 instead of 2) so I added rounding step.
predRes <- round(predict(fm, test))
predRes[which(predRes>3)] <- 3
predRes <- sort(names(pCol))[predRes]

test$predRes <- predRes
ggplot(test, aes(x = species, y = predRes, color = species))+ 
  geom_jitter(size = 3) +
  scale_color_manual(values = pCol)

table(test$predRes, test$species)





Accuracy of basic linear modeling is 94.6%

6-2 svm

using svm is also easy step.
m <- svm(species ~., train)

predRes2 <- predict(m, test)
test$predRes2 <- predRes2

ggplot(test, aes(x = species, y = predRes2, color = species)) +
  geom_jitter(size = 3) +
  scale_color_manual(values = pCol)

table(test$species, test$predRes2)
and below are result of this code.





Accuracy of svm is 100%. wow.



Conclusion

Today I introduced simple routine for EDA and statistical analysis with penguins.
That is not difficult that much, and shows good performances.

Of course, I skipped a lot of things like processing raw-dataset.
However I hope this trial gives inspiration for further data analysis.

Thanks.