![](https://allisonhorst.github.io/palmerpenguins/articles/articles/img/palmerpenguins.png)
Introduction
In June 17, nice article for introducing new trial dataset were uploaded via R-bloggers.
iris, one of commonly used dataset for simple data analysis. but there is a little issue for using it.
Too good.
Every data has well-structured and most of analysis method works with iris very well.
In reality, most of dataset is not pretty and requires a lot of pre-process to just start. This can be possible works in pre-process
Remove
NA
s.Select meaningful features
Handle
duplicated
or inconsistent
values.or even, just
loading
the dataset. if is not well-structured like Flipkart-productsHowever, in this penguin dataset, you can try for this work. also there’s pre-processed data too.
For more information, see the page of palmerpenguins.
There is a routine for me with brief data analysis. and today, I want to share them with this lovely penguins.
Contents
0. Load dataset and library on workspace.
library(palmerpenguins) # for data library(dplyr) # for data-handling library(corrplot) # for correlation plot library(GGally) # for parallel coordinate plot library(e1071) # for svm data(penguins) # load pre-processed penguins
palmerpenguins
have 2 data penguins
, penguins_raw
, and as you can see from their name, penguins
is pre-processed data. 1. See the
summary
and plot
of Dataset
summary(penguins) plot(penguins)
![](https://user-images.githubusercontent.com/6457691/85227767-49e33d80-b41a-11ea-9d2e-e037a37338ac.png)
![](https://user-images.githubusercontent.com/6457691/85227772-51a2e200-b41a-11ea-8568-00a41d431455.png)
It seems
species
, island
and sex
is categorical features.and remaining for numerical features.
2. Set the format of feature
penguins$species <- as.factor(penguins$species) penguins$island <- as.factor(penguins$island) penguins$sex <- as.factor(penguins$sex) summary(penguins) plot(penguins)and see
summary
and plot
again. note that result of plot
is same. ![](https://user-images.githubusercontent.com/6457691/85227892-0e953e80-b41b-11ea-91ee-6a2d7437e967.png)
There’s unwanted
NA
and .
values in some features.3. Remove not necessary datas ( in this tutorial,
NA
)penguins <- penguins %>% filter(sex == 'MALE' | sex == 'FEMALE') summary(penguins)And here, I additionally defined color values for each penguins to see better
plot
result
# Green, Orange, Purple pCol <- c('#057076', '#ff8301', '#bf5ccb') names(pCol) <- c('Gentoo', 'Adelie', 'Chinstrap') plot(penguins, col = pCol[penguins$species], pch = 19)
![](https://user-images.githubusercontent.com/6457691/85228081-3f29a800-b41c-11ea-84eb-03ffd46b4589.png)
Now, plot results are much better to give insights.
Note that, other pre-process step may requires for different datasets.
4. See relation of categorical features
My first purpose of analysis this penguin is
species
So, I will try to see relation between
species
and other categorical values4-1.
species
, island
table(penguins$species, penguins$island) chisq.test(table(penguins$species, penguins$island)) # meaningful difference ggplot(penguins, aes(x = island, y = species, color = species)) + geom_jitter(size = 3) + scale_color_manual(values = pCol)
![](https://user-images.githubusercontent.com/6457691/85228247-67fe6d00-b41d-11ea-9f7f-219ed9da1946.png)
![](https://user-images.githubusercontent.com/6457691/85228240-5f0d9b80-b41d-11ea-9aee-35eaa2f30cd2.png)
Wow, there’s strong relationship between
species
and island
–
Adelie
lives in every island –
Gentoo
lives in only Biscoe
–
Chinstrap
lives in only Dream
4-2 & 4.3.
However,
species
and sex
or sex
and island
did not show any meaningful relation.You can try following codes.
# species vs sex table(penguins$sex, penguins$species) chisq.test(table(penguins$sex, penguins$species)[-1,]) # not meaningful difference 0.916 # sex vs island table(penguins$sex, penguins$island) # 0.9716 chisq.test(table(penguins$sex, penguins$island)[-1,]) # not meaningful difference 0.97165. See with numerical features
I will select numerical features.
and see correlation plot and parallel coordinate plots.
# Select numericals penNumeric <- penguins %>% select(-species, -island, -sex) # Cor-relation between numerics corrplot(cor(penNumeric), type = 'lower', diag = FALSE) # parallel coordinate plots ggparcoord(penguins, columns = 3:6, groupColumn = 1, order = c(4,3,5,6)) + scale_color_manual(values = pCol) plot(penNumeric, col = pCol[penguins$species], pch = 19)and below are result of them.
![](https://user-images.githubusercontent.com/6457691/85228373-14405380-b41e-11ea-968f-0982662a4614.png)
![](https://user-images.githubusercontent.com/6457691/85228393-320db880-b41e-11ea-913d-b44a510a8813.png)
![](https://user-images.githubusercontent.com/6457691/85228396-35a13f80-b41e-11ea-91be-345171f16e50.png)
lucky, every numeric features (even only 4) have meaningful correlation and there is trend with their combination for
species
(See parallel coordinate plot)6. Give statistical work on dataset.
In this step, I usually do
linear modeling
or svm
to predict6.1
linear modeling
species
is categorical value, so it needs to be change to numeric valueset.seed(1234) idx <- sample(1:nrow(penguins), size = nrow(penguins)/2) # as. numeric speciesN <- as.numeric(penguins$species) penguins$speciesN <- speciesN train <- penguins[idx,] test <- penguins[-idx,] fm <- lm(speciesN ~ flipper_length_mm + culmen_length_mm + culmen_depth_mm + body_mass_g, train) summary(fm)
![](https://user-images.githubusercontent.com/6457691/85228609-38506480-b41f-11ea-98a0-b57563e43ef2.png)
It shows that,
body_mass_g
is not meaningful feature as seen in plot
above ( it may explain gentoo
, but not other penguins )To predict, I used this code. however, numeric predict generate not complete value (like 2.123 instead of 2) so I added rounding step.
predRes <- round(predict(fm, test)) predRes[which(predRes>3)] <- 3 predRes <- sort(names(pCol))[predRes] test$predRes <- predRes ggplot(test, aes(x = species, y = predRes, color = species))+ geom_jitter(size = 3) + scale_color_manual(values = pCol) table(test$predRes, test$species)
![](https://user-images.githubusercontent.com/6457691/85228661-a1d07300-b41f-11ea-8892-84611c42e18c.png)
![](https://user-images.githubusercontent.com/6457691/85228660-9aa96500-b41f-11ea-80b4-3cccf8503f34.png)
Accuracy of basic
linear modeling
is 94.6%6-2
svm
using
svm
is also easy step.m <- svm(species ~., train) predRes2 <- predict(m, test) test$predRes2 <- predRes2 ggplot(test, aes(x = species, y = predRes2, color = species)) + geom_jitter(size = 3) + scale_color_manual(values = pCol) table(test$species, test$predRes2)and below are result of this code.
![](https://user-images.githubusercontent.com/6457691/85228760-713d0900-b420-11ea-8c57-ae7de1dff2f1.png)
![](https://user-images.githubusercontent.com/6457691/85228762-7437f980-b420-11ea-888a-20225ae3c541.png)
Accuracy of
svm
is 100%. wow.Conclusion
Today I introduced simple routine for EDA and statistical analysis with penguins.
That is not difficult that much, and shows good performances.
Of course, I skipped a lot of things like processing raw-dataset.
However I hope this trial gives inspiration for further data analysis.
Thanks.
One thought on “Basic data analysis with palmerpenguins”