R – Page 33 – R-posts.com

New R Course: Introduction to the Tidyverse!

Hi! Big announcement today as we just launched Introduction to the Tidyverse R course by David Robinson!

This is an introduction to the programming language R, focused on a powerful set of tools known as the “tidyverse”. In the course you’ll learn the intertwined processes of data manipulation and visualization through the tools dplyr and ggplot2. You’ll learn to manipulate data by filtering, sorting and summarizing a real dataset of historical country data in order to answer exploratory questions. You’ll then learn to turn this processed data into informative line plots, bar plots, histograms, and more with the ggplot2 package. This gives a taste both of the value of exploratory data analysis and the power of tidyverse tools. This is a suitable introduction for people who have no previous experience in R and are interested in learning to perform data analysis.

Take me to chapter 1! Introduction to the Tidyverse features interactive exercises that combine high-quality video, in-browser coding, and gamification for an engaging learning experience that will make you a Tidyverse expert!

What you’ll learn

1. Data wrangling
In this chapter, you’ll learn to do three things with a table: filter for particular observations, arrange the observations in a desired order, and mutate to add or change a column. You’ll see how each of these steps lets you answer questions about your data.

2. Data visualization
You’ve already been able to answer some questions about the data through dplyr, but you’ve engaged with them just as a table (such as one showing the life expectancy in the US each year). Often a better way to understand and present such data is as a graph. Here you’ll learn the essential skill of data visualization, using the ggplot2 package. Visualization and manipulation are often intertwined, so you’ll see how the dplyr and ggplot2 packages work closely together to create informative graphs.

3. Grouping and summarizing
So far you’ve been answering questions about individual country-year pairs, but we may be interested in aggregations of the data, such as the average life expectancy of all countries within each year. Here you’ll learn to use the group by and summarize verbs, which collapse large datasets into manageable summaries.

4. Types of visualizations
You’ve learned to create scatter plots with ggplot2. In this chapter you’ll learn to create line plots, bar plots, histograms, and boxplots. You’ll see how each plot needs different kinds of data manipulation to prepare for it, and understand the different roles of each of these plot types in data analysis.

Master the Tidyverse with our course Introduction to the Tidyverse

Using Microsoft’s Azure Face API to analyze videos (in R)

Microsoft had a cool API called “Emotion API”. With it you could submit a URL of a video, and the API would return a json file with the faces and emotions expressed in the video (per frame). However, that API never matured from preview mode, and in fact was deprecated on October 30th 2017 (it no longer works).

I stumbled upon the Emotion API when I read a post by Kan Nishida from exploritory.io, which analyzed the facial expressions of Trump and Clinton during a presedential debate last year. However, that tutorial (like many others) no longer works, since it used the old Emotion API.

Lately I needed a tool for an analysis I did at work on the facial expressions in TV commercials. I had a list of videos showing faces, and I needed to code these faces into emotions.

I noticed that Microsft still offers a simpler “face API”. This API doesn’t work with videos, it only runs on still images (e.g. jpegs). I decided to use it and here are the results (bottom line – you can use it for videos after some prep work).

By the way, AWS and google have similar APIs (for images) called: Amazon Rekognition (not a typo) and Vision API respectively.

Here is guide on how to do a batch analysis of videos and turn them into a single data frame (or tibble) of emotions which are displayed in the videos per frame.

To use the API you need a key – if you don’t already have it, register to Microsoft’s Azure API and register a new service of type Face API. You get an initial “gratis” credit of about $200 (1,000 images cost $1.5 so $200 is more than enough).

Preparations

First, we’ll load the packages we’re going to use httr to send our requests to the server, and tidyverse (mostly for ggplot, dplyr, tidyr, tibble). Also, lets define the API access point we will use, and the appropriate key. My face API service was hosted on west Europe (hence the URL starts with westeurope.api.

# ==== Load required libraries ====
library(tidyverse)
library(httr)
# ==== Microsoft's Azure Face API ====
end.point <- "https://westeurope.api.cognitive.microsoft.com/face/v1.0/detect"
key1 <- "PUT YOUR SECRET KEY HERE"

To get things going, lets check that the API and key works. We’ll send a simple image (of me) to the API and see what comes out.

sample.img.simple <- POST(url = end.point,
                          add_headers(.headers = c("Ocp-Apim-Subscription-Key" = key1)),
                          body = '{"url":"http://www.sarid-ins.co.il/files/TheTeam/Adi_Sarid.jpg"}',
                          query = list(returnFaceAttributes = "emotion"),
                          accept_json())

This is the simplest form of the API, which returns only emotions of the faces depicted in the image. You can ask for a lot of other features by setting them in the query parameter. For example, to get the emotions, age, gender, hair, makeup and accessories use returnFaceAttributes = "emotion,age,gender,hair,makeup,accessories".

Here’s a full documentation of what you can get.

Later on we’ll change the body parameter at the POST from a json which contains a URL to a local image file which will be uploaded to Microsoft’s servers.

For now, lets look at the response of this query (notice the reference is for the first identified face [[1]], for an image with more faces, face i will appear in location [[i]]).

as_tibble(content(sample.img.simple)[[1]]$faceAttributes$emotion) %>% t()

##            [,1]
## anger     0.001
## contempt  0.062
## disgust   0.002
## fear      0.000
## happiness 0.160
## neutral   0.774
## sadness   0.001
## surprise  0.001

You can see that the API is pretty sure I’m showing a neutral face (0.774), but I might also be showing a little bit of happiness (0.160). Anyway these are weights (probabilities to be exact), they will always sum up to 1. If you want a single classification you should probably choose the highest weight as the classified emotion. Other results such as gender, hair color, work similarly.

Now we’re ready to start working with videos. We’ll be building a number of functions to automate the process.

Splitting a video to individual frames

To split the video file into individual frames (images which we can send to the API), I’m going to (locally) use ffmpeg by calling it from R (it is run externally by the system – I’m using windows for this). Assume that file.url contains the location of the video (can be online or local), and that id.number is a unique string identifier of the video.

movie2frames <- function(file.url, id.number){
  base.dir <- "d:/temp/facial_coding/"
  dir.create(paste0(base.dir, id.number))
  system(
    paste0(
      "ffmpeg -i ", file.url, 
      " -vf fps=2 ", base.dir, 
      id.number, "/image%07d.jpg")
        )
}

The parameter fps=2 in the command means that we are extracting two frames per second (for my needs that was a reasonable fps res, assuming that emotions don’t change that much during 1/2 sec).
Be sure to change the directory location (base.dir from d:/temp/facial_coding/) to whatever you need. This function will create a subdirectory within the base.dir, with all the frames extracted by ffmpeg. Now were ready to send these frames to the API.

A function for sending a (single) image and reading back emotions

Now, I defined a function for sending an image to the API and getting back the results. You’ll notice that I’m only using a very small portion of what the API has to offer (only the faces). For the simplicity of the example, I’m reading only the first face (there might be more than one on a single image).

send.face <- function(filename) {
  face_res <- POST(url = end.point,
                   add_headers(.headers = c("Ocp-Apim-Subscription-Key" = key1)),
                   body = upload_file(filename, "application/octet-stream"),
                   query = list(returnFaceAttributes = "emotion"),
                   accept_json())
 
  if(length(content(face_res)) > 0){
    ret.expr <- as_tibble(content(face_res)[[1]]$faceAttributes$emotion)
  } else {
    ret.expr <- tibble(contempt = NA,
                       disgust = NA,
                       fear = NA,
                       happiness = NA,
                       neutral = NA,
                       sadness = NA,
                       surprise = NA)
   }
  return(ret.expr)
}

A function to process a batch of images

As I mentioned, in my case I had videos, so I had to work with a batch of images (each image representing a frame in the original video). After splitting the video, we now have a directory full of jpgs and we want to send them all to analysis. Thus, another function is required to automate the use of send.face() (the function we had just defined).

extract.from.frames <- function(directory.location){
  base.dir <- "d:/temp/facial_coding/"
  # enter directory location without ending "/"
  face.analysis <- dir(directory.location) %>%
    as_tibble() %>%
    mutate(filename = paste0(directory.location,"/", value)) %>%
    group_by(filename) %>%
    do(send.face(.$filename)) %>%
    ungroup() %>%
    mutate(frame.num = 1:NROW(filename)) %>%
    mutate(origin = directory.location)
  
  # Save temporary data frame for later use (so not to loose data if do() stops/fails)
  temp.filename <- tail(stringr::str_split(directory.location, stringr::fixed("/"))[[1]],1)
  write_excel_csv(x = face.analysis, path = paste0(base.dir, temp.filename, ".csv"))
  
  return(face.analysis)
}

The second part of the function (starting from “# Save temporary data frame for later use…”) is not mandatory. I wanted the results saved per frame batch into a file, since I used this function for a lot of movies (and you don’t want to loose everything if something temporarily doesn’t work). Again, if you do want the function to save its results to a file, be sure to change base.dir in extract.from.frames as well – to suit your own location.

By the way, note the use of do(). I could also probably use walk(), but do() has the benefit of showing you a nice progress bar while it processes the data.

Once you call this results.many.frames <- extract.from.frames("c:/some/directory"), you will receive a nice tibble that looks like this one:

## Observations: 796
## Variables: 11
## $ ..filename  d:/temp/facial_coding/119/image00001.jpg, d:/temp...
## $ anger        0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0...
## $ contempt     0.001, 0.001, 0.000, 0.001, 0.002, 0.001, 0.001, 0...
## $ disgust      0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ fear         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ happiness    0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0...
## $ neutral      0.998, 0.998, 0.999, 0.993, 0.996, 0.997, 0.997, 0...
## $ sadness      0.000, 0.000, 0.000, 0.001, 0.001, 0.000, 0.000, 0...
## $ surprise     0.001, 0.001, 0.000, 0.005, 0.001, 0.002, 0.001, 0...
## $ frame.num    1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,...
## $ origin       d:/temp/facial_coding/119, d:/temp/facial_coding/...

Visualization

Here is the visualization of the emotion classification which were detected in this movie, as a function of frame number.

res.for.gg <- results.many.frames %>%
 select(anger:frame.num) %>%
 gather(key = emotion, value = intensity, -frame.num)
 glimpse(res.for.gg)

 ## Observations: 6,368
 ## Variables: 3
 ## $ frame.num <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1...
 ## $ emotion <chr> "anger", "anger", "anger", "anger", "anger", "anger"...
 ## $ intensity <dbl> 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.0...

ggplot(res.for.gg, aes(x = frame.num, y = intensity, color = emotion)) + geom_point()

Since most of the frames show neutrality at a very high probability, the graph is not very informative. Just for the show, lets drop neutrality and focus on all other emotions. We can see that the is a small probability of some contempt during the video. The next plot shows only the points and the third plot shows only the smoothed version (without the points).

ggplot(res.for.gg %>% filter(emotion != "neutral"),
 aes(x = frame.num, y = intensity, color = emotion)) + geom_point()

ggplot(res.for.gg %>% filter(emotion != "neutral"),
 aes(x = frame.num, y = intensity, color = emotion)) + stat_smooth()

Conclusions

Though the Emotion API which used to analyze a complete video has been deprecated, the Face API can be used for this kind of video analysis, with the addition of splitting the video file to individual frames.

The possibilities with the face API are endless and can fit a variety of needs. Have fun playing around with it, and let me know if you found this tutorial helpful, or if you did something interesting with the API.

New DataCamp Course: Working with Web Data in R

Hi there! We just launched Working with Web Data in R by Oliver Keyes and Charlotte Wickham, our latest R course!

Most of the useful data in the world, from economic data to news content to geographic information, lives somewhere on the internet – and this course will teach you how to access it. You’ll explore how to work with APIs (computer-readable interfaces to websites), access data from Wikipedia and other sources, and build your own simple API client. For those occasions where APIs are not available, you’ll find out how to use R to scrape information out of web pages. In the process, you’ll learn how to get data out of even the most stubborn website, and how to turn it into a format ready for further analysis. The packages you’ll use and learn your way around are rvest, httr, xml2 and jsonlite, along with particular API client packages like WikipediR and pageviews.

Take me to chapter 1!

Working with Web Data in R features interactive exercises that combine high-quality video, in-browser coding, and gamification for an engaging learning experience that will make you an expert in getting information from the Internet!

What you’ll learn

1. Downloading Files and Using API Clients
Sometimes getting data off the internet is very, very simple – it’s stored in a format that R can handle and just lives on a server somewhere, or it’s in a more complex format and perhaps part of an API but there’s an R package designed to make using it a piece of cake. This chapter will explore how to download and read in static files, and how to use APIs when pre-existing clients are available.

2. Using httr to interact with APIs directly
If an API client doesn’t exist, it’s up to you to communicate directly with the API. But don’t worry, the package httr makes this really straightforward. In this chapter, you’ll learn how to make web requests from R, how to examine the responses you get back and some best practices for doing this in a responsible way.

3. Handling JSON and XML
Sometimes data is a TSV or nice plaintext output. Sometimes it’s XML and/or JSON. This chapter walks you through what JSON and XML are, how to convert them into R-like objects, and how to extract data from them. You’ll practice by examining the revision history for a Wikipedia article retrieved from the Wikipedia API using httr, xml2 and jsonlite.

4. Web scraping with XPATHs
Now that we’ve covered the low-hanging fruit (“it has an API, and a client”, “it has an API”) it’s time to talk about what to do when a website doesn’t have any access mechanisms at all – when you have to rely on web scraping. This chapter will introduce you to the rvest web-scraping package, and build on your previous knowledge of XML manipulation and XPATHs.

5. ECSS Web Scraping and Final Case Study
CSS path-based web scraping is a far-more-pleasant alternative to using XPATHs. You’ll start this chapter by learning about CSS, and how to leverage it for web scraping. Then, you’ll work through a final case study that combines everything you’ve learnt so far to write a function that queries an API, parses the response and returns data in a nice form.

Master web data in R with our course Working with Web Data in R!

Cyber Week Only: Save 50% on DataCamp!

For Cyber Week only, DataCamp offers the readers of R-bloggers over $150 off for unlimited access to its data science library. That’s over 90 courses and 5,200 exercises of which a large chunk are R-focused, plus access to the mobile app, Practice Challenges, and hands-on Projects. All by expert instructors such as Hadley Wickham (RStudio), Matt Dowle (data.table), Garrett Grolemund (RStudio), and Max Kuhn (caret)!

Clair your offer now! Offer ends 12/5!

Skills track:

Data Analyst with R
- Learn how to translate numbers into plain English for businesses, whether it’s sales figures, market research, logistics, or transportation costs get the most of your data.
Data Scientist with R
- Learn how to combine statistical and machine learning techniques with R programming to analyze and interpret complex data.
Quantitative Analyst with R
- Learn how to ensure portfolios are risk balanced, help find new trading opportunities, and evaluate asset prices using mathematical models.

Skills track:

Data Visualization in R
- Communicate the most important features of your data by creating beautiful visualizations using ggplot2 and base R graphics.
Importing and Cleaning Data
- Learn how to parse data in any format. Whether it’s flat files, statistical software, databases, or web data, you’ll learn to handle it all.
Statistics with R
- Learn key statistical concepts and techniques like exploratory data analysis, correlation, regression, and inference.
Applied Finance
- Apply your R skills to financial data, including bond valuation, financial trading, and portfolio analysis.

Individual courses:

Machine Learning Toolbox (by Zachary Deane-Mayer, DataRobot & caret, and Max Kuhn, Pfizer Global R&D & caret)
Unsupervised Learning in R (Hank Roark data scientist at Boeing)
Case Study: Bag of Words (Ted Kwartler, Liberty Mutual)
Writing Functions in R (Hadley Wickham, RStudio)

A word about DataCamp

For those of you who don’t know DataCamp: it’s the most intuitive way out there to learn data science thanks to its combination of short expert videos and immediate hands-on-the-keyboard exercises as well as its instant personalized feedback on every exercise. They focus only on data science to offer the best learning experience possible and rely on expert instructors to teach.

How to identify risky bank loans using C.50 decision trees

This tutorial has been taken from Machine Learning with R Second Edition by Brett Lantz. Use the code MLR250RB at the checkout to save 50% on the RRP.

Or pick up this title with 4 others for just $50 – get the R-Bloggers bundle.

The global financial crisis of 2007-2008 highlighted the importance of transparency and rigor in banking practices. As the availability of credit was limited, banks tightened their lending systems and turned to machine learning to more accurately identify risky loans. Decision trees are widely used in the banking industry due to their high accuracy and ability to formulate a statistical model in plain language. Since government organizations in many countries carefully monitor lending practices, executives must be able to explain why one applicant was rejected for a loan while the others were approved. This information is also useful for customers hoping to determine why their credit rating is unsatisfactory. It is likely that automated credit scoring models are employed to instantly approve credit applications on the telephone and web. In this tutorial, we will develop a simple credit approval model using C5.0 decision trees. We will also see how the results of the model can be tuned to minimize errors that result in a financial loss for the institution.

Step 1 – collecting data

The idea behind our credit model is to identify factors that are predictive of higher risk of default. Therefore, we need to obtain data on a large number of past bank loans and whether the loan went into default, as well as information on the applicant. Data with these characteristics is available in a dataset donated to the UCI Machine Learning Data Repository by Hans Hofmann of the University of Hamburg. The data set contains information on loans obtained from a credit agency in Germany.
The data set presented in this tutorial has been modified slightly from the original in order to eliminate some preprocessing steps. To follow along with the examples, download the credit.csv file from Packt’s website and save it to your R working directory. Simply click here and then click ‘code files’ beneath the cover image.
The credit data set includes 1,000 examples on loans, plus a set of numeric and nominal features indicating the characteristics of the loan and the loan applicant. A class variable indicates whether the loan went into default. Let’s see whether we can determine any patterns that predict this outcome.

Step 2 – exploring and preparing the data

As we did previously, we will import data using the read.csv() function. We will ignore the stringsAsFactors option and, therefore, use the default value of TRUE, as the majority of the features in the data are nominal:

> credit <- read.csv("credit.csv")

The first several lines of output from the str() function are as follows:

> str(credit)
'data.frame':1000 obs. of  17 variables:
 $ checking_balance : Factor w/ 4 levels "< 0 DM","> 200 DM",..
 $ months_loan_duration: int  6 48 12 ...
 $ credit_history      : Factor w/ 5 levels "critical","good",..
 $ purpose             : Factor w/ 6 levels "business","car",..
 $ amount              : int  1169 5951 2096 ...

We see the expected 1,000 observations and 17 features, which are a combination of factor and integer data types. Let’s take a look at the table() output for a couple of loan features that seem likely to predict a default. The applicant’s checking and savings account balance are recorded as categorical variables:

> table(credit$checking_balance)
    < 0 DM   > 200 DM 1 - 200 DM    unknown 
       274         63        269        394
> table(credit$savings_balance)
     < 100 DM > 1000 DM  100 - 500 DM 500 - 1000 DM   unknown 
          603        48           103            63       183

The checking and savings account balance may prove to be important predictors of loan default status. Note that since the loan data was obtained from Germany, the currency is recorded in Deutsche Marks (DM). Some of the loan’s features are numeric, such as its duration and the amount of credit requested:

> summary(credit$months_loan_duration)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    4.0    12.0    18.0    20.9    24.0    72.0 
> summary(credit$amount)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    250    1366    2320    3271    3972   18420

The loan amounts ranged from 250 DM to 18,420 DM across terms of 4 to 72 months with a median duration of 18 months and an amount of 2,320 DM. The default vector indicates whether the loan applicant was unable to meet the agreed payment terms and went into default. A total of 30 percent of the loans in this dataset went into default:

> table(credit$default)
 no yes 
700 300

A high rate of default is undesirable for a bank, because it means that the bank is unlikely to fully recover its investment. If we are successful, our model will identify applicants that are at high risk to default, allowing the bank to refuse credit requests.

Data preparation: Creating random training and test data sets

In this example, we’lll split our data into two portions: a training dataset to build the decision tree and a test dataset to evaluate the performance of the model on new data. We will use 90 percent of the data for training and 10 percent for testing, which will provide us with 100 records to simulate new applicants. We’ll use a random sample of the credit data for training. A random sample is simply a process that selects a subset of records at random. In R, the sample() function is used to perform random sampling. However, before putting it in action, a common practice is to set a seed value, which causes the randomization process to follow a sequence that can be replicated later on if desired. It may seem that this defeats the purpose of generating random numbers, but there is a good reason for doing it this way. Providing a seed value via the set.seed() function ensures that if the analysis is repeated in the future, an identical result is obtained. The following commands use the sample() function to select 900 values at random out of the sequence of integers from 1 to 1000. Note that the set.seed() function uses the arbitrary value 123. Omitting this seed will cause your training and testing split to differ from those shown in the remainder of this tutorial:

> set.seed(123)
> train_sample <- sample(1000, 900)

As expected, the resulting train_sample object is a vector of 900 random integers:

> str(train_sample)
 int [1:900] 288 788 409 881 937 46 525 887 548 453 ...

By using this vector to select rows from the credit data, we can split it into the 90 percent training and 10 percent test datasets we desired. Recall that the dash operator used in the selection of the test records tells R to select records that are not in the specified rows; in other words, the test data includes only the rows that are not in the training sample.

> credit_train <- credit[train_sample, ]
> credit_test  <- credit[-train_sample, ]

If all went well, we should have about 30 percent of defaulted loans in each of the datasets:

> prop.table(table(credit_train$default))
       no       yes 
0.7033333 0.2966667 

> prop.table(table(credit_test$default))
  no  yes 
0.67 0.33

This appears to be a fairly even split, so we can now build our decision tree. Tip: If your results do not match exactly, ensure that you ran the command

set.seed(123)

immediately prior to creating the

train_sample

vector.

Step 3: Training a model on the data

We will use the C5.0 algorithm in the C50 package to train our decision tree model. If you have not done so already, install the package with install.packages(“C50”) and load it to your R session, using library(C50). The following syntax box lists some of the most commonly used commands to build decision trees. Compared to the machine learning approaches we used previously, the C5.0 algorithm offers many more ways to tailor the model to a particular learning problem, but more options are available. Once the C50package has been loaded, the ?C5.0Control command displays the help page for more details on how to finely-tune the algorithm.

For the first iteration of our credit approval model, we’ll use the default C5.0 configuration, as shown in the following code. The 17th column in credit_train is the default class variable, so we need to exclude it from the training data frame, but supply it as the target factor vector for classification:

> credit_model <- C5.0(credit_train[-17], credit_train$default)

The credit_model object now contains a C5.0 decision tree. We can see some basic data about the tree by typing its name:

> credit_model

Call:
C5.0.default(x = credit_train[-17], y = credit_train$default)

Classification Tree
Number of samples: 900 
Number of predictors: 16 

Tree size: 57 

Non-standard options: attempt to group attributes

The preceding text shows some simple facts about the tree, including the function call that generated it, the number of features (labeled predictors), and examples (labeled samples) used to grow the tree. Also listed is the tree size of 57, which indicates that the tree is 57 decisions deep—quite a bit larger than the example trees we’ve considered so far! To see the tree’s decisions, we can call the summary() function on the model:

> summary(credit_model)

This results in the following output:

The preceding output shows some of the first branches in the decision tree. The first three lines could be represented in plain language as:

If the checking account balance is unknown or greater than 200 DM, then classify as “not likely to default.” Otherwise, if the checking account balance is less than zero DM or between one and 200 DM. And the credit history is perfect or very good, then classify as “likely to default.”

The numbers in parentheses indicate the number of examples meeting the criteria for that decision, and the number incorrectly classified by the decision. For instance, on the first line, 412/50 indicates that of the 412 examples reaching the decision, 50 were incorrectly classified as not likely to default. In other words, 50 applicants actually defaulted, in spite of the model’s prediction to the contrary. Tip: Sometimes a tree results in decisions that make little logical sense. For example, why would an applicant whose credit history is very good be likely to default, while those whose checking balance is unknown are not likely to default? Contradictory rules like this occur sometimes. They might reflect a real pattern in the data, or they may be a statistical anomaly. In either case, it is important to investigate such strange decisions to see whether the tree’s logic makes sense for business use.
After the tree, the summary(credit_model) output displays a confusion matrix, which is a cross-tabulation that indicates the model’s incorrectly classified records in the training data:

Evaluation on training data (900 cases):

      Decision Tree   
    ----------------  
    Size      Errors  
      56  133(14.8%)   <<

     (a)   (b)    <-classified as
    ----  ----
     598    35    (a): class no
      98   169    (b): class yes

The Errors output notes that the model correctly classified all but 133 of the 900 training instances for an error rate of 14.8 percent. A total of 35 actual no values were incorrectly classified as yes (false positives), while 98 yes values were misclassified as no (false negatives). Decision trees are known for having a tendency to overfit the model to the training data. For this reason, the error rate reported on training data may be overly optimistic, and it is especially important to evaluate decision trees on a test data set.

Step 4: Evaluating model performance

To apply our decision tree to the test dataset, we use the predict() function, as shown in the following line of code:

> credit_pred <- predict(credit_model, credit_test)

This creates a vector of predicted class values, which we can compare to the actual class values using the CrossTable() function in the gmodels package. Setting the prop.c and prop.r parameters to FALSE removes the column and row percentages from the table. The remaining percentage (prop.t) indicates the proportion of records in the cell out of the total number of records:

> library(gmodels)
> CrossTable(credit_test$default, credit_pred,
             prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,
             dnn = c('actual default', 'predicted default'))

This results in the following table:

Out of the 100 test loan application records, our model correctly predicted that 59 did not default and 14 did default, resulting in an accuracy of 73 percent and an error rate of 27 percent. This is somewhat worse than its performance on the training data, but not unexpected, given that a model’s performance is often worse on unseen data. Also note that the model only correctly predicted 14 of the 33 actual loan defaults in the test data, or 42 percent. Unfortunately, this type of error is a potentially very costly mistake, as the bank loses money on each default. Let’s see if we can improve the result with a bit more effort.

Step 5: Improving model performance

Our model’s error rate is likely to be too high to deploy it in a real-time credit scoring application. In fact, if the model had predicted “no default” for every test case, it would have been correct 67 percent of the time—a result not much worse than our model’s, but requiring much less effort! Predicting loan defaults from 900 examples seems to be a challenging problem. Making matters even worse, our model performed especially poorly at identifying applicants who do default on their loans. Luckily, there are a couple of simple ways to adjust the C5.0 algorithm that may help to improve the performance of the model, both overall and for the more costly type of mistakes.

Boosting the accuracy of decision trees

One way the C5.0 algorithm improved upon the C4.5 algorithm was through the addition of adaptive boosting. This is a process in which many decision trees are built and the trees vote on the best class for each example. Boosting is essentially rooted in the notion that by combining a number of weak performing learners, you can create a team that is much stronger than any of the learners alone. Each of the models has a unique set of strengths and weaknesses and they may be better or worse in solving certain problems. Using a combination of several learners with complementary strengths and weaknesses can therefore dramatically improve the accuracy of a classifier. The C5.0() function makes it easy to add boosting to our C5.0 decision tree. We simply need to add an additional trials parameter indicating the number of separate decision trees to use in the boosted team. The trials parameter sets an upper limit; the algorithm will stop adding trees if it recognizes that additional trials do not seem to be improving the accuracy. We’ll start with 10 trials, a number that has become the de facto standard, as research suggests that this reduces error rates on test data by about 25%:

> credit_boost10 <- C5.0(credit_train[-17], credit_train$default,
                         trials = 10)

While examining the resulting model, we can see that some additional lines have been added, indicating the changes:

> credit_boost10
Number of boosting iterations: 10 
Average tree size: 47.5

Across the 10 iterations, our tree size shrunk. If you would like, you can see all 10 trees by typing summary(credit_boost10) at the command prompt. It also lists the model’s performance on the training data:

> summary(credit_boost10)

     (a)   (b)    <-classified as
    ----  ----
     629     4    (a): class no
      30   237    (b): class yes

The classifier made 34 mistakes on 900 training examples for an error rate of 3.8 percent. This is quite an improvement over the 13.9 percent training error rate we noted before adding boosting! However, it remains to be seen whether we see a similar improvement on the test data. Let’s take a look:

> credit_boost_pred10 <- predict(credit_boost10, credit_test)
> CrossTable(credit_test$default, credit_boost_pred10,
             prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,
             dnn = c('actual default', 'predicted default'))

The resulting table is as follows:

Here, we reduced the total error rate from 27 percent prior to boosting down to 18 percent in the boosted model. It does not seem like a large gain, but it is in fact larger than the 25 percent reduction we expected. On the other hand, the model is still not doing well at predicting defaults, predicting only 20/33 = 61%correctly. The lack of an even greater improvement may be a function of our relatively small training data set, or it may just be a very difficult problem to solve.

This said, if boosting can be added this easily, why not apply it by default to every decision tree? The reason is twofold. First, if building a decision tree once takes a great deal of computation time, building many trees may be computationally impractical. Secondly, if the training data is very noisy, then boosting might not result in an improvement at all. Still, if greater accuracy is needed, it’s worth giving it a try.

Making mistakes more costlier than others

Giving a loan out to an applicant who is likely to default can be an expensive mistake. One solution to reduce the number of false negatives may be to reject a larger number of borderline applicants, under the assumption that the interest the bank would earn from a risky loan is far outweighed by the massive loss it would incur if the money is not paid back at all. The C5.0 algorithm allows us to assign a penalty to different types of errors, in order to discourage a tree from making more costly mistakes. The penalties are designated in a cost matrix, which specifies how much costlier each error is, relative to any other prediction. To begin constructing the cost matrix, we need to start by specifying the dimensions. Since the predicted and actual values can both take two values, yes or no, we need to describe a 2 x 2 matrix, using a list of two vectors, each with two values. At the same time, we’ll also name the matrix dimensions to avoid confusion later on:

> matrix_dimensions <- list(c("no", "yes"), c("no", "yes"))
> names(matrix_dimensions) <- c("predicted", "actual")

Examining the new object shows that our dimensions have been set up correctly:

> matrix_dimensions
$predicted
[1] "no"  "yes"

$actual
[1] "no"  "yes"

Next, we need to assign the penalty for the various types of errors by supplying four values to fill the matrix. Since R fills a matrix by filling columns one by one from top to bottom, we need to supply the values in a specific order:

Predicted no, actual no
Predicted yes, actual no
Predicted no, actual yes
Predicted yes, actual yes

Suppose we believe that a loan default costs the bank four times as much as a missed opportunity. Our penalty values could then be defined as:

> error_cost <- matrix(c(0, 1, 4, 0), nrow = 2,
    dimnames = matrix_dimensions)

This creates the following matrix:

> error_cost
         actual
predicted no yes
      no   0   4
      yes  1   0

As defined by this matrix, there is no cost assigned when the algorithm classifies a no or yes correctly, but a false negative has a cost of 4 versus a false positive’s cost of 1. To see how this impacts classification, let’s apply it to our decision tree using the costs parameter of the C5.0() function. We’ll otherwise use the same steps as we did earlier:

> credit_cost <- C5.0(credit_train[-17], credit_train$default,
                            costs = error_cost)
> credit_cost_pred <- predict(credit_cost, credit_test)
> CrossTable(credit_test$default, credit_cost_pred,
             prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,
             dnn = c('actual default', 'predicted default'))

This produces the following confusion matrix:

Compared to our boosted model, this version makes more mistakes overall: 37 percent error here versus 18 percent in the boosted case. However, the types of mistakes are very different. Where the previous models incorrectly classified only 42 and 61 percent of defaults correctly, in this model, 79 percent of the actual defaults were predicted to be non-defaults. This trade resulting in a reduction of false negatives at the expense of increasing false positives may be acceptable if our cost estimates were accurate.

This tutorial has been taken from Machine Learning with R Second Edition by Brett Lantz. Use the code MLR250RB at the checkout to save 50% on the RRP.

Big Data Analytics World Championship (30th September 2017) – Free Entry Code and details

Invitation to the 3rd Big Data Analytics World Championships 2017

We invite members of the R-Bloggers/R-Users community to participate in the 2017 TEXATA Big Data Analytics World Championships (www.texata.com). You’re Free Discount Code is: 2017RUSERS (normally $30 fully paid entry). Here are important dates for the TEXATA educational business competition: Round 1 on 30th September 2017 (Online) Round 2 on 14th October 2017 (Online) World Finals on 15th-16th November 2017 (Austin) About TEXATA The competition is a celebration of big data analytics skills and communities. It targets at students and young professionals looking to learn more and showcase their creative business skills in big data analytics. It’s ideal for people working or studying or interested in: Technology, Computer Science, Statistics and Data, Data Science, IT Professionals, Software Engineering, Analytical, Quantitative and Engineering disciplines. The competition is a great way to learn more about leading Companies, Communities, Universities and Institutions in the big data and business analytics industries. All participants will receive updates throughout 2017 and 2018 of free offers, upcoming conferences and educational events of our community partners as part of joining the TEXATA event. Best of all, it’s all free with this discount code thanks to RBloggers/R-Users. How It Works The structure involves two Online Rounds with a Live World Finals in Austin Texas. Each of the Online Rounds will last 4 Hours in duration. The Top World Finalists will be flown from around the world to compete in the case-study based finals challenge with Finals Judges. The organizers hold similar world championship events in other professional services industries – including Finance and Financial Modeling (www.modeloff.com) and High IQ World Championships (www.hiqora.com) – which gives participants a good background idea of the nature of the educational, fun and challenging nature of the elite business and professional league competitions. Visit www.texata.com for more information.

Naive Principal Component Analysis (using R)

Post from Pablo Bernabeu’s blog.

Principal Component Analysis (PCA) is a technique used to find the core components that underlie different variables. It comes in very useful whenever doubts arise about the true origin of three or more variables. There are two main methods for performing a PCA: naive or less naive. In the naive method, you first check some conditions in your data which will determine the essentials of the analysis. In the less-naive method, you set the those yourself, based on whatever prior information or purposes you had. I will tackle the naive method, mainly by following the guidelines in Field, Miles, and Field (2012), with updated code where necessary. This lecture material was also useful. The ‘naive’ approach is characterized by a first stage that checks whether the PCA should actually be performed with your current variables, or if some should be removed. The variables that are accepted are taken to a second stage which identifies the number of principal components that seem to underlie your set of variables. I ascribe these to the ‘naive’ or formal approach because either or both could potentially be skipped in exceptional circumstances, where the purpose is not scientific, or where enough information exists in advance.

STAGE 1. Determine whether PCA is appropriate at all, considering the variables

Variables should be inter-correlated enough but not too much. Field et al. (2012) provide some thresholds, suggesting that no variable should have many correlations below .30, or any correlation at all above .90. Thus, in the example here, variable Q06 should probably be excluded from the PCA.
Bartlett’s test, on the nature of the intercorrelations, should be significant. Significance suggests that the variables are not an ‘identity matrix’ in which correlations are a sampling error.
KMO (Kaiser-Meyer-Olkin), a measure of sampling adequacy based on common variance (so similar purpose as Bartlett’s). As Field et al. review, ‘values between .5 and .7 are mediocre, values between .7 and .8 are good, values between .8 and .9 are great and values above .9 are superb’ (p. 761). There’s a general score as well as one per variable. The general one will often be good, whereas the individual scores may more likely fail. Any variable with a score below .5 should probably be removed, and the test should be run again.
Determinant: A formula about multicollinearity. The result should preferably fall below .00001.

Note that some of these tests are run on the dataframe and others on a correlation matrix of the data, as distinguished below.

 # Necessary libraries
library(ltm)
library(lattice)
library(psych)
library(car)
library(pastecs)
library(scales)
library(ggplot2)
library(arules)
library(plyr)
library(Rmisc)
library(GPArotation)
library(gdata)
library(MASS)
library(qpcR)
library(dplyr)
library(gtools)
library(Hmisc)

# Select only your variables of interest for the PCA
dataset = mydata[, c('select_var1','select_var1',
'select_var2','select_var3','select_var4',
'select_var5','select_var6','select_var7')]

# Create matrix: some tests will require it
data_matrix = cor(dataset, use = 'complete.obs')

# See intercorrelations
round(data_matrix, 2)

# Bartlett's
cortest.bartlett(dataset)

# KMO (Kaiser-Meyer-Olkin)
KMO(data_matrix)

# Determinant
det(data_matrix)

STAGE 2. Identify number of components (aka factors)

In this stage, principal components (formally called ‘factors’ at this stage) are identified among the set of variables.

The identification is done through a basic, ‘unrotated’ PCA. The number of components set a priori must equal the number of variables that are being tested.

 # Start off with unrotated PCA

pc1 = psych::principal(dataset, nfactors = 
length(dataset), rotate="none")
pc1

Below, an example result:

 ## Principal Components Analysis
## Call: psych::principal(r = eng_prop, nfactors = 3, rotate = "none")
## Standardized loadings (pattern matrix) based upon correlation matrix
##           PC1   PC2  PC3 h2       u2 com
## Aud_eng -0.89  0.13 0.44  1 -2.2e-16 1.5
## Hap_eng  0.64  0.75 0.15  1  1.1e-16 2.0
## Vis_eng  0.81 -0.46 0.36  1 -4.4e-16 2.0
## 
##                        PC1  PC2  PC3
## SS loadings           1.87 0.79 0.34
## Proportion Var        0.62 0.26 0.11
## Cumulative Var        0.62 0.89 1.00
## Proportion Explained  0.62 0.26 0.11
## Cumulative Proportion 0.62 0.89 1.00
## 
## Mean item complexity =  1.9
## Test of the hypothesis that 3 components are sufficient.
## 
## The root mean square of the residuals (RMSR) is  0 
##  with the empirical chi square  0  with prob <  NA 
## 
## Fit based upon off diagonal values = 1

Among the columns, there are first the correlations between variables and components, followed by a column (h2) with the ‘communalities‘. If less factors than variables had been selected, communality values would be below 1. Then there is the uniqueness column (u2): uniqueness is equal to 1 minus the communality. Next is ‘com’, which reflects the complexity with which a variable relates to the principal components. Those components are precisely found below. The first row contains the sums of squared loadings, or eigenvalues, namely, the total variance explained by each linear component. This value corresponds to the number of units explained out of all possible factors (which were three in the above example). The rows below all cut from the same cloth. Proportion var = variance explained over a total of 1. This is the result of dividing the eigenvalue by the number of components. Multiply by 100 and you get the percentage of total variance explained, which becomes useful. In the example, 99% of the variance has been explained. Aside from the meddling maths, we should actually expect 100% there because the number of factors equaled the number of variables. Cumulative var: variance added consecutively up to the last component. Proportion explained: variance explained over what has actually been explained (only when variables = factors is this the same as Proportion var). Cumulative proportion: the actually explained variance added consecutively up to the last component. Two sources will determine the number of components to select for the next stage:

Kaiser’s criterion: components with SS loadings > 1. In our example, only PC1.

A more lenient alternative is Joliffe’s criterion, SS loadings > .7.

Scree plot: the number of points after point of inflexion. For this plot, call:

 plot(pc1$values, type = 'b')

Imagine a straight line from the first point on the right. Once this line bends considerably, count the points after the bend and up to the last point on the left. The number of points is the number of components to select. The example here is probably the most complicated (two components were finally chosen), but normally it’s not difficult. Based on both criteria, go ahead and select the definitive number of components.

STAGE 3. Run definitive PCA

Run a very similar command as you did before, but now with a more advanced method. The first PCA, a heuristic one, worked essentially on the inter-correlations. The definitive PCA, in contrast, will implement a prior shuffling known as ‘rotation’, to ensure that the result is robust enough (just like cards are shuffled). Explained variance is captured better this way. The go-to rotation method is the orthogonal, or ‘varimax’ (though others may be considered too).

 # Now with varimax rotation, Kaiser-normalized 
# by default:
pc2 = psych::principal(dataset, nfactors=2, 
rotate = "varimax", scores = TRUE)
pc2
pc2$loadings

# Healthcheck
pc2$residual
pc2$fit
pc2$communality

We would want:

Less than half of residuals with absolute values > 0.05
Model fit > .9
All communalities > .7

If any of this fails, consider changing the number of factors. Next, the rotated components that have been ‘extracted’ from the core of the set of variables can be added to the dataset. This would enable the use of these components as new variables that might prove powerful and useful (as in this research).

 dataset = cbind(dataset, pc2$scores)
summary(dataset$RC1, dataset$RC2)

STAGE 4. Determine ascription of each variable to components

Check the main summary by just calling pc2, and see how each variable correlates with the rotated components. This is essential because it reveals how variables load on each component, or in other words, to which component a variable belongs. For instance, the table shown here belongs to a study about meaning of words. These results suggest that the visual and haptic modalities of words are quite related, whereas the auditory modality is relatively unique. When the analysis works out well, a cut-off point of r = .8 may be applied for considering a variable as part of a component.

STAGE 5. Enjoy the plot

The plot is perhaps the coolest part about PCA. It really makes an awesome illustration of the power of data analysis.

 ggplot(dataset,
  aes(RC1, RC2, label = as.character(main_eng))) +
  aes (x = RC1, y = RC2, by = main_eng) + stat_density2d(color = "gray87")+
  geom_text(size = 7) +
    ggtitle ('English properties') +
    theme_bw() +
    theme(  
    plot.background = element_blank()
   ,panel.grid.major = element_blank()
   ,panel.grid.minor = element_blank()
   ,panel.border = element_blank()
  ) +
  theme(axis.line = element_line(color = 'black')) + 
    theme(axis.title.x = element_text(colour = 'black', size = 23, 
    margin=margin(15,15,15,15)),
         axis.title.y = element_text(colour = 'black', size = 23, 
    margin=margin(15,15,15,15)),
         axis.text.x  = element_text(size=16),
       axis.text.y  = element_text(size=16)) +
  labs(x = "", y = "Varimax-rotated Principal Component 2") +
    theme(plot.title = element_text(hjust = 0.5, size = 32, face = "bold",
    margin=margin(15,15,15,15)))

Below is an example combining PCA plots with code similar to the above. These plots illustrate something further with regard to the relationships among modalities. In property words, the different modalities spread out more clearly than they do in concept words. This makes sense because in language, properties define concepts (see more).

An example of this code is use is available here (with data here). References
Field, A. P., Miles, J., & Field, Z. (2012). Discovering statistics using R. London: Sage. Feel free to comment below or on the original post.

R in the Data Science Stack at ODSC

Register now for ODSC West in San Francisco, November 2-4 and save 60% with code RB60 until September 1st.

R continues to hold its own in the data science landscape thanks in no small part to its flexibility. That flexibility allows R to integrate with some of the most popular data science tools available.

Given R’s memory bounds, it’s no surprise that deep learning tools like Tensorflow are on that list. Comprehensive, intuitive, and well documented, Tensorflow has quickly become one of the most popular deep learning platforms and RStudio released a package to integrate with the Tensorflow API. Not to be outdone MXNet, another popular and powerful deep learning framework has native support for R with an API interface.

It doesn’t stop with deep learning. Data science is moving real-time and the streaming analytics platform, Apache Kafka, is rapidly gaining traction with the community. The kafka package allows one to use the Kafka messaging queue via R. Spark is now one of the dominant machine learning platforms and thus we see multiple R integrations in the form of the spark package and the SparkR package. The list will continue to grow with package integrations released for H20.ai, Druid etc. and more on the way.

At the Open Data Science Conference, R has long been one of the most popular data science languages and ODSC West 2017 is no exception. We have a strong lineup this year that includes:

R Tools for Data Science
Modeling Big Data with R, sparklyr, and Apache Spark
Machine Learning with R
Introduction to Data Science with R
Modern Time-Series with Prophet
R4ML: A Scalable and Distributed framework in R for Machine Learning
Databases Using R
Geo-Spatial Data Visualization using R

From an R user perspective, one of the most exciting things about ODSC West 2017 is that it offers an excellent opportunity to do a deep dive into some of the most popular data science tools you can now leverage with R. Talks and workshops on the conference schedule include:

Deep learning from Scratch WIth Tensorflow
Apache Kafka for Real-time analytics
Deep learning with MXNet
Effective TensorFlow
Building an Open Source Analytics Solution with Kafka and Druid
Deep Neural Networks with Keras
Robust Data Pipelines with Apache Airflow
Apache Superset – A Modern, Enterprise-Ready Business Intelligence Web Application

Over 3 packed days, ODSC West 2017 also offers a great opportunity to brush up on your modeling skills that include predictive analytics, time series, NLP, machine learning, image recognition, deep learning. autonomous vehicles, and AI chatbot assistants. Here’s just a few of the data science workshops and talks scheduled:

Feature Selection from High Dimensions
Interpreting Predictions from Complex Models
Deep Learning for Recommender Systems
Natural Language Processing in Practice – Do’s and Don’ts
Machine Imaging recognition
Training a Prosocial Chatbot
Anomaly Detection Using Deep Learning
Myths of Data Science: Practical Issues You Can and Can Not Ignore.
Playing Detective with CNNs
Recommendation System Architecture and Algorithms
Driver and Occupants Monitoring AI for Autonomous Vehicles
Solving Impossible Problems by Collaborating with an AI
Dynamic Risk Networks: Mapping Risk in the Financial System

With over 20 full training session, 50 workshops and 100 speakers, ODSC West 2017 is ideal for beginners to experts looking to understand the latest in R tools and topics in data science and AI.

Register now and save 60% with code RB60 until September 1st.

Sheamus McGovern, CEO of ODSC

Shinydashboards from right to left (localizing a shinydashboard to Hebrew)

Post by Adi Sarid (Sarid Institute for Research Services LTD.)

Lately I’ve been working a lot with the shinydashboard library.
Like shiny, it allows any R programmer to harness the power of R and create professional looking interactive apps. The thing about shinydashboards is that it makes wonderfully looking dashboards.

What I’ve been doing with the dashboards is to create dedicated dashboards for my customers. Since most of my customers speak, read, and write in Hebrew I needed to fit that into my shinydashboard apps (i.e., fully localize the app). See an example for such a localized dashboard I made here.

Making a shinydashboard localized turned out to be simpler than I thought.

Since the average R programmer doesn’t necessarily know and understand CSS, I thought I post my solution. This should fit any Hebrew or Arabic dashboard to work from right to left, including the sidebar and all other objects (though I only tested it in Hebrew).

If you want the short version:
(1) Download the following css file;
(2) Put it in a subfolder “/www” of your shinydashboard app;
(3) In your dashboardBody command (within the ui.R section of your shiny app) add the following code:

dashboardBody(

   tags$head(

      tags$link(rel = "stylesheet", type = "text/css", href = "bootstrap-rtl.css"

),...

Here are the few insights and steps which lead me to this solution:

Insight #1: any shiny app (dashboard or otherwise) can be customized using CSS. That’s no secret. However, the adaptation to RTL isn’t that simple when you have so many objects, mobile responsiveness to worry about, etc.

Insight #2: Shiny is based on the AdminLTE theme which is based on the bootstrap 3 theme. AdminLTE is a great theme, and even though it doesn’t officially support RTL, mmdsharifi, provided a solution in his github page. The same for bootstrap 3 which has an RTL customization by morteza (also on github).

Insight #3: What I did in order to make this work was to take the bootstrap-rtl.css from morteza, and then concatenate the AdminLTE-rtl.css file by mmdsharifi. Voilà! (simple, isn’t it?)

Here’s the resulting css file.

Thanks to 0xOri for suggesting and testing insight #3.

Exploring Assumptions of K-means Clustering using R

K-Means Clustering is a well known technique based on unsupervised learning. As the name mentions, it forms ‘K’ clusters over the data using mean of the data. Unsupervised algorithms are a class of algorithms one should tread on carefully. Using the wrong algorithm will give completely botched up results and all the effort will go down the drain. Unlike supervised learning algorithms where one can still get around keeping some parts as an unknown black box, knowing the technique inside out starting from the assumptions made to the process, methods of optimization and uses is essential. So let us begin step by step starting from the assumptions. I will then explain the process and a hands-on illustration using

Assumptions and Process
Why do we assume in the first place? The answer is that making assumptions helps simplify problems and simplified problems can then be solved accurately. To divide your dataset into clusters, one must define the criteria of a cluster and those make the assumptions for the technique. K-Means clustering method considers two assumptions regarding the clusters – first that the clusters are spherical and second that the clusters are of similar size. Spherical assumption helps in separating the clusters when the algorithm works on the data and forms clusters. If this assumption is violated, the clusters formed may not be what one expects. On the other hand, assumption over the size of clusters helps in deciding the boundaries of the cluster. This assumption helps in calculating the number of data points each cluster should have. This assumption also gives an advantage. Clusters in K-means are defined by taking the mean of all the data points in the cluster. With this assumption, one can start with the centers of clusters anywhere. Keeping the starting points of the clusters anywhere will still make the algorithm converge with the same final clusters as keeping the centers as far apart as possible.

Now let’s understand how the algorithm works. The first step is to assign initial clusters. You can specify any K clusters or let the algorithm assign them randomly. The algorithm works in iterations and in every iteration, all the data points are then assigned to one of the clusters based on the nearest distance from the centers. After all points are assigned to one of the cluster, the cluster centers are now updated. The new centers are decided based on the centroid mean of all the points within the cluster. This is repeated, iteration after iteration until the there is no change in the cluster assignment of any of the data points but there are a lot of calculations which are not fixed in this algorithm. For example, one can decide how the distance for each data point from the cluster center is defined. All of us are familiar with the Euclidean distance. The algorithm is straightforward and easy to understand but using the technique is not as easy as it looks. Let’s try out some examples in R.

K-Means Starter
To understand how K-Means works, we start with an example where all our assumptions hold. R includes a dataset about waiting time between eruptions and the duration of the eruption for the Old Faithful geyser in Yellowstone National Park known as ‘faithful’. The dataset consists of 272 observations of 2 features.

#Viewing the Faithful dataset
plot(faithful)

Looking at the dataset, we can notice two clusters. I will now use the kmeans() function in R to form clusters. Let’s see how K-Means clustering works on the data

#Specify 2 centers
k_clust_start=kmeans(faithful, centers=2)
#Plot the data using clusters
plot(faithful, col=k_clust_start$cluster,pch=2)

Being a small dataset, clusters are formed almost instantaneously but how do we see the clusters, their centers or sizes? The k_clust_start variable I used contains information on both centers and the size of clusters. Let’s check them out

#Use the centers to find the cluster centers
k_clust_start$centers

     eruptions      waiting
1       4.29793     80.28488
2       2.09433     54.75000

#Use the size to find the cluster sizes
k_clust_start$size
[1] 172 100

This means the first cluster consists of 172 members and is centered at 4.29793 value of eruptions and 80.28488 value of waiting. Similarly the second cluster consists of 100 members with 2.09433 value of eruptions and 54.75 value of waiting. Now this information is golden! We know that these centers are the cluster means. So, the eruptions typically happen for either ~2 mins or ~4.3 mins. For longer eruptions, the waiting time is also longer.

Getting into the depths
Imagine a dataset which has clusters which one can clearly identify but k-means cannot. I’m talking about a dataset which does not satisfy the assumptions. A common example is a dataset which represents two concentric circles. Let’s generate it and see how it looks like

#The following code will generate different plots for you but they will be similar
library(plyr)
library(dplyr)
#Generate random data which will be first cluster
clust1 = data_frame(x = rnorm(200), y = rnorm(200))
#Generate the second cluster which will ‘surround’ the first cluster
clust2 =data_frame(r = rnorm(200, 15, .5), theta = runif(200, 0, 2 * pi),
                 x = r * cos(theta), y = r * sin(theta)) %>%
  dplyr::select(x, y)
#Combine the data
dataset_cir= rbind(clust1, clust2)
#see the plot
plot(dataset_cir)

Simple, isn’t it? There are two clusters – one in the middle and the other circling the first. However, this violates the assumption that the clusters are spherical. The inner data is spherical while the outer circle is not. Even though the clustering will not be good, let’s see how does k-means perform on this data

#Fit the k-means model
k_clust_spher1=kmeans(dataset_cir, centers=2)
#Plot the data and clusters
plot(dataset_cir, col=k_clust_spher1$cluster,pch=2)

How do we solve this problem? There are clearly 2 clusters but k-means is not working well. A simple way in this case is to transform our data into polar format. Let’s convert it and plot it.

#Using a function for transformation
cart2pol=function(x,y){
#This is r
  newx=sqrt(x^2 + y^2)
#This is theta
  newy=atan(y/x)
  x_y=cbind(newx,newy)
  return(x_y)
}
dataset_cir2=cart2pol(dataset_cir$x,dataset_cir$y)
plot(dataset_cir2)

Now we run the k-means model on this data

k_clust_spher2=kmeans(dataset_cir2, centers=2)
#Plot the data and clusters
plot(dataset_cir2, col=k_clust_spher2$cluster,pch=2)

This time k-means algorithm works well and correctly transform the data. We can also view the clusters on the original data to double-check this.

plot(dataset_cir, col=k_clust_spher2$cluster,pch=2)

By transforming our data into polar coordinates and fitting k-means model on the transformed data, we fulfil the spherical data assumption and data is accurately clustered. Now let’s look at a data where the clusters are not of similar sizes. Similar size does not mean that the clusters have to be exactly equal. It simply means that no cluster should have ‘too few’ members.

#Make the first cluster with 1000 random values
clust1 = data_frame(x = rnorm(1000), y = rnorm(1000))
#Keep 10 values together to make the second cluster
clust2=data_frame(x=c(5,5.1,5.2,5.3,5.4,5,5.1,5.2,5.3,5.4),y=c(5,5,5,5,5,4.9,4.9,4.9,4.9,4.9))
#Combine the data
dataset_uneven=rbind(clust1,clust2)
plot(dataset_uneven)

Here again, we have two clear clusters but they do not satisfy similar size requirement of k-means algorithm.

k_clust_spher3=kmeans(dataset_uneven, centers=2)
plot(dataset_uneven, col=k_clust_spher3$cluster,pch=2)

Why did this happen? K-means tries to minimize inter cluster and intracluster distance and create ‘tight’ clusters. In this process, it assigns some data points in the first cluster to the second cluster incorrectly. This makes the clustering inaccurate.

How to decide the value of K in data

The datasets I worked on in this article are all simple and it is easily to identify clusters by plotting them. However, complicated datasets do not have this luxury. The Elbow method is popular for finding a suitable value of ‘K’ for k-means clustering. This method uses SSE within groups for different values of k and plots them. Using this plot, we can choose the ‘k’ which shows an abrupt change in SSE, creating an ‘elbow effect’. I will show an illustration on iris dataset using petal width and petal length.

#Create a vector for storing the sse
sse=vector('numeric')
for(i in 2:15){
#k-means function in R has a feature withinss which stores sse for each cluster group
  sse[i-1]=sum(kmeans(iris[,3:4],centers=i)$withinss)
}
#Converting the sse to a data frame and storing corresponding value of k
sse=as.data.frame(sse)
sse$k=seq.int(2,15)
#Making the plot. This plot is also known as screeplot
plot(sse$k,sse$sse,type="b")

In this plot, the first elbow is formed at k=3. This method suggest that we should have 3 clusters for this data.

Conclusion
K-means clustering is one of the first and most basic clustering techniques whenever one thinks of unsupervised clustering. However, this technique is not just powerful, but also teaches the importance of understanding the data in unsupervised learning. If any of the assumptions are violated then the clusters are not formed properly. Similarly, while determining the value of ‘k’, using improper or arbitrary values may lead to improper clusters. In a way, k-means clustering also relies on the correct value of k for clustering the data accurately. Unsupervised learning is fun but not a wild attempt. One must be clear of all aspects of the algorithm and its assumptions before implementing it rather than treating it as a black box and shooting in the dark. This article illustrates the various shortcomings of improperly clustering data on simple datasets which do not satisfy the assumptions of the algorithm. The full code used in this article is given below.

#Viewing the Faithful dataset
plot(faithful)

#Specify 2 centers
k_clust_start=kmeans(faithful, centers=2)
#Plot the data using clusters
plot(faithful, col=k_clust_start$cluster,pch=2)

#Use the centers to find the cluster centers
k_clust_start$centers
#Use the size to find the cluster sizes
k_clust_start$size
#The following code will generate different plots for you but they will be similar

library(plyr)
library(dplyr)
#Generate random data which will be first cluster
clust1 = data_frame(x = rnorm(200), y = rnorm(200))
#Generate the second cluster which will ‘surround’ the first cluster
clust2 =data_frame(r = rnorm(200, 15, .5), theta = runif(200, 0, 2 * pi),
                   x = r * cos(theta), y = r * sin(theta)) %>%
  dplyr::select(x, y)
#Combine the data
dataset_cir= rbind(clust1, clust2)
#see the plot
plot(dataset_cir)

#Fit the k-means model
k_clust_spher1=kmeans(dataset_cir, centers=2)
#Plot the data and clusters
plot(dataset_cir, col=k_clust_spher1$cluster,pch=2)

#Using a function for transformation
cart2pol=function(x,y){
  #This is r
  newx=sqrt(x^2 + y^2)
  #This is theta
  newy=atan(y/x)
  x_y=cbind(newx,newy)
  return(x_y)
}
dataset_cir2=cart2pol(dataset_cir$x,dataset_cir$y)
plot(dataset_cir2)

k_clust_spher2=kmeans(dataset_cir2, centers=2)
#Plot the data and clusters
plot(dataset_cir2, col=k_clust_spher2$cluster,pch=2)

plot(dataset_cir, col=k_clust_spher2$cluster,pch=2)

#Make the first cluster with 1000 random values
clust1 = data_frame(x = rnorm(1000), y = rnorm(1000))
#Keep 10 values together to make the second cluster
clust2=data_frame(x=c(5,5.1,5.2,5.3,5.4,5,5.1,5.2,5.3,5.4),y=c(5,5,5,5,5,4.9,4.9,4.9,4.9,4.9))
#Combine the data
dataset_uneven=rbind(clust1,clust2)
plot(dataset_uneven)

k_clust_spher3=kmeans(dataset_uneven, centers=2)
plot(dataset_uneven, col=k_clust_spher3$cluster,pch=2)

#Create a vector for storing the sse
sse=vector('numeric')
for(i in 2:15){
#k-means function in R has a feature withinss which stores sse for each cluster group
  sse[i-1]=sum(kmeans(iris[,3:4],centers=i)$withinss)
}
#Converting the sse to a data frame and storing corresponding value of k
sse=as.data.frame(sse)
sse$k=seq.int(2,15)
#Making the plot. This plot is also known as scree-plot
plot(sse$k,sse$sse,type="b")

Bio: Chaitanya Sagar is the Founder and CEO of Perceptive Analytics. Perceptive Analytics has been chosen as one of the top 10 analytics companies to watch out for by Analytics India Magazine. It works on Marketing Analytics for e-commerce, Retail and Pharma companies.