New DataCamp Course: Working with Web Data in R

Hi there! We just launched Working with Web Data in R by Oliver Keyes and Charlotte Wickham, our latest R course!

Most of the useful data in the world, from economic data to news content to geographic information, lives somewhere on the internet – and this course will teach you how to access it. You’ll explore how to work with APIs (computer-readable interfaces to websites), access data from Wikipedia and other sources, and build your own simple API client. For those occasions where APIs are not available, you’ll find out how to use R to scrape information out of web pages. In the process, you’ll learn how to get data out of even the most stubborn website, and how to turn it into a format ready for further analysis. The packages you’ll use and learn your way around are rvest, httr, xml2 and jsonlite, along with particular API client packages like WikipediR and pageviews.

Take me to chapter 1!

Working with Web Data in R features interactive exercises that combine high-quality video, in-browser coding, and gamification for an engaging learning experience that will make you an expert in getting information from the Internet!



What you’ll learn

1. Downloading Files and Using API Clients
Sometimes getting data off the internet is very, very simple – it’s stored in a format that R can handle and just lives on a server somewhere, or it’s in a more complex format and perhaps part of an API but there’s an R package designed to make using it a piece of cake. This chapter will explore how to download and read in static files, and how to use APIs when pre-existing clients are available.

2. Using httr to interact with APIs directly
If an API client doesn’t exist, it’s up to you to communicate directly with the API. But don’t worry, the package httr makes this really straightforward. In this chapter, you’ll learn how to make web requests from R, how to examine the responses you get back and some best practices for doing this in a responsible way.

3. Handling JSON and XML
Sometimes data is a TSV or nice plaintext output. Sometimes it’s XML and/or JSON. This chapter walks you through what JSON and XML are, how to convert them into R-like objects, and how to extract data from them. You’ll practice by examining the revision history for a Wikipedia article retrieved from the Wikipedia API using httr, xml2 and jsonlite.

4. Web scraping with XPATHs
Now that we’ve covered the low-hanging fruit (“it has an API, and a client”, “it has an API”) it’s time to talk about what to do when a website doesn’t have any access mechanisms at all – when you have to rely on web scraping. This chapter will introduce you to the rvest web-scraping package, and build on your previous knowledge of XML manipulation and XPATHs.

5. ECSS Web Scraping and Final Case Study
CSS path-based web scraping is a far-more-pleasant alternative to using XPATHs. You’ll start this chapter by learning about CSS, and how to leverage it for web scraping. Then, you’ll work through a final case study that combines everything you’ve learnt so far to write a function that queries an API, parses the response and returns data in a nice form.

Master web data in R with our course Working with Web Data in R!

Cyber Week Only: Save 50% on DataCamp!

For Cyber Week only, DataCamp offers the readers of R-bloggers over $150 off for unlimited access to its data science library. That’s over 90 courses and 5,200 exercises of which a large chunk are R-focused, plus access to the mobile app, Practice Challenges, and hands-on Projects. All by expert instructors such as Hadley Wickham (RStudio), Matt Dowle (data.table), Garrett Grolemund (RStudio), and Max Kuhn (caret)!
Clair your offer now! Offer ends 12/5!


Skills track:

  • Data Analyst with R

    • Learn how to translate numbers into plain English for businesses, whether it’s sales figures, market research, logistics, or transportation costs get the most of your data.

  • Data Scientist with R

    • Learn how to combine statistical and machine learning techniques with R programming to analyze and interpret complex data.

  • Quantitative Analyst with R

    • Learn how to ensure portfolios are risk balanced, help find new trading opportunities, and evaluate asset prices using mathematical models.

Skills track:

  • Data Visualization in R

    • Communicate the most important features of your data by creating beautiful visualizations using ggplot2 and base R graphics.

  • Importing and Cleaning Data

    • Learn how to parse data in any format. Whether it’s flat files, statistical software, databases, or web data, you’ll learn to handle it all.

  • Statistics with R

    • Learn key statistical concepts and techniques like exploratory data analysis, correlation, regression, and inference.

  • Applied Finance

    • Apply your R skills to financial data, including bond valuation, financial trading, and portfolio analysis.

Individual courses:

A word about DataCamp
For those of you who don’t know DataCamp: it’s the most intuitive way out there to learn data science thanks to its combination of short expert videos and immediate hands-on-the-keyboard exercises as well as its instant personalized feedback on every exercise. They focus only on data science to offer the best learning experience possible and rely on expert instructors to teach. 

How to identify risky bank loans using C.50 decision trees

This tutorial has been taken from Machine Learning with R Second Edition   by Brett Lantz. Use the code  MLR250RB at the checkout to save 50% on the RRP.

Or pick up this title with 4 others for just $50 – get the R-Bloggers bundle.

The global financial crisis of 2007-2008 highlighted the importance of transparency and rigor in banking practices. As the availability of credit was limited, banks tightened their lending systems and turned to machine learning to more accurately identify risky loans. Decision trees are widely used in the banking industry due to their high accuracy and ability to formulate a statistical model in plain language. Since government organizations in many countries carefully monitor lending practices, executives must be able to explain why one applicant was rejected for a loan while the others were approved. This information is also useful for customers hoping to determine why their credit rating is unsatisfactory. It is likely that automated credit scoring models are employed to instantly approve credit applications on the telephone and web. In this tutorial, we will develop a simple credit approval model using C5.0 decision trees. We will also see how the results of the model can be tuned to minimize errors that result in a financial loss for the institution.

Step 1 – collecting data

The idea behind our credit model is to identify factors that are predictive of higher risk of default. Therefore, we need to obtain data on a large number of past bank loans and whether the loan went into default, as well as information on the applicant. Data with these characteristics is available in a dataset donated to the UCI Machine Learning Data Repository by Hans Hofmann of the University of Hamburg. The data set contains information on loans obtained from a credit agency in Germany.
The data set presented in this tutorial has been modified slightly from the original in order to eliminate some preprocessing steps. To follow along with the examples, download the credit.csv file from Packt’s website and save it to your R working directory. Simply click here and then click ‘code files’ beneath the cover image.
The credit data set includes 1,000 examples on loans, plus a set of numeric and nominal features indicating the characteristics of the loan and the loan applicant. A class variable indicates whether the loan went into default. Let’s see whether we can determine any patterns that predict this outcome.

Step 2 – exploring and preparing the data


As we did previously, we will import data using the read.csv() function. We will ignore the stringsAsFactors option and, therefore, use the default value of TRUE, as the majority of the features in the data are nominal:
> credit <- read.csv("credit.csv")

The first several lines of output from the str() function are as follows:
> str(credit)
'data.frame':1000 obs. of  17 variables:
 $ checking_balance : Factor w/ 4 levels "< 0 DM","> 200 DM",..
 $ months_loan_duration: int  6 48 12 ...
 $ credit_history      : Factor w/ 5 levels "critical","good",..
 $ purpose             : Factor w/ 6 levels "business","car",..
 $ amount              : int  1169 5951 2096 ...
We see the expected 1,000 observations and 17 features, which are a combination of factor and integer data types. Let’s take a look at the table() output for a couple of loan features that seem likely to predict a default. The applicant’s checking and savings account balance are recorded as categorical variables:
> table(credit$checking_balance)
    < 0 DM   > 200 DM 1 - 200 DM    unknown 
       274         63        269        394
> table(credit$savings_balance)
     < 100 DM > 1000 DM  100 - 500 DM 500 - 1000 DM   unknown 
          603        48           103            63       183
The checking and savings account balance may prove to be important predictors of loan default status. Note that since the loan data was obtained from Germany, the currency is recorded in Deutsche Marks (DM). Some of the loan’s features are numeric, such as its duration and the amount of credit requested:
> summary(credit$months_loan_duration)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    4.0    12.0    18.0    20.9    24.0    72.0 
> summary(credit$amount)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    250    1366    2320    3271    3972   18420
The loan amounts ranged from 250 DM to 18,420 DM across terms of 4 to 72 months with a median duration of 18 months and an amount of 2,320 DM. The default vector indicates whether the loan applicant was unable to meet the agreed payment terms and went into default. A total of 30 percent of the loans in this dataset went into default:
> table(credit$default)
 no yes 
700 300
A high rate of default is undesirable for a bank, because it means that the bank is unlikely to fully recover its investment. If we are successful, our model will identify applicants that are at high risk to default, allowing the bank to refuse credit requests.

Data preparation: Creating random training and test data sets

In this example, we’lll split our data into two portions: a training dataset to build the decision tree and a test dataset to evaluate the performance of the model on new data. We will use 90 percent of the data for training and 10 percent for testing, which will provide us with 100 records to simulate new applicants. We’ll use a random sample of the credit data for training. A random sample is simply a process that selects a subset of records at random. In R, the sample() function is used to perform random sampling. However, before putting it in action, a common practice is to set a seed value, which causes the randomization process to follow a sequence that can be replicated later on if desired. It may seem that this defeats the purpose of generating random numbers, but there is a good reason for doing it this way. Providing a seed value via the set.seed() function ensures that if the analysis is repeated in the future, an identical result is obtained. The following commands use the sample() function to select 900 values at random out of the sequence of integers from 1 to 1000. Note that the set.seed() function uses the arbitrary value 123. Omitting this seed will cause your training and testing split to differ from those shown in the remainder of this tutorial:
> set.seed(123)
> train_sample <- sample(1000, 900)
As expected, the resulting train_sample object is a vector of 900 random integers:
> str(train_sample)
 int [1:900] 288 788 409 881 937 46 525 887 548 453 ... 
By using this vector to select rows from the credit data, we can split it into the 90 percent training and 10 percent test datasets we desired. Recall that the dash operator used in the selection of the test records tells R to select records that are not in the specified rows; in other words, the test data includes only the rows that are not in the training sample.
> credit_train <- credit[train_sample, ]
> credit_test  <- credit[-train_sample, ]
If all went well, we should have about 30 percent of defaulted loans in each of the datasets:
> prop.table(table(credit_train$default))
       no       yes 
0.7033333 0.2966667 

> prop.table(table(credit_test$default))
  no  yes 
0.67 0.33
This appears to be a fairly even split, so we can now build our decision tree. Tip: If your results do not match exactly, ensure that you ran the command
set.seed(123)
immediately prior to creating the
train_sample
vector.

Step 3: Training a model on the data

We will use the C5.0 algorithm in the C50 package to train our decision tree model. If you have not done so already, install the package with install.packages(“C50”) and load it to your R session, using library(C50). The following syntax box lists some of the most commonly used commands to build decision trees. Compared to the machine learning approaches we used previously, the C5.0 algorithm offers many more ways to tailor the model to a particular learning problem, but more options are available. Once the C50package has been loaded, the ?C5.0Control command displays the help page for more details on how to finely-tune the algorithm.

For the first iteration of our credit approval model, we’ll use the default C5.0 configuration, as shown in the following code. The 17th column in credit_train is the default class variable, so we need to exclude it from the training data frame, but supply it as the target factor vector for classification:
> credit_model <- C5.0(credit_train[-17], credit_train$default)
The credit_model object now contains a C5.0 decision tree. We can see some basic data about the tree by typing its name:
> credit_model

Call:
C5.0.default(x = credit_train[-17], y = credit_train$default)

Classification Tree
Number of samples: 900 
Number of predictors: 16 

Tree size: 57 

Non-standard options: attempt to group attributes
The preceding text shows some simple facts about the tree, including the function call that generated it, the number of features (labeled predictors), and examples (labeled samples) used to grow the tree. Also listed is the tree size of 57, which indicates that the tree is 57 decisions deep—quite a bit larger than the example trees we’ve considered so far! To see the tree’s decisions, we can call the summary() function on the model:
> summary(credit_model)
This results in the following output:

The preceding output shows some of the first branches in the decision tree. The first three lines could be represented in plain language as:

If the checking account balance is unknown or greater than 200 DM, then classify as “not likely to default.” Otherwise, if the checking account balance is less than zero DM or between one and 200 DM. And the credit history is perfect or very good, then classify as “likely to default.”

The numbers in parentheses indicate the number of examples meeting the criteria for that decision, and the number incorrectly classified by the decision. For instance, on the first line, 412/50 indicates that of the 412 examples reaching the decision, 50 were incorrectly classified as not likely to default. In other words, 50 applicants actually defaulted, in spite of the model’s prediction to the contrary. Tip: Sometimes a tree results in decisions that make little logical sense. For example, why would an applicant whose credit history is very good be likely to default, while those whose checking balance is unknown are not likely to default? Contradictory rules like this occur sometimes. They might reflect a real pattern in the data, or they may be a statistical anomaly. In either case, it is important to investigate such strange decisions to see whether the tree’s logic makes sense for business use.
After the tree, the summary(credit_model) output displays a confusion matrix, which is a cross-tabulation that indicates the model’s incorrectly classified records in the training data:
Evaluation on training data (900 cases):

      Decision Tree   
    ----------------  
    Size      Errors  
      56  133(14.8%)   <<

     (a)   (b)    <-classified as
    ----  ----
     598    35    (a): class no
      98   169    (b): class yes
The Errors output notes that the model correctly classified all but 133 of the 900 training instances for an error rate of 14.8 percent. A total of 35 actual no values were incorrectly classified as yes (false positives), while 98 yes values were misclassified as no (false negatives). Decision trees are known for having a tendency to overfit the model to the training data. For this reason, the error rate reported on training data may be overly optimistic, and it is especially important to evaluate decision trees on a test data set.

Step 4: Evaluating model performance

To apply our decision tree to the test dataset, we use the predict() function, as shown in the following line of code:
> credit_pred <- predict(credit_model, credit_test)
This creates a vector of predicted class values, which we can compare to the actual class values using the CrossTable() function in the gmodels package. Setting the prop.c and prop.r parameters to FALSE removes the column and row percentages from the table. The remaining percentage (prop.t) indicates the proportion of records in the cell out of the total number of records:
> library(gmodels)
> CrossTable(credit_test$default, credit_pred,
             prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,
             dnn = c('actual default', 'predicted default'))
This results in the following table:



Out of the 100 test loan application records, our model correctly predicted that 59 did not default and 14 did default, resulting in an accuracy of 73 percent and an error rate of 27 percent. This is somewhat worse than its performance on the training data, but not unexpected, given that a model’s performance is often worse on unseen data. Also note that the model only correctly predicted 14 of the 33 actual loan defaults in the test data, or 42 percent. Unfortunately, this type of error is a potentially very costly mistake, as the bank loses money on each default. Let’s see if we can improve the result with a bit more effort.

Step 5: Improving model performance

Our model’s error rate is likely to be too high to deploy it in a real-time credit scoring application. In fact, if the model had predicted “no default” for every test case, it would have been correct 67 percent of the time—a result not much worse than our model’s, but requiring much less effort! Predicting loan defaults from 900 examples seems to be a challenging problem. Making matters even worse, our model performed especially poorly at identifying applicants who do default on their loans. Luckily, there are a couple of simple ways to adjust the C5.0 algorithm that may help to improve the performance of the model, both overall and for the more costly type of mistakes.

Boosting the accuracy of decision trees

One way the C5.0 algorithm improved upon the C4.5 algorithm was through the addition of adaptive boosting. This is a process in which many decision trees are built and the trees vote on the best class for each example. Boosting is essentially rooted in the notion that by combining a number of weak performing learners, you can create a team that is much stronger than any of the learners alone. Each of the models has a unique set of strengths and weaknesses and they may be better or worse in solving certain problems. Using a combination of several learners with complementary strengths and weaknesses can therefore dramatically improve the accuracy of a classifier. The C5.0() function makes it easy to add boosting to our C5.0 decision tree. We simply need to add an additional trials parameter indicating the number of separate decision trees to use in the boosted team. The trials parameter sets an upper limit; the algorithm will stop adding trees if it recognizes that additional trials do not seem to be improving the accuracy. We’ll start with 10 trials, a number that has become the de facto standard, as research suggests that this reduces error rates on test data by about 25%:
> credit_boost10 <- C5.0(credit_train[-17], credit_train$default,
                         trials = 10)
While examining the resulting model, we can see that some additional lines have been added, indicating the changes:
> credit_boost10
Number of boosting iterations: 10 
Average tree size: 47.5
Across the 10 iterations, our tree size shrunk. If you would like, you can see all 10 trees by typing summary(credit_boost10) at the command prompt. It also lists the model’s performance on the training data:
> summary(credit_boost10)

     (a)   (b)    <-classified as
    ----  ----
     629     4    (a): class no
      30   237    (b): class yes
The classifier made 34 mistakes on 900 training examples for an error rate of 3.8 percent. This is quite an improvement over the 13.9 percent training error rate we noted before adding boosting! However, it remains to be seen whether we see a similar improvement on the test data. Let’s take a look:
> credit_boost_pred10 <- predict(credit_boost10, credit_test)
> CrossTable(credit_test$default, credit_boost_pred10,
             prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,
             dnn = c('actual default', 'predicted default'))
The resulting table is as follows:



Here, we reduced the total error rate from 27 percent prior to boosting down to 18 percent in the boosted model. It does not seem like a large gain, but it is in fact larger than the 25 percent reduction we expected. On the other hand, the model is still not doing well at predicting defaults, predicting only 20/33 = 61%correctly. The lack of an even greater improvement may be a function of our relatively small training data set, or it may just be a very difficult problem to solve.

This said, if boosting can be added this easily, why not apply it by default to every decision tree? The reason is twofold. First, if building a decision tree once takes a great deal of computation time, building many trees may be computationally impractical. Secondly, if the training data is very noisy, then boosting might not result in an improvement at all. Still, if greater accuracy is needed, it’s worth giving it a try.

Making mistakes more costlier than others

Giving a loan out to an applicant who is likely to default can be an expensive mistake. One solution to reduce the number of false negatives may be to reject a larger number of borderline applicants, under the assumption that the interest the bank would earn from a risky loan is far outweighed by the massive loss it would incur if the money is not paid back at all. The C5.0 algorithm allows us to assign a penalty to different types of errors, in order to discourage a tree from making more costly mistakes. The penalties are designated in a cost matrix, which specifies how much costlier each error is, relative to any other prediction. To begin constructing the cost matrix, we need to start by specifying the dimensions. Since the predicted and actual values can both take two values, yes or no, we need to describe a 2 x 2 matrix, using a list of two vectors, each with two values. At the same time, we’ll also name the matrix dimensions to avoid confusion later on:
> matrix_dimensions <- list(c("no", "yes"), c("no", "yes"))
> names(matrix_dimensions) <- c("predicted", "actual")
Examining the new object shows that our dimensions have been set up correctly:
> matrix_dimensions
$predicted
[1] "no"  "yes"

$actual
[1] "no"  "yes"
Next, we need to assign the penalty for the various types of errors by supplying four values to fill the matrix. Since R fills a matrix by filling columns one by one from top to bottom, we need to supply the values in a specific order:

  • Predicted no, actual no
  • Predicted yes, actual no
  • Predicted no, actual yes
  • Predicted yes, actual yes
Suppose we believe that a loan default costs the bank four times as much as a missed opportunity. Our penalty values could then be defined as:
> error_cost <- matrix(c(0, 1, 4, 0), nrow = 2,
    dimnames = matrix_dimensions)
This creates the following matrix:
> error_cost
         actual
predicted no yes
      no   0   4
      yes  1   0
As defined by this matrix, there is no cost assigned when the algorithm classifies a no or yes correctly, but a false negative has a cost of 4 versus a false positive’s cost of 1. To see how this impacts classification, let’s apply it to our decision tree using the costs parameter of the C5.0() function. We’ll otherwise use the same steps as we did earlier:
> credit_cost <- C5.0(credit_train[-17], credit_train$default,
                            costs = error_cost)
> credit_cost_pred <- predict(credit_cost, credit_test)
> CrossTable(credit_test$default, credit_cost_pred,
             prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,
             dnn = c('actual default', 'predicted default'))
This produces the following confusion matrix:



Compared to our boosted model, this version makes more mistakes overall: 37 percent error here versus 18 percent in the boosted case. However, the types of mistakes are very different. Where the previous models incorrectly classified only 42 and 61 percent of defaults correctly, in this model, 79 percent of the actual defaults were predicted to be non-defaults. This trade resulting in a reduction of false negatives at the expense of increasing false positives may be acceptable if our cost estimates were accurate.

This tutorial has been taken from Machine Learning with R Second Edition   by Brett Lantz. Use the code  MLR250RB at the checkout to save 50% on the RRP.

Big Data Analytics World Championship (30th September 2017) – Free Entry Code and details

Invitation to the 3rd Big Data Analytics World Championships 2017

We invite members of the R-Bloggers/R-Users community to participate in the 2017 TEXATA Big Data Analytics World Championships (www.texata.com).  You’re Free Discount Code is: 2017RUSERS (normally $30 fully paid entry).   Here are important dates for the TEXATA educational business competition: Round 1 on 30th September 2017 (Online) Round 2 on 14th October 2017 (Online) World Finals on 15th-16th November 2017 (Austin)  About TEXATA The competition is a celebration of big data analytics skills and communities.  It targets at students and young professionals looking to learn more and showcase their creative business skills in big data analytics. It’s ideal for people working or studying or interested in: Technology, Computer Science, Statistics and Data, Data Science, IT Professionals, Software Engineering, Analytical, Quantitative and Engineering disciplines. The competition is a great way to learn more about leading Companies, Communities, Universities and Institutions in the big data and business analytics industries.  All participants will receive updates throughout 2017 and 2018 of free offers, upcoming conferences and educational events of our community partners as part of joining the TEXATA event. Best of all, it’s all free with this discount code thanks to RBloggers/R-Users.  How It Works The structure involves two Online Rounds with a Live World Finals in Austin Texas. Each of the Online Rounds will last 4 Hours in duration.  The Top World Finalists will be flown from around the world to compete in the case-study based finals challenge with Finals Judges.  The organizers hold similar world championship events in other professional services industries – including Finance and Financial Modeling (www.modeloff.com) and High IQ World Championships (www.hiqora.com) – which gives participants a good background idea of the nature of the educational, fun and challenging nature of the elite business and professional league competitions. Visit www.texata.com for more information.

Naive Principal Component Analysis (using R)

Post from Pablo Bernabeu’s blog.

Principal Component Analysis (PCA) is a technique used to find the core components that underlie different variables. It comes in very useful whenever doubts arise about the true origin of three or more variables. There are two main methods for performing a PCA: naive or less naive. In the naive method, you first check some conditions in your data which will determine the essentials of the analysis. In the less-naive method, you set the those yourself, based on whatever prior information or purposes you had. I will tackle the naive method, mainly by following the guidelines in Field, Miles, and Field (2012), with updated code where necessary. This lecture material was also useful. The ‘naive’ approach is characterized by a first stage that checks whether the PCA should actually be performed with your current variables, or if some should be removed. The variables that are accepted are taken to a second stage which identifies the number of principal components that seem to underlie your set of variables. I ascribe these to the ‘naive’ or formal approach because either or both could potentially be skipped in exceptional circumstances, where the purpose is not scientific, or where enough information exists in advance.

STAGE 1.  Determine whether PCA is appropriate at all, considering the variables

  • Variables should be inter-correlated enough but not too much. Field et al. (2012) provide some thresholds, suggesting that no variable should have many correlations below .30, or any correlation at all above .90. Thus, in the example here, variable Q06 should probably be excluded from the PCA.
  • Bartlett’s test, on the nature of the intercorrelations, should be significant. Significance suggests that the variables are not an ‘identity matrix’ in which correlations are a sampling error.
  • KMO (Kaiser-Meyer-Olkin), a measure of sampling adequacy based on common variance (so similar purpose as Bartlett’s). As Field et al. review, ‘values between .5 and .7 are mediocre, values between .7 and .8 are good, values between .8 and .9 are great and values above .9 are superb’ (p. 761). There’s a general score as well as one per variable. The general one will often be good, whereas the individual scores may more likely fail. Any variable with a score below .5 should probably be removed, and the test should be run again.
  • Determinant: A formula about multicollinearity. The result should preferably fall below .00001.
Note that some of these tests are run on the dataframe and others on a correlation matrix of the data, as distinguished below.
 # Necessary libraries
library(ltm)
library(lattice)
library(psych)
library(car)
library(pastecs)
library(scales)
library(ggplot2)
library(arules)
library(plyr)
library(Rmisc)
library(GPArotation)
library(gdata)
library(MASS)
library(qpcR)
library(dplyr)
library(gtools)
library(Hmisc)

# Select only your variables of interest for the PCA
dataset = mydata[, c('select_var1','select_var1',
'select_var2','select_var3','select_var4',
'select_var5','select_var6','select_var7')]

# Create matrix: some tests will require it
data_matrix = cor(dataset, use = 'complete.obs')

# See intercorrelations
round(data_matrix, 2)

# Bartlett's
cortest.bartlett(dataset)

# KMO (Kaiser-Meyer-Olkin)
KMO(data_matrix)

# Determinant
det(data_matrix)

STAGE 2.  Identify number of components (aka factors)

In this stage, principal components (formally called ‘factors’ at this stage) are identified among the set of variables.
  • The identification is done through a basic, ‘unrotated’ PCA. The number of components set a priori must equal the number of variables that are being tested.
 # Start off with unrotated PCA

pc1 = psych::principal(dataset, nfactors = 
length(dataset), rotate="none")
pc1 
Below, an example result:
 ## Principal Components Analysis
## Call: psych::principal(r = eng_prop, nfactors = 3, rotate = "none")
## Standardized loadings (pattern matrix) based upon correlation matrix
##           PC1   PC2  PC3 h2       u2 com
## Aud_eng -0.89  0.13 0.44  1 -2.2e-16 1.5
## Hap_eng  0.64  0.75 0.15  1  1.1e-16 2.0
## Vis_eng  0.81 -0.46 0.36  1 -4.4e-16 2.0
## 
##                        PC1  PC2  PC3
## SS loadings           1.87 0.79 0.34
## Proportion Var        0.62 0.26 0.11
## Cumulative Var        0.62 0.89 1.00
## Proportion Explained  0.62 0.26 0.11
## Cumulative Proportion 0.62 0.89 1.00
## 
## Mean item complexity =  1.9
## Test of the hypothesis that 3 components are sufficient.
## 
## The root mean square of the residuals (RMSR) is  0 
##  with the empirical chi square  0  with prob <  NA 
## 
## Fit based upon off diagonal values = 1 

Among the columns, there are first the correlations between variables and components, followed by a column (h2) with the ‘communalities‘. If less factors than variables had been selected, communality values would be below 1. Then there is the uniqueness column (u2): uniqueness is equal to 1 minus the communality. Next is ‘com’, which reflects the complexity with which a variable relates to the principal components. Those components are precisely found below. The first row contains the sums of squared loadings, or eigenvalues, namely, the total variance explained by each linear component. This value corresponds to the number of units explained out of all possible factors (which were three in the above example). The rows below all cut from the same cloth. Proportion var = variance explained over a total of 1. This is the result of dividing the eigenvalue by the number of components. Multiply by 100 and you get the percentage of total variance explained, which becomes useful. In the example, 99% of the variance has been explained. Aside from the meddling maths, we should actually expect 100% there because the number of factors equaled the number of variables. Cumulative var: variance added consecutively up to the last component. Proportion explained: variance explained over what has actually been explained (only when variables = factors is this the same as Proportion var). Cumulative proportion: the actually explained variance added consecutively up to the last component. Two sources will determine the number of components to select for the next stage:
  • Kaiser’s criterion: components with SS loadings > 1. In our example, only PC1.

A more lenient alternative is Joliffe’s criterion, SS loadings > .7.

  • Scree plot: the number of points after point of inflexion. For this plot, call:
 plot(pc1$values, type = 'b') 
Imagine a straight line from the first point on the right. Once this line bends considerably, count the points after the bend and up to the last point on the left. The number of points is the number of components to select. The example here is probably the most complicated (two components were finally chosen), but normally it’s not difficult. Based on both criteria, go ahead and select the definitive number of components.

STAGE 3.  Run definitive PCA

Run a very similar command as you did before, but now with a more advanced method. The first PCA, a heuristic one, worked essentially on the inter-correlations. The definitive PCA, in contrast, will implement a prior shuffling known as ‘rotation’, to ensure that the result is robust enough (just like cards are shuffled). Explained variance is captured better this way. The go-to rotation method is the orthogonal, or ‘varimax’ (though others may be considered too).
 # Now with varimax rotation, Kaiser-normalized 
# by default:
pc2 = psych::principal(dataset, nfactors=2, 
rotate = "varimax", scores = TRUE)
pc2
pc2$loadings

# Healthcheck
pc2$residual
pc2$fit
pc2$communality 
We would want:
  • Less than half of residuals with absolute values > 0.05
  • Model fit > .9
  • All communalities > .7
If any of this fails, consider changing the number of factors. Next, the rotated components that have been ‘extracted’ from the core of the set of variables can be added to the dataset. This would enable the use of these components as new variables that might prove powerful and useful (as in this research).
 dataset = cbind(dataset, pc2$scores)
summary(dataset$RC1, dataset$RC2) 

STAGE 4.  Determine ascription of each variable to components

Check the main summary by just calling pc2, and see how each variable correlates with the rotated components. This is essential because it reveals how variables load on each component, or in other words, to which component a variable belongs. For instance, the table shown here belongs to a study about meaning of words. These results suggest that the visual and haptic modalities of words are quite related, whereas the auditory modality is relatively unique. When the analysis works out well, a cut-off point of r = .8 may be applied for considering a variable as part of a component.

STAGE 5.  Enjoy the plot

The plot is perhaps the coolest part about PCA. It really makes an awesome illustration of the power of data analysis.
 ggplot(dataset,
  aes(RC1, RC2, label = as.character(main_eng))) +
  aes (x = RC1, y = RC2, by = main_eng) + stat_density2d(color = "gray87")+
  geom_text(size = 7) +
    ggtitle ('English properties') +
    theme_bw() +
    theme(  
    plot.background = element_blank()
   ,panel.grid.major = element_blank()
   ,panel.grid.minor = element_blank()
   ,panel.border = element_blank()
  ) +
  theme(axis.line = element_line(color = 'black')) + 
    theme(axis.title.x = element_text(colour = 'black', size = 23, 
    margin=margin(15,15,15,15)),
         axis.title.y = element_text(colour = 'black', size = 23, 
    margin=margin(15,15,15,15)),
         axis.text.x  = element_text(size=16),
       axis.text.y  = element_text(size=16)) +
  labs(x = "", y = "Varimax-rotated Principal Component 2") +
    theme(plot.title = element_text(hjust = 0.5, size = 32, face = "bold",
    margin=margin(15,15,15,15))) 

Below is an example combining PCA plots with code similar to the above. These plots illustrate something further with regard to the relationships among modalities. In property words, the different modalities spread out more clearly than they do in concept words. This makes sense because in language, properties define concepts (
see more).

An example of this code is use is available here (with data here). References
Field, A. P., Miles, J., & Field, Z. (2012). Discovering statistics using R. London: Sage. Feel free to comment below or on the original post.

R in the Data Science Stack at ODSC



Register now
for ODSC West in San Francisco, November 2-4 and save 60% with code RB60 until September 1st.

R continues to hold its own in the data science landscape thanks in no small part to its flexibility.  That flexibility allows R to integrate with some of the most popular data science tools available.

Given R’s memory bounds, it’s no surprise that deep learning tools like Tensorflow are on that list. Comprehensive, intuitive, and well documented, Tensorflow has quickly become one of the most popular deep learning platforms and RStudio released a package to integrate with the Tensorflow API. Not to be outdone MXNet, another popular and powerful deep learning framework has native support for R with an API interface.  

It doesn’t stop with deep learning. Data science is moving real-time and the streaming analytics platform, Apache Kafka, is rapidly gaining traction with the community.  The kafka package allows one to use the Kafka messaging queue via R. Spark is now one of the dominant machine learning platforms and thus we see multiple R integrations in the form of the spark package and the SparkR package. The list will continue to grow with package integrations released for H20.ai, Druid etc. and more on the way.

At the  Open Data Science Conference, R has long been one of the most popular data science languages and ODSC West 2017 is no exception. We have a strong lineup this year that includes:

  • R Tools for Data Science
  • Modeling Big Data with R, sparklyr, and Apache Spark
  • Machine Learning with R
  • Introduction to Data Science with R
  • Modern Time-Series with Prophet
  • R4ML: A Scalable and Distributed framework in R for Machine Learning
  • Databases Using R
  • Geo-Spatial Data Visualization using R
From an R user perspective, one of the most exciting things about ODSC West 2017 is that it offers an excellent opportunity to do a deep dive into some of the most popular data science tools you can now leverage with R. Talks and workshops on the conference schedule include:

  • Deep learning from Scratch WIth Tensorflow
  • Apache Kafka for Real-time analytics
  • Deep learning with MXNet
  • Effective TensorFlow
  • Building an Open Source Analytics Solution with Kafka and Druid
  • Deep Neural Networks with Keras
  • Robust Data Pipelines with Apache Airflow
  • Apache Superset – A Modern, Enterprise-Ready Business Intelligence Web Application
Over 3 packed days, ODSC West 2017 also offers a great opportunity to brush up on your modeling skills that include predictive analytics, time series, NLP, machine learning, image recognition, deep learning. autonomous vehicles, and AI chatbot assistants. Here’s just a few of the data science workshops and talks scheduled:

  • Feature Selection from High Dimensions
  • Interpreting Predictions from Complex Models
  • Deep Learning for Recommender Systems
  • Natural Language Processing in Practice – Do’s and Don’ts
  • Machine Imaging recognition
  • Training a Prosocial Chatbot
  • Anomaly Detection Using Deep Learning
  • Myths of Data Science: Practical Issues You Can and Can Not Ignore.
  • Playing Detective with CNNs
  • Recommendation System Architecture and Algorithms
  • Driver and Occupants Monitoring AI for Autonomous Vehicles
  • Solving Impossible Problems by Collaborating with an AI
  • Dynamic Risk Networks: Mapping Risk in the Financial System
With over 20 full training session, 50 workshops and 100 speakers, ODSC West 2017 is ideal for beginners to experts looking to understand the latest in R tools and topics in data science and AI.

Register now and save 60% with code RB60 until September 1st.


Sheamus McGovern
, CEO of ODSC

Shinydashboards from right to left (localizing a shinydashboard to Hebrew)

Post by Adi Sarid (Sarid Institute for Research Services LTD.)

Lately I’ve been working a lot with the shinydashboard library.
Like shiny, it allows any R programmer to harness the power of R and create professional looking interactive apps. The thing about shinydashboards is that it makes wonderfully looking dashboards.

What I’ve been doing with the dashboards is to create dedicated dashboards for my customers. Since most of my customers speak, read, and write in Hebrew I needed to fit that into my shinydashboard apps (i.e., fully localize the app). See an example for such a localized dashboard I made here.

Making a shinydashboard localized turned out to be simpler than I thought. 

Since the average R programmer doesn’t necessarily know and understand CSS, I thought I post my solution. This should fit any Hebrew or Arabic dashboard to work from right to left, including the sidebar and all other objects (though I only tested it in Hebrew).

If you want the short version:
(1) Download the following css file;
(2) Put it in a subfolder “/www” of your shinydashboard app;
(3) In your dashboardBody command (within the ui.R section of your shiny app) add the following code:
dashboardBody(
tags$head(
tags$link(rel = "stylesheet", type = "text/css", href = "bootstrap-rtl.css"
),...



Here are the few insights and steps which lead me to this solution:

Insight #1:
any shiny app (dashboard or otherwise) can be customized using CSS. That’s no secret. However, the adaptation to RTL isn’t that simple when you have so many objects, mobile responsiveness to worry about, etc.

Insight #2:
Shiny is based on the AdminLTE theme which is based on the bootstrap 3 theme. AdminLTE is a great theme, and even though it doesn’t officially support RTL, mmdsharifi, provided a solution in his github page. The same for bootstrap 3 which has an RTL customization by morteza (also on github).

Insight #3:
 What I did in order to make this work was to take the bootstrap-rtl.css from morteza, and then concatenate the AdminLTE-rtl.css file by mmdsharifi. Voilà! (simple, isn’t it?)

Here’s the resulting css file.

Thanks to 0xOri for suggesting and testing insight #3.





Exploring Assumptions of K-means Clustering using R

K-Means Clustering is a well known technique based on unsupervised learning. As the name mentions, it forms ‘K’ clusters over the data using mean of the data. Unsupervised algorithms are a class of algorithms one should tread on carefully. Using the wrong algorithm will give completely botched up results and all the effort will go down the drain. Unlike supervised learning algorithms where one can still get around keeping some parts as an unknown black box, knowing the technique inside out starting from the assumptions made to the process, methods of optimization and uses is essential. So let us begin step by step starting from the assumptions. I will then explain the process and a hands-on illustration using

Assumptions and Process
Why do we assume in the first place? The answer is that making assumptions helps simplify problems and simplified problems can then be solved accurately. To divide your dataset into clusters, one must define the criteria of a cluster and those make the assumptions for the technique. K-Means clustering method considers two assumptions regarding the clusters – first that the clusters are spherical and second that the clusters are of similar size. Spherical assumption helps in separating the clusters when the algorithm works on the data and forms clusters. If this assumption is violated, the clusters formed may not be what one expects. On the other hand, assumption over the size of clusters helps in deciding the boundaries of the cluster. This assumption helps in calculating the number of data points each cluster should have. This assumption also gives an advantage. Clusters in K-means are defined by taking the mean of all the data points in the cluster. With this assumption, one can start with the centers of clusters anywhere. Keeping the starting points of the clusters anywhere will still make the algorithm converge with the same final clusters as keeping the centers as far apart as possible.

Now let’s understand how the algorithm works. The first step is to assign initial clusters. You can specify any K clusters or let the algorithm assign them randomly. The algorithm works in iterations and in every iteration, all the data points are then assigned to one of the clusters based on the nearest distance from the centers. After all points are assigned to one of the cluster, the cluster centers are now updated. The new centers are decided based on the centroid mean of all the points within the cluster. This is repeated, iteration after iteration until the there is no change in the cluster assignment of any of the data points but there are a lot of calculations which are not fixed in this algorithm. For example, one can decide how the distance for each data point from the cluster center is defined. All of us are familiar with the Euclidean distance. The algorithm is straightforward and easy to understand but using the technique is not as easy as it looks. Let’s try out some examples in R.

K-Means Starter
To understand how K-Means works, we start with an example where all our assumptions hold. R includes a dataset about waiting time between eruptions and the duration of the eruption for the Old Faithful geyser in Yellowstone National Park known as ‘faithful’. The dataset consists of 272 observations of 2 features.
#Viewing the Faithful dataset
plot(faithful) 
Perceptive Analytics
Looking at the dataset, we can notice two clusters. I will now use the kmeans() function in R to form clusters. Let’s see how K-Means clustering works on the data
#Specify 2 centers
k_clust_start=kmeans(faithful, centers=2)
#Plot the data using clusters
plot(faithful, col=k_clust_start$cluster,pch=2)
Perceptive Analytics
Being a small dataset, clusters are formed almost instantaneously but how do we see the clusters, their centers or sizes? The k_clust_start variable I used contains information on both centers and the size of clusters. Let’s check them out
#Use the centers to find the cluster centers
k_clust_start$centers

     eruptions      waiting
1       4.29793     80.28488
2       2.09433     54.75000

#Use the size to find the cluster sizes
k_clust_start$size
[1] 172 100
This means the first cluster consists of 172 members and is centered at 4.29793 value of eruptions and 80.28488 value of waiting. Similarly the second cluster consists of 100 members with 2.09433 value of eruptions and 54.75 value of waiting. Now this information is golden! We know that these centers are the cluster means. So, the eruptions typically happen for either ~2 mins or ~4.3 mins. For longer eruptions, the waiting time is also longer.

Getting into the depths
Imagine a dataset which has clusters which one can clearly identify but k-means cannot. I’m talking about a dataset which does not satisfy the assumptions. A common example is a dataset which represents two concentric circles. Let’s generate it and see how it looks like
#The following code will generate different plots for you but they will be similar
library(plyr)
library(dplyr)
#Generate random data which will be first cluster
clust1 = data_frame(x = rnorm(200), y = rnorm(200))
#Generate the second cluster which will ‘surround’ the first cluster
clust2 =data_frame(r = rnorm(200, 15, .5), theta = runif(200, 0, 2 * pi),
                 x = r * cos(theta), y = r * sin(theta)) %>%
  dplyr::select(x, y)
#Combine the data
dataset_cir= rbind(clust1, clust2)
#see the plot
plot(dataset_cir)

Simple, isn’t it? There are two clusters – one in the middle and the other circling the first. However, this violates the assumption that the clusters are spherical. The inner data is spherical while the outer circle is not. Even though the clustering will not be good, let’s see how does k-means perform on this data
#Fit the k-means model
k_clust_spher1=kmeans(dataset_cir, centers=2)
#Plot the data and clusters
plot(dataset_cir, col=k_clust_spher1$cluster,pch=2)
Perceptive Analytics
How do we solve this problem? There are clearly 2 clusters but k-means is not working well. A simple way in this case is to transform our data into polar format. Let’s convert it and plot it.
#Using a function for transformation
cart2pol=function(x,y){
#This is r
  newx=sqrt(x^2 + y^2)
#This is theta
  newy=atan(y/x)
  x_y=cbind(newx,newy)
  return(x_y)
}
dataset_cir2=cart2pol(dataset_cir$x,dataset_cir$y)
plot(dataset_cir2)
Perceptive Analytics
Now we run the k-means model on this data
k_clust_spher2=kmeans(dataset_cir2, centers=2)
#Plot the data and clusters
plot(dataset_cir2, col=k_clust_spher2$cluster,pch=2)
Perceptive Analytics
This time k-means algorithm works well and correctly transform the data. We can also view the clusters on the original data to double-check this.
plot(dataset_cir, col=k_clust_spher2$cluster,pch=2)
Perceptive Analytics
By transforming our data into polar coordinates and fitting k-means model on the transformed data, we fulfil the spherical data assumption and data is accurately clustered. Now let’s look at a data where the clusters are not of similar sizes. Similar size does not mean that the clusters have to be exactly equal. It simply means that no cluster should have ‘too few’ members.
#Make the first cluster with 1000 random values
clust1 = data_frame(x = rnorm(1000), y = rnorm(1000))
#Keep 10 values together to make the second cluster
clust2=data_frame(x=c(5,5.1,5.2,5.3,5.4,5,5.1,5.2,5.3,5.4),y=c(5,5,5,5,5,4.9,4.9,4.9,4.9,4.9))
#Combine the data
dataset_uneven=rbind(clust1,clust2)
plot(dataset_uneven)
Perceptive Analytics
Here again, we have two clear clusters but they do not satisfy similar size requirement of k-means algorithm.
k_clust_spher3=kmeans(dataset_uneven, centers=2)
plot(dataset_uneven, col=k_clust_spher3$cluster,pch=2)
Perceptive Analytics
Why did this happen? K-means tries to minimize inter cluster and intracluster distance and create ‘tight’ clusters. In this process, it assigns some data points in the first cluster to the second cluster incorrectly. This makes the clustering inaccurate.

How to decide the value of K in data

The datasets I worked on in this article are all simple and it is easily to identify clusters by plotting them. However, complicated datasets do not have this luxury. The Elbow method is popular for finding a suitable value of ‘K’ for k-means clustering. This method uses SSE within groups for different values of k and plots them. Using this plot, we can choose the ‘k’ which shows an abrupt change in SSE, creating an ‘elbow effect’. I will show an illustration on iris dataset using petal width and petal length.
#Create a vector for storing the sse
sse=vector('numeric')
for(i in 2:15){
#k-means function in R has a feature withinss which stores sse for each cluster group
  sse[i-1]=sum(kmeans(iris[,3:4],centers=i)$withinss)
}
#Converting the sse to a data frame and storing corresponding value of k
sse=as.data.frame(sse)
sse$k=seq.int(2,15)
#Making the plot. This plot is also known as screeplot
plot(sse$k,sse$sse,type="b")
Perceptive Analytics
In this plot, the first elbow is formed at k=3. This method suggest that we should have 3 clusters for this data.

Conclusion
K-means clustering is one of the first and most basic clustering techniques whenever one thinks of unsupervised clustering. However, this technique is not just powerful, but also teaches the importance of understanding the data in unsupervised learning. If any of the assumptions are violated then the clusters are not formed properly. Similarly, while determining the value of ‘k’, using improper or arbitrary values may lead to improper clusters. In a way, k-means clustering also relies on the correct value of k for clustering the data accurately. Unsupervised learning is fun but not a wild attempt. One must be clear of all aspects of the algorithm and its assumptions before implementing it rather than treating it as a black box and shooting in the dark. This article illustrates the various shortcomings of improperly clustering data on simple datasets which do not satisfy the assumptions of the algorithm. The full code used in this article is given below.
#Viewing the Faithful dataset
plot(faithful)

#Specify 2 centers
k_clust_start=kmeans(faithful, centers=2)
#Plot the data using clusters
plot(faithful, col=k_clust_start$cluster,pch=2)

#Use the centers to find the cluster centers
k_clust_start$centers
#Use the size to find the cluster sizes
k_clust_start$size
#The following code will generate different plots for you but they will be similar

library(plyr)
library(dplyr)
#Generate random data which will be first cluster
clust1 = data_frame(x = rnorm(200), y = rnorm(200))
#Generate the second cluster which will ‘surround’ the first cluster
clust2 =data_frame(r = rnorm(200, 15, .5), theta = runif(200, 0, 2 * pi),
                   x = r * cos(theta), y = r * sin(theta)) %>%
  dplyr::select(x, y)
#Combine the data
dataset_cir= rbind(clust1, clust2)
#see the plot
plot(dataset_cir)

#Fit the k-means model
k_clust_spher1=kmeans(dataset_cir, centers=2)
#Plot the data and clusters
plot(dataset_cir, col=k_clust_spher1$cluster,pch=2)

#Using a function for transformation
cart2pol=function(x,y){
  #This is r
  newx=sqrt(x^2 + y^2)
  #This is theta
  newy=atan(y/x)
  x_y=cbind(newx,newy)
  return(x_y)
}
dataset_cir2=cart2pol(dataset_cir$x,dataset_cir$y)
plot(dataset_cir2)

k_clust_spher2=kmeans(dataset_cir2, centers=2)
#Plot the data and clusters
plot(dataset_cir2, col=k_clust_spher2$cluster,pch=2)

plot(dataset_cir, col=k_clust_spher2$cluster,pch=2)

#Make the first cluster with 1000 random values
clust1 = data_frame(x = rnorm(1000), y = rnorm(1000))
#Keep 10 values together to make the second cluster
clust2=data_frame(x=c(5,5.1,5.2,5.3,5.4,5,5.1,5.2,5.3,5.4),y=c(5,5,5,5,5,4.9,4.9,4.9,4.9,4.9))
#Combine the data
dataset_uneven=rbind(clust1,clust2)
plot(dataset_uneven)

k_clust_spher3=kmeans(dataset_uneven, centers=2)
plot(dataset_uneven, col=k_clust_spher3$cluster,pch=2)

#Create a vector for storing the sse
sse=vector('numeric')
for(i in 2:15){
#k-means function in R has a feature withinss which stores sse for each cluster group
  sse[i-1]=sum(kmeans(iris[,3:4],centers=i)$withinss)
}
#Converting the sse to a data frame and storing corresponding value of k
sse=as.data.frame(sse)
sse$k=seq.int(2,15)
#Making the plot. This plot is also known as scree-plot
plot(sse$k,sse$sse,type="b")

Bio: Chaitanya Sagar is the Founder and CEO of Perceptive Analytics. Perceptive Analytics has been chosen as one of the top 10 analytics companies to watch out for by Analytics India Magazine. It works on Marketing Analytics for e-commerce, Retail and Pharma companies.

Implementing Parallel Processing in R

If something takes less time if done through parallel processing, why not do it and save time? Modern laptops and PCs today have multi core processors with sufficient amount of memory available and one can use it to generate outputs quickly. Parallelizing your codes has its own numerous advantages. Instead of waiting several minutes or hours while a task completes, one can replace the code, obtain output within seconds or minutes and make it efficient at the same time. Code efficiency is one of the most sought abilities in the industry today and not many people are able to use it. Once you learn, how to parallelize your code, you will only regret that why didn’t you learn it sooner.

Parallelizing your codes in R is simple and there are various methods and packages. Let’s look at some of the functions available in R

lapply() and sapply() functions
lapply() function is used to apply a specified function to the vector or list input. The output of this function is always a list.
lapply(1:5, function(x) x^2) 
#input is 1,2,3,4,5 and output is square of the input
[[1]]
[1] 1

[[2]]
[1] 4

[[3]]
[1] 9

[[4]]
[1] 16

[[5]]
[1] 25

We can also generate multiple outputs, say x^2 and x^3

lapply(1:5, function(x) c(x^2,x^3)) 
#The output should be square and cube of input

[[1]]
[1] 1 1

[[2]]
[1] 4 8

[[3]]
[1]  9 27

[[4]]
[1] 16 64

[[5]]
[1]  25 125

This output is very large but sometimes list is useful. 
An alternative is to use the sapply() function which generates a vector, matrix or array output.

sapply(1:5, function(x) x^2) #This output is a vector
[1]  1  4  9 16 25
sapply(1:5, function(x) c(x^2,x^3)) #This outputs a matrix
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    4    9   16   25
[2,]    1    8   27   64  125

sapply() also provides two additional parameters: simplify and USE.NAMES. If both of them are kept false, the output generated is the same as lappy()
sapply(1:5, function(x) x^2, simplify = FALSE, USE.NAMES = FALSE) 
#Output is same as for lapply()
[[1]]
[1] 1 1

[[2]]
[1] 4 8

[[3]]
[1]  9 27

[[4]]
[1] 16 64

[[5]]
[1]  25 125

The lapply() and sapply() functions are very fast in calculation. This means that the values are calculated independently of each other. However, it is not parallel in execution. There are various packages in R which allow parallelization.

“parallel” Package

The parallel package in R can perform tasks in parallel by providing the ability to allocate cores to R. The working involves finding the number of cores in the system and allocating all of them or a subset to make a cluster. We can then use the parallel version of various functions and run them by passing the cluster as an additional argument. A word of caution: it is important to close the cluster at the end of execution step so that core memory is released.
#Include the parallel library. If the next line does not work, run install.packages(“parallel”) first
library(parallel)
 
# Use the detectCores() function to find the number of cores in system
no_cores <- detectCores()
 
# Setup cluster
clust <- makeCluster(no_cores) #This line will take time

#The parallel version of lapply() is parLapply() and needs an additional cluster argument.
parLapply(clust,1:5, function(x) c(x^2,x^3))
[[1]]
[1] 1 1

[[2]]
[1] 4 8

[[3]]
[1]  9 27

[[4]]
[1] 16 64

[[5]]
[1]  25 125

stopCluster(clust)

If we want a similar output but with sapply(), we use the parSapply() function

#Include the parallel library. If the next line does not work, run install.packages(“parallel”) first
library(parallel)

# Use the detectCores() function to find the number of cores in system
no_cores <- detectCores()

# Setup cluster
clust <- makeCluster(no_cores) #This line will take time

#Setting a base variable 
base <- 4
#Note that this line is required so that all cores in cluster have this variable available
clusterExport(clust, "base")

#Using the parSapply() function
parSapply(clust, 1:5, function(exponent) base^exponent)
[1]    4   16   64  256 1024

stopCluster(clust)
Notice the clusterExport() function here above? This is a special command needed to change the variable scope in parallel execution. Normally, variables such as ‘base’ have a scope which does not allow them to be accessible at all cores. We need to use the clusterExport() function and send the variable to all the assigned cores in the cluster. This is why we pass both the cluster variable as well as the variable we need to export. Changing the base variable after export will have no effect as all the cores will not see that change. Just like there is a scope for variables, libraries also need to be exported to all cores in order to be accessible. If I need to run a task which requires importing libraries, I use the
clusterEvalQ() function
clusterEvalQ(clust,library(randomForest)) 

“foreach” Package

Having basic programming knowledge makes you aware of for loops and the for each package is based on this methodology, making it easy to use. The foreach package also need doParallel package to make the process parallel using the registerDoParallel() function. The starting code looks like this
library(foreach)
library(doParallel)

registerDoParallel(makeCluster(no_cores))
Once this is done, I can execute commands using the foreach() function and do them in parallel using the %dopar% command. The foreach() function includes a parameter .combine which is used to specify the kind of output needed. Using .combine=c gives a vector output while .combine=rbind creates a matrix. If a list output is needed similar to lapply(), we can set .combine=list. We can also obtain dataframe using .combine=data.frame
#Vector output
foreach(exponent = 1:5, .combine = c)  %dopar%  base^exponent

[1]   3   9  27  81 243

#Matrix output
foreach(exponent = 1:5, .combine = rbind)  %dopar%  base^exponent

         [,1]
result.1    3
result.2    9
result.3   27
result.4   81
result.5  243


#List output
foreach(exponent = 1:5, .combine = list, .multicombine=TRUE)  %dopar%  base^exponent

[[1]]
[1] 3

[[2]]
[1] 9

[[3]]
[1] 27

[[4]]
[1] 81

[[5]]
[1] 243

#Data Frame output
foreach(exponent = 1:5, .combine = data.frame)  %dopar%  base^exponent
  result.1 result.2 result.3 result.4 result.5
1        2        4        8       16       32

stopImplicitCluster()

Variations and Controlling Memory Usage

There are a number of different ways to do the same tasks which I did in the code above. For instance, the registerDoParallel() function allows creation of implicit clusters and we don’t need to use the makeCluster() function inside.
#This also works
registerDoParallel(no_cores)
Using implicit cluster means that I don’t have a ‘clust’ variable. To close the cluster, I need a different function known as stopImplicitCluster() at the end of our parallelization task.
stopImplicitCluster()
The foreach() and doParallel() functions have the local variables available at all cores by default. Hence, we don’t need any clusterExport function. However, variables which are not defined locally (such as those part of a parent function) and libraries need to be exported to all cores. For this, we have .export parameter and .packages parameter in foreach() function.
#using .export parameter
registerDoParallel(no_cores)

base <- 2 #Declaring this variable outside the scope of foreach() function

sample_func <- function (exponent) {
  #Using the .export function here to include the base variable
  foreach(exponent = 1:5, .combine = c,.export = "base")  %dopar%  base^exponent
}
sample_func()
[1]  2  4  8 16 32
stopImplicitCluster()

#using .packages parameter
library(dplyr)
registerDoParallel(no_cores)
foreach(i = 1:5, .combine=c, .packages="dplyr") %dopar% {
  iris[i, ] %>% select(-Species) %>% sum
}
[1] 10.2  9.5  9.4  9.4 10.2
stopImplicitCluster()


With parallel processing comes efficient memory usage or your system may crash. The first thing that comes to mind is the ability to use same address(using FORK) versus creating different memory locations(using PSOCK). By default, the makeCluster() function creates addresses of type PSOCK. However, we can change the setting to FORK by passing the type function
clust<-makeCluster(no_cores, type="FORK")
Unless required, it is very useful to use the FORK type cluster to save memory and running time. The makeCluster() function also has an outfile parameter to specify an output file for debugging purpose.
registerDoParallel(makeCluster(no_cores, outfile="debug_file.txt"))
foreach(x=list(1:5, "a"))  %dopar%  print(x)
[[1]]
[1] 1 2 3 4 5

[[2]]
[1] "a"
Contents of the debug_file.txt
starting worker pid=6696 on localhost:11363 at 17:34:36.808
starting worker pid=13168 on localhost:11363 at 17:34:37.421
starting worker pid=11536 on localhost:11363 at 17:34:38.047
starting worker pid=9572 on localhost:11363 at 17:34:38.675
starting worker pid=10612 on localhost:11363 at 17:34:43.467
starting worker pid=8864 on localhost:11363 at 17:34:44.078
starting worker pid=5144 on localhost:11363 at 17:34:44.683
starting worker pid=12012 on localhost:11363 at 17:34:45.286
[1] 1 2a" 3 4 5
We notice that a is printed after 2 in my output due to race. Using a different debug file for each node in a cluster is a better debugging option.
registerDoParallel(makeCluster(no_cores, outfile="debug_file.txt"))
foreach(x=list(1,2,3,4,5, "a"))  %dopar%  cat(dput(x), file = paste0("debug_file_", x, ".txt"))

In this case, I create 6 files containing output for each element in the list and a debug_file. Another way to debug code is using the trycatch() functions.

registerDoParallel(makeCluster(no_cores))
foreach(x=list(1, 2, "a"))  %dopar%  
{
  tryCatch({
    c(1/x) #Should give an error when x is “a”
  }, error = function(e) return(paste0("Error occurred for '", x, "'", 
                                       " The error is '", e, "'")))
}

[[1]]
[1] 1

[[2]]
[1] 0.5

[[3]]
[1] "Error occurred for 'a' The error is 'Error in 1/x: non-numeric argument to binary operator\n'"

Debugging helps find out the reasons for errors but what if the error is related to running out of memory. For this, I can use rm() and gc() functions in R. The rm() function is used to remove a variable from the environment. If you’re sure that you no longer need this variable, it is better to free up memory using the rm() function.
base=4 #Create a variable base whose value is 4
base_copy=base #Make a copy of the variable 
rm(base) #I can now remove the base variable and free up memory
To clean up the entire environment, use rm(list=ls()). It removes all the variables but does not remove libraries.
rm(list=ls())
The gc() function is the garbage collector for R and is automatically implemented. However, in a parallel environment, the gc() function is useful to return the memory regularly.

Summary

Parallel programming may seem a complex process at first but the amount of time saved after executing tasks in parallel makes it worth the try. Functions such as lapply() and sapply() are great alternatives to time consuming looping functions while parallel, foreach and doParallel packages are great starting points to running tasks in parallel. These parallel processes are based on functions and are also modular. However, with great power comes a risk of code crashes. Hence it is necessary to be careful and be aware of ways to control memory usage and error handling. It is not necessary to parallelize every piece of code that you write. You can always write sequential code and decide to parallelize the parts which take significant amounts of time. This will help in further reducing out of memory instances and writing robust and fast codes. The use of parallel programing method is growing and many packages now have parallel implementations available. With this article. one can dive deep into the world of parallel programming and make full use of the vast memory and processing power to generate output quickly. The full code for this article is as follows.
lapply(1:5, function(x) x^2) #input is 1,2,3,4,5 and output is square of the input

lapply(1:5, function(x) c(x^2,x^3)) #The output should be square and cube of input

sapply(1:5, function(x) x^2) #This output is a vector

sapply(1:5, function(x) c(x^2,x^3)) #This outputs a matrix

sapply(1:5, function(x) x^2, simplify = FALSE, USE.NAMES = FALSE) #Output is same as for lapply()

#Include the parallel library. If the next line does not work, run install.packages(“parallel”) first
library(parallel)

# Use the detectCores() function to find the number of cores in system
no_cores <- detectCores()

# Setup cluster
clust <- makeCluster(no_cores) #This line will take time

#The parallel version of lapply() is parLapply() and needs an additional cluster argument.
parLapply(clust,1:5, function(x) c(x^2,x^3))
stopCluster(clust)

#Include the parallel library. If the next line does not work, run install.packages(“parallel”) first
library(parallel)

# Use the detectCores() function to find the number of cores in system
no_cores <- detectCores()

# Setup cluster
clust <- makeCluster(no_cores) #This line will take time

#Setting a base variable 
base <- 4
#Note that this line is required so that all cores in cluster have this variable available
clusterExport(clust, "base")

#Using the parSapply() function
parSapply(clust, 1:5, function(exponent) base^exponent)
stopCluster(clust)

clusterEvalQ(clust,library(randomForest))

library(foreach)
library(doParallel)

registerDoParallel(makeCluster(no_cores))

#Vector output
foreach(exponent = 1:5, .combine = c)  %dopar%  base^exponent

#Matrix output
foreach(exponent = 1:5, .combine = rbind)  %dopar%  base^exponent

#List output
foreach(exponent = 1:5, .combine = list, .multicombine=TRUE)  %dopar%  base^exponent

#Data Frame output
foreach(exponent = 1:5, .combine = data.frame)  %dopar%  base^exponent


#This also works
registerDoParallel(no_cores)

stopImplicitCluster()

#using .export parameter
registerDoParallel(no_cores)

base <- 2 #Declaring this variable outside the scope of foreach() function

sample_func <- function (exponent) {
  #Using the .export function here to include the base variable
  foreach(exponent = 1:5, .combine = c,.export = "base")  %dopar%  base^exponent
}
sample_func()
stopImplicitCluster()

#using .packages parameter
library(dplyr)
registerDoParallel(no_cores)
foreach(i = 1:5, .combine=c, .packages="dplyr") %dopar% {
  iris[i, ] %>% select(-Species) %>% sum
}
stopImplicitCluster()

clust<-makeCluster(no_cores, type="FORK")

registerDoParallel(makeCluster(no_cores, outfile="debug_file.txt"))
foreach(x=list(1:5, "a"))  %dopar%  print(x)

registerDoParallel(makeCluster(no_cores, outfile="debug_file.txt"))
foreach(x=list(1,2,3,4,5, "a"))  %dopar%  cat(dput(x), file = paste0("debug_file_", x, ".txt"))

registerDoParallel(makeCluster(no_cores))
foreach(x=list(1, 2, "a"))  %dopar%  
{
  tryCatch({
    c(1/x) #Should give an error when x is “a”
  }, error = function(e) return(paste0("Error occurred for '", x, "'", 
                                       " The error is '", e, "'")))
}

base=4 #Create a variable base whose value is 4
base_copy=base #Make a copy of the variable 
rm(base) #I can now remove the base variable and free up memory

rm(list=ls())


Bio: Chaitanya Sagar is the Founder and CEO of Perceptive Analytics. Perceptive Analytics has been chosen as one of the top 10 analytics companies to watch out for by Analytics India Magazine. It works on Marketing Analytics for e-commerce, Retail and Pharma companies.

HebRew (using Hebrew in R)

Adi Sarid (Tel Aviv university and Sarid Research Institute LTD.)

July-2017

Background

A while back I participated in an R workshop, in the annual convention of the Israeli Association for Statistics. I had the pleasure of talking with Tal Galili and Jonathan Rosenblatt which indicated that a lot of Israeli R users run into difficulties with Hebrew with R. My firm opinion is that its best to keep everything in English, but sometimes you simply don’t have a choice. For example, I had to prepare a number of R shiny dashboards to Hebrew speaking clients, hence Hebrew was the way to go, in a kind of English-Hebrew “Mishmash” (mix). I happened to run into a lot of such difficulties in the past, so here are a few pointers to get you started when working in R with Hebrew. This post deals with Reading and writing files which contain Hebrew characters. Note, there is also a bit to talk about in the context of Shiny apps which contain Hebrew and using right-to-left in shiny apps, and using Hebrew variable names. Both work with some care, but I won’t cover them here. If you have any other questions you’d like to see answered, feel free to contact me [email protected].

Reading and writing files with Hebrew characters

R can read and write files in many formats. The common formats for small to medium data sets include the comma separated values (*.csv), and excel files (*.xlsx, *.xls). Each such read/write action is facilitated using some kind of “encoding”. Encoding, in simple terms, is a definition of a character set which help you operating system to interpret and represent the character as it should (לדוגמה, תווים בעברית). There are a number of relevant character sets (encodings) when Hebrew is concerned:
  • UTF-8
  • ISO 8859-8
  • Windows-1255
When you try to read files in which there are Hebrew characters, I usually recommend trying to read them in that order- UTF-8 is commonly used by a lot of applications, since it covers a lot of languages.

Using csv files with Hebrew characters

Here’s an example for something that can go wrong, and a possible solution. In this case I’ve prepared a csv file which encoded with UTF-8. When using R’s standard read.csv function, this is what happens:
sample.data <- read.csv("http://www.sarid-ins.co.il/files/utf_encoded_sample.csv")
sample.data
##    ן...... X......        X............
## 1 ׳¨׳•׳ ׳™      25             ׳—׳™׳₪׳”
## 2 ׳ž׳•׳˜׳™      77         ׳”׳¨׳¦׳œ׳™׳”
## 3   ׳“׳ ׳™      13 ׳×׳œ-׳׳‘׳™׳‘ ׳™׳₪׳•
## 4 ׳¨׳¢׳•׳×      30  ׳§׳¨׳™׳× ׳©׳ž׳•׳ ׳”
## 5   ׳“׳ ׳”      44        ׳‘׳™׳× ׳©׳׳Ÿ
Oh boy, that’s probably not what the file’s author had in mind. Let’s try to instruct read.csv to use a different encoding.
sample.data <- read.csv("http://www.sarid-ins.co.il/files/utf_encoded_sample.csv",
                        encoding = "UTF-8")
sample.data
##   X.U.FEFF.שם גיל      מגורים
## 1        רוני  25        חיפה
## 2        מוטי  77      הרצליה
## 3         דני  13 תל-אביב יפו
## 4        רעות  30  קרית שמונה
## 5         דנה  44     בית שאן
A bit better isn’t it? However, not perfect. We can read the Hebrew, but there is a weird thing in the header “X.U.FEFF”. A better way to read and write files (much more than just encoding aspects – it’s quicker reading large files) is using the readr package which is part of the tidyverse. On a side note, if you haven’t already, install.packages(tidyverse), it’s a must. It includes readr but a lot more goodies (read on). Now, for some tools you get with readr:
library(readr)
locale("he")
## <locale>
## Numbers:  123,456.78
## Formats:  %AD / %AT
## Timezone: UTC
## Encoding: UTF-8
## <date_names>
## Days:   יום ראשון (יום א׳), יום שני (יום ב׳), יום שלישי (יום ג׳), יום
##         רביעי (יום ד׳), יום חמישי (יום ה׳), יום שישי (יום ו׳), יום
##         שבת (שבת)
## Months: ינואר (ינו׳), פברואר (פבר׳), מרץ (מרץ), אפריל (אפר׳), מאי (מאי),
##         יוני (יוני), יולי (יולי), אוגוסט (אוג׳), ספטמבר (ספט׳),
##         אוקטובר (אוק׳), נובמבר (נוב׳), דצמבר (דצמ׳)
## AM/PM:  לפנה״צ/אחה״צ
guess_encoding("http://www.sarid-ins.co.il/files/utf_encoded_sample.csv")
## # A tibble: 2 × 2
##   encoding confidence
##      <chr>      <dbl>
## 1    UTF-8       1.00
## 2   KOI8-R       0.98
First we used locale() which knows the date format and default encoding for the language (UTF-8 in this case). On it’s own locale() does nothing than output the specs of the locale, but when used in conjuction with read_csv it tells read_csv everything it needs to know. Also note the use of guess_encoding which reads the first “few” lines of a file (10,000 is the default) which helps us, well… guess the encoding of a file. You can see that readr is pretty confident we need the UTF-8 here (and 98% confident we need a Korean encoding, but first option wins here…)
sample.data <- read_csv(file = "http://www.sarid-ins.co.il/files/utf_encoded_sample.csv",
                        locale = locale(date_names = "he", encoding = "UTF-8"))
## Parsed with column specification:
## cols(
##   שם = col_character(),
##   גיל = col_integer(),
##   מגורים = col_character()
## )
sample.data
## # A tibble: 5 × 3
##      שם   גיל      מגורים
##   <chr> <int>       <chr>
## 1  רוני    25        חיפה
## 2  מוטי    77      הרצליה
## 3   דני    13 תל-אביב יפו
## 4  רעות    30  קרית שמונה
## 5   דנה    44     בית שאן
Awesome isn’t it? Note that the resulting sample.data is a tibble and not a data.frame (read about tibbles). The package readr has tons of functions features to help us with reading (writing) and controlling the encoding, so I definitely recommend it. By the way, try using read_csv without setting the locale parameter and see what happens.

What about files saved by Excel?

Excel files are not the best choice for storing datasets, but the format is extremely common for obvious reasons.

CSV files which were saved by excel

In the past, I had run to a lot of difficulties trying to load CSV files which were saved by excel into R. Excel seems to save them in either “Windows-1255” or “ISO-8859-8”, instead of “UTF-8”. The default read by read_csv might yield something like “” instead of “שלום”. In other cases you might get a “multibyte error”. Just make sure you check the “Windows-1255” or “ISO-8859-8” encodings if the standard UTF-8 doesn’t work well (i.e., use read_csv(file, locale = locale(encoding = "ISO-8859-8"))).

Reading directly from excel

Also, if the original is in Excel, you might want to consider reading it directly from the excel file (skipping CSVs entirely). There are a number of packages for reading excel files and I recommend using readxl, specifically read_xlsx or read_xls will do the trick (depending on file format). You don’t even have to specify the encoding, if there are Hebrew characters they will be read as they should be.

Summary

For reading csv files with Hebrew characters, it’s very convenient to use readr. The package has a lot of utilities for language encoding and localization like guess_encoding and locale. If the original data is in excel, you might want to try skipping the csv and read the data directly from the excel format using the readxl package. Somtimes reading files envolves a lot of trial and error – but eventually it will work.

Don’t give up!