Advice to Young (and Old) Programmers: A Conversation with Hadley Wickham

I recently had the wonderful opportunity to chat with Hadley Wickham. He is an immensely prolific, yet humble guy who has not only contributed heavily to the advancement and development of R as a language and environment, but who also cares and has thought a lot about the process of doing data science the right way.

As a result, he has given many interviews on this “process” and his approach to data science and programming in R, mostly on the technical side of things. So, when I spoke with him, I wanted to frame the conversation for a broader audience, in large part due to the rapid expansion of people who are using R, most of whom engage very little with the programming wing of the community. Though this expansion of R users is a great thing on many dimensions, it has the potential to create a cohort of frustrated, self-taught programmers.

So, I am sharing Hadley’s responses to my questions in this blog for three main reasons: first and foremost, in an effort to offer a life-raft of sorts to people just getting started with (and frustrated by) R; second, to try and bridge the applied side of R users with the programming side of R users; and third, in the spirit of the open source foundation of R, to speak openly and plainly about the basics of approaching programming in R, and then how to keep going when you run into problems.

Importantly, though this conversation is intentionally non-technical and aimed at beginning programmers and R users, there are many great points and ideas for all programmers to consider, regardless of levels of proficiency and comfort with R. I have highlighted (via bold text) several of these particularly practical and valuable points throughout.

A final technical note: I edited a few places to make the conversation more amenable to reading and for the purpose of a blog post. I did not, however, change any of the substantive content of Hadley’s responses, as you will note from the conversational style of the responses, virtually all of which are verbatim. Enough of me. Here is some advice from Hadley Wickham to young (and old) programmers.

  • First, what is your role at R-Studio?

I am the chief scientist at R-Studio. I don’t really know what that means, or what chief scientists do. But I basically lead the teams that look after the Tidyverse, which is a set of packages for doing data science in R. So, my teams have a mix of roles, but there is some research in thinking what we should be working on, a bunch of programming, and also a bunch of education, like helping people understand how things work, webinars, books, talks, and Tweets.

  • Why R? How did you come to it and why should other people be convinced?

When I started learning R, the reason was simple: it was the only open source programming language for statistics. That’s obviously changed today, with programs and languages like Python, Java Script, and Scala.

So why R today? When you talk about choosing programming languages, I always say you shouldn’t pick them based on technical merits, but rather pick them based on the community. And I think the R community is like really, really strong, vibrant, free, welcoming, and embraces a wide range of domains. So, if there are like people like you using R, then your life is going to be much easier. That’s the first reason.

And the second reason, which is both a huge strength of R and a bit of a weakness, is that R is not just a programming language. It was designed from day 1 to be an environment that can do data analysis. So, compared to the other options like Python, you can get up and running in R doing data science, learning much, much less about programming to get started. And that generally makes it like easier to get up and running if you don’t have formal training in computer science or software engineering.

  • Let’s transition to Tidyverse, as you just mentioned. First, could you explain a bit behind the approach of the Tidyverse, from processing and management, to analysis in R?

The Tidyverse is a collection of R packages with the goal being, once you’ve learned one package in the collection, learning the other packages should be much easier. And what that means is that there is a deep, underlying philosophy and unity where you can learn things in one package and apply the same ideas elsewhere. So, it just means that your naïve ideas about what a function is going to do or how to tackle a problem should be fairly good, because you can draw on your experiences. Things are designed consistently in such a way where your experiences with other functions apply to new functions. So, for example, one of the ideas that underlies many, many packages in the Tidyverse is this idea of “tidy data,” which is a really simple idea, where when you are dealing with data science-y kind of data, you want to make sure that every variable is in a column, and then naturally every observation or case becomes a row. And if you put your data in that format once when you are doing data tidying, then you don’t have to hassle with it multiple times throughout the process.

  • In an interview in 2014 at UseR, you said one of your main goals was streamlining the process of getting from raw data to visualization quickly and efficiently, with “tidying” up the data being a key aspect of that. Presumably, you were talking about development of the Tidyverse. Do you think you’re there on that goal? Is there more to be expected? Did you meet it?

Yeah, we have made a lot of progress towards that goal. And 2014 was well before the idea of the Tidyverse existed. And the biggest change in my thinking since then is thinking of the Tidyverse as a thing and not just individual packages, and then being consistent across packages. That’s playing off one of the things my team has been focusing on this year, and that’s consistency within the Tidyverse; not adding a bunch of new features or creating new packages. But just thinking, “how can we make sure everything fits together as well as it possibly can?”

So, we are making good progress and there’s always more to do. And one of the things I find really rewarding is people sharing their experiences, like getting started with data and having the first really enjoyable experience when you go from a new dataset to some cool visualization with as little pain as possible. A neat illustration of that is the “TidyTuesday” hashtag [#tidytuesday] on Twitter. Every Tuesday they post a little data set and challenge, and then people tackle it with R and other tools and Tweet their results. Its really cool. [And anybody can do this, I guess?] Yeah exactly. Totally community run and driven.

  • Let’s shift and talk a little more conceptually about R and programming. There seems to be a ton of resources out there for R programmers given that its open source. This is naturally a wonderful thing. But I am wondering about beginning R programmers. Given that there is so much out there, how would you counsel a beginning programmer to sift through the resources to distinguish the signal from the noise?

So, I’m obviously biased in this recommendation, but I would say start with my book, “R for Data Science,” just because this is what it was designed for. It’s not going to teach you everything about R, and it’s by no means perfect, but I think it’s a really good way to get started. And it focuses relentlessly on giving you useful tools to help you understand data. It seems to be pretty popular and people seem to like it. And it’s free. You can buy a book if you want, but it’s free online.

After that, the other thing I would say is to try and find an R learning community. It’s much easier to learn and stay motivated when you are working with other people. And I think there’s lots of ways of doing that: look for a local meet up, like an R meet up or an R Ladies meet up in your area. There’s also the R-Studio community site.

Just find some way to find people like you who also are learning, because you can share your successes and your trials and your failures. It makes it much more likely that you will stick it out to the point where you will do something really useful.

  • I’ve noticed a theme in your work and how you approach package and resource development is addressing and fixing common problems in R. I’m curious how you hone in on these problems.

Yeah, I am curious too. I seem to be able to do it. I don’t really know exactly how. Part of it is I just talk to people and I travel a lot. I talk to people in different areas working on different problems, and I interact with a bunch of people on Twitter. And somehow that all feeds into my brain, and then ideas come out in a way I don’t fully understand. But it seems to work. I don’t want to break it. Also, I talk to people who are actually struggling with data analysis problems, and I also read a lot of other programming languages, computer science, and software engineering, because there’s basically nothing that I have done that’s not been done in some way, somewhere else before. So, it’s just finding a right idea that someone else has come up with and then applying it in a new domain; it’s tremendously valuable.

  • The use of R has seemed to explode within the past decade or so, moving far beyond the smaller world of computer programmers, spilling into many applied fields such as medicine, engineering, and even my own, political science. Having learned mostly on my own, I feel like there are two conversations going on, with a big gap, leaving the beginners to try and sift through complex worlds and tradeoffs. Specifically, in applied data analysis, the debate or tradeoff seems to be over using R versus another package like Stata. But from what I have observed, in the programming world, the debate seems to be over R versus other languages, such as Python, like you mentioned. So how should an applied analyst, someone who is not a programmer by training, navigate this tradeoff?

I think the tradeoff between Stata and R is: do you want a point-and-click interface, or do you want a programming interface? Point-and-click interfaces are great, because they lay out all of your options in front of you, and you don’t have to remember anything. You can navigate through the set of pre-supplied options. And that’s also it’s greatest weakness, because first of all, you are constrained into what the developer thought you should be able to do. And secondly, because your primary interaction is with a mouse, it’s very difficult to record what you did. And I think that’s a problem for science, because ideally you want to say how you actually got these results. And then simply do that reliably and have other people critique you on that. But it’s also really hard when you are learning, because when you have a problem, how do you communicate that problem to someone else? You basically have to say, “I clicked here, then I clicked here, then I clicked here, and I did this.” Or you make a screen cast, and it’s just clunky.

So, the advantages of programming languages like R or Python, is that the primary mechanism for communicating with the computer is text. And that is scary because there’s nothing like this blinking cursory in front of you; it doesn’t tell you what to do next. But it means you are unconstrained, because you can do anything you can imagine. And you have all these advantages of text, where if you have a problem with your code, you can copy and paste it into an email, you can Google it, you can check it and put it on GitHub, or you can share it by Twitter. There’s just so many advantages to the fact that the primary way you relate with a programming language is through code, which is just text. And so, as long as you are doing data analysis fairly regularly, I think all the advantages outweigh a point and click interface like Stata.

For R and Python, Python is first and foremost a programming language. And that has a lot of good features, but it tends to mean, that if you are going to do data science in Python, you have to first learn how to program in Python. Whereas I think you are going to get up and running faster with R, than with Python because there’s just a bunch more stuff built in and you don’t have to learn as many programming concepts. You can focus on being a great political scientist or whatever you do and learning enough R that you don’t have to become an expert programmer as well to get stuff done.

  • As people develop in programming, could you talk a little about the tradeoff between technical complexity and simplicity and usability?

That’s a big question. People naturally go through a few phases. When you start out, you don’t have many tips and techniques at your disposal. So, you are forced to do the simplest thing possible using the simplest ideas. And sometimes you face problems that are really hard to solve, because you don’t know quite the right techniques yet. So, the very earliest phase, you’ve got a few techniques that you understand really well, and you apply them everywhere because those are the techniques you know.

And the next stage that a lot of people go through, is that you learn more techniques, and more complex ways of solving problems, and then you get excited about them and start to apply them everywhere possible. So instead of using the simplest possible solution, you end up creating something that’s probably overly complex or uses some overly general formulation.

And then eventually you get past that and it’s about understanding, “what are the techniques at my disposal? Which techniques fit this problem most naturally? How can I express myself as clearly as possible, so I can understand what I am doing, and so other people can understand what I am doing?” I talk about this a lot but think explicitly about code as communication. You are obviously telling the computer what to do, but ideally you want to write code to express what it means or what it is trying to do as well, so when others read it and when you in the future reads it, you can understand some of the reasoning.

  • Any parting words of wisdom for R programmers or the community?

It’s easy when you start out programming to get really frustrated and think, “Oh it’s me, I’m really stupid,” or, “I’m not made out to program.” But, that is absolutely not the case. Everyone gets frustrated. I still get frustrated occasionally when writing R code. It’s just a natural part of programming. So, it happens to everyone and gets less and less over time. Don’t blame yourself. Just take a break, do something fun, and then come back and try again later.

GitLab CI for R-package development

— A basic R-phile introduction to continuous integration on GitLab

I have been using the GitLab repository for some time for mainly two reasons: I can have private projects at no monetary costs (I later came to realise that I as an academic can have the same on GitHub), and most importantly GitLab has so far gone under the radar of our IT department, meaning I can access it from my work computer. GitHub on the other hand is flagged as file sharing.

A simple CI config

Most of my time with R is spend trying to make heads and tails of various kinds of data, and I have so far just authored one R-package. While I can see the benefits of a continuous integration (CI) work flow, I just never bothered to actually set it up. Now where I am putting together code in smaller packages for internal use, it seemed like the right time to learn a little.

The Internet gives a few pointers on how to go about setting up CI on GitLab; one of the resources is the blog post Docker, GitLab CI and Developing R Packages by Mustafa Hasanbulli, who gives a simple .gitlab-ci.yml for testing packages. Mustafa’s solution make use of the rocker/tidyverse Docker image and install the dependency packages before running check() from devtools. It’s a good solution and combining with the .gitlab-ci.yml shared as a gist on Github by Artem Klevtsov, I managed to get the coverage badge I though nice to have. The .gitlab-ci.yml for a smaller package can be along the lines of:

image: rocker/tidyverse

stages:
  - check
  - coverage

check_pkg:
  stage: check
  script:
    - R -e 'install.packages(c())'
    - R -e 'devtools::check()'

coverage:
   stage: coverage
   script:
     - R -e 'covr::package_coverage(type = c("tests", "examples"))'

To extract the coverage to the coverage badge, add Coverage: \d+.\d+%$ to the section ‘Test coverage parsing’ in Settings -> CI/CD -> General pipelines.

Introducing cache

For my package, each of the two stages took about 45 minutes to complete, and I realized that the wast majority of the time was spent on downloading and especially installing packages. This was mainly do to the Bioconductor packages I rely on.

If only there would be a way to pass the installed packages between the stages, or even between runs of the CI pipeline. There is – GitLab 9.0 saw the option to specify a cache. The next problem is that the cache must be a directory of the cloned project directory. Since R prefers to install packages in /usr/lib/R/library in the Docker images, the .libpaths() must be changed. In addition you would have to remember to add any new package to the .gitlab-ci.yml. Which I for one would always forget, and therefore painstakingly have to figure out which packages to add.

A much simpler solution is to use packrat – something you anyway should consider to use. It also allows you to use the rocker/r-base image and just the packages actually required for your CI. How much of a win in terms of traffic rocker/r-base is over rocker/tidyverse probably depends on the packages you have to add. The .gitlab-ci.yml caching packages could look like this:

image: rocker/r-base

stages:
  - setup
  - test

cache:
  # Ommit key to use the same cache across all pipelines and branches
  key: "$CI_COMMIT_REF_SLUG"
  paths:
    - packrat/lib/

setup:
  stage: setup
  script:
    - R -e 'source("ci.R"); ci_setup()'

check:
  stage: test
  dependencies:
    - setup
  when: on_success
  script:
    - R -e 'source("ci.R"); ci_check()'

coverage:
  stage: test
  dependencies:
    - setup
  when: on_success
  only:
    - master
  script:
    - R -e 'source("ci.R"); ci_coverage()'

with the ci.R looking like this:

install_if_needed <- function(package_to_install){
  package_path <- find.package(package_to_install, quiet = TRUE)

  if(length(package_path) == 0){
    # Only install if not present
    install.packages(package_to_install)
  }
}

ci_setup <- function(){
  install_if_needed("packrat")
  packrat::restore()
}

ci_check <- function(){
  install_if_needed("devtools")
  devtools::check()
}

ci_coverage <- function(){
  install_if_needed("covr")
  covr::package_coverage(type = c("tests", "examples"))
}

The cache key $CI_COMMIT_REF_SLUG gives you the advantage of different cache for different branches. Using $CI_COMMIT_SHA will give you a separate cache for each commit.

Adding the packrat subdirectories src and lib* to the .gitignore will keep your repository small – and I find it quite useful to commit just the packrat.lock whenever I add or remove a package. But then again, I am the only one working with my repositories, and there might be advantages I don’t know of.

I have noticed that the stages after the setup stage sometimes fail in the first run. If this happens because of the cache, rerunning the failed stage makes everything well.

Using the above for my package, the first run of the pipeline took about 45 minutes, but the second run only about 8 minutes. A considerable reduction in time.

I hope .gitlab-ci.yml and ci.R outlined here will help you getting started on caching your R-packages in your CI. The two modules are quite simple, and if you are loking for something more sophisticated, I can recommend looking Matt Dowle works on data.table and of course the GitLab Runner help pages.

Dealing with The Problem of Multicollinearity in R

Imagine a situation where you are asked to predict the tourism revenue for a country, let’s say India. In this case, your output or dependent or response variable will be total revenue earned (in USD) in a given year. But, what about independent or predictor variables?

You have been provided with two sets of predictor variables and you have to choose one of the sets to predict your output. The first set consists of three variables:

  • X1 = Total number of tourists visiting the country
  • X2 = Government spending on tourism marketing
  • X3 = a*X1 + b*X2 + c, where a, b and c are some constants

The second set also consists of three variables:

  • X1 = Total number of tourists visiting the country
  • X2 = Government spending on tourism marketing
  • X3 = Average currency exchange rate

Which of the two sets do you think provides us more information in predicting our output?

I am sure, you will agree with me that the second set provides us more information in predicting the output because the second set has three variables which are different from each other and each of the variables provides different information (we can infer this intuitively at this moment). Moreover, none of the three variables is directly derived from the other variables in the system. Alternatively, we can also say that none of the variables is a linear combination of other variables in the system.

In the first set of variables, only two variables provide us relevant information; while, the third variable is nothing but a linear combination of other two variables. If we were to directly develop a model without including this variable, our model would have considered this combination and estimated coefficients accordingly.

Now, this effect in the first set of variables is called multicollinearity. Variables in the first set are strongly correlated to each other (if not all, at least some variables are correlated with other variables). Model developed using the first set of variables may not provide as accurate results as the second one because we are missing out on relevant variables/information in the first set. Therefore, it becomes important to study multicollinearity and the techniques to detect and tackle its effect in regression models.

According to Wikipedia, “Collinearity is a linear association between two explanatory variables. Two variables are perfectly collinear if there is an exact linear relationship between them. For example, X1 and X2 are perfectly collinear if there exist parameters λ0 and λ1 such that, for all observations i, we have

X2i = λ0 + λ1 * X1i

Multicollinearity refers to a situation in which two or more explanatory variables in a multiple regression model are highly linearly related.”

We saw an example of exactly what the Wikipedia definition is describing.

Perfect multicollinearity occurs when one independent variable is an exact linear combination of other variables. For example, you already have X and Y as independent variables and you add another variable, Z = a*X + b*Y, to the set of independent variables. Now, this new variable, Z, does not add any significant or different value than provided by X or Y. The model can adjust itself to set the parameters that this combination is taken care of while determining the coefficients.

Multicollinearity may arise from several factors. Inclusion or incorrect use of dummy variables in the system may lead to multicollinearity. The other reason could be the usage of derived variables, i.e., one variable is computed from other variables in the system. This is similar to the example we took at the beginning of the article. The other reason could be taking variables which are similar in nature or which provide similar information or the variables which have very high correlation among each other.

Multicollinearity may not possess problem at an overall level, but it strongly impacts the individual variables and their predictive power. You may not be able to identify which are statistically significant variables in your model. Moreover, you will be working with a set of variables which provide you similar output or variables which are redundant with respect to other variables.

  • It becomes difficult to identify statistically significant variables. Since your model will become very sensitive to the sample you choose to run the model, different samples may show different statistically significant variables.
  • Because of multicollinearity, regression coefficients cannot be estimated precisely because the standard errors tend to be very high. Value and even sign of regression coefficients may change when different samples are chosen from the data.
  • Model becomes very sensitive to addition or deletion of any independent variable. If you add a variable which is orthogonal to the existing variable, your variable may throw completely different results. Deletion of a variable may also significantly impact the overall results.
  • Confidence intervals tend to become wider because of which we may not be able to reject the NULL hypothesis. The NULL hypothesis states that the true population coefficient is zero.

Now, moving on to how to detect the presence of multicollinearity in the system.

There are multiple ways to detect the presence of multicollinearity among the independent or explanatory variables.

  • The first and most rudimentary way is to create a pair-wise correlation plot among different variables. In most of the cases, variables will have some bit of correlation among each other, but high correlation coefficient may be a point of concern for us. It may indicate the presence of multicollinearity among variables.
  • Large variations in regression coefficients on addition or deletion of new explanatory or independent variables can indicate the presence of multicollinearity. The other thing could be significant change in the regression coefficients from sample to sample. With different samples, different statistically significant variables may come out.
  • The other method can be to use tolerance or variance inflation factor (VIF).

VIF = 1 / Tolerance

VIF = 1/ (1 – R square)

VIF of over 10 indicates that the variables have high correlation among each other. Usually, VIF value of less than 4 is considered good for a model.

  • The model may have very high R-square value but most of the coefficients are not statistically significant. This kind of a scenario may reflect multicollinearity in the system.
  • Farrar-Glauber test is one of the statistical test used to detect multicollinearity. This comprises of three further tests. The first, Chi-square test, examines whether multicollinearity is present in the system. The second test, F-test, determines which regressors or explanatory variables are collinear. The third test, t-test, determines the type or pattern of multicollinearity.

We will now use some of these techniques and try their implementation in R.

We will use CPS_85_Wages data which consists of a random sample of 534 persons from the CPS (Current Population Survey). The data provides information on wages and other characteristics of the workers. (Linkhttp://lib.stat.cmu.edu/datasets/CPS_85_Wages). You can go through the data details on the link provided.

In this data, we will predict wages from other variables in the data.

> data1 = read.csv(file.choose(), header = T)> head(data1)  Education South Sex Experience Union  Wage Age Race Occupation Sector Marr1         8 0 1         21 0 5.10 35  2 6 1 12         9 0 1         42 0 4.95 57  3 6 1 13        12 0  0 1     0 6.67 19 3         6 1 04        12 0  0 4     0 4.00 22 3         6 0 05        12 0  0 17     0 7.50 35 3         6 0 16        13 0  0 9     1 13.07 28 3         6 0 0> str(data1)’data.frame’: 534 obs. of  11 variables: $ Education : int  8 9 12 12 12 13 10 12 16 12 … $ South     : int 0 0 0 0 0 0 1 0 0 0 … $ Sex       : int 1 1 0 0 0 0 0 0 0 0 … $ Experience: int  21 42 1 4 17 9 27 9 11 9 … $ Union     : int 0 0 0 0 0 1 0 0 0 0 … $ Wage      : num 5.1 4.95 6.67 4 7.5 … $ Age       : int 35 57 19 22 35 28 43 27 33 27 … $ Race      : int 2 3 3 3 3 3 3 3 3 3 … $ Occupation: int  6 6 6 6 6 6 6 6 6 6 … $ Sector    : int 1 1 1 0 0 0 0 0 1 0 … $ Marr      : int 1 1 0 0 1 0 0 0 1 0 …

The above results show the sample view of data and the variables present in the data. Now, let’s fit the linear regression model and analyze the results.

> fit_model1 = lm(log(data1$Wage) ~., data = data1)> summary(fit_model1) Call:lm(formula = log(data1$Wage) ~ ., data = data1) Residuals:     Min     1Q Median       3Q Max -2.16246 -0.29163 -0.00469  0.29981 1.98248  Coefficients:             Estimate Std. Error t value Pr(>|t|)    (Intercept)  1.078596 0.687514   1.569 0.117291 Education    0.179366 0.110756   1.619 0.105949 South       -0.102360 0.042823  -2.390 0.017187 * Sex         -0.221997 0.039907  -5.563 4.24e-08 ***Experience   0.095822 0.110799   0.865 0.387531 Union        0.200483 0.052475   3.821 0.000149 ***Age         -0.085444 0.110730  -0.772 0.440671 Race         0.050406 0.028531   1.767 0.077865 . Occupation  -0.007417 0.013109  -0.566 0.571761 Sector       0.091458 0.038736   2.361 0.018589 * Marr         0.076611 0.041931   1.827 0.068259 . —Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.4398 on 523 degrees of freedomMultiple R-squared:  0.3185, Adjusted R-squared:  0.3054 F-statistic: 24.44 on 10 and 523 DF,  p-value: < 2.2e-16

The linear regression results show that the model is statistically significant as the F-statistic has high value and p-value for model is less than 0.05. However, on closer examination we observe that four variables – Education, Experience, Age and Occupation are not statistically significant; while, two variables Race and Marr (martial status) are significant at 10% level. Now, let’s plot the model diagnostics to validate the assumptions of the model.


> plot(fit_model1)
Hit <Return> to see next plot:

Hit to see next plot:

Hit to see next plot:

Hit to see next plot:

The diagnostic plots also look fine. Let’s investigate further and look at pair-wise correlation among variables.

library(corrplot)
> cor1 = cor(data1)
> corrplot.mixed(cor1, lower.col = “black”, number.cex = .7)

The above correlation plot shows that there is high correlation between experience and age variables. This might be resulting in multicollinearity in the model.

Now, let’s move a step further and try Farrar-Glauber test to further investigate this. The ‘mctest’ package in R provides the Farrar-Glauber test in R.

install.packages(‘mctest’)library(mctest)

We will first use omcdiag function in mctest package. According to the package description, omcdiag (Overall Multicollinearity Diagnostics Measures) computes different overall measures of multicollinearity diagnostics for matrix of regressors.

> omcdiag(data1[,c(1:5,7:11)],data1$Wage) Call:omcdiag(x = data1[, c(1:5, 7:11)], y = data1$Wage)  Overall Multicollinearity Diagnostics                        MC Results detectionDeterminant |X’X|:         0.0001 1Farrar Chi-Square:      4833.5751 1Red Indicator:             0.1983 0Sum of Lambda Inverse: 10068.8439         1Theil’s Method:            1.2263 1Condition Number:        739.7337 1 1 –> COLLINEARITY is detected by the test 0 –> COLLINEARITY is not detected by the test

The above output shows that multicollinearity is present in the model. Now, let’s go a step further and check for F-test in in Farrar-Glauber test.

> imcdiag(data1[,c(1:5,7:11)],data1$Wage) Call:imcdiag(x = data1[, c(1:5, 7:11)], y = data1$Wage)  All Individual Multicollinearity Diagnostics Result                  VIF TOL   Wi Fi Leamer      CVIF KleinEducation   231.1956 0.0043  13402.4982 15106.5849 0.0658  236.4725 1South         1.0468 0.9553     2.7264 3.0731 0.9774    1.0707 0Sex           1.0916 0.9161     5.3351 6.0135 0.9571    1.1165 0Experience 5184.0939 0.0002 301771.2445 340140.5368 0.0139 5302.4188     1Union         1.1209 0.8922     7.0368 7.9315 0.9445    1.1464 0Age        4645.6650 0.0002 270422.7164 304806.1391 0.0147 4751.7005     1Race          1.0371 0.9642     2.1622 2.4372 0.9819    1.0608 0Occupation    1.2982 0.7703    17.3637 19.5715 0.8777    1.3279 0Sector        1.1987 0.8343    11.5670 13.0378 0.9134    1.2260 0Marr          1.0961 0.9123     5.5969 6.3085 0.9551    1.1211 0 1 –> COLLINEARITY is detected by the test 0 –> COLLINEARITY is not detected by the test Education , South , Experience , Age , Race , Occupation , Sector , Marr , coefficient(s) are non-significant may be due to multicollinearity R-square of y on all x: 0.2805  * use method argument to check which regressors may be the reason of collinearity===================================

The above output shows that Education, Experience and Age have multicollinearity. Also, the VIF value is very high for these variables. Finally, let’s move to examine the pattern of multicollinearity and conduct t-test for correlation coefficients.

> pcor(data1[,c(1:5,7:11)],method = “pearson”)$estimate              Education   South Sex  Experience Union         Age Race OccupationEducation   1.000000000 -0.031750193  0.051510483 -0.99756187 -0.007479144  0.99726160 0.017230877 0.029436911South      -0.031750193  1.000000000 -0.030152499 -0.02231360 -0.097548621  0.02152507 -0.111197596 0.008430595Sex         0.051510483 -0.030152499  1.000000000 0.05497703 -0.120087577 -0.05369785  0.020017315 -0.142750864Experience -0.997561873 -0.022313605  0.054977034 1.00000000 -0.010244447 0.99987574  0.010888486 0.042058560Union      -0.007479144 -0.097548621 -0.120087577 -0.01024445  1.000000000 0.01223890 -0.107706183 0.212996388Age         0.997261601 0.021525073 -0.053697851  0.99987574 0.012238897 1.00000000 -0.010803310 -0.044140293Race        0.017230877 -0.111197596  0.020017315 0.01088849 -0.107706183 -0.01080331  1.000000000 0.057539374Occupation  0.029436911 0.008430595 -0.142750864  0.04205856 0.212996388 -0.04414029 0.057539374  1.000000000Sector     -0.021253493 -0.021518760 -0.112146760 -0.01326166 -0.013531482  0.01456575 0.006412099 0.314746868Marr       -0.040302967  0.030418218 0.004163264 -0.04097664  0.068918496 0.04509033 0.055645964 -0.018580965                 Sector MarrEducation  -0.021253493 -0.040302967South      -0.021518760  0.030418218Sex        -0.112146760  0.004163264Experience -0.013261665 -0.040976643Union      -0.013531482  0.068918496Age         0.014565751 0.045090327Race        0.006412099 0.055645964Occupation  0.314746868 -0.018580965Sector      1.000000000 0.036495494Marr        0.036495494 1.000000000 $p.value           Education    South Sex Experience        Union Age Race Occupation       SectorEducation  0.0000000 0.46745162 0.238259049  0.0000000 8.641246e-01 0.0000000 0.69337880 5.005235e-01 6.267278e-01South      0.4674516 0.00000000 0.490162786  0.6096300 2.526916e-02 0.6223281 0.01070652 8.470400e-01 6.224302e-01Sex        0.2382590 0.49016279 0.000000000  0.2080904 5.822656e-03 0.2188841 0.64692038 1.027137e-03 1.005138e-02Experience 0.0000000 0.60962999 0.208090393  0.0000000 8.146741e-01 0.0000000 0.80325456 3.356824e-01 7.615531e-01Union      0.8641246 0.02526916 0.005822656  0.8146741 0.000000e+00 0.7794483 0.01345383 8.220095e-07 7.568528e-01Age        0.0000000 0.62232811 0.218884070  0.0000000 7.794483e-01 0.0000000 0.80476248 3.122902e-01 7.389200e-01Race       0.6933788 0.01070652 0.646920379  0.8032546 1.345383e-02 0.8047625 0.00000000 1.876376e-01 8.833600e-01Occupation 0.5005235 0.84704000 0.001027137  0.3356824 8.220095e-07 0.3122902 0.18763758 0.000000e+00 1.467261e-13Sector     0.6267278 0.62243025 0.010051378  0.7615531 7.568528e-01 0.7389200 0.88336002 1.467261e-13 0.000000e+00Marr       0.3562616 0.48634504 0.924111163  0.3482728 1.143954e-01 0.3019796 0.20260170 6.707116e-01 4.035489e-01                MarrEducation  0.3562616South      0.4863450Sex        0.9241112Experience 0.3482728Union      0.1143954Age        0.3019796Race       0.2026017Occupation 0.6707116Sector     0.4035489Marr       0.0000000 $statistic              Education South         Sex Experience Union        Age Race Occupation SectorEducation     0.0000000 -0.7271618  1.18069629 -327.2105031 -0.1712102  308.6803174 0.3944914 0.6741338 -0.4866246South        -0.7271618 0.0000000 -0.69053623   -0.5109090 -2.2436907 0.4928456 -2.5613138  0.1929920 -0.4927010Sex           1.1806963 -0.6905362  0.00000000 1.2603880 -2.7689685   -1.2309760 0.4583091 -3.3015287 -2.5834540Experience -327.2105031 -0.5109090  1.26038801 0.0000000 -0.2345184 1451.9092015  0.2492636 0.9636171 -0.3036001Union        -0.1712102 -2.2436907 -2.76896848   -0.2345184 0.0000000 0.2801822 -2.4799336  4.9902208 -0.3097781Age         308.6803174 0.4928456 -1.23097601 1451.9092015  0.2801822 0.0000000 -0.2473135 -1.0114033 0.3334607Race          0.3944914 -2.5613138  0.45830912 0.2492636 -2.4799336   -0.2473135 0.0000000 1.3193223 0.1467827Occupation    0.6741338 0.1929920 -3.30152873    0.9636171 4.9902208 -1.0114033 1.3193223  0.0000000 7.5906763Sector       -0.4866246 -0.4927010 -2.58345399   -0.3036001 -0.3097781 0.3334607 0.1467827  7.5906763 0.0000000Marr         -0.9233273 0.6966272  0.09530228 -0.9387867  1.5813765 1.0332156 1.2757711 -0.4254112  0.8359769                  MarrEducation  -0.92332727South       0.69662719Sex         0.09530228Experience -0.93878671Union       1.58137652Age         1.03321563Race        1.27577106Occupation -0.42541117Sector      0.83597695Marr        0.00000000 $n[1] 534 $gp[1] 8 $method[1] “pearson”

As we saw earlier in the correlation plot, partial correlation between age-experience, age-education and education-experience is statistically significant. There are other pairs also which are statistically significant. Thus, Farrar-Glauber test helps us in identifying the variables which are causing multicollinearity in the model.

There are multiple ways to overcome the problem of multicollinearity. You may use ridge regression or principal component regression or partial least squares regression. The alternate way could be to drop off variables which are resulting in multicollinearity. You may drop of variables which have VIF more than 10. In our case, since age and experience are highly correlated, you may drop one of these variables and build the model again. Try building the model again by removing experience or age and check if you are getting better results. Share your experiences in the comments section below.

Author Bio:

This article was contributed by Perceptive Analytics. Jyothirmayee Thondamallu, Chaitanya Sagar and Saneesh Veetil contributed to this article.

Perceptive Analytics is a marketing analytics company and it also provides Tableau Consulting, data analytics, business intelligence and reporting services to e-commerce, retail, healthcare and pharmaceutical industries. Our client roster includes Fortune 500 and NYSE listed companies in the USA and India.

Longitudinal heat plots

During our research on the effect of prednisone consumption during pregency on health outcomes of the baby (Palmsten K, Rolland M, Hebert MF, et al., Patterns of prednisone use during pregnancy in women with rheumatoid arthritis: Daily and cumulative dose. Pharmacoepidemiol Drug Saf. 2018 Apr;27(4):430-438. https://www.ncbi.nlm.nih.gov/pubmed/29488292) we developed a custom plot to visualize for each patient their daily and cumulative consumption of prednisone during pregenancy. Since the publication these plots have raised some interest so here is the code used to produce them. Data needs to be in the following format: 1 line per patient 1 column for the patient ID (named id) and then 1 column per unit of time (here days) reporting the measure for that day To illustrate the type of data we dealt with I first generate a random dataset containing 25 patients followed for n days (n different for each patient, randomly chosen between 50 and 200) with a daily consumption value randomly selected between 0 and a maximum dose (randomly determined for each patient between 10 and 50 ). Then we compute the cumulated consumption ie sum of all previous days.
# initial parameters for simulating data
n_indiv <- 25
min_days <- 50
max_days <- 200
min_dose <- 10
max_dose <- 50

# list of ids
id_list <- str_c("i", 1:n_indiv)

# intializing empty table
my_data <- as.data.frame(matrix(NA, n_indiv, (max_days+1)))
colnames(my_data) <- c("id", str_c("d", 1:max_days))
my_data$id <- id_list

# daily simulated data
set.seed(113)
for(i in 1:nrow(my_data)){
# n days follow up
n_days <- round(runif(1, min_days, max_days))
# maximum dose
dose <- round(runif(1, min_dose, max_dose))
# random daily value
my_data[i,2:ncol(my_data)] <- c(runif(n_days, 0, max_dose), rep(NA,(max_days-n_days)))
}

# cumulative simulated data
my_cum_data <- my_data
for(i in 3:ncol(my_cum_data)){
my_cum_data[[i]] <- my_cum_data[[i]] + my_cum_data[[i-1]] 
}
Our plots use the legend.col function found here: https://aurelienmadouasse.wordpress.com/2012/01/13/legend-for-a-continuous-color-scale-in-r/ Here is the longitudinal heat plot function:
 # Color legend on the top
long_heat_plot <- function(my_data, cutoff, xmax) {
# my_data: longitudinal data with one line per individual, 1st column with id, and then one column per unit of time
# cutoff: cutoff value for color plot, all values above cutoff are same color
# xmax: x axis max value
n_lines <- nrow(my_data)
line_count <- 1
# color scale
COLS <- rev(heat.colors(cutoff))
# plotting area
par(oma=c(1,1,4,1), mar=c(2, 2, 2, 4), xpd=TRUE)
# plot init
plot(1,1,xlim=c(0,xmax), ylim=c(1, n_lines), pch='.', ylab='Individual',xlab='Time unit',yaxt='n', cex.axis = 0.8)
# plot line for each woman one at a time
for (i in 1 : n_lines) { 
# get id
id1 <- my_data$id[i]
# get trajectory data maxed at max_val
id_traj <- my_data[my_data$id == id1, 2:ncol(my_data)]
# get last day
END <- max(which(!is.na(id_traj)))
# plot dotted line
x1 <- 1:xmax
y1 <- rep(i,xmax)
lines(x1, y1, lty=3)
for (j in 1 : (ncol(my_data) - 1)) {
# trim traj to max val
val <- min(id_traj[j], cutoff)
# plot traj
points(j, i, col=COLS[val], pch=20, cex=1)
}
# add limit line
points((END+1), i, pch="|", cex=0.9)
}
# add legend
legend.col(col = COLS, lev = 1:cutoff)
mtext(side=3, line = 3, 'unit of measurement')
opar <- par()
text_size <- opar$cex.main
}
Then we generate the corresponding plots:
 long_heat_plot(my_data, 50, 200) 

 long_heat_plot(my_cum_data, 5000, 200) 
And here is what we did in our study:

Stencila – an office suite for reproducible research

Stencila launches the first version of its (open source) word processor and spreadsheet editor designed for researchers.

By Michael Aufreiter, Substance, and Aleksandra Pawlik and Nokome Bentley, Stencila

Stencila is an open source office suite designed for researchers. It allows the authoring of interactive, data-driven publications in visual interfaces, similar to those in conventional office suites, but is built from the ground up for reproducibility.

Stencila aims to make it easier for researchers with differing levels of computational skills to collaborate on the same research article. Researchers used to tools like Microsoft Word and Excel will find Stencila’s interfaces intuitive and familiar. And those who use tools such as Jupyter Notebook or R Markdown are still able to embed code for data analysis within their research articles. Once published, Stencila documents are self-contained, interactive and reusable, containing all the text, media, code and data needed to fully support the narrative of research discovery.


Source: https://stenci.la

The Stencila project aims to be part of the wider vision to enable the next generation of research article – all the way from authoring through to publication as a reproducible, self-contained webpage. A key limitation of the current research publishing process is that conventional document formats (e.g. Word, PDF and LaTeX) do not support the inclusion of reproducible research elements, nor do they produce content in the structured format used for science publishing and dissemination (XML). Stencila aims to remove the need for manual conversion of content from source documents to XML and web (HTML) publishing formats, whilst enabling the inclusion of source data and computational methods within the manuscript. We hope that establishing a digital-first, reproducible archive format for publications will facilitate research communication that is faster and more open, and which lowers the barrier for collaboration and reuse. The development of Stencila is driven by community needs and in coordination with the goals of the Reproducible Document Stack, an initiative started by eLife, Substance and Stencila.

A word processor for creating journal-ready scientific manuscripts

Stencila’s article editor builds on Texture, an open source editor built for visually editing JATS XML documents (a standard widely used by scientific journals). Supporting all elements of a standardised research article, the editor features semantic content-oriented editing that allows the user to focus on the research without worrying about layout information, which is normally stripped during the publishing process. While Texture implements all static elements (abstract, figures, references, citations and so on), Stencila extends Texture with code cells which enable computed, data-driven figures.

Spreadsheets for source data and analysis

In Stencila, datasets are an integral part of the publication. They live as individual spreadsheet documents holding structured data. This data can then be referenced from the research article to drive analysis and plots. As within Excel, cells can contain formulas and function calls to run computations directly in a spreadsheet. But not only can users enter simple expressions, they can also add and execute code in a variety of supported programming languages (at the moment R, Python, SQL and Javascript).

A walk-through of some of the features of Stencila, using this Stencila Article. Source: YouTube; video CC-BY Stencila.

Code evaluation in the browser and beyond

Stencila’s user interfaces build on modern web technology and run entirely in the browser – making them available on all major operating systems. The predefined functions available in Stencila use Javascript for execution so they can be run directly in the editor. For example, the plotly() function generates powerful, interactive visualizations solely using Plotly’s Javascript library.



Stencila can also connect to R, Python and SQL sessions, allowing more advanced data analysis and visualization capabilities. Stencila’s execution engine keeps a track of the dependency between code cells, enabling a reactive, spreadsheet-like programming experience both in Stencila Articles and Sheets.

An example of using R within a Stencila Sheet. Source: YouTube; video CC-BY Stencila.

Reproducible Document Archive (Dar)

Stencila stores projects in an open file archive format called Dar. A Dar is essentially a folder with a number of files encompassing the manuscript itself (usually one XML per document) and all associated media.



The Dar format is open source: inspect it and provide feedback at https://github.com/substance/dar

Dar uses existing standards when possible. For instance, articles are represented as JATS XML, the standard preferred by a number of major publishers. The Dar format is a separate effort from Stencila, and aims to establish a strict standard for representing self-contained reproducible publications, which can be submitted directly to publishers. Any other tool should be able to easily read and write such archives, either by supporting it directly or by implementing converters.

Interoperability with existing tools and workflows

Stencila is developed not to replace existing tools, but to complement them. Interoperability is at the heart of the project, with the goal of supporting seamless collaboration between users of Jupyter Notebooks, R Markdown and spreadsheet applications. We are working closely with the communities of existing open source tools to improve interoperability. For instance, we are working with the Jupyter team on tools to turn notebooks into journal submissions. We are also evaluating whether the Stencila editor could be used as another interface to edit Jupyter Notebooks or R Markdown files: we hope this could help researchers who use existing tools to collaborate with peers who are used to other office tools, such as Word and Excel, and thus encourage wider adoption of reproducible computational research practises.

State of development

Over the past two years, we’ve built Stencila from the ground up as a set of modular components that support community-driven open standards for publishing and computation. Stencila Desktop is our prototype of a ‘researcher’s office suite’, built by combining these components into an integrated application. During this beta phase of the project, we are working to address bugs and add missing features, and welcome your feedback and suggestions (see below).

One of our next priorities will be to develop a toolset for generating a web page from a reproducible article in the Dar format. Using progressive enhancement, the reader should be able to reproduce a scientific article right from the journal’s website in various forms, ranging from a traditional static representation of the manuscript and its figures to a fully interactive, executable publication.

We will continue working on Stencila’s various software components, such as the converter module and execution contexts for R and Python, towards improved integration and interoperability with other tools in the open science toolbox (e.g. Jupyter, RStudio and Binder).

Get involved

We’d love to get your input to help shape Stencila. Download Stencila Desktop and take it for a test drive. You could also try porting an existing manuscript over to Stencila using the Stencila command line tool. Give us your feedback and contribute ideas on our community forum or in our chat channel, or drop us an email at [email protected] or [email protected].

Acknowledgments

Development of Stencila has been generously supported by the Alfred P. Sloan Foundation and eLife.

This post was originally published on eLife Labs.

eLife welcomes comments, questions and feedback. Please annotate publicly on the article or contact us at innovation [at] elifesciences [dot] org.

Lyric Analysis with NLP and Machine Learning using R: Part One – Text Mining

June 22
By Debbie Liske

This is Part One of a three part tutorial series originally published on the DataCamp online learning platform in which you will use R to perform a variety of analytic tasks on a case study of musical lyrics by the legendary artist, Prince. The three tutorials cover the following:


Musical lyrics may represent an artist’s perspective, but popular songs reveal what society wants to hear. Lyric analysis is no easy task. Because it is often structured so differently than prose, it requires caution with assumptions and a uniquely discriminant choice of analytic techniques. Musical lyrics permeate our lives and influence our thoughts with subtle ubiquity. The concept of Predictive Lyrics is beginning to buzz and is more prevalent as a subject of research papers and graduate theses. This case study will just touch on a few pieces of this emerging subject.



Prince: The Artist

To celebrate the inspiring and diverse body of work left behind by Prince, you will explore the sometimes obvious, but often hidden, messages in his lyrics. However, you don’t have to like Prince’s music to appreciate the influence he had on the development of many genres globally. Rolling Stone magazine listed Prince as the 18th best songwriter of all time, just behind the likes of Bob Dylan, John Lennon, Paul Simon, Joni Mitchell and Stevie Wonder. Lyric analysis is slowly finding its way into data science communities as the possibility of predicting “Hit Songs” approaches reality.

Prince was a man bursting with music – a wildly prolific songwriter, a virtuoso on guitars, keyboards and drums and a master architect of funk, rock, R&B and pop, even as his music defied genres. – Jon Pareles (NY Times)
In this tutorial, Part One of the series, you’ll utilize text mining techniques on a set of lyrics using the tidy text framework. Tidy datasets have a specific structure in which each variable is a column, each observation is a row, and each type of observational unit is a table. After cleaning and conditioning the dataset, you will create descriptive statistics and exploratory visualizations while looking at different aspects of Prince’s lyrics.

Check out the article here!




(reprint by permission of DataCamp online learning platform)

Anomaly Detection in R

The World of Anomalies

Imagine you are a credit card selling company and you know about a particular customer who makes a purchase of 25$ every week. You guessed this purchase is his fixed weekly rations but one day, this customer makes a different purchase of 700$. This development will not just startle you but also compel you to talk to the customer and find out the reason so you can approve the transaction. This is because, the behavior of the customer had become fixed and the change was so different that it was not expected. Hence we call this event as an anomaly.

Anomalies are hard to detect because they can also be real phenomenon. Let’s say that the customer in the example above made the usual purchases while he was living alone and will be starting his family this week. This will mean that this should be the first of his future purchases of similar magnitude or he is throwing a party this week and this was a one-time large purchase. In all these cases, the customer will be classified as making an ‘abnormal’ choice. We as the credit card seller need to know which of these cases are genuine and which are mistakes which can be corrected if they reconfirm the same with the customer. The usefulness of detecting such anomalies are very useful especially in BFSI industry with the primary use in credit card transactions. Such anomalies can be signs of fraud or theft. Someone making multiple transactions of small amounts from the same credit card, making one very large transaction which is a few order of magnitudes larger than the average, making transactions from an unfamiliar location are such examples that can caused by fraudsters and must be caught. With the popularity of adoption, let’s study the ways we can detect anomalies.

Detecting The Pattern To Find Anomalies

Anomalies are essentially the outliers in our data. If something happens regularly, it is not an anomaly but a trend. Things which happen once or twice and are deviant from the usual behavior, whether continuously or with lags are all anomalies. So it all boils down to the definition of outliers for our data. R provides a lot of packages with different approaches to anomaly detection. We will use the AnomalyDetection package in R to understand the concept of anomalies using one such method. However, the package needs to be installed specially from github. This requires the install_github() function in devtools package. We will also use the Rcpp package which helps us to integrate R with C++ functions. Another github package to be used in this article is the wikipedia trend package which contains the API to access wikipedia and create data for anomaly detection analysis.

The package is capable of identifying outliers in the presence of seasonality and trend in the data. The package uses an algorithm known as Seasonal Hybrid ESD algorithm which finds outliers globally as well as locally in time series or a vector of data. The package has a lot of features, some of which include visualization graphs, type of anomalies (positive or negative) and specifying the window of interest.
#Install the devtools package then github packages
install.packages("devtools")
install.packages("Rcpp")
library(devtools)
install_github("petermeissner/wikipediatrend")
install_github("twitter/AnomalyDetection")

#Loading the libraries
library(Rcpp)
library(wikipediatrend)
library(AnomalyDetection)
The first step is data preparation. We will use the page views on wikipedia page marked on fifa data starting from date 18th March 2013. (link: https://en.wikipedia.org/wiki/FIFA). The wp_trend function gives us the access statistics for the page with the ability to filter data from within the function. We will use this data to model day wise page views and understand anomalies in the pattern of those view numbers
#Download wikipedia webpage "fifa" 
fifa_data_wikipedia = wp_trend("fifa", from="2013-03-18", lang = "en")
This gives us a dataset of about 1022 observations and 8 columns. Looking at the data reveals some redundant information captured
#First_look
fifa_data_wikipedia
   project   language article access     agent      granularity date       views
197 wikipedia en       Fifa    all-access all-agents daily       2016-01-13 116  
546 wikipedia en       Fifa    all-access all-agents daily       2016-12-27  64  
660 wikipedia en       Fifa    all-access all-agents daily       2017-04-20 100  
395 wikipedia en       Fifa    all-access all-agents daily       2016-07-29  70  
257 wikipedia en       Fifa    all-access all-agents daily       2016-03-13  75  
831 wikipedia en       Fifa    all-access all-agents daily       2017-10-08 194  
229 wikipedia en       Fifa    all-access all-agents daily       2016-02-14  84  
393 wikipedia en       Fifa    all-access all-agents daily       2016-07-27 140  
293 wikipedia en       Fifa    all-access all-agents daily       2016-04-18 105  
420 wikipedia en       Fifa    all-access all-agents daily       2016-08-23 757
We see that project, language, article, access, agent and granularity appear to be same for all rows and are irrelevant for us. We are only concerned with date and views as the features to work on. Let’s plot the views against date
#Plotting data
library(ggplot2)
ggplot(fifa_data_wikipedia, aes(x=date, y=views, color=views)) + geom_line()
Anomaly detection
We see some huge spikes at different intervals. There are a lot of anomalies in this data. Before we process them further, let’s keep only the relevant columns.
# Keep only date & page views and discard all other variables
columns_to_keep=c("date","views")
fifa_data_wikipedia=fifa_data_wikipedia[,columns_to_keep]
We will now perform anomaly detection using Seasonal Hybrid ESD Test. The technique maps data as a series and captures seasonality while pointing out data which does not follow the seasonality pattern. The AnomalyDetectionTs() function finds the anomalies in the data. It will basically narrow down all the peaks keeping in mind that not more than 10% of data can be anomalies (by default). We can also reduce this number by changing the max_anoms parameter in the data. We can also specify which kind of anomalies are to be identified using the direction parameter. Here, we are going to specify only positive direction anomalies to be identified. That means that sudden dips in the data are not considered.
#Apply anomaly detection and plot the results
anomalies = AnomalyDetectionTs(fifa_data_wikipedia, direction="pos", plot=TRUE)
anomalies$plot
Anomaly Detection 2

Our data has 5.68% anomalies in positive direction if we take a level of significance (alpha) to be 95%. Since we had a total of 1022 observations, 5.68% of the number is about 58 observations. We can look at the specific dates which are pointed out by the algorithm.
# Look at the anomaly dates
anomalies$anoms
   timestamp anoms
1  2015-07-01   269
2  2015-07-02   233
3  2015-07-04   198
4  2015-07-05   330
5  2015-07-06   582
6  2015-07-07   276
7  2015-07-08   211
8  2015-07-09   250
9  2015-07-10   198
10 2015-07-20   315
11 2015-07-21   209
12 2015-07-25   202
13 2015-07-26   217
14 2015-09-18   278
15 2015-09-25   234
16 2015-09-26   199
17 2015-10-03   196
18 2015-10-07   242
19 2015-10-08   419
20 2015-10-09   240
21 2015-10-11   204
22 2015-10-12   223
23 2015-10-13   237
24 2015-10-18   204
25 2015-10-28   213
26 2015-12-03   225
27 2015-12-21   376
28 2015-12-22   212
29 2016-02-24   240
30 2016-02-26   826
31 2016-02-27   516
32 2016-02-29   199
33 2016-04-04   330
34 2016-05-13   217
35 2016-05-14   186
36 2016-06-10   196
37 2016-06-11   200
38 2016-06-12   258
39 2016-06-13   245
40 2016-06-14   204
41 2016-06-22   232
42 2016-06-27   273
43 2016-06-28   212
44 2016-07-10   221
45 2016-07-11   233
46 2016-08-22   214
47 2016-08-23   757
48 2016-08-24   244
49 2016-09-18   250
50 2016-09-19   346
51 2017-01-10   237
52 2017-03-29   392
53 2017-06-03   333
54 2017-06-21   365
55 2017-10-08   194
56 2017-10-09   208
57 2017-10-11   251
58 2017-10-14   373
We have the exact dates and the anomaly values for each date. In a typical anomaly detection process, each of these dates are looked case by case and the reason for anomalies is identified. For instance, the page views can be higher on these dates if there had been fifa matches or page updates on these particular days. Another reason could be big news about fifa players. However, if page views on any of the dates does not correspond to any special event, then those days are true anomalies and should be flagged. In other situations such as credit card transactions, such anomalies can indicate fraud and quick action must be taken on identification.

The ‘Anomaly Way’

Anomalies are a kind of outlier so SH-ESD (Seasonal Hybrid ESD) is not the only way to detect anomalies. Moreover, ‘AnomalyDetection’ is not the only package we will look upon. Let’s try the anomalize package which is available in CRAN. However, it is always recommended to update the package using github as the owners keep the most recent package versions there and it takes time and testing for the changes to move into standard repositories such as CRAN. We will first install the package from CRAN so that the dependencies are also installed then update the package using devtools
#Installing anomalize
install.packages('anomalize')
#Update from github
library(devtools)
install_github("business-science/anomalize")
#Load the package
library(anomalize)
# We will also use tidyverse package for processing and coindeskr to get bitcoin data
library(tidyverse)
library(coindeskr)
I am also using the tidyverse package (Link) and coindeskr package (Link). The coindeskr package is used to download the bitcoin data and tidyverse is used for speedy data processing. We will now download bitcoin data from 1st January 2017
#Get bitcoin data from 1st January 2017
bitcoin_data <- get_historic_price(start = "2017-01-01")
This data indicates the price per date. Let’s convert it into a time series
#Convert bitcoin data to a time series
bitcoin_data_ts = bitcoin_data %>% rownames_to_column() %>% as.tibble() %>% mutate(date = as.Date(rowname)) %>% select(-one_of('rowname'))
In the time series conversion, we are actually converting the data to a tibble_df which the package requires. We could have alternatively converted the data into tibbletime object. Since it is a time series now, we should also see the seasonality and trend patterns in the data. It is important to remove them so that anomaly detection is not affected. We will now decompose the series. We will also plot the series

#Decompose data using time_decompose() function in anomalize package. We will use stl method which extracts seasonality
bitcoin_data_ts %>% time_decompose(Price, method = "stl", frequency = "auto", trend = "auto") %>%  anomalize(remainder, method = "gesd", alpha = 0.05, max_anoms = 0.1) %>% plot_anomaly_decomposition()
Converting from tbl_df to tbl_time.
Auto-index message: index = date
frequency = 7 days
trend = 90.5 days
Anomaly Detection 2
We have some beautiful plots with the first plot being overall observed data, second being season, third being trend and the final plot analyzed for anomalies. The red points indicate anomalies according to the anomalize function. However, we are not looking for this plot. We only want the anomalies plot with trend and seasonality removed. Let’s plot the data again with recomposed data. This can be done by setting the time_recomposed() function

#Plot the data again by recomposing data
bitcoin_data_ts %>% time_decompose(Price) %>% anomalize(remainder) %>% time_recompose() %>%  plot_anomalies(time_recomposed = TRUE, ncol = 3, alpha_dots = 0.5)
Converting from tbl_df to tbl_time.
Auto-index message: index = date
frequency = 7 days
trend = 90.5 days
Anomaly Detection 3
This is a better plot and shows the anomalies. We all know how bitcoin prices shot up in 2018. The grey portion explains the expected trend. Let’s see what these red points are.
#Extract the anomalies
anomalies=bitcoin_data_ts %>% time_decompose(Price) %>%  anomalize(remainder) %>%  time_recompose() %>%  filter(anomaly == 'Yes')
Now the anomalies dataset consists of the data points which were identified as anomalies by the algorithm

Conclusion: Are You An Anomaly?

We have twitter’s anomaly detection package based on Seasonal Hybrid ESD (SH-ESD) as well as CRAN’s anomaly detection package based on factor analysis, Mahalanobis distance, Horn’s parallel analysis or Principal component analysis. We also have TsOutliers package and anomalize packages in R. There are a lot more packages than one could find in R. They all have the same concept but differ in the underlying algorithm which they use to detect anomaly. Hence, one can get a general idea from all such packages: anomalies are data points which do not follow the general trend or do not lie under the expected behavior of the rest of the data. The next question which is raised is the criteria for a data point to be following expected behavior. The rest of the data points are all anomalies. One can also have varying types of anomalies such as direction based anomalies as described by the anomaly detection package (positive or negative) or anomalies not following events such as matches in fifa data. One can similarly pitch in another logic for anomaly classification and treat them accordingly.

Here is the entire code used in this article

#Install the devtools package then github packages
install.packages("devtools")
install.packages("Rcpp")
library(devtools)
install_github("petermeissner/wikipediatrend")
install_github("twitter/AnomalyDetection")

#Loading the libraries
library(Rcpp)
library(wikipediatrend)
library(AnomalyDetection)

# Download wikipedia webpage "fifa"
fifa_data_wikipedia = wp_trend("fifa", from="2013-03-18", lang = "en")
#First_look
fifa_data_wikipedia

# Plotting data
library(ggplot2)
ggplot(fifa_data_wikipedia, aes(x=date, y=views, color=views)) + geom_line()

# Keep only date & page views and discard all other variables
columns_to_keep=c("date","views")
fifa_data_wikipedia=fifa_data_wikipedia[,columns_to_keep]

#Apply anomaly detection and plot the results
anomalies = AnomalyDetectionTs(fifa_data_wikipedia, direction="pos", plot=TRUE)
anomalies$plot

# Look at the anomaly dates
anomalies$anoms

#Installing anomalize
install.packages('anomalize')
#Update from github
library(devtools)
install_github("business-science/anomalize")
#Load the package
library(anomalize)
# We will also use tidyverse package for processing and coindeskr to get bitcoin data
library(tidyverse)
library(coindeskr)

#Get bitcoin data from 1st January 2017
bitcoin_data = get_historic_price(start = "2017-01-01")

#Convert bitcoin data to a time series
bitcoin_data_ts = bitcoin_data %>% rownames_to_column() %>% as.tibble() %>% mutate(date = as.Date(rowname)) %>% select(-one_of('rowname'))

#Decompose data using time_decompose() function in anomalize package. We will use stl method which extracts seasonality
bitcoin_data_ts %>% time_decompose(Price, method = "stl", frequency = "auto", trend = "auto") %>%  anomalize(remainder, method = "gesd", alpha = 0.05, max_anoms = 0.1) %>% plot_anomaly_decomposition()

#Plot the data again by recomposing data
bitcoin_data_ts %>% time_decompose(Price) %>% anomalize(remainder) %>% time_recompose() %>%  plot_anomalies(time_recomposed = TRUE, ncol = 3, alpha_dots = 0.5)

#Extract the anomalies
anomalies=bitcoin_data_ts %>% time_decompose(Price) %>%  anomalize(remainder) %>%  time_recompose() %>%  filter(anomaly == 'Yes')

Author Bio:

This article was contributed by Perceptive Analytics. Madhur Modi, Prudhvi Potuganti, Saneesh Veetil and Chaitanya Sagar contributed to this article. Perceptive Analytics provides Tableau Consulting, data analytics, business intelligence and reporting services to e-commerce, retail, healthcare and pharmaceutical industries. Our client roster includes Fortune 500 and NYSE listed companies in the USA and India.


Do Clustering by “Dimensional Collapse”

Problem

Image that someone in a bank want to find out whether some of bank’s credit card holders are acctually the same person, so according to his experience, he set a rule: the people share either the same address or the same phone number can be reasonably regarded as the same person. Just as the example:
library(tidyverse)
a <- data_frame(id = 1:16,
                addr = c("a", "a", "a", "b", "b", "c", "d", "d", "d", "e", "e", "f", "f", "g", "g", "h"),
                phone = c(130L, 131L, 132L, 133L, 134L, 132L, 135L, 136L, 137L, 136L, 138L, 138L, 139L, 140L, 141L, 139L),
                flag = c(1L, 1L, 1L, 2L, 2L, 1L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 3L))

head(a)

## id   addr    phone   flag
## 1    a   130 1
## 2    a   131 1
## 3    a   132 1
## 4    b   133 2
## 5    b   134 2
## 6    c   132 1 
In the dataframe a, the letters in column addr stand for address information, the numbers in column phone stand for phone numbers, and the integers in column flag is what he want: the CLUSTER flag which means "really" different persons.

The records in a bank In the above plot, each point stand for a "identity" who has a address which you can tell according to horizontal axis , and a phone number which you can see in vertical axis. The red dotted line present the "connections" betweent identities, which actually means the same address or phone number. So the wanted result is the blue rectangels to circle out different flags which reprent really different persons.

Goal

The "finding the same person" thing is typically a clustring process, and I am very sure there are pretty many ways to do it, Such as Disjoint-set data structure. But, I can not help thinking mayby we can make it in a simple way with R. that's my goal.

"Dimensional Collapse"

When I stared at the plot, I ask myself, why not map the x-axis information of the points to the very first one according to the y-axis "connections". When everything goes well and all done, all the grey points should be mapped along the red arrows to the first marks of the groups, and there should be only 4 marks leave on x-axis: a, b, d and g, instead of 9 marks in the first place. And the y-axis information, after contributing all the "connection rules", can be put away now, since the left x-axis marks are exactly what I want: the final flags. It is why I like to call it "Dimensional Collapse".
Furthermore, in order to take advantage of R properties, I also:
1. Treat both dimensions as integers by factoring them.
2. Use "integer subsetting" to map and collapse.
Dimensional Collapse
axis_collapse <- function(df, .x, .y) {
    .x <- enquo(.x)
    .y <- enquo(.y)
    
    # Turn the address and phone number into integers.
    df <- mutate(df,
                 axis_x = c(factor(!!.x)),
                 axis_y = c(factor(!!.y)))
    
    oldRule <- seq_len(max(df$axis_x))
    
    mapRule <- df %>%
      select(axis_x, axis_y) %>%
      group_by(axis_y) %>%
      arrange(axis_x, .by_group = TRUE) %>%
      mutate(collapse = axis_x[1]) %>%
      ungroup() %>%
      select(-axis_y) %>%
      distinct() %>%
      group_by(axis_x) %>%
      arrange(collapse, .by_group = TRUE) %>%
      slice(1) %>%  
      ungroup() %>%
      arrange(axis_x) %>%
      pull(collapse)
    
    # Use integer subsetting to collapse x-axis.
    # In case of indirect "connections", we should do it recursively.
    while (TRUE) {
        newRule <- mapRule[oldRule]
        if(identical(newRule, oldRule)) {
            break
        } else {
            oldRule <- newRule
        }
    }
    
    df <- df %>%
      mutate(flag = newRule[axis_x],
             flag = c(factor(flag))) %>%
      select(-starts_with("axis_"))
    
    df
}
Let see the result.
a %>%
  rename(flag_t = flag) %>% 
  axis_collapse(addr, phone) %>%
  mutate_at(.vars = vars(addr:flag), factor) %>%
  ggplot(aes(factor(addr), factor(phone), shape = flag_t, color = flag)) +
  geom_point(size = 3) +
  labs(x = "Address", y = "Phone Number", shape = "Target Flag:", color = "Cluster Flag:")
The Clustering result compared to the target Not bad so far.

Calculation Complexity

Let make a simple test about time complexity.
test1 <- data_frame(addr = sample(1:1e4, 1e4), phone = sample(1:1e4, 1e4))
test2 <- data_frame(addr = sample(1:1e5, 1e5), phone = sample(1:1e5, 1e5))

bm <- microbenchmark::microbenchmark(n10k = axis_collapse(test1, addr, phone),
                                     n100k = axis_collapse(test2, addr, phone),
                                     times = 30)

summary(bm)

## expr min lq  mean    median  uq  max neval   cld
## n10k     249.2172    259.918     277.0333    266.9297    279.505     379.4292    30  a
## n100k    2489.1834   2581.731    2640.9394   2624.5741   2723.390    2839.5180   30  b 
It seems that the growth of consumed time is in linear relationship with data increase holding the other conditions unchanged. That is acceptable.

More Dimensions?

To me, since this method collapse one dimension by transfering their clustering information to the other dimension, so the method should can be used resursively on more than 2 dimensions. But I am not 100% sure. Let do a simple test.
a %>%
  # I deliberately add a column which connect group 2 and 4 only.
  mutate(other = c(LETTERS[1:14], "D", "O")) %>%
  # use axis_collapse recursively
  axis_collapse(other, phone) %>%
  axis_collapse(flag, addr) %>%
  ggplot(aes(x = factor(addr), y = factor(phone), color = factor(flag))) +
  geom_point(size = 3) +
  labs(x = "Address", y = "Phone Number", color = "Cluster Flag:")
Dimensional Collapse when more than 2 dimensions

Decision Modelling in R Workshop in The Netherlands!

The Decision Analysis in R for Technologies in Health (DARTH) workgroup is hosting a two-day workshop on decision analysis in R in Leiden, The Netherlands from June 7-8, 2018. A one-day introduction to R course will also be offered the day before the workshop, on June 6th. Decision models are mathematical simulation models that are increasingly being used in health sciences to simulate the impact of policy decisions on population health. New methodological techniques around decision modeling are being developed that rely heavily on statistical and mathematical techniques. R is becoming increasingly popular in decision analysis as it provides a flexible environment where advanced statistical methods can be combined with decision models of varying complexity. Also, the fact that R is freely available improves model transparency and reproducibility. The workshop will guide participants on building probabilistic decision trees, Markov models and microsimulations, creating publication-quality tabular and graphical output, and will provide a basic introduction to value of information methods and model calibration using R. For more information and to register, please visit: http://www.cvent.com/d/mtqth1

Exploratory Factor Analysis in R

Changing Your Viewpoint for Factors

In real life, data tends to follow some patterns but the reasons are not apparent right from the start of the data analysis. Taking a common example of a demographics based survey, many people will answer questions in a particular ‘way’. For example, all married men will have higher expenses than single men but lower than married men with children. In this case, the driving factor which makes them answer following a pattern is the economic status but these answers may also depend on other factors such as level of education, salary and locality or area. It becomes complicated to assign answers related to multiple factors. One option is to map variables or answers to one of the factors. This process has a lot of drawbacks such as the requirement to ‘guess’ the number of factors, heuristic based or biased manual assignment and non-consideration of influence of other factors to the variable. We have variables and answers in the data defined in a way such that we can understand them and interpret. What if we change our view lens in an alternate reality where all variables are automatically mapped into different new categories with customized weight based on their influence on that category. This is the idea behind factor analysis.

Creating Meaningful Factors

Factor analysis starts with the assumption of hidden latent variables which cannot be observed directly but are reflected in the answers or variables of the data. It also makes the assumption that there are as many factors as there are variables. We then transform our current set of variables into an equal number of variables such that each new variable is a combination of the current ones in some weightage. Hence, we are not essentially adding or removing information in this step but only transforming it. A typical way to make this transformation is to use eigenvalues and eigenvectors. We transform our data in the direction of each eigenvector and represent all the new variables or ‘factors’ using the eigenvalues. An eigenvalue more than 1 will mean that the new factor explains more variance than one original variable. We then sort the factors in decreasing order of the variances they explain. Thus, the first factor will be the most influential factor followed by the second factor and so on. This also helps us think of variable reduction by removing the last few factors. Typically we take factors which collectively explain 90-99% of the variance depending upon the dataset and problem. Thus, there is no need to guess the number of factors required for the data. However, we need to read into each of the factors and the combination of original features out of which they are made of to understand what they represent.

Loading Factors With Factor Loadings

As I mentioned, even after transformation, we retain the weight of each original feature that were combined to obtain the factor. These weights are known as ‘factor loadings’ and help understand what the factors represent. Let’s take a cooked up example of factor loadings for an airlines survey. Let’s say the table looks like this:

Perceptive Analytics

I took 10 features originally so it should generate 10 factors. Let’s say our first three factors are as shown in the table. Looking at the values of the factors, the first factor may represent customer experience post on-boarding. The second factor reflects the airline booking experience and related perks. The third factor shows the flight competitive advantage of the airline compared to its competition. We also have a 10th feature which is negatively affecting the second factor (which seems to make sense as a loyal customer will book flights even if they are no longer economic or don’t have as great frequent flyer programs). Thus, we can now understand that the top most important factors for customers filling the survey are customer experience, booking experience and competitive advantage. However, this understanding needs to be developed manually by looking at the loadings. It is the factor loadings and their understanding which are the prime reason which makes factor analysis of such importance followed by the ability to scale down to a few factors without losing much information.

Exploration – How Many Factors?

Factor analysis can be driven by different motivations. One objective of factor analysis can be verifying with the data with what you already know and think about it. This requires a prior intuition on the number of important factors after which the loadings will be low overall as well as an idea of the loadings of each original variable in those factors. This is the confirmatory way of factor analysis where the process is run to confirm with understanding of the data. A more common approach is to understand the data using factor analysis. In this case, you perform factor analysis first and then develop a general idea of what you get out of the results. How do we stop at a specific number of factor in factor analysis when we are exploring? We use the scree plot in this case. The scree plot maps the factors with their eigenvalues and a cut-off point is determined wherever there is a sudden change in the slope of the line.

Going Practical – The BFI Dataset in R

Let’s start with a practical demonstration of factor analysis. We will use the Psych package in R which is a package for personality, psychometric, and psychological research. It consists a dataset – the bfi dataset which represents 25 personality items with 3 additional demographics for 2800 data points. The columns are already classified into 5 factors thus their names start with letters A (Agreeableness), C (Conscientiousness), E (Extraversion), N (Neuroticism) and O (Openness). Let’s perform factor analysis to see if we really have the same association of the variables with each factor.
#Installing the Psych package and loading it
install.packages("psych")
library(psych)
#Loading the dataset
bfi_data=bfi
Though we have NA values in our data which need to be handled but we will not perform much processing on the data. To make things simple, we will only take those data points where there are no missing values.
#Remove rows with missing values and keep only complete cases
bfi_data=bfi_data[complete.cases(bfi_data),]
This leaves us with 2236 data points down from 2800 which means a reduction of 564 data points. Since 2236 is still a plausible number of data points, we can proceed further. We will use the fa() function for factor analysis and need to calculate the correlation matrix for the function
#Create the correlation matrix from bfi_data
bfi_cor <- cor(bfi_data)
The fa() function needs correlation matrix as r and number of factors. The default value is 1 which is undesired so we will specify the factors to be 6 for this exercise.
#Factor analysis of the data
factors_data <- fa(r = bfi_cor, nfactors = 6)
#Getting the factor loadings and model analysis
factors_data
Factor Analysis using method =  minres
Call: fa(r = bfi_cor, nfactors = 6)
Standardized loadings (pattern matrix) based upon correlation matrix
            MR2   MR3   MR1   MR5   MR4   MR6    h2   u2 com
A1         0.11  0.07 -0.07 -0.56 -0.01  0.35 0.379 0.62 1.8
A2         0.03  0.09 -0.08  0.64  0.01 -0.06 0.467 0.53 1.1
A3        -0.04  0.04 -0.10  0.60  0.07  0.16 0.506 0.49 1.3
A4        -0.07  0.19 -0.07  0.41 -0.13  0.13 0.294 0.71 2.0
A5        -0.17  0.01 -0.16  0.47  0.10  0.22 0.470 0.53 2.1
C1         0.05  0.54  0.08 -0.02  0.19  0.05 0.344 0.66 1.3
C2         0.09  0.66  0.17  0.06  0.08  0.16 0.475 0.53 1.4
C3         0.00  0.56  0.07  0.07 -0.04  0.05 0.317 0.68 1.1
C4         0.07 -0.67  0.10 -0.01  0.02  0.25 0.555 0.45 1.3
C5         0.15 -0.56  0.17  0.02  0.10  0.01 0.433 0.57 1.4
E1        -0.14  0.09  0.61 -0.14 -0.08  0.09 0.414 0.59 1.3
E2         0.06 -0.03  0.68 -0.07 -0.08 -0.01 0.559 0.44 1.1
E3         0.02  0.01 -0.32  0.17  0.38  0.28 0.507 0.49 3.3
E4        -0.07  0.03 -0.49  0.25  0.00  0.31 0.565 0.44 2.3
E5         0.16  0.27 -0.39  0.07  0.24  0.04 0.410 0.59 3.0
N1         0.82 -0.01 -0.09 -0.09 -0.03  0.02 0.666 0.33 1.1
N2         0.83  0.02 -0.07 -0.07  0.01 -0.07 0.654 0.35 1.0
N3         0.69 -0.03  0.13  0.09  0.02  0.06 0.549 0.45 1.1
N4         0.44 -0.14  0.43  0.09  0.10  0.01 0.506 0.49 2.4
N5         0.47 -0.01  0.21  0.21 -0.17  0.09 0.376 0.62 2.2
O1        -0.05  0.07 -0.01 -0.04  0.57  0.09 0.357 0.64 1.1
O2         0.12 -0.09  0.01  0.12 -0.43  0.28 0.295 0.70 2.2
O3         0.01  0.00 -0.10  0.05  0.65  0.04 0.485 0.52 1.1
O4         0.10 -0.05  0.34  0.15  0.37 -0.04 0.241 0.76 2.6
O5         0.04 -0.04 -0.02 -0.01 -0.50  0.30 0.330 0.67 1.7
gender     0.20  0.09 -0.12  0.33 -0.21 -0.15 0.184 0.82 3.6
education -0.03  0.01  0.05  0.11  0.12 -0.22 0.072 0.93 2.2
age       -0.06  0.07 -0.02  0.16  0.03 -0.26 0.098 0.90 2.0

                       MR2  MR3  MR1  MR5  MR4  MR6
SS loadings           2.55 2.13 2.14 2.03 1.79 0.87
Proportion Var        0.09 0.08 0.08 0.07 0.06 0.03
Cumulative Var        0.09 0.17 0.24 0.32 0.38 0.41
Proportion Explained  0.22 0.18 0.19 0.18 0.16 0.08
Cumulative Proportion 0.22 0.41 0.59 0.77 0.92 1.00

 With factor correlations of 
      MR2   MR3   MR1   MR5   MR4   MR6
MR2  1.00 -0.18  0.24 -0.05 -0.01  0.10
MR3 -0.18  1.00 -0.23  0.16  0.19  0.04
MR1  0.24 -0.23  1.00 -0.28 -0.19 -0.15
MR5 -0.05  0.16 -0.28  1.00  0.18  0.17
MR4 -0.01  0.19 -0.19  0.18  1.00  0.05
MR6  0.10  0.04 -0.15  0.17  0.05  1.00

Mean item complexity =  1.8
Test of the hypothesis that 6 factors are sufficient.

The degrees of freedom for the null model are  378  and the objective function was  7.79
The degrees of freedom for the model are 225  and the objective function was  0.57 

The root mean square of the residuals (RMSR) is  0.02 
The df corrected root mean square of the residuals is  0.03 

Fit based upon off diagonal values = 0.98
Measures of factor score adequacy             
                                                   MR2  MR3  MR1  MR5  MR4  MR6
Correlation of (regression) scores with factors   0.93 0.89 0.89 0.88 0.86 0.77
Multiple R square of scores with factors          0.86 0.79 0.79 0.77 0.74 0.59
Minimum correlation of possible factor scores     0.72 0.57 0.58 0.54 0.49 0.18
The factor loadings show that the first factor represents N followed by C,E,A and O. This means most of the members in the data have Neuroticism in the data. We also notice that the first five factors adequately represent the factor categories as the data is meant for.

Conclusion: A Deeper Insight

As apparent from the bfi survey example, factor analysis is helpful in classifying our current features into factors which represent hidden features not measured directly. It also has an additional advantage of helping reduce our data into a smaller set of features without losing much information. There are a few things to keep in mind before putting factor analysis into action. The first is about the values of factor loadings. We may have datasets where the factor loadings for all factors are low – lower than 0.5 or 0.3. While a factor loading lower than 0.3 means that you are using too many factors and need to re-run the analysis with lesser factors. A range of loadings around 0.5 is satisfactory but indicates poor predicting ability. You should later keep thresholds and discard factors which have a loading lower than the threshold for all features. Factor analysis on dynamic data can also be helpful in tracking changes in the nature of data. In case the data changes significantly, the number of factors in exploratory factor analysis will also change and indicate you to look into the data and check what changes have occurred. The final one of importance is the interpretability of factors. In case you are unable to understand or explain the factor loadings, you are either using a very granular or very generalized set of factors. In this case, you need to find the right number of factors and obtain loadings to features which are both interpretable and beneficial for analysis. There can be a variety of other situations that may occur with factor analysis and are all subject to interpretation.

Keep learning and here is the entire code used in this article.

#Installing the Psych package and loading it
install.packages("psych")
library(psych)
#Loading the dataset
bfi_data=bfi

#Remove rows with missing values and keep only complete cases
bfi_data=bfi_data[complete.cases(bfi_data),]

#Create the correlation matrix from bfi_data
bfi_cor <- cor(bfi_data)

#Factor analysis of the data
factors_data <- fa(r = bfi_cor, nfactors = 6)
#Getting the factor loadings and model analysis
factors_data

Author Bio:

This article was contributed by Perceptive Analytics. Madhur Modi, Prudhvi Sai Ram, Saneesh Veetil and Chaitanya Sagar contributed to this article.

Perceptive Analytics provides Tableau Consulting, data analytics, business intelligence and reporting services to e-commerce, retail, healthcare and pharmaceutical industries. Our client roster includes Fortune 500 and NYSE listed companies in the USA and India.