R-posts.com

Dealing with The Problem of Multicollinearity in R

Interested in publishing a one-time post on R-bloggers.com? Press here to learn how.

Imagine a situation where you are asked to predict the tourism revenue for a country, let’s say India. In this case, your output or dependent or response variable will be total revenue earned (in USD) in a given year. But, what about independent or predictor variables?

You have been provided with two sets of predictor variables and you have to choose one of the sets to predict your output. The first set consists of three variables:

The second set also consists of three variables:

Which of the two sets do you think provides us more information in predicting our output?

I am sure, you will agree with me that the second set provides us more information in predicting the output because the second set has three variables which are different from each other and each of the variables provides different information (we can infer this intuitively at this moment). Moreover, none of the three variables is directly derived from the other variables in the system. Alternatively, we can also say that none of the variables is a linear combination of other variables in the system.

In the first set of variables, only two variables provide us relevant information; while, the third variable is nothing but a linear combination of other two variables. If we were to directly develop a model without including this variable, our model would have considered this combination and estimated coefficients accordingly.

Now, this effect in the first set of variables is called multicollinearity. Variables in the first set are strongly correlated to each other (if not all, at least some variables are correlated with other variables). Model developed using the first set of variables may not provide as accurate results as the second one because we are missing out on relevant variables/information in the first set. Therefore, it becomes important to study multicollinearity and the techniques to detect and tackle its effect in regression models.

According to Wikipedia, “Collinearity is a linear association between two explanatory variables. Two variables are perfectly collinear if there is an exact linear relationship between them. For example, X1 and X2 are perfectly collinear if there exist parameters λ0 and λ1 such that, for all observations i, we have

X2i = λ0 + λ1 * X1i

Multicollinearity refers to a situation in which two or more explanatory variables in a multiple regression model are highly linearly related.”

We saw an example of exactly what the Wikipedia definition is describing.

Perfect multicollinearity occurs when one independent variable is an exact linear combination of other variables. For example, you already have X and Y as independent variables and you add another variable, Z = a*X + b*Y, to the set of independent variables. Now, this new variable, Z, does not add any significant or different value than provided by X or Y. The model can adjust itself to set the parameters that this combination is taken care of while determining the coefficients.

Multicollinearity may arise from several factors. Inclusion or incorrect use of dummy variables in the system may lead to multicollinearity. The other reason could be the usage of derived variables, i.e., one variable is computed from other variables in the system. This is similar to the example we took at the beginning of the article. The other reason could be taking variables which are similar in nature or which provide similar information or the variables which have very high correlation among each other.

Multicollinearity may not possess problem at an overall level, but it strongly impacts the individual variables and their predictive power. You may not be able to identify which are statistically significant variables in your model. Moreover, you will be working with a set of variables which provide you similar output or variables which are redundant with respect to other variables.

Now, moving on to how to detect the presence of multicollinearity in the system.

There are multiple ways to detect the presence of multicollinearity among the independent or explanatory variables.

VIF = 1 / Tolerance

VIF = 1/ (1 – R square)

VIF of over 10 indicates that the variables have high correlation among each other. Usually, VIF value of less than 4 is considered good for a model.

We will now use some of these techniques and try their implementation in R.

We will use CPS_85_Wages data which consists of a random sample of 534 persons from the CPS (Current Population Survey). The data provides information on wages and other characteristics of the workers. (Linkhttp://lib.stat.cmu.edu/datasets/CPS_85_Wages). You can go through the data details on the link provided.

In this data, we will predict wages from other variables in the data.

> data1 = read.csv(file.choose(), header = T)> head(data1)  Education South Sex Experience Union  Wage Age Race Occupation Sector Marr1         8 0 1         21 0 5.10 35  2 6 1 12         9 0 1         42 0 4.95 57  3 6 1 13        12 0  0 1     0 6.67 19 3         6 1 04        12 0  0 4     0 4.00 22 3         6 0 05        12 0  0 17     0 7.50 35 3         6 0 16        13 0  0 9     1 13.07 28 3         6 0 0> str(data1)’data.frame’: 534 obs. of  11 variables: $ Education : int  8 9 12 12 12 13 10 12 16 12 … $ South     : int 0 0 0 0 0 0 1 0 0 0 … $ Sex       : int 1 1 0 0 0 0 0 0 0 0 … $ Experience: int  21 42 1 4 17 9 27 9 11 9 … $ Union     : int 0 0 0 0 0 1 0 0 0 0 … $ Wage      : num 5.1 4.95 6.67 4 7.5 … $ Age       : int 35 57 19 22 35 28 43 27 33 27 … $ Race      : int 2 3 3 3 3 3 3 3 3 3 … $ Occupation: int  6 6 6 6 6 6 6 6 6 6 … $ Sector    : int 1 1 1 0 0 0 0 0 1 0 … $ Marr      : int 1 1 0 0 1 0 0 0 1 0 …

The above results show the sample view of data and the variables present in the data. Now, let’s fit the linear regression model and analyze the results.

> fit_model1 = lm(log(data1$Wage) ~., data = data1)> summary(fit_model1) Call:lm(formula = log(data1$Wage) ~ ., data = data1) Residuals:     Min     1Q Median       3Q Max -2.16246 -0.29163 -0.00469  0.29981 1.98248  Coefficients:             Estimate Std. Error t value Pr(>|t|)    (Intercept)  1.078596 0.687514   1.569 0.117291 Education    0.179366 0.110756   1.619 0.105949 South       -0.102360 0.042823  -2.390 0.017187 * Sex         -0.221997 0.039907  -5.563 4.24e-08 ***Experience   0.095822 0.110799   0.865 0.387531 Union        0.200483 0.052475   3.821 0.000149 ***Age         -0.085444 0.110730  -0.772 0.440671 Race         0.050406 0.028531   1.767 0.077865 . Occupation  -0.007417 0.013109  -0.566 0.571761 Sector       0.091458 0.038736   2.361 0.018589 * Marr         0.076611 0.041931   1.827 0.068259 . —Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.4398 on 523 degrees of freedomMultiple R-squared:  0.3185, Adjusted R-squared:  0.3054 F-statistic: 24.44 on 10 and 523 DF,  p-value: < 2.2e-16

The linear regression results show that the model is statistically significant as the F-statistic has high value and p-value for model is less than 0.05. However, on closer examination we observe that four variables – Education, Experience, Age and Occupation are not statistically significant; while, two variables Race and Marr (martial status) are significant at 10% level. Now, let’s plot the model diagnostics to validate the assumptions of the model.


> plot(fit_model1)
Hit <Return> to see next plot:

Hit to see next plot:

Hit to see next plot:

Hit to see next plot:

The diagnostic plots also look fine. Let’s investigate further and look at pair-wise correlation among variables.

library(corrplot)
> cor1 = cor(data1)
> corrplot.mixed(cor1, lower.col = “black”, number.cex = .7)

The above correlation plot shows that there is high correlation between experience and age variables. This might be resulting in multicollinearity in the model.

Now, let’s move a step further and try Farrar-Glauber test to further investigate this. The ‘mctest’ package in R provides the Farrar-Glauber test in R.

install.packages(‘mctest’)library(mctest)

We will first use omcdiag function in mctest package. According to the package description, omcdiag (Overall Multicollinearity Diagnostics Measures) computes different overall measures of multicollinearity diagnostics for matrix of regressors.

> omcdiag(data1[,c(1:5,7:11)],data1$Wage) Call:omcdiag(x = data1[, c(1:5, 7:11)], y = data1$Wage)  Overall Multicollinearity Diagnostics                        MC Results detectionDeterminant |X’X|:         0.0001 1Farrar Chi-Square:      4833.5751 1Red Indicator:             0.1983 0Sum of Lambda Inverse: 10068.8439         1Theil’s Method:            1.2263 1Condition Number:        739.7337 1 1 –> COLLINEARITY is detected by the test 0 –> COLLINEARITY is not detected by the test

The above output shows that multicollinearity is present in the model. Now, let’s go a step further and check for F-test in in Farrar-Glauber test.

> imcdiag(data1[,c(1:5,7:11)],data1$Wage) Call:imcdiag(x = data1[, c(1:5, 7:11)], y = data1$Wage)  All Individual Multicollinearity Diagnostics Result                  VIF TOL   Wi Fi Leamer      CVIF KleinEducation   231.1956 0.0043  13402.4982 15106.5849 0.0658  236.4725 1South         1.0468 0.9553     2.7264 3.0731 0.9774    1.0707 0Sex           1.0916 0.9161     5.3351 6.0135 0.9571    1.1165 0Experience 5184.0939 0.0002 301771.2445 340140.5368 0.0139 5302.4188     1Union         1.1209 0.8922     7.0368 7.9315 0.9445    1.1464 0Age        4645.6650 0.0002 270422.7164 304806.1391 0.0147 4751.7005     1Race          1.0371 0.9642     2.1622 2.4372 0.9819    1.0608 0Occupation    1.2982 0.7703    17.3637 19.5715 0.8777    1.3279 0Sector        1.1987 0.8343    11.5670 13.0378 0.9134    1.2260 0Marr          1.0961 0.9123     5.5969 6.3085 0.9551    1.1211 0 1 –> COLLINEARITY is detected by the test 0 –> COLLINEARITY is not detected by the test Education , South , Experience , Age , Race , Occupation , Sector , Marr , coefficient(s) are non-significant may be due to multicollinearity R-square of y on all x: 0.2805  * use method argument to check which regressors may be the reason of collinearity===================================

The above output shows that Education, Experience and Age have multicollinearity. Also, the VIF value is very high for these variables. Finally, let’s move to examine the pattern of multicollinearity and conduct t-test for correlation coefficients.

> pcor(data1[,c(1:5,7:11)],method = “pearson”)$estimate              Education   South Sex  Experience Union         Age Race OccupationEducation   1.000000000 -0.031750193  0.051510483 -0.99756187 -0.007479144  0.99726160 0.017230877 0.029436911South      -0.031750193  1.000000000 -0.030152499 -0.02231360 -0.097548621  0.02152507 -0.111197596 0.008430595Sex         0.051510483 -0.030152499  1.000000000 0.05497703 -0.120087577 -0.05369785  0.020017315 -0.142750864Experience -0.997561873 -0.022313605  0.054977034 1.00000000 -0.010244447 0.99987574  0.010888486 0.042058560Union      -0.007479144 -0.097548621 -0.120087577 -0.01024445  1.000000000 0.01223890 -0.107706183 0.212996388Age         0.997261601 0.021525073 -0.053697851  0.99987574 0.012238897 1.00000000 -0.010803310 -0.044140293Race        0.017230877 -0.111197596  0.020017315 0.01088849 -0.107706183 -0.01080331  1.000000000 0.057539374Occupation  0.029436911 0.008430595 -0.142750864  0.04205856 0.212996388 -0.04414029 0.057539374  1.000000000Sector     -0.021253493 -0.021518760 -0.112146760 -0.01326166 -0.013531482  0.01456575 0.006412099 0.314746868Marr       -0.040302967  0.030418218 0.004163264 -0.04097664  0.068918496 0.04509033 0.055645964 -0.018580965                 Sector MarrEducation  -0.021253493 -0.040302967South      -0.021518760  0.030418218Sex        -0.112146760  0.004163264Experience -0.013261665 -0.040976643Union      -0.013531482  0.068918496Age         0.014565751 0.045090327Race        0.006412099 0.055645964Occupation  0.314746868 -0.018580965Sector      1.000000000 0.036495494Marr        0.036495494 1.000000000 $p.value           Education    South Sex Experience        Union Age Race Occupation       SectorEducation  0.0000000 0.46745162 0.238259049  0.0000000 8.641246e-01 0.0000000 0.69337880 5.005235e-01 6.267278e-01South      0.4674516 0.00000000 0.490162786  0.6096300 2.526916e-02 0.6223281 0.01070652 8.470400e-01 6.224302e-01Sex        0.2382590 0.49016279 0.000000000  0.2080904 5.822656e-03 0.2188841 0.64692038 1.027137e-03 1.005138e-02Experience 0.0000000 0.60962999 0.208090393  0.0000000 8.146741e-01 0.0000000 0.80325456 3.356824e-01 7.615531e-01Union      0.8641246 0.02526916 0.005822656  0.8146741 0.000000e+00 0.7794483 0.01345383 8.220095e-07 7.568528e-01Age        0.0000000 0.62232811 0.218884070  0.0000000 7.794483e-01 0.0000000 0.80476248 3.122902e-01 7.389200e-01Race       0.6933788 0.01070652 0.646920379  0.8032546 1.345383e-02 0.8047625 0.00000000 1.876376e-01 8.833600e-01Occupation 0.5005235 0.84704000 0.001027137  0.3356824 8.220095e-07 0.3122902 0.18763758 0.000000e+00 1.467261e-13Sector     0.6267278 0.62243025 0.010051378  0.7615531 7.568528e-01 0.7389200 0.88336002 1.467261e-13 0.000000e+00Marr       0.3562616 0.48634504 0.924111163  0.3482728 1.143954e-01 0.3019796 0.20260170 6.707116e-01 4.035489e-01                MarrEducation  0.3562616South      0.4863450Sex        0.9241112Experience 0.3482728Union      0.1143954Age        0.3019796Race       0.2026017Occupation 0.6707116Sector     0.4035489Marr       0.0000000 $statistic              Education South         Sex Experience Union        Age Race Occupation SectorEducation     0.0000000 -0.7271618  1.18069629 -327.2105031 -0.1712102  308.6803174 0.3944914 0.6741338 -0.4866246South        -0.7271618 0.0000000 -0.69053623   -0.5109090 -2.2436907 0.4928456 -2.5613138  0.1929920 -0.4927010Sex           1.1806963 -0.6905362  0.00000000 1.2603880 -2.7689685   -1.2309760 0.4583091 -3.3015287 -2.5834540Experience -327.2105031 -0.5109090  1.26038801 0.0000000 -0.2345184 1451.9092015  0.2492636 0.9636171 -0.3036001Union        -0.1712102 -2.2436907 -2.76896848   -0.2345184 0.0000000 0.2801822 -2.4799336  4.9902208 -0.3097781Age         308.6803174 0.4928456 -1.23097601 1451.9092015  0.2801822 0.0000000 -0.2473135 -1.0114033 0.3334607Race          0.3944914 -2.5613138  0.45830912 0.2492636 -2.4799336   -0.2473135 0.0000000 1.3193223 0.1467827Occupation    0.6741338 0.1929920 -3.30152873    0.9636171 4.9902208 -1.0114033 1.3193223  0.0000000 7.5906763Sector       -0.4866246 -0.4927010 -2.58345399   -0.3036001 -0.3097781 0.3334607 0.1467827  7.5906763 0.0000000Marr         -0.9233273 0.6966272  0.09530228 -0.9387867  1.5813765 1.0332156 1.2757711 -0.4254112  0.8359769                  MarrEducation  -0.92332727South       0.69662719Sex         0.09530228Experience -0.93878671Union       1.58137652Age         1.03321563Race        1.27577106Occupation -0.42541117Sector      0.83597695Marr        0.00000000 $n[1] 534 $gp[1] 8 $method[1] “pearson”

As we saw earlier in the correlation plot, partial correlation between age-experience, age-education and education-experience is statistically significant. There are other pairs also which are statistically significant. Thus, Farrar-Glauber test helps us in identifying the variables which are causing multicollinearity in the model.

There are multiple ways to overcome the problem of multicollinearity. You may use ridge regression or principal component regression or partial least squares regression. The alternate way could be to drop off variables which are resulting in multicollinearity. You may drop of variables which have VIF more than 10. In our case, since age and experience are highly correlated, you may drop one of these variables and build the model again. Try building the model again by removing experience or age and check if you are getting better results. Share your experiences in the comments section below.

Author Bio:

This article was contributed by Perceptive Analytics. Jyothirmayee Thondamallu, Chaitanya Sagar and Saneesh Veetil contributed to this article.

Perceptive Analytics is a marketing analytics company and it also provides Tableau Consulting, data analytics, business intelligence and reporting services to e-commerce, retail, healthcare and pharmaceutical industries. Our client roster includes Fortune 500 and NYSE listed companies in the USA and India.

Exit mobile version