9 Multicollinearity

Getting Started As always, insert a chunk for the R packages (and SLU functions) we will need.

library(tidyverse)
source("~/rstudioshared/IRamler/scripts/slunova.R") # for ANOVA test
library(GGally) # for correlation plots
library(car) # for VIF

Multicollinearity means that two or more predictor variables used in the model are correlated.

9.1 What’s so bad about Multicollineary? Example 1

Import the dataset Cars99.CSV.

cars99 = read_csv("data/Cars99.CSV")

This dataset contains information on various cars as reported in the Consumer Reports 1999 New Car Buying Guide and, while old, is still a very useful data set for illustrating regression topics. The three variables currently of interest to us are: Highway_MPG (the EPA’s estimated miles per gallon for highway driving), Acceleration_060 (time, in seconds, to accelerate from zero to 60 mph), and Time_for_QtrMile (time, in seconds, to go 1/4 mile from a standing start).

  1. Fit a model that uses both Acceleration_060 and Time_for_QtrMile to predict Highway_MPG. Report both the population model and the fitted model below.
mod1 <- lm(Highway_MPG ~ Acceleration_060 + Time_for_QtrMile, data = cars99)
summary(mod1)
  1. Is there evidence that the model is useful? Conduct the appropriate hypothesis test, including all of the details.
slunova(mod1)
  1. Which predictors in the model are useful?
summary(mod1)
cor(cars99$Acceleration_060, cars99$Time_for_QtrMile)

9.2 What’s so bad about Multicollineary? Example 2

Now import the dataset BodyFat78.csv.

bodyfat = read_csv("data/BodyFat78.csv")

This dataset contains information on percent body fat, weight, and other easily obtained body measurements for 78 men. Note: adiposity in this dataset is indeed the same thing as BMI (body mass index).

First you will consider three different simple linear regression models. For each model, write the population model and the fitted model in the space provided.

  1. Model 1: Weight (lbs) as a predictor of percent body fat
mod_wgt <- lm(BodyFat ~ Weight, data = bodyfat)
summary(mod_wgt)
  1. Model 2: Adiposity (lbs/in2) as a predictor of percent body fat
mod_bmi <- lm(BodyFat ~ Adiposity, data = bodyfat)
summary(mod_bmi)
  1. Model 3: Neck circumference (cm) as a predictor of percent body fat
mod_neck <- lm(BodyFat ~ Neck, data = bodyfat)
summary(mod_neck)
  1. Now consider a model that uses all three predictors (weight, adiposity, and neck circumference) as predictors of percent body fat. Write the population model and the fitted prediction equation in the space provided.

Model 4: Weight, Adiposity, and Neck Circumference as predictors of percent body fat

mod_mlr1 <- lm(BodyFat ~ Weight + Adiposity + Neck, 
               data = bodyfat)
summary(mod_mlr1)
  1. Is there anything surprising about your results? (In particular, pay attention to the slope coefficients.)

  2. Construct a correlation matrix to investigate the pairwise correlations. What do you notice?

ggcorr(bodyfat, # update data here
       
       #suggestion: keep these options the same
       label = TRUE, label_size = 3, 
       label_round = 1, label_alpha = TRUE,
       hjust = 0.75, layout.exp = 1
       )
  1. “By hand”,” calculate the VIFs for each of the explanatory variables in Model 4. Do they indicate a problem with Multicollinearity?
modad = lm(Adiposity ~ Weight + Neck, data=bodyfat)
modwt = lm(Weight ~ Adiposity + Neck, data=bodyfat)
modneck = lm(Neck ~ Adiposity + Weight, data=bodyfat)
summary(modad)
summary(modwt)
summary(modneck)
  1. Use the vif function (from the car package) to get the VIF values for Model 4.
vif(mod_mlr1)

Back to Example 1

  1. Construct a correlation matrix to investigate the pairwise correlations.
ggcorr(cars99, 
       
       #suggestion: keep these options the same
       label = TRUE, label_size = 3, 
       label_round = 1, label_alpha = TRUE,
       hjust = 0.75, layout.exp = 1
       )
  1. Compute VIF by hand for the explanatory variables in this model (modCars). Verify your answers by using the vif function.
modacc = lm(Acceleration_060 ~ Time_for_QtrMile, data=cars99)
modtime = lm(Time_for_QtrMile ~ Acceleration_060, data=cars99)

Notice these models have same \(R^2\), unlike previous example. Why is that? There is only one other variable. Talk about relationship between \(R^2\) and correlation for SLR. (Shortcut here)

vif(mod1)
1/ (1 - cor(cars99$Acceleration_060, cars99$Time_for_QtrMile) ^ 2)