8 Intro to Multiple Linear Regression

Getting Started As always, insert a chunk for the R packages (and SLU functions) we will need.

library(tidyverse)
source("~/rstudioshared/IRamler/scripts/slunova.R") # function that makes MLR ANOVA tables more convenient

Data

We have data (HomesNNY.csv) on 48 recently sold homes in Canton and Potsdam.

HomesNNY = read_csv("data/HomesNNY.csv") %>%
  rename(Lot_SqFt_1k = Lot2)

8.1 Motivation 1: Influential Observations

Suppose we propose a simple linear model using lot area (in thousand square feet) to explain the price (in thousand dollars) of area homes. We will call this “Model 1.”

Write out the expression for Model 1 below, then examine the residual plots associated with the fitted model. What do you notice?

Model 1:

mod1 <- lm(Price_thousands ~ Lot_SqFt_1k, data = HomesNNY)
summary(mod1)
plot(mod1)

In a previous example, we used a transformation to lessen the impact of an influential observation. However, there are other ways of dealing with unusual and influential observations. Often, they arise because other important variables are not included in the model. One such important variable is the size of the house.

Propose a model that uses both size (in thousand square feet) and lot area (in thousand square feet) to explain selling price (in thousand dollars). We will call this “Model 2.” Fit Model 2 and examine residual plots. Are there still any concerns about influential observations?

Model 2:

mod2 <- lm(Price_thousands ~ Lot_SqFt_1k + Size_sqft_1k, 
           data = HomesNNY)
summary(mod2)
plot(mod2)

8.2 Motivation 2: Partitioning Variation

Previously, we proposed and fit a model that uses size (in thousand square feet) to predict selling price (in thousand dollars). We’ll call this Model 3.

Model 3:

mod3 <- lm(Price_thousands ~ Size_sqft_1k, data = HomesNNY)
ggiraphExtra::ggPredict(mod3, se= TRUE)
summary(mod3)
anova(mod3)
 anova(mod2) # looks at Sums of Squares by model component

Use the “slunova” function to obtain the ANOVA table for Model 3.

slunova(mod3)

As we discussed in class, there are factors other than the size of the home that will also impact price. Lot area, discussed earlier, is one such variable. Above, we fit a model that uses both size (in thousand square feet) and lot area (in thousand square feet) to explain selling price (in thousand dollars), as part of Motivation 1 (this was Model 2).

Use the “slunova” function to obtain the ANOVA table for Model 2.

slunova(mod3)
print('--------------------')
slunova(mod2)

What similarities and differences do you notice in the ANOVA tables for Models 2 and 3?
Conduct the ANOVA hypothesis test for Model 2.

8.3 Quantities Related to the ANOVA Table

Use the ANOVA table for Model 2 to compute the residual standard error for the model. Check your answer by examining the model summary.

slunova(mod2)
summary(mod2)

Compute $R^2$ for Model 2. Check your answer by examining the model summary.
By definition, $R^2$ will always increase when we add predictors to the model, even if those predictors are not useful. We are going to consider an extreme example of this by creating three new variables in our dataset that are just random values completely unrelated to the selling price of homes in the area. We will then add them to our model and observe the fitted model’s $R^2$ value.

set.seed(852)
HomesNNY$Random1 = rnorm(48, 10, 2) # Generate random values from Normal(10,2)
HomesNNY$Random2 = rnorm(48,100,10) # Generate random values from Normal(100,10)
HomesNNY$Random3 = rnorm(48,25,3) # Generate random values from Normal(25,3)

mod_random = lm(Price_thousands ~ Size_sqft_1k + Lot_SqFt_1k + Random1 + Random2 + Random3, data=HomesNNY)

slunova(mod_random)
summary(mod_random)

In multiple linear regression, adjusted $R^2$ is a better choice because it includes a penalty for including in our model predictor terms that are not useful. Use the values from the ANOVA table to compute adjusted $R^2$ for Model 2. Check your answer by examining the model summary.

slunova(mod2)
summary(mod2)

summary(mod_random)

8.4 Inferences about Individual Predictors in the Model

Write out the details of the t tests associated with each of the individual predictors in Model 2.

summary(mod2)

8.5 Predictions and Interpretations of Estimated Coefficients

Now, let’s consider a model that uses size (in thousand square feet), lot area (in thousand square feet), and age (in years) to explain the selling price of area homes. We’ll call this Model 4.

summary(mod2)
confint(mod2, level = 0.99)

Write out the proposed model.

Model 4:

Fit Model 4 and report that fitted equation below. Notice the t test associated with lot area. What does it say here?

mod4 <- lm(Price_thousands ~  
             Lot_SqFt_1k +Size_sqft_1k + Age, data = HomesNNY)
summary(mod4)

For simplicity, propose a new model (Model 5) that uses just size (in thousand square feet) and age (in years) to predict the selling price (in thousand dollars) of area homes. What are some reasons this model might be preferred over the previous model?

Model 5:

Now predict the price for houses with the following characteristics:
1. Size = 1.9 thousand square feet and 50 years old
2. Size = 1.9 thousand square feet and 51 years old
3. Size = 2 thousand square feet and 51 years old Compare these predictions.

mod5 <- lm(Price_thousands ~  
             Size_sqft_1k + Age, data = HomesNNY)
summary(mod5)

newx = data.frame(Size_sqft_1k = c(1.9, 1.9, 2), Age = c(50, 51, 51)) # look at newx in your environment

Interpret the estimated coefficients on both size (in thousand square feet) and age (in years).
Back to my neighbor’s house: Recall that my neighbors are selling a 2.3 thousand square foot house that is 87 years old (built in 1932). Find

90% confidence interval for the mean price of all 87 year old, 2.3 thousand square foot homes in the area.

# making prediction and confidence intervals for MLR
newx2 = data.frame(Size_sqft_1k = 2.3, Age = 87)
predict.lm(mod5, newx2) # just give yhat
predict.lm(mod5, newx2, 
           interval = "confidence",
           level = 0.9
           )

90% prediction interval for a single 87 year old, 2.3 thousand square foot home.

# making prediction and confidence intervals for MLR
predict.lm(mod5, newx2, 
           interval = "prediction",
           level = 0.9
           )

# remove the new data option to
# get intervals for all rows in 
# the dataset that was used to build the model

predict.lm(mod5, 
           interval = "prediction",
           level = 0.9
           )