8 Intro to Multiple Linear Regression
Getting Started As always, insert a chunk for the R packages (and SLU functions) we will need.
library(tidyverse)
source("~/rstudioshared/IRamler/scripts/slunova.R") # function that makes MLR ANOVA tables more convenient
Data
We have data (HomesNNY.csv) on 48 recently sold homes in Canton and Potsdam.
= read_csv("data/HomesNNY.csv") %>%
HomesNNY rename(Lot_SqFt_1k = Lot2)
8.1 Motivation 1: Influential Observations
Suppose we propose a simple linear model using lot area (in thousand square feet) to explain the price (in thousand dollars) of area homes. We will call this “Model 1.”
- Write out the expression for Model 1 below, then examine the residual plots associated with the fitted model. What do you notice?
Model 1:
<- lm(Price_thousands ~ Lot_SqFt_1k, data = HomesNNY)
mod1 summary(mod1)
plot(mod1)
In a previous example, we used a transformation to lessen the impact of an influential observation. However, there are other ways of dealing with unusual and influential observations. Often, they arise because other important variables are not included in the model. One such important variable is the size of the house.
- Propose a model that uses both size (in thousand square feet) and lot area (in thousand square feet) to explain selling price (in thousand dollars). We will call this “Model 2.” Fit Model 2 and examine residual plots. Are there still any concerns about influential observations?
Model 2:
<- lm(Price_thousands ~ Lot_SqFt_1k + Size_sqft_1k,
mod2 data = HomesNNY)
summary(mod2)
plot(mod2)
8.2 Motivation 2: Partitioning Variation
Previously, we proposed and fit a model that uses size (in thousand square feet) to predict selling price (in thousand dollars). We’ll call this Model 3.
Model 3:
<- lm(Price_thousands ~ Size_sqft_1k, data = HomesNNY)
mod3 ::ggPredict(mod3, se= TRUE)
ggiraphExtrasummary(mod3)
anova(mod3)
anova(mod2) # looks at Sums of Squares by model component
- Use the “slunova” function to obtain the ANOVA table for Model 3.
slunova(mod3)
As we discussed in class, there are factors other than the size of the home that will also impact price. Lot area, discussed earlier, is one such variable. Above, we fit a model that uses both size (in thousand square feet) and lot area (in thousand square feet) to explain selling price (in thousand dollars), as part of Motivation 1 (this was Model 2).
- Use the “slunova” function to obtain the ANOVA table for Model 2.
slunova(mod3)
print('--------------------')
slunova(mod2)
What similarities and differences do you notice in the ANOVA tables for Models 2 and 3?
Conduct the ANOVA hypothesis test for Model 2.
8.4 Inferences about Individual Predictors in the Model
- Write out the details of the t tests associated with each of the individual predictors in Model 2.
summary(mod2)
8.5 Predictions and Interpretations of Estimated Coefficients
Now, let’s consider a model that uses size (in thousand square feet), lot area (in thousand square feet), and age (in years) to explain the selling price of area homes. We’ll call this Model 4.
summary(mod2)
confint(mod2, level = 0.99)
- Write out the proposed model.
Model 4:
- Fit Model 4 and report that fitted equation below. Notice the t test associated with lot area. What does it say here?
<- lm(Price_thousands ~
mod4 +Size_sqft_1k + Age, data = HomesNNY)
Lot_SqFt_1k summary(mod4)
- For simplicity, propose a new model (Model 5) that uses just size (in thousand square feet) and age (in years) to predict the selling price (in thousand dollars) of area homes. What are some reasons this model might be preferred over the previous model?
Model 5:
- Now predict the price for houses with the following characteristics:
- Size = 1.9 thousand square feet and 50 years old
- Size = 1.9 thousand square feet and 51 years old
- Size = 2 thousand square feet and 51 years old Compare these predictions.
<- lm(Price_thousands ~
mod5 + Age, data = HomesNNY)
Size_sqft_1k summary(mod5)
= data.frame(Size_sqft_1k = c(1.9, 1.9, 2), Age = c(50, 51, 51)) # look at newx in your environment newx
Interpret the estimated coefficients on both size (in thousand square feet) and age (in years).
Back to my neighbor’s house: Recall that my neighbors are selling a 2.3 thousand square foot house that is 87 years old (built in 1932). Find
- 90% confidence interval for the mean price of all 87 year old, 2.3 thousand square foot homes in the area.
# making prediction and confidence intervals for MLR
= data.frame(Size_sqft_1k = 2.3, Age = 87)
newx2 predict.lm(mod5, newx2) # just give yhat
predict.lm(mod5, newx2,
interval = "confidence",
level = 0.9
)
- 90% prediction interval for a single 87 year old, 2.3 thousand square foot home.
# making prediction and confidence intervals for MLR
predict.lm(mod5, newx2,
interval = "prediction",
level = 0.9
)
# remove the new data option to
# get intervals for all rows in
# the dataset that was used to build the model
predict.lm(mod5,
interval = "prediction",
level = 0.9
)