17 Model Building

Getting Started As always, insert a chunk for the R packages (and SLU functions) we will need.

library(tidyverse)
library(GGally) # for pairs plot

library(leaps)  # for Best Subsets
source("~/rstudioshared/IRamler/scripts/ShowSubsets.R") # for Best Subsets Summary

Often, the goal of statistical modeling is to find the “best” model possible for making predictions. Today we will explore some model building strategies for the “colleges” dataset.

collegesTraining = read_csv("data/Colleges_Training.CSV") # Used to identify a small number of potential "best" models
collegesHoldout = read_csv("data/Colleges_Holdout.CSV") # Used to help make a final decision between candidate "best" models

The available variables include

Total Cost (in thousands) - the total cost of attending the college; this includes tuition, room & board, and other fees
Type - the type of school (private or public)
ACT - the median ACT score for students attending the college
ACTQ3 - the 75th percentile ACT score for students attending the colllege
PctTop10 - the percentage of students who were in the top 10% of their high school class
StudFac - the student/faculty ratio for the college
GradRate - the four-year graduation rate for the college
Enroll - the total enrollment for the college

Goal: Build a “best” model for predicting total cost of a college

ggpairs(collegesTraining, aes(color=Type), columns=5:11)

bestmods = regsubsets(TotalCost~ACT + ACTQ3 + PctTop10 + StudFac + GradRate + Enroll +Type, data=collegesTraining, nbest=2)

ShowSubsets(bestmods)

# a potential option to investigate...not as nice as the previous chunk
plot(bestmods, scale=c("bic", "Cp", "adjr2", "r2")[2])

Discuss Potential Strategies for Using this Information to Find a “Best” Model

After checking the four metrics, 3(1), 3(2), and 2(1) seem to be consistently showing up a good model (i.e., no clear best, but several good)

These models would make a good starting point for further analysis. (e.g., fitting the model, assessing the model)

Crossvalidation: Using a Holdout Sample Crossvalidation means using the fitted equation from the training sample to predict responses from the holdout sample and then analyzing the residuals for those predictions.

modex1 = lm(TotalCost ~ ACT + Type, data=collegesTraining)
# modex1a = lm(log(TotalCost) ~ ACT + Type, data=collegesTraining)

modex1_HOpreds = predict(modex1, collegesHoldout)
modex1_HOres = collegesHoldout$TotalCost - modex1_HOpreds

plot(modex1)

mean(modex1_HOres)  # mean prediction error (residuals)
sd(modex1_HOres)    # sd of residuals (typical prediction error)
max(abs(modex1_HOres)) # worst prediction
cor(modex1_HOpreds, collegesHoldout$TotalCost)  # cross validation correlation

cor(modex1_HOpreds, collegesHoldout$TotalCost) ** 2

#modex1a_HOpreds = exp(predict(modex1a, collegesHoldout))
#modex1a_HOres = collegesHoldout$TotalCost - modex1a_HOpreds


#mean(modex1a_HOres)  # mean prediction error (residuals)
#sd(modex1a_HOres)    # sd of residuals (typical prediction error)
#max(abs(modex1a_HOres)) # worst prediction
#cor(modex1a_HOpreds, collegesHoldout$TotalCost)  # cross validation correlation