19 Logistic Regression - Part 2
library(tidyverse)
library(bestglm)
Import the Colleges.CSV dataset.
#colleges = read_csv(here::here("data/Colleges.CSV"))
= read_csv("data/Colleges.CSV")
colleges $PrivateIND = ifelse(colleges$Type=="Private", 1, 0) colleges
19.1 Model 1: Interpretations and Inference in Simple Logistic Regression
Consider a model for predicting the probability of being a private college using only student/faculty ratio.
Write out both forms (probability form and logit form) of the population model for using student/faculty ratio to model the probability of being a private college.
Use R to fit the model and report it below.
= glm( PrivateIND ~ StudFac ,
mod1 data=colleges,
family = "binomial")
summary(mod1)
- Use the fitted model to “predict” the probability of being a private college for a student-faculty ratio = 14 and a student-faculty ratio = 15.
= data.frame(StudFac = c(14,15, 16))
newx predict.glm(mod1,newx,type="link")
predict.glm(mod1,newx,type="response")
exp(0.9348735) / 4.532
Compare the predicted odds for the two cases predicted above.
Provide an interpretation of the estimated coefficient on student/faculty ratio (in terms of the odds ratio).
Construct and interpret a 95% confidence interval for the odds ratio for student/faculty ratio.
summary(mod1)
-0.5774 + 1.96*0.1329
exp( c(-.838, -0.317) )
19.2 Model 2: Multiple Logistic Regression
Now consider a model that uses both student/faculty ratio and graduation rate to predict the probability that a college is private.
Write out the appropriate population model (in logit form).
Use R to fit the model and report the fitted equation below.
= glm(PrivateIND ~ StudFac + GradRate ,
mod2 data=colleges,
family = "binomial")
summary(mod2)
Interpret the estimated coefficient on student/faculty ratio (in terms of the odds ratio).
Is there evidence that graduation rate is a useful predictor in this model? Justify with the details of the appropriate hypothesis test.
summary(mod2)
- Is there evidence that this model is useful for predicting the probability that a college is private? State the hypotheses associated with this research question and conduct the appropriate hypothesis test.
summary(mod2)
19.3 Best Model
Let’s do “best subsets” to find the “best” model for predicting the probability of being a private college.
- What is the “best” model we can find for predicting the probability of being a private college?
# the response variable needs to be the last column
# in the bestglm's data.frame
<- data.frame( colleges[ ,c(6,7,8,9,10,11, 12) ] )
colleges_bestglm_format
= bestglm( colleges_bestglm_format ,
bestmods family=binomial) # bestglm assumes last column in response
summary(bestmods$BestModel) # lowest BIC model
$BestModels # 5 models with lowest BIC bestmods
- What is the “best” model we can find for predicting the probability of being a private college, if we use \(\sqrt{Enroll}\) as a predictor rather than \(Enroll\)?
$sqrtEnroll = sqrt(colleges$Enroll)
colleges
=bestglm( data.frame( colleges[,c(6:10,13, 12)] ),
bestmods2family=binomial) # bestglm assumes last column in response
$BestModels
bestmods2
summary(bestmods2$BestModel)
- Does adding an interaction between student/faculty ratio and enrollment result in a better model?
= glm(PrivateIND ~ StudFac + sqrtEnroll + StudFac * sqrtEnroll ,
modbest data = colleges,
family = binomial)
summary(modbest)
BIC(modbest)
::vif(modbest)
car# no surprise that we have multicolinearity issues since
# the interaction uses the same information
# as the other terms
Inference for our “Best” Model
Is there evidence that the overall model is useful? Include all details of the appropriate hypothesis test.
Is there evidence that the interaction term is useful in this model? Include all details of the appropriate hypothesis test.
Is there evidence that at least one of the terms involving enrollment significantly improve the model? Address with a single hypothesis test (include all details).
= glm(PrivateIND ~ StudFac, data=colleges, family="binomial")
modsf summary(modsf)
52.424-19.416
anova(modsf,modbest, test = "Chisq")
- Consider a college with an enrollment of 2400. Provide an interpretation (in terms of the odds multiplier) of the coefficient relating student/faculty ratio to the probability of being a private college for such a school.
log_odds = 15.67 - 0.83*SF
exp(-0.83)
exp(-0.83) - 1
Your Turn: Logistic Regression Inference Summary Now consider a model that uses student/faculty ratio, graduation rate, ACTQ3, and PctTop10 to predict the probability of being a private college.
= glm(PrivateIND ~ ACTQ3 + PctTop10 + StudFac + GradRate,
mod3 data=colleges, family="binomial")
- Is there evidence that this model is useful? Include all details of the appropriate hypothesis test.
summary(mod3)
92.105-47.876
- Conduct a single hypothesis test to determine if there is evidence that at least one of graduation rate (GradRate) and Percent Top 10 (PctTop10) is useful in this model. Include all details of the appropriate hypothesis test.
= glm(PrivateIND ~ ACTQ3 + StudFac,
mod3_reduced data=colleges, family="binomial")
summary(mod3_reduced)
50.508-47.876
- Is there evidence that ACTQ3 is a useful predictor in the model? Include all details of the appropriate hypothesis test.
summary(mod3)
- Construct a 95% confidence interval for the population coefficient on student/faculty ratio. Use that confidence interval to construct and interpret a 95% confidence interval for the odds multiplier associated with student/faculty ratio.
exp( confint(mod3, level = 0.95) )