18 Logistic Regression Intro
library(tidyverse)
library(ggiraphExtra)
#library(here)
Can we model (predict) if a college is a private or public college based on characteristics of the college (but not tuition – that’s just too easy!)?
Import the Colleges.CSV dataset.
#colleges = read_csv(here("data/Colleges.CSV"))
= read_csv("data/Colleges.CSV") colleges
Suppose we use the techniques we know and use a simple linear regression model to predict Private or Public based on the college’s student/faculty ratio:
- Comment on the appropriateness of this above statistical model.
$TypeIND = ifelse(colleges$Type=="Private", 1, 0)
colleges$PublicIND = ifelse(colleges$Type=="Public", 1, 0)
colleges
ggplot(data = colleges, aes(x = StudFac, y = TypeIND) ) +
geom_point() +
geom_smooth(method = "lm", se = F) +
geom_smooth(method = "loess", se = FALSE, color = "red")
Write out the population model for using student/faculty ratio to model the probability of being a private college.
Use R to fit the model and report it below.
= glm(TypeIND ~ StudFac , data=colleges, family="binomial")
mod1 summary(mod1)
= glm(PublicIND ~ StudFac , data=colleges, family="binomial")
mod_public summary(mod_public)
- Use the fitted model to “predict” for a student-faculty ratio = 10 and a student-faculty ratio = 20. What do these predicted values represent?
exp( 10.1734 - 0.5774*10 ) / (1+ exp( 10.1734 - 0.5774*10 ))
= data.frame(StudFac = c(10,20))
newx predict.glm(mod1,newx, type="response" )
- Plot the model
ggplot(data = colleges, aes(x = StudFac, y = TypeIND) ) +
geom_point() +
# geom_jitter(width = 0, height = 0.05) +
geom_smooth(method="glm",
method.args=list(family="binomial"),
se=FALSE)
- Now consider a model that uses both student/faculty ratio and graduation rate to predict the probability that a college is private.
- Write out the appropriate population model (in probability form).
- Use R to fit the model and report the fitted equation below.
= glm(TypeIND ~ StudFac + GradRate,
mod2 data=colleges, family="binomial")
summary(mod2)
- Use the model to predict the probability of being a private college for a school with a student/faculty ratio of 12 and a graduation rate of 68%.
= data.frame(StudFac = 12, GradRate=68)
newx2 predict.glm(mod2,newx2,type="response")
predict.glm(mod2,newx2,type = "link") # log-odds output
?predict.glm
exp( 7.0776 - 0.558*12 + 0.0455*68 )
32.3/33.3
- Is there evidence that this model is useful for predicting the probability that a college is private? State the hypotheses associated with this research question and conduct the appropriate hypothesis test.
mod2summary(mod2)
# Test Statistic = Null Deviance - Residual Deviance
# df is Null df - Residual df (Should match number of betas in null hypothesis)
- Compute the AIC for both models considered. Which model would be preferred?
# AIC = Residual Deviance + 2*(k+1)
AIC(mod1)
AIC(mod2)
- Compute the BIC for both models considered. Which model would be preferred?
# BIC = Residual Deviance + ln(n)*(k+1)
BIC(mod1)
BIC(mod2)
summary(mod1)
summary(mod2)