18 Logistic Regression Intro
library(tidyverse)
library(ggiraphExtra)
#library(here)Can we model (predict) if a college is a private or public college based on characteristics of the college (but not tuition – that’s just too easy!)?
Import the Colleges.CSV dataset.
#colleges = read_csv(here("data/Colleges.CSV"))
colleges = read_csv("data/Colleges.CSV")Suppose we use the techniques we know and use a simple linear regression model to predict Private or Public based on the college’s student/faculty ratio:
- Comment on the appropriateness of this above statistical model.
colleges$TypeIND = ifelse(colleges$Type=="Private", 1, 0)
colleges$PublicIND = ifelse(colleges$Type=="Public", 1, 0)
ggplot(data = colleges, aes(x = StudFac, y = TypeIND) ) +
geom_point() +
geom_smooth(method = "lm", se = F) +
geom_smooth(method = "loess", se = FALSE, color = "red")Write out the population model for using student/faculty ratio to model the probability of being a private college.
Use R to fit the model and report it below.
mod1 = glm(TypeIND ~ StudFac , data=colleges, family="binomial")
summary(mod1)
mod_public = glm(PublicIND ~ StudFac , data=colleges, family="binomial")
summary(mod_public)- Use the fitted model to “predict” for a student-faculty ratio = 10 and a student-faculty ratio = 20. What do these predicted values represent?
exp( 10.1734 - 0.5774*10 ) / (1+ exp( 10.1734 - 0.5774*10 ))newx = data.frame(StudFac = c(10,20))
predict.glm(mod1,newx, type="response" )- Plot the model
ggplot(data = colleges, aes(x = StudFac, y = TypeIND) ) +
geom_point() +
# geom_jitter(width = 0, height = 0.05) +
geom_smooth(method="glm",
method.args=list(family="binomial"),
se=FALSE)- Now consider a model that uses both student/faculty ratio and graduation rate to predict the probability that a college is private.
- Write out the appropriate population model (in probability form).
- Use R to fit the model and report the fitted equation below.
mod2 = glm(TypeIND ~ StudFac + GradRate,
data=colleges, family="binomial")
summary(mod2)- Use the model to predict the probability of being a private college for a school with a student/faculty ratio of 12 and a graduation rate of 68%.
newx2 = data.frame(StudFac = 12, GradRate=68)
predict.glm(mod2,newx2,type="response")
predict.glm(mod2,newx2,type = "link") # log-odds output
?predict.glmexp( 7.0776 - 0.558*12 + 0.0455*68 )
32.3/33.3- Is there evidence that this model is useful for predicting the probability that a college is private? State the hypotheses associated with this research question and conduct the appropriate hypothesis test.
mod2
summary(mod2)
# Test Statistic = Null Deviance - Residual Deviance
# df is Null df - Residual df (Should match number of betas in null hypothesis)- Compute the AIC for both models considered. Which model would be preferred?
# AIC = Residual Deviance + 2*(k+1)
AIC(mod1)
AIC(mod2)- Compute the BIC for both models considered. Which model would be preferred?
# BIC = Residual Deviance + ln(n)*(k+1)
BIC(mod1)
BIC(mod2)summary(mod1)
summary(mod2)