18 Logistic Regression Intro

library(tidyverse)
library(ggiraphExtra)
#library(here)

Can we model (predict) if a college is a private or public college based on characteristics of the college (but not tuition – that’s just too easy!)?

Import the Colleges.CSV dataset.

#colleges = read_csv(here("data/Colleges.CSV"))
colleges = read_csv("data/Colleges.CSV")

Suppose we use the techniques we know and use a simple linear regression model to predict Private or Public based on the college’s student/faculty ratio:

  1. Comment on the appropriateness of this above statistical model.
colleges$TypeIND = ifelse(colleges$Type=="Private", 1, 0)
colleges$PublicIND = ifelse(colleges$Type=="Public", 1, 0)

ggplot(data = colleges, aes(x = StudFac, y = TypeIND) ) + 
  geom_point() + 
  geom_smooth(method = "lm", se = F) +
  geom_smooth(method = "loess", se = FALSE, color = "red")
  1. Write out the population model for using student/faculty ratio to model the probability of being a private college.

  2. Use R to fit the model and report it below.

mod1 = glm(TypeIND ~ StudFac , data=colleges, family="binomial")
summary(mod1)

mod_public = glm(PublicIND ~ StudFac , data=colleges, family="binomial")
summary(mod_public)
  1. Use the fitted model to “predict” for a student-faculty ratio = 10 and a student-faculty ratio = 20. What do these predicted values represent?
exp( 10.1734 - 0.5774*10 ) / (1+ exp( 10.1734 - 0.5774*10 ))
newx = data.frame(StudFac = c(10,20))
predict.glm(mod1,newx, type="response" )
  1. Plot the model
ggplot(data = colleges, aes(x = StudFac, y = TypeIND) ) + 
  geom_point() + 
#  geom_jitter(width = 0, height = 0.05) +
  geom_smooth(method="glm", 
              method.args=list(family="binomial"), 
              se=FALSE)
  1. Now consider a model that uses both student/faculty ratio and graduation rate to predict the probability that a college is private.
  1. Write out the appropriate population model (in probability form).
  2. Use R to fit the model and report the fitted equation below.
mod2 = glm(TypeIND ~ StudFac + GradRate, 
           data=colleges, family="binomial")
summary(mod2)
  1. Use the model to predict the probability of being a private college for a school with a student/faculty ratio of 12 and a graduation rate of 68%.
newx2 = data.frame(StudFac = 12, GradRate=68)
predict.glm(mod2,newx2,type="response")
predict.glm(mod2,newx2,type = "link") # log-odds output
?predict.glm
exp( 7.0776 - 0.558*12 + 0.0455*68 )
32.3/33.3
  1. Is there evidence that this model is useful for predicting the probability that a college is private? State the hypotheses associated with this research question and conduct the appropriate hypothesis test.
mod2
summary(mod2)
# Test Statistic = Null Deviance - Residual Deviance


# df is Null df - Residual df (Should match number of betas in null hypothesis)
  1. Compute the AIC for both models considered. Which model would be preferred?
# AIC = Residual Deviance + 2*(k+1)
AIC(mod1)
AIC(mod2)
  1. Compute the BIC for both models considered. Which model would be preferred?
# BIC = Residual Deviance + ln(n)*(k+1)
BIC(mod1)
BIC(mod2)
summary(mod1)
summary(mod2)