Statistical models describe data and relationships within data in a simple way. Different models are required for different types of data. Last workshop we looked at the bivariate regression. But what do we do to model data where:
To deal with these types of data we need to go beyond the bivariate regression.
Deciding which model to use is challenging. Thankfully, statistical textbooks often have decision trees to guide you. I rather like the book Discovering Statistics using R and I have uploaded the decision tree at https://files.warwick.ac.uk/jamestripp/files/DecisionTree-DiscoveringStatisticsUsingR.pdf for your viewing pleasure.
Note: This decision tree is from a textbook written by a psychologist and so you can replace ANOVA with linear regression in most cases.
In this document, we will use a logistic regression. Following the above decision tree it tells us that logistic regressions are suitable for data with a categorical outcome, and one or more predictors which are either categorical or continious (e.g., predicting yes/no responses using likert scale responses).
Logistic regression is used when the outcome is categorical - such as a yes or no answer - and the predictor is continious - such as age or a likert scale response.
In a logistic regression, the linear regression predictions are passed through a logit function. That sounds tricky. In reality a logit function scales value between 0 and 1 in a sinusoidal way. The function is called a link function - it links the regression predictions to the data.
To remind ourselves, this is the linear regression equation with one predictor and outcome:
\[ outcome_i = \alpha + \beta \cdot predictor_i + \epsilon \]
For a logistic regression this becomes,
\[ P(outcome_i) = \frac{1}{1 + \exp^{-(\beta \cdot predictor_i + \alpha)}} + \epsilon \]
where the probablity of choosing the outcome (1) is the result the regression put through a logistic function.
So, what is a logistic function. Simply put, it takes in values and gives us results which between 0 and 1 in an S shape.
logistic <- function(x){
1/(1 + exp(-x))
}
plot(logistic(-10:10), type = 'l', xlab = 'x', ylab = 'logistic(x)')
I like to think of it as a quick fix someone put together - possibly in a pub - to make a regression work with those darned binary (0 and 1) variables.
We are going to fit and explore a basic logistic regression model. We need data for this.
Let’s return to our ‘drinking approach’ from the linear regression workshop. Last time we asked our fictional people to play a computer game. I think we were too kind to them last time. Instead, we will ask them to solve a complex maths question. They can either fail or succeed.
Let’s clear our R enviroment.
rm(list=ls())
and define our hypothetical data.
df.alcohol <- data.frame(alcohol = c(10, 10, 20, 20, 30, 30, 40, 40, 50, 50, 60, 60, 70, 70, 80, 80, 90, 90, 100, 100),
result = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0))
What do we have? We can print out the data
df.alcohol
but it probably makes more sense to tidy it up. Those 0s and 1s should be fail or succeed. Factor allows us to deal with this.
df.alcohol$recode <- factor(df.alcohol$result, labels = c('Fail', 'Succeed'))
We should double check that works
df.alcohol
and plot the data if we have lots of data.
plot(df.alcohol$alcohol, df.alcohol$result)
or to look at it another way,
table(df.alcohol$alcohol, df.alcohol$recode)
##
## Fail Succeed
## 10 0 2
## 20 0 2
## 30 0 2
## 40 0 2
## 50 0 2
## 60 2 0
## 70 0 2
## 80 2 0
## 90 2 0
## 100 2 0
What do you think is happening? Does alcohol free the mind and enable us to overcome the complexity of math? Or is it an inhibiter to all things numeric?
A logistic regression is a generalized linear model. We have taken our basic linear model and generalized it by putting the predictions through a logistic function. Unsuprisingly, the function we use for a logistic model is glm (generalized linear model). We can fit the model like so:
m1 <- glm(result~alcohol, data = df.alcohol, family = 'binomial')
You may - quite resonably - be wondering what the family argument does. The logistic function gives us values between 0 and 1. The binomial probability distribution allows us find the probablity of these value given the data. In essance, it is useful for fitting the model. Exactly why goes beyond this workshop.
Does alcohol help bring out the genius in us all?
summary(m1)
##
## Call:
## glm(formula = result ~ alcohol, family = "binomial", data = df.alcohol)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.46407 -0.27804 0.05822 0.27672 1.45970
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 8.42521 3.95534 2.130 0.0332 *
## alcohol -0.12954 0.05975 -2.168 0.0301 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 26.920 on 19 degrees of freedom
## Residual deviance: 10.028 on 18 degrees of freedom
## AIC: 14.028
##
## Number of Fisher Scoring iterations: 6
Alcohol does make a significant difference! There are a few points to make about this output:
logit2prob <- function(logit){
odds <- exp(logit)
prob <- odds / (1 + odds)
return(prob)
}
logit2prob(coef(m1))
## (Intercept) alcohol
## 0.9997808 0.4676593
Though most people do not do this and it is tricky to interpret.
Note: You are very welcome to skip the next few paragraphs and go onto plots.
We can calculate something like R for logistic regression. The statistic is called McFadden’s R squared. We can use code shown here and for those geeks like me who like lists, there is a list of the commonly encountered R squared values here.
We take the ratio of a ‘measure of the model fit’ (the log likelihood; not exactly a measure of fit but close) for our model with and without alcohol as a predictor. If this ratio is 1 then the models fit the data equally well, and 1 - this ratio = 0. Better fits produce smaller log likelihoods. A better fitting model (vs the null model) gives use values approaching 1 because the value of the ratio decreases.
mod <- glm(result~alcohol, data = df.alcohol, family = binomial)
nullmod <- glm(result~1, data = df.alcohol, family = binomial)
1-logLik(mod)/logLik(nullmod)
## 'log Lik.' 0.6275107 (df=2)
Above we talked about the equation for a logistic regression. Putting in our variable names it should be,
\[ P(succeed_i) = \frac{1}{1 + \exp^{-(\beta \cdot alcohol_i + \alpha)}} + \epsilon \]
So we will put our money where our mouth is, so to speak. The predictions of the model are
m1$fitted.values
## 1 2 3 4 5 6
## 0.99919974 0.99919974 0.99708316 0.99708316 0.98942777 0.98942777
## 7 8 9 10 11 12
## 0.96243735 0.96243735 0.87523077 0.87523077 0.65759268 0.65759268
## 13 14 15 16 17 18
## 0.34460204 0.34460204 0.12583590 0.12583590 0.03791607 0.03791607
## 19 20
## 0.01067452 0.01067452
and putting these into our logistic regression equations
coef(m1)
## (Intercept) alcohol
## 8.4252071 -0.1295437
df.alcohol$pred <- 1/(1+exp(-((-0.1295437*df.alcohol$alcohol) + 8.4252071)))
Success. We have the correct logistic regression equation.
Enough nerding out, let’s create some nice plots of our data, model and confidence intervals.
Like the standard regression, we can generate our predicted data and confidence interval.
df.new <- data.frame(alcohol = c(10, 20, 30, 40, 50, 60, 70, 80, 90, 100))
preds <- predict(m1, df.new, type="response", se.fit=TRUE)
#For the plot, I want the predicted probabilities +/- 1.96 standard errors (that’s the 95% confidence interval; use qnorm(0.975) if 1.96 is not precise enough). I extract and calculate the values for each line separately to better understand the code. - taken from https://druedin.com/2016/01/16/predicted-probabilities-in-r/
predf <- preds$fit # predicted
lower <- preds$fit - (1.96*preds$se.fit) # lower bounds
upper <- preds$fit + (1.96*preds$se.fit) # upper bounds
plot(seq(from=10, to = 100, by = 10), preds$fit, type="l", ylab="Predicted Probability to succeed on maths question", xlab="Alcohol(ml)", bty="n")
lines(seq(from=10, to = 100, by = 10), lower, lty=2, col = 'red')
lines(seq(from=10, to = 100, by = 10), upper, lty=2, col = 'red')
points(df.alcohol, col = 'blue')