knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(cache=TRUE)
require(ggplot2)
require(lme4)
Let’s read in the data
lines <-
"Game Scored N.Attempts
1 4 5
2 5 11
3 5 14
4 5 12
5 2 7
6 7 10
7 6 14
8 9 15
9 4 12
10 1 4
11 13 27
12 5 17
13 6 12
14 9 9
15 7 12
16 3 10
17 8 12
18 1 6
19 18 39
20 3 13
21 10 17
22 1 6
23 3 12"
con <- textConnection(lines)
shaq <- read.csv(con, sep="")
Let’s now fit two models: one model where there is just one intercept, and one model
shaq$Game <- as.factor(shaq$Game)
fit <- glm(cbind(Scored, N.Attempts-Scored)~1, family=binomial(), data=shaq)
fit2 <- glm(cbind(Scored, N.Attempts-Scored)~Game, family=binomial(), data=shaq)
Now, run a Likelihood-Ratio Test:
anova(fit2, fit, test="LRT")
## Analysis of Deviance Table
##
## Model 1: cbind(Scored, N.Attempts - Scored) ~ Game
## Model 2: cbind(Scored, N.Attempts - Scored) ~ 1
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 0 0.000
## 2 22 40.021 -22 -40.021 0.01075 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Null Hypothesis: The probability of Shaq successfully scoring from a free throw does not differ across game days; i.e., the intercept-only model is equivalent to the model adjusting for game day.
Based on the p-value being <0.05 (p=0.01), we reject the null hypothesis. There is evidence to suggest that the probability of Shaq successfully scoring from a free throw differs significantly across game days; i.e., he has both good and bad days.
First, let’s compute the no-pooling estimates. (Note: there are other ways of doing this, and the error bars would look different in that case.)
fit.nopool <- glm(cbind(Scored, N.Attempts-Scored)~Game, family=binomial(link="logit"), data=shaq)
predict.nopool <- predict(fit.nopool, type="response", se=T)
probs.nopool <- data.frame( game = seq(1,23),
probs = predict.nopool$fit,
LB = pmax(0, predict.nopool$fit-predict.nopool$se.fit),
UB = pmin(1, predict.nopool$fit+predict.nopool$se.fit))
ggplot(probs.nopool, aes(x = game, y = probs)) +
geom_point() +
geom_errorbar(aes(ymin=LB, ymax=UB))
Now, let’s compute the complete pooling estimate. Complete pooling means we are estimating just one number – the probability of scoring in any game. Again, other options exist here.
total.scored <- sum(shaq$Scored)
total.attempted <- sum(shaq$N.Attempts)
prob <- total.scored/total.attempted
prob.se <- prob*(1-prob)/sqrt(dim(shaq)[1])
Let’s display this in the same way as before.
probs.completepool <- data.frame( game = seq(1,23),
probs = rep(prob, 23),
LB = prob-prob.se,
UB = prob+prob.se)
ggplot(probs.completepool, aes(x = game, y = probs)) +
geom_point() +
geom_errorbar(aes(ymin=LB, ymax=UB)) +
ylim(0, 1)
Finally, let’s do partial pooling. Note that glmer
doesn’t provide standard errors for the coefficients.
fit.partial <- glmer(cbind(Scored, N.Attempts-Scored)~(1|Game), family=binomial(), data=shaq)
partial.probs <- plogis(ranef(fit.partial)$Game$`(Intercept)`)
probs.partial <- data.frame( game = seq(1,23),
probs = partial.probs)
ggplot(probs.partial, aes(x = game, y = probs)) +
geom_point() +
ylim(0, 1)
To make things clearer, let’s plot the error bars for the no-pooling estimates together with the partial pooling estimates.
ggplot(probs.nopool, aes(x = game, y = probs)) +
geom_errorbar(aes(ymin=LB, ymax=UB)) +
geom_point(data=probs.partial, aes(x = game, y = probs))
The no-pooling method independently provided estimates for Shaq’s free throw success rate for each day; we can see that one day (Game 14), Shaq scored every time, and the no-pooling estimate of the probability of scoring was 100%, which is implausible. Partial pooling moved the estimates closer to the mean probability (0.45) and were less variable as compared to the non-pooled estimates. For example, we can easily see that the estimate corresponding to Shaq’s particually successful shooting day (Game 14) became closer to the other estimates; all estimates were substantially moved close to 46%. Complete pooling resulted in one estimated probability, 46%.
I (Michael) used my ROSI transcript from undergrad to come up with an example. The measurements are my grades in STA, MAT, and CSC classes. The model is
\[grade_i\sim N(\alpha_{j[i]}, \sigma_{grade})\] \[\alpha_j\sim N(\mu_\alpha, \sigma_\alpha)\]
Here, \(grade_i\) is the grade I received in course i. \(\alpha_1\) is the (pooled estimate of the) mean for MAT, \(\alpha_2\) for STA, and \(\alpha_3\) for CSC.
Since I have all the data, I can simply compute the \(\sigma_\alpha\) (5.0, mostly due to the Math Specialist courses…), \(\sigma_{grade}\) (8.1), and \(\mu_\alpha\) (left to your imagination :-) ).