Overdispersed Poisson and Binomial GLM review

Overdispersed Poisson Regression (Qausi-Poisson Regression)

require(Sleuth3)
require(ggplot2)

elephants <- case2201

We can run Quasi-Poisson regression by using family=quasipoisson. This is the same as Poisson regression, but we also estimate the overdispersion

fit <- glm(Matings ~ Age, family= "poisson", data= elephants)
summary(fit)

## 
## Call:
## glm(formula = Matings ~ Age, family = "poisson", data = elephants)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.80798  -0.86137  -0.08629   0.60087   2.17777  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.58201    0.54462  -2.905  0.00368 ** 
## Age          0.06869    0.01375   4.997 5.81e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 75.372  on 40  degrees of freedom
## Residual deviance: 51.012  on 39  degrees of freedom
## AIC: 156.46
## 
## Number of Fisher Scoring iterations: 5

fit <- glm(Matings ~ Age, family= "quasipoisson", data= elephants)
summary(fit)

## 
## Call:
## glm(formula = Matings ~ Age, family = "quasipoisson", data = elephants)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.80798  -0.86137  -0.08629   0.60087   2.17777  
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.58201    0.58590  -2.700   0.0102 *  
## Age          0.06869    0.01479   4.645 3.81e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 1.157334)
## 
##     Null deviance: 75.372  on 40  degrees of freedom
## Residual deviance: 51.012  on 39  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 5

As you can see, the standard errors are inflated by \(\sqrt{1.15}=1.07\).

Causes of Overdispersion

One possibility is that the distribution simply isn’t Poisson. Let’s generate a distribution with a lot more zeros than you’d see in a Poisson distribution.

ind <- rbinom(100, size=1, prob=.5)
y <- ind*rpois(100, lambda=4)
qplot(y)

summary(glm(y~1, family="quasipoisson"))

## 
## Call:
## glm(formula = y ~ 1, family = "quasipoisson")
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1071  -2.1071  -0.9192   1.0725   2.9918  
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.7975     0.1106   7.209 1.14e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 2.717245)
## 
##     Null deviance: 312.41  on 99  degrees of freedom
## Residual deviance: 312.41  on 99  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 5

(Note: in this case, it’s more appropriate to use what’s called Zero-Inflated Poisson, but we won’t cover it here)

Another possibility is that we have unobserved covariates, so that what we see is actually a mixture of two Poisson distributions

ind <- rbinom(100, size=1, prob=.5)
y <- ind*rpois(100, lambda=4)+(1-ind)*rpois(100, lambda=6)
qplot(y)

summary(glm(y~1, family="quasipoisson"))

## 
## Call:
## glm(formula = y ~ 1, family = "quasipoisson")
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.0984  -0.8832  -0.1427   0.9392   2.0687  
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.56862    0.04549   34.48   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 0.9932799)
## 
##     Null deviance: 105.37  on 99  degrees of freedom
## Residual deviance: 105.37  on 99  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 4

Binomial family regression

krunnit <- case2101

In the Krunnit data, we have the total number of species found in 1958, and the total number of species found in 1968. When using glm with the Binomial family, we want to give the number of successes and failures, so we use

fit <- glm(cbind(Extinct, AtRisk-Extinct)~log(Area), family=binomial(), data=krunnit)

Aside: Bernoulli(pi) means Prob(success) = pi Prob(failure) = 1-pi We can switch the variables around and the following instead.

fit1 <- glm(cbind(AtRisk-Extinct, Extinct)~log(Area), family=binomial(), data=krunnit)

This would cause the interpretation of coefficients to change.

Previously, we interpreted AtRisk as the number of species that survived, in which case

fit <- glm(cbind(Extinct, AtRisk)~log(Area), family=binomial(), data=krunnit)

would be appropriate.