GLMs – Gaussian(link=identity)

Here’s R’s built-in cars dataset (speed of cars and the distances taken to stop).

require(ggplot2)
qplot(speed, dist,data=cars)

Let’s no run a linear regression on it

summary(lm(dist~speed, data=cars))
## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## speed         3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

The same thing, but with GLM

summary(glm(dist~speed, data=cars, family=gaussian(link=identity)))
## 
## Call:
## glm(formula = dist ~ speed, family = gaussian(link = identity), 
##     data = cars)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -29.069   -9.525   -2.272    9.215   43.201  
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## speed         3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 236.5317)
## 
##     Null deviance: 32539  on 49  degrees of freedom
## Residual deviance: 11354  on 48  degrees of freedom
## AIC: 419.16
## 
## Number of Fisher Scoring iterations: 2

Here’s another way to get the overdispersion parameter:

anova(lm(dist~speed, data=cars))
## Analysis of Variance Table
## 
## Response: dist
##           Df Sum Sq Mean Sq F value   Pr(>F)    
## speed      1  21186 21185.5  89.567 1.49e-12 ***
## Residuals 48  11354   236.5                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As you can see, the MSE (i.e., the mean squared error, or the mean squared residual (ajusted for the reduction in degrees of freedom) is the same as the estimated overdispersion.