Here’s R’s built-in cars
dataset (speed of cars and the distances taken to stop).
require(ggplot2)
qplot(speed, dist,data=cars)
Let’s no run a linear regression on it
summary(lm(dist~speed, data=cars))
##
## Call:
## lm(formula = dist ~ speed, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791 6.7584 -2.601 0.0123 *
## speed 3.9324 0.4155 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
The same thing, but with GLM
summary(glm(dist~speed, data=cars, family=gaussian(link=identity)))
##
## Call:
## glm(formula = dist ~ speed, family = gaussian(link = identity),
## data = cars)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791 6.7584 -2.601 0.0123 *
## speed 3.9324 0.4155 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 236.5317)
##
## Null deviance: 32539 on 49 degrees of freedom
## Residual deviance: 11354 on 48 degrees of freedom
## AIC: 419.16
##
## Number of Fisher Scoring iterations: 2
Here’s another way to get the overdispersion parameter:
anova(lm(dist~speed, data=cars))
## Analysis of Variance Table
##
## Response: dist
## Df Sum Sq Mean Sq F value Pr(>F)
## speed 1 21186 21185.5 89.567 1.49e-12 ***
## Residuals 48 11354 236.5
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
As you can see, the MSE (i.e., the mean squared error, or the mean squared residual (ajusted for the reduction in degrees of freedom) is the same as the estimated overdispersion.