STA303/STA1002 Midterm Solutions

Question 1(a)

An experiment was conducted to determine whether interviewers display bias against persons with disabilities. An actor recorded 5 different videos of a job interview, with identical scripts, except that in the first video, he appeared in a wheelchair; in the second video, he appeared on crutches; in the third video, he appeared hearing-impaired; in the fourth video, he appeared to have one leg amputated; and in the fifth video, he appeared to have no disabilties.

For each video, 14 different undergraduate students watched it, and rated the job applicant on his qualifications (70 students in total participated in the experiment, each providing one score/rating.) The scores are summarized in a boxplot below.

library(Sleuth3)
library(lattice)

disc <- case0601
disc$Handicap <- relevel(disc$Handicap, ref="None")
bwplot(Handicap ~ Score, data = case0601)

What are the conditions under which we can perform a One-Way ANOVA F-test? List all the conditions, state whether they seem to be satisfied for the discrimination experiment data based on the boxplot, and briefly state why. If you cannot determine whether a condition was satisfied, say so and briefly explain why.

Solution

Condition	is it satisfied?	Comment
The responses are normal	Yes, probably (arguable)	The boxes look symmetric, with most of the probability mass concentrated at the centre
The variances are the same	Yes	The boxplots are all approximately the same size
The measurements are independent	Can’t tell	The only way to really make sure the measurements are independent is to properly design the experiment

Question 1(b)

Here is some R output for the discrimination experiment.

lm(Score~Handicap, data=disc)

## 
## Call:
## lm(formula = Score ~ Handicap, data = disc)
## 
## Coefficients:
##        (Intercept)     HandicapAmputee    HandicapCrutches  
##             4.9000             -0.4714              1.0214  
##    HandicapHearing  HandicapWheelchair  
##            -0.8500              0.4429

Write down the formula for predicting the rating for a new video of an interview, where it is known what disability (if any) the actor in the video appears to have. Define all variables.

Solution

\(y_{new} = 4.9-0.47I_{amp}^{new}+1.02I_{Crutch}^{new}-0.85I_{Hear}^{new}+0.44I_{Wheel}^{new}\)

Definitions:

\(y_{new}\) is the predicted rating for the new datapoint \(I_{amp}^{new}\) is 1 if the person appears to be an amuptee and 0 otherwise \(I_{Crutch}^{new}\) is 1 if the person appears to have crutches and 0 otherwise \(I_{Hear}^{new}\) is 1 if the person appears to be hearing impaired and 0 otherwise \(I_{Wheel}^{new}\) is 1 if the person appears to be using a wheelchair and 0 otherwise

(There were no marks taken off for this, but it’s good to mention that \(I_{amp}^{new}+I_{Crutch}^{new}+I_{Hear}+I_{Wheel}\) should be either 1 or 0 for the model to be valid, since we haven’t observed what happens when people have multiple disabilities.)

Question 1(c)

What was the average rating for the videos where it appeared that the actor is an amputee? Show your work.

Solution

This is the same as the prediction. \(4.9-0.4714\) = 4.4283

Question 1(d)

Here is some R output for the discrimination experiment, with some of the values omitted

> anova(lm(Score~Handicap, data=disc))
Analysis of Variance Table

Response: Score
          Df  Sum Sq Mean Sq F value  Pr(>F)  
Handicap   A  30.521  B        C          D      *
Residuals  E 173.321  G                  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Compute all the omitted values. You may use functions such as qt, qf, pf, pnorm, etc. where necessary; if you use them, you don’t have to provide a numerical final answer. Briefly show how you obtained the answers.

Solution

A = Ngroups - 1 = 4

B = 30.521/A = 7.63025

E = Npoints-Ngroups = 65

G = 173.321/E = 2.6664769

C = B/G = 2.8615788

D = 1 - pf(C, df1=A, df2=B) = pf(C, df1=A, df2=B, lower.tail=F) (either solution is acceptable)

The full ANOVA table:

anova(lm(Score~Handicap, data=disc))

## Analysis of Variance Table
## 
## Response: Score
##           Df  Sum Sq Mean Sq F value  Pr(>F)  
## Handicap   4  30.521  7.6304  2.8616 0.03013 *
## Residuals 65 173.321  2.6665                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Question 1(e)

What conclusion can you draw from the F-test for which the p-value is computed in the ANOVA table? (Note the asterisk in the table, which indicates that the p-value was smaller than 0.05.) Be precise, and state the conclusion without referring to the model – just use English.

Solution

Conclusion: there is evidence that appearing to have (or to not have) different kind of disabilities makes a difference at least some of the time.

(Note: we accepted something like “The means are not all the same” as well.)

Question 2(a)

Suppose the experiment described in Question 1 is repeated. State one way in which the ANOVA assumptions may be violated for the data obtained, and provide a plausible scenario (i.e., a story about the actor, raters, etc.) which would lead to the violation of the ANOVA assumptions.

Solution

Lots of options here.

Option 1 (non-normality): One of the raters personally knew the actor and really disliked him. The rater gave the actor a rating of 0, which was highly unusual.
Option 2 (non-normality, but more interesting): Some of the raters were Arts students. Arts students tend to disagree with each other a lot about everything. They get the rating right on average, but the ratings Arts student give are distributed according to \(N(\mu, 5^2)\), where \(\mu\) is the correct rating. Other raters were Statistics students, who tend to be very precise. The ratings they give were distributed according to \(N(\mu, 0.5^2)\). The resulting distributions of grades were heavy tailed:

library(ggplot2)
stats_grades <- rnorm(1000, mean=60, sd=.5)
arts_grades <- rnorm(1000,  mean=60, sd=5)
qplot(c(stats_grades, arts_grades))

qqnorm(c(stats_grades, arts_grades))

Option 3 (different variances): the actor appeared tired in one of the videos That made some raters sympathetic (resulting in higher scores) and some raters annoyed (resulting in lower scores.)
Option 4 (non-independence): the raters were all recruited on the same location on campus, and all belonged to the same program, where the grades are always high. The raters therefore also rated the interviewee highly.

Question 3(a)

The following is a description of a survey that was recently conducted by prvote.com

We used Google Consumer Surveys to poll 1500 respondents on one of three different phrasings of a possible referendum question, resulting in 500 responses to each question.

The results are as follows:

Results from prvote.com

For the purposes of the following questions, you can ignore the fact that the results were weighted by age and gender.

Error bars (and the precise numerical size of the bars) are shown in the figure on the previous page. Do they represent 70% or 95% confidence intervals? Explain precisely how you arrived at the answer based on the numbers shown in the graph.

Solution

The insight here is that the barcharts display the parameter of a Bernoulli distribution. For example, for Question A (for convenience, let’s relabel the questions in the survey to A, B, and C), a person’s response is distributed according to \(Y_i^{A}\sim Bernoulli(\theta_A)\). The estimate for a Bernoulli distribution parameter is just \(\bar{Y}^A\)

Now the error bars are the CI, so that they are likely either \(1.96\sqrt{Var(\bar{Y}^A)}\) (for a 95% CI) or \(1.04\sqrt{Var(\bar{Y}^A)}\) (for a 70% CI).

To figure out which it is, compute \(Var(\bar{Y}^A) \approx \bar{Y}^A*(1-\bar{Y}^A)/357 = 0.583(1-0.583)/388 0.0006\), with \(SD(\bar{Y}^A)=\sqrt{Var(\bar{Y}^A)}\approx 2.5\%\)

So the error bars for Question 1/Question A, since they are about 5% wide on either side, are 2 standard deviations wide, so that we’re looking at 95% CIs.

(The computation for the other two questions is similar. The results for them are 2.6% and 2.5%)

Question 3(b)

After conducting the polls, prvote.com analyzed the data, and claimed that Question 1 is different from Question 3, and Questions 2 is different from Question 3 (i.e., the probability of saying “yes” was different for the different questions). At 95% confidence, state precisely how you would evaluate the claims based on the graph and numbers on the previous page using only R functions such as pnorm, qnorm, etc. Your answer need not be a valid R program, but it should be very clear how to evaluate the claim in R.

Solution

First, how would you decide whether two means are close? If you only have the two means, you compute the variance of the difference between the sample means

\[Var(\bar{Y}^A-\bar{Y}^B) = Var(\bar{Y}^A)+Var(\bar{Y}^B)\]

To construct a 95% CI for the difference, assuming normality, we would use

\[(\bar{Y}^A-\bar{Y}^B)\pm 1.96\sqrt{Var(\bar{Y}^A-\bar{Y}^B)}\]

Imagine first that we only have two means to compare, Question A and Question B. Then we can compute

lower <- (.583-.471)-1.96*sqrt(0.025^2+0.026^2)
higher <- (.583-.471)+1.96*sqrt(0.025^2+0.026^2)
c(lower, higher)

## [1] 0.04130402 0.18269598

and then see if the interval (lower, higher) contains 0 (it doesn’t).

It goes without saying (although you can check) that doing this for Questions B and C would not lead to rejecting the hypothesis that Questions B and C are the same.

But there is a problem here: the study has been conducted, and we’re looking at it after the fact. Just because we saw a significant difference doesn’t mean we get to report it. We need to adjust somehow to reduce the probability of at least one Type I error. Let’s do a Bonferroni adjustment.

We have 3 comparisons, which means we need to construct \(100\%(1-0.05/3)\) CIs. Let’s do that, and compare Questions A and C, the ones that seem the most different

lower <- (.583-.458)-qnorm(1-0.05/(2*3))*sqrt(0.025^2+0.025^2)
higher <- (.583-.458)+qnorm(1-0.05/(2*3))*sqrt(0.025^2+0.025^2)
c(lower, higher)

## [1] 0.04036003 0.20963997

What about Questions A and B?

lower <- (.583-.471)-qnorm(1-0.05/(2*3))*sqrt(0.025^2+0.026^2)
higher <- (.583-.471)+qnorm(1-0.05/(2*3))*sqrt(0.025^2+0.026^2)
c(lower, higher)

## [1] 0.02565064 0.19834936

There is still significance even after Bonferroni adjustment.

Question 4(a)

Using only rnorm, write R code to generate a sample from \(\chi^2(3).\)

Solution

This one is straight from the study guide (and also Piazza.)

sum(rnorm(3)^2)

will do it. If you want to be a little bit more elaborate,

rnorm(1)^2+rnorm(1)^2+rnorm(1)^2

also works.

Question 4(b)

Suppose that you observe the measurements \(X_1, X_2, ..., X_{10}\). Assume that \(X_i\sim N(\mu, \sigma^2)\) and that the observations are independent. Write R code to test the Null Hypothesis that \(\sigma = 3\).

Solution

This one was done in lecture. Here is a slightly more formal (and cumbersome) approach than what was done in lecture.

The idea is that \(\sum_{i=1}^{10} \left(\frac{X_i-\bar{X}}{\sigma}\right)^2\sim \chi^2(9)\) (note that this is the SSE if you just have the one group.)

So now we have

\[P(x_{.025}(9) \leq \sum_{i=1}^{10}\left(\frac{X_i-\bar{X}}{\sigma}\right)^2 \leq x_{.975}(9))=95\%.\]

Here, \(x_{.025}(9)\) is just qchisq(0.025, df=9).

Invert everything, and get

\[P(1/x_{.975}(9) \leq \sum_{i=1}^{10}\left(\frac{\sigma}{X_i-\bar{X}}\right)^2 \leq 1/x_{.025}(9))=95\%\]

So the CI is

\[((N-1)*s_x/x_{.975}(9), (N-1)*s_x/x_{.025}(9))\]

Compute it in R:

lower <- (N-1)*sd(X)/qchisq(.975, df=9)
higher <- (N-1)*sd(X)/qchisq(.025, df=9)
c(lower, higher)

And now check whether 3 is in that CI.

The informal approach involves moving the \(\chi^2(9)\) around.

\[\sum_{i=1}^{10} \left(\frac{X_i-\bar{X}}{\sigma}\right)^2\sim \chi^2(9)\]

So, kind of:

\[\sum_{i=1}^{10} (X_i-\bar{X})^2/\chi^2(9) "\sim" \sigma^2\]

From there, just plug in pchisq(0.025, 9) and pchisq(0.975) to obtain the CI

\[(\sum_{i=1}^{10} (X_i-\bar{X})^2/x_{0.975}, \sum_{i=1}^{10} (X_i-\bar{X})^2/x_{0.025})\]

Same thing, but perhaps faster.

Question 5(a)

Suppose that \(Y_1, Y_2, ..., Y_{10}\) are i.i.d and \(Y_i\sim Bernoulli(\theta)\). What is the likelihood function of \(\theta\) given \(Y_1, Y_2, ..., Y_{10}\)?

Solution

See lecture

Question 5(b)

Prove that the Maximum Likelihood Estimate of \(\theta\) is \(\bar{Y}\).

Solution

See lecture

Question 6

Explain how leave-one-out cross-validation is used in order to select the best model for the data. In your answer, describe at least two cost functions, and state precisely how to compute them and use them in the context of leave-one-out cross-validation.

Solution

See lecture

Question 7

Recall that in the Pygmalion Effect dataset, we had platoons, each of which belonged to a company, and each of which either had or had not the Pygmalion Effect treatment. In a context of a dataset similar to the Pygmalion Effect dataset, describe a scenario (i.e., a story about the data) where the MSR (Mean Square Regression) is expected to be larger than the MSE (Mean Square Error). You may use any reasonable simplification (for example, you can assume that there are only two companies.) Write code to randomly generate a dataset where you expect the MSR to be larger than the MSE. The ANOVA assumptions must still hold.

Solution

Remember that MSR is between-group variance, and MSE is within-group variance. All we need to do is set up a dataset where the within-group variance is small compared to the between-group variance.

To make things even simpler, imagine that the Pygmalion Effect doesn’t work. One company is very good, perhaps because the soldiers were recruited first, when the army was selective. The other company is not so good, because the soldiers there were recruited at the end of the recruiting season, when the army needed more people and couldn’t be as selective.

Imagine also that all soldiers in both companies learn at more or less the same rate, so that the within-group variance is small.

The easiest way to generate the dataset is using the quick way shown in the Project 1 solutions:

Company <- c("C1", "C1", "C2", "C2")
Treat <- c("Pyg", "Control", "Pyg", "Control")
Score <- c(rnorm(2, mean=90, sd=1), rnorm(2, mean=30, sd=1))

df <- data.frame(Company=Company, Treat=Treat, Score=Score)
df

##   Company   Treat    Score
## 1      C1     Pyg 91.24805
## 2      C1 Control 88.98585
## 3      C2     Pyg 28.60290
## 4      C2 Control 29.71208

Make sure that this worked:

anova(lm(Score~Company))

## Analysis of Variance Table
## 
## Response: Score
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## Company    1 3716.1  3716.1  2341.6 0.0004268 ***
## Residuals  2    3.2     1.6                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We can generate it similarly to how we did in Project 1. No loop is necessary since we need to enter so little data.

dat <- data.frame(matrix(ncol=3, nrow=0))
colnames(dat) <- c("Company", "Treat", "Score")

new_dat <- data.frame(matrix(ncol=3, nrow=1))
colnames(new_dat) <- c("Company", "Treat", "Score")

new_dat["Company"] <- "C1"
new_dat["Treat"] <- "Pyg"
new_dat["Score"] <- rnorm(1, mean=90, sd=1)
dat <- rbind(dat, new_dat)

new_dat["Company"] <- "C1"
new_dat["Treat"] <- "Control"
new_dat["Score"] <- rnorm(1, mean=90, sd=1)
dat <- rbind(dat, new_dat)

new_dat["Company"] <- "C2"
new_dat["Treat"] <- "Pyg"
new_dat["Score"] <- rnorm(1, mean=90, sd=1)
dat <- rbind(dat, new_dat)

new_dat["Company"] <- "C2"
new_dat["Treat"] <- "Control"
new_dat["Score"] <- rnorm(1, mean=90, sd=1)
dat <- rbind(dat, new_dat)

Question 8(a)

Recall that in the Titanic dataset, we predict the survival based on characteristics like sex and class. Recall that the log-odds in Logistic Regression are defined as

\[\log(\frac{\pi_i}{1-\pi_i})=\beta_0+\beta_1 x^{(i)}_1 + ... + \beta_k x^{(i)}_k\]

Consider the R output below:

titanic <- read.csv("titanic.csv")
titanic <- titanic[!is.na(titanic$age),]

m1 <- glm(survived~sex, family=binomial, data=titanic)
m1

## 
## Call:  glm(formula = survived ~ sex, family = binomial, data = titanic)
## 
## Coefficients:
## (Intercept)      sexmale  
##       1.112       -2.467  
## 
## Degrees of Freedom: 1045 Total (i.e. Null);  1044 Residual
## Null Deviance:       1415 
## Residual Deviance: 1102  AIC: 1106

m2 <- glm(survived~sex+pclass, family=binomial, data=titanic)
m2

## 
## Call:  glm(formula = survived ~ sex + pclass, family = binomial, data = titanic)
## 
## Coefficients:
## (Intercept)      sexmale       pclass  
##      3.0043      -2.5278      -0.8575  
## 
## Degrees of Freedom: 1045 Total (i.e. Null);  1043 Residual
## Null Deviance:       1415 
## Residual Deviance: 1014  AIC: 1020

Without using anova or similar functions, write R code to perform a Likelihood Ratio Test to determine whether pclass is useful for predicting survival. Make it as clear as possible how you would use the output of the code to draw concusions.

Solution

Just subtract the residual deviances and degrees of freedom. The difference, under the Null Hypothesis that pclass is not useful, is distributed according to \(\chi^2(df_{diff})\)

1-pchisq(1102-1014, df=1)

## [1] 0

Question 8(b)

What is the interpretation of the coefficient that corresponds to pclass?

Solution

Keeping sex constant, if you go from first class to second class, or if you go down from second class to third class, the log-odds of your survival go down by 0.8575. (Or: the odds of your survival go down by a factor of exp(-0.8575)=0.42)

(Note: as I implied during the midterm, if you assume pclass is categorical and the interpretation is reasonable, you won’t lose marks.)

Question 8(c)

There are 658 men and 388 women in the dataset. How many people in the dataset survived? Show your work.

Solution

We need to work out the probabilities of survival for men and women. Those can be obtained by looking at the logistic regression where the only covariate is sex.

The probability for men is: \(1/(1+\exp(-1.112+2.467))=0.2050541\), and the probability for women is \(1/(1+\exp(-1.112))=0.7525018\). Now, multiply by the numbers of men and women:

388*1/(1+exp(-1.112))+658*1/(1+exp(-1.112+2.467))

## [1] 426.8963

Verify this:

sum(titanic$survived==1)

## [1] 427

The imprecision is because we only used the log-odds to within 4 significant digits.