\documentclass[12pt]{article} %\usepackage{amsbsy} % for \boldsymbol and \pmb \usepackage{graphicx} % To include pdf files! \usepackage{amsmath} \usepackage{amsbsy} \usepackage{amsfonts} \usepackage[colorlinks=true, pdfstartview=FitV, linkcolor=blue, citecolor=blue, urlcolor=blue]{hyperref} % For links \usepackage{fullpage} %\pagestyle{empty} % No page numbers \begin{document} %\enlargethispage*{1000 pt} \begin{center} {\Large \textbf{STA 2101 Assignment 6}}\footnote{This assignment was prepared by \href{http://www.utstat.toronto.edu/~brunner}{Jerry Brunner}, Department of Statistics, University of Toronto. It is licensed under a \href{http://creativecommons.org/licenses/by-sa/3.0/deed.en_US} {Creative Commons Attribution - ShareAlike 3.0 Unported License}. Use any part of it as you like and share the result freely. The \LaTeX~source code is available from the course website: \href{http://www.utstat.toronto.edu/~brunner/oldclass/2101f19} {\texttt{http://www.utstat.toronto.edu/$^\sim$brunner/oldclass/2101f19}}} \vspace{1 mm} \end{center} \noindent The questions on this assignment are not to be handed in. They are practice for Quiz Six on Friday November 1st. % \vspace{3mm} \begin{enumerate} \item In the following regression model, the explanatory variables $X_1$ and $X_2$ are random variables. The true model is \begin{displaymath} Y_i = \beta_0 + \beta_1 X_{i,1} + \beta_2 X_{i,2} + \epsilon_i, \end{displaymath} independently for $i= 1, \ldots, n$, where $\epsilon_i \sim N(0,\sigma^2)$. The mean and covariance matrix of the explanatory variables are given by \begin{displaymath} E\left( \begin{array}{c} X_{i,1} \\ X_{i,2} \end{array} \right) = \left( \begin{array}{c} \mu_1 \\ \mu_2 \end{array} \right) \mbox{~~ and ~~} Var\left( \begin{array}{c} X_{i,1} \\ X_{i,2} \end{array} \right) = \left( \begin{array}{rr} \phi_{11} & \phi_{12} \\ \phi_{12} & \phi_{22} \end{array} \right) \end{displaymath} The explanatory variables $ X_{i,1}$ and $X_{i,2}$ are independent of $\epsilon_i$. Unfortunately $X_{i,2}$, which has an impact on $Y_i$ and is correlated with $X_{i,1}$, is not part of the data set. Since $X_{i,2}$ is not observed, it is absorbed by the intercept and error term, as follows. \begin{eqnarray*} Y_i &=& \beta_0 + \beta_1 X_{i,1} + \beta_2 X_{i,2} + \epsilon_i \\ &=& (\beta_0 + \beta_2\mu_2) + \beta_1 X_{i,1} + (\beta_2 X_{i,2} - \beta_2 \mu_2 + \epsilon_i) \\ &=& \beta^\prime_0 + \beta_1 X_{i,1} + \epsilon^\prime_i. \end{eqnarray*} The primes just denote a new $\beta_0$ and a new $\epsilon_i$. It was necessary to add and subtract $\beta_2 \mu_2$ in order to obtain $E(\epsilon^\prime_i)=0$. And of course there could be more than one omitted variable. They would all get swallowed by the intercept and error term, the garbage bins of regression analysis. \begin{enumerate} \item What is $Cov(X_{i,1},\epsilon^\prime_i)$? \item Calculate the variance-covariance matrix of $(X_{i,1},Y_i)$ under the true model. Is it possible to have non-zero covariance between $X_{i,1}$ and $Y_i$ when $\beta_1=0$? \item Suppose we want to estimate $\beta_1$. The usual least squares estimator is \begin{displaymath} \widehat{\beta}_1 = \frac{\sum_{i=1}^n(X_{i,1}-\overline{X}_1)(Y_i-\overline{Y})} {\sum_{i=1}^n(X_{i,1}-\overline{X}_1)^2}. \end{displaymath} You may just use this formula; you don't have to derive it. Is $\widehat{\beta}_1$ a consistent estimator of $\beta_1$ if the true model holds? Answer Yes or no and show your work. You may use the consistency of the sample variance and covariance without proof. \item Are there \emph{any} points in the parameter space for which $\widehat{\beta}_1 \stackrel{p}{\rightarrow} \beta_1$ when the true model holds? \end{enumerate} \item Independently for $i = 1, \ldots, n$, let $Y_i = \beta X_i + \epsilon_i$, where $X_i \sim N(\mu,\sigma^2_x)$ and $\epsilon_i \sim N(0,\sigma^2_\epsilon)$. Because of omitted variables that influence both $X_i$ and $Y_i$, we have $Cov(X_i,\epsilon_i) = c \neq 0$. \begin{enumerate} \item The least squares estimator of $\beta$ is $\frac{\sum_{i=1}^n X_iY_i}{\sum_{i=1}^n X_i^2}$. Is this estimator consistent? Answer Yes or No and prove your answer. \item Give the parameter space for this model. There are some constraints on $c$. \item First consider points in the parameter space where $\mu \neq 0$. Give an estimator of $\beta$ that converges almost surely to the right answer for that part of the parameter space. If you are not sure how to proceed, try calculating the expected value and covariance matrix of $(X_i,Y_i)$. \item What happens in the rest of the parameter space --- that is, where $\mu=0$? Is a consistent estimator possible there? So we see that parameters may be identifiable in some parts of the parameter space but not all. \end{enumerate} \item We know that omitted explanatory variables are a big problem, because they induce non-zero covariance between the explanatory variables and the error terms $\epsilon_i$. The residuals have a lot in common with the $\epsilon_i$ terms in a regression model, though they are not the same thing. A reasonable idea is to check for correlation between explanatory variables and the $\epsilon_i$ values by looking at the correlation between the residuals and explanatory variables. Accordingly, for a multiple regression model with an intercept so that $\sum_{i=1}^ne_i=0$, calculate the sample correlation $r$ between explanatory variable $j$ and the residuals $e_1, \ldots, e_n$. Use this formula for the correlation: $r = \frac{\sum_{i=1}^n (x_i-\overline{x})(y_i-\overline{y})} {\sqrt{\sum_{i=1}^n (x_i-\overline{x})^2} \sqrt{\sum_{i=1}^n (y_i-\overline{y})^2}}$. Simplify. What can the sample correlations between residuals and $x$ variables tell you about the correlation between $\epsilon$ and the $x$ variables? % \pagebreak \item This question explores the consequences of ignoring measurement error in the response variable. Independently for $i=1, \ldots,n$, let \begin{eqnarray} Y_i &=& \beta_0 + \beta_1 X_i + \epsilon_i \nonumber \\ V_i &=& Y_i + e_i, \nonumber \end{eqnarray} where $Var(X_i)=\phi$, $E(X_i) = \mu_x$, $Var(e_i)=\omega$, $Var(\epsilon_i)=\psi$, and $X_i, e_i, \epsilon_i$ are all independent. The explanatory variable $X_i$ is observable, but the response variable $Y_i$ is latent. Instead of $Y_i$, we can see $V_i$, which is $Y_i$ plus a piece of random noise. Call this the \emph{true model}. \begin{enumerate} \item Make a path diagram of the true model. \item Strictly speaking, the distributions of $X_i, e_i$ and $\epsilon_i$ are unknown parameters because they are unspecified. But suppose we are interested in identifying just the Greek-letter parameters. Does the true model pass the test of the Parameter Count Rule? Answer Yes or No and give the numbers. \item Calculate the variance-covariance matrix of the observable variables as a function of the model parameters. Show your work. \item Suppose that the analyst assumes that $V_i$ is that same thing as $Y_i$, and fits the naive model $ V_i = \beta_0 + \beta_1 X_i + \epsilon_i$, in which \begin{displaymath} \widehat{\beta}_1 = \frac{\sum_{i=1}^n(X_i-\overline{X})(V_i-\overline{V})} {\sum_{i=1}^n(X_i-\overline{X})^2}. \end{displaymath} Assuming the \emph{true} model (not the naive model), is $\widehat{\beta}_1$ a consistent estimator of $\beta_1$? Answer Yes or No and show your work. \item Why does this prove that $\beta_1$ is identifiable? \end{enumerate} \item \label{randiv} This question explores the consequences of ignoring measurement error in the explanatory variable when there is only one explanatory variable. Independently for $i = 1 , \ldots, n$, let \begin{eqnarray*} Y_i & = & \beta X_i + \epsilon_i \\ W_i & = & X_i + e_i \end{eqnarray*} where all random variables are normal with expected value zero, $Var(X_i)=\phi>0$, $Var(\epsilon_i)=\psi>0$, $Var(e_i)=\omega>0$ and $\epsilon_i$, $e_i$ and $X_i$ are all independent. The variables $W_i$ and $Y_i$ are observable, while $X_i$ is latent. Error terms are never observable. \begin{enumerate} \item What is the parameter vector $\boldsymbol{\theta}$ for this model? \item Denote the covariance matrix of the observable variables by $\boldsymbol{\Sigma} = [\sigma_{ij}]$. The unique $\sigma_{ij}$ values are the moments, and there is a covariance structure equation for each one. Calculate the variance-covariance matrix $\boldsymbol{\Sigma}$ of the observable variables, expressed as a function of the model parameters. You now have the covariance structure equations. \item Does this model pass the test of the parameter count rule? Answer Yes or No and give the numbers. \item Are there any points in the parameter space where the parameter $\beta$ is identifiable? Are there infinitely many, or just one point? \item The naive estimator of $\beta$ is \begin{displaymath} \widehat{\beta}_n = \frac{\sum_{i=1}^n W_i Y_i}{\sum_{i=1}^n W_i^2}. \end{displaymath} Is $\widehat{\beta}_n$ a consistent estimator of $\beta$? Answer Yes or No. To what does $\widehat{\beta}_n$ converge? \item Are there any points in the parameter space for which $\widehat{\beta}_n$ converges to the right answer? Compare your answer to the set of points where $\beta$ is identifiable. \item Suppose the reliability of $W_i$ were known\footnote{As a reminder, the reliability of an observed measurement is the proportion of its variace that comes from the ``true" latent variable it is measuring. Here, the reliability of $W_i$ is $\frac{\phi}{\phi+\omega}$.}, or to be more realistic, suppose that a good estimate of the reliability were available; call it $r^2_{wx}$. How could you use $r^2_{wx}$ to improve $\widehat{\beta}_n$? Give the formula for an improved estimator of $\beta$. % \item Because of correlated measurement error, one suspects that many published estimates of reliability are too high. Suppose $r^2_{wx}$ is an overestimate of the true reliability $\rho^2_{wx}$. What effect does this have on your improved estimate of $\beta$? \end{enumerate} % The core of this was on the 2015 final, but it's enhanced in 2017. \item The improved version of $\widehat{\beta}_n$ in the last question is an example of \emph{correction for attenuation} (weakening) caused by measurement error. Here is the version that applies to correlation. Independently for $i=1, \ldots, n$, let % Need eqnarray inside a parbox to make it the cell of a table \begin{tabular}{ccc} \parbox[m]{1.5in} { \begin{eqnarray*} D_{i,1} &=& F_{i,1} + e_{i,1} \\ D_{i,2} &=& F_{i,2} + e_{i,2} \\ && \end{eqnarray*} } % End parbox & $cov\left( \begin{array}{c} F_{i,1} \\ F_{i,2} \end{array} \right) = \left( \begin{array}{c c} \phi_{11} & \phi_{12} \\ \phi_{12} & \phi_{22} \end{array} \right)$ & $cov\left( \begin{array}{c} e_{i,1} \\ e_{i,2} \end{array} \right) = \left( \begin{array}{c c} \omega_1 & 0 \\ 0 & \omega_2 \end{array} \right)$ \end{tabular} \noindent To make this concrete, it would be natural for psychologists to be interested in the correlation between intelligence and self-esteem, but what they want to know is the correlation between \emph{true} intelligence and \emph{true} self-esteem, not just the between score on an IQ test and score on a self-esteem questionnaire. So for subject $i$, let $F_{i,1}$ represent true intelligence and $F_{i,2}$ represent true self-esteem, while $D_{i,1}$ is the subject's score on an intelligence test and $D_{i,1}$ is score on a self-esteem questionnaire. \begin{enumerate} \item Make a path diagram of this model. \item Show that $|Corr(D_{i,1},D_{i,2})| \leq |Corr(F_{i,1},F_{i,2})|$. That is, measurement error weakens (attenuates) the correlation. \item Suppose the reliability of $D_{i,1}$ is $\rho^2_1$ and the reliability of $D_{i,2}$ is $\rho^2_2$. How could you apply $\rho^2_1$ and $\rho^2_2$ to $Corr(D_{i,1},D_{i,2})$, to obtain $Corr(F_{i,1},F_{i,2})$? \item You obtain a sample correlation between IQ score and self-esteem score of $r = 0.25$, which is disappointingly low. From other data, the estimated reliability of the IQ test is $r^2_1 = 0.90$, and the estimated reliability of the self-esteem scale is $r^2_2 = 0.75$. Give an estimate of the correlation between true intelligence and true self-esteem. The answer is a number. % 0.25 / (0.9*0.75) = 0.3703704 \end{enumerate} % 2015 Final \item This is a simplified version of the situation where one is attempting to ``control" for explanatory variables that are measured with error. People do this all the time, and it doesn't work. Independently for $i=1, \ldots, n$, let \begin{eqnarray*} Y_i &=& \beta_1 X_{i,1} + \beta_2 X_{i,2} + \epsilon_i \\ W_i &=& X_{i,1} + e_i, \end{eqnarray*} where $V\left( \begin{array}{c} X_{i,1} \\ X_{i,2} \end{array} \right) = \left( \begin{array}{c c} \phi_{11} & \phi_{12} \\ \phi_{12} & \phi_{22} \end{array} \right)$, $V(\epsilon_i) = \psi$, $V(e_1) = \omega$, all the expected values are zero, and the error terms $\epsilon_i$ and $ e_i$ are independent of one another, and also independent of $X_{i,1}$ and $X_{i,2}$. The variable $X_{i,1}$ is latent, while the variables $W_i$, $Y_i$ and $X_{i,2}$ are observable. What people usually do in situations like this is fit a model like $Y_i = \beta_1 W_i + \beta_2 X_{i,2} + \epsilon_i$, and test $H_0: \beta_2 = 0$. That is, they ignore the measurement error in variables for which they are ``controlling." \begin{enumerate} \item Suppose $H_0: \beta_2 = 0$ is true. Does the ordinary least squares estimator \begin{displaymath} \widehat{\beta}_2 = \frac{\sum_{i=1}^nW_i^2 \sum_{i=1}^nX_{i,2}Y_i - \sum_{i=1}^nW_iX_{i,2}\sum_{i=1}^nW_iY_i} {\sum_{i=1}^n W_i^2 \sum_{i=1}^n X_{i,2}^2 - (\sum_{i=1}^nW_iX_{i,2})^2 } \end{displaymath} converge to the true value of $\beta_2 = 0$ as $n \rightarrow \infty$ everywhere in the parameter space? Answer Yes or No and show your work. \item Under what conditions (that is, for what values of other parameters) does $ \widehat{\beta}_2 \stackrel{p}{\rightarrow} 0$ when $\beta_2 = 0$? \end{enumerate} % \pagebreak % 2015 HW \item Finally we have a solution, though as usual there is a little twist. Independently for $i=1, \ldots, n$, let \begin{eqnarray*} Y_{i~~} &=& \beta X_i + \epsilon_i \\ V_{i~~} &=& Y_i + e_i \\ W_{i,1} &=& X_i + e_{i,1} \\ W_{i,2} &=& X_i + e_{i,2} \end{eqnarray*} where \begin{itemize} \item $Y_i$ is a latent variable. \item $V_i$, $W_{i,1}$ and $W_{i,2}$ are all observable variables. \item $X_i$ is a normally distributed \emph{latent} variable with mean zero and variance $\phi>0$. \item $\epsilon_i$ is normally distributed with mean zero and variance $\psi>0$. \item $e_{i}$ is normally distributed with mean zero and variance $\omega>0$. \item $e_{i,1}$ is normally distributed with mean zero and variance $\omega_1>0$. \item $e_{i,2}$ is normally distributed with mean zero and variance $\omega_2>0$. \item $X_i$, $\epsilon_i$, $e_i$, $e_{i,1}$ and $e_{i,2}$ are all independent of one another. \end{itemize} \begin{enumerate} \item Make a path diagram of this model. \item What is the parameter vector $\boldsymbol{\theta}$ for this model? \item Does the model pass the test of the Parameter Count Rule? Answer Yes or No and give the numbers. \item Calculate the variance-covariance matrix of the observable variables as a function of the model parameters. Show your work. \item Is the parameter vector identifiable at every point in the parameter space? Answer Yes or No and prove your answer. \item Some parameters are identifible, while others are not. Which ones are identifiable? \item If $\beta$ (the paramter of main interest) is identifiable, propose a Method of Moments estimator for it and prove that your proposed estimator is consistent. \item Suppose the sample variance-covariance matrix $\widehat{\boldsymbol{\Sigma}}$ is \begin{verbatim} W1 W2 V W1 38.53 21.39 19.85 W2 21.39 35.50 19.00 V 19.85 19.00 28.81 \end{verbatim} Give a reasonable estimate of $\beta$. There is more than one right answer. The answer is a number. (Is this the Method of Moments estimate you proposed? It does not have to be.) \textbf{Circle your answer.} \item Describe how you could re-parameterize this model to make the parameters all identifiable, allowing you do maximum likelihood. \end{enumerate} \end{enumerate} % End of all the questions \end{document} % Ran out of time and went back to birds. \item \label{bweight} Please start this question by downloading and installing the MASS package in R. Then, \texttt{help(birthwt)} gives some information about the variables in the Birth Weight data set. The cases (rows) are mothers who recently had babies. For the zero-one variables, 1 means Yes and 0 means No. The response variable is \texttt{low}; babies who weigh less than 2.5 kg.~at birth are at risk for a variety of medical problems. \begin{enumerate} \item Look at \texttt{summary(birthwt)}. What percentage of babies had low birth weight? What percentage of mothers smoked during pregnancy? \item \end{enumerate}