% 431Assignment7.tex Regression with measurement error, some identifiability \documentclass[12pt]{article} %\usepackage{amsbsy} % for \boldsymbol and \pmb \usepackage{graphicx} % To include pdf files! \usepackage{amsmath} \usepackage{amsbsy} \usepackage{amsfonts} \usepackage[colorlinks=true, pdfstartview=FitV, linkcolor=blue, citecolor=blue, urlcolor=blue]{hyperref} % For links \usepackage{fullpage} %\pagestyle{empty} % No page numbers \begin{document} %\enlargethispage*{1000 pt} \begin{center} {\Large \textbf{STA 2101f19 Assignment Seven}}\footnote{This assignment was prepared by \href{http://www.utstat.toronto.edu/~brunner}{Jerry Brunner}, Department of Statistical Sciences, University of Toronto. It is licensed under a \href{http://creativecommons.org/licenses/by-sa/3.0/deed.en_US} {Creative Commons Attribution - ShareAlike 3.0 Unported License}. Use any part of it as you like and share the result freely. The \LaTeX~source code is available from the course website: \href{http://www.utstat.toronto.edu/~brunner/oldclass/2101f19} {\small\texttt{http://www.utstat.toronto.edu/$^\sim$brunner/oldclass/2101f19}}} \vspace{1 mm} \end{center} \noindent This assignment is about the double measurement design. See lecture Slide Sets 18, 19 and 20, and Section 0.11 (pages 63-104) in Chapter Zero. The non-computer questions on this assignment are for practice, and will not be handed in. For the R part of this assignment (Question~\ref{Rpig}) please bring hard copy of your input and output to the quiz. \begin{enumerate} \item The point of this question is that when the parameters of a model are identifiable, the number of covariance structure equations minus the number of parameters equals the number of model-induced equality constraints on $\boldsymbol{\Sigma}$. It is these equality constraints that are being tested by the chi-squared test for goodness of fit. In the lecture notes, look at the matrix formulation and discussion of double measurement regression starting on Slide 6 of lecture unit 19. The latent vector $\mathbf{X}_i$ is $p \times 1$, and the latent vector $\mathbf{Y}_i$ is $q \times 1$. As usual, expected values and intercepts are not identifiable, so confine your attention to $\boldsymbol{\Sigma} = [\sigma_{ij}]$, the covariance matrix of the observable data. % In this question, watch out for Stage 1 and Stage 2. I intended to switch the order (2013). But I didn't: 2015, at least not in lecture. It's okay in 2017. \begin{enumerate} \item Here's something that will help with the calculations in this problem. If a covariance matrix is $n \times n$, \begin{enumerate} \item How many unique covariances are there? \item How many unique variances and covariances total are there? Factor and simplify. \end{enumerate} \item For the present problem, what are the dimensions of $\boldsymbol{\Sigma}$? Give the number of rows and the number of columns. It's an expression in $p$ and $q$. \item \label{howmanysigmas} How many unique variances and covariances ($\sigma_{ij}$ quantities) are there in $\boldsymbol{\Sigma}$ when there are no model-induced constraints? The answer is an expression in $p$ and $q$. % \item List the parameter matrices that appear in $\boldsymbol{\Sigma}$. \item Denoting $cov(\mathbf{F}_i)$ by $\boldsymbol{\Phi} = [\phi_{ij}]$, how many unique variances and covariances ($\phi_{ij}$ quantities) are there in $\boldsymbol{\Phi} = cov(\mathbf{F}_i)$ if there are no model-induced equality constraints? The answer is an expression in $p$ and $q$. \item In total, how many unknown parameters are there in the Stage One parameter matrices $\boldsymbol{\Phi}_x$, $\boldsymbol{\beta}_1$ and $\boldsymbol{\Psi}$? The answer is an expression in $p$ and $q$. Is this the same as your last answer? If so, it means that at the first stage, if the parameters are identifiable from $\boldsymbol{\Phi}$, they are \emph{just identifiable} from $\boldsymbol{\Phi}$. \item Still in Stage One (the latent variable model), show the details of how the parameter matrices $\boldsymbol{\Phi}_x$, $\boldsymbol{\beta}_1$ and $\boldsymbol{\Psi}$ can be recovered from $\boldsymbol{\Phi}$. Start by calculating $\boldsymbol{\Phi}$ as a function of $\boldsymbol{\Phi}_x$, $\boldsymbol{\beta}_1$ and $\boldsymbol{\Psi}$. You have shown that the function relating $\boldsymbol{\Phi}$ to $(\boldsymbol{\Phi}_x, \boldsymbol{\beta}_1, \boldsymbol{\Psi})$ is one-to-one (injective). \item In Stage Two (the measurement model), the parameters are in the matrices $\boldsymbol{\Phi}$, $\boldsymbol{\Omega}_1$ and $\boldsymbol{\Omega}_2$. How many unique parameters are there? The answer is an expression in $p$ and $q$. \item \label{nconstr} By inspecting the expression for $\boldsymbol{\Sigma}$ on Slide 11 of lecture unit 19, state the number of equality constraints that are imposed on $\boldsymbol{\Sigma}$ by the model. The answer is an expression in $p$ and $q$. \item Show that the number of parameters plus the number of constraints is equal to the number of unique variances and covariances in $\boldsymbol{\Sigma}$. This is a brief calculation using your answers to~\ref{howmanysigmas} and the last two questions. \end{enumerate} % \newpage \item Here is a one-stage formulation of the double measurement regression model. % See the text for some discussion. Independently for $i=1, \ldots, n$, let \begin{eqnarray*} \mathbf{W}_{i,1} & = & \mathbf{X}_i + \mathbf{e}_{i,1} \\ \mathbf{V}_{i,1} & = & \mathbf{Y}_i + \mathbf{e}_{i,2} \nonumber \\ \mathbf{W}_{i,2} & = & \mathbf{X}_i + \mathbf{e}_{i,3}, \nonumber \\ \mathbf{V}_{i,2} & = & \mathbf{Y}_i + \mathbf{e}_{i,4}, \nonumber \\ \mathbf{Y}_i & = & \boldsymbol{\beta} \mathbf{X}_i + \boldsymbol{\epsilon}_i \nonumber \end{eqnarray*} where \begin{itemize} \item[] $\mathbf{Y}_i$ is a $q \times 1$ random vector of latent response variables. Because $q$ can be greater than one, the regression is multivariate. \item[] $\boldsymbol{\beta}$ is an $q \times p$ matrix of unknown constants. These are the regression coefficients, with one row for each response variable and one column for each explanatory variable. \item[] $\mathbf{X}_i$ is a $p \times 1$ random vector of latent explanatory variables, with expected value zero and variance-covariance matrix $\boldsymbol{\Phi}_x$, a $p \times p$ symmetric and positive definite matrix of unknown constants. \item[] $\boldsymbol{\epsilon}_i$ is the error term of the latent regression. It is a $q \times 1$ random vector with expected value zero and variance-covariance matrix $\boldsymbol{\Psi}$, a $q \times q$ symmetric and positive definite matrix of unknown constants. \item[] $\mathbf{W}_{i,1}$ and $\mathbf{W}_{i,2}$ are $p \times 1$ observable random vectors, each representing $\mathbf{X}_i$ plus random error. \item[] $\mathbf{V}_{i,1}$ and $\mathbf{V}_{i,2}$ are $q \times 1$ observable random vectors, each representing $\mathbf{Y}_i$ plus random error. \item[] $\mathbf{e}_{i,1}, \ldots, \mathbf{e}_{i,4}$ are the measurement errors in $\mathbf{W}_{i,1}, \mathbf{V}_{i,1}, \mathbf{W}_{i,2}$ and $\mathbf{V}_{i,2}$ respectively. Joining the vectors of measurement errors into a single long vector $\mathbf{e}_i$, its covariance matrix may be written as a partitioned matrix \begin{equation*} cov(\mathbf{e}_i) = cov\left(\begin{array}{c} \mathbf{e}_{i,1} \\ \mathbf{e}_{i,2} \\ \mathbf{e}_{i,3} \\ \mathbf{e}_{i,4} \end{array}\right) = \left( \begin{array}{c|c|c|c} \boldsymbol{\Omega}_{11} & \boldsymbol{\Omega}_{12} & \mathbf{0} & \mathbf{0} \\ \hline \boldsymbol{\Omega}_{12}^\top & \boldsymbol{\Omega}_{22} & \mathbf{0} & \mathbf{0} \\ \hline \mathbf{0} & \mathbf{0} & \boldsymbol{\Omega}_{33} & \boldsymbol{\Omega}_{34} \\ \hline \mathbf{0} & \mathbf{0} & \boldsymbol{\Omega}_{34}^\top & \boldsymbol{\Omega}_{44} \end{array} \right) = \boldsymbol{\Omega}. \end{equation*} \item[] In addition, the matrices of covariances between $\mathbf{X}_i, \boldsymbol{\epsilon}_i$ and $\mathbf{e}_i$ are all zero. \end{itemize} Collecting $\mathbf{W}_{i,1}$, $\mathbf{W}_{i,2}$, $\mathbf{V}_{i,1}$ and $\mathbf{V}_{i,2}$ into a single long data vector $\mathbf{D}_i$, we write its variance-covariance matrix as a partitioned matrix: \begin{displaymath} \boldsymbol{\Sigma} = \left( \begin{array}{c|c|c|c} \boldsymbol{\Sigma}_{11} & \boldsymbol{\Sigma}_{12} & \boldsymbol{\Sigma}_{13} & \boldsymbol{\Sigma}_{14 } \\ \hline & \boldsymbol{\Sigma}_{22} & \boldsymbol{\Sigma}_{23} & \boldsymbol{\Sigma}_{24} \\ \hline & & \boldsymbol{\Sigma}_{33} & \boldsymbol{\Sigma}_{34} \\ \hline & & & \boldsymbol{\Sigma}_{44} \end{array} \right), \end{displaymath} where the covariance matrix of $\mathbf{W}_{i,1}$ is $\boldsymbol{\Sigma}_{11}$, the covariance matrix of $\mathbf{V}_{i,1}$ is $\boldsymbol{\Sigma}_{22}$, the matrix of covariances between $\mathbf{W}_{i,1}$ and $\mathbf{V}_{i,1}$ is $\boldsymbol{\Sigma}_{12}$, and so on. \begin{enumerate} \item \label{sigoftheta} Write the elements of the partitioned matrix $\boldsymbol{\Sigma}$ in terms of the parameter matrices of the model. Be able to show your work for each one. \item Prove that all the model parameters are identifiable by solving the covariance structure equations. \item Give a Method of Moments estimator of $\boldsymbol{\Phi}_x$. Remember, your estimator cannot be a function of any unknown parameters. For a particular sample, will your estimate be in the parameter space? Mine is. \item Give a Method of Moments estimator for $\boldsymbol{\beta}$. Remember, your estimator cannot be a function of any unknown parameters. There is more than one correct answer. How do you know your estimator is consistent? % Use $\widehat{\boldsymbol{\Sigma}} \stackrel{p}{\rightarrow} \boldsymbol{\Sigma}$. \end{enumerate} % that is \emph{not} the MLE added after the question was assigned in 2013. But in 2015 I specified MOM instead. \item Question \ref{Rpig} (the R part of this assignment) will use the \emph{Pig Birth Data}. As part of a much larger study, farmers filled out questionnaires about various aspects of their farms. Some questions were asked twice, on two different questionnaires several months apart. Buried in all the questions were \begin{itemize} \item Number of breeding sows (female pigs) at the farm on June 1st \item Number of sows giving birth later that summer. \end{itemize} There are two readings of these variables, one from each questionnaire. We will assume (maybe incorrectly) that because the questions were buried in a lot of other material and were asked months apart, that errors of measurement are independent between the two questionnaires. However, errors of measurement might be correlated within a questionnaire. \begin{enumerate} \item Propose a reasonable model for these data, using the usual notation. Give all the details. You may assume normality if you wish. Remember, measurement error could be correlated within questionnaires. \item Make a path diagram of the model you have proposed. \item Of course it is hopeless to identify the expected values and intercepts, so we will concentrate on the covariance matrix. Calculate the covariance matrix of one observable data vector $\mathbf{D}_i$. Compare your answer to~\ref{sigoftheta}. \item Even though you have a general result that applies to this case, prove that all the parameters in the covariance matrix are identifiable. \item If there are any equality constraints on the covariance matrix, say what they are. \item Based on your answer to the last question, how many degrees of freedom should there be in the chi-squared test for model fit? Does this agree with your answer to Question~\ref{nconstr}? \item \label{mombetahat} Give a consistent estimator of $\beta$ that is \emph{not} the MLE, and explain why it's consistent. You may use the consistency of sample variances and covariances without proof. Your estimator \emph{must not} be a function of any unknown parameters. \end{enumerate} % \pagebreak \item \label{Rpig} The Pig Birth Data are given in the file \href{http://www.utstat.toronto.edu/~brunner/data/legal/openpigs.data.txt} {\texttt{openpigs.data.txt}}. No doubt you will have to edit the data file to strip off the information at the top. There are $n=114$ farms; please verify that you are reading the correct number of cases. \begin{enumerate} \item Start by reading the data and producing a correlation matrix of all the observable variables. \item Use \texttt{lavaan} to fit your model, and look at \texttt{summary}. If you experience numerical problems you are doing something differently from the way I did it. When I fit a good model everything was fine. When I fit a poor model there was trouble. % With no covariances between error terms, psi tried to go negative. Just to ensure we are fitting the same model, my log likelihood (obtained with the \texttt{logLik} function) was \texttt{-1901.717}. \item Does your model fit the data adequately? Answer Yes or No and give three numbers: a chi-squared statistic, the degrees of freedom, and a $p$-value. % G^2 = 0.087, df = 1, p = 0.768 \item \label{betahat} For each additional breeding sow present in September, estimated number giving birth that summer goes up by \underline{\hspace{10mm}}. Your answer is a single number from \texttt{summary}. It is not an integer. % betahat = 0.757 \item Using your answer to Question~\ref{mombetahat}, give a \emph{numerical} version of your consistent estimate of $\beta$. How does it compare to the MLE? % MOM = 0.764, vs. MLE = 0.757. Pretty good! \item Give a large-sample confidence interval for your answer to \ref{betahat}. Note that $\sqrt{n}$ is already built into the inverse of the Hessian, so you don't need multiply by it again. Using all the accuracy available, my lower confidence limit is \texttt{0.6510766}. \item Recall that reliability of a measurement is the proportion of its variance that does \emph{not} come from measurement error. What is the estimated reliability of the number of breeding sows from questionnaire two? The answer is a number, which you could get with a calculator and the output of \texttt{summary}. % phi / (phi+omega33) = 357.073 /(357.073+416.368) = 0.462 % My old answer using SAS output must have been wrong. \item Is there evidence of correlated measurement error within questionnaires? Answer Yes or No and give some numbers from the results file to support your conclusion. % \item The answer to that last question was based on two separate tests. Though it is already pretty convincing, conduct a \emph{single} Wald (not likelihood ratio) test of the two null hypotheses simultaneously. Give the Wald chi-squared statistic, the degrees of freedom and the $p$-value. What do you conclude? Is there evidence of correlated measurement error, or not? % W df p-value % 45.8184924416383410 2.0000000000000000 0.0000000001123676 \item The double measurement design allows the measurement error covariance matrices $\boldsymbol{\Omega}_1$ and $\boldsymbol{\Omega}_2$ to be unequal. Carry out a Wald test to see whether the two covariance matrices are equal or not. \begin{enumerate} \item Give the Wald chi-squared statistic, the degrees of freedom and the $p$-value. What do you conclude? Is there evidence that the two measurement error covariance matrices are unequal? % W = 41.69941 , df=3, p < 0.0001 \item There is evidence that one of the measurements is less accurate on one questionnaire than the other. Which one is it? Give the Wald chi-squared statistic, the degrees of freedom and the $p$-value. % Number of pigs giving birth is more accurate on questionnaire 2 (estimated error variance = 93.023 compared to 321.227): W = 9.666 , df=1, p = 0.0019 I did two tests here; only one of them was significant. \end{enumerate} \end{enumerate} \end{enumerate} \vspace{10mm} \noindent \textbf{Bring a printout with your R input and output to the quiz. Please remember that while the questions may appear in comment statements, answers and interpretation may not, except for numerical answers generated by R.} \end{document} %%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%