% 260s20Assignment3.tex                 Confidence intervals
\documentclass[12pt]{article} 
%\usepackage{amsbsy} % for \boldsymbol and \pmb 
%\usepackage{graphicx} % To include pdf files!
\usepackage{amsmath}
\usepackage{amsbsy}
\usepackage{amsfonts}
\usepackage[colorlinks=true, pdfstartview=FitV, linkcolor=blue, citecolor=blue, urlcolor=blue]{hyperref} % For links
\usepackage{comment}
%\usepackage{fullpage}
\oddsidemargin=0in                  % Good for US Letter paper
\evensidemargin=0in
\textwidth=6.3in
\topmargin=-1in
\headheight=0.2in
\headsep=0.5in
\textheight=9.4in

%\pagestyle{empty} % No page numbers

\begin{document}
%\enlargethispage*{1000 pt} 

\begin{center}   
{\Large \textbf{STA 260s20 Assignment Three: Confidence Intervals Part One}}\footnote{Copyright information is at the end of the last page.}
%\vspace{1 mm}
\end{center}

\noindent
These homework problems are not to be handed in.  They are preparation for Quiz 3 (Week of Jan.~27) and Term Test 1. \textbf{Please try each question before looking at the solution}.

%\vspace{5mm}
\begin{enumerate} 

\item The ``$p$th quantile" or ``$p$ quantile" of a probability distribution is the point with $p$ of the probability at or below it. That is, if $x_p$ is the $p$ quantile of the distribution of the random variable $X$, $F_{_X}(x_p) = p$. Here are some numbers we will need to construct confidence intervals. Let $z_p$ denote the $p$ quantile of the standard normal distribution. Let $Z \sim N(0,1)$
    \begin{enumerate}
        \item What is $P(-z_{0.975} < Z < z_{0.975})$?  I think it helps to draw a picture.
        \item What is $z_{0.975}$? The answer is a number from the table of the standard normal distribution, now included in the formula sheet. % 1.96
        \item What is $P(-z_{0.995} < Z < z_{0.995})$?  
        \item What is $z_{0.995}$? This time you will need to interpolate in the table. % 2.575
        \item Let $\alpha$ denote a small probability; the values $\alpha=0.05$ and  $\alpha=0.01$ are common. What is $P(-z_{1-\alpha/2} < Z < z_{1-\alpha/2})$?
    \end{enumerate}

\item \label{derive} Let $X_1, \ldots X_n$ be independent random variables from a distribution with expected value $\mu$ and variance $\sigma^2$, where $\mu$ and variance $\sigma^2$ are both unknown. Let $\widehat{\sigma}^2_n$ be a consistent estimator of $\sigma^2$.  Using the Central Limit Theorem, derive a $(1-\alpha)*100\%$ confidence interval for $\mu$. This yields a 95\%  confidence interval if $\alpha=0.05$, or a 99\% confidence interval if $\alpha=0.01$. ``Derive" means show all the High School algebra. Your final answer is a pair of formulas, a formula for the lower confidence limit $L(X_1, \ldots X_n)$ and a formula for the upper confidence limit $U(X_1, \ldots X_n)$. Use the notation $z_{1-\alpha/2}$ for the $1-\alpha/2$ quantile of the standard normal distribution.

\item The label on the peanut butter jar says peanuts, partially hydrogenated peanut oil, salt and sugar.  But we all know there is other stuff in there too. There is very good reason to assume that the number of rat hairs in a jar of peanut butter has a Poisson distribution with mean $\lambda$, because it's easy to justify a Poisson process model for how the hairs get into the jars (technical details omitted).  A sample of thirty 500g jars yields $\overline{X}_n=9.2$.
    \begin{enumerate}
        \item Give point estimate and a 95\% confidence interval for $\lambda$. Indicate why your chosen $\widehat{\sigma}^2_n$ is consistent. Show a little work. Your answer is a pair of numbers, a lower confidence  limit and an upper confidence limit. % 9.2 +- 1.09 = (8.11 10.29)
        \item There is a government standard that says the true expected number of rat hairs in a 500g jar may not exceed 8. Do these results suggest that the company may be in violation of the regulation?
    \end{enumerate}

 \item In a coffee taste test, 100 coffee drinkers tasted coffee made with two different blends of coffee beans, the old standard blend and a new blend. We will adopt a Bernoulli model for these data, with $\theta$ denoting the probability that a customer will prefer the new blend. Suppose 60 out of 100 consumers preferred the new blend of coffee beans. 
    \begin{enumerate}
        \item Assuming that the participants in the study are a random sample of coffee drinkers (which they almost certainly are not) give a point estimate and a 95\% confidence interval for the \emph{percentage} of coffee drinkers who would prefer the new blend of coffee beans. % CI for the proportion is 
% me = sqrt(.6*.4/100)*1.96; round(me,3) = 0.096
% ci = round(c(.6-me,.6+me),3); ci = 0.504 0.696 -> (50.5%,69.6%)
        \item The research department is afraid that management will not be satisfied with a margin of probable error of almost 10\%. The would like to say ``These results are expected to be accurate within three percentage points, 19 times out of 20."
                \begin{enumerate}
                    \item Suppose that the true proportion of customers who would like the new blend is actually 0.60. What sample size $n$ would be required to achieve the desired margin of error? % 1.96^2*0.6*0.4/0.0009 = 1024.427.
                    \item Suppose the true proportion equals 0.7. What sample size is required now? % 1.96^2*0.7*0.3/0.0009 = 896.3733
                    \item Suppose the true proportion equals 0.9. What sample size is required now? 
                    \item Suppose the true proportion equals 0.3. What sample size is required now? 
                    \item Now suppose that the researchers are cautious, and want to guarantee that the margin of error will be 3\% or less regardless of what the true value of $\theta$ might be. What is the required required sample size?  % 1.96^2*0.5*0.5/0.0009 = 1067.111 -> 1068
                \end{enumerate}
Clearly, these people are going to over-spend their budget. They are in deep trouble. 
    \end{enumerate} % End of coffee question 

\item A sample of $n=64$ high school students take a standardized multiple choice vocabulary test. We obtain a sample mean of $\overline{x}_n = 105$ and a sample variance of $s^2=256$. Give a 95\% confidence interval for $\mu$. Your answer is a pair of numbers, the lower confidence limit and the upper confidence limit. % 105 +- 3.92 = (101.08 to 108.92).   

\item \label{volcano} The physics of volcanic eruptions suggests that the intervals between eruptions at a particular site should be independent exponential random variables, with the unknown value of the parameter $\lambda$ depending on the site. Fifteen eruptions have been recorded at the West Thumb Geyser Basin during historic times. The 14 intervals between eruptions (in years) are: \texttt{28.83  9.25  3.00  8.58  0.50  2.08  0.50 26.25 33.08  2.17  3.08  0.58  2.42  5.08}. 
    \begin{enumerate} 
        \item Estimating the expected interval between eruptions is easy. Give a reasonable point estimate. The answer is a number. In terms of $\lambda$, what are you estimating? % 8.957 estimates 1/lambda
        \item We want a confidence interval, but $n=14$ is too small to be comfortable using the Central Limit Theorem. However, an exact (not asymptotic) confidence interval is within reach. 
                \begin{enumerate}
                    \item Find the distribution of $\overline{X}_n$. Show your work. 
                    \item Show that $Y = 2\lambda n\overline{X}_n$ has a chi-squared distribution. For reasons that may become clear later in the course, the parameter $\nu$ of the chi-squared distribution is called the ``degrees of freedom." What are the degrees of freedom of $Y$?
                    \item Using the quantiles $\chi^2_{0.025}$ and $\chi^2_{0.975}$, derive a 95\% confidence interval for the expected time between eruptions.
                    \item Unfortunately, the chi-squared table in our text does not give quantiles for the lower part of the distribution. The free open source R software can do it easily, though. \texttt{q} is for quantile.
\begin{verbatim}
> qchisq(0.025,28)
[1] 15.30786
> qchisq(0.975,28)
[1] 44.46079
\end{verbatim}
So with $\nu=28$ degrees of freedom, $\chi^2_{0.025} = 15.31$ and $\chi^2_{0.975}=44.46$. Give the 95\% confidence interval. Your answer is a pair of numbers, the lower confidence limit and the upper confidence limit. % c(2*14*8.957/44.46,2*14*8.957/15.31) = (5.64, 16.38)
                \end{enumerate}
                    \item It's not justified because of the small sample size, but go ahead and use the Central Limit Theorem to produce a 95\% confidence interval for
                        \begin{enumerate}
                            \item The expected time between eruptions.
                            \item The parameter $\lambda$.
                        \end{enumerate}
    \end{enumerate} % End of the eruption.

% \pagebreak

\item % Uniform asymptotic -- a very rich question for the makeup final.
This question mirrors the development of confidence intervals based on the Central Limit Theorem, except that most of the technical details are easier. 
% So following this argument can help clarify a tool that is used more widely in practice. 
Let $X_1, \ldots X_n$ be independent Uniform $(0,\theta)$ random variables.   
    \begin{enumerate}
        \item Let $T_n$ be the maximum $X_i$ value. Obtain the cumulative distribution function of $T_n$. ``Obtain" could be as simple as copying your answer to Question 5c of Assignment 2.
        \item Let $Y_n = n(1-\frac{T_n}{\theta})$. 
                \begin{enumerate}
                    \item Find the support of $Y_n$, or equivalently, the shortest interval $A$ for which $P(Y_n \in A) = 1$. Show your work.
                    \item Derive the cumulative distribution function of $Y_n$, and write it using indicator functions.
                    \item Using the definition of convergence in distribution, show that $Y_n \stackrel{d}{\rightarrow}Y \sim $ Exponential(1). % Using the definition means taking $\lim_{n \rightarrow \infty}F_{_{Y_n}}(y)$ for an arbitrary $y>0$. 
                \end{enumerate}
        \item Using the last result, derive a general $(1-\alpha)$ confidence interval for $\theta$.  Note that generally speaking, it's a good idea to seek confidence intervals that are as short as possible, though for Problem~\ref{volcano} it would have been more trouble than it's worth. Here, it matters. Because the exponential density is decreasing, the shortest interval starts with zero. So, begin the derivation of the confidence interval with $1-\alpha = P(0 < Y < y_{1-\alpha})$, where $y_{1-\alpha}$ is the $1-\alpha$ quantile of the standard exponential distribution. 
        \item Is it possible for the true value of $\theta$ to be below your lower confidence limit? Answer Yes or No.
        \item Unlike the quantiles of the normal and chi-squared distributions, the quantiles of the exponential can be obtained with an ordinary scientific calculator. Give an explicit formula for $y_{1-\alpha}$. Show your work. 
        \item In the following R session, I simulate $n=30$ observations from a Uniform(0,4) distribution, and calculate the mean and maximum. So the true parameter value $\theta=4$ is actually known. This never happens in real-world applications, but when you're developing an estimation method, it's good to try it on simulated data where you know the truth, to see how the estimates behave. The statistics are rounded to 3 decimal places.
\begin{verbatim}
> x = runif(30,0,4)
> round(mean(x),3)
[1] 1.785
> round(max(x),3)
[1] 3.905
\end{verbatim}
                \begin{enumerate}
                    \item Using the simulated data above, calculate your 95\% confidence  interval. The answer is a pair of numbers, a lower confidence limit and an upper confidence limit. 
                    % Commenting out with the comment package:
\begin{comment}
> q95 = -log(0.05); q95
[1] 2.995732
> n = 30
> c(3.905,3.905*n/(n-q95))
[1] 3.905000 4.338203
> -(3.905-3.905*n/(n-q95)) # Width of CI
[1] 0.4332032
\end{comment}

                    \item For comparison, calculate a 95\% confidence interval based on the Central Limit Theorem.
                    % Commenting out with the comment package:
\begin{comment}
> # First a ci for E(X) then double.
> thetahat = 2*1.785; thetahat
[1] 3.57
> se = sqrt(thetahat^2/12)/sqrt(30); se
[1] 0.1881555
> xbar = 1.785
> me = 1.96*se
> ci = c(xbar-me,xbar+me) # CI for E(X) = \theta/2
> ci = 2*ci; ci
[1] 2.83243 4.30757
> 4.30757 - 2.83243 # Width of CI
[1] 1.47514
> 4.30757 - 2.83243 # Width of CI
[1] 1.47514
\end{comment}

                    \item Which confidence interval do you like more, and why?
                \end{enumerate}
    \end{enumerate} % End of the uniform CI.

\item In the construction of confidence intervals, a \emph{pivotal quantity}, or \emph{pivot} is a random variable that is a function of the parameter, but whose distribution does not depend on the value of the parameter. You construct an interval based on the random variable, and then manipulate the inequalities until the parameter is alone in the middle. What are the pivots in this assignment?
% Z_n in Q2, Y in Q6, Y_n in Q7

\end{enumerate}

\vspace{12mm}

\noindent
\begin{center}\begin{tabular}{l}
\hspace{6in} \\ \hline
\end{tabular}\end{center}
This assignment was prepared by  \href{http://www.utstat.toronto.edu/~brunner}{Jerry Brunner},
Department of Mathematical and Computational Sciences, University of Toronto. It is licensed under a 
\href{http://creativecommons.org/licenses/by-sa/3.0/deed.en_US}
     {Creative Commons Attribution - ShareAlike 3.0 Unported License}. Use any part of it as you like and share the result freely. The \LaTeX~source code is available from the course website: 
     
\begin{center}
\href{http://www.utstat.toronto.edu/~brunner/oldclass/260s20} {\small\texttt{http://www.utstat.toronto.edu/$^\sim$brunner/oldclass/260s20}}
\end{center}

\end{document}


\item Let $X_1, \ldots X_n$ be independent random variables  
    \begin{enumerate}
        \item 
        \item 
        \item 
    \end{enumerate}