STA303/STA1002 Project 2

Introduction

In this project, you will be working with the Statlog Credit Data dataset (click on “Data Folder”; you can use either german.data-numeric or german.data, as you see fit, but I recommend sticking to german.data. See the R tips on the course website.) . Financial institutions routinely use statistics in order to make decisions. One type of decision is whether to grant someone credit. The dataset was collected by soliciting expert opinion about whether each customer was creditable or not. Using this dataset, you will explore a hypothesis about the data, and build a statistical model that automatically predicts whether a customer is creditable or not based on their attributes.

As with Project 1, your analysis for this project is required to be reproducible: you will submit a pdf file containing your report, as well as an R Markdown file that the TAs can use to generate your report using Knitr.

Your report should be well-written and readable: you will lose marks if your report is hard to understand. In particular, it should be clear what every variable means and how you can interpret your results. I recommend renaming the columns as well as the values used in the dataset to make your analysis clearer.

Problem 1 (15%)

Come up with a plausible hypothesis about the data, which involves at least one possible interaction between two covariates. State your hypothesis, and briefly explain why the hypothesis is plausible and why there might be an intreaction between the two covariates.

Treat the data as an observational study. Determine whether you need to include the interaction you posited in the model, and state your conclusion concerning the hypothesis.

Problem 2 (15%)

Attribute 17 was converted to an integer in german.data-numeric. What is the reason that it would be reasonable to do that? Plot the logits of the probability of a customer being creditable versus both variants of Attribute 17; and use cross-validation to assess whether predicting whether a customer is creditable should be done using the categorical variable or the ordered (integer) variable. Comment on both the plot, and your experiments with cross-validation: answer the question of what version seems more appropriate.

Problem 3 (25%)

In this problem, you will build a statistical model that predicts whether a customer is creditable or not, and minimizes the cost of the prediction.

As a first step, split your data into two equal parts: a traning set and a test set. You will use the training set to fit the model. You will evaluate the model on the training set. You need to randomly select which rows will go into the training set, and which rows go into the test set. Why can’t you simply take the first 500 rows and use them as the training set, and the second 500 rows and use the as the test set?

Now, build several logistic regression models, and decide using cross-validation which of them works the best at minimizing the cost of prediction. Once you have come up with the model that seems to work the best on the training set, evaluate that model on the test set, and report the results.

Make sure to use the cost matrix that comes with the dataset.

Problem 4 (15%)

In this problem, you will demonstrate the issues with overfitting. Randomly divide the dataset up into a small training set, and a test set of size 500. Fit a model using many of the covariates as the predictors, and demonstrate that both the classification cost (0 for correct classification, 1 for incorrect classification) and the Negative Log Likelihood cost are larger on the test set than the training set. State the false positive and false negative rates for both the training set and the test set.

You should make the training set small enough for the overfitting to be evident, but large enough so that you don’t get error messages or warnings when running the model.

Problem 5 (15%)

Free throws are an important part of the game of baskeball. Here, we will be using a dataset of the number of free throws that Shaquille O’Neal, one of the greatest basketball players in the last few decades who was also known for his poor free-throwing ability, has taken in 23 different games. The total number of attempted free throws is recorded, as well as the number of times that the attempts were successful.

lines <- 
"Game   Scored  N.Attempts
1   4   5
2   5   11
3   5   14
4   5   12
5   2   7
6   7   10
7   6   14
8   9   15
9   4   12
10  1   4
11  13  27
12  5   17
13  6   12
14  9   9
15  7   12
16  3   10
17  8   12
18  1   6
19  18  39
20  3   13
21  10  17
22  1   6
23  3   12"
con <- textConnection(lines)
shaq <- read.csv(con, sep="")
shaq

##    Game Scored N.Attempts
## 1     1      4          5
## 2     2      5         11
## 3     3      5         14
## 4     4      5         12
## 5     5      2          7
## 6     6      7         10
## 7     7      6         14
## 8     8      9         15
## 9     9      4         12
## 10   10      1          4
## 11   11     13         27
## 12   12      5         17
## 13   13      6         12
## 14   14      9          9
## 15   15      7         12
## 16   16      3         10
## 17   17      8         12
## 18   18      1          6
## 19   19     18         39
## 20   20      3         13
## 21   21     10         17
## 22   22      1          6
## 23   23      3         12

Problem 5(a) (5%)

Conduct a likelihood-ratio test in order to determine whether the probability of successfully scoring from a free throw was different in different days (i.e., whether Shaq had “good days” and “bad days.”) State the Null Hypothesis, and state whether it is rejected. State the conclusion in English.

Problem 5(b) (5%)

We are interested in determining the true probability of successfully scoring a free throw in each game. Compute the no-pooling, the complete pooling, and the partial pooling estimates of those probabilities, and display them graphically. If those are available, CI’s should be displayed as well. It is up to you how to display the probabilities, but one possibility is to use a chart like the one here

You should use glmer in order to obtain the partial pooling estimates. glmer is exactly like lme, except it allows you to use glm’s. Specifying glm’s works exactly like with glm.

Problem 5(c) (5%)

Comment on the difference between the complete-pooling, no-pooling, and partial-pooling estimates for this particular dataset.

Problem 6

Problem 6(a) (15%)

Come up with a plausible scenario, which is substantially different from what we saw in lecture, where the following is a good model for the data \(y_i\):

\[\alpha_{j}\sim N(\mu_\alpha, \sigma^{2}_\alpha)\] \[y_i\sim N(\alpha_{j[i]}, \sigma^{2}_y)\]

For your scenario, pick plausible values for \(\mu_\alpha\), \(\sigma_\alpha\), and \(\sigma_y\), and give a brief explanation for why the values you picked make sense in your scenario.

Problem 6(b) (5% bonus, submit separately)

Repeatedly (say 1000 times) generate data using the values you picked, and estimate the means \(\alpha_j\) using no-pooling, complete pooling, and partial pooling. Each time, compute the sum of the squared distances between the estimates and the true values of of the \(\alpha_j\)’s and their estimates, for each method. Compare how well you are estimating the \(\alpha_j\)’s using no-pooling, complete pooling, and partial pooling.