Project 2 – German Credit Dataset

Let’s read in the data and rename the columns and values to something more readable data (note: you didn’t have to rename the values.)

german <- read.csv("german.data", sep = " ", header=F)

attr_info <- "Attribute 1: (qualitative) 
AccountStatus
A11 : ... < 0 DM 
A12 : 0 <= ... < 200 DM 
A13 : ... >= 200 DM / salary assignments for at least 1 year 
A14 : no checking account 

Attribute 2: (numerical) 
Duration

Attribute 3: (qualitative) 
CreditHistory
A30 : no credits taken/ all credits paid back duly 
A31 : all credits at this bank paid back duly 
A32 : existing credits paid back duly till now 
A33 : delay in paying off in the past 
A34 : critical account/ other credits existing (not at this bank) 

Attribute 4: (qualitative) 
Purpose
A40 : car (new) 
A41 : car (used) 
A42 : furniture/equipment 
A43 : radio/television 
A44 : domestic appliances 
A45 : repairs 
A46 : education 
A48 : retraining 
A49 : business 
A410 : others 

Attribute 5: (numerical) 
CreditAmount

Attribute 6: (qualitative) 
Account
A61 : ... < 100 DM 
A62 : 100 <= ... < 500 DM 
A63 : 500 <= ... < 1000 DM 
A64 : .. >= 1000 DM 
A65 : unknown/ no savings account 

Attribute 7: (qualitative) 
EmploymentSince
A71 : unemployed 
A72 : ... < 1 year 
A73 : 1 <= ... < 4 years 
A74 : 4 <= ... < 7 years 
A75 : .. >= 7 years 

Attribute 8: (numerical) 
PrecentOfIncome

Attribute 9: (qualitative) 
PersonalStatus
A91 : male - divorced/separated 
A92 : female - divorced/separated/married 
A93 : male - single 
A94 : male - married/widowed 
A95 : female - single 

Attribute 10: (qualitative) 
OtherDebtors
A101 : none 
A102 : co-applicant 
A103 : guarantor 

Attribute 11: (numerical) 
ResidenceSince

Attribute 12: (qualitative) 
Property
A121 : real estate 
A122 : building society savings agreement/ life insurance 
A123 : car or other, not in attribute Account 
A124 : unknown / no property 

Attribute 13: (numerical) 
Age

Attribute 14: (qualitative) 
OtherInstallPlans
A141 : bank 
A142 : stores 
A143 : none 

Attribute 15: (qualitative) 
Housing
A151 : rent 
A152 : own 
A153 : for free 

Attribute 16: (numerical) 
NumExistingCredits

Attribute 17: (qualitative) 
Job
A171 : unemployed/ unskilled - non-resident
A172 : unskilled - resident
A173 : skilled employee / official
A174 : management/ self-employed/highly qualified employee/ officer

Attribute 18: (numerical) 
NumMaintenance

Attribute 19: (qualitative) 
Telephone
A191 : none 
A192 : yes, registered under the customers name 

Attribute 20: (qualitative) 
ForeignWorker
A201 : yes 
A202 : no 

Attribute 21: (quantitative)
GoodCredit
1: 1
2: 0
"

attr_list <- as.list(as.vector(strsplit(attr_info, "Attribute ")[[1]])[2:22])

attr_names <- sapply(attr_list, function(attr_desc) strsplit(attr_desc, "\n")[[1]][2])
colnames(german) <- attr_names

for(i in 1:21){
  attr_desc <- attr_list[[i]]
  if(length(grep("qualitative",attr_desc))>0){
    lines_all <- as.vector(strsplit(attr_desc, "\n")[[1]])
    lines <- lines_all[3:(length(lines_all)-1)]
    str_vals <- sapply(lines, function(line) strsplit(line, " : ")[[1]][2])
    german[,i] <-as.numeric(german[,i])
    
    for (val_i in 1:length(lines)){
      german[,i] <- ifelse(german[,i]==val_i, str_vals[val_i], german[,i])
    }
  }
}

Here are the first three lines of german now:

german[c(1, 2, 3),]

##          AccountStatus Duration
## 1          ... < 0 DM         6
## 2   0 <= ... < 200 DM        48
## 3 no checking account        12
##                                                  CreditHistory
## 1 critical account/ other credits existing (not at this bank) 
## 2                    existing credits paid back duly till now 
## 3 critical account/ other credits existing (not at this bank) 
##                Purpose CreditAmount                      Account
## 1 domestic appliances          1169 unknown/ no savings account 
## 2 domestic appliances          5951                ... < 100 DM 
## 3          retraining          2096                ... < 100 DM 
##       EmploymentSince PrecentOfIncome                       PersonalStatus
## 1      .. >= 7 years                4                       male - single 
## 2 1 <= ... < 4 years                2 female - divorced/separated/married 
## 3 4 <= ... < 7 years                2                       male - single 
##   OtherDebtors ResidenceSince     Property Age OtherInstallPlans Housing
## 1        none               4 real estate   67             none     own 
## 2        none               2 real estate   22             none     own 
## 3        none               3 real estate   49             none     own 
##   NumExistingCredits                         Job NumMaintenance
## 1                  2 skilled employee / official              1
## 2                  1 skilled employee / official              1
## 3                  1        unskilled - resident              2
##                                   Telephone ForeignWorker GoodCredit
## 1 yes, registered under the customers name           yes           1
## 2                                     none           yes           2
## 3                                     none           yes           1

Now, let’s set things up so that Good Credit is 1, and Bad Credit is 0.

german[, "GoodCredit"] <- ifelse(german[, "GoodCredit"]==1, 1, 0)

Problem 1

Main hypothesis: the age of the customer influences the probability of the loan’s being good, controlling for the purpose of the loan (defined as essential or non-essential, with domestic appliances, repairs, education, and retraining considered essential) and the loan amount . In the analysis, we will also include the purpose of the loan. It is possible that loans are more likely to be bad if a young person is taking out a loan for something like a car or a TV, because the young person may be irresponsible and buy an expensive car/TV without knowing how they’ll repay the loan. On the other hand, for essential things like education and repairs, we don’t expect to see that kind of difference between old and young people. Let’s try to fit the model with an interaction and without an interaction, and decide whether an interaction should be included.

german$PurposeEssential <- ifelse(german$Purpose %in% c("domestic appliances ", "repairs ", "education ", "retraining "), "Essential", "Non-Essential")
fit1 <- glm(GoodCredit~Age+PurposeEssential + CreditAmount + Age*PurposeEssential, family=binomial, data=german)
fit2 <- glm(GoodCredit~Age+PurposeEssential + CreditAmount, family=binomial, data=german)
fit1$aic

## [1] 1197.345

fit2$aic

## [1] 1196

The lower AIC for fit2 indicates that we should select the model without the interaction and proceed.

summary(fit2)

## 
## Call:
## glm(formula = GoodCredit ~ Age + PurposeEssential + CreditAmount, 
##     family = binomial, data = german)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0088  -1.3693   0.7574   0.8527   1.4856  
## 
## Coefficients:
##                                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                    6.202e-01  2.590e-01   2.394  0.01664 *  
## Age                            2.031e-02  6.582e-03   3.085  0.00203 ** 
## PurposeEssentialNon-Essential -1.583e-01  1.507e-01  -1.050  0.29367    
## CreditAmount                  -1.116e-04  2.416e-05  -4.619 3.85e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1221.7  on 999  degrees of freedom
## Residual deviance: 1188.0  on 996  degrees of freedom
## AIC: 1196
## 
## Number of Fisher Scoring iterations: 4

We can reject the hypothesis that the age does not determine whether the credit is good or not at 95% confidence. In other words, there is evidence that older people have slightly better credit, controlling for the credit amount, and the purpose of the loan.

Problem 2

Plotting the logits

First, let’s add in the ordinal variable

words_to_num <- function(word){
  return(switch(word, "unemployed/ unskilled - non-resident"={1}, "unskilled - resident"={2},  "skilled employee / official"={3}, "management/ self-employed/highly qualified employee/ officer"={4}))
}
german$JobOrdinal <- as.numeric(sapply(german$Job, words_to_num))

To obtain the logits of the probabilities, one possibility is to fit a logistic regression model with a 0 intercept (another possibility is to obtain the probabilities directly.)

cleanup_name <- function(s){
  return(gsub(" ", "", gsub("Job", "", strsplit(s, "/")[[1]][1])))
}

fit.job.categ <- glm(GoodCredit~0+Job, family=binomial, data=german)
categ.coefs <- as.data.frame(summary(fit.job.categ)$coefficients)
rownames(categ.coefs) <- as.character(lapply(rownames(categ.coefs), cleanup_name))


orig_order <- c("unemployed", "unskilled-resident", "skilledemployee", "management")

categ.coefs$job.category <- factor(rownames(categ.coefs), levels=orig_order)
categ.coefs["Logit"] <- categ.coefs["Estimate"]

ggplot(categ.coefs, aes(x = job.category, y = Logit)) +  
  geom_errorbar(aes(ymin=Estimate-`Std. Error`, ymax=Estimate+`Std. Error`))

It’s worth pointing out that there isn’t really strong evidence that any of those estimates are different.

anova(glm(GoodCredit~Job, family=binomial, data=german), test="LRT")

## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: GoodCredit
## 
## Terms added sequentially (first to last)
## 
## 
##      Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL                   999     1221.7         
## Job   3    1.854       996     1219.9   0.6033

We cannot reject the hypothesis the the job category doesn’t influence creditworthiness.

Now let’s also produce a the logits obtained using the ordinal version of the predictor.

fit.job.ordinal <- glm(GoodCredit~JobOrdinal, family=binomial, data=german)

ggdata <- data.frame(JobOrdinal = german$JobOrdinal,
                     LogitOrd   = predict(fit.job.ordinal, type="link"))

ggplot(ggdata, aes(x = JobOrdinal, y= LogitOrd)) +
            geom_line(colour="blue", size=1)

Now, let’s overlay the two plots (you didn’t have to do that)

ggplot(categ.coefs, aes(x = job.category, y = Logit)) +  
  geom_errorbar(aes(ymin=Estimate-`Std. Error`, ymax=Estimate+`Std. Error`)) + 
  geom_line(data = ggdata, aes(x=JobOrdinal, y=LogitOrd), colour="blue", size=1)

As we can see, it’s not really possible to conclude from the plot which version would work better – in fact, the predictions agree pretty closely for both versions.

Prediction and NLL Cost

Let’s now compare the models using cross-validation.

cost_classification <- function(r, pi) mean(abs(r-pi) > 0.5)
cost_negloglikelihood <- function(r, pi) -sum(r*log(pi)+(1-r)*log(1-pi))

cv.glm(data=german, glmfit=fit.job.categ, cost=cost_negloglikelihood)$delta[1]

## [1] 0.6140526

cv.glm(data=german, glmfit=fit.job.ordinal, cost=cost_negloglikelihood)$delta[1]

## [1] 0.6123785

cv.glm(data=german, glmfit=fit.job.categ, cost=cost_classification)$delta[1]

## [1] 0.3

cv.glm(data=german, glmfit=fit.job.ordinal, cost=cost_classification)$delta[1]

## [1] 0.3

We observe that the costs are very close – in fact, the classification costs are identical, since in both cases the prediction is always “good credit,” resulting in mistakes in exactly 30% of the cases. The NLL is slightly smaller for the ordinal version. This is possibly because we get a slightly better estimate for the unemployed category, using information from the other categories there (although it’s really hard to be sure of anything there.)

Conclusions

If we had to guess, we would go with the ordinal version based on the cross-validation results (though they make clear that there is not much difference.) It appears at first from the plot that the logits for the different categories do not form a straight line, which would mean that the ordinal version is not appropriate, since a model assumption is that the logits are linear in all variables. However, looking at the error bars, there is no evidence that the logits are not linear in the ordinal version of the job category.

Problem 3

First things first:

idx <- sample(1000, 500)
train_set <- german[idx,]
test_set <- german[-idx,]

Why not take the first 500 rows for the train set?

It’s important to randomly select the training and the test set. It might be that the dataset was assembled in a particular way, which might bias are results. In the worst case, all the loans in the first 500 rows would be good, which would make as always predict that the loan is good. Another situation is that it’s easier to predict the first 500 loans than the second 500 loans (perhaps they were arranged in order of uncertainty about whether the customer is good or not.) That would mean that the performance on the test set will be misleadingly bad.

Let’s try two big models and one small one. We’ll select the best model by looking at the best NLL. We’ll create a validation set on which to compute the NLL.

fit_nothing <- glm(GoodCredit ~ 1, family=binomial, data=train_set)

fit_few <- glm(GoodCredit ~ CreditAmount + Age, family=binomial, data=train_set)

fit_lots <- glm(GoodCredit~ CreditAmount + Age + Duration + Account + OtherDebtors + Job +OtherDebtors + CreditHistory + EmploymentSince + Purpose , family=binomial, data=train_set)

fit_lots_sq <- glm(GoodCredit~ CreditAmount + Age + I(Age^2) + Duration + I(Duration^2) + Account + OtherDebtors + OtherDebtors + CreditHistory + EmploymentSince + Purpose , family=binomial, data=train_set)


cv.glm(data=train_set, glmfit=fit_nothing, cost=cost_negloglikelihood)$delta[1]

## [1] 0.585267

cv.glm(data=train_set, glmfit=fit_few, cost=cost_negloglikelihood)$delta[1]

## [1] 0.5764255

cv.glm(data=train_set, glmfit=fit_lots, cost=cost_negloglikelihood)$delta[1]

## [1] 0.5467152

cv.glm(data=train_set, glmfit=fit_lots_sq, cost=cost_negloglikelihood)$delta[1]

## [1] 0.5438146

We see that the smallest NLL cost is obtained when using fit_lots_sq. The benefit from the variables we selected seems small (as evident by the NLL per instance for fit_nothing – but that’s probably because the people who obviously shouldn’t get a loan aren’t in the dataset).

Now, let’s compute the cost on the test set. We need to make a new cost function for that (though you could simply compute the cost directly as well.)

cost_classification_matrix <- function(r, pi)   mean(5*(r==0)*(abs(r-pi)>0.5)+ 1*(r==1)*(abs(r-pi)>0.5))

cost_classification_matrix(train_set$GoodCredit, predict(fit_lots_sq, newdata=train_set, type="response"))

## [1] 0.814

cost_classification_matrix(test_set$GoodCredit, predict(fit_lots_sq, newdata=test_set, type="response"))

## [1] 1.186

cost_classification_matrix(train_set$GoodCredit, predict(fit_nothing, newdata=train_set, type="response"))

## [1] 1.35

cost_classification_matrix(test_set$GoodCredit, predict(fit_nothing, newdata=test_set, type="response"))

## [1] 1.65

We get a pretty decent win here. It makes sense to try to adjust the threshold, since it might very well make sense to predict “bad” even if the probability of bad credit is a bit lower than 50%, but that’s left as an exercise.

Problem 4

Our task here is to demonstrate overfitting. The key is to make the training set small, and use as many variables as possible.

train_small <- train_set[100:140,]
fit_overfit <- glm(GoodCredit~ CreditAmount + Age + I(Age^2) + Duration + I(Duration^2) + Account , family=binomial, data=train_small)

cost_negloglikelihood(test_set$GoodCredit, predict(fit_overfit, newdata=test_set, type="response"))/dim(test_set)[1]

## [1] 1.350608

cost_negloglikelihood(train_small$GoodCredit, predict(fit_overfit, newdata=train_small, type="response"))/dim(train_small)[1]

## [1] 0.4580464

As we can see the average negative log-likelihood is much smaller on the small training set than on the test set.