Let’s read in the data and rename the columns and values to something more readable data (note: you didn’t have to rename the values.)
german <- read.csv("german.data", sep = " ", header=F)
attr_info <- "Attribute 1: (qualitative)
AccountStatus
A11 : ... < 0 DM
A12 : 0 <= ... < 200 DM
A13 : ... >= 200 DM / salary assignments for at least 1 year
A14 : no checking account
Attribute 2: (numerical)
Duration
Attribute 3: (qualitative)
CreditHistory
A30 : no credits taken/ all credits paid back duly
A31 : all credits at this bank paid back duly
A32 : existing credits paid back duly till now
A33 : delay in paying off in the past
A34 : critical account/ other credits existing (not at this bank)
Attribute 4: (qualitative)
Purpose
A40 : car (new)
A41 : car (used)
A42 : furniture/equipment
A43 : radio/television
A44 : domestic appliances
A45 : repairs
A46 : education
A48 : retraining
A49 : business
A410 : others
Attribute 5: (numerical)
CreditAmount
Attribute 6: (qualitative)
Account
A61 : ... < 100 DM
A62 : 100 <= ... < 500 DM
A63 : 500 <= ... < 1000 DM
A64 : .. >= 1000 DM
A65 : unknown/ no savings account
Attribute 7: (qualitative)
EmploymentSince
A71 : unemployed
A72 : ... < 1 year
A73 : 1 <= ... < 4 years
A74 : 4 <= ... < 7 years
A75 : .. >= 7 years
Attribute 8: (numerical)
PrecentOfIncome
Attribute 9: (qualitative)
PersonalStatus
A91 : male - divorced/separated
A92 : female - divorced/separated/married
A93 : male - single
A94 : male - married/widowed
A95 : female - single
Attribute 10: (qualitative)
OtherDebtors
A101 : none
A102 : co-applicant
A103 : guarantor
Attribute 11: (numerical)
ResidenceSince
Attribute 12: (qualitative)
Property
A121 : real estate
A122 : building society savings agreement/ life insurance
A123 : car or other, not in attribute Account
A124 : unknown / no property
Attribute 13: (numerical)
Age
Attribute 14: (qualitative)
OtherInstallPlans
A141 : bank
A142 : stores
A143 : none
Attribute 15: (qualitative)
Housing
A151 : rent
A152 : own
A153 : for free
Attribute 16: (numerical)
NumExistingCredits
Attribute 17: (qualitative)
Job
A171 : unemployed/ unskilled - non-resident
A172 : unskilled - resident
A173 : skilled employee / official
A174 : management/ self-employed/highly qualified employee/ officer
Attribute 18: (numerical)
NumMaintenance
Attribute 19: (qualitative)
Telephone
A191 : none
A192 : yes, registered under the customers name
Attribute 20: (qualitative)
ForeignWorker
A201 : yes
A202 : no
Attribute 21: (quantitative)
GoodCredit
1: 1
2: 0
"
attr_list <- as.list(as.vector(strsplit(attr_info, "Attribute ")[[1]])[2:22])
attr_names <- sapply(attr_list, function(attr_desc) strsplit(attr_desc, "\n")[[1]][2])
colnames(german) <- attr_names
for(i in 1:21){
attr_desc <- attr_list[[i]]
if(length(grep("qualitative",attr_desc))>0){
lines_all <- as.vector(strsplit(attr_desc, "\n")[[1]])
lines <- lines_all[3:(length(lines_all)-1)]
str_vals <- sapply(lines, function(line) strsplit(line, " : ")[[1]][2])
german[,i] <-as.numeric(german[,i])
for (val_i in 1:length(lines)){
german[,i] <- ifelse(german[,i]==val_i, str_vals[val_i], german[,i])
}
}
}
Here are the first three lines of german
now:
german[c(1, 2, 3),]
## AccountStatus Duration
## 1 ... < 0 DM 6
## 2 0 <= ... < 200 DM 48
## 3 no checking account 12
## CreditHistory
## 1 critical account/ other credits existing (not at this bank)
## 2 existing credits paid back duly till now
## 3 critical account/ other credits existing (not at this bank)
## Purpose CreditAmount Account
## 1 domestic appliances 1169 unknown/ no savings account
## 2 domestic appliances 5951 ... < 100 DM
## 3 retraining 2096 ... < 100 DM
## EmploymentSince PrecentOfIncome PersonalStatus
## 1 .. >= 7 years 4 male - single
## 2 1 <= ... < 4 years 2 female - divorced/separated/married
## 3 4 <= ... < 7 years 2 male - single
## OtherDebtors ResidenceSince Property Age OtherInstallPlans Housing
## 1 none 4 real estate 67 none own
## 2 none 2 real estate 22 none own
## 3 none 3 real estate 49 none own
## NumExistingCredits Job NumMaintenance
## 1 2 skilled employee / official 1
## 2 1 skilled employee / official 1
## 3 1 unskilled - resident 2
## Telephone ForeignWorker GoodCredit
## 1 yes, registered under the customers name yes 1
## 2 none yes 2
## 3 none yes 1
Now, let’s set things up so that Good Credit is 1, and Bad Credit is 0.
german[, "GoodCredit"] <- ifelse(german[, "GoodCredit"]==1, 1, 0)
Main hypothesis: the age of the customer influences the probability of the loan’s being good, controlling for the purpose of the loan (defined as essential or non-essential, with domestic appliances, repairs, education, and retraining considered essential) and the loan amount . In the analysis, we will also include the purpose of the loan. It is possible that loans are more likely to be bad if a young person is taking out a loan for something like a car or a TV, because the young person may be irresponsible and buy an expensive car/TV without knowing how they’ll repay the loan. On the other hand, for essential things like education and repairs, we don’t expect to see that kind of difference between old and young people. Let’s try to fit the model with an interaction and without an interaction, and decide whether an interaction should be included.
german$PurposeEssential <- ifelse(german$Purpose %in% c("domestic appliances ", "repairs ", "education ", "retraining "), "Essential", "Non-Essential")
fit1 <- glm(GoodCredit~Age+PurposeEssential + CreditAmount + Age*PurposeEssential, family=binomial, data=german)
fit2 <- glm(GoodCredit~Age+PurposeEssential + CreditAmount, family=binomial, data=german)
fit1$aic
## [1] 1197.345
fit2$aic
## [1] 1196
The lower AIC for fit2
indicates that we should select the model without the interaction and proceed.
summary(fit2)
##
## Call:
## glm(formula = GoodCredit ~ Age + PurposeEssential + CreditAmount,
## family = binomial, data = german)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0088 -1.3693 0.7574 0.8527 1.4856
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 6.202e-01 2.590e-01 2.394 0.01664 *
## Age 2.031e-02 6.582e-03 3.085 0.00203 **
## PurposeEssentialNon-Essential -1.583e-01 1.507e-01 -1.050 0.29367
## CreditAmount -1.116e-04 2.416e-05 -4.619 3.85e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1221.7 on 999 degrees of freedom
## Residual deviance: 1188.0 on 996 degrees of freedom
## AIC: 1196
##
## Number of Fisher Scoring iterations: 4
We can reject the hypothesis that the age does not determine whether the credit is good or not at 95% confidence. In other words, there is evidence that older people have slightly better credit, controlling for the credit amount, and the purpose of the loan.
First, let’s add in the ordinal variable
words_to_num <- function(word){
return(switch(word, "unemployed/ unskilled - non-resident"={1}, "unskilled - resident"={2}, "skilled employee / official"={3}, "management/ self-employed/highly qualified employee/ officer"={4}))
}
german$JobOrdinal <- as.numeric(sapply(german$Job, words_to_num))
To obtain the logits of the probabilities, one possibility is to fit a logistic regression model with a 0 intercept (another possibility is to obtain the probabilities directly.)
cleanup_name <- function(s){
return(gsub(" ", "", gsub("Job", "", strsplit(s, "/")[[1]][1])))
}
fit.job.categ <- glm(GoodCredit~0+Job, family=binomial, data=german)
categ.coefs <- as.data.frame(summary(fit.job.categ)$coefficients)
rownames(categ.coefs) <- as.character(lapply(rownames(categ.coefs), cleanup_name))
orig_order <- c("unemployed", "unskilled-resident", "skilledemployee", "management")
categ.coefs$job.category <- factor(rownames(categ.coefs), levels=orig_order)
categ.coefs["Logit"] <- categ.coefs["Estimate"]
ggplot(categ.coefs, aes(x = job.category, y = Logit)) +
geom_errorbar(aes(ymin=Estimate-`Std. Error`, ymax=Estimate+`Std. Error`))
It’s worth pointing out that there isn’t really strong evidence that any of those estimates are different.
anova(glm(GoodCredit~Job, family=binomial, data=german), test="LRT")
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: GoodCredit
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 999 1221.7
## Job 3 1.854 996 1219.9 0.6033
We cannot reject the hypothesis the the job category doesn’t influence creditworthiness.
Now let’s also produce a the logits obtained using the ordinal version of the predictor.
fit.job.ordinal <- glm(GoodCredit~JobOrdinal, family=binomial, data=german)
ggdata <- data.frame(JobOrdinal = german$JobOrdinal,
LogitOrd = predict(fit.job.ordinal, type="link"))
ggplot(ggdata, aes(x = JobOrdinal, y= LogitOrd)) +
geom_line(colour="blue", size=1)
Now, let’s overlay the two plots (you didn’t have to do that)
ggplot(categ.coefs, aes(x = job.category, y = Logit)) +
geom_errorbar(aes(ymin=Estimate-`Std. Error`, ymax=Estimate+`Std. Error`)) +
geom_line(data = ggdata, aes(x=JobOrdinal, y=LogitOrd), colour="blue", size=1)
As we can see, it’s not really possible to conclude from the plot which version would work better – in fact, the predictions agree pretty closely for both versions.
Let’s now compare the models using cross-validation.
cost_classification <- function(r, pi) mean(abs(r-pi) > 0.5)
cost_negloglikelihood <- function(r, pi) -sum(r*log(pi)+(1-r)*log(1-pi))
cv.glm(data=german, glmfit=fit.job.categ, cost=cost_negloglikelihood)$delta[1]
## [1] 0.6140526
cv.glm(data=german, glmfit=fit.job.ordinal, cost=cost_negloglikelihood)$delta[1]
## [1] 0.6123785
cv.glm(data=german, glmfit=fit.job.categ, cost=cost_classification)$delta[1]
## [1] 0.3
cv.glm(data=german, glmfit=fit.job.ordinal, cost=cost_classification)$delta[1]
## [1] 0.3
We observe that the costs are very close – in fact, the classification costs are identical, since in both cases the prediction is always “good credit,” resulting in mistakes in exactly 30% of the cases. The NLL is slightly smaller for the ordinal version. This is possibly because we get a slightly better estimate for the unemployed category, using information from the other categories there (although it’s really hard to be sure of anything there.)
If we had to guess, we would go with the ordinal version based on the cross-validation results (though they make clear that there is not much difference.) It appears at first from the plot that the logits for the different categories do not form a straight line, which would mean that the ordinal version is not appropriate, since a model assumption is that the logits are linear in all variables. However, looking at the error bars, there is no evidence that the logits are not linear in the ordinal version of the job category.
First things first:
idx <- sample(1000, 500)
train_set <- german[idx,]
test_set <- german[-idx,]
It’s important to randomly select the training and the test set. It might be that the dataset was assembled in a particular way, which might bias are results. In the worst case, all the loans in the first 500 rows would be good, which would make as always predict that the loan is good. Another situation is that it’s easier to predict the first 500 loans than the second 500 loans (perhaps they were arranged in order of uncertainty about whether the customer is good or not.) That would mean that the performance on the test set will be misleadingly bad.
Let’s try two big models and one small one. We’ll select the best model by looking at the best NLL. We’ll create a validation set on which to compute the NLL.
fit_nothing <- glm(GoodCredit ~ 1, family=binomial, data=train_set)
fit_few <- glm(GoodCredit ~ CreditAmount + Age, family=binomial, data=train_set)
fit_lots <- glm(GoodCredit~ CreditAmount + Age + Duration + Account + OtherDebtors + Job +OtherDebtors + CreditHistory + EmploymentSince + Purpose , family=binomial, data=train_set)
fit_lots_sq <- glm(GoodCredit~ CreditAmount + Age + I(Age^2) + Duration + I(Duration^2) + Account + OtherDebtors + OtherDebtors + CreditHistory + EmploymentSince + Purpose , family=binomial, data=train_set)
cv.glm(data=train_set, glmfit=fit_nothing, cost=cost_negloglikelihood)$delta[1]
## [1] 0.585267
cv.glm(data=train_set, glmfit=fit_few, cost=cost_negloglikelihood)$delta[1]
## [1] 0.5764255
cv.glm(data=train_set, glmfit=fit_lots, cost=cost_negloglikelihood)$delta[1]
## [1] 0.5467152
cv.glm(data=train_set, glmfit=fit_lots_sq, cost=cost_negloglikelihood)$delta[1]
## [1] 0.5438146
We see that the smallest NLL cost is obtained when using fit_lots_sq
. The benefit from the variables we selected seems small (as evident by the NLL per instance for fit_nothing
– but that’s probably because the people who obviously shouldn’t get a loan aren’t in the dataset).
Now, let’s compute the cost on the test set. We need to make a new cost function for that (though you could simply compute the cost directly as well.)
cost_classification_matrix <- function(r, pi) mean(5*(r==0)*(abs(r-pi)>0.5)+ 1*(r==1)*(abs(r-pi)>0.5))
cost_classification_matrix(train_set$GoodCredit, predict(fit_lots_sq, newdata=train_set, type="response"))
## [1] 0.814
cost_classification_matrix(test_set$GoodCredit, predict(fit_lots_sq, newdata=test_set, type="response"))
## [1] 1.186
cost_classification_matrix(train_set$GoodCredit, predict(fit_nothing, newdata=train_set, type="response"))
## [1] 1.35
cost_classification_matrix(test_set$GoodCredit, predict(fit_nothing, newdata=test_set, type="response"))
## [1] 1.65
We get a pretty decent win here. It makes sense to try to adjust the threshold, since it might very well make sense to predict “bad” even if the probability of bad credit is a bit lower than 50%, but that’s left as an exercise.
Our task here is to demonstrate overfitting. The key is to make the training set small, and use as many variables as possible.
train_small <- train_set[100:140,]
fit_overfit <- glm(GoodCredit~ CreditAmount + Age + I(Age^2) + Duration + I(Duration^2) + Account , family=binomial, data=train_small)
cost_negloglikelihood(test_set$GoodCredit, predict(fit_overfit, newdata=test_set, type="response"))/dim(test_set)[1]
## [1] 1.350608
cost_negloglikelihood(train_small$GoodCredit, predict(fit_overfit, newdata=train_small, type="response"))/dim(train_small)[1]
## [1] 0.4580464
As we can see the average negative log-likelihood is much smaller on the small training set than on the test set.