To lend or not to lend: Good risk for the bank?

Vesna Lukic
Feb 7, 2020
5 min read

Updated: Feb 15, 2020

We download the German Credit Data dataset from the following website:

https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/

The data is in two different formats on this website: one containing numerical and categorical variables (german.data-numeric), and the other one containing only categorical variables (german.data). We focus on the latter dataset.

There are 20 attributes for a set of 1000 individuals, as well as an outcome variable indicating whether providing them with the requested loan is a good or bad risk for the bank. A few examples of the attributes include the status of their existing checking account, the duration of the loan, credit history and purpose of the loan.

We chose the Python language to do the analysis.

1. Reading in and cleaning the data

We read in the data into a pandas dataframe. Next, we shuffle the rows the remove any correlation between the rows if it exists. We check if there are any missing entries column-wise and find there are none.

2. Exploratory data analysis

The data has 1000 non-empty rows, with a mix of categorical and numerical variables. Next we find how many unique entries there are across the columns, and plot the result as a bar graph.

The 5th column (credit amount; numerical) has the highest number of unique values. The next highest is column 13 (age in years; numerical), followed by column 2 (duration of loan; numerical) and column 4 (purpose of loan; categorical).

The final column is another categorical variable, indicating whether the loan is a good risk (700 cases) or bad risk (300 cases) for the bank.

We can also draw pie charts indicating good and bad risk vs the checking account status, which has 4 categorical variables.

The pie charts indicate that a loan is likely to be a good risk in terms of checking account status, if there are a higher percentage of individuals with more than 200 DM in the checking account or they have salary assignments for at least one year (A13), and a lower percentage of individuals with no checking account (A14). The other categories, indicating a negative or relatively low account balance, are weaker indicators.

Next we draw a histogram of Good and Bad risk with respect the the credit amount.

The histogram indicates that a credit loan is likely to be a good risk for the bank if there is a higher proportion of lower credit amounts. There is a higher proportion of higher credit amounts with respect to bad risk compared to having a lower proportion of these high amounts if the risk is a good one. Therefore, the bank is better off lending people smaller credit loans overall.

3. Make models

Now that we've explored the data, we can start to make models. Given the relatively low dimensionality of the problem, we will experiment with the following machine learning algorithms: decision trees (DT), gradient boosted trees (GBM), random forest (RF) and support vector machines (SVM).

A decision tree uses a tree-like structure consisting of questions (represented in the nodes) and their possible outcomes (represented in the branches). For example, in the Titanic machine learning from disaster dataset, one can predict the chances of survival by using decision trees by asking questions such as shown:

The left and right numbers below the nodes represent the probability of survival and percentage of observations falling into the category respectively. Decision trees can be used in both classification and regression tasks.

Random forests use a collection of decision trees for training, and the overall outcome assigned is the most frequently appearing outcome across all trees (for classification) or the mean value (in terms of regression).

First we will use the default parameter setups in Python, to get a baseline model of performance. Given that some machine learning algorithms cannot handle categorical data, we will convert the dataset to be numerical. There are 20 predictor variables and 1 outcome variable (risk).

We show the relative feature importance for the algorithms of choice:

We can see that when using the baseline parameters, there are four variables (0,1,4 and 8 representing duration in month, credit amount, age in years and status of current checking account respectively) in common between the different machine learning methods, that have the highest influence on risk.

Now choosing 80% of the data for training and 20% for validation and testing, we can compare the different accuracies, area under the curve (AUC), mean-cross validation score (using 4-fold cross validation) and confusion matrix between the chosen algorithms, in terms of baseline performance. Every run of a model may vary to some extent with respect to performance metrics, therefore we run each model 5 times and write down the mean +/- standard deviation across all 5 runs.

The metrics show that the Gradient Boosted model shows the best performance overall, obtaining a mean cross-validation score of 78%. The support vector machine model performs the poorest, however it should be noted that their performance suffers when the dataset is imbalanced (700 cases vs 300 cases). The models all display some level of overfitting, given the differences in the training and validation scores; the training scores are generally much higher, however when the models are applied to the test sets the performance is much lower on average. This could be remedied by tuning the hyper-parameters, since the ones used by default are unlikely to be optimal for the particular dataset in question.

We therefore perform hyper-parameter optimisation via grid search, in which we scan through different combinations of parameters of choice; for example when using decision trees, the different parameters include the maximum depth of the tree, minimum number of samples at each data split, and minimum number of samples in a leaf, to name a few. We can use GridSearchCV from the sklearn library to perform the grid search.

After updating the parameters such that the optimised ones are used, we obtain the following table.

We see that upon performing a grid search and using the optimised hyper-parameters, we observe a general improvement in regards to validation set performance. The models show a reduced amount of overfitting compared to when the baseline models are used (the training scores are lower and are closer in value to the validation scores.) Across all validation metrics, the Gradient Boosted tree model still outperforms the others.

In summary, we initially explored the data, which included investigating the variable types (categorical and numerical) against how many unique values there are, the outcome variable (good versus bad risk) against the checking account status, and credit amount. Then we looked at different machine learning models (decision trees, random forests, gradient boosted trees and support vector machines), in terms of what the most relevant features are in terms of influencing outcome variable for the baseline performances (default parameter selection in python). Upon performing grid search to find the optimised parameters given the dataset, we have shown a general improvement in metrics across the machine learning methods. We have found that the Gradient boosted model performs the best overall.

To lend or not to lend: Good risk for the bank?

Recent Posts

Comments