Deloitte Presents Machine Learning Challenge: Predict Loan Defaulters

Aug 11

Problem Description (from Machine Hack):

Deloitte refers to one or more of Deloitte Touche Tohmatsu Limited (“DTTL”), its global network of member firms, and their related entities (collectively, the “Deloitte organization”). DTTL (also referred to as “Deloitte Global”) and each of its member firms and related entities are legally separate and independent entities, which cannot obligate or bind each other in respect of third parties. DTTL and each DTTL member firm and related entity is liable only for its own acts and omissions, and not those of each other. DTTL does not provide services to clients. Please see www.deloitte.com/about to learn more.
All the facts and figures that talk to our size and diversity and years of experiences, as notable and important as they may be, are secondary to the truest measure of Deloitte: the impact we make in the world.
So, when people ask, “what’s different about Deloitte?” the answer resides in the many specific examples of where we have helped Deloitte member firm clients, our people, and sections of society to achieve remarkable goals, solve complex problems or make meaningful progress. Deeper still, it’s in the beliefs, behaviours and fundamental sense of purpose that underpin all that we do. Deloitte Globally has grown in scale and diversity—more than 345,000 people in 150 countries, providing multidisciplinary services yet our shared culture remains the same.
Banks run into losses when a customer doesn't pay their loans on time. Because of this, every year, banks have losses in crores, and this also impacts the country's economic growth to a large extent. In this hackathon, we look at various attributes such as funded amount, location, loan, balance, etc., to predict if a person will be a loan defaulter or not.
To solve this problem, MachineHack has created a training dataset of 67,463 rows and 35 columns and a testing dataset of 28,913 rows and 34 columns. The hackathon demands a few pre-requisite skills like big dataset, underfitting vs overfitting, and the ability to optimise “log_loss” to generalise well on unseen data.

Evaluation Metric: Log Loss

Ranking: 108 out of 461. First Place Log Loss was 0.34099 and mine was 0.34845.

Models: Neural Network

Data Processing

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

pd.set_option('display.max_columns', None)

train = pd.read_csv('../Data/train.csv')
test = pd.read_csv('../Data/test.csv')

train = train.drop(columns=['ID', 'Accounts Delinquent'])
test = test.drop(columns=['ID', 'Accounts Delinquent', 'Loan Status'])

Heatmap
tester = train[train['Loan Status'] == 0]
sns.heatmap(train.corr()[['Loan Status']])

plt.figure()
heatmap = sns.heatmap(train.corr()[['Loan Status']].sort_values(by='Loan Status', ascending=False), vmin=-1, vmax=1, annot=True, cmap='BrBG')

cat_cols = [
    'Batch Enrolled',
    'Employment Duration',
    'Verification Status',
    'Payment Plan',
    'Loan Title',
    'Initial List Status',
    'Application Type',
    'Term',
    'Collection 12 months Medical'
]

num_cols = [
    'Loan Amount',
    'Funded Amount',
    'Funded Amount Investor',
    'Term',
    'Interest Rate',
    'Home Ownership',
    'Debit to Income',
    'Delinquency - two years',
    'Inquires - six months',
    'Open Account',
    'Public Record',
    'Revolving Balance',
    'Revolving Utilities',
    'Total Accounts',
    'Total Received Interest',
    'Total Received Late Fee',
    'Recoveries',
    'Collection Recovery Fee',
    'Last week Pay',
    'Total Collection Amount',
    'Total Current Balance',
    'Total Revolving Credit Limit',
    'Grade',
    'Sub Grade'
]


cube_root = [
    'Home Ownership',
    'Revolving Balance',
    'Total Current Balance',
    'Total Received Interest',
    'Interest Rate',
    'Total Revolving Credit Limit',
    'Inquires - six months',
    'Public Record',
    'Last week Pay'
]

log = [
    'Total Collection Amount',
    'Total Received Late Fee',
    'Collection Recovery Fee',
    'Recoveries',
    'Total Accounts'
]

root = [
    'Open Account',
    'Funded Amount Investor',
    'Delinquency - two years'
]


normal = [
    'Funded Amount',
    'Loan Amount',
    'Revolving Utilities',
    'Debit to Income'
]

ord_cols = ['Grade', 'Sub Grade']

Transformations
train[cube_root] = np.cbrt(train[cube_root])
test[cube_root] = np.cbrt(test[cube_root])

train[log] = np.log(train[log] + 1)
test[log] = np.log(test[log] + 1)

train[root] = np.sqrt(train[root])
test[root] = np.sqrt(test[root])


train['Collection 12 months Medical'] = train['Collection 12 months Medical'].map({
    0: 'No',
    1: 'Yes'
})

test['Collection 12 months Medical'] = test['Collection 12 months Medical'].map({
    0: 'No',
    1: 'Yes'
})


labels = np.asarray(train['Loan Status'])

train = train.drop(columns=['Loan Status'])

from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer

enc = OrdinalEncoder()

train[ord_cols] = enc.fit_transform(train[ord_cols])
test[ord_cols] = enc.transform(test[ord_cols])

ct = ColumnTransformer([
    ('num', StandardScaler(), num_cols),
    ('cat', OneHotEncoder(handle_unknown='ignore'), cat_cols),
], remainder='passthrough', sparse_threshold=0)

trans_train = ct.fit_transform(train)
trans_test = ct.transform(test)

trans_train.shape

(67463, 190)

trans_test.shape

(28913, 190)

ct.get_params()

{'n_jobs': None,
 'remainder': 'passthrough',
 'sparse_threshold': 0,
 'transformer_weights': None,
 'transformers': [('num',
   StandardScaler(),
   ['Loan Amount',
    'Funded Amount',
    'Funded Amount Investor',
    'Term',
    'Interest Rate',
    'Home Ownership',
    'Debit to Income',
    'Delinquency - two years',
    'Inquires - six months',
    'Open Account',
    'Public Record',
    'Revolving Balance',
    'Revolving Utilities',
    'Total Accounts',
    'Total Received Interest',
    'Total Received Late Fee',
    'Recoveries',
    'Collection Recovery Fee',
    'Last week Pay',
    'Total Collection Amount',
    'Total Current Balance',
    'Total Revolving Credit Limit',
    'Grade',
    'Sub Grade']),
  ('cat',
   OneHotEncoder(handle_unknown='ignore'),
   ['Batch Enrolled',
    'Employment Duration',
    'Verification Status',
    'Payment Plan',
    'Loan Title',
    'Initial List Status',
    'Application Type',
    'Term',
    'Collection 12 months Medical'])],
 'verbose': False,
 'verbose_feature_names_out': True,
 'num': StandardScaler(),
 'cat': OneHotEncoder(handle_unknown='ignore'),
 'num__copy': True,
 'num__with_mean': True,
 'num__with_std': True,
 'cat__categories': 'auto',
 'cat__drop': None,
 'cat__dtype': numpy.float64,
 'cat__handle_unknown': 'ignore',
 'cat__max_categories': None,
 'cat__min_frequency': None,
 'cat__sparse': 'deprecated',
 'cat__sparse_output': True}

for x in train.columns:
    if (x not in num_cols) & (x not in cat_cols):
        print(x)

Neural Network

import sys
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

sys.path.append('../Scripts')
from Data_Processing import DataProcessing

import joblib

from tensorflow import keras
from keras.callbacks import ModelCheckpoint
from keras.models import load_model
from keras import backend as K
from keras.callbacks import EarlyStopping
from datetime import datetime
from tensorflow import keras
from sklearn.metrics import roc_auc_score, log_loss

def evaluate(model, X_test, y_test):
    predictions = model.predict(X_test)
    errors = abs(predictions - y_test)
    mape = 100 * np.mean(errors / y_test)
    accuracy = 100 - mape
    roc = roc_auc_score(y_test, predictions)
    print('Model Performance')
    print(f'AUC = ')
    return accuracy

trans_train, trans_test, labels = DataProcessing()

X_valid = trans_train[:1000,]
y_valid = labels[:1000,]

X = trans_train[1000:,]
y = labels[1000:,]

mc = ModelCheckpoint(f'../Models/Neural_Network_v2.h5', monitor='val_loss', mode='min', verbose=1, save_best_only=True)

early_stopping = EarlyStopping(
    monitor='val_loss',
    patience=100,
    verbose=0,
    mode="auto",
    baseline=None,
    restore_best_weights=True)

model = keras.Sequential([
    keras.layers.InputLayer(184),
    keras.layers.Dense(200, activation='sigmoid', kernel_initializer='lecun_normal'),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(400, activation='selu', kernel_initializer='lecun_normal'),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(150, activation='selu', kernel_initializer='lecun_normal'),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(350, activation='selu', kernel_initializer='lecun_normal'),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(100, activation='selu', kernel_initializer='lecun_normal'),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(50, activation='selu', kernel_initializer='lecun_normal'),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(25, activation='selu', kernel_initializer='lecun_normal'),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(15, activation='selu', kernel_initializer='lecun_normal'),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(1, activation='sigmoid')
])

optimizer = keras.optimizers.Adam(learning_rate=3e-4)

model.compile(optimizer=optimizer,
              loss='binary_crossentropy',
              metrics=['binary_accuracy'])

history = model.fit(
    X,
    y,
    batch_size=32,
    epochs=5000,
    validation_split=0.3,
    callbacks=[mc, early_stopping],
    shuffle=True,
)

nn_model = load_model('../Models/Neural_Network_v2.h5')
evaluate(nn_model, X_valid, y_valid)

y_pred = nn_model.predict(X_valid)
log_loss(y_valid, y_pred)

SubmissioN

import sys
import pandas as pd
import numpy as np

sys.path.append('../Scripts')
from Data_Processing import DataProcessing

import joblib

from tensorflow import keras
from keras.callbacks import ModelCheckpoint
from keras.models import load_model
from datetime import datetime


trans_train, trans_test, labels = DataProcessing()

nn_model = load_model('../Models/Neural_Network_v2.h5')

predictions = nn_model.predict(trans_test)

submission = pd.DataFrame(predictions, columns=['Loan Status'])

today = datetime.today().strftime('%m-%d-%y %H-%M')
submission.to_csv(f'../Submissions/Neural_Network .csv', index=False)

Mike Anderson

Deloitte Presents Machine Learning Challenge: Predict Loan Defaulters

Data Processing

Heatmap

Transformations

Neural Network

SubmissioN

Google Cloud Data Engineering Summit: Data Engineering Championship

Dare in Reality Hackathon 2021: Predict Lap Timings for Qualifying Session

Mike Anderson

mikeanderson0289@gmail.com