Deloitte Presents Machine Learning Challenge: Predict Loan Defaulters
Problem Description (from Machine Hack):
Deloitte refers to one or more of Deloitte Touche Tohmatsu Limited (“DTTL”), its global network of member firms, and their related entities (collectively, the “Deloitte organization”). DTTL (also referred to as “Deloitte Global”) and each of its member firms and related entities are legally separate and independent entities, which cannot obligate or bind each other in respect of third parties. DTTL and each DTTL member firm and related entity is liable only for its own acts and omissions, and not those of each other. DTTL does not provide services to clients. Please see www.deloitte.com/about to learn more.
All the facts and figures that talk to our size and diversity and years of experiences, as notable and important as they may be, are secondary to the truest measure of Deloitte: the impact we make in the world.
So, when people ask, “what’s different about Deloitte?” the answer resides in the many specific examples of where we have helped Deloitte member firm clients, our people, and sections of society to achieve remarkable goals, solve complex problems or make meaningful progress. Deeper still, it’s in the beliefs, behaviours and fundamental sense of purpose that underpin all that we do. Deloitte Globally has grown in scale and diversity—more than 345,000 people in 150 countries, providing multidisciplinary services yet our shared culture remains the same.
Banks run into losses when a customer doesn't pay their loans on time. Because of this, every year, banks have losses in crores, and this also impacts the country's economic growth to a large extent. In this hackathon, we look at various attributes such as funded amount, location, loan, balance, etc., to predict if a person will be a loan defaulter or not.
To solve this problem, MachineHack has created a training dataset of 67,463 rows and 35 columns and a testing dataset of 28,913 rows and 34 columns. The hackathon demands a few pre-requisite skills like big dataset, underfitting vs overfitting, and the ability to optimise “log_loss” to generalise well on unseen data.
Evaluation Metric: Log Loss
Ranking: 108 out of 461. First Place Log Loss was 0.34099 and mine was 0.34845.
Models: Neural Network
Data Processing
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
pd.set_option('display.max_columns', None)
train = pd.read_csv('../Data/train.csv')
test = pd.read_csv('../Data/test.csv')
train = train.drop(columns=['ID', 'Accounts Delinquent'])
test = test.drop(columns=['ID', 'Accounts Delinquent', 'Loan Status'])
Heatmap
tester = train[train['Loan Status'] == 0]
sns.heatmap(train.corr()[['Loan Status']])
plt.figure()
heatmap = sns.heatmap(train.corr()[['Loan Status']].sort_values(by='Loan Status', ascending=False), vmin=-1, vmax=1, annot=True, cmap='BrBG')
cat_cols = [
'Batch Enrolled',
'Employment Duration',
'Verification Status',
'Payment Plan',
'Loan Title',
'Initial List Status',
'Application Type',
'Term',
'Collection 12 months Medical'
]
num_cols = [
'Loan Amount',
'Funded Amount',
'Funded Amount Investor',
'Term',
'Interest Rate',
'Home Ownership',
'Debit to Income',
'Delinquency - two years',
'Inquires - six months',
'Open Account',
'Public Record',
'Revolving Balance',
'Revolving Utilities',
'Total Accounts',
'Total Received Interest',
'Total Received Late Fee',
'Recoveries',
'Collection Recovery Fee',
'Last week Pay',
'Total Collection Amount',
'Total Current Balance',
'Total Revolving Credit Limit',
'Grade',
'Sub Grade'
]
cube_root = [
'Home Ownership',
'Revolving Balance',
'Total Current Balance',
'Total Received Interest',
'Interest Rate',
'Total Revolving Credit Limit',
'Inquires - six months',
'Public Record',
'Last week Pay'
]
log = [
'Total Collection Amount',
'Total Received Late Fee',
'Collection Recovery Fee',
'Recoveries',
'Total Accounts'
]
root = [
'Open Account',
'Funded Amount Investor',
'Delinquency - two years'
]
normal = [
'Funded Amount',
'Loan Amount',
'Revolving Utilities',
'Debit to Income'
]
ord_cols = ['Grade', 'Sub Grade']
Transformations
train[cube_root] = np.cbrt(train[cube_root])
test[cube_root] = np.cbrt(test[cube_root])
train[log] = np.log(train[log] + 1)
test[log] = np.log(test[log] + 1)
train[root] = np.sqrt(train[root])
test[root] = np.sqrt(test[root])
train['Collection 12 months Medical'] = train['Collection 12 months Medical'].map({
0: 'No',
1: 'Yes'
})
test['Collection 12 months Medical'] = test['Collection 12 months Medical'].map({
0: 'No',
1: 'Yes'
})
labels = np.asarray(train['Loan Status'])
train = train.drop(columns=['Loan Status'])
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
enc = OrdinalEncoder()
train[ord_cols] = enc.fit_transform(train[ord_cols])
test[ord_cols] = enc.transform(test[ord_cols])
ct = ColumnTransformer([
('num', StandardScaler(), num_cols),
('cat', OneHotEncoder(handle_unknown='ignore'), cat_cols),
], remainder='passthrough', sparse_threshold=0)
trans_train = ct.fit_transform(train)
trans_test = ct.transform(test)
trans_train.shape
(67463, 190)
trans_test.shape
(28913, 190)
ct.get_params()
{'n_jobs': None,
'remainder': 'passthrough',
'sparse_threshold': 0,
'transformer_weights': None,
'transformers': [('num',
StandardScaler(),
['Loan Amount',
'Funded Amount',
'Funded Amount Investor',
'Term',
'Interest Rate',
'Home Ownership',
'Debit to Income',
'Delinquency - two years',
'Inquires - six months',
'Open Account',
'Public Record',
'Revolving Balance',
'Revolving Utilities',
'Total Accounts',
'Total Received Interest',
'Total Received Late Fee',
'Recoveries',
'Collection Recovery Fee',
'Last week Pay',
'Total Collection Amount',
'Total Current Balance',
'Total Revolving Credit Limit',
'Grade',
'Sub Grade']),
('cat',
OneHotEncoder(handle_unknown='ignore'),
['Batch Enrolled',
'Employment Duration',
'Verification Status',
'Payment Plan',
'Loan Title',
'Initial List Status',
'Application Type',
'Term',
'Collection 12 months Medical'])],
'verbose': False,
'verbose_feature_names_out': True,
'num': StandardScaler(),
'cat': OneHotEncoder(handle_unknown='ignore'),
'num__copy': True,
'num__with_mean': True,
'num__with_std': True,
'cat__categories': 'auto',
'cat__drop': None,
'cat__dtype': numpy.float64,
'cat__handle_unknown': 'ignore',
'cat__max_categories': None,
'cat__min_frequency': None,
'cat__sparse': 'deprecated',
'cat__sparse_output': True}
for x in train.columns:
if (x not in num_cols) & (x not in cat_cols):
print(x)
Neural Network
import sys
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
sys.path.append('../Scripts')
from Data_Processing import DataProcessing
import joblib
from tensorflow import keras
from keras.callbacks import ModelCheckpoint
from keras.models import load_model
from keras import backend as K
from keras.callbacks import EarlyStopping
from datetime import datetime
from tensorflow import keras
from sklearn.metrics import roc_auc_score, log_loss
def evaluate(model, X_test, y_test):
predictions = model.predict(X_test)
errors = abs(predictions - y_test)
mape = 100 * np.mean(errors / y_test)
accuracy = 100 - mape
roc = roc_auc_score(y_test, predictions)
print('Model Performance')
print(f'AUC = ')
return accuracy
trans_train, trans_test, labels = DataProcessing()
X_valid = trans_train[:1000,]
y_valid = labels[:1000,]
X = trans_train[1000:,]
y = labels[1000:,]
mc = ModelCheckpoint(f'../Models/Neural_Network_v2.h5', monitor='val_loss', mode='min', verbose=1, save_best_only=True)
early_stopping = EarlyStopping(
monitor='val_loss',
patience=100,
verbose=0,
mode="auto",
baseline=None,
restore_best_weights=True)
model = keras.Sequential([
keras.layers.InputLayer(184),
keras.layers.Dense(200, activation='sigmoid', kernel_initializer='lecun_normal'),
keras.layers.BatchNormalization(),
keras.layers.Dense(400, activation='selu', kernel_initializer='lecun_normal'),
keras.layers.BatchNormalization(),
keras.layers.Dense(150, activation='selu', kernel_initializer='lecun_normal'),
keras.layers.BatchNormalization(),
keras.layers.Dense(350, activation='selu', kernel_initializer='lecun_normal'),
keras.layers.BatchNormalization(),
keras.layers.Dense(100, activation='selu', kernel_initializer='lecun_normal'),
keras.layers.BatchNormalization(),
keras.layers.Dense(50, activation='selu', kernel_initializer='lecun_normal'),
keras.layers.BatchNormalization(),
keras.layers.Dense(25, activation='selu', kernel_initializer='lecun_normal'),
keras.layers.BatchNormalization(),
keras.layers.Dense(15, activation='selu', kernel_initializer='lecun_normal'),
keras.layers.BatchNormalization(),
keras.layers.Dense(1, activation='sigmoid')
])
optimizer = keras.optimizers.Adam(learning_rate=3e-4)
model.compile(optimizer=optimizer,
loss='binary_crossentropy',
metrics=['binary_accuracy'])
history = model.fit(
X,
y,
batch_size=32,
epochs=5000,
validation_split=0.3,
callbacks=[mc, early_stopping],
shuffle=True,
)
nn_model = load_model('../Models/Neural_Network_v2.h5')
evaluate(nn_model, X_valid, y_valid)
y_pred = nn_model.predict(X_valid)
log_loss(y_valid, y_pred)
SubmissioN
import sys
import pandas as pd
import numpy as np
sys.path.append('../Scripts')
from Data_Processing import DataProcessing
import joblib
from tensorflow import keras
from keras.callbacks import ModelCheckpoint
from keras.models import load_model
from datetime import datetime
trans_train, trans_test, labels = DataProcessing()
nn_model = load_model('../Models/Neural_Network_v2.h5')
predictions = nn_model.predict(trans_test)
submission = pd.DataFrame(predictions, columns=['Loan Status'])
today = datetime.today().strftime('%m-%d-%y %H-%M')
submission.to_csv(f'../Submissions/Neural_Network .csv', index=False)