Swiss Re: Predict Accident Risk Score for Unique Postcode
Problem Description (from Machine Hack):
Swiss Re is one of the largest reinsurers in the world headquartered in Zurich with offices in over 25 countries. Swiss Re’s core expertise is in underwriting in life, health, as well as the property and casualty insurance space whereas its tech strategy focuses on developing smarter and innovative solutions for clients’ value chains by leveraging data and technology.
The company’s vision is to make the world more resilient. Swiss Re believes in applying fresh perspectives, knowledge and capital to anticipate and manage risk to create smarter solutions and help the world rebuild, renew and move forward.About 1300 professionals that work in the Swiss Re Global Business Solutions Center (BSC), Bangalore combine experience, expertise and out-of-the-box thinking to bring Swiss Re's core business to life by creating new business opportunities.
According to IBEF “Domestic automobiles production increased at 2.36% CAGR between FY16-20 with 26.36 million vehicles being manufactured in the country in FY20.Overall, domestic automobiles sales increased at 1.29% CAGR between FY16-FY20 with 21.55 million vehicles being sold in FY20”.The rise in vehicles on the road will also lead to multiple challenges and the road will be more vulnerable to accidents.Increased accident rates also leads to more insurance claims and payouts rise for insurance companies.
In order to pre-emptively plan for the losses, the insurance firms leverage accident data to understand the risk across the geographical units e.g. Postal code/district etc.
Evaluation Metric: Root Mean Square Error
Ranking: 94 out of 281. First place had an RMSE of 0.63602 and mine was 0.63802.
Models: XGBoost
import pandas as pd
import numpy as np
from Data_Processing import DataProcessing
from sklearn.model_selection import train_test_split
import joblib
from sklearn.ensemble import RandomForestRegressor
from datetime import datetime
from sklearn.metrics import mean_squared_error
from sklearn.decomposition import PCA
import warnings
warnings.filterwarnings('ignore')
pop = pd.read_csv('../Data/population.csv')
train = pd.read_csv('../Data/train.csv')
test = pd.read_csv('../Data/test.csv')
X, y, test = DataProcessing(train, test, pop)
#pca = PCA(n_components=200)
#X = pca.fit_transform(X)
y = y.ravel()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)
X_train.shape
Baseline
# rf = RandomForestRegressor(n_jobs=-1)
# rf.fit(X_train, y_train)
# y_true = y_test
# y_pred = rf.predict(X_test)
# mean_squared_error(y_true, y_pred)
XGBoost
import xgboost as xgb
from bayes_opt import BayesianOptimization
#Importing necessary libraries
from bayes_opt import BayesianOptimization
import xgboost as xgb
from sklearn.metrics import mean_squared_error
#Converting the dataframe into XGBoost’s Dmatrix object
dtrain = xgb.DMatrix(X_train, label=y_train)
#Bayesian Optimization function for xgboost
#specify the parameters you want to tune as keyword arguments
def bo_tune_xgb(max_depth, gamma, n_estimators, learning_rate):
params = {'max_depth': int(max_depth),
'gamma': gamma,
'n_estimators': int(n_estimators),
'learning_rate':learning_rate,
'subsample': 0.8,
'eta': 0.1,
'eval_metric': 'rmse'}
#Cross validating with the specified parameters in 5 folds and 70 iterations
cv_result = xgb.cv(params, dtrain, num_boost_round=50, nfold=5)
#Return the negative RMSE
return -1.0 * cv_result['test-rmse-mean'].iloc[-1]
#Invoking the Bayesian Optimizer with the specified parameters to tune
xgb_bo = BayesianOptimization(bo_tune_xgb, {'max_depth': (1,8),
'gamma': (0,0.1),
'n_estimators':(10,100),
'learning_rate':(0,0.1),
})
xgb_bo.maximize(n_iter=10, init_points=8, acq='ei')
params = xgb_bo.max['params']
print(params)
params['max_depth']= int(params['max_depth'])
params['n_estimators']= int(params['n_estimators'])
from xgboost import XGBRegressor
xgb1_model = XGBRegressor(**params).fit(X_train, y_train)
y_true = y_test
y_pred = xgb1_model.predict(X_test)
mean_squared_error(y_true, y_pred)
train_df = pd.read_csv('../Data/train.csv')
test_df = pd.read_csv('../Data/test.csv')
train_zips = train_df['postcode']
y_true = y
y_pred = xgb1_model.predict(X)
df = pd.DataFrame({'postcode': train_zips, 'y_true': y_true, 'y_pred': y_pred}, columns=['postcode', 'y_true', 'y_pred'])
df = df.groupby('postcode').mean()
df
y_pred = xgb1_model.predict(test)
#y_pred = model.predict(pca.transform(test))
test_zips = test_df['postcode']
test_df = pd.DataFrame({'postcode': test_zips, 'Accident_risk_index': y_pred.flatten()}, columns=['postcode', 'Accident_risk_index'])
submission = test_df.groupby('postcode').mean().reset_index()
submission.to_csv('../Submissions/XGBoost.csv', index=False)
submission