Flu Shot Learning: Predict H1N1 and Seasonal Flu Vaccines
Problem Description (from DrivenData):
Can you predict whether people got H1N1 and seasonal flu vaccines using information they shared about their backgrounds, opinions, and health behaviors?
In this challenge, we will take a look at vaccination, a key public health measure used to fight infectious diseases. Vaccines provide immunization for individuals, and enough immunization in a community can further reduce the spread of diseases through "herd immunity."
As of the launch of this competition, vaccines for the COVID-19 virus are still under development and not yet available. The competition will instead revisit the public health response to a different recent major respiratory disease pandemic. Beginning in spring 2009, a pandemic caused by the H1N1 influenza virus, colloquially named "swine flu," swept across the world. Researchers estimate that in the first year, it was responsible for between 151,000 to 575,000 deaths globally.
A vaccine for the H1N1 flu virus became publicly available in October 2009. In late 2009 and early 2010, the United States conducted the National 2009 H1N1 Flu Survey. This phone survey asked respondents whether they had received the H1N1 and seasonal flu vaccines, in conjunction with questions about themselves. These additional questions covered their social, economic, and demographic background, opinions on risks of illness and vaccine effectiveness, and behaviors towards mitigating transmission. A better understanding of how these characteristics are associated with personal vaccination patterns can provide guidance for future public health efforts.
Your goal is to predict how likely individuals are to receive their H1N1 and seasonal flu vaccines. Specifically, you'll be predicting two probabilities: one for h1n1_vaccine and one for seasonal_vaccine.
Evaluation Metric: Area Under the Receiver Operating Characteristic (AUROC / AUC)
Rank: 298 out of 1697. First place AUC was 0.8658 and mine was 0.8618.
Models: CatBoost
Data Processing
import pandas as pd
import numpy as np
import tensorflow as tf
from datetime import datetime
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import matthews_corrcoef
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.metrics import r2_score
import scipy.stats as stats
from tensorflow import keras
from keras.callbacks import ModelCheckpoint
from keras.models import load_model
from datetime import datetime
from category_encoders import OrdinalEncoder, TargetEncoder
from catboost import CatBoostClassifier, CatBoostRegressor
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from mlxtend.plotting import plot_sequential_feature_selection as plot_sfs
import warnings
warnings.filterwarnings("ignore")
#Functions
def evaluate(model, X_test, y_test):
predictions = model.predict(X_test)
errors = abs(predictions - y_test)
mape = 100 * np.mean(errors / y_test)
accuracy = 100 - mape
roc = roc_auc_score(y_test, predictions)
print('Model Performance')
print('Average Error: degrees'.format(np.mean(errors)))
print('Accuracy = %'.format(accuracy))
print(f'AUC = ')
return accuracy
train = pd.read_csv('../Data/training_set_features.csv', index_col='respondent_id')
test = pd.read_csv('../Data/test_set_features.csv', index_col ='respondent_id')
labels = pd.read_csv('../Data/training_set_labels.csv', index_col='respondent_id')
train.loc[(train['age_group'] == '65+ Years') & (train['employment_status'].isnull()), 'employment_status'] = 'Not in Labor Force'
num_cols = list(train.select_dtypes('number').columns)
cat_cols = [
'race',
'sex',
'marital_status',
'rent_or_own',
'hhs_geo_region',
'census_msa',
'employment_industry',
'employment_occupation'
]
ord_cols = [
'age_group',
'education',
'income_poverty',
'employment_status'
]
#Impute Train
for col in num_cols:
train[col] = train[col].fillna(value=-1)
test[col] = test[col].fillna(value=-1)
for col in (cat_cols + ord_cols):
train[col] = train[col].fillna(value='None')
test[col] = test[col].fillna(value='None')
test_labels = labels.copy()
train['age_group'] = train['age_group'].map({
'18 - 34 Years': 1,
'35 - 44 Years': 2,
'45 - 54 Years': 3,
'55 - 64 Years': 4,
'65+ Years': 5
})
train['education'] = train['education'].map({
'< 12 Years': 1,
'12 Years': 2,
'Some College': 3,
'College Graduate': 4,
'None': -1
})
train['income_poverty'] = train['income_poverty'].map({
'None': -1,
'Below Poverty': 1,
'<= $75,000, Above Poverty': 2,
'> $75,000': 3
})
train['employment_status'] = train['employment_status'].map({
'None': -1,
'Unemployed': 1,
'Employed': 2,
'Not in Labor Force': 3
})
test['education'] = test['education'].map({
'< 12 Years': 1,
'12 Years': 2,
'Some College': 3,
'College Graduate': 4,
'None': -1
})
test['income_poverty'] = test['income_poverty'].map({
'None': -1,
'Below Poverty': 1,
'<= $75,000, Above Poverty': 2,
'> $75,000': 3
})
test['employment_status'] = test['employment_status'].map({
'None': -1,
'Unemployed': 1,
'Employed': 2,
'Not in Labor Force': 3
})
for x in train[ord_cols].columns:
print(x, train[x].unique())
all_cols = train.columns
train_test = train.copy()
h1n1_labels = labels[['h1n1_vaccine']]
seas_labels = labels[['seasonal_vaccine']]
Transformation
cat_cols = train.select_dtypes('object').columns
h1n1_train = train.copy()
seas_train = train.copy()
h1n1_scaler = StandardScaler()
h1n1_train[num_cols] = h1n1_scaler.fit_transform(h1n1_train[num_cols])
seas_scaler = StandardScaler()
seas_train[num_cols] = seas_scaler.fit_transform(seas_train[num_cols])
h1n1_train_trans = h1n1_train
seas_train_trans = seas_train
categorical_features_indices = np.where(train.dtypes != float)[0]
H1N1
CatBoost and Optuna
X = h1n1_train_trans
y = h1n1_labels
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, shuffle=True)
from catboost import CatBoostClassifier
from catboost import Pool, cv
from sklearn.metrics import roc_curve, roc_auc_score
import optuna
train_dataset = Pool(data=X_train,
label=y_train,
cat_features=categorical_features_indices)
def objective(trial):
param = {
'iterations':trial.suggest_categorical('iterations', [100,200,300,500,1000,1200,1500,1700,2000]),
'learning_rate':trial.suggest_float("learning_rate", 0.001, 0.3),
'random_strength':trial.suggest_int("random_strength", 1,10),
'bagging_temperature':trial.suggest_int("bagging_temperature", 0,10),
'max_bin':trial.suggest_categorical('max_bin', [4,5,6,8,10,20,30]),
'grow_policy':trial.suggest_categorical('grow_policy', ['SymmetricTree', 'Depthwise', 'Lossguide']),
'min_data_in_leaf':trial.suggest_int("min_data_in_leaf", 1,10),
'od_type' : "Iter",
'od_wait' : 100,
"depth": trial.suggest_int("max_depth", 2,10),
"l2_leaf_reg": trial.suggest_loguniform("l2_leaf_reg", 1e-8, 100),
'one_hot_max_size':trial.suggest_categorical('one_hot_max_size', [5,10,12,25,100,500,1024]),
'custom_metric' : ['AUC'],
"loss_function": "Logloss",
'auto_class_weights':trial.suggest_categorical('auto_class_weights', ['Balanced', 'SqrtBalanced']),
}
scores = cv(train_dataset,
param,
fold_count=7,
early_stopping_rounds=8,
plot=False, verbose=False)
return scores['test-AUC-mean'].max()
sampler = optuna.samplers.TPESampler(seed=42)
study = optuna.create_study(direction="maximize", sampler=sampler)
study.optimize(objective, n_trials=75)
trial = study.best_trial
final_model = CatBoostClassifier(verbose=False, cat_features=categorical_features_indices,
**trial.params)
final_model.fit(X_train, y_train)
final_h1n1_model = final_model
trial.params
params = trial.params
final_h1n1_model = CatBoostClassifier(cat_features=categorical_features_indices,
verbose=False,
iterations=1200,
learning_rate=0.023793510396254353,
random_strength=1,
bagging_temperature=6,
max_bin=8,
grow_policy='Lossguide',
min_data_in_leaf=8,
max_depth=9,
l2_leaf_reg=89.35313522855303,
one_hot_max_size=500,
auto_class_weights='Balanced').fit(h1n1_train_trans, h1n1_labels)
params = trial.params
final_h1n1_model = CatBoostClassifier(cat_features=categorical_features_indices,
verbose=False,
**params)
Seasonal
Catboost and Optuna
X = seas_train_trans
y = seas_labels
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, shuffle=True)
from catboost import CatBoostClassifier
from catboost import Pool, cv
from sklearn.metrics import roc_curve, roc_auc_score
import optuna
train_dataset = Pool(data=X_train,
label=y_train,
cat_features=categorical_features_indices)
def objective(trial):
param = {
'iterations':trial.suggest_categorical('iterations', [100,200,300,500,1000,1200,1500,1700,2000]),
'learning_rate':trial.suggest_float("learning_rate", 0.001, 0.3),
'random_strength':trial.suggest_int("random_strength", 1,10),
'bagging_temperature':trial.suggest_int("bagging_temperature", 0,10),
'max_bin':trial.suggest_categorical('max_bin', [4,5,6,8,10,20,30]),
'grow_policy':trial.suggest_categorical('grow_policy', ['SymmetricTree', 'Depthwise', 'Lossguide']),
'min_data_in_leaf':trial.suggest_int("min_data_in_leaf", 1,10),
'od_type' : "Iter",
'od_wait' : 100,
"depth": trial.suggest_int("max_depth", 2,10),
"l2_leaf_reg": trial.suggest_loguniform("l2_leaf_reg", 1e-8, 100),
'one_hot_max_size':trial.suggest_categorical('one_hot_max_size', [5,10,12,25,100,500,1024]),
'custom_metric' : ['AUC'],
"loss_function": "Logloss",
'auto_class_weights':trial.suggest_categorical('auto_class_weights', ['Balanced', 'SqrtBalanced']),
}
scores = cv(train_dataset,
param,
fold_count=7,
early_stopping_rounds=8,
plot=False, verbose=False)
return scores['test-AUC-mean'].max()
sampler = optuna.samplers.TPESampler(seed=42)
study = optuna.create_study(direction="maximize", sampler=sampler)
study.optimize(objective, n_trials=75)
trial = study.best_trial
final_model = CatBoostClassifier(verbose=False, cat_features=categorical_features_indices,
**trial.params)
trial.params
params = trial.params
final_seas_model = CatBoostClassifier(cat_features=categorical_features_indices,
verbose=False,
iterations=2000,
learning_rate=0.0647020282001146,
random_strength=2,
bagging_temperature=0,
max_bin=30,
grow_policy='Depthwise',
min_data_in_leaf=4,
max_depth=5,
l2_leaf_reg=50.04446925880975,
one_hot_max_size=10,
auto_class_weights='SqrtBalanced').fit(seas_train_trans, seas_labels)
params = trial.params
final_seas_model = CatBoostClassifier(cat_features=categorical_features_indices,
verbose=False,
**params)
Train Accuracy
h1n1_train_data = train_test.copy()
h1n1_train_data[num_cols] = h1n1_scaler.transform(h1n1_train_data[num_cols])
seas_train_data = train_test.copy()
seas_train_data[num_cols] = seas_scaler.transform(seas_train_data[num_cols])
y_predicted_h1n1 = final_h1n1_model.predict_proba(h1n1_train_data)[:,1].reshape(-1,1)
y_predicted_seas = final_seas_model.predict_proba(seas_train_data)[:,1].reshape(-1,1)
y_true = np.array(labels)
y_predicted = np.concatenate((y_predicted_h1n1, y_predicted_seas), axis=1)
roc_auc_score(np.array(labels), y_predicted)
Submission
test = pd.read_csv('../Data/test_set_features.csv', index_col='respondent_id')
full_test = test.copy()
num_cols = list(test.select_dtypes('number').columns)
cat_cols = [
'race',
'sex',
'marital_status',
'rent_or_own',
'hhs_geo_region',
'census_msa',
'employment_industry',
'employment_occupation'
]
ord_cols = [
'age_group',
'education',
'income_poverty',
'employment_status'
]
#Impute Test
for col in num_cols:
test[col] = test[col].fillna(value=-1)
for col in (cat_cols+ord_cols):
test[col] = test[col].fillna(value='None')
test['age_group'] = test['age_group'].map({
'18 - 34 Years': 1,
'35 - 44 Years': 2,
'45 - 54 Years': 3,
'55 - 64 Years': 4,
'65+ Years': 5
})
test['education'] = test['education'].map({
'< 12 Years': 1,
'12 Years': 2,
'Some College': 3,
'College Graduate': 4,
'None': -1
})
test['income_poverty'] = test['income_poverty'].map({
'None': -1,
'Below Poverty': 1,
'<= $75,000, Above Poverty': 2,
'> $75,000': 3
})
test['employment_status'] = test['employment_status'].map({
'None': -1,
'Unemployed': 1,
'Employed': 2,
'Not in Labor Force': 3
})
test_h1n1 = test.copy()
test_seas = test.copy()
test_h1n1[num_cols] = h1n1_scaler.transform(test_h1n1[num_cols])
test_seas[num_cols] = seas_scaler.transform(test_seas[num_cols])
y_h1n1 = final_h1n1_model.predict_proba(test_h1n1)[:,1].reshape(-1,1)
y_seas = final_seas_model.predict_proba(test_seas)[:,1].reshape(-1,1)
y_comb = np.concatenate((y_h1n1, y_seas), axis=1)
results = pd.DataFrame(y_comb, columns=['h1n1_vaccine', 'seasonal_vaccine'], index=test.index)
submission = pd.concat([full_test, results], axis=1)
submission = submission[['h1n1_vaccine', 'seasonal_vaccine']]
today = datetime.today().date()
submission.to_csv(f'../Submissions/Neural Network Submission .csv')