Google Cloud Data Engineering Summit: Data Engineering Championship
Problem Description (from Machine Hack):
Data Engineering Summit 2022, presented by Google Cloud and organised by Analytics India Magazine, is India's first conference dedicated to the high-demand and impactful field of data engineering. This virtual conference, to be held on April 30, 2022, will focus on data engineering innovation and give attendees direct access to top engineers and innovators working in leading tech companies.
This will be a golden opportunity for attendees to learn about the software deployment architecture of machine learning systems, how to produce the latest data frameworks and solutions for business use cases from the very best in the field.
Data Engineering Championship by MachineHack
MachineHack is organising a data engineering hackathon for data scientists & data engineers to participate.
Data engineering consists of collecting, provisioning and maintaining excellent quality data to get insights. In order to do that, a data engineer needs to design and develop a scalable data architecture, set up processes that pool data from multiple sources, check the data quality, and eliminate corrupt data. In addition, exploratory data analysis (EDA) and extract, transform, and load (ETL) techniques are required to access and use downstream to solve business problems.
Evaluation Metric: Mean Absolute Error
Ranking: 7 out of 86. First Place MAE was 1.17521 and mine was 1.53923
import pandas as pd
from datetime import datetime
dataset_df = pd.read_csv('../Data/participants_dataset_DES.csv')
def DataProcessing(df):
#Departure Time Split
df[['LOW', 'HIGH']] = df['DEP_TIME_BLK'].str.split('-', expand=True)
#Fill in Missing Years and Months. All data points were January 2020
df['YEAR'] = df['YEAR'].fillna(2020).astype(int)
df['MONTH'] = df['MONTH'].fillna(1).astype(int)
df['YEAR_STR'] = df['YEAR'].astype(str)
df['MONTH_STR'] = df['MONTH'].astype(str)
df['DAY_STR'] = df['DAY_OF_WEEK'].astype(str)
df['DATE_STR'] = df['YEAR_STR'] + '-' + df['MONTH_STR'] + '-' + df['DAY_STR']
df['DATE'] = pd.to_datetime(df['DATE_STR'])
#Get list of airlines
airlines = df['CARRIER_NAME'].unique()
#Get averages for each airline, since airline plane size, number of flights, etc. differs
for airline in airlines:
airline_df = df[df['CARRIER_NAME'] == airline]
plane_age_mean = airline_df['PLANE_AGE'].mean()
seats_mean = airline_df['NUMBER_OF_SEATS'].mean()
flights_month_mean = airline_df['AIRLINE_AIRPORT_FLIGHTS_MONTH'].mean()
conc_flights_mean = airline_df['CONCURRENT_FLIGHTS'].mean()
df.loc[(df['CARRIER_NAME'] == airline) & (df['PLANE_AGE'].isnull()), 'PLANE_AGE'] = plane_age_mean
df.loc[(df['CARRIER_NAME'] == airline) & (df['NUMBER_OF_SEATS'].isnull()), 'NUMBER_OF_SEATS'] = seats_mean
df.loc[(df['CARRIER_NAME'] == airline) & (df['AIRLINE_AIRPORT_FLIGHTS_MONTH'].isnull()), 'AIRLINE_AIRPORT_FLIGHTS_MONTH'] = flights_month_mean
df.loc[(df['CARRIER_NAME'] == airline) & (df['CONCURRENT_FLIGHTS'].isnull()), 'CONCURRENT_FLIGHTS'] = conc_flights_mean
#Get average weather conditions
df['PRCP'] = df['PRCP'].fillna(df['PRCP'].mean())
df['SNOW'] = df['SNOW'].fillna(df['SNOW'].mean())
df['TMAX'] = df['TMAX'].fillna(df['TMAX'].mean())
df['AWND'] = df['AWND'].fillna(df['AWND'].mean())
#Logically fill nulls for weather conditions and calculate ratios
for index, row in df.iterrows():
if row['SNOW'] == 0:
df.at[index, 'PRCP_SNOW_RATIO'] = 0
else:
df.at[index, 'PRCP_SNOW_RATIO'] = row['PRCP'] / row['SNOW']
if (row['TMAX'] == 0):
df.at[index, 'WIND_CHILL'] = 0
else:
df.at[index, 'WIND_CHILL'] = row['TMAX'] - (row['AWND'] * 0.7)
if (row['AIRLINE_AIRPORT_FLIGHTS_MONTH'] == 0):
df.at[index, 'PLANE_AGE_AIRLINE_AIRPORT_FLIGHTS_MONTH_RATIO'] = 0
else:
df.at[index, 'PLANE_AGE_AIRLINE_AIRPORT_FLIGHTS_MONTH_RATIO'] = row['PLANE_AGE'] / row['AIRLINE_AIRPORT_FLIGHTS_MONTH']
hour = int(row['LOW'][:2])
minute = int(row['LOW'][-2:])
df.at[index, 'TIMESTAMP'] = row['DATE'].replace(hour=hour, minute=minute)
#Distribution calculation
df['SEAT_DISTRIBUTION'] = df['NUMBER_OF_SEATS'] / df['CONCURRENT_FLIGHTS']
#Normalization calculation
seat_dist_min = df['SEAT_DISTRIBUTION'].min()
seat_dist_max = df['SEAT_DISTRIBUTION'].max()
df['SEAT_DISTRIBUTION_NORMALISED'] = (df['SEAT_DISTRIBUTION'] - seat_dist_min) / (seat_dist_max - seat_dist_min)
#Getting used columns
df = df[[
'DATE',
'LOW',
'HIGH',
'TIMESTAMP',
'WIND_CHILL',
'PRCP_SNOW_RATIO',
'PLANE_AGE_AIRLINE_AIRPORT_FLIGHTS_MONTH_RATIO',
'SEAT_DISTRIBUTION',
'SEAT_DISTRIBUTION_NORMALISED'
]]
return df
transformed_df = DataProcessing(dataset_df)
transformed_df.to_csv('../Submissions/DES22.csv', index=False)