Google Cloud Data Engineering Summit: Data Engineering Championship

Problem Description (from Machine Hack):

Data Engineering Summit 2022, presented by Google Cloud and organised by Analytics India Magazine, is India's first conference dedicated to the high-demand and impactful field of data engineering. This virtual conference, to be held on April 30, 2022, will focus on data engineering innovation and give attendees direct access to top engineers and innovators working in leading tech companies. 

This will be a golden opportunity for attendees to learn about the software deployment architecture of machine learning systems, how to produce the latest data frameworks and solutions for business use cases from the very best in the field.

Data Engineering Championship by MachineHack

MachineHack is organising a data engineering hackathon for data scientists & data engineers to participate.

Data engineering consists of collecting, provisioning and maintaining excellent quality data to get insights. In order to do that, a data engineer needs to design and develop a scalable data architecture, set up processes that pool data from multiple sources, check the data quality, and eliminate corrupt data. In addition, exploratory data analysis (EDA) and extract, transform, and load (ETL) techniques are required to access and use downstream to solve business problems.

Evaluation Metric: Mean Absolute Error

Ranking: 7 out of 86. First Place MAE was 1.17521 and mine was 1.53923

import pandas as pd
from datetime import datetime

dataset_df = pd.read_csv('../Data/participants_dataset_DES.csv')

def DataProcessing(df):
    #Departure Time Split
    df[['LOW', 'HIGH']] = df['DEP_TIME_BLK'].str.split('-', expand=True)

    #Fill in Missing Years and Months.  All data points were January 2020
    df['YEAR'] = df['YEAR'].fillna(2020).astype(int)
    df['MONTH'] = df['MONTH'].fillna(1).astype(int)
    df['YEAR_STR'] = df['YEAR'].astype(str)
    df['MONTH_STR'] = df['MONTH'].astype(str)
    df['DAY_STR'] = df['DAY_OF_WEEK'].astype(str)
    df['DATE_STR'] = df['YEAR_STR'] + '-' + df['MONTH_STR'] + '-' + df['DAY_STR']
    df['DATE'] = pd.to_datetime(df['DATE_STR'])


    #Get list of airlines
    airlines = df['CARRIER_NAME'].unique()

    #Get averages for each airline, since airline plane size, number of flights, etc. differs
    for airline in airlines:
        airline_df = df[df['CARRIER_NAME'] == airline]

        plane_age_mean = airline_df['PLANE_AGE'].mean()
        seats_mean = airline_df['NUMBER_OF_SEATS'].mean()
        flights_month_mean = airline_df['AIRLINE_AIRPORT_FLIGHTS_MONTH'].mean()
        conc_flights_mean = airline_df['CONCURRENT_FLIGHTS'].mean()

        df.loc[(df['CARRIER_NAME'] == airline) & (df['PLANE_AGE'].isnull()), 'PLANE_AGE'] = plane_age_mean
        df.loc[(df['CARRIER_NAME'] == airline) & (df['NUMBER_OF_SEATS'].isnull()), 'NUMBER_OF_SEATS'] = seats_mean
        df.loc[(df['CARRIER_NAME'] == airline) & (df['AIRLINE_AIRPORT_FLIGHTS_MONTH'].isnull()), 'AIRLINE_AIRPORT_FLIGHTS_MONTH'] = flights_month_mean
        df.loc[(df['CARRIER_NAME'] == airline) & (df['CONCURRENT_FLIGHTS'].isnull()), 'CONCURRENT_FLIGHTS'] = conc_flights_mean

    #Get average weather conditions
    df['PRCP'] = df['PRCP'].fillna(df['PRCP'].mean())
    df['SNOW'] = df['SNOW'].fillna(df['SNOW'].mean())
    df['TMAX'] = df['TMAX'].fillna(df['TMAX'].mean())
    df['AWND'] = df['AWND'].fillna(df['AWND'].mean())

    #Logically fill nulls for weather conditions and calculate ratios
    for index, row in df.iterrows():
        if row['SNOW'] == 0:
            df.at[index, 'PRCP_SNOW_RATIO'] = 0
        else:
            df.at[index, 'PRCP_SNOW_RATIO'] = row['PRCP'] / row['SNOW']


        if (row['TMAX'] == 0):
            df.at[index, 'WIND_CHILL'] = 0
        else:
            df.at[index, 'WIND_CHILL'] = row['TMAX'] - (row['AWND'] * 0.7)

        if (row['AIRLINE_AIRPORT_FLIGHTS_MONTH'] == 0):
            df.at[index, 'PLANE_AGE_AIRLINE_AIRPORT_FLIGHTS_MONTH_RATIO'] = 0
        else:
            df.at[index, 'PLANE_AGE_AIRLINE_AIRPORT_FLIGHTS_MONTH_RATIO'] = row['PLANE_AGE'] / row['AIRLINE_AIRPORT_FLIGHTS_MONTH']

        hour = int(row['LOW'][:2])
        minute = int(row['LOW'][-2:])

        df.at[index, 'TIMESTAMP'] = row['DATE'].replace(hour=hour, minute=minute)

    #Distribution calculation
    df['SEAT_DISTRIBUTION'] = df['NUMBER_OF_SEATS'] / df['CONCURRENT_FLIGHTS']


    #Normalization calculation
    seat_dist_min = df['SEAT_DISTRIBUTION'].min()
    seat_dist_max = df['SEAT_DISTRIBUTION'].max()

    df['SEAT_DISTRIBUTION_NORMALISED'] = (df['SEAT_DISTRIBUTION'] - seat_dist_min) / (seat_dist_max - seat_dist_min)

    #Getting used columns
    df = df[[
        'DATE',
        'LOW',
        'HIGH',
        'TIMESTAMP',
        'WIND_CHILL',
        'PRCP_SNOW_RATIO',
        'PLANE_AGE_AIRLINE_AIRPORT_FLIGHTS_MONTH_RATIO',
        'SEAT_DISTRIBUTION',
        'SEAT_DISTRIBUTION_NORMALISED'
    ]]

    return df


transformed_df = DataProcessing(dataset_df)

transformed_df.to_csv('../Submissions/DES22.csv', index=False)
Previous
Previous

Swiss Re: Predict Accident Risk Score for Unique Postcode

Next
Next

Deloitte Presents Machine Learning Challenge: Predict Loan Defaulters