import requests # for making standard html requests
from bs4 import BeautifulSoup # magical tool for parsing html data
import pandas as pd # for data manipulation
import numpy as np # for data manipulation
import seaborn as sns # for data visualization
import matplotlib.pyplot as plt # for data visualization
from sklearn.metrics import mean_absolute_percentage_error # for model evaluation
import statsmodels.api as sm # for data exploration
import scipy.stats as stats # for statistical analysis
from sklearn.linear_model import SGDClassifier # for classification
from sklearn.model_selection import StratifiedKFold, RandomizedSearchCV, train_test_split # for cross validation and hyperparameter tuning
from statsmodels.formula.api import glm # for classification
from sklearn.preprocessing import OrdinalEncoder, StandardScaler # for data preprocessing
from sklearn.pipeline import Pipeline # for data preprocessing
from sklearn.metrics import classification_report, accuracy_score # for model evaluation
from sklearn.tree import DecisionTreeClassifier # for classification
from xgboost import XGBClassifier # for classification
- merge airports and runways and create airport_run using the ident column in airports, and the airport_ident column in runways. We are doing a left join i.e. all of airports columns and only the matching values from runways will be imported.
- count the number of runways each airport has. Here we will group the rows based on the common airport_ident values while keep a count of the non-null values in id_y which has the runway entries of a particular airport.
- create a new dataset containing airport iata code, type of airport, elevation of airport, and the number of runways. All of which could play a role in delays.
- add info about the AirportFrom values by combining the airlines and air_run datasets based on the AirportFrom and iata_code columns, this would give us the count of runways, elevation, and iata_code of the airports from where flights take off. Something that could be a factor in delays.
- add similar info about the destination airports. This would also be a factor in flights delays. For eg: if the destination airport has only one runway, flights would have to park themselves and wait for their turn to land resulting in delays. But this time we will use combined_data instead of airlines since we have all the info from airlines dataset from previous steps.
- As is with every profession, the amount of experience we have plays an important role in our performance. Applying that logic to airline delays, let's extract experience info about each airline in our dataset from "https://en.wikipedia.org/wiki/List_of_airlines_of_the_United_States" using Beautiful Soup class from bs4 and requests lib.
- Apart from the experience, the traffic of the airport is also a factor in delays. If your source/destination airport sees a lot of traffic, naturally the take-off and landing times will be affected. For this step, we will extract info from https://en.wikipedia.org/wiki/List_of_the_busiest_airports_in_the_United_States But this wikipedia page was updated recently, so I am using a previous version of the wikipedia page of 13 April 2023 from this URL: https://en.wikipedia.org/w/index.php?title=List_of_the_busiest_airports_in_the_United_States&oldid=1149689977
- Collate the dataset into combined_data_traffic: ultimate dataset with all the relevant info from previous tables.
- Treat missing values: id 0 Airline 0 Flight 0 AirportFrom 0 AirportTo 0 DayOfWeek 0 Time 0 Length 0 Delay 0 type_source_airport 31 elevation_ft_source_airport 31 runway_count_source_airport 31 type_dest_airport 31 elevation_ft_dest_airport 31 runway_count_dest_airport 31 iata_code_source 83582 data_2021_source_airport 83582 iata_code_dest 83531 data_2021_dest_airport 83531 Founded 83601
- For type, elevation_ft, and runway_count values, we will find the relevant info for 'CYS' i.e 'Cheyenne Regional Jerry Olson Field' from previous datasets and add them to the combined_data_traffic dataset.
- We will remove the iata_code columns for source and destination airport since the AirportFrom and AirportTo columns have the same info.
- For Founded column we can extract the information about the founding year of airlines (US, EV, and CO) from the internet and add the values in the missing rows.
- For traffic data 2021 of source and destination airports, we will first impute missing values with the median() value of the airport. And remove the remaining left after the process.
According to data provided, around 70% of the flights are delayed for Southwest Airlines. Visualize to compare the same for other airlines.
- Calculate the % of delays for SouthWest Airlines and cross-check the stats in first task. round((cdt[cdt.Airline == southwestid].Delay.sum()/ cdt[cdt.Airline == southwestid].Delay.size)*100)
- Plot a bar chart % delay for each airline:
-
- As we can see in the above bar chart, SouthWest airlines has the most delayed flgihts i.e. 69.8% to be exact. Mesa is the airlines with least delays.
Which days of the week are the safest to travel?
-
Calcuate the % of flights that are delayed for each weekday by grouping the dataset by DayOfWeek column.
-
Plot a bar chart to visualize the % delay every day of the week:
From the bar chart and pie-chart above, we can see that Friday and Saturday have the shortest bars for % delays, hence these two days are the safest to travel.
Which airlines to recommend for short, medium, and long length travel?
-
Create a dataset 'duration' with 'Airline', 'Length' and 'Delay' columns of the main dataset (combined_data_traffic) or cdt.
-
Create a new column duration['travel_time'] which contains values by creating 3 equal width bings of the 'Length' column into 'short', 'medium', and 'long'.
-
Group by the dataset using the 'Airline' column and applying a delay_percent function that calculates the % of each airline delay in short, medium, and long distance travels.
-
Find the names of the airlines with 0% delays in long and medium flights and 24.37% delays in short distance travel using the min() function to the columns. Here are the recommended airlines with minimum delays that are safest to travel for each kind of travel distance:
-
Long flights (0% delay): Endeavor, Alaska, Jetblue, ExpressJet, Frontier, Hawaiian, Envoy, PSA, Skywest, PSA (initially US Airway Express), Southwest, JSX, Mesa
-
Medium flights (0% delays): Endeavor
-
Short flights (24.37% delays): Mesa
Do you observe any pattern in the time of departure of flights of long duration?
-
Create a cross-tab of the 'Time' and 'duration' column for 'long' duration column values.
-
Create a dataset filtered_data using the .loc[] function and filtering values for long duration travel.
-
Use .describe() function to get the following values: count 559.000000 mean 840.635063 std 221.020092 min 540.000000 25% 670.000000 50% 717.000000 75% 1045.000000 max 1310.000000
-
Departure Time Range: The departure times range from the minimum value of 540 minutes (after midnight) to the maximum value of 1310 minutes. This suggests that the flights in the dataset have departure times spanning throughout the day.
-
Average Departure Time: The average departure time, represented by the mean of approximately 840.64 minutes, indicates that, on average, flights tend to depart around 8:40 AM (assuming midnight as the starting point).
-
Variability of Departure Times: The standard deviation of approximately 221.02 minutes suggests that the departure times have a moderate amount of variability or spread around the mean. This indicates that departure times are not tightly clustered around a specific time but rather exhibit some degree of dispersion.
-
Quartiles: The 25th percentile (first quartile) value of 670 minutes and the 75th percentile (third quartile) value of 1045 minutes indicate that 25% of the flights depart before 6:10 AM (approximately) and 75% of the flights depart before 5:45 PM (approximately). This information helps understand the distribution of departure times
-
Maximum number of flights are departing at ~700 minutes, i.e. around 11:40 AM.
-
There are two empty time zones, ~(900-1000) and ~(1160-1240) when no long distances flights depart.
-
Maxium number of flights depart between ~(660-710) minutes from midnight.
-
The two empty zones identified above are not ture. There are some flights departing at almost every time stamp.
-
You can also plot a boxplot:
Observations till now:
-
75% of all the long duration flights depart before 6:00 PM in the evening.
-
50% of flights depart before 12:00 PM in the afternoon.
-
The earliest long duration flight departs at 9:00 AM.
-
The last long duration flight departs at 10:50 PM.
-
You can also perform t-test and ANOVA test T-test: T-statistic = 3.335989828914716 p-value = 0.0008500226450162412 ANOVA: F-statistic = 4299971.900179982 p-value = 0.0
-
T-test Result: The t-statistic of 3.336 indicates a significant difference in the means of departure times between long-duration flights and other-duration flights. This suggests that there is a distinct pattern or variation in the departure times of long-duration flights compared to other-duration flights.
-
ANOVA Result: The ANOVA test indicates a significant difference in departure times across different flight durations, including long-duration flights. This suggests that the departure times of long-duration flights are not only different from other-duration flights but also show variation within the group of long-duration flights themselves. This further supports the existence of patterns or systematic differences in the departure times of long-duration flights.
How large hubs compare to medium hubs in terms of count of delayed flights? Use appropriate visualization to represent your findings.
- Bar chart visualization:
- Stacked-bar chart showing both delayed and non-delayed flights for large and medium airport:
-
The plot shows us that there is a significant difference between the number of flights large and medium airports handle. Medium airports have a higher ratio of non-delayed flights. This could mean that given the traffic each type of hub handles, medium airports are better at ensuring timely take-offs.
-
We can also use mean to answer this question:
-
Large airports have a 46% delay where as medium airports have 39% delay. Therefore, we can conclude that large_airports have a higher chance of a delayed flights than medium_airports.
For large hubs, forecast the number of passengers for 2022 using simple moving average method.
- Plotting a time-series for every airport:
- Seasonal decomposition charts for the top four airports with max traffic:
Observations:
-
The trend for all four airports is that the traffic is on an uptrend after a major drop in 2019 due to COVID.
-
The seasonality shows that there are no recurring patterns in the traffic data.
-
The residual shows that there are no abnormnal patterns in the traffic data.
-
Calculate rolling mean forecasts for different window sizes and select the best window size based on the minimum MAPE.
As we saw in the line chart earlier, the airports: ATL, DFW, DEN,and ORD will witness the maximum traffic in 2022.