Inference: Here with the help of the cat plot we are trying to plot the boxplot between the price of the flight and airline and we can conclude that Jet Airways has the most outliers in terms of price.
Inference: Now with the help of cat plot only we are plotting a box plot between the price of the flight and the source place i.e. the place from where passengers will travel to the destination and we can see that Banglore as the source location has the most outliers while Chennai has the least.
Inference: Here we are plotting the box plot with the help of a cat plot between the price of the flight and the destination to which the passenger is travelling and figured out that New Delhi has the most outliers and Kolkata has the least.
Let’s see our processed data first
train_df.head()
Output:
Here first we are dividing the features and labels and then converting the hours in minutes.
train_df['Duration'] = train_df['Duration'].str.replace("h", '*60').str.replace(' ','+').str.replace('m','*1').apply(eval) test_df['Duration'] = test_df['Duration'].str.replace("h", '*60').str.replace(' ','+').str.replace('m','*1').apply(eval)
Date_of_Journey: Here we are organizing the format of the date of journey in our dataset for better preprocessing in the model stage.
train_df["Journey_day"] = train_df['Date_of_Journey'].str.split('/').str[0].astype(int) train_df["Journey_month"] = train_df['Date_of_Journey'].str.split('/').str[1].astype(int) train_df.drop(["Date_of_Journey"], axis = 1, inplace = True)
Dep_Time: Here we are converting departure time into hours and minutes
train_df["Dep_hour"] = pd.to_datetime(train_df["Dep_Time"]).dt.hour train_df["Dep_min"] = pd.to_datetime(train_df["Dep_Time"]).dt.minute train_df.drop(["Dep_Time"], axis = 1, inplace = True)
Arrival_Time: Similarly we are converting the arrival time into hours and minutes.
train_df["Arrival_hour"] = pd.to_datetime(train_df.Arrival_Time).dt.hour train_df["Arrival_min"] = pd.to_datetime(train_df.Arrival_Time).dt.minute train_df.drop(["Arrival_Time"], axis = 1, inplace = True)
Now after final preprocessing let’s see our dataset
train_df.head()
Output:
Plotting Bar chart for Months (Duration) vs Number of Flights
plt.figure(figsize = (10, 5)) plt.title('Count of flights month wise') ax=sns.countplot(x = 'Journey_month', data = train_df) plt.xlabel('Month') plt.ylabel('Count of flights') for p in ax.patches: ax.annotate(int(p.get_height()), (p.get_x()+0.25, p.get_height()+1), va='bottom', color= 'black')
Output:
Inference: Here in the above graph we have plotted the count plot for journey in a month vs several flights and got to see that May has the most number of flights.
Plotting Bar chart for Types of Airline vs Number of Flights
plt.figure(figsize = (20,5)) plt.title('Count of flights with different Airlines') ax=sns.countplot(x = 'Airline', data =train_df) plt.xlabel('Airline') plt.ylabel('Count of flights') plt.xticks(rotation = 45) for p in ax.patches: ax.annotate(int(p.get_height()), (p.get_x()+0.25, p.get_height()+1), va='bottom', color= 'black')
Output:
Inference: Now from the above graph we can see that between the type of airline and count of flights we can see that Jet Airways has the most flight boarded.
Plotting Ticket Prices VS Airlines
plt.figure(figsize = (15,4)) plt.title('Price VS Airlines') plt.scatter(train_df['Airline'], train_df['Price']) plt.xticks plt.xlabel('Airline') plt.ylabel('Price of ticket') plt.xticks(rotation = 90)
Output:
Correlation between all Features
Plotting Correlation
plt.figure(figsize = (15,15)) sns.heatmap(train_df.corr(), annot = True, cmap = "RdYlGn") plt.show()
Output:
Dropping the Price column as it is of no use
data = train_df.drop(["Price"], axis=1)
Dealing with Categorical Data and Numerical Data
train_categorical_data = data.select_dtypes(exclude=['int64', 'float','int32']) train_numerical_data = data.select_dtypes(include=['int64', 'float','int32']) test_categorical_data = test_df.select_dtypes(exclude=['int64', 'float','int32','int32']) test_numerical_data = test_df.select_dtypes(include=['int64', 'float','int32']) train_categorical_data.head()
Output:
Label Encode and Hot Encode for Categorical Columns
le = LabelEncoder() train_categorical_data = train_categorical_data.apply(LabelEncoder().fit_transform) test_categorical_data = test_categorical_data.apply(LabelEncoder().fit_transform) train_categorical_data.head()
Output:
Concatenating both Categorical Data and Numerical Data
X = pd.concat([train_categorical_data, train_numerical_data], axis=1) y = train_df['Price'] test_set = pd.concat([test_categorical_data, test_numerical_data], axis=1) X.head()
Output:
y.head()
Output:
0 3897 1 7662 2 13882 3 6218 4 13302 Name: Price, dtype: int64
Conclusion
So as we saw that we have done a complete EDA process, getting data insights, feature engineering, and data visualization as well so after all these steps one can go for the prediction using machine learning model-making steps.
Here’s the repo link to this article. Hope you liked my article on flight fare prediction using machine learning. If you have any opinions or questions, then comment below.
Read on AV Blog about various predictions using Machine Learning.
About Me
Greeting to everyone, I’m currently working in TCS and previously, I worked as a Data Science Analyst in Zorba Consulting India. Along with full-time work, I’ve got an immense interest in the same field, i.e. Data Science, along with its other subsets of Artificial Intelligence such as Computer Vision, Machine Learning, and Deep learning; feel free to collaborate with me on any project on the domains mentioned above (LinkedIn).
Here you can access my other articles, which are published on Analytics Vidhya as a part of the Blogathon (link).
The media shown in this article is not owned by Analytics Vidhya and are used at the Author’s discretion.
Source: https://www.analyticsvidhya.com/blog/2022/01/flight-fare-prediction-using-machine-learning/