Flight ticket prices can be something hard to guess, today we might see a price, check out the price of the same flight tomorrow, it will be a different story. We might have often heard travelers saying that flight ticket prices are so unpredictable. As data scientists, we are gonna prove that given the right data anything can be predicted. Here you will be provided with prices of flight tickets for various airlines between the months of March and June of 2019 and between various cities. Size of training set: 10683 records
-
The Airline Flight Fare Prediction project is to predict airline flight fares across the Indian cities. The dataset for the project is taken from Kaggle, and it is a time-stamped dataset so, while building the model, extensive pre-processing was done on the dataset especially on the date-time columns to finally come up with a ML model which could effectively predict airline fares across various Indian Cities. The dataset had many features which had to pre-processed and transformed into new parameters for a cleaner and simple web application layout to predict the fares. The various independent features in the dataset were:
-
Airline: The name of the airline.
-
Date_of_Journey: The date of the journey
-
Source: The source from which the service begins.
-
Destination: The destination where the service ends.
-
Route: The route taken by the flight to reach the destination.
-
Dep_Time: The time when the journey starts from the source.
-
Arrival_Time: Time of arrival at the destination.
-
Duration: Total duration of the flight.
-
Total_Stops: Total stops between the source and destination.
-
Additional_Info: Additional information about the flight
-
Price: The price of the ticket
The code is written in Python 3.6.10. If you don't have Python installed, you can find it on google. If you are using a lower version of Python you can upgrade using the pip package, ensuring you have the latest version of pip. To install the required packages and libraries, check the project file in the project directory after cloning the repository.
I needed to clean it up so that it was usable for our model. I made the following changes and created the following variables:
- Calculated the total flight duration
- Removed the null values
- Removed the outliers
First, I transformed the categorical variables into dummy variables. I also split the data into train and tests sets with a test size of 30%.
I tried six different models and evaluated them. using r2_score.
GradientBoostingRegressor
- 0.581642