Project Members:
Amogh Kori A20491465 (Voice of the team)
Kajol Tanesh Shah A20496724
Vikas Pathak A20460927
The project aimed to predict the selling price of cars using big data technologies. The team utilized a large car dataset containing various features related to new and used car listings. Apache Spark and MLLib were employed to analyze the dataset, visualize the data, and build machine learning models for prediction.
The main objective of the project was to study a large car dataset and gain meaningful insights using Apache Spark and ML libraries. The project aimed to provide management with a better understanding of the pricing dynamics of the car market.
In addition to the main objective, the project addressed specific questions related to the dataset:
- How long does it take for a car to be sold?
- Which cars are the most popular?
- What is the most preferred color for cars?
- How much does the final price differ from the Manufacturer's Suggested Retail Price (MSRP) on average?
The project utilized Apache Spark, a fast and general-purpose cluster computing system, along with its machine learning library, MLLib. Spark is known for its speed, ease of use, and compatibility with various platforms. MLLib provides scalable and easy-to-use machine learning capabilities in Java, Scala, Python, and R.
The team worked with a large car dataset extracted from AutoDealerData.com, containing approximately 5.7 million entries of new and used vehicle listings. Data processing involved cleaning, selecting relevant features, and standardizing the dataset. Missing values were handled, and necessary data transformations were performed to prepare the dataset for analysis and modeling.
Exploratory data analysis was conducted to gain insights from the car dataset. The team visualized various parameters and performed statistical analyses to understand the relationships between different features. Correlation analysis and deduction show were used to identify patterns and make data-driven decisions.
Several machine learning algorithms available in MLLib were applied to the dataset for model training. The team implemented linear regression, decision trees, random forests, gradient boosted trees regression, and isotonic regression. These algorithms were used to predict the selling price of cars based on the given features.
The project successfully utilized Apache Spark and MLLib to analyze a large car dataset and predict the selling price of cars. The machine learning models developed demonstrated the potential to provide valuable insights into the pricing dynamics of the car market. Further improvements and future work were discussed.
The project's future work could involve exploring additional machine learning algorithms, fine-tuning the existing models, and incorporating more features into the prediction process. Additionally, conducting further analysis on specific car brands or models could provide deeper insights into the market dynamics.
The dataset used in this project was obtained from AutoDealerData.com and contained new and used car listings from dealers in Illinois. The dataset is available at link to dataset.
The source code for this project is available in the accompanying files. The team utilized Jupyter Notebook with PySpark, along with various libraries such as NumPy, Matplotlib, Pandas, and Seaborn, for data analysis and model development.
The team referred to various online resources, documentations, and tutorials related to Apache Spark, MLLib, and other libraries used in the project. Details of these sources can