Skip to content

Implemented Apache Spark MLLib to analyze a large car dataset, predict car selling prices, and gain insights into the car market.

Notifications You must be signed in to change notification settings

amoghkori/Working-with-Apache-Spark-MLLib

Repository files navigation

CSP554 : Big Data Technologies

Project Members:

Amogh Kori A20491465 (Voice of the team)

Kajol Tanesh Shah A20496724

Vikas Pathak A20460927

Abstract

The project aimed to predict the selling price of cars using big data technologies. The team utilized a large car dataset containing various features related to new and used car listings. Apache Spark and MLLib were employed to analyze the dataset, visualize the data, and build machine learning models for prediction.

Objective

The main objective of the project was to study a large car dataset and gain meaningful insights using Apache Spark and ML libraries. The project aimed to provide management with a better understanding of the pricing dynamics of the car market.

Specific Questions

In addition to the main objective, the project addressed specific questions related to the dataset:

  1. How long does it take for a car to be sold?
  2. Which cars are the most popular?
  3. What is the most preferred color for cars?
  4. How much does the final price differ from the Manufacturer's Suggested Retail Price (MSRP) on average?

Introduction

The project utilized Apache Spark, a fast and general-purpose cluster computing system, along with its machine learning library, MLLib. Spark is known for its speed, ease of use, and compatibility with various platforms. MLLib provides scalable and easy-to-use machine learning capabilities in Java, Scala, Python, and R.

Data Processing

The team worked with a large car dataset extracted from AutoDealerData.com, containing approximately 5.7 million entries of new and used vehicle listings. Data processing involved cleaning, selecting relevant features, and standardizing the dataset. Missing values were handled, and necessary data transformations were performed to prepare the dataset for analysis and modeling.

Data Analysis

Exploratory data analysis was conducted to gain insights from the car dataset. The team visualized various parameters and performed statistical analyses to understand the relationships between different features. Correlation analysis and deduction show were used to identify patterns and make data-driven decisions.

Model Training

Several machine learning algorithms available in MLLib were applied to the dataset for model training. The team implemented linear regression, decision trees, random forests, gradient boosted trees regression, and isotonic regression. These algorithms were used to predict the selling price of cars based on the given features.

Conclusion

The project successfully utilized Apache Spark and MLLib to analyze a large car dataset and predict the selling price of cars. The machine learning models developed demonstrated the potential to provide valuable insights into the pricing dynamics of the car market. Further improvements and future work were discussed.

Future Work

The project's future work could involve exploring additional machine learning algorithms, fine-tuning the existing models, and incorporating more features into the prediction process. Additionally, conducting further analysis on specific car brands or models could provide deeper insights into the market dynamics.

Data Sources

The dataset used in this project was obtained from AutoDealerData.com and contained new and used car listings from dealers in Illinois. The dataset is available at link to dataset.

Source Code

The source code for this project is available in the accompanying files. The team utilized Jupyter Notebook with PySpark, along with various libraries such as NumPy, Matplotlib, Pandas, and Seaborn, for data analysis and model development.

Bibliography

The team referred to various online resources, documentations, and tutorials related to Apache Spark, MLLib, and other libraries used in the project. Details of these sources can

Releases

No releases published

Packages

No packages published