"Distributed Data Analysis and Mining" Class' Team Project - MSc in Data Science and Business Informatics @ University of Pisa
End of Course Project A.Y. 2023/24
Read the report »
Data source
·
View code
The dataset taken into account contains behavior data for only 1 month (March 2020) from a large multi-category online store.
Each row in the file represents an event. All events are related to products and users. Each event is like many-to-many relations between products and users.
There are different types of events. Semantics (or how to read it):
User user_id during session user_session added to shopping cart (property event_type is equal cart) product product_id of brand brand of category category_code with price price at event_time.
This project consists of analyzing a large amount of eCommerce data in order to predict the users' behavior with data mining and Hadoop (Spark) tools.
The project is divided into four parts as follows:
- Data Reduction
- Understanding & Preparation
- Features Extraction
- Classification