UC Berkeley Team: Jack Vasylenko, Chitwan Kaudan, Anith Patel, Tyler Larsen and William Wang.
This project is a song recommendation system implemented using Spark MLib Alternating Squares Collaborative Filtering Algorithm trained on 1 million playlists open-sourced by Spotify.
The MPD contains a million user-generated playlists. These playlists were created during the period of January 2010 through October 2017. Each playlist in the MPD contains a playlist title, the track list (including track metadata) editing information (last edit time, number of playlist edits) and other miscellaneous information about the playlist.
Proceed with these steps to download Spotify's dataset (33 Gb) and convert the data into a memory-efficient format (~ 5 Gb) for use on the Databricks platform:
- Download Spotify's official dataset and place the 'data' folder into the root folder of the project.
- Run the following command:
python restructureData.py
This script populates the \data_csv folder with the data that can be used to create a Databricks table.
EDA.ipynb
Neural-Collaborative-Filtering.ipynb
Spark-MLib-ALS.ipynb
Usage of the Million Playlist Dataset is subject to these license terms