The following is the documentation for Katelyn Stringer and Alex Riley's course project for STAT 689: Statistical Computing w/ R & Python. Movie Matchmaker is an application of collaborative filtering methods to the MovieLens dataset. See the accompanying report for more detail on the methodology used.
These instructions should get you a copy of the project up and running on your local machine for development and testing purposes.
You will need to install the (mostly normal) array of Python packages numpy
, pandas
, matplotlib
, and sklearn
. All of these should be installable from your preferred package installer, for example with conda
:
conda install <package>
To install, simply navigate to where you want the project located on your machine and perform a git clone
:
git clone https://github.com/stringkm/movie-matchmaker.git
To check that the project is speaking with your Python packages, run utils.py
cd movie-matchmaker
python utils.py
This will import all of the required packages and define some functions useful for the project. If you get message Everything looks good!
then you can assume things are working.
After this, you should be able to move into the directory
cd path/to/movie-matchmaker
and run ratemovies.py
in the following manner. Dfine your name using the "userID" keyword, Use the "cutoff" keyword to determine how many movies you want to rate, specify the output filename using "output", and set "method" to "cosine" or "pearson" to specify which similarity metric you want to use for the ratings.
python ratemovies.py --userId=test --cutoff=5 --output=test.csv --method=cosine
You should see a prompt similar to the one below (the movie will very likely be different)
For each movie, type a numeric rating (0-5) or <Enter> if you haven't seen it.
What is your rating for "Toy Story (1995)":
Simply exit the process with either exit
or ^C
:
For each movie, type a numeric rating (0-5) or <Enter> if you haven't seen it.
What is your rating for "Toy Story (1995)": exit
Exiting recommendation program
If you've made it to this point everything is probably set up correctly.
This project contains the latest small development version of the MovieLens dataset as of May 2018, containing ~100,000 ratings. The interested developer might wish to apply this package to the full stable benchmark version of ~20 million ratings. To do so, download the dataset from the linked website, unzip it, and modify the FULL_DATA
parameter in 0_explore_data.ipynb
to point to the folder containing the data. To apply this to any other point in the analysis you will need to modify the files to point to that version of the ratings.csv
and movies.csv
files (see the 0_*.ipynb
files for more information on the contents of the dataset).
For further detail on methodology, read the project report.
0_explore_data.ipynb
: data exploration of the full stable benchmark version0_explore_ratings.ipynb
: further data exploration focused on the ratings of the small development version1_pearson.ipynb
: implementation of collaborative filtering with Pearson correlation coefficient weights2_cosine.ipynb
: implementation of collaborative filtering with vector cosine similarity weights3_top_k.ipynb
: implementation of top-k collaborative filteringratemovies.py
: API to rate randomly selected movies, save those ratings, and compute (using either weight method) the top 5 and bottom 5 predicted rated moviesutils.py
: defines useful functions used throughout the project
data/
: contents of the small development version of the MovieLens dataset as of May 2018. See0_explore_data.ipynb
for an exploration of the full stable benchmark version of these files, which is quite similar to the small version. The actual analysis uses the small dataset throughoutdocs/
: this README and other project documentation (proposal, presentation slides, and final reportfigures/
: figures from the analysis included in the project reportprocessed/
: saved files created in0_explore_data.ipynb
and practice ratings generated by the authors
- Katelyn Stringer - Pearson correlation - stringkm
- Alex Riley - Cosine similarity, top-k filtering - ahriley
This project was created as part of the Spring 2018 course STAT 689: Statistical Computing with R and Python taught by Dr. James Long at Texas A&M University.
We acknowledge the helpful advice contained in the following sources that helped us design and implement our algorithms:
- Michael Ekstrand, "Similarity Functions for User-User Collaborative Filtering," Grouplens (blog), October 24, 2013.
- Suresh Kumar Gorakala, Building Recommendation Engines (Birmingham, UK: Packt Publishing Ltd), 2016.
- James Long, "Netflix Prize and Collaborative Filtering" (lecture, Statistical Computing in R and Python, Texas A&M University, College Station, TX), March 8, 2018.
- "Netflix Prize", Netflix, Inc., accessed April 30, 2018.
- Ethan Rosenthal, "Intro to Recommender Systems: Collaborative Filtering," Data Piques (blog), November 2, 2015.