Skip to content

stringkm/movie-matchmaker

Repository files navigation

movie-matchmaker

The following is the documentation for Katelyn Stringer and Alex Riley's course project for STAT 689: Statistical Computing w/ R & Python. Movie Matchmaker is an application of collaborative filtering methods to the MovieLens dataset. See the accompanying report for more detail on the methodology used.

Getting started

These instructions should get you a copy of the project up and running on your local machine for development and testing purposes.

Prerequisites

You will need to install the (mostly normal) array of Python packages numpy, pandas, matplotlib, and sklearn. All of these should be installable from your preferred package installer, for example with conda:

conda install <package>

Installing

To install, simply navigate to where you want the project located on your machine and perform a git clone:

git clone https://github.com/stringkm/movie-matchmaker.git

To check that the project is speaking with your Python packages, run utils.py

cd movie-matchmaker
python utils.py

This will import all of the required packages and define some functions useful for the project. If you get message Everything looks good! then you can assume things are working.

Testing

After this, you should be able to move into the directory

cd path/to/movie-matchmaker

and run ratemovies.py in the following manner. Dfine your name using the "userID" keyword, Use the "cutoff" keyword to determine how many movies you want to rate, specify the output filename using "output", and set "method" to "cosine" or "pearson" to specify which similarity metric you want to use for the ratings.

python ratemovies.py --userId=test --cutoff=5 --output=test.csv --method=cosine

You should see a prompt similar to the one below (the movie will very likely be different)

For each movie, type a numeric rating (0-5) or <Enter> if you haven't seen it.
What is your rating for "Toy Story (1995)":

Simply exit the process with either exit or ^C:

For each movie, type a numeric rating (0-5) or <Enter> if you haven't seen it.
What is your rating for "Toy Story (1995)": exit

Exiting recommendation program

If you've made it to this point everything is probably set up correctly.

Optional download

This project contains the latest small development version of the MovieLens dataset as of May 2018, containing ~100,000 ratings. The interested developer might wish to apply this package to the full stable benchmark version of ~20 million ratings. To do so, download the dataset from the linked website, unzip it, and modify the FULL_DATA parameter in 0_explore_data.ipynb to point to the folder containing the data. To apply this to any other point in the analysis you will need to modify the files to point to that version of the ratings.csv and movies.csv files (see the 0_*.ipynb files for more information on the contents of the dataset).

Contents

For further detail on methodology, read the project report.

Code

  • 0_explore_data.ipynb: data exploration of the full stable benchmark version
  • 0_explore_ratings.ipynb: further data exploration focused on the ratings of the small development version
  • 1_pearson.ipynb: implementation of collaborative filtering with Pearson correlation coefficient weights
  • 2_cosine.ipynb: implementation of collaborative filtering with vector cosine similarity weights
  • 3_top_k.ipynb: implementation of top-k collaborative filtering
  • ratemovies.py: API to rate randomly selected movies, save those ratings, and compute (using either weight method) the top 5 and bottom 5 predicted rated movies
  • utils.py: defines useful functions used throughout the project

Folders

  • data/: contents of the small development version of the MovieLens dataset as of May 2018. See 0_explore_data.ipynb for an exploration of the full stable benchmark version of these files, which is quite similar to the small version. The actual analysis uses the small dataset throughout
  • docs/: this README and other project documentation (proposal, presentation slides, and final report
  • figures/: figures from the analysis included in the project report
  • processed/: saved files created in 0_explore_data.ipynb and practice ratings generated by the authors

Authors

  • Katelyn Stringer - Pearson correlation - stringkm
  • Alex Riley - Cosine similarity, top-k filtering - ahriley

Acknowledgements

This project was created as part of the Spring 2018 course STAT 689: Statistical Computing with R and Python taught by Dr. James Long at Texas A&M University.

We acknowledge the helpful advice contained in the following sources that helped us design and implement our algorithms:

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published