A python based hybrid recommendation system built from scratch.
- Python 2.7.*
- MongoDB
- Python Modules:
- pyprind
- pymongo
- tmdbsimple (optional)
- ml-100k dataset from
http://www.grouplens.org/datasets/movielens/
Make a copy of the .config.example
into the root directory as .config
and add all your database information and you TMDB API key, if you want to use it.
Run python install.py -p <path to extracted ml-100k>
. To include TMDB information, you can pass the --with-tmdb
option. WARNING! Loading TMDB information takes quite a bit of time.
This script:
- Downloads
u.data.tmdb
fromhttps://gist.github.com/amitab/7869d7336b80dfc3c4e8
to match movie IDs with TMDB IDs, if--with-tmdb
option is provided. - Loads metadata, user data, movies data into MongoDB in the database and respective collections provided in the
.config
file. - Prepares Movie deviation matrix and imports into the collection provided in the
.config
file. - Prepares User Similarity matrix and imports into the collection provided in the
.config
file.
There are two components:
- The User Collaborative filter
- The Item Collaborative filter
The user similarity and item deviations are computed ahead of time and updated on User registration, User likes, and upon adding a new Movie. To recommend a movie to a user, the high level steps are:
- Use Cosine Similarity to get the k nearest neighbours
- Find common movies between these neighbours excluding the movies the User has already rated
- Predict the user rating of these movies using SlopeOne algorithm and the similarity of the users
- Return a sorted list of movies.
To calculate cosine similarity between two Users, we develop a vector to represent the user.
This vector contains the normalized age, gender, occupation and genre interests. This is generated in user_wrapper.py:40
.
Example:
Lets consider a user with ID 1:
{
"age" : 24,
"sex" : "M",
"likes" : [
"Mystery", "Romance", "Sci-Fi",
"Family", "Horror", "Film-Noir",
"Crime", "Drama", "Children's",
"Musical", "Animation", "Adventure",
"Action", "Comedy", "Documentary",
"War", "Thriller", "Western"
],
"occupation" : "technician",
"id" : 1,
"zip_code" : "85711"
}
We would build user vector for user ID 1 as:
[
# The age normalized to a value between 0 and 1
0.25757575757575757,
# Male or female
1, 0,
# So many occupations to choose from
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 1, 0,
# He likes so many genres of Movies
1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1,
0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1,
0, 0, 0, 0, 0, 0, 1, 1, 0
]
We would construct another vector for the user we are trying to compare to and find the cosine of the angle between the two n-dimentional vector. There are lots of 0's in our vectors, so Cosine Similarity seems like the best measure to use.
Numerator is just a dot product of the two vectors. Denominator is the product of the individual lengths of the vectors. These values are caluculated before time, updated each time a user updates his interests and are used to fetch the k nearest neighbours.
Once we have these neighbours, we pick some of the movies between these users and remove the movies already rated by the target user. Then we predict a rating for each of these movies for the target user based on the movies the user has already rated, using both the SlopeOne and the Cosine Similarity algorithm.
It's simple. We already have the cosine similarity of the neighbours. The rating of a movie is given by:
Where: is the set of all neighbours is the target user is the neighbour user is the neighbour user rating
First Phase: We calculate the deviation between every pair of movies - and and store them somewhere.
If is the entire set of ratings and is the number of users who have rated both movies and , then:
Second Phase: To predict a rating for a movie by a user, we find the list of all the movies this user has rated :
where is the number of users who have rated both and .
Just average of the two. Nothing special.
There are 3 types of recommenders built in, all varying depending on the way the movies are selected for recommendation.
- The Fast Recommender - Pick the
k
nearest neighbours, pickcount
of their highly rated movies and run it through the hybrid rating predictor. - The Best Recommender - Pick all the movies of
k
neighbours, run it through the hybrid rating predictor and pickcount
best rated movies. - The Serendipity Recommender - Always returns random best results. Pick the
k
nearest neighbours, pickcount
random movies from their highly rated list and run it through the hybrid rating predictor.
You can see the implementation of each of these in Recommendare.py
.
The main file is Recommendare.py
. The usage is as described below.
r = Recommendare()
# Function signature is the same for fast, best, serendipity recommenders
r.fast_recommender( 1 # user_id,
10 # number of movies to recommend,
3 # number of neighbours to look for)
To register a user:
r = Recommendare()
# All are required, except 'likes'. It is built as the user rates movies.
r.register_user({
'zip_code': 3848920,
'age': 99,
'gender': 'F',
'likes': ['Drama'], # Optional
'occupation': 'Student'
})
To update a users likes:
r = Recommendare()
r.update_user_likes(
1 # user id
['Drama'] # new likes
)
To rate a Movie:
r = Recommendare()
r.user_rate_movie(
1, # user id
2, # movie id
4.2 # rating
)
To predict a rating for a Movie by a User:
r = Recommendare()
r.predict_rating(
1, # user id
2, # movie id
3, # neighbours to consult
)