The provided code snippet performs movie recommendation based on movie metadata using the TMDB Movie Metadata dataset from Kaggle.
- Libraries and Dependencies
- Data Loading and Preprocessing
- TF-IDF Vectorization
- Cosine Similarity
- Movie Recommendations
- Data Parsing and Cleaning
- Feature Extraction
- Data Cleaning and Transformation
- Feature Combination
- CountVectorizer and Cosine Similarity
- Movie Recommendation and Indexing
- Final Recommendation
The provided code snippet includes several steps to perform movie recommendation. Here's a brief explanation of each step:
- Libraries and Dependencies: The required libraries such as pandas, numpy, sklearn, and pickle are imported.
- Data Loading and Preprocessing: Two CSV files, 'tmdb_5000_credits.csv' and 'tmdb_5000_movies.csv', are read into DataFrames df1 and df2, respectively. The columns of df1 are renamed, and then it is merged with df2 based on the 'id' column.
- TF-IDF Vectorization: The TfidfVectorizer from scikit-learn is used to create a TF-IDF vectorizer object named tfidf. Any missing values (NaN) in the 'overview' column of df2 are replaced with an empty string. Then, the TF-IDF matrix is constructed by fitting and transforming the 'overview' data using tfidf.
- Cosine Similarity: The cosine similarity matrix is computed using the linear_kernel function from sklearn.metrics.pairwise. This matrix represents the similarity between movies based on their textual descriptions (TF-IDF representation).
- Movie Recommendations: The get_recommendations function is defined, which takes a movie title as input and returns the top 10 most similar movies based on cosine similarity scores. It uses the cosine similarity matrix and a reverse map of movie titles and indices to find the index of the input movie and retrieve the most similar movies.
- Data Parsing and Cleaning: The literal_eval function from the ast module is applied to parse stringified features ('cast', 'crew', 'keywords', 'genres') into their corresponding Python objects.
- Feature Extraction: Functions are defined to extract useful information from the parsed features. For example, the get_director function retrieves the director's name from the 'crew' feature, and the get_list function returns a list of the top three elements from a given feature.
- Data Cleaning and Transformation: The clean_data function is applied to lowercase all strings and remove spaces from names in the 'cast', 'keywords', 'director', and 'genres' features.
- Feature Combination: A new feature named 'soup' is created by combining the cleaned 'keywords', 'cast', 'director', and 'genres' features into a single string, separated by spaces. This step helps to capture more relevant information for similarity computation.
- CountVectorizer and Cosine Similarity: The CountVectorizer from sklearn.feature_extraction.text is imported, and a count matrix is created by fitting and transforming the 'soup' data. The cosine similarity matrix is computed based on the count matrix, representing the similarity between movies using the count-based approach.
- Movie Recommendation and Indexing: The main DataFrame is reset with a new index, and a reverse mapping of movie titles and indices is constructed. This mapping is used to retrieve the index of the input movie in the recommendation function.
- Final Recommendation: The get_recommendations function is called with a specific movie title, and it returns the top 10 most similar movies based on either TF-IDF or count-based cosine similarity, depending on the cosine similarity matrix used.