This repository includes the semester project of the course ETSP (source repository). The code is written in Python v3.12
.
Main packages: annoy
, sentene_transformers
, torch
, numpy
, sklearn
: {TfidfVectorizer
}, scipy
, pandas
, matplotlib
, tqdm
readAlike
is a book recommendation system that provides similar books to a given input book. The system leverages multiple techniques, including TF-IDF vectorization, Sentence-BERT embeddings, and Approximate Nearest Neighbors (ANN) for generating content-based book recommendations based on title, description, author, and category data (dataset).
To run the recommendation engine:
- Open the command line and execute
pip install -r requirements.txt
- Execute
main.py
. Given an example book, the program will print the top five recommended books based on three methods: TF-IDF, SBERT, and ANN.
- preprocessing/: Manages data preprocessing.
Preprocessor
: Handles data cleaning and formatting from a CSV file of book data.
- core/: Contains the core modules for recommendation.
Library
andBook
: Models the library of books and individual book data.Vectorizer
: Converts book text data into numerical vectors using TF-IDF and Sentence-BERT.DimensionalityReducer
: Reduces the dimensionality of TF-IDF vectors using Truncated SVD.Ann
: Creates an Approximate Nearest Neighbors model for efficient similarity search.Recommender
: Main recommendation engine that integrates the above components to provide recommendations.
config.py
: Configuration file with column names for title, description, authors, and categories.main.py
: Main entry point for running the recommendation pipeline.
- Preprocessing: The
Preprocessor
class reads the dataset and performs data cleaning. - Library Initialization:
Library
is initialized with the cleaned dataset, storing each book as aBook
object. - Vectorization:
Vectorizer
creates TF-IDF and Sentence-BERT embeddings for each book. - Dimensionality Reduction:
DimensionalityReducer
reduces TF-IDF embeddings for optimized ANN performance. - ANN Construction:
Ann
constructs an ANN model based on the reduced vectors. - Recommendation:
Recommender
classifies recommendations into TF-IDF, SBERT, and ANN-based results, outputting top similar books.
- Attributes:
df
: DataFrame containing cleaned book data.
- Methods:
preprocess_data()
: Cleans and formats the data.drop_items_with_short_entries()
,drop_duplicates()
,convert_strings_into_lists()
: Helper functions to clean the dataset.
- Attributes:
books
: List ofBook
objects.
- Methods:
get_combined_data()
: Concatenates title, description, authors, and categories into a single string per book.get_book_idx()
: Retrieves the index of a book within the library.
- Attributes:
title
,description
,authors
,categories
: Fields describing the book.
- Methods:
get_combined_data()
: Combines title, description, authors, and categories into a single string.
- Attributes:
tfidf_matrix
: Sparse matrix of TF-IDF vectors.sbert_embeddings
: Sentence-BERT embeddings for each book.
- Methods:
tfidf_vectorize()
: Vectorizes a book using TF-IDF.sbert_vectorize()
: Vectorizes a book or library using SBERT.
- Attributes:
reduced_matrix
: Dimensionality-reduced version of the TF-IDF matrix.
- Methods:
reduce()
: Reduces a TF-IDF vector to the lower dimension.
- Attributes:
ann_indices
: ANN model for similarity search.
- Methods:
get_nearest_neighbors_by_index()
,get_nearest_neighbors_by_vector()
: Retrieves nearest neighbors by item index or vector.
- Attributes:
lib
: Library of books.vectorizer
: Vectorizer instance.reducer
: Dimensionality reducer instance.ann
: ANN instance.
- Methods:
recommend()
: Provides top recommendations based on TF-IDF, SBERT, and ANN.