Skip to content

caesarjuly/reginx

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

56 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Reginx

Reginx is short for recommendation engine X. I plan to build most parts of modern recommendation engine from scratch.
Initial plan including:

  1. Popular machine learning models like CF, FM, XGBoost, TwoTower, W&D, DeepFM, DCN, MaskNet, SASRec, Bert4Rec, Transformer, etc.
  2. Online inference service written by Golang, including candidate generator, ranking and re-ranking layers
  3. Feature engineering and preprocessing, including both online and offline part
  4. Diversity approaches, like MMR, DPP
  5. Deduplication approaches, like LSH or BloomFilter
  6. Training data pipeline
  7. Model registry, monitoring and versioning

Supported models

Tensorflow 2 and Google Cloud is used for model training and performance tracking. The conda environment config is here.
I have a personal blog in substack explaining the models and I put the corresponding links in the table below.

Model Paper Code Blog
Factorization Machines Factorization Machines Code Post
DeepFM DeepFM: A Factorization-Machine based Neural Network for CTR Prediction Code Post
XDeepFM xDeepFM: Combining Explicit and Implicit Feature Interactions for Recommender Systems Code Post
AutoInt AutoInt: Automatic Feature Interaction Learning via Self-Attentive Neural Networks Code Post
DCN Deep & Cross Network for Ad Click Predictions Code Post
DCN V2 DCN V2: Improved Deep & Cross Network and Practical Lessons for Web-scale Learning to Rank Systems Code Post
DLRM Deep Learning Recommendation Model for Personalization and Recommendation Systems Code Post
FinalMLP FinalMLP: An Enhanced Two-Stream MLP Model for CTR Prediction DualMLP FinalMLP Post
MaskNet MaskNet: Introducing Feature-Wise Multiplication to CTR Ranking Models by Instance-Guided Mask Code Post
TwoTower Sampling-Bias-Corrected Neural Modeling for Large Corpus Item Recommendations Mixed Negative Sampling for Learning Two-tower Neural Networks in Recommendations Code Post1 Post2 Post3
Wide and Deep Wide & Deep Learning for Recommender Systems Code Post
Transformer Attention Is All You Need Code Post
BERT BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding Code Post
SASRec Self-Attentive Sequential Recommendation Code Post
BERT4REC BERT4Rec: Sequential Recommendation with Bidirectional Encoder Representations from Transformer Code Post
ESMM Entire Space Multi-Task Model: An Effective Approach for Estimating Post-Click Conversion Rate Code Post
MMoE Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts Code Post

Local Training

Here is an example to train a two-tower model in local machine.

Setup Conda

Setup your conda environment using the conda config here.

conda env create -f environment.yml
conda activate tf

Set your PYTHONPATH to the root folder of this project. Or you can add it to your bashrc:

export PYTHONPATH=/your_project_folder/reginx

Prepare Movielens Training Data

You can run this script to generate meta and training data in your local directory. By default, it's using the movielens-1m from TensorFlow datasets.
And copy your dataset files to your local /tmp/train, /tmp/test, /tmp/item folder. Notice that the TwoTower model implementation require 3 kinds of files, train files for training, test files for test and item files for mixing global negative samples.
If you want to use your dataset other than movielens, please prepare your own dataset and save it to your local directory.

Check Config File

There is example config file for candidate-retriever training.
If you want to use your dataset other than movielens, please prepare your own query and candidate embedding class.

model:
  temperature: 0.05
  # specify training model under models folder
  base_model: TwoTower
  # specify query embedding model under models/features folder
  query_emb: MovieLensQueryEmb
  # specify candidate embedding model under models/features folder
  candidate_emb: MovieLensCandidateEmb
  # specify the unique key for candidates
  item_id_key: movie_id

train:
  # specify task under tasks folder
  task_name: CandidateRetrieverTrain
  epochs: 1
  batch_size: 256
  mixed_negative_batch_size: 128
  learning_rate: 0.05
  train_data: movielens/data/ratings_train
  test_data: movielens/data/ratings_test
  candidate_data: movielens/data/movies
  meta_data: trainer/meta/movie_lens.json
  model_dir: trainer/saved_models/movielens_cr
  log_dir: logs

Training

Simply run the script below and specify your the config file in you activated conda environment.

python trainer/local_train.py -c movielens_candidate_retriever  

By default, the training metrics show once per 1000 training steps for faster training. You can modify the setting by tuning the steps_per_execution hyperparameter while compiling model.
After the training, evaluation will be run on the test dataset. You should see metrics like:

391/391 [==============================] - 50s 129ms/step - factorized_top_k/top_1_categorical_accuracy: 0.0036 - factorized_top_k/top_5_categorical_accuracy: 0.0181 - factorized_top_k/top_10_categorical_accuracy: 0.0349 - factorized_top_k/top_50_categorical_accuracy: 0.1428 - factorized_top_k/top_100_categorical_accuracy: 0.2409 - loss: 1406.8086 - regularization_loss: 7.9244 - total_loss: 1414.7329

About

Build a recommendation system from scratch

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published