Skip to content

Latest commit

 

History

History
301 lines (218 loc) · 12.6 KB

README_CN.md

File metadata and controls

301 lines (218 loc) · 12.6 KB

RecLearn

简体中文 | English

RecLearn(Recommender Learning)对Recommender System with TF2.0master 分支的内容进行了归纳、整理,是一个基于Python和Tensorflow2.x开发的推荐学习框架,适合学生、初学者研究使用。当然如果你更习惯master分支中的内容,并希望对其中的内容进行修改、更新,可以直接clone整个包的内容进行使用。实现的推荐算法按照工业界的两个应用阶段进行分类:

  • matching recommendation stage
  • ranking recommendeation stage

更新

23/04.2022:更新了所有的召回模型。

安装

Package

RecLearn已经上传在pypi上,可以使用pip进行安装:

pip install reclearn

所依赖的环境:

  • python3.8+
  • Tensorflow2.5-GPU+/Tensorflow2.5-CPU+
  • sklearn0.23+

Local

也可以直接clone Reclearn到本地:

git clone -b reclearn git@github.com:ZiyaoGeng/RecLearn.git

快速开始

example中,给出了每一个推荐模型的demo。

Matching

1、划分数据集

给定数据集的路径:

file_path = 'data/ml-1m/ratings.dat'

划分当前数据集为训练集、验证集、测试集。如果你使用了movielens-1mAmazon-BeautyAmazon-GamesSTEAM数据集的话,也可以直接调用Reclearn中data/datasets/*的方法,完成划分:

train_path, val_path, test_path, meta_path = ml.split_seq_data(file_path=file_path)

其中meta_path为元文件的路径,元文件保存了用户、物品索引的最大值。

2、加载数据

完成对训练集、验证集、测试集的读取,并且对每一个正样本分别生成若干个负样本(随即采样),数据的格式为字典:

data = {'pos_item':, 'neg_item': , ['user': , 'click_seq': ,...]}

如果你构建的模型为序列推荐模型,需要引入点击序列。对于上述4个数据集,Reclearn提供了加载数据的方法:

# general recommendation model
train_data = ml.load_data(train_path, neg_num, max_item_num)
# sequence recommendation model, and use the user feature.
train_data = ml.load_seq_data(train_path, "train", seq_len, neg_num, max_item_num, contain_user=True)

3、给定超参数

模型需要指定所需的超参数,以BPR模型为例:

model_params = {
        'user_num': max_user_num + 1,
        'item_num': max_item_num + 1,
        'embed_dim': FLAGS.embed_dim,
        'use_l2norm': FLAGS.use_l2norm,
        'embed_reg': FLAGS.embed_reg
    }

4、构建模型、编译

选择或构建你需要的模型,并进行编译。以BPR为例:

model = BPR(**model_params)
model.compile(optimizer=Adam(learning_rate=FLAGS.learning_rate))

如果你对模型的结构存在问题的话,编译之后可以调用summary方法打印查看:

model.summary()

5、学习以及预测

for epoch in range(1, epochs + 1):
    t1 = time()
    model.fit(
        x=train_data,
        epochs=1,
        validation_data=val_data,
        batch_size=batch_size
    )
    t2 = time()
    eval_dict = eval_pos_neg(model, test_data, ['hr', 'mrr', 'ndcg'], k, batch_size)
    print('Iteration %d Fit [%.1f s], Evaluate [%.1f s]: HR = %.4f, MRR = %.4f, NDCG = %.4f'
          % (epoch, t2 - t1, time() - t2, eval_dict['hr'], eval_dict['mrr'], eval_dict['ndcg']))

Ranking

针对Criteo数据集,采用了两种数据处理方法:加载部分数据训练模型或者通过分割数据集的方法使用全部数据训练。第一种方法参考example/train_small_criteo_demo.py。第二种方法参考example/r_deepfm_demo.py文件,具体如下所示:

1、分割数据集

调用reclearn.data.datasets.criteo.get_split_file_path(parent_path, dataset_path, sample_num)方法可以将数据集分割,sample_num确定每一个子集样本数量,所以子集保存在数据集对应的路径。若之前已经分割完成,没有改变子数据集路径可以直接读取,或者可以赋值parent_path

sample_num = 4600000
    split_file_list = get_split_file_path(dataset_path=file, sample_num=sample_num)

2、建立特征映射

分割数据集后,在整个数据集下对所有的特征进行映射(静态Embedding层需要确定大小),并且密集数据类型进行分桶处理转化为离散数据类型。调用get_fea_map(fea_map_path, split_file_list)方法,最后保存为映射文件保存为fea_map.pkl。若之前已经完成该步骤,可以赋值fea_map_path参数。

# If you want to make feature map.
fea_map = get_fea_map(split_file_list=split_file_list)
# Or if you want to load feature map.
# fea_map = get_fea_map(fea_map_path='data/criteo/split/fea_map.pkl')

3、加载测试集

选择最后一个子数据集作为测试集。

feature_columns, test_data = create_criteo_dataset(split_file_list[-1], fea_map)

4、构建模型

model = FM(feature_columns=feature_columns, **model_params)
model.summary()
model.compile(loss=binary_crossentropy, optimizer=Adam(learning_rate=learning_rate),
              metrics=[AUC()])

5、迭代训练,并验证

for file in split_file_list[:-1]:
    print("load %s" % file)
    _, train_data = create_criteo_dataset(file, fea_map)
    # TODO: Fit
    model.fit(
        x=train_data[0],
        y=train_data[1],
        epochs=1,
        batch_size=batch_size,
        validation_split=0.1
    )
    # TODO: Test
    print('test AUC: %f' % model.evaluate(x=test_data[0], y=test_data[1], batch_size=batch_size)[1])

实验结果

Reclearn所设计的实验环境与部分论文不同,所以结果可能会存在一定偏差,具体请参考experiement

Matching

Model ml-1m Beauty STEAM
HR@10MRR@10NDCG@10 HR@10MRR@10NDCG@10 HR@10MRR@10NDCG@10
BPR0.57680.23920.30160.37080.21080.24850.77280.42200.5054
NCF0.58340.22190.30600.54480.28310.34510.77680.42730.5103
DSSM0.54980.21480.2929------
YoutubeDNN0.67370.34140.4201------
GRU4Rec0.79690.46980.54830.52110.27240.33120.85010.54860.6209
Caser0.79160.44500.52800.54870.28840.35010.82750.50640.5832
SASRec0.81030.48120.56050.52300.27810.33550.86060.56690.6374
AttRec0.78730.45780.53630.49950.26950.3229---
FISSA0.81060.49530.57130.54310.28510.34620.86350.56820.6391

Ranking

Model 500w(Criteo) Criteo
Log Loss AUC Log Loss AUC
FM0.47650.77830.47620.7875
FFM----
WDL0.46840.78220.46920.7930
Deep Crossing0.46700.78260.46930.7935
PNN-0.7847--
DCN-0.78230.46910.7929
NFM0.47730.77620.47230.7889
AFM0.48190.78080.46920.7871
DeepFM-0.78280.46500.8007
xDeepFM0.46900.78390.46960.7919

复现论文列表

召回模型(Top-K推荐)

Paper|Model Published Author
BPR: Bayesian Personalized Ranking from Implicit Feedback|MF-BPR UAI, 2009 Steffen Rendle
Neural network-based Collaborative Filtering|NCF WWW, 2017 Xiangnan He
Learning Deep Structured Semantic Models for Web Search using Clickthrough Data|DSSM CIKM, 2013 Po-Sen Huang
Deep Neural Networks for YouTube Recommendations| YoutubeDNN RecSys, 2016 Paul Covington
Session-based Recommendations with Recurrent Neural Networks|GUR4Rec ICLR, 2016 Balázs Hidasi
Self-Attentive Sequential Recommendation|SASRec ICDM, 2018 UCSD
Personalized Top-N Sequential Recommendation via Convolutional Sequence Embedding|Caser WSDM, 2018 Jiaxi Tang
Next Item Recommendation with Self-Attentive Metric Learning|AttRec AAAAI, 2019 Shuai Zhang
FISSA: Fusing Item Similarity Models with Self-Attention Networks for Sequential Recommendation|FISSA RecSys, 2020 Jing Lin

排序模型(CTR预估)

Paper|Model Published Author
Factorization Machines|FM ICDM, 2010 Steffen Rendle
Field-aware Factorization Machines for CTR Prediction|FFM RecSys, 2016 Criteo Research
Wide & Deep Learning for Recommender Systems|WDL DLRS, 2016 Google Inc.
Deep Crossing: Web-Scale Modeling without Manually Crafted Combinatorial Features|Deep Crossing KDD, 2016 Microsoft Research
Product-based Neural Networks for User Response Prediction|PNN ICDM, 2016 Shanghai Jiao Tong University
Deep & Cross Network for Ad Click Predictions|DCN ADKDD, 2017 Stanford University|Google Inc.
Neural Factorization Machines for Sparse Predictive Analytics|NFM SIGIR, 2017 Xiangnan He
Attentional Factorization Machines: Learning the Weight of Feature Interactions via Attention Networks|AFM IJCAI, 2017 Zhejiang University|National University of Singapore
DeepFM: A Factorization-Machine based Neural Network for CTR Prediction|DeepFM IJCAI, 2017 Harbin Institute of Technology|Noah’s Ark Research Lab, Huawei
xDeepFM: Combining Explicit and Implicit Feature Interactions for Recommender Systems|xDeepFM KDD, 2018 University of Science and Technology of China
Deep Interest Network for Click-Through Rate Prediction|DIN KDD, 2018 Alibaba Group

讨论

对于项目有任何建议或问题,可以在Issue留言。