Personalized Category Suggestions for eCommerce Type-Ahead
This repo contains working code from our blog post Building personalized category suggestions with Ludwig. By leveraging Ludwig capabilities, we implement an encoder-decoder architecture to provide personalized and dynamic category suggestion to augment type-ahead API.
A typical type-ahead experience is this one:
What we are trying to build is a smarter system, one that suggests different categories depending on contextual factors as well (e.g. the products the user interacted with):
Blog post and code are inspired by our research paper presented @ ACL 2020: How to Grow a (Product) Tree.
Code has been written for Python 3.7 - the provided requirements.txt
can be used with a virtualenv
to run the project in a separate virtual environment.
Credentials and global parameters can be set with the standard .env
file (*.env.local
is provided as a template), and
they are available in the pipeline script through dotenv.
We provide two main scripts to test out our models for category prediction in type-ahead: a simplified, but realistic end-to-end "stateless" pipeline, creating from scratch from raw data all input features and a Ludwig-friendly dataset; a stand-alone folder with a minimal Ludwig script in case you already have embeddings and data rows ready for the model.
By running model_pipeline.py
, a Luigi local pipeline executes a DAG comprising
four tasks:
- prod2vec training: product embeddings are trained from browsing data and stored locally as text in the Glove format;
- dataset preparation: extract data from search logs and prepare a csv with three columns, "query" (the input query), "skus_in_session" (product identifiers for in-session interactions: view, add, etc.), "path" (the target taxonomy path). "skus_in_session" and "path" are sequences, so they are saved as tokens separated by a space;
- Ludwig training: define the deep learning model and feed it to Ludwig for training and local persistence;
- Ludwig testing: load the model from storage, test it on held-out data and print out summary statistics.
By using Luigi, we wrap this DAG in a convenient flow that saves us time if we need to re-run the pipeline from a particular step, and ensure consistency if we perform a clean run.
Please note that data retrieval functions in data_service.py
and prod2vec_train.py
are just stubs:
in our original repository they contained our Snowflake-based code to load behavioral
and search data from our warehouse; modify them with your own logic to extract behavioral and search data so that
downstream tasks can run seamlessly (we left a small snowflake client in the repo for convenience).
The folder ludwig_playground
contains *.local
files that show sample datasets and sample ancillary files.
The folder data
contains catalog.csv.local
, which is a sample csv
file
representing product information (identifiers, images,
taxonomy path): it may be useful to have a product lookup
if your search logs (e.g. products clicked after a search) report product identifiers and you need to join
products with paths to prepare the final dataset.
If you already have embeddings ready (stored in a tab-separated text file, as in
the "Glove format") and a dataset file, you can put them in the ludwig_playground
folder and play directly with Ludwig code with no other dependency: ludwig_playground.py
have some global variables
you can set to re-run training, or just running a trained model on new input rows.
The *.local
files in the folder show the accepted format for a dataset and an embedding file to run the Ludwig code.
This repo is a joint effort of Jacopo, Bingqing and Marie.
We wish to thank our friend Piero Molino, Ludwig's creator, for showing us how to re-write our model (SessionPath) with Ludwig.
If you find this repo (and the ideas in it) useful for your research, please cite our work:
@inproceedings{tagliabue-etal-2020-grow,
title = "How to Grow a (Product) Tree: Personalized Category Suggestions for e{C}ommerce Type-Ahead",
author = "Tagliabue, Jacopo and
Yu, Bingqing and
Beaulieu, Marie",
booktitle = "Proceedings of The 3rd Workshop on e-Commerce and NLP",
month = jul,
year = "2020",
address = "Seattle, WA, USA",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.ecnlp-1.2",
doi = "10.18653/v1/2020.ecnlp-1.2",
pages = "7--18",
abstract = "In an attempt to balance precision and recall in the search page, leading digital shops have been effectively nudging users into select category facets as early as in the type-ahead suggestions. In this work, we present SessionPath, a novel neural network model that improves facet suggestions on two counts: first, the model is able to leverage session embeddings to provide scalable personalization; second, SessionPath predicts facets by explicitly producing a probability distribution at each node in the taxonomy path. We benchmark SessionPath on two partnering shops against count-based and neural models, and show how business requirements and model behavior can be combined in a principled way.",
}
The arxiv version is available here.
The code in this repo is freely available and provided "as is" as covered by the MIT License.