Skip to content
Leonid Boytsov edited this page Oct 20, 2018 · 24 revisions

Overview

This repository hosts a learning-to-rank retrieval (and re-ranking) pipeline, which is a part of the PGraph project where we study applicability of k-nearest neighbor (k-NN) search methods to IR and QA applications. This project is supported primarily by the NSF grant #1618159 : "Matching and Ranking via Proximity Graphs: Applications to Question Answering and Beyond".

Currently, this repository hosts:

  1. Software to reproduce our CIKM'16 paper: L. Boytsov, D. Novak, Y. Malkov, E. Nyberg (2016). Off the Beaten Path: Let’s Replace Term-Based Retrieval with k-NN Search, CIKM'16. Detailed instructions are provided on this page.

  2. Software used in the dissertation of Leonid Boytsov: "Efficient and Accurate Non-Metric k-NN Search with Applications to Text Matching". A summary of this work is given in the following blog post..

Research objectives

The proposed research aims to improve a retrieval component of an extractive QA system, which traditionally relies on a filter-and-refine approach. The cheap filtering step uses full- text retrieval to obtain a list of candidates. The refining step applies a more expensive re-ranking model. Usually, the full-text index includes only lexical entries, which may lead to a reduced recall due to the vocabulary mismatch.

The main research hypothesis is that a high-accuracy similarity (k-NN) search powered by a proximity graph can provide a significant improvement over the classic retrieval approach based on inverted indices. In addition, the similarity search can efficiently handle more sophisticated document representations such as vectorial document embeddings, which are not supported by classic inverted indices.

Existing state-of-the-art k-NN search algorithms based on proximity graphs work well in relatively simple domains with a metric distance function (e.g. the Euclidean distance), but they are not sufficiently accurate for complex non-symmetric and non-metric distances. Our central goal is to improve these search methods and adapt them to the NLP domain.

License

The code written by us (both this repository and NMSLIB) is distributed under Apache 2 license. We also use Apache Lucene, Apache UIMA, Apache Thrift, DKPRO core, and Clear NLP, which have the same license. The RankLib library uses a BSD license. Finally, we use Stanford NLP as well as a modified versions of Manaal Faruqui's retrofitting script. These modules ship with a GNU license.

We may also free software components of unknown license, in particular, GIZA++, Sparse Hash, trec_eval.

Last, but not least, there are a number of less important employed packages (see the POM file), which may have their own licenses different from Apache 2.

Clone this wiki locally