Update OpenAI-ada2 documentation to explain the jank (#2163)

castorini · Aug 11, 2023 · e59f9e7 · e59f9e7
1 parent a628a3f
commit e59f9e7
Showing 1 changed file with 32 additions and 28 deletions.
diff --git a/docs/experiments-msmarco-passage-openai-ada2.md b/docs/experiments-msmarco-passage-openai-ada2.md
@@ -1,47 +1,59 @@
-# Anserini Regressions: OpenAI-ada2 for MS MARCO Passage Ranking
+# Anserini: OpenAI-ada2 Embeddings for MS MARCO Passage Ranking
 
-**Model**: OpenAI-ada2 (using pre-encoded queries) with HNSW indexes
-
-This page describes regression experiments, integrated into Anserini's regression testing framework, using the OpenAI-ada2 model on the [MS MARCO passage ranking task](https://github.com/microsoft/MSMARCO-Passage-Ranking).
-
-In these experiments, we are using pre-encoded queries (i.e., cached results of query encoding).
+This guide explains how to reproduce experiments with OpenAI-ada2 emebddings on the [MS MARCO passage ranking task](https://github.com/microsoft/MSMARCO-Passage-Ranking).
+In these experiments, we are using pre-encoded queries (i.e., cached results of query embeddings).
 
 ## Corpus Download
 
-Download the corpus and unpack into `collections/`:
+Let's start off by downloading the corpus.
+To be clear, the "corpus" here refers to the embedding vectors generated by OpenAI's ada2 embedding endpoint.
+
+Download the tarball containing embedding vectors and unpack into `collections/`:
 
 ```bash
 wget https://rgw.cs.uwaterloo.ca/pyserini/data/msmarco-passage-openai-ada2.tar -P collections/
 tar xvf collections/msmarco-passage-openai-ada2.tar -C collections/
 ```
 
-With the corpus downloaded, the following command will perform the remaining steps below:
+The tarball is 109 GB and has an MD5 checksum of `a4d843d522ff3a3af7edbee789a63402`.
 
 ## Indexing
 
-Sample indexing command, building HNSW indexes:
+Indexing is a bit tricky because the HNSW implementation in Lucene restricts vectors to 1024 dimensions, which is not sufficient for OpenAI's 1536-dimensional embeddings.
+This issue is described [here](https://github.com/apache/lucene/issues/11507).
+The resolution is to make vector dimensions configurable on a per `Codec` basis, as in [this patch](https://github.com/apache/lucene/pull/12436) in Lucene.
+However, as of early August 2023, there is no public release of Lucene that has these features folded in.
+Thus, there is no public release of Lucene can directly index OpenAI's ada2 embedding vectors.
+
+However, we were able to hack around this limitation in [this pull request](https://github.com/castorini/anserini/pull/2161).
+Our workaround is incredibly janky, which is why we're leaving it on a branch and _not_ merging it into trunk.
+The sketch of the solution is as follows: we copy relevant source files from Lucene directly into our source tree, and when we build the fatjar, the class files of our "local versions" take precedence, and hence override the vector size limitations.
+
+So, to get the indexing working, we'll need to pull the above branch, build, and index with the following command:
 
 ```bash
-target/appassembler/bin/IndexHnswDenseVectors \
+java -cp target/anserini-0.21.1-SNAPSHOT-fatjar.jar io.anserini.index.IndexHnswDenseVectors \
   -collection JsonDenseVectorCollection \
-  -input /path/to/msmarco-passage-openai-ada2 \
+  -input collections/msmarco-passage-openai-ada2 \
   -index indexes/lucene-hnsw.msmarco-passage-openai-ada2/ \
   -generator LuceneDenseVectorDocumentGenerator \
   -threads 16 -M 16 -efC 100 \
   >& logs/log.msmarco-passage-openai-ada2 &
 ```
 
-The path `/path/to/msmarco-passage-openai-ada2/` should point to the corpus downloaded above.
+Note that we're _not_ using `target/appassembler/bin/IndexHnswDenseVectors`.
+Instead, we directly rely on the fatjar.
 
+The indexing job takes around three hours on our `orca` server.
 Upon completion, we should have an index with 8,841,823 documents.
 
-<!-- For additional details, see explanation of [common indexing options](common-indexing-options.md). -->
-
 ## Retrieval
 
+Other than the indexing trick, retrieval and evaluation are straightforward.
+
 Topics and qrels are stored [here](https://github.com/castorini/anserini-tools/tree/master/topics-and-qrels), which is linked to the Anserini repo as a submodule.
 
-After indexing has completed, you should be able to perform retrieval as follows using HNSW indexes, replacing `{SETTING}` with the desired setting out of [`msmarco-passage.dev-subset.openai-ada2`, `dl19-passage.openai-ada2`, `dl19-passage.openai-ada2`, `dl19-passage.openai-ada2-hyde`, `dl20-passage.openai-ada2-hyde`]:
+After indexing has completed, you should be able to perform retrieval as follows using HNSW indexes, replacing `{SETTING}` with the desired setting out of [`msmarco-passage.dev-subset.openai-ada2`, `dl19-passage.openai-ada2`, `dl20-passage.openai-ada2`, `dl19-passage.openai-ada2-hyde`, `dl20-passage.openai-ada2-hyde`]:
 
 ```bash
 target/appassembler/bin/SearchHnswDenseVectors \
@@ -73,35 +85,27 @@ tools/eval/trec_eval.9.0.4/trec_eval -c -l 2 -m recall.1000 tools/topics-and-qre
 
 With the above commands, you should be able to reproduce the following results:
 
-`msmarco-passage.dev-subset.openai-ada2`
 ```
+# msmarco-passage.dev-subset.openai-ada2
 recip_rank              all     0.3434
 recall_1000             all     0.9841
-```
 
-`dl19-passage.openai-ada2`
-```
+# dl19-passage.openai-ada2
 map                     all     0.4786
 ndcg_cut_10             all     0.7035
 recall_1000             all     0.8625
-```
 
-`dl20-passage.openai-ada2`
-```
+# dl20-passage.openai-ada2
 map                     all     0.4771
 ndcg_cut_10             all     0.6759
 recall_1000             all     0.8705
-```
 
-`dl19-passage.openai-ada2-hyde`
-```
+# dl19-passage.openai-ada2-hyde
 map                     all     0.5124
 ndcg_cut_10             all     0.7163
 recall_1000             all     0.8968
-```
 
-`dl20-passage.openai-ada2-hyde`
-```
+# dl20-passage.openai-ada2-hyde
 map                     all     0.4938
 ndcg_cut_10             all     0.6666
 recall_1000             all     0.8919