diff --git a/docs/regressions/regressions-msmarco-passage-cohere-embed-english-v3.md b/docs/regressions/regressions-msmarco-passage-cohere-embed-english-v3.md new file mode 100644 index 0000000000..f745943d64 --- /dev/null +++ b/docs/regressions/regressions-msmarco-passage-cohere-embed-english-v3.md @@ -0,0 +1,84 @@ +# Anserini Regressions: MS MARCO Passage Ranking + +**Model**: [Cohere embed-english-v3.0](https://docs.cohere.com/reference/embed) with HNSW indexes (using pre-encoded queries) + +This page describes regression experiments, integrated into Anserini's regression testing framework, using the [Cohere embed-english-v3.0](https://docs.cohere.com/reference/embed) model on the [MS MARCO passage ranking task](https://github.com/microsoft/MSMARCO-Passage-Ranking). + +In these experiments, we are using pre-encoded queries (i.e., cached results of query encoding). + +The exact configurations for these regressions are stored in [this YAML file](../../src/main/resources/regression/msmarco-passage-cohere-embed-english-v3.yaml). +Note that this page is automatically generated from [this template](../../src/main/resources/docgen/templates/msmarco-passage-cohere-embed-english-v3.template) as part of Anserini's regression pipeline, so do not modify this page directly; modify the template instead and then run `bin/build.sh` to rebuild the documentation. + +## Corpus Download + +Download the corpus and unpack into `collections/`: + +```bash +wget https://rgw.cs.uwaterloo.ca/pyserini/data/msmarco-passage-cohere-embed-english-v3.tar -P collections/ +tar xvf collections/msmarco-passage-cohere-embed-english-v3.tar -C collections/ +``` + +To confirm, `msmarco-passage-cohere-embed-english-v3.tar` is 38 GB and has MD5 checksum `6b7d9795806891b227378f6c290464a9`. + +## Indexing + +Sample indexing command, building HNSW indexes: + +```bash +target/appassembler/bin/IndexHnswDenseVectors \ + -collection JsonDenseVectorCollection \ + -input /path/to/msmarco-passage-cohere-embed-english-v3 \ + -generator HnswDenseVectorDocumentGenerator \ + -index indexes/lucene-hnsw.msmarco-passage-cohere-embed-english-v3/ \ + -threads 16 -M 16 -efC 100 \ + >& logs/log.msmarco-passage-cohere-embed-english-v3 & +``` + +The path `/path/to/msmarco-passage-cohere-embed-english-v3/` should point to the corpus downloaded above. +Upon completion, we should have an index with 8,841,823 documents. + +## Retrieval + +Topics and qrels are stored [here](https://github.com/castorini/anserini-tools/tree/master/topics-and-qrels), which is linked to the Anserini repo as a submodule. +The regression experiments here evaluate on the 6980 dev set questions; see [this page](../../docs/experiments-msmarco-passage.md) for more details. + +After indexing has completed, you should be able to perform retrieval as follows using HNSW indexes: + +```bash +target/appassembler/bin/SearchHnswDenseVectors \ + -index indexes/lucene-hnsw.msmarco-passage-cohere-embed-english-v3/ \ + -topics tools/topics-and-qrels/topics.msmarco-passage.dev-subset.cohere-embed-english-v3.jsonl.gz \ + -topicReader JsonIntVector \ + -output runs/run.msmarco-passage-cohere-embed-english-v3.cohere-embed-english-v3.topics.msmarco-passage.dev-subset.cohere-embed-english-v3.jsonl.txt \ + -generator VectorQueryGenerator -topicField vector -threads 16 -hits 1000 -efSearch 1000 & +``` + +Evaluation can be performed using `trec_eval`: + +```bash +target/appassembler/bin/trec_eval -c -m ndcg_cut.10 tools/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt runs/run.msmarco-passage-cohere-embed-english-v3.cohere-embed-english-v3.topics.msmarco-passage.dev-subset.cohere-embed-english-v3.jsonl.txt +target/appassembler/bin/trec_eval -c -m map tools/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt runs/run.msmarco-passage-cohere-embed-english-v3.cohere-embed-english-v3.topics.msmarco-passage.dev-subset.cohere-embed-english-v3.jsonl.txt +target/appassembler/bin/trec_eval -c -M 10 -m recip_rank tools/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt runs/run.msmarco-passage-cohere-embed-english-v3.cohere-embed-english-v3.topics.msmarco-passage.dev-subset.cohere-embed-english-v3.jsonl.txt +target/appassembler/bin/trec_eval -c -m recall.1000 tools/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt runs/run.msmarco-passage-cohere-embed-english-v3.cohere-embed-english-v3.topics.msmarco-passage.dev-subset.cohere-embed-english-v3.jsonl.txt +``` + +## Effectiveness + +With the above commands, you should be able to reproduce the following results: + +| **nDCG@10** | **cohere-embed-english-v3**| +|:-------------------------------------------------------------------------------------------------------------|-----------| +| [MS MARCO Passage: Dev](https://github.com/microsoft/MSMARCO-Passage-Ranking) | 0.428 | +| **AP@1000** | **cohere-embed-english-v3**| +| [MS MARCO Passage: Dev](https://github.com/microsoft/MSMARCO-Passage-Ranking) | 0.371 | +| **RR@10** | **cohere-embed-english-v3**| +| [MS MARCO Passage: Dev](https://github.com/microsoft/MSMARCO-Passage-Ranking) | 0.365 | +| **R@1000** | **cohere-embed-english-v3**| +| [MS MARCO Passage: Dev](https://github.com/microsoft/MSMARCO-Passage-Ranking) | 0.974 | + +Note that due to the non-deterministic nature of HNSW indexing, results may differ slightly between each experimental run. +Nevertheless, scores are generally within 0.005 of the reference values recorded in [our YAML configuration file](../../src/main/resources/regression/msmarco-passage-cohere-embed-english-v3.yaml). + +## Reproduction Log[*](../../docs/reproducibility.md) + +To add to this reproduction log, modify [this template](../../src/main/resources/docgen/templates/msmarco-passage-cohere-embed-english-v3.template) and run `bin/build.sh` to rebuild the documentation. diff --git a/src/main/resources/docgen/templates/msmarco-passage-cohere-embed-english-v3.template b/src/main/resources/docgen/templates/msmarco-passage-cohere-embed-english-v3.template new file mode 100644 index 0000000000..e3a1e1af99 --- /dev/null +++ b/src/main/resources/docgen/templates/msmarco-passage-cohere-embed-english-v3.template @@ -0,0 +1,62 @@ +# Anserini Regressions: MS MARCO Passage Ranking + +**Model**: [Cohere embed-english-v3.0](https://docs.cohere.com/reference/embed) with HNSW indexes (using pre-encoded queries) + +This page describes regression experiments, integrated into Anserini's regression testing framework, using the [Cohere embed-english-v3.0](https://docs.cohere.com/reference/embed) model on the [MS MARCO passage ranking task](https://github.com/microsoft/MSMARCO-Passage-Ranking). + +In these experiments, we are using pre-encoded queries (i.e., cached results of query encoding). + +The exact configurations for these regressions are stored in [this YAML file](${yaml}). +Note that this page is automatically generated from [this template](${template}) as part of Anserini's regression pipeline, so do not modify this page directly; modify the template instead and then run `bin/build.sh` to rebuild the documentation. + +## Corpus Download + +Download the corpus and unpack into `collections/`: + +```bash +wget ${download_url} -P collections/ +tar xvf collections/${corpus}.tar -C collections/ +``` + +To confirm, `${corpus}.tar` is 38 GB and has MD5 checksum `${download_checksum}`. + +## Indexing + +Sample indexing command, building HNSW indexes: + +```bash +${index_cmds} +``` + +The path `/path/to/${corpus}/` should point to the corpus downloaded above. +Upon completion, we should have an index with 8,841,823 documents. + +## Retrieval + +Topics and qrels are stored [here](https://github.com/castorini/anserini-tools/tree/master/topics-and-qrels), which is linked to the Anserini repo as a submodule. +The regression experiments here evaluate on the 6980 dev set questions; see [this page](${root_path}/docs/experiments-msmarco-passage.md) for more details. + +After indexing has completed, you should be able to perform retrieval as follows using HNSW indexes: + +```bash +${ranking_cmds} +``` + +Evaluation can be performed using `trec_eval`: + +```bash +${eval_cmds} +``` + +## Effectiveness + +With the above commands, you should be able to reproduce the following results: + +${effectiveness} + +Note that due to the non-deterministic nature of HNSW indexing, results may differ slightly between each experimental run. +Nevertheless, scores are generally within 0.005 of the reference values recorded in [our YAML configuration file](${yaml}). + +## Reproduction Log[*](${root_path}/docs/reproducibility.md) + +To add to this reproduction log, modify [this template](${template}) and run `bin/build.sh` to rebuild the documentation. diff --git a/src/main/resources/regression/msmarco-passage-cohere-embed-english-v3.yaml b/src/main/resources/regression/msmarco-passage-cohere-embed-english-v3.yaml new file mode 100644 index 0000000000..7c283b1748 --- /dev/null +++ b/src/main/resources/regression/msmarco-passage-cohere-embed-english-v3.yaml @@ -0,0 +1,65 @@ +--- +corpus: msmarco-passage-cohere-embed-english-v3 +corpus_path: collections/msmarco/msmarco-passage-cohere-embed-english-v3/ + +download_url: https://rgw.cs.uwaterloo.ca/pyserini/data/msmarco-passage-cohere-embed-english-v3.tar +download_checksum: 6b7d9795806891b227378f6c290464a9 + +index_path: indexes/lucene-hnsw.msmarco-passage-cohere-embed-english-v3/ +index_type: hnsw +collection_class: JsonDenseVectorCollection +generator_class: HnswDenseVectorDocumentGenerator +index_threads: 16 +index_options: -M 16 -efC 100 + +metrics: + - metric: nDCG@10 + command: target/appassembler/bin/trec_eval + params: -c -m ndcg_cut.10 + separator: "\t" + parse_index: 2 + metric_precision: 4 + can_combine: false + - metric: AP@1000 + command: target/appassembler/bin/trec_eval + params: -c -m map + separator: "\t" + parse_index: 2 + metric_precision: 4 + can_combine: false + - metric: RR@10 + command: target/appassembler/bin/trec_eval + params: -c -M 10 -m recip_rank + separator: "\t" + parse_index: 2 + metric_precision: 4 + can_combine: false + - metric: R@1000 + command: target/appassembler/bin/trec_eval + params: -c -m recall.1000 + separator: "\t" + parse_index: 2 + metric_precision: 4 + can_combine: false + +topic_reader: JsonIntVector +topics: + - name: "[MS MARCO Passage: Dev](https://github.com/microsoft/MSMARCO-Passage-Ranking)" + id: dev + path: topics.msmarco-passage.dev-subset.cohere-embed-english-v3.jsonl.gz + qrel: qrels.msmarco-passage.dev-subset.txt + +models: + - name: cohere-embed-english-v3 + display: cohere-embed-english-v3 + type: hnsw + params: -generator VectorQueryGenerator -topicField vector -threads 16 -hits 1000 -efSearch 1000 + results: + nDCG@10: + - 0.4275 + AP@1000: + - 0.3706 + RR@10: + - 0.3648 + R@1000: + - 0.9735 diff --git a/tools b/tools index 589db040bb..fe0d5e7776 160000 --- a/tools +++ b/tools @@ -1 +1 @@ -Subproject commit 589db040bba15557a4b6b509ae55e3142e7379ab +Subproject commit fe0d5e7776da48ea8a2ea12acc9bfc474cea7a17