Refactor tolerance settings for MS MARCO dense vector regressions (#2541

) Continuation of #2538 - refactor tolerance values for HNSW indexes, calibrate wrt flat index scores.
castorini · Jul 8, 2024 · 3885b5c · 3885b5c
1 parent 5eb46b9
commit 3885b5c
Show file tree

Hide file tree

Showing 109 changed files with 616 additions and 490 deletions.
diff --git a/docs/regressions/regressions-dl19-passage.bge-base-en-v1.5.hnsw-int8.cached.md b/docs/regressions/regressions-dl19-passage.bge-base-en-v1.5.hnsw-int8.cached.md
@@ -101,16 +101,17 @@ With the above commands, you should be able to reproduce the following results:
 
 | **AP@1000**                                                                                                  | **BGE-base-en-v1.5**|
 |:-------------------------------------------------------------------------------------------------------------|-----------|
-| [DL19 (Passage)](https://trec.nist.gov/data/deep2020.html)                                                   | 0.443     |
+| [DL19 (Passage)](https://trec.nist.gov/data/deep2020.html)                                                   | 0.444     |
 | **nDCG@10**                                                                                                  | **BGE-base-en-v1.5**|
-| [DL19 (Passage)](https://trec.nist.gov/data/deep2020.html)                                                   | 0.708     |
+| [DL19 (Passage)](https://trec.nist.gov/data/deep2020.html)                                                   | 0.706     |
 | **R@100**                                                                                                    | **BGE-base-en-v1.5**|
-| [DL19 (Passage)](https://trec.nist.gov/data/deep2020.html)                                                   | 0.614     |
+| [DL19 (Passage)](https://trec.nist.gov/data/deep2020.html)                                                   | 0.617     |
 | **R@1000**                                                                                                   | **BGE-base-en-v1.5**|
-| [DL19 (Passage)](https://trec.nist.gov/data/deep2020.html)                                                   | 0.843     |
+| [DL19 (Passage)](https://trec.nist.gov/data/deep2020.html)                                                   | 0.847     |
 
-Note that due to the non-deterministic nature of HNSW indexing, results may differ slightly between each experimental run.
-Nevertheless, scores are generally within 0.005 of the reference values recorded in [our YAML configuration file](../../src/main/resources/regression/dl19-passage.bge-base-en-v1.5.hnsw-int8.cached.yaml).
+The above figures are from running brute-force search with cached queries on non-quantized **flat** indexes.
+With cached queries on quantized HNSW indexes, observed results are likely to differ; scores may be lower by up to 0.01, sometimes more.
+Note that both HNSW indexing and quantization are non-deterministic (i.e., results may differ slightly between trials).
 
 ❗ Retrieval metrics here are computed to depth 1000 hits per query (as opposed to 100 hits per query for document ranking).
 For computing nDCG, remember that we keep qrels of _all_ relevance grades, whereas for other metrics (e.g., AP), relevance grade 1 is considered not relevant (i.e., use the `-l 2` option in `trec_eval`).

diff --git a/docs/regressions/regressions-dl19-passage.bge-base-en-v1.5.hnsw-int8.onnx.md b/docs/regressions/regressions-dl19-passage.bge-base-en-v1.5.hnsw-int8.onnx.md
@@ -103,14 +103,15 @@ With the above commands, you should be able to reproduce the following results:
 |:-------------------------------------------------------------------------------------------------------------|-----------|
 | [DL19 (Passage)](https://trec.nist.gov/data/deep2020.html)                                                   | 0.444     |
 | **nDCG@10**                                                                                                  | **BGE-base-en-v1.5**|
-| [DL19 (Passage)](https://trec.nist.gov/data/deep2020.html)                                                   | 0.702     |
+| [DL19 (Passage)](https://trec.nist.gov/data/deep2020.html)                                                   | 0.706     |
 | **R@100**                                                                                                    | **BGE-base-en-v1.5**|
-| [DL19 (Passage)](https://trec.nist.gov/data/deep2020.html)                                                   | 0.609     |
+| [DL19 (Passage)](https://trec.nist.gov/data/deep2020.html)                                                   | 0.617     |
 | **R@1000**                                                                                                   | **BGE-base-en-v1.5**|
-| [DL19 (Passage)](https://trec.nist.gov/data/deep2020.html)                                                   | 0.836     |
+| [DL19 (Passage)](https://trec.nist.gov/data/deep2020.html)                                                   | 0.847     |
 
-Note that due to the non-deterministic nature of HNSW indexing, results may differ slightly between each experimental run.
-Nevertheless, scores are generally within 0.005 of the reference values recorded in [our YAML configuration file](../../src/main/resources/regression/dl19-passage.bge-base-en-v1.5.hnsw-int8.onnx.yaml).
+The above figures are from running brute-force search with cached queries on non-quantized **flat** indexes.
+With ONNX query encoding on quantized HNSW indexes, observed results are likely to differ; scores may be lower by up to 0.01, sometimes more.
+Note that both HNSW indexing and quantization are non-deterministic (i.e., results may differ slightly between trials).
 
 ❗ Retrieval metrics here are computed to depth 1000 hits per query (as opposed to 100 hits per query for document ranking).
 For computing nDCG, remember that we keep qrels of _all_ relevance grades, whereas for other metrics (e.g., AP), relevance grade 1 is considered not relevant (i.e., use the `-l 2` option in `trec_eval`).

diff --git a/docs/regressions/regressions-dl19-passage.bge-base-en-v1.5.hnsw.cached.md b/docs/regressions/regressions-dl19-passage.bge-base-en-v1.5.hnsw.cached.md
@@ -99,16 +99,17 @@ With the above commands, you should be able to reproduce the following results:
 
 | **AP@1000**                                                                                                  | **BGE-base-en-v1.5**|
 |:-------------------------------------------------------------------------------------------------------------|-----------|
-| [DL19 (Passage)](https://trec.nist.gov/data/deep2020.html)                                                   | 0.442     |
+| [DL19 (Passage)](https://trec.nist.gov/data/deep2020.html)                                                   | 0.444     |
 | **nDCG@10**                                                                                                  | **BGE-base-en-v1.5**|
 | [DL19 (Passage)](https://trec.nist.gov/data/deep2020.html)                                                   | 0.706     |
 | **R@100**                                                                                                    | **BGE-base-en-v1.5**|
-| [DL19 (Passage)](https://trec.nist.gov/data/deep2020.html)                                                   | 0.616     |
+| [DL19 (Passage)](https://trec.nist.gov/data/deep2020.html)                                                   | 0.617     |
 | **R@1000**                                                                                                   | **BGE-base-en-v1.5**|
-| [DL19 (Passage)](https://trec.nist.gov/data/deep2020.html)                                                   | 0.842     |
+| [DL19 (Passage)](https://trec.nist.gov/data/deep2020.html)                                                   | 0.847     |
 
-Note that due to the non-deterministic nature of HNSW indexing, results may differ slightly between each experimental run.
-Nevertheless, scores are generally within 0.005 of the reference values recorded in [our YAML configuration file](../../src/main/resources/regression/dl19-passage.bge-base-en-v1.5.hnsw.cached.yaml).
+The above figures are from running brute-force search with cached queries on non-quantized **flat** indexes.
+With cached queries on non-quantized HNSW indexes, observed results are likely to differ; scores may be lower by up to 0.01, sometimes more.
+Note that HNSW indexing is non-deterministic (i.e., results may differ slightly between trials).
 
 ❗ Retrieval metrics here are computed to depth 1000 hits per query (as opposed to 100 hits per query for document ranking).
 For computing nDCG, remember that we keep qrels of _all_ relevance grades, whereas for other metrics (e.g., AP), relevance grade 1 is considered not relevant (i.e., use the `-l 2` option in `trec_eval`).

diff --git a/docs/regressions/regressions-dl19-passage.bge-base-en-v1.5.hnsw.onnx.md b/docs/regressions/regressions-dl19-passage.bge-base-en-v1.5.hnsw.onnx.md
@@ -99,16 +99,17 @@ With the above commands, you should be able to reproduce the following results:
 
 | **AP@1000**                                                                                                  | **BGE-base-en-v1.5**|
 |:-------------------------------------------------------------------------------------------------------------|-----------|
-| [DL19 (Passage)](https://trec.nist.gov/data/deep2020.html)                                                   | 0.447     |
+| [DL19 (Passage)](https://trec.nist.gov/data/deep2020.html)                                                   | 0.444     |
 | **nDCG@10**                                                                                                  | **BGE-base-en-v1.5**|
-| [DL19 (Passage)](https://trec.nist.gov/data/deep2020.html)                                                   | 0.701     |
+| [DL19 (Passage)](https://trec.nist.gov/data/deep2020.html)                                                   | 0.706     |
 | **R@100**                                                                                                    | **BGE-base-en-v1.5**|
-| [DL19 (Passage)](https://trec.nist.gov/data/deep2020.html)                                                   | 0.607     |
+| [DL19 (Passage)](https://trec.nist.gov/data/deep2020.html)                                                   | 0.617     |
 | **R@1000**                                                                                                   | **BGE-base-en-v1.5**|
-| [DL19 (Passage)](https://trec.nist.gov/data/deep2020.html)                                                   | 0.837     |
+| [DL19 (Passage)](https://trec.nist.gov/data/deep2020.html)                                                   | 0.847     |
 
-Note that due to the non-deterministic nature of HNSW indexing, results may differ slightly between each experimental run.
-Nevertheless, scores are generally within 0.005 of the reference values recorded in [our YAML configuration file](../../src/main/resources/regression/dl19-passage.bge-base-en-v1.5.hnsw.onnx.yaml).
+The above figures are from running brute-force search with cached queries on non-quantized **flat** indexes.
+With ONNX query encoding on non-quantized HNSW indexes, observed results are likely to differ; scores may be lower by up to 0.01, sometimes more.
+Note that HNSW indexing is non-deterministic (i.e., results may differ slightly between trials).
 
 ❗ Retrieval metrics here are computed to depth 1000 hits per query (as opposed to 100 hits per query for document ranking).
 For computing nDCG, remember that we keep qrels of _all_ relevance grades, whereas for other metrics (e.g., AP), relevance grade 1 is considered not relevant (i.e., use the `-l 2` option in `trec_eval`).

diff --git a/...ressions/regressions-dl19-passage.cohere-embed-english-v3.0.hnsw-int8.cached.md b/...ressions/regressions-dl19-passage.cohere-embed-english-v3.0.hnsw-int8.cached.md
@@ -94,16 +94,17 @@ With the above commands, you should be able to reproduce the following results:
 
 | **AP@1000**                                                                                                  | **cohere-embed-english-v3.0**|
 |:-------------------------------------------------------------------------------------------------------------|-----------|
-| [DL19 (Passage)](https://trec.nist.gov/data/deep2020.html)                                                   | 0.487     |
+| [DL19 (Passage)](https://trec.nist.gov/data/deep2020.html)                                                   | 0.488     |
 | **nDCG@10**                                                                                                  | **cohere-embed-english-v3.0**|
-| [DL19 (Passage)](https://trec.nist.gov/data/deep2020.html)                                                   | 0.690     |
+| [DL19 (Passage)](https://trec.nist.gov/data/deep2020.html)                                                   | 0.696     |
 | **R@100**                                                                                                    | **cohere-embed-english-v3.0**|
-| [DL19 (Passage)](https://trec.nist.gov/data/deep2020.html)                                                   | 0.647     |
+| [DL19 (Passage)](https://trec.nist.gov/data/deep2020.html)                                                   | 0.648     |
 | **R@1000**                                                                                                   | **cohere-embed-english-v3.0**|
-| [DL19 (Passage)](https://trec.nist.gov/data/deep2020.html)                                                   | 0.850     |
+| [DL19 (Passage)](https://trec.nist.gov/data/deep2020.html)                                                   | 0.863     |
 
-Note that due to the non-deterministic nature of HNSW indexing, results may differ slightly between each experimental run.
-Nevertheless, scores are generally within 0.005 of the reference values recorded in [our YAML configuration file](../../src/main/resources/regression/dl19-passage.cohere-embed-english-v3.0.hnsw-int8.cached.yaml).
+The above figures are from running brute-force search with cached queries on non-quantized **flat** indexes.
+With cached queries on quantized HNSW indexes, observed results are likely to differ; scores may be lower by up to 0.01, sometimes more.
+Note that both HNSW indexing and quantization are non-deterministic (i.e., results may differ slightly between trials).
 
 ❗ Retrieval metrics here are computed to depth 1000 hits per query (as opposed to 100 hits per query for document ranking).
 For computing nDCG, remember that we keep qrels of _all_ relevance grades, whereas for other metrics (e.g., AP), relevance grade 1 is considered not relevant (i.e., use the `-l 2` option in `trec_eval`).

diff --git a/docs/regressions/regressions-dl19-passage.cohere-embed-english-v3.0.hnsw.cached.md b/docs/regressions/regressions-dl19-passage.cohere-embed-english-v3.0.hnsw.cached.md
@@ -94,16 +94,17 @@ With the above commands, you should be able to reproduce the following results:
 
 | **AP@1000**                                                                                                  | **cohere-embed-english-v3.0**|
 |:-------------------------------------------------------------------------------------------------------------|-----------|
-| [DL19 (Passage)](https://trec.nist.gov/data/deep2020.html)                                                   | 0.486     |
+| [DL19 (Passage)](https://trec.nist.gov/data/deep2020.html)                                                   | 0.488     |
 | **nDCG@10**                                                                                                  | **cohere-embed-english-v3.0**|
-| [DL19 (Passage)](https://trec.nist.gov/data/deep2020.html)                                                   | 0.690     |
+| [DL19 (Passage)](https://trec.nist.gov/data/deep2020.html)                                                   | 0.696     |
 | **R@100**                                                                                                    | **cohere-embed-english-v3.0**|
-| [DL19 (Passage)](https://trec.nist.gov/data/deep2020.html)                                                   | 0.645     |
+| [DL19 (Passage)](https://trec.nist.gov/data/deep2020.html)                                                   | 0.648     |
 | **R@1000**                                                                                                   | **cohere-embed-english-v3.0**|
-| [DL19 (Passage)](https://trec.nist.gov/data/deep2020.html)                                                   | 0.851     |
+| [DL19 (Passage)](https://trec.nist.gov/data/deep2020.html)                                                   | 0.863     |
 
-Note that due to the non-deterministic nature of HNSW indexing, results may differ slightly between each experimental run.
-Nevertheless, scores are generally within 0.005 of the reference values recorded in [our YAML configuration file](../../src/main/resources/regression/dl19-passage.cohere-embed-english-v3.0.hnsw.cached.yaml).
+The above figures are from running brute-force search with cached queries on non-quantized **flat** indexes.
+With cached queries on non-quantized HNSW indexes, observed results are likely to differ; scores may be lower by up to 0.01, sometimes more.
+Note that HNSW indexing is non-deterministic (i.e., results may differ slightly between trials).
 
 ❗ Retrieval metrics here are computed to depth 1000 hits per query (as opposed to 100 hits per query for document ranking).
 For computing nDCG, remember that we keep qrels of _all_ relevance grades, whereas for other metrics (e.g., AP), relevance grade 1 is considered not relevant (i.e., use the `-l 2` option in `trec_eval`).

diff --git a/docs/regressions/regressions-dl19-passage.cos-dpr-distil.hnsw-int8.cached.md b/docs/regressions/regressions-dl19-passage.cos-dpr-distil.hnsw-int8.cached.md
@@ -101,16 +101,17 @@ With the above commands, you should be able to reproduce the following results:
 
 | **AP@1000**                                                                                                  | **cosDPR-distil**|
 |:-------------------------------------------------------------------------------------------------------------|-----------|
-| [DL19 (Passage)](https://trec.nist.gov/data/deep2020.html)                                                   | 0.458     |
+| [DL19 (Passage)](https://trec.nist.gov/data/deep2020.html)                                                   | 0.466     |
 | **nDCG@10**                                                                                                  | **cosDPR-distil**|
-| [DL19 (Passage)](https://trec.nist.gov/data/deep2020.html)                                                   | 0.717     |
+| [DL19 (Passage)](https://trec.nist.gov/data/deep2020.html)                                                   | 0.725     |
 | **R@100**                                                                                                    | **cosDPR-distil**|
-| [DL19 (Passage)](https://trec.nist.gov/data/deep2020.html)                                                   | 0.605     |
+| [DL19 (Passage)](https://trec.nist.gov/data/deep2020.html)                                                   | 0.617     |
 | **R@1000**                                                                                                   | **cosDPR-distil**|
-| [DL19 (Passage)](https://trec.nist.gov/data/deep2020.html)                                                   | 0.805     |
+| [DL19 (Passage)](https://trec.nist.gov/data/deep2020.html)                                                   | 0.820     |
 
-Note that due to the non-deterministic nature of HNSW indexing, results may differ slightly between each experimental run.
-Nevertheless, scores are generally within 0.005 of the reference values recorded in [our YAML configuration file](../../src/main/resources/regression/dl19-passage.cos-dpr-distil.hnsw-int8.cached.yaml).
+The above figures are from running brute-force search with cached queries on non-quantized **flat** indexes.
+With cached queries on quantized HNSW indexes, observed results are likely to differ; scores may be lower by up to 0.01, sometimes more.
+Note that both HNSW indexing and quantization are non-deterministic (i.e., results may differ slightly between trials).
 
 ❗ Retrieval metrics here are computed to depth 1000 hits per query (as opposed to 100 hits per query for document ranking).
 For computing nDCG, remember that we keep qrels of _all_ relevance grades, whereas for other metrics (e.g., AP), relevance grade 1 is considered not relevant (i.e., use the `-l 2` option in `trec_eval`).