Skip to content

Commit

Permalink
Refactor tolerance settings for MS MARCO dense vector regressions (#2541
Browse files Browse the repository at this point in the history
)

Continuation of #2538 - refactor tolerance values for HNSW indexes, calibrate wrt flat index scores.
  • Loading branch information
lintool committed Jul 8, 2024
1 parent 5eb46b9 commit 3885b5c
Show file tree
Hide file tree
Showing 109 changed files with 616 additions and 490 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -101,16 +101,17 @@ With the above commands, you should be able to reproduce the following results:

| **AP@1000** | **BGE-base-en-v1.5**|
|:-------------------------------------------------------------------------------------------------------------|-----------|
| [DL19 (Passage)](https://trec.nist.gov/data/deep2020.html) | 0.443 |
| [DL19 (Passage)](https://trec.nist.gov/data/deep2020.html) | 0.444 |
| **nDCG@10** | **BGE-base-en-v1.5**|
| [DL19 (Passage)](https://trec.nist.gov/data/deep2020.html) | 0.708 |
| [DL19 (Passage)](https://trec.nist.gov/data/deep2020.html) | 0.706 |
| **R@100** | **BGE-base-en-v1.5**|
| [DL19 (Passage)](https://trec.nist.gov/data/deep2020.html) | 0.614 |
| [DL19 (Passage)](https://trec.nist.gov/data/deep2020.html) | 0.617 |
| **R@1000** | **BGE-base-en-v1.5**|
| [DL19 (Passage)](https://trec.nist.gov/data/deep2020.html) | 0.843 |
| [DL19 (Passage)](https://trec.nist.gov/data/deep2020.html) | 0.847 |

Note that due to the non-deterministic nature of HNSW indexing, results may differ slightly between each experimental run.
Nevertheless, scores are generally within 0.005 of the reference values recorded in [our YAML configuration file](../../src/main/resources/regression/dl19-passage.bge-base-en-v1.5.hnsw-int8.cached.yaml).
The above figures are from running brute-force search with cached queries on non-quantized **flat** indexes.
With cached queries on quantized HNSW indexes, observed results are likely to differ; scores may be lower by up to 0.01, sometimes more.
Note that both HNSW indexing and quantization are non-deterministic (i.e., results may differ slightly between trials).

❗ Retrieval metrics here are computed to depth 1000 hits per query (as opposed to 100 hits per query for document ranking).
For computing nDCG, remember that we keep qrels of _all_ relevance grades, whereas for other metrics (e.g., AP), relevance grade 1 is considered not relevant (i.e., use the `-l 2` option in `trec_eval`).
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -103,14 +103,15 @@ With the above commands, you should be able to reproduce the following results:
|:-------------------------------------------------------------------------------------------------------------|-----------|
| [DL19 (Passage)](https://trec.nist.gov/data/deep2020.html) | 0.444 |
| **nDCG@10** | **BGE-base-en-v1.5**|
| [DL19 (Passage)](https://trec.nist.gov/data/deep2020.html) | 0.702 |
| [DL19 (Passage)](https://trec.nist.gov/data/deep2020.html) | 0.706 |
| **R@100** | **BGE-base-en-v1.5**|
| [DL19 (Passage)](https://trec.nist.gov/data/deep2020.html) | 0.609 |
| [DL19 (Passage)](https://trec.nist.gov/data/deep2020.html) | 0.617 |
| **R@1000** | **BGE-base-en-v1.5**|
| [DL19 (Passage)](https://trec.nist.gov/data/deep2020.html) | 0.836 |
| [DL19 (Passage)](https://trec.nist.gov/data/deep2020.html) | 0.847 |

Note that due to the non-deterministic nature of HNSW indexing, results may differ slightly between each experimental run.
Nevertheless, scores are generally within 0.005 of the reference values recorded in [our YAML configuration file](../../src/main/resources/regression/dl19-passage.bge-base-en-v1.5.hnsw-int8.onnx.yaml).
The above figures are from running brute-force search with cached queries on non-quantized **flat** indexes.
With ONNX query encoding on quantized HNSW indexes, observed results are likely to differ; scores may be lower by up to 0.01, sometimes more.
Note that both HNSW indexing and quantization are non-deterministic (i.e., results may differ slightly between trials).

❗ Retrieval metrics here are computed to depth 1000 hits per query (as opposed to 100 hits per query for document ranking).
For computing nDCG, remember that we keep qrels of _all_ relevance grades, whereas for other metrics (e.g., AP), relevance grade 1 is considered not relevant (i.e., use the `-l 2` option in `trec_eval`).
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -99,16 +99,17 @@ With the above commands, you should be able to reproduce the following results:

| **AP@1000** | **BGE-base-en-v1.5**|
|:-------------------------------------------------------------------------------------------------------------|-----------|
| [DL19 (Passage)](https://trec.nist.gov/data/deep2020.html) | 0.442 |
| [DL19 (Passage)](https://trec.nist.gov/data/deep2020.html) | 0.444 |
| **nDCG@10** | **BGE-base-en-v1.5**|
| [DL19 (Passage)](https://trec.nist.gov/data/deep2020.html) | 0.706 |
| **R@100** | **BGE-base-en-v1.5**|
| [DL19 (Passage)](https://trec.nist.gov/data/deep2020.html) | 0.616 |
| [DL19 (Passage)](https://trec.nist.gov/data/deep2020.html) | 0.617 |
| **R@1000** | **BGE-base-en-v1.5**|
| [DL19 (Passage)](https://trec.nist.gov/data/deep2020.html) | 0.842 |
| [DL19 (Passage)](https://trec.nist.gov/data/deep2020.html) | 0.847 |

Note that due to the non-deterministic nature of HNSW indexing, results may differ slightly between each experimental run.
Nevertheless, scores are generally within 0.005 of the reference values recorded in [our YAML configuration file](../../src/main/resources/regression/dl19-passage.bge-base-en-v1.5.hnsw.cached.yaml).
The above figures are from running brute-force search with cached queries on non-quantized **flat** indexes.
With cached queries on non-quantized HNSW indexes, observed results are likely to differ; scores may be lower by up to 0.01, sometimes more.
Note that HNSW indexing is non-deterministic (i.e., results may differ slightly between trials).

❗ Retrieval metrics here are computed to depth 1000 hits per query (as opposed to 100 hits per query for document ranking).
For computing nDCG, remember that we keep qrels of _all_ relevance grades, whereas for other metrics (e.g., AP), relevance grade 1 is considered not relevant (i.e., use the `-l 2` option in `trec_eval`).
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -99,16 +99,17 @@ With the above commands, you should be able to reproduce the following results:

| **AP@1000** | **BGE-base-en-v1.5**|
|:-------------------------------------------------------------------------------------------------------------|-----------|
| [DL19 (Passage)](https://trec.nist.gov/data/deep2020.html) | 0.447 |
| [DL19 (Passage)](https://trec.nist.gov/data/deep2020.html) | 0.444 |
| **nDCG@10** | **BGE-base-en-v1.5**|
| [DL19 (Passage)](https://trec.nist.gov/data/deep2020.html) | 0.701 |
| [DL19 (Passage)](https://trec.nist.gov/data/deep2020.html) | 0.706 |
| **R@100** | **BGE-base-en-v1.5**|
| [DL19 (Passage)](https://trec.nist.gov/data/deep2020.html) | 0.607 |
| [DL19 (Passage)](https://trec.nist.gov/data/deep2020.html) | 0.617 |
| **R@1000** | **BGE-base-en-v1.5**|
| [DL19 (Passage)](https://trec.nist.gov/data/deep2020.html) | 0.837 |
| [DL19 (Passage)](https://trec.nist.gov/data/deep2020.html) | 0.847 |

Note that due to the non-deterministic nature of HNSW indexing, results may differ slightly between each experimental run.
Nevertheless, scores are generally within 0.005 of the reference values recorded in [our YAML configuration file](../../src/main/resources/regression/dl19-passage.bge-base-en-v1.5.hnsw.onnx.yaml).
The above figures are from running brute-force search with cached queries on non-quantized **flat** indexes.
With ONNX query encoding on non-quantized HNSW indexes, observed results are likely to differ; scores may be lower by up to 0.01, sometimes more.
Note that HNSW indexing is non-deterministic (i.e., results may differ slightly between trials).

❗ Retrieval metrics here are computed to depth 1000 hits per query (as opposed to 100 hits per query for document ranking).
For computing nDCG, remember that we keep qrels of _all_ relevance grades, whereas for other metrics (e.g., AP), relevance grade 1 is considered not relevant (i.e., use the `-l 2` option in `trec_eval`).
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -94,16 +94,17 @@ With the above commands, you should be able to reproduce the following results:

| **AP@1000** | **cohere-embed-english-v3.0**|
|:-------------------------------------------------------------------------------------------------------------|-----------|
| [DL19 (Passage)](https://trec.nist.gov/data/deep2020.html) | 0.487 |
| [DL19 (Passage)](https://trec.nist.gov/data/deep2020.html) | 0.488 |
| **nDCG@10** | **cohere-embed-english-v3.0**|
| [DL19 (Passage)](https://trec.nist.gov/data/deep2020.html) | 0.690 |
| [DL19 (Passage)](https://trec.nist.gov/data/deep2020.html) | 0.696 |
| **R@100** | **cohere-embed-english-v3.0**|
| [DL19 (Passage)](https://trec.nist.gov/data/deep2020.html) | 0.647 |
| [DL19 (Passage)](https://trec.nist.gov/data/deep2020.html) | 0.648 |
| **R@1000** | **cohere-embed-english-v3.0**|
| [DL19 (Passage)](https://trec.nist.gov/data/deep2020.html) | 0.850 |
| [DL19 (Passage)](https://trec.nist.gov/data/deep2020.html) | 0.863 |

Note that due to the non-deterministic nature of HNSW indexing, results may differ slightly between each experimental run.
Nevertheless, scores are generally within 0.005 of the reference values recorded in [our YAML configuration file](../../src/main/resources/regression/dl19-passage.cohere-embed-english-v3.0.hnsw-int8.cached.yaml).
The above figures are from running brute-force search with cached queries on non-quantized **flat** indexes.
With cached queries on quantized HNSW indexes, observed results are likely to differ; scores may be lower by up to 0.01, sometimes more.
Note that both HNSW indexing and quantization are non-deterministic (i.e., results may differ slightly between trials).

❗ Retrieval metrics here are computed to depth 1000 hits per query (as opposed to 100 hits per query for document ranking).
For computing nDCG, remember that we keep qrels of _all_ relevance grades, whereas for other metrics (e.g., AP), relevance grade 1 is considered not relevant (i.e., use the `-l 2` option in `trec_eval`).
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -94,16 +94,17 @@ With the above commands, you should be able to reproduce the following results:

| **AP@1000** | **cohere-embed-english-v3.0**|
|:-------------------------------------------------------------------------------------------------------------|-----------|
| [DL19 (Passage)](https://trec.nist.gov/data/deep2020.html) | 0.486 |
| [DL19 (Passage)](https://trec.nist.gov/data/deep2020.html) | 0.488 |
| **nDCG@10** | **cohere-embed-english-v3.0**|
| [DL19 (Passage)](https://trec.nist.gov/data/deep2020.html) | 0.690 |
| [DL19 (Passage)](https://trec.nist.gov/data/deep2020.html) | 0.696 |
| **R@100** | **cohere-embed-english-v3.0**|
| [DL19 (Passage)](https://trec.nist.gov/data/deep2020.html) | 0.645 |
| [DL19 (Passage)](https://trec.nist.gov/data/deep2020.html) | 0.648 |
| **R@1000** | **cohere-embed-english-v3.0**|
| [DL19 (Passage)](https://trec.nist.gov/data/deep2020.html) | 0.851 |
| [DL19 (Passage)](https://trec.nist.gov/data/deep2020.html) | 0.863 |

Note that due to the non-deterministic nature of HNSW indexing, results may differ slightly between each experimental run.
Nevertheless, scores are generally within 0.005 of the reference values recorded in [our YAML configuration file](../../src/main/resources/regression/dl19-passage.cohere-embed-english-v3.0.hnsw.cached.yaml).
The above figures are from running brute-force search with cached queries on non-quantized **flat** indexes.
With cached queries on non-quantized HNSW indexes, observed results are likely to differ; scores may be lower by up to 0.01, sometimes more.
Note that HNSW indexing is non-deterministic (i.e., results may differ slightly between trials).

❗ Retrieval metrics here are computed to depth 1000 hits per query (as opposed to 100 hits per query for document ranking).
For computing nDCG, remember that we keep qrels of _all_ relevance grades, whereas for other metrics (e.g., AP), relevance grade 1 is considered not relevant (i.e., use the `-l 2` option in `trec_eval`).
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -101,16 +101,17 @@ With the above commands, you should be able to reproduce the following results:

| **AP@1000** | **cosDPR-distil**|
|:-------------------------------------------------------------------------------------------------------------|-----------|
| [DL19 (Passage)](https://trec.nist.gov/data/deep2020.html) | 0.458 |
| [DL19 (Passage)](https://trec.nist.gov/data/deep2020.html) | 0.466 |
| **nDCG@10** | **cosDPR-distil**|
| [DL19 (Passage)](https://trec.nist.gov/data/deep2020.html) | 0.717 |
| [DL19 (Passage)](https://trec.nist.gov/data/deep2020.html) | 0.725 |
| **R@100** | **cosDPR-distil**|
| [DL19 (Passage)](https://trec.nist.gov/data/deep2020.html) | 0.605 |
| [DL19 (Passage)](https://trec.nist.gov/data/deep2020.html) | 0.617 |
| **R@1000** | **cosDPR-distil**|
| [DL19 (Passage)](https://trec.nist.gov/data/deep2020.html) | 0.805 |
| [DL19 (Passage)](https://trec.nist.gov/data/deep2020.html) | 0.820 |

Note that due to the non-deterministic nature of HNSW indexing, results may differ slightly between each experimental run.
Nevertheless, scores are generally within 0.005 of the reference values recorded in [our YAML configuration file](../../src/main/resources/regression/dl19-passage.cos-dpr-distil.hnsw-int8.cached.yaml).
The above figures are from running brute-force search with cached queries on non-quantized **flat** indexes.
With cached queries on quantized HNSW indexes, observed results are likely to differ; scores may be lower by up to 0.01, sometimes more.
Note that both HNSW indexing and quantization are non-deterministic (i.e., results may differ slightly between trials).

❗ Retrieval metrics here are computed to depth 1000 hits per query (as opposed to 100 hits per query for document ranking).
For computing nDCG, remember that we keep qrels of _all_ relevance grades, whereas for other metrics (e.g., AP), relevance grade 1 is considered not relevant (i.e., use the `-l 2` option in `trec_eval`).
Expand Down
Loading

0 comments on commit 3885b5c

Please sign in to comment.