forked from castorini/anserini
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Consolidate metadata for pre-built indexes (castorini#805)
- Loading branch information
Showing
15 changed files
with
206 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
15 changes: 15 additions & 0 deletions
15
pyserini/resources/index-metadata/index-msmarco-doc-20201117-f87c94-readme.txt
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,15 @@ | ||
This index was generated on 2020/11/17 at commit f87c945fd1c1e4174468194c72e3c05688dc45dd Mon Nov 16 16:17:20 2020 -0500 | ||
with the following command: | ||
|
||
sh target/appassembler/bin/IndexCollection -collection CleanTrecCollection \ | ||
-generator DefaultLuceneDocumentGenerator -input collections/msmarco-doc \ | ||
-index index-msmarco-doc-20201117-f87c94 -threads 1 -storeRaw -optimize | ||
|
||
Note that to reduce index size: | ||
|
||
+ positions are not indexed (so no phrase queries) | ||
+ document vectors are not stored (so no query expansion) | ||
|
||
However, the raw documents are stored, so they can be fetched and fed to further downstream reranking components. | ||
|
||
index-msmarco-doc-20201117-f87c94.tar.gz MD5 checksum = ac747860e7a37aed37cc30ed3990f273 |
14 changes: 14 additions & 0 deletions
14
...ni/resources/index-metadata/index-msmarco-doc-expanded-per-doc-20201126-1b4d0a-readme.txt
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,14 @@ | ||
This index was generated on 2020/11/26 at | ||
|
||
+ docTTTTTquery commit d2704c025c2bf6db652b4b27f49c4e59714ba898 (2020/11/24). | ||
+ anserini commit 1b4d0a29879a867ca5d1f003f924acc3279455ba (2020/11/25). | ||
|
||
with the following command: | ||
|
||
sh anserini/target/appassembler/bin/IndexCollection -collection JsonCollection \ | ||
-generator DefaultLuceneDocumentGenerator -threads 1 \ | ||
-input msmarco-doc-expanded -index index-msmarco-doc-expanded-per-doc-20201126-1b4d0a -optimize | ||
|
||
Note that this index does not store any "extras" (positions, document vectors, raw documents, etc.). | ||
|
||
index-msmarco-doc-expanded-per-doc-20201126-1b4d0a.tar.gz MD5 checksum = f7056191842ab77a01829cff68004782 |
14 changes: 14 additions & 0 deletions
14
...esources/index-metadata/index-msmarco-doc-expanded-per-passage-20201126-1b4d0a-readme.txt
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,14 @@ | ||
This index was generated on 2020/11/26 at | ||
|
||
+ docTTTTTquery commit d2704c025c2bf6db652b4b27f49c4e59714ba898 (2020/11/24). | ||
+ anserini commit 1b4d0a29879a867ca5d1f003f924acc3279455ba (2020/11/25). | ||
|
||
with the following command: | ||
|
||
sh anserini/target/appassembler/bin/IndexCollection -collection JsonCollection \ | ||
-generator DefaultLuceneDocumentGenerator -threads 1 \ | ||
-input msmarco-doc-expanded-passage -index index-msmarco-doc-expanded-per-passage-20201126-1b4d0a -optimize | ||
|
||
Note that this index does not store any "extras" (positions, document vectors, raw documents, etc.). | ||
|
||
index-msmarco-doc-expanded-per-passage-20201126-1b4d0a.tar.gz MD5 checksum = 54ea30c64515edf3c3741291b785be53 |
19 changes: 19 additions & 0 deletions
19
pyserini/resources/index-metadata/index-msmarco-doc-per-passage-20201204-f50dcc-readme.txt
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,19 @@ | ||
This index was generated on 2020/12/04 at | ||
|
||
+ docTTTTTquery commit 5be1af130b4657ea117781f761c4e5d15c77cb42 (2020/12/01). | ||
+ anserini commit f50dcceb6cd0ec3403c1e77066aa51bb3275d24e (2020/12/04). | ||
|
||
with the following command: | ||
|
||
sh anserini/target/appassembler/bin/IndexCollection -collection JsonCollection \ | ||
-generator DefaultLuceneDocumentGenerator -threads 1 \ | ||
-input msmarco-doc-passage -index index-msmarco-doc-per-passage-20201204-f50dcc -storeRaw -optimize | ||
|
||
Note that to reduce index size: | ||
|
||
+ positions are not indexed (so no phrase queries) | ||
+ document vectors are not stored (so no query expansion) | ||
|
||
However, the raw documents are stored, so they can be fetched and fed to further downstream reranking components. | ||
|
||
index-msmarco-doc-per-passage-20201204-f50dcc.tar.gz MD5 checksum = 797367406a7542b649cefa6b41cf4c33 |
14 changes: 14 additions & 0 deletions
14
...ni/resources/index-metadata/index-msmarco-doc-per-passage-slim-20201204-f50dcc-readme.txt
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,14 @@ | ||
This index was generated on 2020/12/04 at | ||
|
||
+ docTTTTTquery commit 5be1af130b4657ea117781f761c4e5d15c77cb42 (2020/12/01). | ||
+ anserini commit f50dcceb6cd0ec3403c1e77066aa51bb3275d24e (2020/12/04). | ||
|
||
with the following command: | ||
|
||
sh anserini/target/appassembler/bin/IndexCollection -collection JsonCollection \ | ||
-generator DefaultLuceneDocumentGenerator -threads 1 \ | ||
-input msmarco-doc-passage -index index-msmarco-doc-per-passage-slim-20201204-f50dcc -optimize | ||
|
||
This minimal index does not store any "extras" (positions, document vectors, raw documents, etc.). | ||
|
||
index-msmarco-doc-per-passage-slim-20201204-f50dcc.tar.gz MD5 checksum = 77c2409943a8c9faffabf57cb6adca69 |
10 changes: 10 additions & 0 deletions
10
pyserini/resources/index-metadata/index-msmarco-doc-slim-20201202-ab6e28-readme.txt
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,10 @@ | ||
This index was generated on 2020/12/02 at commit ab6e280b06a7a6476d001a5eb2319c191010c0e1 (2020/12/01) | ||
with the following command: | ||
|
||
sh target/appassembler/bin/IndexCollection -collection CleanTrecCollection \ | ||
-generator DefaultLuceneDocumentGenerator -input collections/msmarco-doc \ | ||
-index index-msmarco-doc-slim-20201202-ab6e28 -threads 1 -optimize | ||
|
||
This minimal index does not store any "extras" (positions, document vectors, raw documents, etc.). | ||
|
||
index-msmarco-doc-slim-20201202-ab6e28.tar.gz MD5 checksum = c56e752f7992bf6149761097641d515a |
15 changes: 15 additions & 0 deletions
15
pyserini/resources/index-metadata/index-msmarco-passage-20201117-f87c94-readme.txt
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,15 @@ | ||
This index was generated on 2020/11/17 at commit f87c945fd1c1e4174468194c72e3c05688dc45dd Mon Nov 16 16:17:20 2020 -0500 | ||
with the following command: | ||
|
||
sh target/appassembler/bin/IndexCollection -collection JsonCollection \ | ||
-generator DefaultLuceneDocumentGenerator -input collections/msmarco-passage/collection_jsonl \ | ||
-index index-msmarco-passage-20201117-f87c94 -threads 9 -storeRaw -optimize | ||
|
||
Note that to reduce index size: | ||
|
||
+ positions are not indexed (so no phrase queries) | ||
+ document vectors are not stored (so no query expansion) | ||
|
||
However, the raw passages are stored, so they can be fetched and fed to further downstream reranking components. | ||
|
||
index-msmarco-passage-20201117-f87c94.tar.gz MD5 checksum = 1efad4f1ae6a77e235042eff4be1612d |
14 changes: 14 additions & 0 deletions
14
pyserini/resources/index-metadata/index-msmarco-passage-expanded-20201121-e127fb-readme.txt
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,14 @@ | ||
This index was generated on 2020/11/21 at | ||
|
||
+ docTTTTTquery commit 701ea0a72beeb8db46aa409352a72ba52cd2c36b Tue Nov 17 07:13:27 2020 -0500 | ||
+ anserini commit e127fbea6f5515d60eb7c325cd866657dbf13cc6 Sat Nov 21 07:58:03 2020 -0500 | ||
|
||
with the following command: | ||
|
||
sh anserini/target/appassembler/bin/IndexCollection \ | ||
-collection JsonCollection -generator DefaultLuceneDocumentGenerator \ | ||
-input msmarco-passage-expanded -index index-msmarco-passage-expanded-20201121-e127fb -threads 9 -optimize | ||
|
||
Note that this index does not store any "extras" (positions, document vectors, raw documents, etc.). | ||
|
||
index-msmarco-passage-expanded-20201121-e127fb.tar.gz MD5 checksum = e5762e9e065b6fe5000f9c18da778565 |
11 changes: 11 additions & 0 deletions
11
pyserini/resources/index-metadata/index-msmarco-passage-ltr-20210519-e25e33f-readme.txt
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,11 @@ | ||
This index was generated on 2021/05/19 at commit e25e33f4a06e9c1ab4d795908cae4474fa019643 2021-05-17 21:48:48 -0400 | ||
with the following command: | ||
|
||
sh target/appassembler/bin/IndexCollection -collection JsonCollection \ | ||
-generator DefaultLuceneDocumentGenerator -input collections/msmarco-ltr-passage/ltr_collection_jsonl \ | ||
-index index-msmarco-passage-ltr-20210519-e25e33f -threads 9 -storeRaw -optimize -storePositions -storeDocvectors -pretokenizdd | ||
|
||
Note, pretokenized option is used to keep preprocessed tokenization. | ||
This is built with spacy 3.0.6. | ||
|
||
index-msmarco-passage-ltr-20210519-e25e33f MD5 checksum = a5de642c268ac1ed5892c069bdc29ae3 |
10 changes: 10 additions & 0 deletions
10
pyserini/resources/index-metadata/index-msmarco-passage-slim-20201202-ab6e28-readme.txt
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,10 @@ | ||
This index was generated on 2020/12/02 at commit ab6e280b06a7a6476d001a5eb2319c191010c0e1 (2020/12/01) | ||
with the following command: | ||
|
||
sh target/appassembler/bin/IndexCollection -collection JsonCollection \ | ||
-generator DefaultLuceneDocumentGenerator -input collections/msmarco-passage/collection_jsonl \ | ||
-index index-msmarco-passage-slim-20201202-ab6e28 -threads 9 -optimize | ||
|
||
This minimal index does not store any "extras" (positions, document vectors, raw documents, etc.). | ||
|
||
index-msmarco-passage-slim-20201202-ab6e28.tar.gz MD5 checksum = 5e11da4cebd2e8dda2e73c589ffb0b4c |
7 changes: 7 additions & 0 deletions
7
pyserini/resources/index-metadata/index-robust04-20191213-readme.txt
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
This index was generated on 12/13/2019 with Anserini v0.7.0, with the following command: | ||
|
||
sh target/appassembler/bin/IndexCollection -collection TrecCollection \ | ||
-input /tuna1/collections/newswire/disk45/ -index index-robust04-20191213 \ | ||
-generator JsoupGenerator -threads 16 -storePositions -storeDocvectors -storeRawDocs -optimize | ||
|
||
index-robust04-20191213.tar.gz MD5 checksum = 15f3d001489c97849a010b0a4734d018 |
18 changes: 18 additions & 0 deletions
18
pyserini/resources/index-metadata/index-wikipedia-dpr-20210120-d1b9e6-readme.txt
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,18 @@ | ||
This index was generated on 2021/01/20 at | ||
|
||
+ anserini commit d1b9e67928aa60fa557113ace5d209b0c58e994c (2021/01/19). | ||
|
||
with the following command: | ||
|
||
sh anserini/target/appassembler/bin/IndexCollection -collection JsonCollection \ | ||
-generator DefaultLuceneDocumentGenerator -threads 22 \ | ||
-input wikipedia-dpr-jsonl -index index-wikipedia-dpr-20210120-d1b9e6 -storeRaw -optimize | ||
|
||
Note that to reduce index size: | ||
|
||
+ positions are not indexed (so no phrase queries) | ||
+ document vectors are not stored (so no query expansion) | ||
|
||
However, the raw documents are stored, so they can be fetched and fed to further downstream reranking components. | ||
|
||
index-wikipedia-dpr-20210120-d1b9e6.tar.gz MD5 checksum = c28f3a56b2dfcef25bf3bf755c264d04 |
13 changes: 13 additions & 0 deletions
13
pyserini/resources/index-metadata/index-wikipedia-dpr-slim-20210120-d1b9e6-readme.txt
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,13 @@ | ||
This index was generated on 2021/01/20 at | ||
|
||
+ anserini commit d1b9e67928aa60fa557113ace5d209b0c58e994c (2021/01/19). | ||
|
||
with the following command: | ||
|
||
sh anserini/target/appassembler/bin/IndexCollection -collection JsonCollection \ | ||
-generator DefaultLuceneDocumentGenerator -threads 22 \ | ||
-input wikipedia-dpr-jsonl -index index-wikipedia-dpr-slim-20210120-d1b9e6 -optimize | ||
|
||
This minimal index does not store any "extras" (positions, document vectors, raw documents, etc.). | ||
|
||
index-wikipedia-dpr-slim-20210120-d1b9e6.tar.gz MD5 checksum = 7d40604a824b5df37a1ae9d25ea38071 |
18 changes: 18 additions & 0 deletions
18
pyserini/resources/index-metadata/index-wikipedia-kilt-doc-20210421-f29307-readme.txt
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,18 @@ | ||
This index was generated on 2021/04/22 at | ||
|
||
+ anserini commit f29307a9fb162ec7bef4919a164929a673d2304e (2021/04/21). | ||
|
||
with the following command: | ||
|
||
python -m pyserini.index -collection JsonCollection -generator DefaultLuceneDocumentGenerator \ | ||
-threads 40 -input collections/wikipedia-kilt-doc \ | ||
-index indexes/index-wikipedia-kilt-doc-20210421-f29307 -storeRaw -optimize | ||
|
||
Note that to reduce index size: | ||
|
||
+ positions are not indexed (so no phrase queries) | ||
+ document vectors are not stored (so no query expansion) | ||
|
||
However, the raw documents are stored, so they can be fetched and fed to further downstream reranking components. | ||
|
||
index-wikipedia-kilt-doc-20210421-f29307.tar.gz MD5 checksum = b8ec8feb654f7aaa86f9901dc6c804a8 |