Skip to content

Commit

Permalink
Updated documentation about pre-built indexes (#288)
Browse files Browse the repository at this point in the history
  • Loading branch information
lintool authored Jan 2, 2021
1 parent d9bf263 commit 7caedfc
Show file tree
Hide file tree
Showing 2 changed files with 40 additions and 3 deletions.
40 changes: 37 additions & 3 deletions docs/prebuilt-indexes.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,47 @@
# Pyserini: Prebuilt Indexes

Pre-built Anserini indexes are hosted at the University of Waterloo's [GitLab](https://git.uwaterloo.ca/jimmylin/anserini-indexes) and mirrored on Dropbox.
The following method will list available pre-built indexes:
The following methods will list available pre-built indexes:

```
```python
from pyserini.search import SimpleSearcher
SimpleSearcher.list_prebuilt_indexes()

from pyserini.index import IndexReader
IndexReader.list_prebuilt_indexes()
```

It's easy initialize a searcher from a pre-built index:

```python
searcher = SimpleSearcher.from_prebuilt_index('robust04')
```

You can use this simple Python one-liner to download the pre-built index:

```
python -c "from pyserini.search import SimpleSearcher; SimpleSearcher.from_prebuilt_index('robust04')"
```

Below is a summary of what's currently available:
The downloaded index will be in `~/.cache/pyserini/indexes/`.

It's similarly easy initialize an index reader from a pre-built index:

```python
index_reader = IndexReader.from_prebuilt_index('robust04')
index_reader.stats()
```

The output will be:

```
{'total_terms': 174540872, 'documents': 528030, 'non_empty_documents': 528030, 'unique_terms': 923436}
```

Note that unless the underlying index was built with the `-optimize` option (i.e., merging all index segments into a single segment), `unique_terms` will show -1.
Nope, that's not a bug.

Below is a summary of the pre-built indexes that are currently available.

## MS MARCO Indexes

Expand Down
3 changes: 3 additions & 0 deletions docs/usage-indexreader.md
Original file line number Diff line number Diff line change
Expand Up @@ -162,3 +162,6 @@ Output is something like this:
'non_empty_documents': 528030,
'unique_terms': 923436}
```

Note that unless the underlying index was built with the `-optimize` option (i.e., merging all index segments into a single segment), `unique_terms` will show -1.
Nope, that's not a bug.

0 comments on commit 7caedfc

Please sign in to comment.