Skip to content

Commit

Permalink
More onboarding doc updates (#2151)
Browse files Browse the repository at this point in the history
  • Loading branch information
lintool authored Jul 21, 2023
1 parent 0e759fd commit 4b8f051
Show file tree
Hide file tree
Showing 2 changed files with 21 additions and 4 deletions.
18 changes: 16 additions & 2 deletions docs/experiments-msmarco-passage.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,17 @@ If you're a Waterloo student traversing the [onboarding path](https://github.com
+ Be able to evaluate the retrieved results above.
+ Understand the MRR metric.

What's Anserini?
Well, it's the repo that you're in right now.
Anserini is a toolkit (in Java) for reproducible information retrieval research built on the [Luence search library](https://lucene.apache.org/).
The Lucene search library provides components of the popular [Elasticsearch](https://www.elastic.co/) platform.

Think of it this way: Lucene provides a "kit of parts".
Elasticsearch provides "assembly of parts" targeted to production search applications, with a REST-centric API.
Anserini provides an alternative way of composing the same core components together, targeted at information retrieval researchers.
By building on Lucene, Anserini aims to bridge the gap between academic information retrieval research and the practice of building real-world search applications.
That is, most things done with Anserini can be "translated" into Elasticsearch quite easily.

## Data Prep

In this guide, we're just going through the mechanical steps of data prep.
Expand Down Expand Up @@ -263,8 +274,9 @@ We can find the MRR@10 for `qid` 1048585 above:

```bash
$ tools/eval/trec_eval.9.0.4/trec_eval -q -c -M 10 -m recip_rank \
collections/msmarco-passage/qrels.dev.small.trec \
runs/run.msmarco-passage.dev.small.trec | grep 1048585
collections/msmarco-passage/qrels.dev.small.trec \
runs/run.msmarco-passage.dev.small.trec | grep 1048585

recip_rank 1048585 1.0000
```

Expand All @@ -280,6 +292,8 @@ In short, it's complicated.
At this time, look back through the learning outcomes again and make sure you're good.
As a next step in the onboarding path, you basically [do the same thing again in Python with Pyserini](https://github.com/castorini/pyserini/blob/master/docs/experiments-msmarco-passage.md) (as opposed to Java with Anserini here).

Before you move on, however, add an entry in the "Reproduction Log" at the bottom of this page, following the same format: use `yyyy-mm-dd`, make sure you're using a commit id that's on the main trunk of Anserini, and use its 7-hexadecimal prefix for the link anchor text.

## BM25 Tuning

This section is **not** part of the onboarding path, so feel free to skip.
Expand Down
7 changes: 5 additions & 2 deletions docs/start-here.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,10 +20,11 @@ What's the problem we're trying to solve?

This is the definition I typically give:

> Given an information need expressed as a query _q_, the text ranking task is to return a ranked list of _k_ texts {_d<sub>1</sub>_, _d<sub>2</sub>_ ... _d<sub>k</sub>_} from an arbitrarily large but finite collection
> Given an information need expressed as a query _q_, the text retrieval task is to return a ranked list of _k_ texts {_d<sub>1</sub>_, _d<sub>2</sub>_ ... _d<sub>k</sub>_} from an arbitrarily large but finite collection
of texts _C_ = {_d<sub>i</sub>_} that maximizes a metric of interest, for example, nDCG, AP, etc.
This problem has been given various names, e.g., the search problem, the information retrieval problem, the text ranking problem, etc.
This problem has been given various names, e.g., the search problem, the information retrieval problem, the text ranking problem, the top-_k_ document retrieval problem, etc.
In most contexts, "ranking" and "retrieval" are used interchangeably.
Basically, this is what _search_ (i.e., information retrieval) is all about.

Let's try to unpack the definition a bit.
Expand Down Expand Up @@ -276,5 +277,7 @@ By now you should be able to connect the concepts we introduced to how they mani
From here, you're now ready to proceed to try and reproduce the [BM25 Baselines for MS MARCO Passage Ranking
](experiments-msmarco-passage.md).
Before you move on, however, add an entry in the "Reproduction Log" at the bottom of this page, following the same format: use `yyyy-mm-dd`, make sure you're using a commit id that's on the main trunk of Anserini, and use its 7-hexadecimal prefix for the link anchor text.
## Reproduction Log[*](reproducibility.md)

0 comments on commit 4b8f051

Please sign in to comment.