Skip to content

Commit

Permalink
Add Mr.TyDi regressions for Arabic (#1685)
Browse files Browse the repository at this point in the history
  • Loading branch information
lintool authored Dec 5, 2021
1 parent 3c81c4f commit aee51ad
Show file tree
Hide file tree
Showing 18 changed files with 17,183 additions and 3 deletions.
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -79,6 +79,7 @@ For the most part, these runs are based on [_default_ parameter settings](https:
+ Regressions for [CLEF 2006 Monolingual French](docs/regressions-clef06-fr.md)
+ Regressions for [TREC 2002 Monolingual Arabic](docs/regressions-trec02-ar.md)
+ Regressions for FIRE 2012: [Monolingual Bengali](docs/regressions-fire12-bn.md), [Monolingual Hindi](docs/regressions-fire12-hi.md), [Monolingual English](docs/regressions-fire12-en.md)
+ Regressions for Mr. TyDi: [ar](docs/regressions-mrtydi-v1.1-ar.md)

## Reproduction Guides

Expand Down
66 changes: 66 additions & 0 deletions docs/regressions-mrtydi-v1.1-ar.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
# Anserini: Regressions for [Mr. TyDi (Arabic)](https://github.com/castorini/mr.tydi)

This page documents regression experiments for [Mr. TyDi (Arabic)](https://github.com/castorini/mr.tydi).

The exact configurations for these regressions are stored in [this YAML file](../src/main/resources/regression/mrtydi-v1.1-ar.yaml).
Note that this page is automatically generated from [this template](../src/main/resources/docgen/templates/mrtydi-v1.1-ar.template) as part of Anserini's regression pipeline, so do not modify this page directly; modify the template instead.

## Indexing

Typical indexing command:

```
nohup sh target/appassembler/bin/IndexCollection -collection MrTyDiCollection \
-input /path/to/mrtydi-v1.1-ar \
-index indexes/lucene-index.mrtydi-v1.1-arabic.pos+docvectors+raw \
-generator DefaultLuceneDocumentGenerator \
-threads 1 -storePositions -storeDocvectors -storeRaw -language ar \
>& logs/log.mrtydi-v1.1-ar &
```

See [this page](https://github.com/castorini/mr.tydi) for more details about the Mr. TyDi corpus.
For additional details, see explanation of [common indexing options](common-indexing-options.md).

## Retrieval

After indexing has completed, you should be able to perform retrieval as follows:

```
nohup target/appassembler/bin/SearchCollection -index indexes/lucene-index.mrtydi-v1.1-arabic.pos+docvectors+raw \
-topicreader TsvInt -topics src/main/resources/topics-and-qrels/topics.mrtydi-v1.1-ar.train.txt.gz \
-output runs/run.mrtydi-v1.1-ar.bm25.topics.mrtydi-v1.1-ar.train.txt.gz \
-language ar -bm25 -hits 100 &
nohup target/appassembler/bin/SearchCollection -index indexes/lucene-index.mrtydi-v1.1-arabic.pos+docvectors+raw \
-topicreader TsvInt -topics src/main/resources/topics-and-qrels/topics.mrtydi-v1.1-ar.dev.txt.gz \
-output runs/run.mrtydi-v1.1-ar.bm25.topics.mrtydi-v1.1-ar.dev.txt.gz \
-language ar -bm25 -hits 100 &
nohup target/appassembler/bin/SearchCollection -index indexes/lucene-index.mrtydi-v1.1-arabic.pos+docvectors+raw \
-topicreader TsvInt -topics src/main/resources/topics-and-qrels/topics.mrtydi-v1.1-ar.test.txt.gz \
-output runs/run.mrtydi-v1.1-ar.bm25.topics.mrtydi-v1.1-ar.test.txt.gz \
-language ar -bm25 -hits 100 &
```

Evaluation can be performed using `trec_eval`:

```
tools/eval/trec_eval.9.0.4/trec_eval -M 100 -m recip_rank -c -m recall.100 -c src/main/resources/topics-and-qrels/qrels.mrtydi-v1.1-ar.train.txt runs/run.mrtydi-v1.1-ar.bm25.topics.mrtydi-v1.1-ar.train.txt.gz
tools/eval/trec_eval.9.0.4/trec_eval -M 100 -m recip_rank -c -m recall.100 -c src/main/resources/topics-and-qrels/qrels.mrtydi-v1.1-ar.dev.txt runs/run.mrtydi-v1.1-ar.bm25.topics.mrtydi-v1.1-ar.dev.txt.gz
tools/eval/trec_eval.9.0.4/trec_eval -M 100 -m recip_rank -c -m recall.100 -c src/main/resources/topics-and-qrels/qrels.mrtydi-v1.1-ar.test.txt runs/run.mrtydi-v1.1-ar.bm25.topics.mrtydi-v1.1-ar.test.txt.gz
```

## Effectiveness

With the above commands, you should be able to reproduce the following results:

MRR@100 | BM25 |
:---------------------------------------|-----------|
[Mr. TyDi (Arabic): train](https://github.com/castorini/mr.tydi)| 0.3356 |
[Mr. TyDi (Arabic): dev](https://github.com/castorini/mr.tydi)| 0.3462 |
[Mr. TyDi (Arabic): test](https://github.com/castorini/mr.tydi)| 0.3682 |


R@100 | BM25 |
:---------------------------------------|-----------|
[Mr. TyDi (Arabic): train](https://github.com/castorini/mr.tydi)| 0.7944 |
[Mr. TyDi (Arabic): dev](https://github.com/castorini/mr.tydi)| 0.7872 |
[Mr. TyDi (Arabic): test](https://github.com/castorini/mr.tydi)| 0.7928 |
149 changes: 149 additions & 0 deletions src/main/java/io/anserini/collection/MrTyDiCollection.java
Original file line number Diff line number Diff line change
@@ -0,0 +1,149 @@
/*
* Anserini: A Lucene toolkit for reproducible information retrieval research
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

package io.anserini.collection;

import com.fasterxml.jackson.databind.JsonNode;
import com.fasterxml.jackson.databind.MappingIterator;
import com.fasterxml.jackson.databind.ObjectMapper;
import org.apache.logging.log4j.LogManager;
import org.apache.logging.log4j.Logger;

import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.nio.charset.StandardCharsets;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.StandardOpenOption;
import java.util.HashMap;
import java.util.Iterator;
import java.util.Map;
import java.util.NoSuchElementException;
import java.util.zip.GZIPInputStream;

public class MrTyDiCollection extends DocumentCollection<MrTyDiCollection.Document> {
private static final Logger LOG = LogManager.getLogger(MrTyDiCollection.class);

public MrTyDiCollection(Path path){
this.path = path;
}

@SuppressWarnings("unchecked")
@Override
public FileSegment<MrTyDiCollection.Document> createFileSegment(Path p) throws IOException {
return new Segment(p);
}

/**
* A file in a corpus for Mr. TyDi.
*/
public static class Segment<T extends Document> extends FileSegment<T> {
private JsonNode node = null;
private Iterator<JsonNode> iter = null; // iterator for JSON document array
private MappingIterator<JsonNode> iterator; // iterator for JSON line objects

public Segment(Path path) throws IOException {
super(path);

if (path.toString().endsWith(".gz")) {
InputStream stream = new GZIPInputStream(Files.newInputStream(path, StandardOpenOption.READ), BUFFER_SIZE);
bufferedReader = new BufferedReader(new InputStreamReader(stream, StandardCharsets.UTF_8));
} else {
bufferedReader = new BufferedReader(new FileReader(path.toString()));
}

ObjectMapper mapper = new ObjectMapper();
iterator = mapper.readerFor(JsonNode.class).readValues(bufferedReader);
if (iterator.hasNext()) {
node = iterator.next();
}
}

@SuppressWarnings("unchecked")
@Override
public void readNext() throws NoSuchElementException {
if (node == null) {
throw new NoSuchElementException("JsonNode is empty");
} else if (node.isObject()) {
bufferedRecord = (T) createNewDocument(node);
if (iterator.hasNext()) { // if bufferedReader contains JSON line objects, we parse the next JSON into node
node = iterator.next();
} else {
atEOF = true; // there is no more JSON object in the bufferedReader
}
} else {
LOG.error("Error: invalid JsonNode type");
throw new NoSuchElementException("Invalid JsonNode type");
}
}

protected Document createNewDocument(JsonNode json) {
return new Document(node);
}
}

/**
* A document in a corpus for Mr. TyDi.
*/
public static class Document implements SourceDocument {
private String id;
private String raw;
private Map<String, String> fields;

public Document(JsonNode json) {
this.raw = json.toPrettyString();
this.fields = new HashMap<>();

json.fields().forEachRemaining( e -> {
if ("docid".equals(e.getKey())) {
this.id = json.get("docid").asText();
} else {
this.fields.put(e.getKey(), e.getValue().asText());
}
});
}

@Override
public String id() {
if (id == null) {
throw new RuntimeException("Document does not have the required \"docid\" field!");
}
return id;
}

@Override
public String contents() {
if (!fields.containsKey("title") || !fields.containsKey("text") ) {
throw new RuntimeException("Document is missing required fields!");
}

return fields.get("title") + "\n" + fields.get("text");
}

@Override
public String raw() {
return raw;
}

@Override
public boolean indexable() {
return true;
}
}
}
7 changes: 6 additions & 1 deletion src/main/java/io/anserini/eval/Qrels.java
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,8 @@

package io.anserini.eval;

import io.anserini.search.topicreader.TsvIntTopicReader;

public enum Qrels {
TREC1_ADHOC("topics-and-qrels/qrels.adhoc.51-100.txt"),
TREC2_ADHOC("topics-and-qrels/qrels.adhoc.101-150.txt"),
Expand Down Expand Up @@ -64,7 +66,10 @@ public enum Qrels {
COVID_ROUND5("topics-and-qrels/qrels.covid-round5.txt"),
TREC2018_BL("topics-and-qrels/qrels.backgroundlinking18.txt"),
TREC2019_BL("topics-and-qrels/qrels.backgroundlinking19.txt"),
TREC2020_BL("topics-and-qrels/qrels.backgroundlinking20.txt");
TREC2020_BL("topics-and-qrels/qrels.backgroundlinking20.txt"),
MRTYDI_V11_AR_TRAIN("topics-and-qrels/qrels.mrtydi-v1.1-ar.train.txt"),
MRTYDI_V11_AR_DEV("topics-and-qrels/qrels.mrtydi-v1.1-ar.dev.txt"),
MRTYDI_V11_AR_TEST("topics-and-qrels/qrels.mrtydi-v1.1-ar.test.txt");

public final String path;

Expand Down
1 change: 1 addition & 0 deletions src/main/java/io/anserini/index/IndexCollection.java
Original file line number Diff line number Diff line change
Expand Up @@ -649,6 +649,7 @@ public IndexCollection(IndexArgs args) throws Exception {
LOG.info("CollectionClass: " + args.collectionClass);
LOG.info("Generator: " + args.generatorClass);
LOG.info("Threads: " + args.threads);
LOG.info("Language: " + args.language);
LOG.info("Stemmer: " + args.stemmer);
LOG.info("Keep stopwords? " + args.keepStopwords);
LOG.info("Stopwords: " + args.stopwords);
Expand Down
5 changes: 4 additions & 1 deletion src/main/java/io/anserini/search/topicreader/Topics.java
Original file line number Diff line number Diff line change
Expand Up @@ -89,7 +89,10 @@ public enum Topics {
DPR_CURATED_TEST(DprJsonlTopicReader.class, "topics-and-qrels/topics.dpr.curated.test.txt"),
DPR_SQUAD_TEST(DprJsonlTopicReader.class, "topics-and-qrels/topics.dpr.squad.test.txt"),
NQ_DEV(DprNqTopicReader.class, "topics-and-qrels/topics.nq.dev.txt"),
NQ_TEST(DprNqTopicReader.class, "topics-and-qrels/topics.nq.test.txt");
NQ_TEST(DprNqTopicReader.class, "topics-and-qrels/topics.nq.test.txt"),
MRTYDI_V11_AR_TRAIN(TsvIntTopicReader.class, "topics-and-qrels/topics.mrtydi-v1.1-ar.train.txt.gz"),
MRTYDI_V11_AR_DEV(TsvIntTopicReader.class, "topics-and-qrels/topics.mrtydi-v1.1-ar.dev.txt.gz"),
MRTYDI_V11_AR_TEST(TsvIntTopicReader.class, "topics-and-qrels/topics.mrtydi-v1.1-ar.test.txt.gz");

public final String path;
public final Class readerClass;
Expand Down
37 changes: 37 additions & 0 deletions src/main/resources/docgen/templates/mrtydi-v1.1-ar.template
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
# Anserini: Regressions for [Mr. TyDi (Arabic)](https://github.com/castorini/mr.tydi)

This page documents regression experiments for [Mr. TyDi (Arabic)](https://github.com/castorini/mr.tydi).

The exact configurations for these regressions are stored in [this YAML file](../src/main/resources/regression/mrtydi-v1.1-ar.yaml).
Note that this page is automatically generated from [this template](../src/main/resources/docgen/templates/mrtydi-v1.1-ar.template) as part of Anserini's regression pipeline, so do not modify this page directly; modify the template instead.

## Indexing

Typical indexing command:

```
${index_cmds}
```

See [this page](https://github.com/castorini/mr.tydi) for more details about the Mr. TyDi corpus.
For additional details, see explanation of [common indexing options](common-indexing-options.md).

## Retrieval

After indexing has completed, you should be able to perform retrieval as follows:

```
${ranking_cmds}
```

Evaluation can be performed using `trec_eval`:

```
${eval_cmds}
```

## Effectiveness

With the above commands, you should be able to reproduce the following results:

${effectiveness}
74 changes: 74 additions & 0 deletions src/main/resources/regression/mrtydi-v1.1-ar.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
---
name: mrtydi-v1.1-ar
index_command: target/appassembler/bin/IndexCollection
index_utils_command: target/appassembler/bin/IndexReaderUtils
search_command: target/appassembler/bin/SearchCollection
topic_root: src/main/resources/topics-and-qrels/
qrels_root: src/main/resources/topics-and-qrels/
index_root:
ranking_root:
collection: MrTyDiCollection
generator: DefaultLuceneDocumentGenerator
threads: 1
index_options:
- -storePositions
- -storeDocvectors
- -storeRaw
- -language ar
search_options:
- -language ar
topic_reader: TsvInt
evals:
- command: tools/eval/trec_eval.9.0.4/trec_eval
params:
- -M 100
- -m recip_rank
- -c
separator: "\t"
parse_index: 2
metric: MRR@100
metric_precision: 4
can_combine: true
- command: tools/eval/trec_eval.9.0.4/trec_eval
params:
- -m recall.100
- -c
separator: "\t"
parse_index: 2
metric: R@100
metric_precision: 4
can_combine: true
input_roots:
- /tuna1/ # on tuna
- /store/ # on orca
- /scratch2/ # on damiano
input: collections/mr-tydi-corpus/mrtydi-v1.1-arabic
index_path: indexes/lucene-index.mrtydi-v1.1-arabic.pos+docvectors+raw
index_stats:
documents: 2106586
documents (non-empty): 2106586
total terms: 92529014
topics:
- name: "[Mr. TyDi (Arabic): train](https://github.com/castorini/mr.tydi)"
path: topics.mrtydi-v1.1-ar.train.txt.gz
qrel: qrels.mrtydi-v1.1-ar.train.txt
- name: "[Mr. TyDi (Arabic): dev](https://github.com/castorini/mr.tydi)"
path: topics.mrtydi-v1.1-ar.dev.txt.gz
qrel: qrels.mrtydi-v1.1-ar.dev.txt
- name: "[Mr. TyDi (Arabic): test](https://github.com/castorini/mr.tydi)"
path: topics.mrtydi-v1.1-ar.test.txt.gz
qrel: qrels.mrtydi-v1.1-ar.test.txt
models:
- name: bm25
display: BM25
params:
- -bm25 -hits 100
results:
MRR@100:
- 0.3356
- 0.3462
- 0.3682
R@100:
- 0.7944
- 0.7872
- 0.7928
Loading

0 comments on commit aee51ad

Please sign in to comment.