Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

General refactoring #2269

Merged
merged 6 commits into from
Nov 25, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -207,7 +207,7 @@ Key:
+ MF = "multifield" baseline (Lucene analyzer)
+ U1 = uniCOIL (noexp)
+ S1 = SPLADE-distill CoCodenser-medium
+ S2 = SPLADE++ (CoCondenser-EnsembleDistil)
+ S2 = SPLADE++ CoCondenser-EnsembleDistil

| Corpus | F1 | F2 | MF | U1 | S1 | S2 |
|-------------------------|:-----------------------------------------------------------------------------:|:--------------------------------------------------------------------------------:|:-----------------------------------------------------------------------------------:|:--------------------------------------------------------------------------------------:|:--------------------------------------------------------------------------------------------------------:|:-------------------------------------------------------------------------------------:|
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -76,7 +76,7 @@ target/appassembler/bin/SearchInvertedDenseVectors \
-topics tools/topics-and-qrels/topics.dl19-passage.cos-dpr-distil.jsonl.gz \
-topicReader JsonIntVector \
-output runs/run.msmarco-passage-cos-dpr-distil.cos-dpr-distil-fw-40.topics.dl19-passage.cos-dpr-distil.jsonl.txt \
-topicField vector -encoding fw -fw.q 40 -hits 1000 &
-topicField vector -threads 16 -encoding fw -fw.q 40 -hits 1000 &
```

Evaluation can be performed using `trec_eval`:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -76,7 +76,7 @@ target/appassembler/bin/SearchInvertedDenseVectors \
-topics tools/topics-and-qrels/topics.dl19-passage.cos-dpr-distil.jsonl.gz \
-topicReader JsonIntVector \
-output runs/run.msmarco-passage-cos-dpr-distil.cos-dpr-distil-lexlsh-600.topics.dl19-passage.cos-dpr-distil.jsonl.txt \
-topicField vector -encoding lexlsh -lexlsh.b 600 -hits 1000 &
-topicField vector -threads 16 -encoding lexlsh -lexlsh.b 600 -hits 1000 &
```

Evaluation can be performed using `trec_eval`:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -76,7 +76,7 @@ target/appassembler/bin/SearchInvertedDenseVectors \
-topics tools/topics-and-qrels/topics.dl20.cos-dpr-distil.jsonl.gz \
-topicReader JsonIntVector \
-output runs/run.msmarco-passage-cos-dpr-distil.cos-dpr-distil-fw-40.topics.dl20.cos-dpr-distil.jsonl.txt \
-topicField vector -encoding fw -fw.q 40 -hits 1000 &
-topicField vector -threads 16 -encoding fw -fw.q 40 -hits 1000 &
```

Evaluation can be performed using `trec_eval`:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -76,7 +76,7 @@ target/appassembler/bin/SearchInvertedDenseVectors \
-topics tools/topics-and-qrels/topics.dl20.cos-dpr-distil.jsonl.gz \
-topicReader JsonIntVector \
-output runs/run.msmarco-passage-cos-dpr-distil.cos-dpr-distil-lexlsh-600.topics.dl20.cos-dpr-distil.jsonl.txt \
-topicField vector -encoding lexlsh -lexlsh.b 600 -hits 1000 &
-topicField vector -threads 16 -encoding lexlsh -lexlsh.b 600 -hits 1000 &
```

Evaluation can be performed using `trec_eval`:
Expand Down
4 changes: 2 additions & 2 deletions docs/regressions/regressions-msmarco-doc-docTTTTTquery.md
Original file line number Diff line number Diff line change
Expand Up @@ -138,7 +138,7 @@ For default parameters (`k1=0.9`, `b=0.4`):
$ sh target/appassembler/bin/SearchCollection \
-index indexes/lucene-index.msmarco-doc-docTTTTTquery/ \
-topics tools/topics-and-qrels/topics.msmarco-doc.dev.txt \
-topicReader TsvInt \
-topicreader TsvInt \
-output runs/run.msmarco-doc-docTTTTTquery.bm25-default.txt \
-format msmarco \
-bm25 -hits 100
Expand All @@ -159,7 +159,7 @@ For tuned parameters (`k1=4.68`, `b=0.87`):
$ sh target/appassembler/bin/SearchCollection \
-index indexes/lucene-index.msmarco-doc-docTTTTTquery/ \
-topics tools/topics-and-qrels/topics.msmarco-doc.dev.txt \
-topicReader TsvInt \
-topicreader TsvInt \
-output runs/run.msmarco-doc-docTTTTTquery.bm25-tuned.txt \
-format msmarco \
-bm25 -bm25.k1 4.68 -bm25.b 0.87 -hits 100
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -109,7 +109,7 @@ The MaxP passage retrieval functionality is available in `SearchCollection`.
To generate an MS MARCO submission with the BM25 default parameters, corresponding to "BM25 (default)" above:

```bash
$ sh target/appassembler/bin/SearchCollection -topicReader TsvString \
$ sh target/appassembler/bin/SearchCollection -topicreader TsvString \
-topics tools/topics-and-qrels/topics.msmarco-doc.dev.txt \
-index indexes/lucene-index.msmarco-doc-segmented-docTTTTTquery/ \
-output runs/run.msmarco-doc-segmented-docTTTTTquery.bm25-default.txt -format msmarco \
Expand All @@ -131,7 +131,7 @@ Note that the above command uses `-format msmarco` to directly generate a run in
To generate an MS MARCO submission with the BM25 tuned parameters, corresponding to "BM25 (tuned)" above:

```bash
$ sh target/appassembler/bin/SearchCollection -topicReader TsvString \
$ sh target/appassembler/bin/SearchCollection -topicreader TsvString \
-topics tools/topics-and-qrels/topics.msmarco-doc.dev.txt \
-index indexes/lucene-index.msmarco-doc-segmented-docTTTTTquery/ \
-output runs/run.msmarco-doc-segmented-docTTTTTquery.bm25-tuned.txt -format msmarco \
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -117,7 +117,7 @@ The following command generates a comparable run:
target/appassembler/bin/SearchCollection \
-index indexes/lucene-index.msmarco-doc-segmented-unicoil/ \
-topics tools/topics-and-qrels/topics.msmarco-doc.dev.unicoil.tsv.gz \
-topicReader TsvInt \
-topicreader TsvInt \
-output runs/run.msmarco-doc-segmented-unicoil.msmarco-doc.dev.txt \
-format msmarco \
-impact -pretokenized -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 100
Expand Down
4 changes: 2 additions & 2 deletions docs/regressions/regressions-msmarco-doc-segmented.md
Original file line number Diff line number Diff line change
Expand Up @@ -109,7 +109,7 @@ The MaxP passage retrieval functionality is available in `SearchCollection`.
To generate an MS MARCO submission with the BM25 default parameters, corresponding to "BM25 (default)" above:

```bash
$ target/appassembler/bin/SearchCollection -topicReader TsvString \
$ target/appassembler/bin/SearchCollection -topicreader TsvString \
-topics tools/topics-and-qrels/topics.msmarco-doc.dev.txt \
-index indexes/lucene-index.msmarco-doc-segmented/ \
-output runs/run.msmarco-doc-segmented.bm25-default.txt -format msmarco \
Expand All @@ -132,7 +132,7 @@ Note that the above command uses `-format msmarco` to directly generate a run in
To generate an MS MARCO submission with the BM25 tuned parameters, corresponding to "BM25 (tuned)" above:

```bash
$ target/appassembler/bin/SearchCollection -topicReader TsvString \
$ target/appassembler/bin/SearchCollection -topicreader TsvString \
-topics tools/topics-and-qrels/topics.msmarco-doc.dev.txt \
-index indexes/lucene-index.msmarco-doc-segmented/ \
-output runs/run.msmarco-doc-segmented.bm25-tuned.txt -format msmarco \
Expand Down
4 changes: 2 additions & 2 deletions docs/regressions/regressions-msmarco-doc.md
Original file line number Diff line number Diff line change
Expand Up @@ -173,7 +173,7 @@ For default parameters (`k1=0.9`, `b=0.4`):
$ sh target/appassembler/bin/SearchCollection \
-index indexes/lucene-index.msmarco-doc/ \
-topics tools/topics-and-qrels/topics.msmarco-doc.dev.txt \
-topicReader TsvInt \
-topicreader TsvInt \
-output runs/run.msmarco-doc.bm25-default.txt \
-format msmarco \
-bm25 -hits 100
Expand All @@ -194,7 +194,7 @@ For tuned parameters (`k1=4.46`, `b=0.82`):
$ sh target/appassembler/bin/SearchCollection \
-index indexes/lucene-index.msmarco-doc/ \
-topics tools/topics-and-qrels/topics.msmarco-doc.dev.txt \
-topicReader TsvInt \
-topicreader TsvInt \
-output runs/run.msmarco-doc.bm25-tuned.txt \
-format msmarco \
-bm25 -bm25.k1 4.46 -bm25.b 0.82 -hits 100
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,7 @@ target/appassembler/bin/SearchInvertedDenseVectors \
-topics tools/topics-and-qrels/topics.msmarco-passage.dev-subset.cos-dpr-distil.jsonl.gz \
-topicReader JsonIntVector \
-output runs/run.msmarco-passage-cos-dpr-distil.cos-dpr-distil-fw-40.topics.msmarco-passage.dev-subset.cos-dpr-distil.jsonl.txt \
-topicField vector -encoding fw -fw.q 40 -hits 1000 &
-topicField vector -threads 16 -encoding fw -fw.q 40 -hits 1000 &
```

Evaluation can be performed using `trec_eval`:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,7 @@ target/appassembler/bin/SearchInvertedDenseVectors \
-topics tools/topics-and-qrels/topics.msmarco-passage.dev-subset.cos-dpr-distil.jsonl.gz \
-topicReader JsonIntVector \
-output runs/run.msmarco-passage-cos-dpr-distil.cos-dpr-distil-lexlsh-600.topics.msmarco-passage.dev-subset.cos-dpr-distil.jsonl.txt \
-topicField vector -encoding lexlsh -lexlsh.b 600 -hits 1000 &
-topicField vector -threads 16 -encoding lexlsh -lexlsh.b 600 -hits 1000 &
```

Evaluation can be performed using `trec_eval`:
Expand Down
4 changes: 2 additions & 2 deletions docs/regressions/regressions-msmarco-passage-docTTTTTquery.md
Original file line number Diff line number Diff line change
Expand Up @@ -165,7 +165,7 @@ For parameters `k1=0.82`, `b=0.68`:
$ sh target/appassembler/bin/SearchCollection \
-index indexes/lucene-index.msmarco-passage-docTTTTTquery/ \
-topics tools/topics-and-qrels/topics.msmarco-passage.dev-subset.txt \
-topicReader TsvInt \
-topicreader TsvInt \
-output runs/run.msmarco-passage-docTTTTTquery.1 \
-format msmarco \
-bm25 -bm25.k1 0.82 -bm25.b 0.68
Expand All @@ -185,7 +185,7 @@ For parameters `k1=2.18`, `b=0.86`:
$ sh target/appassembler/bin/SearchCollection \
-index indexes/lucene-index.msmarco-passage-docTTTTTquery/ \
-topics tools/topics-and-qrels/topics.msmarco-passage.dev-subset.txt \
-topicReader TsvInt \
-topicreader TsvInt \
-output runs/run.msmarco-passage-docTTTTTquery.2 \
-format msmarco \
-bm25 -bm25.k1 2.18 -bm25.b 0.86
Expand Down
4 changes: 2 additions & 2 deletions docs/regressions/regressions-msmarco-passage.md
Original file line number Diff line number Diff line change
Expand Up @@ -142,7 +142,7 @@ For default parameters (`k1=0.9`, `b=0.4`):
$ sh target/appassembler/bin/SearchCollection \
-index indexes/lucene-index.msmarco-passage/ \
-topics tools/topics-and-qrels/topics.msmarco-passage.dev-subset.txt \
-topicReader TsvInt \
-topicreader TsvInt \
-output runs/run.msmarco-passage.bm25.default.tsv \
-format msmarco \
-bm25
Expand All @@ -162,7 +162,7 @@ For tuned parameters (`k1=0.82`, `b=0.68`):
$ sh target/appassembler/bin/SearchCollection \
-index indexes/lucene-index.msmarco-passage/ \
-topics tools/topics-and-qrels/topics.msmarco-passage.dev-subset.txt \
-topicReader TsvInt \
-topicreader TsvInt \
-output runs/run.msmarco-passage.bm25.tuned.tsv \
-format msmarco \
-bm25 -bm25.k1 0.82 -bm25.b 0.68
Expand Down
105 changes: 71 additions & 34 deletions src/main/java/io/anserini/search/SearchHnswDenseVectors.java
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,9 @@
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import java.util.Map;
import java.util.SortedMap;
import java.util.TreeMap;
Expand Down Expand Up @@ -113,6 +116,9 @@ public static class Args {
@Option(name ="-encoder", metaVar = "[encoder]", usage = "Dense encoder to use.")
public String encoder = null;

@Option(name = "-options", usage = "Print information about options.")
public Boolean options = false;

// ---------------------------------------------
// Simple built-in support for passage retrieval
// ---------------------------------------------
Expand Down Expand Up @@ -146,32 +152,36 @@ public static class Args {
private final IndexSearcher searcher;
private final VectorQueryGenerator generator;
private final DenseEncoder queryEncoder;
private final SortedMap<K, String> queries = new TreeMap<>();
private final ConcurrentSkipListMap<K, String> results = new ConcurrentSkipListMap<>();

public SearchHnswDenseVectors(Args args) throws IOException {
this.args = args;
Path indexPath = Paths.get(args.index);

if (!Files.exists(indexPath) || !Files.isDirectory(indexPath) || !Files.isReadable(indexPath)) {
throw new IllegalArgumentException(String.format("Index path '%s' does not exist or is not a directory.", args.index));
}

LOG.info("============ Initializing HNSW Searcher ============");
LOG.info("Index: " + indexPath);
LOG.info("Index: " + args.index);
LOG.info("Topics: " + Arrays.toString(args.topics));
LOG.info("Query generator: " + args.queryGenerator);
LOG.info("Encoder: " + args.encoder);
LOG.info("Threads: " + args.threads);

this.reader = DirectoryReader.open(FSDirectory.open(indexPath));
// We might not be able to successfully create a reader for a variety of reasons, anything from path doesn't exist
// to corrupt index. Gather all possible exceptions together as an unchecked exception to make initialization and
// error reporting clearer.
try {
this.reader = DirectoryReader.open(FSDirectory.open(Paths.get(args.index)));
} catch (IOException e) {
throw new IllegalArgumentException(String.format("\"%s\" does not appear to be a valid index.", args.index));
}

this.searcher = new IndexSearcher(this.reader);

try {
this.generator = (VectorQueryGenerator) Class
.forName(String.format("io.anserini.search.query.%s", args.queryGenerator))
.getConstructor().newInstance();
} catch (Exception e) {
e.printStackTrace();
throw new IllegalArgumentException("Unable to load QueryGenerator: " + args.queryGenerator);
throw new IllegalArgumentException(String.format("Unable to load QueryGenerator \"%s\".", args.queryGenerator));
}

if (args.encoder != null) {
Expand All @@ -180,51 +190,65 @@ public SearchHnswDenseVectors(Args args) throws IOException {
.forName(String.format("io.anserini.encoder.dense.%sEncoder", args.encoder))
.getConstructor().newInstance();
} catch (Exception e) {
e.printStackTrace();
throw new IllegalArgumentException("Unable to load encoder: " + args.encoder);
throw new IllegalArgumentException(String.format("Unable to load Encoder \"%s\".", args.encoder));
}
} else {
queryEncoder = null;
}

}

@Override
public void close() throws IOException {
reader.close();
}

@SuppressWarnings("unchecked")
@Override
public void run() {
// Same as above: we might not be able to successfully read topics for a variety of reasons. Gather all possible
// exceptions together as an unchecked exception to make initialization and error reporting clearer.
SortedMap<K, Map<String, String>> topics = new TreeMap<>();
for (String file : args.topics) {
Path topicsFilePath = Paths.get(file);
for (String singleTopicsFile : args.topics) {
Path topicsFilePath = Paths.get(singleTopicsFile);
if (!Files.exists(topicsFilePath) || !Files.isRegularFile(topicsFilePath) || !Files.isReadable(topicsFilePath)) {
throw new IllegalArgumentException("Topics file : " + topicsFilePath + " does not exist or is not a (readable) file.");
throw new IllegalArgumentException(String.format("\"%s\" does not appear to be a valid topics file.", topicsFilePath));
}
try {
@SuppressWarnings("unchecked")
TopicReader<K> tr = (TopicReader<K>) Class
.forName(String.format("io.anserini.search.topicreader.%sTopicReader", args.topicReader))
.getConstructor(Path.class).newInstance(topicsFilePath);

topics.putAll(tr.read());
} catch (Exception e) {
e.printStackTrace();
throw new IllegalArgumentException("Unable to load topic reader: " + args.topicReader);
throw new IllegalArgumentException(String.format("Unable to load topic reader \"%s\".", args.topicReader));
}
}

// Now iterate through all the topics to pick out the right field with proper exception handling.
try {
for (Map.Entry<K, Map<String, String>> entry : topics.entrySet()) {
K qid = entry.getKey();
String query = entry.getValue().get(args.topicField);
assert query != null;

this.queries.put(qid, query);
}
} catch (AssertionError|Exception e) {
throw new IllegalArgumentException(String.format("Unable to read topic field \"%s\".", args.topicField));
}
}

@Override
public void close() throws IOException {
reader.close();
}

@SuppressWarnings("unchecked")
@Override
public void run() {
LOG.info("============ Launching Search Threads ============");
final ThreadPoolExecutor executor = (ThreadPoolExecutor) Executors.newFixedThreadPool(args.threads);
final AtomicInteger cnt = new AtomicInteger();

final long start = System.nanoTime();
for (Map.Entry<K, Map<String, String>> entry : topics.entrySet()) {
for (Map.Entry<K, String> entry : queries.entrySet()) {
K qid = entry.getKey();

// This is the per-query execution, in parallel.
executor.execute(() -> {
String queryString = entry.getValue().get(args.topicField);
String queryString = entry.getValue();
ScoredDocuments docs;

try {
Expand Down Expand Up @@ -259,9 +283,9 @@ public void run() {
}
final long durationMillis = TimeUnit.MILLISECONDS.convert(System.nanoTime() - start, TimeUnit.NANOSECONDS);

LOG.info(topics.size() + " queries processed in " +
LOG.info(queries.size() + " queries processed in " +
DurationFormatUtils.formatDuration(durationMillis, "HH:mm:ss") +
String.format(" = ~%.2f q/s", topics.size()/(durationMillis/1000.0)));
String.format(" = ~%.2f q/s", queries.size()/(durationMillis/1000.0)));

// Now we write the results to a run file.
try {
Expand Down Expand Up @@ -300,9 +324,22 @@ public static void main(String[] args) throws Exception {
try {
parser.parseArgument(args);
} catch (CmdLineException e) {
System.err.println(e.getMessage());
parser.printUsage(System.err);
System.err.println("Example: SearchHnswDenseVectors" + parser.printExample(OptionHandlerFilter.REQUIRED));
if (searchArgs.options) {
System.err.printf("Options for %s:\n\n", SearchHnswDenseVectors.class.getSimpleName());
parser.printUsage(System.err);

List<String> required = new ArrayList<>();
parser.getOptions().forEach((option) -> {
if (option.option.required()) {
required.add(option.option.toString());
}
});

System.err.printf("\nRequired options are %s\n", required);
} else {
System.err.printf("Error: %s. For help, use \"-options\" to print out information about options.\n", e.getMessage());
}

return;
}

Expand All @@ -315,7 +352,7 @@ public static void main(String[] args) throws Exception {
searcher.run();
searcher.close();
} catch (IllegalArgumentException e) {
System.err.println(e.getMessage());
System.err.printf("Error: %s\n", e.getMessage());
return;
}

Expand Down
Loading
Loading