Skip to content

hltcoe/PSQ

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Efficient Implementation of Probabilistic Structured Queries

This package is an implementation of the Probablistic Structured Queries algorithm for cross-langauge information retrieval. It leverages alignment table from statistical machine translation to translate the document bag-of-tokens into the query language.

Raw translation tables are available on Huggingface Models hltcoe/psq_translation_tables

Get started

fast_psq is available on PyPI.

pip install fast_psq

Alternatively, you can also install directly from the GitHub main branch by using the following command.

pip install pip@git+https://github.com/hltcoe/PSQ

fast_psq works with ir_datasets and ir_measures quite well for accessing IR evaluation collections and evaluating results. You can install the two packages with the following command.

pip install ir_datasets ir_measures

Indexing

The indexing script takes a translation table (i.e., alignment matrix) and a document jsonl file. We release a number of them on Huggingface Model, which can be automatically downloaded in the script by placing the path in the --psq_file flag in the format of {repo_id}:{flie_path}. Alternatively, you can also pass in a local .json.gz file that contains a dictionary of dictionaries, mapping from source tokens (string) to target tokens (string) to alignment probabilities. However, the default tokenizer in the script uses mosestokenier, which may not match the one in your own alignment matrix. You should either use mosestokenier when aligning the bitext or replace the tokenizer with yours.

The document file should be a jsonl file with one document in each row. You can specify the field for document id, title, and body text by passing in the field name in the file through --docid, --title, and --body respectively. Alternatively, you can also use --doc_source with irds: as prefix to use a dataset in ir_datasets.

The following is an example indexing command.

python -m fast_psq.index \
--doc_file irds:neuclir/1/zh/trec-2022 \
--lang zh \
--psq_file hltcoe/psq_translation_tables:zh.table.dict.gz \
--min_translation_prob 0.00010 \
--max_translation_alternatives 64 \
--max_translation_cdf 0.99 \
--docid doc_id \
--title title \
--body text \
--min_translation_prob 1e-4 \
--max_translation_alternatives 64 \
--output_dir ./indexes/neuclir-zh.f32/ \
--compression \
--nworkers 64

Please use python -m fast_psq.index --help for more information about the arguments.

Searching

The search script takes the index and a tsv query file and output a TREC style result file. Similarly, we support ir_datasets as well with irds: prefix in both --query_source and --qrels arguments.

The following command is an example.

python -m fast_psq.search \
--query_source irds:neuclir/1/zh/trec-2022 \
--query_field title \
--index_dir ./indexes/neuclir-zh.f32/ \
--qrels irds:neuclir/1/zh/trec-2022 \
--query_lang en \
--output_file ./neuclir-zh.en.title.f32.trec

Please use python -m fast_psq.search --help for more information about the arguments.

Citation

@article{psq-repro,
    title = {Efficiency-Effectiveness Tradeoff of Probabilistic Structured Queries for Cross-Language Information Retrieval},
    author = {Eugene Yang and Suraj Nair and Dawn Lawrie and James Mayfield and Douglas W. Oard and Kevin Duh},
    journal = {arXiv preprint arXiv},
    year = {2024}
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages