Code for the paper A Large Harvested Corpus of Location Metonymy published in LREC 2020.
- Generate samples
The scripts are available in the directory harvest/
.
First, generate metonymic pairs with the command:
$ python -u gen_metpairs.py -disamb_file ./disambiguation_page_titles -vehicles 'PopulatedPlace' -targets 'Q3918'
where disamb_file
is a file consisting of titles, one per line, of Wikipedia disambiguation pages.
This command extracts metonymic pairs of the form <vehicle>-for-<target>
from the offline version (XML dumps) and the online version (MediaWiki).
Check out
here,
here
and
here for different types of categories that can be used as vehicles and targets.
Then generate samples using the command:
$ python gen_samples.py -directory ./
where directory
denotes the directory having the output of list of metonymic pairs processed by process-pairs.sh
.
This command generates the annotated samples in XML format.
- Run IMM and PreWin baselines
The baseline implementation is based on Minimalist Location Metonymy Resolution published at ACL 2017. The scripts are available in the directories glove/
and bert/
.
First create pickle files for each annotated file with the command:
$ python get_pickle.py -c imm -f filepath
Then train and test the LSTM model using the command:
$ python get_results.py -c imm -w 5 -d directorypath
where directorypath
denotes the path to the directory containing the pickle files. Repeat the same for PreWin for each word embedding. We have provided a few annotated files alongside to play with. Check Minimalist Location Metonymy Resolution on how get GloVe embeddings. We use pytorch-pretrained-bert v0.4.0 for generating BERT embeddings.
@inproceedings{lrec20-wimcor,
author = {Mathews, Kevin Alex and Strube, Michael},
booktitle = {Proceedings of the Eleventh International Conference on Language Resources and Evaluation ({LREC} 2020)},
publisher = {European Languages Resources Association (ELRA)},
title = {A Large Harvested Corpus of Location Metonymy},
year = {2020}
}