Wikipedia Metonymy Corpus

Code for the paper A Large Harvested Corpus of Location Metonymy published in LREC 2020.

Data

WiMCor

Run the code

Generate samples

The scripts are available in the directory harvest/.

First, generate metonymic pairs with the command:

$ python -u gen_metpairs.py -disamb_file ./disambiguation_page_titles -vehicles 'PopulatedPlace' -targets 'Q3918'

where disamb_file is a file consisting of titles, one per line, of Wikipedia disambiguation pages. This command extracts metonymic pairs of the form <vehicle>-for-<target> from the offline version (XML dumps) and the online version (MediaWiki). Check out here, here and here for different types of categories that can be used as vehicles and targets.

Then generate samples using the command:

$ python gen_samples.py -directory ./

where directory denotes the directory having the output of list of metonymic pairs processed by process-pairs.sh. This command generates the annotated samples in XML format.

Run IMM and PreWin baselines

The baseline implementation is based on Minimalist Location Metonymy Resolution published at ACL 2017. The scripts are available in the directories glove/ and bert/.

First create pickle files for each annotated file with the command:

$ python get_pickle.py -c imm -f filepath

Then train and test the LSTM model using the command:

$ python get_results.py -c imm -w 5 -d directorypath

where directorypath denotes the path to the directory containing the pickle files. Repeat the same for PreWin for each word embedding. We have provided a few annotated files alongside to play with. Check Minimalist Location Metonymy Resolution on how get GloVe embeddings. We use pytorch-pretrained-bert v0.4.0 for generating BERT embeddings.

Cite the paper

@inproceedings{lrec20-wimcor,
author    = {Mathews, Kevin Alex and Strube, Michael},
booktitle = {Proceedings of the Eleventh International Conference on Language Resources and Evaluation ({LREC} 2020)},
publisher = {European Languages Resources Association (ELRA)},
title     = {A Large Harvested Corpus of Location Metonymy},
year      = {2020}
}

License

GNU GPLv3

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
bert		bert
glove		glove
harvest		harvest
LICENSE		LICENSE
LREC2020.pdf		LREC2020.pdf
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Wikipedia Metonymy Corpus

Data

Run the code

Cite the paper

License

About

Releases

Packages

Languages

License

nlpAThits/WiMCor

Folders and files

Latest commit

History

Repository files navigation

Wikipedia Metonymy Corpus

Data

Run the code

Cite the paper

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages