Skip to content

Latest commit

 

History

History
127 lines (96 loc) · 7.02 KB

resources.md

File metadata and controls

127 lines (96 loc) · 7.02 KB

Resources

Several external resources have been employed in our experiments.

Table of contents

Download resources from Zenodo

Many of the resources we use for our experiments can be downloaded from here.

Download the compressed file resources.zip and unzip it. Our code assumes the following directory structure:

station-to-station/
├── ...
├── resources/
│   ├── deezymatch/
│   ├── geonames/
│   ├── geoshapefiles/
│   ├── quicks/
│   ├── ranklib/
│   ├── wikidata/
│   ├── wikigaz/
│   └── wikipedia/
└── ...

Some of the directories will be empty, because we cannot share all the resources we used in our experiments. Please follow the instructions below to obtain the remaining files and store them in the right location.

Obtain remaining resources

Geonames

Download the GB table, and store the unzipped file (GB.txt) under resources/geonames/.

For reference, we have used the 2021-04-26 09:01 version in our experiments.

Download the alternateNameV2 table, and store the unzipped files (alternateNamesV2.txt and iso-languagecodes.txt) under resources/geonames/.

For reference, we have used the 2021-04-26 09:11 version in our experiments.

Geoshapefiles

Download the Boundary-Line™ ESRI Shapefile from https://osdatahub.os.uk/downloads/open/BoundaryLine (see licence). Unzip it and copy the following files under the geoshapefiles/ folder:

  • Data/Supplementary_Country/country_region.dbf
  • Data/Supplementary_Country/country_region.prj
  • Data/Supplementary_Country/country_region.shp
  • Data/Supplementary_Country/country_region.shx

Ranklib

Download the Ranklib .jar file from the Lemur project RankLib page and store it in ranklib/. In our experiments, we have used version 2.13, available here. If this is not available anymore, we would suggest that you get the most recent binary from here.

Wikidata

Download a full Wikidata dump from here and store the latest-all.json.bz2 file in wikidata/.

Resources file structure

After following the steps above, you should have the following file structure:

station-to-station/
├── ...
├── resources/
│   ├── deezymatch/
│   │   ├── characters_v001.vocab
│   │   └── input_dfm.yaml
│   ├── geonames/
│   │   ├── alternateNamesV2.txt
│   │   ├── GB.txt
│   │   └── iso-languagecodes.txt
│   ├── geoshapefiles/
│   │   ├── country_region.dbf
│   │   ├── country_region.prj
│   │   ├── country_region.shp
│   │   └── country_region.shx
│   ├── quicks/
│   │   ├── annotations.tsv
│   │   ├── companies.tsv
│   │   ├── index2map.tsv
│   │   ├── quicks_altname_dev.tsv
│   │   ├── quicks_altname_test.tsv
│   │   ├── quicks_dev.tsv
│   │   └── quicks_test.tsv
│   ├── ranklib/
│   │   ├── features.txt
│   │   └── RankLib-2.13.jar
│   ├── wikidata/
│   │   └── latest-all.json.bz2
│   ├── wikigaz/
│   │   └── wikigaz_en_basic.pkl
│   └── wikipedia/
│       └── overall_entity_freq.pickle
└── ...

Additional information on shared resources

In this section, we provide additional information on the resources that we share via Zenodo.

DeezyMatch

The DeezyMatch input file and vocabulary file have been adapted from the original files (which can be found in the DeezyMatch github repository).

Quicks

We are providing the following datasets used for the experiments on parsing the Chronology and linking it to Wikidata:

  • annotations.tsv: this file contains the manual annotations performed by experts in our team.
  • companies.tsv: this file is also manually curated; it links companies (free-text strings) identified in the Chronology to their Wikidata ID.
  • index2map.tsv: this file contains the mapping between the Chronology map id and the places represented in the maps (manually obtained from an appendix in the Chronology).
  • quicks_dev.tsv: the development set, consisting of 217 entries from the Chronology that have been parsed and manually annotated with the corresponding Wikidata entry ID (obtained from running the code to parse the Chronology document).
  • quicks_test.tsv: the test set, consisting of 219 entries from the Chronology that have been parsed and manually annotated with the corresponding Wikidata entry ID (obtained from running the code to parse the Chronology document).
  • quicks_altname_dev.tsv: Additional alternate names found in the Chronology for the entries in quicks_dev.tsv (obtained from running the code to parse the Chronology document).
  • quicks_altname_test.tsv: Additional alternate names found in the Chronology for the entries in quicks_test.tsv (obtained from running the code to parse the Chronology document).

Wikigaz

We share a minimal version of the English WikiGazetteer (wikigaz_en_basic.pkl). You can generate the complete WikiGazetteer from scratch following the instructions here and obtain the minimal version used in our experiments by running the code here.

Wikipedia inlinks

We share a pickled Counter object (overall_entity_freq.pickle) that maps Wikipedia pages to the number of inlinks (e.g. Archway, London has 64 inlinks and London has 75678 inlinks), a common measure of entity relevance.

You can generate this table by following our code to process a Wikipedia dump from scratch, extracting and structuring pages, mention/entity statistics and in- /out-link information, following the instructions here.