Several external resources have been employed in our experiments.
- Download resources from Zenodo
- Obtain remaining resources
- Resources file structure
- Additional information on shared resources
Many of the resources we use for our experiments can be downloaded from here.
Download the compressed file resources.zip
and unzip it. Our code assumes the following directory structure:
station-to-station/
├── ...
├── resources/
│ ├── deezymatch/
│ ├── geonames/
│ ├── geoshapefiles/
│ ├── quicks/
│ ├── ranklib/
│ ├── wikidata/
│ ├── wikigaz/
│ └── wikipedia/
└── ...
Some of the directories will be empty, because we cannot share all the resources we used in our experiments. Please follow the instructions below to obtain the remaining files and store them in the right location.
Download the GB table, and store the unzipped file (GB.txt
) under resources/geonames/
.
For reference, we have used the
2021-04-26 09:01
version in our experiments.
Download the alternateNameV2 table, and store the unzipped files (alternateNamesV2.txt
and iso-languagecodes.txt
) under resources/geonames/
.
For reference, we have used the
2021-04-26 09:11
version in our experiments.
Download the Boundary-Line™ ESRI Shapefile from https://osdatahub.os.uk/downloads/open/BoundaryLine (see licence). Unzip it and copy the following files under the geoshapefiles/
folder:
Data/Supplementary_Country/country_region.dbf
Data/Supplementary_Country/country_region.prj
Data/Supplementary_Country/country_region.shp
Data/Supplementary_Country/country_region.shx
Download the Ranklib .jar
file from the Lemur project RankLib page and store it in ranklib/
. In our experiments, we have used version 2.13, available here. If this is not available anymore, we would suggest that you get the most recent binary from here.
Download a full Wikidata dump from here and store the latest-all.json.bz2
file in wikidata/
.
After following the steps above, you should have the following file structure:
station-to-station/
├── ...
├── resources/
│ ├── deezymatch/
│ │ ├── characters_v001.vocab
│ │ └── input_dfm.yaml
│ ├── geonames/
│ │ ├── alternateNamesV2.txt
│ │ ├── GB.txt
│ │ └── iso-languagecodes.txt
│ ├── geoshapefiles/
│ │ ├── country_region.dbf
│ │ ├── country_region.prj
│ │ ├── country_region.shp
│ │ └── country_region.shx
│ ├── quicks/
│ │ ├── annotations.tsv
│ │ ├── companies.tsv
│ │ ├── index2map.tsv
│ │ ├── quicks_altname_dev.tsv
│ │ ├── quicks_altname_test.tsv
│ │ ├── quicks_dev.tsv
│ │ └── quicks_test.tsv
│ ├── ranklib/
│ │ ├── features.txt
│ │ └── RankLib-2.13.jar
│ ├── wikidata/
│ │ └── latest-all.json.bz2
│ ├── wikigaz/
│ │ └── wikigaz_en_basic.pkl
│ └── wikipedia/
│ └── overall_entity_freq.pickle
└── ...
In this section, we provide additional information on the resources that we share via Zenodo.
The DeezyMatch input file and vocabulary file have been adapted from the original files (which can be found in the DeezyMatch github repository).
We are providing the following datasets used for the experiments on parsing the Chronology and linking it to Wikidata:
annotations.tsv
: this file contains the manual annotations performed by experts in our team.companies.tsv
: this file is also manually curated; it links companies (free-text strings) identified in the Chronology to their Wikidata ID.index2map.tsv
: this file contains the mapping between the Chronology map id and the places represented in the maps (manually obtained from an appendix in the Chronology).quicks_dev.tsv
: the development set, consisting of 217 entries from the Chronology that have been parsed and manually annotated with the corresponding Wikidata entry ID (obtained from running the code to parse the Chronology document).quicks_test.tsv
: the test set, consisting of 219 entries from the Chronology that have been parsed and manually annotated with the corresponding Wikidata entry ID (obtained from running the code to parse the Chronology document).quicks_altname_dev.tsv
: Additional alternate names found in the Chronology for the entries inquicks_dev.tsv
(obtained from running the code to parse the Chronology document).quicks_altname_test.tsv
: Additional alternate names found in the Chronology for the entries inquicks_test.tsv
(obtained from running the code to parse the Chronology document).
We share a minimal version of the English WikiGazetteer (wikigaz_en_basic.pkl
). You can generate the complete WikiGazetteer from scratch following the instructions here and obtain the minimal version used in our experiments by running the code here.
We share a pickled Counter
object (overall_entity_freq.pickle
) that maps Wikipedia pages to the number of inlinks (e.g. Archway, London
has 64 inlinks and London
has 75678 inlinks), a common measure of entity relevance.
You can generate this table by following our code to process a Wikipedia dump from scratch, extracting and structuring pages, mention/entity statistics and in- /out-link information, following the instructions here.