MELArt. A Multimodal Entity Linking Dataset for Art

Code for the generation of MELArt. A Multimodal Entity Linking Dataset for Art.

The code for the experiments with the baselines and for generating model-specific versions od the dataset can be found here

Pre-requisites

Configuration

Create a .env file (you can use .env_sample as a tempate) and set the access token for Wikimedia API and the user agent.
Install required libraries. The easiest way is to use the provided conda environment environment.yaml
Install spacy English model python -m spacy download en_core_web_sm

Input files and services

Download the Artpedia dataset from https://aimagelab.ing.unimore.it/imagelab/page.asp?IdPage=35 and place the artpedia.json file in the input_files/ folder.

Set up a QLever instance and import the Wikidata dump. For further information on how to set up QLever, please refer to the QLever documentation. To reproduce our results, the Wikidata dumps are the following:
- latest-all.ttl.bz2 with the timestamp 2024-09-02T23:00:01Z
- latest-lexemes.ttl.bz2 with the timestamp 2024-09-06T23:00:01Z
Configure the QLever http URL (e.g. http://localhost:7001 ) in the .env file
Download the English Wikipedia dumps for these two tables and place them in the input_files/enwiki folder:
- enwiki-20240901-page.sql.gz
- enwiki-20240901-redirect.sql.gz
Set up a Solr instance and create a core that accepts autoCreateFields. To reproduce our results, the Solr version used was 9.7.0. Typically the creation is with the following command:

solr create -c <core_name>

Configure the Solr core URL in the .env file (e.g. http://localhost:8983/solr/<core_name>)

You can also avoid having the input_files/ folder, by adjusting the paths in the paths.py script.

Dataset generation

Execute the following scripts to generate the dataset.

convert_wikipedia_tables.sh: This script converts the Wikipedia tables from the Wikidata dump to csv files. The output is stored in the aux_files/ folder.
art_merging.py: It matches Artpedia paintings to Wikidata entities using the Wikipedia title, and extracts painting information from Wikidata.
text_matcher.py: Matches the labels of the depicted entities in the visual and contextual sentences.
get_candidates.py: Get the candidates for the depicted entities in the visual and contextual sentences, using Solr as a full text search engine. It creates a mention-candidates dictionary in the aux_files/dict_candidates.json file and for each candidate, it creates a json file with its information in the aux_files/el_candidates folder.
get_img_urls.py: Lists all the Wikimedia Commons file names or Wikipedia http urls needed to download the images.
crawl_images.py: crawl the images from Wikimedia Commons and Wikipedia based on the imgs_url.txt file (from get_img_urls.py)
filter_candidate_images.py: Removes the candidate images that correspond to the paintings in MELArt.
combine_curated_annotations.py: This script combines the automatically generated annotations, with the manually curated annotations to produce the final dataset in the output_files/melart_annotations.json file.
concat_candidates.py: This script concatenates all the candidate files into a single el_candidates.jsonl file in the output_files folder.

Name		Name	Last commit message	Last commit date
Latest commit History 77 Commits
input_files		input_files
.env_sample		.env_sample
.gitignore		.gitignore
LICENSE		LICENSE
Metrics.ipynb		Metrics.ipynb
README.md		README.md
art_merging.py		art_merging.py
combine_curated_annotations.py		combine_curated_annotations.py
concat_candidates.py		concat_candidates.py
convert_wikipedia_tables.py		convert_wikipedia_tables.py
crawl_images.py		crawl_images.py
environment.yaml		environment.yaml
filter_candidate_images.py		filter_candidate_images.py
get_artpedia_depicted.sh		get_artpedia_depicted.sh
get_artpedia_depicted_ids.py		get_artpedia_depicted_ids.py
get_candidates.py		get_candidates.py
get_img_urls.py		get_img_urls.py
index_wikidata_labels.py		index_wikidata_labels.py
paths.py		paths.py
solrqueries.py		solrqueries.py
sparqlqueries.py		sparqlqueries.py
statistics.ipynb		statistics.ipynb
statistics_subjects.ipynb		statistics_subjects.ipynb
text_matcher.py		text_matcher.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MELArt. A Multimodal Entity Linking Dataset for Art

Pre-requisites

Configuration

Input files and services

Dataset generation

About

Releases

Packages

Contributors 2

Languages

License

HPI-Information-Systems/MELArt

Folders and files

Latest commit

History

Repository files navigation

MELArt. A Multimodal Entity Linking Dataset for Art

Pre-requisites

Configuration

Input files and services

Dataset generation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages