This is the repository for our benchmark Babel-ImageNet: Massively Multilingual Evaluation of Vision-and-Language Representations.
Vision-and-language (VL) models with separate encoders for each modality (e.g., CLIP) have become the go-to models for zero-shot image classification and image-text retrieval. The bulk of the evaluation of these models is, however, performed with English text only: the costly creation of language-specific image-caption datasets has limited multilingual VL benchmarks to a handful of high-resource languages.
We introduce Babel-ImageNet, a massively multilingual benchmark that offers (partial) translations of 1000 ImageNet labels to over 100 languages, built without resorting to machine translation (MT) or requiring manual annotation. We instead automatically obtain reliable translations of ImageNet concepts by linking them -- via shared WordNet synsets -- to BabelNet, a massively multilingual lexico-semantic network.
We evaluate13 different publicly available multilingual CLIP models on zero-shot image classification (ZS-IC) for each of the 100 Babel-ImageNet languages chosen for analysis, demonstrating a significant gap between English ImageNet performance and that of high-resource languages (e.g., German or Chinese), and an even bigger gap for low-resource languages (e.g., Sinhala or Lao). Crucially, we show that the models' ZS-IC performance on Babel-ImageNet highly correlates with their performance in image-text retrieval, validating that Babel-ImageNet is suitable for estimating the quality of the multilingual VL representation spaces for the vast majority of languages that lack gold image-text data.
We benchmarked (pretty much) all public multilingual CLIP models on Babel-ImageNet and on three multilingual image-text retrieval datasets. Raw results are here.
We prepared a notebook for easy browsing and analysis of the results.
We list the packages needed in requirements.txt. Both newer and older version probably work but use the specified version on problems.
We release the Babel-ImageNet labels here. The JSON is a dictionary mapping each ISO language code to a tuple with 1) the indices of classes as they appear in ImageNet-1k and 2) the class label names.
We also release the prompts translated with NLLB-1.3b-distilled here.
Babel-ImageNet includes only the labels for the ImageNet classes - you need to download the images yourself.
Labels and prompts can be used in your code as (nearly) drop-in replacement for standard ImageNet zero-shot evaluation with OpenAI's labels and prompts. You only need to take care to process the images of the language subset of classes - see the class BabelImageNet for an example on how to do this using the torchvision ImageNet dataset.
We offer the option to evaluate models for retrieval for XTD, XM3600, and xFlickrCo but you need to download the images (MSCOCO, Flickr30k, XM3600) yourself.
run_eval.py and run_retrieval.py are simple CLI tools to evaluate your model:
python run_eval.py --imagenet_folder=$imagenet_folder --prompts="label,nllb_dist13b_prompts" --languages="298" \
--num_workers=4 --batch_size=512 \
--source="openclip" --from_pretrained="xlm-roberta-base-ViT-B-32@laion5b_s13b_b90k" \
--out_file="results/babel-imagenet/openclip-xlmrb-vitb32"
python run_retrieval.py --image_folder=$image_folder --dataset_file="./data/xm3600.json"\
--num_workers=4 --batch_size=512 \
--source="openclip" --from_pretrained="xlm-roberta-base-ViT-B-32@laion5b_s13b_b90k" \
--out_file="results/retrieval/xm3600/openclip-xlmrb-vitb32"
In evaluation_scripts, we have scripts to replicate evaluation for all models we tested.
Our code is easy to extend for new models:
- HuggingFace and open_clip models are supported out of the box with
--source="huggingface"|"openclip"
and--from_pretrained="$huggingface_model"|"$openclip_model@$pretrained"
- Other models have to implement the
CLIP
interface and add themselve toget_model()
here.
Labels
Our data creation script can be found here.
We use the RPC mode of BabelNet, see here for more details on how to request
the data and set up the environment.
If you want to create labels for additional languages, simply adapt the language list used in the script.
Prompts
For machine-translated prompts, see this script on how the prompts were translated.
We have currently no plans to release training code because we use an internal, not-yet-released framework. However, we are happy to help if you have questions about implementation details - simply open an issue.
Babel-ImageNet is a processed version of BabelNet v5.2 downloaded from https://babelnet.org, made available with the BabelNet Non-Commercial License (see https://babelnet.org/full-license)
Our code is licensed under the MIT license.
If you find this benchmark helpful, please cite the following publication:
@article{geigle2023babelimagenet,
author = {Gregor Geigle and
Radu Timofte and
Goran Glava\v{s}},
title = {{B}abel-{I}mage{N}et: Massively Multilingual Evaluation of Vision-and-Language Representations},
journal = {arXiv},
volume = {abs/2306.08658},
year = {2023},
url = {https://arxiv.org/abs/2306.08658},
eprinttype = {arXiv},
eprint = {2306.08658},
}
Also consider citing the following:
@inproceedings{babelnet,
author = {Roberto Navigli and
Simone Paolo Ponzetto},
editor = {Jan Hajic and
Sandra Carberry and
Stephen Clark},
title = {BabelNet: Building a Very Large Multilingual Semantic Network},
booktitle = {{ACL} 2010, Proceedings of the 48th Annual Meeting of the Association
for Computational Linguistics, July 11-16, 2010, Uppsala, Sweden},
pages = {216--225},
publisher = {The Association for Computer Linguistics},
year = {2010},
url = {https://aclanthology.org/P10-1023/},
timestamp = {Fri, 06 Aug 2021 00:41:04 +0200},
biburl = {https://dblp.org/rec/conf/acl/NavigliP10.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
@inproceedings{imagenet,
author = {Jia Deng and
Wei Dong and
Richard Socher and
Li{-}Jia Li and
Kai Li and
Li Fei{-}Fei},
title = {ImageNet: {A} large-scale hierarchical image database},
booktitle = {2009 {IEEE} Computer Society Conference on Computer Vision and Pattern
Recognition {(CVPR} 2009), 20-25 June 2009, Miami, Florida, {USA}},
pages = {248--255},
publisher = {{IEEE} Computer Society},
year = {2009},
url = {https://doi.org/10.1109/CVPR.2009.5206848},
doi = {10.1109/CVPR.2009.5206848},
timestamp = {Wed, 15 Sep 2021 14:13:01 +0200},
biburl = {https://dblp.org/rec/conf/cvpr/DengDSLL009.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}