The FooDI-ML dataset is offered under the BY-NC-SA license.
The FooDI-ML dataset is hosted in a S3 bucket in AWS. Therefore AWS CLI is needed to download it. Our dataset is composed of:
- One DataFrame (
glovo-foodi-ml-dataset
) stored as acsv
file containing all text information + image paths in S3. The size of this CSV file is 540 MB. - Set of images listed in the DataFrame. The disk space required to store all images is 316.1 GB.
If you do not have AWS CLI already installed, please download the latest version of AWS CLI for your operating system.
-
Run the following command to download the DataFrame in
ENTER_DESTINATION_PATH
directory. We provide an example as if we were going to download the dataset in the directory/mnt/data/foodi-ml/
.aws s3 cp s3://glovo-products-dataset-d1c9720d/glovo-foodi-ml-dataset.csv ENTER_DESTINATION_PATH --no-sign-request
Example:
aws s3 cp s3://glovo-products-dataset-d1c9720d/glovo-foodi-ml-dataset.csv /mnt/data/foodi-ml/ --no-sign-request
-
Run the following command to download the images in
ENTER_DESTINATION_PATH/dataset
directory (please note the appending of /dataset). This command will download the images inENTER_DESTINATION_PATH
directory.aws s3 cp --recursive s3://glovo-products-dataset-d1c9720d/dataset ENTER_DESTINATION_PATH/dataset --no-sign-request --quiet
Example:
aws s3 cp --recursive s3://glovo-products-dataset-d1c9720d/dataset /mnt/data/foodi-ml/dataset --no-sign-request --quiet
-
Run the script
rename_images.py
. This script modifies the DataFrame column to include the paths of the images in the location you specified withENTER_DESTINATION_PATH/dataset
.pip install pandas python scripts/rename_images.py --output-dir ENTER_DESTINATION_PATH
-
Run the script
scripts/dataset_preprocess.py
in order to filter the dataset:
python scripts/dataset_preprocess.py --dataset-path <ENTER_PATH_TO_DATSET_FOLDER>
Our dataset is managed by the DataFrame glovo-foodi-ml-dataset.csv
. This dataset contains the following columns:
-
country_code: This column comprehends 37 unique country codes as explained in our paper. These codes are:
'ES', 'PL', 'CI', 'PT', 'MA', 'IT', 'AR', 'BG', 'KZ', 'BR', 'ME', 'TR', 'PE', 'SI', 'GE', 'EG', 'RS', 'RO', 'HR', 'UA', 'DO', 'KG', 'CR', 'UY', 'EC', 'HN', 'GH', 'KE', 'GT', 'CL', 'FR', 'BA', 'PA', 'UG', 'MD', 'NG', 'PR'
-
city_code: Name of the city where the store is located.
-
store_name: Name of the store selling that product. If
store_name
is equal toAS_XYZ
, it represents an auxiliary store. This means that while the samples contained are for the most part valid, the store name can't be used in learning tasks -
product_name: Name of the product. All products have
product_name
, so this column does not contain anyNaN
value. -
collection_section: Name of the section of the product, used for organizing the store menu. Common values are "drinks", "our pizzas", "desserts". All products have
collection_section
associated to it, so this column does not have anyNaN
value in it. -
product_description: A detailed description of the product, describing ingredients and components of it. Not all products of our data have description, so this column contains
NaN
values that must be removed by the researchers as a preprocessing step. -
subset: Categorical variable indicating if the sample belongs to the Training, Validation or Test set. The respective values in the DataFrame are
["train", "val", "test"]
. -
HIER: Boolean variable indicating if the store name can be used to retrieve product information (indicating if the store_name is not an auxiliary store (with code
AS_XYZ
)). -
s3_path: Path of the image of the product in the disk location you chose.
A notebook analyzing several dataset statistics is provided in notebooks/FooDI-ML Dataset Stats Analytics.ipynb
.
Our paper includes 3 benchmarks: Text to Image/Image to Text Retrieval
Conditional Image Generation
You can cite our paper in arxiv: https://arxiv.org/abs/2110.02035