wikipedia-image-caption-matching

This is the 3rd place solution code for the Wikipedia - Image/Caption Matching Competition on Kaggle.

This repo consist of two main parts:

notebooks

This folder contains notebooks carrying out
- data preparation,
- filtering, and
- ranking.
Some of the notebooks presented here process only a part of data, but should serve as templates for processing the rest data, and are marked with 🧩.
wikimatcher

This is a package providing functionality for filtering by searching for possible matches between images and text candidates. The filtering procedure is based on several heuristic rankers estimating matching rates.

Data Preparation

This stage is an offline pipeline for precomputing various data used in the sequel for filtering and feature engineering.

Translation is performed by GoogleTranslator from deep-translator.

Sentence Embeddings (SEs) and Image-Text Embeddings (ITEs) are computed by pre-trained models from SentenceTransformers.

Basic Preprocessing (train | test)

The base training dataset contains 69,480 records for images and 301,664 records for its textual candidates made from page_title and caption_reference_description.

Image Data Preparation

Translating Image Filenames (train🧩 | test)
Calculating SEs for original filenames (train | test)
Forming Final Image Dataset (train | test)
Calculating SEs for translated filenames (train | test)
Calculating ITEs for images (train | test)

Candidate Data Preparation

Translating page titles (train🧩 | test🧩)
Translating captions (train🧩 | test🧩)
Forming Final Candidate Dataset (train | test)
Calculating Caption SEs (train | test)
Calculating Page Title SEs (train | test)
NER for captions (train🧩 | test)
NER for page titles (train🧩 | test)
Calculating ITEs for original candidates (train | test)
Calculating ITEs for translated candidates (train | test)
Building Containers (train | test)

In this stage, for each image, candidate filtering is performed. The filtering procedure uses several heuristic algorithms computing ranks based on string similarity, named entity matching, number matching, and cosine similarity between various embeddings.

To estimate the degree of string similarity, RapidFuzz is exploited.

The ranks computed here are used for candidate selection and feature engineering.

For this filterting procedure, Recall is approximately 0.95 and some quantiles of the number of candidates for images are shown in the following table.

Q25	Q50	Q75	Q90
2800	3200	3600	4000

After this stage, all the images from the base training dataset and its candidates and features are divided into 72 parts of data.

Ranking

Now, for each image, the matching problem comes to ranking its candidates with XGBRanker.

After filtering, the data prepared is split into training dataset, validation dataset, and holdout dataset as follows.

Dataset	Parts
Training	0–58, 60–68
Validation	70, 71
Holdout	59, 69

According to the table below, the training dataset, in turn, is divided into 7 ranges, each intended for training a certain base model.

Base Model	Part Range
`model-00`	0–9
`model-01`	10–19
`model-02`	20–29
`model-03`	30–39
`model-04`	40–49
`model-05`	50–58
`model-06`	60–68

In accordance with stacking techniques, the final model is trained on the ranks made by the base models.

The training and inference procedures are depicted visually in the following diagram.

Training Pipeline

Preprocessing 🧩

This notebook makes final preparations of filtered data forming datasets for base models.
Training Base Models

These notebooks fit XGBRanker to a specified training dataset and differ only in hyperparameters and the path to the folder containing tranining data.
Calculating Ranks 🧩

This notebook uses model-00 to produce its ranks for the validation and holdout datasets. For each image, the ranks obtained are used to determine the 50 candidates with the highest rank, while the rest of the ones are rejected. The best candidates and its ranks subsequently form training and validation data for the final model.
Rank Stacking
Final Model Training

Inference Pipeline

Preprocessing and Calculating Ranks
Rank Stacking and Inference

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

wikipedia-image-caption-matching

Data Preparation

Basic Preprocessing (train | test)

Image Data Preparation

Candidate Data Preparation

Filtering (train🧩| test🧩)

Ranking

Training Pipeline

Inference Pipeline

Files

README.md

Latest commit

History

README.md

File metadata and controls

wikipedia-image-caption-matching

Data Preparation

Basic Preprocessing (train | test)

Image Data Preparation

Candidate Data Preparation

Filtering (train🧩| test🧩)

Ranking

Training Pipeline

Inference Pipeline