Skip to content

Latest commit

 

History

History
148 lines (85 loc) · 11.7 KB

File metadata and controls

148 lines (85 loc) · 11.7 KB

wikipedia-image-caption-matching

This is the 3rd place solution code for the Wikipedia - Image/Caption Matching Competition on Kaggle.

This repo consist of two main parts:

  • notebooks

    This folder contains notebooks carrying out

    • data preparation,
    • filtering, and
    • ranking.

    Some of the notebooks presented here process only a part of data, but should serve as templates for processing the rest data, and are marked with 🧩.

  • wikimatcher

    This is a package providing functionality for filtering by searching for possible matches between images and text candidates. The filtering procedure is based on several heuristic rankers estimating matching rates.

This stage is an offline pipeline for precomputing various data used in the sequel for filtering and feature engineering.

Translation is performed by GoogleTranslator from deep-translator.

Sentence Embeddings (SEs) and Image-Text Embeddings (ITEs) are computed by pre-trained models from SentenceTransformers.

Basic Preprocessing (train | test)

Basic Preprocessing

The base training dataset contains 69,480 records for images and 301,664 records for its textual candidates made from page_title and caption_reference_description.

Image Data Preparation

Image Data Preparation

Candidate Data Preparation

Candidate Data Preparation

In this stage, for each image, candidate filtering is performed. The filtering procedure uses several heuristic algorithms computing ranks based on string similarity, named entity matching, number matching, and cosine similarity between various embeddings.

To estimate the degree of string similarity, RapidFuzz is exploited.

The ranks computed here are used for candidate selection and feature engineering.

For this filterting procedure, Recall is approximately 0.95 and some quantiles of the number of candidates for images are shown in the following table.

Q25 Q50 Q75 Q90
2800 3200 3600 4000

After this stage, all the images from the base training dataset and its candidates and features are divided into 72 parts of data.

Now, for each image, the matching problem comes to ranking its candidates with XGBRanker.

After filtering, the data prepared is split into training dataset, validation dataset, and holdout dataset as follows.

Dataset Parts
Training 0–58, 60–68
Validation 70, 71
Holdout 59, 69

According to the table below, the training dataset, in turn, is divided into 7 ranges, each intended for training a certain base model.

Base Model Part Range
model-00 0–9
model-01 10–19
model-02 20–29
model-03 30–39
model-04 40–49
model-05 50–58
model-06 60–68

In accordance with stacking techniques, the final model is trained on the ranks made by the base models.

The training and inference procedures are depicted visually in the following diagram.

The training and inference procedures

Training Pipeline

  • Preprocessing 🧩

    This notebook makes final preparations of filtered data forming datasets for base models.

  • Training Base Models

    These notebooks fit XGBRanker to a specified training dataset and differ only in hyperparameters and the path to the folder containing tranining data.

  • Calculating Ranks 🧩

    This notebook uses model-00 to produce its ranks for the validation and holdout datasets. For each image, the ranks obtained are used to determine the 50 candidates with the highest rank, while the rest of the ones are rejected. The best candidates and its ranks subsequently form training and validation data for the final model.

  • Rank Stacking

  • Final Model Training

Inference Pipeline