It has been established that training a box-based detector network can enhance the localization performance of weakly supervised and unsupervised methods. Moreover, we extend this understanding by demonstrating that these detectors can be utilized to improve the original network, paving the way for further advancements. To accomplish this, we train the detectors on top of the network output instead of the image data and apply suitable loss backpropagation. Our findings reveal a significant improvement in phrase grounding for the “what is where by looking” task, as well as various methods of unsupervised object discovery. Paper
To get started, follow these steps:
- Clone the repository
git clone https://github.com/eyalgomel/box-based-refinement.git
cd box-based-refinement
- Create a new Conda environment
conda env create -f environment.yml
conda activate bbr
pip install .
We follow and adapt the format of the datasets based on this GitHub repository. Below are examples showcasing the anticipated data structures for each dataset.
root/
├── annotations/
│ ├── captions_train2014.json
│ ├── captions_val2014.json
| ├── ...
├── images/
└── labels/
root/
├── VG_Annotations/
└── VG_Images/
root/
├── flickr30k_entities/
├── flickr30k_images/
├── train.txt
├── val.txt
└── test.txt
root/
├── annotations/
└── ReferIt_Images/
root/
├── Annotations/
├── ImageSets/
├── JPEGImages/
├── SegmentationClass/
└── SegmentationObject/
For each training script, you have the option to adjust training parameters either using the yaml configuration file or by directly providing them via command line arguments. Configuration files may found here.
For Weakly phrase grounding training, run the command below.
python bbr/train/wsg.py \
training.heatmap_model=MODEL_PATH \
data.train.dataset=TRAIN_DATASET \
data.train.data_path=TRAIN_DATASET_PATH \
data.val.dataset=VAL_DATASET \
data.val.data_path=VAL_DATASET_PATH \
training.gpu_num=GPUS_NUMBER
Make sure to replace the following placeholders:
MODEL_PATH
: The path to the original model checkpoint. You can download it from this repository.TRAIN_DATASET
: Choose one of the available training datasets: coco, vg.TRAIN_DATASET_PATH
: The path to the training dataset.VAL_DATASET
: Choose one of the available validation datasets: flickr, referit, vg.VAL_DATASET_PATH
: The path to the validation dataset.GPUS_NUMBER
: The GPUS number.
For Single object discovery training, run one of the commands below, depends on the underlying method (lost | tokencut | move).
python bbr/train/od_{method}.py \
data.train.dataset=TRAIN_DATASET \
data.train.data_path=TRAIN_DATASET_PATH \
data.val.dataset=VAL_DATASET \
data.val.data_path=VAL_DATASET_PATH \
training.gpu_num=GPUS_NUMBER \
training.move.model_path=MOVE_MODEL_PATH # RELEVANT FOR MOVE ONLY
Make sure to replace the following placeholders:
TRAIN_DATASET
: Choose one of the available training datasets: VOC07, VOC12, coco.TRAIN_DATASET_PATH
: The path to the training dataset.VAL_DATASET
: Choose one of the available training datasets: VOC07, VOC12, coco.VAL_DATASET_PATH
: The path to the validation dataset.GPUS_NUMBER
: The GPUS number.
For training MOVE
, you should download the adapted version of the original model weights MOVE_MODEL_PATH
, which can be found here
For evaluating our method, you should run the following cmd
python bbr/inference/run_inference.py \
task=TASK \
data.dataset=DATASET \
model_path=MODEL_PATH \
data.val_path=DATA_PATH
Replace TASK
with either od_lost
, od_tokencut
, od_move
, or grounding
. Likewise, replace DATASET
with flickr
, referit
, vg
, VOC07
, VOC12
, or coco20k
, and replace DATA_PATH
with the respective dataset path.
MODEL_PATH
should be the path to the model weights you intend to evaluate. You can refer to the next section for instructions on obtaining pretrained weights for all tasks: Pretrained models weights
. Other evaluation parameters can be found in the configuration file.
Here, you'll find the pertinent links to access pretrained model weights for various tasks:
Phrase Grounding
Single Object Discovery Each folder includes three models for VOC07, VOC12, and COCO20K datasets.
This repository owes its greatness to some fantastic methods and repositories, and we truly appreciate their contribution.
CLIP, DETR, DINO, LOST, TokenCut, MOVE, BLIP
The source code for each individual method within this repository is subject to its respective original license
If you discover our work to be inspirational or utilize our codebase in your research, please consider giving a star ⭐ and a citation.
@inproceedings{gomel2023boxbasedrefinement,
title={Box-based Refinement for Weakly Supervised and Unsupervised Localization Tasks},
author={Gomel, Eyal and Shaharbany, Tal and Wolf, Lior},
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
year={2023}
}