VideoSAUR

This is the code release for the paper Object-Centric Learning for Real-World Videos by Predicting Temporal Feature Similarities (NeurIPS 2023), by Andrii Zadaianchuk, Maximilian Seitzer and Georg Martius.

Webpage: https://martius-lab.github.io/videosaur
Arxiv: https://arxiv.org/abs/2306.04829
OpenReview: https://openreview.net/forum?id=t1jLRFvBqm

Summary

Unsupervised video-based object-centric learning is a promising avenue to learn structured representations from large, unlabeled video collections, but previous approaches have only managed to scale to real-world datasets in restricted domains. Recently, it was shown that the reconstruction of pre-trained self-supervised features leads to object-centric representations on unconstrained real-world image datasets. Building on this approach, we propose a novel way to use such pre-trained features in the form of a temporal feature similarity loss. This loss encodes semantic and temporal correlations between image patches and is a natural way to introduce a motion bias for object discovery. We demonstrate that this loss leads to state-of-the-art performance on the challenging synthetic MOVi datasets. When used in combination with the feature reconstruction loss, our model is the first object-centric video model that scales to unconstrained video datasets such as YouTube-VIS.

Usage

Setup

First, setup the python environment setup. We use Poetry for this:

poetry install

Then you could run a test configuration to see if everything works:

poetry run python -m videosaur.train tests/configs/test_dummy_image.yml

Second, to download the datasets used in this work, follow the instructions in data/README.md. By default, datasets are expected to be contained in the folder ./data. You can change this to the actual folder your data is in by setting the environment variable VIDEOSAUR_DATA_PATH, or by running train.py with the --data-dir option.

Training

Run one of the configurations in configs/videosaur, for example:

poetry run python -m videosaur.train configs/videosaur/movi_c.yml

The results are stored in a folder created under the log root folder (by defaults ./logs, changeable by the argument --log-dir). If you want to continue training from a previous run, you can use the --continue argument, like in the following command:

poetry run python -m videosaur.train --continue <path_to_log_dir_or_checkpoint_file> configs/videosaur/movi_c.yml

Inference

If you want to run one of the released checkpoints (see below) on your own video you can use inference script with corresponding config file:

poetry run python -m videosaur.inference --config configs/inference/movi_c.yml

in the released config, please change checkpoint: path/to/videosaur-movi-c.ckpt to the real path to your checkpoint. For different video formats you would need to modify corresponding transformations in build_inference_transform function.

Results

VideoSAUR

We list the results you should roughly be able to obtain with the configs included in this repository:

Dataset	Model Variant	Video ARI	Video mBO	Config	Checkpoint Link
MOVi-C	ViT-B/8, DINO	64.8	38.9	videosaur/movi_c.yml	Checkpoint
MOVi-E	ViT-B/8, DINO	73.9	35.6	videosaur/movi_e.yml	Checkpoint
YT-VIS 2021	ViT-B/16, DINO	39.5	29.1	videosaur/ytvis.yml	Checkpoint
YT-VIS 2021	ViT-B/14, DINOv2	39.7	35.6	videosaur/ytvis_dinov2.yml	Checkpoint

DINOSAUR

We also include a configuration for the DINOSAUR model from our previous paper Bridging the gap to real-world object-centric learning. This configuration yields improved results compared to the DINOSAUR model in the original paper (mainly due to using DINOv2 pre-trained features). Note that there might be minor differences in the metrics, as the numbers here are computed for 224x224 masks, compared to 320x320 masks in the DINOSAUR paper.

Dataset	Model Variant	Image ARI	Image mBO	Config	Checkpoint
COCO	ViT-B/14, DINOv2	45.6	29.6	dinosaur/coco_base14_dinov2.yml	Checkpoint

Citation

If you make use of this repository, please use the following bibtex entry to cite us:

  @inproceedings{zadaianchuk2023objectcentric,
      title={Object-Centric Learning for Real-World Videos by Predicting Temporal Feature Similarities},
      author={Zadaianchuk, Andrii and Seitzer, Maximilian and Martius, Georg},
      booktitle={Thirty-seventh Conference on Neural Information Processing Systems (NeurIPS 2023)},
      year={2023},
  }

License

This codebase is released under the MIT license. Some parts of the codebase were adapted from other codebases. A comment was added to the code where this is the case. Those parts are governed by their respective licenses.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

VideoSAUR

Summary

Usage

Setup

Training

Inference

Results

VideoSAUR

DINOSAUR

Citation

License

Files

README.md

Latest commit

History

README.md

File metadata and controls

VideoSAUR

Summary

Usage

Setup

Training

Inference

Results

VideoSAUR

DINOSAUR

Citation

License