Boosting Zero-shot Human-Object Interaction Detection with Vision-Language Transfer (ICASSP 2024)

Official code for the ICASSP 2024 paper that implements a one-stage DETR-based network for zero-shot HOI detection boosted with vision-language transfer.

👓 At a glance

This repository contains the official PyTorch implementation of our ICASSP 2024 paper : Boosting Zero-Shot Human-Object Interaction Detection with Vision-Language Transfer, a work done by Sandipan Sarma, Pradnesh Kalkar, and Arijit Sur at Indian Institute of Technology Guwahati.

Human-Object Interaction (HOI) detection is a crucial task that involves localizing interactive human-object pairs and identifying the actions being performed. In this work, our primary focus is improving HOI detection in images, particularly in zero-shot scenarios.
The query vectors in our DETR-based framework are vital in projecting an idea about “what” visual information about the human-object pairs to look for, with each vector element suggesting “where” to look for these pairs within the image. Since the final task is to detect human-object pairs, unified query vectors for human-object pairs are important.
Despite the unavailability of certain actions and objects (such as in UA and UO settings), our method is better at detecting unseen interactions in such challenging settings.

The framework

💪 Datasets and Pre-trained models

For dataset preparation, check out these instructions.
Download the params folder and put it outside all folders for DETR-based pretrained models.
Outside all folders, make a new folder called ckpt and download the pretrained model of CLIP for CLIP50x16 inside it.

📝 Generating semantics

Generate the object, action, and interaction CLIP semantics for offline use by running:

cd models
python generate_clip_semantics.py

🚄 Training and evaluation

Run the scripts from the scripts folder which contain DETR_CLIP in the file name. For example, to train the model for UA setting, run the command:

cd scripts
sh train_DETR_CLIP_UA.sh

🏆Zero-shot Results on HICO-DET

🎁 Citation

If you use our work for your research, kindly star ⭐ our repository and consider citing our work using the following BibTex:

@INPROCEEDINGS{10445910,
  author={Sarma, Sandipan and Kalkar, Pradnesh and Sur, Arijit},
  booktitle={ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, 
  title={Boosting Zero-Shot Human-Object Interaction Detection with Vision-Language Transfer}, 
  year={2024},
  volume={},
  number={},
  pages={6355-6359},
  keywords={Visualization;Semantics;Detectors;Transformers;Feature extraction;Task analysis;Speech processing;Human-object interaction;transformer;CLIP;zero-shot learning},
  doi={10.1109/ICASSP48485.2024.10445910}}

🙏Acknowledgments

This work partially borrows codes from CDN and ConsNet

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
datasets		datasets
models		models
scripts		scripts
util		util
README.md		README.md
engine.py		engine.py
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Boosting Zero-shot Human-Object Interaction Detection with Vision-Language Transfer (ICASSP 2024)

👓 At a glance

💪 Datasets and Pre-trained models

📝 Generating semantics

🚄 Training and evaluation

🏆Zero-shot Results on HICO-DET

🎁 Citation

🙏Acknowledgments

About

Releases

Packages

Languages

sandipan211/ZSHOI-VLT

Folders and files

Latest commit

History

Repository files navigation

Boosting Zero-shot Human-Object Interaction Detection with Vision-Language Transfer (ICASSP 2024)

👓 At a glance

💪 Datasets and Pre-trained models

📝 Generating semantics

🚄 Training and evaluation

🏆Zero-shot Results on HICO-DET

🎁 Citation

🙏Acknowledgments

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages