Skip to content

A one-stage DETR-based network for zero-shot HOI detection boosted with vision-language transfer

Notifications You must be signed in to change notification settings

sandipan211/ZSHOI-VLT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

41 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PWC

Boosting Zero-shot Human-Object Interaction Detection with Vision-Language Transfer (ICASSP 2024)

Official code for the ICASSP 2024 paper that implements a one-stage DETR-based network for zero-shot HOI detection boosted with vision-language transfer.

👓 At a glance

This repository contains the official PyTorch implementation of our ICASSP 2024 paper : Boosting Zero-Shot Human-Object Interaction Detection with Vision-Language Transfer, a work done by Sandipan Sarma, Pradnesh Kalkar, and Arijit Sur at Indian Institute of Technology Guwahati.

  • Human-Object Interaction (HOI) detection is a crucial task that involves localizing interactive human-object pairs and identifying the actions being performed. In this work, our primary focus is improving HOI detection in images, particularly in zero-shot scenarios.
  • The query vectors in our DETR-based framework are vital in projecting an idea about “what” visual information about the human-object pairs to look for, with each vector element suggesting “where” to look for these pairs within the image. Since the final task is to detect human-object pairs, unified query vectors for human-object pairs are important.
  • Despite the unavailability of certain actions and objects (such as in UA and UO settings), our method is better at detecting unseen interactions in such challenging settings.

zshoid
The framework

💪 Datasets and Pre-trained models

  • For dataset preparation, check out these instructions.
  • Download the params folder and put it outside all folders for DETR-based pretrained models.
  • Outside all folders, make a new folder called ckpt and download the pretrained model of CLIP for CLIP50x16 inside it.

📝 Generating semantics

Generate the object, action, and interaction CLIP semantics for offline use by running:

cd models
python generate_clip_semantics.py

🚄 Training and evaluation

Run the scripts from the scripts folder which contain DETR_CLIP in the file name. For example, to train the model for UA setting, run the command:

cd scripts
sh train_DETR_CLIP_UA.sh

🏆Zero-shot Results on HICO-DET

image

🎁 Citation

If you use our work for your research, kindly star ⭐ our repository and consider citing our work using the following BibTex:

@INPROCEEDINGS{10445910,
  author={Sarma, Sandipan and Kalkar, Pradnesh and Sur, Arijit},
  booktitle={ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, 
  title={Boosting Zero-Shot Human-Object Interaction Detection with Vision-Language Transfer}, 
  year={2024},
  volume={},
  number={},
  pages={6355-6359},
  keywords={Visualization;Semantics;Detectors;Transformers;Feature extraction;Task analysis;Speech processing;Human-object interaction;transformer;CLIP;zero-shot learning},
  doi={10.1109/ICASSP48485.2024.10445910}}

🙏Acknowledgments

This work partially borrows codes from CDN and ConsNet

About

A one-stage DETR-based network for zero-shot HOI detection boosted with vision-language transfer

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published