Official code for the ICASSP 2024 paper that implements a one-stage DETR-based network for zero-shot HOI detection boosted with vision-language transfer.
This repository contains the official PyTorch implementation of our ICASSP 2024 paper : Boosting Zero-Shot Human-Object Interaction Detection with Vision-Language Transfer, a work done by Sandipan Sarma, Pradnesh Kalkar, and Arijit Sur at Indian Institute of Technology Guwahati.
- Human-Object Interaction (HOI) detection is a crucial task that involves localizing interactive human-object pairs and identifying the actions being performed. In this work, our primary focus is improving HOI detection in images, particularly in zero-shot scenarios.
- The query vectors in our DETR-based framework are vital in projecting an idea about “what” visual information about the human-object pairs to look for, with each vector element suggesting “where” to look for these pairs within the image. Since the final task is to detect human-object pairs, unified query vectors for human-object pairs are important.
- Despite the unavailability of certain actions and objects (such as in UA and UO settings), our method is better at detecting unseen interactions in such challenging settings.
- For dataset preparation, check out these instructions.
- Download the params folder and put it outside all folders for DETR-based pretrained models.
- Outside all folders, make a new folder called
ckpt
and download the pretrained model of CLIP for CLIP50x16 inside it.
Generate the object, action, and interaction CLIP semantics for offline use by running:
cd models
python generate_clip_semantics.py
Run the scripts from the scripts folder which contain DETR_CLIP
in the file name. For example, to train the model for UA setting, run the command:
cd scripts
sh train_DETR_CLIP_UA.sh
If you use our work for your research, kindly star ⭐ our repository and consider citing our work using the following BibTex:
@INPROCEEDINGS{10445910,
author={Sarma, Sandipan and Kalkar, Pradnesh and Sur, Arijit},
booktitle={ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
title={Boosting Zero-Shot Human-Object Interaction Detection with Vision-Language Transfer},
year={2024},
volume={},
number={},
pages={6355-6359},
keywords={Visualization;Semantics;Detectors;Transformers;Feature extraction;Task analysis;Speech processing;Human-object interaction;transformer;CLIP;zero-shot learning},
doi={10.1109/ICASSP48485.2024.10445910}}