This repo holds the codes of paper: "PointTAD: Multi-Label Temporal Action Detection with Learnable Query Points", which is accepted in NeurIPS 2022.
[Paper Link] [Zhihu]
[Jan. 10, 2023] Fixed some bugs and typos; updated best checkpoints for both multi-label benchmarks.
[Dec. 13, 2022] We release the codes and checkpoints on MultiTHUMOS and Charades.
This paper presents a query-based framework for multi-label temporal action detection, namely PointTAD, that leverages a set of learnable query points to handle both boundary frames and action semantic keyframes for finer action representation. Our model takes RGB input only and streamlines an end-to-end trainable framework for easy deployment. PointTAD surpasses previous multi-label TAD works by a large margin under detection-mAP and achieves comparable results under segmentation-mAP.
PyTorch 1.8.1 or higher, opencv-python, scipy, terminaltables, ruamel-yaml, ffmpeg
pip install -r requirements.txt
to install dependencies.
To prepare the RGB frames and corresponding annotations,
-
Clone the repository and
cd PointTAD; mkdir data
-
For MultiTHUMOS:
- Download the raw videos of THUMOS14 from here and put them into
/data/thumos14_videos
; - Extract the RGB frames from raw videos using
util/extract_frames.py
. The frames will be placed in/data/multithumos_frames
; - You also need to generate
multithumos_frames.json
for the extracted frames with/util/generate_frame_dict.py
and put the json file into/datasets
folder.
- Download the raw videos of THUMOS14 from here and put them into
-
For Charades:
- Download the RGB frames of Charades from here , and place the frames at
/data/charades_v1_rgb
.
- Download the RGB frames of Charades from here , and place the frames at
-
Replace the frame folder path or image tensor path in
/datasets/dataset_cfg.yml
.
The structure of data/
is displayed as follows:
|-- data
| |-- thumos14_videos
| | |-- training
| | |-- testing
| |-- multithumos_frames
| | |-- training
| | |-- testing
| |-- charades_v1_rgb
[Optional] Once you had the raw frames, you can convert them into tensors with /util/frames2tensor.py
to speed up IO. By enabling --img_tensor
in train.sh
and test.sh
, the model takes in image tensors instead of frames.
The best checkpoint is provided in the link below. We provide an error bar for each benchmark in the supplementary material of our paper.
Dataset | mAP@0.2 | mAP@0.5 | mAP@0.7 | Avg-mAP | Checkpoint |
---|---|---|---|---|---|
MultiTHUMOS | 39.70% | 24.90% | 12.04% | 23.46% | Link |
Charades | 17.45% | 13.46% | 9.14% | 12.13% | Link |
Use test.sh
to evaluate,
- MultiTHUMOS:
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m torch.distributed.launch --nproc_per_node=8 --master_port=11302 --use_env main.py --dataset multithumos --eval --load multithumos_best.pth
- Charades:
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m torch.distributed.launch --nproc_per_node=8 --master_port=11302 --use_env main.py --dataset charades --eval --load charades_best.pth
Use train.sh
to train PointTAD,
- MultiTHUMOS:
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m torch.distributed.launch --nproc_per_node=8 --master_port=11302 --use_env main.py --dataset multithumos
- Charades:
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m torch.distributed.launch --nproc_per_node=8 --master_port=11302 --use_env main.py --dataset charades
The codebase is built on top of RTD-Net, DETR, Sparse R-CNN, AFSD and E2ETAD, we thank them for providing useful codes.
If you think our work is useful, please feel free to cite our paper:
@inproceedings{
tan2022pointtad,
title={Point{TAD}: Multi-Label Temporal Action Detection with Learnable Query Points},
author={Jing Tan and Xiaotong Zhao and Xintian Shi and Bin Kang and Limin Wang},
booktitle={Advances in Neural Information Processing Systems},
editor={Alice H. Oh and Alekh Agarwal and Danielle Belgrave and Kyunghyun Cho},
year={2022},
url={https://openreview.net/forum?id=_r8pCrHwq39}
}
Jing Tan: jtan@smail.nju.edu.cn