-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
27 changed files
with
2,486 additions
and
4 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,6 +1,115 @@ | ||
# ETAD | ||
This repo holds the official pytorch implementation of paper: ["ETAD: A Unified Framework for Efficient Temporal Action Detection"](https://arxiv.org/abs/2205.07134) | ||
- Author: Shuming Liu, Mengmeng Xu, Chen Zhao, Xu Zhao, and Bernard Ghanem | ||
# ETAD: A Unified Framework for Efficient Temporal Action Detection | ||
This repo holds the official implementation of paper: | ||
["ETAD: A Unified Framework for Efficient Temporal Action Detection"](https://openaccess.thecvf.com/content/CVPR2023W/ECV/papers/Liu_ETAD_Training_Action_Detection_End_to_End_on_a_Laptop_CVPRW_2023_paper.pdf), which is accepted in CVPR workshop 2023. | ||
|
||
> Temporal action detection (TAD) with end-to-end training often suffers from the pain of huge demand for computing resources due to long video duration. In this work, we propose an efficient temporal action detector (ETAD) that can train directly from video frames with extremely low GPU memory consumption. Our main idea is to minimize and balance the heavy computation among features and gradients in each training iteration. We propose to sequentially forward the snippet frame through the video encoder, and backward only a small necessary portion of gradients to update the encoder. To further alleviate the computational redundancy in training, we propose to dynamically sample only a small subset of proposals during training. Moreover, various sampling strategies and ratios are studied for both the encoder and detector. ETAD achieves state-of-the-art performance on TAD benchmarks with remarkable efficiency. On ActivityNet-1.3, training ETAD in 18 hours can reach 38.25% average mAP with only 1.3 GB memory consumption per video under end-to-end training. | ||
|
||
## Updates | ||
Code will be released soon! | ||
- 12/03/2023: We have released our code and pretrained models for the ActivityNet experiments. | ||
|
||
## Installation | ||
|
||
**Step 1.** Clone the repository | ||
``` | ||
git clone git@github.com:sming256/ETAD.git | ||
cd ETAD | ||
``` | ||
|
||
**Step 2.** Install PyTorch=2.0.1, Python=3.10.12, CUDA=11.8 | ||
|
||
``` | ||
conda create -n etad python=3.10.12 | ||
source activate etad | ||
conda install pytorch=2.0.1 torchvision=0.15.2 pytorch-cuda=11.8 -c pytorch -c nvidia | ||
``` | ||
|
||
**Step 3.** Install mmaction2 for end-to-end training | ||
``` | ||
pip install openmim | ||
mim install mmcv==2.0.1 | ||
mim install mmaction2==1.1.0 | ||
pip install numpy==1.23.5 | ||
``` | ||
|
||
## To Reproduce Our Results on ActivityNet 1.3 | ||
|
||
### End-to-End Experiment | ||
|
||
**Download the ActivityNet videos** | ||
- Note that we are not allowed to redistribute the videos without license agreement. You can download the activitynet raw videos from [official website](https://docs.google.com/forms/d/e/1FAIpQLSeKaFq9ZfcmZ7W0B0PbEhfbTHY41GeEgwsa7WobJgGUhn4DTQ/viewform). | ||
- We downsample the videos to 15 fps and resize the shorter side to 256. If you find it's hard to prepare the videos, you can send an email to shuming.liu@kaust.edu.sa to get the videos under license agreements. | ||
- Change the [VIDEO_PATH](configs/anet/e2e_anet_tsp_snippet0.3.py#L26) to the path of your videos. | ||
|
||
**Download the backbone weights** | ||
- Download the pretrained [weights](https://github.com/HumamAlwassel/TSP/releases/download/model_weights/r2plus1d_34-tsp_on_activitynet-max_gvf-backbone_lr_0.0001-fc_lr_0.002-epoch_5-0d2cf854.pth) for R(2+1)D backbone and move it to `pretrained/r2plus1d_34-tsp_on_activitynet-max_gvf-backbone_lr_0.0001-fc_lr_0.002-epoch_5-0d2cf854.pth`. | ||
|
||
**Training** | ||
- `python tools/train.py configs/anet/e2e_anet_tsp_snippet0.3.py 1` | ||
- 1 means using 1 gpu to train. | ||
- The end-to-end experiment takes 18 hours and no more than 10 GB memory for training. | ||
|
||
**Inference** | ||
- `python tools/test.py configs/anet/e2e_anet_tsp_snippet0.3.py 1` | ||
- The testing takes around 45 mins. | ||
|
||
**Evaluation** | ||
- `python tools/post.py configs/anet/e2e_anet_tsp_snippet0.3.py` | ||
|
||
### Feature-based Experiment | ||
|
||
**Download the TSP features** | ||
- You can download TSP feature from [ActionFormer](https://github.com/happyharrycn/actionformer_release#to-reproduce-our-results-on-activitynet-13), or directly from this [Google drive](https://drive.google.com/file/d/1VW8px1Nz9A17i0wMVUfxh6YsPCLVqL-S/view?usp=sharing). | ||
- Change the [FEATURE_PATH]([configs/anet/feature_anet_tsp.py#L7) to the path of your features. | ||
|
||
**Training** | ||
- `python tools/train.py configs/anet/feature_anet_tsp.py 1` | ||
- The feature-based experiment is fast (6 mins in my workstation). | ||
|
||
**Testing and Evaluation** | ||
- `python tools/test.py configs/anet/feature_anet_tsp.py 1 && python tools/post.py configs/anet/feature_anet_tsp.py` | ||
|
||
|
||
### Pretrained Models | ||
You can download the pretrained models in this [link](https://github.com/sming256/ETAD/releases/). | ||
If you want to do inference with our checkpoint, you can simply run | ||
|
||
``` | ||
python tools/test.py configs/anet/e2e_anet_tsp_snippet0.3.py 1 --checkpoint e2e_anet_snippet0.3_bs4_92e98.pth.pth | ||
python tools/post.py configs/anet/e2e_anet_tsp_snippet0.3.py | ||
``` | ||
|
||
The results on ActivityNet (with CUHK classifier) should be | ||
|
||
| mAP at tIoUs | 0.5 | 0.75 | 0.95 | Avg | | ||
| -------------------- | ----- | ----- | ----- | ----- | | ||
| ETAD - TSP - Feature | 54.96 | 39.06 | 9.21 | 37.80 | | ||
| ETAD - TSP - E2E | 56.22 | 39.93 | 10.23 | 38.73 | | ||
|
||
You can also download our **logs, and results** from [Google Drive](https://drive.google.com/drive/folders/1prknt8Ujsf_Wcpo6Z0ZU1NdXuEkK4d5j?usp=sharing). | ||
|
||
|
||
## Contact | ||
If you have any questions about our work, please contact Shuming Liu (shuming.liu@kaust.edu.sa). | ||
|
||
## References | ||
If you are using our code, please consider citing our paper. | ||
``` | ||
@inproceedings{liu2023etad, | ||
title={ETAD: Training Action Detection End to End on a Laptop}, | ||
author={Liu, Shuming and Xu, Mengmeng and Zhao, Chen and Zhao, Xu and Ghanem, Bernard}, | ||
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition}, | ||
pages={4524--4533}, | ||
year={2023} | ||
} | ||
``` | ||
|
||
If you are using TSP features, please cite | ||
``` | ||
@inproceedings{alwassel2021tsp, | ||
title={{TSP}: Temporally-sensitive pretraining of video encoders for localization tasks}, | ||
author={Alwassel, Humam and Giancola, Silvio and Ghanem, Bernard}, | ||
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops}, | ||
pages={3173--3183}, | ||
year={2021} | ||
} | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,76 @@ | ||
EXP_NAME = "e2e_anet_tsp_snippet0.3_bs4_lr5e-7" | ||
|
||
E2E_SETTING = dict( | ||
mode=True, | ||
chunk_size=4, # snippet number of each chunk | ||
model=dict( | ||
type="Recognizer3D", | ||
backbone=dict( | ||
type="ResNet2Plus1d_TSP", | ||
layers=[3, 4, 6, 3], | ||
pretrained="pretrained/r2plus1d_34-tsp_on_activitynet-max_gvf-backbone_lr_0.0001-fc_lr_0.002-epoch_5-0d2cf854.pth", | ||
frozen_stages=2, | ||
norm_eval=True, | ||
), | ||
data_preprocessor=dict( | ||
type="ActionDataPreprocessor", | ||
mean=[110.2008, 100.63983, 95.99475], | ||
std=[58.14765, 56.46975, 55.332195], | ||
format_shape="NCTHW", | ||
), | ||
), | ||
) | ||
|
||
# DATASET SETTING | ||
DATASET = dict(name="anet_1_3", tscale=128, dscale=128) | ||
VIDEO_PATH = "data/anet/raw_data/Anet_videos_15fps_short256" | ||
TRAIN_PIPELINE = [ | ||
dict(type="DecordInit", num_threads=4), | ||
dict(type="SampleFrames", clip_len=16, num_clips=128, test_mode=True), | ||
dict(type="DecordDecode"), | ||
dict(type="Resize", scale=(171, 128), keep_ratio=False), | ||
dict(type="RandomCrop", size=112), | ||
dict(type="Flip", flip_ratio=0.5), | ||
dict(type="ImgAug", transforms="default"), | ||
dict(type="ColorJitter"), | ||
dict(type="FormatShape", input_format="NCTHW"), | ||
] | ||
TEST_PIPELINE = [ | ||
dict(type="DecordInit", num_threads=4), | ||
dict(type="SampleFrames", clip_len=16, num_clips=128, test_mode=True), | ||
dict(type="DecordDecode"), | ||
dict(type="Resize", scale=(171, 128), keep_ratio=False), | ||
dict(type="CenterCrop", crop_size=112), | ||
dict(type="FormatShape", input_format="NCTHW"), | ||
] | ||
|
||
# MODEL SETTINGS | ||
MODEL = dict(in_channels=512, roi_size=24, stage=[0.7, 0.8, 0.9], extend_ratio=0.5) | ||
|
||
# SAMPLING SETTINGS | ||
SAMPLING_RATIO = dict(snippet=0.3, proposal=0.06) | ||
SAMPLING_STRATEGY = dict(proposal="random", snippet="random") | ||
|
||
# SOLVER SETTING | ||
SOLVER = dict( | ||
tal_lr=5.0e-4, | ||
backbone_lr=5.0e-7, | ||
step_size=5, | ||
gamma=0.1, | ||
batch_size=4, | ||
workers=4, | ||
epoch=6, # total epoch | ||
infer=5, # infer epoch: 5 is the last epoch | ||
) | ||
|
||
# LOSS SETTING | ||
LOSS = dict( | ||
log_interval=200, | ||
pos_thresh=0.9, | ||
coef_pem_cls=1, | ||
coef_pem_reg=5, | ||
coef_pem_bnd=10, | ||
) | ||
|
||
# POST PROCESS SETTING | ||
DETECTION_POST = dict(iou_threshold=0, sigma=0.35) # soft nms |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,38 @@ | ||
EXP_NAME = "feature_anet_tsp" | ||
|
||
E2E_SETTING = dict(mode=False) | ||
|
||
# DATASET SETTING | ||
DATASET = dict(name="anet_1_3", tscale=128, dscale=128) | ||
FEATURE = dict(path="data/anet/features/tsp_features", online_resize=True) | ||
|
||
# MODEL SETTINGS | ||
MODEL = dict(in_channels=512, roi_size=24, stage=[0.7, 0.8, 0.9], extend_ratio=0.5) | ||
|
||
# SAMPLING SETTINGS | ||
SAMPLING_RATIO = dict(snippet=0, proposal=0.06) # set snippet=0 for all feature based experiment | ||
SAMPLING_STRATEGY = dict(proposal="random") | ||
|
||
# SOLVER SETTING | ||
SOLVER = dict( | ||
tal_lr=1.0e-3, | ||
backbone_lr=0, | ||
step_size=5, | ||
gamma=0.1, | ||
batch_size=16, | ||
workers=8, | ||
epoch=6, # total epoch | ||
infer=5, # infer epoch: 5 is the last epoch | ||
) | ||
|
||
# LOSS SETTING | ||
LOSS = dict( | ||
log_interval=200, | ||
pos_thresh=0.9, | ||
coef_pem_cls=1, | ||
coef_pem_reg=5, | ||
coef_pem_bnd=10, | ||
) | ||
|
||
# POST PROCESS SETTING | ||
DETECTION_POST = dict(iou_threshold=0, sigma=0.35) # soft nms |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,49 @@ | ||
import torch | ||
import os | ||
import tqdm | ||
import pickle | ||
from lib.utils.misc import reg_to_anchors | ||
|
||
|
||
def inference(model, data_loader, logger, cfg): | ||
output_path = "./exps/{}/output/".format(cfg.EXP_NAME) | ||
|
||
for video_info, video_data, anchors_init in tqdm.tqdm(data_loader): | ||
batch_size = video_data.shape[0] | ||
video_data = video_data.cuda() | ||
anchors_init = anchors_init.cuda() | ||
|
||
with torch.no_grad(): | ||
(tem_out, stage_out) = model(video_data, anchors_init=anchors_init) | ||
|
||
# get anchors and ious | ||
anchors = torch.stack([reg_to_anchors(out[0], out[2]) for out in stage_out], dim=0).mean(dim=0) | ||
ious = torch.stack([out[1] for out in stage_out], dim=0).mean(dim=0) | ||
ious = ious.view(batch_size, -1, ious.shape[1]) | ||
|
||
for jdx in range(batch_size): | ||
# get snippet info | ||
video_name = video_info["video_name"][jdx] | ||
video_snippets = video_info["indices"][jdx].numpy() | ||
start = video_snippets[0] | ||
end = video_snippets[-1] | ||
|
||
# detach result | ||
pred_anchors = anchors[jdx].cpu().detach().numpy() | ||
pred_start = tem_out[jdx, 0, :].cpu().detach().numpy() | ||
pred_end = tem_out[jdx, 1, :].cpu().detach().numpy() | ||
pred_iou = ious[jdx].cpu().detach().numpy() | ||
|
||
result = [video_snippets, pred_anchors, pred_start, pred_end, pred_iou] | ||
|
||
# save result | ||
if cfg.DATASET.name in ["anet_1_3", "hacs"]: | ||
file_path = os.path.join(output_path, "{}.pkl".format(video_name)) | ||
elif cfg.DATASET.name == "thumos_14": | ||
output_folder = os.path.join(output_path, video_name) | ||
if not os.path.exists(output_folder): | ||
os.mkdir(output_folder) | ||
file_path = os.path.join(output_folder, "{}_{}.pkl".format(start, end)) | ||
|
||
with open(file_path, "wb") as outfile: | ||
pickle.dump(result, outfile, pickle.HIGHEST_PROTOCOL) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,107 @@ | ||
import time, copy | ||
import os | ||
import torch | ||
import datetime | ||
import pickle | ||
from ..utils.metric_logger import MetricLogger | ||
|
||
|
||
def train_one_epoch(model, criterion, data_loader, logger, cfg, optimizer=None, scheduler=None): | ||
model.train() | ||
|
||
meters = MetricLogger(delimiter=" ") | ||
end = time.time() | ||
|
||
max_iteration = len(data_loader) | ||
max_epoch = cfg.SOLVER.epoch | ||
last_epoch_iteration = (max_epoch - cfg.epoch - 1) * max_iteration | ||
|
||
for idx, (video_info, video_data, anchors_init, video_gt) in enumerate(data_loader): | ||
video_data = video_data.cuda() | ||
anchors_init = anchors_init.cuda() | ||
|
||
video_gt = [_gt.cuda() for _gt in video_gt] | ||
|
||
if not cfg.E2E_SETTING.mode: | ||
pred = model(video_data, anchors_init=anchors_init) | ||
cost, loss_dict = criterion(pred, video_gt) | ||
|
||
optimizer.zero_grad() | ||
cost.backward() | ||
optimizer.step() | ||
|
||
else: | ||
# stage 1: sequentially forward the backbone | ||
video_feat = model(video_data, stage=1) | ||
|
||
# stage 2: forward and backward the detector | ||
video_feat.requires_grad = True | ||
video_feat.retain_grad() | ||
det_pred = model(video_feat, anchors_init=anchors_init, stage=2) | ||
cost, loss_dict = criterion(det_pred, video_gt) | ||
|
||
# backward the detector | ||
optimizer.zero_grad() | ||
cost.backward() | ||
optimizer.step() | ||
|
||
# stage 3: sequentially backward the backbone with sampled data | ||
if cfg.SAMPLING_RATIO.snippet > 0: | ||
# copy the feature's gradient | ||
feat_grad = copy.deepcopy(video_feat.grad.detach()) # [B,C,T] | ||
|
||
# sample snippets and sequentially backward | ||
optimizer.zero_grad() | ||
model(video_data, feat_grad=feat_grad, stage=3) | ||
optimizer.step() | ||
|
||
batch_time = time.time() - end | ||
end = time.time() | ||
|
||
meters.update(time=batch_time) | ||
meters.update(**loss_dict) | ||
|
||
eta_seconds = meters.time.avg * (max_iteration - idx - 1 + last_epoch_iteration) | ||
eta_string = str(datetime.timedelta(seconds=int(eta_seconds))) | ||
|
||
if ((idx % cfg.LOSS.log_interval == 0) and idx != 0) or (idx == max_iteration - 1): | ||
logger.info( | ||
meters.delimiter.join( | ||
[ | ||
"{mode}: [E{epoch}/{max_epoch}]", | ||
"iter: {iteration}/{max_iteration}", | ||
"eta: {eta}", | ||
"{meters}", | ||
"max_mem: {memory:.2f}GB", | ||
] | ||
).format( | ||
mode="Train", | ||
eta=eta_string, | ||
epoch=cfg.epoch, | ||
max_epoch=max_epoch - 1, | ||
iteration=idx, | ||
max_iteration=max_iteration - 1, | ||
meters=str(meters), | ||
memory=torch.cuda.max_memory_allocated() / 1024.0 / 1024.0 / 1024.0, | ||
) | ||
) | ||
|
||
scheduler.step() | ||
save_checkpoint(model, cfg.epoch, cfg, scheduler, optimizer) | ||
|
||
|
||
def save_checkpoint(model, epoch, cfg, scheduler, optimizer): | ||
exp_name = cfg.EXP_NAME | ||
|
||
state = { | ||
"epoch": epoch, | ||
"state_dict": model.module.state_dict(), | ||
"scheduler": scheduler.state_dict(), | ||
"optimizer": optimizer.state_dict(), | ||
} | ||
checkpoint_dir = "./exps/%s/checkpoint/" % (exp_name) | ||
|
||
if not os.path.exists(checkpoint_dir): | ||
os.system("mkdir -p %s" % (checkpoint_dir)) | ||
checkpoint_path = os.path.join(checkpoint_dir, "epoch_%d.pth.tar" % epoch) | ||
torch.save(state, checkpoint_path) |
Oops, something went wrong.