Skip to content

Commit

Permalink
activitynet release
Browse files Browse the repository at this point in the history
  • Loading branch information
sming256 committed Dec 2, 2023
1 parent 8d9dd34 commit b973a5f
Show file tree
Hide file tree
Showing 27 changed files with 2,486 additions and 4 deletions.
117 changes: 113 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,115 @@
# ETAD
This repo holds the official pytorch implementation of paper: ["ETAD: A Unified Framework for Efficient Temporal Action Detection"](https://arxiv.org/abs/2205.07134)
- Author: Shuming Liu, Mengmeng Xu, Chen Zhao, Xu Zhao, and Bernard Ghanem
# ETAD: A Unified Framework for Efficient Temporal Action Detection
This repo holds the official implementation of paper:
["ETAD: A Unified Framework for Efficient Temporal Action Detection"](https://openaccess.thecvf.com/content/CVPR2023W/ECV/papers/Liu_ETAD_Training_Action_Detection_End_to_End_on_a_Laptop_CVPRW_2023_paper.pdf), which is accepted in CVPR workshop 2023.

> Temporal action detection (TAD) with end-to-end training often suffers from the pain of huge demand for computing resources due to long video duration. In this work, we propose an efficient temporal action detector (ETAD) that can train directly from video frames with extremely low GPU memory consumption. Our main idea is to minimize and balance the heavy computation among features and gradients in each training iteration. We propose to sequentially forward the snippet frame through the video encoder, and backward only a small necessary portion of gradients to update the encoder. To further alleviate the computational redundancy in training, we propose to dynamically sample only a small subset of proposals during training. Moreover, various sampling strategies and ratios are studied for both the encoder and detector. ETAD achieves state-of-the-art performance on TAD benchmarks with remarkable efficiency. On ActivityNet-1.3, training ETAD in 18 hours can reach 38.25% average mAP with only 1.3 GB memory consumption per video under end-to-end training.

## Updates
Code will be released soon!
- 12/03/2023: We have released our code and pretrained models for the ActivityNet experiments.

## Installation

**Step 1.** Clone the repository
```
git clone git@github.com:sming256/ETAD.git
cd ETAD
```

**Step 2.** Install PyTorch=2.0.1, Python=3.10.12, CUDA=11.8

```
conda create -n etad python=3.10.12
source activate etad
conda install pytorch=2.0.1 torchvision=0.15.2 pytorch-cuda=11.8 -c pytorch -c nvidia
```

**Step 3.** Install mmaction2 for end-to-end training
```
pip install openmim
mim install mmcv==2.0.1
mim install mmaction2==1.1.0
pip install numpy==1.23.5
```

## To Reproduce Our Results on ActivityNet 1.3

### End-to-End Experiment

**Download the ActivityNet videos**
- Note that we are not allowed to redistribute the videos without license agreement. You can download the activitynet raw videos from [official website](https://docs.google.com/forms/d/e/1FAIpQLSeKaFq9ZfcmZ7W0B0PbEhfbTHY41GeEgwsa7WobJgGUhn4DTQ/viewform).
- We downsample the videos to 15 fps and resize the shorter side to 256. If you find it's hard to prepare the videos, you can send an email to shuming.liu@kaust.edu.sa to get the videos under license agreements.
- Change the [VIDEO_PATH](configs/anet/e2e_anet_tsp_snippet0.3.py#L26) to the path of your videos.

**Download the backbone weights**
- Download the pretrained [weights](https://github.com/HumamAlwassel/TSP/releases/download/model_weights/r2plus1d_34-tsp_on_activitynet-max_gvf-backbone_lr_0.0001-fc_lr_0.002-epoch_5-0d2cf854.pth) for R(2+1)D backbone and move it to `pretrained/r2plus1d_34-tsp_on_activitynet-max_gvf-backbone_lr_0.0001-fc_lr_0.002-epoch_5-0d2cf854.pth`.

**Training**
- `python tools/train.py configs/anet/e2e_anet_tsp_snippet0.3.py 1`
- 1 means using 1 gpu to train.
- The end-to-end experiment takes 18 hours and no more than 10 GB memory for training.

**Inference**
- `python tools/test.py configs/anet/e2e_anet_tsp_snippet0.3.py 1`
- The testing takes around 45 mins.

**Evaluation**
- `python tools/post.py configs/anet/e2e_anet_tsp_snippet0.3.py`

### Feature-based Experiment

**Download the TSP features**
- You can download TSP feature from [ActionFormer](https://github.com/happyharrycn/actionformer_release#to-reproduce-our-results-on-activitynet-13), or directly from this [Google drive](https://drive.google.com/file/d/1VW8px1Nz9A17i0wMVUfxh6YsPCLVqL-S/view?usp=sharing).
- Change the [FEATURE_PATH]([configs/anet/feature_anet_tsp.py#L7) to the path of your features.

**Training**
- `python tools/train.py configs/anet/feature_anet_tsp.py 1`
- The feature-based experiment is fast (6 mins in my workstation).

**Testing and Evaluation**
- `python tools/test.py configs/anet/feature_anet_tsp.py 1 && python tools/post.py configs/anet/feature_anet_tsp.py`


### Pretrained Models
You can download the pretrained models in this [link](https://github.com/sming256/ETAD/releases/).
If you want to do inference with our checkpoint, you can simply run

```
python tools/test.py configs/anet/e2e_anet_tsp_snippet0.3.py 1 --checkpoint e2e_anet_snippet0.3_bs4_92e98.pth.pth
python tools/post.py configs/anet/e2e_anet_tsp_snippet0.3.py
```

The results on ActivityNet (with CUHK classifier) should be

| mAP at tIoUs | 0.5 | 0.75 | 0.95 | Avg |
| -------------------- | ----- | ----- | ----- | ----- |
| ETAD - TSP - Feature | 54.96 | 39.06 | 9.21 | 37.80 |
| ETAD - TSP - E2E | 56.22 | 39.93 | 10.23 | 38.73 |

You can also download our **logs, and results** from [Google Drive](https://drive.google.com/drive/folders/1prknt8Ujsf_Wcpo6Z0ZU1NdXuEkK4d5j?usp=sharing).


## Contact
If you have any questions about our work, please contact Shuming Liu (shuming.liu@kaust.edu.sa).

## References
If you are using our code, please consider citing our paper.
```
@inproceedings{liu2023etad,
title={ETAD: Training Action Detection End to End on a Laptop},
author={Liu, Shuming and Xu, Mengmeng and Zhao, Chen and Zhao, Xu and Ghanem, Bernard},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={4524--4533},
year={2023}
}
```

If you are using TSP features, please cite
```
@inproceedings{alwassel2021tsp,
title={{TSP}: Temporally-sensitive pretraining of video encoders for localization tasks},
author={Alwassel, Humam and Giancola, Silvio and Ghanem, Bernard},
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops},
pages={3173--3183},
year={2021}
}
```
76 changes: 76 additions & 0 deletions configs/anet/e2e_anet_tsp_snippet0.3.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
EXP_NAME = "e2e_anet_tsp_snippet0.3_bs4_lr5e-7"

E2E_SETTING = dict(
mode=True,
chunk_size=4, # snippet number of each chunk
model=dict(
type="Recognizer3D",
backbone=dict(
type="ResNet2Plus1d_TSP",
layers=[3, 4, 6, 3],
pretrained="pretrained/r2plus1d_34-tsp_on_activitynet-max_gvf-backbone_lr_0.0001-fc_lr_0.002-epoch_5-0d2cf854.pth",
frozen_stages=2,
norm_eval=True,
),
data_preprocessor=dict(
type="ActionDataPreprocessor",
mean=[110.2008, 100.63983, 95.99475],
std=[58.14765, 56.46975, 55.332195],
format_shape="NCTHW",
),
),
)

# DATASET SETTING
DATASET = dict(name="anet_1_3", tscale=128, dscale=128)
VIDEO_PATH = "data/anet/raw_data/Anet_videos_15fps_short256"
TRAIN_PIPELINE = [
dict(type="DecordInit", num_threads=4),
dict(type="SampleFrames", clip_len=16, num_clips=128, test_mode=True),
dict(type="DecordDecode"),
dict(type="Resize", scale=(171, 128), keep_ratio=False),
dict(type="RandomCrop", size=112),
dict(type="Flip", flip_ratio=0.5),
dict(type="ImgAug", transforms="default"),
dict(type="ColorJitter"),
dict(type="FormatShape", input_format="NCTHW"),
]
TEST_PIPELINE = [
dict(type="DecordInit", num_threads=4),
dict(type="SampleFrames", clip_len=16, num_clips=128, test_mode=True),
dict(type="DecordDecode"),
dict(type="Resize", scale=(171, 128), keep_ratio=False),
dict(type="CenterCrop", crop_size=112),
dict(type="FormatShape", input_format="NCTHW"),
]

# MODEL SETTINGS
MODEL = dict(in_channels=512, roi_size=24, stage=[0.7, 0.8, 0.9], extend_ratio=0.5)

# SAMPLING SETTINGS
SAMPLING_RATIO = dict(snippet=0.3, proposal=0.06)
SAMPLING_STRATEGY = dict(proposal="random", snippet="random")

# SOLVER SETTING
SOLVER = dict(
tal_lr=5.0e-4,
backbone_lr=5.0e-7,
step_size=5,
gamma=0.1,
batch_size=4,
workers=4,
epoch=6, # total epoch
infer=5, # infer epoch: 5 is the last epoch
)

# LOSS SETTING
LOSS = dict(
log_interval=200,
pos_thresh=0.9,
coef_pem_cls=1,
coef_pem_reg=5,
coef_pem_bnd=10,
)

# POST PROCESS SETTING
DETECTION_POST = dict(iou_threshold=0, sigma=0.35) # soft nms
38 changes: 38 additions & 0 deletions configs/anet/feature_anet_tsp.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
EXP_NAME = "feature_anet_tsp"

E2E_SETTING = dict(mode=False)

# DATASET SETTING
DATASET = dict(name="anet_1_3", tscale=128, dscale=128)
FEATURE = dict(path="data/anet/features/tsp_features", online_resize=True)

# MODEL SETTINGS
MODEL = dict(in_channels=512, roi_size=24, stage=[0.7, 0.8, 0.9], extend_ratio=0.5)

# SAMPLING SETTINGS
SAMPLING_RATIO = dict(snippet=0, proposal=0.06) # set snippet=0 for all feature based experiment
SAMPLING_STRATEGY = dict(proposal="random")

# SOLVER SETTING
SOLVER = dict(
tal_lr=1.0e-3,
backbone_lr=0,
step_size=5,
gamma=0.1,
batch_size=16,
workers=8,
epoch=6, # total epoch
infer=5, # infer epoch: 5 is the last epoch
)

# LOSS SETTING
LOSS = dict(
log_interval=200,
pos_thresh=0.9,
coef_pem_cls=1,
coef_pem_reg=5,
coef_pem_bnd=10,
)

# POST PROCESS SETTING
DETECTION_POST = dict(iou_threshold=0, sigma=0.35) # soft nms
49 changes: 49 additions & 0 deletions lib/core/inferer.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
import torch
import os
import tqdm
import pickle
from lib.utils.misc import reg_to_anchors


def inference(model, data_loader, logger, cfg):
output_path = "./exps/{}/output/".format(cfg.EXP_NAME)

for video_info, video_data, anchors_init in tqdm.tqdm(data_loader):
batch_size = video_data.shape[0]
video_data = video_data.cuda()
anchors_init = anchors_init.cuda()

with torch.no_grad():
(tem_out, stage_out) = model(video_data, anchors_init=anchors_init)

# get anchors and ious
anchors = torch.stack([reg_to_anchors(out[0], out[2]) for out in stage_out], dim=0).mean(dim=0)
ious = torch.stack([out[1] for out in stage_out], dim=0).mean(dim=0)
ious = ious.view(batch_size, -1, ious.shape[1])

for jdx in range(batch_size):
# get snippet info
video_name = video_info["video_name"][jdx]
video_snippets = video_info["indices"][jdx].numpy()
start = video_snippets[0]
end = video_snippets[-1]

# detach result
pred_anchors = anchors[jdx].cpu().detach().numpy()
pred_start = tem_out[jdx, 0, :].cpu().detach().numpy()
pred_end = tem_out[jdx, 1, :].cpu().detach().numpy()
pred_iou = ious[jdx].cpu().detach().numpy()

result = [video_snippets, pred_anchors, pred_start, pred_end, pred_iou]

# save result
if cfg.DATASET.name in ["anet_1_3", "hacs"]:
file_path = os.path.join(output_path, "{}.pkl".format(video_name))
elif cfg.DATASET.name == "thumos_14":
output_folder = os.path.join(output_path, video_name)
if not os.path.exists(output_folder):
os.mkdir(output_folder)
file_path = os.path.join(output_folder, "{}_{}.pkl".format(start, end))

with open(file_path, "wb") as outfile:
pickle.dump(result, outfile, pickle.HIGHEST_PROTOCOL)
107 changes: 107 additions & 0 deletions lib/core/trainer.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
import time, copy
import os
import torch
import datetime
import pickle
from ..utils.metric_logger import MetricLogger


def train_one_epoch(model, criterion, data_loader, logger, cfg, optimizer=None, scheduler=None):
model.train()

meters = MetricLogger(delimiter=" ")
end = time.time()

max_iteration = len(data_loader)
max_epoch = cfg.SOLVER.epoch
last_epoch_iteration = (max_epoch - cfg.epoch - 1) * max_iteration

for idx, (video_info, video_data, anchors_init, video_gt) in enumerate(data_loader):
video_data = video_data.cuda()
anchors_init = anchors_init.cuda()

video_gt = [_gt.cuda() for _gt in video_gt]

if not cfg.E2E_SETTING.mode:
pred = model(video_data, anchors_init=anchors_init)
cost, loss_dict = criterion(pred, video_gt)

optimizer.zero_grad()
cost.backward()
optimizer.step()

else:
# stage 1: sequentially forward the backbone
video_feat = model(video_data, stage=1)

# stage 2: forward and backward the detector
video_feat.requires_grad = True
video_feat.retain_grad()
det_pred = model(video_feat, anchors_init=anchors_init, stage=2)
cost, loss_dict = criterion(det_pred, video_gt)

# backward the detector
optimizer.zero_grad()
cost.backward()
optimizer.step()

# stage 3: sequentially backward the backbone with sampled data
if cfg.SAMPLING_RATIO.snippet > 0:
# copy the feature's gradient
feat_grad = copy.deepcopy(video_feat.grad.detach()) # [B,C,T]

# sample snippets and sequentially backward
optimizer.zero_grad()
model(video_data, feat_grad=feat_grad, stage=3)
optimizer.step()

batch_time = time.time() - end
end = time.time()

meters.update(time=batch_time)
meters.update(**loss_dict)

eta_seconds = meters.time.avg * (max_iteration - idx - 1 + last_epoch_iteration)
eta_string = str(datetime.timedelta(seconds=int(eta_seconds)))

if ((idx % cfg.LOSS.log_interval == 0) and idx != 0) or (idx == max_iteration - 1):
logger.info(
meters.delimiter.join(
[
"{mode}: [E{epoch}/{max_epoch}]",
"iter: {iteration}/{max_iteration}",
"eta: {eta}",
"{meters}",
"max_mem: {memory:.2f}GB",
]
).format(
mode="Train",
eta=eta_string,
epoch=cfg.epoch,
max_epoch=max_epoch - 1,
iteration=idx,
max_iteration=max_iteration - 1,
meters=str(meters),
memory=torch.cuda.max_memory_allocated() / 1024.0 / 1024.0 / 1024.0,
)
)

scheduler.step()
save_checkpoint(model, cfg.epoch, cfg, scheduler, optimizer)


def save_checkpoint(model, epoch, cfg, scheduler, optimizer):
exp_name = cfg.EXP_NAME

state = {
"epoch": epoch,
"state_dict": model.module.state_dict(),
"scheduler": scheduler.state_dict(),
"optimizer": optimizer.state_dict(),
}
checkpoint_dir = "./exps/%s/checkpoint/" % (exp_name)

if not os.path.exists(checkpoint_dir):
os.system("mkdir -p %s" % (checkpoint_dir))
checkpoint_path = os.path.join(checkpoint_dir, "epoch_%d.pth.tar" % epoch)
torch.save(state, checkpoint_path)
Loading

0 comments on commit b973a5f

Please sign in to comment.