Official implementation of MSR-GCN: Multi-Scale Residual Graph Convolution Networks for Human Motion Prediction (ICCV 2021 paper)
[Paper] [Supp] [Poster] [Slides] [Video]
Notes:
- 2022.09.06: For stochastic long-term human motion prediction which aims to producing future sequences with high plausibility and diversity, our new work Diverse Human Motion Prediction via Gumbel-Softmax Sampling from an Auxiliary Space (ACMMM 2022) is available.
- Lingwei Dang, School of Computer Science and Engineering, South China University of Technology, China, levondang@163.com
- Yongwei Nie, School of Computer Science and Engineering, South China University of Technology, China, nieyongwei@scut.edu.cn
- Chengjiang Long, JD Finance America Corporation, USA, cjfykx@gmail.com
- Qing Zhang, School of Computer Science and Engineering, Sun Yat-sen University, China, zhangqing.whu.cs@gmail.com
- Guiqing Li, School of Computer Science and Engineering, South China University of Technology, China, ligq@scut.edu.cn
Human motion prediction is a challenging task due to the stochasticity and aperiodicity of future poses. Recently, graph convolutional network (GCN) has been proven to be very effective to learn dynamic relations among pose joints, which is helpful for pose prediction. On the other hand, one can abstract a human pose recursively to obtain a set of poses at multiple scales. With the increase of the abstraction level, the motion of the pose becomes more stable, which benefits pose prediction too. In this paper, we propose a novel multi-scale residual Graph Convolution Network (MSR-GCN) for human pose prediction task in the manner of end-to-end. The GCNs are used to extract features from fine to coarse scale and then from coarse to fine scale. The extracted features at each scale are then combined and decoded to obtain the residuals between the input and target poses. Intermediate supervisions are imposed on all the predicted poses, which enforces the network to learn more representative features. Our proposed approach is evaluated on two standard benchmark datasets, i.e., the Human3.6M dataset and the CMU Mocap dataset. Experimental results demonstrate that our method outperforms the state-of-the-art approaches.
- Pytorch 1.7.0+cu110
- Python 3.8.5
- Nvidia RTX 3090
Human3.6m in exponential map can be downloaded from here.
CMU mocap was obtained from the repo of ConvSeq2Seq paper.
Human3.6M
- A pose in h3.6m has 32 joints, from which we choose 22, and build the multi-scale by 22 -> 12 -> 7 -> 4 dividing manner.
- We use S5 / S11 as test / valid dataset, and the rest as train dataset, testing is done on the 15 actions separately, on each we use all data instead of the randomly selected 8 samples.
- Some joints of the origin 32 have the same position
- The input / output length is 10 / 25
CMU Mocap dataset
- A pose in cmu has 38 joints, from which we choose 25, and build the multi-scale by 25 -> 12 -> 7 -> 4 dividing manner.
- CMU does not have valid dataset, testing is done on the 8 actions separately, on each we use all data instead of the random selected 8 samples.
- Some joints of the origin 38 have the same position
- The input / output length is 10 / 25
-
train on Human3.6M:
python main.py --exp_name=h36m --is_train=1 --output_n=25 --dct_n=35 --test_manner=all
-
train on CMU Mocap:
python main.py --exp_name=cmu --is_train=1 --output_n=25 --dct_n=35 --test_manner=all
-
evaluate on Human3.6M:
python main.py --exp_name=h36m --is_load=1 --model_path=ckpt/pretrained/h36m_in10out25dctn35_best_err57.9256.pth --output_n=25 --dct_n=35 --test_manner=all
-
evaluate on CMU Mocap:
python main.py --exp_name=cmu --is_load=1 --model_path=ckpt/pretrained/cmu_in10out25dctn35_best_err37.2310.pth --output_n=25 --dct_n=35 --test_manner=all
H3.6M-10/25/35-all | 80 | 160 | 320 | 400 | 560 | 1000 | - |
---|---|---|---|---|---|---|---|
walking | 12.16 | 22.65 | 38.65 | 45.24 | 52.72 | 63.05 | - |
eating | 8.39 | 17.05 | 33.03 | 40.44 | 52.54 | 77.11 | - |
smoking | 8.02 | 16.27 | 31.32 | 38.15 | 49.45 | 71.64 | - |
discussion | 11.98 | 26.76 | 57.08 | 69.74 | 88.59 | 117.59 | - |
directions | 8.61 | 19.65 | 43.28 | 53.82 | 71.18 | 100.59 | - |
greeting | 16.48 | 36.95 | 77.32 | 93.38 | 116.24 | 147.23 | - |
phoning | 10.10 | 20.74 | 41.51 | 51.26 | 68.28 | 104.36 | - |
posing | 12.79 | 29.38 | 66.95 | 85.01 | 116.26 | 174.33 | - |
purchases | 14.75 | 32.39 | 66.13 | 79.63 | 101.63 | 139.15 | - |
sitting | 10.53 | 21.99 | 46.26 | 57.80 | 78.19 | 120.02 | - |
sittingdown | 16.10 | 31.63 | 62.45 | 76.84 | 102.83 | 155.45 | - |
takingphoto | 9.89 | 21.01 | 44.56 | 56.30 | 77.94 | 121.87 | - |
waiting | 10.68 | 23.06 | 48.25 | 59.23 | 76.33 | 106.25 | - |
walkingdog | 20.65 | 42.88 | 80.35 | 93.31 | 111.87 | 148.21 | - |
walkingtogether | 10.56 | 20.92 | 37.40 | 43.85 | 52.93 | 65.91 | - |
Average | 12.11 | 25.56 | 51.64 | 62.93 | 81.13 | 114.18 | 57.93 |
Results use the metric like MotionMixer, IJCAI22
H3.6M-10/25/35-256 | <=80 | <=160 | <=320 | <=400 | <=560 | <=1000 |
---|---|---|---|---|---|---|
walking | 9.54 | 15.36 | 24.89 | 28.89 | 35.24 | 44.99 |
eating | 5.88 | 9.94 | 17.76 | 21.48 | 28.58 | 44.71 |
smoking | 6.39 | 10.66 | 18.78 | 22.58 | 29.43 | 44.23 |
discussion | 8.81 | 15.55 | 29.81 | 36.66 | 49.06 | 74.06 |
directions | 6.68 | 12.2 | 24.78 | 31.05 | 42.2 | 65.19 |
greeting | 11.35 | 19.83 | 37.69 | 46.1 | 60.98 | 89.2 |
phoning | 7.56 | 12.69 | 22.91 | 27.92 | 37.57 | 60.16 |
posing | 8.77 | 16.11 | 32.94 | 41.69 | 58.66 | 99.05 |
purchases | 10.96 | 19.39 | 36.22 | 43.9 | 57.6 | 85.08 |
sitting | 7.96 | 13.47 | 25.34 | 31.2 | 42.38 | 67.88 |
sittingdown | 13.2 | 21.52 | 37.02 | 44.3 | 58.25 | 89.99 |
takingphoto | 7.18 | 12.45 | 23.81 | 29.5 | 40.95 | 68.61 |
waiting | 7.63 | 13.14 | 25.19 | 31.07 | 41.76 | 64.19 |
walkingdog | 14.97 | 25.66 | 44.8 | 52.61 | 66.25 | 93.61 |
walkingtogether | 8.04 | 13.5 | 23.17 | 27.39 | 34.66 | 47.19 |
average | 8.99 | 15.43 | 28.34 | 34.42 | 45.57 | 69.21 |
CMU-10/25/35-all | 80 | 160 | 320 | 400 | 560 | 1000 | - |
---|---|---|---|---|---|---|---|
basketball | 10.24 | 18.64 | 36.94 | 45.96 | 61.12 | 86.24 | - |
basketball_signal | 3.04 | 5.62 | 12.49 | 16.60 | 25.43 | 49.99 | - |
directing_traffic | 6.13 | 12.60 | 29.37 | 39.22 | 60.46 | 114.56 | - |
jumping | 15.19 | 28.85 | 55.97 | 69.11 | 92.38 | 126.16 | - |
running | 13.17 | 20.91 | 29.88 | 33.37 | 38.26 | 43.62 | - |
soccer | 10.92 | 19.40 | 37.41 | 47.00 | 65.25 | 101.85 | - |
walking | 6.38 | 10.25 | 16.88 | 20.05 | 25.48 | 36.78 | - |
washwindow | 5.41 | 10.93 | 24.51 | 31.79 | 45.13 | 70.16 | - |
Average | 8.81 | 15.90 | 30.43 | 37.89 | 51.69 | 78.67 | 37.23 |
If you use our code, please cite our work
@InProceedings{Dang_2021_ICCV,
author = {Dang, Lingwei and Nie, Yongwei and Long, Chengjiang and Zhang, Qing and Li, Guiqing},
title = {MSR-GCN: Multi-Scale Residual Graph Convolution Networks for Human Motion Prediction},
booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
month = {October},
year = {2021},
pages = {11467-11476}
}
Some of our evaluation code and data process code was adapted/ported from LearnTrajDep by Wei Mao.
MIT