Human-Centric Multimodal Fusion Network for Robust Action Recognition
In this work, we propose an innovative skeleton-guided multimodal data fusion methodology that transforms depth, RGB, and optical flow modalities into human-centric images (HCI) based on keypoint sequences. Building upon this foundation, we introduce a human-centric multimodal fusion network (HCMFN), which can comprehensively extract the action patterns of different modalities. Our model can enhance performance in combination with skeleton-based techniques, delivering significant improvements with rapid inference speed.
Extensive experiments on two large-scale multimodal datasets, namely NTU RGB+D and NTU RGB+D 120, validate the capacity of HCMFN to bolster the robustness of skeleton-based methods in two challenging HAR tasks:
(1) discriminating between actions with subtle inter-class differences, and
(2) recognizing actions from varying viewpoints.
This work has been accepted by ESWA. Please enlighten me with your instructions.
Date | Status |
---|---|
May 31, 2023 | Manuscript submitted to journal |
Aug 04, 2023 | Revised |
Aug 24, 2023 | Revision submitted to journal |
Oct 04, 2023 | Revised |
Oct 11, 2023 | Revision submitted to journal |
Oct 21, 2023 | Accepted |
Oct 31, 2023 | Article available online Link |
This work is based on the following three works:
MMNet, TPAMI 2022, Original code
PoseC3D, CVPR 2022, Original code
MS-G3D, CVPR 2020, Original code
Thanks to the original authors for their work! Although our work represents only a modest improvement upon existing studies, we remain optimistic that it can provide valuable enlightenment to someone.
Meanwhile, we are very grateful to the creators of these two datasets, i.e., NTU RGB+D and NTU RGB+D 120. Your selfless work has made a great contribution to the computer vision community!
Last but not least, the authors will be very grateful for the selfless and constructive suggestions of the reviewers.
If you need to download on Google Drive, please contact me. Now you can also download our preprocessed features of NTU RGB+D 60 on Baidu Cloud.
The detailed ablation studies on NTU RGB+D (CS/CV) and NTU RGB+D 120 (XSub/XSet) are as follows.
Number | Input | Backbone | CS(%) | CV(%) | XSub(%) | XSet(%) |
---|---|---|---|---|---|---|
#1 | PoseC3D | 93.7 | 96.5 | 85.9 | 89.7 | |
#2 | PoseC3D | 93.4 | 96.0 | 85.9 | 89.7 | |
#3 | PoseC3D | 94.1 | 96.8 | 86.6 | 90.2 | |
#4 | MS-G3D | 88.4 | 94.2 | 81.5 | 82.3 | |
#5 | MS-G3D | 89.5 | 94.0 | 85.0 | 86.4 | |
#6 | #4+#5 | 90.9 | 95.2 | 86.4 | 87.6 | |
#7 | ResNet18 | 77.8 | 84.3 | 70.2 | 70.3 | |
#8 | ResNet18 | 63.7 | 69.2 | 54.5 | 55.6 | |
#9 | ResNet18 | 78.4 | 76.6 | 72.6 | 71.2 | |
#10 | #3+#7 | 94.8 | 97.7 | 88.4 | 91.8 | |
#11 | #3+#9 | 94.5 | 97.2 | 88.2 | 91.5 | |
#12 | #6+#7 | 92.8 | 96.7 | 89.5 | 90.5 | |
#13 | #6+#9 | 92.2 | 96.0 | 89.1 | 90.0 | |
#14 | #7+#8 | 83.4 | 88.3 | 77.0 | 81.6 | |
#15 | #7+#9 | 85.5 | 88.3 | 81.2 | 83.8 | |
#16 | #3+#7+#8 | 94.9 | 97.9 | 88.9 | 92.0 | |
#17 | #3+#7+#9 | 95.0 | 97.9 | 89.7 | 92.5 | |
#18 | #6+#7+#8 | 93.0 | 96.9 | 89.8 | 90.9 | |
#19 | #6+#7+#9 | 93.3 | 96.9 | 90.5 | 91.5 | |
#20 | #7+#8+#9 (HCMFN) | 87.7 | 90.5 | 83.4 | 83.8 | |
#21 | #3+#7+#8+#9 (HCMFN) | 95.2 | 98.0 | 89.9 | 92.7 | |
#22 | #6+#7+#8+#9 (HCMFN) | 93.5 | 97.1 | 90.7 | 91.7 |
To reproduce this work, you must complete several stages, including
Step 1, Build environment
Step 2, Download dataset and preprocess
Step 3, Train and test model
Step 4, Ensemble results
If you only need to quickly reproduce the experimental results in the article, please follow Step 4.
HCMFN has been enhanced by incorporating code from PoseC3D, MS-G3D and MMNet. To evaluate its effectiveness in different contexts, we conduct experiments in various environments. Specifically, we utilize the environment of MMLab to process the input of 3D heatmap volumes, while the environment of MMNet is employed for handling the input of HCI and spatiotemporal skeleton graph (for MS-G3D).
We conduct experiments on two large multimodal action datasets, namely NTU RGB+D and NTU RGB+D 120. Download the dataset first, and then preprocess to generate mid-level features.
Request permission at RoseLab to download both datasets. Link
Download data for these modalities: Skeleton, Masked depth maps, RGB videos.
The 3D heatmap volumes are available for download in PoseC3D. For convenience, we provide HCI for depth maps, RGB videos and optical flow modalities. Additionally, our code allows you to generate HCI for different modalities as per your requirements.
The spatiotemporal skeleton graph can be generated by the source code of MS-G3D. If you want to test the view-invariant property of the model, please change the 'training_cameras' setting in the 'ntu_gendata.py' file.
NTU RGB+D HCI
HCI modalities | Baidu Cloud Link | Google Drive Link |
---|---|---|
RGB HCI | Link | - |
Optical flow HCI | Link | - |
Depth HCI | Link | - |
Rock-paper-scissors. Left: RGB video. Right: RGB HCI
Throw. Left: Masked depth maps. Right: Depth HCI
In view of the huge amount of data generated during framing, for each RGB video, we use the method of framing and deleting to construct RGB and optical flow HCI.
Although NTU RGB+D has two benchmarks, note that we store all training samples and testing samples in the 'train' folder.
Since we use flownet2 to generate optical flow HCI, you must download the pretrained model (173MB). Baidu Cloud
For example, you can generate RGB and optical flow HCI:
`python ntu60_gen_HCI.py`
Pay attention to modify the file path.
File path parameter | Description |
---|---|
skeletons_path | Raw skeleton data |
frames_path | Raw rgb video |
ignored_sample_path | Samples that need to be ignored (.txt) |
out_folder | RGB HCI output file |
out_folder_opt | Optical flow HCI output file |
Additionally, you can generate depth HCI:
`python ntu60_gen_depth_HCI.py`
Each single-stream model is trained first, and then the learned model parameters are used for testing.
Each data stream needs to be trained separately. For
For RGB HCI:
`python main.py`
We use the official code of 2s-AGCN or MS-G3D to generate labels for different benchmarks. You can also use these labels directly.
Pay attention to modify the file path.
File path parameter | Description |
---|---|
data_path | Label file - default='data' |
dataset | Label file - Dataset, i.e., ntu or ntu120 |
dataset_type | Label file - Benchmark, i.e., xsub or xview |
output | Output file |
rgb_images_path | RGB HCI file |
For optical flow HCI:
`python main_flow.py`
For depth HCI:
`python main_depth.py`
Perform weighted score fusion. Here we use the highest score finally obtained by each modality single-stream input for fusion. It is worth noting that the best results are not necessarily obtained from the fusion of these highest scores. Please try it yourself.
You can quickly reproduce the experimental results in the article based on the content of this part only.
Due to the upload file size limit (25MB), we store ensemble-related files (368MB) in Baidu Cloud.
The files are arranged as follows:
-ensemble\
-ntu60
-ntu120
-ensemble60_xsub.py
-ensemble60_xview.py
-ensemble120_xset.py
-ensemble120_xsub.py
The four .py files correspond to the score fusion of the four benchmarks. You can change the alpha to adjust the weights for different modalities.
For example, you can ensemble the results of the XSub, one of the benchmark of NTU RGB+D:
`python ensemble60_xsub.py`
@article{hu2023human,
title={Human-centric multimodal fusion network for robust action recognition},
author={Hu, Zesheng and Xiao, Jian and Li, Le and Liu, Cun and Ji, Genlin},
journal={Expert Systems with Applications},
pages={122314},
year={2023},
publisher={Elsevier}
}
If you find that the above description is not clear, or you have other issues that need to be communicated when conducting the experiment, please leave a message on Github.
Feel free to contact me via email:
`zeshenghu@njnu.edu.cn`