Tips for training large-scale face recognition model, such as millions of IDs(classes). #1426

nttstar · 2021-03-08T23:53:51Z

For training ArcFace models by millions of IDs, we may meet some time efficiency problems.

=====
P1: There are too many classes that my GPUs can not handle.

Solutions:

To reduce memory usage of the classification layer, model-parallelism and partial-fc can be the good ideas.
Enable FP16 can further reduce the GPU memory usage and also get acceleration on modern NVIDIA GPUs. For example, we can enable fp16 training by a simple fp16-scale parameter:

export CUDA_VISIBLE_DEVICES='0,1,2,3,4,5,6,7' 
python -u train_parall.py --network r50 --dataset emore --loss arcface --fp16-scale 1.0

or change the following setting in partial-fc MXNet implementation.

config.fp16 = True

=====
P2. The training dataset is too huge, io cost is high which leads to very low training speed.

Solutions:

=====
Any question or discussion can be left in this thread.

The text was updated successfully, but these errors were encountered:

nttstar pinned this issue Mar 8, 2021

nttstar added the Example label Mar 8, 2021

twmht mentioned this issue Mar 17, 2021

any suggestion for data IO when training on large scale dataset (17M images) NVIDIA/DALI#2796

Closed

anxiangsir unpinned this issue May 26, 2021

anxiangsir pinned this issue May 26, 2021

nttstar unpinned this issue May 30, 2021

zzzzz314314 mentioned this issue Jul 28, 2022

MS1MV3 training rec file pre-shuffle causes strange cyclic loss pattern #2059

Closed

Provide feedback