How can I enable GPU profiling for MLPerf dlrm for training result v2.0? #340

dejay-vu · 2022-07-05T10:09:55Z

I set the env variable PROFILING_DIR and -DENABLE_PROFILING=ON in the dockerfile and used the run_with_docker.sh on https://github.com/mlcommons/training_results_v2.0/blob/main/NVIDIA/benchmarks/dlrm/implementations/hugectr. However, after that, the model cannot be trained and output the following error. Do I need to change any model parameters in the python training script as well?

[HUGECTR][08:02:06][INFO][RANK0]: Use non-epoch mode with number of iterations: 75868
[HUGECTR][08:02:06][INFO][RANK0]: Training batchsize: 55296, evaluation batchsize: 1769472
[HUGECTR][08:02:06][INFO][RANK0]: Evaluation interval: 3793, snapshot interval: 2000000
[HUGECTR][08:02:06][INFO][RANK0]: Dense network trainable: True
[HUGECTR][08:02:06][INFO][RANK0]: Sparse embedding sparse_embedding1 trainable: True
[HUGECTR][08:02:06][INFO][RANK0]: Use mixed precision: True, scaler: 1024.000000, use cuda graph: False
[HUGECTR][08:02:06][INFO][RANK0]: lr: 24.000000, warmup_steps: 2750, end_lr: 0.000000
[HUGECTR][08:02:06][INFO][RANK0]: decay_start: 49315, decay_steps: 27772, decay_power: 2.000000
[HUGECTR][08:02:06][INFO][RANK0]: Profiler using PROFILING_TRAIN_EVAL_MODE : train
[HUGECTR][08:02:06][INFO][RANK0]: Profiler using PROFILING_DIR: /profiling
[HUGECTR][08:02:06][INFO][RANK0]: Profiler using WARMUP_ITERS: 10
[HUGECTR][08:02:06][INFO][RANK0]: Profiler using WARMUP_AFTER_CUDAGRAPH_REINIT: 10
[HUGECTR][08:02:06][INFO][RANK0]: Profiler using cuda graph: 1
[HUGECTR][08:02:06][INFO][RANK0]: Profiler Warning. 'extra_info' arg in the PROFILE_RECORD maybe ignored, if the event is executed in cuda graph.
[HUGECTR][08:02:06][INFO][RANK0]: Profiler using PROFILING_MODE: one_shot
[HUGECTR][08:02:06][INFO][RANK0]: Profiler using PROFILING_REPEAT_ITERS: 1000
[HUGECTR][08:02:06][INFO][RANK0]: Profiler using PROFILING_RECORD_EVERY_N: 5
[HUGECTR][08:02:06][INFO][RANK0]: Profiler activate: DataReaderOneShotProfiler
[HUGECTR][08:02:06][INFO][RANK0]: Profiler using PROFILING_DATA_READER_ONE_SHOT_CUEVENT_NUM: 5
[HUGECTR][08:02:06][INFO][RANK0]: Training source file: /raid/datasets/criteo/mlperf/40m.limit_preshuffled/train_data.bin
[HUGECTR][08:02:06][INFO][RANK0]: Evaluation source file: /raid/datasets/criteo/mlperf/40m.limit_preshuffled/test_data.bin
[80038.35, train_epoch_start, 0, ]
Event fused_relu_bias_fully_connected.bprop.cublasGemmEx_2 has stop but no start
terminate called after throwing an instance of 'HugeCTR::internal_runtime_error'
what(): Event fused_relu_bias_fully_connected.bprop.cublasGemmEx_2 has stop but no start
[dgx-hq-01:00171] *** Process received signal ***
[dgx-hq-01:00171] Signal: Aborted (6)
[dgx-hq-01:00171] Signal code: (-6)
[dgx-hq-01:00171] [ 0] /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x143c0)[0x7f10af5023c0]
[dgx-hq-01:00171] [ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7f10af1d903b]
[dgx-hq-01:00171] [ 2] /usr/lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7f10af1b8859]
[dgx-hq-01:00171] [ 3] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x9e911)[0x7f10ac7c9911]
[dgx-hq-01:00171] [ 4] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa38c)[0x7f10ac7d538c]
[dgx-hq-01:00171] [ 5] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xa9369)[0x7f10ac7d4369]
[dgx-hq-01:00171] [ 6] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(__gxx_personality_v0+0x2a1)[0x7f10ac7d4d21]
[dgx-hq-01:00171] [ 7] /usr/lib/x86_64-linux-gnu/libgcc_s.so.1(+0x10bef)[0x7f10ac6debef]
[dgx-hq-01:00171] [ 8] /usr/lib/x86_64-linux-gnu/libgcc_s.so.1(_Unwind_Resume+0x12a)[0x7f10ac6df5aa]
[dgx-hq-01:00171] [ 9] /usr/local/hugectr/lib/libhuge_ctr_shared.so(+0x2c08fa)[0x7f10ad5638fa]
[dgx-hq-01:00171] [10] /usr/local/hugectr/lib/libhuge_ctr_shared.so(ZN7HugeCTR12GraphWrapper7captureESt8functionIFvP11CUstream_stEES3+0x9e)[0x7f10ada3015e]
[dgx-hq-01:00171] [11] /usr/local/hugectr/lib/libhuge_ctr_shared.so(+0xacb702)[0x7f10add6e702]
[dgx-hq-01:00171] [12] /usr/lib/x86_64-linux-gnu/libgomp.so.1(+0x1a78e)[0x7f10ac70378e]
[dgx-hq-01:00171] [13] /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609)[0x7f10af4f6609]
[dgx-hq-01:00171] [14] /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7f10af2b5163]
[dgx-hq-01:00171] *** End of error message ***

minseokl · 2022-07-20T23:16:21Z

Hi @regnnighe We are deprecating the inline profiler. It will be completely removed in a future release. I recommend you use Nsight Systems which offers more rich functionality.

dejay-vu added the question Further information is requested label Jul 5, 2022

minseokl self-assigned this Jul 6, 2022

dejay-vu closed this as completed Jul 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How can I enable GPU profiling for MLPerf dlrm for training result v2.0? #340

How can I enable GPU profiling for MLPerf dlrm for training result v2.0? #340

dejay-vu commented Jul 5, 2022

minseokl commented Jul 20, 2022

How can I enable GPU profiling for MLPerf dlrm for training result v2.0? #340

How can I enable GPU profiling for MLPerf dlrm for training result v2.0? #340

Comments

dejay-vu commented Jul 5, 2022

minseokl commented Jul 20, 2022