You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi @regnnighe We are deprecating the inline profiler. It will be completely removed in a future release. I recommend you use Nsight Systems which offers more rich functionality.
I set the env variable PROFILING_DIR and -DENABLE_PROFILING=ON in the dockerfile and used the run_with_docker.sh on https://github.com/mlcommons/training_results_v2.0/blob/main/NVIDIA/benchmarks/dlrm/implementations/hugectr. However, after that, the model cannot be trained and output the following error. Do I need to change any model parameters in the python training script as well?
[HUGECTR][08:02:06][INFO][RANK0]: Use non-epoch mode with number of iterations: 75868
[HUGECTR][08:02:06][INFO][RANK0]: Training batchsize: 55296, evaluation batchsize: 1769472
[HUGECTR][08:02:06][INFO][RANK0]: Evaluation interval: 3793, snapshot interval: 2000000
[HUGECTR][08:02:06][INFO][RANK0]: Dense network trainable: True
[HUGECTR][08:02:06][INFO][RANK0]: Sparse embedding sparse_embedding1 trainable: True
[HUGECTR][08:02:06][INFO][RANK0]: Use mixed precision: True, scaler: 1024.000000, use cuda graph: False
[HUGECTR][08:02:06][INFO][RANK0]: lr: 24.000000, warmup_steps: 2750, end_lr: 0.000000
[HUGECTR][08:02:06][INFO][RANK0]: decay_start: 49315, decay_steps: 27772, decay_power: 2.000000
[HUGECTR][08:02:06][INFO][RANK0]: Profiler using PROFILING_TRAIN_EVAL_MODE : train
[HUGECTR][08:02:06][INFO][RANK0]: Profiler using PROFILING_DIR: /profiling
[HUGECTR][08:02:06][INFO][RANK0]: Profiler using WARMUP_ITERS: 10
[HUGECTR][08:02:06][INFO][RANK0]: Profiler using WARMUP_AFTER_CUDAGRAPH_REINIT: 10
[HUGECTR][08:02:06][INFO][RANK0]: Profiler using cuda graph: 1
[HUGECTR][08:02:06][INFO][RANK0]: Profiler Warning. 'extra_info' arg in the PROFILE_RECORD maybe ignored, if the event is executed in cuda graph.
[HUGECTR][08:02:06][INFO][RANK0]: Profiler using PROFILING_MODE: one_shot
[HUGECTR][08:02:06][INFO][RANK0]: Profiler using PROFILING_REPEAT_ITERS: 1000
[HUGECTR][08:02:06][INFO][RANK0]: Profiler using PROFILING_RECORD_EVERY_N: 5
[HUGECTR][08:02:06][INFO][RANK0]: Profiler activate: DataReaderOneShotProfiler
[HUGECTR][08:02:06][INFO][RANK0]: Profiler using PROFILING_DATA_READER_ONE_SHOT_CUEVENT_NUM: 5
[HUGECTR][08:02:06][INFO][RANK0]: Training source file: /raid/datasets/criteo/mlperf/40m.limit_preshuffled/train_data.bin
[HUGECTR][08:02:06][INFO][RANK0]: Evaluation source file: /raid/datasets/criteo/mlperf/40m.limit_preshuffled/test_data.bin
[80038.35, train_epoch_start, 0, ]
Event fused_relu_bias_fully_connected.bprop.cublasGemmEx_2 has stop but no start
terminate called after throwing an instance of 'HugeCTR::internal_runtime_error'
what(): Event fused_relu_bias_fully_connected.bprop.cublasGemmEx_2 has stop but no start
[dgx-hq-01:00171] *** Process received signal ***
[dgx-hq-01:00171] Signal: Aborted (6)
[dgx-hq-01:00171] Signal code: (-6)
[dgx-hq-01:00171] [ 0] /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x143c0)[0x7f10af5023c0]
[dgx-hq-01:00171] [ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7f10af1d903b]
[dgx-hq-01:00171] [ 2] /usr/lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7f10af1b8859]
[dgx-hq-01:00171] [ 3] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x9e911)[0x7f10ac7c9911]
[dgx-hq-01:00171] [ 4] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa38c)[0x7f10ac7d538c]
[dgx-hq-01:00171] [ 5] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xa9369)[0x7f10ac7d4369]
[dgx-hq-01:00171] [ 6] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(__gxx_personality_v0+0x2a1)[0x7f10ac7d4d21]
[dgx-hq-01:00171] [ 7] /usr/lib/x86_64-linux-gnu/libgcc_s.so.1(+0x10bef)[0x7f10ac6debef]
[dgx-hq-01:00171] [ 8] /usr/lib/x86_64-linux-gnu/libgcc_s.so.1(_Unwind_Resume+0x12a)[0x7f10ac6df5aa]
[dgx-hq-01:00171] [ 9] /usr/local/hugectr/lib/libhuge_ctr_shared.so(+0x2c08fa)[0x7f10ad5638fa]
[dgx-hq-01:00171] [10] /usr/local/hugectr/lib/libhuge_ctr_shared.so(ZN7HugeCTR12GraphWrapper7captureESt8functionIFvP11CUstream_stEES3+0x9e)[0x7f10ada3015e]
[dgx-hq-01:00171] [11] /usr/local/hugectr/lib/libhuge_ctr_shared.so(+0xacb702)[0x7f10add6e702]
[dgx-hq-01:00171] [12] /usr/lib/x86_64-linux-gnu/libgomp.so.1(+0x1a78e)[0x7f10ac70378e]
[dgx-hq-01:00171] [13] /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609)[0x7f10af4f6609]
[dgx-hq-01:00171] [14] /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7f10af2b5163]
[dgx-hq-01:00171] *** End of error message ***
The text was updated successfully, but these errors were encountered: