Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How can I enable GPU profiling for MLPerf dlrm for training result v2.0? #340

Closed
dejay-vu opened this issue Jul 5, 2022 · 1 comment
Closed
Assignees
Labels
question Further information is requested

Comments

@dejay-vu
Copy link

dejay-vu commented Jul 5, 2022

I set the env variable PROFILING_DIR and -DENABLE_PROFILING=ON in the dockerfile and used the run_with_docker.sh on https://github.com/mlcommons/training_results_v2.0/blob/main/NVIDIA/benchmarks/dlrm/implementations/hugectr. However, after that, the model cannot be trained and output the following error. Do I need to change any model parameters in the python training script as well?

[HUGECTR][08:02:06][INFO][RANK0]: Use non-epoch mode with number of iterations: 75868
[HUGECTR][08:02:06][INFO][RANK0]: Training batchsize: 55296, evaluation batchsize: 1769472
[HUGECTR][08:02:06][INFO][RANK0]: Evaluation interval: 3793, snapshot interval: 2000000
[HUGECTR][08:02:06][INFO][RANK0]: Dense network trainable: True
[HUGECTR][08:02:06][INFO][RANK0]: Sparse embedding sparse_embedding1 trainable: True
[HUGECTR][08:02:06][INFO][RANK0]: Use mixed precision: True, scaler: 1024.000000, use cuda graph: False
[HUGECTR][08:02:06][INFO][RANK0]: lr: 24.000000, warmup_steps: 2750, end_lr: 0.000000
[HUGECTR][08:02:06][INFO][RANK0]: decay_start: 49315, decay_steps: 27772, decay_power: 2.000000
[HUGECTR][08:02:06][INFO][RANK0]: Profiler using PROFILING_TRAIN_EVAL_MODE : train
[HUGECTR][08:02:06][INFO][RANK0]: Profiler using PROFILING_DIR: /profiling
[HUGECTR][08:02:06][INFO][RANK0]: Profiler using WARMUP_ITERS: 10
[HUGECTR][08:02:06][INFO][RANK0]: Profiler using WARMUP_AFTER_CUDAGRAPH_REINIT: 10
[HUGECTR][08:02:06][INFO][RANK0]: Profiler using cuda graph: 1
[HUGECTR][08:02:06][INFO][RANK0]: Profiler Warning. 'extra_info' arg in the PROFILE_RECORD maybe ignored, if the event is executed in cuda graph.
[HUGECTR][08:02:06][INFO][RANK0]: Profiler using PROFILING_MODE: one_shot
[HUGECTR][08:02:06][INFO][RANK0]: Profiler using PROFILING_REPEAT_ITERS: 1000
[HUGECTR][08:02:06][INFO][RANK0]: Profiler using PROFILING_RECORD_EVERY_N: 5
[HUGECTR][08:02:06][INFO][RANK0]: Profiler activate: DataReaderOneShotProfiler
[HUGECTR][08:02:06][INFO][RANK0]: Profiler using PROFILING_DATA_READER_ONE_SHOT_CUEVENT_NUM: 5
[HUGECTR][08:02:06][INFO][RANK0]: Training source file: /raid/datasets/criteo/mlperf/40m.limit_preshuffled/train_data.bin
[HUGECTR][08:02:06][INFO][RANK0]: Evaluation source file: /raid/datasets/criteo/mlperf/40m.limit_preshuffled/test_data.bin
[80038.35, train_epoch_start, 0, ]
Event fused_relu_bias_fully_connected.bprop.cublasGemmEx_2 has stop but no start
terminate called after throwing an instance of 'HugeCTR::internal_runtime_error'
what(): Event fused_relu_bias_fully_connected.bprop.cublasGemmEx_2 has stop but no start
[dgx-hq-01:00171] *** Process received signal ***
[dgx-hq-01:00171] Signal: Aborted (6)
[dgx-hq-01:00171] Signal code: (-6)
[dgx-hq-01:00171] [ 0] /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x143c0)[0x7f10af5023c0]
[dgx-hq-01:00171] [ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7f10af1d903b]
[dgx-hq-01:00171] [ 2] /usr/lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7f10af1b8859]
[dgx-hq-01:00171] [ 3] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x9e911)[0x7f10ac7c9911]
[dgx-hq-01:00171] [ 4] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa38c)[0x7f10ac7d538c]
[dgx-hq-01:00171] [ 5] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xa9369)[0x7f10ac7d4369]
[dgx-hq-01:00171] [ 6] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(__gxx_personality_v0+0x2a1)[0x7f10ac7d4d21]
[dgx-hq-01:00171] [ 7] /usr/lib/x86_64-linux-gnu/libgcc_s.so.1(+0x10bef)[0x7f10ac6debef]
[dgx-hq-01:00171] [ 8] /usr/lib/x86_64-linux-gnu/libgcc_s.so.1(_Unwind_Resume+0x12a)[0x7f10ac6df5aa]
[dgx-hq-01:00171] [ 9] /usr/local/hugectr/lib/libhuge_ctr_shared.so(+0x2c08fa)[0x7f10ad5638fa]
[dgx-hq-01:00171] [10] /usr/local/hugectr/lib/libhuge_ctr_shared.so(ZN7HugeCTR12GraphWrapper7captureESt8functionIFvP11CUstream_stEES3+0x9e)[0x7f10ada3015e]
[dgx-hq-01:00171] [11] /usr/local/hugectr/lib/libhuge_ctr_shared.so(+0xacb702)[0x7f10add6e702]
[dgx-hq-01:00171] [12] /usr/lib/x86_64-linux-gnu/libgomp.so.1(+0x1a78e)[0x7f10ac70378e]
[dgx-hq-01:00171] [13] /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609)[0x7f10af4f6609]
[dgx-hq-01:00171] [14] /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7f10af2b5163]
[dgx-hq-01:00171] *** End of error message ***

@dejay-vu dejay-vu added the question Further information is requested label Jul 5, 2022
@minseokl minseokl self-assigned this Jul 6, 2022
@minseokl
Copy link
Collaborator

Hi @regnnighe We are deprecating the inline profiler. It will be completely removed in a future release. I recommend you use Nsight Systems which offers more rich functionality.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants