cuDNN non-persistant bidirectional RNN dgrad sync fix #16391

DickJC123 · 2019-10-08T00:41:06Z

Description

Background: A non-deterministic failure of test_operator_gpu.py:test_lstm_bidirectional that was observed just once in our own CI was found to be due to cuDNN's RNN dgrad implementation. cuDNN launches many kernels as part of the RNN dgrad into various auxiliary streams that are different than the primary user-settable stream of the cuDNN handle. A final kernel launched by cuDNN into one of these aux streams was not being synchronized (via events) back to the handle's stream. When MXNet's RNN::Backward() returns, a gradient summation kernel can be launched by the GPU worker into its main stream and this kernel's execution could potentially overlap or even precede that of the final cuDNN RNN dgrad kernel. MXNet's calling of cuDNN's wgrad immediately after dgrad makes this data-race failure exceedingly rare, and in fact we discovered the problem by code inspection, not by being able to reproduce the original CI failure.

The first commit of this PR will demonstrate the failure. The approach is to have MXNet now skip the RNN wgrad operation if grad_req = {'data':'add', 'parameters':'none'}, plus expand the test_lstm_bidirectional test to invoke this case. Skipping wgrad aggravates the data race by allowing MXNet to launch the gradient summation kernel sooner after the RNN dgrad kernels are launched with no intervening wgrad gpu activity.

Once the failure is solidly demonstrated, a follow-up commit will supply the fix. The fix will use cuda events directed at the legacy default stream to ensure all dgrad GPU activity is complete before the wgrad or other kernels begin. Unlike a cudaDeviceSynchronize(), the fix will not block the CPU. nvprof-based timing analysis of the fix shows no measurable difference in timing for the single-RNN case analyzed.

The fix is needed for all versions of cuDNN supported by MXNet (so up to the current v7.6.4) and is needed only for non-persistent bidirectional RNNs.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
Code is well-documented:
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
Check the API doc at https://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

Feature1, tests, (and when applicable, API doc)
Feature2, tests, (and when applicable, API doc)

Comments

If this change is a backward incompatible change, why must this change be made.
Interesting edge cases to note here

src/operator/rnn-inl.h

tests/python/unittest/test_operator.py

ptrendx

LGTM

DickJC123 · 2019-10-10T01:21:11Z

I've rebased the commits of the PR to latest master in an attempt to avoid pylint failures, as suggested by @reminisce . If this succeeds, does this suggest a bug in the way the CI creates the repo-under-test?

* Alter test_lstm_bidirectional to demo fast-fail with optional wgrad. * Fix cuDNN RNN dgrad sync. * Simplify gpu activity sync sequence. * Remove repeated running of now-passing test. * Trigger CI

haojin2 · 2019-10-20T22:06:48Z

src/operator/rnn-inl.h

+    if (CUDNN_VERSION <= 7604 && dgrad_sync_needed_) {
+      // Without blocking the CPU, create a synchronization point of all current GPU activity.  No
+      // need to call cudaStreamWaitEvent- cudaEventRecord on the legacy default stream suffices.
+      CUDA_CALL(cudaEventRecord(dgrad_sync_event_, cudaStreamLegacy));


Hi @DickJC123, I'm encountering cudaErrorInvalidResourceHandle error here when I'm trying to run this notebook and this notebook in dive into deep learning textbook. Could you help with a fix to that?

DickJC123 requested a review from ptrendx October 8, 2019 21:00

ptrendx reviewed Oct 8, 2019

View reviewed changes

src/operator/rnn-inl.h Outdated Show resolved Hide resolved

ptrendx reviewed Oct 8, 2019

View reviewed changes

tests/python/unittest/test_operator.py Outdated Show resolved Hide resolved

ptrendx approved these changes Oct 8, 2019

View reviewed changes

DickJC123 mentioned this pull request Oct 9, 2019

CI showing random pylint failures #16401

Closed

DickJC123 added 4 commits October 9, 2019 18:03

Alter test_lstm_bidirectional to demo fast-fail with optional wgrad.

e934d88

Fix cuDNN RNN dgrad sync.

dc817f7

Simplify gpu activity sync sequence.

8ebbc95

Remove repeated running of now-passing test.

57c8156

DickJC123 force-pushed the gpu_rnn_dgrad_sync branch from 65246f6 to 57c8156 Compare October 10, 2019 01:08

DickJC123 added 2 commits October 9, 2019 18:24

Trigger CI

1a11c0d

Merge remote-tracking branch 'mxnet/master' into gpu_rnn_dgrad_sync

09f1279

ptrendx merged commit a2018ba into apache:master Oct 10, 2019

haojin2 reviewed Oct 20, 2019

View reviewed changes

reminisce mentioned this pull request Oct 21, 2019

Tracking mxnet.numpy operator issues for 1.6.0 release #16559

Closed

haojin2 mentioned this pull request Oct 21, 2019

cudaErrorInvalidResourceHandle error when running some RNN models #16572

Closed

DickJC123 mentioned this pull request Oct 23, 2019

RNNOp to call cudaEventCreate lazily #16584

Merged

5 tasks

leezu mentioned this pull request Oct 25, 2019

RNNOp regression on CPU when using a GPU-enabled build #16628

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cuDNN non-persistant bidirectional RNN dgrad sync fix #16391

cuDNN non-persistant bidirectional RNN dgrad sync fix #16391

DickJC123 commented Oct 8, 2019

ptrendx left a comment

DickJC123 commented Oct 10, 2019

haojin2 Oct 20, 2019 •

edited

Loading

cuDNN non-persistant bidirectional RNN dgrad sync fix #16391

cuDNN non-persistant bidirectional RNN dgrad sync fix #16391

Conversation

DickJC123 commented Oct 8, 2019

Description

Checklist

Essentials

Changes

Comments

ptrendx left a comment

Choose a reason for hiding this comment

DickJC123 commented Oct 10, 2019

haojin2 Oct 20, 2019 • edited Loading

Choose a reason for hiding this comment

haojin2 Oct 20, 2019 •

edited

Loading