cuda/cuDNN lib version checking. Force cuDNN v7 usage. #15449

DickJC123 · 2019-07-03T02:34:21Z

This PR addresses two issues:
- rnn.cc of mxnet v1.5 does not compile against cudnn v6. This PR enforces systems that rebuild mxnet to have cudnn v7, and improves the error message for compiling against v6.
- We are accumulating stale code that references no-longer-supported cuda/cudnn versions. This PR provides a means for cleaning out this code.

This PR introduces both runtime and compile-time cuda and cuDNN version checking. The compile time checks are based on new macros: STATIC_ASSERT_CUDNN_VERSION_GE(min_version) and STATIC_ASSERT_CUDA_VERSION_GE(min_version). Example usage:
Before PR:

#if MXNET_USE_CUDNN
#if CUDNN_VERSION >= 7000
    <impl 1>
#elif CUDNN_VERSION >= 6000
    <impl 2>
#else
    LOG(FATAL) << "cuDNN too old.";
#endif
#endif  // MXNET_USE_CUDNN

After PR (given the assumption that we're now requiring cuDNN v7):

#if MXNET_USE_CUDNN
STATIC_ASSERT_CUDNN_VERSION_GE(7000);
<impl 1>
#endif  // MXNET_USE_CUDNN

Discussion continues in the comments section.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
[X ] Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
[X ] Code is well-documented:
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
[X ] To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

Feature1, tests, (and when applicable, API doc)
Feature2, tests, (and when applicable, API doc)

Comments

This PR improves the compile-time message to a user trying to build MXNet 1.5 against cuDNN v6.
Before PR, only the missing library entrypoint is mentioned:

g++ ... -c src/operator/operator.cc -o build/src/operator/operator.o
src/operator/rnn.cc: In function 'std::vector<mxnet::ResourceRequest> mxnet::op::RNNResourceEx(const nnvm::NodeAttrs&, int, mxnet::DispatchMode)':    src/operator/rnn.cc:179:28: error: 'kCuDNNDropoutDesc' is not a member of 'mxnet::ResourceRequest'                                                           request.emplace_back(ResourceRequest::kCuDNNDropoutDesc);

After the PR, the error mentions the library version issue directly:

g++ ... -c src/operator/optimizer_op.cc -o build/src/operator/optimizer_op.o
In file included from src/operator/././operator_common.h:42:0,
                 from src/operator/./rnn-inl.h:45,
                 from src/operator/rnn.cc:29:
src/operator/rnn.cc: In function 'std::vector<mxnet::ResourceRequest> mxnet::op::RNNResourceEx(const nnvm::NodeAttrs&, int, mxnet::DispatchMode)':
src/operator/././../common/cuda_utils.h:467:3: error: static assertion failed: Compiled-against cuDNN version 6021 is too old, please upgrade system to version 7000 or later.
   static_assert(CUDNN_VERSION >= min_version, "Compiled-against cuDNN version " \
   ^
src/operator/rnn.cc:175:5: note: in expansion of macro 'STATIC_ASSERT_CUDNN_VERSION_GE'
     STATIC_ASSERT_CUDNN_VERSION_GE(7000);
     ^
src/operator/rnn.cc:180:28: error: 'kCuDNNDropoutDesc' is not a member of 'mxnet::ResourceRequest'
       request.emplace_back(ResourceRequest::kCuDNNDropoutDesc);
                            ^

This PR provides 2 runtime checks and issues a warning:
- when the compiled-against cuda or cuDNN library version does not match the linked-against version, and
- when the library versions are old w.r.t. the versions tested against by the MXNet CI.

I built the PR against cuda 9 and cuDNN v7.1.4. Running any model will emit the warning:

[01:05:03] src/common/cuda_utils.cc:50: Upgrade advisory: this mxnet has been built against cuda library version 9000, which is older than the oldest version tested by CI (10000).  Set MXNET_CUDA_VERSION_CHECKING=0 to quiet this warning.
[01:05:03] src/common/cuda_utils.cc:89: Upgrade advisory: this mxnet has been built against cuDNN library version 7104, which is older than the oldest version tested by CI (7600).  Set MXNET_CUDNN_VERSION_CHECKING=0 to quiet this warning.

I then upgraded cuDNN to v7.2.1 without recompiling. The following additional message now appears when running models:

[01:11:11] src/common/cuda_utils.cc:85: cuDNN library mismatch: linked-against version 7201 != compiled-against version 7104.  Set MXNET_CUDNN_VERSION_CHECKING=0 to quiet this warning.

As stated, the user can choose to kill the warning messages with environment variables settings.

…nst CUDNN_VERSION < 7000.

DickJC123 · 2019-07-03T02:44:32Z

Versioning issues were recently discussed in the dev forum: https://lists.apache.org/thread.html/96d4a46a0a3c98ea1f3a3237de713ef5f40967fcb0817d661c18e950@%3Cdev.mxnet.apache.org%3E

Although the PR as it stands does not preclude CUDA8, I propose to add STATIC_ASSERT_CUDA_VERSION_GE(9000) given enough consensus.

Tagging @ptrendx @KellenSunderland @marcoabreu @larroy

src/common/cuda_utils.cc

DickJC123 · 2019-07-03T22:32:59Z

I propose that this PR be accepted in its fairly minimal form. Once merged, I would then follow-up with a PR with sizeable cleanup/removal of cuda and cudnn code using the mechanisms set up here. I can envision a couple of approaches. The first (which I'm not recommending) I'd call the simple, but dangerous approach. In cuda_utils.h, I'd put the lines:

STATIC_ASSERT_CUDA_VERSION_GE(9000);
STATIC_ASSERT_CUDNN_VERSION_GE(7000);

Then I'd rip out all code dealing with earlier versions in other files that included cuda_utils.h. I'm worried this approach would make it too easy for users to experiment with going back to old versions. Perhaps the code removed involved bug work-arounds for older lib versions. This approach loses information about the support level of the various operator implementations. My preference then is:

As we remove old code from a file, and thereby add a minimum version requirement, we add the appropriate STATIC_ASSERT_ macro to the file. Thus the protection travels with the file. No version support information is lost and it would be hard to get in trouble when mixing versions as in:

git checkout old_mxnet_version
git checkout newer_mxnet_version -- my_favorite_op-inl.h

In the above scenario, if my_favorite_op-inl.h was counting on version protections present in its own-version cuda_utils.h, those may now be different and more lax (i.e. allowing older libs). Best if the STATIC_ASSERT_ statements live in my_favorite_op-inl.h where the version requirement exists. Comments welcome.

larroy

LGTM % some symbol visibility comments

larroy · 2019-07-03T23:22:29Z

src/common/cuda_utils.cc

+
+
+// Start-up check that the version of cuda compiled-against matches the linked-against version.
+bool CudaVersionChecks() {


should this be static or anon ns?

larroy · 2019-07-03T23:28:46Z

src/common/cuda_utils.cc

+
+// Dynamic initialization here will emit a warning if runtime and compile-time versions mismatch.
+// Also if the user has recompiled their source to a version no longer tested by upstream CI.
+bool cudnn_version_ok = CuDNNVersionChecks();


symbol visibility: can be static or anon ns? if you are just forcing static initialization. Also maybe we should start thinking about having a single place to do static initialization on library load.

Also in this case a static object and the version check inside ctor would save the memory for this variable.

As a partial response to what you're advocating, I've removed the CuDNNVersionChecks() and CudaVersionChecks() functions from the namespace, using instead immediately-invoked function expressions. I like the simplicity of it now, and I don't think the variable memory (8 bytes?) is much of an issue. If you feel otherwise, send me a pointer to an example of the programming pattern you think would be an improvement.

Up to you. I was thinking that using a static object with no members should not use any memory when initialized in global context as your boolean variable but that depends on the implementation. I agree is not a big deal. I was thinking on something like:

struct Initializer { Initializer() { // your code here } }; static Initializer initializer;

Who calls the version check now? I see a lambda but not where is called.

At the end of each lambda, there's a null argument list '()', turning the lambdas into 'immediately invoked function expressions'.

Thanks for the code snippet- it's pretty similar in effect to what I've got now, so if it's OK with you, I'd prefer to stick with what I've already verified. Actually, my current solution has slightly fewer code lines and less names in the namespace.

Sure makes sense, thanks.

DickJC123 · 2019-07-04T02:25:29Z

For my recent "Move STATIC_ASSERT_..." commit, I ended up moving the compile-time check to resource.cc, where the real version dependency is, instead of in rnn.cc, while also simplifying the #if logic. The error message now properly reveals that it's cudnnRestoreDropoutDescriptor() that's missing from cuDNN v6:

g++ ... -c src/resource.cc -o build/src/resource.o
In file included from src/resource.cc:37:0:
src/resource.cc: In member function 'void mxnet::Resource::get_cudnn_dropout_desc(cudnnDropoutStruct**, mshadow::Stream<mshadow::gpu>*, float, uint64_t) const':
src/./common/cuda_utils.h:467:3: error: static assertion failed: Compiled-against cuDNN version 6021 is too old, please upgrade system to version 7000 or later.
   static_assert(CUDNN_VERSION >= min_version, "Compiled-against cuDNN version " \
   ^
src/resource.cc:446:5: note: in expansion of macro 'STATIC_ASSERT_CUDNN_VERSION_GE'
     STATIC_ASSERT_CUDNN_VERSION_GE(7000);
     ^
src/resource.cc:451:50: error: 'cudnnRestoreDropoutDescriptor' was not declared in this scope
                                              seed));
                                                  ^
src/./common/cuda_utils.h:473:24: note: in definition of macro 'CUDNN_CALL'
     cudnnStatus_t e = (func);                                                 \
                        ^

@szha Since you introduced the change that included the dependency, maybe you'd like to weigh in?

roywei · 2019-07-08T15:15:29Z

@mxnet-label-bot add [CUDA]

DickJC123 · 2019-07-08T17:59:05Z

Gentle ping now that the holiday weekend is over.

DickJC123 · 2019-07-09T17:50:27Z

Not sure if the timing permits this, but I'd think this might be a useful PR to backport to v1.5

KellenSunderland · 2019-07-11T02:15:14Z

src/common/cuda_utils.cc

+    int linkedAgainstCudaVersion = 0;
+    CUDA_CALL(cudaRuntimeGetVersion(&linkedAgainstCudaVersion));
+    if (linkedAgainstCudaVersion != CUDA_VERSION)
+      LOG(WARNING) << "cuda library mismatch: linked-against version " << linkedAgainstCudaVersion


Just want to make sure I'm understanding this one. If a user runs with CUDA 10.2, but the library was linked against 10.1 would this issue a warning? I tend to do that fairly often, is it against best practices?

So 'yes' there would be a warning if the user built against 10.1, but ran with 10.2. These warnings can be turned off with an environment variable setting MXNET_CUDA_VERSION_CHECKING=0. The idea behind the 'advisory' is that the user may want to rebuild to get the new functionality present in 10.2, or perhaps to avoid work-arounds for any issues of 10.1. It's probably more useful with the CUDNN version checks, where we have far more compile guards based on version minor numbers. Do you feel these warnings would be unwelcome to users?

I think the question is what can be the issues when linking against a smaller cuda, leaving performance gains on the table? I think you guys are the experts, I was getting some info from here: https://docs.nvidia.com/deploy/cuda-compatibility/#binary-compatibility
Does this warning indicate a real problem or will it confuse users, when there's nothing wrong on running with a newer cuda.

After studying the link supplied by @larroy (thanks!), I need to retract what I said above. Based on a new understanding, I have removed the runtime check of the cuda-runtime library version. The check is unnecessary since (per the link) the major.minor of the cuda runtime must match for the libmxnet.so lib to load. It was instructive for me to do a ldd libmxnet.so:

libcudart.so.9.2 => /usr/local/cuda/lib64/libcudart.so.9.2 (0x00007f361cc64000) libcudnn.so.7 => /usr/lib/x86_64-linux-gnu/libcudnn.so.7 (0x00007f35f9cdf000) libcuda.so.1 => /usr/lib/x86_64-linux-gnu/libcuda.so.1 (0x00007f35f3703000)

Note the extra '.minor' number on libcudart.so. So while a compiled-against cudnn 7.2, might run against a cudnn 7.6, a compiled-against cuda 10.1 won't run against a cuda 10.2. Now, keep in mind we're talking about the cuda runtime library, so libcudart.so as set up by the toolkit install. Your experience on 10.1 vs. 10.2 @KellenSunderland was probably based on upgrading the driver to a higher version, while leaving the toolkit install the same.

Let me know if the PR is now to your liking. I've left in the test of the cuda runtime version against the threshold MXNET_CI_OLDEST_CUDA_VERSION. The idea is that once we no longer test against a particular cuda version, then bugs will creep in with new PRs. We'd prefer users to not be on the front line of bug finding, so we should encourage them to upgrade.

Back to your original question- there will be no warning for upgrading the driver to a newer version (e.g. 10.2) while leaving the toolkit at 10.1

Thanks for the clarification, I think your change brings value to align users and what's tested in CI/CD.

Thanks for the updates and clarification Dick.

… load.

larroy · 2019-07-15T23:50:52Z

src/common/cuda_utils.cc

+// Also if the user has recompiled their source to a version no longer tested by upstream CI.
+bool cuda_version_check_performed = []() {
+  // Don't bother with checks if there are no GPUs visible (e.g. with CUDA_VISIBLE_DEVICES="")
+  if (dmlc::GetEnv("MXNET_CUDA_VERSION_CHECKING", true) && Context::GetGPUCount() > 0) {


@DickJC123 we have detected an error when running a GPU compiled MXNet in a CPU machine, when building mxnet is loaded to generate the operator bindings. My colleague will fill a ticket about this. Would be great to have your guidance if the underlying cudaGetDeviceCount can run without driver, as the call is failing. Our thinking is that before we were not calling this cuda function on load time. I think a possible solution is to add a function that checks if GPUs are available if the GPU count can't be called without GPUs which is a bit puzzling.

filed a issue on the same, this is breaking our internal build flows, where our buildfarm does not have GPU enabled machines, the GPU builds are also done on CPU machines, with CUDA installed on them, for build purposes.

Add STATIC_ASSERT_{CUDA,CUDNN}_VERSION_GE macros. Protect rnn.cc agai…

7192fc8

…nst CUDNN_VERSION < 7000.

DickJC123 requested a review from szha as a code owner July 3, 2019 02:34

KellenSunderland reviewed Jul 3, 2019

View reviewed changes

src/common/cuda_utils.cc Outdated Show resolved Hide resolved

Omit cuda/cudnn lib version checks when no visible gpu devices.

ea410f7

larroy approved these changes Jul 3, 2019

View reviewed changes

DickJC123 added 2 commits July 3, 2019 18:37

Move STATIC_ASSERT_... to resource.cc.

ab5112f

Remove function names in cuda/cudnn version check impl.

59cd7e0

DickJC123 requested review from anirudh2290 and eric-haibin-lin as code owners July 4, 2019 02:08

marcoabreu added the CUDA label Jul 8, 2019

ptrendx approved these changes Jul 9, 2019

View reviewed changes

KellenSunderland reviewed Jul 11, 2019

View reviewed changes

Remove runtime cuda lib check- major.minor already needed for program…

9ec6097

… load.

KellenSunderland approved these changes Jul 12, 2019

View reviewed changes

KellenSunderland merged commit 6a564be into apache:master Jul 12, 2019

larroy reviewed Jul 15, 2019

View reviewed changes

vrakesh mentioned this pull request Jul 16, 2019

MXNet GPU build on CPU machine fails #15548

Closed

DickJC123 mentioned this pull request Jul 16, 2019

Bypass cuda/cudnn checks if no driver. #15551

Merged

5 tasks

alsrgv mentioned this pull request Jul 18, 2019

Pin tf-nightly-gpu and mxnet-cu100 --pre horovod/horovod#1227

Merged

DickJC123 mentioned this pull request Aug 9, 2019

cuDNN support cleanup #15812

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cuda/cuDNN lib version checking. Force cuDNN v7 usage. #15449

cuda/cuDNN lib version checking. Force cuDNN v7 usage. #15449

DickJC123 commented Jul 3, 2019 •

edited

Loading

DickJC123 commented Jul 3, 2019

DickJC123 commented Jul 3, 2019

larroy left a comment

larroy Jul 3, 2019

larroy Jul 3, 2019 •

edited

Loading

DickJC123 Jul 4, 2019

larroy Jul 9, 2019

larroy Jul 9, 2019

DickJC123 Jul 10, 2019

larroy Jul 11, 2019

DickJC123 commented Jul 4, 2019

roywei commented Jul 8, 2019

DickJC123 commented Jul 8, 2019

DickJC123 commented Jul 9, 2019

KellenSunderland Jul 11, 2019

DickJC123 Jul 11, 2019

larroy Jul 11, 2019

DickJC123 Jul 11, 2019

larroy Jul 11, 2019

KellenSunderland Jul 12, 2019

larroy Jul 15, 2019

larroy Jul 15, 2019

vrakesh Jul 16, 2019



		// Start-up check that the version of cuda compiled-against matches the linked-against version.
		bool CudaVersionChecks() {

cuda/cuDNN lib version checking. Force cuDNN v7 usage. #15449

cuda/cuDNN lib version checking. Force cuDNN v7 usage. #15449

Conversation

DickJC123 commented Jul 3, 2019 • edited Loading

Checklist

Essentials

Changes

Comments

DickJC123 commented Jul 3, 2019

DickJC123 commented Jul 3, 2019

larroy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

larroy Jul 3, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DickJC123 commented Jul 4, 2019

roywei commented Jul 8, 2019

DickJC123 commented Jul 8, 2019

DickJC123 commented Jul 9, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DickJC123 commented Jul 3, 2019 •

edited

Loading

larroy Jul 3, 2019 •

edited

Loading