-
Notifications
You must be signed in to change notification settings - Fork 6.8k
cuda/cuDNN lib version checking. Force cuDNN v7 usage. #15449
Changes from 2 commits
7192fc8
ea410f7
ab5112f
59cd7e0
9ec6097
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,116 @@ | ||
/* | ||
* Licensed to the Apache Software Foundation (ASF) under one | ||
* or more contributor license agreements. See the NOTICE file | ||
* distributed with this work for additional information | ||
* regarding copyright ownership. The ASF licenses this file | ||
* to you under the Apache License, Version 2.0 (the | ||
* "License"); you may not use this file except in compliance | ||
* with the License. You may obtain a copy of the License at | ||
* | ||
* http://www.apache.org/licenses/LICENSE-2.0 | ||
* | ||
* Unless required by applicable law or agreed to in writing, | ||
* software distributed under the License is distributed on an | ||
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||
* KIND, either express or implied. See the License for the | ||
* specific language governing permissions and limitations | ||
* under the License. | ||
*/ | ||
|
||
/*! | ||
* Copyright (c) 2019 by Contributors | ||
* \file cuda_utils.cc | ||
* \brief CUDA debugging utilities. | ||
*/ | ||
|
||
#include <mxnet/base.h> | ||
#include "cuda_utils.h" | ||
|
||
#if MXNET_USE_CUDA == 1 | ||
|
||
namespace mxnet { | ||
namespace common { | ||
namespace cuda { | ||
|
||
// The oldest version of cuda used in upstream MXNet CI testing, both for unix and windows. | ||
// Users that have rebuilt MXNet against older versions will we advised with a warning to upgrade | ||
// their systems to match the CI level. Minimally, users should rerun the CI locally. | ||
#if defined(_MSC_VER) | ||
#define MXNET_CI_OLDEST_CUDA_VERSION 9020 | ||
#else | ||
#define MXNET_CI_OLDEST_CUDA_VERSION 10000 | ||
#endif | ||
|
||
|
||
// Start-up check that the version of cuda compiled-against matches the linked-against version. | ||
bool CudaVersionChecks() { | ||
// Don't bother with checks if there are no GPUs visible (e.g. with CUDA_VISIBLE_DEVICES="") | ||
if (dmlc::GetEnv("MXNET_CUDA_VERSION_CHECKING", true) && Context::GetGPUCount() > 0) { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @DickJC123 we have detected an error when running a GPU compiled MXNet in a CPU machine, when building mxnet is loaded to generate the operator bindings. My colleague will fill a ticket about this. Would be great to have your guidance if the underlying cudaGetDeviceCount can run without driver, as the call is failing. Our thinking is that before we were not calling this cuda function on load time. I think a possible solution is to add a function that checks if GPUs are available if the GPU count can't be called without GPUs which is a bit puzzling. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. filed a issue on the same, this is breaking our internal build flows, where our buildfarm does not have GPU enabled machines, the GPU builds are also done on CPU machines, with CUDA installed on them, for build purposes. |
||
int linkedAgainstCudaVersion = 0; | ||
CUDA_CALL(cudaRuntimeGetVersion(&linkedAgainstCudaVersion)); | ||
KellenSunderland marked this conversation as resolved.
Show resolved
Hide resolved
|
||
if (linkedAgainstCudaVersion != CUDA_VERSION) | ||
LOG(WARNING) << "cuda library mismatch: linked-against version " << linkedAgainstCudaVersion | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just want to make sure I'm understanding this one. If a user runs with CUDA 10.2, but the library was linked against 10.1 would this issue a warning? I tend to do that fairly often, is it against best practices? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So 'yes' there would be a warning if the user built against 10.1, but ran with 10.2. These warnings can be turned off with an environment variable setting MXNET_CUDA_VERSION_CHECKING=0. The idea behind the 'advisory' is that the user may want to rebuild to get the new functionality present in 10.2, or perhaps to avoid work-arounds for any issues of 10.1. It's probably more useful with the CUDNN version checks, where we have far more compile guards based on version minor numbers. Do you feel these warnings would be unwelcome to users? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think the question is what can be the issues when linking against a smaller cuda, leaving performance gains on the table? I think you guys are the experts, I was getting some info from here: https://docs.nvidia.com/deploy/cuda-compatibility/#binary-compatibility There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. After studying the link supplied by @larroy (thanks!), I need to retract what I said above. Based on a new understanding, I have removed the runtime check of the cuda-runtime library version. The check is unnecessary since (per the link) the major.minor of the cuda runtime must match for the libmxnet.so lib to load. It was instructive for me to do a
Note the extra '.minor' number on libcudart.so. So while a compiled-against cudnn 7.2, might run against a cudnn 7.6, a compiled-against cuda 10.1 won't run against a cuda 10.2. Now, keep in mind we're talking about the cuda Let me know if the PR is now to your liking. I've left in the test of the cuda runtime version against the threshold MXNET_CI_OLDEST_CUDA_VERSION. The idea is that once we no longer test against a particular cuda version, then bugs will creep in with new PRs. We'd prefer users to not be on the front line of bug finding, so we should encourage them to upgrade. Back to your original question- there will be no warning for upgrading the driver to a newer version (e.g. 10.2) while leaving the toolkit at 10.1 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for the clarification, I think your change brings value to align users and what's tested in CI/CD. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for the updates and clarification Dick. |
||
<< " != compiled-against version " << CUDA_VERSION << "." | ||
<< "Set MXNET_CUDA_VERSION_CHECKING=0 to quiet this warning."; | ||
if (CUDA_VERSION < MXNET_CI_OLDEST_CUDA_VERSION) | ||
LOG(WARNING) << "Upgrade advisory: this mxnet has been built against cuda library version " | ||
<< CUDA_VERSION << ", which is older than the oldest version tested by CI (" | ||
<< MXNET_CI_OLDEST_CUDA_VERSION << "). " | ||
<< "Set MXNET_CUDA_VERSION_CHECKING=0 to quiet this warning."; | ||
} | ||
return true; | ||
} | ||
|
||
// Dynamic initialization here will emit a warning if runtime and compile-time versions mismatch. | ||
// Also if the user has recompiled their source to a version no longer tested by upstream CI. | ||
bool cuda_version_ok = CudaVersionChecks(); | ||
|
||
} // namespace cuda | ||
} // namespace common | ||
} // namespace mxnet | ||
|
||
#endif // MXNET_USE_CUDA | ||
|
||
#if MXNET_USE_CUDNN == 1 | ||
|
||
namespace mxnet { | ||
namespace common { | ||
namespace cudnn { | ||
|
||
// The oldest version of CUDNN used in upstream MXNet CI testing, both for unix and windows. | ||
// Users that have rebuilt MXNet against older versions will we advised with a warning to upgrade | ||
// their systems to match the CI level. Minimally, users should rerun the CI locally. | ||
#if defined(_MSC_VER) | ||
#define MXNET_CI_OLDEST_CUDNN_VERSION 7600 | ||
#else | ||
#define MXNET_CI_OLDEST_CUDNN_VERSION 7600 | ||
#endif | ||
|
||
// Start-up check that the version of cudnn compiled-against matches the linked-against version. | ||
// Also if the user has recompiled their source to a version no longer tested by upstream CI. | ||
bool CuDNNVersionChecks() { | ||
// Don't bother with checks if there are no GPUs visible (e.g. with CUDA_VISIBLE_DEVICES="") | ||
if (dmlc::GetEnv("MXNET_CUDNN_VERSION_CHECKING", true) && Context::GetGPUCount() > 0) { | ||
size_t linkedAgainstCudnnVersion = cudnnGetVersion(); | ||
if (linkedAgainstCudnnVersion != CUDNN_VERSION) | ||
LOG(WARNING) << "cuDNN library mismatch: linked-against version " << linkedAgainstCudnnVersion | ||
<< " != compiled-against version " << CUDNN_VERSION << ". " | ||
<< "Set MXNET_CUDNN_VERSION_CHECKING=0 to quiet this warning."; | ||
if (CUDNN_VERSION < MXNET_CI_OLDEST_CUDNN_VERSION) | ||
LOG(WARNING) << "Upgrade advisory: this mxnet has been built against cuDNN library version " | ||
<< CUDNN_VERSION << ", which is older than the oldest version tested by CI (" | ||
<< MXNET_CI_OLDEST_CUDNN_VERSION << "). " | ||
<< "Set MXNET_CUDNN_VERSION_CHECKING=0 to quiet this warning."; | ||
} | ||
return true; | ||
} | ||
|
||
// Dynamic initialization here will emit a warning if runtime and compile-time versions mismatch. | ||
// Also if the user has recompiled their source to a version no longer tested by upstream CI. | ||
bool cudnn_version_ok = CuDNNVersionChecks(); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. symbol visibility: can be static or anon ns? if you are just forcing static initialization. Also maybe we should start thinking about having a single place to do static initialization on library load. Also in this case a static object and the version check inside ctor would save the memory for this variable. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As a partial response to what you're advocating, I've removed the CuDNNVersionChecks() and CudaVersionChecks() functions from the namespace, using instead immediately-invoked function expressions. I like the simplicity of it now, and I don't think the variable memory (8 bytes?) is much of an issue. If you feel otherwise, send me a pointer to an example of the programming pattern you think would be an improvement. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Up to you. I was thinking that using a static object with no members should not use any memory when initialized in global context as your boolean variable but that depends on the implementation. I agree is not a big deal. I was thinking on something like:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Who calls the version check now? I see a lambda but not where is called. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. At the end of each lambda, there's a null argument list '()', turning the lambdas into 'immediately invoked function expressions'. Thanks for the code snippet- it's pretty similar in effect to what I've got now, so if it's OK with you, I'd prefer to stick with what I've already verified. Actually, my current solution has slightly fewer code lines and less names in the namespace. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sure makes sense, thanks. |
||
|
||
} // namespace cudnn | ||
} // namespace common | ||
} // namespace mxnet | ||
|
||
#endif // MXNET_USE_CUDNN |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be static or anon ns?