-
Notifications
You must be signed in to change notification settings - Fork 705
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
{lib}[foss/2023a] TensorFlow v2.13.0 w/ CUDA 12.1.1 #19182
{lib}[foss/2023a] TensorFlow v2.13.0 w/ CUDA 12.1.1 #19182
Conversation
Test report by @VRehnberg |
The build error could easily be because you should have |
Hmm, yes that would be what I expected if it weren't for the fact that the non-cuda TensorFlow seem to be working without it. https://github.com/easybuilders/easybuild-easyconfigs/blob/develop/easybuild/easyconfigs/t/TensorFlow/TensorFlow-2.13.0-foss-2023a.eb Additionally, the header that is "missing" should come with JsonCpp which is a dependency. https://github.com/open-source-parsers/jsoncpp/blob/master/include/json/json.h |
@Flamefire Any thoughts on why the test installation would fail with " |
@VRehnberg Any chance you have a custom easyblock in place for TensorFlow in |
] | ||
dependencies = [ | ||
('CUDA', '12.1.1', '', SYSTEM), | ||
('cuDNN', '8.9.2.26', versionsuffix, SYSTEM), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@VRehnberg It seems like this cuDNN
version is causing trouble, I'm getting:
In file included from bazel-out/k8-opt/bin/external/cudnn_frontend_archive/_virtual_includes/cudnn_frontend/third_party/cudnn_frontend/include/cudnn_frontend_Operation.h:37,
from bazel-out/k8-opt/bin/external/cudnn_frontend_archive/_virtual_includes/cudnn_frontend/third_party/cudnn_frontend/include/cudnn_frontend_OperationGraph.h:36,
from bazel-out/k8-opt/bin/external/cudnn_frontend_archive/_virtual_includes/cudnn_frontend/third_party/cudnn_frontend/include/cudnn_frontend_Heuristics.h:31,
from bazel-out/k8-opt/bin/external/cudnn_frontend_archive/_virtual_includes/cudnn_frontend/third_party/cudnn_frontend/include/cudnn_frontend.h:101,
from tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:56:
bazel-out/k8-opt/bin/external/cudnn_frontend_archive/_virtual_includes/cudnn_frontend/third_party/cudnn_frontend/include/cudnn_frontend_PointWiseDesc.h: In member function int64_t cudnn_frontend::PointWiseDesc_v8::getPortCount() const:
bazel-out/k8-opt/bin/external/cudnn_frontend_archive/_virtual_includes/cudnn_frontend/third_party/cudnn_frontend/include/cudnn_frontend_PointWiseDesc.h:69:16: error: enumeration value CUDNN_POINTWISE_RECIPROCAL not handled in switch [-Werror=switch]
69 | switch (mode) {
| ^
bazel-out/k8-opt/bin/external/cudnn_frontend_archive/_virtual_includes/cudnn_frontend/third_party/cudnn_frontend/include/cudnn_frontend_Operation.h: In member function cudnn_frontend::Operation_v8&& cudnn_frontend::OperationBuilder_v8::build_pointwise_op():
bazel-out/k8-opt/bin/external/cudnn_frontend_archive/_virtual_includes/cudnn_frontend/third_party/cudnn_frontend/include/cudnn_frontend_Operation.h:413:16: error: enumeration value CUDNN_POINTWISE_RECIPROCAL not handled in switch [-Werror=switch]
413 | switch (m_operation.pointwise_mode) {
| ^
cc1plus: some warnings being treated as errors
see also tensorflow/tensorflow#60832, where they suggest to downgrade to an older cuDNN
(ugh...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are no easyconfigs yet that using a 2023a
toolchain and have a cuDNN
dependency, so we still have the freedom to stick to cuDNN
8.6.* here...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also need to stick to CUDA 11.8 though, since cuDNN 8.6 is only paired with CUDA 10.3 and 11.8 it seems, see https://developer.download.nvidia.com/compute/redist/cudnn/v8.6.0/local_installers/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And CUDA 11.8 is a problem with GCC 12.x, hitting this when installing NCCL on top of CUDA 11.8.0 with GCCcore/12.3.0
:
unsupported GNU version! gcc versions later than 11 are not supported!
So that tells me we're doomed to stick to foss/2022a
for TensorFlow 2.13.0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Meh, I'll close this one then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, or go with another CUDA I suppose. That's what the CUDA version suffix is for I guess. For CUDA 12.3 I can't find anything about compatible GCC, but extrapolating what I could find it will probably work for CUDA 12.3 which isn't listed for CuDNN 8.9.6, but could possibly work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unsupported GNU version! gcc versions later than 11 are not supported!
So that tells me we're doomed to stick to
foss/2022a
for TensorFlow 2.13.0?
We can workaround this by forcing NVCC to accept the "incompatible" compiler: https://github.com/easybuilders/easybuild-easyconfigs/pull/18853/files#diff-c0833191974a98d7eddf20cecac9d27ec670e369f43f75f3a4bafb2261b1135fR27
Of course there is a risk that the compiler really is incompatible...
] | ||
dependencies = [ | ||
('CUDA', '12.1.1', '', SYSTEM), | ||
('cuDNN', '8.9.2.26', versionsuffix, SYSTEM), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are no easyconfigs yet that using a 2023a
toolchain and have a cuDNN
dependency, so we still have the freedom to stick to cuDNN
8.6.* here...
Oh, good catch we actually did. We almost never have any custom ones so I forgot to check, but it's removed now at least. |
There seems to be some issues with compatibility with cuDNN, CUDA and GCC (see boegel's comments). Probably easier to just go with foss-2022a for now. I'll close this for now. |
Test report by @VRehnberg |
I was wondering what the right way forward is here, if this means 'we can not have a TensorFlow in 2023a'. Had a small discussion with @boegel . This page of tested build configurations lists TF 2.15 + cuDNN 8.8 + CUDA 12.2 as a tested combination (and who knows, it might even work with cuDNN 8.9 too). The best way forward is probably to
|
That's what I usually do too and why there's no PyTorch 2.x with CUDA yet as there are still remaining issues with the CPU version. |
(created using
eb --new-pr
)