Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

{lib}[foss/2023a] TensorFlow v2.13.0 w/ CUDA 12.1.1 #19182

Conversation

VRehnberg
Copy link
Contributor

(created using eb --new-pr)

@VRehnberg
Copy link
Contributor Author

Test report by @VRehnberg
FAILED
Build succeeded for 14 out of 15 (1 easyconfigs in total)
alvis2-02 - Linux Rocky Linux 8.8, x86_64, Intel(R) Xeon(R) Gold 6226R CPU @ 2.90GHz, 1 x NVIDIA Tesla T4, 535.104.05, Python 3.6.8
See https://gist.github.com/VRehnberg/f4440b196cb6b0337902b6d580f83bff for a full test report.

@schiotz
Copy link
Contributor

schiotz commented Nov 10, 2023

The build error could easily be because you should have ('Python-bundle-PyPI', '2023.06') as a dependency. Most python packages that used to be in the main Python module have been moved there.

@VRehnberg
Copy link
Contributor Author

The build error could easily be because you should have ('Python-bundle-PyPI', '2023.06') as a dependency. Most python packages that used to be in the main Python module have been moved there.

Hmm, yes that would be what I expected if it weren't for the fact that the non-cuda TensorFlow seem to be working without it. https://github.com/easybuilders/easybuild-easyconfigs/blob/develop/easybuild/easyconfigs/t/TensorFlow/TensorFlow-2.13.0-foss-2023a.eb

Additionally, the header that is "missing" should come with JsonCpp which is a dependency. https://github.com/open-source-parsers/jsoncpp/blob/master/include/json/json.h

@boegel boegel added the update label Nov 14, 2023
@boegel boegel added this to the 4.x milestone Nov 14, 2023
@boegel
Copy link
Member

boegel commented Nov 16, 2023

@Flamefire Any thoughts on why the test installation would fail with "fatal error: json/config.h: No such file or directory" while TensorFlow-2.13.0-foss-2023a.eb doesn't show that problem?

@boegel
Copy link
Member

boegel commented Nov 16, 2023

@VRehnberg Any chance you have a custom easyblock in place for TensorFlow in /apps/c3se-easyblocks/ that is causing trouble?

]
dependencies = [
('CUDA', '12.1.1', '', SYSTEM),
('cuDNN', '8.9.2.26', versionsuffix, SYSTEM),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@VRehnberg It seems like this cuDNN version is causing trouble, I'm getting:

In file included from bazel-out/k8-opt/bin/external/cudnn_frontend_archive/_virtual_includes/cudnn_frontend/third_party/cudnn_frontend/include/cudnn_frontend_Operation.h:37,
                 from bazel-out/k8-opt/bin/external/cudnn_frontend_archive/_virtual_includes/cudnn_frontend/third_party/cudnn_frontend/include/cudnn_frontend_OperationGraph.h:36,
                 from bazel-out/k8-opt/bin/external/cudnn_frontend_archive/_virtual_includes/cudnn_frontend/third_party/cudnn_frontend/include/cudnn_frontend_Heuristics.h:31,
                 from bazel-out/k8-opt/bin/external/cudnn_frontend_archive/_virtual_includes/cudnn_frontend/third_party/cudnn_frontend/include/cudnn_frontend.h:101,
                 from tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:56:
bazel-out/k8-opt/bin/external/cudnn_frontend_archive/_virtual_includes/cudnn_frontend/third_party/cudnn_frontend/include/cudnn_frontend_PointWiseDesc.h: In member function int64_t cudnn_frontend::PointWiseDesc_v8::getPortCount() const:
bazel-out/k8-opt/bin/external/cudnn_frontend_archive/_virtual_includes/cudnn_frontend/third_party/cudnn_frontend/include/cudnn_frontend_PointWiseDesc.h:69:16: error: enumeration value CUDNN_POINTWISE_RECIPROCAL not handled in switch [-Werror=switch]
   69 |         switch (mode) {
      |                ^
bazel-out/k8-opt/bin/external/cudnn_frontend_archive/_virtual_includes/cudnn_frontend/third_party/cudnn_frontend/include/cudnn_frontend_Operation.h: In member function cudnn_frontend::Operation_v8&& cudnn_frontend::OperationBuilder_v8::build_pointwise_op():
bazel-out/k8-opt/bin/external/cudnn_frontend_archive/_virtual_includes/cudnn_frontend/third_party/cudnn_frontend/include/cudnn_frontend_Operation.h:413:16: error: enumeration value CUDNN_POINTWISE_RECIPROCAL not handled in switch [-Werror=switch]
  413 |         switch (m_operation.pointwise_mode) {
      |                ^
cc1plus: some warnings being treated as errors

see also tensorflow/tensorflow#60832, where they suggest to downgrade to an older cuDNN (ugh...)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are no easyconfigs yet that using a 2023a toolchain and have a cuDNN dependency, so we still have the freedom to stick to cuDNN 8.6.* here...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need to stick to CUDA 11.8 though, since cuDNN 8.6 is only paired with CUDA 10.3 and 11.8 it seems, see https://developer.download.nvidia.com/compute/redist/cudnn/v8.6.0/local_installers/

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And CUDA 11.8 is a problem with GCC 12.x, hitting this when installing NCCL on top of CUDA 11.8.0 with GCCcore/12.3.0:

unsupported GNU version! gcc versions later than 11 are not supported!

So that tells me we're doomed to stick to foss/2022a for TensorFlow 2.13.0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Meh, I'll close this one then.

Copy link
Contributor Author

@VRehnberg VRehnberg Nov 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, or go with another CUDA I suppose. That's what the CUDA version suffix is for I guess. For CUDA 12.3 I can't find anything about compatible GCC, but extrapolating what I could find it will probably work for CUDA 12.3 which isn't listed for CuDNN 8.9.6, but could possibly work.

Copy link
Contributor

@Flamefire Flamefire Nov 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unsupported GNU version! gcc versions later than 11 are not supported!

So that tells me we're doomed to stick to foss/2022a for TensorFlow 2.13.0?

We can workaround this by forcing NVCC to accept the "incompatible" compiler: https://github.com/easybuilders/easybuild-easyconfigs/pull/18853/files#diff-c0833191974a98d7eddf20cecac9d27ec670e369f43f75f3a4bafb2261b1135fR27
Of course there is a risk that the compiler really is incompatible...

]
dependencies = [
('CUDA', '12.1.1', '', SYSTEM),
('cuDNN', '8.9.2.26', versionsuffix, SYSTEM),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are no easyconfigs yet that using a 2023a toolchain and have a cuDNN dependency, so we still have the freedom to stick to cuDNN 8.6.* here...

@VRehnberg
Copy link
Contributor Author

@VRehnberg Any chance you have a custom easyblock in place for TensorFlow in /apps/c3se-easyblocks/ that is causing trouble?

Oh, good catch we actually did. We almost never have any custom ones so I forgot to check, but it's removed now at least.

@VRehnberg
Copy link
Contributor Author

There seems to be some issues with compatibility with cuDNN, CUDA and GCC (see boegel's comments). Probably easier to just go with foss-2022a for now. I'll close this for now.

@VRehnberg VRehnberg closed this Nov 17, 2023
@VRehnberg
Copy link
Contributor Author

Test report by @VRehnberg
FAILED
Build succeeded for 14 out of 15 (1 easyconfigs in total)
alvis3-18 - Linux Rocky Linux 8.8, x86_64, Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz, 2 x NVIDIA NVIDIA A100-SXM4-40GB, 535.104.05, Python 3.6.8
See https://gist.github.com/VRehnberg/d104660df9528635d1f28dd4bd59540c for a full test report.

@casparvl
Copy link
Contributor

I was wondering what the right way forward is here, if this means 'we can not have a TensorFlow in 2023a'. Had a small discussion with @boegel . This page of tested build configurations lists TF 2.15 + cuDNN 8.8 + CUDA 12.2 as a tested combination (and who knows, it might even work with cuDNN 8.9 too).

The best way forward is probably to

  • first create a CPU build for TF 2.15 with foss/2023a
  • then create a GPU build for TF 2.15. I'd probably start with cuDNN 8.8 and CUDA 12.2 since that is the tested config. If that works out of the box, I'd try to bump it to 8.9 and CUDA 12.1.1. If that works too, that would be my preference since it means we can use what's already there in 2023a.

@Flamefire
Copy link
Contributor

first create a CPU build for TF 2.15 with foss/2023a

That's what I usually do too and why there's no PyTorch 2.x with CUDA yet as there are still remaining issues with the CPU version.
So I support that plan.
Note that we additionally have the freedom to have both cuDNN versions with different CUDA versions due to the versionsuffix, which we can use as an escape hatch if we have to

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants