-
Notifications
You must be signed in to change notification settings - Fork 705
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
{numlib,chem,tollchain}[NVHPC/23.7-CUDA-12.1.1] nvompi-2023a + QuantumESPRESSO-7.3.1 (GPU enabled) #20364
base: develop
Are you sure you want to change the base?
Conversation
Comparison of code efficiency when linked to EB numlibs (no prefix) VS linked to NVHPC math_libs (
|
Thanks for putting all of this together! Our site is interested in a GPU enabled QuantumESPRESSO build, so we've been testing this. Were you able to get around the "other error"s that occur in LAPACK testing when building OpenBLAS? Using the OpenBLAS_0.3.24-NVHPC-23.7-CUDA-12.1.1.eb EasyConfig as provided gives us 55 other errors:
I saw that you had done some work with on OpenBLAS issue #4652 to get some of the numerical failures down, but was wondering if you were ever able to get rid of the other errors that stop EasyBuild from finishing. |
@cgross95
What hardware are you trying this on? I think i was still getting some other errors as well with 0.3.27 but i didn't investigate much further into it as i was aiming at 0.3.24 for this release (In that case i was getting 14 errors related to the ZHSQR and ZGEEV routines failing to find all eigenvalues). The logs should give you further details on which lapack routine failed and with what error code (each function should have the meaning of the errors as comments in the source/documentation). |
I'm compiling on a v100s with an Intel Xeon Skylake on Ubuntu 22.04. We also have some a100 cards, but we're in the midst of transferring everything in our cluster to Ubuntu, so they're not easily accessible at the moment. I'll dig into the LAPACK testing logs and see if I can produce some more useful debugging information. |
I finally got access to our A100 cards, and can report that there were no "other error"s in the LAPACK tests. I ended up with 152 numerical errors, so increased the |
Hi, I'm compiling on a v100s with an Intel Xeon Skylake on Ubuntu 22.04. What more changes do you think i should do to be able to use QuantumEspresso(GPU enabled)?? Because, when i use this PR, eb --from-pr 20364 -r, i got checksum error in libxc, which i fixed, afterwards i am getting error in OpenBLAS/0.3.24-NVHPC-23.7-CUDA-12.1.1.... The error i get is
|
@beeebiii The other error you are reporting seems related to OpenBLAS. In my tests on an A100 with an AMD zen2 CPU I did not encounter failures in the compilation (only some failures in the test suite). One weird thing is I am not sure |
Yeah you are right, i think gcc is being used instead of nvcc. |
If you look at the OpenBLAS easyconfig, only NVHPC should be used and easybuild should not be aware for other compiler toolchains in that instance. |
This comment was marked as off-topic.
This comment was marked as off-topic.
That is basically easybuild reporting the easyconfig file being used, the debug logging adds much more.
I would venture to guess the problem is in either of them. |
To summarize the problem @beeebiii was having, the |
This reverts commit 706e9d1.
Updated software
|
Added easyconfig files for nvofbf toolchain + QE 7.3.1
local compilers:
GCC/12.3.0
CUDA/12.1.1
Added toolchain/numlib
nvofbf-2023a
nvompi-2023a
NVHPC-23.7-CUDA-12.1.1
OpenMPI-4.1.5
FlexiBLAS-3.3.1
OpenBLAS-0.3.24
FFTW-3.3.10
FFTW.MPI-3.3.10
ScaLAPACK-2.2.0-fb
Added easyconfigs
HDF5-1.14.0-nvompi-2023a-CUDA-12.1.1.eb
libxc-6.2.2-NVHPC-23.7-CUDA-12.1.1.eb
QuantumESPRESSO-7.3.1-nvompi-2023a-CUDA-12.1.1.eb
NOTES:
cuda
compilers which requires specified compute capability (CC), while QE useshpc-sdk
compilers which if not specified compiles for all supported CCsSolved issues:
v0.3.24
v0.3.27
Open issue:
ZHEEV
BLAS routinecuda-gdb
nvompi
linking directly to OpenBLAS and the error was not presentRMM-DIS
diagonalization with k points other than GAMMA, most likely a QE bug (https://gitlab.com/QEF/q-e/-/issues/675)CMAKE
and only experimental withautotools
, and also not a really widely used feature of QE, it is ok to not have the libxc routines run on GPU