Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New Atlas-based variant using BlockStructured FunctionSpace #54

Merged
merged 47 commits into from
Dec 4, 2023

Conversation

sbrdar
Copy link
Collaborator

@sbrdar sbrdar commented May 31, 2023

  • add an optional use of Atlas data structure (atlas::fieldset) as backend for allocating and accessing field data
  • gives identical results in the provided validation
  • is of the same performance as the original code

co-authored with W. Deconinck, B. Reuter, M. Lange

@sbrdar sbrdar requested review from mlange05 and reuterbal May 31, 2023 12:16
@FussyDuck
Copy link

FussyDuck commented May 31, 2023

CLA assistant check
All committers have signed the CLA.

@reuterbal reuterbal changed the base branch from main to develop May 31, 2023 12:22
Copy link
Collaborator

@reuterbal reuterbal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for taking so long to look at this and many thanks, this is great! Nothing much to comment on the Fortran side of things, which has been done very cleanly!

I have left a few comments on the bundle integration. To test the variant, I implemented a good number of these suggestions already and will file them as a PR against this branch.

With that, I have built and run this with GNU 11.2.0 on Atos, which worked well.

Unfortunately, when using MPI there seems to be some kind of problem, which I haven't been able to figure out, yet. It throws a SIGFPE inside the compute kernel but I don't know if this is maybe due to uninitialized values or something else:

==== backtrace (tid:1080635) ====
 0 0x0000000000012ce0 __funlockfile()  :0
 1 0x000000000040fbf5 cloudsc_()  /etc/ecmwf/nfs/dh1_home_a/nabr/dwarf-p-cloudsc/nab_atlas_structures/source/cloudsc-dwarf/src/cloudsc_fortran_atlas/cloudsc.F90:750
 2 0x000000000040954d __cloudsc_driver_mod_MOD_cloudsc_driver()  /etc/ecmwf/nfs/dh1_home_a/nabr/dwarf-p-cloudsc/nab_atlas_structures/source/cloudsc-dwarf/src/cloudsc_fortran_atlas/cloudsc_driver_mod.F90:112
 3 0x000000000041680a MAIN__()  dwarf_cloudsc_atlas.F90:0
 4 0x0000000000403043 main()  ???:0
 5 0x000000000003acf3 __libc_start_main()  ???:0
 6 0x000000000040309e _start()  ???:0
=================================

Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.

Backtrace for this error:
BFD: Dwarf Error: Can't find .debug_ranges section.
==== backtrace (tid:1080636) ====
 0 0x0000000000012ce0 __funlockfile()  :0
 1 0x000000000040fbf5 cloudsc_()  /etc/ecmwf/nfs/dh1_home_a/nabr/dwarf-p-cloudsc/nab_atlas_structures/source/cloudsc-dwarf/src/cloudsc_fortran_atlas/cloudsc.F90:750
 2 0x000000000040954d __cloudsc_driver_mod_MOD_cloudsc_driver()  /etc/ecmwf/nfs/dh1_home_a/nabr/dwarf-p-cloudsc/nab_atlas_structures/source/cloudsc-dwarf/src/cloudsc_fortran_atlas/cloudsc_driver_mod.F90:112
 3 0x000000000041680a MAIN__()  dwarf_cloudsc_atlas.F90:0
 4 0x0000000000403043 main()  ???:0
 5 0x000000000003acf3 __libc_start_main()  ???:0
 6 0x000000000040309e _start()  ???:0
=================================

Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.

Backtrace for this error:
#0  0x14ed97fa4cdf in ???
#0  0x14dca11becdf in ???
#1  0x40fbf5 in cloudsc_
        at /etc/ecmwf/nfs/dh1_home_a/nabr/dwarf-p-cloudsc/nab_atlas_structures/source/cloudsc-dwarf/src/cloudsc_fortran_atlas/cloudsc.F90:750
#1  0x40fbf5 in cloudsc_
        at /etc/ecmwf/nfs/dh1_home_a/nabr/dwarf-p-cloudsc/nab_atlas_structures/source/cloudsc-dwarf/src/cloudsc_fortran_atlas/cloudsc.F90:750
#2  0x40954c in __cloudsc_driver_mod_MOD_cloudsc_driver
        at /etc/ecmwf/nfs/dh1_home_a/nabr/dwarf-p-cloudsc/nab_atlas_structures/source/cloudsc-dwarf/src/cloudsc_fortran_atlas/cloudsc_driver_mod.F90:112
#3  0x416809 in dwarf_cloudsc
        at /etc/ecmwf/nfs/dh1_home_a/nabr/dwarf-p-cloudsc/nab_atlas_structures/source/cloudsc-dwarf/src/cloudsc_fortran_atlas/dwarf_cloudsc_atlas.F90:86
#4  0x403042 in main
        at /etc/ecmwf/nfs/dh1_home_a/nabr/dwarf-p-cloudsc/nab_atlas_structures/source/cloudsc-dwarf/src/cloudsc_fortran_atlas/dwarf_cloudsc_atlas.F90:12
#2  0x40954c in __cloudsc_driver_mod_MOD_cloudsc_driver
        at /etc/ecmwf/nfs/dh1_home_a/nabr/dwarf-p-cloudsc/nab_atlas_structures/source/cloudsc-dwarf/src/cloudsc_fortran_atlas/cloudsc_driver_mod.F90:112
#3  0x416809 in dwarf_cloudsc
        at /etc/ecmwf/nfs/dh1_home_a/nabr/dwarf-p-cloudsc/nab_atlas_structures/source/cloudsc-dwarf/src/cloudsc_fortran_atlas/dwarf_cloudsc_atlas.F90:86
#4  0x403042 in main
        at /etc/ecmwf/nfs/dh1_home_a/nabr/dwarf-p-cloudsc/nab_atlas_structures/source/cloudsc-dwarf/src/cloudsc_fortran_atlas/dwarf_cloudsc_atlas.F90:12

Would you mind taking a look at that?

src/cloudsc_fortran_atlas/CMakeLists.txt Outdated Show resolved Hide resolved
CMakeLists.txt Show resolved Hide resolved
src/cloudsc_fortran_atlas/CMakeLists.txt Show resolved Hide resolved
src/cloudsc_fortran_atlas/CMakeLists.txt Outdated Show resolved Hide resolved
src/cloudsc_fortran_atlas/dwarf_cloudsc_atlas.F90 Outdated Show resolved Hide resolved
@sbrdar
Copy link
Collaborator Author

sbrdar commented Aug 23, 2023

The parallelisation issues should now be resolved in the branch: nab_atlas_structures.

@sbrdar sbrdar requested a review from reuterbal August 23, 2023 16:55
@reuterbal
Copy link
Collaborator

OK, I think this is almost ready. I fixed the versions of the new dependencies to the respective latest releases, and changed some of the ENABLE_CUDA handling to make sure this feature is enabled only for the relevant projects in the bundle.

I have also temporarily disabled the Atlas variant when building single precision, as the expand_mod for Atlas is currently only capable of double precision. @sbrdar, can you fix this please in a subsequent PR on a new branch?

There is also an issue when using the atlas variant with Serialbox: @sbrdar could you please fix that on the current branch?

Separate from this we seem to have broken the CUF transformation in Loki, which is only being tested when building with NVHPC (same failure occurs also on develop). That is the reason for the failing NVHPC tests.

@reuterbal
Copy link
Collaborator

Fix for Loki PR'd: ecmwf-ifs/loki#157

@sbrdar
Copy link
Collaborator Author

sbrdar commented Sep 27, 2023

@reuterbal I think that the things are fine regarding the previous issues with the single precision and SerialBox. Something with nvhpc is problem, can you check that?

@reuterbal
Copy link
Collaborator

OK, sorry for letting this lie around for so long - Getting NVHPC builds and tests to reliably pass was challenging with the growing software stack complexity. I think now the CI setup is OK, though.
Please take a look if you agree, otherwise this is ready to be merged from my point of view.

Copy link
Collaborator

@wdeconinck wdeconinck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks OK to me! Thanks @reuterbal for patching up the testing! Thanks @sbrdar for this nice work!

@reuterbal reuterbal changed the title use Atlas structures New Atlas-based variant using BlockStructured FunctionSpace Dec 1, 2023
@reuterbal reuterbal merged commit 0a8f41c into develop Dec 4, 2023
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants