Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

large scale run failure with gt4py cpu backend #93

Open
xyuan opened this issue Aug 29, 2024 · 0 comments
Open

large scale run failure with gt4py cpu backend #93

xyuan opened this issue Aug 29, 2024 · 0 comments
Assignees
Labels
bug Something isn't working

Comments

@xyuan
Copy link

xyuan commented Aug 29, 2024

when we scale out the gt4py backend using the pace/example test case upto 384 mpi ranks on gaea with/without openmp support, the run crashed with errors,

/ncrc/home1/Xingqiu.Yuan/miniconda3/envs/py3119/lib/python3.11/site-packages/gridtools_cpp/data/include/gridtools/stencil/cpu_kfirst.hpp(78): error: no instance of overloaded function "gridtools::sid::shift" matches the argument list
argument types are: (ptr_diff_t, gridtools::sid::default_stride, )
sid::shift(
^
/ncrc/home1/Xingqiu.Yuan/miniconda3/envs/py3119/lib/python3.11/site-packages/gridtools_cpp/data/include/gridtools/stencil/frontend/../../sid/concept.hpp(658): note: this candidate was rejected because at least one template argument could not be deduced
using concept_impl_::shift;

wtih openmp support, we have the following error,

/ncrc/home1/Xingqiu.Yuan/miniconda3/envs/py3119/lib/python3.11/site-packages/gridtools_cpp/data/include/gridtools/stencil/cpu_ifirst/loops.hpp(131): warning #16219: Some OpenMP processing was skipped to constrain compile time. Consider overriding limits (-qoverride-limits).
srun: error: c5n0890: task 90: Exited with exit code 1
srun: error: c5n1563: task 260: Killed
srun: error: c5n0890: task 111: Exited with exit code 1
srun: error: c5n1563: tasks 279,282: Killed
srun: error: c5n0890: task 98: Exited with exit code 1
srun: error: c5n1563: tasks 258,261,264,267,272,274,287,294,297: Killed

however, when the same case running with DaCe backend, it works fine.

Describe the system environment, include:
the modules used for the test are,
(base) Xingqiu.Yuan@gaea56:/gpfs/f5/gfdl_f/scratch/Xingqiu.Yuan/pace> module list

Currently Loaded Modules:

  1. craype-x86-rome 7) cray-mpich/8.1.25 13) TimeZoneEDT/default 19) uberftp/2_8 25) cray-netcdf/4.9.0.3
  2. craype-network-ofi 8) cray-libsci/23.02.1.1 14) DefApps/default 20) gcp/2.3 26) intel-oneapi/2023.1.0
  3. perftools-base/23.03.0 9) PrgEnv-intel/8.3.3 15) nccmp/1.9.0.1 21) hsm/1.3.0 27) boost/1.79.0
  4. xpmem/2.6.2-2.5_2.38__gd067c3f.shasta 10) cray-pmi/6.1.10 16) nco/5.0.1 22) perlbrew/5.28.0
  5. craype/2.7.20 11) darshan-runtime/3.4.0 17) fre-nctools/2024.03 23) fre/bronx-22
  6. cray-dsmml/0.2.2 12) CmrsEnv/default 18) gridcf-gct/6.2.20220524 24) cray-hdf5/1.12.2.3

when change it to gcc compiler, we have similar error

@xyuan xyuan self-assigned this Aug 29, 2024
@xyuan xyuan added the bug Something isn't working label Aug 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant