e3sm-2.1 mvapich2-2.3.4 mpirun error #6520
gwkwak
started this conversation in
E3SM model help
Replies: 2 comments
-
run command is mpirun --machinefile /var/spool/torque/aux//57.e3sm00 -np 96 /home/build/e3sm.exe |
Beta Was this translation helpful? Give feedback.
0 replies
-
Looks like it failed immediately with the first MPI command. Have you verified you can run any mpi program on that machine? |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
info
OS = CentOS 7.9
compile = intel19, 20,22
anaconda3=2023.09
mvapich = mvapich2-2.3.4
infiniban EDR OFED = 4.9-7.1.0.0
lib = HDF5-1.10.5(Parallel install), Pnetcdf-1.11.2, Netcdf-C-4.6.1(Parallel install), Netcdf-Fortran-4.4.5
node Core = 32 , 3node, Total Core = 96
--->cime/CIME/XML/env_mach_pes.py file "value = -3 * value * max_mpitasks_per_node"
--->cime_config/machines/config_machines.xml file "<MAX_TASKS_PER_NODE>32</MAX_TASKS_PER_NODE>
<MAX_TASKS_PER_NODE>32</MAX_TASKS_PER_NODE>"
PBS Torque=2.5.12
bashrc file add = ulimit -S unlimited, ulimit -s unlimited
ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 1029554
max locked memory (kbytes, -l) unlimited
max memory size (kbytes, -m) unlimited
open files (-n) 65535
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) unlimited
cpu time (seconds, -t) unlimited
max user processes (-u) 1029554
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
run file run_e3sm.template.sh
TO DO:
- custom pelayout
main() {
For debugging, uncomment line below
#set -x
set -x
--- Configuration flags ----
Machine and project
readonly MACHINE=xxxx
readonly PROJECT="e3sm"
Simulation
readonly COMPSET="WCYCL1850"
readonly RESOLUTION="ne30pg2_EC30to60E2r2"
BEFORE RUNNING : CHANGE the following CASE_NAME to desired value
readonly CASE_NAME="${PROJECT}.${COMPSET}.${RESOLUTION}"
If this is part of a simulation campaign, ask your group lead about using a case_group label
readonly CASE_GROUP=""
Code and compilation
readonly CHECKOUT="20210702"
readonly BRANCH="master"
readonly CHERRY=( )
readonly DEBUG_COMPILE=false
#readonly DEBUG_COMPILE=true
Run options
readonly MODEL_START_TYPE="initial" # 'initial', 'continue', 'branch', 'hybrid'
readonly START_DATE="0001-01-01"
Additional options for 'branch' and 'hybrid'
readonly GET_REFCASE=FALSE
Set paths
readonly CODE_ROOT="${HOME}/model/E3SM-2.1.0"
readonly CASE_ROOT="${HOME}/Model/${CASE_NAME}/${CHECKOUT}"
Sub-directories
readonly CASE_BUILD_DIR=${CASE_ROOT}/build
readonly CASE_ARCHIVE_DIR=${CASE_ROOT}/archive
Define type of run
short tests: 'XS_2x5_ndays', 'XS_1x10_ndays', 'S_1x10_ndays',
'M_1x10_ndays', 'M2_1x10_ndays', 'M80_1x10_ndays', 'L_1x10_ndays'
or 'production' for full simulation
readonly run='XS_2x5_ndays'
if [ "${run}" != "production" ]; then
error
conf/eamconf/chem_mech.in -> /home/tests/XS_2x5_ndays/case_scripts/CaseDocs
2024-07-22 13:03:14 NAMELIST CREATION HAS FINISHED
2024-07-22 13:03:14 PRE_RUN_CHECK HAS FINISHED
run command is mpirun --machinefile /var/spool/torque/aux//57.e3sm00 -np 36 /home/build/e3sm.exe
2024-07-22 13:03:14 SAVE_PRERUN_PROVENANCE BEGINS HERE
Deprecated "arg" node detected in /home/tests/XS_2x5_ndays/case_scripts/env_batch.xml, check files /home/cime_config/machines/config_batch.xml
copying /home/tests/XS_2x5_ndays/run/preview_run.log -> /home/tests/XS_2x5_ndays/run/preview_run.log.57.e3sm00.240722-130314
2024-07-22 13:03:14 SAVE_PRERUN_PROVENANCE HAS FINISHED
2024-07-22 13:03:14 MODEL EXECUTION BEGINS HERE
2024-07-22 13:03:15 MODEL EXECUTION HAS FINISHED
ERROR: RUN FAIL: Command 'mpirun --machinefile /var/spool/torque/aux//57.e3sm00 -np 36 /home/build/e3sm.exe
See log file for details: /home/tests/XS_2x5_ndays/run/e3sm.log.57.e3sm00.240722-13031
e3sm.log.57.e3sm00.240722-13031 file error
(t_initf) Read in prof_inparm namelist from: drv_in
(t_initf) Using profile_disable= F
(t_initf) profile_timer= 4
(t_initf) profile_depth_limit= 20
(t_initf) profile_detail_limit= 12
(t_initf) profile_barrier= F
(t_initf) profile_outpe_num= 1
(t_initf) profile_outpe_stride= 0
(t_initf) profile_single_file= F
(t_initf) profile_global_stats= T
(t_initf) profile_ovhd_measurement= F
(t_initf) profile_add_detail= F
(t_initf) profile_papi_enable= F
[cli_0]: aborting job:
Fatal error in PMPI_Waitall:
Other MPI error, error stack:
PMPI_Waitall(419).................: MPI_Waitall(count=35, req_array=0x7ffc43ce90a0, status_array=0x1) failed
MPIR_Waitall_impl(248)............:
MPIDI_CH3I_Progress(284)..........:
handle_read(1349).................:
handle_read_individual(1407)......:
MPIDI_CH3I_MRAIL_Parse_header(404): Control shouldn't reach here in prototype, header %d
(errno 101)
Beta Was this translation helpful? Give feedback.
All reactions