Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exec format error when running Optimas with HiPACE++ on Maxwell #235

Open
lboult opened this issue Jul 24, 2024 · 8 comments
Open

Exec format error when running Optimas with HiPACE++ on Maxwell #235

lboult opened this issue Jul 24, 2024 · 8 comments

Comments

@lboult
Copy link

lboult commented Jul 24, 2024

Hi all,

I have been trying to do some simple grid scans with Optimas and HiPACE++ on Maxwell and am encountering an error that seems to be related to the submission script that Optimas (or libEnsemble?) generates e.g:

2024-07-24 11:06:59,278 libensemble.executors.mpi_executor (WARNING): task libe_task_sim_worker2_0 submit command failed on try 2 with error [Errno 8] Exec format error: './libe_task_sim_worker2_0_run.sh'

I checked and this also seems to happen with the examples also provided in the docs.

Here is an example of one of the submission scripts that are generated:

source /etc/profile.d/modules.sh     # make the module command is available

module purge
module load maxwell gcc/9.3 openmpi/4
module load maxwell cuda/11.8
module load hdf5/1.10.6
# pick correct GPU setting (this may differ for V100 nodes)
export GPUS_PER_SOCKET=2
export GPUS_PER_NODE=4
# optimize CUDA compilation for A100
export AMREX_CUDA_ARCH=8.0

export OMP_NUM_THREADS=1


mpirun -hosts max-mpag002 -np 2 --ppn 2 /home/lboulton/src/hipace/build/bin/hipace template_simulation_script

I try to submit the same script as a single batch job which also fails but then seems to suggest that --ppn isn't a valid parameter.... is this a bug or perhaps something to do with having the wrong version of openmpi? Or maybe even a peculiarity of Maxwell...

Any advice is appreciated!

Cheers,
Lewis

@shuds13
Copy link
Collaborator

shuds13 commented Jul 24, 2024

@lboult

I think most likely it is picking up mpich and then in your subprocess is using openMPI (assuming you are using an env_script). The quickest way to force it is probably to specify openmpi.

In libEnsemble this would be either.

exctr = MPIExecutor(custom_info={"mpi_runner": "openmpi"})

or as a platforms spec.

libE_specs["platform_specs"] = {
    "mpi_runner": "openmpi",
}

In Optimas, if your exploration object is called exp you can probably do:

exp.libE_specs["platform_specs"] = {
    "mpi_runner": "openmpi",
}

@shuds13
Copy link
Collaborator

shuds13 commented Jul 24, 2024

Hang on, I just realised Optimas has a more direct option to set this in your calling script.

ev = TemplateEvaluator(
    env_mpi='openmpi'

If you already have this and its failing, let me know.

@lboult
Copy link
Author

lboult commented Jul 25, 2024

Hey,

Thanks for the fast response. So I think you're right that using the env_mpi='openmpi' argument is necessary- I do this now but there still seems to be some issue that throws the same 'Exec format error'.

Here's how the submission file looks like that optimas generates now:

source /etc/profile.d/modules.sh     # make the module command is available

module purge
module load maxwell gcc/9.3 openmpi/4
module load maxwell cuda/11.8
module load hdf5/1.10.6
# pick correct GPU setting (this may differ for V100 nodes)
export GPUS_PER_SOCKET=2
export GPUS_PER_NODE=4
# optimize CUDA compilation for A100
export AMREX_CUDA_ARCH=8.0

export OMP_NUM_THREADS=1


mpirun -machinefile machinefile_autogen_for_worker_2_task_0 -np 2 -npernode 2 /home/lboulton/src/hipace/build/bin/hipace template_simulation_script

The machine file that it refers to also seems to be generated okay (I think?)

Unfortunately there's no indication in the .err or .out files about what the exact issue might be now... Let me know if I can provide more details.

Cheers,
Lewis

@shuds13
Copy link
Collaborator

shuds13 commented Jul 26, 2024

What happens now if you run:

./libe_task_sim_worker2_0_run.sh

by itself?

You could try without the machinefile and/or in an interactive session, and set the node name in the machinefile to what your on. I wonder if on your system something is needed like starting the file with:

#!/bin/bash

I would see if you can get it to run that file on your system. Let me know what if gives you.

You could also see if sourcing the file makes a difference.

@lboult
Copy link
Author

lboult commented Jul 29, 2024

Hey,

Sorry for the delayed reply, was doing some investigating. So submitting the file as a batch job works fine (I submit specifically to the node mentioned in the machine file as well). But when I run interactively I get these errors:

/home/lboulton/src/hipace/build/bin/hipace: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.20' not found (required by /home/lboulton/src/hipace/build/bin/hipace)
/home/lboulton/src/hipace/build/bin/hipace: /lib64/libstdc++.so.6: version `CXXABI_1.3.8' not found (required by /home/lboulton/src/hipace/build/bin/hipace)
/home/lboulton/src/hipace/build/bin/hipace: /lib64/libstdc++.so.6: version `CXXABI_1.3.9' not found (required by /home/lboulton/src/hipace/build/bin/hipace)
/home/lboulton/src/hipace/build/bin/hipace: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.26' not found (required by /home/lboulton/src/hipace/build/bin/hipace)
/home/lboulton/src/hipace/build/bin/hipace: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.22' not found (required by /home/lboulton/src/hipace/build/bin/hipace)
/home/lboulton/src/hipace/build/bin/hipace: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.21' not found (required by /home/lboulton/src/hipace/build/bin/hipace)
/home/lboulton/src/hipace/build/bin/hipace: /lib64/libstdc++.so.6: version `CXXABI_1.3.8' not found (required by /software/gcc/9.3.0/openmpi/4.0.4/lib/libmpi_cxx.so.40)
/home/lboulton/src/hipace/build/bin/hipace: /lib64/libstdc++.so.6: version `CXXABI_1.3.9' not found (required by /software/gcc/9.3.0/openmpi/4.0.4/lib/libmpi_cxx.so.40)

So it seems like something dodgy going on with environments that I'm yet to understand?

@shuds13
Copy link
Collaborator

shuds13 commented Jul 29, 2024

On some systems there are differences when you run interactively from batch, such as whether your .bashrc is run. This varies from one system to another. It seems hipace is picking up the wrong standard C++ library.

You could echo you LD_LIBARY_PATH in batch and try and replicate it.

You said you had your original issue with the libensemble docs example script. Perhaps check if that works now to see if libEnsemble is still an issue.

@shuds13
Copy link
Collaborator

shuds13 commented Aug 1, 2024

Should be fixed by Libensemble/libensemble#1392 which runs user scripts in shell.

@lboult
Copy link
Author

lboult commented Aug 5, 2024

Sorry for the slow reply, was away for the last half of last week.

I've just now installed the version of libensemble in the git branch you referenced and can confirm that this seems to have solved my issue.

Thanks very much for your help :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants