Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failing IMPI sanity check due to UCX_TLS #3386

Open
Flamefire opened this issue Jul 8, 2024 · 2 comments
Open

Failing IMPI sanity check due to UCX_TLS #3386

Flamefire opened this issue Jul 8, 2024 · 2 comments
Milestone

Comments

@Flamefire
Copy link
Contributor

On a Rocky 8.7 system with Intel cascade-lake CPUs I get a failing sanity check of IMPI/2021.10.0:

Abort(1615247) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(176)........:
MPID_Init(1548)..............:
MPIDI_OFI_mpi_init_hook(1662):
get_ep_names(419)............: OFI get endpoint name failed (ofi_init.c:419:get_ep_names:Invalid argument)

I traced this down to $UCX_TLS=all being set. If I unset (only) this variable the check runs through.

This also applies to running (at least) this hello world program manually with that module (forcing to skip the sanity check for testing)

This is in contrast to the comment in the easyblock:

# set environment variable UCX_TLS to 'all', this works in all hardware configurations
# needed with UCX regardless of the transports available (even without a Mellanox HCA)
# more information in easybuilders/easybuild-easyblocks#2253

While there is a way to set it to a different value in the easyconfig (modextravars takes precedence) there is no way to NOT set it. Hence we might need to rethink that.

I'll do some more tests with the IMPI versions we have access to next week.

@boegel boegel added this to the 4.x milestone Jul 31, 2024
@boegel
Copy link
Member

boegel commented Jul 31, 2024

@Flamefire Any updates here?

If our assumption about UCX_TLS is wrong, would it be sufficient to add an easy way to use a different value, for example via a custom easyconfig parameter for the impi easyblock?

@Flamefire
Copy link
Contributor Author

I did a larger test run:

The 7 impi-2018* ECs fail in the sanity check with a segfault that is unrelated

On a RHEL 8.9 system with "Intel(R) Xeon(R) Platinum 8470" or "AMD EPYC 7702" I don't see an issue in 14 impi ECs I tested on each.

I only see it on a RHEL 8.7 system with "Intel(R) Xeon(R) Platinum 8276M".
Of those 14 ECs 13 succeed when I unset UCX_TLS in the sanity check step. The last one (impi-2019.9.304-iccifortcuda-2020b.eb) is giving me another strange error:

$ mpirun -n 32 /dev/shm/easybuild-tmp/eb-l9s3j6ko/tmpjum3rdnw/mpi_test
ucp_worker.c:1835 UCX  ERROR too many ep configurations: 16 (max: 16)
Abort(1091215) on node 5 (rank 5 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(138)........: 
MPID_Init(1139)..............: 
MPIDI_OFI_mpi_init_hook(1728): OFI get address vector map failed

That happens when using more than 16 processes (mpirun -n 16 works) and seems to be a known issue in UCX < 1.11

I tried setting $UCX_TLS to each individual value of the list in https://ucx-py.readthedocs.io/en/latest/configuration.html#ucx-tls but no individual value reproduced the issue, only "all" does.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants