You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
On a Rocky 8.7 system with Intel cascade-lake CPUs I get a failing sanity check of IMPI/2021.10.0:
Abort(1615247) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(176)........:
MPID_Init(1548)..............:
MPIDI_OFI_mpi_init_hook(1662):
get_ep_names(419)............: OFI get endpoint name failed (ofi_init.c:419:get_ep_names:Invalid argument)
I traced this down to $UCX_TLS=all being set. If I unset (only) this variable the check runs through.
This also applies to running (at least) this hello world program manually with that module (forcing to skip the sanity check for testing)
This is in contrast to the comment in the easyblock:
# set environment variable UCX_TLS to 'all', this works in all hardware configurations
# needed with UCX regardless of the transports available (even without a Mellanox HCA)
# more information in easybuilders/easybuild-easyblocks#2253
While there is a way to set it to a different value in the easyconfig (modextravars takes precedence) there is no way to NOT set it. Hence we might need to rethink that.
I'll do some more tests with the IMPI versions we have access to next week.
The text was updated successfully, but these errors were encountered:
If our assumption about UCX_TLS is wrong, would it be sufficient to add an easy way to use a different value, for example via a custom easyconfig parameter for the impi easyblock?
On a RHEL 8.9 system with "Intel(R) Xeon(R) Platinum 8470" or "AMD EPYC 7702" I don't see an issue in 14 impi ECs I tested on each.
I only see it on a RHEL 8.7 system with "Intel(R) Xeon(R) Platinum 8276M".
Of those 14 ECs 13 succeed when I unset UCX_TLS in the sanity check step. The last one (impi-2019.9.304-iccifortcuda-2020b.eb) is giving me another strange error:
$ mpirun -n 32 /dev/shm/easybuild-tmp/eb-l9s3j6ko/tmpjum3rdnw/mpi_test
ucp_worker.c:1835 UCX ERROR too many ep configurations: 16 (max: 16)
Abort(1091215) on node 5 (rank 5 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(138)........:
MPID_Init(1139)..............:
MPIDI_OFI_mpi_init_hook(1728): OFI get address vector map failed
That happens when using more than 16 processes (mpirun -n 16 works) and seems to be a known issue in UCX < 1.11
On a Rocky 8.7 system with Intel cascade-lake CPUs I get a failing sanity check of IMPI/2021.10.0:
I traced this down to
$UCX_TLS=all
being set. If I unset (only) this variable the check runs through.This also applies to running (at least) this hello world program manually with that module (forcing to skip the sanity check for testing)
This is in contrast to the comment in the easyblock:
easybuild-easyblocks/easybuild/easyblocks/i/impi.py
Lines 387 to 389 in 064e3e2
While there is a way to set it to a different value in the easyconfig (
modextravars
takes precedence) there is no way to NOT set it. Hence we might need to rethink that.I'll do some more tests with the IMPI versions we have access to next week.
The text was updated successfully, but these errors were encountered: