Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prov/shm: Add unmap_region function #10364

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

zachdworkin
Copy link
Contributor

This function is mainly for the niche case where on progress_connreq a peer is added to the map with its region needing to be mapped, and then after mapping it, it's discovered that the newly mapped peer's process died. In this case we need to unmap them and free any resources that were opened for communicating with them.

With the creation of this function we can rework smr_map_del to use it as common code. This requires changes to smr_av.c where smr_map_del is called. smr_map_del is now an iterable function. This is to optimize smr_map_cleanup to use ofi_rbmap_foreach to only cleanup peers that exist.

prov/shm/src/smr_av.c Show resolved Hide resolved
prov/shm/src/smr_av.c Outdated Show resolved Hide resolved
prov/shm/src/smr_util.c Outdated Show resolved Hide resolved
prov/shm/src/smr_util.c Outdated Show resolved Hide resolved
prov/shm/src/smr_util.c Outdated Show resolved Hide resolved
@shijin-aws
Copy link
Contributor

bot:aws:retest

node = ofi_rbmap_find(&smr_av->smr_map.rbmap,
&smr_av->smr_map.peers[shm_id].peer.name);
assert(node);
ret = smr_map_del(&smr_av->smr_map, node);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that both places where you call smr_map_del do a find and pass in the node, let's have the call take in the id instead and do the find in there. You can use the rbmap_find_delete call too to simplify.

This function is mainly for the niche case where on progress_connreq
a peer is added to the map with its region needing to be mapped, and
then after mapping it, it's discovered that the newly mapped peer's
process died. In this case we need to unmap them and free any resources
that were opened for communicating with them.

With the creation of this function we can rework smr_map_del to use
it as common code. This requires changes to smr_av.c where smr_map_del
is called. smr_map_del is now an iterable function. This is to optimize
smr_map_cleanup to use ofi_rbmap_foreach to only cleanup peers that
exist.

Remove lock from map_to_region and unmap_region functions and require
lock acquirement before calling those functions. This is necessary because
on av removal path map will be double locked if the functions also process
locking the map. The map_to_region function is updated to mirror this
policy.

Signed-off-by: Zach Dworkin <zachary.dworkin@intel.com>
@zachdworkin
Copy link
Contributor Author

@shijin-aws whats the AWS failure?

@shijin-aws
Copy link
Contributor

Lots of failures for Open MPI RMA test, one example is

--------------------------------- Captured Err ---------------------------------
2024-09-25 20:25:14,589 - INFO - utils - Running on 2 nodes, 36 processes per node
2024-09-25 20:25:14,589 - INFO - utils - Executing command: ssh -oStrictHostKeyChecking=no 172.31.70.196 "source /home/ec2-user/PortaFiducia/env/bin/activate; source /etc/profile; cat /sys/class/infiniband/*/ports/*/hw_counters/tx_bytes"
2024-09-25 20:25:14,817 - INFO - utils - Executing command: ssh -oStrictHostKeyChecking=no 172.31.70.196 "source /home/ec2-user/PortaFiducia/env/bin/activate; source /etc/profile; cat /sys/class/infiniband/*/ports/*/hw_counters/rx_bytes"
2024-09-25 20:25:15,074 - INFO - utils - Executing command: export PATH=/home/ec2-user/PortaFiducia/build/libraries/openmpi/v4.1.x/install/bin:$PATH;export LD_LIBRARY_PATH=/home/ec2-user/PortaFiducia/build/libraries/libfabric/pr10364/install/libfabric/lib;/home/ec2-user/PortaFiducia/build/libraries/openmpi/v4.1.x/install/bin/mpirun --wdir . -n 2 --hostfile /home/ec2-user/PortaFiducia/hostfile --map-by ppr:1:node --timeout 1800 -x LD_LIBRARY_PATH=/home/ec2-user/PortaFiducia/build/libraries/libfabric/pr10364/install/libfabric/lib -x PATH  /home/ec2-user/PortaFiducia/build/workloads/omb/openmpi-v4.1.7rc1-v4.1.x/install/libexec/osu-micro-benchmarks/mpi/one-sided/osu_get_acc_latency   2>&1 | tee /home/ec2-user/PortaFiducia/build/workloads/omb/openmpi-v4.1.7rc1-v4.1.x/run/one-sided/osu_get_acc_latency/node2-ppn1.txt
2024-09-25 20:27:23,143 - INFO - utils - mpirun output:
# OSU MPI_Get_accumulate latency Test v7.0-lrbison3
# Window creation: MPI_Win_create
# Synchronization: MPI_Win_lock/unlock
# Size          Latency (us)
1                     115.87
2                     115.86
4                     115.92
8                     115.92
16                    116.45
32                    116.40
64                    116.48
128                   116.48
256                   116.81
512                   117.18
1024                  118.10
2048                  119.34
4096                  121.69
8192                  164.23
16384                 250.02
32768                 423.76
65536                 840.90
131072               1673.14
262144               3347.70
524288               6700.41
1048576             13422.45
2097152             26848.66
4194304             53889.38
[ip-172-31-70-196:09307] *** Process received signal ***
[ip-172-31-70-196:09307] Signal: Segmentation fault (11)
[ip-172-31-70-196:09307] Signal code: Address not mapped (1)
[ip-172-31-70-196:09307] Failing at address: 0x7fd84d852048
[ip-172-31-70-196:09307] [ 0] /lib64/libpthread.so.0(+0x118e0)[0x7fd860e488e0]
[ip-172-31-70-196:09307] [ 1] /home/ec2-user/PortaFiducia/build/libraries/libfabric/pr10364/install/libfabric/lib/libfabric.so.2(+0xbd241)[0x7fd84fb64241]
[ip-172-31-70-196:09307] [ 2] /home/ec2-user/PortaFiducia/build/libraries/libfabric/pr10364/install/libfabric/lib/libfabric.so.2(+0xbd4c6)[0x7fd84fb644c6]
[ip-172-31-70-196:09307] [ 3] /home/ec2-user/PortaFiducia/build/libraries/libfabric/pr10364/install/libfabric/lib/libfabric.so.2(+0xbb6a8)[0x7fd84fb626a8]
[ip-172-31-70-196:09307] [ 4] /home/ec2-user/PortaFiducia/build/libraries/libfabric/pr10364/install/libfabric/lib/libfabric.so.2(+0x69fcf)[0x7fd84fb10fcf]
[ip-172-31-70-196:09307] [ 5] /home/ec2-user/PortaFiducia/build/libraries/libfabric/pr10364/install/libfabric/lib/libfabric.so.2(+0x6a6ce)[0x7fd84fb116ce]
[ip-172-31-70-196:09307] [ 6] /home/ec2-user/PortaFiducia/build/libraries/libfabric/pr10364/install/libfabric/lib/libfabric.so.2(+0x6a7a5)[0x7fd84fb117a5]
[ip-172-31-70-196:09307] [ 7] /home/ec2-user/PortaFiducia/build/libraries/openmpi/v4.1.x/install/lib/openmpi/mca_btl_ofi.so(mca_btl_ofi_finalize+0xbd)[0x7fd85415419d]
[ip-172-31-70-196:09307] [ 8] /home/ec2-user/PortaFiducia/build/libraries/openmpi/v4.1.x/install/lib/libopen-pal.so.40(+0x6fee4)[0x7fd86059aee4]
[ip-172-31-70-196:09307] [ 9] /home/ec2-user/PortaFiducia/build/libraries/openmpi/v4.1.x/install/lib/libopen-pal.so.40(mca_base_framework_close+0x5d)[0x7fd860582ecd]
[ip-172-31-70-196:09307] [10] /home/ec2-user/PortaFiducia/build/libraries/openmpi/v4.1.x/install/lib/libopen-pal.so.40(mca_base_framework_close+0x5d)[0x7fd860582ecd]
[ip-172-31-70-196:09307] [11] /home/ec2-user/PortaFiducia/build/libraries/openmpi/v4.1.x/install/lib/libmpi.so.40(ompi_mpi_finalize+0x749)[0x7fd8610a6819]
[ip-172-31-70-196:09307] [12] /home/ec2-user/PortaFiducia/build/workloads/omb/openmpi-v4.1.7rc1-v4.1.x/install/libexec/osu-micro-benchmarks/mpi/one-sided/osu_get_acc_latency[0x4029dc]
[ip-172-31-70-196:09307] [13] /lib64/libc.so.6(__libc_start_main+0xea)[0x7fd860aab13a]
[ip-172-31-70-196:09307] [14] /home/ec2-user/PortaFiducia/build/workloads/omb/openmpi-v4.1.7rc1-v4.1.x/install/libexec/osu-micro-benchmarks/mpi/one-sided/osu_get_acc_latency[0x402ada]
[ip-172-31-70-196:09307] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node ip-172-31-70-196 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

@shijin-aws
Copy link
Contributor

I can try reproducing locally to get more logs. But it seems to me the segfaults happened in some locking steps

@zachdworkin
Copy link
Contributor Author

I can try reproducing locally to get more logs. But it seems to me the segfaults happened in some locking steps

I can test it locally. We dont test OMPI in our CI so those tests weren't run on our end.

@shijin-aws
Copy link
Contributor

Does any provider other than efa can access shm inside Open MPI run now? Open MPI cannot run with shm itself as it doesn't support FI_REMOTE_COMM

@zachdworkin
Copy link
Contributor Author

I have a "hack" for OMPI to force it to run shm without needing another provider. I just remove taht dependency and then run tests that dont need it. Im pretty sure this bug is all related so fixing one of those tests will likely fix most/all of them

@shijin-aws
Copy link
Contributor

Thanks. The interesting thing is that the error only happens for our inter-node test, in this case we only touch shm during the control interface operations (av_insert, ep open, mr reg), no data transfer

@zachdworkin
Copy link
Contributor Author

do you have a stack trace with the av_insert path? I removed the lock inside of map_to_region and pushed it to be outside. I only moved the locks in prov/shm. I think I need to take a look at where it is called in prov/efa. Thats likely the bug

@shijin-aws
Copy link
Contributor

The bug is triggered at the MPI_Finalize stage, I think it's in the clean up stage

@zachdworkin
Copy link
Contributor Author

I wonder if there is a path we dont test in Intel CI that removes nodes from the rb tree after I already remove them with the updated call. Ill take a look there

@zachdworkin
Copy link
Contributor Author

@shijin-aws I have been unsuccessful in recreating the same bug locally. Do you know which call is causing the segfault? I assume its in smr_ep_close.

I also found a missed lock in smr_ep.c:227 needs to be locked and unlocked afterwards

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants