prov/shm: Add unmap_region function #10364

zachdworkin · 2024-09-06T22:45:42Z

This function is mainly for the niche case where on progress_connreq a peer is added to the map with its region needing to be mapped, and then after mapping it, it's discovered that the newly mapped peer's process died. In this case we need to unmap them and free any resources that were opened for communicating with them.

With the creation of this function we can rework smr_map_del to use it as common code. This requires changes to smr_av.c where smr_map_del is called. smr_map_del is now an iterable function. This is to optimize smr_map_cleanup to use ofi_rbmap_foreach to only cleanup peers that exist.

prov/shm/src/smr_av.c

prov/shm/src/smr_util.c

shijin-aws · 2024-09-12T22:53:52Z

bot:aws:retest

aingerson · 2024-09-17T23:14:36Z

prov/shm/src/smr_av.c

+				node = ofi_rbmap_find(&smr_av->smr_map.rbmap,
+				      &smr_av->smr_map.peers[shm_id].peer.name);
+				assert(node);
+				ret = smr_map_del(&smr_av->smr_map, node);


Now that both places where you call smr_map_del do a find and pass in the node, let's have the call take in the id instead and do the find in there. You can use the rbmap_find_delete call too to simplify.

This function is mainly for the niche case where on progress_connreq a peer is added to the map with its region needing to be mapped, and then after mapping it, it's discovered that the newly mapped peer's process died. In this case we need to unmap them and free any resources that were opened for communicating with them. With the creation of this function we can rework smr_map_del to use it as common code. This requires changes to smr_av.c where smr_map_del is called. smr_map_del is now an iterable function. This is to optimize smr_map_cleanup to use ofi_rbmap_foreach to only cleanup peers that exist. Remove lock from map_to_region and unmap_region functions and require lock acquirement before calling those functions. This is necessary because on av removal path map will be double locked if the functions also process locking the map. The map_to_region function is updated to mirror this policy. Signed-off-by: Zach Dworkin <zachary.dworkin@intel.com>

zachdworkin · 2024-09-26T18:24:57Z

@shijin-aws whats the AWS failure?

shijin-aws · 2024-09-26T18:27:43Z

Lots of failures for Open MPI RMA test, one example is

--------------------------------- Captured Err ---------------------------------
2024-09-25 20:25:14,589 - INFO - utils - Running on 2 nodes, 36 processes per node
2024-09-25 20:25:14,589 - INFO - utils - Executing command: ssh -oStrictHostKeyChecking=no 172.31.70.196 "source /home/ec2-user/PortaFiducia/env/bin/activate; source /etc/profile; cat /sys/class/infiniband/*/ports/*/hw_counters/tx_bytes"
2024-09-25 20:25:14,817 - INFO - utils - Executing command: ssh -oStrictHostKeyChecking=no 172.31.70.196 "source /home/ec2-user/PortaFiducia/env/bin/activate; source /etc/profile; cat /sys/class/infiniband/*/ports/*/hw_counters/rx_bytes"
2024-09-25 20:25:15,074 - INFO - utils - Executing command: export PATH=/home/ec2-user/PortaFiducia/build/libraries/openmpi/v4.1.x/install/bin:$PATH;export LD_LIBRARY_PATH=/home/ec2-user/PortaFiducia/build/libraries/libfabric/pr10364/install/libfabric/lib;/home/ec2-user/PortaFiducia/build/libraries/openmpi/v4.1.x/install/bin/mpirun --wdir . -n 2 --hostfile /home/ec2-user/PortaFiducia/hostfile --map-by ppr:1:node --timeout 1800 -x LD_LIBRARY_PATH=/home/ec2-user/PortaFiducia/build/libraries/libfabric/pr10364/install/libfabric/lib -x PATH  /home/ec2-user/PortaFiducia/build/workloads/omb/openmpi-v4.1.7rc1-v4.1.x/install/libexec/osu-micro-benchmarks/mpi/one-sided/osu_get_acc_latency   2>&1 | tee /home/ec2-user/PortaFiducia/build/workloads/omb/openmpi-v4.1.7rc1-v4.1.x/run/one-sided/osu_get_acc_latency/node2-ppn1.txt
2024-09-25 20:27:23,143 - INFO - utils - mpirun output:
# OSU MPI_Get_accumulate latency Test v7.0-lrbison3
# Window creation: MPI_Win_create
# Synchronization: MPI_Win_lock/unlock
# Size          Latency (us)
1                     115.87
2                     115.86
4                     115.92
8                     115.92
16                    116.45
32                    116.40
64                    116.48
128                   116.48
256                   116.81
512                   117.18
1024                  118.10
2048                  119.34
4096                  121.69
8192                  164.23
16384                 250.02
32768                 423.76
65536                 840.90
131072               1673.14
262144               3347.70
524288               6700.41
1048576             13422.45
2097152             26848.66
4194304             53889.38
[ip-172-31-70-196:09307] *** Process received signal ***
[ip-172-31-70-196:09307] Signal: Segmentation fault (11)
[ip-172-31-70-196:09307] Signal code: Address not mapped (1)
[ip-172-31-70-196:09307] Failing at address: 0x7fd84d852048
[ip-172-31-70-196:09307] [ 0] /lib64/libpthread.so.0(+0x118e0)[0x7fd860e488e0]
[ip-172-31-70-196:09307] [ 1] /home/ec2-user/PortaFiducia/build/libraries/libfabric/pr10364/install/libfabric/lib/libfabric.so.2(+0xbd241)[0x7fd84fb64241]
[ip-172-31-70-196:09307] [ 2] /home/ec2-user/PortaFiducia/build/libraries/libfabric/pr10364/install/libfabric/lib/libfabric.so.2(+0xbd4c6)[0x7fd84fb644c6]
[ip-172-31-70-196:09307] [ 3] /home/ec2-user/PortaFiducia/build/libraries/libfabric/pr10364/install/libfabric/lib/libfabric.so.2(+0xbb6a8)[0x7fd84fb626a8]
[ip-172-31-70-196:09307] [ 4] /home/ec2-user/PortaFiducia/build/libraries/libfabric/pr10364/install/libfabric/lib/libfabric.so.2(+0x69fcf)[0x7fd84fb10fcf]
[ip-172-31-70-196:09307] [ 5] /home/ec2-user/PortaFiducia/build/libraries/libfabric/pr10364/install/libfabric/lib/libfabric.so.2(+0x6a6ce)[0x7fd84fb116ce]
[ip-172-31-70-196:09307] [ 6] /home/ec2-user/PortaFiducia/build/libraries/libfabric/pr10364/install/libfabric/lib/libfabric.so.2(+0x6a7a5)[0x7fd84fb117a5]
[ip-172-31-70-196:09307] [ 7] /home/ec2-user/PortaFiducia/build/libraries/openmpi/v4.1.x/install/lib/openmpi/mca_btl_ofi.so(mca_btl_ofi_finalize+0xbd)[0x7fd85415419d]
[ip-172-31-70-196:09307] [ 8] /home/ec2-user/PortaFiducia/build/libraries/openmpi/v4.1.x/install/lib/libopen-pal.so.40(+0x6fee4)[0x7fd86059aee4]
[ip-172-31-70-196:09307] [ 9] /home/ec2-user/PortaFiducia/build/libraries/openmpi/v4.1.x/install/lib/libopen-pal.so.40(mca_base_framework_close+0x5d)[0x7fd860582ecd]
[ip-172-31-70-196:09307] [10] /home/ec2-user/PortaFiducia/build/libraries/openmpi/v4.1.x/install/lib/libopen-pal.so.40(mca_base_framework_close+0x5d)[0x7fd860582ecd]
[ip-172-31-70-196:09307] [11] /home/ec2-user/PortaFiducia/build/libraries/openmpi/v4.1.x/install/lib/libmpi.so.40(ompi_mpi_finalize+0x749)[0x7fd8610a6819]
[ip-172-31-70-196:09307] [12] /home/ec2-user/PortaFiducia/build/workloads/omb/openmpi-v4.1.7rc1-v4.1.x/install/libexec/osu-micro-benchmarks/mpi/one-sided/osu_get_acc_latency[0x4029dc]
[ip-172-31-70-196:09307] [13] /lib64/libc.so.6(__libc_start_main+0xea)[0x7fd860aab13a]
[ip-172-31-70-196:09307] [14] /home/ec2-user/PortaFiducia/build/workloads/omb/openmpi-v4.1.7rc1-v4.1.x/install/libexec/osu-micro-benchmarks/mpi/one-sided/osu_get_acc_latency[0x402ada]
[ip-172-31-70-196:09307] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node ip-172-31-70-196 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

shijin-aws · 2024-09-26T18:30:01Z

I can try reproducing locally to get more logs. But it seems to me the segfaults happened in some locking steps

zachdworkin · 2024-09-26T18:33:43Z

I can try reproducing locally to get more logs. But it seems to me the segfaults happened in some locking steps

I can test it locally. We dont test OMPI in our CI so those tests weren't run on our end.

shijin-aws · 2024-09-26T18:35:51Z

Does any provider other than efa can access shm inside Open MPI run now? Open MPI cannot run with shm itself as it doesn't support FI_REMOTE_COMM

zachdworkin · 2024-09-26T18:38:53Z

I have a "hack" for OMPI to force it to run shm without needing another provider. I just remove taht dependency and then run tests that dont need it. Im pretty sure this bug is all related so fixing one of those tests will likely fix most/all of them

shijin-aws · 2024-09-26T18:41:06Z

Thanks. The interesting thing is that the error only happens for our inter-node test, in this case we only touch shm during the control interface operations (av_insert, ep open, mr reg), no data transfer

zachdworkin · 2024-09-26T18:43:20Z

do you have a stack trace with the av_insert path? I removed the lock inside of map_to_region and pushed it to be outside. I only moved the locks in prov/shm. I think I need to take a look at where it is called in prov/efa. Thats likely the bug

shijin-aws · 2024-09-26T18:46:03Z

The bug is triggered at the MPI_Finalize stage, I think it's in the clean up stage

zachdworkin · 2024-09-26T18:48:14Z

I wonder if there is a path we dont test in Intel CI that removes nodes from the rb tree after I already remove them with the updated call. Ill take a look there

zachdworkin · 2024-09-26T22:58:39Z

@shijin-aws I have been unsuccessful in recreating the same bug locally. Do you know which call is causing the segfault? I assume its in smr_ep_close.

I also found a missed lock in smr_ep.c:227 needs to be locked and unlocked afterwards

zachdworkin requested a review from aingerson September 6, 2024 22:45

j-xiong reviewed Sep 6, 2024

View reviewed changes

prov/shm/src/smr_av.c Show resolved Hide resolved

prov/shm/src/smr_av.c Outdated Show resolved Hide resolved

prov/shm/src/smr_util.c Outdated Show resolved Hide resolved

aingerson reviewed Sep 6, 2024

View reviewed changes

prov/shm/src/smr_util.c Outdated Show resolved Hide resolved

prov/shm/src/smr_util.c Outdated Show resolved Hide resolved

zachdworkin force-pushed the ze-shm branch from c1c09cb to d5815e1 Compare September 17, 2024 22:04

zachdworkin requested a review from aingerson September 17, 2024 22:04

zachdworkin force-pushed the ze-shm branch 3 times, most recently from ad41c78 to 5c3e25b Compare September 17, 2024 22:33

aingerson reviewed Sep 17, 2024

View reviewed changes

zachdworkin force-pushed the ze-shm branch from 5c3e25b to dcabdc5 Compare September 25, 2024 17:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

prov/shm: Add unmap_region function #10364

prov/shm: Add unmap_region function #10364

zachdworkin commented Sep 6, 2024

shijin-aws commented Sep 12, 2024

aingerson Sep 17, 2024

zachdworkin commented Sep 26, 2024

shijin-aws commented Sep 26, 2024

shijin-aws commented Sep 26, 2024

zachdworkin commented Sep 26, 2024

shijin-aws commented Sep 26, 2024

zachdworkin commented Sep 26, 2024

shijin-aws commented Sep 26, 2024

zachdworkin commented Sep 26, 2024

shijin-aws commented Sep 26, 2024

zachdworkin commented Sep 26, 2024

zachdworkin commented Sep 26, 2024

prov/shm: Add unmap_region function #10364

Are you sure you want to change the base?

prov/shm: Add unmap_region function #10364

Conversation

zachdworkin commented Sep 6, 2024

shijin-aws commented Sep 12, 2024

aingerson Sep 17, 2024

Choose a reason for hiding this comment

zachdworkin commented Sep 26, 2024

shijin-aws commented Sep 26, 2024

shijin-aws commented Sep 26, 2024

zachdworkin commented Sep 26, 2024

shijin-aws commented Sep 26, 2024

zachdworkin commented Sep 26, 2024

shijin-aws commented Sep 26, 2024

zachdworkin commented Sep 26, 2024

shijin-aws commented Sep 26, 2024

zachdworkin commented Sep 26, 2024

zachdworkin commented Sep 26, 2024