Correctness failure when using BTL RDMA #3685

vspetrov · 2017-06-09T11:30:06Z

Thank you for taking the time to submit an issue!

Background information

Possibly related to the "patcher" memory framework

What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)

v2.0.x
v2.x
master

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Built from sources, cloned from github

Please describe the system on which you are running

Operating system/version: Red Hat Enterprise Linux Server release 7.2 (Maipo)
Computer hardware: Intel(R) Xeon(R) CPU E5-2680 0 @ 2.70GHz
Network type: Mellanox infiniband, CIB

Details of the problem

Multithreaded correctness test (attached
mt_stress.zip
) fails with OMPI.
Reproduced on 2 nodes.

shell$ mpirun -np 3 --map-by node -mca pml ob1 -mca btl openib,self   `nif 50`    -mca coll ^hcoll ./mt_stress 1497004394

...
Splitting id 124
CORRECTNESS ERROR: id 124, TEST_TYPE 2, pos 3692, value 3, expected 6, dtype MPI_INT, root 0, rank 2, count 15045, comm_size 3, color 1
CORRECTNESS ERROR: id 124, TEST_TYPE 2, pos 3692, value 3, expected 6, dtype MPI_INT, root 0, rank 0, count 15045, comm_size 3, color 1
CORRECTNESS ERROR: id 124, TEST_TYPE 2, pos 3692, value 3, expected 6, dtype MPI_INT, root 0, rank 1, count 15045, comm_size 3, color 1
Splitting id 125
Splitting id 126
...

This is an allreduce failure. After some debug i narrowed it down to the single p2p inside allreduce. One ranks sends the data to the other side, but the data is received corrupted for some reason.

The test would pass if "-mca mpi_leave_pinned 0" OR if the ompi is built without memory manager support (--without-memory-manager). This is why my suspicion goes to "patcher" memory framework.

Additionally, the same issues are observed with pml yalla (mellanox mxm based p2p). Again disabling mem notifications (MXM_MEM_ON_DEMAND_MAP=n) helps.

Since "patcher" was not present in ompi_v1.10 i wanted to try test with that version. btl openib wouldn't work since it didn't support mpi_thread_multiple in 1.10 however, pml yalla works w/o errors with 1.10.

jladd-mlnx · 2017-06-12T14:52:08Z

@markalle @jjhursey @gpaulsen is this something you've observed in your testing?

gpaulsen · 2017-06-12T19:27:04Z

Yes, we're struggling with some issues around patcher right now as well on ppc64le. Still trying to root cause.

jladd-mlnx · 2017-06-12T19:39:29Z

Thanks, @gpaulsen. This was actually reported to us by an IBM test engineer who thought it was an HCOLL issue. @vspetrov 's report is what we've been able to make of it - that it's most likely patcher related and not an HCOLL bug.

alsrgv · 2017-07-17T21:30:09Z

Thanks @vspetrov for reporting this issue!

I am seeing exactly the same issue with MPI_Allreduce, except in my case all MPI calls are done in single thread (but otherwise application is multi-threaded and uses jemalloc). I also was able to trace it down to single p2p call where correct data is sent and corrupted data is received.

It does repro with OpenIB and does not repro with TCP. It does not repro with -mca mpi_leave_pinned 0 or --without-memory-manager, as above.

Unfortunately, with -mca mpi_leave_pinned 0 performance with OpenIB is worse than with TCP, so it defeats the purpose of RDMA.

I also tried OpenMPI 1.10.7 and it works correctly. Performance is not as good as 2.1.1 with memory manager that gives corruption, but definitely better than 2.1.1 without pinned memory.

I have ring-reduce based implementation on pure MPI_Send()/MPI_Recv() that works well and does not lead to corruption, but I'm worried that it may do so in certain circumstances. Are there any pointers what's special about MPI_Allreduce() and why it corrupts the data?

alsrgv · 2017-07-18T01:13:04Z

UPDATE: I discovered that I can repro this issue with my ring-reduce implementation too if I modify it to use malloc'd buffer. Currently it's using pinned memory buffer and avoids corruption.

jsquyres · 2017-08-01T15:50:40Z

Per discussion on 1 Aug Webex:

This issue was originally reported with MXM+patcher
@alsrgv reported with openib (assumedly +patcher)
IBM has seen similar problems in PAMI+patcher.

@vspetrov Does this happen with v3.0.x? (we're assuming you didn't test v3.0.x -- can you clarify?)

@hjelmn Can you chime in here? It's apparently reproducible with openib, and a test case was attached in the original description.

hjelmn · 2017-08-01T15:59:12Z

Probably a race that isn't triggered by the slower case without mpi_leave_pinned. I can try to see if this happens with Aries later this week.

hppritcha · 2017-08-03T15:37:12Z

@vspetrov could you check the v3.0.x branch?

hjelmn · 2017-08-03T20:36:39Z

Nothing to do with the patcher. Looks like it might be a consistency problem with rcache/grdma. Investigating now.

hjelmn · 2017-08-03T22:03:45Z

Think I have a fix. Testing now. Want to make sure there are no regressions and that the problem isn't just being masked.

hjelmn · 2017-08-04T15:41:21Z

This bug is the result of a regression fix gone bad. I am working on a way to fix this without re-introducing the original bug. I am hoping to have a workaround today with the long term fix being a new registration cache.

Note, I changed the title to reflect that this is not a threading issue but a more general issue with how we ensure consistency in the registration cache.

hjelmn · 2017-08-04T15:45:18Z

@bosilca The regression fix that introduced this was for a memory hook deadlock you identified. Can you provide the reproducer for the deadlock?

bosilca · 2017-08-04T18:40:08Z

I first got it with MADNESS but then I could reproduce it with any C++ multi-threaded application.

hjelmn · 2017-08-04T19:10:20Z

@bosilca Ok, I think the attached app may be sufficient to trigger it as well. I am working with a patch that uses the deadlock "plan B" which uses a free list for the vma items. It seems to be working and is probably a good enough solution until I finish the new registration cache. I will post the PR shortly for testing.

The current VMA cache implementation backing rcache/grdma can run into a deadlock situation in multi-threaded code when madvise is hooked and the c library uses locks. In this case we may run into the following situation: Thread 1: ... free () <- Holding libc lock madvice_hook () vma_iteration () <- Blocked waiting for vma lock Thread 2: ... vma_insert () <- Holding vma lock vma_item_new () malloc () <- Blocked waiting for libc lock To fix this problem we chose to remove the madvise () hook but that fix is causing issue open-mpi#3685. This commit aims to greatly reduce the chance that the deadlock will be hit by putting vma items into a free list. This moves the allocation outside the vma lock. In general there are a relatively small number of vma items so the default is to allocate 2048 vma items. This default is configurable but it is likely the number is too large not too small. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>

It is not possible to use the patcher based memory hooks without hooking madvise (MADV_DONTNEED). This commit updates the patcher memory hooks to always hook madvise. This should be safe with recent rcache updates. References open-mpi#3685. Close when merged into v2.0.x, v2.x, and v3.0.x. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>

hjelmn · 2017-08-04T19:31:08Z

Ok, fix is up in PR #4026. I have a simple reproducer that will go into MTT once this ticket is closed.

The current VMA cache implementation backing rcache/grdma can run into a deadlock situation in multi-threaded code when madvise is hooked and the c library uses locks. In this case we may run into the following situation: Thread 1: ... free () <- Holding libc lock madvice_hook () vma_iteration () <- Blocked waiting for vma lock Thread 2: ... vma_insert () <- Holding vma lock vma_item_new () malloc () <- Blocked waiting for libc lock To fix this problem we chose to remove the madvise () hook but that fix is causing issue #3685. This commit aims to greatly reduce the chance that the deadlock will be hit by putting vma items into a free list. This moves the allocation outside the vma lock. In general there are a relatively small number of vma items so the default is to allocate 2048 vma items. This default is configurable but it is likely the number is too large not too small. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>

It is not possible to use the patcher based memory hooks without hooking madvise (MADV_DONTNEED). This commit updates the patcher memory hooks to always hook madvise. This should be safe with recent rcache updates. References #3685. Close when merged into v2.0.x, v2.x, and v3.0.x. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>

The current VMA cache implementation backing rcache/grdma can run into a deadlock situation in multi-threaded code when madvise is hooked and the c library uses locks. In this case we may run into the following situation: Thread 1: ... free () <- Holding libc lock madvice_hook () vma_iteration () <- Blocked waiting for vma lock Thread 2: ... vma_insert () <- Holding vma lock vma_item_new () malloc () <- Blocked waiting for libc lock To fix this problem we chose to remove the madvise () hook but that fix is causing issue open-mpi#3685. This commit aims to greatly reduce the chance that the deadlock will be hit by putting vma items into a free list. This moves the allocation outside the vma lock. In general there are a relatively small number of vma items so the default is to allocate 2048 vma items. This default is configurable but it is likely the number is too large not too small. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov> (cherry picked from commit d0c4538) Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>

It is not possible to use the patcher based memory hooks without hooking madvise (MADV_DONTNEED). This commit updates the patcher memory hooks to always hook madvise. This should be safe with recent rcache updates. References open-mpi#3685. Close when merged into v2.0.x, v2.x, and v3.0.x. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov> (cherry picked from commit 0176000) Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>

The current VMA cache implementation backing rcache/grdma can run into a deadlock situation in multi-threaded code when madvise is hooked and the c library uses locks. In this case we may run into the following situation: Thread 1: ... free () <- Holding libc lock madvice_hook () vma_iteration () <- Blocked waiting for vma lock Thread 2: ... vma_insert () <- Holding vma lock vma_item_new () malloc () <- Blocked waiting for libc lock To fix this problem we chose to remove the madvise () hook but that fix is causing issue open-mpi#3685. This commit aims to greatly reduce the chance that the deadlock will be hit by putting vma items into a free list. This moves the allocation outside the vma lock. In general there are a relatively small number of vma items so the default is to allocate 2048 vma items. This default is configurable but it is likely the number is too large not too small. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov> (cherry picked from commit d0c4538) Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>

It is not possible to use the patcher based memory hooks without hooking madvise (MADV_DONTNEED). This commit updates the patcher memory hooks to always hook madvise. This should be safe with recent rcache updates. References open-mpi#3685. Close when merged into v2.0.x, v2.x, and v3.0.x. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov> (cherry picked from commit 0176000) Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>

jsquyres · 2017-08-08T16:03:38Z

Per discussion on 8 Aug 2017 webex: @hjelmn explains that this happened because we didn't hook madvise() when we created patcher, solely because we didn't think we needed to. This issue and Nathan's further testing shows that we do need to hook madvise because there are cases where you can end up in a deadlock scenario (even in a single-threaded test).

The issue is that there is a race condition going on between simultaneously freeing and mallocing memory -- it's a classic hold-and-wait. @hjelmn will add a single-threaded test in ompi-tests that will always cause this scenario.

Background: the problem is in our VMA tree insert. We really should not be allocating or freeing memory inside the VMA allocation functions. However, we can't know the size we need to alloc until we get deep inside the VMA allocation functions, which led to the "we'll just alloc deep inside these routines" implementation. Somehow we need to fix this -- @hjelmn is looking into alternate data structures and/or algorithms. I.e., the real fix is to replace and/or redesign our VMA data structures.

Meaning: @hjelmn's PRs are band aids: they greatly reduce the possibility of the issue happening by increasing the default size of the reg cache to 2K items. This significantly decreases the probability that the reg cache will need to be expanded (which, in turn, can cause the problem described above -- where the VMA would need to allocate more items). Again, the real fix is to replace / redesign the VMA data structures. But this fix should hold us for a little while while @hjelmn works on the real fix.

Note that v2.x and v3.0.x PRs are already filed. v2.0.x will require a back port (because the rcache is still inside the mpool in 2.0.x -- the individual patches should apply, but files have moved between the v2.0.x and v2.x trees, and that requires human intervention -- @hjelmn is working on it).

jsquyres · 2017-08-14T19:29:29Z

@vspetrov @jladd-mlnx Can you test again on master / v2.x and see if you can reproduce the issue? According to @hjelmn, the band-aid fix should be good enough.

vspetrov · 2017-09-04T09:36:31Z

@jsquyres Sorry, for not replying so long. The github notifications were going to wrong mail. Anyways, i've tried the latest ompi/mater and it works. So, workaround does resolve (hide) the problem.

jsquyres · 2017-09-11T18:58:14Z

This has been fixed (worked around) in v2.x and v3.0.x and master. Waiting for #4078 for v2.0.x.

alsrgv · 2017-09-12T01:00:36Z

This patch fixes correctness issue I was observing in my distributed TensorFlow use case (https://github.com/uber/horovod) as well. Looking forward to the official release!

The current VMA cache implementation backing rcache/grdma can run into a deadlock situation in multi-threaded code when madvise is hooked and the c library uses locks. In this case we may run into the following situation: Thread 1: ... free () <- Holding libc lock madvice_hook () vma_iteration () <- Blocked waiting for vma lock Thread 2: ... vma_insert () <- Holding vma lock vma_item_new () malloc () <- Blocked waiting for libc lock To fix this problem we chose to remove the madvise () hook but that fix is causing issue open-mpi#3685. This commit aims to greatly reduce the chance that the deadlock will be hit by putting vma items into a free list. This moves the allocation outside the vma lock. In general there are a relatively small number of vma items so the default is to allocate 2048 vma items. This default is configurable but it is likely the number is too large not too small. This is a back-port of: open-mpi/ompi@60ad9d1 open-mpi/ompi@b6bf3f4 open-mpi/ompi@b870d15 Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>

It is not possible to use the patcher based memory hooks without hooking madvise (MADV_DONTNEED). This commit updates the patcher memory hooks to always hook madvise. This should be safe with recent rcache updates. References open-mpi#3685. Close when merged into v2.0.x, v2.x, and v3.0.x. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov> (cherry picked from commit 8137623) Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>

hppritcha · 2017-10-24T20:06:15Z

PRs to 2.0.x, 2.x and 3.0.x to fix this issue have been merged. Closing.

vspetrov added the bug label Jun 12, 2017

jsquyres modified the milestone: v2.1.2 Aug 1, 2017

jsquyres added Target: v2.0.x Target: v2.x labels Aug 1, 2017

jsquyres added this to the v2.1.2 milestone Aug 1, 2017

hjelmn added the Target: v3.0.x label Aug 3, 2017

hjelmn changed the title ~~Multithread stress test correctness failures~~ Correctness failure when using BTL RDMA Aug 4, 2017

hjelmn added the Severity: critical label Aug 4, 2017

hjelmn mentioned this issue Aug 4, 2017

Hook madvise when using patcher memory hooks #4026

Merged

hjelmn added the Severity: blocker label Aug 4, 2017

This was referenced Aug 7, 2017

v3.0.x: hook madvaise to fix threading memory allocation issues #4041

Merged

v2.x: hook madvaise to fix threading memory allocation issues #4042

Merged

hppritcha added Target: v2.x and removed Target: v2.x labels Aug 14, 2017

jsquyres mentioned this issue Aug 14, 2017

Port madvise fix to v2.0.x branch #4078

Closed

jsquyres modified the milestones: v2.0.4, v2.1.2 Sep 11, 2017

hppritcha removed Target: v2.0.x Target: v2.x Target: v3.0.x labels Oct 24, 2017

hppritcha closed this as completed Oct 24, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Correctness failure when using BTL RDMA #3685

Correctness failure when using BTL RDMA #3685

vspetrov commented Jun 9, 2017

jladd-mlnx commented Jun 12, 2017

gpaulsen commented Jun 12, 2017

jladd-mlnx commented Jun 12, 2017

alsrgv commented Jul 17, 2017 •

edited

Loading

alsrgv commented Jul 18, 2017

jsquyres commented Aug 1, 2017

hjelmn commented Aug 1, 2017

hppritcha commented Aug 3, 2017

hjelmn commented Aug 3, 2017

hjelmn commented Aug 3, 2017

hjelmn commented Aug 4, 2017

hjelmn commented Aug 4, 2017 •

edited

Loading

bosilca commented Aug 4, 2017

hjelmn commented Aug 4, 2017

hjelmn commented Aug 4, 2017

jsquyres commented Aug 8, 2017

jsquyres commented Aug 14, 2017

vspetrov commented Sep 4, 2017

jsquyres commented Sep 11, 2017

alsrgv commented Sep 12, 2017

hppritcha commented Oct 24, 2017

Correctness failure when using BTL RDMA #3685

Correctness failure when using BTL RDMA #3685

Comments

vspetrov commented Jun 9, 2017

Background information

What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Please describe the system on which you are running

Details of the problem

jladd-mlnx commented Jun 12, 2017

gpaulsen commented Jun 12, 2017

jladd-mlnx commented Jun 12, 2017

alsrgv commented Jul 17, 2017 • edited Loading

alsrgv commented Jul 18, 2017

jsquyres commented Aug 1, 2017

hjelmn commented Aug 1, 2017

hppritcha commented Aug 3, 2017

hjelmn commented Aug 3, 2017

hjelmn commented Aug 3, 2017

hjelmn commented Aug 4, 2017

hjelmn commented Aug 4, 2017 • edited Loading

bosilca commented Aug 4, 2017

hjelmn commented Aug 4, 2017

hjelmn commented Aug 4, 2017

jsquyres commented Aug 8, 2017

jsquyres commented Aug 14, 2017

vspetrov commented Sep 4, 2017

jsquyres commented Sep 11, 2017

alsrgv commented Sep 12, 2017

hppritcha commented Oct 24, 2017

alsrgv commented Jul 17, 2017 •

edited

Loading

hjelmn commented Aug 4, 2017 •

edited

Loading