opal_progress: check timer only once per 8 calls #4697

yosefe · 2018-01-10T19:33:41Z

This PR improves osu_bw and osu_mbw_mr performance of UCX on ARM (30-50%) and x86_64 (10-20%) architectures.
@shamisp @hppritcha

jsquyres · 2018-01-10T19:51:10Z

@bwbarrett @bosilca @hjelmn This is an interesting PR -- it basically switches atomics to a static volatile. Thoughts?

Based on the comment in the code, let me ask a crazy question: is there value in removing volatile? I.e., a) decrease the performance penalty even more, because b) we don't really care if the number is not wholly accurate.

artpol84 · 2018-01-10T20:27:03Z

bot:mellanox:retest

hjelmn · 2018-01-10T21:01:20Z

My only issue with this is what about the situation where there is a low priority event and the calling code only enters MPI very infrequently. I have thought about making this a simple counter but keep running into how to force it to trigger on every call in that case. The timer handles this case.

I know this is not a common case but it is a situation worth thinking about.

hjelmn · 2018-01-10T21:04:36Z

@jsquyres The variable does't even need to be volatile for this to work. In either case the low priority calls will still trigger but their timing will be non-deterministic. I don't think this is an issue though. The atomic just ensures that it triggers exactly once every 8 calls.

yosefe · 2018-01-10T21:31:12Z

@hjelmn regarding the infrequent calls to MPI - since we are doing the event loop every event_progress_delta anyway, it didn't seem like a problem. If something really requires deterministic progress, it should register itself to opal_progress. Do you know which events we have in that libevent loop?
In MXM and UCX we used a progress thread which wakes up once a while and signals the main thread that timers should be dispatched, by incrementing a shared counter (which can be sampled quickly).

@jsquyres removing volatile seemed a bit too dangerous since that variable really can be accessed by multiple threads.

hjelmn · 2018-01-10T21:47:05Z

@yosefe Removing volatile will likely not change much besides timing. It likely will not even change the performance. I don't think it really matters either way.

Looking at what I wrote it isn't completely correct. For low-priority events it is assumed they can be delayed almost indefinitely and that if they need to be processed the user should be calling into MPI. The same isn't necessarily true about libevent events. I would have to take a look and see if there is a problem. As is I think I am ok with this change though. I can see how there might be an improvement by removing the atomic but not necessarily the timer call. The overhead of the rdtsc instruction is ~ 40 cycles. What is the overhead of the aarch64 mrs instruction?

hjelmn · 2018-01-10T21:47:47Z

I plan to test this change with some RMA-MT benchmarks and see how it affects other code paths.

yosefe · 2018-01-10T21:58:26Z

@hjelmn
on x86_64 - 40 instructions is quite a lot, and there is more overhead coming from clock_gettime() library functions. It also seems like rdtsc is slowing down the CPU execution pipeline. We observe noticeable message rate improvement when removing this call.
on aarch64 - on the machines we tested, clock_gettime(CLOCK_MONOTONIC) was translated to a system call. This is probably a problem of its own, but nevertheless removing the timer read improves both aarch64 and x86_64.

bwbarrett

Nack; this patch can not go in. It will adversely impact TCP by not entering the event library every time we call opal_progres().

artpol84 · 2018-01-11T00:06:50Z

What if there will be 2 options for doing the progress and PML or BTL will have an option to select between them?
One will be the current one and another - proposed one.
If some component will request more aggressive libevent poll - this will override what others have requested.
In UCX case there will be only PML that will request "light" progress, while in OB1 case it will ask underlying BTLs about their requirements and BTL/TCP for example will require "heavy" version of progress and this will be propagated up by OB1.

bwbarrett · 2018-01-11T00:14:53Z

That's essentially what the old code did; happy to see a PR for an improved version, but not polling the event library every time when there are consumers (ie, it is polled every iteration) is not an acceptable change.

artpol84 · 2018-01-11T00:19:12Z

Old code is slower than new one by 5-10%, so what I suggest is to have 2 opal_progress functions and use the pointer to select the proper one. Using pointer may introduce some overhead and I can't estimate it according to my experience.

bwbarrett · 2018-01-11T00:21:31Z

I'd have some concerns about the pointer swap (ie, want to see the code) for how we turn on / off fast polling of the event library for the TCP BTL. But I think a pointer could work. And I wouldn't be concerned about the pointer overhead for the TCP BTL, so if it works for UCX, it's probably a good plan.

rhc54 · 2018-01-11T00:23:55Z

Someone also needs to check the other BTLs, plus the Portals and OFI MTLs - I don't know what else might be impacted, but a thorough check of non-UCX paths seems in order before proceeding.

bwbarrett · 2018-01-11T00:34:49Z

@rhc54, sure, but if it makes UCX faster than before, it almost certainly will for others. I actually have no objections to the proposal from a general BTL impact (it should only help). The two places it will hurt are the TCP BTL (because you could go 7 extra entrances to the MPI library before calling into the event library to actually send/receive the packet) and out-of-band at time periods other than INIT and FINALIZE.

rhc54 · 2018-01-11T01:45:27Z

I agree that it should be okay, but we've been burned before by making that assumption - best to ensure the right people know about it and check.

@matcabral @tkordenbrock Just want to ensure you take a look and are okay with this.

hjelmn · 2018-01-11T01:51:10Z

We could condition this path on the event users counter being 0. The TCP btl increments there counter when it is needed. If the counter is non zero we can always check the time delta.

matcabral · 2018-01-11T19:25:23Z

From code inspection, I think this could indeed improve PSM2 MTL (most likely OFI also). Strictly from the timer standpoint, in the past I saw lots of time spent on rdtsc, lowering that would be beneficial. However, can't say how lowering the events call may impact, i can run a few tests.

shamisp · 2018-01-11T22:51:47Z

Few comments.
@artpol84 proposal sounds very reasonable. Whenever possible we should avoid pocking the clock.

One a separate issue:
ompi should use the architectural timer (on architectures where it is available) and not clock_gettime(), especially in the performance critical path like opal_progress. We have had exactly the same issue before and it was fixed some time ago by @bosilca . I think it was broken again and apparently for a reason: #3003

On aarch64 I would recommend to go back to the arch timer: https://github.com/open-mpi/ompi/blob/master/opal/include/opal/sys/arm64/timer.h#L24

tkordenbrock · 2018-01-12T14:15:12Z

I don't think this will have a negative impact on any of the Portals4 components. I'll run some benchmarks today.

yosefe · 2018-01-16T00:32:50Z

@bwbarrett @hjelmn added check for num_event_users and removed volatile
this makes TCP check events on every call

attached are performance results of osu_bw/osu_latency on single node, Xeon(R) CPU E5-2680 v4 @ 2.40GHz
intranode_removed_timer.xlsx

jsquyres · 2018-01-16T16:32:53Z

Per 2018-01-16 webex, we all agree that the idea of this PR is good. We just can't kill TCP performance in doing so. Mellanox will work on updating this PR, but can't commit on a timeframe.

yosefe · 2018-01-16T16:39:19Z

@jsquyres IMHO latest update should avoid killing TCP performance

jladd-mlnx · 2018-01-16T16:44:55Z

@bwbarrett Yossi added a check for num_event_users which preserves the current behavior for TCP. We believe this resolves your concerns. Please confirm.

bwbarrett

Merge the two commits into one and add a better description of the change (in particular, the why of the change) and I'm happy.

bwbarrett · 2018-01-16T17:04:46Z

@jladd-mlnx & @yosefe, my bad, I missed what Yossi was saying with his update yesterday. I'm happy with the code changes; just want some commit comment and number of commit changes and I'll approve.

Reading the system clock on every call to opal_progress() is an expensive operation on most architectures, and it can negatively affect the performance, for example of message rate benchmarks. We change opal_progress() to read the clock once per 8 calls, unless there are active users of the event mechanism. Signed-off-by: Yossi Itigin <yosefe@mellanox.com>

yosefe · 2018-01-16T17:20:20Z

@bwbarrett done

shamisp · 2018-01-16T17:52:01Z

Looks good !
@bwbarrett @jsquyres - what about addressing more generic concern related to the usage of slow timers. Eventually it will bites us in another places.

hjelmn · 2018-01-16T18:45:01Z

@shamisp We should be using the fastest available timers internally. The timers used by opal_progress() are the ones in opal/include/opal/sys/*/timer.h . The only thing that should be using the clock_gettime () is MPI_Wtime().

hjelmn · 2018-01-16T18:45:38Z

Hmm, except when using builtin atomics! That is a problem.. I think.

shamisp · 2018-01-16T20:41:53Z

From what I see, if the clock_gettime available it will use it for all places, including progress.
Even for MPI_Wtime() I would suggest to use the architectural timer. The last time the issue was the accuracy of some timer implementation. If this is the case this timers should fall back to clock_gettime otherwise use architectural timers, but not the other way around.

Probably we should open a separate issue for this ?

hjelmn · 2018-01-16T20:46:00Z

@shamisp Nope. See: https://github.com/open-mpi/ompi/blob/master/opal/runtime/opal_progress.c#L193

We use the native timer if it is available. If there are cases where we have the native timer and we are not using it that needs to be fixed.

As for MPI_Wtime(). There was a long discussion on this some time ago. The problem is that no one has implemented the proper function to get the rdtsc timer frequency in open so we decided to fall back on clock_gettime(). If someone wants to open a pull request to get the correct frequency we can revisit the issue.

bosilca · 2018-01-16T20:49:18Z

@shamisp MPI_Wtime requires a monotonic timer. We fall back on clock_gettime only if the architecture timer is not monotonic and the user has not turned off the monotonic requirement.

hjelmn · 2018-01-16T20:51:01Z

@bosilca That is what we used to do until we ran into the frequency issue. We now only use clock_gettime(). See: https://github.com/open-mpi/ompi/blob/master/ompi/mpi/c/wtime.c . I would love to see a fix but we need to implement the code to get the intel rdtsc frequency.

shamisp · 2018-01-16T22:16:43Z

@hjelmn Unless I'm missing something,
See: https://github.com/open-mpi/ompi/blob/master/opal/runtime/opal_progress.c#L193
will go to
https://github.com/open-mpi/ompi/blob/master/opal/mca/timer/linux/timer_linux_component.c#L42

which call clock_gettime() since it is available

hjelmn · 2018-01-16T22:20:46Z

@shamisp On x86_64 (and aarch64) OPAL_PROGRESS_ONLY_USEC_NATIVE is false so we call L196 not L194. This path does not call clock_gettime ().

shamisp · 2018-01-16T22:25:51Z

@hjelmn both are mapped to clock_gettime whenever it is available.

hjelmn · 2018-01-16T22:30:17Z

@shamisp That is a bug if it is using clock_gettime(). It is supposed to be using the native timer :-/.

hjelmn · 2018-01-16T22:32:56Z

@shamisp This is where we re-map it to use the native timer:

https://github.com/open-mpi/ompi/blob/master/opal/mca/timer/linux/timer_linux_component.c#L204

shamisp · 2018-01-16T23:11:16Z

Hmm...seems like it is supposed to take the right pass, but this is not what has been reported in performance measurements. Probably I have to take additional steps to confirm it.

As for https://github.com/open-mpi/ompi/blob/master/ompi/mpi/c/wtime.c
Why we would not switch back to architectural timer for non-x86 platforms ?

hjelmn · 2018-01-16T23:17:09Z

@shamisp We took the easy path for now. Don't think any of my users care about the performance of MPI_Wtime(). Outside benchmarks I don't think anyone cares.

This will probably change if we add a new timer to MPI. There is a proposal from the tools working group to add a call to get cycles.

jsquyres · 2018-01-16T23:21:18Z

@shamisp That being said, pull requests are always appreciated... 😄

shamisp · 2018-01-16T23:25:37Z

@hjelmn within certain builds (like static) vdso will fallback to system call, so impact on performance and measurements might be quite substantial.

Why we can not go back to what it was before, and just have it disabled if monotonic timer is not there. Essentially use the same path that opal supposed to take (in theory ?)

bosilca · 2018-01-16T23:27:57Z

variable clock rate ?

shamisp · 2018-01-16T23:29:28Z

@bosilca ifdef (x86_64) ?

hjelmn · 2018-01-16T23:32:34Z

@shamisp Would love to see the PR :). Someone with cycles for this needs to do the work. I don't as my users don't care. Too many other things to do.

shamisp · 2018-01-16T23:34:56Z

@hjelmn are you okay with the patch where only x86_64 takes the clock_gettime() path and the rest fallbacks to the original flow ?

jsquyres · 2018-01-16T23:39:17Z

"Someone with cycles..." Hah! Very punny. 👏

shamisp · 2018-01-16T23:56:04Z

https://github.com/open-mpi/ompi/blob/master/ompi/mpi/c/wtime.c#L53 If everybody agrees on ```s/if 0/OPAL_ASSEMBLY_ARCH != OPAL_X86_64/``` I will go ahead and submit it.

…

On Tue, Jan 16, 2018 at 5:39 PM, Jeff Squyres ***@***.***> wrote: "Someone with cycles..." Hah! Very punny. 👏 — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4697 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ACIe2EHxYpwN8taF_qjrXvNRrP9GxTrwks5tLTMngaJpZM4RZ1QO> .

rhc54 · 2018-01-17T00:23:24Z

I'd rather we not do that until someone quantifies the impact the change would have on x86 systems. We spent a ton of time debating this before, and had all organizations that wanted to participate spend time benchmarking the options before deciding on the current course. Frankly, I'm a tad annoyed to suddenly find us resetting back to square one.

@matcabral Would you have time to re-run the benchmarks with the proposed change? I personally consider this low-priority given all that has already transpired, but would appreciate it if you took a look as time permits.

rhc54 · 2018-01-17T08:40:21Z

My bad - I'm told that this proposed change will only take everyone but x86 down a currently unused code path. I got lost in the exchange and thought something else was being proposed.

Objection removed - feel free.

matcabral · 2018-01-17T16:48:25Z

I will still test the new patch ;)

hjelmn · 2018-01-17T22:10:57Z

@shamisp Go ahead and open the PR. I will review.

shamisp · 2018-01-18T15:36:41Z

@rhc54 - correct the intend to leave x86 flow as it is and redirect all the rest to the arch timer, if it is available.
@hjelmn - I will submit patch

Thnx!

yosefe added the ⚠️ WIP-DNM! label Jan 10, 2018

bwbarrett requested changes Jan 10, 2018

View reviewed changes

bwbarrett requested changes Jan 16, 2018

View reviewed changes

yosefe force-pushed the topic/opal-progress-avoid-checking-timer branch from e713a28 to 7cee603 Compare January 16, 2018 17:19

bwbarrett approved these changes Jan 16, 2018

View reviewed changes

yosefe merged commit 79ca1c4 into open-mpi:master Jan 17, 2018

yosefe deleted the topic/opal-progress-avoid-checking-timer branch January 17, 2018 08:48

opal_progress: check timer only once per 8 calls #4697

opal_progress: check timer only once per 8 calls #4697

Conversation

yosefe commented Jan 10, 2018 • edited Loading

jsquyres commented Jan 10, 2018 • edited Loading

artpol84 commented Jan 10, 2018

hjelmn commented Jan 10, 2018 • edited Loading

hjelmn commented Jan 10, 2018 • edited Loading

yosefe commented Jan 10, 2018

hjelmn commented Jan 10, 2018

hjelmn commented Jan 10, 2018

yosefe commented Jan 10, 2018

bwbarrett left a comment

Choose a reason for hiding this comment

artpol84 commented Jan 11, 2018 • edited Loading

bwbarrett commented Jan 11, 2018

artpol84 commented Jan 11, 2018

bwbarrett commented Jan 11, 2018

rhc54 commented Jan 11, 2018

bwbarrett commented Jan 11, 2018

rhc54 commented Jan 11, 2018

hjelmn commented Jan 11, 2018

matcabral commented Jan 11, 2018

shamisp commented Jan 11, 2018 • edited Loading

tkordenbrock commented Jan 12, 2018

yosefe commented Jan 16, 2018

jsquyres commented Jan 16, 2018

yosefe commented Jan 16, 2018

jladd-mlnx commented Jan 16, 2018

bwbarrett left a comment

Choose a reason for hiding this comment

bwbarrett commented Jan 16, 2018

yosefe commented Jan 16, 2018

shamisp commented Jan 16, 2018

hjelmn commented Jan 16, 2018

hjelmn commented Jan 16, 2018 • edited Loading

shamisp commented Jan 16, 2018

hjelmn commented Jan 16, 2018 • edited Loading

bosilca commented Jan 16, 2018

hjelmn commented Jan 16, 2018

shamisp commented Jan 16, 2018

hjelmn commented Jan 16, 2018

shamisp commented Jan 16, 2018

hjelmn commented Jan 16, 2018 • edited Loading

hjelmn commented Jan 16, 2018

shamisp commented Jan 16, 2018 • edited Loading

hjelmn commented Jan 16, 2018 • edited Loading

jsquyres commented Jan 16, 2018

shamisp commented Jan 16, 2018

bosilca commented Jan 16, 2018

shamisp commented Jan 16, 2018

hjelmn commented Jan 16, 2018

shamisp commented Jan 16, 2018

jsquyres commented Jan 16, 2018

shamisp commented Jan 16, 2018 via email • edited Loading

rhc54 commented Jan 17, 2018

rhc54 commented Jan 17, 2018

matcabral commented Jan 17, 2018

hjelmn commented Jan 17, 2018

shamisp commented Jan 18, 2018

yosefe commented Jan 10, 2018 •

edited

Loading

jsquyres commented Jan 10, 2018 •

edited

Loading

hjelmn commented Jan 10, 2018 •

edited

Loading

hjelmn commented Jan 10, 2018 •

edited

Loading

artpol84 commented Jan 11, 2018 •

edited

Loading

shamisp commented Jan 11, 2018 •

edited

Loading

hjelmn commented Jan 16, 2018 •

edited

Loading

hjelmn commented Jan 16, 2018 •

edited

Loading

hjelmn commented Jan 16, 2018 •

edited

Loading

shamisp commented Jan 16, 2018 •

edited

Loading

hjelmn commented Jan 16, 2018 •

edited

Loading

shamisp commented Jan 16, 2018 via email •

edited

Loading