-
Notifications
You must be signed in to change notification settings - Fork 860
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
opal_progress: check timer only once per 8 calls #4697
opal_progress: check timer only once per 8 calls #4697
Conversation
@bwbarrett @bosilca @hjelmn This is an interesting PR -- it basically switches atomics to a static volatile. Thoughts? Based on the comment in the code, let me ask a crazy question: is there value in removing volatile? I.e., a) decrease the performance penalty even more, because b) we don't really care if the number is not wholly accurate. |
bot:mellanox:retest |
My only issue with this is what about the situation where there is a low priority event and the calling code only enters MPI very infrequently. I have thought about making this a simple counter but keep running into how to force it to trigger on every call in that case. The timer handles this case. I know this is not a common case but it is a situation worth thinking about. |
@jsquyres The variable does't even need to be volatile for this to work. In either case the low priority calls will still trigger but their timing will be non-deterministic. I don't think this is an issue though. The atomic just ensures that it triggers exactly once every 8 calls. |
@hjelmn regarding the infrequent calls to MPI - since we are doing the event loop every @jsquyres removing volatile seemed a bit too dangerous since that variable really can be accessed by multiple threads. |
@yosefe Removing volatile will likely not change much besides timing. It likely will not even change the performance. I don't think it really matters either way. Looking at what I wrote it isn't completely correct. For low-priority events it is assumed they can be delayed almost indefinitely and that if they need to be processed the user should be calling into MPI. The same isn't necessarily true about libevent events. I would have to take a look and see if there is a problem. As is I think I am ok with this change though. I can see how there might be an improvement by removing the atomic but not necessarily the timer call. The overhead of the rdtsc instruction is ~ 40 cycles. What is the overhead of the aarch64 mrs instruction? |
I plan to test this change with some RMA-MT benchmarks and see how it affects other code paths. |
@hjelmn |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nack; this patch can not go in. It will adversely impact TCP by not entering the event library every time we call opal_progres().
What if there will be 2 options for doing the progress and PML or BTL will have an option to select between them? |
That's essentially what the old code did; happy to see a PR for an improved version, but not polling the event library every time when there are consumers (ie, it is polled every iteration) is not an acceptable change. |
Old code is slower than new one by 5-10%, so what I suggest is to have 2 opal_progress functions and use the pointer to select the proper one. Using pointer may introduce some overhead and I can't estimate it according to my experience. |
I'd have some concerns about the pointer swap (ie, want to see the code) for how we turn on / off fast polling of the event library for the TCP BTL. But I think a pointer could work. And I wouldn't be concerned about the pointer overhead for the TCP BTL, so if it works for UCX, it's probably a good plan. |
Someone also needs to check the other BTLs, plus the Portals and OFI MTLs - I don't know what else might be impacted, but a thorough check of non-UCX paths seems in order before proceeding. |
@rhc54, sure, but if it makes UCX faster than before, it almost certainly will for others. I actually have no objections to the proposal from a general BTL impact (it should only help). The two places it will hurt are the TCP BTL (because you could go 7 extra entrances to the MPI library before calling into the event library to actually send/receive the packet) and out-of-band at time periods other than INIT and FINALIZE. |
I agree that it should be okay, but we've been burned before by making that assumption - best to ensure the right people know about it and check. @matcabral @tkordenbrock Just want to ensure you take a look and are okay with this. |
We could condition this path on the event users counter being 0. The TCP btl increments there counter when it is needed. If the counter is non zero we can always check the time delta. |
From code inspection, I think this could indeed improve PSM2 MTL (most likely OFI also). Strictly from the timer standpoint, in the past I saw lots of time spent on rdtsc, lowering that would be beneficial. However, can't say how lowering the events call may impact, i can run a few tests. |
Few comments. One a separate issue: On aarch64 I would recommend to go back to the arch timer: https://github.com/open-mpi/ompi/blob/master/opal/include/opal/sys/arm64/timer.h#L24 |
I don't think this will have a negative impact on any of the Portals4 components. I'll run some benchmarks today. |
@bwbarrett @hjelmn added check for num_event_users and removed volatile attached are performance results of osu_bw/osu_latency on single node, Xeon(R) CPU E5-2680 v4 @ 2.40GHz |
Per 2018-01-16 webex, we all agree that the idea of this PR is good. We just can't kill TCP performance in doing so. Mellanox will work on updating this PR, but can't commit on a timeframe. |
@jsquyres IMHO latest update should avoid killing TCP performance |
@bwbarrett Yossi added a check for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Merge the two commits into one and add a better description of the change (in particular, the why of the change) and I'm happy.
@jladd-mlnx & @yosefe, my bad, I missed what Yossi was saying with his update yesterday. I'm happy with the code changes; just want some commit comment and number of commit changes and I'll approve. |
Reading the system clock on every call to opal_progress() is an expensive operation on most architectures, and it can negatively affect the performance, for example of message rate benchmarks. We change opal_progress() to read the clock once per 8 calls, unless there are active users of the event mechanism. Signed-off-by: Yossi Itigin <yosefe@mellanox.com>
e713a28
to
7cee603
Compare
@bwbarrett done |
Looks good ! |
@shamisp We should be using the fastest available timers internally. The timers used by opal_progress() are the ones in opal/include/opal/sys/*/timer.h . The only thing that should be using the clock_gettime () is MPI_Wtime(). |
Hmm, except when using builtin atomics! That is a problem.. I think. |
From what I see, if the clock_gettime available it will use it for all places, including progress. Probably we should open a separate issue for this ? |
@shamisp Nope. See: https://github.com/open-mpi/ompi/blob/master/opal/runtime/opal_progress.c#L193 We use the native timer if it is available. If there are cases where we have the native timer and we are not using it that needs to be fixed. As for MPI_Wtime(). There was a long discussion on this some time ago. The problem is that no one has implemented the proper function to get the rdtsc timer frequency in open so we decided to fall back on clock_gettime(). If someone wants to open a pull request to get the correct frequency we can revisit the issue. |
@shamisp MPI_Wtime requires a monotonic timer. We fall back on clock_gettime only if the architecture timer is not monotonic and the user has not turned off the monotonic requirement. |
@bosilca That is what we used to do until we ran into the frequency issue. We now only use clock_gettime(). See: https://github.com/open-mpi/ompi/blob/master/ompi/mpi/c/wtime.c . I would love to see a fix but we need to implement the code to get the intel rdtsc frequency. |
@hjelmn Unless I'm missing something, which call clock_gettime() since it is available |
@shamisp On x86_64 (and aarch64) OPAL_PROGRESS_ONLY_USEC_NATIVE is false so we call L196 not L194. This path does not call clock_gettime (). |
@hjelmn both are mapped to clock_gettime whenever it is available. |
@shamisp That is a bug if it is using clock_gettime(). It is supposed to be using the native timer :-/. |
@shamisp This is where we re-map it to use the native timer: https://github.com/open-mpi/ompi/blob/master/opal/mca/timer/linux/timer_linux_component.c#L204 |
Hmm...seems like it is supposed to take the right pass, but this is not what has been reported in performance measurements. Probably I have to take additional steps to confirm it. As for https://github.com/open-mpi/ompi/blob/master/ompi/mpi/c/wtime.c |
@shamisp We took the easy path for now. Don't think any of my users care about the performance of MPI_Wtime(). Outside benchmarks I don't think anyone cares. This will probably change if we add a new timer to MPI. There is a proposal from the tools working group to add a call to get cycles. |
@shamisp That being said, pull requests are always appreciated... 😄 |
@hjelmn within certain builds (like static) vdso will fallback to system call, so impact on performance and measurements might be quite substantial. Why we can not go back to what it was before, and just have it disabled if monotonic timer is not there. Essentially use the same path that opal supposed to take (in theory ?) |
variable clock rate ? |
@bosilca ifdef (x86_64) ? |
@shamisp Would love to see the PR :). Someone with cycles for this needs to do the work. I don't as my users don't care. Too many other things to do. |
@hjelmn are you okay with the patch where only x86_64 takes the clock_gettime() path and the rest fallbacks to the original flow ? |
"Someone with cycles..." Hah! Very punny. 👏 |
https://github.com/open-mpi/ompi/blob/master/ompi/mpi/c/wtime.c#L53
If everybody agrees on ```s/if 0/OPAL_ASSEMBLY_ARCH != OPAL_X86_64/```
I will go ahead and submit it.
…On Tue, Jan 16, 2018 at 5:39 PM, Jeff Squyres ***@***.***> wrote:
"Someone with cycles..." Hah! Very punny. 👏
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#4697 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ACIe2EHxYpwN8taF_qjrXvNRrP9GxTrwks5tLTMngaJpZM4RZ1QO>
.
|
I'd rather we not do that until someone quantifies the impact the change would have on x86 systems. We spent a ton of time debating this before, and had all organizations that wanted to participate spend time benchmarking the options before deciding on the current course. Frankly, I'm a tad annoyed to suddenly find us resetting back to square one. @matcabral Would you have time to re-run the benchmarks with the proposed change? I personally consider this low-priority given all that has already transpired, but would appreciate it if you took a look as time permits. |
My bad - I'm told that this proposed change will only take everyone but x86 down a currently unused code path. I got lost in the exchange and thought something else was being proposed. Objection removed - feel free. |
I will still test the new patch ;) |
@shamisp Go ahead and open the PR. I will review. |
This PR improves osu_bw and osu_mbw_mr performance of UCX on ARM (30-50%) and x86_64 (10-20%) architectures.
@shamisp @hppritcha