Fix first Reactor sleep taking at least 30-50ms #653

vlovich · 2024-04-16T20:33:47Z

Noticed this problem in unit tests where the very first timer sleep would take a very long time because the membarrier was being registered and tests have more than 1 thread.

Initializing the membarrier strategy on Reactor::new seems like a good idea because it's highly likely any use of the Reactor will involve going to sleep and I can't imagine a use-case where that's not the case unless you're not interacting with anything within Glommio in the first place.

With this change we see that the very first timer now completes within 11ms (I added a grace window of an extra ms in case of CI).

What does this PR do?

The first construction of the IO uring reactor now registers the membarrier.

Motivation

I had tests that were flakily failing because a very short sleep was taking an almost unbounded amount of time (observed up to 60ms even for a 5ms sleep). I think it's cleaner if the membarrier initialization is front-loaded into Reactor construction vs paying that penalty on the very first .await later where it's a bit more non-obvious and surprising.

Related issues

Fixes #652

Additional Notes

I believe the cost of registration when there's > 1 thread running got worse sometime after Linux 6.6 because tests in my project that do something similar started regularly taking >30ms after upgrading to 6.8 whereas before they were mostly succeeding < 30ms (it was my grace window for how long a 10ms sleep could take).

Checklist

[X] I have added unit tests to the code I am submitting
[X] My unit tests cover both failure and success scenarios
[] If applicable, I have discussed my architecture

vlovich · 2024-04-24T19:47:12Z

Ping on this in case you missed it in your review batch @glommer

glommer · 2024-04-24T19:52:20Z

I had missed it, indeed.

glommer · 2024-04-24T19:53:25Z

Can we add a comment on the code explaining why we're doing this, so nobody else removes that in the future?

Noticed this problem in unit tests where the very first timer sleep would take a very long time because the membarrier was being registered and tests have more than 1 thread. Initializing the membarrier strategy before the BlockingThreadPool is constructed seems like a good idea because it tries to elide the registration cost if we haven't created any other threads yet. Even if there are threads, the cost is front-loaded eagerly so it's part of the cost of creating the executor rather than appearing as a random delay going to sleep on the io_uring the first time (assuming it could otherwise wake before the 30-80ms cost observed). I believe the cost of registration got worse sometime after Linux 6.6 because tests in my project that do something similar started regularly taking >30ms after upgrading to 6.8 whereas before they were mostly succeeding < 30ms (it was my grace window for how long a 10ms sleep could take). With this change we see that the very first timer now completes within 11ms (I added a grace window of an extra ms in case of CI).

vlovich · 2024-04-24T20:57:36Z

Good note. I've added much more extensive documentation for the subtleties. I also realized that it's placement was incorrect as it needed to preceed the construction of threads in BlockingThreadPool. I also realized it's necessary as an external API (or we can add a dependency on the ctor crate to do this transparently) for 2 reasons:

The user may start their own threads before glommio
The user may sandbox their app before glommio & thus glommio will get a privilege denial for trying to register the membarrier

I don't know the policy on adding dependency on 3p crates so I went with an explicit API that users can call if it's needed for them (most don't) but ctor magic would be more user-friendly although portability may be a concern (I think it's well supported on tier 1 platforms but I'm sure there's always subtleties that can appear with that kind of low-level stuff). If you'd prefer magic, I'm happy to change the PR & add a dependency on ctor. Let me know.

glommer · 2024-04-25T13:17:19Z

As I get older, I dislike magic more and more.

vlovich force-pushed the fix-timer branch 2 times, most recently from f3e7916 to f6f3963 Compare April 16, 2024 20:42

vlovich force-pushed the fix-timer branch from f6f3963 to 9a1d9e2 Compare April 24, 2024 19:21

vlovich force-pushed the fix-timer branch from 2a70e5d to 02b9814 Compare April 24, 2024 20:54

glommer approved these changes Apr 25, 2024

View reviewed changes

Glauber Costa added 2 commits April 25, 2024 09:17

Merge branch 'master' into fix-timer

d76f7ab

Merge branch 'master' into fix-timer

b6aa012

glommer merged commit 2418fa3 into DataDog:master Apr 25, 2024
4 of 5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix first Reactor sleep taking at least 30-50ms #653

Fix first Reactor sleep taking at least 30-50ms #653

vlovich commented Apr 16, 2024

vlovich commented Apr 24, 2024

glommer commented Apr 24, 2024

glommer commented Apr 24, 2024

vlovich commented Apr 24, 2024 •

edited

Loading

glommer commented Apr 25, 2024

Fix first Reactor sleep taking at least 30-50ms #653

Fix first Reactor sleep taking at least 30-50ms #653

Conversation

vlovich commented Apr 16, 2024

What does this PR do?

Motivation

Related issues

Additional Notes

Checklist

vlovich commented Apr 24, 2024

glommer commented Apr 24, 2024

glommer commented Apr 24, 2024

vlovich commented Apr 24, 2024 • edited Loading

glommer commented Apr 25, 2024

vlovich commented Apr 24, 2024 •

edited

Loading