Skip to content

Commit

Permalink
Fix first sleep taking at least 30-50ms
Browse files Browse the repository at this point in the history
Noticed this problem in unit tests where the very first timer sleep
would take a very long time because the membarrier was being registered
and tests have more than 1 thread.

Initializing the membarrier strategy before the BlockingThreadPool is
constructed seems like a good idea because it tries to elide the
registration cost if we haven't created any other threads yet. Even if
there are threads, the cost is front-loaded eagerly so it's part of the
cost of creating the executor rather than appearing as a random delay
going to sleep on the io_uring the first time (assuming it could
otherwise wake before the 30-80ms cost observed).

I believe the cost of registration got worse sometime after Linux 6.6
because tests in my project that do something similar started regularly
taking >30ms after upgrading to 6.8 whereas before they were mostly
succeeding < 30ms (it was my grace window for how long a 10ms sleep
could take).

With this change we see that the very first timer now completes within
11ms (I added a grace window of an extra ms in case of CI).
  • Loading branch information
vlovich committed Apr 24, 2024
1 parent 85b115d commit 02b9814
Show file tree
Hide file tree
Showing 6 changed files with 75 additions and 2 deletions.
30 changes: 30 additions & 0 deletions glommio/src/executor/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -93,6 +93,36 @@ pub fn executor() -> ExecutorProxy {
ExecutorProxy {}
}

/// You probably don't need to call this explicitly unless you have created threads before
/// you constructed an executor. You may also want to call this explicitly if you're sandboxing
/// your app and dropping permissions required to use privated expedited membarrier commands to
/// improve Glommio performance.
///
/// This run some early initialization that may save ~30-80ms if you create threads before you
/// tough Glommio code & don't otherwise register with membarrier. This is an idempotent call
/// that can be invoked as many times, but only really makes sense to do early in main before
/// you've constructed any threads. Additionally, the explicit membarrier registration this
/// attempts is also useful if you're later dropping the privilege required to using private
/// expedited commands so that the membarrier synchronization strategy can be used within
/// the executor hot loop.
///
/// The detailed motivation is that internally glommio prefers to use [membarrier](https://man7.org/linux/man-pages/man2/membarrier.2.html)
/// as a high performance synchronization barrier of the underlying io_uring memory
/// shared with the kernel (if the feature is available). If there exist any threads in the program,
/// registering the barrier has been observed to take a relatively long and highly variable amount
/// of time (as high as 30-80ms). If we register before any threads are constructed, this takes almost no time.
/// That's why this can be useful if you're integrating Glommio into an existing application to move this to
/// the front of main.
///
/// The suspicion is that the membarrier registration makes a synchronous IPI to enable the use of cheap
/// asynchronous IPI within the reactor hot loop. This means the startup cost will depend on the specific
/// number of cores; 30-80ms was observed on a 32-core machine running a high end Intel consumer CPU.
/// That's probably on the high end for consumer machines and smartphones today and on the low-end for
/// server-class hardware.
pub fn early_init() {
sys::initialize_membarrier_strategy();
}

pub(crate) fn executor_id() -> Option<usize> {
#[cfg(not(feature = "native-tls"))]
{
Expand Down
4 changes: 2 additions & 2 deletions glommio/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -446,8 +446,8 @@ pub use crate::{
ResourceType, Result,
},
executor::{
allocate_dma_buffer, allocate_dma_buffer_global, executor, spawn_local, spawn_local_into,
spawn_scoped_local, spawn_scoped_local_into,
allocate_dma_buffer, allocate_dma_buffer_global, early_init, executor, spawn_local,
spawn_local_into, spawn_scoped_local, spawn_scoped_local_into,
stall::{DefaultStallDetectionHandler, StallDetection, StallDetectionHandler},
yield_if_needed, CpuSet, ExecutorJoinHandle, ExecutorProxy, ExecutorStats, LocalExecutor,
LocalExecutorBuilder, LocalExecutorPoolBuilder, Placement, PoolPlacement,
Expand Down
10 changes: 10 additions & 0 deletions glommio/src/sys/blocking.rs
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,8 @@ use std::{
thread::JoinHandle,
};

use super::membarrier;

// So hard to copy/clone io::Error, plus need to send between threads. Best to
// do all i64.
macro_rules! raw_syscall {
Expand Down Expand Up @@ -193,6 +195,14 @@ impl BlockingThreadPool {
placement: PoolPlacement,
sleep_notifier: Arc<SleepNotifier>,
) -> crate::Result<Self, ()> {
// Make sure to initialize the membarrier before the thread pool is constructed so that registration
// is much cheaper. Additionally, even if we take the expensive slow path, we do it here instead
// of at the reactor hot loop which synchronously blocks the io_uring loop to register the membarrier.
// https://github.com/DataDog/glommio/issues/652
// An alternative solution is to pull in https://crates.io/crates/ctor as a dependency and run this
// that way - then the public API to call early_init can be elided wholesale.
membarrier::initialize_strategy();

let (in_tx, in_rx) = flume::bounded(4 << 10);
let (out_tx, out_rx) = flume::bounded(4 << 10);
let in_rx = Arc::new(in_rx);
Expand Down
15 changes: 15 additions & 0 deletions glommio/src/sys/membarrier.rs
Original file line number Diff line number Diff line change
Expand Up @@ -226,3 +226,18 @@ pub(crate) fn heavy() {
Fallback => atomic::fence(atomic::Ordering::SeqCst),
}
}

/// The membarrier strategy is expensive to initialize. Expose that initialization so
/// that [crate::executor::early_init] can call it (it also gets intentionally called
/// by [crate::executor::LocalExecutor::new] in case the user didn't).
///
/// I believe the cost of registration got worse sometime after Linux 6.6 because tests
/// in my project that do a sleep on a timer as the first thing started taking >30ms
/// after upgrading to 6.8 whereas before they were mostly succeeding < 30ms (it was
/// my hacky attempt at dealing with setting a timeout on how long a timer could take
/// before I dug into the problem once 6.8 made it unbearable & waiting up to 100ms
/// for a 10ms test seemed silly & was hard to write assertions around).
#[inline]
pub(crate) fn initialize_strategy() {
let _ = *STRATEGY;
}
1 change: 1 addition & 0 deletions glommio/src/sys/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -173,6 +173,7 @@ pub(crate) fn sendmsg_syscall(

mod dma_buffer;
mod membarrier;
pub(crate) use membarrier::initialize_strategy as initialize_membarrier_strategy;
pub(crate) mod source;
pub(crate) mod sysfs;
mod uring;
Expand Down
17 changes: 17 additions & 0 deletions glommio/src/timer/timer_impl.rs
Original file line number Diff line number Diff line change
Expand Up @@ -1195,4 +1195,21 @@ mod test {

handle.join().unwrap();
}

#[test]
fn first_timer_finishes_with_expected_duration() {
test_executor!(async move {
let start = Instant::now();
Timer::new(Duration::from_millis(3)).await;
let elapsed = start.elapsed();
assert!(
elapsed >= Duration::from_millis(3),
"Timer expired too soon: {elapsed:?}"
);
assert!(
elapsed <= Duration::from_millis(5),
"Timer took way too long to run: {elapsed:?}"
);
});
}
}

0 comments on commit 02b9814

Please sign in to comment.