-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove LocalWaker and simplify the RawWakerVTable #16
Conversation
Is it correct that this makes it impossible to write an executor that is single-threaded and uses neither atomics nor thread-locals? Having recently spent some time manually implementing 0.3 futures for Quinn, I don't understand why calling |
Responding to a couple points in the blog post:
How do you safely handle a call on another thread without using TLS here too? Is there data to support the claim that there's never significant performance detriment?
As a manual future author, I don't seem to be exposed to a significant amount of complexity by the current API; I call |
/// [`Waker`]: ../task/struct.Waker.html | ||
fn poll(self: Pin<&mut Self>, lw: &LocalWaker) -> Poll<Self::Output>; | ||
/// [`Waker::into_waker`]: ../task/struct.Waker.html#method.into_waker |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line should be also removed, since Waker::into_waker()
does not exist anymore.
To demonstrate the impact of atomic operations I created a simple benchmark comparing atomic additions, found here. I ran the benchmark on a 2015 macBook (2,2 Ghz Intel Core i7) and here are the results:
Note that non atomic is in ps, atomic in ns! Doing some quick maths, this simple atomic addition is 30x slower on a (fairly) recent Intel processor. To me that is not a zero cost abstraction, especially if your building a (mainly) single threaded executor. Regarding @withoutboats blog posts, I though they were well written and part 1 should be integrated into the docs somehow. Quote from the blog post part 2:
The part about probably wanting a Furthermore the main question/unclear point I get from the blog post is "when/whether to convert a |
Yes, this needs to use TLS (or at least the ThreadId API, but probably you would just access your executor as a TLS'd It's pretty difficult to prove a negative about performance characteristics, but I would point out two things:
Given that we know with the simpler API we can provide excellent support for a wide range of applications from embedded hardware to device drivers to network services to parallelized computing, if there is really a niche use case for which this API is inadequate, I am comfortable considering that outside of the scope of the async/await feature and having to use a different manner of implementing its state machines.
There's not really a connection between futures being But I also want to point out that its very easy to assume that there is a connection between which waker you "need" and whether your future is Send or not - your belief here wasn't unreasonable at all. That's a part of the complexity burden of this API: there are several orthogonal distinctions that are not at all obvious and that become easy to conflate in the mind of someone trying to understand it. |
Just to provide some additional context on Tokio, executors, reactors, Before this PR landed, Tokio's default runtime used to have N worker threads that execute tasks and 1 dedicated thread for the reactor. There were N+1 threads total. That means every time the reactor wanted to wake a task, it would have to send the notification to one of the N worker threads. Today, the situation is different. Tokio now has N worker threads that execute tasks and N reactors, one per thread. That means there are only N threads total, and every thread is responsible for executing tasks and driving its reactor. When a reactor wants to wake a task, in most cases it will send the notification to the same thread. On relatively rare occurrences, however, it may have to send the notification to a different thread (because idle threads steal tasks from working threads). Tokio has another, single-threaded runtime, also known as My opinion is that removing Related issue on generic executors: tokio-rs/tokio#625 |
@stjepang My recollection was mistaken, I thought that tokio's default now was to have 2N threads (a separate reactor thread and executor thread, rather than running the reactor on the same thread as the executor). |
It seems perverse that single-threaded-only executors are required to use TLS when others aren't, particularly considering the change to pass in wakers without requiring TLS in the first place. |
I agree with you that it's important to support non-atomic queues and other forms of |
No, but you do need some other mechanism for queuing a message to the executor. One way to do this would be to track the thread ID of the executor and only allow queuing onto it when the ID of the task being woken is the same as the thread ID of the executor, ensuring that the wake queue can be safely accessed. There are a number of different variations on this design, but all involve some sort of thread-ID or thread-local, or else some way to ensure that accessing a nonatomic queue is safe inside of |
|
I just did a small experiment to see how much overhead With Results on my x86-64 machine:
Note that this represents the worst possible case. In real-world scenarios the difference is expected to be smaller. These benchmarks are just heavy stress tests on the executor and the reactor; no other work is going on. |
I think a key point here is that this last frontier of single-threaded-only Could this conflict thus be resolved not by relying on TLS or atomics, but by implementing @cramertj's thread ID approach as a debug-only assertion? This bends the rules around safety a little, but in practice it is unlikely to matter as such a scenario won't be using multiple threads anyway. |
I think that's an important point--this doesn't make it impossible to implement the niche case of purely single-threaded wakeups with an absolute minimum of overhead, just hazardous, and even then still likely less hazardous than any existing solution to those requirements. |
fn poll(self: Pin<&mut Self>, lw: &LocalWaker) -> Poll<Self::Output>; | ||
/// [`Waker::into_waker`]: ../task/struct.Waker.html#method.into_waker | ||
/// [`Waker::wake`]: ../task/struct.Waker.html#method.wake | ||
fn poll(self: Pin<&mut Self>, lw: &Waker) -> Poll<Self::Output>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lw
=> waker
@Thomasdezeeuw , @stjepang Thanks for providing actual data/measurements! That is always helpful! What I think would also be helpful is an implementation which actually shows the common case (a singlethreaded executor which also supports cross-thread wakeups) with the new design and the proposed optimizations. |
This isn't an actual concern-- waking extra times is always fine, and if the task is already gone then there's no need to wake it. |
@cramertj I meant if you transfer the task as a raw pointer or We could adjust the |
@Matthias247 I've created another gist with a complete implementation of a
That is a 10x slowdown for the thread-safe waker. Also note that the |
@Thomasdezeeuw We have to be careful here. This might get more of a a benchmark for crossbeam-channel than the waker concept itself. I think the idea of the proposed simplification is to do something like this: unsafe fn wake_wake(data: *const ()) {
let executor = // get the executor reference somehow
executor.wake(data or other identifier)
}
impl Executor {
fn wake(&self, task_data or id) {
if self.thread_id == thread::current().id() {
// Since wake was called from the thread that drives the executor,
// we don't have to perform synchronization.
self.local_tasks_ready_to_run.push(task_id);
self.loop_more = true;
}
else {
// We are on a different thread an have to perform a remote wakeup
// The queue for tasks from here must be synchronized
{
let mut task_queue = self.remote_tasks_ready_to_run.lock().unwrap();
task_queue .push(task_id);
}
executor.wakeup(); // Executed via write to pipe or channel, waking up a condition variable, etc.
}
}
} As it's visible in the code, the In the end the only options might be either atomic reference counted tasks (with pointers to reference-counted executors), or static global executors. An approach where there is no Arc manipulation involved as long as we are only doing local-things, and where we would move to Btw: If we check the code above we can see that the only performance-relevant change to the old |
@Matthias247 You're right the code effectively benchmarks adding to a As for the initialisation, it actually uses an
You're right, so hard in fact that I decided not to bother with it. The executor picks up notifications from the |
If you're limiting the executor to wakeups on a single thread, it's not hard to get a reference to the executor-- you can just use TLS. |
For what its worth, a somewhat similar discussion was taking place 1.5 years ago tokio-rs/tokio-rfcs#3 (comment) |
@AZon8 That discussion is quite different since it introduced the requirement that the Tokio IO primitives themselves be threadsafe. This change does not add or remove any requirements around thread-safety, but trades off the ability to easily create a non-threadsafe executor without using TLS or statics for ease of ergonomics and comprehensibility of the API in the more common case. |
I guess the main outstanding question would be: Would we worry that for all executors that support remote wakeup the Wakers must be If one does have a reasonable design for an executor with remote wakeup that doesn't use |
I'm going to go ahead and merge this change, as it seems like conversation has generally settled down. Though this change makes fewer direct affordances for single-thread-only executors, it is still possible to write an efficient single-thread-only executor with this design, and in return the resulting API is much simpler and easier to understand and implement. |
@cramertj I disagree with your decision and I'm also dissapointed in the way this discussion has been going. I've showed an implementation that is 10x faster with Is there anything that could change the minds of people that only use |
@Thomasdezeeuw I haven't claimed that atomics have no overhead. I'm saying that atomics aren't necessary in order to write a singlethreaded executor with this API. |
@cramertj I'm sorry. You're right, you didn't make the claim that atomic have no overhead. @withoutboats made that claim in his blog post and I, incorrectly, assumed you agreed. @cramertj your PR mentions TLS, but also has a cost. A micro benchmark shows the following on macOS:
Not as bad as an atomic add, but still not great. But if the majority agrees that having |
@Thomasdezeeuw Yup, there is a small overhead to using TLS. You can do better with |
Testing locally, I get 417.29ps for raw add, 2.478ns for |
Does Rust rely on libc to enable TLS? |
Question: It seems to me most of the time when The problem is the following: The method on impl Waker {
fn wake_and_drop(self);
} |
@stjepang Yeah, we debated about having that be the default behavior of the |
@cramertj Do you think we should add it before stabilization? (my gut says "yes" because I expect it to be uncontroversial - it's a really easy performance win) |
@stjepang Sure-- it seems like an obvious addition. Certainly less worrisome than the existing round of conversation XD Perhaps we could make |
No description provided.