Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove LocalWaker and simplify the RawWakerVTable #16

Merged
merged 1 commit into from
Jan 23, 2019

Conversation

cramertj
Copy link
Collaborator

@cramertj cramertj commented Jan 8, 2019

No description provided.

@Ralith
Copy link

Ralith commented Jan 14, 2019

Is it correct that this makes it impossible to write an executor that is single-threaded and uses neither atomics nor thread-locals?

Having recently spent some time manually implementing 0.3 futures for Quinn, I don't understand why calling into_waker() as needed is deemed to be unreasonably high friction, though as I'm not implementing a custom executor I haven't had to touch the rest of the affected API.

@Ralith
Copy link

Ralith commented Jan 14, 2019

Responding to a couple points in the blog post:

implement a singlethreaded event loop ... using atomic reference counts. Since your application is single threaded, on x86 at least this should have essentially no overhead over using nonatomic reference counts.

How do you safely handle a call on another thread without using TLS here too? Is there data to support the claim that there's never significant performance detriment?

is forcing every author of a manual future to deal with this complexity and unintuitiveness worth it to allow one particular of the multiple implementation strategy for a niche executor use case?

As a manual future author, I don't seem to be exposed to a significant amount of complexity by the current API; I call into_waker() every time I want to capture a Waker and that's that.

/// [`Waker`]: ../task/struct.Waker.html
fn poll(self: Pin<&mut Self>, lw: &LocalWaker) -> Poll<Self::Output>;
/// [`Waker::into_waker`]: ../task/struct.Waker.html#method.into_waker
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line should be also removed, since Waker::into_waker() does not exist anymore.

@Thomasdezeeuw
Copy link

To demonstrate the impact of atomic operations I created a simple benchmark comparing atomic additions, found here. I ran the benchmark on a 2015 macBook (2,2 Ghz Intel Core i7) and here are the results:

Add/Non atomic  time: [336.94 ps 338.56 ps 340.38 ps]
Add/atomic      time: [11.412 ns 11.465 ns 11.523 ns]

Note that non atomic is in ps, atomic in ns!

Doing some quick maths, this simple atomic addition is 30x slower on a (fairly) recent Intel processor. To me that is not a zero cost abstraction, especially if your building a (mainly) single threaded executor.

Regarding @withoutboats blog posts, I though they were well written and part 1 should be integrated into the docs somehow.

Quote from the blog post part 2:

You receive a LocalWaker from the executor, rather than a Waker. It’s unclear without a lot of explanation whether you’re supposed to convert it to a Waker (the thing you probably really want) early or not.

The part about probably wanting a Waker let me to be believe that you're mostly dealing with Send futures. If that is a correct understanding that I'm not very surprised that having a LocalWaker type seems like an inconvenience. However I for example deal with mainly non-Send futures, where not having to may a heavy price (see above) for atomic operations is a huge relieve.

Furthermore the main question/unclear point I get from the blog post is "when/whether to convert a LocalWaker to a Waker?". To me this is just a documentation issue, if this question can be answered in the documentation would that still warrant this change? My answer to the question would be: convert it late, iff when necessary.

@withoutboats
Copy link
Collaborator

How do you safely handle a call on another thread without using TLS here too? Is there data to support the claim that there's never significant performance detriment?

Yes, this needs to use TLS (or at least the ThreadId API, but probably you would just access your executor as a TLS'd Option).

It's pretty difficult to prove a negative about performance characteristics, but I would point out two things:

  • I haven't seen anyone provide convincing evidence that using TLS is too expensive for their use case.
  • All of our excellent benchmarks showing futures are fast enough have involved using TLS in the poll/wait/wake cycle.

Given that we know with the simpler API we can provide excellent support for a wide range of applications from embedded hardware to device drivers to network services to parallelized computing, if there is really a niche use case for which this API is inadequate, I am comfortable considering that outside of the scope of the async/await feature and having to use a different manner of implementing its state machines.

The part about probably wanting a Waker let me to be believe that you're mostly dealing with Send futures. If that is a correct understanding that I'm not very surprised that having a LocalWaker type seems like an inconvenience.

There's not really a connection between futures being Send and this aspect of the system. Even if your future isn't Send, using the default set up (e.g. tokio in any configuration), your reactor will be waking the executor from a different thread. This means you have to have a threadsafe waker to use the tokio reactor. Only a very tightly coupled implementation providing both a reactor and an executor would fall outside of this threadsafety requirement, because the reactor and executor would need to share state (beyond just the waker API) to operate on a single thread.

But I also want to point out that its very easy to assume that there is a connection between which waker you "need" and whether your future is Send or not - your belief here wasn't unreasonable at all. That's a part of the complexity burden of this API: there are several orthogonal distinctions that are not at all obvious and that become easy to conflate in the mind of someone trying to understand it.

@ghost
Copy link

ghost commented Jan 15, 2019

@withoutboats

Even if your future isn't Send, using the default set up (e.g. tokio in any configuration), your reactor will be waking the executor from a different thread.

Just to provide some additional context on Tokio, executors, reactors, !Send futures, and Rc vs Arc...

Before this PR landed, Tokio's default runtime used to have N worker threads that execute tasks and 1 dedicated thread for the reactor. There were N+1 threads total. That means every time the reactor wanted to wake a task, it would have to send the notification to one of the N worker threads.

Today, the situation is different. Tokio now has N worker threads that execute tasks and N reactors, one per thread. That means there are only N threads total, and every thread is responsible for executing tasks and driving its reactor. When a reactor wants to wake a task, in most cases it will send the notification to the same thread. On relatively rare occurrences, however, it may have to send the notification to a different thread (because idle threads steal tasks from working threads).

Tokio has another, single-threaded runtime, also known as runtime::current_thread. Unlike the default runtime, this one can spawn !Send futures. It executes tasks and drives its reactor on the same thread it is run on. Interestingly, spawned tasks are wrapped into an Arc rather than Rc. The reason is that in futures 0.1 we don't have the distinction between waker and local wakers. If we had one, we'd probably make the reactor generic (maybe as in Reactor<W: Wake>) and... I don't know, things would get messy real quick.

My opinion is that removing LocalWaker is a good call. I'm sad that we only have an ok and not a great support for the !Send world, but handling both Send and !Send generically at the same time is a really tough problem to solve. It's akin to having a big framework that pervasively uses reference counting and then making it generic over Rc<T> and Arc<T>. It's just really hard.

Related issue on generic executors: tokio-rs/tokio#625

@withoutboats
Copy link
Collaborator

@stjepang My recollection was mistaken, I thought that tokio's default now was to have 2N threads (a separate reactor thread and executor thread, rather than running the reactor on the same thread as the executor).

@Ralith
Copy link

Ralith commented Jan 15, 2019

Yes, this needs to use TLS (or at least the ThreadId API, but probably you would just access your executor as a TLS'd Option).

It seems perverse that single-threaded-only executors are required to use TLS when others aren't, particularly considering the change to pass in wakers without requiring TLS in the first place.

@cramertj
Copy link
Collaborator Author

@Thomasdezeeuw

I agree with you that it's important to support non-atomic queues and other forms of !Send executors. However, I think that LocalWaker/Waker is intrusive to the common case both of allowing Wakers to be sent across threads and implementing and understanding the RawWaker interface that it's worth it to instead force TLS/statics/etc. in order to get the fully-optimized singlethreaded case (which I give an example of in this PR) .

@cramertj
Copy link
Collaborator Author

@Ralith

Is it correct that this makes it impossible to write an executor that is single-threaded and uses neither atomics nor thread-locals?

No, but you do need some other mechanism for queuing a message to the executor. One way to do this would be to track the thread ID of the executor and only allow queuing onto it when the ID of the task being woken is the same as the thread ID of the executor, ensuring that the wake queue can be safely accessed. There are a number of different variations on this design, but all involve some sort of thread-ID or thread-local, or else some way to ensure that accessing a nonatomic queue is safe inside of wake.

@cramertj
Copy link
Collaborator Author

@Ralith

Having recently spent some time manually implementing 0.3 futures for Quinn, I don't understand why calling into_waker() as needed is deemed to be unreasonably high friction, though as I'm not implementing a custom executor I haven't had to touch the rest of the affected API.

into_waker() isn't so bad when you know what to do, but the additional type in the API that doesn't add any capabilities in the common case (where wake can occur on multiple threads, whether or not the executor itself is multithreaded) is really unfortunate, and it adds a lot of complexity to the RawWaker API.

@ghost
Copy link

ghost commented Jan 15, 2019

I just did a small experiment to see how much overhead Arc<Task> has over Rc<Task>. The experiment adds two benchmarks for tokio-current-thread and then changes all relevant Arcs into Rcs inside tokio-current-thread.

With Arc<Node>: https://github.com/stjepang/tokio/tree/experiment-arc
With Rc<Node>: https://github.com/stjepang/tokio/tree/experiment-rc
Benchmarks: https://github.com/stjepang/tokio/blob/experiment-arc/tokio-current-thread/benches/basic.rs
Run: cd tokio-current-thread; cargo bench

Results on my x86-64 machine:

name             arc ns/iter  rc ns/iter  diff ns/iter   diff %  speedup
reactor_driving  40,422,283   38,254,291    -2,167,992   -5.36%   x 1.06
task_switching   10,134,551   9,025,576     -1,108,975  -10.94%   x 1.12

Note that this represents the worst possible case. In real-world scenarios the difference is expected to be smaller. These benchmarks are just heavy stress tests on the executor and the reactor; no other work is going on. task_switching is yielding a bunch of tasks a bunch of times, while reactor_driving is the same in addition to polling a dummy I/O handle on every yield.

@rpjohnst
Copy link

I think a key point here is that this last frontier of single-threaded-only Wakers should never need any synchronization. (This is by definition, since any scenario that does use synchronization/threads already appears to be supported quite well, as described in @withoutboats's second blog post.)

Could this conflict thus be resolved not by relying on TLS or atomics, but by implementing @cramertj's thread ID approach as a debug-only assertion? This bends the rules around safety a little, but in practice it is unlikely to matter as such a scenario won't be using multiple threads anyway.

@Ralith
Copy link

Ralith commented Jan 16, 2019

I think that's an important point--this doesn't make it impossible to implement the niche case of purely single-threaded wakeups with an absolute minimum of overhead, just hazardous, and even then still likely less hazardous than any existing solution to those requirements.

fn poll(self: Pin<&mut Self>, lw: &LocalWaker) -> Poll<Self::Output>;
/// [`Waker::into_waker`]: ../task/struct.Waker.html#method.into_waker
/// [`Waker::wake`]: ../task/struct.Waker.html#method.wake
fn poll(self: Pin<&mut Self>, lw: &Waker) -> Poll<Self::Output>;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lw => waker

@Matthias247
Copy link

@Thomasdezeeuw , @stjepang Thanks for providing actual data/measurements! That is always helpful!

What I think would also be helpful is an implementation which actually shows the common case (a singlethreaded executor which also supports cross-thread wakeups) with the new design and the proposed optimizations.
I feel like it might not look like what is expected partially. E.g. if thread-locals are available, they can only be used to determine whether the current thread is the original mainloop, and whether e.g. a Mutex lock for reenqueuing a task can be avoided. But they don't help on finding the actual executor, or making sure the task-id is actually valid long enough. I am suspecting that that whatever gets stored inside Wakers must be an Arc independent of whether the task gets woken from the main thread (likely case) or from another thread (unlikely case). The concept of futures which allows stale Wakers after the task has already been run to completion combined with the fact that a waker might at any point of time be transferred to another thread without the possibility for manipulation seems to require this.
This is is essentially more expensive than some classical eventloops (e.g. libuv) which make the common wakeup case cheap and only require more overhead for the less common remote-wakeup case. But Rusts approach also provides more interoperability between executors and different IO sources, and it supports cancellations at any time versus having strict run-to-completion semantics. So the comparison is not fully fair.

@cramertj
Copy link
Collaborator Author

making sure the task-id is actually valid long enough

This isn't an actual concern-- waking extra times is always fine, and if the task is already gone then there's no need to wake it.

@Matthias247
Copy link

@cramertj I meant if you transfer the task as a raw pointer or Rc, it might not be valid long enough. You are right that it doesn't matter in the case of raw IDs, but then the question is still how to find the associated executor. It must likely be global, since we can't stuff two pointers (one for task ID and one for executor) into the RawWaker.

We could adjust the RawWaker to do that, but I'm not yet sure if it makes sense. Typically in C one type erased pointer is good enough. But to make that one always valid, we would require it to be Arc (or global). Maybe the fact that Rusts systems requires one to deal with outdated tasks justifies 2 pointers. Even though it seems most current executors currently just use one Arc.

@Thomasdezeeuw
Copy link

@Matthias247 I've created another gist with a complete implementation of a LocalWaker and Waker here: https://gist.github.com/Thomasdezeeuw/8ca4df953d00d68f7d1960a6255ccd8c. The results I got on my machine:

Waker/Wake local        time:   [5.3191 ns 5.3286 ns 5.3399 ns]
Waker/wake              time:   [53.726 ns 53.834 ns 53.955 ns]

That is a 10x slowdown for the thread-safe waker. Also note that the Waker implementation isn't even complete as this doesn't wake other threads, just adds events. To actually wake others threads it would have to write to an unix pipe or Linux's eventedfd. Meaning it would add a write system call, making it even slower.

@Matthias247
Copy link

@Thomasdezeeuw We have to be careful here. This might get more of a a benchmark for crossbeam-channel than the waker concept itself.

I think the idea of the proposed simplification is to do something like this:

unsafe fn wake_wake(data: *const ()) {
   let executor = // get the executor reference somehow
   executor.wake(data or other identifier)
}

impl Executor {
   fn wake(&self, task_data or id) {
       if self.thread_id == thread::current().id() {
           // Since wake was called from the thread that drives the executor,
           // we don't have to perform synchronization.
           self.local_tasks_ready_to_run.push(task_id);
           self.loop_more = true;
       }
       else {
           // We are on a different thread an have to perform a remote wakeup
           // The queue for tasks from here must be synchronized
           {
              let mut task_queue = self.remote_tasks_ready_to_run.lock().unwrap();
              task_queue .push(task_id);
           }
           executor.wakeup(); // Executed via write to pipe or channel, waking up a condition variable, etc.
       }
   }
}

As it's visible in the code, the // get the executor reference somehow is still tricky.
It's easy to implement if all tasks (and thereby data pointers) are atomic reference-counted, but otherwise would would raise the question how to get to that data. You use a global array, which seems like a possibility. But in a dynamic system that also raises a new question how an executor can be dynamically registered in this array, and whether it could ever be removed. Maybe through some atomic swap of that executor reference to a noop-executor.

In the end the only options might be either atomic reference counted tasks (with pointers to reference-counted executors), or static global executors.
If we had 2 pointers we could store a refcounted executor plus a task-id. But that also doesn't yet sound better, we are still doing Arc manipulations on each Waker clone. Since it's then even the executor Arc that gets manipulated, it might be even worse, since more tasks access the same pointer atomically.

An approach where there is no Arc manipulation involved as long as we are only doing local-things, and where we would move to Arc once things get remote might sound ideal. But that would be quite complicated to implement and maintain, and is only possible with APIs which differentiate between LocalWaker and Waker.

Btw: If we check the code above we can see that the only performance-relevant change to the old local_wake mechanism is the additional check of thread_id (or the read of a thread_local variable). That's arguably not too bad for the 99% use-case, which is single-threaded executors which support remote-wakeup and multithreaded executors.

@Thomasdezeeuw
Copy link

@Matthias247 You're right the code effectively benchmarks adding to a Vec vs sending over a crossbeam channel. But this is a more realistic implementation of LocalWaker compared to the benchmark I provided earlier, and it is how I would implement waking right now (if the VTable related changes made it into the standard library).

As for the initialisation, it actually uses an AtomicUsize where the main thread can "allocate" slots for the worker threads it starts. Deallocation doesn't happen as the system is expected to run for as long as the application is running, so even if a worker thread crashes its Vec and Sender stay in the array.

As it's visible in the code, the // get the executor reference somehow is still tricky.

You're right, so hard in fact that I decided not to bother with it. The executor picks up notifications from the Vec/channel, but the waker never has access to the executor.

@cramertj
Copy link
Collaborator Author

If you're limiting the executor to wakeups on a single thread, it's not hard to get a reference to the executor-- you can just use TLS.

@AZon8
Copy link

AZon8 commented Jan 22, 2019

For what its worth, a somewhat similar discussion was taking place 1.5 years ago tokio-rs/tokio-rfcs#3 (comment)

@cramertj
Copy link
Collaborator Author

@AZon8 That discussion is quite different since it introduced the requirement that the Tokio IO primitives themselves be threadsafe. This change does not add or remove any requirements around thread-safety, but trades off the ability to easily create a non-threadsafe executor without using TLS or statics for ease of ergonomics and comprehensibility of the API in the more common case.

@Matthias247
Copy link

If you're limiting the executor to wakeups on a single thread, it's not hard to get a reference to the executor-- you can just use TLS.
That works only for the pure-singlethreaded use-case. For singlethreaded with remote wakeups it doesn't, since the remote thread won't see the executor in TLS.

I guess the main outstanding question would be: Would we worry that for all executors that support remote wakeup the Wakers must be Arcs for all reasonable systems? The more I'm thinking about it, the less I worry. Because from an allocation overhead Arc vs Box doesn't make a huge difference, and refcount changes also only happen when the task gets blocked and a context-switch needs to happen.

If one does have a reasonable design for an executor with remote wakeup that doesn't use Arc and no globals, I would be interested to see it.

@cramertj
Copy link
Collaborator Author

I'm going to go ahead and merge this change, as it seems like conversation has generally settled down. Though this change makes fewer direct affordances for single-thread-only executors, it is still possible to write an efficient single-thread-only executor with this design, and in return the resulting API is much simpler and easier to understand and implement.

@cramertj cramertj merged commit 05db198 into aturon:future Jan 23, 2019
@cramertj cramertj deleted the less-waker branch January 23, 2019 17:45
@Thomasdezeeuw
Copy link

@cramertj I disagree with your decision and I'm also dissapointed in the way this discussion has been going. I've showed an implementation that is 10x faster with LocalWaker but I feel it was ignored and the general consensus of "atomics have no overhead" remained, which I also feel like I've disproved.

Is there anything that could change the minds of people that only use Waker (not LocalWaker) to show that having LocalWaker is a clear benefit for people who mainly use single-threaded executors?

@cramertj
Copy link
Collaborator Author

@Thomasdezeeuw I haven't claimed that atomics have no overhead. I'm saying that atomics aren't necessary in order to write a singlethreaded executor with this API.

@Thomasdezeeuw
Copy link

@cramertj I'm sorry. You're right, you didn't make the claim that atomic have no overhead. @withoutboats made that claim in his blog post and I, incorrectly, assumed you agreed.

@cramertj your PR mentions TLS, but also has a cost. A micro benchmark shows the following on macOS:

Add/Non TLS             time:   [336.16 ps 337.33 ps 338.86 ps]
Add/TLS                 time:   [3.6553 ns 3.6748 ns 3.6959 ns]

Not as bad as an atomic add, but still not great.

But if the majority agrees that havingLocalWaker and Waker is too high a complexity price to pay, then we can end the discussion.

@cramertj
Copy link
Collaborator Author

@Thomasdezeeuw Yup, there is a small overhead to using TLS. You can do better with #[thread_local] than thread_local!, but it requires extra caution and is unstable.

@cramertj
Copy link
Collaborator Author

Testing locally, I get 417.29ps for raw add, 2.478ns for thread_local!, and 1.6039ns for #[thread_local].

@kvinwang
Copy link

kvinwang commented Jan 25, 2019

Does Rust rely on libc to enable TLS?
The uCLibc, which is widely used in network/communication devices, does not support TLS. Should it be considered?

@ghost
Copy link

ghost commented Mar 19, 2019

Question: It seems to me most of the time when wake() is called, the waker is immediately dropped. Should the vtable also have wake_and_drop() function?

The problem is the following: wake() will usually increment the refcount for the future and then push it into the queue of futures to poll, while dropping Waker usually decrements the refcount, which means we're wasting two atomic operations.

The method on Waker might look like this:

impl Waker {
    fn wake_and_drop(self);
}

@cramertj
Copy link
Collaborator Author

@stjepang Yeah, we debated about having that be the default behavior of the wake function, actually. It seems like it would be a fine thing to add (and can be added backwards-compatibly if my future-proofing PR lands).

@ghost
Copy link

ghost commented Mar 19, 2019

@cramertj Do you think we should add it before stabilization? (my gut says "yes" because I expect it to be uncontroversial - it's a really easy performance win)

@cramertj
Copy link
Collaborator Author

cramertj commented Mar 19, 2019

@stjepang Sure-- it seems like an obvious addition. Certainly less worrisome than the existing round of conversation XD

Perhaps we could make wake() do that by default, but wake_ref() or wake_borrowed() do the other (more expensive) thing?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants