-
Notifications
You must be signed in to change notification settings - Fork 243
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better send buffer management: await sock.writeable() and TCP_NOTSENT_LOWAT #83
Comments
Okay, after reading up on it a bit more, it sounds like |
Wow, this is pretty interesting. I have two immediate questions: I wonder how asyncio handles writes? I wonder how this would work in synchronous thread-based coding? Brief thought: I wonder if this could be fixed via inheritance or some other specialization of the socket class? For example, one implementation that tries the send first. Another implementation that awaits first. Definitely going to think about this... |
By my reading of the asyncio source, it seems that it also performs a send() immediately followed by a check for blocking. For example, in asyncio/selector_events.py:
|
Good point, I just filed a bug on asyncio too to let them know :-)
My first reaction is that this sounds really overcomplicated -- is it really so bad to make it Just Work? |
Adding an await _write_wait() introduces about an 85% performance penalty on the Curio echo-server benchmark. In the big picture, in what scenarios is this TCP_NOTSENT_LOWAT option being used? I can imagine a lot of situations where I would want to curio to behave as it does now. For example, services where it's all based on a request/response cycle like HTTP, RPC, etc. TCP_NOTSENT_LOWAT seems like a pretty special case to me. It should definitely possible for Curio to support it in some way if someone wanted, but I'm not sure I'd want to add the penalty of the extra write wait to everything to do it. One option would be to add a new socket method for explicit write waiting if you needed it. For example:
Another option might be to have Curio intercept setsockopt() and look for TCP_NOTSENT_LOWAT. Based on that, it could enable the extra wait implicitly. Or this could be turned into some kind of more general method/configuration option for sockets that make Curio wait first. |
Having just skimmed, I do have the same question -- when is this option actually useful? |
Oof. Well, good to know; thanks for checking. Did you also try the version where we do a quick synchronous check ( I have mixed feelings about using the echo-server benchmark as a guide in cases like this, because its overhead is so incredibly low that I think it can easily push in the wrong direction. If the echo server starts out at 40 us to handle a "request/response cycle", then that's such an incredibly small baseline that it really constrains what you can do. Doing anything at all extra will make that number much worse (going from 40 us -> 50 us is terrible!), while trying to optimize it to post more impressive numbers requires really heroic feats because there's so little headroom to work with -- pretty soon the only way to improve is to start cutting features. But, any real protocol has much higher protocol overhead than that, just to do anything -- e.g. Still... I get that these benchmarks are important and that 85% is a lot! (I've thought about this in particular in the context of
The tl;dr is that TCP_NOTSENT_LOWAT is basically always at least as good as the alternative and should probably be the default. I'm not sure why it isn't (though I note that the people who implemented it did provide a global knob so that you can make it the system-wide default, suggesting that they agree that this isn't ridiculous -- my guess is that the main reason is that there's a parameter you need to set and they don't have any auto-tuning for it yet.) And TCP_NOTSENT_LOWAT's killer use-case is HTTP/2, where you really want it. Full explanation follows; if you want to skip it then scroll down to the next bit of quoted text :-) So, background, maybe review maybe not I don't know how much you keep up with network engineering drama :-). A major problem in using TCP currently is the presence of too much buffering (aka "bufferbloat"). The point of a buffer is to smooth out bursts, so the ideal buffer should fill up when a burst arrives, then trickle down so that it reaches empty just before the next burst arrives -- that way there's always data to send, but no unnecessary delays. The problem with TCP is that if you're sending data fast enough that the network is your bottleneck, then it will happily fill up all the buffers in its path completely full. At that point, they're not buffers, they're just delay lines. (Think of a grocery store checkout where people arrive at a rate of exactly 1/minute, and where the clerk can process exactly 1/minute. If there's no queue, then each person walks up, gets handled, and leaves 1 minute later, and that's a steady state. If there's a queue of 100 people, then everything still moves at the same speed, and it's still a steady state, but each individual person has to wait 100 minutes. Which is super frustrating! You have the capacity to get everyone through without waiting, it's just the "buffer" thats killing you. I feel like I often encounter telephone service centers that work on this model.) This is a multi-faceted problem because these kinds of buffers show up in all sorts of places (the per-socket send buffer, your kernel's network driver, routers out on the internet, ...), and everyone suddenly realized back in 2009 that oops no-one was paying attention and all of them are broken in the same way. Since then there's been a concerted effort to fix these, and things have been getting a lot better. (Mostly irrelevant but super interesting tangent: just last month Google released a major rework of TCP's flow control algorithms that they're hoping to get deployed everywhere to fix a lot of these problems.) TCP_NOTSENT_LOWAT is aimed at addressing a particular one of these buffers: the per-socket send buffer. The problem with this buffer is that it actually serves two different purposes: it holds data that the application has written to the socket but that is still waiting to be sent on the network, and it also holds data that has already been sent, but that hasn't been acknowledged yet, so the kernel has to hold onto it in case it needs to be resent. And this is a problem because you generally need only a small buffer for the unsent data, and a large buffer for the sent-but-unacknowledged data. (This is because the unsent data buffer just needs to hold enough so that it doesn't run dry between each iteration of the process's event loop, so like, a few milliseconds worth of data at most. The unacknowledged data buffer OTOH needs to hold at least one round-trip-time worth of data, maybe more if conditions are bad, so hundreds or even thousands of milliseconds worth of data. Totally totally different things.) But since the kernel doesn't distinguish between these two kinds of data, then traditionally you just get one giant buffer for both, which means your unsent-data ends up spilling over into the space that really should be used for sent-but-unacknowledged data, and if you just keep dumping data into it until it fills up, then now you've got one of those queues where your data will be waiting in line for hundreds of milliseconds for no reason. Turning on TCP_NOTSENT_LOWAT fixes this: it effectively tells the kernel to keep track of unsent-data separately from sent-but-unacknowledged data, and then you can use an appropriately sized buffer instead of a ludicrously oversized one. (In some experimenting with simple rate-throttled proxy servers over loopback on my laptop, I was getting ~3-5 second latencies, measured as the time from when I sent data from one process to when it was received in another. Over loopback. After turning on TCP_NOTSENT_LOWAT that drops down to milliseconds or better.) So that's what I mean about TCP_NOTSENT_LOWAT being basically a Good Thing: it's fixing a bug. But, of course, there are plenty of cases where this bug is still a bug, but not one that really matters. If you're not saturating the pipe (think: IRC, or interactive ssh), then queues don't form anyway so this doesn't matter. Or if you're doing a bulk transfer where no-one's paying attention (think: bittorrent) then latency doesn't really matter much (except the for the minor issue that each of these buffers can end up wasting a few megabytes of kernel memory for no reason). OTOH an example of where this really matters is HTTP/2. The big idea of HTTP/2 is that web pages are made out of lots of parts (HTML, CSS, JS, images, ...), and instead of making lots of separate TCP connections to fetch these like in HTTP/1.1, we're going to make a single connection and then multiplex all the downloads over that one connection. But not all of these parts are created equally -- browsers go to a lot of trouble to try and fetch the important parts of the page first, because this has a direct effect on perceived web page speed. (If a web page shows you the main content after 200 ms then you don't care if it's still loading some images down at the bottom of the page; but if it loads those images first before the rest of the page then that's terrible.) So to handle this, HTTP/2 has a sophisticated system for prioritizing which resources get sent first. The way this ends up working is, each time the socket becomes writeable, you look around and find the highest priority resource that's ready to transmit, and you send a chunk of that. But once you've passed it off to the kernel, then you're committed. So you want to delay this decision as long as possible, because otherwise you risk committing to sending some low priority resource that's ready early, and then a high priority resource becomes ready but it's too late to take that back. In the supermarket analogy again, imagine that they put new items on sale at random times, but once you get in line you can't change what's in your cart. If there's a long line, you run the risk that things you want to buy come on sale while you're waiting in line and it's too late to get them; if the line is short, then you can wait until right before you check out before grabbing what you want. Here's a war story of this bug biting Google Maps. (And this also explains why Google is throwing engineers at fixing TCP in general -- they care a lot about HTTP/2.) Another example where you could see this kind of thing is if you use ssh in persistent connection mode, then you can end up with a single ssh connection that's multiplexing a bulk file transfer with scp and a regular interactive terminal at the same time. If the bulk file transfer fills up your buffer, then now all your interactive key strokes have to wait in that queue, and it can basically become unusable.
I guess the are two use cases:
I think the proposal of enabling it for some sockets as an extra step works fine for the first group (they're already used to jumping through hoops to set up sockets). So that would certainly be a step forward. It Would Be Nice(tm) if both cases just worked. Unfortunately, though, I just checked and it doesn't look like there's any reliable way to query a socket to find out whether TCP_NOTSENT_LOWAT is enabled (at least as of Linux 4.6.0) :-(. You can do |
This presentation from Apple (slides -- starting page 66, video / transcript) has some useful further information. I'll summarize for posterity: They point out another example of where better-controlling the socket send buffer is critically important: applications that can dynamically adjust the data they send to match the network. E.g., streaming video where if you're short on bandwidth you can lower the video quality, or something like VNC where you can drop the refresh rate to preserve interactivity. (Their demo is Apple Screen Sharing having 3 seconds of lag on a connection with a 35 ms ping.) xpra is a Python remote display app that works like this. Basically the core loop goes like: while True:
await sock.writeable()
# at the last possible moment, take a screenshot
data = take_screenshot()
await sock.sendall(data) (Notice: an app like this actually needs What's interesting about this example is it explains why the designers decided to make And finally, they argue that TCP_NOTSENT_LOWAT is basically always a win, it's just a question of how much. In fact, I'll go ahead and quote:
That all makes perfect sense to me, so I've changed my mind :-). Methods like I'll update the issue title to match. |
await sock.writeable()
and TCP_NOTSENT_LOWAT
await sock.writeable()
and TCP_NOTSENT_LOWAT
Wow! Thanks for writing this up. This is really interesting. If anything, it reconfirms my view that Curio shouldn't be doing it's own buffering. That writeable() method can definitely be added (I'm thinking the select() approach will be much more efficient, but will need to experiment). |
but is that realistic for typical network servers/clients? |
I've added a socket.writeable() method that waits until a socket is writeable. It uses select(). The performance impact of it seems almost negligible when used. In light of earlier discussion, it seems that it should still be separate from send() though. |
On further investigation it turns out that my original statements were actually wrong/incomplete. On Linux, TCP_NOTSENT_LOWAT does actually affect
However! AFAICT, on macOS, it is actually true that TCP_NOTSENT_LOWAT affects Anyway, none of this really affects any of the conclusions here, I just wanted to correct that misinformation in case someone stumbles across this thread in the future. |
Closing this for now. |
The case where select and send disagree hints at the one time where TCP_NOTSENT_LOWAT could hurt -- if the application already has the data, and sending it to the kernel now will let the app do some cleanup (or exit, or further progress while the CPU is otherwise idle). That said, using asynchronous output for an app like this is enough of a corner case that TCP_NOTSENT_LOWAT should probably be the default. |
Update: based on the discussion below, I now think that the resolution is that Curio should ideally:
await sock.writeable()
(or whatever spelling is preferred)TCP_NOTSENT_LOWAT
on sockets whenever possible/convenient, with a buffer size of ~8-16 KiB. This only provides full benefits for code that usessock.writeable()
, but it provides some benefits regardless.Original report follows, though note that a bunch of my early statements are incomplete/wrong:
So I just stumbled across a rather arcane corner of the Linux/OS X socket API while trying to understand why I'm seeing weird buffering behavior in some complicate curio/kernel interaction.
As we know, calling
socket.send
doesn't immediately dump data onto the network; instead the kernel just sticks it into the socket's send buffer, to be trickled out to the network as and when possible. Or, if the send buffer is full, then the kernel will reject your data and tell you to try again later. (Assuming non-blocking mode of course.)But for various reasons, it turns out that the kernel's send buffer is usually way larger than you actually want it to be, which means you can end up queueing up a huge amount of data that will take forever to trickle out, introducing latency and causing various problems.
So at least Linux and OS X have introduced the euphoniously named and terribly documented
TCP_NOTSENT_LOWAT
feature. Basically what it does is let you usesetsockopt
to tell the kernel -- hey, I know that you're willing to buffer, like, 5 MiB of data on this socket. But I don't actually want to do that. Can you only wake me up when the amount of data that's actually buffered drops below, like, 128 KiB, and I'll just top it up to there? (This is a bit of a simplification because there are some subtleties about how you budget for data that's queued to send vs. data that's been sent-but-not-yet-acked, but it's good enough to go on.)But it turns out that TCP_NOTSENT_LOWAT only affects polling-for-writability. So you absolutely can have a socket where
select
and friends say "not writeable", but at the same timesend
is happy to let you write lots and lots of data. And this is bad, because it turns out literally the only way the kernel is willing to give you the information you need to avoid over-buffering is with that "not writeable" signal.So if you want to avoid over-buffering, then you have to always call
select
-or-whatever before you callsend
, and only proceed if the socket is claimed to be writeable.And unfortunately, right now, curio never does this: it always tries calling
send
first, and then only if that fails does it block waiting for writeability.The solution is simple: before calling
socket.send
, check that the kernel thinks the socket is writeable.The obvious way to do this would be to replace the current implementation of
Socket.send
with something like:(and similarly for the other methods. I guess
sendall
could usefully be rewritten in terms ofsend
, and I'm not sure what if anything would need to be done forsendmsg
andsendto
. TCP_NOTSENT_LOWAT doesn't apply to UDP, so forsendto
it maybe doesn't matter, but I guess might be better to be safe? And I don't remember whatsendmsg
is for at all.)The one potential downside of this strategy that I can see is that right now,
send
never blocks unless the write actually fails, and if we add anawait _write_wait
then it will generally suspend the coroutine for one "tick" before doing the actual write, even when the write could have been done without blocking. I guess this might actually be a good thing in that it could promote fairness (think of a coroutine that's constantly writing to a socket with a fast reader, so the writes always succeed and it ends up starving everyone else...), but it might have some performance implications too.The alternative, which preserves the current semantics, would be to do a quick synchronous check up front, like:
How I managed to confirm this for myself (Linux specific, and mostly recording this for reference, not really any need to read it):
echo 128000 | sudo tee /proc/sys/net/ipv4/tcp_notsent_lowat
socat TCP-LISTEN:4002 STDOUT
in one terminal.Start filling up our send buffer. At first the data goes into our send buffer and then immediately drains into the other side's receive buffer, but since the other side is asleep then eventually this stops and our send buffer starts filling up:
Okay, there are 121366 bytes enqueued in our send buffer. That's a little bit below the TCP_NOTSENT_LOWAT that we set, so our socket should still be writeable:
Put some more data in, pushing it over the 128000 limit:
Now if we check using
select
, it's not writeable:BUT if we call
send
, then no problem, we can definitely write more data to this "non writeable" socket:The text was updated successfully, but these errors were encountered: