-
Notifications
You must be signed in to change notification settings - Fork 160
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Drop highWaterMark entirely?? #13
Comments
hwm can be completely userland. The below code has obvouis buffering bugs because it was written in 10 minutes but in theory assuming function HighWaterMark(source, highWaterMark) {
// should be arrayBuffer
var buffer = []
var pending = null
var reading = false
// create a new stream, pull out of source on open
var stream = new ReadableStream(function () {
fill()
})
// intercept read. assume it never gets called in parallel
stream.read = function read() {
// store the response to this read somewhere
pending = new Promise()
if (buffer.length < highWaterMark) drain()
// if not reading then pre-emptive fetch
if (!reading && buffer.length < highWaterMark) fill()
return pending
}
return stream
// mark that we are reading
// read from source
// store chunk in buffer, drain then fill again if below
// waterMark
function fill() {
reading = true
source.read().then(function (chunk) {
reading = false
buffer.push(chunk)
drain()
if (buffer.length < highWaterMark) {
fill()
}
}, function (err) {
pending.reject(err)
})
}
function drain() {
if (pending) {
pending.fulfill(buffer.shift())
}
}
} |
High water marks will default to zero, but be easily configurable via a buffering strategy. Tracking the definition of that in #12. |
Reopening since yesterday @isaacs again brought up that HWM might not be a good idea. It would be a nice simplification of the spec if we were able to move custom buffering strategies into userland entirely, or let them be built on top of the base primitives (e.g. as in #24). @isaacs, let's discuss details. If you have "no high water mark," that means one of two things: infinite high water mark (no backpressure ever), or zero high water mark (backpressure once any data is in the buffer at all). I assume zero is what you mean. My concern about zero high water mark is that it would cause continually pause/resume of the underlying source's data flow. For example, consider the adapting a push-based data source example, which is based off of the Node.js docs' SourceWrapper example. With zero high water mark, wouldn't you constantly be sending Maybe that's not a problem at all, and actually continually adjusting the flow rate between start and stop is fine? |
But is this behavior not the same with or without hwm? Once you have reached the hwm, most likely the stream will constantly stop/restart. In one of my current implementation (browsers with WebSockets) there is a hwm, when it is reached the app keeps spending its time checking if the buffer size is below or not the hwm, then adding data into the buffer or polling again with a theorical 0-timeout/postMessage I don't know how to do this better for now, the problem is that you are accumulating polling attempts doing nothing most of the time because the buffer is still above the hwm. Should not [[bufferSize]] be observable instead? Another issue I have is on the readable side, flow control is using a usual window, the receiver tells the sender to send more data when the window gets empty. Problem: the [[bufferSize]] can still be increasing or not be null when the window gets empty, then you should better wait before telling the sender to send more data, if not you are sure that the receiver will get overloaded. And what you should wait for is that the bufferSize is 0, so same question. |
This is not the case, because you wait for the buffer to drain completely before restarting the flow of data. You do not restart the data immediately after dropping beow the high water mark. (Actually, you could introduce a low water mark wherein once you drop below that you re-start the flow of data. I am describing a low water mark of zero. You seem to be describing a low water mark equal to the high water mark, or infinitesimally smaller.) |
I am not sure this is the most efficient to wait for the buffer to be empty to restart, maybe a low water mark would be better. Indeed the virtual lwm is close to the hwm in my case because I have no way to know the state of the buffer neither can I know in how many time the 0-timeout will fire, so when data can be sent it refills up to the hwm. What about the receiver case? Since I am working on this I am wondering why nobody specified events associated to the bufferedAmount for WS. |
This is where implementation experience comes in handy. Implementation experience shows draining the buffer to be a much better strategy. |
Do you have a stream implementation experience inside browsers and did you experience a draining strategy inside browsers? |
Yes, for example see http://thlorenz.github.io/stream-viz/ |
I was talking about network streams in a real environment. With what is currently available inside browsers the draining strategy did not appear to be the best for me. |
Ah. I also was talking about network streams in a real environment, and showed them to you simulated in a browser context so you can see the effect in browsers. |
I said in a real environment, testing, defining and specifying streams in a perfect world is a non sense. A real environment means unexpected networks events, busy browsers, unexpected browsers behavior, hazardous response time, bandwidth changes, shitty nodes in the route, etc I have experienced all this. Draining the buffer will necessarly cause delays between the time it gets empty and gets refilled, the right method is for the flow to stick around the mark. |
Yes, I have experienced all of that as well. Draining the buffer works really well in those situations. It seems we simply have different perspectives borne out by similar experiences. At this point I imagine other implementer experience (for example from Node.js implementers) would be good to have in the thread. The fact remains that the most successful streams implementations in JavaScript, viz. Node.js streams and browserified streams, drain the buffer. An oscillating model has not been tested on the same scale as a buffer-draining one. |
Where? What projects? How could you use the draining method inside browsers? |
You updated your post after my answer... The fact is that you have today 0 example of streams implementation inside browsers, you have examples of js streams implementation, and some browserification experiments of these examples, but still no example of a living stream implementation inside browsers. Except mine, you can try http://www.peersm.com, on client side it is using the browser, on server side (or Peersm clients) node, and it's different. It's not possible today to use a kind of draining method inside browsers without decreasing performances a lot, maybe it will, it has been difficult to adjust the flow control inside browsers and the model that works better than others is the oscillating one. |
I think you are ignoring successful examples by dismissing them as experiments. |
Regardless, this bug is not the place to discuss whether your stream implementation is the best model for browser streams. It is the place to discuss whether we want to include a high water mark or not. If you want to open a separate issue to discuss low water marks please do so. But please do not let all of your arguments be from authority, based on your product. In general, comments of the sort "why aren't we doing things the way my product already does them" are not very helpful. Instead argue on technical merits with real illustrations of e.g. performance numbers for no-low-water-mark vs. with-low-water-mark, and code samples, preferably in the context of the existing proposal. I hope we can use this bug to discuss the question from the original post now. |
You are really misunderstanding my point, I am not saying that my implementation is the best, because it can not be with what we have today inside browsers, and it's not possible to demonstrate anything neither. Anyway it seems like you are not receptive, any spec not taking care of real implementations feedback will end up as usual as "Yeah, we will correct this in V2" |
I am hopeful we have both misunderstood each other and can have more productive dialogues in the future :). I welcome some specific feedback about high or low water marks based on real data. |
Yes, OK, I wil try to compare again the different methods and provide data. Now, coming back to the topic of that bug, the no hwm method is equivalent to the oscillating one, no? And to make sure I undersand correctly the draining method, let's take a use case: streaming a file piped to a heavy_encryption process piped to the network. When the buffer is drained, the heavy_encryption process is stopped, it restarts when the buffer gets empty, so necessarly you have to wait for the heavy_encryption process to send new data, decreasing performances, correct or not? |
To some extent, yes, although it starts oscillating immediately. I am having trouble understanding how this is desirable, but am guessing it's likely I'm missing something fundamental...
Sorry, when which buffer is drained? Which buffer gets empty? (Remember there are three streams here, and thus three buffers.) Who has to wait? This sounds like a good concrete use case to analyze, but I can't quite understand it. |
I don't think the immediate oscillation is desirable, because in that case you don't really need buffers and what they have been invented for, which seems not extremely efficient. I was talking about the network buffer that gets drained, if I follow the logic, preceeding streams are stopped, so the heavy_encryption stream has to wait for the network buffer to be empty to restart, and then before new data are piped to the network you must wait for the heavy_encryption to process them. That's why I find good the oscillating model around a mark that is not null , because you let a chance to the heavy_encryption process to fill data while the network keeps going sending data until the network buffer is empty, which should not occur if all of this works correctly, the flow should continuously stick to the mark. My opinion... maybe incorrect but I am interested to know why if it's the case. |
How about thinking of source and sink where
For simplicity, please ignore buffer in kernel. While the network is vacant, the sink drains data at the max speed and the buffer stays empty for almost all time. When congestion avoidance starts working, the sink drains data at very low speed such as 1MB/s. The buffer is filled up to HWM. Then, we stops pulling from the source. The source keeps draining data at low speed. When the network becomes again vacant before the buffer becomes empty, the sink again drains the buffered data rapidly than the source (50MB < 100MB). It depends, but I think I want HWM to be big enough so that it can keep accepting data from the low speed source for faster delivery of data to the peer. So, I want the buffer to be kept filled while there's congestion. Not sure, but this is what I came up with. |
I think this is the same that I have described and I would say I agree with the conclusion. Correction in my last comment above: of course if the network is faster than the source the buffer stays empty (and does not stick to the mark as I wrote) |
So, the landing point could be to change |
I had a chance to discuss this all with @isaacs. Our discussion was kind of inconclusive. Some takeaways: He was initially optimistic about HWM of zero on the write side, as he envisioned that for slow destinations that would lead to "write once, write again---OK it's full; wait until it drains; now write everything we accumulated in the meantime from the fast source; now write one more time---OK it's full, wait for drain again..." Which is fairly efficient and does what you want. But this isn't how our current piping model works, because we don't write everything we accumulated (i.e. we don't smash together all I just had a thought---with built-in writev support (#27), even in just When asked about HWM of zero on the read side, he noted that this would not be a problem for slow-to-fast pipes, but only for fast-to-slow ones, which are a bit rarer. (E.g. most transforms are faster than the I/O operations generating data.) I was not convinced that this is a sufficient argument. He noted that in Node, nobody changes the defaults for HWMs. They made them configurable in the streams2 rewrite, and nobody actually uses that configurability. This is a strong point to me, and makes me unsure whether we want to give it a prominent role in our API as we are currently doing. For reference, the Node defaults (writable link, readable link), are:
He noted that low water marks are actively harmful, as you always want to pass along data that is available, even if it's just one byte. This makes sense to me. We ended up concluding that what really needs to happen is an assembly of canonical examples that we can test against various configurations, e.g.
I am also still wondering that we could build a HWMTransform in user-land (#24) on top of a ZeroHWMStream and punt this problem to specific streams or use-cases. E.g. you could start with |
You read all available data from the source readableStream and then write them in oneshot to the destination?
Do you mean fast source wrapped with a BRS and slow consumer reading from the BRS by "fast-to-slow"?
Could you please elaborate this? Are you talking about inefficiency of transferring data in small chunks?
These two are basically separate decisions. Does pull of BRS have any implication that the source must push any available data asap? The source can choose to push data after some reasonable amount is accumulated in it. |
Right, that's the idea.
Yes.
Hmm. Upon reflection I don't recall how this argument was framed. It doesn't seem to match my understanding of low water marks:
Neither of these seem related to "low water marks are actively harmful, as you always want to pass along data that is available, even if it's just one byte." @isaacs, can you help me remember what you were saying? |
It turns out I was very confuse about what @isaacs meant by low water marks, and we were talking past each other! His definition, translated into whatwg/streams terms, was:
You can see how both of these are actively harmful. Which brings up the question of WTF my previous "low water marks" mean or should be called. Apparently I have been misunderstanding things for a while, and Node streams (at least) resume the flow of data immediately after dipping underneath the HWM, i.e. they don't wait for a complete drain. I feel pretty incompetent now as I was arguing strongly against @Ayms earlier that a complete drain was good based on it being what Node streams did. I apologize, @Ayms. I still am not sure I understand the complexities here entirely. I believe the writable side might become even more important in pipe chains, e.g. the oneshot drain-everything-from-readable-into-writable might be the really important part here. I need to dig into these issues more, and am somewhat disappointed in myself now that I realize some of the areas I'm missing understanding in. |
I think we ought to solve this problem empirically. Here's what I'm
readableStream.pipe(transformStream).pipe(writableStream)
a) readableStream underlying source data rate
a) time for writableStream to consume all the data A graph of these would be great. The benefit of writing such a test is that we'd be able to conclusively Does that sound like a worthwhile thing to do? On Mon, Mar 10, 2014 at 1:45 PM, Domenic Denicola
|
@domenic that's ok, no problem, that's what discussions are made for. We can do @isaacs tests but finally it seems like we all agree here, no? And from what I have seen node's streams are much more efficient now. Maybe the model used is a bit confusing, we always talk slow to fast or fast to slow and maybe interactions between streams "outside of the streams" are not really taken into account, the fact for example that the fast can be slowed down by the receiver, ie even if the source is fast, it will stop sending data until the receiver tells him to restart, so since the transform takes some time, it has to resume from its buffer to be efficient (if the buffer is not empty it means that the transform is faster than the network but still takes time to resume with no HWM), the mechanisms between the sender and the receiver are on userland but the fact that the stream resumes efficiently is on the spec side. So, I think the HWM model is the good one. |
I think that's likely as well. But it DOES certainly introduce a A little bit of empirical research would go a long way to explain the On Wed, Mar 12, 2014 at 3:39 AM, Aymeric Vitte notifications@github.comwrote:
|
As it is mentionned above, implementors seem not to have noticed they could change the HWM for node (probably means they are happy how it is) But indeed this does not prevent to have more data, should not the suggested tests be performed end to end with at least something like a basic window flow control at the app level and the possibility to change the network rate? |
I definitely agree we need empirical research. I'll give it a try, or maybe one of the implementation helpers (@othiym23, @Raynos, or anyone else) would be willing to set something up. @thlorenz has already done some good work with http://thlorenz.github.io/stream-viz/. One thing that would also help clarify for me is what exactly the start/stop mechanism is in various cases. E.g. in TCP, do we/should we at any point lower the window size? That seems expensive. Whereas if we just stop reading from the kernel buffer, and let the kernel take care of the rest, then that seems cheaper. As for whether the HWM model is a good one, I think it's less clear whether that's true on the writable side. |
This is definitely something that's in my wheelhouse, and my ambient insanity level is maybe dropping enough that I can take a pass at implementing @isaacs's proposal over the weekend (it sounds fun! for what that's worth). I'm especially concerned that our experience with Node streams not overdetermine the spec, because I have a feeling there are many use cases (some of them outlined by @Ayms and others) for the browser that just plain behave differently than on the server side. |
The text was updated successfully, but these errors were encountered: