-
Notifications
You must be signed in to change notification settings - Fork 30k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
streams: Readable highWaterMark is measured in bytes *or* characters #6798
Comments
My initial take on this is that measuring characters is a bug. |
@jasnell I agree! this is a nice issue if anybody wants to start contributing on streams and on core cc @nodejs/streams. |
I'd be interested in digging into this to learn more about streams.... will take a look in the morning |
One other issue to think about with this is when someone calls |
I think we should just use byte length for highWaterMark. Converting values... sounds tricky. |
Hi! It seems like it's been awhile since anyone's mentioned anything on this issue. Does anyone mind if I take a go at this? |
@jessicaquynh please do! Ping @nodejs/streams if you need any help. |
@mcollina Thanks! Should I also include a test for this change? |
@jessicaquynh yes please! |
Great, will do! Thanks @cjihrig! |
@nodejs/streams would love some guidance! The implementation that I made has caused a side-effect and I am unsure what to do to fix it. I have changed this line to write the However, the most recent version of This Anyhow, any help or feedback would be greatly appreciated! Thanks! :) |
I do not think so, no. The best approach is to write a new unit test, that exhibit the described issue, and push it on branch on your profile. I would like to have a closer look at both the change and the test. As a side note, |
@mcollina That was a very complicated process, but I ended up learning a lot along the way! Would love your advice and feedback! I wrote a test for this for on a lower byte level using mdash and the euro character, as their character length and byte length are not the same. |
@jessicaquynh nice one! I think the problem is here: https://github.com/jessicaquynh/node/blob/issue%236798/lib/_stream_readable.js#L253. That Let me know how it goes! |
@mcollina hey! I am not sure if I implemented your fix correctly, but it's still failing a few other tests. I wasn't exactly sure how to reproduce beyond introducing the Nonetheless, here's an updated test that reproduces the error. If you push in an exact buffer source, it works as expected. But should you split the buffer and the split is not exact, it attempts to return the empty buffer head! |
Some notes. I fear this is definitely not a good first contribution. I was wrong in tagging this as is. Anyway, we are progressing: we need also to update https://github.com/jessicaquynh/node/blob/b587c1f2a7f21fc00b5c116d48a9c44d62feae11/lib/_stream_readable.js#L865-L884. As it's making its calculations on the length, not the bytes. There is still a thing that worries me: characters are unsplittable, we can send the first half of a utf8 sequence through when the consumer expect it to be a string. Not sure if it's going to be an issue code-wise. |
@mcollina Ahah :) I will admit it's been a daunting (although rewarding) task. Thank you for your aid in this process. I have implemented your recommended changes. There is still one error that happens on the parallel tests, and that has to do with a From my interpretation it looks like the byte order mark is causing the length to be off? |
@jessicaquynh can you push your branch? master...jessicaquynh:issue#6798 does not have all the fixes (especially the The problem is related to https://github.com/jessicaquynh/node/blob/b587c1f2a7f21fc00b5c116d48a9c44d62feae11/lib/_stream_readable.js#L865-L884. As you can see, it's still accessing the length without passing through |
@mcollina Sorry about that! I forgot to push up last night with Additionally, I am wondering I could ask for clarification:
Is this in regards to instances in the test when the Buffer splits the euro/emdash characters between different chunks? |
Yes! If we count only characters, what should we do? If we have 15 bytes, and then a 3-bytes utf sequence, we can either read 15 or 18 bytes, we cannot read 16 or 17. This is very likely what is happening here. |
(Maybe I am wrong about that utf story, this change is really hard). I'm seeing two tests failing:
In case of 1., there is something wrong going on with hex encoding. [ '6161616161',
'6161616161',
'6161616161',
'6161616161',
'6161616161',
'6161616161',
'6161616161',
'6161616161',
'6161616161',
'6161616161',
'6161616161',
'6161616161',
'6161616161',
'6161616161',
'6161616161',
'6161616161',
'6161616161',
'6161616161',
'61616161616161616161',
'' ]
[ '6161616161',
'6161616161',
'6161616161',
'6161616161',
'6161616161',
'6161616161',
'6161616161',
'6161616161',
'6161616161',
'6161616161',
'6161616161',
'6161616161',
'6161616161',
'6161616161',
'6161616161',
'6161616161',
'6161616161',
'6161616161',
'6161616161',
'6161616161' ] All the data is there, but organized differently. There might be an issue in |
Maybe this is just too hard to fix in general? The use case of piping with setEncoding is kinda edge-casy anyway. |
@mafintosh probably, at this point I am wondering if it's feasible, and if it decreases throughput. Maybe we are just worrying for nothing. |
Hey @mcollina, I've added your suggestion but it actually crops up more errors with the tests.
I noticed in my first changes that this same error was happening when I was using the state's encoding in the I am not exactly sure what the issue is. So, I tried experimenting without setting the encoding in Though, I am not exactly sure why because the failing Nonetheless
is still failing with this error.
|
@jessicaquynh sorry it's taking a bit of time :( this week and the next one will be a bit busy, I will try to answer as quickly as I can. |
@mcollina no worries! I appreciate all your help :) |
ping @nodejs/streams |
I think the best approach about this issue is to update the documentation. An actual fix proved to be way harder, and it might slow things down considerably. Would anyone update the docs? Maybe @jessicaquynh? |
@mcollina definitely can; are you thinking about putting it in as a "gotcha" statement? |
@jessicaquynh yes, I think that would work, let's see what other people think when you fire the PR. |
This commit documents and edge-case behavior in readable streams. It is expected that non-object streams are measured in bytes against the highWaterMark. However, it was discovered in issue nodejs#6798 that after calling .setEncoding() on the stream, it will thereafter begin to measure the buffer's length in characters.
This commit documents and edge-case behavior in readable streams. It is expected that non-object streams are measured in bytes against the highWaterMark. However, it was discovered in issue thereafter begin to measure the buffer's length in characters. PR-URL: #13442 Refs: #6798 Reviewed-By: James M Snell <jasnell@gmail.com> Reviewed-By: Vse Mozhet Byt <vsemozhetbyt@gmail.com> Reviewed-By: Ruben Bridgewater <ruben@bridgewater.de>
This commit documents and edge-case behavior in readable streams. It is expected that non-object streams are measured in bytes against the highWaterMark. However, it was discovered in issue thereafter begin to measure the buffer's length in characters. PR-URL: #13442 Refs: #6798 Reviewed-By: James M Snell <jasnell@gmail.com> Reviewed-By: Vse Mozhet Byt <vsemozhetbyt@gmail.com> Reviewed-By: Ruben Bridgewater <ruben@bridgewater.de>
This commit documents and edge-case behavior in readable streams. It is expected that non-object streams are measured in bytes against the highWaterMark. However, it was discovered in issue thereafter begin to measure the buffer's length in characters. PR-URL: #13442 Refs: #6798 Reviewed-By: James M Snell <jasnell@gmail.com> Reviewed-By: Vse Mozhet Byt <vsemozhetbyt@gmail.com> Reviewed-By: Ruben Bridgewater <ruben@bridgewater.de>
This commit documents and edge-case behavior in readable streams. It is expected that non-object streams are measured in bytes against the highWaterMark. However, it was discovered in issue thereafter begin to measure the buffer's length in characters. PR-URL: #13442 Refs: #6798 Reviewed-By: James M Snell <jasnell@gmail.com> Reviewed-By: Vse Mozhet Byt <vsemozhetbyt@gmail.com> Reviewed-By: Ruben Bridgewater <ruben@bridgewater.de>
This commit documents and edge-case behavior in readable streams. It is expected that non-object streams are measured in bytes against the highWaterMark. However, it was discovered in issue thereafter begin to measure the buffer's length in characters. PR-URL: nodejs#13442 Refs: nodejs#6798 Reviewed-By: James M Snell <jasnell@gmail.com> Reviewed-By: Vse Mozhet Byt <vsemozhetbyt@gmail.com> Reviewed-By: Ruben Bridgewater <ruben@bridgewater.de>
This commit documents and edge-case behavior in readable streams. It is expected that non-object streams are measured in bytes against the highWaterMark. However, it was discovered in issue thereafter begin to measure the buffer's length in characters. PR-URL: #13442 Refs: #6798 Reviewed-By: James M Snell <jasnell@gmail.com> Reviewed-By: Vse Mozhet Byt <vsemozhetbyt@gmail.com> Reviewed-By: Ruben Bridgewater <ruben@bridgewater.de>
This commit documents and edge-case behavior in readable streams. It is expected that non-object streams are measured in bytes against the highWaterMark. However, it was discovered in issue thereafter begin to measure the buffer's length in characters. PR-URL: #13442 Refs: #6798 Reviewed-By: James M Snell <jasnell@gmail.com> Reviewed-By: Vse Mozhet Byt <vsemozhetbyt@gmail.com> Reviewed-By: Ruben Bridgewater <ruben@bridgewater.de>
This commit documents and edge-case behavior in readable streams. It is expected that non-object streams are measured in bytes against the highWaterMark. However, it was discovered in issue thereafter begin to measure the buffer's length in characters. PR-URL: #13442 Refs: #6798 Reviewed-By: James M Snell <jasnell@gmail.com> Reviewed-By: Vse Mozhet Byt <vsemozhetbyt@gmail.com> Reviewed-By: Ruben Bridgewater <ruben@bridgewater.de>
Yes, I think this could be closed. |
Currently the Readable streams documentation states that
highWaterMark
is measured in bytes for non-object mode streams. However, when.setEncoding()
is called on the Readable stream, thenhighWaterMark
is measured in characters and not bytes sincestate.length += chunk.length
happens afterchunk
is overwritten with a decoded string andstate.length
is what is compared withhighWaterMark
.This seems to be an issue since at least v0.10, so I wasn't sure if we should just update the documentation to note this behavior or if we should change the existing behavior to match the documentation.
/cc @nodejs/streams
The text was updated successfully, but these errors were encountered: