Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

streams: Readable highWaterMark is measured in bytes *or* characters #6798

Closed
mscdex opened this issue May 17, 2016 · 32 comments
Closed

streams: Readable highWaterMark is measured in bytes *or* characters #6798

mscdex opened this issue May 17, 2016 · 32 comments
Labels
stream Issues and PRs related to the stream subsystem.

Comments

@mscdex
Copy link
Contributor

mscdex commented May 17, 2016

  • Version: all
  • Platform: all
  • Subsystem: streams

Currently the Readable streams documentation states that highWaterMark is measured in bytes for non-object mode streams. However, when .setEncoding() is called on the Readable stream, then highWaterMark is measured in characters and not bytes since state.length += chunk.length happens after chunk is overwritten with a decoded string and state.length is what is compared with highWaterMark.

This seems to be an issue since at least v0.10, so I wasn't sure if we should just update the documentation to note this behavior or if we should change the existing behavior to match the documentation.

/cc @nodejs/streams

@mscdex mscdex added the stream Issues and PRs related to the stream subsystem. label May 17, 2016
@jasnell
Copy link
Member

jasnell commented May 17, 2016

My initial take on this is that measuring characters is a bug.

@mcollina
Copy link
Member

@jasnell I agree! this is a nice issue if anybody wants to start contributing on streams and on core cc @nodejs/streams.

@mscdex mscdex added the good first issue Issues that are suitable for first-time contributors. label May 17, 2016
@MylesBorins
Copy link
Contributor

I'd be interested in digging into this to learn more about streams.... will take a look in the morning

@MylesBorins MylesBorins self-assigned this May 17, 2016
@mscdex
Copy link
Contributor Author

mscdex commented Jun 16, 2016

One other issue to think about with this is when someone calls setEncoding() when state.length > 0, what to do then? Convert all the buffered data to strings and update state.length or just loop through the buffer using Buffer.byteLength(chunk, encoding) to update state.length ?

@mcollina
Copy link
Member

I think we should just use byte length for highWaterMark. Converting values... sounds tricky.

@jalafel
Copy link
Contributor

jalafel commented Oct 16, 2016

Hi! It seems like it's been awhile since anyone's mentioned anything on this issue. Does anyone mind if I take a go at this?

@mcollina
Copy link
Member

@jessicaquynh please do! Ping @nodejs/streams if you need any help.

@jalafel
Copy link
Contributor

jalafel commented Oct 18, 2016

@mcollina Thanks! Should I also include a test for this change?

@cjihrig
Copy link
Contributor

cjihrig commented Oct 18, 2016

@jessicaquynh yes please!

@jalafel
Copy link
Contributor

jalafel commented Oct 18, 2016

Great, will do! Thanks @cjihrig!

@jalafel
Copy link
Contributor

jalafel commented Oct 20, 2016

@nodejs/streams would love some guidance! The implementation that I made has caused a side-effect and I am unsure what to do to fix it.

I have changed this line to write the chunk.length (if not in objectMode) to be a converted Buffer.byteLength(chunk, encoding).

However, the most recent version of _stream_readable.js has this line that conflicts with the change.

This if statement gets entered pre-emptively when the byteLength is quite low. Say, 2 or 4 and it throws the error that state.buffer.head is undefined. The difference is in these instances, the string chunk length was 0. I notice that the file on master and the file in this issue's reference contain a different howMuchToRead function. So perhaps the new function got rewritten without this bug in mind?

Anyhow, any help or feedback would be greatly appreciated! Thanks! :)

@mcollina
Copy link
Member

So perhaps the new function got rewritten without this bug in mind?

I do not think so, no.

The best approach is to write a new unit test, that exhibit the described issue, and push it on branch on your profile. I would like to have a closer look at both the change and the test.

As a side note, Buffer.byteLength(str) calls str.toString(), which might be the issue here.

@jalafel
Copy link
Contributor

jalafel commented Oct 22, 2016

@mcollina That was a very complicated process, but I ended up learning a lot along the way! Would love your advice and feedback!

I wrote a test for this for on a lower byte level using mdash and the euro character, as their character length and byte length are not the same.

https://github.com/jessicaquynh/node/tree/issue%236798

@mcollina
Copy link
Member

@jessicaquynh nice one! I think the problem is here: https://github.com/jessicaquynh/node/blob/issue%236798/lib/_stream_readable.js#L253. That state.buffer.head.data can be both a Buffer or a String, which means that we need to use Buffer.byteLength.

Let me know how it goes!

@jalafel
Copy link
Contributor

jalafel commented Oct 26, 2016

@mcollina hey! I am not sure if I implemented your fix correctly, but it's still failing a few other tests. I wasn't exactly sure how to reproduce beyond introducing the process.on('uncaughtException') listener. Let me know if this is incorrect or bad practice!

Nonetheless, here's an updated test that reproduces the error. If you push in an exact buffer source, it works as expected. But should you split the buffer and the split is not exact, it attempts to return the empty buffer head!

@mcollina
Copy link
Member

Some notes. I fear this is definitely not a good first contribution. I was wrong in tagging this as is.

Anyway, we are progressing: we need also to update https://github.com/jessicaquynh/node/blob/b587c1f2a7f21fc00b5c116d48a9c44d62feae11/lib/_stream_readable.js#L865-L884. As it's making its calculations on the length, not the bytes.

There is still a thing that worries me: characters are unsplittable, we can send the first half of a utf8 sequence through when the consumer expect it to be a string. Not sure if it's going to be an issue code-wise.

@jalafel
Copy link
Contributor

jalafel commented Oct 27, 2016

@mcollina Ahah :) I will admit it's been a daunting (although rewarding) task. Thank you for your aid in this process.

I have implemented your recommended changes. There is still one error that happens on the parallel tests, and that has to do with a ReadLine implementation. Namely, this test which is throwing the empty buffer head error.

From my interpretation it looks like the byte order mark is causing the length to be off?

@mcollina
Copy link
Member

@jessicaquynh can you push your branch? master...jessicaquynh:issue#6798 does not have all the fixes (especially the howMuchToRead ones). I want to run the tests on your branch too.

The problem is related to https://github.com/jessicaquynh/node/blob/b587c1f2a7f21fc00b5c116d48a9c44d62feae11/lib/_stream_readable.js#L865-L884. As you can see, it's still accessing the length without passing through Buffer.byteLength.

@jalafel
Copy link
Contributor

jalafel commented Oct 27, 2016

@mcollina Sorry about that! I forgot to push up last night with fromListPartial fix. It has now been updated and pushed to be the same as my local branch.

Additionally, I am wondering I could ask for clarification:

There is still a thing that worries me: characters are unsplittable, we can send the first half of a utf8 sequence through when the consumer expect it to be a string. Not sure if it's going to be an issue code-wise.

Is this in regards to instances in the test when the Buffer splits the euro/emdash characters between different chunks?

@mcollina
Copy link
Member

Is this in regards to instances in the test when the Buffer splits the euro/emdash characters between different chunks?

Yes!
In current master, the logic is exact, and it treats bytes and chars in the same way.
Meaning that, if you try to .read(16), it will read either 16 bytes or characters.

If we count only characters, what should we do? If we have 15 bytes, and then a 3-bytes utf sequence, we can either read 15 or 18 bytes, we cannot read 16 or 17. This is very likely what is happening here.

@mcollina
Copy link
Member

(Maybe I am wrong about that utf story, this change is really hard).

I'm seeing two tests failing:

  1. test/parallel/test-stream2-set-encoding.js
  2. test/parallel/test-stream-preprocess.js

In case of 1., there is something wrong going on with hex encoding.

[ '6161616161',
  '6161616161',
  '6161616161',
  '6161616161',
  '6161616161',
  '6161616161',
  '6161616161',
  '6161616161',
  '6161616161',
  '6161616161',
  '6161616161',
  '6161616161',
  '6161616161',
  '6161616161',
  '6161616161',
  '6161616161',
  '6161616161',
  '6161616161',
  '61616161616161616161',
  '' ]
[ '6161616161',
  '6161616161',
  '6161616161',
  '6161616161',
  '6161616161',
  '6161616161',
  '6161616161',
  '6161616161',
  '6161616161',
  '6161616161',
  '6161616161',
  '6161616161',
  '6161616161',
  '6161616161',
  '6161616161',
  '6161616161',
  '6161616161',
  '6161616161',
  '6161616161',
  '6161616161' ]

All the data is there, but organized differently. There might be an issue in onEofChunk, again accessing chunk.length: https://github.com/jessicaquynh/node/blob/cedc393221c305eb87fc50fde2191664bddac697/lib/_stream_readable.js#L405

@mafintosh
Copy link
Member

Maybe this is just too hard to fix in general? The use case of piping with setEncoding is kinda edge-casy anyway.

@mcollina
Copy link
Member

@mafintosh probably, at this point I am wondering if it's feasible, and if it decreases throughput. Maybe we are just worrying for nothing.

@mcollina mcollina removed the good first issue Issues that are suitable for first-time contributors. label Oct 27, 2016
@jalafel
Copy link
Contributor

jalafel commented Oct 28, 2016

Hey @mcollina, I've added your suggestion but it actually crops up more errors with the tests.

I'm seeing two tests failing:
test/parallel/test-stream2-set-encoding.js

I noticed in my first changes that this same error was happening when I was using the state's encoding in the readableAddChunk func, here.

I am not exactly sure what the issue is. So, I tried experimenting without setting the encoding in fromListPartial, here and that seemed to work.

Though, I am not exactly sure why because the failing test/parallel/test-stream2-set-encoding.js is checking for hex encoding, and with the empty string it defaults to utf8. If you wouldn't mind explaining this difference in behaviour, I'd very much appreciate it :)

Nonetheless

test/parallel/test-stream-preprocess.js

is still failing with this error.

return Buffer.byteLength(state.buffer.head.data)

@mcollina
Copy link
Member

mcollina commented Nov 2, 2016

@jessicaquynh sorry it's taking a bit of time :( this week and the next one will be a bit busy, I will try to answer as quickly as I can.

@jalafel
Copy link
Contributor

jalafel commented Nov 2, 2016

@mcollina no worries! I appreciate all your help :)

@jasnell
Copy link
Member

jasnell commented May 30, 2017

ping @nodejs/streams

@mcollina
Copy link
Member

I think the best approach about this issue is to update the documentation. An actual fix proved to be way harder, and it might slow things down considerably.

Would anyone update the docs? Maybe @jessicaquynh?

@jalafel
Copy link
Contributor

jalafel commented May 31, 2017

@mcollina definitely can; are you thinking about putting it in as a "gotcha" statement?

@mcollina
Copy link
Member

mcollina commented Jun 1, 2017

@jessicaquynh yes, I think that would work, let's see what other people think when you fire the PR.

jalafel added a commit to jalafel/node that referenced this issue Aug 31, 2017
This commit documents and edge-case behavior in readable streams. It is
expected that non-object streams are measured in bytes against the
highWaterMark. However, it was discovered in issue nodejs#6798 that after calling
.setEncoding() on the stream, it will thereafter begin to measure the buffer's
length in characters.
BridgeAR pushed a commit that referenced this issue Sep 8, 2017
This commit documents and edge-case behavior in readable streams.
It is expected that non-object streams are measured in bytes
against the highWaterMark. However, it was discovered in issue
thereafter begin to measure the buffer's length in characters.

PR-URL: #13442
Refs: #6798
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Vse Mozhet Byt <vsemozhetbyt@gmail.com>
Reviewed-By: Ruben Bridgewater <ruben@bridgewater.de>
MylesBorins pushed a commit that referenced this issue Sep 10, 2017
This commit documents and edge-case behavior in readable streams.
It is expected that non-object streams are measured in bytes
against the highWaterMark. However, it was discovered in issue
thereafter begin to measure the buffer's length in characters.

PR-URL: #13442
Refs: #6798
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Vse Mozhet Byt <vsemozhetbyt@gmail.com>
Reviewed-By: Ruben Bridgewater <ruben@bridgewater.de>
MylesBorins pushed a commit that referenced this issue Sep 11, 2017
This commit documents and edge-case behavior in readable streams.
It is expected that non-object streams are measured in bytes
against the highWaterMark. However, it was discovered in issue
thereafter begin to measure the buffer's length in characters.

PR-URL: #13442
Refs: #6798
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Vse Mozhet Byt <vsemozhetbyt@gmail.com>
Reviewed-By: Ruben Bridgewater <ruben@bridgewater.de>
MylesBorins pushed a commit that referenced this issue Sep 12, 2017
This commit documents and edge-case behavior in readable streams.
It is expected that non-object streams are measured in bytes
against the highWaterMark. However, it was discovered in issue
thereafter begin to measure the buffer's length in characters.

PR-URL: #13442
Refs: #6798
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Vse Mozhet Byt <vsemozhetbyt@gmail.com>
Reviewed-By: Ruben Bridgewater <ruben@bridgewater.de>
addaleax pushed a commit to addaleax/node that referenced this issue Sep 13, 2017
This commit documents and edge-case behavior in readable streams.
It is expected that non-object streams are measured in bytes
against the highWaterMark. However, it was discovered in issue
thereafter begin to measure the buffer's length in characters.

PR-URL: nodejs#13442
Refs: nodejs#6798
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Vse Mozhet Byt <vsemozhetbyt@gmail.com>
Reviewed-By: Ruben Bridgewater <ruben@bridgewater.de>
MylesBorins pushed a commit that referenced this issue Sep 20, 2017
This commit documents and edge-case behavior in readable streams.
It is expected that non-object streams are measured in bytes
against the highWaterMark. However, it was discovered in issue
thereafter begin to measure the buffer's length in characters.

PR-URL: #13442
Refs: #6798
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Vse Mozhet Byt <vsemozhetbyt@gmail.com>
Reviewed-By: Ruben Bridgewater <ruben@bridgewater.de>
MylesBorins pushed a commit that referenced this issue Sep 21, 2017
This commit documents and edge-case behavior in readable streams.
It is expected that non-object streams are measured in bytes
against the highWaterMark. However, it was discovered in issue
thereafter begin to measure the buffer's length in characters.

PR-URL: #13442
Refs: #6798
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Vse Mozhet Byt <vsemozhetbyt@gmail.com>
Reviewed-By: Ruben Bridgewater <ruben@bridgewater.de>
MylesBorins pushed a commit that referenced this issue Sep 26, 2017
This commit documents and edge-case behavior in readable streams.
It is expected that non-object streams are measured in bytes
against the highWaterMark. However, it was discovered in issue
thereafter begin to measure the buffer's length in characters.

PR-URL: #13442
Refs: #6798
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Vse Mozhet Byt <vsemozhetbyt@gmail.com>
Reviewed-By: Ruben Bridgewater <ruben@bridgewater.de>
@gireeshpunathil
Copy link
Member

#13442 is landed, so close-able?
ping @mscdex

@mcollina
Copy link
Member

mcollina commented Apr 9, 2018

Yes, I think this could be closed.

@mcollina mcollina closed this as completed Apr 9, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stream Issues and PRs related to the stream subsystem.
Projects
None yet
Development

No branches or pull requests

8 participants