feat: buffered writer #70

Gozala · 2022-03-25T04:03:10Z

Implement #69

lib/buffer-writer.js

rvagg · 2022-03-28T04:16:14Z

lib/buffer-writer.js

+  capacity = count * DEFAULT_CID_SIZE
+) => {
+  // Number of bytes added per root
+  const rootsExtra = count * ROOT_EXTRA_SIZE


This'll be out when you go above 24 roots, and then again at 256, cbor will be adding extra bytes to fit in the array length.

You could fix it by adding 1 byte for >=24 and another byte for >=256. There's another boundary at 65,536 where you'd need to add 2 more bytes, but maybe by that time you should be doing a throw.

There's also a difference in encoding for different sizes of CIDs for the same boundaries - a tiny CID <24 is going to be 1 byte more compact, then a huge over 256 is going to add another byte. These are going to be uncommon, but not out of the question. Unfortunately your arguments won't account for that, maybe documentation just needs to make it clear that this estimate is only good for sensible defaults.

It seems that the cost of a bad estimate is the huge content shuffle in close() which could be invisibly very expensive. Documenting that would be good too.

So, some solid tests for this function for various CIDs might be good!

I added an EncodedLength() calculator to the Go dag-cbor, which lets you figure out exactly how many bytes an object will take up once encoded, we could do the same for JS because it's not too hard, but it may not end up being that much cheaper than just doing a dag-cbor encode of a faked object and checking the byte count.

This'll be out when you go above 24 roots, and then again at 256, cbor will be adding extra bytes to fit in the array length.

You could fix it by adding 1 byte for >=24 and another byte for >=256. There's another boundary at 65,536 where you'd need to add 2 more bytes, but maybe by that time you should be doing a throw.

I end up publishing before I did a last push so I'm not sure if that is prior to a fix I have added to account for the varintSize. If so I assume that last change should address this or is that something else ?

There's also a difference in encoding for different sizes of CIDs for the same boundaries - a tiny CID <24 is going to be 1 byte more compact, then a huge over 256 is going to add another byte. These are going to be uncommon, but not out of the question. Unfortunately your arguments won't account for that, maybe documentation just needs to make it clear that this estimate is only good for sensible defaults.

Yeah I'm not sure if there is a good way to account for all of that. Only thing I have considered is to overestimate a bit, which i think is better than underestimate. Alternatively we could return a range and let user decide which one to go with.

It seems that the cost of a bad estimate is the huge content shuffle in close() which could be invisibly very expensive. Documenting that would be good too.

I do not think that overhead is huge, as far as I remember browsers optimize case of moving bytes within the same buffer

So, some solid tests for this function for various CIDs might be good!

I added an EncodedLength() calculator to the Go dag-cbor, which lets you figure out exactly how many bytes an object will take up once encoded, we could do the same for JS because it's not too hard, but it may not end up being that much cheaper than just doing a dag-cbor encode of a faked object and checking the byte count.

Yeah I was thinking about that as well, but decided it was not worth an effort. However I think it would be great to have one.

If so I assume that last change should address this or is that something else ?

something else, this is all about cbor encoding compression. Arrays have their length included at the start, if the length is <24 then that number is included in the single byte at the start that says "this is an array". Between 24 and 256 it's in its own byte, so the array takes up 2 bytes to get started, 256 to 65536 takes up another byte, etc. The same rule applies for byte arrays - so CIDs have this property too. A CID that's less than 24 bytes long will have one less byte in the prelude, and a CID greater than 256 bytes long will have an extra one. This is why a size estimator would be handy, because the rules are specific and get a bit weird and surprising in some edges. But the rules are fixes so if you lock them down then you're set.

Unfortunately if you start down this road then you're not far off making a full estimator for arbitrary objects!

So maybe just get this right for obvious cases and not have easy failure modes, and we can get a proper estimator into the dag-cbor codec to use later.

lib/buffer-writer.js

rvagg · 2022-03-28T04:27:17Z

Yeah, this is neat, I like the API I think. I like that you could use the class directly to write into your own buffer too. It's just the issues around size estimation that are a concern. That offset shuffle is a bit nasty and it would be nice if we didn't have to have it since this code is going to so much effort to get it right. In fact I'd also be fine with making the API strictly require you estimate correctly ahead of time to avoid this invisible nastiness entirely.

The other issue I see with mis-estimation is that I you want to use the class directly on your own buffers, then if you overestimate your header size then you get back a subarray that doesn't start at the begining of your buffer - which may be unexpected. Not a big deal but maybe you want to write a CAR directly up against some other data already in your buffer and to avoid the padding that an overestimation will create you'll have to get your estimates spot-on - and it'd be tricky to tell if you got it wrong. Another case for just removing the possibility of getting it wrong - throw if you misestimate.

Gozala · 2022-03-28T16:48:55Z

Yeah, this is neat, I like the API I think. I like that you could use the class directly to write into your own buffer too. It's just the issues around size estimation that are a concern. That offset shuffle is a bit nasty and it would be nice if we didn't have to have it since this code is going to so much effort to get it right. In fact I'd also be fine with making the API strictly require you estimate correctly ahead of time to avoid this invisible nastiness entirely.

Yeah moving bytes isn't ideal indeed. Maybe there needs to be a separate API for one use case we have that is:

We encode user input into CAR packets of 256MiB or less.
We do not know how many roots there will be, as it depends on input

With the current API thinking is you could make a rough guess for amount of roots and start writing, if you have less roots you may be able to make space for more blocks, if you have more roots you may be able to make room for them by reducing number of blocks.

Unfortunately I can not think of any better way to do this, other than writing blocks from the end of the buffer, but that is not great either because order in which we encode is it's natural order. The way I think we may go about it is via single synthetic root linking to desired roots, that way synthetic root could be the last one and we could make pretty accurate estimates.

Anyway if you think it's best to avoid moving bytes in close, I can drop that and leave it up to use space to deal with that code. In that case it would be nice to have a helper function that can at least copy stuff into larger buffer without having to reencode everything on wrong estimate.

Gozala · 2022-03-28T17:04:39Z

The other issue I see with mis-estimation is that I you want to use the class directly on your own buffers, then if you overestimate your header size then you get back a subarray that doesn't start at the begining of your buffer - which may be unexpected.

I am not sure it is that bad, although I agree it can be unexpected. Alternatively we could slide bytes blocks towards the head, but I thought I'd leave it up the user to decide. What do you think ?

Not a big deal but maybe you want to write a CAR directly up against some other data already in your buffer and to avoid the padding that an overestimation will create you'll have to get your estimates spot-on - and it'd be tricky to tell if you got it wrong.

I'm not sure it would be tricky, you just have to compare the byteOffset, but I do agree that one may not expect it to be different. For what it's worth that was a reason you pass ArrayBuffer and get back Uint8Array, hope was that this would make things more clear.

Another case for just removing the possibility of getting it wrong - throw if you misestimate.

I don't like throwing idea here because you did mutate the buffer but then you threw and left buffer dirty.

I think here are several alternative options to consider:

We pad with raw block just to align things.
We move blocks instead to align.
We move header to the front to align.

I felt like 3rd option was best as it has least overhead and user could still revert back to 1 or 2 if so desired.

rvagg · 2022-03-29T01:51:24Z

I don't like throwing idea here because you did mutate the buffer but then you threw and left buffer dirty.

yeah, good point, but that might be an argument for ensuring that you can properly get your header length estimate right, or at least never underestimate. If we can make the API solid in those terms, i.e. if you use it properly then you'll never underestimate, then it becomes a user problem - don't give bad estimation values up-front or you'll end up in a bad place (in which case I still don't mind throwing, you made a boo-boo, consider it programmer error and therefore reasonably fatal).

Gozala · 2022-03-29T03:37:31Z

yeah, good point, but that might be an argument for ensuring that you can properly get your header length estimate right, or at least never underestimate. If we can make the API solid in those terms, i.e. if you use it properly then you'll never underestimate, then it becomes a user problem - don't give bad estimation values up-front or you'll end up in a bad place (in which case I still don't mind throwing, you made a boo-boo, consider it programmer error and therefore reasonably fatal).

Ok so I have spend bit more time on this and I think I have reasonable solution

I have addressed header estimate issues you've caught (thanks for pointing those out)
I made close throw exception if estimates is incorrect.
I have added an option to close close({ align: true }) which if passed will move bytes when possible and ensure that head starts at the beginning of the buffer and is directly followed by the blocks.

I think that is reasonable compromise, as it fails by default but provides a way to recover when possible.

lib/buffer-writer.js

rvagg · 2022-03-29T05:13:42Z

lib/buffer-writer.js

+const totalByteLength = (cids) => {
+  let total = 0
+  for (const cid of cids) {
+    total += cid.bytes.byteLength


We could fix this too now you have the nice arrayLengthEncodeSize() which will work for CIDs too; I think if you subtract 2 from ROOT_EXTRA_SIZE (the byte array is >24<256 so it requires two bytes cbor prelude <bytes><length>, the extra bits in ROOT_EXTRA_SIZE are to do with tags, mainly) and then add it back in by adding + arrayLengthEncodeSize(cid.length) here then you get a more accurate CID length. The problem with doing that is that your API allows for an optional rootsByteLength, but that could also probably be fixed by adding those 2 to the DEFAULT_CID_SIZE.

If you add bafkqaaia and bafkqbbacaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa to your tests then you should see the impact of the bad sizes here.

Identity CIDs are the most likely way we're going to end up here--it would be silly to use one as a root but it's allowed and it's not out of the question that someone finds a use-case for it.

rvagg · 2022-03-29T05:15:57Z

👌 nice

re my last comment about CID sizing - it seems like low hanging fruit now you have the logic partially done, but I wouldn't consider it essential; we should just identify this as something to fix later one when we get a proper CBOR sizer and add it in here (I wrote one today for cborg but ran into a nasty bug in the cbor encoder triggered in the test suite that got me side tracked).

lib/buffer-writer.js

feat: add precise calculateHeaderLength()

Gozala · 2022-03-30T07:40:31Z

@rvagg I have merged your #71 and refactored this PR to remove prior calculations. I have also changed estimateHeaderSize to estimateHeaderLength(rootCount:number, rootByteLength?:number):number as you've suggested.

I think with that all the issues should be addressed now

rvagg · 2022-03-30T09:16:14Z

lib/buffer-writer.js

+/**
+ * @param {Block} block
+ * @returns {number}
+ */
+export const blockEncodeSize = ({ cid, bytes }) => {
+  const size = cid.bytes.byteLength + bytes.byteLength
+  return varint.encodingLength(size) + size
+}


This appears to be unused, although tested, perhaps remove it to avoid the export noise and API surface area; or if you think this is going to be genuinely useful, add some docs using the pattern I added for the other documented statics (make sure to include a @name!).

rvagg · 2022-03-30T09:17:18Z

package.json

@@ -54,10 +54,14 @@
    "./writer": {
      "browser": "./lib/writer-browser.js",
      "import": "./lib/writer.js"
+    },
+    "./buffer-writer": {


you're going to need to add an equivalent block in typesVersions so consumers can find the types for it (i.e. avoiding the lib indirection)

rvagg

Yep, looks good. So I've pushed another commit that compiles the docs into the README - there's a script in package.json that just needed the file added (I add the files manually to control the ordering). Run npm run docs to re-generate.

I've named the static members of that file as CarBufferWriter.* assuming the pattern is import * as CarBufferWriter from '@ipld/car/buffer-writer', it also makes the TOC cleaner not to have ungrouped static functions.

The logic it uses to pull docs out is that it'll ignore jsdoc blocks where there are no actual docs. When you add some notes above a function it'll appear in the README. Have a look over what I've added and consider whether you want to document more of the exported functions or not or change the docs that are already there.

Giving you 👍 for this so I can switch off today and you can get this merged yourself, a few things though:

the typesVersions addition
maybe remove blockEncodeSize
any other doc tweaks you think would contribute to making the README more informative
please squash away the merge commit before merging - I don't really mind how many commits this gets merged as, although the commit messages aren't very informative as they are now, I just really don't want a merge commit in master. I'm happy for you to squash away my commits into yours and just make a single feat: if you like. I think Squash and Merge will take care of it fine if you want to use that.

rvagg · 2022-03-30T09:22:51Z

oh, the ts compiler doesn't like some of my @param additions for docs, those can be removed but it might be nice if you could figure out a compromise to get params documented a little bit nicely

lib/buffer-writer.js

## [4.1.0](v4.0.0...v4.1.0) (2022-03-31) ### Features * buffered writer ([#70](#70)) ([b1dd34b](b1dd34b))

github-actions · 2022-03-31T01:06:45Z

🎉 This PR is included in version 4.1.0 🎉

The release is available on:

Your semantic-release bot 📦🚀

Gozala added 2 commits March 24, 2022 21:01

feat: buffered writer

555e239

fix: lint errors

6ea51eb

rvagg reviewed Mar 28, 2022

View reviewed changes

lib/buffer-writer.js Outdated Show resolved Hide resolved

rvagg reviewed Mar 28, 2022

View reviewed changes

lib/buffer-writer.js Outdated Show resolved Hide resolved

Gozala added 3 commits March 28, 2022 09:17

chore: improve esitmates

61c0a9f

chore: lint fix

559d56c

chore: undo unexpected changes

686cdff

Gozala added 3 commits March 28, 2022 20:25

fix: header size estimates

30c4ae3

fix: lint issues

90f7979

chore: remove unuesed defs

e1672a4

Gozala force-pushed the feat/buffer-writer branch from df4c85b to e1672a4 Compare March 29, 2022 03:30

rvagg reviewed Mar 29, 2022

View reviewed changes

lib/buffer-writer.js Show resolved Hide resolved

rvagg reviewed Mar 29, 2022

View reviewed changes

Gozala added 3 commits March 28, 2022 23:05

feat: write tests & fix bugs

828539c

fix: large CIDs

bf0301b

chore: lint

4c722d5

Gozala marked this pull request as ready for review March 29, 2022 07:40

Gozala requested a review from rvagg March 29, 2022 07:40

rvagg mentioned this pull request Mar 29, 2022

feat: add precise calculateHeaderLength() #71

Merged

rvagg reviewed Mar 29, 2022

View reviewed changes

lib/buffer-writer.js Outdated Show resolved Hide resolved

rvagg reviewed Mar 29, 2022

View reviewed changes

lib/buffer-writer.js Outdated Show resolved Hide resolved

rvagg and others added 2 commits March 30, 2022 16:32

feat: add precise calculateHeaderLength()

1520cde

Merge pull request #71 from ipld/rvagg/calculateHeaderLength

d28a12a

feat: add precise calculateHeaderLength()

fix: address review feedback

4f47f7b

Gozala force-pushed the feat/buffer-writer branch from 31771d7 to 4f47f7b Compare March 30, 2022 07:41

Gozala requested a review from rvagg March 30, 2022 07:41

chore: add docs to readme for CarBufferWriter

b8e67dd

rvagg reviewed Mar 30, 2022

View reviewed changes

rvagg approved these changes Mar 30, 2022

View reviewed changes

Gozala commented Mar 31, 2022

View reviewed changes

lib/buffer-writer.js Outdated Show resolved Hide resolved

Gozala commented Mar 31, 2022

View reviewed changes

lib/buffer-writer.js Outdated Show resolved Hide resolved

Gozala commented Mar 31, 2022

View reviewed changes

lib/buffer-writer.js Outdated Show resolved Hide resolved

Gozala added 4 commits March 30, 2022 17:32

Apply suggestions from code review

88d228a

chore: add type to the type map

3da0a23

chore: document blockLength function

4bd57f7

chore: regenerate docs

f6fd59a

Gozala force-pushed the feat/buffer-writer branch from bacab50 to f6fd59a Compare March 31, 2022 00:53

Gozala merged commit b1dd34b into master Mar 31, 2022

Gozala deleted the feat/buffer-writer branch March 31, 2022 00:58

github-actions bot pushed a commit that referenced this pull request Mar 31, 2022

chore(release): 4.1.0 [skip ci]

4966eda

## [4.1.0](v4.0.0...v4.1.0) (2022-03-31) ### Features * buffered writer ([#70](#70)) ([b1dd34b](b1dd34b))

github-actions bot added the released label Mar 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: buffered writer #70

feat: buffered writer #70

Gozala commented Mar 25, 2022

rvagg Mar 28, 2022

rvagg Mar 28, 2022

Gozala Mar 28, 2022

rvagg Mar 29, 2022

rvagg commented Mar 28, 2022

Gozala commented Mar 28, 2022

Gozala commented Mar 28, 2022

rvagg commented Mar 29, 2022

Gozala commented Mar 29, 2022

rvagg Mar 29, 2022

rvagg commented Mar 29, 2022

Gozala commented Mar 30, 2022

rvagg Mar 30, 2022

rvagg Mar 30, 2022

rvagg left a comment

rvagg commented Mar 30, 2022

github-actions bot commented Mar 31, 2022

feat: buffered writer #70

feat: buffered writer #70

Conversation

Gozala commented Mar 25, 2022

rvagg Mar 28, 2022

Choose a reason for hiding this comment

rvagg Mar 28, 2022

Choose a reason for hiding this comment

Gozala Mar 28, 2022

Choose a reason for hiding this comment

rvagg Mar 29, 2022

Choose a reason for hiding this comment

rvagg commented Mar 28, 2022

Gozala commented Mar 28, 2022

Gozala commented Mar 28, 2022

rvagg commented Mar 29, 2022

Gozala commented Mar 29, 2022

rvagg Mar 29, 2022

Choose a reason for hiding this comment

rvagg commented Mar 29, 2022

Gozala commented Mar 30, 2022

rvagg Mar 30, 2022

Choose a reason for hiding this comment

rvagg Mar 30, 2022

Choose a reason for hiding this comment

rvagg left a comment

Choose a reason for hiding this comment

rvagg commented Mar 30, 2022

github-actions bot commented Mar 31, 2022