-
-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Validate size in the DagReaders #4680
Conversation
No blocksizes. File size bigger than all data. All links write to file. |
Nice! As @ivan386 says, we should check if Edit: Alternatively, we could do the same zero-extend/truncate thing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wish there were a simpler way to do this but I'm pretty sure there isn't.
r.offset += int64(n) | ||
return n, err | ||
} | ||
for ; n < len(p) && r.offset+int64(n) < r.size; n++ { // pad |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: This may confuse the optimizer. I'd compute the bounds up-front. Possibly something like:
remaining := r.size - (r.offset + int64(n))
if remaining < len(p)-n {
remaining = len(p)-n
}
for i := range p[n:n+remaining] {
p[i] = 0
}
n += remaining
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can still do this but I am reluctant to do to the increase in code size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Entirely up to you. I seriously doubt this will be a bottleneck.
unixfs/io/sizeadj.go
Outdated
} | ||
n, err := r.base.Read(p) | ||
if err == nil { | ||
_, err = r.base.Read(nil) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd just allow the short read and force the outer reader to loop. Unfortunately, a lot of readers don't like read attempts with empty buffers (and I'd rather not do two reads per outer read).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did, that initially and it broke some code I will look into it more closely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's not good...
unixfs/io/sizeadj.go
Outdated
} | ||
// Its easier just to always use io.SeekStart rather than | ||
// correctly adjust offset for io.SeekCurrent and io.SeekEnd. | ||
return r.base.Seek(r.offset, io.SeekStart) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this return r.offset
? base.Seek
could return a short offset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
unixfs/io/sizeadj.go
Outdated
r.offset = r.size + offset | ||
} | ||
if r.offset < 0 { | ||
return -1, errors.New("Seek will result in negative position") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this rollback the offset (technically)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, will fix.
@Stebalien, two questions
|
IMO, yes.
Yes. I consider this situation to be the same as (1). |
For test: QmRqh1ckCLG6nXoE1aJQT2CHrByk5DWji8v3s4bHcmN975
Here zero size raw link: zb2rhmy65F3REf8SZp7De11gxtECBGgUKaLdiDj7MCGCHxbDW I think that aligning blocks to the end of the file is the best solution. Less data in block.
|
85203e7
to
055026c
Compare
test/sharness/t0110-gateway.sh
Outdated
@@ -188,7 +188,7 @@ test_expect_success "Add compact blocks" ' | |||
printf "foofoo" > expected | |||
' | |||
|
|||
test_expect_success "GET compact blocks succeeds" ' | |||
test_expect_failure "GET compact blocks succeeds" ' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These sorts of changes make me nervous. Whats the technical reason for the swap?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the test uses a unixfs node that does not define any blocksizes. Per earlier discussion:
@Stebalien, two questions
- If there are links defined but no corresponding blocksizes, should we consider the block ill-formed and fail with an error, after all seeking is impossible (this came up in one of the test cases on a dag that was likely constructed manually, QmTTDNxNJUqF5PZicJqjDjsUed1bHaPWwFdEMvb1iY93Ur in t0110-gateway.sh test Add compact blocks)
IMO, yes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's not good... I assumed that all of the files we produced did had blocksizes listed. @whyrusleeping did we change this at some point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Stebalien I think this test is testing a specifically constructed block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should add a decent number of unit tests for stuff like this. Testing this sort of specific detailed logic via sharness feels wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@whyrusleeping thoughs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kevina here is the background on this testcase: #4286 (comment)
Specifically:
Additionally - the way my project chunks data, my average blocksize is around 4096 bytes ( with a hard-limit on block sizes at 64k ), so forgoing any optional metadata is critical for me in order to reduce the size of my objects ( as the overhead adds up incredibly fast when one deals with hundreds of millions of objects in a DAG ).
The sizes are optional
@kevina please do not break this relied-upon feature ( neither in PB- nor in the future CBOR-UnixFS )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, this is a problem. Unix programs expect files to be seekable and will break horribly if they aren't (and I'm not going to break every single unix program ever written). However, I also don't want to break existing uses if I can help it and this problem doesn't seem insurmountable.
We should be able to fall back on dumb seeking (although we may want to disable this from the gateway). Basically, the logic would be as follows:
- If blocksizes are defined, use them.
- If a block with children has no blocksizes defined (none at all, not some, not half, not incorrect ones, none), dumb seek through that portion of the file.
This actually gives us the best of both worlds. Files with lots of tiny blocks can put blocksizes at the top layers of the DAG and omit them in the lower levels.
In terms of implementation, we can punt on dumb seeking for now. Simply validate blocksizes if present and error when seeking when they're not present.
@kevina how hard would this be?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will see what I can do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
unixfs/io/pbdagreader.go
Outdated
curLinks := getLinkCids(n) | ||
data := pb.GetData() | ||
|
||
// validate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to have this logic broken out separately and have a few unit tests for it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you clarify what you mean here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This validate logic that has been newly added. It should live in its own function. And then we should add a few units tests for that function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll see what I can do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay this is done now. Sorry for the delay.
It adds complexity. I don't want to have one case where we zero-pad the end and another where we zero-pad the beginning.
That's a bug. Would you mind reporting it? Also, we technically have an "identity" hash function so, eventually, we should be able to encode this empty block as 4 bytes: Note: The empty block is zb2rhWm2M1wXXqtqU6pHfovz3DZQ7D54ZD2xN3ynwankHCBCn (can add it with |
Reported: #4688
Real zero length raw block is:
Convert it to base58: z2oTmZ
test_dag_identity.json:
Report new bug? PS:
PPS: Same error
|
Good catch.
The fact that we choke on a zero-length identity multihash is a bug in the multihash library. The fact that we don't currently support "inline blocks" is more of a missing feature (we only recently added the identity multihash and haven't added support for extracting blocks from it yet). However, feel free to report both (we want inline blocks eventually anyways). 0x55 versus 0x37: You're right, the code is 0x55. I converted 55 into hex (but it's already hex). |
progress here? |
Assuming the test pass I think this should be good to go. We can clean up the failed test case later. |
b1c5ffb
to
bf51ef3
Compare
As @Stebalien has done most of the review here, i'd like to get his 👍 on merging |
@whyrusleeping see #4680 (comment) ~Kubuxu |
@whyrusleeping, @kevina see my comment here (unless I missed some change and this is no longer an issue). (Also, @kevina, sorry for giving you bad information. I was unaware of @whyrusleeping's promise). |
@Stebalien I pushed a commit which I think should address your concern. |
8e412b3
to
1c63543
Compare
@Stebalien I fixed the test case in case that was what you where waiting for. @magik6k a review will would be helpful from you also |
} | ||
// Its easier just to always use io.SeekStart rather than | ||
// correctly adjust offset for io.SeekCurrent and io.SeekEnd. | ||
_, err := r.base.Seek(newOffset, io.SeekStart) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what happens if r.size == 1000
, but r.base.Size() == 500
and we try to call r.Seek(700)
?
It seems like this would error out, when it should succeed and subsequent reads should return up to 300 bytes of zeros. I guess i'm not sure what errors Seek returns when you try to seek past its bounds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually it won't error out, from the documentation (https://golang.org/pkg/io/#Seeker):
Seeking to an offset before the start of the file is an error. Seeking to any positive offset is legal, but the behavior of subsequent I/O operations on the underlying object is implementation-dependent.
unixfs/io/sizeadj.go
Outdated
func (w *truncWriter) Write(p []byte) (int, error) { | ||
truncC := 0 | ||
if int64(len(p)) > w.size-w.offset { | ||
truncC = int(int64(len(p)) - w.size - w.offset) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think your order of operations here is wrong:
lets say we try to write 10 bytes to a thing that has a size of 15, and a current offset of 12:
10 - 15 - 12 == -17
you want: int64(len(p)) - (w.size - w.offset)
unixfs/io/sizeadj.go
Outdated
truncC := 0 | ||
if int64(len(p)) > w.size-w.offset { | ||
truncC = int(int64(len(p)) - w.size - w.offset) | ||
p = p[:w.size] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think you mean p = p[:w.size - w.offset]
here
} | ||
|
||
// Write implemented Write method as defined by io.Writer | ||
func (w *truncWriter) Write(p []byte) (int, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add tests for this function once the issues here are fixed please
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay done. Sorry about that. I wrote this code several months ago and i thought it was ready for review, I guess I was wrong as I did not double check things carefully enough.
14856b3
to
77149fd
Compare
unixfs/unixfs.go
Outdated
for _, blocksize := range pb.Blocksizes { | ||
total += blocksize | ||
} | ||
if total != pb.GetFilesize() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, one case here is pb.Filesize == nil
. Is Filesize
required?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I double check, and if not I will add a check for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: the function pb.GetFilesize() returns a uint64, if pb.Filesize == nil it will return 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, my question was more: If Filesize is nil, that means it wasn't specified. Do we require Filesize to be specified? Should we calculate it from the blocksizes if left unspecified?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Stebalien given there must be a source of size, and given that blocksizes
are optional, my hunch would be to just make Filesize
hard-required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm. I'd like @whyrusleeping and @diasdavid to 👍 this.
Will all IPFS files include a Filesize?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think its probably fine to assume that any correctly formatted ipfs/unixfs file will contain a filesize. Seems pretty hard to do a lot of things without it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay I added a test to make sure it is not nil.
unixfs/unixfs.go
Outdated
// ValidatePB validates a unixfs protonode. | ||
func ValidatePB(n *dag.ProtoNode, pb *pb.Data) error { | ||
if len(pb.Blocksizes) == 0 { // special case | ||
return nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's also the case where there are no blocks but we still have inline data.
2c59305
to
bbc034e
Compare
If a node is too large, it is truncated; if it is too small, it is zero extended. License: MIT Signed-off-by: Kevin Atkinson <k@kevina.org>
License: MIT Signed-off-by: Kevin Atkinson <k@kevina.org>
One failed test due to the use of a ill-formed unixfs node. License: MIT Signed-off-by: Kevin Atkinson <k@kevina.org>
License: MIT Signed-off-by: Kevin Atkinson <k@kevina.org>
License: MIT Signed-off-by: Kevin Atkinson <k@kevina.org>
License: MIT Signed-off-by: Kevin Atkinson <k@kevina.org>
License: MIT Signed-off-by: Kevin Atkinson <k@kevina.org>
License: MIT Signed-off-by: Kevin Atkinson <k@kevina.org>
License: MIT Signed-off-by: Kevin Atkinson <k@kevina.org>
License: MIT Signed-off-by: Kevin Atkinson <k@kevina.org>
License: MIT Signed-off-by: Kevin Atkinson <k@kevina.org>
bbc034e
to
b563ea5
Compare
License: MIT Signed-off-by: Kevin Atkinson <k@kevina.org>
} | ||
if len(dr.pbdata.Blocksizes) > 0 { | ||
// if blocksizes are set, ensure that we read exactly that amount of data | ||
dr.buf, err = newSizeAdjReadSeekCloser(dr.buf, dr.pbdata.Blocksizes[linkPos]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You know, I think we can do something simpler here. Given that we know the Filesize of the buffer, we can require that buf
export a Size()
method. Then, we can just have a check here:
buf.Size() > blocksize
-> make a limit seeker==
-> return as-is<
-> zero extend
That way, we can do all of these checks up-front. Thoughts? Too late?
What currently happens if there are no blocksizes but the filesize doesn't match? Both with and without inline data? We may need to truncate/zero extend in this case as well. Current rules:
One thing is a bit inconsistent. If no blocksizes are specified, the filesize is used to truncate/zero extend. If they are specified, the filesize can't do this. We could also just truncate/extend always (i.e., don't check that So, I know this is kind of cold feet at the 11th hour however, after seeing all the weird corner-cases that come up in practice, I'm starting to wonder if we should fail more. That is:
The downsides are:
|
I'm not sure what the status on this currently is, but:
|
This needs to be worked out at the spec level first. We jumped into coding too quickly, IMO. |
If a node is too large, it is truncated; if it is too small, it is zero extended.
Closes #4540. Closes #4667.