-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Leverage multipart blob uploads to remove need for sharding / reassembly #20
Comments
Also this article describes concatenation strategy using multipart uploads where objects in s3 can be used as parts without having to re-upload them. I think it is really interesting as it may allow us to create append-only logs which can be basis for all the queues and whole lot of other things relevant in distributed systems. |
My primary question @Gozala -- what's the Filecoin strategy? We have to shard at that level for super large uploads. I always understood 4GB to be a limit that was safe to get in a single Filecoin piece. |
Currently we submit aggregates that are 16GiB which we chose over 32GiB (both options recommended by spade) to move content faster through the pipeline. There is a some overhead to be considered for the segment index, but otherwise blobs up to whole aggregate size - segment index size could be used without a problem. You are correct however that blobs that are larger in size than aggregates would need to be shaded for filecoin pipeline, but even than we could provide them as a byte range within our large blob to an SPs. It is worth calling out motivating benefits. Currently we create much smaller shards around 127MiB that are CARs so for large uploads (e.g. video stream) we'd end up with lot of shards, then index mapping blocks across those shards and when serving we'd need to reverse all of that back. That is a lot of overhead and you loose ability to perform (efficient) skip forward/backward on a video stream just to satisfy first iteration over hash addressed content. In contrast if we retain content as is, we remove lot of overhead associated with sharding, transcoding and then reconstructing and get ability to skip forward / backward. Better yet actual blob could be read right out of copy that SPs holds. |
Wow, ok. I don't know where I got 4GB as the shard size. 127MiB is super small. And the shard as index into a larger blob is SUPER interesting. Alright now I really like this. It makes 100% sense we should apply sharding only when needed, and specific to the situation where it's needed, rather than have a fixed sharding all the way through for both Filecoin and the upload part. So essentially, multipart upload is "temporary sharding" for the upload process, while byte range indexes into a blob for an SP are sharding tailored to filecoin. That's pretty cool. And we don't really have to go all in on Blake3 to do it -- we could just stop sharding into multiple CAR files. Ok well don't hate me but while I think this is a great thing to do I also don't think it goes in Write To R2. I am imagining an endeavor, at a later point, we can call "simplfying storage format" which would cover this as well as Blake3 instead of cars for flat files. |
Looks like R2 also supports multipart https://developers.cloudflare.com/r2/objects/multipart-objects/ What is important to look at is whether presigned URL supports multipart upload in S3/R2. at least while we use them
We submit aggregates that are 26GiB minimum actually :) It is a env VAR in seed.run
4GB is the bigger size we support for a shard for two main reasons:
127MiB is the client default chunking. User can pass a parameter to make it bigger or smaller, we could actually discuss if we want more on it. I think we decided on lower value to make faster retries on failure, allow parallel chunk write, etc. I think @alanshaw may have thoughts here |
Ya, 127MB is the default in the client but is configurable up to ~4GB. Tuning the shard size can allow resuming a failed upload to be quicker. Amongst other historical reasons - can elaborate if required... ℹ️ Note: uploaded parts persist in the bucket until a complete/abort request is sent at the end. If the user never sends that final request then I think by default the parts hang around indefinitely. That said, I imagine there is some kind of lifecycle rules you can set to clean them up, but we'd need to check. It does make things a bit more complicated though... |
S3 has multipart uploads that allows users to shard content and upload it in parts, just like we shard into cars and upload them.
In our case however we have to do whole lot of bookkeeping and to reassemble content on reads. This will make even less sense if we transition from UinxFS into BAO / blake3 world where sharding will offer no value just an overhead.
I think it would be interesting to explore how multipart uploads could be leveraged to avoid sharding and whether multipart uploads are supported by other storage providers.
The text was updated successfully, but these errors were encountered: