-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Discuss: improving ipfs.add
#3033
Comments
This sounds a lot like the MFS interface, but operating on an arbitrary DAG instead of the MFS DAG. It's neat & I like it because it uses familiar filesystem-y sounding operations. (In js-land at least) a lot of the MFS commands operate on IPFS paths as well as MFS paths, so you could easily implement the above by wrapping the MFS API with a function that retains the (ever changing) CID at the root of the DAG you are manipulating. You could also implement it by treating Anyway, some thoughts: /**
* Returns a Promise that resolves when the desired size of the stream's internal queue
* transitions from non-positive to positive, signaling that it is no longer applying backpressure
*/
ready:Promise<void>, This doesn't seem like a great API to me, the producer should not be asking 'is it ok for me to push?', rather they should be told 'ok, now give me more'. How does the producer change their mind here? If they don't want to wait any more they should be able to stop waiting for this promise to resolve. It might be a mistake, but it looks like If instead you just You can of course call these API methods without waiting for the promise to resolve. We have a read/write lock mechanism in the MFS to prevent these calls from overlapping in this case. Reads happen in parallel, writes happen in series after any reads complete and prevent other writes or reads from starting. type FileContent =
| Blob
| Buffer
| ArrayBuffer
| TypedArray
| string This seems a bit simplistic. What if my application was generating enormous amounts of data - think editing video or high quality audio, arbitrary amounts of encrypted data, downloading massive files from the web, stuff like that - not simply files dragged & dropped by the user but transformed data streams being emitted by the application. If instead of taking all of these different data types, you just accept something that streams chunks of bytes, then you can handle any datatype at any size, it just has to be converted to the format you expect before you start to process it. You can either require the user to do that (bad, why would they care what your internal datatypes are?), or you can do it for them. writeFile(path:string, content:FileContent, etc:WriteFile):Promise<Stat>
write(path:string, content:FileContent, etc:Write):Promise<void> I don't understand why you would have both of these. Better to just accept writeFile(path:string, content:FileContent, etc:Write):Promise<Stat>
write(content:FileContent, etc:Write):Promise<Stat> If you accept that streaming chunked byte streams of arbitrary length is necessary (see comment above about accepting application data, not just files from writeFile(path:string, content:FileContent, etc:Write):Promise<Stat>
write(content:FileContent, etc:Write):Promise<Stat>
writeByteStream(content:ReadableStream<Bytes>, etc:Write):Promise<Stat>
writeFileStream(content:ReadableStream<{ content: FileContent, path: string }>, etc:Write):Promise<Stat>
writeFileStreamOfByteStreams(content:ReadableStream<{ content: ReadableStream<Bytes>, path: string }>, etc:Write):Promise<Stat> Though this is leading to an absolute method explosion. This is, I think, fundamentally why You either push all that complexity onto the user ('oh my, which function do I call') or accept whatever they give you and then shoulder the burden of transforming it. Could the API be more ergonomic, sure, of course! The MFS API uses familiar FS concepts so it would be great to have more of those semantics as core IPFS operations. There have been many attempts to unify the MFS and files API, I don't think anyone is truly in love with the API as it stands. This issue and all the linked issues are a good way to get some context: ipfs/specs#98 I'm going to stop here, this has turned into quite the ramble.
|
The intended purpose of for await (const chunk = readHugeDataset(url) {
await writer.write(path, chunk)
}
const {cid} = await writer.flush(path) That is also why
|
This is was mostly inspired by I just recall (could be incorrectly) you mentioning that consumer could pull multiple files to do perform concurrent writes. If that is the case than
I don't like mutable promise properties myself, it just seemed to make sense to keep it similar to readable stream stuff. |
Fun fact is this fall out of my attemp to do with cross thread |
In a nutshell when importing a file, we chunk it up, then pull batches of chunks and process them (e.g. hash it, turn it into a block, put the block into the datastore, figure out it's place in the file DAG) in parallel. When importing directories full of files, we pull batches of files and process them in parallel (as well as processing their chunks in parallel). Could you expand a bit on why: for await (const chunk = readHugeDataset(url) { // <- did you mean for await..of here?
await writer.write(path, chunk)
}
const {cid} = await writer.flush(path) is better than: for await (const { cid } of ipfs.add({ path, content: readHugeDataset(url) })) {
// do something with cid
}
Also could you explain why this is a good idea? |
I ask the last question in particular, because there are bugs open against every major browser to implement the async iterable contract method/properties on browser ReadbleStreams, at that point we'll be able to take browser ReadbleStreams, node ReadableStreams, buffers, arrays, typed arrays, array buffers all that stuff because they all are (or will be) async iterables. This is great for the user because they can just pass in whatever they have and don't have much opportunity to choose the wrong API method, and this is great for us because our API surface area remains small. https://bugs.chromium.org/p/chromium/issues/detail?id=929585 |
So please correct me if I'm being wrong, but that implies that following code would prevent such think from happening: for (const {name, content} of files) {
await writer.writeFile(name, content)
} Because producer awaits for one file to be written and only then writes the second, which is to say consumer is unable to pull multiple files at time. This is also what I though for (const {name, content} of files) {
writer.writeFile(name, content)
await writer.ready
} |
I would not claim one being better than the other. I would however suggest that, two make different tradeoffs. I think proposed API trades a bit of convenience (that is just pass a thing and it does the right thing) in favor of: Greater control of the flowHere is personally why I end up choosing
Smaller API ≠ Simpler API
This end up being far too longe, but hopefully constructive (even if whiny at times) so I think I'd stop here. |
I should point out just in case I left a wrong impression that I LOVE async generators and readable streams! I'm very glad we have them! And that we use them! It's just they're not perfect for everything and that is ok. |
ipfs.add
more than one rootipfs.add
js-ipfs is being deprecated in favor of Helia. You can #4336 and read the migration guide. Please feel to reopen with any comments by 2023-06-02. We will do a final pass on reopened issues afterward (see #4336). I believe the @helia/unixfs package resolves a lot of concerns with the existing interface, please let us know in https://github.com/ipfs/helia-unixfs if it doesn't solve your needs! |
As I'm dissecting
ipfs.add
I'm recognizing some ambiguities that I would like to discuss and hopefully improve over time.In the context of #3022 I've continued thinking how to do RPC of
ipfs.add
both as simple as possible and attempt to avoid buffering on client. Here are some notes / observations / confusions:ipfs.add
lets you pass collections ofFileContent
and collections ofFileObject
(that have associated paths). This leads to some behaviors:n
FileContet
s producesn
results.1
FileContent
produces1
result.1
FileObject
produces:1
result ifpath
contains1
fragment (e.g. 'foo')n
results ifpath
containsn
fragments (e.g. 'foo/bar/baz' -> [foo/bar/baz, foo/bar, foo])n
FileObject
s producesn
+ max of path fragments when paths nestError: detected more than one root if paths
- when paths don't nestn
FileContent
andm
FileObject
produces:n + m + maxPathFragments
- when paths nestError: detected more than one root if paths
- when paths don't nestIf you consider user of
ipfs.add
this API is really brittle if you want to do anything about produced results other than print them out or pick the last one. Later only makes sense whenFileObjects
are used and path contains>1
fragments.I would really like to understand what are the goals of this API and hopefully improve documentation so it can be better understood by other users.
I also remember having similar discussion with @mikeal about
DagWriter
and I'm wondering if we can learn something from his work to improve this API as well. As I understand itipfs.add
attempts to provide an API for submitting a directory structure and return tree of corresponding stats. Which makes me wonder if something likeFilesystemWriter
(as in writable part of MFS) could provide a cleaner and more intuitive API to work with. Here is the sketch of what I mean:I believe it would be able to do everything that
ipfs.add
does to day but with a more convenient API that avoids some of the confusions described above.write
returns promise signaling backpressure to the produces.stat
can be use to getStat
and it's sync because completed write stats are accumulated and held by the writer.{parents:true}
in write calls).CID
is made available throughwriteFile()
so there is no need to do backwards mapping.flush(path)
will returnCID
for the written file.flush() / flush('/')
provides that.The text was updated successfully, but these errors were encountered: