Skip to content
This repository has been archived by the owner on Feb 12, 2024. It is now read-only.

Discuss: improving ipfs.add #3033

Closed
Gozala opened this issue May 14, 2020 · 10 comments
Closed

Discuss: improving ipfs.add #3033

Gozala opened this issue May 14, 2020 · 10 comments

Comments

@Gozala
Copy link
Contributor

Gozala commented May 14, 2020

As I'm dissecting ipfs.add I'm recognizing some ambiguities that I would like to discuss and hopefully improve over time.

In the context of #3022 I've continued thinking how to do RPC of ipfs.add both as simple as possible and attempt to avoid buffering on client. Here are some notes / observations / confusions:

ipfs.add lets you pass collections of FileContent and collections of FileObject (that have associated paths). This leads to some behaviors:

  • Passing n FileContets produces n results.
  • Passing 1 FileContent produces 1 result.
  • Passing 1 FileObject produces:
    • 1 result if path contains 1 fragment (e.g. 'foo')
    • n results if path contains n fragments (e.g. 'foo/bar/baz' -> [foo/bar/baz, foo/bar, foo])
  • Passing n FileObjects produces
    • n + max of path fragments when paths nest
    • Error: detected more than one root if paths - when paths don't nest
  • Pass n FileContent and m FileObject produces:
    • n + m + maxPathFragments - when paths nest
    • Error: detected more than one root if paths - when paths don't nest

If you consider user of ipfs.add this API is really brittle if you want to do anything about produced results other than print them out or pick the last one. Later only makes sense when FileObjects are used and path contains >1 fragments.

I would really like to understand what are the goals of this API and hopefully improve documentation so it can be better understood by other users.

I also remember having similar discussion with @mikeal about DagWriter and I'm wondering if we can learn something from his work to improve this API as well. As I understand it ipfs.add attempts to provide an API for submitting a directory structure and return tree of corresponding stats. Which makes me wonder if something like FilesystemWriter (as in writable part of MFS) could provide a cleaner and more intuitive API to work with. Here is the sketch of what I mean:

interface FilesystemWriter {
  /**
   * Returns a Promise that resolves when the desired size of the stream's internal queue
   * transitions from non-positive to positive, signaling that it is no longer applying backpressure
   */
  ready:Promise<void>,
  /**
   * Make a directory.
   */
  makeDirectory(path:string, options?:MakeDirectory):Promise<Stat>,
  /**
   * Similar to fs.writeFile in node
   */
  writeFile(path:string, content:FileContent, etc:WriteFile):Promise<Stat>
  /**
   * Similar to fs.write in node
   */
  write(path:string, content:FileContent,  etc:Write):Promise<void>
  stat(path, options):Stat
  /**
   * Fulfills with the `Stat` of the root once all the writes are successfully flushed,
   * or rejects with an error if a problem was encountered during the process.
   */
  flush(path?:string):Promise<Stat>
}

type FileContent =
  | Blob
  | Buffer
  | ArrayBuffer
  | TypedArray
  | string

type MakeDirectory = {
  parents?: boolean,
  hashAlg?: string,
  flush?: boolean,
  mode?: Mode,
  mtime?: Time|Date
}

type WriteFile = {
  create?:boolean,
  truncate?:boolean,
  parents?:boolean,
  length?:number,
  rawLeaves?:number,
  cidVersion?:number,
  mode?:Mode,
  mtime?:Time|Date
}

type Write = WriteFile & {
    offset?:number
}

type Mode = number | string

type Time = {
  sec:number,
  nsecs?:number
}

I believe it would be able to do everything that ipfs.add does to day but with a more convenient API that avoids some of the confusions described above.

  • Each write returns promise signaling backpressure to the produces.
  • Writer can queue writes if backpressure is disregarded by producer (highMark, lowMark) style.
  • stat can be use to get Stat and it's sync because completed write stats are accumulated and held by the writer.
  • User can create directory explicitly or implicitly (via {parents:true} in write calls).
  • Individual file CID is made available through writeFile() so there is no need to do backwards mapping.
  • Streaming write could be accomplished by calling writes with offset. flush(path) will return CID for the written file.
  • If user just cares about top CID, flush() / flush('/') provides that.
@Gozala Gozala added the need/triage Needs initial labeling and prioritization label May 14, 2020
@achingbrain
Copy link
Member

This sounds a lot like the MFS interface, but operating on an arbitrary DAG instead of the MFS DAG. It's neat & I like it because it uses familiar filesystem-y sounding operations.

(In js-land at least) a lot of the MFS commands operate on IPFS paths as well as MFS paths, so you could easily implement the above by wrapping the MFS API with a function that retains the (ever changing) CID at the root of the DAG you are manipulating.

You could also implement it by treating ipfs.add as a low-level API, and wrapping it in a function that hands it an async generator that yields {content, path} objects in response to the makeDirectory, writeFile etc methods above and completes when flush is called, I guess you'd have to hold the directory structure in memory as you create it for stat to work, I'll leave that as an exercise for the reader.

Anyway, some thoughts:

/**
 * Returns a Promise that resolves when the desired size of the stream's internal queue
 * transitions from non-positive to positive, signaling that it is no longer applying backpressure
 */
ready:Promise<void>,

This doesn't seem like a great API to me, the producer should not be asking 'is it ok for me to push?', rather they should be told 'ok, now give me more'.

How does the producer change their mind here? If they don't want to wait any more they should be able to stop waiting for this promise to resolve.

It might be a mistake, but it looks like ready here is a property? It should be a function that returns a promise, because it can't un-resolve if we want to await on it again.

If instead you just await on the promise returned by each operation, the promise will resolve when it's complete, meaning that the consumer is ready for more data. No external backpressure mechanism is necessary, the await is the backpressure.

You can of course call these API methods without waiting for the promise to resolve. We have a read/write lock mechanism in the MFS to prevent these calls from overlapping in this case. Reads happen in parallel, writes happen in series after any reads complete and prevent other writes or reads from starting.

type FileContent =
  | Blob
  | Buffer
  | ArrayBuffer
  | TypedArray
  | string

This seems a bit simplistic. Blob is the only type here capable of representing data that is greater than the available memory+swap of the machine and that is only through it's .stream() method, which is only really usable when the browser creates the Blob (e.g. if I do new Blob(content), content is an array-like that has to be in memory, or another blob, but if it is, please recurse to the beginning of this sentence).

What if my application was generating enormous amounts of data - think editing video or high quality audio, arbitrary amounts of encrypted data, downloading massive files from the web, stuff like that - not simply files dragged & dropped by the user but transformed data streams being emitted by the application.

If instead of taking all of these different data types, you just accept something that streams chunks of bytes, then you can handle any datatype at any size, it just has to be converted to the format you expect before you start to process it.

You can either require the user to do that (bad, why would they care what your internal datatypes are?), or you can do it for them.

writeFile(path:string, content:FileContent, etc:WriteFile):Promise<Stat>
write(path:string, content:FileContent,  etc:Write):Promise<void>

I don't understand why you would have both of these. Better to just accept offset as an option to both and have write be without a path for when you're just adding data without the context of a containing directory (which itself is a bit of a footgun if you just look through the discuss issues where people are basically saying "I added a file to IPFS, what happened to the filename?"). You're also then just reimplementing ipfs.add but split across two functions and without being able to stream into it:

writeFile(path:string, content:FileContent, etc:Write):Promise<Stat>
write(content:FileContent,  etc:Write):Promise<Stat>

If you accept that streaming chunked byte streams of arbitrary length is necessary (see comment above about accepting application data, not just files from <input type='file' /> or DataTransfer objects), you could split it out further:

writeFile(path:string, content:FileContent, etc:Write):Promise<Stat>
write(content:FileContent,  etc:Write):Promise<Stat>
writeByteStream(content:ReadableStream<Bytes>,  etc:Write):Promise<Stat>
writeFileStream(content:ReadableStream<{ content: FileContent, path: string }>,  etc:Write):Promise<Stat>
writeFileStreamOfByteStreams(content:ReadableStream<{ content: ReadableStream<Bytes>, path: string }>,  etc:Write):Promise<Stat>

Though this is leading to an absolute method explosion. This is, I think, fundamentally why ipfs.add is what it is. One method that takes things or streams of things, where things can be bytes or bytes with metadata (e.g. path, mode, mtime) or just metadata (e.g. a directory), and maybe the bytes can be browser things or node things or both.

You either push all that complexity onto the user ('oh my, which function do I call') or accept whatever they give you and then shoulder the burden of transforming it.

Could the API be more ergonomic, sure, of course! The MFS API uses familiar FS concepts so it would be great to have more of those semantics as core IPFS operations. There have been many attempts to unify the MFS and files API, I don't think anyone is truly in love with the API as it stands. This issue and all the linked issues are a good way to get some context: ipfs/specs#98

I'm going to stop here, this has turned into quite the ramble.


Error: detected more than one root - you can avoid this by passing wrapWithDirectory: true but this error is probably something we should get rid of. go-ipfs can import multiple files without a containing directory, and so should we - you just get two CIDs for two unrelated DAGs back. It should just be a case of removing this check and adding some tests to sanity check the behaviour.

@achingbrain achingbrain added exploration and removed need/triage Needs initial labeling and prioritization labels May 14, 2020
@Gozala
Copy link
Contributor Author

Gozala commented May 14, 2020

I don't understand why you would have both of these.

The intended purpose of write was to provide streaming interface without passing (async)iterables streams etc.... E.g.

for await (const chunk = readHugeDataset(url) {
   await writer.write(path, chunk)
}
const {cid} = await writer.flush(path)

That is also why

type FileContent =
 | Blob
 | Buffer
 | ArrayBuffer
 | TypedArray
 | string

This seems a bit simplistic.

@Gozala
Copy link
Contributor Author

Gozala commented May 14, 2020

/**
* Returns a Promise that resolves when the desired size of the stream's internal queue
* transitions from non-positive to positive, signaling that it is no longer applying backpressure
*/
ready:Promise<void>,

This doesn't seem like a great API to me, the producer should not be asking 'is it ok for me to >push?', rather they should be told 'ok, now give me more'.

This is was mostly inspired by ready form WritableStream API (see: https://developer.mozilla.org/en-US/docs/Web/API/WritableStreamDefaultWriter/ready). But maybe it is unnecessary.

I just recall (could be incorrectly) you mentioning that consumer could pull multiple files to do perform concurrent writes. If that is the case than await writer.writeFile(path, content) is going to prevent such things from happening. Which is why I though writer.ready could signal it better.

It might be a mistake, but it looks like ready here is a property? It should be a function that returns a promise, because it can't un-resolve if we want to await on it again.

I don't like mutable promise properties myself, it just seemed to make sense to keep it similar to readable stream stuff.

@Gozala
Copy link
Contributor Author

Gozala commented May 14, 2020

You could also implement it by treating ipfs.add as a low-level API, and wrapping it in a function that hands it an async generator that yields {content, path} objects in response to the makeDirectory, writeFile etc methods above and completes when flush is called, I guess you'd have to hold the directory structure in memory as you create it for stat to work, I'll leave that as an exercise for the reader.

Fun fact is this fall out of my attemp to do with cross thread ipfs.add, basically I end up creating writer and inputs to add are being written with writer, which end up simplifying RPC API.

@achingbrain
Copy link
Member

achingbrain commented May 14, 2020

I just recall (could be incorrectly) you mentioning that consumer could pull multiple files to do perform concurrent writes.

In a nutshell when importing a file, we chunk it up, then pull batches of chunks and process them (e.g. hash it, turn it into a block, put the block into the datastore, figure out it's place in the file DAG) in parallel. When importing directories full of files, we pull batches of files and process them in parallel (as well as processing their chunks in parallel).

Could you expand a bit on why:

for await (const chunk = readHugeDataset(url) {  // <- did you mean for await..of here?
   await writer.write(path, chunk)
}
const {cid} = await writer.flush(path)

is better than:

for await (const { cid } of ipfs.add({ path, content: readHugeDataset(url) })) {
  // do something with cid
}

The intended purpose of write was to provide streaming interface without passing (async)iterables streams etc....

Also could you explain why this is a good idea?

@achingbrain
Copy link
Member

I ask the last question in particular, because there are bugs open against every major browser to implement the async iterable contract method/properties on browser ReadbleStreams, at that point we'll be able to take browser ReadbleStreams, node ReadableStreams, buffers, arrays, typed arrays, array buffers all that stuff because they all are (or will be) async iterables.

This is great for the user because they can just pass in whatever they have and don't have much opportunity to choose the wrong API method, and this is great for us because our API surface area remains small.

https://bugs.chromium.org/p/chromium/issues/detail?id=929585
https://bugzilla.mozilla.org/show_bug.cgi?id=1525852
https://bugs.webkit.org/show_bug.cgi?id=194379

@Gozala
Copy link
Contributor Author

Gozala commented May 14, 2020

When importing directories full of files, we pull batches of files and process them in parallel (as well as processing their chunks in parallel).

So please correct me if I'm being wrong, but that implies that following code would prevent such think from happening:

for (const {name, content} of files) {
  await writer.writeFile(name, content)
}

Because producer awaits for one file to be written and only then writes the second, which is to say consumer is unable to pull multiple files at time.

This is also what I though ready was able to provide instead:

for (const {name, content} of files) {
  writer.writeFile(name, content)
  await writer.ready
}

@Gozala
Copy link
Contributor Author

Gozala commented May 14, 2020

Could you expand a bit on why:

for await (const chunk = readHugeDataset(url) {  // <- did you mean for await..of here?
  await writer.write(path, chunk)
}
const {cid} = await writer.flush(path)

is better than:

for await (const { cid } of ipfs.add({ path, content: readHugeDataset(url) })) {
 // do something with cid
}

The intended purpose of write was to provide streaming interface without passing (async)iterables streams etc....

Also could you explain why this is a good idea?

I would not claim one being better than the other. I would however suggest that, two make different tradeoffs. I think proposed API trades a bit of convenience (that is just pass a thing and it does the right thing) in favor of:

Greater control of the flow

Here is personally why I end up choosing ipfs.files API in many of my projects over the ipfs.add.

  1. When I add multiple files I want to get corresponding cids back. In nutshell mapping collection of path->content to path->cid. There are set of complications I would encounter when trying to accomplish this with ipfs.add:

    1. Need to for await and collect results to be able to map it back. However that is made complicated a bit by the fact that:
      • input paths don't necessarily map to the output path (they're normalized)
      • number of inputs do not necessarily map number of outputs.
      • order of inputs do not match the order of output order. So you either need to collect then pick / map or deal with complications of doing it as things come in.
      • What if 3rd write out of 7 fails ?
      • What if one of the paths is missing ? I understand that it should not happen, but API forces you to consider that / handle that case.
      • What if 3rd file takes too long and I want to skip, and move to next input.

    Proposed API makes this case pretty straight forward and avoids all of the stuff above:

    for await (const input of inputs) {
       const { cid } = await writer.writeFile(input.path, input.content)
       // Note how you can use that CID right from the loop and it corresponds to your input
       // no need 
    }

    It is also worth considering that ipfs.add does not necessarily free users from for await loop it just puts it on the other end.

  2. It is a lot easier to implement convenience API on top of more primitive API than other way round. For example if you want to implement equivalent of ipfs.add it is straight forward task:

    const add = async function*(inputs, options) => {
       const writer = ipfs.files.writer(options)
       for await (const input of inputs) {
          yield await writer.writeFile(input.path, input.content) 
       }
       yield await writer.flush()
    }

    Above is obviously is not fully compatible with ipfs.add as doesn't do input normalization etc.. however I hope it does get point across. Additionally I wanted to illustrate the fact that:

    1. Customize to meet your use case (yield only things you care about, or yield [input, CID] pairs
    2. It is trivial to specialize (e.g if I have an array of inputs and I need array of outputs I can just Promise.all(inputs.map(({path, content}) => writer.writeFile(path, content))) and not deal with all the collecting mapping ordering etc...

Smaller API ≠ Simpler API

  1. I would argue that while ipfs.add AP appears smaller on surface (as in less functions) it is more complex API to use (maybe except of handful of cases).

  2. I would argue that general writer interface is really convenient and enables some really neat things that I have not pointed out yet. e.g. You can have API compatible version e.g. FormDataWriter. In fact it could even stream (at least in node) as user is writing into it. Put it differently it can encode that into different things.

  3. As it was pointed out proposed API is similar to MFS. That was deliberate.

  4. Async Iterators are amazing on the surface (that is when used with for await) but under the hood not so much. I really want to get this point across because many of the problems that seem exotic (as in why would anyone ever do that) can't be overlooked in the multi-thread context.

    Caution: Longish rant, please feel free to skip ahead.
    Consumer could call .next() many many times far more times then there will be results, producing many promises for incoming data. Once there's no more data is left all the pending nexts will need to be completed as done. If error occurs do pending nexts get error-ed or does the one that fails get errored and others get doned ? If consumer (end calling next) isn't pulling but producer in the other thread produces result that needs to be queued. Argument could be made that producer should stop if consumer isn't consuming, but once you start considering actually parallel tasks (threads, routines, processes) such a tight coordination is far from ideal. (CSP is good resource for details on why. Fun fact that is what go / rust channels are based on), that is also why node end up bolting on highMark / lowMark to streams.

    Put more simply, complexity is greatly reduce by having primitive read / write APIs and layer streams on top (that is unless language has build-in concurrency primitives like go or rust channels). That also why node's fs.createWriteStream is a sugar over the more primitive fs.write. For the same the same reasons it's far less complex to implement cross thread add with something like this writer under the hood.

  5. Normalization / Generalization has costs both in terms of computational overhead and cognitive. E.g. currently add turns trivial input that could be transferred over the threads into nested async loops. In nutshell proposed interface just replaces yields from with-in the normalizer into writes. Immediate benefit is if you have a primitive input you no longer have to produce nested iterator thingy that later will be iterated over but just do a write e.g. following code from current implementation (including for convenience)

    function normaliseInput (input) {
      // must give us something
      if (input === null || input === undefined) {
        throw errCode(new Error(`Unexpected input: ${input}`), 'ERR_UNEXPECTED_INPUT')
      }
    
      // String
      if (typeof input === 'string' || input instanceof String) {
        return (async function * () { // eslint-disable-line require-await
          yield toFileObject(input)
        })()
      }
    
    
      // Buffer|ArrayBuffer|TypedArray
      // Blob|File
      if (isBytes(input) || isBloby(input)) {
        return (async function * () { // eslint-disable-line require-await
          yield toFileObject(input)
        })()
      }
    
      // ....
    }

    Turns into something along these lines. Notice how there no more generator allocations, invocations which in turn produce async iterables that needs to be awaited once in outer loop then in the inner loop.

     function pipeInput (input, writer) {
       // must give us something
       if (input === null || input === undefined) {
         throw errCode(new Error(`Unexpected input: ${input}`), 'ERR_UNEXPECTED_INPUT')
       }
    
       // String
       if (typeof input === 'string' || input instanceof String) {
         writer.writeFile('', input)
       }
    
    
       // Buffer|ArrayBuffer|TypedArray
       // Blob|File
       if (isBytes(input) || isBloby(input)) {
         writer.writeFile('', input)
       }
    
       // ....
     }

    For cross thread work it also allows really helpful specializations that is if you write TypedArray, or ArrayBuffer or Blob it no longer needs to create sub-sessions with all the backpressure handling etc... Things do get more complicated with actual stream inputs, but even that is greatly simplified withwriter.write and again can avoids async loops when sync iterators are passed all that while keeping same simplicity as iterating over normalized input creates.

This end up being far too longe, but hopefully constructive (even if whiny at times) so I think I'd stop here.

@Gozala
Copy link
Contributor Author

Gozala commented May 14, 2020

I should point out just in case I left a wrong impression that I LOVE async generators and readable streams! I'm very glad we have them! And that we use them! It's just they're not perfect for everything and that is ok.

@lidel lidel changed the title Discuss: ipfs.add more than one root Discuss: improving ipfs.add May 26, 2020
@SgtPooki SgtPooki self-assigned this May 17, 2023
@SgtPooki
Copy link
Member

js-ipfs is being deprecated in favor of Helia. You can #4336 and read the migration guide.

Please feel to reopen with any comments by 2023-06-02. We will do a final pass on reopened issues afterward (see #4336).

I believe the @helia/unixfs package resolves a lot of concerns with the existing interface, please let us know in https://github.com/ipfs/helia-unixfs if it doesn't solve your needs!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
No open projects
Status: Done
Development

No branches or pull requests

3 participants