-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A more Featureful Approach to Storage APIs #265
Conversation
This will eventually rival and replace linking.BlockWriteOpener and linking.BlockReadOpener, being more primitive than them, and easier to implement without importing any go-ipld-prime types. (The eventual migration story should be smooth, though; we'll probably support both styles at once for a while.) It also explicitly does not acknolwedge the interfaces from the github.com/ipfs/go-datastore module. We're seeking several key improvements over those, including: - slimmer core; more feature-detection - only standard library types needed to implement the API - context everywhere - clear support for streaming operation - separation between read and write directions of operation - a consistent strategy for feature-detection and optional fastpaths - key type should be much simpler (and not do strange internal manipulations that cause useless allocations!) We'll offer wrappers that bridge back to the go-datastore style API, because there's lots of things implemented today using it. However, these objectives for a new clean API are not reasonably feasible to reach by incrementally evolving that older interface, so we have to start anew. In this commit: we have a first draft of the most essential interfaces, as well as always-available functions at package scope which will "do the right thing" in the most efficient way possible, by using feature detection internally on your behalf, for ease-of-use.
This is a pretty big change if anyone else out there is implementing new link types. If you're using the linking/cid package, the change is already done. I'm going to consider this a minimally breaking change because I don't think we've got a lot of diverse Link implementations out in the wild. (And even if so: this should be a pretty easy addition to make.) I'm anticipating using this as part of a good layering of APIs regarding storage. (We should be able to use binary, dense things in any storage substrate that allows it. Some common storage systems, like e.g. flatfs, don't work with binary. But any escaping having to do with that should be associated with the storage system -- not pushed to any other layer. Having binary access here helps do that.)
Fixes import cycles that I would otherwise be about to experience starting in the next commit.
…fined by the storage package. This means we now have a one-liner call that can: - rig up a storage system... - which can have been written while using zero types from this repo... - and it'll DTRT. - All while it's also become much easier to write storage implementations, as well as extend them gracefully, because of the new storage package APIs. This seems good.
…into the new storage API.
// (For a concrete example: if using CIDs, this method will return a binary string that includes | ||
// the cid version indicator, the multicodec and multihash indicators, etc, in addition to the hash itself -- | ||
// whereas the LinkPrototype.BuildLink function still expects to receive only the hash itself alone.) | ||
Binary() string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wouldn't this be more naturally a []byte
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would pay [REDACTED] for an immutable byte slice type in golang.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[REDACTED], Will. Truly. [REDACTED].
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i would rather we eat that up by making a copy than having all users be forced to cast every time they deal with the core key type in the system
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have to side with Will here; strings are technically like read-only byte slices, but using them as such in return values is a bad idea. I think they're only reasonable in cases where the bytes being shoved into a string is necessarily a short-lived state, such as map[string]T
or string(input) == string(expected)
.
Like Will says, if the objective here is preventing corruption/races, make a copy - converting to a string makes a copy anyway. Converting to a string is arguably worse in all scenarios - if the user wants to use the bytes like a byte slice, they'll need to make a second copy for the conversion back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The natural resident form of this data is usually string
.
I think we would be encountering significantly more conversions (and probably, heap-escaping allocations) if this interface required []byte
. (I was being flippant about the mutability thing; while that does bother me, it's not the most essential reason for this.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the internal storing of string is for similar reasons though, right? the thing you do to get the binary form of a cid to access it is []byte
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Taking a step back, I'm fairly certain this should be just https://pkg.go.dev/encoding#BinaryMarshaler. We should have a very strong reason to use a different interface (and type!) for exactly the same concept, and "save a bit of overhead with go-cid's internal representation" doesn't seem particularly strong to me :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not at all convinced I care about any of this.
There is no reason to be interested in implementing BinaryMarshaler
that I can see. (It's not that it's a bad idea; it's just that I don't see a reason to care.) There is also no possible error return.
I want this to be efficient, do the thing, get out of the way, and not provoke allocations. None of these arguments about style are punching on the same level?
The original move of go-cid
to use string
was pretty highly thoughtful, planned for a long time, and once executed, hasn't provoked regrets. I'd need quite a lot of convincing to move in any other direction except the same one, here.
(Sorry if I'm being a little terse, and for the initial flippant replies -- I'm actually very surprised we're talking about this.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get that your error would always be nil, and that the string-byte conversion is not ideal, but that's the kind of tradeoff you have with generic APIs and interoperability. I would have assumed that the point of an IPLD library in Go is to be easy to use and interoperable, not to squeeze every last bit of performance at the cost of usability :)
As a note, I also disagree that you can predict that you'll never need to return an error. For example, see ipni/storetheindex#94.
For an end-to-end example of how this working if you want to use it through |
storage/api.go
Outdated
// but by contrast, ReadableStorage is expected to return a safe copy. | ||
// PeekableStorage can be used when the caller knows they will not mutate the returned slice. | ||
type PeekableStorage interface { | ||
Peek(ctx context.Context, key string) ([]byte, error) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
todo.self: @whyrusleeping has pointed out that this is a silly interface, and it should probably be something more like Peek(ctx context.Context, key string, callback func([]byte) error) error
.
The value of this is that if the storage system wants to reuse that buffer, then it gets a signal to know it may do so by the return of the callback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another option, if you want to avoid callbacks, is also returning an io.Closer
to signal that you're done using the returned bytes: https://pkg.go.dev/github.com/cockroachdb/pebble#Batch.Get
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, I almost certainly like that better. I'm gonna update this to returning an io.Closer
.
The callback approach can be nice, stylistically, but I dislike forcing callbacks as the baseline. Also, the idea of the Peek function blindly proxying back whatever error the callback returns did not strike me as ideal for clear reasoning about error attribution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the callback is significantly better, as a counter example to the pebble code, heres badger: https://dgraph.io/docs/badger/get-started/#read-only-transactions
Implementing the callback api on top of pebble is simple, implementing the []byte, io.Closer
api on top of the callback api is really tricky. Plus, we're already using a callback style API in our blockstore interfaces for quite some time now (cc @raulk ), ex: https://github.com/filecoin-project/go-bs-lmdb/blob/master/blockstore.go#L352-L377
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Request: please review the API of candidate backing implementations (like the one @whyrusleeping posted) before choosing a direction here. Getting creative with interfaces, while fun, may lead to poor choices. io.Closer
is an interface that I would despise using, as it precludes the calling code from using closures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I got confused by the mention of pebble (I think @whyrusleeping was referring to the CockroachDB example). I see that io.Closer is being suggested as a return value to signal disposal.
Agree with @whyrusleeping that a callback is narrower as it limits the usage of the value to the scope of the callback, and therefore can accommodate "wider" interfaces like the io.Closer, but the other way around would be terrible.
Most embedded DBs use disk-mmapped buffers, for which the callback interface is much better as it strictly restricts/demarcates the lifetime. I suspect the io.Closer works for Cockroach because they're using buffer pools, not mmaps (they're a distributed DB, which is questionable inspiration for IPLD).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fwiw, my main goal here is actually to become less opinionated about this, by virtue of making it all feature detection, all the time. :) So though this is merged, it's not a very strongly chosen direction. We can do multiple things here, and I'm not at all stuck on any particular approach.
Between these two styles?
I see these as roughly six-of-one, half-a-dozen-of-the-other.
I think the Closer approach is moderately better to ask the implementation on the bottom to provide, because it's very easy to write a package-scope function that detects that feature, and offers to turn it into a closure-style for you, giving the end user the choice of both while the implementer only needs to provide one. It's relatively hard to write something that's efficiently going the other way.
But can we do both? Sure. Why not? More PRs always welcome.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(We should, at least, definitely add the callback style to the package-scope functions offered, I'd agree. That's a gimme.)
@@ -0,0 +1,5 @@ | |||
module github.com/ipld/go-ipld-prime/storage/dsadapter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why an extra module? If the reason is that you don't want ipld-prime to depend on go-datastore, then perhaps put this code in go-datastore.
If I want to use ipld-prime with datastore, I want to find the wrapper solution in either module - not have to discover an extra module. The IPLD/IPFS ecosystem already has enough Go modules as it is, too :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just read the README, which confirms my suspicion. I still think this code belongs in either of the two modules.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 I guess I can't really defend this, other than I think it's going to be operationally easier for me to deal with it here, and I kind of like the idea of having a few other things also in-repo here as sibling packages, so I can say "look in storage/*
for implementations" in the docs.
(I have another freshly written filesystem-based storage system to add in a sibling directory, shortly, which would make for three of them here, including this one.)
You're right that this code could technically go in the go-datastore repo and module, due to the the success of the key design goal here of not having the new API require any types outside of the standard library.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should still be able to refer to "fully qualified" names from other modules, like:
Look in the github.com/ipfs/go-datastore/foobar package for...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also not a fan of module proliferation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, if someone wants to push this into go-datastore
be my guest, but I want to see the PR landed in <24 hours, and then you also get to deal with whatever else comes up as I try to add benchmarks in the coming days that reach across old and new code 🙃 Sounds like the road of greater hassle to me, but I dunno. 🤷
Otherwise, I'd suggest we keep it here for now, and revisit in the future if it causes pain or discovery failure. I don't think it should be difficult to move (by copying to a new home, and eventually removing the older location, after a transition period), if we find it necessary. Leaving it here is a very "type 2 / revisitable" decision, in other words.
// --- basics ---> | ||
|
||
type ReadableStorage interface { | ||
Get(ctx context.Context, key string) ([]byte, error) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the precedent for using key string
rather than key []byte
? Basically every key-value database in Go (badger, bolt, pebble, etc etc) use []byte
keys, precisely because it helps clarify that they can contain anything - not just valid utf8.
If the point is to be more like go-datastore, perhaps that's enough precedent. But I would hope that we would make the interface be what is nicest for Go and as a generic abstraction layer, not just for IPFS in particular.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ultimately I guess my strongest argument is that I'm afraid our applications as a whole will have more heap allocations if we design to []byte
, because a nontrivial amount of the application uses these structs that wrap strings (i.e. go-cid does this) already, because in turn those structs cared about being able to used as map keys. And in turn, I'm unable to argue that those structs which use strings for immutability+mapkeyability are wrong, so, I'm inclined to pivot other code to favor that style.
(I realize it is also now true that when golang sees map[string]foo
being assigned or looked up via themap[string(somebytes)]
it does some compiler magic to make that not escape. However, that only works in very specific cases; and in particular, doesn't help when there's a struct wrapping a thing, which is the situation we're in with go-cid. Yes, we're also no longer beholden to that in this level of interface -- but it's still in the neighborhood of considerations, because our choice here can either end up forcing a heap escape somewhere in the glue between here and go-cid, or, not.)
More broadly: I do see it as perhaps a bit vanguard, but otherwise unsurprising, to see a binary API taking string
parameters in golang. Maybe I'm a minority in this, but in my opinion, as long as the API documents it clearly, using string
to handle immutable bytes is perfectly fine. There's so many mechanical sympathy advantages to using strings for immutable bytes that it doesn't surprise me if an API admits it.
I'm not sure how strongly held my belief is on this. Maybe I should just hush about that one heap escape.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
key type discussion aside, I think we should really attempt to get away from any sort of interface that straight up returns an array of bytes.
storage/api.go
Outdated
// but by contrast, ReadableStorage is expected to return a safe copy. | ||
// PeekableStorage can be used when the caller knows they will not mutate the returned slice. | ||
type PeekableStorage interface { | ||
Peek(ctx context.Context, key string) ([]byte, error) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another option, if you want to avoid callbacks, is also returning an io.Closer
to signal that you're done using the returned bytes: https://pkg.go.dev/github.com/cockroachdb/pebble#Batch.Get
Opens the way to a storage implementation that makes reuse of buffers.
3bb477f
to
55a4896
Compare
Most usages change from a one-liner to another one-liner. Some tests get a bit more involved, because they were actually using the callback structure to hook introspections up.
I've added more documentation to the key type situation, but (with admittedly some trepidation) still want to move forward with use of strings there. I can't get over the mechanical sympathy arguments from our current world. (E.g., using If golang in the future adds some new types which are meant to describe immutable bytes, I'd presumably jump for those in a heartbeat. (And hope/expect that it would come with a zero-cost conversion path from string types, so we could build another generation of bridge-forward code that's also zero-cost.) @mvdan and I also briefly discussed if there would be any utility to making a |
I added another commit which finishes changing I guess there's no surprise here. It was purposeful and intended to not do anything about changing those callback types in this PR. But it may be interesting to note this kind of usage exists, if there is any future work that does try to touch them. |
(This is a PR of the "big, fun one" sort.)
I've revisited storage APIs.
In this PR, I set out some new interfaces, which I think improve on limitations of both the very old
go-datastore.Datastore
API, and the slightly less oldlinking.BlockWriteOpener
andlinking.BlockReadOpener
.It also provides a very smooth compatibility road between all three (both of the old ones, and the new one) -- smoother than I think we've ever had all in one place before. It's now easier than ever to connect either brand new storage implementations to a
LinkSystem
, or, an oldergo-datastore
implementation up toLinkSystem
instead.And it provides a lot (lot) of improvements in DX, dependency minimization, extensibility, usability, and, well... just about everything. Let's see the whole list:
list of improvements
go-datastore
⇨ slimmer core; more feature-detection.linking.Block*Opener
⇨ more feature-detection.go-datastore
andlinking.Block*Opener
⇨ feature-detection design means it's easy for end-users to pick the feature they need and use the package-scope functions -- and rest assured that the operation will work regardless of the backing implementation; it's just a matter of whether it gets fastpaths or not.go-datastore
andlinking.Block*Opener
⇨ a consistent strategy for feature-detection and optional fastpaths means it's easier to offer performance specializations than ever before.linking.Block*Opener
⇨ a consistent strategy for feature-detection means it's now clear where we'd expect storage system implementations to offer more other features (e.g. size estimates, or other non-core behaviors). (Previously this was very very difficult, since thelinking.Block*Opener
types were function types, not interfaces -- meaning you can't feature-detect on them at all!)go-datastore
andlinking.Block*Opener
⇨ only standard library types are needed to implement the API -- meaning it's much easier to write (and maintain) storage implementations without getting tangled in a complex dependency web.go-datastore
⇨ golang's modern standardcontext.Context
is used everywhere. (go-datastore
predatedContext
; times have changed.)go-datastore
⇨ clear support for streaming operation. (go-datastore
just didn't have this, at all.)linking.Block*Opener
⇨ clear support for non-streaming operation. (heh.)linking.Block*Opener
⇨ easier to implement! Creating a minimum viable thing is now possible with just the most basicPut
/Get
methods (vs previously streaming operation was the minimum, which would generally be a bit more work).go-datastore
⇨ the read and write directions of operation are separated. You can implement both features on one type if you like, but you're not required to. (That means a read-only system can just not have write methods, rather than being required to have them, but then being stuck having to error when they're used.) Whenever storage handles are provided as configuration to some other system, the read and the write handles are distinct, and you can choose to set both or just one or the other (this is the same as howlinking.Block*Opener
already worked; it was good, we're keeping that part).go-datastore
⇨ the key type is much, much simpler (and does not do strange internal manipulations that cause useless allocations!)go-datastore
⇨ the key type is now binary and that's that. If the storage implementation can't handle binary keys, the storage implementation should have some configuration system that accepts a function for escaping. (In manygo-datastore
implementations, these boundaries were unclear, and some implementations would just reject some keys, forcing higher levels of application to deal with the problem, which could be confusing to reason about.)go-datastore
⇨ conceptually clearer and more explicit about intent to be used with one-write-per-key (because that's what a content-addressed system needs)... which results in implementations being allowed to make design choices that drastically simplify things and often increase performance.the new style of usage
As a user, if you're just doing IPLD things -- e.g. working mostly at the data model level -- then you can now use
LinkSystem.SetReadStorage
andLinkSystem.SetWriteStorage
to set up storage.As a user, if you want to work with storage directly, you should usually pass around the
storage.ReadableStorage
andstorage.WritableStorage
types, and use the always-available functions at package scope to actually operations on them. (These package-scope functions mirror the ones available on the storage implementations themselves, and will "do the right thing" in the most efficient way possible, by using feature detection internally on your behalf, for ease-of-use.)As an implementer: start off implementing
storage.ReadableStorage
orstorage.WritableStorage
first... and then pick and choose any of the additional feature-detection interfaces in the storage package that you think you can do especially well, or more optimally or more resource-efficiently than the fallback behaviors, and implement those too at your leisure.construction vs wiring
All of the APIs here are about connecting storage systems.
There's (still) no APIs attempting to standardize configuration for storage systems, nor standardize constructor funcs, on offer here. (That was the story for
go-ipld-prime/linking.Block*Opener
, too, and also the story forgo-datastore
before that.)(I might probe towards such configuration systems in https://github.com/ipld/go-ipldtool, but I don't think it's going to be appropriate to try to over-standardize that here -- I don't want to hem things in too much; letting things evolve is important.)
connection to
go-datastore
Notice that the new interfaces explicitly do not acknowledge the interfaces from the
github.com/ipfs/go-datastore
module. We'll offer wrappers that bridge back to thego-datastore
style API, because there's lots of things implemented today using it, but not be letting any of the concrete types from that old API leak through the new API.(Was a new API necessary to achieve the above improvements? The answer is: Yes, absolutely. It would not have been feasible to reach the successes above by incrementally evolving that older interface; it's vastly less prohibitively difficult to get there by simply starting anew.)
The bridge to using
go-datastore
code as the newgo-ipld-prime/storage
API is already done: it's in thestorage/dsadapter
folder. It's in this repo (because the code is so small!) but in a new go module -- which means you don't have to pull it (or its transitive dependency baggage) into your project unless you need it.connection to
go-ipld-prime/linking.Block*Opener
linking.BlockWriteOpener
andlinking.BlockReadOpener
stay as they are for now -- these are still the functions you use to configure aLinkSystem
.The new
storage.ReadableStorage
andstorage.WritableStorage
APIs can be glued to aLinkSystem
easily, though: useLinkSystem.SetReadStorage
andLinkSystem.SetWriteStorage
. With this, configuration is a one-liner. (Or two-liner, if you are setting up both read and write.)Note that the new storage APIs do not receive the
linking.LinkContext
type. That means: if you did want to do something like "please store things of type 'foo' in a different place" or "if I got here by a path that contains 'frobnoz', load info from a different place", you can't do that with these new storage APIs: you can't see theLinkNode
orLinkPath
anymore like you could in thelinking.BlockWriteOpener
andlinking.BlockReadOpener
functions. If you need that? You can still use thelinking.BlockWriteOpener
andlinking.BlockReadOpener
APIs... and maybe just shell out to the newstorage.*
inside of that.(It's possible there will be further eventual revisits to
linking.BlockWriteOpener
andlinking.BlockReadOpener
in the future. However, it's up in the air -- due to things like the user story in the previous paragraph -- for now. Regardless of what happens: the eventual migration story should be smooth; we'll probably support both styles at once for a while, and graceful transitions can be done behind theLinkSystem
abstraction, which should remain unaffected.)whew
I think this represents a pretty significant improvement to the state of storage APIs. There's a few things to polish here, but the direction feels good. Overall I hope, and think, this feels like a more comprehensive and inclusive solution than what we've had before. Feedback and refinements welcome.
I'm currently fuelling this by turning around and immediately using it again downstream in the go-ipldtool repo. So far, so good. (This does make me a bit more averse to rebasing much, though -- the go mod file in that repo is referring to some of these commit hashes.)
If this goes well: I think this API might justify the tagging of a
v0.14.0
release after it lands.