Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Codegen of unions, and their keyed representations #60

Merged
merged 12 commits into from
Jul 8, 2020
Merged

Conversation

warpfork
Copy link
Collaborator

@warpfork warpfork commented Jul 8, 2020

Unions! 🎉

Unions were one of the last major groups of functionality in the feature table that we had no codegen implementations of yet, so this is pretty huge news overall. Nothing broke; no abstractions shattered. Whew. (Enums are the only thing left, but those are pretty easy-peasy. Unions, being a container/recursive type kind rather than a leaf, were pretty heavy-duty.)

We have two things here, as always: the "type-level" and the "representation-level" implementations of the Node and NodeAssembler interfaces.

  • the "type-level" Node for a Union always acts like a map, and the keys are always the names of the member Type.
  • the "representation-level" Node behavior may (of course) vary, but for "keyed" representations... it's roughly the same as the type-level semantics; it's just that the keys are the strings defined by the schema author, rather than being the member type's name.

The "keyed" representations are the only representation implemented so far, as those are by far the easiest to ship, but adding the rest should just be a Small Matter of Programming. (Which is not to say easy or instantaneous to scratch out, but at least it's not theoretically challenging, nor likely to encounter Unknown Unknowns.)

Individual commits have more details in their messages.


Two styles of internal memory layout in the generated code are supported! One is "embedAll"; the other is "interface".

  • In "embedAll" memory layout, it... does what it say on the tin: all possible member types are embedded into the union type. Only one of them is actually initialized with data; which one that is is marked by a 'tag' field which adds another word of memory. Overall this means the resident memory size can be potentially rather large, but, the allocation count when using these is zero. (If we had language-level support for "unions" that share memory in Golang, the resident memory size could be much less, as well. Alas; this is not available (unless one resorts to 'unsafe', which as a general rule I avoid).)
  • In "interface" memory layout, the generated golang structure for the union type will contain just one field, which is an interface. The type info is thus in the golang native interface type info; simple. These interfaces are smaller in resident memory (and probably the right choice by default), but in contrast to "embedAll", constructing them does necessarily cause heap allocations.

(All this talk about performance differences of these two memory layout strategies may sound theoretical! And perhaps for applications that don't care about performance, that is true. However, we do also have examples of both modes elsewhere in the go-ipld-prime codebase already, and they were absolutely the result of profiling which indicated it made a serious, make-or-break difference when those types were involved in "hot" loops. Check out the ipld.PathSegment type, for example, which resembles the "embed" strategy.)

In either case, an interface type is also generated, so that if the user of the generated code wants to do a native type-switch of their own, they can call AsInterface() and proceed to do so.


There are still substantial "todos" regarding the quality of error handling. The types used sometimes carry too much information, or too little; or the information that they carry is inconvenient (e.g. may require processing) to actually have on hand in some cases; and the general consistency of the error type ontology... well, isn't. I've started accumulating more comments and notes about this, but a massive pass over the entire suite of relevant packages will be required at some point, once we've gathered enough experiences to make final strategic decisions about the subject.

(One example thought on the subject of errors is: perhaps we should start attaching NodePrototype values to them, and also introduce a Stringer requirement to that interface. This would serve up type name info for typed/codegen'd nodes, while also being clearly defineable for untyped nodes; this would help resolve a lot of awkward sticking points currently showing up around error info handling. But: this is a biggish change; it would live in its own PR.)


Looking forward...

In the the near future: one of the reasons having unions be generable now is that a lot of schemas have at least one of them... including, for example, the schema-schema itself (the "self-hosting" document that describes the schema information as a schema). This means we're getting really close to being able to generate code that we could then use to parse our own schema ASTs! That, in turn, would let us throw out a great deal of placeholder code which was waiting for this day; and also means we'd get e.g. JSON-unmarshalling of Schema info "for free"... which in turn will mean we're dang near ready to make an end-user CLI tool for all this codegen stuff that you can feed declarative schema documents into and go!

(... but, one step at a time. First, it turns out we'll need to fix the "placeholder" schema info types one more time: it's not currently possible to construct values with "cycles" in them, and we actually need that to describe the schema-schema. Heh.)

(... and, the schema-schema currently uses inline unions, as well as keyed unions. I might actually change the schema-schema. It would make boostrapping to self-hosting easier, yes. But there are other reasons as well: I've become much more averse to inline unions through the course of this work, and seeing more and more closely how their performance will (necessarily!) suck. I also suspect the ease-of-implementation of keyed unions will be a recurring theme for people building IPLD + Schema tooling in other languages, and therefore, making the schema-schema use only keyed unions will be a bootstrap simplifying move for them, too. All in all, such a change seems likely worth pursuing.)

Tl;dr much more excitement soon to come :3

Expect this to evolve slowly over and the course of *many* commits.
Unions are exceptionally tricky, and their various representations
are by far the most complicated parts of IPLD Schemas.

In particular, the fact that how to interpret data *inside* a union
may vary, whilst some representations may not actually tell us which
union member we have until later in the stream... is going to produce
a lot of absolutely fabulous complexity.  Look forward to it.
It's kind of dark.  I've been leaving remarks strewn around github
for a while now to the general effect of "please please please
prefer keyed unions whenever you can; the others can be Not Fast".
But it's even more nasty in terms of details like error handling
than I think I had yet realized.  Sigh.

Oh well.  Remember: we exist to describe data that people *have made*,
not just the data that we want to encourage them to make in the future.
The appearance of the word "any" there has started to perturb me;
"any" is a concept that we also need to describe in schemas, and that
type has nothing to do with it.  It's more of a base mixin.
Also make some of the placeholder construction functions.
(Not all of them.  Just the ones I intend to use first.)

There was some very unfinished placeholder stuff going on there.
(And ironically, it wasn't written in a very clear union-like way;
it gets a lot less icky when it's rewritten so that it's uniony.)
Golang templates are mostly pretty smooth to get started with,
but this particular issue is really a bizarre tripwire, imo.

Will clean up the existing "range modifies dot unhelpfully" messes
sometime over the rainbow; I just refuse to make more of them today.
Generates the type structure itself, and a marker interface.
Not much else yet.

Wired test harnesses and plugged out one example.

It's a bummer we don't really have a great way to poke part of it into
running until the whole thing satisfies the generator interface.
But it'll come soon enough.
…ayouts.

Outline for keyed representation, but only enough to get some more
things compilable and start getting all the templates exercised.

Still can't really e2e test the things we've got without getting
the rest of it and its builders finished.  It's coming along pretty
reasonably, though, it seems.
These are turning out to be *very* fun.

In the process of detailing how child assemblers work for unions,
it has become time to consider how separately-allocated child
assemblers are handled, and what their existence implies about
the rest of our logical rules how all the assemblers behave in concert.
Like so many things in this endeavor... the far-reachingness of
the implications of even seemingly very mild choices is... high.

I think it's quite likely that the current claims about union
assemblers not needing extra state for set/focus might turn out to be
redacted in upcoming commits, and for the strangest reason: we might
actually need that state to be persisted (outside of 'w') so that
the *reset* mechanism, of all things, can be efficient.
We'll see; still sort of tumbling that thought around in my mind.
There are various options.
... for the type-level assemblers, anyway.

We *still* can't do e2e tests on this that include compiling the
generated code; not until we get at least one representation strategy
implemented.  All the templates exercise, though (even though you
might have to uncomment some of the test "skip" lines to do it).

I'll move on to doing the keyed representation next, and that will
be the thing we use to prove out all of this stuff.  It should be
the easiest one to do now, too, since it's semantically very very
similar to how we made the type-level behaviors work.
I can't quite claim tests passed on the *first* shot... but,
the first shot after mostly syntactical (rather than semantic) fixes?
Yeah, actually.  That was pretty fun.

Snuck in a bit of DRY'ing up.  The repetition in BeginMap methods
got to me, and was low hanging fruit, so I extract that from unions
and also backported it to structs.

Errors got some work in this commit, because it turns out I've
straightjacketed myself by not allowing "fmt.Errorf" due to imports.
There's a lot to do there, and I only tackled what was directly
critical to get this commit about unions across the finish line,
but there's a few remarks in comments about where to go next.

Some more comments about future work on the type info holder types
also appears; I'm starting to skid on those placeholders and their
issues more and more.  I really hope we can get to replace those
sooner than later.

And... also, yes, the idea of not having a "focus" state field in
assemblers really bit it, hard, as speculated in the previous
commit message.  I ended up using 'ca' in more switches than I
expected, simply because it's easier to use that than have the
conditonal templating branches that would be necessary to use the other
tagging mechanisms that do also have sufficient info.

One big fixme in the core interfaces for nodebuilders (wince):
the ValuePrototype method can error sometimes, and that hasn't been
accounted for.  Need to make a decision about what to do there.
It's not really an exercised path in practice, but it shouldn't
contain caltrops, regardless of how frequently used it is.
(This probably would've come up earlier, except there's a bunch of
stubs about ValuePrototype in other parts of codegen already; all of
them need backfill, but haven't yet made it to top of the todo heap.)

But despite all the fixmes accumulated, this does bring unions
to be a usable thing, and definitively proves out that the design
still works, even for what turns out to be one of the most complicated
parts of the schema system!  It's very, *very* exciting to add the
checkmarks to this part of the feature table -- it's one of the places
I most feared "unknown unknowns", now it's put to bed.  Woot!
This is a clear and simple solution to the problem.

(I think I may have gotten so optimistic about type systems for a while
that I forgot that returning nils is an option.  mmmm.)

The docs on NodeBuilder already even nodded towards this possibility,
so I've just further clarified that.

After staring at this method for a while, I begin to wonder if it even
is all that useful.  To use it sensibly on structs or unions (for any
purpose other than checking if something is *not* a field/member),
you have to figure out which keys you can ask it about... which...
means you'd need the type info, to be able to enumerate that!  At which
point you'd just be able to look at the type info, and wouldn't really
need to ask the builder about ValuePrototype in this way at all,
rendering the whole thing moot.

But removing it seems a little drastic, too.  And would leave questions
about how to do those inspections on untyped things.  So.  Despite the
nibbles of oddity around this, perhaps this is still the least-bad
design that can make these situations legible.
@warpfork
Copy link
Collaborator Author

warpfork commented Jul 8, 2020

Charging forward: I want to start new branches to work on the cycle issues, and thence onto that self-hosting milestone. (I know there's going to be more fun speedbumps on the way to the latter, so I want to encounter them as soon as possible, on the figuring they'll also force a lot of other stuff to get straightened out...)

@warpfork warpfork merged commit 32d5243 into master Jul 8, 2020
@warpfork warpfork deleted the more-codegen branch July 8, 2020 16:46
@aschmahmann aschmahmann mentioned this pull request Feb 18, 2021
73 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant