Codegen update -- Assemblers, and many new representations #52

warpfork · 2020-04-23T14:49:15Z

Warning: large diff.

This is a big update and rework of codegen.

Codegen was briefly left behind entirely by the winter update to the core interfaces which moved us to "assembler"-oriented operation -- wherein we mostly assign data into already-bulk-allocated memory -- over the previous "builders". As a result, this is very nearly a total reboot. The use of textual templating as the primary mechanism remains the same; and the overall pattern of {node, assembler, reprNode, reprAssembler} remains the same; but most of the actual code has churned.

Some of the highlight:

There's now working proof of a type with various representation strategies! Check out structs: they have a representation as a map, and also a representation of "stringjoin", which behaves very differently.
There's now an "AdjunctConfig" system, and this supports customization of anything that's goland-and-this-code-generator-specific (as opposed to anything schema-specified).
- Among the things that can be done with this is overriding the text used in creating symbols in the generated code. This isn't quite wired up all the way to user-configurability, but the infra should all be in place to do so.
- You can configure whether "MaybeT" types -- generated and used whenever something is either nullable or optional -- use pointers internally, or not. (This might be something an advanced user legitimately wants to tune for performance reasons.)
Everything is (of course) now generating NodeAssembler interfaces.
- These get nested and embedded in each other. So, whole deep trees of data can be created and even the assemblers get amortized into a single allocation.
There's test infrastructure that's actually satisfying!
- You can run 'go test' on the gen package and it will...
  1. exercise and test the generation itself
  2. ensure that the generated code actually compiles
  3. link the compiled objects back into the test process and run functional tests over them!
- There's some caveats to this: unfortunately, some of the systems I used here don't work on windows, and they also require (blegh!) some cgo, and thus C compilers on your $PATH. I'm not thrilled with this, but it has much better outcomes than the alternatives -- you can see the commit messages in the development process for more detail, including fully worked attempts at alternatives.
... much more.

This still has a decent number of TODOs, FIXMEs, and REVIEWPLZ comments in it, but may nonetheless already be an interesting read.

There's also a lot of code which is not yet even remotely "DRY". I've been intentionally not DRY'ing things out on the first pass: it's been reliably non-obvious so far what parts are actually going to turn out reusable and at what granularity and boundaries, so it seems better (and less frankly brain-melting) to do things with high redundancy now, and factor it back out later once it's clear what works and what doesn't.

Despite the roughness, I might land this back on master soon. It's still not recommended for incautious general use... but it's getting done enough to kick it around if you're feeling experimental, and it's certainly closer to useful than what's on master presently, so it's probably more useful to move forward than to twiddle thumbs.

In research: I found that there are more low-cost ways to switch which methods are available to call on a value than I thought. Also, these techniques work even for methods of the same name. This is going to improve some code in NodeAssemblers significantly -- there are several situations where this will let us reuse existing pieces of memory instead of needing to allocate new ones; even the basicnode package now already needs updates to improve this. It's also going to make returning representation nodes from our typed nodes *significantly* easier and and lower in performance costs. (Before finding methodsets are in fact so feasible, I was afraid this was going to require our typed nodes to embed yet another small struct with a pointer back to themselves so we can have amortized availability of value that contains the representation's logic for the Node interface... which while it certainly would've worked, would've definitely made me sigh deeply.) Quite exciting for several reasons; only wish I'd noticed this earlier. Also in research: I found a novel way to make it (I believe) impossible to create zero values of a type, whilst also making a symbol available for it in other packages, so that we can do type assertions, etc, with that symbol. This is neat. We're gonna use this to make sure that types in your schema package can never be created without passing through any validation logic that the user applies. In codegen: lots of files disappear. I'm doing a tabula rasa workflow. (A bunch of the old files stick around in my working directory, and are being... "inspirational"... but everything is getting whitelisted before any of it ports over to the new commits. This is an effective way to force myself to do things like naming consistency re-checks across the board. And there's *very* little that's getting zero change since the changes to pointer strategy and assembler interface are so sweeping, so... there's very little reason *not* to tabula rasa.) Strings are reimplemented already. *With* representations. Most of the codegen interfaces stay roughly the same so far. I've exported more things this time around. Lots of "mixins" based on lessons learned in the prior joust. (Also a bunch of those kind-based rejections look *much* nicer now, since we also made those standard across the other node packages.) Some parts of the symbol munging still up in the air a bit. I think I'm going to go for getting all the infrastructure in place for allowing symbol-rename adjunct configuration this time. (I doubt I'll wire it all the way up to real usable configuration yet, but it'll be nice to get as many of the interventions as possible into topologically the right places to minimize future effort required.) There's a HACKME_wip.md file which contains some other notes on priorities/goals/lessoned-learned-now-being-applied in this rewrite which may contain some information about what's changing at a higher level than trying to track the diffs. (But, caveat: I'm not really writing it for an audience; more my own tracking. So, it comes with no guarantee it will make sense or be useful.)

I *think* the notes on Maybes are now oscillating increasingly closely around a consistent centroid. But getting here has been a one heck of a noodle-scratcher.

Compiles and runs; but the results don't. Some small issues; some big issues. Just needed a checkpoint. It's feeling like it's time to deal with breaking down the assembler generators in the same way we did for all the node components, which is going to be kind of a big shakeup.

…values. (That line kind of says a lot about why this project is tricky, doesn't it.)

This cleans up a lot of stuff and reduces the amount of boilerplate content that's just generating monomorphizing error method stubs. The nodeGenerator mixins now also use the node mixins. I don't know why I failed to do that from the start; I certainly meant to. It results in shorter generated code, and the compiler turns it into identical assembly.

…sts AssignNode validation. The HACKME_builderBehaviors.md file in the repo root says we should be able to procede from these repeated keys conditions. The AssignNode systems now use Finish methods more reasonably. In the absense of actual validation logic, I don't know if this really structurally mattered, but it feels better now. There's one further comment in basicnode.Map and basicnode.List's AssignNode method about semantics when it's used after *some* data has already been added; this should probably be addressed at some point. I'm pushing it off the "today" plate though, since it's not a critical bug that violates immutability or some such. All three of these issues were noticed while building the related semantics in codegen for structs.

Lots of tasty bitshuffling in this commit. Remaining: creating, embedding, and returning child value assemblers. This might involve a full type per field: doing so would be direct, have pretty excellent verbosity in debugging symbols, and pointers back to parent suffer no interface indirection; overall, it's probably going to have the fastest results. However, it would also mean: we can't save resident memory size if we have more than one field in a struct that has the same type; and we we can't avoid increasing binary size if the same types are used as fields in several places. Going to pause to think about this for a while.

We'll see how this works out.

…ed using a pointer; attempt the finishCallback system for reusable child assemblers. This... almost appears to be on a dang nice track. Except one fairly critical thing. It doesn't work correctly if your maybe IS implemented as containing a pointer. Because that pointer starts life as nil in the parent structure. And that's hard to fix. We can't have a pointer to a pointer in the assembler; that'd cause the type bifructation we've been trying to avoid. (We can give up and do that of course; it would be correct. Just... more code and larger final assembly size. And if we did this, a lot of this diff would be rewritten.) We could have the parent structure allocate a blank of the field, and put a pointer to it in the maybe and also in the child assembler. Except... this is a wasted allocation if it turns out to be null. Rock and hard place detected in a paired configuration. Might have to back out of this approach entirely. Not sure.

Easier than previous commit message seemed to think. Since we already have finishCallback functions in play in any scene where maybes can also be in play, we can just make those responsible for copying out a pointer from the child assembler to the parent value. That makes the child assembler able to create a new value late in its lifecycle and still expect it can land in a proper home. Tests needed badly; this is *oodles* of edge cases. Unrelated fix needed first. (Might make that elsewhere and rebase this to include that, to reduce the carnage ever so slightly.)

Sanity check level: gen result compiles. Next: time to target tests at all this jazz that actually exercise and run the generated code.

I'm almost a little staggered. We've got structures with maybes both using pointers and not, and everything seems fine. A few fixes were necessary, but they were mostly "d'oh" grade. (The finishCallback logic is a bit hairy, but appears correct now.) More cases for more variations in presence coming up next.

I started doing double-duty in tests to keep the volume down: the block for nullables tests both 'nullable' and 'nullable optional'; and the optionals similarly. Will make failures require more careful reading to discern, but probably a fine trade.

Seems to be on track after all.

I was particularly worried about trailing optionals. Happily, they appear to work fine. Representation iterators and smooth stopping included. Starting to wonder how best to organize these tests to continue. 333 lines already? Questionable scalability.

Still have a lot of uncertainty to resolve around nullable and optional in general. But writing that up elsewhere -- trying to treat it from the spec angle without getting the implementation too mixed up in it.

Some touches to ImplicitValue came along for the ride, but aren't exercised yet. Tests are getting unwieldy. I think something must be done about this before proceding much further or it's going to start resulting in increasingly significant velocity loss.

Using golang's plugin feature, we can... well, *do* this. To date, testing codegen has involved running the "test" in the gen package to get it to emit code; and then switching to the emitted package and _manually_ running the tests there. Now, running `go test` in the gen package is sufficient to do *everything*: both the generation, and the result compiling, and we can even write tests against the interfaces and run those, all in one step. There's also lots of stuff that becomes possible now that we can easily generate multiple separate packages with various codegen outputs: - Overall: everything is granular. We can test selections of things, rather than needing to have everything fall into place at once. - Generally more organized results. - We can more easily inspect the size of generated code. - We can more easily inspect the size of the compiled result of gen! (Okay, not really. I'm seeing a '.so' file that's 4MB is coming out from 200sloc of "String". I don't think that's exactly informative. Some constant factor is thoroughly obscuring the data of interest. Nice idea in theory though, and maybe we'll get there later.) - We can diff the generated type for variations in adjunct config! (Probably not something that will end up tested, but neat to be able to do during dev.) Doing this with go plugins seemed like the neatest way to do this. It's certainly not the only way, though. And in particular, I will confess that this will probably make developing from a windows host pretty painful: go plugins aren't supported on windows. Mind, this doesn't mean you can't *use* codegen or its results on windows. It just means the tests won't work. So, someone doing development _on the codegen itself_ would have to wait for the CI server to run the tests on their behalf. Hopefully this is survivable. (There's also a fun wee wiggle in that loading a plugin has the requirement that it be built with the same "runtime". The definition of "runtime" here happens to include whether or not things have been built in "race" mode. So, the '-race' flag disappears from our CI config file in this diff; otherwise, CI will get the (rather confusing) error "plugin was built with a different version of package runtime". This isn't really worrying to ditch, though. I'm... not even sure why the '-race' was in our CI script in the first place. Must've arrived via cargo cult; we don't _have_ any concurrency in this library.) An alternative way to go about all this would be to have the tests for gen invoke `go test` (rather than `go build` in plugin mode) on each of the generated packages. It strikes me as similar but worse. We still have to invoke the go tools from inside the test; we'd have *more* work to do to either copy tests into the gen'd package or else generate calls back to the parent package for test functions (which still have to be written against interfaces, so that they can compile even when the gen isn't done, as it indeed isn't when you freshly check out the repo -- exact same as with the plugin approach); and in exchange for the extra work, we get markedly worse output ('go test' doesn't nest nicely, afaict), and we can't separate the compiling of the generated code from the evaluation of tests on it, and we'd have no possibility of passing information via closures should we wish to develop table-driven tests where this would be useful. tl;dr: Highest cost, uglier, and worse. No matter which way we go about this, there *is* a speed trade-off. Invoking the compiler per package adds at least a half second of time for each package, in practice. Worth it, though. And on the off chance that this plugin approach does burn later, and we do want to switch to child process 'go test' invocations... the good news is: we shouldn't actually have to rewrite very much. The everything-starts-from-NodeStyle-and-tests-against-Node work is exactly the same for making the plugin approach work, and will trivially pivot to working fine in for child 'go test' approaches, should we find it necessary to do so in the future. So! With our butts covered: a plugin'ing we shall go! Some of the code here still needs cleanup; this is a proof of concept checkpointing commit. (The real thing probably won't have such function names as "TestFancier".) But, we do get to see here: plugins work; more than one in the process works; and they work even when the same type names are in the generated packages. All good.

In the parent commit, we tried using plugins. Turns out there's one big, big drawback to plugins. Building one invokes 'gcc'. Works; as long as you... have gcc, and don't mind that having that as a dependency that you now have to track and potentially worry about. I don't consider that minor. So, okay. What's using 'go test' child invocations look like? Well, bad. It looks... bad. The output is a mess. Every child process starts its output as if its a top level thing. Why this happens is perfectly understandable and hard to be mad at, but nonetheless the result is maddeningly illegible. Cleaning this up might be possible, but will require some additional effort to be invested. There's some more irritating problems than that, though. Notice the entirely new package that's sprouted. Not being able to use closures in the test assertions is unfortunate. Having to put the test assertions in *entirely different files* than their setup... now that's really painful. (And yeah, sure, we can move the setup out to another package, too. It's not pleasing, though. Then what of the test is actually in the package it's testing? The entrypoint? So then, I'd end up editting the test setup and assertions in one package (but running tests on that package would actually no-op; syke!) and then have to switch to the other package to actually run it? There's no angle this can be sliced that results in any sort of _nice_.) Flags like '-v' aren't inherited now, either. Hooray. We could try to hack that in, I suppose. Detect '-test.v' arg. But it's just another thing that needs patching up to be shipshape. Speaking of which, '-run' arg isn't even *remotely* manageable without massive effort. The thing can contain regexps, for goodness sake -- how can we possibly semantically modify those for passdown? I'm not sure where to go with this. Depending on gcc in tests is undesirable. But all these other effects from trying to use 'go test' this way seem even *more* undesirable.

'go test' just really wasn't designed for this. There's no first class API-driven way to interact with it at all. Even the 'go test -json' capabilities, despite being built-in, appear to be strap-ons rather than logically central consistent abstractions. Where things can be parsed at all (and in general, they can't -- not without significant potential for ambiguity slipping in from some angle or another), trying to output them again in a normalized form is seriously tricky business: a great deal of re-buffering is required. (This diff doesn't have it: I started off far too optimistic about the simplicity of the whole thing, and am putting it down rather than continue. I get the feeling I could tumble down the rabbit hole quite some time and still only ever be *approaching* correctness asymptotically -- never actually *getting* there.) In addition, all the problems enumerated in the previous commit ('-v' needs to be passed down; '-run' is more or less impossible; and all the sadness about closures and more helper files split across whole package boundaries just for maximum distraction) are still here. Yeah... let's... let's put this one back in the box. I'm going to merge-ignore this and keep the plugin approach instead. As much as I *really* don't want gcc as a test dep, all this stuff... this seems worse. (And maybe, someday, Go can be improved so plugins don't have the gcc dep. I'm going to hope for that.)

…degen-using-plugins Both commits on the merged branch contain long comments explaining why that approach was not satisfying. I'm feeling a bit salty about the whole thing, truth be told. I really expected there to be APIs somewhere in this vicinity.

Output is fairly fabulous. Really happy with this. Diff size is nice. Took the opportunity to do some cleanups and the net result is 193 insertions and 351 deletions counted by line. Most importantly, though: running 'go test .' in the gengo package actually generates, compiles, and tests **everything**. The test file inside the "_test" dir that you previously had to jankily run manually is gone. Hooray! And already, we've got three distinct packages being generated and tested. (Try 'diff _test/stroct*/tStroct.go'. It's neat.) The time overhead is acceptable so far. The total time is up to 0.9sec on my current machine. Not as bad as feared. Overall: nice. I think this reduces the test friction to a point that will significantly enable further progress.

…his whole wrangle.

Wanting this as I work on list representations, and making target-of-opportunity improvements as I encounter them.

Trying to minimize gsloc here as much as feasible -- a lot of logic *is* in common with the type-level node. We'll see how this goes, though. If we inspect the performance and the assembly output at the end of this, and it's nuts, we might reconsider how to approach this substantially.

A lot of the coments in earlier commit messages about choosing a path and cutting through it roughly and leaving DRY for latter still applies (see f1eeeaf). The hardest part of writing this was giving up on having shared code in the output. It just can't be done along the current edges without giving up some performance at runtime, and that's generally not a trade we want to make. Maybe there's different cuts for function boundaries that would do better -- but we'll leave that for later fun work. I did at least get a bunch of templates to be textually shared. Comments abound in the diff for future possibilities, as well as what's not possible to go further on without making undesired trades.

…ives. Would cause the same memory to get reused inappropriately; need the nil value there to kick the child assembler to do the new allocation on the next round.

Having a nil pointer dereference cause a panic *during* error stringing is really unpleasant. We shouldn't "ever" have this problem, when the library is done. But right no we certainly do.

…for both assembly and traversal. It does.

Correct representational creation mode still coming up, as you can see in the fixme comment. Tests will come in the next commit along with the creation mode.

…e when they recurse. This is pretty similar to the extractions that made it fly for lists. I've added a few more comments to the "genparts*" files, because on further reflection, a lot of those methods are... less reusable than the name might seem to claim. We'll see how those evolve. Altered TestMapsContainingMaps quite a bit to now also test that representation mode is tracked correctly across both read and creation recursions, in a similar way to how TestListsContainingLists does (e.g., uses some structs towards the leaf nodes, so it can use the rename directives to create visible differences).

A fair amount of stuff is finished. A fair amount of stuff isn't. These are all the type kinds and representation strategies outlined in https://specs.ipld.io/schemas/representations.html .

Some of them still lived over the "gendemo" package, and those are now moved over here to the proper gen package. Linked to more of the other documents from the main HACKME doc.

Less comments emitted in gen result. Touch up the 'do not edit' comment.

Going to move it over to replace the (currently hand-written) gendemo package shortly... but spread that over a few commits, in case the diffs turn out interesting to look at.

Previously, it was manually written prototypes of what gen "would" look like. Now it's the real deal :3

Includes one code fix, as well: generated maps now return 'nil' for the value of lookups when returning an error for not-found. (I waffled on this for a while, because it's arguable that `ipld.Undef` is a perfectly reasonable sentinel value for this, and it's *arguable* that any sentinel value is better than nil. However... no. A much, much clearer rule is "if a method returns an error, the value is nil", and that the contrapositive is also true: and that's what we're now going with and sticking with.)

All "genparts*" files are reusable bits of template; all other "gen[A-Z](.*)" files are per kind.

I spent some time poking around at other applications that I'd want to be able to use some of this stuff on, and the lines about "minimalist" come from that. I suspect that in ideal outcomes, I'd like to be able to sometimes be able to generate the type structures and native/typed APIs for a Schema definition... but perhaps not generate all the stuff necessary for the interface monomorphization, because perhaps I just want an enum purely for internal program purposes; surely we should be able to just do that. We'll see though. Maybe that's scope creep as much as anything else.

... almost. In attempting to put the sections about absent value handling into place, I realized there's a decently sizable bug there already. I'm going to document this for now, move on, and come back to it later... so it will continue to take up space in a "wip" doc. The description of the situation is at least cleaned up now.

(It's a small diff -- just reflecting the fix for map returing absent.)

warpfork · 2020-06-15T20:55:00Z

Okay -- I think it's about time to sync this with master and merge it.

It's still far from perfect, but 99 commits of work is far enough divergence. Work will continue on new (hopefully mostly smaller) branches after this.

warpfork added 30 commits March 29, 2020 14:08

Codegen now has widespread symbol customization.

14e6653

Fix excessing pointers in maybes.

cbef5c3

Beginning of struct gen; much ado about Maybes.

59c295d

I *think* the notes on Maybes are now oscillating increasingly closely around a consistent centroid. But getting here has been a one heck of a noodle-scratcher.

Correct iterators for representatons of structs with absent optional …

96d63ad

…values. (That line kind of says a lot about why this project is tricky, doesn't it.)

ErrInvalidKey type, and use it.

27873cb

Late night thoughts on reducing gen type count.

896d0e5

We'll see how this works out.

Contemplations about defaults for MaybeUsingPtr.

1a6143a

More smoke test types: exercise maybeptr=true.

6dfdc19

Sanity check level: gen result compiles. Next: time to target tests at all this jazz that actually exercise and run the generated code.

Test null nullable field; works.

bcf7b05

Merge branch 'codegen-how-maybe' into codegen-oppdatering

e422966

Seems to be on track after all.

Addntl questions on semantics of absent values.

f44ea65

Refine nullable implementation; some more of struct map reprs.

58aa17a

Still have a lot of uncertainty to resolve around nullable and optional in general. But writing that up elsewhere -- trying to treat it from the spec angle without getting the implementation too mixed up in it.

Stamp out remainder of map repr for struct.

54f5e47

Smooth skips when plugins aren't available (or on request); docs on t…

e11bc44

…his whole wrangle.

warpfork added 23 commits May 8, 2020 10:34

the most hygenic of macros have been deployed here

38a6fd0

Dry'ing more templates across repr-or-not.

4d56c5e

Wanting this as I work on list representations, and making target-of-opportunity improvements as I encounter them.

Generalize more shared templates over repr.

e157f3b

Fix missing zeroing of wip pointer during nullable children in recurs…

160062b

…ives. Would cause the same memory to get reused inappropriately; need the nil value there to kick the child assembler to do the new allocation on the next round.

Make this error less fragile.

d87c67e

Having a nil pointer dereference cause a panic *during* error stringing is really unpleasant. We shouldn't "ever" have this problem, when the library is done. But right no we certainly do.

Test for lists-of-lists and that representation-mode recursion works …

6c65fd6

…for both assembly and traversal. It does.

Maps now correctly stay in representation mode when read.

3d4cd43

Correct representational creation mode still coming up, as you can see in the fixme comment. Tests will come in the next commit along with the creation mode.

comment typo fix

6916052

Feature table in readme!

e198e32

A fair amount of stuff is finished. A fair amount of stuff isn't. These are all the type kinds and representation strategies outlined in https://specs.ipld.io/schemas/representations.html .

Refresh HACKME documents.

a7e8b2f

Some of them still lived over the "gendemo" package, and those are now moved over here to the proper gen package. Linked to more of the other documents from the main HACKME doc.

Clean up minima.

f1f9308

Less comments emitted in gen result. Touch up the 'do not edit' comment.

Regen the realgen package.

0009613

Going to move it over to replace the (currently hand-written) gendemo package shortly... but spread that over a few commits, in case the diffs turn out interesting to look at.

gendemo package is now real generation :3

e9455cd

Previously, it was manually written prototypes of what gen "would" look like. Now it's the real deal :3

Clean up notes about how gen templates are organized.

59b1625

Move files to use genparts prefix consistently.

9d4171a

All "genparts*" files are reusable bits of template; all other "gen[A-Z](.*)" files are per kind.

brief review of absent value handling policy

cc9f331

Fix some out of date docs.

befa498

warpfork force-pushed the codegen-oppdatering branch from 982012d to befa498 Compare June 15, 2020 20:45

warpfork added 3 commits June 15, 2020 22:50

Regen gendemo package.

13a3cd3

(It's a small diff -- just reflecting the fix for map returing absent.)

Merge branch 'master' into codegen-oppdatering

6ca0abc

Update tests that had semantic conflicts.

21c8792

warpfork merged commit 41db59c into master Jun 15, 2020

aschmahmann mentioned this pull request Feb 18, 2021

Release v0.8.0 ipfs/kubo#7707

Closed

73 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Codegen update -- Assemblers, and many new representations #52

Codegen update -- Assemblers, and many new representations #52

warpfork commented Apr 23, 2020

warpfork commented Jun 15, 2020

Codegen update -- Assemblers, and many new representations #52

Codegen update -- Assemblers, and many new representations #52

Conversation

warpfork commented Apr 23, 2020

warpfork commented Jun 15, 2020