Use fewer allocations when building descriptors in the linker package #290

jhump · 2024-04-22T17:09:14Z

The idea here is to use fewer, larger allocations. This change also eliminates many allocations related to computing fully-qualified-names.

Previously, when adding the file's elements to the symbol table, we had to traverse the hierarchy of descriptor protos, because the wrapper descriptor implementations hadn't yet been created. But this incurred redundantly computing the full name for every element an extra two times -- once for the first pass which checked for collisions and again in the second pass which commits the items to the symbol table.

The new flow instead instantiates the descriptor wrappers first. So we compute the fully-qualified names for that, but no more. Adding the names to the symbol table can then traverse those descriptor wrappers, which does not incur the cost of allocating/computing full names again. It wasn't originally implemented this way simply because the cost of iterating the descriptor protos wasn't clear -- it's turned out to be a leaky abstraction.

Another significant part of this branch is that when we create the descriptor wrappers, we allocate them in bulk using large, flattened slices. This means many fewer calls to the allocator.

I don't have benchmark results just for this change. I was working on benchmarks in the bufbuild/buf repo, and this was an optimization I happened to spot. While it did provide measurable improvements, they were quite small (both in terms of total number of allocations and in terms of throughput). I was hoping for a more dramatic improvement, but despite only minor gains this still seems like a keeper to me.

create hierarchy descriptor first, so we only need a single traversal where fully-qualified names must be computed/allocated subsequent traversals to add to symbol table can use descriptors instead of descriptor protos (which is allocation-free) when creating hierarchy, first count all items and allocate slices in bulk instead of single allocation per element.

linker/descriptors.go

linker/resolve.go

linker/pool.go

After measuring the impact of #286, #287, and #290 and seeing it to be too modest. I decided to use a memory profiler, and it found "the good stuff". These changes had the largest impact on allocations and performance. When linking inputs that come from descriptor protos (as opposed to inputs that are compiled from sources and have ASTs), this resulted in a 23% reduction in latency and 70% reduction in allocations. This change features the following improvements: 1. `ast.NoSourceNode` now has a pointer receiver, so wrapping one in an `ast.Node` interface value doesn't incur an allocation to put the value on the heap. This also updates `parser.ParseResult` to refer to a single `*ast.NoSourceNode` when it has no AST, instead of allocating one in each call to get a node value. The `NoSourceNode`'s underlying type is now `ast.FileInfo` so that it can be allocation-free, even for the `NodeInfo` method (which previously was allocating a new `FileInfo` each time). 3. Don't allocate a slice to hold the set of checked files for each element being resolved. Instead, we allocate a single slice up front, and re-use that throughout. 4. Don't pro-actively allocate strings that only are used for error messages; instead defer construction of the change to the construction of the error.

…bufbuild#290) This uses many fewer, larger allocations, to allocate all of the descendant descriptors in a file in as few chunks as possible (by allocating slices of flattened structs). This change also eliminates many allocations related to computing fully-qualified-names by moving the construction of the descriptor hierarchy (which includes computing and storing full names) to *before* we import the names into the symbol table (which previously had to compute/allocate the full names twice: once when checking for collisions, and again in a second pass to store the names when no collisions were found). (cherry picked from commit 93923d2)

After measuring the impact of bufbuild#286, bufbuild#287, and bufbuild#290 and seeing it to be too modest. I decided to use a memory profiler, and it found "the good stuff". These changes had the largest impact on allocations and performance. When linking inputs that come from descriptor protos (as opposed to inputs that are compiled from sources and have ASTs), this resulted in a 23% reduction in latency and 70% reduction in allocations. This change features the following improvements: 1. `ast.NoSourceNode` now has a pointer receiver, so wrapping one in an `ast.Node` interface value doesn't incur an allocation to put the value on the heap. This also updates `parser.ParseResult` to refer to a single `*ast.NoSourceNode` when it has no AST, instead of allocating one in each call to get a node value. The `NoSourceNode`'s underlying type is now `ast.FileInfo` so that it can be allocation-free, even for the `NodeInfo` method (which previously was allocating a new `FileInfo` each time). 3. Don't allocate a slice to hold the set of checked files for each element being resolved. Instead, we allocate a single slice up front, and re-use that throughout. 4. Don't pro-actively allocate strings that only are used for error messages; instead defer construction of the change to the construction of the error. (cherry picked from commit 016b009)

…bufbuild#290) This uses many fewer, larger allocations, to allocate all of the descendant descriptors in a file in as few chunks as possible (by allocating slices of flattened structs). This change also eliminates many allocations related to computing fully-qualified-names by moving the construction of the descriptor hierarchy (which includes computing and storing full names) to *before* we import the names into the symbol table (which previously had to compute/allocate the full names twice: once when checking for collisions, and again in a second pass to store the names when no collisions were found). (cherry picked from commit 93923d2)

After measuring the impact of bufbuild#286, bufbuild#287, and bufbuild#290 and seeing it to be too modest. I decided to use a memory profiler, and it found "the good stuff". These changes had the largest impact on allocations and performance. When linking inputs that come from descriptor protos (as opposed to inputs that are compiled from sources and have ASTs), this resulted in a 23% reduction in latency and 70% reduction in allocations. This change features the following improvements: 1. `ast.NoSourceNode` now has a pointer receiver, so wrapping one in an `ast.Node` interface value doesn't incur an allocation to put the value on the heap. This also updates `parser.ParseResult` to refer to a single `*ast.NoSourceNode` when it has no AST, instead of allocating one in each call to get a node value. The `NoSourceNode`'s underlying type is now `ast.FileInfo` so that it can be allocation-free, even for the `NodeInfo` method (which previously was allocating a new `FileInfo` each time). 3. Don't allocate a slice to hold the set of checked files for each element being resolved. Instead, we allocate a single slice up front, and re-use that throughout. 4. Don't pro-actively allocate strings that only are used for error messages; instead defer construction of the change to the construction of the error. (cherry picked from commit 016b009)

jhump requested a review from emcfarlane April 22, 2024 17:11

emcfarlane reviewed Apr 22, 2024

View reviewed changes

linker/descriptors.go Show resolved Hide resolved

jhump mentioned this pull request Apr 22, 2024

Use a profiler to improve linker performance #291

Merged

emcfarlane reviewed Apr 22, 2024

View reviewed changes

linker/resolve.go Outdated Show resolved Hide resolved

linker/pool.go Show resolved Hide resolved

emcfarlane approved these changes Apr 22, 2024

View reviewed changes

jhump mentioned this pull request Apr 22, 2024

Flatten slices #292

Closed

review feedback: ctor and init slices early instead of lazily

3158ec5

jhump merged commit 93923d2 into main Apr 22, 2024
8 checks passed

jhump deleted the jh/flattened-slices branch April 22, 2024 19:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use fewer allocations when building descriptors in the linker package #290

Use fewer allocations when building descriptors in the linker package #290

jhump commented Apr 22, 2024 •

edited

Loading

Use fewer allocations when building descriptors in the linker package #290

Use fewer allocations when building descriptors in the linker package #290

Conversation

jhump commented Apr 22, 2024 • edited Loading

jhump commented Apr 22, 2024 •

edited

Loading