-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use fewer allocations when building descriptors in the linker package #290
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
create hierarchy descriptor first, so we only need a single traversal where fully-qualified names must be computed/allocated subsequent traversals to add to symbol table can use descriptors instead of descriptor protos (which is allocation-free) when creating hierarchy, first count all items and allocate slices in bulk instead of single allocation per element.
emcfarlane
reviewed
Apr 22, 2024
emcfarlane
reviewed
Apr 22, 2024
emcfarlane
approved these changes
Apr 22, 2024
Closed
jhump
added a commit
that referenced
this pull request
Apr 22, 2024
After measuring the impact of #286, #287, and #290 and seeing it to be too modest. I decided to use a memory profiler, and it found "the good stuff". These changes had the largest impact on allocations and performance. When linking inputs that come from descriptor protos (as opposed to inputs that are compiled from sources and have ASTs), this resulted in a 23% reduction in latency and 70% reduction in allocations. This change features the following improvements: 1. `ast.NoSourceNode` now has a pointer receiver, so wrapping one in an `ast.Node` interface value doesn't incur an allocation to put the value on the heap. This also updates `parser.ParseResult` to refer to a single `*ast.NoSourceNode` when it has no AST, instead of allocating one in each call to get a node value. The `NoSourceNode`'s underlying type is now `ast.FileInfo` so that it can be allocation-free, even for the `NodeInfo` method (which previously was allocating a new `FileInfo` each time). 3. Don't allocate a slice to hold the set of checked files for each element being resolved. Instead, we allocate a single slice up front, and re-use that throughout. 4. Don't pro-actively allocate strings that only are used for error messages; instead defer construction of the change to the construction of the error.
kralicky
pushed a commit
to kralicky/protocompile
that referenced
this pull request
May 19, 2024
…bufbuild#290) This uses many fewer, larger allocations, to allocate all of the descendant descriptors in a file in as few chunks as possible (by allocating slices of flattened structs). This change also eliminates many allocations related to computing fully-qualified-names by moving the construction of the descriptor hierarchy (which includes computing and storing full names) to *before* we import the names into the symbol table (which previously had to compute/allocate the full names twice: once when checking for collisions, and again in a second pass to store the names when no collisions were found). (cherry picked from commit 93923d2)
kralicky
pushed a commit
to kralicky/protocompile
that referenced
this pull request
May 19, 2024
After measuring the impact of bufbuild#286, bufbuild#287, and bufbuild#290 and seeing it to be too modest. I decided to use a memory profiler, and it found "the good stuff". These changes had the largest impact on allocations and performance. When linking inputs that come from descriptor protos (as opposed to inputs that are compiled from sources and have ASTs), this resulted in a 23% reduction in latency and 70% reduction in allocations. This change features the following improvements: 1. `ast.NoSourceNode` now has a pointer receiver, so wrapping one in an `ast.Node` interface value doesn't incur an allocation to put the value on the heap. This also updates `parser.ParseResult` to refer to a single `*ast.NoSourceNode` when it has no AST, instead of allocating one in each call to get a node value. The `NoSourceNode`'s underlying type is now `ast.FileInfo` so that it can be allocation-free, even for the `NodeInfo` method (which previously was allocating a new `FileInfo` each time). 3. Don't allocate a slice to hold the set of checked files for each element being resolved. Instead, we allocate a single slice up front, and re-use that throughout. 4. Don't pro-actively allocate strings that only are used for error messages; instead defer construction of the change to the construction of the error. (cherry picked from commit 016b009)
kralicky
pushed a commit
to kralicky/protocompile
that referenced
this pull request
Jun 8, 2024
…bufbuild#290) This uses many fewer, larger allocations, to allocate all of the descendant descriptors in a file in as few chunks as possible (by allocating slices of flattened structs). This change also eliminates many allocations related to computing fully-qualified-names by moving the construction of the descriptor hierarchy (which includes computing and storing full names) to *before* we import the names into the symbol table (which previously had to compute/allocate the full names twice: once when checking for collisions, and again in a second pass to store the names when no collisions were found). (cherry picked from commit 93923d2)
kralicky
pushed a commit
to kralicky/protocompile
that referenced
this pull request
Jun 8, 2024
After measuring the impact of bufbuild#286, bufbuild#287, and bufbuild#290 and seeing it to be too modest. I decided to use a memory profiler, and it found "the good stuff". These changes had the largest impact on allocations and performance. When linking inputs that come from descriptor protos (as opposed to inputs that are compiled from sources and have ASTs), this resulted in a 23% reduction in latency and 70% reduction in allocations. This change features the following improvements: 1. `ast.NoSourceNode` now has a pointer receiver, so wrapping one in an `ast.Node` interface value doesn't incur an allocation to put the value on the heap. This also updates `parser.ParseResult` to refer to a single `*ast.NoSourceNode` when it has no AST, instead of allocating one in each call to get a node value. The `NoSourceNode`'s underlying type is now `ast.FileInfo` so that it can be allocation-free, even for the `NodeInfo` method (which previously was allocating a new `FileInfo` each time). 3. Don't allocate a slice to hold the set of checked files for each element being resolved. Instead, we allocate a single slice up front, and re-use that throughout. 4. Don't pro-actively allocate strings that only are used for error messages; instead defer construction of the change to the construction of the error. (cherry picked from commit 016b009)
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The idea here is to use fewer, larger allocations. This change also eliminates many allocations related to computing fully-qualified-names.
Previously, when adding the file's elements to the symbol table, we had to traverse the hierarchy of descriptor protos, because the wrapper descriptor implementations hadn't yet been created. But this incurred redundantly computing the full name for every element an extra two times -- once for the first pass which checked for collisions and again in the second pass which commits the items to the symbol table.
The new flow instead instantiates the descriptor wrappers first. So we compute the fully-qualified names for that, but no more. Adding the names to the symbol table can then traverse those descriptor wrappers, which does not incur the cost of allocating/computing full names again. It wasn't originally implemented this way simply because the cost of iterating the descriptor protos wasn't clear -- it's turned out to be a leaky abstraction.
Another significant part of this branch is that when we create the descriptor wrappers, we allocate them in bulk using large, flattened slices. This means many fewer calls to the allocator.
I don't have benchmark results just for this change. I was working on benchmarks in the
bufbuild/buf
repo, and this was an optimization I happened to spot. While it did provide measurable improvements, they were quite small (both in terms of total number of allocations and in terms of throughput). I was hoping for a more dramatic improvement, but despite only minor gains this still seems like a keeper to me.