-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve Span benchmarks #2402
Comments
I put together a messy demonstration of all of the above-mentioned improvements. The three bullets above lead to substantial improvements, and what was left after that was a lot of time being spent validating See the branch with all three fixes and the heavyweight tracestate checking commented out: sdk/trace benchmarks before the change
and after:
For the sampler statistical tests mentioned above, running a single test case testing 13% sampling on 1M spans (i.e., creating approximately 130,000 spans, repeated 20 times):
I believe this demonstrates that performance gains are relatively easy, possibly easy enough that I can leave the statistical tests in 1379 as-is. |
One way this branch gains performance is by not using an LRU structure to decide which attributes to remove. The OTel specification does not say we should do this, and it seems to be optimizing for the atypical case. I do not believe it is worth the cost of maintaining the LRU so that users who do reach the limit have control over which attributes are dropped. |
Honestly, this is great. I like seeing the improvements here. Logistically it looks like there are three changes that need to be made:
For the first two, can we get a few PRs? They should have some slight improvements on their own. |
@MadVikingGod Yes that's right, or maybe 4.
One idea for the tracestate performance hit is to offer two variations, one that checks syntax and one that does not. I wanted to check with the group before going further, especially with regards to (2) and (4). These are both changes that relate to the user's expectations of the SDK. I believe as a general rule we should "trust the user" and not go out of our way to protect them from themselves. For 2, the existing attributes map has a complicated mechanism to help the user when they are confused about which attributes they're setting and are possibly setting too many of them => not worth the cost of protecting user from themselves. Similar in 4, the existing tracestate mechanism assumes a defensive position a significant cost. Why would a user bother to configure code that generates invalid tracestate values => not worth the cost of protecting user from themselves. (*) there's an anti-pattern inside
|
I don't know if either of these are designed to protect the user, and are more targeted at spec compliance. Re 2. There was at some point the idea that attributes should be like a circular buffer, where when you fill up the oldest falls off. The spec current says the opposite,
So I think it would be totally reasonable to do away with any complexity that is no longer needed by our interpretation of the spec. Re 4. The spec is very specific about when and where you have to validate tracestate. If it is causing a lot of problems I think we should explore options that optimize that, but I don't think we are going to move away from validating tracestate whenever it is modified. |
Re: the cost of tracestate validation specifically I will take up the issue as part of merging the probability sampler itself. Perhaps it will take a slight revision to the specification. As I have it in open-telemetry/opentelemetry-go-contrib#1379 the tracestate is validated when it is read, modified and serialized in a way that cannot lead to invalid tracestate. If the specification supported a dedicated interface for setting the OTel tracestate, that could work around the performance hit. In this repository, we'd be able to re-use the implementation in open-telemetry/opentelemetry-go-contrib#1379 (which goes out of its way to avoid allocations in the common case). |
Looking at the proposed replacement for the span attribute field it does not seem to follow the requirements of the OpenTelemetry specification.
The proposed change truncates the last attribute of the lexicographically sorted set. This alone can be resolve pretty easy by just dropping complete or partial calls to
It will first require a look-up in some data type to see if the arguments should be used to update values, or be appended or dropped based on capacity. This would seem to require modifying the proposed algorithm to effectively be an LRU cache-ing scheme without preserving order. We need to investigate if that modified algorithm would still provide the desired performance enhancements. |
I created #2554 to track the SDK's non-compliance with the attribute drop order mentioned in this comment. |
🎉 |
Problem Statement
Span creation is very expensive.
Proposed Solution
We can significantly reduce the number of allocations needed for a new span in the
sdk/trace
package. In particular:evictedQueue
pointers can be replaced with structs, the extra space used looks far less expensive than two compulsory allocations. I.e., change to structs will reduce two allocations:attributesMap
type is heavyweight and IMO should be discarded. There's no need to dynamically compute the current set of attributes in order to calculate the number of attributes dropped. These can be lazily computed, so that new span construction costs less. It is still necessary to compute the distinct set of attributes, but this can be done using theattribute.NewSet()
call which is optimized to avoid constructing a map. (It uses a stable sort and linear scan to de-duplicate.) This can eliminate three allocations. Seeattribute.NewSetWithSortable
, if therecordingSpan
stores anattribute.Sortable
in the first place this avoids another alloc.Alternatives
See how in open-telemetry/opentelemetry-go-contrib#1379 I have trouble with test timeouts for the probability sampler. Instead of improving span creation, we could have slow tests with increased timeouts. With the optimizations described in this issue we should be able to make those tests 4x faster.
The text was updated successfully, but these errors were encountered: