Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Span benchmarks #2402

Closed
jmacd opened this issue Nov 17, 2021 · 9 comments · Fixed by #2576
Closed

Improve Span benchmarks #2402

jmacd opened this issue Nov 17, 2021 · 9 comments · Fixed by #2576
Labels
enhancement New feature or request
Milestone

Comments

@jmacd
Copy link
Contributor

jmacd commented Nov 17, 2021

Problem Statement

Span creation is very expensive.

Proposed Solution

We can significantly reduce the number of allocations needed for a new span in the sdk/trace package. In particular:

  1. The two evictedQueue pointers can be replaced with structs, the extra space used looks far less expensive than two compulsory allocations. I.e., change to structs will reduce two allocations:
	events evictedQueue
	links evictedQueue
  1. The attributesMap type is heavyweight and IMO should be discarded. There's no need to dynamically compute the current set of attributes in order to calculate the number of attributes dropped. These can be lazily computed, so that new span construction costs less. It is still necessary to compute the distinct set of attributes, but this can be done using the attribute.NewSet() call which is optimized to avoid constructing a map. (It uses a stable sort and linear scan to de-duplicate.) This can eliminate three allocations. See attribute.NewSetWithSortable, if the recordingSpan stores an attribute.Sortable in the first place this avoids another alloc.
  2. The fixes described in Avoid unnecessary heap allocation in NewSpanStartConfig #2065

Alternatives

See how in open-telemetry/opentelemetry-go-contrib#1379 I have trouble with test timeouts for the probability sampler. Instead of improving span creation, we could have slow tests with increased timeouts. With the optimizations described in this issue we should be able to make those tests 4x faster.

@jmacd jmacd added the enhancement New feature or request label Nov 17, 2021
@jmacd
Copy link
Contributor Author

jmacd commented Nov 18, 2021

I put together a messy demonstration of all of the above-mentioned improvements. The three bullets above lead to substantial improvements, and what was left after that was a lot of time being spent validating TraceState entries (which feels wasteful, as they're surely well formed).

See the branch with all three fixes and the heavyweight tracestate checking commented out:
https://github.com/open-telemetry/opentelemetry-go/compare/main...jmacd:jmacd/faster?expand=1

sdk/trace benchmarks before the change

goos: darwin
goarch: amd64
pkg: go.opentelemetry.io/otel/sdk/trace
cpu: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
BenchmarkAttributesMapToKeyValue-12      	  308713	      4984 ns/op
BenchmarkStartEndSpan/AlwaysSample-12    	 1000000	      1626 ns/op	     848 B/op	       9 allocs/op
BenchmarkStartEndSpan/NeverSample-12     	 2673973	       459.6 ns/op	     224 B/op	       3 allocs/op
BenchmarkSpanWithAttributes_4/AlwaysSample-12         	  699889	      2592 ns/op	    1584 B/op	      17 allocs/op
BenchmarkSpanWithAttributes_4/NeverSample-12          	 1857534	       630.7 ns/op	     416 B/op	       4 allocs/op
BenchmarkSpanWithAttributes_8/AlwaysSample-12         	  337555	      3657 ns/op	    2112 B/op	      23 allocs/op
BenchmarkSpanWithAttributes_8/NeverSample-12          	 1000000	      1035 ns/op	     608 B/op	       4 allocs/op
BenchmarkSpanWithAttributes_all/AlwaysSample-12       	  357660	      3314 ns/op	    1936 B/op	      21 allocs/op
BenchmarkSpanWithAttributes_all/NeverSample-12        	 1486611	       684.1 ns/op	     544 B/op	       4 allocs/op
BenchmarkSpanWithAttributes_all_2x/AlwaysSample-12    	  380109	      3113 ns/op	    3236 B/op	      32 allocs/op
BenchmarkSpanWithAttributes_all_2x/NeverSample-12     	 2019628	       591.1 ns/op	     864 B/op	       4 allocs/op
BenchmarkTraceID_DotString-12                         	17526044	        66.62 ns/op
BenchmarkSpanID_DotString-12                          	22330292	        52.48 ns/op
PASS
ok  	go.opentelemetry.io/otel/sdk/trace	20.636s

and after:

goos: darwin
goarch: amd64
pkg: go.opentelemetry.io/otel/sdk/trace
cpu: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
BenchmarkStartEndSpan/AlwaysSample-12    	 2108943	       567.7 ns/op	     528 B/op	       2 allocs/op
BenchmarkStartEndSpan/NeverSample-12     	 4608898	       256.2 ns/op	     128 B/op	       2 allocs/op
BenchmarkSpanWithAttributes_4/AlwaysSample-12         	 1249711	       953.9 ns/op	    1168 B/op	       6 allocs/op
BenchmarkSpanWithAttributes_4/NeverSample-12          	 3389326	       349.9 ns/op	     320 B/op	       3 allocs/op
BenchmarkSpanWithAttributes_8/AlwaysSample-12         	  922864	      1244 ns/op	    1872 B/op	       7 allocs/op
BenchmarkSpanWithAttributes_8/NeverSample-12          	 2788951	       427.6 ns/op	     512 B/op	       3 allocs/op
BenchmarkSpanWithAttributes_all/AlwaysSample-12       	  931443	      1207 ns/op	    1808 B/op	       7 allocs/op
BenchmarkSpanWithAttributes_all/NeverSample-12        	 2963410	       402.0 ns/op	     448 B/op	       3 allocs/op
BenchmarkSpanWithAttributes_all_2x/AlwaysSample-12    	  671433	      1671 ns/op	    3152 B/op	       8 allocs/op
BenchmarkSpanWithAttributes_all_2x/NeverSample-12     	 2213091	       539.5 ns/op	     768 B/op	       3 allocs/op
BenchmarkTraceID_DotString-12                         	17788945	        66.34 ns/op
BenchmarkSpanID_DotString-12                          	22966808	        51.49 ns/op
PASS
ok  	go.opentelemetry.io/otel/sdk/trace	18.187s

For the sampler statistical tests mentioned above, running a single test case testing 13% sampling on 1M spans (i.e., creating approximately 130,000 spans, repeated 20 times):

BEFORE:        35.27 real        42.88 user         1.91 sys
AFTER:        12.68 real        18.15 user         0.99 sys

I believe this demonstrates that performance gains are relatively easy, possibly easy enough that I can leave the statistical tests in 1379 as-is.

@jmacd
Copy link
Contributor Author

jmacd commented Nov 18, 2021

One way this branch gains performance is by not using an LRU structure to decide which attributes to remove. The OTel specification does not say we should do this, and it seems to be optimizing for the atypical case. I do not believe it is worth the cost of maintaining the LRU so that users who do reach the limit have control over which attributes are dropped.

@MadVikingGod
Copy link
Contributor

Honestly, this is great. I like seeing the improvements here.

Logistically it looks like there are three changes that need to be made:

  • Removal of the Attribute map, going to a simpler list struct
  • Removal of the tracestate checking
  • Changing of StarSpanOptions from apply(*config) to apply(config) config

For the first two, can we get a few PRs? They should have some slight improvements on their own.
For the last one this changes away from our current guidance on how to do options. If it is really causing a number of allocations, and slowing down the system I would like to see that on its own. I would also want to at least understand the scope of work, not necessarily block on it, required to move everything else to this kind of option, just so we don't have a confusing mess of different standards.

@jmacd
Copy link
Contributor Author

jmacd commented Nov 18, 2021

@MadVikingGod Yes that's right, or maybe 4.

  1. stop allocating evictedQueue (*)
  2. differently-optimized attributes
  3. large (and error-prone) change of implementation behind options
  4. something about tracestate validation

One idea for the tracestate performance hit is to offer two variations, one that checks syntax and one that does not.

I wanted to check with the group before going further, especially with regards to (2) and (4). These are both changes that relate to the user's expectations of the SDK. I believe as a general rule we should "trust the user" and not go out of our way to protect them from themselves. For 2, the existing attributes map has a complicated mechanism to help the user when they are confused about which attributes they're setting and are possibly setting too many of them => not worth the cost of protecting user from themselves. Similar in 4, the existing tracestate mechanism assumes a defensive position a significant cost. Why would a user bother to configure code that generates invalid tracestate values => not worth the cost of protecting user from themselves.

(*) there's an anti-pattern inside evictedQueue, which causes memory not to be re-used the way it's supposed to. This limits the number of items (for the specification), but hurts memory performance.

eq.queue = eq.queue[1:]
eq.queue = append(eq.queue, ...)

@MadVikingGod
Copy link
Contributor

I don't know if either of these are designed to protect the user, and are more targeted at spec compliance.

Re 2. There was at some point the idea that attributes should be like a circular buffer, where when you fill up the oldest falls off. The spec current says the opposite,

for each unique attribute key, addition of which would result in exceeding the limit, SDK MUST discard that key/value pair

So I think it would be totally reasonable to do away with any complexity that is no longer needed by our interpretation of the spec.

Re 4. The spec is very specific about when and where you have to validate tracestate. If it is causing a lot of problems I think we should explore options that optimize that, but I don't think we are going to move away from validating tracestate whenever it is modified.

@jmacd
Copy link
Contributor Author

jmacd commented Nov 30, 2021

Re: the cost of tracestate validation specifically

I will take up the issue as part of merging the probability sampler itself. Perhaps it will take a slight revision to the specification. As I have it in open-telemetry/opentelemetry-go-contrib#1379 the tracestate is validated when it is read, modified and serialized in a way that cannot lead to invalid tracestate. If the specification supported a dedicated interface for setting the OTel tracestate, that could work around the performance hit. In this repository, we'd be able to re-use the implementation in open-telemetry/opentelemetry-go-contrib#1379 (which goes out of its way to avoid allocations in the common case).

@MrAlias
Copy link
Contributor

MrAlias commented Jan 25, 2022

Looking at the proposed replacement for the span attribute field it does not seem to follow the requirements of the OpenTelemetry specification.

for each unique attribute key, addition of which would result in exceeding the limit, SDK MUST discard that key/value pair.

The proposed change truncates the last attribute of the lexicographically sorted set. This alone can be resolve pretty easy by just dropping complete or partial calls to SetAttributes that would exceed this limit, but when taken into consideration that the specification also states the following.

Setting an attribute with the same key as an existing attribute SHOULD overwrite the existing attribute's value.

It will first require a look-up in some data type to see if the arguments should be used to update values, or be appended or dropped based on capacity. This would seem to require modifying the proposed algorithm to effectively be an LRU cache-ing scheme without preserving order.

We need to investigate if that modified algorithm would still provide the desired performance enhancements.

@MrAlias
Copy link
Contributor

MrAlias commented Jan 26, 2022

I created #2554 to track the SDK's non-compliance with the attribute drop order mentioned in this comment.

@jmacd
Copy link
Contributor Author

jmacd commented Feb 8, 2022

🎉

@pellared pellared added this to the untracked milestone Nov 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants