-
Notifications
You must be signed in to change notification settings - Fork 452
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simple span processor perf #502
Conversation
Currently the simple span processor his _highly_ sensitive to any latency in its underlying exporter as it will block the current thread for the full export duration when each span ends. This patch addresses this by moving span exporting to a separate thread, and communicating via channels. Spans are still exported as soon as they end, and shutdown will wait for all spans to be successfully exported, preserving the simple span processor semantics of immediate exports, and never dropping data.
The implementation looks good. I wonder if this is a good idea, though.
That said, I'm really not sure either way. There are certainly benefits in exporting in a different thread. I wouldn't object to merging this. |
Yeah I'm open to alternatives as well, I think there are always going to be drawbacks to the simple strategy, the benefit in this approach is that it should be better for cases where the simple strategy makes sense (sending one-by-one, never dropping spans). If you are producing faster than you can export, you probably should be using the batch span processor to do load shedding. The main alternative seems to be to close this and keep the existing behavior which avoids the OOM by simply blocking the progress of the current thread, but it seems to conflict a bit with open-telemetry/opentelemetry-specification#1555, which adds a non-blocking requirement to ending spans. |
I don't know of good alternatives. And you're right. The spec suggests that we shouldn't block, which the simple span processor currently does. |
.name("opentelemetry-exporter".to_string()) | ||
.spawn(move || { | ||
while let Ok(Some(span)) = span_rx.recv() { | ||
if let Err(err) = executor::block_on(exporter.export(vec![span])) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we consider adding a timeout around here to prevent the exporter block for too long?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was considering that, but the spec doesn't have any config for it and if you are creating faster than you are exporting to the point where it would be a big issue you should probably just use the batch processor as your app may OOM or do other bad things as well with the simple processor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To that point we should probably warn users that using simple span processor may cause OOM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think a docs PR followup on the difference between both and when to choose one vs the other would be a good idea 👍
Overall LGTM. I think we may also consider supporting user without async runtime to use batch span processor somehow, may be helpful as I think batch span processor generally have a better performance |
ForceFlush seems to have been left behind in open-telemetry#502. With those changes, the processing is not really synchronous anymore, i.e. OnEnd now only sends the span down the pipe to be processed in the separate thread as soon as possible. https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/sdk.md#forceflush-1 says: > In particular, if any SpanProcessor has any associated exporter, it SHOULD try to call the exporter's Export with all spans for which this was not already done and then invoke ForceFlush on it. As the comment states, all spans previously got exported synchronounsly right away, so that no such spans existed, but now they might be anywhere between the channel and (the end of) the export call. Doin g nothing in ForceFlush even violates the specification as... > The built-in SpanProcessors MUST do so. Awaiting all open tasks from the channel on ForceFlush fixes this. Previous discussions regarding parts of the specification that this does not tackle in line with Shutdown: > ForceFlush SHOULD provide a way to let the caller know whether it succeeded, failed or timed out. open-telemetry#358 (comment) > ForceFlush SHOULD complete or abort within some timeout. https://github.com/open-telemetry/opentelemetry-rust/pull/502/files#r603722431 This brings the simple processor a step closer to the batch processor with the obvious main difference of batches and the (not so obvious, also see open-telemetry#502 (comment)) difference that it works without a presumed async runtime.
ForceFlush seems to have been left behind in open-telemetry#502. With those changes, the processing is not really synchronous anymore, i.e. OnEnd now only sends the span down the pipe to be processed in the separate thread as soon as possible. https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/sdk.md#forceflush-1 says: > In particular, if any SpanProcessor has any associated exporter, it SHOULD try to call the exporter's Export with all spans for which this was not already done and then invoke ForceFlush on it. As the comment states, all spans previously got exported synchronounsly right away, so that no such spans existed, but now they might be anywhere between the channel and (the end of) the export call. Doin g nothing in ForceFlush even violates the specification as... > The built-in SpanProcessors MUST do so. Awaiting all open tasks from the channel on ForceFlush fixes this. Previous discussions regarding parts of the specification that this does not tackle in line with Shutdown: > ForceFlush SHOULD provide a way to let the caller know whether it succeeded, failed or timed out. open-telemetry#358 (comment) > ForceFlush SHOULD complete or abort within some timeout. https://github.com/open-telemetry/opentelemetry-rust/pull/502/files#r603722431 This brings the simple processor a step closer to the batch processor with the obvious main difference of batches and the (not so obvious, also see open-telemetry#502 (comment)) difference that it works without a presumed async runtime.
ForceFlush seems to have been left behind in #502. With those changes, the processing is not really synchronous anymore, i.e. OnEnd now only sends the span down the pipe to be processed in the separate thread as soon as possible. https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/sdk.md#forceflush-1 says: > In particular, if any SpanProcessor has any associated exporter, it SHOULD try to call the exporter's Export with all spans for which this was not already done and then invoke ForceFlush on it. As the comment states, all spans previously got exported synchronounsly right away, so that no such spans existed, but now they might be anywhere between the channel and (the end of) the export call. Doin g nothing in ForceFlush even violates the specification as... > The built-in SpanProcessors MUST do so. Awaiting all open tasks from the channel on ForceFlush fixes this. Previous discussions regarding parts of the specification that this does not tackle in line with Shutdown: > ForceFlush SHOULD provide a way to let the caller know whether it succeeded, failed or timed out. #358 (comment) > ForceFlush SHOULD complete or abort within some timeout. https://github.com/open-telemetry/opentelemetry-rust/pull/502/files#r603722431 This brings the simple processor a step closer to the batch processor with the obvious main difference of batches and the (not so obvious, also see #502 (comment)) difference that it works without a presumed async runtime.
Currently the simple span processor his highly sensitive to any latency in its underlying exporter as it will block the current thread for the full export duration when each span ends.
This patch addresses this by moving span exporting to a separate thread, and communicating via channels.
Spans are still exported as soon as they end, and shutdown will wait for all spans to be successfully exported, preserving the simple span processor semantics of immediate exports, and never dropping data.