Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simple span processor perf #502

Merged
merged 3 commits into from
Mar 31, 2021
Merged

Simple span processor perf #502

merged 3 commits into from
Mar 31, 2021

Conversation

jtescher
Copy link
Member

Currently the simple span processor his highly sensitive to any latency in its underlying exporter as it will block the current thread for the full export duration when each span ends.

This patch addresses this by moving span exporting to a separate thread, and communicating via channels.

Spans are still exported as soon as they end, and shutdown will wait for all spans to be successfully exported, preserving the simple span processor semantics of immediate exports, and never dropping data.

Currently the simple span processor his _highly_ sensitive to any
latency in its underlying exporter as it will block the current thread
for the full export duration when each span ends.

This patch addresses this by moving span exporting to a separate thread,
and communicating via channels.

Spans are still exported as soon as they end, and shutdown will wait for
all spans to be successfully exported, preserving the simple span
processor semantics of immediate exports, and never dropping data.
@jtescher jtescher requested a review from a team March 29, 2021 16:34
@frigus02
Copy link
Member

The implementation looks good.

I wonder if this is a good idea, though.

  • At the moment the simple span processor is highly sensitive to any latency in its underlying exporter. But it is easily predictable.
  • If I have a program using the simple span processor, where the program logic is faster than the exporter, then with the current implementation my program will run slower. But it should run just fine. With the change it can possibly build up a very large backlog of spans, which consumes memory. The bigger the difference between span exporter latency and business logic speed the smaller the overall benefit of making the requests in another thread.

That said, I'm really not sure either way. There are certainly benefits in exporting in a different thread. I wouldn't object to merging this.

@jtescher
Copy link
Member Author

Yeah I'm open to alternatives as well, I think there are always going to be drawbacks to the simple strategy, the benefit in this approach is that it should be better for cases where the simple strategy makes sense (sending one-by-one, never dropping spans). If you are producing faster than you can export, you probably should be using the batch span processor to do load shedding.

The main alternative seems to be to close this and keep the existing behavior which avoids the OOM by simply blocking the progress of the current thread, but it seems to conflict a bit with open-telemetry/opentelemetry-specification#1555, which adds a non-blocking requirement to ending spans.

@jtescher jtescher mentioned this pull request Mar 29, 2021
@frigus02
Copy link
Member

I don't know of good alternatives. And you're right. The spec suggests that we shouldn't block, which the simple span processor currently does.

opentelemetry/src/sdk/trace/span_processor.rs Show resolved Hide resolved
opentelemetry/Cargo.toml Show resolved Hide resolved
.name("opentelemetry-exporter".to_string())
.spawn(move || {
while let Ok(Some(span)) = span_rx.recv() {
if let Err(err) = executor::block_on(exporter.export(vec![span])) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we consider adding a timeout around here to prevent the exporter block for too long?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was considering that, but the spec doesn't have any config for it and if you are creating faster than you are exporting to the point where it would be a big issue you should probably just use the batch processor as your app may OOM or do other bad things as well with the simple processor.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To that point we should probably warn users that using simple span processor may cause OOM.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think a docs PR followup on the difference between both and when to choose one vs the other would be a good idea 👍

@TommyCpp
Copy link
Contributor

Overall LGTM. I think we may also consider supporting user without async runtime to use batch span processor somehow, may be helpful as I think batch span processor generally have a better performance

@jtescher jtescher merged commit 4af37e1 into main Mar 31, 2021
@jtescher jtescher deleted the simple-span-processor-perf branch March 31, 2021 05:14
cschramm added a commit to cschramm/opentelemetry-rust that referenced this pull request Apr 17, 2023
ForceFlush seems to have been left behind in open-telemetry#502. With those changes, the processing is not really synchronous anymore, i.e. OnEnd now only sends the span down the pipe to be processed in the separate thread as soon as possible.

https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/sdk.md#forceflush-1 says:

> In particular, if any SpanProcessor has any associated exporter, it SHOULD try to call the exporter's Export with all spans for which this was not already done and then invoke ForceFlush on it.

As the comment states, all spans previously got exported synchronounsly right away, so that no such spans existed, but now they might be anywhere between the channel and (the end of) the export call. Doin
g nothing in ForceFlush even violates the specification as...

> The built-in SpanProcessors MUST do so.

Awaiting all open tasks from the channel on ForceFlush fixes this.

Previous discussions regarding parts of the specification that this does not tackle in line with Shutdown:

> ForceFlush SHOULD provide a way to let the caller know whether it succeeded, failed or timed out.

open-telemetry#358 (comment)

> ForceFlush SHOULD complete or abort within some timeout.

https://github.com/open-telemetry/opentelemetry-rust/pull/502/files#r603722431

This brings the simple processor a step closer to the batch processor with the obvious main difference of batches and the (not so obvious, also see open-telemetry#502 (comment)) difference that it works without a presumed async runtime.
cschramm added a commit to cschramm/opentelemetry-rust that referenced this pull request Apr 24, 2023
ForceFlush seems to have been left behind in open-telemetry#502. With those changes, the processing is not really synchronous anymore, i.e. OnEnd now only sends the span down the pipe to be processed in the separate thread as soon as possible.

https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/sdk.md#forceflush-1 says:

> In particular, if any SpanProcessor has any associated exporter, it SHOULD try to call the exporter's Export with all spans for which this was not already done and then invoke ForceFlush on it.

As the comment states, all spans previously got exported synchronounsly right away, so that no such spans existed, but now they might be anywhere between the channel and (the end of) the export call. Doin
g nothing in ForceFlush even violates the specification as...

> The built-in SpanProcessors MUST do so.

Awaiting all open tasks from the channel on ForceFlush fixes this.

Previous discussions regarding parts of the specification that this does not tackle in line with Shutdown:

> ForceFlush SHOULD provide a way to let the caller know whether it succeeded, failed or timed out.

open-telemetry#358 (comment)

> ForceFlush SHOULD complete or abort within some timeout.

https://github.com/open-telemetry/opentelemetry-rust/pull/502/files#r603722431

This brings the simple processor a step closer to the batch processor with the obvious main difference of batches and the (not so obvious, also see open-telemetry#502 (comment)) difference that it works without a presumed async runtime.
jtescher pushed a commit that referenced this pull request Apr 24, 2023
ForceFlush seems to have been left behind in #502. With those changes, the processing is not really synchronous anymore, i.e. OnEnd now only sends the span down the pipe to be processed in the separate thread as soon as possible.

https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/sdk.md#forceflush-1 says:

> In particular, if any SpanProcessor has any associated exporter, it SHOULD try to call the exporter's Export with all spans for which this was not already done and then invoke ForceFlush on it.

As the comment states, all spans previously got exported synchronounsly right away, so that no such spans existed, but now they might be anywhere between the channel and (the end of) the export call. Doin
g nothing in ForceFlush even violates the specification as...

> The built-in SpanProcessors MUST do so.

Awaiting all open tasks from the channel on ForceFlush fixes this.

Previous discussions regarding parts of the specification that this does not tackle in line with Shutdown:

> ForceFlush SHOULD provide a way to let the caller know whether it succeeded, failed or timed out.

#358 (comment)

> ForceFlush SHOULD complete or abort within some timeout.

https://github.com/open-telemetry/opentelemetry-rust/pull/502/files#r603722431

This brings the simple processor a step closer to the batch processor with the obvious main difference of batches and the (not so obvious, also see #502 (comment)) difference that it works without a presumed async runtime.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants