-
Notifications
You must be signed in to change notification settings - Fork 896
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Batch span processor support for multiple pending exports #2434
Comments
I went looking into this at one point as well and found that the OTLP spec actually does support pipelining requests: https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/protocol/otlp.md#otlpgrpc-concurrent-requests and https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/protocol/otlp.md#otlphttp-concurrent-requests At first I thought this conflicted with how the batch processor says: "Export() will never be called concurrently for the same exporter instance. Export() can be called again only after the current call returns." But pretty sure I realized So there may just need to be a note in the batch processor spec to say that -- Edit: Also wanted to note there should be a PR to add an env variable for the max concurrent requests. The OTLP spec specifies it should be configurable but does not give a name to an environment variable for everyone to follow. And "concurrent sending" is in the spec matrix https://github.com/open-telemetry/opentelemetry-specification/blob/main/spec-compliance-matrix.md#exporters ... which now I'm confused since Java has a |
In C++ SDK, the export is sequential, so the batch processor waits for the completion of the ongoing request before creating a new one. And during our tests, significant performance/throughput improvement was observed with concurrent HTTP requests, and multiple ongoing exports (open-telemetry/opentelemetry-cpp#1209 (comment)). Changes couldn't be incorporated as it seems to deviates from specs - "Export() will never be called concurrently for the same exporter instance. Export() can be called again only after the current call returns" Adding a configurable number of ongoing exports would be definitely useful. @owent @DebajitDas - fyi |
The Go SDK explicitly states that calls to the exporter from the BSP will be done synchronously based on what the specification stated. The Go SDK and its exporters would be able to support this concurrency pattern, but there would need to be some flag as mentioned. The existing exporters we provide do not offer any concurrency guarantees and would need to be refactored. If there was no flag and the behavior was introduced it would indeed cause race conditions based on the current implementations. |
@MrAlias but can it be done with only refactoring the exporters and not touching the batch processor? Like with retry logic that is not allowed in the batch processor but must be done in the exporter itself I think it makes the most sense to leave the batch processor spec as it is (except for adding a note similar to how it says retries must be done in the exporter). I guess the line about |
My attempt to improve the spec for this: #2452 |
Applications generating significant span volume can end up dropping data due to the synchronous export step. According to the opentelemetry spec, This function will never be called concurrently for the same exporter instance. It can be called again only after the current call returns. However, it does not place a restriction on concurrent I/O or anything of that nature. There is an [ongoing discussion] about tweaking the language to make this more clear. With that in mind, this commit makes the exporters return a future that can be spawned concurrently. Unfortunately, this means that the `export()` method can no longer be async while taking &mut self. The latter is desirable to enforce the no concurrent calls line of the spec, so the choice is made here to return a future instead with the lifetime decoupled from self. This resulted in a bit of additional verbosity, but for the most part the async code can still be shoved into an async fn for the ergonomics. The main exception to this is the `jaeger` exporter which internally requires a bunch of mutable references. I plan to discuss with the opentelemetry team the overall goal of this PR and get buy-in before making more invasive changes to support this in the jaeger exporter. [ongoing discussion]: open-telemetry/opentelemetry-specification#2434
Applications generating significant span volume can end up dropping data due to the synchronous export step. According to the opentelemetry spec, This function will never be called concurrently for the same exporter instance. It can be called again only after the current call returns. However, it does not place a restriction on concurrent I/O or anything of that nature. There is an [ongoing discussion] about tweaking the language to make this more clear. With that in mind, this commit makes the exporters return a future that can be spawned concurrently. Unfortunately, this means that the `export()` method can no longer be async while taking &mut self. The latter is desirable to enforce the no concurrent calls line of the spec, so the choice is made here to return a future instead with the lifetime decoupled from self. This resulted in a bit of additional verbosity, but for the most part the async code can still be shoved into an async fn for the ergonomics. The main exception to this is the `jaeger` exporter which internally requires a bunch of mutable references. I plan to discuss with the opentelemetry team the overall goal of this PR and get buy-in before making more invasive changes to support this in the jaeger exporter. [ongoing discussion]: open-telemetry/opentelemetry-specification#2434
Applications generating significant span volume can end up dropping data due to the synchronous export step. According to the opentelemetry spec, This function will never be called concurrently for the same exporter instance. It can be called again only after the current call returns. However, it does not place a restriction on concurrent I/O or anything of that nature. There is an [ongoing discussion] about tweaking the language to make this more clear. With that in mind, this commit makes the exporters return a future that can be spawned concurrently. Unfortunately, this means that the `export()` method can no longer be async while taking &mut self. The latter is desirable to enforce the no concurrent calls line of the spec, so the choice is made here to return a future instead with the lifetime decoupled from self. This resulted in a bit of additional verbosity, but for the most part the async code can still be shoved into an async fn for the ergonomics. The main exception to this is the `jaeger` exporter which internally requires a bunch of mutable references. I plan to discuss with the opentelemetry team the overall goal of this PR and get buy-in before making more invasive changes to support this in the jaeger exporter. [ongoing discussion]: open-telemetry/opentelemetry-specification#2434
Applications generating significant span volume can end up dropping data due to the synchronous export step. According to the opentelemetry spec, This function will never be called concurrently for the same exporter instance. It can be called again only after the current call returns. However, it does not place a restriction on concurrent I/O or anything of that nature. There is an [ongoing discussion] about tweaking the language to make this more clear. With that in mind, this commit makes the exporters return a future that can be spawned concurrently. Unfortunately, this means that the `export()` method can no longer be async while taking &mut self. The latter is desirable to enforce the no concurrent calls line of the spec, so the choice is made here to return a future instead with the lifetime decoupled from self. This resulted in a bit of additional verbosity, but for the most part the async code can still be shoved into an async fn for the ergonomics. The main exception to this is the `jaeger` exporter which internally requires a bunch of mutable references. I plan to discuss with the opentelemetry team the overall goal of this PR and get buy-in before making more invasive changes to support this in the jaeger exporter. [ongoing discussion]: open-telemetry/opentelemetry-specification#2434
* Add support for concurrent exports Applications generating significant span volume can end up dropping data due to the synchronous export step. According to the opentelemetry spec, This function will never be called concurrently for the same exporter instance. It can be called again only after the current call returns. However, it does not place a restriction on concurrent I/O or anything of that nature. There is an [ongoing discussion] about tweaking the language to make this more clear. With that in mind, this commit makes the exporters return a future that can be spawned concurrently. Unfortunately, this means that the `export()` method can no longer be async while taking &mut self. The latter is desirable to enforce the no concurrent calls line of the spec, so the choice is made here to return a future instead with the lifetime decoupled from self. This resulted in a bit of additional verbosity, but for the most part the async code can still be shoved into an async fn for the ergonomics. The main exception to this is the `jaeger` exporter which internally requires a bunch of mutable references. I plan to discuss with the opentelemetry team the overall goal of this PR and get buy-in before making more invasive changes to support this in the jaeger exporter. [ongoing discussion]: open-telemetry/opentelemetry-specification#2434 * SpanProcessor directly manages concurrent exports Prior, export tasks were run in "fire and forget" mode with runtime::spawn. SpanProcessor now manages tasks directly using FuturesUnordered. This enables limiting overall concurrency (and thus memory footprint). Additionally, flush and shutdown logic now spawn an additional task for any unexported spans and wait on _all_ outstanding tasks to complete before returning. * Add configuration for BSP max_concurrent_exports Users may desire to control the level of export concurrency in the batch span processor. There are two special values: max_concurrent_exports = 0: no bound on concurrency max_concurrent_exports = 1: no concurrency, makes everything synchronous on the messaging task. * Implement new SpanExporter API for Jaeger Key points - decouple exporter from uploaders via channel and spawned task - some uploaders are a shared I/O resource and cannot be multiplexed - necessitates a task queue - eg, HttpClient will spawn many I/O tasks internally, AgentUploader is a single I/O resource. Different level of abstraction. - Synchronous API not supported without a Runtime argument. I updated the API to thread one through, but maybe this is undesirable. I'm also exploiting the fact in the Actix examples that it uses Tokio under the hood to pass through the Tokio runtime token. - Tests pass save for a couple of flakey environment ones which is likely a race condition. * Reduce dependencies on futures The minimal necessary futures library (core, util, futures proper) is now used in all packages touched by the concurrent exporters work. * Remove runtime from Jaeger's install_simple To keep the API _actually_ simple, we now leverage a thread to run the jaeger exporter internals. * Add Arc lost in a rebase * Fix OTEL_BSP_MAX_CONCURRENT_EXPORTS name and value Per PR feedback, the default should match the previous behavior of 1 batch at a time. * Fix remaining TODOs This finishes the remaining TODOs on the concurrent-exports branch. The major change included here adds shutdown functionality to the jaeger exporter which ensures the exporter has finished its tasks before exiting. * Restore lint.sh script This was erroneously committed. * Make max concurrent exports env configurable OTEL_BSP_MAX_CONCURRENT_EXPORTS may now be specified in the environment to configure the number of max concurrent exports. This configurable now has parity with the other options of the span_processor.
@tsloughter @reyang I don't think #2452 resolved this |
Agree, this is not resolved with #2452. |
@trask because it doesn't define a setting for the max number of pending exports? I'd argue against it but in doing so I'd be arguing to also remove the returning of sucess/failure and the existing timeout :). So it is a tough one. Adding a pending limit further adds concurrency control to the processor when it is meant to be in the exporter, along with retry logic. The fact retries are in the exporter is another reason I've not seen the need for success/failure being passed to the processor, what can it do with this information? I guess log it. My plan had been to have |
* Add support for concurrent exports Applications generating significant span volume can end up dropping data due to the synchronous export step. According to the opentelemetry spec, This function will never be called concurrently for the same exporter instance. It can be called again only after the current call returns. However, it does not place a restriction on concurrent I/O or anything of that nature. There is an [ongoing discussion] about tweaking the language to make this more clear. With that in mind, this commit makes the exporters return a future that can be spawned concurrently. Unfortunately, this means that the `export()` method can no longer be async while taking &mut self. The latter is desirable to enforce the no concurrent calls line of the spec, so the choice is made here to return a future instead with the lifetime decoupled from self. This resulted in a bit of additional verbosity, but for the most part the async code can still be shoved into an async fn for the ergonomics. The main exception to this is the `jaeger` exporter which internally requires a bunch of mutable references. I plan to discuss with the opentelemetry team the overall goal of this PR and get buy-in before making more invasive changes to support this in the jaeger exporter. [ongoing discussion]: open-telemetry/opentelemetry-specification#2434 * SpanProcessor directly manages concurrent exports Prior, export tasks were run in "fire and forget" mode with runtime::spawn. SpanProcessor now manages tasks directly using FuturesUnordered. This enables limiting overall concurrency (and thus memory footprint). Additionally, flush and shutdown logic now spawn an additional task for any unexported spans and wait on _all_ outstanding tasks to complete before returning. * Add configuration for BSP max_concurrent_exports Users may desire to control the level of export concurrency in the batch span processor. There are two special values: max_concurrent_exports = 0: no bound on concurrency max_concurrent_exports = 1: no concurrency, makes everything synchronous on the messaging task. * Implement new SpanExporter API for Jaeger Key points - decouple exporter from uploaders via channel and spawned task - some uploaders are a shared I/O resource and cannot be multiplexed - necessitates a task queue - eg, HttpClient will spawn many I/O tasks internally, AgentUploader is a single I/O resource. Different level of abstraction. - Synchronous API not supported without a Runtime argument. I updated the API to thread one through, but maybe this is undesirable. I'm also exploiting the fact in the Actix examples that it uses Tokio under the hood to pass through the Tokio runtime token. - Tests pass save for a couple of flakey environment ones which is likely a race condition. * Reduce dependencies on futures The minimal necessary futures library (core, util, futures proper) is now used in all packages touched by the concurrent exporters work. * Remove runtime from Jaeger's install_simple To keep the API _actually_ simple, we now leverage a thread to run the jaeger exporter internals. * Add Arc lost in a rebase * Fix OTEL_BSP_MAX_CONCURRENT_EXPORTS name and value Per PR feedback, the default should match the previous behavior of 1 batch at a time. * Fix remaining TODOs This finishes the remaining TODOs on the concurrent-exports branch. The major change included here adds shutdown functionality to the jaeger exporter which ensures the exporter has finished its tasks before exiting. * Restore lint.sh script This was erroneously committed. * Make max concurrent exports env configurable OTEL_BSP_MAX_CONCURRENT_EXPORTS may now be specified in the environment to configure the number of max concurrent exports. This configurable now has parity with the other options of the span_processor.
What are you trying to achieve?
Higher throughput from the batch span processor when using an asynchronous exporter (especially when talking to a remote backend as opposed to a local collector).
Having discussed this with the Java folks, I believe this boils down to:
maxPendingExports
in the spec to support multiple pending exports.Additional context
The Java SDK's span exporter returns a promise, and the batch span processor waits up to
exportTimeoutMillis
for the promise to complete before proceeding to export the next batch.It would be great to hear if other language SDKs have had any different interpretation and have implemented the batch span processor to already allow multiple pending exports, since that will help in planning a path forward here.
@open-telemetry/cpp-approvers @open-telemetry/dotnet-approvers @open-telemetry/erlang-approvers @open-telemetry/go-approvers @open-telemetry/javascript-approvers @open-telemetry/python-approvers @open-telemetry/ruby-approvers
The text was updated successfully, but these errors were encountered: