Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batch span processor support for multiple pending exports #2434

Closed
trask opened this issue Mar 23, 2022 · 8 comments · Fixed by #2452
Closed

Batch span processor support for multiple pending exports #2434

trask opened this issue Mar 23, 2022 · 8 comments · Fixed by #2452
Assignees
Labels
spec:trace Related to the specification/trace directory

Comments

@trask
Copy link
Member

trask commented Mar 23, 2022

What are you trying to achieve?

Higher throughput from the batch span processor when using an asynchronous exporter (especially when talking to a remote backend as opposed to a local collector).

Having discussed this with the Java folks, I believe this boils down to:

  • Clarity on whether the batch span processor should allow multiple pending exports, and if not, I'd like to propose a new configuration parameter maxPendingExports in the spec to support multiple pending exports.

Additional context

The Java SDK's span exporter returns a promise, and the batch span processor waits up to exportTimeoutMillis for the promise to complete before proceeding to export the next batch.

It would be great to hear if other language SDKs have had any different interpretation and have implemented the batch span processor to already allow multiple pending exports, since that will help in planning a path forward here.

@open-telemetry/cpp-approvers @open-telemetry/dotnet-approvers @open-telemetry/erlang-approvers @open-telemetry/go-approvers @open-telemetry/javascript-approvers @open-telemetry/python-approvers @open-telemetry/ruby-approvers

@trask trask added the spec:trace Related to the specification/trace directory label Mar 23, 2022
@tsloughter
Copy link
Member

tsloughter commented Mar 24, 2022

I went looking into this at one point as well and found that the OTLP spec actually does support pipelining requests: https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/protocol/otlp.md#otlpgrpc-concurrent-requests and https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/protocol/otlp.md#otlphttp-concurrent-requests

At first I thought this conflicted with how the batch processor says: "Export() will never be called concurrently for the same exporter instance. Export() can be called again only after the current call returns." But pretty sure I realized Export technically can return before the request it sent has gotten a response, and was how I planned to implement it -- similar to what you mention with the Java Export implementation returning a promise.

So there may just need to be a note in the batch processor spec to say that Export not being called concurrently is not meant to mean a restriction on concurrent sending of batches.

--

Edit: Also wanted to note there should be a PR to add an env variable for the max concurrent requests. The OTLP spec specifies it should be configurable but does not give a name to an environment variable for everyone to follow.

And "concurrent sending" is in the spec matrix https://github.com/open-telemetry/opentelemetry-specification/blob/main/spec-compliance-matrix.md#exporters ... which now I'm confused since Java has a + there :)

@lalitb
Copy link
Member

lalitb commented Mar 24, 2022

In C++ SDK, the export is sequential, so the batch processor waits for the completion of the ongoing request before creating a new one.

And during our tests, significant performance/throughput improvement was observed with concurrent HTTP requests, and multiple ongoing exports (open-telemetry/opentelemetry-cpp#1209 (comment)). Changes couldn't be incorporated as it seems to deviates from specs - "Export() will never be called concurrently for the same exporter instance. Export() can be called again only after the current call returns"

Adding a configurable number of ongoing exports would be definitely useful.

@owent @DebajitDas - fyi

@MrAlias
Copy link
Contributor

MrAlias commented Mar 28, 2022

The Go SDK explicitly states that calls to the exporter from the BSP will be done synchronously based on what the specification stated. The Go SDK and its exporters would be able to support this concurrency pattern, but there would need to be some flag as mentioned. The existing exporters we provide do not offer any concurrency guarantees and would need to be refactored. If there was no flag and the behavior was introduced it would indeed cause race conditions based on the current implementations.

@tsloughter
Copy link
Member

@MrAlias but can it be done with only refactoring the exporters and not touching the batch processor?

Like with retry logic that is not allowed in the batch processor but must be done in the exporter itself I think it makes the most sense to leave the batch processor spec as it is (except for adding a note similar to how it says retries must be done in the exporter).

I guess the line about Export not being called concurrently should also be clarified in the spec but I don't think it has to or should change. If all the responsibility for concurrent sending is in the exporter then the batch processor should be limited in how it calls Export -- the same as you wouldn't want the batch processor to start doing retries if the exporter has that logic as well.

@tsloughter
Copy link
Member

My attempt to improve the spec for this: #2452

jwilm added a commit to OneSignal/opentelemetry-rust that referenced this issue Apr 13, 2022
Applications generating significant span volume can end up dropping data
due to the synchronous export step. According to the opentelemetry spec,

    This function will never be called concurrently for the same exporter
    instance. It can be called again only after the current call returns.

However, it does not place a restriction on concurrent I/O or anything
of that nature. There is an [ongoing discussion] about tweaking the
language to make this more clear.

With that in mind, this commit makes the exporters return a future that
can be spawned concurrently. Unfortunately, this means that the
`export()` method can no longer be async while taking &mut self. The
latter is desirable to enforce the no concurrent calls line of the spec,
so the choice is made here to return a future instead with the lifetime
decoupled from self. This resulted in a bit of additional verbosity, but
for the most part the async code can still be shoved into an async fn
for the ergonomics.

The main exception to this is the `jaeger` exporter which internally
requires a bunch of mutable references. I plan to discuss with the
opentelemetry team the overall goal of this PR and get buy-in before
making more invasive changes to support this in the jaeger exporter.

[ongoing discussion]: open-telemetry/opentelemetry-specification#2434
jwilm added a commit to OneSignal/opentelemetry-rust that referenced this issue Apr 20, 2022
Applications generating significant span volume can end up dropping data
due to the synchronous export step. According to the opentelemetry spec,

    This function will never be called concurrently for the same exporter
    instance. It can be called again only after the current call returns.

However, it does not place a restriction on concurrent I/O or anything
of that nature. There is an [ongoing discussion] about tweaking the
language to make this more clear.

With that in mind, this commit makes the exporters return a future that
can be spawned concurrently. Unfortunately, this means that the
`export()` method can no longer be async while taking &mut self. The
latter is desirable to enforce the no concurrent calls line of the spec,
so the choice is made here to return a future instead with the lifetime
decoupled from self. This resulted in a bit of additional verbosity, but
for the most part the async code can still be shoved into an async fn
for the ergonomics.

The main exception to this is the `jaeger` exporter which internally
requires a bunch of mutable references. I plan to discuss with the
opentelemetry team the overall goal of this PR and get buy-in before
making more invasive changes to support this in the jaeger exporter.

[ongoing discussion]: open-telemetry/opentelemetry-specification#2434
jwilm added a commit to OneSignal/opentelemetry-rust that referenced this issue Apr 20, 2022
Applications generating significant span volume can end up dropping data
due to the synchronous export step. According to the opentelemetry spec,

    This function will never be called concurrently for the same exporter
    instance. It can be called again only after the current call returns.

However, it does not place a restriction on concurrent I/O or anything
of that nature. There is an [ongoing discussion] about tweaking the
language to make this more clear.

With that in mind, this commit makes the exporters return a future that
can be spawned concurrently. Unfortunately, this means that the
`export()` method can no longer be async while taking &mut self. The
latter is desirable to enforce the no concurrent calls line of the spec,
so the choice is made here to return a future instead with the lifetime
decoupled from self. This resulted in a bit of additional verbosity, but
for the most part the async code can still be shoved into an async fn
for the ergonomics.

The main exception to this is the `jaeger` exporter which internally
requires a bunch of mutable references. I plan to discuss with the
opentelemetry team the overall goal of this PR and get buy-in before
making more invasive changes to support this in the jaeger exporter.

[ongoing discussion]: open-telemetry/opentelemetry-specification#2434
jwilm added a commit to OneSignal/opentelemetry-rust that referenced this issue May 5, 2022
Applications generating significant span volume can end up dropping data
due to the synchronous export step. According to the opentelemetry spec,

    This function will never be called concurrently for the same exporter
    instance. It can be called again only after the current call returns.

However, it does not place a restriction on concurrent I/O or anything
of that nature. There is an [ongoing discussion] about tweaking the
language to make this more clear.

With that in mind, this commit makes the exporters return a future that
can be spawned concurrently. Unfortunately, this means that the
`export()` method can no longer be async while taking &mut self. The
latter is desirable to enforce the no concurrent calls line of the spec,
so the choice is made here to return a future instead with the lifetime
decoupled from self. This resulted in a bit of additional verbosity, but
for the most part the async code can still be shoved into an async fn
for the ergonomics.

The main exception to this is the `jaeger` exporter which internally
requires a bunch of mutable references. I plan to discuss with the
opentelemetry team the overall goal of this PR and get buy-in before
making more invasive changes to support this in the jaeger exporter.

[ongoing discussion]: open-telemetry/opentelemetry-specification#2434
TommyCpp pushed a commit to open-telemetry/opentelemetry-rust that referenced this issue May 17, 2022
* Add support for concurrent exports

Applications generating significant span volume can end up dropping data
due to the synchronous export step. According to the opentelemetry spec,

    This function will never be called concurrently for the same exporter
    instance. It can be called again only after the current call returns.

However, it does not place a restriction on concurrent I/O or anything
of that nature. There is an [ongoing discussion] about tweaking the
language to make this more clear.

With that in mind, this commit makes the exporters return a future that
can be spawned concurrently. Unfortunately, this means that the
`export()` method can no longer be async while taking &mut self. The
latter is desirable to enforce the no concurrent calls line of the spec,
so the choice is made here to return a future instead with the lifetime
decoupled from self. This resulted in a bit of additional verbosity, but
for the most part the async code can still be shoved into an async fn
for the ergonomics.

The main exception to this is the `jaeger` exporter which internally
requires a bunch of mutable references. I plan to discuss with the
opentelemetry team the overall goal of this PR and get buy-in before
making more invasive changes to support this in the jaeger exporter.

[ongoing discussion]: open-telemetry/opentelemetry-specification#2434

* SpanProcessor directly manages concurrent exports

Prior, export tasks were run in "fire and forget" mode with
runtime::spawn. SpanProcessor now manages tasks directly using
FuturesUnordered. This enables limiting overall concurrency (and thus
memory footprint). Additionally, flush and shutdown logic now spawn an
additional task for any unexported spans and wait on _all_ outstanding
tasks to complete before returning.

* Add configuration for BSP max_concurrent_exports

Users may desire to control the level of export concurrency in the batch
span processor. There are two special values:

    max_concurrent_exports = 0: no bound on concurrency
    max_concurrent_exports = 1: no concurrency, makes everything
    synchronous on the messaging task.

* Implement new SpanExporter API for Jaeger

Key points
- decouple exporter from uploaders via channel and spawned task
- some uploaders are a shared I/O resource and cannot be multiplexed
    - necessitates a task queue
    - eg, HttpClient will spawn many I/O tasks internally, AgentUploader
      is a single I/O resource. Different level of abstraction.
- Synchronous API not supported without a Runtime argument. I updated
  the API to thread one through, but maybe this is undesirable. I'm also
  exploiting the fact in the Actix examples that it uses Tokio under the
  hood to pass through the Tokio runtime token.
- Tests pass save for a couple of flakey environment ones which is
  likely a race condition.

* Reduce dependencies on futures

The minimal necessary futures library (core, util, futures proper) is
now used in all packages touched by the concurrent exporters work.

* Remove runtime from Jaeger's install_simple

To keep the API _actually_ simple, we now leverage a thread to run the
jaeger exporter internals.

* Add Arc lost in a rebase

* Fix OTEL_BSP_MAX_CONCURRENT_EXPORTS name and value

Per PR feedback, the default should match the previous behavior of 1
batch at a time.

* Fix remaining TODOs

This finishes the remaining TODOs on the concurrent-exports branch. The
major change included here adds shutdown functionality to the jaeger
exporter which ensures the exporter has finished its tasks before
exiting.

* Restore lint.sh script

This was erroneously committed.

* Make max concurrent exports env configurable

OTEL_BSP_MAX_CONCURRENT_EXPORTS may now be specified in the environment
to configure the number of max concurrent exports. This configurable now
has parity with the other options of the span_processor.
@trask
Copy link
Member Author

trask commented Jun 23, 2022

@tsloughter @reyang I don't think #2452 resolved this

@lalitb
Copy link
Member

lalitb commented Jul 5, 2022

Agree, this is not resolved with #2452.

@tsloughter
Copy link
Member

@trask because it doesn't define a setting for the max number of pending exports?

I'd argue against it but in doing so I'd be arguing to also remove the returning of sucess/failure and the existing timeout :). So it is a tough one.

Adding a pending limit further adds concurrency control to the processor when it is meant to be in the exporter, along with retry logic. The fact retries are in the exporter is another reason I've not seen the need for success/failure being passed to the processor, what can it do with this information? I guess log it.

My plan had been to have export block when the exporter has reached its max pending requests, so the batch processor would not need to know about it.

Gmanboy added a commit to Gmanboy/opentelemetry-rust that referenced this issue Aug 12, 2024
* Add support for concurrent exports

Applications generating significant span volume can end up dropping data
due to the synchronous export step. According to the opentelemetry spec,

    This function will never be called concurrently for the same exporter
    instance. It can be called again only after the current call returns.

However, it does not place a restriction on concurrent I/O or anything
of that nature. There is an [ongoing discussion] about tweaking the
language to make this more clear.

With that in mind, this commit makes the exporters return a future that
can be spawned concurrently. Unfortunately, this means that the
`export()` method can no longer be async while taking &mut self. The
latter is desirable to enforce the no concurrent calls line of the spec,
so the choice is made here to return a future instead with the lifetime
decoupled from self. This resulted in a bit of additional verbosity, but
for the most part the async code can still be shoved into an async fn
for the ergonomics.

The main exception to this is the `jaeger` exporter which internally
requires a bunch of mutable references. I plan to discuss with the
opentelemetry team the overall goal of this PR and get buy-in before
making more invasive changes to support this in the jaeger exporter.

[ongoing discussion]: open-telemetry/opentelemetry-specification#2434

* SpanProcessor directly manages concurrent exports

Prior, export tasks were run in "fire and forget" mode with
runtime::spawn. SpanProcessor now manages tasks directly using
FuturesUnordered. This enables limiting overall concurrency (and thus
memory footprint). Additionally, flush and shutdown logic now spawn an
additional task for any unexported spans and wait on _all_ outstanding
tasks to complete before returning.

* Add configuration for BSP max_concurrent_exports

Users may desire to control the level of export concurrency in the batch
span processor. There are two special values:

    max_concurrent_exports = 0: no bound on concurrency
    max_concurrent_exports = 1: no concurrency, makes everything
    synchronous on the messaging task.

* Implement new SpanExporter API for Jaeger

Key points
- decouple exporter from uploaders via channel and spawned task
- some uploaders are a shared I/O resource and cannot be multiplexed
    - necessitates a task queue
    - eg, HttpClient will spawn many I/O tasks internally, AgentUploader
      is a single I/O resource. Different level of abstraction.
- Synchronous API not supported without a Runtime argument. I updated
  the API to thread one through, but maybe this is undesirable. I'm also
  exploiting the fact in the Actix examples that it uses Tokio under the
  hood to pass through the Tokio runtime token.
- Tests pass save for a couple of flakey environment ones which is
  likely a race condition.

* Reduce dependencies on futures

The minimal necessary futures library (core, util, futures proper) is
now used in all packages touched by the concurrent exporters work.

* Remove runtime from Jaeger's install_simple

To keep the API _actually_ simple, we now leverage a thread to run the
jaeger exporter internals.

* Add Arc lost in a rebase

* Fix OTEL_BSP_MAX_CONCURRENT_EXPORTS name and value

Per PR feedback, the default should match the previous behavior of 1
batch at a time.

* Fix remaining TODOs

This finishes the remaining TODOs on the concurrent-exports branch. The
major change included here adds shutdown functionality to the jaeger
exporter which ensures the exporter has finished its tasks before
exiting.

* Restore lint.sh script

This was erroneously committed.

* Make max concurrent exports env configurable

OTEL_BSP_MAX_CONCURRENT_EXPORTS may now be specified in the environment
to configure the number of max concurrent exports. This configurable now
has parity with the other options of the span_processor.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
spec:trace Related to the specification/trace directory
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants