Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flush export queue when it reaches max_export_batch_size #1521

Merged
merged 2 commits into from
Feb 12, 2021

Conversation

anton-ryzhov
Copy link
Contributor

Description

Currently BatchExportSpanProcessor skips the sleep if it already has max_export_batch_size in queue, 512 by default. But it interrupts that sleep only when it collects 1024 items in the queue.

It should export immediately when there are max_export_batch_size items in the queue.

Type of change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)

How Has This Been Tested?

Manually by creating spans and analysing logs

Does This PR Require a Contrib Repo Change?

Answer the following question based on these examples of changes that would require a Contrib Repo Change:

  • The OTel specification has changed which prompted this PR to update the method interfaces of opentelemetry-api/ or opentelemetry-sdk/

  • The method interfaces of opentelemetry-instrumentation/ have changed

  • The method interfaces of test/util have changed

  • Scripts in scripts/ that were copied over to the Contrib repo have changed

  • Configuration files that were copied over to the Contrib repo have changed (when consistency between repositories is applicable) such as in

    • pyproject.toml
    • isort.cfg
    • .flake8
  • When a new .github/CODEOWNER is added

  • Major changes to project information, such as in:

    • README.md
    • CONTRIBUTING.md
  • Yes. - Link to PR:

  • No.

Checklist:

  • Followed the style guidelines of this project
  • Changelogs have been updated
  • Unit tests have been added
  • Documentation has been updated

@anton-ryzhov anton-ryzhov marked this pull request as ready for review January 11, 2021 16:37
@anton-ryzhov anton-ryzhov requested review from a team, toumorokoshi and hectorhdzg and removed request for a team January 11, 2021 16:37
Copy link
Member

@toumorokoshi toumorokoshi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Code looks good, blocking design question.

@@ -194,7 +194,7 @@ def on_end(self, span: Span) -> None:

self.queue.appendleft(span)

if len(self.queue) >= self.max_queue_size // 2:
if len(self.queue) >= self.max_export_batch_size:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the change! I would wonder if it would be better to modify the exporter to flush the queue at something less than the max batch size?

I think this code was written to notify a flush well before the queue filled to handle race conditions. I haven't looked at the code super closely recently, but IIRC there is a possibility for multiple spans to end at the same time, and as a result:

  • first span would notify the flush
  • second span would be dropped, since the flush is notified, but the export_batch_size has been reached.

So maybe take something a little more cautious, like filling up to 80%, and have be the condition for both the notify and the flush itself (refactoring both to a single function call would prevent future splitting of the code).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Queue overflow and spans drop — that's exactly the issue I'm addressing here.

There are 2 different numbers — max_queue_size = 2048 and max_export_batch_size = 512.

Don't think it's a good idea to flush half- (or 80%-) full requests. That will only make effective_export_batch_size = 0.8 * max_export_batch_size.

Second span in your example won't be dropped if max_queue_size > max_export_batch_size, which should be true.

Maybe as extra check the code should validate that max_queue_size > 2 * max_export_batch_size or so.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it! Makes sense. Sorry about that, I misread the code the first time.

I think the checks there are there are already good (albeit maybe require more documentation).

Copy link
Member

@toumorokoshi toumorokoshi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Good catch, thanks for the change.

Base automatically changed from master to main January 29, 2021 16:49
@codeboten codeboten merged commit 5967374 into open-telemetry:main Feb 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants