Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collect subpolicy sampling data from composite policy #20849

Closed
drewby opened this issue Apr 11, 2023 · 7 comments
Closed

Collect subpolicy sampling data from composite policy #20849

drewby opened this issue Apr 11, 2023 · 7 comments
Labels
closed as inactive enhancement New feature or request processor/tailsampling Tail sampling processor Stale

Comments

@drewby
Copy link
Member

drewby commented Apr 11, 2023

Component(s)

processor/tailsampling

Is your feature request related to a problem? Please describe.

The composite policy allows the user to allocate a certain amount of total spans per second to each subpolicy. For example, one might configure "error" traces to get 90%, but "normal" traces to only get 10% of the total allocated spans per second. However, as traffic increases, its impossible to know what the effective sampling rate is (ex, how many error traces are being sampled, how many normal traces are being sampled). This information is important for understanding the characteristics of traffic in a distributed system and detecting when the amount of errors (or other conditions is rising).

Describe the solution you'd like

There are two places I'd like to see this information show up.

  1. In metrics, there should be an additional dimension on "count_traces_sampled" called "subpolicy" or a new counter called "composite_count_traces_sampled"
  2. This information is also important when inspecting the trace. I'd like to have two attributes added to the root span (or first span, if root is missing) of a trace.
    a) "trace.sampling_policy" - the name of the subpolicy which triggered the sampling decision
    b) "trace.sampling_rate" - the effective sampling rate averaged over a time window (sampled_traces_for_subpolicy/total_traces_for_subpolicy)

The first will allow for monitoring of trends. The second is important when inspecting a trace to understand how often that particular type of trace shows up relative to total traffic.

Describe alternatives you've considered

There are potentially other places to record the sampling policy and sampling rate for a trace, but it seems the root span is the best option.

If traces are dropped due to other issues, such as memory constraints, it would impact the accuracy of the metrics and the "sampling_rate" attribute on the trace.

Additional context

No response

@drewby drewby added enhancement New feature or request needs triage New item requiring triage labels Apr 11, 2023
@github-actions github-actions bot added the processor/tailsampling Tail sampling processor label Apr 11, 2023
@github-actions
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@atoulme atoulme removed the needs triage New item requiring triage label Apr 12, 2023
@jpkrohling
Copy link
Member

I agree this is important. Perhaps we could have someone from SIG Sampling help review the names and algorithms here. I remember seeing a few OTEPs on this topic.

cc @jmacd

@drewby
Copy link
Member Author

drewby commented Apr 19, 2023

@jpkrohling, thanks. Is that a conversation you could start as I'm new to the community. Or if you can point me to the SIG (is it in the slack channels?), I'd be happy to start the conversation.

@drewby
Copy link
Member Author

drewby commented May 8, 2023

An update here. After a discussion in #otel-sampling, this information would likely be recording as "sampler.adjusted_count" according to https://github.com/open-telemetry/oteps/blob/main/text/trace/0170-sampling-probability.md. However, this convention assumes a power-of-two sampling rate and the tailsampling/composite policy needs a non-power-of-two number.

The non-power-of-two case is being discussed in a current OTEP PR here: open-telemetry/oteps#226

The metrics part of this issue could be solved today, but recording the "adjusted count" in the trace/span data should wait until the spec is updated.

@jpkrohling
Copy link
Member

I'm sorry for losing track of this. It looks like you found the right places to discuss it! Let me know when you think this is ready to move forward.

@github-actions
Copy link
Contributor

github-actions bot commented Aug 7, 2023

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Aug 7, 2023
@github-actions
Copy link
Contributor

github-actions bot commented Oct 6, 2023

This issue has been closed as inactive because it has been stale for 120 days with no activity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Oct 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
closed as inactive enhancement New feature or request processor/tailsampling Tail sampling processor Stale
Projects
None yet
Development

No branches or pull requests

3 participants