Collect subpolicy sampling data from composite policy #20849

drewby · 2023-04-11T04:11:35Z

Component(s)

processor/tailsampling

Is your feature request related to a problem? Please describe.

The composite policy allows the user to allocate a certain amount of total spans per second to each subpolicy. For example, one might configure "error" traces to get 90%, but "normal" traces to only get 10% of the total allocated spans per second. However, as traffic increases, its impossible to know what the effective sampling rate is (ex, how many error traces are being sampled, how many normal traces are being sampled). This information is important for understanding the characteristics of traffic in a distributed system and detecting when the amount of errors (or other conditions is rising).

Describe the solution you'd like

There are two places I'd like to see this information show up.

In metrics, there should be an additional dimension on "count_traces_sampled" called "subpolicy" or a new counter called "composite_count_traces_sampled"
This information is also important when inspecting the trace. I'd like to have two attributes added to the root span (or first span, if root is missing) of a trace.
a) "trace.sampling_policy" - the name of the subpolicy which triggered the sampling decision
b) "trace.sampling_rate" - the effective sampling rate averaged over a time window (sampled_traces_for_subpolicy/total_traces_for_subpolicy)

The first will allow for monitoring of trends. The second is important when inspecting a trace to understand how often that particular type of trace shows up relative to total traffic.

Describe alternatives you've considered

There are potentially other places to record the sampling policy and sampling rate for a trace, but it seems the root span is the best option.

If traces are dropped due to other issues, such as memory constraints, it would impact the accuracy of the metrics and the "sampling_rate" attribute on the trace.

Additional context

No response

github-actions · 2023-04-11T04:11:51Z

Pinging code owners:

processor/tailsampling: @jpkrohling

See Adding Labels via Comments if you do not have permissions to add labels yourself.

jpkrohling · 2023-04-12T19:40:42Z

I agree this is important. Perhaps we could have someone from SIG Sampling help review the names and algorithms here. I remember seeing a few OTEPs on this topic.

cc @jmacd

drewby · 2023-04-19T02:23:38Z

@jpkrohling, thanks. Is that a conversation you could start as I'm new to the community. Or if you can point me to the SIG (is it in the slack channels?), I'd be happy to start the conversation.

drewby · 2023-05-08T00:40:31Z

An update here. After a discussion in #otel-sampling, this information would likely be recording as "sampler.adjusted_count" according to https://github.com/open-telemetry/oteps/blob/main/text/trace/0170-sampling-probability.md. However, this convention assumes a power-of-two sampling rate and the tailsampling/composite policy needs a non-power-of-two number.

The non-power-of-two case is being discussed in a current OTEP PR here: open-telemetry/oteps#226

The metrics part of this issue could be solved today, but recording the "adjusted count" in the trace/span data should wait until the spec is updated.

jpkrohling · 2023-06-07T19:50:00Z

I'm sorry for losing track of this. It looks like you found the right places to discuss it! Let me know when you think this is ready to move forward.

github-actions · 2023-08-07T03:31:42Z

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

processor/tailsampling: @jpkrohling

See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions · 2023-10-06T05:18:56Z

This issue has been closed as inactive because it has been stale for 120 days with no activity.

drewby added enhancement New feature or request needs triage New item requiring triage labels Apr 11, 2023

github-actions bot added the processor/tailsampling Tail sampling processor label Apr 11, 2023

atoulme removed the needs triage New item requiring triage label Apr 12, 2023

github-actions bot added the Stale label Aug 7, 2023

github-actions bot added the closed as inactive label Oct 6, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Oct 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Collect subpolicy sampling data from composite policy #20849

Collect subpolicy sampling data from composite policy #20849

drewby commented Apr 11, 2023

github-actions bot commented Apr 11, 2023

jpkrohling commented Apr 12, 2023

drewby commented Apr 19, 2023

drewby commented May 8, 2023

jpkrohling commented Jun 7, 2023

github-actions bot commented Aug 7, 2023

github-actions bot commented Oct 6, 2023