-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Collect subpolicy sampling data from composite policy #20849
Comments
Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself. |
I agree this is important. Perhaps we could have someone from SIG Sampling help review the names and algorithms here. I remember seeing a few OTEPs on this topic. cc @jmacd |
@jpkrohling, thanks. Is that a conversation you could start as I'm new to the community. Or if you can point me to the SIG (is it in the slack channels?), I'd be happy to start the conversation. |
An update here. After a discussion in #otel-sampling, this information would likely be recording as "sampler.adjusted_count" according to https://github.com/open-telemetry/oteps/blob/main/text/trace/0170-sampling-probability.md. However, this convention assumes a power-of-two sampling rate and the tailsampling/composite policy needs a non-power-of-two number. The non-power-of-two case is being discussed in a current OTEP PR here: open-telemetry/oteps#226 The metrics part of this issue could be solved today, but recording the "adjusted count" in the trace/span data should wait until the spec is updated. |
I'm sorry for losing track of this. It looks like you found the right places to discuss it! Let me know when you think this is ready to move forward. |
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself. |
This issue has been closed as inactive because it has been stale for 120 days with no activity. |
Component(s)
processor/tailsampling
Is your feature request related to a problem? Please describe.
The composite policy allows the user to allocate a certain amount of total spans per second to each subpolicy. For example, one might configure "error" traces to get 90%, but "normal" traces to only get 10% of the total allocated spans per second. However, as traffic increases, its impossible to know what the effective sampling rate is (ex, how many error traces are being sampled, how many normal traces are being sampled). This information is important for understanding the characteristics of traffic in a distributed system and detecting when the amount of errors (or other conditions is rising).
Describe the solution you'd like
There are two places I'd like to see this information show up.
a) "trace.sampling_policy" - the name of the subpolicy which triggered the sampling decision
b) "trace.sampling_rate" - the effective sampling rate averaged over a time window (sampled_traces_for_subpolicy/total_traces_for_subpolicy)
The first will allow for monitoring of trends. The second is important when inspecting a trace to understand how often that particular type of trace shows up relative to total traffic.
Describe alternatives you've considered
There are potentially other places to record the sampling policy and sampling rate for a trace, but it seems the root span is the best option.
If traces are dropped due to other issues, such as memory constraints, it would impact the accuracy of the metrics and the "sampling_rate" attribute on the trace.
Additional context
No response
The text was updated successfully, but these errors were encountered: