-
Notifications
You must be signed in to change notification settings - Fork 889
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add the requirement for the probability sampler #570
Add the requirement for the probability sampler #570
Conversation
…ribute describing its probability.
specification/trace/sdk.md
Outdated
@@ -119,6 +119,9 @@ These are the default samplers implemented in the OpenTelemetry SDK: | |||
that are root spans (no parent) and Spans with remote parent. However there | |||
should be configuration to change this to "root spans only", or "all spans". | |||
* Description MUST be `ProbabilitySampler{0.000100}`. | |||
* If the probability is applied (as opposed to using the parent's `SampledFlag` value), the set of span Attributes | |||
returned in the result MUST be a single attribute with the key `sampling.probability` and a value of the probability | |||
being used. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Jaeger clients the samplers return two tags (at minimum), indicating the type of sampler used, and the parameter value (like probability, or rate for rate limiting sampler). It's functionally equivalent, but makes it easier to determine the type of sampler used than matching on potentially different tag names (e.g. Jaeger collector adds sampler type as a label on the metric of how many traces it receives).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be all in favor of having the Decision
include the sampler name, as well. I think that's a different PR than this one, though. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this very similar to the component=http attribute, which we decided against?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems a little different to me. This is an internal detail of the sampling itself, rather than something that is coming from outside the sampler (like the component).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is a new MUST requirement some reasoning would be nice. Would the probability be used by the backend to apply a multiplicity to sampled spans?
Co-Authored-By: Armin Ruech <armin.ruech@gmail.com>
I agree. @pavolloffay raised this issue in the java project, so I'll let him add additional context: open-telemetry/opentelemetry-java#1087 |
Regarding the line length: #564 |
@@ -119,6 +119,10 @@ These are the default samplers implemented in the OpenTelemetry SDK: | |||
that are root spans (no parent) and Spans with remote parent. However there | |||
should be configuration to change this to "root spans only", or "all spans". | |||
* Description MUST be `ProbabilitySampler{0.000100}`. | |||
* If the probability is applied (as opposed to using the parent's |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's surprising that this is a MUST since downstream tracers don't typically need to know the parent's sampling procedure, just its decision. This looks more like a SHOULD-level semantic convention to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, we want to have the Probability Sampler work the same everywhere. So, if we want to go forward with this change (and I'm relying on @pavolloffay for the justification/need), then I think that we should require them to all behave the same way. Hence, the use of MUST.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@c24t this has no effect on downstream tracers, since the probability is merely recorded as span attribute, out of band.
Justification for capturing probability and type of sampler: Jaeger backend implements an adaptive control loop (think "PID controller") that continuously recalculates desired sampling probabilities for ingress endpoints and pushes them back to the SDKs. The calculation needs to know what kind of sampler was used for any given trace, because only probabilistically sampled traffic should be counted, whereas SDKs also have other sampling mechanisms, like "each endpoint at least once a minute". For this reason, Jaeger SDKs add |
Is it clearer to use the sample probability, or inverse-probability e.g. representing the number of spans/traces that this span represents? Some discussion recently: https://gitter.im/open-telemetry/opentelemetry-specification?at=5ea6fe381eb8bd3978fb91a0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition to propagating in the span data, we also need to propagate in the trace headers so that subsequent sampling can be done (e.g. two probabilitysamplers each running with probability 0.2 should produce spans with probability 0.04 in net at the leaf, rather than 0.2 being applied as a fixed number)
It feels like this is a bigger change that this PR was intending to address, since this particular part of the spec is only related to the attributes being provided by the Sampler, and not how these get propagated in headers. |
I'd like to generalize this further using inverse-probability, or just call it For example, let's use Jaeger's probabilistic and rate limiting samplers applied at the start of the trace. If Jaeger's probabilistic sampler was applied with If we just use |
A slightly-more unambiguous term that I like is "adjusted count" because it implies the input weight is 1, so I might propose So this is a vote for |
* If the probability is applied (as opposed to using the parent's | ||
`SampledFlag` value), the set of span Attributes returned in the result MUST | ||
include an attribute with the key `sampling.probability` and a value of the | ||
probability being used. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sampling.probability would have the value that is calculated by the algorithm? It is not very clear what is value of the probability being used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand the concern. A probability-based sampler has a probability that it is configured to use. This is the sampling probability.
@jmacd personally, I don't fully understand all the use cases where you can try to extrapolate other statistics based on the sampling probability, and that makes me reluctant making it a first-order field in the span. Probabilistic sampling is just one form of sampling, other forms of sampling cannot be expressed as a probability or even an effective weight/count. For example a token-bucket-based rate limiting sampler can be modified to count how many traces it didn't sample, but that count is completely wrong to use as a weight of the positive sampling decision (e.g. a traffic burst that exhausts all tokens for a while would make the next positive sampling decision disproportionately weighted, skewing any distribution metrics you may want to extrapolate from the sampled trace). |
That's fair that a token bucket based sampler wouldn't be taking a statistical sample, and thus the sample rate would be inapplicable. However, for probability based sampling, even where you may have multiple probabilities depending upon subpopulation (e.g. trying to sample the tails 1:1 but the body 1:50 or 1:1000), it is possible to reconstruct the distribution from re-weighting the received events. See https://youtu.be/QFx63Ukn4QM?t=1505 for more information (and not from Lightstep or Honeycomb! but from SignalFx!) |
@lizthegrey yeah, I saw that talk before, but the leftovers of a scientist in me were disappointed, because it showed neither math nor any literature references to prove the claim, only empirical observations from synthetic simulations. I would love to see something more robust. That aside, I do not dispute that capturing the probability is useful, Jaeger SDK always did that, but out-of-band. |
Sampling as a topic has come up for Metrics in both open-telemetry/opentelemetry-proto#159 and #625. To be fair, the talk slides list some ways to compute samples: Sampling of traces--of the kind discussed in this issue--is the same as sampling from GROUPING metric data points, where each event (that may become a sample item) has a real count of 1 (but may end up with representative Computing sampling probabilities for ADDING metric data points is a simpler problem than for GROUPING metric data points, generally. I like to start with a weighted reserver sampling algorithm, here are two: https://dl.acm.org/doi/10.1145/1314690.1314696 It's not directly obvious how to get from a solution to sampling ADDING points to sampling GROUPING points, but there are a bunch of ways. I want to highlight the simplest. Using a weighted sampling algorithm such as these requires batching data (spans or metric events), for the sampling weight or probability isn't known until a complete frame of data has been processed. Once you have batch of data, take two passes over the data:
The output weight divided by the input weight equals This procedure outputs a fixed size sample with information that allows computing In the OpenCensus metric protocol, Histogram buckets contain one non-statistical exemplar each. Using the procedure above, we can compute instead a statistical sample of raw data points. Using the same procedure, an exporter can compute If you read all this and still just want to see an open-source implementation, something is bound to happen. :-) If you read all this, hopefully you also agree we should have a SpanData field named |
I think the SpanData discussion pertains to open-telemetry/oteps#107 in which we discuss out of process communication of the sample rate/priority via SpanData; the in-process communication of that information could be done with an Attribute (but as per above is really a metadata property best sent in Data rather than Attribute). So Data addresses a superset of the Attribute case, but is more disruptive because it involves changes to fields that we propagate. |
@lizthegrey I thought it was the opposite. I'm not trying to say anything about propagation in-band (which open-telemetry/oteps#107 does), only about what gets recorded in the span (which is I think the point of this PR). |
To be fair, those references are algorithms for doing approximate distribution/quantile computations without storing all the data. None of them, to my knowledge, discuss the possibility of doing the same on already sampled data. The sampling discussion starts right after that, and is only backed by empirical experiments, not math. He even jokes "it's all good in practice, but how does it work in theory?" |
This PR was marked stale due to lack of activity. It will be closed in 7 days. |
Closed as inactive. Feel free to reopen if this PR is still being worked on. |
Java already implements this, so I think we should make a decision here other than letting the PR go stale. CC @pmm-sumo |
This PR was marked stale due to lack of activity. It will be closed in 7 days. |
Closed as inactive. Feel free to reopen if this PR is still being worked on. |
It must include a span attribute describing its probability.