Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce sampling score and propagate it with the trace #135

Closed
wants to merge 26 commits into from

Conversation

lmolkova
Copy link
Contributor

@lmolkova lmolkova commented Aug 21, 2020

This is a reincarnation of #107.

The spec described consistent sampling between existing tracing tools that use vendor-specific sampling algorithms and OTel and enables an upgrade path that vendors may use to onboard customers on otel without breaking cross-tracing-tool traces.

The delta OTEP introduces is relatively small:

  1. Add SamplingResult.Tracestate field: sampler should be able to assign a
    new tracestate for to-be-created span
  2. Add convention for sampling.score attribute on span (let's discuss attribute vs field).
    [Update] after review
  3. Add notion of SamplingScoreGenerator that is capable of calculating float score from sampling parameters.
    It has TraceIdRatioGenerator, RandomGenerator, and possible other implementations.
    • Change TraceIdRatioBased sampler to use corresponding generator and serve as generic probability sampler with configurable score generation approach.
  4. Add ExternalScoreSampler implementation of Sampler. It's created with probability value and implementation of SamplingScoreGenerator.

With this OTEP we are trying to come up with long-term plan for interop between Microsoft SDKs and services and OpenTelemetry-enabled apps. We want to make sure this solution exists, while implementation can wait.

The specification changes can wait till post-GA, however, I've heard several requests for p1 before GA (ability to modify tracestate by Sampler) - open-telemetry/opentelemetry-specification#856.

Copy link
Member

@bogdandrutu bogdandrutu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the overall direction of this, and I got inspired by one idea from this OTEP.

We can have only one "ProbabilitySampler" that allows to configure how the score is calculated:

  • Using a deterministic hash as the current TraceIdRatioBased
  • Using a probability.score generate at the root of the trace and propagated via tracestate.

What I would do different?

  • I would encourage to rewrite the initial motivation part to clarify why using the "TraceId" as the source of the score is not good enough.
  • Propose to have a "ProbabilitySampler" that allows customizing how the score is calculated and propagated (will support the TraceIdRatioBased as well as the new proposed way).

text/trace/0107-sampling-score.md Outdated Show resolved Hide resolved
text/trace/0107-sampling-score.md Outdated Show resolved Hide resolved
text/trace/0107-sampling-score.md Show resolved Hide resolved
text/trace/0107-sampling-score.md Show resolved Hide resolved
Comment on lines 50 to 52
Score is also exposed through span attributes. Vendors can leverage it
to sort traces based on their completeness: the lower the value of score is,
the higher the chance it was sampled it by each component.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we always need to have a score? How do we calculate the score if a custom sampler is used like always on? How does the score interact with custom samplers that do rate based sampling 1 trace every 1 second? How does the backend know when score was used to do sampling decision or just ignored?

Copy link
Contributor Author

@lmolkova lmolkova Sep 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't always need to have a score. We need it when there are multiple possible sampling algorithms in the same app.
Example when we need it:

Service A uses Azure Monitor SDK, uses hash1(trace-id), assigns score  -> Service B uses OTel with hash2(trace-id)
                                                                       -> Service C uses AzureMontior with hash1(trace-id)

Services A and C use Azure Monitor SDK for years and it's unreasonable to ask them to upgrade to OTel.
It is reasonable to ask them to update to the new version of Azure Monitor that supports score.

We don't need score when only OTel is used everywhere (or the same algorithm).

So, we'll tell our customers to configure ExternalScoreSampler and fallback to OTel TraceIdRationBased algorithm if score is not in the tracestate. Default case for OTel users does not change, this is opt-in behavior.

Custom samplers that use 1 trace every 1 second are not compatible with score - they don't pursuit consistent sampling goal anyway and applying them together with ExternalScoreSampler seems to be a misconfiguration.

text/trace/0107-sampling-score.md Outdated Show resolved Hide resolved
text/trace/0107-sampling-score.md Outdated Show resolved Hide resolved
text/trace/0107-sampling-score.md Outdated Show resolved Hide resolved
Comment on lines 165 to 166
- if it's not there, invokes `ProbabilitySampler`, which calculates score
and populates it on the attributes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so I've tried to follow this approach in the new version. To make it clean, I suggest separating score generation from
sampling.

Samplers can use score generation, and also attach attributes, change tracestate.
If we allow falling back to another sampler:

  • we need to do score propagation back from one sampler to the ExtenralScore one
  • coordinate possible multiple changes of tracestate and attributes

If we have a layer responsible for score calculation (random, deterministic of any sort), we can make it much simpler and cleaner.

text/trace/0107-sampling-score.md Outdated Show resolved Hide resolved
@lmolkova
Copy link
Contributor Author

lmolkova commented Sep 2, 2020

thanks for the review @bogdandrutu. I believe I addressed your questions/comments in the spec.

@Oberon00
Copy link
Member

Oberon00 commented Sep 3, 2020

I think

  1. Add SamplingResult.Tracestate field: sampler should be able to assign a
    new tracestate for to-be-created span

which you already created open-telemetry/opentelemetry-specification#856 for should be done separately. That change is IMHO the most important one (breaking in many interface-based languages), it is also the simplest one and the other parts can be added later.

@lmolkova
Copy link
Contributor Author

lmolkova commented Sep 3, 2020

I think

  1. Add SamplingResult.Tracestate field: sampler should be able to assign a
    new tracestate for to-be-created span

which you already created open-telemetry/opentelemetry-specification#856 for should be done separately. That change is IMHO the most important one (breaking in many interface-based languages), it is also the simplest one and the other parts can be added later.

I agree that other parts could be added later implementation-wise.
At the same time, we are making some decisions now on how to support sampling interop between current Azure-specific algorithms and OpenTelemetry without breaking existing customers. We want to have a consensus on this approach going forward so we can make such decisions today.
The ultimate goal is to have this mechanism in vanilla OTel and avoid shipping Azure-specific packages (except exporters). It also seems to be generally useful for interop/sampling orchestration scenarios.
So I'd like to move the OTEP forward, I can mark it as post-GA if it helps. Certainly, implementation can wait till post-GA.

@lmolkova
Copy link
Contributor Author

@specs-trace-approvers what could be the next steps to move this spec forward?

@Oberon00
Copy link
Member

I think we should definitely do open-telemetry/opentelemetry-specification#856 before GA but I suspect that most don't want to take the time to read and understand the whole OTEP before GA.
On the other hand there were some worries voiced in yesterday's SIG meeting that the Sampling SDK may not be in a GA-ready state yet, because no one has tried writing custom samplers, etc yet. So this OTEP might be a good chance to test the sampling SDK.

CC @open-telemetry/specs-approvers @open-telemetry/specs-trace-approvers

@Oberon00
Copy link
Member

@lmolkova You could maybe move this forward by making a spec PR for open-telemetry/opentelemetry-specification#856. Then after that is merged, you can make the OTEP a bit shorter 😃

Copy link
Member

@bogdandrutu bogdandrutu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I am ok with the direction of this OTEP

text/trace/0107-sampling-score.md Outdated Show resolved Hide resolved
tigrannajaryan pushed a commit to open-telemetry/opentelemetry-specification that referenced this pull request Sep 25, 2020
Fixes #856

## Changes
Added `Tracestate` to `SamplingResult`

Related [oteps](https://github.com/open-telemetry/oteps) open-telemetry/oteps#135
`tracestate` so downstream services can re-use it to make their sampling
decisions *instead of* re-calculating score as a function of trace-id
(or trace-flags). This allows to configure sampling algorithm on the first
service ans avoid coordination of algorithms when multiple tracing tools are
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: ans -> and

@ericmustin
Copy link

I was wondering what the status of this is? @lmolkova @bogdandrutu , is it blocked by #148 ?

@jmacd
Copy link
Contributor

jmacd commented May 11, 2021

@ericmustin I remember the history of this thread, but I believe @lmolkova has moved on from OpenTelemetry.

The conclusion from #148 is that encoding inclusion probability (how we count sampled events) is different from how we ensure that traces are complete when sampled by some scheme. OTel's current specification includes only one scheme for ensuring that sampled traces are complete, the W3C "is-sampled" flag (part of TraceFlags, in SpanContext).

See https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/api.md#spancontext

May we close this issue?

@ericmustin
Copy link

@jmacd gotcha ty for the followup. makes sense, I’ll do my bike shedding(haha jk...unless?) over on #148 , looks like there’s some thoughtful work on inclusion probability there that would address the use case I had in mind.

@oertl
Copy link
Contributor

oertl commented May 12, 2021

@jmacd I prefer to keep this issue open. Propagating the sample rate or adjusted count from the root span, where the sampling decision was made, allows to set the adjusted count for a span properly as already mentioned in #148 (comment). This would allow extrapolation of individual spans without having to retrieve this information from the root span.

@jmacd
Copy link
Contributor

jmacd commented May 12, 2021

@oertl OK, let's keep discussing. My understanding of this proposal in this PR is that it does not propagate inclusion probability. I am definitely in favor of solutions for propagating inclusion probability. The solution in this PR talks about the use of an explicit random number for use coordinating different-sampling schemes w/o addressing inclusion probability. I'm not familiar with the system @lmolkova describes here.

@oertl
Copy link
Contributor

oertl commented May 17, 2021

@jmacd, you are right this is a different issue. However, it tackles a still unsolved problem with regard to sampling.

The trace ID ratio based sampler, only works properly if the trace ID is uniformly distributed, which is AFAIK, not well specified. In order to get sampling right, if you do not know the distribution of the trace ID and if you just know that the trace ID is unique, you need to hash the trace ID. A high-quality hash function, is able to transform the trace ID into a uniformly distributed hash value which can then be used for ratio-based sampling. A list of fast, non-cryptographic, high-quality hash function can be found at https://github.com/rurban/smhasher#summary. Apart from the overhead, a further problem is that the hash function needs to be consistently implemented for all supported languages.

To save hashing costs, the root span may calculate the hash value and propagate it with the trace ID, such that child spans do not need to hash the trace ID again and again. If the hash value is only calculated once at the root span and never again for child spans, because they are always using the same precalculated hash value from the parent, one could replace the hash value computed at the root span by any random number. This is exactly what is proposed here where the generated random value is called sampling score. It is essentially nothing else than a secondary trace ID that is uniformly distributed.

By the way, if ratio-based sampling is restricted to sample rates that are powers of one half as proposed (see #148 (comment)), it would be sufficient to propagate just the number of leading zeros of the sampling score, because this is the only information needed for the sampling decision in this case.

@lmolkova
Copy link
Contributor Author

I'm going to close this OTEP as I'm not working on sampling anymore and it doesn't look like there is much interest. If anyone is interested in it, feel free to take any parts of it, and happy to share any context.

@lmolkova lmolkova closed this Sep 16, 2021
@jmacd
Copy link
Contributor

jmacd commented Sep 16, 2021

This goals of this OTEP are, I believe, addressed in #168.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants