Introduce sampling score and propagate it with the trace #135

lmolkova · 2020-08-21T18:04:48Z

This is a reincarnation of #107.

The spec described consistent sampling between existing tracing tools that use vendor-specific sampling algorithms and OTel and enables an upgrade path that vendors may use to onboard customers on otel without breaking cross-tracing-tool traces.

The delta OTEP introduces is relatively small:

Add SamplingResult.Tracestate field: sampler should be able to assign a
new tracestate for to-be-created span
Add convention for sampling.score attribute on span (let's discuss attribute vs field).
[Update] after review
Add notion of SamplingScoreGenerator that is capable of calculating float score from sampling parameters.
It has TraceIdRatioGenerator, RandomGenerator, and possible other implementations.
- Change TraceIdRatioBased sampler to use corresponding generator and serve as generic probability sampler with configurable score generation approach.
Add ExternalScoreSampler implementation of Sampler. It's created with probability value and implementation of SamplingScoreGenerator.

With this OTEP we are trying to come up with long-term plan for interop between Microsoft SDKs and services and OpenTelemetry-enabled apps. We want to make sure this solution exists, while implementation can wait.

The specification changes can wait till post-GA, however, I've heard several requests for p1 before GA (ability to modify tracestate by Sampler) - open-telemetry/opentelemetry-specification#856.

bogdandrutu

I like the overall direction of this, and I got inspired by one idea from this OTEP.

We can have only one "ProbabilitySampler" that allows to configure how the score is calculated:

Using a deterministic hash as the current TraceIdRatioBased
Using a probability.score generate at the root of the trace and propagated via tracestate.

What I would do different?

I would encourage to rewrite the initial motivation part to clarify why using the "TraceId" as the source of the score is not good enough.
Propose to have a "ProbabilitySampler" that allows customizing how the score is calculated and propagated (will support the TraceIdRatioBased as well as the new proposed way).

text/trace/0107-sampling-score.md

bogdandrutu · 2020-08-26T22:06:03Z

text/trace/0107-sampling-score.md

+Score is also exposed through span attributes. Vendors can leverage it
+to sort traces based on their completeness: the lower the value of score is,
+the higher the chance it was sampled it by each component.


Do we always need to have a score? How do we calculate the score if a custom sampler is used like always on? How does the score interact with custom samplers that do rate based sampling 1 trace every 1 second? How does the backend know when score was used to do sampling decision or just ignored?

We don't always need to have a score. We need it when there are multiple possible sampling algorithms in the same app.
Example when we need it:

Service A uses Azure Monitor SDK, uses hash1(trace-id), assigns score -> Service B uses OTel with hash2(trace-id) -> Service C uses AzureMontior with hash1(trace-id)

Services A and C use Azure Monitor SDK for years and it's unreasonable to ask them to upgrade to OTel.
It is reasonable to ask them to update to the new version of Azure Monitor that supports score.

We don't need score when only OTel is used everywhere (or the same algorithm).

So, we'll tell our customers to configure ExternalScoreSampler and fallback to OTel TraceIdRationBased algorithm if score is not in the tracestate. Default case for OTel users does not change, this is opt-in behavior.

Custom samplers that use 1 trace every 1 second are not compatible with score - they don't pursuit consistent sampling goal anyway and applying them together with ExternalScoreSampler seems to be a misconfiguration.

text/trace/0107-sampling-score.md

bogdandrutu · 2020-08-26T22:13:51Z

text/trace/0107-sampling-score.md

+- if it's not there, invokes `ProbabilitySampler`, which calculates score
+  and populates it on the attributes


I like this.

so I've tried to follow this approach in the new version. To make it clean, I suggest separating score generation from
sampling.

Samplers can use score generation, and also attach attributes, change tracestate.
If we allow falling back to another sampler:

we need to do score propagation back from one sampler to the ExtenralScore one

coordinate possible multiple changes of tracestate and attributes

If we have a layer responsible for score calculation (random, deterministic of any sort), we can make it much simpler and cleaner.

text/trace/0107-sampling-score.md

lmolkova · 2020-09-02T20:45:52Z

thanks for the review @bogdandrutu. I believe I addressed your questions/comments in the spec.

Oberon00 · 2020-09-03T10:30:45Z

I think

Add SamplingResult.Tracestate field: sampler should be able to assign a
new tracestate for to-be-created span

which you already created open-telemetry/opentelemetry-specification#856 for should be done separately. That change is IMHO the most important one (breaking in many interface-based languages), it is also the simplest one and the other parts can be added later.

lmolkova · 2020-09-03T18:49:47Z

I think

Add SamplingResult.Tracestate field: sampler should be able to assign a
new tracestate for to-be-created span

which you already created open-telemetry/opentelemetry-specification#856 for should be done separately. That change is IMHO the most important one (breaking in many interface-based languages), it is also the simplest one and the other parts can be added later.

I agree that other parts could be added later implementation-wise.
At the same time, we are making some decisions now on how to support sampling interop between current Azure-specific algorithms and OpenTelemetry without breaking existing customers. We want to have a consensus on this approach going forward so we can make such decisions today.
The ultimate goal is to have this mechanism in vanilla OTel and avoid shipping Azure-specific packages (except exporters). It also seems to be generally useful for interop/sampling orchestration scenarios.
So I'd like to move the OTEP forward, I can mark it as post-GA if it helps. Certainly, implementation can wait till post-GA.

lmolkova · 2020-09-15T21:19:14Z

@specs-trace-approvers what could be the next steps to move this spec forward?

Oberon00 · 2020-09-16T08:14:52Z

I think we should definitely do open-telemetry/opentelemetry-specification#856 before GA but I suspect that most don't want to take the time to read and understand the whole OTEP before GA.
On the other hand there were some worries voiced in yesterday's SIG meeting that the Sampling SDK may not be in a GA-ready state yet, because no one has tried writing custom samplers, etc yet. So this OTEP might be a good chance to test the sampling SDK.

CC @open-telemetry/specs-approvers @open-telemetry/specs-trace-approvers

Oberon00 · 2020-09-18T14:28:40Z

@lmolkova You could maybe move this forward by making a spec PR for open-telemetry/opentelemetry-specification#856. Then after that is merged, you can make the OTEP a bit shorter 😃

bogdandrutu

Overall I am ok with the direction of this OTEP

text/trace/0107-sampling-score.md

Fixes #856 ## Changes Added `Tracestate` to `SamplingResult` Related [oteps](https://github.com/open-telemetry/oteps) open-telemetry/oteps#135

Incorporating changes from open-telemetry/opentelemetry-specification#988

MSNev · 2020-12-07T20:07:52Z

text/trace/0107-sampling-score.md

+`tracestate` so downstream services can re-use it to make their sampling
+decisions *instead of* re-calculating score as a function of trace-id
+(or trace-flags). This allows to configure sampling algorithm on the first
+service ans avoid coordination of algorithms when multiple tracing tools are


Typo: ans -> and

ericmustin · 2021-05-10T16:26:55Z

I was wondering what the status of this is? @lmolkova @bogdandrutu , is it blocked by #148 ?

jmacd · 2021-05-11T23:56:08Z

@ericmustin I remember the history of this thread, but I believe @lmolkova has moved on from OpenTelemetry.

The conclusion from #148 is that encoding inclusion probability (how we count sampled events) is different from how we ensure that traces are complete when sampled by some scheme. OTel's current specification includes only one scheme for ensuring that sampled traces are complete, the W3C "is-sampled" flag (part of TraceFlags, in SpanContext).

See https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/api.md#spancontext

May we close this issue?

ericmustin · 2021-05-12T00:19:49Z

@jmacd gotcha ty for the followup. makes sense, I’ll do my bike shedding(haha jk...unless?) over on #148 , looks like there’s some thoughtful work on inclusion probability there that would address the use case I had in mind.

oertl · 2021-05-12T09:36:38Z

@jmacd I prefer to keep this issue open. Propagating the sample rate or adjusted count from the root span, where the sampling decision was made, allows to set the adjusted count for a span properly as already mentioned in #148 (comment). This would allow extrapolation of individual spans without having to retrieve this information from the root span.

jmacd · 2021-05-12T17:05:24Z

@oertl OK, let's keep discussing. My understanding of this proposal in this PR is that it does not propagate inclusion probability. I am definitely in favor of solutions for propagating inclusion probability. The solution in this PR talks about the use of an explicit random number for use coordinating different-sampling schemes w/o addressing inclusion probability. I'm not familiar with the system @lmolkova describes here.

oertl · 2021-05-17T10:26:17Z

@jmacd, you are right this is a different issue. However, it tackles a still unsolved problem with regard to sampling.

The trace ID ratio based sampler, only works properly if the trace ID is uniformly distributed, which is AFAIK, not well specified. In order to get sampling right, if you do not know the distribution of the trace ID and if you just know that the trace ID is unique, you need to hash the trace ID. A high-quality hash function, is able to transform the trace ID into a uniformly distributed hash value which can then be used for ratio-based sampling. A list of fast, non-cryptographic, high-quality hash function can be found at https://github.com/rurban/smhasher#summary. Apart from the overhead, a further problem is that the hash function needs to be consistently implemented for all supported languages.

To save hashing costs, the root span may calculate the hash value and propagate it with the trace ID, such that child spans do not need to hash the trace ID again and again. If the hash value is only calculated once at the root span and never again for child spans, because they are always using the same precalculated hash value from the parent, one could replace the hash value computed at the root span by any random number. This is exactly what is proposed here where the generated random value is called sampling score. It is essentially nothing else than a secondary trace ID that is uniformly distributed.

By the way, if ratio-based sampling is restricted to sample rates that are powers of one half as proposed (see #148 (comment)), it would be sufficient to propagate just the number of leading zeros of the sampling score, because this is the only information needed for the sampling decision in this case.

lmolkova · 2021-09-16T18:18:12Z

I'm going to close this OTEP as I'm not working on sampling anymore and it doesn't look like there is much interest. If anyone is interested in it, feel free to take any parts of it, and happy to share any context.

jmacd · 2021-09-16T19:31:54Z

This goals of this OTEP are, I believe, addressed in #168.

Kanwaldeep and others added 9 commits May 18, 2020 10:45

Proposal for sampling.priority

207ba2f

Renamed the file and updated the contents for the markdown checks

0510dfd

Added more details based on feedback

096d10b

Fixed link

c41aa8a

Updated based on feedback. Added example

eed164f

Minor styling fixes

146d963

Fixing lint error

0295785

Merge branch 'master' of https://github.com/Kanwaldeep/oteps

716bdb5

Merge branch 'master' of https://github.com/open-telemetry/oteps

f0ce6ba

lmolkova requested review from a team August 21, 2020 18:04

This was referenced Aug 21, 2020

SDK: Samplers should be able to modify tracestate open-telemetry/opentelemetry-specification#856

Closed

Proposal for sampling.priority #107

Closed

lmolkova force-pushed the priority branch from bad0086 to 6734145 Compare August 21, 2020 20:24

bogdandrutu reviewed Aug 26, 2020

View reviewed changes

Liudmila added 14 commits August 27, 2020 12:23

Merge branch 'master' of https://github.com/open-telemetry/oteps

0091a55

Add details and explanations

9b0dcdc

mitigation update

2509e5e

typos

65b6b40

add link to PoC

122eb25

float precision

bfc6f37

float precision

597a78f

semantic convention vs extensible struct

b17dfc6

rename to score, minor fixes

f096d02

minor fixes

a14a7d4

minor fixes

197b1c1

rename file

488ae17

lint

ed978c7

separate score generation from sampling and other review comments

234ddf0

lmolkova force-pushed the priority branch from 4c9581c to 234ddf0 Compare September 2, 2020 20:00

more review comments

b7a2040

bogdandrutu approved these changes Sep 21, 2020

View reviewed changes

text/trace/0107-sampling-score.md Outdated Show resolved Hide resolved

Reference float standard

7c06469

lmolkova mentioned this pull request Sep 22, 2020

Allow samplers to modify tracestate open-telemetry/opentelemetry-specification#988

Merged

Spec now allows samplers to modify tracestate

0721be6

Incorporating changes from open-telemetry/opentelemetry-specification#988

MSNev reviewed Dec 7, 2020

View reviewed changes

Base automatically changed from master to main January 27, 2021 20:37

Oberon00 mentioned this pull request Feb 1, 2021

Current getUpdatedTraceState may cause unnecessary allocations if SDK will also change the TraceState open-telemetry/opentelemetry-java#2631

Closed

yurishkuro mentioned this pull request May 17, 2021

Probabilistic sampler causes trace collisions jaegertracing/jaeger-client-go#582

Closed

oertl mentioned this pull request Jun 17, 2021

Probability sampling basics for telemetry events #148

Closed

oertl mentioned this pull request Jul 24, 2021

Specify how to propagate consistent head sampling probability #168

Merged

lmolkova closed this Sep 16, 2021

oertl mentioned this pull request Jun 14, 2022

REQUEST: New membership for @oertl open-telemetry/community#1078

Closed

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce sampling score and propagate it with the trace #135

Introduce sampling score and propagate it with the trace #135

lmolkova commented Aug 21, 2020 •

edited

Loading

bogdandrutu left a comment

bogdandrutu Aug 26, 2020

lmolkova Sep 2, 2020 •

edited

Loading

bogdandrutu Aug 26, 2020

lmolkova Sep 3, 2020

lmolkova commented Sep 2, 2020

Oberon00 commented Sep 3, 2020

lmolkova commented Sep 3, 2020

lmolkova commented Sep 15, 2020

Oberon00 commented Sep 16, 2020

Oberon00 commented Sep 18, 2020

bogdandrutu left a comment

MSNev Dec 7, 2020

ericmustin commented May 10, 2021

jmacd commented May 11, 2021

ericmustin commented May 12, 2021

oertl commented May 12, 2021

jmacd commented May 12, 2021

oertl commented May 17, 2021

lmolkova commented Sep 16, 2021

jmacd commented Sep 16, 2021

		- if it's not there, invokes `ProbabilitySampler`, which calculates score
		and populates it on the attributes

Introduce sampling score and propagate it with the trace #135

Introduce sampling score and propagate it with the trace #135

Conversation

lmolkova commented Aug 21, 2020 • edited Loading

bogdandrutu left a comment

Choose a reason for hiding this comment

bogdandrutu Aug 26, 2020

Choose a reason for hiding this comment

lmolkova Sep 2, 2020 • edited Loading

Choose a reason for hiding this comment

bogdandrutu Aug 26, 2020

Choose a reason for hiding this comment

lmolkova Sep 3, 2020

Choose a reason for hiding this comment

lmolkova commented Sep 2, 2020

Oberon00 commented Sep 3, 2020

lmolkova commented Sep 3, 2020

lmolkova commented Sep 15, 2020

Oberon00 commented Sep 16, 2020

Oberon00 commented Sep 18, 2020

bogdandrutu left a comment

Choose a reason for hiding this comment

MSNev Dec 7, 2020

Choose a reason for hiding this comment

ericmustin commented May 10, 2021

jmacd commented May 11, 2021

ericmustin commented May 12, 2021

oertl commented May 12, 2021

jmacd commented May 12, 2021

oertl commented May 17, 2021

lmolkova commented Sep 16, 2021

jmacd commented Sep 16, 2021

lmolkova commented Aug 21, 2020 •

edited

Loading

lmolkova Sep 2, 2020 •

edited

Loading