-
Notifications
You must be signed in to change notification settings - Fork 164
Proposal to separate context propagation from observability #42
Conversation
|
||
Because the application and context propagation layers are separated, it is possible to create new distributed applications which do not depend on either the Observability or Baggage APIs. | ||
|
||
**GetPropagator(type) inject, extract** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the text that follows, I expected a RegisterPropagatator
API here. Somehow this should reference the actual registration function described later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, the propagation layer provides the RegisterPropagator function. Applications provide, and applications provide the Propagators to be registered.
Is there a better way to explain this in the GetPropagator description? Right now it says "To register with the propagation system, the [BLANK] API provides a set of propagation functions for every propagation type."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've removed the Registry concept, hopefully this is clearer.
|
||
## What about complex propagation behavior? | ||
|
||
Some OpenTelemetry proposals have called for more complex propagation behavior. For example, having a fallback to extracting B3 headersif Trace-Context headers are not found. Chained propagators and other complex behavior can be modeled as implementation details behind the Propagator interface. Therefore, the propagation system itself does not need to provide chained propagators or other additional facilities. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the propagation system itself does not need to provide chained propagators
So what happens when multiple propagators are registered for the same type? I think chained propagators may be cumbersome to provide on top of an API that allows only one propagator, since it requires cooperation between everyone who wants to add a propagator to the chain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have seen code that does this, by using a stack-alike Propagator
that tries to inject/extract from a list of propagators (order is important) - for extraction, it returns with the first successful attempt, and for injection it simply tries to inject all formats. And yes, we should have a test case for this scenario, so we know where work fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, here is the nuance: there is basic chaining provided at the API level by RegisterPropagator. All propagators are run for every type, in the order in which they are registered. This is where the cooperation happens between independent applications.
More complex propagation behavior usually ends up being specific to each application. So, the OTel SDK can provide the kind of fallback W3C -> B3 chaining for observability, described here. That sort of behavior is presented as a single, complex propagator, from the point of view of the Propagation API, since the fallback behavior is internal to a single application. Does that make sense?
One thing to keep in mind is that in some languages there's an implicitly current For this purpose, I'd imagine for some languages make |
Some OpenTelemetry proposals have called for more complex propagation behavior. For example, having a fallback to extracting B3 headersif Trace-Context headers are not found. Chained propagators and other complex behavior can be modeled as implementation details behind the Propagator interface. Therefore, the propagation system itself does not need to provide chained propagators or other additional facilities. | ||
|
||
|
||
## Did you add a context parameter to every API call because Go has infected your brain? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious, is this just for fun?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haha, well I wanted to address what I perceive to be a common question, along the lines of "why is this context parameter everywhere? Is it because this is a golang project? Is it required that every language must pass context this way?"
But if the humor gets in the way of learning, I can change it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like it. You can also stress that it's not just a Go thing - in the old versions of Node there was no CLS (or it was extremely inefficient) and explicitly passing the context was how the propagation was achieved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call, I will emphasize that this issue exists in multiple languages.
Co-Authored-By: Christian Neumüller <christian+github@neumueller.me>
Co-Authored-By: Reiley Yang <reyang@microsoft.com>
Co-Authored-By: Reiley Yang <reyang@microsoft.com>
Co-Authored-By: Christian Neumüller <christian+github@neumueller.me>
Co-Authored-By: Christian Neumüller <christian+github@neumueller.me>
So, I am generally happy with where this is landing. But there are two major issues which must be worked out. Named Tracers and MetricsWith the advent of named tracers, propagation now depends on having a handle to the "current" tracer. For example, propagation may be disabled, or the headers used may be changed, depending on the named tracer being used. At least, that is my read on how named tracers will be used. Since propagation behavior now depends on which tracer instance is used, how a single inject call can inject multiple independent propagators needs to be thought about a bit more. Named tracers must be addressed as part of this context propagation change. From talking with @Oberon00, I think this is a big issue. Active Span / Context SwitchingRight now, context management occurs at the span level, by setting a current span active in a thread. If there is more than one kind of context which must track the execution flow, then context switching must be managed at the context level, not the span level. When a span is moved from one thread to another, or the current thread is swapping the active span because there is an async layer on top of it (python gevent, for example), the whole context must be moved, not just the span. This makes plenty of logical sense, but I think it may be a big change in practice. I don't want to hand wave about it; I think we need to spike this in code in order to get a handle on what this would mean in practice. |
OpenTelemetry is separated into an **application layer** and a **context propagation layer**. In this architecture, multiple distributed applications - including the observability and baggage systems provided by OpenTelemetry - share the same underlying context propagation system. | ||
|
||
|
||
# Application Layer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This RFC doesn't really explain how different applications would work, rather it goes into the implementation details of metrics and tracing "observability systems". It would be nice to have a better definition of what an application is in the application layer. Are all applications required to share a common interface?
|
||
Distributed tracing is an example of a cross-cutting concern, which requires non-local, transaction-level context propagation in order to execute correctly. Transaction-level context propagation can also be useful for other cross-cutting concerns, e.g., for security, versioning, and network switching. We refer to these types of cross-cutting concerns as **distributed applications**. | ||
|
||
OpenTelemetry is separated into an **application layer** and a **context propagation layer**. In this architecture, multiple distributed applications - including the observability and baggage systems provided by OpenTelemetry - share the same underlying context propagation system. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
niiiiice
**GetHTTPExtractor() -> extractor** | ||
To deserialize the state of the system sent from the the prior upstream process, the Baggage API provides a function which returns a HTTPExtract function. | ||
|
||
**GetHTTPInjector() -> injector** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's misleading to combine injectors/extractors with the API for manipulating the baggage. I am not even sure there should be "get" methods for those - who would call that? The inter-process propagation layer (see my diagram above) sits below specific contexts, so they can register with it, but it shouldn't need to "call up"
To receive data injected by prior upstream processes, the Propagation API provides a function which takes a context and an HTTP request, and returns context which represents the state of the upstream system. | ||
|
||
**ChainHTTPInjector(injector, injector) -> injector** | ||
To allow multiple distributed applications to inject their context into the same request, the Propagation API provides a function which takes two injectors, and returns a single injector which calls the two original injectors in order. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is irrelevant, the application does not need to know if context systems are chained or what not, it only needs to say "I want to inject context". This is a low-level implementation detail of the propagation later.
|
||
Baggage values, on the other hand, are explicitly added in order to be accessed by downstream by other application code. Therefore, Baggage Context must be readable, and reliably propagated in-band in order to accomplish this goal. | ||
|
||
There may be cases where a key-value pair is propagated as a Correlation for observability and as a Baggage item for application-specific use. AB testing is one example of such use case. This would result in extra overhead, as the same key-value pair would be present in two separate headers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not clear why we're making this point. Can't we allow telemetry sub-systems to access baggage? The metrics exported can be configured to read AB testing labels from baggage, not just from correlations - seems preferable to transmitting the same data twice.
|
||
There may be cases where a key-value pair is propagated as a Correlation for observability and as a Baggage item for application-specific use. AB testing is one example of such use case. This would result in extra overhead, as the same key-value pair would be present in two separate headers. | ||
|
||
Solving this issue is not worth having semantic confusion with dual purpose. However, because all observability functions take the complete context as input – and baggage is not sampled – it may still be possible to use baggage values as labels for observability. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not the first reference to sampling, and I find it highly confusing. Context is never sampled. Telemetry context may be dropped due to bandwidth limitations (which is not mentioned here), but that's completely different from sampling.
The name Why is it not just a boolean named Edit: Nevermind, the |
@yurishkuro @Oberon00 I've created a second draft of this proposal, based on all of this great feedback. Please find it here: #66 |
This proposal provides a high level description of OpenTelemetry, which shows how context propagation, observability, and baggage could be cleanly decoupled.
A couple notes on terminology:
TagMap
asCorrelations
andlabels
. There can be a separate discussion for what to name this, but since Tag has repeatedly come up as a contentious name for this feature, I tried to use something "similar but different" to avoid this debate. I actually feel that the term Correlation contains a more accurate meaning than Tag, so it may be an improvement.TTL
is referred to ashoplimit
, to reflect the clearer name of this feature in IPv6. It is represented as enum rather than an integer, since only two values are defined.The goal with this spec proposal is to find a way to describe these public APIs as simple as possible, but no simpler. This required touching on most of the public APIs for OpenTelemetry. This description is meant to be high level, but not inaccurate. If a detail appears to be missing, it is intentional - presumably handled at the SDK layer. Possibly, it should be present. Please review this document with this in mind.