Skip to content
This repository has been archived by the owner on Dec 6, 2024. It is now read-only.

Proposal to separate context propagation from observability #42

Closed
wants to merge 36 commits into from

Conversation

tedsuo
Copy link
Contributor

@tedsuo tedsuo commented Sep 10, 2019

This proposal provides a high level description of OpenTelemetry, which shows how context propagation, observability, and baggage could be cleanly decoupled.

A couple notes on terminology:

  • the terms "distributed application" and "application layer" used in this document refer to systems which leverage the context propagation layer, like observability and baggage. They do not refer to the "user applications" these systems are embedded in.
  • In this doc, I refer to TagMap as Correlations and labels. There can be a separate discussion for what to name this, but since Tag has repeatedly come up as a contentious name for this feature, I tried to use something "similar but different" to avoid this debate. I actually feel that the term Correlation contains a more accurate meaning than Tag, so it may be an improvement.
  • TTL is referred to as hoplimit, to reflect the clearer name of this feature in IPv6. It is represented as enum rather than an integer, since only two values are defined.

The goal with this spec proposal is to find a way to describe these public APIs as simple as possible, but no simpler. This required touching on most of the public APIs for OpenTelemetry. This description is meant to be high level, but not inaccurate. If a detail appears to be missing, it is intentional - presumably handled at the SDK layer. Possibly, it should be present. Please review this document with this in mind.

@tedsuo
Copy link
Contributor Author

tedsuo commented Sep 10, 2019

This proposal relates to #37 and #36.

text/0000-separate-context-propagation.md Outdated Show resolved Hide resolved
text/0000-separate-context-propagation.md Outdated Show resolved Hide resolved
text/0000-separate-context-propagation.md Outdated Show resolved Hide resolved
text/0000-separate-context-propagation.md Outdated Show resolved Hide resolved

Because the application and context propagation layers are separated, it is possible to create new distributed applications which do not depend on either the Observability or Baggage APIs.

**GetPropagator(type) inject, extract**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the text that follows, I expected a RegisterPropagatator API here. Somehow this should reference the actual registration function described later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, the propagation layer provides the RegisterPropagator function. Applications provide, and applications provide the Propagators to be registered.

Is there a better way to explain this in the GetPropagator description? Right now it says "To register with the propagation system, the [BLANK] API provides a set of propagation functions for every propagation type."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed the Registry concept, hopefully this is clearer.

text/0000-separate-context-propagation.md Outdated Show resolved Hide resolved
text/0000-separate-context-propagation.md Outdated Show resolved Hide resolved
text/0000-separate-context-propagation.md Outdated Show resolved Hide resolved
text/0000-separate-context-propagation.md Outdated Show resolved Hide resolved

## What about complex propagation behavior?

Some OpenTelemetry proposals have called for more complex propagation behavior. For example, having a fallback to extracting B3 headersif Trace-Context headers are not found. Chained propagators and other complex behavior can be modeled as implementation details behind the Propagator interface. Therefore, the propagation system itself does not need to provide chained propagators or other additional facilities.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the propagation system itself does not need to provide chained propagators

So what happens when multiple propagators are registered for the same type? I think chained propagators may be cumbersome to provide on top of an API that allows only one propagator, since it requires cooperation between everyone who wants to add a propagator to the chain.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have seen code that does this, by using a stack-alike Propagator that tries to inject/extract from a list of propagators (order is important) - for extraction, it returns with the first successful attempt, and for injection it simply tries to inject all formats. And yes, we should have a test case for this scenario, so we know where work fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, here is the nuance: there is basic chaining provided at the API level by RegisterPropagator. All propagators are run for every type, in the order in which they are registered. This is where the cooperation happens between independent applications.

More complex propagation behavior usually ends up being specific to each application. So, the OTel SDK can provide the kind of fallback W3C -> B3 chaining for observability, described here. That sort of behavior is presented as a single, complex propagator, from the point of view of the Propagation API, since the fallback behavior is internal to a single application. Does that make sense?

@carlosalberto
Copy link
Contributor

One thing to keep in mind is that in some languages there's an implicitly current Context object. In Java you can fetch/copy the current Context, whereas in Python (both through contextvars and the current OpenTelemetry Python implementation) you can only either set values on it (thread-local alike) or copy the entire thing.

For this purpose, I'd imagine for some languages make Context and optional parameter, falling back to the current one.

Co-Authored-By: Christian Neumüller <christian+github@neumueller.me>
Some OpenTelemetry proposals have called for more complex propagation behavior. For example, having a fallback to extracting B3 headersif Trace-Context headers are not found. Chained propagators and other complex behavior can be modeled as implementation details behind the Propagator interface. Therefore, the propagation system itself does not need to provide chained propagators or other additional facilities.


## Did you add a context parameter to every API call because Go has infected your brain?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious, is this just for fun?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha, well I wanted to address what I perceive to be a common question, along the lines of "why is this context parameter everywhere? Is it because this is a golang project? Is it required that every language must pass context this way?"

But if the humor gets in the way of learning, I can change it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it. You can also stress that it's not just a Go thing - in the old versions of Node there was no CLS (or it was extremely inefficient) and explicitly passing the context was how the propagation was achieved.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, I will emphasize that this issue exists in multiple languages.

tedsuo and others added 4 commits September 10, 2019 13:39
Co-Authored-By: Christian Neumüller <christian+github@neumueller.me>
Co-Authored-By: Reiley Yang <reyang@microsoft.com>
Co-Authored-By: Reiley Yang <reyang@microsoft.com>
@tedsuo
Copy link
Contributor Author

tedsuo commented Oct 16, 2019

So, I am generally happy with where this is landing. But there are two major issues which must be worked out.

Named Tracers and Metrics

With the advent of named tracers, propagation now depends on having a handle to the "current" tracer. For example, propagation may be disabled, or the headers used may be changed, depending on the named tracer being used. At least, that is my read on how named tracers will be used.

Since propagation behavior now depends on which tracer instance is used, how a single inject call can inject multiple independent propagators needs to be thought about a bit more. Named tracers must be addressed as part of this context propagation change. From talking with @Oberon00, I think this is a big issue.

Active Span / Context Switching

Right now, context management occurs at the span level, by setting a current span active in a thread. If there is more than one kind of context which must track the execution flow, then context switching must be managed at the context level, not the span level. When a span is moved from one thread to another, or the current thread is swapping the active span because there is an async layer on top of it (python gevent, for example), the whole context must be moved, not just the span.

This makes plenty of logical sense, but I think it may be a big change in practice. I don't want to hand wave about it; I think we need to spike this in code in order to get a handle on what this would mean in practice.

OpenTelemetry is separated into an **application layer** and a **context propagation layer**. In this architecture, multiple distributed applications - including the observability and baggage systems provided by OpenTelemetry - share the same underlying context propagation system.


# Application Layer

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This RFC doesn't really explain how different applications would work, rather it goes into the implementation details of metrics and tracing "observability systems". It would be nice to have a better definition of what an application is in the application layer. Are all applications required to share a common interface?


Distributed tracing is an example of a cross-cutting concern, which requires non-local, transaction-level context propagation in order to execute correctly. Transaction-level context propagation can also be useful for other cross-cutting concerns, e.g., for security, versioning, and network switching. We refer to these types of cross-cutting concerns as **distributed applications**.

OpenTelemetry is separated into an **application layer** and a **context propagation layer**. In this architecture, multiple distributed applications - including the observability and baggage systems provided by OpenTelemetry - share the same underlying context propagation system.
Copy link
Member

@yurishkuro yurishkuro Oct 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

niiiiice

**GetHTTPExtractor() -> extractor**
To deserialize the state of the system sent from the the prior upstream process, the Baggage API provides a function which returns a HTTPExtract function.

**GetHTTPInjector() -> injector**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's misleading to combine injectors/extractors with the API for manipulating the baggage. I am not even sure there should be "get" methods for those - who would call that? The inter-process propagation layer (see my diagram above) sits below specific contexts, so they can register with it, but it shouldn't need to "call up"

To receive data injected by prior upstream processes, the Propagation API provides a function which takes a context and an HTTP request, and returns context which represents the state of the upstream system.

**ChainHTTPInjector(injector, injector) -> injector**
To allow multiple distributed applications to inject their context into the same request, the Propagation API provides a function which takes two injectors, and returns a single injector which calls the two original injectors in order.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is irrelevant, the application does not need to know if context systems are chained or what not, it only needs to say "I want to inject context". This is a low-level implementation detail of the propagation later.


Baggage values, on the other hand, are explicitly added in order to be accessed by downstream by other application code. Therefore, Baggage Context must be readable, and reliably propagated in-band in order to accomplish this goal.

There may be cases where a key-value pair is propagated as a Correlation for observability and as a Baggage item for application-specific use. AB testing is one example of such use case. This would result in extra overhead, as the same key-value pair would be present in two separate headers.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not clear why we're making this point. Can't we allow telemetry sub-systems to access baggage? The metrics exported can be configured to read AB testing labels from baggage, not just from correlations - seems preferable to transmitting the same data twice.


There may be cases where a key-value pair is propagated as a Correlation for observability and as a Baggage item for application-specific use. AB testing is one example of such use case. This would result in extra overhead, as the same key-value pair would be present in two separate headers.

Solving this issue is not worth having semantic confusion with dual purpose. However, because all observability functions take the complete context as input – and baggage is not sampled – it may still be possible to use baggage values as labels for observability.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not the first reference to sampling, and I find it highly confusing. Context is never sampled. Telemetry context may be dropped due to bandwidth limitations (which is not mentioned here), but that's completely different from sampling.

@tsloughter
Copy link
Member

tsloughter commented Nov 1, 2019

The namehopLimit makes it sound like an integer.

Why is it not just a boolean named propagate?

Edit: Nevermind, the hopLimit will be removed completely since only baggage propagates.

@tedsuo
Copy link
Contributor Author

tedsuo commented Nov 16, 2019

@yurishkuro @Oberon00 I've created a second draft of this proposal, based on all of this great feedback. Please find it here: #66

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.