Skip to content
This repository has been archived by the owner on Dec 6, 2024. It is now read-only.

Proposal to separate context propagation from observability #42

Closed
wants to merge 36 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
dff8df9
Proposal to separate context propagation from observability
tedsuo Sep 8, 2019
5ad7d1c
cleanup description for Extract
tedsuo Sep 10, 2019
1dc3c7b
commas
tedsuo Sep 10, 2019
58248e6
Update text/0000-separate-context-propagation.md
tedsuo Sep 10, 2019
68cb0ba
RFC proposal: A layered approach to data formats
tedsuo Aug 13, 2019
3dc6a76
whitespace
tedsuo Aug 22, 2019
459435e
Capitalization
tedsuo Aug 22, 2019
c9c64f4
whitespace
tedsuo Aug 22, 2019
c3c7c24
CleanBaggage -> ClearBaggage
tedsuo Sep 10, 2019
4588096
move function descriptions to new line
tedsuo Sep 10, 2019
2d80dae
Add Optional subheader
tedsuo Sep 10, 2019
7a73210
cleanup rough edits
tedsuo Sep 10, 2019
0d8e41b
clean up advice on pre-existing context implementations
tedsuo Sep 10, 2019
aad5605
Better context descriptions
tedsuo Sep 10, 2019
4a930eb
remove data format file
tedsuo Sep 11, 2019
e1ef61f
remove git diff message
tedsuo Sep 11, 2019
f949435
improved code sytnax
tedsuo Sep 11, 2019
1cb155e
stop stuttering
tedsuo Sep 11, 2019
7b9e861
Update text/0000-separate-context-propagation.md
tedsuo Sep 11, 2019
07eb397
spacing
tedsuo Sep 11, 2019
0ebeb6c
Refine propagation
tedsuo Sep 25, 2019
147d6b0
Add RFC ID number from PR
tedsuo Oct 1, 2019
72d4651
remove RFC status line
tedsuo Oct 1, 2019
1472197
slight calrification for GetHTTPExtractor
tedsuo Oct 1, 2019
18a37d4
add global propagators
tedsuo Oct 1, 2019
7ea1834
Clean up motivation
tedsuo Oct 15, 2019
7317747
Clean up explanbation intro
tedsuo Oct 15, 2019
43ba8fd
Clarify context types
tedsuo Oct 15, 2019
d7d6f1c
Fix ChainHTTPInjector and ChainHTTPExtractor
tedsuo Oct 15, 2019
3a817a2
typo
tedsuo Oct 15, 2019
3381e0f
Reference Trace-Context, not just traceparent
tedsuo Oct 15, 2019
c15a107
Bagge context cleanup
tedsuo Oct 15, 2019
310e8d5
stronger language around context access
tedsuo Oct 15, 2019
f59fc27
Update text/0042-separate-context-propagation.md
tedsuo Oct 15, 2019
153b9aa
clean up tradeoffs
tedsuo Oct 15, 2019
f70855a
Update text/0042-separate-context-propagation.md
tedsuo Oct 15, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
217 changes: 217 additions & 0 deletions text/0042-separate-context-propagation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,217 @@
# Proposal: Separate Layer for Context Propagation

Design OpenTelemetry as a set of separate applications which operate on a shared context propagation mechanism.


## Motivation

Based on prior art, we know that fusing the observability system and the context propagation system together creates issues. Observability systems have special rules for propagating information, such as sampling. This can create difficulty for other systems, which may want to leverage the same context propagation mechanism, but have different rules and requirements regarding the data they are sending. The Baggage system within OpenTelemetry is one such example.

This RFC addresses the following topics:

**Separatation of concerns**
* Remove the Tracer dependency from context propagation mechanisms.
* Handle user data (Baggage) and observability data (Correlations) seprately.

**Extensibility**
* Allow users to create new applications for context propagation. For example: A/B testing, encrypted or authenticated data, and new, experimental forms of observability.

## Explanation

# OpenTelemetry Layered Architecture

![drawing](img/context_propagation_explanation.png)

Distributed tracing is an example of a cross-cutting concern, which requires non-local, transaction-level context propagation in order to execute correctly. Transaction-level context propagation can also be useful for other cross-cutting concerns, e.g., for security, versioning, and network switching. We refer to these types of cross-cutting concerns as **distributed applications**.

OpenTelemetry is separated into an **application layer** and a **context propagation layer**. In this architecture, multiple distributed applications - including the observability and baggage systems provided by OpenTelemetry - share the same underlying context propagation system.
Copy link
Member

@yurishkuro yurishkuro Oct 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

niiiiice



# Application Layer

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This RFC doesn't really explain how different applications would work, rather it goes into the implementation details of metrics and tracing "observability systems". It would be nice to have a better definition of what an application is in the application layer. Are all applications required to share a common interface?


## Observability API

OpenTelemetry currently contains two observability systems - Tracing and Metrics – and may be extended over time. These separate systems are bound into a unified Observability API through sharing labels – a mechanism for correlating independent observations – and through sharing propagators.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the propagators themselves really shared though? I think it is merely the API that signals propagation points that is shared. That might be more hair-splitting than is good for easy understandability though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am thinking that Observability system does not automatically have a separate propagators for tracing and for metrics, but I see your point that Trace-Context and Correlation-Context are separate, and right now it's not clear if Tracing would use Correlation-Context.


**Observe(context, labels…, observations...) -> context**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some introducing sentence like "The following general forms of APIs exist:" would be good here. Although then the HTTP API's don't seem to fit, which really makes me wonder: What does this list of APIs actually enumerate? Or would it be better to reformat this like "The general form for all observability APIs is a function with the following signature: Observe(...). That is, it takes a Context, label keys, ...".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I should definitely remove the "all" here.

In this RFC, I'm trying to define the Context and Propagation APIs. I want it to be clear enough about how we intend to leverage the context and propagation APIs that we can catch any design errors in those APIs. If I enumerate all of the details of Tracing and Metrics, I've hauled in too much. If I define nothing but GetHTTPExtractor and GetHTTPInjector in the Observability API, that will be insufficient for review.

I am trying to show how DistribtuedContext can be cleanly split into Baggage and Correlations. So I feel like Correlate should be present, along with Observe. But, again, this is a general form. I will try to be clearer that these two definitions - Observe and Correlate - are more abstract than the rest. I would still like feedback to the tune of "even as a general form, these functions are not correct."

I suggest that the details about how exactly these changes will affect our Tracing and Metrics APIs should be worked out in specification PRs, not in an RFC.

The general form for all observability APIs is a function which takes a Context, label keys, and observations as input, and returns an updated Context.

**Correlate(context, label, value, hoplimit) -> context**
To set the label values used by all observations in the current transaction, the Observability API provides a function which takes a context, a label key, a value, and a hoplimit, and returns an updated context. If the hoplimit is set to NO_PROPAGATION, the label will only be available to observability functions in the same process. If the hoplimit is set to UNLIMITED_PROPAGATION, it will be available to all downstream services.

**GetHTTPExtractor() -> extractor**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure of the value of calling out HTTP explicitly. There are plenty of practical examples where people use non-HTTP transports

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although the "HTTP"-propagators can really be used to propagate over any protocol that supports ASCII key-value pairs as metadata/headers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My intent is just to show that I am not proposing a "generic" propagator: HTTP, BINARY, etc, are each handled by a separate function. I didn't call this out before and it confused some people into thinking we were going back to the OpenTracing way of doing things.

We could call it TEXTMAP instead of HTTP, or HTTPText. Don't really have an opinion on that front. HTTP is the shortest though. :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The notion of extractor/injector is irrelevant to the API for the application (we didn't have them in OpenTracing). Injector/extractor are only needed when registering them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since these have the same name as their "Baggage" counterparts, it seems they'll have to be namespaced in some way. here is no explicit observability concept today tough, only meters and tracers. Should there be a separate propagator API for each of these? If not, why does Baggage have a separate one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Oberon00 I actually assumed that we would not have separate propagators for Metrics and Tracing as something in the API layer. So I was namespacing by splitting things into an Observability API and a Baggage API.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there can be no separation in the propagation API for any propagators, not just metrics/tracing, but any custom/baggage propagators. The only thing the application needs to know is "I want to inject/extract context in this format".

To deserialize the state of the system sent from the prior upstream process, the Observability API provides a function which returns a HTTPExtract function.

**GetHTTPInjector() -> injector**
To serialize the the current state of the observability system and send it to the next downstream process, the Observability API provides a function which returns a HTTPInject function.


## Baggage API

In addition to observability, OpenTelemetry provides a simple mechanism for propagating arbitrary data, called Baggage. This allows new distributed applications to be implemented without having to create new propagators.

To manage the state of a distributed application, the Baggage API provides a set of functions which read, write, and remove data.

**SetBaggage(context, key, value) -> context**
To record the distributed state of an application, the Baggage API provides a function which takes a context, a key, and a value as input, and returns an updated context which contains the new value.

**GetBaggage(context, key) -> value**
To access the distributed state of an application, the Baggage API provides a function which takes a context and a key as input, and returns a value.

**RemoveBaggage(context, key) -> context**
To delete distributed state from an application, the Baggage API provides a function which takes a context, a key, and a value as input, and returns an updated context which contains the new value.

**ClearBaggage(context) -> context**
To avoid sending baggage to an untrusted downstream process, the Baggage API provides a function remove all baggage from a context,

**GetHTTPExtractor() -> extractor**
To deserialize the state of the system sent from the the prior upstream process, the Baggage API provides a function which returns a HTTPExtract function.

**GetHTTPInjector() -> injector**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's misleading to combine injectors/extractors with the API for manipulating the baggage. I am not even sure there should be "get" methods for those - who would call that? The inter-process propagation layer (see my diagram above) sits below specific contexts, so they can register with it, but it shouldn't need to "call up"

To serialize the the current state of the system and send it to the next downstream process, the Baggage API provides a function which returns a HTTPInject function.


## Additional APIs

Because the application and context propagation layers are separated, it is possible to create new distributed applications which do not depend on either the Observability or Baggage APIs.

**GetHTTPExtractor() -> extractor**
To deserialize the state of the system in the prior upstream process, all additional APIs provide a function which returns a HTTPExtract function.

**GetHTTPInjector() -> injector**
To serialize the the current state of the system and send it to the next downstream process, all additional APIs provide a function which returns a HTTPInject function.


# Context Propagation Layer

## Context API

Distributed applications access data in-process using a shared context object. Each distributed application sets a single key in the context, containing all of the data for that system.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I.e., value can be an arbitrary object, instead of just a scalar value.


**SetValue(context, key, value) -> context**
To record the local state of an application, the Context API provides a function which takes a context, a key, and a value as input, and returns an updated context which contains the new value.

**GetValue(context, key) -> value**
To access the local state of an application, the Context API provides a function which takes a context and a key as input, and returns a value.

### Optional: Automated Context Management
When possible, the OpenTelemetry context should automatically be associated with the program execution context. Note that some languages do not provide any facility for setting and getting a current context. In these cases, the user is responsible for managing the current context.

**SetCurrent(context)**
To associate a context with program execution, the Context API provides a function which takes a Context.

**GetCurrent() -> context**
To access the context associated with program execution, the Context API provides a function which takes no arguments and returns a Context.


## Propagation API

Distributed applications propagate their state by data to downstream processes via injectors, functions which read and write application context into RPC requests. Each distributed application creates a set of propagators for every type of supported medium - currently only HTTP requests.

**HTTPInject(context, request)**
To send the data for all distributed applications downstream to the next process, the Propagation API provides a function which takes a context and an HTTP request, and mutates the HTTP request to include an HTTP Header representation of the context.

**HTTPExtract(context, request) -> context**
To receive data injected by prior upstream processes, the Propagation API provides a function which takes a context and an HTTP request, and returns context which represents the state of the upstream system.

**ChainHTTPInjector(injector, injector) -> injector**
To allow multiple distributed applications to inject their context into the same request, the Propagation API provides a function which takes two injectors, and returns a single injector which calls the two original injectors in order.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is irrelevant, the application does not need to know if context systems are chained or what not, it only needs to say "I want to inject context". This is a low-level implementation detail of the propagation later.


**ChainHTTPExtractor(extractor, extractor) -> extractor**
To allow multiple distributed applications to extract their context from the same request, the Propagation API provides a function which takes two extractors, and returns a single extractor which calls the two original extractors in order.


### Optional: Global Propagators
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please avoid global state. It can always be added later, but cannot be removed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess most languages already crossed the Rubicon here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to make it optional, but I'm now realizing this has to rethought, given named tracers.

It is often convenient to create a chain of propagators during program initialization, and then access these combined propagators later in the program. To facilitate this, global injectors and extractors are optionally available. However, there is no requirement to use this feature.

**SetHTTPInjector(injector)**
To update the global injector, the Propagation API provides a function which takes an injector.

**GetHTTPInjector() -> injector**
To access the global injector, the Propagation API provides a function which returns an injector.

**SetHTTPExtractor(extractor)**
To update the global extractor, the Propagation API provides a function which takes an extractor.

**GetHTTPExtractor() -> extractor**
To access the global extractor, the Propagation API provides a function which returns an extractor.

# Internal details

![drawing](img/context_propagation_details.png)

## Context details
OpenTelemetry currently implements three context types of context propagation.

**Span Context -** The serializable portion of a span, which is injected and extracted. The readable attributes are defined to match those found in the [W3C Trace Context specification](https://www.w3.org/TR/trace-context/).

**Correlation Context -** Correlation Context contains a map of labels and values, to be shared between metrics and traces. This allows observability data to be indexed and dimensionalized in a variety of ways. Note that correlations can quickly add overhead when propagated in-band. But because this data is write-only, it may be possible to optimize how it is transmitted.

**Baggage -** Transaction-level application data, meant to be shared with downstream components. This data is readable, and must be propagated in-band. Because of this, Baggage should be used sparingly, to avoid ballooning the size of all downstream requests.

Note that OpenTelemetry APIs calls should *always* be given access to the entire context object, and never just a subset of the context, such as the value in a single key. This allows the SDK to make improvements and leverage additional data that may be available, without changes to all of the call sites.


## Context Management and in-process propagation

In order for Context to function, it must always remain bound to the execution of code it represents. By default, this means that the programmer must pass a Context down the call stack as a function parameter. However, many languages provide automated context management facilities, such as thread locals. OpenTelemetry should leverage these facilities when available, in order to provide automatic context management.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
In order for Context to function, it must always remain bound to the execution of code it represents. By default, this means that the programmer must pass a Context down the call stack as a function parameter. However, many languages provide automated context management facilities, such as thread locals. OpenTelemetry should leverage these facilities when available, in order to provide automatic context management.
For Context to function, it must always remain bound to the execution of code it represents. By default, this means that the programmer must pass a Context down the call stack as a function parameter. However, many languages provide automated context management facilities, such as thread locals. OpenTelemetry should leverage these facilities when available, in order to provide automatic context management.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ This seems to indicate that we must change the way we think about "Span" activation: We should really activate a whole context not a Span.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not for setting the active span in the current context, but when moving work from one thread to another, or context switching in async/nonblocking runtimes, this needs to be addressed.

In general, I'm concerned that there will be important details which present themselves better when we attempt to implement this. We've gotten a bit of a pass so far because we did so much work in java at the very beginning of this process, before RFCs, etc. FWIW, in OpenTracing, we required a working implementation before adding an committing a change to the spec.


## Pre-existing Context implementations

In some languages, a single, widely used Context implementation exists. In other languages, there many be too many implementations, or none at all. For example, Go has a the `context.Context` object, and widespread conventions for how to pass it down the call stack.

In the cases where an extremely clear, pre-existing option is not available, OpenTelemetry should provide its own Context implementation.

## Default Propagators

When available, OpenTelemetry defaults to propagating via HTTP header definitions which have been standardized by the W3C.


# Trade-offs and mitigations

## Why separate Baggage from Correlations?

Since Baggage Context and Correlation Context appear very similar, why have two?

First and foremost, the intended uses for Baggage and Correlations are completely different. Secondly, the propagation requirements diverge significantly.

Correlation values are solely to be used as labels for metrics and traces. By making Correlation data write-only, how and when it is transmitted remains undefined. This leaves the door open to optimizations, such as propagating some data out-of-band, and situations where sampling decisions may cease the need to propagate correlation context any further.

Baggage values, on the other hand, are explicitly added in order to be accessed by downstream by other application code. Therefore, Baggage Context must be readable, and reliably propagated in-band in order to accomplish this goal.

There may be cases where a key-value pair is propagated as a Correlation for observability and as a Baggage item for application-specific use. AB testing is one example of such use case. This would result in extra overhead, as the same key-value pair would be present in two separate headers.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not clear why we're making this point. Can't we allow telemetry sub-systems to access baggage? The metrics exported can be configured to read AB testing labels from baggage, not just from correlations - seems preferable to transmitting the same data twice.


Solving this issue is not worth having semantic confusion with dual purpose. However, because all observability functions take the complete context as input – and baggage is not sampled – it may still be possible to use baggage values as labels for observability.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not the first reference to sampling, and I find it highly confusing. Context is never sampled. Telemetry context may be dropped due to bandwidth limitations (which is not mentioned here), but that's completely different from sampling.



## What about complex propagation behavior?

Some OpenTelemetry proposals have called for more complex propagation behavior. For example, falling back to extracting B3 headers if W3C Trace-Context headers are not found. Chained propagators and other complex behavior can be modeled as implementation details behind the Propagator interface. Therefore, the propagation system itself does not need to provide chained propagators or other additional facilities.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with this, and therefore do not understand why earlier section explicitly mentions Chained API functions. Chaining can be achieved through composition of SDK objects without cluttering the API that the end-user is exposed to.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is no solution for this at the API-level (or at least SDK-level), then there must be only one dedicated component (e.g. "the application") that sets all propagators as there is no shared protocol for how to coordinate on setting propagators.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, this language should be updated, now that the API has simple chaining. But, at the API level, we are only chaining together propagators for different applications. Complex details within one application – such as checking for Trace-Context and falling back to B3 if it is not present – does not need to be something handled at the API level. I was getting comments and questions about that kind of behavior, so I felt compelled to add this...

But, with the addition of named tracers, how a single inject/extract call can leverage multiple independent propagators needs to be thought about a bit more, since propagation behavior now depends on which tracer instance is used - or at least, I think that is what named tracers implies.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on chaining, I think it doesn't belong here (see earlier comment in the API section)

on named tracers - I would propose to ignore those. Propagators, like exporters, are 1-1 with TracerFactory, not with Tracer, i.e. all differently named tracers share the propagator. And the application that ultimately invokes the propagation API does it at the generic propagation layer, not via tracers.



## Did you add a context parameter to every API call because Go has infected your brain?

No. The concept of an explicit context is fundamental to a model where independent distributed applications share the same context propagation layer. How this context appears or is expressed is language specific, but it must be present in some form.


# Prior art and alternatives

Prior art:
* OpenTelemetry distributed context
* OpenCensus propagators
* OpenTracing spans
* gRPC context
Comment on lines +200 to +203
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make these links.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do!


# Open questions

Related work on HTTP propagatation has not been completed yet:

* [W3C Trace-Context](https://www.w3.org/TR/trace-context/) candidate is not yet accepted
* Work on [W3C Correlation-Context](https://w3c.github.io/correlation-context/) has begun, but was halted to focus on Trace-Context.
* No work has begun on a theoretical W3C Baggage-Context.

Given that we must ship with working propagators, and the W3C specifications are not yet complete, how should we move forwards with implementing context propagation?

# Future possibilities

Cleanly splitting OpenTelemetry into an Application and Context Propagation layer may allow us to move the Context Propagation layer into its own, stand-alone project. This may facilitate adoption, by allowing us to share Context Propagation with gRPC and other projects.
Binary file added text/img/context_propagation_details.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added text/img/context_propagation_explanation.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.