Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Suppress Tracing context key #1653

Conversation

dyladan
Copy link
Member

@dyladan dyladan commented Apr 27, 2021

Fixes #530

As discussed in the specification SIG today, this adds a predefined context key for suppressing instrumentation.

@dyladan dyladan requested review from a team April 27, 2021 16:39
@dyladan dyladan force-pushed the suppress-instrumentation branch 2 times, most recently from 4a162f3 to 4f3c671 Compare April 27, 2021 16:44
@carlosalberto
Copy link
Contributor

@open-telemetry/specs-approvers Take a look at this please.

Oberon00
Oberon00 previously approved these changes Apr 28, 2021
specification/context/context.md Outdated Show resolved Hide resolved
specification/context/context.md Outdated Show resolved Hide resolved
Copy link
Member

@yurishkuro yurishkuro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am -1 on this change in the current form, it feels rushed to me (lacks extensibility), without a use case clearly documented

@Oberon00
Copy link
Member

@yurishkuro

it feels rushed to me

The issue and general solution design here are quite old, see #530 (@dyladan, probably you should add "Fixes #530" to the PR description)

@yurishkuro
Copy link
Member

The issue and general solution design here are quite old, see #530

#530 is mostly about stopping tracing, not stopping all instrumentation. This change would make a lot more sense to me if it was about stopping tracing only, but even then a single boolean is a crude design that cannot be extended further.

@dyladan
Copy link
Member Author

dyladan commented Apr 28, 2021

The issue and general solution design here are quite old, see #530

#530 is mostly about stopping tracing, not stopping all instrumentation. This change would make a lot more sense to me if it was about stopping tracing only, but even then a single boolean is a crude design that cannot be extended further.

I don't understand the need to be able to extend this key in the future. The whole point of having context with opaque keys was to be able to create keys that mean different things. We can make one for tracing and in the future if we want another signal disabled (or all signals) we can just create a new key.

@dyladan dyladan changed the title Add Suppress Instrumentation context key Add Suppress Tracing context key Apr 28, 2021
@dyladan
Copy link
Member Author

dyladan commented Apr 28, 2021

@yurishkuro modified to only suppress tracing by creating non-recording spans and preventing injection. Extract and metrics removed from the wording.

@blumamir
Copy link
Member

I want to add that this mechanism is being used by me in few instrumentation libraries and proven to be very handy.

It is useful to reduce the visual and performance overhead of collecting non-interesting underlying implementation details of high-level libraries such as:

  • the HTTP rpc part of aws-sdk requests.
  • the HTTP part of elastic-search requests.
  • the database driver spans where an ORM is being used (suppress mongodb driver spans for operations generated by mongoose, suppress mysql driver spans for operations generated from typeorm, etc).

You can read more about it here

@iNikem
Copy link
Contributor

iNikem commented Apr 28, 2021

I probably miss something fundamental here, but why #530 cannot be solved by using Samplers? Especially point 1, of health checks and such.

@yurishkuro
Copy link
Member

I probably miss something fundamental here, but why #530 cannot be solved by using Samplers? Especially point 1, of health checks and such.

+1

I don't understand the need to be able to extend this key in the future. The whole point of having context with opaque keys was to be able to create keys that mean different things. We can make one for tracing and in the future if we want another signal disabled (or all signals) we can just create a new key.

My problem is not with the opaque key, but with growing the surface of the API. For comparison, the OpenTracing API had 6 methods in total: inject, extract, start, finish, setTag, addLog. Here we're proposing two new methods to the public API surface to account for a relatively obscure use case. And we're not even making it extensible so that the cognitive overhead of wider API surface could be amortized for more use cases in the future (back to my original #1653 (comment)).

@dyladan
Copy link
Member Author

dyladan commented Apr 28, 2021

I probably miss something fundamental here, but why #530 cannot be solved by using Samplers? Especially point 1, of health checks and such.

haha i think i might also be missing something because I can't see how samplers could fix this issue in an automatic way without using context. how does the exporter/spanprocessor signal to the sampler that this is an export span that shouldn't be sampled?

@dyladan
Copy link
Member Author

dyladan commented Apr 28, 2021

sorry for the accidental close and reopen while trying to comment

My problem is not with the opaque key, but with growing the surface of the API. For comparison, the OpenTracing API had 6 methods in total: inject, extract, start, finish, setTag, addLog.

This is not really a new API, simply leveraging an existing mechanism. Even if it is a new API, I don't think we can use 'growing API' as a counterargument because then no new APIs will ever be added.

Here we're proposing two new methods to the public API surface to account for a relatively obscure use case.

i'm not sure I agree that the use case is obscure. suppressing tracing is used by at least every exporter that uses an instrumented transmission channel like http. additionally, @blumamir already cited several examples where he found this useful.

And we're not even making it extensible so that the cognitive overhead of wider API surface could be amortized for more use cases in the future (back to my original #1653 (comment)).

I must not be understanding what you mean by extensibility. To me it seems like it would be easy to add another key to the list of predefined keys in the future. If you're asking for this one key to be able to disable multiple signals in a configurable way, I would argue that is more confusing than just having multiple keys.

@dyladan dyladan closed this Apr 28, 2021
@dyladan dyladan reopened this Apr 28, 2021
@blumamir
Copy link
Member

I am not sure that suppressing underlying spans is a right thing to do. E.g. take a look at #1360. I am pretty sure that in most cases we still want information from the underlying transport libraries

In my opinion, the transport spans can be interesting to see in some cases (debugging low-level issues, performance issues, etc), but most of the time they just add noise to the trace and distract the attention from the logical operation. Allowing the user to turn this instrumentation feature on and off via config option can be very handy to support a wide range of users with different needs.

Before this mechanism was introduced, we had to "clean" the trace in the backend which is possible but more tedious.

@@ -183,6 +183,8 @@ When asked to create a Span, the SDK MUST act as if doing the following in order
A non-recording span MAY be implemented using the same mechanism as when a
`Span` is created without an SDK installed or as described in
[wrapping a SpanContext in a Span](api.md#wrapping-a-spancontext-in-a-span).
If the [`SuppressTracing`](./api.md#suppress-tracing) `Context` flag is set,
the newly created `Span` should be a non-recording `Span`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can there be a situation where it might be desired to suppress tracing but then re-enable it later in the same trace? For example a chain of 5 middlewares (all instrumented by a common instrumentation) where a user suppresses tracing in middleware 2 and then enable it again in middleware 4. If we generate non-recording spans, the spans from middleware 4 and 5 will not will not be children of the span generated by middleware 2. Not recording any spans when tracing is suppressed instead would allow this case to generate correct traces.

This could also cause issues with context propagation. If tracing is suppressed on a code path that end up making a network call and there is user/instrumentation code that inject trace context into the outgoing call, it'll end up either not injecting context or injecting non-recording span's span ID depending on the implementation of non-recording span.

May be there are other such cases.

If we were to actually suppress tracing, i.e, not record any spans at all instead of non-recording spans, both of these cases would work. Neither solution is perfect but not recording anything would probably result in traces that are "less broken".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can there be a situation where it might be desired to suppress tracing but then re-enable it later in the same trace?

If we generate non-recording spans, the spans from middleware 4 and 5 will not will not be children of the span generated by middleware 2. Not recording any spans when tracing is suppressed instead would allow this case to generate correct traces

Sure I think that's a valid usecase. It's definitely not solved by this though for the reasons you stated. It's not easy to solve either because this puts the burden on the caller to not create spans. If an instrumentation doesn't check the context key before calling startSpan, then it can't be suppressed.

This could also cause issues with context propagation. If tracing is suppressed on a code path that end up making a network call and there is user/instrumentation code that inject trace context into the outgoing call, it'll end up either not injecting context or injecting non-recording span's span ID depending on the implementation of non-recording span.

Yes, it was my intention to suppress injection. For the usecases described, I think this is the desired behavior.

If we were to actually suppress tracing, i.e, not record any spans at all instead of non-recording spans, both of these cases would work. Neither solution is perfect but not recording anything would probably result in traces that are "less broken".

The spec states startSpan MUST return something implementing the Span interface, so I don't really see any option other than returning a non-recording span. Open to ideas if you have them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if context API could act like a stack in this case. Suppressing instrumentation could store a reference to the active span and undoing it later would re-activate that span again. That'd solve both the cases but it probably bring up more issues and complicates implementation too much.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

context API already works like a stack since context is immutable, you need a new copy for nested scope

Copy link
Contributor

@owais owais Apr 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure I think that's a valid usecase. It's definitely not solved by this though for the reasons you stated. It's not easy to solve either because this puts the burden on the caller to not create spans. If an instrumentation doesn't check the context key before calling startSpan, then it can't be suppressed.

True but if we make it part of the spec or semantic conventions then it's very likely they that they will.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True but if we make it part of the spec or semantic conventions then it's very likely they that they will.

In my experience this is not true. Instrumentations created by SIGs will, but the community is very unlikely to read the spec that closely.

@weyert
Copy link

weyert commented Apr 30, 2021

Interesting issue, this might help solving my long standing issue at work were a lot of spans are created for RPC and HTTP by instrumentation libraries spans while I don't really care about those in some situations, when calling a specific function/action but do care about in all other cases.

For example, I am using Datastore which uses RPC for it calls means I am having spans for dns (for the host), RPC connect, my database client span, my database sql builder span, etc. In this case I want to suppress the RPC connect, NET connect, DNS connect spans but only in one case were I doing many calls to Datastore or when I am calling many underlaying REST services in one class/function but still have the behaviour everywhere else.

E.g.

  RequestController()
       handleGetRequest()
            - fetchData1()
            - triggerFetch()
                   - getDataFromOtherService()
                             - getDataFromOtherService2()
                                 // SUPPRESS SPAN ALL RPC/NET INSTRUMENTATION
                                  - processDataFromService2()
                                               - moreProcessDataFromService2()
                                // UNSUPPRESS SPAN
                   - getDataFromOtherService3()

@Oberon00 Oberon00 dismissed their stale review April 30, 2021 11:16

This seems to be less trivial than I initally thought.

@Oberon00
Copy link
Member

Oberon00 commented Apr 30, 2021

After reading all the feedback on samplers, I think this could indeed also work: We provide, in the SDK, a Sampler SuppressingSampler(innerSampler) that checks the context key and a corresponding SDK method to set the flag on the context. Then we increase only the SDK's API surface but not the API's API surface. The major disadvantage though is that performance will be worse because initial span attributes will still be collected but that is #620.

EDIT: The other difference with the sampler solution would be that the current PR suggests not to inject at all if sampling is suppressed while with the current sampling decisions we would inject a span context, just with sampled flag set to zero. I created #1663 because I think this is an idea worth thinking about.

@Oberon00
Copy link
Member

Oberon00 commented Apr 30, 2021

@weyert

Interesting issue, this might help solving my long standing issue at work were a lot of spans are created for RPC and HTTP by instrumentation libraries spans

That use case might be better solved by providing samplers access to a readable parent span (I think this already works in some languages with some casting, since the sampler already gets the full parent context)

@cijothomas
Copy link
Member

We provide, in the SDK, a Sampler SuppressingSampler(innerSampler) that checks the context key and a corresponding SDK method to set the flag on the context. Then we increase only the SDK's API surface but not the API's API surface.

If the method to set the flag on the context is in the SDK, then this cannot be leveraged by Instrumentations to suppress underlying, right? (unless instrumentation takes SDK dependency.)

@dyladan
Copy link
Member Author

dyladan commented Apr 30, 2021

I would like to avoid instrumentations taking on the SDK as a dependency since this is essentially the primary reason for the strict API/SDK split. If a library author builds opentelemetry into their library they should only have to depend on the API

@pauldraper
Copy link

pauldraper commented May 2, 2021

Here we're proposing two new methods to the public API surface to account for a relatively obscure use case

To quote #530

There are circumstances for instrumented actions where:

  1. The action is frequent and of low interest: a healthcheck, polling a message queue, etc.
  2. An OpenTracing exporter uses libraries that themselves may be instrumented (risking infinite tracing).
  3. If the current layer (e.g. RPC) happens to offer sufficiently detailed tracing, lower HTTP/DNS/TCP/UDP layers do not need to be traced when invoked with this library. This need is heighten by the use of server/client spans, which AFAIK needs to remove spans in between them. (SpanKind with layers #526).

(1) has come up in #173

(2) has been a problem in opentelemetry-js (open-telemetry/opentelemetry-js#332) HTTP-based exporters, since the Node.js stdlib is instrumented globally. The solution in that has was adding a special HTTP header x-opentelemetry-outgoing-request headers, that the http instrumentation ignores. In essence it uses HTTP headers as a poor-man's context API. (This may not scale to other APIs.)

Please address alternative solutions to these issues.

The problem is so non-obscure and non-rushed that Python and Ruby implementations independently created this solution over a year ago.

@dyladan
Copy link
Member Author

dyladan commented May 3, 2021

@pauldraper Are you asking me to address alternatives?

@pauldraper
Copy link

Sorry, @yurishkuro

@dyladan
Copy link
Member Author

dyladan commented May 4, 2021

Discussed this in the spec SIG today. This PR boils down to 3 use-cases

  1. Prevent infinite loop with exporters
  2. Allow instrumentations to suppress lower level instrumentations
  3. Suppress frequent noise like polling and health checks

Prevent Exporter Loops

This can be solved in the SDK only with no API change and would not require this API to exist.

Instrumentations Suppress Lower Levels

This is a special case of a larger question: how should we handle cases where an instrumentation wants to be the "final" span? This PR suggests that they be suppressed and not traced, but there may be other solutions to this problem.

Suppress Noisy Spans

In some ways this is an extension of (2), but was not my original target in this PR and is possibly better solved with other mechanisms like smarter sampling.

Takeaway

For now, the specification SIG suggested that JS SIG should move this functionality out of the API into some sort of API extensions package. Instrumentations can rely on this package which our SDK will respect, but third party SDKs will be under no official obligation to respect.

I am going to leave this open because I still believe the mechanism has value and I have not yet seen a proposal for solving (2) that is obviously better than this. The linked issue is currently one of our oldest open issues and having multiple languages already having implemented this on the SDK level seems to me to show the value in this mechanism.

@pauldraper
Copy link

pauldraper commented May 5, 2021

This can be solved in the SDK only with no API change and would not require this API to exist.

How?

Example situation:

gRPC module uses the http module.

The gRPC module is instrumented. The http module is instrumented.

The gRPC instrumentation creates a client span. The http module creates a client span.

Client spans are now doubled up which AFAIK is a problem for systems with two-sided spans like Zipkin.

To prevent doubled-up client spans, the gRPC instrumentation (which uses only the API) needs a way to tell the http instrumentation (which uses only the API) to not trace.


This is a very common situation.

Am I missing something?

@jkwatson
Copy link
Contributor

jkwatson commented May 5, 2021

This can be solved in the SDK only with no API change and would not require this API to exist.

How?

Example situation:

gRPC module uses the http module.

The gRPC module is instrumented. The http module is instrumented.

The gRPC instrumentation creates a client span. The http module creates a client span.

Client spans are now doubled up which AFAIK is a problem for systems with two-sided spans like Zipkin.

To prevent doubled-up client spans, the gRPC instrumentation (which uses only the API) needs a way to tell the http instrumentation (which uses only the API) to not trace.

This is a very common situation.

Am I missing something?

Your example isn't an exporter loop, which is where you snipped your comment from.

@Oberon00
Copy link
Member

Oberon00 commented May 5, 2021

@pauldraper

To prevent doubled-up client spans, the gRPC instrumentation (which uses only the API) needs a way to tell the http instrumentation (which uses only the API) to not trace.

This is only one possible solution. Another theoretically (not currently) possible solution is that the SDK user / application owner configures the SDK to suppress any local child spans of spans with type=CLIENT and having an rpc.system=grpc attribute. Or any CLIENT spans that are children of another CLIENT should be suppressed.

IMHO, having the grpc instrumentation decide whether or not the child spans should be suppressed is wrong. The HTTP child spans are potentially useful, e.g., if you have a specific HTTP-related issue you want to debug.

@pauldraper
Copy link

pauldraper commented May 5, 2021

Your example isn't an exporter loop, which is where you snipped your comment from.

Ah, yes. My bad. Regardless, I would like to understand the group's answer to that.

IMHO, having the grpc instrumentation decide whether or not the child spans should be suppressed is wrong. The HTTP child spans are potentially useful, e.g., if you have a specific HTTP-related issue you want to debug.

FWIW I agree, I think client/server spans are terrible. Zipkin two-sided spans are terrible. Protocols are inherently nested and Otel seems to ignore that (#526).

But, under the current Opentelemetry approach, it seems that double-client spans are Bad, yet there is no good solution to prevent them.

@github-actions
Copy link

This PR was marked stale due to lack of activity. It will be closed in 7 days.

@github-actions github-actions bot added Stale and removed Stale labels May 13, 2021
@carlosalberto
Copy link
Contributor

@dyladan As the related issue will still exists, are you ok with closing this PR?

@dyladan
Copy link
Member Author

dyladan commented May 17, 2021

I don't mind if we close this issue as long as we eventually solve this use-case. The issue has been around for over a year with no movement

@carlosalberto
Copy link
Contributor

Oh definitely, let's keep #530 open until we finally solve it for all SIGs in an uniform way. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use Context to stop tracing