Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cleanups to OpenTelemetry tracing. #9

Closed

Conversation

anuraaga
Copy link
Contributor

@anuraaga anuraaga commented Mar 8, 2021

Hello, I'm an OpenTelemetry maintainer - thanks for using OpenTelemetry for tracing in this project :)

I've applied some cleanups to this project to match some of the guidelines we have in https://github.com/open-telemetry/opentelemetry-java-instrumentation

  • Accept OpenTelemetry instance to instrumentation instead of Tracer. Tracer name should be determined by the library, not by the user
  • Don't call setNoParent when couchbase didn't provide a parent span. Couchbase spans still need to parent to, e.g. a spring-boot server span
  • Don't call makeCurrent - it is to make a span active in a threadlocal for a block of code, but immediately calling close is effectively an expensive no-op, and it's not needed for Span.end() either. Since you propagate parent RequestSpan within the SDK, you don't need to ever use makeCurrent I think.

One big remaining issue seems to be the span kind - couchbase spans corresponding to a user operation should be CLIENT, not INTERNAL. But I think that would involve a mapping from operationName to kind which I'm not familiar enough to solve myself.

@cb-sdk-robot
Copy link

Thanks for the pull request!! To ensure quality review, Couchbase employs a code review system based on Gerrit to manage the workflow of changes in addition to tracking our contributor agreements.

To get this change in and collaborate in code review, please register on Gerrit and accept our CLA. The easiest way to do this is to follow the link below, sign in with your GitHub account and then follow through the steps provided on that page to sign an 'Individual' agreement: http://review.couchbase.org/#/settings/new-agreement.

Keep in mind that the emails we are seeing on the commits are: aa****ag@amaz****o.jp

Note: Please contact us if you have any issues registering with Gerrit! If you have not signed our CLA within 7 days, the Pull Request will be automatically closed.

::SDKBOT/PR:no_cla

@anuraaga
Copy link
Contributor Author

anuraaga commented Mar 8, 2021

Signed the CLA - assuming the PR will autosync to gerrit at some point but let me know if there's any other steps.

*/
public static OpenTelemetryRequestTracer wrap(final Tracer tracer) {
return new OpenTelemetryRequestTracer(tracer);
public static OpenTelemetryRequestTracer wrap(final OpenTelemetry openTelemetry) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was always under the impression (maybe this is legacy knowledge from OpenTracing) that there is only one global tracer for each app, so that we would just whatever tracer the user has configured. Does this mean that with Otel you'll have one tracer instance for each library instrumented?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup OTel is a bit different, the "Tracer" from OpenTracing is called "TracerProvider" in OTel. Each library indeed has its own Tracer with information about that library.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that makes sense then!

private OpenTelemetryRequestTracer(Tracer tracer) {
this.tracer = tracer;
private OpenTelemetryRequestTracer(OpenTelemetry openTelemetry) {
this.tracer = openTelemetry.getTracer("com.couchbase.client");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a scheme we should follow on the naming (I saw an example around otel.contrib?). Also will this show up in the actual UIs of the servers? If so maybe we should do com.couchbase.client. like com.couchbase.client.java, com.couchbase.client.dotnet etc..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately I don't think I've seen a guidelines aside from what's here

https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/api.md#get-a-tracer

Putting java here definitely makes sense, updated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added java but now I'm wondering whether any of jvm, java-core, java-core-io are more appropriate since this isn't related to the java-client artifact

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you set the default to com.couchbase.client.jvm? I'm going to add subsequent options so that each higher level sdk (i.e. java, kotlin, scala) can override that "jvm" bit so we have a more specific namespace if we can get it but a sane default if not.

span -> span
.hasName("get")
.hasParentSpanId(parent.getSpanContext().getSpanId())
.hasKind(SpanKind.INTERNAL)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we send SpanKind.CLIENT on all of the couchbase-related spans that are created in the SDK?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(i.e. also on the dispatch_to_server)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we're not really allowed to have nested CLIENT spans because the definition in the spec says a single CLIENT span corresponds to a single SERVER span. In our instrumentation, we generally suppress transport spans, so for example in AWS SDK instrumentation, we generate a CLIENT span for the SDK call, but no span for the Apache HTTP Client request itself. In this case, I think that would be similar to suppressing the dispatch_to_server span. However, it's not entirely bad modeling to have dispatch_to_server as an INTERNAL span either, but we'd generally only expect the "API Method" to be a CLIENT span - imagine it as the name on client and server should be very similar. Ideally some day Couchbase itself has OpenTelemetry server instrumentation too and would have a get span with type SERVER :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more question: how do you deal with "multiple" requests? So for example we have "get from replicas" which will internally send N requests to the active and all replicas. Would the outer parent span that covers all the individual requests also be a CLIENT? or really only ever the individual requests

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We haven't really settled on a great way to model this, this spec issue is related.

open-telemetry/opentelemetry-specification#1360

If one of the replicas fails, does the entire request fail? I suppose not since we're expecting one replica to succeed - in that case I'd consider this to be a single CLIENT span, how many requests are made in the background is still an implementation detail. Tracing backends generally assume CLIENT span failures are indicative of a problem so it's good for those to be 1:1 if possible.

It can still be important for debugging though, so one idea we've had while discussing that issue is possibly modeling as events, modeling as INTERNAL spans, or maybe allowing nested CLIENT spans but we haven't come to a great conclusion yet unfortunately.

@@ -78,9 +78,7 @@ public void addEvent(String name, Instant timestamp) {

@Override
public void end() {
try (Scope scope = span.makeCurrent()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if the span is completed on a different thread than it was created, it doesn't need to made current again?

Copy link
Contributor Author

@anuraaga anuraaga Mar 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not just to call .end

If in this method, if we had a call like grpcClient.executeRpc, then yes it would need to be made current so the gRPC client picked it up as a parent. Making a span current is only to allow implicit propagation between two different libraries.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok that makes sense, thanks

@daschl
Copy link
Contributor

daschl commented Mar 8, 2021

@anuraaga thank you very much for your PR! I've posted a couple comments because I have some follow up questions if you don't mind.

@brett19
Copy link
Member

brett19 commented Mar 8, 2021

Hey @anuraaga,

It looks like the email that this commit was created with (aaag@amazo.jp) isn't linked to your Gerrit account. You can add it here: http://review.couchbase.org/settings/#EmailAddresses and then this PR should be pulled to Gerrit.

Cheers, Brett

@anuraaga anuraaga force-pushed the tweak-opentelemetry-tracing branch from bd562f5 to 4b5f194 Compare March 9, 2021 02:59
@cb-sdk-robot
Copy link

Your changes (commit: 4b5f194) have been pushed to the Couchbase Review Site:
http://review.couchbase.org/147979

::SDKBOT/PR:created

@cb-sdk-robot
Copy link

Your changes (commit: d47105f) have been pushed to the Couchbase Review Site:
http://review.couchbase.org/147979

Note: As your pull request contains multiple commits, we have performed an automatic squash of these commits into a single change-set. If this is not the desired behaviour, please consider submitting a pull request per discreet feature.

::SDKBOT/PR:updated

@anuraaga
Copy link
Contributor Author

anuraaga commented Mar 9, 2021

@brett19 Thanks for the advice! As this is under individual CLA I needed to change my email address now looks like it's good :)

if (parent != null) {
spanBuilder.setParent(Context.current().with(castSpan(parent)));
} else {
spanBuilder.setNoParent();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually one more question here too: I assume you are refering to the implicit propagation through active and thread locals? What about where we are working under async / reactive environments where the one in the thread local might not actually be the "real" parent?

Btw, not sure if you noticed, we actually have an explicit propagation mechanism. On every request the user makes, in the option block, you can provide the parent explicitly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah for async environments, we still expect the context to be propagated through callbacks, for example wiring up using something like

Executor contextExecutor = Context.current().wrap(Runnable::run);
asyncClient.execute(request).thenComposeAsync(response -> asyncClient.execute(request2), contextExecutor)
...;

Or registering the executor in a dagger graph, or as a reactive subscriber, or maybe use reactive hooks via some instrumentation like https://github.com/open-telemetry/opentelemetry-java-instrumentation/tree/main/instrumentation/reactor-3.1/library, or using the java agent which also does automatic context propagation in many cases.

The reality is very few libraries support explicitly passing in a context like this couchbase one does, so propagating the context into threadlocals across async bounds is basically a requirement in these environments, e.g. if you're using gRPC or really any other library you need to be propagating context. On the bright side, this means that a couchbase user doesn't have to worry about couchbase specifically, they can treat it as any other client in their app and expect it to use the implicit parent when there's no explicit one.

Of course, it's up to you whether this is something you want to support - on the flip side, you could just have a couchbase sdk policy saying "always pass in the parent, we don't use implicit propagation at all". I wouldn't recommend it though since it'd be a jarring experience for users compared to other libraries.

@daschl
Copy link
Contributor

daschl commented Mar 9, 2021

@anuraaga thanks again, I have a couple more follow ups but then we should be good to go.

@cb-sdk-robot
Copy link

Your changes (commit: 84e0767) have been pushed to the Couchbase Review Site:
http://review.couchbase.org/147979

Note: As your pull request contains multiple commits, we have performed an automatic squash of these commits into a single change-set. If this is not the desired behaviour, please consider submitting a pull request per discreet feature.

::SDKBOT/PR:updated

@anuraaga
Copy link
Contributor Author

@daschl Sorry for the delay, updated the tracer name to be jvm believe I hit all the comments

@daschl
Copy link
Contributor

daschl commented Mar 15, 2021

@anuraaga yes thanks very much - I'll complete the rest of it over at gerrit. If you are okay I'll amend your commit with a jira version etc so it aligns with our commit messages.

thanks much for your contribution again!

@daschl
Copy link
Contributor

daschl commented Mar 15, 2021

@anuraaga can you close the ticket here? ( I can't unfortunately )

@anuraaga
Copy link
Contributor Author

@daschl Do you mean closing this PR?

@daschl
Copy link
Contributor

daschl commented Mar 15, 2021

Yes, since it's already in gerrit.

@anuraaga
Copy link
Contributor Author

Got it. I might recommend either 1) Gerrit sync closes PR too or 2) Stop Gerrit since Github PR UI has made huge progress since back in the day.

Not my project though ;)

@anuraaga anuraaga closed this Mar 15, 2021
@daschl
Copy link
Contributor

daschl commented Mar 15, 2021

@anuraaga couchbase exclusively uses gerrit for all its projects - we just use github as mirrors since people are more comfortable with it.

@anuraaga
Copy link
Contributor Author

@daschl Yup - I'm just suggesting it's probably time to challenge that given how things have changed these past few years. I personally won't be coming back to this repo since I believe in contributor-first (Gerrit + sync + PR close is contributor-second). Just food for thought, please digest it since I won't be commenting anymore. Will be happy to continue to consume the binaries though since we have committed through open-telemetry/opentelemetry-java-instrumentation#2524 (via GitHub -- pardon my last snark ;).

Cheers!

ns-codereview pushed a commit that referenced this pull request Sep 17, 2024
When selecting a query error to propagate as exception,
prefer non-retriable errors.

Change-Id: I3e27530e94e462dda27e62161d5c114861497a1a
Reviewed-on: https://review.couchbase.org/c/couchbase-jvm-clients/+/216262
Tested-by: Build Bot <build@couchbase.com>
Reviewed-by: Graham Pople <graham.pople@couchbase.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants