Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cleanups to OpenTelemetry tracing. #9

Closed
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -24,37 +24,36 @@
import com.couchbase.client.test.ClusterAwareIntegrationTest;
import com.couchbase.client.test.Services;
import com.couchbase.client.test.TestNodeConfig;
import io.opentelemetry.api.OpenTelemetry;
import io.opentelemetry.sdk.OpenTelemetrySdk;
import io.opentelemetry.sdk.testing.exporter.InMemorySpanExporter;
import io.opentelemetry.sdk.trace.SdkTracerProvider;
import io.opentelemetry.sdk.trace.export.SimpleSpanProcessor;
import io.opentelemetry.api.trace.Span;
import io.opentelemetry.api.trace.SpanKind;
import io.opentelemetry.context.Scope;
import io.opentelemetry.sdk.testing.junit5.OpenTelemetryExtension;
import java.time.Duration;
import org.junit.jupiter.api.AfterAll;
import org.junit.jupiter.api.BeforeAll;
import org.junit.jupiter.api.Test;

import java.time.Duration;
import org.junit.jupiter.api.extension.RegisterExtension;

import static com.couchbase.client.java.ClusterOptions.clusterOptions;
import static com.couchbase.client.test.Util.waitUntilCondition;
import static io.opentelemetry.sdk.testing.assertj.OpenTelemetryAssertions.assertThat;

class OpenTelemetryIntegrationTest extends ClusterAwareIntegrationTest {

private static ClusterEnvironment environment;
private static Cluster cluster;
private static Collection collection;
private static final InMemorySpanExporter exporter = InMemorySpanExporter.create();
private static SdkTracerProvider tracerProvider;

@RegisterExtension
static final OpenTelemetryExtension otelTesting = OpenTelemetryExtension.create();

@BeforeAll
static void beforeAll() {
tracerProvider = SdkTracerProvider.builder().addSpanProcessor(SimpleSpanProcessor.create(exporter)).build();

TestNodeConfig config = config().firstNodeWith(Services.KV).get();

environment = ClusterEnvironment
.builder()
.requestTracer(OpenTelemetryRequestTracer.wrap(tracerProvider.get("integrationTest")))
.requestTracer(OpenTelemetryRequestTracer.wrap(otelTesting.getOpenTelemetry()))
.build();

cluster = Cluster.connect(
Expand All @@ -72,11 +71,51 @@ static void beforeAll() {
static void afterAll() {
cluster.disconnect();
environment.shutdown();
tracerProvider.shutdown();
}

@Test
void capturesTraceSpans() {
Span parent = otelTesting.getOpenTelemetry().getTracer("integrationtest")
.spanBuilder("test")
.setSpanKind(SpanKind.SERVER)
.startSpan();
try (Scope ignored = parent.makeCurrent()) {
collection.get("myid");
} catch (DocumentNotFoundException ignored) {
// expected
} finally {
parent.end();
}

waitUntilCondition(() -> {
otelTesting.assertTraces()
.hasTracesSatisfyingExactly(
trace -> trace.hasSpansSatisfyingExactly(
span -> span
.hasName("test")
.hasKind(SpanKind.SERVER),
span -> span
.hasName("get")
.hasParentSpanId(parent.getSpanContext().getSpanId())
.hasKind(SpanKind.INTERNAL)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we send SpanKind.CLIENT on all of the couchbase-related spans that are created in the SDK?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(i.e. also on the dispatch_to_server)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we're not really allowed to have nested CLIENT spans because the definition in the spec says a single CLIENT span corresponds to a single SERVER span. In our instrumentation, we generally suppress transport spans, so for example in AWS SDK instrumentation, we generate a CLIENT span for the SDK call, but no span for the Apache HTTP Client request itself. In this case, I think that would be similar to suppressing the dispatch_to_server span. However, it's not entirely bad modeling to have dispatch_to_server as an INTERNAL span either, but we'd generally only expect the "API Method" to be a CLIENT span - imagine it as the name on client and server should be very similar. Ideally some day Couchbase itself has OpenTelemetry server instrumentation too and would have a get span with type SERVER :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more question: how do you deal with "multiple" requests? So for example we have "get from replicas" which will internally send N requests to the active and all replicas. Would the outer parent span that covers all the individual requests also be a CLIENT? or really only ever the individual requests

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We haven't really settled on a great way to model this, this spec issue is related.

open-telemetry/opentelemetry-specification#1360

If one of the replicas fails, does the entire request fail? I suppose not since we're expecting one replica to succeed - in that case I'd consider this to be a single CLIENT span, how many requests are made in the background is still an implementation detail. Tracing backends generally assume CLIENT span failures are indicative of a problem so it's good for those to be 1:1 if possible.

It can still be important for debugging though, so one idea we've had while discussing that issue is possibly modeling as events, modeling as INTERNAL spans, or maybe allowing nested CLIENT spans but we haven't come to a great conclusion yet unfortunately.

.hasAttributesSatisfying(attributes -> assertThat(attributes)
.containsEntry("db.system", "couchbase")
.containsEntry("db.operation", "get")
.containsEntry("db.couchbase.service", "kv")
.containsEntry("db.couchbase.collection", "_default")
.containsEntry("db.couchbase.scope", "_default")),
span -> span
.hasName("dispatch_to_server")
.hasKind(SpanKind.INTERNAL)
.hasAttributesSatisfying(attributes -> assertThat(attributes)
.containsEntry("db.system", "couchbase"))
));
return true;
});
}

@Test
void stressTest() {
int numRequests = 100;
for (int i = 0; i < 100; i++) {
try {
Expand All @@ -86,7 +125,10 @@ void capturesTraceSpans() {
}
}

waitUntilCondition(() -> exporter.getFinishedSpanItems().size() >= numRequests);
waitUntilCondition(() -> {
otelTesting.assertTraces().hasSizeGreaterThanOrEqualTo(numRequests);
return true;
});
}

}
Original file line number Diff line number Diff line change
Expand Up @@ -78,9 +78,7 @@ public void addEvent(String name, Instant timestamp) {

@Override
public void end() {
try (Scope scope = span.makeCurrent()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if the span is completed on a different thread than it was created, it doesn't need to made current again?

Copy link
Contributor Author

@anuraaga anuraaga Mar 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not just to call .end

If in this method, if we had a call like grpcClient.executeRpc, then yes it would need to be made current so the gRPC client picked it up as a parent. Making a span current is only to allow implicit propagation between two different libraries.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok that makes sense, thanks

span.end();
}
span.end();
}

@Override
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@

import com.couchbase.client.core.cnc.RequestSpan;
import com.couchbase.client.core.cnc.RequestTracer;
import io.opentelemetry.api.OpenTelemetry;
import io.opentelemetry.api.trace.Span;
import io.opentelemetry.api.trace.SpanBuilder;
import io.opentelemetry.api.trace.Tracer;
Expand All @@ -33,23 +34,23 @@
public class OpenTelemetryRequestTracer implements RequestTracer {

/**
* Holds the actual OT tracer.
* Holds the actual OTel tracer.
*/
private final Tracer tracer;

/**
* Wraps the OpenTelemetry tracer and returns a datatype that can be passed into the requestTracer method of the
* Wraps OpenTelemetry and returns a datatype that can be passed into the requestTracer method of the
* environment.
*
* @param tracer the tracer to wrap.
* @return the wrapped tracer ready to be passed in.
* @param openTelemetry the OpenTelemetry instance to wrap.
* @return the wrapped OpenTelemetry ready to be passed in.
*/
public static OpenTelemetryRequestTracer wrap(final Tracer tracer) {
return new OpenTelemetryRequestTracer(tracer);
public static OpenTelemetryRequestTracer wrap(final OpenTelemetry openTelemetry) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was always under the impression (maybe this is legacy knowledge from OpenTracing) that there is only one global tracer for each app, so that we would just whatever tracer the user has configured. Does this mean that with Otel you'll have one tracer instance for each library instrumented?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup OTel is a bit different, the "Tracer" from OpenTracing is called "TracerProvider" in OTel. Each library indeed has its own Tracer with information about that library.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that makes sense then!

return new OpenTelemetryRequestTracer(openTelemetry);
}

private OpenTelemetryRequestTracer(Tracer tracer) {
this.tracer = tracer;
private OpenTelemetryRequestTracer(OpenTelemetry openTelemetry) {
this.tracer = openTelemetry.getTracer("com.couchbase.client.jvm");
}

private Span castSpan(final RequestSpan requestSpan) {
Expand All @@ -68,19 +69,17 @@ private Span castSpan(final RequestSpan requestSpan) {
* Returns the inner OpenTelemetry tracer.
*/
public Tracer tracer() {
return tracer;
return tracer;
}

@Override
public RequestSpan requestSpan(String operationName, RequestSpan parent) {
SpanBuilder spanBuilder = tracer.spanBuilder(operationName);
Context parentContext = Context.current();
if (parent != null) {
spanBuilder.setParent(Context.current().with(castSpan(parent)));
} else {
spanBuilder.setNoParent();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually one more question here too: I assume you are refering to the implicit propagation through active and thread locals? What about where we are working under async / reactive environments where the one in the thread local might not actually be the "real" parent?

Btw, not sure if you noticed, we actually have an explicit propagation mechanism. On every request the user makes, in the option block, you can provide the parent explicitly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah for async environments, we still expect the context to be propagated through callbacks, for example wiring up using something like

Executor contextExecutor = Context.current().wrap(Runnable::run);
asyncClient.execute(request).thenComposeAsync(response -> asyncClient.execute(request2), contextExecutor)
...;

Or registering the executor in a dagger graph, or as a reactive subscriber, or maybe use reactive hooks via some instrumentation like https://github.com/open-telemetry/opentelemetry-java-instrumentation/tree/main/instrumentation/reactor-3.1/library, or using the java agent which also does automatic context propagation in many cases.

The reality is very few libraries support explicitly passing in a context like this couchbase one does, so propagating the context into threadlocals across async bounds is basically a requirement in these environments, e.g. if you're using gRPC or really any other library you need to be propagating context. On the bright side, this means that a couchbase user doesn't have to worry about couchbase specifically, they can treat it as any other client in their app and expect it to use the implicit parent when there's no explicit one.

Of course, it's up to you whether this is something you want to support - on the flip side, you could just have a couchbase sdk policy saying "always pass in the parent, we don't use implicit propagation at all". I wouldn't recommend it though since it'd be a jarring experience for users compared to other libraries.

parentContext = parentContext.with(castSpan(parent));
}
Span span = spanBuilder.startSpan();
span.makeCurrent().close();
Span span = spanBuilder.setParent(parentContext).startSpan();
return OpenTelemetryRequestSpan.wrap(span);
}

Expand Down