Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kotlin coroutines instrumentation #4405

Merged

Conversation

monosoul
Copy link
Contributor

@monosoul monosoul commented Dec 8, 2022

What Does This Do

Provides a proper instrumentation for Kotlin coroutines. Solves #931 , Solves #1123 .

Introduces:

  • ScopeState (probably might have a better name) <-- an abstraction for holding and managing current scope state.
  • ContinuableManagedScope <-- implementation that holds a ScopeStack. On activate it sets the scope stack it holds into the thread local scope stack of ContinuableScopeManager. On fetchFromActive it fetches the value of thread local scope stack into it's local variable.
  • ScopeStateAware <-- provides a method to get a new scope state. Extended by AgentScopeManager and TracerAPI
  • ScopeStateCoroutineContext <-- coroutine context element that allows to store/restore scope state on suspension/continuation.
  • AbstractCoroutineInstrumentation <-- instrumentation for AbstractCoroutine, adds 3 advices:
    • AbstractCoroutineConstructorAdvice <-- advice around AbstractCoroutine constructor that creates a new instance of ScopeStateCoroutineContext and puts it into the CoroutineContext ensuring each coroutine has it's own instance of the context element and it is never inherited. If the coroutine is not lazily started, then also captures the current active scope.
    • AbstractCoroutineOnStartAdvice <-- advice around AbstractCoroutine#onStart to capture active scope on if/when a coroutine starts. Will do nothing for eagerly started coroutines (scope already got captured there on construction), only applicable to lazily started coroutines.
    • JobSupportAfterCompletionInternalAdvice <-- advice around AbstractCoroutine#onCompletionInternal to guarantee the scope captured by coroutine is closed.
  • CoroutineContextHelper <-- a set of helper functions to access elements of CoroutineContext.

Basically, what it does - is makes each coroutine have it's own scope stack.

Motivation

At the moment DataDog agent doesn't work properly with Kotlin coroutines, especially in the case when a coroutine gets suspended first and then gets continued. This is because ContinuableScopeManager stores the scope stack in a thread local variable.
Imagine this situation:

  1. We have a coroutine dispatcher that is an executor pool of 2 threads.
  2. We start a coroutine on the thread #1 and then it gets suspended.
  3. When it gets continued, the thread #1 is busy with another coroutine, so our coroutine gets scheduled to the thread #2
  4. Because ContinuableScopeManager uses a thread local scope stack, all the scopes that we have activated in our coroutine got left on the thread #1. Moreover, the thread #2 might also have some leftovers in the scope stack from another coroutine, leading to missing spans and broken span hierarchy.

Having this instrumentation will make DD customers using Kotlin coroutines way happier with the product.

Additional Notes

The test case to demonstrate the issue can be found in this commit: df6d722 (for easier review).

You can find more information about coroutines in Kotlin here:

Feature flags

The feature is disabled by default.

  • System property dd.integration.kotlin_coroutine.experimental.enabled
  • Environment variable DD_INTEGRATION_KOTLIN_COROUTINE_EXPERIMENTAL_ENABLED

Either of the 2 should be set to true to enable the feature.

Signed-off-by: monosoul <Kloz.Klaud@gmail.com>
Signed-off-by: monosoul <Kloz.Klaud@gmail.com>
Signed-off-by: monosoul <Kloz.Klaud@gmail.com>
@monosoul monosoul force-pushed the feature/kotlin-coroutines-instrumentation branch 2 times, most recently from 35df53a to b06ac78 Compare December 9, 2022 13:36
Signed-off-by: monosoul <Kloz.Klaud@gmail.com>
@monosoul monosoul force-pushed the feature/kotlin-coroutines-instrumentation branch from b06ac78 to a3512c8 Compare December 9, 2022 13:38
Signed-off-by: monosoul <Kloz.Klaud@gmail.com>
Signed-off-by: monosoul <Kloz.Klaud@gmail.com>
@monosoul monosoul force-pushed the feature/kotlin-coroutines-instrumentation branch 2 times, most recently from 5088230 to 8530f70 Compare December 9, 2022 15:22
Signed-off-by: monosoul <Kloz.Klaud@gmail.com>
@monosoul monosoul force-pushed the feature/kotlin-coroutines-instrumentation branch from 8530f70 to 079ab2c Compare December 9, 2022 15:37
Signed-off-by: monosoul <Kloz.Klaud@gmail.com>
Signed-off-by: monosoul <Kloz.Klaud@gmail.com>
Signed-off-by: monosoul <Kloz.Klaud@gmail.com>
@monosoul monosoul marked this pull request as ready for review December 9, 2022 19:20
@monosoul monosoul requested a review from a team as a code owner December 9, 2022 19:20
@vsingal-p
Copy link

Thanks for raising this 💯

@monosoul
Copy link
Contributor Author

Hey @mcculls @bantonsson , sorry for pinging, but would you mind having a look here? I believe having a proper instrumentation for coroutines will make lots of DD customers (including myself) happy. In case if you think the changes here are a bit too intrusive, there's also an option to do it like this: c7b4468 , with no changes to the API.
Thanks!

@bantonsson
Copy link
Contributor

Thanks @monosoul. I'll try to review this tomorrow.

…utines-instrumentation

Signed-off-by: monosoul <Kloz.Klaud@gmail.com>

# Conflicts:
#	dd-trace-core/src/main/java/datadog/trace/core/scopemanager/ContinuableScopeManager.java
@bantonsson
Copy link
Contributor

@monosoul I need to dig a bit deeper into this. The Scope instances are not really intended to be shared across threads and I want to make sure that I understand the underlying mechanisms in Kotlin.

@monosoul
Copy link
Contributor Author

monosoul commented Dec 15, 2022

@bantonsson yeah, the change here basically makes sure that each coroutine has it's own scope that doesn't interfere with the thread's scope and other coroutines. I.e. while the thread runs the coroutine - the scope manager will use the scope stored in the ScopeStateCoroutineContext instance unique for each coroutine. Once the coroutine is suspended or finished - the original thread's scope will be restored. Without this change multiple coroutines might share the same scope which leads to lost spans and span inheritance issues. You can think of ThreadContextElements as of ThreadLocals of coroutine world. Not sure if that explanation makes it any more clear 😅

@bantonsson
Copy link
Contributor

Hey @monosoul. I've been reading up on this a bit. This whole solution would need to use continuations and activate/deactivate scopes in a different way to handle separating the creation of coroutines and the running of them.

I have a changed test that splits the reported spans into two traces which is not correct and I'll open a PR against your branch with it.

Would love to have a bigger discussion about how one would expect the UI to show these coroutines, since I think there is where we need to start to get this right.

@vsingal-p
Copy link

@bantonsson Why would we need a change in the UI? The main problem is that traces getting split with new trace_ids because scope/span are not propagated correctly with coroutines. If we do, the flow would look exactly same as it does now (in a normal/other app).

@monosoul
Copy link
Contributor Author

Hey @bantonsson , thanks for looking into it!

I have a changed test that splits the reported spans into two traces which is not correct and I'll open a PR against your branch with it.

I'm not sure if I got that part, but I guess it will be easier to discuss when you open the PR.

Here are my expectations from coroutines instrumentation:

  • All calls of instrumented methods from within coroutines should work the same way as they work outside of coroutines. I.e. if I start a span (let's call it span A) before calling a coroutine and then inside the coroutine I call a method that was instrumented automatically, for example fetch messages with Kafka client, then each span created with Kafka client call should have span A as it's parent regardless of the thread the coroutine got scheduled to. If there are multiple parallel calls, they all should have span A as it's parent.
  • The instrumentation should just work without any changes to coroutines invocations.
  • And of course running coroutines should not mess up other spans.

As for how it looks in the UI - I'd also expect the span hierarchy to be respected there. If I got it right - you want to use continuations so that the spans will have breaks in them when coroutine gets suspended, right?

Signed-off-by: monosoul <Kloz.Klaud@gmail.com>
Signed-off-by: monosoul <Kloz.Klaud@gmail.com>
@bantonsson
Copy link
Contributor

Happy Holidays @monosoul. I just wanted to say that I've been looking over this code and working on some more tests as well as digging into the Kotlin Coroutine machinery. I'll get back with an update next week.

@bantonsson
Copy link
Contributor

@monosoul I've opened a new PR against your branch monosoul#2 that has all my changes, both tests and implementation. I've also tested it against CI here #4499

@monosoul
Copy link
Contributor Author

monosoul commented Jan 5, 2023

@bantonsson thanks! I'll check it tomorrow

@monosoul
Copy link
Contributor Author

monosoul commented Jan 6, 2023

@bantonsson I've left a few comments there. Tbh, now I think it's probably impossible/hard to achieve the behavior for lazily started coroutines I mentioned here. First of all because there are no guarantees a lazily started coroutine will ever be started or cancelled after it was instantiated.

@vsingal-p
Copy link

Hello folks. Is there any update on this PR?
I have gone through the comments and looks like lazy coroutines are somewhat tricky to handle

Can we keep them out of scope and get the base fix merged? We can mark lazy coroutines as a known bug, but atleast we would be able to solve a lot of other ones (everything basically)

@bantonsson
Copy link
Contributor

@vsingal-p I've reviewed the fixes that @monosoul have done in his repo, and I think that with those fixes done, we could move this PR forward if we have the instrumentation as an opt-in until it's been tested and we know what the performance implications are.

@vsingal-p
Copy link

vsingal-p commented Jan 25, 2023

@vsingal-p I've reviewed the fixes that @monosoul have done in his repo, and I think that with those fixes done, we could move this PR forward if we have the instrumentation as an opt-in until it's been tested and we know what the performance implications are.

Sounds like a plan. @monosoul Let's get all the changes in and get this fix launched 🤞 And huge thanks for all your efforts 🙇

@monosoul
Copy link
Contributor Author

Sorry for the delay on my end this time, been a bit busy lately. I'll try to implement the changes suggested by @bantonsson by the end of this week.

@monosoul
Copy link
Contributor Author

@bantonsson I've updated the PR

@monosoul monosoul requested a review from bantonsson January 27, 2023 12:25
Copy link
Contributor

@bantonsson bantonsson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @monosoul for a great contribution.

@bantonsson
Copy link
Contributor

@monosoul Would you mind rewording your PR description to what the state of the instrumentation is now, and mention the flags for turning on the instrumentation? Also, could you switch out the use of fixes #... to solves #... since github auto-close will close the issues on merge and not on release.

@monosoul
Copy link
Contributor Author

@bantonsson Done!

@kwazii1231
Copy link

is it working correctly?
i mean, i'm hard to find out what i'm missing
i use RUN curl -Lso dd-java-agent.jar https://dtdg.co/latest-java-tracer
to use latest dd-trace-java for my application as datadog agent
and i use implementation("com.datadoghq:dd-trace-api:1.22.0") as gradle dependency.
i don't know what is the root cause that is reason my application can not correctly trace code that i use async of coroutine
anything other is same as what u describe.
if you have any clue to solve my situation.

@monosoul
Copy link
Contributor Author

monosoul commented Nov 9, 2023

@kwazii1231 you need to also enable the feature, since it's behind a flag atm

@kwazii1231
Copy link

kwazii1231 commented Nov 9, 2023

@kwazii1231 you need to also enable the feature, since it's behind a flag atm

@monosoul

really thanks to reply very quickly
i set up DD_INTEGRATION_KOTLIN_COROUTINE_EXPERIMENTAL_ENABLED=true
to my container env_variable
and my docker file is seems like belows

FROM adoptopenjdk:11.0.11_9-jdk-hotspot as app

ARG APP_PROFILE=local
ENV APP_PROFILE=${APP_PROFILE}
TZ=Asia/Seoul

WORKDIR /app
RUN curl -Lso dd-java-agent.jar https://dtdg.co/latest-java-tracer

WORKDIR /app
COPY build/libs/* /app/

ENTRYPOINT java
-XX:+UseContainerSupport
-XX:InitialRAMPercentage=75
-XX:MaxRAMPercentage=75
-XshowSettings:vm
-Dfile.encoding=UTF-8
-Djava.net.preferIPv4Stack=true
-jar /app/api.jar
--spring.profiles.active=${APP_PROFILE}

is there any other things to do?
some trace seems like trace correctly but almost trace is not working correctly

@monosoul
Copy link
Contributor Author

monosoul commented Nov 9, 2023

@kwazii1231 you need to add the agent to your jvm cmd line: -javaagent:/path/to/dd-java-agent.jar

See this documentation: https://docs.datadoghq.com/tracing/trace_collection/dd_libraries/java/?tab=springboot

@kwazii1231
Copy link

kwazii1231 commented Nov 9, 2023

@kwazii1231 you need to also enable the feature, since it's behind a flag atm

also did at common template of my helm chart like this

  • name: JAVA_TOOL_OPTIONS
    value: -javaagent:/app/dd-java-agent.jar

only async statement for grpc api call trace is not correctly collected

@kwazii1231
Copy link

kwazii1231 commented Nov 10, 2023

@kwazii1231 you need to add the agent to your jvm cmd line: -javaagent:/path/to/dd-java-agent.jar

See this documentation: https://docs.datadoghq.com/tracing/trace_collection/dd_libraries/java/?tab=springboot

Hi @monosoul
finally i found the reason, and root cause is not from your implementation
but from our custom span append code that was written when this agent didn't support kotlin coroutine tracing.
so i used activeSpan from globalTracer for appending custom client span.
finally it works!
really thansk to reply very quickly that make me try to find out reason from our code.
have a nice day!

@monosoul monosoul deleted the feature/kotlin-coroutines-instrumentation branch December 18, 2023 12:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
inst: others All other instrumentations
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants