-
Notifications
You must be signed in to change notification settings - Fork 782
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parent can only be null in a local root #1814
Comments
@FranPregernik Could you please provide a simple sample project that sometimes can repro the issue? |
Hi @jonatan-ivanov! The way it is set up is that it configures the Flowable ProcessEngine with the TraceableExecutorService here https://github.com/FranPregernik/spring-cloud-sleuth-1814/blob/master/src/main/java/com/example/demo/ProcessEngineConfig.java#L156 The test then reads in the BPMN process and starts it. The process calls to OrderWorkflowServiceImpl methods (e.g. https://github.com/FranPregernik/spring-cloud-sleuth-1814/blob/master/src/main/java/com/example/demo/OrderWorkflowServiceImpl.java#L33) which I have turned into fakes that allow the test (i.e. BPMN process) to complete fully. Each method will check the presence of MDC context and sleep for a bit (to simulate the real world scenario) and do some logic to make the test pass. I ran the test continuously but could not make it fail. The exception reported above is before any of my asserts are called obviously. Hope it helps. |
@jonatan-ivanov we can do a check for |
@marcingrzejszczak Did you mean a I don't know enough about Brave but if I were to do something I would add a Brave check to see if there is a pending Trace available or something like that. Or maybe just a try catch around the initialization of Bottom line ... the delegate must be run. |
What I mean is more or less the following: Span childSpan = this.parent != null ? this.tracer.nextSpan(this.parentSpan).name(this.spanName) : this.tracer.nextSpan().name(this.spanName); |
@FranPregernik I checked your demo project but since it does not reproduces the issue and it is huge, I was not really able to get any lead on it. I checked the stacktrace though which seems interesting (see below). @marcingrzejszczak I'm not sure calling
To me this seems like a concurrency issue in Brave, here are a few interesting details:
|
You're right - I mistook it for the case where you put a
I think we should wait for a simple case of a demo project that will allow us to replicate this issue. WDYT @jonatan-ivanov ? |
Hey @FranPregernik, any chance for a smaller reproducer? |
@marcingrzejszczak, @jonatan-ivanov I can try. I will need to investigate the internals a bit. I don't know how/where/when the LazySpan is initialized. Any pointers? |
@FranPregernik To have a better understanding about what could have gone wrong and what Brave is doing under the hood, try to follow the stacktrace (here's some help in case you need it). Let me help walking through the section that could help:
In the sample above, you use a As you move forward in the stacktrace, you can see that Span childSpan = this.tracer.nextSpan(this.parent).name(this.spanName); This also shows where the Please let me know if this makes sense, also, since I'm not really able to repro this, this is still a guess. |
Thank you for the feedback and mini tutorial. The stack trace was the issue that confused me initially - it didn't show me (right away) where this LazySpan was coming from. Should have looked better. Fingers crossed... Update 2021-01-14 19:46 CESTAlmost certain that none of the LazySpans are shared. The context is shared. The bug only triggers if there is a parent trace exists in BraveTracer.java:52 This bug never triggers if there is no sampler set up. In my project I had this sampler and in the demo I did not. @Bean
Sampler sampler() {
// necessary to prevent noop spans brave.Tracer.java:407
return CountingSampler.create(0.02f);
} Still trying to trigger the assert in PendingSpans.java:89 |
@FranPregernik Please let us know if you were able to repro the issue. |
I am getting nowhere. I have one more thing to try... I need to instrument the spring cloud sleuth code with some special logging in my project to be able to catch constructor and run methods of TraceRunnable and capture the contexts of them. I am hoping this will let me see how to reproduce the conditions in the demo project. |
@FranPregernik Please let us know how it went. |
I ran into a similar isuue when using the LazyTraceExecutor. With the exact same difficulties of having that problem being rather hard to reproduce. I gotta test it more over the coming days, however bumping the dependency io.zipkin.brave dependencies from 5.13.2 to 5.13.3 seems to have fixed it in my case |
@rbieniek Thank for the info. @FranPregernik, could you please check if upgrading brave helps? |
I upgraded the brave deps in the parent pom @jonatan-ivanov, @rbieniek I never thought I would witness a real live Heisenbug 🙄 . I then ran the test suite for 4 hours non-stop. No error... I then removed the TraceRunnable and clean/compiled everything again. Ran the test suite and sure enough the error pops up again (assert breakpoint triggered). So frustrating... Any suggestions? |
And I gave it another shot and caught it. I did not modify the Please double check my reasoning here. So the regular (passing) case is when the context does not have the parent trace id - it is the local root: With the AssertError caught the context is different - it is not the local root: Then I went up the call stack to the So I called manually the I tested the Finally I tested the So some race condition is happening but might not be the same as what we are expecting (unprotected My TraceRunnable: https://gist.github.com/FranPregernik/0d0e0ddff5f0f642f8f12313f42667e1 |
So do you think that the problem is actually in Sleuth? I mean the only thing that we're doing is delegating work to Brave 🤷 |
Not sure... could be. |
You can file an issue in Brave and maybe link this one? |
@FranPregernik Let me close this issue, please @mention me or Marcin if you want us to reopen. |
FYI - I've submitted a fix for this to Brave in openzipkin/brave#1306 In the meantime, here's a hack for anyone else impacted to resolve it until that change can be merged and released:
|
Thank you @andylintner |
Describe the bug
I am using Spring Cloud Sleuth 3.0.1-SNAPSHOT (as of 2020-12-31).
The bug is really sporadic but does cause the worfkflow engine (flowable, which I have instrumented manually) to not call the next tasks in the BPMN workflow.
The error is this:
I think it is on this line in TraceRunable:
I am of the opinion that the TraceRunanble should not prevent the delegate Runnable in any way. We might benefit from a check before calling the
nextSpan
method or catching this exception somehow.Sample
I can't reliably reproduce it, it is sporadic. We have on the order of 100 tests using flowable but I get one error every now and then.
Flowable can be set up with a custom implementation of a JobExectuor. I have it set up like so:
P.S. I am really grateful to you guys for the libraries you produce. They make my job easier!
The text was updated successfully, but these errors were encountered: