Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parent can only be null in a local root #1814

Closed
FranPregernik opened this issue Dec 31, 2020 · 25 comments
Closed

Parent can only be null in a local root #1814

FranPregernik opened this issue Dec 31, 2020 · 25 comments

Comments

@FranPregernik
Copy link

Describe the bug

I am using Spring Cloud Sleuth 3.0.1-SNAPSHOT (as of 2020-12-31).
The bug is really sporadic but does cause the worfkflow engine (flowable, which I have instrumented manually) to not call the next tasks in the BPMN workflow.

The error is this:

Exception in thread "orders-flowable-async-job-executor-thread-4" java.lang.AssertionError: Bug (or unexpected call to internal code): parent can only be null in a local root!
	at brave.internal.recorder.PendingSpans.getOrCreate(PendingSpans.java:89)
	at brave.Tracer._toSpan(Tracer.java:410)
	at brave.Tracer.toSpan(Tracer.java:382)
	at brave.Tracer.toSpan(Tracer.java:376)
	at brave.LazySpan.span(LazySpan.java:141)
	at brave.LazySpan.context(LazySpan.java:40)
	at org.springframework.cloud.sleuth.brave.bridge.BraveSpan.context(BraveSpan.java:48)
	at org.springframework.cloud.sleuth.brave.bridge.BraveTracer.nextSpan(BraveTracer.java:52)
	at org.springframework.cloud.sleuth.instrument.async.TraceRunnable.run(TraceRunnable.java:61)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

I think it is on this line in TraceRunable:
image

I am of the opinion that the TraceRunanble should not prevent the delegate Runnable in any way. We might benefit from a check before calling the nextSpan method or catching this exception somehow.

Sample

I can't reliably reproduce it, it is sporadic. We have on the order of 100 tests using flowable but I get one error every now and then.

Flowable can be set up with a custom implementation of a JobExectuor. I have it set up like so:

 private AsyncExecutor initAsyncExecutor(String tennantId, final int maxConcurrent) {
        final DefaultAsyncJobExecutor asyncExecutor = new DefaultAsyncJobExecutor();
        asyncExecutor.setAutoActivate(false);

        // other setup, irrelevant to the case

        // customized default async job executor initialization with tracing info
        // org.flowable.job.service.impl.asyncexecutor.DefaultAsyncJobExecutor.initAsyncJobExecutionThreadPool
        asyncExecutor.setThreadPoolQueue(new ArrayBlockingQueue<>(queueSize));
        asyncExecutor.setExecutorService(
            new TraceableExecutorService(
                this.applicationContext,
                new ThreadPoolExecutor(
                    asyncExecutor.getCorePoolSize(),
                    asyncExecutor.getMaxPoolSize(),
                    asyncExecutor.getKeepAliveTime(), TimeUnit.MILLISECONDS,
                    asyncExecutor.getThreadPoolQueue(),
                    new BasicThreadFactory.Builder()
                        .namingPattern(tennantId + "-flowable-async-job-executor-thread-%d")
                        .build()
                )
            )
        );

        return asyncExecutor;
    }

P.S. I am really grateful to you guys for the libraries you produce. They make my job easier!

@jonatan-ivanov
Copy link
Member

@FranPregernik Could you please provide a simple sample project that sometimes can repro the issue?

@FranPregernik
Copy link
Author

Hi @jonatan-ivanov!

Run the test https://github.com/FranPregernik/spring-cloud-sleuth-1814/blob/master/src/test/java/com/example/demo/OrderProcessIT.java#L245

The way it is set up is that it configures the Flowable ProcessEngine with the TraceableExecutorService here https://github.com/FranPregernik/spring-cloud-sleuth-1814/blob/master/src/main/java/com/example/demo/ProcessEngineConfig.java#L156

The test then reads in the BPMN process and starts it. The process calls to OrderWorkflowServiceImpl methods (e.g. https://github.com/FranPregernik/spring-cloud-sleuth-1814/blob/master/src/main/java/com/example/demo/OrderWorkflowServiceImpl.java#L33) which I have turned into fakes that allow the test (i.e. BPMN process) to complete fully. Each method will check the presence of MDC context and sleep for a bit (to simulate the real world scenario) and do some logic to make the test pass.

I ran the test continuously but could not make it fail. The exception reported above is before any of my asserts are called obviously.

Hope it helps.

@marcingrzejszczak
Copy link
Contributor

@jonatan-ivanov we can do a check for null there and call nextSpan() if parent is null.

@FranPregernik
Copy link
Author

@marcingrzejszczak Did you mean a null check on this.tracer or something else? If so, I don't think tracer is null.

I don't know enough about Brave but if I were to do something I would add a Brave check to see if there is a pending Trace available or something like that. Or maybe just a try catch around the initialization of childSpan and then null check of childSpan later on. Something like that anyway.

Bottom line ... the delegate must be run.

@marcingrzejszczak
Copy link
Contributor

What I mean is more or less the following:

Span childSpan = this.parent != null ? this.tracer.nextSpan(this.parentSpan).name(this.spanName) : this.tracer.nextSpan().name(this.spanName);

@jonatan-ivanov
Copy link
Member

jonatan-ivanov commented Jan 5, 2021

@FranPregernik I checked your demo project but since it does not reproduces the issue and it is huge, I was not really able to get any lead on it. I checked the stacktrace though which seems interesting (see below).

@marcingrzejszczak I'm not sure calling nextSpan() if the parent is null and nextSpan(parent) otherwise will help, here's why:

  • According to the javadoc of Tracer#nextSpan(Span), it acts as nextSpan() if the parent span parameter is null
  • BraveTracer#nextSpan(Span) complies this, it literally calls nextSpan() if the parent span is null
  • parent was not null when this error was thrown; nextSpan(Span) in BraveTracer checks it, see:
at org.springframework.cloud.sleuth.brave.bridge.BraveSpan.context(BraveSpan.java:48)
at org.springframework.cloud.sleuth.brave.bridge.BraveTracer.nextSpan(BraveTracer.java:52)

To me this seems like a concurrency issue in Brave, here are a few interesting details:

  1. parent seems to be a LazySpan, see:
at brave.LazySpan.context(LazySpan.java:40)
at org.springframework.cloud.sleuth.brave.bridge.BraveSpan.context(BraveSpan.java:48)
  1. LazySpan is not tread-safe, see its javadoc, the delegate, and the context fields

  2. I believe it can happen that two threads are calling the span() method on LazySpan, both of them pass the null check and the second thread uses the context in delegate = tracer.toSpan(context); created by the first thread (in the next line). This context fails the check in PendingSpans: assert parent != null || context.isLocalRoot() : "Bug (or unexpected call to internal code): parent can only be null in a local root!";

@marcingrzejszczak
Copy link
Contributor

BraveTracer#nextSpan(Span) complies this, it literally calls nextSpan() if the parent span is null

You're right - I mistook it for the case where you put a null span in scope. Sorry.

@FranPregernik I checked your demo project but since it does not reproduces the issue and it is huge, I was not really able to get any lead on it. I checked the stacktrace though which seems interesting (see below).

I think we should wait for a simple case of a demo project that will allow us to replicate this issue. WDYT @jonatan-ivanov ?

@jonatan-ivanov
Copy link
Member

@marcingrzejszczak 👍🏼

@marcingrzejszczak
Copy link
Contributor

Hey @FranPregernik, any chance for a smaller reproducer?

@FranPregernik
Copy link
Author

FranPregernik commented Jan 12, 2021

@marcingrzejszczak, @jonatan-ivanov I can try. I will need to investigate the internals a bit.

I don't know how/where/when the LazySpan is initialized.
I don't understand how one instance of a LazySpan could be shared among the threads in the executor. Are two TraceRunnables executing in one thread at the same time? I don't think this should happen at all, ever.

Any pointers?

@jonatan-ivanov
Copy link
Member

jonatan-ivanov commented Jan 12, 2021

@FranPregernik To have a better understanding about what could have gone wrong and what Brave is doing under the hood, try to follow the stacktrace (here's some help in case you need it). Let me help walking through the section that could help:

	at brave.LazySpan.span(LazySpan.java:141)
	at brave.LazySpan.context(LazySpan.java:40)
	at org.springframework.cloud.sleuth.brave.bridge.BraveSpan.context(BraveSpan.java:48)
	at org.springframework.cloud.sleuth.brave.bridge.BraveTracer.nextSpan(BraveTracer.java:52)
	at org.springframework.cloud.sleuth.instrument.async.TraceRunnable.run(TraceRunnable.java:61)

In the sample above, you use a TraceableExecutorService as an ExecutorService implementation. If you look into that class, you will see that it wraps your Runnable into a TraceRunnable. So TraceRunnable.run is called because TraceableExecutorService is used.

As you move forward in the stacktrace, you can see that BraveTracer.nextSpan is called from TraceRunnable.run(TraceRunnable.java:61):

Span childSpan = this.tracer.nextSpan(this.parent).name(this.spanName);

This also shows where the LazySpan is coming from: this.parent = tracer.currentSpan();.
So if tracer.currentSpan() gives you the same instance of Span for two different TraceRunnable instances, through BraveTracer.nextSpan(BraveTracer.java:52), BraveSpan.context(BraveSpan.java:48), and LazySpan.context(LazySpan.java:40), LazySpan.span can be called concurrently.

Please let me know if this makes sense, also, since I'm not really able to repro this, this is still a guess.

@FranPregernik
Copy link
Author

FranPregernik commented Jan 14, 2021

Thank you for the feedback and mini tutorial.

The stack trace was the issue that confused me initially - it didn't show me (right away) where this LazySpan was coming from. Should have looked better.

Fingers crossed...

Update 2021-01-14 19:46 CEST

Almost certain that none of the LazySpans are shared. The context is shared.

The bug only triggers if there is a parent trace exists in BraveTracer.java:52

This bug never triggers if there is no sampler set up. In my project I had this sampler and in the demo I did not.

   @Bean
    Sampler sampler() {
        // necessary to prevent noop spans brave.Tracer.java:407
        return CountingSampler.create(0.02f);
    }

Still trying to trigger the assert in PendingSpans.java:89

@jonatan-ivanov
Copy link
Member

@FranPregernik Please let us know if you were able to repro the issue.

@FranPregernik
Copy link
Author

I am getting nowhere.

I have one more thing to try... I need to instrument the spring cloud sleuth code with some special logging in my project to be able to catch constructor and run methods of TraceRunnable and capture the contexts of them. I am hoping this will let me see how to reproduce the conditions in the demo project.

@jonatan-ivanov
Copy link
Member

@FranPregernik Please let us know how it went.

@rbieniek
Copy link

rbieniek commented Jan 24, 2021

I ran into a similar isuue when using the LazyTraceExecutor. With the exact same difficulties of having that problem being rather hard to reproduce.

I gotta test it more over the coming days, however bumping the dependency io.zipkin.brave dependencies from 5.13.2 to 5.13.3 seems to have fixed it in my case

@jonatan-ivanov
Copy link
Member

@rbieniek Thank for the info. @FranPregernik, could you please check if upgrading brave helps?

@FranPregernik
Copy link
Author

I upgraded the brave deps in the parent pom <dependancyManagement> block. Maven dep tree showed I was using 5.13.3 throughout. Then I ran the test suite continuously and sure enough I hit the assert brakepoint in PendingSpans class, so I conclude the issue exists in that code.
Maybe it has better internal error handling so the error does not pop up all the way to TraceRunnable... I will test that.

@jonatan-ivanov, @rbieniek I never thought I would witness a real live Heisenbug 🙄 .
Back to my original idea... I copied the source of the TraceRunnable into my project under the same package name and verified that because of the classpath config my class was loaded/used instead of the one in the lib, it was all good, constructor called, run called. I added simple instrumentation that writes the TraceRunnable hashCode in a map (key) and an object containing the 'this.parent' context and the current constructor stack trace in order to correlate the run method exception with the point of construction.

I then ran the test suite for 4 hours non-stop. No error...

I then removed the TraceRunnable and clean/compiled everything again. Ran the test suite and sure enough the error pops up again (assert breakpoint triggered).

So frustrating...

Any suggestions?

@FranPregernik
Copy link
Author

FranPregernik commented Jan 26, 2021

And I gave it another shot and caught it. I did not modify the TraceRunnable.run method in any way (like before) and I just logged the stack trace in the constructor.

Please double check my reasoning here.

So the regular (passing) case is when the context does not have the parent trace id - it is the local root:
01_regular_passing

With the AssertError caught the context is different - it is not the local root:
02_assert_will_be_thrown

Then I went up the call stack to the LazySpan.span() method (focus on the debuger variables):
03_walking_up_the_stack_to_span_method

So I called manually the tracer.toSpan(context) line - it returns a RealSpan without the AssertError:
04_calling_tracer_toSpan_in_debug

I tested the tracer.toSpan(context).context() call as well to see - also OK:
05_calling_tracer_toSpan_context_in_debug

Finally I tested the this.span() call just for the fun of it - No AssertError the second time around:
06_calling_this_span_no_error

So some race condition is happening but might not be the same as what we are expecting (unprotected this.delegate =).
I went to see the captured stack trace of the TraceRunnable constructor and I see that here we have a case where the TraceExecutorService runs one TraceRunnable (the root) and that one later on execues another TraceRunnable and here is where the error happens:
07_interesting_stack_trace

My TraceRunnable: https://gist.github.com/FranPregernik/0d0e0ddff5f0f642f8f12313f42667e1

@marcingrzejszczak
Copy link
Contributor

So do you think that the problem is actually in Sleuth? I mean the only thing that we're doing is delegating work to Brave 🤷

@FranPregernik
Copy link
Author

Not sure... could be.
Do you know any Brave devs that can check this issue here as a first step?

@marcingrzejszczak
Copy link
Contributor

You can file an issue in Brave and maybe link this one?

@jonatan-ivanov
Copy link
Member

@FranPregernik Let me close this issue, please @mention me or Marcin if you want us to reopen.

@andylintner
Copy link

FYI - I've submitted a fix for this to Brave in openzipkin/brave#1306

In the meantime, here's a hack for anyone else impacted to resolve it until that change can be merged and released:

@Aspect
class BraveTracerNextSpanRetryAspect {

    /*
     * Stupid hack to retry BraveTracer::nextSpan(span) because of a race condition wherein a span being loaded concurrent with the
     * span being flushed will throw an AssertionError. Since the flush will have completed, retrying succeeds.
     */
    @Around("execution(org.springframework.cloud.sleuth.Span org.springframework.cloud.sleuth.brave.bridge.BraveTracer.nextSpan(..))")
    public Object wrapNextSpan(ProceedingJoinPoint pjp) throws Throwable {
        try {
            return pjp.proceed();
        } catch (AssertionError e) {
            return pjp.proceed();
        }
    }
}

@FranPregernik
Copy link
Author

Thank you @andylintner

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants