Parent can only be null in a local root #1295

FranPregernik · 2021-03-24T09:05:41Z

Describe the bug

related issue: spring-cloud/spring-cloud-sleuth#1814

I am using Spring Cloud Sleuth 3.0.1 and Brave 5.3.12
The bug is really sporadic but does cause the worfkflow engine (flowable, which I have instrumented manually) to not call the next tasks in the BPMN workflow.

The error is this:

Exception in thread "orders-flowable-async-job-executor-thread-4" java.lang.AssertionError: Bug (or unexpected call to internal code): parent can only be null in a local root!
	at brave.internal.recorder.PendingSpans.getOrCreate(PendingSpans.java:89)
	at brave.Tracer._toSpan(Tracer.java:410)
	at brave.Tracer.toSpan(Tracer.java:382)
	at brave.Tracer.toSpan(Tracer.java:376)
	at brave.LazySpan.span(LazySpan.java:141)
	at brave.LazySpan.context(LazySpan.java:40)
	at org.springframework.cloud.sleuth.brave.bridge.BraveSpan.context(BraveSpan.java:48)
	at org.springframework.cloud.sleuth.brave.bridge.BraveTracer.nextSpan(BraveTracer.java:52)
	at org.springframework.cloud.sleuth.instrument.async.TraceRunnable.run(TraceRunnable.java:61)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

I think it is on this line in TraceRunable:

I am of the opinion that the TraceRunanble should not prevent the delegate Runnable in any way. We might benefit from a check before calling the nextSpan method or catching this exception somehow.

Sample

I can't reliably reproduce it, it is sporadic. We have on the order of 100 tests using flowable but I get one error every now and then.

Flowable can be set up with a custom implementation of a JobExectuor. I have it set up like so:

 private AsyncExecutor initAsyncExecutor(String tennantId, final int maxConcurrent) {
        final DefaultAsyncJobExecutor asyncExecutor = new DefaultAsyncJobExecutor();
        asyncExecutor.setAutoActivate(false);

        // other setup, irrelevant to the case

        // customized default async job executor initialization with tracing info
        // org.flowable.job.service.impl.asyncexecutor.DefaultAsyncJobExecutor.initAsyncJobExecutionThreadPool
        asyncExecutor.setThreadPoolQueue(new ArrayBlockingQueue<>(queueSize));
        asyncExecutor.setExecutorService(
            new TraceableExecutorService(
                this.applicationContext,
                new ThreadPoolExecutor(
                    asyncExecutor.getCorePoolSize(),
                    asyncExecutor.getMaxPoolSize(),
                    asyncExecutor.getKeepAliveTime(), TimeUnit.MILLISECONDS,
                    asyncExecutor.getThreadPoolQueue(),
                    new BasicThreadFactory.Builder()
                        .namingPattern(tennantId + "-flowable-async-job-executor-thread-%d")
                        .build()
                )
            )
        );

        return asyncExecutor;
    }

The text was updated successfully, but these errors were encountered:

FranPregernik · 2021-03-24T09:06:38Z

From the related bug report ...

And I gave it another shot and caught it. I did not modify the TraceRunnable.run method in any way (like before) and I just logged the stack trace in the constructor.

Please double check my reasoning here.

So the regular (passing) case is when the context does not have the parent trace id - it is the local root:

With the AssertError caught the context is different - it is not the local root:

Then I went up the call stack to the LazySpan.span() method (focus on the debuger variables):

So I called manually the tracer.toSpan(context) line - it returns a RealSpan without the AssertError:

I tested the tracer.toSpan(context).context() call as well to see - also OK:

Finally I tested the this.span() call just for the fun of it - No AssertError the second time around:

So some race condition is happening but might not be the same as what we are expecting (unprotected this.delegate =).
I went to see the captured stack trace of the TraceRunnable constructor and I see that here we have a case where the TraceExecutorService runs one TraceRunnable (the root) and that one later on execues another TraceRunnable and here is where the error happens:

My TraceRunnable: https://gist.github.com/FranPregernik/0d0e0ddff5f0f642f8f12313f42667e1

jcchavezs · 2021-03-24T12:10:32Z

Any chance you can check it out @anuraaga?

FranPregernik · 2021-05-11T06:26:39Z

Hi @jcchavezs, @anuraaga!

Any update on this?

r-lysak · 2021-05-26T09:10:28Z

Hi @jcchavezs, @anuraaga, @FranPregernik

I have same issue. Do you know how to fix it or can advise some workaround?

andylintner · 2021-09-07T20:46:59Z

I'm impacted by the same issue. After digging in, it appears to happen due to a race condition within brave.Tracer wherein the span is available in the pendingSpans collection when the first line of toSpan completes, but is no longer present when _toSpan checks the collection.

I've added a test case illustrating the issue here: andylintner@86a9086

Fixes openzipkin#1295 Previous to this fix, a call to Tracer::toSpan concurrent with a call to flush the span from pendingSpans could result in an assertion error. We now only fetch the pendingSpan once for a single call to toSpan.

FranPregernik added the bug label Mar 24, 2021

andylintner mentioned this issue Sep 7, 2021

Fixes race condition with toSpan and flush #1306

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parent can only be null in a local root #1295

Parent can only be null in a local root #1295

FranPregernik commented Mar 24, 2021

FranPregernik commented Mar 24, 2021

jcchavezs commented Mar 24, 2021

FranPregernik commented May 11, 2021

r-lysak commented May 26, 2021

andylintner commented Sep 7, 2021

Parent can only be null in a local root #1295

Parent can only be null in a local root #1295

Comments

FranPregernik commented Mar 24, 2021

FranPregernik commented Mar 24, 2021

jcchavezs commented Mar 24, 2021

FranPregernik commented May 11, 2021

r-lysak commented May 26, 2021

andylintner commented Sep 7, 2021