Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parent can only be null in a local root #1295

Open
FranPregernik opened this issue Mar 24, 2021 · 5 comments
Open

Parent can only be null in a local root #1295

FranPregernik opened this issue Mar 24, 2021 · 5 comments
Labels

Comments

@FranPregernik
Copy link

Describe the bug

related issue: spring-cloud/spring-cloud-sleuth#1814

I am using Spring Cloud Sleuth 3.0.1 and Brave 5.3.12
The bug is really sporadic but does cause the worfkflow engine (flowable, which I have instrumented manually) to not call the next tasks in the BPMN workflow.

The error is this:

Exception in thread "orders-flowable-async-job-executor-thread-4" java.lang.AssertionError: Bug (or unexpected call to internal code): parent can only be null in a local root!
	at brave.internal.recorder.PendingSpans.getOrCreate(PendingSpans.java:89)
	at brave.Tracer._toSpan(Tracer.java:410)
	at brave.Tracer.toSpan(Tracer.java:382)
	at brave.Tracer.toSpan(Tracer.java:376)
	at brave.LazySpan.span(LazySpan.java:141)
	at brave.LazySpan.context(LazySpan.java:40)
	at org.springframework.cloud.sleuth.brave.bridge.BraveSpan.context(BraveSpan.java:48)
	at org.springframework.cloud.sleuth.brave.bridge.BraveTracer.nextSpan(BraveTracer.java:52)
	at org.springframework.cloud.sleuth.instrument.async.TraceRunnable.run(TraceRunnable.java:61)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

I think it is on this line in TraceRunable:
image

I am of the opinion that the TraceRunanble should not prevent the delegate Runnable in any way. We might benefit from a check before calling the nextSpan method or catching this exception somehow.

Sample

I can't reliably reproduce it, it is sporadic. We have on the order of 100 tests using flowable but I get one error every now and then.

Flowable can be set up with a custom implementation of a JobExectuor. I have it set up like so:

 private AsyncExecutor initAsyncExecutor(String tennantId, final int maxConcurrent) {
        final DefaultAsyncJobExecutor asyncExecutor = new DefaultAsyncJobExecutor();
        asyncExecutor.setAutoActivate(false);

        // other setup, irrelevant to the case

        // customized default async job executor initialization with tracing info
        // org.flowable.job.service.impl.asyncexecutor.DefaultAsyncJobExecutor.initAsyncJobExecutionThreadPool
        asyncExecutor.setThreadPoolQueue(new ArrayBlockingQueue<>(queueSize));
        asyncExecutor.setExecutorService(
            new TraceableExecutorService(
                this.applicationContext,
                new ThreadPoolExecutor(
                    asyncExecutor.getCorePoolSize(),
                    asyncExecutor.getMaxPoolSize(),
                    asyncExecutor.getKeepAliveTime(), TimeUnit.MILLISECONDS,
                    asyncExecutor.getThreadPoolQueue(),
                    new BasicThreadFactory.Builder()
                        .namingPattern(tennantId + "-flowable-async-job-executor-thread-%d")
                        .build()
                )
            )
        );

        return asyncExecutor;
    }
@FranPregernik
Copy link
Author

From the related bug report ...

And I gave it another shot and caught it. I did not modify the TraceRunnable.run method in any way (like before) and I just logged the stack trace in the constructor.

Please double check my reasoning here.

So the regular (passing) case is when the context does not have the parent trace id - it is the local root:
01_regular_passing

With the AssertError caught the context is different - it is not the local root:
02_assert_will_be_thrown

Then I went up the call stack to the LazySpan.span() method (focus on the debuger variables):
03_walking_up_the_stack_to_span_method

So I called manually the tracer.toSpan(context) line - it returns a RealSpan without the AssertError:
04_calling_tracer_toSpan_in_debug

I tested the tracer.toSpan(context).context() call as well to see - also OK:
05_calling_tracer_toSpan_context_in_debug

Finally I tested the this.span() call just for the fun of it - No AssertError the second time around:
06_calling_this_span_no_error

So some race condition is happening but might not be the same as what we are expecting (unprotected this.delegate =).
I went to see the captured stack trace of the TraceRunnable constructor and I see that here we have a case where the TraceExecutorService runs one TraceRunnable (the root) and that one later on execues another TraceRunnable and here is where the error happens:
07_interesting_stack_trace

My TraceRunnable: https://gist.github.com/FranPregernik/0d0e0ddff5f0f642f8f12313f42667e1

@jcchavezs
Copy link
Contributor

Any chance you can check it out @anuraaga?

@FranPregernik
Copy link
Author

Hi @jcchavezs, @anuraaga!

Any update on this?

@r-lysak
Copy link

r-lysak commented May 26, 2021

Hi @jcchavezs, @anuraaga, @FranPregernik

I have same issue. Do you know how to fix it or can advise some workaround?

@andylintner
Copy link
Contributor

I'm impacted by the same issue. After digging in, it appears to happen due to a race condition within brave.Tracer wherein the span is available in the pendingSpans collection when the first line of toSpan completes, but is no longer present when _toSpan checks the collection.

I've added a test case illustrating the issue here: andylintner@86a9086

andylintner pushed a commit to andylintner/brave that referenced this issue Sep 7, 2021
Fixes openzipkin#1295
Previous to this fix, a call to Tracer::toSpan concurrent with a call
to flush the span from pendingSpans could result in an assertion error.

We now only fetch the pendingSpan once for a single call to toSpan.
andylintner pushed a commit to andylintner/brave that referenced this issue Sep 7, 2021
Fixes openzipkin#1295
Previous to this fix, a call to Tracer::toSpan concurrent with a call
to flush the span from pendingSpans could result in an assertion error.

We now only fetch the pendingSpan once for a single call to toSpan.
codefromthecrypt pushed a commit to andylintner/brave that referenced this issue Dec 15, 2023
Fixes openzipkin#1295
Previous to this fix, a call to Tracer::toSpan concurrent with a call
to flush the span from pendingSpans could result in an assertion error.

We now only fetch the pendingSpan once for a single call to toSpan.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants