Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Guarantee ForceFlush of BatchSpanProcessor will export all ended spans #2080

Closed
garrettwegan opened this issue Jul 13, 2021 · 4 comments · Fixed by #2335
Closed

Guarantee ForceFlush of BatchSpanProcessor will export all ended spans #2080

garrettwegan opened this issue Jul 13, 2021 · 4 comments · Fixed by #2335
Labels
enhancement New feature or request
Milestone

Comments

@garrettwegan
Copy link

Problem Statement

Currently calling ForceFlush on a BatchSpanProcessor will export all spans which have been ended AND consumed into the BSP's batch. However, spans which have been ended before a call to ForceFlush are not guaranteed to be consumed into the batch and therefore may or may not be exported by ForceFlush as desired.

Proposed Solution

I would like ForceFlush to ensure any ended spans are indeed exported following a call to ForceFlush.

I have attempted to modify ForceFlush myself to process the queue before exporting but have achieved only mixed results.

Prior Art

Using runtime.Gosched() either within ForceFlush or in user code before calling ForceFlush to yield execution and allow consumption to occur naturally is very likely to populate the batch with any ended spans, but is not guaranteed to do so.

Additional Context

A simple test case which would pass after such a change is as follows:

// Sometimes passes
func TestBSPForce(t *testing.T) {
	ctx := context.Background()

	exp := tracetest.NewInMemoryExporter()

	tp := sdktrace.NewTracerProvider(
		sdktrace.WithBatcher(exp),
	)

	tracer := tp.Tracer("tracer")

	_, span := tracer.Start(ctx, "span")
	span.End()
	
	// runtime.Gosched() - with this uncommented I have yet to see this test fail but would like a stronger guarantee
	err := tp.ForceFlush(ctx)
	assert.NoError(t, err)

	// Expect well defined behavior due to calling ForceFlush
	assert.Len(t, exp.GetSpans(), 1)
}

Intuitively I would already expect this test case to pass, however, current behavior of the BatchSpanProcessor causes the result of this test case to be inconsistent.

@wangzlei
Copy link

wangzlei commented Jul 15, 2021

yes, seems it is a a multi-thread bug, some data may still in bsp.queue when call ForceFlush()

case sd := <-bsp.queue:
bsp.batchMutex.Lock()
bsp.batch = append(bsp.batch, sd)

@wangzlei
Copy link

I guess the issue is caused by

case sd := <-bsp.queue:

span is at sd but not in bsp.batch, at the same time ForceFlush has hold the batch lock.

@garrettwegan garrettwegan changed the title Guaranteed ForceFlush of BatchSpanProcessor will export all ended spans Guarantee ForceFlush of BatchSpanProcessor will export all ended spans Jul 15, 2021
@jlisee
Copy link

jlisee commented Aug 2, 2021

If anyone else hits this issue I was able to successfully get the example above working by using WithSyncer instead of WithBatcher (or directly calling NewSimpleSpanProcessor). This will produce reliable tests by simply calling the exporter directly as spans come in.


The issue is we have two buffers the bsp.queue channel, and the currently building batch in bsp.batch. There is a go rountine that moves data from bsp.queue to bsp.batch but ForceFlush only concerns itself with data in bsp.batch. The options I see for solving this:

  • Signal the background go rountine to empty the bsp.queue then wait on that event to complete (maybe use a new pair of channels here)
  • Have the foreground ForceFlush empty bsp.queue itself while properly managing the locks

The former option seems the most go like, but does increase the complexity of the class a bit, especially during shutdown to ensure nothing deadlocks waiting on a flush.

@pkwarren
Copy link
Contributor

What about something like this?

Define the following type:

type forceFlushSpan struct {
	flushed chan struct{}
}

Then in ForceFlush() before we call export spans add this logic:

		flushCh := make(chan struct{})
		select {
		case bsp.queue <- forceFlushSpan{flushCh}:
		case <-ctx.Done():
			return ctx.Err()
		}

		select {
		case <-flushCh:
			// Processed any items in queue prior to ForceFlush being called
		case <-ctx.Done():
			return ctx.Err()
		}

Then in processQueue handle this special type of read only span as follows:

			if ffs, ok := sd.(forceFlushSpan); ok {
				close(ffs.flushed)
				return
			}

This way we'll ensure that anything in the queue at the time ForceFlush() is called will be written out but we don't get into complex race conditions if traces continue to come through.

The main negative of this approach is another implementation of ReadOnlySpan (with a lot of boilerplate).

Aneurysm9 added a commit that referenced this issue Nov 5, 2021
* Empty queued spans when ForceFlush called

Update the implementation of ForceFlush() to first ensure that all spans
which are queued are added to the batch before calling export spans.
Create a small ReadOnlySpan implementation which can be used as a marker
that ForceFlush has been invoked and used to notify when all spans are
ready to be exported.

Fixes #2080.

* Add a changelog entry.

* Update CHANGELOG.md

Co-authored-by: Tyler Yahn <MrAlias@users.noreply.github.com>

* Update sdk/trace/batch_span_processor.go

Co-authored-by: Tyler Yahn <MrAlias@users.noreply.github.com>

* Improve test case to verify multiple flushes.

* Refactor code to use enqueue.

* Be more defensive on waiting for queue.

Update the handling of the force flush span so we only wait on the
channel if we were able to enqueue the span to the queue.

* Fix linter.

Co-authored-by: Tyler Yahn <MrAlias@users.noreply.github.com>
Co-authored-by: Anthony Mirabella <a9@aneurysm9.com>
@pellared pellared added this to the untracked milestone Nov 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants