Skip to content
This repository has been archived by the owner on Aug 30, 2022. It is now read-only.

Flush any pending buffers on close() #54

Closed
wants to merge 1 commit into from

Conversation

ringerc
Copy link
Contributor

@ringerc ringerc commented Feb 5, 2018

The RemoteReporter buffers spans, but didn't flush them on close.
So any spans Finish()ed between the last flush interval and the
Close() of the Tracer would be lost.

Fixes #52

Signed-off-by: Craig Ringer craig@2ndquadrant.com

@codecov
Copy link

codecov bot commented Feb 5, 2018

Codecov Report

Merging #54 into master will decrease coverage by 0.5%.
The diff coverage is 29.41%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #54      +/-   ##
==========================================
- Coverage   88.46%   87.96%   -0.51%     
==========================================
  Files          93       93              
  Lines        2246     2260      +14     
==========================================
+ Hits         1987     1988       +1     
- Misses        259      272      +13
Impacted Files Coverage Δ
src/jaegertracing/reporters/RemoteReporter.h 100% <ø> (ø) ⬆️
src/jaegertracing/reporters/Reporter.h 66.66% <0%> (-33.34%) ⬇️
src/jaegertracing/reporters/RemoteReporter.cpp 66.17% <30%> (-7.16%) ⬇️
src/jaegertracing/Tracer.h 82% <33.33%> (-4.32%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c36adcf...866782d. Read the comment docs.

@rnburn
Copy link
Contributor

rnburn commented Feb 5, 2018

btw - is calling close automatically like this in the reporter's destructor really a good idea?
https://github.com/jaegertracing/cpp-client/blob/master/src/jaegertracing/reporters/RemoteReporter.h#L45
It can cause the process to take much longer to exit than it otherwise would.

@ringerc
Copy link
Contributor Author

ringerc commented Feb 5, 2018 via email

@ringerc ringerc force-pushed the flush-on-close branch 2 times, most recently from 1102a98 to 6ac58e8 Compare February 7, 2018 07:41
@ringerc
Copy link
Contributor Author

ringerc commented Feb 7, 2018

OK, so addressing #53 instead with an extension Tracer.Flush() method to allow explicit flushes.

@ringerc
Copy link
Contributor Author

ringerc commented Feb 7, 2018

Hrm, doesn't seem to do the job. I'm still losing spans. General approach seems sensible, forcing a blocking flush via separate API request. I just haven't figured out how to do it yet through all the layers involved.

@ringerc
Copy link
Contributor Author

ringerc commented Feb 7, 2018

Maybe something like

 void RemoteReporter::flush()
{
    async_flush();
    std::unique_lock<std::mutex> lock(_mutex);
    _cv.wait(lock, [this]() {
        return !_running || _queue.empty();
    });
}

?

@isaachier
Copy link
Contributor

Actually, I see the issue is that I didn't add a "poison pill" to the end of the queue, instead used a boolean flag running. That means the shutdown is immediate. Will patch and submit new PR. Thanks for the help here, but will be taking a different approach, the one used in the Go client.

@isaachier
Copy link
Contributor

Actually, I see we do check for an empty queue in the thread's main loop, so I'm not sure why this isn't effectively the same thing. Also, now that I've read #53, I see this issue has further reaching goals than just fixing this local problem. Might be worth adding an eager flush mechanism as you suggested.

@isaachier isaachier reopened this Feb 7, 2018
Copy link
Contributor

@isaachier isaachier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense to me but I want to discuss with Jaeger devs about overall idea of eager flushing. Hopefully will merge soon. Thanks for the help.

@@ -194,6 +194,11 @@ class Tracer : public opentracing::Tracer,

void close() noexcept { Close(); }

void Flush()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only use the title case functions for opentracing-cpp methods. I'd prefer flush here unless it is added to the OT implementation and this is an override.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in latest

@ringerc
Copy link
Contributor Author

ringerc commented Feb 8, 2018

I'm still seeing intermittent lost spans with this patch, though less frequently.

I just did some test runs with Wireshark observing the udp communications. The lost spans definitely never seem to get sent on the wire, they're not misplaced in the collector. The span name / operation name never appears on the wire at all.

Cause is uncertain as yet. Having trouble debugging effectively. May have to write small test program to see if I can repro.

In case it's relevant, my spans seem to be almost exactly 10s long if they do arrive. This is ... aberrant. Tomorrow will examine whether the app's doing something crazy there or the span timestamps are bogus.

Extend the OpenTracing API with an explicit jaegertracing::Tracer::flush()
method to force spans to be flushed eagerly without closing the tracer. It
returns only when the spans are flushed.

To support this a new condition variable is introduced in the reporter to allow
the main thread to wait on notification from the reporter flush thread.

Fixes jaegertracing#53

Call flush() from Close(), but not from the Tracer dtor. So we follow the spec
and ensure we flush buffers on explicit Close only.

Fixes jaegertracing#52

Signed-off-by: Craig Ringer <craig@2ndquadrant.com>
@ringerc
Copy link
Contributor Author

ringerc commented Feb 9, 2018

OK, modified as discussed. It now flushes buffers on explicit Close() only.

My implementation of RemoteReporter::flush() was pretty much completely wrong. Fixed now, so it waits on its own condition variable and the reporter thread sets that condition variable after a wakeup (whether or not a span was flushed).

@ringerc
Copy link
Contributor Author

ringerc commented Feb 9, 2018

@isaachier Should be ready for review/merge

Copy link
Contributor

@isaachier isaachier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a lot of this is overkill. Instead of adding explicit flush, maybe we can just add an option to flush everything synchronously. That way we can avoid spawning a thread altogether.

{
std::unique_lock<std::mutex> lock(_mutex);
_cv_flush.wait(lock, [this]() {
async_flush();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This definitely looks wrong. The lambda here is just meant to express a boolean condition to avoid spurious wakeups. For example, instead of while (running) { cv.wait(); } you can do cv.wait([&running] { return running; }. This will potentially lead to many bugs in the program.

@isaachier
Copy link
Contributor

Fixed in #59.

@isaachier isaachier closed this Feb 12, 2018
@ringerc
Copy link
Contributor Author

ringerc commented Feb 13, 2018 via email

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

cpp-client loses spans on short-lived processes; Close() not flushing buffers?
3 participants