Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

forceFlush() and shutdown() operations hang, not honoring timeout when there is no network connection for whole session #449

Closed
dpasirst opened this issue Aug 5, 2023 · 9 comments · Fixed by #454
Assignees

Comments

@dpasirst
Copy link

dpasirst commented Aug 5, 2023

When attempting to either forceFlush() or shutdown() the traceProvider instance or LogRecordProcessors such that no network connection to the OpenTelemetry Collector endpoint was present from launch to exit, the respective function calls seemingly hang and do not honor the configured or specified timeouts as called for in the OpenTelemetry specification.

Environment:
macOS 13.4.1, OpenTelemetry endpoint is configured with https/TLS. OpenTelemetry-swift 1.5.1. Xcode building native macOS application.
Either - network is disabled on the MacBook from application launch through application exit (or network is operational but collector endpoint is unreachable from application launch through exit).

Note: if the collector was reachable when the application launched and then became unreachable upon application exit, the functions appear to behave as expected.

Simplified Example:

 OpenTelemetry.registerTracerProvider(tracerProvider: TracerProviderBuilder()
   .add(spanProcessor:spanProcessor)  // SimpleSpanProcessor 
   .with(resource: DefaultResources().get())
   .build())
logProcessors = [
    BatchLogRecordProcessor(
        logRecordExporter:OtlpLogExporter(channel: grpcChannel)
    )
]
OpenTelemetry.registerLoggerProvider(loggerProvider: LoggerProviderBuilder()
    .with(processors: logProcessors!)
        .with(resource: DefaultResources().get())
        .build())

...... <send some spans>
...... <send some logs>

let tracerProviderSDK = OpenTelemetry.instance.tracerProvider as? TracerProviderSdk
tracerProviderSDK?.forceFlush(timeout: 3) // <-- hangs - explicit timeout
tracerProviderSDK?.shutdown()  // <-- hangs - default timeout
logProcessors?.forEach { _ = $0.shutdown() }. // <--hangs 

Actual Result:
The application hangs and fails to exit.

Expected result:
The timeout (default or explicit) should be honored per the specification for shutdown() and forceFlush().

Other Notes:
I have not tried or tested Metrics and do not know if that is also impacted.

@bryce-b bryce-b self-assigned this Aug 10, 2023
@bryce-b
Copy link
Member

bryce-b commented Aug 24, 2023

Hi @dpasirst,
Apologies for not replying more quickly. I believe you're experiencing this issue because you are using the SimpleSpanProcessor, which isn't meant for production use. Can you try replacing it with the BatchSpanProcessor and see if the problem persists? I don't believe it will.

@dpasirst
Copy link
Author

dpasirst commented Aug 24, 2023

Hi @bryce-b ,

Short version: it still hangs with BatchSpanProcessor, so unfortunately, that did not work and the problem is not limited to the SimpleSpanProcessor.

Some additional details:
You are correct as I have code as follows:

#if DEBUG
            let spanProcessor = SimpleSpanProcessor(spanExporter: traceExporter)
#else
            let spanProcessor = BatchSpanProcessor(spanExporter: traceExporter)
#endif

However, modifying it to use BatchSpanProcessor for the DEBUG build did not fix the issue. Additionally, per the original ticket, even if the Span processing is disabled, the log processor also hangs and for that one, I'm using the BatchLogRecordProcessor as shared in the original ticket.

BTW:
If attempting to reproduce on macOS, you may need or want the following in the app delegate to force immediate termination - so you don't end up with the window disappearing but the process hanging/zombie'd in the background:

    func applicationShouldTerminateAfterLastWindowClosed(_ sender: NSApplication) -> Bool {
        return true
    }

@bryce-b
Copy link
Member

bryce-b commented Aug 25, 2023

Thanks for bringing this to our attention. I'll try to get this fixed asap.

@bryce-b
Copy link
Member

bryce-b commented Sep 7, 2023

@dpasirst I believe pr #454 fixed this issue. Can you confirm?

@dpasirst
Copy link
Author

dpasirst commented Sep 7, 2023

Negative, this issue should not be closed. It is not fixed.

I just tested using main and can confirm that it still hangs indefinitely with both BatchSpanProcessor and the BatchLogRecordProcessor.

span shutdown
testing from main branch with commit verified as present in the build:
 Sources/OpenTelemetrySdk/Trace/SpanProcessors/BatchSpanProcessor.swift
line 44 shutdown() is called which line 46 calls into the BatchWork shutdown (same file) line 117. Then on line 118, it calses forceFlush() (note: it will never get back to calling line 119). Line 122 is forceFlush() which on line 129 calls exportBatch(…). Line 132 is exportBatch with I see line 140 calls timeoutTimer.activate() then it gets to line 142 queue.waitUntilAllOperationsAreFinished() and hangs indefinitely - the timeoutTimer does not appear to do anything.

log shutdown
testing from main branch with commit verified as present in the build:
Sources/OpenTelemetrySdk/Logs/Processors/BatchLogRecordProcessor.swift
line 116 is func shutdown() and 117 calls fourceFlush passing the export timeout, which goes to line 105 to 112 calling exportBatch()

In exportBatch(), line 129 is the timeoutTimer.activate(), but when the execution processes line 131 queue.waitUntilAllOperationsAreFinished(), it hangs indefinetely. The timeoutTimer is not effective.

btw: it would be much nicer if in addition to this working, if func shutdown() for both spans and logs allowed an explicitTimeout override to be specified just like forceFlush(explicitTimeout: TimeInterval?) It would also make testing a fix go much faster.

@bryce-b bryce-b reopened this Sep 13, 2023
@bryce-b
Copy link
Member

bryce-b commented Sep 14, 2023

I've discovered that this is only an issue on macos, and not iOS, which is why I initially thought it solved.

@bryce-b
Copy link
Member

bryce-b commented Sep 19, 2023

@dpasirst, I believe this PR fixes the issue. Can you take a look?
#459

@dpasirst
Copy link
Author

@bryce-b
Partially Confirmed:
Specifically in my tests:
BatchSpanProcessor and BatchLogRecordProcessor achieved timeout in about 10 seconds
Even SimpleSpanProcessor achieved timeout in approx 30 seconds

However - those numbers did not change when a timeout was specified.
For example:
using BatchSpanProcessor

        let tracerProviderSDK = OpenTelemetry.instance.tracerProvider as? TracerProviderSdk
        tracerProviderSDK?.forceFlush(timeout: 3)

resulted in taking about 10 seconds, the same amount of time as calling tracerProviderSDK?.shutdown() without specifying a timeout. The same was trust using SimpleSpanProcessor where the forceFlush(timeout: 3) still took approx 30 seconds.

Thus, it does not appear to be honoring the timeout passed into forceFlush.

That said, this is a huge leap forward and while I would like to see this fully functioning as expected before closing this issue, I would happily take a release with this behavior over what is released today.

@bryce-b
Copy link
Member

bryce-b commented Sep 21, 2023

Alright, I've expanded the exporters to allow for explicitTimeouts.

@bryce-b bryce-b closed this as completed Sep 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants