RecordingStream
's brittle shutdown/flush behavior
#5335
Labels
🪳 bug
Something isn't working
🧑💻 dev experience
developer experience (excluding CI)
💬 discussion
🪵 Log & send APIs
Affects the user-facing API for all languages
Context
Facts:
RecordingStream
s are ref-counted,Send
&Sync
.Drop
ped, theydisconnect()
, causing them to flush pending batches, join pending threads, flush sinks, etc.These two facts combined together make for a pretty brittle shutdown behavior.
It's very easy to clone and send
RecordingStream
s in a bunch of background threads. If any of these threads were to outlive the main thread, then theDrop
implementation ofRecordingStream
will turn into a noop (becausestrong_count() > 0
), and user data will be lost.A very nasty manifestation of this is e.g. sending a
RecordingStream
into an ephemeral thread that is meant to compute something heavy and then log it (e.g. calling into a dataloader).Another example is
RecordingStream
s being stored in process- and thread- locals.Python
In Python, the issue is mitigated by the fact that the SDK is in itself a kind of entity with its very own lifetime, and the recordings' respective lifetimes are themselves tied to the SDK's lifetime.
When the SDK shut downs, all the recordings tied to it are shut down too, triggering the all the flushing mechanisms.
Python SDK's shutdown logic:
C++
In C and C++, the issue is actually made somewhat worse since the SDK keeps its own refcount of recording streams, and implements a destructor to automatically flush them.
I.e. C++ users have to worry about the refcount issues on both the Rust and C++ side.
C++ SDK's destructor:
Proposal
The SDK should always be an actual thing that gets instantiated (whether directly by the user or automagically behind the scenes), and all recordings created with that SDK should have their lifetimes tied to it.
This SDK object can never outlive main.
All SDKs rely on this object via FFI, so lifetime management is only implemented once on the Rust side.
On shutdown, all recordings tied to that SDK are flushed.
In the future, we might want to make it configurable (NO_WAIT, WAIT_FOR_FLUSH, etc).
The text was updated successfully, but these errors were encountered: