`RecordingStream`'s brittle shutdown/flush behavior #5335

teh-cmc · 2024-02-28T14:23:44Z

Context

Facts:

RecordingStreams are ref-counted, Send & Sync.
When they get Dropped, they disconnect(), causing them to flush pending batches, join pending threads, flush sinks, etc.

These two facts combined together make for a pretty brittle shutdown behavior.

It's very easy to clone and send RecordingStreams in a bunch of background threads. If any of these threads were to outlive the main thread, then the Drop implementation of RecordingStream will turn into a noop (because strong_count() > 0), and user data will be lost.
A very nasty manifestation of this is e.g. sending a RecordingStream into an ephemeral thread that is meant to compute something heavy and then log it (e.g. calling into a dataloader).

Another example is RecordingStreams being stored in process- and thread- locals.

Python

In Python, the issue is mitigated by the fact that the SDK is in itself a kind of entity with its very own lifetime, and the recordings' respective lifetimes are themselves tied to the SDK's lifetime.

When the SDK shut downs, all the recordings tied to it are shut down too, triggering the all the flushing mechanisms.

Python SDK's shutdown logic:

#[pyfunction]
fn shutdown(py: Python<'_>) {
    re_log::debug!("Shutting down the Rerun SDK");
    // Release the GIL in case any flushing behavior needs to cleanup a python object.
    py.allow_threads(|| {
        for (_, recording) in all_recordings().drain() {
            recording.disconnect();
        }
        flush_garbage_queue();
    });
}

C++

In C and C++, the issue is actually made somewhat worse since the SDK keeps its own refcount of recording streams, and implements a destructor to automatically flush them.

I.e. C++ users have to worry about the refcount issues on both the Rust and C++ side.

C++ SDK's destructor:

RecordingStream::~RecordingStream() {
    if (_id != RR_REC_STREAM_CURRENT_RECORDING && _id != RR_REC_STREAM_CURRENT_BLUEPRINT) {
        rr_recording_stream_free(this->_id);
    }
}

Proposal

The SDK should always be an actual thing that gets instantiated (whether directly by the user or automagically behind the scenes), and all recordings created with that SDK should have their lifetimes tied to it.

This SDK object can never outlive main.
All SDKs rely on this object via FFI, so lifetime management is only implemented once on the Rust side.

On shutdown, all recordings tied to that SDK are flushed.
In the future, we might want to make it configurable (NO_WAIT, WAIT_FOR_FLUSH, etc).

The text was updated successfully, but these errors were encountered:

teh-cmc · 2024-02-28T14:40:22Z

Even worse, sometimes that one RecordingStream in a rogue thread somewhere is actually expected to be the sole surviving instance of that recording; which means strong/weak handle based schemes are out of the picture... unless we make sure to keep at least one strong handle alive for long enough, somehow.

Exposes raw `DataLoader`s to the Rust SDK through 2 new methods: `RecordingStream::log_file_from_path` & `RecordingStream::log_file_from_contents`. Everything streams asynchronously, as expected. There's no way to configure the behavior of the `DataLoader` at all, yet. That comes next. ```rust rec.log_file_from_path(filepath)?; ``` ![image](https://github.com/rerun-io/rerun/assets/2910679/fac1514a-a00b-40c3-8950-df076080e1fc) This highlighted some brittleness on the shutdown path, see: - #5335 --- Part of series of PR to expose configurable `DataLoader`s to our SDKs: - #5327 - #5328 - #5330 - #5337

Introduces `RecordingStream::clone_weak`, so `DataLoader` threads started from the SDK don't prevent the recording from flushing once `main` goes out of scope, all the while making sure that `DataLoader`s run to completion. ![image](https://github.com/rerun-io/rerun/assets/2910679/94bf7b16-cf9b-40ce-a190-d255328af3f8) - Mitigates #5335 --- Part of series of PR to expose configurable `DataLoader`s to our SDKs: - #5327 - #5328 - #5330 - #5337 - #5351 - #5355

teh-cmc added 🪳 bug Something isn't working 💬 discussion 🧑‍💻 dev experience developer experience (excluding CI) 🪵 Log & send APIs Affects the user-facing API for all languages labels Feb 28, 2024

This was referenced Feb 28, 2024

SDK DataLoaders 4: working around shutdown brittleness #5337

Merged

SDK DataLoaders 1: barebones Rust support #5327

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`RecordingStream`'s brittle shutdown/flush behavior #5335

`RecordingStream`'s brittle shutdown/flush behavior #5335

teh-cmc commented Feb 28, 2024

teh-cmc commented Feb 28, 2024 •

edited

Loading

RecordingStream's brittle shutdown/flush behavior #5335

RecordingStream's brittle shutdown/flush behavior #5335

Comments

teh-cmc commented Feb 28, 2024

Context

Python

C++

Proposal

teh-cmc commented Feb 28, 2024 • edited Loading

`RecordingStream`'s brittle shutdown/flush behavior #5335

`RecordingStream`'s brittle shutdown/flush behavior #5335

teh-cmc commented Feb 28, 2024 •

edited

Loading