Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RecordingStream's brittle shutdown/flush behavior #5335

Open
teh-cmc opened this issue Feb 28, 2024 · 1 comment
Open

RecordingStream's brittle shutdown/flush behavior #5335

teh-cmc opened this issue Feb 28, 2024 · 1 comment
Labels
🪳 bug Something isn't working 🧑‍💻 dev experience developer experience (excluding CI) 💬 discussion 🪵 Log & send APIs Affects the user-facing API for all languages

Comments

@teh-cmc
Copy link
Member

teh-cmc commented Feb 28, 2024

Context

Facts:

  • RecordingStreams are ref-counted, Send & Sync.
  • When they get Dropped, they disconnect(), causing them to flush pending batches, join pending threads, flush sinks, etc.

These two facts combined together make for a pretty brittle shutdown behavior.

It's very easy to clone and send RecordingStreams in a bunch of background threads. If any of these threads were to outlive the main thread, then the Drop implementation of RecordingStream will turn into a noop (because strong_count() > 0), and user data will be lost.
A very nasty manifestation of this is e.g. sending a RecordingStream into an ephemeral thread that is meant to compute something heavy and then log it (e.g. calling into a dataloader).

Another example is RecordingStreams being stored in process- and thread- locals.

Python

In Python, the issue is mitigated by the fact that the SDK is in itself a kind of entity with its very own lifetime, and the recordings' respective lifetimes are themselves tied to the SDK's lifetime.

When the SDK shut downs, all the recordings tied to it are shut down too, triggering the all the flushing mechanisms.

Python SDK's shutdown logic:

#[pyfunction]
fn shutdown(py: Python<'_>) {
    re_log::debug!("Shutting down the Rerun SDK");
    // Release the GIL in case any flushing behavior needs to cleanup a python object.
    py.allow_threads(|| {
        for (_, recording) in all_recordings().drain() {
            recording.disconnect();
        }
        flush_garbage_queue();
    });
}

C++

In C and C++, the issue is actually made somewhat worse since the SDK keeps its own refcount of recording streams, and implements a destructor to automatically flush them.

I.e. C++ users have to worry about the refcount issues on both the Rust and C++ side.

C++ SDK's destructor:

RecordingStream::~RecordingStream() {
    if (_id != RR_REC_STREAM_CURRENT_RECORDING && _id != RR_REC_STREAM_CURRENT_BLUEPRINT) {
        rr_recording_stream_free(this->_id);
    }
}

Proposal

The SDK should always be an actual thing that gets instantiated (whether directly by the user or automagically behind the scenes), and all recordings created with that SDK should have their lifetimes tied to it.

This SDK object can never outlive main.
All SDKs rely on this object via FFI, so lifetime management is only implemented once on the Rust side.

On shutdown, all recordings tied to that SDK are flushed.
In the future, we might want to make it configurable (NO_WAIT, WAIT_FOR_FLUSH, etc).

@teh-cmc teh-cmc added 🪳 bug Something isn't working 💬 discussion 🧑‍💻 dev experience developer experience (excluding CI) 🪵 Log & send APIs Affects the user-facing API for all languages labels Feb 28, 2024
@teh-cmc
Copy link
Member Author

teh-cmc commented Feb 28, 2024

Even worse, sometimes that one RecordingStream in a rogue thread somewhere is actually expected to be the sole surviving instance of that recording; which means strong/weak handle based schemes are out of the picture... unless we make sure to keep at least one strong handle alive for long enough, somehow.

teh-cmc added a commit that referenced this issue Feb 29, 2024
Exposes raw `DataLoader`s to the Rust SDK through 2 new methods:
`RecordingStream::log_file_from_path` &
`RecordingStream::log_file_from_contents`.

Everything streams asynchronously, as expected.

There's no way to configure the behavior of the `DataLoader` at all,
yet. That comes next.

```rust
rec.log_file_from_path(filepath)?;
```

![image](https://github.com/rerun-io/rerun/assets/2910679/fac1514a-a00b-40c3-8950-df076080e1fc)

This highlighted some brittleness on the shutdown path, see:
- #5335 

---

Part of series of PR to expose configurable `DataLoader`s to our SDKs:
- #5327 
- #5328 
- #5330
- #5337
teh-cmc added a commit that referenced this issue Feb 29, 2024
Introduces `RecordingStream::clone_weak`, so `DataLoader` threads
started from the SDK don't prevent the recording from flushing once
`main` goes out of scope, all the while making sure that `DataLoader`s
run to completion.


![image](https://github.com/rerun-io/rerun/assets/2910679/94bf7b16-cf9b-40ce-a190-d255328af3f8)


- Mitigates #5335 

---

Part of series of PR to expose configurable `DataLoader`s to our SDKs:
- #5327 
- #5328 
- #5330
- #5337
- #5351
- #5355
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🪳 bug Something isn't working 🧑‍💻 dev experience developer experience (excluding CI) 💬 discussion 🪵 Log & send APIs Affects the user-facing API for all languages
Projects
None yet
Development

No branches or pull requests

1 participant