-
Notifications
You must be signed in to change notification settings - Fork 590
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discussion: support for throttling write based on LSM Tree stats #7997
Comments
In general it is doable but we need to be very careful to constrain the radius of the impact of stalling because blocking ingest_batch means blocking the actor task. Under the current architecture, our system may not function well if the actor task is blocked for a long time. For example:
Stalling on memtable (i.e. previously called shared buffer) reaching a threshold is okay since we know there can be some progress (e.g. flush/spill) made for sure and the stalling won't last for long. However, if we stall when compaction is lagging behind, I think it is harder to guarantee that there will be progress. Under this case, we should make sure the stalling only affect the problematic MV/Tables, not others. Another option is not to stall |
I agree that pausing/throttling source is better than stalling |
Yes..I missed these drawbacks (again. Pausing/throttling source +1 |
Hmmm, even pausing source is dangerous. The streaming operators can trigger pausing due to scaling out or failure recovery. The 2 mechanisms together sound very error-prone. I think probably the only correct way to do throttling is by backpressure and "currently ingest_batch will stall if shared buffer size threshold is exceeded" looks good to me. May I ask the reason of improving that? That's the most natural solution I can think of. Actually, there are very few things we can do when the system is overloaded. |
Zheng Wang does not plan to fix "currently ingest_batch will stall if shared buffer size threshold is exceeded" |
The implementation in #8049 ensure pause/resume issued by barrier (scaling/recovery) has higher priority than throttler. i.e. throttler cannot resume a paused source that was paused by barrier. (actually current barrier latency based throttler in main branch has this bug) |
We have seen cases that compactor is abnormal, and tens of thousands SST files accumulated in level 0 of LSM tree when we notice it. By throttling earlier we can keep LSM tree in good shape. |
IIRC @Little-Wallace has also emphasized the risk of throttling based on LSM tree shape once, any comments? |
Yeah, I agree with the motivation. But, instead of throttling from source, can we do it locally? For example, stall the |
Well, it's very likely to be correct now, but I'm afraid this may cause potential deep bugs in the future. Before this, the call hierarchy is clear: |
but @hzxa21 thinks that there are some disadvantages if we do locally
|
Stalling |
👀 +1 for stall locally. If we make it special for Source, should we also apply it to DML and even Chain? It's not easy to get them synced if we stall it manually, which is instead natural for back-pressure of local stalling.
How can we define "related"? Our system provides global snapshot consistency, so all tables that can be joined together should be considered related. Not caring about the freshness of some specific sources seems to make some queries less meaningful. Maybe we should implement per-database checkpointing and totally isolate them?
Do you mean that stalling
IMO the latency of a streaming job is a physical property and will always be there, as long as we have some sort of back-pressure mechanism. We're overloading -> barrier is stalled -> barrier latency is increased -> more in-flight barrier -> less per-epoch data, this chain of affection should be natural and reasonable to the system. 🤔 |
Good point. After a second thought, blocking However, we still need to be careful about the condition of blocking |
Currently ingest_batch will stall if shared buffer size threshold is exceeded.
We can add another similar stall trigger that is calculated from the HummockVersion stats, e.g. L0 total file number/L0 total sub level number. This helps in cases when compaction is lagging behind, otherwise too many files may have been added to LSM tree when we eventually realize the issue.
Note that we may have multiple LSM trees. As we know table_id in StateTable, we can know the target LSM tree too.
@Little-Wallace @Li0k @soundOfDestiny @wenym1 @hzxa21
The text was updated successfully, but these errors were encountered: