Discussion: support for throttling write based on LSM Tree stats #7997

zwang28 · 2023-02-17T05:25:39Z

Currently ingest_batch will stall if shared buffer size threshold is exceeded.
We can add another similar stall trigger that is calculated from the HummockVersion stats, e.g. L0 total file number/L0 total sub level number. This helps in cases when compaction is lagging behind, otherwise too many files may have been added to LSM tree when we eventually realize the issue.

Note that we may have multiple LSM trees. As we know table_id in StateTable, we can know the target LSM tree too.

@Little-Wallace @Li0k @soundOfDestiny @wenym1 @hzxa21

hzxa21 · 2023-02-17T05:55:20Z

In general it is doable but we need to be very careful to constrain the radius of the impact of stalling because blocking ingest_batch means blocking the actor task. Under the current architecture, our system may not function well if the actor task is blocked for a long time. For example:

barrier cannot flow through the streaming graph if an actor is blocked
checkpoint of non-related MV/Tables will be blocked
recovery won't work if there are any other failures
...

Stalling on memtable (i.e. previously called shared buffer) reaching a threshold is okay since we know there can be some progress (e.g. flush/spill) made for sure and the stalling won't last for long. However, if we stall when compaction is lagging behind, I think it is harder to guarantee that there will be progress. Under this case, we should make sure the stalling only affect the problematic MV/Tables, not others. Another option is not to stall ingest_batch but pause/throttle source (without pausing barrier and the actor task).

soundOfDestiny · 2023-02-17T06:01:55Z

In general it is doable but we need to be very careful to constrain the radius of the impact of stalling because blocking ingest_batch means blocking the actor task. Under the current architecture, our system may not function well if the actor task is blocked for a long time. For example:

barrier cannot flow through the streaming graph if an actor is blocked

checkpoint of non-related MV/Tables will be blocked

recovery won't work if there are any other failures

...

Stalling on memtable (i.e. previously called shared buffer) reaching a threshold is okay since we know there can be some progress (e.g. flush/spill) made for sure and the stalling won't last for long. However, if we stall when compaction is lagging behind, I think it is harder to guarantee that there will be progress. Under this case, we should make sure the stalling only affect the problematic MV/Tables, not others. Another option is not to stall ingest_batch but pause/throttle source (without pausing barrier and the actor task).

I agree that pausing/throttling source is better than stalling ingest_batch

zwang28 · 2023-02-17T06:11:24Z

if we stall when compaction is lagging behind, I think it is harder to guarantee that there will be progress

barrier cannot flow through the streaming graph if an actor is blocked

Yes..I missed these drawbacks (again.

Pausing/throttling source +1

fuyufjh · 2023-02-22T08:32:58Z

Hmmm, even pausing source is dangerous. The streaming operators can trigger pausing due to scaling out or failure recovery. The 2 mechanisms together sound very error-prone.

I think probably the only correct way to do throttling is by backpressure and "currently ingest_batch will stall if shared buffer size threshold is exceeded" looks good to me. May I ask the reason of improving that? That's the most natural solution I can think of. Actually, there are very few things we can do when the system is overloaded.

soundOfDestiny · 2023-02-22T08:36:49Z

Hmmm, even pausing source is dangerous. The streaming operators can trigger pausing due to scaling out or failure recovery.

I think probably the only correct way to do throttling is by backpressure and "currently ingest_batch will stall if shared buffer size threshold is exceeded" looks good to me. May I ask the reason of improving that? That's the most natural solution I can think of. Actually, there are very few things we can do when the system is overloaded.

Zheng Wang does not plan to fix "currently ingest_batch will stall if shared buffer size threshold is exceeded"
He means that he wants ingest_batch to stall when either share buffer size threshold or L0 SST size threshold is exceeded.

zwang28 · 2023-02-22T08:37:12Z

The streaming operators can trigger pausing due to scaling out or failure recovery. The 2 mechanisms together sound very error-prone.

The implementation in #8049 ensure pause/resume issued by barrier (scaling/recovery) has higher priority than throttler. i.e. throttler cannot resume a paused source that was paused by barrier. (actually current barrier latency based throttler in main branch has this bug)

zwang28 · 2023-02-22T08:43:36Z

"currently ingest_batch will stall if shared buffer size threshold is exceeded" looks good to me. May I ask the reason

We have seen cases that compactor is abnormal, and tens of thousands SST files accumulated in level 0 of LSM tree when we notice it. By throttling earlier we can keep LSM tree in good shape.

zwang28 · 2023-02-22T08:54:07Z

IIRC @Little-Wallace has also emphasized the risk of throttling based on LSM tree shape once, any comments?

fuyufjh · 2023-02-22T09:27:23Z

"currently ingest_batch will stall if shared buffer size threshold is exceeded" looks good to me. May I ask the reason

We have seen cases that compactor is abnormal, and tens of thousands SST files accumulated in level 0 of LSM tree when we notice it. By throttling earlier we can keep LSM tree in good shape.

Yeah, I agree with the motivation. But, instead of throttling from source, can we do it locally? For example, stall the ingest_batch itself, so that the it will backpressure gradually to slow the whole streaming job.

fuyufjh · 2023-02-22T09:34:50Z

The implementation in #8049 ensure pause/resume issued by barrier (scaling/recovery) has higher priority than throttler. i.e. throttler cannot resume a paused source that was paused by barrier.

Well, it's very likely to be correct now, but I'm afraid this may cause potential deep bugs in the future. Before this, the call hierarchy is clear: compute -> storage, but this PRs breaks that conversion by introducing a reversed control flow. I just feel terrible about it :(

soundOfDestiny · 2023-02-22T10:03:37Z

"currently ingest_batch will stall if shared buffer size threshold is exceeded" looks good to me. May I ask the reason

We have seen cases that compactor is abnormal, and tens of thousands SST files accumulated in level 0 of LSM tree when we notice it. By throttling earlier we can keep LSM tree in good shape.

Yeah, I agree with the motivation. But, instead of throttling from source, can we do it locally? For example, stall the ingest_batch itself, so that the it will backpressure gradually to slow the whole streaming job.

but @hzxa21 thinks that there are some disadvantages if we do locally

barrier cannot flow through the streaming graph if an actor is blocked

checkpoint of non-related MV/Tables will be blocked
recovery won't work if there are any other failures
...

hzxa21 · 2023-02-22T10:32:48Z

ice it. By throttling earlier we can keep LSM tree in good shape.

Yeah, I agree with the motivation. But, instead of throttling from source, can we do it locally? For example, stall the ingest_batch itself, so that the it will backpressure gradually to slow the whole streaming job.

Stalling ingest_batch will stall the actor task, which can affect barrier/checkpoint/recovery and potentially other stuff. I remembered this has caused some deadlock issues before. cc @wenym1

BugenZhao · 2023-02-24T16:56:32Z

👀 +1 for stall locally. If we make it special for Source, should we also apply it to DML and even Chain? It's not easy to get them synced if we stall it manually, which is instead natural for back-pressure of local stalling.

checkpoint of non-related MV/Tables will be blocked

How can we define "related"? Our system provides global snapshot consistency, so all tables that can be joined together should be considered related. Not caring about the freshness of some specific sources seems to make some queries less meaningful. Maybe we should implement per-database checkpointing and totally isolate them?

recovery won't work if there are any other failures

Do you mean that stalling ingest_batch will make us less responsible for recovery? In this case, I guess we can make the stalled ingest_batch fail immediately by sending some messages when the meta service decides to trigger the recovery progress.

which can affect barrier/checkpoint

IMO the latency of a streaming job is a physical property and will always be there, as long as we have some sort of back-pressure mechanism. We're overloading -> barrier is stalled -> barrier latency is increased -> more in-flight barrier -> less per-epoch data, this chain of affection should be natural and reasonable to the system. 🤔

hzxa21 · 2023-02-27T02:49:12Z

In this case, I guess we can make the stalled ingest_batch fail immediately by sending some messages when the meta service decides to trigger the recovery progress.

Good point. After a second thought, blocking ingest_batch is an issue only when the unblock condition relies on actor task to be polled. Memtable full and too many L0 files have nothing to do with polling the actor task so it is indeed more natural to leverage backpressure in this case. cc @zwang28

However, we still need to be careful about the condition of blocking ingest_batch to make sure that progress can be made and it can be unblocked (as mentioned here).

github-actions bot added this to the release-0.1.18 milestone Feb 17, 2023

zwang28 added the component/storage Storage label Feb 17, 2023

zwang28 changed the title ~~Discussion: support for stalling ingest_batch based on LSM Tree stats~~ Discussion: support for throttling write based on LSM Tree stats Feb 17, 2023

zwang28 self-assigned this Feb 20, 2023

zwang28 mentioned this issue Feb 20, 2023

feat(source): throttling source based on storage stats #8049

Closed

5 tasks

zwang28 mentioned this issue Mar 7, 2023

feat(storage): limit hummock write based on LSM tree stats #8383

Merged

6 tasks

zwang28 closed this as completed in #8383 Mar 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discussion: support for throttling write based on LSM Tree stats #7997

Discussion: support for throttling write based on LSM Tree stats #7997

zwang28 commented Feb 17, 2023 •

edited

Loading

hzxa21 commented Feb 17, 2023

soundOfDestiny commented Feb 17, 2023

zwang28 commented Feb 17, 2023 •

edited

Loading

fuyufjh commented Feb 22, 2023 •

edited

Loading

soundOfDestiny commented Feb 22, 2023

zwang28 commented Feb 22, 2023 •

edited

Loading

zwang28 commented Feb 22, 2023 •

edited

Loading

zwang28 commented Feb 22, 2023

fuyufjh commented Feb 22, 2023

fuyufjh commented Feb 22, 2023

soundOfDestiny commented Feb 22, 2023

hzxa21 commented Feb 22, 2023 •

edited

Loading

BugenZhao commented Feb 24, 2023 •

edited

Loading

hzxa21 commented Feb 27, 2023

Discussion: support for throttling write based on LSM Tree stats #7997

Discussion: support for throttling write based on LSM Tree stats #7997

Comments

zwang28 commented Feb 17, 2023 • edited Loading

hzxa21 commented Feb 17, 2023

soundOfDestiny commented Feb 17, 2023

zwang28 commented Feb 17, 2023 • edited Loading

fuyufjh commented Feb 22, 2023 • edited Loading

soundOfDestiny commented Feb 22, 2023

zwang28 commented Feb 22, 2023 • edited Loading

zwang28 commented Feb 22, 2023 • edited Loading

zwang28 commented Feb 22, 2023

fuyufjh commented Feb 22, 2023

fuyufjh commented Feb 22, 2023

soundOfDestiny commented Feb 22, 2023

hzxa21 commented Feb 22, 2023 • edited Loading

BugenZhao commented Feb 24, 2023 • edited Loading

hzxa21 commented Feb 27, 2023

zwang28 commented Feb 17, 2023 •

edited

Loading

zwang28 commented Feb 17, 2023 •

edited

Loading

fuyufjh commented Feb 22, 2023 •

edited

Loading

zwang28 commented Feb 22, 2023 •

edited

Loading

zwang28 commented Feb 22, 2023 •

edited

Loading

hzxa21 commented Feb 22, 2023 •

edited

Loading

BugenZhao commented Feb 24, 2023 •

edited

Loading