Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion: support for throttling write based on LSM Tree stats #7997

Closed
zwang28 opened this issue Feb 17, 2023 · 14 comments · Fixed by #8383
Closed

Discussion: support for throttling write based on LSM Tree stats #7997

zwang28 opened this issue Feb 17, 2023 · 14 comments · Fixed by #8383
Assignees
Labels
Milestone

Comments

@zwang28
Copy link
Contributor

zwang28 commented Feb 17, 2023

Currently ingest_batch will stall if shared buffer size threshold is exceeded.
We can add another similar stall trigger that is calculated from the HummockVersion stats, e.g. L0 total file number/L0 total sub level number. This helps in cases when compaction is lagging behind, otherwise too many files may have been added to LSM tree when we eventually realize the issue.

Note that we may have multiple LSM trees. As we know table_id in StateTable, we can know the target LSM tree too.

@Little-Wallace @Li0k @soundOfDestiny @wenym1 @hzxa21

@github-actions github-actions bot added this to the release-0.1.18 milestone Feb 17, 2023
@zwang28 zwang28 added the component/storage Storage label Feb 17, 2023
@hzxa21
Copy link
Collaborator

hzxa21 commented Feb 17, 2023

In general it is doable but we need to be very careful to constrain the radius of the impact of stalling because blocking ingest_batch means blocking the actor task. Under the current architecture, our system may not function well if the actor task is blocked for a long time. For example:

  • barrier cannot flow through the streaming graph if an actor is blocked
  • checkpoint of non-related MV/Tables will be blocked
  • recovery won't work if there are any other failures
  • ...

Stalling on memtable (i.e. previously called shared buffer) reaching a threshold is okay since we know there can be some progress (e.g. flush/spill) made for sure and the stalling won't last for long. However, if we stall when compaction is lagging behind, I think it is harder to guarantee that there will be progress. Under this case, we should make sure the stalling only affect the problematic MV/Tables, not others. Another option is not to stall ingest_batch but pause/throttle source (without pausing barrier and the actor task).

@soundOfDestiny
Copy link
Contributor

In general it is doable but we need to be very careful to constrain the radius of the impact of stalling because blocking ingest_batch means blocking the actor task. Under the current architecture, our system may not function well if the actor task is blocked for a long time. For example:

  • barrier cannot flow through the streaming graph if an actor is blocked
  • checkpoint of non-related MV/Tables will be blocked
  • recovery won't work if there are any other failures
  • ...

Stalling on memtable (i.e. previously called shared buffer) reaching a threshold is okay since we know there can be some progress (e.g. flush/spill) made for sure and the stalling won't last for long. However, if we stall when compaction is lagging behind, I think it is harder to guarantee that there will be progress. Under this case, we should make sure the stalling only affect the problematic MV/Tables, not others. Another option is not to stall ingest_batch but pause/throttle source (without pausing barrier and the actor task).

I agree that pausing/throttling source is better than stalling ingest_batch

@zwang28 zwang28 changed the title Discussion: support for stalling ingest_batch based on LSM Tree stats Discussion: support for throttling write based on LSM Tree stats Feb 17, 2023
@zwang28
Copy link
Contributor Author

zwang28 commented Feb 17, 2023

if we stall when compaction is lagging behind, I think it is harder to guarantee that there will be progress

barrier cannot flow through the streaming graph if an actor is blocked

Yes..I missed these drawbacks (again.

Pausing/throttling source +1

@fuyufjh
Copy link
Member

fuyufjh commented Feb 22, 2023

Hmmm, even pausing source is dangerous. The streaming operators can trigger pausing due to scaling out or failure recovery. The 2 mechanisms together sound very error-prone.

I think probably the only correct way to do throttling is by backpressure and "currently ingest_batch will stall if shared buffer size threshold is exceeded" looks good to me. May I ask the reason of improving that? That's the most natural solution I can think of. Actually, there are very few things we can do when the system is overloaded.

@soundOfDestiny
Copy link
Contributor

Hmmm, even pausing source is dangerous. The streaming operators can trigger pausing due to scaling out or failure recovery.

I think probably the only correct way to do throttling is by backpressure and "currently ingest_batch will stall if shared buffer size threshold is exceeded" looks good to me. May I ask the reason of improving that? That's the most natural solution I can think of. Actually, there are very few things we can do when the system is overloaded.

Zheng Wang does not plan to fix "currently ingest_batch will stall if shared buffer size threshold is exceeded"
He means that he wants ingest_batch to stall when either share buffer size threshold or L0 SST size threshold is exceeded.

@zwang28
Copy link
Contributor Author

zwang28 commented Feb 22, 2023

The streaming operators can trigger pausing due to scaling out or failure recovery. The 2 mechanisms together sound very error-prone.

The implementation in #8049 ensure pause/resume issued by barrier (scaling/recovery) has higher priority than throttler. i.e. throttler cannot resume a paused source that was paused by barrier. (actually current barrier latency based throttler in main branch has this bug)

@zwang28
Copy link
Contributor Author

zwang28 commented Feb 22, 2023

"currently ingest_batch will stall if shared buffer size threshold is exceeded" looks good to me. May I ask the reason

We have seen cases that compactor is abnormal, and tens of thousands SST files accumulated in level 0 of LSM tree when we notice it. By throttling earlier we can keep LSM tree in good shape.

@zwang28
Copy link
Contributor Author

zwang28 commented Feb 22, 2023

IIRC @Little-Wallace has also emphasized the risk of throttling based on LSM tree shape once, any comments?

@fuyufjh
Copy link
Member

fuyufjh commented Feb 22, 2023

"currently ingest_batch will stall if shared buffer size threshold is exceeded" looks good to me. May I ask the reason

We have seen cases that compactor is abnormal, and tens of thousands SST files accumulated in level 0 of LSM tree when we notice it. By throttling earlier we can keep LSM tree in good shape.

Yeah, I agree with the motivation. But, instead of throttling from source, can we do it locally? For example, stall the ingest_batch itself, so that the it will backpressure gradually to slow the whole streaming job.

@fuyufjh
Copy link
Member

fuyufjh commented Feb 22, 2023

The implementation in #8049 ensure pause/resume issued by barrier (scaling/recovery) has higher priority than throttler. i.e. throttler cannot resume a paused source that was paused by barrier.

Well, it's very likely to be correct now, but I'm afraid this may cause potential deep bugs in the future. Before this, the call hierarchy is clear: compute -> storage, but this PRs breaks that conversion by introducing a reversed control flow. I just feel terrible about it :(

@soundOfDestiny
Copy link
Contributor

"currently ingest_batch will stall if shared buffer size threshold is exceeded" looks good to me. May I ask the reason

We have seen cases that compactor is abnormal, and tens of thousands SST files accumulated in level 0 of LSM tree when we notice it. By throttling earlier we can keep LSM tree in good shape.

Yeah, I agree with the motivation. But, instead of throttling from source, can we do it locally? For example, stall the ingest_batch itself, so that the it will backpressure gradually to slow the whole streaming job.

but @hzxa21 thinks that there are some disadvantages if we do locally

  • barrier cannot flow through the streaming graph if an actor is blocked
  • checkpoint of non-related MV/Tables will be blocked
  • recovery won't work if there are any other failures
  • ...

@hzxa21
Copy link
Collaborator

hzxa21 commented Feb 22, 2023

ice it. By throttling earlier we can keep LSM tree in good shape.

Yeah, I agree with the motivation. But, instead of throttling from source, can we do it locally? For example, stall the ingest_batch itself, so that the it will backpressure gradually to slow the whole streaming job.

Stalling ingest_batch will stall the actor task, which can affect barrier/checkpoint/recovery and potentially other stuff. I remembered this has caused some deadlock issues before. cc @wenym1

@BugenZhao
Copy link
Member

BugenZhao commented Feb 24, 2023

👀 +1 for stall locally. If we make it special for Source, should we also apply it to DML and even Chain? It's not easy to get them synced if we stall it manually, which is instead natural for back-pressure of local stalling.

checkpoint of non-related MV/Tables will be blocked

How can we define "related"? Our system provides global snapshot consistency, so all tables that can be joined together should be considered related. Not caring about the freshness of some specific sources seems to make some queries less meaningful. Maybe we should implement per-database checkpointing and totally isolate them?

recovery won't work if there are any other failures

Do you mean that stalling ingest_batch will make us less responsible for recovery? In this case, I guess we can make the stalled ingest_batch fail immediately by sending some messages when the meta service decides to trigger the recovery progress.

which can affect barrier/checkpoint

IMO the latency of a streaming job is a physical property and will always be there, as long as we have some sort of back-pressure mechanism. We're overloading -> barrier is stalled -> barrier latency is increased -> more in-flight barrier -> less per-epoch data, this chain of affection should be natural and reasonable to the system. 🤔

@hzxa21
Copy link
Collaborator

hzxa21 commented Feb 27, 2023

In this case, I guess we can make the stalled ingest_batch fail immediately by sending some messages when the meta service decides to trigger the recovery progress.

Good point. After a second thought, blocking ingest_batch is an issue only when the unblock condition relies on actor task to be polled. Memtable full and too many L0 files have nothing to do with polling the actor task so it is indeed more natural to leverage backpressure in this case. cc @zwang28

However, we still need to be careful about the condition of blocking ingest_batch to make sure that progress can be made and it can be unblocked (as mentioned here).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants