Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tablet throttler: non-PRIMARY tablets report back to PRIMARY throttler when they've been 'check'ed #13177

Conversation

shlomi-noach
Copy link
Contributor

Description

#13175 describes a starvation scenario, where a streamer happens to read from a REPLICA and nothing on the PRIMARY checks the throttler. This may happen when:

  • using --heartbeat_on_demand_duration
  • as mentioned, streamer reads from non-PRIMARY

and when:

  • The writes happen on another cluster (e.g. MoveTables, resharding)
  • or, the writes happen on the same cluster, but the writer in charge does not consult the throttler.

Up till now, throttler checks on the replica would not trigger on-demand heartbeats on the source cluster. This can lead to a "starvation" scenario, where the app is continuously throttled: nothing writes heartbeats on the source cluster. the checks on the replica do not trigger heartbeats, the throttler thinks there's lag, stalls reads, no writes taking place, the entire flow is at standstill.

As of this PR, a replica's throttler service takes note when someone (other than the throttler mechanism itself) performs a check. It notes down "I have been recently checked". This note expires between 1sec and 2sec.

As the PRIMARY throttler probes the replicas, it also collects this new information. If it sees a replica which has "just been recently throttled", then it asks for a heartbeat lease renewal.

With this PR, any check on the throttler of any tablet in the cluster, will trigger a heartbeat lease renewal.

Related Issue(s)

Checklist

  • "Backport to:" labels have been added if this change should be back-ported
  • Tests were added or are not required
  • Did the new or modified tests pass consistently locally and on the CI
  • Documentation was added or is not required

Deployment Notes

…r when they've been 'check'ed

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
@vitess-bot
Copy link
Contributor

vitess-bot bot commented May 28, 2023

Review Checklist

Hello reviewers! 👋 Please follow this checklist when reviewing this Pull Request.

General

  • Ensure that the Pull Request has a descriptive title.
  • If this is a change that users need to know about, please apply the release notes (needs details) label so that merging is blocked unless the summary release notes document is included.
  • If a test is added or modified, there should be a documentation on top of the test to explain what the expected behavior is what the test does.

If a new flag is being introduced:

  • Is it really necessary to add this flag?
  • Flag names should be clear and intuitive (as far as possible)
  • Help text should be descriptive.
  • Flag names should use dashes (-) as word separators rather than underscores (_).

If a workflow is added or modified:

  • Each item in Jobs should be named in order to mark it as required.
  • If the workflow should be required, the maintainer team should be notified.

Bug fixes

  • There should be at least one unit or end-to-end test.
  • The Pull Request description should include a link to an issue that describes the bug.

Non-trivial changes

  • There should be some code comments as to why things are implemented the way they are.

New/Existing features

  • Should be documented, either by modifying the existing documentation or creating new documentation.
  • New features should have a link to a feature request issue or an RFC that documents the use cases, corner cases and test cases.

Backward compatibility

  • Protobuf changes should be wire-compatible.
  • Changes to _vt tables and RPCs need to be backward compatible.
  • vtctl command output order should be stable and awk-able.
  • RPC changes should be compatible with vitess-operator
  • If a flag is removed, then it should also be removed from VTop, if used there.

@vitess-bot vitess-bot bot added NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsIssue A linked issue is missing for this Pull Request NeedsWebsiteDocsUpdate What it says labels May 28, 2023
@github-actions github-actions bot added this to the v17.0.0 milestone May 28, 2023
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
@shlomi-noach
Copy link
Contributor Author

Implicitly merged via #13195

@shlomi-noach shlomi-noach deleted the throttler-replica-propagate-check branch June 1, 2023 04:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: TabletManager Component: VReplication NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsIssue A linked issue is missing for this Pull Request NeedsWebsiteDocsUpdate What it says Type: Bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

VReplication Workflows and VStream API gets stuck in the copy phase if tablet type is set only as a replica
1 participant