Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Reduce number of topo calls from vtgate healthcheck #14277

Closed
deepthi opened this issue Oct 13, 2023 · 2 comments · Fixed by #14412 or #14693
Closed

Feature Request: Reduce number of topo calls from vtgate healthcheck #14277

deepthi opened this issue Oct 13, 2023 · 2 comments · Fixed by #14412 or #14693
Labels
Type: Feature Type: RFC Request For Comment

Comments

@deepthi
Copy link
Member

deepthi commented Oct 13, 2023

Feature Description

In production environments, we observed that vtgates impose a significant amount of load on the topo server, because they read each tablet record every tablet_refresh_interval duration, which defaults to 1 minute. It is desirable to lower this to a smaller value (say 30 seconds), to discover replica tablet changes in a more timely fashion, but that increases load on topo even more.
There is one option that is currently available to tune this, and that is the tablet_refresh_known_tablets flag. However, this has a downside. It is only usable in static environments where a tablet's host never changes after first deployment. Clearly, this is not usable in a k8s environment where tablets might be rescheduled to a new node at any time for any reason.

There is a way to improve this behavior and reduce the load on the topo. And that is to fetch all the tablet records for a cell in one go, instead of one tablet at a time. This is what the GetTablets vtctld RPC already does.
However, there is a tradeoff here. Each supported topo server (etcd and zookeeper right now), has a limit on the request/response size. In fact zookeeper does not actually support the List function used by GetTablets and we fall back to the one tablet at a time method. If there are a sufficient number of tablets to hit this limit, the topo call will error out, and we would have wasted it. This tradeoff is probably worth it, because of the stark reduction in the number of network calls, and the amount of load on the topo.
In the best case, the number of topo calls becomes O(1) from O(n) where n is the total number of tablets being watched by vtgate. In the worst case, we add 1 additional topo call to what is already O(n).

Use Case(s)

Vitess clusters with a large number of tablets.

@deepthi deepthi added Type: Feature Needs Triage This issue needs to be correctly labelled and triaged Type: RFC Request For Comment and removed Needs Triage This issue needs to be correctly labelled and triaged labels Oct 13, 2023
@deepthi
Copy link
Member Author

deepthi commented Oct 13, 2023

Note: we can also implement a similar optimization for vtorc.

@arthurschreiber
Copy link
Contributor

I reopened this because I don't think this was actually fixed / implemented.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Feature Type: RFC Request For Comment
Projects
None yet
2 participants