Feature Request: Reduce number of topo calls from vtgate healthcheck #14277

deepthi · 2023-10-13T20:43:05Z

Feature Description

In production environments, we observed that vtgates impose a significant amount of load on the topo server, because they read each tablet record every tablet_refresh_interval duration, which defaults to 1 minute. It is desirable to lower this to a smaller value (say 30 seconds), to discover replica tablet changes in a more timely fashion, but that increases load on topo even more.
There is one option that is currently available to tune this, and that is the tablet_refresh_known_tablets flag. However, this has a downside. It is only usable in static environments where a tablet's host never changes after first deployment. Clearly, this is not usable in a k8s environment where tablets might be rescheduled to a new node at any time for any reason.

There is a way to improve this behavior and reduce the load on the topo. And that is to fetch all the tablet records for a cell in one go, instead of one tablet at a time. This is what the GetTablets vtctld RPC already does.
However, there is a tradeoff here. Each supported topo server (etcd and zookeeper right now), has a limit on the request/response size. In fact zookeeper does not actually support the List function used by GetTablets and we fall back to the one tablet at a time method. If there are a sufficient number of tablets to hit this limit, the topo call will error out, and we would have wasted it. This tradeoff is probably worth it, because of the stark reduction in the number of network calls, and the amount of load on the topo.
In the best case, the number of topo calls becomes O(1) from O(n) where n is the total number of tablets being watched by vtgate. In the worst case, we add 1 additional topo call to what is already O(n).

Use Case(s)

Vitess clusters with a large number of tablets.

The text was updated successfully, but these errors were encountered:

deepthi · 2023-10-13T20:47:01Z

Note: we can also implement a similar optimization for vtorc.

arthurschreiber · 2023-11-15T11:03:13Z

I reopened this because I don't think this was actually fixed / implemented.

deepthi added Type: Feature Needs Triage This issue needs to be correctly labelled and triaged Type: RFC Request For Comment and removed Needs Triage This issue needs to be correctly labelled and triaged labels Oct 13, 2023

deepthi mentioned this issue Oct 31, 2023

tx throttler: remove unused topology watchers #14412

Merged

4 tasks

GuptaManan100 closed this as completed in #14412 Nov 1, 2023

arthurschreiber reopened this Nov 15, 2023

deepthi mentioned this issue Dec 6, 2023

Use GetTabletsByCell in healthcheck #14693

Merged

4 tasks

deepthi closed this as completed in #14693 Dec 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: Reduce number of topo calls from vtgate healthcheck #14277

Feature Request: Reduce number of topo calls from vtgate healthcheck #14277

deepthi commented Oct 13, 2023 •

edited

Loading

deepthi commented Oct 13, 2023

arthurschreiber commented Nov 15, 2023

Feature Request: Reduce number of topo calls from vtgate healthcheck #14277

Feature Request: Reduce number of topo calls from vtgate healthcheck #14277

Comments

deepthi commented Oct 13, 2023 • edited Loading

Feature Description

Use Case(s)

deepthi commented Oct 13, 2023

arthurschreiber commented Nov 15, 2023

deepthi commented Oct 13, 2023 •

edited

Loading