You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In production environments, we observed that vtgates impose a significant amount of load on the topo server, because they read each tablet record every tablet_refresh_interval duration, which defaults to 1 minute. It is desirable to lower this to a smaller value (say 30 seconds), to discover replica tablet changes in a more timely fashion, but that increases load on topo even more.
There is one option that is currently available to tune this, and that is the tablet_refresh_known_tablets flag. However, this has a downside. It is only usable in static environments where a tablet's host never changes after first deployment. Clearly, this is not usable in a k8s environment where tablets might be rescheduled to a new node at any time for any reason.
There is a way to improve this behavior and reduce the load on the topo. And that is to fetch all the tablet records for a cell in one go, instead of one tablet at a time. This is what the GetTablets vtctld RPC already does.
However, there is a tradeoff here. Each supported topo server (etcd and zookeeper right now), has a limit on the request/response size. In fact zookeeper does not actually support the List function used by GetTablets and we fall back to the one tablet at a time method. If there are a sufficient number of tablets to hit this limit, the topo call will error out, and we would have wasted it. This tradeoff is probably worth it, because of the stark reduction in the number of network calls, and the amount of load on the topo.
In the best case, the number of topo calls becomes O(1) from O(n) where n is the total number of tablets being watched by vtgate. In the worst case, we add 1 additional topo call to what is already O(n).
Use Case(s)
Vitess clusters with a large number of tablets.
The text was updated successfully, but these errors were encountered:
Feature Description
In production environments, we observed that vtgates impose a significant amount of load on the topo server, because they read each tablet record every
tablet_refresh_interval
duration, which defaults to 1 minute. It is desirable to lower this to a smaller value (say 30 seconds), to discover replica tablet changes in a more timely fashion, but that increases load on topo even more.There is one option that is currently available to tune this, and that is the
tablet_refresh_known_tablets
flag. However, this has a downside. It is only usable in static environments where a tablet's host never changes after first deployment. Clearly, this is not usable in a k8s environment where tablets might be rescheduled to a new node at any time for any reason.There is a way to improve this behavior and reduce the load on the topo. And that is to fetch all the tablet records for a cell in one go, instead of one tablet at a time. This is what the
GetTablets
vtctld RPC already does.However, there is a tradeoff here. Each supported topo server (etcd and zookeeper right now), has a limit on the request/response size. In fact zookeeper does not actually support the
List
function used byGetTablets
and we fall back to the one tablet at a time method. If there are a sufficient number of tablets to hit this limit, the topo call will error out, and we would have wasted it. This tradeoff is probably worth it, because of the stark reduction in the number of network calls, and the amount of load on the topo.In the best case, the number of topo calls becomes O(1) from O(n) where n is the total number of tablets being watched by vtgate. In the worst case, we add 1 additional topo call to what is already O(n).
Use Case(s)
Vitess clusters with a large number of tablets.
The text was updated successfully, but these errors were encountered: