Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backport 14693 - Use GetTabletsByCell in healthcheck #514

Merged
merged 6 commits into from
Oct 4, 2024

Conversation

ejortegau
Copy link
Collaborator

This backports upstream PR vitessio#14693, with a few minor changes to make it work with the Go
version we are using and a small change to topology_watcher.go so that test cases reflect
and test for the same behavior as the upstream code. The description of the original PR
follows:

VTGate's healthcheck module currently calls GetTablet for each tablet alias that it discovers
in a cell. Instead we can use GetTabletsForCell to fetch all tablets for a cell at once.

This PR does a few more things:

  • GetTabletsForCell now handles the case where the response size violates gRPC limits by
    falling back to one tablet at a time in case of error.
  • Previously, the one tablet at a time method had unlimited concurrency. In this PR we
    introduce a configuration option for concurrency.
  • We pass topoReadConcurrency from healthcheck into GetTabletsForCell.
  • The behavior of --refresh_known_tablets flag is different now. Previously we would not
    read those tablets at all, now we do read them, but ignore any changes if they are
    already known.

The basic fix has already been tried in production and shown to reduce the number of Get
calls from vtgate -> topo from O(n) to O(1).

We can consider deprecating and deleting --refresh_known_tablets in a future release.
The concerns that originally motivated adding that flag in vitessio#3965 are alleviated by fetching
all tablets in one call to the topo.

This backports upstram PR vitessio#14693, with a few minor changes to make it work with the Go
version we are using and a small change to topology_watcher.go so that test cases reflect
and test for the same behavior as the upstream code. The description of the original PR
follows:

VTGate's healthcheck module currently calls GetTablet for each tablet alias that it discovers
in a cell. Instead we can use GetTabletsForCell to fetch all tablets for a cell at once.

This PR does a few more things:

* GetTabletsForCell now handles the case where the response size violates gRPC limits by
  falling back to one tablet at a time in case of error.
* Previously, the one tablet at a time method had unlimited concurrency. In this PR we
  introduce a configuration option for concurrency.
* We pass topoReadConcurrency from healthcheck into GetTabletsForCell.
* The behavior of --refresh_known_tablets flag is different now. Previously we would not
  read those tablets at all, now we do read them, but ignore any changes if they are
  already known.

The basic fix has already been tried in production and shown to reduce the number of Get
calls from vtgate -> topo from O(n) to O(1).

We can consider deprecating and deleting --refresh_known_tablets in a future release.
The concerns that originally motivated adding that flag in vitessio#3965 are alleviated by fetching
all tablets in one call to the topo.
@github-actions github-actions bot added this to the v15.0.5 milestone Sep 16, 2024
@timvaillancourt timvaillancourt self-requested a review October 2, 2024 09:18
@ejortegau ejortegau merged commit f552c6f into slack-15.0 Oct 4, 2024
194 checks passed
@ejortegau ejortegau deleted the txthrottler_backport_14693 branch October 4, 2024 12:32
ejortegau added a commit that referenced this pull request Oct 10, 2024
vitessio#16871 added filtering in the health check,
but later #514 undid that. So aiming at
fixing it here.

With this change, essentially, the filters passed in NewHealthCheck() get actually
applied to the topology watchers, instead of it using a newly instantiated, empty
filter as it's currently happening.

Signed-off-by: Eduardo J. Ortega U <5791035+ejortegau@users.noreply.github.com>
timvaillancourt added a commit that referenced this pull request Oct 10, 2024
* Fix healthcheck filters

vitessio#16871 added filtering in the health check,
but later #514 undid that. So aiming at
fixing it here.

With this change, essentially, the filters passed in NewHealthCheck() get actually
applied to the topology watchers, instead of it using a newly instantiated, empty
filter as it's currently happening.

Signed-off-by: Eduardo J. Ortega U <5791035+ejortegau@users.noreply.github.com>

* Improve tests by getting them from vitessio#16871

Signed-off-by: Eduardo J. Ortega U <5791035+ejortegau@users.noreply.github.com>

---------

Signed-off-by: Eduardo J. Ortega U <5791035+ejortegau@users.noreply.github.com>
Co-authored-by: Tim Vaillancourt <tim@timvaillancourt.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants