Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backport 14693 - Use GetTabletsByCell in healthcheck #514

Open
wants to merge 5 commits into
base: slack-15.0
Choose a base branch
from

Conversation

ejortegau
Copy link
Collaborator

This backports upstream PR vitessio#14693, with a few minor changes to make it work with the Go
version we are using and a small change to topology_watcher.go so that test cases reflect
and test for the same behavior as the upstream code. The description of the original PR
follows:

VTGate's healthcheck module currently calls GetTablet for each tablet alias that it discovers
in a cell. Instead we can use GetTabletsForCell to fetch all tablets for a cell at once.

This PR does a few more things:

  • GetTabletsForCell now handles the case where the response size violates gRPC limits by
    falling back to one tablet at a time in case of error.
  • Previously, the one tablet at a time method had unlimited concurrency. In this PR we
    introduce a configuration option for concurrency.
  • We pass topoReadConcurrency from healthcheck into GetTabletsForCell.
  • The behavior of --refresh_known_tablets flag is different now. Previously we would not
    read those tablets at all, now we do read them, but ignore any changes if they are
    already known.

The basic fix has already been tried in production and shown to reduce the number of Get
calls from vtgate -> topo from O(n) to O(1).

We can consider deprecating and deleting --refresh_known_tablets in a future release.
The concerns that originally motivated adding that flag in vitessio#3965 are alleviated by fetching
all tablets in one call to the topo.

This backports upstram PR vitessio#14693, with a few minor changes to make it work with the Go
version we are using and a small change to topology_watcher.go so that test cases reflect
and test for the same behavior as the upstream code. The description of the original PR
follows:

VTGate's healthcheck module currently calls GetTablet for each tablet alias that it discovers
in a cell. Instead we can use GetTabletsForCell to fetch all tablets for a cell at once.

This PR does a few more things:

* GetTabletsForCell now handles the case where the response size violates gRPC limits by
  falling back to one tablet at a time in case of error.
* Previously, the one tablet at a time method had unlimited concurrency. In this PR we
  introduce a configuration option for concurrency.
* We pass topoReadConcurrency from healthcheck into GetTabletsForCell.
* The behavior of --refresh_known_tablets flag is different now. Previously we would not
  read those tablets at all, now we do read them, but ignore any changes if they are
  already known.

The basic fix has already been tried in production and shown to reduce the number of Get
calls from vtgate -> topo from O(n) to O(1).

We can consider deprecating and deleting --refresh_known_tablets in a future release.
The concerns that originally motivated adding that flag in vitessio#3965 are alleviated by fetching
all tablets in one call to the topo.
@github-actions github-actions bot added this to the v15.0.5 milestone Sep 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants