Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timeout Fixes and VTOrc Improvement #11881

Merged
merged 24 commits into from
Dec 16, 2022

Conversation

GuptaManan100
Copy link
Member

@GuptaManan100 GuptaManan100 commented Dec 5, 2022

Description

This PR implements the ideas discussed in #11662 (comment) and #11662 (comment).

As discussed, this PR makes the following changes -

  1. Remove the LockShardTimeout from VTOrc as it doesn't provide any value.
  2. Add parallelism to VTOrc refresh phase so that we wait at most RemoteOperationTimeout seconds before proceeding.
  3. Introduce a new flag lock-timeout which defaults to 45 seconds instead of using RemoteOperationTimeout for the shard lock duration. The default for RemoteOperationTimeout has also been changed to 15 seconds.
  4. Fix the usage of WaitReplicasTimeout in ERS code-path. This makes it so that we are able to handle multiple failures with the set default values.

Related Issue(s)

Checklist

  • "Backport to:" labels have been added if this change should be back-ported
  • Tests were added or are not required
  • Documentation was added or is not required

Deployment Notes

…de-downgrade tests too

Signed-off-by: Manan Gupta <manan@planetscale.com>
…th default values of flags

Signed-off-by: Manan Gupta <manan@planetscale.com>
…ation-timeout

Signed-off-by: Manan Gupta <manan@planetscale.com>
…ndling multiple failures

Signed-off-by: Manan Gupta <manan@planetscale.com>
…elism to refresh of tablets

Signed-off-by: Manan Gupta <manan@planetscale.com>
Signed-off-by: Manan Gupta <manan@planetscale.com>
@vitess-bot
Copy link
Contributor

vitess-bot bot commented Dec 5, 2022

Review Checklist

Hello reviewers! 👋 Please follow this checklist when reviewing this Pull Request.

General

  • Ensure that the Pull Request has a descriptive title.
  • If this is a change that users need to know about, please apply the release notes (needs details) label so that merging is blocked unless the summary release notes document is included.

If a new flag is being introduced:

  • Is it really necessary to add this flag?
  • Flag names should be clear and intuitive (as far as possible)
  • Help text should be descriptive.
  • Flag names should use dashes (-) as word separators rather than underscores (_).

If a workflow is added or modified:

  • Each item in Jobs should be named in order to mark it as required.
  • If the workflow should be required, the maintainer team should be notified.

Bug fixes

  • There should be at least one unit or end-to-end test.
  • The Pull Request description should include a link to an issue that describes the bug.

Non-trivial changes

  • There should be some code comments as to why things are implemented the way they are.

New/Existing features

  • Should be documented, either by modifying the existing documentation or creating new documentation.
  • New features should have a link to a feature request issue or an RFC that documents the use cases, corner cases and test cases.

Backward compatibility

  • Protobuf changes should be wire-compatible.
  • Changes to _vt tables and RPCs need to be backward compatible.
  • vtctl command output order should be stable and awk-able.
  • RPC changes should be compatible with vitess-operator
  • If a flag is removed, then it should also be removed from VTop, if used there.

Signed-off-by: Manan Gupta <manan@planetscale.com>
Signed-off-by: Manan Gupta <manan@planetscale.com>
Signed-off-by: Manan Gupta <manan@planetscale.com>
@GuptaManan100 GuptaManan100 added Type: Bug Type: Enhancement Logical improvement (somewhere between a bug and feature) Component: Cluster management Component: VTorc Vitess Orchestrator integration release notes (needs details) This PR needs to be listed in the release notes in a dedicated section (deprecation notice, etc...) NeedsWebsiteDocsUpdate What it says labels Dec 5, 2022
@GuptaManan100
Copy link
Member Author

GuptaManan100 commented Dec 5, 2022

Even with these changes, VTOrc is unable to fix the failure of a single vttablet. i.e. if the primary's vttablet crashes, VTOrc doesn't run an ERS.
The issues still pending are -

  1. vttablets in their tm.Close() function, delete the MySQL hostname and some more information from the vttablet record. This causes VTOrc to think that the vttablet isn't part of the topology anymore. This assumption is however incorrect. Until the tablet's record is removed from the topo-server, it should be considered part of the topology.
  2. The above only happens sometimes on the vttablets. The reason is that the onclose_timeout defaults to 1ns, so the tm.Close() function runs successfully only sometimes. This makes the VTOrc behavior unpredictable.
  3. When a vttablet fails, VTOrc thinks it is unreachable, but doesn't think it's dead. The reason is that for MySQL failures, VTOrc tries to take in information from the replicas as well to see if they are able to reach the primary MySQL server. If the replication is working fine for the replicas, then VTOrc assumes that it is network partitioned from the MySQL server and doesn't run an ERS.
    For vttablet failures, this however becomes a problem! Since vttablets don't contact each other, if the MySQL servers are working fine, VTOrc doesn't run an ERS. We would need some way to know if the vttablet is actually down, or just network partitioned from VTOrc. We could potentially use the vtgate for it 🤷‍♂️, but this is just an idea. It needs more thought on how we want to handle this.

Irrespective of the problems listed above, this PR is still an improvement for both VTOrc and ERS, since it does allow ERS to handle multiple failures with the default values and it allows VTOrc to handle the situations where both vttablet and MySQL server is down. A test for this situation has been added to verify that this indeed does work after the changes.

…ltiple times

Signed-off-by: Manan Gupta <manan@planetscale.com>
Signed-off-by: Manan Gupta <manan@planetscale.com>
Signed-off-by: Manan Gupta <manan@planetscale.com>
Signed-off-by: Manan Gupta <manan@planetscale.com>
Signed-off-by: Manan Gupta <manan@planetscale.com>
Signed-off-by: Manan Gupta <manan@planetscale.com>
@GuptaManan100 GuptaManan100 removed the release notes (needs details) This PR needs to be listed in the release notes in a dedicated section (deprecation notice, etc...) label Dec 13, 2022
@GuptaManan100
Copy link
Member Author

I have now fixed the default timeout of onclose_timeout flag. Do we need to include this in the release notes summary too?
I have also patched VTOrc not to forget vttablet instances with an empty hostname as long as they are present in the topo server. A test has been added for the same.

@GuptaManan100 GuptaManan100 marked this pull request as ready for review December 13, 2022 11:03
…ents to allow vttablets to setup replication when restarted

Signed-off-by: Manan Gupta <manan@planetscale.com>
Signed-off-by: Manan Gupta <manan@planetscale.com>
Signed-off-by: Manan Gupta <manan@planetscale.com>
Signed-off-by: Manan Gupta <manan@planetscale.com>
Signed-off-by: Manan Gupta <manan@planetscale.com>
@deepthi
Copy link
Member

deepthi commented Dec 14, 2022

I have now fixed the default timeout of onclose_timeout flag. Do we need to include this in the release notes summary too?

I doubt anyone is actually using it, so this should be fine without a release note.

Copy link
Member

@deepthi deepthi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work cleaning up a whole bunch of incorrect timeout handling in this code.
Please address all feedback before merging.

Comment on lines +48 to +49
During upgrades, if the users want to preserve the same behaviour as previous releases, then they should provide the `remote_operation_timeout` flag explicitly before upgrading.
After the upgrade, they should then alter their configuration to also specify `lock-timeout` explicitly.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specifying both remote_operation_timeout and lock-timeout to 30s will expose users to the bug that is being fixed in this PR.
Instead we should recommend that if (and only if) they get timeouts during PRS or ERS, they may need to increase remote_operation_timeout and lock-timeout at the same time while making sure that lock-timeout is at least 15 seconds more than remote_operation_timeout.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, the defaults are 15 and 45 seconds, and we don't want lock-timeout to be at least 15 seconds more than remote_operation_timeout, we want it to be atleaset 2 times it and a littlle more.

Well in the previous release this bug exists, and only after upgrading and specifying both the flags (or specifying neither) can they really get rid of the bug. So the recommended process is to specify the remote-operation_timeout first so that while upgrading their behaviour remains the same. Once ugpraded they can choose whatever values they want.

@@ -1,5 +1,5 @@
Usage of vtctld:
--action_timeout duration time to wait for an action before resorting to force (default 2m0s)
--action_timeout duration time to wait for an action before resorting to force (default 1m0s)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this changing? Can you also check the blame so that we can see whether it changed recently?
There's an action_timeout that defaults to 1h on the vtctlclient and vtctldclient binaries. How does this relate to those?

Copy link
Member Author

@GuptaManan100 GuptaManan100 Dec 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a effect of

DefaultActionTimeout = topo.RemoteOperationTimeout * 4
and
actionTimeout = wrangler.DefaultActionTimeout

Apparently there are two action_timeouts 🤣. That action_timeout is a parameter for vtctldclient and vtctlclient. This one is taken by vtctld and applies only to ApplyKeyspaceAction, ApplyShardAction and ApplyTabletAction. This is the commit that introduced it - facdb18

@@ -23,14 +23,14 @@ Usage of vtorc:
--keep_logs duration keep logs for this long (using ctime) (zero to keep forever)
--keep_logs_by_mtime duration keep logs for this long (using mtime) (zero to keep forever)
--lameduck-period duration keep running at least this long after SIGTERM before stopping (default 50ms)
--lock-shard-timeout duration Duration for which a shard lock is held when running a recovery (default 30s)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to deprecate this flag?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have deprecated it -

fs.Duration("lock-shard-timeout", 30*time.Second, "Duration for which a shard lock is held when running a recovery")
_ = fs.MarkDeprecated("lock-shard-timeout", "Please use lock-timeout instead.")

Deprecated flags don't show up in the flags output either.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this to the summary too.

@@ -70,6 +70,17 @@ func Parse(fs *flag.FlagSet) {
flag.Parse()
}

// IsFlagPassed returns if the given flag has been provided by the user explicitly or not
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: IsFlagProvided
WDYT? also @ajm188

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have strong opinions. Whatever we feel is best. I'll change it to IsFlagProvided for now.

go/test/endtoend/reparent/plannedreparent/reparent_test.go Outdated Show resolved Hide resolved
ctx = trace.CopySpan(context.TODO(), ctx)
ctx, cancel := context.WithTimeout(ctx, defaultLockTimeout)
ctx, cancel := context.WithTimeout(ctx, RemoteOperationTimeout)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe I have missed something why we are not using getLockedTimeout during unlock?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unlock is just a single remote operation. It doesn't need more time than a RemoteOperationTimeout. getLockTimeout is the time for which we want to lock the shard for.

ctx = trace.CopySpan(context.TODO(), ctx)
ctx, cancel := context.WithTimeout(ctx, defaultLockTimeout)
ctx, cancel := context.WithTimeout(ctx, RemoteOperationTimeout)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

go/vt/topo/locks_test.go Outdated Show resolved Hide resolved
go/vt/vttablet/tabletmanager/tm_init.go Outdated Show resolved Hide resolved
Signed-off-by: Manan Gupta <manan@planetscale.com>
Copy link
Contributor

@rsajwani rsajwani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall LGTM

@@ -129,7 +129,7 @@ func setupCluster(ctx context.Context, t *testing.T, shardName string, cells []s
// the replication manager to silently fix the replication in case ERS or PRS mess up. All the
// tests in this test suite should work irrespective of this flag. Each run of ERS, PRS should be
// setting up the replication correctly.
"--disable_active_reparents")
"--disable-replication-manager")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the context of this change?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is required because in the tests we test that a tablet that is restarted starts replicating from the primary again. So we want the replication initialization code to run. disable_active_reparents blocks both the replication manager and the replication initialization, but disable-replication-manager only blocks the first.

We required this change after increasing the onclose timeout. Earlier, because onclose timeout was 1ns, the vttablets when shutting down wouldn't have the time to remove the hostname, port from the vttablet records. This caused a call to ERS to have a hanging SetReplicationSource. This call would succeed when the tablet was up again. But this is incorrect and not want the test really wanted to test.

After increasing the onclose timeout, the vttablets now remove their hostname, port information from the tablet records. This causes the SetReplicationSource call to be directed at a tablet with address :0. So that call doesn't setup the replication. Instead, we require the replication initialization to do it properly. This is the reason for this change.

These changes resolve flakiness in the ERS tests which showed itself as vttablet timed out after 10s failing to reach the Serving state, in the few cases that onclose function did complete before timing out.

// we deliberatly test the flag to make sure it's not accidently set to a
// high value.
if finished, want := fireOnCloseHooks(onCloseTimeout), false; finished != want {
if finished, want := fireOnCloseHooks(1*time.Nanosecond), false; finished != want {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1ns is here to check backward compact. given the default value is change now ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made this change, because otherwise the test would need to run for more than 10 seconds. The intent of the test is to verify that fireOnCloseHooks respects the timeout provided to it and that it returns after that much time. If we give the default here, we would need a sleep for like 11 seconds or something in the Onclose function and that would just slow down the test.

ctx = trace.CopySpan(context.TODO(), ctx)
ctx, cancel := context.WithTimeout(ctx, defaultLockTimeout)
ctx, cancel := context.WithTimeout(ctx, RemoteOperationTimeout)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe I have missed something why we are not using getLockedTimeout during unlock?

go/vt/topo/locks_test.go Show resolved Hide resolved
Signed-off-by: Manan Gupta <manan@planetscale.com>
Signed-off-by: Manan Gupta <manan@planetscale.com>
Signed-off-by: Manan Gupta <manan@planetscale.com>
@GuptaManan100 GuptaManan100 merged commit 46d63e9 into vitessio:main Dec 16, 2022
@GuptaManan100 GuptaManan100 deleted the timeout-fixes branch December 16, 2022 09:02
@deepthi deepthi restored the timeout-fixes branch January 11, 2023 23:15
@deepthi deepthi mentioned this pull request Jan 11, 2023
3 tasks
@deepthi deepthi deleted the timeout-fixes branch January 11, 2023 23:34
@deepthi deepthi restored the timeout-fixes branch January 12, 2023 00:46
maxenglander pushed a commit to planetscale/vitess that referenced this pull request Jun 26, 2023
* refactor: move tests out of newfeaturestest so that they run on upgrade-downgrade tests too

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: add failing ers test for handling multiple vttablet failures with default values of flags

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: add a new lock-timeout flag and use that instead of remote-operation-timeout

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: augment DownPrimary test to reproduce the issue of VTOrc not handling multiple failures

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: remove LockShardTimeout configuration from VTOrc and add parallelism to refresh of tablets

Signed-off-by: Manan Gupta <manan@planetscale.com>

* log: add more logging lines around ers in vtorc

Signed-off-by: Manan Gupta <manan@planetscale.com>

* test: get the test to work

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: fix usage of wait for replicas timeout

Signed-off-by: Manan Gupta <manan@planetscale.com>

* test: fix flags expected output

Signed-off-by: Manan Gupta <manan@planetscale.com>

* test: fix race in test now that the function is called in parallel multiple times

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: fix default of onCloseTimeout to 1 second

Signed-off-by: Manan Gupta <manan@planetscale.com>

* test: add failing unit test to refreshTabletsInKeyspaceShard

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: fix vtorc to not forget a tablet which has been deleted

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: fix backward compatibility, add tests and release notes docs

Signed-off-by: Manan Gupta <manan@planetscale.com>

* test: fix flags output

Signed-off-by: Manan Gupta <manan@planetscale.com>

* test: use disable-replication-manager instead of disable-active-reparents to allow vttablets to setup replication when restarted

Signed-off-by: Manan Gupta <manan@planetscale.com>

* test: fix flaky test by not checking for an error

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: handle the case of empty hostname in tablet initialization

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: update onclose timeout to 10 seconds

Signed-off-by: Manan Gupta <manan@planetscale.com>

* test: fix unit test

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: address review comments

Signed-off-by: Manan Gupta <manan@planetscale.com>

* docs: add comments explaining the test functions

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: add summary docs for 'lock-shard-timeout' deprecation

Signed-off-by: Manan Gupta <manan@planetscale.com>

Signed-off-by: Manan Gupta <manan@planetscale.com>
Signed-off-by: Max Englander <max@planetscale.com>
timvaillancourt pushed a commit to slackhq/vitess that referenced this pull request Sep 11, 2023
* refactor: move tests out of newfeaturestest so that they run on upgrade-downgrade tests too

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: add failing ers test for handling multiple vttablet failures with default values of flags

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: add a new lock-timeout flag and use that instead of remote-operation-timeout

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: augment DownPrimary test to reproduce the issue of VTOrc not handling multiple failures

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: remove LockShardTimeout configuration from VTOrc and add parallelism to refresh of tablets

Signed-off-by: Manan Gupta <manan@planetscale.com>

* log: add more logging lines around ers in vtorc

Signed-off-by: Manan Gupta <manan@planetscale.com>

* test: get the test to work

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: fix usage of wait for replicas timeout

Signed-off-by: Manan Gupta <manan@planetscale.com>

* test: fix flags expected output

Signed-off-by: Manan Gupta <manan@planetscale.com>

* test: fix race in test now that the function is called in parallel multiple times

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: fix default of onCloseTimeout to 1 second

Signed-off-by: Manan Gupta <manan@planetscale.com>

* test: add failing unit test to refreshTabletsInKeyspaceShard

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: fix vtorc to not forget a tablet which has been deleted

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: fix backward compatibility, add tests and release notes docs

Signed-off-by: Manan Gupta <manan@planetscale.com>

* test: fix flags output

Signed-off-by: Manan Gupta <manan@planetscale.com>

* test: use disable-replication-manager instead of disable-active-reparents to allow vttablets to setup replication when restarted

Signed-off-by: Manan Gupta <manan@planetscale.com>

* test: fix flaky test by not checking for an error

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: handle the case of empty hostname in tablet initialization

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: update onclose timeout to 10 seconds

Signed-off-by: Manan Gupta <manan@planetscale.com>

* test: fix unit test

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: address review comments

Signed-off-by: Manan Gupta <manan@planetscale.com>

* docs: add comments explaining the test functions

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: add summary docs for 'lock-shard-timeout' deprecation

Signed-off-by: Manan Gupta <manan@planetscale.com>

Signed-off-by: Manan Gupta <manan@planetscale.com>
timvaillancourt pushed a commit to slackhq/vitess that referenced this pull request Mar 15, 2024
* refactor: move tests out of newfeaturestest so that they run on upgrade-downgrade tests too

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: add failing ers test for handling multiple vttablet failures with default values of flags

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: add a new lock-timeout flag and use that instead of remote-operation-timeout

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: augment DownPrimary test to reproduce the issue of VTOrc not handling multiple failures

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: remove LockShardTimeout configuration from VTOrc and add parallelism to refresh of tablets

Signed-off-by: Manan Gupta <manan@planetscale.com>

* log: add more logging lines around ers in vtorc

Signed-off-by: Manan Gupta <manan@planetscale.com>

* test: get the test to work

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: fix usage of wait for replicas timeout

Signed-off-by: Manan Gupta <manan@planetscale.com>

* test: fix flags expected output

Signed-off-by: Manan Gupta <manan@planetscale.com>

* test: fix race in test now that the function is called in parallel multiple times

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: fix default of onCloseTimeout to 1 second

Signed-off-by: Manan Gupta <manan@planetscale.com>

* test: add failing unit test to refreshTabletsInKeyspaceShard

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: fix vtorc to not forget a tablet which has been deleted

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: fix backward compatibility, add tests and release notes docs

Signed-off-by: Manan Gupta <manan@planetscale.com>

* test: fix flags output

Signed-off-by: Manan Gupta <manan@planetscale.com>

* test: use disable-replication-manager instead of disable-active-reparents to allow vttablets to setup replication when restarted

Signed-off-by: Manan Gupta <manan@planetscale.com>

* test: fix flaky test by not checking for an error

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: handle the case of empty hostname in tablet initialization

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: update onclose timeout to 10 seconds

Signed-off-by: Manan Gupta <manan@planetscale.com>

* test: fix unit test

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: address review comments

Signed-off-by: Manan Gupta <manan@planetscale.com>

* docs: add comments explaining the test functions

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: add summary docs for 'lock-shard-timeout' deprecation

Signed-off-by: Manan Gupta <manan@planetscale.com>

Signed-off-by: Manan Gupta <manan@planetscale.com>
timvaillancourt added a commit to slackhq/vitess that referenced this pull request May 16, 2024
* VTOrc running PRS when database_instance empty bug fix. (vitessio#12019)

* feat: convert join with database_instance to a left join and prevent fixes from running if the information from database_instance is unavailable

Signed-off-by: Manan Gupta <manan@planetscale.com>

* test: add tests to verify the fix works

Signed-off-by: Manan Gupta <manan@planetscale.com>

Signed-off-by: Manan Gupta <manan@planetscale.com>

* Timeout Fixes and VTOrc Improvement (vitessio#11881)

* refactor: move tests out of newfeaturestest so that they run on upgrade-downgrade tests too

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: add failing ers test for handling multiple vttablet failures with default values of flags

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: add a new lock-timeout flag and use that instead of remote-operation-timeout

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: augment DownPrimary test to reproduce the issue of VTOrc not handling multiple failures

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: remove LockShardTimeout configuration from VTOrc and add parallelism to refresh of tablets

Signed-off-by: Manan Gupta <manan@planetscale.com>

* log: add more logging lines around ers in vtorc

Signed-off-by: Manan Gupta <manan@planetscale.com>

* test: get the test to work

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: fix usage of wait for replicas timeout

Signed-off-by: Manan Gupta <manan@planetscale.com>

* test: fix flags expected output

Signed-off-by: Manan Gupta <manan@planetscale.com>

* test: fix race in test now that the function is called in parallel multiple times

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: fix default of onCloseTimeout to 1 second

Signed-off-by: Manan Gupta <manan@planetscale.com>

* test: add failing unit test to refreshTabletsInKeyspaceShard

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: fix vtorc to not forget a tablet which has been deleted

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: fix backward compatibility, add tests and release notes docs

Signed-off-by: Manan Gupta <manan@planetscale.com>

* test: fix flags output

Signed-off-by: Manan Gupta <manan@planetscale.com>

* test: use disable-replication-manager instead of disable-active-reparents to allow vttablets to setup replication when restarted

Signed-off-by: Manan Gupta <manan@planetscale.com>

* test: fix flaky test by not checking for an error

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: handle the case of empty hostname in tablet initialization

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: update onclose timeout to 10 seconds

Signed-off-by: Manan Gupta <manan@planetscale.com>

* test: fix unit test

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: address review comments

Signed-off-by: Manan Gupta <manan@planetscale.com>

* docs: add comments explaining the test functions

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: add summary docs for 'lock-shard-timeout' deprecation

Signed-off-by: Manan Gupta <manan@planetscale.com>

Signed-off-by: Manan Gupta <manan@planetscale.com>

* log: also log error in DiscoverInstance when force discovery is specified (vitessio#11936)

Signed-off-by: Manan Gupta <manan@planetscale.com>

Signed-off-by: Manan Gupta <manan@planetscale.com>

* VTOrc Code Cleanup - generate_base, replace cluster_name with keyspace and shard. (vitessio#12012)

* feat: refactor generate commands of VTOrc to be in a single file

Signed-off-by: Manan Gupta <manan@planetscale.com>

* refactor: cleanup create table formatting

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: cleanup the usage of IsSQLite and IsMySQL

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: remove unused minimal instance

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: remove unused table cluster_domain_name

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: fix vtorc database to store keyspace and shard instead of cluster

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: remove unused attributes

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: remove unused cluster domain

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: change GetClusterName to GetKeyspaceAndShardName

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: fix insertion into database_instance

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: fix SnapshotTopologies

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: remove inject unseen primary and inject seed

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: remove ClusterName from Instance

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: fix Audit operations

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: add Keyspace and Shard to cluster information to replace ClusterName

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: fix attempt failure detection registeration

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: fix blocked topology recoveries

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: fix topology recovery

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: reading recovery instances

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: fix get replication and analysis

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: fix bug in query

Signed-off-by: Manan Gupta <manan@planetscale.com>

* test: add tests to check that filtering by keyspace works for APIs

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: remove remaining usages of ClusterName

Signed-off-by: Manan Gupta <manan@planetscale.com>

* refactor: fix comment explaining sleep in the test

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: add code to prevent filtering just by shard and add tests for it

Signed-off-by: Manan Gupta <manan@planetscale.com>

Signed-off-by: Manan Gupta <manan@planetscale.com>

* Fix insert query of blocked_recovery table in VTOrc (vitessio#12091)

* feat: add failing test and fix the query of insertion

Signed-off-by: Manan Gupta <manan@planetscale.com>

* empty-commit

Signed-off-by: Manan Gupta <manan@planetscale.com>

Signed-off-by: Manan Gupta <manan@planetscale.com>

* Fix: VTOrc forgetting old instances (vitessio#12089)

* test: add a failing test for the case where the port changes for a tablet

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: fix the issue by adding alias as a unique field

Signed-off-by: Manan Gupta <manan@planetscale.com>

* empty-commit

Signed-off-by: Manan Gupta <manan@planetscale.com>

Signed-off-by: Manan Gupta <manan@planetscale.com>

* Move vtorc from go-sqlite3 to modernc.org/sqlite (vitessio#12214)

* Move vtorc from go-sqlite3 to modernc.org/sqlite

This moves vtorc from the go-sqlite3 library that uses CGO, to use
modernc.org/sqlite which is a pure Go implementation.

vtorc is the only component we have to build with CGO but it's causing
pain for releases since we need to build it against an old Linux for
linking against glibc.

Using modernc.org/sqlite allows for using Go only again and makes all
Vitess components buildable without CGO.

In
https://datastation.multiprocess.io/blog/2022-05-12-sqlite-in-go-with-and-without-cgo.html
someone ran some basic benchmarks. It shows that the pure Go version can
be twice as slow, but the usage of vtorc is very limited and we operate
on small datasets, so I think the performance impact purely of a
somewhat slower sqlite implementation is negligable.

None of this is in a hot query serving path or anything like that, so I
have little concern performance wise.

Signed-off-by: Dirkjan Bussink <d.bussink@gmail.com>

* Fix error handling in RowToArray

Signed-off-by: Dirkjan Bussink <d.bussink@gmail.com>

---------

Signed-off-by: Dirkjan Bussink <d.bussink@gmail.com>

* see if CI passes on v14.0.5 as previous release

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

* Revert "see if CI passes on v14.0.5 as previous release"

This reverts commit 53a0e0c.

---------

Signed-off-by: Manan Gupta <manan@planetscale.com>
Signed-off-by: Dirkjan Bussink <d.bussink@gmail.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Co-authored-by: Manan Gupta <35839558+GuptaManan100@users.noreply.github.com>
Co-authored-by: Dirkjan Bussink <d.bussink@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Cluster management Component: VTorc Vitess Orchestrator integration NeedsWebsiteDocsUpdate What it says Type: Bug Type: Enhancement Logical improvement (somewhere between a bug and feature)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug Report: VTOrc repairs fail when vttablet is unreachable
3 participants