Skip to content

Commit

Permalink
Improve VTOrc config handling to support dynamic variables (#17218)
Browse files Browse the repository at this point in the history
Signed-off-by: Manan Gupta <manan@planetscale.com>
  • Loading branch information
GuptaManan100 authored Dec 3, 2024
1 parent 1ee4e97 commit 1058621
Show file tree
Hide file tree
Showing 37 changed files with 667 additions and 556 deletions.
24 changes: 23 additions & 1 deletion changelog/22.0/22.0.0/summary.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
- **[Major Changes](#major-changes)**
- **[RPC Changes](#rpc-changes)**
- **[Prefer not promoting a replica that is currently taking a backup](#reparents-prefer-not-backing-up)**
- **[VTOrc Config File Changes](#vtorc-config-file-changes)**


## <a id="major-changes"/>Major Changes</a>
Expand All @@ -25,4 +26,25 @@ For planned reparents, hosts taking backups with a backup engine other than `bui
valid candidates. This means they will never get promoted - not even if there's no other candidates.

Note that behavior for `builtin` backups remains unchanged: a replica that is currently taking a `builtin` backup will
never be promoted, neither by planned nor by emergency reparents.
never be promoted, neither by planned nor by emergency reparents.

### <a id="vtorc-config-file-changes"/>VTOrc Config File Changes</a>

The configuration file for VTOrc has been updated to now support dynamic fields. The old `--config` parameter has been removed. The alternative is to use the `--config-file` parameter. The configuration can now be provided in json, yaml or any other format that [viper](https://github.com/spf13/viper) supports.

The following fields can be dynamically changed -
1. `instance-poll-time`
2. `prevent-cross-cell-failover`
3. `snapshot-topology-interval`
4. `reasonable-replication-lag`
5. `audit-to-backend`
6. `audit-to-syslog`
7. `audit-purge-duration`
8. `wait-replicas-timeout`
9. `tolerable-replication-lag`
10. `topo-information-refresh-duration`
11. `recovery-poll-duration`
12. `allow-emergency-reparent`
13. `change-tablets-with-errant-gtid-to-drained`

To upgrade to the newer version of the configuration file, first switch to using the flags in your current deployment before upgrading. Then you can switch to using the configuration file in the newer release.
4 changes: 3 additions & 1 deletion examples/common/scripts/vtorc-up.sh
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,9 @@ vtorc \
$TOPOLOGY_FLAGS \
--logtostderr \
--alsologtostderr \
--config="${script_dir}/../vtorc/config.json" \
--config-path="${script_dir}/../vtorc/" \
--config-name="config.yaml" \
--config-type="yml" \
--port $port \
> "${log_dir}/vtorc.out" 2>&1 &

Expand Down
4 changes: 0 additions & 4 deletions examples/common/vtorc/config.json

This file was deleted.

13 changes: 13 additions & 0 deletions examples/common/vtorc/config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
instance-poll-time: 1s
prevent-cross-cell-failover: false
snapshot-topology-interval: 0h
reasonable-replication-lag: 10s
audit-to-backend: false
audit-to-syslog: false
audit-purge-duration: 168h
wait-replicas-timeout: 30s
tolerable-replication-lag: 0s
topo-information-refresh-duration: 15s
recovery-poll-duration: 1s
allow-emergency-reparent: true
change-tablets-with-errant-gtid-to-drained: false
2 changes: 1 addition & 1 deletion examples/operator/101_initial_cluster.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -79,7 +79,7 @@ spec:
cpu: 100m
memory: 128Mi
extraFlags:
recovery-period-block-duration: 5s
instance-poll-time: 1s
partitionings:
- equal:
parts: 1
Expand Down
2 changes: 1 addition & 1 deletion examples/operator/201_customer_tablets.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -75,7 +75,7 @@ spec:
cpu: 100m
memory: 128Mi
extraFlags:
recovery-period-block-duration: 5s
instance-poll-time: 1s
partitionings:
- equal:
parts: 1
Expand Down
2 changes: 1 addition & 1 deletion examples/operator/302_new_shards.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -75,7 +75,7 @@ spec:
cpu: 100m
memory: 128Mi
extraFlags:
recovery-period-block-duration: 5s
instance-poll-time: 1s
partitionings:
- equal:
parts: 1
Expand Down
2 changes: 1 addition & 1 deletion examples/operator/306_down_shard_0.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -75,7 +75,7 @@ spec:
cpu: 100m
memory: 128Mi
extraFlags:
recovery-period-block-duration: 5s
instance-poll-time: 1s
partitionings:
- equal:
parts: 1
Expand Down
2 changes: 1 addition & 1 deletion examples/operator/401_scheduled_backups.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -116,7 +116,7 @@ spec:
cpu: 100m
memory: 128Mi
extraFlags:
recovery-period-block-duration: 5s
instance-poll-time: 1s
partitionings:
- equal:
parts: 1
Expand Down
16 changes: 4 additions & 12 deletions go/cmd/vtorc/cli/cli.go
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ import (
"github.com/spf13/cobra"

"vitess.io/vitess/go/acl"
"vitess.io/vitess/go/viperutil/debug"
"vitess.io/vitess/go/vt/log"
"vitess.io/vitess/go/vt/servenv"
"vitess.io/vitess/go/vt/vtorc/config"
Expand All @@ -29,8 +30,7 @@ import (
)

var (
configFile string
Main = &cobra.Command{
Main = &cobra.Command{
Use: "vtorc",
Short: "VTOrc is the automated fault detection and repair tool in Vitess.",
Example: `vtorc \
Expand All @@ -51,22 +51,16 @@ var (

func run(cmd *cobra.Command, args []string) {
servenv.Init()
config.UpdateConfigValuesFromFlags()
inst.RegisterStats()

log.Info("starting vtorc")
if len(configFile) > 0 {
config.ForceRead(configFile)
} else {
config.Read("/etc/vtorc.conf.json", "conf/vtorc.conf.json", "vtorc.conf.json")
}
if config.Config.AuditToSyslog {
if config.GetAuditToSyslog() {
inst.EnableAuditSyslog()
}
config.MarkConfigurationLoaded()

// Log final config values to debug if something goes wrong.
config.LogConfigValues()
log.Infof("Running with Configuration - %v", debug.AllSettings())
server.StartVTOrcDiscovery()

server.RegisterVTOrcAPIEndpoints()
Expand Down Expand Up @@ -96,7 +90,5 @@ func init() {
servenv.MoveFlagsToCobraCommand(Main)

logic.RegisterFlags(Main.Flags())
config.RegisterFlags(Main.Flags())
acl.RegisterFlags(Main.Flags())
Main.Flags().StringVar(&configFile, "config", "", "config file name")
}
1 change: 0 additions & 1 deletion go/flags/endtoend/vtorc.txt
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,6 @@ Flags:
--catch-sigpipe catch and ignore SIGPIPE on stdout and stderr if specified
--change-tablets-with-errant-gtid-to-drained Whether VTOrc should be changing the type of tablets with errant GTIDs to DRAINED
--clusters_to_watch strings Comma-separated list of keyspaces or keyspace/shards that this instance will monitor and repair. Defaults to all clusters in the topology. Example: "ks1,ks2/-80"
--config string config file name
--config-file string Full path of the config file (with extension) to use. If set, --config-path, --config-type, and --config-name are ignored.
--config-file-not-found-handling ConfigFileNotFoundHandling Behavior when a config file is not found. (Options: error, exit, ignore, warn) (default warn)
--config-name string Name of the config file (without extension) to search for. (default "vtconfig")
Expand Down
1 change: 0 additions & 1 deletion go/test/endtoend/cluster/cluster_process.go
Original file line number Diff line number Diff line change
Expand Up @@ -1301,7 +1301,6 @@ func (cluster *LocalProcessCluster) NewVTOrcProcess(config VTOrcConfiguration) *
VtctlProcess: *base,
LogDir: cluster.TmpDirectory,
Config: config,
WebPort: cluster.GetAndReservePort(),
Port: cluster.GetAndReservePort(),
}
}
Expand Down
57 changes: 35 additions & 22 deletions go/test/endtoend/cluster/vtorc_process.go
Original file line number Diff line number Diff line change
Expand Up @@ -43,20 +43,28 @@ type VTOrcProcess struct {
ExtraArgs []string
ConfigPath string
Config VTOrcConfiguration
WebPort int
NoOverride bool
proc *exec.Cmd
exit chan error
}

type VTOrcConfiguration struct {
Debug bool
ListenAddress string
RecoveryPeriodBlockSeconds int
TopologyRefreshSeconds int `json:",omitempty"`
PreventCrossDataCenterPrimaryFailover bool `json:",omitempty"`
LockShardTimeoutSeconds int `json:",omitempty"`
ReplicationLagQuery string `json:",omitempty"`
FailPrimaryPromotionOnLagMinutes int `json:",omitempty"`
InstancePollTime string `json:"instance-poll-time,omitempty"`
SnapshotTopologyInterval string `json:"snapshot-topology-interval,omitempty"`
PreventCrossCellFailover bool `json:"prevent-cross-cell-failover,omitempty"`
ReasonableReplicationLag string `json:"reasonable-replication-lag,omitempty"`
AuditToBackend bool `json:"audit-to-backend,omitempty"`
AuditToSyslog bool `json:"audit-to-syslog,omitempty"`
AuditPurgeDuration string `json:"audit-purge-duration,omitempty"`
WaitReplicasTimeout string `json:"wait-replicas-timeout,omitempty"`
TolerableReplicationLag string `json:"tolerable-replication-lag,omitempty"`
TopoInformationRefreshDuration string `json:"topo-information-refresh-duration,omitempty"`
RecoveryPollDuration string `json:"recovery-poll-duration,omitempty"`
AllowEmergencyReparent string `json:"allow-emergency-reparent,omitempty"`
ChangeTabletsWithErrantGtidToDrained bool `json:"change-tablets-with-errant-gtid-to-drained,omitempty"`
LockShardTimeoutSeconds int `json:",omitempty"`
ReplicationLagQuery string `json:",omitempty"`
FailPrimaryPromotionOnLagMinutes int `json:",omitempty"`
}

// ToJSONString will marshal this configuration as JSON
Expand All @@ -65,12 +73,12 @@ func (config *VTOrcConfiguration) ToJSONString() string {
return string(b)
}

func (config *VTOrcConfiguration) AddDefaults(webPort int) {
config.Debug = true
if config.RecoveryPeriodBlockSeconds == 0 {
config.RecoveryPeriodBlockSeconds = 1
}
config.ListenAddress = fmt.Sprintf(":%d", webPort)
func (config *VTOrcConfiguration) addValuesToCheckOverride() {
config.InstancePollTime = "10h"
}

func (orc *VTOrcProcess) RewriteConfiguration() error {
return os.WriteFile(orc.ConfigPath, []byte(orc.Config.ToJSONString()), 0644)
}

// Setup starts orc process with required arguements
Expand All @@ -91,7 +99,9 @@ func (orc *VTOrcProcess) Setup() (err error) {
orc.ConfigPath = configFile.Name()

// Add the default configurations and print them out
orc.Config.AddDefaults(orc.WebPort)
if !orc.NoOverride {
orc.Config.addValuesToCheckOverride()
}
log.Errorf("configuration - %v", orc.Config.ToJSONString())
_, err = configFile.WriteString(orc.Config.ToJSONString())
if err != nil {
Expand All @@ -111,15 +121,18 @@ func (orc *VTOrcProcess) Setup() (err error) {
"--topo_implementation", orc.TopoImplementation,
"--topo_global_server_address", orc.TopoGlobalAddress,
"--topo_global_root", orc.TopoGlobalRoot,
"--config", orc.ConfigPath,
"--config-file", orc.ConfigPath,
"--port", fmt.Sprintf("%d", orc.Port),
// This parameter is overriden from the config file, added here to just verify that we indeed use the config file paramter over the flag
"--recovery-period-block-duration", "10h",
"--instance-poll-time", "1s",
// Faster topo information refresh speeds up the tests. This doesn't add any significant load either
"--topo-information-refresh-duration", "3s",
"--bind-address", "127.0.0.1",
)
if !orc.NoOverride {
orc.proc.Args = append(orc.proc.Args,
// This parameter is overriden from the config file. This verifies that we indeed use the flag value over the config file.
"--instance-poll-time", "1s",
// Faster topo information refresh speeds up the tests. This doesn't add any significant load either.
"--topo-information-refresh-duration", "3s",
)
}

if *isCoverage {
orc.proc.Args = append(orc.proc.Args, "--test.coverprofile="+getCoveragePath("orc.out"))
Expand Down
3 changes: 1 addition & 2 deletions go/test/endtoend/vtorc/api/api_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -35,8 +35,7 @@ import (
func TestAPIEndpoints(t *testing.T) {
defer cluster.PanicHandler(t)
utils.SetupVttabletsAndVTOrcs(t, clusterInfo, 2, 1, nil, cluster.VTOrcConfiguration{
PreventCrossDataCenterPrimaryFailover: true,
RecoveryPeriodBlockSeconds: 5,
PreventCrossCellFailover: true,
}, 1, "")
keyspace := &clusterInfo.ClusterInstance.Keyspaces[0]
shard0 := &keyspace.Shards[0]
Expand Down
Loading

0 comments on commit 1058621

Please sign in to comment.