freno
was written to assist in controlling writes to MySQL
clusters.
MySQL
installments typically include a master and multiple replicas. Aggressive write to the master may cause increased replication lags. Replication lags have multiple undesired effects, such as stale replica data.
Common operations apply massive changes to MySQL data, such as:
- Archiving/purging of old data (e.g. via
pt-archiver
) - Online migrations (e.g. via
gh-ost
) - Population of newly added columns
- Bulk loading of data (e.g. importing data from Hadoop via
sqoop
) - Any application generated massive update
Such operations can and should be broken to smaller subtasks (e.g. 100
rows at a time) and throttle based on replication lag.
Tools such as gh-ost
and pt-archiver
already support auto-throttling by replication lag, however:
- They use different internal implementations
- They need to be given the list of servers
- They do not adapt automatically to a change in the list of relevant servers (
gh-ost
can be updated dynamically, but again must be told the identities of the replicas)
freno
provides a unified service which self-adapts to changes in MySQL replica inventory.
Let's dissect the MySQL part of the sample config file:
You will find the top-level configuration:
"MySQL": {
"User": "some_user",
"Password": "${mysql_password_env_variable}",
"MetricQuery": "select unix_timestamp(now(6)) - unix_timestamp(ts) as lag_check from meta.heartbeat order by ts desc limit 1",
"CacheMillis": 0,
"ThrottleThreshold": 1.0,
"IgnoreHostsCount": 0,
"IgnoreHostsThreshold": 0.0,
"HttpCheckPort": -1,
"HttpCheckPath": "path-to-check",
"IgnoreHosts": [
"us-east-1",
"us-east-2"
],
"Clusters": {
}
}
These params apply in general to all MySQL clusters, unless specified differently (overridden) on a per-cluster basis.
-
User
,Password
: these can be specified as plaintext, or in a${some_env_variable}
format, in which casefreno
will look up its environment for specified variable. (e.g. to match the above config, ashell
script invokingfreno
canexport mysql_password_env_variable=flyingcircus
) -
MetricQuery
:- Note: returned value is expected to be
[0..)
(0
or more), where lower values are "better" and higher values are "worse". - if not provided,
freno
will assume you're interested in replication lag, and will issue aSHOW SLAVE STATUS
to extractSeconds_behind_master
- We strongly recommend using a custom heartbeat mechanism such as
pt-heartbeat
, with subsecond resolution. The sample query above works well withpt-heartbeat
subsecond timestamps. - Strictly speaking, you don't have to provide a replication-lag metric. This could be any query that reports any metric. However you're likely interested in replication lag to start with.
- Note: the default time unit for replication lag is seconds
- Note: returned value is expected to be
-
CacheMillis
: optional (default:0
, disabled), cacheMetricQuery
results. For some queries it make senses to poll aggressively (such is replication lag measurement). For some other queries, it does not. You may, for example, throttle on master's load instead of replication lag. Or on master's history length. In such cases you may wish to only query the master in longer intervals. WhenCacheMillis > 0
freno
will cache valid (non-error) query results for specified number of milliseconds. -
ThrottleThreshold
: an upper limit for valid collected values. If value collected (viaMetricQuery
) is below or equal toThrottleThreshold
, cluster is considered to be good to write to. If higher, then cluster writes will need to be throttled.- Note: valid range is
[0..)
(0
or more), where lower values are stricter and higher values are more permissive. - Note: use seconds as replication lag time unit. In the above we throttle above
1.0
seconds.
- Note: valid range is
-
IgnoreHostsCount
: number of hosts that can be ignored while aggregating cluster's values. For example, ifIgnoreHostsCount
is2
, then up to2
hosts that have errors are silently ignored. Or, if there's no errors, the two highest values will be ignored (so if these two values exceed the cluster's threshold,freno
may still be happy to allow writes to the cluster). -
IgnoreHostsThreshold
: applies a conditional to theIgnoreHostsCount
logic: for hosts with errors, no conditional applies. For hosts reporting a value, the value needs to be higher than given threshold to be ignored. A use case: ignore up ton
lagging replicas, on condition that they're lagging at least10.0sec
. -
HttpCheckPort
: when> 0
, and together withHttpCheckPath
,freno
will run a HTTP check on the MySQL boxes. For a given cluster there can only be one HTTP check on a MySQL box, even if one has multiple MySQL services running on that box. The HTTP check may return any HTTP status. The404 Not Found
status is special:freno
will completely disregard hosts where HTTP checks return404
.You may override
HttpCheckPort
on specific clusters. Set to-1
to disable HTTP check. -
HttpCheckPath
: path to test. e.g. when"HttpCheckPort": 1234
and"HttpCheckPath": "health"
,freno
will testhttp://<mysql-box>:1234/health
.You may override
HttpCheckPath
on specific clusters. -
IgnoreHosts
: array of substrings. A host is completely ignored byfreno
if it contains a substring listed inIgnoreHosts
. Like other values, this value can be overridden per-cluster. A non-emptyIgnoreHosts
in a specific cluster will replace theMySQL
scope definition, for that cluster. An emptyIgnoreHosts
in a cluster scope will not un-ignore the patterns specified inMySQL
scope. If you want to un-ignore theMySQL
scope use some thing like"IgnoreHosts": ["--no-such-pattern--"],
, known to never match any of your hosts.
Looking at clusters configuration:
"Clusters": {
"prod4": {
"ThrottleThreshold": 0.8,
"HAProxySettings": {
"Host": "my.haproxy.mydomain.com",
"Port": 1001,
"PoolName": "my_prod4_pool"
}
},
"sharded": {
"IgnoreHosts": [
"us-east-2"
],
"VitessSettings": {
"API": "https://vtctld.example.com/api/",
"Keyspace": "my_sharded_ks"
}
},
"local": {
"User": "msandbox",
"Password": "msandbox",
"IgnoreHostsCount": 1,
"StaticHostsSettings" : {
"Hosts": [
"127.0.0.1:22293",
"127.0.0.1:22294",
"127.0.0.1:22295"
]
}
}
}
This introduces two clusters: prod4
and local
. freno
will only serve requests for these two clusters. Any other request (e.g. /check/archive/mysql/prod7
) will be answered with HTTP 500
-- an unknown metric
Noteworthy:
prod4
chooses to (but doesn't have to) override theThrottleThreshold
to0.8
secondsprod4
list of servers is dictated byHAProxy
.freno
will routinely and dynamically poll given HAProxy server for list of hosts. These will include any hosts not inNOLB
.local
cluster chooses to overrideUser
,Password
andIgnoreHostsCount
.local
cluster defines a static list of hosts.
freno
isn't necessarily about replication lag. You may choose to use different thresholds appropriate for your setup and workload. For example, you may choose to monitor the master (as opposed of the replicas) and read some metric such as threads_running
. An example configuration would be:
"Clusters": {
"master8": {
"MetricQuery": "show global variables like 'threads_running'",
"CacheMillis": 500,
"ThrottleThreshold": 50,
"User": "msandbox",
"Password": "msandbox",
"StaticHostsSettings" : {
"Hosts": [
"my.master8.vip.com:3306"
]
}
}
}
freno
explicitly recognizes show global ...
statements and reads the result's numeric value.
Otherwise you may provide any query that returns a single row, single numeric column.