-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add check to detect increased etcd traffic #4316
add check to detect increased etcd traffic #4316
Conversation
899d89f
to
733724b
Compare
|
||
cmd = [ | ||
'/bin/journalctl', | ||
'-ru', 'etcd', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sosiouxme will make this a module param. Not sure what the value should be
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The unit? It'll be etcd
for multi-master clusters or the master unit name which depends on the deployment type
c52fb2e
to
6a1ff5f
Compare
aos-ci-test |
@sosiouxme will begin adding tests |
0bcd1aa
to
9153160
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly looking pretty good!
|
||
cmd = [ | ||
'/bin/journalctl', | ||
'-ru', 'etcd', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The unit? It'll be etcd
for multi-master clusters or the master unit name which depends on the deployment type
cmd = [ | ||
'/bin/journalctl', | ||
'-ru', 'etcd', | ||
'--output', module.params["output"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no real reason to parameterize this, you'll always want json because it's machine-readable that way
@@ -0,0 +1,95 @@ | |||
# pylint: disable=missing-docstring | |||
|
|||
"""Interface to journalctl""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hard to believe there isn't a module for journalctl... guess it just shows, no one is doing monitoring/diagnostics with ansible yet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there's actually nothing about this module that's specific to the task at hand; it could probably be renamed to search_journal.py or some such. although in a more generic implementation you'd probably search for several things at the same time in a single unit's journal. but we can cross that bridge later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
although in a more generic implementation you'd probably search for several things at the same time in a single unit's journal. but we can cross that bridge later.
@sosiouxme yeah, was thinking of implementing this, but was not sure how to handle "callbacks" for different "matchers". Ideally I would send an array of dictionaries, with each dictionary containing a regexp field, and a function to be called if that regex was matched, to the remote module.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would just have it pass back what matched which regex and let the caller figure out how to respond.
try: | ||
obj = json.loads(line.rstrip()) | ||
if start_matcher.match(obj["MESSAGE"]): | ||
break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# don't need to look past the most recent service restart
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(just because that may not be obvious)
@@ -0,0 +1,35 @@ | |||
# pylint: disable=missing-docstring |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't do that...
def is_active(cls, task_vars): | ||
"""Skip hosts that do not have etcd or masters in their group names.""" | ||
group_names = get_var(task_vars, "group_names", default=[]) | ||
active = "masters" in group_names or "etcd" in group_names |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so, it's masters if there's only one master; etcd if there are multiple masters. there's a variable for figuring out what's in all the groups - maybe just groups
?
it's not exactly critical to get this right in is_active
but we might as well.
def run(self, tmp, task_vars): | ||
match = self.module_executor("etcdlogs", { | ||
"log_matcher": { | ||
"regexp": "etcd: sync duration of \d+\.\d+s, expected less than 1s", # pylint: disable=anomalous-backslash-in-string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use r'
so the backslashes aren't misinterpreted - or does that not work across the module invocation? i think it should be fine.
if match["matched"]: | ||
msg = ("Higher than normal etcd traffic detected. " | ||
"OpenShift 3.4 introduced an increase in etcd traffic by at least a factor of 4." | ||
"\nUpgrading to OpenShift 3.6 is recommended in order to fix this issue.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pending an upgrade there are a few tricks for mitigating this; can we mention those too? Or best, refer to a kbase if we have one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done, linked to https://access.redhat.com/solutions/291638
1865f63
to
afb6496
Compare
@juanvallejo is this still [WIP]? looks pretty done-ish. got some lint/flake8 stuff to fix though. |
@sosiouxme thanks, removed WIP tag. Will address lint stuff |
9cfdd5b
to
fbf31aa
Compare
2b8d30b
to
1372c59
Compare
@rhcarvalho thanks for the feedback, comments addressed |
|
||
module.exit_json( | ||
changed=False, | ||
failed=len(errors), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will be less confusing for us if failed
is a boolean. Please make it bool(errors)
(note you don't need len
to get what you mean, an empty sequence is false, otherwise true).
"""Iterate through a list of matchers, verify required fields exist on each one, and | ||
retrieve log output for each matcher, appending its match pattern to a collection if | ||
a matcher's regex pattern is found. | ||
Return a collection of matched regular expressions, or raise a MatchErrorException if |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No raise
statements in this function body, is this comment true / up-to-date?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will update docstring
|
||
def get_log_input(matcher): | ||
"""Run journalctl with the systemd unit specified in the matcher. | ||
Return the command output as an iterator.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General suggestion for docstrings: speak about what a function does, and why somebody would want to call it. Often you can omit how it does it, leave it as an implementation detail unless the algorithm has particular importance.
That way, comments will be less prone to getting out of sync when the implementation changes, because the how tends to change more predominantly than why or what.
E.g.:
def get_log_output(matcher):
"""Return an iterator on the logs of a given matcher."""
BTW, why is the function called "log_input"? Shouldn't it be "output"?!
retrieve log output for each matcher, appending its match pattern to a collection if | ||
a matcher's regex pattern is found. | ||
Return a collection of matched regular expressions, or raise a MatchErrorException if | ||
any errors occur.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(please consider this a continuation to the comment below about docstrings)
Another example of docstring:
def get_log_matches(matchers, log_count_limit, timestamp_limit_seconds):
"""Return a list of up to log_count_limit matches for each matcher.
Log entries are only considered if older than timestamp_limit_seconds.
"""
The current doscstring seems to be spelling out a procedure, enumerating the algorithm that reflects the current implementation. It however does not explain what log_count_limit
and timestamp_limit_seconds
are meant for (and maybe my example above is documenting them wrong).
def find_matches(log_input, matcher, log_count_limit, timestamp_limit_seconds): | ||
"""Receive iterable input, a matcher dictionary and scan journalctl at the | ||
given systemd module for log messages matching a given regular expression. | ||
Return the matched log message (or None) and an error string if any.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This docstring is outdated. Please review the docstrings. Some are really verbose and still miss fundamental points like explaining non-obvious arguments. Better to aim for concise strings that are still helpful in understanding the purpose of the functions/methods.
1372c59
to
65758ab
Compare
@rhcarvalho thanks, comments addressed |
aos-ci-test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nits, let's get this in and avoid another CI cycle.
"Please refer to https://access.redhat.com/solutions/2916381 for more information.") | ||
return {"failed": True, "msg": msg} | ||
|
||
if match["failed"]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we check this before match.get("matched")
? It will probably work the same this way, but it seems logical to check for errors first, and then look at matches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
for word in extra_words: | ||
assert word in result.get("msg", "") | ||
|
||
assert result.get("failed", False) == failed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems more logical to first check "failed" and only then check "extra_words".
[merge] |
[test]ing while waiting on the merge queue |
Flake openshift/origin#13977 [merge] |
Not merging until the queue is open again. |
65758ab
to
7895589
Compare
Evaluated for openshift ansible merge up to 7895589 |
Evaluated for openshift ansible test up to 7895589 |
continuous-integration/openshift-jenkins/test FAILURE (https://ci.openshift.redhat.com/jenkins/job/test_pull_request_openshift_ansible/349/) (Base Commit: e432f8a) (PR Branch Commit: 7895589) |
continuous-integration/openshift-jenkins/merge FAILURE (https://ci.openshift.redhat.com/jenkins/job/merge_pull_request_openshift_ansible/715/) (Base Commit: e432f8a) (PR Branch Commit: 7895589) |
Code was changed since last test run, need another run. |
aos-ci-test |
[merge] |
The OpenShift Ansible merge job could not be run again for this pull request.
|
Related trello card: https://trello.com/c/7pJdHj4F
Related bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1415839
An update in OSE 3.4 caused an increase in etcd traffic due to increased quorum reads.
This check scans etcd logs (
journalctl -ru <unit>
) and recommends an upgrade toOSE 3.6
if the messageetcd: sync duration of ..., expected less than 1s
is found (this message is a direct symptom of the increased traffic update).Based on oadm diagnostics
TODO
cc @brenton @rhcarvalho @sosiouxme