-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
verify sane log times in logging stack #4682
verify sane log times in logging stack #4682
Conversation
@rhcarvalho thanks for the feedback. Review comments from #4519 addressed |
@@ -20,7 +20,7 @@ def is_active(cls, task_vars): | |||
@staticmethod | |||
def is_first_master(task_vars): | |||
"""Run only on first master and only when logging is configured. Returns: bool""" | |||
logging_deployed = get_var(task_vars, "openshift_hosted_logging_deploy", default=True) | |||
logging_deployed = get_var(task_vars, "openshift_hosted_logging_deploy", default=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
except ValueError: | ||
return { | ||
"failed": True, | ||
"msg": ('Invalid value provided in your Ansible inventory file for ' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't know where the value is coming from with certainty, s/in your Ansible inventory file//
"msg": ('Invalid value provided in your Ansible inventory file for ' | ||
'"openshift_check_logging_index_timeout_seconds". Value must ' | ||
'be an integer representing an amount in seconds.'), | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@juanvallejo @sosiouxme what if instead of failing the check we would use the default value in this case?
Pros:
- Less annoyance from a failed check.
Cons:
- May be hard to realize the config is invalid / not being used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that it might be worth the price of the added annoyance by having the check fail. That way, a user is at least aware of why a value supplied for openshift_check_logging_index_timeout_seconds
was invalid
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree if they went to the trouble to set this, they should be alerted if it isn't valid. Until we get some kind of "warning" mechanism anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, one thing we need to work on is how users discover they can set a variable openshift_check_logging_index_timeout_seconds
, and then that this value must be an integer (and not a float). This is a general problem for all checks, their usage and their customization -- unrelated to this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of the things I intend to do is introduce a kwarg on get_var so you can specify what type a var should have. Then it will either come back as that type or as an exception thrown that hopefully informs the user what is expected (in a unified fashion, not per check).
That doesn't help them discover the value in the first place... it would also be good if each check had a spec for which values are configurable and the expected semantics, which we could show in either a check "help" playbook or when the checks fail (as an addendum to the "how to disable" advice). In fact maybe get_var could look at that too :)
Of course this kinda goes against the r_role_variable
concept.
if not len(running_es_pods): | ||
msg = ('No Elasticsearch pods were found to be in the "Running" state.' | ||
'At least one Elasticsearch pod is required in order to perform this check.') | ||
return {"failed": True, "changed": False, "msg": msg} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can avoid duplication and ensure consistency:
self.logging_namespace = get_var(task_vars, "openshift_logging_namespace", default=self.logging_namespace)
for component, name in (['kibana', 'Kibana'], ['es', 'Elasticsearch']):
pods, error = self.get_pods_for_component(
self.execute_module, self.logging_namespace, component, task_vars,
)
if error:
msg = 'Unable to retrieve pods for the {} logging component: {}'
return {"failed": True, "changed": False, "msg": msg.format(name, error)}
running_pods = self.running_pods(pods)
if not running_pods:
msg = ('No {} pods in the "Running" state were found.'
'At least one pod is required in order to perform this check.')
return {"failed": True, "changed": False, "msg": msg.format(name)}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, will update
return {"failed": True, "changed": False, "msg": msg.format(error)} | ||
|
||
running_es_pods = self.running_pods(pods) | ||
if not len(running_es_pods): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/len(...)/...
def wait_until_cmd_or_err(self, es_pod, uuid, timeout_secs, task_vars): | ||
"""Retry an Elasticsearch query every second until query success, or a defined | ||
length of time has passed.""" | ||
now = time.time() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Original comment: #4519 (comment)
Let me try to explain again.
"Retry ... until a defined length of time has passed" sounds plausible, but is hard to implement without access to a monotonic time source. time.time()
is not monotonic.
Instead of adding hacks to try to compensate for problems with time.time()
, what I am proposing is an algorithm that doesn't depend on it: "retry N times with S seconds interval" (time.sleep
uses a monotonic clock internally).
In code:
interval = 1
for i in range(max_tries):
if self.query_es_from_es(es_pod, uuid, task_vars):
return
if i < max_tries - 1:
time.sleep(interval) # note: we actually don't need to sleep after the last try...
msg = ...
raise ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rhcarvalho What I don't like about that is that it doesn't account for the time spent on the query, which involves execing into the pod (could take an unbounded amount of time) and running curl (could take up to the limit we set). Frequently the sleep(1) would be the briefest part of the loop. So you could easily have 30 times through the loop varying from ~30s to ~1000s, especially under adverse conditions (exactly the kind you are looking for in the first place). And that's the tradeoff for avoiding a large-clock-reset edge case that should be very rare. I'd rather detect the clock reset and restart the timer in that rare event than warp the accuracy of the check continuously.
|
||
def oc_cmd(self, cmd_str, extra_args, task_vars): | ||
"""Wrap parent exec_oc method. Allows for testing without actually invoking other modules.""" | ||
return super(LoggingIndexTime, self).exec_oc( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why super(LoggingIndexTime, self).exec_oc
? Super is used when you want to access the parent's implementation skipping the current class', useful when overriding a method. Here it should probably be self.exec_oc
.
|
||
# get all Kibana pods | ||
self.logging_namespace = get_var(task_vars, "openshift_logging_namespace", default=self.logging_namespace) | ||
pods, error = super(LoggingIndexTime, self).get_pods_for_component( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/super(LoggingIndexTime, self)/self ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although, looking at this a bit more, wouldn't it be better to re-use get_pods_for_component
from the parent class (LoggingCheck
)?
EDIT: just saw #4682 (comment), thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When an instance inst
of a class C
doesn't implement / override a given method m
, calling inst.m()
is the same as super(C, self).m()
. However, as soon as you override the method, the two calls produce a different result.
Semantically, we mean inst.m()
and the Python runtime will lookup what is the implementation of m
to run. The super(C, self).m()
for is useful when you need to bypass the lookup, normally in a case where you would introduce an infinite recursion looking up for a method implementation. E.g., this doesn't work:
>>> class Foo:
... def m(self):
... return self.m()
...
>>> Foo().m()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 3, in m
File "<stdin>", line 3, in m
...
RecursionError: maximum recursion depth exceeded
@rhcarvalho thanks for the feedback, review comments addressed |
|
||
def wait_until_cmd_or_err(self, es_pod, uuid, timeout_secs, task_vars): | ||
"""Retry an Elasticsearch query every second until query success, or a defined | ||
length of time has passed.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, now this is out of sync with the implementation?
Re-reading the previous conversation in #4682 (comment) and reconsidering, I get that the intention of this check is look at the elapsed time between writing something to Kibana and being able to read it from Elasticsearch.
Sorry for the back-and-forth, but I'll go with what @sosiouxme said and what you had originally, that we want the whole check to run within a certain timeout, and we try to query ES as many times as possible within that time, always spaced by some interval.
"""Wrap uuid generator. Allows for testing with expected values.""" | ||
return str(uuid4()) | ||
|
||
def oc_cmd(self, cmd_str, extra_args, task_vars): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps this belongs in the parent class? It seems to have nothing specific about the logging index time check.
@rhcarvalho @sosiouxme thanks for the feedback, comments addressed |
201e8b9
to
15cc5cd
Compare
@rhcarvalho @sosiouxme Will go ahead and squash if no further comments on this. Thanks |
if self.query_es_from_es(es_pod, uuid, task_vars): | ||
return | ||
time.sleep(interval) | ||
now = int(time.time()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why int
? I think we don't need to truncate the time.
"""Retry an Elasticsearch query every second until query success, or a defined | ||
length of time has passed.""" | ||
now = time.time() | ||
time_start = now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems that this is not used for anything, dead code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's used later for detecting if the clock reset far back (before the start time).
If the clock only resets a little, nothing special happens. Same if it resets ahead... the loop will not have as long to run.
while now < time_end: | ||
if self.query_es_from_es(es_pod, uuid, task_vars): | ||
return | ||
time.sleep(interval) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can save 1s in run time by not sleeping in the last loop iteration.
Suggestion:
def wait_until_doc_in_es(self, es_pod, uuid, timeout_secs, task_vars):
deadline = time.time() + timeout_secs
interval = 1
while not self.query_es_from_es(es_pod, uuid, task_vars):
if time.time() + interval > deadline:
msg = "Message with UUID={} not found in Elasticsearch after {:.1f}s."
raise OpenShiftCheckException(msg.format(uuid, timeout_secs))
time.sleep(interval)
The above ignores clock resets, but guarantees that we query Elasticsearch at least once.
It also exits immediately when the deadline expired or the document was found, without the extra sleep in the last loop iteration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would only sleep on the last loop iteration if the query were not passing.
Err... so do you want to ignore clock resets now? It certainly simplifies the code... or was this just to insure against time resets forward?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Err... so do you want to ignore clock resets now? It certainly simplifies the code...
It seems better to have it simple and to guarantee it runs at least once. Trying to implement monotonic time out time.time()
is not going to work, plus adds bloat to confuse us later.
27f9074
to
9cda0ad
Compare
aos-ci-test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is almost ready.
@@ -20,7 +21,7 @@ def is_active(cls, task_vars): | |||
@staticmethod | |||
def is_first_master(task_vars): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bad name, makes the is_active
expression look like we only care about being first master, when in fact it is a bit more than that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, updated method to only check if first master. Moved logging_deployed
check to is_active
check.curl_kibana_with_uuid(plain_running_kibana_pod, None) | ||
|
||
for word in extra_words: | ||
assert word in str(error) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Too much indentation
with pytest.raises(expect_exception) as error: | ||
check.wait_until_cmd_or_err(plain_running_elasticsearch_pod, uuid, timeout, None) | ||
else: | ||
check.wait_until_cmd_or_err(plain_running_elasticsearch_pod, uuid, timeout, None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please split into two tests and avoid conditional assertions.
The parameters above mix positive and negative cases, making it harder to understand what is being tested.
Also please squash the commits before we run aos-ci-test. |
9cda0ad
to
2969965
Compare
@rhcarvalho thanks, review comments addressed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One nit and we should hopefully merge this today.
# Note: It would be nice to use membership in oo_first_master group, however for now it | ||
# seems best to avoid requiring that setup and just check this is the first master. | ||
hostname = get_var(task_vars, "ansible_ssh_host") or [None] | ||
masters = get_var(task_vars, "groups", "masters", default=None) or [None] | ||
return logging_deployed and masters[0] == hostname | ||
return masters[0] == hostname |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may assume there is always at least one master. For robustness and to avoid an IndexError
, we could do
return masters and masters[0] == hostname
@@ -94,3 +95,13 @@ def exec_oc(execute_module=None, namespace="logging", cmd_str="", extra_args=Non | |||
raise OpenShiftCheckException(msg) | |||
|
|||
return result.get("result", "") | |||
|
|||
def oc_cmd(self, cmd_str, extra_args, task_vars): | |||
"""Wrap parent exec_oc method. Allows for testing without actually invoking other modules.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This docstring is out-of-sync. exec_oc
is not a parent method, it is defined right here in LoggingCheck
, and not in the parent class, OpenShiftCheck
.
Tip: document what it does and why someone would want to call this method and what to expect from the return value. No need to make references to class hierarchy nor to "testing without actually invoking other modules" (typically that is not interesting, you can always monkey-patch things in Python... one could monkey-patch exec_oc
itself and this method wouldn't even be needed. It seems that its existence is solely to let you pass in 2 arguments less.
Suggestion:
def oc_cmd(self, cmd_str, extra_args, task_vars):
"""Same as exec_oc with a predefined execute_module and namespace."""
I almost wrote that this simplifies calls to exec_oc
, but then I noted that exec_oc
has all optional arguments while oc_cmd
has 3 required args, so maybe it is not all that more convenient to call?! All calls I saw pass and empty []
to extra_args
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, will just remove oc_cmd
186f3d4
to
c1aafde
Compare
This patch verifies that logs sent from logging pods can be queried on the Elasticsearch pod within a reasonable amount of time.
c1aafde
to
ede0374
Compare
"--cacert /etc/elasticsearch/secret/admin-ca " | ||
"--cert /etc/elasticsearch/secret/admin-cert " | ||
"--key /etc/elasticsearch/secret/admin-key " | ||
"https://logging-es:9200/project.{namespace}*/_count?q=message:{uuid}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically it should probably retrieve the project's UUID and be more specific about the ES index it's querying - reduce query time and load on ES. However I think it's a small enough concern that we can save this for when it becomes a problem or we're revisiting the code for some reason.
@rhcarvalho do you approve the changes? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
aos-ci-test |
[merge] |
Flake openshift/origin#15356 And I don't understand https://ci.openshift.redhat.com/jenkins/job/test_pull_request_openshift_ansible_extended_conformance_install_update/649/, says
[merge] |
Evaluated for openshift ansible merge up to ede0374 |
continuous-integration/openshift-jenkins/merge FAILURE (https://ci.openshift.redhat.com/jenkins/job/merge_pull_request_openshift_ansible/728/) (Base Commit: 2a706ad) (PR Branch Commit: ede0374) |
Again flake openshift/origin#15356 & Merging manually as per https://github.com/openshift/openshift-ansible/blob/master/docs/pull_requests.md#manual-merges. |
For the record, the other failure seems to be a CI script error: openshift/origin#15186 |
thanks for tracking that down, sorry didn't get to respond to you on irc. |
Related trello card: https://trello.com/c/7uOluOTW
Based on @richm 's script: https://bugzilla.redhat.com/attachment.cgi?id=1258225
This patch verifies that logs sent from logging pods can be queried on
the Elasticsearch pod within a reasonable amount of time.
TODO