Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

verify sane log times in logging stack #4682

Merged

Conversation

juanvallejo
Copy link
Contributor

Related trello card: https://trello.com/c/7uOluOTW
Based on @richm 's script: https://bugzilla.redhat.com/attachment.cgi?id=1258225

This patch verifies that logs sent from logging pods can be queried on
the Elasticsearch pod within a reasonable amount of time.

TODO

  • add tests

@juanvallejo
Copy link
Contributor Author

@rhcarvalho thanks for the feedback. Review comments from #4519 addressed

@@ -20,7 +20,7 @@ def is_active(cls, task_vars):
@staticmethod
def is_first_master(task_vars):
"""Run only on first master and only when logging is configured. Returns: bool"""
logging_deployed = get_var(task_vars, "openshift_hosted_logging_deploy", default=True)
logging_deployed = get_var(task_vars, "openshift_hosted_logging_deploy", default=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

except ValueError:
return {
"failed": True,
"msg": ('Invalid value provided in your Ansible inventory file for '
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't know where the value is coming from with certainty, s/in your Ansible inventory file//

"msg": ('Invalid value provided in your Ansible inventory file for '
'"openshift_check_logging_index_timeout_seconds". Value must '
'be an integer representing an amount in seconds.'),
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@juanvallejo @sosiouxme what if instead of failing the check we would use the default value in this case?

Pros:

  • Less annoyance from a failed check.

Cons:

  • May be hard to realize the config is invalid / not being used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that it might be worth the price of the added annoyance by having the check fail. That way, a user is at least aware of why a value supplied for openshift_check_logging_index_timeout_seconds was invalid

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree if they went to the trouble to set this, they should be alerted if it isn't valid. Until we get some kind of "warning" mechanism anyway.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, one thing we need to work on is how users discover they can set a variable openshift_check_logging_index_timeout_seconds, and then that this value must be an integer (and not a float). This is a general problem for all checks, their usage and their customization -- unrelated to this PR.

Copy link
Member

@sosiouxme sosiouxme Jul 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the things I intend to do is introduce a kwarg on get_var so you can specify what type a var should have. Then it will either come back as that type or as an exception thrown that hopefully informs the user what is expected (in a unified fashion, not per check).

That doesn't help them discover the value in the first place... it would also be good if each check had a spec for which values are configurable and the expected semantics, which we could show in either a check "help" playbook or when the checks fail (as an addendum to the "how to disable" advice). In fact maybe get_var could look at that too :)

Of course this kinda goes against the r_role_variable concept.

if not len(running_es_pods):
msg = ('No Elasticsearch pods were found to be in the "Running" state.'
'At least one Elasticsearch pod is required in order to perform this check.')
return {"failed": True, "changed": False, "msg": msg}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can avoid duplication and ensure consistency:

self.logging_namespace = get_var(task_vars, "openshift_logging_namespace", default=self.logging_namespace)

for component, name in (['kibana', 'Kibana'], ['es', 'Elasticsearch']):
    pods, error = self.get_pods_for_component(
        self.execute_module, self.logging_namespace, component, task_vars,
    )

    if error:
        msg = 'Unable to retrieve pods for the {} logging component: {}'
        return {"failed": True, "changed": False, "msg": msg.format(name, error)}

    running_pods = self.running_pods(pods)

    if not running_pods:
        msg = ('No {} pods in the "Running" state were found.'
               'At least one pod is required in order to perform this check.')
        return {"failed": True, "changed": False, "msg": msg.format(name)}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, will update

return {"failed": True, "changed": False, "msg": msg.format(error)}

running_es_pods = self.running_pods(pods)
if not len(running_es_pods):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/len(...)/...

def wait_until_cmd_or_err(self, es_pod, uuid, timeout_secs, task_vars):
"""Retry an Elasticsearch query every second until query success, or a defined
length of time has passed."""
now = time.time()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Original comment: #4519 (comment)

Let me try to explain again.

"Retry ... until a defined length of time has passed" sounds plausible, but is hard to implement without access to a monotonic time source. time.time() is not monotonic.

Instead of adding hacks to try to compensate for problems with time.time(), what I am proposing is an algorithm that doesn't depend on it: "retry N times with S seconds interval" (time.sleep uses a monotonic clock internally).

In code:

interval = 1

for i in range(max_tries):
    if self.query_es_from_es(es_pod, uuid, task_vars):
        return
    if i < max_tries - 1:
        time.sleep(interval)  # note: we actually don't need to sleep after the last try...

msg = ...
raise ...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rhcarvalho What I don't like about that is that it doesn't account for the time spent on the query, which involves execing into the pod (could take an unbounded amount of time) and running curl (could take up to the limit we set). Frequently the sleep(1) would be the briefest part of the loop. So you could easily have 30 times through the loop varying from ~30s to ~1000s, especially under adverse conditions (exactly the kind you are looking for in the first place). And that's the tradeoff for avoiding a large-clock-reset edge case that should be very rare. I'd rather detect the clock reset and restart the timer in that rare event than warp the accuracy of the check continuously.


def oc_cmd(self, cmd_str, extra_args, task_vars):
"""Wrap parent exec_oc method. Allows for testing without actually invoking other modules."""
return super(LoggingIndexTime, self).exec_oc(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why super(LoggingIndexTime, self).exec_oc? Super is used when you want to access the parent's implementation skipping the current class', useful when overriding a method. Here it should probably be self.exec_oc.


# get all Kibana pods
self.logging_namespace = get_var(task_vars, "openshift_logging_namespace", default=self.logging_namespace)
pods, error = super(LoggingIndexTime, self).get_pods_for_component(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/super(LoggingIndexTime, self)/self ?

Copy link
Contributor Author

@juanvallejo juanvallejo Jul 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although, looking at this a bit more, wouldn't it be better to re-use get_pods_for_component from the parent class (LoggingCheck)?

EDIT: just saw #4682 (comment), thanks

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When an instance inst of a class C doesn't implement / override a given method m, calling inst.m() is the same as super(C, self).m(). However, as soon as you override the method, the two calls produce a different result.

Semantically, we mean inst.m() and the Python runtime will lookup what is the implementation of m to run. The super(C, self).m() for is useful when you need to bypass the lookup, normally in a case where you would introduce an infinite recursion looking up for a method implementation. E.g., this doesn't work:

>>> class Foo:
...     def m(self):
...         return self.m()
... 
>>> Foo().m()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 3, in m
  File "<stdin>", line 3, in m
  ...
RecursionError: maximum recursion depth exceeded

@juanvallejo
Copy link
Contributor Author

@rhcarvalho thanks for the feedback, review comments addressed


def wait_until_cmd_or_err(self, es_pod, uuid, timeout_secs, task_vars):
"""Retry an Elasticsearch query every second until query success, or a defined
length of time has passed."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, now this is out of sync with the implementation?

Re-reading the previous conversation in #4682 (comment) and reconsidering, I get that the intention of this check is look at the elapsed time between writing something to Kibana and being able to read it from Elasticsearch.

Sorry for the back-and-forth, but I'll go with what @sosiouxme said and what you had originally, that we want the whole check to run within a certain timeout, and we try to query ES as many times as possible within that time, always spaced by some interval.

"""Wrap uuid generator. Allows for testing with expected values."""
return str(uuid4())

def oc_cmd(self, cmd_str, extra_args, task_vars):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this belongs in the parent class? It seems to have nothing specific about the logging index time check.

@juanvallejo
Copy link
Contributor Author

@rhcarvalho @sosiouxme thanks for the feedback, comments addressed

@juanvallejo juanvallejo force-pushed the jvallejo/verify-logging-index-time branch from 201e8b9 to 15cc5cd Compare July 14, 2017 21:54
@juanvallejo
Copy link
Contributor Author

@rhcarvalho @sosiouxme Will go ahead and squash if no further comments on this. Thanks

if self.query_es_from_es(es_pod, uuid, task_vars):
return
time.sleep(interval)
now = int(time.time())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why int? I think we don't need to truncate the time.

"""Retry an Elasticsearch query every second until query success, or a defined
length of time has passed."""
now = time.time()
time_start = now
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems that this is not used for anything, dead code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's used later for detecting if the clock reset far back (before the start time).

If the clock only resets a little, nothing special happens. Same if it resets ahead... the loop will not have as long to run.

while now < time_end:
if self.query_es_from_es(es_pod, uuid, task_vars):
return
time.sleep(interval)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can save 1s in run time by not sleeping in the last loop iteration.

Suggestion:

def wait_until_doc_in_es(self, es_pod, uuid, timeout_secs, task_vars):
    deadline = time.time() + timeout_secs
    interval = 1
    while not self.query_es_from_es(es_pod, uuid, task_vars):
        if time.time() + interval > deadline:
            msg = "Message with UUID={} not found in Elasticsearch after {:.1f}s."
            raise OpenShiftCheckException(msg.format(uuid, timeout_secs))
        time.sleep(interval)

The above ignores clock resets, but guarantees that we query Elasticsearch at least once.
It also exits immediately when the deadline expired or the document was found, without the extra sleep in the last loop iteration.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would only sleep on the last loop iteration if the query were not passing.

Err... so do you want to ignore clock resets now? It certainly simplifies the code... or was this just to insure against time resets forward?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Err... so do you want to ignore clock resets now? It certainly simplifies the code...

It seems better to have it simple and to guarantee it runs at least once. Trying to implement monotonic time out time.time() is not going to work, plus adds bloat to confuse us later.

@juanvallejo juanvallejo force-pushed the jvallejo/verify-logging-index-time branch 2 times, most recently from 27f9074 to 9cda0ad Compare July 19, 2017 02:00
@juanvallejo
Copy link
Contributor Author

aos-ci-test

@openshift-bot
Copy link

error: aos-ci-jenkins/OS_3.6_containerized for 9cda0ad (logs)

@openshift-bot
Copy link

success: "aos-ci-jenkins/OS_3.6_NOT_containerized, aos-ci-jenkins/OS_3.6_NOT_containerized_e2e_tests" for 9cda0ad (logs)

Copy link
Contributor

@rhcarvalho rhcarvalho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is almost ready.

@@ -20,7 +21,7 @@ def is_active(cls, task_vars):
@staticmethod
def is_first_master(task_vars):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bad name, makes the is_active expression look like we only care about being first master, when in fact it is a bit more than that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, updated method to only check if first master. Moved logging_deployed check to is_active

check.curl_kibana_with_uuid(plain_running_kibana_pod, None)

for word in extra_words:
assert word in str(error)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Too much indentation

with pytest.raises(expect_exception) as error:
check.wait_until_cmd_or_err(plain_running_elasticsearch_pod, uuid, timeout, None)
else:
check.wait_until_cmd_or_err(plain_running_elasticsearch_pod, uuid, timeout, None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please split into two tests and avoid conditional assertions.
The parameters above mix positive and negative cases, making it harder to understand what is being tested.

@rhcarvalho
Copy link
Contributor

Also please squash the commits before we run aos-ci-test.

@juanvallejo juanvallejo force-pushed the jvallejo/verify-logging-index-time branch from 9cda0ad to 2969965 Compare July 20, 2017 15:09
@juanvallejo
Copy link
Contributor Author

@rhcarvalho thanks, review comments addressed

Copy link
Contributor

@rhcarvalho rhcarvalho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One nit and we should hopefully merge this today.

# Note: It would be nice to use membership in oo_first_master group, however for now it
# seems best to avoid requiring that setup and just check this is the first master.
hostname = get_var(task_vars, "ansible_ssh_host") or [None]
masters = get_var(task_vars, "groups", "masters", default=None) or [None]
return logging_deployed and masters[0] == hostname
return masters[0] == hostname
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may assume there is always at least one master. For robustness and to avoid an IndexError, we could do

return masters and masters[0] == hostname

@@ -94,3 +95,13 @@ def exec_oc(execute_module=None, namespace="logging", cmd_str="", extra_args=Non
raise OpenShiftCheckException(msg)

return result.get("result", "")

def oc_cmd(self, cmd_str, extra_args, task_vars):
"""Wrap parent exec_oc method. Allows for testing without actually invoking other modules."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This docstring is out-of-sync. exec_oc is not a parent method, it is defined right here in LoggingCheck, and not in the parent class, OpenShiftCheck.

Tip: document what it does and why someone would want to call this method and what to expect from the return value. No need to make references to class hierarchy nor to "testing without actually invoking other modules" (typically that is not interesting, you can always monkey-patch things in Python... one could monkey-patch exec_oc itself and this method wouldn't even be needed. It seems that its existence is solely to let you pass in 2 arguments less.

Suggestion:

    def oc_cmd(self, cmd_str, extra_args, task_vars):
        """Same as exec_oc with a predefined execute_module and namespace."""

I almost wrote that this simplifies calls to exec_oc, but then I noted that exec_oc has all optional arguments while oc_cmd has 3 required args, so maybe it is not all that more convenient to call?! All calls I saw pass and empty [] to extra_args.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, will just remove oc_cmd

@juanvallejo juanvallejo force-pushed the jvallejo/verify-logging-index-time branch from 186f3d4 to c1aafde Compare July 20, 2017 18:47
This patch verifies that logs sent from logging pods can be queried on
the Elasticsearch pod within a reasonable amount of time.
@juanvallejo juanvallejo force-pushed the jvallejo/verify-logging-index-time branch from c1aafde to ede0374 Compare July 20, 2017 19:17
"--cacert /etc/elasticsearch/secret/admin-ca "
"--cert /etc/elasticsearch/secret/admin-cert "
"--key /etc/elasticsearch/secret/admin-key "
"https://logging-es:9200/project.{namespace}*/_count?q=message:{uuid}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically it should probably retrieve the project's UUID and be more specific about the ES index it's querying - reduce query time and load on ES. However I think it's a small enough concern that we can save this for when it becomes a problem or we're revisiting the code for some reason.

@sosiouxme
Copy link
Member

@rhcarvalho do you approve the changes?

Copy link
Contributor

@rhcarvalho rhcarvalho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@rhcarvalho
Copy link
Contributor

aos-ci-test

@openshift-bot
Copy link

success: "aos-ci-jenkins/OS_3.6_NOT_containerized, aos-ci-jenkins/OS_3.6_NOT_containerized_e2e_tests" for ede0374 (logs)

@openshift-bot
Copy link

success: "aos-ci-jenkins/OS_3.6_containerized, aos-ci-jenkins/OS_3.6_containerized_e2e_tests" for ede0374 (logs)

@rhcarvalho
Copy link
Contributor

[merge]

@rhcarvalho
Copy link
Contributor

Flake openshift/origin#15356

And I don't understand https://ci.openshift.redhat.com/jenkins/job/test_pull_request_openshift_ansible_extended_conformance_install_update/649/, says Stage VERIFY THE INSTALLED OPENSHIFT-ANSIBLE RELEASE [00h 00m 01s] failed. -- Looks like it comes from https://github.com/openshift/aos-cd-jobs/blob/master/sjb/config/test_cases/test_branch_origin_extended_conformance_install_update.yml#L136-L143, and according to the console output:

+ cd /data/src/github.com/openshift/openshift-ansible
++ git describe --tags --abbrev=0 --exact-match HEAD
+ last_tag=openshift-ansible-3.6.154-1
++ git log -n 1 --pretty=%h
+ last_commit=4513926
+ rpm -V openshift-ansible-3.6.154-1.git.0.4513926.el7
package openshift-ansible-3.6.154-1.git.0.4513926.el7 is not installed
++ export status=FAILURE
++ status=FAILURE
+ set +o xtrace
########## FINISHED STAGE: FAILURE: VERIFY THE INSTALLED OPENSHIFT-ANSIBLE RELEASE [00h 00m 01s] ##########

[merge]

@openshift-bot
Copy link

Evaluated for openshift ansible merge up to ede0374

@openshift-bot
Copy link

continuous-integration/openshift-jenkins/merge FAILURE (https://ci.openshift.redhat.com/jenkins/job/merge_pull_request_openshift_ansible/728/) (Base Commit: 2a706ad) (PR Branch Commit: ede0374)

@rhcarvalho
Copy link
Contributor

Again flake openshift/origin#15356 & Stage VERIFY THE INSTALLED OPENSHIFT-ANSIBLE RELEASE [00h 00m 01s] failed.

Merging manually as per https://github.com/openshift/openshift-ansible/blob/master/docs/pull_requests.md#manual-merges.

@rhcarvalho rhcarvalho merged commit c5e782d into openshift:master Jul 24, 2017
@juanvallejo juanvallejo deleted the jvallejo/verify-logging-index-time branch July 24, 2017 14:11
@rhcarvalho
Copy link
Contributor

For the record, the other failure seems to be a CI script error: openshift/origin#15186

@sdodson
Copy link
Member

sdodson commented Jul 25, 2017

thanks for tracking that down, sorry didn't get to respond to you on irc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants