Adding additonal checks upgrade path #5014

juanvallejo · 2017-08-07T13:41:07Z

Depends on #4960

TODO

Possibly handle upgrade playbook context on etcd_volume check

sosiouxme · 2017-08-07T16:37:57Z

@juanvallejo it seems to me the etcd checks can be used as-is. They're both intended for already-installed clusters. Any thoughts what would be different about an upgrade?

sosiouxme · 2017-08-07T19:25:42Z

bot, retest this

juanvallejo · 2017-08-07T20:21:26Z

@sosiouxme

it seems to me the etcd checks can be used as-is. They're both intended for already-installed clusters.
Any thoughts what would be different about an upgrade?

I was not sure if I was missing any special cases during an upgrade that would need to be handled in the etcd checks. I cannot think of any, so I was hoping I could get enough people to review this and either confirm that nothing else is needed, or catch whatever we're missing

juanvallejo · 2017-08-09T14:01:57Z

[test]

rhcarvalho

The patch is pretty simple, but I still have doubts about the implications of certain checks in the upgrade flow.

rhcarvalho · 2017-08-09T14:23:08Z

playbooks/common/openshift-cluster/upgrades/pre/verify_health_checks.yml

@@ -11,3 +11,6 @@
      checks:
      - disk_availability
      - memory_availability
+      - docker_image_availability
+      - etcd_imagedata_size


Should we block an upgrade if "etcd has too much image data"? That may not be a good idea.

checks can be disabled...

but I guess the question is really "are we running health checks generally or just looking for issues we think might impact the upgrade?"

to the extent that an upgrade might exercise data in etcd, the etcd checks probably make sense either way. you do not want your upgrade to fail partway through due to running out of etcd space, although it was probably going to fill up imminently anyway, at least you don't want to have to deal with that in the middle of an upgrade.

rhcarvalho · 2017-08-09T14:24:15Z

playbooks/common/openshift-cluster/upgrades/pre/verify_health_checks.yml

@@ -11,3 +11,6 @@
      checks:
      - disk_availability
      - memory_availability
+      - docker_image_availability
+      - etcd_imagedata_size
+      - etcd_volume


Similarly, this might be not-so-good. If an upgrade is actually necessary to reduce etcd usage, this check prior to upgrade could block getting the cluster to a healthier state.

juanvallejo · 2017-08-10T14:29:52Z

@rhcarvalho Went ahead and removed etcd_volume from upgrade path.
Per @sosiouxme 's comment, I think it would be worth keeping the etcd_imagedata_size check in the upgrade path

juanvallejo · 2017-08-10T17:59:03Z

re[test]

juanvallejo · 2017-08-11T18:16:13Z

flaked on openshift/origin#8571
re[test]

juanvallejo · 2017-08-11T22:31:57Z

flake openshift/origin#10162
re[test]

sosiouxme · 2017-08-14T20:28:31Z

aos-ci-test

openshift-bot · 2017-08-14T23:12:08Z

success: "aos-ci-jenkins/OS_3.6_NOT_containerized, aos-ci-jenkins/OS_3.6_NOT_containerized_e2e_tests" for 094b46c (logs)

openshift-bot · 2017-08-14T23:14:05Z

success: "aos-ci-jenkins/OS_3.6_containerized, aos-ci-jenkins/OS_3.6_containerized_e2e_tests" for 094b46c (logs)

sosiouxme · 2017-08-14T23:27:06Z

try a [merge]

sosiouxme · 2017-08-15T14:49:34Z

853
In addition to openshift/origin#15769 we have openshift/origin#12072

[merge] again

sosiouxme · 2017-08-15T19:21:19Z

Time wasted in the queue

openshift/origin#15769 again which should be fixed now.

I'm not sure what this is; there doesn't seem to be a flake issue for it:

[ERROR] wait_for_fluentd_to_catch_up: not found 1 record project .operations for 2b30b48d-248c-461a-8f27-ca64cf4a72ec after 300 seconds
[ERROR] Checking journal for 2b30b48d-248c-461a-8f27-ca64cf4a72ec...
[ERROR] Found 2b30b48d-248c-461a-8f27-ca64cf4a72ec in journal
No resources found.
error: expected 'logs (POD | TYPE/NAME) [CONTAINER_NAME]'.
POD or TYPE/NAME is a required argument for the logs command
See 'oc logs -h' for help and examples.
[ERROR] PID 119484: hack/testing/test-fluentd-forward.sh:171: `oc logs $ffpod > $ARTIFACT_DIR/test-fluentd-forward.forward-fluentd.log` exited with status 1.

Looks like it might be a logging flake, i.e. the logging aggregation didn't start up / catch up fast enough?

I can say [merge] again or @sdodson can save some queue time and merge this since other tests seem fine.

juanvallejo · 2017-08-18T13:56:54Z

[test]

sdodson · 2017-08-24T12:06:46Z

aos-ci-test

openshift-bot · 2017-08-24T21:01:41Z

success: "aos-ci-jenkins/OS_3.6_NOT_containerized, aos-ci-jenkins/OS_3.6_NOT_containerized_e2e_tests" for 6a79df8 (logs)

openshift-bot · 2017-08-24T21:03:56Z

success: "aos-ci-jenkins/OS_3.6_containerized, aos-ci-jenkins/OS_3.6_containerized_e2e_tests" for 6a79df8 (logs)

sdodson · 2017-08-24T23:56:40Z

[merge]

sdodson · 2017-08-25T15:40:31Z

[test] there have been some fixes to the logging job

sdodson · 2017-08-28T13:43:25Z

aos-ci-test

sdodson · 2017-08-28T13:43:45Z

[merge]

openshift-bot · 2017-08-28T18:22:48Z

success: "aos-ci-jenkins/OS_3.6_containerized, aos-ci-jenkins/OS_3.6_containerized_e2e_tests" for 6709b8a (logs)

sdodson · 2017-08-28T18:43:37Z

The test failure seems real, where do we get python-etcd from? We should add it to openshift-ansible.spec as a requirement if it's required on local host.

juanvallejo · 2017-08-28T19:31:29Z

@sdodson

The test failure seems real, where do we get python-etcd from? We should add it to openshift-ansible.spec as a requirement if it's required on local host.

So, it is available by default in the "updates" repo on Fedora, but I don't think there is a repo for it on rhel yet.

If it is easier, I could remove the etcd_imagedata_size check from this PR, and just add the docker_image_availability check to the upgrade path for the time being. The etcd_imagedata_size check is not currently on the install path.

cc @sosiouxme

rhcarvalho · 2017-08-30T13:59:59Z

Hmm, looks like aos-ci-test results are missing. Plus possibly some flakes in https://ci.openshift.redhat.com/jenkins/job/merge_pull_request_openshift_ansible/944/.

rhcarvalho · 2017-08-30T14:00:09Z

aos-ci-test

openshift-bot · 2017-08-30T15:33:33Z

success: "aos-ci-jenkins/OS_3.6_NOT_containerized, aos-ci-jenkins/OS_3.6_NOT_containerized_e2e_tests" for 9dc723c (logs)

openshift-bot · 2017-08-30T15:37:48Z

success: "aos-ci-jenkins/OS_3.6_containerized, aos-ci-jenkins/OS_3.6_containerized_e2e_tests" for 9dc723c (logs)

juanvallejo · 2017-08-31T21:04:04Z

cc @brenton
I think the test failures were flakes

rhcarvalho · 2017-09-05T13:47:50Z

re-[merge]

rhcarvalho · 2017-09-05T14:16:33Z

Previous merge flake was openshift/origin#16005, and yum.

re-[merge]

openshift-bot · 2017-09-05T14:19:15Z

Evaluated for openshift ansible merge up to 9dc723c

openshift-bot · 2017-09-05T15:37:12Z

continuous-integration/openshift-jenkins/merge FAILURE (https://ci.openshift.redhat.com/jenkins/job/merge_pull_request_openshift_ansible/979/) (Base Commit: 4acf08d) (PR Branch Commit: 9dc723c)

juanvallejo · 2017-09-06T13:57:50Z

[test]

openshift-bot · 2017-09-06T14:19:00Z

Evaluated for openshift ansible test up to 9dc723c

openshift-bot · 2017-09-06T16:12:20Z

continuous-integration/openshift-jenkins/test SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pull_request_openshift_ansible/610/) (Base Commit: 5aaa24b) (PR Branch Commit: 9dc723c)

juanvallejo · 2017-09-11T15:53:22Z

@sosiouxme or @rhcarvalho tests seem to be passing, mind tagging once more?

rhcarvalho · 2017-09-12T09:56:13Z

@juanvallejo I think we're only merging bug fixes this week

sosiouxme · 2017-09-18T14:48:11Z

/lgtm

sosiouxme · 2017-09-18T14:49:54Z

I think for the new CI we also need to:
/retest

sosiouxme · 2017-09-19T16:26:58Z

/retest

juanvallejo · 2017-09-20T14:04:27Z

/test tox
/test install

sosiouxme · 2017-09-20T18:10:06Z

CI fail
/test install

openshift-merge-robot · 2017-09-20T20:45:57Z

/test all [submit-queue is verifying that this PR is safe to merge]

openshift-merge-robot · 2017-09-20T22:30:25Z

Automatic merge from submit-queue

sosiouxme approved these changes Aug 7, 2017

View reviewed changes

juanvallejo force-pushed the jvallejo/add-additonal-checks-upgrade-path branch from aaf6dfa to 8f3a6f6 Compare August 8, 2017 17:17

rhcarvalho reviewed Aug 9, 2017

View reviewed changes

juanvallejo force-pushed the jvallejo/add-additonal-checks-upgrade-path branch from 8033106 to 2ece672 Compare August 11, 2017 13:55

juanvallejo force-pushed the jvallejo/add-additonal-checks-upgrade-path branch from 2ece672 to 094b46c Compare August 14, 2017 13:21

juanvallejo force-pushed the jvallejo/add-additonal-checks-upgrade-path branch from c010b7d to f2bf837 Compare August 21, 2017 13:35

juanvallejo force-pushed the jvallejo/add-additonal-checks-upgrade-path branch 2 times, most recently from f1fda9f to 6709b8a Compare August 25, 2017 14:07

add additional preflight checks to upgrade path

9dc723c

juanvallejo force-pushed the jvallejo/add-additonal-checks-upgrade-path branch from 6709b8a to 9dc723c Compare August 28, 2017 19:39

openshift-ci-robot assigned sosiouxme Sep 18, 2017

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Sep 18, 2017

openshift-merge-robot merged commit 796027a into openshift:master Sep 20, 2017

juanvallejo deleted the jvallejo/add-additonal-checks-upgrade-path branch September 20, 2017 22:34

Adding additonal checks upgrade path #5014

Adding additonal checks upgrade path #5014

Conversation

juanvallejo commented Aug 7, 2017

sosiouxme commented Aug 7, 2017

sosiouxme commented Aug 7, 2017

juanvallejo commented Aug 7, 2017

juanvallejo commented Aug 9, 2017

rhcarvalho left a comment

Choose a reason for hiding this comment

rhcarvalho Aug 9, 2017

Choose a reason for hiding this comment

sosiouxme Aug 9, 2017

Choose a reason for hiding this comment

rhcarvalho Aug 9, 2017

Choose a reason for hiding this comment

juanvallejo commented Aug 10, 2017

juanvallejo commented Aug 10, 2017

juanvallejo commented Aug 11, 2017

juanvallejo commented Aug 11, 2017

sosiouxme commented Aug 14, 2017

openshift-bot commented Aug 14, 2017

openshift-bot commented Aug 14, 2017

sosiouxme commented Aug 14, 2017

sosiouxme commented Aug 15, 2017 • edited Loading

sosiouxme commented Aug 15, 2017

juanvallejo commented Aug 18, 2017

sdodson commented Aug 24, 2017

openshift-bot commented Aug 24, 2017

openshift-bot commented Aug 24, 2017

sdodson commented Aug 24, 2017

sdodson commented Aug 25, 2017

sdodson commented Aug 28, 2017

sdodson commented Aug 28, 2017

openshift-bot commented Aug 28, 2017

sdodson commented Aug 28, 2017

juanvallejo commented Aug 28, 2017

rhcarvalho commented Aug 30, 2017

rhcarvalho commented Aug 30, 2017

openshift-bot commented Aug 30, 2017

openshift-bot commented Aug 30, 2017

juanvallejo commented Aug 31, 2017

rhcarvalho commented Sep 5, 2017

rhcarvalho commented Sep 5, 2017

openshift-bot commented Sep 5, 2017

openshift-bot commented Sep 5, 2017

juanvallejo commented Sep 6, 2017

openshift-bot commented Sep 6, 2017

openshift-bot commented Sep 6, 2017

juanvallejo commented Sep 11, 2017

rhcarvalho commented Sep 12, 2017

sosiouxme commented Sep 18, 2017

sosiouxme commented Sep 18, 2017

sosiouxme commented Sep 19, 2017

juanvallejo commented Sep 20, 2017 • edited Loading

sosiouxme commented Sep 20, 2017

openshift-merge-robot commented Sep 20, 2017

openshift-merge-robot commented Sep 20, 2017

sosiouxme commented Aug 15, 2017 •

edited

Loading

juanvallejo commented Sep 20, 2017 •

edited

Loading