Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding additonal checks upgrade path #5014

Conversation

juanvallejo
Copy link
Contributor

Depends on #4960

TODO

  • Possibly handle upgrade playbook context on etcd_volume check

cc @sosiouxme @rhcarvalho

@sosiouxme
Copy link
Member

@juanvallejo it seems to me the etcd checks can be used as-is. They're both intended for already-installed clusters. Any thoughts what would be different about an upgrade?

@sosiouxme
Copy link
Member

bot, retest this

@juanvallejo
Copy link
Contributor Author

@sosiouxme

it seems to me the etcd checks can be used as-is. They're both intended for already-installed clusters.
Any thoughts what would be different about an upgrade?

I was not sure if I was missing any special cases during an upgrade that would need to be handled in the etcd checks. I cannot think of any, so I was hoping I could get enough people to review this and either confirm that nothing else is needed, or catch whatever we're missing

@juanvallejo juanvallejo force-pushed the jvallejo/add-additonal-checks-upgrade-path branch from aaf6dfa to 8f3a6f6 Compare August 8, 2017 17:17
@juanvallejo
Copy link
Contributor Author

[test]

Copy link
Contributor

@rhcarvalho rhcarvalho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The patch is pretty simple, but I still have doubts about the implications of certain checks in the upgrade flow.

@@ -11,3 +11,6 @@
checks:
- disk_availability
- memory_availability
- docker_image_availability
- etcd_imagedata_size
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we block an upgrade if "etcd has too much image data"? That may not be a good idea.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checks can be disabled...

but I guess the question is really "are we running health checks generally or just looking for issues we think might impact the upgrade?"

to the extent that an upgrade might exercise data in etcd, the etcd checks probably make sense either way. you do not want your upgrade to fail partway through due to running out of etcd space, although it was probably going to fill up imminently anyway, at least you don't want to have to deal with that in the middle of an upgrade.

@@ -11,3 +11,6 @@
checks:
- disk_availability
- memory_availability
- docker_image_availability
- etcd_imagedata_size
- etcd_volume
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, this might be not-so-good. If an upgrade is actually necessary to reduce etcd usage, this check prior to upgrade could block getting the cluster to a healthier state.

@juanvallejo
Copy link
Contributor Author

@rhcarvalho Went ahead and removed etcd_volume from upgrade path.
Per @sosiouxme 's comment, I think it would be worth keeping the etcd_imagedata_size check in the upgrade path

@juanvallejo
Copy link
Contributor Author

re[test]

@juanvallejo juanvallejo force-pushed the jvallejo/add-additonal-checks-upgrade-path branch from 8033106 to 2ece672 Compare August 11, 2017 13:55
@juanvallejo
Copy link
Contributor Author

flaked on openshift/origin#8571
re[test]

@juanvallejo
Copy link
Contributor Author

flake openshift/origin#10162
re[test]

@juanvallejo juanvallejo force-pushed the jvallejo/add-additonal-checks-upgrade-path branch from 2ece672 to 094b46c Compare August 14, 2017 13:21
@sosiouxme
Copy link
Member

aos-ci-test

@openshift-bot
Copy link

success: "aos-ci-jenkins/OS_3.6_NOT_containerized, aos-ci-jenkins/OS_3.6_NOT_containerized_e2e_tests" for 094b46c (logs)

@openshift-bot
Copy link

success: "aos-ci-jenkins/OS_3.6_containerized, aos-ci-jenkins/OS_3.6_containerized_e2e_tests" for 094b46c (logs)

@sosiouxme
Copy link
Member

try a [merge]

@sosiouxme
Copy link
Member

sosiouxme commented Aug 15, 2017

853
In addition to openshift/origin#15769 we have openshift/origin#12072

[merge] again

@sosiouxme
Copy link
Member

Time wasted in the queue

openshift/origin#15769 again which should be fixed now.

I'm not sure what this is; there doesn't seem to be a flake issue for it:

[ERROR] wait_for_fluentd_to_catch_up: not found 1 record project .operations for 2b30b48d-248c-461a-8f27-ca64cf4a72ec after 300 seconds
[ERROR] Checking journal for 2b30b48d-248c-461a-8f27-ca64cf4a72ec...
[ERROR] Found 2b30b48d-248c-461a-8f27-ca64cf4a72ec in journal
No resources found.
error: expected 'logs (POD | TYPE/NAME) [CONTAINER_NAME]'.
POD or TYPE/NAME is a required argument for the logs command
See 'oc logs -h' for help and examples.
[ERROR] PID 119484: hack/testing/test-fluentd-forward.sh:171: `oc logs $ffpod > $ARTIFACT_DIR/test-fluentd-forward.forward-fluentd.log` exited with status 1.

Looks like it might be a logging flake, i.e. the logging aggregation didn't start up / catch up fast enough?

I can say [merge] again or @sdodson can save some queue time and merge this since other tests seem fine.

@juanvallejo
Copy link
Contributor Author

[test]

@juanvallejo juanvallejo force-pushed the jvallejo/add-additonal-checks-upgrade-path branch from c010b7d to f2bf837 Compare August 21, 2017 13:35
@sdodson
Copy link
Member

sdodson commented Aug 24, 2017

aos-ci-test

@openshift-bot
Copy link

success: "aos-ci-jenkins/OS_3.6_NOT_containerized, aos-ci-jenkins/OS_3.6_NOT_containerized_e2e_tests" for 6a79df8 (logs)

@openshift-bot
Copy link

success: "aos-ci-jenkins/OS_3.6_containerized, aos-ci-jenkins/OS_3.6_containerized_e2e_tests" for 6a79df8 (logs)

@sdodson
Copy link
Member

sdodson commented Aug 24, 2017

[merge]

@juanvallejo juanvallejo force-pushed the jvallejo/add-additonal-checks-upgrade-path branch 2 times, most recently from f1fda9f to 6709b8a Compare August 25, 2017 14:07
@sdodson
Copy link
Member

sdodson commented Aug 25, 2017

[test] there have been some fixes to the logging job

@sdodson
Copy link
Member

sdodson commented Aug 28, 2017

aos-ci-test

@sdodson
Copy link
Member

sdodson commented Aug 28, 2017

[merge]

@openshift-bot
Copy link

success: "aos-ci-jenkins/OS_3.6_containerized, aos-ci-jenkins/OS_3.6_containerized_e2e_tests" for 6709b8a (logs)

@sdodson
Copy link
Member

sdodson commented Aug 28, 2017

The test failure seems real, where do we get python-etcd from? We should add it to openshift-ansible.spec as a requirement if it's required on local host.

@juanvallejo
Copy link
Contributor Author

@sdodson

The test failure seems real, where do we get python-etcd from? We should add it to openshift-ansible.spec as a requirement if it's required on local host.

So, it is available by default in the "updates" repo on Fedora, but I don't think there is a repo for it on rhel yet.

If it is easier, I could remove the etcd_imagedata_size check from this PR, and just add the docker_image_availability check to the upgrade path for the time being. The etcd_imagedata_size check is not currently on the install path.

cc @sosiouxme

@juanvallejo juanvallejo force-pushed the jvallejo/add-additonal-checks-upgrade-path branch from 6709b8a to 9dc723c Compare August 28, 2017 19:39
@rhcarvalho
Copy link
Contributor

Hmm, looks like aos-ci-test results are missing. Plus possibly some flakes in https://ci.openshift.redhat.com/jenkins/job/merge_pull_request_openshift_ansible/944/.

@rhcarvalho
Copy link
Contributor

aos-ci-test

@openshift-bot
Copy link

success: "aos-ci-jenkins/OS_3.6_NOT_containerized, aos-ci-jenkins/OS_3.6_NOT_containerized_e2e_tests" for 9dc723c (logs)

@openshift-bot
Copy link

success: "aos-ci-jenkins/OS_3.6_containerized, aos-ci-jenkins/OS_3.6_containerized_e2e_tests" for 9dc723c (logs)

@juanvallejo
Copy link
Contributor Author

cc @brenton
I think the test failures were flakes

@rhcarvalho
Copy link
Contributor

re-[merge]

@rhcarvalho
Copy link
Contributor

Previous merge flake was openshift/origin#16005, and yum.

re-[merge]

@openshift-bot
Copy link

Evaluated for openshift ansible merge up to 9dc723c

@openshift-bot
Copy link

continuous-integration/openshift-jenkins/merge FAILURE (https://ci.openshift.redhat.com/jenkins/job/merge_pull_request_openshift_ansible/979/) (Base Commit: 4acf08d) (PR Branch Commit: 9dc723c)

@juanvallejo
Copy link
Contributor Author

[test]

@openshift-bot
Copy link

Evaluated for openshift ansible test up to 9dc723c

@openshift-bot
Copy link

continuous-integration/openshift-jenkins/test SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pull_request_openshift_ansible/610/) (Base Commit: 5aaa24b) (PR Branch Commit: 9dc723c)

@juanvallejo
Copy link
Contributor Author

@sosiouxme or @rhcarvalho tests seem to be passing, mind tagging once more?

@rhcarvalho
Copy link
Contributor

@juanvallejo I think we're only merging bug fixes this week

@sosiouxme
Copy link
Member

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Sep 18, 2017
@sosiouxme
Copy link
Member

I think for the new CI we also need to:
/retest

@sosiouxme
Copy link
Member

/retest

@juanvallejo
Copy link
Contributor Author

juanvallejo commented Sep 20, 2017

/test tox
/test install

@sosiouxme
Copy link
Member

CI fail
/test install

@openshift-merge-robot
Copy link
Contributor

/test all [submit-queue is verifying that this PR is safe to merge]

@openshift-merge-robot
Copy link
Contributor

Automatic merge from submit-queue

@openshift-merge-robot openshift-merge-robot merged commit 796027a into openshift:master Sep 20, 2017
@juanvallejo juanvallejo deleted the jvallejo/add-additonal-checks-upgrade-path branch September 20, 2017 22:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants