-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add mention of openshift-ansible image in Scaling and Performance Guide #4579
add mention of openshift-ansible image in Scaling and Performance Guide #4579
Conversation
98c60cd
to
2abd014
Compare
|
||
A failure from this check indicates that a significant amount of space in etcd is being taken up by OpenShift image data, which can eventually result in your etcd cluster crashing. | ||
|
||
A user-defined limit may be set by passing the variable `etcd_max_image_data_size_bytes=400000000` to the `openshift_health_checker` role. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to clarify the effect of this setting (use 40 GB as the limit).
|Check Name |Purpose | ||
|
||
|`*etcd_imagedata_size*` | ||
|This check measures the total size of OpenShift image data in an etcd cluster. Fails if the calculated size exceeds a user-defined limit. If no limit is specified, this check will fail if the size of OpenShift image data amounts to `50%` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd avoid saying 50%
, because if we ever decide on tweaking the default value we'd need to remember to update the docs.
A user-defined limit may be set by passing the variable `etcd_max_image_data_size_bytes=400000000` to the `openshift_health_checker` role. | ||
|
||
|`*etcd_traffic*` | ||
|This check detects higher-than-normal traffic on an Etcd host. Fails if a `journalctl` log entry with an Etcd sync duration warning is found. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/Etcd/etcd?
Use the *openshift-ansible* diagnostic checks with the following: | ||
|
||
---- | ||
# docker run --rm -it openshift/openshift-ansible /bin/bash -c "ansible-playbook playbooks/common/openshift-checks/check.yml --become --become-user root" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need "/bin/bash -c"? Do we need it?
The ENTRYPOINT should allow us to call ansible-playbook directly.
2abd014
to
87b6bf7
Compare
@rhcarvalho thanks, review comments addressed |
Use the *openshift-ansible* diagnostic checks with the following: | ||
|
||
---- | ||
# docker run --rm -it openshift/openshift-ansible "ansible-playbook playbooks/common/openshift-checks/check.yml --become --become-user root" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you try this? I suspect there should be no quotes (it was originally an argument to bash -c
, now it is not).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, modified to use the command provided in the image description
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I just realized this is outdated.
The image name is now openshift/origin-ansible
|| registry.access.redhat.com/openshift3/ose-ansible
.
NEW: https://hub.docker.com/r/openshift/origin-ansible/ (needs description)
OLD: https://hub.docker.com/r/openshift/openshift-ansible/ (description needs update to state its deprecation and the NEW name)
/cc @brenton @codificat
87b6bf7
to
2f31c97
Compare
-e INVENTORY_FILE=/tmp/inventory \ | ||
-e OPTS="-v" \ | ||
-e PLAYBOOK_FILE=playbooks/byo/openshift-preflight/check.yml \ | ||
openshift/openshift-ansible |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will need to put a conditional here, and have different image names depending on the product. Origin gets openshift/origin-ansible
and OCP gets registry.access.redhat.com/openshift3/ose-ansible
(possibly with a version tag).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right.
For OCP I think it's fine to just use openshift3/ose-ansible
as registry.access.redhat.com
should already be configured as an additional registry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, added conditional for updated image name based on distribution.
2f31c97
to
8e6ef1a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM peding merge of a related PR.
A user-defined limit may be set by passing the variable `etcd_max_image_data_size_bytes=40000000000` to the `openshift_health_checker` role. | ||
This example limit will cause the check to fail if the total size of OpenShift image data stored in etcd exceeds `40GB`. | ||
|
||
|`*etcd_traffic*` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Depends on openshift/openshift-ansible#4316 merging.
@@ -94,10 +94,10 @@ Registry credentials. | |||
[[scaling-performance-debugging]] | |||
== Debugging {product-title} Using the RHEL Tools Container | |||
|
|||
Red Hat distributes a *rhel-tools* container, which: | |||
Red Hat distributes a *rhel-tools* container, containing tools that aid in debugging scaling performance problems. This container: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/scaling performance/scaling or performance
== Debugging {product-title} Using the OpenShift-Ansible Image | ||
|
||
Red Hat distributes an https://github.com/openshift/openshift-ansible/blob/master/README_CONTAINER_IMAGE.md[openshift-ansible image], with specific checks focused on detecting common deployment issues. | ||
Use the following checks to help detect common scaling performance problems: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/common scaling performance problems/potential issues
|This check measures the total size of OpenShift image data in an etcd cluster. | ||
Fails if the calculated size exceeds a user-defined limit. If no limit is specified, this check will fail if the size of OpenShift image data exceeds a certain amount of the currently used space in the etcd cluster. | ||
|
||
A failure from this check indicates that a significant amount of space in etcd is being taken up by OpenShift image data, which can eventually result in your etcd cluster crashing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/eventually result in your etcd cluster crashing/destabilize an etcd cluster
|
||
A failure from this check indicates that a significant amount of space in etcd is being taken up by OpenShift image data, which can eventually result in your etcd cluster crashing. | ||
|
||
A user-defined limit may be set by passing the variable `etcd_max_image_data_size_bytes=40000000000` to the `openshift_health_checker` role. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any chance you could add a specific example of how to pass in this variable?
|
||
|`*etcd_traffic*` | ||
|This check detects higher-than-normal traffic on an etcd host. Fails if a `journalctl` log entry with an etcd sync duration warning is found. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome. It would be great to add a pointer here to the host_practices.adoc file which I will be updating shortly with a bunch of new etcd performance information.
|=== | ||
|
||
|
||
Use the *openshift-ansible* diagnostic checks with the following: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whoa, could you make this an "atomic run"...thing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I am assuming this would require us to add a LABEL RUN ...
to our image? cc @sosiouxme
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding this as a new card to our Trello board as a followup to this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sadly unfamiliar with what's required to allow atomic run
to do its thing but I'm sure it'll be simple enough; we really just need the card to figure out where in our CD processes to make changes and follow up with docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
atomic run
doesn't, AFAICS, allow the user to specify env vars or parameters to the command being run. Makes it challenging to run the playbook you want, modify the verbosity, etc. I guess we could have it read an optional env var file up front.
8e6ef1a
to
48181e4
Compare
@jeremyeder thanks for the review, comments addressed |
See below for a complete example of running checks with the Docker image. | ||
|
||
|`*etcd_traffic*` | ||
|This check detects higher-than-normal traffic on an etcd host. Fails if a `journalctl` log entry with an etcd sync duration warning is found. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR for this check: openshift/openshift-ansible#4316
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eparis is this happening to us?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean do we have message like:
Jun 22 18:11:28 ip-172-31-54-162.ec2.internal etcd[100560]: sync duration of 2.675498017s, expected less than 1s
(which I just got off an active cluster)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the error message that generally precedes very bad things (tm)
-v /etc/ansible/hosts:/tmp/inventory:ro \ | ||
-e INVENTORY_FILE=/tmp/inventory \ | ||
-e OPTS="-v" \ | ||
-e PLAYBOOK_FILE=playbooks/byo/openshift-preflight/check.yml \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
old location; but also, the checks above are health checks - how about pointing at that one? and could you include the logging checks...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Added a mention of the logging_index_time check, and linked to the existing Additional Diagnostics Checks... section in the diagnostics_tool
docs
48181e4
to
adb372b
Compare
Per discussion with @juanvallejo, closing in favor of #4713. |
Followup to #3881 (review)
Related Trello card: https://trello.com/c/dxXLsKYz
Adds a mention of the
etcd_imagedata_size
check for helping to debug Scaling Performance issues.cc @brenton @sosiouxme @adellape @rhcarvalho