add mention of openshift-ansible image in Scaling and Performance Guide #4579

juanvallejo · 2017-06-13T21:35:51Z

Followup to #3881 (review)
Related Trello card: https://trello.com/c/dxXLsKYz

Adds a mention of the etcd_imagedata_size check for helping to debug Scaling Performance issues.

cc @brenton @sosiouxme @adellape @rhcarvalho

rhcarvalho · 2017-06-14T11:33:12Z

scaling_performance/optimizing_compute_resources.adoc

+
+A failure from this check indicates that a significant amount of space in etcd is being taken up by OpenShift image data, which can eventually result in your etcd cluster crashing.
+
+A user-defined limit may be set by passing the variable `etcd_max_image_data_size_bytes=400000000` to the `openshift_health_checker` role.


Need to clarify the effect of this setting (use 40 GB as the limit).

rhcarvalho · 2017-06-14T11:35:49Z

scaling_performance/optimizing_compute_resources.adoc

+|Check Name |Purpose
+
+|`*etcd_imagedata_size*`
+|This check measures the total size of OpenShift image data in an etcd cluster. Fails if the calculated size exceeds a user-defined limit. If no limit is specified, this check will fail if the size of OpenShift image data amounts to `50%`


I'd avoid saying 50%, because if we ever decide on tweaking the default value we'd need to remember to update the docs.

rhcarvalho · 2017-06-14T11:36:29Z

scaling_performance/optimizing_compute_resources.adoc

+A user-defined limit may be set by passing the variable `etcd_max_image_data_size_bytes=400000000` to the `openshift_health_checker` role.
+
+|`*etcd_traffic*`
+|This check detects higher-than-normal traffic on an Etcd host. Fails if a `journalctl` log entry with an Etcd sync duration warning is found.


s/Etcd/etcd?

rhcarvalho · 2017-06-14T11:46:42Z

scaling_performance/optimizing_compute_resources.adoc

+Use the *openshift-ansible* diagnostic checks with the following:
+
+----
+# docker run --rm -it openshift/openshift-ansible /bin/bash -c "ansible-playbook playbooks/common/openshift-checks/check.yml --become --become-user root"


Why do we need "/bin/bash -c"? Do we need it?
The ENTRYPOINT should allow us to call ansible-playbook directly.

juanvallejo · 2017-06-14T18:50:27Z

@rhcarvalho thanks, review comments addressed

rhcarvalho · 2017-06-15T16:05:55Z

scaling_performance/optimizing_compute_resources.adoc

+Use the *openshift-ansible* diagnostic checks with the following:
+
+----
+# docker run --rm -it openshift/openshift-ansible "ansible-playbook playbooks/common/openshift-checks/check.yml --become --become-user root"


Did you try this? I suspect there should be no quotes (it was originally an argument to bash -c, now it is not).

Thanks, modified to use the command provided in the image description

Actually, I just realized this is outdated.

The image name is now openshift/origin-ansible || registry.access.redhat.com/openshift3/ose-ansible.

NEW: https://hub.docker.com/r/openshift/origin-ansible/ (needs description)
OLD: https://hub.docker.com/r/openshift/openshift-ansible/ (description needs update to state its deprecation and the NEW name)

https://github.com/openshift/openshift-ansible/blob/master/README_CONTAINER_IMAGE.md#a-note-about-the-name-of-the-image

/cc @brenton @codificat

rhcarvalho · 2017-06-16T10:16:40Z

scaling_performance/optimizing_compute_resources.adoc

+       -e INVENTORY_FILE=/tmp/inventory \
+       -e OPTS="-v" \
+       -e PLAYBOOK_FILE=playbooks/byo/openshift-preflight/check.yml \
+       openshift/openshift-ansible


We will need to put a conditional here, and have different image names depending on the product. Origin gets openshift/origin-ansible and OCP gets registry.access.redhat.com/openshift3/ose-ansible (possibly with a version tag).

Right.

For OCP I think it's fine to just use openshift3/ose-ansible as registry.access.redhat.com should already be configured as an additional registry.

Thanks, added conditional for updated image name based on distribution.

rhcarvalho

LGTM peding merge of a related PR.

rhcarvalho · 2017-06-19T14:28:40Z

scaling_performance/optimizing_compute_resources.adoc

+A user-defined limit may be set by passing the variable `etcd_max_image_data_size_bytes=40000000000` to the `openshift_health_checker` role.
+This example limit will cause the check to fail if the total size of OpenShift image data stored in etcd exceeds `40GB`.
+
+|`*etcd_traffic*`


Depends on openshift/openshift-ansible#4316 merging.

jeremyeder · 2017-06-21T13:47:25Z

scaling_performance/optimizing_compute_resources.adoc

@@ -94,10 +94,10 @@ Registry  credentials.
 [[scaling-performance-debugging]]
 == Debugging {product-title} Using the RHEL Tools Container

-Red Hat distributes a *rhel-tools* container, which:
+Red Hat distributes a *rhel-tools* container, containing tools that aid in debugging scaling performance problems. This container:


s/scaling performance/scaling or performance

jeremyeder · 2017-06-21T13:48:36Z

scaling_performance/optimizing_compute_resources.adoc

+== Debugging {product-title} Using the OpenShift-Ansible Image
+
+Red Hat distributes an https://github.com/openshift/openshift-ansible/blob/master/README_CONTAINER_IMAGE.md[openshift-ansible image], with specific checks focused on detecting common deployment issues.
+Use the following checks to help detect common scaling performance problems:


s/common scaling performance problems/potential issues

jeremyeder · 2017-06-21T13:50:06Z

scaling_performance/optimizing_compute_resources.adoc

+|This check measures the total size of OpenShift image data in an etcd cluster.
+Fails if the calculated size exceeds a user-defined limit. If no limit is specified, this check will fail if the size of OpenShift image data exceeds a certain amount of the currently used space in the etcd cluster.
+
+A failure from this check indicates that a significant amount of space in etcd is being taken up by OpenShift image data, which can eventually result in your etcd cluster crashing.


s/eventually result in your etcd cluster crashing/destabilize an etcd cluster

jeremyeder · 2017-06-21T13:50:31Z

scaling_performance/optimizing_compute_resources.adoc

+
+A failure from this check indicates that a significant amount of space in etcd is being taken up by OpenShift image data, which can eventually result in your etcd cluster crashing.
+
+A user-defined limit may be set by passing the variable `etcd_max_image_data_size_bytes=40000000000` to the `openshift_health_checker` role.


any chance you could add a specific example of how to pass in this variable?

jeremyeder · 2017-06-21T13:51:21Z

scaling_performance/optimizing_compute_resources.adoc

+
+|`*etcd_traffic*`
+|This check detects higher-than-normal traffic on an etcd host. Fails if a `journalctl` log entry with an etcd sync duration warning is found.
+


Awesome. It would be great to add a pointer here to the host_practices.adoc file which I will be updating shortly with a bunch of new etcd performance information.

jeremyeder · 2017-06-21T13:51:47Z

scaling_performance/optimizing_compute_resources.adoc

+|===
+
+
+Use the *openshift-ansible* diagnostic checks with the following:


whoa, could you make this an "atomic run"...thing?

Sure, I am assuming this would require us to add a LABEL RUN ... to our image? cc @sosiouxme

Adding this as a new card to our Trello board as a followup to this PR

I'm sadly unfamiliar with what's required to allow atomic run to do its thing but I'm sure it'll be simple enough; we really just need the card to figure out where in our CD processes to make changes and follow up with docs.

atomic run doesn't, AFAICS, allow the user to specify env vars or parameters to the command being run. Makes it challenging to run the playbook you want, modify the verbosity, etc. I guess we could have it read an optional env var file up front.

juanvallejo · 2017-06-21T21:25:48Z

@jeremyeder thanks for the review, comments addressed

jeremyeder · 2017-06-22T15:53:55Z

scaling_performance/optimizing_compute_resources.adoc

+See below for a complete example of running checks with the Docker image.
+
+|`*etcd_traffic*`
+|This check detects higher-than-normal traffic on an etcd host. Fails if a `journalctl` log entry with an etcd sync duration warning is found.


PR for this check: openshift/openshift-ansible#4316

@eparis is this happening to us?

You mean do we have message like:

Jun 22 18:11:28 ip-172-31-54-162.ec2.internal etcd[100560]: sync duration of 2.675498017s, expected less than 1s

(which I just got off an active cluster)

this is the error message that generally precedes very bad things (tm)

vikram-redhat · 2017-06-26T05:14:28Z

Document: [diagnostics] Document the use of the diagnostic tools in the official docs

sosiouxme · 2017-07-04T15:57:08Z

scaling_performance/optimizing_compute_resources.adoc

+       -v /etc/ansible/hosts:/tmp/inventory:ro \
+       -e INVENTORY_FILE=/tmp/inventory \
+       -e OPTS="-v" \
+       -e PLAYBOOK_FILE=playbooks/byo/openshift-preflight/check.yml \


old location; but also, the checks above are health checks - how about pointing at that one? and could you include the logging checks...

Done. Added a mention of the logging_index_time check, and linked to the existing Additional Diagnostics Checks... section in the diagnostics_tool docs

adellape · 2017-07-05T21:44:30Z

Per discussion with @juanvallejo, closing in favor of #4713.

juanvallejo force-pushed the jvallejo/mention-etcd-imagedata-check branch from 98c60cd to 2abd014 Compare June 13, 2017 21:39

rhcarvalho reviewed Jun 14, 2017

View reviewed changes

juanvallejo force-pushed the jvallejo/mention-etcd-imagedata-check branch from 2abd014 to 87b6bf7 Compare June 14, 2017 18:50

adellape self-assigned this Jun 14, 2017

rhcarvalho reviewed Jun 15, 2017

View reviewed changes

juanvallejo force-pushed the jvallejo/mention-etcd-imagedata-check branch from 87b6bf7 to 2f31c97 Compare June 15, 2017 20:34

rhcarvalho reviewed Jun 16, 2017

View reviewed changes

juanvallejo force-pushed the jvallejo/mention-etcd-imagedata-check branch from 2f31c97 to 8e6ef1a Compare June 16, 2017 21:59

rhcarvalho approved these changes Jun 19, 2017

View reviewed changes

jeremyeder suggested changes Jun 21, 2017

View reviewed changes

juanvallejo force-pushed the jvallejo/mention-etcd-imagedata-check branch from 8e6ef1a to 48181e4 Compare June 21, 2017 21:25

jeremyeder reviewed Jun 22, 2017

View reviewed changes

sosiouxme reviewed Jul 4, 2017

View reviewed changes

add mention of openshift-ansible image

adb372b

juanvallejo force-pushed the jvallejo/mention-etcd-imagedata-check branch from 48181e4 to adb372b Compare July 5, 2017 17:42

adellape mentioned this pull request Jul 5, 2017

Updated diagnostic_tools for openshift-ansible image #4713

Merged

adellape closed this Jul 5, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add mention of openshift-ansible image in Scaling and Performance Guide #4579

add mention of openshift-ansible image in Scaling and Performance Guide #4579

juanvallejo commented Jun 13, 2017

rhcarvalho Jun 14, 2017

rhcarvalho Jun 14, 2017

rhcarvalho Jun 14, 2017

rhcarvalho Jun 14, 2017

juanvallejo commented Jun 14, 2017

rhcarvalho Jun 15, 2017

juanvallejo Jun 15, 2017 •

edited

Loading

rhcarvalho Jun 16, 2017

rhcarvalho Jun 16, 2017

codificat Jun 16, 2017

juanvallejo Jun 16, 2017

rhcarvalho left a comment

rhcarvalho Jun 19, 2017

jeremyeder Jun 21, 2017

jeremyeder Jun 21, 2017

jeremyeder Jun 21, 2017

jeremyeder Jun 21, 2017

jeremyeder Jun 21, 2017

jeremyeder Jun 21, 2017

juanvallejo Jun 21, 2017

juanvallejo Jun 22, 2017

sosiouxme Jun 22, 2017

sosiouxme Jul 4, 2017

juanvallejo commented Jun 21, 2017

jeremyeder Jun 22, 2017

juanvallejo Jun 22, 2017

deads2k Jun 22, 2017

eparis Jun 22, 2017

jeremyeder Jun 22, 2017

vikram-redhat commented Jun 26, 2017

sosiouxme Jul 4, 2017

juanvallejo Jul 5, 2017

adellape commented Jul 5, 2017


		A failure from this check indicates that a significant amount of space in etcd is being taken up by OpenShift image data, which can eventually result in your etcd cluster crashing.

		A user-defined limit may be set by passing the variable `etcd_max_image_data_size_bytes=400000000` to the `openshift_health_checker` role.


		\|`etcd_traffic`
		\|This check detects higher-than-normal traffic on an etcd host. Fails if a `journalctl` log entry with an etcd sync duration warning is found.

		\|===


		Use the openshift-ansible diagnostic checks with the following:

add mention of openshift-ansible image in Scaling and Performance Guide #4579

add mention of openshift-ansible image in Scaling and Performance Guide #4579

Conversation

juanvallejo commented Jun 13, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

juanvallejo commented Jun 14, 2017

Choose a reason for hiding this comment

juanvallejo Jun 15, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rhcarvalho left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

juanvallejo commented Jun 21, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vikram-redhat commented Jun 26, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adellape commented Jul 5, 2017

juanvallejo Jun 15, 2017 •

edited

Loading