Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add mention of openshift-ansible image in Scaling and Performance Guide #4579

Conversation

juanvallejo
Copy link

Followup to #3881 (review)
Related Trello card: https://trello.com/c/dxXLsKYz

Adds a mention of the etcd_imagedata_size check for helping to debug Scaling Performance issues.

cc @brenton @sosiouxme @adellape @rhcarvalho

@juanvallejo juanvallejo force-pushed the jvallejo/mention-etcd-imagedata-check branch from 98c60cd to 2abd014 Compare June 13, 2017 21:39

A failure from this check indicates that a significant amount of space in etcd is being taken up by OpenShift image data, which can eventually result in your etcd cluster crashing.

A user-defined limit may be set by passing the variable `etcd_max_image_data_size_bytes=400000000` to the `openshift_health_checker` role.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to clarify the effect of this setting (use 40 GB as the limit).

|Check Name |Purpose

|`*etcd_imagedata_size*`
|This check measures the total size of OpenShift image data in an etcd cluster. Fails if the calculated size exceeds a user-defined limit. If no limit is specified, this check will fail if the size of OpenShift image data amounts to `50%`

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd avoid saying 50%, because if we ever decide on tweaking the default value we'd need to remember to update the docs.

A user-defined limit may be set by passing the variable `etcd_max_image_data_size_bytes=400000000` to the `openshift_health_checker` role.

|`*etcd_traffic*`
|This check detects higher-than-normal traffic on an Etcd host. Fails if a `journalctl` log entry with an Etcd sync duration warning is found.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/Etcd/etcd?

Use the *openshift-ansible* diagnostic checks with the following:

----
# docker run --rm -it openshift/openshift-ansible /bin/bash -c "ansible-playbook playbooks/common/openshift-checks/check.yml --become --become-user root"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need "/bin/bash -c"? Do we need it?
The ENTRYPOINT should allow us to call ansible-playbook directly.

@juanvallejo juanvallejo force-pushed the jvallejo/mention-etcd-imagedata-check branch from 2abd014 to 87b6bf7 Compare June 14, 2017 18:50
@juanvallejo
Copy link
Author

@rhcarvalho thanks, review comments addressed

@adellape adellape self-assigned this Jun 14, 2017
Use the *openshift-ansible* diagnostic checks with the following:

----
# docker run --rm -it openshift/openshift-ansible "ansible-playbook playbooks/common/openshift-checks/check.yml --become --become-user root"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you try this? I suspect there should be no quotes (it was originally an argument to bash -c, now it is not).

Copy link
Author

@juanvallejo juanvallejo Jun 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, modified to use the command provided in the image description

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I just realized this is outdated.

The image name is now openshift/origin-ansible || registry.access.redhat.com/openshift3/ose-ansible.

NEW: https://hub.docker.com/r/openshift/origin-ansible/ (needs description)
OLD: https://hub.docker.com/r/openshift/openshift-ansible/ (description needs update to state its deprecation and the NEW name)

https://github.com/openshift/openshift-ansible/blob/master/README_CONTAINER_IMAGE.md#a-note-about-the-name-of-the-image

/cc @brenton @codificat

@juanvallejo juanvallejo force-pushed the jvallejo/mention-etcd-imagedata-check branch from 87b6bf7 to 2f31c97 Compare June 15, 2017 20:34
-e INVENTORY_FILE=/tmp/inventory \
-e OPTS="-v" \
-e PLAYBOOK_FILE=playbooks/byo/openshift-preflight/check.yml \
openshift/openshift-ansible

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will need to put a conditional here, and have different image names depending on the product. Origin gets openshift/origin-ansible and OCP gets registry.access.redhat.com/openshift3/ose-ansible (possibly with a version tag).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right.

For OCP I think it's fine to just use openshift3/ose-ansible as registry.access.redhat.com should already be configured as an additional registry.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, added conditional for updated image name based on distribution.

@juanvallejo juanvallejo force-pushed the jvallejo/mention-etcd-imagedata-check branch from 2f31c97 to 8e6ef1a Compare June 16, 2017 21:59
Copy link

@rhcarvalho rhcarvalho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM peding merge of a related PR.

A user-defined limit may be set by passing the variable `etcd_max_image_data_size_bytes=40000000000` to the `openshift_health_checker` role.
This example limit will cause the check to fail if the total size of OpenShift image data stored in etcd exceeds `40GB`.

|`*etcd_traffic*`

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depends on openshift/openshift-ansible#4316 merging.

@@ -94,10 +94,10 @@ Registry credentials.
[[scaling-performance-debugging]]
== Debugging {product-title} Using the RHEL Tools Container

Red Hat distributes a *rhel-tools* container, which:
Red Hat distributes a *rhel-tools* container, containing tools that aid in debugging scaling performance problems. This container:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/scaling performance/scaling or performance

== Debugging {product-title} Using the OpenShift-Ansible Image

Red Hat distributes an https://github.com/openshift/openshift-ansible/blob/master/README_CONTAINER_IMAGE.md[openshift-ansible image], with specific checks focused on detecting common deployment issues.
Use the following checks to help detect common scaling performance problems:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/common scaling performance problems/potential issues

|This check measures the total size of OpenShift image data in an etcd cluster.
Fails if the calculated size exceeds a user-defined limit. If no limit is specified, this check will fail if the size of OpenShift image data exceeds a certain amount of the currently used space in the etcd cluster.

A failure from this check indicates that a significant amount of space in etcd is being taken up by OpenShift image data, which can eventually result in your etcd cluster crashing.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/eventually result in your etcd cluster crashing/destabilize an etcd cluster


A failure from this check indicates that a significant amount of space in etcd is being taken up by OpenShift image data, which can eventually result in your etcd cluster crashing.

A user-defined limit may be set by passing the variable `etcd_max_image_data_size_bytes=40000000000` to the `openshift_health_checker` role.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any chance you could add a specific example of how to pass in this variable?


|`*etcd_traffic*`
|This check detects higher-than-normal traffic on an etcd host. Fails if a `journalctl` log entry with an etcd sync duration warning is found.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome. It would be great to add a pointer here to the host_practices.adoc file which I will be updating shortly with a bunch of new etcd performance information.

|===


Use the *openshift-ansible* diagnostic checks with the following:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whoa, could you make this an "atomic run"...thing?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I am assuming this would require us to add a LABEL RUN ... to our image? cc @sosiouxme

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding this as a new card to our Trello board as a followup to this PR

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sadly unfamiliar with what's required to allow atomic run to do its thing but I'm sure it'll be simple enough; we really just need the card to figure out where in our CD processes to make changes and follow up with docs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

atomic run doesn't, AFAICS, allow the user to specify env vars or parameters to the command being run. Makes it challenging to run the playbook you want, modify the verbosity, etc. I guess we could have it read an optional env var file up front.

@juanvallejo juanvallejo force-pushed the jvallejo/mention-etcd-imagedata-check branch from 8e6ef1a to 48181e4 Compare June 21, 2017 21:25
@juanvallejo
Copy link
Author

@jeremyeder thanks for the review, comments addressed

See below for a complete example of running checks with the Docker image.

|`*etcd_traffic*`
|This check detects higher-than-normal traffic on an etcd host. Fails if a `journalctl` log entry with an etcd sync duration warning is found.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eparis is this happening to us?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean do we have message like:

Jun 22 18:11:28 ip-172-31-54-162.ec2.internal etcd[100560]: sync duration of 2.675498017s, expected less than 1s

(which I just got off an active cluster)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the error message that generally precedes very bad things (tm)

@vikram-redhat
Copy link
Contributor

-v /etc/ansible/hosts:/tmp/inventory:ro \
-e INVENTORY_FILE=/tmp/inventory \
-e OPTS="-v" \
-e PLAYBOOK_FILE=playbooks/byo/openshift-preflight/check.yml \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

old location; but also, the checks above are health checks - how about pointing at that one? and could you include the logging checks...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Added a mention of the logging_index_time check, and linked to the existing Additional Diagnostics Checks... section in the diagnostics_tool docs

@adellape
Copy link
Contributor

adellape commented Jul 5, 2017

Per discussion with @juanvallejo, closing in favor of #4713.

@adellape adellape closed this Jul 5, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants