Skip to content

Commit

Permalink
Merge pull request #4713 from adellape/scaling_preinstall
Browse files Browse the repository at this point in the history
Updated diagnostic_tools for openshift-ansible image
  • Loading branch information
adellape authored Aug 16, 2017
2 parents 2d4dc41 + 4aeec01 commit bcdd31b
Show file tree
Hide file tree
Showing 3 changed files with 216 additions and 46 deletions.
238 changes: 199 additions & 39 deletions admin_guide/diagnostics_tool.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -116,37 +116,59 @@ current master) should be able to diagnose the status of infrastructure such as
nodes, registry, and router. In each case, running `oc adm diagnostics` looks
for the client configuration in its standard location and uses it if available.

[[additional-cluster-health-checks]]
== Additional Diagnostic Checks via Ansible

// TODO: add link to OCP image once it is available

Some additional diagnostic checks are available through the *openshift-ansible*
container image. See the image's link:https://github.com/openshift/openshift-ansible/blob/master/README_CONTAINER_IMAGE.md[source repository] for usage information.

The following health checks belong to a diagnostic task meant to be run against
the Ansible inventory file for a deployed {product-title} cluster. They can
report common problems for the current {product-title} installation.
[[ansible-based-tooling-health-checks]]
== Ansible-based Health Checks

// tag::ansible-based-health-checks-intro[]
Additional diagnostic health checks are available through the
xref:../install_config/install/advanced_install.adoc#install-config-install-advanced-install[Ansible-based tooling] used to install and manage {product-title} clusters. They can report
common deployment problems for the current {product-title} installation.

These checks can be run either using the `ansible-playbook` command (the same
method used during
xref:../install_config/install/advanced_install.adoc#install-config-install-advanced-install[Advanced Installation]) or as a link:https://github.com/openshift/openshift-ansible/blob/master/README_CONTAINER_IMAGE.md[containerized version] of *openshift-ansible*. For the `ansible-playbook` method, the checks
are provided by the
ifdef::openshift-enterprise[]
*atomic-openshift-utils* RPM package.
endif::[]
ifdef::openshift-origin[]
xref:../install_config/install/host_preparation.adoc#preparing-for-advanced-installations-origin[*openshift-ansible*]
Git repository.
endif::[]
For the containerized method,
ifdef::openshift-enterprise[]
the *openshift3/ose-ansible* container image is distributed via the
link:https://registry.access.redhat.com[Red Hat Container Registry].
endif::[]
ifdef::openshift-origin[]
the *openshift/origin-ansible* container image is distributed via Docker Hub.
endif::[]
// end::ansible-based-health-checks-intro[]
Example usage for each method are provided in subsequent sections.

The following health checks are a set of diagnostic tasks that are meant to be
run against the Ansible inventory file for a deployed {product-title} cluster
using the provided *_health.yml_* playbook.

[WARNING]
====
Due to potential changes the health check playbooks could make to hosts, they
should only be used on clusters that have been deployed using Ansible and using
the same inventory file with which it was deployed. Changes mostly involve
installing dependencies so that the checks can gather required information, but
it is possible for certain system components (for example, `docker` or
networking) to be altered if their current state differs from the configuration
in the inventory file. Only run these health checks if you would not expect your
inventory file to make any changes to your current cluster configuration.
====

[[admin-guide-diagnostics-tool-ansible-checks]]
.Diagnostic Checks
.Diagnostic Health Checks
[options="header"]
|===

|Check Name |Purpose

|`ovs_version`
|This check ensures that a host has the correct version of Open vSwitch installed
for the currently deployed version of {product-title}.

|`kibana`, `curator`, `elasticsearch`, `fluentd`
|This set of checks verifies that Elasticsearch, Fluentd, and Curator pods have
been deployed and are in a `running` state, and that a connection can be
established between the control host and the exposed Kibana URL. These checks
will only run if the `openshift_hosted_logging_deploy` inventory variable is set
to `true`, to ensure that they are executed in a deployment where a logging
stack has been deployed.

|`etcd_imagedata_size`
|This check measures the total size of {product-title} image data in an etcd
cluster. The check fails if the calculated size exceeds a user-defined limit. If
Expand All @@ -157,36 +179,174 @@ A failure from this check indicates that a significant amount of space in etcd
is being taken up by {product-title} image data, which can eventually result in
your etcd cluster crashing.

A user-defined limit may be set by passing the variable
`etcd_max_image_data_size_bytes=400000000` to the `openshift_health_checker`
role.
A user-defined limit may be set by passing the `etcd_max_image_data_size_bytes`
variable. For example, setting `etcd_max_image_data_size_bytes=40000000000` will
cause the check to fail if the total size of image data stored in etcd
exceeds 40 GB.

|`etcd_traffic`
|This check detects higher-than-normal traffic on an etcd host. It fails if a
`journalctl` log entry with an etcd sync duration warning is found.

For further information on improving etcd performance, see
xref:../scaling_performance/host_practices.adoc#scaling-performance-capacity-host-practices-etcd[Recommended Practices for {product-title} etcd Hosts] and the
link:https://access.redhat.com/solutions/2916381[Red Hat Knowledgebase].

|`etcd_volume`
|This check ensures that the volume usage for an etcd cluster is below a maximum
user-specified threshold. If no maximum threshold value is specified, it is
defaulted to `90%` of the total volume size.

A user-defined limit may be set by passing the variable
`etcd_device_usage_threshold_percent=90` to the `openshift_health_checker` role.
A user-defined limit may be set by passing the
`etcd_device_usage_threshold_percent` variable.

|`docker_storage`
|Only runs on hosts that depend on the *docker* damon (nodes and containerized
|Only runs on hosts that depend on the *docker* daemon (nodes and containerized
installations). Checks that *docker*'s total usage does not exceed a
user-defined limit. If no user-defined limit is set, *docker*'s maximum usage
threshold defaults to 90% of the total size available. The threshold
limit for total percent usage can be set with a variable in your inventory file:
`max_thinpool_data_usage_percent=90`.
|===
threshold defaults to 90% of the total size available.

To disable specific checks, include the variable `openshift_disable_check` with
a comma-delimited list of check names in your inventory file. For example:
The threshold limit for total percent usage can be set with a variable in your
inventory file, for example `max_thinpool_data_usage_percent=90`.

----
openshift_disable_check=ovs_version,etcd_volume
----
This also checks that *docker*'s storage is using a
xref:../install_config/registry/deploy_registry_existing_clusters.adoc#storage-for-the-registry[supported configuration].

|`curator`, `elasticsearch`, `fluentd`, `kibana`
|This set of checks verifies that Curator, Kibana, Elasticsearch, and Fluentd
pods have been deployed and are in a `running` state, and that a connection can
be established between the control host and the exposed Kibana URL. These checks
will only run if the `openshift_hosted_logging_deploy` inventory variable is set
to `true`, to ensure that they are executed in a deployment where
xref:../install_config/aggregate_logging.adoc#install-config-aggregate-logging[cluster logging] has been enabled.

|`logging_index_time`
|This check detects higher than normal time delays between log creation and log
aggregation by Elasticsearch in a logging stack deployment. It fails if a new
log entry cannot be queried through Elasticsearch within a timeout (by default,
30 seconds). The check only runs if logging is enabled.

A user-defined timeout may be set by passing the
`openshift_check_logging_index_timeout_seconds` variable. For example, setting
`openshift_check_logging_index_timeout_seconds=45` will cause the check to fail
if a newly-created log entry is not able to be queried via Elasticsearch after
45 seconds.

|===

[NOTE]
====
A similar set of checks meant to run as part of the installation process can be
found in
xref:../install_config/install/advanced_install.adoc#configuring-cluster-pre-install-checks[Configuring Cluster Pre-install Checks]. Another set of checks for checking certificate
expiration can be found in
xref:../install_config/redeploying_certificates.adoc#install-config-redeploying-certificates[Redeploying Certificates].
====

[[admin-guide-health-checks-via-ansible-playbook]]
=== Running Health Checks via ansible-playbook

To run the *openshift-ansible* health checks using the `ansible-playbook`
command, specify your cluster's inventory file and run the *_health.yml_*
playbook:

----
# ansible-playbook -i <inventory_file> \
ifdef::openshift-enterprise[]
/usr/share/ansible/openshift-ansible/playbooks/byo/openshift-checks/health.yml
endif::[]
ifdef::openshift-origin[]
~/openshift-ansible/playbooks/byo/openshift-checks/health.yml
endif::[]
----

To set variables in the command line, include the `-e` flag with any desired
variables in `key=value` format. For example:

----
# ansible-playbook -i <inventory_file> \
ifdef::openshift-enterprise[]
/usr/share/ansible/openshift-ansible/playbooks/byo/openshift-checks/health.yml
endif::[]
ifdef::openshift-origin[]
~/openshift-ansible/playbooks/byo/openshift-checks/health.yml
endif::[]
-e openshift_check_logging_index_timeout_seconds=45
-e etcd_max_image_data_size_bytes=40000000000
----

To disable specific checks, include the variable `openshift_disable_check` with
a comma-delimited list of check names in your inventory file before running the
playbook. For example:

----
openshift_disable_check=etcd_traffic,etcd_volume
----

Alternatively, set any checks you want to disable as variables with
`-e openshift_disable_check=<check1>,<check2>` when running the
`ansible-playbook` command.

[[admin-guide-health-checks-via-docker-cli]]
=== Running Health Checks via Docker CLI

It is possible to run the *openshift-ansible* playbooks in a Docker container,
avoiding the need for installing and configuring Ansible, on any host that can
run the
ifdef::openshift-enterprise[]
*ose-ansible*
endif::[]
ifdef::openshift-origin[]
*origin-ansible*
endif::[]
image via the Docker CLI.

To do so, specify your cluster's inventory file and the *_health.yml_* playbook
when running the following `docker run` command as a non-root user that has
privileges to run containers:

----
# docker run -u `id -u` \ <1>
-v $HOME/.ssh/id_rsa:/opt/app-root/src/.ssh/id_rsa:Z,ro \ <2>
-v /etc/ansible/hosts:/tmp/inventory:ro \ <3>
-e INVENTORY_FILE=/tmp/inventory \
-e PLAYBOOK_FILE=playbooks/byo/openshift-checks/health.yml \ <4>
-e OPTS="-v -e openshift_check_logging_index_timeout_seconds=45 -e etcd_max_image_data_size_bytes=40000000000" \ <5>
ifdef::openshift-enterprise[]
openshift3/ose-ansible
endif::[]
ifdef::openshift-origin[]
openshift/origin-ansible
endif::[]
----
<1> These options make the container run with the same UID as the current user,
which is required for permissions so that the SSH key can be read inside the
container (SSH private keys are expected to be readable only by their owner).
<2> Mount SSH keys as a volume under *_/opt/app-root/src/.ssh_* under normal usage
when running the container as a non-root user.
<3> Change *_/etc/ansible/hosts_* to the location of your cluster's inventory file,
if different. This file will be bind-mounted to *_/tmp/inventory_*, which is
used according to the `INVENTORY_FILE` environment variable in the container.
<3> The `PLAYBOOK_FILE` environment variable is set to the location of the
*_health.yml_* playbook relative to *_/usr/share/ansible/openshift-ansible_*
inside the container.
<4> Set any variables desired for a single run with the `-e key=value` format.

In the above command, the SSH key is mounted with the `:Z` flag so that the
container can read the SSH key from its restricted SELinux context; this means
that your original SSH key file will be relabeled to something like
`system_u:object_r:container_file_t:s0:c113,c247`. For more details about `:Z`,
see the `docker-run(1)` man page.

Keep this in mind for these volume mount specifications because it could have
unexpected consequences. For example, if you mount (and therefore relabel) your
*_$HOME/.ssh_* directory, *sshd* will become unable to access your public keys
to allow remote login. To avoid altering the original file labels, mounting a
copy of the SSH key (or directory) is recommended.

You might want to mount an entire *_.ssh_* directory for various reasons. For
example, this would allow you to use an SSH configuration to match keys with
hosts or modify other connection parameters. It would also allow you to provide
a *_known_hosts_* file and have SSH validate host keys, which is disabled by the
default configuration and can be re-enabled with an environment variable by
adding `-e ANSIBLE_HOST_KEY_CHECKING=True` to the `docker` command line.
9 changes: 6 additions & 3 deletions install_config/install/advanced_install.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -544,11 +544,14 @@ inventory file. For example:
openshift_disable_check=memory_availability,disk_availability
----

A similar set of checks meant to run for diagnostic on existing clusters can be
[NOTE]
====
A similar set of health checks meant to run for diagnostics on existing clusters
can be found in
xref:../../admin_guide/diagnostics_tool.adoc#ansible-based-tooling-health-checkss[Ansible-based Health Checks]. Another set of checks for checking certificate expiration can be
found in
xref:../../admin_guide/diagnostics_tool.adoc#additional-cluster-health-checks[Additional Diagnostic Checks via Ansible]. Another set of checks for checking certificate
expiration can be found in
xref:../../install_config/redeploying_certificates.adoc#install-config-redeploying-certificates[Redeploying Certificates].
====

[[advanced-install-configuring-system-containers]]
=== Configuring System Containers
Expand Down
15 changes: 11 additions & 4 deletions scaling_performance/optimizing_compute_resources.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -92,12 +92,13 @@ Registry credentials.
====

[[scaling-performance-debugging]]
== Debugging {product-title} Using the RHEL Tools Container
== Debugging Using the RHEL Tools Container Image

Red Hat distributes a *rhel-tools* container, which:
Red Hat distributes a *rhel-tools* container image, packaging tools that aid in
debugging scaling or performance problems. This container image:

* Allow users to deploy minimal footprint container hosts by moving packages out of the base distribution and into this support container.
* Provide debugging capabilities for Red Hat Enterprise Linux 7 Atomic Host, which has an immutable packet tree. *rhel-tools* includes utilities such as tcpdump, sosreport, git, gdb, perf, and many more common system administration utilities.
* Allows users to deploy minimal footprint container hosts by moving packages out of the base distribution and into this support container.
* Provides debugging capabilities for Red Hat Enterprise Linux 7 Atomic Host, which has an immutable package tree. *rhel-tools* includes utilities such as tcpdump, sosreport, git, gdb, perf, and many more common system administration utilities.

Use the *rhel-tools* container with the following:

Expand All @@ -107,5 +108,11 @@ Use the *rhel-tools* container with the following:

See the link:https://access.redhat.com/documentation/en/red-hat-enterprise-linux-atomic-host/7/getting-started-with-containers/chapter-11-using-the-atomic-tools-container-image[RHEL Tools Container documentation] for more information.

[[scaling-performance-debugging-using-ansible]]
== Debugging Using Ansible-based Health Checks

include::admin_guide/diagnostics_tool.adoc[tag=ansible-based-health-checks-intro]

See
xref:../admin_guide/diagnostics_tool.adoc#ansible-based-tooling-health-checks[Ansible-based Health Checks] in the Cluster Administration guide for information on the
available health checks and example usage.

0 comments on commit bcdd31b

Please sign in to comment.