Merge pull request #4713 from adellape/scaling_preinstall

Updated diagnostic_tools for openshift-ansible image
openshift · Aug 16, 2017 · bcdd31b · bcdd31b
2 parents 2d4dc41 + 4aeec01
commit bcdd31b
Show file tree

Hide file tree

Showing 3 changed files with 216 additions and 46 deletions.
diff --git a/admin_guide/diagnostics_tool.adoc b/admin_guide/diagnostics_tool.adoc
@@ -116,37 +116,59 @@ current master) should be able to diagnose the status of infrastructure such as
 nodes, registry, and router. In each case, running `oc adm diagnostics` looks
 for the client configuration in its standard location and uses it if available.
 
-[[additional-cluster-health-checks]]
-== Additional Diagnostic Checks via Ansible
-
-// TODO: add link to OCP image once it is available
-
-Some additional diagnostic checks are available through the *openshift-ansible*
-container image. See the image's link:https://github.com/openshift/openshift-ansible/blob/master/README_CONTAINER_IMAGE.md[source repository] for usage information.
-
-The following health checks belong to a diagnostic task meant to be run against
-the Ansible inventory file for a deployed {product-title} cluster. They can
-report common problems for the current {product-title} installation.
+[[ansible-based-tooling-health-checks]]
+== Ansible-based Health Checks
+
+// tag::ansible-based-health-checks-intro[]
+Additional diagnostic health checks are available through the
+xref:../install_config/install/advanced_install.adoc#install-config-install-advanced-install[Ansible-based tooling] used to install and manage {product-title} clusters. They can report
+common deployment problems for the current {product-title} installation.
+
+These checks can be run either using the `ansible-playbook` command (the same
+method used during
+xref:../install_config/install/advanced_install.adoc#install-config-install-advanced-install[Advanced Installation]) or as a link:https://github.com/openshift/openshift-ansible/blob/master/README_CONTAINER_IMAGE.md[containerized version] of *openshift-ansible*. For the `ansible-playbook` method, the checks
+are provided by the
+ifdef::openshift-enterprise[]
+*atomic-openshift-utils* RPM package.
+endif::[]
+ifdef::openshift-origin[]
+xref:../install_config/install/host_preparation.adoc#preparing-for-advanced-installations-origin[*openshift-ansible*]
+Git repository.
+endif::[]
+For the containerized method,
+ifdef::openshift-enterprise[]
+the *openshift3/ose-ansible* container image is distributed via the
+link:https://registry.access.redhat.com[Red Hat Container Registry].
+endif::[]
+ifdef::openshift-origin[]
+the *openshift/origin-ansible* container image is distributed via Docker Hub.
+endif::[]
+// end::ansible-based-health-checks-intro[]
+Example usage for each method are provided in subsequent sections.
+
+The following health checks are a set of diagnostic tasks that are meant to be
+run against the Ansible inventory file for a deployed {product-title} cluster
+using the provided *_health.yml_* playbook.
+
+[WARNING]
+====
+Due to potential changes the health check playbooks could make to hosts, they
+should only be used on clusters that have been deployed using Ansible and using
+the same inventory file with which it was deployed. Changes mostly involve
+installing dependencies so that the checks can gather required information, but
+it is possible for certain system components (for example, `docker` or
+networking) to be altered if their current state differs from the configuration
+in the inventory file. Only run these health checks if you would not expect your
+inventory file to make any changes to your current cluster configuration.
+====
 
 [[admin-guide-diagnostics-tool-ansible-checks]]
-.Diagnostic Checks
+.Diagnostic Health Checks
 [options="header"]
 |===
 
 |Check Name |Purpose
 
-|`ovs_version`
-|This check ensures that a host has the correct version of Open vSwitch installed
-for the currently deployed version of {product-title}.
-
-|`kibana`, `curator`, `elasticsearch`, `fluentd`
-|This set of checks verifies that Elasticsearch, Fluentd, and Curator pods have
-been deployed and are in a `running` state, and that a connection can be
-established between the control host and the exposed Kibana URL. These checks
-will only run if the `openshift_hosted_logging_deploy` inventory variable is set
-to `true`, to ensure that they are executed in a deployment where a logging
-stack has been deployed.
-
 |`etcd_imagedata_size`
 |This check measures the total size of {product-title} image data in an etcd
 cluster. The check fails if the calculated size exceeds a user-defined limit. If
@@ -157,36 +179,174 @@ A failure from this check indicates that a significant amount of space in etcd
 is being taken up by {product-title} image data, which can eventually result in
 your etcd cluster crashing.
 
-A user-defined limit may be set by passing the variable
-`etcd_max_image_data_size_bytes=400000000` to the `openshift_health_checker`
-role.
+A user-defined limit may be set by passing the `etcd_max_image_data_size_bytes`
+variable. For example, setting `etcd_max_image_data_size_bytes=40000000000` will
+cause the check to fail if the total size of image data stored in etcd
+exceeds 40 GB.
+
+|`etcd_traffic`
+|This check detects higher-than-normal traffic on an etcd host. It fails if a
+`journalctl` log entry with an etcd sync duration warning is found.
+
+For further information on improving etcd performance, see
+xref:../scaling_performance/host_practices.adoc#scaling-performance-capacity-host-practices-etcd[Recommended Practices for {product-title} etcd Hosts] and the
+link:https://access.redhat.com/solutions/2916381[Red Hat Knowledgebase].
 
 |`etcd_volume`
 |This check ensures that the volume usage for an etcd cluster is below a maximum
 user-specified threshold. If no maximum threshold value is specified, it is
 defaulted to `90%` of the total volume size.
 
-A user-defined limit may be set by passing the variable
-`etcd_device_usage_threshold_percent=90` to the `openshift_health_checker` role.
+A user-defined limit may be set by passing the
+`etcd_device_usage_threshold_percent` variable.
 
 |`docker_storage`
-|Only runs on hosts that depend on the *docker* damon (nodes and containerized
+|Only runs on hosts that depend on the *docker* daemon (nodes and containerized
 installations). Checks that *docker*'s total usage does not exceed a
 user-defined limit. If no user-defined limit is set, *docker*'s maximum usage
-threshold defaults to 90% of the total size available. The threshold
-limit for total percent usage can be set with a variable in your inventory file:
-`max_thinpool_data_usage_percent=90`.
-|===
+threshold defaults to 90% of the total size available. 
 
-To disable specific checks, include the variable `openshift_disable_check` with
-a comma-delimited list of check names in your inventory file. For example:
+The threshold limit for total percent usage can be set with a variable in your
+inventory file, for example `max_thinpool_data_usage_percent=90`.
 
-----
-openshift_disable_check=ovs_version,etcd_volume
-----
+This also checks that *docker*'s storage is using a
+xref:../install_config/registry/deploy_registry_existing_clusters.adoc#storage-for-the-registry[supported configuration].
+
+|`curator`, `elasticsearch`, `fluentd`, `kibana`
+|This set of checks verifies that Curator, Kibana, Elasticsearch, and Fluentd
+pods have been deployed and are in a `running` state, and that a connection can
+be established between the control host and the exposed Kibana URL. These checks
+will only run if the `openshift_hosted_logging_deploy` inventory variable is set
+to `true`, to ensure that they are executed in a deployment where
+xref:../install_config/aggregate_logging.adoc#install-config-aggregate-logging[cluster logging] has been enabled.
+
+|`logging_index_time`
+|This check detects higher than normal time delays between log creation and log
+aggregation by Elasticsearch in a logging stack deployment. It fails if a new
+log entry cannot be queried through Elasticsearch within a timeout (by default,
+30 seconds). The check only runs if logging is enabled.
 
+A user-defined timeout may be set by passing the
+`openshift_check_logging_index_timeout_seconds` variable. For example, setting
+`openshift_check_logging_index_timeout_seconds=45` will cause the check to fail
+if a newly-created log entry is not able to be queried via Elasticsearch after
+45 seconds.
+
+|===
+
+[NOTE]
+====
 A similar set of checks meant to run as part of the installation process can be
 found in
 xref:../install_config/install/advanced_install.adoc#configuring-cluster-pre-install-checks[Configuring Cluster Pre-install Checks]. Another set of checks for checking certificate
 expiration can be found in
 xref:../install_config/redeploying_certificates.adoc#install-config-redeploying-certificates[Redeploying Certificates].
+====
+
+[[admin-guide-health-checks-via-ansible-playbook]]
+=== Running Health Checks via ansible-playbook
+
+To run the *openshift-ansible* health checks using the `ansible-playbook`
+command, specify your cluster's inventory file and run the *_health.yml_*
+playbook:
+
+----
+# ansible-playbook -i <inventory_file> \
+ifdef::openshift-enterprise[]
+    /usr/share/ansible/openshift-ansible/playbooks/byo/openshift-checks/health.yml
+endif::[]
+ifdef::openshift-origin[]
+    ~/openshift-ansible/playbooks/byo/openshift-checks/health.yml
+endif::[]
+----
+
+To set variables in the command line, include the `-e` flag with any desired
+variables in `key=value` format. For example:
+
+----
+# ansible-playbook -i <inventory_file> \
+ifdef::openshift-enterprise[]
+    /usr/share/ansible/openshift-ansible/playbooks/byo/openshift-checks/health.yml
+endif::[]
+ifdef::openshift-origin[]
+    ~/openshift-ansible/playbooks/byo/openshift-checks/health.yml
+endif::[]
+    -e openshift_check_logging_index_timeout_seconds=45
+    -e etcd_max_image_data_size_bytes=40000000000
+----
+
+To disable specific checks, include the variable `openshift_disable_check` with
+a comma-delimited list of check names in your inventory file before running the
+playbook. For example:
+
+----
+openshift_disable_check=etcd_traffic,etcd_volume
+----
+
+Alternatively, set any checks you want to disable as variables with
+`-e openshift_disable_check=<check1>,<check2>` when running the
+`ansible-playbook` command.
+
+[[admin-guide-health-checks-via-docker-cli]]
+=== Running Health Checks via Docker CLI
+
+It is possible to run the *openshift-ansible* playbooks in a Docker container,
+avoiding the need for installing and configuring Ansible, on any host that can
+run the
+ifdef::openshift-enterprise[]
+*ose-ansible*
+endif::[]
+ifdef::openshift-origin[]
+*origin-ansible*
+endif::[]
+image via the Docker CLI.
+
+To do so, specify your cluster's inventory file and the *_health.yml_* playbook
+when running the following `docker run` command as a non-root user that has
+privileges to run containers:
+
+----
+# docker run -u `id -u` \ <1>
+    -v $HOME/.ssh/id_rsa:/opt/app-root/src/.ssh/id_rsa:Z,ro \ <2>
+    -v /etc/ansible/hosts:/tmp/inventory:ro \ <3>
+    -e INVENTORY_FILE=/tmp/inventory \
+    -e PLAYBOOK_FILE=playbooks/byo/openshift-checks/health.yml \ <4>
+    -e OPTS="-v -e openshift_check_logging_index_timeout_seconds=45 -e etcd_max_image_data_size_bytes=40000000000" \ <5>
+ifdef::openshift-enterprise[]
+    openshift3/ose-ansible
+endif::[]
+ifdef::openshift-origin[]
+    openshift/origin-ansible
+endif::[]
+----
+<1> These options make the container run with the same UID as the current user,
+which is required for permissions so that the SSH key can be read inside the
+container (SSH private keys are expected to be readable only by their owner).
+<2> Mount SSH keys as a volume under *_/opt/app-root/src/.ssh_* under normal usage
+when running the container as a non-root user.
+<3> Change *_/etc/ansible/hosts_* to the location of your cluster's inventory file,
+if different. This file will be bind-mounted to *_/tmp/inventory_*, which is
+used according to the `INVENTORY_FILE` environment variable in the container.
+<3> The `PLAYBOOK_FILE` environment variable is set to the location of the
+*_health.yml_* playbook relative to *_/usr/share/ansible/openshift-ansible_*
+inside the container.
+<4> Set any variables desired for a single run with the `-e key=value` format.
+
+In the above command, the SSH key is mounted with the `:Z` flag so that the
+container can read the SSH key from its restricted SELinux context; this means
+that your original SSH key file will be relabeled to something like
+`system_u:object_r:container_file_t:s0:c113,c247`. For more details about `:Z`,
+see the `docker-run(1)` man page.
+
+Keep this in mind for these volume mount specifications because it could have
+unexpected consequences. For example, if you mount (and therefore relabel) your
+*_$HOME/.ssh_* directory, *sshd* will become unable to access your public keys
+to allow remote login. To avoid altering the original file labels, mounting a
+copy of the SSH key (or directory) is recommended.
+
+You might want to mount an entire *_.ssh_* directory for various reasons. For
+example, this would allow you to use an SSH configuration to match keys with
+hosts or modify other connection parameters. It would also allow you to provide
+a *_known_hosts_* file and have SSH validate host keys, which is disabled by the
+default configuration and can be re-enabled with an environment variable by
+adding `-e ANSIBLE_HOST_KEY_CHECKING=True` to the `docker` command line.
diff --git a/install_config/install/advanced_install.adoc b/install_config/install/advanced_install.adoc
@@ -544,11 +544,14 @@ inventory file. For example:
 openshift_disable_check=memory_availability,disk_availability
 ----
 
-A similar set of checks meant to run for diagnostic on existing clusters can be
+[NOTE]
+====
+A similar set of health checks meant to run for diagnostics on existing clusters
+can be found in
+xref:../../admin_guide/diagnostics_tool.adoc#ansible-based-tooling-health-checkss[Ansible-based Health Checks]. Another set of checks for checking certificate expiration can be
 found in
-xref:../../admin_guide/diagnostics_tool.adoc#additional-cluster-health-checks[Additional Diagnostic Checks via Ansible]. Another set of checks for checking certificate
-expiration can be found in
 xref:../../install_config/redeploying_certificates.adoc#install-config-redeploying-certificates[Redeploying Certificates].
+====
 
 [[advanced-install-configuring-system-containers]]
 === Configuring System Containers

diff --git a/scaling_performance/optimizing_compute_resources.adoc b/scaling_performance/optimizing_compute_resources.adoc
@@ -92,12 +92,13 @@ Registry  credentials.
 ====
 
 [[scaling-performance-debugging]]
-== Debugging {product-title} Using the RHEL Tools Container
+== Debugging Using the RHEL Tools Container Image
 
-Red Hat distributes a *rhel-tools* container, which:
+Red Hat distributes a *rhel-tools* container image, packaging tools that aid in
+debugging scaling or performance problems. This container image:
 
-* Allow users to deploy minimal footprint container hosts by moving packages out of the base distribution and into this support container.
-* Provide debugging capabilities for Red Hat Enterprise Linux 7 Atomic Host, which has an immutable packet tree. *rhel-tools* includes utilities such as tcpdump, sosreport, git, gdb, perf, and many more common system administration utilities.
+* Allows users to deploy minimal footprint container hosts by moving packages out of the base distribution and into this support container.
+* Provides debugging capabilities for Red Hat Enterprise Linux 7 Atomic Host, which has an immutable package tree. *rhel-tools* includes utilities such as tcpdump, sosreport, git, gdb, perf, and many more common system administration utilities.
 
 Use the *rhel-tools* container with the following:
 
@@ -107,5 +108,11 @@ Use the *rhel-tools* container with the following:
 
 See the link:https://access.redhat.com/documentation/en/red-hat-enterprise-linux-atomic-host/7/getting-started-with-containers/chapter-11-using-the-atomic-tools-container-image[RHEL Tools Container documentation] for more information.
 
+[[scaling-performance-debugging-using-ansible]]
+== Debugging Using Ansible-based Health Checks
 
+include::admin_guide/diagnostics_tool.adoc[tag=ansible-based-health-checks-intro]
 
+See
+xref:../admin_guide/diagnostics_tool.adoc#ansible-based-tooling-health-checks[Ansible-based Health Checks] in the Cluster Administration guide for information on the
+available health checks and example usage.