-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updated diagnostic_tools for openshift-ansible image #4713
Conversation
dba9b7f
to
b6ce32a
Compare
admin_guide/diagnostics_tool.adoc
Outdated
|
||
---- | ||
# docker run -u `id -u` \ | ||
-v $HOME/.ssh/id_rsa:/opt/app-root/src/.ssh/id_rsa:Z,ro \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was musing with Juan we should add a callout explaining what's going on here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah... there's a fairly detailed explanation of what's going on at https://github.com/openshift/openshift-ansible/blob/master/README_CONTAINER_IMAGE.md#usage
As usual you have to weigh how much to say about all this... I have to say that the ssh key permissions are particularly fragile and quite difficult to debug when they're not right (ssh is really persnickety and not very informative unless you turn debug up to 11), so we really want to guide them down a single path, not make them confident in how to do it differently. We definitely need to mention not running this as root, that's likely to be a nasty surprise.
b6ce32a
to
69e97c0
Compare
admin_guide/diagnostics_tool.adoc
Outdated
-v /etc/ansible/hosts:/tmp/inventory:ro \ <1> | ||
-e INVENTORY_FILE=/tmp/inventory \ | ||
-e PLAYBOOK_FILE=playbooks/common/openshift-checks/health.yml \ <2> | ||
-e OPTS="-v -e openshift_check_logging_index_timeout_seconds=30 etcd_max_image_data_size_bytes=40000000000" \ <3> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pretty sure you need a second -e
for the second parameter.
-e OPTS="-v -e openshift_check_logging_index_timeout_seconds=30 -e etcd_max_image_data_size_bytes=40000000000" \
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW are the callouts here going to interfere with copy/paste in the rendered version? Because they're really going to want to copy/paste this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pretty sure you need a second -e for the second parameter.
👍
Or quotes, so that everything after -e
goes as a single shell argument.
admin_guide/diagnostics_tool.adoc
Outdated
-v $HOME/.ssh/id_rsa:/opt/app-root/src/.ssh/id_rsa:Z,ro \ | ||
-v /etc/ansible/hosts:/tmp/inventory:ro \ <1> | ||
-e INVENTORY_FILE=/tmp/inventory \ | ||
-e PLAYBOOK_FILE=playbooks/common/openshift-checks/health.yml \ <2> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be playbooks/byo/openshift-checks/health.yml
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sosiouxme Do the other two instances need to be /byo
as well?
admin_guide/diagnostics_tool.adoc
Outdated
|
||
---- | ||
# docker run -u `id -u` \ | ||
-v $HOME/.ssh/id_rsa:/opt/app-root/src/.ssh/id_rsa:Z,ro \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah... there's a fairly detailed explanation of what's going on at https://github.com/openshift/openshift-ansible/blob/master/README_CONTAINER_IMAGE.md#usage
As usual you have to weigh how much to say about all this... I have to say that the ssh key permissions are particularly fragile and quite difficult to debug when they're not right (ssh is really persnickety and not very informative unless you turn debug up to 11), so we really want to guide them down a single path, not make them confident in how to do it differently. We definitely need to mention not running this as root, that's likely to be a nasty surprise.
admin_guide/diagnostics_tool.adoc
Outdated
[[openshift-ansible-health-checks]] | ||
== openshift-ansible Health Checks | ||
|
||
Additional diagnostic health checks are available through *openshift-ansible*, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's probably a bit confusing that we refer to it as openshift-ansible
here, but later it has a different RPM name and a different image name. I'm not sure there's anything to be done about it, though... you have to call it something and origin/OCP have different names for everything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could use a conditional for the name, or in this particular case we could possible avoid giving it a name altogether:
Additional diagnostic health checks are available through the Ansible-based tooling used to install and manage {product-title} clusters.
admin_guide/diagnostics_tool.adoc
Outdated
xref:../install_config/install/advanced_install.adoc#install-config-install-advanced-install[Advanced Installation]) or using the Docker CLI to directly run a | ||
link:https://github.com/openshift/openshift-ansible/blob/master/README_CONTAINER_IMAGE.md[containerized version] of *openshift-ansible*. For the `ansible-playbook` method, the checks | ||
are provided with the *atomic-openshift-utils* RPM package. For the Docker CLI | ||
method, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's the OCP RPM name... I don't think we ship an RPM for Origin. I think we expect people to run out of a git clone.
admin_guide/diagnostics_tool.adoc
Outdated
---- | ||
# docker run -u `id -u` \ | ||
-v $HOME/.ssh/id_rsa:/opt/app-root/src/.ssh/id_rsa:Z,ro \ | ||
-v /etc/ansible/hosts:/tmp/inventory:ro \ <1> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
turns out this needs a :Z
as well (at least in my test today)
-v /etc/ansible/hosts:/tmp/inventory:Z,ro \
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe you. Wondering why we don't have it in README_CONTAINER_IMAGE.md. The ro
is also "new" here, makes sense. We may update README_CONTAINER_IMAGE.md to match.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't need :Z
here in my tests, and ro
would be covered by standard file perms (we're running as non-root by default). Considering that :Z
would relabel the file on the host, better to leave it out if not needed:
# ls -Z /etc/ansible/hosts
-rw-r--r--. root root unconfined_u:object_r:net_conf_t:s0 /etc/ansible/hosts
# sesearch -A -s svirt_lxc_net_t -t net_conf_t
Found 2 semantic av rules:
allow svirt_sandbox_domain file_type : filesystem getattr ;
allow svirt_sandbox_domain file_type : dir { getattr search open } ;
If we do need it let's add it... just mentioning that I think it's worth double checking that it's necessary...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leaving it alone for now, then.
admin_guide/diagnostics_tool.adoc
Outdated
A user-defined timeout may be set by passing the | ||
`openshift_check_logging_index_timeout_seconds` variable. For example, setting | ||
`openshift_check_logging_index_timeout_seconds=30` will cause the check to fail | ||
if a newly-created Kibana log is not able to be queried via Elasticsearch after |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Kibana log" is confusing here and it's a detail they don't need. Suggest: "log entry"
method used during | ||
xref:../install_config/install/advanced_install.adoc#install-config-install-advanced-install[Advanced Installation]) or using the Docker CLI to directly run a | ||
link:https://github.com/openshift/openshift-ansible/blob/master/README_CONTAINER_IMAGE.md[containerized version] of *openshift-ansible*. For the `ansible-playbook` method, the checks | ||
are provided with the *atomic-openshift-utils* RPM package. For the Docker CLI |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again atomic-openshift-utils
is OCP only
link:https://registry.access.redhat.com[Red Hat Container Registry]. | ||
endif::[] | ||
ifdef::openshift-origin[] | ||
the *openshift/origin-ansible* container image is distirbuted via Docker Hub. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
distributed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the content duplication intentional? This paragraph is the same in two files (typo included).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re-did this so that I'm re-using the shared content w/ an include::
.
admin_guide/diagnostics_tool.adoc
Outdated
---- | ||
# ansible-playbook -i <inventory_file> \ | ||
/usr/share/ansible/openshift-ansible/playbooks/common/openshift-checks/health.yml \ | ||
-e "openshift_check_logging_index_timeout_seconds=30 etcd_max_image_data_size_bytes=40000000000" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
needs a -e
per param
-e openshift_check_logging_index_timeout_seconds=30 -e etcd_max_image_data_size_bytes=40000000000
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested just to be sure. We can use a single -e
if the argument is quoted (because of spaces).
-e "openshift_check_logging_index_timeout_seconds=30 etcd_max_image_data_size_bytes=40000000000"
is also correct.
And so is using single quotes (avoids shell expansion):
-e 'openshift_check_logging_index_timeout_seconds=30 etcd_max_image_data_size_bytes=40000000000'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Making it separate -e
entries on separate lines (without quotes) for readability.
admin_guide/diagnostics_tool.adoc
Outdated
The following health checks belong to a diagnostic task meant to be run against | ||
the Ansible inventory file for a deployed {product-title} cluster. They can | ||
report common problems for the current {product-title} installation. | ||
[[openshift-ansible-health-checks]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing the anchor will break existing links, is it worth it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was only one xref in the docs repo and it's updated in this PR. This content has only appeared so far in the Origin docs, so not too worried about links out in the wild.
admin_guide/diagnostics_tool.adoc
Outdated
the Ansible inventory file for a deployed {product-title} cluster. They can | ||
report common problems for the current {product-title} installation. | ||
[[openshift-ansible-health-checks]] | ||
== openshift-ansible Health Checks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: same comment as https://github.com/openshift/openshift-docs/pull/4713/files#r125776778
admin_guide/diagnostics_tool.adoc
Outdated
Example usage for each method are provided in subsequent sections. | ||
|
||
The following health checks are a set of diagnostic tasks that run as part of | ||
the *openshift_health_checker* Ansible role. They are meant to be run against |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The role name is internal implementation detail, should not be in the documentation.
admin_guide/diagnostics_tool.adoc
Outdated
|
||
|`logging_index_time` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: this is implemented in openshift/openshift-ansible#4682, not merged yet as of now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Merged today
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... meaning it will probably be in 3.6.1 not 3.6.0
admin_guide/diagnostics_tool.adoc
Outdated
if a newly-created Kibana log is not able to be queried via Elasticsearch after | ||
30 seconds. | ||
|
||
|`ovs_version` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't exist. We use a more generic package_version
check to check versions of multiple packages, including Open vSwitch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, that was my fault. I believe I initially started the Open vSwitch check in a check of its own
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack, will remove.
|
||
To disable specific checks, include the variable `openshift_disable_check` with | ||
a comma-delimited list of check names in your inventory file before running the | ||
playbook. For example: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW it doesn't need to be in the inventory file, it can be with a -e
flag as well (among the other ways to set variables in Ansible).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding mention of that method as well.
admin_guide/diagnostics_tool.adoc
Outdated
openshift_disable_check=ovs_version,etcd_volume | ||
---- | ||
|
||
To set variables that accept user-define values, include the `-e` flag with any |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
user-defined
Or rather:
To set variables in the command-line, include ...
admin_guide/diagnostics_tool.adoc
Outdated
---- | ||
# ansible-playbook -i <inventory_file> \ | ||
/usr/share/ansible/openshift-ansible/playbooks/common/openshift-checks/health.yml \ | ||
-e "openshift_check_logging_index_timeout_seconds=30 etcd_max_image_data_size_bytes=40000000000" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested just to be sure. We can use a single -e
if the argument is quoted (because of spaces).
-e "openshift_check_logging_index_timeout_seconds=30 etcd_max_image_data_size_bytes=40000000000"
is also correct.
And so is using single quotes (avoids shell expansion):
-e 'openshift_check_logging_index_timeout_seconds=30 etcd_max_image_data_size_bytes=40000000000'
admin_guide/diagnostics_tool.adoc
Outdated
-v /etc/ansible/hosts:/tmp/inventory:ro \ <1> | ||
-e INVENTORY_FILE=/tmp/inventory \ | ||
-e PLAYBOOK_FILE=playbooks/common/openshift-checks/health.yml \ <2> | ||
-e OPTS="-v -e openshift_check_logging_index_timeout_seconds=30 etcd_max_image_data_size_bytes=40000000000" \ <3> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pretty sure you need a second -e for the second parameter.
👍
Or quotes, so that everything after -e
goes as a single shell argument.
link:https://registry.access.redhat.com[Red Hat Container Registry]. | ||
endif::[] | ||
ifdef::openshift-origin[] | ||
the *openshift/origin-ansible* container image is distirbuted via Docker Hub. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the content duplication intentional? This paragraph is the same in two files (typo included).
69e97c0
to
7544eb9
Compare
ping @rhcarvalho or @sosiouxme wondering if there are any more comments on this? |
admin_guide/diagnostics_tool.adoc
Outdated
|
||
|`logging_index_time` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Merged today
admin_guide/diagnostics_tool.adoc
Outdated
established between the control host and the exposed Kibana URL. These checks | ||
will only run if the `openshift_hosted_logging_deploy` inventory variable is set | ||
to `true`, to ensure that they are executed in a deployment where a logging | ||
stack has been deployed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we have some link in this paragraph tying back to the logging docs, or is it clear what "logging stack" we're referring to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could. And actually the name could use some standardization too. I see it called variously "cluster logging", "aggregated logging", "the logging stack", and "the EFK stack".
https://docs.openshift.org/latest/install_config/aggregate_logging.html is probably what we want to link to. I don't have a strong preference on standard name but "cluster logging" seems straightforward.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
admin_guide/diagnostics_tool.adoc
Outdated
`openshift_check_logging_index_timeout_seconds` variable. For example, setting | ||
`openshift_check_logging_index_timeout_seconds=30` will cause the check to fail | ||
if a newly-created log entry is not able to be queried via Elasticsearch after | ||
30 seconds. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd consider that more important than knowing the variable name would be to explain when one should bother to change the default? I think we're missing that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sosiouxme When should one bother?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't think we need the long variable name twice. Don't like "is not able to be". And to Rodolfo's point:
Users that either require lower-latency log aggregation or are comfortable with higher latency may adjust this timeout with an Ansible variable. For example, setting
openshift_check_logging_index_timeout_seconds=45
relaxes the timeout to 45 seconds.
admin_guide/diagnostics_tool.adoc
Outdated
openshift_disable_check=etcd_traffic,etcd_volume | ||
---- | ||
|
||
Alternatively, set any checks you want to disable as environment variables with `-e openshift_disable_check` when running the `ansible-playbook` command. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be misleading, it should be -e openshift_disable_check=name1,name2,...
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So "an" environment variable yes?
// tag::ansible-based-health-checks-intro[] | ||
Additional diagnostic health checks are available through the | ||
xref:../install_config/install/advanced_install.adoc#install-config-install-advanced-install[Ansible-based tooling] used to install and manage {product-title} clusters. They can report | ||
common deployment problems for the current {product-title} installation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There needs to be a substantial up-front warning about using these checks. They are not without side effects; running them can make changes to the hosts. Mostly those changes would be installing dependencies so the checks can gather needed information, and so they shouldn't be of much concern. My main concern is that some of the roles from the installer are invoked as pre-requisites to gather information the checks require, and those roles don't have a concept of not changing anything important. They try to ensure the system components they deal with are configured consistently with the inventory, even though you're not running from an installer playbook. So if the admin didn't install with Ansible, or ran the install with different/extra variables specified than what's in the inventory file, or made manual config changes after installing, they could easily find that running the checks re-configures their hosts according to the inventory file. Networking and Docker (for instance) can be affected.
So the checks should only be used on clusters that have been deployed with Ansible and using the same inventory it was deployed with. I'm not sure how to say this in a way that isn't terrifying to the user. Basically, if you wouldn't run the install with your inventory (and expect it to change nothing) then don't run the checks with it either, because the run may perform (some of) the same changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I consider this a bug in the check runner 😢
It is not the user's fault the implications of how the checks are implemented. We should fix it, since IIUC it did not start that way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rhcarvalho Oops, didn't see your comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rhcarvalho well that would involve decomposing roles into "informational" versus "transformational". With the caveat that even "informational" roles can install dependencies. It's probably worth doing but I don't think it will be quick.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@adellape That warning LGTM.
7544eb9
to
7d1f6c6
Compare
LGTM |
admin_guide/diagnostics_tool.adoc
Outdated
|`logging_index_time` | ||
|This check detects higher than normal time delays between log creation and log | ||
aggregation by Elasticsearch in a logging stack deployment. It fails if a | ||
user-defined timeout is reached before logs are able to be queried through |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't say "user-defined" here as most users will use the default. And it's good to mention the default. So something like:
It fails if a new log entry cannot be queried through Elasticsearch within a timeout (by default, 30 seconds).
And below, I'd use a different configured timeout, e.g. 45 seconds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
admin_guide/diagnostics_tool.adoc
Outdated
|
||
---- | ||
# ansible-playbook -i <inventory_file> \ | ||
/usr/share/ansible/openshift-ansible/playbooks/common/openshift-checks/health.yml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the location for OCP... do we ship the installer RPM for Origin? I don't think we do. I don't want to complicate this but maybe for Origin we need to describe cloning the repo...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/common/byo/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also s/common/byo/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll ifdef this for OCP vs Origin.
<1> These options make the container run with the same UID as the current user, | ||
which is required for permissions so that the SSH key can be read inside the | ||
container (SSH private keys are expected to be readable only by their owner). | ||
<2> Mount SSH keys as a volume under *_/opt/app-root/src/.ssh_* under normal usage |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The :Z requires some explanation too. As the repo docs explain:
Note that the ssh key is mounted with the :Z flag: this is also required so that the container can read the ssh key from its restricted SELinux context; this means that your original ssh key file will be re-labeled to something like system_u:object_r:container_file_t:s0:c113,c247. For more details about :Z please check the docker-run(1) man page. Please keep this in mind when providing these volume mount specifications because this could have unexpected consequences: for example, if you mount (and therefore re-label) your whole $HOME/.ssh directory you will block sshd from accessing your keys. This is a reason why you might want to work on a separate copy of the ssh key, so that the original file's labels remain untouched.
ssh is exceedingly picky about what it accepts as keys, and having the wrong SELinux label on the file (which is virtually inevitable at least on the initial run) causes it to silently skip the key and blithely fail to connect without further explanation. It's a pretty infuriating user experience. This is also why it's so important to have the container user matching the user ID on the keyfile.
The mention of mounting a .ssh directory leads to yet more discussion; I don't like it but this probably has to be pretty verbose because this is what everyone will get hung up on adapting to their specific usage. You might want to mount a whole .ssh directory for various reasons, among them because you want to use an ssh config to match keys with hosts or tweak other connection parameters, or because you want to provide a known_hosts
file and have ssh validate host keys (which is disabled by default config and can be re-enabled by setting the envvar -e ANSIBLE_HOST_KEY_CHECKING=True
on the cmdline or a few more complicated ways).
Finally we should probably mention providing a vault password. Probably in a future update because frankly I know nothing about it yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
admin_guide/diagnostics_tool.adoc
Outdated
when running the container as a non-root user. | ||
<3> Change *_/etc/ansible/hosts_* to the location of your cluster's inventory file, | ||
if different. This file will be bind-mounted to *_/tmp/inventory_*, which is | ||
used by the `INVENTORY_FILE` environment variable in the container. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, it's not used by it but I guess used according to it. Or as indicated by it?
The same need for :Z labeling can apply to this (or really anything that's mounted in). The label on /etc/ansible/hosts
is typically suitable already for use in the container and they can get away without :Z relabeling. However if the inventory file is in a user home directory for instance, it will not be available to the user in the container and they will need to relabel it. I'm not sure there's a graceful way to explain this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And we haven't even discussed dynamic inventory, which is a whole 'nother topic.
admin_guide/diagnostics_tool.adoc
Outdated
if different. This file will be bind-mounted to *_/tmp/inventory_*, which is | ||
used by the `INVENTORY_FILE` environment variable in the container. | ||
<3> The `PLAYBOOK_FILE` environment variable is set to the location of the | ||
*_health.yml_* playbook inside the container. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's probably worth saying that this is the location relative to /usr/share/ansible/openshift-ansible
inside the container. You could also fully specify e.g. PLAYBOOK_FILE=/usr/share/ansible/openshift-ansible/playbooks/byo/openshift-checks/health.yml
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
admin_guide/diagnostics_tool.adoc
Outdated
used by the `INVENTORY_FILE` environment variable in the container. | ||
<3> The `PLAYBOOK_FILE` environment variable is set to the location of the | ||
*_health.yml_* playbook inside the container. | ||
<4> Set any desired variables that accept user-defined values in `key=value key=value` format. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All variables can be user-defined so that's kind of redundant. The example's clear enough but I'd suggest this say:
Set any variables desired for a single run with the
-e key=value
format.
Having two in there could be confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@sosiouxme <https://github.com/sosiouxme> Do the other two instances need
to be /byo as well?
Yes... yes, they all do. No one should be running /common playbooks
directly.
|
admin_guide/diagnostics_tool.adoc
Outdated
|
||
---- | ||
# ansible-playbook -i <inventory_file> \ | ||
/usr/share/ansible/openshift-ansible/playbooks/common/openshift-checks/health.yml \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/common/byo/
admin_guide/diagnostics_tool.adoc
Outdated
==== | ||
Due to potential changes the health checks could make to hosts, they should only | ||
be used on clusters that have been deployed using Ansible and using the same | ||
inventory file with which it was deployed. Changes mostly involve installing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"deployed using Ansible and using the same inventory file with which it was deployed"
Is there a programmatic way to determine this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Err... run the playbook and see if it changes anything?
Seeing what would be changed without making the changes is what the --check
ansible flag is for but I honestly don't know how accurate or useful it is with as much custom stuff as openshift-ansible includes. Could be fine, could be broken, could even make changes anyway. I don't think it's a use case anyone's paying attention to.
@adellape how far do you think we are from getting it merged? :-) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, still a few things to fix up.
admin_guide/diagnostics_tool.adoc
Outdated
|
||
[WARNING] | ||
==== | ||
Due to potential changes the health checks could make to hosts, they should only |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a small quibble I missed before.
s/health checks/health check playbooks/
(the health checks themselves don't make the problem changes, the playbooks have dependencies that make them)
admin_guide/diagnostics_tool.adoc
Outdated
|
||
These checks can be run either using the `ansible-playbook` command (the same | ||
method used during | ||
xref:../install_config/install/advanced_install.adoc#install-config-install-advanced-install[Advanced Installation]) or using the Docker CLI to directly run a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're going to link to github instructions in the next paragraph, it's probably worth mentioning that you can run it as a system container as well.
"using the Docker CLI or atomic..."
Or, go the other way and remain agnostic:
"... or as a containerized version of openshift-ansible." And "containerized method" below.
admin_guide/diagnostics_tool.adoc
Outdated
exceeds 40 GB. | ||
|
||
|`etcd_traffic` | ||
|This check detects higher than normal traffic on an etcd host. It fails if a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: how about "higher-than-normal"?
admin_guide/diagnostics_tool.adoc
Outdated
limit for total percent usage can be set with a variable in your inventory file: | ||
`max_thinpool_data_usage_percent=90`. | ||
|=== | ||
threshold defaults to 90% of the total size available. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It also checks that docker storage is configured in a supportable way. Meaning, with a storage driver and backing storage device that are supported for usage with Docker.
In particular, the default configuration of Docker storage is not supported. I'm not sure how much to say about this; probably worth linking to any docs we have on Docker storage configuration.
admin_guide/diagnostics_tool.adoc
Outdated
openshift_disable_check=ovs_version,etcd_volume | ||
---- | ||
|`kibana`, `curator`, `elasticsearch`, `fluentd` | ||
|This set of checks verifies that Elasticsearch, Fluentd, and Curator pods have |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and Kibana
admin_guide/diagnostics_tool.adoc
Outdated
xref:../install_config/aggregate_logging.adoc#install-config-aggregate-logging[cluster | ||
logging] has been enabled. | ||
|
||
|`logging_index_time` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check similarly only runs if logging is enabled. Should it be grouped with the others? They're all related but the focus of this one is a little different, it's end-to-end...
admin_guide/diagnostics_tool.adoc
Outdated
---- | ||
|
||
To set variables in the command line, include the `-e` flag with any desired | ||
variables in `key=value key=value` format. For example: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would still prefer a single key=value
here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops.
admin_guide/diagnostics_tool.adoc
Outdated
therefore relabel) your *_$HOME/.ssh_* directory, you will block *sshd* | ||
from accessing your keys. This is a reason why you might want to work on a | ||
separate copy of the SSH key, so that the original file's labels remain | ||
untouched. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would reword this just a bit to be clearer about the problem and stronger in the suggested remedy.
Keep this in mind for these volume mount specifications because it could have unexpected consequences. For example, if you mount (and therefore relabel) your $HOME/.ssh directory, sshd will become unable to access your public keys to allow remote login. To avoid altering the original file labels, mounting a copy of the SSH key (or directory) is recommended.
admin_guide/diagnostics_tool.adoc
Outdated
hosts or modify other connection parameters. It would also allow you to provide | ||
a *_known_hosts_* file and have SSH validate host keys, which is disabled by the | ||
default configuration and can be re-enabled by setting the `envvar -e | ||
ANSIBLE_HOST_KEY_CHECKING=True` on the command line. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be confusing as to whether it needs to be added on the docker command or in the OPTS var. It needs to be set on the docker command.
... and can be re-enabled with an environment variable by adding
-e ANSIBLE_HOST_KEY_CHECKING=True
to thedocker
command line.
27e3da9
to
c356e2a
Compare
@sosiouxme OK thanks, changes made, see latest commit. |
@sosiouxme bump |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few nits but LGTM
* Allow users to deploy minimal footprint container hosts by moving packages out of the base distribution and into this support container. | ||
* Provide debugging capabilities for Red Hat Enterprise Linux 7 Atomic Host, which has an immutable packet tree. *rhel-tools* includes utilities such as tcpdump, sosreport, git, gdb, perf, and many more common system administration utilities. | ||
* Allows users to deploy minimal footprint container hosts by moving packages out of the base distribution and into this support container. | ||
* Provides debugging capabilities for Red Hat Enterprise Linux 7 Atomic Host, which has an immutable packet tree. *rhel-tools* includes utilities such as tcpdump, sosreport, git, gdb, perf, and many more common system administration utilities. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/packet/package/
here; this supplies tools that can't be installed as packages on AH. Not sure what an immutable packet tree would mean :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😊
admin_guide/diagnostics_tool.adoc
Outdated
openshift_disable_check=etcd_traffic,etcd_volume | ||
---- | ||
|
||
Alternatively, set any checks you want to disable as environment variables with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically these aren't environment variables... I think you can safely s/environment//
On a docker command line, -e
sets an environment variable. On an ansible command line, -e
sets an extra variable or inventory variable. They look the same but they're accessed in completely different ways...
admin_guide/diagnostics_tool.adoc
Outdated
=== Running Health Checks via Docker CLI | ||
|
||
It is possible a playbook may require dependencies that are not installed | ||
locally on the host running the `ansible-playbook` command. You can avoid this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, dependencies like Ansible and the playbooks :) Other dependencies aren't really much of a problem for this playbook, so I don't know that I'd put this as the motivation. The purpose of the image is to be able to run playbooks on a system with nothing more than Docker. You could even run it on a Mac or Atomic Host.
Not a big deal, but maybe something like:
"It is possible to run the openshift-ansible playbooks in a Docker container, avoiding the need for installing and configuring Ansible, on any host that can run the {ose,origin}-ansible image via the Docker CLI."
It would be nice to have these changes made at least for 3.6.1 if not before... |
c356e2a
to
4aeec01
Compare
@sosiouxme Thanks! Changes made and I'll merge after Travis finishes. It will get published w/ next Monday's release. |
[rev_history] |
Reworks #4579 and moves most of the content to the existing
admin_guide/diagnostic_tools
section, and adds subsections foransible-playbook
vs directdocker
usage. Still adds mention of the openshift-ansible health checks to the scaling guide, but mostly links to theadmin_guide
for the details and usage. Still adds in the previously-undocumented checks (logging_index_time
,etcd_traffic
).Preview:
http://file.rdu.redhat.com/~adellape/070517/scaling_preinstall/admin_guide/diagnostics_tool.html#additional-cluster-health-checks
http://file.rdu.redhat.com/~adellape/070517/scaling_preinstall/scaling_performance/optimizing_compute_resources.html#scaling-performance-debugging-using-openshift-ansible