You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We propose a system monitoring mechanism that for Cluster and Pod level does not require changes to existing Che code. However, for application monitoring of Che agents it requires some changes:
Add special HTTP monitor requests (telemetry) or using the logs and convert it into monitoring metrics by adding special tag to the record.
Add health check command by each agent for monitoring and register with health check configuration policy to the agent manager.
Add health check agent manager within the Pod for monitoring.
Use Custom environment params that are added to the records of the Che agents for customized
purposes, e.g. user’s tenant (customer) id.
Add critical external health check command by relevant agents that will be used by Kubelet livenessProbe to restart the Pod. In addition, add the agent health check configuration as livenessProbe to the Pod configuration file.
Monitoring Che Workspace(aka WS) agents is required for anticipate problems and discover bottleneck in production environment.
K8S monitor can be categorized as follow
There are many possible combinations of node and cluster-level agents that could comprise a monitoring pipeline. The most popular in K8S is Prometheus which is part of the CNCF.
It collects metrics from configured targets at given intervals, evaluates rule expressions, displays the results, and can trigger alerts if some condition is observed to be true.
Prometheus comes with its own dashboard which is available for running ad-hoc queries or quick debugging, but for best experience it is recommended to be integrated with visualization backends such as Grafana. https://www.weave.works/technologies/monitoring-kubernetes-with-prometheus
Prometheus Architecture
Prometheus has a cluster level agent and a node level agent (node exporter).
The Node exporter is installed as a DaemonSet that gather machine-level metrics in addition to the metrics exposed by the cAdvisor for each container.
The Prometheus server is installed per cluster. It scrapes and stores time series data from instrumented jobs either directly or via an intermediary push gateway for short-lived jobs. It stores all scraped samples locally, runs rules over this data and generate alerts. https://prometheus.io/docs/introduction/overview/#architecture
Pushgateway
The Prometheus Pushgateway exists to allow ephemeral and batch jobs to expose their metrics to Prometheus. Since these kinds of jobs may not exist long enough to be scraped, they can instead push their metrics to a Pushgateway. The Pushgateway then exposes these metrics to Prometheus. The Pushgateway is installed per cluster.
In order to expose metrics of Che agents and running applications, the application need to send HTTP POST/PUT with the metric object to the Pushgateway URL. https://github.com/prometheus/pushgateway
Application Health Checking
Application health checking is required to detect non-functioning agents from application perspective although Pod and Node are considered healthy e.g. deadlock.
External Application Health Check & Recovery
K8S address this problem by supporting user implemented application health checks that are performed by the Kubelet to ensure that the application is operating correctly.
K8S application health checks types:
HTTP Health checks – calling a web hook. Considering http status between 200 and 399 as success, failure otherwise.
Container Exec – execute a command inside the container. Exit with status 0 considered as success otherwise failure.
TCP Socket – open a socket to the container. If connection is established it is considered healthy otherwise failure.
Kubelet can react to two kinds of probes:
LivenessProbe - if Kubelet discovers a failure the container is restarted.
ReadinessProbe – If Kubelet discovers a failure the Pod IP is removed from the services for a period.
The container health checks are configured in the livenessProbe/readinessProbe section of the container config.
This can be used as an external Health check for critical services.
That way a system outside of the application itself is responsible for monitoring the application and taking action to fix it.
While Kubelet use the healthcheck response for a restart action or removing it’s IP it does not give a monitoring tool for different container Health Checks.
The option to do agent monitoring health check using request originating from outside of the Pod is not scalable and can create network loading therefore it should be originated within the Pod.
Each agent should provide health check command for monitoring. To perform the health check there should be a dedicated agent (Health check manager agent) that triggers the health check commands every interval.
Each agent need to register to the health check agent manager and configure it’s health check policy.
The agent manager can expose the results by one of the following:
Expose it to cAdvisor end point (still in alpha. see below).
Send Prometheus metrics to the Pushgateway Pod.
Send dedicated logs that will be monitored – recommended.
cAdvisor solution - Since K8S 1.2 a new feature (still in Alpha) allows cAdvisor to collect custom metrics from applications running in containers, if these metrics are exposed in the Prometheus format natively. https://github.com/google/cadvisor/blob/master/docs/application_metrics.md
Exposing to cAdvisor is not recommend as it is still in alpha and will add additional dependencies with other components.
Sending Prometheus metrics is less recommended as it creates additional complexity by having the Pushgateway component.
Using the logging [See #10290] for application monitoring is preferred to be more homogenous as it is using the existent logging system and can be correlated to additional information supplied by it. In this case the PushGateway is not required.
Health check agent manger
The health check agent manager can be implemented as
Independent agent within the container.
Healthcheck instruction within the Docker.
Docker provides Healthcheck instruction that checks the container health by running a command inside the container every time interval.
The Proposed solution for monitoring application health check should be used also to a single centric component (e.g. WS Master) for homogenous solution.
Implementation recommendation
System Monitor of K8S Cluster and Node based on Prometheus system.
Application Monitor of WS agents within the container should follow
Sending metrics
Sending the metrics by adding logs to the WS agent with specific tag that will
indicate that this log is used for monitoring.
Custom environment params
Added to the records of Che agents for customized purposes, e.g. user’s tenant (customer) id.
Internal health check
Provide health check command by each agent for monitoring.
In addition each agent should register to the health check agent manager with health check
configuration policy.
Health Check agent manager
Agent within the Pod that can be implemented as either Independent agent
or Healthcheck instruction within the Docker (should be further investigated).
External health check
Provide critical health check command by relevant agents to be used by Kubelet livenessProbe to
restart the Pod. In addition, the agent should add health check configuration policy to the
livenessProbe part in the Pod configuration file.
FWIW, keeping at least one form of the metrics available as a http-pollable prometheus-exporter url would be pretty future-proof, even if the cAdvisor machinery were to go away.
Summary
We propose a system monitoring mechanism that for Cluster and Pod level does not require changes to existing Che code. However, for application monitoring of Che agents it requires some changes:
purposes, e.g. user’s tenant (customer) id.
Description
Che epics [Complementary]:
Tracing - #10298, #10288
Logging - #10290
Background
Monitoring Che Workspace(aka WS) agents is required for anticipate problems and discover bottleneck in production environment.
K8S monitor can be categorized as follow
Cluster metrics (System Monitor):
Pods Metrics (System Monitor):
Application metrics (Application Monitor):
https://logz.io/blog/kubernetes-monitoring
Prometheus solution
There are many possible combinations of node and cluster-level agents that could comprise a monitoring pipeline. The most popular in K8S is Prometheus which is part of the CNCF.
It collects metrics from configured targets at given intervals, evaluates rule expressions, displays the results, and can trigger alerts if some condition is observed to be true.
Prometheus comes with its own dashboard which is available for running ad-hoc queries or quick debugging, but for best experience it is recommended to be integrated with visualization backends such as Grafana.
https://www.weave.works/technologies/monitoring-kubernetes-with-prometheus
Prometheus Architecture
Prometheus has a cluster level agent and a node level agent (node exporter).
The Node exporter is installed as a DaemonSet that gather machine-level metrics in addition to the metrics exposed by the cAdvisor for each container.
The Prometheus server is installed per cluster. It scrapes and stores time series data from instrumented jobs either directly or via an intermediary push gateway for short-lived jobs. It stores all scraped samples locally, runs rules over this data and generate alerts.
https://prometheus.io/docs/introduction/overview/#architecture
Pushgateway
The Prometheus Pushgateway exists to allow ephemeral and batch jobs to expose their metrics to Prometheus. Since these kinds of jobs may not exist long enough to be scraped, they can instead push their metrics to a Pushgateway. The Pushgateway then exposes these metrics to Prometheus. The Pushgateway is installed per cluster.
In order to expose metrics of Che agents and running applications, the application need to send HTTP POST/PUT with the metric object to the Pushgateway URL.
https://github.com/prometheus/pushgateway
Application Health Checking
Application health checking is required to detect non-functioning agents from application perspective although Pod and Node are considered healthy e.g. deadlock.
External Application Health Check & Recovery
K8S address this problem by supporting user implemented application health checks that are performed by the Kubelet to ensure that the application is operating correctly.
K8S application health checks types:
Kubelet can react to two kinds of probes:
The container health checks are configured in the livenessProbe/readinessProbe section of the container config.
This can be used as an external Health check for critical services.
That way a system outside of the application itself is responsible for monitoring the application and taking action to fix it.
https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#container-probes
https://kubernetes.io/docs/tutorials/k8s201/#application-health-checking
https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/
Application Health Check Monitoring
While Kubelet use the healthcheck response for a restart action or removing it’s IP it does not give a monitoring tool for different container Health Checks.
The option to do agent monitoring health check using request originating from outside of the Pod is not scalable and can create network loading therefore it should be originated within the Pod.
Each agent should provide health check command for monitoring. To perform the health check there should be a dedicated agent (Health check manager agent) that triggers the health check commands every interval.
Each agent need to register to the health check agent manager and configure it’s health check policy.
The agent manager can expose the results by one of the following:
cAdvisor solution - Since K8S 1.2 a new feature (still in Alpha) allows cAdvisor to collect custom metrics from applications running in containers, if these metrics are exposed in the Prometheus format natively.
https://github.com/google/cadvisor/blob/master/docs/application_metrics.md
Exposing to cAdvisor is not recommend as it is still in alpha and will add additional dependencies with other components.
Sending Prometheus metrics is less recommended as it creates additional complexity by having the Pushgateway component.
Using the logging [See #10290] for application monitoring is preferred to be more homogenous as it is using the existent logging system and can be correlated to additional information supplied by it. In this case the PushGateway is not required.
Health check agent manger
The health check agent manager can be implemented as
Docker provides Healthcheck instruction that checks the container health by running a command inside the container every time interval.
The Proposed solution for monitoring application health check should be used also to a single centric component (e.g. WS Master) for homogenous solution.
Implementation recommendation
Sending the metrics by adding logs to the WS agent with specific tag that will
indicate that this log is used for monitoring.
Added to the records of Che agents for customized purposes, e.g. user’s tenant (customer) id.
Provide health check command by each agent for monitoring.
In addition each agent should register to the health check agent manager with health check
configuration policy.
Agent within the Pod that can be implemented as either Independent agent
or Healthcheck instruction within the Docker (should be further investigated).
Provide critical health check command by relevant agents to be used by Kubelet livenessProbe to
restart the Pod. In addition, the agent should add health check configuration policy to the
livenessProbe part in the Pod configuration file.
Implementation
The text was updated successfully, but these errors were encountered: