From e28bb29b03bcf33ef77cca269097baad26cedd21 Mon Sep 17 00:00:00 2001 From: KAAAsS Date: Thu, 9 May 2024 17:17:00 +0800 Subject: [PATCH 1/2] rfc: Export Metrics related to StressChaos Experiments Signed-off-by: KAAAsS --- ...4-05-09-stresschaos-experiments-metrics.md | 179 ++++++++++++++++++ 1 file changed, 179 insertions(+) create mode 100644 text/2024-05-09-stresschaos-experiments-metrics.md diff --git a/text/2024-05-09-stresschaos-experiments-metrics.md b/text/2024-05-09-stresschaos-experiments-metrics.md new file mode 100644 index 0000000..c0710e5 --- /dev/null +++ b/text/2024-05-09-stresschaos-experiments-metrics.md @@ -0,0 +1,179 @@ +# Export Metrics related to StressChaos Experiments + +## Summary + +Chaos Daemon exports the statistical metrics of the container, Chaos Controller Manager exports the relation metrics of StressChaos experiment and selected container. Helm Charts include Prometheus rules that join these metrics together to export experiment metrics of StressChaos. + +## Motivation + +Currently, there are deficiencies in the observability of StressChaos experiments. Users need to deploy observation tools themselves to observe pods and containers affected by StressChaos. Although there is already a Grafana plugin that adds the start and end times of StressChaos experiments to Grafana through Annotation, it is still difficult for users to determine which Pods and Containers will be affected by the experiment. These issues make it difficult for users to observe the experimental effects of StressChaos, and also make it hard to evaluate the steady-state. + +This RFC aims to solve this problem by actively exporting StressChaos related metrics from Chaos Mesh. After implementing this RFC, users will be able to directly observe the effects of StressChaos in Prometheus through metrics. + +## Detailed design + +### Overview + +The purpose of this feature is to export metrics related to StressChaos. To achieve this, we need to modify Chaos Daemon and Chaos Controller Manager to export some intermediate metrics, and add prometheus rules to Helm Charts to export the final metrics. + +For end users, the basic outcome features are: + +- Users can observe the effects of StressChaos experiments through Prometheus. +- Users can filter the metrics of specific experiments through Prometheus. + +### Metrics + +Overall, this design exports 3 types of metrics. + +#### Statistical metrics + +Statistical metrics are the metrics that describe the statistical information of the container. These metrics are exported by Chaos Daemon. We plan to export the following metrics: + +| Metric Name | Description | +| ------------------------------------------------------- | ----------------------------------------------------------- | +| `chaos_daemon_container_cpu_usage_seconds_total` | Total CPU usage in seconds of the container. | +| `chaos_daemon_container_memory_working_set_bytes` | The amount of working set memory in bytes of the container. | +| `chaos_daemon_container_memory_available_bytes` | The available memory in bytes of the container. | +| `chaos_daemon_container_memory_usage_bytes` | The memory usage in bytes of the container. | +| `chaos_daemon_container_memory_rss_bytes` | The amount of RSS memory in bytes of the container. | +| `chaos_daemon_container_memory_page_faults_total` | Total number of page faults of the container. | +| `chaos_daemon_container_memory_major_page_faults_total` | Total number of major page faults of the container. | +| `chaos_daemon_container_memory_swap_available_bytes` | The available swap in bytes of the container. | +| `chaos_daemon_container_memory_swap_usage_bytes` | The swap usage in bytes of the container. | + +Statistical metrics are exported with the following labels: + +| Label Name | Description | +| ----------- | ------------------------------- | +| `namespace` | The namespace of the container. | +| `pod` | The pod name of the container. | +| `container` | The container name. | + +The export of statistical metrics is not directly related to the StressChaos experiment. Instead, Chaos Daemon exports these metrics for all Kubernetes managed containers on the node. + +#### Relation metrics + +Relation metrics are the metrics that describe the relation between the Chaos experiment and the container. These metrics are exported by Chaos Controller Manager. The proposed metric name is: `chaos_controller_manager_chaos_experiments_container_relation`. It is exported with the following labels: + +| Label Name | Description | +| ----------- | --------------------------------------------- | +| `namespace` | The namespace of the container. | +| `kind` | The kind of the experiment. | +| `phase` | The phase of the experiment. | +| `name` | The name of the experiment. | +| `uid` | The UID of the experiment. | +| `pod` | The pod name of the selected container. | +| `container` | The container name of the selected container. | + +The relation metrics are exported for all Chaos experiments managed by Chaos Controller Manager if the phase of the experiment is not `Finished` or `Deleting`. This prevents exporting too many metrics by ignoring inactive experiments. + +For each selected container in the experiment, Chaos Controller Manager exports a relation metric. The value of the metric is always fixed to `1` for the convenience of joining metrics. + +#### Experiment metrics + +Experiment metrics are the metrics that describe the effects of the StressChaos experiment. This is the final metrics that end users can observe. The proposed metric name is: `chaos_mesh:stress_chaos:`. For example: + +- Statistical metrics: `chaos_daemon_container_cpu_usage_seconds_total` +- Experiment metrics: `chaos_mesh:stress_chaos:container_cpu_usage_seconds_total` + +It is exported with the following labels: + +| Label Name | Description | +| ----------- | --------------------------------------------- | +| `namespace` | The namespace of the container. | +| `kind` | The kind of the experiment. | +| `phase` | The phase of the experiment. | +| `name` | The name of the experiment. | +| `uid` | The UID of the experiment. | +| `pod` | The pod name of the selected container. | +| `container` | The container name of the selected container. | + +The experiment metrics are exported by joining the statistical metrics and relation metrics. The join is done by Prometheus rules in Helm Charts. Thus, the value of the experiment metrics is the same as the statistical metrics, but with additional labels of the experiment. + +### Joining Metrics with PromQL + +The purpose of joining metrics is to add the experiment information labels to the statistical metrics. During the join, we first match the statistical metrics and relation metrics by the `namespace`, `pod`, and `container` labels. Then, we add the experiment information labels to the statistical metrics. + +The join is done by PromQL specified in the PrometheusRule. The principle is PromQL’s `group_right` operator and vector matching. This operator does a one-to-many join between two metrics. For example, for statistical metrics `chaos_daemon_container_cpu_usage_seconds_total`, we use the following PromQL to join the relation metrics: + +```promql +chaos_daemon_container_cpu_usage_seconds_total + # Doing multiplication, which has no effect because the value is always 1.0, but is used to trigger the joining query + * + # Match based on the namespace, pod and container labels + on(namespace, pod, container) + # Join the relation metrics + group_right +chaos_controller_manager_chaos_experiments_container_relation +``` + +### About Implementation + +#### Chaos Daemon + +We use CRI (Container Runtime Interface) to obtain the statistical metrics of the container. We will connect to the CRI socket to obtain the metrics. A new cli parameter `--cri-socket-path` will be added to specify the CRI socket. The default value is picked with the respect of `--runtime` and `--runtime-socket-path`. We will export the metrics to the current probe endpoint. + +#### Chaos Controller Manager + +We will export the metrics to the current probe endpoint. + +#### Helm Charts + +We will modify the current Prometheus ConfigMap to add the Prometheus rules. + +We will add new values to control the proposed feature: + +| Value Name | Description | Default Value | +| -------------------------------------- | --------------------------------------------------- | ------------- | +| `prometheus.experimentMetrics.enabled` | Enable the feature of exporting experiment metrics. | `false` | +| `chaosDaemon.criSocketPath` | The path of the CRI socket. | - | + +## Drawbacks + +We need to associate the statistical metrics of the container with the experimental information of StressChaos through prometheus rules. This makes part of this design dependent on Prometheus. For other metrics processing tools, users need to implement this association themselves. + +Due to the limitations of PromQL, this design can only export one label combination for a container metric. For example, this design will not be able to export metrics like this: + +``` +# For this metrics, given a container, there's two label combinations: +# 1. type="working_set" +# 2. type="rss" +# Since PromQL cannot do many-to-many label joining, these metrics cannot be exported. +chaos_mesh:stress_chaos:memory_usage_bytes{container="test", type="working_set", ...} 1024 +chaos_mesh:stress_chaos:memory_usage_bytes{container="test", type="rss", ...} 1024 +``` + +Instead, we need to export two seperate metrics like this: + +``` +chaos_mesh:stress_chaos:memory_working_set_usage_bytes{container="test", ...} 1024 +chaos_mesh:stress_chaos:memory_rss_usage_bytes{container="test", ...} 1024 +``` + +## Alternatives + +### Alternative #1: Export the experiment metrics by Chaos Daemon + +We can leave the Chaos Controller Manager completely unmodified. Instead, we allow the Chaos Daemon to obtain the experiment records of StressChaos and let the Chaos Daemon directly add experiments information labels into metrics. + +Pros: + +- All problem metioned in the drawbacks section can be solved. + +Cons: + +- Chaos Daemon is not aware of the Experiment details at present. Therefore, Chaos Daemon needs to request the Kubernetes API Server to obtain the details of the experiment, which introduces additional complexity. +- Any metrics probing on any worker nodes' Chaos Daemon may result in several requests to the Kubernetes API Server, though some requests can be cached, it still may cause a performance issue. + +### Alternative #2: Export the experiment metrics by Chaos Controller Manager + +We can export the final metrics in Chaos Controller Manager. Specifically, when obtaining metrics from Chaos Controller Manager, it retrieves statistical metrics from all Chaos Daemons. Then, it adds the experiment information labels to the metrics and exports them as the final experiment metrics. + +Pros: + +- All problem metioned in the drawbacks section can be solved. + +Cons: + +- Each probe of the metrics will result in several requests to all Chaos Daemons, which may cause a performance issue. +- Chaos Controller Manager need to implement the logic of parsing, filtering, modifying, and exporting metrics, which may introduce additional complexity. From cc908ba74ec32574c683f9de446e66ecc9185f8f Mon Sep 17 00:00:00 2001 From: KAAAsS Date: Wed, 15 May 2024 00:04:22 +0800 Subject: [PATCH 2/2] style: fix lint Signed-off-by: KAAAsS --- ...4-05-09-stresschaos-experiments-metrics.md | 154 +++++++++++++----- 1 file changed, 109 insertions(+), 45 deletions(-) diff --git a/text/2024-05-09-stresschaos-experiments-metrics.md b/text/2024-05-09-stresschaos-experiments-metrics.md index c0710e5..e0ae715 100644 --- a/text/2024-05-09-stresschaos-experiments-metrics.md +++ b/text/2024-05-09-stresschaos-experiments-metrics.md @@ -2,19 +2,33 @@ ## Summary -Chaos Daemon exports the statistical metrics of the container, Chaos Controller Manager exports the relation metrics of StressChaos experiment and selected container. Helm Charts include Prometheus rules that join these metrics together to export experiment metrics of StressChaos. +Chaos Daemon exports the statistical metrics of the container, Chaos Controller +Manager exports the relation metrics of StressChaos experiment and selected +container. Helm Charts include Prometheus rules that join these metrics together +to export experiment metrics of StressChaos. ## Motivation -Currently, there are deficiencies in the observability of StressChaos experiments. Users need to deploy observation tools themselves to observe pods and containers affected by StressChaos. Although there is already a Grafana plugin that adds the start and end times of StressChaos experiments to Grafana through Annotation, it is still difficult for users to determine which Pods and Containers will be affected by the experiment. These issues make it difficult for users to observe the experimental effects of StressChaos, and also make it hard to evaluate the steady-state. +Currently, there are deficiencies in the observability of StressChaos experiments. +Users need to deploy observation tools themselves to observe pods and containers +affected by StressChaos. Although there is already a Grafana plugin that adds the +start and end times of StressChaos experiments to Grafana through Annotation, it +is still difficult for users to determine which Pods and Containers will be affected +by the experiment. These issues make it difficult for users to observe the experimental +effects of StressChaos, and also make it hard to evaluate the steady-state. -This RFC aims to solve this problem by actively exporting StressChaos related metrics from Chaos Mesh. After implementing this RFC, users will be able to directly observe the effects of StressChaos in Prometheus through metrics. +This RFC aims to solve this problem by actively exporting StressChaos related +metrics from Chaos Mesh. After implementing this RFC, users will be able to directly +observe the effects of StressChaos in Prometheus through metrics. ## Detailed design ### Overview -The purpose of this feature is to export metrics related to StressChaos. To achieve this, we need to modify Chaos Daemon and Chaos Controller Manager to export some intermediate metrics, and add prometheus rules to Helm Charts to export the final metrics. +The purpose of this feature is to export metrics related to StressChaos. To achieve +this, we need to modify Chaos Daemon and Chaos Controller Manager to export some +intermediate metrics, and add prometheus rules to Helm Charts to export the final +metrics. For end users, the basic outcome features are: @@ -27,19 +41,28 @@ Overall, this design exports 3 types of metrics. #### Statistical metrics -Statistical metrics are the metrics that describe the statistical information of the container. These metrics are exported by Chaos Daemon. We plan to export the following metrics: - -| Metric Name | Description | -| ------------------------------------------------------- | ----------------------------------------------------------- | -| `chaos_daemon_container_cpu_usage_seconds_total` | Total CPU usage in seconds of the container. | -| `chaos_daemon_container_memory_working_set_bytes` | The amount of working set memory in bytes of the container. | -| `chaos_daemon_container_memory_available_bytes` | The available memory in bytes of the container. | -| `chaos_daemon_container_memory_usage_bytes` | The memory usage in bytes of the container. | -| `chaos_daemon_container_memory_rss_bytes` | The amount of RSS memory in bytes of the container. | -| `chaos_daemon_container_memory_page_faults_total` | Total number of page faults of the container. | -| `chaos_daemon_container_memory_major_page_faults_total` | Total number of major page faults of the container. | -| `chaos_daemon_container_memory_swap_available_bytes` | The available swap in bytes of the container. | -| `chaos_daemon_container_memory_swap_usage_bytes` | The swap usage in bytes of the container. | +Statistical metrics are the metrics that describe the statistical information of +the container. These metrics are exported by Chaos Daemon. We plan to export the +following metrics: + +- `chaos_daemon_container_cpu_usage_seconds_total`: Total CPU usage in seconds of + the container. +- `chaos_daemon_container_memory_working_set_bytes`: The amount of working set + memory in bytes of the container. +- `chaos_daemon_container_memory_available_bytes`: The available memory in bytes + of the container. +- `chaos_daemon_container_memory_usage_bytes`: The memory usage in bytes of the + container. +- `chaos_daemon_container_memory_rss_bytes`: The amount of RSS memory in bytes of + the container. +- `chaos_daemon_container_memory_page_faults_total`: Total number of page faults + of the container. +- `chaos_daemon_container_memory_major_page_faults_total`: Total number of major + page faults of the container. +- `chaos_daemon_container_memory_swap_available_bytes`: The available swap in bytes + of the container. +- `chaos_daemon_container_memory_swap_usage_bytes`: The swap usage in bytes of the + container. Statistical metrics are exported with the following labels: @@ -49,11 +72,17 @@ Statistical metrics are exported with the following labels: | `pod` | The pod name of the container. | | `container` | The container name. | -The export of statistical metrics is not directly related to the StressChaos experiment. Instead, Chaos Daemon exports these metrics for all Kubernetes managed containers on the node. +The export of statistical metrics is not directly related to the StressChaos +experiment. Instead, Chaos Daemon exports these metrics for all Kubernetes +managed containers on the node. #### Relation metrics -Relation metrics are the metrics that describe the relation between the Chaos experiment and the container. These metrics are exported by Chaos Controller Manager. The proposed metric name is: `chaos_controller_manager_chaos_experiments_container_relation`. It is exported with the following labels: +Relation metrics are the metrics that describe the relation between the Chaos +experiment and the container. These metrics are exported by Chaos Controller +Manager. The proposed metric name is: +`chaos_controller_manager_chaos_experiments_container_relation`. It is exported +with the following labels: | Label Name | Description | | ----------- | --------------------------------------------- | @@ -65,13 +94,19 @@ Relation metrics are the metrics that describe the relation between the Chaos ex | `pod` | The pod name of the selected container. | | `container` | The container name of the selected container. | -The relation metrics are exported for all Chaos experiments managed by Chaos Controller Manager if the phase of the experiment is not `Finished` or `Deleting`. This prevents exporting too many metrics by ignoring inactive experiments. +The relation metrics are exported for all Chaos experiments managed by Chaos +Controller Manager if the phase of the experiment is not `Finished` or`Deleting`. +This prevents exporting too many metrics by ignoring inactive experiments. -For each selected container in the experiment, Chaos Controller Manager exports a relation metric. The value of the metric is always fixed to `1` for the convenience of joining metrics. +For each selected container in the experiment, Chaos Controller Manager exports +a relation metric. The value of the metric is always fixed to `1` for the +convenience of joining metrics. #### Experiment metrics -Experiment metrics are the metrics that describe the effects of the StressChaos experiment. This is the final metrics that end users can observe. The proposed metric name is: `chaos_mesh:stress_chaos:`. For example: +Experiment metrics are the metrics that describe the effects of the StressChaos +experiment. This is the final metrics that end users can observe. The proposed +metric name is: `chaos_mesh:stress_chaos:`. For example: - Statistical metrics: `chaos_daemon_container_cpu_usage_seconds_total` - Experiment metrics: `chaos_mesh:stress_chaos:container_cpu_usage_seconds_total` @@ -88,22 +123,32 @@ It is exported with the following labels: | `pod` | The pod name of the selected container. | | `container` | The container name of the selected container. | -The experiment metrics are exported by joining the statistical metrics and relation metrics. The join is done by Prometheus rules in Helm Charts. Thus, the value of the experiment metrics is the same as the statistical metrics, but with additional labels of the experiment. +The experiment metrics are exported by joining the statistical metrics and +relation metrics. The join is done by Prometheus rules in Helm Charts. Thus, +the value of the experiment metrics is the same as the statistical metrics, +but with additional labels of the experiment. ### Joining Metrics with PromQL -The purpose of joining metrics is to add the experiment information labels to the statistical metrics. During the join, we first match the statistical metrics and relation metrics by the `namespace`, `pod`, and `container` labels. Then, we add the experiment information labels to the statistical metrics. +The purpose of joining metrics is to add the experiment information labels to +the statistical metrics. During the join, we first match the statistical +metrics and relation metrics by the `namespace`, `pod`, and `container` labels. +Then, we add the experiment information labels to the statistical metrics. -The join is done by PromQL specified in the PrometheusRule. The principle is PromQL’s `group_right` operator and vector matching. This operator does a one-to-many join between two metrics. For example, for statistical metrics `chaos_daemon_container_cpu_usage_seconds_total`, we use the following PromQL to join the relation metrics: +The join is done by PromQL specified in the PrometheusRule. The principle is +PromQL’s `group_right` operator and vector matching. This operator does a +one-to-many join between two metrics. For example, for statistical metrics +`chaos_daemon_container_cpu_usage_seconds_total`, we use the following PromQL +to join the relation metrics: ```promql chaos_daemon_container_cpu_usage_seconds_total - # Doing multiplication, which has no effect because the value is always 1.0, but is used to trigger the joining query - * - # Match based on the namespace, pod and container labels - on(namespace, pod, container) + # Doing multiplication, which has no effect because the value is always 1.0, but is used to trigger the joining query + * + # Match based on the namespace, pod and container labels + on(namespace, pod, container) # Join the relation metrics - group_right + group_right chaos_controller_manager_chaos_experiments_container_relation ``` @@ -111,7 +156,11 @@ chaos_controller_manager_chaos_experiments_container_relation #### Chaos Daemon -We use CRI (Container Runtime Interface) to obtain the statistical metrics of the container. We will connect to the CRI socket to obtain the metrics. A new cli parameter `--cri-socket-path` will be added to specify the CRI socket. The default value is picked with the respect of `--runtime` and `--runtime-socket-path`. We will export the metrics to the current probe endpoint. +We use CRI (Container Runtime Interface) to obtain the statistical metrics of +the container. We will connect to the CRI socket to obtain the metrics. A new +cli parameter `--cri-socket-path` will be added to specify the CRI socket. The +default value is picked with the respect of `--runtime` and +`--runtime-socket-path`. We will export the metrics to the current probe endpoint. #### Chaos Controller Manager @@ -123,18 +172,22 @@ We will modify the current Prometheus ConfigMap to add the Prometheus rules. We will add new values to control the proposed feature: -| Value Name | Description | Default Value | -| -------------------------------------- | --------------------------------------------------- | ------------- | -| `prometheus.experimentMetrics.enabled` | Enable the feature of exporting experiment metrics. | `false` | -| `chaosDaemon.criSocketPath` | The path of the CRI socket. | - | +- `prometheus.experimentMetrics.enabled`: Enable the feature of exporting + experiment metrics, default to `false`. +- `chaosDaemon.criSocketPath`: The path of the CRI socket. Not set by default. ## Drawbacks -We need to associate the statistical metrics of the container with the experimental information of StressChaos through prometheus rules. This makes part of this design dependent on Prometheus. For other metrics processing tools, users need to implement this association themselves. +We need to associate the statistical metrics of the container with the experimental +information of StressChaos through prometheus rules. This makes part of this +design dependent on Prometheus. For other metrics processing tools, users need +to implement this association themselves. -Due to the limitations of PromQL, this design can only export one label combination for a container metric. For example, this design will not be able to export metrics like this: +Due to the limitations of PromQL, this design can only export one label combination +for a container metric. For example, this design will not be able to export metrics +like this: -``` +```openmetrics # For this metrics, given a container, there's two label combinations: # 1. type="working_set" # 2. type="rss" @@ -145,7 +198,7 @@ chaos_mesh:stress_chaos:memory_usage_bytes{container="test", type="rss", ...} Instead, we need to export two seperate metrics like this: -``` +```openmetrics chaos_mesh:stress_chaos:memory_working_set_usage_bytes{container="test", ...} 1024 chaos_mesh:stress_chaos:memory_rss_usage_bytes{container="test", ...} 1024 ``` @@ -154,7 +207,9 @@ chaos_mesh:stress_chaos:memory_rss_usage_bytes{container="test", ...} 1 ### Alternative #1: Export the experiment metrics by Chaos Daemon -We can leave the Chaos Controller Manager completely unmodified. Instead, we allow the Chaos Daemon to obtain the experiment records of StressChaos and let the Chaos Daemon directly add experiments information labels into metrics. +We can leave the Chaos Controller Manager completely unmodified. Instead, we allow +the Chaos Daemon to obtain the experiment records of StressChaos and let the Chaos +Daemon directly add experiments information labels into metrics. Pros: @@ -162,12 +217,19 @@ Pros: Cons: -- Chaos Daemon is not aware of the Experiment details at present. Therefore, Chaos Daemon needs to request the Kubernetes API Server to obtain the details of the experiment, which introduces additional complexity. -- Any metrics probing on any worker nodes' Chaos Daemon may result in several requests to the Kubernetes API Server, though some requests can be cached, it still may cause a performance issue. +- Chaos Daemon is not aware of the Experiment details at present. Therefore, Chaos + Daemon needs to request the Kubernetes API Server to obtain the details of the + experiment, which introduces additional complexity. +- Any metrics probing on any worker nodes' Chaos Daemon may result in several requests + to the Kubernetes API Server, though some requests can be cached, it still may + cause a performance issue. ### Alternative #2: Export the experiment metrics by Chaos Controller Manager -We can export the final metrics in Chaos Controller Manager. Specifically, when obtaining metrics from Chaos Controller Manager, it retrieves statistical metrics from all Chaos Daemons. Then, it adds the experiment information labels to the metrics and exports them as the final experiment metrics. +We can export the final metrics in Chaos Controller Manager. Specifically, when +obtaining metrics from Chaos Controller Manager, it retrieves statistical metrics +from all Chaos Daemons. Then, it adds the experiment information labels to the +metrics and exports them as the final experiment metrics. Pros: @@ -175,5 +237,7 @@ Pros: Cons: -- Each probe of the metrics will result in several requests to all Chaos Daemons, which may cause a performance issue. -- Chaos Controller Manager need to implement the logic of parsing, filtering, modifying, and exporting metrics, which may introduce additional complexity. +- Each probe of the metrics will result in several requests to all Chaos Daemons, + which may cause a performance issue. +- Chaos Controller Manager need to implement the logic of parsing, filtering, + modifying, and exporting metrics, which may introduce additional complexity.