From 82b8cda3adbb4fcd221c65e516c43924e5cb8bd4 Mon Sep 17 00:00:00 2001 From: Abhinav Khanna <45081182+abhi-sumo@users.noreply.github.com> Date: Tue, 15 Oct 2019 21:50:59 -0700 Subject: [PATCH 1/3] Scaling Fluentd --- deploy/docs/monitoring-lag.md | 11 +++++++++++ 1 file changed, 11 insertions(+) create mode 100644 deploy/docs/monitoring-lag.md diff --git a/deploy/docs/monitoring-lag.md b/deploy/docs/monitoring-lag.md new file mode 100644 index 0000000000..2980d6c121 --- /dev/null +++ b/deploy/docs/monitoring-lag.md @@ -0,0 +1,11 @@ +# Monitoring the Monitoring + +Once you have Sumo Logic's collection setup installed, you should be primed to have metrics, logs, and events flowing into Sumo Logic. However, as your cluster scales up and down, you might find the need to rescale your fluentd deployment replica count. Here are some tips on how to judge if you're seeing lag in your Sumo Logic collection pipeline. + +1. Kubernetes Health Check Dashboard + +This dashboard can be found from the `Cluster` level in `Explore`, and is a great way of holistically judging if your collection process is working as expected. + +2. Fluentd Queue Length + +On the health check dashboard you'll see a panel for fluentd queue length. If you see this length going up over time, chances are that you either have backpressure or you are overwhelming the fluentd pods with requests. If you see any `429` status codes in the fluentd logs, that means you are likely getting throttled and need to contact Sumo Logic to increase your base plan or increase the throttling limit. If you aren't seeing `429` then you likely are in a situation where the incoming traffic into Fluentd is higher than the current replica count can handle. This is a good indication that you should scale up. From dd79d075bdd8968be2957741e212e6b65e5cf5a3 Mon Sep 17 00:00:00 2001 From: Abhinav Khanna <45081182+abhi-sumo@users.noreply.github.com> Date: Tue, 15 Oct 2019 21:54:12 -0700 Subject: [PATCH 2/3] Update monitoring-lag.md --- deploy/docs/monitoring-lag.md | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/deploy/docs/monitoring-lag.md b/deploy/docs/monitoring-lag.md index 2980d6c121..f47e49d113 100644 --- a/deploy/docs/monitoring-lag.md +++ b/deploy/docs/monitoring-lag.md @@ -9,3 +9,11 @@ This dashboard can be found from the `Cluster` level in `Explore`, and is a grea 2. Fluentd Queue Length On the health check dashboard you'll see a panel for fluentd queue length. If you see this length going up over time, chances are that you either have backpressure or you are overwhelming the fluentd pods with requests. If you see any `429` status codes in the fluentd logs, that means you are likely getting throttled and need to contact Sumo Logic to increase your base plan or increase the throttling limit. If you aren't seeing `429` then you likely are in a situation where the incoming traffic into Fluentd is higher than the current replica count can handle. This is a good indication that you should scale up. + +3. Check Prometheus Remote Write Metrics + +Prometheus has a few metrics to monitor its remote write performance. You should check that the succeeded count is strictly non-zero and if looking at a cumulative metric, it is going up and to the right. Two other great metrics to check are `remote_storage_pending_samples` and `remote_storage_failed_samples`. Higher failure counts and pending counts are good indicators of queue buildup. + +4. Check Prometheus Logs + +If all else fails, check the prometheus logs. If there is anything suspicious happening with regards to the fluentd connection you'll see it in the Prometheus logs. Any logs that have `connection reset` or `context cancelled` in them are indicative of requests that were terminated or dropped. Too many of those and your data will start to lag. From 210e70e8189008810e671016a691a66e6124fece Mon Sep 17 00:00:00 2001 From: Sam Song Date: Thu, 17 Oct 2019 14:05:37 -0700 Subject: [PATCH 3/3] Update monitoring-lag.md --- deploy/docs/monitoring-lag.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/deploy/docs/monitoring-lag.md b/deploy/docs/monitoring-lag.md index f47e49d113..f01ea680e5 100644 --- a/deploy/docs/monitoring-lag.md +++ b/deploy/docs/monitoring-lag.md @@ -1,6 +1,6 @@ # Monitoring the Monitoring -Once you have Sumo Logic's collection setup installed, you should be primed to have metrics, logs, and events flowing into Sumo Logic. However, as your cluster scales up and down, you might find the need to rescale your fluentd deployment replica count. Here are some tips on how to judge if you're seeing lag in your Sumo Logic collection pipeline. +Once you have Sumo Logic's collection setup installed, you should be primed to have metrics, logs, and events flowing into Sumo Logic. However, as your cluster scales up and down, you might find the need to rescale your Fluentd deployment replica count. Here are some tips on how to judge if you're seeing lag in your Sumo Logic collection pipeline. 1. Kubernetes Health Check Dashboard @@ -8,7 +8,7 @@ This dashboard can be found from the `Cluster` level in `Explore`, and is a grea 2. Fluentd Queue Length -On the health check dashboard you'll see a panel for fluentd queue length. If you see this length going up over time, chances are that you either have backpressure or you are overwhelming the fluentd pods with requests. If you see any `429` status codes in the fluentd logs, that means you are likely getting throttled and need to contact Sumo Logic to increase your base plan or increase the throttling limit. If you aren't seeing `429` then you likely are in a situation where the incoming traffic into Fluentd is higher than the current replica count can handle. This is a good indication that you should scale up. +On the health check dashboard you'll see a panel for Fluentd queue length. If you see this length going up over time, chances are that you either have backpressure or you are overwhelming the Fluentd pods with requests. If you see any `429` status codes in the Fluentd logs, that means you are likely getting throttled and need to contact Sumo Logic to increase your base plan or increase the throttling limit. If you aren't seeing `429` then you likely are in a situation where the incoming traffic into Fluentd is higher than the current replica count can handle. This is a good indication that you should scale up. 3. Check Prometheus Remote Write Metrics @@ -16,4 +16,4 @@ Prometheus has a few metrics to monitor its remote write performance. You should 4. Check Prometheus Logs -If all else fails, check the prometheus logs. If there is anything suspicious happening with regards to the fluentd connection you'll see it in the Prometheus logs. Any logs that have `connection reset` or `context cancelled` in them are indicative of requests that were terminated or dropped. Too many of those and your data will start to lag. +If all else fails, check the Prometheus logs. If there is anything suspicious happening with regards to the Fluentd connection you'll see it in the Prometheus logs. Any logs that have `connection reset` or `context cancelled` in them are indicative of requests that were terminated or dropped. Too many of those and your data will start to lag.