Skip to content

Commit

Permalink
Merge pull request #235 from abhi-sumo/patch-9
Browse files Browse the repository at this point in the history
Scaling Fluentd
  • Loading branch information
samjsong authored Oct 17, 2019
2 parents a22ceb4 + 210e70e commit 85672f4
Showing 1 changed file with 19 additions and 0 deletions.
19 changes: 19 additions & 0 deletions deploy/docs/monitoring-lag.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
# Monitoring the Monitoring

Once you have Sumo Logic's collection setup installed, you should be primed to have metrics, logs, and events flowing into Sumo Logic. However, as your cluster scales up and down, you might find the need to rescale your Fluentd deployment replica count. Here are some tips on how to judge if you're seeing lag in your Sumo Logic collection pipeline.

1. Kubernetes Health Check Dashboard

This dashboard can be found from the `Cluster` level in `Explore`, and is a great way of holistically judging if your collection process is working as expected.

2. Fluentd Queue Length

On the health check dashboard you'll see a panel for Fluentd queue length. If you see this length going up over time, chances are that you either have backpressure or you are overwhelming the Fluentd pods with requests. If you see any `429` status codes in the Fluentd logs, that means you are likely getting throttled and need to contact Sumo Logic to increase your base plan or increase the throttling limit. If you aren't seeing `429` then you likely are in a situation where the incoming traffic into Fluentd is higher than the current replica count can handle. This is a good indication that you should scale up.

3. Check Prometheus Remote Write Metrics

Prometheus has a few metrics to monitor its remote write performance. You should check that the succeeded count is strictly non-zero and if looking at a cumulative metric, it is going up and to the right. Two other great metrics to check are `remote_storage_pending_samples` and `remote_storage_failed_samples`. Higher failure counts and pending counts are good indicators of queue buildup.

4. Check Prometheus Logs

If all else fails, check the Prometheus logs. If there is anything suspicious happening with regards to the Fluentd connection you'll see it in the Prometheus logs. Any logs that have `connection reset` or `context cancelled` in them are indicative of requests that were terminated or dropped. Too many of those and your data will start to lag.

0 comments on commit 85672f4

Please sign in to comment.