Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alloy Test and Investigation for Metrics #3522

Closed
1 of 4 tasks
Tracked by #3520
Rotfuks opened this issue Jun 24, 2024 · 13 comments
Closed
1 of 4 tasks
Tracked by #3520

Alloy Test and Investigation for Metrics #3522

Rotfuks opened this issue Jun 24, 2024 · 13 comments
Assignees
Labels

Comments

@Rotfuks
Copy link
Contributor

Rotfuks commented Jun 24, 2024

Motivation

We want to unify all of our agents to use the new opentelemetry agent from grafanalabs: alloy. For this we need to first test out if alloy can deliver exactly the same capability as the prometheus/grafana agents when collecting metrics. 

Todo

  • Deploy Alloy for one test installation (like gazelle)
    • Create a feature flag with which we can select which agents are active - for this we still need to deploy both
  • Check if mimir still gets the same metrics as we expect
  • Compare the resource consumption of alloy to the previous agents - is it less, is it more, is it the same?

Outcome

  • We gained experience with alloy and are confident enough to roll it out as agent for collecting metrics everywhere
@TheoBrigitte
Copy link
Member

TheoBrigitte commented Aug 8, 2024

I managed to use alloy to send metrics to Mimir.

Settings kept:

  • affinity for karpenter and alloy itself
  • external labels
  • service and pod monitor selectors
  • remote write
  • scrape interval
  • priority class

Settings abandoned:

Decision made:

Here is the value file I used to deploy Alloy as metrics ingester values.yaml.gz

helm install alloy giantswarm-test/alloy --version 0.3.1-c58378e71cbb5e9da677957500cf43b951d870a1 --values values.yaml

The amount of metrics sent to Mimir stays the same as with Prometheus agent

  • left side is Prometheus Agent sending metrics
  • middle just me playing with Alloy; note the spike is when I ran 2 Alloy replicas without clustering which resulted in nearly x2 req./s
  • right side is Alloy sending metrics

image

@TheoBrigitte
Copy link
Member

TheoBrigitte commented Aug 15, 2024

Here are the results of running Prometheus agent and Alloy as metrics agent. All test have been ran on the same installation (golem), each test was run for 1h.

I used 4 different test cases:

  • 1 Prometheus agent replica/shard
  • 1 Alloy replica
  • 2 Prometheus agent replicas/shards
  • 2 Alloy replicas

Agents

agent replicas CPU Memory
Alloy 1 < 0.1 < 3GiB
Prometheus agent 1 > 0.1 > 4Gib
Alloy 2 < 0.05 < 3GiB
Prometheus agent 2 < 0.1 < 3GiB

Mimir

The amount of metrics, network and resources load on Mimir stayed approximately the same across all tests. Some Mimir ingester restarted and had some impact on the values shown in the graphs here but values are mostly within the same range.

Summary

Those tests showed that Alloy tends to consume about the same amount of resources or less than Prometheus agent and Mimir load stayed the same across all tests.

Here are the results as Grafana dashboard screenshots prometheus-agent_vs_alloy.tar.gz

@TheoBrigitte
Copy link
Member

Most of the work and testing was done in giantswarm/observability-operator#66

I decided to go with our current custom autoscaling solution as it would otherwise differ to much from what we have currently and its also more complex to find a fit for every different installations size.

  • alloy-app v0.4.0 was release with support for secret values
  • observability-bundle v1.6.0 was release with the new alloy-metrics app as v0.4.0
  • observability-operator v0.4.0 was release with support for Alloy as monitoring agent

Deployment to an installation is currently blocked as this feature is only supported on CAPI installations and we need a new release to get the new observability-bundle out.

@TheoBrigitte
Copy link
Member

v29.1.0 is on its way, once this is release to a CAPA installation we can proceed with our live testing of Alloy as monitoring agent. We would then only need to toggle the monitoring agent flag for the observability-operator (example: https://github.com/giantswarm/giantswarm-configs/pull/135/files).

@QuentinBisson
Copy link

As an FYI, the release was merged :)

@TheoBrigitte
Copy link
Member

TheoBrigitte commented Aug 29, 2024

Now we need to have it deployed to MCs https://github.com/giantswarm/giantswarm-management-clusters/pull/749

@QuentinBisson
Copy link

We can try it on a WC right?

@QuentinBisson
Copy link

Oh wait no we cannot because of this https://github.com/giantswarm/observability-operator/blob/09ddfe046e6a81cc6b874ac537941be9a495bc18/internal/controller/cluster_monitoring_controller.go#L181

Maybe the ervices should be created for each reconciliation then so the agent is always injected? Or passed as a function parameter

@QuentinBisson
Copy link

Yes we can test it out on the gazelle/cicddev cluster as it is running 29.1.0 :)

@TheoBrigitte
Copy link
Member

There were actually few issue preventing this to be rolled out

  • incorrect catalog name in observability-bundle for alloyMetrics
  • invalid Alloy configuration due to missing comma in external labels map
  • broken release pipeline in observability-operator

Those are all fixed now, but we now need to wait for an upgrade of observability-bundle to v1.6.2, most likely in capa v30.0.0 > giantswarm/releases#1357 (review)

@TheoBrigitte
Copy link
Member

This is running on golem now and would be available from:

Reminder: make sure we make an announcement to customers before releasing alloy-metrics.

@QuentinBisson
Copy link

@TheoBrigitte as this is an investigation story and not the rollout, should this be put in tracking or closed?

@Rotfuks
Copy link
Contributor Author

Rotfuks commented Sep 30, 2024

Done on our side for now.

@Rotfuks Rotfuks closed this as completed Sep 30, 2024
@github-project-automation github-project-automation bot moved this from Inbox 📥 to Done ✅ in Roadmap Sep 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Archived in project
Development

No branches or pull requests

4 participants