Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement alloy-service resource to configure Alloy as a monitoring agent #66

Closed
wants to merge 59 commits into from

Conversation

TheoBrigitte
Copy link
Member

@TheoBrigitte TheoBrigitte commented Aug 9, 2024

Towards: giantswarm/roadmap#3522

This PR implements the logic to configure Alloy as a monitoring agent instead of Prometheus agent.

It does

  • add --monitoring-agent flag to choose between Prometheus agent and Alloy
  • add logic to select monitoring agent and check for Alloy support in observability bundle version
  • add Alloy service resource; which created a configmap for alloy app and a secret to be mounted by pods directly to inject values via env var
  • add Alloy config files as templates
    • Alloy app helm values; with networkpolicy, clustering enabled, autoscaling enabled, affinity settings similar to Prometheus agent
    • Alloy config; with podmonitor and servicemonitor discovery, remotewrite setting
  • add common labels used by both configmap and secret
  • disable golang-ci lll linter (which keep complaining about long lines)

@TheoBrigitte TheoBrigitte self-assigned this Aug 9, 2024
Base automatically changed from monitoring-common to main August 13, 2024 08:11
@TheoBrigitte
Copy link
Member Author

Clustering works ✔️

image

@TheoBrigitte
Copy link
Member Author

TheoBrigitte commented Aug 19, 2024

I experimented with Alloy horizontal autoscaler built into the helm chart. The autoscaler is based on memory usage and does by default scale up whenever 80% of memory usage is reached.
Autoscaling based on memory do make sense as there is a relation between the amount of time series and the memory usage: https://grafana.com/docs/alloy/latest/introduction/estimate-resource-usage/
On our side on biggest installation we see Prometheus agent using up to 10 shards and ~10GiB memory per shard. But using 10GiB. When using autoscaling the request memory setting is required, but using 10GiB would not work on installations with smaller nodes.
An idea would be to use 3GiB memory request and 300% hpa memory utilization, which would result hpa scaling up Alloy whenever it reaches 9GiB memory usage, but this still leaves cases with smaller node unresolved as pods would never be able to reach such memory usage.

Therefore I went with our current custom implementation of autoscaling based on number of metrics.

@TheoBrigitte
Copy link
Member Author

Tested and working on both management and workload clusters.

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant