Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop collecting the beat state metricset as part of agent monitoring #4153

Closed
cmacknz opened this issue Jan 26, 2024 · 13 comments · Fixed by #4579 or #4671
Closed

Stop collecting the beat state metricset as part of agent monitoring #4153

cmacknz opened this issue Jan 26, 2024 · 13 comments · Fixed by #4579 or #4671
Assignees
Labels
Team:Elastic-Agent Label for the Agent team Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team

Comments

@cmacknz
Copy link
Member

cmacknz commented Jan 26, 2024

Our agent monitoring implementation currently uses the beat Metricbeat module to monitor Beat subprocesses. We collect both the stats and state metricsets.

if isSupportedBeatsBinary(binaryName) {
beatsStreams = append(beatsStreams, map[string]interface{}{
idKey: "metrics-monitoring-" + name,
"data_stream": map[string]interface{}{
"type": "metrics",
"dataset": fmt.Sprintf("elastic_agent.%s", name),
"namespace": monitoringNamespace,
},
"metricsets": []interface{}{"stats", "state"},

It seems to me that nothing actually uses the data from the state metricset. We don't map the fields in the Elastic Agent integration. I believe we can remove this metricset and stop pointlessly storing this data for every Beat process we start.

We currently store both the state and stats metricset in the same datastream, and as such include the metricset name as a TSDB dimension which could probably be removed after this change.

https://github.com/elastic/integrations/blob/a2c55c4cbf752e0490f9fe2d3e68698517c7b74d/packages/elastic_agent/data_stream/elastic_agent_metrics/fields/ecs.yml#L21-L23

- name: metricset.name
  type: keyword
  dimension: true

Acceptance Criteria:

@cmacknz cmacknz added the Team:Elastic-Agent Label for the Agent team label Jan 26, 2024
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

@nimarezainia
Copy link
Contributor

@pchila thanks for your diligence on this issue. Would it be possible to have a benchmark on what the savings we could expect from this change?

cc: @pierrehilbert

@ycombinator
Copy link
Contributor

Reopening this issue as the second part of the acceptance criteria isn't actually done yet AFAICT:

The data storage savings after removing this metricset are calculated and included in the release notes

Also related to @nimarezainia's question in the previous comment.

@ycombinator ycombinator reopened this May 2, 2024
@pchila
Copy link
Member

pchila commented May 3, 2024

@pchila thanks for your diligence on this issue. Would it be possible to have a benchmark on what the savings we could expect from this change?

@cmacknz did a quick check on the data savings here on the PR #4579 (comment)

I will re-run 2 versions of agent (with and without the change) and check the index size and document count

@ycombinator
Copy link
Contributor

@cmacknz did a quick check on the data savings here on the PR #4579 (comment)

I will re-run 2 versions of agent (with and without the change) and check the index size and document count

Thanks. Could you make a small PR to update

with these savings numbers?

@pchila
Copy link
Member

pchila commented May 3, 2024

@nimarezainia @ycombinator
Re-measured index size difference between commit 1e88a94 (commit just before the change) and commit 0d31445 (merge commit of the related PR) for a 10 min period after startup.

In both cases I used a policy that included the System Integration and agent logs and metrics collection.
image
image

Here's the sizes of the reindexed documents
image

Document count for metrics-elastic_agent.filebeat-* and metrics-elastic_agent.metricbeat-disksize.baseline is down by 50% (as expected removing half of the metricsets) with a size on disk gain of ~13% for both indices

I am gonna put up a small PR with the changelog patching and link it to this issue

@cmacknz
Copy link
Member Author

cmacknz commented May 3, 2024

In that same PR, can you add something under the doc directory describing how to reproduce these test results?

@pchila
Copy link
Member

pchila commented May 3, 2024

@cmacknz
I used a script that is part of PR #4633 for extracting and reindexing logs and metrics but it's not merged yet

@cmacknz
Copy link
Member Author

cmacknz commented May 3, 2024

Sure, doesn't matter when or how it gets documented then, as long as we have a way to remember what we did if we want to re-evaluate this again later.

@strawgate
Copy link
Contributor

strawgate commented May 3, 2024

Isn't the number of metrics produced dependent on the number of components running under agent? i.e. something like x document per beat per interval? so the % savings depends on the number of deployed integrations/managed beats?

@cmacknz
Copy link
Member Author

cmacknz commented May 3, 2024

That is correct yes, more complex configurations will see greater savings. I assume @pchila likely tested this with the default system integration installed, I will comment on the changelog entry.

@pchila
Copy link
Member

pchila commented May 3, 2024

@strawgate @cmacknz edited my comment adding clarification on what policy I used for the test. This is the reason why I expressed the savings in % as the absolute numbers will scale with the number of impacted indices

@ycombinator ycombinator added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label May 4, 2024
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:Elastic-Agent Label for the Agent team Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
7 participants