Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve stats api #27963

Merged
merged 30 commits into from
Oct 20, 2021
Merged

Improve stats api #27963

merged 30 commits into from
Oct 20, 2021

Conversation

newly12
Copy link
Contributor

@newly12 newly12 commented Sep 16, 2021

Enhancement

What does this PR do?

make sysinfo.Host() call only once in process.go.

Why is it important?

sysinfo.Host() is called multiple times when beats' /stats API is called for metrics collection. several reasons for this change

  1. it collects host info, but in process.go host info is never used, which is unnecessary waste, and we call it only to get memory usage later.
  2. we have seen issues in our production with this piece of code, where for some reasons syscalls are slow/hung, and metrics collection per minute makes it worse. the profile screenshot attached below. some more info about our deployment, metricbeat(the beat caused issue) is deployed as daemonset running on every k8s node, we have resource limit for it but because of syscall stuck, the process went to D state, and (I think) cgroup limitation is not working(we saw load of the node was high and kept climbing, and almost down to the floor after kill the pod). the issue happened on a lot of nodes in multiple k8s clusters.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Author's Checklist

  • [ ]

How to test this PR locally

Related issues

Use cases

Screenshots

image

Logs

@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Sep 16, 2021
@elasticmachine
Copy link
Collaborator

elasticmachine commented Sep 16, 2021

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2021-10-19T01:28:09.607+0000

  • Duration: 160 min 38 sec

  • Commit: a2c536c

Test stats 🧪

Test Results
Failed 0
Passed 53933
Skipped 5344
Total 59277

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

@ruflin ruflin added the Team:Elastic-Agent Label for the Agent team label Sep 21, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/agent (Team:Agent)

@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Sep 21, 2021
@ruflin ruflin added the Team:Integrations Label for the Integrations team label Sep 21, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/integrations (Team:Integrations)

@mergify
Copy link
Contributor

mergify bot commented Sep 22, 2021

This pull request does not have a backport label. Could you fix it @newly12? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-v./d./d./d is the label to automatically backport to the 7./d branch. /d is the digit

NOTE: backport-skip has been added to this pull request.

@mergify mergify bot added the backport-skip Skip notification from the automated backport with mergify label Sep 22, 2021
Copy link
Contributor

@fearful-symmetry fearful-symmetry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems reasonable, since I don't think we're using any of the fields that are filled out when we call Host(). However, I don't think we actually need to use init() here. Can we add the Host object to the Stats object instead?

@newly12
Copy link
Contributor Author

newly12 commented Sep 24, 2021

I've removed init() function and move the sysinfo.Host() call to Stats.Init(), the error of sysinfo.Host() call was swallowed to during the init to keep as much the same behaviour as before, and also the error will be logged just once during init.

@newly12
Copy link
Contributor Author

newly12 commented Sep 28, 2021

@fearful-symmetry Any update?

@mergify
Copy link
Contributor

mergify bot commented Oct 4, 2021

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b improve_stats_api upstream/improve_stats_api
git merge upstream/master
git push upstream improve_stats_api

Copy link
Contributor

@fearful-symmetry fearful-symmetry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One small change. Sorry for the delay, I was occupied all of last week.

procStats.host, err = sysinfo.Host()
if err != nil {
procStats.logger.Warnf("Getting host details: %v", err)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I 100% trust the behavior of go-sysinfo here. If we get an error, we may want to just set procStats.host to nil ourselves.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure. updated.

@fearful-symmetry
Copy link
Contributor

LGTM. You'll need to merge to fix the changelog conflict, then we can wait for CI.

@newly12
Copy link
Contributor Author

newly12 commented Oct 7, 2021

resolved conflicts. can you please take another look?

@mergify
Copy link
Contributor

mergify bot commented Oct 7, 2021

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b improve_stats_api upstream/improve_stats_api
git merge upstream/master
git push upstream improve_stats_api

Copy link
Contributor

@fearful-symmetry fearful-symmetry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@fearful-symmetry
Copy link
Contributor

Alright, now to see if we can get CI to play nice...

@newly12
Copy link
Contributor Author

newly12 commented Oct 8, 2021

thanks! looks like CI is not approved. can you help with that?

@mergify
Copy link
Contributor

mergify bot commented Oct 11, 2021

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b improve_stats_api upstream/improve_stats_api
git merge upstream/master
git push upstream improve_stats_api

@fearful-symmetry
Copy link
Contributor

So, CI seems to be broken right now. You can try re-merging to fix conflicts, but you might have to do it again in a few hours.

v1v and others added 20 commits October 13, 2021 09:43
clone3 is a linux syscall that is now used by glibc starting version
2.34. It is used when pthread_create() gets called. Current seccomp
filters do not allow this syscall leading to crashes like
runtime/cgo: pthread_create failed: Operation not permitted

See elastic/apm-server#6238 for more details
…zations (elastic#28289)

Previously the osquery.autoload file was overwritten every time on
osquerybeat start and stamped with our extension.
After the change we check the content of the file and do not overwrite it on
each osquerybeat start. This allows the user to deploy their own
extensions if their want and start osquery with that.
* Runner and Fetcher unit tests

* Fix header formatting

* Tweak test
* format of conditional build tags has changed
* matching of * in regexes was fixed, thus breaking some of our code: golang/go#46123
* iproute package was missing from the new Golang Docker image, thus, we had to add it for our tests
* go.mod file contains separate require directive for transitive dependencies
* Move labels and annotations under kubernetes.namespace.
…e token (elastic#28096)

* Add ability to communicate with Kibana through service token. Add ability to pass service token to container subcommand.

* Add changelog entry.

* Fix go fmt.
* Add username to ASA Security negotiation log

I added the username user.name field to ASA Security negotiation log line.

* adding support for both formats

* adding changelog entry

* updating geo fields in expected output files

* reverse formatting

* reverting to older version of file

* reverting formatting again

* regenrate golden files again

* remove formatting, ready for review

* fixing missing message due to no newline

* fix dissect pattern to fit correctly

Co-authored-by: Marius Iversen <marius.iversen@elastic.co>
* Redis: remove deprecated fields
elastic#28252)

* [Filebeat] - S3 Input - Add support for only iterating/accessing only specific folders or datapaths
…annotations (elastic#28230)

* Breaking change for 8.0, namespace_annotations replaced by namespace.annotations

* Take care of namespace being nil
…elastic#27878)

partial fix for elastic#27648 , this PR:

Detects if the user is running as root then:
Checks to see if an environment variable BEAT_SETUID_AS (set in our Docker.tmpl) is present
Attempts to Setuid , Setgid and Setgroups to that user / groups
Invokes setcap to drop all privileges except NET_RAW+ep
This PR also fixes the broken syscall filtering in heartbeat, some non-syscall strings were breaking that.

With the changes here elastic-agent can still run as root, but the subprocesses can lower their privileges ASAP. This should also make it possible for heartbeat to safely run ICMP pings and synthetics. Synthetics must run as non-root, but ICMP requires NET_RAW. This lets us be consistent in our docs with the recommendation that elastic-agent run as root.
@newly12
Copy link
Contributor Author

newly12 commented Oct 13, 2021

ok.. I've rebased and fixed conflicts. can you please check?

@mergify
Copy link
Contributor

mergify bot commented Oct 13, 2021

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b improve_stats_api upstream/improve_stats_api
git merge upstream/master
git push upstream improve_stats_api

@fearful-symmetry
Copy link
Contributor

Alas, it's in conflict again. At least, CI should be working now. This is why I hate changelogs.

@newly12
Copy link
Contributor Author

newly12 commented Oct 14, 2021

@fearful-symmetry it is really annoying, hope it works this time..

@fearful-symmetry
Copy link
Contributor

Yah, CI has been a mess recently, and there's only so much I can do. This time I it seems like there's an issue with libbeat/metric/system/process/process.go , if you run mage fmt in the root libbeat directory, it should fix it.

@fearful-symmetry fearful-symmetry merged commit 48c9d79 into elastic:master Oct 20, 2021
Icedroid pushed a commit to Icedroid/beats that referenced this pull request Nov 1, 2021
* singleton sysinfo host to avoid frequently collecting host info

* add Host object to Stats object

* update changelog

* set procStats.host to nil if any error calling sysinfo.Host()

* Update aws-lambda-go library version to 1.13.3 (elastic#28236)

* [cloud][docker] use the private docker namespace (elastic#28286)

* [7.x] [DOCS] Update api_key example on elasticsearch output (elastic#28288)

* packetbeat/protos/dns: don't render missing A and AAAA addresses from truncated records (elastic#28297)

* seccomp: allow clone3 syscall for x86 (elastic#28117)

clone3 is a linux syscall that is now used by glibc starting version
2.34. It is used when pthread_create() gets called. Current seccomp
filters do not allow this syscall leading to crashes like
runtime/cgo: pthread_create failed: Operation not permitted

See elastic/apm-server#6238 for more details

* Osquerybeat: Improve handling of osquery.autoload file, allow customizations (elastic#28289)

Previously the osquery.autoload file was overwritten every time on
osquerybeat start and stamped with our extension.
After the change we check the content of the file and do not overwrite it on
each osquerybeat start. This allows the user to deploy their own
extensions if their want and start osquery with that.

* Osquerybeat: Runner and Fetcher unit tests (elastic#28290)

* Runner and Fetcher unit tests

* Fix header formatting

* Tweak test

* Update go release version 1.17.1 (elastic#27543)

* format of conditional build tags has changed
* matching of * in regexes was fixed, thus breaking some of our code: golang/go#46123
* iproute package was missing from the new Golang Docker image, thus, we had to add it for our tests
* go.mod file contains separate require directive for transitive dependencies

* Move labels and annotations under kubernetes.namespace. (elastic#27917)

* Move labels and annotations under kubernetes.namespace.

* Remove GCP support from Functionbeat (elastic#28253)

* Fix build tags for Go 1.17 (elastic#28338)

* [Elastic Agent] Add ability to communicate with Kibana through service token (elastic#28096)

* Add ability to communicate with Kibana through service token. Add ability to pass service token to container subcommand.

* Add changelog entry.

* Fix go fmt.

* Add username to ASA Security negotiation log (elastic#26975)

* Add username to ASA Security negotiation log

I added the username user.name field to ASA Security negotiation log line.

* adding support for both formats

* adding changelog entry

* updating geo fields in expected output files

* reverse formatting

* reverting to older version of file

* reverting formatting again

* regenrate golden files again

* remove formatting, ready for review

* fixing missing message due to no newline

* fix dissect pattern to fit correctly

Co-authored-by: Marius Iversen <marius.iversen@elastic.co>

* x-pack/filebeat/module/cisco: loosen time parsing and add group and session type capture (elastic#28325)

* Redis: remove deprecated fields (elastic#28246)

* Redis: remove deprecated fields

* Disable generator tests temporarily (elastic#28362)

* Windows/perfmon metricset -  remove deprecated perfmon.counters configuration (elastic#28282)

* remove deprecated config

* changelog

* [Filebeat] - S3 Input - Add support for only iterating/accessing only… (elastic#28252)

* [Filebeat] - S3 Input - Add support for only iterating/accessing only specific folders or datapaths

* Breaking change for 8.0, namespace_annotations replaced by namespace.annotations (elastic#28230)

* Breaking change for 8.0, namespace_annotations replaced by namespace.annotations

* Take care of namespace being nil

* [Heartbeat] Setuid to regular user / lower capabilities when possible (elastic#27878)

partial fix for elastic#27648 , this PR:

Detects if the user is running as root then:
Checks to see if an environment variable BEAT_SETUID_AS (set in our Docker.tmpl) is present
Attempts to Setuid , Setgid and Setgroups to that user / groups
Invokes setcap to drop all privileges except NET_RAW+ep
This PR also fixes the broken syscall filtering in heartbeat, some non-syscall strings were breaking that.

With the changes here elastic-agent can still run as root, but the subprocesses can lower their privileges ASAP. This should also make it possible for heartbeat to safely run ICMP pings and synthetics. Synthetics must run as non-root, but ICMP requires NET_RAW. This lets us be consistent in our docs with the recommendation that elastic-agent run as root.

* mage fmt

Co-authored-by: kaiyan-sheng <kaiyan.sheng@elastic.co>
Co-authored-by: Victor Martinez <victormartinezrubio@gmail.com>
Co-authored-by: Ugo Sangiorgi <ugo.sangiorgi@elastic.co>
Co-authored-by: Dan Kortschak <90160302+efd6@users.noreply.github.com>
Co-authored-by: Arnaud Lefebvre <a.lefebvre@outlook.fr>
Co-authored-by: Aleksandr Maus <aleksandr.maus@elastic.co>
Co-authored-by: apmmachine <58790750+apmmachine@users.noreply.github.com>
Co-authored-by: Michael Katsoulis <michaelkatsoulis88@gmail.com>
Co-authored-by: Noémi Ványi <kvch@users.noreply.github.com>
Co-authored-by: Blake Rouse <blake.rouse@elastic.co>
Co-authored-by: LaZyDK <dennisperto@gmail.com>
Co-authored-by: Marius Iversen <marius.iversen@elastic.co>
Co-authored-by: Andrea Spacca <andrea.spacca@elastic.co>
Co-authored-by: Mariana Dima <mariana@elastic.co>
Co-authored-by: Andrew Cholakian <andrew@andrewvc.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-skip Skip notification from the automated backport with mergify Team:Elastic-Agent Label for the Agent team Team:Integrations Label for the Integrations team
Projects
None yet
Development

Successfully merging this pull request may close these issues.