Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metricbeat6.0 system module doesn't capture kernel process in Centos7 #5700

Closed
maozi07 opened this issue Nov 24, 2017 · 10 comments · Fixed by #6224
Closed

Metricbeat6.0 system module doesn't capture kernel process in Centos7 #5700

maozi07 opened this issue Nov 24, 2017 · 10 comments · Fixed by #6224
Labels
bug good first issue Indicates a good issue for first-time contributors help wanted Indicates that a maintainer wants help on an issue or pull request Metricbeat Metricbeat

Comments

@maozi07
Copy link

maozi07 commented Nov 24, 2017

  • Version: metricbeat6.0.0
  • Operating System: Centos7 x64
  • Steps to Reproduce:

metricbeat system module doesn't capture the system process there cmdline always like this [xxxx]
and I enabled debug log found nothing useful, there nothing about these process
It's all kernel process like this:

11 root rt 0 0 0 0 S 0.0 0.0 0:30.50 watchdog/0
12 root rt 0 0 0 0 S 0.0 0.0 0:31.35 watchdog/1
13 root rt 0 0 0 0 S 0.0 0.0 0:19.37 migration/1
14 root 20 0 0 0 0 S 0.0 0.0 1:25.61 ksoftirqd/1
...

https://discuss.elastic.co/t/metricbeat6-0-system-module-doesnt-capture-linux-system-process/108654

@kvch kvch added bug Metricbeat Metricbeat labels Nov 24, 2017
@ruflin
Copy link
Contributor

ruflin commented Nov 28, 2017

I just tried this out in a docker centos:7 container. Initially I set up the module manually with only the process metricset and all looked ok. Then I run our default installation and got the following panic in the logs:

2017/11/28 02:45:38.935378 log.go:175: ERR recovered from panic while fetching 'system/process' for host ''. Recovering, but please report this: runtime error: slice bounds out of range.
2017/11/28 02:45:38.935541 log.go:176: ERR Stacktrace: goroutine 78 [running]:
runtime/debug.Stack(0x2a67c9d, 0x2b, 0xc4202711c8)
	/usr/local/go/src/runtime/debug/stack.go:24 +0x79
github.com/elastic/beats/libbeat/logp.Recover(0xc420221f40, 0x40)
	/go/src/github.com/elastic/beats/libbeat/logp/log.go:176 +0x12f
panic(0x26f4340, 0x38c9a00)
	/usr/local/go/src/runtime/panic.go:489 +0x2cf
github.com/elastic/beats/metricbeat/module/system/process.(*ProcStats).includeTopProcesses(0xc42032a120, 0xc4203eb600, 0x2, 0x2, 0x2, 0xc4203eb600, 0x1)
	/go/src/github.com/elastic/beats/metricbeat/module/system/process/helper.go:414 +0x4ab
github.com/elastic/beats/metricbeat/module/system/process.(*ProcStats).GetProcStats(0xc42032a120, 0x0, 0x0, 0x0, 0x0, 0x0)
	/go/src/github.com/elastic/beats/metricbeat/module/system/process/helper.go:390 +0x6c8
github.com/elastic/beats/metricbeat/module/system/process.(*MetricSet).Fetch(0xc42021e000, 0x2, 0x30, 0x2, 0x2, 0xc420284bd0)
	/go/src/github.com/elastic/beats/metricbeat/module/system/process/process.go:82 +0x47
github.com/elastic/beats/metricbeat/mb/module.(*metricSetWrapper).multiEventFetch(0xc42040b000, 0x7fe359527d48, 0xc42021e000, 0x38f7940, 0xc4202b1c80)
	/go/src/github.com/elastic/beats/metricbeat/mb/module/wrapper.go:238 +0x4f
github.com/elastic/beats/metricbeat/mb/module.(*metricSetWrapper).fetch(0xc42040b000, 0x38f7940, 0xc4202b1c80)
	/go/src/github.com/elastic/beats/metricbeat/mb/module/wrapper.go:222 +0x118
github.com/elastic/beats/metricbeat/mb/module.(*metricSetWrapper).startPeriodicFetching(0xc42040b000, 0x38f7940, 0xc4202b1c80)
	/go/src/github.com/elastic/beats/metricbeat/mb/module/wrapper.go:199 +0x5d
github.com/elastic/beats/metricbeat/mb/module.(*metricSetWrapper).run(0xc42040b000, 0xc4203b8600, 0xc4203b8d20)
	/go/src/github.com/elastic/beats/metricbeat/mb/module/wrapper.go:186 +0x5b8
github.com/elastic/beats/metricbeat/mb/module.(*Wrapper).Start.func1(0xc4203459c0, 0xc4203b8600, 0xc4203b8d20, 0xc42040b000)
	/go/src/github.com/elastic/beats/metricbeat/mb/module/wrapper.go:110 +0xbd
created by github.com/elastic/beats/metricbeat/mb/module.(*Wrapper).Start
	/go/src/github.com/elastic/beats/metricbeat/mb/module/wrapper.go:111 +0x14e

Not sure if this is related.

@maozi07 Can you share the config you are using? Could you also share some of the documents that metricbeat puts out?

@maozi07
Copy link
Author

maozi07 commented Nov 28, 2017

Hi @ruflin docker doesn't have any kernel process. I only collect system metric configure like this:

  • module: system
    period: 10s
    metricsets:
    - cpu
    - load
    - memory
    - network
    - process
    - process_summary
    #- core
    - diskio
    #- socket
    processes: ['.*']

What else can I share or do you need?

@ruflin
Copy link
Contributor

ruflin commented Nov 28, 2017

@maozi07 Does metricbeat on your end not collect the processes at all or is just the command line off? If it does collect the events, could you share one of the json documents that you are seeing?

@maozi07
Copy link
Author

maozi07 commented Nov 28, 2017

@ruflin I think it doesn't collect the process at all,because I haven't find any data using cmdline/process name from ES about the process.
And you can't find the cmdline about the kernel process /proc/PID/cmdlin are empty

@andrewkroh
Copy link
Member

As a data point, I am seeing events for these processes using 5.5.0. I believe there is no cmdline field because the value is empty for most of these processes. I did a query for system.process.ppid:0 || system.process.pid:<10 to see what was present.

{
    "@timestamp": "2017-11-28T11:01:17.780Z",
    "beat": {
      "hostname": "xxx",
      "name": "xxx",
      "version": "5.5.0"
    },
    "meta": {
      "cloud": {
        "instance_id": "xxx",
        "provider": "xxx",
        "region": "xxx"
      }
    },
    "metricset": {
      "module": "system",
      "name": "process",
      "rtt": 61001
    },
    "system": {
      "process": {
        "cpu": {
          "start_time": "2015-10-27T11:49:47.000Z",
          "total": {
            "pct": 0
          }
        },
        "fd": {
          "limit": {
            "hard": 4096,
            "soft": 1024
          },
          "open": 0
        },
        "memory": {
          "rss": {
            "bytes": 0,
            "pct": 0
          },
          "share": 0,
          "size": 0
        },
        "name": "ksoftirqd/0",
        "pgid": 0,
        "pid": 4,
        "ppid": 2,
        "state": "sleeping",
        "username": "root"
      }
    },
    "type": "metricsets"
  }

@maozi07
Copy link
Author

maozi07 commented Nov 29, 2017

@andrewkroh I'm using 6.0.0, and I tried to query system.process.pid:<10 , but there is nothing except system.process.pid=1 it's systemd. And I also tried query system.process.name=ksoftirqd/0 and no results

@maozi07
Copy link
Author

maozi07 commented Nov 30, 2017

@ruflin @andrewkroh Hi ,I found some log maybe useful to you when I tracked other problem.

2017-11-30T14:46:41+08:00 DBG Skip process pid=2: error getting process exe for pid=2: readlink /proc/2/exe: no such file or directory
2017-11-30T14:46:41+08:00 DBG Skip process pid=3: error getting process exe for pid=3: readlink /proc/3/exe: no such file or directory
2017-11-30T14:46:41+08:00 DBG Skip process pid=8: error getting process exe for pid=8: readlink /proc/8/exe: no such file or directory
2017-11-30T14:46:41+08:00 DBG Skip process pid=9: error getting process exe for pid=9: readlink /proc/9/exe: no such file or directory
2017-11-30T14:46:41+08:00 DBG Skip process pid=10: error getting process exe for pid=10: readlink /proc/10/exe: no such file or directory

@andrewkroh
Copy link
Member

andrewkroh commented Nov 30, 2017

That's very helpful. Those kernel processes don't have an exe on disk so the /proc/[PID]/exe symlink is broken. So Metricbeat should ignore the ENOENT error and continue processing data from the process.

I think the code here should also check for ... && !os.IsNotExist(err). The problem was introduced when we added the system.process.cwd data (which I think is only in 6.x).

exe := sigar.ProcExe{}
if err := exe.Get(pid); err != nil && !sigar.IsNotImplemented(err) && !os.IsPermission(err) {
return nil, fmt.Errorf("error getting process exe for pid=%d: %v", pid, err)
}

@ronindesign
Copy link

ronindesign commented Dec 31, 2017

I've hit this also.
Metricbeat v6.1.1
Fedora 27 (Linux hostname 4.14.8-300.fc27.x86_64 #1 SMP Wed Dec 20 19:00:18 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux)

This is for all pid=2..n

2017-12-31T09:56:14-08:00 DBG  [module] Starting metricSetWrapper[module=system, name=process, host=]
2017-12-31T09:56:14-08:00 DBG  [metricbeat] Skip process pid=2: error getting process exe for pid=2: readlink /proc/2/exe: no such file or directory
2017-12-31T09:56:14-08:00 DBG  [metricbeat] Skip process pid=4: error getting process exe for pid=4: readlink /proc/4/exe: no such file or directory
2017-12-31T09:56:14-08:00 DBG  [metricbeat] Skip process pid=6: error getting process exe for pid=6: readlink /proc/6/exe: no such file or directory
2017-12-31T09:56:14-08:00 DBG  [metricbeat] Skip process pid=7: error getting process exe for pid=7: readlink /proc/7/exe: no such file or directory
2017-12-31T09:56:14-08:00 DBG  [metricbeat] Skip process pid=8: error getting process exe for pid=8: readlink /proc/8/exe: no such file or directory
2017-12-31T09:56:14-08:00 DBG  [metricbeat] Skip process pid=9: error getting process exe for pid=9: readlink /proc/9/exe: no such file or directory
2017-12-31T09:56:14-08:00 DBG  [metricbeat] Skip process pid=10: error getting process exe for pid=10: readlink /proc/10/exe: no such file or directory
2017-12-31T09:56:14-08:00 DBG  [metricbeat] Skip process pid=11: error getting process exe for pid=11: readlink /proc/11/exe: no such file or directory
2017-12-31T09:56:14-08:00 DBG  [metricbeat] Skip process pid=12: error getting process exe for pid=12: readlink /proc/12/exe: no such file or directory
2017-12-31T09:56:14-08:00 DBG  [metricbeat] Skip process pid=13: error getting process exe for pid=13: readlink /proc/13/exe: no such file or directory
....

EDIT: I can confirm commenting out process related stat sets causes no further error messages:
/etc/metricbeat/modules.d/system.yml

- module: system
  period: 15s
  metricsets:
    - cpu
    - load
    - memory
    - network
    #- process
    #- process_summary
    #- core
    #- diskio
    #- socket
  processes: ['.*']
  process.include_top_n:
    by_cpu: 5      # include top 5 processes by CPU
    by_memory: 5   # include top 5 processes by memory

- module: system
  period: 1m
  metricsets:
    - filesystem
    - fsstat
  processors:
  - drop_event.when.regexp:
      system.filesystem.mount_point: '^/(sys|cgroup|proc|dev|etc|host|lib)($|/)'

- module: system
  period: 15m
  metricsets:
    - uptime

@andrewkroh andrewkroh added help wanted Indicates that a maintainer wants help on an issue or pull request good first issue Indicates a good issue for first-time contributors labels Jan 2, 2018
andrewkroh added a commit to andrewkroh/beats that referenced this issue Jan 30, 2018
On Linux, kernel processes don't have an exe on disk so the `/proc/[PID]/exe` symlink is broken. When Metricbeat tries to read the symlink it gets an error and it skips the whole process. This adds a check for ENOENT and ignores the error which allows kernel processes to be reported by Metricbeat. This is the error that it fixes.

    2018-01-30T20:31:02.512Z	DEBUG	[processes]	process/process.go:443	Skip process pid=12113: error getting process exe for pid=12113: readlink /proc/12113/exe: no such file or directory

Fixes elastic#5700
ruflin pushed a commit that referenced this issue Jan 30, 2018
On Linux, kernel processes don't have an exe on disk so the `/proc/[PID]/exe` symlink is broken. When Metricbeat tries to read the symlink it gets an error and it skips the whole process. This adds a check for ENOENT and ignores the error which allows kernel processes to be reported by Metricbeat. This is the error that it fixes.

    2018-01-30T20:31:02.512Z	DEBUG	[processes]	process/process.go:443	Skip process pid=12113: error getting process exe for pid=12113: readlink /proc/12113/exe: no such file or directory

Fixes #5700
@ronindesign
Copy link

Thank you!

andrewkroh added a commit to andrewkroh/beats that referenced this issue Mar 29, 2018
On Linux, kernel processes don't have an exe on disk so the `/proc/[PID]/exe` symlink is broken. When Metricbeat tries to read the symlink it gets an error and it skips the whole process. This adds a check for ENOENT and ignores the error which allows kernel processes to be reported by Metricbeat. This is the error that it fixes.

    2018-01-30T20:31:02.512Z	DEBUG	[processes]	process/process.go:443	Skip process pid=12113: error getting process exe for pid=12113: readlink /proc/12113/exe: no such file or directory

Fixes elastic#5700

(cherry picked from commit 6bde7d8)
leweafan pushed a commit to leweafan/beats that referenced this issue Apr 28, 2023
On Linux, kernel processes don't have an exe on disk so the `/proc/[PID]/exe` symlink is broken. When Metricbeat tries to read the symlink it gets an error and it skips the whole process. This adds a check for ENOENT and ignores the error which allows kernel processes to be reported by Metricbeat. This is the error that it fixes.

    2018-01-30T20:31:02.512Z	DEBUG	[processes]	process/process.go:443	Skip process pid=12113: error getting process exe for pid=12113: readlink /proc/12113/exe: no such file or directory

Fixes elastic#5700

(cherry picked from commit ee38051)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug good first issue Indicates a good issue for first-time contributors help wanted Indicates that a maintainer wants help on an issue or pull request Metricbeat Metricbeat
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants