[Metricbeat] Monitoring metrics don't work when containerised #6620

Constantin07 · 2018-03-21T16:48:38Z

Hello,

I'm still getting these errors in metricbeat logs with version 6.2.3 when deployed as docker container:

2018-03-21T10:10:07.539Z	ERROR	instance/metrics.go:69	Error while getting memory usage: error retrieving process stats
2018-03-21T10:10:07.539Z	ERROR	instance/metrics.go:113	Error retrieving CPU percentages: error retrieving process stats

but there is no clue what could be the problem. I'm running an official metricbeat docker image and trying to pull stats from host.

The metricbeat container is run using this command:

docker run -d --restart=always --name metricbeat \
--net=host \
-u root \
-v /proc:/hostfs/proc:ro \
-v /sys/fs/cgroup:/hostfs/sys/fs/cgroup:ro \
-v /:/hostfs:ro \
-v /var/run/docker.sock:/var/run/docker.sock \
metricbeat:6.2.3 -system.hostfs=/hostfs

cat system.yml

- module: system
  period: 10s
  metricsets:
    - cpu
    - load
    - memory
    - network
    - process
    - process_summary
    #- core
    - diskio
    #- socket
  processes: ['.*']
  process.include_top_n:
    by_cpu: 5      # include top 5 processes by CPU
    by_memory: 5   # include top 5 processes by memory

- module: system
  period: 1m
  metricsets:
    - filesystem
    - fsstat
  processors:
  - drop_event.when.regexp:
      system.filesystem.mount_point: '^/(sys|cgroup|proc|dev|etc|host|lib)($|/)'

Metricbeat version: 6.2.3
Operating System: CoreOS Linux 1465.8.0
Docker Client 1.10.3 (API 1.22), Docker Server 1.12.6 (API 1.24)

Any idea what could be the cause ?

The text was updated successfully, but these errors were encountered:

andrewkroh · 2018-03-21T16:53:35Z

My initial guess is that it cannot find its own metrics in /hostfs/proc because it is pid namespaced and therefore is looking for something like 2 but when on the host the PID is very different.

This problem extends from the fact that /hostfs is treated as a global variable. This causes all metrics collecting code to read from /hostfs/proc. But the self monitoring metrics should always come from /proc.

@kvch Can you try to reproduce this? I think we need a test case for this, and we can discuss some possible solutions.

andrewkroh · 2018-03-21T17:02:28Z

@Constantin07 Just so we are on the same page I should clarify that those log messages indicate a problem with the self-monitoring feature that was added in 6.2 that let's the Beat report it's own CPU/memory/load information in the log (with a message of "Non-zero metrics in the last 30s") and to X-Pack Monitoring if configured.

Your regular metrics from the system module should not be affected.

kvch · 2018-03-21T19:21:35Z

@andrewkroh I can reproduce problem.
We definitely need a new test case for this. I think the problem should be handled in gosigar. However, I am not yet sure how it should be done.

ewgRa · 2018-03-22T22:39:26Z

How about #6641 fix?

It is just enough to run metricbeat with system.hostfs argument to reproduce problem, it is not related to docker itself.

jsoriano · 2018-03-23T11:01:43Z

I think a general solution without modifying gosigar could be to obtain the pid of the process from <sigar.Procd>/self/status (Pid field), instead of using os.GetPid(). This would work both with or without namespaces.

Constantin07 · 2018-03-23T11:59:12Z

@andrewkroh yes, you are right. I do see in Elasticsearch regular metrics but the error log message to me appears misleading (it's kind of "it works" nut not completely).

ewgRa · 2018-03-23T15:05:12Z

@jsoriano can you give more details about general solution?

In hostfs proc dir we have host processes, /hostfs/proc/self/status - it will be not an metricbeat process.

I will try later to check myself, but for now on I don't understand how it will work.

jsoriano · 2018-03-23T16:02:29Z

@ewgRa host proc dir contains all processes running in any PID namespace of the machine, this includes the namespace in which the metricbeat process runs. The special file self always refers to the process in the namespace of the procfs mount, so if metricbeat reads <sigar.Procd>/self/status (with host /proc mounted in sigar.Procd) it will see its status in the host namespace, what includes the pid in this namespace, that could be used in calls to gosigar without changing sigar.Procd.

I say it could be a general solution because it'd also work when no namespacing is used: sigar.Procd would be /proc and this would contain the process as usual.

self behaviour in different namespaces is documented in pid_namespaces man page:

       Calling readlink(2) on the path /proc/self yields the process ID of
       the caller in the PID namespace of the procfs mount (i.e., the PID
       namespace of the process that mounted the procfs).  This can be use‐
       ful for introspection purposes, when a process wants to discover its
       PID in other namespaces.

ewgRa · 2018-03-24T14:18:46Z

@jsoriano thanks for brilliant idea, I made changes, it works, close to magic level :) Can you review it again? Failed CI looks like not related to my changes.

I see only two problems/limitations from this solution:

Hard to write test for this case, I've add simple test, that actually test not a real case itself.
In case if there will be wrong -system.hostfs flag (directory not exists for example), this way will fail to get metrics.

But I think this is acceptable edge cases.

Fix self metrix when containerised #6620

jsoriano · 2018-03-27T08:41:13Z

Fixed by #6641

Constantin07 · 2018-03-27T09:36:27Z

Thanks @ewgRa @jsoriano

Camillevau · 2018-03-28T07:56:42Z

Thanks

andrzej-majewski · 2018-04-24T11:27:00Z

Is there a workaround for this?

I am running 6.2.4 and issue still persist.

grantcurell · 2018-10-30T18:52:15Z

+1 still seeing:
2018-10-30T18:47:37.898Z ERROR instance/metrics.go:92 Error retrieving CPU percentages: error retrieving process stats: cannot find matching process for pid=1
2018-10-30T18:47:37.898Z ERROR instance/metrics.go:61 Error while getting memory usage: error retrieving process stats: cannot find matching process for pid=1

I added hostPid: true in accordance with: #6734

Edit: I just realized that's not where it goes. Moved in in accordance with https://kubernetes.io/docs/concepts/policy/pod-security-policy/. Still no dice. We see the same errors in the logs as before.

jsoriano · 2018-10-31T18:00:08Z

@grantcurell what version of metricbeat are you using? could you also share the configuration you are using to start it?
And what logs do you see with hostPid: true? They shouldn' tbe the same as metricbeat won't have pid 1.

grantcurell · 2018-10-31T18:26:25Z

Update: It took me a second to get it in the right place, but

metricbeat-deploy.yml.txt

Metricbeat is 6.2.4. The logs are clean now and I'm not receiving the error, but the behavior of the dashboard is erratic and I'm not sure why.

@grantcurell what version of metricbeat are you using? could you also share the configuration you are using to start it?
And what logs do you see with hostPid: true? They shouldn' tbe the same as metricbeat won't have pid 1.

metricbeat-deploy.yml.txt

Update: Took me a second to get it in the right place and I'm no longer receiving the error in the logs. What is strange is the dashboard behaves very erratically and the data is incorrect. For example: I'm doing a controlled test where I'm pumping 5Gb/s into a security sensor I have, am confirming with traditional monitoring tools that the sensor is receiving the expected 5Gb/s (4.65 to be exact with the loss from overhead), but Metricbeat's reading for inbound traffic jumps around all over the place. Anywhere from 17MB/s to 120MB/s and it changes on each 5 second interval I have the dashboard set to.

The other problem I have can be seen below. If I set the time period to anything less than 30 minutes the entire top part of the dashboard zeros out, but the accompanying data continues to display appropriately - including network speed which you can see is appropriately sitting at 600MB/s (4800 Mb/s).

Additional Info: This is running on Kubernetes 1.9.7

grantcurell · 2018-10-31T18:27:32Z

You can see below when I change the time to 30 minutes it displays. Though the disk IO is still missing.

Edit: And the information is still inaccurate in general.

jsoriano · 2018-10-31T19:03:25Z

@grantcurell thanks for all the details.

6.2.4 didn't include yet the fix for this, you need 6.3.0 or later, in any case the hostPid: true workaround should work.

Regarding the other problems, it'd be great if you could confirm them with a more modern metricbeat version and open specific issues.

grantcurell · 2018-10-31T19:39:07Z

@grantcurell thanks for all the details.

6.2.4 didn't include yet the fix for this, you need 6.3.0 or later, in any case the hostPid: true workaround should work.

Regarding the other problems, it'd be great if you could confirm them with a more modern metricbeat version and open specific issues.

Can I update metricbeat independently of Elasticsearch in this case?

jsoriano · 2018-11-01T12:23:44Z

Can I update metricbeat independently of Elasticsearch in this case?

If you are using Elasticsearch 6.X this should be fine, check the product compatibility matrix.

grantcurell · 2018-11-01T16:17:36Z

@jsoriano upgrading to Metricbeat 6.4.2 didn't fix the problem. I still get a bunch of strange partial data if the time interval is anything less than 30 minutes. Ex the Kubernetes dashboard:

or

But move it to 30 minutes and you get:

Constantin07 mentioned this issue Mar 21, 2018

[Monitoring] Logs error when we failed to collect information concerning the memory usage. #6426

Closed

andrewkroh added bug libbeat labels Mar 21, 2018

ruflin added the monitoring label Mar 22, 2018

ewgRa mentioned this issue Mar 22, 2018

Fix self metrix when containerised #6620 #6641

Merged

jsoriano added a commit that referenced this issue Mar 27, 2018

Merge pull request #6641 from ewgRa/containerised-self-metric-6620

88998f9

Fix self metrix when containerised #6620

jsoriano closed this as completed Mar 27, 2018

kvch mentioned this issue Apr 3, 2018

Error while getting [memory,cpu] usage: error retrieving process stats: cannot find matching process for pid=1 #6734

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Metricbeat] Monitoring metrics don't work when containerised #6620

[Metricbeat] Monitoring metrics don't work when containerised #6620

Constantin07 commented Mar 21, 2018 •

edited

Loading

andrewkroh commented Mar 21, 2018

andrewkroh commented Mar 21, 2018

kvch commented Mar 21, 2018

ewgRa commented Mar 22, 2018

jsoriano commented Mar 23, 2018 •

edited

Loading

Constantin07 commented Mar 23, 2018 •

edited

Loading

ewgRa commented Mar 23, 2018

jsoriano commented Mar 23, 2018 •

edited

Loading

ewgRa commented Mar 24, 2018

jsoriano commented Mar 27, 2018

Constantin07 commented Mar 27, 2018

Camillevau commented Mar 28, 2018

andrzej-majewski commented Apr 24, 2018

grantcurell commented Oct 30, 2018 •

edited

Loading

jsoriano commented Oct 31, 2018

grantcurell commented Oct 31, 2018

grantcurell commented Oct 31, 2018 •

edited

Loading

jsoriano commented Oct 31, 2018

grantcurell commented Oct 31, 2018

jsoriano commented Nov 1, 2018

grantcurell commented Nov 1, 2018 •

edited

Loading

[Metricbeat] Monitoring metrics don't work when containerised #6620

[Metricbeat] Monitoring metrics don't work when containerised #6620

Comments

Constantin07 commented Mar 21, 2018 • edited Loading

andrewkroh commented Mar 21, 2018

andrewkroh commented Mar 21, 2018

kvch commented Mar 21, 2018

ewgRa commented Mar 22, 2018

jsoriano commented Mar 23, 2018 • edited Loading

Constantin07 commented Mar 23, 2018 • edited Loading

ewgRa commented Mar 23, 2018

jsoriano commented Mar 23, 2018 • edited Loading

ewgRa commented Mar 24, 2018

jsoriano commented Mar 27, 2018

Constantin07 commented Mar 27, 2018

Camillevau commented Mar 28, 2018

andrzej-majewski commented Apr 24, 2018

grantcurell commented Oct 30, 2018 • edited Loading

jsoriano commented Oct 31, 2018

grantcurell commented Oct 31, 2018

grantcurell commented Oct 31, 2018 • edited Loading

jsoriano commented Oct 31, 2018

grantcurell commented Oct 31, 2018

jsoriano commented Nov 1, 2018

grantcurell commented Nov 1, 2018 • edited Loading

Constantin07 commented Mar 21, 2018 •

edited

Loading

jsoriano commented Mar 23, 2018 •

edited

Loading

Constantin07 commented Mar 23, 2018 •

edited

Loading

jsoriano commented Mar 23, 2018 •

edited

Loading

grantcurell commented Oct 30, 2018 •

edited

Loading

grantcurell commented Oct 31, 2018 •

edited

Loading

grantcurell commented Nov 1, 2018 •

edited

Loading