Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Metricbeat] Monitoring metrics don't work when containerised #6620

Closed
Constantin07 opened this issue Mar 21, 2018 · 21 comments
Closed

[Metricbeat] Monitoring metrics don't work when containerised #6620

Constantin07 opened this issue Mar 21, 2018 · 21 comments

Comments

@Constantin07
Copy link

Constantin07 commented Mar 21, 2018

Hello,

I'm still getting these errors in metricbeat logs with version 6.2.3 when deployed as docker container:

2018-03-21T10:10:07.539Z	ERROR	instance/metrics.go:69	Error while getting memory usage: error retrieving process stats
2018-03-21T10:10:07.539Z	ERROR	instance/metrics.go:113	Error retrieving CPU percentages: error retrieving process stats

but there is no clue what could be the problem. I'm running an official metricbeat docker image and trying to pull stats from host.

The metricbeat container is run using this command:

docker run -d --restart=always --name metricbeat \
--net=host \
-u root \
-v /proc:/hostfs/proc:ro \
-v /sys/fs/cgroup:/hostfs/sys/fs/cgroup:ro \
-v /:/hostfs:ro \
-v /var/run/docker.sock:/var/run/docker.sock \
metricbeat:6.2.3 -system.hostfs=/hostfs

cat system.yml

- module: system
  period: 10s
  metricsets:
    - cpu
    - load
    - memory
    - network
    - process
    - process_summary
    #- core
    - diskio
    #- socket
  processes: ['.*']
  process.include_top_n:
    by_cpu: 5      # include top 5 processes by CPU
    by_memory: 5   # include top 5 processes by memory

- module: system
  period: 1m
  metricsets:
    - filesystem
    - fsstat
  processors:
  - drop_event.when.regexp:
      system.filesystem.mount_point: '^/(sys|cgroup|proc|dev|etc|host|lib)($|/)'
  • Metricbeat version: 6.2.3
  • Operating System: CoreOS Linux 1465.8.0
  • Docker Client 1.10.3 (API 1.22), Docker Server 1.12.6 (API 1.24)

Any idea what could be the cause ?

@andrewkroh
Copy link
Member

My initial guess is that it cannot find its own metrics in /hostfs/proc because it is pid namespaced and therefore is looking for something like 2 but when on the host the PID is very different.

This problem extends from the fact that /hostfs is treated as a global variable. This causes all metrics collecting code to read from /hostfs/proc. But the self monitoring metrics should always come from /proc.

@kvch Can you try to reproduce this? I think we need a test case for this, and we can discuss some possible solutions.

@andrewkroh
Copy link
Member

@Constantin07 Just so we are on the same page I should clarify that those log messages indicate a problem with the self-monitoring feature that was added in 6.2 that let's the Beat report it's own CPU/memory/load information in the log (with a message of "Non-zero metrics in the last 30s") and to X-Pack Monitoring if configured.

Your regular metrics from the system module should not be affected.

@kvch
Copy link
Contributor

kvch commented Mar 21, 2018

@andrewkroh I can reproduce problem.
We definitely need a new test case for this. I think the problem should be handled in gosigar. However, I am not yet sure how it should be done.

@ewgRa
Copy link
Contributor

ewgRa commented Mar 22, 2018

How about #6641 fix?

It is just enough to run metricbeat with system.hostfs argument to reproduce problem, it is not related to docker itself.

@jsoriano
Copy link
Member

jsoriano commented Mar 23, 2018

I think a general solution without modifying gosigar could be to obtain the pid of the process from <sigar.Procd>/self/status (Pid field), instead of using os.GetPid(). This would work both with or without namespaces.

@Constantin07
Copy link
Author

Constantin07 commented Mar 23, 2018

@andrewkroh yes, you are right. I do see in Elasticsearch regular metrics but the error log message to me appears misleading (it's kind of "it works" nut not completely).

@ewgRa
Copy link
Contributor

ewgRa commented Mar 23, 2018

@jsoriano can you give more details about general solution?

In hostfs proc dir we have host processes, /hostfs/proc/self/status - it will be not an metricbeat process.

I will try later to check myself, but for now on I don't understand how it will work.

@jsoriano
Copy link
Member

jsoriano commented Mar 23, 2018

@ewgRa host proc dir contains all processes running in any PID namespace of the machine, this includes the namespace in which the metricbeat process runs. The special file self always refers to the process in the namespace of the procfs mount, so if metricbeat reads <sigar.Procd>/self/status (with host /proc mounted in sigar.Procd) it will see its status in the host namespace, what includes the pid in this namespace, that could be used in calls to gosigar without changing sigar.Procd.

I say it could be a general solution because it'd also work when no namespacing is used: sigar.Procd would be /proc and this would contain the process as usual.

self behaviour in different namespaces is documented in pid_namespaces man page:

       Calling readlink(2) on the path /proc/self yields the process ID of
       the caller in the PID namespace of the procfs mount (i.e., the PID
       namespace of the process that mounted the procfs).  This can be use‐
       ful for introspection purposes, when a process wants to discover its
       PID in other namespaces.

@ewgRa
Copy link
Contributor

ewgRa commented Mar 24, 2018

@jsoriano thanks for brilliant idea, I made changes, it works, close to magic level :) Can you review it again? Failed CI looks like not related to my changes.

I see only two problems/limitations from this solution:

  1. Hard to write test for this case, I've add simple test, that actually test not a real case itself.

  2. In case if there will be wrong -system.hostfs flag (directory not exists for example), this way will fail to get metrics.

But I think this is acceptable edge cases.

jsoriano added a commit that referenced this issue Mar 27, 2018
@jsoriano
Copy link
Member

Fixed by #6641

@Constantin07
Copy link
Author

Thanks @ewgRa @jsoriano

@Camillevau
Copy link

Thanks

@andrzej-majewski
Copy link

Is there a workaround for this?

I am running 6.2.4 and issue still persist.

@grantcurell
Copy link

grantcurell commented Oct 30, 2018

+1 still seeing:
2018-10-30T18:47:37.898Z ERROR instance/metrics.go:92 Error retrieving CPU percentages: error retrieving process stats: cannot find matching process for pid=1
2018-10-30T18:47:37.898Z ERROR instance/metrics.go:61 Error while getting memory usage: error retrieving process stats: cannot find matching process for pid=1
image

I added hostPid: true in accordance with: #6734
image

Edit: I just realized that's not where it goes. Moved in in accordance with https://kubernetes.io/docs/concepts/policy/pod-security-policy/. Still no dice. We see the same errors in the logs as before.

@jsoriano
Copy link
Member

@grantcurell what version of metricbeat are you using? could you also share the configuration you are using to start it?
And what logs do you see with hostPid: true? They shouldn' tbe the same as metricbeat won't have pid 1.

@grantcurell
Copy link

Update: It took me a second to get it in the right place, but

metricbeat-deploy.yml.txt

Metricbeat is 6.2.4. The logs are clean now and I'm not receiving the error, but the behavior of the dashboard is erratic and I'm not sure why.

@grantcurell what version of metricbeat are you using? could you also share the configuration you are using to start it?
And what logs do you see with hostPid: true? They shouldn' tbe the same as metricbeat won't have pid 1.

metricbeat-deploy.yml.txt

Update: Took me a second to get it in the right place and I'm no longer receiving the error in the logs. What is strange is the dashboard behaves very erratically and the data is incorrect. For example: I'm doing a controlled test where I'm pumping 5Gb/s into a security sensor I have, am confirming with traditional monitoring tools that the sensor is receiving the expected 5Gb/s (4.65 to be exact with the loss from overhead), but Metricbeat's reading for inbound traffic jumps around all over the place. Anywhere from 17MB/s to 120MB/s and it changes on each 5 second interval I have the dashboard set to.

The other problem I have can be seen below. If I set the time period to anything less than 30 minutes the entire top part of the dashboard zeros out, but the accompanying data continues to display appropriately - including network speed which you can see is appropriately sitting at 600MB/s (4800 Mb/s).

Additional Info: This is running on Kubernetes 1.9.7

image

@grantcurell
Copy link

grantcurell commented Oct 31, 2018

You can see below when I change the time to 30 minutes it displays. Though the disk IO is still missing.

Edit: And the information is still inaccurate in general.

image

@jsoriano
Copy link
Member

@grantcurell thanks for all the details.

6.2.4 didn't include yet the fix for this, you need 6.3.0 or later, in any case the hostPid: true workaround should work.

Regarding the other problems, it'd be great if you could confirm them with a more modern metricbeat version and open specific issues.

@grantcurell
Copy link

@grantcurell thanks for all the details.

6.2.4 didn't include yet the fix for this, you need 6.3.0 or later, in any case the hostPid: true workaround should work.

Regarding the other problems, it'd be great if you could confirm them with a more modern metricbeat version and open specific issues.

Can I update metricbeat independently of Elasticsearch in this case?

@jsoriano
Copy link
Member

jsoriano commented Nov 1, 2018

Can I update metricbeat independently of Elasticsearch in this case?

If you are using Elasticsearch 6.X this should be fine, check the product compatibility matrix.

@grantcurell
Copy link

grantcurell commented Nov 1, 2018

@jsoriano upgrading to Metricbeat 6.4.2 didn't fix the problem. I still get a bunch of strange partial data if the time interval is anything less than 30 minutes. Ex the Kubernetes dashboard:

image

or

image

But move it to 30 minutes and you get:

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants