Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

can not find docker metrics in influxdb, anyone can help? #645

Closed
asdfsx opened this issue Feb 4, 2016 · 35 comments
Closed

can not find docker metrics in influxdb, anyone can help? #645

asdfsx opened this issue Feb 4, 2016 · 35 comments
Labels
bug unexpected problem or unintended behavior

Comments

@asdfsx
Copy link

asdfsx commented Feb 4, 2016

I start telegraf with the following config

$ more /etc/telegraf/telegraf.d/docker.conf
[[inputs.docker]]
  # Docker Endpoint
  #   To use TCP, set endpoint = "tcp://[ip]:[port]"
  #   To use environment variables (ie, docker-machine), set endpoint = "ENV"
  endpoint = "unix:///var/run/docker.sock"
  # Only collect metrics for these containers, collect all if empty
  container_names = []

and start telegraf by following command:

/usr/bin/telegraf -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d  

but can not find any measurements about docker in influxdb:

> show measurements;
name: measurements
------------------
name
cpu
disk
diskio
mem
swap
system

actually, I can see docker datas collected by telegraf

$/usr/bin/telegraf -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d  -input-filter docker -test
* Plugin: docker, Collection 1
......
> docker_cpu,com.docker.compose.config-hash=2db93f17fb0fdbb2b3be408209d18ac7eb9f44d787af2df58b6f6601771763cf,com.docker.compose.container-number=1,com.docker.compose.oneoff=False,com.docker.compose.project=grafana,com.docker.compose.service=grafana,com.docker.compose.version=1.5.1,cont_id=528cfa640ba2863df3febd0cd28b173527599b8c2d81a26c6965fc3b13b0ea2d,cont_image=grafana/grafana,cont_name=grafana_grafana_1,cpu=cpu1 usage_total=8040078470i 1454561313911620608
......
@asdfsx asdfsx closed this as completed Feb 4, 2016
@asdfsx
Copy link
Author

asdfsx commented Feb 4, 2016

I finally fix this problem~~~~

@sparrc
Copy link
Contributor

sparrc commented Feb 4, 2016

@asdfsx what was the issue/solution?

@asdfsx
Copy link
Author

asdfsx commented Feb 4, 2016

@sparrc nothing but restart telegraf several times, then it appeared in influxdb......but I think there is still some problems remain.
Right now I'm trying to use grafana to display the metrics, but I can only see one point on the graph, not a line.
caf93ae9-9195-460e-897e-5e75d7af2983
I'm confused by this.....

and when I execute sql:

SELECT mean(usage_total) FROM docker_cpu WHERE time > now() - 1h AND host = 'mesos36' GROUP BY time(15m)

I only get one record
50e697e2-34e1-44fa-870a-4a37c89d411e

@asdfsx
Copy link
Author

asdfsx commented Feb 4, 2016

It's so strange that, when I start telegraf by exec the command:

/usr/bin/telegraf -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d -debug 

I can get docker data in influxdb.
Then I create a graph in grafana, by using sql like below

SELECT mean("usage_total") FROM "docker_cpu" WHERE "host" = 'mesos36' AND $timeFilter GROUP BY time($interval) fill(null)

I get a graph like below
b73ccc6d-d1b7-415a-932b-83f6e883a399
It seems docker_cpu.usage_total is continue growing~
Seems not right.

And when I start telegraf by using systemctl start telegraf, it seems nothing about docker send to the influxdb~~~

SELECT mean(usage_total) FROM docker_cpu WHERE host = 'mesos36' AND time > now() - 1h GROUP BY time(5m)

docker_cpu
time    mean
2016-02-04T07:25:00Z    
2016-02-04T07:30:00Z    
2016-02-04T07:35:00Z    
2016-02-04T07:40:00Z    9535640409061.729
2016-02-04T07:45:00Z    
2016-02-04T07:50:00Z    
2016-02-04T07:55:00Z    9541001545054.232
2016-02-04T08:00:00Z    9542674282644.1
2016-02-04T08:05:00Z    9544248412986.191
2016-02-04T08:10:00Z    
2016-02-04T08:15:00Z    
2016-02-04T08:20:00Z    
2016-02-04T08:25:00Z

@sparrc
Copy link
Contributor

sparrc commented Feb 4, 2016

might be a permissions issue, try

sudo -u telegraf /usr/bin/telegraf -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d -debug 

The upward slope is normal. Docker's CPU "usage" metric is actually just a counter of CPU ticks used.

@asdfsx
Copy link
Author

asdfsx commented Feb 5, 2016

I think you are right!
I just notice that the telegraf is running under telegraf account.
So I modify the /etc/systemd/system/telegraf.service

[Service]
EnvironmentFile=-/etc/default/telegraf
#User=telegraf
User=root
ExecStart=/usr/bin/telegraf -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d ${TELEGRAF_OPTS}
Restart=on-failure
KillMode=process

You can see the telegraf was started by user telegraf before
And now I can see the new data!

BTW. You said that Docker's CPU "usage" metric is actually just a counter of CPU ticks used.
Does that mean I don't need to use mean function?
Just query data like this:

SELECT "usage_total" FROM "docker_cpu" WHERE "host" = 'mesos36'

or I need some function else, like count,sum?

@sparrc
Copy link
Contributor

sparrc commented Feb 5, 2016

yes, that query would be fine

@acherunilam
Copy link

sudo su telegraf -c '/usr/bin/telegraf -config telegraf.conf -test -filter docker'

works fine. However, the service fails to send the docker metrics and the log fills with multiple instances of

Error getting docker stats: io: read/write on closed pipe

The default permission on the unix socket is 660 (UID:root, GID:docker), and I've added user telegraf to the docker group as well. @sparrc Any idea what's going wrong?

@zstyblik
Copy link
Contributor

@sparrc I'm seeing the same issue with v0.10.3-1:

Feb 23 09:42:21 dev112-12 docker[879]: 2016/02/23 08:42:21 Error getting docker stats: io: read/write on closed pipe
Feb 23 09:42:25 dev112-12 docker[879]: 2016/02/23 08:42:25 Error getting docker stats: read unix @->/var/run/docker.sock: i/o timeout
Feb 23 09:42:25 dev112-12 docker[879]: 2016/02/23 08:42:25 Error getting docker stats: read unix @->/var/run/docker.sock: i/o timeout

However, % docker ps; works just fine.

@zstyblik
Copy link
Contributor

I think this is most likely related to Docker Version as I'm not seeing it at hosts with Docker version 1.8.3, build f4bf5c7, but I'm seeing it at Docker version 1.9.1, build 4419fdb-dirty.

@zstyblik
Copy link
Contributor

I've written small app which is scaled down plugin from Telegraf and I'm getting the same error/seeing the same issue even at Host with Docker v1.8.3. However, I can see "requests" being made in logs:

Feb 23 11:27:30 builder docker[10341]: time="2016-02-23T11:27:30.054869742+01:00" level=info msg="GET /containers/json"
Feb 23 11:27:30 builder docker[10341]: time="2016-02-23T11:27:30.056860514+01:00" level=info msg="GET /containers/6b864d4d17e370abeff82dc0bb6553905f161fc2ec3b8b2e5998ee9bd637f166/stats?stream=false"
Feb 23 11:27:30 builder docker[10341]: time="2016-02-23T11:27:30.057260630+01:00" level=info msg="GET /containers/616487a45616594a2ca671bd0a6f5691cd71fc2c7eee7dfd85cd6f4d6949e0f1/stats?stream=false"
Feb 23 11:27:30 builder docker[10341]: time="2016-02-23T11:27:30.057576478+01:00" level=info msg="GET /containers/6242042c2f252ab5225f0173090cf37dedda8c18cf2de5f28ed52ce57c22d69c/stats?stream=false"
Feb 23 11:27:30 builder docker[10341]: time="2016-02-23T11:27:30.058421662+01:00" level=info msg="GET /containers/c618e64f04f5d9920119b70477d38d76b4761a1c2a8a92ce704e024d231c4dd1/stats?stream=false"

Origin of the message is https://github.com/influxdata/telegraf/blob/master/plugins/inputs/docker/docker.go#L119

I have no idea how to fix this issue, though. I have found no way to increase timeout value or anything related to such setting. It's possible, though, I haven't dug deep enough.

@tripledes
Copy link

@zstyblik the docker plugin has a hardcoded timeout of 5s, https://github.com/influxdata/telegraf/blob/master/plugins/inputs/docker/docker.go#L108-L114 which I believe should be more than enough.

I think that should be a configuration setting, but that's a topic for another discussion.

@zstyblik
Copy link
Contributor

@tripledes unfortunately, this setting isn't related to the issue.

@tripledes
Copy link

@zstyblik obviously, a closed pipe has nothing to do with a timeout, but you suggested to increase the timeout and I just provided information regarding it being hardcoded ❔

@tripledes
Copy link

Seems like the closed pipe is a side effect of the timeout over docker socket. Looks to me like it might be some synchronisation issue in dockerClient, but just a guess.

On the other hand, I've been looking how docker stats works because it doesn't fail, or at least it doesn't report any issue. The difference is that they use https://github.com/docker/engine-api as a client library. If I get the time I'll try to do a POC just to see if I can reproduce the issue with engine-api.

@tripledes
Copy link

@sparrc should we keep this opened ? As the issue can be reproduced, I believe it should be opened until a fix is found.

@sparrc
Copy link
Contributor

sparrc commented Feb 24, 2016

yep, sure

@sparrc sparrc reopened this Feb 24, 2016
@tripledes
Copy link

First attempt to switch to Docker's engine-api, if anyone is willing to test it, it's here:

https://github.com/zooplus/telegraf/tree/docker_engine_api

Besides having better compatibility I think one of the advantages of using engine-api is that they use context for all request so they can handle failure better.

I'd be very glad to have some feedback, I tried to keep the output as it was before but the following items would need some love:

  • Better error handling (probably using a shared error channel amongst all goroutines)
  • Unit tests

And whether possible, I'd like to make the plugin a bit more flexible, using jsonflattener? So we don't need to specify all the metrics upfront. But I guess this could be left for follow-ups.

@sparrc what are your thoughts about the change? I think it could also be done with Go's std lib, but would require a bigger effort to have the same functionalities (context, api version compatibilities, ...).

@sparrc
Copy link
Contributor

sparrc commented Feb 26, 2016

@tripledes I don't have time to test but this sounds fine with me.

There is also a PR up for improving some of the docker metrics: #754, how does that fit in?

@tripledes
Copy link

@sparrc I currently have an instance of Telegraf with my changes running on our test env, no issues for now, just some blkio metric names that I need to check...other than that it's running fine, still some feedback from anyone involved on this issue would be very much appreciated :) @asdfsx @AdithyaBenny

Regarding #754, I just had a quick look and I don't think it'd be an issue, I could reapply my changes on top of it once you get it merged.

@asdfsx
Copy link
Author

asdfsx commented Feb 29, 2016

@tripledes if you make any change to the telegraf for this issue, I'd like to try

@tripledes
Copy link

@asdfsx here: https://github.com/zooplus/telegraf/tree/docker_engine_api you'd need to compile it yourself. I could provide a compiled binary if needed.

@asdfsx
Copy link
Author

asdfsx commented Mar 1, 2016

@tripledes I just compile it on ubuntu, and run it via the following command
sudo /home/ubuntu/go/bin/telegraf -config /home/ubuntu/telegraf/telegraf.conf -debug -test -filter docker
It seems ok right now.
centos seems ok too!
Anything else need to be tested? Please tell me!

@tripledes
Copy link

@asdfsx thanks! Just let us know whether you find any issue so it can be fixed before submitting a PR.

@sparrc sparrc added the bug unexpected problem or unintended behavior label Mar 7, 2016
@sporokh
Copy link

sporokh commented Mar 14, 2016

any changes considering this issue?

@tripledes
Copy link

@sporokh I understand you're also hitting the issue, right? I'd like to have a PR ready by the end of the week...although cannot really promise, little bit short on time this week, but I'll try.

@sporokh
Copy link

sporokh commented Mar 15, 2016

@tripledes Thanks a lot Sergio!
We have the same issue on our staging server, the metrics being collected but I consistently receive this error in my logs.

2016/02/23 08:42:21 Error getting docker stats: io: read/write on closed pipe

@sparrc
Copy link
Contributor

sparrc commented Mar 30, 2016

@tripledes any possibilities of a PR by the end of this week?

@tripledes
Copy link

@sparrc sorry, little short on time lately, I'll try over the weekend. In case I don't manage to get the time I'll ping u back.

@tripledes
Copy link

@sparrc Just finished modifying the input, haven't done anything on the tests yet and just run a manual test, although looking promising.

I'll get to the tests tomorrow in the meantime anyone willing to test ?

https://github.com/tripledes/telegraf/tree/engine-api

Feedback welcome 👍

@sparrc
Copy link
Contributor

sparrc commented Apr 6, 2016

thank you @tripledes, this has worked well for me

@tripledes
Copy link

@sparrc glad to hear! I'd like to have a better look to the input plugin whenever I get a bit of time (quite busy lately at work), as I think it should be checking for API version and also to have some kind of integration tests against the supported docker api versions. Just some ideas.

@forzagreen
Copy link

Check the syslogs (tail -f /var/log/syslog). If the error is

Error in plugin [inputs.docker]: Got permission denied while trying to connect to the Docker daemon...

then you have to add telegraf user to docker group, as explained here:

$ sudo usermod -aG docker telegraf

@remiteeple
Copy link

remiteeple commented Sep 9, 2020

To anyone looking for a solution on ARM based architecture...

As root open the cmdline.txt file...
$ sudo nano /boot/firmware/cmdline.txt

Add the following to the end of the file...
cgroup_enable=cpuset cgroup_enable=memory cgroup_memory=1

Reboot the system...
$ sudo reboot

Verify that the changes have worked!
$ docker stats

Hope this helps.

@Mrx381
Copy link

Mrx381 commented Jan 3, 2022

Try to Run this command sudo chmod 666 /var/run.docker.sock.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug unexpected problem or unintended behavior
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants