can not find docker metrics in influxdb, anyone can help? #645

asdfsx · 2016-02-04T04:51:19Z

I start telegraf with the following config

$ more /etc/telegraf/telegraf.d/docker.conf
[[inputs.docker]]
  # Docker Endpoint
  #   To use TCP, set endpoint = "tcp://[ip]:[port]"
  #   To use environment variables (ie, docker-machine), set endpoint = "ENV"
  endpoint = "unix:///var/run/docker.sock"
  # Only collect metrics for these containers, collect all if empty
  container_names = []

and start telegraf by following command:

/usr/bin/telegraf -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d

but can not find any measurements about docker in influxdb:

> show measurements;
name: measurements
------------------
name
cpu
disk
diskio
mem
swap
system

actually, I can see docker datas collected by telegraf

$/usr/bin/telegraf -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d  -input-filter docker -test
* Plugin: docker, Collection 1
......
> docker_cpu,com.docker.compose.config-hash=2db93f17fb0fdbb2b3be408209d18ac7eb9f44d787af2df58b6f6601771763cf,com.docker.compose.container-number=1,com.docker.compose.oneoff=False,com.docker.compose.project=grafana,com.docker.compose.service=grafana,com.docker.compose.version=1.5.1,cont_id=528cfa640ba2863df3febd0cd28b173527599b8c2d81a26c6965fc3b13b0ea2d,cont_image=grafana/grafana,cont_name=grafana_grafana_1,cpu=cpu1 usage_total=8040078470i 1454561313911620608
......

The text was updated successfully, but these errors were encountered:

asdfsx · 2016-02-04T06:44:55Z

I finally fix this problem~~~~

sparrc · 2016-02-04T06:45:09Z

@asdfsx what was the issue/solution?

asdfsx · 2016-02-04T06:51:23Z

@sparrc nothing but restart telegraf several times, then it appeared in influxdb......but I think there is still some problems remain.
Right now I'm trying to use grafana to display the metrics, but I can only see one point on the graph, not a line.

I'm confused by this.....

and when I execute sql:

SELECT mean(usage_total) FROM docker_cpu WHERE time > now() - 1h AND host = 'mesos36' GROUP BY time(15m)

I only get one record

asdfsx · 2016-02-04T08:30:03Z

It's so strange that, when I start telegraf by exec the command:

/usr/bin/telegraf -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d -debug

I can get docker data in influxdb.
Then I create a graph in grafana, by using sql like below

SELECT mean("usage_total") FROM "docker_cpu" WHERE "host" = 'mesos36' AND $timeFilter GROUP BY time($interval) fill(null)

I get a graph like below

It seems docker_cpu.usage_total is continue growing~
Seems not right.

And when I start telegraf by using systemctl start telegraf, it seems nothing about docker send to the influxdb~~~

SELECT mean(usage_total) FROM docker_cpu WHERE host = 'mesos36' AND time > now() - 1h GROUP BY time(5m)

docker_cpu
time    mean
2016-02-04T07:25:00Z    
2016-02-04T07:30:00Z    
2016-02-04T07:35:00Z    
2016-02-04T07:40:00Z    9535640409061.729
2016-02-04T07:45:00Z    
2016-02-04T07:50:00Z    
2016-02-04T07:55:00Z    9541001545054.232
2016-02-04T08:00:00Z    9542674282644.1
2016-02-04T08:05:00Z    9544248412986.191
2016-02-04T08:10:00Z    
2016-02-04T08:15:00Z    
2016-02-04T08:20:00Z    
2016-02-04T08:25:00Z

sparrc · 2016-02-04T16:26:00Z

might be a permissions issue, try

sudo -u telegraf /usr/bin/telegraf -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d -debug

The upward slope is normal. Docker's CPU "usage" metric is actually just a counter of CPU ticks used.

asdfsx · 2016-02-05T08:55:11Z

I think you are right!
I just notice that the telegraf is running under telegraf account.
So I modify the /etc/systemd/system/telegraf.service

[Service]
EnvironmentFile=-/etc/default/telegraf
#User=telegraf
User=root
ExecStart=/usr/bin/telegraf -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d ${TELEGRAF_OPTS}
Restart=on-failure
KillMode=process

You can see the telegraf was started by user telegraf before
And now I can see the new data!

BTW. You said that Docker's CPU "usage" metric is actually just a counter of CPU ticks used.
Does that mean I don't need to use mean function?
Just query data like this:

SELECT "usage_total" FROM "docker_cpu" WHERE "host" = 'mesos36'

or I need some function else, like count,sum?

sparrc · 2016-02-05T16:03:04Z

yes, that query would be fine

acherunilam · 2016-02-18T23:14:42Z

sudo su telegraf -c '/usr/bin/telegraf -config telegraf.conf -test -filter docker'

works fine. However, the service fails to send the docker metrics and the log fills with multiple instances of

Error getting docker stats: io: read/write on closed pipe

The default permission on the unix socket is 660 (UID:root, GID:docker), and I've added user telegraf to the docker group as well. @sparrc Any idea what's going wrong?

zstyblik · 2016-02-23T08:58:14Z

@sparrc I'm seeing the same issue with v0.10.3-1:

Feb 23 09:42:21 dev112-12 docker[879]: 2016/02/23 08:42:21 Error getting docker stats: io: read/write on closed pipe
Feb 23 09:42:25 dev112-12 docker[879]: 2016/02/23 08:42:25 Error getting docker stats: read unix @->/var/run/docker.sock: i/o timeout
Feb 23 09:42:25 dev112-12 docker[879]: 2016/02/23 08:42:25 Error getting docker stats: read unix @->/var/run/docker.sock: i/o timeout

However, % docker ps; works just fine.

zstyblik · 2016-02-23T09:07:11Z

I think this is most likely related to Docker Version as I'm not seeing it at hosts with Docker version 1.8.3, build f4bf5c7, but I'm seeing it at Docker version 1.9.1, build 4419fdb-dirty.

zstyblik · 2016-02-23T10:36:29Z

I've written small app which is scaled down plugin from Telegraf and I'm getting the same error/seeing the same issue even at Host with Docker v1.8.3. However, I can see "requests" being made in logs:

Feb 23 11:27:30 builder docker[10341]: time="2016-02-23T11:27:30.054869742+01:00" level=info msg="GET /containers/json"
Feb 23 11:27:30 builder docker[10341]: time="2016-02-23T11:27:30.056860514+01:00" level=info msg="GET /containers/6b864d4d17e370abeff82dc0bb6553905f161fc2ec3b8b2e5998ee9bd637f166/stats?stream=false"
Feb 23 11:27:30 builder docker[10341]: time="2016-02-23T11:27:30.057260630+01:00" level=info msg="GET /containers/616487a45616594a2ca671bd0a6f5691cd71fc2c7eee7dfd85cd6f4d6949e0f1/stats?stream=false"
Feb 23 11:27:30 builder docker[10341]: time="2016-02-23T11:27:30.057576478+01:00" level=info msg="GET /containers/6242042c2f252ab5225f0173090cf37dedda8c18cf2de5f28ed52ce57c22d69c/stats?stream=false"
Feb 23 11:27:30 builder docker[10341]: time="2016-02-23T11:27:30.058421662+01:00" level=info msg="GET /containers/c618e64f04f5d9920119b70477d38d76b4761a1c2a8a92ce704e024d231c4dd1/stats?stream=false"

Origin of the message is https://github.com/influxdata/telegraf/blob/master/plugins/inputs/docker/docker.go#L119

I have no idea how to fix this issue, though. I have found no way to increase timeout value or anything related to such setting. It's possible, though, I haven't dug deep enough.

tripledes · 2016-02-23T11:01:41Z

@zstyblik the docker plugin has a hardcoded timeout of 5s, https://github.com/influxdata/telegraf/blob/master/plugins/inputs/docker/docker.go#L108-L114 which I believe should be more than enough.

I think that should be a configuration setting, but that's a topic for another discussion.

zstyblik · 2016-02-23T11:44:01Z

@tripledes unfortunately, this setting isn't related to the issue.

tripledes · 2016-02-23T12:15:59Z

@zstyblik obviously, a closed pipe has nothing to do with a timeout, but you suggested to increase the timeout and I just provided information regarding it being hardcoded ❔

tripledes · 2016-02-23T16:59:41Z

Seems like the closed pipe is a side effect of the timeout over docker socket. Looks to me like it might be some synchronisation issue in dockerClient, but just a guess.

On the other hand, I've been looking how docker stats works because it doesn't fail, or at least it doesn't report any issue. The difference is that they use https://github.com/docker/engine-api as a client library. If I get the time I'll try to do a POC just to see if I can reproduce the issue with engine-api.

tripledes · 2016-02-24T14:14:09Z

@sparrc should we keep this opened ? As the issue can be reproduced, I believe it should be opened until a fix is found.

sparrc · 2016-02-24T16:03:42Z

yep, sure

tripledes · 2016-02-25T16:26:08Z

First attempt to switch to Docker's engine-api, if anyone is willing to test it, it's here:

https://github.com/zooplus/telegraf/tree/docker_engine_api

Besides having better compatibility I think one of the advantages of using engine-api is that they use context for all request so they can handle failure better.

I'd be very glad to have some feedback, I tried to keep the output as it was before but the following items would need some love:

Better error handling (probably using a shared error channel amongst all goroutines)
Unit tests

And whether possible, I'd like to make the plugin a bit more flexible, using jsonflattener? So we don't need to specify all the metrics upfront. But I guess this could be left for follow-ups.

@sparrc what are your thoughts about the change? I think it could also be done with Go's std lib, but would require a bigger effort to have the same functionalities (context, api version compatibilities, ...).

sparrc · 2016-02-26T11:09:13Z

@tripledes I don't have time to test but this sounds fine with me.

There is also a PR up for improving some of the docker metrics: #754, how does that fit in?

tripledes · 2016-02-26T11:31:50Z

@sparrc I currently have an instance of Telegraf with my changes running on our test env, no issues for now, just some blkio metric names that I need to check...other than that it's running fine, still some feedback from anyone involved on this issue would be very much appreciated :) @asdfsx @AdithyaBenny

Regarding #754, I just had a quick look and I don't think it'd be an issue, I could reapply my changes on top of it once you get it merged.

asdfsx · 2016-02-29T06:13:13Z

@tripledes if you make any change to the telegraf for this issue, I'd like to try

tripledes · 2016-02-29T08:12:07Z

@asdfsx here: https://github.com/zooplus/telegraf/tree/docker_engine_api you'd need to compile it yourself. I could provide a compiled binary if needed.

asdfsx · 2016-03-01T11:26:17Z

@tripledes I just compile it on ubuntu, and run it via the following command
sudo /home/ubuntu/go/bin/telegraf -config /home/ubuntu/telegraf/telegraf.conf -debug -test -filter docker
It seems ok right now.
centos seems ok too!
Anything else need to be tested? Please tell me!

tripledes · 2016-03-01T13:27:10Z

@asdfsx thanks! Just let us know whether you find any issue so it can be fixed before submitting a PR.

sporokh · 2016-03-14T10:46:37Z

any changes considering this issue?

tripledes · 2016-03-15T12:28:45Z

@sporokh I understand you're also hitting the issue, right? I'd like to have a PR ready by the end of the week...although cannot really promise, little bit short on time this week, but I'll try.

sporokh · 2016-03-15T12:42:06Z

@tripledes Thanks a lot Sergio!
We have the same issue on our staging server, the metrics being collected but I consistently receive this error in my logs.

2016/02/23 08:42:21 Error getting docker stats: io: read/write on closed pipe

sparrc · 2016-03-30T16:39:47Z

@tripledes any possibilities of a PR by the end of this week?

tripledes · 2016-03-31T10:07:39Z

@sparrc sorry, little short on time lately, I'll try over the weekend. In case I don't manage to get the time I'll ping u back.

tripledes · 2016-04-02T22:40:55Z

@sparrc Just finished modifying the input, haven't done anything on the tests yet and just run a manual test, although looking promising.

I'll get to the tests tomorrow in the meantime anyone willing to test ?

https://github.com/tripledes/telegraf/tree/engine-api

Feedback welcome 👍

sparrc · 2016-04-06T18:08:39Z

thank you @tripledes, this has worked well for me

tripledes · 2016-04-06T21:59:47Z

@sparrc glad to hear! I'd like to have a better look to the input plugin whenever I get a bit of time (quite busy lately at work), as I think it should be checking for API version and also to have some kind of integration tests against the supported docker api versions. Just some ideas.

forzagreen · 2017-12-04T15:38:45Z

Check the syslogs (tail -f /var/log/syslog). If the error is

Error in plugin [inputs.docker]: Got permission denied while trying to connect to the Docker daemon...

then you have to add telegraf user to docker group, as explained here:

$ sudo usermod -aG docker telegraf

remiteeple · 2020-09-09T03:01:51Z

To anyone looking for a solution on ARM based architecture...

As root open the cmdline.txt file...
$ sudo nano /boot/firmware/cmdline.txt

Add the following to the end of the file...
cgroup_enable=cpuset cgroup_enable=memory cgroup_memory=1

Reboot the system...
$ sudo reboot

Verify that the changes have worked!
$ docker stats

Hope this helps.

Mrx381 · 2022-01-03T10:12:18Z

Try to Run this command sudo chmod 666 /var/run.docker.sock.

asdfsx closed this as completed Feb 4, 2016

sparrc reopened this Feb 24, 2016

sparrc added the bug unexpected problem or unintended behavior label Mar 7, 2016

sparrc mentioned this issue Apr 1, 2016

Added support for a TLS-enabled Docker daemon #910

Closed

tripledes mentioned this issue Apr 3, 2016

Docker engine api #957

Closed

sparrc closed this as completed in e19c474 Apr 6, 2016

can not find docker metrics in influxdb, anyone can help? #645

can not find docker metrics in influxdb, anyone can help? #645

Comments

asdfsx commented Feb 4, 2016

asdfsx commented Feb 4, 2016

sparrc commented Feb 4, 2016

asdfsx commented Feb 4, 2016

asdfsx commented Feb 4, 2016

sparrc commented Feb 4, 2016

asdfsx commented Feb 5, 2016

sparrc commented Feb 5, 2016

acherunilam commented Feb 18, 2016

zstyblik commented Feb 23, 2016

zstyblik commented Feb 23, 2016

zstyblik commented Feb 23, 2016

tripledes commented Feb 23, 2016

zstyblik commented Feb 23, 2016

tripledes commented Feb 23, 2016

tripledes commented Feb 23, 2016

tripledes commented Feb 24, 2016

sparrc commented Feb 24, 2016

tripledes commented Feb 25, 2016

sparrc commented Feb 26, 2016

tripledes commented Feb 26, 2016

asdfsx commented Feb 29, 2016

tripledes commented Feb 29, 2016

asdfsx commented Mar 1, 2016

tripledes commented Mar 1, 2016

sporokh commented Mar 14, 2016

tripledes commented Mar 15, 2016

sporokh commented Mar 15, 2016

sparrc commented Mar 30, 2016

tripledes commented Mar 31, 2016

tripledes commented Apr 2, 2016

sparrc commented Apr 6, 2016

tripledes commented Apr 6, 2016

forzagreen commented Dec 4, 2017

remiteeple commented Sep 9, 2020 • edited Loading

Mrx381 commented Jan 3, 2022

remiteeple commented Sep 9, 2020 •

edited

Loading