Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Drop and Error statistics on virtual interfaces not reported correctly #2111

Closed
Sil5nc5 opened this issue Dec 1, 2016 · 12 comments
Closed
Labels
bug unexpected problem or unintended behavior
Milestone

Comments

@Sil5nc5
Copy link

Sil5nc5 commented Dec 1, 2016

Bug report

Relevant telegraf.conf:

Default configuration writing to InfluxDB with 1 metric enabled:
...
[[inputs.net]]
interfaces = ["video.1297"]
...

System info:

OS: Centos 7.2

Telegraf v1.1.1 (git: release-1.1.0 94de9dc)
InfluxDB shell version: 1.1.0

Steps to reproduce:

  1. Configure a VLAN interface, like eth0. with an IP/Subnet mask ...
  2. Setup your /etc/telegraf/telegraf.conf (telegraf -sample-config -input-filter net -output-filter influxdb > /etc/telegraf/telegraf.conf). If you prefer you can define an interface (see configuration file for the net section).
  3. Drop the telegraf DB in influx (drop database telegraf) to have a clean start.
  4. Start Telegraf and InfluxDB
  5. Login to influx (influx)
  6. Change database (use telegraf)
  7. Check if there are series already (show series)
  8. Query (select interface,bytes_recv,bytes_sent,drop_in,drop_out,err_in,err_out from net where interface =~ /(video).*/ limit 10). Adjust the interface name accordingly.

Expected behavior:

When drops or errors occur on the VLAN interface as shown by commands like:

  • ip -s l
  • ifconfig (RX and TX stats)

then those drop/err values should be present in InfluxDB.

Actual behavior:

Those values are always 0.

Additional info:

After some digging, basically stracing the Telegraf process (strace -p -ff -s 1500 $(pgrep telegraf), I found out that /proc/net/dev is used to gather the interface statistics. If I "cat" those statistics on a regular basis (every 5s), the drop/err counters for the VLAN interface are available but the values are not put correctly in influxDB. The "drop_in" and "drop_out" counter are 0 for that VLAN interface.

Telegraf strace:

Telegraf_Strace.txt

InfluxDB output:

InfluxDB shell version: 1.1.0
use telegraf
Using database telegraf
select interface,bytes_recv,bytes_sent,drop_in,drop_out,err_in,err_out from net where interface =~ /(video).*/ limit 10
name: net
time interface bytes_recv bytes_sent drop_in drop_out err_in err_out


1480586420000000000 video.1297 19661428082746 20537550752310 0 0 0 0
1480586425000000000 video.1297 19662919045426 20538672416520 0 0 0 0
1480586430000000000 video.1297 19664387950054 20539792739500 0 0 0 0

IP output (ip -s l):

9: video.1297@video: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP >mode DEFAULT
link/ether 00:0c:29:a7:9f:3f brd ff:ff:ff:ff:ff:ff
RX: bytes packets errors dropped overrun mcast
22023152032994 16241265939 0 0 0 820
TX: bytes packets errors dropped carrier collsns
22338244074504 16305291072 0 4030510272 0 0

Output "/proc/net/dev":

Inter-| Receive | Transmit
face |bytes packets errs drop fifo frame compressed multicast|bytes packets errs drop fifo colls carrier compressed
video.1297: 21987005189672 16214608972 0 0 0 0 0 819 22310920550506 16285346891 0 4023375855 0 0 0 0

Proposal:

N/A

Current behavior:

Counter for drop/err remains 0.

Desired behavior:

Get the drop/err counters for the VLAN interface in InfluxDB using Telegraf.

Use case: [Why is this important (helps with prioritizing requests)]

Everybody who has VLAN interfaces configured and want to gather all the statistics. Now one assumes no drops ever happen on those interfaces, so no possible alerts get triggered.

@sparrc
Copy link
Contributor

sparrc commented Dec 1, 2016

can you provide output of uname -a?

TBH I'm tempted to close this because I think it's reasonable to assume that /proc/net/dev is correct, unless there is a reasonable workaround using procfs.

@Sil5nc5
Copy link
Author

Sil5nc5 commented Dec 1, 2016

@sparrc Hi, I updated the bug report, "/proc/net/dev" provides the desired output (my apologies), I looked @ the wrong field. But the issue remains the same, the "drop_in" and "drop_out" counters do not get updated inside influxdb, the value remains 0.

uname -a:

Linux localhost.localdomain 3.10.0-327.el7.x86_64 #1 SMP Thu Nov 19 22:10:57 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

@Sil5nc5
Copy link
Author

Sil5nc5 commented Dec 1, 2016

Some extra information:

select interface,bytes_recv,bytes_sent,drop_in,drop_out,err_in,err_out from net where interface =~ /(video).*/ limit 20
name: net
time                    interface       bytes_recv      bytes_sent      drop_in drop_out        err_in  err_out
----                    ---------       ----------      ----------      ------- --------        ------  -------
1480596805000000000     video           22970416424910  22886710950176  **24**      0               0       0
1480596805000000000     video3.1296     22720381848522  23068674526646  0       0               0       0
1480596805000000000     video2          22970368955460  22008615195164  **22**      0               0       0
1480596805000000000     video.1297      22735682039790  22886730699448  0       0               0       0
1480596805000000000     video3          22954999162292  23068653478614  **20**      0               0       0
1480596805000000000     video2.1602     22735635059372  22008635722596  0       0               0       0
1480596810000000000     video3          22956474141282  23069760685214  **20**      0               0       0

As you can see, it does provide the correct values for video, video2 and video3 interfaces (Those are non-VLAN) interfaces.

@sparrc
Copy link
Contributor

sparrc commented Dec 1, 2016

please send output of cat /proc/net/dev

@Sil5nc5
Copy link
Author

Sil5nc5 commented Dec 1, 2016

Output_Proc_Net_Dev.txt

@sparrc See attached file

@sparrc
Copy link
Contributor

sparrc commented Dec 1, 2016

@Sil5nc5 afaict these numbers match what you have in the db.

your db values are not 0 for video, video2, or video3

@Sil5nc5
Copy link
Author

Sil5nc5 commented Dec 1, 2016

@sparrc The values for the drop_out (drops on the sending side / TX / Transmit) are 1281423891 (video2.1602) / 1281423891 (video3.1296) / 1281423891 (video.1297). But in InfluxDB they are 0.

@sparrc
Copy link
Contributor

sparrc commented Dec 1, 2016

ah, so do I understand right that this doesn't have anything to do with the VLAN then?

@Sil5nc5
Copy link
Author

Sil5nc5 commented Dec 1, 2016

To be absolutely sure, I would have to trigger drops on the other interfaces to verify if this is the case. I will try it tomorrow. Or I will try to trigger RX errors on those interfaces to see if those values are correctly stored in the database. I'll get back to you.

@sparrc sparrc added bug unexpected problem or unintended behavior and removed Need More Info labels Dec 1, 2016
@sparrc
Copy link
Contributor

sparrc commented Dec 1, 2016

I may have located the issue, the library we use to get system stats appears to be indexing the fields incorrectly here: https://github.com/shirou/gopsutil/blob/master/net/net_linux.go#L82-L89, I'll open an issue there and see if we can get that fixed.

(dropOut should actually be the 11th index)

@Sil5nc5
Copy link
Author

Sil5nc5 commented Dec 1, 2016

@sparrc great find!

@timhallinflux timhallinflux added this to the 1.3.0 milestone Jan 27, 2017
@sparrc
Copy link
Contributor

sparrc commented Feb 3, 2017

closed by #2353

@sparrc sparrc closed this as completed Feb 3, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug unexpected problem or unintended behavior
Projects
None yet
Development

No branches or pull requests

3 participants