Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug in btrfs statistics with subvolume #1221

Closed
phoet opened this issue Nov 25, 2014 · 10 comments
Closed

bug in btrfs statistics with subvolume #1221

phoet opened this issue Nov 25, 2014 · 10 comments
Labels
Milestone

Comments

@phoet
Copy link

phoet commented Nov 25, 2014

i was just testing 5.1.0-539 to gather some statistics on btrfs usage, but the client fails with an error:

2014-11-25 11:51:55 UTC | ERROR | dd.collector | checks.btrfs(__init__.py:552) | Check 'btrfs' instance #0 failed
Traceback (most recent call last):
  File "/opt/datadog-agent/agent/checks/__init__.py", line 543, in run
    self.check(copy.deepcopy(instance))
  File "/opt/datadog-agent/agent/checks.d/btrfs.py", line 124, in check
    replication_type, usage_type  = FLAGS_MAPPER[flags]
KeyError: 562949953421312

my guess is that handling two mount points is somehow broken? the get_usage method is not trivial, so i this is my best guess...

mount shows two btrfs devices:

mount | grep btrfs
/dev/mapper/lvroot-u on /u type btrfs (rw)
/dev/mapper/lvroot-u on /var/log/nginx type btrfs (rw,subvol=nginx-logs)

disk_partitions finds both mounts, but just returns the last one:

>>> psutil.disk_partitions()
[..., sdiskpart(device='/dev/mapper/lvroot-u', mountpoint='/u', fstype='btrfs', opts='rw'), sdiskpart(device='/dev/mapper/lvroot-u', mountpoint='/var/log/nginx', fstype='btrfs', opts='rw,subvol=nginx-logs')]
{'/dev/mapper/lvroot-u': '/u'}

same usage statistics for both, but the last one is off (handling of subvolume broken?)

>>> get_usage('/u')
[(1, 27925676032, 17798676480), (34, 8388608, 16384), (2, 4194304, 0), (36, 3758096384, 1661550592), (4, 8388608, 0), (562949953421312, 436207616, 0)]
>>> get_usage('/var/log/nginx')
[(1, 27925676032, 17798836224), (34, 8388608, 16384), (2, 4194304, 0), (36, 3758096384, 1661566976), (4, 8388608, 0), (562949953421312, 436207616, 0)]
@remh
Copy link
Contributor

remh commented Nov 25, 2014

Thanks a lot for this detailed feedback.
I'm reviewing the issue and will come back to you shortly.

@phoet
Copy link
Author

phoet commented Nov 25, 2014

👍

@remh
Copy link
Contributor

remh commented Nov 25, 2014

Could you also send us the output of

btrfs fi df

please ?

@remh remh added this to the 5.1.1 milestone Nov 25, 2014
@remh remh added the bugfix label Nov 25, 2014
@phoet
Copy link
Author

phoet commented Nov 25, 2014

sudo btrfs fi df /u
Data, single: total=23.00GiB, used=9.36GiB
System, DUP: total=32.00MiB, used=16.00KiB
Metadata, DUP: total=3.50GiB, used=888.91MiB
unknown, single: total=240.00MiB, used=0.00

@remh
Copy link
Contributor

remh commented Nov 25, 2014

Thanks, the weird key "562949953421312" corresponds to your "unknown" line in your output above.

Could you try this patched version of the check and let me know how it goes ?
https://gist.githubusercontent.com/remh/8caa27c2b72c48bdd86f/raw/0913cc0bfb50d5bdfc90c8a0f33a296c7f20133e/btrfs.py

Just replace your /opt/datadog-agent/agent/checks.d/btrfs.py file by the one in the gist and restart the agent.

Thanks!

@phoet
Copy link
Author

phoet commented Nov 26, 2014

this is the data from the test-host. maybe i just don't understand properly how btrfs is handling the disk space, but it looks off.

datadog 5 1 datadog 2014-11-26 12-02-39

the other stats are collected via a script that uses the btrfs cli

#!/bin/bash

TOTAL=$(sudo /sbin/btrfs filesystem show /dev/mapper/lvroot-u | grep -oP '(?<=size )\d+\.\d+\w' | numfmt --from=iec)
USED=$(sudo /sbin/btrfs filesystem show /dev/mapper/lvroot-u | grep 'Total devices' | grep -oP '(?<=used.)\d+\.\d+\w' | numfmt --from=iec)
ALLOCATED=$(sudo /sbin/btrfs filesystem show /dev/mapper/lvroot-u | grep -oP '(?<=used )\d+\.\d+\w' | tail -n1 | numfmt --from=iec)
FREE=$(echo "$TOTAL - $USED" | bc)

echo -e -n "system.btrfs.total:$TOTAL|g" > /dev/udp/127.0.0.1/8125
echo -e -n "system.btrfs.used:$USED|g" > /dev/udp/127.0.0.1/8125
echo -e -n "system.btrfs.allocated:$USED|g" > /dev/udp/127.0.0.1/8125
echo -e -n "system.btrfs.free:$FREE|g" > /dev/udp/127.0.0.1/8125

DATA_TOTAL=$(sudo /sbin/btrfs filesystem df /u | grep "Data, single" | grep -oP '(?<=total=)\d+\.\d+\w' | numfmt --from=iec)
DATA_USED=$(sudo /sbin/btrfs filesystem df /u | grep "Data, single" | grep -oP '(?<=used.)\d+\.\d+\w' | numfmt --from=iec)
DATA_FREE=$(echo "$DATA_TOTAL - $DATA_USED" | bc)

echo -e -n "system.btrfs.data.total:$DATA_TOTAL|g"  > /dev/udp/127.0.0.1/8125
echo -e -n "system.btrfs.data.used:$DATA_USED|g"  > /dev/udp/127.0.0.1/8125
echo -e -n "system.btrfs.data.free:$DATA_FREE|g"  > /dev/udp/127.0.0.1/8125

METADATA_TOTAL=$(sudo /sbin/btrfs filesystem df /u | grep "Metadata, DUP" | grep -oP '(?<=total=)\d+\.\d+\w' | numfmt --from=iec)
METADATA_USED=$(sudo /sbin/btrfs filesystem df /u | grep "Metadata, DUP" | grep -oP '(?<=used=)\d+\.\d+\w' | numfmt --from=iec)
METADATA_FREE=$(echo "$METADATA_TOTAL - $METADATA_USED" | bc)

echo -e -n "system.btrfs.metadata.total:$METADATA_TOTAL|g"  > /dev/udp/127.0.0.1/8125
echo -e -n "system.btrfs.metadata.used:$METADATA_USED|g"  > /dev/udp/127.0.0.1/8125
echo -e -n "system.btrfs.metadata.free:$METADATA_FREE|g"  > /dev/udp/127.0.0.1/8125

SUBVOLUME_COUNT=$(sudo /sbin/btrfs subvolume list /u | wc -l)
echo -e -n "system.btrfs.subvolume.count:$SUBVOLUME_COUNT|g"  > /dev/udp/127.0.0.1/8125

this is the current fi show output

sudo /sbin/btrfs filesystem show /dev/mapper/lvroot-u
Label: none  uuid: 0ed47570-e293-44cd-a2db-1395f60bffb3
    Total devices 1 FS bytes used 18.13GiB
    devid    1 size 70.80GiB used 33.04GiB path /dev/mapper/lvroot-u
Btrfs v3.12

sudo /sbin/btrfs filesystem df /dev/mapper/lvroot-u
Data, single: total=26.01GiB, used=16.59GiB
System, DUP: total=8.00MiB, used=16.00KiB
System, single: total=4.00MiB, used=0.00
Metadata, DUP: total=3.50GiB, used=1.55GiB
Metadata, single: total=8.00MiB, used=0.00
unknown, single: total=416.00MiB, used=0.00

the node is running 93 subvolumes at the time being

@alq666
Copy link
Member

alq666 commented Nov 26, 2014

of interest to your current work, @technovangelist

@remh
Copy link
Contributor

remh commented Nov 26, 2014

@phoet Can you contact us by email (support at datadoghq.com)? It would be easier to debug.

The graphs with the system.disk.btrfs.* metrics is probably using the "avg" aggregator which will average accross everything, as metrics collected by the agent are tagged by replication_type and usage_type:
https://github.com/DataDog/dd-agent/blob/master/checks.d/btrfs.py#L125-L136

@phoet
Copy link
Author

phoet commented Nov 27, 2014

@remh ah, that makes total sense! LGTM

@remh
Copy link
Contributor

remh commented Dec 9, 2014

Closing the issue then, fix for "unknown" usage type is going out with 5.1.1

@remh remh closed this as completed Dec 9, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants