Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prometheus output can return partial results if a collect request is made in-between writing multiple batches #1775

Closed
banks opened this issue Sep 16, 2016 · 36 comments · Fixed by #1780
Labels
bug unexpected problem or unintended behavior
Milestone

Comments

@banks
Copy link

banks commented Sep 16, 2016

Bug report

We upgraded some servers to latest deb release in repos.influxdata.com only because that repo does not keep old versions (filing a separate ticket about that).

The latest version seems to fail in a way I can't diagnose, log output is same as before yet the response form the prometheus output metrics endpoint intermittently has a limited set of the actual data.

It seems to be only reporting the last batch of metrics written not all since last scrape.

In the config below the exec input pulls cgroups memory stats for each container running on the host. There are ~3000-4000 distinct metrics returned by it every time and this number is constant with number of containers running.

Relevant telegraf.conf:

/etc/telegraf/telegraf.conf

[agent]
collection_jitter = "3s"
flush_interval = "10s"
flush_jitter = "5s"
interval = "15s"
round_interval = true

/etc/telegraf/telegraf.d/default_inputs.conf

[inputs.cpu]
fieldpass = ["usage_*"]
percpu = false
totalcpu = true
[inputs.disk]
fieldpass = ["free", "total", "used", "inodes_used", "inodes_total"]
ignore_fs = ["tmpfs", "devtmpfs"]
mount_points = ["/", "/data_0", "/data"]
[inputs.diskio]
[inputs.exec]
commands = ["/usr/local/bin/capture_memory_metrics"]
data_format = "influx"
timeout = "15s"
[inputs.mem]
fielddrop = ["*_percent"]
[inputs.net]
interfaces = ["eth0", "bond0", "bond1", "br0"]
[inputs.system]
fieldpass = ["uptime", "load*"]

/etc/telegraf/telegraf.d/default_outputs.conf

[[outputs.prometheus_client]]
listen = ":9126"

System info:

telegraf 1.0.0 (old version 0.13.2)

Steps to reproduce:

(This is assumption, based on the cases that do exhibit it)

  1. create an exec input that returns > 1000 metrics
  2. run with that input and prometheus output
  3. obeserve number of metrics scraped each second on prometheus end point.

Expected behavior:

Shoudl constistenyl see the correct number of metrics. For our case This is what happend when we scrape every second and output number of metrics scraped using 0.13.2:

$ for i in {1..999}; do echo -n "Number of capsules metrics scraped:"; curl 2>/dev/null localhost:9126/metrics | grep -o -P 'capsule_name="[a-z0-9_]+"' | wc -l; sleep 1; done
Number of capsules metrics scraped:3980
Number of capsules metrics scraped:3980
Number of capsules metrics scraped:3980
Number of capsules metrics scraped:3980
Number of capsules metrics scraped:3980
Number of capsules metrics scraped:3980
Number of capsules metrics scraped:3980
Number of capsules metrics scraped:3980

Actual behavior:

After upgrade with 1.0.0 we see:

$ for i in {1..999}; do echo -n "Number of capsules metrics scraped:"; curl 2>/dev/null localhost:9126/metrics | grep -o -P 'capsule_name="[a-z0-9_]+"' | wc -l; sleep 1; done
Number of capsules metrics scraped:1000
Number of capsules metrics scraped:1000
Number of capsules metrics scraped:3
Number of capsules metrics scraped:3
Number of capsules metrics scraped:3
Number of capsules metrics scraped:3
Number of capsules metrics scraped:3
Number of capsules metrics scraped:3
Number of capsules metrics scraped:3
Number of capsules metrics scraped:3
Number of capsules metrics scraped:0
Number of capsules metrics scraped:1000
Number of capsules metrics scraped:1000
Number of capsules metrics scraped:1000
Number of capsules metrics scraped:1000
Number of capsules metrics scraped:1000

This numbers match the log messages about number of metrics in each batch so the bug appears to be that prometheus output is only reporting the last batch of metrics recorded not all since last scrape.

@sparrc
Copy link
Contributor

sparrc commented Sep 16, 2016

hi @banks, good sleuthing, but the previous behavior of continuing to report metrics that are not reported anymore was actually a bug that was fixed in 1.0: #1334

my understanding is that this is the intended prometheus behavior.

why? because prometheus endpoints are supposed to work by gathering a set of metrics at the time that the http request is made. If the metrics don't exist at the time that the request is made, then they are not exposed.

I'll re-open this if you can find some documentation that explains otherwise, but my conversations with prometheus devs led me to believe that the 1.0 behavior is the intended behavior.

@sparrc sparrc closed this as completed Sep 16, 2016
@banks
Copy link
Author

banks commented Sep 16, 2016

I'm pretty sure this is a bug and not a valid fix for #1334.

I followed that issue too as we were also bitten by that bug on other hosts. In fact 0.13.2 fixed that bug for us - I assume the fix was back-ported from 1.0 branch.

This report is different though:

  • the exec script is returning 3980 distinct metrics every invocation - none of the metrics are being removed or going away in the example given
  • those are written to the prometheus output in batches of 1000
  • previous behaviour was to report all of the metrics reported since last scrape
  • current behaviour only reports metrics in last batch
  • since 0.13.2 any metrics actually that stopped being reported did indeed stop showing in output for us.

In other words, as of 1.0.0 the prometheus output can't report on more than 1000 distinct metrics. Not only that but the failure mode when you try is random sampling of metrics that drops an unbounded amount of them on the floor.

@banks
Copy link
Author

banks commented Sep 16, 2016

Btw. if there was no intentional backport of the fix for #1334 in 0.13.2 then it's very interesting because we saw it go away.

Sadly it was too long ago to still have data in prometheus to show but I may be able to dig out a graph that showed how we dropped about 25% of our ingested-metrics-per-second over then 30 mins it took to roll out due to the deleted containers' metrics going away as expected.

If that happened without an intentional change in telegraf perhaps it was a prometheus client bug that was fixed in that version or something and in fact the "fix" in 1.0.0 is creating this whole new problem?

Please re-open this, prometheus output is unusable in production on 1.0.0 for anyone who might have more than 1000 metrics on telegraf instance as far as I can see.

@banks
Copy link
Author

banks commented Sep 16, 2016

For the record here:

image

That was the number of distinct containers we had metrics for over the time we rolled out 0.13.1 -> 0.13.2. I checked and none of the live containers I could find were missing them and all recently-deleted containers had their metrics stop as expected.

I didn't look further and just assumed that #1334 fix had been backported and released. If that's not the case then I think #1334 was not really understanding the correct issue and potentially caused this new one with the "fix".

@sparrc
Copy link
Contributor

sparrc commented Sep 16, 2016

I see what you mean, I'll reopen

@sparrc sparrc reopened this Sep 16, 2016
@sparrc sparrc added the bug unexpected problem or unintended behavior label Sep 16, 2016
@sparrc sparrc added this to the 1.1.0 milestone Sep 16, 2016
@sparrc
Copy link
Contributor

sparrc commented Sep 16, 2016

I can see exactly what's wrong here, and this should actually be a fairly simple fix, so I'm going to bump it up to 1.0.1

this map: https://github.com/influxdata/telegraf/blob/master/plugins/outputs/prometheus_client/prometheus_client.go#L95 should be getting remade on calls to Collect, not on Write

in the meantime, you should be able to bump up metric_batch_size in the [agent] config section to workaround this: https://github.com/influxdata/telegraf/blob/master/etc/telegraf.conf#L35

@sparrc sparrc modified the milestones: 1.0.1, 1.1.0 Sep 16, 2016
@banks
Copy link
Author

banks commented Sep 16, 2016

Sounds exactly like the issue :) Thanks for the quick response.

@banks
Copy link
Author

banks commented Sep 16, 2016

Incidentally my assumption that it was caused by having more metrics than the buffer seems to be wrong.

In process of trying to patch hosts I found one host reporting only 300 metrics and it still showed them inconsistently sometimes being blank and sometimes having all of them even just in the few seconds I observed:

$for i in {1..999}; do echo -n "Number of capsules metrics scraped:"; curl 2>/dev/null localhost:9126/metrics | grep -o -P 'capsule_name="[a-z0-9_]+"' | wc -l; sleep 1; done
Number of capsules metrics scraped:0
Number of capsules metrics scraped:0
Number of capsules metrics scraped:0
Number of capsules metrics scraped:0
Number of capsules metrics scraped:0
Number of capsules metrics scraped:0
Number of capsules metrics scraped:300
Number of capsules metrics scraped:300
Number of capsules metrics scraped:300
Number of capsules metrics scraped:300
Number of capsules metrics scraped:300

After downgrading to 0.13.2 this stopped too. Not sure whether or not that is explained by your find yet... (maybe a timing issue even if writes are not large thatn buffer size?)

@banks
Copy link
Author

banks commented Nov 1, 2016

Hi, @sparrc

I've finally gotten around to testing 1.0.1 ready to roll out to our servers and I've discovered that this issue seems NOT to be resolved in the latest 1.0.1 binary..

On my hosts before (0.13.2) I could run my little for loop from the original issue description:

$telegraf --version
Telegraf - version 0.13.2
$for i in {1..999}; do echo -n "Number of capsules metrics scraped:"; curl 2>/dev/null localhost:9126/metrics | grep -o -P 'capsule_name="[a-z0-9_]+"' | wc -l; sleep 1; done
Number of capsules metrics scraped:4050
Number of capsules metrics scraped:4050
Number of capsules metrics scraped:4050
Number of capsules metrics scraped:4050
Number of capsules metrics scraped:4050
Number of capsules metrics scraped:4050
Number of capsules metrics scraped:4050
Number of capsules metrics scraped:4050

On my canary host (wouldn't expect identical number of metrics but equally consistent) I see:

$telegraf --version
Telegraf - version 1.0.1
$for i in {1..999}; do echo -n "Number of capsules metrics scraped:"; curl 2>/dev/null localhost:9126/metrics | grep -o -P 'capsule_name="[a-z0-9_]+"' | wc -l; sleep 1; done
Number of capsules metrics scraped:0
Number of capsules metrics scraped:3988
Number of capsules metrics scraped:3988
Number of capsules metrics scraped:3988
Number of capsules metrics scraped:3988
Number of capsules metrics scraped:3988
Number of capsules metrics scraped:3988
Number of capsules metrics scraped:3988
Number of capsules metrics scraped:3988
Number of capsules metrics scraped:122
Number of capsules metrics scraped:122
Number of capsules metrics scraped:122
Number of capsules metrics scraped:122
Number of capsules metrics scraped:122
Number of capsules metrics scraped:3987
Number of capsules metrics scraped:3987
Number of capsules metrics scraped:3987
Number of capsules metrics scraped:3987
Number of capsules metrics scraped:3987
Number of capsules metrics scraped:3987
Number of capsules metrics scraped:3987
Number of capsules metrics scraped:123
Number of capsules metrics scraped:123
Number of capsules metrics scraped:123
Number of capsules metrics scraped:123
Number of capsules metrics scraped:123
Number of capsules metrics scraped:123
Number of capsules metrics scraped:123
Number of capsules metrics scraped:123
Number of capsules metrics scraped:123
Number of capsules metrics scraped:3988
Number of capsules metrics scraped:3988
Number of capsules metrics scraped:3988
Number of capsules metrics scraped:3988

The log output on the new host is just as before and shows no errors:

2016/11/01 17:50:25 Output [prometheus_client] buffer fullness: 122 / 10000 metrics. Total gathered metrics: 136124. Total dropped metrics: 0.
2016/11/01 17:50:25 Output [prometheus_client] wrote batch of 122 metrics in 5.367323ms
2016/11/01 17:50:32 Output [prometheus_client] wrote batch of 1000 metrics in 53.550333ms
2016/11/01 17:50:32 Output [prometheus_client] wrote batch of 1000 metrics in 37.459509ms
2016/11/01 17:50:32 Output [prometheus_client] wrote batch of 1000 metrics in 31.184277ms
2016/11/01 17:50:32 Output [prometheus_client] wrote batch of 1000 metrics in 47.910489ms

So we expect 4122 metrics per scrape in this case but never get it...

So it seems the fix was not the full story. I think this is the same issue as before not been fixed as we expected however it is possible it's the same symptom from a different cause.

@sparrc would you prefer I reopen this issue or create a new one with a link back?

@banks
Copy link
Author

banks commented Nov 1, 2016

FWIW my wild guess is that although the output is no longer clearing the map on each Write, it is clearing it at the end of each collection so there is a race where prometheus might (and often does) scrape before it's done with all batches so it can see the result f any number of Writes complete before it gets data... i.e. still not seeing everything collected since last scrape.

This is almost as bad as the original issue here if I'm correct, but less simple to fix.

@sparrc
Copy link
Contributor

sparrc commented Nov 2, 2016

doesn't this add up to 4122 metrics, which is what you were expecting?:

2016/11/01 17:50:25 Output [prometheus_client] wrote batch of 122 metrics in 5.367323ms
2016/11/01 17:50:32 Output [prometheus_client] wrote batch of 1000 metrics in 53.550333ms
2016/11/01 17:50:32 Output [prometheus_client] wrote batch of 1000 metrics in 37.459509ms
2016/11/01 17:50:32 Output [prometheus_client] wrote batch of 1000 metrics in 31.184277ms
2016/11/01 17:50:32 Output [prometheus_client] wrote batch of 1000 metrics in 47.910489ms

I think that the consistency can be fixed by making your output batch size larger than the number of metrics you expect to receive, ie, set metric_batch_size = 5000. If you do this and run your script, I think you will see consistent results every time.

The problem is that Telegraf can only "Lock" the output on each write call, which gets called once for every batch. So if your batch size is smaller than the number of metrics then the metrics will be split across multiple writes (and multiple locks). If you then make an HTTP request in-between multiple writes you will get split data. I'm not sure there's any other way to get around this other than having some very special handling of the prometheus output plugin coded into the core of telegraf.

@banks
Copy link
Author

banks commented Nov 2, 2016

doesn't this add up to 4122 metrics, which is what you were expecting?:

Yes sorry, that's confusion on my part. I expected 4122 because I added those up - my point is those log messages look correct and are the same for 1.0.1 as 0.13.2, despite the output as indicated in the for loop above being different.

I can't explain why we saw 3988 in the output though - that was consistently less than 100% but its also not a simple combination of the batches reported in the logs. No filtering is configured on the output, it might be possible that I have duplicate metric names that are being de-duped by the prometheus output but that doesn't seem to happen obviously with 0.13.2. I need to be more scientific with testing that though to draw any conclusions.

I think that the consistency can be fixed by making your output batch size larger than the number of metrics you expect to receive

I imagine that would work but it feels like a hack. My issues with it are:

  1. Presumably batching was done for a reason - if there is no tradeoff making batches arbitrarily large why implement the complexity of writing in batches in the first place?
  2. Number of metrics is not necessarily known up front - we may have systems that add new metrics at will by different development teams and operations would need to keep track of how many metrics are being recorded per host and constantly ensure the batch size is big enough. It's not even like the effects are obvious like a disk filling up when it happens so we can detect and react. In practice Prometheus may scrape each metric often enough that it's hard to even tell there is missing data, we just lost resolution for a random sample of metrics for most of the time...
  3. We have some hosts like this example with ~4.5k metrics now but we may well end up with one or two orders of magnitude more and are hoping telegraf will scale to that. Having to make batch size be 50x or 100x the default seems like we are no longer using the system as it was intended and are likely to run into all sorts of issues like locks being held too long causing contention etc.
  4. For Telegraf + Prometheus in general it seems to be a broken state that users have to figure out they need to increase batch size to whatever number is right for them to avoid losing data silently. It's especially bad because it's very likely they won't notice anything is wrong during testing and even testing at scale as the gaps are random and often smoothed over by Prometheus..

If my understanding of the issue is correct, then the clean solution would be closer to the old prometheus output where metrics are registered and the current value (last one written by any batch) reported on every scrape. Then to fix the issue of old metrics not being unregistered, you'd need something more elaborate that kept a map of last write time for each metric and sweep through and unregistered ones that were not written for X collection cycles or seconds.

I may well be able to contribute this fix or an alternative if you have a preference.

@sparrc
Copy link
Contributor

sparrc commented Nov 2, 2016

  1. Presumably batching was done for a reason - if there is no tradeoff making batches arbitrarily large why implement the complexity of writing in batches in the first place?

You need to understand that the prometheus output is a bit different than any other telegraf output. "Writing" a batch to the prometheus output is more like just registering metrics on the http endpoint. Batching is really intended more for the sake of the more traditional write endpoints (InfluxDB, graphite, openTSDB, MQTT, AMQP, Kafka, etc).

  1. Number of metrics is not necessarily known up front - we may have systems that add new metrics at will by different development teams and operations would need to keep track of how many metrics are being recorded per host and constantly ensure the batch size is big enough. It's not even like the effects are obvious like a disk filling up when it happens so we can detect and react. In practice Prometheus may scrape each metric often enough that it's hard to even tell there is missing data, we just lost resolution for a random sample of metrics for most of the time...

I think that you should be able to come up with a max batch size that you reasonably can expect to not exceed.

  1. We have some hosts like this example with ~4.5k metrics now but we may well end up with one or two orders of magnitude more and are hoping telegraf will scale to that. Having to make batch size be 50x or 100x the default seems like we are no longer using the system as it was intended and are likely to run into all sorts of issues like locks being held too long causing contention etc.

I'm afraid I can't really speak to that, but as I said before if you're only using the prometheus output then I wouldn't worry too much about the batch size.

That being said, I would caution against your planned architecture....exposing 450,000 metrics over an http interface, then scraping and parsing that endpoint every 10s would have an enormous overhead and I'm not entirely sure how well the prometheus client library performs at that scale.

  1. For Telegraf + Prometheus in general it seems to be a broken state that users have to figure out they need to increase batch size to whatever number is right for them to avoid losing data silently. It's especially bad because it's very likely they won't notice anything is wrong during testing and even testing at scale as the gaps are random and often smoothed over by Prometheus..

It might be a documentation issue, I wouldn't be opposed if you submitted a PR explaining that users of the prometheus output may want to increase their batch size to an arbitrarily large number, if that is what works best.

If my understanding of the issue is correct, then the clean solution would be closer to the old prometheus output where metrics are registered and the current value (last one written by any batch) reported on every scrape. Then to fix the issue of old metrics not being unregistered, you'd need something more elaborate that kept a map of last write time for each metric and sweep through and unregistered ones that were not written for X collection cycles or seconds.

I don't think that keeping a map of all values is a good solution, especially not a good large-scale solution (that map will be in-memory and can grow quite large).

I haven't looked into it too much but I think my preference would be to have a special RunningOutput just for the prometheus output, which can be locked while it is writing a "set" of batches.

@banks
Copy link
Author

banks commented Nov 2, 2016

Thanks for that @sparrc

Batching not being relevant for prometheus is useful to know.

It might be a documentation issue, I wouldn't be opposed if you submitted a PR explaining that users of the prometheus output may want to increase their batch size to an arbitrarily large number, if that is what works best.

Yeah if that is a viable solution then this would be necessary. I'll see what happens in our tests.

That being said, I would caution against your planned architecture....exposing 450,000 metrics over an http interface, then scraping and parsing that endpoint every 10s would have an enormous overhead and I'm not entirely sure how well the prometheus client library performs at that scale.

Interesting. The way I see it is if we do end up with ~100k metrics per host which is not impossible, then exposing them all through a single scrape in Telegraf is less overhead than running hundreds of prometheus clients per host and having to scrape same amount of data overall using hundreds of separate http requests... Obviously we have more research to do to verify how well the components scale and at what cost.

I don't think that keeping a map of all values is a good solution

That was a sketch suggestion for a "correct" (matching Prometheus' expectations) option, in practice with a bit of design, I think it could be done with very little overhead compared to the maps the output already keeps in memory.

Prometheus' expectations are that hitting the endpoint should just return the most recent value of all metrics we care about. It used to do this. In trying to work around the never unregistering metrics issue (which is real and bites us too) we've kinda broken that basic assumption: now hitting the collect endpoint actually causes state change in the metrics/lastMetrics maps - potentially wiping older metrics that have not yet had a chance to be written for the most recent collection period.

(Effectively) Infinite batch size masks that problem only because it ensures p.metrics is never a partial result, but if we require large-enough batches for correctness anyway, then the current dance with p.lastMetrics and p.metrics is unnecessary and we should really go back to just having one map that gets wiped and rebuilt each Write().

To put it another way:

  • either the previous fix for this issue is wrong and should be reverted in conjunction with docs update/RunningOutput suggestions,
  • or it's valid to want to keep batching and we should augment the current solution to actually have correct behaviour - i.e. to only "drop" metrics when they've not been collected for a full cycle (or more) rather than having prometheus endpoint hits potentially modifying state*.

Do you agree with that analysis or have I missed something again?

I haven't looked into it too much but I think my preference would be to have a special RunningOutput just for the prometheus output

Interesting. That's certainly cleaner than relying on Prometheus users noticing the warning about batch size in the docs and never combining with another output...


  • There is an additional downside to the current implementation: if there are multiple prometheus servers scraping a single endpoint (which is common given that this is the recommended way to get HA metrics storage) one scrape might see one partial result and the other server will see another partial result. Now we have two prometheus servers that are both missing metrics but not even missing the same metrics.

@sparrc sparrc changed the title 1.0.0 upgrade causes metrics to be dropped in prometheus output, > 50% loss of data prometheus output can return partial results if a collect request is made in-between writing multiple batches Nov 2, 2016
@sparrc sparrc reopened this Nov 2, 2016
@sparrc
Copy link
Contributor

sparrc commented Nov 2, 2016

you might be right that collecting the prometheus metrics in one place is better, especially if the machine exposing the http endpoint with that many metrics isn't expected to perform other heavy duties.

I will reopen the issue because it's probably worth revisiting how we handle the prometheus output. Basically the problem is that prometheus expects that the metrics get collected at the time that the http request is made. Telegraf obviously has a different model where it collects metrics on a set interval and then just gives prometheus it's most recent metrics when asked for.

Another solution that I think might be a lot more simple: every time a "Write" request comes in we increment a counter, and decrement it when the request finishes. The "HTTP" requests are then only allowed to grab the lock when the Write counter == 0....Go has a nice sync.WaitGroup object designed for this.

ie, every "Write" goes straight to the front of the line, and "HTTP" requests need to wait for all "Write" requests to finish before they are processed.

@sparrc sparrc modified the milestones: 1.2.0, 1.0.1 Nov 2, 2016
@banks
Copy link
Author

banks commented Nov 2, 2016

Cool thanks.

I didn't quite follow your suggestion. It sounds like something I was considering but I think would need a change to how Telegraf calls Write to indicate whether or not there are more metrics to be collected in the current collection cycle.

How do you propose informing the output that there are "outstanding "Write" requests on the lock"?

Given a mechanism like I described (and like I think you are implying would be needed), the solution is pretty easy.

@sparrc
Copy link
Contributor

sparrc commented Nov 2, 2016

My hope was that the prometheus plugin could service this lock internally, delaying HTTP requests until all metrics are ready.

@sparrc
Copy link
Contributor

sparrc commented Nov 2, 2016

giving this a second thought, I'm actually not sure there is anything we can do to guarantee that this won't happen. Telegraf deals with metrics individually as they stream to the output plugins, so there isn't really any concept of "groups" of metrics that should or shouldn't get exposed together (in the prometheus sense).

as to the behavior in 0.13.2, I think that was even worse in some ways. In that case, you were missing the same data from the current interval, but the plugin was just mixing data from the current & previous intervals as if they were collected at the same time.

One thing we could do is set some sort of arbitrary waiting period to service the HTTP requests. Basically we would set a certain waiting period, and if no new metrics arrive within that period then we can respond to the HTTP request, otherwise we would wait for any incoming metrics to finish arriving.

@banks
Copy link
Author

banks commented Nov 2, 2016

@sparrc thanks for the thought. It was my assumption based on logs and config etc, that Telegraf has distinct collection "passes" - i.e. it does have a concept of a group of metrics collected together that is just hidden form the outputs currently.

If that is really not the case then You're right, it's pretty hard to reconcile the model with Prometheus's assumptions.

I'm going to think some more about this and dig into the code. If I come up with any better ideas I'll report back. The timing option might be the best we can do but it feels like a hack to me. Almost like we'd be better of just saying outright in docs that the prometheus output doesn't really work very well and explain the issue rather than fudge it to make it sorta work mostly kinda.

In my mind the old behaviour of reporting the last known value of a metric even if it was collected in a previous pass is "correct" at least in the sense of how prometheus expects things to work. You have to accept that counter values might end up incrementing effectively one sample later than otherwise etc but if you use Telegraf or any other agent then there is some delay/time smearing going on anyway in general and Prometheus' case there isn't really a better option.

Anyway will return if I have further insight to share.

@trastle
Copy link
Contributor

trastle commented Nov 3, 2016

I have encountered this same issue.

Reading through your notes the behaviour in V0.13.2 would be much more useful in my use than the V1.0.0 behaviour. In my use old data (V0.13.2) is less harmful than lost data (V1.0.0). However it does cause a problem where the lack of new data from a host is masked indefinitely.

Reading through the suggestion from @sparrc of delaying http responses from the prometheus client until no new metrics have arrived for a configurable amount of time would work well with my use. As I see many metrics coming in batches from the poll and then a gap until the next interval.

@banks what was your work around with V1.0.1?

@banks
Copy link
Author

banks commented Nov 4, 2016

@trastle increasing batchsize is a theoretical workaround with a few unpleasant caveats as noted. Honestly I've not even tried it yet although it should work.

The more I look at this, the more I see fundamental mismatches between what telegraf does and what Prometheus expects.

I've not finished looking yet, but I'm sadly leaning towards replacing telegraf in our setup with a combination of Prometheus' node_exporter and pushgateway and/or a custom extremely simple collector daemon that runs our custom collection scripts synchronously on scrapes. One of the reasons I'm not fully decided is that the other solutions all require subtle design decisions to avoid all the same issues telegraf has/had (old metrics not being dropped implicitly etc) and will end up being very similar to telegraf + batchsize workaround in practice. I still feel like that has too many gotchas to be a long term solution though.

I'll update again when we've made a choice here.

@sparrc
Copy link
Contributor

sparrc commented Nov 4, 2016

@trastle increase metric_batch_size in the [agent] section of your config to a number larger than you'd expect a single "collection" to take.

it's true that using telegraf for prometheus output is somewhat of a shoehorn. The typical workflow of a prometheus exporter is to collect metrics on an http request, which means that there is always a "batch" of metrics that are collected & exposed together.

we could presumably modify the core of telegraf to have a "prometheus mode" where it collected metrics on request, rather than on an interval. There actually was a PR to do that in the early days of telegraf that we were resistant to, but it could possibly be revisited especially given the popularity of prometheus.

@sparrc
Copy link
Contributor

sparrc commented Nov 4, 2016

@banks I'd appreciate if you could test with a very large batch size setting no matter which option you decide on. I'm thinking that the simplest solution to this would actually be to just set the batch size for the prometheus output to a very high number (10,000,000?) when the output is created (ignoring the metric_batch_size setting).

From what I can tell, this can be set to an almost arbitrarily high number because the memory is not allocated until it is actually filled with a metric (though it may allocate some "virtual" memory before that point)

@banks
Copy link
Author

banks commented Nov 4, 2016

I've had a few other duties to attend to this week which have limited my time to continuing figuring out the best option here beyond discovering the issue in the first place and out-of-hours reading up on things.

It's on my TODO list though to test with the hack fix and to review telegraf code base more thoroughly to see if there is a cleaner solution.

To be honest though your suggestion there if hard coding it for the Prometheus output alone might not be a bad idea if possible without affecting other outputs. I'll look further.

My mind flip flops - it turns out the official Prometheus node_exporter (or some inputs for it) essentially works the same way as telegraf anyway - gathering metrics on an interval/caching and the reporting them on next scrape. This is also more-or-less the behaviour you'd get from running push gateway and PUTing the results of a custom script to it on an interval.

If telegraf can behave that way without hacks/caveats and out of the box then it's actually fine but with much more flexibility over inputs and pruning metrics you don't care about on the host rather than centrally on the server (which matters if you care about bandwidth as well as just Prometheus storage etc).

Thanks for your patience, I'll try to produce something constructive for Telegraf here either way.

On 4 Nov 2016, at 08:55, Cameron Sparr notifications@github.com wrote:

@banks I'd appreciate if you could test with a very large batch size setting no matter which option you decide on. I'm thinking that the simplest solution to this would actually be to just set the batch size for the prometheus output to a very high number (10,000,000?) when the output is created (ignoring the metric_batch_size setting).

From what I can tell, this can be set to an almost arbitrarily high number because the memory is not allocated until it is actually filled with a metric (though it may allocate some "virtual" memory before that point)


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

@sparrc
Copy link
Contributor

sparrc commented Nov 4, 2016

FWIW, in my mind I wouldn't call increasing the batch size a "hack", I think it's simply the correct behavior for a prometheus exporter. The fact that it needs to be increased manually at the moment is inconvenient, but it's because the setting makes more sense from a more standard "push" output perspective.

@banks
Copy link
Author

banks commented Nov 4, 2016

@sparrc I agree: the "hack" is doing it manually in a global config. This is a hack because:

  • even if documented, it's surprising behaviour that will be overlooked and catch people out
  • the failure mode is more or less silent but fundamentally compromises accuracy/availability of the data
  • it limits Prometheus users to never being able to use any other output in conjunction with Prometheus without possibly compromising performance or causing issues with the other outputs.

If we can find a clean way to bump that batch size for Prometheus output without manual config, and without affecting other outputs then I agree it's (theoretically) a viable solution and not a hack.

I'll look into this.

From a first glance it looks like we'd need to change the NewRunningOutput to have a case that overrode batchsize cleanly for Prometheus.

Off the top of my head, we could add a new interface like:

type OutputWithBatchingOpinions interface {
    MinBatchSize() int
    MaxBatchSize() int
}

Those could return 0 (or a negative) for no preference or a positive int if they care, and then NewRunningOutput could do it'd current logic, then type assert OutputWithBatchingOpinions on the output and if it is one, apply min/max clamping to whatever config value has been provided (potentially with log info output so people can see why their config was ignored?).

That would seem like a clean, general purpose solution that would allow Prometheus to clamp min to a very high number and so work in all cases out of the box without affecting other outputs. Added benefit, no other Output code nor core code would need to change to accommodate.

Any thoughts on that feel free to share - was based on a few mins of looking at the code on GitHub so may not be ideal!

If we did this, we could effectively revert the last PR on this issue and go back to assuming each Write totally replaces the metrics to report.

I also need to verify that bumping batch size is correct/works before I get too ahead of myself with code...

@sparrc
Copy link
Contributor

sparrc commented Nov 4, 2016

TBH I was thinking something quite a bit simpler, after calling NewRunningOutput we can just check if we are creating a prometheus output:

if strings.HasPrefix(name, "prometheus") {
    if batchSize < 1000000 {
        // log that we are auto-increasing the prometheus batch size
        log.Println("I! blahblah")
        batchSize = 1000000
    }
}

this would just need to get inserted at https://github.com/influxdata/telegraf/blob/master/internal/models/running_output.go#L44

@banks
Copy link
Author

banks commented Nov 4, 2016

Hmm thinking more: is this even a correct solution?

Given that flushing right now is not in strict collection passes, even if we make batch size arbitrarily large, it will only work as expected if all inputs are on same interval and collect at exactly the same time so all the metrics arrive in one batch right? Nothing guarantees this even if they are all same interval now as far as I can see... Especially if you have exec inputs that take a non-trivial amount of time (we do).

If you were to have 2 different inputs with different intervals, you'd never have all the metrics in one batch regardless of batchsize.

I think that is the fundamental issue rather than the batching. Am I wrong about that?

This also makes me nervous about testing big batches out at all because it may seem to work for us now, but may not be a good solution or have harder to spot but still resent cases where it drops stuff...

@sparrc
Copy link
Contributor

sparrc commented Nov 4, 2016

what this does guarantee is that when the metrics are getting written (on flush_interval), that whatever metrics have been reported at that time will get written together (and thus can never be split across multiple prometheus requests)

any metrics that arrive later would get added to the next flush, which to me seems like the logical thing to happen.

If you were to have 2 different inputs with different intervals, you'd never have all the metrics in one batch regardless of batchsize.

You would have the metrics together whenever their metrics arrive within the same window. Let's say pluginA reports every 10s and pluginB every 13s. Then let's say that you make an HTTP request every 10s. What'd you'd get is this:

request 00:10: pluginA (collected at 00:10)
request 00:20: pluginA (collected at 00:20) + pluginB (collected at 00:13)
request 00:30: pluginA (collected at 00:30) + pluginB (collected at 00:26)
request 00:40: pluginA (collected at 00:40) + pluginB (collected at 00:39)
request 00:50: pluginA (collected at 00:50)
request 00:60: pluginA (collected at 00:60) + pluginB (collected at 00:52)

(NOTE that in this example I am also assuming that flush_interval = "10s")

so, from my perspective this is correct, because the only alternative would be to potentially keep metrics which no longer exist. (ie, prometheus metrics are only "registered" when they've actually been collected in the previous interval).

@ragalie
Copy link
Contributor

ragalie commented Nov 4, 2016

I think there are three intervals at play: input plugin collection interval, telegraf flush/write interval and prometheus scrape interval.

To me the core difficulty is in the interplay between telegraf flush and prometheus scrape. If telegraf flushes too quickly, then it's possible for prometheus to miss metrics. For example:

collect 00:00: plugin B
flush 00:05
flush 00:10
collect 00:13: plugin B
scrape 00:14
flush 00:15
flush 00:20
flush 00:25
collect 00:26: plugin B
scrape 00:29
flush 00:30

I don't think we ever actually get a plugin B reading in that case in prometheus, unless I'm missing something.

Now it's possible with some good documentation this could be ameliorated, but it's a lot of moving pieces to keep track of at once. I think Paul's suggestion earlier in the thread of keeping the most recent observation of each metric in the map for a longer period of time is a more user-friendly solution to the issue personally.

@banks
Copy link
Author

banks commented Nov 4, 2016

Heh I was just writing up an elaborate example that said the same as Mike. Even if you consider that flush interval is set to min of any of the input intervals, the collections and flushes take time too and there are clearly races that can happen that would mean some plugins output might be missed entirely on some scrapes due to being overwritten by another flush before the scrape can happen.

As a minimal example, assume the input timings in your example @sparrc, but have Prometheus only scrape every 30 seconds, if the scrapes occurred at 00:20 and 00:50, the second scrape would have missed any of the outputs from B since the previous scrape even though there were some...

I think going back to keeping all the metrics in memory in the output and keeping enough additional data to be able to remove things that have not reported for X seconds is the only option that really works. There is overhead for sure but I think it could be minimised to acceptable levels with careful design (especially since you could just clean up while iterating the metrics on a scrape where we already visit every one anyway so no additional work needed to iterate them all to implement the timeouts..).

Any other correct solution I can think of would change too much about Telegraf to be worth considering IMO - even if it was optional it would probably remove a lot of the benefits Telegraf has by making features like separate intervals per input unusble etc.

@sparrc
Copy link
Contributor

sparrc commented Nov 4, 2016

@ragalie a flush with no metrics is a no-op, and so at scrape 00:14 you would actually get the collect 00:00: pluginB. But that being said I'm sure we can construct another example where things get dropped, and I agree that the interfacing between these three intervals is complicated & opaque.

Alright! I think the solution will be to have an "expiration" setting for the prometheus output. Every metric that gets written will set it's expiration time to now() + expiration. When the metrics are being reported to the prometheus endpoint (on an http request) we will check if the metric has expired. If expired then we drop the metric until it gets written again.

Would this be an OK solution for everyone?

@banks
Copy link
Author

banks commented Nov 4, 2016

Sounds good to me!

Thanks for your effort on this @sparrc we really appreciate the time in this conversation and putting up with a lot of verbosity and mind-changing from me :).

Either Mike or I would probably be able to contribute a PR for that proposal if that would help.

@trastle
Copy link
Contributor

trastle commented Nov 4, 2016

I have found the same thing as @ragalie the flush_interval is key. I need to make sure metrics collection is completed before flushing to the prometheus output or metrics are lost.

In my environment I collecting some 50k metrics per interval. This takes about 350 seconds. I am allowing 600 seconds so have plenty of head room.

I was testing this today using V1.0.0 with the following config:

[agent]
  interval = "600s"
  flush_interval = "600s"
  round_interval = false
  metric_batch_size = 200000
  metric_buffer_limit = 2000000

[[outputs.prometheus_client]]
  listen = ":9126"

[[inputs.exec]]
  timeout = "600s"
  commands = [   ...   ]

This is working quite well for me.
My log looks as follows:

2016/11/04 13:41:36 Output [prometheus_client] wrote batch of 53697 metrics in 50.817863898s
2016/11/04 13:50:44 Output [prometheus_client] buffer fullness: 53617 / 2000000 metrics. Total gathered metrics: 4591205. Total dropped metrics: 0.
2016/11/04 13:51:38 Output [prometheus_client] wrote batch of 53617 metrics in 53.024765754s
2016/11/04 14:00:44 Output [prometheus_client] buffer fullness: 53682 / 2000000 metrics. Total gathered metrics: 4644887. Total dropped metrics: 0.
2016/11/04 14:01:38 Output [prometheus_client] wrote batch of 53682 metrics in 52.685866545s
2016/11/04 14:10:44 Output [prometheus_client] buffer fullness: 53657 / 2000000 metrics. Total gathered metrics: 4698544. Total dropped metrics: 0.
2016/11/04 14:11:34 Output [prometheus_client] wrote batch of 53657 metrics in 48.475608566s
2016/11/04 14:20:44 Output [prometheus_client] buffer fullness: 53657 / 2000000 metrics. Total gathered metrics: 4752201. Total dropped metrics: 0.
2016/11/04 14:21:27 Output [prometheus_client] wrote batch of 53657 metrics in 42.113429451s
2016/11/04 14:30:44 Output [prometheus_client] buffer fullness: 53682 / 2000000 metrics. Total gathered metrics: 4805883. Total dropped metrics: 0.
2016/11/04 14:31:32 Output [prometheus_client] wrote batch of 53682 metrics in 46.925494062s

I am running 3 Telegraf servers each publish roughly the same number of metrics. Each interval. Prometheus is scraping each Telegraf server every 60s.

As best I can tell this has resolved the loss I was seeing with a smaller flush_interval.

An expiration for each metric would be much cleaner.

@ragalie
Copy link
Contributor

ragalie commented Nov 4, 2016

@sparrc I have a PR on this, just need to get it cleared and sign the CLA and I'll open it up. Thanks again!

@sparrc
Copy link
Contributor

sparrc commented Nov 15, 2016

closed by #2016

@sparrc sparrc closed this as completed Nov 15, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug unexpected problem or unintended behavior
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants