Duplicate metric events in Telemetry endpoint #536

eliasbrange · 2017-12-04T13:30:18Z

We use containerpilot -putmetric to gather certain fluctuating metrics such as memory usage and free disk space. We noticed that after running a container in production for some time the /metrics endpoint generates over 100k lines with lines on the form

containerpilot_events{code="Metric",source="metric|value"} 1

Since the metrics are fluctuating this means that we get a few new events every 15 seconds or so. My guess is that this will continue indefinitely until the container is restarted. For counters that can only ever go up this also means a new event line every time you do putmetric.

My question is, is this expected behaviour? Filtering in e.g. InfluxDB on source is now impossible, and the endpoint is getting slower and slower due to the massive amount of events that are returned.

Edit: I have experimented with version 3.4.2, 3.5.0 and 3.6.0 and they seem to behave the same.

The text was updated successfully, but these errors were encountered:

jwreagor · 2017-12-05T17:26:22Z

Thanks for reporting this @eliasbrange.

Initially the behavior you're describing doesn't sound like what I'd expect on a Prom formatted endpoint. I would expect a single metric counter to increase monotonically and not output multiple lines or counters with a single event value.

Could you paste a snippet of everything else being logged around when this starts (I don't need all the lines, obviously)?

Is it just a single metric named metric|value? Are any other custom metrics performing this way, possibly named in another manner?

A test configuration/setup is always helpful as well and helps me debug quicker but I'm quite sure I can reproduce this. Sounds like I only need to push metrics into ContainerPilot.

eliasbrange · 2017-12-06T08:03:25Z

I guess the config doesn't actually matter. And defining a metric in the config does not matter either I think. Anyways, here comes a little config.

{
  consul: "localhost:8500",
  logging: {
    level: "DEBUG",
    format: "default",
    output: "stdout"
  },
  telemetry: {
    port: 9090,
    interfaces: ["static:127.0.0.1"],
    metrics: [
      {
        namespace: "telemetry",
        subsystem: "test",
        name: "counter",
        help: "test",
        type: "counter"
      }
    ]
  }
}

Then I run the following commands with the following output:

$ curl --silent localhost:9090/metrics | grep telemetry_test_counter
> # HELP telemetry_test_counter test                                          
> # TYPE telemetry_test_counter counter                                       
> telemetry_test_counter 0                                                    

$ for i in {1..10}; do containerpilot -putmetric telemetry_test_counter=$i; done

$ bash-4.3# curl --silent localhost:9090/metrics | grep telemetry_test_counter
> containerpilot_events{code="Metric",source="telemetry_test_counter|1"} 1    
> containerpilot_events{code="Metric",source="telemetry_test_counter|10"} 1   
> containerpilot_events{code="Metric",source="telemetry_test_counter|2"} 1    
> containerpilot_events{code="Metric",source="telemetry_test_counter|3"} 1    
> containerpilot_events{code="Metric",source="telemetry_test_counter|4"} 1    
> containerpilot_events{code="Metric",source="telemetry_test_counter|5"} 1    
> containerpilot_events{code="Metric",source="telemetry_test_counter|6"} 1    
> containerpilot_events{code="Metric",source="telemetry_test_counter|7"} 1    
> containerpilot_events{code="Metric",source="telemetry_test_counter|8"} 1    
> containerpilot_events{code="Metric",source="telemetry_test_counter|9"} 1    
> # HELP telemetry_test_counter test                                          
> # TYPE telemetry_test_counter counter                                       
> telemetry_test_counter 55

As you can see, each call to -putmetric with a unique value results in a new row on the /metrics endpoint.

jwreagor · 2017-12-06T16:09:39Z

Thank you, your configs definitely helped because I wasn't seeing the issue immediately until I tried your setup. So yeah, you've found a bug and possibly a race condition because hammering the endpoint using a for loop causes multiple events to be created while sending them slower increments an existing event.

Notice below that telemetry_test_counter|1 equals 2. I wouldn't have noticed that without the for loop!

$ curl --silent localhost:9090/metrics | grep telemetry_test_counter
containerpilot_events{code="Metric",source="telemetry_test_counter|1"} 2
containerpilot_events{code="Metric",source="telemetry_test_counter|10"} 1
containerpilot_events{code="Metric",source="telemetry_test_counter|2"} 1
containerpilot_events{code="Metric",source="telemetry_test_counter|3"} 1
containerpilot_events{code="Metric",source="telemetry_test_counter|4"} 1
containerpilot_events{code="Metric",source="telemetry_test_counter|5"} 1
containerpilot_events{code="Metric",source="telemetry_test_counter|6"} 1
containerpilot_events{code="Metric",source="telemetry_test_counter|7"} 1
containerpilot_events{code="Metric",source="telemetry_test_counter|8"} 1
containerpilot_events{code="Metric",source="telemetry_test_counter|9"} 1
# HELP telemetry_test_counter test
# TYPE telemetry_test_counter counter
telemetry_test_counter 56

jwreagor · 2017-12-06T19:08:47Z

Looks like everything is working fine except logging. We're logging each unique counter event by value. This is especially prevalent when counting in varying increments which I didn't notice before.

$ for i in {1..10}; do containerpilot -config ./containerpilot.json5 -putmetric "telemetry_test_counter=1"; done
$ curl -s 172.16.20.7:4141/metrics | grep something_test
# HELP telemetry_test_counter total number of somethings
# TYPE telemetry_test_counter counter
telemetry_test_counter 10
containerpilot_events{code="Metric",source="telemetry_test_counter|1"} 10

jwreagor · 2017-12-06T19:56:11Z

We're reporting events through our event bus Prometheus collector. It looks like I can remove the value portion of the event name when collecting Metric events but I'm not sure of the repercussions of that. I believe that's only used for reporting so that's one option.

Another option would be to stop recording Metric events all together since end users use their Prometheus counter values themselves. Why do we need duplicate information through the endpoint? Do we really need to know about each Metric event if the counter or gauge is what we're ultimately deriving value out of?

Any thoughts? Going to think about this.

jwreagor · 2017-12-06T20:13:58Z

After adding a single if statement around Metric collection, I'm feeling good about option 2 and removing Metric event collection all together. In hindsight if feels a bit overzealous. Also, I don't believe we had many consumers of the telemetry endpoint in that regard so it's probably safe to remove those events.

eliasbrange · 2017-12-07T07:52:48Z

Thanks for the quick iterations on this one @cheapRoc. Much appreciated.

jwreagor · 2017-12-07T18:05:51Z

Just released in version 3.6.1, thanks again for the report!

jwreagor added the bug label Dec 6, 2017

jwreagor changed the title ~~Metric events~~ Hammering telemetry endpoint duplicates counter Metric events Dec 6, 2017

jwreagor added enhancement and removed bug labels Dec 6, 2017

jwreagor changed the title ~~Hammering telemetry endpoint duplicates counter Metric events~~ Duplicate metric events in Telemetry endpoint Dec 6, 2017

jwreagor mentioned this issue Dec 6, 2017

Prevent overzealous collection of Metric events through Prometheus #538

Merged

jwreagor added the open PR label Dec 6, 2017

jwreagor added awaiting release released and removed open PR awaiting release labels Dec 7, 2017

jwreagor closed this as completed Dec 7, 2017

dfredell mentioned this issue Aug 7, 2018

Prometheus 2.1 abnormally shutdown with SIGBUS prometheus/prometheus#3781

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Duplicate metric events in Telemetry endpoint #536

Duplicate metric events in Telemetry endpoint #536

eliasbrange commented Dec 4, 2017 •

edited

Loading

jwreagor commented Dec 5, 2017

eliasbrange commented Dec 6, 2017

jwreagor commented Dec 6, 2017

jwreagor commented Dec 6, 2017

jwreagor commented Dec 6, 2017 •

edited

Loading

jwreagor commented Dec 6, 2017 •

edited

Loading

eliasbrange commented Dec 7, 2017

jwreagor commented Dec 7, 2017

Duplicate metric events in Telemetry endpoint #536

Duplicate metric events in Telemetry endpoint #536

Comments

eliasbrange commented Dec 4, 2017 • edited Loading

jwreagor commented Dec 5, 2017

eliasbrange commented Dec 6, 2017

jwreagor commented Dec 6, 2017

jwreagor commented Dec 6, 2017

jwreagor commented Dec 6, 2017 • edited Loading

jwreagor commented Dec 6, 2017 • edited Loading

eliasbrange commented Dec 7, 2017

jwreagor commented Dec 7, 2017

eliasbrange commented Dec 4, 2017 •

edited

Loading

jwreagor commented Dec 6, 2017 •

edited

Loading

jwreagor commented Dec 6, 2017 •

edited

Loading