consul.raft.replication.heartbeat metrics have many suffixes #4450

xuejipeng · 2018-07-26T05:30:56Z

my consul version is 1.2.1，here is my configuration
{
"datacenter": "dc1",
"data_dir": "/apps/consul_1.2.1/data",
"log_level": "DEBUG",
"node_name": "ast0",
"server": true,
"ui": true,
"bootstrap_expect": 1,
"bind_addr": "10.0.5.169",
"client_addr": "0.0.0.0",
"retry_join": ["10.0.5.160","10.0.5.94"],
"retry_interval": "3s",
"enable_debug": true,
"rejoin_after_leave": true,
"enable_syslog": false,
"telemetry": {
"prometheus_retention_time": "24h",
"disable_hostname": true
}
}

when i get the metrics consul.raft.replication.heartbeat it like this consul_raft_replication_appendEntries_rpc_450cf1d9_62de_d0be_905d_b4b42de3f8b8

What are these suffixes，if i get rid of it? or if i can tag some label for this metrics

banks · 2018-07-26T09:52:11Z

Hey that metric comes from: https://github.com/hashicorp/raft/blob/a3fb4581fb07b16ecf1c3361580d4bdb17de9d98/replication.go#L535-L539

In our raft library. The suffix is the UUID of the raft server so you should see a metric for each server. If you already split by hostname Prometheus (normal) then you should see that each host has a single UUID suffix. (For clarity: I see "disable_hostname": true in the posted config but that is normally used because collecting agents like Prometheus already add a host label based on the target it scraped so I assume the OP does already have metrics labeled that way.)

As an immediate solution, I'm 80% sure Prometheus rewriting rules are expressive enough to remove it or turn the last part into a label today with no changes.

In general I agree this would be better as a label - the reason it's not is that our raft library pre-dates go-metrics label features since the original sinks like statsd didn't have label support.

Seems reasonable to open an Issue on raft to switch to using labels but I'm not sure how soon that will happen!

xuejipeng · 2018-07-27T07:00:35Z

@banks Sorry, i find this suffix metrics only on the consul leader, i didn't find Prometheus how to rewrite a rule，I just found the record rules.，i would appreciate it if you could teach me some best practices.

These suffixes will change after the consul restarts, and if I want to use Grafana , then consul reboot, it will not work.

banks · 2018-07-27T10:57:11Z

These suffixes will change after the consul restarts i find this suffix metrics only on the consul leader

So it would seem that only the leader of raft logs this since it is the only one initiating the appendEntries RPC. The reason you only saw one set was that only one leader existed during the interval you looked at. When you restart the leader, the cluster elects a new leader which has a different UUID. So it _is_ legitimately a different metric - if you were also splitting metrics by `host` label you'd see them as two separate time series in prometheus regardless of if this suffix is present or not.

didn't find Prometheus how to rewrite a rule，I just found the record

rules. <https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/>，i would appreciate it if you could teach me some best practices. I was referring to relabel_config: (sorry misremembered their specific name) https://prometheus.io/docs/prometheus/latest/configuration/configuration/#%3Crelabel_config%3E I think something like (untested): metric_relabel_configs: - source_labels: ["__name__"] regex: "consul_raft_replication_appendEntries_rpc(.*)" target_label: "__name__" replacement: "consul_raft_replication_appendEntries_rpc" Would remove the suffix, if you want to keep it as another label then I think another rule before that one that extracts only that portion and replaces it into `target_label` = "raft_id" would work too.

…

On Fri, Jul 27, 2018 at 8:00 AM xuejipeng ***@***.***> wrote: @banks <https://github.com/banks> Sorry, i find this suffix metrics only on the consul leader, i didn't find Prometheus how to rewrite a rule，I just found the record rules. <https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/>，i would appreciate it if you could teach me some best practices. These suffixes will change after the consul restarts, and if I want to use Grafana , then consul reboot, it will not work. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4450 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAHYUwjnL9lsdk3pFbuQJg6nliwsEDfrks5uKrqXgaJpZM4VhM4g> .

xuejipeng · 2018-07-28T09:25:57Z

@banks Thank you so much，I temporarily solved the problem of Grafana data collection ，here's my configuration.

metric_relabel_configs:
  - source_labels: [__name__]
    regex: '(consul_raft_replication_appendEntries_rpc)_((\w){36})((_sum)|(_count))?'
    target_label: raft_id
    replacement: '${2}'
  - source_labels: [__name__]
    regex: '(consul_raft_replication_appendEntries_rpc)_((\w){36})((_sum)|(_count))?'
    target_label: __name__
    replacement: '${1}${4}'

i have to use the regex (consul_raft_replication_appendEntries_rpc)_((\w){36})((_sum)|(_count))? Because some metrics's name like this:

consul_raft_replication_appendEntries_rpc_450cf1d9_62de_d0be_905d_b4b42de3f8b8

consul_raft_replication_appendEntries_rpc_450cf1d9_62de_d0be_905d_b4b42de3f8b8_count

consul_raft_replication_appendEntries_rpc_450cf1d9_62de_d0be_905d_b4b42de3f8b8_sum

i don't find label host,so i have to split it by raft_id, but it doesn't seem to be easy to identify.

Can i get the hostname of the appropriate server or get the raft_id and Server correspondence relationship ?

banks · 2018-07-30T13:04:35Z

There is no host label because you are explicitly disabling that in your Prometheus configuration block in the agent config. Typically that makes sense if you have Prometheus setup to scrape since prometheus adds it's own `instance` label which has the hostname/IP and port scraped which tends to be enough. So I'd recommend using that instance label in grafana a a more readable one for now. If you want to figure out the mapping of host to raft UUID you can see it in the output of `consul operator raft list-peers` (or the API endpoint that uses). But I don't recommend trying to hack that unless you have to since prometheus should have the instance label already.

…

On Sat, Jul 28, 2018 at 10:25 AM xuejipeng ***@***.***> wrote: @banks <https://github.com/banks> Thank you so much，I temporarily solved the problem of Grafana data collection ，here's my configuration. metric_relabel_configs: - source_labels: [__name__] regex: '(consul_raft_replication_appendEntries_rpc)_((\w){36})((_sum)|(_count))?' target_label: raft_id replacement: '${2}' - source_labels: [__name__] regex: '(consul_raft_replication_appendEntries_rpc)_((\w){36})((_sum)|(_count))?' target_label: __name__ replacement: '${1}${4}' i have to use the regex (consul_raft_replication_appendEntries_rpc)_((\w){36})((_sum)|(_count))? Because some metrics's name like this: consul_raft_replication_appendEntries_rpc_450cf1d9_62de_d0be_905d_b4b42de3f8b8 consul_raft_replication_appendEntries_rpc_450cf1d9_62de_d0be_905d_b4b42de3f8b8_count consul_raft_replication_appendEntries_rpc_450cf1d9_62de_d0be_905d_b4b42de3f8b8_sum i don't find label host,so i have to split it by raft_id, but it doesn't seem to be easy to identify. [image: image] <https://user-images.githubusercontent.com/18901031/43355140-f863287a-9289-11e8-904e-cfd0ce5d1302.png> Can i get the hostname of the appropriate server or get the raft_id and Server correspondence relationship ? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4450 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAHYU-A-xWJqmQ54ozVXb_rmd2Gu56eQks5uLC4ngaJpZM4VhM4g> .

xuejipeng · 2018-07-31T09:02:30Z

@banks thanks,now, my problem is basically solved, but this metrics only on header, so the instance label value is the header hostname/IP, i think the instance label is follower hostname/IP will be better, But can I get follower's instance label in this metrics or do I have to configure the relabel? How should I configure it ?

banks · 2018-07-31T11:37:07Z

The "instance" label is from Prometheus - it's whatever host:port prometheus scraped the metrics from typically. So each metric should be labelled with the instance it came from already.

…

On Tue, Jul 31, 2018 at 10:02 AM xuejipeng ***@***.***> wrote: @banks <https://github.com/banks> thanks,now, my problem is basically solved, but this metrics only on header, so the instance label value is the header hostname/IP, i think the instance label is follower hostname/IP will be better, But can I get follower's instance label in this metrics or do I have to configure the relabel? How should I configure it ? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4450 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAHYU9JNdpy4u2h8L4r--ntfNiB0B-tmks5uMB0vgaJpZM4VhM4g> .

xuejipeng · 2018-08-02T02:31:01Z

@banks Thank you, but my question is that my final picture of the Grafana can be like the bottom of the picture like this, these two IP addresses are follower, not leader.

banks · 2018-08-02T12:08:35Z

Ah I was a bit wrong before: The metric comes from https://github.com/hashicorp/raft/blob/6b7a98d6741c1aa8e8280dd4074adab19899b1d8/replication.go Which is running on the _leader_ but replicating to a follower - the UUID is the _follower_ UUID but the metrics is emitted on the leader. Bizarrely that is the exact opposite of what your graphs seem to show - you only had a single unique UUID before and now you have to separate instances both emitting this metric. I'm finding it hard to explain that without a lot more info about what your cluster is doing (e.g. is leadership changing all the time) or access to the raw metrics. I would expect to see two separate graphs like you have here but I'd expect them to come from the same leader IP at one time unless leadership changes. Note that prometheus will make it look like there is still data there for up to 5 mins after a metric stops being recorded which might complicate things. On that basis, you probably do need to keep the UUID as a label since the unique tuple for metrics is: instance (the leader), peer UUID (the follower). and there should be exactly 2 sets of series at any one time - one for each follower.

…

On Thu, Aug 2, 2018 at 3:31 AM xuejipeng ***@***.***> wrote: @banks <https://github.com/banks> Thank you, but my question is that my final picture of the Grafana can be like the bottom of the picture like this, these two IP addresses are follower, not leader. [image: image] <https://user-images.githubusercontent.com/18901031/43559031-c2933682-963e-11e8-91dd-edc21f31279b.png> — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4450 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAHYUyNkg7qDWf2_wXKNtq4CwO6mShnAks5uMmRogaJpZM4VhM4g> .

xuejipeng · 2018-08-07T03:07:07Z

@banks Thank you very much for your careful answer, maybe I need two graphs.

mvisonneau · 2018-08-16T14:48:02Z

Thanks for the tip guys, I actually used the following myself in order to cover for the other ones as well, if it worth of interest to anyone :

metric_relabel_configs:
  - source_labels: [__name__]
    regex: 'consul_raft_replication_(appendEntries_rpc|appendEntries_logs|heartbeat|installSnapshot)_((\w){36})((_sum)|(_count))?'
    target_label: raft_id
    replacement: '${2}'
  - source_labels: [__name__]
    regex: 'consul_raft_replication_(appendEntries_rpc|appendEntries_logs|heartbeat|installSnapshot)_((\w){36})((_sum)|(_count))?'
    target_label: __name__
    replacement: 'consul_raft_replication_${1}${4}'

xuejipeng · 2018-08-20T03:33:08Z

@mvisonneau you metrics rules are really cool.

jaysoncena · 2019-04-08T13:46:11Z

This was also used in my workplace so adding my feedback here as well.

I think you can make the regex shorter. From ((\w){36})((_sum)|(_count))? to (\w{36})(_sum|_count)? and the replacement to consul_raft_replication_${1}${3}. Makes it easier to understand

mvisonneau · 2019-04-08T13:57:04Z

Ah indeed, which would give us:

metric_relabel_configs:
  - source_labels: [__name__]
    regex: 'consul_raft_replication_(appendEntries_rpc|appendEntries_logs|heartbeat|installSnapshot)_(\w{36})(_sum|_count)?'
    target_label: raft_id
    replacement: '${2}'
  - source_labels: [__name__]
    regex: 'consul_raft_replication_(appendEntries_rpc|appendEntries_logs|heartbeat|installSnapshot)_(\w{36})(_sum|_count)?'
    target_label: __name__
    replacement: 'consul_raft_replication_${1}${3}'

I haven't tested yet though

GMartinez-Sisti · 2019-11-29T18:12:59Z

The relabel above works perfectly, however if you're using a ServiceMonitor configuration in Kubernetes you need to change these two keys:

source_labels -> sourceLabels
target_label -> targetLabel

pearkes added type/question Not an "enhancement" or "bug". Please post on discuss.hashicorp waiting-reply Waiting on response from Original Poster or another individual in the thread labels Jul 26, 2018

oceyral mentioned this issue Sep 26, 2019

Move peer id from metric name to labels in raft replication hashicorp/raft#365

Closed

satyanash mentioned this issue Aug 18, 2020

Move peer id from metric name to labels in raft replication hashicorp/raft#416

Closed

mkcp mentioned this issue Sep 29, 2020

☂️ Use metrics labels for metadata rather than appending metadata to the metric name #8420

Closed

mikemorris mentioned this issue Oct 6, 2020

chore: update raft to v1.2.0 #8822

Merged

mikemorris closed this as completed in #8822 Oct 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

consul.raft.replication.heartbeat metrics have many suffixes #4450

consul.raft.replication.heartbeat metrics have many suffixes #4450

xuejipeng commented Jul 26, 2018

banks commented Jul 26, 2018

xuejipeng commented Jul 27, 2018

banks commented Jul 27, 2018 via email

xuejipeng commented Jul 28, 2018

banks commented Jul 30, 2018 via email

xuejipeng commented Jul 31, 2018

banks commented Jul 31, 2018 via email

xuejipeng commented Aug 2, 2018

banks commented Aug 2, 2018 via email

xuejipeng commented Aug 7, 2018

mvisonneau commented Aug 16, 2018 •

edited

Loading

xuejipeng commented Aug 20, 2018

jaysoncena commented Apr 8, 2019 •

edited

Loading

mvisonneau commented Apr 8, 2019

GMartinez-Sisti commented Nov 29, 2019

consul.raft.replication.heartbeat metrics have many suffixes #4450

consul.raft.replication.heartbeat metrics have many suffixes #4450

Comments

xuejipeng commented Jul 26, 2018

banks commented Jul 26, 2018

xuejipeng commented Jul 27, 2018

banks commented Jul 27, 2018 via email

xuejipeng commented Jul 28, 2018

banks commented Jul 30, 2018 via email

xuejipeng commented Jul 31, 2018

banks commented Jul 31, 2018 via email

xuejipeng commented Aug 2, 2018

banks commented Aug 2, 2018 via email

xuejipeng commented Aug 7, 2018

mvisonneau commented Aug 16, 2018 • edited Loading

xuejipeng commented Aug 20, 2018

jaysoncena commented Apr 8, 2019 • edited Loading

mvisonneau commented Apr 8, 2019

GMartinez-Sisti commented Nov 29, 2019

mvisonneau commented Aug 16, 2018 •

edited

Loading

jaysoncena commented Apr 8, 2019 •

edited

Loading