-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
consul.raft.replication.heartbeat metrics have many suffixes #4450
Comments
Hey that metric comes from: https://github.com/hashicorp/raft/blob/a3fb4581fb07b16ecf1c3361580d4bdb17de9d98/replication.go#L535-L539 In our raft library. The suffix is the UUID of the raft server so you should see a metric for each server. If you already split by hostname Prometheus (normal) then you should see that each host has a single UUID suffix. (For clarity: I see As an immediate solution, I'm 80% sure Prometheus rewriting rules are expressive enough to remove it or turn the last part into a label today with no changes. In general I agree this would be better as a label - the reason it's not is that our raft library pre-dates Seems reasonable to open an Issue on raft to switch to using labels but I'm not sure how soon that will happen! |
@banks Sorry, i find this suffix metrics only on the consul leader, i didn't find Prometheus how to rewrite a rule,I just found the record rules.,i would appreciate it if you could teach me some best practices. These suffixes will change after the consul restarts, and if I want to use Grafana , then consul reboot, it will not work. |
These suffixes will change after the consul restarts
i find this suffix metrics only on the consul leader
So it would seem that only the leader of raft logs this since it is the
only one initiating the appendEntries RPC. The reason you only saw one set
was that only one leader existed during the interval you looked at. When
you restart the leader, the cluster elects a new leader which has a
different UUID. So it _is_ legitimately a different metric - if you were
also splitting metrics by `host` label you'd see them as two separate time
series in prometheus regardless of if this suffix is present or not.
didn't find Prometheus how to rewrite a rule,I just found the record
rules.
<https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/>,i
would appreciate it if you could teach me some best practices.
I was referring to relabel_config: (sorry misremembered their specific
name)
https://prometheus.io/docs/prometheus/latest/configuration/configuration/#%3Crelabel_config%3E
I think something like (untested):
metric_relabel_configs:
- source_labels: ["__name__"]
regex: "consul_raft_replication_appendEntries_rpc(.*)"
target_label: "__name__"
replacement: "consul_raft_replication_appendEntries_rpc"
Would remove the suffix, if you want to keep it as another label then I
think another rule before that one that extracts only that portion and
replaces it into `target_label` = "raft_id" would work too.
…On Fri, Jul 27, 2018 at 8:00 AM xuejipeng ***@***.***> wrote:
@banks <https://github.com/banks> Sorry, i find this suffix metrics only
on the consul leader, i didn't find Prometheus how to rewrite a rule,I just
found the record rules.
<https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/>,i
would appreciate it if you could teach me some best practices.
These suffixes will change after the consul restarts, and if I want to use
Grafana , then consul reboot, it will not work.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#4450 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAHYUwjnL9lsdk3pFbuQJg6nliwsEDfrks5uKrqXgaJpZM4VhM4g>
.
|
@banks Thank you so much,I temporarily solved the problem of Grafana data collection ,here's my configuration.
i have to use the regex
i don't find label Can i get the hostname of the appropriate server or get the |
There is no host label because you are explicitly disabling that in your
Prometheus configuration block in the agent config. Typically that makes
sense if you have Prometheus setup to scrape since prometheus adds it's own
`instance` label which has the hostname/IP and port scraped which tends to
be enough.
So I'd recommend using that instance label in grafana a a more readable one
for now.
If you want to figure out the mapping of host to raft UUID you can see it
in the output of `consul operator raft list-peers` (or the API endpoint
that uses). But I don't recommend trying to hack that unless you have to
since prometheus should have the instance label already.
…On Sat, Jul 28, 2018 at 10:25 AM xuejipeng ***@***.***> wrote:
@banks <https://github.com/banks> Thank you so much,I temporarily solved
the problem of Grafana data collection ,here's my configuration.
metric_relabel_configs:
- source_labels: [__name__]
regex: '(consul_raft_replication_appendEntries_rpc)_((\w){36})((_sum)|(_count))?'
target_label: raft_id
replacement: '${2}'
- source_labels: [__name__]
regex: '(consul_raft_replication_appendEntries_rpc)_((\w){36})((_sum)|(_count))?'
target_label: __name__
replacement: '${1}${4}'
i have to use the regex
(consul_raft_replication_appendEntries_rpc)_((\w){36})((_sum)|(_count))?
Because some metrics's name like this:
consul_raft_replication_appendEntries_rpc_450cf1d9_62de_d0be_905d_b4b42de3f8b8
consul_raft_replication_appendEntries_rpc_450cf1d9_62de_d0be_905d_b4b42de3f8b8_count
consul_raft_replication_appendEntries_rpc_450cf1d9_62de_d0be_905d_b4b42de3f8b8_sum
i don't find label host,so i have to split it by raft_id, but it doesn't
seem to be easy to identify.
[image: image]
<https://user-images.githubusercontent.com/18901031/43355140-f863287a-9289-11e8-904e-cfd0ce5d1302.png>
Can i get the hostname of the appropriate server or get the raft_id and
Server correspondence relationship ?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#4450 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAHYU-A-xWJqmQ54ozVXb_rmd2Gu56eQks5uLC4ngaJpZM4VhM4g>
.
|
@banks thanks,now, my problem is basically solved, but this metrics only on header, so the |
The "instance" label is from Prometheus - it's whatever host:port
prometheus scraped the metrics from typically. So each metric should be
labelled with the instance it came from already.
…On Tue, Jul 31, 2018 at 10:02 AM xuejipeng ***@***.***> wrote:
@banks <https://github.com/banks> thanks,now, my problem is basically
solved, but this metrics only on header, so the instance label value is
the header hostname/IP, i think the instance label is follower
hostname/IP will be better, But can I get follower's instance label in this
metrics or do I have to configure the relabel? How should I configure it ?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#4450 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAHYU9JNdpy4u2h8L4r--ntfNiB0B-tmks5uMB0vgaJpZM4VhM4g>
.
|
@banks Thank you, but my question is that my final picture of the Grafana can be like the bottom of the picture like this, these two IP addresses are follower, not leader. |
Ah I was a bit wrong before:
The metric comes from
https://github.com/hashicorp/raft/blob/6b7a98d6741c1aa8e8280dd4074adab19899b1d8/replication.go
Which is running on the _leader_ but replicating to a follower - the UUID
is the _follower_ UUID but the metrics is emitted on the leader.
Bizarrely that is the exact opposite of what your graphs seem to show - you
only had a single unique UUID before and now you have to separate instances
both emitting this metric. I'm finding it hard to explain that without a
lot more info about what your cluster is doing (e.g. is leadership changing
all the time) or access to the raw metrics.
I would expect to see two separate graphs like you have here but I'd expect
them to come from the same leader IP at one time unless leadership changes.
Note that prometheus will make it look like there is still data there for
up to 5 mins after a metric stops being recorded which might complicate
things.
On that basis, you probably do need to keep the UUID as a label since the
unique tuple for metrics is: instance (the leader), peer UUID (the
follower). and there should be exactly 2 sets of series at any one time -
one for each follower.
…On Thu, Aug 2, 2018 at 3:31 AM xuejipeng ***@***.***> wrote:
@banks <https://github.com/banks> Thank you, but my question is that my
final picture of the Grafana can be like the bottom of the picture like
this, these two IP addresses are follower, not leader.
[image: image]
<https://user-images.githubusercontent.com/18901031/43559031-c2933682-963e-11e8-91dd-edc21f31279b.png>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#4450 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAHYUyNkg7qDWf2_wXKNtq4CwO6mShnAks5uMmRogaJpZM4VhM4g>
.
|
@banks Thank you very much for your careful answer, maybe I need two graphs. |
Thanks for the tip guys, I actually used the following myself in order to cover for the other ones as well, if it worth of interest to anyone : metric_relabel_configs:
- source_labels: [__name__]
regex: 'consul_raft_replication_(appendEntries_rpc|appendEntries_logs|heartbeat|installSnapshot)_((\w){36})((_sum)|(_count))?'
target_label: raft_id
replacement: '${2}'
- source_labels: [__name__]
regex: 'consul_raft_replication_(appendEntries_rpc|appendEntries_logs|heartbeat|installSnapshot)_((\w){36})((_sum)|(_count))?'
target_label: __name__
replacement: 'consul_raft_replication_${1}${4}' |
@mvisonneau you metrics rules are really cool. |
This was also used in my workplace so adding my feedback here as well. I think you can make the regex shorter. From |
Ah indeed, which would give us: metric_relabel_configs:
- source_labels: [__name__]
regex: 'consul_raft_replication_(appendEntries_rpc|appendEntries_logs|heartbeat|installSnapshot)_(\w{36})(_sum|_count)?'
target_label: raft_id
replacement: '${2}'
- source_labels: [__name__]
regex: 'consul_raft_replication_(appendEntries_rpc|appendEntries_logs|heartbeat|installSnapshot)_(\w{36})(_sum|_count)?'
target_label: __name__
replacement: 'consul_raft_replication_${1}${3}' I haven't tested yet though |
The relabel above works perfectly, however if you're using a ServiceMonitor configuration in Kubernetes you need to change these two keys:
|
my consul version is 1.2.1,here is my configuration
{
"datacenter": "dc1",
"data_dir": "/apps/consul_1.2.1/data",
"log_level": "DEBUG",
"node_name": "ast0",
"server": true,
"ui": true,
"bootstrap_expect": 1,
"bind_addr": "10.0.5.169",
"client_addr": "0.0.0.0",
"retry_join": ["10.0.5.160","10.0.5.94"],
"retry_interval": "3s",
"enable_debug": true,
"rejoin_after_leave": true,
"enable_syslog": false,
"telemetry": {
"prometheus_retention_time": "24h",
"disable_hostname": true
}
}
when i get the metrics consul.raft.replication.heartbeat it like this consul_raft_replication_appendEntries_rpc_450cf1d9_62de_d0be_905d_b4b42de3f8b8
What are these suffixes,if i get rid of it? or if i can tag some label for this metrics
The text was updated successfully, but these errors were encountered: