-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rafthttp: probe connection for Raft message transport #10022
Conversation
…econds" metric Currently, only v2 metrics ("stats.FollowerStats") tracks Raft message send latencies. Add Prometheus histogram to track Raft messages for writes, since heartbeats are probed (see etcd-io#10022) and snapshots are already being tracked via etcd-io#9997. ``` etcd_network_raft_send_total_duration_seconds_bucket{To="7339c4e5e833c029",Type="MsgProp",le="0.0001"} 1 etcd_network_raft_send_total_duration_seconds_bucket{To="7339c4e5e833c029",Type="MsgProp",le="0.0002"} 1 etcd_network_raft_send_total_duration_seconds_bucket{To="729934363faa4a24",Type="MsgApp",le="0.0001"} 9 etcd_network_raft_send_total_duration_seconds_bucket{To="729934363faa4a24",Type="MsgApp",le="0.0002"} 9 etcd_network_raft_send_total_duration_seconds_bucket{To="7339c4e5e833c029",Type="MsgAppResp",le="0.0001"} 8 etcd_network_raft_send_total_duration_seconds_bucket{To="7339c4e5e833c029",Type="MsgAppResp",le="0.0002"} 8 ``` Signed-off-by: Gyuho Lee <leegyuho@amazon.com>
Address #9438. |
Codecov Report
@@ Coverage Diff @@
## master #10022 +/- ##
==========================================
- Coverage 71.59% 71.56% -0.04%
==========================================
Files 390 390
Lines 36258 36266 +8
==========================================
- Hits 25960 25952 -8
- Misses 8488 8505 +17
+ Partials 1810 1809 -1
Continue to review full report at Codecov.
|
…econds" metric Currently, only v2 metrics ("stats.FollowerStats") tracks Raft message send latencies. Add Prometheus histogram to track Raft messages for writes, since heartbeats are probed (see etcd-io#10022) and snapshots are already being tracked via etcd-io#9997. ``` etcd_network_raft_send_total_duration_seconds_bucket{To="7339c4e5e833c029",Type="MsgProp",le="0.0001"} 1 etcd_network_raft_send_total_duration_seconds_bucket{To="7339c4e5e833c029",Type="MsgProp",le="0.0002"} 1 etcd_network_raft_send_total_duration_seconds_bucket{To="729934363faa4a24",Type="MsgApp",le="0.0001"} 9 etcd_network_raft_send_total_duration_seconds_bucket{To="729934363faa4a24",Type="MsgApp",le="0.0002"} 9 etcd_network_raft_send_total_duration_seconds_bucket{To="7339c4e5e833c029",Type="MsgAppResp",le="0.0001"} 8 etcd_network_raft_send_total_duration_seconds_bucket{To="7339c4e5e833c029",Type="MsgAppResp",le="0.0002"} 8 ``` Signed-off-by: Gyuho Lee <leegyuho@amazon.com>
Preliminary work to add prober to "streamRt" Signed-off-by: Gyuho Lee <leegyuho@amazon.com>
Signed-off-by: Gyuho Lee <leegyuho@amazon.com>
In our production cluster, we found one TCP connection had >8-sec latencies to a remote peer, but "etcd_network_peer_round_trip_time_seconds" metrics shows <1-sec latency distribution, which means we weren't sampling enough, or all the latency spikes happen outside of snapshot pipeline connection. The later is most likely the case, since the cluster had leader elections from missing heartbeats. This PR adds another probing routine to monitor the connection for Raft message transports. Signed-off-by: Gyuho Lee <leegyuho@amazon.com>
We need to track which connection had high latency spikes. ``` etcd_network_peer_round_trip_time_seconds_bucket{ConnectionType="ROUND_TRIPPER_RAFT_MESSAGE",To="729934363faa4a24",le="0.0001"} 0 etcd_network_peer_round_trip_time_seconds_bucket{ConnectionType="ROUND_TRIPPER_RAFT_MESSAGE",To="729934363faa4a24",le="0.0002"} 1 etcd_network_peer_round_trip_time_seconds_bucket{ConnectionType="ROUND_TRIPPER_SNAPSHOT",To="729934363faa4a24",le="0.0001"} 0 etcd_network_peer_round_trip_time_seconds_bucket{ConnectionType="ROUND_TRIPPER_SNAPSHOT",To="729934363faa4a24",le="0.0002"} 1 ``` Signed-off-by: Gyuho Lee <leegyuho@amazon.com>
Fix ``` panic: runtime error: invalid memory address or nil pointer dereference [recovered] panic: runtime error: invalid memory address or nil pointer dereference ``` Signed-off-by: Gyuho Lee <leegyuho@amazon.com>
@wenjiaswe @jpbetz I meant to ask about this metrics as well. Would adding |
AWS does not use this metrics, so this should be safe for us. |
Let's hold on this until we can confirm how label additions handled. We're primarily concerned with how https://github.com/GoogleCloudPlatform/k8s-stackdriver/tree/master/prometheus-to-sd handles this case. @wenjiaswe is checking now. |
@jpbetz No rush. Thanks for checking! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -143,7 +143,7 @@ var ( | |||
// highest bucket start of 0.0001 sec * 2^15 == 3.2768 sec | |||
Buckets: prometheus.ExponentialBuckets(0.0001, 2, 16), | |||
}, | |||
[]string{"To"}, | |||
[]string{"ConnectionType", "To"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gyuho unfortunately, label addition will break GKE. I checked https://github.com/GoogleCloudPlatform/k8s-stackdriver/blob/master/prometheus-to-sd, for metrics with prefix "container.googleapis.com" (e.g., etcd metrics), if definition (for example, label is one of the definition) of the metric was changed, then the metric is marked as broken and the metric is not going to be pushed: https://github.com/GoogleCloudPlatform/k8s-stackdriver/blob/a39287c62ca7c1ffaa3cac4887f6a92c1943abab/prometheus-to-sd/translator/metric_descriptor_cache.go#L72. prometheus-to-sd only UpdateMetricDescriptors if it contains "custom.googleapis.com" prefix:https://github.com/GoogleCloudPlatform/k8s-stackdriver/blob/1047589c380eed64c94484e7ab989db53d1b907b/prometheus-to-sd/main.go#L141.
@wenjiaswe @jpbetz Thanks for checking! Then, let's close this. I will add a separate metrics. |
@wenjiaswe Would you also file an issue somewhere to track this limitation. Many label additions are conceptually backward compatible since their introduction would not break existing readers that are unaware of the label, so improving our metrics processing pipeline tolerant of label additions seems desirable and useful. |
In our production cluster, we found one TCP connection had >8-sec
latencies to a remote peer, but "etcd_network_peer_round_trip_time_seconds"
metrics shows <1-sec latency distribution, which means we
weren't sampling enough, or all the latency spikes happen
outside of snapshot pipeline connection. The later is most likely
the case, since the cluster had leader elections from missing
heartbeats.
This PR adds another probing routine to monitor the connection
for Raft message transports.
We need to track which connection had high latency spikes.
@jpbetz Would adding
ConnectionType="ROUND_TRIPPER_SNAPSHOT"
label break anything in your monitoring systems? I want to backport this as well. Currently, probing doesn't tell much, since it only tracks snapshot sender connection...