rafthttp: add Raft send latency metric for writes #10023

gyuho · 2018-08-18T00:40:41Z

Currently, only v2 metrics ("stats.FollowerStats") tracks Raft message
send latencies. Add Prometheus histogram to track Raft messages for
writes, since heartbeats are probed (see #10022)
and snapshots are already being tracked via #9997.

etcd_network_raft_send_total_duration_seconds_bucket{To="7339c4e5e833c029",Type="MsgProp",le="0.0001"} 1
etcd_network_raft_send_total_duration_seconds_bucket{To="7339c4e5e833c029",Type="MsgProp",le="0.0002"} 1
etcd_network_raft_send_total_duration_seconds_bucket{To="729934363faa4a24",Type="MsgApp",le="0.0001"} 9
etcd_network_raft_send_total_duration_seconds_bucket{To="729934363faa4a24",Type="MsgApp",le="0.0002"} 9
etcd_network_raft_send_total_duration_seconds_bucket{To="7339c4e5e833c029",Type="MsgAppResp",le="0.0001"} 8
etcd_network_raft_send_total_duration_seconds_bucket{To="7339c4e5e833c029",Type="MsgAppResp",le="0.0002"} 8

gyuho · 2018-08-18T00:43:35Z

Address #9438.

codecov-io · 2018-08-18T01:49:49Z

Codecov Report

Merging #10023 into master will decrease coverage by 0.08%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #10023      +/-   ##
==========================================
- Coverage   71.59%   71.51%   -0.09%     
==========================================
  Files         390      390              
  Lines       36258    36265       +7     
==========================================
- Hits        25960    25935      -25     
- Misses       8488     8519      +31     
- Partials     1810     1811       +1

Impacted Files	Coverage Δ
etcdserver/api/rafthttp/pipeline.go	`97.5% <100%> (ø)`	⬆️
etcdserver/api/rafthttp/stream.go	`79.66% <100%> (+0.26%)`	⬆️
etcdserver/api/rafthttp/metrics.go	`100% <100%> (ø)`	⬆️
pkg/adt/interval_tree.go	`79.27% <0%> (-12.62%)`	⬇️
clientv3/leasing/util.go	`91.66% <0%> (-6.67%)`	⬇️
proxy/grpcproxy/watch.go	`88.19% <0%> (-4.35%)`	⬇️
pkg/transport/listener.go	`58.67% <0%> (-4.09%)`	⬇️
etcdserver/v2_server.go	`80.76% <0%> (-3.85%)`	⬇️
etcdserver/api/v3rpc/watch.go	`80.06% <0%> (-2.29%)`	⬇️
clientv3/watch.go	`90.87% <0%> (-1.49%)`	⬇️
... and 18 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 34fcaba...e07f1a2. Read the comment docs.

wenjiaswe

I have two questions, everything else lgtm.

wenjiaswe · 2018-08-29T21:13:19Z

etcdserver/api/rafthttp/stream.go

+				// heartbeats are tracked via prober https://github.com/coreos/etcd/pull/10022
+				if m.Type == raftpb.MsgProp || m.Type == raftpb.MsgApp || m.Type == raftpb.MsgAppResp {
+					raftSendSeconds.WithLabelValues(m.Type.String(), types.ID(m.To).String()).Observe(took.Seconds())
+				}


Is there a reason that MsgProp, MsgApp and MsgAppResp are chosen here? I was reading the doc, and you explained snapshot and heartbeats, how about other message types?

wenjiaswe · 2018-08-29T21:27:51Z

etcdserver/api/rafthttp/metrics.go

+		// highest bucket start of 0.0001 sec * 2^15 == 3.2768 sec
+		Buckets: prometheus.ExponentialBuckets(0.0001, 2, 16),
+	},
+		[]string{"Type", "To"},


Do you mean []string{"To"}?

Signed-off-by: Gyuho Lee <leegyuho@amazon.com>

…econds" metric Currently, only v2 metrics ("stats.FollowerStats") tracks Raft message send latencies. Add Prometheus histogram to track Raft messages for writes, since heartbeats are probed (see etcd-io#10022) and snapshots are already being tracked via etcd-io#9997. ``` etcd_network_raft_send_total_duration_seconds_bucket{To="7339c4e5e833c029",Type="MsgProp",le="0.0001"} 1 etcd_network_raft_send_total_duration_seconds_bucket{To="7339c4e5e833c029",Type="MsgProp",le="0.0002"} 1 etcd_network_raft_send_total_duration_seconds_bucket{To="729934363faa4a24",Type="MsgApp",le="0.0001"} 9 etcd_network_raft_send_total_duration_seconds_bucket{To="729934363faa4a24",Type="MsgApp",le="0.0002"} 9 etcd_network_raft_send_total_duration_seconds_bucket{To="7339c4e5e833c029",Type="MsgAppResp",le="0.0001"} 8 etcd_network_raft_send_total_duration_seconds_bucket{To="7339c4e5e833c029",Type="MsgAppResp",le="0.0002"} 8 ``` Signed-off-by: Gyuho Lee <leegyuho@amazon.com>

gyuho · 2018-08-29T22:17:10Z

@jpbetz @wenjiaswe

Is there a reason that MsgProp, MsgApp and MsgAppResp are chosen here?

Those are most frequent ones that would give us enough sample size to investigate any network issues. Other important ones are heartbeat and v3 snapshot messages:

v3 snapshot sends are tracked via separate metrics *: add v3 snapshot metrics (fsync, network) #9997
Heartbeats are tracked via prober rafthttp: probe connection for Raft message transport #10022

Do you mean []string{"To"}?

By adding "From" field, we can just use one metrics to track multiple message types.

etcd_network_raft_send_total_duration_seconds_bucket{To="7339c4e5e833c029",Type="MsgProp",le="0.0001"} 1
etcd_network_raft_send_total_duration_seconds_bucket{To="7339c4e5e833c029",Type="MsgProp",le="0.0002"} 1
etcd_network_raft_send_total_duration_seconds_bucket{To="729934363faa4a24",Type="MsgApp",le="0.0001"} 9
etcd_network_raft_send_total_duration_seconds_bucket{To="729934363faa4a24",Type="MsgApp",le="0.0002"} 9
etcd_network_raft_send_total_duration_seconds_bucket{To="7339c4e5e833c029",Type="MsgAppResp",le="0.0001"} 8
etcd_network_raft_send_total_duration_seconds_bucket{To="7339c4e5e833c029",Type="MsgAppResp",le="0.0002"} 8

Can you confirm with GKE team, if this does not break anything in production?

I will check with other production users.

wenjiaswe · 2018-08-29T22:23:45Z

@gyuho Adding metrics does not break anything in GKE production as far as I know, unless @jpbetz has more insights that I am not aware of.

gyuho · 2018-08-29T23:05:06Z

Oh, this is a new metrics, so should not break anything when backported.

wenjiaswe · 2018-08-29T23:19:10Z

@gyuho I do need to check if it's safe to add label though. Checking now, get back to you soon.

xiang90 · 2018-08-29T23:38:45Z

etcdserver/api/rafthttp/stream.go

+				// snapshot sends are tracked via separate metrics https://github.com/etcd-io/etcd/pull/9997
+				// heartbeats are tracked via prober https://github.com/etcd-io/etcd/pull/10022
+				// TODO: track other messages?
+				if m.Type == raftpb.MsgProp ||


this is not the sending time (flushing the buffer to the network card). we actually send the messages at line 211 when flush is called explicitly.

i would rather just remove this metrics. it is not easy to measure after all with buffers all over the places.

@xiang90 Good point. For now, tracking snapshots and heartbeats is good enough. Will come back to this later.

gyuho · 2018-09-06T02:42:19Z

Will get back to this later.

gyuho requested review from jpbetz and xiang90 August 18, 2018 00:40

gyuho added WIP labels Aug 18, 2018

wenjiaswe reviewed Aug 29, 2018

View reviewed changes

gyuho added 2 commits August 29, 2018 15:07

etcdserver/api/rafthttp: replace "Sub" call with "time.Since"

dcb4d76

Signed-off-by: Gyuho Lee <leegyuho@amazon.com>

gyuho force-pushed the stats-for-v3 branch from ca7d628 to e07f1a2 Compare August 29, 2018 22:11

gyuho added backport/v3.3 Release-Backport/v3.2 and removed WIP labels Aug 29, 2018

xiang90 reviewed Aug 29, 2018

View reviewed changes

gyuho added WIP and removed Release-Backport/v3.1 backport/v3.3 labels Aug 29, 2018

gyuho removed the WIP label Sep 6, 2018

gyuho closed this Sep 6, 2018

gyuho deleted the stats-for-v3 branch June 3, 2024 09:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rafthttp: add Raft send latency metric for writes #10023

rafthttp: add Raft send latency metric for writes #10023

gyuho commented Aug 18, 2018

gyuho commented Aug 18, 2018

codecov-io commented Aug 18, 2018 •

edited

Loading

wenjiaswe left a comment

wenjiaswe Aug 29, 2018

wenjiaswe Aug 29, 2018

gyuho commented Aug 29, 2018

wenjiaswe commented Aug 29, 2018

gyuho commented Aug 29, 2018

wenjiaswe commented Aug 29, 2018

xiang90 Aug 29, 2018

xiang90 Aug 29, 2018

gyuho Aug 29, 2018

gyuho commented Sep 6, 2018

rafthttp: add Raft send latency metric for writes #10023

rafthttp: add Raft send latency metric for writes #10023

Conversation

gyuho commented Aug 18, 2018

gyuho commented Aug 18, 2018

codecov-io commented Aug 18, 2018 • edited Loading

Codecov Report

wenjiaswe left a comment

Choose a reason for hiding this comment

wenjiaswe Aug 29, 2018

Choose a reason for hiding this comment

wenjiaswe Aug 29, 2018

Choose a reason for hiding this comment

gyuho commented Aug 29, 2018

wenjiaswe commented Aug 29, 2018

gyuho commented Aug 29, 2018

wenjiaswe commented Aug 29, 2018

xiang90 Aug 29, 2018

Choose a reason for hiding this comment

xiang90 Aug 29, 2018

Choose a reason for hiding this comment

gyuho Aug 29, 2018

Choose a reason for hiding this comment

gyuho commented Sep 6, 2018

codecov-io commented Aug 18, 2018 •

edited

Loading