Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rafthttp: add Raft send latency metric for writes #10023

Closed
wants to merge 2 commits into from

Conversation

gyuho
Copy link
Contributor

@gyuho gyuho commented Aug 18, 2018

Currently, only v2 metrics ("stats.FollowerStats") tracks Raft message
send latencies. Add Prometheus histogram to track Raft messages for
writes, since heartbeats are probed (see #10022)
and snapshots are already being tracked via #9997.

etcd_network_raft_send_total_duration_seconds_bucket{To="7339c4e5e833c029",Type="MsgProp",le="0.0001"} 1
etcd_network_raft_send_total_duration_seconds_bucket{To="7339c4e5e833c029",Type="MsgProp",le="0.0002"} 1
etcd_network_raft_send_total_duration_seconds_bucket{To="729934363faa4a24",Type="MsgApp",le="0.0001"} 9
etcd_network_raft_send_total_duration_seconds_bucket{To="729934363faa4a24",Type="MsgApp",le="0.0002"} 9
etcd_network_raft_send_total_duration_seconds_bucket{To="7339c4e5e833c029",Type="MsgAppResp",le="0.0001"} 8
etcd_network_raft_send_total_duration_seconds_bucket{To="7339c4e5e833c029",Type="MsgAppResp",le="0.0002"} 8

@gyuho gyuho requested review from jpbetz and xiang90 August 18, 2018 00:40
@gyuho gyuho added WIP labels Aug 18, 2018
@gyuho
Copy link
Contributor Author

gyuho commented Aug 18, 2018

Address #9438.

@codecov-io
Copy link

codecov-io commented Aug 18, 2018

Codecov Report

Merging #10023 into master will decrease coverage by 0.08%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #10023      +/-   ##
==========================================
- Coverage   71.59%   71.51%   -0.09%     
==========================================
  Files         390      390              
  Lines       36258    36265       +7     
==========================================
- Hits        25960    25935      -25     
- Misses       8488     8519      +31     
- Partials     1810     1811       +1
Impacted Files Coverage Δ
etcdserver/api/rafthttp/pipeline.go 97.5% <100%> (ø) ⬆️
etcdserver/api/rafthttp/stream.go 79.66% <100%> (+0.26%) ⬆️
etcdserver/api/rafthttp/metrics.go 100% <100%> (ø) ⬆️
pkg/adt/interval_tree.go 79.27% <0%> (-12.62%) ⬇️
clientv3/leasing/util.go 91.66% <0%> (-6.67%) ⬇️
proxy/grpcproxy/watch.go 88.19% <0%> (-4.35%) ⬇️
pkg/transport/listener.go 58.67% <0%> (-4.09%) ⬇️
etcdserver/v2_server.go 80.76% <0%> (-3.85%) ⬇️
etcdserver/api/v3rpc/watch.go 80.06% <0%> (-2.29%) ⬇️
clientv3/watch.go 90.87% <0%> (-1.49%) ⬇️
... and 18 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 34fcaba...e07f1a2. Read the comment docs.

Copy link
Contributor

@wenjiaswe wenjiaswe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have two questions, everything else lgtm.

// heartbeats are tracked via prober https://github.com/coreos/etcd/pull/10022
if m.Type == raftpb.MsgProp || m.Type == raftpb.MsgApp || m.Type == raftpb.MsgAppResp {
raftSendSeconds.WithLabelValues(m.Type.String(), types.ID(m.To).String()).Observe(took.Seconds())
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason that MsgProp, MsgApp and MsgAppResp are chosen here? I was reading the doc, and you explained snapshot and heartbeats, how about other message types?

// highest bucket start of 0.0001 sec * 2^15 == 3.2768 sec
Buckets: prometheus.ExponentialBuckets(0.0001, 2, 16),
},
[]string{"Type", "To"},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean []string{"To"}?

gyuho added 2 commits August 29, 2018 15:07
Signed-off-by: Gyuho Lee <leegyuho@amazon.com>
…econds" metric

Currently, only v2 metrics ("stats.FollowerStats") tracks Raft message
send latencies. Add Prometheus histogram to track Raft messages for
writes, since heartbeats are probed (see etcd-io#10022)
and snapshots are already being tracked via etcd-io#9997.

```
etcd_network_raft_send_total_duration_seconds_bucket{To="7339c4e5e833c029",Type="MsgProp",le="0.0001"} 1
etcd_network_raft_send_total_duration_seconds_bucket{To="7339c4e5e833c029",Type="MsgProp",le="0.0002"} 1
etcd_network_raft_send_total_duration_seconds_bucket{To="729934363faa4a24",Type="MsgApp",le="0.0001"} 9
etcd_network_raft_send_total_duration_seconds_bucket{To="729934363faa4a24",Type="MsgApp",le="0.0002"} 9
etcd_network_raft_send_total_duration_seconds_bucket{To="7339c4e5e833c029",Type="MsgAppResp",le="0.0001"} 8
etcd_network_raft_send_total_duration_seconds_bucket{To="7339c4e5e833c029",Type="MsgAppResp",le="0.0002"} 8
```

Signed-off-by: Gyuho Lee <leegyuho@amazon.com>
@gyuho
Copy link
Contributor Author

gyuho commented Aug 29, 2018

@jpbetz @wenjiaswe

Is there a reason that MsgProp, MsgApp and MsgAppResp are chosen here?

Those are most frequent ones that would give us enough sample size to investigate any network issues. Other important ones are heartbeat and v3 snapshot messages:

Do you mean []string{"To"}?

By adding "From" field, we can just use one metrics to track multiple message types.

etcd_network_raft_send_total_duration_seconds_bucket{To="7339c4e5e833c029",Type="MsgProp",le="0.0001"} 1
etcd_network_raft_send_total_duration_seconds_bucket{To="7339c4e5e833c029",Type="MsgProp",le="0.0002"} 1
etcd_network_raft_send_total_duration_seconds_bucket{To="729934363faa4a24",Type="MsgApp",le="0.0001"} 9
etcd_network_raft_send_total_duration_seconds_bucket{To="729934363faa4a24",Type="MsgApp",le="0.0002"} 9
etcd_network_raft_send_total_duration_seconds_bucket{To="7339c4e5e833c029",Type="MsgAppResp",le="0.0001"} 8
etcd_network_raft_send_total_duration_seconds_bucket{To="7339c4e5e833c029",Type="MsgAppResp",le="0.0002"} 8

Can you confirm with GKE team, if this does not break anything in production?

I will check with other production users.

@wenjiaswe
Copy link
Contributor

@gyuho Adding metrics does not break anything in GKE production as far as I know, unless @jpbetz has more insights that I am not aware of.

@gyuho
Copy link
Contributor Author

gyuho commented Aug 29, 2018

Oh, this is a new metrics, so should not break anything when backported.

@wenjiaswe
Copy link
Contributor

@gyuho I do need to check if it's safe to add label though. Checking now, get back to you soon.

// snapshot sends are tracked via separate metrics https://github.com/etcd-io/etcd/pull/9997
// heartbeats are tracked via prober https://github.com/etcd-io/etcd/pull/10022
// TODO: track other messages?
if m.Type == raftpb.MsgProp ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not the sending time (flushing the buffer to the network card). we actually send the messages at line 211 when flush is called explicitly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would rather just remove this metrics. it is not easy to measure after all with buffers all over the places.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xiang90 Good point. For now, tracking snapshots and heartbeats is good enough. Will come back to this later.

@gyuho
Copy link
Contributor Author

gyuho commented Sep 6, 2018

Will get back to this later.

@gyuho gyuho deleted the stats-for-v3 branch June 3, 2024 09:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

4 participants