Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

*: add v3 snapshot metrics (fsync, network) #9997

Merged
merged 2 commits into from
Aug 15, 2018
Merged

Conversation

gyuho
Copy link
Contributor

@gyuho gyuho commented Aug 12, 2018

v3 snapshots are large and critical operations, often affecting the cluster availabilities. Currently, we do not have any metrics around v3 snapshots. Only way to debug is looking at server logs after something bad happens.


etcd_snap_db_fsync_duration_seconds_count
etcd_snap_db_save_total_duration_seconds_bucket

to monitor v3 snapshot save operations on local node.


etcd_network_snapshot_send_success
etcd_network_snapshot_send_failures
etcd_network_snapshot_send_total_duration_seconds
etcd_network_snapshot_receive_success
etcd_network_snapshot_receive_failures
etcd_network_snapshot_receive_total_duration_seconds

to monitor v3 snapshot operations between remote peers.

Distribution would be:

0.1 second or more
...
25.6 seconds or more
51.2 seconds or more

This records successful snapshot sends/receives as well, because frequent snapshots affect cluster availabilities, as bad as spikes in etcd_network_snapshot_send/receive_failures.

Will update http://etcd.readthedocs.io/en/latest and operation docs in following PRs.

ref. #9438

/cc @wenjiaswe

@gyuho gyuho requested review from jpbetz and xiang90 August 12, 2018 05:01
@codecov-io
Copy link

codecov-io commented Aug 12, 2018

Codecov Report

Merging #9997 into master will increase coverage by 0.4%.
The diff coverage is 75%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master    #9997     +/-   ##
=========================================
+ Coverage   69.24%   69.65%   +0.4%     
=========================================
  Files         386      386             
  Lines       36030    36057     +27     
=========================================
+ Hits        24948    25114    +166     
+ Misses       9276     9158    -118     
+ Partials     1806     1785     -21
Impacted Files Coverage Δ
etcdserver/api/snap/metrics.go 100% <100%> (ø) ⬆️
etcdserver/api/rafthttp/metrics.go 100% <100%> (ø) ⬆️
etcdserver/api/snap/db.go 64% <100%> (+3.13%) ⬆️
etcdserver/api/rafthttp/http.go 65.58% <50%> (-0.48%) ⬇️
etcdserver/api/rafthttp/snapshot_sender.go 81.55% <90%> (+0.94%) ⬆️
clientv3/namespace/watch.go 72.72% <0%> (-6.07%) ⬇️
integration/bridge.go 70.22% <0%> (-1.53%) ⬇️
etcdserver/server.go 74.24% <0%> (+0.14%) ⬆️
clientv3/watch.go 91.93% <0%> (+0.42%) ⬆️
pkg/transport/listener_tls.go 66.22% <0%> (+0.66%) ⬆️
... and 20 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e72730a...6f4c509. Read the comment docs.

Copy link
Contributor

@jpbetz jpbetz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks right. A couple minor suggestions for the names, help strings and bucket sizes of the metrics.

if r.Method != "POST" {
w.Header().Set("Allow", "POST")
http.Error(w, "Method Not Allowed", http.StatusMethodNotAllowed)
snapshotReceiveFailures.WithLabelValues("").Inc()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"" feels a bit vague. Should we add a constant we always use here that is more descriptive?

Namespace: "etcd",
Subsystem: "network",
Name: "snapshot_send",
Help: "Total number of snapshot sends",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Operators might not understand what this metric is for. They're more familiar with "snapshot save" and "snapshot restore" commands. Maybe clarify it a bit more? e.g. "Total number of snapshots send from a member to a client or another member" ?

snapshotSend = prometheus.NewCounterVec(prometheus.CounterOpts{
Namespace: "etcd",
Subsystem: "network",
Name: "snapshot_send",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for successful sends only, right? Maybe just clarify that in the help string so that it's clear that the total sent count would be this + snapshot_send_failures? Alternatively we could name this "snapshot_send_success", but that seems a bit too verbose..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. Will suffix success.

Help: "Total latency distributions of v3 snapshot sends",

// lowest bucket start of upper bound 0.001 sec (1 ms) with factor 2
// highest bucket start of 0.001 sec * 2^13 == 8.192 sec
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is 8 sec enough for the highest bucket? Would we spare a couple more metric entries to get this up to maybe 30 sec or so? Might be helpful if/when things go wrong.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I will update the highest bucket up to 30-second.

gyuho added 2 commits August 15, 2018 12:56
etcd_snap_db_fsync_duration_seconds_count
etcd_snap_db_save_total_duration_seconds_bucket

Signed-off-by: Gyuho Lee <leegyuho@amazon.com>
Distribution would be:
0.1 second or more
...
25.6 seconds or more
51.2 seconds or more

etcd_network_snapshot_send_success
etcd_network_snapshot_send_failures
etcd_network_snapshot_send_total_duration_seconds
etcd_network_snapshot_receive_success
etcd_network_snapshot_receive_failures
etcd_network_snapshot_receive_total_duration_seconds

Signed-off-by: Gyuho Lee <leegyuho@amazon.com>
@gyuho
Copy link
Contributor Author

gyuho commented Aug 15, 2018

@jpbetz All addressed. PTAL. Thanks.

Copy link
Contributor

@jpbetz jpbetz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@gyuho gyuho merged commit 2a6bc7d into etcd-io:master Aug 15, 2018
@gyuho gyuho deleted the snap-metrics branch August 15, 2018 21:16
gyuho added a commit to gyuho/etcd that referenced this pull request Aug 18, 2018
…econds" metric

Currently, only v2 metrics ("stats.FollowerStats") tracks Raft message
send latencies. Add Prometheus histogram to track Raft messages for
writes, since heartbeats are probed (see etcd-io#10022)
and snapshots are already being tracked via etcd-io#9997.

```
etcd_network_raft_send_total_duration_seconds_bucket{To="7339c4e5e833c029",Type="MsgProp",le="0.0001"} 1
etcd_network_raft_send_total_duration_seconds_bucket{To="7339c4e5e833c029",Type="MsgProp",le="0.0002"} 1
etcd_network_raft_send_total_duration_seconds_bucket{To="729934363faa4a24",Type="MsgApp",le="0.0001"} 9
etcd_network_raft_send_total_duration_seconds_bucket{To="729934363faa4a24",Type="MsgApp",le="0.0002"} 9
etcd_network_raft_send_total_duration_seconds_bucket{To="7339c4e5e833c029",Type="MsgAppResp",le="0.0001"} 8
etcd_network_raft_send_total_duration_seconds_bucket{To="7339c4e5e833c029",Type="MsgAppResp",le="0.0002"} 8
```

Signed-off-by: Gyuho Lee <leegyuho@amazon.com>
gyuho added a commit that referenced this pull request Aug 29, 2018
…97-upstream-release-3.1

Automated cherry pick of #9997
gyuho added a commit to gyuho/etcd that referenced this pull request Aug 29, 2018
…econds" metric

Currently, only v2 metrics ("stats.FollowerStats") tracks Raft message
send latencies. Add Prometheus histogram to track Raft messages for
writes, since heartbeats are probed (see etcd-io#10022)
and snapshots are already being tracked via etcd-io#9997.

```
etcd_network_raft_send_total_duration_seconds_bucket{To="7339c4e5e833c029",Type="MsgProp",le="0.0001"} 1
etcd_network_raft_send_total_duration_seconds_bucket{To="7339c4e5e833c029",Type="MsgProp",le="0.0002"} 1
etcd_network_raft_send_total_duration_seconds_bucket{To="729934363faa4a24",Type="MsgApp",le="0.0001"} 9
etcd_network_raft_send_total_duration_seconds_bucket{To="729934363faa4a24",Type="MsgApp",le="0.0002"} 9
etcd_network_raft_send_total_duration_seconds_bucket{To="7339c4e5e833c029",Type="MsgAppResp",le="0.0001"} 8
etcd_network_raft_send_total_duration_seconds_bucket{To="7339c4e5e833c029",Type="MsgAppResp",le="0.0002"} 8
```

Signed-off-by: Gyuho Lee <leegyuho@amazon.com>
wenjiaswe added a commit that referenced this pull request Sep 4, 2018
…97-upstream-release-3.2

Automated cherry pick of #9997
wenjiaswe added a commit that referenced this pull request Oct 3, 2018
…97-upstream-release-3.3

Automated cherry pick of #9997
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants