Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rafthttp: probe all raft transports #10155

Merged
merged 4 commits into from
Oct 9, 2018
Merged

Conversation

gyuho
Copy link
Contributor

@gyuho gyuho commented Oct 7, 2018

This PR adds another probing routine to monitor the connection
for Raft message transports. Previously, we only monitored
snapshot transports.

In our production cluster, we found one TCP connection had >8-sec
latencies to a remote peer, but "etcd_network_peer_round_trip_time_seconds"
metrics shows <1-sec latency distribution, which means etcd server
was not sampling enough while such latency spikes happen
outside of snapshot pipeline connection.

Reworked on #10022.

gyuho added 4 commits October 7, 2018 03:13
Preliminary work to add prober to "streamRt"

Signed-off-by: Gyuho Lee <leegyuho@amazon.com>
Signed-off-by: Gyuho Lee <leegyuho@amazon.com>
This PR adds another probing routine to monitor the connection
for Raft message transports. Previously, we only monitored
snapshot transports.

In our production cluster, we found one TCP connection had >8-sec
latencies to a remote peer, but "etcd_network_peer_round_trip_time_seconds"
metrics shows <1-sec latency distribution, which means etcd server
was not sampling enough while such latency spikes happen
outside of snapshot pipeline connection.

Signed-off-by: Gyuho Lee <leegyuho@amazon.com>
Signed-off-by: Gyuho Lee <leegyuho@amazon.com>
@codecov-io
Copy link

Codecov Report

Merging #10155 into master will decrease coverage by 0.2%.
The diff coverage is 94.44%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #10155      +/-   ##
==========================================
- Coverage   71.54%   71.33%   -0.21%     
==========================================
  Files         390      390              
  Lines       36307    36315       +8     
==========================================
- Hits        25975    25906      -69     
- Misses       8530     8622      +92     
+ Partials     1802     1787      -15
Impacted Files Coverage Δ
etcdserver/api/rafthttp/metrics.go 100% <ø> (ø) ⬆️
etcdserver/api/rafthttp/transport.go 83.87% <100%> (+0.45%) ⬆️
etcdserver/api/rafthttp/probing_status.go 57.44% <83.33%> (-0.34%) ⬇️
etcdctl/ctlv3/command/printer_simple.go 32.21% <0%> (-40.27%) ⬇️
etcdctl/ctlv2/command/role_commands.go 29.76% <0%> (-27.39%) ⬇️
etcdctl/ctlv2/command/get_command.go 50% <0%> (-25%) ⬇️
proxy/grpcproxy/register.go 69.44% <0%> (-13.89%) ⬇️
clientv3/namespace/watch.go 87.87% <0%> (-6.07%) ⬇️
etcdctl/ctlv3/command/lease_command.go 65.34% <0%> (-5.95%) ⬇️
clientv3/leasing/txn.go 88.09% <0%> (-3.18%) ⬇️
... and 20 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6976819...884a8bd. Read the comment docs.

@jingyih
Copy link
Contributor

jingyih commented Oct 8, 2018

/cc @jpbetz

@jpbetz
Copy link
Contributor

jpbetz commented Oct 8, 2018

@wenjiaswe Would you have a look?

@wenjiaswe
Copy link
Contributor

LGTM

@gyuho gyuho merged commit d2a0f17 into etcd-io:master Oct 9, 2018
@gyuho gyuho deleted the metrics-messages branch October 9, 2018 18:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

5 participants