Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clientv3/balance: fixed flaky balancer tests #14204

Merged
merged 1 commit into from
Jul 11, 2022

Conversation

lavacat
Copy link

@lavacat lavacat commented Jul 9, 2022

  • added verification step to indirectly verify that all peers are in balancer subconn list

Please read https://github.com/etcd-io/etcd/blob/main/CONTRIBUTING.md#contribution-flow.

@lavacat lavacat force-pushed the release-3.4-balancer-tests branch 2 times, most recently from 9a1b977 to 78c72ec Compare July 9, 2022 05:10
@lavacat
Copy link
Author

lavacat commented Jul 9, 2022

fixes #14158

@lavacat
Copy link
Author

lavacat commented Jul 9, 2022

Looking at the end of the log from failed run https://github.com/etcd-io/etcd/runs/7049120841?check_suite_focus=true

we see "msg":"picked","picker" 10 times as expected. But first 4 times "subconn-size":1. That means that same peer will be picked and switches won't be incremented

{"level":"info","msg":"state changed","picker":"picker-error","balancer-id":"ckyp592dedhk","connected":true,"subconn":"0xaa01b00","subconn-size":5,"address":"127.0.0.1:38193","old-state":"CONNECTING","new-state":"READY"}
{"level":"info","msg":"updated picker","picker":"picker-roundrobin-balanced","balancer-id":"ckyp592dedhk","policy":"picker-roundrobin-balanced","subconn-ready":["127.0.0.1:38193 (0xaa01b00)"],"subconn-size":1}
{"level":"debug","msg":"picked","picker":"picker-roundrobin-balanced","address":"127.0.0.1:38193","subconn-index":0,"subconn-size":1}
{"level":"debug","msg":"balancer done","picker":"picker-roundrobin-balanced","address":"127.0.0.1:38193","success":true,"bytes-sent":true,"bytes-received":true}
{"level":"debug","msg":"picked","picker":"picker-roundrobin-balanced","address":"127.0.0.1:38193","subconn-index":0,"subconn-size":1}
{"level":"debug","msg":"balancer done","picker":"picker-roundrobin-balanced","address":"127.0.0.1:38193","success":true,"bytes-sent":true,"bytes-received":true}
{"level":"debug","msg":"picked","picker":"picker-roundrobin-balanced","address":"127.0.0.1:38193","subconn-index":0,"subconn-size":1}
{"level":"debug","msg":"balancer done","picker":"picker-roundrobin-balanced","address":"127.0.0.1:38193","success":true,"bytes-sent":true,"bytes-received":true}
{"level":"debug","msg":"picked","picker":"picker-roundrobin-balanced","address":"127.0.0.1:38193","subconn-index":0,"subconn-size":1}
{"level":"info","msg":"state changed","picker":"picker-roundrobin-balanced","balancer-id":"ckyp592dedhk","connected":true,"subconn":"0xaa01ab0","subconn-size":5,"address":"127.0.0.1:36793","old-state":"CONNECTING","new-state":"READY"}
{"level":"info","msg":"updated picker","picker":"picker-roundrobin-balanced","balancer-id":"ckyp592dedhk","policy":"picker-roundrobin-balanced","subconn-ready":["127.0.0.1:36793 (0xaa01ab0)","127.0.0.1:38193 (0xaa01b00)"],"subconn-size":2}
{"level":"info","msg":"state changed","picker":"picker-roundrobin-balanced","balancer-id":"ckyp592dedhk","connected":true,"subconn":"0xaa01ad0","subconn-size":5,"address":"127.0.0.1:37731","old-state":"CONNECTING","new-state":"READY"}
{"level":"info","msg":"updated picker","picker":"picker-roundrobin-balanced","balancer-id":"ckyp592dedhk","policy":"picker-roundrobin-balanced","subconn-ready":["127.0.0.1:36793 (0xaa01ab0)","127.0.0.1:37731 (0xaa01ad0)","127.0.0.1:38193 (0xaa01b00)"],"subconn-size":3}
{"level":"info","msg":"state changed","picker":"picker-roundrobin-balanced","balancer-id":"ckyp592dedhk","connected":true,"subconn":"0xaa01ac0","subconn-size":5,"address":"127.0.0.1:41573","old-state":"CONNECTING","new-state":"READY"}
{"level":"info","msg":"updated picker","picker":"picker-roundrobin-balanced","balancer-id":"ckyp592dedhk","policy":"picker-roundrobin-balanced","subconn-ready":["127.0.0.1:36793 (0xaa01ab0)","127.0.0.1:37731 (0xaa01ad0)","127.0.0.1:38193 (0xaa01b00)","127.0.0.1:41573 (0xaa01ac0)"],"subconn-size":4}
{"level":"info","msg":"state changed","picker":"picker-roundrobin-balanced","balancer-id":"ckyp592dedhk","connected":true,"subconn":"0xaa01b10","subconn-size":5,"address":"127.0.0.1:36837","old-state":"CONNECTING","new-state":"READY"}
{"level":"info","msg":"updated picker","picker":"picker-roundrobin-balanced","balancer-id":"ckyp592dedhk","policy":"picker-roundrobin-balanced","subconn-ready":["127.0.0.1:36793 (0xaa01ab0)","127.0.0.1:36837 (0xaa01b10)","127.0.0.1:37731 (0xaa01ad0)","127.0.0.1:38193 (0xaa01b00)","127.0.0.1:41573 (0xaa01ac0)"],"subconn-size":5}
{"level":"debug","msg":"balancer done","picker":"picker-roundrobin-balanced","address":"127.0.0.1:38193","success":true,"bytes-sent":true,"bytes-received":true}
{"level":"debug","msg":"picked","picker":"picker-roundrobin-balanced","address":"127.0.0.1:36837","subconn-index":0,"subconn-size":5}
{"level":"debug","msg":"balancer done","picker":"picker-roundrobin-balanced","address":"127.0.0.1:36837","success":true,"bytes-sent":true,"bytes-received":true}
{"level":"debug","msg":"picked","picker":"picker-roundrobin-balanced","address":"127.0.0.1:36793","subconn-index":1,"subconn-size":5}
{"level":"debug","msg":"balancer done","picker":"picker-roundrobin-balanced","address":"127.0.0.1:36793","success":true,"bytes-sent":true,"bytes-received":true}
{"level":"debug","msg":"picked","picker":"picker-roundrobin-balanced","address":"127.0.0.1:41573","subconn-index":2,"subconn-size":5}
{"level":"debug","msg":"balancer done","picker":"picker-roundrobin-balanced","address":"127.0.0.1:41573","success":true,"bytes-sent":true,"bytes-received":true}
{"level":"debug","msg":"picked","picker":"picker-roundrobin-balanced","address":"127.0.0.1:37731","subconn-index":3,"subconn-size":5}
{"level":"debug","msg":"balancer done","picker":"picker-roundrobin-balanced","address":"127.0.0.1:37731","success":true,"bytes-sent":true,"bytes-received":true}
{"level":"debug","msg":"picked","picker":"picker-roundrobin-balanced","address":"127.0.0.1:38193","subconn-index":4,"subconn-size":5}
{"level":"debug","msg":"balancer done","picker":"picker-roundrobin-balanced","address":"127.0.0.1:38193","success":true,"bytes-sent":true,"bytes-received":true}
{"level":"debug","msg":"picked","picker":"picker-roundrobin-balanced","address":"127.0.0.1:36837","subconn-index":0,"subconn-size":5}
{"level":"debug","msg":"balancer done","picker":"picker-roundrobin-balanced","address":"127.0.0.1:36837","success":true,"bytes-sent":true,"bytes-received":true}
--- FAIL: TestRoundRobinBalancedResolvableFailoverFromRequestFail (0.00s)

@lavacat lavacat force-pushed the release-3.4-balancer-tests branch from 78c72ec to 6a3da5e Compare July 9, 2022 05:25
available := make(map[string]struct{})
// cycle through all peers to indirectly verify that balancer subconn list is fully loaded
// otherwise we can't reliably count switches in the next step
for len(available) < tc.serverCount {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, technically we only need 2 different peers for switch counting logic to work

Comment on lines 288 to 272
var picked string
available := make(map[string]struct{})
// cycle through all peers to indirectly verify that balancer subconn list is fully loaded
// otherwise we can't reliably count switches in the next step
for len(available) < serverCount {
picked, err = reqFunc(context.Background())
if err != nil {
t.Fatalf("Unexpected failure %v", err)
}
available[picked] = struct{}{}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see exact the same code multiple times, can you get them wrapped in a common function?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've followed the same style as original tests, there is a bunch of duplication.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I understand it. Since this is a low hang fruit, let's try to do a little better as we can. thx.

func waitSubconnReady(count int, reqFunc func(context) (?, error) ) map[string]struct{} {

}

@lavacat lavacat force-pushed the release-3.4-balancer-tests branch from 6a3da5e to 0b3a09e Compare July 11, 2022 21:21
@ahrtr
Copy link
Member

ahrtr commented Jul 11, 2022

Please fix the linux-amd64-fmt failure, and ignore Release/release failure for now.

- added verification step to indirectly verify that all peers are in balancer subconn list

Signed-off-by: Bogdan Kanivets <bkanivets@apple.com>
@lavacat lavacat force-pushed the release-3.4-balancer-tests branch from 0b3a09e to 185f203 Compare July 11, 2022 21:44
Copy link
Member

@ahrtr ahrtr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Thank you @lavacat

@ahrtr
Copy link
Member

ahrtr commented Jul 12, 2022

I saw another issue ,

 {"level":"debug","msg":"picked","picker":"picker-roundrobin-balanced","address":"127.0.0.1:39987","subconn-index":0,"subconn-size":1}
{"level":"info","msg":"state changed","picker":"picker-roundrobin-balanced","balancer-id":"cld97ag39nv7","connected":false,"subconn":"0xc000316650","subconn-size":5,"address":"127.0.0.1:39987","old-state":"READY","new-state":"CONNECTING"}
{"level":"info","msg":"updated picker","picker":"picker-roundrobin-balanced","balancer-id":"cld97ag39nv7","policy":"picker-roundrobin-balanced","subconn-ready":[],"subconn-size":0}
{"level":"warn","msg":"balancer failed","error":"rpc error: code = Unavailable desc = transport is closing","picker":"picker-roundrobin-balanced","address":"127.0.0.1:39987","success":false,"bytes-sent":true,"bytes-received":false}
--- FAIL: TestRoundRobinBalancedResolvableFailoverFromServerFail (0.01s)
    balancer_test.go:167: Unexpected failure rpc error: code = Unavailable desc = transport is closing

https://github.com/etcd-io/etcd/runs/7293195456?check_suite_focus=true

@lavacat
Copy link
Author

lavacat commented Jul 12, 2022

@ahrtr interesting, I took a quick look but could find the root cause. Seams like after stopping 1 peer subconn-size went from 5 to 1 to 0. I'll investigate more

@ahrtr
Copy link
Member

ahrtr commented Jul 13, 2022

Just raised a new issue to track this
#14216

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants