Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clientv3/ordering TestUnresolvableOrderViolation fails #8624

Closed
gyuho opened this issue Sep 28, 2017 · 12 comments
Closed

clientv3/ordering TestUnresolvableOrderViolation fails #8624

gyuho opened this issue Sep 28, 2017 · 12 comments
Assignees
Milestone

Comments

@gyuho
Copy link
Contributor

gyuho commented Sep 28, 2017

TestUnresolvableOrderViolation goes into infinite leader elections...

https://semaphoreci.com/coreos/etcd/branches/master/builds/2379

@gyuho gyuho changed the title clientv3/ordering unit test fails, TestUnresolvableOrderViolation clientv3/ordering TestUnresolvableOrderViolation fails Oct 3, 2017
@xiang90
Copy link
Contributor

xiang90 commented Oct 4, 2017

is this already fixed?

@gyuho
Copy link
Contributor Author

gyuho commented Oct 4, 2017

Let me try to reproduce.

@gyuho
Copy link
Contributor Author

gyuho commented Oct 5, 2017

@xiang90
Copy link
Contributor

xiang90 commented Oct 5, 2017

/cc @lorneli would you like to take a look at this test failure?

@lorneli
Copy link
Contributor

lorneli commented Oct 6, 2017

@xiang90 Yeah. I'll follow up this weekend.

@xiang90 xiang90 added this to the v3.4.0 milestone Oct 6, 2017
@lorneli
Copy link
Contributor

lorneli commented Oct 8, 2017

I haven't reproduced this failure locally, so have to depend on log in semaphore.

Based on log-ordering.txt, TestUnresolvableOrderViolation blocks on putting k-v pair to first member of the cluster, actually not related to clientv3/ordering code. Looks like a integration/etcd-server bug.

cfg := clientv3.Config{
    // ...
}
cli, err := clientv3.New(cfg)
    // ....
cli.SetEndpoints(clus.Members[0].GRPCAddr())
time.Sleep(1 * time.Second)
_, err = cli.Put(ctx, "foo", "bar")  // block here, server doesn't resp

I see many grpc warning lines about connection closing, printed by grpclog . For example, line 5163-5168 in log-ordering.txt(See below). Is there a guarantee that log generated by capnslog and grpclog happends in sequence? Log shows three nodes can't be dialed after etcd server is published, which is a little confusing...

2017-10-05 11:41:31.385241 I | etcdserver: published {Name:3998243288755946524 ClientURLs:[unix://127.0.0.1:2105415012]} to cluster 77a8bf0cb4e3ab13
2017-10-05 11:41:31.389455 I | etcdserver: published {Name:337497985577860612 ClientURLs:[unix://127.0.0.1:2104815012]} to cluster 77a8bf0cb4e3ab13
2017-10-05 11:41:31.393393 I | etcdserver: published {Name:3210302211912965408 ClientURLs:[unix://127.0.0.1:2105015012]} to cluster 77a8bf0cb4e3ab13
2017-10-05 11:41:31.393679 I | etcdserver: published {Name:3123809328254117252 ClientURLs:[unix://127.0.0.1:2105215012]} to cluster 77a8bf0cb4e3ab13
2017-10-05 11:41:31.399699 I | etcdserver: published {Name:1904611785361370535 ClientURLs:[unix://127.0.0.1:2105615012]} to cluster 77a8bf0cb4e3ab13
2017-10-05 11:41:31.414984 I | etcdserver: setting up the initial cluster version to 3.2
2017-10-05 11:41:31.423877 N | etcdserver/membership: set the initial cluster version to 3.2
2017-10-05 11:41:31.425623 N | etcdserver/membership: set the initial cluster version to 3.2
2017-10-05 11:41:31.430211 N | etcdserver/membership: set the initial cluster version to 3.2
2017-10-05 11:41:31.431297 N | etcdserver/membership: set the initial cluster version to 3.2
2017-10-05 11:41:31.432081 N | etcdserver/membership: set the initial cluster version to 3.2
WARNING: 2017/10/05 11:41:31 Failed to dial localhost:31238093282541172520: grpc: the connection is closing; please retry.
WARNING: 2017/10/05 11:41:31 Failed to dial localhost:31238093282541172520: grpc: the connection is closing; please retry.
WARNING: 2017/10/05 11:41:31 Failed to dial localhost:39982432887559465240: grpc: the connection is closing; please retry.
WARNING: 2017/10/05 11:41:31 Failed to dial localhost:39982432887559465240: grpc: the connection is closing; please retry.
WARNING: 2017/10/05 11:41:31 Failed to dial localhost:19046117853613705350: grpc: the connection is closing; please retry.
WARNING: 2017/10/05 11:41:31 Failed to dial localhost:19046117853613705350: grpc: the connection is closing; please retry.

@xiang90
Copy link
Contributor

xiang90 commented Oct 9, 2017

probably we need to investigate why there is the 1 second sleep. it seems pretty random and arbitrary.

@xiang90
Copy link
Contributor

xiang90 commented Oct 9, 2017

/cc @mangoslicer

@mangoslicer
Copy link
Contributor

The 1 second sleep was to ensure that the endpoint was set to the first member. I'll try to reproduce the error locally and see if changing the 1 second delay or not setting the endpoint to the first member changes anything.

@mkumatag
Copy link
Contributor

This is failing consistently in ppc64le platform - https://jenkins-etcd-public.prod.coreos.systems/job/etcd-ci-ppc64/

@mkumatag
Copy link
Contributor

mkumatag commented Oct 11, 2017

@gyuho I still see tests are failing, any idea why this issue is closed.?

@gyuho
Copy link
Contributor Author

gyuho commented Oct 11, 2017

@mkumatag client integration tests aren't stable yet.
We are fixing those failures now with highest priority (#8678 and #8677).

Sorry!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

5 participants