Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DaemonSet test failing #118

Closed
jpkrohling opened this issue Nov 20, 2018 · 17 comments
Closed

DaemonSet test failing #118

jpkrohling opened this issue Nov 20, 2018 · 17 comments
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@jpkrohling
Copy link
Contributor

Currently, the DaemonSet e2e test is failing:

Running end-to-end tests...
time="2018-11-20T13:25:34+01:00" level=info msg="passing &{{Jaeger io.jaegertracing/v1alpha1} {agent-as-daemonset  jaeger-jaeger-group-daemonset-1542716724    0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] nil [] } {allInOne { {map[]} {[] [] map[] {map[] map[]}}} {0  {map[]} {[] [] map[] {map[] map[]}}} {0  {map[]} {[] [] map[] {map[] map[]}}} {DaemonSet  {map[log-level:debug]} {[] [] map[] {map[] map[]}}} { {map[]} {<nil>   }} {<nil>  {[] [] map[] {map[] map[]}}} {[] [] map[] {map[] map[]}}} {}}"
time="2018-11-20T13:27:05+01:00" level=info msg="passing &{{Jaeger io.jaegertracing/v1alpha1} {with-cassandra  jaeger-jaeger-group-cassandra-1542716820    0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] nil [] } {allInOne { {map[]} {[] [] map[] {map[] map[]}}} {0  {map[]} {[] [] map[] {map[] map[]}}} {0  {map[]} {[] [] map[] {map[] map[]}}} {  {map[]} {[] [] map[] {map[] map[]}}} {cassandra {map[cassandra.servers:cassandra.default.svc]} {<nil>   }} {<nil>  {[] [] map[] {map[] map[]}}} {[] [] map[] {map[] map[]}}} {}}"
time="2018-11-20T13:27:30+01:00" level=info msg="passing &{{Jaeger io.jaegertracing/v1alpha1} {my-jaeger  jaeger-jaeger-group-my-other-jaeger-1542716840    0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] nil [] } {allInOne { {map[log-level:debug memory.max-traces:10000]} {[] [] map[] {map[] map[]}}} {0  {map[]} {[] [] map[] {map[] map[]}}} {0  {map[]} {[] [] map[] {map[] map[]}}} {  {map[]} {[] [] map[] {map[] map[]}}} { {map[]} {<nil>   }} {<nil>  {[] [] map[] {map[] map[]}}} {[] [] map[] {map[] map[]}}} {}}"
time="2018-11-20T13:27:30+01:00" level=info msg="passing &{{Jaeger io.jaegertracing/v1alpha1} {my-jaeger  jaeger-jaeger-group-my-jaeger-1542716840    0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] nil [] } {allInOne { {map[memory.max-traces:10000 log-level:debug]} {[] [] map[] {map[] map[]}}} {0  {map[]} {[] [] map[] {map[] map[]}}} {0  {map[]} {[] [] map[] {map[] map[]}}} {  {map[]} {[] [] map[] {map[] map[]}}} { {map[]} {<nil>   }} {<nil>  {[] [] map[] {map[] map[]}}} {[] [] map[] {map[] map[]}}} {}}"
--- FAIL: TestJaeger (151.38s)
    --- FAIL: TestJaeger/jaeger-group (115.68s)
        --- FAIL: TestJaeger/jaeger-group/daemonset (70.16s)
            client.go:57: resource type Role with namespace/name (jaeger-jaeger-group-daemonset-1542716724/jaeger-operator) created
            client.go:57: resource type RoleBinding with namespace/name (jaeger-jaeger-group-daemonset-1542716724/default-account-jaeger-operator) created
            client.go:57: resource type Deployment with namespace/name (jaeger-jaeger-group-daemonset-1542716724/jaeger-operator) created
            jaeger_test.go:50: Initialized cluster resources
            wait_util.go:45: Waiting for full availability of jaeger-operator deployment (0/1)
            wait_util.go:51: Deployment available (1/1)
            client.go:57: resource type Jaeger with namespace/name (jaeger-jaeger-group-daemonset-1542716724/agent-as-daemonset) created
            wait_util.go:55: Waiting for full availability of agent-as-daemonset-agent-daemonset daemonsets (0/1)
            wait_util.go:55: Waiting for full availability of agent-as-daemonset-agent-daemonset daemonsets (0/1)
            wait_util.go:55: Waiting for full availability of agent-as-daemonset-agent-daemonset daemonsets (0/1)
            wait_util.go:55: Waiting for full availability of agent-as-daemonset-agent-daemonset daemonsets (0/1)
            wait_util.go:55: Waiting for full availability of agent-as-daemonset-agent-daemonset daemonsets (0/1)
            wait_util.go:55: Waiting for full availability of agent-as-daemonset-agent-daemonset daemonsets (0/1)
            wait_util.go:55: Waiting for full availability of agent-as-daemonset-agent-daemonset daemonsets (0/1)
            wait_util.go:55: Waiting for full availability of agent-as-daemonset-agent-daemonset daemonsets (0/1)
            wait_util.go:55: Waiting for full availability of agent-as-daemonset-agent-daemonset daemonsets (0/1)
            wait_util.go:55: Waiting for full availability of agent-as-daemonset-agent-daemonset daemonsets (0/1)
            wait_util.go:55: Waiting for full availability of agent-as-daemonset-agent-daemonset daemonsets (0/1)
            wait_util.go:55: Waiting for full availability of agent-as-daemonset-agent-daemonset daemonsets (0/1)
            wait_util.go:55: Waiting for full availability of agent-as-daemonset-agent-daemonset daemonsets (0/1)
            daemonset.go:29: timed out waiting for the condition
            client.go:75: resource type Jaeger with namespace/name (jaeger-jaeger-group-daemonset-1542716724/agent-as-daemonset) successfully deleted
            client.go:75: resource type Deployment with namespace/name (jaeger-jaeger-group-daemonset-1542716724/jaeger-operator) successfully deleted
            client.go:75: resource type RoleBinding with namespace/name (jaeger-jaeger-group-daemonset-1542716724/default-account-jaeger-operator) successfully deleted
            client.go:75: resource type Role with namespace/name (jaeger-jaeger-group-daemonset-1542716724/jaeger-operator) successfully deleted
FAIL
FAIL	github.com/jaegertracing/jaeger-operator/test/e2e	151.460s

According to git bisect:

13d7cc5db1b3b9effbd262e9f0e77d8a1b76d139 is the first bad commit
commit 13d7cc5db1b3b9effbd262e9f0e77d8a1b76d139
Author: Juraci Paixão Kröhling <juraci.github@kroehling.de>
Date:   Tue Nov 13 16:29:46 2018 +0100
@jpkrohling jpkrohling added bug Something isn't working help wanted Extra attention is needed labels Nov 20, 2018
@objectiser
Copy link
Contributor

Strange - that commit is mainly doc/changelog changes - except the jaeger version moves from 1.7 to 1.8. I assume the e2e tests would pick up the version from that jaeger.version file?

@pavolloffay
Copy link
Member

is e2e using agent?

@jpkrohling
Copy link
Contributor Author

is e2e using agent?

Not quite sure I understand the question. There's an e2e test with the agent as DaemonSet (the one that is failing), but there's also a test with the agent as side-car.

@dlmiddlecote
Copy link

dlmiddlecote commented Nov 20, 2018

I'm noticing the same thing happening when trying to run the agent as a DaemonSet, the agent-daemonset never becomes available.
i.e.

jaeger-agent-daemonset-vbnzv        0/1       Running   0          7m
jaeger-collector-745dddf5c6-thxxx   1/1       Running   0          7m
jaeger-operator-7f5d55ffb6-qdzt2    1/1       Running   0          8m
jaeger-query-74bb5dc84d-6ppqs       1/1       Running   0          7m

The logs I get from this Pod are:

{"level":"warn","ts":1542730394.6390572,"caller":"tchannel/flags.go:67","msg":"Using deprecated configuration","option":"collector.host-port"} 
{"level":"info","ts":1542730394.639524,"caller":"tchannel/builder.go:94","msg":"Enabling service discovery","service":"jaeger-collector"} 
{"level":"info","ts":1542730394.6397831,"caller":"peerlistmgr/peer_list_mgr.go:111","msg":"Registering active peer","peer":"jaeger-collector:14267"} 
{"level":"info","ts":1542730394.6407661,"caller":"agent/main.go:75","msg":"Starting agent"} 
{"level":"info","ts":1542730395.6402557,"caller":"peerlistmgr/peer_list_mgr.go:157","msg":"Not enough connected peers","connected":0,"required":1} 
{"level":"info","ts":1542730395.640376,"caller":"peerlistmgr/peer_list_mgr.go:166","msg":"Trying to connect to peer","host:port":"jaeger-collector:14267"} 
{"level":"error","ts":1542730395.6445525,"caller":"peerlistmgr/peer_list_mgr.go:171","msg":"Unable to connect","host:port":"jaeger-collector:14267","connCheckTimeout":0.25,"error":"dial tcp: lookup jaeger-collector on 10.96.0.10:53: server misbehaving","stacktrace":"github.com/jaegertracing/jaeger/pkg/discovery/peerlistmgr.(*PeerListManager).ensureConnectionsnt/home/travis/gopath/src/github.com/jaegertracing/jaeger/pkg/discovery/peerlistmgr/peer_list_mgr.go:171ngh.neting.cc/jaegertracing/jaeger/pkg/discovery/peerlistmgr.(*PeerListManager).maintainConnectionsnt/home/travis/gopath/src/github.com/jaegertracing/jaeger/pkg/discovery/peerlistmgr/peer_list_mgr.go:101"} 
{"level":"info","ts":1542730396.6405394,"caller":"peerlistmgr/peer_list_mgr.go:157","msg":"Not enough connected peers","connected":0,"required":1} 
{"level":"info","ts":1542730396.6407301,"caller":"peerlistmgr/peer_list_mgr.go:166","msg":"Trying to connect to peer","host:port":"jaeger-collector:14267"} 
{"level":"error","ts":1542730396.651564,"caller":"peerlistmgr/peer_list_mgr.go:171","msg":"Unable to connect","host:port":"jaeger-collector:14267","connCheckTimeout":0.25,"error":"dial tcp: lookup jaeger-collector on 10.96.0.10:53: server misbehaving","stacktrace":"github.com/jaegertracing/jaeger/pkg/discovery/peerlistmgr.(*PeerListManager).ensureConnectionsnt/home/travis/gopath/src/github.com/jaegertracing/jaeger/pkg/discovery/peerlistmgr/peer_list_mgr.go:171ngh.neting.cc/jaegertracing/jaeger/pkg/discovery/peerlistmgr.(*PeerListManager).maintainConnectionsnt/home/travis/gopath/src/github.com/jaegertracing/jaeger/pkg/discovery/peerlistmgr/peer_list_mgr.go:101"} 
{"level":"info","ts":1542730397.6404357,"caller":"peerlistmgr/peer_list_mgr.go:157","msg":"Not enough connected peers","connected":0,"required":1} 
{"level":"info","ts":1542730397.6408088,"caller":"peerlistmgr/peer_list_mgr.go:166","msg":"Trying to connect to peer","host:port":"jaeger-collector:14267"} 
{"level":"error","ts":1542730397.6513267,"caller":"peerlistmgr/peer_list_mgr.go:171","msg":"Unable to connect","host:port":"jaeger-collector:14267","connCheckTimeout":0.25,"error":"dial tcp: lookup jaeger-collector on 10.96.0.10:53: server misbehaving","stacktrace":"github.com/jaegertracing/jaeger/pkg/discovery/peerlistmgr.(*PeerListManager).ensureConnectionsnt/home/travis/gopath/src/github.com/jaegertracing/jaeger/pkg/discovery/peerlistmgr/peer_list_mgr.go:171ngh.neting.cc/jaegertracing/jaeger/pkg/discovery/peerlistmgr.(*PeerListManager).maintainConnectionsnt/home/travis/gopath/src/github.com/jaegertracing/jaeger/pkg/discovery/peerlistmgr/peer_list_mgr.go:101"} 
{"level":"info","ts":1542730398.6404321,"caller":"peerlistmgr/peer_list_mgr.go:157","msg":"Not enough connected peers","connected":0,"required":1} 
{"level":"info","ts":1542730398.6406636,"caller":"peerlistmgr/peer_list_mgr.go:166","msg":"Trying to connect to peer","host:port":"jaeger-collector:14267"} 
{"level":"error","ts":1542730398.6499462,"caller":"peerlistmgr/peer_list_mgr.go:171","msg":"Unable to connect","host:port":"jaeger-collector:14267","connCheckTimeout":0.25,"error":"dial tcp: lookup jaeger-collector on 10.96.0.10:53: server misbehaving","stacktrace":"github.com/jaegertracing/jaeger/pkg/discovery/peerlistmgr.(*PeerListManager).ensureConnectionsnt/home/travis/gopath/src/github.com/jaegertracing/jaeger/pkg/discovery/peerlistmgr/peer_list_mgr.go:171ngh.neting.cc/jaegertracing/jaeger/pkg/discovery/peerlistmgr.(*PeerListManager).maintainConnectionsnt/home/travis/gopath/src/github.com/jaegertracing/jaeger/pkg/discovery/peerlistmgr/peer_list_mgr.go:101"} 
{"level":"info","ts":1542730399.640582,"caller":"peerlistmgr/peer_list_mgr.go:157","msg":"Not enough connected peers","connected":0,"required":1} 
{"level":"info","ts":1542730399.640759,"caller":"peerlistmgr/peer_list_mgr.go:166","msg":"Trying to connect to peer","host:port":"jaeger-collector:14267"} 
{"level":"error","ts":1542730399.6502535,"caller":"peerlistmgr/peer_list_mgr.go:171","msg":"Unable to connect","host:port":"jaeger-collector:14267","connCheckTimeout":0.25,"error":"dial tcp: lookup jaeger-collector on 10.96.0.10:53: server misbehaving","stacktrace":"github.com/jaegertracing/jaeger/pkg/discovery/peerlistmgr.(*PeerListManager).ensureConnectionsnt/home/travis/gopath/src/github.com/jaegertracing/jaeger/pkg/discovery/peerlistmgr/peer_list_mgr.go:171ngh.neting.cc/jaegertracing/jaeger/pkg/discovery/peerlistmgr.(*PeerListManager).maintainConnectionsnt/home/travis/gopath/src/github.com/jaegertracing/jaeger/pkg/discovery/peerlistmgr/peer_list_mgr.go:101"} 
{"level":"info","ts":1542730400.640373,"caller":"peerlistmgr/peer_list_mgr.go:157","msg":"Not enough connected peers","connected":0,"required":1} 
{"level":"info","ts":1542730400.6405044,"caller":"peerlistmgr/peer_list_mgr.go:166","msg":"Trying to connect to peer","host:port":"jaeger-collector:14267"} 
{"level":"info","ts":1542730400.6489508,"caller":"peerlistmgr/peer_list_mgr.go:176","msg":"Connected to peer","host:port":"[::]:14267"}

Seems like it connects to the collector and then 'stops'.

The readiness probe is the thing that is stopping the agent becoming available,

  Warning  Unhealthy              4m (x31 over 9m)  kubelet, minikube  Readiness probe failed: HTTP probe failed with statuscode: 400

@dlmiddlecote
Copy link

Follow up:
The readiness probe failing is caused by the /metrics endpoint not being available, which in turn is caused by the agent HTTP Server not being started. This was fixed in this PR.
The jaegertracing/jaeger-agent:1.8 image does not have this fix, but jaegertracing/jaeger-agent:latest does.
I've tried with this image allows the agent to become available.

Not sure if that really helps with this problem, but I thought I'd mention it, as it helps me, and maybe others.

@jpkrohling
Copy link
Contributor Author

This is a nice detective work, @dlmiddlecote!

@pavolloffay: did you end up releasing 1.8.1 with the fix from jaegertracing/jaeger#1178

@jpkrohling
Copy link
Contributor Author

For the record, the image for 1.8 is still the same as 1.8.0. I expect the 1.8 tag to point to 1.8.1 once that's released.

https://hub.docker.com/r/jaegertracing/jaeger-agent/tags/

@pavolloffay
Copy link
Member

Not yet I am waiting for jaegertracing/jaeger-ui#263

@jpkrohling
Copy link
Contributor Author

Alright, I'll remove the readiness check then, as it might take a couple of days for that to get merged.

@dlmiddlecote
Copy link

@jpkrohling is there anything else you can use as a Readiness probe/Liveness probe?
Is there a way to see if the agent is connected to the collector, as that’s what the agent really needs, not just the fact that metrics are being exposed.

@jpkrohling
Copy link
Contributor Author

jpkrohling commented Nov 21, 2018

The agent is pretty much the only component without the health check handler. The metrics endpoint is the only one we can call without a service parameter, so, we shouldn't be calling the baggage restriction manager nor the sampling manager.

https://github.com/jaegertracing/jaeger/blob/f2eb7d14902909d2ace35b224f0c1e32519caf6f/cmd/agent/app/httpserver/server.go#L37-L50

And for the all-in-one, note how we create the hc and pass to collector and query, but not agent:

https://github.com/jaegertracing/jaeger/blob/f2eb7d14902909d2ace35b224f0c1e32519caf6f/cmd/all-in-one/main.go#L98

https://github.com/jaegertracing/jaeger/blob/f2eb7d14902909d2ace35b224f0c1e32519caf6f/cmd/all-in-one/main.go#L132-L135

This would be a nice feature request, I believe. Would you mind opening an issue on the main repo (jaegertracing/jaeger)?

@dlmiddlecote
Copy link

Thanks for the explanation! I can open the issue, to add a healthcheck to the agent.

What are the next steps for this? Removing the readiness probe and releasing a new version of the operator chart?

@jpkrohling
Copy link
Contributor Author

operator chart

Did you mean this repo here? If so, I can certainly release a patch version this afternoon.

@dlmiddlecote
Copy link

Yes, sorry, this repo, got muddled with my repos! Thanks 👍

@jpkrohling
Copy link
Contributor Author

jpkrohling commented Nov 21, 2018

@dlmiddlecote The Jaeger Operator 1.8.1 is out. Let me know if it looks alright to you.

@dlmiddlecote
Copy link

@jpkrohling Looks good, thanks for the speedy turn-around.

@kevinearls
Copy link
Contributor

@jpkrohling I think you can close this. It currently runs on Kubernetes, and remaining issues on OpenShift are being tracked by #178

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

5 participants