Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Got plugin error "transport: error while dialing: dial unix /tmp/plugin" #99

Open
bocharovf opened this issue Nov 26, 2021 · 6 comments
Labels
bug Something isn't working

Comments

@bocharovf
Copy link
Collaborator

Describe the bug
Got error in Jaeger UI with Clickhouse gRPC plugin when search for traces:
HTTP Error: plugin error: rpc error: code = Unavailable
desc = connection error:
desc = "transport: error while dialing: dial unix /tmp/plugin2381205323:
connect: connection refused

Seems it happens

  • either after several hours of inactivity of Jaeger Query
  • either after jaeger_index_local exceeds ~70kk rows

Clickhouse is up and running.
Restarting Jaeger Query fix the problem temporary (until next long search).

To Reproduce
Steps to reproduce the behavior:

  1. Ingest ~70kk rows in jaeger_index_local
  2. Search for traces

Expected behavior
Traces are found

Screenshots
image

Version (please complete the following information):

  • OS: Linux
  • Jaeger Query version: 1.25
  • Clickhouse plugin versin: 0.8
  • Clickhouse version: 21.8.10.19
  • Deployment: Kubernetes

What troubleshooting steps did you try?

Additional context
jaeger-query-clickhouse-5ff64c9dbc-h7jr4.log

@bocharovf bocharovf added the bug Something isn't working label Nov 26, 2021
@bocharovf
Copy link
Collaborator Author

Just measured: error occurs after 16.7s of trace searching.

In Clickhouse error log:
2021.11.27 11:10:51.904396 [ 239 ] {5407e9ea-5b0a-4307-b4ba-645e44462eeb} <Error> TCPHandler: Code: 210, e.displayText() = DB::Exception: Connection reset by peer, while writing to socket (_10.X.X.X_:61097), Stack trace: ... 2021.11.27 11:10:51.904562 [ 239 ] {5407e9ea-5b0a-4307-b4ba-645e44462eeb} <Warning> TCPHandler: Client has gone away.

@bocharovf
Copy link
Collaborator Author

Related to #100 . Seems that huge traces (100k+ spans) leads to timeout. Ability to limit fetched spans could partially help.

@nickbp
Copy link
Contributor

nickbp commented Jan 26, 2022

From the comments it sounds like this is caused by (and resolved by) #100

@nickbp nickbp closed this as completed Jan 26, 2022
@bocharovf
Copy link
Collaborator Author

Unfortunately #100 helped only partially. There are still errors on our test server. I guess that we incorrectly handle query timeout or connection drops with CH. After error occurs plugin stop working at all and all further requests are failed. So that is severe problem for us.

@bocharovf bocharovf reopened this Jan 27, 2022
@vosmax
Copy link

vosmax commented Feb 11, 2022

I've faced with this issue as well. I believe it's related to resources. When querying a huge bunch of data, jaeger starts to consume memory avalanche-like (see pic)
Screenshot
Sometimes OOMKill doesn't occur, because when the pod almost hits the memory request value the plugin process just killed, and then memory flushed. As a result, the pod is not restarted and the plugin doesn't work.
I attached a memory graph where clearly you may see that the pod almost hit the request in 1Gb then memory was flushed and error transport: error while dialing: dial unix /tmp/plugin... connection refused appeared and 2022-02-11T23:05:17.578Z [DEBUG] plugin process exited: path=/plugin/jaeger-clickhouse pid=13 error="signal: killed" in the pod log.
I reproduced this error in several minutes (blue graph)

2022-02-11T23:05:17.578Z [DEBUG] plugin process exited: path=/plugin/jaeger-clickhouse pid=13 error="signal: killed"
{"level":"info","ts":1644609917.5783212,"caller":"channelz/logging.go:50","msg":"[core]Subchannel Connectivity change to IDLE","system":"grpc","grpc_log":true}
2022-02-11T23:05:17.592Z [DEBUG] stdio: received EOF, stopping recv loop: err="rpc error: code = Canceled desc = context canceled"
{"level":"info","ts":1644609917.5928702,"caller":"grpclog/component.go:71","msg":"[core]pickfirstBalancer: UpdateSubConnState: 0xc0004da840, {IDLE <nil>}","system":"grpc","grpc_log":true}
{"level":"info","ts":1644609917.5929146,"caller":"channelz/logging.go:50","msg":"[core]Channel Connectivity change to IDLE","system":"grpc","grpc_log":true}
{"level":"info","ts":1644609917.5929787,"caller":"grpclog/component.go:71","msg":"[transport]transport: loopyWriter.run returning. connection error: desc = \"transport is closing\"","system":"grpc","grpc_log":true}
{"level":"info","ts":1644610105.3780131,"caller":"channelz/logging.go:50","msg":"[core]Subchannel Connectivity change to CONNECTING","system":"grpc","grpc_log":true}
{"level":"info","ts":1644610105.378047,"caller":"channelz/logging.go:50","msg":"[core]Subchannel picks a new address \"unused\" to connect","system":"grpc","grpc_log":true}
{"level":"warn","ts":1644610105.3781164,"caller":"channelz/logging.go:75","msg":"[core]grpc: addrConn.createTransport failed to connect to {unused unused <nil>  0 }. Err: connection error: desc = \"transport: error while dialing: dial unix /tmp/plugin2834833615: connect: connection refused\"","system":"grpc","grpc_log":true}
{"level":"info","ts":1644610105.3781302,"caller":"channelz/logging.go:50","msg":"[core]Subchannel Connectivity change to TRANSIENT_FAILURE","system":"grpc","grpc_log":true}
{"level":"info","ts":1644610105.3781457,"caller":"grpclog/component.go:71","msg":"[core]pickfirstBalancer: UpdateSubConnState: 0xc0004da840, {CONNECTING <nil>}","system":"grpc","grpc_log":true}
{"level":"info","ts":1644610105.3781548,"caller":"channelz/logging.go:50","msg":"[core]Channel Connectivity change to CONNECTING","system":"grpc","grpc_log":true}
{"level":"info","ts":1644610105.3781884,"caller":"grpclog/component.go:71","msg":"[core]pickfirstBalancer: UpdateSubConnState: 0xc0004da840, {TRANSIENT_FAILURE connection error: desc = \"transport: error while dialing: dial unix /tmp/plugin2834833615: connect: connection refused\"}","system":"grpc","grpc_log":true}
{"level":"info","ts":1644610105.3781965,"caller":"channelz/logging.go:50","msg":"[core]Channel Connectivity change to TRANSIENT_FAILURE","system":"grpc","grpc_log":true}
{"level":"error","ts":1644610105.378254,"caller":"app/http_handler.go:487","msg":"HTTP handler, Internal Server Error","error":"plugin error: rpc error: code = Unavailable desc = connection error: desc = \"transport: error while dialing: dial unix /tmp/plugin2834833615: connect: connection refused\"","stacktrace":"github.com/jaegertracing/jaeger/cmd/query/app.(*APIHandler).handleError\n\tgh.neting.cc/jaegertracing/jaeger/cmd/query/app/http_handler.go:487\ngh.neting.cc/jaegertracing/jaeger/cmd/query/app.(*APIHandler).getOperationsLegacy\n\tgh.neting.cc/jaegertracing/jaeger/cmd/query/app/http_handler.go:180\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2047\ngh.neting.cc/opentracing-contrib/go-stdlib/nethttp.MiddlewareFunc.func5\n\tgh.neting.cc/opentracing-contrib/go-stdlib@v1.0.0/nethttp/server.go:154\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2047\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2047\ngh.neting.cc/gorilla/mux.(*Router).ServeHTTP\n\tgh.neting.cc/gorilla/mux@v1.8.0/mux.go:210\ngh.neting.cc/jaegertracing/jaeger/cmd/query/app.additionalHeadersHandler.func1\n\tgh.neting.cc/jaegertracing/jaeger/cmd/query/app/additional_headers_handler.go:28\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2047\ngh.neting.cc/gorilla/handlers.CompressHandlerLevel.func1\n\tgh.neting.cc/gorilla/handlers@v1.5.1/compress.go:141\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2047\ngh.neting.cc/gorilla/handlers.recoveryHandler.ServeHTTP\n\tgh.neting.cc/gorilla/handlers@v1.5.1/recovery.go:78\nnet/http.serverHandler.ServeHTTP\n\tnet/http/server.go:2879\nnet/http.(*conn).serve\n\tnet/http/server.go:1930"}
{"level":"info","ts":1644610106.3785865,"caller":"channelz/logging.go:50","msg":"[core]Subchannel Connectivity change to IDLE","system":"grpc","grpc_log":true}
{"level":"info","ts":1644610106.3786347,"caller":"grpclog/component.go:71","msg":"[core]pickfirstBalancer: UpdateSubConnState: 0xc0004da840, {IDLE connection error: desc = \"transport: error while dialing: dial unix /tmp/plugin2834833615: connect: connection refused\"}","system":"grpc","grpc_log":true}``` 

@leizhag
Copy link

leizhag commented Jun 3, 2022

Jaeger Query exits on plugin crash starting from v1.33.0. This can be a workaround until the reason for the plugin crash is found.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants