-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Envoy Crash when connection state assert failure #9083
Comments
@alyssawilk do you mind taking a look at this? |
Well the good news is that this is a debug ASSERT, and there's actual error handling so with an opt build it shouldn't crash. I think the problem is is here: quick advice, switching to a debug build should fix the problem due to the error handling (and if I'm incorrect, PLEASE take it up with the envoy-security group, don't disclose crashes in public! :-) ) I've got a TODO to clean this area up, since it was obsoleted by #8352 . If switching to an opt build isn't an option I can do the quick fix of moving the read disable inside the connection checks. If removing that check or doing an opt build is sufficient to hold you over, I can probably get to the clean up some time in early December. |
Thanks for your reply @alyssawilk ,We will try your advice. |
I can't see how the access log would create this problem (maybe it's making a race condition worse?) but having spent some time looking at the code I'm frankly baffled how you're getting here. the underlying connection looks like it is closed in one of three places. onIdleTimeout and onConnectionDurationTimeout are never in the dispatch stack. the third is checkForDeferredClose() which I would think would be the culprit checkForDeferredClose only calls close() if streams_.empty() each dispatch() call for HTTP/1 we might create one stream, and then might checkForDeferredClose, then might read disable, but I can't see how we'd go from no streams where we checkForDeferredClose to having streams at read disable time because the parser doesn't act then. I've played around with the network class to see if I can force a dispatch event in delay close and can't get that to fail either. I can do the quick fix but I'd really love to understand what's going on here - if you guys ever get another reproducible situation and can turn up a bit of logging ping me on slack and I'd be happy to try to debug with you |
Title: Envoy Crash when connection state assert failure
Description:
We use envoy as ingress in K8S. The pressure at that time was 30-40 rps. When checking the log, I found the following error every 1-2 hours..
Repro steps:
We use the customized isti-pilot as the xds-server to configure the envoy. As mentioned above, envoy is the ingress of the K8S cluster, and the xDS protocol has not been specially extended. Downstream of envoy is kong, as shown below:
Admin and Stats Output:
Config:
//...
dynamic_active_listener:
//...
{
"version_info": "2019-11-20T09:27:34Z/24",
"listener": {
"name": "0.0.0.0_8550",
"address": {
"socket_address": {
"address": "0.0.0.0",
"port_value": 8550
}
},
"filter_chains": [
{
"filters": [
{
"name": "envoy.http_connection_manager",
"typed_config": {
"@type": "type.googleapis.com/envoy.config.filter.network.http_connection_manager.v2.HttpConnectionManager",
"stat_prefix": "http_default",
"rds": {
"config_source": {
"ads": {}
},
"route_config_name": "80"
},
"http_filters": [
{
"name": "envoy.router"
}
],
"access_log": [
{
"name": "envoy.file_access_log",
"config": {
"path": "/dev/stdout",
"format": "[%START_TIME%] "%REQ(:METHOD)% %REQ(X-ENVOY-ORIGINAL-PATH?:PATH)% %PROTOCOL%" %RESPONSE_CODE% %RESPONSE_FLAGS% "%DYNAMIC_METADATA(istio.mixer:status)%" %BYTES_REC
EIVED% %BYTES_SENT% %DURATION% %RESP(X-ENVOY-UPSTREAM-SERVICE-TIME)% "%REQ(X-FORWARDED-FOR)%" "%REQ(USER-AGENT)%" "%REQ(X-REQUEST-ID)%" "%REQ(:AUTHORITY)%" "%UPSTREAM_HOST%" %UPSTRE
AM_CLUSTER% %UPSTREAM_LOCAL_ADDRESS% %DOWNSTREAM_LOCAL_ADDRESS% %DOWNSTREAM_REMOTE_ADDRESS% %REQUESTED_SERVER_NAME%\n"
}
}
],
"use_remote_address": true,
]
}
}
]
}
]
}
]
},
"last_updated": "2019-11-20T09:27:37.875Z"
}
]
},
too long to write hole config here
Logs:
[2019-11-20T02:17:23.960Z] "GET /bflow HTTP/1.1" 200 - "-" 0 1541 8 7 "183.136.182.131, 10.160.246.207, 10.160.246.207,10.199.208.10" "Apache-HttpClient/4.4.1 (Java/1.8.0_191)" "9a2075b7-04ec
-4f37-8e7a-f441513c6d4d" "test.yanxuan-nav-bflow.service.mailsaas" "10.199.209.106:8080" outbound|80||yanxuan-nav-bflow.test.svc.cluster.local - 10.199.211.97:8550 10.199.208.10:62890 -
[2019-11-20T02:17:25.102Z] "POST /log-platform/datasource/log-flume-service/writeFlumeMetrics.json HTTP/1.1" 200 - "-" 1366 12 103 102 "10.130.69.158,10.199.208.10" "Apache-HttpClient/4.2.1 (
java 1.5)" "e5a3831d-3196-4251-b67c-743efe993b9a" "test.log-flume-service.service.mailsaas" "10.199.208.120:8080" outbound|80||log-flume-service.test.svc.cluster.local - 10.199.211.97:8550 10
.199.208.10:62894 -
[2019-11-20 02:17:34.630][40][critical][assert] [external/envoy/source/common/network/connection_impl.cc:285] assert failure: state() == State::Open.
[2019-11-20 02:17:34.630][40][critical][backtrace] [bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:81] Caught Aborted, suspect faul
ting address 0x15
//....
[2019-11-20 02:17:34.777][40][critical][backtrace] [bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #23: start_thread [0x7fcd4d6
446db]
2019-11-20T02:17:35.505685Z info Envoy proxy is NOT ready: failed retrieving Envoy stats: Get http://127.0.0.1:15000/stats?usedonly: dial tcp 127.0.0.1:15000: connect: connection refus
ed
2019-11-20T02:17:35.660007Z warn Epoch 0 terminated with an error: signal: aborted (core dumped)
2019-11-20T02:17:35.660042Z warn Aborted all epochs
2019-11-20T02:17:35.660186Z info Epoch 0: set retry delay to 200ms, budget to 9
2019-11-20T02:17:35.860294Z info Reconciling retry (budget 9)
2019-11-20T02:17:35.860380Z info Epoch 0 starting
2019-11-20T02:17:35.861613Z info Envoy command: [-c /etc/istio/proxy/envoy-rev0.json --restart-epoch 0 --drain-time-s 45 --parent-shutdown-time-s 60 --service-cluster qz-ingress --serv
ice-node sidecar
10.199.211.97qz-ingress-75b6868f98-p4mzt.istio-system~istio-system.svc.cluster.local --max-obj-name-len 189 --allow-unknown-fields -l warning]Call Stack:
The text was updated successfully, but these errors were encountered: