Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Envoy Crash when connection state assert failure #9083

Closed
ydh926 opened this issue Nov 20, 2019 · 4 comments
Closed

Envoy Crash when connection state assert failure #9083

ydh926 opened this issue Nov 20, 2019 · 4 comments
Assignees
Labels
Milestone

Comments

@ydh926
Copy link

ydh926 commented Nov 20, 2019

Title: Envoy Crash when connection state assert failure

Description:
We use envoy as ingress in K8S. The pressure at that time was 30-40 rps. When checking the log, I found the following error every 1-2 hours..

Repro steps:
We use the customized isti-pilot as the xds-server to configure the envoy. As mentioned above, envoy is the ingress of the K8S cluster, and the xDS protocol has not been specially extended. Downstream of envoy is kong, as shown below:

image

Note: The Envoy_collect tool
gathers a tarball with debug logs, config and the following admin
endpoints: /stats, /clusters and /server_info. Please note if there are
privacy concerns, sanitize the data prior to sharing the tarball/pasting.

Admin and Stats Output:

Include the admin output for the following endpoints: /stats,
/clusters, /routes, /server_info. For more information, refer to the
admin endpoint documentation.

Note: If there are privacy concerns, sanitize the data prior to
sharing.

Config:
//...
dynamic_active_listener:
//...
{
"version_info": "2019-11-20T09:27:34Z/24",
"listener": {
"name": "0.0.0.0_8550",
"address": {
"socket_address": {
"address": "0.0.0.0",
"port_value": 8550
}
},
"filter_chains": [
{
"filters": [
{
"name": "envoy.http_connection_manager",
"typed_config": {
"@type": "type.googleapis.com/envoy.config.filter.network.http_connection_manager.v2.HttpConnectionManager",
"stat_prefix": "http_default",
"rds": {
"config_source": {
"ads": {}
},
"route_config_name": "80"
},
"http_filters": [
{
"name": "envoy.router"
}
],
"access_log": [
{
"name": "envoy.file_access_log",
"config": {
"path": "/dev/stdout",
"format": "[%START_TIME%] "%REQ(:METHOD)% %REQ(X-ENVOY-ORIGINAL-PATH?:PATH)% %PROTOCOL%" %RESPONSE_CODE% %RESPONSE_FLAGS% "%DYNAMIC_METADATA(istio.mixer:status)%" %BYTES_REC
EIVED% %BYTES_SENT% %DURATION% %RESP(X-ENVOY-UPSTREAM-SERVICE-TIME)% "%REQ(X-FORWARDED-FOR)%" "%REQ(USER-AGENT)%" "%REQ(X-REQUEST-ID)%" "%REQ(:AUTHORITY)%" "%UPSTREAM_HOST%" %UPSTRE
AM_CLUSTER% %UPSTREAM_LOCAL_ADDRESS% %DOWNSTREAM_LOCAL_ADDRESS% %DOWNSTREAM_REMOTE_ADDRESS% %REQUESTED_SERVER_NAME%\n"
}
}
],
"use_remote_address": true,
]
}
}
]
}
]
}
]
},
"last_updated": "2019-11-20T09:27:37.875Z"
}
]
},
too long to write hole config here

Logs:
[2019-11-20T02:17:23.960Z] "GET /bflow HTTP/1.1" 200 - "-" 0 1541 8 7 "183.136.182.131, 10.160.246.207, 10.160.246.207,10.199.208.10" "Apache-HttpClient/4.4.1 (Java/1.8.0_191)" "9a2075b7-04ec
-4f37-8e7a-f441513c6d4d" "test.yanxuan-nav-bflow.service.mailsaas" "10.199.209.106:8080" outbound|80||yanxuan-nav-bflow.test.svc.cluster.local - 10.199.211.97:8550 10.199.208.10:62890 -
[2019-11-20T02:17:25.102Z] "POST /log-platform/datasource/log-flume-service/writeFlumeMetrics.json HTTP/1.1" 200 - "-" 1366 12 103 102 "10.130.69.158,10.199.208.10" "Apache-HttpClient/4.2.1 (
java 1.5)" "e5a3831d-3196-4251-b67c-743efe993b9a" "test.log-flume-service.service.mailsaas" "10.199.208.120:8080" outbound|80||log-flume-service.test.svc.cluster.local - 10.199.211.97:8550 10
.199.208.10:62894 -
[2019-11-20 02:17:34.630][40][critical][assert] [external/envoy/source/common/network/connection_impl.cc:285] assert failure: state() == State::Open.
[2019-11-20 02:17:34.630][40][critical][backtrace] [bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:81] Caught Aborted, suspect faul
ting address 0x15
//....
[2019-11-20 02:17:34.777][40][critical][backtrace] [bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #23: start_thread [0x7fcd4d6
446db]
2019-11-20T02:17:35.505685Z info Envoy proxy is NOT ready: failed retrieving Envoy stats: Get http://127.0.0.1:15000/stats?usedonly: dial tcp 127.0.0.1:15000: connect: connection refus
ed
2019-11-20T02:17:35.660007Z warn Epoch 0 terminated with an error: signal: aborted (core dumped)
2019-11-20T02:17:35.660042Z warn Aborted all epochs
2019-11-20T02:17:35.660186Z info Epoch 0: set retry delay to 200ms, budget to 9
2019-11-20T02:17:35.860294Z info Reconciling retry (budget 9)
2019-11-20T02:17:35.860380Z info Epoch 0 starting
2019-11-20T02:17:35.861613Z info Envoy command: [-c /etc/istio/proxy/envoy-rev0.json --restart-epoch 0 --drain-time-s 45 --parent-shutdown-time-s 60 --service-cluster qz-ingress --serv
ice-node sidecar10.199.211.97qz-ingress-75b6868f98-p4mzt.istio-system~istio-system.svc.cluster.local --max-obj-name-len 189 --allow-unknown-fields -l warning]

Note: If there are privacy concerns, sanitize the data prior to
sharing.

Call Stack:

If the Envoy binary is crashing, a call stack is required.
[2019-11-20 02:17:34.630][40][critical][assert] [external/envoy/source/common/network/connection_impl.cc:285] assert failure: state() == State::Open.
[2019-11-20 02:17:34.630][40][critical][backtrace] [bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:81] Caught Aborted, suspect faul
ting address 0x15
[2019-11-20 02:17:34.630][40][critical][backtrace] [bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:69] Backtrace (use tools/stack_d
ecode.py to get line numbers):
[2019-11-20 02:17:34.637][40][critical][backtrace] [bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #0: Envoy::SignalAction::sig
Handler() [0x19ba75f]
[2019-11-20 02:17:34.637][40][critical][backtrace] [bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #1: __restore_rt [0x7fcd4d64
f890]
[2019-11-20 02:17:34.644][40][critical][backtrace] [bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #2: Envoy::Http::ConnectionM
anagerImpl::onData() [0x13a11f1]
[2019-11-20 02:17:34.651][40][critical][backtrace] [bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #3: Envoy::Network::FilterMa
nagerImpl::onContinueReading() [0xfb9342]
[2019-11-20 02:17:34.658][40][critical][backtrace] [bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #4: Envoy::Network::FilterMa
nagerImpl::onRead() [0xfb94db]
[2019-11-20 02:17:34.664][40][critical][backtrace] [bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #5: Envoy::Network::Connecti
onImpl::onRead() [0xfad740]
[2019-11-20 02:17:34.671][40][critical][backtrace] [bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #6: Envoy::Network::Connecti
onImpl::onReadReady() [0xfb029e]
[2019-11-20 02:17:34.678][40][critical][backtrace] [bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #7: Envoy::Network::Connecti
onImpl::onFileEvent() [0xfaffda]
[2019-11-20 02:17:34.685][40][critical][backtrace] [bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #8: Envoy::Network::ConnectionImpl::ConnectionImpl()::$_2::operator()() [0xfb237e]
[2019-11-20 02:17:34.691][40][critical][backtrace] [bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #9: std::_Function_handler<>::_M_invoke() [0xfb2231]
[2019-11-20 02:17:34.698][40][critical][backtrace] [bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #10: std::function<>::operator()() [0xfa48da]
[2019-11-20 02:17:34.704][40][critical][backtrace] [bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #11: Envoy::Event::FileEventImpl::assignEvents()::$_0::operator()() [0xfa4497]
[2019-11-20 02:17:34.711][40][critical][backtrace] [bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #12: Envoy::Event::FileEventImpl::assignEvents()::$_0::__invoke() [0xfa42c8]
[2019-11-20 02:17:34.717][40][critical][backtrace] [bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #13: event_process_active_single_queue [0x16b95eb]
[2019-11-20 02:17:34.724][40][critical][backtrace] [bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #14: event_base_loop [0x16b6170]
[2019-11-20 02:17:34.730][40][critical][backtrace] [bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #15: Envoy::Event::LibeventScheduler::run() [0xfc72bc]
[2019-11-20 02:17:34.737][40][critical][backtrace] [bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #16: Envoy::Event::DispatcherImpl::run() [0xf9eb9e]
[2019-11-20 02:17:34.743][40][critical][backtrace] [bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #17: Envoy::Server::WorkerImpl::threadRoutine() [0xf899c0]
[2019-11-20 02:17:34.750][40][critical][backtrace] [bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #18: Envoy::Server::WorkerImpl::start()::$_3::operator()() [0xf8aa0f]
[2019-11-20 02:17:34.757][40][critical][backtrace] [bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #19: std::_Function_handler<>::_M_invoke() [0xf8a8bd]
[2019-11-20 02:17:34.763][40][critical][backtrace] [bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #20: std::function<>::operat
or()() [0x571afe]
[2019-11-20 02:17:34.770][40][critical][backtrace] [bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #21: Envoy::Thread::ThreadIm
plPosix::ThreadImplPosix()::$_0::operator()() [0x19bddc0]
[2019-11-20 02:17:34.777][40][critical][backtrace] [bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #22: Envoy::Thread::ThreadIm
plPosix::ThreadImplPosix()::$_0::__invoke() [0x19bdd98]
[2019-11-20 02:17:34.777][40][critical][backtrace] [bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #23: start_thread [0x7fcd4d6
446db]

@junr03
Copy link
Member

junr03 commented Nov 20, 2019

@alyssawilk do you mind taking a look at this?

@alyssawilk
Copy link
Contributor

Well the good news is that this is a debug ASSERT, and there's actual error handling so with an opt build it shouldn't crash.

I think the problem is is here:
https://github.com/envoyproxy/envoy/blob/master/source/common/http/conn_manager_impl.cc#L341
basically I suspect there's some corner case where we think we should re-enable reading but the connection is disconnected, so we're hitting that assert.

quick advice, switching to a debug build should fix the problem due to the error handling (and if I'm incorrect, PLEASE take it up with the envoy-security group, don't disclose crashes in public! :-) )

I've got a TODO to clean this area up, since it was obsoleted by #8352 . If switching to an opt build isn't an option I can do the quick fix of moving the read disable inside the connection checks. If removing that check or doing an opt build is sufficient to hold you over, I can probably get to the clean up some time in early December.

@alyssawilk alyssawilk self-assigned this Nov 20, 2019
@mattklein123 mattklein123 added this to the 1.13.0 milestone Nov 21, 2019
@ydh926
Copy link
Author

ydh926 commented Nov 22, 2019

Thanks for your reply @alyssawilk ,We will try your advice.
Otherwise,in the process of locating the problem, we deleted the accesslog configuration and found that the problem disappeared.
Will this issue be related to accesslog?

@alyssawilk
Copy link
Contributor

I can't see how the access log would create this problem (maybe it's making a race condition worse?) but having spent some time looking at the code I'm frankly baffled how you're getting here.

the underlying connection looks like it is closed in one of three places. onIdleTimeout and onConnectionDurationTimeout are never in the dispatch stack. the third is checkForDeferredClose() which I would think would be the culprit

checkForDeferredClose only calls close() if streams_.empty()
the ASSERT you're hitting is doing a read disable only if !streams_.empty()

each dispatch() call for HTTP/1 we might create one stream, and then might checkForDeferredClose, then might read disable, but I can't see how we'd go from no streams where we checkForDeferredClose to having streams at read disable time because the parser doesn't act then.

I've played around with the network class to see if I can force a dispatch event in delay close and can't get that to fail either. I can do the quick fix but I'd really love to understand what's going on here - if you guys ever get another reproducible situation and can turn up a bit of logging ping me on slack and I'd be happy to try to debug with you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants