-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test: Stop fake_upstream methods from accidentally succeeding #4232
test: Stop fake_upstream methods from accidentally succeeding #4232
Conversation
Constructing an AssertionResult out of a (non-empty) string gives a successful assertion, which wasn't intended. Signed-off-by: Michael Behr <mkbehr@google.com>
So, a few of the mac test cases do actually hit the 10-second timeout in that CI run. They're all RouterDownstreamDisconnectBeforeResponseComplete, and they all time out on waitForDisconnect. That test succeeds on Http2UpstreamIntegrationTest and Http2IntegrationTest, but it fails on SslIntegrationTest, UdsUpstreamIntegrationTest, and IntegrationTest. In the tests where this succeeds, the methods take <100ms, so I wouldn't be surprised if this is spinning forever and the test framework bug stopped us from noticing. How should we handle this? |
Hopefully these are the reason why the mac tests seem more flakey than the linux ones. (We currently re-run mac tests one when they fail to paper over this.) I think we should try to fix these tests. If you don't have a mac dev environment handy, I can take a look on Monday. |
Empirically, this patch makes all three of those tests pass:
It seems to me that nothing in the HTTP1 stack closes the upstream connection when the downstream codec_client_ is closed in @alyssawilk do you have any ideas? |
Ok, so in test we're sending half a request, making sure we establish a connection upstream, and killing the client connection, yeah? Logically that should result in killing the upstream connection too - if it's halfway through streaming a HTTP/1.1 request it can't be used for a subsequent request. If the router filter doesn't handle that by closing the connection that could be an actual Envoy bug. It's a bit tricky trying to visually find what happens on unexpected downstream disconnect. I assume we get an router onDestroy with an existing upstream_request_ which should result in resetStream which for HTTP/1.1 should kill the connection, which is what this test appears to be trying to test. Can either of you pastebin the -l trace logs? |
Mac run, fails: https://pastebin.com/raw/yuyvrju1 |
I'll dig through traces but I have a theory, which is it's one of those "don't do upstream things without letting the client dispatcher progress" bugs. I bet we're just hanging on the upstream "waitfordisconnect" without the FIN getting shipped somehow. linux side I do see the router resetting the pool request Mac side not so much. It looks like the client socket is getting closed |
alternately you could add a std::cerr on the actual socket close to see if the client is actually closing it before we go and block on the upstream waitForDisconnect. That'd be a fast way to sanity check my theory. |
I guess the difference is that the linux version sees "[C1] remote early close" which seems to trigger closing the upstream connection where the Mac version doesn't. |
Yeah, but looking more closely at the test I'm a bit more confused why we don't get that disconnect. We wrote the headers but verified they came through so it's not like the FIN might be behind them. I could maybe see the SslIntegrationTest failing if there's some ssl close time delay but that doesn't explain the IntegrationTest failures. But basically it looks like either the FIN isn't being written immediately in the stack of close() or Envoy isn't reading it for some reason, and given I've seen a bunch of things fail with the former failure pattern I'm still betting on it :-P |
So adding waitForNewStream made the test pass: after ten seconds of waiting for a new stream and failing, it called waitForDisconnect and after 5 more seconds, we got a timeout on the upstream connection that closed it. Later C1 gets closed. I also hacked waitForDisconnect to take a Dispatcher and that doesn't fix the test either, still no close on C1. I grabbed this during the test. 57956 is the listener port.
|
(I should add that I removed the waitForNewStream when I added the hacked waitForDisconnect and the netstat dump is from the latter test.) |
So, I've hacked up ConnectionImpl and FileEventImpl. Seems that the listener side of the downstream connection reads the 62 byte request, and then readDisabled is invoked. The upstream handles the request and writes its partial body and the client side of the downstream is closed. Because I've moved filtering of events into the libevent callback I can see that when that close happens a read event is sent on the listener side of the connection where on Linux we get a close event. |
Confirmed that if that read event is allowed through, we attempt to read, get 0 bytes and detect the connection is closed. I suppose this is some strange interaction between OSX and libevent? |
So, after a lot of digging, I've come to the conclusion that libevent doesn't support EV_FEATURE_EARLY_CLOSE on OSX. If I modify dispatcher_impl.cc to require the feature it fails to initialize libevent (I've verified that it succeeds under Linux). From reading the definition of EV_CLOSED and the detect_early_close_-related comments in ConnectionImpl, it seems that's exactly what we're depending on in this test and it was just never working correctly and not detected before this change. It's possible there's some way to gain support for that under OSX, but I'll have to some digging. |
The EV_FEATURE_EARLY_CLOSE is epoll-only. |
Ok, if EV_FEATURE_EARLY_CLOSE isn't supported on OSX we should just disable the HTTP/1 test for OSX noting that the downstream close will not be detected when envoy has paused reading during request handling, yeah? |
@alyssawilk yes, that's the conclusion I came too as well. I'm going to write this up in an issue that we can refer to where it's disabled. |
@mkbehr Let's disable |
Signed-off-by: Michael Behr <mkbehr@google.com>
And have to add the disable block in SslIntegrationTest's RouterDownstreamDisconnectBeforeResponseComplete test case. |
Signed-off-by: Michael Behr <mkbehr@google.com>
@@ -150,6 +150,13 @@ TEST_P(SslIntegrationTest, RouterUpstreamDisconnectBeforeResponseComplete) { | |||
} | |||
|
|||
TEST_P(SslIntegrationTest, RouterDownstreamDisconnectBeforeRequestComplete) { | |||
#ifdef __APPLE__ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the request version of the test, which works. Conditional needs to be on the next one "RouterDownstreamDisconnectBeforeResponseComplete"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, right.
Implements early close detection on OS X by inspecting the TCP state using getsockopt. *Risk Level*: low (conditional compilation) *Testing*: integration tests pass (tested against envoyproxy#4232) *Docs Changes*: n/a *Release Notes*: n/a *Fixes*: envoyproxy#4294 Signed-off-by: Stephan Zuercher <stephan@turbinelabs.io>
instead of before /request/ complete Signed-off-by: Michael Behr <mkbehr@google.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Pulling the following changes from github.com/envoyproxy/envoy: f936fc6 ssl: serialize accesses to SSL socket factory contexts (envoyproxy#4345) e34dcd6 Fix crash in tcp_proxy (envoyproxy#4323) ae6a252 router: fix matching when all domains have wildcards (envoyproxy#4326) aa06142 test: Stop fake_upstream methods from accidentally succeeding (envoyproxy#4232) 5d73187 rbac: update the authenticated.user to a StringMatcher. (envoyproxy#4250) c6bfc7d time: Event::TimeSystem abstraction to make it feasible to inject time with simulated timers (envoyproxy#4257) 752483e Fixing the fix (envoyproxy#4333) 83487f6 tls: update BoringSSL to ab36a84b (3497). (envoyproxy#4338) 7bc210e test: fixing interactions between waitFor and ignore_spurious_events (envoyproxy#4309) 69474b3 admin: order stats in clusters json admin (envoyproxy#4306) 2d155f9 ppc64le build (envoyproxy#4183) 07efc6d fix static initialization fiasco problem (envoyproxy#4314) 0b7e3b5 test: Remove declared but undefined class methods (envoyproxy#4297) 1485a13 lua: make sure resetting dynamic metadata wrapper when request info is marked dead d243cd6 test: set to zero when start_time exceeds limit (envoyproxy#4328) 0a1e92a test: fix heap use-after-free in ~IntegrationTestServer. (envoyproxy#4319) cddc732 CONTRIBUTING: Document 'kick-ci' trick. (envoyproxy#4335) f13ef24 docs: remove reference to deprecated value field (envoyproxy#4322) e947a27 router: minor doc fixes in stream idle timeout (envoyproxy#4329) 0c2e998 tcp-proxy: fixing a TCP proxy bug where we attempted to readDisable a closed connection (envoyproxy#4296) 00ffe44 utility: fix strftime overflow handling. (envoyproxy#4321) af1183c Re-enable TcpProxySslIntegrationTest and make the tests pass again. (envoyproxy#4318) 3553461 fuzz: fix H2 codec fuzzer post envoyproxy#4262. (envoyproxy#4311) 42f6048 Proto string issue fix (envoyproxy#4320) 9c492a0 Support Envoy to fetch secrets using SDS service. (envoyproxy#4256) a857219 ratelimit: revert `revert rate limit failure mode config` and add tests (envoyproxy#4303) 1d34172 dns: fix exception unsafe behavior in c-ares callbacks. (envoyproxy#4307) 1212423 alts: add gRPC TSI socket (envoyproxy#4153) f0363ae fuzz: detect client-side resets in H2 codec fuzzer. (envoyproxy#4300) 01aa3f8 test: hopefully deflaking echo integration test (envoyproxy#4304) 1fc0f4b ratelimit: link legacy proto when message is being used (envoyproxy#4308) aa4481e fix rare List::remove(&target) segfault (envoyproxy#4244) 89e0f23 headers: fixing fast fail of size-validation (envoyproxy#4269) 97eba59 build: bump googletest version. (envoyproxy#4293) 0057e22 fuzz: avoid false positives in HCM fuzzer. (envoyproxy#4262) 9d094e5 Revert ac0bd74 (envoyproxy#4295) ddb28a4 Add validation context provider (envoyproxy#4264) 3b47cba added histogram latency information to Hystrix dashboard stream (envoyproxy#3986) cf87d50 docs: update SNI FAQ. (envoyproxy#4285) f952033 config: fix update empty stat for eds (envoyproxy#4276) 329e591 router: Add ability of custom headers to rely on per-request data (envoyproxy#4219) 68d20b4 thrift: refactor build files and imports (envoyproxy#4271) 5fa8192 access_log: log requested_server_name in tcp proxy (envoyproxy#4144) fa45bb4 fuzz: libc++ clocks don't like nanos. (envoyproxy#4282) 53f8944 stats: add symbol table for future stat name encoding (envoyproxy#3927) c987b42 test infra: Remove timeSource() from the ClusterManager api (envoyproxy#4247) cd171d9 websocket: tunneling websockets (and upgrades in general) over H2 (envoyproxy#4188) b9dc5d9 router: disallow :path/host rewriting in request_headers_to_add. (envoyproxy#4220) 0c91011 network: skip socket options and source address for UDS client connections (envoyproxy#4252) da1857d build: fixing a downstream compile error by noting explicit fallthrough (envoyproxy#4265) 9857cfe fuzz: cleanup per-test environment after each fuzz case. (envoyproxy#4253) 52beb06 test: Wrap proto string in std::string before comparison (envoyproxy#4238) f5e219e extensions/thrift_proxy: Add header matching to thrift router (envoyproxy#4239) c9ce5d2 fuzz: track read_disable_count bidirectionally in codec_impl_fuzz_test. (envoyproxy#4260) 35103b3 fuzz: use nanoseconds for SystemTime in RequestInfo. (envoyproxy#4255) ba6ba98 fuzz: make runtime root hermetic in server_fuzz_test. (envoyproxy#4258) b0a9014 time: Add 'format' test to ensure no one directly instantiates Prod*Time from source. (envoyproxy#4248) 8567460 access_log: support beginning of epoch in START_TIME. (envoyproxy#4254) 28d5f41 proto: unify envoy_proto_library/api_proto_library. (envoyproxy#4233) f7d3cb6 http: fix allocation bug introduced in envoyproxy#4211. (envoyproxy#4245) Fixes istio/istio#8310 (once pulled into istio/istio). Signed-off-by: Piotr Sikora <piotrsikora@google.com>
Description: FakeConnectionBase::waitForDisconnect and FakeHttpConnection::waitForNewStream were returning assertion successes when they timed out, because an AssertionResult constructed with a (non-empty) string counts as a success. Fix that.
Risk Level: Low (test only)
Testing: bazel test //test/...
Docs Changes: n/a
Release Notes: n/a