-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
quic: fixing the disconnect between quiche stream count and Envoy #18694
Conversation
Signed-off-by: Alyssa Wilk <alyssar@chromium.org>
Signed-off-by: Alyssa Wilk <alyssar@chromium.org>
Ok here's the actual fix. We may opt to not land this until I fix the TODO for quiche APIs since using a test peer in core code is pretty awful, but I wanted to at least get this out here and get your takes before tackling that. |
I also think a challenge for this PR is documenting and explaining the stream limit gaps which I've now done a first pass at. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, this looks GREAT! It's simpler than I would have expected. Excellent!
void EnvoyQuicClientSession::OnCanCreateNewOutgoingStream(bool) { | ||
if (notifier_.has_value()) { | ||
quic::QuicStreamIdManager& manager = quic::test::QuicSessionPeer::getStream(this); | ||
uint32_t streams_available = manager.outgoing_max_streams() - manager.outgoing_stream_count(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be nice to ASSERT(manager.outgoing_max_streams() > manager.outgoing_stream_count())
(which should clearly be true given that this method says we can create an outgoing stream).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ack
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we want >
not >=
here (Because if they're equal then we should NOT be able to open a stream, right?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah I started with that, but unfortunately it doesn't pan out i
we get at least one call with both numbers equal to 100 Protocols/Http2UpstreamIntegrationTest.ManySimultaneousRequest/IPv4_Http2Downstream_Http3Upstream
seems like a quiche bug but this assert is the best I can do for today
[2021-10-20 20:45:07.866][402][critical][assert] [source/common/quic/envoy_quic_client_session.cc:109] assert failure: manager.outgoing_max_streams() > manager.outgoing_stream_count().
[2021-10-20 20:45:07.866][402][critical][backtrace] [./source/server/backtrace.h:104] Caught Aborted, suspect faulting address 0xad370000000e
[2021-10-20 20:45:07.866][402][critical][backtrace] [./source/server/backtrace.h:91] Backtrace (use tools/stack_decode.py to get line numbers):
[2021-10-20 20:45:07.866][402][critical][backtrace] [./source/server/backtrace.h:92] Envoy version: 0/1.21.0-dev/test/DEBUG/BoringSSL
[2021-10-20 20:45:07.898][402][critical][backtrace] [./source/server/backtrace.h:96] #0: Envoy::SignalAction::sigHandler() [0x3f95079]
[2021-10-20 20:45:07.898][402][critical][backtrace] [./source/server/backtrace.h:96] #1: __restore_rt [0x7f27a3e9e8e0]
[2021-10-20 20:45:07.926][402][critical][backtrace] [./source/server/backtrace.h:96] #2: quic::QuicSession::StreamDraining() [0x35dbdfb]
[2021-10-20 20:45:07.953][402][critical][backtrace] [./source/server/backtrace.h:96] #3: quic::QuicStream::OnStreamFrame() [0x35f3152]
[2021-10-20 20:45:07.978][402][critical][backtrace] [./source/server/backtrace.h:96] #4: quic::QuicSession::OnStreamFrame() [0x35cbc09]
[2021-10-20 20:45:08.004][402][critical][backtrace] [./source/server/backtrace.h:96] #5: quic::QuicConnection::OnStreamFrame() [0x362ce54]
[2021-10-20 20:45:08.029][402][critical][backtrace] [./source/server/backtrace.h:96] #6: quic::QuicFramer::ProcessIetfFrameData() [0x373b77d]
[2021-10-20 20:45:08.055][402][critical][backtrace] [./source/server/backtrace.h:96] #7: quic::QuicFramer::ProcessIetfDataPacket() [0x3734cf0]
[2021-10-20 20:45:08.080][402][critical][backtrace] [./source/server/backtrace.h:96] #8: quic::QuicFramer::ProcessPacketInternal() [0x3731638]
[2021-10-20 20:45:08.106][402][critical][backtrace] [./source/server/backtrace.h:96] #9: quic::QuicFramer::ProcessPacket() [0x3730697]
[2021-10-20 20:45:08.131][402][critical][backtrace] [./source/server/backtrace.h:96] #10: quic::QuicConnection::ProcessUdpPacket() [0x363d12a]
[2021-10-20 20:45:08.152][402][critical][backtrace] [./source/server/backtrace.h:96] #11: Envoy::Quic::EnvoyQuicClientConnection::processPacket() [0x340d85f]
[2021-10-20 20:45:08.173][402][critical][backtrace] [./source/server/backtrace.h:96] #12: Envoy::Network::passPayloadToProcessor() [0x3f1d297]
[2021-10-20 20:45:08.193][402][critical][backtrace] [./source/server/backtrace.h:96] #13: Envoy::Network::Utility::readFromSocket() [0x3f1dc1a]
[2021-10-20 20:45:08.214][402][critical][backtrace] [./source/server/backtrace.h:96] #14: Envoy::Network::Utility::readPacketsFromSocket() [0x3f2041c]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think you're right that this is a quiche bug. In particular, I think QuicSession::StreamDraining() is doing the wrong thing and needs to wrap the call to OnCanCreateNewOutgoingStream(unidirectional) inside of if (!VersionHasIetfQuicFrames(transport_version()))
like the other calls.
So the assert you have sounds fine for now and I'll look into fixing that in quiche, if that SGTY?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM :-)
@@ -28,6 +28,19 @@ ActiveClient::ActiveClient(Envoy::Http::HttpConnPoolImplBase& parent, | |||
Upstream::Host::CreateConnectionData& data) | |||
: MultiplexedActiveClientBase(parent, getMaxStreams(parent.host()->cluster()), | |||
parent.host()->cluster().stats().upstream_cx_http3_total_, data) { | |||
auto& connection = dynamic_cast<ActiveClient*>(this)->codec_client_->connection(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the dyanmic_cast actually needed here? It looks like it's casting from ActiveClient to ActiveClient?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah but due to Envoy "name everything the same thing" this is actually casting from the base class Active Client to the HTTP/3 active client :-P
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! (modulo > vs >= in an ASSERT)
void EnvoyQuicClientSession::OnCanCreateNewOutgoingStream(bool) { | ||
if (notifier_.has_value()) { | ||
quic::QuicStreamIdManager& manager = quic::test::QuicSessionPeer::getStream(this); | ||
uint32_t streams_available = manager.outgoing_max_streams() - manager.outgoing_stream_count(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we want >
not >=
here (Because if they're equal then we should NOT be able to open a stream, right?)
@@ -87,6 +109,14 @@ void EnvoyQuicClientSession::SetDefaultEncryptionLevel(quic::EncryptionLevel lev | |||
} | |||
} | |||
|
|||
void EnvoyQuicClientSession::OnCanCreateNewOutgoingStream(bool) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh! I just realized we should probably pay attention to this param. It's unidirectional
which is basically for push, in this context. So I think we probably want to just do:
if (unidirectional) {
return;
}
at the top of this method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep - I fixed that in one of the later pushes so if you look at all changes, it's a no-op for unidirectional streams :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤦 I'm so bad at github UI.
void EnvoyQuicClientSession::OnCanCreateNewOutgoingStream(bool) { | ||
if (notifier_.has_value()) { | ||
quic::QuicStreamIdManager& manager = quic::test::QuicSessionPeer::getStream(this); | ||
uint32_t streams_available = manager.outgoing_max_streams() - manager.outgoing_stream_count(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think you're right that this is a quiche bug. In particular, I think QuicSession::StreamDraining() is doing the wrong thing and needs to wrap the call to OnCanCreateNewOutgoingStream(unidirectional) inside of if (!VersionHasIetfQuicFrames(transport_version()))
like the other calls.
So the assert you have sounds fine for now and I'll look into fixing that in quiche, if that SGTY?
@@ -87,6 +102,15 @@ void EnvoyQuicClientSession::SetDefaultEncryptionLevel(quic::EncryptionLevel lev | |||
} | |||
} | |||
|
|||
void EnvoyQuicClientSession::OnCanCreateNewOutgoingStream(bool unidirectional) { | |||
if (http_connection_callbacks_ && !unidirectional) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
if(!http_connection_callbacks_ || unidirectional) {
return;
}
@@ -87,6 +109,14 @@ void EnvoyQuicClientSession::SetDefaultEncryptionLevel(quic::EncryptionLevel lev | |||
} | |||
} | |||
|
|||
void EnvoyQuicClientSession::OnCanCreateNewOutgoingStream(bool) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤦 I'm so bad at github UI.
// We track the QUIC capacity here, and overload currentUnusedCapacity so the | ||
// connection pool can accurately keep track of when it is safe to create new | ||
// streams. | ||
uint64_t quiche_capacity_ = 100; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure that this initialization to 100 is quite right. At the QUIC protocol layer, this starts at 0 until the initial_max_streams_bidi
transport parameter is received to specify the initial value. After that, it's set to the value from MAX_STREAMS. So I guess this makes me think 2 things.
- Should this be initialized to 0?
- Should we hook into QuicSession::OnConfigNegotiated() to grab the initial outgoing max streams value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to be at 100 or we'll prefetch a connection for each incoming stream: commented
I would assume that we get a stream update on the initial frame as well, else we should fix upstream.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm. I definitely appreciate the point about the connection prefetch issue. That makes sense to me. We have to be careful that we don't actually send streams on the wire that are above the peers MAX_STREAMS limit or we will commit a protocol violation and the connection will be closed. In quiche, we do this via ShouldCreateOutgoingBidirectionalStream() returning false until we have stream capacity, but with #18614 that will return true. With 0-RTT in particular, we could send a bunch of requests in the first flight and trigger this.
But maybe we already have logic which is queueing requests somewhere to avoid this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a test to future-proof that we update this before we raise the connected event.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a test to future-proof that we update this before we raise the connected event.
But do we send 0-RTT request before updating this? initial_max_streams_bidi in transport parameters will update this initially. And it seems reasonable to me that 0-RTT can be sent before receiving transport param.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't support 0-rtt right now. When we do, we'll want to set this based on the cached params.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just skimmed the h3 code but just a couple of questions about the common code, thanks.
/wait-any
|
||
/** | ||
* Fires when the MAX_STREAMS frame is received from the peer. | ||
* This may occur multiple times across the lifetime of an HTTP/3 connection. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't this technically happen for H/2 also? I thought the peer could send a new settings frame?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MAX_STREAMS is http/3 specific but the same does occur for onSettings above.
Apart from not wanting to manufacture a fake settings struct, I don't think we should combine the two since settings tells us the number of streams we're allowed to have open at a given time and onMaxStreamsChanged allows a fixed number of streams to be opened.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK I see. Do you mind just adding a comment about how this is different from H2? For the uninitiated for H3 this is kind of confusing.
client.remaining_streams_--; | ||
if (client.remaining_streams_ == 0) { | ||
ENVOY_CONN_LOG(debug, "maximum streams per connection, DRAINING", client); | ||
host_->cluster().stats().upstream_cx_max_requests_.inc(); | ||
transitionActiveClientState(client, Envoy::ConnectionPool::ActiveClient::State::DRAINING); | ||
} else if (client.numActiveStreams() + 1 >= client.concurrent_stream_limit_) { | ||
} else if (capacity <= 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused by this check. How do we get here if capacity == 0? Wouldn't we not be attaching to this client if we don't have any capacity? Should this by capacity == 1 with an ASSERT that capacity != 0? I'm probably missing something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, it can't be 0, I was just trying to be as consistent as possible with the prior >=. I'll update to == 1 since it's confusing.
// by configuration, or it will be set to std::numeric_limits<uint32_t>::max() to be | ||
// (functionally) unlimited. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated to your change, but we should probably make this absl::optional to avoid the extremely rare but possible case where this is not actually unlimited. :) If you end up pushing more changes maybe TODO this?
|
||
/** | ||
* Fires when the MAX_STREAMS frame is received from the peer. | ||
* This may occur multiple times across the lifetime of an HTTP/3 connection. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MAX_STREAMS is http/3 specific but the same does occur for onSettings above.
Apart from not wanting to manufacture a fake settings struct, I don't think we should combine the two since settings tells us the number of streams we're allowed to have open at a given time and onMaxStreamsChanged allows a fixed number of streams to be opened.
client.remaining_streams_--; | ||
if (client.remaining_streams_ == 0) { | ||
ENVOY_CONN_LOG(debug, "maximum streams per connection, DRAINING", client); | ||
host_->cluster().stats().upstream_cx_max_requests_.inc(); | ||
transitionActiveClientState(client, Envoy::ConnectionPool::ActiveClient::State::DRAINING); | ||
} else if (client.numActiveStreams() + 1 >= client.concurrent_stream_limit_) { | ||
} else if (capacity <= 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, it can't be 0, I was just trying to be as consistent as possible with the prior >=. I'll update to == 1 since it's confusing.
// We track the QUIC capacity here, and overload currentUnusedCapacity so the | ||
// connection pool can accurately keep track of when it is safe to create new | ||
// streams. | ||
uint64_t quiche_capacity_ = 100; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a test to future-proof that we update this before we raise the connected event.
raiseConnectionEvent(Network::ConnectionEvent::Connected); | ||
|
||
// Arguably the peer could start a connection with 0 streams and increase open | ||
// streams later but this is currently unsupported. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@RyanTheOptimist I want to call this out in particular. It would be pretty easy to have a member variable connected_ and in OnCanCreateNewOutgoingStream raise the connected event if if !connected and streamsAvailable() but it feels like a lot of complexity for a really bizarre server implementation. Lmk if you think it's merited complexity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's explicitly called out in the RFC and I have some vague sense of hearing someone in the WG proposing that this might be a thing to do under DoS. So I think we should probably support it. Though doing that in a subsequent PR seems totally reasonable.
That being said, if streamsAvailable() is 0 I might be more inclined to still call raiseConnectionEvent() here, but then move the connection to busy in the pool. Would that make sense?
Needs a main merge as well if you could add that comment that would be great. Otherwise LGTM. /wait |
Signed-off-by: Alyssa Wilk <alyssar@chromium.org>
// Creates up to 3 connections, based on the preconnect ratio. | ||
// Returns the ConnectionResult of the last attempt. | ||
ConnectionResult tryCreateNewConnections(); | ||
|
||
virtual void onConnected(Envoy::ConnectionPool::ActiveClient&) {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unrelated change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
had moved things up and then back to a different place. will fix.
ASSERT(streamsAvailable()); | ||
if (streamsAvailable() > 0) { | ||
OnCanCreateNewOutgoingStream(false); | ||
raiseConnectionEvent(Network::ConnectionEvent::Connected); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why skipping raiseConnectionEvent() when streamAvailable() == 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, the connection pool assumes when a connection is connected it can start sending streams (and does so, violating QUIC constraints). per discussion I added the TODO above and will address in a follow-up.
Signed-off-by: Alyssa Wilk <alyssar@chromium.org>
Signed-off-by: Alyssa Wilk <alyssar@chromium.org>
gfe-relnote: adding an accessor. PiperOrigin-RevId: 405507053
https://github.com/google/quiche/compare/12228ccfc..4c6ad6445 $ git log 12228ccfc..4c6ad6445 --date=short --no-merges --format="%ad %al %s" 2021-10-26 renjietang Add num_ptos_for_path_degrading_ getter in QuicSentPacketManagerPeer. 2021-10-25 dschinazi Add code count for packets dropped by blocked port 2021-10-25 dschinazi Internal change 2021-10-25 quiche-dev Adding an accessor for envoyproxy#18694 2021-10-25 wub Add retry_token to ReceivedPacketInfo. The token can be used later to extract cached network parameters. Signed-off-by: Renjie Tang <renjietang@google.com>
https://quiche.googlesource.com/quiche.git/+log/385e03f1eab1..1d6d13074621 $ git log 385e03f1e..1d6d13074 --date=short --no-merges --format='%ad %ae %s' 2021-10-25 dschinazi Add code count for packets dropped by blocked port 2021-10-25 dschinazi Internal change 2021-10-25 quiche-dev Adding an accessor for envoyproxy/envoy#18694 2021-10-25 wub Add retry_token to ReceivedPacketInfo. The token can be used later to extract cached network parameters. 2021-10-25 renjietang Allow QuicSentPacketManager to expose the setter of num_ptos_for_path_degrading_. Created with: roll-dep src/net/third_party/quiche/src src/third_party/quic_trace/src Change-Id: I60ff709d7f349b585fde0823afa01fa472201d82 Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/3244885 Reviewed-by: Bin Wu <wub@chromium.org> Commit-Queue: Bence Béky <bnc@chromium.org> Cr-Commit-Position: refs/heads/main@{#934977}
Actually fixing a quic stream limit issue. Also fixing an unrelated bug with clean stream shutdown occasionally causing spurious stream-close writes to a closed connection.
Risk Level: High (changing connection pool limits)
Testing: new integration test
Docs Changes: n/a
Release Notes: n/a
Platform Specific Features: n/a
Fixes #18160