Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

quic: fixing the disconnect between quiche stream count and Envoy #18694

Merged
merged 14 commits into from
Oct 26, 2021

Conversation

alyssawilk
Copy link
Contributor

@alyssawilk alyssawilk commented Oct 20, 2021

Actually fixing a quic stream limit issue. Also fixing an unrelated bug with clean stream shutdown occasionally causing spurious stream-close writes to a closed connection.

Risk Level: High (changing connection pool limits)
Testing: new integration test
Docs Changes: n/a
Release Notes: n/a
Platform Specific Features: n/a
Fixes #18160

Signed-off-by: Alyssa Wilk <alyssar@chromium.org>
Signed-off-by: Alyssa Wilk <alyssar@chromium.org>
Signed-off-by: Alyssa Wilk <alyssar@chromium.org>
@alyssawilk
Copy link
Contributor Author

Ok here's the actual fix. We may opt to not land this until I fix the TODO for quiche APIs since using a test peer in core code is pretty awful, but I wanted to at least get this out here and get your takes before tackling that.
Will also tag one of the senior maintainers once I do the quiche pass since any changes to connection pool accounting deserve a second pair of eyes, but I'll spare everyone the email ccs until I've got your LGTMs :-)

Signed-off-by: Alyssa Wilk <alyssar@chromium.org>
@alyssawilk
Copy link
Contributor Author

I also think a challenge for this PR is documenting and explaining the stream limit gaps which I've now done a first pass at.
I suspect we're going to want more clarity here, so please push for as much explanation as you see fit :-)

Signed-off-by: Alyssa Wilk <alyssar@chromium.org>
Copy link
Contributor

@RyanTheOptimist RyanTheOptimist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, this looks GREAT! It's simpler than I would have expected. Excellent!

source/common/conn_pool/conn_pool_base.h Show resolved Hide resolved
source/common/conn_pool/conn_pool_base.h Outdated Show resolved Hide resolved
source/common/quic/envoy_quic_client_session.h Outdated Show resolved Hide resolved
void EnvoyQuicClientSession::OnCanCreateNewOutgoingStream(bool) {
if (notifier_.has_value()) {
quic::QuicStreamIdManager& manager = quic::test::QuicSessionPeer::getStream(this);
uint32_t streams_available = manager.outgoing_max_streams() - manager.outgoing_stream_count();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be nice to ASSERT(manager.outgoing_max_streams() > manager.outgoing_stream_count()) (which should clearly be true given that this method says we can create an outgoing stream).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want > not >= here (Because if they're equal then we should NOT be able to open a stream, right?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I started with that, but unfortunately it doesn't pan out i
we get at least one call with both numbers equal to 100 Protocols/Http2UpstreamIntegrationTest.ManySimultaneousRequest/IPv4_Http2Downstream_Http3Upstream
seems like a quiche bug but this assert is the best I can do for today

[2021-10-20 20:45:07.866][402][critical][assert] [source/common/quic/envoy_quic_client_session.cc:109] assert failure: manager.outgoing_max_streams() > manager.outgoing_stream_count().
[2021-10-20 20:45:07.866][402][critical][backtrace] [./source/server/backtrace.h:104] Caught Aborted, suspect faulting address 0xad370000000e
[2021-10-20 20:45:07.866][402][critical][backtrace] [./source/server/backtrace.h:91] Backtrace (use tools/stack_decode.py to get line numbers):
[2021-10-20 20:45:07.866][402][critical][backtrace] [./source/server/backtrace.h:92] Envoy version: 0/1.21.0-dev/test/DEBUG/BoringSSL
[2021-10-20 20:45:07.898][402][critical][backtrace] [./source/server/backtrace.h:96] #0: Envoy::SignalAction::sigHandler() [0x3f95079]
[2021-10-20 20:45:07.898][402][critical][backtrace] [./source/server/backtrace.h:96] #1: __restore_rt [0x7f27a3e9e8e0]
[2021-10-20 20:45:07.926][402][critical][backtrace] [./source/server/backtrace.h:96] #2: quic::QuicSession::StreamDraining() [0x35dbdfb]
[2021-10-20 20:45:07.953][402][critical][backtrace] [./source/server/backtrace.h:96] #3: quic::QuicStream::OnStreamFrame() [0x35f3152]
[2021-10-20 20:45:07.978][402][critical][backtrace] [./source/server/backtrace.h:96] #4: quic::QuicSession::OnStreamFrame() [0x35cbc09]
[2021-10-20 20:45:08.004][402][critical][backtrace] [./source/server/backtrace.h:96] #5: quic::QuicConnection::OnStreamFrame() [0x362ce54]
[2021-10-20 20:45:08.029][402][critical][backtrace] [./source/server/backtrace.h:96] #6: quic::QuicFramer::ProcessIetfFrameData() [0x373b77d]
[2021-10-20 20:45:08.055][402][critical][backtrace] [./source/server/backtrace.h:96] #7: quic::QuicFramer::ProcessIetfDataPacket() [0x3734cf0]
[2021-10-20 20:45:08.080][402][critical][backtrace] [./source/server/backtrace.h:96] #8: quic::QuicFramer::ProcessPacketInternal() [0x3731638]
[2021-10-20 20:45:08.106][402][critical][backtrace] [./source/server/backtrace.h:96] #9: quic::QuicFramer::ProcessPacket() [0x3730697]
[2021-10-20 20:45:08.131][402][critical][backtrace] [./source/server/backtrace.h:96] #10: quic::QuicConnection::ProcessUdpPacket() [0x363d12a]
[2021-10-20 20:45:08.152][402][critical][backtrace] [./source/server/backtrace.h:96] #11: Envoy::Quic::EnvoyQuicClientConnection::processPacket() [0x340d85f]
[2021-10-20 20:45:08.173][402][critical][backtrace] [./source/server/backtrace.h:96] #12: Envoy::Network::passPayloadToProcessor() [0x3f1d297]
[2021-10-20 20:45:08.193][402][critical][backtrace] [./source/server/backtrace.h:96] #13: Envoy::Network::Utility::readFromSocket() [0x3f1dc1a]
[2021-10-20 20:45:08.214][402][critical][backtrace] [./source/server/backtrace.h:96] #14: Envoy::Network::Utility::readPacketsFromSocket() [0x3f2041c]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think you're right that this is a quiche bug. In particular, I think QuicSession::StreamDraining() is doing the wrong thing and needs to wrap the call to OnCanCreateNewOutgoingStream(unidirectional) inside of if (!VersionHasIetfQuicFrames(transport_version())) like the other calls.

So the assert you have sounds fine for now and I'll look into fixing that in quiche, if that SGTY?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM :-)

@@ -28,6 +28,19 @@ ActiveClient::ActiveClient(Envoy::Http::HttpConnPoolImplBase& parent,
Upstream::Host::CreateConnectionData& data)
: MultiplexedActiveClientBase(parent, getMaxStreams(parent.host()->cluster()),
parent.host()->cluster().stats().upstream_cx_http3_total_, data) {
auto& connection = dynamic_cast<ActiveClient*>(this)->codec_client_->connection();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the dyanmic_cast actually needed here? It looks like it's casting from ActiveClient to ActiveClient?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah but due to Envoy "name everything the same thing" this is actually casting from the base class Active Client to the HTTP/3 active client :-P

source/common/http/http3/conn_pool.cc Outdated Show resolved Hide resolved
source/common/http/http3/conn_pool.h Show resolved Hide resolved
source/common/http/http3/conn_pool.h Show resolved Hide resolved
Signed-off-by: Alyssa Wilk <alyssar@chromium.org>
Copy link
Contributor

@RyanTheOptimist RyanTheOptimist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! (modulo > vs >= in an ASSERT)

void EnvoyQuicClientSession::OnCanCreateNewOutgoingStream(bool) {
if (notifier_.has_value()) {
quic::QuicStreamIdManager& manager = quic::test::QuicSessionPeer::getStream(this);
uint32_t streams_available = manager.outgoing_max_streams() - manager.outgoing_stream_count();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want > not >= here (Because if they're equal then we should NOT be able to open a stream, right?)

source/common/conn_pool/conn_pool_base.h Outdated Show resolved Hide resolved
Signed-off-by: Alyssa Wilk <alyssar@chromium.org>
@@ -87,6 +109,14 @@ void EnvoyQuicClientSession::SetDefaultEncryptionLevel(quic::EncryptionLevel lev
}
}

void EnvoyQuicClientSession::OnCanCreateNewOutgoingStream(bool) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh! I just realized we should probably pay attention to this param. It's unidirectional which is basically for push, in this context. So I think we probably want to just do:

if (unidirectional) {
  return;
}

at the top of this method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep - I fixed that in one of the later pushes so if you look at all changes, it's a no-op for unidirectional streams :-)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤦 I'm so bad at github UI.

void EnvoyQuicClientSession::OnCanCreateNewOutgoingStream(bool) {
if (notifier_.has_value()) {
quic::QuicStreamIdManager& manager = quic::test::QuicSessionPeer::getStream(this);
uint32_t streams_available = manager.outgoing_max_streams() - manager.outgoing_stream_count();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think you're right that this is a quiche bug. In particular, I think QuicSession::StreamDraining() is doing the wrong thing and needs to wrap the call to OnCanCreateNewOutgoingStream(unidirectional) inside of if (!VersionHasIetfQuicFrames(transport_version())) like the other calls.

So the assert you have sounds fine for now and I'll look into fixing that in quiche, if that SGTY?

@@ -87,6 +102,15 @@ void EnvoyQuicClientSession::SetDefaultEncryptionLevel(quic::EncryptionLevel lev
}
}

void EnvoyQuicClientSession::OnCanCreateNewOutgoingStream(bool unidirectional) {
if (http_connection_callbacks_ && !unidirectional) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

  if(!http_connection_callbacks_ || unidirectional) {
    return;
  }

@@ -87,6 +109,14 @@ void EnvoyQuicClientSession::SetDefaultEncryptionLevel(quic::EncryptionLevel lev
}
}

void EnvoyQuicClientSession::OnCanCreateNewOutgoingStream(bool) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤦 I'm so bad at github UI.

// We track the QUIC capacity here, and overload currentUnusedCapacity so the
// connection pool can accurately keep track of when it is safe to create new
// streams.
uint64_t quiche_capacity_ = 100;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that this initialization to 100 is quite right. At the QUIC protocol layer, this starts at 0 until the initial_max_streams_bidi transport parameter is received to specify the initial value. After that, it's set to the value from MAX_STREAMS. So I guess this makes me think 2 things.

  1. Should this be initialized to 0?
  2. Should we hook into QuicSession::OnConfigNegotiated() to grab the initial outgoing max streams value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to be at 100 or we'll prefetch a connection for each incoming stream: commented
I would assume that we get a stream update on the initial frame as well, else we should fix upstream.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm. I definitely appreciate the point about the connection prefetch issue. That makes sense to me. We have to be careful that we don't actually send streams on the wire that are above the peers MAX_STREAMS limit or we will commit a protocol violation and the connection will be closed. In quiche, we do this via ShouldCreateOutgoingBidirectionalStream() returning false until we have stream capacity, but with #18614 that will return true. With 0-RTT in particular, we could send a bunch of requests in the first flight and trigger this.

But maybe we already have logic which is queueing requests somewhere to avoid this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a test to future-proof that we update this before we raise the connected event.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a test to future-proof that we update this before we raise the connected event.

But do we send 0-RTT request before updating this? initial_max_streams_bidi in transport parameters will update this initially. And it seems reasonable to me that 0-RTT can be sent before receiving transport param.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't support 0-rtt right now. When we do, we'll want to set this based on the cached params.

Signed-off-by: Alyssa Wilk <alyssar@chromium.org>
Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just skimmed the h3 code but just a couple of questions about the common code, thanks.

/wait-any


/**
* Fires when the MAX_STREAMS frame is received from the peer.
* This may occur multiple times across the lifetime of an HTTP/3 connection.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't this technically happen for H/2 also? I thought the peer could send a new settings frame?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MAX_STREAMS is http/3 specific but the same does occur for onSettings above.

Apart from not wanting to manufacture a fake settings struct, I don't think we should combine the two since settings tells us the number of streams we're allowed to have open at a given time and onMaxStreamsChanged allows a fixed number of streams to be opened.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I see. Do you mind just adding a comment about how this is different from H2? For the uninitiated for H3 this is kind of confusing.

client.remaining_streams_--;
if (client.remaining_streams_ == 0) {
ENVOY_CONN_LOG(debug, "maximum streams per connection, DRAINING", client);
host_->cluster().stats().upstream_cx_max_requests_.inc();
transitionActiveClientState(client, Envoy::ConnectionPool::ActiveClient::State::DRAINING);
} else if (client.numActiveStreams() + 1 >= client.concurrent_stream_limit_) {
} else if (capacity <= 1) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused by this check. How do we get here if capacity == 0? Wouldn't we not be attaching to this client if we don't have any capacity? Should this by capacity == 1 with an ASSERT that capacity != 0? I'm probably missing something.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, it can't be 0, I was just trying to be as consistent as possible with the prior >=. I'll update to == 1 since it's confusing.

Comment on lines +107 to +108
// by configuration, or it will be set to std::numeric_limits<uint32_t>::max() to be
// (functionally) unlimited.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated to your change, but we should probably make this absl::optional to avoid the extremely rare but possible case where this is not actually unlimited. :) If you end up pushing more changes maybe TODO this?

Signed-off-by: Alyssa Wilk <alyssar@chromium.org>

/**
* Fires when the MAX_STREAMS frame is received from the peer.
* This may occur multiple times across the lifetime of an HTTP/3 connection.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MAX_STREAMS is http/3 specific but the same does occur for onSettings above.

Apart from not wanting to manufacture a fake settings struct, I don't think we should combine the two since settings tells us the number of streams we're allowed to have open at a given time and onMaxStreamsChanged allows a fixed number of streams to be opened.

client.remaining_streams_--;
if (client.remaining_streams_ == 0) {
ENVOY_CONN_LOG(debug, "maximum streams per connection, DRAINING", client);
host_->cluster().stats().upstream_cx_max_requests_.inc();
transitionActiveClientState(client, Envoy::ConnectionPool::ActiveClient::State::DRAINING);
} else if (client.numActiveStreams() + 1 >= client.concurrent_stream_limit_) {
} else if (capacity <= 1) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, it can't be 0, I was just trying to be as consistent as possible with the prior >=. I'll update to == 1 since it's confusing.

// We track the QUIC capacity here, and overload currentUnusedCapacity so the
// connection pool can accurately keep track of when it is safe to create new
// streams.
uint64_t quiche_capacity_ = 100;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a test to future-proof that we update this before we raise the connected event.

raiseConnectionEvent(Network::ConnectionEvent::Connected);

// Arguably the peer could start a connection with 0 streams and increase open
// streams later but this is currently unsupported.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RyanTheOptimist I want to call this out in particular. It would be pretty easy to have a member variable connected_ and in OnCanCreateNewOutgoingStream raise the connected event if if !connected and streamsAvailable() but it feels like a lot of complexity for a really bizarre server implementation. Lmk if you think it's merited complexity.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's explicitly called out in the RFC and I have some vague sense of hearing someone in the WG proposing that this might be a thing to do under DoS. So I think we should probably support it. Though doing that in a subsequent PR seems totally reasonable.

That being said, if streamsAvailable() is 0 I might be more inclined to still call raiseConnectionEvent() here, but then move the connection to busy in the pool. Would that make sense?

@mattklein123
Copy link
Member

Needs a main merge as well if you could add that comment that would be great. Otherwise LGTM.

/wait

Signed-off-by: Alyssa Wilk <alyssar@chromium.org>
Signed-off-by: Alyssa Wilk <alyssar@chromium.org>
Signed-off-by: Alyssa Wilk <alyssar@chromium.org>
mattklein123
mattklein123 previously approved these changes Oct 25, 2021
// Creates up to 3 connections, based on the preconnect ratio.
// Returns the ConnectionResult of the last attempt.
ConnectionResult tryCreateNewConnections();

virtual void onConnected(Envoy::ConnectionPool::ActiveClient&) {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unrelated change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

had moved things up and then back to a different place. will fix.

ASSERT(streamsAvailable());
if (streamsAvailable() > 0) {
OnCanCreateNewOutgoingStream(false);
raiseConnectionEvent(Network::ConnectionEvent::Connected);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why skipping raiseConnectionEvent() when streamAvailable() == 0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, the connection pool assumes when a connection is connected it can start sending streams (and does so, violating QUIC constraints). per discussion I added the TODO above and will address in a follow-up.

Signed-off-by: Alyssa Wilk <alyssar@chromium.org>
Signed-off-by: Alyssa Wilk <alyssar@chromium.org>
copybara-service bot pushed a commit to google/quiche that referenced this pull request Oct 25, 2021
gfe-relnote: adding an accessor.
PiperOrigin-RevId: 405507053
@alyssawilk alyssawilk merged commit 0882d63 into envoyproxy:main Oct 26, 2021
RenjieTang added a commit to RenjieTang/envoy that referenced this pull request Oct 26, 2021
https://github.com/google/quiche/compare/12228ccfc..4c6ad6445

$ git log 12228ccfc..4c6ad6445 --date=short --no-merges --format="%ad %al %s"

2021-10-26 renjietang Add num_ptos_for_path_degrading_ getter in QuicSentPacketManagerPeer.
2021-10-25 dschinazi Add code count for packets dropped by blocked port
2021-10-25 dschinazi Internal change
2021-10-25 quiche-dev Adding an accessor for envoyproxy#18694
2021-10-25 wub Add retry_token to ReceivedPacketInfo. The token can be used later to extract cached network parameters.

Signed-off-by: Renjie Tang <renjietang@google.com>
pull bot pushed a commit to Mu-L/chromium that referenced this pull request Oct 26, 2021
https://quiche.googlesource.com/quiche.git/+log/385e03f1eab1..1d6d13074621

$ git log 385e03f1e..1d6d13074 --date=short --no-merges --format='%ad %ae %s'
2021-10-25 dschinazi Add code count for packets dropped by blocked port
2021-10-25 dschinazi Internal change
2021-10-25 quiche-dev Adding an accessor for envoyproxy/envoy#18694
2021-10-25 wub Add retry_token to ReceivedPacketInfo. The token can be used later to extract cached network parameters.
2021-10-25 renjietang Allow QuicSentPacketManager to expose the setter of num_ptos_for_path_degrading_.

Created with:
  roll-dep src/net/third_party/quiche/src src/third_party/quic_trace/src

Change-Id: I60ff709d7f349b585fde0823afa01fa472201d82
Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/3244885
Reviewed-by: Bin Wu <wub@chromium.org>
Commit-Queue: Bence Béky <bnc@chromium.org>
Cr-Commit-Position: refs/heads/main@{#934977}
@alyssawilk alyssawilk deleted the max_streams_2 branch August 4, 2022 00:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

envoy crash when test http3 upgrade
4 participants