Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dns resolvers: add All lookup mode #18464

Merged
merged 23 commits into from
Oct 28, 2021
Merged

dns resolvers: add All lookup mode #18464

merged 23 commits into from
Oct 28, 2021

Conversation

junr03
Copy link
Member

@junr03 junr03 commented Oct 5, 2021

Commit Message: dns resolvers - add All lookup mode
Additional Description: this change will enable callback targets to access both address families for a given hostname.
Risk Level: low - new behavior guarded by config option
Testing: added UT
Docs Changes: updated
Release Notes: updated

Jose Nino added 2 commits October 5, 2021 14:39
Signed-off-by: Jose Nino <jnino@lyft.com>
Signed-off-by: Jose Nino <jnino@lyft.com>
@repokitteh-read-only
Copy link

CC @envoyproxy/api-shepherds: Your approval is needed for changes made to api/envoy/.
envoyproxy/api-shepherds assignee is @adisuissa
CC @envoyproxy/api-watchers: FYI only for changes made to api/envoy/.

🐱

Caused by: #18464 was opened by junr03.

see: more, trace.

@junr03 junr03 assigned alyssawilk and unassigned adisuissa Oct 5, 2021
@@ -104,16 +104,24 @@ void DnsResolverImpl::PendingResolution::onAresGetAddrInfoCallback(int status, i
// ARES_ECONNREFUSED. If the PendingResolution has not been cancelled that means that the
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The c-ares based resolver changed a bit more because it was not originally written to accumulate responses from multiple c-ares callbacks.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I refactored for ease of understanding but open to comments to simplify further!

dns_name_);

if (!pending_response_.address_list_.empty()) {
ASSERT(dns_lookup_family_ == DnsLookupFamily::All && !dual_resolution_);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here and below I considered a failure in the second resolution an overall success if the first resolution was successful, and hence fired the callback with the existing addresses and a success state. Up to discussing if this should be the behavior.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah IMO a resolve success of either v4 or v6 should be considered success.
cc @DavidSchinazi @RyanTheOptimist for informed opinions :-)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, success of v4 OR v6 should mean overall success since we have usable addresses

Jose Nino added 2 commits October 5, 2021 15:31
Signed-off-by: Jose Nino <jnino@lyft.com>
Signed-off-by: Jose Nino <jnino@lyft.com>
Signed-off-by: Jose Nino <jnino@lyft.com>
Copy link
Contributor

@alyssawilk alyssawilk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

dns_name_);

if (!pending_response_.address_list_.empty()) {
ASSERT(dns_lookup_family_ == DnsLookupFamily::All && !dual_resolution_);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah IMO a resolve success of either v4 or v6 should be considered success.
cc @DavidSchinazi @RyanTheOptimist for informed opinions :-)

return;
}
if (!fallback_if_failed_) {

if (!dual_resolution_) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, are we holding all forward progress on the dns resolver returning v4 and v6 addreses? That might work for a first pass but I believe that the chrome folks have told me it's a bad idea in general for latency / Quality of Experience reasons. If they confirm, I think we need a TODO to handle ipv4 and ipv6 asynchronously.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I was thinking about this as well from two different angles:

  1. Should both c-ares function calls happen concurrently?
  2. This applies both for c-ares and for apple: should we stream results as they come in to the callback target.

(1) is easy and I could do that in this PR. (2) is a bit harder, and I wouldn't want to do it in this PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I would do (1) and not (2) for now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alyssawilk @mattklein123 re:(1) do we have a preference for all three modes (All,Auto,V4Preferred) to do the calls concurrently, vs only All doing it concurrently and Auto/V4Preferred only do the second call if the first one fails/returns no addresses?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, less sure on that - I'll defer to Matt and David.
Totally fine with not doing 2 in this PR but I think we should either comment or TODO that it has latency implications and should probably be fixed at some point.

Copy link
Member Author

@junr03 junr03 Oct 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alyssawilk @mattklein123 I settled on doing both lookups concurrrently for All and leaving Auto and V4Preferred with sequential lookups. I did not want to block this PR on making the necessary changes to enable concurrent lookups for Auto/V4Preferred given that the existing behavior is already sequential, and concurrent lookup for All did not need bigger changes.

I am already part of the way on the TODO, so will put that up as a separate PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree that waiting for both means users are waiting unnecessarily. But tackling in a separate PR with TODO here sounds good

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SG. Do we have a TODO on not waiting for both?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

decided to create an issue #18572

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This definitely seems like a good way to handle this for now. While suboptimal because it block progres until both return, Chrome does exactly this today. Solving #18572 (later) is a great chance to make envoy mobile better than Chrome!

docs/root/version_history/current.rst Outdated Show resolved Hide resolved
@alyssawilk
Copy link
Contributor

oh,I had one other question which is if we're doing the scoping to 5 as disussed on slack, if we should do it here, or where we get the resolution? I have a TODO in my code for where we scope it, but no strong opinion

I'll send mine out in draft form today so you can TAL

@junr03
Copy link
Member Author

junr03 commented Oct 6, 2021

I had one other question which is if we're doing the scoping to 5 as disussed on slack, if we should do it here, or where we get the resolution?

My preference is for the callback target to decide what it does with all the addresses that the resolver gives it. That was static dns cluster takes everything, logical takes one, dns cache with happy eyeballs does it thing, etc.

Jose Nino added 4 commits October 6, 2021 14:45
Signed-off-by: Jose Nino <jnino@lyft.com>
Signed-off-by: Jose Nino <jnino@lyft.com>
Signed-off-by: Jose Nino <jnino@lyft.com>
Signed-off-by: Jose Nino <jnino@lyft.com>
@junr03
Copy link
Member Author

junr03 commented Oct 7, 2021

@alyssawilk @mattklein123 updated!

return pending_response_.v4_responses_;
case DnsLookupFamily::All:
ASSERT(pending_response_.all_responses_.empty());
pending_response_.all_responses_.insert(pending_response_.all_responses_.end(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm understanding this correctly, it's adding v4 addresses before v6 addresses. In general I'd suggest v6 first but you can get fancier as per the happy eyeballs spec

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, that's the case. But the callback receiver can do what they want with the list, and is where I would expect the ordering complexity to happen; so I am not too opinionated with order here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should follow the RFC 8305 (the happy eyeballs spec) as David suggest, and I agree with Jose that we should probably do that in HappyEyeballsConnectionImpl. I'd be happy to take a whack at that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to do this outside.

dns_name_);

if (!pending_response_.address_list_.empty()) {
ASSERT(dns_lookup_family_ == DnsLookupFamily::All && !dual_resolution_);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, success of v4 OR v6 should mean overall success since we have usable addresses

return;
}
if (!fallback_if_failed_) {

if (!dual_resolution_) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree that waiting for both means users are waiting unnecessarily. But tackling in a separate PR with TODO here sounds good

Jose Nino added 4 commits October 8, 2021 09:03
Signed-off-by: Jose Nino <jnino@lyft.com>
Signed-off-by: Jose Nino <jnino@lyft.com>
Signed-off-by: Jose Nino <jnino@lyft.com>
fmt
Signed-off-by: Jose Nino <jnino@lyft.com>
@junr03
Copy link
Member Author

junr03 commented Oct 8, 2021

I see the coverage failure. Writing a test for the 1 uncovered line

Signed-off-by: Jose Nino <jnino@lyft.com>
alyssawilk
alyssawilk previously approved these changes Oct 13, 2021
@alyssawilk
Copy link
Contributor

(needs main merge)

Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thanks for working on this. A few comments from a read through, thanks.

/wait

return pending_response_.v4_responses_;
case DnsLookupFamily::All:
ASSERT(pending_response_.all_responses_.empty());
pending_response_.all_responses_.insert(pending_response_.all_responses_.end(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to do this outside.

Comment on lines 91 to 92
// or Auto, perform a second resolution if the first one fails. If dns_lookup_family_ is All
// perform the second resolution whether the first one fails or succeeds.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this comment out of date? It conflicts with the one above about concurrent? The behavior is that both are concurrent, but we don't raise the final callback until both are done? Or is that not correct? (From further reading I'm pretty sure it's correct but I would do a 2nd comment pass as the logic is actually pretty tricky and I would be concerned we might be missing some corner cases.

@@ -97,23 +97,35 @@ void DnsResolverImpl::initializeChannel(ares_options* options, int optmask) {

void DnsResolverImpl::PendingResolution::onAresGetAddrInfoCallback(int status, int timeouts,
ares_addrinfo* addrinfo) {
// Purely for checking for invalid state. This will change is Auto and V4Preferred are changed to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Purely for checking for invalid state. This will change is Auto and V4Preferred are changed to
// Purely for checking for invalid state. This will change if Auto and V4Preferred are changed to

?

Comment on lines 103 to 106
// The only way the resolver would have addresses here is if it is configured to resolve All
// families, and the first resolution was successful. In that case, we might as well return
// the addresses that resolved.
ASSERT(dns_lookup_family_ == DnsLookupFamily::All && !dual_resolution_);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this seems a bit fragile. Should you change status to ARES_SUCCESS in this case? Do we have test coverage for this case?

Comment on lines 119 to 120
if (pending_response_.address_list_.empty()) {
pending_response_.status_ = ResolutionStatus::Failure;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this true? Couldn't the list actually be empty? Or is this due to the new concurrent behavior and the 2nd one failing? Maybe this is related to my comment above about switching status? Can you add more comments?

Jose Nino added 3 commits October 19, 2021 13:10
Signed-off-by: Jose Nino <jnino@lyft.com>
Signed-off-by: Jose Nino <jnino@lyft.com>
fmt
Signed-off-by: Jose Nino <jnino@lyft.com>
@junr03
Copy link
Member Author

junr03 commented Oct 19, 2021

@mattklein123 Thanks for the comments! I needed to rearrange a piece of code because there was not way to get coverage i.e., I couldn't force c-ares to behave in a way to get us on that path. At first glance I think I covered the new arrangement, but will make sure to double check.

At the very least I will add more comments in order to leave everything a bit more clear. I will update when I am back 10/26.

cc @alyssawilk

@junr03
Copy link
Member Author

junr03 commented Oct 19, 2021

/wait

Jose Nino added 2 commits October 27, 2021 11:00
Signed-off-by: Jose Nino <jnino@lyft.com>
Signed-off-by: Jose Nino <jnino@lyft.com>
@repokitteh-read-only
Copy link

CC @envoyproxy/api-shepherds: Your approval is needed for changes made to (api/envoy/|docs/root/api-docs/).
envoyproxy/api-shepherds assignee is @adisuissa
CC @envoyproxy/api-watchers: FYI only for changes made to (api/envoy/|docs/root/api-docs/).

🐱

Caused by: #18464 was synchronize by junr03.

see: more, trace.

@junr03
Copy link
Member Author

junr03 commented Oct 27, 2021

Alright, @mattklein123 when I came back to this with fresh eyes I realized that constructing the pending response as a failure and only changing to success when the c-ares callback gave us an ARES_SUCCESS made the code simpler. All the cases I can think about are covered with the existing and added test cases.

Let me know if this is better!

Jose Nino added 2 commits October 27, 2021 11:21
Signed-off-by: Jose Nino <jnino@lyft.com>
fmt
Signed-off-by: Jose Nino <jnino@lyft.com>
Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The DnsResolverImpl::PendingResolution::onAresGetAddrInfoCallback is a lot more complicated than I would like and pretty hard to follow, so if you see any further opportunities to make that easier to understand that would be a nice follow up. Otherwise LGTM after staring at this for a while. Thanks for adding!

@junr03
Copy link
Member Author

junr03 commented Oct 27, 2021

The DnsResolverImpl::PendingResolution::onAresGetAddrInfoCallback is a lot more complicated than I would like and pretty hard to follow

Yeah it has never been the easiest to follow. I think this PR is a good first step by taking out finishResolve. I can take another refactoring pass in a follow up.

@mattklein123 mattklein123 merged commit 036ae75 into main Oct 28, 2021
@mattklein123 mattklein123 deleted the dns-all-resolution-mode branch October 28, 2021 15:59
mpuncel added a commit to mpuncel/envoy that referenced this pull request Oct 29, 2021
* main: (221 commits)
  deps: Bump `protobuf` -> 3.19.0 (envoyproxy#18471)
  tooling: auto-assign dependency shephards (envoyproxy#18794)
  clang-tidy: Return from diff fun if empty diff (envoyproxy#18815)
  repokitteh: Block PRs pending deps approval (envoyproxy#18814)
  deps: Bump `org_llvm_llvm` -> 12.0.1, `com_github_wavm_wavm` -> 9ffd3e2 (envoyproxy#18747)
  dns resolvers: add All lookup mode (envoyproxy#18464)
  doc: fix link formatting for TLS session_timeout (envoyproxy#18790)
  ext_authz: Set response flag and code details to UAEX when denied (envoyproxy#18740)
  socket options: add support for directly creating ipv4/ipv6 pairs (envoyproxy#18769)
  ecds: make onConfigUpdate generic over filter type (envoyproxy#18061)
  bazel: update CMake instructions in EXTERNAL_DEPS.md (envoyproxy#18799)
  upstream: fix typo in comment (envoyproxy#18798)
  runtime: removing envoy.reloadable_features.grpc_json_transcoder_adhere_to_buffer_limits (envoyproxy#18696)
  bazel: Add CC=clang to clang configuration (envoyproxy#18732)
  fix error request id in the dubbbo local reply (envoyproxy#18741)
  event: assert the case of both read and closed event registered (envoyproxy#18265)
  tcp proxy connect tunneling: improved testing (envoyproxy#18784)
  deps: Bump `protoc-gen-validate` -> 0.6.2 (envoyproxy#18742)
  deps: Bump `rules_pkg` -> ad57589 (envoyproxy#18746)
  bazel: copy .bazelversion for envoy filter examples (envoyproxy#18730)
  ...

Signed-off-by: Michael Puncel <mpuncel@squareup.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants