Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ares_gethostbyname() and exception safety #219

Closed
htuch opened this issue Aug 29, 2018 · 2 comments
Closed

ares_gethostbyname() and exception safety #219

htuch opened this issue Aug 29, 2018 · 2 comments

Comments

@htuch
Copy link

htuch commented Aug 29, 2018

While investigating a memory leak in Envoy (spotted by https://github.com/google/oss-fuzz), a c-ares consumer, I noticed the following trace:

Direct leak of 8 byte(s) in 1 object(s) allocated from:
--
  | #0 0x4e7a78 in __interceptor___strdup _asan_rtl_
  | #1 0x279fba4 in fake_hostent /builder/home/.cache/bazel/_bazel_root/4e9824db8e7d11820cfa25090ed4ed10/external/envoy_deps_cache_b22e04bff96538ea37e715942da6315c/cares.dep.build/c-ares-cares-1_14_0/ares_gethostbyname.c:290:20
  | #2 0x279fba4 in ares_gethostbyname /builder/home/.cache/bazel/_bazel_root/4e9824db8e7d11820cfa25090ed4ed10/external/envoy_deps_cache_b22e04bff96538ea37e715942da6315c/cares.dep.build/c-ares-cares-1_14_0/ares_gethostbyname.c:98
  | #3 0x8cb8c5 in Envoy::Network::DnsResolverImpl::resolve(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, Envoy::Network::DnsLookupFamily, std::__1::function<void (std::__1::list<std::__1::shared_ptr<Envoy::Network::Address::Instance const>, std::__1::allocator<std::__1::shared_ptr<Envoy::Network::Address::Instance const> > >&&)>) /source/common/network/dns_impl.cc:196:25
  | #4 0x1530264 in Envoy::Upstream::StrictDnsClusterImpl::ResolveTarget::startResolve() /source/common/upstream/upstream_impl.cc:1138:42

what's happening is we're doing a DNS resolution on an IP address, the strdup at

hostent.h_name = ares_strdup(name);
allocates, and Envoy code is throwing an exception in the callback at
callback(arg, ARES_ENOMEM, 0, NULL);
. This then prevents the free at
ares_free((char *)(hostent.h_name));
from happening.

There are two ways to fix this:

  1. Make c-ares more robust to exceptions. I'm not sure how broad the issues around c-ares, callbacks and exceptions are, hoping the maintainers of this project can shed some light here.
  2. If exceptions can't happen in callbacks, force Envoy to be exception free in any c-ares callback. This is possible but might involve some rearchitecting of how we handle resolution. CC @envoyproxy/maintainers
@daviddrysdale
Copy link
Member

I think option 2 is the way to go. c-ares is a C library and so isn't really in a position to do anything about C++ exceptions. As a result, the callbacks it invokes really need to act like C code too (e.g. by having an outer wrapper that consumes all exceptions) -- which is option 2.

htuch added a commit to htuch/envoy that referenced this issue Aug 30, 2018
Previously, we were taking an exception due to the late validation of update_merge_window duration
in ClusterManagerImpl::scheduleUpdate, which happened under a c-ares strict DNS host resolution
callback. There are several related issues here:

1. c-ares is exception unsafe, see c-ares/c-ares#219.
2. We should be validating Durations with PGV, see
   bufbuild/protoc-gen-validate#97.
3. We should defer the c-ares resolution callbacks to be outside the c-ares callback context for
   exception safety.

This PR addresses (3) by moving callbacks, even when they are "immediate", to a dispatcher post, so
that we never take an exception under a c-ares callback.

A workaround for (2) is provided, in lieu of bufbuild/protoc-gen-validate#97,
which is blocked on our ability to bump PGV version in Envoy, see
lyft/protoc-gen-star#28.

Fixes oss-fuzz issue https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=9868.

Risk level: Medium (DNS clusters will have some timing changes).
Testing: Updated DNS implementation unit tests, server fuzz corpus entry added.

Signed-off-by: Harvey Tuch <htuch@google.com>
@htuch
Copy link
Author

htuch commented Aug 30, 2018

Yeah, I think in general this is hard to do, although in this specific case it would be possible to save a pointer to the allocated string somewhere (not on the stack) that ares_destroy() could cleanup (we call this as part of exception handling in Envoy).

Anyway, totally reasonable that c-ares as a C library doesn't want to be in the business of trying to handle C++ exceptions; I have a fix Envoy-side in envoyproxy/envoy#4307, so I will close this out. Thanks.

@htuch htuch closed this as completed Aug 30, 2018
htuch added a commit to envoyproxy/envoy that referenced this issue Aug 31, 2018
Previously, we were taking an exception due to the late validation of update_merge_window duration
in ClusterManagerImpl::scheduleUpdate, which happened under a c-ares strict DNS host resolution
callback. There are several related issues here:

1. c-ares is exception unsafe, see c-ares/c-ares#219.
2. We should be validating Durations with PGV, see
   bufbuild/protoc-gen-validate#97.
3. We should defer the c-ares resolution callbacks to be outside the c-ares callback context for
   exception safety.

This PR addresses (3) by moving callbacks, even when they are "immediate", to a dispatcher post, so
that we never take an exception under a c-ares callback.

A workaround for (2) is provided, in lieu of bufbuild/protoc-gen-validate#97,
which is blocked on our ability to bump PGV version in Envoy, see
lyft/protoc-gen-star#28.

Fixes oss-fuzz issue https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=9868.

Risk level: Medium (DNS clusters will have some timing changes).
Testing: Updated DNS implementation unit tests, server fuzz corpus entry added.

Signed-off-by: Harvey Tuch <htuch@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants