Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Envoy crash when validating bootstrap config file #27775

Open
jakewang-stripe opened this issue Jun 2, 2023 · 16 comments
Open

Envoy crash when validating bootstrap config file #27775

jakewang-stripe opened this issue Jun 2, 2023 · 16 comments
Labels
area/bootstrap bug no stalebot Disables stalebot from closing an issue

Comments

@jakewang-stripe
Copy link

jakewang-stripe commented Jun 2, 2023

Description:
Envoy occasionally crashes when validating bootstrap config

Repro steps:
This has only been observed on 0.05% of Stripe hosts that run Envoy. And I can login to those hosts and manually call and roughly get segfault 1 out of 2/3 times

$ sudo /pay/jenkins-artifacts/envoy/1.24.4-stripe1/envoy-stripe --config-path $bootstrap_path --mode validate --service-cluster certhorse --service-zone us-west-2b --service-node qa-certhorse--01b66f3177b5f6cdc
Segmentation fault
Or sometimes the validate is hanging indefinitely.

Call stack:

[external/envoy/source/server/backtrace.h:104] Caught Segmentation fault, suspect faulting address 0x0
[external/envoy/source/server/backtrace.h:91] Backtrace (use tools/stack_decode.py to get line numbers):
[external/envoy/source/server/backtrace.h:92] Envoy version: 2f44165e55dd47475c44d2d03018eac3cb8a6264/1.24.4-stripe1/Clean/RELEASE/BoringSSL
[external/envoy/source/server/backtrace.h:96] #0: __restore_rt [0x7f26c19d1420]
[external/envoy/source/server/backtrace.h:96] #1: tcmalloc::tcmalloc_internal::cpu_cache_internal::CpuCache<>::Refill() [0x55cf3f28ce2a]
[external/envoy/source/server/backtrace.h:96] #2: tcmalloc::tcmalloc_internal::cpu_cache_internal::CpuCache<>::Allocate<>()::Helper::Underflow() [0x55cf3f28df77]
[external/envoy/source/server/backtrace.h:96] #3: Envoy::Api::ValidationImpl::allocateDispatcher() [0x55cf3dd5b3de]
[external/envoy/source/server/backtrace.h:96] #4: Envoy::Server::ValidationInstance::ValidationInstance() [0x55cf3dd4b52f]
[external/envoy/source/server/backtrace.h:96] #5: Envoy::Server::validateConfig() [0x55cf3dd4aa65]
[external/envoy/source/server/backtrace.h:96] #6: Envoy::MainCommonBase::run() [0x55cf3dd11370]
[external/envoy/source/server/backtrace.h:96] #7: Envoy::MainCommon::main() [0x55cf3dd11a7d]
[external/envoy/source/server/backtrace.h:96] #8: main [0x55cf3dd0da4a]
[external/envoy/source/server/backtrace.h:96] #9: __libc_start_main [0x7f26c17ef083]
uname -a:
Linux 5.15.0-1036-aws envoyproxy/go-control-plane#40 SMP Mon Apr 24 00:21:13 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Envoy version
2f44165e55dd47475c44d2d03018eac3cb8a6264/1.24.4-stripe1/Clean/RELEASE/BoringSSL

2f44165e55dd47475c44d2d03018eac3cb8a6264 is internal commit of Stripe's Envoy repo, it uses OSS envoy 1.24.4

@jakewang-stripe jakewang-stripe added the triage Issue requires triage label Jun 2, 2023
@jakewang-stripe jakewang-stripe changed the title Envoy crash when validating bootstrap config file #709 Envoy crash when validating bootstrap config file Jun 2, 2023
@jakewang-stripe
Copy link
Author

We actually found a potential root cause to be a tcmalloc bug that's being used in Envoy 1.24.4 that is unable to handle non-sequential online CPUs. Those segfaults happen on ec2 instances with nitro-enclaves enabled so there are some hot-plugged off CPUs, i.e.

$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 46 bits physical, 48 bits virtual
CPU(s): 16
On-line CPU(s) list: 0,2-8,10-15
Off-line CPU(s) list: 1,9
And the theory is tcmalloc uses the cpu's id to index into the per-cpu arrays that hold the per cpu data structures. If tcmalloc allocates 14 entries because ncpu is 14, but the 14th cpu id is 15 then its array access is out of bounds.

Can you confirm if that's the valid root cause?

@jakewang-stripe
Copy link
Author

Can you confirm if some newer Envoy version (i.e. 1.26.1) uses the tcmalloc version which fixed the bug?

@ravenblackx
Copy link
Contributor

That certainly sounds like a plausible root cause.

It looks like tcmalloc version was last updated to a version from 2022-10-24 which was right after 1.24 was cut. 1.24 is using a version from 2022-08-06.

Do you know when/if the described tcmalloc bug was fixed?

@jakewang-stripe
Copy link
Author

I'm not sure if this is fixed so open an issue with tcmalloc

@ravenblackx
Copy link
Contributor

Looks like this is confirmed as a tcmalloc issue that's not resolved, so there's nothing to do here until it's addressed there.

@ravenblackx ravenblackx removed their assignment Jun 22, 2023
@clundquist-stripe
Copy link

I see a commit referencing the tcmalloc issue!
google/tcmalloc@5823a86

Once the tcmalloc side is closed/confirmed we will have to bump the revision on this side to include the fix 🎉
Hopefully it will land soon!

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label Aug 18, 2023
@clundquist-stripe
Copy link

This is still an issue.

@github-actions github-actions bot removed the stale stalebot believes this issue/PR has not been touched recently label Aug 18, 2023
@ravenblackx
Copy link
Contributor

Looks like Jul 24's commit may be the one to address the problem. Would it be reasonable to patch in an update to envoy's tcmalloc version at your end and see whether it fixes the problem?

@clundquist-stripe
Copy link

we're working on building envoy with the tcmalloc patch to test it

@clundquist-stripe
Copy link

We hit compilation errors attempting to update Envoy's dependencies.

My coworker Thejas attempted to build Envoy with the patch and encountered a chain of errors.

I tried updating tcmalloc, but it requires us to update Abseil, updating this causes compile failure of Common Expression Language (CEL) C++ library (com_google_cel_cpp)
There is no newer version of com_google_cel_cpp which uses the version of Abseil used by tcmalloc

The errors we found were:

Use --sandbox_debug to see verbose messages from the sandbox and retain the sandbox build root for debugging
�[1mexternal/com_google_cel_cpp/base/memory_manager.cc:237:5: �[0m�[0;1;31merror: �[0m�[1muse of undeclared identifier 'ABSL_INTERNAL_UNREACHABLE'�[0m
    ABSL_INTERNAL_UNREACHABLE;
�[0;1;32m    ^
�[0m�[1mexternal/com_google_cel_cpp/base/memory_manager.cc:245:5: �[0m�[0;1;31merror: �[0m�[1muse of undeclared identifier 'ABSL_INTERNAL_UNREACHABLE'�[0m
    ABSL_INTERNAL_UNREACHABLE;
�[0;1;32m    ^
�[0m2 errors generated.
 Compiling tcmalloc/system-alloc.cc failed: (Exit 1): clang-14 failed: error executing command (from target @com_github_google_tcmalloc//tcmalloc:common_8k_pages) 

# ...
[1mexternal/com_github_google_tcmalloc/tcmalloc/central_freelist.h:250:3: �[0m�[0;1;31merror: �[0m�[1munknown type name 'ABSL_ATTRIBUTE_NO_UNIQUE_ADDRESS'�[0m
  ABSL_ATTRIBUTE_NO_UNIQUE_ADDRESS Forwarder forwarder_;
�[0;1;32m  ^
�[0m�[1mexternal/com_github_google_tcmalloc/tcmalloc/central_freelist.h:250:36: �[0m�[0;1;31merror: �[0m�[1mduplicate member 'Forwarder'�[0m
  ABSL_ATTRIBUTE_NO_UNIQUE_ADDRESS Forwarder forwarder_;
�[0;1;32m                                   ^
�[0m�[1mexternal/com_github_google_tcmalloc/tcmalloc/central_freelist.h:70:9: �[0m�[0;1;30mnote: �[0mprevious declaration is here�[0m
  using Forwarder = ForwarderT;
�[0;1;32m        ^
�[0m�[1mexternal/com_github_google_tcmalloc/tcmalloc/central_freelist.h:250:45: �[0m�[0;1;31merror: �[0m�[1mexpected ';' at end of declaration list�[0m
  ABSL_ATTRIBUTE_NO_UNIQUE_ADDRESS Forwarder forwarder_;
�[0;1;32m                                            ^
�[0m�[0;32m                                            ;
�[0m�[1mexternal/com_github_google_tcmalloc/tcmalloc/central_freelist.h:122:35: �[0m�[0;1;31merror: �[0m�[1muse of undeclared identifier 'forwarder_'; did you mean 'forwarder'?�[0m
  Forwarder& forwarder() { return forwarder_; }
�[0;1;32m                                  ^~~~~~~~~~
�[0m�[0;32m                                  forwarder
�[0m�[1mexternal/com_github_google_tcmalloc/tcmalloc/central_freelist.h:122:14: �[0m�[0;1;30mnote: �[0m'forwarder' declared here�[0m
  Forwarder& forwarder() { return forwarder_; }
�[0;1;32m             ^
�[0m�[1mexternal/com_github_google_tcmalloc/tcmalloc/central_freelist.h:392:18: �[0m�[0;1;31merror: �[0m�[1muse of undeclared identifier 'forwarder_'�[0m
    Span* span = forwarder_.MapObjectToSpan(batch[i]);
�[0;1;32m                 ^
�[0m�[1mexternal/com_github_google_tcmalloc/tcmalloc/central_freelist.h:418:5: �[0m�[0;1;31merror: �[0m�[1muse of undeclared identifier 'forwarder_'; did you mean 'forwarder'?�[0m
    forwarder_.DeallocateSpans(size_class_,
�[0;1;32m    ^~~~~~~~~~
�[0m�[0;32m    forwarder
�[0m�[1mexternal/com_github_google_tcmalloc/tcmalloc/central_freelist.h:122:14: �[0m�[0;1;30mnote: �[0m'forwarder' declared here�[0m
  Forwarder& forwarder() { return forwarder_; }
�[0;1;32m             ^
�[0m�[1mexternal/com_github_google_tcmalloc/tcmalloc/central_freelist.h:418:5: �[0m�[0;1;31merror: �[0m�[1mreference to non-static member function must be called�[0m
    forwarder_.DeallocateSpans(size_class_,
�[0;1;32m    ^~~~~~~~~~
�[0m�[1mexternal/com_github_google_tcmalloc/tcmalloc/central_freelist.h:498:16: �[0m�[0;1;31merror: �[0m�[1muse of undeclared identifier 'forwarder_'�[0m
  Span* span = forwarder_.AllocateSpan(size_class_, info, pages_per_span_);
  # ...

The patch boils down to:

diff --git a/bazel/repository_locations.bzl b/bazel/repository_locations.bzl
index a262e3fe44..c64d708995 100644
--- a/bazel/repository_locations.bzl
+++ b/bazel/repository_locations.bzl
@@ -150,8 +150,8 @@ REPOSITORY_LOCATIONS_SPEC = dict(
         project_name = "Abseil",
         project_desc = "Open source collection of C++ libraries drawn from the most fundamental pieces of Google’s internal codebase",
         project_url = "https://abseil.io/",
-        version = "9bff2a9302a8dbf91712fc215eb2e2cf8ec234e7",
-        sha256 = "ae959138730b55b3fb968d3c357e740e7ffdeab4648dc3eb28843a1e9fa56b57",
+        version = "a3020c763c12bd16bbf00804abe853afa5778174",
+        sha256 = "0734c1d74a75fef0298f8d08c279e092d319b783ea5ff46873af904df0003f81",
         strip_prefix = "abseil-cpp-{version}",
         urls = ["https://github.com/abseil/abseil-cpp/archive/{version}.tar.gz"],
         use_category = ["dataplane_core", "controlplane"],
@@ -336,8 +336,8 @@ REPOSITORY_LOCATIONS_SPEC = dict(
         project_name = "tcmalloc",
         project_desc = "Fast, multi-threaded malloc implementation",
         project_url = "https://github.com/google/tcmalloc",
-        version = "e33c7bc60415127c104006d3301c96902f98d42a",
-        sha256 = "14a2c91b71d6719558768a79671408c9acd8284b418e80386c5888047e2c15aa",
+        version = "cbbe578d8f2822a5f2cefff42ebabfa364b725ab",
+        sha256 = "ceef110ed7ea3fe1a4665b9b5adf38fdca8b026739db78cba4686d1a03224582",
         strip_prefix = "tcmalloc-{version}",
         urls = ["https://github.com/google/tcmalloc/archive/{version}.tar.gz"],
         use_category = ["dataplane_core", "controlplane"],

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label Sep 21, 2023
@clundquist-stripe
Copy link

@ravenblackx This is not stale and still not resolved

@github-actions github-actions bot removed the stale stalebot believes this issue/PR has not been touched recently label Sep 28, 2023
@ravenblackx
Copy link
Contributor

@mattklein123 I don't know what the strategy is for updating conflicting chains of dependencies?

cancecen added a commit to cancecen/envoy that referenced this issue Oct 9, 2023
online CPUs.

envoyproxy#27775

Signed-off-by: Can Cecen <ccecen@netflix.com>
@github-actions
Copy link

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label Oct 28, 2023
@clundquist-stripe
Copy link

This is not stale.

@ravenblackx ravenblackx added no stalebot Disables stalebot from closing an issue and removed stale stalebot believes this issue/PR has not been touched recently labels Oct 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/bootstrap bug no stalebot Disables stalebot from closing an issue
Projects
None yet
Development

No branches or pull requests

3 participants