Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Envoy on Dps can get cleaned before added back which causes network errors. #2171

Closed
lahabana opened this issue Jun 16, 2021 · 0 comments · Fixed by #2172
Closed

Envoy on Dps can get cleaned before added back which causes network errors. #2171

lahabana opened this issue Jun 16, 2021 · 0 comments · Fixed by #2172

Comments

@lahabana
Copy link
Contributor

Summary

When disconnection happens on a multi-node setup there are cases where the dp might get pushed an empty SotW which causes all listeners to be removed. Shortly after a correct snapshot is pushed which fixes the problem.

Here are some logs showing this:

[2021-06-16 06:22:38.364][16846][warning][config] [bazel-out/k8-opt/bin/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:101] StreamSecrets gRPC config stream closed: 13,
[2021-06-16 06:22:38.364][16846][warning][config] [bazel-out/k8-opt/bin/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:101] StreamSecrets gRPC config stream closed: 13,
[2021-06-16 06:22:38.364][16846][warning][config] [bazel-out/k8-opt/bin/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:101] StreamSecrets gRPC config stream closed: 13,
[2021-06-16 06:22:38.364][16846][warning][upstream] [source/common/upstream/health_discovery_service.cc:334] StreamHealthCheck gRPC config stream closed: 13,
[2021-06-16 06:22:38.364][16846][warning][upstream] [source/common/upstream/health_discovery_service.cc:71] HdsDelegate stream/connection failure, will retry in 71 ms.
[2021-06-16 06:22:38.364][16846][warning][config] [bazel-out/k8-opt/bin/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:101] StreamSecrets gRPC config stream closed: 13,
[2021-06-16 06:22:38.364][16846][warning][config] [bazel-out/k8-opt/bin/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:101] StreamAggregatedResources gRPC config stream closed: 13,
[2021-06-16 06:22:38.873][16846][info][upstream] [source/common/upstream/cds_api_impl.cc:71] cds: add 0 cluster(s), remove 5 cluster(s)
[2021-06-16 06:22:38.873][16846][info][upstream] [source/common/upstream/cluster_manager_impl.cc:691] removing cluster 813bdfbf-3f86-4ca3-a3fe-da8b15ad25b5-80_prod
[2021-06-16 06:22:38.873][16846][info][upstream] [source/common/upstream/cds_api_impl.cc:97] cds: remove cluster '813bdfbf-3f86-4ca3-a3fe-da8b15ad25b5-80_prod'
[2021-06-16 06:22:38.873][16846][info][upstream] [source/common/upstream/cluster_manager_impl.cc:691] removing cluster kuma:envoy:admin
[2021-06-16 06:22:38.873][16846][info][upstream] [source/common/upstream/cds_api_impl.cc:97] cds: remove cluster 'kuma:envoy:admin'
[2021-06-16 06:22:38.873][16846][info][upstream] [source/common/upstream/cluster_manager_impl.cc:691] removing cluster bfea2c9d-d39a-44bd-9281-9dc27c989d4d-80_prod
[2021-06-16 06:22:38.873][16846][info][upstream] [source/common/upstream/cds_api_impl.cc:97] cds: remove cluster 'bfea2c9d-d39a-44bd-9281-9dc27c989d4d-80_prod'
[2021-06-16 06:22:38.874][16846][info][upstream] [source/server/lds_api.cc:60] lds: remove listener 'inbound:10.132.15.224:5601'
[2021-06-16 06:22:38.874][16846][info][upstream] [source/server/lds_api.cc:60] lds: remove listener 'kuma:envoy:admin'
[2021-06-16 06:22:39.913][16846][info][upstream] [source/common/upstream/cds_api_impl.cc:71] cds: add 3 cluster(s), remove 2 cluster(s)
[2021-06-16 06:22:39.949][16846][info][upstream] [source/common/upstream/cds_api_impl.cc:86] cds: add/update cluster '813bdfbf-3f86-4ca3-a3fe-da8b15ad25b5-80_prod'
[2021-06-16 06:22:39.982][16846][info][upstream] [source/common/upstream/cds_api_impl.cc:86] cds: add/update cluster 'bfea2c9d-d39a-44bd-9281-9dc27c989d4d-80_prod'
[2021-06-16 06:22:39.992][16846][info][upstream] [source/common/upstream/cds_api_impl.cc:86] cds: add/update cluster 'kuma:envoy:admin'
[2021-06-16 06:22:40.011][16846][info][upstream] [source/server/lds_api.cc:79] lds: add/update listener 'inbound:10.132.15.224:5601'
[2021-06-16 06:22:40.030][16846][info][upstream] [source/server/lds_api.cc:79] lds: add/update listener 'kuma:envoy:admin'

Steps To Reproduce

  1. Setup 2 CP behind a TCP LB (preferably in RoundRobin as the problem seems to be more frequent when moving from one instance to another).
  2. Start a DP
  3. Make the connection between the CP and the DP fail
  4. Run a load generator that uses the DP endpoint exposed previously
  5. Observe a few 503s that coincides with the logs pasted previously.

Additional Details & Logs

  • Version: Trunk (40e882)
  • Installation Method (Helm, kumactl, AWS CloudFormation, etc.): Universal backed by Postgres

Observations

By looking at the code we can see that we push an empty snapshot instead of clearing the snapshot cache: code ref ps: I can't seem to find the origin of this code in the history, it seems to predate CNCF donation.

The comment claims that the fake value will be removed from the cache on Disconnect. However, I see no evidence that this is the case SnapshotCache.ClearSnapshot(nodeId) seems to never be called.

This is likely to cause 2 issues:

  1. A growing snapshot cache (we never evict old nodeIds), the size of this leak depends if there's a churn in dp count.
  2. The issue explained in this ticket.

More explanation on 2.:

  • As mentioned once we disconnect we push an empty snapshot to the cache
  • The watchdog waits an entire refreshInterval before generating a new snapshot and populating the cache

So what I believe is happening:

  • We connect to a first node everything works well
  • We disconnect push an empty snapshot to the cache
  • Connect back the empty snapshot is pushed to envoy
  • Envoy removes all listeners
  • The watchdog ticks creates a correct snapshot
  • Envoy receives the correct snapshot and adds back listeners

Proposed fix

reconciler.Clear() should clear the snapshot and not push an empty snapshot.

Note: Will we need to change the way the watchdog refreshes entities?
I don't think it's necessary for these reasons:
a) It works for the first request on a freshly booted CP
b) The ADS request is simply parked until the watchdog runs once.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant