Envoy on Dps can get cleaned before added back which causes network errors. #2171

lahabana · 2021-06-16T07:53:05Z

Summary

When disconnection happens on a multi-node setup there are cases where the dp might get pushed an empty SotW which causes all listeners to be removed. Shortly after a correct snapshot is pushed which fixes the problem.

Here are some logs showing this:

[2021-06-16 06:22:38.364][16846][warning][config] [bazel-out/k8-opt/bin/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:101] StreamSecrets gRPC config stream closed: 13,
[2021-06-16 06:22:38.364][16846][warning][config] [bazel-out/k8-opt/bin/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:101] StreamSecrets gRPC config stream closed: 13,
[2021-06-16 06:22:38.364][16846][warning][config] [bazel-out/k8-opt/bin/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:101] StreamSecrets gRPC config stream closed: 13,
[2021-06-16 06:22:38.364][16846][warning][upstream] [source/common/upstream/health_discovery_service.cc:334] StreamHealthCheck gRPC config stream closed: 13,
[2021-06-16 06:22:38.364][16846][warning][upstream] [source/common/upstream/health_discovery_service.cc:71] HdsDelegate stream/connection failure, will retry in 71 ms.
[2021-06-16 06:22:38.364][16846][warning][config] [bazel-out/k8-opt/bin/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:101] StreamSecrets gRPC config stream closed: 13,
[2021-06-16 06:22:38.364][16846][warning][config] [bazel-out/k8-opt/bin/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:101] StreamAggregatedResources gRPC config stream closed: 13,
[2021-06-16 06:22:38.873][16846][info][upstream] [source/common/upstream/cds_api_impl.cc:71] cds: add 0 cluster(s), remove 5 cluster(s)
[2021-06-16 06:22:38.873][16846][info][upstream] [source/common/upstream/cluster_manager_impl.cc:691] removing cluster 813bdfbf-3f86-4ca3-a3fe-da8b15ad25b5-80_prod
[2021-06-16 06:22:38.873][16846][info][upstream] [source/common/upstream/cds_api_impl.cc:97] cds: remove cluster '813bdfbf-3f86-4ca3-a3fe-da8b15ad25b5-80_prod'
[2021-06-16 06:22:38.873][16846][info][upstream] [source/common/upstream/cluster_manager_impl.cc:691] removing cluster kuma:envoy:admin
[2021-06-16 06:22:38.873][16846][info][upstream] [source/common/upstream/cds_api_impl.cc:97] cds: remove cluster 'kuma:envoy:admin'
[2021-06-16 06:22:38.873][16846][info][upstream] [source/common/upstream/cluster_manager_impl.cc:691] removing cluster bfea2c9d-d39a-44bd-9281-9dc27c989d4d-80_prod
[2021-06-16 06:22:38.873][16846][info][upstream] [source/common/upstream/cds_api_impl.cc:97] cds: remove cluster 'bfea2c9d-d39a-44bd-9281-9dc27c989d4d-80_prod'
[2021-06-16 06:22:38.874][16846][info][upstream] [source/server/lds_api.cc:60] lds: remove listener 'inbound:10.132.15.224:5601'
[2021-06-16 06:22:38.874][16846][info][upstream] [source/server/lds_api.cc:60] lds: remove listener 'kuma:envoy:admin'
[2021-06-16 06:22:39.913][16846][info][upstream] [source/common/upstream/cds_api_impl.cc:71] cds: add 3 cluster(s), remove 2 cluster(s)
[2021-06-16 06:22:39.949][16846][info][upstream] [source/common/upstream/cds_api_impl.cc:86] cds: add/update cluster '813bdfbf-3f86-4ca3-a3fe-da8b15ad25b5-80_prod'
[2021-06-16 06:22:39.982][16846][info][upstream] [source/common/upstream/cds_api_impl.cc:86] cds: add/update cluster 'bfea2c9d-d39a-44bd-9281-9dc27c989d4d-80_prod'
[2021-06-16 06:22:39.992][16846][info][upstream] [source/common/upstream/cds_api_impl.cc:86] cds: add/update cluster 'kuma:envoy:admin'
[2021-06-16 06:22:40.011][16846][info][upstream] [source/server/lds_api.cc:79] lds: add/update listener 'inbound:10.132.15.224:5601'
[2021-06-16 06:22:40.030][16846][info][upstream] [source/server/lds_api.cc:79] lds: add/update listener 'kuma:envoy:admin'

Steps To Reproduce

Setup 2 CP behind a TCP LB (preferably in RoundRobin as the problem seems to be more frequent when moving from one instance to another).
Start a DP
Make the connection between the CP and the DP fail
Run a load generator that uses the DP endpoint exposed previously
Observe a few 503s that coincides with the logs pasted previously.

Additional Details & Logs

Version: Trunk (40e882)
Installation Method (Helm, kumactl, AWS CloudFormation, etc.): Universal backed by Postgres

Observations

By looking at the code we can see that we push an empty snapshot instead of clearing the snapshot cache: code ref ps: I can't seem to find the origin of this code in the history, it seems to predate CNCF donation.

The comment claims that the fake value will be removed from the cache on Disconnect. However, I see no evidence that this is the case SnapshotCache.ClearSnapshot(nodeId) seems to never be called.

This is likely to cause 2 issues:

A growing snapshot cache (we never evict old nodeIds), the size of this leak depends if there's a churn in dp count.
The issue explained in this ticket.

More explanation on 2.:

As mentioned once we disconnect we push an empty snapshot to the cache
The watchdog waits an entire refreshInterval before generating a new snapshot and populating the cache

So what I believe is happening:

We connect to a first node everything works well
We disconnect push an empty snapshot to the cache
Connect back the empty snapshot is pushed to envoy
Envoy removes all listeners
The watchdog ticks creates a correct snapshot
Envoy receives the correct snapshot and adds back listeners

Proposed fix

reconciler.Clear() should clear the snapshot and not push an empty snapshot.

Note: Will we need to change the way the watchdog refreshes entities?
I don't think it's necessary for these reasons:
a) It works for the first request on a freshly booted CP
b) The ADS request is simply parked until the watchdog runs once.

The text was updated successfully, but these errors were encountered:

lahabana added the bug label Jun 16, 2021

lahabana mentioned this issue Jun 16, 2021

fix(kuma-cp) Clear snapshots from cache on disconnect #2172

Merged

4 tasks

lahabana closed this as completed in #2172 Jun 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Envoy on Dps can get cleaned before added back which causes network errors. #2171

Envoy on Dps can get cleaned before added back which causes network errors. #2171

lahabana commented Jun 16, 2021

Envoy on Dps can get cleaned before added back which causes network errors. #2171

Envoy on Dps can get cleaned before added back which causes network errors. #2171

Comments

lahabana commented Jun 16, 2021

Summary

Steps To Reproduce

Additional Details & Logs

Observations

Proposed fix