Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] send EDS before CDS, CDS will not resubscribe #526

Closed
maplebeats opened this issue Dec 17, 2021 · 33 comments · Fixed by #544
Closed

[BUG] send EDS before CDS, CDS will not resubscribe #526

maplebeats opened this issue Dec 17, 2021 · 33 comments · Fixed by #544

Comments

@maplebeats
Copy link
Contributor

maplebeats commented Dec 17, 2021

sotw server says send xds out of order.most of time CDS will request first, so it response at first.
but some times EDS will send before CDS

		// config watcher can send the requested resources types in any order
		case resp, more := <-values.endpoints:
			if !more {
				return status.Errorf(codes.Unavailable, "endpoints watch failed")
			}
			nonce, err := send(resp)
			if err != nil {
				return err
			}
			values.endpointNonce = nonce

		case resp, more := <-values.clusters:
			if !more {
				return status.Errorf(codes.Unavailable, "clusters watch failed")
			}
			nonce, err := send(resp)
			if err != nil {
				return err
			}
			values.clusterNonce = nonce

https://www.envoyproxy.io/docs/envoy/latest/api-docs/xds_protocol#resource-warming
but enovyproxy docs says

EDS updates (if any) must arrive after CDS updates for the respective clusters.

if CDS pushed and EDS was pushed brefore CDS, the pushed CDS config would not working.
We are running 40 nodes(envoyproxy v1.16.4) cluters sub xds to a single management server.7.5% nodes will push EDS brefore CDS when Snapshot changed, EDS will sub twice,and CDS will not sub.if we subsequently changed the cluster's resources , the cds will not push anymore.

you can add this code to stow server, it will reproduce send eds brefore cds.

	go func() error {
		tick := time.NewTicker(time.Second * 3)
		tick2 := time.NewTicker(time.Second * 1)
		for {
			select {
			case resp, more := <-values.endpoints:
				if !more {
					return status.Errorf(codes.Unavailable, "endpoints watch failed")
				}
				_ = resp
				nonce, err := send(resp)
				if err != nil {
					return err
				}
				values.endpointNonce = nonce
			case <-tick.C:

			}
			select {
			case resp, more := <-values.clusters:
				if !more {
					return status.Errorf(codes.Unavailable, "clusters watch failed")
				}
				nonce, err := send(resp)
				if err != nil {
					return err
				}
				values.clusterNonce = nonce
			case <-tick2.C:
				tick.Reset(time.Second * 3)
			}
		}
	}()
@maplebeats
Copy link
Contributor Author

maplebeats commented Dec 17, 2021

maybe LDS with RDS also has this issue, I haven't tested it.

@jpeach
Copy link
Contributor

jpeach commented Dec 20, 2021

@maplebeats Did you configure envoy to use ADS?

@maplebeats
Copy link
Contributor Author

maplebeats commented Dec 21, 2021

@maplebeats Did you configure envoy to use ADS?

yes, use ADS mode

dynamic_resources:
  cds_config:
    ads: {}
    resource_api_version: V3
    initial_fetch_timeout: 0s
  ads_config:
    api_type: GRPC
    grpc_services:
    - envoy_grpc:
        cluster_name: envolve_xds_cluster
      timeout: 1s
    transport_api_version: V3

@maplebeats
Copy link
Contributor Author

#461 this PR points out the problem, it's STOW design flaws

@alecholmez
Copy link
Contributor

Let me go run the xDS conformance test suite against this repo to see if I can reproduce this. I know we do have some sequencing problems but originally the idea was to ignore ordering when things come from the cache to relieve back pressure in large scale systems

@alecholmez
Copy link
Contributor

hi @maplebeats thanks again for opening this issue. I was able to confirm this with the conformance suite, we'll look into this. I might take over that SOTW drafts PR since it seems quite a lot of people prefer ordering over backpressure relief

@maplebeats
Copy link
Contributor Author

hi @maplebeats thanks again for opening this issue. I was able to confirm this with the conformance suite, we'll look into this. I might take over that SOTW drafts PR since it seems quite a lot of people prefer ordering over backpressure relief

what can i do for you? I was modified a version at production environment,it looks work very well.
I can help you do testing or suggestion if your need.

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale label Feb 10, 2022
@alecholmez alecholmez self-assigned this Feb 10, 2022
@alecholmez
Copy link
Contributor

This PR: #544

should resolve this issue.

@maplebeats
Copy link
Contributor Author

maplebeats commented Mar 29, 2022

I'm try to replace tag to ordered_ads branch and start server with ordered option

srv := server.NewServer(ctx, snapCache, cb, sotw.WithOrderedADS())

but it doesn't work , plane send xds to envoy also not in ordered. I need a moment to locate the problem

@alecholmez
Copy link
Contributor

alecholmez commented Mar 29, 2022

@maplebeats Are there any errors you can share here? That branch uses the same algorithm as this PR: #461

The server may be ordered but theres also a problem that we don't guarantee order from the cache side either.

Edit

I added callback log statements to the integration tests to verify that the responses were in the correct order and indeed find CDS comes before EDS, LDS comes before SRDS and RDS. If you're not seeing this behavior please let me know

@maplebeats
Copy link
Contributor Author

@alecholmez
I use

err := snapCache.SetSnapshot(context.Background(), node, snap)

to update snapshot. I found cache use a map store id-xds , and range it to respond

for id, watch := range info.watches  // the range is random

I try to use a test branch to fix it , it works well. but the commit just for test (only for cds/eds)
maplebeats@0aa201b

@sunjayBhatia
Copy link
Member

That’s a good find and may be what @alecholmez was talking about above around the cache not guaranteeing order

@sunjayBhatia
Copy link
Member

sunjayBhatia commented Mar 30, 2022

here's an integration test run that looks like SRDS responds before LDS (with added logging)

2022/03/30 16:59:30 upstream listening HTTP/1.1 on 18080
2022/03/30 16:59:30 enabling ordered ADS mode...
2022/03/30 16:59:30 waiting for the first request...
2022/03/30 16:59:30 gateway listening HTTP/1.1 on 18001
2022/03/30 16:59:30 access log server listening on 18090
2022/03/30 16:59:30 management server listening on 18000
2022/03/30 16:59:30 stream 1 open for
request for type type.googleapis.com/envoy.config.cluster.v3.Cluster version  nonce
2022/03/30 16:59:30 initial snapshot {Xds:ads Version: UpstreamPort:18080 BasePort:9000 NumClusters:4 NumHTTPListeners:2 NumScopedHTTPListeners:2 NumVHDSHTTPListeners:0 NumTCPListeners:2 NumRuntimes:2 TLS:false NumExtension:1 currentPort:0}
2022/03/30 16:59:30 executing sequence updates=10 request=5
2022/03/30 16:59:30 update snapshot v0
2022/03/30 16:59:30 request batch 0, ok 0, failed 6, pass false
responding for type type.googleapis.com/envoy.config.cluster.v3.Cluster version v0 nonce 1
request for type type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment version  nonce
responding for type type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment version v0 nonce 2
request for type type.googleapis.com/envoy.config.cluster.v3.Cluster version v0 nonce 1
LDS req 0 --> request for type type.googleapis.com/envoy.config.listener.v3.Listener version  nonce
LDS resp 0 --> responding for type type.googleapis.com/envoy.config.listener.v3.Listener version v0 nonce 3
request for type type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment version v0 nonce 2
SRDS req 0 --> request for type type.googleapis.com/envoy.config.route.v3.ScopedRouteConfiguration version  nonce
request for type type.googleapis.com/envoy.config.route.v3.RouteConfiguration version  nonce
SRDS resp 0 --> responding for type type.googleapis.com/envoy.config.route.v3.ScopedRouteConfiguration version v0 nonce 4
LDS req 1 --> request for type type.googleapis.com/envoy.config.listener.v3.Listener version v0 nonce 3
request for type type.googleapis.com/envoy.config.route.v3.RouteConfiguration version  nonce
SRDS req 1 --> request for type type.googleapis.com/envoy.config.route.v3.ScopedRouteConfiguration version v0 nonce 4
responding for type type.googleapis.com/envoy.config.route.v3.RouteConfiguration version v0 nonce 5
request for type type.googleapis.com/envoy.config.route.v3.RouteConfiguration version v0 nonce 5
2022/03/30 16:59:31 request batch 1, ok 6, failed 0, pass true
2022/03/30 16:59:31 request batch 2, ok 6, failed 0, pass true
2022/03/30 16:59:32 request batch 3, ok 6, failed 0, pass true
2022/03/30 16:59:32 request batch 4, ok 6, failed 0, pass true
2022/03/30 16:59:33 [echo2022-03-30T16:59:31Z] 127.0.0.1:9001 / http 200 3dc76cae-3c9a-4431-a7d2-9106377c1c88 cluster-v0-1
2022/03/30 16:59:33 [echo2022-03-30T16:59:31Z] 127.0.0.1:9000 / http 200 b7613ed4-ba16-48b2-b011-86b9630bedb5 cluster-v0-0
2022/03/30 16:59:33 [echo2022-03-30T16:59:31Z] 127.0.0.1:9003 / http 200 f5324a61-f18c-45f7-bac6-95653cd10188 cluster-v0-1
2022/03/30 16:59:33 [echo2022-03-30T16:59:31Z] 127.0.0.1:9001 / http 200 51e1cb3b-8059-41bb-834f-de5ec8780505 cluster-v0-1
2022/03/30 16:59:33 [echo2022-03-30T16:59:31Z] 127.0.0.1:9000 / http 200 e317b2eb-a8ab-4cc6-b2ca-c50965cbb0d8 cluster-v0-0
2022/03/30 16:59:33 [echo2022-03-30T16:59:31Z] 127.0.0.1:9003 / http 200 05a2d906-a5e5-447b-95ac-82433a3deb5e cluster-v0-1
2022/03/30 16:59:33 [echo2022-03-30T16:59:31Z] 127.0.0.1:9002 / http 200 06064b6b-8937-4931-97dc-a7003b1b0f01 cluster-v0-0
2022/03/30 16:59:33 [echo2022-03-30T16:59:31Z] 127.0.0.1:9002 / http 200 d2595229-34c7-4958-bf0d-3548baf2b1a0 cluster-v0-0
2022/03/30 16:59:33 [echo2022-03-30T16:59:32Z] 127.0.0.1:9001 / http 200 d41d997c-b948-4bcd-9a93-9c20e86d2615 cluster-v0-1
2022/03/30 16:59:33 [echo2022-03-30T16:59:32Z] 127.0.0.1:9003 / http 200 17ca663d-65ca-4435-9d8f-5bf49b429ae9 cluster-v0-1
2022/03/30 16:59:33 [echo2022-03-30T16:59:32Z] 127.0.0.1:9003 / http 200 0c2f4749-71d0-4cac-8017-003af7d17283 cluster-v0-1
2022/03/30 16:59:33 [echo2022-03-30T16:59:32Z] 127.0.0.1:9002 / http 200 f7d9124a-2aaa-4ee1-9e80-ca9cafbc8f46 cluster-v0-0
2022/03/30 16:59:33 [echo2022-03-30T16:59:32Z] 127.0.0.1:9000 / http 200 bdc3cd61-39a5-430d-9b8c-d49076942437 cluster-v0-0
2022/03/30 16:59:33 [echo2022-03-30T16:59:32Z] 127.0.0.1:9001 / http 200 afd56dd7-975f-4a19-ae14-db191648e3fb cluster-v0-1
2022/03/30 16:59:33 [echo2022-03-30T16:59:32Z] 127.0.0.1:9002 / http 200 7430017f-92c3-45d8-93b2-3051243fd974 cluster-v0-0
2022/03/30 16:59:33 [echo2022-03-30T16:59:32Z] 127.0.0.1:9000 / http 200 877d2d14-6dc9-457b-a935-4d51a07d438a cluster-v0-0
2022/03/30 16:59:33 server callbacks fetches=0 requests=11 responses=5
2022/03/30 16:59:33 update snapshot v1
SRDS resp 1 --> responding for type type.googleapis.com/envoy.config.route.v3.ScopedRouteConfiguration version v1 nonce 6
responding for type type.googleapis.com/envoy.config.cluster.v3.Cluster version v1 nonce 7
LDS resp 1 --> responding for type type.googleapis.com/envoy.config.listener.v3.Listener version v1 nonce 8
request for type type.googleapis.com/envoy.config.route.v3.RouteConfiguration version v0 nonce 5
request for type type.googleapis.com/envoy.config.route.v3.ScopedRouteConfiguration version v1 nonce 6
2022/03/30 16:59:33 request batch 0, ok 5, failed 1, pass false
request for type type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment version v0 nonce 2
responding for type type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment version v1 nonce 9

@sunjayBhatia
Copy link
Member

another smaller example here with added logs in SetSnapshot

2022/03/30 17:16:47 update snapshot v4
respond open watch 22 type.googleapis.com/envoy.config.route.v3.ScopedRouteConfiguration[] with new version "v4"
respond open watch 27 type.googleapis.com/envoy.config.route.v3.RouteConfiguration[route-v3-9003 route-v3-9002 route-v3-9001 route-v3-9000] with new version "v4"
respond open watch 24 type.googleapis.com/envoy.config.cluster.v3.Cluster[] with new version "v4"
respond open watch 25 type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment[cluster-v3-0 cluster-v3-3 cluster-v3-2 cluster-v3-1] with new version "v4"
responding for type type.googleapis.com/envoy.config.route.v3.ScopedRouteConfiguration version v4 nonce 21
LDS watch is picked up and responded to later --> respond open watch 21 type.googleapis.com/envoy.config.listener.v3.Listener[] with new version "v4"
responding for type type.googleapis.com/envoy.config.cluster.v3.Cluster version v4 nonce 22
responding for type type.googleapis.com/envoy.config.listener.v3.Listener version v4 nonce 23

@alecholmez
Copy link
Contributor

alecholmez commented Mar 30, 2022

Ah yes @maplebeats so that is a known issue and what I was referring to. Because we just range over the open watches they can be set out of order and aren't sorted. Sorting is expensive so I think we might want an alternate SetSnapshot in the cache interface for guaranteed ordering. We're really only concerned about order in ADS mode.

@sunjayBhatia interesting finds! I'm curious as to where you added the logging? Could you maybe paste a diff here? I added back debug logging in the integration test callbacks and I'm seeing this behavior from strictly the server side:

2022/03/30 14:06:00 server callbacks fetches=0 requests=11 responses=5
2022/03/30 14:06:00 update snapshot v1
2022/03/30 14:06:00 responding to request for type.googleapis.com/envoy.config.cluster.v3.Cluster on stream 1
2022/03/30 14:06:00 responding to request for type.googleapis.com/envoy.config.listener.v3.Listener on stream 1
2022/03/30 14:06:00 responding to request for type.googleapis.com/envoy.config.route.v3.ScopedRouteConfiguration on stream 1
2022/03/30 14:06:00 request batch 0, ok 6, failed 0, pass true
2022/03/30 14:06:00 received request for type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment on stream 1
2022/03/30 14:06:00 responding to request for type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment on stream 1
2022/03/30 14:06:00 received request for type.googleapis.com/envoy.config.route.v3.RouteConfiguration on stream 1
2022/03/30 14:06:00 received request for type.googleapis.com/envoy.config.listener.v3.Listener on stream 1
2022/03/30 14:06:00 received request for type.googleapis.com/envoy.config.route.v3.RouteConfiguration on stream 1
2022/03/30 14:06:00 responding to request for type.googleapis.com/envoy.config.route.v3.RouteConfiguration on stream 1
2022/03/30 14:06:00 received request for type.googleapis.com/envoy.config.route.v3.ScopedRouteConfiguration on stream 1
2022/03/30 14:06:00 received request for type.googleapis.com/envoy.config.cluster.v3.Cluster on stream 1
2022/03/30 14:06:00 received request for type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment on stream 1
2022/03/30 14:06:00 received request for type.googleapis.com/envoy.config.route.v3.RouteConfiguration on stream 1
2022/03/30 14:06:01 request batch 1, ok 6, failed 0, pass true
2022/03/30 14:06:01 received request for type.googleapis.com/envoy.config.route.v3.RouteConfiguration on stream 1
2022/03/30 14:06:01 received request for type.googleapis.com/envoy.config.route.v3.RouteConfiguration on stream 1
2022/03/30 14:06:01 request batch 2, ok 6, failed 0, pass true
2022/03/30 14:06:02 request batch 3, ok 6, failed 0, pass true
2022/03/30 14:06:02 request batch 4, ok 6, failed 0, pass true

-----

2022/03/30 14:06:03 update snapshot v2
2022/03/30 14:06:03 responding to request for type.googleapis.com/envoy.config.listener.v3.Listener on stream 1
2022/03/30 14:06:03 responding to request for type.googleapis.com/envoy.config.route.v3.ScopedRouteConfiguration on stream 1
2022/03/30 14:06:03 responding to request for type.googleapis.com/envoy.config.cluster.v3.Cluster on stream 1
2022/03/30 14:06:03 request batch 0, ok 6, failed 0, pass true
2022/03/30 14:06:03 received request for type.googleapis.com/envoy.config.route.v3.RouteConfiguration on stream 1
2022/03/30 14:06:03 received request for type.googleapis.com/envoy.config.listener.v3.Listener on stream 1
2022/03/30 14:06:03 received request for type.googleapis.com/envoy.config.route.v3.RouteConfiguration on stream 1
2022/03/30 14:06:03 received request for type.googleapis.com/envoy.config.route.v3.ScopedRouteConfiguration on stream 1
2022/03/30 14:06:03 responding to request for type.googleapis.com/envoy.config.route.v3.RouteConfiguration on stream 1
2022/03/30 14:06:03 received request for type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment on stream 1
2022/03/30 14:06:03 responding to request for type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment on stream 1
2022/03/30 14:06:03 received request for type.googleapis.com/envoy.config.route.v3.RouteConfiguration on stream 1
2022/03/30 14:06:03 received request for type.googleapis.com/envoy.config.cluster.v3.Cluster on stream 1
2022/03/30 14:06:03 received request for type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment on stream 1
2022/03/30 14:06:04 request batch 1, ok 6, failed 0, pass true
2022/03/30 14:06:04 received request for type.googleapis.com/envoy.config.route.v3.RouteConfiguration on stream 1
2022/03/30 14:06:04 received request for type.googleapis.com/envoy.config.route.v3.RouteConfiguration on stream 1
2022/03/30 14:06:04 request batch 2, ok 6, failed 0, pass true
2022/03/30 14:06:05 request batch 3, ok 6, failed 0, pass true
2022/03/30 14:06:05 request batch 4, ok 6, failed 0, pass true

I think the most important thing to take here is that LDS and CDS are always the 2 first responses on an ADS stream. After the initial request I'm not 100% sure on Envoy's sequencing since the root objects in the tree already exist and theres functionality like warming etc...

It is definitely important to note that the response logs from the cache will most certainly be out of order. And honestly now that I'm thinking about it we might want to namespace logs like cache: or server: cause reading the output from all the concurrent routines is extremely difficult.

I think the original design of this ordering algorithm on the server side from @fxposter accounted for such things but I could be wrong. It's probably worth me working on ordering the cache responses as well with an alternate implementation.

@sunjayBhatia
Copy link
Member

yeah using some structured logging tagging/contexts etc. might be nice to have when debugging

@sunjayBhatia
Copy link
Member

was a bit lazy and just put some printfs before the where the request/response callbacks are, adding properly w/ the callback logging (and adding a real logger to the test main as well)

@sunjayBhatia
Copy link
Member

heres a branch I am using for debugging, running the tests w/ docker make docker_tests locally: https://github.com/sunjayBhatia/go-control-plane/tree/debugging-ordered-ads

@alecholmez
Copy link
Contributor

Sweet let me take a look real quick thanks for that

@sunjayBhatia
Copy link
Member

heres an example where it looks like SRDS responds before LDS

2022/03/30 18:44:45 upstream listening HTTP/1.1 on 18080
2022/03/30 18:44:45 enabling ordered ADS mode...
2022/03/30 18:44:45 waiting for the first request...
2022/03/30 18:44:45 gateway listening HTTP/1.1 on 18001
2022/03/30 18:44:45 access log server listening on 18090
2022/03/30 18:44:45 management server listening on 18000
2022/03/30 18:44:45 stream 1 open for
2022/03/30 18:44:45 request for type type.googleapis.com/envoy.config.cluster.v3.Cluster version  nonce
2022/03/30 18:44:45 initial snapshot {Xds:ads Version: UpstreamPort:18080 BasePort:9000 NumClusters:4 NumHTTPListeners:2 NumScopedHTTPListeners:2 NumVHDSHTTPListeners:0 NumTCPListeners:2 NumRuntimes:2 TLS:false NumExtension:1 currentPort:0}
2022/03/30 18:44:45 executing sequence updates=10 request=5
DEBU[0000] open watch 1 for type.googleapis.com/envoy.config.cluster.v3.Cluster[] from nodeID "test-id", version ""
2022/03/30 18:44:45 update snapshot v0
DEBU[0000] respond open watch 1 type.googleapis.com/envoy.config.cluster.v3.Cluster[] with new version "v0"
DEBU[0000] respond type.googleapis.com/envoy.config.cluster.v3.Cluster[] version "" with version "v0"
2022/03/30 18:44:45 request batch 0, ok 0, failed 6, pass false
2022/03/30 18:44:45 responding for type type.googleapis.com/envoy.config.cluster.v3.Cluster version v0 nonce 1
2022/03/30 18:44:45 request for type type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment version  nonce
DEBU[0000] nodeID "test-id" requested type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment[cluster-v0-2 cluster-v0-3 cluster-v0-0 cluster-v0-1] and known map[]. Diff [cluster-v0-2 cluster-v0-3 cluster-v0-0 cluster-v0-1]
DEBU[0000] respond type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment[cluster-v0-2 cluster-v0-3 cluster-v0-0 cluster-v0-1] version "" with version "v0"
2022/03/30 18:44:45 responding for type type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment version v0 nonce 2
2022/03/30 18:44:45 request for type type.googleapis.com/envoy.config.cluster.v3.Cluster version v0 nonce 1
DEBU[0000] nodeID "test-id" requested type.googleapis.com/envoy.config.cluster.v3.Cluster[] and known map[]. Diff []
DEBU[0000] open watch 2 for type.googleapis.com/envoy.config.cluster.v3.Cluster[] from nodeID "test-id", version "v0"
LDS req 0 --> 2022/03/30 18:44:45 request for type type.googleapis.com/envoy.config.listener.v3.Listener version  nonce
DEBU[0000] nodeID "test-id" requested type.googleapis.com/envoy.config.listener.v3.Listener[] and known map[]. Diff []
DEBU[0000] respond type.googleapis.com/envoy.config.listener.v3.Listener[] version "" with version "v0"
LDS resp 0 --> 2022/03/30 18:44:45 responding for type type.googleapis.com/envoy.config.listener.v3.Listener version v0 nonce 3
2022/03/30 18:44:45 request for type type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment version v0 nonce 2
DEBU[0000] nodeID "test-id" requested type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment[cluster-v0-2 cluster-v0-3 cluster-v0-0 cluster-v0-1] and known map[cluster-v0-0:{} cluster-v0-1:{} cluster-v0-2:{} cluster-v0-3:{}]. Diff []
DEBU[0000] open watch 3 for type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment[cluster-v0-2 cluster-v0-3 cluster-v0-0 cluster-v0-1] from nodeID "test-id", version "v0"
SRDS req 0 --> 2022/03/30 18:44:45 request for type type.googleapis.com/envoy.config.route.v3.ScopedRouteConfiguration version  nonce
DEBU[0000] nodeID "test-id" requested type.googleapis.com/envoy.config.route.v3.ScopedRouteConfiguration[] and known map[]. Diff []
DEBU[0000] respond type.googleapis.com/envoy.config.route.v3.ScopedRouteConfiguration[] version "" with version "v0"
2022/03/30 18:44:45 request for type type.googleapis.com/envoy.config.route.v3.RouteConfiguration version  nonce
DEBU[0000] nodeID "test-id" requested type.googleapis.com/envoy.config.route.v3.RouteConfiguration[route-v0-9001 route-v0-9000] and known map[]. Diff [route-v0-9001 route-v0-9000]
WARN[0000] ADS mode: not responding to request: "route-v0-9002" not listed
SRDS resp 0 --> 2022/03/30 18:44:45 responding for type type.googleapis.com/envoy.config.route.v3.ScopedRouteConfiguration version v0 nonce 4
LDS req 1 --> 2022/03/30 18:44:45 request for type type.googleapis.com/envoy.config.listener.v3.Listener version v0 nonce 3
DEBU[0000] nodeID "test-id" requested type.googleapis.com/envoy.config.listener.v3.Listener[] and known map[]. Diff []
DEBU[0000] open watch 4 for type.googleapis.com/envoy.config.listener.v3.Listener[] from nodeID "test-id", version "v0"
2022/03/30 18:44:45 request for type type.googleapis.com/envoy.config.route.v3.RouteConfiguration version  nonce
DEBU[0000] nodeID "test-id" requested type.googleapis.com/envoy.config.route.v3.RouteConfiguration[route-v0-9003 route-v0-9002 route-v0-9001 route-v0-9000] and known map[]. Diff [route-v0-9003 route-v0-9002 route-v0-9001 route-v0-9000]
DEBU[0000] respond type.googleapis.com/envoy.config.route.v3.RouteConfiguration[route-v0-9003 route-v0-9002 route-v0-9001 route-v0-9000] version "" with version "v0"
SRDS req 1 --> 2022/03/30 18:44:45 request for type type.googleapis.com/envoy.config.route.v3.ScopedRouteConfiguration version v0 nonce 4
2022/03/30 18:44:45 responding for type type.googleapis.com/envoy.config.route.v3.RouteConfiguration version v0 nonce 5
DEBU[0000] nodeID "test-id" requested type.googleapis.com/envoy.config.route.v3.ScopedRouteConfiguration[] and known map[]. Diff []
DEBU[0000] open watch 5 for type.googleapis.com/envoy.config.route.v3.ScopedRouteConfiguration[] from nodeID "test-id", version "v0"
2022/03/30 18:44:45 request for type type.googleapis.com/envoy.config.route.v3.RouteConfiguration version v0 nonce 5
DEBU[0000] nodeID "test-id" requested type.googleapis.com/envoy.config.route.v3.RouteConfiguration[route-v0-9003 route-v0-9002 route-v0-9001 route-v0-9000] and known map[route-v0-9000:{} route-v0-9001:{} route-v0-9002:{} route-v0-9003:{}]. Diff []
DEBU[0000] open watch 6 for type.googleapis.com/envoy.config.route.v3.RouteConfiguration[route-v0-9003 route-v0-9002 route-v0-9001 route-v0-9000] from nodeID "test-id", version "v0"
...
2022/03/30 18:44:47 server callbacks fetches=0 requests=11 responses=5
2022/03/30 18:44:47 update snapshot v1
DEBU[0002] respond open watch 5 type.googleapis.com/envoy.config.route.v3.ScopedRouteConfiguration[] with new version "v1"
DEBU[0002] respond type.googleapis.com/envoy.config.route.v3.ScopedRouteConfiguration[] version "v0" with version "v1"
DEBU[0002] respond open watch 6 type.googleapis.com/envoy.config.route.v3.RouteConfiguration[route-v0-9003 route-v0-9002 route-v0-9001 route-v0-9000] with new version "v1"
WARN[0002] ADS mode: not responding to request: "route-v1-9000" not listed
SRDS resp 1 --> 2022/03/30 18:44:47 responding for type type.googleapis.com/envoy.config.route.v3.ScopedRouteConfiguration version v1 nonce 6
DEBU[0002] respond open watch 2 type.googleapis.com/envoy.config.cluster.v3.Cluster[] with new version "v1"
DEBU[0002] respond type.googleapis.com/envoy.config.cluster.v3.Cluster[] version "v0" with version "v1"
DEBU[0002] respond open watch 3 type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment[cluster-v0-2 cluster-v0-3 cluster-v0-0 cluster-v0-1] with new version "v1"
WARN[0002] ADS mode: not responding to request: "cluster-v1-0" not listed
DEBU[0002] respond open watch 4 type.googleapis.com/envoy.config.listener.v3.Listener[] with new version "v1"
DEBU[0002] respond type.googleapis.com/envoy.config.listener.v3.Listener[] version "v0" with version "v1"
2022/03/30 18:44:47 responding for type type.googleapis.com/envoy.config.cluster.v3.Cluster version v1 nonce 7
LDS resp 1 --> 2022/03/30 18:44:47 responding for type type.googleapis.com/envoy.config.listener.v3.Listener version v1 nonce 8
2022/03/30 18:44:47 request for type type.googleapis.com/envoy.config.route.v3.RouteConfiguration version v0 nonce 5

@alecholmez
Copy link
Contributor

Woah okay super weird behavior the integration test doesn't even pass on your branch. Getting very different outputs than on my branch, this is great thanks for doing this. I'll have to debug and see what's up here

@alecholmez
Copy link
Contributor

alecholmez commented Mar 30, 2022

I guess this brings me to my next question, do we want to hold off on the server ordering PR before we guarantee cache order?

I'll still try and figure out why we're seeing different behavior

@alecholmez
Copy link
Contributor

alecholmez commented Mar 30, 2022

Okay quick update, @sunjayBhatia on your branch if I turned off the cache logging I see the same output on the upstream branch. So LDS and CDS still come first in regards to the server side but if I reenable the cache logger it definitely shows srds is trying to go through first.

2022/03/30 16:06:18 responding for type type.googleapis.com/envoy.config.cluster.v3.Cluster version v0 nonce 1
2022/03/30 16:06:18 request for type type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment version  nonce 
2022/03/30 16:06:18 responding for type type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment version v0 nonce 2
2022/03/30 16:06:18 request for type type.googleapis.com/envoy.config.cluster.v3.Cluster version v0 nonce 1
2022/03/30 16:06:18 request for type type.googleapis.com/envoy.config.listener.v3.Listener version  nonce 
2022/03/30 16:06:18 responding for type type.googleapis.com/envoy.config.listener.v3.Listener version v0 nonce 3
2022/03/30 16:06:18 request for type type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment version v0 nonce 2
2022/03/30 16:06:18 request for type type.googleapis.com/envoy.config.route.v3.ScopedRouteConfiguration version  nonce 
2022/03/30 16:06:18 request for type type.googleapis.com/envoy.config.route.v3.RouteConfiguration version  nonce 
2022/03/30 16:06:18 responding for type type.googleapis.com/envoy.config.route.v3.ScopedRouteConfiguration version v0 nonce 4
2022/03/30 16:06:18 request for type type.googleapis.com/envoy.config.listener.v3.Listener version v0 nonce 3
2022/03/30 16:06:18 request for type type.googleapis.com/envoy.config.route.v3.RouteConfiguration version  nonce 
2022/03/30 16:06:18 request for type type.googleapis.com/envoy.config.route.v3.ScopedRouteConfiguration version v0 nonce 4
2022/03/30 16:06:18 responding for type type.googleapis.com/envoy.config.route.v3.RouteConfiguration version v0 nonce 5
2022/03/30 16:06:18 request for type type.googleapis.com/envoy.config.route.v3.RouteConfiguration version v0 nonce 5

---

2022/03/30 16:06:21 update snapshot v1
2022/03/30 16:06:21 responding for type type.googleapis.com/envoy.config.listener.v3.Listener version v1 nonce 6
2022/03/30 16:06:21 responding for type type.googleapis.com/envoy.config.route.v3.ScopedRouteConfiguration version v1 nonce 7
2022/03/30 16:06:21 responding for type type.googleapis.com/envoy.config.cluster.v3.Cluster version v1 nonce 8
2022/03/30 16:06:21 request batch 0, ok 6, failed 0, pass true
2022/03/30 16:06:21 request for type type.googleapis.com/envoy.config.route.v3.RouteConfiguration version v0 nonce 5
2022/03/30 16:06:21 request for type type.googleapis.com/envoy.config.listener.v3.Listener version v1 nonce 6
2022/03/30 16:06:21 request for type type.googleapis.com/envoy.config.route.v3.RouteConfiguration version v0 nonce 5
2022/03/30 16:06:21 request for type type.googleapis.com/envoy.config.route.v3.ScopedRouteConfiguration version v1 nonce 7
2022/03/30 16:06:21 responding for type type.googleapis.com/envoy.config.route.v3.RouteConfiguration version v1 nonce 9
2022/03/30 16:06:21 request for type type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment version v0 nonce 2
2022/03/30 16:06:21 responding for type type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment version v1 nonce 10
2022/03/30 16:06:21 request for type type.googleapis.com/envoy.config.route.v3.RouteConfiguration version v1 nonce 9
2022/03/30 16:06:21 request for type type.googleapis.com/envoy.config.cluster.v3.Cluster version v1 nonce 8
2022/03/30 16:06:21 request for type type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment version v1 nonce 10
2022/03/30 16:06:22 request batch 1, ok 6, failed 0, pass true
2022/03/30 16:06:22 request for type type.googleapis.com/envoy.config.route.v3.RouteConfiguration version v1 nonce 9
2022/03/30 16:06:22 request for type type.googleapis.com/envoy.config.route.v3.RouteConfiguration version v1 nonce 9
2022/03/30 16:06:22 request batch 2, ok 6, failed 0, pass true
2022/03/30 16:06:23 request batch 3, ok 6, failed 0, pass true
2022/03/30 16:06:23 request batch 4, ok 6, failed 0, pass true

I think it makes sense to say the localized change to the server is okay but the cache ordering is also probably worth implementing considering the behavior we're seeing.

I still have absolutely no why I see the management server die here:

2022/03/30 16:06:26 update snapshot v3
2022/03/30 16:06:26 responding for type type.googleapis.com/envoy.config.listener.v3.Listener version v3 nonce 16
2022/03/30 16:06:26 responding for type type.googleapis.com/envoy.config.route.v3.ScopedRouteConfiguration version v3 nonce 17
2022/03/30 16:06:26 responding for type type.googleapis.com/envoy.config.cluster.v3.Cluster version v3 nonce 18
2022/03/30 16:06:26 request batch 0, ok 6, failed 0, pass true
2022/03/30 16:06:26 stream 1 closed
2022/03/30 16:06:27 request batch 1, ok 0, failed 6, pass true
2022/03/30 16:06:27 request batch 2, ok 0, failed 6, pass true
2022/03/30 16:06:28 request batch 3, ok 0, failed 6, pass true
2022/03/30 16:06:28 request batch 4, ok 0, failed 6, pass true
2022/03/30 16:06:29 server callbacks fetches=0 requests=31 responses=18
2022/03/30 16:06:29 update snapshot v4
2022/03/30 16:06:29 request batch 0, ok 0, failed 6, pass false
2022/03/30 16:06:29 request batch 1, ok 0, failed 6, pass false
2022/03/30 16:06:30 request batch 2, ok 0, failed 6, pass false
2022/03/30 16:06:30 request batch 3, ok 0, failed 6, pass false
2022/03/30 16:06:31 request batch 4, ok 0, failed 6, pass false
2022/03/30 16:06:31 server callbacks fetches=0 requests=31 responses=18
2022/03/30 16:06:31 failed all requests in a run 4
build/integration.sh: line 46: 73561 Segmentation fault: 11  ( ${ENVOY} -c sample/bootstrap-${XDS}.yaml --drain-time-s 1 -l debug 2> ${ENVOY_LOG} )
build/integration.sh: line 40: kill: (73561) - No such process
make: *** [integration.ads] Error 1

But at least I can see similar results to my test. I'll go ahead on the cache ordering and see if I can do it without needing an API change. We already set an ads mode in the SnapshotCache so I think I can just key off that to make the change not invasive.

@sunjayBhatia
Copy link
Member

hm, i've increased the number of snapshots generated but I've never seen the tests actually fully fail or that crash, guess I'll have to run it a few more times

@alecholmez
Copy link
Contributor

hm, i've increased the number of snapshots generated but I've never seen the tests actually fully fail or that crash, guess I'll have to run it a few more times

It honestly could be me, I've been noticing random docker problems on my M1 arm machine. I'll try this on an x86 machine to see if there's any discrepancies

@alecholmez
Copy link
Contributor

alecholmez commented Mar 31, 2022

@maplebeats and @sunjayBhatia Can you guys test again?

I just pushed this commit: 1016a43

I was able to see that the cache logs lined up with the server logs in regards to what order things came back. Let me know if you guys see different behavior.

It added ordering to the cache and also fixed your bug @maplebeats where ScopedRoutes came before Routes. It only sorts when ads is enabled but it's still an expensive operation. Thinking about ways we can maybe cut down on all the computation time here.

@sunjayBhatia
Copy link
Member

run it a good few times and can't repro the previous out of order errors

@sunjayBhatia
Copy link
Member

could we maybe trade off the sorting complexity by saving watches indexed by type url first, then watch id, or if we can in the simple cache just save 1 watch per type if that works? we could then iterate w/o sorting etc. in the type order precedence we need to

@maplebeats
Copy link
Contributor Author

I test 10 nodes few times, all works fine

@alecholmez
Copy link
Contributor

Okay great, glad we now have at least a working POC. Yea I really don't like the sorting complexity. That might be interesting, I'm thinking we could have a specific code path for ADS that does something like you said

@github-actions github-actions bot removed the stale label Dec 19, 2022
@haoruolei
Copy link

Hi is there any updates on this issue and a possible fix? Encountered same problem with EDS sent before CDS in ADS mode

@yongbozhang
Copy link

+1 hoping to see this get fixed because we run into the same issue. It looks like #544 hasn't been updated for a long time. Is the review in a state that I can pull and test with our application?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants