-
Notifications
You must be signed in to change notification settings - Fork 514
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] send EDS before CDS, CDS will not resubscribe #526
Comments
maybe LDS with RDS also has this issue, I haven't tested it. |
@maplebeats Did you configure envoy to use ADS? |
yes, use ADS mode dynamic_resources:
cds_config:
ads: {}
resource_api_version: V3
initial_fetch_timeout: 0s
ads_config:
api_type: GRPC
grpc_services:
- envoy_grpc:
cluster_name: envolve_xds_cluster
timeout: 1s
transport_api_version: V3 |
#461 this PR points out the problem, it's STOW design flaws |
Let me go run the xDS conformance test suite against this repo to see if I can reproduce this. I know we do have some sequencing problems but originally the idea was to ignore ordering when things come from the cache to relieve back pressure in large scale systems |
hi @maplebeats thanks again for opening this issue. I was able to confirm this with the conformance suite, we'll look into this. I might take over that SOTW drafts PR since it seems quite a lot of people prefer ordering over backpressure relief |
what can i do for you? I was modified a version at production environment,it looks work very well. |
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions. |
This PR: #544 should resolve this issue. |
I'm try to replace tag to ordered_ads branch and start server with ordered option srv := server.NewServer(ctx, snapCache, cb, sotw.WithOrderedADS()) but it doesn't work , plane send xds to envoy also not in ordered. I need a moment to locate the problem |
@maplebeats Are there any errors you can share here? That branch uses the same algorithm as this PR: #461 The server may be ordered but theres also a problem that we don't guarantee order from the cache side either. EditI added callback log statements to the integration tests to verify that the responses were in the correct order and indeed find CDS comes before EDS, LDS comes before SRDS and RDS. If you're not seeing this behavior please let me know |
@alecholmez
to update snapshot. I found cache use a map store id-xds , and range it to respond
I try to use a test branch to fix it , it works well. but the commit just for test (only for cds/eds) |
That’s a good find and may be what @alecholmez was talking about above around the cache not guaranteeing order |
here's an integration test run that looks like
|
another smaller example here with added logs in
|
Ah yes @maplebeats so that is a known issue and what I was referring to. Because we just @sunjayBhatia interesting finds! I'm curious as to where you added the logging? Could you maybe paste a diff here? I added back debug logging in the integration test callbacks and I'm seeing this behavior from strictly the server side:
I think the most important thing to take here is that It is definitely important to note that the response logs from the cache will most certainly be out of order. And honestly now that I'm thinking about it we might want to namespace logs like I think the original design of this ordering algorithm on the server side from @fxposter accounted for such things but I could be wrong. It's probably worth me working on ordering the cache responses as well with an alternate implementation. |
yeah using some structured logging tagging/contexts etc. might be nice to have when debugging |
was a bit lazy and just put some printfs before the where the request/response callbacks are, adding properly w/ the callback logging (and adding a real logger to the test main as well) |
heres a branch I am using for debugging, running the tests w/ docker |
Sweet let me take a look real quick thanks for that |
heres an example where it looks like
|
Woah okay super weird behavior the integration test doesn't even pass on your branch. Getting very different outputs than on my branch, this is great thanks for doing this. I'll have to debug and see what's up here |
I guess this brings me to my next question, do we want to hold off on the server ordering PR before we guarantee cache order? I'll still try and figure out why we're seeing different behavior |
Okay quick update, @sunjayBhatia on your branch if I turned off the cache logging I see the same output on the upstream branch. So LDS and CDS still come first in regards to the server side but if I reenable the cache logger it definitely shows srds is trying to go through first.
I think it makes sense to say the localized change to the server is okay but the cache ordering is also probably worth implementing considering the behavior we're seeing. I still have absolutely no why I see the management server die here:
But at least I can see similar results to my test. I'll go ahead on the cache ordering and see if I can do it without needing an API change. We already set an |
hm, i've increased the number of snapshots generated but I've never seen the tests actually fully fail or that crash, guess I'll have to run it a few more times |
It honestly could be me, I've been noticing random docker problems on my M1 arm machine. I'll try this on an x86 machine to see if there's any discrepancies |
@maplebeats and @sunjayBhatia Can you guys test again? I just pushed this commit: 1016a43 I was able to see that the cache logs lined up with the server logs in regards to what order things came back. Let me know if you guys see different behavior. It added ordering to the cache and also fixed your bug @maplebeats where ScopedRoutes came before Routes. It only sorts when |
run it a good few times and can't repro the previous out of order errors |
could we maybe trade off the sorting complexity by saving watches indexed by type url first, then watch id, or if we can in the simple cache just save 1 watch per type if that works? we could then iterate w/o sorting etc. in the type order precedence we need to |
I test 10 nodes few times, all works fine |
Okay great, glad we now have at least a working POC. Yea I really don't like the sorting complexity. That might be interesting, I'm thinking we could have a specific code path for ADS that does something like you said |
Hi is there any updates on this issue and a possible fix? Encountered same problem with EDS sent before CDS in ADS mode |
+1 hoping to see this get fixed because we run into the same issue. It looks like #544 hasn't been updated for a long time. Is the review in a state that I can pull and test with our application? |
sotw server says send xds out of order.most of time CDS will request first, so it response at first.
but some times EDS will send before CDS
https://www.envoyproxy.io/docs/envoy/latest/api-docs/xds_protocol#resource-warming
but enovyproxy docs says
if CDS pushed and EDS was pushed brefore CDS, the pushed CDS config would not working.
We are running 40 nodes(envoyproxy v1.16.4) cluters sub xds to a single management server.7.5% nodes will push EDS brefore CDS when Snapshot changed, EDS will sub twice,and CDS will not sub.if we subsequently changed the cluster's resources , the cds will not push anymore.
you can add this code to stow server, it will reproduce send eds brefore cds.
The text was updated successfully, but these errors were encountered: