cache: add support for xDS TTLs #359

snowp · 2020-09-24T14:52:38Z

Adds support for xDS TTLs for SotW by introducing a goroutine per TTL'd
resource group which pushes the current config to open watches every 1/3
TTL time.

Fixes #358

Signed-off-by: Snow Pettersen snowp@lyft.com

Adds support for xDS TTLs for SotW by introducing a goroutine per TTL'd resource group which pushes the current config to open watches every 1/3 TTL time. Signed-off-by: Snow Pettersen <snowp@lyft.com>

Signed-off-by: Snow Pettersen <snowp@lyft.com>

snowp · 2020-09-24T14:56:17Z

Note: this is not complete due to the proto changes not having landed upstream Envoy, but this implements all the necessary bits for the cache to trigger watches based on the configured TTL.

Mostly looking for high level feedback on this approach at this point, especially around whether we need to do something smarter around reducing the number of goroutines required for this feature. This approach is likely not feasible for large scale delta deployments, but since we're currently only supporting sotw this seems ok to me?

kyessenov · 2020-09-24T22:30:21Z

I'm still a little confused about how this is supposed to work end-to-end. I thought TTL means that Envoy just "forgets" about some version of xDS it received, and needs to re-request. That implies the server needs nothing special to support TTL except pass it through. Server TTL can be orchestrated by just re-pushing a new, identical version.

snowp · 2020-09-24T22:38:52Z

It does forget about it and would re-request, but if we rely on this then Envoy will flap between knowing about the resource and not knowing about it. The server will need to push the resource again before the TTL expires to ensure that the resource isn't forgotten.

I guess one option would be to include a defer the deletion for some duration after the TTL expires? So that Envoy is able to request a new version of the resource before dropping the old one. I would increase the complexity in Envoy, but I'll raise it in the issue as an alternative solution and we can see what people think

kyessenov · 2020-09-24T23:39:30Z

I think it makes more sense for Envoy not to delete expired resources if they are in use. Relying on the control plane to pro-actively re-push is fragile, and a lot to ask from all implementations.

Signed-off-by: Snow Pettersen <snowp@lyft.com>

snowp · 2020-11-05T21:37:23Z

@jyotimahapatra @jessicayuen either of you wanna give this a look for the general approach? I've updated it to reflect the current implementation in Envoy, though some test coverage is missing (the v2/v3 handling makes v3 only features tricky to test, lmk what you think of the workaround I started using). Feel free to ping me if you need more context on TTLs

Signed-off-by: Snow Pettersen <snowp@lyft.com>

jyotimahapatra

I echo Kuat's observation that having this complexity as part of control plane will cause multiple duplicate implementations in all control planes out there. The adoption of the feature will be delayed depending on control planes supporting this feature.

pkg/ttl/v3/ttl.go

pkg/cache/v2/snapshot.go

pkg/ttl/v3/ttl.go

pkg/cache/v2/simple.go

pkg/cache/types/types.go

pkg/cache/v2/simple.go

snowp · 2020-11-14T10:24:01Z

For more details on TTLs and heartbeats the Envoy configuration has some information: https://www.envoyproxy.io/docs/envoy/latest/api-docs/xds_protocol#ttl

This PR essentially implements two potentially distinct features: being able to set TTLs on resources (done by wrapping them in a Delta-style Resource wrapper) and a mechanism for sending lightweight sotw updates that only serve to refresh the TTL (aka heartbeats) that would be used when TTLs are used for rarely changing resources.

Talking to @jyotimahapatra he suggested moving the heartbeat logic into server (to avoid having to implement it in each cache implementation), so I took a look.

It can be pretty easily put in the server logic by having the server poll the cache for resources that requires heart beating, though this likely introduces more chances of heartbeats and real updates being processed out of order (since we lose out on the strict ordering of the response channel). A more specific cache lookup could probably be made (for the purposes of reducing lock contention on the cache) by tracking more state in the server (e.g. which types have TTL'd resources), but at some point we'd be duplicating state in the cache and server, so not sure if we want to go down that road.

Moving it to the server puts the onus of configuring a heartbeat interval on the server, which might make it trickier to support variable intervals for different snapshots/types/etc., but that might be okay for now.

At the very least I'll push up a separate branch with the server driven heartbeating and anyone interested can take a look.

snowp · 2020-11-16T11:55:12Z

Having spent some time trying to fit heartbeating into the SotW server.go code, I feel less confident that it will work without having to maintain a substantial amount of per resource state in the server. The key problem arises when the server polls the cache to get the resources that should be heartbeated but is unable to verify that the resource has been sent to the client or not; this opens up for a race where the first response containing a resource might be heartbeat response. The cache is unable to determine whether a resource has been sent to the client, as the use of a buffered channel means that handing over the request to the channel != the request being sent.

I think that this + the fact that cache implementors might want to implement their own heartbeating strategies (e.g. elide heartbeat requests during frequent real updates) makes me lean towards keeping the responsibility of heart beating in the cache.

jessicayuen

Personally I think it's OK for this logic to live within the cache. If I'm reading the code correctly, It doesn't appear that the implementation is forced to implement this TTL logic unless the mesh operator wants to utilize TTL - is that correct?

jessicayuen · 2020-11-13T02:08:23Z

pkg/cache/v2/cache.go

@@ -113,7 +116,11 @@ func (r *RawResponse) GetDiscoveryResponse() (*discovery.DiscoveryResponse, erro
 		marshaledResources := make([]*any.Any, len(r.Resources))

 		for i, resource := range r.Resources {
-			marshaledResource, err := MarshalResource(resource)
+			maybeTtldResource, err := ttl.MaybeCreateTtlResource(resource, GetResourceName(resource.Resource), r.Heartbeat)


In v2, it seems that a ttl resource will never be created. The naming of this may be a bit misleading.

pkg/cache/v2/cache.go

jessicayuen · 2020-11-13T02:11:39Z

pkg/cache/v2/resource.go

@@ -82,13 +82,13 @@ func MarshalResource(resource types.Resource) (types.MarshaledResource, error) {

 // GetResourceReferences returns the names for dependent resources (EDS cluster
 // names for CDS, RDS routes names for LDS).
-func GetResourceReferences(resources map[string]types.Resource) map[string]bool {
+func GetResourceReferences(resources map[string]types.ResourceWithTtl) map[string]bool {


Is there something we can do here so we don't need to change the param type to a concrete implementation? Maybe by adding a getTTL method to the Resource interface?

Oh yeah I like that, should simplify things quite a bit

I spent some time trying to get this to work, the problem is that by adding a function here regular proto messages no longer satisfy the interface, which means that we end up having to wrap the resources in a struct regardless. I tried composing a proto.Message into ResourceWithTtl to make make them all just satisfy types.Resource, but it seems like the protobuf library does all kind of type conversions so just satisfying proto.Message isn't actually sufficient.

lmk if you have any ideas for how to get this working, otherwise the current approach seems like the simplest one (if verbose)

Could we change Resource to be a lightweight struct that embeds a proto? Might be a cleaner solution.

I gave this a go, the problem here is that it means that a regular proto no longer passes as a Resource (at least from what I could tell), which means that all callers that currently pass []Resource{proto1, proto2} would have to update it to []Resource{Resource{Message: proto1}, Resource{Message: proto2}. Not sure if it would be worth the churn to change? wdyt

I think you could do []Resource{{proto1}, {proto2}} which I can live with. What do others think? I think it makes sense to align Resource with xDS Resource proto which has TTL field. But I'm fine either way.

Omitting the field name works when there's only one field e.g. type Resource struct { proto.Message }. Once we introduce a second one for the TTL, they now need to be specified.

I would expose the a function to get the TTL but that means having to embed the TTL within the proto, which then forces the user to use the Resource proto wrapper (not to be confused with the Go type) themselves to specify the TTL. This would be an option: we could provide a helper function to make it easier to create the proto wrapper and rely on it for holding the TTL configuration. LMK what you all think (cc @jyotimahapatra @jessicayuen)

I'll note that this adds a bunch of complexity around anything that cares about the actual resource, like consistency checks, as we need to handle potentially unwrapping the wrapper type to see the real resource.

Ok after much more playing around it seems like embedding proto.Message breaks marshalling, since the proto library makes use of some cute interface casts to do marshalling, which ends up failing when embedding the interface. As such I'd propose leaving things as is, unless someone has any ideas for how to work around this

pkg/cache/v2/simple.go

jessicayuen · 2020-11-17T01:27:50Z

Also was there more conclusion discussed in Envoy surrounding the following?

I think it makes more sense for Envoy not to delete expired resources if they are in use.

Signed-off-by: Snow Pettersen <snowp@lyft.com>

snowp · 2020-11-18T15:02:07Z

Also was there more conclusion discussed in Envoy surrounding the following?

I think it makes more sense for Envoy not to delete expired resources if they are in use.

We've had support for expiring resources in use for EDS for a long time now (envoyproxy/envoy#6477), so TTL just formalizes this for all resource types. Envoy is already able to operate without specific resources (i.e. what happens if you hit the initial_fetch_timeout), so TTL expirations should simply revert to that state.

The whole purpose of xDS TTL is to ensure that certain resources that would do more harm than good are removed if the control plane is unavailable, which means that this logic has to live in Envoy and be controlled by control plane activity (hence the heartbeating). This is not meant as a general purpose TTL that applies to all resources, but just to resources that you don't want to go stale. ECDS, RTDS and EDS are all resource types where I've seen use cases where TTLs make sense.

Signed-off-by: Snow Pettersen <snowp@lyft.com>

snowp · 2020-11-18T19:03:09Z

@jessicayuen @jyotimahapatra I think I've responded/addressed all the comments now, ptal

Signed-off-by: Snow Pettersen <snowp@lyft.com>

snowp · 2020-11-23T14:49:10Z

friendly ping for reviews :)

jyotimahapatra · 2020-11-24T18:39:06Z

The changes lgtm.
2 final qs:

Is it possible to write an integration test. Can do in a subsequent PR
Is it possible to write a unit test showing that heartbeats send hollow responses ...may be in cache_test.go?

Signed-off-by: Snow Pettersen <snowp@lyft.com>

snowp · 2020-11-24T21:54:15Z

@jyotimahapatra I think I was able to address both your concerns with the integration test that verifies that the heartbeat response is a hollow response, lmk if this is good enough.

@kyessenov do you wanna take a look as well?

kyessenov

Seems reasonable but it's hard to reason about many go routines. Can we simplify it maybe by 1) turn Resource into struct, which hopefully reduce code changes and make it easier, 2) have a global heartbeat, which could keep track of how long ago resources were pushed?

kyessenov · 2020-12-02T19:49:00Z

pkg/ttl/v3/ttl.go

+			if err != nil {
+				return nil, "", err
+			}
+			any.TypeUrl = resourceTypeUrl


Is this line necessary? Doesn't marshal any already take care of it?

This preserves the behavior from the existing code where we always set the type URL to what the request asked for, I think this is used to allow marshalling V2 resources in response to a V3 request. I can add a comment to make this clearer

kyessenov · 2020-12-02T19:57:24Z

pkg/cache/v2/resource.go

@@ -82,13 +82,13 @@ func MarshalResource(resource types.Resource) (types.MarshaledResource, error) {

 // GetResourceReferences returns the names for dependent resources (EDS cluster
 // names for CDS, RDS routes names for LDS).
-func GetResourceReferences(resources map[string]types.Resource) map[string]bool {
+func GetResourceReferences(resources map[string]types.ResourceWithTtl) map[string]bool {


Could we change Resource to be a lightweight struct that embeds a proto? Might be a cleaner solution.

kyessenov · 2020-12-02T19:58:29Z

pkg/ttl/v2/ttl.go

+}
+
+func IsTTLResource(resource *any.Any) bool {
+	return true


Why is this true?

Thinking about it some more this should probably be false, it just so happens to only be used to test things we expect to be TTL responses, so this made the tests pass. Depending on the resolution of my other comment I'll update this

If it's used only for testing, then let's just put that as comment and not worry about its efficiency.

kyessenov · 2020-12-02T19:59:19Z

pkg/ttl/v3/ttl.go

+	return resource.Resource, resourceTypeUrl, nil
+}
+
+func IsTTLResource(resource *any.Any) bool {


This is a pretty expensive check since proto unmarshal is quite slow in golang. Where is it being used?

It's currently only used during tests to assert that a response emitted by the heartbeat routine is a TTL resource. I put it here since the V3 tests being generated from the V2 files makes per version tests a bit tricky, open to suggestions if you have a better idea

One option would be to only test this in the V3-only integration test and remove the tests from the versioned cache tests, wdyt?

kyessenov · 2020-12-02T20:17:55Z

pkg/cache/v2/simple.go

@@ -81,6 +85,9 @@ type snapshotCache struct {
 	// hash is the hashing function for Envoy nodes
 	hash NodeHash

+	// heartbeatRoutines keeps track of the active heartbeat goroutines
+	heartbeatRoutines map[string]heartbeatHandle


Why do we need so many go routines? Could we just have a single ticker that traverses the entire cache and re-pushes resources that aged?

It just felt the easiest at the time, I'll give the single goroutine a go

Is this field needed now? I don't see it being updated.

kyessenov · 2020-12-02T20:19:34Z

pkg/cache/v2/simple.go

+
+			wg.Add(1)
+			// Spawn a goroutine here so that we can give up the lock before issuing responses. This prevents lock contention if the
+			// response channels aren't ready to receive the response.


There's usually a spot in the response channel for every request. Can you explain why it could be not ready?

I think this is a holdover from before when the channel was blocking (I think they used to be blocking?), and tests weren't set up to consume from the channel. I removed the go routine and tests pass, so I don't think this is actually necessary

Signed-off-by: Snow Pettersen <snowp@lyft.com>

snowp · 2020-12-07T17:49:57Z

@kyessenov ptal, let me know if I've addressed everything

snowp · 2020-12-09T00:26:17Z

ping @kyessenov

Signed-off-by: Snow Pettersen <snowp@lyft.com>

kyessenov · 2020-12-11T03:57:35Z

pkg/integration/ttl_integration_test.go

@@ -0,0 +1,139 @@
+package integration


I think it would be nice to have an end-to-end example integration test with actual envoy. TTLs are very confusing, so a worked out example with envoy plus a control plane config would help understand it. This doesn't need to happen in this PR since envoy is pegged to some old version in CI, but a follow up would be helpful.

kyessenov

Can you also update CHANGELOG? This is an API change so it'll require work from consumers.

Signed-off-by: Snow Pettersen <snowp@lyft.com>

snowp · 2020-12-16T21:40:05Z

@kyessenov Should be updated now, will follow up with an integration test as I get around to it

kyessenov

Thanks, and sorry it took so long. I'm fine with the change, so feel free to merge and iterate.

snowp added 2 commits September 24, 2020 10:23

cache: add support for xDS TTLs

d60ddf1

Adds support for xDS TTLs for SotW by introducing a goroutine per TTL'd resource group which pushes the current config to open watches every 1/3 TTL time. Signed-off-by: Snow Pettersen <snowp@lyft.com>

fix test names

6a04f23

Signed-off-by: Snow Pettersen <snowp@lyft.com>

snowp mentioned this pull request Sep 24, 2020

xds: add support for TTLs envoyproxy/envoy#13201

Merged

snowp changed the title ~~cache: add support for xDS TTLs~~ DNR cache: add support for xDS TTLs Sep 30, 2020

snowp added 5 commits October 31, 2020 13:21

Merge branch 'master' of github.com:envoyproxy/go-control-plane into ttl

46e1521

Signed-off-by: Snow Pettersen <snowp@lyft.com>

get cache tests passing for new impl

4950ef7

Signed-off-by: Snow Pettersen <snowp@lyft.com>

format

7c7bd4b

Signed-off-by: Snow Pettersen <snowp@lyft.com>

update v3

8a3df30

Signed-off-by: Snow Pettersen <snowp@lyft.com>

add missing files

46f7685

Signed-off-by: Snow Pettersen <snowp@lyft.com>

snowp added 3 commits November 9, 2020 10:23

add test for heartbeat marshalling

85188e4

Signed-off-by: Snow Pettersen <snowp@lyft.com>

fix race

d28f5f3

Signed-off-by: Snow Pettersen <snowp@lyft.com>

cleanup

dc7554a

Signed-off-by: Snow Pettersen <snowp@lyft.com>

snowp changed the title ~~DNR cache: add support for xDS TTLs~~ cache: add support for xDS TTLs Nov 10, 2020

snowp assigned jyotimahapatra Nov 10, 2020

jyotimahapatra reviewed Nov 12, 2020

View reviewed changes

jessicayuen reviewed Nov 17, 2020

View reviewed changes

close heartbeat routine during cancel

506f272

Signed-off-by: Snow Pettersen <snowp@lyft.com>

comment heartbeat

bd50781

Signed-off-by: Snow Pettersen <snowp@lyft.com>

snowp added 2 commits November 18, 2020 16:10

Merge branch 'master' of github.com:envoyproxy/go-control-plane into ttl

ef8ef83

Signed-off-by: Snow Pettersen <snowp@lyft.com>

add comment

7ca85c3

Signed-off-by: Snow Pettersen <snowp@lyft.com>

snowp added 2 commits November 18, 2020 16:14

some more comments and renames

62b78a4

Signed-off-by: Snow Pettersen <snowp@lyft.com>

v3 clone

83e352f

Signed-off-by: Snow Pettersen <snowp@lyft.com>

snowp added 2 commits November 24, 2020 16:48

add integration test

8aa5fc4

Signed-off-by: Snow Pettersen <snowp@lyft.com>

format

b9fe8c9

Signed-off-by: Snow Pettersen <snowp@lyft.com>

kyessenov reviewed Dec 2, 2020

View reviewed changes

snowp added 4 commits December 2, 2020 17:44

remove redundant go routine

703d458

Signed-off-by: Snow Pettersen <snowp@lyft.com>

use one goroutine per cache

e428918

Signed-off-by: Snow Pettersen <snowp@lyft.com>

add comments

8c823a7

Signed-off-by: Snow Pettersen <snowp@lyft.com>

Merge remote-tracking branch 'origin/master' into ttl

c3585ab

Signed-off-by: Snow Pettersen <snowp@lyft.com>

Merge branch 'master' of github.com:envoyproxy/go-control-plane into ttl

d899fe9

Signed-off-by: Snow Pettersen <snowp@lyft.com>

snowp assigned kyessenov Dec 10, 2020

kyessenov reviewed Dec 11, 2020

View reviewed changes

snowp added 3 commits December 11, 2020 16:04

remove old heartbeat routines code

0bc2e6e

Signed-off-by: Snow Pettersen <snowp@lyft.com>

v3

3d7535d

Signed-off-by: Snow Pettersen <snowp@lyft.com>

change log entry

eb44d1a

Signed-off-by: Snow Pettersen <snowp@lyft.com>

snowp requested a review from kyessenov December 14, 2020 21:01

kyessenov approved these changes Dec 17, 2020

View reviewed changes

snowp merged commit 7fe139b into envoyproxy:master Dec 17, 2020

maksim-paskal mentioned this pull request Oct 25, 2021

add xDS TTL to endpoints maksim-paskal/envoy-control-plane#53

Merged

cache: add support for xDS TTLs #359

cache: add support for xDS TTLs #359

Conversation

snowp commented Sep 24, 2020

snowp commented Sep 24, 2020

kyessenov commented Sep 24, 2020

snowp commented Sep 24, 2020

kyessenov commented Sep 24, 2020

snowp commented Nov 5, 2020

jyotimahapatra left a comment

Choose a reason for hiding this comment

snowp commented Nov 14, 2020

snowp commented Nov 16, 2020

jessicayuen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jessicayuen commented Nov 17, 2020

snowp commented Nov 18, 2020

snowp commented Nov 18, 2020

snowp commented Nov 23, 2020

jyotimahapatra commented Nov 24, 2020

snowp commented Nov 24, 2020

kyessenov left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

snowp commented Dec 7, 2020

snowp commented Dec 9, 2020

Choose a reason for hiding this comment

kyessenov left a comment

Choose a reason for hiding this comment

snowp commented Dec 16, 2020

kyessenov left a comment

Choose a reason for hiding this comment