Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cache: add support for xDS TTLs #359

Merged
merged 26 commits into from
Dec 17, 2020
Merged

cache: add support for xDS TTLs #359

merged 26 commits into from
Dec 17, 2020

Conversation

snowp
Copy link
Contributor

@snowp snowp commented Sep 24, 2020

Adds support for xDS TTLs for SotW by introducing a goroutine per TTL'd
resource group which pushes the current config to open watches every 1/3
TTL time.

Fixes #358

Signed-off-by: Snow Pettersen snowp@lyft.com

Adds support for xDS TTLs for SotW by introducing a goroutine per TTL'd
resource group which pushes the current config to open watches every 1/3
TTL time.

Signed-off-by: Snow Pettersen <snowp@lyft.com>
Signed-off-by: Snow Pettersen <snowp@lyft.com>
@snowp
Copy link
Contributor Author

snowp commented Sep 24, 2020

Note: this is not complete due to the proto changes not having landed upstream Envoy, but this implements all the necessary bits for the cache to trigger watches based on the configured TTL.

Mostly looking for high level feedback on this approach at this point, especially around whether we need to do something smarter around reducing the number of goroutines required for this feature. This approach is likely not feasible for large scale delta deployments, but since we're currently only supporting sotw this seems ok to me?

@kyessenov
Copy link
Contributor

I'm still a little confused about how this is supposed to work end-to-end. I thought TTL means that Envoy just "forgets" about some version of xDS it received, and needs to re-request. That implies the server needs nothing special to support TTL except pass it through. Server TTL can be orchestrated by just re-pushing a new, identical version.

@snowp
Copy link
Contributor Author

snowp commented Sep 24, 2020

It does forget about it and would re-request, but if we rely on this then Envoy will flap between knowing about the resource and not knowing about it. The server will need to push the resource again before the TTL expires to ensure that the resource isn't forgotten.

I guess one option would be to include a defer the deletion for some duration after the TTL expires? So that Envoy is able to request a new version of the resource before dropping the old one. I would increase the complexity in Envoy, but I'll raise it in the issue as an alternative solution and we can see what people think

@kyessenov
Copy link
Contributor

I think it makes more sense for Envoy not to delete expired resources if they are in use. Relying on the control plane to pro-actively re-push is fragile, and a lot to ask from all implementations.

@snowp snowp changed the title cache: add support for xDS TTLs DNR cache: add support for xDS TTLs Sep 30, 2020
snowp added 5 commits October 31, 2020 13:21
Signed-off-by: Snow Pettersen <snowp@lyft.com>
Signed-off-by: Snow Pettersen <snowp@lyft.com>
Signed-off-by: Snow Pettersen <snowp@lyft.com>
Signed-off-by: Snow Pettersen <snowp@lyft.com>
@snowp
Copy link
Contributor Author

snowp commented Nov 5, 2020

@jyotimahapatra @jessicayuen either of you wanna give this a look for the general approach? I've updated it to reflect the current implementation in Envoy, though some test coverage is missing (the v2/v3 handling makes v3 only features tricky to test, lmk what you think of the workaround I started using). Feel free to ping me if you need more context on TTLs

snowp added 3 commits November 9, 2020 10:23
Signed-off-by: Snow Pettersen <snowp@lyft.com>
Signed-off-by: Snow Pettersen <snowp@lyft.com>
Signed-off-by: Snow Pettersen <snowp@lyft.com>
@snowp snowp changed the title DNR cache: add support for xDS TTLs cache: add support for xDS TTLs Nov 10, 2020
Copy link
Contributor

@jyotimahapatra jyotimahapatra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I echo Kuat's observation that having this complexity as part of control plane will cause multiple duplicate implementations in all control planes out there. The adoption of the feature will be delayed depending on control planes supporting this feature.

pkg/ttl/v3/ttl.go Show resolved Hide resolved
pkg/cache/v2/snapshot.go Show resolved Hide resolved
pkg/ttl/v3/ttl.go Show resolved Hide resolved
pkg/cache/v2/simple.go Outdated Show resolved Hide resolved
pkg/cache/types/types.go Show resolved Hide resolved
pkg/cache/v2/simple.go Outdated Show resolved Hide resolved
@snowp
Copy link
Contributor Author

snowp commented Nov 14, 2020

For more details on TTLs and heartbeats the Envoy configuration has some information: https://www.envoyproxy.io/docs/envoy/latest/api-docs/xds_protocol#ttl

This PR essentially implements two potentially distinct features: being able to set TTLs on resources (done by wrapping them in a Delta-style Resource wrapper) and a mechanism for sending lightweight sotw updates that only serve to refresh the TTL (aka heartbeats) that would be used when TTLs are used for rarely changing resources.

Talking to @jyotimahapatra he suggested moving the heartbeat logic into server (to avoid having to implement it in each cache implementation), so I took a look.

It can be pretty easily put in the server logic by having the server poll the cache for resources that requires heart beating, though this likely introduces more chances of heartbeats and real updates being processed out of order (since we lose out on the strict ordering of the response channel). A more specific cache lookup could probably be made (for the purposes of reducing lock contention on the cache) by tracking more state in the server (e.g. which types have TTL'd resources), but at some point we'd be duplicating state in the cache and server, so not sure if we want to go down that road.

Moving it to the server puts the onus of configuring a heartbeat interval on the server, which might make it trickier to support variable intervals for different snapshots/types/etc., but that might be okay for now.

At the very least I'll push up a separate branch with the server driven heartbeating and anyone interested can take a look.

@snowp
Copy link
Contributor Author

snowp commented Nov 16, 2020

Having spent some time trying to fit heartbeating into the SotW server.go code, I feel less confident that it will work without having to maintain a substantial amount of per resource state in the server. The key problem arises when the server polls the cache to get the resources that should be heartbeated but is unable to verify that the resource has been sent to the client or not; this opens up for a race where the first response containing a resource might be heartbeat response. The cache is unable to determine whether a resource has been sent to the client, as the use of a buffered channel means that handing over the request to the channel != the request being sent.

I think that this + the fact that cache implementors might want to implement their own heartbeating strategies (e.g. elide heartbeat requests during frequent real updates) makes me lean towards keeping the responsibility of heart beating in the cache.

Copy link
Member

@jessicayuen jessicayuen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I think it's OK for this logic to live within the cache. If I'm reading the code correctly, It doesn't appear that the implementation is forced to implement this TTL logic unless the mesh operator wants to utilize TTL - is that correct?

@@ -113,7 +116,11 @@ func (r *RawResponse) GetDiscoveryResponse() (*discovery.DiscoveryResponse, erro
marshaledResources := make([]*any.Any, len(r.Resources))

for i, resource := range r.Resources {
marshaledResource, err := MarshalResource(resource)
maybeTtldResource, err := ttl.MaybeCreateTtlResource(resource, GetResourceName(resource.Resource), r.Heartbeat)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In v2, it seems that a ttl resource will never be created. The naming of this may be a bit misleading.

pkg/cache/v2/cache.go Show resolved Hide resolved
@@ -82,13 +82,13 @@ func MarshalResource(resource types.Resource) (types.MarshaledResource, error) {

// GetResourceReferences returns the names for dependent resources (EDS cluster
// names for CDS, RDS routes names for LDS).
func GetResourceReferences(resources map[string]types.Resource) map[string]bool {
func GetResourceReferences(resources map[string]types.ResourceWithTtl) map[string]bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there something we can do here so we don't need to change the param type to a concrete implementation? Maybe by adding a getTTL method to the Resource interface?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah I like that, should simplify things quite a bit

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spent some time trying to get this to work, the problem is that by adding a function here regular proto messages no longer satisfy the interface, which means that we end up having to wrap the resources in a struct regardless. I tried composing a proto.Message into ResourceWithTtl to make make them all just satisfy types.Resource, but it seems like the protobuf library does all kind of type conversions so just satisfying proto.Message isn't actually sufficient.

lmk if you have any ideas for how to get this working, otherwise the current approach seems like the simplest one (if verbose)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we change Resource to be a lightweight struct that embeds a proto? Might be a cleaner solution.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gave this a go, the problem here is that it means that a regular proto no longer passes as a Resource (at least from what I could tell), which means that all callers that currently pass []Resource{proto1, proto2} would have to update it to []Resource{Resource{Message: proto1}, Resource{Message: proto2}. Not sure if it would be worth the churn to change? wdyt

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you could do []Resource{{proto1}, {proto2}} which I can live with. What do others think? I think it makes sense to align Resource with xDS Resource proto which has TTL field. But I'm fine either way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Omitting the field name works when there's only one field e.g. type Resource struct { proto.Message }. Once we introduce a second one for the TTL, they now need to be specified.

I would expose the a function to get the TTL but that means having to embed the TTL within the proto, which then forces the user to use the Resource proto wrapper (not to be confused with the Go type) themselves to specify the TTL. This would be an option: we could provide a helper function to make it easier to create the proto wrapper and rely on it for holding the TTL configuration. LMK what you all think (cc @jyotimahapatra @jessicayuen)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll note that this adds a bunch of complexity around anything that cares about the actual resource, like consistency checks, as we need to handle potentially unwrapping the wrapper type to see the real resource.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok after much more playing around it seems like embedding proto.Message breaks marshalling, since the proto library makes use of some cute interface casts to do marshalling, which ends up failing when embedding the interface. As such I'd propose leaving things as is, unless someone has any ideas for how to work around this

pkg/cache/v2/simple.go Show resolved Hide resolved
@jessicayuen
Copy link
Member

Also was there more conclusion discussed in Envoy surrounding the following?

I think it makes more sense for Envoy not to delete expired resources if they are in use.

Signed-off-by: Snow Pettersen <snowp@lyft.com>
@snowp
Copy link
Contributor Author

snowp commented Nov 18, 2020

Also was there more conclusion discussed in Envoy surrounding the following?

I think it makes more sense for Envoy not to delete expired resources if they are in use.

We've had support for expiring resources in use for EDS for a long time now (envoyproxy/envoy#6477), so TTL just formalizes this for all resource types. Envoy is already able to operate without specific resources (i.e. what happens if you hit the initial_fetch_timeout), so TTL expirations should simply revert to that state.

The whole purpose of xDS TTL is to ensure that certain resources that would do more harm than good are removed if the control plane is unavailable, which means that this logic has to live in Envoy and be controlled by control plane activity (hence the heartbeating). This is not meant as a general purpose TTL that applies to all resources, but just to resources that you don't want to go stale. ECDS, RTDS and EDS are all resource types where I've seen use cases where TTLs make sense.

Signed-off-by: Snow Pettersen <snowp@lyft.com>
@snowp
Copy link
Contributor Author

snowp commented Nov 18, 2020

@jessicayuen @jyotimahapatra I think I've responded/addressed all the comments now, ptal

Signed-off-by: Snow Pettersen <snowp@lyft.com>
Signed-off-by: Snow Pettersen <snowp@lyft.com>
Signed-off-by: Snow Pettersen <snowp@lyft.com>
@snowp
Copy link
Contributor Author

snowp commented Nov 23, 2020

friendly ping for reviews :)

@jyotimahapatra
Copy link
Contributor

The changes lgtm.
2 final qs:

  1. Is it possible to write an integration test. Can do in a subsequent PR
  2. Is it possible to write a unit test showing that heartbeats send hollow responses ...may be in cache_test.go?

Signed-off-by: Snow Pettersen <snowp@lyft.com>
Signed-off-by: Snow Pettersen <snowp@lyft.com>
@snowp
Copy link
Contributor Author

snowp commented Nov 24, 2020

@jyotimahapatra I think I was able to address both your concerns with the integration test that verifies that the heartbeat response is a hollow response, lmk if this is good enough.

@kyessenov do you wanna take a look as well?

Copy link
Contributor

@kyessenov kyessenov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable but it's hard to reason about many go routines. Can we simplify it maybe by 1) turn Resource into struct, which hopefully reduce code changes and make it easier, 2) have a global heartbeat, which could keep track of how long ago resources were pushed?

if err != nil {
return nil, "", err
}
any.TypeUrl = resourceTypeUrl
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this line necessary? Doesn't marshal any already take care of it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This preserves the behavior from the existing code where we always set the type URL to what the request asked for, I think this is used to allow marshalling V2 resources in response to a V3 request. I can add a comment to make this clearer

@@ -82,13 +82,13 @@ func MarshalResource(resource types.Resource) (types.MarshaledResource, error) {

// GetResourceReferences returns the names for dependent resources (EDS cluster
// names for CDS, RDS routes names for LDS).
func GetResourceReferences(resources map[string]types.Resource) map[string]bool {
func GetResourceReferences(resources map[string]types.ResourceWithTtl) map[string]bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we change Resource to be a lightweight struct that embeds a proto? Might be a cleaner solution.

}

func IsTTLResource(resource *any.Any) bool {
return true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this true?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about it some more this should probably be false, it just so happens to only be used to test things we expect to be TTL responses, so this made the tests pass. Depending on the resolution of my other comment I'll update this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's used only for testing, then let's just put that as comment and not worry about its efficiency.

return resource.Resource, resourceTypeUrl, nil
}

func IsTTLResource(resource *any.Any) bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a pretty expensive check since proto unmarshal is quite slow in golang. Where is it being used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's currently only used during tests to assert that a response emitted by the heartbeat routine is a TTL resource. I put it here since the V3 tests being generated from the V2 files makes per version tests a bit tricky, open to suggestions if you have a better idea

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One option would be to only test this in the V3-only integration test and remove the tests from the versioned cache tests, wdyt?

@@ -81,6 +85,9 @@ type snapshotCache struct {
// hash is the hashing function for Envoy nodes
hash NodeHash

// heartbeatRoutines keeps track of the active heartbeat goroutines
heartbeatRoutines map[string]heartbeatHandle
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need so many go routines? Could we just have a single ticker that traverses the entire cache and re-pushes resources that aged?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It just felt the easiest at the time, I'll give the single goroutine a go

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this field needed now? I don't see it being updated.


wg.Add(1)
// Spawn a goroutine here so that we can give up the lock before issuing responses. This prevents lock contention if the
// response channels aren't ready to receive the response.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's usually a spot in the response channel for every request. Can you explain why it could be not ready?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a holdover from before when the channel was blocking (I think they used to be blocking?), and tests weren't set up to consume from the channel. I removed the go routine and tests pass, so I don't think this is actually necessary

snowp added 4 commits December 2, 2020 17:44
Signed-off-by: Snow Pettersen <snowp@lyft.com>
Signed-off-by: Snow Pettersen <snowp@lyft.com>
Signed-off-by: Snow Pettersen <snowp@lyft.com>
Signed-off-by: Snow Pettersen <snowp@lyft.com>
@snowp
Copy link
Contributor Author

snowp commented Dec 7, 2020

@kyessenov ptal, let me know if I've addressed everything

@snowp
Copy link
Contributor Author

snowp commented Dec 9, 2020

ping @kyessenov

@@ -0,0 +1,139 @@
package integration
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be nice to have an end-to-end example integration test with actual envoy. TTLs are very confusing, so a worked out example with envoy plus a control plane config would help understand it. This doesn't need to happen in this PR since envoy is pegged to some old version in CI, but a follow up would be helpful.

Copy link
Contributor

@kyessenov kyessenov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also update CHANGELOG? This is an API change so it'll require work from consumers.

Signed-off-by: Snow Pettersen <snowp@lyft.com>
Signed-off-by: Snow Pettersen <snowp@lyft.com>
Signed-off-by: Snow Pettersen <snowp@lyft.com>
@snowp snowp requested a review from kyessenov December 14, 2020 21:01
@snowp
Copy link
Contributor Author

snowp commented Dec 16, 2020

@kyessenov Should be updated now, will follow up with an integration test as I get around to it

Copy link
Contributor

@kyessenov kyessenov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, and sorry it took so long. I'm fine with the change, so feel free to merge and iterate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support xDS TTLs
4 participants