-
Notifications
You must be signed in to change notification settings - Fork 517
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cache: add support for xDS TTLs #359
Conversation
Adds support for xDS TTLs for SotW by introducing a goroutine per TTL'd resource group which pushes the current config to open watches every 1/3 TTL time. Signed-off-by: Snow Pettersen <snowp@lyft.com>
Signed-off-by: Snow Pettersen <snowp@lyft.com>
Note: this is not complete due to the proto changes not having landed upstream Envoy, but this implements all the necessary bits for the cache to trigger watches based on the configured TTL. Mostly looking for high level feedback on this approach at this point, especially around whether we need to do something smarter around reducing the number of goroutines required for this feature. This approach is likely not feasible for large scale delta deployments, but since we're currently only supporting sotw this seems ok to me? |
I'm still a little confused about how this is supposed to work end-to-end. I thought TTL means that Envoy just "forgets" about some version of xDS it received, and needs to re-request. That implies the server needs nothing special to support TTL except pass it through. Server TTL can be orchestrated by just re-pushing a new, identical version. |
It does forget about it and would re-request, but if we rely on this then Envoy will flap between knowing about the resource and not knowing about it. The server will need to push the resource again before the TTL expires to ensure that the resource isn't forgotten. I guess one option would be to include a defer the deletion for some duration after the TTL expires? So that Envoy is able to request a new version of the resource before dropping the old one. I would increase the complexity in Envoy, but I'll raise it in the issue as an alternative solution and we can see what people think |
I think it makes more sense for Envoy not to delete expired resources if they are in use. Relying on the control plane to pro-actively re-push is fragile, and a lot to ask from all implementations. |
Signed-off-by: Snow Pettersen <snowp@lyft.com>
Signed-off-by: Snow Pettersen <snowp@lyft.com>
Signed-off-by: Snow Pettersen <snowp@lyft.com>
@jyotimahapatra @jessicayuen either of you wanna give this a look for the general approach? I've updated it to reflect the current implementation in Envoy, though some test coverage is missing (the v2/v3 handling makes v3 only features tricky to test, lmk what you think of the workaround I started using). Feel free to ping me if you need more context on TTLs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I echo Kuat's observation that having this complexity as part of control plane will cause multiple duplicate implementations in all control planes out there. The adoption of the feature will be delayed depending on control planes supporting this feature.
For more details on TTLs and heartbeats the Envoy configuration has some information: https://www.envoyproxy.io/docs/envoy/latest/api-docs/xds_protocol#ttl This PR essentially implements two potentially distinct features: being able to set TTLs on resources (done by wrapping them in a Delta-style Resource wrapper) and a mechanism for sending lightweight sotw updates that only serve to refresh the TTL (aka heartbeats) that would be used when TTLs are used for rarely changing resources. Talking to @jyotimahapatra he suggested moving the heartbeat logic into server (to avoid having to implement it in each cache implementation), so I took a look. It can be pretty easily put in the server logic by having the server poll the cache for resources that requires heart beating, though this likely introduces more chances of heartbeats and real updates being processed out of order (since we lose out on the strict ordering of the response channel). A more specific cache lookup could probably be made (for the purposes of reducing lock contention on the cache) by tracking more state in the server (e.g. which types have TTL'd resources), but at some point we'd be duplicating state in the cache and server, so not sure if we want to go down that road. Moving it to the server puts the onus of configuring a heartbeat interval on the server, which might make it trickier to support variable intervals for different snapshots/types/etc., but that might be okay for now. At the very least I'll push up a separate branch with the server driven heartbeating and anyone interested can take a look. |
Having spent some time trying to fit heartbeating into the SotW server.go code, I feel less confident that it will work without having to maintain a substantial amount of per resource state in the server. The key problem arises when the server polls the cache to get the resources that should be heartbeated but is unable to verify that the resource has been sent to the client or not; this opens up for a race where the first response containing a resource might be heartbeat response. The cache is unable to determine whether a resource has been sent to the client, as the use of a buffered channel means that handing over the request to the channel != the request being sent. I think that this + the fact that cache implementors might want to implement their own heartbeating strategies (e.g. elide heartbeat requests during frequent real updates) makes me lean towards keeping the responsibility of heart beating in the cache. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally I think it's OK for this logic to live within the cache. If I'm reading the code correctly, It doesn't appear that the implementation is forced to implement this TTL logic unless the mesh operator wants to utilize TTL - is that correct?
pkg/cache/v2/cache.go
Outdated
@@ -113,7 +116,11 @@ func (r *RawResponse) GetDiscoveryResponse() (*discovery.DiscoveryResponse, erro | |||
marshaledResources := make([]*any.Any, len(r.Resources)) | |||
|
|||
for i, resource := range r.Resources { | |||
marshaledResource, err := MarshalResource(resource) | |||
maybeTtldResource, err := ttl.MaybeCreateTtlResource(resource, GetResourceName(resource.Resource), r.Heartbeat) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In v2, it seems that a ttl resource will never be created. The naming of this may be a bit misleading.
@@ -82,13 +82,13 @@ func MarshalResource(resource types.Resource) (types.MarshaledResource, error) { | |||
|
|||
// GetResourceReferences returns the names for dependent resources (EDS cluster | |||
// names for CDS, RDS routes names for LDS). | |||
func GetResourceReferences(resources map[string]types.Resource) map[string]bool { | |||
func GetResourceReferences(resources map[string]types.ResourceWithTtl) map[string]bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there something we can do here so we don't need to change the param type to a concrete implementation? Maybe by adding a getTTL
method to the Resource
interface?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yeah I like that, should simplify things quite a bit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I spent some time trying to get this to work, the problem is that by adding a function here regular proto messages no longer satisfy the interface, which means that we end up having to wrap the resources in a struct regardless. I tried composing a proto.Message
into ResourceWithTtl
to make make them all just satisfy types.Resource
, but it seems like the protobuf library does all kind of type conversions so just satisfying proto.Message
isn't actually sufficient.
lmk if you have any ideas for how to get this working, otherwise the current approach seems like the simplest one (if verbose)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we change Resource to be a lightweight struct that embeds a proto? Might be a cleaner solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I gave this a go, the problem here is that it means that a regular proto no longer passes as a Resource
(at least from what I could tell), which means that all callers that currently pass []Resource{proto1, proto2}
would have to update it to []Resource{Resource{Message: proto1}, Resource{Message: proto2}
. Not sure if it would be worth the churn to change? wdyt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you could do []Resource{{proto1}, {proto2}}
which I can live with. What do others think? I think it makes sense to align Resource with xDS Resource proto which has TTL field. But I'm fine either way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Omitting the field name works when there's only one field e.g. type Resource struct { proto.Message }
. Once we introduce a second one for the TTL, they now need to be specified.
I would expose the a function to get the TTL but that means having to embed the TTL within the proto, which then forces the user to use the Resource
proto wrapper (not to be confused with the Go type) themselves to specify the TTL. This would be an option: we could provide a helper function to make it easier to create the proto wrapper and rely on it for holding the TTL configuration. LMK what you all think (cc @jyotimahapatra @jessicayuen)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll note that this adds a bunch of complexity around anything that cares about the actual resource, like consistency checks, as we need to handle potentially unwrapping the wrapper type to see the real resource.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok after much more playing around it seems like embedding proto.Message
breaks marshalling, since the proto library makes use of some cute interface casts to do marshalling, which ends up failing when embedding the interface. As such I'd propose leaving things as is, unless someone has any ideas for how to work around this
Also was there more conclusion discussed in Envoy surrounding the following?
|
Signed-off-by: Snow Pettersen <snowp@lyft.com>
We've had support for expiring resources in use for EDS for a long time now (envoyproxy/envoy#6477), so TTL just formalizes this for all resource types. Envoy is already able to operate without specific resources (i.e. what happens if you hit the initial_fetch_timeout), so TTL expirations should simply revert to that state. The whole purpose of xDS TTL is to ensure that certain resources that would do more harm than good are removed if the control plane is unavailable, which means that this logic has to live in Envoy and be controlled by control plane activity (hence the heartbeating). This is not meant as a general purpose TTL that applies to all resources, but just to resources that you don't want to go stale. ECDS, RTDS and EDS are all resource types where I've seen use cases where TTLs make sense. |
Signed-off-by: Snow Pettersen <snowp@lyft.com>
@jessicayuen @jyotimahapatra I think I've responded/addressed all the comments now, ptal |
Signed-off-by: Snow Pettersen <snowp@lyft.com>
Signed-off-by: Snow Pettersen <snowp@lyft.com>
Signed-off-by: Snow Pettersen <snowp@lyft.com>
friendly ping for reviews :) |
The changes lgtm.
|
Signed-off-by: Snow Pettersen <snowp@lyft.com>
@jyotimahapatra I think I was able to address both your concerns with the integration test that verifies that the heartbeat response is a hollow response, lmk if this is good enough. @kyessenov do you wanna take a look as well? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems reasonable but it's hard to reason about many go routines. Can we simplify it maybe by 1) turn Resource into struct, which hopefully reduce code changes and make it easier, 2) have a global heartbeat, which could keep track of how long ago resources were pushed?
if err != nil { | ||
return nil, "", err | ||
} | ||
any.TypeUrl = resourceTypeUrl |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this line necessary? Doesn't marshal any already take care of it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This preserves the behavior from the existing code where we always set the type URL to what the request asked for, I think this is used to allow marshalling V2 resources in response to a V3 request. I can add a comment to make this clearer
@@ -82,13 +82,13 @@ func MarshalResource(resource types.Resource) (types.MarshaledResource, error) { | |||
|
|||
// GetResourceReferences returns the names for dependent resources (EDS cluster | |||
// names for CDS, RDS routes names for LDS). | |||
func GetResourceReferences(resources map[string]types.Resource) map[string]bool { | |||
func GetResourceReferences(resources map[string]types.ResourceWithTtl) map[string]bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we change Resource to be a lightweight struct that embeds a proto? Might be a cleaner solution.
} | ||
|
||
func IsTTLResource(resource *any.Any) bool { | ||
return true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about it some more this should probably be false, it just so happens to only be used to test things we expect to be TTL responses, so this made the tests pass. Depending on the resolution of my other comment I'll update this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's used only for testing, then let's just put that as comment and not worry about its efficiency.
return resource.Resource, resourceTypeUrl, nil | ||
} | ||
|
||
func IsTTLResource(resource *any.Any) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a pretty expensive check since proto unmarshal is quite slow in golang. Where is it being used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's currently only used during tests to assert that a response emitted by the heartbeat routine is a TTL resource. I put it here since the V3 tests being generated from the V2 files makes per version tests a bit tricky, open to suggestions if you have a better idea
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One option would be to only test this in the V3-only integration test and remove the tests from the versioned cache tests, wdyt?
pkg/cache/v2/simple.go
Outdated
@@ -81,6 +85,9 @@ type snapshotCache struct { | |||
// hash is the hashing function for Envoy nodes | |||
hash NodeHash | |||
|
|||
// heartbeatRoutines keeps track of the active heartbeat goroutines | |||
heartbeatRoutines map[string]heartbeatHandle |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need so many go routines? Could we just have a single ticker that traverses the entire cache and re-pushes resources that aged?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It just felt the easiest at the time, I'll give the single goroutine a go
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this field needed now? I don't see it being updated.
pkg/cache/v2/simple.go
Outdated
|
||
wg.Add(1) | ||
// Spawn a goroutine here so that we can give up the lock before issuing responses. This prevents lock contention if the | ||
// response channels aren't ready to receive the response. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's usually a spot in the response channel for every request. Can you explain why it could be not ready?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a holdover from before when the channel was blocking (I think they used to be blocking?), and tests weren't set up to consume from the channel. I removed the go routine and tests pass, so I don't think this is actually necessary
Signed-off-by: Snow Pettersen <snowp@lyft.com>
Signed-off-by: Snow Pettersen <snowp@lyft.com>
Signed-off-by: Snow Pettersen <snowp@lyft.com>
Signed-off-by: Snow Pettersen <snowp@lyft.com>
@kyessenov ptal, let me know if I've addressed everything |
ping @kyessenov |
Signed-off-by: Snow Pettersen <snowp@lyft.com>
@@ -0,0 +1,139 @@ | |||
package integration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be nice to have an end-to-end example integration test with actual envoy. TTLs are very confusing, so a worked out example with envoy plus a control plane config would help understand it. This doesn't need to happen in this PR since envoy is pegged to some old version in CI, but a follow up would be helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also update CHANGELOG? This is an API change so it'll require work from consumers.
Signed-off-by: Snow Pettersen <snowp@lyft.com>
Signed-off-by: Snow Pettersen <snowp@lyft.com>
@kyessenov Should be updated now, will follow up with an integration test as I get around to it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, and sorry it took so long. I'm fine with the change, so feel free to merge and iterate.
Adds support for xDS TTLs for SotW by introducing a goroutine per TTL'd
resource group which pushes the current config to open watches every 1/3
TTL time.
Fixes #358
Signed-off-by: Snow Pettersen snowp@lyft.com