Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix networking resources deletion of child resources during adoption #3136

Merged
merged 2 commits into from
Jul 13, 2023

Conversation

matthchr
Copy link
Member

Networking resources have a structure where a PUT on the parent also can delete child resources (or add them). ASO attempts to allow users to manage these networking child resources independent of the owner. In order to do that, when building the PUT payload for the parent resource we used the ownerReferences.uid field to look up the child resources owned by the parent resource we were about to send and include them all in the PUT payload of the owner so that they are not deleted.

The issue is that with adoption, parent + child already exist in Azure, but ownerReferences.uid isn't set yet in Kubernetes, which means the PUT of the parent doesn't include any reference to its children and so PUT parent ends up deleting the children.

This issue impacts the following resources:

  • VirtualNetwork
  • LoadBalancer
  • RouteTable
  • NetworkSecurityGroup

We investigated using .spec.owner instead of ownerReferences, which works, but inconsistently as there is still a race in Kubernetes client caching when the parent is created during adoption. Sometimes it realizes that it has children and sometimes it doesn't.

Instead, we rely on Azure itself, issuing a GET parent while building the PUT payload. We use the exact contents of the GET response as the PUT payload for the child resource field (subnets for VNET, rules for LoadBalancer, etc). This ensures that we don't ever delete any child resource when PUT-ing the parent, even if we're adopting that parent for the first time.

Note that technically GET + PUT is subject to a dual-writer race, which we could use ETag and If-Match headers to prevent. We didn't implement that here because:

  1. Some of the networking APIs support this paradigm but it's not universally available.
  2. It would require moderate changes to the reconciler to pipe the etag through.
  3. ASO expects to own the resource entirely, so if there is another parallel writer their changes will inevitably lose to ASO as it will keep updating the Azure resource to its desired state.

Given the above it didn't make sense to bother with optimistic concurrency.

Closes #3077

If applicable:

  • this PR contains documentation
  • this PR contains tests

if err != nil {
return errors.Wrap(err, "failed during comparison for embeddedInboundNatRuleJSON and inboundNatRuleJSON")
}
// TODO: Can't do a trivial fuzzyEqualityComparison here because we don't know which fields are readonly
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is my least favorite part of this change. I think it's pretty safe though because before we were using the child resource shape and the parent resource shape and the assertion here was effectively "are they actually the same".

Now we're using the parent resource shape (from GET) and putting that into the parent resource shape (for PUT). Since it's the same resource w/ the same API version, the properties are guaranteed to line-up structurally. Properties that exist on GET but not on PUT are read-only, so it's OK for them to be omitted.

Copy link
Member

@theunrepentantgeek theunrepentantgeek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good. A few comments, including a possibility to simplify the code using reflection.

v2/api/network/customizations/load_balancer_extension.go Outdated Show resolved Hide resolved
v2/api/network/customizations/load_balancer_extension.go Outdated Show resolved Hide resolved
Comment on lines 143 to 165
func getRawInboundNatRulesField(lb map[string]any) ([]any, error) {
props, ok := lb["properties"]
if !ok {
return nil, errors.Errorf("couldn't find properties field on LoadBalancer")
}

propsMap, ok := props.(map[string]any)
if !ok {
return nil, errors.Errorf("properties field wasn't a map")
}

inboundNatRules, ok := propsMap["inboundNatRules"]
if !ok {
return nil, errors.Errorf("couldn't find inboundNatRules field on LoadBalancer")
}

inboundNatRulesSlice, ok := inboundNatRules.([]any)
if !ok {
return nil, errors.Errorf("inboundNatRules field wasn't a slice")
}

return inboundNatRulesSlice, nil
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we avoid the repetition here? What about defining a dynamicObject as an alias for map[string]any as a container for some helper methods?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored things to share a bunch more code.

LoadBalancer is a bit different because it needs to merge rather than overwrite, and that got more complicated with this refactor because now we're looking at Azure state which means that we need to deal with conflicts consistently (previously we merged state from LB in ASO and Routes in ASO, now we're merging LB from ASO w/ LB from Azure, and since LB from Azure will have the routes defined on LB in ASO we have to make sure we choose the ASO ones). So specifically for loadbalancer it doesn't use all of the shared logic it has a bit of its own for that, but the rest of the examples have significantly reduced duplication and even LB makes use of a good bit of that.

Comment on lines +238 to +177
if !foundExistingRule {
resultSlice = reflect.Append(resultSlice, inboundNatRule)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is order significant? This puts all the extras at the end of the slice, so any that are known by ASO will be promoted in front of the others.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very good question. Let me check

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Order doesn't seem to matter. The networking RP seems to return the collection in order of creation time, so the oldest NatRules are first and the newer ones are last. Whatever order you supply rules to it in, that doesn't change.

For our purposes because we're doing a merge here based on name, it doesn't make a different to us I don't think, so the order we end up with here should be OK.

@matthchr matthchr force-pushed the feature/networking-resources-bad branch from 617f0f1 to 44fc044 Compare July 13, 2023 17:20
@matthchr
Copy link
Member Author

/ok-to-test sha=44fc044

@matthchr matthchr merged commit 9e11258 into Azure:main Jul 13, 2023
@matthchr matthchr deleted the feature/networking-resources-bad branch July 13, 2023 20:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
3 participants