Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Role.iam silently removes [] from spec.forProvider fields #1357

Open
1 task done
mbbush opened this issue Jun 12, 2024 · 5 comments
Open
1 task done

[Bug]: Role.iam silently removes [] from spec.forProvider fields #1357

mbbush opened this issue Jun 12, 2024 · 5 comments
Labels
bug Something isn't working is:triaged Indicates that an issue has been reviewed.

Comments

@mbbush
Copy link
Collaborator

mbbush commented Jun 12, 2024

Is there an existing issue for this?

  • I have searched the existing issues

Affected Resource(s)

  • iam.aws.upbound.io/v1beta1 - Role

Resource MRs required to reproduce the bug

apiVersion: iam.aws.upbound.io/v1beta1
kind: Role
metadata:
  name: role-with-inline-policy
spec:
  forProvider:
    assumeRolePolicy: |
      {
        "Version": "2012-10-17",
        "Statement": [
          {
            "Effect": "Allow",
            "Principal": {
              "Service": "eks.amazonaws.com"
            },
            "Action": "sts:AssumeRole"
          }
        ]
      }
    managedPolicyArns: []
    inlinePolicy:
      - name: "my_inline_policy"
        policy: |
          {
            "Version": "2012-10-17",
            "Statement": [
              {
                "Effect": "Allow",
                "Resource": "*",
                "Action": "ec2:Describe*"
              }
            ]
          }

Steps to Reproduce

Start running kubectl get -o yaml role.iam.aws.upbound.io --watch
Apply the manifest to the cluster. I observe the same behavior regardless of whether it is server-side or client-side applied.
Observe that while the very first version of the resource contains managedPolicyArns: [] as specified, every subsequent version has the key removed.

Here are the first few observed versions of the resource:

---
apiVersion: iam.aws.upbound.io/v1beta1
kind: Role
metadata:
  annotations:
    meta.upbound.io/example-id: iam/v1beta1/role
  creationTimestamp: "2024-06-12T23:37:16Z"
  generation: 1
  labels:
    testing.upbound.io/example-name: role
  name: role-with-inline-policy
  resourceVersion: "5128"
  uid: f6ff7514-ffd1-44be-ab15-e4b3790f2ca8
spec:
  deletionPolicy: Delete
  forProvider:
    assumeRolePolicy: |
      {
        "Version": "2012-10-17",
        "Statement": [
          {
            "Effect": "Allow",
            "Principal": {
              "Service": "eks.amazonaws.com"
            },
            "Action": "sts:AssumeRole"
          }
        ]
      }
    inlinePolicy:
    - name: my_inline_policy
      policy: |
        {
          "Version": "2012-10-17",
          "Statement": [
            {
              "Effect": "Allow",
              "Resource": "*",
              "Action": "ec2:Describe*"
            }
          ]
        }
    managedPolicyArns: []
  managementPolicies:
  - '*'
  providerConfigRef:
    name: default
---
apiVersion: iam.aws.upbound.io/v1beta1
kind: Role
metadata:
  annotations:
    meta.upbound.io/example-id: iam/v1beta1/role
  creationTimestamp: "2024-06-12T23:37:16Z"
  generation: 2
  labels:
    testing.upbound.io/example-name: role
  name: role-with-inline-policy
  resourceVersion: "5129"
  uid: f6ff7514-ffd1-44be-ab15-e4b3790f2ca8
spec:
  deletionPolicy: Delete
  forProvider:
    assumeRolePolicy: |
      {
        "Version": "2012-10-17",
        "Statement": [
          {
            "Effect": "Allow",
            "Principal": {
              "Service": "eks.amazonaws.com"
            },
            "Action": "sts:AssumeRole"
          }
        ]
      }
    inlinePolicy:
    - name: my_inline_policy
      policy: |
        {
          "Version": "2012-10-17",
          "Statement": [
            {
              "Effect": "Allow",
              "Resource": "*",
              "Action": "ec2:Describe*"
            }
          ]
        }
    tags:
      crossplane-kind: role.iam.aws.upbound.io
      crossplane-name: role-with-inline-policy
      crossplane-providerconfig: default
  initProvider: {}
  managementPolicies:
  - '*'
  providerConfigRef:
    name: default
---
apiVersion: iam.aws.upbound.io/v1beta1
kind: Role
metadata:
  annotations:
    crossplane.io/external-name: role-with-inline-policy
    meta.upbound.io/example-id: iam/v1beta1/role
  creationTimestamp: "2024-06-12T23:37:16Z"
  generation: 2
  labels:
    testing.upbound.io/example-name: role
  name: role-with-inline-policy
  resourceVersion: "5130"
  uid: f6ff7514-ffd1-44be-ab15-e4b3790f2ca8
spec:
  deletionPolicy: Delete
  forProvider:
    assumeRolePolicy: |
      {
        "Version": "2012-10-17",
        "Statement": [
          {
            "Effect": "Allow",
            "Principal": {
              "Service": "eks.amazonaws.com"
            },
            "Action": "sts:AssumeRole"
          }
        ]
      }
    inlinePolicy:
    - name: my_inline_policy
      policy: |
        {
          "Version": "2012-10-17",
          "Statement": [
            {
              "Effect": "Allow",
              "Resource": "*",
              "Action": "ec2:Describe*"
            }
          ]
        }
    tags:
      crossplane-kind: role.iam.aws.upbound.io
      crossplane-name: role-with-inline-policy
      crossplane-providerconfig: default
  initProvider: {}
  managementPolicies:
  - '*'
  providerConfigRef:
    name: default
---

What happened?

The managedPolicyArns field in the iam Role resource is a bit unusual. Its description says that setting an empty array will result in disassociating any policies attached out-of-band, while setting it to null will simply ignore attached policies.

I would like to use crossplane to enforce that additional policies are not attached to a certain IAM role, but it seems that the provider (or some other part of the reconciliation machinery) is removing the explicitly set [] value, so I can't actually get the managed resource to have the state it would need to enforce no additional policies.

Relevant Error Output Snippet

No response

Crossplane Version

1.14

Provider Version

1.6.0

Kubernetes Version

v1.30.0

Kubernetes Distribution

kind

Additional Info

No response

@mbbush mbbush added bug Something isn't working needs:triage labels Jun 12, 2024
@diranged
Copy link

We're seeing the exact same behavior... thanks for opening this bug.

@mbbush mbbush added is:triaged Indicates that an issue has been reviewed. and removed needs:triage labels Jun 17, 2024
@mbbush
Copy link
Collaborator Author

mbbush commented Jun 17, 2024

The issue is caused by the omitempty tag in the type definition. This is generated code, and as far as I can tell upjet does not expose the ability to configure it, instead just always adding omitempty to every Field it creates. (caveat: I may be misreading the code and this might not actually be the section of upjet being invoked here)

I removed omitempty from the generated code and ran the provider locally, and this resolved the bug. The provider no longer removed the field from spec.forProvider, and it correctly enforced that externally-attached policies should be detached during the reconciliation loop. I did a little bit of testing of the more common use case of not specifying spec.forProvider.managedPolicyArns, and I didn't see any issues, although I wouldn't consider my tests exhaustive.

@ulucinar Do you remember why we're always setting omitempty? Conceptually, it seems unnecessary, and while it's harmless in most cases, it's causing a problem here where the empty state is meaningful.

@mbbush
Copy link
Collaborator Author

mbbush commented Jul 17, 2024

We discussed this at the SIG-Upjet meeting today. I think the video gets automatically uploaded to youtube, but I don't know the exact link.

  • We need omitempty because the ManagedPolicyArns field has type []*string, which is a slice of pointers to strings, not a pointer to a slice of strings (or a pointer to a slice of pointers to strings), so there's no way to store a null value in a way that's distinct from an empty list. Similarly, InlinePolicy has type []InlinePolicyParameters, which has the same issue.
  • Because of that golang type, at the golang level we cannot distinguish between a missing/unspecified/null value and one that is set to an empty array. Since being able to tell when a user has specified a field or not is so important to crossplane providers, we chose to interpret "either null or empty" as "null" rather than "empty", which is what the omitempty annotation does. The unfortunate consequence is that declaring the user's intent that a field be defined and set to an empty collection becomes impossible.
  • There were several questions around XRM fidelity, and a desire to give @mergenci and @turkenf some more time to re-familiarize themselves with the history. We plan to discuss it again at the next meeting in 2 weeks.

Background and terminology:

In AWS IAM, a role supports two different types of policies: managed and inline. The main difference is that an inline policy can only ever be attached to a single role, while a managed policy can be attached to zero or more roles, groups, or users. They are also counted differently against various AWS quotas.

A managed policy can either be AWS-managed and simply referenced by the AWS-documented ARN, or customer-managed, and created by a Policy managed resource. The RolePolicyAttachment managed resource declares that a particular managed policy should be attached to a role, and detaches it when the managed resource is deleted. It uses the role name, together with a timestamp, as its external name and terraform id.

The RolePolicy managed resource declares that a particular role should have a particular inline policy, and deletes that policy when the managed resource is deleted. It uses the name of the inline policy as its external name (and the combination of policy name and role name as its terraform id). A single role may have more than one inline policy.

When the Role resource was first added to the v1beta1 apiVersion, both inlinePolicy and managedPolicyArns were only present in the status.atProvider fields of the Role managed resource, and the RolePolicy resource was skipped, possibly due to some misunderstanding around the slight, nuanced difference between inline and managed policies.

This left no way to manage inline role policies using this provider until version 0.40.0, which moved inlinePolicy and managedPolicyArns to also be in spec.forProvider of the Role, and added support for the RolePolicy resource.

It's also worth noting that at the time of provider release v0.40.0 we were still forking a terraform cli process for every reconciliation of every managed resource, which was extremely resource-intensive. There were some short-term pressures for ways to reduce the number of managed resources, that became moot once we switched to the no-fork architecture with dramatically improved performance.

References:
I mentioned at the meeting finding some reference that well-behaved kubernates operators should treat absent keys and empty values identically. I looked for and wasn't able to find the reference I remember. Here are two semi-related ones I could find.

kubernetes/kubernetes#125317
This is a (recent) discussion related to distinguishing nil vs empty in Server Side Apply. It seems only peripherally related, but shows the kubernetes maintainers distinguishing between nil and empty arrays/objects.

kubernetes/kubernetes#124050 Includes the quote

I read in kubernetes/kubernetes#70281 that Kubernetes should treat absent keys, empty values (in this case, an empty map), and null values the same.

although I could not find that message in the linked issue.

#745 contains significant discussion around the decision to add inlinePolicy and managedPolicyArns to the Role MR.

Some possible approaches for a solution:

  • Change the golang type used for terraform List/Set fields to use pointers to slices instead of slices, and remove omitempty. This seems like the most "correct" approach, and would provide broader benefits to all upjet-based providers by allowing them to differentiate between null and [], but is probably also the most work, especially as far as regression testing.
  • Add separate detachAllManagedPolicies and deleteAllInlinePolicies boolean fields to the Role's spec.forProvider, and remove spec.forProvider.managedPolicyArns and spec.forProvider.inlinePolicy. This would (mostly) remove the possibility of double-specifying attachments, leading to races like the one in RolePolicyAttachment.iam is detaching policies every reconciliation #929, and support declaring "There should be zero [managed|inline] policies", but not declaring "There should be this list of [managed|inline] policies and no others". If we do this approach, we'd need to make a plan for how to handle existing resources/manifests that contain the removed fields in their spec, possibly by bumping the api version.
  • Change the decision to support the use case of wanting to delete all inline/managed policies for a resource (which @jeanduplessis cited as the main reason for adding the otherwise-duplicative inlinePolicy and managedPolicyArns fields to spec.forProvider) and update the documentation accordingly.

@jeanduplessis
Copy link
Collaborator

@ryanewk
Copy link

ryanewk commented Sep 5, 2024

I believe it is the same underlying issue but we ran into this from a slightly different angle: when trying to remove all managedPolicyArns from an existing role. For example, if you have applied a role with

managedPolicyArns:
- arn:aws:iam::aws:policy/some-policy-name

You then want to remove that policy for some reason and change your spec to:

managedPolicyArns: []

After applying.... the "arn:aws:iam::aws:policy/some-policy-name" policy will remain associated with the role with no warnings. This is confusing (and potentially dangerous if the intention is to revoke permissions). And inconsistent with what the documentation says as mentioned above. (side note: the snake case vs camel case inconsistency in the docs is a bit confusing as well).

The only way we have found to actually remove the policy from the role is to set:

managedPolicyArns: [""]

as alluded to here.

  • This is only an issue when going from n managedPolicyArns to 0.
  • We are still in the evaluation phase and this is just something we noticed as we are kicking the tires and not representative of a particular use case. Basically just wanted to +1 this issue with a little more info on how it tripped us up a bit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working is:triaged Indicates that an issue has been reviewed.
Projects
None yet
Development

No branches or pull requests

4 participants