Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skip deallocating Gid when static Gid set #733

Closed
wants to merge 1 commit into from

Conversation

kyanar
Copy link

@kyanar kyanar commented Jul 7, 2022

Is this a bug fix or adding new feature?
Bug fix

What is this PR about? / Why do we need it?
The efs-plugin crashes with a segmentation fault if there was an error creating an access point if the storage class is defined with a fixed uid and gid because it attempts to deallocate the gid using an uninitialised gidAllocator, and does not log the error for the cluster admin to resolve, as the code to log the error occurs after the segfault.

This PR adds a check to see if the "allocated gid" is the default Go int value, and skips attempting to deallocate it if is so.

What testing is done?
Is captured by existing test case for fixed uid/gid allocation

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jul 7, 2022
@k8s-ci-robot
Copy link
Contributor

Hi @kyanar. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot requested a review from d-nishi July 7, 2022 05:12
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: kyanar
To complete the pull request process, please assign ashley-wenyizha after the PR has been reviewed.
You can assign the PR to them by writing /assign @ashley-wenyizha in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kyanar
Copy link
Author

kyanar commented Aug 22, 2022

/cc @Ashley-wenyizha

@daghaian
Copy link

This issue is also currently preventing us from being able to leverage the newer versions of the efs-csi-driver for our production environments.

@MarkSpencerTan
Copy link

Having this same issue as well in our environment since we are setting GIDs. Would appreciate it if this could be reviewed and merged please!

@kyanar
Copy link
Author

kyanar commented Aug 30, 2022

@MarkSpencerTan @daghaian just so you're aware, if you're running into this issue it's because the EFS CSI driver has already failed to mount your EFS, because the bug is unfortunately in the exception handler, meaning it doesn't get a chance to output a useable error message.

I recommend checking CloudTrail for any AccessDenied errors - in my case I was using a version of the IAM policy that did not confer efs:CreateAccessPoint on the CSI controller service account's IAM role.

@daghaian
Copy link

@kyanar Thanks for the suggestion. We did see AccessPointAlreadyExists errors being thrown inside CloudTrail so its plausible something is going on there.

@MarkSpencerTan
Copy link

MarkSpencerTan commented Sep 3, 2022

@kyanar thanks for the suggestions! We tried to figure out what was causing the AccessPointAlreadyExists errors that we were getting, however, it seems to be happening at random times or some sort of race condition and seems like the EFS CSI Driver handles this by just doing a retry and it works again. This is why we've never really seen any issues until we have started setting the Gid...

When the Gid is set and this problem arises, the EFS CSI Driver pods errors out continuously, preventing any further PVCs from being created until the problematic PVCs are cleared out.

I've captured the logs of the efs csi driver pod when this error happens without the Gid being set so you can see what it does normally:

https://gist.github.com/MarkSpencerTan/96775d9b2b3043ce7647693ecce309be#file-gistfile1-txt-L163

Would appreciate it if this fix becomes available since this is causing the efs-csi-driver to be very unstable in cases where the Gid is being set.

@kyanar
Copy link
Author

kyanar commented Sep 5, 2022

Unfortunately none of the approvers for this project appear to be active on GitHub - the most recent was 18th August, so I can't find anyone to review it.

@kyanar
Copy link
Author

kyanar commented Sep 26, 2022

/cc @Ashley-wenyizha

@kyanar
Copy link
Author

kyanar commented Oct 22, 2022

@Ashley-wenyizha this is a one line fix for an issue affecting quite a few users, can this be looked at please?

@kyanar
Copy link
Author

kyanar commented Dec 14, 2022

Pull #850 corrects this behaviour.

@kyanar kyanar closed this Dec 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants