Skip deallocating Gid when static Gid set #733

kyanar · 2022-07-07T05:12:29Z

Is this a bug fix or adding new feature?
Bug fix

What is this PR about? / Why do we need it?
The efs-plugin crashes with a segmentation fault if there was an error creating an access point if the storage class is defined with a fixed uid and gid because it attempts to deallocate the gid using an uninitialised gidAllocator, and does not log the error for the cluster admin to resolve, as the code to log the error occurs after the segfault.

This PR adds a check to see if the "allocated gid" is the default Go int value, and skips attempting to deallocate it if is so.

What testing is done?
Is captured by existing test case for fixed uid/gid allocation

k8s-ci-robot · 2022-07-07T05:12:37Z

Hi @kyanar. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot · 2022-07-07T05:12:42Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: kyanar
To complete the pull request process, please assign ashley-wenyizha after the PR has been reviewed.
You can assign the PR to them by writing /assign @ashley-wenyizha in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

kyanar · 2022-08-22T00:48:49Z

/cc @Ashley-wenyizha

daghaian · 2022-08-30T22:28:37Z

This issue is also currently preventing us from being able to leverage the newer versions of the efs-csi-driver for our production environments.

MarkSpencerTan · 2022-08-30T22:30:22Z

Having this same issue as well in our environment since we are setting GIDs. Would appreciate it if this could be reviewed and merged please!

kyanar · 2022-08-30T23:19:01Z

@MarkSpencerTan @daghaian just so you're aware, if you're running into this issue it's because the EFS CSI driver has already failed to mount your EFS, because the bug is unfortunately in the exception handler, meaning it doesn't get a chance to output a useable error message.

I recommend checking CloudTrail for any AccessDenied errors - in my case I was using a version of the IAM policy that did not confer efs:CreateAccessPoint on the CSI controller service account's IAM role.

daghaian · 2022-08-31T16:44:37Z

@kyanar Thanks for the suggestion. We did see AccessPointAlreadyExists errors being thrown inside CloudTrail so its plausible something is going on there.

MarkSpencerTan · 2022-09-03T01:57:53Z

@kyanar thanks for the suggestions! We tried to figure out what was causing the AccessPointAlreadyExists errors that we were getting, however, it seems to be happening at random times or some sort of race condition and seems like the EFS CSI Driver handles this by just doing a retry and it works again. This is why we've never really seen any issues until we have started setting the Gid...

When the Gid is set and this problem arises, the EFS CSI Driver pods errors out continuously, preventing any further PVCs from being created until the problematic PVCs are cleared out.

I've captured the logs of the efs csi driver pod when this error happens without the Gid being set so you can see what it does normally:

https://gist.github.com/MarkSpencerTan/96775d9b2b3043ce7647693ecce309be#file-gistfile1-txt-L163

Would appreciate it if this fix becomes available since this is causing the efs-csi-driver to be very unstable in cases where the Gid is being set.

kyanar · 2022-09-05T01:48:22Z

Unfortunately none of the approvers for this project appear to be active on GitHub - the most recent was 18th August, so I can't find anyone to review it.

kyanar · 2022-09-26T06:05:15Z

/cc @Ashley-wenyizha

kyanar · 2022-10-22T07:31:29Z

@Ashley-wenyizha this is a one line fix for an issue affecting quite a few users, can this be looked at please?

kyanar · 2022-12-14T23:40:35Z

Pull #850 corrects this behaviour.

Skip deallocating Gid when static Gid set

fbf4dea

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jul 7, 2022

k8s-ci-robot requested a review from d-nishi July 7, 2022 05:12

k8s-ci-robot requested a review from leakingtapan July 7, 2022 05:12

k8s-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Jul 7, 2022

kyanar mentioned this pull request Jul 24, 2022

efs-plugin crash loops when a storage class is created with a fixed uid and gid, and access point creation fails #693

Closed

k8s-ci-robot requested a review from Ashley-wenyizha August 22, 2022 00:48

kyanar closed this Dec 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Skip deallocating Gid when static Gid set #733

Skip deallocating Gid when static Gid set #733

kyanar commented Jul 7, 2022

k8s-ci-robot commented Jul 7, 2022

k8s-ci-robot commented Jul 7, 2022

kyanar commented Aug 22, 2022

daghaian commented Aug 30, 2022

MarkSpencerTan commented Aug 30, 2022

kyanar commented Aug 30, 2022

daghaian commented Aug 31, 2022

MarkSpencerTan commented Sep 3, 2022 •

edited

Loading

kyanar commented Sep 5, 2022 •

edited

Loading

kyanar commented Sep 26, 2022

kyanar commented Oct 22, 2022

kyanar commented Dec 14, 2022

Skip deallocating Gid when static Gid set #733

Skip deallocating Gid when static Gid set #733

Conversation

kyanar commented Jul 7, 2022

k8s-ci-robot commented Jul 7, 2022

k8s-ci-robot commented Jul 7, 2022

kyanar commented Aug 22, 2022

daghaian commented Aug 30, 2022

MarkSpencerTan commented Aug 30, 2022

kyanar commented Aug 30, 2022

daghaian commented Aug 31, 2022

MarkSpencerTan commented Sep 3, 2022 • edited Loading

kyanar commented Sep 5, 2022 • edited Loading

kyanar commented Sep 26, 2022

kyanar commented Oct 22, 2022

kyanar commented Dec 14, 2022

MarkSpencerTan commented Sep 3, 2022 •

edited

Loading

kyanar commented Sep 5, 2022 •

edited

Loading