Fix dangling attach errors #55491

gnufied · 2017-11-10T16:51:00Z

Detach volumes from shutdown nodes and ensure that
dangling volumes are handled correctly in AWS

Implement correction mechanism for dangling volumes attached for deleted pods

gnufied · 2017-11-10T16:54:04Z

/sig aws
/sig storage

gnufied · 2017-11-10T16:54:13Z

cc @kubernetes/sig-storage-pr-reviews

jingxu97 · 2017-11-10T17:55:11Z

pkg/volume/aws_ebs/attacher.go

-		// Volume is already detached from node.
-		glog.Infof("detach operation was successful. volume %q is already detached from node %q.", volumeID, nodeName)
-		return nil
+		// it means either node has been deleted or volume is not attached to node


DiskIsAttached will return not attached without error only if cloudprovider.InstanceNotFound returns true. If the node is deleted from cloudprovider, doesn't mean volume should be automatically detached?

no. in AWS even nodes that are shutdown/stopped are removed from node list and volumes are not detached from such nodes.

gnufied · 2017-11-11T02:00:39Z

/test pull-kubernetes-bazel-test

jsafrane

I have couple of nits, however the PR looks good to me in general.

jsafrane · 2017-11-13T15:29:56Z

pkg/cloudprovider/providers/aws/aws.go

@@ -1718,6 +1718,10 @@ func (c *Cloud) AttachDisk(diskName KubernetesVolumeID, nodeName types.NodeName,
 	if !alreadyAttached {
 		available, err := c.checkIfAvailable(disk, "attaching", awsInstance.awsID)



why empty line?

jsafrane · 2017-11-13T15:30:09Z

pkg/volume/util/error.go

@@ -0,0 +1,41 @@
+/*
+Copyright 2016 The Kubernetes Authors.


nit: wrong year

gnufied · 2017-11-13T16:37:56Z

@jingxu97 ping again on this. let me know what you think?

gnufied · 2017-11-15T14:18:40Z

@jsafrane can you ptal, I addressed your nits.

jsafrane · 2017-11-15T14:34:21Z

/assign

jsafrane · 2017-11-15T14:34:25Z

/lgtm

jsafrane · 2017-11-15T14:34:43Z

/lgtm
(sorry, wrong button :-)

jsafrane · 2017-11-15T14:57:23Z

@deads2k, one tiny approval please, we had to adopt an unit test in pkg/admission to a new interface.

deads2k · 2017-11-15T17:51:17Z

/approve

jingxu97 · 2017-11-15T19:27:09Z

I understand the main changes are for AWS. Will this change affect other volume plugins?

gnufied · 2017-11-15T19:31:15Z

@jingxu97 not until they also start implementing DanglingVolumeError. But I think it might be a good idea to implement it for other volume plugins too. But that can be done independently.

jingxu97 · 2017-11-15T20:41:42Z

pkg/cloudprovider/providers/aws/aws.go

@@ -1766,12 +1769,41 @@ func (c *Cloud) AttachDisk(diskName KubernetesVolumeID, nodeName types.NodeName,
 }

 // DetachDisk implements Volumes.DetachDisk
-func (c *Cloud) DetachDisk(diskName KubernetesVolumeID, nodeName types.NodeName) (string, error) {
+func (c *Cloud) DetachDisk(diskName KubernetesVolumeID, nodeName types.NodeName, nodeExists bool) (string, error) {


It is not very clear for me to have nodeExists here. The function is to Detach disk from a node, I feel the logic of checking node exist or not should just inside of this function?

Yeah it can be certainly changed. will do.

So the main reason this additional boolean field has been introduced is because - DiskIsAttached check returns false (as in disk is detached) incorrectly when node is shutdown and volume is still attached to the node. I am trying to fix that behaviour here.

The reason DiskIsAsAttached returns false even though Disk is still attached to a stopped node is because, in AWS a stopped node is not returned when fetching instance list, so AWS cloudprovider throws "Node Not Found error", whereas node is still there and volumes are still attached to it.

I am trying to fix this, without changing too much of internal of AWS cloudprovider's code. DiskIsAttached appears to be a standard function that is being used by all volume types. I am not 100% sure, if it makes sense to roll this into DetachDisk just for aws.

The main purpose of this fix is that when reconciler tries to attach a disk to a node, but somehow the disk is already attached to another node (e.g., attaching operation timeout but eventually attached).
Here it seems like you are also trying to fix another issue for DiskIsAttached. Could you make a separate PR for that so we can discuss more details.

Yep, that is correct. But I think both are related. We couldn't have fixed detaching volumes from stopped nodes if We didn't had Dangling volume detection code.

If you can explain what is bothering you with the fix, may be I will try to workaround or address that.

It seems like DiskIsAsAttached() is not implemented correctly for AWS when node could not be found by cloud provider, so I feel instead of adding the bool value, we should fix that function instead.
Dangling volume detection is for trigger detach if volume is attached to a different node. But detaching volume from a stopped node does not rely on this detection? When a node is stopped, and removed from the cloud provider, reconciler will trigger detach, but detach will do nothing since it though the volume is already detached.

You are correct that DiskIsAsAttached does not do return correct value. But fixing it alone will not fix Detach from stopped nodes. For example, even if we called DetachDisk then again https://github.com/kubernetes/kubernetes/blob/master/pkg/cloudprovider/providers/aws/aws.go#L1775 code will make DetachDisk return again without doing actual detach.

So at minimum I think we need to fix both DiskIsAttached as well as DetachDisk to detach disks correctly from stopped nodes.

But detaching volume from a stopped node does not rely on this detection? When a node is stopped, and removed from the cloud provider, reconciler will trigger detach, but detach will do nothing since it though the volume is already detached.

That is correct, but I was thinking also about the case of node getting stopped and c-m restarted or active master switched in HA environment. Both of those cases will cause node information to be lost and hence detach may not be attempted. So there are like 2 bugs present here. :(

I am thinking lets fix both DetachDisk and DiskIsAttached functions, so as they can correctly handle stopped instances. This PR sort of fixes DetachDisk but leaves DiskIsAttached broken... I will fix that too and update the PR.

If you prefer, I can fix DetachDisk and DiskIsAttached in a separate PR.

I think having two PRs is better so it is easier to explain what we are trying to fix.

Detach volumes from shutdown nodes and ensure that dangling volumes are handled correctly in AWS

gnufied · 2017-11-16T15:00:55Z

@jingxu97 I have removed code that fixes detaching volumes from stopped instances on AWS from this PR. PTAL.

@jsafrane @justinsb please take another look. The PR needs lgtm and approve.

jsafrane · 2017-11-16T16:01:32Z

/lgtm

justinsb · 2017-11-17T00:57:56Z

/approve

Some nits that are more suggestions

justinsb · 2017-11-16T23:36:33Z

pkg/cloudprovider/providers/aws/aws.go

@@ -1717,6 +1717,9 @@ func (c *Cloud) AttachDisk(diskName KubernetesVolumeID, nodeName types.NodeName,

 	if !alreadyAttached {
 		available, err := c.checkIfAvailable(disk, "attaching", awsInstance.awsID)
+		if err != nil {
+			glog.Error(err)


Maybe glog.Errorf("error checking if volume available: %v", err)

justinsb · 2017-11-16T23:58:58Z

pkg/volume/util/error.go

+
+// This error on attach indicates volume is attached to a different node
+// than we expected.
+type DanglingAttachError struct {


Naming nit: AlreadyAttachedError (?)

justinsb · 2017-11-17T00:00:45Z

pkg/volume/util/operationexecutor/operation_generator.go

@@ -267,6 +267,18 @@ func (og *operationGenerator) GenerateAttachVolumeFunc(
 			volumeToAttach.VolumeSpec, volumeToAttach.NodeName)

 		if attachErr != nil {
+			if derr, ok := attachErr.(*util.DanglingAttachError); ok {
+				addErr := actualStateOfWorld.MarkVolumeAsAttached(
+					v1.UniqueVolumeName(""),


Is the empty volume name going to cause a problem... can we have a comment as to why not?

nah - the if volume spec is specified in MarkVolumeAsAttached then unique name is picked from volume spec rather than from first parameter. I am just filling out first parameter to satisfy function contract.

justinsb · 2017-11-17T00:41:50Z

pkg/volume/util/operationexecutor/operation_generator.go

@@ -267,6 +267,18 @@ func (og *operationGenerator) GenerateAttachVolumeFunc(
 			volumeToAttach.VolumeSpec, volumeToAttach.NodeName)

 		if attachErr != nil {
+			if derr, ok := attachErr.(*util.DanglingAttachError); ok {
+				addErr := actualStateOfWorld.MarkVolumeAsAttached(


Can we glog.Info (or warning) when we're doing this ... it's unusual enough that we want to know in the logs I think

k8s-github-robot · 2017-11-17T00:58:22Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deads2k, gnufied, jsafrane, justinsb

Associated issue: 52573

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

~~pkg/cloudprovider/providers/aws/OWNERS~~ [justinsb]
~~pkg/volume/util/OWNERS~~ [gnufied,jsafrane]

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

k8s-github-robot · 2017-11-17T08:18:19Z

Automatic merge from submit-queue (batch tested with PRs 55392, 55491, 51914, 55831, 55836). If you want to cherry-pick this change to another branch, please follow the instructions here.

jingxu97 · 2017-11-17T09:56:14Z

It is merged already. But could you address @justinsb's comments in another PR? I think logging when this error happens is important.

gnufied · 2017-11-17T13:09:57Z

@jingxu97 yes I agree. I will do that in a follow up commit shortly.

k8s-github-robot assigned justinsb and zmerlynn Nov 10, 2017

k8s-ci-robot added sig/aws sig/storage Categorizes an issue or PR as relevant to SIG Storage. labels Nov 10, 2017

jingxu97 reviewed Nov 10, 2017

View reviewed changes

gnufied force-pushed the fix-dangling-attach-errors branch 2 times, most recently from 83ddecc to e9944f8 Compare November 10, 2017 20:38

gnufied force-pushed the fix-dangling-attach-errors branch from e9944f8 to f9a876a Compare November 11, 2017 13:16

k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Nov 11, 2017

jsafrane reviewed Nov 13, 2017

View reviewed changes

gnufied force-pushed the fix-dangling-attach-errors branch from f9a876a to a65c488 Compare November 13, 2017 16:21

k8s-ci-robot assigned jsafrane Nov 15, 2017

jsafrane closed this Nov 15, 2017

jsafrane reopened this Nov 15, 2017

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 15, 2017

jingxu97 reviewed Nov 15, 2017

View reviewed changes

gnufied force-pushed the fix-dangling-attach-errors branch from a65c488 to 884e917 Compare November 16, 2017 13:41

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Nov 16, 2017

k8s-github-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 16, 2017

gnufied force-pushed the fix-dangling-attach-errors branch from 884e917 to 0aaf189 Compare November 16, 2017 13:43

Fix dangling attach errors

5297c14

Detach volumes from shutdown nodes and ensure that dangling volumes are handled correctly in AWS

gnufied force-pushed the fix-dangling-attach-errors branch from 0aaf189 to 5297c14 Compare November 16, 2017 13:43

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 16, 2017

justinsb reviewed Nov 17, 2017

View reviewed changes

k8s-github-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 17, 2017

k8s-github-robot merged commit bb82a3a into kubernetes:master Nov 17, 2017

andyzhangx mentioned this pull request Aug 11, 2019

fix: detach azure disk issue using dangling error #81266

Merged

andyzhangx mentioned this pull request Aug 15, 2020

detach azure disk using dangling error kubernetes-sigs/azuredisk-csi-driver#288

Closed

andyzhangx mentioned this pull request Aug 24, 2020

implement dangling error feature for CSI driver gracefully kubernetes-sigs/azuredisk-csi-driver#501

Closed

		@@ -1718,6 +1718,10 @@ func (c *Cloud) AttachDisk(diskName KubernetesVolumeID, nodeName types.NodeName,
		if !alreadyAttached {
		available, err := c.checkIfAvailable(disk, "attaching", awsInstance.awsID)

Fix dangling attach errors #55491

Fix dangling attach errors #55491

Conversation

gnufied commented Nov 10, 2017 • edited Loading

gnufied commented Nov 10, 2017

gnufied commented Nov 10, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gnufied commented Nov 11, 2017

jsafrane left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gnufied commented Nov 13, 2017

gnufied commented Nov 15, 2017

jsafrane commented Nov 15, 2017

jsafrane commented Nov 15, 2017

jsafrane commented Nov 15, 2017

jsafrane commented Nov 15, 2017

deads2k commented Nov 15, 2017

jingxu97 commented Nov 15, 2017

gnufied commented Nov 15, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jingxu97 Nov 16, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gnufied commented Nov 16, 2017

jsafrane commented Nov 16, 2017

justinsb commented Nov 17, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

k8s-github-robot commented Nov 17, 2017

k8s-github-robot commented Nov 17, 2017

jingxu97 commented Nov 17, 2017

gnufied commented Nov 17, 2017

gnufied commented Nov 10, 2017 •

edited

Loading

jingxu97 Nov 16, 2017 •

edited

Loading