Volumes attaching fails #292

tsmetana · 2019-05-07T12:47:28Z

/kind bug

What happened?
When trying to provision EBS volume, the attaching fails - the volume is stuck in "attaching" state.

Pod events:

Events:
  Type     Reason              Age                   From                                    Message
  ----     ------              ----                  ----                                    -------
  Normal   Scheduled           4m11s                 default-scheduler                       Successfully assigned default/app to ip-172-18-26-100.ec2.internal
  Warning  FailedAttachVolume  3m56s                 attachdetach-controller                 AttachVolume.Attach failed for volume "pvc-5dea601a-f076-40d5-8d0a-fe1ea4e04e74" : attachment timeout for volume vol-0ddf92e13636fa667
  Warning  FailedMount         2m8s                  kubelet, ip-172-18-26-100.ec2.internal  Unable to mount volumes for pod "app_default(35032cc6-1df9-4a3c-a911-ad3ac40d718a)": timeout expired waiting for volumes to attach or mount for pod "default"/"app". list of unmounted volumes=[persistent-storage]. list of unattached volumes=[persistent-storage def
ault-token-qz8h2]
  Warning  FailedAttachVolume  108s (x8 over 3m55s)  attachdetach-controller                 AttachVolume.Attach failed for volume "pvc-5dea601a-f076-40d5-8d0a-fe1ea4e04e74" : rpc error: code = DeadlineExceeded desc = context deadline exceeded

The log excerpt from the driver:

key:"topology.ebs.csi.aws.com/zone" value:"us-east-1c" > > preferred:<segments:<key:"topology.ebs.csi.aws.com/zone" value:"us-east-1c" > >  XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0507 10:36:45.640858       1 identity.go:61] Probe: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0507 10:36:47.390177       1 controller.go:192] ControllerPublishVolume: called with args {VolumeId:vol-0ff706e7960da6554 NodeId:i-0cf6d85f66c7878fa VolumeCapability:mount:<fs_type:"ext4" > access_mode:<mode:SINGLE_NODE_WRITER >  Readonly:false Secrets:map[] VolumeContext:map[fstype: storage.kubernetes.io/csiProvisionerIdentity:1557225320251-8081-ebs.cs
i.aws.com] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0507 10:36:47.916659       1 cloud.go:335] AttachVolume volume="vol-0ff706e7960da6554" instance="i-0cf6d85f66c7878fa" request returned {
  AttachTime: 2019-05-07 10:36:47.879 +0000 UTC,
  Device: "/dev/xvda",
  InstanceId: "i-0cf6d85f66c7878fa",
  State: "attaching",
  VolumeId: "vol-0ff706e7960da6554"
}
I0507 10:36:55.640954       1 identity.go:61] Probe: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0507 10:37:02.391502       1 controller.go:192] ControllerPublishVolume: called with args {VolumeId:vol-0ff706e7960da6554 NodeId:i-0cf6d85f66c7878fa VolumeCapability:mount:<fs_type:"ext4" > access_mode:<mode:SINGLE_NODE_WRITER >  Readonly:false Secrets:map[] VolumeContext:map[fstype: storage.kubernetes.io/csiProvisionerIdentity:1557225320251-8081-ebs.cs
i.aws.com] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0507 10:37:05.640812       1 identity.go:61] Probe: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
E0507 10:37:10.772415       1 driver.go:66] GRPC error: rpc error: code = Internal desc = Could not attach volume "vol-0ff706e7960da6554" to node "i-0cf6d85f66c7878fa": RequestCanceled: request context canceled
caused by: context canceled

What you expected to happen?
Volume gets attached just normally, pod starts.

How to reproduce it (as minimally and precisely as possible)?
Using the manifest with "canary" versions of the sidecars with RBAC updated to match Kubernetes changes (but I think I run into the same issue with the manifest as-is, i.e. the 1.0.1/1.0.2 versions of the sidecars) create a PVC and a pod using it:

Anything else we need to know?:
The EC2 GUI tells me the volume device is /dev/xvda which should be reserved for root devices according to the documentation -- not sure how this happens and if it's OK.

Also -- the problem disappears if I revert changes made for PR #274...

Environment

Kubernetes version (use kubectl version):
v1.15.0-alpha.2.150+6a48257627d3bc-dirty (git master HEAD)
Driver version:
master HEAD

The text was updated successfully, but these errors were encountered:

leakingtapan · 2019-05-08T16:42:13Z

@tsmetana the nameAllocator.GetNext(inUse) should take into consideration that the volume names that are already used (this was the same before the refactoring. The refactoring just changed how the names are picked based off inUse names but not how the inUse names are fetched from EC2). One possibility is d.getDeviceNamesInUser returns stale data.

Is this volume attachment happening shortly after the instance is started? And is this issue reproducible?

anusha-ragunathan · 2019-05-09T18:42:54Z

I am experiencing the same issue.

Kube cluster 1.14
same result after updating side cars to use canary or v1.1.0
pvc/pv creation works
pod fails to attach EBS volume.
I can reproduce this 100% of the time

anusha-ragunathan · 2019-05-09T19:27:07Z

error logs from ebs-plugin:

{"log":"E0509 19:19:35.923183       1 driver.go:66] GRPC error: rpc error: code = Internal desc = Could not attach volume \"vol-00b79dc30944af2c4\" to node \"i-00db5016
9b8e6314a\": RequestCanceled: request context canceled\n","stream":"stderr","time":"2019-05-09T19:19:35.923320812Z"}
{"log":"caused by: context deadline exceeded\n","stream":"stderr","time":"2019-05-09T19:19:35.923340543Z"}
{"log":"I0509 19:22:11.856103       1 controller.go:196] ControllerPublishVolume: called with args {VolumeId:vol-00b79dc30944af2c4 NodeId:i-00db50169b8e6314a VolumeCapa
bility:mount:\u003cfs_type:\"xfs\" \u003e access_mode:\u003cmode:SINGLE_NODE_WRITER \u003e  Readonly:false Secrets:map[] VolumeContext:map[fstype:xfs storage.kubernetes
.io/csiProvisionerIdentity:1557429213017-8081-ebs.csi.aws.com] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}\n","stream":"stderr","time":"2019-05-09T19:2
2:11.856246074Z"}
{"log":"E0509 19:22:34.691738       1 driver.go:66] GRPC error: rpc error: code = Internal desc = Could not attach volume \"vol-00b79dc30944af2c4\" to node \"i-00db5016
9b8e6314a\": RequestCanceled: request context canceled\n","stream":"stderr","time":"2019-05-09T19:22:34.69188685Z"}
{"log":"caused by: context deadline exceeded\n","stream":"stderr","time":"2019-05-09T19:22:34.691905331Z"}
{"log":"I0509 19:23:33.358178       1 controller.go:196] ControllerPublishVolume: called with args {VolumeId:vol-00b79dc30944af2c4 NodeId:i-00db50169b8e6314a VolumeCapa
bility:mount:\u003cfs_type:\"xfs\" \u003e access_mode:\u003cmode:SINGLE_NODE_WRITER \u003e  Readonly:false Secrets:map[] VolumeContext:map[fstype:xfs storage.kubernetes
.io/csiProvisionerIdentity:1557429213017-8081-ebs.csi.aws.com] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}\n","stream":"stderr","time":"2019-05-09T19:2
3:33.358312324Z"}
{"log":"E0509 19:23:56.309141       1 driver.go:66] GRPC error: rpc error: code = Internal desc = Could not attach volume \"vol-00b79dc30944af2c4\" to node \"i-00db5016
9b8e6314a\": RequestCanceled: request context canceled\n","stream":"stderr","time":"2019-05-09T19:23:56.309280671Z"}
{"log":"caused by: context deadline exceeded\n","stream":"stderr","time":"2019-05-09T19:23:56.309299766Z"}

anusha-ragunathan · 2019-05-09T19:30:43Z

AWS console shows that the volume that's attempting to be attached is /dev/xvda and is in state attaching. However, this is in conflict with the root partition of my AWS EC2 instance.

tsmetana · 2019-05-10T05:59:37Z

@leakingtapan -- just as described by @anusha-ragunathan (thanks for adding the info) -- this happens every time (i.e., the current master HEAD is unusable for me), the age of the cluster doesn't affect the problem.

leakingtapan · 2019-05-10T18:19:44Z

I tried the dynamic provisioning example on 1.14 cluster and it works for me. The cluster is provisioned using kops:

kops create cluster --zones us-east-1a,us-east-1b,us-east-1c --node-count=1 --kubernetes-version=v1.14.0 1.14-bug.k8s.local

k8s version:

kk version                                                                                                                                                               ⏎ ✭
Client Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.1", GitCommit:"b7394102d6ef778017f2ca4046abbaa23b88c290", GitTreeState:"clean", BuildDate:"2019-04-08T17:11:31Z", GoVersion:"go1.12.1", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.0", GitCommit:"641856db18352033a0d96dbc99153fa3b27298e5", GitTreeState:"clean", BuildDate:"2019-03-25T15:45:25Z", GoVersion:"go1.12.1", Compiler:"gc", Platform:"linux/amd64"}

But can't reproduce the issue. How did you guys create the cluster? And what is the instance type being used?

@anusha-ragunathan since you can consistently reproduce the issue could you print out the instance object during AttachDisk here ?

@tsmetana @anusha-ragunathan

anusha-ragunathan · 2019-05-10T20:23:25Z

I have a 3 node cluster provisioned using Docker Enterprise UCP 3.2.0.
AWS instances are t2.2xlarge

leakingtapan · 2019-05-11T05:03:56Z

Still can't reproduce using t2.2xlarge. The simply fix will be starting the device path from /dev/xvdb instead of /dev/xvda. But let's root cause this first

tsmetana · 2019-05-13T07:29:17Z

I think I just used t2.medium with CentOS 7.5, turned SELinux to permissive, installed docker, added docker user group, started up the service, copied compiled vanilla kubernetes on it and started a single-node cluster with ALLOW_PRIVILEGED=true hack/local-up-cluster.sh -O...

tsmetana · 2019-05-15T11:07:23Z

I think the necessary condition for reproducing the bug is to have the system installed on /dev/sd* so /dev/xvda is (seemingly) available for the dynamically allocated devices.

anusha-ragunathan · 2019-05-15T17:15:23Z

@leakingtapan : I reproduced the issue with added debug log to dump the instance in AttachDisk
aws_ebs.log

anusha-ragunathan · 2019-05-15T17:32:26Z

There's definitely some enumeration issue going on.

The instance has 3 block devices and claims that /dev/sda is the root device.
However, Linux only sees 2 /dev/xvd partitions, where /dev/xvda is the boot partition.

Linux partition details:

$ more /proc/partitions 
major minor  #blocks  name

 202        0   10485760 xvda
 202        1       1024 xvda1
 202        2   10483695 xvda2
 202       16  104857600 xvdb
 202       17  104856576 xvdb1

BlockDeviceMappings on the EC2 instance:

BlockDeviceMappings: [{
      DeviceName: "/dev/sda1",
      Ebs: {
        AttachTime: 2019-04-18 03:46:09 +0000 UTC,
        DeleteOnTermination: true,
        Status: "attached",
        VolumeId: "vol-0112e5a6cddbaed09"
      }
    },{
      DeviceName: "/dev/xvdb",
      Ebs: {
        AttachTime: 2019-04-18 03:46:09 +0000 UTC,
        DeleteOnTermination: true,
        Status: "attached",
        VolumeId: "vol-0a668d11c2427b9d3"
      }
    },{
      DeviceName: "/dev/xvda",
      Ebs: {
        AttachTime: 2019-05-15 16:57:00 +0000 UTC,
        DeleteOnTermination: false,
        Status: "attaching",
        VolumeId: "vol-0a4436f65c0565521"
      }
    }],

anusha-ragunathan · 2019-05-15T17:48:41Z

Instance says "hey! /dev/xvda is available". So plugin starts to attach volume under that device name. However, Linux partition has /dev/xvda as the system partition and rejects request.

According to https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/device_naming.html, on HVM instances, sda1 and xvda are reserved for root partition. Also, the available block device range for xvd follow "/dev/xvd[b-c][a-z]" pattern. However commit 1ace946 violates the reserved device names and the naming logic. This causes volume attachment issues. For example, for instances where /dev/xvda is root partition, the driver tries to create a new device with the same reserved name. This change fixes the issue. Testing: Deployed Kube 1.14 cluster. Update manifest.yaml with canary images for sidecards and new names for API resources. Tested new plugin and created sc, pvc (pv) and pod. Volume attaches to node successfully. New device attached is "/dev/xvdba" as expected. Pod mounts volume and runs fine. Issue kubernetes-sigs#292 Signed-off-by: Anusha Ragunathan <anusha.ragunathan@docker.com>

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label May 7, 2019

bertinatto added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label May 13, 2019

anusha-ragunathan mentioned this issue May 15, 2019

Device allocator should not use reserved dev names. #297

Closed

anusha-ragunathan mentioned this issue May 17, 2019

Device allocator should not use reserved dev names. #298

Merged

k8s-ci-robot closed this as completed in #298 May 17, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Volumes attaching fails #292

Volumes attaching fails #292

tsmetana commented May 7, 2019

leakingtapan commented May 8, 2019 •

edited

Loading

anusha-ragunathan commented May 9, 2019

anusha-ragunathan commented May 9, 2019

anusha-ragunathan commented May 9, 2019

tsmetana commented May 10, 2019

leakingtapan commented May 10, 2019 •

edited

Loading

anusha-ragunathan commented May 10, 2019

leakingtapan commented May 11, 2019

tsmetana commented May 13, 2019

tsmetana commented May 15, 2019

anusha-ragunathan commented May 15, 2019

anusha-ragunathan commented May 15, 2019

anusha-ragunathan commented May 15, 2019 •

edited

Loading

Volumes attaching fails #292

Volumes attaching fails #292

Comments

tsmetana commented May 7, 2019

leakingtapan commented May 8, 2019 • edited Loading

anusha-ragunathan commented May 9, 2019

anusha-ragunathan commented May 9, 2019

anusha-ragunathan commented May 9, 2019

tsmetana commented May 10, 2019

leakingtapan commented May 10, 2019 • edited Loading

anusha-ragunathan commented May 10, 2019

leakingtapan commented May 11, 2019

tsmetana commented May 13, 2019

tsmetana commented May 15, 2019

anusha-ragunathan commented May 15, 2019

anusha-ragunathan commented May 15, 2019

anusha-ragunathan commented May 15, 2019 • edited Loading

leakingtapan commented May 8, 2019 •

edited

Loading

leakingtapan commented May 10, 2019 •

edited

Loading

anusha-ragunathan commented May 15, 2019 •

edited

Loading