Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Volumes attaching fails #292

Closed
tsmetana opened this issue May 7, 2019 · 13 comments · Fixed by #298
Closed

Volumes attaching fails #292

tsmetana opened this issue May 7, 2019 · 13 comments · Fixed by #298
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now.

Comments

@tsmetana
Copy link
Contributor

tsmetana commented May 7, 2019

/kind bug

What happened?
When trying to provision EBS volume, the attaching fails - the volume is stuck in "attaching" state.

Pod events:

Events:
  Type     Reason              Age                   From                                    Message
  ----     ------              ----                  ----                                    -------
  Normal   Scheduled           4m11s                 default-scheduler                       Successfully assigned default/app to ip-172-18-26-100.ec2.internal
  Warning  FailedAttachVolume  3m56s                 attachdetach-controller                 AttachVolume.Attach failed for volume "pvc-5dea601a-f076-40d5-8d0a-fe1ea4e04e74" : attachment timeout for volume vol-0ddf92e13636fa667
  Warning  FailedMount         2m8s                  kubelet, ip-172-18-26-100.ec2.internal  Unable to mount volumes for pod "app_default(35032cc6-1df9-4a3c-a911-ad3ac40d718a)": timeout expired waiting for volumes to attach or mount for pod "default"/"app". list of unmounted volumes=[persistent-storage]. list of unattached volumes=[persistent-storage def
ault-token-qz8h2]
  Warning  FailedAttachVolume  108s (x8 over 3m55s)  attachdetach-controller                 AttachVolume.Attach failed for volume "pvc-5dea601a-f076-40d5-8d0a-fe1ea4e04e74" : rpc error: code = DeadlineExceeded desc = context deadline exceeded

The log excerpt from the driver:

key:"topology.ebs.csi.aws.com/zone" value:"us-east-1c" > > preferred:<segments:<key:"topology.ebs.csi.aws.com/zone" value:"us-east-1c" > >  XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0507 10:36:45.640858       1 identity.go:61] Probe: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0507 10:36:47.390177       1 controller.go:192] ControllerPublishVolume: called with args {VolumeId:vol-0ff706e7960da6554 NodeId:i-0cf6d85f66c7878fa VolumeCapability:mount:<fs_type:"ext4" > access_mode:<mode:SINGLE_NODE_WRITER >  Readonly:false Secrets:map[] VolumeContext:map[fstype: storage.kubernetes.io/csiProvisionerIdentity:1557225320251-8081-ebs.cs
i.aws.com] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0507 10:36:47.916659       1 cloud.go:335] AttachVolume volume="vol-0ff706e7960da6554" instance="i-0cf6d85f66c7878fa" request returned {
  AttachTime: 2019-05-07 10:36:47.879 +0000 UTC,
  Device: "/dev/xvda",
  InstanceId: "i-0cf6d85f66c7878fa",
  State: "attaching",
  VolumeId: "vol-0ff706e7960da6554"
}
I0507 10:36:55.640954       1 identity.go:61] Probe: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0507 10:37:02.391502       1 controller.go:192] ControllerPublishVolume: called with args {VolumeId:vol-0ff706e7960da6554 NodeId:i-0cf6d85f66c7878fa VolumeCapability:mount:<fs_type:"ext4" > access_mode:<mode:SINGLE_NODE_WRITER >  Readonly:false Secrets:map[] VolumeContext:map[fstype: storage.kubernetes.io/csiProvisionerIdentity:1557225320251-8081-ebs.cs
i.aws.com] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0507 10:37:05.640812       1 identity.go:61] Probe: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
E0507 10:37:10.772415       1 driver.go:66] GRPC error: rpc error: code = Internal desc = Could not attach volume "vol-0ff706e7960da6554" to node "i-0cf6d85f66c7878fa": RequestCanceled: request context canceled
caused by: context canceled

What you expected to happen?
Volume gets attached just normally, pod starts.

How to reproduce it (as minimally and precisely as possible)?
Using the manifest with "canary" versions of the sidecars with RBAC updated to match Kubernetes changes (but I think I run into the same issue with the manifest as-is, i.e. the 1.0.1/1.0.2 versions of the sidecars) create a PVC and a pod using it:

Anything else we need to know?:
The EC2 GUI tells me the volume device is /dev/xvda which should be reserved for root devices according to the documentation -- not sure how this happens and if it's OK.

Also -- the problem disappears if I revert changes made for PR #274...

Environment

  • Kubernetes version (use kubectl version):
    v1.15.0-alpha.2.150+6a48257627d3bc-dirty (git master HEAD)
  • Driver version:
    master HEAD
@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label May 7, 2019
@leakingtapan
Copy link
Contributor

leakingtapan commented May 8, 2019

@tsmetana the nameAllocator.GetNext(inUse) should take into consideration that the volume names that are already used (this was the same before the refactoring. The refactoring just changed how the names are picked based off inUse names but not how the inUse names are fetched from EC2). One possibility is d.getDeviceNamesInUser returns stale data.

Is this volume attachment happening shortly after the instance is started? And is this issue reproducible?

@anusha-ragunathan
Copy link
Contributor

I am experiencing the same issue.

Kube cluster 1.14
same result after updating side cars to use canary or v1.1.0
pvc/pv creation works
pod fails to attach EBS volume.
I can reproduce this 100% of the time

@anusha-ragunathan
Copy link
Contributor

error logs from ebs-plugin:

{"log":"E0509 19:19:35.923183       1 driver.go:66] GRPC error: rpc error: code = Internal desc = Could not attach volume \"vol-00b79dc30944af2c4\" to node \"i-00db5016
9b8e6314a\": RequestCanceled: request context canceled\n","stream":"stderr","time":"2019-05-09T19:19:35.923320812Z"}
{"log":"caused by: context deadline exceeded\n","stream":"stderr","time":"2019-05-09T19:19:35.923340543Z"}
{"log":"I0509 19:22:11.856103       1 controller.go:196] ControllerPublishVolume: called with args {VolumeId:vol-00b79dc30944af2c4 NodeId:i-00db50169b8e6314a VolumeCapa
bility:mount:\u003cfs_type:\"xfs\" \u003e access_mode:\u003cmode:SINGLE_NODE_WRITER \u003e  Readonly:false Secrets:map[] VolumeContext:map[fstype:xfs storage.kubernetes
.io/csiProvisionerIdentity:1557429213017-8081-ebs.csi.aws.com] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}\n","stream":"stderr","time":"2019-05-09T19:2
2:11.856246074Z"}
{"log":"E0509 19:22:34.691738       1 driver.go:66] GRPC error: rpc error: code = Internal desc = Could not attach volume \"vol-00b79dc30944af2c4\" to node \"i-00db5016
9b8e6314a\": RequestCanceled: request context canceled\n","stream":"stderr","time":"2019-05-09T19:22:34.69188685Z"}
{"log":"caused by: context deadline exceeded\n","stream":"stderr","time":"2019-05-09T19:22:34.691905331Z"}
{"log":"I0509 19:23:33.358178       1 controller.go:196] ControllerPublishVolume: called with args {VolumeId:vol-00b79dc30944af2c4 NodeId:i-00db50169b8e6314a VolumeCapa
bility:mount:\u003cfs_type:\"xfs\" \u003e access_mode:\u003cmode:SINGLE_NODE_WRITER \u003e  Readonly:false Secrets:map[] VolumeContext:map[fstype:xfs storage.kubernetes
.io/csiProvisionerIdentity:1557429213017-8081-ebs.csi.aws.com] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}\n","stream":"stderr","time":"2019-05-09T19:2
3:33.358312324Z"}
{"log":"E0509 19:23:56.309141       1 driver.go:66] GRPC error: rpc error: code = Internal desc = Could not attach volume \"vol-00b79dc30944af2c4\" to node \"i-00db5016
9b8e6314a\": RequestCanceled: request context canceled\n","stream":"stderr","time":"2019-05-09T19:23:56.309280671Z"}
{"log":"caused by: context deadline exceeded\n","stream":"stderr","time":"2019-05-09T19:23:56.309299766Z"}

@anusha-ragunathan
Copy link
Contributor

AWS console shows that the volume that's attempting to be attached is /dev/xvda and is in state attaching. However, this is in conflict with the root partition of my AWS EC2 instance.

@tsmetana
Copy link
Contributor Author

@leakingtapan -- just as described by @anusha-ragunathan (thanks for adding the info) -- this happens every time (i.e., the current master HEAD is unusable for me), the age of the cluster doesn't affect the problem.

@leakingtapan
Copy link
Contributor

leakingtapan commented May 10, 2019

I tried the dynamic provisioning example on 1.14 cluster and it works for me. The cluster is provisioned using kops:

kops create cluster --zones us-east-1a,us-east-1b,us-east-1c --node-count=1 --kubernetes-version=v1.14.0 1.14-bug.k8s.local

k8s version:

kk version                                                                                                                                                               ⏎ ✭
Client Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.1", GitCommit:"b7394102d6ef778017f2ca4046abbaa23b88c290", GitTreeState:"clean", BuildDate:"2019-04-08T17:11:31Z", GoVersion:"go1.12.1", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.0", GitCommit:"641856db18352033a0d96dbc99153fa3b27298e5", GitTreeState:"clean", BuildDate:"2019-03-25T15:45:25Z", GoVersion:"go1.12.1", Compiler:"gc", Platform:"linux/amd64"}

But can't reproduce the issue. How did you guys create the cluster? And what is the instance type being used?

@anusha-ragunathan since you can consistently reproduce the issue could you print out the instance object during AttachDisk here ?

@tsmetana @anusha-ragunathan

@anusha-ragunathan
Copy link
Contributor

I have a 3 node cluster provisioned using Docker Enterprise UCP 3.2.0.
AWS instances are t2.2xlarge

@leakingtapan
Copy link
Contributor

Still can't reproduce using t2.2xlarge. The simply fix will be starting the device path from /dev/xvdb instead of /dev/xvda. But let's root cause this first

@tsmetana
Copy link
Contributor Author

I think I just used t2.medium with CentOS 7.5, turned SELinux to permissive, installed docker, added docker user group, started up the service, copied compiled vanilla kubernetes on it and started a single-node cluster with ALLOW_PRIVILEGED=true hack/local-up-cluster.sh -O...

@bertinatto bertinatto added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label May 13, 2019
@tsmetana
Copy link
Contributor Author

I think the necessary condition for reproducing the bug is to have the system installed on /dev/sd* so /dev/xvda is (seemingly) available for the dynamically allocated devices.

@anusha-ragunathan
Copy link
Contributor

@leakingtapan : I reproduced the issue with added debug log to dump the instance in AttachDisk
aws_ebs.log

@anusha-ragunathan
Copy link
Contributor

There's definitely some enumeration issue going on.

  • The instance has 3 block devices and claims that /dev/sda is the root device.
  • However, Linux only sees 2 /dev/xvd partitions, where /dev/xvda is the boot partition.

Linux partition details:

$ more /proc/partitions 
major minor  #blocks  name

 202        0   10485760 xvda
 202        1       1024 xvda1
 202        2   10483695 xvda2
 202       16  104857600 xvdb
 202       17  104856576 xvdb1

BlockDeviceMappings on the EC2 instance:

BlockDeviceMappings: [{
      DeviceName: "/dev/sda1",
      Ebs: {
        AttachTime: 2019-04-18 03:46:09 +0000 UTC,
        DeleteOnTermination: true,
        Status: "attached",
        VolumeId: "vol-0112e5a6cddbaed09"
      }
    },{
      DeviceName: "/dev/xvdb",
      Ebs: {
        AttachTime: 2019-04-18 03:46:09 +0000 UTC,
        DeleteOnTermination: true,
        Status: "attached",
        VolumeId: "vol-0a668d11c2427b9d3"
      }
    },{
      DeviceName: "/dev/xvda",
      Ebs: {
        AttachTime: 2019-05-15 16:57:00 +0000 UTC,
        DeleteOnTermination: false,
        Status: "attaching",
        VolumeId: "vol-0a4436f65c0565521"
      }
    }],

@anusha-ragunathan
Copy link
Contributor

anusha-ragunathan commented May 15, 2019

Instance says "hey! /dev/xvda is available". So plugin starts to attach volume under that device name. However, Linux partition has /dev/xvda as the system partition and rejects request.

anusha-ragunathan added a commit to anusha-ragunathan/aws-ebs-csi-driver that referenced this issue May 17, 2019
According to
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/device_naming.html,
on HVM instances, sda1 and xvda are reserved for root partition. Also,
the available block device range for xvd follow "/dev/xvd[b-c][a-z]"
pattern.

However commit 1ace946 violates the reserved device names and the naming
logic. This causes volume attachment issues. For example, for instances
where /dev/xvda is root partition, the driver tries to create a new
device with the same reserved name.
This change fixes the issue.

Testing:
Deployed Kube 1.14 cluster.
Update manifest.yaml with canary images for sidecards and new names for
API resources.
Tested new plugin and created sc, pvc (pv) and pod.
Volume attaches to node successfully. New device attached is
"/dev/xvdba" as expected.

Pod mounts volume and runs fine.

Issue kubernetes-sigs#292

Signed-off-by: Anusha Ragunathan <anusha.ragunathan@docker.com>
anusha-ragunathan added a commit to anusha-ragunathan/aws-ebs-csi-driver that referenced this issue May 17, 2019
According to
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/device_naming.html,
on HVM instances, sda1 and xvda are reserved for root partition. Also,
the available block device range for xvd follow "/dev/xvd[b-c][a-z]"
pattern.

However commit 1ace946 violates the reserved device names and the naming
logic. This causes volume attachment issues. For example, for instances
where /dev/xvda is root partition, the driver tries to create a new
device with the same reserved name.
This change fixes the issue.

Testing:
Deployed Kube 1.14 cluster.
Update manifest.yaml with canary images for sidecards and new names for
API resources.
Tested new plugin and created sc, pvc (pv) and pod.
Volume attaches to node successfully. New device attached is
"/dev/xvdba" as expected.

Pod mounts volume and runs fine.

Issue kubernetes-sigs#292

Signed-off-by: Anusha Ragunathan <anusha.ragunathan@docker.com>
anusha-ragunathan added a commit to anusha-ragunathan/aws-ebs-csi-driver that referenced this issue May 17, 2019
According to
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/device_naming.html,
on HVM instances, sda1 and xvda are reserved for root partition. Also,
the available block device range for xvd follow "/dev/xvd[b-c][a-z]"
pattern.

However commit 1ace946 violates the reserved device names and the naming
logic. This causes volume attachment issues. For example, for instances
where /dev/xvda is root partition, the driver tries to create a new
device with the same reserved name.
This change fixes the issue.

Testing:
Deployed Kube 1.14 cluster.
Update manifest.yaml with canary images for sidecards and new names for
API resources.
Tested new plugin and created sc, pvc (pv) and pod.
Volume attaches to node successfully. New device attached is
"/dev/xvdba" as expected.

Pod mounts volume and runs fine.

Issue kubernetes-sigs#292

Signed-off-by: Anusha Ragunathan <anusha.ragunathan@docker.com>
sreis pushed a commit to d2iq-archive/aws-ebs-csi-driver that referenced this issue Nov 5, 2019
According to
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/device_naming.html,
on HVM instances, sda1 and xvda are reserved for root partition. Also,
the available block device range for xvd follow "/dev/xvd[b-c][a-z]"
pattern.

However commit 1ace946 violates the reserved device names and the naming
logic. This causes volume attachment issues. For example, for instances
where /dev/xvda is root partition, the driver tries to create a new
device with the same reserved name.
This change fixes the issue.

Testing:
Deployed Kube 1.14 cluster.
Update manifest.yaml with canary images for sidecards and new names for
API resources.
Tested new plugin and created sc, pvc (pv) and pod.
Volume attaches to node successfully. New device attached is
"/dev/xvdba" as expected.

Pod mounts volume and runs fine.

Issue kubernetes-sigs#292

Signed-off-by: Anusha Ragunathan <anusha.ragunathan@docker.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now.
Projects
None yet
5 participants