You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I create veth interface for my custom network configuration and some time i see no such device errors when i'm trying to call netlink.LinkSetNsPid on the newly created interface.
The interesting thing that i was able to obtain link handle via LinkByName or via LinkList before i called LinkSetNsPid, so device was created successfully. I also see devices created with ip link
I started debugging and added LinkList just after the error happened + another one with 1 second delay. What i see consistently (if the issue occurred) is pretty weird: the first output usually contains only few records with incorrect device index values, while the second one has all the records i has configured on the host with correct indexes.
1st output (just after failure), LinkList results has size 3:
time="2016-12-19T10:43:40Z" level=warning msg="=== debug interfaces, 1st attempt ==="
time="2016-12-19T10:43:40Z" level=debug msg="=== current interfaces list (count 3) ==="
time="2016-12-19T10:43:40Z" level=debug msg="interface record -> name: lo idx: 1"
time="2016-12-19T10:43:40Z" level=debug msg="interface record -> name: vethbada13B idx: 2"
time="2016-12-19T10:43:40Z" level=debug msg="interface record -> name: vethbada13A idx: 3"
time="2016-12-19T10:43:40Z" level=debug msg="=== end current interface list ==="
2nd attempt output (after 1 second), LinkList result has size 23 (expected)
time="2016-12-19T10:43:41Z" level=warning msg="=== debug interfaces, 2nd attempt ==="
time="2016-12-19T10:43:41Z" level=debug msg="=== current interfaces list (count 23) ==="
time="2016-12-19T10:43:41Z" level=debug msg="interface record -> name: lo idx: 1"
time="2016-12-19T10:43:41Z" level=debug msg="interface record -> name: eth0 idx: 2"
time="2016-12-19T10:43:41Z" level=debug msg="interface record -> name: docker0 idx: 3"
time="2016-12-19T10:43:41Z" level=debug msg="interface record -> name: eth1 idx: 4"
time="2016-12-19T10:43:41Z" level=debug msg="interface record -> name: vetha434e0B idx: 1036"
time="2016-12-19T10:43:41Z" level=debug msg="interface record -> name: vetha434e0A idx: 1037"
We start a lot of containers and this issue happens only in about 0.5-1% launches (100 crashes out of 20000 tasks last night). We run on AWS and our instances are recycled once in a few days, but even with that, the issue tends to happen on the same subset of hosts where it happened once. (3-5 hosts out of 200)
@vishvananda i'm not sure if it's go netlink bug or some kernel issue. If you have any recommendations what to look at next, it would be super helpful. I can add more output to the netlink as well.
Our configuration is
Linux mainvpc-r3.8xlarge-i-00e3fe8d965d0c577 3.19.0-26-generic #28~14.04.1-Ubuntu SMP Wed Aug 12 14:09:17 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
The text was updated successfully, but these errors were encountered:
when we setns it is only for the current thread. It looks like sometimes your code is running from a thread in a different namespace, giving you different results. First make sure you are using LockOSThread for the duration of the namespace get/set. The other possibility is that the runtime is starting a new os thread at some point (lock doesn't prevent that) which will be in an incorrect namespace. If that is the issue, it appears that the only safe way to do it is to exec a new process to do the work you need, lock thread, do work, then exit. See discussion here: vishvananda/netns#17
I create veth interface for my custom network configuration and some time i see
no such device
errors when i'm trying to callnetlink.LinkSetNsPid
on the newly created interface.The interesting thing that i was able to obtain link handle via
LinkByName
or viaLinkList
before i calledLinkSetNsPid
, so device was created successfully. I also see devices created withip link
I started debugging and added
LinkList
just after the error happened + another one with 1 second delay. What i see consistently (if the issue occurred) is pretty weird: the first output usually contains only few records with incorrect device index values, while the second one has all the records i has configured on the host with correct indexes.1st output (just after failure), LinkList results has size 3:
2nd attempt output (after 1 second), LinkList result has size 23 (expected)
We start a lot of containers and this issue happens only in about 0.5-1% launches (100 crashes out of 20000 tasks last night). We run on AWS and our instances are recycled once in a few days, but even with that, the issue tends to happen on the same subset of hosts where it happened once. (3-5 hosts out of 200)
@vishvananda i'm not sure if it's go netlink bug or some kernel issue. If you have any recommendations what to look at next, it would be super helpful. I can add more output to the netlink as well.
Our configuration is
Linux mainvpc-r3.8xlarge-i-00e3fe8d965d0c577 3.19.0-26-generic #28~14.04.1-Ubuntu SMP Wed Aug 12 14:09:17 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
The text was updated successfully, but these errors were encountered: