-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Windows Containers route tables expected behaviour #386
Comments
Hello @gillg, I have the same issue. |
Are you able to share your set up? Are you using containerd as a container runtime or are you using docker? "On linux, by default, when you create a container, the host main route table is reusable. So in case of static routes (like for cloud provider local link metadata API, IMDS at the ip 169.254.169.254) your container automatically has this route at runtime." This is not default Linux behavior. However, depending on the answer to your container runtime, this is the behavior of the CNI plugin. If you run the following commands with iproute2 you can see the default behavior. ip netns add cni-1234 You won't see any routes until you explicitly add them. This appears to be a missing route in the root network namespace? What route are you adding? The route in the Linux container you shared, has a default so the root network namespace would need to be shared to see the behavior of that specific prefix. |
@MikeZappa87 my setup is the most basic as possible.
Reminder for linux from inside a container
On windows I also use docker (not containerd) so I expect a consistent behaviour, installed from an official zip package extracted in program files and with the service installed with
So the mains questions are :
Let me know if you need more detailed informations, or anything. EDIT: Exactly the same behaviour tested in version 24.0.4 on windows. Same route table inside the container, and |
Last note, the route I have to automate inside the container because it's not working as expected is this one:
|
Is the issue only with 169.254.169.254? Does this happen with the others below?: Do you have layer 2 connectivity for 10.102.28.65? Are you running that powershell script inside the container? |
The problem is with all the persistent route not actives, but the others are not revelant to me. And yes I have a full connectivity to 10.102.28.64/26 because I reach a database server on this range. Hard to exactly say for the implicit AWS router at 10.102.28.65 because it's completely locked down and a blackbox but I should, else I would not reach my database. Or... For an unknown reason (windows firewall default rules?) I can reach it from the host but not from the container (but the container reaches the DB). That could explain why the persistent routes are not added ? |
This issue has been open for 30 days with no updates. |
I am facing the same issue. |
We also face this problem on GCE, currently using @gillg solution to force the route inside our docker containers. Host running Windows Server 2019 Datacenter build 1809, OS build 17763.4737. |
Just came across this issue from #420, which turns out to be GCE specific from August 2023, and hence is probably what @sam-sla was seeing and maybe @Eishi2012, but that is not the original problem. It's odd to have routes to 169.254.0.0/16 addresses, those are "link-local" and should not be routed. You can see this in the Linux example, it has an explicit route to 0.0.0.0 which AFAIK actually means "link-local", and is just explicitly stating which link to use by default for such messages. You can see that there's no such route inside the container on Linux, i.e. "your container automaticaly has this route at runtime." is not actually what's happening here. My guess for the original question is that although I actually don't know how those routes got into the container's registry, did you already try to persist the routes inside the container? I don't expect it to be copying random stuff from the host registry into the container registry, but maybe the network management system does that here as a side effect, even though it doesn't use them ;docker/for-win#12297 suggests they used to be both copied and applied, which is buggy because containers don't live on the same network as the host, so maybe they fixed that since. By default, you don't need routes for 169.254.0.0/16 on Windows, it appears to handle that network internally without the route. However, if you have more than one interface, such a route would tell Windows which link to use if you don't specify the interface when sending packets, e.g. with Technically, I think the Linux example is doing the wrong thing, as it forwarded the 169.254.169.154 to its gateway (172.17.0.1) which then forwarded it to the host's eth0 per its local config. It's doing what it's been told, because there isn't a link-local route set up inside the container, but per the RFC, the router at 172.17.0.1 should have dropped those packets, not forwarded them. Ignoring those rules happens to make AWS's link-local services (IMDS, DNS, etc) work from inside containers somewhat by accident. Windows doesn't ignore those rules, unless you explicitly tell it to by adding a route, so you can't reach 169.254.0.0/16 if it's not actually on the local link, and the local link here is a Hyper-V virtual network adapter. So 169.254.169.254 from inside the container should be trying to connect to a host on that virtual network, and of course the EC2 metadata service etc. are not present on that virtual network. To some extent this is a legacy of AWS using link-local addresses in a world where hosts may have internal networks (as Google recently bounced off, seen in #420), and in IPv6 it's resolved by use of Unique Local Addresses (which are routable within a site, just not out into the world) which would do the right thing in this case. Assuming IPv6 isn't an option, you may want to consider hooking up your container to a custom transparent network. I haven't tested this myself, but I believe that should produce a network that can see the host's link-local peers, since they allow you to DHCP from the outside world, and that's approximately the same thing.
So I was thinking about this some more, and HCN's network-create API can include a list of routes for the network. So it's possible, but untested, that a Docker custom network (or equivalent in your container-hosting environment, e.g., a Kubernetes Pod's network) could be defined with an explicit route for 169.254.0.0/16 to the HCN's gateway so that individual containers on that network don't need to have the route added directly, and don't need to use transparent mode (which would conflict with Kubernetes Pod expected behaviour, for example). Of course, this would need to be implemented in the various HCN users (Docker, the CNI plugins, etc), so even if it is possible, it seems unlikely to happen any time soon, but someone who has this use-case could file a feature request with the appropriate runtime system for their use-case and see if anything comes of it. |
Many thanks @TBBle for this very long but precise and realistic analysis. By searching workarounds, I also discovered that AWS introduced some IPv6 addresses So my only interrogation now, is by which magic does that routes are "persisted" in the container when we launch it, but windows doesn't "apply" them as it should be the default behaviour. It seems more realistic to not have that routes persisted at all when we start the container. Outside of that, I partialy understand your proposal to move forward in the right way. I have to go deeper in HCN to understand you idea. |
I haven't had a chance to play with IPv6 in Windows containers, but a quick look at that route table shows no default gateway for IPv6, suggesting that IPv6 is not active on the virtual network's gateway router, as normally it'd advertise itself as a router and routes would appear in the IPv6 config by https://docs.docker.com/engine/reference/commandline/network_create/ suggests that IPv6 is opt-in for Docker-created networks. https://learn.microsoft.com/en-au/virtualization/windowscontainers/container-networking/architecture#unsupported-features-and-network-options notes that IPv6 works with l2bridge but not NAT or overlay networks (they also list the However, poking around suggests that Transparent mode doesn't work on EC2, presumably for similar reasons to why it doesn't work on Azure: the cloud networking infrastructure does not allow MAC spoofing. So I fear that IPv6 in containers on AWS/EC2 is likely to be a dead-end if you cannot set up an L2bridge configuration. #230 suggests the same thing. https://techcommunity.microsoft.com/t5/networking-blog/l2bridge-container-networking/ba-p/1180923 (and maybe a simpler version in this comment) shows a way of using L2bridge with Docker, but doesn't touch on IPv6, so it may or may not result in working IPv6 if the host is IPv6-connected. That said, I honestly haven't looked very closely at this, as my recent focus was on only getting enough networking going to run BuildKit, so I've only played with NAT via the Windows CNI plugin in particular lately. So I can't speak to the practicality of any of this right now, sadly. Edit: This issue isn't really in a satisfactory place, so as a possible workaround, a host-based proxy for IMDS or equivalent on other cloud providers would also be possible. However, it must only be available to containers on that host. So having it running in a container and attached to the same virtual network would be a safer option; either way, there's still the challenge of telling your code or SDK that IMDS is actually found at a different IP address. So still not great, but if this turns out to be an absolute blocker, it's a workaround to evaluate. |
Have you tried the latest AWS AMI? We had a conversation with AWS and they were changing this behavior. |
Which behavior do you mean? IPv6 or ipv4 with persistent routes ? When you mean they made a change it was on the very latest ? I would love have more details about this potential change because I will need to take care of it to avoid problems ^^ |
Sorry for the delay @gillg, is this still an issue you're facing? |
Is it still an inconsistent/unexpected behavior for a container running on the cloud ? Yes Is it expected following the IP RFC...? Yes too. So I would say we should at least document something clearly on what is the "best practice" to make a windows container able to reach a metadata API in the cloud. To avoid black magic, avoid bad ideas, give some official guidelines. |
Also, because IPv6 is not supported in NAT mode, it makes the things very complex to deal with. |
As mentioned earlier, 169.254.0.0/16 addresses are typically link-local and shouldn’t be routed. Usually, the system’s network stack manages this internally, directing traffic to the appropriate interface. If no matching interface exists, the packet is dropped. Generally, the CNI handles configuring the container network, including access to the host’s link-local addresses. Here’s an example using an AKS cluster with the Azure CNI plugin’s default settings on a Windows Server 2022 (KB5041948) node: From within the pod (IP: 10.224.0.44): Again, from the same pod:
Notice how we can access 169.254.169.254 from within the pod without requiring any internal routing. However, if the CNI plugin doesn’t manage this, you’ll need to manually configure the route to these link-local IP addresses, similar to what's been discussed in this thread. Here is an example of adding these routes from within a pod, based on @gillg’s example:
|
Closing for now, feel free to reopen if anything else comes up. |
We could not get it to work with /16, we were getting
This uses the interface alias instead of index to simplify the commands. $gateway = Get-NetRoute | Where-Object { $_.DestinationPrefix -eq '0.0.0.0/0' } | Sort-Object RouteMetric | Select-Object -ExpandProperty NextHop
New-NetRoute -DestinationPrefix 169.254.0.0/16 -InterfaceAlias "vEthernet (Ethernet)" -NextHop $gateway To validate you can run: [string]$token = Invoke-RestMethod -Headers @{"X-aws-ec2-metadata-token-ttl-seconds" = "21600"} -Method PUT -Uri http://169.254.169.254/latest/api/token Edit: |
You could use |
Hello,
I face an issue since a long time, we have workaround and tricks in place, but I would have a more elaborated answer about the expected behaviour, an eventual better approach, and see what is doable as a cleaner solution.
On linux, by default, when you create a container, the host main route table is reusable. So in case of static routes (like for cloud provider local link metadata API, IMDS at the ip 169.254.169.254) your container automaticaly has this route at runtime.
Example on vanilla AWS Linux2 AMI:
From inside a created container:
On windows, currently, you have static routes at host level, and these routes are inherited inside the container. But they seems not "applied" as active, so they are kind of useless and confusing. By the way, why a route in the persistent store is not always active ?
Example on a vanilla AWS Windows server AMI:
From inside a container:
And if I try to reach 169.254.169.254 it's impossible.
The obvious solution is to manualy create the route with the command "route add" as an example but what I don't understand is why if the persistentstore contains the routes it's not working ?
Moreover, if the route in unknown we should fallback to the route
0.0.0.0 0.0.0.0 172.30.42.1 172.30.42.214 5256
and the host should route the request with NAT. So I don't really understand the current behaviour.Tweaking the route table at startup during the runtime seems not a clean and reliable solution assuming you could host your image on different cloud providers, and need different routes in different contextes... This should be managed at the container runtime level.
Any insights, opinions welcome !
The text was updated successfully, but these errors were encountered: