Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWS VPC limits curb scalability of aws-vpc backend #164

Closed
rohansingh opened this issue Apr 30, 2015 · 9 comments
Closed

AWS VPC limits curb scalability of aws-vpc backend #164

rohansingh opened this issue Apr 30, 2015 · 9 comments

Comments

@rohansingh
Copy link
Contributor

At Spotify we've actually determined the aws-vpc backend isn't usable for us, partly due to AWS VPC limits:
http://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/VPC_Appendix_Limits.html

Specifically, the "Entries per route table" is limited to 50. While the documentation states that the number of entries per route table can be increased on request, here's Spotify's experience as related by one of our devops folks:

yeah, we raised the limit (200?) but apparently the routes over 100 are just skipped...
which is slightly better because before if we went over 100 it completely crashed

So yeah, unfortunately looks like aws-vpc is unusable for more than a couple dozen nodes.

@rynbrd
Copy link

rynbrd commented May 3, 2015

The VPC backend also has the limitation that all hosts must be in subnets which share the same route table.

Our cluster layout includes a subnet to which we deploy our proxy containers. This uses a different route table as these are the only hosts with external IPs and direct internet access.

A solution to both problems could involve adding support for multiple route tables. The implementation would likely include a mapping of host subnets to route table IDs.

The above solution would allow us to bypass the 100 entry limit in a single route table by assigning a new route table to each subnet.

@eyakubovich
Copy link
Contributor

Thank you for this feedback. Having good support for aws-vpc would be very valuable to avoid the overhead. We're currently working on adding a client/server option. The idea is that all the flannel daemons talk to flannel server instead of etcd directly. This is needed for some deployments where there's desire to restrict a set of machines that can access etcd. But it will also allow for the flannel server to be the one modifying the route tables, thus restricting just the server to having access to the tokens/IAM role needed to modify the routes.

As part of that work, we'll try to see how we can support multiple route tables. And PR's are always welcome.

@rohansingh
Copy link
Contributor Author

Interesting, good to know. Are you considering the availability implications? It would be a no-go for us if the flannel server/master became a single point of failure.

@eyakubovich
Copy link
Contributor

@rohansingh The client/server will be an opt-in option so you'll be able to keep things as is. But more importantly, the server will be stateless -- the data is still stored in etcd. If a server fails, a new one can be brought up, hopefully automatically by a cluster scheduler such as fleet or Kubernetes.

@rohansingh
Copy link
Contributor Author

@eyakubovich Sounds great! Thanks for the clarification :)

@Grindizer
Copy link

Hi everyone,

we are also investigating the use of aws-vpc for our cluster, the most annoying for us is actually the need for the subnets to share the same route table, in our case we use different subnets (one in each AZ) and each one point to a NAT instance present in its AZ (so table are different for each subnet).

I was wondering if updating more then one route table would be a possible solution, the list of routing table to alter would be given with an AWS tag (where the backend would update any routing table tagged a certain way, the tag name and value would be given as a parameter).
or we can also image an autoscaling group feature, where the backend introspect the instance autoscaling group, fetch the list of subnets involved and then update any routing table associated ?

For now the backend inspect the instance subnet and update the routing table associated.
This won't solve the limitation problem but would certainly make the backend more usable many case ?

@bernielomax
Copy link

@Grindizer did you have any luck with this? I would like to do the same.

@anubhavmishra
Copy link

@rohansingh What did you guys end up using? VXLAN option?

@rohansingh
Copy link
Contributor Author

@anubhavmishra We went with this (copied from a presentation, sorry):

Alternative: BGP

  • Border Gateway Protocol, the routing protocol of
    the internet.
  • BGP peers connect to each other and exchange routes.

OMG. What did you do?

  • We configured our top-of-rack switches to accept
    routes from our docker hosts.
  • And then installed a bgp daemon, bird, on each
    docker host.

So how does it work now?

  1. We install flannel on docker hosts, just to do
    IP allocation.
  2. Docker uses the machine subnet allocated by flannel.
  3. bird takes a look at what the machine subnet is,
    and advertises this to the top-of-rack switch.
  4. That top-of-rack switch exchanges routes with its
    peers, so packets get routed correctly.

Is this better?

  • More complexity at startup since we need both
    bird and BGP to work.
  • But after initialization, the system is resilient.
  • Also, packets are "normal":
    traceroute makes sense, etc.

@tomdee tomdee closed this as completed Mar 13, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants