-
Notifications
You must be signed in to change notification settings - Fork 804
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace kube-lego + nginx ingress with traefik #1539
Conversation
When we [picked](jupyterhub#229) kube-lego to do automatic HTTPS in 2017, it seemed the easiest and most kubernetes native way to do let's encrypt based HTTPS. It was fairly simple, and worked pretty well for our use case. Users of this chart could set up HTTPS without having to know anything about certificates, challenges or let's encrypt. Life was good. Life moves on, and so does code. kube-lego was deprecated in favor of [cert-manager](https://github.com/jetstack/cert-manager), which was far more general. It wasn't tied to Let's Encrypt, and could do a lot more things. It was far more kubernetes native, and slicker. However, this came with a massive increase in complexity, which is justifyable if you need many certificates in many situations. However, for the most common case of just one certificate per hub, this was super massive overkill. This PR replaces the current autohttps with autocertbot, a tiny wrapper around the recommended upstream let's encrypt software - [certbot](https://certbot.eff.org/). certbot requires persistent storage under /etc/letsencrypt, and we would prefer to not. This was the primary reason I picked kube-lego over certbot back in the day. However, a small wrapper script that sync's the contents of /etc/letsencrypt to a kubernetes secret when needed is small, easy to reason about and solves all our problems. This PR should be fully backwards compatible. autocertbot should be able to renew certificates that kube-lego was handling before, and issue new certificates as well. Since it is just wrapping upstream certbot, there are fewer chances of a deprecation hitting us hard. Fixes jupyterhub#1448
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome! This seems like a huge simplification
Excellent!!! I like this a lot! |
This is soo valuable work! Lets make a quick 0.10.0 release with this when it lands. I'm looking forward to review this on sunday! I looked it through from mobile now and it looks good to me so far. |
nginx proved more complicated than hoped for, primarily due to lack of hot reloading. The two blockers were: 1. nginx won't start without SSL certs existing, but we won't get them until after nginx starts to use the webroot challenge! 2. When certificates get renewed, nginx needs a reload. We could have fixed these, but I realized the reason we were not using traefik for this was that it needs persistent disk to put its let's encrypt config in. Since that is no longer a problem due to the secret sync, I switched us over to traefik instead. *Much* cleaner, simpler and straightforward!!!
I rewrote this again to use traefik instead of nginx. d22c69f has information on why. |
This works really well! Things left to do:
|
@yuvipanda wieee! I figure we should release 0.9.0 without this and then go for 0.10.0 with this promptly after. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have more learning to do regarding certbot and your work but I want to properly understand it so I'll spend time to do so. Thank you for this work Yuvi, it is solving a urgent matter!
traefik does not seem to have a problem with this as far as I can test. I used the code in jupyterhub/mybinder.org-deploy#210 (comment) to test, and I crashed my browser around the 160M mark before I ran into the entity too large issue
Matches previous config. Users can adjust the maxAge as well as includeSubdomains.
This is good enough with emptyDir
Without this PRNo HTTPSIncoming traffic hits chp, which then routes it to hub / user pods Manual HTTPSIncoming traffic hits CHP, which does TLS termination, then routes it to hub / user pods Let's Encrypt HTTPS (autohttps)When it was workingIncoming traffic hit nginx-ingress, which did TLS termination, then routed it to CHP, which then routed to hub / user pods NowIf you turn on autohttps, your entire hub is broken, because kube-lego is deprecated and you can not get a new account. So if you want HTTPS, you need external complex options With this PRNo HTTPSNo Change Manual HTTPSs Let's Encrypt HTTPSIncoming traffic hits traefik, which does TLS termination, then routed to CHP, which routes to hub / user pods So the major difference right now, is that without this PR Let's Encrypt HTTPS is completely broken, so nobody can really use the nginx ingress. If they had set up the hub before November 2019 (based on this), then the HTTPS will continue to renew until June 2020. However, if they have to move the hub, or re-install, or install a new hub - no HTTPS. So IMO, this is a bugfix to unbreak HTTPS. I currently have it deployed from the PR for berkeley's big hubs, and it seems to work out well so far. I'll try find some time today to write better docs for the traefik config! Thank you for working on this slow moving, complex but important work <3 |
- Make sure logging actually works - Remove some unused config items
remove no-longer-used sections
Since testing letsencrypt isn't feasible right now, I went through a manual deployment following the guide (with Thanks for fixing this, @yuvipanda! |
@yuvipanda wieeeeeee awesome work! ❤️ 🎉 this feels sooooo good to have merged! |
Thank you for the merge and testing, @minrk! And thank you for reviewing, @consideRatio & @betatim! |
0.9.0-beta.3 released, I gained a lot of confidence that this would not disrupt users not using autohttps along the way, thanks for explaining the difference between this TLS termination and CHP termination! |
Thanks for making the release, @consideRatio! I also wanted to note two things I discovered:
|
We were waiting for jupyterhub/zero-to-jupyterhub-k8s#1539 to be merged. It has been merged!
Stupid NOOB question on this (and this may be the wrong place to put this, I'm not sure): |
@dahjah yepp I think that should be safe to do, I consider it a good idea. |
Thank you @consideRatio. Works like a charm! |
Ohhh!!! VERY relevant advice thank you for describing this! |
When we
picked
kube-lego to do automatic HTTPS in 2017, it seemed the easiest and
most kubernetes native way to do let's encrypt based HTTPS. It was
fairly simple, and worked pretty well for our use case. Users of
this chart could set up HTTPS without having to know anything about
certificates, challenges or let's encrypt. Life was good.
Life moves on, and so does code. kube-lego was deprecated in
favor of cert-manager,
which was far more general. It wasn't tied to Let's Encrypt,
and could do a lot more things. It was far more kubernetes native,
and slicker. However, this came with a massive increase in
complexity, which is justifyable if you need many certificates in
many situations. However, for the most common case of just one
certificate per hub, this was super massive overkill.
This PR replaces the current autohttps with traefik. traefik's
Let's encrypt implementation requires persistent
storage to put acme.json into, and we would prefer to not. This
was the primary reason I picked kube-lego over other implementations
(like certbot) back in
the day. However, a small wrapper script that sync's the contents
of acme.json to a kubernetes secret when needed is small,
easy to reason about and solves all our problems.
This PR should be fully backwards compatible. traefik should
be able to renew certificates that kube-lego was handling before,
and issue new certificates as well. Since it is just wrapping
upstream certbot, there are fewer chances of a deprecation hitting
us hard.
d22c69f has information on why
we used traefik instead of nginx.
Fixes #1448