Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace kube-lego + nginx ingress with traefik #1539

Merged
merged 26 commits into from
Jan 17, 2020

Conversation

yuvipanda
Copy link
Collaborator

@yuvipanda yuvipanda commented Jan 8, 2020

When we
picked
kube-lego to do automatic HTTPS in 2017, it seemed the easiest and
most kubernetes native way to do let's encrypt based HTTPS. It was
fairly simple, and worked pretty well for our use case. Users of
this chart could set up HTTPS without having to know anything about
certificates, challenges or let's encrypt. Life was good.

Life moves on, and so does code. kube-lego was deprecated in
favor of cert-manager,
which was far more general. It wasn't tied to Let's Encrypt,
and could do a lot more things. It was far more kubernetes native,
and slicker. However, this came with a massive increase in
complexity, which is justifyable if you need many certificates in
many situations. However, for the most common case of just one
certificate per hub, this was super massive overkill.

This PR replaces the current autohttps with traefik. traefik's
Let's encrypt implementation requires persistent
storage to put acme.json into, and we would prefer to not. This
was the primary reason I picked kube-lego over other implementations
(like certbot) back in
the day. However, a small wrapper script that sync's the contents
of acme.json to a kubernetes secret when needed is small,
easy to reason about and solves all our problems.

This PR should be fully backwards compatible. traefik should
be able to renew certificates that kube-lego was handling before,
and issue new certificates as well. Since it is just wrapping
upstream certbot, there are fewer chances of a deprecation hitting
us hard.

d22c69f has information on why
we used traefik instead of nginx.

Fixes #1448

When we
[picked](jupyterhub#229)
kube-lego to do automatic HTTPS in 2017, it seemed the easiest and
most kubernetes native way to do let's encrypt based HTTPS. It was
fairly simple, and worked pretty well for our use case. Users of
this chart could set up HTTPS without having to know anything about
certificates, challenges or let's encrypt. Life was good.

Life moves on, and so does code. kube-lego was deprecated in
favor of [cert-manager](https://github.com/jetstack/cert-manager),
which was far more general. It wasn't tied to Let's Encrypt,
and could do a lot more things. It was far more kubernetes native,
and slicker. However, this came with a massive increase in
complexity, which is justifyable if you need many certificates in
many situations. However, for the most common case of just one
certificate per hub, this was super massive overkill.

This PR replaces the current autohttps with autocertbot, a tiny
wrapper around the recommended upstream let's encrypt software -
[certbot](https://certbot.eff.org/). certbot requires persistent
storage under /etc/letsencrypt, and we would prefer to not. This
was the primary reason I picked kube-lego over certbot back in
the day. However, a small wrapper script that sync's the contents
of /etc/letsencrypt to a kubernetes secret when needed is small,
easy to reason about and solves all our problems.

This PR should be fully backwards compatible. autocertbot should
be able to renew certificates that kube-lego was handling before,
and issue new certificates as well. Since it is just wrapping
upstream certbot, there are fewer chances of a deprecation hitting
us hard.

Fixes jupyterhub#1448
Copy link
Member

@minrk minrk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! This seems like a huge simplification

@consideRatio
Copy link
Member

Excellent!!! I like this a lot!

@consideRatio
Copy link
Member

consideRatio commented Jan 8, 2020

This is soo valuable work! Lets make a quick 0.10.0 release with this when it lands. I'm looking forward to review this on sunday! I looked it through from mobile now and it looks good to me so far.

nginx proved more complicated than hoped for, primarily due to
lack of hot reloading. The two blockers were:

1. nginx won't start without SSL certs existing, but we won't get
   them until after nginx starts to use the webroot challenge!
2. When certificates get renewed, nginx needs a reload.

We could have fixed these, but I realized the reason we were not
using traefik for this was that it needs persistent disk to put its
let's encrypt config in. Since that is no longer a problem due to
the secret sync, I switched us over to traefik instead. *Much*
cleaner, simpler and straightforward!!!
@yuvipanda yuvipanda changed the title [WIP] Replace kube-lego + nginx ingress with certbot + nginx [WIP] Replace kube-lego + nginx ingress with traefik Jan 9, 2020
@yuvipanda
Copy link
Collaborator Author

I rewrote this again to use traefik instead of nginx. d22c69f has information on why.

jupyterhub/values.yaml Outdated Show resolved Hide resolved
jupyterhub/values.yaml Outdated Show resolved Hide resolved
@yuvipanda yuvipanda changed the title [WIP] Replace kube-lego + nginx ingress with traefik Replace kube-lego + nginx ingress with traefik Jan 13, 2020
@yuvipanda
Copy link
Collaborator Author

yuvipanda commented Jan 13, 2020

This works really well!

Things left to do:

  • HSTS
  • verify that saving large notebooks works

@consideRatio
Copy link
Member

@yuvipanda wieee! I figure we should release 0.9.0 without this and then go for 0.10.0 with this promptly after.

Copy link
Member

@consideRatio consideRatio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have more learning to do regarding certbot and your work but I want to properly understand it so I'll spend time to do so. Thank you for this work Yuvi, it is solving a urgent matter!

traefik does not seem to have a problem with this as far as
I can test.

I used the code in jupyterhub/mybinder.org-deploy#210 (comment)
to test, and I crashed my browser around the 160M mark before I ran
into the entity too large issue
Matches previous config. Users can adjust the maxAge as well
as includeSubdomains.
This is good enough with emptyDir
@yuvipanda
Copy link
Collaborator Author

Without this PR

No HTTPS

Incoming traffic hits chp, which then routes it to hub / user pods

Manual HTTPS

Incoming traffic hits CHP, which does TLS termination, then routes it to hub / user pods

Let's Encrypt HTTPS (autohttps)

When it was working

Incoming traffic hit nginx-ingress, which did TLS termination, then routed it to CHP, which then routed to hub / user pods

Now

If you turn on autohttps, your entire hub is broken, because kube-lego is deprecated and you can not get a new account. So if you want HTTPS, you need external complex options

With this PR

No HTTPS

No Change

Manual HTTPS

s
No change

Let's Encrypt HTTPS

Incoming traffic hits traefik, which does TLS termination, then routed to CHP, which routes to hub / user pods

So the major difference right now, is that without this PR Let's Encrypt HTTPS is completely broken, so nobody can really use the nginx ingress. If they had set up the hub before November 2019 (based on this), then the HTTPS will continue to renew until June 2020. However, if they have to move the hub, or re-install, or install a new hub - no HTTPS.

So IMO, this is a bugfix to unbreak HTTPS. I currently have it deployed from the PR for berkeley's big hubs, and it seems to work out well so far.

I'll try find some time today to write better docs for the traefik config!

Thank you for working on this slow moving, complex but important work <3

- Make sure logging actually works
- Remove some unused config items
remove no-longer-used sections
@minrk
Copy link
Member

minrk commented Jan 17, 2020

Since testing letsencrypt isn't feasible right now, I went through a manual deployment following the guide (with helm install ./jupyterhub/ instead of jupyterhub/jupyterhub) and everything worked great! With the last commit adding nice docs, I think we should go ahead and land this now.

Thanks for fixing this, @yuvipanda!

@minrk minrk merged commit 80973b7 into jupyterhub:master Jan 17, 2020
@consideRatio
Copy link
Member

@yuvipanda wieeeeeee awesome work! ❤️ 🎉 this feels sooooo good to have merged!

@yuvipanda
Copy link
Collaborator Author

Thank you for the merge and testing, @minrk!

And thank you for reviewing, @consideRatio & @betatim!

@consideRatio
Copy link
Member

0.9.0-beta.3 released, I gained a lot of confidence that this would not disrupt users not using autohttps along the way, thanks for explaining the difference between this TLS termination and CHP termination!

@yuvipanda
Copy link
Collaborator Author

Thanks for making the release, @consideRatio!

I also wanted to note two things I discovered:

  1. Upgrading from nginx autohttps requires you delete the old autohttps deployment. kubectl -n hub-namespace delete deployment -l component=authottps. This is necessary because we are changing some immutable fields in the deployment, and kubernetes will not do an in-place change. However, maybe --force works - I have not tested it.
  2. This gets rid of a lot of our ClusterRoles, which will help with using this in higher security contexts

@dahjah
Copy link

dahjah commented Feb 19, 2020

Stupid NOOB question on this (and this may be the wrong place to put this, I'm not sure):
Having just installed jupyterhub, it still uses the stable 0.8.2 release. Would it be safe to run
helm upgrade --install jhub jupyterhub/jupyterhub --namespace jhub --version=0.9.0-beta.3 --values config.yaml to get this working, or would that be a bad idea?

@consideRatio
Copy link
Member

consideRatio commented Feb 19, 2020

@dahjah yepp I think that should be safe to do, I consider it a good idea.

@dahjah
Copy link

dahjah commented Feb 26, 2020

Thank you @consideRatio. Works like a charm!
In case anyone comes here with the same problem later, I also had to delete the old autohttps deployment as mentioned above before it would work.

@consideRatio
Copy link
Member

Ohhh!!! VERY relevant advice thank you for describing this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

letsencrypt now wants ACMEv2
9 participants