Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

blog: add an article about the 2023-07-07 outage #6532

Merged
Merged
Show file tree
Hide file tree
Changes from 7 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -15,9 +15,9 @@ links:

The Jenkins project packages and plugins are hosted through a network of mirror servers (provided by our sponsors) close to your location.

It provides a "HTTP redirector" service hosted behind the `get.jenkins.io`, `mirrors.jenkins.io` and `mirrors.jenkins-ci.org` domains, with a new public IP (`20.119.232.75`) since the 12th of June 2023.
It provides a "HTTP redirector" service hosted behind the `get.jenkins.io`, `mirrors.jenkins.io` and `mirrors.jenkins-ci.org` domains, with a new public IP: +++<s>`20.119.232.75`</s>+++ `20.7.178.24` (as per link:/blog/2023/07/12/jenkins-mirrors-postmortem-outage/[]) since the 12th of June 2023.
The former redirector service and its previous IPv4 will be decommissioned the 27th of June 2023.

IMPORTANT: Please update your DNS servers and firewall rules to the new IP `20.119.232.75` if you are in a restricted environment.
IMPORTANT: Please update your DNS servers and firewall rules to the new IP +++<s>`20.119.232.75`</s>+++ `20.7.178.24` if you are in a restricted environment.

TIP: More details in https://github.com/jenkins-infra/helpdesk/issues/3351.
Original file line number Diff line number Diff line change
@@ -0,0 +1,120 @@
---
layout: post
title: "Post Mortem of the 7th July 2023 Jenkins Infrastructure Outage"
tags:
- infrastructure
- mirrors
- jenkins
- outage
- postmortem
authors:
- dduportal
opengraph:
image: /images/logos/fire/fire.svg
links:
discourse: true
---

On Friday 7th of July 2023, the Jenkins infrastructure suffered a major outage from 11:05am UTC until 15:25pm UTC with complete downtime of the following public services:

* accounts.jenkins.io
* fallback.get.jenkins.io
* get.jenkins.io
* incrementals.jenkins.io
* javadoc.jenkins.io
* plugin-health.jenkins.io
* plugin-site-issues.jenkins.io
* plugins.origin.jenkins.io
* plugins.jenkins.io
* rating.jenkins.io
* repo.azure.jenkins.io
* reports.jenkins.io
* stories.jenkins.io
* uplink.jenkins.io
* weekly.ci.jenkins.io
* www.origin.jenkins.io

We also had complete downtime of the following non-public services:

* ldap.jenkins.io
* previews of *.jenkins.io pull requests (infra.ci.jenkins.io)

In addition, there was disruption (partial or complete) of the following services:

* ci.jenkins.io
* infra.ci.jenkins.io
* issues.jenkins.io
* plugins.jenkins.io
* repo.jenkins-ci.org
* www.jenkins.io

[IMPORTANT]
--
The public IPs of these services changed (DNS records included) to:

* `20.7.178.24` (IPv4)
* `2603:1030:408:5::15a` (IPv6)

⚠️ Update your corporate networks (DNS, proxies, firewall) if you need to access these services.
--

== Incident Timeline

* **10:30am UTC:** After a successful upgrade of the public Kubernetes cluster in Azure to 1.25 (as part of https://github.com/jenkins-infra/helpdesk/issues/3582[help desk ticket 3582]), we realized that the LDAP service was not reachable by the services running inside the cluster (such as accounts.jenkins.io).
We quickly realized that the IP restrictions were blocking these requests as the pod originating IP was in a different range than it was before.
MarkEWaite marked this conversation as resolved.
Show resolved Hide resolved

* **10:55am UTC:** The fix (https://github.com/jenkins-infra/azure/pull/431[Azure PR 431]) is deployed to specify a proper set of IP ranges for the pods.
It removed all of the node pools (all the virtual machines where the container was running) and failed to re-create them, causing a full outage of all the services running in this cluster:
** accounts.jenkins.io
** get.jenkins.io
** incrementals.jenkins.io
** javadoc.jenkins.io
** jenkins-wiki-exporter.jenkins.io
** ldap.jenkins.io
** plugins.jenkins.io
** previews of *.jenkins.io pull requests (infra.ci.jenkins.io)
** release.ci.jenkins.io
** repo.azure.jenkins.io
** reports.jenkins.io
** stories.jenkins.io
** uplink.jenkins.io
** www.jenkins.io

* **11:16am until 13:16pm UTC:** The failure to re-create resources led us to spend the 2 next hours creating the cluster from scratch with a fixed network setup.

* **15:17pm UTC:** link:https://github.com/jenkins-infra/azure/pull/432[This pull request] is pushed to persist the manual work we did to recreate the cluster including the IP setup.

* **15:25pm UTC:** All services are back to normal

== What Happened?

* When the cluster was initially created, we selected the virtual network's IP range `10.**244**.0.0/**16**` (ref. https://github.com/jenkins-infra/azure-net/blob/fcb010a5d9f164203c9a896fcb974df4051c321d/vnets.tf#L66[Azure VNets]) with a sub-network `10.245.0.0/24` (ref. https://github.com/jenkins-infra/azure-net/blob/fcb010a5d9f164203c9a896fcb974df4051c321d/vnets.tf#L161)[Azure subnet]).
MarkEWaite marked this conversation as resolved.
Show resolved Hide resolved

* But we ignored that the range `10.**244**.0.0/**24**` is the default CIDR for the Kubernetes Pods network in Azure when using the link:https://learn.microsoft.com/en-us/azure/aks/configure-kubenet["kubenet" network to support IPv6 instead of default CNI].
MarkEWaite marked this conversation as resolved.
Show resolved Hide resolved

* The node pool re-creation failed because we assumed both ranges were able to communicate and tried to deploy an invalid setup.
MarkEWaite marked this conversation as resolved.
Show resolved Hide resolved

* As soon as we specified a custom Pod CIDR in a distinct range, everything went fine.

* When the original cluster was deleted it transitively removed the current Public IPs, as it removed the link:https://learn.microsoft.com/en-us/azure/aks/faq#why-are-two-resource-groups-created-with-aks[Nodes Resource Group] containing the Public IP.
** These public IPs should change as little as possible to avoid problems with our users running behind a corporate firewall with an allow-list.

== What can we do to improve?

* As per link:https://github.com/jenkins-infra/helpdesk/issues/3582#issuecomment-1629210833[our initial assessment]: protect the Public IPs from deletion by adding a https://learn.microsoft.com/en-us/azure/azure-resource-manager/management/lock-resources?tabs=json[Management Lock].

* As link:https://github.com/jenkins-infra/helpdesk/issues/3582#issuecomment-1629752851[recommended by other contributors]: storing the Public IP in a distinct Resource Group and set up the Kubernetes-managed Load Balancers accordingly (annotation `service.beta.kubernetes.io/azure-load-balancer-resource-group`).

* Improve our network diagrams and documentation to have better access to the representation and potential overlaps when preparing operations.

* Avoid changing AKS node pools configurations all at once: we would have caught the issue after the first node pool and could have avoided a full outage (we are working on this topic for the `arm64` node pools in https://github.com/jenkins-infra/helpdesk/issues/3623[PR-3623]).

== From 0 to production in less than 4 hours!

One of the takeaways of this outage, is that we are able to recover from a full destruction in less than **4** hours.
MarkEWaite marked this conversation as resolved.
Show resolved Hide resolved

It's a huge collaborative work which allowed this: from defining the architecture, building the infrastructure, backing-up its data, etc.

This huge effort started years ago by link:/blog/authors/rtyler/[R. Tyler Croy], link:/blog/authors/olblak/[Olivier Vernin] and backed by a lot of contributors such as link:/blog/authors/daniel-beck/[Daniel Beck], link:/blog/authors/hlemeur/[Hervé Le Meur], link:/blog/authors/timja/[Tim Jacomb], link:/blog/authors/markewaite/[Mark E Waite], link:/blog/authors/smerle33/[Stéphane Merle] and many more.

As current Infrastructure Officer, I want to thank them all so that our life is easier when catastrophic events happens!