jenkins-infra · MarkEWaite · Jul 12, 2023 · Jul 12, 2023 · Jul 12, 2023 · Jul 12, 2023
@@ -15,9 +15,9 @@ links:
 
 The Jenkins project packages and plugins are hosted through a network of mirror servers (provided by our sponsors) close to your location.
 
-It provides a "HTTP redirector" service hosted behind the `get.jenkins.io`, `mirrors.jenkins.io` and `mirrors.jenkins-ci.org` domains, with a new public IP (`20.119.232.75`) since the 12th of June 2023.
+It provides a "HTTP redirector" service hosted behind the `get.jenkins.io`, `mirrors.jenkins.io` and `mirrors.jenkins-ci.org` domains, with a new public IP: +++<s>`20.119.232.75`</s>+++ `20.7.178.24` (as per link:/blog/2023/07/12/jenkins-mirrors-postmortem-outage/[]) since the 12th of June 2023.
 The former redirector service and its previous IPv4 will be decommissioned the 27th of June 2023.
 
-IMPORTANT: Please update your DNS servers and firewall rules to the new IP `20.119.232.75` if you are in a restricted environment.
+IMPORTANT: Please update your DNS servers and firewall rules to the new IP +++<s>`20.119.232.75`</s>+++ `20.7.178.24` if you are in a restricted environment.
 
 TIP: More details in https://github.com/jenkins-infra/helpdesk/issues/3351.
@@ -0,0 +1,120 @@
+---
+layout: post
+title: "Post Mortem of the 7th July 2023 Jenkins Infrastructure Outage"
+tags:
+- infrastructure
+- mirrors
+- jenkins
+- outage
+- postmortem
+authors:
+- dduportal
+opengraph:
+  image: /images/logos/fire/fire.svg
+links:
+discourse: true
+---
+
+On Friday 7th of July 2023, the Jenkins infrastructure suffered a major outage from 11:05am UTC until 15:25pm UTC with complete downtime of the following public services:
+
+* accounts.jenkins.io
+* fallback.get.jenkins.io
+* get.jenkins.io
+* incrementals.jenkins.io
+* javadoc.jenkins.io
+* plugin-health.jenkins.io
+* plugin-site-issues.jenkins.io
+* plugins.origin.jenkins.io
+* plugins.jenkins.io
+* rating.jenkins.io
+* repo.azure.jenkins.io
+* reports.jenkins.io
+* stories.jenkins.io
+* uplink.jenkins.io
+* weekly.ci.jenkins.io
+* www.origin.jenkins.io
+
+We also had complete downtime of the following non-public services:
+
+* ldap.jenkins.io
+* previews of *.jenkins.io pull requests (infra.ci.jenkins.io)
+
+In addition, there was disruption (partial or complete) of the following services:
+
+* ci.jenkins.io
+* infra.ci.jenkins.io
+* issues.jenkins.io
+* plugins.jenkins.io
+* repo.jenkins-ci.org
+* www.jenkins.io
+
+[IMPORTANT]
+--
+The public IPs of these services changed (DNS records included) to:
+
+* `20.7.178.24` (IPv4)
+* `2603:1030:408:5::15a` (IPv6)
+
+⚠️ Update your corporate networks (DNS, proxies, firewall) if you need to access these services.
+--
+
+== Incident Timeline
+
+* **10:30am UTC:** After a successful upgrade of the public Kubernetes cluster in Azure to 1.25 (as part of https://github.com/jenkins-infra/helpdesk/issues/3582[help desk ticket 3582]), we realized that the LDAP service was not reachable by the services running inside the cluster (such as accounts.jenkins.io).
+We quickly realized that the IP restrictions were blocking these requests as the pod originating IP was in a different range than it was before.
+
+* **10:55am UTC:** The fix (https://github.com/jenkins-infra/azure/pull/431[Azure PR 431]) is deployed to specify a proper set of IP ranges for the pods.
+It removed all of the node pools (all the virtual machines where the container was running) and failed to re-create them, causing a full outage of all the services running in this cluster:
+** accounts.jenkins.io
+** get.jenkins.io
+** incrementals.jenkins.io
+** javadoc.jenkins.io
+** jenkins-wiki-exporter.jenkins.io
+** ldap.jenkins.io
+** plugins.jenkins.io
+** previews of *.jenkins.io pull requests (infra.ci.jenkins.io)
+** release.ci.jenkins.io
+** repo.azure.jenkins.io
+** reports.jenkins.io
+** stories.jenkins.io
+** uplink.jenkins.io
+** www.jenkins.io
+
+* **11:16am until 13:16pm UTC:** The failure to re-create resources led us to spend the 2 next hours creating the cluster from scratch with a fixed network setup.
+
+* **15:17pm UTC:** link:https://github.com/jenkins-infra/azure/pull/432[This pull request] is pushed to persist the manual work we did to recreate the cluster including the IP setup.
+
+* **15:25pm UTC:** All services are back to normal
+
+== What Happened?
+
+* When the cluster was initially created, we selected the virtual network's IP range `10.**244**.0.0/**16**` (ref. https://github.com/jenkins-infra/azure-net/blob/fcb010a5d9f164203c9a896fcb974df4051c321d/vnets.tf#L66[Azure VNets]) with a sub-network `10.245.0.0/24` (ref. https://github.com/jenkins-infra/azure-net/blob/fcb010a5d9f164203c9a896fcb974df4051c321d/vnets.tf#L161)[Azure subnet]).
+
+* But we ignored that the range `10.**244**.0.0/**24**` is the default CIDR for the Kubernetes Pods network in Azure when using the link:https://learn.microsoft.com/en-us/azure/aks/configure-kubenet["kubenet" network to support IPv6 instead of default CNI].
+
+* The node pool re-creation failed because we assumed both ranges were able to communicate and tried to deploy an invalid setup.
+
+* As soon as we specified a custom Pod CIDR in a distinct range, everything went fine.
+
+* When the original cluster was deleted it transitively removed the current Public IPs, as it removed the link:https://learn.microsoft.com/en-us/azure/aks/faq#why-are-two-resource-groups-created-with-aks[Nodes Resource Group] containing the Public IP.
+** These public IPs should change as little as possible to avoid problems with our users running behind a corporate firewall with an allow-list.
+
+== What can we do to improve?
+
+* As per link:https://github.com/jenkins-infra/helpdesk/issues/3582#issuecomment-1629210833[our initial assessment]: protect the Public IPs from deletion by adding a https://learn.microsoft.com/en-us/azure/azure-resource-manager/management/lock-resources?tabs=json[Management Lock].
+
+* As link:https://github.com/jenkins-infra/helpdesk/issues/3582#issuecomment-1629752851[recommended by other contributors]: storing the Public IP in a distinct Resource Group and set up the Kubernetes-managed Load Balancers accordingly (annotation `service.beta.kubernetes.io/azure-load-balancer-resource-group`).
+
+* Improve our network diagrams and documentation to have better access to the representation and potential overlaps when preparing operations.
+
+* Avoid changing AKS node pools configurations all at once: we would have caught the issue after the first node pool and could have avoided a full outage (we are working on this topic for the `arm64` node pools in https://github.com/jenkins-infra/helpdesk/issues/3623[PR-3623]).
+
+== From 0 to production in less than 4 hours!
+
+One of the takeaways of this outage, is that we are able to recover from a full destruction in less than **4** hours.
+
+It's a huge collaborative work which allowed this: from defining the architecture, building the infrastructure, backing-up its data, etc.
+
+This huge effort started years ago by link:/blog/authors/rtyler/[R. Tyler Croy], link:/blog/authors/olblak/[Olivier Vernin] and backed by a lot of contributors such as link:/blog/authors/daniel-beck/[Daniel Beck], link:/blog/authors/hlemeur/[Hervé Le Meur], link:/blog/authors/timja/[Tim Jacomb], link:/blog/authors/markewaite/[Mark E Waite], link:/blog/authors/smerle33/[Stéphane Merle] and many more.
+
+As current Infrastructure Officer, I want to thank them all so that our life is easier when catastrophic events happens!