diff --git a/adr/001-architecture-decision-records.md b/adr/001-architecture-decision-records.md index 026ad774..d150d7f6 100644 --- a/adr/001-architecture-decision-records.md +++ b/adr/001-architecture-decision-records.md @@ -1,6 +1,6 @@ # 1. Architecture Decision Records -Date: 2024-05-14 +Date: 2024-05-14, updated 2024-09-03 ## Decision @@ -21,5 +21,28 @@ We want to record our architectural decisions so that... - New team members who join can see why we made the decisions we made. - The team can revise or revisit decisions with more confidence and context. -### Related Issues +## Impact +_The outcomes of the decision, both positive and negative. This section explains the impact of the decision, such as trade-offs, risks, and what needs to be done to implement it._ + +### Positive + +- **Transparency**: ADRs make decision-making more transparent, helping current and future team members understand the rationale behind decisions. +- **Historical Context**: They provide valuable historical context, aiding in future decision-making and avoiding repeated mistakes. +- **Onboarding**: ADRs speed up the onboarding process by quickly familiarizing new team members with architectural decisions. +- **Consistency**: A standardized format ensures consistent documentation, making records easier to maintain and reference. + +### Negative + +- **Overhead**: Maintaining ADRs requires time and effort. +- **Outdated Records**: If not regularly updated, ADRs can become outdated and misleading. + +### Risks + +- **Incomplete Documentation**: Not all decisions may be documented, leading to gaps in the record. +- **Misalignment**: ADRs may not always match the actual implementation, causing confusion. + +## Related Issues + +- #1 +- #13 diff --git a/adr/010-deployment-slots.md b/adr/010-deployment-slots.md new file mode 100644 index 00000000..1863329a --- /dev/null +++ b/adr/010-deployment-slots.md @@ -0,0 +1,52 @@ +# 10. Deployment Slots for Zero Downtime Deploys for the Web App + +Date: 2024-09-03 + +## Decision +1. We will use Azure Web App Deployment Slots to facilitate zero-downtime deploys of the SFTP Ingestion Service web app. +2. Because the ingestion service is queue-driven and in order to keep both the pre-live and production slots healthy, +we will use `sticky_settings` to keep queue configuration on only the production slot in each environment. + +## Status + +Accepted. + +## Context +1. Even though the Ingestion Service's queue-driven workflow is resilient to small downtimes, implementing zero-downtime +deploys is a standard best practice. Using Azure Deployment Slots also lets us have fast and easy rollbacks in addition to +zero-downtime deployment, and is consistent with the workflow we're using in TI. +2. Because the Ingestion Service is queue-driven, turning 'off' the pre-live slot (which routes http traffic) doesn't +stop it from reading queues. To prevent actions from being duplicated, we're keeping queue configuration settings only +on the production/live slot, which will leave the pre-live slot running and healthy, but not active. + +Even though there are some significant downsides to Deployment Slots, they're Azure's recommended +approach to zero-downtime deploys (ZDD), and they're lower effort and lower risk than the alternatives. +Other options to achieve ZDD are Kubernetes (significantly more complexity and effort), creating +our own custom deploy system (significantly more complexity, effort, and risk), or switching to +a cloud service provider that makes this easier, like AWS (not currently in scope as an option). + +## Impact +### Positive +- **Zero-downtime deploys**: Zero-downtime deploys are a best practice. +- **Easy rollback**: Deployment slots make it easy to roll back to the previous version of the + app if we find errors after deploy. +- **Consistency**: Deployment Slots are an Azure feature specifically designed to enable + zero-down time deployment. We use deployment slots in all ingestion service environments and + in the Trusted Intermediary web app. + +### Negative +- **Incomplete support for Linux**: The auto-swap feature is not available for Linux-based web apps like ours. + so we had to include an explicit swapping step in our updated deployment process. +- **Opaque responses from `az webapp deployment slot swap` CLI**: When there are issues swapping slots, the CLI doesn't + return any details about the issue. The swapping operation can also take as much as 20 minutes + to time out if there's a silent failure, which slows down deploy and validation. +- **Steep learning curve**: Most of the official docs and unofficial resources + (such as blogs and tutorials) for deployment slots are written for people using Windows + servers and Microsoft-published programing languages. This lack of support for other platforms + and languages means a lot more trial and error is involved. + +### Risks +- Because of the incomplete support for and documentation of our usecase, we may not have + chosen the optimal implementation of this feature. It may also be time-consuming to + troubleshoot if we run into future issues. +- Future developers may be confused by which settings should be `sticky` and which should not. diff --git a/operations/template/app.tf b/operations/template/app.tf index f8f85a39..0f35498b 100644 --- a/operations/template/app.tf +++ b/operations/template/app.tf @@ -112,7 +112,8 @@ resource "azurerm_linux_web_app" "sftp" { } # When adding new settings that are needed for the live app but shouldn't be used in the pre-live - # slot, add them to `sticky_settings` as well as `app_settings` for the main app resource + # slot, add them to `sticky_settings` as well as `app_settings` for the main app resource. + # All queue-related settings should be `sticky` so that the pre-live slot does not send or consume messages. app_settings = { DOCKER_REGISTRY_SERVER_URL = "https://${azurerm_container_registry.registry.login_server}" WEBSITES_PORT = 8080