Skip to content

Latest commit

 

History

History

reliability

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 

Reliability

Implementing reliable applications and services in Azure means they have to be both resilient and highly available. To achieve these goals, there are many decisions and tradeoffs that have to be made. The questions and initiatives below are based on the Reliability pillar of the Well-Architected Framework. Reflect on each question and priorize/plan the initiatives of the Reliability playbook.

Questions to make

  • What reliability targets and metrics have you defined for your application?
  • How have you ensured that your application architecture is resilient to failures?
  • How have you ensured required capacity and services are available in targeted regions?
  • How are you handling disaster recovery for this workload?
  • What decisions have been taken to ensure the application platform meets your reliability requirements?
  • What decisions have been taken to ensure the data platform meets your reliability requirements?
  • How does your application logic handle exceptions and errors?
  • What decisions have been taken to ensure networking and connectivity meets your reliability requirements?
  • What reliability allowances for scalability and performance have you made?
  • What reliability allowances for security have you made?
  • What reliability allowances for operations have you made?
  • How do you test the application to ensure it is fault tolerant?
  • How do you monitor and measure application health?

Initiatives

Capture requirements

Everybody wants the highest level of reliability for their services, but this always comes with cost and architecture complexity tradeoffs. Therefore, the first thing to do regarding reliability is to identify how reliable we need our workload to be and what is required to achieve this objective.

Performing a failure mode analysis (FMA) helps you identify where your workload can fail and what are the risks of such failure. Consequently, it also helps you prioritize the solutions and initiatives that will remediate the identified reliability vulnerabilities.

Availability targets are an other important driver for identifying reliability requirements. For example, determining mean time to recovery (MTTR) or recovery time objective (RTO) targets will influence which reliability solutions will be adopted. More details here.

Depending on your reliability requirements, Azure provides many high-availability and resiliency solutions to choose from, such as Availability Sets with Managed Disks, Availability Zones, Geo-Redundancy, Data Replication or Azure Site Recovery, just to name a few.

Validate reliability best practices are applied

Once reliability requirements are clearly identified, you have several tools and actions at your disposal to validate whether your workloads are applying reliability best practices. Here are some actions that should be part of your todo list:

Incorporate reliability in release engineering procedures