A curated list of selected readings to illustrate Scalability, Availability, and Stability Design Patterns in Back-end Development. Concepts are explained in the articles of notable engineers (Martin Fowler, Robert C. Martin, Tom White, etc) and high quality sources (highscalability.com, infoq.com, etc). Case studies are taken from battle-tested systems that serve millions of users (Netflix, Instagram, Riot Games, LINE, etc).
Understand your problems: performance problem (slow for a single user) or scalability problem (fast for a single user but slow under heavy load) by reviewing design principles. You can also check some talks of elite engineers from tech giants (Google, Facebook, Netflix, etc) to see how they build and scale their systems.
"Even if you lose all one day, you can build all over again if you retain your calm!" - Thuan Pham, CTO at Uber Technologies Inc.
Contributions are greatly welcome! You may want to take a look at the contribution guidelines. If you find this project helpful, please help me share it to more and more people! Thank you very much!
- Principles of Chaos Engineering
- Finding the Order in Chaos
- The Clean Architecture - Robert C. Martin (Uncle Bob)
- CAP Theorem and Trade-offs
- CAP Twelve Years Later: How the "Rules" Have Changed (2012) - Eric Brewer (VP of Infrastructure at Google)
- Scale Up or Scale Out, What it is and Why You Should Care
- Scaling Up vs Scaling Out: Hidden Costs
- ACID and BASE
- Blocking/Non-Blocking and Sync/Async
- Why Non-Blocking?
- SQL and NoSQL
- Consistent Hashing - Tom White, author of 'Hadoop: the Definitive Guide'
- Cache is King!
- Anti-Caching
- Understand Latency
- Architecture Issues When Scaling Web Applications: Bottlenecks, Database, CPU, IO
- 20 Common Bottlenecks
- Relying on Software to Redirect Traffic Reliably at Various Layers
- Advantages and Drawbacks of Microservices
- Breaking Things on Purpose
- Avoid Over Engineering
- Scalability Worst Practices
- Use Solid Technologies - Don’t Re-invent the Wheel - Keep It Simple!
- Performance is a Feature
- Make Performance Part of Your Workflow
- Writing Code that Scales
- AWS Do's and Don'ts
- (UI) Design Doesn’t Scale - Stanley Wood, Design Director at Spotify
- Design for Loose-coupling
- Design for Resiliency
- Design for Self-healing
- Design for Scaling Out
- Best Practices for Scaling Out
- Design for Evolution
- Learn from Mistakes
-
- Microservices Resource Guide - Martin Fowler, Chief Scientist at ThoughtWorks
- Thinking Inside the Container - Riot Games (8 part series)
- Containerization at Pinterest
- The Evolution of Container Usage at Netflix
- Dockerizing MySQL at Uber
- Testing of Microservices at Spotify
- Organize Monolith Before Breaking it into Services at Weebly
-
- The Log: What Every Software Engineer Should Know
- Scalable and reliable log ingestion at Pinterest
- Building DistributedLog at Twitter: High-performance replicated log service
- Logging Service with Spark at CERN Accelerator
- Logging and Aggregation at Quora
- BookKeeper: Distributed Log Storage at Yahoo
- LogDevice: Distributed Data Store for Logs at Facebook
-
- Understanding When to use RabbitMQ or Apache Kafka
- Running Kafka at scale at Linkedin
- Delaying Asynchronous Message Processing with RabbitMQ at Indeed
- Real-time Data Pipeline with Kafka at Yelp
- Audit Kafka End-to-End at Uber (count each message exactly once, audit a message across tiers)
- Deduplication Techniques
-
RDBMS (MySQL, MSSQL, PostgreSQL)
- MS SQL versus MySQL
- Why SQL is beating NoSQL, and what this means for the future of data
- Sharding MySQL at Pinterest
- How Airbnb Partitioned Main MySQL Database in Two Weeks
- Replication is the Key for Scalability & High Availability
- How Twitch uses PostgreSQL
- Scaling MySQL-based financial reporting system at Airbnb
- Scaling to 100M at Wix: MySQL is a Better NoSQL
- Why Uber Engineering Switched from Postgres to MySQL
- Handling Growth with Postgres at Instagram
-
- Introduction to Modern Network Load Balancing and Proxying
- Load Balancing infrastructure to support more than 1.3 billion users at Facebook
- DHCPLB: Open Source Load Balancer for DHCP at Facebook
- Load Balancing with Eureka at Netflix
- Load Balancing at Yelp
- Load Balancing at Github
- Consistent Hashing to Improve Load Balancing at Vimeo
- UDP Load Balancing at 500 pixel
-
- SPMD (Single Program Multiple Data): The Genetic Pattern
- Master/Worker Pattern
- Loop Parallelism Pattern: Extracting parallel tasks from loops
- Fork/Join Pattern: Good for recursive data processing
- Map-Reduce: Born for Simplified Data Processing on Large Clusters
- On the Death of Map-Reduce - Henry Robinson, Cloudera
- Parallelize the rendering of web pages: Use case of Yelp.com
-
- Scalable Deep Learning Platform On Spark In Baidu
- Horovod: Uber’s Open Source Distributed Deep Learning Framework for TensorFlow
- Scaling Gradient Boosted Trees for Click-Through-Rate Prediction at Yelp
- TensorFlowOnSpark: Distributed Deep Learning on Big Data Clusters at Yahoo
- CaffeOnSpark: Distributed Deep Learning on Big Data Clusters at Yahoo
- AIOps in Practice at Baidu
- Learning with Privacy at Scale - Differential Privacy Team, Apple
- Fail-over
- Replication
- NodeJS High Availability at Yahoo
- Every Day Is Monday in Operations - LinkedIn (11 part series)
- Practical Guide to Monitoring and Alerting with Time Series at Scale
- How Robust Monitoring Powers High Availability for LinkedIn Feed
- Architectural Patterns for High Availability - Adrian Cockcroft, Director of Architecture at Netflix
- Circuit Breaker
- Always use timeouts (if possible)
- Let it crash/Supervisors: Embrace failure as a natural state in the life-cycle of the application
- Crash early: An error now is better than a response tomorrow
- Bulkheads: Partition and tolerate failure in one part
- Steady state: Always put logs on separate disk
- Throttling: Maintain a steady pace
- Multi-clustering: Improving Resiliency and Stability of a Large-scale Monolithic API Service at LinkedIn
- Distributed Git server at Palantir
- Configuration management for distributed systems (using GitHub and cfg4j) at Flickr
- Seagull: Distributed system that helps running > 20 million tests per day at Yelp
- Cloud Bouncer: Distributed Rate Limiting at Yahoo
- Scalable gaming patterns on AWS (Sep 2017)
- Building a modern bank backend at Monzo
- Selecting a cloud provider at Etsy
- Architecture of Tripod (Flickr’s Backend)
- How eBay's Shopping Cart used compression techniques to solve network I/O bottlenecks
- Optimizing web servers for high throughput and low latency at Dropbox
- Talks on Efficiency, Reliability, and Scaling - James Hamilton, Vice President and Distinguished Engineer at AWS
- Building Real Time Infrastructure at Facebook - Jeff Barber and Shie Erlich, Software Engineer at Facebook
- Building Reliable Social Infrastructure for Google - Marc Alvidrez, Senior Manager at Google
- How Google Does Planet-Scale for Planet-Scale Infra - Melissa Binde, SRE Director for Google Cloud Platform
- Netflix Guide to Microservices - Josh Evans, Director of Operations Engineering at Netflix
- Achieving Rapid Response Times in Large Online Services - Jeff Dean, Google Senior Fellow
- How We've Scaled Dropbox - Kevin Modzelewski, Back-end Engineer at Dropbox
- Lessons of Scale at Facebook - Bobby Johnson, Director of Engineering at Facebook
- Scaling Instagram Infrastructure - Lisa Guo, Instagram Engineering
- Scaling Twitter Core Infrastructure - Yao Yue, Staff Software Engineer at Twitter
- Scaling Pinterest - Marty Weiner, Pinterest’s founding engineer
- Scaling Spotify Data Infrastructure - Matti (Lepistö) Pehrs, Spotify
- Scaling Uber's Backend by Breaking Everything - Matt Ranney, Chief Systems Architect at Uber
- Scaling Slack - Bing Wei, Software Engineer (Infrastructure) at Slack
- The Art of Scalability
- Designing Data-Intensive Applications
- Web Scalability for Startup Engineers
- Scalability Rules: 50 Principles for Scaling Web Sites
- Chaos Engineering - Building Confidence in System Behavior through Experiments
- Jonas Bonér, CTO at Lightbend, for the original inspiration
This work is licensed under a Creative Commons Attribution 4.0 International License.