Awesome Scalability, Availability, and Stability Back-end Design Patterns

A curated list of selected readings to illustrate Scalability, Availability, and Stability Design Patterns in Back-end Development. Concepts are explained in the articles of notable engineers (Martin Fowler, Robert C. Martin, Tom White, etc) and high quality sources (highscalability.com, infoq.com, etc). Case studies are taken from battle-tested systems that serve millions of users (Netflix, Instagram, Riot Games, LINE, etc).

What if your Back-end went slow?

Understand your problems: performance problem (slow for a single user) or scalability problem (fast for a single user but slow under heavy load) by reviewing design principles. You can also check some talks of elite engineers from tech giants (Google, Facebook, Netflix, etc) to see how they build and scale their systems.

What if your Back-end went down?

"Even if you lose all one day, you can build all over again if you retain your calm!" - Thuan Pham, CTO at Uber Technologies Inc.

Community Power

Contributions are greatly welcome! You may want to take a look at the contribution guidelines. If you find this project helpful, please help me share it to more and more people! Thank you very much!

Principles

Principles of Chaos Engineering
Finding the Order in Chaos
The Clean Architecture - Robert C. Martin (Uncle Bob)
CAP Theorem and Trade-offs
CAP Twelve Years Later: How the "Rules" Have Changed (2012) - Eric Brewer (VP of Infrastructure at Google)
Scale Up or Scale Out, What it is and Why You Should Care
Scaling Up vs Scaling Out: Hidden Costs
ACID and BASE
Blocking/Non-Blocking and Sync/Async
Why Non-Blocking?
SQL and NoSQL
Consistent Hashing - Tom White, author of 'Hadoop: the Definitive Guide'
Cache is King!
Anti-Caching
Understand Latency
Architecture Issues When Scaling Web Applications: Bottlenecks, Database, CPU, IO
20 Common Bottlenecks
Relying on Software to Redirect Traffic Reliably at Various Layers
Advantages and Drawbacks of Microservices
Breaking Things on Purpose
Avoid Over Engineering
Scalability Worst Practices
Use Solid Technologies - Don’t Re-invent the Wheel - Keep It Simple!
Performance is a Feature
Make Performance Part of Your Workflow
Writing Code that Scales
AWS Do's and Don'ts
(UI) Design Doesn’t Scale - Stanley Wood, Design Director at Spotify
Design for Loose-coupling
Design for Resiliency
Design for Self-healing
Design for Scaling Out
Best Practices for Scaling Out
Design for Evolution
Learn from Mistakes

Scalability

Microservices
Distributed Caching
Distributed Tracking and Tracing
Distributed Logging
Distributed Messaging
Storage
- In-memory Storage
- Durable Storage (typically Object Storage)
NoSQL
RDBMS (MySQL, MSSQL, PostgreSQL)
Time Series Database (TSDB)
HTTP Caching (Reverse Proxy, CDN)
Concurrency
Event-Driven Architecture
Load Balancing
Parallel Computing
Distributed Machine Learning

Availability

Fail-over
- The Evolution of Global Traffic Routing and Failover
- Testing for Disaster Recovery Failover Testing
Replication
NodeJS High Availability at Yahoo
Every Day Is Monday in Operations - LinkedIn (11 part series)
Practical Guide to Monitoring and Alerting with Time Series at Scale
How Robust Monitoring Powers High Availability for LinkedIn Feed
Architectural Patterns for High Availability - Adrian Cockcroft, Director of Architecture at Netflix

Stability

Circuit Breaker
Always use timeouts (if possible)
Let it crash/Supervisors: Embrace failure as a natural state in the life-cycle of the application
Crash early: An error now is better than a response tomorrow
Bulkheads: Partition and tolerate failure in one part
Steady state: Always put logs on separate disk
Throttling: Maintain a steady pace
Multi-clustering: Improving Resiliency and Stability of a Large-scale Monolithic API Service at LinkedIn

Others

Distributed Git server at Palantir
Configuration management for distributed systems (using GitHub and cfg4j) at Flickr
Seagull: Distributed system that helps running > 20 million tests per day at Yelp
Cloud Bouncer: Distributed Rate Limiting at Yahoo
Scalable gaming patterns on AWS (Sep 2017)
Building a modern bank backend at Monzo
Selecting a cloud provider at Etsy
Architecture of Tripod (Flickr’s Backend)
How eBay's Shopping Cart used compression techniques to solve network I/O bottlenecks
Optimizing web servers for high throughput and low latency at Dropbox

Talks

Talks on Efficiency, Reliability, and Scaling - James Hamilton, Vice President and Distinguished Engineer at AWS
Building Real Time Infrastructure at Facebook - Jeff Barber and Shie Erlich, Software Engineer at Facebook
Building Reliable Social Infrastructure for Google - Marc Alvidrez, Senior Manager at Google
How Google Does Planet-Scale for Planet-Scale Infra - Melissa Binde, SRE Director for Google Cloud Platform
Netflix Guide to Microservices - Josh Evans, Director of Operations Engineering at Netflix
Achieving Rapid Response Times in Large Online Services - Jeff Dean, Google Senior Fellow
How We've Scaled Dropbox - Kevin Modzelewski, Back-end Engineer at Dropbox
Lessons of Scale at Facebook - Bobby Johnson, Director of Engineering at Facebook
Scaling Instagram Infrastructure - Lisa Guo, Instagram Engineering
Scaling Twitter Core Infrastructure - Yao Yue, Staff Software Engineer at Twitter
Scaling Pinterest - Marty Weiner, Pinterest’s founding engineer
Scaling Spotify Data Infrastructure - Matti (Lepistö) Pehrs, Spotify
Scaling Uber's Backend by Breaking Everything - Matt Ranney, Chief Systems Architect at Uber
Scaling Slack - Bing Wei, Software Engineer (Infrastructure) at Slack

Books

The Art of Scalability
Designing Data-Intensive Applications
Web Scalability for Startup Engineers
Scalability Rules: 50 Principles for Scaling Web Sites
Chaos Engineering - Building Confidence in System Behavior through Experiments

Special Thanks

Jonas Bonér, CTO at Lightbend, for the original inspiration

License

This work is licensed under a Creative Commons Attribution 4.0 International License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Awesome Scalability, Availability, and Stability Back-end Design Patterns

What if your Back-end went slow?

What if your Back-end went down?

Community Power

Contents

Principles

Scalability

Availability

Stability

Others

Talks

Books

Special Thanks

License

Files

README.md

Latest commit

History

README.md

File metadata and controls

Awesome Scalability, Availability, and Stability Back-end Design Patterns

What if your Back-end went slow?

What if your Back-end went down?

Community Power

Contents

Principles

Scalability

Availability

Stability

Others

Talks

Books

Special Thanks

License