An updated and curated list of selected readings to illustrate Scalability, Availability, and Stability Design Patterns in Back-end Development. Concepts are explained in the articles of notable engineers (Werner Vogels, James Hamilton, Jeff Atwood, Martin Fowler, Robert C. Martin, Tom White, Martin Kleppmann) and high quality reference sources (highscalability.com, infoq.com, official engineering blogs, etc). Case studies are taken from battle-tested systems those are serving millions to billions of users (Netflix, Baidu, Flipkart, LINE, Spotify, etc).
Understand your problems: performance problem (slow for a single user) or scalability problem (fast for a single user but slow under heavy load) by reviewing design principles. You can also check some talks of elite engineers from tech giants (Google, Facebook, Instagram, etc) to see how they build and scale their systems.
"Even if you lose all one day, you can build all over again if you retain your calm!" - Thuan Pham, CTO at Uber Technologies Inc.
Contributions are greatly welcome! You may want to take a look at the contribution guidelines. If you find this project helpful, please help me share it on Twitter! Thank you very much ❤️
- My Scaling Hero - Jeff Atwood
- Principles of Chaos Engineering
- Finding the Order in Chaos
- The Clean Architecture - Robert C. Martin (Uncle Bob)
- The Twelve-Factor App
- 10 Common (Large-Scale) Software Architectural Patterns in a Nutshell
- CAP Theorem and Trade-offs
- CAP Twelve Years Later: How the "Rules" Have Changed (2012) - Eric Brewer, VP of Infrastructure at Google
- Scale Up or Scale Out, What it is and Why You Should Care
- Scaling Up vs Scaling Out: Hidden Costs
- ACID and BASE
- Blocking/Non-Blocking and Sync/Async
- Why Non-Blocking?
- SQL versus NoSQL
- SQL or NoSQL - Lesson Learned from Salesforce
- How Sharding Works
- Consistent Hashing - Tom White, author of 'Hadoop: the Definitive Guide'
- Uniform Consistent Hashing
- Eventually Consistent - Werner Vogels, CTO at Amazon
- Cache is King!
- Anti-Caching
- Understand Latency
- Architecture Issues When Scaling Web Applications: Bottlenecks, Database, CPU, IO
- 20 Common Bottlenecks
- Life Beyond Distributed Transactions
- Relying on Software to Redirect Traffic Reliably at Various Layers
- Advantages and Drawbacks of Microservices
- Microservices Scale Cube
- Breaking Things on Purpose
- Avoid Over Engineering
- Scalability Worst Practices
- Use Solid Technologies - Don’t Re-invent the Wheel - Keep It Simple!
- Why Over-Reusing is Bad
- Performance is a Feature
- Make Performance Part of Your Workflow
- The Benefits of Server Side Rendering Over Client Side Rendering
- Writing Code that Scales
- Automate and Abstract: Lessons from Facebook on Engineering for Scale
- AWS Do's and Don'ts
- (UI) Design Doesn’t Scale - Stanley Wood, Design Director at Spotify
- Design for Loose-coupling
- Design for Resiliency
- Design for Self-healing
- Design for Scaling Out
- Best Practices for Scaling Out
- Design for Evolution
- Learn from Mistakes
- Microservices and Orchestration
- Microservices Resource Guide - Martin Fowler, Chief Scientist at ThoughtWorks
- Microservices Patterns
- Thinking Inside the Container (8 parts) at Riot Games
- Containerization at Pinterest
- The Evolution of Container Usage at Netflix
- Dockerizing MySQL at Uber
- Testing of Microservices at Spotify
- Organize Monolith Before Breaking it into Services at Weebly
- Lessons learned running Docker in production at Treehouse
- Inside a SoundCloud Microservice
- Microservices at BlaBlaCar
- Operate Kubernetes Reliably at Stripe
- Agrarian-Scale Kubernetes (3 parts) at New York Times
- Mesos, Docker and Ochopod in Localization Services at Autodesk
- Nanoservices at BBC Online
- Distributed Caching
- Write-behind and Write-through
- Eviction Policies
- Peer-To-Peer Caching
- EVCache: Caching for a Global Netflix
- Memsniff: Robust Memcache Traffic Analyzer at Box.com
- Caching with Consistent Hashing and Cache Smearing at Etsy
- An Analysis of Facebook Photo Caching
- Reduce Memcached Memory Usage by 50% at Trivago
- Distributed Tracking and Tracing
- Distributed Logging
- The Problem with Logging - Jeff Atwood
- The Log: What Every Software Engineer Should Know
- Using Logs to Build a Solid Data Infrastructure - Martin Kleppmann
- Scalable and reliable log ingestion at Pinterest
- Building DistributedLog at Twitter: High-performance replicated log service
- Logging Service with Spark at CERN Accelerator
- Logging and Aggregation at Quora
- BookKeeper: Distributed Log Storage at Yahoo
- LogDevice: Distributed Data Store for Logs at Facebook
- Distributed Messaging
- Understanding When to use RabbitMQ or Apache Kafka
- Running Kafka at scale at Linkedin
- Delaying Asynchronous Message Processing with RabbitMQ at Indeed
- Real-time Data Pipeline with Kafka at Yelp
- Audit Kafka End-to-End at Uber (count each message exactly once, audit a message across tiers)
- Deduplication Techniques
- Should You Put Several Event Types in the Same Kafka Topic? - Martin Kleppmann
- Distributed Searching
- Search Architecture of Instagram
- Search Architecture of eBay
- Improving Search Engine Efficiency by over 25% at eBay
- Elasticsearch Performance Tuning Practice at eBay
- Nautilus: Travel Search Engine of Expedia
- Galene: Search Architecture of LinkedIn
- Search at Slack
- Search Service (Half a Trillion Documents and Query Average Latency < 100ms) at Twitter (2014)
- Manas: High Performing Customized Search System at Pinterest
- Sherlock: Near Real Time Search Indexing at Flipkart
- Nebula: Storage Platform to Build Search Backends at Airbnb
- Elasticsearch at Kickstarter
- Distributed Storage
- Distributed Version Control
- NoSQL
- Key-Value Databases (DynamoDB, Voldemort, Manhattan)
- Scaling Mapbox infrastructure with DynamoDB Streams
- Manhattan: Twitter’s distributed key-value database
- Sherpa: Yahoo’s distributed NoSQL key-value store
- Riak inside Chat Service Architecture at Riot Games
- MPH: Fast and Compact Immutable Key-Value Stores at Indeed
- zBase: High Performance, Elastic, Distributed Key-Value Store at Zynga
- Column Databases (Cassandra, HBase, Vertica, Sybase IQ)
- Consistent Hashing in Cassandra
- When NOT to use Cassandra?
- Storing Images in Cassandra at Walmart Scale
- Cassandra at Instagram
- How Yelp Scaled Ad Analytics with Cassandra
- How Discord Stores Billions of Messages with Cassandra
- Scale to serve 100+ million reads/writes using Spark and Cassandra at Dream11
- Imgur Notification: From MySQL to HBASE at Imgur
- Moving Food Feed from Redis to Cassandra at Zomato
- Document Databases (MongoDB, SimpleDB, CouchDB)
- eBay: Building Mission-Critical Multi-Data Center Applications with MongoDB
- MongoDB at Baidu: Multi-Tenant Cluster Storing 200+ Billion Documents across 160 Shards
- The AWS and MongoDB Infrastructure of Parse (acquired by Facebook)
- Migrating Mountains of Mongo Data at Addepar
- Couchbase Ecosystem at LinkedIn
- SimpleDB at Zendesk
- Graph Databases
- Datastructure Databases (Redis, Hazelcast)
- Using Redis To Scale at Twitter
- Scaling Job Queue with Redis at Slack
- Moving persistent data out of Redis at Github
- Storing Hundreds of Millions of Simple Key-Value Pairs in Redis at Instagram
- Redis in Chat Architecture of Twitch (from 27:22)
- Learn Redis the hard way (in production) at Trivago
- Optimizing Session Key Storage in Redis at Deliveroo
- Optimizing Redis Storage at Deliveroo
- Practical NoSQL resilience design pattern for the enterprise (eBay)
- Key-Value Databases (DynamoDB, Voldemort, Manhattan)
- RDBMS (MySQL, MSSQL, PostgreSQL)
- MS SQL versus MySQL
- Why SQL is beating NoSQL, and what this means for the future of data
- MySQL Crash-Safe Replication, Parallel Replication, and Slave Scaling (10 parts) at Booking.com
- Sharding MySQL at Pinterest
- How Airbnb Partitioned Main MySQL Database in Two Weeks
- Replication is the Key for Scalability & High Availability
- How Twitch uses PostgreSQL
- Scaling MySQL-based financial reporting system at Airbnb
- Scaling to 100M at Wix: MySQL is a Better NoSQL
- Why Uber Engineering Switched from Postgres to MySQL
- Handling Growth with Postgres at Instagram
- Scaling the Analytics Database (Postgres) at TransferWise
- MySQL Sharding (3 parts) at Evernote
- Time Series Database (TSDB)
- Time Series Data: Why and How to Use a Relational Database instead of NoSQL
- Beringei: High-performance Time Series Storage Engine at Facebook
- Atlas: In-memory Dimensional Time Series Database at Netflix
- Heroic: Time Series Database at Spotify
- Roshi: Distributed Storage System for Time-Series Event at SoundCloud
- Building a Scalable Time Series Database on PostgreSQL
- Scaling Time Series Data Storage at Netflix
- HTTP Caching (Reverse Proxy, CDN)
- Reverse Proxy (Nginx, Varnish, Squid, rack-cache)
- Stop Worrying and Love the Proxy
- Playing HTTP Tricks with Nginx
- Using CDN to Improve Site Performance at Coursera
- Strategy: Caching 404s Saved 66% On Server Time at The Onion
- Increasing Application Performance with HTTP Cache Headers
- Zynga Geo Proxy: Reducing Mobile Game Latency at Zynga
- Google AMP at Condé Nast
- Running A/B Tests on Hosting Infrastructure (CDNs) at Deliveroo
- Load Balancing
- Introduction to Modern Network Load Balancing and Proxying
- Load Balancing infrastructure to support more than 1.3 billion users at Facebook
- DHCPLB: Open Source Load Balancer for DHCP at Facebook
- Load Balancing with Eureka at Netflix
- Load Balancing at Yelp
- Load Balancing at Github
- Consistent Hashing to Improve Load Balancing at Vimeo
- UDP Load Balancing at 500 pixel
- Autoscaling
- Concurrency
- Parallel Computing
- SPMD (Single Program Multiple Data): The Genetic Pattern
- Master/Worker Pattern
- Loop Parallelism Pattern: Extracting parallel tasks from loops
- Fork/Join Pattern: Good for recursive data processing
- Map-Reduce: Born for Simplified Data Processing on Large Clusters
- On the Death of Map-Reduce - Henry Robinson, Cloudera
- Server-side Optimization to Parallelize the Rendering of Web Pages at Yelp
- Event-Driven Architecture
- Distributed Machine Learning
- Scalable Deep Learning Platform On Spark In Baidu
- Horovod: Uber’s Open Source Distributed Deep Learning Framework for TensorFlow
- Scaling Gradient Boosted Trees for Click-Through-Rate Prediction at Yelp
- TensorFlowOnSpark: Distributed Deep Learning on Big Data Clusters at Yahoo
- CaffeOnSpark: Distributed Deep Learning on Big Data Clusters at Yahoo
- AIOps in Practice at Baidu
- Learning with Privacy at Scale - Differential Privacy Team, Apple
- Image Classification Experiment Using Deep Learning at Mercari
- Content-based Video Relevance Prediction at Hulu
- PaddlePaddle Fluid: Elastic Deep Learning on Kubernetes at Baidu
- Training ML Models with Airflow and BigQuery at WePay
- Improving Photo Selection With Deep Learning at TripAdvisor
- Machine Learning (2 parts) at Condé Nast
- Distributed Architecture in Financial Systems
- Failover
- Replication
- NodeJS High Availability at Yahoo
- Every Day Is Monday in Operations (11 parts) at LinkedIn
- Practical Guide to Monitoring and Alerting with Time Series at Scale
- How Robust Monitoring Powers High Availability for LinkedIn Feed
- Architectural Patterns for High Availability - Adrian Cockcroft, Director of Architecture at Netflix
- Ensuring Resilience to Disaster at Quora
- Resiliency against Traffic Oversaturation at iHeartRadio
- Circuit Breaker
- Always use timeouts (if possible)
- Let it crash/Supervisors: Embrace failure as a natural state in the life-cycle of the application
- Crash early: An error now is better than a response tomorrow
- Bulkheads: Partition and tolerate failure in one part
- Steady state: Always put logs on separate disk
- Throttling: Maintain a steady pace
- Multi-clustering: Improving Resiliency and Stability of a Large-scale Monolithic API Service at LinkedIn
- Web Performance: Cache Efficiency Exercise at Facebook
- Improving Performance with Background Data Prefetching at Instagram
- Compression Techniques to Solve Network I/O Bottlenecks at eBay
- Optimizing Web Servers for High Throughput and Low Latency at Dropbox
- Boosting Site Speed Using Brotli Compression at LinkedIn
- Linux Performance Analysis in 60.000 Milliseconds at Netflix
- Optimizing 360 Photos at Scale at Facebook
- Reducing Image File Size in the Photos Infrastructure at Etsy
- Improving Video Thumbnails with Deep Neural Nets at YouTube
- Optimizing APIs through Dynamic Polyglot Runtime, Fully Asynchronous, and Reactive Programming at Netflix
- Optimizing Video Playback Performance at Pinterest
- Reducing Video Loading Time by Prefetching during Preroll at Dailymotion
- Improving GIF Performance at Pinterest
- Performance Improvements (All Stacks) at Pinterest
- Server Side Rendering at Wix
- 30x Performance Improvements on MySQLStreamer at Yelp
- Performance Monitoring with Riemann and Clojure at Walmart
- Improving Homepage Performance at Zillow
- Decreasing RAM Usage by 40% Using jemalloc with Python & Celery at Zapier
- Architecture of Tripod (Flickr’s Backend)
- Architecture of SurveyMonkey
- Architecture of Data Platform at Flipkart
- Distributed Cron Architecture at Quora
- Simone: Distributed Simulation Service at Netflix
- Seagull: Distributed System that Helps Running > 20 Million Tests Per Day at Yelp
- Cloud Bouncer: Distributed Rate Limiting at Yahoo
- Selecting a Cloud Provider at Etsy
- Basic Infrastructure Patterns at Zenefits
- Syscall Auditing at Scale at Slack
- Scaling Online Migrations at Stripe
- Netflix: What Happens When You Press Play?
- Service Decomposition at Scale at Intuit QuickBooks
- Back-end at BlaBlaCar
- Scalable Gaming Patterns on AWS
- How League Of Legends Scaled Chat To 70 Million Players
- Talks on Efficiency, Reliability, and Scaling - James Hamilton, Vice President and Distinguished Engineer at AWS
- Building Real Time Infrastructure at Facebook - Jeff Barber and Shie Erlich, Software Engineer at Facebook
- Building Reliable Social Infrastructure for Google - Marc Alvidrez, Senior Manager at Google
- How Google Does Planet-Scale for Planet-Scale Infra - Melissa Binde, SRE Director for Google Cloud Platform
- Netflix Guide to Microservices - Josh Evans, Director of Operations Engineering at Netflix
- Achieving Rapid Response Times in Large Online Services - Jeff Dean, Google Senior Fellow
- Architecture to Handle 80K RPS Celebrity Sales at Shopify - Simon Eskildsen, Engineering Lead at Shopify
- How We've Scaled Dropbox - Kevin Modzelewski, Back-end Engineer at Dropbox
- Lessons of Scale at Facebook - Bobby Johnson, Director of Engineering at Facebook
- Performance Optimization for the Greater China Region at Salesforce - Jeff Cheng, Enterprise Architect at Salesforce
- How GIPHY Delivers a GIF to 300 Millions Users - Alex Hoang and Nima Khoshini, Services Engineers at GIPHY
- Scaling Facebook Live Videos to a Billion Users - Sachin Kulkarni, Director of Engineering at Facebook
- Scaling Instagram Infrastructure - Lisa Guo, Instagram Engineering
- Scaling Twitter Core Infrastructure - Yao Yue, Staff Software Engineer at Twitter
- Scaling Pinterest - Marty Weiner, Pinterest’s founding engineer
- Scaling Spotify Data Infrastructure - Matti (Lepistö) Pehrs, Spotify
- Scaling Uber's Backend by Breaking Everything - Matt Ranney, Chief Systems Architect at Uber
- Scaling Slack - Bing Wei, Software Engineer (Infrastructure) at Slack
- Scaling YouTube's Backend - Sugu Sougoumarane, SDE at Youtube
- Scaling (a NSFW site) to 200 Million Views A Day And Beyond - Eric Pickup, Lead Platform Developer at MindGeek
- Google Site Reliability Engineering (Online - Free)
- Distributed Systems for Fun and Profit (Online - Free)
- Beyond the Twelve-Factor App - Exploring the DNA of Highly Scalable, Resilient Cloud Applications (Free)
- Chaos Engineering - Building Confidence in System Behavior through Experiments (Free)
- The Art of Scalability
- Designing Data-Intensive Applications
- Web Scalability for Startup Engineers
- Scalability Rules: 50 Principles for Scaling Web Sites
- Jonas Bonér, CTO at Lightbend, for the original inspiration
Copyright Benny (Quoc-Binh) Nguyen, 2018. Licensed under a Creative Commons Attribution 4.0 International License.