TL;DR
- Horizontal scaling > vertical scaling for most web apps.
- Load balancers distribute traffic. CDNs cache static assets at the edge.
- Caching (Redis, CDN, browser) is the #1 performance lever.
- Know the tradeoffs: CAP theorem, consistency vs availability.
Step 1: Scaling Fundamentals
Scaling is the reason system design interviews exist. Every startup begins on a single server, and the moment it succeeds, that server falls over under load. The choice between vertical scaling (bigger machine) and horizontal scaling (more machines) shapes your entire architecture. Horizontal won because hardware has a ceiling, it creates redundancy, and cloud providers make adding instances trivial. But it introduces complexity: state management, data consistency, and load distribution — which is everything else in this handbook.
Vertical vs Horizontal Scaling
Vertical Scaling (Scale Up) Horizontal Scaling (Scale Out)
┌────────────┐ ┌────┐ ┌────┐ ┌────┐ ┌────┐
│ │ │ S1 │ │ S2 │ │ S3 │ │ S4 │
│ BIGGER │ └────┘ └────┘ └────┘ └────┘
│ SERVER │ ↑
│ │ Load Balancer
└────────────┘ ┌──────────────────────┐
│ Traffic │
More CPU, RAM └──────────────────────┘
Has limits Add more machines
Single point of failure Redundancy built-in
| Vertical | Horizontal | |
|---|---|---|
| Cost | Exponential | Linear |
| Limit | Hardware ceiling | Practically unlimited |
| Complexity | Simple | Need load balancing, state management |
| Downtime | Required for upgrade | Zero-downtime possible |
Step 2: Load Balancing
Load balancers exist because the moment you have multiple servers, you need something to decide which server handles each request. Without one, you'd need DNS round-robin (no health checks, sticky DNS caches) or manual traffic splitting. Load balancers also enable zero-downtime deployments, automatic failover, and SSL termination. Understanding the different algorithms (round-robin, least connections, IP hash) comes up in every system design interview because the choice directly affects session handling and resource utilization.
Algorithms
| Algorithm | How it Works | Best For |
|---|---|---|
| Round Robin | Rotate through servers 1→2→3→1→... | Equal servers, stateless |
| Weighted Round Robin | More traffic to stronger servers | Mixed hardware |
| Least Connections | Route to server with fewest active connections | Variable request times |
| IP Hash | Same client always hits same server | Session affinity |
| Random | Randomly pick a server | Simple, surprisingly effective |
Architecture
┌─── App Server 1
User → DNS → CDN → LB ─── App Server 2
└─── App Server 3
│
┌────┴────┐
│ Cache │
│ (Redis) │
└────┬────┘
│
┌────┴────┐
│Database │
│(Primary)│
└────┬────┘
│
┌─────────┴─────────┐
│ Read Replica 1 │ Read Replica 2
└───────────────────┘
Health Checks
# Load balancer health check config
health_check:
path: /health
interval: 10s
timeout: 3s
unhealthy_threshold: 3 # 3 failures = remove from pool
healthy_threshold: 2 # 2 successes = add back
Step 3: Caching Strategies
Caching is the single biggest performance lever in any system — it's cheaper and faster to add cache than to optimize queries or add servers. The reason it's so effective: most apps have a heavily skewed access pattern (80/20 rule) where a small fraction of data gets most of the reads. Understanding the cache hierarchy (browser → CDN → Redis → DB) and patterns (cache-aside, write-through, write-behind) is essential because the wrong strategy leads to stale data, thundering herds, or cache stampedes that can bring down your system.
Cache Hierarchy
Browser Cache (fastest, per-user)
↓ miss
CDN Cache (fast, per-region)
↓ miss
Application Cache - Redis (fast, shared)
↓ miss
Database Query Cache
↓ miss
Database Disk
Patterns
Cache-Aside (Lazy Loading)
def get_user(user_id):
# 1. Check cache
cached = redis.get(f"user:{user_id}")
if cached:
return json.loads(cached)
# 2. Cache miss — fetch from DB
user = db.query("SELECT * FROM users WHERE id = ?", user_id)
# 3. Store in cache for next time
redis.setex(f"user:{user_id}", 3600, json.dumps(user)) # TTL: 1 hour
return user
Write-Through
def update_user(user_id, data):
# 1. Write to DB
db.update("UPDATE users SET ... WHERE id = ?", data, user_id)
# 2. Update cache immediately
redis.setex(f"user:{user_id}", 3600, json.dumps(data))
Write-Behind (Write-Back)
def update_user(user_id, data):
# 1. Write to cache only (fast response)
redis.setex(f"user:{user_id}", 3600, json.dumps(data))
# 2. Async worker writes to DB later (batched)
queue.push({"type": "user_update", "id": user_id, "data": data})
Cache Invalidation
| Strategy | Pros | Cons |
|---|---|---|
| TTL (Time-to-Live) | Simple, automatic | Stale data until expiry |
| Event-based | Immediate freshness | Complex, need event system |
| Version keys | Easy rollback | More cache keys |
Step 4: Database Scaling
Database scaling is where system design gets hard. Your app servers are stateless and easily replicated, but your database holds the truth — and truth is hard to distribute. Read replicas handle read-heavy workloads (most apps are 90%+ reads), but writes still hit a single primary. When that's not enough, sharding splits data across machines, but it makes joins impossible across shards and requires careful shard key selection. This is the domain where "it depends" actually matters, and interviewers want to hear you reason about the tradeoffs.
Read Replicas
Writes → Primary DB (single source of truth)
Reads → Read Replicas (many, distributed)
Replication lag: 10ms - 1s (eventual consistency)
Sharding (Horizontal Partitioning)
User data split by user_id:
Shard 1: user_id 1-1M (Server A)
Shard 2: user_id 1M-2M (Server B)
Shard 3: user_id 2M-3M (Server C)
Shard key selection is CRITICAL:
- Even distribution
- Queries should hit single shard
- Avoid hot spots
When to Scale What
| Symptom | Solution |
|---|---|
| Slow reads | Read replicas, caching |
| Slow writes | Sharding, write-behind cache |
| Too much data | Sharding, archiving old data |
| Complex queries | Denormalization, materialized views |
| Full-text search | Elasticsearch/OpenSearch |
Step 5: CAP Theorem
The CAP theorem (proved by Brewer in 2000) states that in the presence of a network partition, you must choose between consistency and availability. It matters because distributed systems WILL experience network issues — the question is how your system behaves when they happen. A banking system must choose consistency (reject transactions if unsure), while a social feed chooses availability (show potentially stale data). This framework is used in every system design interview to justify architecture decisions and understand database choices.
Consistency (C)
/\
/ \
/ \
/ Pick \
/ Two \
/──────────\
Availability Partition
(A) Tolerance (P)
You can only have 2 of 3 (and P is not optional in distributed systems):
| Choice | Trade-off | Example |
|---|---|---|
| CP | Sacrifices availability during partitions | MongoDB, Redis Cluster |
| AP | Sacrifices consistency (eventual) | Cassandra, DynamoDB |
| CA | Can't handle network partitions (single node) | Traditional RDBMS |
Practical Decision
Banking transaction → CP (consistency matters, reject if uncertain)
Social media feed → AP (show slightly stale data, always available)
Shopping cart → AP (merge conflicts later, never lose items)
Inventory count → CP (don't oversell)
Step 6: Content Delivery Networks (CDN)
CDNs exist because the speed of light is a real constraint — a user in Tokyo making a request to a server in Virginia experiences 200ms of unavoidable network latency. CDNs place cached copies of your content at "edge" locations worldwide, reducing latency to 10-30ms for cached content. For static assets, CDNs reduce origin load by 90%+. For modern apps using ISR/SSG (Next.js), CDNs serve entire pages from the edge. Cloudflare, AWS CloudFront, and Vercel's Edge Network are the major players.
Without CDN:
User (Tokyo) → Server (US-East) = 200ms latency
With CDN:
User (Tokyo) → CDN Edge (Tokyo) = 20ms latency
CDN Edge → Origin (cache miss only) = 200ms
What to CDN
| Cache | Don't Cache |
|---|---|
| Static assets (JS, CSS, images) | Personalized content |
| API responses (public, rarely changes) | Real-time data |
| HTML pages (SSG/ISR) | Authentication responses |
| Fonts, videos | Write operations |
Step 7: Back-of-Envelope Calculations
Latency Numbers Every Developer Should Know
| Operation | Time |
|---|---|
| L1 cache reference | 1 ns |
| L2 cache reference | 4 ns |
| RAM reference | 100 ns |
| SSD random read | 16 μs |
| HDD seek | 4 ms |
| Network round-trip (same datacenter) | 0.5 ms |
| Network round-trip (cross-continent) | 150 ms |
Quick Estimation
1 million users, 10% daily active = 100K DAU
100K users × 10 requests/day = 1M requests/day
1M / 86400 seconds ≈ 12 requests/second (average)
Peak = 3-5x average ≈ 50 RPS
Can a single server handle 50 RPS? YES (most can handle 1000+ RPS)
When do you NEED horizontal scaling? 5000+ RPS or high availability requirements
Interview Questions
-
How would you design a system to handle 10M users?
- Load balancer → multiple app servers (stateless) → Redis cache → database with read replicas. CDN for static assets. Shard database when single instance is bottleneck. Queue for async work.
-
Explain the difference between horizontal and vertical scaling.
- Vertical: bigger machine (more RAM/CPU). Horizontal: more machines behind a load balancer. Horizontal preferred for availability, cost-effectiveness, and practically unlimited growth.
-
How do you handle session state with multiple servers?
- Options: Sticky sessions (IP hash, not recommended), centralized session store (Redis), or JWT tokens (stateless — preferred).
-
What's eventual consistency?
- After a write, reads may temporarily return stale data. All replicas converge to the same value eventually. Trade-off: higher availability for temporary inconsistency.