Scalability & Core Concepts — System Design Handbook

TL;DR

Horizontal scaling > vertical scaling for most web apps.
Load balancers distribute traffic. CDNs cache static assets at the edge.
Caching (Redis, CDN, browser) is the #1 performance lever.
Know the tradeoffs: CAP theorem, consistency vs availability.

Step 1: Scaling Fundamentals

Scaling is the reason system design interviews exist. Every startup begins on a single server, and the moment it succeeds, that server falls over under load. The choice between vertical scaling (bigger machine) and horizontal scaling (more machines) shapes your entire architecture. Horizontal won because hardware has a ceiling, it creates redundancy, and cloud providers make adding instances trivial. But it introduces complexity: state management, data consistency, and load distribution — which is everything else in this handbook.

Vertical vs Horizontal Scaling

Vertical Scaling (Scale Up)          Horizontal Scaling (Scale Out)
┌────────────┐                       ┌────┐ ┌────┐ ┌────┐ ┌────┐
│            │                       │ S1 │ │ S2 │ │ S3 │ │ S4 │
│   BIGGER   │                       └────┘ └────┘ └────┘ └────┘
│   SERVER   │                            ↑
│            │                       Load Balancer
└────────────┘                       ┌──────────────────────┐
                                     │     Traffic          │
More CPU, RAM                        └──────────────────────┘
Has limits                           Add more machines
Single point of failure              Redundancy built-in

	Vertical	Horizontal
Cost	Exponential	Linear
Limit	Hardware ceiling	Practically unlimited
Complexity	Simple	Need load balancing, state management
Downtime	Required for upgrade	Zero-downtime possible

Step 2: Load Balancing

Load balancers exist because the moment you have multiple servers, you need something to decide which server handles each request. Without one, you'd need DNS round-robin (no health checks, sticky DNS caches) or manual traffic splitting. Load balancers also enable zero-downtime deployments, automatic failover, and SSL termination. Understanding the different algorithms (round-robin, least connections, IP hash) comes up in every system design interview because the choice directly affects session handling and resource utilization.

Algorithms

Algorithm	How it Works	Best For
Round Robin	Rotate through servers 1→2→3→1→...	Equal servers, stateless
Weighted Round Robin	More traffic to stronger servers	Mixed hardware
Least Connections	Route to server with fewest active connections	Variable request times
IP Hash	Same client always hits same server	Session affinity
Random	Randomly pick a server	Simple, surprisingly effective

Architecture

                    ┌─── App Server 1
User → DNS → CDN → LB ─── App Server 2
                    └─── App Server 3
                              │
                         ┌────┴────┐
                         │  Cache  │
                         │ (Redis) │
                         └────┬────┘
                              │
                         ┌────┴────┐
                         │Database │
                         │(Primary)│
                         └────┬────┘
                              │
                    ┌─────────┴─────────┐
                    │ Read Replica 1    │ Read Replica 2
                    └───────────────────┘

Health Checks

# Load balancer health check config
health_check:
  path: /health
  interval: 10s
  timeout: 3s
  unhealthy_threshold: 3   # 3 failures = remove from pool
  healthy_threshold: 2     # 2 successes = add back

Step 3: Caching Strategies

Caching is the single biggest performance lever in any system — it's cheaper and faster to add cache than to optimize queries or add servers. The reason it's so effective: most apps have a heavily skewed access pattern (80/20 rule) where a small fraction of data gets most of the reads. Understanding the cache hierarchy (browser → CDN → Redis → DB) and patterns (cache-aside, write-through, write-behind) is essential because the wrong strategy leads to stale data, thundering herds, or cache stampedes that can bring down your system.

Cache Hierarchy

Browser Cache (fastest, per-user)
    ↓ miss
CDN Cache (fast, per-region)
    ↓ miss
Application Cache - Redis (fast, shared)
    ↓ miss
Database Query Cache
    ↓ miss
Database Disk

Patterns

Cache-Aside (Lazy Loading)

def get_user(user_id):
    # 1. Check cache
    cached = redis.get(f"user:{user_id}")
    if cached:
        return json.loads(cached)

    # 2. Cache miss — fetch from DB
    user = db.query("SELECT * FROM users WHERE id = ?", user_id)

    # 3. Store in cache for next time
    redis.setex(f"user:{user_id}", 3600, json.dumps(user))  # TTL: 1 hour

    return user

Write-Through

def update_user(user_id, data):
    # 1. Write to DB
    db.update("UPDATE users SET ... WHERE id = ?", data, user_id)

    # 2. Update cache immediately
    redis.setex(f"user:{user_id}", 3600, json.dumps(data))

Write-Behind (Write-Back)

def update_user(user_id, data):
    # 1. Write to cache only (fast response)
    redis.setex(f"user:{user_id}", 3600, json.dumps(data))

    # 2. Async worker writes to DB later (batched)
    queue.push({"type": "user_update", "id": user_id, "data": data})

Cache Invalidation

Strategy	Pros	Cons
TTL (Time-to-Live)	Simple, automatic	Stale data until expiry
Event-based	Immediate freshness	Complex, need event system
Version keys	Easy rollback	More cache keys

Step 4: Database Scaling

Database scaling is where system design gets hard. Your app servers are stateless and easily replicated, but your database holds the truth — and truth is hard to distribute. Read replicas handle read-heavy workloads (most apps are 90%+ reads), but writes still hit a single primary. When that's not enough, sharding splits data across machines, but it makes joins impossible across shards and requires careful shard key selection. This is the domain where "it depends" actually matters, and interviewers want to hear you reason about the tradeoffs.

Read Replicas

Writes → Primary DB (single source of truth)
Reads  → Read Replicas (many, distributed)

Replication lag: 10ms - 1s (eventual consistency)

Sharding (Horizontal Partitioning)

User data split by user_id:
  Shard 1: user_id 1-1M      (Server A)
  Shard 2: user_id 1M-2M     (Server B)
  Shard 3: user_id 2M-3M     (Server C)

Shard key selection is CRITICAL:
- Even distribution
- Queries should hit single shard
- Avoid hot spots

When to Scale What

Symptom	Solution
Slow reads	Read replicas, caching
Slow writes	Sharding, write-behind cache
Too much data	Sharding, archiving old data
Complex queries	Denormalization, materialized views
Full-text search	Elasticsearch/OpenSearch

Step 5: CAP Theorem

The CAP theorem (proved by Brewer in 2000) states that in the presence of a network partition, you must choose between consistency and availability. It matters because distributed systems WILL experience network issues — the question is how your system behaves when they happen. A banking system must choose consistency (reject transactions if unsure), while a social feed chooses availability (show potentially stale data). This framework is used in every system design interview to justify architecture decisions and understand database choices.

    Consistency (C)
        /\
       /  \
      /    \
     / Pick \
    /  Two   \
   /──────────\
Availability    Partition
    (A)        Tolerance (P)

You can only have 2 of 3 (and P is not optional in distributed systems):

Choice	Trade-off	Example
CP	Sacrifices availability during partitions	MongoDB, Redis Cluster
AP	Sacrifices consistency (eventual)	Cassandra, DynamoDB
CA	Can't handle network partitions (single node)	Traditional RDBMS

Practical Decision

Banking transaction → CP (consistency matters, reject if uncertain)
Social media feed → AP (show slightly stale data, always available)
Shopping cart → AP (merge conflicts later, never lose items)
Inventory count → CP (don't oversell)

Step 6: Content Delivery Networks (CDN)

CDNs exist because the speed of light is a real constraint — a user in Tokyo making a request to a server in Virginia experiences 200ms of unavoidable network latency. CDNs place cached copies of your content at "edge" locations worldwide, reducing latency to 10-30ms for cached content. For static assets, CDNs reduce origin load by 90%+. For modern apps using ISR/SSG (Next.js), CDNs serve entire pages from the edge. Cloudflare, AWS CloudFront, and Vercel's Edge Network are the major players.

Without CDN:
  User (Tokyo) → Server (US-East) = 200ms latency

With CDN:
  User (Tokyo) → CDN Edge (Tokyo) = 20ms latency
  CDN Edge → Origin (cache miss only) = 200ms

What to CDN

Cache	Don't Cache
Static assets (JS, CSS, images)	Personalized content
API responses (public, rarely changes)	Real-time data
HTML pages (SSG/ISR)	Authentication responses
Fonts, videos	Write operations

Step 7: Back-of-Envelope Calculations

Latency Numbers Every Developer Should Know

Operation	Time
L1 cache reference	1 ns
L2 cache reference	4 ns
RAM reference	100 ns
SSD random read	16 μs
HDD seek	4 ms
Network round-trip (same datacenter)	0.5 ms
Network round-trip (cross-continent)	150 ms

Quick Estimation

1 million users, 10% daily active = 100K DAU
100K users × 10 requests/day = 1M requests/day
1M / 86400 seconds ≈ 12 requests/second (average)
Peak = 3-5x average ≈ 50 RPS

Can a single server handle 50 RPS? YES (most can handle 1000+ RPS)
When do you NEED horizontal scaling? 5000+ RPS or high availability requirements

Interview Questions

How would you design a system to handle 10M users?
- Load balancer → multiple app servers (stateless) → Redis cache → database with read replicas. CDN for static assets. Shard database when single instance is bottleneck. Queue for async work.
Explain the difference between horizontal and vertical scaling.
- Vertical: bigger machine (more RAM/CPU). Horizontal: more machines behind a load balancer. Horizontal preferred for availability, cost-effectiveness, and practically unlimited growth.
How do you handle session state with multiple servers?
- Options: Sticky sessions (IP hash, not recommended), centralized session store (Redis), or JWT tokens (stateless — preferred).
What's eventual consistency?
- After a write, reads may temporarily return stale data. All replicas converge to the same value eventually. Trade-off: higher availability for temporary inconsistency.