Indexing & Query Optimization — SQL Handbook

TL;DR

Indexes speed up reads (O(log n) vs O(n)) but slow down writes.
Use EXPLAIN ANALYZE to understand query performance.
Index columns in WHERE, JOIN, ORDER BY — but don't over-index.
Composite indexes: leftmost prefix matters. (a, b, c) supports queries on a, a+b, a+b+c.

Step 1: How Indexes Work

Indexes are the reason databases can serve queries in milliseconds over millions of rows instead of scanning every single row. The B-Tree (balanced tree) index, invented in 1972, is the most common structure — it maintains sorted data in a tree that allows O(log n) lookups. Without indexes, every query is a full table scan (O(n)). With the right index, finding a row in 1 million records takes ~20 comparisons instead of 1 million. Understanding index types and when to use each is the most impactful database performance skill.

Without index (Full Table Scan):
┌─────────────────────────────────────────┐
│ Scan ALL rows → O(n) → 1M rows = slow  │
└─────────────────────────────────────────┘

With B-Tree index:
         [50]
        /    \
    [25]      [75]
   /   \     /   \
 [12] [37] [62] [87]    ← O(log n) → 1M rows = ~20 lookups

Index Types

Type	Use Case	Example
B-Tree (default)	Equality, range, sorting	`WHERE age > 25`, `ORDER BY`
Hash	Exact equality only	`WHERE id = 123`
GIN	Full-text search, arrays, JSONB	`WHERE tags @> '{react}'`
GiST	Geometric, range types	PostGIS spatial queries
BRIN	Large sequential tables	Time-series data

Step 2: Creating Effective Indexes

Creating indexes is easy; creating the right indexes is an art. Over-indexing wastes disk space and slows writes (every INSERT/UPDATE must update all indexes). Under-indexing causes slow queries. Partial indexes (only indexing rows matching a condition) reduce index size dramatically for filtered queries. Covering indexes (INCLUDE) eliminate table lookups entirely. Expression indexes handle computed conditions. Knowing these techniques is what separates a developer who "adds an index" from one who designs an efficient indexing strategy.

-- Single column index
CREATE INDEX idx_users_email ON users(email);

-- Composite index (column order matters!)
CREATE INDEX idx_orders_user_date ON orders(user_id, created_at DESC);

-- Partial index (only index relevant rows)
CREATE INDEX idx_active_orders ON orders(created_at)
WHERE status = 'active';  -- Smaller index, faster for active order queries

-- Expression index
CREATE INDEX idx_users_lower_email ON users(LOWER(email));
-- Now WHERE LOWER(email) = 'alice@test.com' uses the index

-- Covering index (INCLUDE — avoids table lookup)
CREATE INDEX idx_orders_covering ON orders(user_id)
INCLUDE (total, created_at);
-- Index-only scan: no need to read the actual table row

Step 3: Composite Index Rules

Composite indexes were invented because queries often filter on multiple columns simultaneously ("orders for user X after date Y"). A composite index on (user_id, created_at) handles both single-column and multi-column queries — but only in leftmost-prefix order. This "leftmost prefix rule" is the most common indexing gotcha: an index on (A, B, C) supports queries on A, A+B, or A+B+C, but NOT B alone or C alone. Getting column order right in composite indexes is the most frequently tested database optimization topic.

CREATE INDEX idx_example ON orders(user_id, status, created_at);

-- ✅ Uses index (leftmost prefix)
WHERE user_id = 1
WHERE user_id = 1 AND status = 'active'
WHERE user_id = 1 AND status = 'active' AND created_at > '2026-01-01'

-- ⚠️ Partial index use
WHERE user_id = 1 AND created_at > '2026-01-01'  -- Skips status (uses user_id only)

-- ❌ Cannot use index (skips leftmost column)
WHERE status = 'active'
WHERE created_at > '2026-01-01'
WHERE status = 'active' AND created_at > '2026-01-01'

Column Order Strategy

Put columns in this order:
1. Equality conditions (=) first
2. Range conditions (>, <, BETWEEN) last
3. High-cardinality first among equals

Example: Searching orders by user + date range
✅ CREATE INDEX ON orders(user_id, created_at);
❌ CREATE INDEX ON orders(created_at, user_id);

Step 4: EXPLAIN ANALYZE — Reading Query Plans

EXPLAIN ANALYZE is the X-ray machine for query performance. It was built into PostgreSQL (and equivalents exist in every database) because you can't optimize what you can't measure. It shows you exactly what the database engine does: which indexes it uses, how it joins tables, estimated vs actual row counts, and where time is spent. The gap between estimated and actual rows often reveals stale statistics or cardinality estimation errors. Learning to read query plans is the most practical skill for diagnosing slow queries in production.

EXPLAIN ANALYZE
SELECT u.name, COUNT(o.id)
FROM users u
JOIN orders o ON u.id = o.user_id
WHERE u.created_at > '2025-01-01'
GROUP BY u.id, u.name;

Key Terms in Query Plans

Term	Meaning	Good/Bad
Seq Scan	Full table scan	Bad (unless table is tiny)
Index Scan	Uses index to find rows	Good
Index Only Scan	All data from index (no table access)	Best
Bitmap Index Scan	Index → bitmap → table	Good (multiple conditions)
Nested Loop	For each row in A, scan B	Good for small tables
Hash Join	Build hash of one table, probe with other	Good for large tables
Sort	Explicit sort operation	Bad if unexpected (add index)

Reading the Numbers

Hash Join  (cost=100.00..500.00 rows=1000 width=40)
            (actual time=2.5..15.3 rows=982 loops=1)
             ↑               ↑        ↑          ↑
        Estimated cost  Actual time  Real rows  Iterations

Planning Time: 0.5 ms
Execution Time: 16.1 ms  ← The number that matters

Common Fixes

Problem	Solution
Seq Scan on large table	Add index on WHERE/JOIN columns
Sort taking too long	Add index matching ORDER BY
Nested Loop on large tables	Ensure join columns are indexed
High cost estimate	Check statistics: `ANALYZE tablename;`

Step 5: Query Optimization Techniques

These optimization techniques represent decades of collective wisdom from database administrators and performance engineers. Each one addresses a common anti-pattern: SELECT * prevents index-only scans, functions on indexed columns bypass indexes, OFFSET pagination degrades with depth, and the N+1 query problem multiplies database round-trips. These aren't theoretical — they're the exact patterns that cause production incidents when traffic scales, and they're what senior engineers look for in code reviews.

1. Avoid SELECT *

-- ❌ Fetches all columns (prevents index-only scan)
SELECT * FROM orders WHERE user_id = 1;

-- ✅ Only fetch what you need
SELECT id, total, created_at FROM orders WHERE user_id = 1;

2. Use EXISTS Instead of IN (for large subqueries)

-- ❌ IN with large subquery (loads all IDs into memory)
SELECT * FROM users
WHERE id IN (SELECT user_id FROM orders WHERE total > 100);

-- ✅ EXISTS (stops at first match per row)
SELECT * FROM users u
WHERE EXISTS (
    SELECT 1 FROM orders o
    WHERE o.user_id = u.id AND o.total > 100
);

3. Avoid Functions on Indexed Columns

-- ❌ Index on created_at is USELESS here
WHERE YEAR(created_at) = 2026

-- ✅ Rewrite as range (index can be used)
WHERE created_at >= '2026-01-01' AND created_at < '2027-01-01'

-- ❌ Index on email is USELESS
WHERE LOWER(email) = 'alice@test.com'

-- ✅ Create expression index or store lowercase
CREATE INDEX idx_lower_email ON users(LOWER(email));

4. Pagination Done Right

-- ❌ OFFSET is slow for deep pages (scans and discards rows)
SELECT * FROM products ORDER BY id LIMIT 20 OFFSET 10000;
-- Must scan 10020 rows, discard 10000

-- ✅ Cursor-based pagination (keyset)
SELECT * FROM products
WHERE id > 10000  -- Last seen ID
ORDER BY id
LIMIT 20;
-- Jumps directly to the right place via index

5. Batch Operations

-- ❌ N+1 query pattern
for user in users:
    orders = SELECT * FROM orders WHERE user_id = user.id

-- ✅ Single query with JOIN or IN
SELECT * FROM orders WHERE user_id IN (1, 2, 3, 4, 5);
-- Or use JOIN

Step 6: Transaction Isolation Levels

Level	Dirty Read	Non-Repeatable Read	Phantom Read	Use Case
Read Uncommitted	✅	✅	✅	Almost never
Read Committed	❌	✅	✅	PostgreSQL default
Repeatable Read	❌	❌	✅	Financial reports
Serializable	❌	❌	❌	Critical transactions

-- Set isolation level
BEGIN;
SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;

-- Critical operation
UPDATE accounts SET balance = balance - 100 WHERE id = 1;
UPDATE accounts SET balance = balance + 100 WHERE id = 2;

COMMIT;

Step 7: Monitoring & Maintenance

-- Find slow queries (PostgreSQL)
SELECT query, calls, mean_exec_time, total_exec_time
FROM pg_stat_statements
ORDER BY mean_exec_time DESC
LIMIT 10;

-- Check index usage
SELECT
    indexrelname AS index_name,
    idx_scan AS times_used,
    pg_size_pretty(pg_relation_size(indexrelid)) AS index_size
FROM pg_stat_user_indexes
ORDER BY idx_scan ASC;  -- Unused indexes at top (candidates for removal)

-- Table bloat check
SELECT
    relname AS table,
    n_live_tup AS live_rows,
    n_dead_tup AS dead_rows,
    ROUND(n_dead_tup * 100.0 / NULLIF(n_live_tup, 0), 1) AS dead_pct
FROM pg_stat_user_tables
WHERE n_dead_tup > 1000
ORDER BY dead_pct DESC;

-- Rebuild bloated indexes
REINDEX INDEX idx_orders_user_id;
-- Or non-blocking:
CREATE INDEX CONCURRENTLY idx_orders_user_id_new ON orders(user_id);
DROP INDEX idx_orders_user_id;
ALTER INDEX idx_orders_user_id_new RENAME TO idx_orders_user_id;

Interview Questions

When should you NOT add an index?
- On tables with heavy writes (index maintenance slows inserts/updates). On low-cardinality columns (boolean with 50/50 split). On small tables (seq scan is faster). On columns rarely used in WHERE/JOIN.
What's the N+1 query problem?
- Fetching a list (1 query), then fetching related data for each item (N queries). Fix: JOIN, eager loading, or batching with IN clause. Example: 1 query for users + 100 queries for each user's orders = 101 queries instead of 1 JOIN.
How does cursor-based pagination work?
- Instead of OFFSET (which scans and discards), use a WHERE clause with the last-seen value (WHERE id > last_id LIMIT 20). Leverages the index directly. Constant performance regardless of page depth.
What's a covering index?
- An index that contains ALL columns needed by a query (via INCLUDE). The database can answer the query from the index alone without reading the actual table row (index-only scan). Fastest possible lookup.