TL;DR
- Indexes speed up reads (O(log n) vs O(n)) but slow down writes.
- Use
EXPLAIN ANALYZEto understand query performance. - Index columns in WHERE, JOIN, ORDER BY — but don't over-index.
- Composite indexes: leftmost prefix matters.
(a, b, c)supports queries ona,a+b,a+b+c.
Step 1: How Indexes Work
Indexes are the reason databases can serve queries in milliseconds over millions of rows instead of scanning every single row. The B-Tree (balanced tree) index, invented in 1972, is the most common structure — it maintains sorted data in a tree that allows O(log n) lookups. Without indexes, every query is a full table scan (O(n)). With the right index, finding a row in 1 million records takes ~20 comparisons instead of 1 million. Understanding index types and when to use each is the most impactful database performance skill.
Without index (Full Table Scan):
┌─────────────────────────────────────────┐
│ Scan ALL rows → O(n) → 1M rows = slow │
└─────────────────────────────────────────┘
With B-Tree index:
[50]
/ \
[25] [75]
/ \ / \
[12] [37] [62] [87] ← O(log n) → 1M rows = ~20 lookups
Index Types
| Type | Use Case | Example |
|---|---|---|
| B-Tree (default) | Equality, range, sorting | WHERE age > 25, ORDER BY |
| Hash | Exact equality only | WHERE id = 123 |
| GIN | Full-text search, arrays, JSONB | WHERE tags @> '{react}' |
| GiST | Geometric, range types | PostGIS spatial queries |
| BRIN | Large sequential tables | Time-series data |
Step 2: Creating Effective Indexes
Creating indexes is easy; creating the right indexes is an art. Over-indexing wastes disk space and slows writes (every INSERT/UPDATE must update all indexes). Under-indexing causes slow queries. Partial indexes (only indexing rows matching a condition) reduce index size dramatically for filtered queries. Covering indexes (INCLUDE) eliminate table lookups entirely. Expression indexes handle computed conditions. Knowing these techniques is what separates a developer who "adds an index" from one who designs an efficient indexing strategy.
-- Single column index
CREATE INDEX idx_users_email ON users(email);
-- Composite index (column order matters!)
CREATE INDEX idx_orders_user_date ON orders(user_id, created_at DESC);
-- Partial index (only index relevant rows)
CREATE INDEX idx_active_orders ON orders(created_at)
WHERE status = 'active'; -- Smaller index, faster for active order queries
-- Expression index
CREATE INDEX idx_users_lower_email ON users(LOWER(email));
-- Now WHERE LOWER(email) = 'alice@test.com' uses the index
-- Covering index (INCLUDE — avoids table lookup)
CREATE INDEX idx_orders_covering ON orders(user_id)
INCLUDE (total, created_at);
-- Index-only scan: no need to read the actual table row
Step 3: Composite Index Rules
Composite indexes were invented because queries often filter on multiple columns simultaneously ("orders for user X after date Y"). A composite index on (user_id, created_at) handles both single-column and multi-column queries — but only in leftmost-prefix order. This "leftmost prefix rule" is the most common indexing gotcha: an index on (A, B, C) supports queries on A, A+B, or A+B+C, but NOT B alone or C alone. Getting column order right in composite indexes is the most frequently tested database optimization topic.
CREATE INDEX idx_example ON orders(user_id, status, created_at);
-- ✅ Uses index (leftmost prefix)
WHERE user_id = 1
WHERE user_id = 1 AND status = 'active'
WHERE user_id = 1 AND status = 'active' AND created_at > '2026-01-01'
-- ⚠️ Partial index use
WHERE user_id = 1 AND created_at > '2026-01-01' -- Skips status (uses user_id only)
-- ❌ Cannot use index (skips leftmost column)
WHERE status = 'active'
WHERE created_at > '2026-01-01'
WHERE status = 'active' AND created_at > '2026-01-01'
Column Order Strategy
Put columns in this order:
1. Equality conditions (=) first
2. Range conditions (>, <, BETWEEN) last
3. High-cardinality first among equals
Example: Searching orders by user + date range
✅ CREATE INDEX ON orders(user_id, created_at);
❌ CREATE INDEX ON orders(created_at, user_id);
Step 4: EXPLAIN ANALYZE — Reading Query Plans
EXPLAIN ANALYZE is the X-ray machine for query performance. It was built into PostgreSQL (and equivalents exist in every database) because you can't optimize what you can't measure. It shows you exactly what the database engine does: which indexes it uses, how it joins tables, estimated vs actual row counts, and where time is spent. The gap between estimated and actual rows often reveals stale statistics or cardinality estimation errors. Learning to read query plans is the most practical skill for diagnosing slow queries in production.
EXPLAIN ANALYZE
SELECT u.name, COUNT(o.id)
FROM users u
JOIN orders o ON u.id = o.user_id
WHERE u.created_at > '2025-01-01'
GROUP BY u.id, u.name;
Key Terms in Query Plans
| Term | Meaning | Good/Bad |
|---|---|---|
| Seq Scan | Full table scan | Bad (unless table is tiny) |
| Index Scan | Uses index to find rows | Good |
| Index Only Scan | All data from index (no table access) | Best |
| Bitmap Index Scan | Index → bitmap → table | Good (multiple conditions) |
| Nested Loop | For each row in A, scan B | Good for small tables |
| Hash Join | Build hash of one table, probe with other | Good for large tables |
| Sort | Explicit sort operation | Bad if unexpected (add index) |
Reading the Numbers
Hash Join (cost=100.00..500.00 rows=1000 width=40)
(actual time=2.5..15.3 rows=982 loops=1)
↑ ↑ ↑ ↑
Estimated cost Actual time Real rows Iterations
Planning Time: 0.5 ms
Execution Time: 16.1 ms ← The number that matters
Common Fixes
| Problem | Solution |
|---|---|
| Seq Scan on large table | Add index on WHERE/JOIN columns |
| Sort taking too long | Add index matching ORDER BY |
| Nested Loop on large tables | Ensure join columns are indexed |
| High cost estimate | Check statistics: ANALYZE tablename; |
Step 5: Query Optimization Techniques
These optimization techniques represent decades of collective wisdom from database administrators and performance engineers. Each one addresses a common anti-pattern: SELECT * prevents index-only scans, functions on indexed columns bypass indexes, OFFSET pagination degrades with depth, and the N+1 query problem multiplies database round-trips. These aren't theoretical — they're the exact patterns that cause production incidents when traffic scales, and they're what senior engineers look for in code reviews.
1. Avoid SELECT *
-- ❌ Fetches all columns (prevents index-only scan)
SELECT * FROM orders WHERE user_id = 1;
-- ✅ Only fetch what you need
SELECT id, total, created_at FROM orders WHERE user_id = 1;
2. Use EXISTS Instead of IN (for large subqueries)
-- ❌ IN with large subquery (loads all IDs into memory)
SELECT * FROM users
WHERE id IN (SELECT user_id FROM orders WHERE total > 100);
-- ✅ EXISTS (stops at first match per row)
SELECT * FROM users u
WHERE EXISTS (
SELECT 1 FROM orders o
WHERE o.user_id = u.id AND o.total > 100
);
3. Avoid Functions on Indexed Columns
-- ❌ Index on created_at is USELESS here
WHERE YEAR(created_at) = 2026
-- ✅ Rewrite as range (index can be used)
WHERE created_at >= '2026-01-01' AND created_at < '2027-01-01'
-- ❌ Index on email is USELESS
WHERE LOWER(email) = 'alice@test.com'
-- ✅ Create expression index or store lowercase
CREATE INDEX idx_lower_email ON users(LOWER(email));
4. Pagination Done Right
-- ❌ OFFSET is slow for deep pages (scans and discards rows)
SELECT * FROM products ORDER BY id LIMIT 20 OFFSET 10000;
-- Must scan 10020 rows, discard 10000
-- ✅ Cursor-based pagination (keyset)
SELECT * FROM products
WHERE id > 10000 -- Last seen ID
ORDER BY id
LIMIT 20;
-- Jumps directly to the right place via index
5. Batch Operations
-- ❌ N+1 query pattern
for user in users:
orders = SELECT * FROM orders WHERE user_id = user.id
-- ✅ Single query with JOIN or IN
SELECT * FROM orders WHERE user_id IN (1, 2, 3, 4, 5);
-- Or use JOIN
Step 6: Transaction Isolation Levels
| Level | Dirty Read | Non-Repeatable Read | Phantom Read | Use Case |
|---|---|---|---|---|
| Read Uncommitted | ✅ | ✅ | ✅ | Almost never |
| Read Committed | ❌ | ✅ | ✅ | PostgreSQL default |
| Repeatable Read | ❌ | ❌ | ✅ | Financial reports |
| Serializable | ❌ | ❌ | ❌ | Critical transactions |
-- Set isolation level
BEGIN;
SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;
-- Critical operation
UPDATE accounts SET balance = balance - 100 WHERE id = 1;
UPDATE accounts SET balance = balance + 100 WHERE id = 2;
COMMIT;
Step 7: Monitoring & Maintenance
-- Find slow queries (PostgreSQL)
SELECT query, calls, mean_exec_time, total_exec_time
FROM pg_stat_statements
ORDER BY mean_exec_time DESC
LIMIT 10;
-- Check index usage
SELECT
indexrelname AS index_name,
idx_scan AS times_used,
pg_size_pretty(pg_relation_size(indexrelid)) AS index_size
FROM pg_stat_user_indexes
ORDER BY idx_scan ASC; -- Unused indexes at top (candidates for removal)
-- Table bloat check
SELECT
relname AS table,
n_live_tup AS live_rows,
n_dead_tup AS dead_rows,
ROUND(n_dead_tup * 100.0 / NULLIF(n_live_tup, 0), 1) AS dead_pct
FROM pg_stat_user_tables
WHERE n_dead_tup > 1000
ORDER BY dead_pct DESC;
-- Rebuild bloated indexes
REINDEX INDEX idx_orders_user_id;
-- Or non-blocking:
CREATE INDEX CONCURRENTLY idx_orders_user_id_new ON orders(user_id);
DROP INDEX idx_orders_user_id;
ALTER INDEX idx_orders_user_id_new RENAME TO idx_orders_user_id;
Interview Questions
-
When should you NOT add an index?
- On tables with heavy writes (index maintenance slows inserts/updates). On low-cardinality columns (boolean with 50/50 split). On small tables (seq scan is faster). On columns rarely used in WHERE/JOIN.
-
What's the N+1 query problem?
- Fetching a list (1 query), then fetching related data for each item (N queries). Fix: JOIN, eager loading, or batching with IN clause. Example: 1 query for users + 100 queries for each user's orders = 101 queries instead of 1 JOIN.
-
How does cursor-based pagination work?
- Instead of OFFSET (which scans and discards), use a WHERE clause with the last-seen value (
WHERE id > last_id LIMIT 20). Leverages the index directly. Constant performance regardless of page depth.
- Instead of OFFSET (which scans and discards), use a WHERE clause with the last-seen value (
-
What's a covering index?
- An index that contains ALL columns needed by a query (via INCLUDE). The database can answer the query from the index alone without reading the actual table row (index-only scan). Fastest possible lookup.