Window Functions & CTEs — SQL Handbook

TL;DR

Window Functions compute across a set of rows related to the current row — without collapsing them.
CTEs (Common Table Expressions) = named subqueries for readability and recursion.
Window functions solve: ranking, running totals, moving averages, gap detection, percentiles.
Master ROW_NUMBER, RANK, LAG/LEAD, SUM() OVER — they cover 90% of interview questions.

Step 1: Window Function Anatomy

Window functions were added to the SQL standard (SQL:2003) because GROUP BY destroys individual rows — you can get the department average, but you lose each employee's row. Window functions let you compute aggregates while keeping every row visible, answering questions like "what percentile is this employee within their department?" or "what's the running total up to this row?". They revolutionized analytics queries and are now the most powerful and most-tested advanced SQL feature in data engineering and backend interviews.

function_name() OVER (
    PARTITION BY column      -- Split into groups (optional)
    ORDER BY column          -- Order within each group
    ROWS/RANGE frame_spec   -- Which rows to include (optional)
)

Key Difference: GROUP BY vs Window Functions

-- GROUP BY: collapses rows → one row per group
SELECT department, AVG(salary) FROM employees GROUP BY department;
-- Result: 5 rows (one per department)

-- Window function: keeps all rows, adds computed column
SELECT name, department, salary,
    AVG(salary) OVER (PARTITION BY department) AS dept_avg
FROM employees;
-- Result: ALL rows, each with their department average

Step 2: Ranking Functions

Ranking functions were invented because "give me the top 3 per group" was one of the most-requested query patterns and was nightmarishly complex without them. Before ROW_NUMBER, getting the top N per group required correlated subqueries or application-side processing. The three variants (ROW_NUMBER for unique ordinals, RANK with gaps for ties, DENSE_RANK without gaps) cover every ranking scenario: leaderboards, pagination, deduplication, and percentile computation. "Top N per group" is arguably the single most common advanced SQL interview question.

-- ROW_NUMBER: unique sequential number (no ties)
-- RANK: same rank for ties, gaps after (1,2,2,4)
-- DENSE_RANK: same rank for ties, no gaps (1,2,2,3)

SELECT
    name,
    department,
    salary,
    ROW_NUMBER() OVER (ORDER BY salary DESC) AS row_num,
    RANK() OVER (ORDER BY salary DESC) AS rank,
    DENSE_RANK() OVER (ORDER BY salary DESC) AS dense_rank
FROM employees;

-- Example output:
-- Alice  | Engineering | 150000 | 1 | 1 | 1
-- Bob    | Engineering | 150000 | 2 | 1 | 1  ← Same salary
-- Carol  | Sales       | 140000 | 3 | 3 | 2  ← RANK skips, DENSE_RANK doesn't
-- Dave   | Marketing   | 130000 | 4 | 4 | 3

Top N Per Group

-- Top 3 earners in each department
WITH ranked AS (
    SELECT
        name, department, salary,
        ROW_NUMBER() OVER (
            PARTITION BY department
            ORDER BY salary DESC
        ) AS rn
    FROM employees
)
SELECT name, department, salary
FROM ranked
WHERE rn <= 3;

NTILE — Split into Buckets

-- Divide employees into 4 salary quartiles
SELECT
    name, salary,
    NTILE(4) OVER (ORDER BY salary) AS quartile
FROM employees;
-- quartile 1 = bottom 25%, quartile 4 = top 25%

Step 3: LAG & LEAD — Access Adjacent Rows

LAG and LEAD were created because comparing a row to its previous/next row is fundamental to time-series analysis but was extremely painful with self-joins. Calculating day-over-day revenue change, detecting consecutive events, or finding gaps in sequences all require looking at neighboring rows. Before LAG/LEAD, this meant joining the table to itself with offset conditions — slow and unreadable. These functions power every growth chart, trend analysis, and sequential pattern detection in analytics dashboards.

-- LAG: previous row's value
-- LEAD: next row's value

SELECT
    date,
    revenue,
    LAG(revenue) OVER (ORDER BY date) AS prev_day_revenue,
    revenue - LAG(revenue) OVER (ORDER BY date) AS daily_change,
    ROUND(
        (revenue - LAG(revenue) OVER (ORDER BY date)) * 100.0
        / LAG(revenue) OVER (ORDER BY date), 2
    ) AS pct_change
FROM daily_revenue
ORDER BY date;

-- Output:
-- 2026-01-01 | 10000 | NULL  | NULL  | NULL
-- 2026-01-02 | 12000 | 10000 | 2000  | 20.00%
-- 2026-01-03 | 11500 | 12000 | -500  | -4.17%

Month-over-Month Growth

WITH monthly AS (
    SELECT
        DATE_TRUNC('month', created_at) AS month,
        SUM(total) AS revenue
    FROM orders
    GROUP BY DATE_TRUNC('month', created_at)
)
SELECT
    month,
    revenue,
    LAG(revenue) OVER (ORDER BY month) AS prev_month,
    ROUND((revenue - LAG(revenue) OVER (ORDER BY month)) * 100.0
          / LAG(revenue) OVER (ORDER BY month), 1) AS growth_pct
FROM monthly
ORDER BY month;

Step 4: Running Totals & Moving Averages

Running totals and moving averages are the foundation of financial reporting, inventory tracking, and time-series analysis. Before window functions with frame specifications, computing a running total required a self-join with a <= condition or cursor-based iteration — both O(n²). SUM() OVER (ORDER BY date) does it in a single linear pass. Moving averages smooth out noise in data ("what's the 7-day average revenue?") and are used in stock analysis, capacity planning, and any dashboard showing trends over time.

-- Running total
SELECT
    date,
    revenue,
    SUM(revenue) OVER (ORDER BY date) AS running_total
FROM daily_revenue;

-- Running total per month (resets each month)
SELECT
    date,
    revenue,
    SUM(revenue) OVER (
        PARTITION BY DATE_TRUNC('month', date)
        ORDER BY date
    ) AS monthly_running_total
FROM daily_revenue;

-- 7-day moving average
SELECT
    date,
    revenue,
    ROUND(AVG(revenue) OVER (
        ORDER BY date
        ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
    ), 2) AS moving_avg_7d
FROM daily_revenue;

-- Cumulative percentage
SELECT
    department,
    salary,
    ROUND(
        SUM(salary) OVER (ORDER BY salary DESC) * 100.0
        / SUM(salary) OVER (), 2
    ) AS cumulative_pct
FROM employees;

Step 5: Frame Specifications

Frame specifications give you precise control over which rows a window function considers — "the 6 rows before the current one" or "from the start of the partition to here" or "3 days preceding to 3 days following". They were added because default window behavior (all rows in partition, or all rows up to current) doesn't cover every use case. Understanding ROWS vs RANGE is critical: ROWS counts physical rows (predictable), while RANGE groups logically equal values (handles duplicates differently). This is the advanced-level window function knowledge that distinguishes expert SQL users.

-- ROWS BETWEEN defines which rows the window includes:

-- All rows from start to current (default for ORDER BY)
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW

-- Previous 2 + current + next 2 (5-row window)
ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING

-- Current row to end
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING

-- All rows in partition (no ORDER BY needed)
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING

ROWS vs RANGE

-- ROWS: physical rows (exact count)
-- RANGE: logical range (values that are equal grouped together)

-- If dates have duplicates:
-- ROWS BETWEEN 6 PRECEDING = exactly 6 previous rows
-- RANGE BETWEEN INTERVAL '6 days' PRECEDING = all rows within 6 days

SELECT date, revenue,
    SUM(revenue) OVER (
        ORDER BY date
        RANGE BETWEEN INTERVAL '6 days' PRECEDING AND CURRENT ROW
    ) AS trailing_7d_revenue
FROM daily_revenue;

Step 6: CTEs (Common Table Expressions)

CTEs (the WITH clause) were added to SQL because deeply nested subqueries become unreadable after 2-3 levels. They let you name intermediate result sets and reference them like tables, creating a top-down readable flow instead of inside-out nesting. Recursive CTEs go further — they enable hierarchical queries (org charts, category trees, bill of materials) that would otherwise require application-side loops or database-specific extensions. CTEs are the key to writing maintainable, self-documenting analytical SQL.

Basic CTE

-- Named subquery — much more readable than nested subqueries
WITH high_value_customers AS (
    SELECT user_id, SUM(total) AS lifetime_value
    FROM orders
    GROUP BY user_id
    HAVING SUM(total) > 5000
),
recent_orders AS (
    SELECT user_id, COUNT(*) AS recent_count
    FROM orders
    WHERE created_at > NOW() - INTERVAL '30 days'
    GROUP BY user_id
)
SELECT
    u.name,
    hvc.lifetime_value,
    COALESCE(ro.recent_count, 0) AS orders_last_30d
FROM high_value_customers hvc
JOIN users u ON hvc.user_id = u.id
LEFT JOIN recent_orders ro ON hvc.user_id = ro.user_id
ORDER BY hvc.lifetime_value DESC;

Recursive CTE

-- Organizational hierarchy
WITH RECURSIVE org_tree AS (
    -- Base case: top-level managers (no manager)
    SELECT id, name, manager_id, 1 AS level
    FROM employees
    WHERE manager_id IS NULL

    UNION ALL

    -- Recursive case: employees under each manager
    SELECT e.id, e.name, e.manager_id, ot.level + 1
    FROM employees e
    JOIN org_tree ot ON e.manager_id = ot.id
)
SELECT
    REPEAT('  ', level - 1) || name AS org_chart,
    level
FROM org_tree
ORDER BY level, name;

-- Output:
-- CEO                      | 1
--   VP Engineering         | 2
--     Senior Engineer      | 3
--     Junior Engineer      | 3
--   VP Sales              | 2
--     Sales Manager       | 3

Recursive: Generate Date Series

-- Generate all dates in a range (fill gaps)
WITH RECURSIVE dates AS (
    SELECT DATE '2026-01-01' AS date
    UNION ALL
    SELECT date + 1 FROM dates WHERE date < '2026-01-31'
)
SELECT
    d.date,
    COALESCE(SUM(o.total), 0) AS revenue
FROM dates d
LEFT JOIN orders o ON DATE(o.created_at) = d.date
GROUP BY d.date
ORDER BY d.date;

Step 7: Advanced Patterns

Gap & Island Detection

-- Find consecutive login streaks
WITH numbered AS (
    SELECT
        user_id,
        login_date,
        login_date - (ROW_NUMBER() OVER (
            PARTITION BY user_id ORDER BY login_date
        ))::int AS group_id
    FROM user_logins
)
SELECT
    user_id,
    MIN(login_date) AS streak_start,
    MAX(login_date) AS streak_end,
    COUNT(*) AS streak_length
FROM numbered
GROUP BY user_id, group_id
HAVING COUNT(*) >= 3  -- Streaks of 3+ days
ORDER BY streak_length DESC;

Percentile / Median

-- Median salary per department
SELECT DISTINCT
    department,
    PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY salary)
        OVER (PARTITION BY department) AS median_salary
FROM employees;

First/Last Value

-- Most recent order per customer
SELECT DISTINCT
    user_id,
    FIRST_VALUE(total) OVER (
        PARTITION BY user_id ORDER BY created_at DESC
    ) AS most_recent_order_total,
    LAST_VALUE(total) OVER (
        PARTITION BY user_id
        ORDER BY created_at
        ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
    ) AS latest_order_total
FROM orders;

Interview Questions

What's the difference between ROW_NUMBER, RANK, and DENSE_RANK?
- ROW_NUMBER: always unique (1,2,3,4). RANK: ties get same number, gaps after (1,2,2,4). DENSE_RANK: ties get same number, no gaps (1,2,2,3).
When would you use a CTE over a subquery?
- When the same subquery is referenced multiple times, for readability, for recursive queries (hierarchy traversal), and when you want to name intermediate steps logically.
How do you calculate a running total?
- SUM(amount) OVER (ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW). The ORDER BY + default frame gives cumulative sum.
Explain the frame specification in window functions.
- Defines which rows relative to current row are included in the window calculation. ROWS BETWEEN 2 PRECEDING AND CURRENT ROW = 3-row window. Default with ORDER BY is "from start to current row".