SQL Fundamentals — JOINs, Subqueries & Aggregations — SQL Handbook

TL;DR

JOINs combine rows from multiple tables. INNER (matching only), LEFT (all from left), RIGHT, FULL.
GROUP BY + aggregates (COUNT, SUM, AVG, MAX, MIN) for summarization.
Subqueries = queries inside queries. Correlated = references outer query.
HAVING filters after aggregation (WHERE filters before).

Step 1: JOIN Types Visualized

JOINs are the fundamental operation that makes relational databases powerful — they let you combine normalized data back into meaningful results. Before relational databases, data was stored in flat files with massive redundancy. Edgar Codd's relational model (1970) said: store each fact once, link tables with keys, and JOIN them when you need a complete picture. Understanding JOIN types is the #1 most-tested SQL skill in interviews because it reveals whether you truly understand relational data modeling and can reason about NULL handling, set operations, and data relationships.

INNER JOIN — Only matching rows
┌───────┐     ┌───────┐
│   A   │     │   B   │
│  ┌────┼─────┼────┐  │
│  │ ██ │     │ ██ │  │   ← Only the overlap
│  └────┼─────┼────┘  │
└───────┘     └───────┘

LEFT JOIN — All from A + matching from B
┌───────┐     ┌───────┐
│ ██████ │     │   B   │
│ ██████ ┼─────┼────┐  │   ← All of A, matching B (NULL if no match)
│ ██████ │     │ ██ │  │
│  └────┼─────┼────┘  │
└───────┘     └───────┘

FULL OUTER JOIN — Everything from both
┌───────┐     ┌───────┐
│ ██████ │     │██████ │   ← Everything (NULLs where no match)
│ ██████ ┼─────┼██████ │
│ ██████ │     │██████ │
└───────┘     └───────┘

Practical Examples

-- Sample tables
-- users: id, name, email
-- orders: id, user_id, total, created_at

-- INNER JOIN: Users who have placed orders
SELECT u.name, o.total, o.created_at
FROM users u
INNER JOIN orders o ON u.id = o.user_id;

-- LEFT JOIN: ALL users, even those with no orders
SELECT u.name, COALESCE(o.total, 0) AS total
FROM users u
LEFT JOIN orders o ON u.id = o.user_id;

-- Find users with NO orders (anti-join pattern)
SELECT u.name
FROM users u
LEFT JOIN orders o ON u.id = o.user_id
WHERE o.id IS NULL;

Multiple JOINs

-- Orders with user info AND product details
SELECT
    u.name AS customer,
    p.name AS product,
    oi.quantity,
    oi.price * oi.quantity AS line_total
FROM orders o
JOIN users u ON o.user_id = u.id
JOIN order_items oi ON o.id = oi.order_id
JOIN products p ON oi.product_id = p.id
WHERE o.created_at > '2026-01-01';

Step 2: Aggregations & GROUP BY

Aggregation functions (COUNT, SUM, AVG, MAX, MIN) were built into SQL because the most common business questions are summaries: "how many orders per customer?", "what's the average revenue per month?", "who's our top spender?". GROUP BY partitions rows into groups and applies aggregate functions to each group independently. The HAVING clause was invented as a separate keyword because WHERE filters individual rows before grouping, but sometimes you need to filter after aggregation (e.g., "show me only departments with average salary above $80K").

-- Count orders per user
SELECT
    u.name,
    COUNT(o.id) AS order_count,
    SUM(o.total) AS total_spent,
    AVG(o.total) AS avg_order,
    MAX(o.total) AS largest_order
FROM users u
LEFT JOIN orders o ON u.id = o.user_id
GROUP BY u.id, u.name;

-- HAVING = WHERE for groups
-- Users who spent more than $1000
SELECT
    u.name,
    SUM(o.total) AS total_spent
FROM users u
JOIN orders o ON u.id = o.user_id
GROUP BY u.id, u.name
HAVING SUM(o.total) > 1000
ORDER BY total_spent DESC;

WHERE vs HAVING

-- WHERE: filters rows BEFORE grouping
-- HAVING: filters groups AFTER aggregation

SELECT department, AVG(salary) AS avg_salary
FROM employees
WHERE hire_date > '2020-01-01'   -- Filter individual rows first
GROUP BY department
HAVING AVG(salary) > 80000;      -- Then filter the groups

Step 3: Subqueries

Subqueries exist because many real-world questions require multi-step reasoning: "find employees earning above the average" requires computing the average first, then comparing. Before subqueries, this required multiple queries or application-side logic. SQL subqueries let you nest computations inline, keeping logic in a single statement. Correlated subqueries (referencing the outer query) enable row-by-row comparisons like "employees earning above their department's average" — though they can be expensive and are often better rewritten as JOINs or window functions for performance.

Scalar Subquery (returns single value)

-- Employees earning above average
SELECT name, salary
FROM employees
WHERE salary > (SELECT AVG(salary) FROM employees);

IN Subquery (returns a list)

-- Users who ordered product 'Widget Pro'
SELECT name, email
FROM users
WHERE id IN (
    SELECT DISTINCT o.user_id
    FROM orders o
    JOIN order_items oi ON o.id = oi.order_id
    JOIN products p ON oi.product_id = p.id
    WHERE p.name = 'Widget Pro'
);

Correlated Subquery (references outer query)

-- Employees earning more than their department average
SELECT e.name, e.salary, e.department
FROM employees e
WHERE e.salary > (
    SELECT AVG(e2.salary)
    FROM employees e2
    WHERE e2.department = e.department  -- References outer query!
);

EXISTS (more efficient than IN for large datasets)

-- Users who have at least one order
SELECT u.name
FROM users u
WHERE EXISTS (
    SELECT 1 FROM orders o WHERE o.user_id = u.id
);

Step 4: CASE Expressions

CASE expressions bring conditional logic into SQL, eliminating the need to post-process query results in application code. They were added to the SQL standard because real data doesn't map neatly to columns — you need to derive categories ("Senior"/"Junior" from salary), create pivot tables, handle NULL values, and compute conditional aggregates. CASE is used in virtually every reporting query, data transformation pipeline, and analytical dashboard. It's SQL's equivalent of if/else and appears in SELECT, WHERE, GROUP BY, and ORDER BY clauses.

-- Conditional logic in queries
SELECT
    name,
    salary,
    CASE
        WHEN salary >= 150000 THEN 'Senior'
        WHEN salary >= 100000 THEN 'Mid'
        WHEN salary >= 60000 THEN 'Junior'
        ELSE 'Entry'
    END AS level,
    CASE
        WHEN department IN ('Engineering', 'Product') THEN 'Tech'
        WHEN department IN ('Sales', 'Marketing') THEN 'GTM'
        ELSE 'Operations'
    END AS org
FROM employees
ORDER BY salary DESC;

-- Pivot-style aggregation
SELECT
    department,
    COUNT(CASE WHEN gender = 'M' THEN 1 END) AS male_count,
    COUNT(CASE WHEN gender = 'F' THEN 1 END) AS female_count,
    ROUND(AVG(CASE WHEN gender = 'M' THEN salary END), 0) AS male_avg_salary,
    ROUND(AVG(CASE WHEN gender = 'F' THEN salary END), 0) AS female_avg_salary
FROM employees
GROUP BY department;

Step 5: String & Date Functions

String and date functions exist because raw database values rarely match how users need to see them. Dates need truncation to months for reporting, time zone conversion, and interval arithmetic ("users who signed up in the last 7 days"). Strings need normalization (case-insensitive search via LOWER), parsing (extract username from email), and formatting. These functions are what transform a database query from "get raw data" into "get a report" — and they're heavily tested in SQL interviews and data engineering roles.

-- String functions
SELECT
    UPPER(name),
    LOWER(email),
    LENGTH(name),
    CONCAT(first_name, ' ', last_name) AS full_name,
    SUBSTRING(email FROM 1 FOR POSITION('@' IN email) - 1) AS username,
    TRIM(BOTH ' ' FROM name)
FROM users;

-- Date functions (PostgreSQL)
SELECT
    created_at,
    DATE_TRUNC('month', created_at) AS month,
    EXTRACT(YEAR FROM created_at) AS year,
    AGE(NOW(), created_at) AS account_age,
    created_at + INTERVAL '30 days' AS trial_end
FROM users
WHERE created_at >= NOW() - INTERVAL '7 days';  -- Last 7 days

-- Date grouping
SELECT
    DATE_TRUNC('month', created_at) AS month,
    COUNT(*) AS orders,
    SUM(total) AS revenue
FROM orders
GROUP BY DATE_TRUNC('month', created_at)
ORDER BY month DESC;

Step 6: Set Operations

Set operations (UNION, INTERSECT, EXCEPT) come directly from mathematical set theory and were included in SQL because they solve questions that JOINs can't express cleanly: combining results from different tables with the same structure, finding common elements across datasets, or computing differences. They're essential for cohort analysis ("users in group A but not group B"), data reconciliation, and combining results from partitioned tables. UNION ALL is a performance best practice when you know there are no duplicates — it skips the expensive deduplication step.

-- UNION: combine results (removes duplicates)
SELECT name, email FROM customers
UNION
SELECT name, email FROM leads;

-- UNION ALL: keep duplicates (faster)
SELECT name, email FROM customers
UNION ALL
SELECT name, email FROM leads;

-- INTERSECT: only in both
SELECT user_id FROM orders_2025
INTERSECT
SELECT user_id FROM orders_2026;  -- Users who ordered in both years

-- EXCEPT: in first but not second
SELECT user_id FROM newsletter_subscribers
EXCEPT
SELECT user_id FROM unsubscribed;  -- Active subscribers

Step 7: Common Query Patterns

Top N per Group

-- Top 3 highest-paid employees per department
SELECT * FROM (
    SELECT
        name, department, salary,
        ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) AS rn
    FROM employees
) ranked
WHERE rn <= 3;

Running Total

SELECT
    date,
    revenue,
    SUM(revenue) OVER (ORDER BY date) AS running_total
FROM daily_revenue;

Duplicate Detection

-- Find duplicate emails
SELECT email, COUNT(*) AS count
FROM users
GROUP BY email
HAVING COUNT(*) > 1;

-- Delete duplicates (keep lowest id)
DELETE FROM users
WHERE id NOT IN (
    SELECT MIN(id)
    FROM users
    GROUP BY email
);

Interview Questions

What's the difference between WHERE and HAVING?
- WHERE filters individual rows before GROUP BY. HAVING filters aggregated groups after GROUP BY. HAVING can use aggregate functions (COUNT, SUM), WHERE cannot.
LEFT JOIN vs INNER JOIN — when to use which?
- INNER: only want rows that match in both tables. LEFT: want ALL rows from left table, even with no match (NULL for missing right side). Use LEFT when "no match" is meaningful data.
What's a correlated subquery?
- A subquery that references the outer query. Executes once per outer row (can be slow). Often rewritten as JOINs for performance. Use when comparing each row to an aggregate of its group.
How do you find the Nth highest salary?
- Use ROW_NUMBER() OVER (ORDER BY salary DESC) or LIMIT 1 OFFSET N-1 or DENSE_RANK().