0%

Basic Joins in SQL Database

February 24, 2026

Database

Oracle

Postgresql

1. Nested Loop Join

How it works.

  • For each row in the outer table, scan the entire inner table looking for matches
  • Similar to nested for-loops in programming

Example.

For each row in Table A (outer table):
    For each row in Table B (inner table):
        If A.key = B.key:
            Output joined row

Why is it called "outer" and "inner"?

The terminology comes from nested loop structure in programming.

  • The outer table (Table A) is like the outer loop - it's processed first and runs fewer times (once through the entire table)
  • The inner table (Table B) is like the inner loop - it's nested inside and runs many times (once for each row in the outer table)

Just like in nested loops where the outer loop variable changes slowly and the inner loop variable changes rapidly, the outer table is scanned once while the inner table is scanned repeatedly.

In this example, Table A is the outer table (scanned once) and Table B is the inner table (scanned repeatedly for each row in A).

When it's used.

  • Small outer table with indexed inner table
  • Selective WHERE clauses (few matching rows)

Performance.

  • Time complexity: where = outer rows, = inner rows
  • Best when outer table is small (Table A) and inner table has an index (Table B)

Example scenario.

SELECT * FROM customers c
JOIN orders o ON c.id = o.customer_id
WHERE c.state = 'CA';

Execution order.

  • Step 1. Apply WHERE clause - filter customers by state (returns 5 customers from 10 total)
  • Step 2. For each of those 5 customers, use index to find their orders

Good choice: only 5 CA customers after filtering. The customers (outer table) is small after WHERE clause, and orders (inner table) is large (1M rows) but has index on customer_id. Result: Only 5 index lookups - very fast!

2. Hash Join

How it works.

  • Build phase. Create hash table from smaller table
  • Probe phase. Scan larger table, probe hash table for matches

Example.

1. Build hash table from Table A (smaller table) on join key
2. For each row in Table B (larger table):
    - Hash the join key
    - Look up in hash table
    - Output matches

In this example, Table A is the build input (smaller table, used to create hash table) and Table B is the probe input (larger table, scanned to find matches).

When it's used.

  • Large tables where indexes don't exist or wouldn't be efficient (both Table A and B can be large, but algorithm works best when at least one is significantly smaller)
  • Equality joins only (=)
  • Sufficient memory available to hold the hash table

Why choose hash join over nested loop (even when indexes exist).

  • Hash join ignores indexes but doesn't require their absence
  • Optimizer chooses hash join based on cost, which can be lower than nested loop even with indexes when.
    • Both tables return many rows (poor selectivity) - repeatedly using an index becomes expensive
    • Full table scans + hash join is faster than many index lookups
    • The WHERE clause filters result in large intermediate result sets
  • Example. If our WHERE clause returns 50,000 customers, doing 50,000 index lookups on orders is slower than scanning both tables and hashing

Performance.

  • Time complexity: - linear
  • Space complexity:
  • Fast for large datasets

2.1.

Example 1. No indexes

SELECT * FROM orders o
JOIN order_items oi ON o.id = oi.order_id;

Hash join builds hash table on smaller table (orders as build input), then probes with larger table (order_items as probe input).

2.2.

Example 2. Indexes exist, but hash join is still chosen

SELECT * FROM orders o
JOIN order_items oi ON o.id = oi.order_id
WHERE o.order_date >= '2025-01-01';

Execution order.

  • Step 1. Apply WHERE clause - filter orders by date (returns 100k orders)
  • Step 2. Now need to join these 100k orders with order_items

Even with index on order_id, if WHERE returns 100k orders.

  • Nested loop. 100k index lookups on order_items (expensive!)
  • Hash join. Scan filtered orders + scan order_items = faster!

The WHERE clause is applied FIRST, creating a large intermediate result. Then the join algorithm is chosen based on that filtered row count. Optimizer chooses hash join based on cost estimates.

3. Merge Join (Sort-Merge Join)

How it works.

  • 1st Phase (Sort). Sort both tables by join key (if not already sorted)
  • 2nd Phase (Merge). Scan both sorted tables simultaneously with two pointers, matching rows as we go

Why is it called "Sort-Merge"? The algorithm has two distinct phases:

  • The sort phase ensures both inputs are ordered by the join key
  • The merge phase walks through both sorted lists simultaneously, similar to the merge step in merge sort algorithm

Detailed merge process.

Start: orders=[1,3,5,7,9], items=[1,3,3,5,5,5,8], A=1, B=1
S1: A=1 B=1 MATCH
OUT#1: A→3 B→3
S2: A=3 B=3 MATCH
OUT#2: A=3 B→3
S3: A=3 B=3 MATCH
OUT#3: A=3 B→5
S4: A=3 B=5 skip
A→5
S5: A=5 B=5 MATCH
OUT#4: A=5 B→5
S6: A=5 B=5 MATCH
OUT#5: A=5 B→5
S7: A=5 B=5 MATCH
OUT#6: A=5 B→8
S8: A=5 B=8 skip
A→7
S9: A=7 B=8 skip
A→9
S10: A=9 B=8 skip
Done: 6 results

Each pointer advances monotonically through its table - we never go backwards. When we find order 3 with 2 items, we scan through both items for order 3, then move on. We don't need to rescan because both tables are sorted.

Advantage. Each table is scanned only once during the merge phase, making it very efficient for large datasets that are already sorted.

When it is used.

  • Both inputs already sorted (indexes on join columns) - avoids expensive sort phase
  • Non-equality joins (, , BETWEEN) - unlike hash join which only works with
  • Large sorted datasets where hash join would require too much memory
  • Range joins where we need to match ranges of values

Performance.

  • Time complexity: if sorting needed, if pre-sorted
  • Space complexity: if data already sorted (no buffering needed)
  • Efficient for sorted data - single pass through each table during merge

3.1.

Example 1. Pre-sorted data (optimal case)

SELECT * FROM customers c
JOIN orders o ON c.id = o.customer_id
ORDER BY c.id;

Execution order.

  • Step 1. Both tables have clustered indexes on their join columns, so they're already sorted
  • Step 2. Merge phase only - scan both tables once with two pointers
  • Step 3. Output is already sorted (bonus)

Result. - single scan of each table.

3.2.

Example 2. Range joins

SELECT * FROM sales s
JOIN promotions p ON s.sale_date BETWEEN p.start_date AND p.end_date;

Why merge join?

  • Hash join doesn't work with BETWEEN (non-equality)
  • Nested loop would be - checking every sale against every promotion
  • Merge join. Sort both by date, then efficiently match ranges in after sorting

3.3.

Example 3. When might merge join be chosen despite sorting cost?

SELECT * FROM employees e
JOIN departments d ON e.dept_id = d.id
WHERE e.salary > 50000
ORDER BY e.dept_id;

Question. If no indexes exist on dept_id, why not use hash join instead of merge join?

Hash join is typically better, but merge join might be chosen when the result set is large,

Case A. Result needs sorting (ORDER BY clause present)

  1. By hash join we need:

    • Step 1. Apply WHERE clause - filter employees by salary
    • Step 2 (Hash join). Build hash on departments, probe with employees → unsorted result
    • Step 3 (Sort result). by dept_id for ORDER BY, costing , where = number of results
  2. On the other hand, by sort-merge we need:

    • Step 1. Apply WHERE clause - filter employees by salary
    • Step 2 (Merge join approach). Sort employees by dept_id:
    • Step 3. Sort departments by id:
    • Step 4. Merge (output is already sorted!) →

Case B. Limited memory (hash table would not fit)

SELECT * FROM huge_orders o
JOIN massive_items i ON o.category_id = i.category_id;

If both tables are massive and hash table won't fit in available memory.

  • Hash join. Build hash table on smaller table → might exceed work_mem, spill to disk (very slow!)
  • Merge join. External sort both tables using disk-based sort → disk I/O, then merge

Merge join's external sort is more efficient than hash join's disk spilling.

Case C. Statistics-driven decision

The optimizer considers.

  • Table sizes ( and )
  • Available memory (work_mem in PostgreSQL)
  • Expected result size
  • Presence of ORDER BY clause

Cost comparison.

  • Hash join cost. + potential sort cost if we are using ORDER BY

  • Merge join cost. + merge

    Remark. It is easy to show that when , then

Decision. It depends on statistics! If:

  • Small tables → nested loop or hash join
  • Large tables + sufficient memory + no ORDER BY → hash join
  • Large tables + limited memory → merge join
  • Any size + ORDER BY on join key → merge join
  • Pre-sorted data (indexes) → merge join (no sort needed!)

Typical scenario. For Example 3 without ORDER BY, hash join would be preferred. Merge join is chosen when there's a good reason (sorting needed, memory constraints, or data already sorted).

Comparison with other joins.

ScenarioNested LoopHash JoinMerge Join
Small outer + indexed innerBestToo slowOverkill
Large tables, unsortedTerribleBestGood if sorted
Pre-sorted dataSlowWastes sortBest
Range joins (BETWEEN)SlowCan't useBest
Limited memoryOKNeeds RAMBest

3.4.

Comparison Summary

FeatureNested LoopHash JoinMerge Join
Best forSmall outer tableLarge tablesPre-sorted data
MemoryLowHighMedium
Complexity sorted
Index helps?Yes (inner)NoYes (both)
Join typesAllEquality onlyAll
ParallelizableLimitedEasyMedium

4. Simple Case Study for Different Joins

Given tables:

  • customers - 1,000 rows
  • orders - 1,000,000 rows
SELECT c.name, o.total
FROM customers c
JOIN orders o ON c.id = o.customer_id
WHERE c.country = 'USA';

4.1.

Scenario 1: USA has 5 customers, orders.customer_id indexed

Winning choice. Nested Loop (5 index lookups)

4.2.

Scenario 2: USA has 900 customers, no indexes, plenty of RAM

Winning choice. Hash Join (build hash on customers, probe orders)

4.3.

Scenario 3: Both tables ordered by ID, limited RAM

Winning choice. Merge Join (both already sorted, no memory needed)