1. Nested Loop Join
How it works.
- For each row in the outer table, scan the entire inner table looking for matches
- Similar to nested for-loops in programming
Example.
Why is it called "outer" and "inner"?
The terminology comes from nested loop structure in programming.
- The outer table (Table A) is like the outer loop - it's processed first and runs fewer times (once through the entire table)
- The inner table (Table B) is like the inner loop - it's nested inside and runs many times (once for each row in the outer table)
Just like in nested loops where the outer loop variable changes slowly and the inner loop variable changes rapidly, the outer table is scanned once while the inner table is scanned repeatedly.
In this example, Table A is the outer table (scanned once) and Table B is the inner table (scanned repeatedly for each row in A).
When it's used.
- Small outer table with indexed inner table
- Selective
WHEREclauses (few matching rows)
Performance.
- Time complexity: where = outer rows, = inner rows
- Best when outer table is small (Table A) and inner table has an index (Table B)
Example scenario.
Execution order.
- Step 1. Apply
WHEREclause - filter customers by state (returns 5 customers from 10 total) - Step 2. For each of those 5 customers, use index to find their orders
Good choice: only 5 CA customers after filtering. The customers (outer table) is small after WHERE clause, and orders (inner table) is large (1M rows) but has index on customer_id. Result: Only 5 index lookups - very fast!
2. Hash Join
How it works.
- Build phase. Create hash table from smaller table
- Probe phase. Scan larger table, probe hash table for matches
Example.
In this example, Table A is the build input (smaller table, used to create hash table) and Table B is the probe input (larger table, scanned to find matches).
When it's used.
- Large tables where indexes don't exist or wouldn't be efficient (both Table A and B can be large, but algorithm works best when at least one is significantly smaller)
- Equality joins only (=)
- Sufficient memory available to hold the hash table
Why choose hash join over nested loop (even when indexes exist).
- Hash join ignores indexes but doesn't require their absence
- Optimizer chooses hash join based on cost, which can be lower than nested loop even with indexes when.
- Both tables return many rows (poor selectivity) - repeatedly using an index becomes expensive
- Full table scans + hash join is faster than many index lookups
- The
WHEREclause filters result in large intermediate result sets
- Example. If our
WHEREclause returns 50,000 customers, doing 50,000 index lookups on orders is slower than scanning both tables and hashing
Performance.
- Time complexity: - linear
- Space complexity:
- Fast for large datasets
2.1.Example 1. No indexes
Example 1. No indexes
Hash join builds hash table on smaller table (orders as build input), then probes with larger table (order_items as probe input).
2.2.Example 2. Indexes exist, but hash join is still chosen
Example 2. Indexes exist, but hash join is still chosen
Execution order.
- Step 1. Apply
WHEREclause - filter orders by date (returns 100k orders) - Step 2. Now need to join these 100k orders with
order_items
Even with index on order_id, if WHERE returns 100k orders.
- Nested loop. 100k index lookups on
order_items(expensive!) - Hash join. Scan filtered orders + scan
order_items= faster!
The WHERE clause is applied FIRST, creating a large intermediate result. Then the join algorithm is chosen based on that filtered row count. Optimizer chooses hash join based on cost estimates.
3. Merge Join (Sort-Merge Join)
How it works.
- 1st Phase (Sort). Sort both tables by join key (if not already sorted)
- 2nd Phase (Merge). Scan both sorted tables simultaneously with two pointers, matching rows as we go
Why is it called "Sort-Merge"? The algorithm has two distinct phases:
- The sort phase ensures both inputs are ordered by the join key
- The merge phase walks through both sorted lists simultaneously, similar to the merge step in merge sort algorithm
Detailed merge process.
Each pointer advances monotonically through its table - we never go backwards. When we find order 3 with 2 items, we scan through both items for order 3, then move on. We don't need to rescan because both tables are sorted.
Advantage. Each table is scanned only once during the merge phase, making it very efficient for large datasets that are already sorted.
When it is used.
- Both inputs already sorted (indexes on join columns) - avoids expensive sort phase
- Non-equality joins (, , BETWEEN) - unlike hash join which only works with
- Large sorted datasets where hash join would require too much memory
- Range joins where we need to match ranges of values
Performance.
- Time complexity: if sorting needed, if pre-sorted
- Space complexity: if data already sorted (no buffering needed)
- Efficient for sorted data - single pass through each table during merge
3.1.Example 1. Pre-sorted data (optimal case)
Example 1. Pre-sorted data (optimal case)
Execution order.
- Step 1. Both tables have clustered indexes on their join columns, so they're already sorted
- Step 2. Merge phase only - scan both tables once with two pointers
- Step 3. Output is already sorted (bonus)
Result. - single scan of each table.
3.2.Example 2. Range joins
Example 2. Range joins
Why merge join?
- Hash join doesn't work with
BETWEEN(non-equality) - Nested loop would be - checking every sale against every promotion
- Merge join. Sort both by date, then efficiently match ranges in after sorting
3.3.Example 3. When might merge join be chosen despite sorting cost?
Example 3. When might merge join be chosen despite sorting cost?
Question. If no indexes exist on
dept_id, why not use hash join instead of merge join?
Hash join is typically better, but merge join might be chosen when the result set is large,
Case A. Result needs sorting (ORDER BY clause present)
-
By hash join we need:
- Step 1. Apply
WHEREclause - filter employees by salary - Step 2 (Hash join). Build hash on departments, probe with employees → unsorted result
- Step 3 (Sort result). by
dept_idforORDER BY, costing , where = number of results
- Step 1. Apply
-
On the other hand, by sort-merge we need:
- Step 1. Apply
WHEREclause - filter employees by salary - Step 2 (Merge join approach). Sort employees by dept_id:
- Step 3. Sort departments by id:
- Step 4. Merge (output is already sorted!) →
- Step 1. Apply
Case B. Limited memory (hash table would not fit)
If both tables are massive and hash table won't fit in available memory.
- Hash join. Build hash table on smaller table → might exceed
work_mem, spill to disk (very slow!) - Merge join. External sort both tables using disk-based sort → disk I/O, then merge
Merge join's external sort is more efficient than hash join's disk spilling.
Case C. Statistics-driven decision
The optimizer considers.
- Table sizes ( and )
- Available memory (
work_memin PostgreSQL) - Expected result size
- Presence of
ORDER BYclause
Cost comparison.
-
Hash join cost. + potential sort cost if we are using
ORDER BY -
Merge join cost. + merge
Remark. It is easy to show that when , then
Decision. It depends on statistics! If:
- Small tables → nested loop or hash join
- Large tables + sufficient memory + no
ORDER BY→ hash join - Large tables + limited memory → merge join
- Any size +
ORDER BYon join key → merge join - Pre-sorted data (indexes) → merge join (no sort needed!)
Typical scenario. For Example 3 without ORDER BY, hash join would be preferred. Merge join is chosen when there's a good reason (sorting needed, memory constraints, or data already sorted).
Comparison with other joins.
| Scenario | Nested Loop | Hash Join | Merge Join |
|---|---|---|---|
| Small outer + indexed inner | Best | Too slow | Overkill |
| Large tables, unsorted | Terrible | Best | Good if sorted |
| Pre-sorted data | Slow | Wastes sort | Best |
| Range joins (BETWEEN) | Slow | Can't use | Best |
| Limited memory | OK | Needs RAM | Best |
3.4.Comparison Summary
Comparison Summary
| Feature | Nested Loop | Hash Join | Merge Join |
|---|---|---|---|
| Best for | Small outer table | Large tables | Pre-sorted data |
| Memory | Low | High | Medium |
| Complexity | sorted | ||
| Index helps? | Yes (inner) | No | Yes (both) |
| Join types | All | Equality only | All |
| Parallelizable | Limited | Easy | Medium |
4. Simple Case Study for Different Joins
Given tables:
customers- 1,000 rowsorders- 1,000,000 rows
4.1.Scenario 1: USA has 5 customers, orders.customer_id indexed
Scenario 1: USA has 5 customers, orders.customer_id indexed
Winning choice. Nested Loop (5 index lookups)
4.2.Scenario 2: USA has 900 customers, no indexes, plenty of RAM
Scenario 2: USA has 900 customers, no indexes, plenty of RAM
Winning choice. Hash Join (build hash on customers, probe orders)
4.3.Scenario 3: Both tables ordered by ID, limited RAM
Scenario 3: Both tables ordered by ID, limited RAM
Winning choice. Merge Join (both already sorted, no memory needed)





