Anthropic recently open sourced their performance engineering take-home challenge. The goal: make a tree traversal algorithm with hashing run as fast as possible on a simulated VLIW SIMD CPU.
I started this on a quiet evening, thinking “I’ll just poke at it for an hour.”
Four days later, I was obsessed with ideas about instruction scheduling. The 2,000 cycle barrier became my white whale.
Starting point: 147,734 cycles. Final result: 1,524 cycles. Speedup: 96.9x.
TIL I have an unhealthy obsession with making numbers go down.
The Problem
256 “walkers” traverse a binary tree over 16 rounds. Each round:
- Load current node’s value
- XOR with walker’s value
- Hash the result (6 stages of multiply-add and xor/shift)
- Use the LSB to decide: go left (even) or right (odd)
- If we fall off the tree (depth > 10), reset to root
The tree is stored as an implicit array: node i has children at 2i+1 and 2i+2.
Le Machine
Understanding the hardware is crucial. The target is a VLIW SIMD architecture:
VLIW (Very Long Instruction Word): Multiple operations execute in parallel across “engines”:
- ALU: 12 scalar operations per cycle
- VALU: 6 vector operations per cycle
- Load: 2 memory reads per cycle
- Store: 2 memory writes per cycle
- Flow: 1 control operation per cycle
SIMD: Vector operations process 8 elements simultaneously.
In theory, the machine can execute 23 ops per cycle. But there are constraints:
- RAW hazards: Can’t read a register being written in the same cycle
- WAW hazards: Can’t write to the same register twice per cycle
- Dependencies: Operations that depend on each other must be in different cycles
The load engine is the bottleneck. With only 2 load slots per cycle and heavy memory access patterns, we’re fundamentally load-bound. Everything else follows from this.
The Baseline: 147,734 Cycles
The starter code processes walkers sequentially. One walker at a time, one round at a time:
for round in range(16):
for walker in range(256):
# Load, XOR, Hash, Store
We’re using maybe 1-2% of the machine’s capacity per cycle.
Verdict: Maximally inefficient.
Step 1: SIMD → 25,165 cycles (5.87x)
The obvious first step. Instead of processing 1 walker at a time, process 8.
# Before: scalar
load(val, addr)
alu("+", result, val, const)
# After: vector (8 walkers at once)
vload(v_val, addr)
valu("+", v_result, v_val, v_const)
We’re doing 8x the work per instruction. The VALU engine has 6 slots per cycle, so we can process up to 48 walkers worth of computation per cycle.
Verdict: SIMD is the foundation. Everything else builds on this.
Step 2: Hazard-Aware Packing → 15,438 cycles (9.57x)
After vectorization, we had lots of independent instructions, but they were executing one per cycle. The pack_slots() function bundles independent instructions together.
The algorithm tracks which registers are written in the current cycle. A new instruction can pack if:
- Its reads don’t overlap with current writes (RAW hazard)
- Its writes don’t overlap with current writes (WAW hazard)
- The engine slot limit isn’t exceeded
def pack_slots(instructions):
cycles = []
current_cycle = []
current_writes = set()
engine_counts = defaultdict(int)
for inst in instructions:
reads, writes = get_registers(inst)
engine = inst.engine
# Check hazards and slot limits
has_raw = reads & current_writes
has_waw = writes & current_writes
engine_full = engine_counts[engine] >= ENGINE_LIMITS[engine]
if has_raw or has_waw or engine_full:
# Start new cycle
cycles.append(current_cycle)
current_cycle = []
current_writes = set()
engine_counts = defaultdict(int)
current_cycle.append(inst)
current_writes |= writes
engine_counts[engine] += 1
return cycles
Instead of 1 op per cycle, we’re now packing 5-10 ops per cycle wherever possible.
Verdict: Greedy packing goes surprisingly far.
Step 3: Phase Interleaving → 5,870 cycles (25.2x)
This was the conceptual breakthrough.
Before: Process all 16 rounds for chunk 0, then chunk 1, etc.
chunk0: round0 → round1 → ... → round15
chunk1: round0 → round1 → ... → round15
Problem: Each round depends on the previous round’s results. We have a dependency chain 16 rounds long. The packer can’t parallelize across rounds.
After: Process round 0 for all chunks, then round 1 for all chunks, etc.
round0: chunk0, chunk1, ..., chunk31
round1: chunk0, chunk1, ..., chunk31
Within a round, different chunks are completely independent. While chunk 0 is waiting for its load to complete, we can work on chunks 1, 2, 3…
I flipped two nested loops and got 4x.
Verdict: Shorter chains = more parallelism.
Step 4: Stream Scheduler → 2,779 cycles (53x)
Phase interleaving helped, but we were still leaving performance on the table. Time to get fancier.
Treat each chunk as a “stream” of operations. Interleave streams to maximize engine utilization. The key idea was to limit how many streams can transition from VALU to LOAD per cycle (switch_cap=1) so we don’t drain all VALU-ready heads at once.
This started feeling like dark magic.
Verdict: Better load balancing. The load engine (our bottleneck) stays more consistently busy.
Step 5: Shallow Tree Preloading → 2,516 cycles (59x)
A tree of height 10 has 2047 nodes, but the first few levels are tiny:
- Level 0: 1 node (root)
- Level 1: 2 nodes
- Level 2: 4 nodes
For the first few rounds, all 256 walkers are guaranteed to access these same 7 nodes.
So: preload nodes 0-6 into vector registers at startup. Precompute the differences. For shallow rounds, use multiply_add to select the right node without memory access:
# Branchless mux using multiply_add
left = b1 * diff34 + node3 # node3 or node4
right = b1 * diff56 + node5 # node5 or node6
result = b0 * (right - left) + left
Since loads are our bottleneck, eliminating them is pure win.
The jump from 2,779 to 2,516 (10% improvement) was smaller than previous steps. By this point I was running low on ideas. The big architectural wins were behind us.
Verdict: Squeezing.
Step 6: Scratch Optimization → 2,347 cycles (63x)
The input values needed to be read at the start and written at the end. We were reading/writing them every round.
Fix: Load all values into scratch memory once, keep them there for all 16 rounds, write back once at the end.
Bonus trick: The flow engine only has 1 slot for add_imm. Using two pointers means we can increment both in parallel via ALU instead of serializing on flow.
Verdict: Small wins add up.
Step 7: Skip Redundant Ops → 2,257 cycles (65.5x)
Micro-optimizations:
- Skip last round’s pointer update (indices aren’t used)
- Simplified reset logic (we know exactly when it happens)
- Skip direction bit storage in later rounds (only needed for shallow tree mux)
Verdict: Death by a thousand cuts, reversed.
Step 8: Global List Scheduler + multiply_add → 1,524 cycles (96.9x)
The final push combined two techniques.
Global List Scheduler
The setup phase (loading constants, preloading nodes, initializing pointers) had complex dependencies. The basic packer wasn’t optimal.
The algorithm:
- Build a dependency graph across all instructions
- Compute instruction “levels” via BFS (distance from root operations)
- Use priority queues per engine, prioritizing higher-level (more critical) instructions
- Greedily schedule the highest-priority ready instruction each cycle
Suddenly the setup phase that took forever was tight.
multiply_add Fusion
The hash function has 6 stages. Three are affine+shift:
a = (a + C1) + (a << 12) # a*4097 + C1
a = (a + C3) + (a << 5) # a*33 + C3
a = (a + C5) + (a << 3) # a*9 + C5
Without fusion, each multiply stage needs 3 VALU ops (shift + add + add). The machine supports fused multiply_add(a, b, c) → a * b + c in one op. This cuts 3 ops to 1 for each multiply stage.
33% fewer VALU ops per hash. Across 256 walkers × 16 rounds, this adds up.
Verdict: The “aha” moment. ✨
The Obsession
Every time I ran the tests and saw a number starting with 2, I felt the pull to try one more thing. The load engine sat there at 100% utilization, mocking me.
When I finally hit 1,524 cycles, I just stared at the terminal for a minute. 96.9x faster than where I started. The load engine was still at 100%. There was probably more to squeeze out.
But I had to get back to work :>
The Numbers
| Metric | Value |
|---|---|
| Total cycles | 1,524 |
| Total operations | 11,152 |
| Ops/cycle | 7.32 / 23 possible (31.8%) |
| VALU utilization | 5.72/6 slots (95.3%) |
| Load utilization | 2.00/2 slots (100%) |
The final code is 720 lines of Python generating ~11,000 VLIW instructions.
There’s something deeply satisfying about optimization problems. Monkey brain like number go down.