$ cd ..

Optimizing Anthropic's VLIW SIMD Assignment

📅 2026-01-22

⌛ 50 days ago

📖 8 min read

Anthropic recently open sourced their performance engineering take-home challenge. The goal: make a tree traversal algorithm with hashing run as fast as possible on a simulated VLIW SIMD CPU.

I started this on a quiet evening, thinking “I’ll just poke at it for an hour.”

Four days later, I was obsessed with ideas about instruction scheduling. The 2,000 cycle barrier became my white whale.

Starting point: 147,734 cycles. Final result: 1,524 cycles. Speedup: 96.9x.

TIL I have an unhealthy obsession with making numbers go down.

The Problem

256 “walkers” traverse a binary tree over 16 rounds. Each round:

Load current node’s value
XOR with walker’s value
Hash the result (6 stages of multiply-add and xor/shift)
Use the LSB to decide: go left (even) or right (odd)
If we fall off the tree (depth > 10), reset to root

The tree is stored as an implicit array: node i has children at 2i+1 and 2i+2.

Le Machine

Understanding the hardware is crucial. The target is a VLIW SIMD architecture:

VLIW (Very Long Instruction Word): Multiple operations execute in parallel across “engines”:

ALU: 12 scalar operations per cycle
VALU: 6 vector operations per cycle
Load: 2 memory reads per cycle
Store: 2 memory writes per cycle
Flow: 1 control operation per cycle

SIMD: Vector operations process 8 elements simultaneously.

In theory, the machine can execute 23 ops per cycle. But there are constraints:

RAW hazards: Can’t read a register being written in the same cycle
WAW hazards: Can’t write to the same register twice per cycle
Dependencies: Operations that depend on each other must be in different cycles

The load engine is the bottleneck. With only 2 load slots per cycle and heavy memory access patterns, we’re fundamentally load-bound. Everything else follows from this.

The Baseline: 147,734 Cycles

The starter code processes walkers sequentially. One walker at a time, one round at a time:

for round in range(16):
    for walker in range(256):
        # Load, XOR, Hash, Store

We’re using maybe 1-2% of the machine’s capacity per cycle.

Verdict: Maximally inefficient.

Step 1: SIMD → 25,165 cycles (5.87x)

The obvious first step. Instead of processing 1 walker at a time, process 8.

# Before: scalar
load(val, addr)
alu("+", result, val, const)

# After: vector (8 walkers at once)
vload(v_val, addr)
valu("+", v_result, v_val, v_const)

We’re doing 8x the work per instruction. The VALU engine has 6 slots per cycle, so we can process up to 48 walkers worth of computation per cycle.

Verdict: SIMD is the foundation. Everything else builds on this.

Step 2: Hazard-Aware Packing → 15,438 cycles (9.57x)

After vectorization, we had lots of independent instructions, but they were executing one per cycle. The pack_slots() function bundles independent instructions together.

The algorithm tracks which registers are written in the current cycle. A new instruction can pack if:

Its reads don’t overlap with current writes (RAW hazard)
Its writes don’t overlap with current writes (WAW hazard)
The engine slot limit isn’t exceeded

def pack_slots(instructions):
    cycles = []
    current_cycle = []
    current_writes = set()
    engine_counts = defaultdict(int)

    for inst in instructions:
        reads, writes = get_registers(inst)
        engine = inst.engine

        # Check hazards and slot limits
        has_raw = reads & current_writes
        has_waw = writes & current_writes
        engine_full = engine_counts[engine] >= ENGINE_LIMITS[engine]

        if has_raw or has_waw or engine_full:
            # Start new cycle
            cycles.append(current_cycle)
            current_cycle = []
            current_writes = set()
            engine_counts = defaultdict(int)

        current_cycle.append(inst)
        current_writes |= writes
        engine_counts[engine] += 1

    return cycles

Instead of 1 op per cycle, we’re now packing 5-10 ops per cycle wherever possible.

Verdict: Greedy packing goes surprisingly far.

Step 3: Phase Interleaving → 5,870 cycles (25.2x)

This was the conceptual breakthrough.

Before: Process all 16 rounds for chunk 0, then chunk 1, etc.

chunk0: round0 → round1 → ... → round15
chunk1: round0 → round1 → ... → round15

Problem: Each round depends on the previous round’s results. We have a dependency chain 16 rounds long. The packer can’t parallelize across rounds.

After: Process round 0 for all chunks, then round 1 for all chunks, etc.

round0: chunk0, chunk1, ..., chunk31
round1: chunk0, chunk1, ..., chunk31

Within a round, different chunks are completely independent. While chunk 0 is waiting for its load to complete, we can work on chunks 1, 2, 3…

I flipped two nested loops and got 4x.

Verdict: Shorter chains = more parallelism.

Step 4: Stream Scheduler → 2,779 cycles (53x)

Phase interleaving helped, but we were still leaving performance on the table. Time to get fancier.

Treat each chunk as a “stream” of operations. Interleave streams to maximize engine utilization. The key idea was to limit how many streams can transition from VALU to LOAD per cycle (switch_cap=1) so we don’t drain all VALU-ready heads at once.

This started feeling like dark magic.

Verdict: Better load balancing. The load engine (our bottleneck) stays more consistently busy.

Step 5: Shallow Tree Preloading → 2,516 cycles (59x)

A tree of height 10 has 2047 nodes, but the first few levels are tiny:

Level 0: 1 node (root)
Level 1: 2 nodes
Level 2: 4 nodes

For the first few rounds, all 256 walkers are guaranteed to access these same 7 nodes.

So: preload nodes 0-6 into vector registers at startup. Precompute the differences. For shallow rounds, use multiply_add to select the right node without memory access:

# Branchless mux using multiply_add
left = b1 * diff34 + node3   # node3 or node4
right = b1 * diff56 + node5  # node5 or node6
result = b0 * (right - left) + left

Since loads are our bottleneck, eliminating them is pure win.

The jump from 2,779 to 2,516 (10% improvement) was smaller than previous steps. By this point I was running low on ideas. The big architectural wins were behind us.

Verdict: Squeezing.

Step 6: Scratch Optimization → 2,347 cycles (63x)

The input values needed to be read at the start and written at the end. We were reading/writing them every round.

Fix: Load all values into scratch memory once, keep them there for all 16 rounds, write back once at the end.

Bonus trick: The flow engine only has 1 slot for add_imm. Using two pointers means we can increment both in parallel via ALU instead of serializing on flow.

Verdict: Small wins add up.

Step 7: Skip Redundant Ops → 2,257 cycles (65.5x)

Micro-optimizations:

Skip last round’s pointer update (indices aren’t used)
Simplified reset logic (we know exactly when it happens)
Skip direction bit storage in later rounds (only needed for shallow tree mux)

Verdict: Death by a thousand cuts, reversed.

Step 8: Global List Scheduler + multiply_add → 1,524 cycles (96.9x)

The final push combined two techniques.

Global List Scheduler

The setup phase (loading constants, preloading nodes, initializing pointers) had complex dependencies. The basic packer wasn’t optimal.

The algorithm:

Build a dependency graph across all instructions
Compute instruction “levels” via BFS (distance from root operations)
Use priority queues per engine, prioritizing higher-level (more critical) instructions
Greedily schedule the highest-priority ready instruction each cycle

Suddenly the setup phase that took forever was tight.

`multiply_add` Fusion

The hash function has 6 stages. Three are affine+shift:

a = (a + C1) + (a << 12)   # a*4097 + C1
a = (a + C3) + (a << 5)    # a*33 + C3
a = (a + C5) + (a << 3)    # a*9 + C5

Without fusion, each multiply stage needs 3 VALU ops (shift + add + add). The machine supports fused multiply_add(a, b, c) → a * b + c in one op. This cuts 3 ops to 1 for each multiply stage.

33% fewer VALU ops per hash. Across 256 walkers × 16 rounds, this adds up.

Verdict: The “aha” moment. ✨

The Obsession

Every time I ran the tests and saw a number starting with 2, I felt the pull to try one more thing. The load engine sat there at 100% utilization, mocking me.

When I finally hit 1,524 cycles, I just stared at the terminal for a minute. 96.9x faster than where I started. The load engine was still at 100%. There was probably more to squeeze out.

But I had to get back to work :>

The Numbers

Metric	Value
Total cycles	1,524
Total operations	11,152
Ops/cycle	7.32 / 23 possible (31.8%)
VALU utilization	5.72/6 slots (95.3%)
Load utilization	2.00/2 slots (100%)

The final code is 720 lines of Python generating ~11,000 VLIW instructions.

There’s something deeply satisfying about optimization problems. Monkey brain like number go down.