---
title: "Optimizing Anthropic's VLIW SIMD Assignment"
description: "I started this on a quiet evening, thinking 'I'll just poke at it for an hour.'"
datePublished: 2026-01-22T18:30:00.000Z
slug: "vliw-optimization"
permalink: "/blog/vliw-optimization"
---
<br />

Anthropic recently open sourced their performance engineering take-home challenge. The goal: make a tree traversal algorithm with hashing run as fast as possible on a simulated VLIW SIMD CPU.

I started this on a quiet evening, thinking "I'll just poke at it for an hour."

Four days later, I was _obsessed_ with ideas about instruction scheduling. The 2,000 cycle barrier became my white whale.

**Starting point:** 147,734 cycles.
**Final result:** 1,524 cycles.
**Speedup:** 96.9x.

<div class="border-l-2 border-yellow-400 px-4">
	TIL I have an unhealthy obsession with making numbers go down.
</div>

## The Problem

256 "walkers" traverse a binary tree over 16 rounds. Each round:

1. Load current node's value
2. XOR with walker's value
3. Hash the result (6 stages of multiply-add and xor/shift)
4. Use the LSB to decide: go left (even) or right (odd)
5. If we fall off the tree (depth > 10), reset to root

The tree is stored as an implicit array: node `i` has children at `2i+1` and `2i+2`.

## Le Machine

Understanding the hardware is crucial. The target is a VLIW SIMD architecture:

**VLIW (Very Long Instruction Word):** Multiple operations execute in parallel across "engines":

- **ALU:** 12 scalar operations per cycle
- **VALU:** 6 vector operations per cycle
- **Load:** 2 memory reads per cycle
- **Store:** 2 memory writes per cycle
- **Flow:** 1 control operation per cycle

**SIMD:** Vector operations process 8 elements simultaneously.

In theory, the machine can execute 23 ops per cycle. But there are constraints:

- **RAW hazards:** Can't read a register being written in the same cycle
- **WAW hazards:** Can't write to the same register twice per cycle
- **Dependencies:** Operations that depend on each other must be in different cycles

**The load engine is the bottleneck.** With only 2 load slots per cycle and heavy memory access patterns, we're fundamentally load-bound. Everything else follows from this.

## The Baseline: 147,734 Cycles

The starter code processes walkers sequentially. One walker at a time, one round at a time:

```python
for round in range(16):
    for walker in range(256):
        # Load, XOR, Hash, Store
```

We're using maybe 1-2% of the machine's capacity per cycle.

**Verdict:** Maximally inefficient.

## Step 1: SIMD → 25,165 cycles (5.87x)

The obvious first step. Instead of processing 1 walker at a time, process 8.

```python
# Before: scalar
load(val, addr)
alu("+", result, val, const)

# After: vector (8 walkers at once)
vload(v_val, addr)
valu("+", v_result, v_val, v_const)
```

We're doing 8x the work per instruction. The VALU engine has 6 slots per cycle, so we can process up to 48 walkers worth of computation per cycle.

**Verdict:** SIMD is the foundation. Everything else builds on this.

## Step 2: Hazard-Aware Packing → 15,438 cycles (9.57x)

After vectorization, we had lots of independent instructions, but they were executing one per cycle. The `pack_slots()` function bundles independent instructions together.

The algorithm tracks which registers are written in the current cycle. A new instruction can pack if:

- Its reads don't overlap with current writes (RAW hazard)
- Its writes don't overlap with current writes (WAW hazard)
- The engine slot limit isn't exceeded

```python
def pack_slots(instructions):
    cycles = []
    current_cycle = []
    current_writes = set()
    engine_counts = defaultdict(int)

    for inst in instructions:
        reads, writes = get_registers(inst)
        engine = inst.engine

        # Check hazards and slot limits
        has_raw = reads & current_writes
        has_waw = writes & current_writes
        engine_full = engine_counts[engine] >= ENGINE_LIMITS[engine]

        if has_raw or has_waw or engine_full:
            # Start new cycle
            cycles.append(current_cycle)
            current_cycle = []
            current_writes = set()
            engine_counts = defaultdict(int)

        current_cycle.append(inst)
        current_writes |= writes
        engine_counts[engine] += 1

    return cycles
```

Instead of 1 op per cycle, we're now packing 5-10 ops per cycle wherever possible.

**Verdict:** Greedy packing goes surprisingly far.

## Step 3: Phase Interleaving → 5,870 cycles (25.2x)

This was the conceptual breakthrough.

**Before:** Process all 16 rounds for chunk 0, then chunk 1, etc.

```
chunk0: round0 → round1 → ... → round15
chunk1: round0 → round1 → ... → round15
```

**Problem:** Each round depends on the previous round's results. We have a dependency chain 16 rounds long. The packer can't parallelize across rounds.

**After:** Process round 0 for all chunks, then round 1 for all chunks, etc.

```
round0: chunk0, chunk1, ..., chunk31
round1: chunk0, chunk1, ..., chunk31
```

Within a round, different chunks are completely independent. While chunk 0 is waiting for its load to complete, we can work on chunks 1, 2, 3...

I flipped two nested loops and got 4x.

**Verdict:** Shorter chains = more parallelism.

## Step 4: Stream Scheduler → 2,779 cycles (53x)

Phase interleaving helped, but we were still leaving performance on the table. Time to get fancier.

Treat each chunk as a "stream" of operations. Interleave streams to maximize engine utilization. The key idea was to limit how many streams can transition from VALU to LOAD per cycle (`switch_cap=1`) so we don't drain all VALU-ready heads at once.

This started feeling like dark magic.

**Verdict:** Better load balancing. The load engine (our bottleneck) stays more consistently busy.

## Step 5: Shallow Tree Preloading → 2,516 cycles (59x)

A tree of height 10 has 2047 nodes, but the first few levels are tiny:

- Level 0: 1 node (root)
- Level 1: 2 nodes
- Level 2: 4 nodes

For the first few rounds, all 256 walkers are guaranteed to access these same 7 nodes.

So: preload nodes 0-6 into vector registers at startup. Precompute the differences. For shallow rounds, use `multiply_add` to select the right node without memory access:

```python
# Branchless mux using multiply_add
left = b1 * diff34 + node3   # node3 or node4
right = b1 * diff56 + node5  # node5 or node6
result = b0 * (right - left) + left
```

Since loads are our bottleneck, eliminating them is pure win.

The jump from 2,779 to 2,516 (10% improvement) was smaller than previous steps. By this point I was running low on ideas. The big architectural wins were behind us.

**Verdict:** Squeezing.

## Step 6: Scratch Optimization → 2,347 cycles (63x)

The input values needed to be read at the start and written at the end. We were reading/writing them every round.

**Fix:** Load all values into scratch memory once, keep them there for all 16 rounds, write back once at the end.

**Bonus trick:** The flow engine only has 1 slot for `add_imm`. Using two pointers means we can increment both in parallel via ALU instead of serializing on flow.

**Verdict:** Small wins add up.

## Step 7: Skip Redundant Ops → 2,257 cycles (65.5x)

Micro-optimizations:

- Skip last round's pointer update (indices aren't used)
- Simplified reset logic (we know exactly when it happens)
- Skip direction bit storage in later rounds (only needed for shallow tree mux)

**Verdict:** Death by a thousand cuts, reversed.

## Step 8: Global List Scheduler + multiply_add → 1,524 cycles (96.9x)

The final push combined two techniques.

### Global List Scheduler

The setup phase (loading constants, preloading nodes, initializing pointers) had complex dependencies. The basic packer wasn't optimal.

The algorithm:

1. Build a dependency graph across all instructions
2. Compute instruction "levels" via BFS (distance from root operations)
3. Use priority queues per engine, prioritizing higher-level (more critical) instructions
4. Greedily schedule the highest-priority ready instruction each cycle

Suddenly the setup phase that took forever was tight.

### `multiply_add` Fusion

The hash function has 6 stages. Three are affine+shift:

```
a = (a + C1) + (a << 12)   # a*4097 + C1
a = (a + C3) + (a << 5)    # a*33 + C3
a = (a + C5) + (a << 3)    # a*9 + C5
```

Without fusion, each multiply stage needs 3 VALU ops (shift + add + add). The machine supports fused `multiply_add(a, b, c)` → `a * b + c` in one op. This cuts 3 ops to 1 for each multiply stage.

33% fewer VALU ops per hash. Across 256 walkers × 16 rounds, this adds up.

**Verdict:** The "aha" moment. ✨

## The Obsession

Every time I ran the tests and saw a number starting with 2, I felt the pull to try one more thing. The load engine sat there at 100% utilization, mocking me.

When I finally hit 1,524 cycles, I just stared at the terminal for a minute. 96.9x faster than where I started. The load engine was still at 100%. There was probably more to squeeze out.

But I had to get back to work :>

## The Numbers

| Metric           | Value                      |
| ---------------- | -------------------------- |
| Total cycles     | 1,524                      |
| Total operations | 11,152                     |
| Ops/cycle        | 7.32 / 23 possible (31.8%) |
| VALU utilization | 5.72/6 slots (95.3%)       |
| Load utilization | 2.00/2 slots (100%)        |

The final code is [720 lines of Python](https://github.com/obviyus/original_performance_takehome) generating ~11,000 VLIW instructions.

There's something deeply satisfying about optimization problems. Monkey brain like number go down.