$ cd ..

Debugging a Cloudflare Routing Blackhole

📅 2025-07-30

⌛ 73 days ago

📖 3 min read

TL;DR: Users reported issues, two Cloudflare proxied hostnames would just hang. Same app on a non proxied domain worked great. The culprit (my best guess) was a routing/peering blackhole on the ISP -> Cloudflare path for a specific Cloudflare anycast pool. Not DNS!

Workaround: Change to a non-proxied DNS only (direct) hostname.

Update: Cloudflare is currently investigating. According to this CF Community thread, it was broken on Jio yesterday but they fixed it today and broke it for Airtel.

The Setup

Users started reporting issues that the sites kept timing out. I was completely unable to replicate this. I asked friends and colleagues, but none of them had issues.

Two doors, same house:

app.alpha.example → Cloudflare proxied (via Tunnel) → Bun server on :3002
app.beta.example → direct (Nginx, no CF proxy) → same app

Users could not access app.alpha.example while app.beta.example worked just fine. Other apps on the same box (:3001) worked perfectly.

The Symptom

On one ISP: Cloudflare proxied hostname(s) → infinite timeout.
On VPN / other ISPs: flawless.
On the same ISP, direct hostname: flawless.

So either my server hates orange clouds, or the path to Cloudflare is cursed. What made this infinitely more complex is that it was intermittent. Out of nowhere, it started working for a subset of users but broke for another.

First Try: DNS is Boring (and That’s Good)

dig +short A app.alpha.example
# → 104.21.48.1 104.21.80.1 104.21.96.1 104.21.112.1 104.21.16.1 104.21.32.1 104.21.64.1

dig +short AAAA app.alpha.example
# → 2606:4700:3030::6815:6001 ... ::6815:5001

Resolves fine. Shockingly, it wasn’t DNS. The proxied domain lands in 104.21.0.0/16. The direct one hits my origin.

Second Try: Pinning the Edge IPs

for ip in $(dig +short A app.alpha.example); do
  echo "== testing $ip =="
  curl -sSvk --connect-timeout 6 \
    --resolve app.alpha.example:443:$ip \
    https://app.alpha.example/cdn-cgi/trace | head -n 6 || echo "(timeout)"
done

On the broken ISP: every 104.21.x.x IP → timeout.
On VPN: same IPs → normal colo=... traces.

The Smoking Gun: `mtr` on TCP/443

sudo mtr -rwzc 100 -P 443 104.21.96.1

Result:

1. 192.168.31.1         0.0%   100    3.1   4.5   2.5  12.6   1.6
2. 192.168.1.1          0.0%   100    5.8   4.9   3.1  11.2   1.3
3. <ISP access>         0.0%   100    9.1  10.8   5.2  46.1   7.3
4. <ISP core>           0.0%   100    8.3  11.5   4.6  50.7   7.6
5. *                  100%   100     —     —     —     —     —

Hop 5: the abyss. Exactly where the ISP should hand traffic to Cloudflare (AS13335). Instead, the packets vanish.

”But What About Argo?”

… is what I asked myself. Nope. Argo optimizes edge ⇄ origin. This failure happens before the edge handshake. If your SYNs don’t land, Argo cannot save you.

Takeaways (and a checklist for next time)

If a CF proxied site dies on one ISP, suspect routing.
Prove it fast: curl --resolve + mtr -P 443.
Keep users happy: DNS only fallback.
Don’t blame your tunnel/server when it is a hop 5 blackhole.

I had become quite a fan of using Cloudflare Tunnels lately. Alas, back to good old nginx I go.