View Single Post
  #12  
Old 03-08-2026, 03:55
cjack's Avatar
cjack cjack is offline
Family
 
Join Date: Jan 2002
Posts: 170
Rept. Given: 196
Rept. Rcvd 176 Times in 34 Posts
Thanks Given: 332
Thanks Rcvd at 219 Times in 64 Posts
cjack Reputation: 100-199 cjack Reputation: 100-199
Critical Bug Found & Fixed (v1.5.0)

Hey all,

Wanted to share a hard lesson learned with the THUNDERSTRIKE distributed Pollard's Rho solver targeting Armadillo ECDSA-113 (binary Koblitz curve over GF(2^113)).

The Problem

We've been running ~30 agents (mostly RTX 5090s) at a combined ~92 G/s for a while now. Reached 233% of the expected median iteration count with absolutely zero collisions. The probability of that happening with a correctly functioning Rho walk is roughly 1.4% — suspicious enough to warrant a deep investigation.

Root Cause

The walk partition function p = X.hi & 31 was using the projective X coordinate instead of the affine x coordinate.

In Lopez-Dahab projective coordinates, X_proj = x_affine × Z. After the very first ld_madd step, Z diverges from 1. So two walks arriving at the same affine point but carrying different Z values would compute different partition indices, select different walk table entries, and diverge. Walks never merge. Pollard's Rho degenerates into pure random distinguished point sampling — you'd need ~10^12 DPs for a birthday collision among them. We had collected ~1.8 million. At that rate: roughly 223 years. Not ideal.

The bug was subtle because every individual component (EC arithmetic, DP detection, server collection) was working correctly in isolation. The partition function just happened to operate on the wrong representation of the point.

The Fix (v1.5.0)

We switched to per-step affine normalization using Itoh-Tsujii inversion, ensuring Z = 1 at every step. This means the partition function now sees the true affine x coordinate and walks sharing the same point will always take the same step — as Pollard intended.

With Z guaranteed to be 1 on input, we wrote an optimized ld_madd_z1 routine (5M+3S vs the previous 8M+5S). The compiled kernel hits 96 registers, 0 spills. Throughput on a single RTX 5090 is ~975 M/s — about 3.5x slower per step than before, but the algorithm now actually converges.

Verification

We wrote formal proofs for the partition invariant and ran a 5-test verification suite — all passing, confirming both that the old code was broken and that the new code preserves walk mergeability. Test runs show hash table duplicates growing at the expected rate, which is exactly what you want to see.

What's Next

Operations are temporarily suspended while we do final verification across the fleet. Once we restart with v1.5.0, the estimated time to solve one certificate is in the range of 32-50 hours with the full agent fleet at reduced per-step throughput. A very different story from "223 years."

Sometimes the most dangerous bugs are the ones where everything looks like it's working perfectly. 92 G/s of beautifully fast, completely useless computation.
Reply With Quote