Exetools  

Go Back   Exetools > General > General Discussion

Notices

 
 
Thread Tools Display Modes
Prev Previous Post   Next Post Next
  #12  
Old 03-08-2026, 03:55
cjack's Avatar
cjack cjack is offline
Family
 
Join Date: Jan 2002
Posts: 170
Rept. Given: 196
Rept. Rcvd 176 Times in 34 Posts
Thanks Given: 332
Thanks Rcvd at 219 Times in 64 Posts
cjack Reputation: 100-199 cjack Reputation: 100-199
Critical Bug Found & Fixed (v1.5.0)

Hey all,

Wanted to share a hard lesson learned with the THUNDERSTRIKE distributed Pollard's Rho solver targeting Armadillo ECDSA-113 (binary Koblitz curve over GF(2^113)).

The Problem

We've been running ~30 agents (mostly RTX 5090s) at a combined ~92 G/s for a while now. Reached 233% of the expected median iteration count with absolutely zero collisions. The probability of that happening with a correctly functioning Rho walk is roughly 1.4% — suspicious enough to warrant a deep investigation.

Root Cause

The walk partition function p = X.hi & 31 was using the projective X coordinate instead of the affine x coordinate.

In Lopez-Dahab projective coordinates, X_proj = x_affine × Z. After the very first ld_madd step, Z diverges from 1. So two walks arriving at the same affine point but carrying different Z values would compute different partition indices, select different walk table entries, and diverge. Walks never merge. Pollard's Rho degenerates into pure random distinguished point sampling — you'd need ~10^12 DPs for a birthday collision among them. We had collected ~1.8 million. At that rate: roughly 223 years. Not ideal.

The bug was subtle because every individual component (EC arithmetic, DP detection, server collection) was working correctly in isolation. The partition function just happened to operate on the wrong representation of the point.

The Fix (v1.5.0)

We switched to per-step affine normalization using Itoh-Tsujii inversion, ensuring Z = 1 at every step. This means the partition function now sees the true affine x coordinate and walks sharing the same point will always take the same step — as Pollard intended.

With Z guaranteed to be 1 on input, we wrote an optimized ld_madd_z1 routine (5M+3S vs the previous 8M+5S). The compiled kernel hits 96 registers, 0 spills. Throughput on a single RTX 5090 is ~975 M/s — about 3.5x slower per step than before, but the algorithm now actually converges.

Verification

We wrote formal proofs for the partition invariant and ran a 5-test verification suite — all passing, confirming both that the old code was broken and that the new code preserves walk mergeability. Test runs show hash table duplicates growing at the expected rate, which is exactly what you want to see.

What's Next

Operations are temporarily suspended while we do final verification across the fleet. Once we restart with v1.5.0, the estimated time to solve one certificate is in the range of 32-50 hours with the full agent fleet at reduced per-step throughput. A very different story from "223 years."

Sometimes the most dangerous bugs are the ones where everything looks like it's working perfectly. 92 G/s of beautifully fast, completely useless computation.
Reply With Quote
 

Tags
bolero, ecdlp


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Replacing ECDSA in Target (arma) Mynotos General Discussion 3 11-22-2019 00:49


All times are GMT +8. The time now is 03:35.


Always Your Best Friend: Aaron, JMI, ahmadmansoor, ZeNiX, chessgod101
( Since 1998 )