Exetools  

Go Back   Exetools > General > General Discussion

Notices

Reply
 
Thread Tools Display Modes
  #1  
Old 03-13-2026, 17:49
cjack's Avatar
cjack cjack is offline
Family
 
Join Date: Jan 2002
Posts: 170
Rept. Given: 196
Rept. Rcvd 176 Times in 34 Posts
Thanks Given: 332
Thanks Rcvd at 219 Times in 64 Posts
cjack Reputation: 100-199 cjack Reputation: 100-199
Quote:
Originally Posted by WhoCares View Post
@cjack

too many "OFFLINE" prints now


[ENDGAME][ 1240s] 200.10 M iter/s | 2.481e+11 iters | SERVER OFFLINE - retrying...
[ENDGAME][ 1249s] 200.12 M iter/s | 2.500e+11 iters | DP sent:3
[ENDGAME][ 1250s] 200.04 M iter/s | 2.500e+11 iters | SERVER OFFLINE - retrying...

[DP] 7602 sent, 7602 unique (0.0% dup rate)
[ENDGAME][ 1259s] 200.14 M iter/s | 2.520e+11 iters | DP sent:0
[ENDGAME][ 1260s] 200.06 M iter/s | 2.521e+11 iters | SERVER OFFLINE - retrying...
[ENDGAME][ 1269s] 200.16 M iter/s | 2.540e+11 iters | DP sent:4
[ENDGAME][ 1270s] 200.08 M iter/s | 2.541e+11 iters | SERVER OFFLINE - retrying...
[ENDGAME][ 1288s] 200.13 M iter/s | 2.578e+11 iters | DP sent:4
[DP] 7801 sent, 7801 unique (0.0% dup rate)
[ENDGAME][ 1321s] 200.02 M iter/s | 2.642e+11 iters | DP sent:3
[DP] 8000 sent, 8000 unique (0.0% dup rate)
[ENDGAME][ 1355s] 199.63 M iter/s | 2.705e+11 iters | DP sent:4
[DP] 8202 sent, 8202 unique (0.0% dup rate)
[ENDGAME][ 1359s] 199.68 M iter/s | 2.714e+11 iters | DP sent:1
[ENDGAME][ 1360s] 199.61 M iter/s | 2.715e+11 iters | SERVER OFFLINE - retrying...
[ENDGAME][ 1369s] 199.63 M iter/s | 2.733e+11 iters | DP sent:2
[ENDGAME][ 1370s] 199.56 M iter/s | 2.734e+11 iters | SERVER OFFLINE - retrying...
[ENDGAME][ 1379s] 199.65 M iter/s | 2.753e+11 iters | DP sent:4
[ENDGAME][ 1380s] 199.58 M iter/s | 2.754e+11 iters | SERVER OFFLINE - retrying...
[ENDGAME][ 1388s] 199.54 M iter/s | 2.770e+11 iters | DP sent:2
[DP] 8401 sent, 8401 unique (0.0% dup rate)
[ENDGAME][ 1389s] 199.61 M iter/s | 2.773e+11 iters | DP sent:3
[ENDGAME][ 1390s] 199.53 M iter/s | 2.773e+11 iters | SERVER OFFLINE - retrying...
[ENDGAME][ 1419s] 199.46 M iter/s | 2.830e+11 iters | DP sent:1
[DP] 8601 sent, 8601 unique (0.0% dup rate)
[ENDGAME][ 1459s] 199.42 M iter/s | 2.910e+11 iters | DP sent:2
[DP] 8802 sent, 8802 unique (0.0% dup rate)
[ENDGAME][ 1489s] 199.48 M iter/s | 2.970e+11 iters | DP sent:3
[ENDGAME][ 1490s] 199.41 M iter/s | 2.971e+11 iters | SERVER OFFLINE - retrying...
[ENDGAME][ 1492s] 199.40 M iter/s | 2.975e+11 iters | DP sent:6
[DP] 9006 sent, 9006 unique (0.0% dup rate)
[ENDGAME][ 1499s] 199.50 M iter/s | 2.991e+11 iters | DP sent:3
[ENDGAME][ 1500s] 199.43 M iter/s | 2.992e+11 iters | SERVER OFFLINE - retrying...
Hey WhoCares, thanks for reporting the connection problems you experienced.

The "SERVER OFFLINE" errors you're seeing are caused by our network setup — we use a VPN as a reverse proxy to route traffic to the server. This intermediate hop occasionally drops connections or times out, resulting in those 502 errors. The agent handles this gracefully by retrying automatically, so no work is ever lost.

We've recently applied a server-side optimization that significantly reduced the issue (server response time went from ~2000ms down to ~2ms), but the VPN layer can still cause sporadic hiccups. We're looking into improving this path further.

Important: these brief connection blips have zero impact on computational performance. The solver keeps running on the GPU continuously — it only needs the server to submit Distinguished Points, which happens asynchronously. Even if a submission fails, it simply retries on the next cycle. Your GPU speed and contribution are not affected at all.

Thanks for your patience and for contributing to the project!
Reply With Quote
The Following User Says Thank You to cjack For This Useful Post:
WhoCares (03-13-2026)
  #2  
Old 03-13-2026, 21:27
WhoCares's Avatar
WhoCares WhoCares is offline
who cares
 
Join Date: Jan 2002
Location: Here
Posts: 468
Rept. Given: 11
Rept. Rcvd 32 Times in 25 Posts
Thanks Given: 69
Thanks Rcvd at 247 Times in 94 Posts
WhoCares Reputation: 32
@cjack

Not sure if this is expected behavior, but I noticed a couple of things:

1. The local console reports about 197 M/s, while the web dashboard shows around 192 M/s.
2. The speed reported by the console gradually decreases over time. It started at around 200 M/s and has slowly dropped to ~197 M/s so far.

My machine isn’t running any other CPU- or GPU-intensive workloads.

Just wondering if this difference is expected, or if I might be missing something.
__________________
AKA Solomon/blowfish.

Last edited by WhoCares; 03-13-2026 at 21:39.
Reply With Quote
  #3  
Old 03-13-2026, 23:30
cjack's Avatar
cjack cjack is offline
Family
 
Join Date: Jan 2002
Posts: 170
Rept. Given: 196
Rept. Rcvd 176 Times in 34 Posts
Thanks Given: 332
Thanks Rcvd at 219 Times in 64 Posts
cjack Reputation: 100-199 cjack Reputation: 100-199
Quote:
Originally Posted by WhoCares View Post
@cjack

Not sure if this is expected behavior, but I noticed a couple of things:

1. The local console reports about 197 M/s, while the web dashboard shows around 192 M/s.
2. The speed reported by the console gradually decreases over time. It started at around 200 M/s and has slowly dropped to ~197 M/s so far.

My machine isn’t running any other CPU- or GPU-intensive workloads.

Just wondering if this difference is expected, or if I might be missing something.
@WhoCares

Both are totally expected, no worries!

1. Console vs Dashboard difference (~197 vs ~192 M/s):
The local console shows a cumulative average (total iterations ÷ total time since start). The dashboard uses a server-side EMA (exponential moving average) computed from DP arrival deltas every ~10 seconds. The server measurement includes a tiny network latency overhead on each heartbeat interval, which consistently makes it read ~2-3% lower. The dashboard number is actually a more accurate picture of your current sustained throughput, while the console number is slightly inflated by the faster early-session burst.

2. Gradual speed decrease (200 → 197 M/s):
Classic GPU thermal behavior. When you first start, the GPU runs at max boost clocks while it's cool. After a few minutes it reaches thermal equilibrium and settles to sustained clocks — slightly lower. Since the console uses a cumulative average from time zero, it takes a while for the displayed number to drift down from that early burst. It'll eventually stabilize around ~195-197 and stay there. Completely normal.

Your setup is running perfectly. Thanks for contributing!
Reply With Quote
The Following 2 Users Say Thank You to cjack For This Useful Post:
Jupiter (03-15-2026), WhoCares (03-13-2026)
  #4  
Old 03-15-2026, 00:46
WhoCares's Avatar
WhoCares WhoCares is offline
who cares
 
Join Date: Jan 2002
Location: Here
Posts: 468
Rept. Given: 11
Rept. Rcvd 32 Times in 25 Posts
Thanks Given: 69
Thanks Rcvd at 247 Times in 94 Posts
WhoCares Reputation: 32
@cjack

Please confirm whether the issues described in the attachment are real bugs. Thanks.
Attached Files
File Type: zip MayBeBug.zip (1.2 KB, 3 views)
__________________
AKA Solomon/blowfish.
Reply With Quote
The Following User Gave Reputation+1 to WhoCares For This Useful Post:
Jupiter (03-16-2026)
  #5  
Old 03-15-2026, 01:06
cjack's Avatar
cjack cjack is offline
Family
 
Join Date: Jan 2002
Posts: 170
Rept. Given: 196
Rept. Rcvd 176 Times in 34 Posts
Thanks Given: 332
Thanks Rcvd at 219 Times in 64 Posts
cjack Reputation: 100-199 cjack Reputation: 100-199
Hey WhoCares,

I owe you a correction — and a big one.

In my previous reply, I said your point #3 (walk partition not Frobenius-invariant) was "a legitimate optimization opportunity, not a correctness bug." I was wrong, and I want to be upfront about it.

After a deeper analysis, we realized this is actually the root cause of why our collision is taking so long. Here's what we found:

The walk function uses p = x.lo & 31 (actual affine x bits) for the partition. Since τ(x,y) = (x², y²) produces completely different low bits, walks through P and τ(P) diverge immediately. There's no walk merging between orbit members, which means the 226× Frobenius speedup we assumed in our probability formula was never active.

The numbers speak for themselves:

What we thought: P = 94.5%, past the median, "just unlucky"
Reality: P = 1.28%, only 14% of the way to the median
Search space: We were walking in ORDER (5.19×10³³), not ORDER/226 (2.30×10³¹)
The canonicalization at DP time is mathematically correct (the server handles all Frobenius cases perfectly, as I described before), but it only helps with DP-space birthday collisions — which are negligible at our DP count (expected: 2.6×10⁻¹⁵). The dominant mechanism is walk merging, and that requires Frobenius-invariant iteration.

We're already working on the fix. The plan is to implement canonical lifting at every walk step:

Canonicalize the current point each step (112 squarings in GF(2¹¹³) — cheap in binary fields)
Use canonical x bits for the walk partition
Adjust (a,b) coefficients by λ^i at each canonicalization
Walk from the canonical representative
Cost: ~25-35% overhead per step. Gain: 226× from walk invariance. Net: ~178× speedup.

With this fix, the post-fleet timeline goes from years to weeks — completely changes the game.

You nailed the critical issue. Your analysis was more right than our initial review gave it credit for, and I should have dug deeper before replying. This is exactly why external reviews matter — thank you for pushing on this.

We'll keep you posted on the implementation progress.

Last edited by cjack; 03-15-2026 at 01:34.
Reply With Quote
The Following 2 Users Say Thank You to cjack For This Useful Post:
Abaddon (03-19-2026), Jupiter (03-16-2026)
  #6  
Old 03-15-2026, 03:05
cjack's Avatar
cjack cjack is offline
Family
 
Join Date: Jan 2002
Posts: 170
Rept. Given: 196
Rept. Rcvd 176 Times in 34 Posts
Thanks Given: 332
Thanks Rcvd at 219 Times in 64 Posts
cjack Reputation: 100-199 cjack Reputation: 100-199
Hey everyone,

Quick update on the Armadillo ECDSA-113 solver project. We just released v2.0.0 of both server and solver, fixing a critical bug discovered by WhoCares — huge thanks to him for the sharp eye.

What was wrong:

The walk partition function was using raw affine x-coordinate bits to determine the walk step (p = wX.hi & 31). Since the Frobenius endomorphism τ(x,y) = (x², y²) completely changes the bit pattern of x, two points in the same Frobenius orbit would take different walk paths. This means the 226× speedup from Frobenius equivalence classes was completely inactive — the solver was effectively brute-forcing the full 112-bit space instead of the reduced ~104-bit quotient space.

With 344M distinguished points collected under v1.x, the true collision probability was around 1.3%, not the 94.5% displayed on the dashboard. The old DPs were essentially useless for finding collisions through Frobenius orbits.

What v2.0 fixes:

The solver now implements canonical lifting — at every walk step, it finds the lexicographically smallest x-coordinate across all 113 Frobenius conjugates (x, x², x⁴, ..., x^(2¹¹²)), lifts the point to that canonical representative via repeated squaring, and scales the linear combination coefficients by the appropriate power of λ (the Frobenius eigenvalue). The partition function now operates on the canonical x, making the walk Frobenius-invariant.

Key changes:

ec_canon_x(): iterates 112 squarings to find lex-min x in the orbit
λ^i power table (113 entries) precomputed in __constant__ memory
Coefficient scaling via sc_mul128(a, lambda_pow[ci]) at each step
Server collision resolution updated for all 5 Frobenius cases (d=0, ±τ^d, ±τ^d')
Version gating: server rejects all agents < v2.0.0 to prevent mixing old/new DPs
Performance: ~377 M iter/s per RTX 5090 (down from ~1050 M/s due to canonicalization overhead), but each iteration now searches an effective space 226× smaller. Net theoretical speedup: ~81×.

Current status:

Server v2.0.0 and solver v2.0.0 are up and running, all workers active and submitting canonical DPs. Fresh start from zero — all old DPs discarded.

@WhoCares: Would you be willing to do another independent verification of the v2 solver+server? I've prepared a review package (source code + test suite + review instructions) — I can send it to you or you can grab it from the dashboard download. Specifically interested in confirming:

The canonical lifting logic is mathematically sound
Coefficient scaling with λ^i is correct
Walk partition is truly Frobenius-invariant
Server collision resolution handles all Frobenius rotation cases
No regression in DP generation/submission pipeline
Your previous analysis was spot-on and saved us from burning through the entire fleet budget on broken walks. Want to make absolutely sure v2 is solid before we spend the next few days waiting for a collision.

Source code and full changelog available on the dashboard. ZIP includes complete sources, build scripts, and test suite.
Reply With Quote
The Following 3 Users Say Thank You to cjack For This Useful Post:
Abaddon (03-19-2026), WhoCares (03-15-2026), wx69wx2023 (03-15-2026)
  #7  
Old 03-15-2026, 10:34
WhoCares's Avatar
WhoCares WhoCares is offline
who cares
 
Join Date: Jan 2002
Location: Here
Posts: 468
Rept. Given: 11
Rept. Rcvd 32 Times in 25 Posts
Thanks Given: 69
Thanks Rcvd at 247 Times in 94 Posts
WhoCares Reputation: 32
@cjack

To be honest, I didn't catch this manually. It was actually flagged by Gemini within Google Antigravity. I simply audited the code referenced in the report to verify it wasn't a false positive. I ran out of tokens for Gemini Pro, so I was using Gemini Flash yesterday—even the free tier provided surprisingly solid analysis. My takeaway is that it’s definitely worth running project code through different AI agents for peer reviews.

I've already gone through the v2.1.0 client implementation and no more issue found(except for some perf optimizations like Warp-level Montgomery Batch Inversion, early-exit or bit-filtering to eliminate candidates before full squaring in ec_canon_x()). If you'd like me to take a look at the server-side logic, you can reach me at "bugtraq at 163 dot com". Alternatively, you might want to perform a local audit using an AI agent for a quick sanity check.

Thanks.
__________________
AKA Solomon/blowfish.

Last edited by WhoCares; 03-15-2026 at 12:46.
Reply With Quote
  #8  
Old 03-15-2026, 15:17
cjack's Avatar
cjack cjack is offline
Family
 
Join Date: Jan 2002
Posts: 170
Rept. Given: 196
Rept. Rcvd 176 Times in 34 Posts
Thanks Given: 332
Thanks Rcvd at 219 Times in 64 Posts
cjack Reputation: 100-199 cjack Reputation: 100-199
Quote:
Originally Posted by WhoCares View Post
@cjack

To be honest, I didn't catch this manually. It was actually flagged by Gemini within Google Antigravity. I simply audited the code referenced in the report to verify it wasn't a false positive. I ran out of tokens for Gemini Pro, so I was using Gemini Flash yesterday—even the free tier provided surprisingly solid analysis. My takeaway is that it’s definitely worth running project code through different AI agents for peer reviews.

I've already gone through the v2.1.0 client implementation and no more issue found(except for some perf optimizations like Warp-level Montgomery Batch Inversion, early-exit or bit-filtering to eliminate candidates before full squaring in ec_canon_x()). If you'd like me to take a look at the server-side logic, you can reach me at "bugtraq at 163 dot com". Alternatively, you might want to perform a local audit using an AI agent for a quick sanity check.

Thanks.
@WhoCares That's a great approach — using multiple AI agents as independent reviewers is something we've fully embraced in this project, after the latest critical bug. The Frobenius invariance bug you flagged was the single most impactful finding: it saved us from burning weeks of GPU time on completely useless DPs. We rebuilt everything from scratch after that (v2.0.0), and then the independent audits caught another subtle bug in the collision resolution logic (missing negation case for d=0). None of these would have been easy to spot manually.
Honestly, we should have started doing independent reviews much earlier in the project — lesson learned. From now on it's standard procedure: every significant change gets reviewed by at least one external AI agent before deployment.
Regarding the perf optimizations you mentioned — we evaluated warp-level batch inversion but decided the risk/complexity wasn't worth it for the marginal gain. The early-exit in ec_canon_x() is interesting but the current 112-squaring loop is already branch-free and register-friendly on sm_120, so we kept it simple. We did implement two other optimizations in v2.1.0 (reusing canon_x from canonicalization + hardware-accelerated scalar multiply with Barrett reduction) which brought us from 377 to 535 M/s per GPU — a clean 1.42x speedup with zero register increase.
Fleet is currently running at 32.5 G/s with 58 workers, ETA ~2 days for the collision. Fingers crossed.
I've sent you the server-side sources via PM — looking forward to your analysis. When this is over, everything goes open source on GitHub.
Reply With Quote
  #9  
Old 03-15-2026, 17:05
JMP-JECXZ JMP-JECXZ is offline
Friend
 
Join Date: Mar 2017
Posts: 123
Rept. Given: 0
Rept. Rcvd 5 Times in 4 Posts
Thanks Given: 15
Thanks Rcvd at 150 Times in 69 Posts
JMP-JECXZ Reputation: 5
Is that project full vibe coded ? feel like each time a bug is found wasting days of computing.
cant you add a test mode using this post maybe https://forum.exetools.com/showpost.php?p=111962&postcount=54 to see if the solver actually can reproduce the result of a know broken cert
Reply With Quote
The Following User Says Thank You to JMP-JECXZ For This Useful Post:
deepzero (03-15-2026)
  #10  
Old 03-15-2026, 17:21
cjack's Avatar
cjack cjack is offline
Family
 
Join Date: Jan 2002
Posts: 170
Rept. Given: 196
Rept. Rcvd 176 Times in 34 Posts
Thanks Given: 332
Thanks Rcvd at 219 Times in 64 Posts
cjack Reputation: 100-199 cjack Reputation: 100-199
@JMP-JECXZ Fair question, and I understand the concern — finding bugs after days of computing is painful (and expensive).

To be honest, you're touching on a real lesson we learned the hard way. The independent review process was introduced only AFTER WhoCares flagged the Frobenius invariance bug that invalidated all our v1.x DPs. Before that, we were too confident in our own code — that was our mistake, plain and simple. Presumption.
Since then, we've put the entire codebase through 4 independent reviews, with 175/175 mathematical tests passing, 100/100 production DPs verified via independent EC scalar multiplication, and all 6 collision resolution formulas tested with synthetic collisions. So only NOW do we have mathematical certainty that the calculations are correct. We should have done this from day one.
Regarding testing with a known key: we already do exactly that. The solver has a --v964 test mode using the Armadillo v9.64 known secret:

k = 1984557253727814641989266002264698

This was used extensively to verify the entire pipeline end-to-end: GPU walk → DP emission → server collection → collision detection → key recovery → serial generation.
The current system is in code freeze — no changes until the collision is found. The fleet is running at 32.5 G/s with 99.1% efficiency and 58 workers. At this point, it's pure math.
Thanks for your input — it's great to see the community engaged on this!
Reply With Quote
The Following User Says Thank You to cjack For This Useful Post:
niculaita (03-16-2026)
  #11  
Old 03-18-2026, 15:58
cjack's Avatar
cjack cjack is offline
Family
 
Join Date: Jan 2002
Posts: 170
Rept. Given: 196
Rept. Rcvd 176 Times in 34 Posts
Thanks Given: 332
Thanks Rcvd at 219 Times in 64 Posts
cjack Reputation: 100-199 cjack Reputation: 100-199
After 74.5 hours of computation, we successfully extracted the private key for Encryptionizer Certificate #6!!!!!!!!

The collision was found by "Raziel" (RTX 5090) on March 17, 2026 at 22:06 UTC.

The solution is:

k = 0x0000002082DF1821D6E82CEE6880211228

Certificate: #6 of 12 (Encryptionizer / reg3.exe)
BasepointInit: 3254253397 (0xC1F7F755)
Checksum: 0xF4775F5E
Public key (ONB2):
Q.x = 3600264749883462755399490686438491
Q.y = 419754383946908414551514272523181
Curve: GF(2^113) Koblitz (y^2 + xy = x^3 + x^2 + 1)
Subgroup order: 5192296858534827627896703833467507


Statistics:

59 GPU workers (57x RTX 5090 + RTX 4070 + RTX 4070 Ti SUPER)
Peak speed: 32.78 billion iterations/second
210 million distinguished points collected
~7.84 × 10^15 effective iterations
Pollard's Rho with Frobenius canonicalization on GF(2^113) Koblitz curve
Architecture: Custom CUDA solver running on a distributed fleet coordinated by a central Python/FastAPI server. The solver uses table-free GF(2^113) arithmetic, Lopez-Dahab projective coordinates, and per-step Itoh-Tsujii affine normalization — achieving 96 registers, 0 spill on sm_120 (RTX 5090).

Special thanks to WhoCares whose independent audit identified a critical missing optimization (the negation map), which led to the development of v2.2.0 — our "GOLDEN" build that will deliver an additional 41% algorithmic speedup on future certificates. His RTX 4070 also contributed to the search alongside the main fleet.
We will be publishing the complete source code (GPU solver, CPU solver, and distributed server) on GitHub shortly. This includes the full v2.2.0-GOLDEN solver with negation map and fruitless cycle escape, ready for future ECDSA-113 targets.

The journey from a broken 3.5 G/s solver to a production-grade 32.78 G/s distributed system was quite a ride. Every bug caught made the code stronger. Every optimization attempt taught us something about the hardware.
Reply With Quote
The Following 2 Users Gave Reputation+1 to cjack For This Useful Post:
niculaita (03-18-2026), WhoCares (03-19-2026)
The Following 7 Users Say Thank You to cjack For This Useful Post:
Abaddon (03-19-2026), DARKER (03-18-2026), Jupiter (03-18-2026), niculaita (03-18-2026), nulli (03-18-2026), WhoCares (03-18-2026), wx69wx2023 (03-18-2026)
  #12  
Old 03-19-2026, 12:30
WhoCares's Avatar
WhoCares WhoCares is offline
who cares
 
Join Date: Jan 2002
Location: Here
Posts: 468
Rept. Given: 11
Rept. Rcvd 32 Times in 25 Posts
Thanks Given: 69
Thanks Rcvd at 247 Times in 94 Posts
WhoCares Reputation: 32
Armadillo parameter verification scripts, including:

Elliptic curve base point generation (based on code from mrexodia’s Armadillo Key Tool).
Solving the basis transformation matrix between PB(Polynomial Basis) and ONB2(Type-2 Optimal Normal Basis), and transform cordinates between PB and ONB2.
Curve equation validation.(trivial task)
Subgroup order verification.(trivial task)
Public/private key consistency checks.(trivial task)

https://github.com/z16166/ArmadilloEcdlpVerify (for fun)

Quote:
Originally Posted by cjack View Post
[B]
The solution is:

k = 0x0000002082DF1821D6E82CEE6880211228

Certificate: #6 of 12 (Encryptionizer / reg3.exe)
BasepointInit: 3254253397 (0xC1F7F755)
Checksum: 0xF4775F5E
Public key (ONB2):
Q.x = 3600264749883462755399490686438491
Q.y = 419754383946908414551514272523181
Curve: GF(2^113) Koblitz (y^2 + xy = x^3 + x^2 + 1)
Subgroup order: 5192296858534827627896703833467507
__________________
AKA Solomon/blowfish.

Last edited by WhoCares; 03-19-2026 at 13:15.
Reply With Quote
The Following User Gave Reputation+1 to WhoCares For This Useful Post:
cjack (03-19-2026)
The Following 5 Users Say Thank You to WhoCares For This Useful Post:
blue_devil (03-20-2026), cjack (03-19-2026), cybercoder (03-21-2026), niculaita (03-19-2026), tonyweb (04-06-2026)
  #13  
Old 03-18-2026, 16:00
WhoCares's Avatar
WhoCares WhoCares is offline
who cares
 
Join Date: Jan 2002
Location: Here
Posts: 468
Rept. Given: 11
Rept. Rcvd 32 Times in 25 Posts
Thanks Given: 69
Thanks Rcvd at 247 Times in 94 Posts
WhoCares Reputation: 32
Perhaps we'd better upgrade CUDA toolkit from 12.x to 13.1.

For learning purposes, I asked an AI to optimize the GPU kernel function pollard_kernel(), mainly targeting the NVIDIA GeForce RTX 5090.

The optimization goal was to reduce register usage from 96 registers to 64 registers. This increases SM occupancy, allowing the number of blocks that can run concurrently on a single SM to increase from 5 to 8, yielding a theoretical performance improvement of around one third.

The actual performance gain should be evaluated using NVIDIA Nsight Compute together with real benchmark data.

By leveraging the SMRS compiler feature introduced in NVIDIA CUDA Toolkit 13.0, spilled registers can be replaced with accesses to shared memory, making it possible to ultimately achieve the 64-register optimization target.

Quote:
ptxas info : Compiling entry function '_Z14pollard_kernelP6worm_tP4dp_tPjjiyiyi' for 'sm_120'
ptxas info : Function properties for _Z14pollard_kernelP6worm_tP4dp_tPjjiyiyi
200 bytes stack frame, -36 bytes spill stores, -28 bytes spill loads
ptxas info : Used 64 registers, used 1 barriers, 200 bytes cumulative stack size, 7168 bytes smem

ptxas info : Compile time = 0.000 ms
ptxas info : Function properties for _Z10ec_canon_x4fe_t
0 bytes stack frame, 4 bytes spill stores, 4 bytes spill loads
ptxas info : Function properties for _Z10ld_madd_z1RK4fe_tS1_S1_S1_RS_S2_S2_
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Function properties for _Z10reset_wormiyR14worm_context_tP6walk_t
0 bytes stack frame, 124 bytes spill stores, 120 bytes spill loads
ptxas info : Function properties for _Z12iterate_stepiyR14worm_context_tjP4dp_tPjiP6walk_ty
0 bytes stack frame, 32 bytes spill stores, 16 bytes spill loads
ptxas info : Function properties for _Z12prepare_stepR4fe_tS0_R4sc_tS2_
0 bytes stack frame, 0 bytes spill stores, 8 bytes spill loads
ptxas info : Function properties for _Z6fe_inv4fe_t
0 bytes stack frame, 0 bytes spill stores, 4 bytes spill loads
ptxas info : Function properties for _Z9record_dpRK4fe_tRK4sc_tS4_P4dp_tPji
0 bytes stack frame, 0 bytes spill stores, 8 bytes spill loads
tmpxft_00006c68_00000000-7_solver_fast.compute_120.cudafe1.cpp
[100%] Linking CUDA executable solver_fast.exe
[100%] Built target solver_fast
Attached Files
File Type: zip GPU_Register_Usage_Optimization_Walkthrough.zip (3.1 KB, 12 views)
__________________
AKA Solomon/blowfish.

Last edited by WhoCares; 03-18-2026 at 16:12.
Reply With Quote
The Following User Gave Reputation+1 to WhoCares For This Useful Post:
cjack (03-18-2026)
The Following 4 Users Say Thank You to WhoCares For This Useful Post:
cjack (03-18-2026), niculaita (03-19-2026), nulli (03-20-2026), wx69wx2023 (03-18-2026)
  #14  
Old 03-18-2026, 19:08
cjack's Avatar
cjack cjack is offline
Family
 
Join Date: Jan 2002
Posts: 170
Rept. Given: 196
Rept. Rcvd 176 Times in 34 Posts
Thanks Given: 332
Thanks Rcvd at 219 Times in 64 Posts
cjack Reputation: 100-199 cjack Reputation: 100-199
Quote:
Originally Posted by WhoCares View Post
Perhaps we'd better upgrade CUDA toolkit from 12.x to 13.1.

For learning purposes, I asked an AI to optimize the GPU kernel function pollard_kernel(), mainly targeting the NVIDIA GeForce RTX 5090.

The optimization goal was to reduce register usage from 96 registers to 64 registers. This increases SM occupancy, allowing the number of blocks that can run concurrently on a single SM to increase from 5 to 8, yielding a theoretical performance improvement of around one third.

The actual performance gain should be evaluated using NVIDIA Nsight Compute together with real benchmark data.

By leveraging the SMRS compiler feature introduced in NVIDIA CUDA Toolkit 13.0, spilled registers can be replaced with accesses to shared memory, making it possible to ultimately achieve the 64-register optimization target.
Hey WhoCares,

Thanks for the detailed optimization work!
About the register optimization proposal — we actually went down this exact rabbit hole. Here's what we found on real silicon:

The kernel is throughput-bound, not latency-bound. We built and benchmarked an optimization called "Thunderstrike" (OPT-3) that used ILP to fuse operations and improve parallelism. Result on RTX 5090: 0% speedup. The ALU pipeline is already saturated by the 112 sequential GF(2^113) squarings in ec_canon_x (40% of step cost) and the 8 multiplications in fe_inv via Itoh-Tsujii (50% of step cost). More warps via higher occupancy just queue up behind the same ALU — there are no idle cycles to fill.

A few specific concerns with the 64-register approach:

fe_mul alone needs ~80 registers (table-free XOR accumulation across 113-bit field elements). Forcing 64 via __launch_bounds__ guarantees massive spills. Even with __noinline__ on hot functions like fe_inv and ec_canon_x, the call overhead and lost register context hurt throughput on the critical path.

CUDA 13.x / SMRS: we'd love to test it!.

The ptxas output shows negative spill values (-36 bytes spill stores, -28 bytes spill loads). Negative spills are unusual and suggest the compiler is reporting redirected spills rather than actual elimination. Without real Nsight Compute profiling data, it's hard to confirm whether this translates to actual throughput gains.

What we WILL adopt from your proposals for the next certificate:

Single-pass DP retrieval (clean simplification, ~1-2%)
cudaOccupancyMaxActiveBlocksPerMultiprocessor for self-tuning grid size
Benchmarking L1 cache vs shared memory for the walk table (your bank conflict analysis was spot-on theoretically)
Bottom line: fe_inv and ec_canon_x consume 90% of the step cost and are algorithmically irreducible for Koblitz curve canonicalization. No amount of occupancy optimization can reduce these costs.
Reply With Quote
The Following 2 Users Say Thank You to cjack For This Useful Post:
niculaita (03-18-2026), nulli (03-20-2026)
  #15  
Old 03-18-2026, 18:55
cjack's Avatar
cjack cjack is offline
Family
 
Join Date: Jan 2002
Posts: 170
Rept. Given: 196
Rept. Rcvd 176 Times in 34 Posts
Thanks Given: 332
Thanks Rcvd at 219 Times in 64 Posts
cjack Reputation: 100-199 cjack Reputation: 100-199
Hey everyone,

After weeks of work, we've reached the finish line. The ECDLP on the Armadillo Koblitz curve GF(2^113) has been solved, and we've successfully generated a working registration key for the original, unmodified target : Encryptionizer NEP.
It's a great transparent data encryption software for windows.....old version

The Target:
NetLib Encryptionizer Platform for Server, Version 2017.1216.31129.1

Download:
https://mega.nz/file/M8oRXBBK#OSW0ZKq963cY92MhHh9xAxOYHMtlIaKNMqUfo9DJpD4

Registration Data:

Serial Number: 7375-4869-9685-9586a
Name: exetools
Key: 111HTT-TGQ6JH-HVEZF7-U36QEA-UFJD05-QHM5RJ-YEP3DZ-TNZZ77-2UGEKK-KW8A1X-5M95WE


Built a keygen based on AKT sources and verified it against the original Armadillo v9.66 protection of Encryptionizer.
This was a team effort. Special thanks to WhoCares whose vulnerability report identified the missing negation map optimization, giving us a √2 algorithmic speedup that made the difference.

Cheers,
CrackerJack & Claude
Reply With Quote
The Following User Says Thank You to cjack For This Useful Post:
niculaita (03-18-2026)
Reply

Tags
bolero, ecdlp

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Replacing ECDSA in Target (arma) Mynotos General Discussion 3 11-22-2019 00:49


All times are GMT +8. The time now is 20:13.


Always Your Best Friend: Aaron, JMI, ahmadmansoor, ZeNiX, chessgod101
( Since 1998 )