Exetools  

Go Back   Exetools > General > General Discussion

Notices

Reply
 
Thread Tools Display Modes
  #1  
Old 03-12-2026, 16:13
cjack's Avatar
cjack cjack is offline
Family
 
Join Date: Jan 2002
Posts: 170
Rept. Given: 196
Rept. Rcvd 176 Times in 34 Posts
Thanks Given: 332
Thanks Rcvd at 219 Times in 64 Posts
cjack Reputation: 100-199 cjack Reputation: 100-199
Quote:
Originally Posted by WhoCares View Post
@cjack

Performance optimizations suggested by the AI agent (Claude Opus 4.6 Thinking):

Core conclusion: The two highest-priority optimizations together can provide roughly a 3–4× speedup:

Montgomery batch inversion (P0) — A prototype already exists in pollard_rho.cuh, but it uses an old data structure. It needs to be ported to the fe_t architecture used in solver_fast.cu. This can reduce the per-step cost from 15M + 120S to approximately 6M + 6S.

CUDA Stream double buffering + pinned memory (P0) — The current workflow (kernel → sync → D2H → CPU processing) is strictly serial. Using dual streams with a ping-pong buffer allows GPU computation to fully overlap with CPU/network processing. ������

And I ask AI to code a Python script to run and upgrade "slover_fast.exe" automatically:
https://github.com/z16166/PySolverLauncher/
@WhoCares

Thanks for the deep analysis and the PySolverLauncher — really appreciate you putting time into this!

Let me give some context on the current state, since the AI reviewed v1.3.0 but we're now on v1.6.0 with several things already addressed:

CUDA Streams / GPU overlap — Already solved in v1.6.0. We implemented an async DP sender pipeline (background thread handles all HTTP while the main thread immediately relaunches the kernel). Measured GPU utilization: 100%, power draw 502W/575W on the RTX 5090. The double-buffering approach from the analysis would add <0.5% on top of what we already have.

Montgomery batch inversion — This is the one genuinely interesting suggestion. The per-step Itoh-Tsujii inversion IS the main cost (8M+116S out of 15M+120S per step). Batch inversion could amortize it across 128 threads. However, it requires 14x __syncthreads() per step (currently we have ZERO sync across 2048 steps), plus shared memory for the product tree, plus extra registers. Our realistic estimate is 1.5-2x speedup, not 3x. Worth exploring after the current run.

Comb w=5/6 for fe_mul — We actually tested this. Wider comb = more registers = less occupancy. Our history: comb table fe_mul gave 198 registers (16.7% occupancy), while the current table-free approach uses 80 registers (50% occupancy) and was 2.4x faster in practice. Occupancy wins over per-operation speed on GPU.

Important note about per-step normalization: the analysis calls it "the biggest bottleneck" — true, but it's mathematically required. Without it, walks never merge (we learned this the hard way — our old 3.5 G/s benchmark was invalid because of this). It can only be amortized (batch inversion), not removed.

Your PySolverLauncher: we've bundled it into the official ArmadilloSolver.zip on the dashboard! The /api/download-info endpoint was already there, so it works out of the box. Credit in the changelog. Thanks for the contribution!

Current status: 42 G/s fleet, 101M DPs, 22% probability, all verified end-to-end. Just waiting for the birthday paradox to do its thing!
Reply With Quote
  #2  
Old 03-12-2026, 16:52
WhoCares's Avatar
WhoCares WhoCares is offline
who cares
 
Join Date: Jan 2002
Location: Here
Posts: 468
Rept. Given: 11
Rept. Rcvd 32 Times in 25 Posts
Thanks Given: 69
Thanks Rcvd at 247 Times in 94 Posts
WhoCares Reputation: 32
@cjack

The AI's analysis is based on the 1.6.0 code. It's just that each time I extract the latest version of the exe and the source code into the same directory whose name contains 1.3.0, overwriting the old version.

You forgot to update the zip sha1 from download api interface, latest is B5A021ADE2C88548EB511120C17470F0D00FBB5C, not 0306102b37f2a102d4d8376c3dc806ce059c9597.
This sha1 is used as version number by Python script. The real version number "v1.6.0" is hardcoded in exe. It's not easy to extract it.
__________________
AKA Solomon/blowfish.

Last edited by WhoCares; 03-12-2026 at 17:03.
Reply With Quote
  #3  
Old 03-12-2026, 17:36
cjack's Avatar
cjack cjack is offline
Family
 
Join Date: Jan 2002
Posts: 170
Rept. Given: 196
Rept. Rcvd 176 Times in 34 Posts
Thanks Given: 332
Thanks Rcvd at 219 Times in 64 Posts
cjack Reputation: 100-199 cjack Reputation: 100-199
Quote:
Originally Posted by WhoCares View Post
@cjack

The AI's analysis is based on the 1.6.0 code. It's just that each time I extract the latest version of the exe and the source code into the same directory whose name contains 1.3.0, overwriting the old version.

You forgot to update the zip sha1 from download api interface, latest is B5A021ADE2C88548EB511120C17470F0D00FBB5C, not 0306102b37f2a102d4d8376c3dc806ce059c9597.
This sha1 is used as version number by Python script. The real version number "v1.6.0" is hardcoded in exe. It's not easy to extract it.
@WhoCares

Good catch on the stale SHA1! It was cached at server startup and never refreshed after we updated the ZIP (we added your launcher script + updated changelog).

Fixed. The API now returns the correct hash. I also implemented automatic mtime-based detection: whenever the ZIP file changes on disk, /api/download-info recomputes SHA1 and size on the fly. No more stale hashes, no server restart needed.

The API response now also includes a version field:

{
"available": true,
"sha1": "c7c267adca36c7e2ddac0f4b3bf37100f88ef033",
"size": 1018086,
"filename": "ArmadilloSolver.zip",
"version": "1.6.0"
}

This is fully backward compatible — your existing launcher works without any changes. The version field is just extra info you can optionally use if you want to display a human-readable version instead of the SHA1.

We also added VERSION.txt and a README.txt with quick start instructions inside the ZIP, so anyone extracting it knows exactly which version they have.

Thanks for keeping an eye on things — your feedback makes the project better!

Flash status update:

42 active workers, fleet speed 41.7 G/s
104M distinguished points collected
Current probability: 23.4%
Median ETA: ~17 hours (Mar 13, ~03:00 UTC+1)
0 collisions so far — the hunt continues!
Reply With Quote
The Following User Says Thank You to cjack For This Useful Post:
WhoCares (03-13-2026)
Reply

Tags
bolero, ecdlp

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Replacing ECDSA in Target (arma) Mynotos General Discussion 3 11-22-2019 00:49


All times are GMT +8. The time now is 04:27.


Always Your Best Friend: Aaron, JMI, ahmadmansoor, ZeNiX, chessgod101
( Since 1998 )