Exetools - View Single Post - Armadillo ECDSA-113

WhoCares · #**106** 03-12-2026, 09:06

@cjack

Performance optimizations suggested by the AI agent (Claude Opus 4.6 Thinking):

Core conclusion: The two highest-priority optimizations together can provide roughly a 3–4× speedup:

Montgomery batch inversion (P0) — A prototype already exists in pollard_rho.cuh, but it uses an old data structure. It needs to be ported to the fe_t architecture used in solver_fast.cu. This can reduce the per-step cost from 15M + 120S to approximately 6M + 6S.

CUDA Stream double buffering + pinned memory (P0) — The current workflow (kernel → sync → D2H → CPU processing) is strictly serial. Using dual streams with a ping-pong buffer allows GPU computation to fully overlap with CPU/network processing. ��

And I ask AI to code a Python script to run and upgrade "slover_fast.exe" automatically:
https://github.com/z16166/PySolverLauncher/