|
@cjack
Performance optimizations suggested by the AI agent (Claude Opus 4.6 Thinking):
Core conclusion: The two highest-priority optimizations together can provide roughly a 3–4× speedup:
Montgomery batch inversion (P0) — A prototype already exists in pollard_rho.cuh, but it uses an old data structure. It needs to be ported to the fe_t architecture used in solver_fast.cu. This can reduce the per-step cost from 15M + 120S to approximately 6M + 6S.
CUDA Stream double buffering + pinned memory (P0) — The current workflow (kernel → sync → D2H → CPU processing) is strictly serial. Using dual streams with a ping-pong buffer allows GPU computation to fully overlap with CPU/network processing. ������
And I ask AI to code a Python script to run and upgrade "slover_fast.exe" automatically:
https://github.com/z16166/PySolverLauncher/
__________________
AKA Solomon/blowfish.
Last edited by WhoCares; 03-12-2026 at 15:03.
|