View Single Post
  #106  
Old 03-12-2026, 09:06
WhoCares's Avatar
WhoCares WhoCares is offline
who cares
 
Join Date: Jan 2002
Location: Here
Posts: 468
Rept. Given: 11
Rept. Rcvd 32 Times in 25 Posts
Thanks Given: 69
Thanks Rcvd at 247 Times in 94 Posts
WhoCares Reputation: 32
@cjack

Performance optimizations suggested by the AI agent (Claude Opus 4.6 Thinking):

Core conclusion: The two highest-priority optimizations together can provide roughly a 3–4× speedup:

Montgomery batch inversion (P0) — A prototype already exists in pollard_rho.cuh, but it uses an old data structure. It needs to be ported to the fe_t architecture used in solver_fast.cu. This can reduce the per-step cost from 15M + 120S to approximately 6M + 6S.

CUDA Stream double buffering + pinned memory (P0) — The current workflow (kernel → sync → D2H → CPU processing) is strictly serial. Using dual streams with a ping-pong buffer allows GPU computation to fully overlap with CPU/network processing. ������

And I ask AI to code a Python script to run and upgrade "slover_fast.exe" automatically:
https://github.com/z16166/PySolverLauncher/
Attached Files
File Type: rar performance_analysis_en.rar (6.7 KB, 6 views)
__________________
AKA Solomon/blowfish.

Last edited by WhoCares; 03-12-2026 at 15:03.
Reply With Quote