Quote:
Originally Posted by WhoCares
@cjack
Performance optimizations suggested by the AI agent (Claude Opus 4.6 Thinking):
Core conclusion: The two highest-priority optimizations together can provide roughly a 3–4× speedup:
Montgomery batch inversion (P0) — A prototype already exists in pollard_rho.cuh, but it uses an old data structure. It needs to be ported to the fe_t architecture used in solver_fast.cu. This can reduce the per-step cost from 15M + 120S to approximately 6M + 6S.
CUDA Stream double buffering + pinned memory (P0) — The current workflow (kernel → sync → D2H → CPU processing) is strictly serial. Using dual streams with a ping-pong buffer allows GPU computation to fully overlap with CPU/network processing. ������
And I ask AI to code a Python script to run and upgrade "slover_fast.exe" automatically:
https://github.com/z16166/PySolverLauncher/
|
@WhoCares
Thanks for the deep analysis and the PySolverLauncher — really appreciate you putting time into this!
Let me give some context on the current state, since the AI reviewed v1.3.0 but we're now on v1.6.0 with several things already addressed:
CUDA Streams / GPU overlap — Already
solved in v1.6.0. We implemented an async DP sender pipeline (background thread handles all HTTP while the main thread immediately relaunches the kernel). Measured GPU utilization: 100%, power draw 502W/575W on the RTX 5090. The double-buffering approach from the analysis would add <0.5% on top of what we already have.
Montgomery batch inversion — This is the one genuinely interesting suggestion. The per-step Itoh-Tsujii inversion IS the main cost (8M+116S out of 15M+120S per step). Batch inversion could amortize it across 128 threads. However, it requires 14x __syncthreads() per step (currently we have ZERO sync across 2048 steps), plus shared memory for the product tree, plus extra registers.
Our realistic estimate is 1.5-2x speedup, not 3x.
Worth exploring after the current run.
Comb w=5/6 for fe_mul — We actually tested this. Wider comb = more registers = less occupancy. Our history: comb table fe_mul gave 198 registers (16.7% occupancy), while the current table-free approach uses 80 registers (50% occupancy) and was 2.4x faster in practice. Occupancy wins over per-operation speed on GPU.
Important note about
per-step normalization: the analysis calls it "the biggest bottleneck" — true, but
it's mathematically required. Without it, walks never merge (we learned this the hard way — our old 3.5 G/s benchmark was invalid because of this). It can only be amortized (batch inversion), not removed.
Your PySolverLauncher: we've bundled it into the official ArmadilloSolver.zip on the dashboard! The /api/download-info endpoint was already there, so it works out of the box. Credit in the changelog. Thanks for the contribution!
Current status: 42 G/s fleet, 101M DPs, 22% probability, all verified end-to-end. Just waiting for the birthday paradox to do its thing!