![]() |
|
#37
|
||||
|
||||
|
Quote:
Thanks for the detailed optimization work! About the register optimization proposal — we actually went down this exact rabbit hole. Here's what we found on real silicon: The kernel is throughput-bound, not latency-bound. We built and benchmarked an optimization called "Thunderstrike" (OPT-3) that used ILP to fuse operations and improve parallelism. Result on RTX 5090: 0% speedup. The ALU pipeline is already saturated by the 112 sequential GF(2^113) squarings in ec_canon_x (40% of step cost) and the 8 multiplications in fe_inv via Itoh-Tsujii (50% of step cost). More warps via higher occupancy just queue up behind the same ALU — there are no idle cycles to fill. A few specific concerns with the 64-register approach: fe_mul alone needs ~80 registers (table-free XOR accumulation across 113-bit field elements). Forcing 64 via __launch_bounds__ guarantees massive spills. Even with __noinline__ on hot functions like fe_inv and ec_canon_x, the call overhead and lost register context hurt throughput on the critical path. CUDA 13.x / SMRS: we'd love to test it!. The ptxas output shows negative spill values (-36 bytes spill stores, -28 bytes spill loads). Negative spills are unusual and suggest the compiler is reporting redirected spills rather than actual elimination. Without real Nsight Compute profiling data, it's hard to confirm whether this translates to actual throughput gains. What we WILL adopt from your proposals for the next certificate: Single-pass DP retrieval (clean simplification, ~1-2%) cudaOccupancyMaxActiveBlocksPerMultiprocessor for self-tuning grid size Benchmarking L1 cache vs shared memory for the walk table (your bank conflict analysis was spot-on theoretically) Bottom line: fe_inv and ec_canon_x consume 90% of the step cost and are algorithmically irreducible for Koblitz curve canonicalization. No amount of occupancy optimization can reduce these costs. |
| Tags |
| bolero, ecdlp |
|
|
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Replacing ECDSA in Target (arma) | Mynotos | General Discussion | 3 | 11-22-2019 00:49 |