Exetools  

Go Back   Exetools > General > General Discussion

Notices

 
 
Thread Tools Display Modes
Prev Previous Post   Next Post Next
  #36  
Old 03-18-2026, 16:00
WhoCares's Avatar
WhoCares WhoCares is offline
who cares
 
Join Date: Jan 2002
Location: Here
Posts: 468
Rept. Given: 11
Rept. Rcvd 32 Times in 25 Posts
Thanks Given: 69
Thanks Rcvd at 247 Times in 94 Posts
WhoCares Reputation: 32
Perhaps we'd better upgrade CUDA toolkit from 12.x to 13.1.

For learning purposes, I asked an AI to optimize the GPU kernel function pollard_kernel(), mainly targeting the NVIDIA GeForce RTX 5090.

The optimization goal was to reduce register usage from 96 registers to 64 registers. This increases SM occupancy, allowing the number of blocks that can run concurrently on a single SM to increase from 5 to 8, yielding a theoretical performance improvement of around one third.

The actual performance gain should be evaluated using NVIDIA Nsight Compute together with real benchmark data.

By leveraging the SMRS compiler feature introduced in NVIDIA CUDA Toolkit 13.0, spilled registers can be replaced with accesses to shared memory, making it possible to ultimately achieve the 64-register optimization target.

Quote:
ptxas info : Compiling entry function '_Z14pollard_kernelP6worm_tP4dp_tPjjiyiyi' for 'sm_120'
ptxas info : Function properties for _Z14pollard_kernelP6worm_tP4dp_tPjjiyiyi
200 bytes stack frame, -36 bytes spill stores, -28 bytes spill loads
ptxas info : Used 64 registers, used 1 barriers, 200 bytes cumulative stack size, 7168 bytes smem

ptxas info : Compile time = 0.000 ms
ptxas info : Function properties for _Z10ec_canon_x4fe_t
0 bytes stack frame, 4 bytes spill stores, 4 bytes spill loads
ptxas info : Function properties for _Z10ld_madd_z1RK4fe_tS1_S1_S1_RS_S2_S2_
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Function properties for _Z10reset_wormiyR14worm_context_tP6walk_t
0 bytes stack frame, 124 bytes spill stores, 120 bytes spill loads
ptxas info : Function properties for _Z12iterate_stepiyR14worm_context_tjP4dp_tPjiP6walk_ty
0 bytes stack frame, 32 bytes spill stores, 16 bytes spill loads
ptxas info : Function properties for _Z12prepare_stepR4fe_tS0_R4sc_tS2_
0 bytes stack frame, 0 bytes spill stores, 8 bytes spill loads
ptxas info : Function properties for _Z6fe_inv4fe_t
0 bytes stack frame, 0 bytes spill stores, 4 bytes spill loads
ptxas info : Function properties for _Z9record_dpRK4fe_tRK4sc_tS4_P4dp_tPji
0 bytes stack frame, 0 bytes spill stores, 8 bytes spill loads
tmpxft_00006c68_00000000-7_solver_fast.compute_120.cudafe1.cpp
[100%] Linking CUDA executable solver_fast.exe
[100%] Built target solver_fast
Attached Files
File Type: zip GPU_Register_Usage_Optimization_Walkthrough.zip (3.1 KB, 12 views)
__________________
AKA Solomon/blowfish.

Last edited by WhoCares; 03-18-2026 at 16:12.
Reply With Quote
The Following User Gave Reputation+1 to WhoCares For This Useful Post:
cjack (03-18-2026)
The Following 4 Users Say Thank You to WhoCares For This Useful Post:
cjack (03-18-2026), niculaita (03-19-2026), nulli (03-20-2026), wx69wx2023 (03-18-2026)
 

Tags
bolero, ecdlp

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Replacing ECDSA in Target (arma) Mynotos General Discussion 3 11-22-2019 00:49


All times are GMT +8. The time now is 04:28.


Always Your Best Friend: Aaron, JMI, ahmadmansoor, ZeNiX, chessgod101
( Since 1998 )