Montgomery multiplication would significantly reduce the run time especially for the modular exponentiation in Miller Rabin primarily testing.
Further modular exponentiation here is naive as it could use windowing of the exponent into groups of bits at a time to reduce the products by a decent amount though squarings remains constant.
Ideally you generate a few dozen numbers and test them for primarily in batches. This way the modular exponentation precomputed table is efficiently applied to all the different exponents simultaneously.
There are also ways to do GCD and modular inverse without divisions using montgomery multiplication and even powers of 2 based on the machine word size. And yes of course at a certain bit size Karatsuba or Schonhage Strassen multiplication should be used. Other tricks are possible such as looking at consecutive ones in the multiplicand and dealing with primes of forms 2^n2^m+... by doing addition and subtraction which is far better than pure addition in grade school method.
Isnt the OpenSSL open source library considered the de facto reference for a reasonably heavily optimized big prime finder?
