Modular exponentiation is the algorithm whose performance determines the performance and practicality of many public key cryptosystems, including RSA, DH and DSA. We have recently achieved a manyfold improvement in the performance of modular exponentiation in JavaScript over the implementation of modular exponentiation in the Stanford JavaScript Crypto Library (SJCL). JavaScript was originally intended for performing simple tasks in web pages, but it has grown into a sophisticated general purpose programming language used for both client and server computing, which is arguably the most important programming language today. Good performance of public key cryptography is difficult to achieve in JavaScript, because JavaScript is an interpreted language inherently slower than a compiled language such as C, and provides floating point arithmetic but no integer arithmetic. But fast JavaScript public key cryptography is worth the effort, because it may radically change the way cryptography is used in web applications.
(We are concerned here with modular exponentiation using the general moduli of classical public key cryptosystems such as RSA, DSA and DH rather than the pseudoMersenne primes used as moduli in elliptic curve cryptography. In SJCL, modular exponentiation with general moduli is provided by montpowermod.)
Our implementation of modular exponentiation is part of a JavaScript big integer library that we are developing as the basis of a cryptographic authentication toolkit for web applications. We initially considered building the toolkit on top of the Web Cryptography API of the W3C. That approach was appealing because it would have provided our JavaScript code with access to fast cryptographic primitives supplied by the browser, presumably implemented in C, perhaps with assembly language optimizations. But we decided against it for several reasons: the API is unfinished and seems to be in a state of flux; it lacks important primitives, notably DSA; and it is unnecessarily complicated, requiring cryptographic primitives to be invoked via asynchronous JavaScript Promises. Instead we took up the challenge of implementing faster public key cryptography in JavaScript itself, and were lucky to succeed well beyond our expectations.
Performance results
Modular exponentiation computes g^{x} mod m, where g is the base, x is the exponent, and m is the modulus. The bit lengths of g, x and m depend on the cryptosystem where the algorithm is used and the desired security strength. The security strength of a cryptosystem is defined by the amount of work that it would take to break it with the best available cryptanalytic techniques. That amount of work can be loosely measured by the key length of a symmetric encryption algorithm such as AES for which an exhaustive search of the key space would require the same amount of work.In Table 2 of SP 80057 Part 1, NIST lists the key sizes that are required in several cryptosystems to achieve a range of security strengths. We have measured the performance of modular exponentiation for the parameter bit lengths used in the cryptosystems of columns FFC and IFC at the security strengths that NIST considers acceptable, i.e. 112, 128, 192 and 256 bits.
FFC stands for FiniteField Cryptography, which comprises DSA and DH. Column FFC of the NIST table lists key sizes (L,N) for DSA and DH, L being the bit length of the public key, and N the bit length of the private key. DSA involves one modular exponentiation for signing and two for decrypting, while DH requires one modular exponentiation by each participant in a key exchange. In those modular exponentiations g and m are L bits long, while x is N bits long. The following table shows the time it takes to compute those modular exponentiations with our implementation and with the one in Stanford library, both running in Firefox on a Mac laptop with a 1.7 GHz 64bit Intel Core i5 processor. (The bit length of g has negligible effect on performance but is included for clarity. The same is true for the other three tables below.)
DSA and DH bit lengths  
NIST security strength  112 bits  128 bits  192 bits  256 bits  
Bit lengths 
g  2048  3072  7680  15360 
x  224  256  384  512  
m  2048  3072  7680  15360  
Min time in Firefox on 64bit laptop 
Stanford JS Crypto Lib 
74 ms  180 ms  1549 ms  7908 ms 
Pomcor JS Big Int Lib 
10 ms  23 ms  199 ms  1050 ms  
Performance improvement 
7x faster  8x faster  8x faster  8x faster 
IFC stands for IntegerFactorization Cryptography, which comprises RSA. Column IFC of the NIST table lists the bit length k of the RSA modulus at the range of security strengths. When the usual CRT optimization is used, a private key RSA operation for signing or decrypting requires two modular exponentiations, where the bit lengths of g and x are k and the bit length of m is k/2. (Public key operations matter less for performance because they use a very small exponent.) The following table shows the time it takes to compute each of those two modular exponentiations with our implementation and with the one in the Stanford library, again in Firefox on the same Mac laptop.
RSAwithCRT bit lengths  
NIST security strength  112 bits  128 bits  192 bits  256 bits  
Bit lengths 
g  2048  3072  7680  15360 
x  2048  3072  7680  15360  
m  1024  1536  3840  7680  
Min time in Firefox on 64bit laptop 
Stanford JS Crypto Lib 
148 ms  460 ms  6636 ms  50818 ms 
Pomcor JS Big Int Lib 
25 ms  75 ms  882 ms  7424 ms  
Performance improvement 
6x faster  6x faster  8x faster  7x faster 
Column ECC of the NIST table lists the key sizes for elliptic curve cryptosystems such as ECDSA and ECDH, shown as the order f of the group generated by the base point of the elliptic curve. ECC has important advantages over classical public key cryptography and we plan to support it in our big integer library and authentication toolkit. But after the Snowden revelations, some users of cryptography, especially in Europe, are worried that elliptic curves chosen by NIST or by others may have back doors. This was reported and discussed at the NIST ECC workshop last June (webcast available). Hence the use of classical cryptography may be preferable or required for some applications or use cases. (Classical DSA was designed by the NSA but it not under suspicion.) Modular exponentiation with general moduli is not used in ECC, where scalar multiplication of curve points is used instead.
The next two tables report measurements for implementations running in Chrome on a Samsung Galaxy Note 3 with a 32bit 2.3 GHz processor. Our implementation includes Karatsuba multiplication, which is asymptotically faster than long multiplication, but slower for shorter bit lengths. We use it only where indicated. SJCL does not implement Karatsuba multiplication.
DSA and DH bit lengths  
NIST security strength  112 bits  128 bits  192 bits  256 bits  
Bit lengths 
g  2048 bits  3072 bits  7680 bits  15360 bits 
x  224 bits  256 bits  384 bits  512 bits  
m  2048 bits  3072 bits  7680 bits  15360 bits  
Min time in Chrome on 32bit phone 
Stanford JS Crypto Lib 
315 ms  742 ms  6264 ms  34460 ms 
Pomcor JS Big Int Lib 
46 ms  103 ms  644 ms ^{*}  3379 ms ^{*}  
Performance improvement 
7x faster  7x faster  10x faster  10x faster  
^{*} Using Karatsuba multiplication 
RSAwithCRT bit lengths  
NIST security strength  112 bits  128 bits  192 bits  256 bits  
Bit lengths 
g  2048 bits  3072 bits  7680 bits  15360 bits 
x  2048 bits  3072 bits  7680 bits  15360 bits  
m  1024 bits  1536 bits  3840 bits  7680 bits  
Min time in Chrome on 32bit phone 
Stanford JS Crypto Lib 
710 ms  2108 ms  29300 ms  Not tested 
Pomcor JS Big Int Lib 
115 ms  263 ms  3263 ms  Not tested  
Performance improvement 
6x faster  7x faster  7x faster 
Remarks
The following observations can be made from the timings shown in the above tables:
 Our modular exponentiation is from 6 to 10 times faster than the one in SJCL for the bit lengths used in DSA, DH and RSA with CRT at the tested security strengths.
 With our fast implementation, given that a DSA signature takes one exponentiation, cryptographic authentication by signing a challenge should take a small fraction of a second on a laptop at security strengths 112, 128 and 192, and not much more than a second at security strength 256. On a smart phone with a 32bit processor it should take a small fraction of a second at security strengths 112 and 128, and less than a second at security strength 192. For security strength 256 on a phone it would be necessary to resort to ECDSA.

A private key operation with the CRT optimization of RSA takes two
exponentiations. Thus, using our implementation, cryptographic
authentication or public key encryption using RSA on a laptop would
take a small fraction of a second at security strengths 112 and 128,
and less than two seconds at security strength 192. At security
strength 256
it would be necessary to use ECDSAone would use instead DSA or ECDSA for authentication or an elliptic curve version of El Gamal for encryption. On a phone with a 32bit processor, authentication or encryption with RSA would take less than a second at security strengths 112 and 128, but would not be practical at security strengths 192 and beyond.  In our implementation, Karatsuba multiplication does not help in any of the cases where the size of the multiplicands (as determined by the modulus), is 3840 bits or less. This is surprising, and it was disappointing to us, since we made a serious effort to optimize our implementation of Karatsuba. By contrast, the Wikipedia article on Karatsuba states that “as a rule of thumb, Karatsuba is usually faster when the multiplicands are longer than 320–640 bits”.
Measurement method
It is difficult to measure performance in JavaScript. Programs written in C can use the clock function of the C Standard Library to measure the number of processor cycles spent by the program; but JavaScript does not have equivalent functionality, so we have to measure elapsed time rather than clock cycles. The JavaScript Date object can be used to measure elapsed time, but not accurately enough. The getTime() method of the Date object reports time in milliseconds, but the reported time may not be accurate to the millisecond and is subject to occasional system clock adjustments forward or backwards. We use instead the performance.now() function introduced by the High Resolution Time specification of the W3C, which is supposed to be accurate to the millisecond and not subject to clock adjustments, and which is now supported by most browsers. Elapsed time, however, is not a good measure of program performance, since the process that executes the program may have to yield the processor core on which it runs to other processes, and the system clock may be slowed down to conserve battery life.
We cope with these difficulties as follows. To measure the time taken by an algorithm for a set of inputs, we run the algorithm on the inputs in a loop with enough loop iterations for the total time to be at least 200ms, and we divide the total time by the number of iterations. Then we manually repeat the loop a number of times. As the number of repetitions increases, the minimum time over the set repetitions converges towards the ideal time that the algorithm would take without interruptions or slowdowns. We report the minimum time observed after a large number of repetitions.
For more information…
…come to the XXI Internet Identity Workshop, where we will make a presentation and give a demo. The workshop takes place next week, October 2729, at the Computer History Museum in Mountain View.
We plan to release the big integer library together with the cryptographic toolkit as soon as they are both ready. We will then explain in detail what makes our modular exponentiation code faster.
Update. The presentation at IIW is now available.
When researching using Wikipedia, it is best to look at the quoted references to get a more accurate idea of what is being paraphrased. For efficiency of Karatsuba multiplication, one of the references is http://www.cburch.com/proj/karat/comment1.html where the following was noted:
For Linux on x86 the crossover seems to be around 64digit
factors. For HPUX on PARISC2.0, between 125 and 130
digits. In both cases the top optimization level was used. In
addition, I compiled under C, not C++.
These results are somewhat disappointing in that one has
to go to quite a few digits before the advantages of Karatsuba
kick in. For both platforms above, Karatsuba becomes significantly
faster only when reaching some 250 (Linux) or 600 digits (HPUX).
If you had read the aforementioned reference prior to your post, you may not have been as surprised and disappointed to find less benefit in JavaScript than you had hoped for.
It is unsurprising to me that an optimizing compiler would find the crossover point for Katatsuba multiplication to occur earlier than would be observed in JavaScript in a browser.
My response is in reaction to the following excerpted text:
In our implementation, Karatsuba multiplication does not help in any of the cases where the size of the multiplicands (as determined by the modulus), is 3840 bits or less. This is surprising, and it was disappointing to us, since we made a serious effort to optimize our implementation of Karatsuba. By contrast, the Wikipedia article on Karatsuba states that “as a rule of thumb, Karatsuba is usually faster when the multiplicands are longer than 320–640 bits”.