background

i'm trying implement rsa 2048 in hardware (xilinx zynq fpga) using variety of different montgomery methods. implementing algorithm using xilinx hls (essentially c++ code synthesized hardware).

note: sake of post, treat standard c++ implementation, except can have variables act bit-vectors 4096 bits wide, , access individual bits using foo[bit] or foo.range(7,0) syntax. haven't yet parallelized it, should behave standard c++ code. please don't afraid , stop reading because said word fpga , hls. treat 1 c++ code.

i've been able working prototype uses standard square-and-multiply modular exponentiation, , standard radix-2 mm algorithm modular multiplication, takes space on fpga , need use less resource-heavy algorithms.

to save space, i'm trying implement tenka-koc scalable multiple word radix 2 montgomery multiplication (mwr2mm) proposed here. i've been struggling month no avail. there interesting problem resulting out of struggles cannot figure out.

the problem

my issue mwr2mm not return correct answer when performing montgomery multiplication. however, beginning think not coding error, , rather, instead misinterpreting critical usage of algorithm.

there multiple variations of mwr2mm algorithm, quite differing implementations, , i've tried implement many of them. have 4 different implementations of mwr2mm coded up, based on modifications algorithm put forth in number of papers. what makes me think implementations correct, of these varying versions of algorithm returning same incorrect answer! don't think coincidence, don't think published algorithms wrong....therefore, posit more nefarious going on, , algorithm implementations correct.

example 1

for example, take original proposed mwr2mm put forth in tenca-koc's paper, refer mwr2mm_csa because algorithm's addition operations use carry-save adder (csa) when implemented in hardware.

s partial sum
m modulus
y multiplicand
x multiplier , x_i (subscript) single bit (e.g. x = (x_n,...,x_1,x_0).
the superscript words vectors ( e.g. m = (0,m^{e-1},...,m^1,m^0)
(a,b) concatenation of 2 bit vectors.
m operands width
w width of choosen words
e number of w-bit words required complete vectors, (e.g. e = ceil((m+1)/w) )

my implementation of algorithm uses following parameters:

mwr2mm_m = 2048 (operand size, m above)
mwr2mm_w = 8 (word size, w above)
mwr2mm_e = ceil( (e+1)/w ) = 257 (number of words + 1 per operand, e above)
ap_uint<num_bits> how declare bit vector in hls

my code:

void mwr2mm_csa( ap_uint<mwr2mm_m> x,                  ap_uint<mwr2mm_w> y[mwr2mm_e+1],                  ap_uint<mwr2mm_w> m[mwr2mm_e+1],                  ap_uint<mwr2mm_m> *out) {     // declare , 0 partial sum s     ap_uint<mwr2mm_w> s[mwr2mm_e] = 0;     (int i=0; i<mwr2mm_e; i++)         s[i] = 0;      // 2 carry bits     ap_uint<1> ca=0, cb=0;      (int i=0; i<mwr2mm_m; i++)     {         (ca,s[0]) = x[i]*y[0] + s[0]; // how hls concatenates vectors, in paper!         if (s[0][0] == 1) // if 0th bit of 0th word 1         {             (cb,s[0]) = s[0] + m[0];             (int j=1; j<=mwr2mm_e; j++)             {                    (ca, s[j]) = ca + x[i]*y[j] + s[j];                 (cb, s[j]) = cb + m[j] + s[j];                 s[j-1] = ( s[j][0], s[j-1].range(mwr2mm_w-1,1) );             }         }         else         {             (int j=1; j<=mwr2mm_e; j++)             {                 (ca, s[j]) = ca + x[i]*y[j] + s[j];                 s[j-1] = ( s[j][0], s[j-1].range(mwr2mm_w-1,1) );             }         }     }      // copy result output pointer     (int i=0; i<mwr2mm_e-1; i++)         out->range(mwr2mm_w*i+(mwr2mm_w-1), mwr2mm_w*i) = s[i].to_uchar(); }

now, understanding (quoting paper above)

the montgomery multiplication (mm) algorithm on 2 integers x , y , required parameters n bits of precision, result in number mm(x,y,m) = xy(2^-n) (modulo m), r=2^n , m integer in range (2^(n-1), 2^(n)) such gcd(r,m)=1. since r=2^n , sufficient modulus m odd integer.

therefore, should expect following results (verified w/ software library):

x = 0xaba5e025b607aa14f7f1b8cc88d6ec01c2d17c536508e7fa10114c9437d9616c9e1c689a4fc54744fa7dfe66d6c2fcf86e332bfd6195c13fe9e331148013987a947d9556a27a326a36c84fb38bfefa0a0ffa2e121600a4b6aa4f9ad2f43fb1d5d3eb5eaba13d3b382fed0677df30a089869e4e93943e913d0dc099aa320b8d8325b2fc5a5718b19254775917ed48a34e86324adbc8549228b5c7beeefa86d27a44ceb204be6f315b138a52ec714888c8a699f6000d1cd5ab9bf261373a5f14da1f568be70a0c97c2c3eff0f73f7ebd47b521184dc3ca932c91022bf86dd029d21c660c7c6440d3a3ae799097642f0507dfaecac11c2bd6941cbc66cedeeab744 y = 0xd091be9d9a4e98a172bd721c4bc50ac3f47daa31522db869eb6f98197e63535636c8a6f0ba2fd4c154c762738fbc7b38bdd441c5b9a43b347c5b65cfdef4dcd355e5e6f538efbb1cc161693fa2171b639a2967bea0e3f5e429d991fe1f4de802d2a1d600702e7d517b82bffe393e090a41f57e966a394d34297842552e15550b387e0e485d81c8cccaad488b2c07a1e83193ce757fe00f3252e4bd670668b1728d73830f7ae7d1a4c02e7afd913b3f011782422f6de4ed0ef913a3a261176a7d922e65428ae7aaa2497bb75bfc52084ef9f74190d0d24d581eb0b3dac6b5e44596881200b2ce5d0fb2831d65f036d8e30d5f42becab3a956d277e3510df8cba9 m = 0xd27bf9f01e2a901db957879f45f697330d21a21095da4fa7d3aab75454a8e9f0f4ea531ece34f0c3ba9e02eb27d8f0dbe78eede4ac84061beef162d00b55c0dd772d28f23e994899aa19b9bea7b12a8027a32a92190a3630e249544675488121565a23548fcd36f5382eeb993db9ce3f526f20ab355e82d963d59541bc1161e211a03e3b372560840c57e12bd2f40eac5ffcec01b3f07c378c0a60b74bef7b572764c88a4f98b61fa8ccd905afae779e6193378304d8eb17695ce71a173ac3de11271753c48db58546e5af9917c1cebba5bb1af3fce3df9516c0c95c9bc14bb65d1c53078c06c81ac0f3ed0d8634260e47bf780cf4f4996084df732935194417 mm(x,y,m) = 0x444682cc199679928f5971191accb8eaa5c76cf743e54fc28fd8dcff57bd198677a26a5c1a6254810a91049fa85cbe3eddfdcdf12ed3fbb204de249c389cdee3fa6db65441afe03f1148660ea0e756e038891cef098f2a009fb443685202fac40d8fe7b82a1f643020ea31f5a8f4b253ad2f30028c59f1e2dcf3902bbc48e73eca7bdc22bb92e8a70bc535584bf644caf24ef39a1899f18c05937446aacc5c64762afad2b73eedf3aa96c9a4cff836a551a26aed46279328edd4b9bbbc182b9e408640d058926882b3a0faa043f726ef96e07b7960d586e2648534eb15c23fe152d0d088f1742e023715e3abaec8128b51cc86e8bc207d69f1e6ba7067d44429

but instead, algorithm returns

mwr2mm_csa(x,y,m) = 0x16c27cbc37c109b048b0f8b860c3501db2e90f07d9bf9f6a63839453ac6603776c8cbd7ae8974544c52f078ad035af1ac58cbbd5db5801cdf3cf876c43f29fc1719adf46804928d8bb621fcd48988160602c47812299603181fd97aec74b7be563ea0b0cb9ec9b2559191d8ee6ae8092ff9e50adc1b874bc40c9256d785a4920dc1c1a5df2b8492b181d16841eea5377524bdf9bcc8a6dc3919dd4fdf6bbd7bb9d8fc35d06d7a4135363a2aa7fa6ae43b335a2704b007e405731a0d5d352ef7c51ad58241d201e07fa86aa395bb8f5ab3c9b966d5db966777b45fe47b1838b97afed23907d7af61cf809d0b934fc3899998bfef5b11516ca76c62d999ced8840

example 2

ok, maybe implementation wrong. lets try modified version, mwr2mm_cpa algorithm (named carry-propogate adders used in hardware):

and implementation of mwr2mm_csa:

void mwr2mm_cpa(rsasize_t x, rsasize_t yin, rsasize_t min, rsasize_t* out) { // extend operands 2 words longer ap_uint<mwr2mm_m+2*mwr2mm_w> y = yin;  ap_uint<mwr2mm_m+2*mwr2mm_w> m = min; ap_uint<mwr2mm_m+2*mwr2mm_w> s = 0;  ap_uint<2> c = 0; bit_t qi = 0;  // unlike previous example, store concatenations in temporary variable ap_uint<10> temp_concat=0;   (int i=0; i<mwr2mm_m; i++) {     qi = (x[i]*y[0]) xor s[0];      // c gets top 2 bits of temp_concat, j'th word of s gets bottom 8 bits of temp_concat     temp_concat = x[i]*y.range(mwr2mm_w-1,0) + qi*m.range(mwr2mm_w-1,0) + s.range(mwr2mm_w-1,0);     c = temp_concat.range(9,8);     s.range(mwr2mm_w-1,0) = temp_concat.range(7,0);      (int j=1; j<=mwr2mm_e; j++)     {         temp_concat = c + x[i]*y.range(mwr2mm_w*j+(mwr2mm_w-1), mwr2mm_w*j) + qi*m.range(mwr2mm_w*j+(mwr2mm_w-1), mwr2mm_w*j) + s.range(mwr2mm_w*j+(mwr2mm_w-1), mwr2mm_w*j);         c = temp_concat.range(9,8);         s.range(mwr2mm_w*j+(mwr2mm_w-1), mwr2mm_w*j) = temp_concat.range(7,0);          s.range(mwr2mm_w*(j-1)+(mwr2mm_w-1), mwr2mm_w*(j-1)) = (s.bit(mwr2mm_w*j), s.range( mwr2mm_w*(j-1)+(mwr2mm_w-1), mwr2mm_w*(j-1)+1));     }     s.range(s.length()-1, s.length()-mwr2mm_w) = 0;     c=0; }  *out = s;

}

when run same x,y , m, returns exact same incorrect result mwr2mm_csa, despite different bit-level operations.

mwr2mm_cpa(x,y,m) = 0x16c27cbc37c109b048b0f8b860c3501db2e90f07d9bf9f6a63839453ac6603776c8cbd7ae8974544c52f078ad035af1ac58cbbd5db5801cdf3cf876c43f29fc1719adf46804928d8bb621fcd48988160602c47812299603181fd97aec74b7be563ea0b0cb9ec9b2559191d8ee6ae8092ff9e50adc1b874bc40c9256d785a4920dc1c1a5df2b8492b181d16841eea5377524bdf9bcc8a6dc3919dd4fdf6bbd7bb9d8fc35d06d7a4135363a2aa7fa6ae43b335a2704b007e405731a0d5d352ef7c51ad58241d201e07fa86aa395bb8f5ab3c9b966d5db966777b45fe47b1838b97afed23907d7af61cf809d0b934fc3899998bfef5b11516ca76c62d999ced8840

for brevity, spare 2 other algorithms return same incorrect result. should note both of these algorithms work correctly when used 4-bit operand size , 2-bit word size. other operand size/word size combinations incorrect, have same incorrect result 4 differing bit-level implementations.

i cannot life of me figure out why 4 algorithms return same incorrect result. code in first example literally word-for-word identical algorithm put forth in tenca-koc paper!

am incorrect @ assuming mwr2mm algorithm should return same result (in montgomery domain) standard radix-2 mm algorithm? have same radix, results should identical regardless of word-size. should not able interchange these algorithms each other?

sorry lengthy post, want precise , coherent in explaining issue is. i not asking debugging code, rather trying figure out whether misunderstanding underlying feature of montgomery multiplication algorithms. curious why different implementations giving same wrong result.

thanks!

the issue algorithm returns:

0x116c27cbc37...   ^

which greater m. if subtract m expected answer:

both algorithms return value in range 0 2*m, if answer greater or equal m, need final subtraction stage.

in other words, if test algorithm randomly chosen x , y should find half of time gives correct answer.

from section 2 of paper:

thus 1 conditional subtraction necessary bring s[n] required range 0 ≤ s[n] < m. subtraction omitted in subsequent discussion since independent of specific algorithm , architecture , can treated part of post processing.

Search This Blog

Insert

c++ - Bizzare identical incorrect results across different MWR2MM algorithms for RSA montgomery multiplication -

background

the problem

example 1

example 2

Comments

Post a Comment

Popular posts from this blog

vue.js - Create hooks for automated testing -

php - Vagrant up error - Uncaught Reflection Exception: Class DOMDocument does not exist -

serial port - hub4com OVERRUN Error -