Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It's amazing that the 17 instructions at the end to tidy up all the carries, which looks like it has to be done serially, is still faster. But I guess each register's carry bits are independent from the ones that get carries added from the last register, so it could be ...

  mov W, E
  mov V, D
  mov U, C
  mov T, B
  shr W, 51
  shr V, 51
  shr U, 51
  shr T, 51
  add D, W
  add C, V
  add B, U
  add A, T
which does seem like it could be parallelized.


> each register's carry bits are independent from the ones that get carries added from the last register

No they're not. Let's say B's non-carry bits are all 1. If you carry anything from C, that will affect B's carry bits.


You could check if the middle 64-12-12 bits are all 1's, and if they are branch to a slow codepath.

It would be an incredibly rare case, so branch prediction will always get it right, and the cost of a branch nearly nill.


If you're trying to add two 256-bit numbers, there's probably cryptography involved, and data-dependant branches are poison to crypto code.


Well, it's only faster if you do three or more additions before that. But otherwise yes.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: