Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Its theoretically possible to run gather as fast as one cache line per cycle instead of one SIMD lane per cycle. I don't think anyone has thrown that much permute hardware at the problem, though. Its only profitable if you believe that scatter and gather do have cache locality even when they don't have regularity.


I believe the Xeon Phi series implemented gather this way.


Isn't AVX512 basically cacheline-instructions?


That's the way normal SIMD loads work, yeah.

But the scatter/gather instructions do random access memory operations. You have one SIMD register with a 8 (or whatever the width is) indexes to be applied to a base address in a scalar register, and the hardware then goes and does 8 separate memory operations on your behalf, packing the results into a SIMD register at the end.

That has to hit the cache 8 times in the general case. It's extremely expensive as a single instruction, though faster than running scalar code to do the same thing.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: