Its theoretically possible to run gather as fast as one cache line per cycle ins...

BeeOnRope · on Aug 28, 2020

I believe the Xeon Phi series implemented gather this way.

PixelOfDeath · on Aug 28, 2020

Isn't AVX512 basically cacheline-instructions?

ajross · on Aug 28, 2020

That's the way normal SIMD loads work, yeah.

But the scatter/gather instructions do random access memory operations. You have one SIMD register with a 8 (or whatever the width is) indexes to be applied to a base address in a scalar register, and the hardware then goes and does 8 separate memory operations on your behalf, packing the results into a SIMD register at the end.

That has to hit the cache 8 times in the general case. It's extremely expensive as a single instruction, though faster than running scalar code to do the same thing.