Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The original problem that Tomasulo's algorithm (register renaming) solved was that the IBM 360 Model 91 floating point unit only had 4 FP registers. Register renaming is also important on x86 (starting with Pentium Pro) which only has 8 GPRs and x86_64 which only has 16 GPRs.

These (8 and 16) are still small numbers. I wonder how important renaming is on ARMv8 or RISC-V with 32 GPRs. I really wonder if compiler register allocators help/hurt/know about renaming which in fact would require microarchitectural knowledge.

The example that Agner found is a form of register renaming. So is this AMD trying to make its legacy code base go faster (AMD+Intel do a TON of that) or is it something a compiler can actually target? I think the former.



Of the 16 GPRs on Icelake / Sunny Cove (Intel's next generation desktop core), there are 352 renaming registers available.

Of the 32 GPRs on ARM's Cortex A78 application processor, there are 160 renaming registers.

> I really wonder if compiler register allocators help/hurt/know about renaming which in fact would require microarchitectural knowledge.

The benefit of register renaming is that idioms like "xor eax, eax" automatically scale to whatever the reorder register size is.

"xor eax, eax" REALLY means " 'malloc' a register and call it EAX". Because the reorder buffer changes between architectures and even within an architecture (smaller chips may have smaller reorder buffers), its best to "cut all dependencies" at the compiler level, and then simply emit code where the CPU allocates registers in whatever optimal way.

The compiler doesn't care if you have 160-renaming registers (ARM A78), 224-renaming registers (Intel Skylake), or 300+ registers (Intel Icelake). The "xor eax, eax" idiom on every dependency cut emits the optimal code for all CPUs.

----------

You should have a large enough architectural register set to perform the calculations you need... in practice, 16 to 32 registers seem to be enough.

With the "dependency cutting" paradigm (aka: "xor eax, eax" really means malloc-register), your compiler's code will scale to all future processors, no matter the size of the reorder buffer of the particular CPU it ends up running on.

EDIT: It should be noted that on Intel Skylake / AMD Zen, "xor eax, eax" is so well optimized its not even a micro-op. Literally zero uop execution time for that instruction.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: