The limitations are very real. Dennard scaling has been dead since the mid-2000s (that is, power use per unit area has been increasing, even though energy use per logic operation is very much dropping at leading edge nodes) which means an increasing fraction of all silicon has to be "dark", power-gated and only used for the rare accelerated workload. Additionally, recent nodes have seen very little improvement in SRAM cell size which is used for register files and caches. So perhaps we'll be seeing relatively smaller caches per core in the future, and the addition of eDRAM (either on-die or on a separate chiplet) as a new, slower L4 level to partially cope with that.
I'm ignorant of this space, but it seems like the obvious solution for heat dissipation is to layer lattices and not solid layers, in order to increase the overall surface area of the chip. I assume the manufacturing is too difficult...?
What if it went the other way and you got much larger die area dedicated to caches or even on-chip RAM, since that usage is relatively cheaper from a power/heat point of view? Or is the process different enough between the two that it just doesn't make sense to have them interwoven like that?
The point of SRAM, especially at the L1/L2 level is having an extremely high BW and extremely low latency (a few clock cycles). So it is not really an option to put them somewhere else (although L3 and as mentioned other lower level layers) can and are already being put into either separate chiplets in the same PCB w/extremely fast ring OR directly on top of the die (3D stacking).
Yeah. The analogy for cache that I like to use is a table at the library. If you think about doing research (the old fashioned way) by looking through a library shelf by shelf and bringing books to your table to read through more closely. If you have a bigger table you can store more books which can speed your lookup times since you don’t need to get up and go back and forth to the shelves.
But at some point making your table larger just defeats the purpose of the library itself. Your table becomes the new library, and you have to walk around on it and look up things in these piles of books. So you make a smaller table in the middle of the big table.
Your fundamental limitation is how small you can make a memory cell, not how big you want to make a cache. That’s akin to making the books smaller print size so you can fit more on the same size table.
well, sorta, since caches are just sram+tag logic. you can parallelize tables, so that each remains fast, but it costs you power/heat. the decoder inherent to sram is what introduces the size-speed tradeoff.
I was ignoring the details on how SRAM works in favour of thinking about it physically. Most of those details just affect the average cell size at the end of the day.
The other physical aspect we’re dealing with is propagation delay and physical distance. That’s where the library analogy really shines: if there’s a minimum size to a book and a minimum size of you (the person doing the research) this corresponds roughly to minimum cell sizes and minimum wire pitch, so you’re ultimately limited in the density you can fit within a given volume.
A compiler AND processor design amateur here. (Latter in school.)
Once you have enough registers, having more mean lowers active utilization for any given instruction (bad use of space, vs. fast pipelined access to cached stack) or higher levels of parallel instruction dispatch (much greater complexity, and even greater inefficiency for branching misses).
Then you have to update instruction sets, which could be impossible given how tightly they fit in current instruction sizes.
Ergo, increasing register banks is a major architecture & platform change from hardware to software redesign, with heavy end user impact, and a fair chance of decreasing performance.
In contrast, anything that improves caching performance is a big non-disruptive win.
the caches are already ~75% of the space. you can't significantly increase that. On die ram is also relatively unlikely due to process differences. my best guess is more 3d cache chips. if we can get the interconnects small enough and fast enough, I could see a future where the logic is stacked on top of a dozen (physical) layers of stacked cache
AMD stacked cache is a significant increase and gives a huge boost in certain gaming scenarios, to the point that it's a 100% increase in certain games that rely on huge caches
Good question - but it would have to be a one of the kind that decrease latency not the one that decrease bandwidth. Maybe there is a way to achieve such.