Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The limitations are very real. Dennard scaling has been dead since the mid-2000s (that is, power use per unit area has been increasing, even though energy use per logic operation is very much dropping at leading edge nodes) which means an increasing fraction of all silicon has to be "dark", power-gated and only used for the rare accelerated workload. Additionally, recent nodes have seen very little improvement in SRAM cell size which is used for register files and caches. So perhaps we'll be seeing relatively smaller caches per core in the future, and the addition of eDRAM (either on-die or on a separate chiplet) as a new, slower L4 level to partially cope with that.


I'm ignorant of this space, but it seems like the obvious solution for heat dissipation is to layer lattices and not solid layers, in order to increase the overall surface area of the chip. I assume the manufacturing is too difficult...?


That's one of the promises of 3D stacked transistors, yes.


What if it went the other way and you got much larger die area dedicated to caches or even on-chip RAM, since that usage is relatively cheaper from a power/heat point of view? Or is the process different enough between the two that it just doesn't make sense to have them interwoven like that?


The point of SRAM, especially at the L1/L2 level is having an extremely high BW and extremely low latency (a few clock cycles). So it is not really an option to put them somewhere else (although L3 and as mentioned other lower level layers) can and are already being put into either separate chiplets in the same PCB w/extremely fast ring OR directly on top of the die (3D stacking).


Yeah. The analogy for cache that I like to use is a table at the library. If you think about doing research (the old fashioned way) by looking through a library shelf by shelf and bringing books to your table to read through more closely. If you have a bigger table you can store more books which can speed your lookup times since you don’t need to get up and go back and forth to the shelves.

But at some point making your table larger just defeats the purpose of the library itself. Your table becomes the new library, and you have to walk around on it and look up things in these piles of books. So you make a smaller table in the middle of the big table.

Your fundamental limitation is how small you can make a memory cell, not how big you want to make a cache. That’s akin to making the books smaller print size so you can fit more on the same size table.


well, sorta, since caches are just sram+tag logic. you can parallelize tables, so that each remains fast, but it costs you power/heat. the decoder inherent to sram is what introduces the size-speed tradeoff.


I was ignoring the details on how SRAM works in favour of thinking about it physically. Most of those details just affect the average cell size at the end of the day.

The other physical aspect we’re dealing with is propagation delay and physical distance. That’s where the library analogy really shines: if there’s a minimum size to a book and a minimum size of you (the person doing the research) this corresponds roughly to minimum cell sizes and minimum wire pitch, so you’re ultimately limited in the density you can fit within a given volume.


Really good analogy!


Is it possible to use big and fat CPU registers instead of cache? There might be no wasted clock cycles and no delay.


A compiler AND processor design amateur here. (Latter in school.)

Once you have enough registers, having more mean lowers active utilization for any given instruction (bad use of space, vs. fast pipelined access to cached stack) or higher levels of parallel instruction dispatch (much greater complexity, and even greater inefficiency for branching misses).

Then you have to update instruction sets, which could be impossible given how tightly they fit in current instruction sizes.

Ergo, increasing register banks is a major architecture & platform change from hardware to software redesign, with heavy end user impact, and a fair chance of decreasing performance.

In contrast, anything that improves caching performance is a big non-disruptive win.


What about if you use register windows or special renaming of architectural registers to internal ones? https://en.wikipedia.org/wiki/Register_window


The amount of stack in the L1 cache is essentially that. A shifting fast access working memory area.

Then think of registers as just part of the fetch & store pipelines for staging operations on stack values.

Forth goes all in with this approach.


Registers are quite expensive in space and power, because multiple at once have to be accessible in many places.

If you add more registers, the cost per register increases rapidly, and you very quickly hit your limits.

If you make registers wider, that's still very expensive, and you introduce extra steps to get to your data most of the time.

So no, you can't do that in a reasonable way.


Thank you!


CPU registers are either SRAM or even larger flip-flops, they have the same problem.


the caches are already ~75% of the space. you can't significantly increase that. On die ram is also relatively unlikely due to process differences. my best guess is more 3d cache chips. if we can get the interconnects small enough and fast enough, I could see a future where the logic is stacked on top of a dozen (physical) layers of stacked cache


AMD stacked cache is a significant increase and gives a huge boost in certain gaming scenarios, to the point that it's a 100% increase in certain games that rely on huge caches


stacking is a heat problem, and heat has been the PRIMARY system limit for over a decade.

2.5d is just too easy and effective - we're going to have lots more chiplets, and only the cool ones will get stacked.


Limitations in existing processes, sure. But not limitations in physics. If E=mc^2, we've got a lot of efficiencies still to find.


Fusion-based computing FTW!


I wonder if we’ll see compressed data transmission at some point.


Good question - but it would have to be a one of the kind that decrease latency not the one that decrease bandwidth. Maybe there is a way to achieve such.


fast compression is way too slow.

remember, we're talking TB/s these days.


Could be useful for sparse data structures.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: