The limitations are very real. Dennard scaling has been dead since the mid-2000s...

projectileboy · on Dec 17, 2023

I'm ignorant of this space, but it seems like the obvious solution for heat dissipation is to layer lattices and not solid layers, in order to increase the overall surface area of the chip. I assume the manufacturing is too difficult...?

fooker · on Dec 17, 2023

That's one of the promises of 3D stacked transistors, yes.

mikepurvis · on Dec 17, 2023

What if it went the other way and you got much larger die area dedicated to caches or even on-chip RAM, since that usage is relatively cheaper from a power/heat point of view? Or is the process different enough between the two that it just doesn't make sense to have them interwoven like that?

treesciencebot · on Dec 17, 2023

The point of SRAM, especially at the L1/L2 level is having an extremely high BW and extremely low latency (a few clock cycles). So it is not really an option to put them somewhere else (although L3 and as mentioned other lower level layers) can and are already being put into either separate chiplets in the same PCB w/extremely fast ring OR directly on top of the die (3D stacking).

chongli · on Dec 17, 2023

Yeah. The analogy for cache that I like to use is a table at the library. If you think about doing research (the old fashioned way) by looking through a library shelf by shelf and bringing books to your table to read through more closely. If you have a bigger table you can store more books which can speed your lookup times since you don’t need to get up and go back and forth to the shelves.

But at some point making your table larger just defeats the purpose of the library itself. Your table becomes the new library, and you have to walk around on it and look up things in these piles of books. So you make a smaller table in the middle of the big table.

Your fundamental limitation is how small you can make a memory cell, not how big you want to make a cache. That’s akin to making the books smaller print size so you can fit more on the same size table.

markhahn · on Dec 17, 2023

well, sorta, since caches are just sram+tag logic. you can parallelize tables, so that each remains fast, but it costs you power/heat. the decoder inherent to sram is what introduces the size-speed tradeoff.

chongli · on Dec 17, 2023

I was ignoring the details on how SRAM works in favour of thinking about it physically. Most of those details just affect the average cell size at the end of the day.

The other physical aspect we’re dealing with is propagation delay and physical distance. That’s where the library analogy really shines: if there’s a minimum size to a book and a minimum size of you (the person doing the research) this corresponds roughly to minimum cell sizes and minimum wire pitch, so you’re ultimately limited in the density you can fit within a given volume.

randall · on Dec 17, 2023

Really good analogy!

DeathArrow · on Dec 17, 2023

Is it possible to use big and fat CPU registers instead of cache? There might be no wasted clock cycles and no delay.

Nevermark · on Dec 17, 2023

A compiler AND processor design amateur here. (Latter in school.)

Once you have enough registers, having more mean lowers active utilization for any given instruction (bad use of space, vs. fast pipelined access to cached stack) or higher levels of parallel instruction dispatch (much greater complexity, and even greater inefficiency for branching misses).

Then you have to update instruction sets, which could be impossible given how tightly they fit in current instruction sizes.

Ergo, increasing register banks is a major architecture & platform change from hardware to software redesign, with heavy end user impact, and a fair chance of decreasing performance.

In contrast, anything that improves caching performance is a big non-disruptive win.

als0 · on Dec 17, 2023

What about if you use register windows or special renaming of architectural registers to internal ones? https://en.wikipedia.org/wiki/Register_window

Nevermark · on Dec 18, 2023

The amount of stack in the L1 cache is essentially that. A shifting fast access working memory area.

Then think of registers as just part of the fetch & store pipelines for staging operations on stack values.

Forth goes all in with this approach.

Dylan16807 · on Dec 17, 2023

Registers are quite expensive in space and power, because multiple at once have to be accessible in many places.

If you add more registers, the cost per register increases rapidly, and you very quickly hit your limits.

If you make registers wider, that's still very expensive, and you introduce extra steps to get to your data most of the time.

So no, you can't do that in a reasonable way.

DeathArrow · on Dec 17, 2023

Thank you!

saati · on Dec 17, 2023

CPU registers are either SRAM or even larger flip-flops, they have the same problem.

adgjlsfhk1 · on Dec 17, 2023

the caches are already ~75% of the space. you can't significantly increase that. On die ram is also relatively unlikely due to process differences. my best guess is more 3d cache chips. if we can get the interconnects small enough and fast enough, I could see a future where the logic is stacked on top of a dozen (physical) layers of stacked cache

iopq · on Dec 17, 2023

AMD stacked cache is a significant increase and gives a huge boost in certain gaming scenarios, to the point that it's a 100% increase in certain games that rely on huge caches

markhahn · on Dec 17, 2023

stacking is a heat problem, and heat has been the PRIMARY system limit for over a decade.

2.5d is just too easy and effective - we're going to have lots more chiplets, and only the cool ones will get stacked.

mvkel · on Dec 18, 2023

Limitations in existing processes, sure. But not limitations in physics. If E=mc^2, we've got a lot of efficiencies still to find.

oblio · on Dec 18, 2023

Fusion-based computing FTW!

hinkley · on Dec 17, 2023

I wonder if we’ll see compressed data transmission at some point.

mazurnification · on Dec 17, 2023

Good question - but it would have to be a one of the kind that decrease latency not the one that decrease bandwidth. Maybe there is a way to achieve such.

markhahn · on Dec 17, 2023

fast compression is way too slow.

remember, we're talking TB/s these days.

hinkley · on Dec 17, 2023

Could be useful for sparse data structures.