It’s fun to be just a curious bystander for many years in this industry.
Every now and then Moore’s law hits a roadblock. Some experts see that as a clear sign that it’s reaching its end. Others that it’s already dead, because actually, the price per transistor has increased. Others that it’s physics, we can approach Y but after X nm it can’t be done.
Then you read others that claim that Intel has just been lazy enjoying its almost monopoly for the past decade and was caught off guard by TSMC’s ultraviolet prowess. Or people who really know how the sausage is made, like Jim Keller, enthusiastically stating that we are nowhere near any major fundamental limitation and can expect 1000X improvement in the years to come at least.
Anyway, it’s really fun to watch, like I said. Hard to think of a field with such rollercoaster-like forecasting while still delivering unparalleled growth in such a steady state for decades.
The limitations are very real. Dennard scaling has been dead since the mid-2000s (that is, power use per unit area has been increasing, even though energy use per logic operation is very much dropping at leading edge nodes) which means an increasing fraction of all silicon has to be "dark", power-gated and only used for the rare accelerated workload. Additionally, recent nodes have seen very little improvement in SRAM cell size which is used for register files and caches. So perhaps we'll be seeing relatively smaller caches per core in the future, and the addition of eDRAM (either on-die or on a separate chiplet) as a new, slower L4 level to partially cope with that.
I'm ignorant of this space, but it seems like the obvious solution for heat dissipation is to layer lattices and not solid layers, in order to increase the overall surface area of the chip. I assume the manufacturing is too difficult...?
What if it went the other way and you got much larger die area dedicated to caches or even on-chip RAM, since that usage is relatively cheaper from a power/heat point of view? Or is the process different enough between the two that it just doesn't make sense to have them interwoven like that?
The point of SRAM, especially at the L1/L2 level is having an extremely high BW and extremely low latency (a few clock cycles). So it is not really an option to put them somewhere else (although L3 and as mentioned other lower level layers) can and are already being put into either separate chiplets in the same PCB w/extremely fast ring OR directly on top of the die (3D stacking).
Yeah. The analogy for cache that I like to use is a table at the library. If you think about doing research (the old fashioned way) by looking through a library shelf by shelf and bringing books to your table to read through more closely. If you have a bigger table you can store more books which can speed your lookup times since you don’t need to get up and go back and forth to the shelves.
But at some point making your table larger just defeats the purpose of the library itself. Your table becomes the new library, and you have to walk around on it and look up things in these piles of books. So you make a smaller table in the middle of the big table.
Your fundamental limitation is how small you can make a memory cell, not how big you want to make a cache. That’s akin to making the books smaller print size so you can fit more on the same size table.
well, sorta, since caches are just sram+tag logic. you can parallelize tables, so that each remains fast, but it costs you power/heat. the decoder inherent to sram is what introduces the size-speed tradeoff.
I was ignoring the details on how SRAM works in favour of thinking about it physically. Most of those details just affect the average cell size at the end of the day.
The other physical aspect we’re dealing with is propagation delay and physical distance. That’s where the library analogy really shines: if there’s a minimum size to a book and a minimum size of you (the person doing the research) this corresponds roughly to minimum cell sizes and minimum wire pitch, so you’re ultimately limited in the density you can fit within a given volume.
A compiler AND processor design amateur here. (Latter in school.)
Once you have enough registers, having more mean lowers active utilization for any given instruction (bad use of space, vs. fast pipelined access to cached stack) or higher levels of parallel instruction dispatch (much greater complexity, and even greater inefficiency for branching misses).
Then you have to update instruction sets, which could be impossible given how tightly they fit in current instruction sizes.
Ergo, increasing register banks is a major architecture & platform change from hardware to software redesign, with heavy end user impact, and a fair chance of decreasing performance.
In contrast, anything that improves caching performance is a big non-disruptive win.
the caches are already ~75% of the space. you can't significantly increase that. On die ram is also relatively unlikely due to process differences. my best guess is more 3d cache chips. if we can get the interconnects small enough and fast enough, I could see a future where the logic is stacked on top of a dozen (physical) layers of stacked cache
AMD stacked cache is a significant increase and gives a huge boost in certain gaming scenarios, to the point that it's a 100% increase in certain games that rely on huge caches
Good question - but it would have to be a one of the kind that decrease latency not the one that decrease bandwidth. Maybe there is a way to achieve such.
>It’s fun to be just a curious bystander for many years in this industry.
Every now and then Moore’s law hits a roadblock. Some experts see that as a clear sign that it’s reaching its end......
That is just mainstream reporting.
If one actually went and read the paper referred or what the context was. It was always the same thing. It was all about the economics, all the way back from early 90s. We cant do x node because it would be too expensive to sustain it at a node every two years.
Smartphone era ( Referring to Post iPhone launch ) essentially meant we ship an additional ~2 Billions Pocket computer every year including Tablet. That is 5x the most optimistic projection to traditional PC model at 400M / year. ( Which we never reached ). And that is ignoring the Server market, Network Market, GPU market, AI Market etc. In terms of transistor and revenue or profits the whole TAM ( Total Addressable Market ) went up at least 10x more than those projection. Which is essentially what scale us from 22nm to now 3nm, and all the way to 2nm and 1.4nm. And my projection of 1nm by 2030 as well. I even wrote on HN in ~2015 I have a hard time to see how we could sustain this post 3nm. At the time when trillion dollar company was thought to be impossible.
On the other side of things, the cost projection to next node ( e.g 2nm ), and next next node (e.g 1.4nm ) was always higher than what its turns out. As with any large project management it is was better to ask and project more in case shit hits the fan. ( Intel 10nm ) But every time TSMC has executed so well.
So as you can see there is a projection mismatch at both ends. Which is why the clear sign of progress coming to end keeps being wrong.
> and can expect 1000X improvement in the years to come at least.
I just want to state that this figure keeps being throw around. It was Jim Keller comparing at the time Intel 14nm ( Which is somewhere close to TSMC N10 ) to hypothetical physics limit. At 3nm we are at least 4x pass that. Depending on how you want to measure it we could reach less than 100x by 2030.
AI trend could carries us forward to may be 2035. But we dont have another product category like iPhone. Server at hyperscaler are already at a scale growth is slowing. We will again need to substantially lower the development cost of leading node ( My bet is on the AI / Software side ) and some product that continues to grow the TAM. May be Autonomous Vehicles will finally be a thing by 2030s ? ( I doubt it but just throwing in some ides ).
I remember reading around the 300nm transition that Moore’s law was all over because wavelengths and physics. No one was talking about multiple masking patterns, probably because it was prohibitively expensive. Inconceivable, much like trillion dollar companies in the early 2000s.
I remember a quote from Von Braun where he learned to use the word 'impossible' with the greatest caution.
When you have a significant fraction of the GDP of a super power dedicated to achieving some crazy engineering task, it almost certainly can be done. And I wouldn’t bet against our hanger for better chips.
> It was all about the economics, all the way back from early 90s. We cant do x node because it would be too expensive to sustain it at a node every two years.
Totally agree.
> AI trend could carries us forward to may be 2035. But we dont have another product category like iPhone.
There will be fancier iPhones with on board offline Large Language Models and other Foundation Models to talk to, solving all kinds of tasks for you that would require a human assistant today.
However there is a big difference between those "~2 Billions Pocket computer every year including Tablet" and regular computers, so to speak.
They are mostly programmed in managed languages, where the respective runtimes and OS collaborate, in order to distribute the computing across all available cores in the best way possible, with little intervention required from the developers side.
Additionally, the OS frameworks and language runtimes collaborate in the best way to take advantage of each specific set of CPU capabilities in an almost transparent way.
Quite different from the regular POSIX and Win32 applications coded in C and C++, where everything needs to be explicitly taken care of, which is what kind of prevents most of the cool CPU approaches to take off, sitting there idle most of the time.
> They are mostly programmed in managed languages, where the respective runtimes and OS collaborate, in order to distribute the computing across all available cores in the best way possible, with little intervention required from the developers side.
I was under the impression that distributing workloads across many CPU cores (or HW threads) is done at the process and thread level by the OS? That gives managed and unmanaged languages the same benefits.
Managed languages provide higher level primitives that makes it easier to create a multi-threaded application. But isn't that still manually coded in the mainstream managed languages?
I'm thinking of inherently CPU-intensive custom workloads. UI rendering and IO operations become automatically distributed with little intervention.
Or am I missing something, where there is "little intervention required from the developers side" to create multi-threaded apps?
You are missing the part that ART, Swift/Objective-C runtime, and stuff like Gran Central Dispatch also take part in the decision process.
So the schedulers can decide in a more transparent way what runs where, specially on Android side, where the on-device JIT/AOT compilers are part of the loop.
Additionally, there is more effort on having the toolchains explore SIMD capabilities, where on C and C++ level one is expected to write that code explicilty.
Yes, auto-vectorization isn't as good as writing the code explicitly, however the latter implies that only a niche set of developers actually care to write any of it.
Hence why frameworks like Accelerate exist, even if a JIT isn't part of the picture, the framework takes the best path depending on available hardware.
Likewise higher level managed frameworks offer a better distribution between the parallel processing taking part across CPU, GPU or NPU, which again on classical UNIX/Win32 in C and C++, have to be explicility programmed for.
Such higher level frameworks can of course also be provided in such languages, e.g. CUDA and SYCL, howver then we start discussing about programmer culture to adopt such kind of tooling in classical LOB applications.
> ART, Swift/Objective-C runtime, and stuff like Grand Central Dispatch
I don't know these, but from a quick googling it still looks like explicit multi-threading? Albeit with higher level primitives than in older languages, but still explicit?
> auto-vectorization
I'm not sure I see a hard dividing line between older languages and managed ones as far as auto-vectorization? Sure, a higher-level language might make it easier for the compiler since it knows more about potential side effects, but simple and local C code doesn't have any side effects either.
> with little intervention required from the developers side
> Hence why frameworks like Accelerate exist, even if a JIT isn't part of the picture
Accelerate looks nice, but it still looks like it has to be called explicitly in the user code?
> Likewise higher level managed frameworks offer a better distribution between the parallel processing taking part across CPU, GPU or NPU, which again on classical UNIX/Win32 in C and C++, have to be explicility programmed for.
I'm not sure I understand, can you give more explicit examples?
My point here isn't that managed languages don't give big benefits over C. I prefer Python and C# when those can be used.
It's more that I don't see "automatic parallel processing" as a solved problem?
Sure, we get better and better primitives for multi-threading, and there are more and more high-level parallel libraries like you mentioned. But for most cases, the programmer still has to explicitly design the application to take advantage of multiple cores.
Its predominantly TSMC. I do get quite tired and sometimes annoyed when 99.99999% of the internet including HN stating it is just "ASML". As if buying those TwinScan would be enough to make you the world's best leading edge foundry.
Remember,
1. TSMC didn't beat Intel because they had the newer EUV first. They beat Intel before the whole thing started.
2. If having EUV Machines from ASML were enough, Samsung would have been the 2nd given they are or they will use it for NAND and DRAM. And yet they are barely competing.
3. It is not like Global Foundry dropped off racing the leading edge node for no reason.
4. TSMC has always managed to work around any road block when ASML failed to deliver their promise on time.
5. Quoting from CEO of ASML, half jokingly but also half true "Dont ask us about how those EUV machine are doing. Ask TSMC, they know that thing better than we do."
Of course there is large number of small companies around the whole Pure Play Foundry business in which TSMC ex-CEO calls it the Grand Alliance. You need every party to perform well for it to happen. This is somewhat different to Samsung and Intel, both are ( more or less ) much more vertically integrated.
arguably, the current race is down to TSMC making the right decision on hi-NA EUV (ie, to run with low-NA). it's not as if Intel couldn't have acquired EUV, they just chose not to.
It’s a massive supply chain, so, yes, both. But also a hundred other companies. TSMC and other foundries bring together many technologies from many companies (and no doubt a lot of their own) to ship a full foundry solution (design-technology-cooptimization, masks, lithography, packaging, etc).
Aren't Intel, TSMC and Samsung all customers (and investors) of ASML, which is actually the manufacturer and developer of the EUV (ultraviolet) machines this refers to? Basically, if at all, they might have a slight exclusivity deal, but given the owner structure you can imagine that this will not really affect anything in the long run. With the willingness of spending the money on new nodes they will have the technology too.
As Jim Keller himself famously put, Moore's law is still fine. Furthermore, the number of people predicting end of Moore's law doubles every 18 months, thus following the Moore's law itself.
It is fun to watch and keep track of - And keeping in mind it's also been an insane amount of work by an insane number of people with an insane amount of budget thrown at the problems. You can do quite a bit in software "as a hobby" - and this field is not it.
I think one of the interesting takeaways here should be that they have a 48 - 50nm "device pitch" which is to say the transistors are small in the XY plane there are pitch widths much larger than "5nm" or "3nm" (people familiar with chip production realize this but too often people who don't have a very deep understanding of chip production are mislead into thinking you can put down transistors 5nm apart from each other)
So from a density perspective, a perhaps 30 - 40% gain in overall number of transistors in the same space.
Looking at the Intel inverter design, it looks like if they were willing to double the depth they could come up with a really compact DRAM cell. A chiplet with 8 GB of ECC DDR memory on it would be a useful thing both for their processors and their high end FPGA architectures.
General question about semiconductors: Why is there so much emphasis on the density of transistors rather than purely on the costs of production (compute/$)? CPUs aren't particularly large. My computer's CPU may be just a few tablespoons in volume. Hence, is compute less useful if it's spread out (e.g., due to communication speeds)?
That's only if you needed a signal to cross the whole chip in one cycle. There's no such limitation preventing a 1 foot wide chip from being filled with 5ghz cores on an appropriate ring bus.
You can make X lower by reducing the frequency (= having each cycle be longer)
But apart for that, the main reason big chips would clock slower is power, not timing. If you have a lot of transistors all switching on a high voltage so that the frequency is high, you get molten metal and the magic smoke leaves.
Big chips aren't one big stage where light travels from one side to the other. But they are giant weaves of heating elements that can't all run fast all of the time
cache latency is definitely part of what limits core clock. you're not going to have a good time if your L1 latency is, say 10 clocks. not to mention the fact that register files are not much different than SRAM (therefore cache-like).
You could always purchase a multi CPU system (effectively what you're suggesting) from several years ago for much cheaper than modern hardware. If you're using it regularly though, the electrical cost will eventually eat away any money savings vs the same computational power in a modern single CPU.
With the way Solar/wind + batteries are bringing electricity prices down the cost per compute will still come down even as moore laws slows down. Looking at current trends running today processors 10 years from now could just cost in electricity just 10-12% what it costs now.
A factory makes transistors ,and if you increase a 'node', you make twice as much. If you do an amazing job, you might reduce cost 10%.
So by far the best way to maximize value in semiconductors is to enable shrink.
But you also just don't hear it in the popular or even engineering press. Most manufacturers and designers look at a PPAC curve (power, performance, area, cost) and find optimal design points.
As for spreading it out: the unit of production isn't a wafer, it is a lithographic field, which is roughly 25*35mm. You cant practically 'speead out' much more (ok, you sort of can with field stitching, but that is really expensive).
Because when you make it denser, you can cut the CPU into smaller parts, which decreases costs
when you make it less dense, it can clock up higher, but you will have fewer cores per mm^2
AMD went with both approaches, where their hybrid CPU will have densely packed low speed Zen 4C cores and some high speed Zen 4 cores to boost at the highest frequency
Increasing density has caused chip cost per FLOP/s to decrease exponentially over the last decades. But nowadays the price per transistor doesn't go down as fast with increased density like it used to.
E.g. new Nvidia GPUs are getting smaller for the same price, which means they are getting more expensive for the same size. At some point, the price per transistor will actually increase. Then Moore's Law (the exponential increase in transistor density) will probably stop, simply because it's not economical to produce slower chips for the same price. (Maybe the increased power efficiency will still make density scaling worth it for a little while longer, but probably not a lot longer.)
> this is because NVidia changed their pricing strategy to decouple it from that
Because neither AMD nor Intel can come withing striking distance of Nvidia's flagships, and seeing how their silicone flies off the shelves, they have also adjusted their pricing to match their relative performance to Nvidia.
In addition to the answers already given, there are defects during the process that are more likely to render your chip useless the larger your chip is. This is true for smaller chips as well, and often the design handles a defunct component, but you prefer minimizing defects per chip.
Density is one of the main ways to get cost savings. But there are others too, and there's also a lot of hype around them. Chiplets for example. Or CXL for memory.
Personal usage still relies on fast single threaded performance. As far as business usage, the cost is primarily energy which requires smaller node size for the same performance.
Because you are assuming there is an objectively optimal processor design for a specific manufacturing process.
If you don't constrain the chip to a specific design then what is going to count as compute? The number of adders or multipliers? That is just a different way of talking about transistor density.
TOF latency isn't that much of a big deal, though driving a signal for distance consumes a lot of power, and power has been the primary design-limiter for at least a decade.
But I think the GP's point is that heat is far easier managed when spread out over a larger area, so why all the emphasis on ultra tiny transistors vs just making a chip that's two inches by two inches or something?
And I think the main answer to that comes when you look at some of the discourse around Apple's M-series chips, that doing a larger-die design is just way riskier: there are huge implications on cost, yield, flexibility, etc, so it was really something that Apple was uniquely positioned to move aggressively on vs a player like Qualcomm who needs to be way more conservative in what they try to sell to their main customers (phone OEMs like Samsung).
Every chip manufacturer does that, that's how they come up with cheap, low end parts. They just try to keep number of cores an even number, so the trick is less obvious.
Maybe I’m missing something here, but wouldn’t heat become a bigger issue? Right now we have pretty intense cooling solutions to get heat off the surface of a comparatively thinner chip. If chips become more cubic how would we cool the inside?
If we keep going down this route I have to wonder if we'll see something drastic in the cooling space.
CPU dies are optimised towards being cooled from one side. I wonder if we'll eventually see sockets, motherboards and heat spreaders shift towards cooling both sides of the CPU.
Probably not, can't imagine what a halfway feasible solution to integrating pin out and a heat spreader would be.
A couple years back they noted that they were looking at having essentially cooling pipes _inside_ the chips. There hasn't been much noise in terms of commercialization, but that's the kind of extreme they were looking at.
Heres my first thought on how you might be able to do pin out on a 'sandwich a cpu between 2 heatsinks design'.
1) DRAM gets integrated with the cpu. Slight thickness increase, probably quite a bit of added width. We get a bigger area to cool, closer ram and no need for any memory pins.
2) Add power connections to the 2 cooling sides. Running power wires through the coolers shouldn't be an issue.
3) Run as many of the fastest PCIe lanes as you can out the 4 thin sides of the package. These end up handling ALL of the IO.
Some downsides I can think of off the bat are cooking the ram chips and with so much density and heat not sure how well signal integrity would work out.
A relatively easy win here is to have a “stock” set of fins built into the motherboard behind the CPU socket. The CPU could get attached to it with a pad or paste on the back.
In storage, moving away from 2D MLC and TLC NAND towards 3D TLC stacking (and horrendous higher bits) has introduced disturbances that literally shorten the memory life cycle. When a cell is read, the voltage alters the state of adjacent cells, which must be forced to be rewritten to preserve their state, thus shortening the life cycle of the disk just by reading data. they are selling us crap.
From the little I understand about the problem, this would be solved by occupying more surface area to separate the tracks that run through the vertical stacks ? what would be like a 2D design surface area but with bigger complications. Although I have read papers[1] that propose adding latency in an attempt to mitigate (not solve) the problem.
So now, reading this news about processors and stacking, I wonder about what inconveniences the end users are going to suffer with processors built under these techniques. Whether in computational reliability, vulnerabilities and so on.
I wrote vulnerabilities (pure imagination and speculation of my own, I'm imagining a prefetch problem at the transistor level) because if it turns to be real at future I can see the manufacturer introducing a fix for randomly increase latencies or any other thing, and sending the computing power back ten years with an "oh, we didn't expect it such thing were possible when we designed it".
And of course the computational reliability.
is being taken care of to avoid all of this?.. if not, I leave my comment here for courts in the future.
>So now, reading this news about processors and stacking, I wonder about what inconveniences the end users are going to suffer with processors built under these techniques. Whether in computational reliability, vulnerabilities and so on.
Denser logic hasn't got the same issues as dense non-volatile storage as logic doesn't need to have any persistence.
It's what the likes of Micron and Samsung are good at fixing and working around when they launch and scale their Xnm processes for a specific storage technology, and what makes them better than competitors.
Intel, TSMC, GloFo, etc they all can buy the latest gen EUV machines from ASML if they want, but yet TSMC is always one node ahead on logic and Micron and Samsung win at storage, because they're good at ironing out the kinks and challenges that come from shrinking down those specific designs closer and closer to sub-nm level while the others can not (so easily).
If fabbing cutting edge silicone was as easy as just having the latest gen ASML machines, then ASML would just hoard the cutting edge machines for themselves and become vertically integrated in fabbing their own cutting chips as a side hustle before everyone else.
3D NAND has introduced degradation when is read data from the disk. You need to calculate then how many times the disk is read, the unwritten free space that will be consumed for to maintain the data when the disk is read, and so on.
The TBW of the disk shown in the specifications is the estimated write limit of each cell multiplied by the number of cells. They don't take into account that in order to read the data of each cell, the adjacent cells will be written and will consume little by little these estimated write limits.
Therefore, if you fill the disk and only read data, it will sooner or later go into protection mode or lose data because of it.
They could only guarantee the TBW if more memory were added for to cover the writes consumption by the reads usage of the current 3D NAND design. I no longer know how to explain that it is programmed obsolescence, self-destructing disks by read data.
We stopped seeing 10 years guaranties when 3D NAND was introduced, so they know well what they are doing.
" Figure 1a plots the average SSD lifetime consumed by the read-only workloads across 200 days on three SSDs (the detailed parameters of these SSDs can be found from SSD-A/-B/-C in Table 1). As shown in the figure, the lifetime consumed by the read (disturbance) induced writes increases significantly as the SSD density increases. In addition, increasing the read throughput (from 17MBps to 56/68MBps) can greatly accelerate the lifetime consumption. Even more problematically, as the density increases, the SSD lifetime (plotted in Figure 1b) decreases.
In addition, SSD-aware write-reduction-oriented system software is no longer sufficient for high-density 3D SSDs, to reduce lifetime consumption. This is because the SSDs entered an era where one can wear out an SSD by simply reading it."
" 3D NAND flash memory exhibits three new error sources that were not previously observed in planar NAND flash memory:
(1) layer-to-layer process variation, a new phenomenon specific to the 3D nature of the device, where the average error rate of each 3D-stacked layer in a chip is significantly different;
(2) early retention loss, a new phenomenon where the number of errors due to charge leakage increases quickly within
several hours after programming; and
(3) retention interference, a new phenomenon where the rate at which charge leaks from a flash cell is dependent on the data value stored in the neighboring cell. "
TLC is a decent spot, which is why it's still being produced.
QLC is less so, since its endurance is only ~300 cycles. there's plenty of tension in the storage industry about this, with vendors saying "don't worry be happy", and purchasers saying "wait, what read:write ratio are you assuming, and how much dedupe?"
PLC (probably <100 cycles) is very dubious, IMO, simply because it would only be suitable for very cold data - and at that point you're competing with magnetic media (which has been scaling quite nicely).
that's the tape market I mentioned. agreed, tape doesn't fit the personal market, but it totally dominates anywhere that has scale.
the question is: what counts as reliable? if PLC is good for 50 erasures, are you really comfortable with that? it's going to cost more than half of QLC, I assure you...
the interesting thing about flash is that people want to use the speed. which means they put in places that have a high content-mutation rate. if it's just personal stuff - mostly cold, little mutation - that's fine but not the main market.
There is a market for high speed read only data - S3 serving, and all kinds of mostly read database scenarios (OLAP). You can have tiered storage, data is first consolidated/updated on TLC drives, and as it ages is moved to PLC storage. RocksDB already supports something like this.
The CPU was modeled and designed primarily on computers in advanced 3-Dimensional programming packages, where simulated testing could be done in real time, or at increased rates.
The lattice of cubes in the construction of the prototype CPU suggests a "hypercube", a cube of more than three dimensions. In computer design, hypercubes are used as a physical connection scheme that minimizes the effective communication distance (and therefore the time delay) between processors, when the logical connection scheme needed by the software that will be run on those processors cannot be known in advance. This then supports the Neural Net's ability to learn, adapt, and built new logical connection schemes.
Faster chips which use less power to do the same amount of computation, same as ever.
CFETs are very much real world technology which are on the roadmaps for all leading edge fabs. They're the same as current gen FinFets and GAAFets a year or two from now in that they essentially just do the same thing as previous generations of chip tech except they do it better.
Running watts through transistors produces heat. Flat transistors are cooled by various heat dispersal mechanisms today. Thicker 3D stacked transistors will possibly provide impetus for a different cooling paradigm.
Interesting that when we can't make chips bigger laterally, we go vertical and stack transistors. It's like we discovered high-rise buildings all over again.
These two layers are touching, nanometers apart. The heat dissipation will be the same for both layers. It's still a simple problem of density, not a more complicated problem similar to trying to cool multiple dies.
Edit: To throw math at it, silicon conducts at 2-4 Watts per centimeter-Kelvin. If we need the heat to travel an extra 100nm, and we're looking at a 1cm by 1cm area of chip, then it takes 20 to 40 kilowatts flowing through that slice before the top and bottom will differ by more than 0.1 degrees.
In gaming, especially simulators: The 5800x3d and then the 7800x3d has proved how exemplary performance benefits can be gained in certain use cases, in some cases outperforming Intel with less than half the power usage (if not a third).
Limiting overclocking is a price to pay for that, but you kind of get it back with the monthly power bills - and still going toe to toe with Intel in general.
I doubt that for one user using one CPU the power bill is going to matter. People still use AC, washing machines and electrical heating consuming thousands of kw.
Power/heat matters for datacenters, but not for people. Yes, you can built a kw desktop, but you know you're doing something weird. For most people, their computer's peak dissipation has been falling for a decade. 90W cpus used to be common, but mainstream is currently going from 65 to 40W categories in desktops. And normal people do not have GPUs. Even more normal people depend primarily on mobile devices, where 15W laptops are routine, and lots of people use devices <4W.
Samsung went even smaller than Intel, showing results for 48-nm and 45-nm contacted poly pitch (CPP), compared to Intel’s 60 nm, though these were for individual devices, not complete inverters. Although there was some performance degradation in the smaller of Samsung’s two prototype CFETs, it wasn’t much, and the company’s researchers believe manufacturing process optimization will take care of it. Crucial to Samsung’s success was the ability to electrically isolate the sources and drains of the stacked pFET and nFET devices. Without adequate isolation, the device, which Samsung calls a 3D stacked FET (3DSFET), will leak current. A key step to achieving that isolation was swapping an etching step involving wet chemicals with a new kind of dry etch. That led to an 80 percent boost in the yield of good devices. Like Intel, Samsung contacted the bottom of the device from beneath the silicon to save space. However, the Korean chipmaker differed from the American one by using a single nanosheet in each of the paired devices, instead of Intel’s three. According to its researchers, increasing the number of nanosheets will enhance the CFET’s performance.
The fact that they are predicted to be "seven to ten years" away suggests there are still many unsolved problems that are preventing scaling-up from becoming a reality.
Every now and then Moore’s law hits a roadblock. Some experts see that as a clear sign that it’s reaching its end. Others that it’s already dead, because actually, the price per transistor has increased. Others that it’s physics, we can approach Y but after X nm it can’t be done.
Then you read others that claim that Intel has just been lazy enjoying its almost monopoly for the past decade and was caught off guard by TSMC’s ultraviolet prowess. Or people who really know how the sausage is made, like Jim Keller, enthusiastically stating that we are nowhere near any major fundamental limitation and can expect 1000X improvement in the years to come at least.
Anyway, it’s really fun to watch, like I said. Hard to think of a field with such rollercoaster-like forecasting while still delivering unparalleled growth in such a steady state for decades.