As HPC Chip Sizes Grow, So Does the Need For 1kW+ Chip Cooling

by Anton Shilov on June 27, 2022 10:00 AM EST

40 Comments | Add A Comment

40 Comments

One trend in the high performance computing (HPC) space that is becoming increasingly clear is that power consumption per chip and per rack unit is not going to stop with the limits of air cooling. As supercomputers and other high performance systems have already hit – and in some cases exceeded these limits – power requirements and power densities have continued to scale up. And based on the news from TSMC's recent annual technology symposium, we should expect to see this trend continue as TSMC lays the groundwork for even denser chip configurations.

The problem at hand is not a new one: transistor power consumption isn't scaling down nearly as quickly as transistor sizes. And as chipmakers are not about to leave performance on the table (and fail to deliver semi-annual increases for their customers), in the HPC space power per transistor is quickly growing. As an additional wrinkle, chiplets are paving the way towards constructing chips with even more silicon than traditional reticle limits, which is good for performance and latency, but even more problematic for cooling.

Enabling this kind of silicon and power growth has been modern technologies like TSMC'a CoWoS and InFO, which allow chipmakers to build integrated multi-chiplet system-in-packages (SiPs) with as much a double the amount of silicon otherwise allowed by TSMC's reticle limits. By 2024, advancements of TSMC's CoWoS packaging technology will enable building even larger multi-chiplet SiPs, with TSMC anticipating stitching together upwards of four reticle-sized chiplets, This will enable tremendous levels of complexity (over 300 billion transistor per SiP is a possibility that TSMC and its partners are looking at) and performance, but naturally at the cost of formidable power consumption and heat generation.

Already, flagship products like NVIDIA's H100 accelerator module require upwards of 700W of power for peak performance. So the prospect of multiple, GH100-sized chiplets on a single product is raising eyebrows – and power budgets. TSMC envisions that several years down the road there will be multi-chiplet SiPs with a power consumption of around 1000W or even higher, Creating a cooling challenge.

At 700W, H100 already requires liquid cooling; and the story is much the same for the chiplet based Ponte Vecchio from Intel, and AMD's Instinct MI250X. But even traditional liquid cooling has its limits. By the time chips reach a cumulative 1 kW, TSMC envisions that datacenters will need to use immersion liquid cooling systems for such extreme AI and HPC processors. Immersion liquid cooling, in turn, will require rearchitecting datacenters themselves, which will be a major change in design and a major challenge in continuity.

The short-tem challenges aside, once datacenters are setup for immersion liquid cooling, they will be ready for even hotter chips. Liquid immersion cooling has a lot of potential for handling large cooling loads, which is one reason why Intel is investing heavily in this technology in an attempt to make it more mainstream.

In addition to immersion liquid cooling, there is another technology that can be used to cool down ultra-hot chips — on-chip water cooling. Last year TSMC revealed that it had experimented with on-chip water cooling and said that even 2.6 kW SiPs could be cooled down using this technology. But of course, on-chip water cooling is an extremely expensive technology by itself, which will drive costs of those extreme AI and HPC solutions to unprecedented levels.

None the less, while the future isn't set in stone, seemingly it has been cast in silicon. TSMC's chipmaking clients have customers willing to pay a top dollar for those ultra-high-performance solutions (think operators of hyperscale cloud datacenters), even with the high costs and technical complexity that entails. Which to bring things back to where we started, is why TSMC has been developing CoWoS and InFO packaging processes on the first place – because there are customers ready and eager to break the reticle limit via chiplet technology. We're already seeing some of this today with products like Cerebras' massive Wafer Scale Engine processor, and via large chiplets, TSMC is preparing to make smaller (but still reticle-breaking) designs more accessible to their wider customer base.

Such extreme requirements for performance, packaging, and cooling not only push producers of semiconductors, servers, and cooling systems to their limits, but also require modifications of cloud datacenters. If indeed massive SiPs for AI and HPC workloads become widespread, cloud datacenters will be completely different in the coming years.

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

40 Comments

View All Comments

ballsystemlord - Monday, June 27, 2022 - link
So, how about liquid nitrogen cooling? You could create micro-through-chiplet and inter-chiplet vias to allow the liquid nitrogen to flow through the chips and chiplet interconnects to cool them down.
Potentially, the spaces between chiplets (created by the micro-solder-bumps), could be utilized as pathways without the need for making any vias at all.
ingwe - Monday, June 27, 2022 - link
I suspect the phase change would create significant problems given the small size we are talking about. I don't know though and it is an interesting idea.
Wereweeb - Monday, June 27, 2022 - link
That's an entirely different can of worms, and an expensive one at that.
ChefJeff789 - Monday, June 27, 2022 - link
An interesting idea, but capillary action prevents fluids from travelling through extremely small openings unless there is a force acting to pull the fluid through each cavity. To keep it practical, you'd need one inlet and one outlet (or at least one inlet region and one outlet region, if you wanted to feed multiple channels), so you need to create an explicit path through the chip for the fluid to follow. Perhaps doable, if you make the channel part of the chip stack during manufacturing, but difficult. Also, if even one region ends up with an air pocket, you could get some really harsh effects. I suspect cavitation would completely destroy a chip in seconds, if the thermal gradient from air-fluid doesn't do it first.
xol - Monday, June 27, 2022 - link
active cooling aka refrigeration is going to add an additional cost (energy to refrigerate), whilst liquid cooling can just use passive radiators to lose the heat to the environment.

that being said refrigeration for chips *could* be interesting since it low temps will drastically reduce wire resistance (even though most of chip power is lost in switching not resistance) ..

I think another challenger is thermal coefficients of expansion/contraction .. room temp to 99C is around 75 degrees difference. room temp to liquid N2 is ~225degrees - and this is exacerbated by materials becoming more brittle at lower temps.(eg solder joints)
xol - Monday, June 27, 2022 - link
[edit] above "challenge" not "challenger" (typo)
meacupla - Monday, June 27, 2022 - link
AFAIK, the preferred method for cryogenic cooling is pumping liquid helium through a piping system of some sort.
The downside with both liquid nitrogen and liquid helium, is that they are both cryogenic forms of cooling, and will create a lot of condensation/ice around whatever metal is exposed to the room.
Foo Barred - Monday, June 27, 2022 - link
Water / fluid cooling is about transferring the heat to someplace where it can be removed/cooled, possibly in a more effective manner than with a local heat-sink and airflow.

Liquid nitrogen is a very effective cooling solution but it consumes the nitrogen in the process and removes(cools) rather than transferring the heat. I.e. very effective for setting over clocking records, but not practical at scale.
ballsystemlord - Monday, June 27, 2022 - link
I mean that you could use a closed loop system. Thus you wouldn't "consume" the nitrogen.
Kevin G - Tuesday, June 28, 2022 - link
The problem is that the liquid nitrogen becomes a gas when it comes into contact with the chip heating it up. To form a closed loop, you'd have to have a condenser/compressor in the loop to convert the nitrogen gas back into its liquid form. Due to inefficiencies, this also takes energy to do. Not impossible to pull off but very very impractical.

As HPC Chip Sizes Grow, So Does the Need For 1kW+ Chip Cooling

Post Your Comment

40 Comments

View All Comments

ballsystemlord - Monday, June 27, 2022 - link

ingwe - Monday, June 27, 2022 - link

Wereweeb - Monday, June 27, 2022 - link

ChefJeff789 - Monday, June 27, 2022 - link

xol - Monday, June 27, 2022 - link

xol - Monday, June 27, 2022 - link

meacupla - Monday, June 27, 2022 - link

Foo Barred - Monday, June 27, 2022 - link

ballsystemlord - Monday, June 27, 2022 - link

Kevin G - Tuesday, June 28, 2022 - link

Log in

Don't have an account? Sign up now