Cooling the AI Blaze: Solutions for Surging Rack Densities in Data Centers

Written by Dave Meadows | Jun 12, 2024 1:21:00 PM

In the past couple of years, Generative AI capabilities and use cases have exploded leading to an increased demand for compute power and storage facilities. As a result, rack densities and power consumption in data centers are steadily climbing, reflecting the increased workload they must handle.
The impact of this growth is significant, with the latest IAE projections indicating that data centers worldwide will consume 700-1000TW of power by 2026! To accommodate this, governments worldwide have started making plans to expand power infrastructure accordingly.

Alongside this explosion also comes the challenge of cooling these data centers effectively. These machines powering the future are running hotter than ever, and traditional cooling methods might not be enough to cut it.

So, how are data centers rising to this challenge?

Understanding the Impact of Rising Rack Densities

Rack density refers to the amount of computational power and equipment packed into a single server rack within a data center facility. In simple terms, it is the amount of power the equipment within a server rack uses.

Advancements in chip technology and growing demand have led to a steady increase in rack densities in data centers. For example, according to AFCOM and DCI’s 2014 industry density standard 8-15KW was classified as a high-density rack, while anything above that was termed as extreme density.
Fast forward to today, data centers, including hyperscalers like Google and Facebook, are deploying racks in the 25-40KW range. In fact, specialized companies offering HPC (High Power Computing) services are already deploying racks that draw well over 90KW. As data centers deploy more power-hungry GPUs, this number is only going to go higher.

Rising Rack Densities: Opportunity or Challenge?

On one hand, higher rack densities allow for greater consolidation of computing resources, maximizing the utilization of physical space within the facility. This increased density translates to improved scalability and efficiency, enabling data centers to handle larger workloads and meet the demands of evolving applications.

However, as rack densities soar, so too does the heat generated by the densely packed equipment. Most hyperscale companies like Google are already running their data centers at temperatures hotter than the ASHRAE TC9.9 maximum recommended standard server entering temperature of 80.6°F to keep cooling costs down.

However, high-density racks run much hotter and still need significant cooling to keep them at acceptable temperatures. This heightened thermal load poses significant challenges for cooling systems, requiring data center operators to implement robust cooling solutions to maintain optimal operating temperatures and prevent equipment failures.

Emerging Technologies to Tackle Rising Chip Densities

To combat the rising temperatures, data center operators are exploring and deploying new cooling infrastructure to support their high-density racks. One new technique in particular that’s generating a lot of interest is liquid cooling.

Liquid cooling techniques have been proven to be more cost-effective and energy-efficient than air cooling. For some data centers, switching to a liquid cooling system has resulted in up to a 50% reduction in facility cooling power consumption.

These eye-popping results have led to a boom in the deployment of liquid cooling systems. According to AFCOM’s 2022 State of The Data Center Report, at least 40% of data centers are considering using liquid cooling in some form.
Let’s explore some of these techniques:

Direct Liquid to Chip Cooling

Direct Liquid-to-chip cooling has emerged as one of the most cost-effective and efficient methods for cooling down high-density racks.
It involves circulating a coolant through a cold plate heat exchanger directly over the server’s components to cool them. The coolant extracts the heat from the components and is circulated away from the server to be cooled by a heat exchanger, or expelled from the system.

This cooling approach is more efficient than traditional air cooling as it uses less mechanical power leading to a lower PUE. It also captures approximately 80% of the heat generated from the server’s components resulting in more efficient operation and a longer equipment lifespan.

Another important reason for Direct Liquid to Chip Cooling’s mass appeal is that it is more compact as the cooling takes place inside the server’s enclosure negating the need for any other bulky equipment

Immersion Cooling

Immersion cooling is another innovative cooling method that’s gaining steam in HPC circles. This technique involves submerging the entire device into a dielectric, non-conductive fluid which can either be a mineral oil or any other dielectric fluid.

The fluid absorbs the heat from the component through direct contact and removes the generated heat. The liquid is then circulated through a plate heat exchange or fluid cooler to remove the heat so it can be reused.

Rear Door Heat Exchangers

Rear door heat exchangers are a variation of air cooling that has been deployed successfully in cooling medium-density racks. They aren’t as effective as the liquid cooling methods mentioned above, but several industry players have used them to successfully cool racks in the 20kW+ range.

Rear door heat exchangers are doors with a radiator-like coil structure mounted at the back of the server rack. The hot air from the server is exhausted towards the door which contains a coolant like chilled water flowing through its coils.

The door serves as a heat exchanger, removing up to 100% of the generated heat away from the server. The heated coolant then goes to a chiller where the water is cooled and then recirculated back into the system.

Energy Efficiency: Balancing Performance with Cooling Costs

Now that we know the various methods available for cooling high-density racks, it's time to talk about cost. Or more specifically, how can data centers manage cooling costs?

Since cooling accounts for about 50% of the energy used in data centers, it takes more than just the right selection. Companies must also cost-effectively deploy them to achieve maximum efficiency. Here are some strategies data centers are adopting.

Running Servers Hot

In the past, it was common to keep the server rooms as cold as possible to maximize performance. However, recent studies have shown that servers can still perform while running at high temperatures.

In light of this, many companies like Facebook and Google have taken to running their server racks right at the ASHRAE recommended limit or beyond. Additionally, most new server components and GPUs are built to resist hotter temperatures for longer, negating the need to keep them at lower temperatures.

Implementing a Hybrid Strategy

While most data centers are exploring liquid cooling options, few if any are willing to completely phase out air cooling. Instead, they are opting to employ a hybrid approach combining both liquid and air cooling. One such approach uses cold plates to target and remove heat from high-heat-density components like chips, while other low-heat-density components use air cooling. This is one of the better options for retrofitting older data centers. To save costs, new racks fitted with liquid cooling can be introduced while legacy racks can run on air cooling.

Improving Data Center Monitoring

Enhanced monitoring has led to significant cost savings for many data centers. By embedding a network of real-time sensors to monitor temperature and humidity, they can accurately track and adjust cooling conditions to maintain optimal operating temperatures and increase efficiency. With these sensors, data centers can better target hotspots for cooling. Hyperscalers like Google are even turning to AI to help optimize their cooling solutions. They are taking advantage of innovations in AI and Computational Fluid Dynamic models to help map, control, and optimize airflow and cooling, predict hotspots, and identify areas for cost savings.

Adopt the Modular Center approach

Another growing technology in the data center space is the modular data center. According to AFCOM’s 2024 State of The Data Center report, 42% of their survey respondents are considering deploying modular facilities alongside their traditional buildings. Several hyperscalers are adopting them because it’s cheaper and faster to deploy them than to retrofit old data centers. They can easily add modules to meet up with demand for AI or Edge workloads. Also, they are quite compact with a smaller footprint which makes them easier and cheaper to cool.

Looking to The Future…

As data centers continue to evolve to meet up with surging AI workloads, we’re going to see many more cooling innovations come into play in the next few years. As an industry leader in data center cooling, STULZ is right at the forefront of these innovations. By deploying our cutting-edge cooling solutions, data centers are effectively mitigating these challenges, ensuring the reliability and efficiency of their systems.

View full post