Simple Science

Cutting edge science explained simply

# Computer Science# Hardware Architecture# Artificial Intelligence# Distributed, Parallel, and Cluster Computing

Power Capping: A Step Towards Sustainable AI

Research shows how power capping GPUs can reduce energy use and temperatures.

― 8 min read


GPU Power Capping forGPU Power Capping forEnergy Savingsenergy use and heat.Study reveals GPU power limits cut
Table of Contents

As the need for artificial intelligence (AI) grows, so does the demand for strong computer resources to support it. Training complex AI models, especially in fields like natural language processing and computer vision, requires powerful hardware. Large models need lots of energy and resources, which can lead to high carbon emissions and increased demand for specialized hardware like GPUs.

This increase in demand raises questions about energy efficiency and sustainability in data centers where supercomputers operate. In this study, we look at how limiting the power used by GPUs affects temperature and energy use in a supercomputing facility. By carefully setting the power limits, we found that we could lower both the temperature and energy usage of the GPUs. This not only saves energy but may also help extend the life of the hardware, all while keeping job performance mostly unaffected.

However, there are challenges. If users notice a drop in job performance because of the power limits, they may try to compensate by running more jobs, which could reverse any energy benefits gained from the limits. Our research is the first detailed analysis of how capping GPU power works at the large scale of a supercomputing center. We hope it will motivate other centers to look into power capping as a way to make AI more sustainable.

The Cost of Advanced AI

Recent progress in AI has led to amazing results, such as realistic text generation and breakthroughs in medical research. However, these advancements come with a price. For instance, training large language models can produce as much carbon dioxide as the total emissions from five cars over their lifetime. Many of these AI models have hundreds of millions of parameters and take weeks or even months to train on powerful hardware with massive datasets.

On top of that, deploying these models can also consume significant energy. Popular models like GPT-3 and GPT-4, which are behind applications like chatbots and search engines, can lead to high Energy Consumption as millions of users access these tools daily. Besides large language models, various other AI applications use considerable energy as well.

As AI continues to evolve, it requires more resources in terms of data, computing power, time, and energy. The more complex these systems become, the larger their impact on the environment will be, raising concerns about the sustainability of energy sources and the overall demand for resources.

Finding the Balance

Addressing the issue of sustainability in AI is crucial as these technologies proliferate across industries. Striking the right balance between performance, energy efficiency, and sustainability is important for both the environment and the future of AI development. Significant efforts have been made to improve model efficiency through various techniques, such as reducing model size and using smaller datasets for training. However, many of these methods require advanced technical skills and may complicate the training and deployment processes.

One potential solution for data centers and computing facilities is to limit the power that their hardware uses. In our research, we present observations from an academic supercomputing center where we set a 60% power cap on GPUs. Our findings indicate that this power limitation can significantly reduce energy consumption and operating Temperatures without heavily affecting job performance.

Previous Research

While the idea of limiting power use for better energy efficiency is not new, previous research has mainly focused on CPUs rather than GPUs. Some studies showed that Power Caps could lower energy use and operating temperatures while increasing hardware reliability. With GPUs becoming essential for AI workloads, researchers have started to investigate how capping GPU power affects performance.

For instance, one study found that capping the power during the pre-training of the BERT model saved energy without greatly disrupting tasks. Another research examined various AI models and confirmed that power limits for GPUs could lead to energy savings. Nevertheless, many of these larger-scale experiments have not been made publicly accessible or undergone thorough analysis.

Experiment Setup

We conducted our research on the MIT Supercloud, a high-performance computing system that utilizes NVIDIA Volta V100 GPUs. The system is composed of numerous nodes that use resource management software. We collected data on GPU utilization, temperature, power draw, and other factors at a regular interval. The dataset used consisted of over 123,000 GPU jobs, with some jobs subject to power-capping.

Our analysis focused on job-level hardware utilization to protect user privacy. Since the GPUs are highly utilized, we couldn't frequently change power cap levels without risking disruptions. We summarized data relevant to various jobs to better analyze power capping effects.

Key Results

After implementing power caps across the system, we noticed a decrease in energy use and GPU temperature. The drop in power draw and temperatures was consistent with earlier research. However, the overall effect on energy consumption remains ambiguous, as users might increase job requests if they observe performance degradation. This could eliminate any gains made from power limiting.

To determine the effectiveness of power capping, we conducted a more rigorous analysis of temperature and power use across jobs. This evaluation helps clarify whether the observed changes were significant or just due to random variation.

Analysis of Power Draw and Temperatures

We grouped the results to visualize how power capping influences GPU temperatures and power usage. The data showed that jobs with power caps had lower temperatures than those without, with a consistent decrease across all measured percentiles. The variance in GPU temperatures also dropped, indicating that capped jobs had less fluctuation in temperature.

Similar trends were seen for GPU power draw, confirming that power capping successfully reduced overall energy usage. This stability in temperatures and power draw may also imply potential benefits for extending GPU lifespan and promoting sustainable hardware practices in data centers.

Statistical Testing

To quantify the changes in GPU power draw and temperature, we employed statistical tests. These tests helped us determine the significance of the differences between capped and uncapped jobs. Our findings indicated that the reductions in temperature and power draw were statistically significant.

Overall, the evidence strongly suggests that power capping effectively reduces energy use and temperature in operational settings. The reductions we observed could help enhance hardware reliability and lower the chances of early failures.

Treatment Effect Estimation

While hypothesis testing highlighted significant differences between the two groups, understanding how much of that effect was truly due to power capping was crucial. To tackle this, we aimed to estimate the average treatment effect (ATE) of power capping on GPU power draw and temperature.

Our estimates showed that power capping can lead to meaningful reductions in GPU power draw and temperature. For jobs with average GPU utilization, we noted slightly larger reductions in both areas. This suggests that more efficient jobs benefit even more from power limitations.

Matching for Bias Mitigation

To address any biases from the non-random assignment of power caps, we applied matching techniques to estimate the ATE. We categorized observations based on their features to identify similar groups from both capped and uncapped jobs. This approach allowed us to control for potential bias when estimating the effects of power capping.

Our findings continued to show significant reductions in power draw and temperatures, reinforcing the idea that power caps can help improve energy efficiency in a practical way.

Impact on Job Performance

After analyzing power capping's effects on temperature and power usage, we also examined its influence on job performance. To define an optimal power cap, we aimed for a balance between reduced energy consumption and minimal performance impact.

When considering deep learning training, we found that power capping does yield energy savings while keeping performance relatively stable. For various AI models, optimal caps reduced energy use significantly without greatly slowing down training times.

Stricter power caps could save more energy but came at the cost of reduced performance. Therefore, we identified "sweet spots" in power capping that can maximize energy savings while keeping performance losses within acceptable limits.

Model Inference

Inference performance also deserves attention, especially for models like LLaMA 65B, which use more advanced methods. We tested power capping on this model and observed that setting a cap led to good energy savings with minimal performance degradation.

Our results indicated that, while stricter caps provided more energy savings, they also resulted in noticeable speed drops. This highlights the importance of finding an appropriate power cap that aligns with specific workload requirements to maintain efficiency without significant performance losses.

Conclusion

Our research offers insights into the impact of power capping on GPUs at a large scale. We observed significant reductions in GPU temperatures and energy use, which can contribute to better hardware longevity and decreased carbon footprints. Allowing users to control GPU power limits can empower researchers to make greener choices in AI development.

However, many questions still remain about how various workloads interact with power capping, and how to best implement such strategies in different settings. Future research may explore dynamic power capping systems that adapt to workload demands while maximizing energy efficiency.

By continuing to study these methods, we hope to identify better strategies for improving sustainability and making AI development more responsible.

Original Source

Title: Sustainable Supercomputing for AI: GPU Power Capping at HPC Scale

Abstract: As research and deployment of AI grows, the computational burden to support and sustain its progress inevitably does too. To train or fine-tune state-of-the-art models in NLP, computer vision, etc., some form of AI hardware acceleration is virtually a requirement. Recent large language models require considerable resources to train and deploy, resulting in significant energy usage, potential carbon emissions, and massive demand for GPUs and other hardware accelerators. However, this surge carries large implications for energy sustainability at the HPC/datacenter level. In this paper, we study the aggregate effect of power-capping GPUs on GPU temperature and power draw at a research supercomputing center. With the right amount of power-capping, we show significant decreases in both temperature and power draw, reducing power consumption and potentially improving hardware life-span with minimal impact on job performance. While power-capping reduces power draw by design, the aggregate system-wide effect on overall energy consumption is less clear; for instance, if users notice job performance degradation from GPU power-caps, they may request additional GPU-jobs to compensate, negating any energy savings or even worsening energy consumption. To our knowledge, our work is the first to conduct and make available a detailed analysis of the effects of GPU power-capping at the supercomputing scale. We hope our work will inspire HPCs/datacenters to further explore, evaluate, and communicate the impact of power-capping AI hardware accelerators for more sustainable AI.

Authors: Dan Zhao, Siddharth Samsi, Joseph McDonald, Baolin Li, David Bestor, Michael Jones, Devesh Tiwari, Vijay Gadepally

Last Update: 2024-02-24 00:00:00

Language: English

Source URL: https://arxiv.org/abs/2402.18593

Source PDF: https://arxiv.org/pdf/2402.18593

Licence: https://creativecommons.org/licenses/by/4.0/

Changes: This summary was created with assistance from AI and may have inaccuracies. For accurate information, please refer to the original source documents linked here.

Thank you to arxiv for use of its open access interoperability.

More from authors

Similar Articles