Scaling Challenges in Neural Network Training
Examining the impact of hardware and communication on deep learning efficiency.
Jared Fernandez, Luca Wehrstedt, Leonid Shamis, Mostafa Elhoushi, Kalyan Saladi, Yonatan Bisk, Emma Strubell, Jacob Kahn
― 13 min read
Table of Contents
- Accelerators at Scale
- Algorithms for Distributed Training
- Experimental Methodology
- Effects of Scaling: Parallelization, Hardware, Model Size
- Scaling Data Parallelism
- Scaling Model Parallelism
- Scaling the Hardware World Size
- Scaling the Hardware Generation
- Scaling the Model Architecture
- Scaling the Compute Workload
- Trends in Scaling and Implications
- Diminishing Returns in Performance and Power at Scale
- Performance Benchmarking Fails to Extrapolate Across Scales and Hardware Generations
- Related Work
- Limitations and Future Work
- Software and Hardware Details
- Original Source
- Reference Links
In recent years, we’ve seen neural network models grow bigger and bigger. This is mainly because we have been scaling up the size of these models, the amount of training data, and the computing power we use. To create these gigantic networks that we need for things like chatbots and image recognition, we often distribute the training work across many powerful computers, specifically thousands of GPUs. This requires some fancy coordination to ensure everything runs smoothly across these big groups of computers.
In this work, we show that it’s super important to think carefully about how we set up our hardware and how we divide up the workload. This can make a big difference in how effective (and cost-effective) we are when scaling up our models, data, and overall computing power.
We did a big study looking at how well large models train when we change the model size, hardware setup, and the method we use to share the workload. Here’s what we found:
-
After reaching a certain size, the extra overhead from complex Communication methods can actually make some strategies that we thought wouldn’t work better than expected.
-
If we just throw more hardware at a problem and don’t tweak anything else, we quickly run into diminishing returns. This means we get less and less new performance from adding more power or GPUs.
The ideal scaling patterns show that the performance of large neural networks rises with the size of the model, the amount of training data, and the total computing power (and we’re talking a lot of FLOPs here). This has led many to chase after bigger models to achieve the latest and greatest results in areas like language tasks and understanding images.
These state-of-the-art neural networks, now containing hundreds of billions of parameters, demand more computing and memory during training. In many cases, a single GPU can’t hold the entire model, meaning we have to spread the workload across many GPUs to take advantage of their processing power and memory. Training in such situations requires some complicated strategies to split data and model pieces between the GPUs.
As we require more devices for training these large neural networks, the cost of communication and computation has changed significantly. In the past, deep learning models were mostly about crunching numbers quickly, meaning they were compute-bound. But now, because we have so many GPUs working together, the amount of communication required to keep everything in sync can slow us down.
This means we can't just keep adding more GPUs and expect things to run faster. Instead, we see that the communication cost rises and can limit how much we can scale up our model size and workload while still improving performance. Our experiments across different hardware setups show that improvements in computing have outpaced improvements in memory and network speeds, which means we experience even more communication-related issues as we scale up.
So, while there are some tried-and-true methods for training across multiple devices, we haven't yet fully figured out how they scale. Here’s what we add to the conversation:
- A large-scale study of distributed training covering different hardware setups and strategies, highlighting how sharded training scales.
- Evidence showing that adding more accelerator devices leads to diminishing returns in performance, especially in how quickly we can process words.
- A look at real costs, revealing that as we add more GPUs, our power needs increase linearly, but the performance doesn’t keep up.
- Proof that Model Parallelism can actually improve performance, despite what some folks might say.
- A study showing that future gains in performance are likely to be marginal unless we also improve communication networks and memory size.
In this section, we take a moment to introduce key ideas around using hardware accelerators in data centers and common techniques for training large neural models.
Accelerators at Scale
Training large neural networks often happens in clusters filled with loads of GPUs, which are known for their computing power and memory. Different technologies connect these GPUs, and each one has its own advantages and drawbacks in terms of size, speed, and cost.
When grouped together on the same node, NVIDIA GPUs can use fast connections like NVLink or NVSwitch, allowing them to talk to each other much faster than they can talk to GPUs on other nodes. These characteristics of the GPUs-like how much they can compute, how much memory they hold, and how they communicate-have developed at varying speeds, which has shaped how we design neural networks over the years.
As models grow bigger and data sets expand, communication has become a bottleneck, prompting new algorithms for training. What this means is that with all this added hardware, the cost of communicating between devices starts to impact the efficiency of the entire system.
Algorithms for Distributed Training
Most distributed training methods aim to keep things simple for users while mimicking training on a single device. A big decision is how to distribute the components of the model and the data: do we replicate the model across different GPUs or split it up?
When using Data Parallelism, we copy model settings across GPUs but split the data batches. Each GPU calculates local outputs and gradients, which are then exchanged to create a global view. This communication strategy is known as AllReduce.
If a model is too big to fit on one device, we can use techniques like Fully-Sharded Data Parallelism (FSDP) to split model parameters and optimizer states across GPUs. This requires temporarily bringing parameters to each device during calculations, which can slow things down.
Model parallelism comes into play when we split model parameters across GPUs, with each GPU working on the same input data at once. However, sending data back and forth between GPUs can lead to delays. There are also strategies like Tensor Parallelism, where we break down parts of the model even further for better efficiency.
In the end, the goal is to optimize communication and computation to make distributed training as efficient as possible.
Experimental Methodology
In the following sections, we’ll delve into the effects of scaling workloads on both computation speed and communication. We’ll analyze how these factors influence the overall system performance across different strategies, GPU devices, and model sizes.
We will mainly focus on the Llama-2 model, a type of transformer, as it represents a top-tier model. Our tests will involve setting up environments with NVIDIA DGX nodes and looking at multiple parallelization techniques.
We will measure things like Throughput, which tells us how many examples we can process at a time, as well as efficiency metrics for power and computing performance.
Effects of Scaling: Parallelization, Hardware, Model Size
Now, let’s look at how scaling affects different types of neural network architecture, the hardware used, and the ways we share the load.
Scaling Data Parallelism
We can see how increasing the number of GPUs (from 8 to 2048) affects data parallel training speed. With more devices, we can increase the overall throughput as the global batch size grows.
When using FSDP data parallel training with a Llama-7B model, we notice that while adding more nodes increases overall power usage linearly, the efficiency drops. There is a growing amount of communication overhead, which holds back throughput.
Interestingly, at smaller scales, the cost of communication is quite low. But as we scale up, those larger communication costs start to take a toll. We observe that as we increase the number of GPUs, the time for communication rises significantly.
This means that while adding devices should ideally speed things up, in reality, we see less gain. Due to this communication cost, GPUs sit idle, unable to do useful work. This leads to lower overall performance, which is the opposite of what we hoped for.
Scaling Model Parallelism
Model parallelism is helpful when our models don’t fit on one GPU. It allows us to split layers between different devices to make training more efficient. When we train the Llama-7B model using this method, we can boost throughput and efficiency while minimizing communication costs.
While some might argue that model parallelism lowers hardware utilization, we've found that clever strategies can actually improve performance by minimizing communication that slows things down.
Scaling the Hardware World Size
Next, let’s look at what happens when we scale our hardware. By keeping the workload the same but adding more devices, we might expect to see gains. Instead, we find that as we increase the number of devices, the effective workload per device decreases.
So, when the hardware expands too rapidly without adjusting the workload, we see diminishing returns in both global throughput and local hardware utilization. This is especially true when training larger models, where communication needs can dominate performance.
When we push the scale up too high-like increasing from 512 to 2048 GPUs-the per-device performance can drop significantly. We see a similar pattern even when we test models of various sizes; bigger doesn’t always mean better.
Scaling the Hardware Generation
Comparing previous generations of hardware, like A100 to H100, we find that even as we speed things up, the actual utilization can take a hit. There are communication challenges as we scale things up, leading to higher exposure rates and less efficiency.
When we take a closer look, we see that while we make improvements, the balance between computation and communication still hampers our performance.
Scaling the Model Architecture
When we examine models with varying sizes, we’re surprised to find that as models grow, so do the communication needs. This means that scaling up doesn’t just impact computation speed, but also affects how we get data in and out.
As we go bigger, we need to be smart about our model parallelism, as some strategies work better than others across different sizes. The results show that we can boost both utilization and throughput by finding the right balance between size and communication.
Scaling the Compute Workload
Increasing the amount of data we’re working with leads to better GPU utilization. This is because the bigger job means we can keep GPUs working hard while minimizing communication lags.
However, changing the workload isn’t always straightforward. Training setups often have to account for various factors, including how they affect overall performance.
Trends in Scaling and Implications
We can break the training setups into two distinct types. In one, the model size is large compared to the world size, allowing each device to work effectively. In the other, when the device count grows too large in relation to the model, communication takes over, leading to inefficiencies.
What we’ve learned is that not all FLOPs are created equal. Traditional measures of performance based on FLOPs often miss the mark because they don’t factor in the communication needs that arise with distributed setups.
Collective communication can really slow things down at scale, which drives us to think about different strategies. If we don’t adjust our methods to fit this, we could see diminishing returns in performance as the model size increases.
Not only that, but trying to train one massive model can actually be less efficient in terms of power than training several smaller models simultaneously. This gives a strong case for pursuing ensemble strategies, which might yield better performance without needing to scale hardware to absurd levels.
Diminishing Returns in Performance and Power at Scale
Though we start with a ton of promise, scaling hardware can lead us to some undesirable outcomes. While the need for power can rise with more GPUs, the expected performance often doesn’t follow suit, leading to power-to-performance ratios that aren’t ideal.
As a result, we see issues with the effectiveness of scaling. For example, adding more resources can lead to only slight increases in throughput, while costs keep climbing.
While upgrading internal node connections can help, building nodes with more accelerators and higher speeds is also an effective way to tackle communication issues at larger scales.
Performance Benchmarking Fails to Extrapolate Across Scales and Hardware Generations
As we examine new technologies and their performance, we find that old metrics no longer apply. The dynamics change significantly with different scales and hardware setups, leading to major challenges in performance evaluation.
In the end, we notice that simply measuring throughput or FLOPs does not give a full picture of performance. A clear understanding of how communication impacts execution time in large systems is crucial for success.
Related Work
Throughout this exploration of deep learning, we've realized that the world of large-scale training is different from traditional computing. Past research has helped shed light on how efficient we can be, but it’s crucial to keep evolving our understanding and strategies.
Everything from memory needs to the speed of communication can drastically change how we think about deep learning systems. Our work aims to expand on past research by examining how scaling affects performance, particularly in relation to communication needs.
In wrapping up, we’d like to thank those who helped us shape our ideas and put together this work. It’s a team effort all the way, and we’re looking forward to future research that will keep pushing these boundaries.
Limitations and Future Work
While we certainly focused on core parallelization techniques for training neural networks, there’s still a ton out there to explore. Other methods for reducing memory and workload could really shake things up, and future work will likely expand on what we’ve covered here.
As we move forward, we also want to look at how different rates of communication can change our results. It’s a chance to dive even deeper into how different setups affect performance.
And while our findings center on certain types of hardware, we believe similar trends will appear with other platforms as well. So, the road ahead is filled with opportunities for exploration and new insights.
Software and Hardware Details
For our training, we used a special framework and applied various optimizations. Our primary experiments were conducted with specific GPU setups and took the time to measure a range of metrics that matter when evaluating performance.
In summary, as we continue to push the boundaries of what’s possible in training deep neural networks, it’s clear that communication and computation must go hand in hand. The journey is ongoing, and we can only imagine where it will lead us next.
Title: Hardware Scaling Trends and Diminishing Returns in Large-Scale Distributed Training
Abstract: Dramatic increases in the capabilities of neural network models in recent years are driven by scaling model size, training data, and corresponding computational resources. To develop the exceedingly large networks required in modern applications, such as large language models (LLMs), model training is distributed across tens of thousands of hardware accelerators (e.g. GPUs), requiring orchestration of computation and communication across large computing clusters. In this work, we demonstrate that careful consideration of hardware configuration and parallelization strategy is critical for effective (i.e. compute- and cost-efficient) scaling of model size, training data, and total computation. We conduct an extensive empirical study of the performance of large-scale LLM training workloads across model size, hardware configurations, and distributed parallelization strategies. We demonstrate that: (1) beyond certain scales, overhead incurred from certain distributed communication strategies leads parallelization strategies previously thought to be sub-optimal in fact become preferable; and (2) scaling the total number of accelerators for large model training quickly yields diminishing returns even when hardware and parallelization strategies are properly optimized, implying poor marginal performance per additional unit of power or GPU-hour.
Authors: Jared Fernandez, Luca Wehrstedt, Leonid Shamis, Mostafa Elhoushi, Kalyan Saladi, Yonatan Bisk, Emma Strubell, Jacob Kahn
Last Update: 2024-11-20 00:00:00
Language: English
Source URL: https://arxiv.org/abs/2411.13055
Source PDF: https://arxiv.org/pdf/2411.13055
Licence: https://creativecommons.org/licenses/by/4.0/
Changes: This summary was created with assistance from AI and may have inaccuracies. For accurate information, please refer to the original source documents linked here.
Thank you to arxiv for use of its open access interoperability.
Reference Links
- https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/dgx-1/dgx-1-rhel-datasheet-nvidia-us-808336-r3-web.pdf
- https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/nvidia-dgx-a100-datasheet.pdf
- https://resources.nvidia.com/en-us-dgx-systems/ai-enterprise-dgx?xs=489753
- https://images.nvidia.com/content/technologies/deep-learning/pdf/Datasheet-DGX1.pdf
- https://www.nvidia.com/en-us/data-center/dgx-station-a100-whitepaper/
- https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf
- https://www.nvidia.com/en-us/data-center/nvlink/
- https://datatracker.ietf.org/doc/html/rfc4391
- https://github.com/NVIDIA/nccl
- https://github.com/ROCm/rccl
- https://github.com/openxla/xla
- https://github.com/NVIDIA/nccl-tests
- https://resources.nvidia.com/en-us-dgx-systems/dgx-superpod-gb200-datasheet
- https://developer.nvidia.com/management-library-nvml
- https://github.com/goodfeli/dlbook_notation
- https://github.com/facebookresearch/repo
- https://ai.meta.com/blog/?page=1
- https://fb.workplace.com/notes/1767028093764250
- https://www.facebook.com/brand/meta/color/