TL;DR
"As the hardware landscape is becoming less homogenous and the access to GPU more intricate, what about making LLM training hardware-agnostic? Training models on AMD hardware is becoming increasingly accessible to everyone. In this blog post we will demonstrate an advanced example where we will write our own kernels for a new language model architecture with little support in the open. It goes without saying that training a Transformer model would be way easier and more straightforward thanks to the available support of such models, for instance optimizations such as Flash Attention v2 are already available on AMD. What if in the future, the main criteria to choose the training hardware would be the GPU cost per hour?"
Introduction
Most AI training workloads use NVIDIA GPUs as the hardware accelerator and the deep learning stack used by ML practitioners is heavily dependent on libraries leveraging CUDA software. While Transformer architectures have benefited from a number of hardware and implementation optimizations (e.g Flash Attention 1, 2 and recently 3, xformers, … etc), new model architectures (e.g state space models) may be lagging behind in this regard. The Transformer architecture has the advantage of relying primarily on matrix multiplication which is one the most optimized pieces of software. While novel architectures such as Mamba are built upon different algorithms that may offer efficient implementation on NVIDIA GPUs, they often struggle to achieve similar performance on AMD GPUs due to limited support. That is why we decided to adapt the Mamba kernels to run on AMD GPUs.
Previous efforts have demonstrated the feasibility of training Transformer models at scale on AMD hardware such as MosaicML in Training LLMs with AMD MI250 GPUs and MosaicML. We expand on these efforts by focusing on the Mamba architecture instead of Transformers and develop kernels to maximize training efficiency on AMD hardware.
In this blogpost we show how we can train a Mamba model interchangeably on both NVIDIA and AMD and we compare both training performance and convergence in both cases. This shows that our training stack is becoming more GPU-agnostic.
AMD Instinct MI250/MI250X GPUs:
The AMD MI250/250X is a high-performance datacenter GPU designed for AI and HPC workloads, comparable to NVIDIA's A100. It comes in a configuration of 4 cards, with each card containing two dies. This unique design offers a total of 128 GB of High Bandwidth Memory (HBM) per card and achieves a theoretical peak performance of 362.1 TFLOPs in bfloat16/float16 floating point operations, surpassing the A100's 312 TFLOPs.
Key features of the MI250/250X include:
- Higher peak TFLOP/s in FP16 or BF16 compared to the A100 (362.1 vs 312 TFLOPs)
- Larger HBM memory capacity (128GB vs A100's 80GB max), enabling training and inference of larger models
- Comparable or slightly better power efficiency per GPU when considering system-level power consumption
- Typically configured in 4-GPU nodes, contrasting with A100's common 8-GPU setups
The MI250/250X utilizes Matrix Cores, which are analogous to NVIDIA's Tensor Cores, optimized for rapid matrix operations crucial in machine learning tasks.
In the rest of the blogpost we will refer to the AMD Instinct MI250/250X by just MI250 for simplicity.
ROCm as an alternative to CUDA
ROCm is an open-source software stack for GPU computation that provides a comprehensive suite of tools : drivers, APIs, and development tools that enable GPU programming from low-level kernel operations to high-level applications. Powered by the Heterogeneous-Computing Interface for Portability (HIP), ROCm supports code portability across various GPU platforms and integrates with popular machine learning frameworks like PyTorch. It offers support for multiple programming models, including OpenMP and OpenCL, and is particularly well-suited for AI workloads.
The setup : MI250/MI250X GPUs, PyTorch, ROCm, RCCL
We use PyTorch 2.4.0, RCCL 2.18.3 along with ROCm 6.1. The training code is based on Composer which is a PyTorch FSDP based training codebase that allows training on multiple GPUs as well as multiple nodes in native PyTorch.
How to reproduce the run:
- Install PyTorch 2.4+, RCCL 2.18+ and ROCm 6.1.
- Clone both the causal-conv1d-amd and mamba-amd repos and build the binaries by following the instructions outlined in the respective READMEs.
- Start the run on as many GPUs as you can afford :)
Mamba Training Performance:
Individual Kernels
Training Mamba on AMD GPUs presents significant challenges without custom kernels. In fact, standard PyTorch eager mode proves impractical and leads to excessive memory consumption and poor throughput. While PyTorch does provide a built-in implementation of 1D causal convolution via the Conv1d layer, the efficient selective scan operation—a key component of Mamba—is currently only optimized for CUDA.
We provide a speed benchmark of the fundamental operations used in Mamba mainly the selective scan and the causal 1D convolution on an A40 NVIDIA GPU which has a 48 GB of memory and 149.7 TFLOPs and one single die of an MI250 which corresponds to around 64 GB of memory and 181 TFLOPs peak theoretical performance.
The previous table shows that the causal convolution on MI250 is faster than the kernel on A40 during both the forward and backward passes while the selective scan kernel is slower on MI250 especially during the backward pass. Selective scan is a more complex operation than causal convolution, making it more prone to degradation when translating it to HIP. In particular, we found hipCUB’s InclusiveScan and InclusiveReverseScan, not present in causal convolution, along with ATen/PyTorch’s atomic add to be significantly slower on MI250 compared to the A40. In the interest of time, we did not redesign the kernels specifically for AMD hardware or with HIP’s most efficient libraries in mind. An important point is that with a direct translation of selective scan, and causal convolution to a lesser extent, to a functional HIP kernel, such as the one by EmbeddedLLM at the time of writing, the kernels are an order of magnitude slower than our final results. The speedup comes largely from updating the launch bounds and switching the atomics to the unsafe atomic versions according to the HIP Porting Guide. Additional gains can be made by tuning the grid shape and compiler options.
Note:
- These measurements are for one die of MI250, which has slightly more TFLOPs than an A40.
- The benchmark configuration is : (batch size, sequence length, d_model, d_state) = (2, 4096, 5376, 16)
Overall Training Throughput
In addition to the individual speed of the selective scan and the causal convolution, we are also interested in the overall speed achieved by the kernels when baked into a full language model architecture. For that matter, we reuse our previously released 1.6B language model Mambaoutai that was previously trained on NVIDIA GPUs and benchmark its performance on the AMD MI250 GPUs with our added custom kernels.
Note that what we call here a GPU is actually just one die but PyTorch treats each die as a separate GPU device so for example with a node of 4xMI250 GPUs, rocm-smi (equivalent to nvidia-smi) would show 8 devices. This is a particularity of the AMD 250 GPU and will be abandoned in the next generations of AMD GPUs like the recently announced MI300 series.
We benchmarked throughput across various sequence lengths on a single die, aiming to draw comparisons with our previous Mambaoutai 1.6B blog post. This comparison is particularly relevant as one AMD MI250 boasts nearly identical FLOPs to both variants of the NVIDIA A100.
Without leveraging block-wise activation checkpointing, our results closely mirror those achieved on an A100 with 64GB. However, it is worth noting that this specific activation checkpointing pattern does not seem to translate seamlessly to PyTorch on AMD hardware.
While there is undoubtedly room for further optimization in our kernels, we are already seeing promising performance in training a small language model. For context, using a 3D parallelism approach via the Nanotron library, we achieved 50k tokens per second on an A100 64GB which is very close to the 44k tokens per second.
Considering this is the first port of a new architecture, the performance is already highly competitive. We are confident that the community can surely make it go even further.
Multi-node Training: Strong Scaling Setup
We test the multi-node scaling i.e how the training throughput evolves as a function of the number of GPUs used. There are two scenarios to consider for multi-node scaling : strong scaling and weak scaling. In weak scaling, everything in the training setup is held constant except matching to the number of nodes such that gradient accumulation steps, and thus the ratio of computation to communication and various overheads, stay constant. However, this means the results would not be equivalent for the different training runs from an optimization perspective. For this reason, for reproduction purposes we focus on strong scaling. In strong scaling experiments, everything is kept constant, including the batch size of 896. This means that smaller node counts can run computation for much longer before they need to exchange results, meaning smaller node counts are expected to perform better in terms of FLOPs utilization, as their ratio of computation to communication is much higher.
For more details about training hyperparameters, check the Mambaoutai training repo.
The following figure illustrates that training throughput increases with the number of GPUs, as expected. However, we observe that this scaling becomes sub-linear as the GPU count rises, indicating growing communication bottlenecks. This sub-optimal scaling can be attributed to our use of Ethernet networking instead of the lower-latency InfiniBand. Due to time constraints, we opted for the more straightforward Ethernet setup rather than configuring InfiniBand, which would have required a more complex setup but potentially offered better performance.
The choice of network interface is crucial for AI workloads, where bandwidth and latency are paramount. InfiniBand's architecture significantly reduces packet loss, resulting in very low latency. In contrast, Ethernet networks inherently experience more packet loss. While many applications can tolerate this, AI training workloads are particularly sensitive to such delays, given their already time-consuming and costly nature.
It is worth noting that InfiniBand and Ethernet have different strengths. While InfiniBand excels in latency, Ethernet often leads in raw bandwidth capacity. For instance, NVIDIA’s latest Quantum InfiniBand switches peak at 51.2 Tb/s with 400 Gb/s ports. In comparison, Ethernet switching reached the 51.2 Tb/s milestone nearly two years ago and now supports port speeds up to 800 Gb/s.
We also visualize the per GPU throughput which stays almost the same up to 16 GPUs then starts to degrade because of the communication bottleneck.
Mamba Optimization Convergence
When comparing hardware, it is crucial to ensure we are getting the same end result in terms of the final model. To guarantee this training stability, we decided to train the exact same model as Mambaoutai 1.6B, pushing it slightly beyond Chinchilla-optimal. Both runs utilized bfloat16 precision and FSDP. We also evaluate the resulting checkpoint on a set of standard in-context-learning (ICL) benchmarks after training. This way, we are not just looking at raw training performance, but making sure the quality of the output holds up across different hardware setups.
As we can see on the training loss curve, the same training dynamics are reproduced on both AMD and NVIDIA across around 40 GT. This result is notable when you consider all the factors that could throw things off - the non-deterministic nature of GPUs, the quirks of floating point numerics, differences in software versions, and the general variability between platforms. It is a solid indication that we are dealing with comparable performance and stability across these different hardware setups, which is no small feat in the world of deep learning.
Examining the initial 200 steps of the training process more closely reveals nearly identical loss dynamics, regardless of whether the model is trained using 4 MI250-128GB GPUs or 8 H100-80GB GPUs.
When comparing checkpoints at similar training steps, we observe that the model trained on AMD MI250 performs on par with the one trained on NVIDIA H100. All models underwent evaluation using the lm-evaluation-harness with identical configurations. It is worth noting that the evaluation process has some inherent randomness, introducing slight variations in results. However, considering these natural fluctuations in the evaluation process and the fact that the compared checkpoints are from closely matched but not identical training steps, we view these results as equivalent in terms of performance.
Partnership with Nscale
The AMD Instinct MI250/MI250X nodes were kindly provided by Nscale. Nscale is a European cloud provider offering access to AMD GPUs, high-performance computing, and cloud solutions with a focus on sustainability, using 100% renewable energy sources and located primarily in Norway.