Accelerating Training of Large Transformer Models By Reducing Activation Recomputation: A Research Explainer

November 21, 2023

Introduction

Training deep learning models often requires significant computational resources (Andoorveedu et al., 2022). With the aim of training huge transformer models, the discipline of AI faces an important obstacle. Large models with billions of parameters are challenging to train (Narayanan et al., 2021). But these models are growing in size, with some reaching gargantuan proportions of billions of parameters. As a result, creative methods are needed to ensure their effective training.

The idea of model parallelism takes centerstage when these models scale up. Model parallelism entails dividing the model's complicated parameters, activations, and optimizer state over several devices, allowing them to fit within device memory restrictions and allow for practical training periods. Tensor-level model parallelism is often limited to a small cluster of GPUs with high-speed interconnects, such as those seen in DGX servers.

Another option is pipeline parallelism, which involves storing activations from many microbatches to reduce pipeline latency. While this technique optimises memory utilisation for model parameters and optimizer state adequately, it fails to meet the immediate requirement for memory allocation to store activations. As a result, when scaling up large transformer models, the storing of these activations becomes a bottleneck. Nonetheless, it is possible to come up with novel ways for alleviating the memory restrictions associated with storing activations, thereby substantially reducing the requirement for recomputation.

These strategies are designed specifically for the more sophisticated transformer architecture, noted for its intricacy. What's more appealing is that implementing these strategies is simple and has no impact on computing efficiency. Although other methods exist to reduce memory requirements when training large models, such as distributing data across data parallel ranks or offloading data to CPU memory, these approaches often entail higher implementation costs and exert more pronounced impacts on computational efficiency. This article hones in on these novel techniques while acknowledging their potential synergy with other methods. It sets the stage for an exploration into various facets of model parallelism and their implications for activation memory requirements. It introduces the concepts of sequence parallelism and tensor parallelism, which emerge as pivotal tools in curbing redundant activation storage in areas where conventional tensor parallelism proves less effective.

Furthermore, it underscores the significance of selectivity in deciding which activations to retain and which to recompute, a critical strategy for reducing recomputation costs while efficiently utilizing memory resources. In this paper, the findings are presented as a series of experiments, which quantitatively demonstrate the enhancements these techniques bring to individual training components and the overall training throughput.

Transformer Architecture

A crucial aspect of transformer models is specifically focusing on a single-stack transformer encoder or decoder comprising L layers. This architectural foundation is illustrated in Figure 1, forming the cornerstone for understanding the subsequent innovations presented.

Figure 1: Transformer Architecture

At the start of the network, the input tokens go through a word embedding table defined by its dimensions, represented as vh. These input tokens, denoted by their size sh, entangle with learnt positional embeddings along the way. To put things into perspective, 's' stands for sequence length, 'h' stands for hidden dimension, and 'v' is for vocabulary size. The transformation of this embedding layer results in a 3-dimensional tensor with the dimensions s b h. Here, 'b' stands for the micro-batch size, a critical component in the training process.

Each transformer layer, which constitutes an essential building block, consists of a self-attention component characterized by multiple attention heads. This is followed by a two-layer multi-layer perceptron (MLP). This MLP is critical in altering the hidden dimension. It first extends the dimensionality to 4h and then decreases it back to h. As a result, the input and output dimensions for each transformer layer remain consistent, both of which match to the size s b h.

Activation Memory

In this section, we explore the intricacies of activation memory within the context of a single-stack transformer model, as depicted in Figure 1. Activations, in this context, encompass any tensors generated during the forward pass, essential for subsequent gradient calculations during backpropagation. This category excludes primary model parameters and the optimizer's state but encompasses elements like dropout masks. The memory analysis drills down into the core components of a transformer layer, comprising the attention block, the MLP, and layer normalization. In parallelism, or sequence parallelism, another novel method comes into play, targeting memory efficiency. It reduces memory requirements for activations that are not effectively parallelized through tensor parallelism. By partitioning operations along the sequence dimension, the memory needed is further reduced, introducing specific communication operations to bridge sequence and tensor parallel regions.

Pipeline parallelism, dividing the transformer layers into groups, is another dimension to consider. The activation memory becomes dependent on the schedule used. For example, in a 1F1B pipeline schedule, the first stage carries the most memory pressure, compelling it to store activations for p microbatches. Other schedules might vary slightly. The standard input sequence in FFN layers can be replaced by various other types of activations (Pan et al., 2021).

Evaluation of Memory and Speed Optimization in Large Transformer Models

In the pursuit of more powerful and efficient AI models, new territories are being explored, with researchers pushing the boundaries of model size and training speed. A recent study conducted a comprehensive evaluation of their proposed approach, with a focus on memory consumption and training speed enhancements.

Memory Efficiency

The colossal memory requirements of training models with billions of parameters has been the researchers' primary concern. Two key techniques were introduced: sequence parallelism and selective activation recomputation. When used separately, the memory demand was halved. However, the real breakthrough occurred when these techniques were combined, resulting in a remarkable five-fold reduction in memory requirements. In practical terms, this brought memory needs to under 20% of the original requirement.

For the evaluated models with up to one trillion parameters, this innovative approach proved to be a game-changer. Without these memory-saving techniques, these massive models would be practically untrainable.

Speed Optimization

Memory is not the sole challenge when training models of this magnitude. Training speed is equally vital. The time required to complete forward and backwards passes for a single transformer layer in a 22-billion-parameter model was examined by the researchers. The findings show that sequence parallelism alone introduced a modest 6% improvement, reducing the forward time. However, selective activation recomputation proved even more potent. By judiciously re-evaluating which operations need recomputation, the overhead was reduced from 64% to just 11%. When combined with sequence parallelism, the overhead dropped further, to a mere 4%.

Conclusion and Future Prospects

Two novel and practical strategies were presented in the ‘Reducing Activation Recomputation in Large Transformer Models’ paper to address the memory constraints associated with storing activations during large-scale transformer model training. By considerably lowering the requirement for recomputing activations, these strategies provide a convincing solution to the memory-pressure problem. A significant decrease in activation memory was achieved by combining sequence parallelism with tensor parallelism and selective activation recomputation. This decrease in memory consumption was particularly noticeable when working with large models, and it resulted in a surprising 5% reduction, effectively reducing the computational burden incurred by complete activation recomputation.

In the future, work in this subject will be built on these promising foundations. Researchers want to improve activation memory consumption even more by tackling memory fragmentation problems, particularly in circumstances involving large microbatches. They also want to address non-uniform memory allocation concerns caused by pipeline parallelism, thereby improving memory management for smoother model training.

If you’re looking to deploy your AI models for inference, E2E Cloud has a suite of GPU offerings that you can make use of.

References

Andoorveedu, M., Zhu, Z., Zheng, B. and Pekhimenko, G., 2022. ‘Tempo: Accelerating Transformer-Based Model Training through Memory Footprint Reduction.’ Advances in Neural Information Processing Systems, 35, pp.12267-12282.

Narayanan, D., Phanishayee, A., Shi, K., Chen, X. and Zaharia, M., 2021, July. ‘Memory-Efficient Pipeline-Parallel DNN Training.’ International Conference on Machine Learning (pp. 7937-7947). PMLR.

Pan, Z., Chen, P., He, H., Liu, J., Cai, J. and Zhuang, B., 2021. ‘Mesa: A Memory-Saving Training Framework for Transformers.’ arXiv preprint arXiv:2111.11124.

Korthikanti, V.A., Casper, J., Lym, S., McAfee, L., Andersch, M., Shoeybi, M. and Catanzaro, B., 2023. ‘Reducing Activation Recomputation in Large Transformer Models.’ Proceedings of Machine Learning and Systems, 5.

Sign up for Free Trial

Latest Blogs