List of Topics:
Location Research Breakthrough Possible @S-Logix pro@slogix.in

Office Address

Social List

Galvatron: Automatic Distributed Training for Large Transformer Models - 2025

galvatron-automatic-distributed-training-for-large-transformer-models.png

Research Paper on Galvatron: Automatic Distributed Training for Large Transformer Models

Research Area:  Machine Learning

Abstract:

Training multi-billion to trillion-parameter language models efficiently on GPU clusters requires leveraging multiple parallelism strategies. We present Galvatron, a novel open-source framework (dubbed Optimus-Megatron in the implementation) that dynamically combines data parallelism, tensor model parallelism, and pipeline parallelism to optimize training throughput. Built atop PyTorch and integrating NVIDIAs Megatron-LM and Microsofts DeepSpeed, Galvatron automatically selects and adjusts parallelism strategies in real time based on model architecture, hardware, and training dynamics. This paper details Galvatrons key features -- automatic hybrid parallelism selection, layer-wise and phase-wise strategy optimization, and runtime adaptation -- and contrasts them with existing static frameworks. We describe the systems technical stack, including its use of DeepSpeeds ZeRO and NCCL communication, and provide an in-depth implementation overview of its core modules (profilers, strategy selector, parallelism manager). We then illustrate how Galvatron can be seamlessly integrated into existing training pipelines with minimal code modifications, providing companies a plug-and-play solution for efficient large-model training. Finally, we situate Galvatron in context with related efforts (NVIDIA Megatron-LM, Microsoft DeepSpeed, Google GShard, Meta FairScale, etc.), highlighting how it advances the state of the art in distributed deep learning. References to the GitHub repository and relevant literature are provided throughout.

Keywords:  

Author(s) Name:  Esmail Gumaan

Journal name:  Distributed, Parallel, and Cluster Computing

Conferrence name:  

Publisher name:  arXiv

DOI:  10.48550/arXiv.2504.03662

Volume Information:  Volume 31, (2024)