Amazon SageMaker HyperPod

Reduce time to train foundation models by up to 40% and scale across more than a thousand AI accelerators efficiently

What is SageMaker HyperPod?

Amazon SageMaker HyperPod removes the undifferentiated heavy lifting involved in building and optimizing machine learning (ML) infrastructure. It is pre-configured with SageMaker’s distributed training libraries that automatically split training workloads across more than a thousand AI accelerators, so workloads can be processed in parallel for improved model performance. SageMaker HyperPod ensures your FM training uninterrupted by periodically saving checkpoints. It automatically detects hardware failure when it happens, repairs or replaces the faulty instance, and resumes the training from the last saved checkpoint, removing the need for you to manually manage this process. The resilient environment allows you to train models for week or months in a distributed setting without disruption, saving training time by up to 40%. SageMaker HyperPod is also highly customizable, allowing you to efficiently run and scale FM workloads and easily share compute capacity between different workloads, from large scale training to inference.

Benefits of SageMaker HyperPod

Amazon SageMaker HyperPod is preconfigured with Amazon SageMaker distributed training libraries, allowing you to automatically split your models and training datasets across AWS cluster instances to help you efficiently scale training workloads.
Amazon SageMaker distributed training libraries optimizes your training job for AWS network infrastructure and cluster topology through two techniques: data parallelism and model parallelism. Model parallelism splits models too large to fit on a single GPU into smaller parts before distributing them across multiple GPUs to train. Data parallelism splits large datasets to train concurrently in order to improve training speed.
SageMaker HyperPod enables a more resilient training environment by automatically detecting, diagnosing, and recovering from faults, allowing you to continually train FMs for months without disruption.

Automatic cluster health check and repair

If any instances become defective during a training workload, SageMaker HyperPod automatically detects and swaps faulty nodes with healthy ones. To detect faulty hardware, SageMaker HyperPod regularly runs an array of health checks for GPU and network integrity. 

High performing distributed training libraries

With SageMaker’s distributed training libraries, you can run highly scalable and cost-effective custom data parallel and model parallel deep learning training jobs. SageMaker HyperPod is preconfigured with SageMaker distributed libraries. With only a few lines of code, you can enable data parallelism in your training scripts. SageMaker HyperPod makes it faster to perform distributed training by automatically splitting your models and training datasets across AWS GPU instances.

Learn more

Advanced observability for improved performance

You can use built-in ML tools in SageMaker HyperPod to improve model performance. For example, Amazon SageMaker with TensorBoard helps you save development time by visualizing the model architecture to identify and remediate convergence issues and Amazon SageMaker Debugger captures metrics and profiles training jobs in real time. The integration with Amazon CloudWatch Container Insights provides deeper insights into cluster performance, health, and utilization. 

Workload scheduling and orchestration

The SageMaker HyperPod user interface is highly customizable using Slurm or Amazon EKS. You can select and install any needed frameworks or tools. All clusters are provisioned with the instance type and count you choose, and they are retained for your use across workloads.

Scalability and optimized resources utilization

You can manage and operate SageMaker HyperPod clusters with a consistent Kubernetes-based administrator experience. This allows you to efficiently run and scale FM workloads, from training, fine-tuning, experimentation, to inference. You can easily share compute capacity and switch between Slurm and EKS for different types of workloads.