AWS launches Trainium2 EC2 instances & UltraServers

Wed, 11th Dec 2024

Amazon Web Services has announced the general availability of its AWS Trainium2-powered Amazon EC2 instances and introduced new EC2 Trn2 UltraServers.

The Trn2 instances incorporate AWS's latest Trainium2 AI chips, promising a 30-40% better price performance than existing GPU-based EC2 P5e and P5en instances. These instances boast 16 Trainium2 chips, offering a theoretical peak performance of 20.8 petaflops, making them suitable for training and deploying large language models (LLMs) with billions of parameters.

In addition, the newly introduced Amazon EC2 Trn2 UltraServers feature 64 interconnected Trainium2 chips using AWS's ultra-fast NeuronLink interconnect. This configuration significantly boosts compute capability to 83.2 peak petaflops, quadrupling a single instance's compute, memory, and networking resources. Such enhancements allow for the training and deployment of some of the world's largest models.

AWS has partnered with Anthropic to develop project Rainier, an EC2 UltraCluster comprised of hundreds of thousands of Trainium2 chips. This cluster boasts over five times the number of exaflops used in current leading AI models. It aims to be the world's largest AI compute cluster, offering Anthropic unprecedented resources to train future AI models.

David Brown, Vice President of Compute and Networking at AWS, stated, "Trainium2 is purpose-built to support the largest, most cutting-edge generative AI workloads, for both training and inference, and to deliver the best price performance on AWS. With models approaching trillions of parameters, we understand customers also need a novel approach to train and run these massive workloads. New Trn2 UltraServers offer the fastest training and inference performance on AWS and help organisations of all sizes to train and deploy the world's largest models faster and at a lower cost."

Companies like Anthropic, Databricks, and Hugging Face are among the early adopters of Trainium2, leveraging its performance to enhance their AI offerings. Databricks anticipates that utilising Trn2 along with their Mosaic AI will enable up to 30% lower total cost of ownership for customers. Similarly, Hugging Face's open platform expects improved performance with the launch of Trainium2, enhancing model development and deployment speed.

Anthropic has also committed to optimising its Claude models to run on Trainium2, scaling model training with hundreds of thousands of AI chips, significantly increasing their cluster size and capacity.

Alongside the availability of Trainium2, AWS has also announced the Trainium3 chips, which are expected to debut in late 2025. These next-generation chips promise to deliver vastly improved performance and energy efficiency, utilising a 3-nanometer process node to enable advanced AI workloads.

AWS's Neuron SDK, which supports development with Trainium chips, is integrated with widely used machine learning frameworks like JAX and PyTorch, simplifying the transition for developers familiar with these systems. Google is collaborating with AWS to enhance JAX's capabilities for large-scale training and inference on Trn2 instances.

Currently, Trn2 instances are generally available in the US East (Ohio) AWS Region, with plans to expand availability to additional regions soon. Trn2 UltraServers are presently in the preview phase.

Share on: