Trainium3: New AWS Chip Promises 4x Performance Boost

Share this article
Share this article
Prioritise Us on Google
Matt Garman, CEO of AWS. Credit: Amazon Web Services (AWS)
AWS says its third-generation custom chip Trainium3 provides a 50% cost reduction for enterprise training and inference workloads

Amazon Web Services has announced the general availability of Amazon EC2 Trn3 UltraServers, powered by the company’s third generation Trainium chip built on 3-nanometre technology.

The Trn3 UltraServers pack up to 144 Trainium3 chips into a single integrated system, delivering up to 4.4 times more compute performance than Trainium2 UltraServers, with customers able to achieve three times higher throughput per chip while delivering four times faster response times than Trn2 UltraServers using OpenAI’s open weight model GPT-OSS.

“We ran through a number of open-source models – all the workloads that we've been optimising to run on Trainium2 – to see how they run on Trainium3,” says Matt Garman, CEO of AWS.

AWS' Trainium3 chip. Credit: Amazon Web Services (AWS)

The CEO reported in his keynote at AWS re:Invent that Trainium3 offers efficiency gains over Trainium2, for 5x higher output tokens per megawatt, all while maintaining the same latency.

“It’s got 4x more performance over Trainium2, which is fantastic,” David Brown, VP of AWS Compute and Machine Learning Services told Technology Magazine. “It’s also going to be 40% more performance per watt, which is obviously very, very important as we think about how much compute we can get out of every watt of power that we put into the device. And that makes it about 40% better energy efficient as well when we go from Trainium 2. We've also increased the memory bandwidth by 50%.”

Amazon Bedrock runs majority of inference on Trainium architecture

AWS has deployed over one million Trainium chips starting this year, with the majority of Amazon Bedrock inference workloads now running on Trainium architecture.

“If you look at all the inference running on Amazon Bedrock today, the majority is actually powered by Trainium, and the performance advances of Trainium are really noticeable,” says Matt Garman, CEO of AWS. “For Anthropic’s latest generation models in Bedrock, all of that traffic is running on Trainium, which is delivering the best response times compared to any other major provider.”

The Trainium family has evolved rapidly since its introduction in 2020. AWS engineered the Trn3 UltraServer as a vertically integrated system from chip architecture to software stack. The new NeuronSwitch-v1 delivers twice more bandwidth within each UltraServer, while enhanced Neuron Fabric networking reduces communication delays between chips to under 10 microseconds.

For customers requiring greater scale, EC2 UltraClusters 3.0 can connect thousands of UltraServers containing up to one million Trainium chips, representing 10 times the previous generation. Matt explains: “We’ve gotten to a million chips in record speed, not just because we control the whole stack, but because we can optimise end-to-end how we roll it out.”

Project Rainier demonstrates Trainium2 deployment velocity

Through Project Rainier, AWS collaborated with Anthropic to connect more than 500,000 Trainium2 chips into what the company describes as the world’s largest AI compute cluster. 

“We spoke about Project Rainier at re:Invent last year, so early December 2024, and by October, less than a year later – 10 months later – we were deploying it,” says David. 

He adds that the company expects to scale Trainium3 faster than Trainium2: “I would actually expect us to scale Trainium 3 even faster than Trainium 2. One thing we constantly learn is it’s not only about how quickly you can make the silicon – and you have to make it well, because any mistake in the silicon, you're going to lose way too much time.”

AWS Project Rainier. Credit: Amazon

Customers including Anthropic, Karakuri, Metagenomics, Neto.ai, Ricoh and Splashmusic are reducing training and inference costs by up to 50% with Trainium technology. Decart, an AI laboratory specialising in efficient generative AI video and image models, is achieving four times faster frame generation at half the cost of graphics processing units.

Trainium4 to offer sixt times performance increase

AWS also announced it is actively developing Trainium4, designed to bring performance improvements including at least six times the processing performance in FP4 precision, three times the FP8 performance and four times more memory bandwidth. FP8 is the industry-standard precision format that balances model accuracy with computational efficiency for AI workloads.

Trainium4 is being designed to support Nvidia NVLink Fusion high-speed chip interconnect technology, enabling Trainium4, Graviton and Elastic Fabric Adapter to work together within common MGX racks, providing rack-scale AI infrastructure that supports both graphics processing unit and Trainium servers.

Amazon EC2 Trn3 UltraServers are available now. David states: “Model efficiency and model technology has to improve significantly, and then the other big thing you can do for the cost is obviously on the hardware side. Whether it’s price performance on GPUs or price performance on custom silicon like Trainium, you want to get more compute for every dollar spent. The reason I’m so focused on that is I know if I can get that cost down, it’s going to unlock more innovation for our customers and more adoption.”

Executives