At its recent AWS re:Invent event, AWS moved its custom AI accelerator strategy into a new phase with the general availability of EC2 Trn3 UltraServers based on the Trainium3 chip and the public preview of its next-generation Trainium4.
The new Trainium3 UltraServers target customers that need to train and serve increasingly large and complex models but want lower cost per token and higher energy efficiency than general-purpose GPU clusters.
AWS claims up to 4.4× more compute performance compared to its previous Trainium2 UltraServers, with 3× higher throughput per chip and 4× lower response times on OpenAI’s GPT-OSS test workloads.
AWS also disclosed some details of its next generation Trainium4 accelerators, targeting at least 6× processing performance at FP4, 3× FP8 performance, and 4× more memory bandwidth compared to Trainium3.
Let’s take a look at what was announced.
Trainium3 UltraServers
Trainium3 is the foundational building block for AWS’s new EC2 Trn3 UltraServers, which deploy Trainium3 as a large, tightly coupled memory and compute domain.
At a high level, Trainium3 combines updated NeuronCores, expanded HBM memory, and faster XPU interconnects:
- Process and core architecture
- Fabricated on a 3 nm process node, which enables higher transistor density and better energy efficiency than earlier 5 nm and 7 nm nodes.
- Uses NeuronCore-v4 cores that:
- Maintain scalar, vector, tensor, and collective communication engines.
- Add optimizations for transformer workloads, including faster exponential function evaluation in the vector engine for attention operations.
- Support FP16/BF16 formats and quantization to MXFP8 to reduce memory and bandwidth requirements between MLP layers.
- On-chip memory and HBM
- Increases per-core SRAM (e.g., up to 32 MB per NeuronCore, according to prior technical material) to keep more activations and parameters close to compute.
- Expands HBM to 144 GB per chip with up to 4.9 TB/s HBM bandwidth.
- These changes reduce off-chip bottlenecks for large language and multimodal models.
- XPU interconnect and memory domains
- Uses NeuronLink-v4 for chip-to-chip connectivity, delivering what Amazon claims is up to 2.5 TB/s of aggregate bandwidth per device.
- UltraServers can now host up to 144 Trainium3 chips in a single integrated memory domain, compared with 64 in earlier configurations.
- This expanded domain size allows larger model partitions and reduces cross-system communication for data and model parallel training.
- System-level integration (Trn3 UltraServers)
- A single Trn3 UltraServer aggregates these 144 chips into a single, highly connected system, which AWS positions as a “scale-up” node for both training and inference.
- AWS reports that this configuration delivers up to 4.4× more compute than its Trainium2 UltraServers.
Network Fabric & Ultracluster Scale-Out
Beyond single-system performance, AWS emphasizes the networking and system fabric that connects Trainium3 chips within and across racks. This is crucial for LLMs, MOE, and agentic AI systems that require high bandwidth and low latency across many accelerators.
The networking stack comprises the following:
- Intra-UltraServer networking
- NeuronSwitch-v1 provides a dedicated switching layer for XPU-to-XPU traffic inside the UltraServer.
- AWS claims its NeuronSwitch-v1 delivers 2× more bandwidth within each UltraServer compared to prior generations.
- This reduces contention for multi-chip collective operations (e.g., all-reduce, all-gather) and improves scaling efficiency inside each node.
- Latency characteristics
- Enhanced Neuron Fabric reportedly reduces communication latency between chips to just under 10 microseconds.
- That latency profile is critical for models with tight synchronization points, such as large transformer training and expert routing in MoE architectures.
- EC2 UltraClusters 3.0
- AWS connects Trn3 UltraServers into EC2 UltraClusters 3.0, a scale-out fabric that extends to thousands of UltraServers.
- The company claims these clusters can contain up to 1 million Trainium chips, representing a 10× increase in possible scale compared to earlier UltraCluster generations.
- At this scale, AWS positions Trainium3 as suitable for:
- Training multimodal foundation models over trillion-token datasets.
- Serving real-time inference workloads for millions of concurrent users.
Trainium4 Preview
During his keynote address at the event, AWS CEO Matt Garman also disclosed initial details about its next generation Trainium4, combining more aggressive numeric formats, higher bandwidth, and deep integration into NVIDIA’s rack-scale architecture. The roadmap emphasizes both raw performance and infrastructure standardization.
According the Garman, AWS’s Trainium4 targets include:
- Core numeric performance
- At least 6× processing performance at FP4 relative to Trainium3.
- 3× FP8 performance, which AWS says will deliver proportional improvements in training speed or inference throughput, before further gains from software optimization.
- 4× more HBM bandwidth and 2× HBM capacity compared to Trainium3.
- Rack-scale integration with NVIDIA
- Trainium4 will integrate with NVIDIA NVLink 6 and NVLink Fusion, allowing it to participate in the same scale-up fabric used by NVIDIA’s own accelerators.
- It will fit into NVIDIA MGX rack architectures, enabling AWS to build AI racks with a common footprint, power distribution, and cooling design for Trainium, Graviton, and GPU systems.
- NVLink Fusion includes a chiplet that custom ASIC designers (including AWS) can embed to connect their accelerators to NVLink Switches and the scale-up fabric. NVIDIA documents indicate:
- Support for all-to-all connectivity across up to 72 accelerators in a single scale-up domain.
- Up to 3.6 TB/s per accelerator and 260 TB/s total scale-up bandwidth in such a domain.
- Hardware support for peer-to-peer memory access, atomic operations, and SHARP-based in-network reductions and multicast.
- Ecosystem leverage and heterogeneous racks
- By aligning Trainium4 with MGX and NVLink Fusion, AWS can reuse the same rack, tray, cooling, and power delivery ecosystem as NVIDIA GPU deployments.
- AWS can build heterogeneous AI racks mixing Trainium accelerators, NVIDIA GPUs, Graviton CPUs, and networking components (such as EFA and other NICs) within a consistent mechanical and electrical envelope.
- This reduces the need for AWS to engineer fully bespoke rack-scale infrastructure for each custom silicon generation.
Analysis
AWS announced its first Trainium accelerate in 2021, just ahead of our current GenAI-focused AI moment, and the company has been consistently refining its technolology since.
The general availability of EC2 Trn3 UltraServers and the public preview of Trainium4 show AWS has completed its evolution from infrastructure-centered cloud provider to full-stack AI infrastructure vendor.
Trainium3 brings concrete, measurable advances over earlier Trainium generations, including higher compute density, larger memory domains, improved energy efficiency, and more capable networking fabrics.
AWS’s claims of 4.4× compute, 3× throughput per chip, and 4× faster response times versus Trainium2 UltraServers show that AWS can compete credibly with high-end GPU configurations on both performance and cost for select workloads. While these numbers haven’t been independently validated, Amazon’s performance claims have historically held true, and we expect that will continue with Tranium3.
One of the most surprising elements of the Trainium4 disclosure is the inclusion of NVIDIA’s NVLink Fusion. This shows NVIDIA broadening its NVLINK approach while also AWS more receptive to deeper integration with NVIDIA (and potentially other 3rd-party) ecosystems.
We don’t yet know where this integration is headed, but including NVLINK technology in Trainium4 gives Amazon a high degree of flexibility while increasing NVIDIA’s footprint in a traditionally non-NVIDIA environment. There’s more to say about this in a more dedicated analysis of NVIDIA’s NVLINK strategy.
For technology decision-makers, the net impact of this announcement is positive. Trainium3 expands the practical choices for large-scale AI beyond NVIDIA-only solutions, and Trainium4’s roadmap shows a credible path toward even higher performance and better rack-scale integration.
Organizations that are already heavily invested in AWS, or that face escalating GPU costs and capacity constraints, should treat Trainium3 as a serious option for both training and inference.
Overall, the release of Trainium3 instances and the preview of Trainium4 show AWS strengthening its position in hosted AI infrastructure space, increasing competitive pressure on GPU-centric offerings, and giving enterprises more levers to balance performance, cost, and scalability for next-generation AI workloads.
Competitive Outlook & Advice to IT Buyers
The cloud accelerator space is on a crowded market, with AWS competitors Google Cloud and Microsoft Azure each promoting unique accelerators, while AMD and Qualcomm offer more off-the-shelf options. Everyone, of course, competes with NVIDIA in this space.
Given this dynamic, how does the new AWS Trainium3-based offerings stack up and what should IT buyers consider when evaluating these options? Let’s jump in.
These sections are only available to NAND Research clients and IT Advisory Members. Please reach out to [email protected] to learn more.



