AWS Trainium3 Deep Dive – A Potential Challenger Approaching
Posted by Symmetry 5 days ago
Comments
Comment by klysm 15 hours ago
Comment by mrlongroots 13 hours ago
Comment by bri3d 13 hours ago
I do think AWS need to improve their software to capture more downmarket traction, but my understanding is that even Trainium2 with virtually no public support was financially successful for Anthropic as well as for scaling AWS Bedrock workloads.
Ease of optimization at the architecture level is what matters at the bleeding edge; a pure-AI organization will have teams of optimization and compiler engineers who will be mining for tricks to optimize the hardware.
Comment by epolanski 6 hours ago
Amazon has all the resources needed to write their own backends to several ML software or even drop-in API replacements.
Eventually economics win: where margins are high competition appears and in time margins get thinner and competition starts disappearing again, it's a cycle.
Comment by stogot 15 hours ago
> In fact, they are conducting a massive, multi-phase shift in software strategy. Phase 1 is releasing and open sourcing a new native PyTorch backend. They will also be open sourcing the compiler for their kernel language called “NKI” (Neuron Kernal Interface) and their kernel and communication libraries matmul and ML ops (analogous to NCCL, cuBLAS, cuDNN, Aten Ops). Phase 2 consists of open sourcing their XLA graph compiler and JAX software stack.
> By open sourcing most of their software stack, AWS will help broaden adoption and kick-start an open developer ecosystem. We believe the CUDA Moat isn’t constructed by the Nvidia engineers that built the castle, but by the millions of external developers that dig the moat around that castle by contributing to the CUDA ecosystem. AWS has internalized this and is pursuing the exact same strategy.
Comment by coredog64 14 hours ago
Comment by almostgotcaught 13 hours ago
Comment by willahmad 11 hours ago
AWS can make it seamless, so you can run open source models on their hardware.
See their ARM based instances, you rarely notice you are running on ARM, when using Lambda, k8s, fargate and others
Comment by trueismywork 11 hours ago
Comment by ivape 14 hours ago
With Alchip, Amazon is working on "more economical design, foundry and backend support" for its upcoming chip programs, according to Acree.
https://www.morningstar.com/news/marketwatch/20251208112/mar...
Comment by thecopy 14 hours ago
Comment by esafak 13 hours ago
Comment by epolanski 6 hours ago
Comment by hobo_mark 11 hours ago
Comment by ijidak 8 hours ago
Comment by mlmonkey 13 hours ago
Comment by mNovak 10 hours ago
Comment by artur44 14 hours ago
If AWS really delivers on open-sourcing more of the toolchain, that could be a much bigger signal for adoption than raw specs alone.
Comment by t1234s 10 hours ago
Comment by Analemma_ 9 hours ago
Comment by jauntywundrkind 14 hours ago
It doesn't have a lot of ports and certainly not enough NTB to be useful as a switch, but man, wild to me than an AMD Epyc core has 128 lanes of PCIe and that switch chips are struggling to match even a basic server's worth of net bandwidth.
Comment by cmiles8 13 hours ago