Touching the Elephant – TPUs
Posted by giuliomagnifico 5 days ago
Comments
Comment by desideratum 5 days ago
Comment by jauntywundrkind 5 days ago
The work that XLA & schedulers are doing here is wildly impressive.
This feels so much drastically harder to work with than Itanium must have been. ~400bit VLIW, across extremely diverse execution units. The workload is different, it's not general purpose, but still awe inspiring to know not just that they built the chip but that the software folks can actually use such a wildly weird beast.
I wish we saw more industry uptake for XLA. Uptakes not bad, per-se: there's a bunch of different hardware it can target! But what amazing secret sauce, it's open source, and it doesn't feel like there's the industry rally behind it it deserves. It feels like Nvidia is only barely beginning to catch up, to dig a new moat, with the just announced Nvidia Tiles. Such huge overlap. Afaik, please correct if wrong, but XLA isn't at present particularly useful at scheduling across machines, is it? https://github.com/openxla/xla
Comment by alevskaya 5 days ago
JAX/XLA does offer some really nice tools for doing automated sharding of models across devices, but for really large performance-optimized models we often handle the comms stuff manually, similar in spirit to MPI.
Comment by jauntywundrkind 5 days ago
But if you make it 2900 words through this 9000 word document, to the "Sample VLIW Instructions" and "Simplified TPU Instruction Overlay" diagrams, trying to map the VLIW slots ("They contain slots for 2 scalar, 4 vector, 2 matrix, 1 miscellaneous, and 6 immediate instructions") to useful work one can do seems incredibly incredible challenging. Given the vast disparity of functionality and style of the attached units that that governs, and given the extreme complexity in keeping that MXU constantly fed, keeping very tight timing so that it is constantly well utilized.
> Subsystems operate with different latencies: scalar arithmetic might take single digit cycles, vector arithmetic 10s, and matrix multiplies 100s. DMAs, VMEM loads/stores, FIFO buffer fill/drain, etc. all must be coordinated with precise timing.
Where-as Itanium's compilers needed to pack parallel work into a single instruction, there's maybe less need for that here. But that quote there feels like an incredible heart of the machine challenge, to write instruction bundles that are going to feed a variety of systems all at once, when these systems have such drastically different performance profiles / pipeline depths. Truly an awe-some system, IMO.
Still though, yes: Itanium's software teams did have an incredibly hard challenge finding enough work at compile time to pack into instructions. Maybe it was a harder task. What a marvel modern cores are, having almost a dozen execution units that cpu control can juggle and keep utilized, analyzing incoming instructions on the fly, with deep out-of-order depenency-tracking insight. Trying to figure it all out ahead of time & packing it into the instructions apriori was a wildly hard task.
Comment by desideratum 5 days ago
> XLA isn't at present particularly useful at scheduling across machines,
I'm not sure if you mean compiler-based distributed optimizations, but JAX does this with XLA: https://docs.jax.dev/en/latest/notebooks/Distributed_arrays_...
Comment by cpgxiii 5 days ago
Comment by jauntywundrkind 5 days ago
Comment by Simplita 5 days ago
Comment by Zigurd 5 days ago
Comment by randomtoast 4 days ago
Starting point: In 1965, the most advanced chips contained roughly 50 to 100 transistors (e.g., early integrated logic).
Lets take 1965 -> 2025, which is 60 years.
Number of doubling intervals: 60 years / 2 years per doubling = 30 doublings
So the theoretical prediction is:
Transistors in 2025 (predicted) = 100 × 2^30 ≈ 107 billion transistors
The Apple M1 Ultra has 114 billion transistors.
Comment by KolenCh 4 days ago
But if we relax it to be a slowly varying constant, then it is not dead. That constant has been changed (by consensus) for a few times already.
Your mistake is to (1) take that constant literally (ie using the strong law) and (2) uses the boundary points to find the “average” effect. The latter is a really flawed argument as it cannot prove it hasn’t been dead (a recent effect) because you haven’t considered it’s change over time.
Comment by alecco 5 days ago
The TPUv4 and TPUv6 docs were stolen by a Chinese national in 2022/2023: https://www.cyberhaven.com/blog/lessons-learned-from-the-goo... https://www.justice.gov/opa/pr/superseding-indictment-charge...
And that's just 1 guy that got caught. Who knows how many other cases were there.
A Chinese startup is already making clusters of TPUs and has revenue https://www.scmp.com/tech/tech-war/article/3334244/ai-start-...
Comment by Workaccount2 5 days ago
There is a dark art to semiconductor manufacturing that pretty much only TSMC really has the wizards for. Maybe intel and samsung a bit too.
Comment by mr_toad 5 days ago
China has fabs. Most are older nodes and are used to manufacture chips used in cars and consumer electronics. They have companies that design chips (manufactured by TSMC), like the Ascend 910, which are purpose built for AI. They may be behind, but they’re not standing still.
Comment by aunty_helen 5 days ago
The question is when? Does that come in time to deflate the US tech stock bubble? Or will the bubble start to level out and reality catch up, or will the market crash for another reason beforehand?
Comment by snek_case 5 days ago
This is like this funny idea people had in the early 2000s that China would continue to manufacture most US technology but they could never design their own competitive tech. Why would anyone think that?
Wrt invading Taiwan, I don't think there is any way China can get TSMC intact. If they do invade Taiwan (please God no), it would be a horrible bloodbath. Deaths in the hundreds of thousands and probably relentless bombing. Taiwan would likely destroy its own fabs to avoid them being taken. It would be sad and horrible.
Comment by renewiltord 5 days ago
They’ll just catch the next wave of tech or eventually break into EUV.
Comment by adgjlsfhk1 5 days ago
Comment by jandrewrogers 5 days ago
Everyone is still dependent on a single American manufacturer for this tech after decades of development. This strongly suggests that it is considerably more difficult than just "funding a second source".
Comment by renewiltord 5 days ago
Comment by mr_toad 5 days ago
There are so many trade and manufacturing links between China and Taiwan that an outright war would be economically disastrous for both countries.
Comment by dpe82 5 days ago
Comment by overfeed 5 days ago
That'd be the belief in good old American exceptionalism. Up until recently, a common meme on HN was "freedom" is fundamental to innovation, and naturally the country with the most Freedom(TM) wins. This even persisted after it was clear that DJI was kicking all kinds of ass, outcompeting multiple western drone companies.
Comment by snek_case 4 days ago
Comment by overfeed 3 days ago
It's not exactly a new idea. This was the CIA's operating principle in the western hemisphere since before the cold war.
Comment by radialstub 5 days ago
Comment by PunchyHamster 5 days ago
We desperately need more open frameworks for competition to work
Comment by tomrod 5 days ago
Comment by Workaccount2 5 days ago
The knowledge of making 2008 era chips is not a gating factor for getting a handful of atoms to function as a transistor in current SOTA chips. There are probably 100 people on earth who know how to do this, and the majority of them are in Taiwan.
Again, China has literally stolen the plans for EUV lithography, years ago, and still cannot get it to work. Even Samsung and Intel, using the same machines as TSMC, cannot match what they are doing.
It's a dark art in the most literal sense.
Nevermind that new these cutting edge fabs cost ~$50 Billion each.
Comment by checker659 5 days ago
Comment by pixl97 5 days ago
Comment by Zigurd 5 days ago
Comment by Workaccount2 5 days ago
The killer really is training, which is insanely compute intensive and really only recently hardware practical on the scale needed.
Comment by adgjlsfhk1 5 days ago
Comment by Zigurd 5 days ago
Comment by pests 5 days ago
Comment by llm_nerd 5 days ago
How would this be a deadly blow to Google? Google makes TPUs for their own services and products, avoiding paying the expensive nvidia tax. If other people make similar products, this has effectively zero impact on Google.
nvidia knew their days were numbered, at least in their ownership of the whole market. And China hardly had to steal the great plans for a TPU to make one, and a FMA/MAC unit is actually a surprisingly simple bit of hardware to design. Everyone is adding "TPUs" in their chips - Apple, Qualcomm, Google, AMD, Amazon, Huawei, nvidia (that's what tensor cores are) and everyone else.
And that startup isn't the big secret. Huawei already has solutions matching the H20. Once the specific need that can be serviced by an ASIC is clear, everyone starts building it.
>America will train 600k Chinese students as Trump agreed to
What great advantage do you think this is?
America isn't remotely the great gatekeeper on this. If anything, Taiwan + the Netherlands (ASML) are. China would yield infinitely more value in learning manufacturing and fabrication secrets than cloning some specific ASIC.
Comment by lukasb 5 days ago
Comment by fullofideas 5 days ago
I dont understand this part. What has nuclear base got to do with chip manufacturing? And surely, not all 600k students are learning chip design or stealing plans
Comment by dylanowen 5 days ago
Comment by mr_toad 5 days ago
Comment by renewiltord 5 days ago
Comment by pstuart 5 days ago
There are things about China not to be celebrated but one cannot help but admire the way that they invest in their country as a whole. The US is all about "what's in it for me".
Comment by renewiltord 5 days ago
Is all that construction really worth it when we could be protecting neighborhoods and historic views?
Comment by pstuart 5 days ago
And it's not an entirely binary choice on protecting neighborhoods and views; for example what's happening in south Memphis with the power plant that's powering the Grok center there is a classic case of environmental racism -- they are cutting costs on pollution regulation because they have a community that they can dump the externalized costs on via their emissions.
Nobody's saying Grok shouldn't have the power, it's just a small detail on how that impact is managed.
Comment by renewiltord 5 days ago
Comment by Spooky23 5 days ago
Comment by pixl97 5 days ago
Comment by alecco 5 days ago
About students, have you seen the microelectronic labs in American universities lately? A huge chunk are Chinese already. Same with some of the top AI labs.
Comment by tormeh 5 days ago
Comment by daveguy 5 days ago
Comment by alivetoad 5 days ago
Comment by allisdust 5 days ago
I have spent close to 2 hours on your extremely info dense article and loved every bit of it.
Looking forward for the next one.
Comment by alivetoad 3 days ago
Comment by imtringued 5 days ago
Comment by alivetoad 3 days ago
Comment by tomhow 4 days ago
Comment by alivetoad 3 days ago
Comment by daveguy 3 days ago
Just wanted to chime in (even though you probably don't want advice from me after that a-holery). I don't think LLMs are good for editing at all. They make things more wordy and obtuse. But editing is supposed to add clarity, especially in technical writing. Generally, the more you can remove while conveying the ideas the better. Honestly, I expect your original writing was better before using Claude.
Sorry again for being so harsh in the first place.
Comment by alivetoad 3 days ago
You're all good in my book, don't sweat it. I really do welcome the feedback. I am just some yokel from the internet, the burden is on me to write ~well enough so that reading 10k words isn't an enormous waste of time. Definitely rusty, but a little less so after wrestling with this one. I'm still ironing out my process, but writing more is the only way to get that sorted.
Comment by tomhow 4 days ago
Comment by alivetoad 3 days ago
Comment by daveguy 3 days ago
Comment by ddtaylor 5 days ago