4× RTX Pro 6000 Blackwell on Water, and the One Card That Wouldn't Behave
Posted by sabareesh 4 days ago
Comments
Comment by lgessler 19 hours ago
Comment by sabareesh 19 hours ago
Comment by arjie 19 hours ago
The story is interesting but it’s hard to read because it’s hard to tell which parts are meaningful and which parts are filler.
E.g. “we pulled the card cold - straight from the rig to the workbench”. Okay, but why would going straight from the rig to the workbench make it cold? If anything it would be warm. But it turns out the temperature is meaningful in your story.
Comment by sabareesh 19 hours ago
Comment by cogman10 20 hours ago
Comment by ssl-3 19 hours ago
The business of protecting individual power cords was handled by an Eaton PDU that had a 30a twist-lock plug on one side and a couple of rows of current-limited IEC C13 sockets on the other side.
Comment by Kirby64 20 hours ago
Comment by FireBeyond 18 hours ago
Certainly on my panel the only "single outlet" breakers are hot water, AC, oven/stove, dryer.
Comment by Kirby64 18 hours ago
Also, another place where you might already have this outlet: some older houses that use window AC units that were larger had 240V 20A outlets. Not common these days, but you can still buy these types of window AC units.
Comment by hgoel 20 hours ago
Comment by jmalicki 18 hours ago
If you're gonna get rewired you may as well install a 240v circuit, and some 120v 20a sockets while you're at it.
Comment by hgoel 17 hours ago
I'm very close to just running a cord over or devising a way to put my machine closer to a second circuit because my rental is horribly setup and both my bedroom AC and living room desktop (that also doubles as a ML training box) end up on the same circuit.
Comment by quickthrowman 17 hours ago
It would require an additional run of 14/2G romex (12/2G for 20A) and a single-pole breaker, but allows you to skip cutting in an old work box to add a 2nd duplex receptacle.
You could possibly replace the existing 14/2G with 14/4G which has enough conductors for both circuits.
Comment by jmalicki 17 hours ago
The receptacle is the easy part, running the new circuit is the hard part.
Or you know, install a new 240V receptacle.
If I have to:
1) Run wire
2) Get a bigger breaker box
3) To do it legally, hire an electrician and maybe get a permit
Replacing the receptacle is like, <1% of what's involved there.
Comment by quickthrowman 14 hours ago
Any 15A or 20A duplex receptacle can have the tabs broken to get two separate 15A or 20A simplex receptacles, you don’t need a 5-20R for that, a 3-wire 5-15R works just fine.
Someone upthread mentioned 1.2kW load which a 15A receptacle handles just fine: .8*120*15=1,440W continuous. Bumping that up to 20A only gets you an additional 480W of continuous load: .8*120*20=1,920W. A continuous load is one that runs for 2 hours or longer, the overcurrent protection and wire must be upsized by 1.25x (or derated to 80%)
Most receptacles in homes are wired with 14/2 romex which is only good to 15A (in homes, which use the 60C ampacity column) which is why I suggested pulling another run of 14/2G romex and breaking the tabs. Pulling 14/2 romex to an existing receptacle usually isn’t that hard if you have a fish tape.
AFAIK computer PSUs can’t easily use 240V power without a PDU in the middle, but I’m likely wrong on that, especially for server PSUs.
Comment by jmalicki 13 hours ago
Comment by quickthrowman 11 hours ago
Comment by sabareesh 20 hours ago
Comment by tjwebbnorfolk 20 hours ago
I'm air cooling so I set -pl 450 so I'm not running them all at the full 600w
Comment by amluto 20 hours ago
Hint: when you have a piece of metal stuck with thermal goop to a lot of components, the force doesn’t “concentrate” on one of them. You need to detach it from each one with however much force is needed to detach it from that component.
Comment by sabareesh 20 hours ago
Comment by tomaytotomato 20 hours ago
It cost about £190 in 2006.
Now we have GPUs that are in tens of thousands of pounds with insane performance, but what would their price be without the AI and Datacentre squeeze?
Comment by a012 20 hours ago
Comment by tomaytotomato 18 hours ago
Thanks for making me feel older now
Comment by cobalt60 16 hours ago
Comment by testing22321 20 hours ago
Comment by rvba 18 hours ago
Diablo 2 stopped lagging when a necromancer joined the game and summoned all the skeletons...
On an unrelated note Path of exile 1 still lags even on a 5090
Comment by embedding-shape 14 hours ago
Bunch of games lag on a RTX Pro 6000, at one point (most points?) it's less about the hardware :)
Comment by NwtnsMthd 20 hours ago
Something went wrong in manufacturing. The solder should have wicked to cover the entire pad, not just a small square, and there should be no (brown) discoloration.
Comment by josephg 20 hours ago
Comment by sabareesh 20 hours ago
Comment by dwroberts 19 hours ago
The trouble with this though is, what if that is not the only issue with the card? That’s normally my thought process on reaching for RMA. The unit could be an all-round lemon that should not have passed QA etc. (and as noted in the post itself, working for a week on various tasks is not enough to prove it good)
Comment by stryakr 19 hours ago
The phrasing is very claude like:
"That cracked joint is the whole story. The card had passed initial bring-up and ran fine at light loads for a week."
"That sequencing matters — it’s why we have a story to tell. The pilot card failed, taught us a lesson, and the lesson is the reason the other three went on without incident."
"Driver swaps, CUDA reinstalls, and inference-engine theories were dead ends I spent hours on. The failure pattern itself told the story — listen to it earlier."
Comment by ssl-3 18 hours ago
Stuff like "it's the whole story," "this part matters," and "it's not X" (when X wasn't ever under discussion to begin with).
They're like a bot characterizing to itself what is important, what is unimportant, or sometimes even arguing with itself. Their presentation seems like bits of the internal thinking mechanism leaking into the output queue.
Comment by voidUpdate 20 hours ago
Comment by sabareesh 20 hours ago
Comment by atemerev 20 hours ago
Comment by sieabahlpark 20 hours ago
Comment by sabareesh 4 days ago
Comment by robin_reala 20 hours ago
Edit: reading fail on my part, nothing to see here.
Comment by xmichael909 20 hours ago
Comment by OneDeuxTriSeiGo 20 hours ago
> With 18× 140 mm of surface, the fans run quietly and the coolant Δt across the rads stays small
Comment by robin_reala 20 hours ago
Comment by iagooar 19 hours ago
Comment by sabareesh 19 hours ago
Comment by iagooar 18 hours ago
Comment by alecco 19 hours ago
Those are SM120 so no tmem/tcgen05 and lack of support in main libraries (it's like everybody is focusing on B300/SM100).
For that money I'd buy a single B300, similar total AI TOPS, similar GPU bandwidth aggregated, and only 25% less total memory (probably saved in less implementation complexity), half the energy consumption...
Also by having all SMs local they have the special L1-level interconnect. SMs can collaborate on the same GEMM. And a bunch of other nice features.
Or, you know, rent it.
Comment by sabareesh 19 hours ago
Comment by sandworm101 20 hours ago
Comment by AnthonBerg 18 hours ago
Comment by atemerev 20 hours ago
Comment by lightedman 19 hours ago
Signed, IPC-610 certified tech.
Comment by warpfactor 20 hours ago
Comment by stryakr 19 hours ago
I picked up on it too, this wouldn't have been something difficult to share but it's far too verbose to be a real person's words in this way.