4× RTX Pro 6000 Blackwell on Water, and the One Card That Wouldn't Behave

Posted by sabareesh 4 days ago

Comments

Comment by lgessler 19 hours ago

Not that this really takes away from the substance of the article, but the first two paragraphs are giving heavy Claude smell. Semicolons, em dashes, "That sequencing matters"... I guess I'm just a little surprised that anyone could be arsed to take on a hardware project like this but can't be arsed to write their own introduction.

Comment by sabareesh 19 hours ago

Appreciate the feedback. I have improved the article.

Comment by arjie 19 hours ago

Pass the generated text through some kind of quality prompt. It’s got too much filler right now.

The story is interesting but it’s hard to read because it’s hard to tell which parts are meaningful and which parts are filler.

E.g. “we pulled the card cold - straight from the rig to the workbench”. Okay, but why would going straight from the rig to the workbench make it cold? If anything it would be warm. But it turns out the temperature is meaningful in your story.

Comment by sabareesh 19 hours ago

Appreciate the feedback. I have improved the article.

Comment by cogman10 20 hours ago

Ok, how are people powering these things? 2.4kW is well beyond a standard circuit in the US. Are people having 240V/30A circuits installed? Are they hijacking the dryer plugs? EV charger plugs? Hottub circuits?

Comment by ssl-3 19 hours ago

When I dipped my toes into vaguely-serious eth mining with GPUs 5 or 6 years ago, I just installed a dedicated 240v, 30a circuit in the basement. The run of cabling was short, the basement was unfinished utility space, and the parts didn't cost very much. It went together quickly.

The business of protecting individual power cords was handled by an Eaton PDU that had a 30a twist-lock plug on one side and a couple of rows of current-limited IEC C13 sockets on the other side.

Comment by Kirby64 20 hours ago

240V-20A circuits will handle 3.8kW continuous. It’s probably a 240V-20A circuit, as that is what the power supplies typically want. Also, easy to convert an outlet to 240V, if the breaker is dedicated to that outlet. Just requires swapping the breaker and the outlet, not the wires.

Comment by FireBeyond 18 hours ago

I don't think that's exactly common other than for outlets that are already wired for those specialized purposes, no?

Certainly on my panel the only "single outlet" breakers are hot water, AC, oven/stove, dryer.

Comment by Kirby64 18 hours ago

I agree it’s not common. Doesn’t mean you couldn’t do that. If every room has its own breaker for the outlets in that room, you could convert that room to 240V, as an example, though.

Also, another place where you might already have this outlet: some older houses that use window AC units that were larger had 240V 20A outlets. Not common these days, but you can still buy these types of window AC units.

Comment by hgoel 20 hours ago

Chaining two PSUs on separate circuits is also an option. If they're using the MaxQ versions though, the total GPU power draw is only ~1200W. The bigger question to me is how are they cooling it? Sticking an AC in that room just doubles the power draw issues.

Comment by jmalicki 18 hours ago

If I wanted to use two circuits without running extension cords, every place I've lived would mean getting electrical rewired.

If you're gonna get rewired you may as well install a 240v circuit, and some 120v 20a sockets while you're at it.

Comment by hgoel 17 hours ago

Fair point.

I'm very close to just running a cord over or devising a way to put my machine closer to a second circuit because my rental is horribly setup and both my bedroom AC and living room desktop (that also doubles as a ML training box) end up on the same circuit.

Comment by quickthrowman 17 hours ago

You can break the tabs on a 15A or 20A duplex receptacle to have (2) single 15A or 20A dedicated circuits on a single duplex receptacle.

It would require an additional run of 14/2G romex (12/2G for 20A) and a single-pole breaker, but allows you to skip cutting in an old work box to add a 2nd duplex receptacle.

You could possibly replace the existing 14/2G with 14/4G which has enough conductors for both circuits.

Comment by jmalicki 17 hours ago

If you are going to do that, why not install a NEMA 5-20R receptacle, that has two independent circuits and is backwards compatible, as well as being rated for 20A per plug?

The receptacle is the easy part, running the new circuit is the hard part.

Or you know, install a new 240V receptacle.

If I have to:

1) Run wire

2) Get a bigger breaker box

3) To do it legally, hire an electrician and maybe get a permit

Replacing the receptacle is like, <1% of what's involved there.

Comment by quickthrowman 14 hours ago

Breaking the tabs on the existing receptacle prevents one from having to use a jab saw or multitool to cut a hole in the gypsum wallboard or plaster and install a cut-in/old work box to add a 2’d duplex receptacle: https://www.homedepot.com/c/ah/how-to-install-remodeling-box...

Any 15A or 20A duplex receptacle can have the tabs broken to get two separate 15A or 20A simplex receptacles, you don’t need a 5-20R for that, a 3-wire 5-15R works just fine.

Someone upthread mentioned 1.2kW load which a 15A receptacle handles just fine: .8*120*15=1,440W continuous. Bumping that up to 20A only gets you an additional 480W of continuous load: .8*120*20=1,920W. A continuous load is one that runs for 2 hours or longer, the overcurrent protection and wire must be upsized by 1.25x (or derated to 80%)

Most receptacles in homes are wired with 14/2 romex which is only good to 15A (in homes, which use the 60C ampacity column) which is why I suggested pulling another run of 14/2G romex and breaking the tabs. Pulling 14/2 romex to an existing receptacle usually isn’t that hard if you have a fish tape.

AFAIK computer PSUs can’t easily use 240V power without a PDU in the middle, but I’m likely wrong on that, especially for server PSUs.

Comment by jmalicki 13 hours ago

Almost all computer PSUs I have ever seen are 110/220 since they don't make different models for Europe.

Comment by quickthrowman 11 hours ago

Gotcha, that would make sense. In that case, your suggestion of a 240V outlet is best. Swap the outlet for a 240V one, swap the breaker out for an 240V 2-pole, and use the same wire, assuming the wire is already big enough since you won’t need a neutral.

Comment by sabareesh 20 hours ago

It is basically on 2 different circuits/breakers. Asus wrx90e supports 2 psu as well. You may need to synchronize both psu and several adapter for this is available in Amazon. Soon planning to upgrade it to 240V

Comment by tjwebbnorfolk 20 hours ago

exactly, I had a 220v 30a circuit installed to run a multi-GPU server in my basement.

I'm air cooling so I set -pl 450 so I'm not running them all at the full 600w

Comment by amluto 20 hours ago

I wonder whether those cards ran the model that wrote the nonsense about the forces involved.

Hint: when you have a piece of metal stuck with thermal goop to a lot of components, the force doesn’t “concentrate” on one of them. You need to detach it from each one with however much force is needed to detach it from that component.

Comment by sabareesh 20 hours ago

Not sure what really happened but some force or bad solder caused it.

Comment by tomaytotomato 20 hours ago

What a time to be alive, I remember 10 years ago as a poor student waiting to buy a ATI Radeon X1600 Pro with 256mb, yes 256mb of RAM.

It cost about £190 in 2006.

Now we have GPUs that are in tens of thousands of pounds with insane performance, but what would their price be without the AI and Datacentre squeeze?

Comment by a012 20 hours ago

2006 is 20 years ago

Comment by tomaytotomato 18 hours ago

It feels like 10 years ago for me :D

Thanks for making me feel older now

Comment by cobalt60 16 hours ago

we all seem to hace our own timeline! XFX 6600GT

Comment by testing22321 20 hours ago

I remember buying the Radeon PCI with 32MB RAM for $650AUD…

Comment by rvba 18 hours ago

I remember buying 128mb that was added on top of 64mb what brought me to the unbalanced 192 mb setup.

Diablo 2 stopped lagging when a necromancer joined the game and summoned all the skeletons...

On an unrelated note Path of exile 1 still lags even on a 5090

Comment by embedding-shape 14 hours ago

> On an unrelated note Path of exile 1 still lags even on a 5090

Bunch of games lag on a RTX Pro 6000, at one point (most points?) it's less about the hardware :)

Comment by NwtnsMthd 20 hours ago

It's difficult to speculate as to the exact failure from blurry pictures but the solder on that choke (inductor) looks terrible.

Something went wrong in manufacturing. The solder should have wicked to cover the entire pad, not just a small square, and there should be no (brown) discoloration.

Comment by josephg 20 hours ago

Cool post. FYI you might be better off getting one big fan for your "radiator" instead of lots of little fans. Big fans don't need to spin as fast as small fans to push the same amount of air. So they run a lot quieter.

Comment by sabareesh 20 hours ago

Sure 140mm fans you may call little but it does need enough static pressure for the radiators. This setup is already several times quieter than stock setup

Comment by dwroberts 19 hours ago

> Don’t RMA it, and don’t solder it yourself. A local phone-repair chain with a microsolder tech can put a 3 mm SMD part back on a GPU PCB in twenty minutes for the price of dinner. The skill is in your city. You just have to look.

The trouble with this though is, what if that is not the only issue with the card? That’s normally my thought process on reaching for RMA. The unit could be an all-round lemon that should not have passed QA etc. (and as noted in the post itself, working for a week on various tasks is not enough to prove it good)

Comment by stryakr 19 hours ago

Why does this post sound like it's an AI story based on the inputs from the engineer?

The phrasing is very claude like:

"That cracked joint is the whole story. The card had passed initial bring-up and ran fine at light loads for a week."

"That sequencing matters — it’s why we have a story to tell. The pilot card failed, taught us a lesson, and the lesson is the reason the other three went on without incident."

"Driver swaps, CUDA reinstalls, and inference-engine theories were dead ends I spent hours on. The failure pattern itself told the story — listen to it earlier."

Comment by ssl-3 18 hours ago

Some bot-isms like that are amusing to me.

Stuff like "it's the whole story," "this part matters," and "it's not X" (when X wasn't ever under discussion to begin with).

They're like a bot characterizing to itself what is important, what is unimportant, or sometimes even arguing with itself. Their presentation seems like bits of the internal thinking mechanism leaking into the output queue.

Comment by voidUpdate 20 hours ago

Is that little computer training LLMs from scratch all by itself? That must take years to get any kind of progress, given the scale of training other providers do. Where do you get the training data from?

Comment by sabareesh 20 hours ago

Most of the training i am working on is with post training. You can do so much with a system that is running 24/7

Comment by atemerev 20 hours ago

You can train TinyStories in a few hours on retail hardware, and this is a highly illuminating experience that I can recommend for everyone.

Comment by sieabahlpark 20 hours ago

[dead]

Comment by sabareesh 4 days ago

Converting four RTX PRO 6000 Blackwell cards to waterblocks, finding a VRM choke loose on the workbench, and getting back to 41k tok/s.

Comment by robin_reala 20 hours ago

Complete side note, but I can’t work out how the author managed to mistype “at” as “Δt”.

Edit: reading fail on my part, nothing to see here.

Comment by xmichael909 20 hours ago

I caught that too, probably a qwen bug (;

Comment by OneDeuxTriSeiGo 20 hours ago

huh? the only Δt in the article is used correctly.

> With 18× 140 mm of surface, the fans run quietly and the coolant Δt across the rads stays small

Comment by robin_reala 20 hours ago

Hah, wow, I completely misread it. Delta-t makes sense when you get the context right, thanks.

Comment by iagooar 19 hours ago

Out of curiosity, what are you training with these cards?

Comment by sabareesh 19 hours ago

I am primarily experimenting on post training stack. As of now working on training a model that is natively RLM https://github.com/alexzhang13/rlm

Comment by iagooar 18 hours ago

Nice. Are you working on it for "fun" or as a lab? I am looking into building a semi-professional cluster + setting up a lab, but the investment needed is beyond what I could justify as a hobby.

Comment by alecco 19 hours ago

> 4× RTX PRO 6000 Blackwell Workstation (GB202, 96 GB GDDR7, 600 W)

Those are SM120 so no tmem/tcgen05 and lack of support in main libraries (it's like everybody is focusing on B300/SM100).

For that money I'd buy a single B300, similar total AI TOPS, similar GPU bandwidth aggregated, and only 25% less total memory (probably saved in less implementation complexity), half the energy consumption...

Also by having all SMs local they have the special L1-level interconnect. SMs can collaborate on the same GEMM. And a bunch of other nice features.

Or, you know, rent it.

Comment by sabareesh 19 hours ago

Yes this has been on my mind as well. But this was built one at a time but still overall very happy with them

Comment by sandworm101 20 hours ago

Ditch the tiny DC fans. Build a shroud and switch to a single ac-powered industrial blower / duct fan.

Comment by AnthonBerg 18 hours ago

dB?

Comment by atemerev 20 hours ago

If you want ready, well engineered, water-cooled multi-GPU research workstations, my colleagues at https://comino.com build and sell them. Or you can purchase fitted waterblocks from them for many GPUs, and build your own.

Comment by lightedman 19 hours ago

You can tell this is AI slop by the horrible soldering descriptions - anyone with experience can look at that VRM and go "Oh, the solder is still in its stencil-applied state and has not flowed across the contacts at all, on either component or board. This is a reflow-in-oven issue from the manufacturer." This wasn't a cracking joint this was a poorly-done joint.

Signed, IPC-610 certified tech.

Comment by warpfactor 20 hours ago

AI slop post.

Comment by stryakr 19 hours ago

https://news.ycombinator.com/item?id=48557170

I picked up on it too, this wouldn't have been something difficult to share but it's far too verbose to be a real person's words in this way.