Ooooh yeah, from experience, even the usual (thorough) suspects who over-over-over-spec cooling and power (as anyone here has heard an HPE Apollo 6500 at full fan speed can attest) have a hard time getting all the interconnects, pcie, firmware stuff up to snuff and running H24. Once it's setup it's amazing but the bringup can be rocky.
And I'm not even talking 100/400G network, wonderful wonderful hardware, good luck debugging and getting all the RoCE/RDMA/GPUDirect/StorageDirect/NCCL working (already a bit of pain on nvidia, with a large installed base...).
Either you want to learn all this stuff (for reasons) or you're dumping a lot of money on fast-evolving tech.
And I'm not even talking 100/400G network, wonderful wonderful hardware, good luck debugging and getting all the RoCE/RDMA/GPUDirect/StorageDirect/NCCL working (already a bit of pain on nvidia, with a large installed base...).
Either you want to learn all this stuff (for reasons) or you're dumping a lot of money on fast-evolving tech.