ML: A Hardware Perspective
This rant is inspired by a zoom call and the limelight of The Stargate Project. Who it’s for: micro-cloud providers and anyone spending more than $10K/mo on ML training.
Cost Optimization Background
Assumptions
If you use AMD or TPUs, your technical proficiency exceeds this post.
Scaling efficiency of ~90% across all generations (close enough).
You’re going for the cheapest regions, pricing as of 01/28/25.
You’re not in a specialized HPC environment like UltraCluster.
Zero consideration for vCPU, bandwidth and SSD allocations.
…but your pre-processing can keep GPUs fed at near 100%.
You don’t have the DevOps to leverage spot instances.
You don’t have the budget for 1-3 year reservations.
Transformer training when possible, because hype.
VRAM limits don’t matter as much for training.
NVIDIA A100, 2020
This is when NVIDIA introduced TF32 (Tensor Float). TF32 is supported by TensorFlow and PyTorch, maintains the 8-bit exponent of FP32, but reduces the mantissa to 10 bits. This life hack gives it performance so far beyond FP32 that unless you specifically need the accuracy of FP32 or 64 for advanced quantization techniques (in which case reading this article is a waste of your time), you care mostly about TF32 performance.
Unrelated ML perf: this paper achieves a 95% energy use reduction for multiplication on tensor hardware by almost losslessly approximating it with addition using L-Mul.
TF32 Training Performance with 8xA100 80GB on DGX: ~2,200TF1 (Sparse)
AWS p4d.24xlarge: 96 vCPU, 1.1TB RAM
$32.77/Hr On-Demand ($10.36/Hr Spot)
GCP a2-ultragpu-8g: 96 vCPU, 1.4TB RAM
$40.22/Hr On-Demand
Azure NDamsrA100v4: 96 vCPU, 1.9TB RAM
$32.77/Hr On-Demand ($5.55/Hr Spot)
NVIDIA H100, 2022
TF32 Training Performance with 8x H100 on SXM (DGX): ~7,120TF2
AWS p5.48xlarge: 192 vCPU, 2TB RAM
$98.32/Hr On-Demand ($54.07/Hr Spot)
GCP a3-highgpu-8g: 208 vCPU, 1.9TB RAM
$87.83/Hr On-Demand
Azure NDisrH100v5: 96 vCPU, 1.9TB RAM
$98.32/Hr On-Demand
Practical Cost Optimization
On the Big 3 cloud providers, we pay:
$14.89/Hr for 1000 TF32 TFLOPS with Ampere
$12.34/Hr for 1000 TF32 TFLOPS with Hopper
Let’s focus on the lowest micro-cloud cost per 1000 TF32 TFLOPS:
OVHCloud only offers pre-TensorFloat Voltas, they’re out.
DigitalOcean is $3.35/Hr for 1000 TF32 TFLOPS …maybe.
They’re VERY unclear on whether this is SXM or PCIe.
Atlantic.Net doesn’t list prices. That about sums that up.
Lambda Labs are f*cking champs. Clear pricing and platform:
$4.69/Hr for 1000 TF32 TFLOPS on SXM 40GB A100s
$6.50/Hr for 1000 TF32 TFLOPS on SXM 80GB A100s
$3.36/Hr for 1000 TF32 TFLOPS on SXM 80GB H100s
CoreWeave, not a winner …BUT clear and transparent pricing:
$6.91/Hr for 1000 TF32 TFLOPS on SXM 80GB H100s
They offer a lot of value-add software (hint hint)
RunPod’s “community cloud” and no 8x are sus. They’re out.
Lambda Labs vs Big 3 with on-demand 8x H100s SXM? 72.77% Price Reduction.
Now, Vast.ai is a beast of its own. Let’s say this one’s outside of the scope of this post because they deserve their own post and aren’t really a micro-cloud provider. Places like Vast.ai / OpenRouter / etc are excellent for micro-cloud providers monetizing their spare capacity. They’re also a different universe where things like this are real:
…and that’s inside a secure cloud hosted in a trusted datacenter. These are by far the best prices you’ll ever see and they offer value-add software that makes life easy. The whole world isn’t using them because they’re effectively limited to single VM setups.
Technological Barriers to Entry
Ok Yevgen, we’d love to save $7.3K/mo out of our $10K R&D budget, or $22K/mo out of our $30K/mo R&D budget (enough to hire a good MLOps / DevOps engineer with a $210K salary, 1.25x employer overhead accounted for)! How do we get in on this?
Getting a K8s Cluster Up and Running
First off, why? Because that unlocks a large ecosystem of tools: everything you need.
Rejoice! As of October 2024, Lambda Labs has finessed this into a 1-click operation. A massive upgrade from what you have to do on most micro-clouds today, which still resembles Lambda Labs’ messy and time-consuming guide for how to do this in 2022.
Making a Cluster Useful for R&D
Once you have a cluster, a whole world opens up …for you to implement. A brief look:
SkyPilot is excellent (inference-focused) due to its automatic cost optimization.
If you want to not hate your life and do things like go directly to distributed training jobs, Kubeflow has an operator for that. It supports PyTorch, XGBoost, Tensorflow, and almost everything else you may want. Kubeflow is my personal favorite, so this may be a biased suggestion since I also know it the best. It has:
Easy to use VS Code Server and JupyterLab in-cluster with KF Notebooks!
It lets you create E2E ML pipelines natively in Python (researcher-friendly)
It has an amazing ez-bake (in beta, I love it) for Hyperparameter Tuning.
It has a simple model and artifact registry that works for most use cases.
Want to aggregate gradients computed on multiple nodes? You got it.
This uses K8s’ MPI operator and works with TensorFlow and PyTorch.
Got your shiny new model ready for inference? You can easily do that too.
Making a Cluster Good for R&D
For performant S3-compatible cluster storage, you’ll want MinIO. We love MinIO! After this, you’ll want cluster observability. Here’s a decent diagram I cooked up a while back that is a basic, opinionated, but OK cookie-cutter setup (click for full size)
Now - I wrote that diagram for a different purpose so you’ll want to ignore everything outside of the purple “K8 Cluster” boundary. If this is an R&D cluster skip Envoy, Istio, Kafka, probably ArgoCD too and tweak it to use OpenTelemetry-compatible thingies.
This setup will work for most use-cases because it lets you easily unify observability and operations between AWS/GCP/Azure clusters, or multiple micro-cloud clusters. Having this before something sus happens is good for your health and happiness.
Now that you’ve got basic DevOps observability, you may want MLOps observability. You can always use Comet ML or Weights & Biases like a normal person, or (there are benefits to this) drop MLFlow into your cluster with Bitnami’s ez-bake Helm Chart.
Also, something like Apache Airflow can be a godsend for scheduling / management.
Keep in mind that at this point your tooling will have significant overlap. This will get you everything you need for a complete R&D shop, but it’s most likely unnecessary.
Doing unnecessary things means unnecessary maintenance and overhead. Do be aware of the tech debt you are committing yourself to when considering these.
The Micro-Cloud Dream
You know what myself and a lot of other people who care about GPU pricing would pay a lot of money for that, to the best of my knowledge, does not exist right now?
Imagine this. You’re on Azure, or GCP, or AWS. You have a DevOps or MLOps guy or two, and your R&D budget is $10K-50K/mo - nowhere near enough to justify having a full expert platform team for self-service and optimization. That’s a lot of people.
First off, the micro-cloud contract. Cloud providers always want to sell steady reserved capacity on 9-36 month contracts …but that sucks for most people. What if you could sign on for a set amount of resources but use them when you actually need them?
What if your contract had a fully reserved rate only for the machines you need to be always on and run your K8S control / management planes, and the rest auto-scaled?
What if the onboarding’s timeline, expectations, and engineering requirements were all clearly expressed in the contract? We could just have a few checkboxes, such as:
We work on [Reinforcement Learning / Computer Vision / Transformers / etc]
Our Engineers use [VS Code / Jupyter / etc] and [TensorFlow / PyTorch / etc]
We will need mostly [egress / ingress], [X] Gbps sustained, [Y] Gbps burst.
Our training data is [X] Gb and we guestimate training [Y] epochs on it.
We [want / don’t want] them to have awesome, efficient in-cluster IDEs.
We [want / don’t want] to peer with our main [AWS / GCP / Azure] VPC.
We [want / don’t want] cluster logs and metrics visible in our main cloud.
We [want / don’t want] to manage our clusters ourselves. Do it for us!
Our preferred contract duration is [12/24/36 mo] - include incentives!
We want Platform support [Nah / As-Needed / Business Hours / 24.7]
Boom! Here are 3-5 options and quotes for what hardware we recommend.
What if micro-cloud providers provided $2-5K worth of professional services free to onboard your R&D team without DevOps/MLOps support and amortized that cost in the contract itself? That’d de-risk potential customers and shift their CAC (Customer Acquisition Cost) from sales to professional services with a notable overall reduction.
You know what almost every R&D shop wants to hear? It’s this:
All done with your checklist? We’re spinning up a cluster for you and you will get SSH keys for non-control plane nodes if you want and VPN credentials to connect to it. We will give your ML engineers docs on how to use IDEs / everything + observability tools.
Whoever gets this right will dominate because the ML Engineers will sell their CTO on it (dope tooling) for you, who in turn won’t be risking his job on this.
“This” being: fill out survey, pick feature and hardware options, sign contract, ENG gets credentials and documentation on how to use their shiny new tools.
R&D orgs are not looking for cheaper servers or GPUs, they’re looking to optimize overall costs - and there is a massive opportunity there for high value add services. You don’t even have to do it yourself. Just find trusted partners who can do it for you, for which you foot the bill (amortize it in your contract to de-risk your customers) and lower the barrier to entry. A $100K+ contract is risky enough. Signing it without a partner heavily invested in making sure it all works out is not an option for most.
Aight, Let’s Wrap it Up
Every startup wants to save costs on GPUs. Less than 1% of them (I can list the ones able to efficiently span micro-clouds on one hand because I know most of them) are capable of it. This is a wheel that has to get re-invented every time a startup wants to save costs on GPUs. A batteries-included end to end setup starting with the cluster, to observability, to docker images, to workload distribution, and IDEs is the R&D dream.
Why leave it to them? Cloud providers have a vested interest. This isn’t even a sales objection, it’s THE sales objection. Please see exhibit below - the ones on top STILL can’t span micro-clouds …and you’re mostly talking to the ones on the bottom.
Meme credit goes to Eduaro Ordax, the man responsible for my 6-pack.
Nvidia Datasheet (Look at SXM, since that’s HGX/DGX)
Nvidia Datasheet - BEWARE: That assumes max 700TDP configuration.