The most expensive part of training an AI model is no longer the talent—it’s the infrastructure. Training GPT-4 alone reportedly cost OpenAI over $100 million, and projections show the next frontier models could exceed $1 billion by 2027. This massive investment has turned the question of where to build and train models into a make-or-break strategic decision for any organization serious about AI.
The default choice for most has long been hyperscalers, such as AWS, Azure, and Google Cloud Platform (GCP), because they offer the scale, security, and broad service offerings that enterprises rely on. However, a new wave of specialized GPU-as-a-service (GPUaaS) providers, such as CUDO Compute, CoreWeave, and Lambda, is challenging the status quo. Purpose-built for AI workloads, they deliver high-performance computing at a significantly lower price point.
The wrong choice can lead to ballooning costs, stalled projects, and lost competitive advantage. In this article, we’ll break down the financial and practical implications of each platform, helping you determine which is the most efficient path to build cost-effective AI models.
Difference between hyperscalers and specialized platforms
Hyperscalers and specialized GPU-as-a-service (GPUaaS) platforms can both deliver the raw compute power needed to train large models, but they operate on fundamentally different models with distinct trade-offs in terms of cost, flexibility, and services.
Hyperscalers
For most enterprises, the starting point has long been the hyperscalers—AWS, Microsoft Azure, and Google Cloud Platform (GCP). Their appeal is built on a foundation of trust and comprehensiveness:
- Scalability: Global data centers and on-demand access to GPU clusters that can be spun up in minutes.
- Deep ecosystem integration: Seamless connections with storage, databases, MLOps pipelines, compliance tools, and enterprise security frameworks.
- Established enterprise trust: Long-standing vendor relationships and certifications that simplify procurement and regulatory approvals.
However, this all-in-one convenience comes at a premium. Hyperscalers are designed to support every imaginable workload, meaning AI-focused organizations often pay for infrastructure flexibility they don’t use. This leads to premium pricing on GPU instances, with costs further escalated by data egress fees and rigid configurations. For example, a team needing 6 GPUs might be forced to rent an 8-GPU instance—the closest available size—thereby paying for idle capacity simply because resources are bundled into fixed types.
Specialized GPUaaS providers
In contrast, GPUaaS providers—such as CUDO Compute—have emerged to do one thing exceptionally well: run AI workloads at scale. Their value proposition rests on three pillars:
- Aggressive pricing: By focusing exclusively on high-demand chips like the NVIDIA A100 and H100 and eliminating the overhead of other cloud services, they can offer significantly lower compute costs.
- Tailored infrastructure: Their networks, storage, and orchestration are purpose-built for the unique demands of high-performance training and inference.
- Greater flexibility: They often provide faster access to the latest GPUs without the long waitlists or strict quotas sometimes seen on hyperscalers.
These advantages come with their own trade-offs. GPUaaS providers typically have a smaller global footprint, a more limited ecosystem of adjacent tools, and may carry higher perceived risk for enterprises evaluating long-term vendor stability. For organizations that require deep integration across a wide array of cloud services, this narrower focus can be a significant limitation.
What it costs to train an AI
Training a modern AI model is not just about renting GPUs. The final bill stems from a complex web of cost drivers that extend well beyond raw compute time. We have written extensively about this in previous articles that you can read here, but here is a brief overview.
1. Compute hardware
The largest cost component is GPU usage. Cloud pricing for high-end GPUs, such as the NVIDIA H100, can range from $1.77 to $13 per hour, depending on the provider and configuration. GPUaaS providers like CUDO Compute often undercut hyperscaler rates significantly by optimizing infrastructure utilization.
For a detailed breakdown of rental costs across providers, see What Does It Cost to Rent Cloud GPUs?
2. Scale and duration of training
Model size and training duration drive costs sharply upward. A small model might train in days on a few GPUs, while larger models can run for weeks or months, making GPU hours a costly utility. Hyperscalers often charge premiums for uninterrupted, large-scale GPU clusters.
For a scenario-based analysis of how costs escalate from GPT-3 to GPT-4 and beyond, see What Is the Cost of Training Large Language Models?.
3. Storage and data pipeline
Handling vast datasets brings hidden expenses. Egress charges—fees for moving data out of the cloud—can quickly accumulate and are often overlooked. Hyperscalers bundle storage with compute, but the convenience may come at a steeper cost if data pipelines are inefficient.
Read more: Why AI teams need cloud infrastructure without vendor lock-ins
4. Networking and interconnect
Efficient distributed training requires high-speed networking (e.g., NVLink, InfiniBand). Hyperscalers may charge extra for low-latency clusters, while GPUaaS providers often design infrastructure with training in mind, offering better value for high-bandwidth requirements.
5. Software & orchestration overhead
Compute efficiency is also about tooling. While hyperscalers provide integrated services (e.g., SageMaker, Vertex AI), which reduce setup bottlenecks, GPUaaS providers often offer customizable orchestration, allowing for greater control over tuning; however, this may require more setup and engineering effort.
6. Energy & sustainability considerations
AI training consumes a lot of energy. Hyperscalers often invest heavily in renewable credits and highly optimized data centers, which can support ESG reporting. GPUaaS clouds may operate leaner—prioritizing utilization and cost efficiency—but often with a less formalized sustainability narrative.
A blend of hardware pricing, training scale, data handling, networking, orchestration, and energy efficiency shapes your AI training costs. Hyperscalers charge a premium for ecosystem breadth and reliability, while specialized GPU providers focus narrowly on cost and performance efficiency.
Cost comparison of AI training across AWS, Azure, GCP, CoreWeave, CUDO, and Lambda (On‑demand)
Before diving into the three scenarios we will be analyzing, these are the on-demand pricing for NVIDIA H100 GPUs on the respective platforms (as of mid-2025). These rates (per GPU per hour) will be used in our calculations:
- AWS (EC2 P5 instance): ~$7.57/hour per H100 GPU. (e.g., p5.48xlarge with 8×H100 ≈ $60.54/hr total)
- Azure (NCads H100 v5 VM): $6.98/hour per H100 GPU. (1×H100 80GB VM in East US region)
- Google Cloud (A3 a3-highgpu-1g): ~$11.06/hour per H100 GPU. (1×H100 80GB in us-central1 on-demand)
- CoreWeave (“H100 HGX” instance): ~$6.16/hour per H100 GPU. (8×H100 w/ InfiniBand at $49.24/hr total under standard pricing)
- CUDO Compute (on-demand marketplace): $2.25–$2.47/hour per H100 GPU. (SXM: $2.25/hr, PCIe: $2.47/hr; flat on-demand)
- Lambda (Lambda Cloud instance): $2.99/hour per H100 GPU. (8×H100 SXM instance at $23.92/hr total; billed by the minute, with no egress fees )
Note: The GPU-hour prices above bundle the associated CPU, memory, and local storage that come with each instance. We will separately account for object storage and data transfer (egress) costs in the following scenarios. All rates are on-demand (pay-as-you-go), with no reserved capacity or spot discounts applied.
Scenario 1: Training a 70B-parameter model from scratch
Assumptions: We consider a full pre-training run of a large 70B parameter model (comparable to Meta’s LLaMA 2/3 70B). This is an extremely compute-intensive task. For example, Meta’s LLaMA 3 70B reportedly required around 6.4 million H100 GPU-hours to train using a cluster of 24,576 H100 GPUs over 11 days.
In practice, one might use a smaller cluster for longer (e.g., on 2,000 GPUs for ~133 days, or 1,000 GPUs for ~266 days) to accumulate a similar order of GPU-hours. Here we will use 6.4 million GPU-hours as the basis for cost comparison, and assume training data on the order of a few terabytes (stored in cloud object storage) with minimal external data transfer during training (data is read and processed in-region).
- Resource & duration assumptions: Total H100 GPU-hours: ~6,400,000 (e.g. 1,000 GPUs × 6,400 hours, or equivalent)
- Training duration: On a large cluster of several thousand GPUs, a training run of this magnitude would take months. Our cost model is based on the total work performed, which amounts to 6.4 million GPU-hours. This total remains the same regardless of the configuration; for example, using 1,000 GPUs for 266 days costs the same as using 2,000 GPUs for 133 days.
- High-speed interconnect: Required for multi-GPU sync, as all these platforms offer NVLink/NVSwitch within nodes and InfiniBand or similar between nodes – included in instance pricing. For example, AWS P5 instances have 3,200 Gbps EFA networking built in. No additional networking costs are charged for in-region internal traffic on these instances.
- Data storage: Our cost model includes 5 TB of data storage, calculated using standard object storage rates. This total volume covers the initial training data plus the essential model "checkpoints"—snapshots saved during the run to ensure progress isn't lost in case of an interruption.
- Data egress: Assume negligible during training since data is stored and used within the cloud. We consider a one-time model output of ~200 GB (final checkpoint) downloaded from the cloud as egress.
Below is a cost breakdown table for Scenario 1:
Provider | H100 Price per Hour | GPU-Hours | GPU Compute Cost | Storage (5 TB) | Egress (200 GB) | Estimated Total |
---|---|---|---|---|---|---|
AWS | $7.57 | 6,400,000 | $48,448,000 | $115 | $18 | $48,448,133.00 |
Azure | $6.98 | 6,400,000 | $44,672,000 | $92 | $17.40 | $44,672,109.40 |
GCP | $11.06 | 6,400,000 | $70,784,000 | $125 | $24 | $70,784,149.00 |
CoreWeave | $6.16 | 6,400,000 | $39,424,000 | $150 | 0 | $39,424,150.00 |
CUDO Compute | $2.25* | 6,400,000 | $14,400,000 | $50 | 0 | $14,400,050.00 |
Lambda | $2.99 | 6,400,000 | $19,136,000 | 0** | 0 | $19,136,000.00 |
- *Using the lower SXM rate, appropriate for large-scale training.
- **Assuming bundled ephemeral storage is sufficient for in-training checkpoints.
Analysis:
Training a 70B-parameter model from scratch is enormously expensive across all providers. GPU compute dominates the bill, quickly reaching tens of millions of dollars on the hyperscalers. For such a workload, on-demand costs for AWS and Azure fall in the $45–48 million range, while GCP is even higher, at approximately $71 million, due to its $11/hr GPU pricing.
By contrast, specialized GPU cloud providers are substantially cheaper: CoreWeave would cost around $39 million, Lambda approximately $19 million, and CUDO Compute just over $14.4 million for the same 6.4 million GPU-hour job. The cost gap reflects the significantly lower base price per GPU hour on these platforms.
Storage and data transfer are relatively minor contributors to the overall cost. Holding 5 TB of data for a few months incurs only a few hundred dollars (or tens of dollars per month on any cloud). Downloading final model checkpoints (200 GB) would cost only a few tens of dollars on AWS, Azure, or GCP (roughly $0.09–$0.12/GB), and nothing on CoreWeave, CUDO, or Lambda, all of which advertise zero egress fees.
Networking interconnect charges do not apply here, since all training nodes operate within the same cluster or region. For instance, AWS’s UltraCluster networking is included in instance pricing, and CoreWeave likewise bundles high-speed networking into its rates.
Overall, on-demand cloud training of a 70B model ranges from $14.4M on CUDO Compute to $71M on GCP, highlighting why most organizations avoid training such large models from scratch in the cloud. Instead, it is far more cost-effective to start with a pre-trained model and fine-tune it for specific use cases.
Scenario 2: Fine-tuning a mid-sized model
Assumptions: In this scenario, we fine-tune a mid-sized pre-trained model (for example, a language model with approximately 6–13 billion parameters) on a domain-specific dataset. Fine-tuning is far cheaper than training a model from scratch because it typically requires fewer GPU hours (often only a few epochs on a smaller dataset) and sometimes uses techniques like low-rank adaptation (LoRA) to update only a subset of parameters (reducing compute by 10–100×). We assume a moderately intensive fine-tuning run:
- GPU setup: 8× H100 GPUs (e.g., one 8-GPU server) for parallel training.
- Duration: About 48 hours of training. This could represent 2 days of fine-tuning with a reasonably large dataset or multiple runs.
- Total GPU-hours: 8 GPUs × 48 hours = 384 GPU-hours.
- Storage: Assume 500 GB of training data and model checkpoints, which is much smaller than in Scenario 1.
- Data egress: Minimal, as it’ll probably only be used when downloading the final fine-tuned model (approximately 20 GB).
Costs for fine-tuning (on-demand prices):
Provider | H100 Price | GPU-Hours | GPU Compute Cost | Storage (0.5 TB) | Egress (20 GB) | Estimated Total |
---|---|---|---|---|---|---|
AWS | $7.57 | 384 | $2,907 | $11.50 | $1.8 | $2,920.18 |
Azure | $6.98 | 384 | $2,680 | $9.20 | $1.7 | $2,691.22 |
GCP | $11.06 | 384 | $4,249 | $11.50 | $2.4 | $4,260.94 |
CoreWeave | $6.16 | 384 | $2,366 | $15 | Free | $2,380.44 |
CUDO Compute | $2.25* | 384 | $864 | $5 | Free | $869.00 |
Lambda | $2.99 | 384 | $1,147 | 0** | Free | $1,148.16 |
- *Using the lower SXM rate, appropriate for large-scale training.
- **Assuming bundled ephemeral storage is sufficient for in-training checkpoints.
Analysis:
As expected, fine-tuning is orders of magnitude cheaper than training from scratch. The data reveals a clear cost hierarchy: the job costs approximately $4,260 on GCP, drops to the $2,700-$2,900 range on AWS and Azure, and becomes even more accessible on GPUaaS platforms.
CoreWeave completes the task for $2,380, while Lambda and CUDO Compute are significantly lower at $1,148 and $869, respectively. In relative terms, choosing the most cost-effective provider (CUDO) over the most expensive (GCP) could reduce the fine-tuning budget by approximately 80%.
At this scale, secondary costs like storage and data transfer are almost negligible. Storing half a terabyte of data costs only $5–$15 per month, and downloading the 20 GB final model is just a couple of dollars on hyperscalers (and free on the others).
In fact, for short jobs like this, these costs can often be eliminated entirely by using ephemeral instance storage. For example, CUDO Compute’s on-demand clusters are provisioned with 17.84 TiB of NVMe SSD storage per machine, with storage included in the cluster rate, allowing you to manage datasets and checkpoints without requiring separate storage services.
Finally, it’s important to note that our 384 GPU-hour scenario represents a relatively heavy fine-tuning run. Many modern techniques (like LoRA) can achieve excellent results with far less compute, potentially bringing costs down to just hundreds of dollars. This affordability is precisely why fine-tuning pre-trained models has become the default strategy for most organizations, reserving the multi-million-dollar cost of training from scratch for only the largest, most foundational projects.
Scenario 3: Multiple small-model experiments
Assumptions: This scenario represents a typical research and development workflow, such as training or fine-tuning numerous small models (e.g., those with 100 million to 1 billion parameters) or prototyping with small datasets. Researchers often run dozens of these experiments to test new ideas or tune hyperparameters.
We will assume a cumulative usage of 200 GPU-hours spread across multiple jobs. This could be, for instance, 20 separate experiments, each using a single H100 GPU for about 10 hours. In this setting, each individual run is relatively short and self-contained on one GPU, meaning no special multi-node networking is required. We will also assume that dataset sizes are small (a few gigabytes per experiment, totaling less than 100 GB of storage) and that data egress is negligible, aside from downloading final results.
Costs for multiple small experiments (200 GPU-hours total):
Provider | H100 Price | GPU-Hours | GPU Compute Cost | Storage (0.1 TB) | Egress | Estimated Total |
---|---|---|---|---|---|---|
AWS | $7.57 | 200 | $1,514 | $2.30 | $0.09 | $1,516.39 |
Azure | $6.98 | 200 | $1,396 | $1.84 | $0.09 | $1,397.93 |
GCP | $11.06 | 200 | $2,212 | $2.30 | $0.12 | $2,214.42 |
CoreWeave | $6.16 | 200 | $1,232 | $3 | 0 | $1,235.00 |
CUDO Compute | $2.25 | 200 | $450 | $1 | 0 | $451.00 |
Lambda | $2.99 | 200 | $598 | 0 | 0 | $598.00 |
Analysis:
For a collection of small experiments totaling 200 GPU-hours, costs are comparatively low on an absolute scale, but differences between providers remain significant in percentage terms. On AWS, 200 GPU-hours costs about $1,514, and on Azure, about $1,396. GCP’s higher rate makes it $2,212 for the same work.
Meanwhile, CoreWeave would be roughly $1,232, and the cheapest options are CUDO at $450 and Lambda at $598 for the full 200 hours. This means a researcher's budget stretches nearly five times further on CUDO compared to GCP, allowing for a vastly larger number of experiments for the same cost.
For small-scale runs, ancillary costs are negligible. 100 GB of storage for datasets is only a dollar or two per month on any cloud. Most providers have free inbound data and charge nothing for keeping data in the same region during experiments. If results (such as a few model files or plots) are downloaded, that might be just a few GB – on AWS, that would cost well under $1 (at $0.09/GB), and again, $0 on platforms with no egress fees.
One consideration for multiple small runs is the overhead of repeatedly spinning up instances. On-demand instances are billed by the minute (or hour), and each job may not perfectly use a full hour increment; however, our calculation assumes 200 full GPU-hours of usage in total. The no-cost interconnect advantage of specialized providers is less relevant here, as we are mostly using single GPUs per experiment (no multi-GPU syncing is needed). However, the absence of egress fees on some platforms can encourage freely moving data and results out, which is convenient for an iterative research workflow.
In summary, for numerous small-model experiments, cloud costs remain in the low thousands or even hundreds of dollars. Using lower-cost GPU clouds, such as CUDO or Lambda, can make an extensive experimental campaign very affordable, whereas relying solely on the big-three cloud on-demand GPUs could cost a couple of thousand dollars for the same work. Researchers often choose these specialized providers or reserved/spot instances on hyperscalers to stretch their experimentation budgets.
Sources: The above comparisons use published on-demand pricing from official cloud provider pages and recent third-party summaries.
Trade-offs between hyperscalers and specialized providers
While cost is a primary driver, the choice between a hyperscaler and a specialized provider involves a strategic assessment of performance, compliance, and ecosystem integration. Each platform offers a distinct value proposition tailored to different organizational needs.
1. Ecosystem and integration
- Hyperscalers: Offer deeply integrated ecosystems—from data warehouses to managed AI services, DevOps tools, and enterprise-grade identity management. For organizations already embedded in a cloud stack, these native integrations reduce friction.
- Specialized Providers: Focus narrowly on compute performance and cost. They typically lack the broad service catalog but compensate with transparent pricing, zero egress fees, and direct, optimized access to GPU clusters.
2. Performance and flexibility
- Hyperscalers: Provide high-performance infrastructure with global availability and strong SLAs. However, access to cutting-edge GPUs can be limited, resulting in long queue times or allocation challenges.
- Specialized Providers: Purpose-built for AI, they often provision high-demand GPUs faster and with greater flexibility in cluster configurations. This agility can significantly shorten project timelines.
3. Compliance and enterprise readiness
- Hyperscalers: On-demand pricing is premium, and total cost can be unpredictable due to complex billing for egress, storage, and networking. Reserved instances can lower costs but add complexity.
- Specialized Providers: Generally offer simpler, more transparent pricing models with fewer hidden fees. Features like zero egress can substantially reduce the total cost of ownership for iterative, data-heavy workflows.
4. Cost transparency and predictability
- Hyperscalers: On-demand pricing is premium, and costs can escalate quickly due to egress charges, storage fees, and networking. Reserved or spot pricing can reduce costs, but introduces complexity and risk.
- GPUaaS Cloud: Generally offers simpler, more transparent pricing with fewer hidden costs. Zero egress fees (as seen with CUDO) can substantially reduce TCO for iterative workflows or multi-cloud strategies.
5. Strategic positioning
- Hyperscalers: Offer “one-stop-shop” convenience and brand assurance, but at the cost of vendor lock-in and premium rates.
- Specialized Providers: Represent a cost-effective and agile alternative, empowering AI-native companies and research teams. Enterprises often adopt a hybrid strategy to leverage the best of both worlds.
Key takeaway: Hyperscalers offer the safety of scale, compliance, and integration, but at a premium. Specialized GPU providers, on the other hand, deliver speed, transparency, and cost efficiency—making them especially attractive for AI-first organizations and experimentation-heavy workflows.
The right compute for the right job
The cost of AI training is no longer a background concern — it’s a strategic determinant of competitiveness. Training a frontier-scale model can cost tens of millions, while even mid-sized fine-tuning projects quickly add up. The platform you choose to train on has a direct impact not only on your budget, but also on how fast your teams can iterate and how much risk your organization carries.
If your team is preparing to fine-tune, pre-train, or experiment with AI models, it’s worth running your next workload on CUDO Compute’s GPU clusters. With competitive on-demand H100 pricing, massive NVMe storage included per cluster, and zero egress fees, CUDO gives you the performance of a hyperscaler at a fraction of the cost.
Continue reading
