The LLM Deployment Provider Breakdown
The best model in the world doesn't matter if it's running somewhere that costs a fortune, violates your compliance requirements, or bottlenecks your latency. The deployment layer is where most teams make expensive mistakes — usually by defaulting to whatever they heard about first.
Here's a map of the actual tradeoffs.
The landscape in one sentence
There are four categories: inference APIs (Groq, Together, Fireworks), managed cloud integrations (Bedrock, Azure OpenAI, Vertex), serverless GPU platforms (Modal, RunPod), and self-hosting (vLLM on your own GPUs). Each one wins in different scenarios.
Inference APIs — speed and simplicity first
These are purpose-built inference layers, not cloud behemoths with LLMs bolted on.
Groq runs custom LPU (Language Processing Unit) hardware, not GPUs. The result: 716 tokens/sec on Llama 3.1 70B, TTFT under 100ms. For real-time applications — chat, voice agents, anything user-facing — this is hard to beat. Pricing sits at ~$0.59/1M tokens in+out for 70B models.
Together.ai covers the widest open-source model catalog, including fine-tuned variants. Their batch API runs at 75% discount vs synchronous — for nightly pipelines, evaluations, or offline enrichment jobs, this is the right tool. ~$0.90/1M tokens.
Fireworks.ai hits a rare combo: HIPAA-eligible and SOC 2 Type II, at inference-API pricing. For healthcare or fintech startups that can't afford AWS Bedrock pricing but need compliance documentation, this is the play. They also run FireFunction (function-calling-optimized Mistral variant) at ~70ms TTFT.
What inference APIs can't give you
- VPC PrivateLink / private endpoints
- Zero-data-retention (ZDR) guarantees from the provider
- Enterprise SLAs with dedicated capacity
- Multi-modal parity (most are text-only or behind)
If those matter, you're in managed cloud territory.
Managed cloud providers — compliance and ecosystem
AWS Bedrock gives you Claude, Titan, Llama, Mistral, and others inside your VPC via PrivateLink. No data leaves your network. It's HIPAA-eligible, FedRAMP-in-progress, and it logs nothing by default. The catch: it's 10–15× more expensive than inference APIs on equivalent workloads. Claude Sonnet 4.6 runs $3/$15 per 1M in/out. For companies already deep in AWS where the security posture justifies the cost, it's the right call.
Azure OpenAI has one feature nobody else has: Zero Data Retention (ZDR) mode. With ZDR enabled, prompts and completions never touch storage — not even temporarily. For banks and healthcare systems under strict data residency rules, that's a hard requirement. GPT-4o at ~$10/1M output.
GCP Vertex AI is the Google route — Gemini, Claude (via agreement), Llama. If you're already on GCP with BigQuery, Cloud Run, and GKE, the IAM integration is seamless. Gemini 1.5 Pro at ~$7/1M output is competitive. The tooling around evaluation pipelines and Vertex Pipelines is genuinely good.
TGI (Hugging Face Text Generation Inference) is effectively in maintenance mode as of 2025. New deployments should use vLLM. TGI is still reasonable for existing setups, but don't build new infrastructure on it.
Self-hosting — when the math finally works
The break-even point is roughly $80K/month in API spend. Below that, managed providers win on simplicity and reliability. Above it, the GPU economics shift.
At that scale, you provision A100 or H100 instances and run vLLM. The numbers: vLLM hits 793 tokens/sec throughput with continuous batching. That's ~24× faster than running a naive Transformers inference loop at high concurrency. vLLM also handles PagedAttention (efficient KV cache allocation), prefix caching, and multi-LoRA serving.
Contrast that with Ollama's ~41 tokens/sec — fine for local dev, not for production multi-user workloads.
Self-hosting realities people don't talk about: you own the CUDA driver hell, you own the model weight updates, you own the uptime. Add 1 ML infra engineer to that cost estimate.
Serverless GPU — the middle path
Modal and RunPod let you deploy custom model containers and pay per compute-second. Cold start is the catch — first request on a scaled-to-zero instance takes 15–30 seconds. Warm, it runs well (~80 TPS on a 4×A10G container). Good for:
- Low-traffic fine-tuned models that need on-demand availability
- Batch jobs you want to parallelize across GPU workers
- Prototypes before you commit to an instance type
The decision framework
The right provider follows from three questions:
1. Do you have compliance requirements? HIPAA or FedRAMP means AWS Bedrock (VPC PrivateLink), Azure OpenAI (ZDR), or Fireworks.ai. Not Groq, not Together.
2. What's your API spend? Below $80K/month: managed provider or inference API. Above: self-hosting starts to pencil out with an infra team.
3. Is TTFT the primary UX constraint? Real-time voice or chat: Groq/Cerebras. Async background work: Together batch or any managed provider.
Pricing quick reference — May 2026
| Provider | Model | Input /1M | Output /1M | Notes | |---|---|---|---|---| | Groq | Llama 3.1 70B | $0.59 | $0.59 | LPU, 716 TPS | | Together | Llama 3.1 70B | $0.90 | $0.90 | batch -75% | | Fireworks | Llama 3.1 70B | $0.90 | $0.90 | HIPAA+SOC2 | | AWS Bedrock | Claude Sonnet 4.6 | $3.00 | $15.00 | VPC private | | Azure OAI | GPT-4o | $2.50 | $10.00 | ZDR available | | GCP Vertex | Gemini 1.5 Pro | $1.25 | $7.00 | GCP-native | | Self-hosted | Llama 3.1 70B | ~$0.80 | ~$0.80 | A100 at scale |
Operational things worth knowing
vLLM, not TGI. If you're self-hosting in 2026, TGI is maintenance-mode. vLLM has the active development community, better continuous batching, and more frequent model support updates.
Multi-provider routing matters. At production scale, no single provider has 100% uptime. LiteLLM lets you configure primary/fallback routing across providers with a single API surface. Groq primary → Together fallback → Bedrock emergency.
Don't use Bedrock or Azure for prototyping. The setup friction (IAM roles, VPC config, quota requests) is significant. Use Together or Groq to move fast, then migrate when compliance requirements actually exist.
Watch the context window tax. Providers charge by token. A 100K context window call is not the same as a 4K call with the same model. Claude Sonnet on Bedrock at $3/1M input becomes $0.30 per 100K-context call. Design your context management accordingly.
The deployment layer is a cost center until you treat it like an architectural decision.