May 23, 2026 · 9 min read

The LLM Deployment Provider Breakdown

The best model in the world doesn't matter if it's running somewhere that costs a fortune, violates your compliance requirements, or bottlenecks your latency. The deployment layer is where most teams make expensive mistakes — usually by defaulting to whatever they heard about first.

Here's a map of the actual tradeoffs.

The landscape in one sentence

There are four categories: inference APIs (Groq, Together, Fireworks), managed cloud integrations (Bedrock, Azure OpenAI, Vertex), serverless GPU platforms (Modal, RunPod), and self-hosting (vLLM on your own GPUs). Each one wins in different scenarios.

Inference APIs — speed and simplicity first

These are purpose-built inference layers, not cloud behemoths with LLMs bolted on.

Groq runs custom LPU (Language Processing Unit) hardware, not GPUs. The result: 716 tokens/sec on Llama 3.1 70B, TTFT under 100ms. For real-time applications — chat, voice agents, anything user-facing — this is hard to beat. Pricing sits at ~$0.59/1M tokens in+out for 70B models.

Together.ai covers the widest open-source model catalog, including fine-tuned variants. Their batch API runs at 75% discount vs synchronous — for nightly pipelines, evaluations, or offline enrichment jobs, this is the right tool. ~$0.90/1M tokens.

Fireworks.ai hits a rare combo: HIPAA-eligible and SOC 2 Type II, at inference-API pricing. For healthcare or fintech startups that can't afford AWS Bedrock pricing but need compliance documentation, this is the play. They also run FireFunction (function-calling-optimized Mistral variant) at ~70ms TTFT.

Throughput vs cost scatter plot for major LLM providers

What inference APIs can't give you

  • VPC PrivateLink / private endpoints
  • Zero-data-retention (ZDR) guarantees from the provider
  • Enterprise SLAs with dedicated capacity
  • Multi-modal parity (most are text-only or behind)

If those matter, you're in managed cloud territory.

Managed cloud providers — compliance and ecosystem

AWS Bedrock gives you Claude, Titan, Llama, Mistral, and others inside your VPC via PrivateLink. No data leaves your network. It's HIPAA-eligible, FedRAMP-in-progress, and it logs nothing by default. The catch: it's 10–15× more expensive than inference APIs on equivalent workloads. Claude Sonnet 4.6 runs $3/$15 per 1M in/out. For companies already deep in AWS where the security posture justifies the cost, it's the right call.

Azure OpenAI has one feature nobody else has: Zero Data Retention (ZDR) mode. With ZDR enabled, prompts and completions never touch storage — not even temporarily. For banks and healthcare systems under strict data residency rules, that's a hard requirement. GPT-4o at ~$10/1M output.

GCP Vertex AI is the Google route — Gemini, Claude (via agreement), Llama. If you're already on GCP with BigQuery, Cloud Run, and GKE, the IAM integration is seamless. Gemini 1.5 Pro at ~$7/1M output is competitive. The tooling around evaluation pipelines and Vertex Pipelines is genuinely good.

TGI (Hugging Face Text Generation Inference) is effectively in maintenance mode as of 2025. New deployments should use vLLM. TGI is still reasonable for existing setups, but don't build new infrastructure on it.

Self-hosting — when the math finally works

The break-even point is roughly $80K/month in API spend. Below that, managed providers win on simplicity and reliability. Above it, the GPU economics shift.

At that scale, you provision A100 or H100 instances and run vLLM. The numbers: vLLM hits 793 tokens/sec throughput with continuous batching. That's ~24× faster than running a naive Transformers inference loop at high concurrency. vLLM also handles PagedAttention (efficient KV cache allocation), prefix caching, and multi-LoRA serving.

Contrast that with Ollama's ~41 tokens/sec — fine for local dev, not for production multi-user workloads.

Self-hosting realities people don't talk about: you own the CUDA driver hell, you own the model weight updates, you own the uptime. Add 1 ML infra engineer to that cost estimate.

Serverless GPU — the middle path

Modal and RunPod let you deploy custom model containers and pay per compute-second. Cold start is the catch — first request on a scaled-to-zero instance takes 15–30 seconds. Warm, it runs well (~80 TPS on a 4×A10G container). Good for:

  • Low-traffic fine-tuned models that need on-demand availability
  • Batch jobs you want to parallelize across GPU workers
  • Prototypes before you commit to an instance type
Provider selection decision tree

The decision framework

The right provider follows from three questions:

1. Do you have compliance requirements? HIPAA or FedRAMP means AWS Bedrock (VPC PrivateLink), Azure OpenAI (ZDR), or Fireworks.ai. Not Groq, not Together.

2. What's your API spend? Below $80K/month: managed provider or inference API. Above: self-hosting starts to pencil out with an infra team.

3. Is TTFT the primary UX constraint? Real-time voice or chat: Groq/Cerebras. Async background work: Together batch or any managed provider.

Pricing quick reference — May 2026

| Provider | Model | Input /1M | Output /1M | Notes | |---|---|---|---|---| | Groq | Llama 3.1 70B | $0.59 | $0.59 | LPU, 716 TPS | | Together | Llama 3.1 70B | $0.90 | $0.90 | batch -75% | | Fireworks | Llama 3.1 70B | $0.90 | $0.90 | HIPAA+SOC2 | | AWS Bedrock | Claude Sonnet 4.6 | $3.00 | $15.00 | VPC private | | Azure OAI | GPT-4o | $2.50 | $10.00 | ZDR available | | GCP Vertex | Gemini 1.5 Pro | $1.25 | $7.00 | GCP-native | | Self-hosted | Llama 3.1 70B | ~$0.80 | ~$0.80 | A100 at scale |

Operational things worth knowing

vLLM, not TGI. If you're self-hosting in 2026, TGI is maintenance-mode. vLLM has the active development community, better continuous batching, and more frequent model support updates.

Multi-provider routing matters. At production scale, no single provider has 100% uptime. LiteLLM lets you configure primary/fallback routing across providers with a single API surface. Groq primary → Together fallback → Bedrock emergency.

Don't use Bedrock or Azure for prototyping. The setup friction (IAM roles, VPC config, quota requests) is significant. Use Together or Groq to move fast, then migrate when compliance requirements actually exist.

Watch the context window tax. Providers charge by token. A 100K context window call is not the same as a 4K call with the same model. Claude Sonnet on Bedrock at $3/1M input becomes $0.30 per 100K-context call. Design your context management accordingly.

The deployment layer is a cost center until you treat it like an architectural decision.

Related topics
LLM deploymentAWS BedrockAzure OpenAI +9

T
Tanmay Bohra
Full Stack Engineer at Grant Thornton Bharat. Building high-concurrency systems in Go and TypeScript.
← portfolio chat with tanmay ↗