The Cheapest LLM Call Is the One You Never Make
The deployment dashboard showed green. Costs were not green.
Three months after launch, Sourcegraph's AI coding assistant had accumulated a $1M cloud bill. Fixie's agent — a customer-facing assistant — looped 847 times on a single session before a human noticed. A Series A startup I know ran out of their entire $10K OpenAI credit in four days after a single user discovered they could prompt the agent into recursion.
These aren't freak incidents. They're the default outcome of shipping LLM agents with a Datadog dashboard designed for microservices and a mental model inherited from REST APIs.
Standard APM will tell you your agent has 99.9% uptime. It won't tell you your P99 session is consuming 340,000 tokens. It won't tell you 31% of your tool calls are silently failing and the agent is compensating by hallucinating. It won't tell you a model routing decision you made in staging is costing you 60× more per query in production.
This post is about building observability systems that catch this — and the cost optimization techniques that make your cheapest LLM call the one you never had to make.
Why Standard APM Fails for Agents
Traditional observability was built for a world where requests are short, stateless, and independently meaningful. An HTTP request either succeeds or fails. Latency is deterministic. Cost is fixed.
LLM agent sessions break every one of these assumptions.
Sessions are stateful and variable-length. An agent doing a research task might make 3 LLM calls or 40. The session isn't "done" after the first response — it's done when the agent decides it's done, or when it hits a budget, or when it loops forever.
Costs are non-linear and input-sensitive. A user asking "what's 2+2" costs almost nothing. A user who uploads a 200-page PDF and asks for an audit costs 200× more. Both look like successful requests to your uptime monitor.
Tool calls add hidden failure surfaces. Your agent might call 8 tools per session. Each is a new failure surface. A tool call that always returns {} instead of failing visibly will silently degrade response quality for months.
Models are not fungible. Routing the same request to Claude Haiku 3 vs Claude Opus 4 is a 60× cost difference. Your metrics need to capture which model served which request type, not just that a request was served.
The result: teams that only watch latency, uptime, and error rate in production are flying blind. The real signals are one layer deeper.
The Signal Stack: What to Actually Instrument
Token Metrics — Split the Cache
Every LLM call should emit at minimum:
gen_ai.usage.input_tokens
gen_ai.usage.output_tokens
gen_ai.usage.cache_read_input_tokens # Anthropic/OpenAI cache hit
gen_ai.usage.cache_creation_input_tokensThe cache_read split is critical. On Anthropic's Claude, cache hits cost 10% of normal input token price. If you're not tracking this separately, your cost attribution is wrong — and you don't know whether your prompt engineering is actually saving money or not.
A good cache hit rate for a RAG system should be 40–70%. If you're below 15%, your system prompt ordering is probably wrong (cache-friendly content goes at the top, before dynamic content).
Latency — Split TTFT from Total
gen_ai.client.operation.duration # total span duration
gen_ai.client.time_per_output_token # streaming throughputTime to First Token (TTFT) and total generation time tell different stories. High TTFT indicates prompt processing overhead or routing delay. High time-per-token indicates output length issues. They need separate P50/P95/P99 tracking.
A 3-second TTFT is catastrophic for interactive chat. The same latency is irrelevant for a background document processing agent. Context matters — track against session type, not globally.
Tool Call Metrics
gen_ai.tool.name
gen_ai.tool.call.count
gen_ai.tool.error.rate
gen_ai.tool.duration_msTool error rate is one of the most under-tracked metrics in production agent systems. A tool failing 30% of the time silently degrades agent quality — the model starts hedging, hallucinating fallbacks, or loops trying to recover.
Alert on tool error rates exceeding 5% with a 15-minute window. Anything higher is a systemic issue, not noise.
Agent Loop Metrics
agent.step.count # per session
agent.session.depth # recursive agent calls
agent.loop.detected # boolean flag
agent.token_budget.exceeded # boolean flagThese don't exist in any standard library — you instrument them yourself in your agent loop. They're also the most important metrics you'll ever add. A session with agent.step.count > 50 and agent.loop.detected = true is a runaway session, and you need to know about it in real time, not in next week's billing cycle.
The OTel Span Structure
OpenTelemetry's gen_ai semantic conventions give you a standard schema for LLM instrumentation. Here's what a well-structured agent span looks like:
Span: agent.session
├── Span: gen_ai.chat (LLM call 1)
│ ├── gen_ai.system = "anthropic"
│ ├── gen_ai.request.model = "claude-haiku-3-5"
│ ├── gen_ai.usage.input_tokens = 1240
│ ├── gen_ai.usage.output_tokens = 89
│ ├── gen_ai.usage.cache_read_input_tokens = 980
│ └── gen_ai.response.finish_reason = "tool_calls"
├── Span: tool.call (search_web)
│ ├── tool.name = "search_web"
│ ├── tool.input_length = 42
│ └── tool.status = "success"
├── Span: gen_ai.chat (LLM call 2)
│ └── ...
└── agent.step.count = 3The span hierarchy matters: agent session wraps LLM calls which wrap tool calls. This gives you cost-per-session rolled up from token costs at each LLM call, plus the ability to trace exactly which tool calls triggered which follow-up LLM calls.
Toolchain options:
- OpenLLMetry (open-source) — drop-in auto-instrumentation for 30+ LLM providers. Zero-config. Emits OTel spans. Best starting point.
- Langfuse — full-featured LLM observability platform, self-hostable, excellent for prompt versioning and human feedback collection.
- Helicone — proxy-based, so zero SDK changes. Best for teams who want instant visibility without modifying agent code.
- Arize Phoenix — local-first, great for debugging during development. LLM traces visualized as trees.
For production systems: OpenLLMetry for spans → Langfuse for the observability UI → alert on Prometheus metrics via the OTel Collector.
Cost Attribution That Actually Works
The reason most teams don't know their real cost per user is that attribution is an afterthought. They track total monthly OpenAI spend, not cost per feature, user cohort, or agent type.
Here's a span processor pattern that attaches business context to every LLM call:
from opentelemetry import trace
from opentelemetry.sdk.trace import SpanProcessor
class CostAttributionProcessor(SpanProcessor):
COST_PER_1K = {
"claude-haiku-3-5": {"input": 0.0008, "output": 0.004, "cache_read": 0.00008},
"claude-sonnet-4-6": {"input": 0.003, "output": 0.015, "cache_read": 0.0003},
"claude-opus-4-7": {"input": 0.015, "output": 0.075, "cache_read": 0.0015},
"gpt-4o": {"input": 0.0025, "output": 0.010, "cache_read": 0.00125},
"gpt-4o-mini": {"input": 0.00015,"output": 0.0006, "cache_read": 0.000075},
}
def on_end(self, span):
model = span.attributes.get("gen_ai.request.model", "")
pricing = self.COST_PER_1K.get(model)
if not pricing:
return
input_tokens = span.attributes.get("gen_ai.usage.input_tokens", 0)
output_tokens = span.attributes.get("gen_ai.usage.output_tokens", 0)
cache_read = span.attributes.get("gen_ai.usage.cache_read_input_tokens", 0)
billable_input = input_tokens - cache_read
cost = (
billable_input / 1000 * pricing["input"] +
cache_read / 1000 * pricing["cache_read"] +
output_tokens / 1000 * pricing["output"]
)
span.set_attribute("llm.cost.usd", round(cost, 6))
span.set_attribute("llm.cost.model", model)
# Propagate to parent span so session total accumulates
parent = span.parent
if parent:
ctx = trace.get_current_span(parent)
if ctx.is_recording():
existing = ctx.attributes.get("llm.cost.session_usd", 0.0)
ctx.set_attribute("llm.cost.session_usd", existing + cost)Add this processor to your tracer provider and you get llm.cost.usd on every LLM span and llm.cost.session_usd rolled up to the session span. Now you can answer: "what's my P95 session cost?" and "what does this feature cost per user?"
The No-Call Hierarchy
The cheapest LLM call is the one you never make. Here's the hierarchy from cheapest to most expensive, implemented in order:
1. Guard Rails (Pre-LLM)
Input validation before the request ever touches a model:
def validate_input(query: str) -> tuple[bool, str | None]:
if len(query) > 8000:
return False, "Query too long"
if contains_jailbreak_pattern(query):
return False, "Invalid request"
if is_rate_limited(user_id):
return False, "Rate limit exceeded"
return True, NoneA regex check that fires in 0.1ms costs nothing. A guard rail that catches 3% of traffic saves 3% of your LLM budget with zero quality tradeoff.
2. Exact Match Cache
Before touching any embedding model:
import hashlib
import json
def exact_cache_key(query: str, context_hash: str) -> str:
normalized = query.strip().lower()
return hashlib.sha256(f"{normalized}:{context_hash}".encode()).hexdigest()
# Redis with 4-hour TTL for exact matches
cached = redis.get(exact_cache_key(query, context))
if cached:
return json.loads(cached)Exact match is O(1) Redis lookup — microseconds. For FAQ-style agents where users ask the same 50 questions in slightly different orderings, hit rates of 15–25% are common. That's free.
3. Semantic Cache
When exact match misses, check semantic similarity before generating:
from numpy import dot
from numpy.linalg import norm
SIMILARITY_THRESHOLD = 0.91
def cosine_similarity(a, b) -> float:
return dot(a, b) / (norm(a) * norm(b))
async def semantic_lookup(query_embedding, cache_store) -> str | None:
candidates = cache_store.get_recent(limit=500)
for entry in candidates:
sim = cosine_similarity(query_embedding, entry.embedding)
if sim >= SIMILARITY_THRESHOLD:
return entry.response
return NoneThe threshold matters. 0.88 catches too many false positives — "how do I deploy to staging" and "how do I deploy to production" will match and return wrong answers. 0.95 misses too many valid cache hits. 0.91–0.92 is the empirically validated sweet spot for technical Q&A.
Generate one embedding (cheap) to avoid one full generation (expensive). At Anthropic's embedding pricing, that's roughly a 50× cost reduction for cache hits.
4. Prefix Caching
For systems with long, stable system prompts (RAG context, tool definitions, persona), prefix caching on the model side reduces input token costs by 10× on cache hits.
The key constraint: cache-friendly content must be at the beginning of your prompt, and it must be identical across requests. Dynamic content (user ID, current date, session-specific context) goes at the end.
[SYSTEM PROMPT - 8,000 tokens - CACHED]
[RAG CONTEXT - 3,000 tokens - CACHED]
[TOOL DEFINITIONS - 2,000 tokens - CACHED]
-----
[SESSION HISTORY - variable - NOT cached]
[USER QUERY - variable - NOT cached]If you can get the stable prefix to 10,000+ tokens, prefix caching typically reduces your input token cost by 60–80% for repeat users.
Model Routing: The 60× Cost Gap
The cost difference between Claude Haiku 3.5 and Claude Opus 4.7 is roughly 60×. For a system serving thousands of queries per day, routing even 40% of traffic to Haiku instead of Opus pays for a new engineer.
The failure mode teams fall into: they benchmark on their hardest queries (where Opus is actually needed) and deploy Opus everywhere.
Heuristic routing:
def route_model(query: str, session_context: dict) -> str:
query_lower = query.lower()
# Haiku: simple lookups, factual questions, short responses
if session_context.get("step_count", 0) > 10:
return "claude-haiku-3-5" # Deep in a session, use cheap model for interim steps
if len(query.split()) < 15 and not any(
w in query_lower for w in ["analyze", "compare", "explain", "design", "evaluate"]
):
return "claude-haiku-3-5"
# Sonnet: moderate complexity, most production traffic
if session_context.get("tool_call_count", 0) < 5:
return "claude-sonnet-4-6"
# Opus: complex reasoning, multi-step analysis, critical path
return "claude-opus-4-7"Cascading (smarter, slower):
Route to Haiku first. If the response confidence is low (hallucination markers, hedge language, low logprob average), escalate to Sonnet. This adds 200–400ms latency but cuts cost by 3–5× for the queries Haiku handles confidently.
Measure before you route. Run 500 queries through both models. Build a confusion matrix. Only deploy routing where Haiku achieves >95% of Opus quality on a labeled test set. Routing without measurement is just hoping.
Agent Loop Hardening
Runaway agents are not a hypothetical. Here's a minimal loop hardening implementation:
class AgentSession:
MAX_STEPS = 25
LOOP_WINDOW = 5
def __init__(self):
self.steps = 0
self.recent_outputs = []
def check_budget(self) -> None:
if self.steps >= self.MAX_STEPS:
raise StepBudgetExceeded(f"Agent exceeded {self.MAX_STEPS} steps")
def detect_loop(self, output: str) -> bool:
# Semantic loop detection: if 3 of last 5 outputs are similar, we're looping
if len(self.recent_outputs) < self.LOOP_WINDOW:
return False
similarities = [
cosine_similarity(embed(output), embed(prev))
for prev in self.recent_outputs[-self.LOOP_WINDOW:]
]
looping = sum(1 for s in similarities if s > 0.92) >= 3
if looping:
span = trace.get_current_span()
span.set_attribute("agent.loop.detected", True)
return looping
def record_step(self, output: str) -> None:
self.steps += 1
self.recent_outputs.append(output)
if len(self.recent_outputs) > self.LOOP_WINDOW * 2:
self.recent_outputs.pop(0)Parallel tool calls are the other major loop hardening technique. Instead of:
Step 1: search("topic A")
Step 2: search("topic B")
Step 3: search("topic C")A well-configured agent should:
Step 1: [search("topic A"), search("topic B"), search("topic C")] # parallelThis cuts 3 LLM calls down to 1 for the tool-dispatch step. For research agents that fan out across many sources, parallel tool calls reduce both step count and total latency by 3–5×. Both major providers (Anthropic and OpenAI) support parallel tool calls natively — it's opt-in at the API level.
PromQL Alerting Patterns
Once your OTel spans flow into Prometheus, these are the alerts that matter:
# Cost rate of change — catch sudden spikes before they become incidents
increase(llm_cost_usd_total[15m]) > 50
# Runaway session detector — any session over $2 is suspicious
max(llm_cost_session_usd) by (session_id) > 2.0
# Tool error rate — agent quality degradation
rate(agent_tool_errors_total[10m]) / rate(agent_tool_calls_total[10m]) > 0.05
# Context pressure — nearing context window limit increases hallucination risk
histogram_quantile(0.95, gen_ai_usage_input_tokens_bucket) > 80000
# Model routing drift — production deviating from intended routing
(
rate(llm_calls_total{model="claude-opus-4-7"}[1h]) /
rate(llm_calls_total[1h])
) > 0.30 # Alert if Opus > 30% of trafficSet increase(llm_cost_usd_total[15m]) > 50 as a PagerDuty alert. Everything else can be Slack. The cost spike alert is the one that protects your bank account.
What Teams Actually Measure (And What They Should)
According to a 2024 survey of 200+ teams shipping LLM products in production:
- 94% track latency
- 91% track error rate
- 67% track token usage
- 34% track cost per request
- 12% track cost per session
- 4% track Cost Per Successful Outcome (CPSO)
CPSO is the metric that actually matters for business sustainability. It answers: "For every successful outcome delivered to a user, what did I spend?"
A system with 60% task completion and $0.03 CPSO is more sustainable than one with 80% completion and $0.45 CPSO. The second system will bankrupt you before you can fix the quality gap.
Define "successful outcome" for your domain:
- Coding assistant: user accepted the suggestion
- RAG Q&A: user didn't ask a follow-up for clarification
- Document processor: user didn't reject the output
- Booking agent: booking completed successfully
Then instrument it:
# At session end
outcome = "success" if user_accepted_result else "failure"
span.set_attribute("agent.outcome", outcome)
span.set_attribute("agent.cost_usd", session_cost)Now you can query: avg(llm_cost_session_usd) where agent.outcome = "success". That's your CPSO.
The Pre-Ship Observability Checklist
Before shipping any agent to production:
Instrumentation:
- [ ] Every LLM call emits
gen_ai.*spans with token counts split by cache/non-cache - [ ] Tool calls have individual spans with error status
- [ ] Agent sessions have step count and loop detection
- [ ] Cost attribution processor attaches
llm.cost.usdto every span
Caching:
- [ ] Exact match cache implemented (Redis, SHA256 key)
- [ ] Semantic cache with threshold 0.91–0.92 on test queries
- [ ] System prompt ordering optimized for prefix cache hits
- [ ] Cache hit rate measured (target: >30% for repeat-use features)
Budget controls:
- [ ] Per-session step budget enforced (fail safe, not silent)
- [ ] Per-user daily token cap
- [ ] Loop detection with semantic similarity
- [ ] Model routing with measured quality/cost tradeoff
Alerting:
- [ ] Cost rate spike alert (15-minute window)
- [ ] Runaway session alert (per-session cost threshold)
- [ ] Tool error rate alert (>5% triggers investigation)
- [ ] Context pressure alert (P95 input tokens > 80% of limit)
Business metrics:
- [ ] CPSO calculated and baselined pre-launch
- [ ] A/B test framework for model routing changes
- [ ] Cost attribution by feature/user cohort
Closing
The teams that will win at AI products long-term are not the ones with the smartest prompts or the biggest models. They're the ones who treat observability and cost as first-class engineering problems from day one.
A semantic cache with a threshold of 0.91 isn't a performance optimization. It's a competitive moat — because it lets you iterate faster, serve more users on the same budget, and not wake up to a $50K bill that your entire Series A was not prepared for.
Instrument ruthlessly. Cache aggressively. Route intelligently. Alert early.
The cheapest LLM call really is the one you never had to make.