May 23, 2026 · 14 min read

The AI Chatbot Scaling Playbook

Stateless API with external session state

LLMs are stateless. They have no memory between API calls. Every "stateful chatbot" is a stateless model with state externalized into your infrastructure.

This is easy to miss when you're building on a single server. It becomes expensive when you deploy replica 2.

The statefulness lie

When you store conversation history in the application process, everything works fine. Until you scale to two replicas and half your users lose their context, depending on which task the load balancer routes them to. Session affinity (sticky sessions) patches this — but now you can't freely balance load, and you've introduced a single point of failure per user.

The architecture that actually scales: stateless API layer + external state store.

Session storage:
  Redis ElastiCache   → hot path: recent N turns, <5ms
  DynamoDB            → durable: full history, partition_key=session_id, sort_key=created_at
  TTL on DynamoDB     → 90 days, auto-expiry
  Split messages      → separate items, avoid 400KB limit

DynamoDB design matters. Don't store unbounded message lists as a single item — the 400KB limit becomes real at ~50 turns. Partition key = session_id, sort key = created_at for chronological ordering. On-demand billing scales to millions of RPS.

History management strategies:

Full replay: send all past messages every turn. Simple, accurate, gets expensive as sessions grow.
Sliding window: keep last N turns. Cheap, but loses early context.
Summarization: periodic LLM call to compress old turns. One extra API call, but handles long sessions well.
Anthropic prompt caching: reuses KV tensor for fixed prefixes (system prompt, RAG context). 90% cost reduction on cached tokens, 85% latency reduction on long prompts. Break-even at 2 cache hits.

The semantic cache architecture

Two-layer cache: exact match and semantic similarity

Cache hit response time: 1–27ms vs 500–2000ms for a live LLM call. That's not a rounding error.

Two layers:

Exact match (SHA-256 hash): catches deterministic repeats. Zero false positives. 15–30% of production traffic. Sub-1ms.
Semantic match (embedding + cosine similarity): catches paraphrases and intent-similar queries. "How do I reset my password?" and "Forgot my password help" → same cache hit. 3–8ms on the lookup.

Threshold 0.92 is the sweet spot for most chatbots: catches clear rephrasings, rejects distinct queries. Below 0.80: risk of serving wrong answers confidently. Above 0.97: functionally equivalent to exact-match with extra infrastructure cost.

Real production hit rates: FAQ/support bots 40–60%, open-ended chat 10–20%. Profile your traffic before building the cache — if 90% of your queries are unique and context-dependent, the ceiling hit rate is 12%.

Redis + vector index handles everything below 1M cached entries. Sub-50ms queries. Infrastructure cost stays under 5% of savings achieved. Separate vector DB only when you exceed that.

LLM gateways: the layer you're missing

The gateway sits between your app and provider APIs. It handles auth (virtual keys), per-team token/RPM budgets, automatic failover, load balancing, logging, and semantic caching. Adding one changes a lot.

LLM Gateway with multi-provider fallback

Numbers on the trade-offs:

Kong: 27,000 RPS at pure proxy throughput. If you already run Kong for non-AI traffic, adding LLM routing is a config change.
LiteLLM: 3,200 RPS. Lower throughput, but supports 100+ providers via OpenAI-compatible protocol. Easier to self-host for pure LLM-routing setups.

Intercom's Fin processed 13M+ conversations for 4,000+ customers with 99.9%+ availability by routing across multiple Bedrock regions. Their key insight: "model availabilities are relatively uncorrelated" — if one model or region has issues, others are typically fine. Multi-provider routing is availability insurance, not just cost optimization.

Cross-region inference adds ~52ms latency. Justified by quota headroom and availability.

Fallback routing in LiteLLM:

model_list = [
    {"model_name": "gpt-4o", "litellm_params": {"model": "openai/gpt-4o"}},
    {"model_name": "gpt-4o", "litellm_params": {"model": "anthropic/claude-sonnet-4-6"}},
    {"model_name": "gpt-4o", "litellm_params": {"model": "together_ai/llama-3-70b"}},
]
# on 429 or connection error: escalate through the list

Horizontal scaling: what to actually measure

Stateless API + external state means any request can go to any task. No routing tricks, no sticky sessions, no lost context. ECS Fargate + ALB handles this naturally.

For inference workloads, standard CPU/memory HPA is the wrong signal. The bottlenecks are KV cache utilization and TTFT degradation — not CPU. Scale on custom CloudWatch metrics via KEDA:

triggers:
  - type: prometheus
    metadata:
      serverAddress: http://prometheus:9090
      metricName: kv_cache_utilization
      threshold: "0.75"
  - type: prometheus
    metadata:
      metricName: ttft_p90_seconds
      threshold: "0.5"

vLLM continuous batching: 2–4x throughput improvement over FasterTransformer, 24x higher throughput than HuggingFace TGI under high-concurrency workloads via PagedAttention. If you're running self-hosted inference, this is the deployment framework.

SSE vs WebSocket for streaming

SSE is the right call for token streaming. It's just HTTP — works through CDNs, existing proxies, and ALBs without configuration. Browser EventSource has built-in automatic reconnection. No sticky sessions required.

The ops gotcha: your load balancer needs to support thousands of concurrent long-lived keep-alive connections. ALB on AWS handles this natively. nginx requires tuning proxy_read_timeout — the default 60s kills long responses.

WebSocket only makes sense when you need high-frequency bidirectional messaging on the same connection: voice, collaborative editing, real-time multiplayer. Not token streaming.

Track goodput — RPS where both TTFT and TPOT meet your SLO — not raw throughput. A system at high RPS with degraded latency has low goodput.

TTFT SLA targets:

Interactive chatbot: ≤ 500ms to feel responsive
Code completion: ≤ 100ms
P95 inflates 1.6–3.2x over P50 in production

The seven mistakes

1. State in the API process       → breaks at replica 2, silently
2. Caching before profiling       → hit rate might be 12%
3. Synchronous LLM calls          → blocks under load, cascades to timeouts
4. Scaling on CPU/memory          → wrong signal for inference (KV cache is the bottleneck)
5. Streaming everywhere           → disable for non-user-facing calls to enable retries
6. Single-provider dependency     → one outage = full product outage
7. Logging as an afterthought     → can't diagnose hallucination spikes post-deploy

Intercom's hard-won insight on #5: streaming clients generally can't retry mid-stream. Disable streaming for model evaluation, async processing, and batch jobs — reserve it for the user-facing paths where it matters.

The AWS stack

CloudFront → Route53 → ALB → ECS Fargate (stateless, private subnet)
                             ↓
              LLM Gateway (LiteLLM) → Provider APIs
                             ↓
ElastiCache Redis (semantic cache + hot sessions)
DynamoDB (durable sessions, on-demand billing)

ElastiCache Serverless: scales automatically, no cluster sizing decisions, charges per ECU. Worth it for unpredictable chatbot traffic patterns. DynamoDB on-demand billing handles the bursty read pattern from session fetches without pre-provisioning.