May 23, 2026 · 8 min read

Semantic Caching Is the Cheat Code Nobody Talks About

Every team building with LLMs eventually discovers that a huge percentage of their queries are near-duplicates. "How do I reset my password?" and "forgot password, how to reset?" are semantically identical — but if you're hashing the query string, you miss the second one and send it to the LLM anyway.

Semantic caching fixes this. It's one of the highest-ROI optimizations in the LLM infrastructure stack, and most teams implement it too late.

What semantic caching actually is

A regular cache uses an exact key match — SHA-256 of the query string, for example. Semantic cache adds a second lookup: if the exact key misses, embed the query and do a vector similarity search against previously cached responses. If the similarity score is above a threshold, return the cached answer.

The difference in hit rate is significant. Exact hashing on FAQ-heavy workloads gets you maybe 15–25%. Semantic matching on top pushes that to 50–70% depending on query distribution.

For the math: if a cached query costs $0.0001 (embedding + Redis lookup) vs $0.015 (GPT-4o generation), every 1% additional hit rate on 1M daily queries saves $150/day. At 50% combined hit rate: $7,500/day.

Three-layer semantic cache architecture diagram

Three layers, not one

The full production setup has three layers:

L1 — Exact hash (Redis GET, ~0.3ms) Normalise the query first: lowercase, strip punctuation, collapse whitespace. Hash the result. This catches true duplicates. Fast, cheap, zero false positives.

L2 — Semantic match (vector search, ~8–15ms) Embed the query using a small fast model (text-embedding-3-small at $0.02/1M tokens, or a local sentence-transformers model at $0). Do a cosine similarity lookup against cached embeddings. Return if above threshold.

L3 — Provider prompt cache (implicit) Anthropic and OpenAI both cache the prefix of your prompts automatically. If your system prompt + RAG context is 5K tokens and the user question is 50 tokens, you pay for 50 tokens of input processing — the 5K is served from the provider's cache. This is ~90% cost reduction on the static prefix. No code required, just keep your system prompt stable and in the front of the message.

The threshold problem

The cosine similarity threshold is the most important parameter, and the one teams get wrong most often.

Hit rate vs false positive rate by cosine similarity threshold

Too low (0.80): you get high hit rates but semantic drift becomes a real problem. "What's the weather in Mumbai?" and "What's the weather in Delhi?" might score 0.82 cosine similarity on an embedding model. Returning Mumbai's answer for Delhi is a correctness failure.

Too high (0.95): you're back to near-exact matching. The hit rate barely beats L1.

The sweet spot is 0.88–0.92 for most chatbot and FAQ workloads:

0.92 → 2.1% false positive rate, 35–55% hit rate
0.88 → 8.4% false positive rate, 55–65% hit rate

For factual Q&A where correctness matters: stay at 0.92. For creative or conversational workloads where a near-match is fine: 0.88–0.90.

Real production hit rates

What teams actually see after a few weeks of warmup:

FAQ/support chatbots: 60–70% combined (L1 + L2). Users ask the same questions constantly.
Coding assistants: 20–30%. Code queries have high variance.
Internal knowledge bases: 45–55%. Domain vocabulary is narrow, embeddings cluster well.
General-purpose chat: 25–35%. Depends heavily on use case.

The cache needs warmup — most production systems hit stable hit rates after 50K–100K queries.

Implementation with GPTCache / Redis LangCache

Two production-ready options:

GPTCache (open source) drops in front of any LiteLLM or OpenAI client. It handles L1 and L2 automatically, plugs into Redis or Qdrant for storage, and exposes the threshold config.

from gptcache import cache
from gptcache.adapter import openai

cache.init(
    embedding_func=onnx_embedding,  # local, fast
    data_manager=get_data_manager(data_path="cache_data"),
    similarity_evaluation=SearchDistanceEvaluation(max_distance=0.12),  # ~0.88 threshold
)
client = openai.OpenAI()

Redis LangCache (Redis's own semantic cache layer) is simpler to operate if you're already running Redis. One endpoint, stores embeddings in a Redis vector index. The semantic search is done inside Redis itself, which keeps latency low.

Don't cache at the model output level for agentic or tool-calling flows. If the LLM is deciding which tool to call based on real-time state (inventory, booking availability, user context), a cached answer is a stale answer. Semantic caching is best for read-heavy, stateless Q&A patterns.

Cache invalidation

The unsexy part. Three patterns that work:

TTL-based: simplest. All cache entries expire after N hours/days. Right for FAQ content that changes occasionally. Low overhead.

Event-based: when source documents change (product catalog update, policy change), invalidate the embeddings that were retrieved from those documents. Requires tracking which cached responses were generated from which source chunks. More complex but precise.

Version-tagged: prefix all cache keys with a content version hash. When the knowledge base version changes, bump the hash, effectively invalidating everything. Blunt but reliable — you avoid stale answers at the cost of a cold cache restart.

The CacheAttack problem

Worth knowing: adversarial cache probing exists. An attacker who can query your system can send slightly-varied questions to infer what's been cached — and from that, reconstruct what other users have been asking. This is called CacheAttack (Berkeley, 2024).

Mitigations: add noise to similarity scores before returning (differentially private hit/miss signals), rate-limit embedding-identical queries from the same session, and don't cache sensitive or user-specific responses in the shared cache.

Where semantic caching doesn't help

Per-user personalised responses (can't be shared across the cache)
Streaming responses where you need to start token delivery immediately (latency to check L2 adds 8–15ms — acceptable in most cases, occasionally not)
Model outputs that embed current timestamps or real-time data
Agentic flows with tool state (as noted above)

For everything else — FAQ bots, RAG-powered search, support automation, knowledge base assistants — semantic caching is one of the cheapest wins in the stack.