May 19, 2026 · 14 min read

The Golden Database: How Production AI Agents Are Actually Built

Replit tried fine-tuning. They built the infrastructure, labeled the data, ran the experiments. And then they stopped.

The conclusion their team published: fine-tuning "didn't yield breakthroughs." Leveraging a superior base model â€” Claude 3.5 Sonnet at the time â€” with carefully designed few-shot examples and long task-specific instructions "often outperforms fine-tuned approaches." This is not a novel finding in 2026, but it's still widely ignored by teams who equate "customization" with "retraining."

The pattern behind this has a name: the golden database. Not weights. Not a fine-tuned checkpoint. A curated store of high-quality (input â†’ output) pairs, organized as an authoritative reference corpus, retrieved at inference time to anchor the model's behavior. The knowledge lives in the database; the prompt is the retrieval window.

This is what Replit actually runs in production. It's what the smarter coding agents use. And it has a concrete architecture â€” one worth understanding before deciding whether fine-tuning is actually the thing you need.

What "golden database" actually means

The term sounds more abstract than it is. In practice, a golden database is just a collection of example trajectories where the agent performed correctly â€” user request, agent reasoning, tool calls, output. What makes them "golden" is curation: these aren't randomly sampled traces. They're the ones where the agent got it right, where a human reviewer confirmed the output was high-quality, or where an automated evaluator (test suite, type checker, lint pass) validated the result.

Curating golden traces happens through three main pathways:

Production mining â€” after each agent run, evaluate the output against a success signal. In a coding agent, that might be: did the tests pass? Did the user continue or immediately undo the change? Good outcomes become candidates for the golden set.

Active human labeling â€” for complex cases where automated signals aren't sufficient, a human reviews traces and marks them high-quality. Labor-intensive but produces the highest-fidelity examples.

Synthetic generation â€” use a powerful model (or a human expert) to write ideal traces from scratch for specific scenarios. Useful for covering edge cases that rarely appear in production data.

Once you have the golden set, each example gets an embedding. At inference time, you embed the incoming task, run a cosine similarity search against the golden database, and inject the top-k most similar examples into the prompt. Static few-shot picks the same examples for every request; golden database retrieval picks the right examples for this request.

Decision-time guidance: Replit's evolution beyond static prompts

Static system prompts have a scaling problem. Every rule you add is present in every request â€” relevant or not. As the rule set grows, irrelevant context starts polluting model attention and degrading performance on the actual task. Replit's engineering team published the mechanism they built to address this: decision-time guidance.

The implementation is a lightweight multi-label classifier that continuously analyzes the agent's trajectory â€” recent tool results, error patterns, user messages, loop detection signals. When it identifies a relevant situation, it injects targeted instructions into that specific request rather than baking them into the system prompt.

The numbers matter here. Compared to modifying the system prompt dynamically:

90% cost reduction â€” because the stable core prompt stays cached (Claude's 5-minute and 1-hour TTL cache). Cache reads cost 10% of base input token price; dynamic system prompt changes bust the cache on every call.
15% increase in parallel tool calls â€” guidance injected at the right moment, rather than always-present rules that compete for attention with the actual task.

The design philosophy behind this: treat injected guidance as suggestions rather than hard constraints. False positives â€” injecting a reminder when it wasn't strictly necessary â€” carry no penalty. The cost is a few extra tokens. The benefit of catching the true positives (doom loops, mock-data compliance risks, high-stakes changes) is significant.

This ephemeral injection pattern â€” guidance that appears once, influences the current turn, and doesn't persist in conversation history â€” is the architectural detail that separates decision-time guidance from memory systems or chain-of-thought prompting.

The retrieval mechanics

Golden database retrieval and decision-time guidance are complementary mechanisms operating at different levels. Retrieval provides task-relevant examples (how to solve problems like this one). Decision-time guidance provides situation-relevant instructions (what to watch out for right now). In production systems, you typically want both.

The retrieval pipeline itself is straightforward: embed the user task, cosine-search the golden database, inject top-k examples. The engineering questions are in the details:

What similarity threshold to use. Below ~0.75-0.80 cosine similarity (depending on embedding model), you start injecting tangentially related examples that add noise. Better to inject zero examples than a misleading one.

How many examples to inject. Three to five is the typical range. More than that starts consuming context budget and attention. The diminishing returns from example #6 onward are steep.

When to cache the golden database itself. For a system with consistent traffic, the embedding index stays warm in memory. For high-volume API-based systems, caching the embeddings themselves (not just the text) avoids re-embedding on every retrieval call.

The token arithmetic. A static few-shot set of 10 examples, each 300 tokens, costs 3,000 input tokens per request â€” every request, regardless of whether those examples are relevant. A golden database with retrieval costs 0 tokens for the database itself (the index lives server-side) plus ~900 tokens for the 3 relevant examples actually injected. At scale, this difference is significant: RAG few-shot reduced code vulnerability detection F1 from 36.35% (zero-shot) to 74.05% (20-shot retrieval), outperforming a fine-tuned Gemini-1.5-Flash at 59.31% â€” without any training cost or retraining cycle.

When fine-tuning actually wins

The Replit finding doesn't mean fine-tuning is always wrong. It means fine-tuning is wrong for their use case: a diverse coding agent handling arbitrary user requests across thousands of different tech stacks. The performance ceiling from a better base model + golden database retrieval is higher than fine-tuning for that problem shape.

Different problem shapes yield different conclusions. There are three cases where fine-tuning is the correct default:

High-volume, narrow tasks with verifiable ground truth. SK Telecom fine-tuned Claude 3 Haiku on Amazon Bedrock for telecom-specific tasks â€” content moderation, call log summarization, ICD-style topic extraction. Results: classification accuracy from 81.5% to 99.6%, 85% reduction in tokens per query, 73% increase in positive user feedback. For tasks this well-defined, the model learns to skip the reasoning overhead that generalist prompting requires. The token reduction alone justifies the training investment at their scale.

Verifiable reasoning tasks (OpenAI Reinforcement Fine-Tuning). RFT is a different mechanism â€” it runs reinforcement learning on top of an existing model, using a custom grader to reward good completions and penalize bad ones. The grader must produce a verifiable signal: right/wrong on math, correct ICD code, valid legal citation. Accordance AI used RFT on o4-mini for complex tax analysis and saw 39% accuracy improvement over baseline, outperforming every other frontier model on their tax reasoning benchmark. Ambience Healthcare applied it to ICD-10 medical coding and achieved 12 points above physician baselines. RFT works with just a few dozen examples â€” but those examples need a grader, not just human labels.

Latency-sensitive narrow tasks where inference budget is fixed. A fine-tuned small model can handle classification or extraction tasks with far fewer tokens than a frontier model with chain-of-thought. If your p95 latency target is 200ms and your task is well-defined, fine-tuning wins on latency even when it doesn't win on absolute accuracy.

The common thread: fine-tuning pays off when the task is narrow, high-volume, and verifiable. When any of those three conditions is missing, you're usually better off with a better model and better retrieval.

The third option: Anthropic Agent Skills

Fine-tuning and few-shot retrieval aren't the only customization tools in 2026. In October 2025, Anthropic shipped Agent Skills â€” and they're worth understanding as a distinct mechanism.

Skills are folders containing instructions, scripts, and resources that Claude loads dynamically when it detects the skill is relevant. Anthropic ships managed skills for working with Excel, PowerPoint, Word documents, and fillable PDFs. Custom skills let you package domain expertise and organizational workflows and upload them via the /v1/skills endpoint.

The architectural distinction from fine-tuning: skills are compositional, not permanent. They stack. Claude automatically identifies which skills are needed for a given task and coordinates their use. You can update a skill without retraining anything; the change is live on the next request. And unlike RAG few-shot retrieval, skills don't require a vector database infrastructure â€” the skill is a folder, the API handles loading.

The distinction from in-context examples: skills inject scripts and procedural knowledge (how to operate a tool, how to follow a workflow), not examples of good outputs. For use cases where the task is "manipulate this Excel file correctly" or "follow this org's code review procedure," skills are more appropriate than a golden database of output examples.

The practical guidance: if your customization need is workflow-oriented (teach Claude how to do something), use skills. If it's output-quality-oriented (teach Claude what good looks like), use a golden database. If it's both, use both â€” they compose.

The inference-time compute wildcard

One more factor is reshaping the fine-tuning vs prompting calculus in 2026: extended reasoning, or test-time compute.

Claude's extended thinking mode, OpenAI's o-series (o3, o4-mini), DeepSeek R1 â€” all of these scale inference compute rather than training compute. The model spends more tokens reasoning through a problem before producing output. The result: frontier models, given a reasoning budget, now match or exceed what fine-tuned models achieve on many tasks â€” without any task-specific training.

The practical implication: when your model starts thinking harder, some of the precision improvements you used to chase through fine-tuning become available through reasoning. Tax analysis, complex code debugging, architectural decisions â€” extended reasoning often catches the edge cases that once required fine-tuning.

This doesn't eliminate fine-tuning, but it shifts the threshold. Five years ago, "improve accuracy on task X" almost always meant "fine-tune on task X." Today the first question is: does extended reasoning close the gap? If yes (often), you avoid the training infrastructure entirely. If no (narrow classification, latency-constrained), fine-tuning or RFT is still the right tool.

The broader shift: companies are increasingly finding that inference-strategy updates have shorter development cycles than retraining. When an edge case appears in production, adjusting a retrieval threshold or adding a skill takes hours. A fine-tuning run takes days and requires maintaining a training pipeline. For product iteration velocity, this asymmetry matters.

What to build

The practical decision is a function of task shape. Here's the framework I'd use in 2026:

Does your task require consistent behavior across many diverse inputs? Golden database retrieval is your default. Build the curated example store, add the retrieval pipeline, use the base frontier model. This is what Replit ships. It scales with your data quality, not your training infrastructure.

Is your task narrow, high-volume, and does it have verifiable ground truth? Consider fine-tuning â€” specifically Claude 3 Haiku on Amazon Bedrock if you're in the Anthropic ecosystem, or RFT on o4-mini if you have a grader signal and need peak reasoning accuracy on a specific domain.

Is your use case workflow-oriented? Anthropic Agent Skills give you composable capability injection without the retrieval overhead or training cost. They're particularly strong for document processing, org-specific procedures, and tool-specific expertise.

Are you hitting a quality ceiling on a reasoning-heavy task? Try extended reasoning before you try fine-tuning. The test-time compute you're paying for often eliminates the problem that was leading you toward a training run.

The mistake most teams make is defaulting to fine-tuning as the "serious" customization option. It looks like what you do when you want the model to take your domain seriously. But Replit's experience â€” hundreds of thousands of production runs, ~90% tool invocation success rate, no fine-tuning â€” is the more common outcome when you've actually built the golden database correctly.

The weights aren't where the knowledge lives. The database is.