Sakana Means Fish. Their New Model Is a School of Them.
A pufferfish is only dangerous if you prepare it wrong. It takes a licensed chef years to learn how to serve fugu safely. The fish itself is not the problem — the handling is.
Sakana AI named their new system after it. The implication is deliberate.
Fugu launched on June 22, 2026 out of Tokyo. It does not work the way any other AI system works. It does not have a larger model. It has a conductor — a 7-billion-parameter orchestrator trained to manage a school of frontier models: GPT-5.5, Claude Opus 4.8, and Gemini 3.1 Pro. On some benchmarks, this school beats Anthropic's Mythos 5. On others it trails. But the architecture itself is the news.
Who Built This
Sakana AI was founded in 2023 by David Ha and Llion Jones.
David Ha was Research Director at Google Brain — not just a researcher, he ran the team. Before starting Sakana, he was also Head of Research at Stability AI.
Llion Jones is one of the eight original authors of "Attention Is All You Need," the 2017 Vaswani et al. paper that introduced the Transformer architecture. Every LLM running today traces its lineage to that paper.
"Sakana" (魚) means fish in Japanese. This is not incidental. Their entire research philosophy is modeled on collective intelligence in nature — a school of fish producing emergent behavior smarter than any individual swimmer. You see this in their earlier work: Evolutionary Model Merge (2024) combined weights from multiple existing open-source models using evolutionary search, producing state-of-the-art Japanese math models without training data. AI Scientist (2024) automated the ML research lifecycle end-to-end. Earlier this year, AI Scientist-v2 published the first fully AI-generated paper to pass human peer review in Nature.
The school-of-fish metaphor is not a branding choice. It is their thesis.
The Core Idea: School, Not Whale
Every major AI lab is racing to build the biggest model. Sakana built a smarter coach.
Fugu does not generate your response. The Conductor does. It reads your query, decides which models in the pool to call, assigns them roles, sequences the work, and synthesizes the result. The individual worker models — GPT-5.5, Opus 4.8, Gemini 3.1 Pro — do the actual generation. The Conductor coordinates.
This is what Sakana calls behavioral composition. It is fundamentally different from a Mixture-of-Experts model (MoE), where you merge weights at the parameter level and need access to model internals. Behavioral composition treats every frontier model as a black box — you access them via API, you never touch their weights. The Conductor only learns what to ask and who to ask it.
The practical consequence: as new frontier models release, Sakana retrains the orchestrator (roughly two weeks per new worker model added to the pool) without rerunning a foundation model training job. The school grows. The members change. The conductor adapts.
Two Variants, Two Different Architectures
Fugu ships in two variants that are architecturally distinct — not just differently tuned.
Fugu (standard) is powered by TRINITY, a roughly 0.6-billion-parameter coordinator combined with a small routing head. The coordinator reads your query, compresses it to a hidden-state vector, and the routing head outputs a probability distribution across the worker pool. Softmax, sample, call. Fast. Low latency. One worker per query. Routing patterns that emerged from training: GPT-5.5 dominates math and physics; Opus 4.8 dominates software engineering and security; Gemini 3.1 Pro dominates science recall. These are learned, not hardcoded.
Fugu Ultra is powered by the Conductor — the 7B model built on Qwen2.5-7B, trained with GRPO (Group Relative Policy Optimization) over 200 iterations on 960 labeled problems from MATH, MMLU, LiveCodeBench, and RLPR datasets. The Conductor does not just pick a worker. It generates a complete natural-language workflow: up to five steps, with a subtask list, a worker assignment per subtask, and a communication topology.
The communication topology is the key architectural innovation. The Conductor can emit chains (A → B → C), best-of-N ensembles (run all three workers, take the best), debate trees (workers argue, Conductor resolves), or recursive structures — all from the same system depending on task complexity. This topology is emergent. It is not a template the engineer designed. The Conductor learns when to chain, when to ensemble, when to debate.
The Moment That Made This Matter
On June 12, 2026, the US Department of Commerce sent a letter to Anthropic imposing export controls on Fable 5 and Mythos 5. Any foreign person — including foreign nationals inside the United States, including Anthropic's own international employees — now requires an approved license to access either model.
This was the first time export controls were applied to a specific AI model rather than hardware. Anthropic's response was to disable both models for all customers to ensure compliance.
Fugu had never included Fable 5 or Mythos 5 in its pool. Their absence was not a limitation — it was an architecture feature.
David Ha said it plainly: "Relying on a single company's APIs for critical infrastructure is a material vulnerability."
The swappable pool design means that when a provider goes dark — for export control, for pricing, for outage — the Conductor retargets what remains without a code change. You change the worker list, retrain the routing head on the new pool, and redeploy. The conductor does not hard-code assumptions about which models exist.
European officials cited the June 12 directive as justification for sovereign AI development. The incident crystallized what Sakana had been arguing architecturally: distributed behavioral composition is more resilient than single-provider dependency.
What the Benchmarks Actually Say
Sakana's marketing says "shoulder-to-shoulder with Mythos." The numbers say selective parity. Both are technically accurate. Only one is the full picture.
| Benchmark | Fugu Ultra | Fugu std | Claude Opus 4.8 | GPT-5.5 | Mythos 5 |
|---|---|---|---|---|---|
| SWE-Bench Pro | 73.7 | 59.0 | 69.2 | 58.6 | 80.3 |
| GPQA-Diamond | 95.5 ★ | 95.5 | 92.0 | 93.6 | 94.1 |
| LiveCodeBench v6 | 93.2 ★ | 92.9 | 90.3 | 90.7 | — |
| TerminalBench 2.1 | 82.1 | 80.2 | 74.6 | 78.2 | 88.0 |
★ = Fugu Ultra leads. Source: Sakana AI technical report, arXiv 2606.21228. All figures vendor-reported.
On GPQA-Diamond (graduate-level science and reasoning) and LiveCodeBench v6, Fugu Ultra beats every frontier model in the comparison, including Mythos 5. On SWE-Bench Pro and TerminalBench 2.1 — the two benchmarks most representative of real engineering work — Mythos 5 is still ahead.
The honest read: Fugu Ultra is at the frontier on reasoning and coding benchmarks. It trails on the hardest software engineering and terminal-agent tasks. Given that it achieves this by orchestrating models that are themselves in the comparison set, this is not a small result.
Why the API Is the Right Move
Fugu uses an OpenAI-compatible API. Chat Completions endpoint. Responses endpoint. No SDK migration. You change your base_url to fugu.sakana.ai and swap the API key.
from openai import OpenAI
client = OpenAI(
base_url="https://fugu.sakana.ai/v1",
api_key="your-fugu-key",
)
response = client.chat.completions.create(
model="fugu-ultra",
messages=[{"role": "user", "content": "Review this PR diff for security issues: ..."}],
)
print(response.choices[0].message.content)This is the right product decision. Any team already using any OpenAI-compatible client can switch in twenty minutes. The barrier to evaluation is the price of a pip install and a new key.
Fugu standard starts at $20/month. Fugu Ultra pay-as-you-go is $5/1M input tokens and $30/1M output tokens — with a caveat that heavy Fugu Ultra tasks, which can involve multiple 5-step agentic workflows, can reach $10 per message on complex security assessments or multi-hour research runs.
What This Means
The standard AI product narrative is: more parameters, better results. Train bigger, compute harder, win the benchmark.
Sakana is running a different experiment. The Conductor is 7 billion parameters. It beats models that cost hundreds of millions of dollars to train — not by being smarter, but by knowing who to ask. The small fish directing the school.
The constraint the architecture accepts is latency. Multi-step Conductor workflows take longer than a single direct call. For interactive chat, this is a real trade-off. For deep research tasks, security assessments, multi-turn code review — the latency premium buys output quality that a single model call cannot match.
Fugu is also currently unavailable in the EU while Sakana works through GDPR compliance for its black-box data routing architecture. If your team is EU-based, you cannot use it yet.
But the architectural argument is durable regardless of whether Fugu specifically wins the market. The question it asks — can a small model that learns to coordinate beat large models that learn to generate? — is now empirically answered for certain task classes. The answer is yes.
A school of fish moves faster than the same mass of water alone.