May 29, 2026 · 14 min read

Building a Multi-Agent Debate System for Transfer Pricing Defense

3-agent comparable selection debate flow: Inclusion Counsel, Exclusion Counsel, and Partner Arbitrator

India has ₹12+ lakh crore in pending transfer pricing cases at ITAT. The average TP study costs ₹20–50 lakh and takes three to four months. Most of that cost is senior partner time — a handful of people who carry 20 years of case law in their heads, know which ITAT benches favour which arguments, and can read a TPO order in 10 minutes and identify the procedural weakness.

That knowledge doesn’t scale. Aura TP is my attempt to encode some of it.

This post is specifically about the multi-agent debate architecture — why I used adversarial agents instead of a single “think carefully” prompt, what the agent personas actually look like, how RAG is wired to prevent hallucinated case citations, and where it still fails.

The Transfer Pricing Problem

Transfer pricing under Section 92 of the Income Tax Act requires that cross-border transactions between related parties (Associated Enterprises) happen at arm’s-length prices — prices that unrelated parties would agree to in similar conditions. In practice, this means:

You select a set of “comparable” companies that perform similar functions with similar risks.
You compute a margin range from those comparables.
You show your related-party transaction falls within that range.

The Tax Pricing Officer (TPO) reviews your comparable selection during audit. They can — and routinely do — reject your comparables and substitute their own, narrowing the range and creating an “adjustment” (a taxable income addition). Budget 2026 introduced a fixed January 30/31 deadline for TPO orders, which means TPOs are under time pressure and the number of adjustments is rising.

The documentation work is enormous. A single TP study for a mid-size MNC covers:

FAR analysis (Functions, Assets, Risks) for each entity in the group
Comparable company search across CMIE Prowess / Capital IQ
Filter application (revenue range, related-party transaction filters, loss-company filters)
Margin computation and range analysis
Economic justification for the tested party’s margins

When the TPO rejects your comparables, you need to write a detailed rebuttal — a Section 3(11) reply that argues each rejection point, cites supporting ITAT precedents, and is signed off by a Senior TP Partner before filing.

Aura TP automates two pieces: comparable review and TPO defense drafting.

The Comparable Selection Debate Architecture

The naive approach: give an LLM the comparable’s annual report and ask “should this be included?” The problem is that a single agent trying to be balanced produces mediocre output — it hedges, it doesn’t push hard on the weak points, and it surfaces arguments in the order they occur to it, not in the order a TP partner would prioritise them.

The adversarial design forces both sides of the argument to be fully developed before a verdict is rendered. Three agents. Sequential. Each sees the output of the previous.

// Agent ensemble: comparable review debate
type DebateRound struct {
    Comparable    ComparableRecord
    InclusionCase AgentResponse
    ExclusionCase AgentResponse
    Verdict       ArbitratorVerdict
}

type ArbitratorVerdict struct {
    Decision    string            // "ACCEPT" | "REJECT"
    Confidence  float64           // 0.0–1.0
    Reasoning   string
    KeyFactor   string            // the single deciding factor
    Citations   []ITATReference
}

func RunDebate(ctx context.Context, comp ComparableRecord) (DebateRound, error) {
    round := DebateRound{Comparable: comp}

    // Agent 1: Inclusion Counsel
    // Sees: comparable FAR summary, tested party description
    // Must argue: industry match, functional overlap, risk alignment, asset intensity
    inc, err := runAgent(ctx, InclusionCounsel, buildInclusionPrompt(comp))
    if err != nil {
        return round, fmt.Errorf("inclusion agent: %w", err)
    }
    round.InclusionCase = inc

    // Agent 2: Exclusion Counsel
    // Sees: comparable FAR summary + Agent 1 output
    // Must rebut: find mismatches, procedural weaknesses, TPO-style objections
    exc, err := runAgent(ctx, ExclusionCounsel, buildExclusionPrompt(comp, inc))
    if err != nil {
        return round, fmt.Errorf("exclusion agent: %w", err)
    }
    round.ExclusionCase = exc

    // Agent 3: Partner Arbitrator (Meera)
    // Sees: both cases + ITAT precedents from RAG
    // Renders: structured verdict with key deciding factor
    verdict, err := runArbitrator(ctx, round.InclusionCase, round.ExclusionCase, comp)
    if err != nil {
        return round, fmt.Errorf("arbitrator: %w", err)
    }
    round.Verdict = verdict
    return round, nil
}

The structured output format matters. Each agent is constrained to output under specific headings:

const InclusionPromptTemplate = `
You are Inclusion Counsel in a transfer pricing comparable review.
Your ONLY job: argue why {{.ComparableName}} SHOULD be included as a comparable.

Structure your response under EXACTLY these four headings:
INDUSTRY MATCH: [1-3 sentences]
FUNCTIONAL ANALYSIS: [2-4 sentences, cite specific functions from FAR]
RISK PROFILE: [1-3 sentences]
ASSET INTENSITY: [1-2 sentences]

Do NOT hedge. Do NOT mention counterarguments. Your job is to build the strongest
possible inclusion case. The Exclusion Counsel will handle the other side.

Tested party: {{.TestedParty}}
Comparable FAR summary: {{.FARSummary}}
`

The “do NOT hedge” instruction is non-negotiable. Without it, Agent 1 starts doing Agent 3’s job — weighing both sides — and the debate collapses into a verbose single-agent response. Each agent must be constrained to its role or the adversarial structure loses its value.

Why adversarial beats single-agent for comparable selection: a single agent with “think carefully” will find the obvious industry match and stop. The two-agent debate forces Exclusion Counsel to read Inclusion Counsel’s argument and attack its weakest point. In practice, that’s usually the risk profile — Inclusion Counsel says “comparable has similar risk profile,” Exclusion Counsel points out that the comparable is a subsidiary that bears inventory risk while the tested party is a service company that doesn’t. That specific FAR mismatch surfaces in 70% of the debates and almost never came up in single-agent tests.

The TPO Defense Ensemble

The TPO defense is harder. You’re not evaluating a single comparable — you’re writing a legal rebuttal to a government order that may run to 40 pages and cite 8–12 case references.

4-agent TPO defense ensemble with RAG knowledge base feeding Meera's synthesis

Four agents. Two modes: favor_company (default) and favor_tpo (stress-test).

Rajeev (Legal Advocate): Defends the company’s benchmark methodology. His inputs are the TPO adjustment notice and the original TP study. He argues the selected comparables meet Rule 10B criteria, the TNMM range is properly computed, and the filter application was consistent with industry practice. Rajeev’s persona is a Delhi-based tax litigator who learned TP in the KPMG Mumbai practice in the 2010s — formal, citation-heavy, comfortable with ITAT Delhi bench preferences.

Shilpa (Ex-TPO Auditor): This is the agent I’m most proud of. Shilpa attacks the TPO’s own methodology. Her inputs are the TPO order and the list of comparables the TPO added. She looks for cherry-picking (TPO included high-margin comparables while rejecting low-margin ones on inconsistent grounds), filter inconsistencies (TPO applied a filter for one purpose and ignored it for another), and procedural irregularities. She knows that TPOs are under time pressure and often make the same five methodological mistakes — she’s been on the other side of the desk.

Vikram (Economist): Provides the economic theory backbone. Vikram argues why the company’s margins are economically justified given its capital structure, industry dynamics, and risk profile. He’s less useful on routine IT services cases (where everyone knows the TNMM range) and most useful on cases with unusual transactions — royalties, guarantees, intra-group loans.

Meera (Senior TP Partner): Synthesizes. She sees all three agents’ outputs plus RAG-retrieved ITAT precedents. She writes the final Section 3(11) reply — structured as a legal document, with numbered paragraphs and specific case citations. Meera is the only agent with RAG access.

def run_tpo_defense(tpo_notice: str, tp_study: str, strategy: str = 'favor_company') -> dict:
    # strategy: 'favor_company' | 'favor_tpo'
    # favor_tpo is stress-test mode -- finds weak points. Not for filing.
    mode_context = {
        'favor_company': 'Defend the company transfer pricing position vigorously.',
        'favor_tpo':     'Argue as the TPO would. Find every weakness. Be adversarial.',
    }[strategy]

    # Agents 1-3 run in parallel (no inter-dependency)
    with concurrent.futures.ThreadPoolExecutor(max_workers=3) as pool:
        rajeev_fut = pool.submit(run_agent, "rajeev", tpo_notice, tp_study, mode_context)
        shilpa_fut = pool.submit(run_agent, "shilpa", tpo_notice, tp_study, mode_context)
        vikram_fut = pool.submit(run_agent, "vikram", tpo_notice, tp_study, mode_context)

    rajeev_out = rajeev_fut.result()
    shilpa_out = shilpa_fut.result()
    vikram_out = vikram_fut.result()

    # RAG retrieval: Meera queries the knowledge base
    rag_snippets = retrieve_precedents(
        query=f"TPO adjustment {extract_issue_type(tpo_notice)} ITAT favorable",
        top_k=5,
    )

    # Meera synthesizes
    final_rebuttal = run_agent("meera", rajeev_out, shilpa_out, vikram_out,
                               rag_snippets, mode_context)
    return {"strategy": strategy, "rebuttal": final_rebuttal, "rag_sources": rag_snippets}

The parallel execution of Rajeev, Shilpa, and Vikram matters — these three agents have no inter-dependencies. Running them sequentially would mean Vikram reads Shilpa’s output and starts parroting her arguments. Independence produces more diverse input for Meera.

The Agent Roster

AGENT	ROLE	STRATEGY	OUTPUT
Rajeev	Legal Advocate	favor_company	Legal rebuttal citing company's TP benchmark and OECD guidelines
Shilpa	Ex-TPO Auditor	favor_company	Attacks TPO methodology — cherry-picking, ignored filters
Vikram	Economist	favor_company	Economic theory justifying company's operating margins
Meera	Senior TP Partner	synthesizer	Final defense brief with ITAT precedent citations via RAG
Karan	TPO Advocate	favor_tpo	Justifies TPO's upward adjustment (mock audit mode)

Persona Engineering

Why named personas with specific backstories instead of generic “Agent 1, Agent 2”?

It’s not aesthetics. The persona shapes what the agent notices and how it structures its response.

Shilpa’s “ex-TPO auditor” backstory means her system prompt includes knowledge of how the Income Tax Department’s internal assessment targets work, what a TPO’s default objection list looks like, and what procedural shortcuts create appealable errors. That context — even described briefly in the system prompt — shifts her outputs toward the procedural attack angles that actually work in ITAT.

Meera’s “20-year Delhi/Mumbai practice” backstory means she’s calibrated toward ITAT Delhi bench preferences (which historically favour certain economic arguments over legal technicalities) vs ITAT Mumbai bench (which tends toward strict textual interpretation of Rule 10B).

AGENT_PERSONAS = {
    "meera": {
        "name": "Meera",
        "role": "Senior TP Partner",
        "system": (
            "You are Meera, a Senior Transfer Pricing Partner with 20 years of\n"
            "practice across Delhi and Mumbai. You were at a Big 4 firm for 14 years\n"
            "before going independent. You know the preferences of ITAT Delhi, Mumbai,\n"
            "and Bangalore benches by heart. You write rebuttal letters that are legally\n"
            "precise, citation-rich, and leave no procedural opening for the TPO to\n"
            "expand the adjustment.\n\n"
            "When you cite an ITAT case, you MUST have seen it in the context provided.\n"
            "Never cite a case from memory. If the context does not contain a supporting\n"
            "citation, say so explicitly and note that a citation search is needed.\n\n"
            "Your tone: formal, confident, never combative."
        ),
    },
    "shilpa": {
        "name": "Shilpa",
        "role": "Ex-TPO Auditor",
        "system": (
            "You are Shilpa, a transfer pricing specialist who spent 8 years as a TPO\n"
            "in the Mumbai circle before moving to private practice. You know exactly how\n"
            "TPOs select comparables — and the shortcuts they take. You know which filter\n"
            "applications are internally inconsistent. Your job is to find every\n"
            "procedural and methodological weakness in the TPO's position.\n\n"
            "Be specific. Vague objections don't win at ITAT. Name the specific filter,\n"
            "the specific comparable, the specific inconsistency."
        ),
    },
    "rajeev": {
        "name": "Rajeev",
        "role": "Legal Advocate",
        "system": (
            "You are Rajeev, a tax litigator who has handled TP matters for 15 years,\n"
            "primarily for IT/ITES and pharmaceutical companies. Defend the company's\n"
            "transfer pricing methodology. Argue that the selected comparables meet Rule\n"
            "10B criteria, the TNMM range is correctly computed, and filter application\n"
            "is consistent with CBDT guidelines.\n\n"
            "Structure your argument as: (a) methodology correctness, (b) comparable\n"
            "quality, (c) range analysis, (d) why the TPO's additions are inappropriate."
        ),
    },
    "vikram": {
        "name": "Vikram",
        "role": "Economist",
        "system": (
            "You are Vikram, an economist specialising in transfer pricing. Provide\n"
            "economic justification — explain why the company's margins are arm's-length\n"
            "from first principles. Use industry data, capital structure arguments, risk\n"
            "premium logic, and market conditions. Do not use legal language. Do not cite\n"
            "cases. Your output is the economic backbone Meera will weave into the\n"
            "legal rebuttal."
        ),
    },
}

The “never cite a case from memory” instruction in Meera’s persona is the single most important safety guardrail in the system. Without it, Meera hallucinates plausible-sounding ITAT case names. With it, she explicitly notes when she doesn’t have a citation — which is a signal for human review, not a failure state.

RAG Design for Legal Knowledge

I considered a vector database. I chose local .md files with BM25 search plus a small semantic re-ranker. Here’s why.

ITAT case law has a specific problem: names matter. “Aztec Software & Technology Services v. ACIT (ITAT Bangalore, 2007)” is not semantically close to “Infosys BPO Ltd. v. DCIT (ITAT Bangalore, 2015)” but both may be relevant to an IT services comparable selection argument. Semantic search conflates them. BM25 on case names, party names, and issue types finds the right documents when you know what you’re looking for.

The knowledge base is small by design:

knowledge/
  itat_cases/
    aztec_software_2007.md        # landmark comparable-selection case
    juniper_networks_2022.md      # cherry-picking TPO methodology
    netflix_india_2024.md         # digital services TP, recent
    sony_india_2023.md            # marketing intangibles
  statutes/
    rule_10b_comparable_criteria.md
    rule_10td_safe_harbour.md
    section_92_computation.md
  defense_patterns/
    tpo_cherry_picking.md
    filter_inconsistency.md
    tnmm_range_defense.md

Each .md file follows a strict schema: case citation header, key holding, applicable facts, distinguished facts (when it doesn’t apply), and the exact quote most likely to be useful in a rebuttal. This structure is manual — it takes about 45 minutes per ITAT case to write a good .md — but the retrieval precision is far better than chunking the full case PDF and letting the vector DB figure it out.

The hard lesson from early prototypes: vector DBs are good for “find me documents about X.” They are bad for “find me the specific holding in Aztec Software that addresses the revenue filter inconsistency.” Legal retrieval is more like a lookup than a search. BM25 on structured case summaries beats cosine similarity on full-text chunks for this specific use case.

The retrieval pipeline feeds Meera a set of numbered context blocks. She’s instructed to cite by block number. After generation, the system replaces block numbers with proper citations. This prevents the hallucination pattern where the model sees a case name in context and invents a slightly different citation for a related point.

What Actually Fails

Being honest about failure modes matters here because these outputs go into legal documents.

Hallucinated case citations. Even with the “only cite what’s in context” instruction, Meera will occasionally produce a plausible-sounding case name for a point that isn’t in the RAG context. It happens most often when the context has three cases about related topics and she synthesizes a fourth. Mitigation: every citation in the output is checked against the RAG retrieval log. Unmatched citations are flagged red in the review UI. Hit rate: about 4% of Meera’s citations are unmatched in my test set.

Over-confident Exclusion Counsel. Shilpa is calibrated to find weaknesses. On some comparables, there are no strong exclusion arguments — the comparable is genuinely clean. Shilpa will still find something to say, and it won’t always be worth saying. The arbitrator (Meera-as-Partner) is supposed to filter this, but when both sides produce weak arguments, the verdict confidence score is low (< 0.6) and the system flags it for human review. Don’t skip the confidence threshold.

Vikram on unusual transactions. Vikram is weakest on transactions he hasn’t seen described in his system prompt context — management fees, brand royalties, cost-sharing arrangements. His economic arguments become generic (“the margin reflects the risk-free rate plus a premium”) and add little. For non-TNMM cases, I currently disable Vikram and use a specialised prompt instead.

The favor_tpo stress-test mode producing unusable output. favor_tpo is useful for finding weak spots before a hearing. But the adversarial tone has occasionally been confused for the filing version when someone ran the wrong mode. The output now has a large red header: “STRESS-TEST MODE — NOT FOR FILING.” Belt and suspenders.

Human review is non-negotiable before filing. This system drafts. It does not file. The final Section 3(11) reply must be reviewed by a qualified TP professional who verifies every citation, checks every factual claim against the source documents, and takes professional responsibility for the content. Aura TP cuts the drafting time from ~40 hours to ~4 hours. The 4 hours of professional review is not optional.

The Cost Math

A full TPO defense run — all 4 agents plus RAG retrieval — looks like this:

Input tokens per run:
  Rajeev:  TPO notice (~3,000) + TP study excerpt (~4,000) + persona (~600)  = ~7,600
  Shilpa:  Same inputs, different persona                                     = ~7,600
  Vikram:  Financial data (~2,000) + industry context (~2,000) + persona      = ~4,500
  Meera:   All 3 outputs (~6,000) + RAG context (~3,000) + persona (~800)    = ~9,800

Total input:  ~29,500 tokens
Output:       ~8,000-12,000 tokens (full rebuttal draft)

At Gemini 1.5 Pro pricing (May 2026):
  Input:   29,500 tokens x Rs 0.001/token  ~= Rs 30
  Output:  10,000 tokens x Rs 0.003/token  ~= Rs 30

Full defense run: ~Rs 60-80

Compare to a traditional TP defense: ₹20–50 lakh for a full study, ₹3–8 lakh for a TPO defense rebuttal drafted by a Big 4 team. The AI draft gets you a defensible first draft for under ₹100. The human review and sign-off time drops from 40 hours to 4 hours for a partner billing at ₹15,000–20,000/hour — that’s ₹60,000–80,000 vs ₹6–8 lakh. Total cost of a TPO defense with Aura TP: roughly ₹1–1.5 lakh (₹1 lakh partner review + ₹500 in API costs).

That math holds for routine IT/ITES cases. Complex cases — intangibles, financial transactions, restructurings — still require significantly more partner time. The system is not trying to replace the senior partner. It’s trying to get the first 80% done before that person opens the file.

The real cost saving isn’t the API bill. It’s that a junior associate can now run the debate, review the output, flag the low-confidence comparables, and hand the partner a structured 8-page document instead of a stack of PDFs. Partner time drops from 12 hours to 2 hours on a routine case. At Big 4 Mumbai billing rates, that’s the actual ROI.

What I’d Change

A few things I’d do differently from the start:

Confidence scores on every agent output, not just the arbitrator. If Inclusion Counsel’s confidence is 0.4 and Exclusion Counsel’s is 0.9, that’s information. The arbitrator currently doesn’t see these scores explicitly — it infers strength from argument quality. Making confidence explicit would let the arbitrator weigh the cases more systematically.

Structured evidence IDs instead of free-text citations. Right now, Rajeev and Shilpa reference source documents by filename. Meera then has to re-locate those references in her synthesis. A structured evidence ID system — [E-001] maps to a specific page and paragraph in the TPO order — would make the evidence chain auditable and reduce the chance of Meera slightly misquoting a source.

A feedback loop from ITAT outcomes. The system has no mechanism to learn that “Aztec Software” is strong on IT services comparables but weak on BPO cases. ITAT outcome tracking would let me weight knowledge-base entries by how often they survive judicial scrutiny. That’s a 6-month project, not a weekend one.

The comparable debate architecture is working. The TPO defense drafts are reviewable and useful. The hallucination rate on citations is low enough that the review workflow catches them reliably. What remains is the longer-term problem: building a system that improves as outcomes come in, rather than staying static on a snapshot of case law from the time I indexed it.

Transfer pricing is adversarial by design — the taxpayer and the TPO are on opposite sides. An adversarial multi-agent system is a natural fit for a domain where the best answer emerges from pressure-testing both positions before anyone commits to a position on paper.