May 29, 2026 · 13 min read

AI Governance in the Era of Autonomous Agents

EU AI Act risk classification tiers

The EU AI Act's high-risk compliance obligations became active in August 2026. Its prohibited practices have been banned since August 2025. The fines are real: up to €35M or 7% of global turnover for prohibited AI systems; €15M or 3% for high-risk non-compliance.

Here is the number that should be on every board agenda: 40% of enterprise AI systems still can't be classified by risk tier (appliedAI, 2026). Not "haven't been classified yet." Can't be classified — because no one documented what data they were trained on, what they decide, or who they affect.

That's where we are.

The EU AI Act Obligation Stack

The regulation is structured as four concentric rings of obligation.

Prohibited practices (Article 5) are the hard stops. These are banned outright since August 2025:

  • Real-time remote biometric identification in public spaces by law enforcement (with narrow exceptions for imminent threats)
  • Social scoring systems that evaluate citizens and deny them services based on general behavior
  • AI that exploits vulnerabilities of specific groups — children, people with disabilities, the economically precarious
  • Subliminal manipulation techniques that bypass conscious decision-making
  • Emotion recognition in workplace and education settings

These aren't ambiguous. If your system does any of these, it's illegal in the EU regardless of technical sophistication or business rationale.

High-risk systems (Articles 6–7 and Annex III) are the compliance-heavy tier. The categories:

  • Biometric categorization systems (not just identification — categorization by race, political opinion, religion)
  • Critical infrastructure: power grids, water systems, traffic management
  • Education and vocational training: admission decisions, assessment of students
  • Employment and HR: CV screening, interview evaluation, promotion decisions
  • Essential services: credit scoring, insurance risk assessment, social benefit eligibility
  • Law enforcement: crime prediction, polygraphs, risk assessment of individuals
  • Migration and border control: asylum application assessment, travel authorization
  • Justice: influencing judicial decisions or dispute resolution

For high-risk systems, you need: a conformity assessment before deployment, technical documentation (Article 11), a risk management system that runs throughout the lifecycle (Article 9), data governance covering training data quality and bias (Article 10), logging and record-keeping so decisions can be audited (Article 12), human oversight measures — actual mechanisms by which a human can override the system (Article 14), and accuracy, robustness, and cybersecurity standards (Article 15).

Limited-risk systems (Article 50) have disclosure requirements. Chatbots must tell users they're talking to AI. AI-generated content — images, audio, video — must be labeled as such. This is where most consumer-facing AI lives right now.

Minimal-risk covers everything else: spam filters, recommendation engines, content moderation tools. No specific obligations beyond the general requirement not to violate other EU law.

The compliance trap most organizations fall into: they build a chatbot for HR questions, treat it as "minimal risk," and then it starts answering questions about performance reviews and promotion criteria. At that point it's an employment decision system. It's high-risk. And none of the Article 9–15 obligations were ever satisfied.

NIST AI RMF in Practice

The NIST AI Risk Management Framework (AI RMF 1.0) is voluntary. But "voluntary" is doing a lot of work in that sentence — it's increasingly referenced in US government contracts, cited in regulatory guidance, and used as a benchmark in litigation to establish what "reasonable" AI governance looks like.

The framework has four functions: Govern, Map, Measure, Manage.

Govern means establishing the organizational culture, policies, and accountability structures. Who owns AI risk? What's the escalation path when a model misbehaves? Are there documented AI principles that actual product decisions get checked against, or just a PDF that lives on the intranet?

In practice, most organizations are stuck at the PDF stage. The governance function should produce: an AI inventory (every production AI system, not just the LLM-powered ones), a defined risk taxonomy, clear ownership, and a documented escalation process for high-risk decisions.

Map means identifying AI risks in context. For a given system: what are the potential failure modes? Who are the affected populations? What happens if the model drifts? What's the blast radius if it fails silently vs. loudly?

This is where the 40% classification problem lives. You can't map risk on systems you haven't inventoried. And you can't inventory systems that were deployed as "experiments" three years ago and never documented.

Measure means actually quantifying risks. This requires metrics — not just "we think it's working." Bias metrics across demographic slices. Accuracy on distribution shift. Confidence calibration (does the model's stated confidence match its actual accuracy?). The NIST framework doesn't prescribe specific metrics; it requires that you select appropriate ones and track them.

Manage means taking action on what you measure. This is the feedback loop: detect drift, retrain or rollback, document the decision. For high-stakes systems, this means automated monitoring with human review at defined thresholds.

The gap between citing NIST AI RMF and actually implementing it is whether your Measure and Manage functions generate real decisions or just reports.

IT governance vs AI governance gap analysis

The Agentic Gap

Traditional IT governance was built for deterministic software. You write code. Code is reviewed. Code is deployed. The behavior of the deployed code matches the behavior of the reviewed code. If something breaks, you look at the stack trace.

Autonomous agents don't work this way.

An agentic system — a system that orchestrates multiple LLM calls, uses tools, spawns sub-agents, maintains state across a session — has emergent behavior. The same prompt produces different outputs on different invocations. The system can develop strategies across a conversation that weren't anticipated in any individual component's design. A sub-agent that was individually safe can behave unexpectedly when given context accumulated by a parent agent.

The specific problems that existing governance frameworks don't address:

Model drift. LLM providers update their models, sometimes without announcement. GPT-4 in November 2023 is not the same model as GPT-4 in August 2024. The system you validated is not the system you're running. SR 11-7 (more on this below) handles this with challenger model validation — but that process was designed for models you own and control, not APIs where the model can change under you.

Non-reproducible outputs. When a deterministic system produces a wrong answer, you can reproduce the failure, trace it, and fix it. When an LLM produces a wrong answer, you often can't reproduce the exact failure — temperature, sampling, and context accumulation make each invocation unique. You need to log everything.

Tool call surfaces. An agent with access to email, calendar, and CRM can do things that are individually authorized by each tool integration but collectively constitute something no one approved. The governance question isn't "can it access the CRM?" — it's "can it combine CRM data with email and make a decision that affects a customer without human review?"

Sub-agent spawning. Multi-agent systems where a top-level orchestrator creates sub-agents dynamically create accountability chains that don't fit neatly into any existing approval framework. Who approved the sub-agent? The orchestrator? The team that deployed the orchestrator? The vendor who built the orchestrator framework?

Explainability failure. LIME and SHAP work reasonably well for single-model decisions — they perturb inputs and observe output changes. They largely fail for multi-step agentic chains. When the decision is the product of five sequential LLM calls with tool use, there is no clean attribution path back to input features. The explainability tools that satisfy EU AI Act's Article 13 (transparency) for a credit scoring model don't work for an agent that made a credit decision across multiple hops.

No current EU AI Act article specifically covers agentic behavior. The regulation was drafted for models. The industry is deploying agents. That gap will be filled — by guidance, enforcement action, or both. The organizations that are building governance infrastructure for agents now will be better positioned than those waiting for the regulation to catch up.

SR 11-7 is from 2011 and was written for risk models like credit scoring. But it's currently the best applicable framework for AI model validation that has teeth — US banks are applying it to LLMs right now. A challenger model validation process for your production LLM isn't paranoia; it's what a model risk examiner will ask for.

The Accountability Chain

When an autonomous agent makes a wrong decision, who is liable?

The Air Canada case (BC Civil Resolution Tribunal, 2024) is the clearest precedent available. Air Canada's chatbot told a passenger that he could book a full-fare bereavement flight and then apply for a discounted rate retroactively — a policy that didn't actually exist. The airline argued the chatbot was "a separate legal entity" responsible for its own information. The tribunal rejected this categorically: companies are responsible for information their AI systems provide to customers.

This seems obvious. But consider what it means at scale: every enterprise AI system that customer-facing employees, customers, or partners interact with is generating statements the organization can be held to. The chatbot that tells a customer their claim is denied. The AI assistant that tells a job applicant their skills don't match the role. The pricing engine that quotes a rate.

LLM terms of service are not a defense. Every major LLM API agreement explicitly disclaims liability for outputs. OpenAI, Google, Anthropic — their contracts transfer responsibility for deployment decisions entirely to the operator. You built it, you deployed it, you're responsible for what it says.

The Uber Freight AI pricing lawsuit is the second case worth knowing. Uber Freight's dynamic pricing algorithm allegedly made pricing decisions that violated freight broker regulations. The lawsuit is ongoing, but the theory is the same: the algorithm acted as an agent of the company, and the company is liable for its decisions.

COMPAS — the recidivism prediction algorithm used in US criminal sentencing since the 2000s — remains the long-running case study in high-risk AI accountability failure. ProPublica's 2016 analysis found Black defendants were nearly twice as likely to be falsely flagged as higher risk. COMPAS was not trained on protected characteristics, but it was trained on factors correlated with them. The bias was emergent. The accountability was absent. It's still being cited in AI governance discussions a decade later because the accountability chain was never resolved.

For enterprise AI deployers, the accountability chain needs to be explicit before deployment:

  • Who approved the deployment decision?
  • Who can override an AI decision in real time?
  • Where are the inference logs, and how long are they retained?
  • What's the process when a customer disputes an AI-generated decision?

These aren't rhetorical questions. They're what a regulator or a plaintiff's attorney will ask.

What Boards Actually Need to Do

Not a compliance checklist. Specific actions.

1. Build an AI inventory. Every AI system in production — your own and third-party integrations. For each: what does it decide or influence? Who does it affect? What data does it use? What jurisdiction's users does it touch? This is the prerequisite for everything else. You cannot classify what you haven't catalogued.

2. Risk-tier classify everything. Apply the EU AI Act tiers (or NIST AI RMF risk categories) to every system in the inventory. Do this with input from legal and the business owners who deployed each system, not just engineering. The HR team knows if their ATS is screening resumes; engineering may not know that means it's a high-risk system.

3. Produce model cards for all production models. A model card documents: what the model does, what data it was trained on, performance metrics across demographic slices, known limitations, intended use cases, and prohibited use cases. If you're using third-party models (GPT-4, Gemini, Claude), document how you're using them and the constraints you've applied. This is what technical documentation under EU AI Act Article 11 requires.

4. Implement 90-day inference log retention minimum. For any system making decisions that affect individuals — approvals, denials, recommendations, scoring — retain the inputs, outputs, and any retrieval context for 90 days. This is the minimum audit trail that makes post-incident review possible. For high-risk systems under EU AI Act, automatic logging is legally required under Article 12.

5. Build human escalation paths for high-risk decisions. Not theoretical override mechanisms — actual workflows. For credit decisions: a process where a declined applicant can request human review within 24 hours. For HR screening: a process where a recruiter can override an AI shortlist with documented reasoning. EU AI Act Article 14 requires "effective oversight" — oversight requires a human who can actually intervene.

6. Run quarterly model performance reviews. For every production model: check accuracy metrics against baseline, check for performance degradation on defined demographic slices, check if the model's operating context has drifted (new data types, new user populations, new use cases that weren't in the original design). This is the Measure function of NIST AI RMF operationalized.

7. Appoint an AI risk owner. Not the CTO. Not "the AI team." A named individual who is accountable for AI risk across the organization, has authority to pause deployments, and reports to the board. In regulated industries this person will be the point of contact for regulatory examination.

India Context

India's regulatory environment for AI is distinct from the EU — and moving faster than most organizations realize.

The Digital Personal Data Protection Act 2023 (DPDP Act) is operational. It governs how personal data is collected, processed, and stored — and any AI system that processes personal data of Indian residents (which is most enterprise AI) has obligations under it. Data principals have rights: to know what data is held, to correction, to erasure. Consent obligations apply. Cross-border transfer rules are in place. The DPDP Act doesn't mention AI specifically, but the obligations attach to AI systems through data processing.

MeitY's AI advisory (March 2023) requires disclosure labels for AI-generated content on platforms with significant user bases. The advisory isn't binding legislation, but it established the direction: platforms should not let AI-generated content pass as human-created without disclosure. MeitY has since issued further guidance; the direction is toward mandatory disclosure, not voluntary.

AI-specific rules are under development. The MeitY consultation process for AI regulation has been ongoing through 2025–2026. The draft framework follows a risk-based approach influenced by the EU AI Act but adapted for India's context — lower per-capita digital literacy, higher dependency on AI in government services (UIDAI's facial recognition scale, for instance), and different enforcement capacity.

What's different about governing AI in the Indian regulatory environment:

  • Scale of government AI: India runs biometric identity verification at scale (Aadhaar, FASTag, DigiLocker) that would be classified as high-risk under EU AI Act but operates in a regulatory environment that's still developing oversight for it.
  • Sectoral fragmentation: RBI has its own model risk guidelines for banking AI. SEBI has guidance for algorithmic trading. IRDAI is developing AI rules for insurance. The frameworks are sector-specific and don't yet form a coherent whole.
  • Contractual pressure before regulation: Indian enterprises doing business with EU customers or US banks are already subject to EU AI Act and SR 11-7 obligations through contractual requirements — before Indian domestic AI regulation is finalized. EU data adequacy decisions and US correspondent banking relationships create governance obligations that flow through supply chains.

For Indian enterprises: DPDP Act compliance is not optional and is active now. EU AI Act applies if you have EU users or EU contracts. SR 11-7 applies if you have US banking customers. The governance infrastructure you build for one of these applies to all of them.

Framework Comparison

FRAMEWORKSCOPEENFORCEABILITYAGENTIC AI COVERAGE
EU AI ActRisk classification, prohibited uses, conformity assessmentsHard law — €35M penalties⚠ Partial — no article covers agentic reasoning
NIST AI RMF 1.0Govern / Map / Measure / Manage across AI lifecycleVoluntary — referenced in contracts⚠ Partial — model risk only, not tool-call audit
ISO/IEC 42001AI management system — policies, objectives, controlsCertifiable — growing in procurement✗ Weak — governance layer only, no inference controls
Fed SR 11-7Model risk management — soundness, monitoringHard law for US banks✗ None — SR 26-2 (Apr 2026) explicitly excludes agents
India DPDP ActPersonal data processing, consent, fiduciary obligationsHard law — rules finalizing✗ None — covers data, not AI decision-making

What a Model Card Looks Like

The governance artifact most organizations are missing. This is a minimal model card in YAML:

model_card:
  model_id: "credit-risk-scorer-v2.1"
  version: "2.1.0"
  last_updated: "2026-03-15"
  owner: "Risk Engineering, [email protected]"

  purpose:
    description: "Scores loan applications for consumer credit risk (0–1000)"
    intended_use: "Internal decisioning for personal loan applications under INR 5L"
    prohibited_use:
      - "Mortgage decisions"
      - "Business loan scoring"
      - "Re-identification of individuals from anonymised datasets"

  data:
    training_cutoff: "2025-12-31"
    training_sources:
      - "Internal repayment history (2018–2025)"
      - "Bureau data: CIBIL, Experian"
    protected_characteristics_excluded: true
    known_proxy_risks:
      - "Postal code may proxy for caste/religion in some regions"

  performance:
    overall_gini: 0.68
    population_parity_ratio:
      gender: 0.97   # ratio of positive rate: female/male
      age_18_25: 0.91
    calibration_error: 0.022

  limitations:
    - "Performance degrades on thin-file applicants (< 6 months bureau history)"
    - "Not validated for NRI applicants with foreign income"
    - "Bureau data latency up to 72h — score may not reflect recent defaults"

  human_oversight:
    auto_approve_threshold: 750
    auto_decline_threshold: 400
    human_review_band: "400–750"
    human_review_sla: "24 hours"
    appeal_process: "Customer can request human review via branch or app within 30 days"

  monitoring:
    psi_alert_threshold: 0.2   # Population Stability Index
    gini_alert_threshold: 0.05  # Gini degradation vs baseline
    review_frequency: "Quarterly"
    last_challenger_model_run: "2026-01-10"
    next_scheduled_review: "2026-06-15"

  regulatory:
    eu_ai_act_risk_tier: "High Risk (Annex III, 5b: creditworthiness)"
    nist_rmf_risk_level: "High"
    sr_11_7_applicable: true
    dpdp_data_category: "Personal data with financial profile"

This isn't theoretical. This is the artifact that satisfies EU AI Act Article 11 technical documentation, NIST AI RMF's Measure function, and SR 11-7's conceptual soundness requirement. Build one for every production model.

Closing

At GT Bharat, the question I get most from clients isn't about AI capability. The models are capable enough — often more capable than the organizations deploying them. The question is about accountability.

Who signed off on this? Where's the audit trail? If this decision gets challenged, what do we show a regulator?

Those are governance questions. And the answer, in most organizations, is: nobody signed off, there's no audit trail, and we'd have to rebuild the paper trail retroactively.

That's the gap. Not a technology gap — a documentation and accountability gap. The organizations that will navigate the next two years of AI regulation without expensive surprises are the ones that are building the inventory, writing the model cards, instrumenting the inference logs, and naming the humans responsible for escalation today — not when the enforcement notice arrives.

The frameworks exist. NIST AI RMF tells you what to govern. EU AI Act tells you what you're required to do. SR 11-7 tells you what validation looks like. ISO 42001 gives you a certifiable management system. The question is whether governance is treated as a cost center or as the infrastructure that makes AI deployments defensible.

Regulators are not the only audience. Your enterprise customers are asking for AI governance documentation in procurement processes. Your board's D&O insurers are starting to ask about AI risk programs. Your auditors are developing AI-specific testing procedures. The governance infrastructure you build now isn't just about EU AI Act compliance — it's about being the organization that can actually answer the accountability question.

Build the inventory. Write the model cards. Log the inferences. Name the humans.

Related topics
AI GovernanceEU AI ActNIST AI RMF +3

T
Tanmay Bohra
Full Stack Engineer at Grant Thornton Bharat. Building high-concurrency systems in Go and TypeScript.
← portfolio chat with tanmay ↗