Production reading
87.4%
Groundedness
Is that good?
No threshold. No baseline. No answer.

The Baseline Problem in AI Evaluation.

Frameworks tell you what to track. They don’t tell you what “good enough” looks like.

The gap between the metric and the threshold that makes the metric actionable is the practical heart of AI assurance.

See compliance coverage →
TrustEvals field guide for finance AI teams.

A baseline in AI evaluation is the threshold that makes a metric actionable for a specific use case. It has four components: metric (what you measure), threshold (the number that triggers action), context (use case, population, risk appetite), owner (who authorizes threshold changes). Without all four, you have a number, not a baseline.

The problem

Same question, four different answers.

Four realistic agents. The same metric category. Four different numbers, because context (not the framework) sets the threshold.

01

Marketing-copy agent

Retail brand. Product descriptions.
Metric
Bias tolerance
Threshold
0.25

Loose. Creative copy. Low regulatory impact. Risk appetite is “don’t embarrass us.” QA already catches most issues downstream.

02

Loan-underwriting agent

Regional bank. Credit decisions.
Metric
Bias tolerance
Threshold
0.05

Tight. Fair-lending regulatory exposure. Protected classes in the population. Consumer-protection laws. Reputational cost of a lawsuit. Same metric, different number.

03

Healthcare-triage agent

Health system. Patient inquiries.
Metric
Groundedness SLO
Threshold
Near-zero (safety-critical)

Hallucination near-zero for safety-critical outputs, looser for wayfinding. Groundedness SLO is the primary control, not the generic “hallucination rate.” Context outweighs the metric name.

04

Customer-support agent

SaaS. Tier-1 tickets.
Metric
Data exposure
Threshold
Per-tenant

Depends on tenant isolation guarantees, what the agent can see, contractual commitments. A B2B PII tenant needs different settings than a B2C tenant on published docs.

ISO 42001NIST RMFSR 11-7EU AI Act
Why frameworks stop short

Standards don’t define thresholds. And shouldn’t.

NIST AI RMF, ISO 42001, EU AI Act, Singapore Agentic AI Governance, AIUC-1. Each names categories of risk. Each names what to measure. Each refuses, correctly, to name the threshold.

A framework that legislated “bias < 0.10 for all agents” would be too loose for loan underwriting and too strict for marketing copy. The same metric, applied uniformly, is either dangerous or useless depending on context.

The gap is intentional. What frameworks can’t do, the organization has to. Define what “good enough” looks like for its specific deployments, populations, and risk appetite.

EVIDENCEPIPELINEOPERATINGAUDITAUDIT PACK
What baselines are

A baseline is not a number. It’s a process.

Strip any one of these and you don’t have a baseline. You have a number.

01

Metric.

What you’re measuring.

02

Threshold.

The number that triggers action.

03

Context.

Use case, population, risk appetite.

04

Owner.

The named human authorized to change the threshold.

Most enterprises skip the owner. That is why most “baselines” in market today are aspirational PDFs instead of operational controls.
DISCOVERY CALL · 30 MINApproved AI inventoryShadow AI exposureAdoption outcomesEvidence gapsowners, risk, next workstream
How to set a baseline

The five-step method we use with customers.

  1. 01

    NAME the use case tightly.

    “Customer-support agent for tier-1 troubleshooting on product X.” Not “AI chatbot.”

  2. 02

    INVENTORY the risks.

    Per AIUC-1 six categories, NIST AI RMF MEASURE, or your internal framework. The risk categories drive the metric set.

  3. 03

    PICK the metrics, narrowly.

    Groundedness, bias against protected class, tool-call authorization, data-exposure incidents. Fewer is better.

  4. 04

    SET the threshold, with the context.

    Threshold, use-case description, population, risk appetite, business-impact model. Document the reasoning.

  5. 05

    ASSIGN the owner.

    Named human. Authorized to change the threshold based on evidence. Versioned document. Change log accessible.

Step 5 is where most baseline programs die. Without an owner, step 4 turns into a PDF; the PDF turns into a reference nobody updates; the baseline becomes an artifact, not a control.

Why continuous

A baseline without continuous measurement is performance art.

The baseline is the threshold. Continuous measurement is what tells you whether the production system is on the right side of it today, this hour, this interaction. Without continuous measurement, a baseline is intent without enforcement.

AI systems are non-deterministic. A system that met its baseline on Monday can drift past it by Thursday. A prompt update, a model refresh, a new corpus, a change in user behavior. Quarterly attestation against a baseline means up to 90 days of undetected drift.

Continuous measurement without a baseline is just noise. “Groundedness 87.4%” with no way to decide whether that’s good or bad. Continuous + baseline together are the full control. Either alone falls down.

A baseline is the threshold that makes the metric actionable. Continuous measurement is what keeps the threshold honest. Both, or neither.

FAQ

Common questions on baselines.

A baseline is the threshold that makes a metric actionable for a specific use case. It has four components: metric, threshold, context, owner. Without all four, you have a number, not a baseline. NIST AI RMF and ISO 42001 name the metric categories; baselines name the thresholds.

A named owner (product, security, or compliance) depending on the use case. The owner authorizes threshold changes based on evidence and is referenced by the policy-as-code layer that enforces the baseline. Step 5 of the five-step method is where most baseline programs die.

Continuously evaluated, periodically reviewed. Production traces flow against the threshold every interaction. Formal review of the threshold itself runs quarterly or whenever the use case, population, or risk appetite changes. Model upgrade, prompt revision, new corpus, regulatory shift.

A threshold is a number ('bias < 0.05'). A baseline is the threshold plus context plus owner. The metric, the threshold, the use case + population + risk appetite the threshold is grounded in, and the named human authorized to change it. Baselines are operational; thresholds in isolation are aspirational.

Both frameworks specify what to measure (MAP and MEASURE in NIST; clauses 6 to 9 in ISO 42001). Neither specifies thresholds, correctly, because thresholds are use-case-specific. Baselines fill the gap between framework category and operational control.

Run the five-step method against your highest-risk AI deployment first: name the use case tightly, inventory the risks, pick metrics narrowly, set the threshold with context, assign the owner. One baseline per quarter is faster than waiting for a framework certification.