Counterfactual Explanations
Counterfactuals answer a different question than SHAP attributions: instead of "which phrases mattered most?" they surface "what is the minimum change that would have flipped the agent's decision?".
Surrogate-driven, cached per span, and reproducible — open the
Counterfactuals tab on any trace detail page (/traces/[id]) to
run them.
When to use counterfactuals
Use them when SHAP alone doesn't give you an actionable lever:
- A debugging session where you already know which phrase dominated but you need a concrete rewrite that changes the outcome
- Regression root-causing — pin down the smallest prompt delta between two runs that chose different tools
- Policy authoring — generate realistic phrase edits you can add as constitutional rule examples
How they're computed
- The server fetches the user prompt + the SHAP-backed surrogate classifier already trained for this span's ablation run.
- It iterates candidate phrase edits drawn from the surrogate's
top-weighted terms, picks the
num_candidatesthat most flip the predicted tool class, and attaches:original_phrase— the text fragment being replacedsuggested_phrase— the minimal editpredicted_outcome— free-form string describing the new decision ("Would have calledcalculatorinstead ofweb_search")confidence— float 0..1 (the UI buckets to high ≥ 0.7, medium ≥ 0.4, low otherwise)- Optional
explanation— human-readable reasoning
- Results are persisted so subsequent reads are cache hits
(
source: "cache"on the response).
Opt-in, no surprises
Like ablation, counterfactual compute is opt-in. Nothing runs until you click "Run counterfactual analysis" — the card shows an estimated cost first. This is the same pattern used throughout AuditTrail for LLM-cost-bearing operations.
Empty states
- No prompt text on the span → returns 400 with a clear message. Counterfactuals require an input phrase to rewrite.
- No LLM provider configured → the surrogate still runs (it's
local SHAP inference). The
explanationfield falls back to a deterministic template. - No candidates flip the decision → response has
counterfactuals: []and the UI shows an explicit empty card.
Endpoint
POST /api/v1/xai/counterfactuals
// Request
{
"span_id": "a3f29c1b-…",
"num_candidates": 3
}
// Response
{
"span_id": "a3f29c1b-…",
"surrogate_f1": 0.912,
"source": "surrogate",
"counterfactuals": [
{
"id": "cf-8821-…",
"original_phrase": "search the web for",
"suggested_phrase": "calculate the ROI of",
"predicted_outcome": "Would have called `calculator` instead of `web_search`",
"confidence": 0.83,
"explanation": "The phrase 'search the web' dominates the web_search weight by 0.41; the rewrite moves the decision boundary.",
"created_at": "2026-04-19T18:22:41Z"
}
]
}User-scoped (span must belong to a trace the caller owns). Rate limited at 20/minute per client.
UI mount point
The <CounterfactualPanel /> component mounts as a 5th tab on
/traces/[id] — next to Spans / DAG / Timeline / Sankey. The span
picker defaults to the first LLM or tool span in the trace.
Shipped in: prod-v2.6.4 — backend + model shipped in V1.3; UI
lands this sprint.