Counterfactual Explanations

Counterfactuals answer a different question than SHAP attributions: instead of "which phrases mattered most?" they surface "what is the minimum change that would have flipped the agent's decision?".

Surrogate-driven, cached per span, and reproducible — open the Counterfactuals tab on any trace detail page (/traces/[id]) to run them.

When to use counterfactuals

Use them when SHAP alone doesn't give you an actionable lever:

A debugging session where you already know which phrase dominated but you need a concrete rewrite that changes the outcome
Regression root-causing — pin down the smallest prompt delta between two runs that chose different tools
Policy authoring — generate realistic phrase edits you can add as constitutional rule examples

How they're computed

The server fetches the user prompt + the SHAP-backed surrogate classifier already trained for this span's ablation run.
It iterates candidate phrase edits drawn from the surrogate's top-weighted terms, picks the num_candidates that most flip the predicted tool class, and attaches:
- original_phrase — the text fragment being replaced
- suggested_phrase — the minimal edit
- predicted_outcome — free-form string describing the new decision ("Would have called calculator instead of web_search")
- confidence — float 0..1 (the UI buckets to high ≥ 0.7, medium ≥ 0.4, low otherwise)
- Optional explanation — human-readable reasoning
Results are persisted so subsequent reads are cache hits (source: "cache" on the response).

Fast, deterministic, no cost

Counterfactuals make no LLM call. They run entirely against the local SHAP surrogate (XGBoost tool classifier), so there is no cost and no estimate step — click "Run counterfactual analysis" and results return in well under a second. The predicted_outcome and explanation fields are deterministic, template-generated text (reproducible run-to-run), not model output.

Nothing runs until you click, purely so the panel isn't busy on every trace open — not because of any spend. Results are cached per span, so re-opening a span is an instant cache hit.

Prompt resolution (why tool spans "just work")

The span you click doesn't have to carry the prompt itself. AuditTrail looks for the user prompt on the span under a narrow set of keys (user_input, prompt, query, input, text) — the same keys the surrogate is trained on. A leaf tool span keyed only by its argument (calculator(expression=…), read_file(path=…)) has no user prompt of its own, so resolution walks up the ancestor chain to the root and borrows the real user prompt from the orchestrator span above it. The tool argument itself is never mistaken for the prompt — counterfactuals rewrite what the user asked, not the tool's internal argument.

Result status

Every response carries a status field that disambiguates an empty candidate list into three honest, distinct outcomes (the UI renders a different card for each):

ok — one or more edits flip the decision; counterfactuals is populated.
stable — the surrogate agreed with the observed tool choice and no edit flipped it. This is a valid result, not an error: your tool selection is robust for this prompt.
surrogate_untrained — the tool-classification surrogate has not trained yet (it needs ≥ 5 tool spans across your traces). Run a few more agents, then retry.
surrogate_disagrees_baseline — the surrogate's own baseline prediction differs from the observed action, so "the minimal edit to flip this action" isn't well-defined; usually a sign the surrogate is still under-trained on this tool.

Only a span with no prompt text anywhere up its ancestor chain returns HTTP 400 — the genuine "nothing to rewrite" case.

Endpoint

POST /api/v1/xai/counterfactuals

json

// Request
{
  "span_id": "a3f29c1b-…",
  "num_candidates": 3
}
 
// Response
{
  "span_id": "a3f29c1b-…",
  "surrogate_f1": 0.912,
  "source": "surrogate",
  "status": "ok",
  "reason": null,
  "counterfactuals": [
    {
      "id": "cf-8821-…",
      "original_phrase": "search the web for",
      "suggested_phrase": "remove 'search the web for'",
      "predicted_outcome": "Would have chosen calculator instead of web_search",
      "confidence": 0.83,
      "explanation": "Removing the phrase 'search the web for' shifts the surrogate's tool prediction from web_search to calculator, a margin of 83% over the next-best alternative on the original prompt. The surrogate is a tool classifier, not the agent itself, so treat this as a directional hint rather than a guarantee.",
      "created_at": "2026-04-19T18:22:41Z"
    }
  ]
}

User-scoped (span must belong to a trace the caller owns). Rate limited at 20/minute per client.

UI mount point

The <CounterfactualPanel /> component mounts as a 5th tab on /traces/[id] — next to Spans / DAG / Timeline / Sankey. The span picker defaults to the first LLM or tool span in the trace.

Shipped in: prod-v2.6.4 — backend + model shipped in V1.3; UI lands this sprint.