Counterfactual Explanations

Counterfactuals answer a different question than SHAP attributions: instead of "which phrases mattered most?" they surface "what is the minimum change that would have flipped the agent's decision?".

Surrogate-driven, cached per span, and reproducible — open the Counterfactuals tab on any trace detail page (/traces/[id]) to run them.

When to use counterfactuals

Use them when SHAP alone doesn't give you an actionable lever:

  • A debugging session where you already know which phrase dominated but you need a concrete rewrite that changes the outcome
  • Regression root-causing — pin down the smallest prompt delta between two runs that chose different tools
  • Policy authoring — generate realistic phrase edits you can add as constitutional rule examples

How they're computed

  1. The server fetches the user prompt + the SHAP-backed surrogate classifier already trained for this span's ablation run.
  2. It iterates candidate phrase edits drawn from the surrogate's top-weighted terms, picks the num_candidates that most flip the predicted tool class, and attaches:
    • original_phrase — the text fragment being replaced
    • suggested_phrase — the minimal edit
    • predicted_outcome — free-form string describing the new decision ("Would have called calculator instead of web_search")
    • confidence — float 0..1 (the UI buckets to high ≥ 0.7, medium ≥ 0.4, low otherwise)
    • Optional explanation — human-readable reasoning
  3. Results are persisted so subsequent reads are cache hits (source: "cache" on the response).

Opt-in, no surprises

Like ablation, counterfactual compute is opt-in. Nothing runs until you click "Run counterfactual analysis" — the card shows an estimated cost first. This is the same pattern used throughout AuditTrail for LLM-cost-bearing operations.

Empty states

  • No prompt text on the span → returns 400 with a clear message. Counterfactuals require an input phrase to rewrite.
  • No LLM provider configured → the surrogate still runs (it's local SHAP inference). The explanation field falls back to a deterministic template.
  • No candidates flip the decision → response has counterfactuals: [] and the UI shows an explicit empty card.

Endpoint

POST /api/v1/xai/counterfactuals

json
// Request
{
  "span_id": "a3f29c1b-…",
  "num_candidates": 3
}
 
// Response
{
  "span_id": "a3f29c1b-…",
  "surrogate_f1": 0.912,
  "source": "surrogate",
  "counterfactuals": [
    {
      "id": "cf-8821-…",
      "original_phrase": "search the web for",
      "suggested_phrase": "calculate the ROI of",
      "predicted_outcome": "Would have called `calculator` instead of `web_search`",
      "confidence": 0.83,
      "explanation": "The phrase 'search the web' dominates the web_search weight by 0.41; the rewrite moves the decision boundary.",
      "created_at": "2026-04-19T18:22:41Z"
    }
  ]
}

User-scoped (span must belong to a trace the caller owns). Rate limited at 20/minute per client.

UI mount point

The <CounterfactualPanel /> component mounts as a 5th tab on /traces/[id] — next to Spans / DAG / Timeline / Sankey. The span picker defaults to the first LLM or tool span in the trace.


Shipped in: prod-v2.6.4 — backend + model shipped in V1.3; UI lands this sprint.