Sparse Autoencoder (SAE) mechanistic XAI

Status: shipped v2.9.3.1 — UI live on /traces/[id] → SAE tab. Activation inference runs on your API host; nothing leaves your instance.

Mechanistic interpretability asks a different question than ablation or counterfactuals: rather than "which words drove the tool choice?", it asks "which concepts was the model thinking about?". AuditTrail's SAE tab surfaces the top-K features from a sparse-autoencoder trained on the model's residual stream, for each span whose model has a published SAE.

SAE is a self-host capability

Unlike ablation and counterfactuals (which work against any model, including closed-weight GPT/Claude), SAE extraction needs the base model's weights plus a published sparse autoencoder. That means it only runs when all of these are true on your own API host:

the span's model is an open-weight one with a published SAE (see the catalog below),

sae-lens (which brings torch + TransformerLens) is installed on the API container, and

for the license-gated repos (gemma, llama) a Hugging Face token (AUDITTRAIL_SAE_MODEL_KEY) is set — gpt2-small needs no key.

The hosted demo at auditrail.imaginaerium.in does not run extraction — its traces are all closed-weight GPT/Claude, and the demo image ships without the heavy inference stack. SAE is a bring-your-own-dependencies feature you enable on a self-hosted instance. The panel's pre-flight checklist (see below) tells you exactly which of the three prerequisites are in place for any span.

Supported models

gpt2-small (or gpt2)                  → gpt2-small-res-jb               (no HF key, CPU-OK, Neuronpedia labels)
google/gemma-2-2b                     → gemma-scope-2b-pt-res-canonical (gated repo, Neuronpedia labels)
meta-llama/Meta-Llama-3-8B-Instruct   → llama-3-8b-it-res-jh            (gated repo)

gpt2-small is the smallest entry point: ~500 MB of weights, runs on CPU in a few GB of RAM, its base repo is ungated, and its features have Neuronpedia labels — the fastest way to see real SAE output end-to-end. (v3.25.0 catalog corrections: Mistral-7B was removed — it had no published SAE and its old entry pointed at a GPT-2 release by mistake — and the Llama entry moved from Llama-3.1-8B to Meta-Llama-3-8B-Instruct, the model its SAE was actually trained on.)

API-only models (GPT, Claude) do not get SAE extraction — no SAE weights exist for them publicly. Those spans fall back to ablation + counterfactuals on the Sankey and Counterfactuals tabs, which is the right tool for closed-weight models anyway.

Tip — Neuronpedia labels

Neuronpedia label enrichment always runs (best-effort) on extracted features — there is no enable flag to set. Extracted features are enriched with human-readable labels from neuronpedia.org. Results are cached (LRU, 2048 entries) and the enrichment call has a 3-second timeout, so misses fall back gracefully to bucket heuristics (label = None, raw feature id shown). A Neuronpedia outage never blocks extraction.

The supported catalog is served live at GET /api/v1/sae/supported-models — the SAE tab renders this list verbatim in its empty state so users can see exactly which runs will produce activations without leaving the dashboard.

Finding SAE-ready runs

You don't have to open every trace to find out which ones SAE can run against. Two surfaces answer "which of my runs support SAE?":

/traces badge. The Traces page shows a small N SAE-ready chip next to the title. N is the number of your traces that have at least one span recorded on a supported open-weight model. On the hosted demo this is honestly 0 — every demo span is GPT/Claude.

GET /api/v1/sae/candidate-traces. The endpoint behind the badge. It's user-scoped and cheap (it never runs extraction), and returns:

json

{
  "total": 2,
  "supported_models": [
    "gpt2-small",
    "gpt2",
    "google/gemma-2-2b",
    "meta-llama/Meta-Llama-3-8B-Instruct"
  ],
  "traces": [
    {
      "trace_id": "…",
      "agent_name": "research-agent",
      "created_at": "2026-07-02T…",
      "matched_models": ["google/gemma-2-2b"],
      "span_count": 3
    }
  ]
}

matched_models lists the supported models actually seen on that trace's spans, and span_count is how many of its spans are on one.

Pre-flight checklist

The SAE tab renders a compact four-row pre-flight checklist above the panel body. Where the body shows only the first thing that's missing, the checklist shows all four prerequisites at once so you can see the whole gap in one glance:

Gate	Green when
Model has a published SAE	the span's `model` is in the supported catalog
`sae-lens` + `torch` installed on API host	the inference stack is importable on the server
`AUDITTRAIL_SAE_MODEL_KEY` configured	the HF token env var is set — or the model needs no key (gpt2-small)
Activations extracted for this span	a cached `sae_feature_activations` row exists

A ✓ is a met gate, a dashed circle is an actionable to-do (install the extras, set the key, or click Extract now), and an ✗ is a hard blocker you can't act on from the dashboard (a closed-weight model has no SAE — use ablation + counterfactuals instead). The header shows an N/4 ready summary.

Setup

The SAE stack is optional because it adds a few GB of Python deps (torch, TransformerLens, transformers — all pulled in by sae-lens). Install when you actually need it:

bash

pip install sae-lens
# or, from a source checkout of the API:
pip install -e "apps/api[sae]"
# or, in the compose api image:
docker compose exec api pip install sae-lens
docker compose restart api

For the license-gated base models (gemma, llama), set AUDITTRAIL_SAE_MODEL_KEY to a Hugging Face token that has accepted the model licenses — skip this step for gpt2-small, its repo is open:

env

# /opt/audittrail/compose/.env
AUDITTRAIL_SAE_MODEL_KEY=hf_xxxxxxxxxxxxxxxxxxxx

Restart the API container so the env var takes effect.

Triggering extraction

Extraction is opt-in per span — running the base model to get residual-stream activations takes seconds and several GB of RAM, so we don't do it automatically on every trace. There are two ways to kick it:

From the dashboard. Open /traces/[trace_id], go to the SAE tab, pick an LLM span, click Extract now. Results cache in the sae_feature_activations table so the next visit is instant.

From the API.

bash

curl -X POST "$AUDITTRAIL_API/api/v1/sae/extract" \
  -H "Authorization: Bearer $AUDITTRAIL_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"span_id": "...", "top_k": 20}'

The POST /sae/extract handler short-circuits if a row exists for that span — the response's reason field will say ok (cached).

Feature labels (Neuronpedia)

Feature IDs like f_15_4109 are opaque without a dictionary. AuditTrail calls Neuronpedia — the community interpretability index — to resolve a human-written short description for each feature. When Neuronpedia has the feature, you see

f_15_4109 · refusal / policy-adjacent       0.92

When it doesn't, the UI shows (label pending) and the raw ID. Neither case blocks the extraction path; both are cached in-process with a 2048-entry LRU.

The coverage set is advertised through the labels_available field on each supported-models catalog entry, so the UI can tag releases that have label coverage with a small labels chip.

What you see in the UI

The SAE tab mirrors the landing teaser exactly:

Compact feature strip, top-K sorted by activation
Bucket dot on the left (safety / intent / reasoning / tool / unknown) heuristically mapped from the label keyword
Animated activation bar for each row
"View on Neuronpedia" deep-link when a URL is available

Bucket colors come from the standard design tokens (--status-critical, --span-llm, --accent-docs, --span-tool, --muted-foreground) — no ad-hoc hex.

Privacy + telemetry

Inference runs on your API host. Nothing leaves the instance except the outbound HF and Neuronpedia calls listed above.
Neuronpedia is read-only. We only ever GET feature labels; we never send prompts, traces, or model outputs.
HF token scope. A read-only token with access to the SAE model repos is sufficient. Do not issue write-scope tokens — the adapter never uploads.

Troubleshooting

"SAE isn't available for this model" — the span's model string isn't in the supported catalog. This is expected for GPT/Claude spans and for spans where no model attribute was ever recorded.

"Server missing sae-lens + torch" — pip install sae-lens on the API host (see Setup). If you're on Windows, install into a venv at a SHORT path — deep paths break pyarrow's DLL loading with "The filename or extension is too long" (Windows MAX_PATH).

"AUDITTRAIL_SAE_MODEL_KEY not configured" — set the env var and restart the API (gated models only; gpt2-small doesn't need it). The dashboard will refresh the state automatically on the next tab load.

pyarrow.dataset has no attribute ParquetFragmentScanOptions — on Windows this is usually a MAX_PATH symptom in disguise: pyarrow's parquet DLL fails to load from a deep venv path and the missing attribute is the downstream error. Recreate the venv at a short path (e.g. C:\venvs\sae).

Extraction failed: OOM — reduce top_k or run on a host with ≥16 GB RAM. Llama-3.1-8B in fp16 needs ~16 GB; Gemma-2-2b runs on 8 GB hosts; gpt2-small runs in a few GB on CPU.