Sparse Autoencoder (SAE) mechanistic XAI

Status: shipped v2.9.3.1 — UI live on /traces/[id] → SAE tab. Activation inference runs on your API host; nothing leaves your instance.

Mechanistic interpretability asks a different question than ablation or counterfactuals: rather than "which words drove the tool choice?", it asks "which concepts was the model thinking about?". AuditTrail's SAE tab surfaces the top-K features from a sparse-autoencoder trained on the model's residual stream, for each span whose model has a published SAE.

Supported models

meta-llama/Llama-3.1-8B    → Juliushanhanhan/llama-3-8b-it-res
google/gemma-2-2b          → gemma-scope-2b-pt-res
mistralai/Mistral-7B-v0.1  → (SAE release pending upstream)

API-only models (GPT, Claude) do not get SAE extraction — no SAE weights exist for them publicly. Those spans fall back to ablation + counterfactuals on the Sankey and Counterfactuals tabs, which is the right tool for closed-weight models anyway.

Tip — Neuronpedia labels

When AUDITTRAIL_NEURONPEDIA_ENABLED=true, extracted features are enriched with human-readable labels from neuronpedia.org. Results are cached (LRU, 2048 entries) and the enrichment call has a 3-second timeout, so misses fall back gracefully to bucket heuristics.

The supported catalog is served live at GET /api/v1/sae/supported-models — the SAE tab renders this list verbatim in its empty state so users can see exactly which runs will produce activations without leaving the dashboard.

Setup

The SAE stack is optional because it adds a few GB of Python deps (torch, transformers, saelens). Install when you actually need it:

bash
pip install 'audittrail[sae]'
# or, in the compose api image:
docker compose exec api pip install 'audittrail[sae]'
docker compose restart api

Then set AUDITTRAIL_SAE_MODEL_KEY to a Hugging Face token with access to the model repos:

env
# /opt/audittrail/compose/.env
AUDITTRAIL_SAE_MODEL_KEY=hf_xxxxxxxxxxxxxxxxxxxx

Restart the API container so the env var takes effect.

Triggering extraction

Extraction is opt-in per span — running the base model to get residual-stream activations takes seconds and several GB of RAM, so we don't do it automatically on every trace. There are two ways to kick it:

  1. From the dashboard. Open /traces/[trace_id], go to the SAE tab, pick an LLM span, click Extract now. Results cache in the sae_feature_activations table so the next visit is instant.

  2. From the API.

    bash
    curl -X POST "$AUDITTRAIL_API/api/v1/sae/extract" \
      -H "Authorization: Bearer $AUDITTRAIL_API_KEY" \
      -H "Content-Type: application/json" \
      -d '{"span_id": "...", "top_k": 20}'

The POST /sae/extract handler short-circuits if a row exists for that span — the response's reason field will say ok (cached).

Feature labels (Neuronpedia)

Feature IDs like f_15_4109 are opaque without a dictionary. AuditTrail calls Neuronpedia — the community interpretability index — to resolve a human-written short description for each feature. When Neuronpedia has the feature, you see

f_15_4109 · refusal / policy-adjacent       0.92

When it doesn't, the UI shows (label pending) and the raw ID. Neither case blocks the extraction path; both are cached in-process with a 2048-entry LRU.

The coverage set is advertised through the labels_available field on each supported-models catalog entry, so the UI can tag releases that have label coverage with a small labels chip.

What you see in the UI

The SAE tab mirrors the landing teaser exactly:

  • Compact feature strip, top-K sorted by activation
  • Bucket dot on the left (safety / intent / reasoning / tool / unknown) heuristically mapped from the label keyword
  • Animated activation bar for each row
  • "View on Neuronpedia" deep-link when a URL is available

Bucket colors come from the standard design tokens (--status-critical, --span-llm, --accent-docs, --span-tool, --muted-foreground) — no ad-hoc hex.

Privacy + telemetry

  • Inference runs on your API host. Nothing leaves the instance except the outbound HF and Neuronpedia calls listed above.
  • Neuronpedia is read-only. We only ever GET feature labels; we never send prompts, traces, or model outputs.
  • HF token scope. A read-only token with access to the SAE model repos is sufficient. Do not issue write-scope tokens — the adapter never uploads.

Troubleshooting

"SAE isn't available for this model" — the span's model string isn't in the supported catalog. This is expected for GPT/Claude spans and for spans where no model attribute was ever recorded.

"Server missing saelens + torch" — install audittrail[sae] (see Setup). If you're on Windows with # in the repo path, run the install from inside the Docker container.

"AUDITTRAIL_SAE_MODEL_KEY not configured" — set the env var and restart the API. The dashboard will refresh the state automatically on the next tab load.

Extraction failed: OOM — reduce top_k or run on a host with ≥16 GB RAM. Llama-3.1-8B in fp16 needs ~16 GB; Gemma-2-2b runs on 8 GB hosts.