TurboQuant-accelerated ablation
Status: research preview (v2.9.2). Contract is stable; runtime is opt-in and requires an external TurboQuant + vLLM service.
Ablation drives the Sankey / SHAP explanation pipeline. Every segment of your prompt is masked and re-scored, which means an N-segment prompt normally costs N+1 LLM calls. For long prompts this adds up fast.
AuditTrail exposes three ablation backends, selected via
AUDITTRAIL_ABLATION_BACKEND:
| Backend | When to use | Typical cost reduction |
|---|---|---|
standard | Default. One model scores every segment. | Baseline (0%) |
adaptive_route | Real LLM runs. Cheap model for the first pass, expensive model only for the top-K ambiguous segments. | ~70% |
turboquant | GPU-accelerated bulk inference via vLLM + TurboQuant 4-bit quantization. Requires an external runner endpoint. | 80–100% |
Enabling adaptive routing (no extra infra)
Adaptive routing is the easy win. It doesn't need a GPU — it just drops down to a smaller model on the first pass, then escalates only the genuinely ambiguous segments to your chosen accurate model.
AUDITTRAIL_ABLATION_REAL_LLM_ENABLED=true
AUDITTRAIL_ABLATION_LLM_MODEL=anthropic:claude-3-5-sonnet-latest
AUDITTRAIL_ABLATION_BACKEND=adaptive_route
AUDITTRAIL_ABLATION_ROUTE_CHEAP_MODEL=openai:gpt-4o-mini
AUDITTRAIL_ABLATION_ROUTE_ACCURATE_ESCALATION_TOP_K=3How it works, per trace:
- Phase A — every segment is scored once with
ablation_route_cheap_model. Cost is small because the model is small. - Segments in the ambiguous band (
0.05 ≤ attribution ≤ 0.30) are identified. These are the ones where the cheap model is uncertain. - Phase B — the top-K of those (ranked by cheap-model attribution) are re-scored with your accurate model. K defaults to 3, tune to taste.
Everything else uses the cheap-model score. The UI renders exactly the same Sankey and SHAP cards; the only observable difference is cost.
Enabling TurboQuant (GPU runner)
TurboQuant backs the segment-scoring calls with a bulk-batched vLLM service running a 4-bit quantized Llama 3.1 70B (or equivalent). The runner is not bundled with AuditTrail — ship it separately and point the API at it:
AUDITTRAIL_ABLATION_BACKEND=turboquant
AUDITTRAIL_TURBOQUANT_RUNNER_URL=http://turboquant:8080Runner contract (request/response, POST /score):
// request
{
"model": "llama3.1:70b-turboquant",
"candidates": ["web_search", "calculator", "read_file"],
"segments": [
{ "id": "seg_0", "text": "..." },
{ "id": "seg_1", "text": "..." }
]
}
// response
{
"scores": {
"seg_0": { "web_search": 0.82, "calculator": 0.11, "read_file": 0.07 },
"seg_1": { "web_search": 0.12, "calculator": 0.85, "read_file": 0.03 }
},
"total_runs": 2
}If the runner is unreachable or returns non-200, the backend falls back
to standard with a warning log (never silently degrades to the demo
heuristic — that would look like a real result).
Picking the right backend
- Local demo / test / dev — leave it on
standardwith the demo scorer. - Production with API LLMs (OpenAI / Anthropic / etc.) —
adaptive_routeis almost always the right call. Free ~70% cost reduction, no infra. - Self-hosted GPU fleet —
turboquantfor throughput-bound batch workloads where every LLM round-trip is a hot path.
Cost accounting
Attribution calls made via each backend are recorded in the normal
cost ledger — the only difference is the model name that lands on the
span. Dashboards and /api/v1/reports/*.pdf pick them up automatically.
What this does NOT do
- It does not skip the baseline full-prompt score. That's always the caller's chosen model so the anchor probability is trustworthy.
- It does not aggregate multiple users' workloads. Each ablation run stays per-user and per-trace; there's no cross-tenant batching.
- The
ablation_backendsetting does not flip automatically when the runner becomes unreachable — you'll see a one-line warning and fall-through to standard, but the process keeps using the configured backend when the runner comes back.