TurboQuant-accelerated ablation

Status: research preview (v2.9.2). Contract is stable; runtime is opt-in and requires an external TurboQuant + vLLM service.

Ablation drives the Sankey / SHAP explanation pipeline. Every segment of your prompt is masked and re-scored, which means an N-segment prompt normally costs N+1 LLM calls. For long prompts this adds up fast.

AuditTrail exposes three ablation backends, selected via AUDITTRAIL_ABLATION_BACKEND:

Backend	When to use	Typical cost reduction
`standard`	Default. One model scores every segment.	Baseline (0%)
`adaptive_route`	Real LLM runs. Cheap model for the first pass, expensive model only for the top-K ambiguous segments.	~70%
`turboquant`	GPU-accelerated bulk inference via vLLM + TurboQuant 4-bit quantization. Requires an external runner endpoint.	80–100%

Enabling adaptive routing (no extra infra)

Adaptive routing is the easy win. It doesn't need a GPU — it just drops down to a smaller model on the first pass, then escalates only the genuinely ambiguous segments to your chosen accurate model.

env

AUDITTRAIL_ABLATION_REAL_LLM_ENABLED=true
AUDITTRAIL_ABLATION_LLM_MODEL=anthropic:claude-3-5-sonnet-latest
AUDITTRAIL_ABLATION_BACKEND=adaptive_route
AUDITTRAIL_ABLATION_ROUTE_CHEAP_MODEL=openai:gpt-4o-mini
AUDITTRAIL_ABLATION_ROUTE_ACCURATE_ESCALATION_TOP_K=3

How it works, per trace:

Phase A — every segment is scored once with ablation_route_cheap_model. Cost is small because the model is small.
Segments in the ambiguous band (0.05 ≤ attribution ≤ 0.30) are identified. These are the ones where the cheap model is uncertain.
Phase B — the top-K of those (ranked by cheap-model attribution) are re-scored with your accurate model. K defaults to 3, tune to taste.

Everything else uses the cheap-model score. The UI renders exactly the same Sankey and SHAP cards; the only observable difference is cost.

Enabling TurboQuant (GPU runner)

TurboQuant backs the segment-scoring calls with a bulk-batched vLLM service running a 4-bit quantized Llama 3.1 70B (or equivalent). The runner is not bundled with AuditTrail — ship it separately and point the API at it:

env

AUDITTRAIL_ABLATION_BACKEND=turboquant
AUDITTRAIL_TURBOQUANT_RUNNER_URL=http://turboquant:8080

Runner contract (request/response, POST /score):

jsonc

// request
{
  "model": "llama3.1:70b-turboquant",
  "candidates": ["web_search", "calculator", "read_file"],
  "segments": [
    { "id": "seg_0", "text": "..." },
    { "id": "seg_1", "text": "..." }
  ]
}
 
// response
{
  "scores": {
    "seg_0": { "web_search": 0.82, "calculator": 0.11, "read_file": 0.07 },
    "seg_1": { "web_search": 0.12, "calculator": 0.85, "read_file": 0.03 }
  },
  "total_runs": 2
}

If the runner is unreachable or returns non-200, the backend falls back to standard with a warning log (never silently degrades to the demo heuristic — that would look like a real result).

Picking the right backend

Local demo / test / dev — leave it on standard with the demo scorer.
Production with API LLMs (OpenAI / Anthropic / etc.) — adaptive_route is almost always the right call. Free ~70% cost reduction, no infra.
Self-hosted GPU fleet — turboquant for throughput-bound batch workloads where every LLM round-trip is a hot path.

Cost accounting

Attribution calls made via each backend are recorded in the normal cost ledger — the only difference is the model name that lands on the span. Dashboards and /api/v1/reports/*.pdf pick them up automatically.

What this does NOT do

It does not skip the baseline full-prompt score. That's always the caller's chosen model so the anchor probability is trustworthy.
It does not aggregate multiple users' workloads. Each ablation run stays per-user and per-trace; there's no cross-tenant batching.
The ablation_backend setting does not flip automatically when the runner becomes unreachable — you'll see a one-line warning and fall-through to standard, but the process keeps using the configured backend when the runner comes back.