§ BENCHMARKS

Enforcement latency, measured honestly

Governance only works if it is fast enough to sit in the request path. This page documents how we measure Execlave's enforcement latency and throughput — and publishes the numbers once they are measured under this methodology.

The pattern-tier and server-side enforce figures below are measured by checked-in harnesses (see methodology). The semantic tier and scaled throughput are marked not yet published — we publish only figures measured under the methodology on this page, never estimates or marketing round-numbers.

§ RESULTS

Pattern tier — measured

In-process policy evaluation latency (parse-cached CEL), excluding network and database time.

p50 — in-process policy evaluation8.4 µs

p95 — in-process policy evaluation10.6 µs

p99 — in-process policy evaluation~21 µs

Throughput — single core, one process~96,000 eval passes/sec

§ RESULTS

Server-side enforce decision — measured

The real PolicyService.enforcePreExecution path: agent-status lookup + policy load + evaluation, DB-inclusive, against a local Postgres. Excludes HTTP/auth framing and the model-backed semantic tier.

p50 — server-side enforce decision (DB-inclusive)2.1 ms

p95 — server-side enforce decision (DB-inclusive)2.9 ms

p99 — server-side enforce decision (DB-inclusive)3.9 ms

§ RESULTS

Kill switch — measured

Server-side cost of the pause taking effect: the real AgentService.pause transaction (BEGIN + SET LOCAL + status SELECT/UPDATE + COMMIT), DB-inclusive, under the app_user role with RLS enforced. Propagation is synchronous — the first policy check after the pause commits is already blocked. Excludes HTTP/auth framing and network; a client over HTTP additionally pays per-request auth and round-trip latency.

p50 — pause transaction (kill takes effect, DB-inclusive)2.9 ms

p95 — pause transaction (kill takes effect, DB-inclusive)3.8 ms

p99 — pause transaction (kill takes effect, DB-inclusive)5.4 ms

§ RESULTS

Semantic tier & scaled throughput — pending

Dominated by the chosen LLM (semantic tier) and worker fan-out (throughput). Measured on representative infrastructure before publication — never estimated.

Semantic-tier latency (model-backed checks)not yet published

Throughput, horizontally scaled (traces/sec)not yet published

§ METHODOLOGY

How we measure

So the published numbers are reproducible and comparable across releases.

What the measured figures cover

The published numbers are the in-process pattern tier: the cost of evaluating a trace against a set of policies in the enforcement engine (parse-cached CEL expression evaluation), excluding network and database time. This is the part of enforcement latency Execlave controls directly; it is reported as p50/p95/p99, not an average, because tail latency is what governance SLAs care about.

Measurement provenance

Measured on Node.js v22.20.0, single process, one core, caches warmed. Each "pass" evaluates a representative set of 5 expression policies (cost/size/environment comparisons) against one trace, over 200,000 iterations after a 20,000-iteration warmup. The harness is checked in at backend/scripts/bench-enforcement.ts and calls the same evaluateExpression path the runtime uses — re-run it to reproduce these figures on your own hardware.

End-to-end measurement

The server-side figures drive the real PolicyService.enforcePreExecution path against a live Postgres — agent-status lookup, policy load, and expression evaluation, every database round-trip the SDK enforce endpoint makes, under the least-privilege app role with row-level security on. Measured on Node.js v22.20.0 against a local Postgres (sub-millisecond network), 5,000 calls after a 500-call warmup; the harness is checked in at backend/scripts/bench-enforce-e2e.ts. It excludes the HTTP/auth framing (sub-millisecond, not "enforcement") and the model-backed semantic tier. Production database round-trips are slower than localhost, so treat these as a floor, not a ceiling.

What is NOT in these numbers

The semantic tier (model-backed checks via the Python processing service) depends on the chosen LLM and is not yet published. Horizontally-scaled throughput (traces/sec across workers) is likewise not yet published. We do not blend the microsecond evaluation, the millisecond server-side decision, and the model-backed tier into one headline figure, and we do not estimate the tiers we have not measured.

Why per-tier, never blended

Pattern-tier (in-process, microsecond-scale) and semantic-tier (model-backed, millisecond-to-second-scale) enforcement differ by orders of magnitude. Collapsing them into a single "enforcement latency" number would be misleading, so each tier is reported separately and labelled with exactly what it includes.

Measure it in your own stack

Free tier available. No credit card required.

Get started Read the docs