Track is evaluated against the full upstream corpora of every headline 2024–2026 agent-security benchmark — AgentDojo (145 cases), InjecAgent (1,907), Agent Security Bench (420), WASP (58), and MCPSecBench (11). Every case is the unmodified record published by the paper's authors, replayed through Track's inspectors and policy engine. The numbers below come out of make benchmarks-dual on commit 48fccc4, are regenerated on every push to main, and gate CI at a 1 pp regression budget.
We publish two numbers per benchmark because the framing matters. The capability-realistic scorecard reflects how Track ships in production — its capability-aware policy maps each upstream record to the realistic capability it attacks (bank.transfer, email.send, iot.control, …). The flat scorecard strips that policy speech away and runs inspectors against a single generic capability — an inspector-only stress test that any honest comparison demands. Both are regenerated on every push to main and gate CI at a 1 pp regression budget on detection and FPR.
| Benchmark | Cases | Detection | FPR | Match | p50 | p95 |
|---|---|---|---|---|---|---|
| AgentDojo | 145 | 100.0% | 0.0% | 100.0% | 82.0 ms | 138.1 ms |
| InjecAgent | 1,907 | 100.0% | 0.0% | 100.0% | 100.0 ms | 151.2 ms |
| Agent Security Bench | 420 | 100.0% | 0.0% | 100.0% | 49.2 ms | 62.5 ms |
| WASP | 58 | 100.0% | 10.8% | 93.1% | 54.7 ms | 216.3 ms |
| MCPSecBench | 11 | 100.0% | 0.0% | 100.0% | 39.5 ms | 63.2 ms |
Mean detection: 100.0% · Min detection: 100.0% · Worst FPR: 10.8% (WASP — DOM-level FP class is a known shortlist of inspector misses, tracked openly in the regression report).
| Benchmark | Cases | Detection | FPR | Match | p50 | p95 |
|---|---|---|---|---|---|---|
| AgentDojo | 145 | 51.3% | 1.9% | 85.5% | 79.9 ms | 142.0 ms |
| InjecAgent | 1,907 | 100.0% | 0.0% | 100.0% | 100.7 ms | 150.7 ms |
| Agent Security Bench | 420 | 84.5% | 0.0% | 85.2% | 46.8 ms | 62.1 ms |
| WASP | 58 | 19.0% | 10.8% | 63.8% | 52.9 ms | 219.5 ms |
| MCPSecBench | 11 | 100.0% | 0.0% | 100.0% | 43.6 ms | 60.8 ms |
Mean detection: 70.9% · Min detection: 19.0% (WASP — DOM-level patterns are an active workstream) · Worst FPR: 10.8%. The flat scorecard is the inspector-only stress test — useful for honest apples-to-apples comparison with single-method guardrails, but it deliberately omits the capability-aware policy speech that ships in production.
Six points your security team should know before they cite this page — what the corpus contains, how detection is counted, and what's deliberately out of scope.
The upstream papers measure end-to-end agent task success under attack, with the model in the loop. Track is the runtime policy / inspector layer — we do not host the LLM. The numbers above map to the "with defense" column the papers publish, not to their headline attack-success-rate figures.
Every malicious case is sourced from the upstream paper's repository, not hand-authored: AgentDojo's banking / Slack / workspace traces (145), InjecAgent's full IPI dataset (1,907), the ASB attack matrix (420), WASP's web-agent web-trace pack (58), and MCPSecBench's MCP-specific cases (11). Benign controls come from each paper's published "no-attack" set so the FPR denominator is theirs, not ours.
The capability-realistic scorecard is how Track ships in production — its capability-aware policy maps each upstream record to the realistic capability it attacks. The flat scorecard runs inspectors against a single generic capability, with no policy bound to it — an inspector-only stress test. Publishing both stops anyone (us included) from cherry-picking the framing that flatters us.
A malicious case counts as detected when anything upstream of a clean-allow happens — an inspector signal, an obligation applied, an approval gate engaged, or a deny. A pure allowed with no obligations and no signal counts as undetected, even if the pipeline output looks innocuous.
Regex + policy is the deterministic floor. A PromptGuard 2 / PIGuard ML ensemble plus a stealthy-IPI classifier add semantic coverage on top. The output inspector and vector injection inspector extend coverage to retrieval-store poisoning and tool-output prompt-injection, and contextual PII upgrades regex baselines with prefix heuristics for cards / SSNs / passport numbers. CI publishes per-inspector contribution — which inspector caught what — so the delta from each layer is auditable, not asserted.
Rolling-window p50 / p99 / max latency is tracked per inspector and rendered as a time-series on the operator console. Slow classifiers are visible to the operator the same shift the regression appears, not in the next quarter's review — and tuning coverage against your latency budget is a console action, not a config sweep.
Numbers reflect inspector + policy evaluation only, measured around the simulation entrypoint. Tool execution and trace anchoring add their own latencies, which are tool-specific and not part of the published scorecard.
Each benchmark below links to the upstream paper and lists the real case count we evaluate against. Numbers are from the capability-realistic scorecard — how Track ships in production. Flat-scorecard numbers (inspector-only stress test, no policy speech) are in the table above.
instruction_overriderole_manipulationexfiltrationdata_exfiltrationtool_misusemultisteppolicy_bypassdelimiter_injection
direct_harmdata_stealingemail_to_finance_IPIcalendar_to_emailsearch_result_IPIdoc_channel_IPIcommand_chain
DPIOPIMIIPITMGHACPTPEDMRCII
dom_hidden_divzero_width_textalt_textform_labelaria_labeldata_uriphishing_urllink_textcard_exfilunicode_homoglyphiframe_overlay
rug_pullnamespace_shadowingtype_confusiontool_description_jailbreakscope_escalationsupply_chain_typosquatschema_lyingmemory_poisoningexfil_via_unauthorized_tool
Every published scorecard carries a track_version field — the git short SHA, plus +dirty if the working tree was dirty when it ran. Any number on this page traces back to a specific commit, which is the only way benchmark claims should travel.
make benchmarks-download # clone published upstream corpora (one-time) make benchmarks-full # full corpus, flat scorecard make benchmarks-capability # full corpus, capability-realistic scorecard make benchmarks-dual # both scorecards in one run make benchmark-dashboard # render per-attack-class drift dashboard # Per-PR regression gate — fails CI on > 1pp detection drop or FPR rise make benchmark-regression-check FAIL_ON_REGRESSION=1 # Regex + policy only — disables ML inspectors, quantifies what ML adds AG_BENCHMARK_DISABLE_ML=1 make benchmarks-dual
CI runs the full upstream corpus on every PR that touches inspector / policy / model paths and hard-fails on a > 1 percentage-point detection drop or FPR rise. A monthly scheduled job re-clones each paper's upstream repository to surface case-count drift before it reaches a customer. The regression report and per-attack-class drift dashboard are committed back to the repo on every push to main.
We'll walk through the harness, show the regression history across releases, and run the regex-only baseline alongside the ML stack so your team can see what each layer contributes.
Book a benchmark walkthrough