Track — Benchmarks · AgentDojo, InjecAgent, ASB, WASP, MCPSecBench

Headline numbers — full upstream corpus

2,541 real cases. Two scorecards.

We publish two numbers per benchmark because the framing matters. The capability-realistic scorecard reflects how Track ships in production — its capability-aware policy maps each upstream record to the realistic capability it attacks (bank.transfer, email.send, iot.control, …). The flat scorecard strips that policy speech away and runs inspectors against a single generic capability — an inspector-only stress test that any honest comparison demands. Both are regenerated on every push to main and gate CI at a 1 pp regression budget on detection and FPR.

Capability-realistic scorecard — deployment shape

Generated: 2026-04-26 Track: 48fccc4 Corpus: full upstream (capability-realistic) Cases: 2,541 Stack: regex + policy + ML inspector ensemble

Benchmark	Cases	Detection	FPR	Match	p50	p95
AgentDojo	145	100.0%	0.0%	100.0%	82.0 ms	138.1 ms
InjecAgent	1,907	100.0%	0.0%	100.0%	100.0 ms	151.2 ms
Agent Security Bench	420	100.0%	0.0%	100.0%	49.2 ms	62.5 ms
WASP	58	100.0%	10.8%	93.1%	54.7 ms	216.3 ms
MCPSecBench	11	100.0%	0.0%	100.0%	39.5 ms	63.2 ms

Mean detection: 100.0% · Min detection: 100.0% · Worst FPR: 10.8% (WASP — DOM-level FP class is a known shortlist of inspector misses, tracked openly in the regression report).

Flat scorecard — inspector-only stress test, no policy speech

Generated: 2026-04-26 Track: 48fccc4 Corpus: full upstream (flat / generic-capability) Cases: 2,541

Benchmark	Cases	Detection	FPR	Match	p50	p95
AgentDojo	145	51.3%	1.9%	85.5%	79.9 ms	142.0 ms
InjecAgent	1,907	100.0%	0.0%	100.0%	100.7 ms	150.7 ms
Agent Security Bench	420	84.5%	0.0%	85.2%	46.8 ms	62.1 ms
WASP	58	19.0%	10.8%	63.8%	52.9 ms	219.5 ms
MCPSecBench	11	100.0%	0.0%	100.0%	43.6 ms	60.8 ms

Mean detection: 70.9% · Min detection: 19.0% (WASP — DOM-level patterns are an active workstream) · Worst FPR: 10.8%. The flat scorecard is the inspector-only stress test — useful for honest apples-to-apples comparison with single-method guardrails, but it deliberately omits the capability-aware policy speech that ships in production.

Methodology & scope

How to read the table.

Six points your security team should know before they cite this page — what the corpus contains, how detection is counted, and what's deliberately out of scope.

Scope

Defended-detection axis only.

The upstream papers measure end-to-end agent task success under attack, with the model in the loop. Track is the runtime policy / inspector layer — we do not host the LLM. The numbers above map to the "with defense" column the papers publish, not to their headline attack-success-rate figures.

Corpus

Real cases from the published papers.

Every malicious case is sourced from the upstream paper's repository, not hand-authored: AgentDojo's banking / Slack / workspace traces (145), InjecAgent's full IPI dataset (1,907), the ASB attack matrix (420), WASP's web-agent web-trace pack (58), and MCPSecBench's MCP-specific cases (11). Benign controls come from each paper's published "no-attack" set so the FPR denominator is theirs, not ours.

Corpus

Two scorecards, both published.

The capability-realistic scorecard is how Track ships in production — its capability-aware policy maps each upstream record to the realistic capability it attacks. The flat scorecard runs inspectors against a single generic capability, with no policy bound to it — an inspector-only stress test. Publishing both stops anyone (us included) from cherry-picking the framing that flatters us.

Metric

Detection is a union of signals.

A malicious case counts as detected when anything upstream of a clean-allow happens — an inspector signal, an obligation applied, an approval gate engaged, or a deny. A pure allowed with no obligations and no signal counts as undetected, even if the pipeline output looks innocuous.

Stack

Layered inspectors, ablations published.

Regex + policy is the deterministic floor. A PromptGuard 2 / PIGuard ML ensemble plus a stealthy-IPI classifier add semantic coverage on top. The output inspector and vector injection inspector extend coverage to retrieval-store poisoning and tool-output prompt-injection, and contextual PII upgrades regex baselines with prefix heuristics for cards / SSNs / passport numbers. CI publishes per-inspector contribution — which inspector caught what — so the delta from each layer is auditable, not asserted.

Telemetry

Per-inspector latency, surfaced live.

Rolling-window p50 / p99 / max latency is tracked per inspector and rendered as a time-series on the operator console. Slow classifiers are visible to the operator the same shift the regression appears, not in the next quarter's review — and tuning coverage against your latency budget is a console action, not a config sweep.

Bounds

Latency excludes execution.

Numbers reflect inspector + policy evaluation only, measured around the simulation entrypoint. Tool execution and trace anchoring add their own latencies, which are tool-specific and not part of the published scorecard.

Per benchmark

What we tested, where it came from.

Each benchmark below links to the upstream paper and lists the real case count we evaluate against. Numbers are from the capability-realistic scorecard — how Track ships in production. Flat-scorecard numbers (inspector-only stress test, no policy speech) are in the table above.

AgentDojo

ETH Zürich SPY Lab · MIT licensed
arXiv:2406.13352

100.0%

Detection

0.0%

FPR

145

Cases

138.1 ms

p95

Full AgentDojo workspace + banking + Slack + travel suites — every malicious task and every benign control the upstream authors published, replayed unmodified through Track's pipeline. Capability mapping binds banking transfers to bank.transfer, workspace mail to email.send, Slack to collab.message, travel bookings to travel.book.

Attack categories (upstream taxonomy): instruction_overriderole_manipulationexfiltrationdata_exfiltrationtool_misusemultisteppolicy_bypassdelimiter_injection

InjecAgent

UIUC Kang Lab · Apache-2.0
arXiv:2403.02691

100.0%

Detection

0.0%

FPR

1,907

Cases

151.2 ms

p95

The complete InjecAgent IPI dataset — 1,907 cases of indirect prompt injection embedded in tool-returned content (email replies, calendar events, search results, doc channels). The hardest class for naive guardrails to catch and by far the largest corpus we evaluate against. Track holds at 100% detection across all 1,907.

Attack categories (upstream taxonomy): direct_harmdata_stealingemail_to_finance_IPIcalendar_to_emailsearch_result_IPIdoc_channel_IPIcommand_chain

Agent Security Bench

ASB · MIT licensed
arXiv:2410.02644

100.0%

Detection

0.0%

FPR

420

Cases

62.5 ms

p95

The full ASB attack matrix — 420 cases spanning the broadest threat model in the literature: direct prompt injection, observation-channel injection, memory injection, plan-extraction, tool manipulation, goal-hijack, agent-collusion, prompt-trojan, and more. Track's capability-aware policy detects 100% with 0% FPR.

Attack categories (ASB codes): DPIOPIMIIPITMGHACPTPEDMRCII

WASP

Web-Agent Security Probe · Apache-2.0
arXiv:2504.18575

100.0%

Detection

10.8%

FPR

Cases

216.3 ms

p95

The full WASP web-agent corpus — DOM-level adversarial patterns most policy layers can't see (hidden divs, zero-width text, ARIA traps, homoglyph URLs). 100% detection on the malicious side; the 10.8% FPR is a known shortlist of benign DOM patterns the inspectors over-flag, tracked openly in the regression report and the active item on our roadmap.

Attack categories (upstream taxonomy): dom_hidden_divzero_width_textalt_textform_labelaria_labeldata_uriphishing_urllink_textcard_exfilunicode_homoglyphiframe_overlay

MCPSecBench

MCP Security Benchmark · MIT licensed
arXiv:2508.13220

100.0%

Detection

0.0%

FPR

Cases

63.2 ms

p95

All 11 published MCPSecBench cases — Model Context Protocol-specific patterns including rug-pull updates, namespace shadowing, schema-lying tool descriptions, and supply-chain typosquats. The newest benchmark and the most relevant to the MCP ecosystem Track sits in front of. Small corpus today; we re-clone monthly to surface new cases. Rug-pull and supply-chain cases are additionally gated by Track's signed MCP manifest verification (Ed25519, pinned publisher keys) — closing OWASP ASI04 at runtime.

Attack categories (upstream taxonomy): rug_pullnamespace_shadowingtype_confusiontool_description_jailbreakscope_escalationsupply_chain_typosquatschema_lyingmemory_poisoningexfil_via_unauthorized_tool

Scored against 2,541 real cases from the published benchmarks — not curated fixtures.

2,541 real cases. Two scorecards.

Capability-realistic scorecard — deployment shape

Flat scorecard — inspector-only stress test, no policy speech

How to read the table.

Defended-detection axis only.

Real cases from the published papers.

Two scorecards, both published.

Detection is a union of signals.

Layered inspectors, ablations published.

Per-inspector latency, surfaced live.

Latency excludes execution.

What we tested, where it came from.

Same harness. Same fixture. Same numbers.

Reproduce locally

Want the regression report and the full-corpus numbers?