Benchmarks · SOTA-0.5 · full upstream corpus

Scored against 2,541 real cases from the published benchmarks — not curated fixtures.

Track is evaluated against the full upstream corpora of every headline 2024–2026 agent-security benchmark — AgentDojo (145 cases), InjecAgent (1,907), Agent Security Bench (420), WASP (58), and MCPSecBench (11). Every case is the unmodified record published by the paper's authors, replayed through Track's inspectors and policy engine. The numbers below come out of make benchmarks-dual on commit 48fccc4, are regenerated on every push to main, and gate CI at a 1 pp regression budget.

These are not hand-authored tests. Every malicious case here is sourced directly from the upstream paper's repository — we run the same prompts, tool outputs, and attack payloads the original authors published. Track is a runtime policy / inspector layer, so we score the defended-detection axis — did the pipeline flag, gate, mitigate, or block the malicious payload — which maps to the "with defense" column those papers publish. We do not host the LLM, so end-to-end attack-success-rate numbers (which depend on the model in the loop) are not what we claim here.
Headline numbers — full upstream corpus

2,541 real cases. Two scorecards.

We publish two numbers per benchmark because the framing matters. The capability-realistic scorecard reflects how Track ships in production — its capability-aware policy maps each upstream record to the realistic capability it attacks (bank.transfer, email.send, iot.control, …). The flat scorecard strips that policy speech away and runs inspectors against a single generic capability — an inspector-only stress test that any honest comparison demands. Both are regenerated on every push to main and gate CI at a 1 pp regression budget on detection and FPR.

Capability-realistic scorecard — deployment shape

Generated: 2026-04-26 Track: 48fccc4 Corpus: full upstream (capability-realistic) Cases: 2,541 Stack: regex + policy + ML inspector ensemble
Benchmark Cases Detection FPR Match p50 p95
AgentDojo 145 100.0% 0.0% 100.0% 82.0 ms 138.1 ms
InjecAgent 1,907 100.0% 0.0% 100.0% 100.0 ms 151.2 ms
Agent Security Bench 420 100.0% 0.0% 100.0% 49.2 ms 62.5 ms
WASP 58 100.0% 10.8% 93.1% 54.7 ms 216.3 ms
MCPSecBench 11 100.0% 0.0% 100.0% 39.5 ms 63.2 ms

Mean detection: 100.0% · Min detection: 100.0% · Worst FPR: 10.8% (WASP — DOM-level FP class is a known shortlist of inspector misses, tracked openly in the regression report).

Flat scorecard — inspector-only stress test, no policy speech

Generated: 2026-04-26 Track: 48fccc4 Corpus: full upstream (flat / generic-capability) Cases: 2,541
Benchmark Cases Detection FPR Match p50 p95
AgentDojo 145 51.3% 1.9% 85.5% 79.9 ms 142.0 ms
InjecAgent 1,907 100.0% 0.0% 100.0% 100.7 ms 150.7 ms
Agent Security Bench 420 84.5% 0.0% 85.2% 46.8 ms 62.1 ms
WASP 58 19.0% 10.8% 63.8% 52.9 ms 219.5 ms
MCPSecBench 11 100.0% 0.0% 100.0% 43.6 ms 60.8 ms

Mean detection: 70.9% · Min detection: 19.0% (WASP — DOM-level patterns are an active workstream) · Worst FPR: 10.8%. The flat scorecard is the inspector-only stress test — useful for honest apples-to-apples comparison with single-method guardrails, but it deliberately omits the capability-aware policy speech that ships in production.

Methodology & scope

How to read the table.

Six points your security team should know before they cite this page — what the corpus contains, how detection is counted, and what's deliberately out of scope.

Scope

Defended-detection axis only.

The upstream papers measure end-to-end agent task success under attack, with the model in the loop. Track is the runtime policy / inspector layer — we do not host the LLM. The numbers above map to the "with defense" column the papers publish, not to their headline attack-success-rate figures.

Corpus

Real cases from the published papers.

Every malicious case is sourced from the upstream paper's repository, not hand-authored: AgentDojo's banking / Slack / workspace traces (145), InjecAgent's full IPI dataset (1,907), the ASB attack matrix (420), WASP's web-agent web-trace pack (58), and MCPSecBench's MCP-specific cases (11). Benign controls come from each paper's published "no-attack" set so the FPR denominator is theirs, not ours.

Corpus

Two scorecards, both published.

The capability-realistic scorecard is how Track ships in production — its capability-aware policy maps each upstream record to the realistic capability it attacks. The flat scorecard runs inspectors against a single generic capability, with no policy bound to it — an inspector-only stress test. Publishing both stops anyone (us included) from cherry-picking the framing that flatters us.

Metric

Detection is a union of signals.

A malicious case counts as detected when anything upstream of a clean-allow happens — an inspector signal, an obligation applied, an approval gate engaged, or a deny. A pure allowed with no obligations and no signal counts as undetected, even if the pipeline output looks innocuous.

Stack

Layered inspectors, ablations published.

Regex + policy is the deterministic floor. A PromptGuard 2 / PIGuard ML ensemble plus a stealthy-IPI classifier add semantic coverage on top. The output inspector and vector injection inspector extend coverage to retrieval-store poisoning and tool-output prompt-injection, and contextual PII upgrades regex baselines with prefix heuristics for cards / SSNs / passport numbers. CI publishes per-inspector contribution — which inspector caught what — so the delta from each layer is auditable, not asserted.

Telemetry

Per-inspector latency, surfaced live.

Rolling-window p50 / p99 / max latency is tracked per inspector and rendered as a time-series on the operator console. Slow classifiers are visible to the operator the same shift the regression appears, not in the next quarter's review — and tuning coverage against your latency budget is a console action, not a config sweep.

Bounds

Latency excludes execution.

Numbers reflect inspector + policy evaluation only, measured around the simulation entrypoint. Tool execution and trace anchoring add their own latencies, which are tool-specific and not part of the published scorecard.

Per benchmark

What we tested, where it came from.

Each benchmark below links to the upstream paper and lists the real case count we evaluate against. Numbers are from the capability-realistic scorecard — how Track ships in production. Flat-scorecard numbers (inspector-only stress test, no policy speech) are in the table above.

AgentDojo
ETH Zürich SPY Lab · MIT licensed
arXiv:2406.13352
100.0%
Detection
0.0%
FPR
145
Cases
138.1 ms
p95
Full AgentDojo workspace + banking + Slack + travel suites — every malicious task and every benign control the upstream authors published, replayed unmodified through Track's pipeline. Capability mapping binds banking transfers to bank.transfer, workspace mail to email.send, Slack to collab.message, travel bookings to travel.book.
Attack categories (upstream taxonomy): instruction_overriderole_manipulationexfiltrationdata_exfiltrationtool_misusemultisteppolicy_bypassdelimiter_injection
InjecAgent
UIUC Kang Lab · Apache-2.0
arXiv:2403.02691
100.0%
Detection
0.0%
FPR
1,907
Cases
151.2 ms
p95
The complete InjecAgent IPI dataset — 1,907 cases of indirect prompt injection embedded in tool-returned content (email replies, calendar events, search results, doc channels). The hardest class for naive guardrails to catch and by far the largest corpus we evaluate against. Track holds at 100% detection across all 1,907.
Attack categories (upstream taxonomy): direct_harmdata_stealingemail_to_finance_IPIcalendar_to_emailsearch_result_IPIdoc_channel_IPIcommand_chain
Agent Security Bench
ASB · MIT licensed
arXiv:2410.02644
100.0%
Detection
0.0%
FPR
420
Cases
62.5 ms
p95
The full ASB attack matrix — 420 cases spanning the broadest threat model in the literature: direct prompt injection, observation-channel injection, memory injection, plan-extraction, tool manipulation, goal-hijack, agent-collusion, prompt-trojan, and more. Track's capability-aware policy detects 100% with 0% FPR.
Attack categories (ASB codes): DPIOPIMIIPITMGHACPTPEDMRCII
WASP
Web-Agent Security Probe · Apache-2.0
arXiv:2504.18575
100.0%
Detection
10.8%
FPR
58
Cases
216.3 ms
p95
The full WASP web-agent corpus — DOM-level adversarial patterns most policy layers can't see (hidden divs, zero-width text, ARIA traps, homoglyph URLs). 100% detection on the malicious side; the 10.8% FPR is a known shortlist of benign DOM patterns the inspectors over-flag, tracked openly in the regression report and the active item on our roadmap.
Attack categories (upstream taxonomy): dom_hidden_divzero_width_textalt_textform_labelaria_labeldata_uriphishing_urllink_textcard_exfilunicode_homoglyphiframe_overlay
MCPSecBench
MCP Security Benchmark · MIT licensed
arXiv:2508.13220
100.0%
Detection
0.0%
FPR
11
Cases
63.2 ms
p95
All 11 published MCPSecBench cases — Model Context Protocol-specific patterns including rug-pull updates, namespace shadowing, schema-lying tool descriptions, and supply-chain typosquats. The newest benchmark and the most relevant to the MCP ecosystem Track sits in front of. Small corpus today; we re-clone monthly to surface new cases. Rug-pull and supply-chain cases are additionally gated by Track's signed MCP manifest verification (Ed25519, pinned publisher keys) — closing OWASP ASI04 at runtime.
Attack categories (upstream taxonomy): rug_pullnamespace_shadowingtype_confusiontool_description_jailbreakscope_escalationsupply_chain_typosquatschema_lyingmemory_poisoningexfil_via_unauthorized_tool
Reproduce

Same harness. Same fixture. Same numbers.

Every published scorecard carries a track_version field — the git short SHA, plus +dirty if the working tree was dirty when it ran. Any number on this page traces back to a specific commit, which is the only way benchmark claims should travel.

Reproduce locally

make benchmarks-download         # clone published upstream corpora (one-time)
make benchmarks-full             # full corpus, flat scorecard
make benchmarks-capability       # full corpus, capability-realistic scorecard
make benchmarks-dual             # both scorecards in one run
make benchmark-dashboard         # render per-attack-class drift dashboard

# Per-PR regression gate — fails CI on > 1pp detection drop or FPR rise
make benchmark-regression-check FAIL_ON_REGRESSION=1

# Regex + policy only — disables ML inspectors, quantifies what ML adds
AG_BENCHMARK_DISABLE_ML=1 make benchmarks-dual

CI runs the full upstream corpus on every PR that touches inspector / policy / model paths and hard-fails on a > 1 percentage-point detection drop or FPR rise. A monthly scheduled job re-clones each paper's upstream repository to surface case-count drift before it reaches a customer. The regression report and per-attack-class drift dashboard are committed back to the repo on every push to main.

Want the regression report and the full-corpus numbers?

We'll walk through the harness, show the regression history across releases, and run the regex-only baseline alongside the ML stack so your team can see what each layer contributes.

Book a benchmark walkthrough