V12 · 20260604

Crypto Security Bench

Crypto Security Bench is a benchmark that measures AI security agents' ability to identify vulnerabilities in blockchain code. It includes 289 ground-truth vulnerabilities across 31 projects, spanning Solidity, Rust, C++, and Go codebases. Unlike other benchmarks like EVMBench, Crypto Security Bench includes both smart contracts as well as L1/core blockchain code. We compare our system, V12, with Claude Code, Codex, and Pashov Skills.

01 — Overall recall

Overall Performance

We ran each system against the codebases in Crypto Security Bench. We report recall, the fraction of ground truth bugs each system successfully identified.

SystemDetectedRecall
V12
181 / 289
62.6%
Claude Code harness (Opus 4.7)
104 / 289
36.0%
Codex harness (GPT-5.5)
81 / 289
28.0%
Solidity head-to-head
Pashov Skills only audits Solidity, so it can't be ranked against the others on the full 289-task set. We evaluated Pashov Skills on the 91 Solidity tasks:
V12
67 / 91
73.6%
Claude Code
52 / 91
57.1%
Codex
42 / 91
46.2%
Pashov Skills
40 / 91
44.0%
02 — By language

Performance By Language

Recall grouped by language.

Recall (%) = TP / total tasks, per languageRust: 145C++: 29Solidity: 91Go: 24
03 — All Fixtures

Detailed Results

List of all ground-truth vulnerabilities included in Crypto Security Bench, along with the vulnerability reports generated by each system tested. Each fixture links to an associated human audit report.

Fixture
Tasks
V12
Claude
Codex
Pashov
05 — Methodology

Methodology and FAQ