Crypto Security Bench
Crypto Security Bench is a benchmark that measures AI security agents' ability to identify vulnerabilities in blockchain code. It includes 289 ground-truth vulnerabilities across 31 projects, spanning Solidity, Rust, C++, and Go codebases. Unlike other benchmarks like EVMBench, Crypto Security Bench includes both smart contracts as well as L1/core blockchain code. We compare our system, V12, with Claude Code, Codex, and Pashov Skills.
Overall Performance
We ran each system against the codebases in Crypto Security Bench. We report recall, the fraction of ground truth bugs each system successfully identified.
| System | Detected | Recall |
|---|---|---|
V12 | 181 / 289 | 62.6% |
Claude Code harness (Opus 4.7) | 104 / 289 | 36.0% |
Codex harness (GPT-5.5) | 81 / 289 | 28.0% |
V12 | 67 / 91 | 73.6% |
Claude Code | 52 / 91 | 57.1% |
Codex | 42 / 91 | 46.2% |
Pashov Skills | 40 / 91 | 44.0% |
Performance By Language
Recall grouped by language.
Detailed Results
List of all ground-truth vulnerabilities included in Crypto Security Bench, along with the vulnerability reports generated by each system tested. Each fixture links to an associated human audit report.