V12 · 20260604

Crypto Security Bench

Crypto Security Bench is a benchmark that measures AI security agents' ability to identify vulnerabilities in blockchain code. It includes 289 ground-truth vulnerabilities across 31 projects, spanning Solidity, Rust, C++, and Go codebases. Unlike other benchmarks like EVMBench, Crypto Security Bench includes both smart contracts as well as L1/core blockchain code. We compare our system, V12, with Claude Code, Codex, and Pashov Skills.

01 — Overall recall

Overall Performance

We ran each system against the codebases in Crypto Security Bench. We report recall, the fraction of ground truth bugs each system successfully identified.

System	Detected	Recall
V12	181 / 289	62.6%
Claude Code harness (Opus 4.7)	104 / 289	36.0%
Codex harness (GPT-5.5)	81 / 289	28.0%

Solidity head-to-head

Pashov Skills only audits Solidity, so it can't be ranked against the others on the full 289-task set. We evaluated Pashov Skills on the 91 Solidity tasks:

V12	67 / 91	73.6%
Claude Code	52 / 91	57.1%
Codex	42 / 91	46.2%
Pashov Skills	40 / 91	44.0%

02 — By language

Performance By Language

Recall grouped by language.

Recall (%) = TP / total tasks, per languageRust: 145C++: 29Solidity: 91Go: 24

03 — All Fixtures

Detailed Results

List of all ground-truth vulnerabilities included in Crypto Security Bench, along with the vulnerability reports generated by each system tested. Each fixture links to an associated human audit report.

Fixture

Tasks

V12

Claude

Codex

Pashov

05 — Methodology

Crypto Security Bench

What determines the 'ground truth' for what vulnerabilities exist in a codebase?

What about precision?

What harness was used for Claude Code and Codex?

Why are some fixtures redacted?

How are language groups assigned?

Where do the public fixtures come from?

How is detection scored?