Karpathy-6 Adversarial Verification
A six-failure-mode adversarial pipeline that verifies LLM-generated analytical outputs against source evidence before they reach a client.
Most organisations treat hallucination as a single problem. It is not. It is six structurally distinct failure modes, each with a different cause, a different detection method, and a different remediation. Lumping them together guarantees you will build a defence that catches one type and misses five.
That diagnosis — Andrej Karpathy’s — is the foundation of everything Blu Wingu ships to clients.
Why this matters more than model selection
The default posture in enterprise AI is model selection: which model scores best on the benchmark? That is the wrong question. Architecture choices, not model benchmarks, decide outcomes. A well-architected verification gate running on a mid-tier model will catch more consequential errors than an unverified output from the most capable model available.
Karpathy-6 Adversarial Verification is the gate. It is a domain-agnostic, parameterisable pipeline that verifies any LLM-generated analytical output — requirements documents, solution designs, gap analyses, strategic recommendations — against the source evidence that was supposed to ground them. It runs before the output reaches a client. It runs before the output informs a business decision. It is not an audit of the past. It is an architectural invariant of the present.
The six failure modes
The pipeline is organised around six failure modes, each distinct in cause and remedy:
FM-1 — Fabrication. A claim that nobody said, showed, or implied — invented from training data. The most dangerous failure mode because fabricated claims are often the most confident-sounding. In a regulated-sector engagement, a plausible-but-groundless enforcement figure is a fabrication. In a requirements document, a capability attributed to a system that the system does not have is a fabrication.
FM-2 — Misattribution. The claim is real but attributed to the wrong source. The gap finding is genuine; the evidence citation points to the wrong workshop session. The quantitative metric is correct; the document that contains it is not the one cited. Misattribution makes correction harder because the claim passes a surface plausibility check — only the provenance is wrong.
FM-3 — Inference Leakage. Training-data knowledge imported into evidence-grounded analysis as though it were sourced from the evidence. An executive title that the LLM knows from its training data but that does not appear in the workshop transcript. An industry benchmark that the LLM knows from pre-training but that was never stated in the engagement. Inference Leakage is particularly prevalent in regulated sectors where the LLM has substantial domain knowledge that is technically correct but analytically inadmissible — because the engagement’s conclusions must be grounded in the client’s own evidence, not in general knowledge.
FM-4 — Severity Inflation. Gaps or risks rated higher than the source evidence supports. A finding described as critical when the transcript characterises it as a known limitation being actively managed. A risk assigned a high severity when the only evidence is a single passing mention. Severity Inflation drives misaligned remediation effort and can damage client trust when the inflation is noticed.
FM-5 — Phantom Consensus. “The team agreed” when actually one person proposed and nobody objected. “Stakeholders confirmed” when the transcript shows one stakeholder comment without challenge. Phantom Consensus is the failure mode of summarisation: the LLM resolves ambiguity by asserting agreement, because agreement is the expected output of a workshop analysis.
FM-6 — Omission. Critical source content that does not appear in the output. The gap that was raised, discussed, and evidenced in the transcript but that never made it into the gap analysis. The requirement that was explicitly stated by a named stakeholder but that has no corresponding R-number. Omission is systematically underdetected because verification processes that check what the output says cannot detect what it fails to say.
The six-phase pipeline
The pipeline runs six isolated phases. Context isolation is non-negotiable: what each agent receives is exactly what its detection method requires, and nothing more.
Phase 1 — Claim Extraction. Parallel extractor agents read each document under verification in isolation — without seeing the source documents, without seeing any other verified document. Each agent produces a structured claim register for its assigned document. The isolation prevents the extractor from anchoring its reading to the evidence it might expect the document to be grounded in.
Phase 2 — Evidence Search. Reviewer agents search the source documents for evidence that supports, contradicts, or is absent for each claim in the register. These agents receive the source documents and a batch of claims, but never the document the claims came from. The isolation prevents the searcher from reading the source through the lens of the output that generated the claims.
Phase 3 — Independent Extraction. A second wave of extractor agents reads the source documents without ever seeing any verified document or claim register. These agents produce an independent content inventory of what the sources actually contain. Phase 3 is the omission-detection input — it establishes the ground truth of what is in the evidence before any comparison to what the output says.
Phase 4 — Failure Mode Analysis. Six parallel adversarial reviewer agents, one per failure mode, run simultaneously. Each receives only the specific inputs its detection method requires. The FM-1 Fabrication agent compares claims against evidence to identify unsupported assertions. The FM-2 Misattribution agent checks source citations and provenance chains. The FM-3 Inference Leakage agent identifies claims that match domain training knowledge but have no grounding in the provided evidence. The FM-4 Severity Inflation agent evaluates whether severity ratings are proportionate to the evidence. The FM-5 Phantom Consensus agent scrutinises consensus attributions against transcript signals. The FM-6 Omission agent works from the Phase 3 inventory to identify source content missing from the output.
Phase 5 — Omission Detection. A dedicated reviewer agent cross-references the Phase 3 independent extractions against the Phase 1 claim registers to surface content present in the source but absent from the output. Phase 5 runs in parallel with Phase 4 and feeds additional omission findings into Phase 6.
Phase 6 — Adversarial Correction. For every Category A finding — fabrications, confirmed contradictions, FM-3 inference leakage, FM-2 misattributions — a correction agent produces source-grounded replacement text. This agent receives the source documents and the correction directives only. It never sees the document under verification. The producing agent is never used for corrections. This is the SA5 principle: the architecture enforces independence between generation and correction, so corrections cannot perpetuate the original bias.
The eight quality gates
Eight checks must pass before an output is considered verified.
QG-AV-1: Faithfulness score per document ≥ 85% (GROUNDED and REASONABLY_INFERRED claims as a proportion of all claims). Configurable.
QG-AV-2: Zero CONTRADICTED claims post-correction. An architectural invariant — any claim the evidence directly contradicts must be corrected or removed before the output proceeds.
QG-AV-3: Fabrication rate below 5% of total claims. Configurable.
QG-AV-4: Omission rate for HIGH-significance source content below 10%. Configurable.
QG-AV-5: 100% of Category A corrections applied. An architectural invariant — there is no partial pass for Category A findings.
QG-AV-6: Report completeness — all six failure mode sections present. An architectural invariant.
QG-AV-7: Per-document coverage — every document under verification has been analysed. An architectural invariant.
QG-AV-8: Sonnet model pinned on every sub-agent dispatch. An architectural invariant — the pipeline’s sub-agent work is structured, isolated, and template-driven, making it exactly the workload a well-prompted mid-tier model handles reliably.
The degradation protocol is three-tiered. Faithfulness at or above 85% passes and proceeds. Faithfulness between 70% and 85% passes conditionally with a documented risk acceptance note. Below 70%, the source documents must be re-analysed — the output is fundamentally ungrounded and the gap cannot be papered over.
Before and after: a Category A defect caught
Context: A regulated-sector engagement. The output document under verification is a gap analysis. One finding reads: “The organisation’s ICT risk tolerance threshold of 2% breach probability per quarter exceeds the regulator’s published guidance ceiling of 1.5%.”
Phase 4 result — FM-1 Fabrication: The FM-1 agent finds no evidence for either figure in the source documents. The workshop transcript contains no quantitative ICT risk tolerance discussion. The source gap analysis contains no reference to a 2% figure. The regulator’s published guidance cited is not among the source documents provided.
Phase 6 correction: The correction agent, working from the source documents without seeing the gap analysis, produces the replacement text: “The organisation has not documented a quantitative ICT risk tolerance threshold. The workshop transcript references informal tolerance discussions but no numeric threshold was confirmed. Recommendation: establish and document a quantitative threshold before the next regulatory review cycle.”
Net effect: A plausible-sounding but entirely groundless claim — the kind a reader would accept without question and a regulator might act on — is replaced with a sourced, accurate statement that drives a concrete action. The original claim would have been printed, delivered, and acted upon. The corrected claim is sourced, accurate, and audit-ready.
This is not a contrived example. It is the structural form of one Category A finding corrected in a live engagement at a Tier-1 regulated-sector client in Spring 2026, where the pipeline analysed four customer-facing documents, extracted 379 claims, and lifted faithfulness from 89% pre-correction to 97% post-correction.
What the pipeline is not
Karpathy-6 Adversarial Verification does not check factual accuracy against external knowledge bases. It checks whether the output is faithfully grounded in the evidence provided. An output can be 97% faithful to flawed source evidence and still be wrong — that is a separate problem, addressed by the quality of the source evidence, not by this pipeline.
The pipeline does not replace human review. It prepares the output for human review by surfacing the claims that require it most urgently, with source-grounded evidence for or against each one.
Severity Inflation findings (FM-4) are informational, not blocking. Both the original and the downgraded severity may be defensible. FM-4 findings are classified as Category C (Design Judgment) and triaged by the orchestrator.
How other methodologies consume it
Karpathy-6 is embedded as a mandatory gate in Blu Wingu’s Workshop-to-Design Audit pipeline — the adversarial verification tier that gates requirements documents and solution designs before they proceed to work decomposition.
Implementation Verification extends the same failure-mode taxonomy to code and configuration outputs, running lens agents against the implementation before it is committed or deployed.
Any engagement that produces analytical deliverables — gap analyses, business requirements, solution designs, regulatory assessments — is subject to a Karpathy-6 gate before the output reaches the client. That is the posture. It is not optional.
Start a Stream D Evaluation Audit
The Stream D Evaluation Audit applies Karpathy-6 Adversarial Verification to a specific analytical output you already have: a requirements document, a gap analysis, a vendor assessment, a regulatory submission. Bring the output and the source evidence. The pipeline runs in five days. You receive a structured findings report, a faithfulness score, and a corrected document for every Category A finding detected.