LLM-as-a-Judge

Architecture choices, not model benchmarks, decide outcomes. LLM-as-a-Judge is the architecture for making the choice correctly.

What it is

LLM-as-a-Judge is the methodology Blu Wingu uses to select, evaluate, and assign language models to specific roles within a multi-agent or AI-augmented system. It replaces benchmark-driven model selection with task-driven evaluation: the model under consideration is assessed by a structured evaluator (the judge) on the specific output types the use case requires, using representative examples from the actual deployment context.

The methodology distinguishes three model assignment questions: capability fit (can this model produce the required output type at the required quality?), cost-at-scale fit (does cost per output stay inside the engagement’s Outcome NAV structure at production volume?), and role fit (which tasks belong to orchestrator-class models, which to structured-task models, which to domain-specific fine-tunes?).

The judge is itself a model — typically a more capable model evaluating a smaller candidate — applying a rubric derived from the actual success criteria of the use case, not from general benchmarks. The Karpathy-6 pipeline pins a specific model tier to every sub-agent dispatch — not because the model is the strongest available, but because the structured, isolated, template-driven nature of each phase is exactly the workload that tier handles reliably at scale.

When you reach for it

LLM-as-a-Judge applies at the architecture stage of any AI system that involves more than one model or more than one task type. It is the right methodology when a client is considering a wholesale migration to a new model family, when a multi-agent system is experiencing quality degradation that cannot be diagnosed from outputs alone, or when cost is growing non-linearly with usage and the source of the growth is unclear.

It is not the right methodology for evaluating a single model on a single task in isolation. For that, a direct capability assessment suffices. LLM-as-a-Judge adds value when the evaluation requires comparison across candidates, roles, or contexts — and when the selection decision will govern a system running at production volume.

What you ship

A role-assignment register — a structured document mapping each task type in the target system to a recommended model tier, with the evaluation evidence for each assignment stated explicitly. The register is the decision record for model governance and audit.
A judge rubric set — the evaluation criteria used to assess model candidates on the use case’s actual output types. The rubric set is reusable and can be extended as new task types are added to the system.
A cost-at-scale model — a projection of token cost per output type at three volumes (baseline, expected peak, stress case), with the break-even analysis for each candidate model. Produced as a spreadsheet so the client can update it as usage evolves.

Linked methodologies

LLM-as-a-Judge is the precondition for any multi-agent pipeline that includes a Karpathy-6 Adversarial Verification gate — because the model pinning decisions in Karpathy-6 (orchestrator runs on the caller’s model; all sub-agents pin to a specific tier) are themselves model-selection decisions. The LLM-as-a-Judge methodology provides the evidence base for those decisions.

Implementation Verification uses an analogous lens-assignment approach — correctness lens, safety lens, spec-adherence lens — where each lens is assigned to the agent type best suited to its detection method. The same role-fit thinking applies.

Start here

LLM-as-a-Judge engagements typically run as part of a Stream B AI system design or a Stream D evaluation. If your existing AI system is producing outputs that vary in quality in ways you cannot explain from the prompt or the task, the model assignment is the first place to look. Book a discovery conversation.

What it is

When you reach for it

What you ship

Linked methodologies

Start here

Continue reading