GLITCHiT executed deep research to develop a comprehensive white paper demonstrating how AI agents and multi-agent systems can transform NHS GP triage and diagn…
Notebook
Evals
Here's the stat that should terrify every board:
Here’s the stat that should terrify every board:
25% of companies building AI can’t tell you if their AI is right or wrong.
They’re not laggards. They’re your “innovative” competitors. And they’re about to pay for it.
This comes from Iconic’s survey of senior AI leaders—the people who should know better. Even more telling: when Superintelligent ran thousands of agent readiness audits, they found this pattern everywhere. Companies building sophisticated AI systems with no way to measure if they actually work.
The stakes in 2025 are different. Companies will no longer buy or build for FOMO. They’ll buy or build because they believe there’s value. And there will be enormous pressure to show that value.
The question: Will you have answers when that pressure hits?
What Evaluation Blindness Actually Costs
I am seeing a lot of anecdotes that enterprise is seeing failure at 50% of the time, after a £3M investment and a board presentation showcasing it as their digital transformation flagship.
The pattern repeats across industries:
- Glorious builds that “went up in flames” when major vendors caught up, like SalesForce or ServiceNow
- Pilot Hell, where most companies get permanently stuck
- £300K custom builds that became obsolete when Salesforce shipped native features
- Compliance nightmares when auditors asked “How do you know this is accurate?” and got silence in return
As Superintelligent’s research found: “Those who don’t test rigorously pay a very, very heavy price. Often the quickest road to value is the one that slows down for proper evaluation.”
If you can’t prove your AI’s accuracy, you won’t have an answer. The 25% without evals will be exposed.
Before We Talk About Missing Evals, Let’s Define Them
Most executives confuse testing with evaluation. They’re not the same thing.
Think of evals like this: Your AI is a junior analyst. Testing checks if they show up to work. Evaluation checks if their analysis is accurate, if they cite proper sources, and if they’d pass an audit.
Without evals, you’re promoting someone you’ve never actually reviewed.
Here’s what a simple eval looks like:
Input: "What was Q3 revenue?"
Expected: £47M (from verified data)
AI Output: £52M
Result: ❌ FAIL – Hallucination detected
One question is easy. Your enterprise AI answers 10,000 variations. How do you know which 1,000 are fabricating numbers?
You scale this. You build datasets. You create methods to test systematically. That’s evaluation.
And here’s the secret that successful companies won’t tell you: they view their eval systems as competitive advantage. According to Superintelligent’s research, “Companies with good evals won’t share them. They view them as their secret sauce for getting quick AND quality results.”
If they’re hiding them, you know they’re valuable.
Why Smart Companies Get This Wrong
Superintelligent’s research across thousands of audits reveals three failure modes. You’ll recognise your organisation in one of them.
The Sizzle Seeker
Optimises for: PR and demos
These companies chase shiny AI showcases for marketing and social media. They can’t be bothered with the “drudge work” of data quality and evaluation systems.
Result: Stuck in Pilot Hell. Impressive demos that never make it to production.
What evals would have caught: You’d have known your demo was 40% hallucination before the board saw it. You’d have data showing why it couldn’t scale beyond the cherry-picked examples.
The Over-Planner
Optimises for: Certainty
These organisations look at all the data issues, infrastructure gaps, and integration challenges. They become overwhelmed by the scale and unknowns. Every discovery spawns three more workstreams.
Result: Analysis paralysis. They never actually start.
What evals would have shown: You’d have data proving which two problems actually matter—not all 47. You’d know where to focus instead of trying to solve everything.
The Foundation Maximalist
Optimises for: “Perfect platform”
These teams see the problems clearly. Too clearly. They initiate comprehensive foundational projects before building anything. Six months to sort out the data lake. Another four to implement the perfect governance framework.
Result: They miss the entire opportunity window while “preparing.”
What evals would have enabled: You’d have shipped v1, measured actual gaps with real data, and fixed them iteratively. Instead, you’re still architecting.
Here’s the uncomfortable truth from Superintelligent’s research: “These foundational projects didn’t really work even before Gen AI existed. I didn’t believe them then, and I don’t believe them now.”
The Pendulum Trap
The worst pattern? Companies that swing between archetypes. They start as Sizzle Seekers with a quick demo, hit data reality, then swing to Over-Planner paralysis (“stop everything, fix foundations first”). Both kill momentum.
The goal is holding the middle through discovery. That’s where Intentional Opportunism comes in.
The Framework That Actually Works
After thousands of audits, one approach consistently delivers results: Intentional Opportunism.
The philosophy is simple: “Be very pragmatic—blend opportunistic low-hanging fruit with structured vision. It’s not either/or, it’s both.”
Here’s the quarter-by-quarter playbook:
Q1 (Now): Launch 1-2 agent projects combining visibility with value. These projects get a “free pass” on serious foundational work. Just get them out the door. Learn by doing.
Q2 onwards: Build your roadmap based on what you learned. Now identify critical gaps—not all gaps. Build reusable blocks. Focus on the 20% that matters.
And here’s the permission structure that liberates paralysed organisations:
For your first 1-2 agent projects, you get a free pass on foundational work. Just ship them.
Aim for 70% right. That’s enough to be ahead of the curve.
Use what you learn to build proper foundations. Not the other way round.
Why this works:
- Value realised quickly = stakeholder buy-in
- Real learning only happens by doing
- Everyone understands what the technology can and can’t do
- Builds momentum for proper infrastructure investment
You’re not avoiding the hard work. You’re sequencing it intelligently.
What To Actually Do On Monday
Three frameworks that turn philosophy into action.
Framework 1: The Stack That Scales
Most companies either over-centralise (creating bottlenecks) or over-decentralise (creating chaos). The solution is a three-tier architecture that serves all populations.
Tier 1 - Point-Based Platforms (Relevance, All India) For non-technical teams. Visual, no-code interfaces. Fastest path to agent building without IT dependencies.
Tier 2 - Low-Code Automation (n8n, Zapier, Make) For technical ops teams. More flexibility, easy integrations with existing systems. Bridges business needs and engineering capability.
Tier 3 - Full Developer Flexibility (OpenAI SDK, Google ADK) For engineering teams. Complete control. Custom integrations. Maximum capability for complex use cases.
Plus: Vertical Solutions For legal, customer support, coding—buy proven solutions. Don’t compete with vendors who do this daily. Customise, don’t rebuild from scratch.
Plus: Reusable Internal Utilities Data access, monitoring, governance, guardrails. Build once, serve to everyone. This is where central teams add real value.
The provision checklist: By end of Q1, your teams should have access to all three tiers. This eliminates “waiting for IT” as the blocker that kills momentum.
The governance model:
- Teams build locally: Day-to-day workflows, experimentation
- Central team handles: Shared/complex problems, write permissions to critical systems
- Boundary rule: Centralise based on Value × Complexity × Spread
- Security model: Local teams get read access; central approval required for write access to finance systems and customer data
Framework 2: Build vs Buy Decision Rules
The £300K mistake happens when companies build what vendors are about to ship. Here are three programmatic rules to avoid it:
Rule 1 - The 80% Rule: If a vendor tool covers 80% of your use case → Buy it. Don’t waste months trying to build the remaining 20% of perfection.
Rule 2 - The Platform Inevitability Rule: If Salesforce, Workday, or your core platform will inevitably build this → Wait (or create a temporary patch).
As one research participant put it: “I’ve seen many cases of glorious builds that went up in flames once a major vendor got their act together.”
Rule 3 - The Competitive Advantage Rule:Build only when:
- No solution exists in market or vendor roadmaps
- Building precisely to your needs creates competitive differentiation
- You have the skills and resources to get it right
This framework prevents the classic pattern: spending £300K and six months building something your vendor ships as a native feature three months after you launch.
Framework 3: The 10-Hour SOP Solution
Your best processes live in experts’ heads. They’re too busy to document them. Agents can’t learn from what doesn’t exist.
This is often the blocker that stops everything.
Here’s the recording hack that transforms weeks into hours:
Before: Weeks of documentation meetings. Experts too busy to write SOPs. Projects stalled waiting for tribal knowledge to be codified.
The method:
- Have your subject matter expert work in front of a recording (any device)
- They work normally and narrate what they’re doing as they go
- Feed the video → LLM → Draft SOP
- Expert reviews and refines (handful of hours total)
Result: Fully documented processes in days, not months.
Real companies have used this to transform “we have no documentation” (complete blocker) into “we have SOPs for our top 20 processes” (solid foundation for agent development).
The ROI: A handful of expert hours = fully documented processes that would have taken weeks of meetings and writing. This unlocks agent development when tribal knowledge is your blocker.
Solving Data Without Boiling The Ocean
Data fragmentation kills more AI initiatives than any other factor. But you don’t need to fix everything. Fix what matters.
1. Use AI for Data Problems (the irony works)
Connect data entities with natural language. Use RAG and semantic similarity for related data retrieval. This didn’t exist pre-GenAI. It’s a massive unlock for problems that were previously intractable.
2. The MCP Breakthrough
Instead of rationalising 287 disconnected data sources (18-month project, £2M budget, probable failure), identify your handful of most critical data buckets. Structure them as MCP (Model Context Protocol) servers.
This transforms “fix everything first” paralysis into a manageable incremental project. You’re not boiling the ocean—you’re heating the kettles that matter.
Start with the data sources that unlock the highest value. Make them accessible. Then move to the next tier.
3. Build New Systems Right From Day One
If you’re building new systems or starting fresh, build for agents from the beginning:
- Data accessible and logically organised
- Metadata everywhere you can put it
- Standard operating procedures documented from launch
- No excuses—you know agents are coming
The principle: Invest in cleanup only for extreme ROI. Historical data foundation projects failed because they tried to do everything. Be opportunistic. Select what yields actual return.
The Practice That Changes Everything
Here’s the breakthrough that’s emerging from successful implementations: Use agents to test other agents.
Why this matters:
- Overcomes the human bottleneck in testing
- Allows continuous evaluation at scale
- Creates safe sandbox environments for experimentation
- Shows “lots of promise” according to Superintelligent’s research
You don’t need endless testing. You need rigorous testing:
- Evaluation datasets and methods
- Safe environments to test and fail
- Only then promote to production
The ROI reality: “Those who don’t test rigorously pay a very, very heavy price. Often the quickest road to value is the one that slows down for proper evaluation.”
And remember: successful companies view their eval systems as competitive advantage they won’t share. If you can build what they’re hiding, you’ve built something valuable.
Why This Year Is Different
2024 was the FOMO year. Companies built agents to say they had agents. Demos, not deployment. Proof of concepts that never went to production.
2025 is the value realisation year. Boards want ROI. There will be enormous pressure to show tangible business value.
The timing is perfect for two reasons:
The MCP unlock: New tooling (Model Context Protocol) makes incremental data organisation possible. No more “fix everything first” paralysis. You can actually make progress without boiling the ocean.
The permission moment: Enterprises now have permission to invest in foundations—context engineering, not just shiny demos. The case studies being shared this year will be about infrastructure and systematic deployment, not speed to first demo.
This is your window.
The Three Questions That Matter
Based on Superintelligent’s thousands of audits, three questions separate winners from losers:
1. “Do you have evaluation datasets and methods to test your AI?”
If no → this is your Q1 priority. Everything else is built on sand without this.
2. “Are you a Sizzle Seeker, Over-Planner, Foundation Maximalist, or Intentional Opportunist?”
If you’re one of the first three → use the frameworks above to find the middle ground. The pendulum will kill your momentum if you let it swing.
3. “What percentage of your use cases have documented SOPs?”
If low → use the recording hack this month. Don’t let tribal knowledge block your entire AI strategy.
The Action Plan
This quarter: Launch 1-2 visible, valuable agent projects. Get your free pass on foundations. Just ship and learn.
Next quarter: Build your roadmap from actual learnings. Provision the three-tier stack. Focus on critical gaps only—not all gaps.
The permission: Aim for 70% right. That’s enough to be ahead of the curve.
The 12-Week Window
Q1 earnings season starts in 12 weeks. Your board will ask two questions:
“What’s our AI ROI?”
“How do you know it’s accurate?”
The companies winning won’t have the fanciest models or the most impressive demos. They’ll be the ones who can prove their AI works—at scale, under audit, before anyone asks.
That’s not a technical problem. That’s an executive decision you’re making this week.
25% of your competitors are flying blind with significant investment and no way to measure success.
Which side will you be on when the questions start?
What’s your experience getting stuck between velocity and quality? Which framework would help you most: the three-tier stack, build vs buy rules, or the SOP recording hack? Share in the comments.
Poll: Which archetype is your organisation?
- 🎬 Sizzle Seeker (optimising for PR and demos)
- 📊 Over-Planner (optimising for certainty)
- 🏗️ Foundation Maximalist (optimising for “perfect platform”)
- 🎯 Intentional Opportunist
Insights from Superintelligent’s Head of Research, based on thousands of agent readiness audits. Additional research from Iconic’s survey of senior AI leaders.