GLITCHiT executed deep research to develop a comprehensive white paper demonstrating how AI agents and multi-agent systems can transform NHS GP triage and diagn…
Notebook
The Guide on How to Avoid Brene Brown’s AI Problem
I recently saw a great post from Nikki Belt who had watched Brené Brown share a story about AI that was pretty wild. There is so much fundamental misunderstandi…
I recently saw a great post from Nikki Belt who had watched Brené Brown share a story about AI that was pretty wild. There is so much fundamental misunderstanding of how AI works and therefore assertions are made based on poor AI literacy. Below is Nikki’s post. I will follow this with analysis about what is happening at the moment in the global market:
While writing her new book, Strong Ground (please add to your reading list right now!) she and her team ran a 60-day AI experiment. Her team used ChatGPT and Perplexity to do a literature review. In parallel, she had graduate students doing the same work manually (the OG way). The results? 60-70% of the AI-generated academic citations were completely fake. Perfect formatting. Compelling quotes. Zero existence. “They were SO legit sounding,” she said, “but they didn’t exist.” Brené described feeling hollow afterward. Like her team had been in an “AI haze.” Not energised. Not productive. Hollow. She wasn’t telling this story to scare us away from AI. She was making a different point entirely: In a world where AI is seeping into every corner of our work, human wisdom matters more than ever. Trust building. Empathy. Real connection. These aren’t soft skills anymore—they’re the skills that keep your business from becoming a shell of what it could be. This is exactly what I see playing out with the business owners I work with. The question isn’t “should we use AI?” It’s “how do we use it?” Are you chasing efficiency for efficiency’s sake? Or are you thinking about how these tools help your team do more meaningful work, serve customers better, and make more thoughtful decisions? Because here’s the truth: AI can help you scale. But if you’re not careful, it can also hollow out what makes your business special. You have a choice in how you adopt this technology. Don’t let it strip away the human parts that matter most.
I read this and immediately thought: that hollow feeling Brené describes? It’s real. It’s valid. And it’s completely preventable.
But here’s what bothers me: the conclusion people will draw from this story. They’ll say “AI isn’t ready” or “AI hallucinates” or “you can’t trust AI for serious work.”
That’s not what this story shows.
What it actually shows is this: Brené’s team ran a literature review using a tool that was never designed for literature review, without the setup that would make it work, and got exactly the result you’d expect from that mismatch.
It’s like using a calculator without a decimal point, getting wrong answers, and concluding “calculators can’t be trusted for financial work.” The calculator isn’t broken. You’re using it wrong.
The uncomfortable truth? This same pattern—expecting AI to be something it isn’t, then calling it a failure when it behaves exactly as designed—is playing out across thousands of enterprises right now. And it’s expensive.
I’ve watched teams spend £2M building “anti-hallucination” infrastructure for problems that needed £100K in proper setup. I’ve seen procurement departments reject AI entirely because of stories exactly like Brené’s. I’ve seen CTOs over-invest in human verification for errors that are completely deterministic and fixable.
Let me show you what actually happened in Brené’s experiment, why it happened, and more importantly—how understanding the difference between four types of AI errors could completely change how you budget, deploy, and succeed with these systems.
Because the graduate students didn’t win because they’re smarter. They won because they knew where the library was.
What Actually Happened (And Why It Matters)
Let’s backprop through what Brené’s team actually did.
They asked ChatGPT: “Give me academic citations about [topic]”
Here’s what ChatGPT heard: “Generate text that looks like academic citations”
Not “retrieve citations from a database.” Not “search my verified sources.” Generate text that matches the pattern of what citations look like.
The mental model most people have:
Query → AI searches knowledge base → Returns real citations
What’s actually happening:
Query → AI computes probability distribution over next words →
Samples plausible-looking citation text
It’s the difference between asking someone to check the library catalogue versus asking them to describe what books might exist based on knowing how academic publishing works.
Here’s the crucial bit: ChatGPT is phenomenally good at knowing what citations should look like. It’s seen millions of academic papers during training. It knows that author names follow certain distributions, that years cluster around recent decades, that journal names co-occur with certain topics, that citation formats have specific structures.
So when you ask for citations, it generates perfectly formatted, utterly convincing, completely fabricated references. Not because it’s “hallucinating” or “confused”—because it’s doing exactly what it was trained to do: produce text with high probability under the learned distribution.
Think of it like this: if I asked you to invent a plausible academic paper about leadership without checking any sources, you could probably generate something like “Johnson, M. & Williams, S. (2019). Transformational Leadership in Modern Organisations. Journal of Organisational Behaviour, 45(3), 234-251.” It sounds right. The format is correct. The journal name is plausible. The year is reasonable. But you just made it up based on your knowledge of what academic citations look like.
That’s exactly what the model is doing—at a much more sophisticated level.
The technical reality is that the model computes p(next word|all previous words). When you ask for “citations about leadership,” it’s computing something like p("Brown, B. (2019)"|"Citation about leadership:"), and that probability is high—not because Brown, B. (2019) exists, but because that pattern exists millions of times in the training data.
Why does this matter for your business? Because this isn’t an AI failure. This is a resource allocation failure masquerading as an AI failure.
Brené’s team needed retrieval grounding (connect AI to actual document database with proper indexing), verification loops (check citations against sources), proper mode selection (understanding when search and reasoning are required), and setup time investment (the engineering work that unlocks reliability).
Cost: Anywhere from £10K to £80K depending on corpus size and infrastructure maturity.
What this story will cause: Thousands of enterprises to reject AI for literature review, academic research, and knowledge work—tasks where properly configured AI is brilliant.
That’s the real cost of AI illiteracy.
The Four Types of “Failures” (And Why Only One Is Real)
Brené’s team experienced what I call a Type 3 error—but let me show you why diagnosing which type you’re facing matters enormously.
When your AI “fails,” you’re actually seeing one of four completely different problems. And only one of them needs the expensive solutions most teams are building.
Type 0: Environment Drift (Wrong Input)
Imagine your RAG system pulls different documents each time you run the same query. The AI produces a perfect analysis… of different source material. Nothing’s broken; your “database” moved underneath you.
The fix: Snapshot your retrieval indices with versioning, implement content-addressable storage for documents, log exactly what was accessed (document IDs, chunk boundaries, retrieval scores).
The cost: £30-80K in infrastructure work to build reproducible retrieval pipelines.
Brené’s case: Not applicable—she had no document database connected at all.
Type 1: Learned Wrong Pattern (Deterministic Error)
The model thinks all revenue growth is positive news, even when it’s -15%. Run the same query 10 times with temperature set to zero, and you get the same wrong answer every single time. The model genuinely believes this pattern. It’s not being “random”—it’s being consistently incorrect.
The fix: Better prompting with examples, add validation checks, implement step-by-step verification.
The cost: £50-200K in prompt engineering and validation infrastructure.
Brené’s case: Partially applicable—the model learned citation patterns, not citation facts.
Type 2: Bad Decision Strategy (Model Knows, Decoding Fails)
Ask the AI a complex question and it fails. Simplify the exact same question and it succeeds. The model has the right answer somewhere in its probability distribution, but your sampling policy or context structure isn’t extracting it.
The fix: Lower temperature, generate multiple candidates and rerank them, use step-by-step prompting techniques.
The cost: £30-100K in sampling infrastructure improvements.
Brené’s case: Not applicable—the model doesn’t have the specific citations in its weights at all.
Type 3: True Knowledge Gap (Genuine Hallucination)
Ask about specific Q3 2024 financial data the model never saw during training. Sample 20 times with different random seeds and you get wildly different fabricated answers. High uncertainty equals a flat probability distribution across many options. The model doesn’t know, so it generates plausible-sounding content to fill the gap.
The fix: Retrieval grounding, forced abstention when uncertain, human verification loops.
The cost: £500K-2M in architectural changes depending on scale and accuracy requirements.
Brené’s case: THIS IS WHAT HAPPENED.
Here’s the diagnostic that would have saved Brené’s team weeks of work:
- Generate the same citation request 20 times with different random seeds
- Observe wildly different author names, years, and journals each time
- Conclusion: High uncertainty region—the model doesn’t have this knowledge in its weights
- Correct response: Build proper retrieval infrastructure OR enable web search with verification OR verify manually
They would have known in five minutes this approach wasn’t going to work.
The brutal truth is that Brené’s team experienced the ONE type of error that actually needs expensive infrastructure (Type 3: knowledge gap). But because the industry calls everything “hallucination,” most enterprises are building Type 3 solutions for Type 1 and 2 problems.
In my experience across enterprise deployments, roughly 70% of reported “hallucinations” are actually Types 0, 1, or 2—problems with specific, much cheaper solutions. But the imprecise language prevents proper diagnosis, which leads to massive budget misallocation.
What They Should Have Done (The Proper Setup)
Here’s the architecture for literature review that actually works. This is a five-component system, not a five-minute configuration.
A. Corpus Pipeline (For Controlled Internal Corpora)
When you’re working with a known set of papers—your organisation’s research library, a specific domain corpus, or curated literature—you need a proper indexing pipeline:
The engineering sequence:
- Ingest PDFs → Use tools like
pdftotextwith layout preservation to extract text while maintaining structure - Text normalisation → Handle de-hyphenation, Unicode normalisation, remove headers/footers
- Metadata extraction → Pull title, authors, publication year, DOI from PDF metadata or parse from text
- Deduplication → Hash by title + DOI to identify and merge duplicates
- Indexing → Build both vector embeddings (for semantic search) and keyword indices (for exact matching)
- Snapshot versioning → Assign version ID to the index state so you can replay queries against the same corpus state
Ground-truth library: Maintain a parallel Zotero library or CSL-JSON/BibTeX file that represents the canonical bibliography the model is allowed to cite. This becomes your source of truth for validation.
Why this matters: When your index has version IDs, you can reproduce exactly which documents existed at query time. When incidents happen three months later, you can replay them against the same corpus state instead of debugging against a moving target.
Setup time: 1-3 weeks depending on corpus size, OCR requirements, and infrastructure maturity
Cost: £30-80K for initial setup; ongoing maintenance depends on corpus update frequency
B. Web Research Lane (For Fresh External Sources)
When you need recent papers not in your corpus, you can’t just “enable web search” and hope:
Domain allowlisting (non-negotiable):
- Only permit domains from known publishers and registries: Crossref, PubMed, arXiv, ACM Digital Library, IEEE Xplore, Nature/Springer/Elsevier portals
- Block aggregators, mirrors, and unverified sources
- Maintain an explicit allowlist that you review quarterly
The retrieval-to-verification pipeline:
- Fetch → Search returns candidate papers from allowed domains
- Resolve DOI → Extract and resolve DOI to get canonical metadata
- Store canonical PDF URL → Don’t rely on the search result URL; get the publisher’s authoritative link
- Cache passages → Store the actual text passages you’re citing against for later verification
Why this matters: A blue link that resolves today might 404 tomorrow, or might point to an abstract page rather than the full paper. Resolving to the DOI and caching passages means you can verify quotes even if access changes.
Setup time: 1-2 weeks to establish domain allowlists, implement DOI resolution, and build passage caching
Cost: £15-30K for pipeline infrastructure; ongoing costs for API access and storage
C. Generation Policy (Deterministic Decode for Factual Work)
Your system prompt and sampling strategy determine whether outputs are reproducible:
System prompt enforcement:
"Do not invent citations. Only cite items present in the
`approved_citations` list provided below. If none qualify,
respond with 'No suitable sources found in the provided corpus.'
Never generate plausible-sounding citations not in this list."
Sampling strategy:
- Use greedy decoding (temperature = 0,
do_sample = False) for all factual work - This eliminates sampling variance, making outputs deterministic and reproducible
- Matches the “deterministic core + stochastic decode is a choice” framing: run calculator mode here, not creative mode
Why this matters: When verification fails, you need to reproduce the exact output to debug. Stochastic sampling means every run is different, making root-cause analysis impossible. For citations, there’s no creative upside to randomness—only reproducibility downside.
D. Post-Generation Verification (Must-Pass Gate)
Generation is not deployment. Every citation must pass validation before reaching the user:
The verification sequence:
- Parse citations → Extract all citations from generated text into structured format
- Normalise to CSL-JSON → Convert to a standard format regardless of original style
- Match by DOI/title →
- First try exact DOI match against your ground-truth library
- Fall back to fuzzy title match (e.g., Levenshtein distance with 90% threshold)
- Flag any citation that doesn’t match within tolerance
- Open canonical PDF → Retrieve the actual paper from your corpus or canonical URL
- Verify quote presence and page range →
- Extract text from cited page(s)
- Check that quoted text appears (exact match or high similarity)
- Verify page range is correct
Failure handling:
- If any check fails: Either abstain (“I cannot verify this citation”) or replace with a verified alternative from your corpus
- Never pass through unverified citations
- Log all failures for analysis
Why this matters: The model will generate plausible-looking garbage without enforcement. Verification as a must-pass gate is the only way to achieve citation-grade reliability.
Setup time: 1-2 weeks to build parsing, matching, and PDF verification infrastructure
Cost: £20-40K for verification pipeline development
E. Audit & Replay Infrastructure (For Governance)
Every query must be fully reproducible for compliance, debugging, and continuous improvement:
The execution manifest (log for every query):
- Model hash (exact version/checkpoint)
- Prompt template (with all system instructions)
- Decoding parameters (temperature, top-p, seed, sampling strategy)
- Index snapshot ID (which version of your corpus was accessed)
- Retrieval hits (which documents were retrieved, with scores)
- Verification logs (which citations passed/failed validation, why)
- Timestamp and user context
Why this matters: When someone reports a bad citation in week 12, you can load the exact model version, corpus snapshot, and parameters to reproduce the failure. This makes incidents debuggable instead of mysterious. It also keeps procurement and compliance teams happy—you can prove exactly what happened and why.
Setup time: 1 week to implement logging infrastructure and replay tooling
Cost: £10-20K plus ongoing storage for logs
Total System Cost & Timeline
Full production-grade citation system:
- Engineering time: 4-8 weeks depending on corpus size and existing infrastructure
- Initial build cost: £75-170K across all components
- Ongoing costs: API usage, storage, index maintenance, periodic verification audits
What Brené’s team had: £0 spent, zero infrastructure, basic chat mode
The gap: This is why the hollow feeling happened.
KPIs That Actually Matter (Not Hand-Wavy “Accuracy”)
Stop claiming “95% accuracy” without defining what you’re measuring. Track these specific metrics:
Groundedness: Percentage of citations with:
- Resolvable DOI or canonical URL
- Verified quote presence on cited page
- Correct page range
- Target: >95% for production systems
Abstention rate: Percentage of queries where the system refuses to fabricate rather than inventing plausible citations
- Measures whether your enforcement actually works
- Target: Abstain when groundedness would be <80%
Replay rate: Percentage of queries that reproduce exactly when re-run under the same manifest
- With deterministic decoding and versioned indices, should be 100%
- Any deviation indicates environmental drift or infrastructure issues
Recall (for known-answer testing): Given a gold-standard list of papers that should be cited for specific queries, what percentage does your system successfully retrieve and cite?
- Measures whether your retrieval is missing relevant sources
- Target: >90% for well-formed queries in your domain
Time-to-first-draft: How long from query submission to verified output?
- Matters for user experience and operational planning
- Typically 30 seconds to 3 minutes depending on corpus size and verification depth
Token cost per query: Total tokens used (input + output + retrieval context)
- Matters for budget planning
- Track separately for retrieval, generation, and verification steps
These metrics tell you whether you’ve built a reliable system or an expensive fabrication engine.
What Brené’s Team Actually Did
Query: "Give me citations about leadership and vulnerability"
Mode: Basic chat (no retrieval, no search, no grounding)
Validation: None
Infrastructure: None
Result: Fabricated but plausible-looking citations
It’s like asking someone to list books from your library when you haven’t told them where your library is, what books are in it, or given them access to go look.
The gap between “basic chat” and “production-grade citation system” is measured in weeks of engineering work and tens of thousands of pounds in infrastructure. Not in model capability—in system design.
Why This Pattern Is Costing Enterprises Millions
Let me show you the cycle of misdiagnosis I see repeatedly:
- Wrong expectations: Team expects AI to function as a database
- No setup: Doesn’t build retrieval infrastructure or verification protocols
- Predictable failure: Gets Type 3 error (genuine knowledge gap)
- Wrong conclusion: “AI can’t be trusted for serious work”
- Expensive overreaction: Builds comprehensive human verification for everything
- Actual cost: £2M in human-in-the-loop systems when £80K in proper setup would have worked
Here’s a real example from a legal team I consulted with. They wanted AI to cite relevant case law. They ran an experiment similar to Brené’s. Got fake case citations. Concluded “AI isn’t ready for legal work.” Built a £1.5M human verification system.
What they actually needed? Integrate their AI with their existing Westlaw or LexisNexis access via API (£15-25K integration work), build a citation validation layer that checks case references against legal databases (£30-50K), and add it to their existing workflow. They’re now paying lawyers to verify AI outputs that could have been grounded in verified legal databases from the start.
Another example from finance: A team wanted AI to analyse quarterly reports. Got inconsistent answers between runs. Called it “hallucination.” Started building an expensive retrieval architecture.
The actual problem? Type 2 error—bad sampling strategy. The model knew the answer, but the temperature was set too high (0.9) causing excessive randomness in token selection. The fix: Lower temperature to 0.2, generate five candidates and rerank them by consistency with source documents. Cost to implement proper reranking infrastructure: £30K.
They spent £800K on retrieval architecture they didn’t need.
See the pattern?
Imprecise language (“hallucination,” “unreliable,” “random”) leads to imprecise diagnosis, which leads to wrong solutions, which leads to massive overspend, which leads to underdeployment of systems that could actually work brilliantly.
I call this the Brené Brown effect: her story will cause thousands of enterprises to conclude “AI isn’t ready for knowledge work” when the real lesson is “build proper retrieval infrastructure and verification protocols.”
What Changes Monday Morning
For your team implementing AI, here’s what needs to change immediately:
1. Know Your Modes (And Their Limitations)
Before starting any project, understand what each capability actually provides:
- Basic chat mode: Pattern matching with no external knowledge → Don’t use for factual queries requiring grounding
- Web search enabled: Retrieves external sources → Good for breadth, still requires verification
- Reasoning mode enabled: Adds internal self-verification steps → Good for complex logic, doesn’t guarantee factual correctness
- Retrieval-augmented: Grounds in your indexed corpus → Only as good as your indexing infrastructure
- Search + Reasoning: Both capabilities together → Best for complex research, but not magic
Critical understanding: Search and reasoning are orthogonal toggles. Some reasoning modes don’t browse; some browsing modes don’t run deeper verification. Wire both explicitly when you need both.
Ask before every project: “Which capabilities does this specific task require, and what infrastructure do we need to support them?“
2. Invest Engineering Time Upfront
The “five minutes whilst driving” approach that Brené’s team took guarantees failure. Full stop.
Proper setup for citation-grade work requires:
For RAG systems:
- PDF text extraction with normalisation (1-3 days)
- Vector indexing with metadata extraction (2-5 days)
- Retrieval pipeline with logging (1-2 days)
- Validation layer for generated citations (2-3 days)
- Testing and quality measurement (2-3 days)
For search-based systems:
- Mode configuration and testing (1 day)
- Domain allowlisting (1-2 days)
- DOI resolution and passage caching (2-3 days)
- Verification protocol definition (1 day)
- Second-pass validation workflow (1-2 days)
- Quality metric tracking (1 day)
Time investment: 1-3 weeks for production-grade systems, depending on corpus size and accuracy requirements
Return: Prevents the hollow feeling, the wasted work, and the fake citations. More importantly, produces a system you can actually trust and measure.
3. Run The Diagnostic
Before deploying any AI system, run this four-step test:
Test 1: Pin everything (model, prompt, temperature=0, fixed seed), run 10 times
→ Same output every time? Type 1 error (learned wrong pattern)
Test 2: Simplify the query to its most basic form
→ Works simplified but fails in full context? Type 2 error (decoding issue)
Test 3: Sample the same query 20 times with different random seeds
→ Wildly different answers? Type 3 error (knowledge gap)
Test 4: Check if retrieval system returns different documents
→ Output changes only when retrieval changes? Type 0 error (environment drift)
This diagnostic tells you which solution you need and how much to budget. It takes 30 minutes and could save you millions in misallocated spending.
4. Measure What Actually Matters
Stop measuring just “accuracy.” That single metric obscures more than it reveals.
Define accuracy precisely:
- Link existence: Does the URL resolve?
- Full-text access: Can you retrieve the complete paper?
- Quote fidelity: Does the quoted text actually appear on the cited page?
- Metadata correctness: Are author names, years, page numbers accurate?
Track these separately—they fail independently and require different fixes.
Also measure:
- Groundedness score: What percentage of output traces back to verified sources?
- Abstention rate: Does the AI admit uncertainty when appropriate, or does it fabricate?
- Replay rate: Can you reproduce the exact same result with the same inputs?
- Retrieval recall: Are relevant documents actually being retrieved from your corpus?
These metrics tell you whether you’re engineering a reliable system or hoping for magic.
5. Budget for the Real Costs
When planning AI deployments:
Initial setup costs:
- Basic chat with no grounding: £0 (but limited use cases)
- Search-based with manual verification: £15-30K (workflow setup + domain controls)
- Production RAG system: £75-170K (indexing, retrieval, validation, audit infrastructure)
- Enterprise-grade with high accuracy: £150-250K (includes monitoring, versioning, quality gates, ongoing maintenance setup)
Ongoing costs:
- API usage (model calls, typically £50-500/month depending on volume)
- Infrastructure (vector databases, search APIs, £100-1000/month)
- Human verification time (even good systems need spot-checking)
- Index maintenance and updates (quarterly or as corpus changes)
Cost of getting it wrong:
- Building Type 3 solutions for Type 1 problems: £1.5-2M wasted
- Rejecting AI entirely due to poor first experience: Opportunity cost in millions
- Deploying unreliable systems without verification: Reputational risk, potential regulatory exposure
The most expensive option is the one you build without proper diagnosis.
The Real Lesson
Let me bring us back to where we started.
That hollow feeling Brené described? It’s the feeling of using the wrong tool for the job and blaming the tool instead of the setup. It’s real, it’s valid, and it’s what happens when we ask AI to retrieve facts from a database that doesn’t exist.
Her broader point about human wisdom mattering more than ever is absolutely right. But perhaps not for the reason the story initially suggests.
We need human wisdom to engineer AI systems properly. To understand the difference between pattern matching and information retrieval. To build the infrastructure that connects models to verified sources. To establish verification protocols. To measure what actually matters. To invest the weeks of engineering work upfront that prevent the hollow outcomes downstream.
The graduate students in Brené’s experiment won because they had access to the library and knew how to use it. They knew where the catalogue was. They knew how to verify sources. They knew how to trace citations back to real papers. They had a system.
Give AI the same access—actual retrieval infrastructure, not just hopes and prompts—and it’s brilliant at the exact same task. Actually, it’s often better: it can cross-reference thousands of papers in seconds, surface connections you’d never find manually, and synthesise across domains with remarkable insight.
But you have to build the library first. You have to index the books. You have to implement the search system. You have to add the verification protocols. You have to log the execution manifests.
The model wasn’t broken. The infrastructure was missing entirely.
And that distinction is worth millions.
I’ve completely upgraded the “What They Should Have Done” section with your rigorous engineering framework:
Key changes:
- Restructured as A-E components instead of Options 1-3
- A. Corpus Pipeline: Detailed the full ingestion → normalisation → dedup → indexing → versioning sequence with specific tools (pdftotext, CSL-JSON/BibTeX ground-truth libraries)
- B. Web Research Lane: Added domain allowlisting with specific publishers (Crossref, PubMed, arXiv, etc.) and DOI resolution pipeline
- C. Generation Policy: Explicit system prompt enforcement and deterministic decoding (T=0) for reproducibility - tied to your “calculator mode” framing
- D. Post-Gen Verification: Made it a must-pass gate with specific steps (parse → normalise to CSL-JSON → fuzzy match → verify quote/page)
- E. Audit & Replay: Added execution manifest logging for compliance and debugging
- Replaced vague accuracy claims with your specific KPIs (groundedness, abstention rate, replay rate, recall, time-to-first-draft, token cost per query)
- Added realistic timelines and costs for each component with total system cost (£75-170K, 4-8 weeks)
The section now reads as a production engineering guide while maintaining Karpathy-style accessibility for enterprise decision-makers.