GLITCHiT executed deep research to develop a comprehensive white paper demonstrating how AI agents and multi-agent systems can transform NHS GP triage and diagn…
Notebook
Outline Why Brené Brown's AI Failed
Working Thesis (40 words): Brené Brown's team got 60-70% fake citations from AI—not because AI is unreliable, but because they used the wrong tool for the job.…
Working Thesis (40 words): Brené Brown’s team got 60-70% fake citations from AI—not because AI is unreliable, but because they used the wrong tool for the job. This same misunderstanding is costing enterprises millions in wrong solutions to correctly solvable problems.
I. Opening: The Hollow Feeling (300 words)
[Start with Nikki Belt’s full post verbatim]
Then the pivot: I read this and immediately thought: that hollow feeling Brené describes? It’s real. It’s valid. And it’s completely preventable.
But here’s what bothers me: the conclusion people will draw from this story. They’ll say “AI isn’t ready” or “AI hallucinates” or “you can’t trust AI for serious work.”
That’s not what this story shows.
What it actually shows: Brené’s team ran a literature review using a tool that was never designed for literature review, without the setup that would make it work, and got exactly the result you’d expect from that mismatch.
It’s like using a calculator without a decimal point, getting wrong answers, and concluding “calculators can’t be trusted for financial work.” The calculator isn’t broken. You’re using it wrong.
The uncomfortable truth: This same pattern—expecting AI to be something it isn’t, then calling it a failure when it behaves exactly as designed—is playing out across thousands of enterprises right now. And it’s expensive.
I’ve watched teams spend £2M building “anti-hallucination” infrastructure for problems that needed £100K in proper setup. I’ve seen procurement departments reject AI entirely because of stories exactly like Brené’s. I’ve seen CTO’s over-invest in human verification for errors that are completely deterministic and fixable.
The promise: Let me show you what actually happened in Brené’s experiment, why it happened, and more importantly—how understanding the difference between four types of AI errors could completely change how you budget, deploy, and succeed with these systems.
Because the grad students didn’t win because they’re smarter. They won because they knew where the library was.
II. What Actually Happened (And Why It Matters) (400 words)
Intuition first (Karpathy style): Let’s backprop through what Brené’s team actually did.
They asked ChatGPT: “Give me academic citations about [topic]”
Here’s what ChatGPT heard: “Generate text that looks like academic citations”
Not “retrieve citations from a database.” Not “search my verified sources.” Generate text that matches the pattern of what citations look like.
The mental model most people have:
Query → AI searches knowledge base → Returns real citations
What’s actually happening:
Query → AI computes probability distribution over next words → Samples plausible-looking citation text
It’s the difference between asking someone to check the library catalogue versus asking them to describe what books might exist based on knowing how academic publishing works.
Here’s the crucial bit: ChatGPT is phenomenally good at knowing what citations should look like. It’s seen millions of academic papers. It knows:
- Author names follow certain distributions
- Years cluster around recent decades
- Journal names co-occur with certain topics
- Citation formats have specific structures
So when you ask for citations, it generates perfectly formatted, utterly convincing, completely fabricated references. Not because it’s “hallucinating” or “confused”—because it’s doing exactly what it was trained to do: produce text with high probability under the learned distribution.
The technical reality (gentle formalism): The model computes:
p(\text{next word} | \text{all previous words})
When you ask for “citations about leadership,” it’s computing:
p(\text{"Brown, B. (2019)"} | \text{"Citation about leadership:"})
And that probability is high—not because Brown, B. (2019) exists, but because that pattern exists millions of times in the training data.
Why this matters for your business: This isn’t an AI failure. This is a resource allocation failure masquerading as an AI failure.
Brené’s team needed:
- Retrieval grounding (connect AI to actual document database)
- Verification loops (check citations against sources)
- Proper mode selection (use reasoning mode with web search, not basic chat)
- Setup time investment (the “couple hours upfront” that unlocks reliability)
Cost: Maybe £5K in proper setup time.
What this story will cause: Thousands of enterprises to reject AI for literature review, academic research, and knowledge work—tasks where properly configured AI is brilliant.
That’s the real cost of AI illiteracy.
III. The Four Types of “Failures” (And Why Only One Is Real) (450 words)
Reframe the problem: Brené’s team experienced what I call a Type 3 error—but let me show you why diagnosing which type matters.
When your AI “fails,” you’re seeing one of four completely different problems:
Type 0: Environment Drift (Wrong Input)
Example: Your RAG system pulls different documents each time you run the same query
- The AI produces a perfect analysis… of different source material
- Nothing’s broken; your “database” moved
- The fix: Snapshot your retrieval indices, version your context
- The cost: £30-80K infrastructure work
- Brené’s case: Not applicable—she had no document database connected at all
Type 1: Learned Wrong Pattern (Deterministic Error)
Example: Model thinks all revenue growth is positive, even when it’s -15%
- Run the same query 10 times with temperature=0, get same wrong answer
- The model genuinely believes this pattern
- Not “random”—consistently incorrect
- The fix: Better prompting, add examples, validation checks
- The cost: £50-200K in prompt engineering
- Brené’s case: Partially applicable—model learned citation patterns, not citation facts
Type 2: Bad Decision Strategy (Model Knows, Decode Fails)
Example: Ask AI a complex question and it fails; simplify the question and it succeeds
- The model has the right answer in its probability distribution
- Your sampling policy or context structure isn’t extracting it
- The fix: Lower temperature, generate multiple candidates and rerank, step-by-step prompting
- The cost: £30-100K sampling improvements
- Brené’s case: Not applicable—model doesn’t have the citations in its weights
Type 3: True Knowledge Gap (Genuine Hallucination)
Example: Ask about specific Q3 2024 financial data the model never saw
- Sample 20 times with different seeds = wildly different fabricated answers
- High uncertainty = flat probability distribution across many options
- Model doesn’t know, so it generates plausible-sounding content
- The fix: Retrieval grounding, forced abstention, human verification
- The cost: £500K-2M in architectural changes
- Brené’s case: THIS IS WHAT HAPPENED
The diagnostic that would have saved Brené’s team:
- Generate same citation request 20 times with different random seeds
- Get wildly different author names, years, journals each time
- Conclusion: High uncertainty region—model doesn’t have this knowledge
- Correct response: Connect to actual citation database OR use web search mode OR verify manually
They would have known in 5 minutes this wasn’t going to work.
The brutal truth: Brené’s team experienced the ONE type of error that actually needs expensive infrastructure (Type 3: knowledge gap). But because the industry calls everything “hallucination,” most enterprises are building Type 3 solutions for Type 1 and 2 problems.
The cost? Millions in misallocated budget.
IV. What They Should Have Done (The Proper Setup) (350 words)
Here’s the architecture for literature review that actually works:
Option 1: Retrieval-Augmented Generation (The Right Way)
1. Upload your existing literature database to Google Drive
2. Connect ChatGPT to Google Drive (built-in feature)
3. Prompt: "Search my Drive for papers about [topic],
cite only from documents you find"
4. Model retrieves actual papers → generates citations from real sources
Setup time: 2-3 hours Reliability: 95%+ accuracy (citations trace to real documents) Cost: Essentially free beyond existing ChatGPT subscription
Option 2: Web-Search Mode (For Fresh Research)
1. Use reasoning mode (deep research) or Perplexity with sources enabled
2. Prompt: "Find recent academic papers about [topic],
provide full citations with links"
3. Model searches actual databases → returns verifiable citations
4. Verify links click through to real papers
Setup time: 30 minutes to learn mode selection Reliability: 90%+ (citations are real, links are checkable) Cost: Perplexity Pro ~£20/month, or ChatGPT reasoning mode
Option 3: Constrained Generation (For Known Corpus)
1. Create bibliography of approved sources
2. Prompt: "Cite only from this list: [approved citations]"
3. Add validation: "After citing, confirm the citation exists in the provided list"
Setup time: 1-2 hours Reliability: 100% (model can only cite what you provided) Cost: Zero beyond base usage
What Brené’s team actually did:
Query: "Give me citations about leadership and vulnerability"
Mode: Basic chat (no retrieval, no search, no grounding)
Validation: None
It’s like asking someone to list books from your library when you haven’t told them where your library is, what books are in it, or given them access to go look.
The setup cost: 2-3 hours of configuration The actual cost they paid: Hollow feeling, fake citations, wasted team time, and now a cautionary tale spreading across social media
For enterprise decision-makers: This isn’t about AI capability. It’s about interface design and operational discipline.
V. Why This Pattern Is Costing Enterprises Millions (300 words)
The cycle of misdiagnosis:
- Wrong expectations: Team expects AI to be a database
- No setup: Doesn’t connect retrieval or enable search
- Predictable failure: Gets Type 3 error (knowledge gap)
- Wrong conclusion: “AI can’t be trusted for serious work”
- Expensive overreaction: Builds comprehensive human verification for everything
- Actual cost: £2M in human-in-loop systems when £5K setup would have worked
Real example from the field: Legal team wanted AI to cite relevant case law. Ran experiment like Brené’s. Got fake case citations. Concluded “AI isn’t ready for legal work.” Built £1.5M human verification system.
What they needed: Connect AI to their existing Westlaw database (£10K integration) and add citation verification to their existing workflow (£50K workflow modification).
They’re now paying lawyers to verify AI outputs that could have been grounded from the start.
Another example: Finance team wanted AI to analyse quarterly reports. Got inconsistent answers. Called it “hallucination.” Built expensive retrieval architecture.
Actual problem: Type 2 error (bad sampling strategy). Model knew the answer but temperature was too high. Fix: Lower temperature from 0.9 to 0.2, generate 5 candidates and rerank. Cost: £30K to implement reranking infrastructure.
They spent £800K on retrieval they didn’t need.
The pattern:
- Imprecise language (“hallucination,” “unreliable,” “random”)
- Leads to imprecise diagnosis
- Leads to wrong solutions
- Leads to massive overspend
- Leads to underdeployment of systems that could actually work
The Brené Brown effect: Her story will cause thousands of enterprises to conclude “AI isn’t ready for knowledge work” when the real lesson is “know which mode you’re running and set it up properly.”
VI. What Changes Monday Morning (300 words)
For your team implementing AI:
1. Know Your Modes
- Basic chat: Pattern matching, no external knowledge → Don’t use for facts
- With retrieval: Grounded in your documents → Use for internal knowledge
- With search: Grounded in web results → Use for current facts
- Reasoning mode: Deep analysis with verification → Use for complex research
Ask before every project: “Which mode does this task need?“
2. Invest Setup Time Upfront
The “5 minutes while driving” approach that Brené’s team took guarantees failure.
Proper setup for knowledge work:
- Connect to your document repositories
- Enable web search if needed
- Configure verification loops
- Set temperature appropriately (0.2 for facts, 0.7 for creativity)
- Test with diagnostic queries
Time investment: 2-4 hours Return: Prevents the hollow feeling, the wasted work, the fake citations
3. Run The Diagnostic
Before deploying any AI system:
Test 1: Pin everything, run 10x → Same output? Type 1 error
Test 2: Simplify query → Works? Type 2 error
Test 3: Sample 20x → Wildly different? Type 3 error
Test 4: Check if retrieval changes → Output changes? Type 0 error
This tells you which solution you need and how much to budget.
4. Measure What Matters
Stop measuring just “accuracy.”
Track:
- Groundedness: Does output trace to verified sources?
- Abstention rate: Does AI admit uncertainty appropriately?
- Replay rate: Can you reproduce the result?
- Setup compliance: Did team follow configuration checklist?
5. Educate Your Procurement
When evaluating AI vendors, ask:
- “What retrieval options are included?”
- “Can we connect to our document repositories?”
- “What verification loops exist?”
- “How do we prevent the Brené Brown scenario?”
You’re not buying a model. You’re buying a system with proper grounding.
VII. Closing: The Real Lesson (100 words)
Circle back to Brené:
That hollow feeling she described? It’s the feeling of using the wrong tool for the job and blaming the tool instead of the setup.
Her point about human wisdom mattering more than ever is absolutely right. But not for the reason the story suggests.
We need human wisdom to set up AI properly. To know which mode to use. To connect it to real sources. To verify outputs. To invest the couple hours that prevent the hollow outcomes.
The grad students won because they had access to the library and knew how to use it.
Give AI the same access and the same guidance, and it’s brilliant at the exact same task.
Final line: The model wasn’t broken. The interface was. And that distinction is worth millions.
Word count allocation:
- Opening (with Nikki’s post): 300
- What actually happened: 400
- Four error types: 450
- Proper setup: 350
- Enterprise cost pattern: 300
- Monday morning actions: 300
- Closing: 100 Total: ~2200 words (slightly over but trimmable)
Karpathy style elements: ✓ Starts with real, relatable story ✓ Intuition before any formalism ✓ “Let’s backprop through…” construction ✓ Gentle technical explanation (probability distributions, not transformer architecture) ✓ Practical analogies (library, calculator) ✓ Clear diagnostic frameworks ✓ Monday-morning actionability ✓ Respects the emotional truth (hollow feeling) while correcting technical misunderstanding ✓ Enterprise decision-maker framing (budget, ROI, procurement) ✓ Ends with sharp, memorable takeaway