Why Brené Brown's AI Failed V1

[image: image]

I recently saw a great post from Nikki Belt who had watched Brené Brown share a story about AI that was pretty wild. There is so much fundamental misunderstanding of how AI works and therefore assertions are made based on poor AI literacy.

Below is Nikki’s post. I will follow this with analysis about what is happening at the moment in the global market:

While writing her new book, Strong Ground (please add to your reading list right now!) she and her team ran a 60-day AI experiment. Her team used ChatGPT and Perplexity to do a literature review. In parallel, she had graduate students doing the same work manually (the OG way).

The results? 60-70% of the AI-generated academic citations were completely fake. Perfect formatting. Compelling quotes. Zero existence.

“They were SO legit sounding,” she said, “but they didn’t exist.”

Brené described feeling hollow afterward. Like her team had been in an “AI haze.” Not energised. Not productive. Hollow.

She wasn’t telling this story to scare us away from AI. She was making a different point entirely: In a world where AI is seeping into every corner of our work, human wisdom matters more than ever.

Trust building. Empathy. Real connection. These aren’t soft skills anymore—they’re the skills that keep your business from becoming a shell of what it could be.

This is exactly what I see playing out with the business owners I work with. The question isn’t “should we use AI?” It’s “how do we use it?”

Are you chasing efficiency for efficiency’s sake? Or are you thinking about how these tools help your team do more meaningful work, serve customers better, and make more thoughtful decisions?

Because here’s the truth: AI can help you scale. But if you’re not careful, it can also hollow out what makes your business special. You have a choice in how you adopt this technology. Don’t let it strip away the human parts that matter most.

I read this and immediately thought: that hollow feeling Brené describes? It’s real. It’s valid. And it’s completely preventable.

But here’s what bothers me: the conclusion people will draw from this story. They’ll say “AI isn’t ready” or “AI hallucinates” or “you can’t trust AI for serious work.”

That’s not what this story shows.

What it actually shows is this: Brené’s team ran a literature review using a tool that was never designed for literature review, without the setup that would make it work, and got exactly the result you’d expect from that mismatch.

It’s like using a calculator without a decimal point, getting wrong answers, and concluding “calculators can’t be trusted for financial work.” The calculator isn’t broken. You’re using it wrong.

The uncomfortable truth? This same pattern—expecting AI to be something it isn’t, then calling it a failure when it behaves exactly as designed—is playing out across thousands of enterprises right now. And it’s expensive.

I’ve watched teams spend £2M building “anti-hallucination” infrastructure for problems that needed £100K in proper setup. I’ve seen procurement departments reject AI entirely because of stories exactly like Brené’s. I’ve seen CTOs over-invest in human verification for errors that are completely deterministic and fixable.

Let me show you what actually happened in Brené’s experiment, why it happened, and more importantly—how understanding the difference between four types of AI errors could completely change how you budget, deploy, and succeed with these systems.

Because the graduate students didn’t win because they’re smarter. They won because they knew where the library was.

What Actually Happened (And Why It Matters)

Let’s backprop through what Brené’s team actually did.

They asked ChatGPT: “Give me academic citations about [topic]”

Here’s what ChatGPT heard: “Generate text that looks like academic citations”

Not “retrieve citations from a database.” Not “search my verified sources.” Generate text that matches the pattern of what citations look like.

The mental model most people have:

Query → AI searches knowledge base → Returns real citations

What’s actually happening:

Query → AI computes probability distribution over next words →
Samples plausible-looking citation text

It’s the difference between asking someone to check the library catalogue versus asking them to describe what books might exist based on knowing how academic publishing works.

Here’s the crucial bit: ChatGPT is phenomenally good at knowing what citations should look like. It’s seen millions of academic papers during training. It knows that author names follow certain distributions, that years cluster around recent decades, that journal names co-occur with certain topics, that citation formats have specific structures.

So when you ask for citations, it generates perfectly formatted, utterly convincing, completely fabricated references. Not because it’s “hallucinating” or “confused”—because it’s doing exactly what it was trained to do: produce text with high probability under the learned distribution.

Think of it like this: if I asked you to invent a plausible academic paper about leadership without checking any sources, you could probably generate something like “Johnson, M. & Williams, S. (2019). Transformational Leadership in Modern Organizations. Journal of Organizational Behaviour, 45(3), 234-251.” It sounds right. The format is correct. The journal name is plausible. The year is reasonable. But you just made it up based on your knowledge of what academic citations look like.

That’s exactly what the model is doing—at a much more sophisticated level.

The technical reality is that the model computes p(next word|all previous words). When you ask for “citations about leadership,” it’s computing something like p("Brown, B. (2019)"|"Citation about leadership:"), and that probability is high—not because Brown, B. (2019) exists, but because that pattern exists millions of times in the training data.

Why does this matter for your business? Because this isn’t an AI failure. This is a resource allocation failure masquerading as an AI failure.

Brené’s team needed retrieval grounding (connect AI to actual document database), verification loops (check citations against sources), proper mode selection (use reasoning mode with web search, not basic chat), and setup time investment (the “couple hours upfront” that unlocks reliability).

Cost: Maybe £5,000 in proper setup time.

What this story will cause: Thousands of enterprises to reject AI for literature review, academic research, and knowledge work—tasks where properly configured AI is brilliant.

That’s the real cost of AI illiteracy.

The Four Types of “Failures” (And Why Only One Is Real)

Brené’s team experienced what I call a Type 3 error—but let me show you why diagnosing which type you’re facing matters enormously.

When your AI “fails,” you’re actually seeing one of four completely different problems. And only one of them needs the expensive solutions most teams are building.

Type 0: Environment Drift (Wrong Input)

Imagine your RAG system pulls different documents each time you run the same query. The AI produces a perfect analysis… of different source material. Nothing’s broken; your “database” moved underneath you.

The fix: Snapshot your retrieval indices, version your context, log exactly what documents were accessed.

The cost: £30-80K in infrastructure work.

Brené’s case: Not applicable—she had no document database connected at all.

Type 1: Learned Wrong Pattern (Deterministic Error)

The model thinks all revenue growth is positive news, even when it’s -15%. Run the same query 10 times with temperature set to zero, and you get the same wrong answer every single time. The model genuinely believes this pattern. It’s not being “random”—it’s being consistently incorrect.

The fix: Better prompting with examples, add validation checks, implement step-by-step verification.

The cost: £50-200K in prompt engineering and validation infrastructure.

Brené’s case: Partially applicable—the model learned citation patterns, not citation facts.

Type 2: Bad Decision Strategy (Model Knows, Decoding Fails)

Ask the AI a complex question and it fails. Simplify the exact same question and it succeeds. The model has the right answer somewhere in its probability distribution, but your sampling policy or context structure isn’t extracting it.

The fix: Lower temperature, generate multiple candidates and rerank them, use step-by-step prompting techniques.

The cost: £30-100K in sampling infrastructure improvements.

Brené’s case: Not applicable—the model doesn’t have the specific citations in its weights at all.

Type 3: True Knowledge Gap (Genuine Hallucination)

Ask about specific Q3 2024 financial data the model never saw during training. Sample 20 times with different random seeds and you get wildly different fabricated answers. High uncertainty equals a flat probability distribution across many options. The model doesn’t know, so it generates plausible-sounding content to fill the gap.

The fix: Retrieval grounding, forced abstention when uncertain, human verification loops.

The cost: £500K-2M in architectural changes.

Brené’s case: THIS IS WHAT HAPPENED.

Here’s the diagnostic that would have saved Brené’s team weeks of work:

Generate the same citation request 20 times with different random seeds
Observe wildly different author names, years, and journals each time
Conclusion: High uncertainty region—the model doesn’t have this knowledge in its weights
Correct response: Connect to an actual citation database OR enable web search mode OR verify manually

They would have known in five minutes this approach wasn’t going to work.

The brutal truth is that Brené’s team experienced the ONE type of error that actually needs expensive infrastructure (Type 3: knowledge gap). But because the industry calls everything “hallucination,” most enterprises are building Type 3 solutions for Type 1 and 2 problems.

In my experience across enterprise deployments, roughly 70% of reported “hallucinations” are actually Types 0, 1, or 2—problems with specific, much cheaper solutions. But the imprecise language prevents proper diagnosis, which leads to massive budget misallocation.

What They Should Have Done (The Proper Setup)

Here’s the architecture for literature review that actually works:

Option 1: Retrieval-Augmented Generation (The Right Way)

The setup:

Upload your existing literature database to Google Drive
Connect ChatGPT to Google Drive (this is a built-in feature)
Prompt: “Search my Drive for papers about [topic], cite only from documents you find”
The model retrieves actual papers, then generates citations from real sources

Setup time: 2-3 hours Reliability: 95%+ accuracy (every citation traces to a real document) Cost: Essentially free beyond existing ChatGPT subscription

Option 2: Web-Search Mode (For Fresh Research)

The setup:

Use reasoning mode (for deep research) or Perplexity with sources enabled
Prompt: “Find recent academic papers about [topic], provide full citations with links”
The model searches actual databases, returns verifiable citations
Verify that links click through to real papers

Setup time: 30 minutes to learn proper mode selection Reliability: 90%+ (citations are real, links are checkable) Cost: Perplexity Pro ~£20/month, or ChatGPT reasoning mode

Option 3: Constrained Generation (For Known Corpus)

The setup:

Create a bibliography of approved sources
Prompt: “Cite only from this list: [approved citations]”
Add validation: “After citing, confirm the citation exists in the provided list”

Setup time: 1-2 hours Reliability: 100% (the model can only cite what you provided) Cost: Zero beyond base usage

What Brené’s team actually did:

Query: "Give me citations about leadership and vulnerability"
Mode: Basic chat (no retrieval, no search, no grounding)
Validation: None
Result: Fabricated but plausible-looking citations

It’s like asking someone to list books from your library when you haven’t told them where your library is, what books are in it, or given them access to go look.

The setup cost to avoid this? Two to three hours of configuration.

The actual cost they paid? A hollow feeling, fake citations, wasted team time, and now a cautionary tale spreading across social media that will cause thousands of teams to avoid AI for tasks where it could be transformative.

For enterprise decision-makers, understand this: the problem wasn’t AI capability. It was interface design and operational discipline.

Why This Pattern Is Costing Enterprises Millions

Let me show you the cycle of misdiagnosis I see repeatedly:

Wrong expectations: Team expects AI to function as a database
No setup: Doesn’t connect retrieval systems or enable search capabilities
Predictable failure: Gets Type 3 error (genuine knowledge gap)
Wrong conclusion: “AI can’t be trusted for serious work”
Expensive overreaction: Builds comprehensive human verification for everything
Actual cost: £2M in human-in-the-loop systems when £5K in setup would have worked

Here’s a real example from a legal team I consulted with. They wanted AI to cite relevant case law. They ran an experiment similar to Brené’s. Got fake case citations. Concluded “AI isn’t ready for legal work.” Built a £1.5M human verification system.

What they actually needed? Connect their AI to their existing Westlaw database (£10K integration) and add citation verification to their existing workflow (£50K workflow modification). They’re now paying lawyers to verify AI outputs that could have been grounded in verified sources from the start.

Another example from finance: A team wanted AI to analyse quarterly reports. Got inconsistent answers between runs. Called it “hallucination.” Built an expensive retrieval architecture.

The actual problem? Type 2 error—bad sampling strategy. The model knew the answer, but the temperature was set too high (0.9) causing excessive randomness in token selection. The fix: Lower temperature to 0.2, generate five candidates and rerank them by consistency with source documents. Cost to implement proper reranking infrastructure: £30K.

They spent £800K on retrieval architecture they didn’t need.

See the pattern?

Imprecise language (“hallucination,” “unreliable,” “random”) leads to imprecise diagnosis, which leads to wrong solutions, which leads to massive overspend, which leads to underdeployment of systems that could actually work brilliantly.

I call this the Brené Brown effect: her story will cause thousands of enterprises to conclude “AI isn’t ready for knowledge work” when the real lesson is “know which mode you’re running and set it up properly.”

What Changes Monday Morning

For your team implementing AI, here’s what needs to change immediately:

1. Know Your Modes

Before starting any project, understand what each mode actually does:

Basic chat mode: Pattern matching with no external knowledge → Don’t use for factual queries
With retrieval enabled: Grounded in your uploaded documents → Use for internal knowledge work
With web search enabled: Grounded in current web results → Use for recent facts and research
Reasoning mode: Deep analysis with internal verification steps → Use for complex research requiring chain-of-thought

Ask before every project: “Which mode does this specific task require?“

2. Invest Setup Time Upfront

The “five minutes whilst driving” approach that Brené’s team took guarantees failure. Full stop.

Proper setup for knowledge work requires:

Connecting to your document repositories (Google Drive, SharePoint, etc.)
Enabling web search if you need current information
Configuring verification loops and validation checks
Setting temperature appropriately (0.2 for factual work, 0.7 for creative tasks)
Testing with diagnostic queries to verify the setup works

Time investment: 2-4 hours per project Return: Prevents the hollow feeling, the wasted work, and the fake citations

3. Run The Diagnostic

Before deploying any AI system, run this four-step test:

Test 1: Pin everything (model, prompt, seed), run 10 times
→ Same output every time? Type 1 error (learned wrong pattern)

Test 2: Simplify the query to its most basic form
→ Works simplified but fails in full context? Type 2 error (decoding issue)

Test 3: Sample the same query 20 times with different random seeds
→ Wildly different answers? Type 3 error (knowledge gap)

Test 4: Check if retrieval system returns different documents
→ Output changes only when retrieval changes? Type 0 error (environment drift)

This diagnostic tells you which solution you need and how much to budget. It takes 30 minutes and could save you millions in misallocated spending.

4. Measure What Actually Matters

Stop measuring just “accuracy.” That single metric obscures more than it reveals.

Start tracking:

Groundedness score: Does the output trace back to verified sources?
Abstention rate: Does the AI admit uncertainty when appropriate, or does it fabricate?
Replay rate: Can you reproduce the exact same result with the same inputs?
Setup compliance: Did the team follow the configuration checklist?

These metrics tell you whether you’re engineering a reliable system or hoping for magic.

5. Educate Your Procurement Team

When evaluating AI vendors, your procurement team should ask:

“What retrieval options are included in the platform?”
“Can we connect this to our existing document repositories?”
“What verification loops and validation checks exist?”
“How do you prevent the Brené Brown scenario we’ve read about?”

You’re not buying a model. You’re buying a system with proper grounding, and those are very different things.

The Real Lesson

Let me bring us back to where we started.

That hollow feeling Brené described? It’s the feeling of using the wrong tool for the job and blaming the tool instead of the setup. It’s real, it’s valid, and it’s what happens when we ask AI to retrieve facts from a database that doesn’t exist.

Her broader point about human wisdom mattering more than ever is absolutely right. But perhaps not for the reason the story initially suggests.

We need human wisdom to set up AI properly. To know which mode to use for which task. To connect it to real sources. To verify outputs against ground truth. To invest the couple of hours upfront that prevent the hollow outcomes downstream.

The graduate students in Brené’s experiment won because they had access to the library and knew how to use it. They knew where the catalogue was. They knew how to verify sources. They knew how to trace citations back to real papers.

Give AI the same access and the same guidance, and it’s brilliant at the exact same task. Actually, it’s often better—it can cross-reference thousands of papers in seconds, surface connections you’d never find manually, and synthesise across domains with remarkable insight.

But you have to give it the library card first.

The model wasn’t broken. The interface was. And that distinction is worth millions.

Why Brené Brown's AI Failed V1

What Actually Happened (And Why It Matters)

The Four Types of “Failures” (And Why Only One Is Real)

Type 0: Environment Drift (Wrong Input)

Type 1: Learned Wrong Pattern (Deterministic Error)

Type 2: Bad Decision Strategy (Model Knows, Decoding Fails)

Type 3: True Knowledge Gap (Genuine Hallucination)

What They Should Have Done (The Proper Setup)

Option 1: Retrieval-Augmented Generation (The Right Way)

Option 2: Web-Search Mode (For Fresh Research)

Option 3: Constrained Generation (For Known Corpus)

Why This Pattern Is Costing Enterprises Millions

What Changes Monday Morning

1. Know Your Modes

2. Invest Setup Time Upfront

3. Run The Diagnostic

4. Measure What Actually Matters

5. Educate Your Procurement Team

The Real Lesson

Continue reading

AI + Doctor = Super Doctor Transforming NHS GP Tri

AI That Forgets — the Competitive Edge of Private

AI is Not a Bubble