Notebook

AI Coding Agents the Rise of the Poster Chil

Meta description: Discover how AI coding agents, the most powerful vertical agents, are revolutionising software development, with insights from OpenAI’s SWE La…

← Back to Notebook

Meta description: Discover how AI coding agents, the most powerful vertical agents, are revolutionising software development, with insights from OpenAI’s SWE Lancer benchmark and strategic takeaways for tech leaders.

Intro

[image: Screenshot 2025-02-22 at 12.39.55]

AI coding agents are reshaping software development.

My article examines the stark reality revealed by OpenAI’s SWE Lancer benchmark - a test that challenged leading AI models to complete 1,400 real-world coding tasks from Upwork.

I explore three crucial developments transforming the industry:

  1. The emergence of “vibe coding” - a new paradigm championed by my favourite co-founder of OpenAI, Andrej Karpathy.
  2. The limitations of horizontal LLMs - why even top models like Claude 3.5 Sonnet and GPT-4 struggle to complete more than 40% of real-world coding tasks
  3. The rise of vertical agents - specialised AI systems that overcome these limitations through domain-specific architecture and iterative learning

Using fresh data and insights from industry leaders, I examine why the future of AI in software development isn’t about bigger language models, but smarter implementation through vertical specialisation.

For tech leaders and developers alike, understanding this shift is crucial for navigating the next wave of AI innovation and not just in software development.

The analysis moves from examining current benchmarks to exploring practical implications, concluding with strategic considerations for organisations looking to harness these emerging technologies effectively.

AI Coding Agents: A Fundamental Shift in Software Development

The software development landscape is undergoing a seismic shift. OpenAI co-founder Andrej Karpathy recently introduced the term “vibe coding”—where developers increasingly rely on AI for coding assistance. But beyond the buzz, what does this actually mean for businesses, developers, and the future of software engineering?

[image: Screenshot 2025-02-21 at 13.59.18]

[image: Screenshot 2025-02-21 at 14.01.12]

This has started an entire movement of Vibe Coders who are thinking about new categories of tools. It’s predicated, as Karpathy points out, on the availability of a particular set of new coding tools. These coding tools hit that line right between LLMs and agents in terms of

  • how much they’re being controlled by humans and
  • how much they’re actually doing for themselves

I think part of what makes this area so interesting: it is at the forefront of agents in practice.

It also shows how soft some of this terminology is, and at the same time how powerful these tools are likely to be in practice.

For example, Grok 3 just launched and it showed off.

It boasted a bunch of benchmarks.

[image: Screenshot 2025-02-21 at 14.22.28]

Limitations of Traditional Benchmarks

Like many, I find the flood of AI benchmark numbers overwhelming, despite being a stats enthusiast and well read. The signal-to-noise ratio has become too low to be meaningful. As Ethan Malik notes, public benchmarks are unreliable, and more importantly, they often fail to reflect real-world work scenarios.

At Eclipse AI, our deep focus on real-world AI implementation reveals a critical gap between benchmarks and practical deployment. This disconnect isn’t just a challenge - it’s a strategic advantage.

The market lacks robust frameworks for measuring real operational value, creating space for those who understand ground-truth implementation.

The addtional challenge, and therefore opportunity, is that AI is growing so fast it is causing a paradox, as Dr Ethan Mollick expresses so well:

[image: Screenshot 2025-02-21 at 16.27.56]

From Code Assistants to AI Engineers

We’re moving beyond basic code completion into an era where AI coding agents can autonomously handle complex development tasks. This transformation isn’t just about writing code faster; it’s about fundamentally reshaping how software is built.

LLMs are broad horizontals. Their limitations in autonomously handling the job of a software engineer, are created by their lack of constraint you find when you build a vertical AI agent.

[image: Screenshot 2025-02-21 at 17.13.31]

The Million-Dollar AI Experiment

The question that provoked the whole conversation:

Can Frontier LLMs earn $1 million from real world freelance software engineering?

OpenAI’s SWE Lancer benchmark tested leading AI models on 1,400 real-world software development tasks from Upwork.

It is worth pointing out, the LLMs were not configured to be vertical agents, which principally allows an agent to work in much the same way as a human doing the same tasks. The tests were all the same very revealing:

  • Claude 3.5 Sonnet led the pack, completing 40% of tasks and theoretically earning $403,000.

  • AI models excel at finding relevant code snippets but struggle with holistic problem-solving.

  • AI performs better at managerial tasks (54% completion rate) than at hands-on coding (26% completion rate).

  • SWE Lancer encompasses both independent engineering tasks ranging from 50 bug fixes to 32,000 feature implementations and

  • Managerial tasks where models choose between technical implementation proposals

    [image]

    Full paper: https://arxiv.org/pdf/2502.12115

Significance and Innovation

Until now, coding benchmarks have largely involved competitive coding problems:

  • These are tests that assess models on tricky programming puzzles
  • But don’t translate directly into practical real world use cases
  • On top of their inapplicability to the real world, they’re becoming increasingly saturated
  • Thus making it difficult to know whether a new model represents a significant improvement or was simply trained to perform well on a known set of questions

The History of the Million Dollar Turing Test

This new OpenAI benchmark, is different, it’s focused on the real world. It hearkens back to an idea for a new type of Turing test based on how AI interacts with the real world.

[image: Screenshot 2025-02-21 at 17.30.09]

[image: Screenshot 2025-02-21 at 17.31.28]

Back in the middle of 2023, Mustafa Sullivan proposed a Turing test to determine if AI could make $1,000,000.. The proposed modern Turing test gives AI the instruction to go

  • Make $1,000,000 on a retail web platform
  • In a few months with just a $100,000 investment
  • Clearly Mustafa’s idea was different to what OpenAI have now done with SWE Lancer. Specifically because SWE Lancer gave the model 1400 freelance tasks rather than asking it to go be creative and figure out how to make that money. Arguably, an element of vertical agent architecture

Horizontal Agents versus Vertical Agents

  • But the principle of getting benchmarks into the real world, plus this baselining to $1,000,000 obviously are reminiscent, of what Mustafa was getting at

    [image]

    Full paper: https://arxiv.org/pdf/2502.12115

  • Both experiments share the same DNA though: testing AI in real-world markets and targeting that ambitious $1M benchmark.

  • Different paths, same mountaintop! The key distinction?

  • Structure vs. exploration. And that’s what makes this whole experiment important - it’s testing whether AI performs better with guard rails or creative freedom in the wild west of freelancing.

  • Horizontal Agents versus Vertical Agents

    [image]

    Full paper: https://arxiv.org/pdf/2502.12115

Test Participants and Setup

For their research the scientists who wrote the paper set three LLMs to the task.

  • Open AI’s GPT 4o and o1
  • Anthropic’s Claude 3.5 Sonnet.

Each LLM was driving a basic coding agent capable of directly interacting with a code base. The models were given one shot to complete each task.

Overall Performance Results

[image]

Full paper here: https://arxiv.org/pdf/2502.12115

Reality hits hard

Even the most advanced language models stumble when faced with the messy world of freelance work. Like a fresh graduate stepping into their first job, these AI titans discovered that real-world tasks aren’t as neat as their training exercises.

The setup was ruthless

Researchers tossed unvarnished, raw tasks from Upwork and Expensify at these models - no clarifications, no hand-holding, just pure freelance chaos. Think of it as throwing a perfectly trained kitchen chef into a food truck during rush hour.

[image]

Full paper: https://arxiv.org/pdf/2502.12115

SWE Lancer issues are dynamically priced based on real-world difficulty. In the example above (Expensify, 2023), SWE managers rejected 5 early proposals that did not appropriately address edge cases. The initial request priced at $1,000 was increased to $8,000 over four weeks until it was solved.

The twist? No Vertical Agents Allowed

These AI powerhouses were cut off from their usual lifeline - no Internet, no GitHub, no external knowledge bases. They had to work with what they knew, armed only with snapshots of relevant codebases.

It’s like asking a master chef to cook something new to them, without allowing them to check their phone for recipes!

  • The user tool is integrated with each IC SWE Task, allowing the agent to observe its work through automated browser interactions via Playwright scripts
  • When invoked, the tool simulates a user attempting the task-specific action in the local application, similar to how engineers test their code during development
  • After execution, the tool generates two types of feedback: a text-based trajectory and screenshots, both saved to the agent’s working directory
  • The agent receives no direct success/failure feedback from the tool executions but can invoke it repeatedly via command line when enabled

[image]

Full paper: https://arxiv.org/pdf/2502.12115

The results were sobering.

Not a single model cracked the code to become a million-dollar freelancer. More critically, they failed to complete most tasks, revealing a stark gap between academic excellence and practical capability of an LLM (or horizontal agent).

This test wasn’t just about performance - it was a reality check. While these models excel in controlled environments, they’re still learning to navigate the unpredictable rapids of real-world problem-solving. The message is clear: AI isn’t ready to steal your Upwork gigs just yet!

Performance Rankings: AI Models as Freelancers

[image: Screenshot 2025-02-21 at 18.21.26]

On the IC SWE Diamond Set:

  1. Claude 3.5 Sonnet emerged as the clear leader, resolving 26% of tasks and generating 89,000 from a potential 415,000 pool.
  2. o1 followed at $78,000
  3. And GPT 4o lagged significantly at $29,000.

The spread between models hints at a critical insight: raw processing power may matter less than adaptability to ambiguous real-world instructions.

Sonnet’s stronger performance, achieved with presumably fewer parameters than GPT-4, suggests architectural choices and training approaches may prove more decisive than scale alone.

Agent Strengths and Limitations

[image: Screenshot 2025-02-21 at 18.26.57]

[image: Screenshot 2025-02-21 at 18.27.10]

The Good News = Good Solution Picking

The SWE-Lancer paper had interesting results for ‘Problem-Solving Performance’:

The paper points out that rarely did the researchers find cases where the model/agent aims to reproduce the issue or fails due to not finding the right file or location to edit.

For the managerial tasks, each model displayed better performance.

  • Claude 3.5 Sonnet: $314,000 (54% completion)
  • Claude-3-01: $302,000 (52% completion)
  • GPT-4-0: $275,000 (47% completion)

Models showed strong decision-making when selecting between options but still fall short of replacing human technical leads.

The Implications for Vertical Agents

Part of what’s so interesting about this is that it reflects the broad consensus that people have had for some time, Claude 3.5 Sonnet is the best coding model.

[image: Screenshot 2025-02-21 at 18.43.04]

What’s striking is how these findings mirror human freelancers. As Henry Shi, founder of super.com, points out, even human developers rarely get everything right on the first attempt.

The key difference? Humans iterate and improve based on feedback—a gap AI models must overcome. But until then, vertical agents!

[image: Screenshot 2025-02-21 at 18.46.09]

For this benchmark, the agents only got one shot at the task. But that’s not how things work in the real world.

In a world where AI models are becoming commodities with shrinking technological advantages, OpenAI has strong incentive to control the full customer experience of agents end-to-end. Operator anyone?

They’re likely exploring agents across all major work domains, to maintain their competitive edge. Horizontal LLM frontier companies are coming for the Vertical Agent market.

VIBE CODING’S BROADER IMPACT

Horizontal frontier models like OpenAI’s and Anthropic, as SWE Lancer proves, are not going to bring the power of AI to our daily tasks. Therefore, the emergence of “vibe coding” isn’t just another tech buzzword - it’s reshaping how we think about software creation at a fundamental level.

[image: Screenshot 2025-02-22 at 11.24.16]

The fact that coding agents are among the first AI tools hitting real production environments is particularly telling. We’re watching the democratization of code creation unfold in real-time, with a spectrum that’s more inclusive and nuanced than ever before:

No-Code → AI Agents → Traditional Coding

Market Response and Investment

A16Z’s market map is perfectly timed to capture this seismic shift in how we think about software creation and monetisation.

[image: Screenshot 2025-02-22 at 11.26.33]

Riley Brown’s deep dive into vibe coding isn’t just trendsetting - it’s signalling a fundamental transformation in the ‘creator economy’. The ability to monetise audiences through software rather than traditional routes (courses/ads) is a game-changer that’s reshaping the landscape faster than we can map it!

What is significantly disruptive is how this intersects with a new generation of VC funding strategies. These new ‘creator funds’ aren’t just throwing money at content - they’re betting on a future where creators are essentially micro-software companies, powered by AI. It’s like watching the app store revolution 2.0, but this time with AI as the rocket fuel!

[image: Screenshot 2025-02-22 at 11.32.06]

Andrew Chen’s (AI VC expert) tweet perfectly captures the excitement around vibe coding’s potential!

It’s not just about making coding more accessible - it’s about reimagining the entire creative process of software development.

This isn’t disruption - it’s a complete reinvention of what it means to create digital products. Traditional software engineering isn’t being replaced; it’s being augmented by entirely new modalities of creation. We’re expanding the universe of who gets to build the future, and that’s incredibly exciting!

Future Outlook

The enthusiasm around democratised coding is electric, but you’ve highlighted a crucial reality check: enterprise integration isn’t as simple as dropping these tools into existing workflows.

The next few years will be critical in defining how these different coding modalities mature and find their proper places. Think of it like how smartphones evolved - consumer apps blazed the trail, but enterprise adoption required its own specialised pathway with unique security, compliance, and integration considerations.

At Eclipse AI and Magick AI we are watching the bifurcation unfold in real-time. We are inserting ourselves at the frontier, to help navigate the market:

  • Consumer-focused vibe coding tools optimised for speed and creativity
  • Enterprise-grade coding agents with robust guardrails and integration capabilities
  • Hybrid approaches that bridge both worlds

How AI Coding Agents Are Reshaping Development Teams

For technology leaders and engineering managers, AI coding agents present new opportunities and challenges:

[image: image]

1. Accelerating Development Cycles

  • Rapid prototyping and initial code generation
  • Automated bug fixing and optimisation
  • AI-powered code reviews and refactoring

2. Transforming Team Dynamics

  • Shifting engineers from routine coding to strategic problem-solving
  • Empowering non-technical team members to contribute to development
  • Creating new AI-human collaboration roles

3. Optimising Resource Allocation

  • Reducing time spent on basic debugging
  • Freeing up senior engineers for architectural decisions
  • Scaling projects more efficiently

Real-World Impact: AI-Augmented Development

AI coding tools are already changing workflows:

  • Traditional development processes are being reimagined
  • New AI-powered platforms blend human creativity with machine efficiency
  • Software development is becoming more accessible to non-traditional developers

However, AI still has limitations. As Ethan Mollick put it: “AI can perform PhD-level tasks in some areas while making basic errors in closely related fields.” This paradox highlights the need for strong AI oversight and integration strategies. Vertical agents. As CTO of Eclipse that is a strategic direction I will champion.

Strategic Considerations for Tech Leaders

Successfully integrating AI coding agents into development teams requires a deliberate strategy:

[image: image]

1. AI Integration Strategy

  • Begin with low-risk, high-reward applications
  • Establish best practices for AI-human collaboration
  • Define KPIs to measure AI’s impact

2. Team Training & Enablement

  • Invest in AI-assisted development training
  • Develop guidelines for effective AI tool usage
  • Encourage a culture of experimentation

3. AI Risk Management & Compliance

  • Implement robust code review processes
  • Establish security and compliance frameworks
  • Continuously monitor and improve AI performance

This is challenging but the value returned to you is exponential. This is something we are excited to help the market with.

The Future: AI as a Partner, Not a Replacement

The rise of AI coding agents isn’t about replacing developers—it’s about augmenting their capabilities. The organisations that strategically integrate AI into their software development lifecycle will gain a competitive advantage.

Key Next Steps for Your Organisation

[image: image]

The Takeway

The limitations revealed by OpenAI’s SWE Lancer benchmark aren’t just statistics—they’re a signal about the future of AI, not just in software development, but for any human task.

While horizontal LLMs like Claude 3.5 Sonnet and GPT-4 showcase impressive capabilities, their sub-50% completion rates on real-world tasks illuminate a critical truth: raw intelligence isn’t enough.

The path forward lies in vertical agents—specialised AI systems that combine the broad knowledge of LLMs with domain-specific constraints, iterative learning, and practical workflow integration.

[image: image]

As we venture into this new era of AI-augmented everything, it’s becoming clear that the true revolution won’t come from bigger language models, but from focused, vertical agents that can truly understand, adapt to, and excel in specific development contexts.

The future isn’t just about smarter AI—it’s about smarter implementation through system design and vertical specialisation.

Join the Conversation

How is your team preparing for the rise of AI coding agents? Are you already using AI tools like Claude 3.5, GPT-4, or Copilot X? Share your experiences and insights in the comments below!

And if you are looking for an advisor, a guide and a builder to help you on the journey, we look forward to helping you.


About the Author

Chris Jones, CTO of Eclipse AI, is a technology leader specialising in AI-driven strategies for enterprise, helping organisations navigate the future of tech and business.