Subscribe to Our Newsletter

Success! Now Check Your Email

To complete Subscribe, click the confirmation link in your inbox. If it doesn’t arrive within 3 minutes, check your spam folder.

Ok, Thanks

Most AI agents are failing the only test that matters. Here's how to spot the ones that aren't

A framework from Nate B Jones cuts through the noise around outcome agents and exposes why the most hyped tools in the category are still falling short

Ian Lyall profile image
by Ian Lyall
Most AI agents are failing the only test that matters. Here's how to spot the ones that aren't
Photo by Growtika / Unsplash

Twenty twentyfive was supposed to be the year of the agent: software that can act and think autonomously to compelete tasks in the background based on a few helpful prompts. Sounds simple. 2026, however, has augered in a frustrating phase in AI's evolution: untangling what these agents can actually achieve, which at this point looks limited.

The pitch from every major AI agent platform right now is some version of the same promise: sit back, describe what you want, and let the software do the work. Companies like Lindy, Replit's Cowork, Google Opal and a clutch of better-funded but less visible startups are all making versions of this claim. The investment flowing into the category reflects how seriously the market is taking it. More than $285 billion has shifted out of SaaS stocks as investors reconsider whether subscription software can survive a world where agents do the same work without the licence fee.

Nate B Jones, who has been tracking the evolution of this category closely, argues that most of these tools are not actually solving the hard part. They are solving the easy part and calling it done.

Why code came first

To understand where outcome agents are going, it helps to understand where they started. The first generation of capable AI agents focused on code, and the reason is simple: code is verifiable. It either runs or it does not. There is no ambiguity about whether the output is good. That property made it possible to train and evaluate agents systematically, and companies like OpenAI and Google built their early agent products around it.

The challenge with general-purpose outcome agents is that most work is not verifiable in the same way. Telling an agent to prepare a client briefing or manage a calendar does not produce output with a clear pass or fail state. That ambiguity is what makes the category genuinely hard, and it is where most current tools break down.

Three questions that separate real from fake

Jones applies three questions to any agent claiming to deliver outcomes. Does it have persistent memory? Does it produce artifacts that can be inspected and edited? And does its architecture allow context to compound over time, so the tenth task is easier than the first?

These are not complicated criteria. They reflect what anyone who has used a capable human assistant would expect as a baseline. But very few current tools meet all three, and several of the most prominent ones fail on multiple counts.

Anthropic's Cowork, which launched in January and prompted Microsoft to build a rival product on top of Claude's agent framework despite a $3 billion investment in OpenAI, scores around one and a half out of three against this framework. It has some persistent memory and it produces tangible artifacts, which is part of why Wall Street noticed it. But its architecture does not compound context. Each session is broadly a fresh start. Shut the laptop and the agent stops. For a tool positioning itself as a replacement for enterprise SaaS, that is a significant limitation.

Lindy: executive automation with a trust problem

Lindy is one of the more visible players in the outcome agent space, founded by Flo Crello and built around natural language descriptions of desired outcomes. The user describes what they want, and Lindy builds and runs the automation. In principle, that is close to the ideal. In practice, the user experience has not kept pace with the ambition.

Its Trustpilot rating sits at 2.4 out of five. Complaints centre on unclear credit usage and automations that produce no useful result while consuming resources. The interface makes it easy to start but difficult to debug. Jones places Lindy in the gap between Zapier and a genuine outcome agent, useful for automating small executive tasks but not a tool that produces deep, compounding work.

Sauna: the strongest conceptual play

The most intellectually coherent approach Jones identifies comes from Sauna, formerly Wordware, which raised $30 million from Spark Capital and Y Combinator before pivoting from an AI development environment to a professional workspace. The pivot came after the founders concluded that what people actually needed was not a way to build agents but a way to work with them.

Sauna's founder Philip Kazera has built the product around memory as a substrate rather than a feature. The distinction matters. When memory is a feature, it is bolted on, partial and easy to break. When it is a substrate, everything else in the product is built on top of it, and context accumulates in a way that changes the quality of work over time.

The platform also reflects a broader insight that Jones considers durable: knowledge workers do not need to become programmers in an AI-native world. They need to be precise about what they want and capable of writing clear specifications. Sauna is building toward that model. The concern is that the product is still early, and the gap between its demo videos and its production performance has not fully closed.

Google Opal: free, capable and possibly temporary

Google Opal is a prompt-to-workflow builder that is free to use and has recently been upgraded with Gemini 3 Flash. It can understand objectives, select tools, self-correct, remember across sessions and ask clarifying questions. People are building and sharing real projects on it, including meeting preparation agents and research workflows.

The open-source ethos and the culture of building in public give it genuine momentum. But Jones flags two risks. One is data: using a Google product means accepting Google's terms around how that data is used. The other is continuity. Google has a documented history of building useful products and then abandoning them. For anyone investing significant time in building workflows on Opal, that history is not reassuring.

Obvious: quiet and ambitious

The least visible tool Jones evaluates is Obvious, an AI workspace offering workbooks with SQL, documents with live charts and custom application building. It appears to be addressing the artifacts and compounding context questions more seriously than most competitors. Its memory capabilities are harder to assess because the product is new and has generated little public discussion. Jones flags it as one to watch rather than one to adopt immediately.

What to build toward

The practical framework Jones proposes for anyone building or evaluating agents comes down to three layers. A knowledge store provides the memory foundation, separate from the model and updateable independently. Agent recipes provide pre-wired workflows for recurring tasks, from calendar management to document preparation. And a surface for editing outcomes makes the agent's work visible and correctable.

The question to ask any vendor is whether all three exist, how they interact, and whether the context genuinely compounds. If the answer to any of those is vague, the tool is probably not yet what it claims to be.

Ian Lyall profile image
by Ian Lyall