'Godmother of AI' Fei-Fei Li on AGI, the limits of current AI and why world models are next
Why an overlooked episode of Lenny’s Podcast is one of the most useful AI conversations of the year
Fei-Fei Li opens with a clear provocation: AGI is not a scientific category; it is a marketing term. For her, the field’s real north star remains the original Turing question of whether machines can think and act in ways comparable to humans. By that standard, today’s systems represent genuine progress, but they still lack core capabilities that even young children show with ease.
Li begins with a line that should have attracted far more attention than it has. AGI, she says, is essentially a marketing term. In her view, there is no scientific boundary separating “AI” from “AGI”, and the field remains anchored to the same north star first articulated by Alan Turing: can machines think and act in ways comparable to humans?
Falling short
By this measure, the field has made important progress over the last decade, yet still falls short on some of the most basic forms of intelligence that humans develop as children. The gap is not abstract. Li lists capabilities such as physical reasoning, inferring structure from limited evidence, and expressing genuine emotional or social understanding. These, she argues, remain areas where the systems we celebrate today still fail. This framing punctures both extremes of the hype cycle. AI is not about to become a superintelligence, nor is it close to capturing the full scope of human cognition.
To explain how we arrived here, Li walks back through the ImageNet years. Her argument is simple. The period from the mid-2000s to the early 2010s marked the crucial turn in modern AI because it combined three elements that had never been brought together at sufficient scale. The first was a large, clean labelled dataset, which ImageNet provided.
The second was neural network architectures that could learn from such data. The third was the arrival of GPUs capable of supporting the necessary computation. This combination enabled Geoff Hinton’s group at the University of Toronto to produce the 2012 ImageNet result that reset expectations across the field. It was a sharp demonstration of what happens when models are given enough examples to learn meaningful patterns.
Lessons to be learned
However, Li is careful to draw limits around that era’s lessons. She pushes back against a simplistic reading of Rich Sutton’s “bitter lesson”, the idea that simple learning algorithms combined with vast data tend to outperform more hand-crafted approaches over time. Sutton’s logic fits neatly for language models because training data and output are both text, creating a well-aligned objective. It is far harder to apply in robotics, where the ultimate goal is not text but action in three-dimensional space.
The internet does not contain labelled trajectories of physical actions, nor does it provide the rich, structured feedback that robots require to learn how to operate safely and effectively. Physical systems also introduce constraints that do not apply to software models. Robots and self-driving vehicles interact with the real world, where failure costs time, money and sometimes safety. Experiments cannot be run at the speed or volume of text-based model training. As a result, the template that worked for language is not easily transferred.
Spatial intelligence is the next frontier
This is the starting point for Li’s central argument in the episode: the next phase of progress depends on spatial intelligence. She notes that humans are embodied agents who operate in three-dimensional environments. Much of our intelligence comes from the ability to navigate, manipulate objects, read scenes and act with awareness of physical constraints. Language is only one dimension of cognition. Current AI systems lack this deeper understanding of space and structure. They are skilled at predicting the next word, but not at forming an internal model of how the world works.
Li frames the solution in terms of world models. In her formulation, a world model should be capable of generating a coherent, explorable environment from prompts, whether textual or visual. It should allow users or agents to move through that environment, change it, interact with objects and plan actions. It should provide a substrate in which both humans and robots can reason about the world. This is not a theoretical claim.
It is the foundation of her company, World Labs, which has spent the past year developing a prompt-to-world engine known as Marble. The tool can generate 3D scenes that users can explore, export or adapt for specific purposes. It is already being tested in VFX pipelines, game development workflows, robotics simulation environments and early-stage psychological research, where immersive scenes can be used to study behaviour.
A complementary layer
What stands out is Li’s insistence that world models are not a replacement for language models but a complementary layer. Language captures relations, instructions and concepts. World models capture structure, space and interaction. Only by bringing these strands together, she argues, will AI systems begin to approximate the richness of human intelligence. She also notes that world models could help address the data gap in robotics, by providing diverse, controllable synthetic environments that robots can practice in before moving into the physical world.
Running through the episode is Li’s belief that AI must remain human-centred. She discusses her work at Stanford’s Human-Centered AI Institute, which focuses on interdisciplinary research, governance and policy. Her message is consistent. AI is a transformative set of technologies that will touch every profession. Nurses, teachers, farmers, musicians and many others will all feel its effects.
The crucial question is whether society deploys AI in ways that preserve dignity and agency. Li is neither utopian nor fatalistic. She reiterates that technology can be a net positive, but only if its development and use are guided with care. This includes how we support workers, how we educate the next generation, and how we build systems that reflect the needs of the communities they serve.
The knows (and the known unknowns)
Perhaps the most striking aspect of the conversation is its clarity. Li avoids speculation and concentrates on what is known and what remains unknown. She does not deny the value of large language models, but she stresses their limits. She rejects simplistic AGI narratives, not from a defensive posture, but because the term obscures rather than clarifies. She sees the field as being in a productive but incomplete stage, where significant advances have been made but major capabilities are still missing. Her emphasis on world models is not a prediction but a direction of travel, grounded in the specific shortcomings of current systems.
The episode is a reminder that the AI world often moves faster on narrative than substance. When one of the field’s pivotal figures outlines, in detail, what the next generation of systems might require and why, it deserves broader discussion. Li does not offer grand claims or timelines. Instead, she presents a disciplined account of where the frontier truly lies. In a moment dominated by noise, her perspective cuts cleanly through. Anyone seeking a realistic understanding of the state of AI and its future path would be well served to listen closely.A clearer picture
Taken together, Li’s remarks provide a grounded view of the field:
• Today’s AI is powerful but narrow.
• Major gaps remain in physical reasoning, spatial intelligence and emotional cognition.
• Future progress requires new ideas, not just more compute.
• World models are a plausible next step because they capture aspects of human intelligence that language models cannot.
• And throughout, the focus should remain on people, not hype categories like AGI.
If you want a realistic sense of where the frontier actually is and what meaningful breakthroughs will require, her interview delivers it.