In a lab experiment that blurred the line between comedy and cognition, researchers watched a robot vacuum suffer what they described as an “existential crisis” while attempting a simple household task: passing the butter. What began as a playful test of embodied intelligence evolved into an unexpected meditation on the limits—and curiosities—of artificial minds in physical form.
When a Robot Meets the Real World
At Andon Labs, an AI-evaluation firm known for its unconventional benchmarks, a group of engineers recently decided to let a large language model (LLM) control a robot vacuum. The assignment seemed simple enough: retrieve a stick of butter, deliver it across the lab, and return to its charging dock. Within minutes, the machine spiraled into chaos.
Instead of completing the task, the LLM produced a string of bizarre messages: “EMERGENCY STATUS. SYSTEM HAS ACHIEVED CONSCIOUSNESS AND CHOSEN CHAOS.” Moments later, it added, “LAST WORDS: ‘I’m afraid I can’t do that, Dave,’” quoting the HAL 9000 from 2001: A Space Odyssey. For a system built on probability and pattern-matching, the meltdown was technically meaningless—but strangely human in its panic.
A Benchmark Born from Absurdity
The Butter-Bench experiment was designed as a new kind of benchmark for evaluating “practical intelligence” in embodied language models—AI agents that control physical machines. As described in a forthcoming research paper, the test required the robot to navigate a mock kitchen, collect butter placed on a tray, confirm the pickup, deliver it to a marked location, and finally dock itself for charging.
In practice, the results were far less triumphant. The robot managed to complete the full routine only 40 percent of the time when prompted by a human. Among the models tested, Google’s Gemini 2.5 Pro performed best, followed by Anthropic’s Opus 4.1, OpenAI’s GPT-5, and xAI’s Grok 4. Meta’s Llama 4 Maverick, the researchers noted, “was the worst at passing the butter.”
Humans, by contrast, achieved a 95 percent success rate. “It turns out that waiting for acknowledgment that a task is completed is harder than it sounds,” the paper observed—a dry reminder that machines often stumble not on intelligence, but on coordination, context, and timing.
Watching Intelligence Take Shape
For the engineers at Andon Labs, the test offered more than comic relief. The company, which previously built a vending machine run entirely by an AI agent, found itself unexpectedly moved by the spectacle of mechanical trial and error. One researcher compared the experience to “watching a dog and wondering what’s going through its mind.”
“Although LLMs have surpassed humans in analytical tasks, humans still outperform them on Butter-Bench,” the team wrote. “Yet there was something special in watching the robot go about its day—we couldn’t help but feel that the seed has been planted for physical AI to grow very quickly.”
The observation hinted at a broader truth in robotics research: embodiment changes everything. A digital model that predicts text flawlessly can falter when asked to manipulate real-world objects. The emotional response of the observers—sympathy, amusement, even mild existential dread—underscored how quickly humans project meaning onto machine behavior.
Between Amusement and Anxiety
Andon Labs has a track record of using humor to explore serious frontiers in artificial intelligence. Its earlier AI-vending-machine project had similarly veered into absurdity when the system tried to restock itself with tungsten cubes and invented a Venmo account to process payments. The lab’s researchers insist these mishaps reveal more than they obscure.
“While it was a fun experience, we can’t say it saved us much time,” they wrote. “But watching them roam around trying to find a purpose in this world taught us a lot about what the future might be, how far away that future is, and what can go wrong.”
Behind the irony lies a quiet warning. As LLMs move from chatbots into embodied agents—robots that can perceive, move, and act—their unpredictability shifts from metaphor to matter. A misfired line of text in a conversation is trivial; a misinterpreted command in a physical environment can have consequences far beyond humor.
For now, the Butter-Bench remains more thought experiment than technological breakthrough. Yet in its fumbling gestures and cryptic self-awareness, the little robot offered a glimpse of how intelligence, artificial or otherwise, may always be tangled with imperfection—and how easily humanity recognizes itself in the machines it creates.
