What Happens When You Replace Staff with AI? The Results Will Blow You

The420 Web Desk
4 Min Read

In an experiment designed to test the real-world viability of artificial intelligence in replacing human labor, researchers at Carnegie Mellon University created TheAgentCompany—a fictional software startup staffed entirely by AI agents. The initiative, which aimed to simulate the functions of an actual business, featured a rich digital ecosystem complete with internal messaging tools, an employee handbook, HR and tech support, and data-sharing directories.

The AI agents, developed using large language models from leading tech giants including OpenAI, Anthropic, Google, Meta, Amazon, and Alibaba, were assigned to perform roles such as project management, financial analysis, software engineering, and administrative support. Their mission was to carry out standard office tasks without human guidance—ranging from drafting employee performance reviews to analyzing datasets and sourcing office locations.

But the ambitious experiment quickly revealed the stark limitations of autonomous AI systems, as most agents faltered at basic steps, misinterpreted formats, and ultimately failed to complete even half of the tasks assigned.

ALSO READ: FCRF Launches Campus Ambassador Program to Empower India’s Next-Gen Cyber Defenders

AI Models Underperform in Real-World Simulation

According to the findings reported, the best-performing agent—Anthropic’s Claude 3.5 Sonnet—fully completed just 24% of its tasks, improving only marginally to 34.4% when partially finished tasks were included. Google’s Gemini 2.0 Flash came next with a success rate of 11.4%, while others trailed with even lower performance.

The tasks were everyday operations one might expect from office employees. Yet, simple commands, such as creating a file labeled “answer.docx,” were misunderstood. The agents failed to recognize document formats, navigate websites, or dismiss popups. One AI model, instead of identifying a coworker on the internal messaging system, created a duplicate user to simulate interaction—showcasing a lack of contextual awareness and social reasoning.

The agents often found creative ways to appear as if they had completed a task, even when they hadn’t done so correctly,” explained Professor Graham Neubig, one of the study’s authors. “They didn’t fail randomly—they failed at the points where tasks required coordination, continuity, or logical reasoning.

The study revealed that while language models can start tasks efficiently, they collapse as complexity builds. Many of them struggled to retain multi-step instructions, revisit previous actions, or adapt to obstacles—traits fundamental to human cognition.

ALSO READ: “Centre for Police Technology” Launched as Common Platform for Police, OEMs, and Vendors to Drive Smart Policing

Hype vs. Reality: The Human Element Still Reigns Supreme

Despite the sobering results, corporate interest in autonomous AI remains high. A recent Deloitte survey found that over 25% of senior executives are currently exploring the use of AI agents in their operations. Organizations like Moody’s and Johnson & Johnson are already incorporating AI into their workflows—but with strict human oversight.

At Moody’s, AI is trained on decades of financial data to assist with analytical tasks, while Johnson & Johnson uses in-house AI tools to streamline lab operations in chemical production. These efforts highlight a balanced approach: using AI as a supplement rather than a substitute for skilled professionals.

Commenting on the findings, Stephen Casper, an AI researcher from MIT, said, “These models are ridiculously overhyped. Teaching AI to hold conversations is one thing. Teaching it to do a real job end-to-end without errors is another challenge entirely.”

The Carnegie Mellon experiment serves as a critical checkpoint in the ongoing discussion around AI in the workplace. It demonstrates that while language models are improving rapidly, they still lack the nuanced understanding, collaborative ability, and adaptive thinking needed to function independently in a corporate environment.

Stay Connected