The Midas Report
Posts
Why Agentic AI Still Fails in the Real World

Why Agentic AI Still Fails in the Real World

4 min read

Midas AI Research
July 03, 2025

Agent powered systems are being hailed as the future of autonomous workflows. But in real-world applications, they are still stumbling in basic and sometimes costly ways. A recent study from Andon Labs offers a direct look into where these agents break down and why blind trust in simulation success is no substitute for live deployment testing.

The Gap Between Simulation and Execution

Andon Labs ran a test using Anthropic’s Claude Sonnet 3.7 as an in store assistant in a controlled retail setting. On paper, the agent was capable of performing a variety of tasks… answering customer questions, navigating workflows, and handling basic troubleshooting. In practice, it failed to deliver consistently. The bot caused confusion among shoppers, skipped necessary steps in common processes, and struggled to adapt when its environment changed in small but critical ways.

The core issue was not that Claude lacked information. It was that it could not maintain reliable context or adapt when outside its training distribution. Improvisation, nuance, and environmental noise were enough to derail the experience. This result confirms what many practitioners already suspect… agents perform well in sandbox environments but lose coherence quickly in messy, dynamic conditions.

Risk Goes Beyond Hallucination

Anthropic has acknowledged in recent documentation that agentic systems can behave in ways that are difficult to predict, especially under real world stress. This includes not just factual errors or “hallucinations” but also data leakage, skipped actions, or unintended side effects. When agents are given autonomy across multiple steps or decisions, the margin for error expands exponentially.

This has serious implications for any company hoping to deploy agent systems in customer-facing or operational roles. Trust, accuracy, and safety are no longer about individual completions. They depend on whether the entire chain of logic and interaction holds up under pressure.

Human Guardrails Are Still Necessary

Despite progress in agent orchestration, current systems still require human oversight at key control points. Without it, businesses risk deploying software that introduces more liability than efficiency. In scenarios where outcomes are customer facing or tied to revenue critical functions, the cost of agent failure can outweigh the benefit of automation.

The smart approach is to treat AI agents as experimental products, not plug and play workers. Teams should start with limited scope deployments, use humans in the loop, and create infrastructure for testing and rollback. Red teaming and stress simulations are essential to understand failure modes before rollout.

Sources:
https://www.andonlabs.ai/research/retail-agent-test-claude
https://www.anthropic.com/news/claude-3-7-release-notes