Claude, GPT, and Gemini Agents Failed Most U.S. Healthcare Workflows in New Benchmark
A new benchmark reported in Carroll County Mirror-Democrat found major failures across leading AI agents when tested on U.S. healthcare workflows. The result is a sharp reminder that general-purpose agents remain far from dependable for complex clinical operations.
This benchmark result is important because it cuts through the optimism around agentic AI. Healthcare workflows are not just information retrieval problems; they involve policy rules, exceptions, sequence-dependent tasks, and high stakes that reward precision over fluency.
A reported failure rate of 72% should be read as a warning about deployment readiness. It suggests that even the most capable frontier models can struggle when the task requires integrated operational judgment rather than isolated question answering. In healthcare, that gap is not a minor engineering issue—it is the difference between a useful assistant and a liability.
The finding also helps explain why many health systems are becoming more selective about AI. There is growing recognition that general-purpose models may be impressive in demos but brittle in production, especially when workflows cross clinical, administrative, and regulatory boundaries.
The real takeaway is not that agents have no future, but that healthcare may need a different class of agent: narrower, better governed, and deeply aware of domain-specific constraints. Until then, the benchmark serves as a strong caution against overestimating current capability.