All stories

Why General-Purpose LLMs Still Fail at Differential Diagnosis

A new wave of studies is reinforcing a blunt conclusion: large language models may sound clinically fluent, but they remain unreliable when asked to reason through differential diagnosis. For specialties like ophthalmology, where pattern recognition must be paired with structured reasoning and domain-specific context, the gap between conversational confidence and diagnostic quality remains wide.

Large language models have become remarkably good at producing plausible medical language, but plausibility is not the same as clinical judgment. The latest ophthalmology-focused coverage adds to a growing body of evidence that general-purpose LLMs still struggle when the task shifts from answering factual questions to ranking competing diagnoses.

That distinction matters because differential diagnosis is not a trivia test. It requires weighting symptoms, risk factors, temporal patterns, and uncertainty — often with incomplete information. In real practice, clinicians do not just need an answer; they need a defensible path from evidence to decision.

The concern is not that LLMs are useless in medicine, but that their strengths are mismatched to the hardest parts of clinical work. They can summarize, draft, and retrieve language well, yet they appear far less dependable at the inferential steps that separate common from dangerous, and likely from merely possible.

For healthcare organizations, this is an important corrective to the hype cycle. The practical opportunity is not to hand diagnosis over to a chatbot, but to build systems that constrain model behavior, verify outputs, and keep physician oversight central. Until models can show consistent reasoning under uncertainty, diagnostic use cases should be treated as high-risk decision support, not autonomous judgment.