Physicians received unexpected news recently when a small study published in JAMA Network Open found that ChatGPT-4 had outdone us at our own craft: diagnosis, what some have called the physician’s “most important procedure.” The study’s large language model (LLM) outperformed doctors, achieving a 90% diagnostic reasoning score on challenging cases while physicians — even those working with the chatbot — reached only 76%.
But more striking than the numbers was what the study revealed about what we know, and don’t know, about how doctors think. Why would GPT alone trump physician plus GPT? It raised a troubling, if not existential question: What is actually going on when physicians make a diagnosis, and how easily might we be replaced?
The study’s methodology might help explain these counterintuitive results. Human graders assessed diagnostic performance based not only on diagnostic accuracy but on how well the respondent (bot or human) explained themselves, the subtle art of argument and counterargument called differential diagnosis. For many doctors, who go by experience, instinct, or intuition (“I just knew”), such explanations may be hard to produce. For the doctors paired with GPT, many remained anchored to their initial impressions when presented with AI-generated alternatives, treating the technology more like a search engine than a collaborator.
I’m a practicing physician and historian of diagnosis, and I’ve studied how medicine has grappled with new technologies. I found these results both fascinating and familiar, but found the surge of responses to the study, many staging us in a man vs. machine battle — “chatbots defeated doctors” and “what’s the prognosis for medics” — even more so.
There is a long history of new diagnostic technologies prompting declarations about the future of medical practice. In the 1820s, the stethoscope’s ability to reveal hidden sounds within the body was dismissed by some and led others to predict the end of the intimate relationship between doctor and patient. A century later, the X-ray’s capacity to penetrate flesh raised alarms about a purely mechanical approach to diagnosis. More recently, as I’ve shown in research with colleagues, the 20th century introduction of computerized health records was meant to systematize medical thinking and free physicians up to be at the bedside; instead, many physicians report the electronic health record (EHR) is a patient care disruption and source of significant administrative burden.
AI can’t replicate this key part of practicing medicine
Each innovation eventually found its place not as a replacement for clinicians, but as a tool within medicine’s broader interpretive framework and contributor to an increasingly complex system.
We tend to tell binary stories. The physician vs. machine one tends to focus on a pattern of predicted replacement, rather than thoughtful integration. It’s patterned from science fiction, where machine overlords enslave or supplant humans. The reality is more banal but might be more impactful, as we end up with tools that wield us or that sit uneasily within an increasingly clunky system, prompting the question: For whom are they — and we — meant to work? Who benefits? Who suffers?
One of the reasons that this recent study has prompted such a strong response, especially from physicians, is because of how much we prize diagnosis itself. What is it about diagnosis? We could argue that the prizing of diagnostic ability dates back to the ancients (the Hippocratics famously led with prognosis, but you cannot prognose without diagnosing), but it was really the post-Enlightenment shift to pathological anatomy and disease localization with figures such as Giovanni Morgagni and the Parisian and Edinburgh schools that centered diagnosis in the core identity of a physician.
Our metaphors reveal how we think of ourselves and our work, what we are complaining about, and what we believe makes us matter. Listen on the wards or in clinic, and a number of shorthands emerge: “chasing all leads,” “inputting data,” “chart bloat,” “technician,” “historian,” “educator.” These all suggest the idea of detection — the ability to piece together subtle clues into coherent diagnoses. This tradition stretches from the morgues and clinics of post-revolutionary Paris to modern academic medical centers, shaping not just how we practice but how we understand ourselves as physicians.
Consider how 19th-century physicians approached their work. In European clinics a systematic method developed which simultaneously observed and documented patient symptoms and signs while — as Michel Foucault famously argued — developing a “medical gaze” that selectively retrieved pieces of data out of the person’s holistic story and slotted them into a biomedical paradigm. The metaphor of physician-as-detective emerged alongside new diagnostic techniques such as R.T.H Laennec’s mediate auscultation and Carl Wunderlich’s thermometry. When Joseph Bell, the Edinburgh surgeon who inspired Sherlock Holmes, taught students to observe and deduce, he wasn’t just teaching technique; he was encoding an entire philosophy of clinical reasoning that became part of an enduring fantasy with actionable consequences. And while physicians do many other things — including but not limited to systems thinking, conversation, treatment decision-making, team collaboration, historical and narrative reconstruction — many of us are drawn to being detectives.
With new concerns about artificial intelligence outperforming physicians in diagnosis, we can see why we’d start to wonder whether medicine’s detective work is becoming obsolete, heralding a future where algorithms, not clinicians, solve cases.
Can AI help ease medicine’s empathy problem?
AI forces us to reconsider the real nature of diagnosis. The detective and other metaphors, while resilient, important, and useful, have always seemed to miss the mark slightly. After all, we are not trying to catch a murderer but improve people’s lives. The way we talk about diagnosis reveals this ongoing tension between systematic and interpretive approaches. For one, diagnosis is both label and process. Historian Charles Rosenberg famously described the “tyranny of diagnosis,” and medical sociologist Annemarie Jutel has talked about how powerful it is to “put a label on” someone’s illness.
But we have struggled to encapsulate diagnosis as process, to understand what it truly is. Narrative scholar Kathryn Montgomery argues that it is a complex interplay of evidence gathering and interpretation. Diagnosis has always required more than pattern recognition — the area where AI has us beat. It isn’t just about matching symptoms to diseases. When I see patients, I’m not simply processing data points but interpreting a narrative shaped by cultural context, experience, and human interaction. I am operating in a diagnostic environment which includes the social world. Research on diagnostic disparities has shown how crucial this interpretive element is — many patients, especially those who are vulnerable and from marginalized communities, have historically lost out or become subjects of medical surveillance without benefit when viewed purely as collections of symptoms rather than complex individuals enmeshed in complex environments.
Diagnostic health equity research is moving us beyond the in-clinic process to reveal how diagnosis operates at multiple levels — from broad societal influences to direct clinical interactions to hidden systemic factors. When we bring algorithms into the picture, there is good data to suggest that it’s not the algorithms themselves that are biased, but that they encode the gaps and inequities in our existing systems. For example, an influential health care algorithm exhibited racial bias by using medical costs as a proxy for health needs, causing Black patients to receive less care despite being sicker than white patients with equal algorithm scores. The algorithm reflected socioeconomic disparities where Black patients historically incur lower costs at the same level of illness. Machine learning algorithms in health care can perpetuate or amplify existing biases found in medical devices, clinical interactions, and interventions, such as pulse oximeters that work poorly on darker skin, diagnostic disparities between racial groups, and gender-based differences in treatment quality.
A patient’s diagnostic journey can be dramatically different based on their starting point and the landscape they face. Consider how different types of diagnostic error emerge. Research on diagnostic disparities finds that mistakes often arise not just (not even mostly) from lack of medical knowledge, but from failures of system and structure (access, follow up, digital literacy and equity) and failures of interpretation — missing crucial context, overlooking cultural factors, or failing to integrate seemingly unrelated symptoms into a coherent whole. AI systems excel at spotting patterns across vast datasets, but they can’t independently weigh the significance of a patient’s lived experience or cultural context. These disparities reveal why AI’s pattern recognition capabilities alone cannot solve our diagnostic challenges and why thoughtful integration that accounts for history, context, and multilevel factors will be so important. It won’t be about just plugging in a case and GPT spitting out solutions.
Today’s AI systems represent something different — not just a new diagnostic tool, but a new kind of medical reasoning altogether. The recent study revealed that ChatGPT succeeded not by mimicking physician thought processes, but by applying its own form of pattern recognition to medical knowledge. Most notably, it excelled at explaining its diagnostic reasoning, offering comprehensive analyses that many physicians, trained in more intuitive methods, struggled to match. As one physician noted, sharing a clinic with an LLM scribe is like having both a smart medical student observer and a senior attending supervisor rolled into one. What makes AI different is its ability to engage in what we might call meta-diagnosis — not just pattern matching symptoms to diseases, but explaining its reasoning in ways that can challenge and enhance physician thinking.
As a primary care physician, here’s how I hope to partner with AI
Yet the recent study shows we’re far from realizing this potential. When doctors in the study had access to ChatGPT, they often used it narrowly, asking specific questions rather than engaging with its full analytical capabilities. Even more telling, when the AI suggested alternative diagnoses, many physicians remained anchored to their initial impressions. This resistance isn’t simply about protecting professional turf. It is part of a long and dynamic interplay between technology, people, and the concept of clinical judgment that has always been more complex than simple replacement or resistance.
In my work I have found that each novel technology has prompted medicine to redefine — but not to abandon — its interpretive core. The stethoscope didn’t just let doctors hear hearts better; it created new ways of understanding bodies and changed how we listen. The X-ray didn’t just reveal hidden structures; it reshaped how we conceptualized disease itself. Even as the electronic health record fragmented patient stories into discrete data points, it created structures of information storage and access that could have significant impact at a population health level. Similarly, AI is already forcing us to think deeply about our cognitive processes and our use of language. Its algorithms hold up a mirror to what’s wrong in our system and who we are leaving out. What is prompt engineering but rhetoric, and what is a neural network but philosophy of the mind?
The challenge isn’t choosing between human and machine intelligence but understanding how we can thoughtfully partner them. The art of diagnosis has always been about integration — of different types of knowledge, different ways of knowing, and different tools for discovery.
Educational reform is critical. Starting at the pre-health stage, medical education must evolve beyond its traditional focus on biomedical knowledge acquisition. As AI systems increasingly excel at pattern recognition and rapid information retrieval, today’s physicians-in-training will need to be sophisticated interpreters of narrative and context rather than repositories of rapidly shifting scientific evidence. They need to be sophisticated users of language. Don’t teach them how to code, teach them how to read.
This shift demands a renewed emphasis on humanistic skills in medical training — teaching doctors to be astute readers of both scientific literature and human stories, skilled writers who can articulate their reasoning clearly, and nuanced interpreters who can weave together biological, social, and cultural threads into coherent and clear narratives. Just as previous diagnostic technologies like the stethoscope and x-ray required new interpretive frameworks, AI will be most valuable when we develop new educational models that help physicians understand when to trust their contextual judgment and when to lean on AI’s analytical capabilities. This requires understanding what aspects of diagnosis remain essentially human. Rather than diminishing, the physician’s essential role is evolving— from information holder to meaning maker, from isolated detective to collaborative interpreter.
Referring to Charles Babbage’s 1830s proto-AI “thinking machine,” Arthur Conan Doyle wrote to Joseph Bell that Sherlock Holmes was “as inhuman as a Babbage’s Calculating Machine and just about as likely to fall in love.” Yet Holmes became far more than a pure analytical engine. Across 60 stories, the detective displays not only love but loyalty, intuition, and kindness. He relies not just on his own analytical powers but on a network of informants, specialists, and technologies. Most of all, he relies on his friendship with Dr. Watson, an enduring literary partnership that hints at collaboration over solitary analysis.
Lakshmi Krishnan is a physician-scholar and founding director of medical humanities at Georgetown, where she researches the intersection of medical knowledge, culture, and practice.