Mass General Brigham Evaluates ChatGPT and Other Chatbots’ Medical Advice

This article looks at a study from Mass General Brigham, published in JAMA Network Open. The research team wanted to see how well general-purpose AI chatbots, like ChatGPT, Grok, Claude, and Gemini, could come up with a differential diagnosis using just basic patient info—and how they’d do with more detailed clinical data.

They tested 21 large language models using 29 published clinical cases. These cases covered common issues like heart failure and ectopic pregnancy.

The results? There’s a pretty big gap. Chatbots really struggle to start a case when all they have is sparse information, but their accuracy jumps way up once you give them physical exam findings and lab results.

Table of Contents

Key findings: AI in initial triage under sparse data

When the AI models got only age, gender, and symptoms, they missed the correct differential diagnosis over 80% of the time. But after adding exam findings and test results, the models nailed the correct diagnosis more than 90% of the time.

So, AI shines when it has all the data, but it flounders at the open-ended beginning of a case when key info is missing.

Lead author Arya Rao pointed out that these models are great at zeroing in on a final diagnosis once you’ve handed them everything they need. Still, they’re just not reliable when they have to start from scratch with little to go on.

Marc Succi, the study’s co-lead, warned that bad advice early on could send patients down the wrong path—maybe even toward unnecessary invasive procedures or delays in care for serious problems like stroke.

Both researchers stressed the importance of clinician oversight. They say a “human in the loop” is needed to narrow down differentials by asking targeted questions, doing physical exams, and ordering the right tests.

The human-in-the-loop is essential

To use AI responsibly in medical triage, the study calls for clinician stewardship. There needs to be a structured back-and-forth between AI outputs and real patient evaluation.

Clinicians should review what the AI suggests, then dig deeper with focused questions, physical exams, and selective testing to get the diagnosis right.

This approach helps avoid misdirection from AI during the most uncertain moments and keeps professional medical judgment front and center.

Some practical steps? Prioritize the most important questions—like risk factors or how symptoms started and changed. Perform the essential physical exams. Order only the tests that’ll actually help confirm or rule out dangerous conditions.

Blending AI input with guideline-based care and clinician experience keeps the diagnostic process focused on the patient and safety.

Real-world application: Care Connect at Mass General Brigham

Mass General Brigham has already brought AI into its workflow through Care Connect. This AI-driven intake tool screens patients, checks records, and schedules telehealth visits.

Crucially, Care Connect doesn’t make diagnoses—its job is to speed up getting patients in front of a clinician and making sure they get care quickly.

Rajesh Patel from MGB made it clear that only real clinicians handle diagnoses and treatment decisions. This keeps the boundary strong between triage support and actual clinical judgment.

Implications for patients and healthcare systems

AI as decision-support works best when it helps clinicians make choices, not when it acts as the sole diagnostician. This approach keeps care safer and more accurate.

Guardrails and governance matter a lot. Clinicians need to review AI-generated recommendations before doing anything with them.

Improved triage and access tools can speed up clinician assessment and testing. These tools might help avoid unnecessary procedures, which is always a win.

Continuous monitoring and updates keep AI models relevant as they encounter new patient populations and medical conditions. It’s not a set-it-and-forget-it situation—these systems need regular attention.

Here is the source article for this story: Mass General Brigham study explores medical advice by ChatGPT, other chatbots

Additional Reading:

Key findings: AI in initial triage under sparse data

The human-in-the-loop is essential

Real-world application: Care Connect at Mass General Brigham

Implications for patients and healthcare systems

Related Posts