November 03, 2024
Systematic Review Reveals Gaps in Evaluating Large Language Models for Healthcare
A recent systematic review published in JAMA evaluated how large language models (LLMs) are tested in healthcare, analyzing 519 studies conducted from January 2022 to February 2024. Despite the significant potential of LLMs—such as improving diagnostics and patient interactions—researchers found that only 5% of these studies used real patient care data, underscoring a gap in real-world testing that may hinder effective implementation.
The review reveals that most studies focus on accuracy in limited areas like answering exam questions, with tasks such as summarizing clinical notes or handling conversational interactions largely neglected. Administrative applications like billing or prescription generation were also underrepresented. Furthermore, crucial metrics like fairness, bias, and toxicity—essential for safe LLM deployment in clinical settings—were included in less than 16% of the studies, indicating the need for a broader range of evaluation criteria.