In a recently published study, Dr NPJ Digital MedicineResearchers have developed diagnostic reasoning prompts to investigate whether large language models (LLMs) can simulate diagnostic clinical reasoning.
LLM, an artificial intelligence-based system trained using large amounts of text data, is known for its human-simulating performance on tasks such as writing clinical notes and passing medical exams. However, understanding their clinical diagnostic reasoning abilities is crucial for their integration into clinical care.
Recent studies have focused on open-ended-type clinical questions, indicating that innovative large-language models such as the GPT-4 have the potential to identify complex patients. Prompt engineering has begun to overcome this problem, as LLM performance varies based on prompt and question type.
About the study
In the present study, researchers assessed diagnostic reasoning for open-ended-type clinical questions by GPT-3.5 and GPT-4, hypothesizing that GPT models can outperform conventional chain-of-thought (CoT) by prompting with diagnostic reasoning prompts.
The team used modified MedQA United States Medical Licensing Examination (USMLE) datasets and New England Journal of Medicine (NEJM) case series to compare conventional chain-of-thought prompting with different diagnostic logic prompts modeled after cognitive approaches to formulating differential diagnoses. , analytical reasoning, Bayesian inference, and intuitive reasoning.
They investigated whether large-language models could mimic clinical reasoning skills using specialized prompts combining clinical skills with advanced prompting techniques.
The team used prompt engineering to create prompts for diagnostic reasoning, eliminating multiple-choice selections and converting questions to free responses. They included only Step II and Step III questions from the USMLE dataset and the Patient Diagnosis Assessor.
Each round of prompt engineering involved GPT-3.5 accuracy evaluation using the MEDQA training set. The training and test sets, containing 95 and 518 questions, respectively, were reserved for evaluation.
Researchers evaluated GPT-4 performance in 310 cases recently published in the journal NEJM. They excluded 10 that did not have a specific final diagnosis or exceeded the maximum context length for the GPT-4. They compared conventional CoT prompting with the best-performing clinical diagnostic reasoning CoT prompt (reasoning of differential diagnosis) in the MedQA dataset.
Each prompt consists of two example questions with reasoning using target reasoning techniques or short-shot learning. The study used free-response questions from the USMLE and NEJM case report series to facilitate rigorous comparisons between assessment prompting techniques.
The physician author, attending physician, and an internal medicine resident evaluated the language model responses, with each question evaluated by two blinded physicians. A third researcher resolved disagreements. If necessary, the doctors checked the accuracy of the answers using software.
The study reveals that the GPT-4 prompts can simulate the clinician’s clinical reasoning without compromising diagnostic accuracy, which is crucial for assessing the accuracy of LLM responses, thereby increasing their credibility for patient care. The approach can help overcome the black box limitations of LLMs, bringing them closer to safe and effective use in medicine.
The GPT-3.5 accurately answered 46% of assessment questions with standard CoT prompting and 31% with zero-shot-type non-chain-of-thought prompting. Among prompts associated with clinical diagnostic reasoning, the GPT-3.5 performed best with intuitive-type reasoning (48% vs. 46%).
Compared to classic chain-of-thoughts, GPT-3.5 performed significantly worse for developing analytical reasoning prompts (40%) and differential diagnosis (38%), while Bayesian inferences fell short of significance (42%). The team observed an inter-rater consensus of 97% for the MedQA data GPT-3.5 assessment.
The GPT-4 API returned errors for 20 test queries, limiting the size of the test dataset to 498. GPT-4 demonstrated higher accuracy than GPT-3.5. The GPT-4 showed 76%, 77%, 78%, 78%, and 72% accuracy with classical chain-of-thought, intuitive-type reasoning, differential diagnostic reasoning, analytical reasoning prompts, and Bayesian inference, respectively. Inter-rater consensus for the GPT-4 MedQA assessment was 99%.
Regarding the NEJM dataset, GPT-4 scored 38% accuracy with conventional CoT versus 34% for differential diagnosis formulation (a 4.2% difference). Inter-rater consensus for the GPT-4 NEJM assessment was 97%. GPT-4 responses and rationale for the entire NEJM dataset. Prompts that promote step-by-step reasoning and focus on a single diagnostic reasoning strategy perform better than combining multiple strategies.
Overall, the study results showed that GPT-3.5 and GPT-4 improved reasoning ability but not accuracy. The GPT-4 performed similarly with conventional and intuitive-type reasoning chain-of-thought prompts but worse with analytic and differential diagnosis prompts. Bayesian inference and chain-of-thought prompting showed worse performance than classical CoT.
The authors propose three explanations for the difference: GPT-4’s reasoning process may be fundamentally different from that of human providers; It can explain post-hoc diagnostic evaluations in the desired logic format; Or it can achieve maximum accuracy with the given vignette data.