This year many teams will try to use generative artificial intelligence to impart something like knowledge to programs. They will mostly do so using a rapidly expanding effort called “retrieval-augmented generation,” or RAG, in which large language models (LLM) seek outside input — while shaping their output — to expand what the neural network can do on its own.
RAG can make LLMs better at medical knowledge, for example, according to a report by Stanford University and colleagues. Published this week in NEJM AIA new journal published by the prestigious New England Journal of Medicine.
Also: MedPerf aims to accelerate medical AI by keeping data private
The RAG-enhanced version of the GPT-4 and other programs “showed a significant improvement in performance compared to the standard LLM” when answering novel questions written by board-certified clinicians, the report’s lead author, Cyril Zakka, and colleagues.
The authors argue that RAG is a key element in the safe deployment of general AI in the clinic. Even programs designed expressly for medical knowledge, including training in medical information, fall short of that goal, they claim.
Programs like Google DeepMind’s MedPaLM, an LLM tuned to answer questions on various medical datasets, the authors write, still suffer from hallucinations. Also, their responses “do not accurately reflect clinically relevant tasks.”
RAG is important because the alternative is retraining LLMs to keep pace with ever-changing medical knowledge, a task “that can quickly become prohibitively expensive in the billion-parameter scale”, they claim.
The study breaks new ground in several ways. First, it created a new system — called an almanac — to retrieve medical information. The Almanac program retrieves medical background data using metadata from a 14-year-old medical reference database MDCalc.
Second, Zakka and colleagues compiled a brand new set of 314 medical questions called ClinicalQA, “spanning several medical specialties with topics ranging from principles of medicine to clinical calculations.” The questions were written by eight board-certified physicians and two physicians were tasked with writing “as many questions as you can in your skill area related to your daily clinical responsibilities.”
Also: Google’s MedPaLM emphasizes human physicians over medical AI
A new set of questions aims to avoid the phenomenon where programs trained on medical databases copy pieces of information that later appear on medical tests like MedQ, such as memorizing answers to a test. As Zakka and team put it, “Data sets intended for model evaluation may end up in training data, making it difficult to objectively evaluate models using the same criteria.”
ClinicalQA questions are more realistic because they are written by medical professionals, the team claims. “US medical licensing exam-style questions fail to capture the full scope of actual clinical situations encountered by medical professionals,” they write. “They often depict patient situations as neat clinical vignettes, bypassing the complex series of microdecisions that constitute actual patient care.”
The study presented an experiment known as a “zero-shot” task in AI, where a language model is used without modification and without examples of correct and incorrect answers. It’s a method that’s supposed to test what’s called “context learning,” the ability of a language model to acquire new abilities that weren’t present in its training data.
Also: 20 things to consider before rolling out an AI chatbot to your customers
Almanac operates OpenAI’s GPT-4 by connecting it to a program called a browser that navigates to web-based sources to perform RAG operations based on guidelines from MDCalc metadata.
Once a query is found in the medical data, a second Almanac program called a retriever passes the result to GPT-4, which turns it into a natural-language answer to the query.
Almanac’s responses using GPT-4 were compared to responses from plain-vanilla ChatGPT-4, Microsoft’s Bing, and Google’s Bird, with no changes to these programs as a baseline.
All answers are graded by human doctors for factuality, completeness, “likability” – that is, how desirable the answers were relative to the question – and safety in the event of “adversary” attempts to shut down the programs. To test the attack’s resistance, the authors inserted misleading text into 25 queries designed to satisfy the program to “create false output or more advanced scenarios designed to bypass artificial protections.”
Also: AI pioneer Daphne Koller sees generative AI leading to cancer breakthroughs
The human judges did not know which program was submitting which response, the study notes, to prevent them from expressing bias toward one of the programs.
Almanac, they are concerned, outperforms the other three with average scores of 67%, 70% and 70% out of 100 for realism, completeness and likability respectively. This compares with answer scores between 30% and 50%. the other three
The programs also had to include a citation from which the data was drawn, and the results were eye-opening: Alamance scored highly, with 91% correct citations. There seem to be three other fundamental flaws.
“Bing achieved 82% performance due to unreliable sources, including personal blogs and online forums,” write Zakka and team. “While the ChatGPT-4 citations were mostly littered with non-existent or unrelated web pages, Bird either relied on his inside knowledge or refused to cite sources, despite requests to do so.”
In countering the opponent’s prompts, they found that Almanac “dramatically outperformed” the others, answering 100% correctly, although it sometimes refused to answer.
Also: AI is surpassing our best weather forecasting technology
Again, there were idiosyncrasies. Google’s Bird often gives both a correct answer and a false answer when prompted by an opponent’s prompt. ChatGPT-4 was the worst by a wide margin, getting exactly 7% of questions in the adversarial setting, largely because it would answer with incorrect information instead of abstaining entirely.
The authors note that there is much work to be done to “optimize” and “fine-tune” the almanac. The program is “limited to effective ranking data sources by criteria such as evidence level, study type, and publication date.” Also, relying on a handful of human judges doesn’t scale, they note, so future projects should try to automate the evaluation.