Why open source generative AI models are still a step behind GPT-4

5 minutes, 14 seconds Read


Human kidneys glow blue

MagicMine/Getty Images

One of the most heated debates in generative artificial intelligence (AI) is open source versus closed source: which will prove more valuable?

On the one hand, a plethora of open-source Large Language Models (LLMs) are continuously being produced by a growing constellation of contributors, led by the most prestigious open-source model to date, Meta’s Llama 2. The closed-source LLM is represented by two of the most well-established commercial programs, OpenAI’s GPT-4 and venture-backed startup Anthropic’s language model, known as Cloud2.

Also: I’m taking the free AI Image course on Udemy with this little trick – and you can too

One way to test these programs against each other is to see how well they do at answering questions in a specific field, such as, say, medical knowledge.

On that basis, Llama 2 is terrible at answering questions in the field of nephrology, the science of the kidney, according to a recent study by scientists at Pepperdine University, the University of California, Los Angeles, and UC Riverside, which was Published this week in NEJM AIA new journal published by the prestigious New England Journal of Medicine.

Also: Best AI Chatbot: ChatGPT and Other Notable Alternatives

“Compared to GPT-4 and Claude 2, the open-source models performed worse in terms of absolute correct answers and the quality of their explanations,” wrote lead author Sean Wu of Pepperdine’s Keck Data Science Institute and colleagues.

nejm-testing-llms-on-medical-knowledge

Pepperdine University scholars converted nephrology questions into prompts to feed into a bunch of large language models, including Lama2 and GPT-4.

New England Journal of Medicine

“GPT-4 performed exceptionally well and achieved human-like performance for most subjects,” they wrote, achieving a score of 73.3%, just below the 75% rating that is a passing grade for a human who must answer multiple-choice nephrology questions. .

“The majority of open-source LLMs achieved an overall score that did not differ from what would be expected if the questions were answered randomly,” they wrote, noting that Llama 2 was the best of the five open-source models by Vicuna et al. the falcon The Llama 2 program was above the random guess level (23.8%) with a score of 30.6%.

Also: Five ways to use AI responsibly

The study was an experiment in what is known as a “zero-shot” task in AI, where a language model is used with no changes and no examples of right and wrong answers. Zero-Shot is a method that is supposed to test “in-context learning”, the ability of a language model to acquire new abilities that were not present in its training data.

In the tests, the models — Llama 2 and four other open-source programs, as well as two commercial programs — were each fed 858 nephrology questions. NFSAPThe Nephrology Self-Assessment Program, a publication of the American Society of Nephrology that physicians use for self-study in this field.

Also: Google’s AI image generator finally rolls out to the public – how to try it

Converting NFSAP’s plain-text files into prompts required the authors to prepare significant data that could be fed into language models. Each prompt consists of natural language questions and multiple-choice answers. (The data set is Posted for others to use HuggingFace.)

And since GPT-4 and Llama 2 and others produce long text output as their answers in many cases, the authors also had to develop automated techniques to parse the answers from each model for each question and then accurately compare the model answers. Answers to score results automatically.

There are many possible reasons why the open-source models performed worse than GPT-4, but the authors suspect a key reason is that Anthropic and OpenAI baked-in proprietary medical data as part of the training of their programs.

“GPT-4 and Claude 2 were trained not only on publicly available data, but also on third-party data,” they write.

“High-quality data for training LLMs in the medical field often reside in nonpublic materials that have been curated and peer reviewed, such as textbooks, published articles, and curated datasets,” note Wu and team. “Without denying the importance of the computational capabilities of specific LLMs, the ability to access medical training data material not currently in the public domain will likely be a key factor in determining whether the performance of specific LLMs improves in the future.”

Also: MedPerf aims to accelerate medical AI by keeping data private

Clearly, GPT-4 scores two points below the human passing grade, not just open source, but all language models have huge room for improvement.

Happily for the open-source crowd, efforts are underway to help even the odds when it comes to training data.

One of these efforts is a broader movement called federated training, where language models are trained locally on private data, but then the results of that training are contributed to an overall effort in the public cloud.

This approach could be a way to bridge the divide between confidential data sources in medicine and the collective push to strengthen open-source foundation models. A prominent effort in that area is the ML Commons industry consortium’s MedPerf effort, which began last year.

It is also possible that some commercial models will be distilled into open-source programs that will inherit certain medical skills from the parent. For example, Google DeepMind’s MedPaLM is an LLM tuned to answer questions from various medical datasets, including a brand new one invented by Google that represents questions consumers ask about health on the Internet.

Also: Google’s MedPaLM emphasizes human physicians over medical AI

Even without training a program on medical knowledge, the output can be improved with “recovery-augmented generation,” a method in which LLMs seek outside input as they are building their outputs to what the neural network can do on its own.

Whichever approach wins, the open nature of Llama 2 and other models allows many parties to improve the programs, unlike commercial programs like GPT-4 and Claude 2, whose operations are at their own corporate discretion. the owners





Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *