Connect with us

Technology

Hugging Face releases a benchmark for testing generative AI on health tasks

Avatar

Published

on

Hugging Face releases a benchmark for testing generative AI on health tasks

Generative AI models are increasingly being introduced into healthcare – perhaps prematurely in some cases. Early adopters believe they will achieve greater efficiencies while revealing insights that would otherwise be missed. Critics, meanwhile, point out that these models have flaws and biases that could contribute to poorer health outcomes.

But is there a quantitative way to know how useful or harmful a model can be when tasked with things like summarizing patient records or answering health-related questions?

Hugging Face, the AI ​​startup, proposes a solution newly released benchmark test called Open Medical-LLM. Founded in collaboration with researchers from the non-profit organization Open Life Science AI and the University of Edinburgh’s Natural Language Processing Group, Open Medical-LLM aims to evaluate the performance of generative AI models on a range of medical-related standardize tasks.

Open Medical-LLM is not a from scratch a benchmark in itself, but rather an amalgamation of existing test sets – MedQA, PubMedQA, MedMCQA and so on – designed to examine models of general medical knowledge and related areas such as anatomy, pharmacology, genetics and clinical practice. The benchmark includes multiple-choice and open-ended questions that require medical reasoning and understanding, drawing on materials such as US and Indian medical licensing exams and college biology test question banks.

“[Open Medical-LLM] allows researchers and practitioners to identify the strengths and weaknesses of different approaches, stimulate further progress in the field, and ultimately contribute to better patient care and outcomes,” Hugging Face wrote in a blog post.

generation of AI healthcare

Image credits: Hugging face

Hugging Face positions the benchmark as a “robust assessment” of generative AI models for healthcare. But some medical experts on social media cautioned against putting too much stock in Open Medical-LLM, lest it lead to ill-informed implementations.

On factual clinical practice can be quite large.

Hugging Face researcher Clémentine Fourrier, co-author of the blog post, agreed.

“These rankings should only be used as a first approximation [generative AI model] to explore for a particular use case, but then a deeper testing phase is always needed to investigate the limits and relevance of the model in real-world conditions,” Fourrier replied about X. “Medical [models] should absolutely not be used by patients on their own, but should instead be trained as supporting tools for physicians.”

It’s reminiscent of Google’s experience when it tried to bring an AI screening tool for diabetic retinopathy to healthcare systems in Thailand.

Google created a deep learning system that scanned images of the eye, looking for evidence of retinopathy, a leading cause of vision loss. But despite the high theoretical accuracy, the tool proved impractical in real-world testingwhich frustrates both patients and nurses with inconsistent results and a general lack of harmony with real-world practice.

Significantly, of the 139 AI-related medical devices the US Food and Drug Administration has approved to date, none use generative AI. It is exceptionally difficult to test how the performance of a generative AI tool in the laboratory will translate to hospitals and outpatient clinics, and, perhaps more importantly, how the outcomes might evolve over time.

That is not to say that Open Medical-LLM is not useful or informative. If anything, the results rankings serve as a reminder of how bad models answer fundamental health questions. But Open Medical-LLM, and no other benchmark, is a substitute for carefully thought-out real-world testing.