OpenAI’s latest push into hardware has opened up an entirely new realm of possibilities for the … More
AFP via Getty Images
OpenAI, the famed company behind ChatGPT, released HealthBench, a new standard to measure AI outputs specifically for healthcare use cases. The company indicates that creation of the standard involved the partnership of 262 physicians across 60 countries to develop 5,000 conversations with customized “rubrics” for each to determine the efficacy and quality of responses from models.
The company announced that their vision for the benchmark is to ultimately ensure that healthcare models should be:
Meaningful: AI systems need to be efficacious in real-time clinical scenarios in an impactful way
Trustworthy: models should reflect the standards that trained medical professionals would themselves prioritize
Unsaturated: systems should continue to have the flexibility to grow and expand in order to optimize long-term performance
Furthermore, the company also announced last week that it would be acquiring Jony Ive’s startup IO for $6.5 billion to make its inroads into the world of hardware and devices. Ive is most famous for his contributions to and design of the original iPhone and other flagship products in Apple’s early days of moving into the world of mobile. This move signals OpenAI’s formal commitment to build a device that could potentially integrate its AI work; very little is known about what the device may be, but many are speculating that it will be “unobtrusive [and] fully aware of a user’s life and surroundings.”
Why is all of this important?
The intersection of healthcare and AI is rapidly growing across the ecosystem, especially as technology companies and large hyper-scalers are investing billions of dollars to ramp up models specifically for healthcare use-cases. Additionally, new hardware and devices add an entirely new layer to this phenomenon, as users will be able to better use these devices to interact with their surroundings, track their day-to-day health metrics further and have a true “intelligent companion”– almost akin to having a live concierge clinician with them at all times.
Take for example Meta, which has created one of the most successful open-source models with Llama. Earlier this month, the company released a seminal case study which examined how a major health system (MHS) utilized the Llama 3.1 8B model to generate clinical documentation and ease workflows. Specifically, the model was used to “reduce time spent abstracting data from electronic health records (EHRs) while maintaining patient confidentiality” and alleviate manual clinical annotation tasks and chart review. The study ultimately found that the use of the platform resulted in nearly 70-80% less manual annotations, creating the potential for nearly $176 in savings per patient record. Scaled across large healthcare systems over the course of multiple years, this could lead to potentially billions of dollars saved and thousands of hours recovered from clinical staff. Additionally, their much anticipated Orion glasses product line has massive potential to augment human health capabilities.
Another great example is Google’s Med-PaLM large language model. The original version of the model was incredibly successful, having received more than a 60% score on the U.S. Medical Licensing Exam (USMLE). Since then, the company has made significant progress and Med-PaLM 2 scored 86.5% on medical benchmark tests. Last week, Google also introduced its latest MedGemma model, which has even higher comprehension capabilities for medical text and images. Google has worked with numerous healthcare organizations and systems to deploy its models across a variety of use cases, ranging from clinical documentation and workflow optimization to agentic uses and task automation. Google also announced its own upcoming line of AI powered glasses, Android XR.
Indeed, the landscape as a whole is growing immensely. A paper that was published in Nature in 2023 describes the impact that the growth of medically tuned large language models will have in medicine: “LLMs have the potential to improve patient care by augmenting core medical competencies such as factual knowledge or interpersonal communication skills.” Specifically, the paper documents a variety of areas which are already capturing significant value from the development of these advanced models, including augmenting communication with patients, creating opportunities for better transmission of complex medical information, collating and summarizing data from a variety of data sources and formats, and even in medical research, which often requires large swaths of data to be analyzed to generate meaningful and concise insights.
OpenAI’s push with HealthBench, and the larger industry push towards creating broader device ecosystems, will inevitably advance healthcare and societal health outcomes, if done in a safe, well-tested and patient centered manner.