OpenAI is making a serious push into the healthcare sector, with the release of a new benchmark called HealthBench, designed to evaluate the capabilities of AI systems in health.
The benchmark aims to help large language models (LLMs) support patients and clinicians with health discussions that are trustworthy, meaningful, and open to continuous improvement. HealthBench looks at seven key areas, including emergency care, managing uncertainty, and global health.
“What if you had a world-class doctor in your pocket, 24/7, at no cost? That’s the promise of AI in healthcare, but mistakes can be catastrophic. That’s why OpenAI launched HealthBench, a new benchmark to test how well AI models handle real, complex medical conversations,” Matthew Berman, CEO of Forward Future, wrote on X.
Developed in partnership with 262 physicians from 60 countries, HealthBench includes 5,000 realistic health-related conversations, each paired with a custom physician-created rubric for grading model responses.
OpenAI shared in its blog that it used HealthBench to evaluate how well its latest models perform on healthcare tasks. According to the company, recent models have improved quickly, with o3 outperforming others, including Claude 3.7 Sonnet and Gemini 2.5 Pro (March 2025 version) in the tests.
OpenAI also mentioned that small models have gotten much better lately. GPT‑4.1 nano, for example, beats the August 2024 GPT‑4o model—even though it’s 25 times less expensive.
Compared to written responses from doctors, LLMs were found to write better answers for many of the instances. By April this year, the newest models had reached a point where physician responses no longer improved the quality of the answers.
Online, many users have shared stories of how ChatGPT helped them make sense of complicated health problems, ranging from chronic back pain to unexplained jaw issues.
“I’ve had half a dozen healthcare-related issues in my family over the last few months, and ChatGPT has been more helpful than the physician…,” said Joe Flaherty, a former Wired staff writer, in a post on X.
“ChatGPT outperforms human doctors for me. It diagnosed a condition I have and recommended the correct treatment after two human specialists failed. Perfect use-case for LLMs as it requires knowledge & pattern matching,” another user said on X.
However, experts warn of the over-dependence on AI. “Using artificial intelligence for diagnosis and even for prescriptions, one has to be really cautious, because physical examination is missing,” Dr CN Manjunath, senior cardiologist and director of the Sri Jayadeva Institute of Cardiovascular Sciences and Research, Bengaluru, told AIM in an earlier interaction.
He further emphasised that, despite the widespread use of technology in healthcare, physical evaluation remains a cornerstone of accurate diagnosis. Though medications may alleviate symptoms, he advised always following up with a qualified medical practitioner for comprehensive care. He explained that once a particular diagnosis has been made, patients can follow up with ChatGPT.
OpenAI’s growing interest in healthcare is reflected in its job openings, which include roles such as health AI research engineer and healthcare software engineer.
This development comes against the backdrop of OpenAI appointing Fidji Simo as the CEO of applications, allowing Sam Altman to focus more on research, compute, and safety. Time and time again, Altman has reiterated that he is most excited about scientific discoveries with the help of AI.
“I’m personally most excited about AI for science at this point. I’m a big believer that the most important driver of the world and people’s lives getting better and better is new scientific discovery,” said Altman in a recent TED talk. He added that they hear from scientists about how the latest AI models have been making them more productive and impacting what they are able to discover.
“I deeply believe that AGI can extend human life by broadening trustworthy access to care and accelerating longevity research,” said Karina Nguyen, researcher at OpenAI, in a post on X.
Even Bryan Johnson, known for his radical approach to longevity and anti-ageing, weighed in on OpenAI’s development. He pointed out that AI-assisted physicians had outperformed human physicians without reference materials, adding that by April, the responses were so strong that physicians could no longer improve them.
Google is Stepping Up in Healthcare AI
OpenAI is not alone in focusing on healthcare. Google recently launched TxGemma, a new suite of open-source language models built to support therapeutic development. The models are intended to improve tasks such as drug candidate assessment, molecule property prediction, and clinical trial outcome estimation by applying LLM capabilities to biomedical data.
In 2024, Google developed Med-Gemini, a next-generation set of healthcare models that combine Gemini’s advanced multimodal and reasoning capabilities by fine-tuning on de-identified medical data.
To support care providers, Google, in 2023, introduced MedLM and Search for Healthcare. These are built to handle medical queries and are available on the Google Cloud Vertex AI platform. They help clinicians make better-informed decisions and enable patients to receive more accurate and personalised care.
Anthropic chief Dario Amodei, a rival of OpenAI, has also expressed excitement about AI’s potential in biology. “I’m optimistic that diseases which have plagued us for thousands of years—such as cancer, Alzheimer’s, and ageing itself—may be treatable,” he said.
In his recent essay ‘Machines of Loving Grace’, Amodei outlined a future in which AI could “double our lifespans, cure all diseases, and create untold global economic wealth”.Anthropic recently launched the AI for Science Program to support scientific research and discovery by giving researchers access to its API. The program offers free API credits for high-impact projects, with a focus on biology and life sciences.