Researchers from the Center for AI Safety (CAIS), MIT’s Media Lab, the Brazilian university UFABC, and the pandemic prevention non-profit SecureBio have found that leading artificial intelligence models can outperform experienced, PhD-level virologists in troubleshooting complex laboratory procedures.
The findings, detailed in a new study introducing the Virology Capabilities Test (VCT), demonstrate AI’s proficiency in specialized scientific tasks but also highlight serious dual-use concerns, suggesting these tools could lower the barrier for creating dangerous biological agents.
The VCT benchmark, consisting of 322 questions and detailed further in its research paper, was designed specifically to measure an AI’s ability to assist with intricate ‘wet lab’ virology protocols, assessing fundamental, visual, and tacit understanding – the kind of practical know-how often gained through hands-on lab experience.
The results showed OpenAI’s o3 model achieved 43.8% accuracy, substantially exceeding the 22.1% average scored by specialized human virologists answering questions within their fields. Google’s Gemini 2.5 Pro also performed strongly, scoring 37.6%. According to the VCT analysis, o3’s performance surpassed 94% of the human experts on tailored question subsets.
AI Virologist Chatbots Pose Dual-Use Dilemma
This emergent AI capability – providing expert-level guidance for sensitive lab work – presents a clear dual-use scenario: useful for accelerating legitimate research but potentially dangerous if misused. Seth Donoughe, a SecureBio research scientist and study co-author, conveyed his apprehension to TIME, stating the findings made him “little nervous.”
He elaborated on the historical context: “Throughout history, there are a fair number of cases where someone attempted to make a bioweapon—and one of the major reasons why they didn’t succeed is because they didn’t have access to the right level of expertise… So it seems worthwhile to be cautious about how these capabilities are being distributed.”
Reflecting this, the VCT researchers propose that this AI skill warrants inclusion within governance frameworks designed for dual-use life science technologies.
The VCT findings spurred immediate calls for action from safety advocates. Dan Hendrycks, director of the Center for AI Safety, stressed the need for immediate action, urging AI companies to implement robust safeguards within six months, calling inaction “reckless.”
He advocated for tiered or gated access controls as a potential mitigation strategy. “We want to give the people who have a legitimate use for asking how to manipulate deadly viruses—like a researcher at the MIT biology department—the ability to do so,” Hendrycks explained to TIME. “But random people who made an account a second ago don’t get those capabilities.”
Industry Responses and Calls for Oversight
Having been briefed on the VCT results months ago, AI developers have reacted differently. xAI, Elon Musk’s company, in February, published a risk management framework acknowledging the paper and mentioning potential virology safeguards for its Grok model, such as training it to decline harmful requests.
OpenAI stated it “deployed new system-level mitigations for biological risks” for its recently released o3 and o4-mini models, including specific measures like “blocking harmful outputs.”
This measure reportedly resulted from a “thousand-hour red-teaming campaign in which 98.7% of unsafe bio-related conversations were successfully flagged and blocked.” Red-teaming is a common security practice involving simulated attacks to find vulnerabilities. Anthropic, another leading AI lab, acknowledged the VCT results in its system documentation but offered no specific mitigation plans, while Google declined to comment on the matter to TIME.
However, some experts believe self-policing by the industry isn’t sufficient. Tom Inglesby from the Johns Hopkins Center for Health Security advocated for governmental policy and regulation. “The current situation is that the companies that are most virtuous are taking time and money to do this work, which is good for all of us, but other companies don’t have to do it,” he told TIME, adding, “That doesn’t make sense.” Inglesby proposed mandatory evaluations for new large language models before their release “to make sure it will not produce pandemic-level outcomes.”
AI’s Expanding Footprint in Scientific Research
The VCT results are not an isolated incident but rather a stark data point within a broader landscape where AI is rapidly integrating into specialized scientific fields. OpenAI, creator of the top-performing o3 model, was already known to be exploring biological applications; Winbuzzer reported in January on its collaboration with Retro Biosciences using a model named GPT-4b Micro to optimize proteins involved in stem cell creation.
Similarly, Google DeepMind has been highly active. Besides the Gemini model family, its widely used AlphaFold program predicts protein structures, while an “AI Co-Scientist” project, detailed in February, aims to generate novel scientific hypotheses, sometimes mirroring unpublished human research.
Microsoft entered the fray in February with BioEmu-1, a model focused on predicting the dynamic movement of proteins, complementing AlphaFold’s static predictions. These tools, focusing on protein engineering, hypothesis generation, and molecular simulation, illustrate AI’s expanding role, moving beyond data analysis toward complex scientific reasoning and procedural assistance – amplifying both the potential scientific gains and the safety challenges highlighted by the VCT.