
(Krot_Studio/Shutterstock)
In the TPC25 session Science Updates from Key TPC Leaders, two distinguished speakers shared different yet complementary perspectives on the future of large language models in science. Franck Cappello from Argonne National Laboratory introduced EAIRA, a new framework for evaluating AI research assistants. His focus was on how to measure reasoning, adaptability, and domain-specific skill so researchers can trust these systems to handle complex scientific work without constant oversight.
From Japan, Professor Rio Yokota of the Tokyo Institute of Technology described the country’s ambitious two-track plan for LLM development. The LLM-jp consortium is training massive models on Japan’s most powerful supercomputers, while the smaller Swallow project experiments with leaner architectures and faster iteration. Together, they showed that the future of LLMs in science depends on more than just building bigger models. It is about making them trustworthy and creating the infrastructure and collaboration to put them to use.
What Do You Need to Trust an LLM Research Assistant?
What do we want LLMs to be able to do as research assistants in science? How can we effectively evaluate these new AI research assistants?

Slide courtesy of Frank Capello
Franck Cappello, AuroraGPT Evaluation Team, Argonne National Laboratory, looked in some detail at these two core questions in his TPC plenary talk – EAIRA: Establishing a Methodology to Evaluate LLMs as Research Assistants.
Broadly speaking, our ambitions for these AI colleagues keep growing. Far from the early notions of using them to quickly sift science literature and return useful information, today we want almost a full partner able to sift literature, develop novel hypotheses, write code and suggest (and perhaps carry out) experimental workflows.
“But how do we test their reasoning and knowledge capability? How do we test that they actually understand the problem?” said Cappello. “And [how do we] develop the trust of the researcher on this model? When we develop a telescope, or microscope or a light source, we know very well how they work. That’s not the case here, because it’s still a black box.”
“We don’t want to spend a lot of time checking what the model is providing,” said Cappello. “We want to trust the results that it’s providing. It understands the command that humans are giving, but it should also interface with tools and devices that we have in our laboratories, and it should have some degree of autonomy, of course, repeating workflow or learning workflow is one possibility, but what we really want is to be able to generate hypotheses and high-quality hypothesis.”
Getting to that point will require new tools. After a quick review of recent LLM progress, Cappello dug into efforts to develop an effective evaluation methodology. Currently, he said, the two primary evaluation tools are multiple-choice questions (MCQ) and open responses. The current crop of both can be problematic.

Slide courtesy of Frank Capello
“When you ask researchers to generate many of these MCQs, it takes a lot of time to do that. So they are very important; we still need to consider them,” said Cappello. “Currently, if we look at the available benchmarks, they are too generic. They are not specific to some disciplines. They are static — this means they don’t evolve in time, which opens the problem of contamination; so the benchmark being used for the training of the model. That’s a problem that we need to consider.”
Open responses are also difficult to get right, but still important. He walked through the various criteria (slide below) that an evaluation methodology should accommodate.
Capello then reviewed Argonne’s developing EAIRA, which is attempting to create a rigorous, repeatable approach. In February, Cappello and colleagues from several institutions posted a preprint (EAIRA: Establishing a Methodology for Evaluating AI Models as Scientific Research Assistants) on arXiv. He presented a figure from the paper see slide below)

Slide courtesy of Frank Capello
“So you see the methodology that we propose here (slide below). It’s a combination of techniques. So it’s using MCQs. Yes, we are generating MCQs. It’s also using Open
Response benchmarks. And we have two new things, the labstyle experiment and the field style experiment,” he said.
This methodology incorporates four primary classes of evaluations:
Multiple Choice Questions to assess factual recall;
Open Response to evaluate advanced reasoning and problem-solving skills;
Lab-Style Experiments involving detailed analysis of capabilities as research assistants in controlled environments;
Field-Style Experiments to capture researcher-LLM interactions at scale in a wide range of scientific domains and applications.

Slide courtesy of Frank Capello
The paper’s abstract does a nice job summarizing EAIRA:
“Large Language Models (LLMs) as transformative tools for scientific research, capable of addressing complex tasks that require reasoning, problem-solving, and decision-making. Their exceptional capabilities suggest their potential as scientific research assistants, but also highlight the need for holistic, rigorous, and domain-specific evaluation to assess effectiveness in real world scientific applications.
“This paper describes a multifaceted methodology for Evaluating AI models as scientific Research Assistants (EAIRA) developed at Argonne National Laboratory. This methodology incorporates four primary classes of evaluations. 1) Multiple Choice Questions to assess factual recall; 2) Open Response to evaluate advanced reasoning and problem-solving skills; 3) Lab-Style Experiments involving detailed analysis of capabilities as research assistants in controlled environments; and 4) Field-Style Experiments to capture researcher-LLM interactions at scale in a wide range of scientific domains and applications.
These complementary methods enable a comprehensive analysis of LLM strengths and weaknesses with respect to their scientific knowledge, reasoning abilities, and adaptability. Recognizing the rapid pace of LLM advancements, we designed the methodology to evolve and adapt so as to ensure its continued relevance and applicability. This paper describes the methodology’s state at the end of February 2025. Although developed within a subset of scientific domains, the methodology is designed to be generalizable to a wide range of scientific domains.”
There was a good deal more to his talk and TPC will be providing links to a recording.
Cappello also looked at a couple of other benchmarks, including the ASTRO MCQ Benchmark (astronomy) and SciCode Open Response Benchmark, and briefly touched on an ANL-HPE collaboration (DoReMi: Difficulty-Oriented Reasoning Effort Modeling of Science Problems for Reasoning Language Models).
Recent Progress on Japanese LLMs
Japan’s AI community is taking bold steps to expand its role in the global large language model (LLM) landscape. In a plenary address at TPC25, Professor Rio Yokota, a leading figure in Japan’s high-performance computing and AI research and a professor at Tokyo Institute of Technology, presented the country’s most ambitious initiatives to date: the large-scale LLM-jp consortium and the compact Swallow project with its targeted research agenda.
These two projects are building massive multilingual datasets, exploring everything from dense 172 billion parameter models to nimble Mixture of Experts (MoE) designs, and committing millions of H100 GPU hours to stay in step with global leaders. Much of this work runs on Japan’s top computing assets, including the ABCI supercomputer and Fugaku system, giving the teams both the scale and flexibility needed to push LLM research forward.

Slide Courtesy of Rio Yokota
Yokota explained that such scale requires more than just hardware and data; it demands careful coordination, disciplined experimentation, and a constant awareness of the risks and trade-offs involved. From there, he shifted to the practical realities of training at this level, noting that “these things cost like many, many millions of dollars to train” and that even “just one parameter” set incorrectly can mean “a million dollars’ worth wasted.” He also stressed the painstaking work of cleaning and deduplicating data, calling it one of the most decisive factors in building models that are not only bigger but also smarter.
From the overarching vision, focus shifted to how Japan is translating its AI ambitions into a coordinated national program. LLM-jp brings together universities, government research centers, and corporate partners under a shared framework that aligns funding and development priorities.
This structure makes it possible to run experiments at a scale no single institution could manage on its own, while ensuring that progress in one area is quickly shared across the community. As Yokota put it, the goal is to “share everything as fast as possible so others can build on it right away.”
Yokota described how the consortium’s governance is built for speed, with teams able to exchange interim findings, surface technical issues early, and adjust their methods without being slowed by lengthy approval processes. That ability to adapt on the fly, he noted, can be just as decisive as compute capacity when competing with the fastest-moving global initiatives.
If LLM-jp is about scale and coordination, Swallow takes a different approach. This smaller initiative is designed for targeted experimentation, focusing on efficient training methods and leaner model architectures.
Yokota explained that Swallow operates with far fewer parameters than the largest LLM-jp models, but pushes for innovations that can be applied across projects, from data filtering techniques to optimized hyperparameter search. In his words, “it’s where we try risky ideas that might not work at 172 billion parameters.”

Slide Courtesy of Rio Yokota
Swallow’s MoE experiments use sparse activation, meaning only a small subset of its expert models are active for any given input, cutting FLOPs dramatically while preserving accuracy.
The project also serves as a proving ground for MoE designs, where specialized sub-models are activated only when needed. This approach reduces computation costs while maintaining performance on complex tasks, an area of growing interest for teams facing finite GPU budgets. According to Yokota, Swallow’s agility makes it well-suited for rapid iteration, with “lessons flowing back into the big models” almost immediately.
Yokota concluded by framing LLM-jp and Swallow as two halves of the same strategy. One pushes the limits of scale, the other refines the techniques that make such scale practical. Both are tied together by an insistence on sharing results quickly so the broader community can benefit.
He acknowledged that the path forward for Japan’s LLM progress will be demanding, especially with rising compute costs and rapidly shifting benchmarks. However, he argued that Japan’s combination of national coordination, targeted innovation, and open exchange is what will keep it competitive in the global AI landscape.
The Key Takeaway
The two talks converged on a single point. Trust and scale must advance together if LLMs are to reach their full potential in science. The largest models in the world will have limited impact if their outputs cannot be verified, and even the most thorough evaluation methods lose value if they are not applied to systems powerful enough to address complex, real-world problems.
Cappello’s EAIRA framework addresses the trust challenge by combining multiple evaluation approaches to give a clearer view of what an AI can actually do. Yokota’s LLM-jp and Swallow initiatives focus on scale through national coordination, efficient architectures, and a culture of rapid knowledge sharing. The shared message was clear. The LLMs that will matter most in science will be those that are ambitious in capability and grounded in rigorous and transparent testing.
Thank you for following our TPC25 coverage. Complete session videos and transcripts will be available shortly at TPC25.org.
Contributing to this article were Ali Azhar, Doug Eadline, Jaime Hampton, Drew Jolly, and John Russell.