
At ISC High Performance 2025 in Hamburg, an audience gathered to hear an expert panel tackle a question that now shadows every research cluster and datacenter deployment: how do scientists keep AI models both trustworthy and energy-efficient while the scale of training jobs grows by orders of magnitude?
In “Trustworthiness and Energy Efficiency in AI for Science,” experts from across the HPC landscape unpacked this shared challenge: ensuring machine learning predictions are reliable while keeping the energy demands of the underlying hardware in check.
Chaired by Oak Ridge National Laboratory’s Director of AI Programs, Prasanna Balaprakash, the panel delved into what trustworthiness means in scientific AI, brought up strategies to boost reliability and efficiency without compromising either, and examined how domain-specific constraints shape AI development across hardware, software, and modeling. At its core, the discussion pushed for deeper collaboration across disciplines and borders to tackle these common issues.
The Case for Trustworthy AI Starts with the Data
University of Utah Chief AI Officer Manish Parashar laid out what trustworthiness means in the context of scientific AI: a result that is not only accurate, but transparent, reproducible, and grounded in data whose origins and handling can be verified. In other words, trust is not just about what a model predicts, but whether the full path to that prediction can be understood and repeated.

The panel at ISC featured experts from all around the HPC ecosystem. (Source: ISC)
Parashar broke that idea down into three practical checks on the data that feed a model:
Provenance: Where did the data originate, and can its lineage be verified?
Governance: How is the dataset curated, secured, and shared?
Use: Who has applied it, and with what results or peer-reviewed citations?
He argued that if any one of those links is weak, the model’s predictions lose credibility. To support that kind of data transparency, he described the National Data Platform, a prototype project led by the National Artificial Intelligence Research Resource (NAIRR). The National Data Platform would align provenance and governance checks in one place, allowing researchers to review a dataset’s history, access terms, and citation trail before ever downloading it.
“The way you do this is by being able to connect different corpuses, whether publications or patents, or other public corpuses, that you can then mine and correlate the data. You can then provide a more holistic view to the user in terms of what the data is, and that allows the user to build trust in that data. It’s a continuous process,” he said.
When asked whether reproducibility is part of the trust equation, Parashar said yes: an AI result is only reliable if someone else, using the same data and process, can run the job again and get the same outcome. That level of reproducibility depends on two practices that often go overlooked: metadata capture (recording where the data came from and how it was collected) and workflow capture (saving the exact code, parameters, model checkpoints, and software environments used).
For HPC practitioners, keeping those records is no longer a nice-to-have. They must be saved, stored, and shared just like the results themselves. Because without them, the science cannot be trusted or repeated.
The Carbon Cost of Scaling Up
While Parashar focused on the integrity of data, Pekka Manninen, Director of Science and Technology at CSC in Finland, turned the conversation toward energy consumption and what it really costs to train large-scale scientific AI. Drawing on usage records from CSC’s 10,000-GPU LUMI system, Manninen estimated that a single million-GPU-hour workload running on a 7-megawatt cluster in Central Europe could emit around 245 tons of CO₂. That figure, he noted, exceeds what most people will emit from air travel over the course of their lives.
Manninen acknowledged the typical efficiency improvements available to HPC centers like warmer water cooling loops, optimized scheduling, and AI-specific hardware like ASICs, but argued that these offer only incremental gains. The biggest impact, he said, comes from location: Nordic countries with hydropower-rich grids can deliver electricity at under 50 grams CO₂ per kilowatt-hour, reducing emissions from the same AI job by a factor of ten.
His point landed with extra weight. Just a day earlier, Germany’s new JUPITER system had officially displaced LUMI as Europe’s top-ranked supercomputer. But while JUPITER has captured headlines for performance, Manninen’s remarks underscored a less-visible consideration: grid emissions matter. As nations invest in AI and HPC infrastructure, the carbon intensity of the surrounding power system may determine not just cost, but legitimacy.
Training AI Tools for the HPC Stack
After discussions of data integrity and energy cost, Oak Ridge’s Jeffrey Vetter shifted the focus to the developer level, arguing that trustworthy AI also depends on how scientists build, adapt, and interact with the software tools that support their work. As head of advanced computing systems research at ORNL, Vetter highlighted efforts to improve software productivity through AI, including ChatHPC, a project exploring whether large language models can be tuned to help researchers parallelize code, translate between programming models, and make better use of performance tools.

Vetter shared how ChatHPC explores how fine-tuned language models can support code translation, parallelization, and performance tuning in HPC environments. (Source: ISC)
The project builds on Code Llama, an open source foundation model, and fine-tunes it using a wide range of materials such as documentation, tutorials, installation guides, and source code from libraries like Kokkos, MAGMA, and TAO. The models are trained with an expert-in-the-loop process: users provide feedback on where the model succeeds or fails, and that feedback is then used to refine the system.
In one early test, a fine-tuned model was able to convert calls from Intel’s MKL library to Nvidia’s MAGMA with 93 percent functional accuracy, up from just 37 percent using the base Code Llama model. While the current models are not fully automated or general-purpose, the team’s goal is to create a toolkit that HPC centers can adapt to their own programming environments and deploy locally.
Vetter emphasized that trust is just as important for code generation as it is for scientific prediction. Each fine-tuning step introduces potential bias, he noted, and Oak Ridge now applies the same rigor to model provenance as it does to data: recording configuration files, training settings, and test cases to ensure the models can be understood, reproduced, and improved over time.
Scaling AI Responsibly, From Silicon to Substation
Chris Porter, director of HPC and AI infrastructure at Nvidia, emphasized that trust in AI heavily depends on how it’s being used. A model built to assist with molecular modeling, he noted, operates under different expectations than one built to drive a car. In scientific settings, trustworthy AI needs to obey fundamental physical laws, like conservation of energy or momentum, and align with observations drawn from experiments or instruments like telescopes and microscopes.
But Porter argued that the bigger challenge is transparency. When scientists cannot trace how an AI system reached a conclusion, it becomes difficult to trust or verify its output, especially in research contexts where reproducibility and interpretability are critical.
He then connected that concern to Nvidia’s broader work on efficiency and performance across the AI stack. With Moore’s Law no longer delivering predictable gains, Nvidia is investing across hardware and system layers, including compilers, software libraries, networking, cooling, and even grid-level energy planning. As Porter put it, quoting Nvidia CEO Jensen Huang, “The data center is the computer now.”
To show what’s possible, Porter pointed to an internal study of a 1.8-trillion-parameter mixture-of-experts model. Projecting across GPU generations, Nvidia found a 200,000× improvement in inference efficiency from Kepler to Blackwell Ultra. In another example, Porter referenced MLPerf Inference benchmarks showing that, over the course of a year, Hopper GPU performance improved 1.6× purely through software updates with no hardware changes required. These, he said, are the types of full-stack improvements needed to keep AI both performant and sustainable.
Across the panel, the message was clear: building AI systems that scientists can trust is not just a question of accuracy or speed. It also means making those systems efficient enough to be usable, reproducible, and sustainable. That includes understanding where data comes from, how workflows can be repeated, how much energy models consume, and whether the tools scientists rely on are transparent about their outputs. As AI becomes more embedded in scientific discovery, trust and efficiency are not competing goals. They are shared responsibilities. Each layer of the stack, from algorithms to infrastructure, shapes how AI advances science and how the scientific community holds it to account.