Four Voices On The Future Of Scientific AI At TPC25

From national labs to hyperscale clouds to academic research, Wednesday’s first plenary session at TPC25 asked the same question from different angles: What will it take for AI to stop describing science and start doing it?

Argonne’s Ian Foster outlined a “thought–action fabric” that would let an LLM move from hypothesis to lab execution. Microsoft’s Preeth Chengappa showed how an agentic platform is already stringing together data, models, and tools for problems that range from coolant chemistry to vaccine design. FutureHouse’s Siddharth Narayanan tested whether language agents can reason like junior lab partners and found that new benchmarks and tool-calling scaffolds are beginning to close the gap. University of Michigan’s Karthik Duraisamy argued that none of this will matter unless AI can bridge the abstraction, reasoning, and reality gaps that separate token predictors from causal thinkers.

This plenary session reframed frontier-scale models not as end products but as components in larger, collaborative systems that can couple reasoning engines to experiments, encode provenance and policy, and keep humans in the loop.

A Blueprint for AI-Native Discovery: Thought, Action, and Infrastructure

Ian Foster of Argonne National Laboratory opened the plenary session with a look at how large-scale AI models might move from pattern recognition to actual scientific reasoning and discovery. While language models can now outperform human experts on benchmarks like GPQA, Foster argued that this kind of reasoning still falls short of what science demands. The real question, he said, is how to connect these reasoning systems to the world—how to enable them not just to think but to act.

This vision requires what Foster described as an AI-native scientific discovery platform. At its core is a reasoning engine, but that engine must be embedded in a larger framework that includes access to data repositories, simulations, experimental labs, policy constraints, and memory systems that track what has been done and learned. He called this a “thought-action fabric,” a framework that lets AI systems generate hypotheses, run experiments, evaluate outcomes, and learn from results.

Slide courtesy of Ian Foster

“How do we take a reasoning core and link it through what we might call a thought action fabric, something that will allow thoughts by a model to be translated into actions on the world?” Foster said. “How do we create this fabric and link it into the various elements of what a model needs to be able to access?”

To illustrate how such a system might function, Foster walked through a detailed example centered on a familiar challenge at Argonne: discovering a more efficient catalyst for converting carbon dioxide into ethanol. In this imagined scenario, a scientist poses the question to the system, which then mines structured data and scientific literature, proposes a hypothesis, and develops a plan involving simulations, lab assays, and iterative learning. Over the course of about three hours, the system screens thousands of candidate materials, launches high-fidelity simulations on supercomputing systems, conducts experiments in automated labs, and retrains its surrogate models based on new results. It ultimately identifies a promising catalyst, with all actions recorded, policy checked, and results delivered back to the human researcher with full transparency.

Foster was clear that this is still a speculative example, not yet something any lab can do start to finish. But many of the necessary pieces already exist in isolation. Argonne and other labs have built automated labs that can be controlled remotely, simulation pipelines that integrate with surrogates and domain models, and early infrastructure for orchestrating workflows and enforcing policies. The challenge now is to connect these components into a coherent system that can reason, act, and learn at scale.

He also acknowledged the limits of current models. While some can generate plausible hypotheses, there is still little clarity on how to evaluate those ideas or decide which are worth pursuing. The lack of structure and accessibility in much scientific data also remains a major bottleneck. Perhaps more fundamentally, the infrastructure of science—built over decades for human users—may need to be reimagined for intelligent agents that can operate at speeds and scales humans cannot match.

Foster called on the TPC community to take up this challenge. “I would hope that some people here are going to want to engage within the TPC to create this scientific reasoning platform, to build out the thought and action fabric that we’re going to need, and to build the interfaces with these different types of resources that we’re going to be able to access.”

While some of the foundational pieces are already in place across national labs and research institutions, he emphasized that real progress will require close coordination across disciplines, linking AI, simulation science, automation, data management, and policy. The opportunity, he said, is to define what science looks like in an AI-native world. If successful, the result could be systems that don’t just analyze the world but help transform it—faster, more transparently, and with the capacity to learn continuously from every new experiment.

Pushing Autonomy from Cloud to Lab

Noting that “all of Microsoft is an AI-first company now,” Preeth Chengappa, head of Industry, Semiconductors & Physics, provided a glimpse into Microsoft’s recently-launched Discovery Platform (May 2025) and early use cases leveraging its AI components in his TPC talk, “Agents, Autonomy, and Agency: A Brave New World.”

Microsoft, of course, has been an early participant in the AI revolution, not least through its stake in OpenAI. It has three groups working on generative AI – one focused on LLMs, another focused on AI infrastructure, and a third building services on top of both infrastructure and LLMs. “I’m part of the third,” said Chengappa. Much of that work is being done on the Microsoft Discovery platform, which is now working with early customers.

“The three main areas of focus, as you can see (slide below), are physics, chemistry and biology. There’s a fourth one, which is shaping up to be very focused on autonomy robotics, a fascinating area that [touches] everything Ian was talking about, [such as] wet labs and lab automation. As you can see, it’s built on top of core Azure infrastructure. But the layer in between is what’s what I’m going to be talking about.”

Slides courtesy of Preeth Chengappa

The HPC layer provides orchestration and access to different kinds of compute at scale, said Chengappa. The middle data layer “brings in all the different aspects of your data, as well as customer data, third-party data, anything else from a data perspective.” There’s also a model and tools repository.

“This platform actually now gives you the ability to bring in data, agents, [and] models, and to enumerate or register all these models and agents together in ways that then leads up to the top part, which is the Science Copilot, because this is where you actually interface with everything else that you see,” said Chengappa. “The cool thing here, as we’ll see [in] a few examples that I have, is that it’s possible for some degree of autonomy, some degree of shall we say planning and execution that can happen because of the agentic architecture and agency frameworks that we have.”

The first case history involved data center cooling. “Microsoft has large data centers, and we have problems with cooling. One of the things that we have problems with also is the fact that existing solutions have a lot of harmful chemicals, PFAs, in them. So we wanted to discover something, or see if we could discover something that had less PFAs,” said Chengappa. Basically, using a simple prompt – “Hey, go do the research and tell me what I need to know.” After sifting literature and producing a useful summary, “You go to the next step, which is, ‘Hey, make a plan and tell me how to execute on this.”

“Next, you can edit and say, hey, look, this is not the right boiling point, or this is not the right variance that I want. This is where the human group is super important, because that’s going to continue [to be needed] for some time. But the plan has been edited. And now we say, execute. I didn’t do too much animation, but there’s a bunch of agents getting called, and after that, it sends out a bunch of candidates. Somebody actually looked at these and said, this particular one looks like a useful candidate. We took the logical next step, which is to actually get it made, and it’s actually being tested now for large-scale applications,” said Chengappa.

The next example involved an aspect of pharmaceutical drug discovery. The World Health Organization issued a challenge stating that current antivirus delivery or vaccine delivery mechanisms were not optimally efficient and might cause side effects. The question: Can we improve the state of the delivery mechanism?

“So again, same process, create a nice prompt, iteratively made more complex. It immediately resulted in some activity, which identified a specific protein that you could bind to, which could give you better results. Here are all the different agents (slide below) that we see and all the different tools that we call, and this is the kind of visualization that you keep seeing as we look at the explainability of what happened across different discovery processes.

“With all those tools and exits being called, we discovered eight different spike proteins which can be used in delivery, and this is something that is now being explored for a real-life solution for this problem.”

Chengappa packed a lot into his talk. He also touched on several current trends. Here are two:

Personalized Agents – “Lately is that we are seeing highly personalized agents in the enterprise. They have email IDs. They have Teams IDs. You can talk to them just like you would any other colleague. You could give them tasks.”
Trust – “We look at trust in many different levels: human to agent, agent to tool, agent to LLM index. When I say trust, you know, people no longer have any doubts the fact that LLMs are going to do something, but they’re not questioning competence. But it’s important to understand what the elements’ motivations are, what the agent is actually doing.”

Perhaps surprisingly, Chengappa concluded with a caution:

“Closing thoughts? Agents are going to be everywhere. They are going to be pervasive. They are going to take over all our jobs, not just yet. It’s going to happen over a period of time. And so in every previous instance where there’s been such disruption, we’ve been able to figure out what we as human beings are going to do next. I know everybody’s excited about everything that TPC is going to do, and all the wonderful things that agents in the LLMs and generative AI are going to do, but I do think the unsolved problem is, where do we all end up in an agentic world? I think that TPC should pay some attention to that as well as the societal impacts, the political structure.”

Teaching Language Agents to Think Like Scientists

It’s one thing to ask AI a question. It’s another to ask it to think like a scientist, plan an experiment, and tell you what to test next. That’s the kind of system Dr. Siddharth Narayanan is working on. At TPC 2025, he shared how language model agents are beginning to take on complex scientific tasks like analyzing data, reviewing papers, and designing drug candidates.

Dr. Narayanan is a physicist and researcher at FutureHouse, an independent nonprofit research organization focused on automating scientific discovery. His background spans particle physics and machine learning, with past work on dark matter and protein design. Today, his focus is on building AI systems that extend scientific reasoning and help researchers work more efficiently.

Keeping up with science today feels harder than ever. There’s more data, more tools, and more papers coming out than most researchers can reasonably stay on top of. Dr. Narayanan acknowledged this pressure, noting that “the volume of information is so large” and the way science is conducted has become “more complex as well.”

Rather than trying to simplify the field, he sees AI as a way to help researchers keep pace. With the right design, language model agents could act as collaborators who read papers, generate ideas, and help scientists scale their thinking.

Figuring out what these models can actually do in science turned out to be a challenge in itself. Dr. Narayanan’s team found that most benchmarks reward models for getting the right answer, even if it is just a fact they memorized. However, science is not about recall. It is about reasoning through problems step by step. So his group built tests that reflect how scientists actually work. The models are not beating humans yet, but they are getting closer.

Figuring out what models can do is only the first step. To make them genuinely useful in research, Dr. Narayanan’s team is building language agents. These systems let models run tools, take actions, and interact with data inside structured environments. The agents are designed to reflect how real scientists work, allowing the AI to reason, experiment, and solve problems instead of just predicting text.

These agents are already helping with some of the most time-consuming parts of research. Paper Q/A works like a smart assistant that can search through scientific papers, pull out what matters, and answer detailed questions about a topic. Another agent, called Protein, focuses on designing new molecules. It reads past studies, runs structure predictions, and suggests binders that can be tested on a computer. These tools are not just experiments; they are meant to be used.

To make these agents more accessible, the team has built an interactive platform where scientists can run them, inspect results, and follow the reasoning behind each step. One of the most advanced systems so far is Rocklin, which helps generate and refine drug hypotheses using AI. Dr. Narayanan covered even more in his session, including live examples, early results, and where this work is headed next.

Active Inference as a Discovery Loop

University of Michigan Professor Karthik Duraisamy gave his TPC25 talk, “Active Inference AI Systems for Scientific Discovery,” to explain why bigger models alone will not give science its ChatGPT moment. He says today’s AI struggles with three persistent gaps that hinder its use as a discovery engine: an abstraction gap, a reasoning gap, and a reality gap.

The abstraction gap comes from models that reason with tokens and pixels instead of the high-level concepts scientists rely on. Current foundation models may spot statistical patterns, but they rarely identify important domain concepts in the way scientists need. The reasoning gap refers to brittle logic chains that collapse when tasks demand months of context or explicit causal links. When inference chains stretch beyond a handful of steps, many models still stumble, losing the causal links that connect one result to the next.

“Every day, I’m extremely impressed by what these reasoning models can do. But I think all of us know that they’re very brittle, and there are many reasons why they’re brittle,” he said. “What we need are models that reason through established causal relationships and can also develop analogies at some kind of abstract level, much more than they do right now. And something very important for science, as all of us know, science operates over weeks and months and years and maybe even decades, right? We need a very long reasoning chain.”

The reality gap shows up when a model strays outside its training data and predicts the outcome of an unseen experiment. Without fresh data from the lab, off-distribution forecasts, or the predictions AI models make with unfamiliar data, remain guesses rather than reliable guides for the next experiment. Duraisamy says these three gaps reinforce one another and that closing them will matter more than parameter counts or extra compute. Closing those gaps, he argues, calls for an architecture that blends modeling, experimentation, and human judgment in a single feedback loop.

Duraisamy’s answer is an active inference stack that keeps the model, the lab, and the human in continuous dialogue. At the top are general-purpose language models, either commercial or open weight. They feed domain foundation models for materials, biology, and other fields. Shared embedding spaces let these domain models compare ideas and spot unexpected links.

Slide courtesy of Karthik Duraisamy

Next is a dynamic knowledge graph that holds literature, simulation outputs, and fresh observations. Then there’s an orchestration layer, built from uncertainty quantification, optimal experiment design, and control theory, that plans new simulations or bench tests. At the bottom, experiments, simulations, and formal proofs generate evidence that flows back up the stack. Duraisamy says that formally verified code or theorems can serve as trusted building blocks and reduce error propagation. Each layer can tune the others, turning the system into a “world model” that learns by acting.

However, this inference stack cannot run on autopilot. It still needs human oversight to steer discovery and set guardrails. Researchers must decide when to revise a theory, when to trust a simulation, and when to discard a wrong path. That judgment is a permanent architectural feature. Duraisamy cautions that science lives in pockets of computable reality, so models must recognize where their manifold ends and the unknown begins. In practice, an AI system should flag its blind spots and defer to new data or human review whenever it strays beyond validated ground.

Duraisamy closed his talk with the idea that AGI is a long-term goal rather than an imminent breakthrough: “I do believe it is practical to reconstruct intelligence via computation, but I also believe we are very, very far away from the right kind of abstractions on the right kind of compute to for this to be purely computational. What helps us here is the interplay between counterfactual reasoning, where our approximate world models interact in the real world with some kind of human oversight.”

The Takeaway

If there was a single refrain running through the session, it was that “bigger” is no longer enough. The next leap in scientific AI will come from architectural elements like knowledge graphs, orchestration layers, automated labs, and domain-aware agents that let models plan, act, and learn in a closed loop. Yet every speaker also underscored the current challenges: brittle reasoning chains, insufficient structured data, and questions of trust and accountability. The opportunity for the TPC community is clear: build the connective tissue that turns trillion-parameter models into reliable scientific partners, while designing the guardrails that keep discovery both rapid and responsible.

Thank you for following our TPC25 coverage. Complete session videos and transcripts will be available shortly at TPC25.org.

Contributing to this article were Ali Azhar, Doug Eadline, Jaime Hampton, Drew Jolly, and John Russell

Categories: Academia, AI/ML/DL, Cloud, Data Analytics, Datacenter, Energy, Government, Healthcare, Insight, Life Sciences, Manufacturing, Sectors, Silicon, Software, Systems

Source link

What's Hot

Oracle promotes two presidents to co-CEO role

SPATIALGEN: Layout-guided 3D Indoor Scene Generation – Takara TLDR

Unlocking the 3D World with 2D Data: The First Multi-View Video Diffusion Framework for Kinematic Part Segmentation

Four Voices on the Future of Scientific AI at TPC25

Judge allows Orsted to resume construction on Revolution Wind

Tesla makes a big change to reflect new IRS EV tax credit rules

Longtime Tesla hacker shares Model Y “E41” details in firmware

St. Patrick’s Cathedral Unveils Monumental Mural by Adam Cvijanovic

Three Loaned Banksy Works Incite Dispute Between England and Italy

Major Collection of Old Masters Paintings Could Be Fractionalized

100 Must-See Artworks at the Metropolitan Museum of Art