
(K illustrator Photo/Shutterstock)
At its recent annual conference in Boston, Splunk made the case that AI now has a dual role in enterprise operations: it can accelerate incident response, and it must also be monitored like any other critical system. The company’s new observability features align with that shift, pairing agentic troubleshooting tools with dashboards that track the performance, cost, and reliability of AI agents and models.
Splunk and Cisco executives described the approach as two sides of the same coin: AI for observability and observability for AI. On one hand, AI is embedded into observability workflows to cut resolution times from hours to seconds, freeing engineers to focus on development rather than troubleshooting, they said. On the other hand, observability itself is being extended to AI, providing transparency into the behavior and cost of models that are becoming central to business processes to keep AI systems accountable.
AI for Observability: Speeding Troubleshooting
The first half of Splunk’s strategy centers on embedding AI directly into the work of troubleshooting. There are several new features designed to shorten the distance between a spike in telemetry and its solution. Instead of pulling engineers into hours-long calls to rummage through layers of infrastructure, Splunk’s agentic AI highlights the likely root cause and even offers to remediate.
The AI Troubleshooting Agents, available in both Splunk Observability Cloud and Splunk AppDynamics, automatically analyze incidents in real time for actionable insights. Event iQ in Splunk IT Service Intelligence (ITSI) applies AI to correlate floods of alerts into meaningful groups, giving teams more context on what is actually happening. ITSI Episode Summarization gives users concise overviews of grouped alerts, including trends, potential impact, and root causes, to move from detection to resolution faster.

(Michael Vi/Shutterstock)
AIwire spoke with Mimi Shalash, an observability advisor at Splunk, who described how these features change the experience for teams managing complex services such as e-commerce sites. A sudden wave of “rage clicking” from users might once have required several people combing through logs and dashboards to isolate a failed API key in a payment service. She explained how AI Troubleshooting Agents are meant to automate what has long been a painstaking process of manual investigation across teams involving tracing user sessions, checking infrastructure layers, combing through logs, and eventually isolating the failed API key.
With agentic AI in place, that same sequence can now be flagged and diagnosed instantly, with the system even offering to roll back the faulty version. A workflow that once required multiple engineers and hours of effort can be compressed into moments. The result, Shalash said, is not only operational savings and protected revenue, but also a better experience for developers and data engineers.
Beyond speeding up troubleshooting, Splunk is also responding to a larger set of customer challenges. Shalash noted that these challenges vary widely depending on maturity, but one consistent theme is complexity. Some organizations are already experimenting with agentic AI, while others are still early in their observability journey.
A common obstacle, she said, is tool sprawl: “The challenge, especially in our customers that are a bit earlier to the observability curve or maturity curve, is, if you think about how people procured software years ago, you could be a director, a VP, and you’d have one specific problem, one use case, and you would go buy a tool to solve that use case,” she said. “Well, fast forward, and it has become financially irresponsible, because then you have nine different teams, and each has their own tool that they love, and they guard it very closely. It’s familiar, it’s comfortable, but we know it creates all the blind spots, all the silos.”
Consolidating those tools can be difficult, and onboarding new ones adds friction. Splunk’s agentic AI aims to ease that process by lowering the learning curve, Shalash says. Tasks that once required years of experience with specialized query languages like SPL or SignalFlow can now be handled by AI, reducing cognitive load and making complex analytics more accessible.
Observability for AI: Keeping Models Accountable
If Splunk’s first priority is using AI to make operations run smoother, its second is making sure AI itself can be trusted. As more organizations embed agents and LLMs into critical business processes, the risks of runaway costs, decision-making without transparency, or degraded performance have grown too large to ignore. Splunk leaders describe this as the other half of the equation: extending observability from applications and infrastructure to the AI stack itself.

Mimi Shalash, Observability Advisor, Splunk
“Historically, where Splunk has focused its AI observability strategy has been AI for observability, so generative prompts allowing teams to use natural language to identify if there’s an issue and doing sophisticated correlation of analytics and some root cause analysis,” Shalash said. “What is new, and I think exciting, is observability for AI, and being able to have visibility and intelligence in the now and understanding the performance of underlying infrastructure, being able to capture cost and degradation and make sure that that’s being reported to the business.”
That shift means watching not just applications and logs, but the AI stack itself, from infrastructure health to GPU usage, to ensure performance and costs stay under control. To illustrate the point, Shalash described a financial services customer who automated reporting with agentic AI, only to see GPU demand spiral into a seven-figure bill. Without observability, the cost spike was not caught until it was too late, making it hard to defend the project’s ROI to executives and shareholders. In Shalash’s view, Splunk’s observability for AI would have flagged the degrading GPU performance early, allowing the company to investigate before it became a catastrophe.
Splunk’s new features are specifically made for these scenarios. AI Agent Monitoring provides visibility into both the quality and cost of LLMs and AI agents, helping teams decide whether models are performing at the right price point and whether they align with business objectives. AI Infrastructure Monitoring focuses on the hardware layer, watching for bottlenecks and consumption spikes across services so that costs can be managed before they spiral. Used together, these tools are intended to give organizations the same kind of oversight they expect for traditional IT systems, but tailored to the unpredictable economics and behaviors of modern AI.

(chaylek/Shutterstock)
Another goal is building trust. As Splunk VP of AI Hao Yang noted during a panel discussion, enterprises cannot rely on AI systems that operate as black boxes. By giving models and infrastructure the same kind of oversight that applications receive, observability provides a foundation of transparency as a way to verify costs, trace decisions, and ensure performance. In this new world, observability is less of an add-on but a basic requirement for scaling AI.
The Takeaway
Splunk’s Boston announcements show how observability is evolving alongside AI itself. The company is expanding its portfolio with features that speed incident response and extend monitoring into the AI stack, covering agents, models, and infrastructure. The shift reflects a new reality: AI is now part of the operational backbone, and it requires the same level of oversight as any other system.
For enterprises, the value lies in being able to trust that systems are working as expected. Faster troubleshooting protects revenue and eases the burden on engineers, while visibility into AI performance and costs helps prevent runaway bills and opaque decisions. Splunk’s pitch is that observability can provide both the efficiency gains businesses want today and the guardrails they will need as AI adoption accelerates.