Close Menu
  • Home
  • AI Models
    • DeepSeek
    • xAI
    • OpenAI
    • Meta AI Llama
    • Google DeepMind
    • Amazon AWS AI
    • Microsoft AI
    • Anthropic (Claude)
    • NVIDIA AI
    • IBM WatsonX Granite 3.1
    • Adobe Sensi
    • Hugging Face
    • Alibaba Cloud (Qwen)
    • Baidu (ERNIE)
    • C3 AI
    • DataRobot
    • Mistral AI
    • Moonshot AI (Kimi)
    • Google Gemma
    • xAI
    • Stability AI
    • H20.ai
  • AI Research
    • Allen Institue for AI
    • arXiv AI
    • Berkeley AI Research
    • CMU AI
    • Google Research
    • Microsoft Research
    • Meta AI Research
    • OpenAI Research
    • Stanford HAI
    • MIT CSAIL
    • Harvard AI
  • AI Funding & Startups
    • AI Funding Database
    • CBInsights AI
    • Crunchbase AI
    • Data Robot Blog
    • TechCrunch AI
    • VentureBeat AI
    • The Information AI
    • Sifted AI
    • WIRED AI
    • Fortune AI
    • PitchBook
    • TechRepublic
    • SiliconANGLE – Big Data
    • MIT News
    • Data Robot Blog
  • Expert Insights & Videos
    • Google DeepMind
    • Lex Fridman
    • Matt Wolfe AI
    • Yannic Kilcher
    • Two Minute Papers
    • AI Explained
    • TheAIEdge
    • Matt Wolfe AI
    • The TechLead
    • Andrew Ng
    • OpenAI
  • Expert Blogs
    • François Chollet
    • Gary Marcus
    • IBM
    • Jack Clark
    • Jeremy Howard
    • Melanie Mitchell
    • Andrew Ng
    • Andrej Karpathy
    • Sebastian Ruder
    • Rachel Thomas
    • IBM
  • AI Policy & Ethics
    • ACLU AI
    • AI Now Institute
    • Center for AI Safety
    • EFF AI
    • European Commission AI
    • Partnership on AI
    • Stanford HAI Policy
    • Mozilla Foundation AI
    • Future of Life Institute
    • Center for AI Safety
    • World Economic Forum AI
  • AI Tools & Product Releases
    • AI Assistants
    • AI for Recruitment
    • AI Search
    • Coding Assistants
    • Customer Service AI
    • Image Generation
    • Video Generation
    • Writing Tools
    • AI for Recruitment
    • Voice/Audio Generation
  • Industry Applications
    • Finance AI
    • Healthcare AI
    • Legal AI
    • Manufacturing AI
    • Media & Entertainment
    • Transportation AI
    • Education AI
    • Retail AI
    • Agriculture AI
    • Energy AI
  • AI Art & Entertainment
    • AI Art News Blog
    • Artvy Blog » AI Art Blog
    • Weird Wonderful AI Art Blog
    • The Chainsaw » AI Art
    • Artvy Blog » AI Art Blog
What's Hot

EU Commission: “AI Gigafactories” to strengthen Europe as a business location

United States, China, and United Kingdom Lead the Global AI Ranking According to Stanford HAI’s Global AI Vibrancy Tool

AI Workflows Get New Open Source Tools to Advance Document Intelligence, Data Quality, and Decentralized AI with IBM’s Contribution of 3 projects to Linux Foundation AI and Data

Facebook X (Twitter) Instagram
Advanced AI News
  • Home
  • AI Models
    • Adobe Sensi
    • Aleph Alpha
    • Alibaba Cloud (Qwen)
    • Amazon AWS AI
    • Anthropic (Claude)
    • Apple Core ML
    • Baidu (ERNIE)
    • ByteDance Doubao
    • C3 AI
    • Cohere
    • DataRobot
    • DeepSeek
  • AI Research & Breakthroughs
    • Allen Institue for AI
    • arXiv AI
    • Berkeley AI Research
    • CMU AI
    • Google Research
    • Meta AI Research
    • Microsoft Research
    • OpenAI Research
    • Stanford HAI
    • MIT CSAIL
    • Harvard AI
  • AI Funding & Startups
    • AI Funding Database
    • CBInsights AI
    • Crunchbase AI
    • Data Robot Blog
    • TechCrunch AI
    • VentureBeat AI
    • The Information AI
    • Sifted AI
    • WIRED AI
    • Fortune AI
    • PitchBook
    • TechRepublic
    • SiliconANGLE – Big Data
    • MIT News
    • Data Robot Blog
  • Expert Insights & Videos
    • Google DeepMind
    • Lex Fridman
    • Meta AI Llama
    • Yannic Kilcher
    • Two Minute Papers
    • AI Explained
    • TheAIEdge
    • Matt Wolfe AI
    • The TechLead
    • Andrew Ng
    • OpenAI
  • Expert Blogs
    • François Chollet
    • Gary Marcus
    • IBM
    • Jack Clark
    • Jeremy Howard
    • Melanie Mitchell
    • Andrew Ng
    • Andrej Karpathy
    • Sebastian Ruder
    • Rachel Thomas
    • IBM
  • AI Policy & Ethics
    • ACLU AI
    • AI Now Institute
    • Center for AI Safety
    • EFF AI
    • European Commission AI
    • Partnership on AI
    • Stanford HAI Policy
    • Mozilla Foundation AI
    • Future of Life Institute
    • Center for AI Safety
    • World Economic Forum AI
  • AI Tools & Product Releases
    • AI Assistants
    • AI for Recruitment
    • AI Search
    • Coding Assistants
    • Customer Service AI
    • Image Generation
    • Video Generation
    • Writing Tools
    • AI for Recruitment
    • Voice/Audio Generation
  • Industry Applications
    • Education AI
    • Energy AI
    • Finance AI
    • Healthcare AI
    • Legal AI
    • Media & Entertainment
    • Transportation AI
    • Manufacturing AI
    • Retail AI
    • Agriculture AI
  • AI Art & Entertainment
    • AI Art News Blog
    • Artvy Blog » AI Art Blog
    • Weird Wonderful AI Art Blog
    • The Chainsaw » AI Art
    • Artvy Blog » AI Art Blog
Advanced AI News
Home » What Benchmarks Say About Agentic AI’s Coding Potential
Industry Applications

What Benchmarks Say About Agentic AI’s Coding Potential

Advanced AI BotBy Advanced AI BotMarch 28, 2025No Comments6 Mins Read
Share Facebook Twitter Pinterest Copy Link Telegram LinkedIn Tumblr Email
Share
Facebook Twitter LinkedIn Pinterest Email


(Source: Shutterstock)

Last week’s GTC 2025 show may have been agentic AI’s breakout moment, but the core technology behind it has been quietly improving behind the scenes. That progress is being tracked across a series of coding benchmarks, such as SWE-bench and GAIA, leading some to believe AI agents are on the cusp of something big.

It wasn’t that long ago that AI-generated code was not deemed suitable for deployment. The SQL code would be too verbose or the Python code would be buggy or insecure. However, that situation has changed considerably in recent months, and AI models today are generating more code for customers every day.

Benchmarks provide a good way to gauge how far agentic AI has come in the software engineering domain. One of the more popular benchmarks, dubbed SWE-bench, was created by researchers at Princeton University to measure how well LLMs like Meta’s Llama and Anthropic’s Claude can solve common software engineering challenges. The benchmark utilizes GitHub as a rich resource of Python software bugs across 16 repositories and provides a mechanism for measuring how well the LLM-based AI agents can solve them.

When the authors submitted their paper, “SWE-Bench: Can Language Models Resolve Real-World GitHub Issues?” to the International Conference on Learning Representations (ICLR) in October 2023, the LLMs were not performing at a high level. “Our evaluations show that both state-of-the-art proprietary models and our fine-tuned model SWE-Llama can resolve only the simplest issues,” the authors wrote in the abstract. “The best-performing model, Claude 2, is able to solve a mere 1.96% of the issues.”

That changed quickly. Today, the SWE-bench leaderboard shows the top-scoring model resolved 55% of the coding issues on SWE-bench Lite, which is a subset of the benchmark designed to make evaluation less costly and more accessible.

SWE-bench measures AI agents’ capability to resolve GitHub issues. You can see the current leaders at https://www.swebench.com.

Huggingface put together a benchmark for General AI Assistants, dubbed GAIA, that measures a model’s capability across several realms, including reasoning, multi-modality handling, Web browsing, and generally tool-use proficiency. The GAIA tests are non-ambiguous, and are challenging, such as counting the number of birds in a five-minute video.

A year ago, the top score on level 3 of the GAIA test was around 14, according to Sri Ambati, the CEO and co-founder of H2O.ai. Today, an H2O.ai-based model based on Claude 3.7 Sonnet holds the top overall score, about 53.

“So the accuracy is just really growing very fast,” Ambati said. “We’re not fully there, but we are on that path.”

H2O.ai’s software is involved in another benchmark that measures SQL generation. BIRD, which stands for BIg Bench for LaRge-scale Database Grounded Text-to-SQL Evaluation, measures how well AI models can parse natural language into SQL.

When BIRD debuted in May 2023, the top scoring model, CoT+ChatGPT, demonstrated about 40% accuracy. One year ago, the top-scoring AI model, ExSL+granite-20b-code, was based on IBM’s Granite AI model and had an accuracy of about 68%. That was quite a bit below the capability of human performance, which BIRD measures at about 92%. The current BIRD leaderboard shows an H2O.ai-based model from AT&T as the leader, with an 77% accuracy rate.

The rapid progress in generating decent computer code has led some influential AI leaders, such as Nvidia CEO and co-founder Jensen Huang and Anthropic co-founder and CEO Dario Amodei, to make bold predictions about where we will soon find ourselves.

“We are not far from a world–I think we’ll be there in three to six months–where AI is writing 90 percent of the code,” Amodei said earlier this month. “And then in twelve months, we may be in a world where AI is writing essentially all of the code.”

GAIA measures AI agents’ capability to handle a number of tasks. You can see the current leaders board at https://huggingface.co/spaces/gaia-benchmark/leaderboard.

During his GTC25 keynote last week, Huang shared his vision about the future of agentic computing. In his view, we are rapidly approaching a world where AI factories generate and run software based on human inputs, as opposed to humans writing software to retrieve and manipulate data.

“Whereas in the past we wrote the software and we ran it on computers, in the future, the computers are going to generate the tokens for the software,” Huang said. “And so the computer has become a generator of tokens, not a retrieval of files. [We’ve gone] from retrieval-based computing to generative-based computing.”

Others are taking a more pragmatic view. Anupam Datta, the principal research scientist at Snowflake and lead of the Snowflake AI Research Team, applauds the improvement in SQL generation. For instance, Snowflake says its Cortex Agent’s text-to-SQL generation accuracy rate is 92%. However, Datta doesn’t share Amodei’s view that computers will be rolling their own code by the end of the year.

“My view is that coding agents in certain areas, like text-to-SQL, I think are getting really good,” Datta said at GTC25 last week. “Certain other areas, they’re more assistants that help a programmer get faster. The human is not out of the loop just yet.”

BIRD measures the text-to-SQL capability of AI agents. You can access the current leaderboard at https://bird-bench.github.io.

Programmer productivity will be the big winner thanks to coding copilots and agentic AI systems, he said. We’re not far from a world where agentic AI will generate the first draft, he said, and then the humans will come in and refine and improve it. “There will be huge gains in productivity,” Datta said. “So the impact will be very significant, just with copilot alone.”

H2O.ai’s Ambati also believes that software engineers will work closely with AI. Even the best coding agents today introduce “subtle bugs,” so people still need to look at it the code, he said. “It’s still a pretty necessary skill set.”

One area that’s still pretty green is the semantic layer, where natural language is translated into business context. The problem is that the English language can be ambiguous, with multiple meanings from the same phrase.

“Part of it is understanding the semantics layer of the customer schema, the metadata,” Ambati said. “That piece is still building. That ontology is still a bit of a domain knowledge.”

Hallucinations are still an issue too, as is the potential for an AI model to go off the rails and say or do bad things. Those are all areas of concern that companies like Anthropic, Nvidia, H2O.ai, and Snowflake are all working to mitigate. But as the core capabilities of Gen AI get better, the number of reasons not to put AI agents into production decreases.

This article first appeared on BigDATAwire.

Related

About the author: Alex Woodie

Alex Woodie has written about IT as a technology journalist for more than a decade. He brings extensive experience from the IBM midrange marketplace, including topics such as servers, ERP applications, programming, databases, security, high availability, storage, business intelligence, cloud, and mobile enablement. He resides in the San Diego area.



Source link

Follow on Google News Follow on Flipboard
Share. Facebook Twitter Pinterest LinkedIn Tumblr Email Copy Link
Previous ArticleThe Trillion Parameter Consortium (TPC) Has Cleared the Tower
Next Article In the News: Rachel Moran on Milwaukee County Brady List – Newsroom
Advanced AI Bot
  • Website

Related Posts

How SandboxAQ and Stand Up To Cancer Are Using AI to Transform Cancer Research

June 6, 2025

Senate Republicans revise ban on state AI regulations in bid to preserve controversial provision

June 6, 2025

Winklevoss twins’ crypto firm Gemini confidentially files for IPO

June 6, 2025
Leave A Reply Cancel Reply

Latest Posts

Men’s Swimwear Gets Casual At Miami Swim Week 2025

Original Prototype for Jane Birkin’s Hermes Bag Consigned to Sotheby’s

Viral Trump Vs. Musk Feud Ignites A Meme Chain Reaction

UK Art Dealer Sentenced To 2.5 Years In Jail For Selling Art to Suspected Hezbollah Financier

Latest Posts

EU Commission: “AI Gigafactories” to strengthen Europe as a business location

June 7, 2025

United States, China, and United Kingdom Lead the Global AI Ranking According to Stanford HAI’s Global AI Vibrancy Tool

June 7, 2025

AI Workflows Get New Open Source Tools to Advance Document Intelligence, Data Quality, and Decentralized AI with IBM’s Contribution of 3 projects to Linux Foundation AI and Data

June 7, 2025

Subscribe to News

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

Welcome to Advanced AI News—your ultimate destination for the latest advancements, insights, and breakthroughs in artificial intelligence.

At Advanced AI News, we are passionate about keeping you informed on the cutting edge of AI technology, from groundbreaking research to emerging startups, expert insights, and real-world applications. Our mission is to deliver high-quality, up-to-date, and insightful content that empowers AI enthusiasts, professionals, and businesses to stay ahead in this fast-evolving field.

Subscribe to Updates

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

YouTube LinkedIn
  • Home
  • About Us
  • Advertise With Us
  • Contact Us
  • DMCA
  • Privacy Policy
  • Terms & Conditions
© 2025 advancedainews. Designed by advancedainews.

Type above and press Enter to search. Press Esc to cancel.