Close Menu
  • Home
  • AI Models
    • DeepSeek
    • xAI
    • OpenAI
    • Meta AI Llama
    • Google DeepMind
    • Amazon AWS AI
    • Microsoft AI
    • Anthropic (Claude)
    • NVIDIA AI
    • IBM WatsonX Granite 3.1
    • Adobe Sensi
    • Hugging Face
    • Alibaba Cloud (Qwen)
    • Baidu (ERNIE)
    • C3 AI
    • DataRobot
    • Mistral AI
    • Moonshot AI (Kimi)
    • Google Gemma
    • xAI
    • Stability AI
    • H20.ai
  • AI Research
    • Allen Institue for AI
    • arXiv AI
    • Berkeley AI Research
    • CMU AI
    • Google Research
    • Microsoft Research
    • Meta AI Research
    • OpenAI Research
    • Stanford HAI
    • MIT CSAIL
    • Harvard AI
  • AI Funding & Startups
    • AI Funding Database
    • CBInsights AI
    • Crunchbase AI
    • Data Robot Blog
    • TechCrunch AI
    • VentureBeat AI
    • The Information AI
    • Sifted AI
    • WIRED AI
    • Fortune AI
    • PitchBook
    • TechRepublic
    • SiliconANGLE – Big Data
    • MIT News
    • Data Robot Blog
  • Expert Insights & Videos
    • Google DeepMind
    • Lex Fridman
    • Matt Wolfe AI
    • Yannic Kilcher
    • Two Minute Papers
    • AI Explained
    • TheAIEdge
    • Matt Wolfe AI
    • The TechLead
    • Andrew Ng
    • OpenAI
  • Expert Blogs
    • François Chollet
    • Gary Marcus
    • IBM
    • Jack Clark
    • Jeremy Howard
    • Melanie Mitchell
    • Andrew Ng
    • Andrej Karpathy
    • Sebastian Ruder
    • Rachel Thomas
    • IBM
  • AI Policy & Ethics
    • ACLU AI
    • AI Now Institute
    • Center for AI Safety
    • EFF AI
    • European Commission AI
    • Partnership on AI
    • Stanford HAI Policy
    • Mozilla Foundation AI
    • Future of Life Institute
    • Center for AI Safety
    • World Economic Forum AI
  • AI Tools & Product Releases
    • AI Assistants
    • AI for Recruitment
    • AI Search
    • Coding Assistants
    • Customer Service AI
    • Image Generation
    • Video Generation
    • Writing Tools
    • AI for Recruitment
    • Voice/Audio Generation
  • Industry Applications
    • Finance AI
    • Healthcare AI
    • Legal AI
    • Manufacturing AI
    • Media & Entertainment
    • Transportation AI
    • Education AI
    • Retail AI
    • Agriculture AI
    • Energy AI
  • AI Art & Entertainment
    • AI Art News Blog
    • Artvy Blog » AI Art Blog
    • Weird Wonderful AI Art Blog
    • The Chainsaw » AI Art
    • Artvy Blog » AI Art Blog
What's Hot

Contextual retrieval in Anthropic using Amazon Bedrock Knowledge Bases

Class Dismissed? Representative Claims in Getty v. Stability AI | Cooley LLP

Google Gemma 3 : Comprehensive Guide to the New AI Model Family

Facebook X (Twitter) Instagram
Advanced AI News
  • Home
  • AI Models
    • Adobe Sensi
    • Aleph Alpha
    • Alibaba Cloud (Qwen)
    • Amazon AWS AI
    • Anthropic (Claude)
    • Apple Core ML
    • Baidu (ERNIE)
    • ByteDance Doubao
    • C3 AI
    • Cohere
    • DataRobot
    • DeepSeek
  • AI Research & Breakthroughs
    • Allen Institue for AI
    • arXiv AI
    • Berkeley AI Research
    • CMU AI
    • Google Research
    • Meta AI Research
    • Microsoft Research
    • OpenAI Research
    • Stanford HAI
    • MIT CSAIL
    • Harvard AI
  • AI Funding & Startups
    • AI Funding Database
    • CBInsights AI
    • Crunchbase AI
    • Data Robot Blog
    • TechCrunch AI
    • VentureBeat AI
    • The Information AI
    • Sifted AI
    • WIRED AI
    • Fortune AI
    • PitchBook
    • TechRepublic
    • SiliconANGLE – Big Data
    • MIT News
    • Data Robot Blog
  • Expert Insights & Videos
    • Google DeepMind
    • Lex Fridman
    • Meta AI Llama
    • Yannic Kilcher
    • Two Minute Papers
    • AI Explained
    • TheAIEdge
    • Matt Wolfe AI
    • The TechLead
    • Andrew Ng
    • OpenAI
  • Expert Blogs
    • François Chollet
    • Gary Marcus
    • IBM
    • Jack Clark
    • Jeremy Howard
    • Melanie Mitchell
    • Andrew Ng
    • Andrej Karpathy
    • Sebastian Ruder
    • Rachel Thomas
    • IBM
  • AI Policy & Ethics
    • ACLU AI
    • AI Now Institute
    • Center for AI Safety
    • EFF AI
    • European Commission AI
    • Partnership on AI
    • Stanford HAI Policy
    • Mozilla Foundation AI
    • Future of Life Institute
    • Center for AI Safety
    • World Economic Forum AI
  • AI Tools & Product Releases
    • AI Assistants
    • AI for Recruitment
    • AI Search
    • Coding Assistants
    • Customer Service AI
    • Image Generation
    • Video Generation
    • Writing Tools
    • AI for Recruitment
    • Voice/Audio Generation
  • Industry Applications
    • Education AI
    • Energy AI
    • Finance AI
    • Healthcare AI
    • Legal AI
    • Media & Entertainment
    • Transportation AI
    • Manufacturing AI
    • Retail AI
    • Agriculture AI
  • AI Art & Entertainment
    • AI Art News Blog
    • Artvy Blog » AI Art Blog
    • Weird Wonderful AI Art Blog
    • The Chainsaw » AI Art
    • Artvy Blog » AI Art Blog
Advanced AI News
Home » Your AI models are failing in production—Here’s how to fix model selection
VentureBeat AI

Your AI models are failing in production—Here’s how to fix model selection

Advanced AI BotBy Advanced AI BotJune 4, 2025No Comments5 Mins Read
Share Facebook Twitter Pinterest Copy Link Telegram LinkedIn Tumblr Email
Share
Facebook Twitter LinkedIn Pinterest Email


Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More

Enterprises need to know if the models that power their applications and agents work in real-life scenarios. This type of evaluation can sometimes be complex because it is hard to predict specific scenarios. A revamped version of the RewardBench benchmark looks to give organizations a better idea of a model’s real-life performance. 

The Allen Institute of AI (Ai2) launched RewardBench 2, an updated version of its reward model benchmark, RewardBench, which they claim provides a more holistic view of model performance and assesses how models align with an enterprise’s goals and standards. 

Ai2 built RewardBench with classification tasks that measure correlations through inference-time compute and downstream training. RewardBench mainly deals with reward models (RM), which can act as judges and evaluate LLM outputs. RMs assign a score or a “reward” that guides reinforcement learning with human feedback (RHLF).

RewardBench 2 is here! We took a long time to learn from our first reward model evaluation tool to make one that is substantially harder and more correlated with both downstream RLHF and inference-time scaling. pic.twitter.com/NGetvNrOQV

— Ai2 (@allen_ai) June 2, 2025

Nathan Lambert, a senior research scientist at Ai2, told VentureBeat that the first RewardBench worked as intended when it was launched. Still, the model environment rapidly evolved, and so should its benchmarks. 

“As reward models became more advanced and use cases more nuanced, we quickly recognized with the community that the first version didn’t fully capture the complexity of real-world human preferences,” he said. 

Lambert added that with RewardBench 2, “we set out to improve both the breadth and depth of evaluation—incorporating more diverse, challenging prompts and refining the methodology to reflect better how humans actually judge AI outputs in practice.” He said the second version uses unseen human prompts, has a more challenging scoring setup and new domains. 

Using evaluations for models that evaluate

While reward models test how well models work, it’s also important that RMs align with company values; otherwise, the fine-tuning and reinforcement learning process can reinforce bad behavior, such as hallucinations, reduce generalization, and score harmful responses too high.

RewardBench 2 covers six different domains: factuality, precise instruction following, math, safety, focus and ties.

“Enterprises should use RewardBench 2 in two different ways depending on their application. If they’re performing RLHF themselves, they should adopt the best practices and datasets from leading models in their own pipelines because reward models need on-policy training recipes (i.e. reward models that mirror the model they’re trying to train with RL). For inference time scaling or data filtering, RewardBench 2 has shown that they can select the best model for their domain and see correlated performance,” Lambert said. 

Lambert noted that benchmarks like RewardBench offer users a way to evaluate the models they’re choosing based on the “dimensions that matter most to them, rather than relying on a narrow one-size-fits-all score.” He said the idea of performance, which many evaluation methods claim to assess, is very subjective because a good response from a model highly depends on the context and goals of the user. At the same time, human preferences get very nuanced. 

Ai 2 released the first version of RewardBench in March 2024. At the time, the company said it was the first benchmark and leaderboard for reward models. Since then, several methods for benchmarking and improving RM have emerged. Researchers at Meta’s FAIR came out with reWordBench. DeepSeek released a new technique called Self-Principled Critique Tuning for smarter and scalable RM. 

Super excited that our second reward model evaluation is out. It’s substantially harder, much cleaner, and well correlated with downstream PPO/BoN sampling.

Happy hillclimbing!

Huge congrats to @saumyamalik44 who lead the project with a total commitment to excellence. https://t.co/c0b6rHTXY5

— Nathan Lambert (@natolambert) June 2, 2025

How models performed

Since RewardBench 2 is an updated version of RewardBench, Ai2 tested both existing and newly trained models to see if they continue to rank high. These included a variety of models, such as versions of Gemini, Claude, GPT-4.1, and Llama-3.1, along with datasets and models like Qwen, Skywork, and its own Tulu. 

The company found that larger reward models perform best on the benchmark because their base models are stronger. Overall, the strongest-performing models are variants of Llama-3.1 Instruct. In terms of focus and safety, Skywork data “is particularly helpful,” and Tulu did well on factuality. 

Ai2 said that while they believe RewardBench 2 “is a step forward in broad, multi-domain accuracy-based evaluation” for reward models, they cautioned that model evaluation should be mainly used as a guide to pick models that work best with an enterprise’s needs. 

Daily insights on business use cases with VB Daily

If you want to impress your boss, VB Daily has you covered. We give you the inside scoop on what companies are doing with generative AI, from regulatory shifts to practical deployments, so you can share insights for maximum ROI.

Read our Privacy Policy

Thanks for subscribing. Check out more VB newsletters here.

An error occured.





Source link

Follow on Google News Follow on Flipboard
Share. Facebook Twitter Pinterest LinkedIn Tumblr Email Copy Link
Previous ArticleWindsurf says Anthropic is limiting its direct access to Claude AI models
Next Article Protein Industries Canada Targets Supply Chain Resilience with $15 Million in Genomics and AI Funding – vegconomist
Advanced AI Bot
  • Website

Related Posts

Solidroad just raised $6.5M to reinvent customer service with AI that coaches, not replaces

June 5, 2025

How much information do LLMs really memorize? Now we know, thanks to Meta, Google, Nvidia and Cornell

June 5, 2025

Databricks, Noma Tackle CISOs’ AI Inference Nightmare

June 5, 2025
Leave A Reply Cancel Reply

Latest Posts

At Gearbox Records The Sound Quality Remains First

The Science Of De-Extinction Is Providing Hope For Nature’s Future

Influential French Gallerist Daniel Lelong Dies at 92

Anish Kapoor backs appeal to buy Barbara Hepworth sculpture

Latest Posts

Contextual retrieval in Anthropic using Amazon Bedrock Knowledge Bases

June 5, 2025

Class Dismissed? Representative Claims in Getty v. Stability AI | Cooley LLP

June 5, 2025

Google Gemma 3 : Comprehensive Guide to the New AI Model Family

June 5, 2025

Subscribe to News

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

Welcome to Advanced AI News—your ultimate destination for the latest advancements, insights, and breakthroughs in artificial intelligence.

At Advanced AI News, we are passionate about keeping you informed on the cutting edge of AI technology, from groundbreaking research to emerging startups, expert insights, and real-world applications. Our mission is to deliver high-quality, up-to-date, and insightful content that empowers AI enthusiasts, professionals, and businesses to stay ahead in this fast-evolving field.

Subscribe to Updates

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

YouTube LinkedIn
  • Home
  • About Us
  • Advertise With Us
  • Contact Us
  • DMCA
  • Privacy Policy
  • Terms & Conditions
© 2025 advancedainews. Designed by advancedainews.

Type above and press Enter to search. Press Esc to cancel.