Beyond generic benchmarks: How Yourbench lets enterprises evaluate AI models against actual data

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More

Every AI model release inevitably includes charts touting how it outperformed its competitors in this benchmark test or that evaluation matrix.

However, these benchmarks often test for general capabilities. For organizations that want to use models and large language model-based agents, it’s harder to evaluate how well the agent or the model actually understands their specific needs.

Model repository Hugging Face launched Yourbench, an open-source tool where developers and enterprises can create their own benchmarks to test model performance against their internal data.

Sumuk Shashidhar, part of the evaluations research team at Hugging Face, announced Yourbench on X. The feature offers “custom benchmarking and synthetic data generation from ANY of your documents. It’s a big step towards improving how model evaluations work.”

He added that Hugging Face knows “that for many use cases what really matters is how well a model performs your specific task. Yourbench lets you evaluate models on what matters to you.”

Creating custom evaluations

Hugging Face said in a paper that Yourbench works by replicating subsets of the Massive Multitask Language Understanding (MMLU) benchmark “using minimal source text, achieving this for under $15 in total inference cost while perfectly preserving the relative model performance rankings.”

Organizations need to pre-process their documents before Yourbench can work. This involves three stages:

Document Ingestion to “normalize” file formats.

Semantic Chunking to break down the documents to meet context window limits and focus the model’s attention.

Document Summarization

Next comes the question-and-answer generation process, which creates questions from information on the documents. This is where the user brings in their chosen LLM to see which one best answers the questions.

Hugging Face tested Yourbench with DeepSeek V3 and R1 models, Alibaba’s Qwen models including the reasoning model Qwen QwQ, Mistral Large 2411 and Mistral 3.1 Small, Llama 3.1 and Llama 3.3, Gemini 2.0 Flash, Gemini 2.0 Flash Lite and Gemma 3, GPT-4o, GPT-4o-mini, and o3 mini, and Claude 3.7 Sonnet and Claude 3.5 Haiku.

Shashidhar said Hugging Face also offers cost analysis on the models and found that Qwen and Gemini 2.0 Flash “produce tremendous value for very very low costs.”

Compute limitations

However, creating custom LLM benchmarks based on an organization’s documents comes at a cost. Yourbench requires a lot of compute power to work. Shashidhar said on X that the company is “adding capacity” as fast they could.

Hugging Face runs several GPUs and partners with companies like Google to use their cloud services for inference tasks. VentureBeat reached out to Hugging Face about Yourbench’s compute usage.

Benchmarking is not perfect

Benchmarks and other evaluation methods give users an idea of how well models perform, but these do not perfectly capture how the models will work daily.

Some have even voiced skepticism that benchmark tests show models’ limitations and can lead to false conclusions about their safety and performance. A study also warned that benchmarking agents could be “misleading.”

However, enterprises cannot avoid evaluating models now that there are many choices in the market, and technology leaders justify the rising cost of using AI models. This has led to different methods to test model performance and reliability.

Google DeepMind introduced FACTS Grounding, which tests a model’s ability to generate factually accurate responses based on information from documents. Some Yale and Tsinghua University researchers developed self-invoking code benchmarks to guide enterprises for which coding LLMs work for them.

Daily insights on business use cases with VB Daily

If you want to impress your boss, VB Daily has you covered. We give you the inside scoop on what companies are doing with generative AI, from regulatory shifts to practical deployments, so you can share insights for maximum ROI.

Read our Privacy Policy

Thanks for subscribing. Check out more VB newsletters here.

An error occured.

Source link

What's Hot

Tesla signs $16.5B deal with Samsung to make AI chips

Will You Be Using Chrome In 2030? Perplexity AI CEO Aravind Srinivas Questions Google’s Relevance As Comet Gains Popularity

Anthropic throttles Claude rate limits, devs call foul

Beyond generic benchmarks: How Yourbench lets enterprises evaluate AI models against actual data

Anthropic throttles Claude rate limits, devs call foul

No more links, no more scrolling—The browser is becoming an AI Agent

Chinese startup Z.ai launches powerful open source GLM-4.5 model family with PowerPoint creation

Picasso’s ‘Demoiselles’ May Not Have Been Inspired by African Art

Catalan National Assembly protested the restitution of murals to Aragon.

UNESCO Adds 26 Sites to World Heritage List

Aspen Art Fair Doubles in Size for 2025 Edition

Tesla signs $16.5B deal with Samsung to make AI chips

Will You Be Using Chrome In 2030? Perplexity AI CEO Aravind Srinivas Questions Google’s Relevance As Comet Gains Popularity

Anthropic throttles Claude rate limits, devs call foul

What's Hot

Beyond generic benchmarks: How Yourbench lets enterprises evaluate AI models against actual data

Creating custom evaluations

Compute limitations

Benchmarking is not perfect

Related Posts

Subscribe to Updates