arXiv AI

A Framework for Domain-Adaptive Evaluation of LLMs via Dynamic Benchmark Construction and Exploratory Interaction

By Advanced AI EditorMay 19, 2025No Comments2 Mins Read

[Submitted on 15 Oct 2024 (v1), last revised 16 May 2025 (this version, v4)]

View a PDF of the paper titled TestAgent: A Framework for Domain-Adaptive Evaluation of LLMs via Dynamic Benchmark Construction and Exploratory Interaction, by Wanying Wang and 3 other authors

View PDF

Abstract:As large language models (LLMs) are increasingly deployed to various vertical domains, automatically evaluating their performance across different domains remains a critical challenge. Current evaluation methods often rely on static and resource-intensive datasets that are not aligned with real-world requirements and lack cross-domain adaptability. To address these limitations, we revisit the evaluation process and introduce two key concepts: \textbf{Benchmark+}, which extends the traditional question-answer benchmark into a more flexible “strategy-criterion” format; and \textbf{Assessment+}, which enhances the interaction process to facilitate deeper exploration and comprehensive analysis from multiple perspectives. We propose \textbf{\textsc{TestAgent}}, an agent-based evaluation framework that implements these concepts using retrieval-augmented generation and reinforcement learning. \textsc{TestAgent} enables automatic dynamic benchmark generation and in-depth assessment across diverse vertical domains. Experiments on tasks ranging from constructing multiple vertical domain evaluations to transforming static benchmarks into dynamic forms demonstrate the effectiveness of \textsc{TestAgent}. This work provides a novel perspective on automatic evaluation methods for domain-specific LLMs, offering a pathway for domain-adaptive dynamic benchmark construction and exploratory assessment.

Submission history

From: Wanying Wang [view email]
[v1]
Tue, 15 Oct 2024 11:20:42 UTC (4,213 KB)
[v2]
Wed, 16 Oct 2024 10:36:18 UTC (3,017 KB)
[v3]
Tue, 11 Feb 2025 07:03:51 UTC (2,463 KB)
[v4]
Fri, 16 May 2025 05:34:13 UTC (422 KB)

Previous ArticleMagentic-UI, an experimental human-centered web agent

Next Article Britain’s Lee Broom Turns Everyday Objects Into Design Spectacle

Advanced AI Editor

Leave A Reply