Prompt Stability Matters: Evaluating And Optimizing Auto-Generated Prompt In General-Purpose Systems

arXiv:2505.13546v1 Announce Type: new
Abstract: Automatic prompt generation plays a crucial role in enabling general-purpose multi-agent systems to perform diverse tasks autonomously. Existing methods typically evaluate prompts based on their immediate task performance, overlooking the intrinsic qualities that determine their reliability. This outcome-centric view not only limits interpretability but also fails to account for the inherent stochasticity of large language models (LLMs). In this work, we bring attention to prompt stability-the consistency of model responses across repeated executions-as a key factor for building robust and effective prompt generation systems. To quantify this, we propose semantic stability as a criterion for assessing the response consistency of prompts, and fine-tune a LLaMA-based evaluator to measure it automatically across tasks. These components have enabled us to develop the first stability-aware general-purpose prompt generation system that leverages stability feedback to iteratively enhance both prompt quality and system-level performance. Furthermore, we establish a logical chain between prompt stability and task success by analyzing the structural dependencies within our system, proving stability as a necessary condition for effective system-level execution. Empirical results across general and domain-specific tasks demonstrate that our stability-aware framework improves both accuracy and output consistency. By shifting the focus from one-off results to persistent reliability, our work offers a new perspective on prompt design and contributes practical tools for building more trustworthy general-purpose systems.

Source link

What's Hot

MIT sees ‘significant new financial pressures’ from Trump cuts

Vast Data’s SyncEngine helps AI agents to tap unstructured data from every source

Apple’s new iPhone 17 devices don’t have an AI-powered Siri yet. It doesn’t matter.

Prompt Stability Matters: Evaluating and Optimizing Auto-Generated Prompt in General-Purpose Systems

LTLCrit: A Temporal Logic-based LLM Critic for Safe and Efficient Embodied Agents

From Imitation to Innovation: The Emergence of AI Unique Artistic Styles and the Challenge of Copyright Protection

VerifyLLM: LLM-Based Pre-Execution Task Plan Verification for Robots

Anne Imhof Reimagines Football Jerseys with Nike

Storied Collector and MoMA Trustee Dies at 92

Congress Obtains Drawing Trump Apparently Made for Jeffrey Epstein

Galerie Gmurzynska Slated to Open in New York’s Fuller Building

MIT sees ‘significant new financial pressures’ from Trump cuts

Vast Data’s SyncEngine helps AI agents to tap unstructured data from every source

Apple’s new iPhone 17 devices don’t have an AI-powered Siri yet. It doesn’t matter.

What's Hot

Prompt Stability Matters: Evaluating and Optimizing Auto-Generated Prompt in General-Purpose Systems

Related Posts

Subscribe to Updates