Parameterized Argumentation-based Reasoning Tasks For Benchmarking Generative Language Models

arXiv:2505.01539v1 Announce Type: new
Abstract: Generative large language models as tools in the legal domain have the potential to improve the justice system. However, the reasoning behavior of current generative models is brittle and poorly understood, hence cannot be responsibly applied in the domains of law and evidence. In this paper, we introduce an approach for creating benchmarks that can be used to evaluate the reasoning capabilities of generative language models. These benchmarks are dynamically varied, scalable in their complexity, and have formally unambiguous interpretations. In this study, we illustrate the approach on the basis of witness testimony, focusing on the underlying argument attack structure. We dynamically generate both linear and non-linear argument attack graphs of varying complexity and translate these into reasoning puzzles about witness testimony expressed in natural language. We show that state-of-the-art large language models often fail in these reasoning puzzles, already at low complexity. Obvious mistakes are made by the models, and their inconsistent performance indicates that their reasoning capabilities are brittle. Furthermore, at higher complexity, even state-of-the-art models specifically presented for reasoning capabilities make mistakes. We show the viability of using a parametrized benchmark with varying complexity to evaluate the reasoning capabilities of generative language models. As such, the findings contribute to a better understanding of the limitations of the reasoning capabilities of generative models, which is essential when designing responsible AI systems in the legal domain.

Source link

What's Hot

Elon Musk Releases Free Video AI Model to Go Head – to

How to Automate Web Searches with Perplexity AI and Zapier

A&O Shearman’s Helen Lightfoot – Artificial Lawyer

Parameterized Argumentation-based Reasoning Tasks for Benchmarking Generative Language Models

LTLCrit: A Temporal Logic-based LLM Critic for Safe and Efficient Embodied Agents

From Imitation to Innovation: The Emergence of AI Unique Artistic Styles and the Challenge of Copyright Protection

VerifyLLM: LLM-Based Pre-Execution Task Plan Verification for Robots

Matthiesen Gallery Files Lawsuit Over Gustave Courbet Painting

MoMA Partners with Mattel for Van Gogh Barbie, Monet and Dalí Figures

Basquiat Work on Paper Headline’s Phillips’ Frieze Week Sales

Charges Against Isaac Wright ‘to Be Dropped’ After His Arrest by NYPD

Elon Musk Releases Free Video AI Model to Go Head – to

How to Automate Web Searches with Perplexity AI and Zapier

A&O Shearman’s Helen Lightfoot – Artificial Lawyer

What's Hot

Parameterized Argumentation-based Reasoning Tasks for Benchmarking Generative Language Models

Related Posts

Subscribe to Updates