From Scores To Skills: A Cognitive Diagnosis Framework For Evaluating Financial Large Language Models - Takara TLDR

Large Language Models (LLMs) have shown promise for financial applications,
yet their suitability for this high-stakes domain remains largely unproven due
to inadequacies in existing benchmarks. Existing benchmarks solely rely on
score-level evaluation, summarizing performance with a single score that
obscures the nuanced understanding of what models truly know and their precise
limitations. They also rely on datasets that cover only a narrow subset of
financial concepts, while overlooking other essentials for real-world
applications. To address these gaps, we introduce FinCDM, the first cognitive
diagnosis evaluation framework tailored for financial LLMs, enabling the
evaluation of LLMs at the knowledge-skill level, identifying what financial
skills and knowledge they have or lack based on their response patterns across
skill-tagged tasks, rather than a single aggregated number. We construct
CPA-QKA, the first cognitively informed financial evaluation dataset derived
from the Certified Public Accountant (CPA) examination, with comprehensive
coverage of real-world accounting and financial skills. It is rigorously
annotated by domain experts, who author, validate, and annotate questions with
high inter-annotator agreement and fine-grained knowledge labels. Our extensive
experiments on 30 proprietary, open-source, and domain-specific LLMs show that
FinCDM reveals hidden knowledge gaps, identifies under-tested areas such as tax
and regulatory reasoning overlooked by traditional benchmarks, and uncovers
behavioral clusters among models. FinCDM introduces a new paradigm for
financial LLM evaluation by enabling interpretable, skill-aware diagnosis that
supports more trustworthy and targeted model development, and all datasets and
evaluation scripts will be publicly released to support further research.

Source link

What's Hot

A&O Shearman Spin-Off aosphere Buys Investment Navigator – Updated – Artificial Lawyer

SpineBench: A Clinically Salient, Level-Aware Benchmark Powered by the SpineMed-450k Corpus – Takara TLDR

Indian Enterprises Put Key AI Roles in the Leadership Table: IBM Study

From Scores to Skills: A Cognitive Diagnosis Framework for Evaluating Financial Large Language Models – Takara TLDR

SpineBench: A Clinically Salient, Level-Aware Benchmark Powered by the SpineMed-450k Corpus – Takara TLDR

FocusAgent: Simple Yet Effective Ways of Trimming the Large Context of Web Agents – Takara TLDR

Improving GUI Grounding with Explicit Position-to-Coordinate Mapping – Takara TLDR

Sotheby’s to Sell René Magritte Held in Same Collection for 100 years

Former ARTnews Publisher Dies at 97

National Gallery of Art Closes as a Result of Government Shutdown

Almine Rech Closes London Gallery After More Than a Decade

A&O Shearman Spin-Off aosphere Buys Investment Navigator – Updated – Artificial Lawyer

SpineBench: A Clinically Salient, Level-Aware Benchmark Powered by the SpineMed-450k Corpus – Takara TLDR

Indian Enterprises Put Key AI Roles in the Leadership Table: IBM Study

What's Hot

From Scores to Skills: A Cognitive Diagnosis Framework for Evaluating Financial Large Language Models – Takara TLDR

Related Posts

Subscribe to Updates