A new multilingual, multi-sector, and multi-task benchmark, M³FinMeeting, evaluates large language models’ performance in understanding financial meetings across different languages and industries.
Recent breakthroughs in large language models (LLMs) have led to the
development of new benchmarks for evaluating their performance in the financial
domain. However, current financial benchmarks often rely on news articles,
earnings reports, or announcements, making it challenging to capture the
real-world dynamics of financial meetings. To address this gap, we propose a
novel benchmark called M^3FinMeeting, which is a multilingual,
multi-sector, and multi-task dataset designed for financial meeting
understanding. First, M^3FinMeeting supports English, Chinese, and
Japanese, enhancing comprehension of financial discussions in diverse
linguistic contexts. Second, it encompasses various industry sectors defined by
the Global Industry Classification Standard (GICS), ensuring that the benchmark
spans a broad range of financial activities. Finally,
M^3FinMeeting includes three tasks: summarization, question-answer
(QA) pair extraction, and question answering, facilitating a more realistic and
comprehensive evaluation of understanding. Experimental results with seven
popular LLMs reveal that even the most advanced long-context models have
significant room for improvement, demonstrating the effectiveness of
M^3FinMeeting as a benchmark for assessing LLMs’ financial meeting
comprehension skills.