StockBench: Can LLM Agents Trade Stocks Profitably In Real-world Markets? - Takara TLDR

Large language models (LLMs) have recently demonstrated strong capabilities
as autonomous agents, showing promise in reasoning, tool use, and sequential
decision-making. While prior benchmarks have evaluated LLM agents in domains
such as software engineering and scientific discovery, the finance domain
remains underexplored, despite its direct relevance to economic value and
high-stakes decision-making. Existing financial benchmarks primarily test
static knowledge through question answering, but they fall short of capturing
the dynamic and iterative nature of trading. To address this gap, we introduce
StockBench, a contamination-free benchmark designed to evaluate LLM agents in
realistic, multi-month stock trading environments. Agents receive daily market
signals — including prices, fundamentals, and news — and must make sequential
buy, sell, or hold decisions. Performance is assessed using financial metrics
such as cumulative return, maximum drawdown, and the Sortino ratio. Our
evaluation of state-of-the-art proprietary (e.g., GPT-5, Claude-4) and
open-weight (e.g., Qwen3, Kimi-K2, GLM-4.5) models shows that while most LLM
agents struggle to outperform the simple buy-and-hold baseline, several models
demonstrate the potential to deliver higher returns and manage risk more
effectively. These findings highlight both the challenges and opportunities in
developing LLM-powered financial agents, showing that excelling at static
financial knowledge tasks does not necessarily translate into successful
trading strategies. We release StockBench as an open-source resource to support
reproducibility and advance future research in this domain.

Source link

What's Hot

A Rigorous Benchmark with Multidimensional Evaluation for Deep Research Agents: From Answers to Reports – Takara TLDR

Stocks to Gain From Quantum Computing in 2025: MSFT, IBM, QBTS, IONQ

StockBench: Can LLM Agents Trade Stocks Profitably In Real-world Markets? – Takara TLDR

StockBench: Can LLM Agents Trade Stocks Profitably In Real-world Markets? – Takara TLDR

A Rigorous Benchmark with Multidimensional Evaluation for Deep Research Agents: From Answers to Reports – Takara TLDR

Drawing Conclusions from Draws: Rethinking Preference Semantics in Arena-Style LLM Evaluation – Takara TLDR

RewardMap: Tackling Sparse Rewards in Fine-grained Visual Reasoning via Multi-Stage Reinforcement Learning – Takara TLDR

Record Exec and Art Collector Gets Over 4 Years

Chicago’s Art Scene Offers a Beacon of Hope for Artists and Dealers

New Archaeological Research Reveals Life in Pompeii Post-Eruption

Director Fired After Declining to Give Trump Sword for King Charles

A Rigorous Benchmark with Multidimensional Evaluation for Deep Research Agents: From Answers to Reports – Takara TLDR

Stocks to Gain From Quantum Computing in 2025: MSFT, IBM, QBTS, IONQ