Paper Page - TIME: A Multi-level Benchmark For Temporal Reasoning Of LLMs In Real-World Scenarios

Temporal reasoning is pivotal for Large Language Models (LLMs) to comprehend the real world. However, existing works neglect the real-world challenges for temporal reasoning: (1) intensive temporal information, (2) fast-changing event dynamics, and (3) complex temporal dependencies in social interactions. To bridge this gap, we propose a multi-level benchmark TIME, designed for temporal reasoning in real-world scenarios. TIME consists of 38,522 QA pairs, covering 3 levels with 11 fine-grained sub-tasks. This benchmark encompasses 3 sub-datasets reflecting different real-world challenges: TIME-Wiki, TIME-News, and TIME-Dial. We conduct extensive experiments on reasoning models and non-reasoning models. And we conducted an in-depth analysis of temporal reasoning performance across diverse real-world scenarios and tasks, and summarized the impact of test-time scaling on temporal reasoning capabilities. Additionally, we release TIME-Lite, a human-annotated subset to foster future research and standardized evaluation in temporal reasoning. The code is available at https://github.com/sylvain-wei/TIME , and the dataset is available at https://huggingface.co/datasets/SylvainWei/TIME .

Source link

What's Hot

The Hybrid AI Law Firm – Artificial Lawyer

HumanAgencyBench: Scalable Evaluation of Human Agency Support in AI Assistants – Takara TLDR

Anthropic Claude AI Experiences Outage, Developers Reflect on AI Tool Dependency and API Stability_the_again_model

Paper page – TIME: A Multi-level Benchmark for Temporal Reasoning of LLMs in Real-World Scenarios

HumanAgencyBench: Scalable Evaluation of Human Agency Support in AI Assistants – Takara TLDR

Parallel-R1: Towards Parallel Thinking via Reinforcement Learning – Takara TLDR

Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search – Takara TLDR

Christie’s Will Auction The First Calculating Machine In History

The Art Market Isn’t Dying. The Way We Write About It Might Be.

Banksy Mural of Judge Beating Protestor Removed by Courts Service

Ralph Rugoff to Leave London’s Hayward Gallery After 20 Years

The Hybrid AI Law Firm – Artificial Lawyer

HumanAgencyBench: Scalable Evaluation of Human Agency Support in AI Assistants – Takara TLDR

Anthropic Claude AI Experiences Outage, Developers Reflect on AI Tool Dependency and API Stability_the_again_model

What's Hot

Paper page – TIME: A Multi-level Benchmark for Temporal Reasoning of LLMs in Real-World Scenarios

Related Posts

Subscribe to Updates