Browsing: Hugging Face
Web agents enable users to perform tasks on web browsers through natural language interaction. Evaluating web agents trajectories is an…
We introduce S1-Bench, a novel benchmark designed to evaluate Large Reasoning Models’ (LRMs) performance on simple tasks that favor intuitive…
Recent advances in reinforcement learning (RL)-based post-training have led tonotable improvements in large language models (LLMs), particularly in enhancingtheir reasoning…
OpenAI’s multimodal GPT-4o has demonstrated remarkable capabilities in image generation and editing, yet its ability to achieve world knowledge-informed semantic…
Large language models learn and continually learn through the accumulation of gradient-based updates, but how individual pieces of new information…
Recently, improving the reasoning ability of large multimodal models (LMMs) through reinforcement learning has made great progress. However, most existing…
Effective reasoning is crucial to solving complex mathematical problems. Recent large language models (LLMs) have boosted performance by scaling test-time…
Graphical User Interface (GUI) agents offer cross-platform solutions for automating complex digital tasks, with significant potential to transform productivity workflows.…
Scientific equation discovery is a fundamental task in the history of scientific progress, enabling the derivation of laws governing natural…
Overview of EmoEval for Evaluating Mental Safety of AI-human Interactions. The simulation consists of four steps: (1) User Agent Initialization…