A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models

[Submitted on 15 Aug 2024 (v1), last revised 12 Apr 2025 (this version, v4)]

Authors:Andy K. Zhang, Neil Perry, Riya Dulepet, Joey Ji, Celeste Menders, Justin W. Lin, Eliot Jones, Gashon Hussein, Samantha Liu, Donovan Jasper, Pura Peetathawatchai, Ari Glenn, Vikram Sivashankar, Daniel Zamoshchin, Leo Glikbarg, Derek Askaryar, Mike Yang, Teddy Zhang, Rishi Alluri, Nathan Tran, Rinnara Sangpisit, Polycarpos Yiorkadjis, Kenny Osele, Gautham Raghupathi, Dan Boneh, Daniel E. Ho, Percy Liang

View a PDF of the paper titled Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models, by Andy K. Zhang and 26 other authors

View PDF

Abstract:Language Model (LM) agents for cybersecurity that are capable of autonomously identifying vulnerabilities and executing exploits have potential to cause real-world impact. Policymakers, model providers, and researchers in the AI and cybersecurity communities are interested in quantifying the capabilities of such agents to help mitigate cyberrisk and investigate opportunities for penetration testing. Toward that end, we introduce Cybench, a framework for specifying cybersecurity tasks and evaluating agents on those tasks. We include 40 professional-level Capture the Flag (CTF) tasks from 4 distinct CTF competitions, chosen to be recent, meaningful, and spanning a wide range of difficulties. Each task includes its own description, starter files, and is initialized in an environment where an agent can execute commands and observe outputs. Since many tasks are beyond the capabilities of existing LM agents, we introduce subtasks for each task, which break down a task into intermediary steps for a more detailed evaluation. To evaluate agent capabilities, we construct a cybersecurity agent and evaluate 8 models: GPT-4o, OpenAI o1-preview, Claude 3 Opus, Claude 3.5 Sonnet, Mixtral 8x22b Instruct, Gemini 1.5 Pro, Llama 3 70B Chat, and Llama 3.1 405B Instruct. For the top performing models (GPT-4o and Claude 3.5 Sonnet), we further investigate performance across 4 agent scaffolds (structed bash, action-only, pseudoterminal, and web search). Without subtask guidance, agents leveraging Claude 3.5 Sonnet, GPT-4o, OpenAI o1-preview, and Claude 3 Opus successfully solved complete tasks that took human teams up to 11 minutes to solve. In comparison, the most difficult task took human teams 24 hours and 54 minutes to solve. All code and data are publicly available at this https URL.

Submission history

From: Riya Dulepet [view email]
[v1]
Thu, 15 Aug 2024 17:23:10 UTC (2,722 KB)
[v2]
Sun, 6 Oct 2024 22:19:54 UTC (2,940 KB)
[v3]
Thu, 5 Dec 2024 19:46:36 UTC (2,982 KB)
[v4]
Sat, 12 Apr 2025 21:26:07 UTC (2,982 KB)

Source link

What's Hot

Risk Management, AI Lead In Attracting Capital

Sam Altman warns there’s no legal confidentiality when using ChatGPT as a therapist

HR Job Postings Require AI Skills

A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models

LTLCrit: A Temporal Logic-based LLM Critic for Safe and Efficient Embodied Agents

From Imitation to Innovation: The Emergence of AI Unique Artistic Styles and the Challenge of Copyright Protection

VerifyLLM: LLM-Based Pre-Execution Task Plan Verification for Robots

Artist Loses Final Appeal in Case of Apologising for ‘Fishrot Scandal’

US Appeals Court Overturns $8.8 M. Trademark Judgement For Yuga Labs

Old Masters ‘Making a Comeback’ in London: Morning Links

Bill Proposed To Apply Anti-Money Laundering Regulations to Art Market

Risk Management, AI Lead In Attracting Capital

Sam Altman warns there’s no legal confidentiality when using ChatGPT as a therapist

HR Job Postings Require AI Skills

What's Hot

A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models

Submission history

Related Posts

Subscribe to Updates