arXiv AI

[2505.13511] Can AI Freelancers Compete? Benchmarking Earnings, Reliability, and Task Success at Scale

By Advanced AI EditorMay 22, 20251 Comment2 Mins Read

[Submitted on 16 May 2025]

View a PDF of the paper titled Can AI Freelancers Compete? Benchmarking Earnings, Reliability, and Task Success at Scale, by David Noever and 1 other authors

View PDF

Abstract:This study explores Large Language Models (LLMs) as autonomous agents for real-world tasks, including freelance software development. This work presents a new benchmark that evaluates LLMs on freelance programming and data analysis tasks derived from economic data. We construct the benchmark using synthetic tasks created from a Kaggle Freelancer dataset of job postings, with all job prices standardized to USD (median fixed-project price around $250, and an average of $306). Each task is accompanied by structured input-output test cases and an estimated price tag, enabling automated correctness checking and a monetary performance valuation. This approach is inspired by OpenAI’s recent SWE-Lancer benchmark (1,400 real Upwork tasks worth $1M total). Still, our framework simplifies evaluation using programmatically testable tasks and predicted price values, making it highly scalable and repeatable. On this benchmark, we evaluate four modern LLMs – Claude 3.5 Haiku, GPT-4o-mini, Qwen 2.5, and Mistral. We report each model’s accuracy (task success rate and test-case pass rate) and the total “freelance earnings” it achieves (sum of prices of solved tasks). Our results show that Claude 3.5 Haiku performs best, earning approximately $1.52 million USD, followed closely by GPT-4o-mini at $1.49 million, then Qwen 2.5 ($1.33M) and Mistral ($0.70M). We analyze the distribution of errors per task and observe that the strongest models solve the most tasks and rarely fail completely on any project. We discuss the implications of these results for the feasibility of AI as a freelance developer, the advantages and limitations of our automated benchmark approach, and the gap between performance on structured tasks versus the true complexity of real-world freelance jobs.

Submission history

From: David Noever [view email]
[v1]
Fri, 16 May 2025 22:42:04 UTC (563 KB)

Previous ArticleMIT CSAIL researchers develop tool for creating domain-specific languages

Next Article OpenAI to buy Jony Ive’s io for $6.4bn in hardware push

Advanced AI Editor

1 Comment

código de indicac~ao binance on July 12, 2025 11:52 am

Thank you for your sharing. I am worried that I lack creative ideas. It is your article that makes me full of hope. Thank you. But, I have a question, can you help me?

Leave A Reply