
By Jim Wagner, CEO, The Contract Network
A ‘Vals 2’ legal AI benchmarking project is planned and it is bound to spark debate. As many remember, ‘Vals 1’ was far from smooth – companies withdrew and participants questioned the methodology. I support transparency in legal AI, but past benchmarking efforts in our industry, including Vals 1, have at best produced mixed results.
When executed well, benchmarking offers genuine insight. It lets legal professionals choose tools with confidence, nudges vendors to improve, and, in rare cases, moves the entire field forward, (thank you again, Maura Grossman and Gordon Cormack).
Still, truly effective benchmarking in legal AI remains elusive. Prior attempts, both inside and outside law, have often generated more clouds than light, leaving would‑be adopters uncertain about what these tools can accomplish in real‑world conditions.
The Perils of ‘Good Enough’ – When Benchmarks Fall Short
Even well‑intentioned studies can mislead. We saw that with the first Vals Legal AI Report (‘Vals 1’). While groundbreaking, it struggled with three issues:
Timeliness of Results. The AI landscape shifts at breakneck speed. As Vecflow noted in its commentary on Vals 1, its product had “advanced substantially” during the six months separating data collection from publication. By the time readers saw the numbers, they lagged behind reality.
Sample Size and Scope. Some tasks faced questions about whether the dataset was large enough to support broad conclusions. Noah Waisberg’s ‘Missed‑MFN‑Gate’ analysis argued that data‑extraction tests should span a far larger document set to paint a reliable picture of accuracy.
Perceived Conflicts of Interest. Vals 1 disclosed that ‘Vals AI has a customer relationship with one or more of the participants.’ Transparency helps, but such ties naturally raise doubts about impartiality and underscore the need for demonstrably independent evaluation.
The Frontier‑Model Factor
Most specialized legal‑AI tools rely on the same frontier models – Claude, OpenAI, or their peers. That fact means macro performance will trend with breakthroughs in those underlying models more than with incremental vendor enhancements. When OpenAI or Anthropic releases a model with stronger reasoning, tools built on them will improve in tandem, regardless of domain tuning.
Implementation still matters. Effective retrieval‑augmented generation, domain‑specific data, and thoughtful prompt engineering can yield sizable gains. We see it daily in the clinical‑research agreements we handle. But we all should acknowledge that the ultimate ceilings are set by the foundation models themselves. An ideal, but likely unreachable, benchmark would separate what stems from the base model from the value created by the legal‑AI layer.
How We Communicate Results
Beyond the reports themselves lies a broader problem: marketing. Some vendors trumpet headline accuracy figures like ‘verified xx % accurate’. Within a tightly controlled test – known documents, known issues, and expert users – the claim is likely true. The trouble comes when that number, stripped of context, morphs into a promise of universal performance. In messy reality – unfamiliar contract types, novel clauses, users with uneven skills – results will vary, sometimes sharply. Overstated claims breed skepticism and, ultimately, distrust of the entire AI community.
It’s particularly important to note that courts and regulatory bodies regularly cite benchmarking studies when weighing AI‑assisted efforts. Our homework will face scrutiny on the record.
Real‑World Benchmarking Challenges: Lessons from the Trenches
After years building and evaluating AI tools for the legal community, I run into two persistent hurdles:
Bandwidth. Proper evaluations demand enormous effort. Experts have to label data. Vendors must dedicate engineers. Evaluators must craft and run rigorous protocols. The temptation to cut corners—fewer documents, simpler tasks, automated scoring—always looms but undercuts the value of the exercise.
A Moving Gold Standard. Legal professionals often disagree on what is ‘correct’. Clause identification, risk assessment, even basic interpretation spark debate. Place three experts in a room, and you may receive four answers. When half approve an AI output and half dissent, is that success or failure?
The Blueprint: Evolving Toward Gold‑Standard Legal‑AI Benchmarking
If we want benchmarking to fulfill its promise, we must treat it as an ongoing discipline. Key priorities:
Transparency and Independence. Publish methodology, dataset characteristics, scoring rubrics, funding sources, and governance. Show how conflicts of interest are managed.
Robustness and Realism. Use sufficiently large, diverse datasets and tasks that mirror real practice, not synthetic edge cases.
Objective, Contextualized Metrics. Move beyond a single accuracy number. Balanced scorecards—precision, recall, F‑score, plus task‑specific qualitative notes – give a fuller picture.
Continuous, Accessible Evaluation. AI evolves rapidly; benchmarks must keep pace. Evaluate current models on rolling schedules and share findings widely.
Guardrails Against “Teaching to the Test.” Rotate blind test sets, vary tasks, and focus on core reasoning skills to prevent overfitting.
A Pragmatic Path Forward
Legal AI benchmarking is in its crawl‑walk‑run phase. Missteps are inevitable, but poorly designed evaluations can do more harm than good by distorting perceptions of capability.
As Vals 2 and other initiatives proceed, I hope we see benchmarks that own their limits, emphasize real‑world utility, and inch us toward fair, contextual assessment. The goal is not to anoint champions but to deepen industry understanding, spur responsible innovation, and help legal professionals adopt AI wisely. The future of AI in law depends on it.
—
About the Author:
Jim Wagner is co‑founder and CEO of The Contract Network, where he and his colleagues tackle the wasted effort in contracting. Before founding TCN, he served as Vice President of Agreement Cloud Strategy at DocuSign after DocuSign acquired Seal Software, where he was President. A serial founder in legal tech, Jim holds multiple patents on using AI and analytics for legal documents.
—
[ This is an educational think piece written by Jim Wagner for Artificial Lawyer. ]
—
Legal Innovators California Conference, San Francisco, June 11 + 12
If you’re interested in the cutting edge of legal AI and innovation – and where we are all heading – then come along to Legal Innovators California, in San Francisco, June 11 and 12, where speakers from the leading law firms, inhouse teams, and tech companies will be sharing their insights and experiences as to what is really happening and where we are all heading.
We already have an incredible roster of companies to hear from. This includes: Legora, Harvey, StructureFlow, Ivo, Flatiron Law Group, PointOne, Centari, LexisNexis, eBrevia, Legatics, Knowable, Draftwise, newcode.AI, Riskaway, Aracor, SimpleClosure and more.

See you all there!
More information and tickets here.