GPT-5 Tops Harvey’s BigLaw Bench Eval – Artificial Lawyer

As AL shared last night, Harvey – and other companies – have had early access to GPT-5. The genAI pioneer has analysed the new LLM’s outputs and marked it as the best-performing OpenAI model using its ‘BigLaw Bench’ AI evaluation system. It scored 89.22% overall.

The company launched BigLaw Bench (see AL article) last year to help with gauging the quality of genAI responses, in particular relative to how a lawyer would expect an acceptable response to read.

As they explained at the time – ‘Each task in BigLaw Bench is assessed using custom-designed rubrics that measure:

Answer Quality: Evaluates the completeness, accuracy, and appropriateness of the model’s response based on specific criteria essential for effective task completion.

Source Reliability: Assesses the model’s ability to provide verifiable and correctly cited sources for its assertions, enhancing trust and facilitating validation.

Scores are calculated by combining positive points for meeting task requirements and negative points for errors or missteps (e.g. hallucinations).

Those scores are then expressed as percentages.’

And below is the chart they have provided. As you can see GPT-5 scored 89.22%, a notable improvement of around 5% on the next closest results shown, which were of another OpenAI model, o3, which was at 84.13%. (Note: Harvey uses other companies’ models, not just OpenAI, but those are not shown here.)

Moreover, this is really starting to get close to ‘last mile’ territory.

I.e. the closer we get to something where lawyers can go ‘yep, that’s fine, let it through’, the harder and harder it gets.

Getting to ‘it’s kind of right, but needs some work to get to the level I want’ is relatively easy for many LLMs. But, getting up to 90% and then into that massive last mile on the journey to 99%, is a totally different experience.

But, we are moving in the right direction. Plus, these outputs will get improved as Harvey – (and other legal tech companies) – applies refinement, system prompting, and orchestration with related data.

Which raises the question: can we ever get to 99.9% on BigLaw Bench? Probably not for some years yet, but eventually…? Why not. It goes back to the Waymo analogy this site has used a few times now: getting to the level of success where people just go with it is incredibly hard to do in a super-complex, unstructured environment, but, as Waymo showed, it can be done with enough time and investment.

Will new genAI models get much better? It’s hard to say. There will be incremental improvements for sure. But, bigger steps may come from other strategies, such as improving the verification layer.

Either way, we are making progress, and at an incredible pace. In three years we have gone from scepticism about AI, to now a majority of large law firms engaging deeply with the technology – so too their clients. And central to this change is the performance of the models. If those LLMs didn’t deliver, then the lawyers would not be so enthusiastic about the current wave of legal AI tools.

—

Right, what else?

In Harvey’s blog post on the new model, they also added some details about their own plans on how to leverage GPT-5:

‘Integrated into Harvey’s systems, these baseline capabilities can be leveraged to enable more powerful use cases in the document drafting and complex research domains. GPT-5 is also the first orchestration model that appears capable of combining these tasks—allowing for a single agent to both collaborate with a user on the research and produce the finished work product.

For example, on a task like: ‘Identify if any of these internal guidance documents are inconsistent with current regulation, we operate in the United States and the European Union’ . . . GPT-5 can be used to orchestrate agents that:

Review the internal documents to identify relevant trends to search for;

Find recent changes in global regulation;

Perform a comprehensive review of any gaps between the two; and

Draft a memo of recommendations of how to best update your internal guidance to stay aligned with the new regulatory environment.

All while prompting the user as needed for additional context to ensure it reaches the goal as expected.

Coupled with our recently-announced data partnerships with LexisNexis and iManage, Harvey is now able to see the full picture – public and proprietary – before it acts. With GPT-5’s substantially improved tool-use and drafting capabilities, we can now build a deeply integrated AI system that reasons over an organization’s internal data and leverages trusted third-party content in real-time.

Building an Intelligent Coworker

Complex matters don’t unfold linearly; they advance dynamically through iteration, and in close collaboration with internal and external stakeholders. With GPT-5, and our product and data ingredients in place, Harvey’s north star of creating an intelligent coworker comes into focus.’

—

You can find more about Harvey and read the original post here. Thanks to CEO Winston Weinberg and team for sharing.

—

Legal Innovators Conferences in New York and London – Both In November ’25

If you’d like to stay ahead of the legal AI curve….then come along to Legal Innovators New York, Nov 19 + 20, where the brightest minds will be sharing their insights on where we are now and where we are heading.

And also, Legal Innovators UK – Nov 4 + 5 + 6

What's Hot

Beyond Turn Limits: Training Deep Search Agents with Dynamic Context Window – Takara TLDR

U.S. Tighten Chip Loop As China Bets On Open Source

Read MIT’s letter to Trump administration on higher ed ‘compact’

GPT-5 Tops Harvey’s BigLaw Bench Eval – Artificial Lawyer

Stanford’s Paper2Agent Reimagines Scientific Papers as Interactive AI Agents

Tesla axed one of the Model Y’s best features in ‘Standard’ trims: here’s why

How Tesla’s Standard models will help deliveries despite price disappointment

The Rubin Names 2025 Art Prize, Research and Art Projects Grants

Kochi-Muziris Biennial Announces 66 Artists for December Exhibition

Instagram Launches ‘Rings’ Awards for Creators—With KAWS as a Judge

Museums Prepare to Close Their Doors as Government Shutdown Continues

Beyond Turn Limits: Training Deep Search Agents with Dynamic Context Window – Takara TLDR

U.S. Tighten Chip Loop As China Bets On Open Source

Read MIT’s letter to Trump administration on higher ed ‘compact’

What's Hot

GPT-5 Tops Harvey’s BigLaw Bench Eval – Artificial Lawyer

Discover more from Artificial Lawyer

Related Posts

Subscribe to Updates