Close Menu
  • Home
  • AI Models
    • DeepSeek
    • xAI
    • OpenAI
    • Meta AI Llama
    • Google DeepMind
    • Amazon AWS AI
    • Microsoft AI
    • Anthropic (Claude)
    • NVIDIA AI
    • IBM WatsonX Granite 3.1
    • Adobe Sensi
    • Hugging Face
    • Alibaba Cloud (Qwen)
    • Baidu (ERNIE)
    • C3 AI
    • DataRobot
    • Mistral AI
    • Moonshot AI (Kimi)
    • Google Gemma
    • xAI
    • Stability AI
    • H20.ai
  • AI Research
    • Allen Institue for AI
    • arXiv AI
    • Berkeley AI Research
    • CMU AI
    • Google Research
    • Microsoft Research
    • Meta AI Research
    • OpenAI Research
    • Stanford HAI
    • MIT CSAIL
    • Harvard AI
  • AI Funding & Startups
    • AI Funding Database
    • CBInsights AI
    • Crunchbase AI
    • Data Robot Blog
    • TechCrunch AI
    • VentureBeat AI
    • The Information AI
    • Sifted AI
    • WIRED AI
    • Fortune AI
    • PitchBook
    • TechRepublic
    • SiliconANGLE – Big Data
    • MIT News
    • Data Robot Blog
  • Expert Insights & Videos
    • Google DeepMind
    • Lex Fridman
    • Matt Wolfe AI
    • Yannic Kilcher
    • Two Minute Papers
    • AI Explained
    • TheAIEdge
    • Matt Wolfe AI
    • The TechLead
    • Andrew Ng
    • OpenAI
  • Expert Blogs
    • François Chollet
    • Gary Marcus
    • IBM
    • Jack Clark
    • Jeremy Howard
    • Melanie Mitchell
    • Andrew Ng
    • Andrej Karpathy
    • Sebastian Ruder
    • Rachel Thomas
    • IBM
  • AI Policy & Ethics
    • ACLU AI
    • AI Now Institute
    • Center for AI Safety
    • EFF AI
    • European Commission AI
    • Partnership on AI
    • Stanford HAI Policy
    • Mozilla Foundation AI
    • Future of Life Institute
    • Center for AI Safety
    • World Economic Forum AI
  • AI Tools & Product Releases
    • AI Assistants
    • AI for Recruitment
    • AI Search
    • Coding Assistants
    • Customer Service AI
    • Image Generation
    • Video Generation
    • Writing Tools
    • AI for Recruitment
    • Voice/Audio Generation
  • Industry Applications
    • Finance AI
    • Healthcare AI
    • Legal AI
    • Manufacturing AI
    • Media & Entertainment
    • Transportation AI
    • Education AI
    • Retail AI
    • Agriculture AI
    • Energy AI
  • AI Art & Entertainment
    • AI Art News Blog
    • Artvy Blog » AI Art Blog
    • Weird Wonderful AI Art Blog
    • The Chainsaw » AI Art
    • Artvy Blog » AI Art Blog
What's Hot

Deforming Videos to Masks: Flow Matching for Referring Video Segmentation – Takara TLDR

Google DeepMind’s Gemini Agent : Autonomous Al Coding Agent

OpenAI’s Next Bet: Intel Stock?

Facebook X (Twitter) Instagram
Advanced AI News
  • Home
  • AI Models
    • OpenAI (GPT-4 / GPT-4o)
    • Anthropic (Claude 3)
    • Google DeepMind (Gemini)
    • Meta (LLaMA)
    • Cohere (Command R)
    • Amazon (Titan)
    • IBM (Watsonx)
    • Inflection AI (Pi)
  • AI Research
    • Allen Institue for AI
    • arXiv AI
    • Berkeley AI Research
    • CMU AI
    • Google Research
    • Meta AI Research
    • Microsoft Research
    • OpenAI Research
    • Stanford HAI
    • MIT CSAIL
    • Harvard AI
  • AI Funding
    • AI Funding Database
    • CBInsights AI
    • Crunchbase AI
    • Data Robot Blog
    • TechCrunch AI
    • VentureBeat AI
    • The Information AI
    • Sifted AI
    • WIRED AI
    • Fortune AI
    • PitchBook
    • TechRepublic
    • SiliconANGLE – Big Data
    • MIT News
    • Data Robot Blog
  • AI Experts
    • Google DeepMind
    • Lex Fridman
    • Meta AI Llama
    • Yannic Kilcher
    • Two Minute Papers
    • AI Explained
    • TheAIEdge
    • The TechLead
    • Matt Wolfe AI
    • Andrew Ng
    • OpenAI
    • Expert Blogs
      • François Chollet
      • Gary Marcus
      • IBM
      • Jack Clark
      • Jeremy Howard
      • Melanie Mitchell
      • Andrew Ng
      • Andrej Karpathy
      • Sebastian Ruder
      • Rachel Thomas
      • IBM
  • AI Tools
    • AI Assistants
    • AI for Recruitment
    • AI Search
    • Coding Assistants
    • Customer Service AI
  • AI Policy
    • ACLU AI
    • AI Now Institute
    • Center for AI Safety
  • Business AI
    • Advanced AI News Features
    • Finance AI
    • Healthcare AI
    • Education AI
    • Energy AI
    • Legal AI
LinkedIn Instagram YouTube Threads X (Twitter)
Advanced AI News
Industry Applications

S3 Launches – LLM Eval ‘For Any Jurisdiction, Language + Model’ – Artificial Lawyer

By Advanced AI EditorJune 30, 2025No Comments5 Mins Read
Share Facebook Twitter Pinterest Copy Link Telegram LinkedIn Tumblr Email
Share
Facebook Twitter LinkedIn Pinterest Email



Raymond Blyd, the well-known legal tech expert, has launched S3, a new LLM evaluation framework for legal needs, which focuses on ‘identifying core deficiencies rather than proficiencies’.

As Blyd explained to AL, S3 was created to calibrate and compare open-source models during Sabaio’s (his earlier AI company) development, targeting accuracy and hallucinations.

It provides:

‘Standardized Evaluation Metrics: Implements industry-standard benchmarks and custom metrics tailored for legal tasks.

Reproducible Workflows: Ensures that evaluation processes can be repeated and verified by others.

Extensible Architecture: Easily add new evaluation modules or integrate with other legal tech tools.

Transparent Reporting: Generates clear, auditable reports for regulatory and internal review.’

Blyd commented: ‘I needed a consistent method to assess improvements in core model capabilities. For instance, many models failed to cite correct articles or reference numbers. To test this, I developed a simple ‘Strawberry’ test by offsetting legal article numbers to check model accuracy. Most models failed, exposing their unreliability.

‘This insight led to the creation of a prompt template for model testing. The template uses a fixed structure – jurisdiction, code, article number, offset, and legal topic – to ensure consistency. This allows for measurable, reproducible comparisons of model performance across languages and legal systems.’

The framework employs a ‘straightforward quantitative approach: each model responds to a fixed set of objective questions, and correct answers are counted’. Performance is reported as a ratio (e.g., 12/12), enabling transparent and reproducible comparisons between models and test runs, he explained.

Below is a more in-depth interview with Blyd about the how and the why of the project.

Why do this?

For Sabaio, I was looking for a way to check if a large language model could accurately reference Dutch civil legislation, specifically identifying the correct article related to tort law. None of the open-source models I’ve run locally managed this. So, I wanted to see if any model out there has this fundamental legal skill. Other evaluation frameworks look at model proficiencies or specific product proficiencies, while the S3 framework looks at deficiencies in foundational models only. 

How can you tell what is accurate or not?

By deliberately including an incorrect article number and then asking the model to verify if the number is correct. This results in a straightforward “true or false” test—like a legal version of the “strawberry test.” This works on legal code as well as case law references. 

What measure do you use?

We use a simple ratio, like 12/12, to provide clear and reproducible comparisons across different models and test runs. This also helps gauge consistency when repeating tests with identical inputs. For instance, the first run might achieve 12/12, whereas a second run could be 10/12. Some models perform better consistently. Therefore, we see vendors and firms looking at S3 evaLs as a MCP service or tool call to verify outputs. S3 provides essential infrastructure for model output stability, making legal AI realistically reliable.

Which ones have you tested?

We tested DeepSeek R1 0528 with Dutch and Jordanian laws, specifically in Dutch and Arabic languages. The Legal AI Arabic test was carried out in Egypt to help create a new tool for judges. S3 allows us to test any model, in any language for any legal jurisdiction. 

What datasets are you testing against?

We generate our tests using local legislative texts and case law databases.

If testing for citation accuracy, which case law library do you use for comparison?

Currently, we’re not testing citation formats. Our tests are limited to verifying if the case reference number correctly matches the case name. In those cases, S3 tests will have to rely on customers’ access to case law databases. However, we do see opportunities to add citation formats as extra eval in S3. 

How can you measure accuracy for more subjective areas, like drafting and redlining?

In short, we currently don’t test in subjective areas. We don’t believe drafting and redlining can be objectively measured unless approached from a litigant’s perspective. Each party typically wants to strengthen their arguments in a case or contract negotiation. In litigations, this may have been the cause for the hallucinated citations in court briefs. That being said, understanding these conditions allows us to create custom evals in specific use cases.

Special thanks to Emma Kelly and Khrizelle Lascano for their key contributions. We invite the legal and AI communities to help build a more trustworthy future for legal AI. If you are a legal expert, vendor, or at a law firm, connect with Emma at emma@legalcomplex.com.

—

You can see more about S3 here on Github.

—

Legal Innovators Conferences New York and UK – Both In November ’25

If you’d like to stay ahead of the legal AI curve….then come along to Legal Innovators New York, Nov 19 + 20, where the brightest minds will be sharing their insights on where we are now and where we are heading. 

And also, Legal Innovators UK – Nov 4 + 5 + 6

Both events, as always, are organised by the awesome Cosmonauts team! 

Please get in contact with them if you’d like to take part. 

Discover more from Artificial Lawyer

Subscribe to get the latest posts sent to your email.



Source link

Follow on Google News Follow on Flipboard
Share. Facebook Twitter Pinterest LinkedIn Tumblr Email Copy Link
Previous ArticlePaper page – MiCo: Multi-image Contrast for Reinforcement Visual Reasoning
Next Article Enjoy TikTok Explainers? These Old-Fashioned Diagrams Are A Whole Lot Smarter
Advanced AI Editor
  • Website

Related Posts

Nvidia’s Huang says AI computing demand is up ‘substantially’ in the last 6 months

October 8, 2025

Why Law Firms Can’t Afford to Ignore AI – Artificial Lawyer

October 8, 2025

A&O Shearman’s Helen Lightfoot – Artificial Lawyer

October 8, 2025
Leave A Reply

Latest Posts

Matthiesen Gallery Files Lawsuit Over Gustave Courbet Painting

MoMA Partners with Mattel for Van Gogh Barbie, Monet and Dalí Figures

Underground Film Legend and Artist Dies at 92

Artwork Forfeited by Inigo Philbrick’s Partner Flops at Sotheby’s

Latest Posts

Deforming Videos to Masks: Flow Matching for Referring Video Segmentation – Takara TLDR

October 8, 2025

Google DeepMind’s Gemini Agent : Autonomous Al Coding Agent

October 8, 2025

OpenAI’s Next Bet: Intel Stock?

October 8, 2025

Subscribe to News

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

Recent Posts

  • Deforming Videos to Masks: Flow Matching for Referring Video Segmentation – Takara TLDR
  • Google DeepMind’s Gemini Agent : Autonomous Al Coding Agent
  • OpenAI’s Next Bet: Intel Stock?
  • S&P Global Uses IBM AI To Boost Efficiency – IBM (NYSE:IBM)
  • Europe’s Venture Scene Held Steady In Q3, Buoyed By Early-Stage Funding And Klarna IPO

Recent Comments

  1. GigabitE6Nalay on Steven Pinker: AI in the Age of Reason | Lex Fridman Podcast #3
  2. Alberto Rojas on Class Dismissed? Representative Claims in Getty v. Stability AI | Cooley LLP
  3. Neville Somvang on Class Dismissed? Representative Claims in Getty v. Stability AI | Cooley LLP
  4. Rafaela Gaffer on Class Dismissed? Representative Claims in Getty v. Stability AI | Cooley LLP
  5. Danielnem on Curiosity, Grit Matter More Than Ph.D to Work at OpenAI: ChatGPT Boss

Welcome to Advanced AI News—your ultimate destination for the latest advancements, insights, and breakthroughs in artificial intelligence.

At Advanced AI News, we are passionate about keeping you informed on the cutting edge of AI technology, from groundbreaking research to emerging startups, expert insights, and real-world applications. Our mission is to deliver high-quality, up-to-date, and insightful content that empowers AI enthusiasts, professionals, and businesses to stay ahead in this fast-evolving field.

Subscribe to Updates

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

LinkedIn Instagram YouTube Threads X (Twitter)
  • Home
  • About Us
  • Advertise With Us
  • Contact Us
  • DMCA
  • Privacy Policy
  • Terms & Conditions
© 2025 advancedainews. Designed by advancedainews.

Type above and press Enter to search. Press Esc to cancel.