Close Menu
  • Home
  • AI Models
    • DeepSeek
    • xAI
    • OpenAI
    • Meta AI Llama
    • Google DeepMind
    • Amazon AWS AI
    • Microsoft AI
    • Anthropic (Claude)
    • NVIDIA AI
    • IBM WatsonX Granite 3.1
    • Adobe Sensi
    • Hugging Face
    • Alibaba Cloud (Qwen)
    • Baidu (ERNIE)
    • C3 AI
    • DataRobot
    • Mistral AI
    • Moonshot AI (Kimi)
    • Google Gemma
    • xAI
    • Stability AI
    • H20.ai
  • AI Research
    • Allen Institue for AI
    • arXiv AI
    • Berkeley AI Research
    • CMU AI
    • Google Research
    • Microsoft Research
    • Meta AI Research
    • OpenAI Research
    • Stanford HAI
    • MIT CSAIL
    • Harvard AI
  • AI Funding & Startups
    • AI Funding Database
    • CBInsights AI
    • Crunchbase AI
    • Data Robot Blog
    • TechCrunch AI
    • VentureBeat AI
    • The Information AI
    • Sifted AI
    • WIRED AI
    • Fortune AI
    • PitchBook
    • TechRepublic
    • SiliconANGLE – Big Data
    • MIT News
    • Data Robot Blog
  • Expert Insights & Videos
    • Google DeepMind
    • Lex Fridman
    • Matt Wolfe AI
    • Yannic Kilcher
    • Two Minute Papers
    • AI Explained
    • TheAIEdge
    • Matt Wolfe AI
    • The TechLead
    • Andrew Ng
    • OpenAI
  • Expert Blogs
    • François Chollet
    • Gary Marcus
    • IBM
    • Jack Clark
    • Jeremy Howard
    • Melanie Mitchell
    • Andrew Ng
    • Andrej Karpathy
    • Sebastian Ruder
    • Rachel Thomas
    • IBM
  • AI Policy & Ethics
    • ACLU AI
    • AI Now Institute
    • Center for AI Safety
    • EFF AI
    • European Commission AI
    • Partnership on AI
    • Stanford HAI Policy
    • Mozilla Foundation AI
    • Future of Life Institute
    • Center for AI Safety
    • World Economic Forum AI
  • AI Tools & Product Releases
    • AI Assistants
    • AI for Recruitment
    • AI Search
    • Coding Assistants
    • Customer Service AI
    • Image Generation
    • Video Generation
    • Writing Tools
    • AI for Recruitment
    • Voice/Audio Generation
  • Industry Applications
    • Finance AI
    • Healthcare AI
    • Legal AI
    • Manufacturing AI
    • Media & Entertainment
    • Transportation AI
    • Education AI
    • Retail AI
    • Agriculture AI
    • Energy AI
  • AI Art & Entertainment
    • AI Art News Blog
    • Artvy Blog » AI Art Blog
    • Weird Wonderful AI Art Blog
    • The Chainsaw » AI Art
    • Artvy Blog » AI Art Blog
What's Hot

Clio Buys vLex For $1 Billion – Artificial Lawyer

HII and C3 AI Forge Strategic Artificial Intelligence Partnership to Support US Navy Shipbuilding

Context extraction from image files in Amazon Q Business using LLMs

Facebook X (Twitter) Instagram
Advanced AI News
  • Home
  • AI Models
    • Amazon (Titan)
    • Anthropic (Claude 3)
    • Cohere (Command R)
    • Google DeepMind (Gemini)
    • IBM (Watsonx)
    • Inflection AI (Pi)
    • Meta (LLaMA)
    • OpenAI (GPT-4 / GPT-4o)
    • Reka AI
    • xAI (Grok)
    • Adobe Sensi
    • Aleph Alpha
    • Alibaba Cloud (Qwen)
    • Apple Core ML
    • Baidu (ERNIE)
    • ByteDance Doubao
    • C3 AI
    • DataRobot
    • DeepSeek
  • AI Research & Breakthroughs
    • Allen Institue for AI
    • arXiv AI
    • Berkeley AI Research
    • CMU AI
    • Google Research
    • Meta AI Research
    • Microsoft Research
    • OpenAI Research
    • Stanford HAI
    • MIT CSAIL
    • Harvard AI
  • AI Funding & Startups
    • AI Funding Database
    • CBInsights AI
    • Crunchbase AI
    • Data Robot Blog
    • TechCrunch AI
    • VentureBeat AI
    • The Information AI
    • Sifted AI
    • WIRED AI
    • Fortune AI
    • PitchBook
    • TechRepublic
    • SiliconANGLE – Big Data
    • MIT News
    • Data Robot Blog
  • Expert Insights & Videos
    • Google DeepMind
    • Lex Fridman
    • Meta AI Llama
    • Yannic Kilcher
    • Two Minute Papers
    • AI Explained
    • TheAIEdge
    • Matt Wolfe AI
    • The TechLead
    • Andrew Ng
    • OpenAI
  • Expert Blogs
    • François Chollet
    • Gary Marcus
    • IBM
    • Jack Clark
    • Jeremy Howard
    • Melanie Mitchell
    • Andrew Ng
    • Andrej Karpathy
    • Sebastian Ruder
    • Rachel Thomas
    • IBM
  • AI Policy & Ethics
    • ACLU AI
    • AI Now Institute
    • Center for AI Safety
    • EFF AI
    • European Commission AI
    • Partnership on AI
    • Stanford HAI Policy
    • Mozilla Foundation AI
    • Future of Life Institute
    • Center for AI Safety
    • World Economic Forum AI
  • AI Tools & Product Releases
    • AI Assistants
    • AI for Recruitment
    • AI Search
    • Coding Assistants
    • Customer Service AI
    • Image Generation
    • Video Generation
    • Writing Tools
    • AI for Recruitment
    • Voice/Audio Generation
  • Industry Applications
    • Education AI
    • Energy AI
    • Finance AI
    • Healthcare AI
    • Legal AI
    • Media & Entertainment
    • Transportation AI
    • Manufacturing AI
    • Retail AI
    • Agriculture AI
  • AI Art & Entertainment
    • AI Art News Blog
    • Artvy Blog » AI Art Blog
    • Weird Wonderful AI Art Blog
    • The Chainsaw » AI Art
    • Artvy Blog » AI Art Blog
Facebook X (Twitter) Instagram
Advanced AI News
Industry Applications

S3 Launches – LLM Eval ‘For Any Jurisdiction, Language + Model’ – Artificial Lawyer

Advanced AI EditorBy Advanced AI EditorJune 30, 2025No Comments5 Mins Read
Share Facebook Twitter Pinterest Copy Link Telegram LinkedIn Tumblr Email
Share
Facebook Twitter LinkedIn Pinterest Email



Raymond Blyd, the well-known legal tech expert, has launched S3, a new LLM evaluation framework for legal needs, which focuses on ‘identifying core deficiencies rather than proficiencies’.

As Blyd explained to AL, S3 was created to calibrate and compare open-source models during Sabaio’s (his earlier AI company) development, targeting accuracy and hallucinations.

It provides:

‘Standardized Evaluation Metrics: Implements industry-standard benchmarks and custom metrics tailored for legal tasks.

Reproducible Workflows: Ensures that evaluation processes can be repeated and verified by others.

Extensible Architecture: Easily add new evaluation modules or integrate with other legal tech tools.

Transparent Reporting: Generates clear, auditable reports for regulatory and internal review.’

Blyd commented: ‘I needed a consistent method to assess improvements in core model capabilities. For instance, many models failed to cite correct articles or reference numbers. To test this, I developed a simple ‘Strawberry’ test by offsetting legal article numbers to check model accuracy. Most models failed, exposing their unreliability.

‘This insight led to the creation of a prompt template for model testing. The template uses a fixed structure – jurisdiction, code, article number, offset, and legal topic – to ensure consistency. This allows for measurable, reproducible comparisons of model performance across languages and legal systems.’

The framework employs a ‘straightforward quantitative approach: each model responds to a fixed set of objective questions, and correct answers are counted’. Performance is reported as a ratio (e.g., 12/12), enabling transparent and reproducible comparisons between models and test runs, he explained.

Below is a more in-depth interview with Blyd about the how and the why of the project.

Why do this?

For Sabaio, I was looking for a way to check if a large language model could accurately reference Dutch civil legislation, specifically identifying the correct article related to tort law. None of the open-source models I’ve run locally managed this. So, I wanted to see if any model out there has this fundamental legal skill. Other evaluation frameworks look at model proficiencies or specific product proficiencies, while the S3 framework looks at deficiencies in foundational models only. 

How can you tell what is accurate or not?

By deliberately including an incorrect article number and then asking the model to verify if the number is correct. This results in a straightforward “true or false” test—like a legal version of the “strawberry test.” This works on legal code as well as case law references. 

What measure do you use?

We use a simple ratio, like 12/12, to provide clear and reproducible comparisons across different models and test runs. This also helps gauge consistency when repeating tests with identical inputs. For instance, the first run might achieve 12/12, whereas a second run could be 10/12. Some models perform better consistently. Therefore, we see vendors and firms looking at S3 evaLs as a MCP service or tool call to verify outputs. S3 provides essential infrastructure for model output stability, making legal AI realistically reliable.

Which ones have you tested?

We tested DeepSeek R1 0528 with Dutch and Jordanian laws, specifically in Dutch and Arabic languages. The Legal AI Arabic test was carried out in Egypt to help create a new tool for judges. S3 allows us to test any model, in any language for any legal jurisdiction. 

What datasets are you testing against?

We generate our tests using local legislative texts and case law databases.

If testing for citation accuracy, which case law library do you use for comparison?

Currently, we’re not testing citation formats. Our tests are limited to verifying if the case reference number correctly matches the case name. In those cases, S3 tests will have to rely on customers’ access to case law databases. However, we do see opportunities to add citation formats as extra eval in S3. 

How can you measure accuracy for more subjective areas, like drafting and redlining?

In short, we currently don’t test in subjective areas. We don’t believe drafting and redlining can be objectively measured unless approached from a litigant’s perspective. Each party typically wants to strengthen their arguments in a case or contract negotiation. In litigations, this may have been the cause for the hallucinated citations in court briefs. That being said, understanding these conditions allows us to create custom evals in specific use cases.

Special thanks to Emma Kelly and Khrizelle Lascano for their key contributions. We invite the legal and AI communities to help build a more trustworthy future for legal AI. If you are a legal expert, vendor, or at a law firm, connect with Emma at emma@legalcomplex.com.

—

You can see more about S3 here on Github.

—

Legal Innovators Conferences New York and UK – Both In November ’25

If you’d like to stay ahead of the legal AI curve….then come along to Legal Innovators New York, Nov 19 + 20, where the brightest minds will be sharing their insights on where we are now and where we are heading. 

And also, Legal Innovators UK – Nov 4 + 5 + 6

Both events, as always, are organised by the awesome Cosmonauts team! 

Please get in contact with them if you’d like to take part. 

Discover more from Artificial Lawyer

Subscribe to get the latest posts sent to your email.



Source link

Follow on Google News Follow on Flipboard
Share. Facebook Twitter Pinterest LinkedIn Tumblr Email Copy Link
Previous ArticlePaper page – MiCo: Multi-image Contrast for Reinforcement Visual Reasoning
Next Article Enjoy TikTok Explainers? These Old-Fashioned Diagrams Are A Whole Lot Smarter
Advanced AI Editor
  • Website

Related Posts

Clio Buys vLex For $1 Billion – Artificial Lawyer

June 30, 2025

Google makes foray into fusion with MIT Commonwealth Fusion Systems

June 30, 2025

Harvey + LexisNexis – The Potential Pricing Impact – Artificial Lawyer

June 30, 2025
Leave A Reply Cancel Reply

Latest Posts

Mark Wallinger Installation at Glastonbury Focused on Children in Gaza

Enjoy TikTok Explainers? These Old-Fashioned Diagrams Are A Whole Lot Smarter

Newly Released Wildlife Images Winners Of BigPicture Photo Competition

Tituss Burgess Teams Up With Lyft To Offer Pride Weekend Discounts

Latest Posts

Clio Buys vLex For $1 Billion – Artificial Lawyer

June 30, 2025

HII and C3 AI Forge Strategic Artificial Intelligence Partnership to Support US Navy Shipbuilding

June 30, 2025

Context extraction from image files in Amazon Q Business using LLMs

June 30, 2025

Subscribe to News

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

Recent Posts

  • Clio Buys vLex For $1 Billion – Artificial Lawyer
  • HII and C3 AI Forge Strategic Artificial Intelligence Partnership to Support US Navy Shipbuilding
  • Context extraction from image files in Amazon Q Business using LLMs
  • Getty Images and Stability AI face off in British copyright trial that will test AI industry
  • HCLTech and OpenAI Collaborate to Drive Enterprise-Scale AI Adoption

Recent Comments

No comments to show.

Welcome to Advanced AI News—your ultimate destination for the latest advancements, insights, and breakthroughs in artificial intelligence.

At Advanced AI News, we are passionate about keeping you informed on the cutting edge of AI technology, from groundbreaking research to emerging startups, expert insights, and real-world applications. Our mission is to deliver high-quality, up-to-date, and insightful content that empowers AI enthusiasts, professionals, and businesses to stay ahead in this fast-evolving field.

Subscribe to Updates

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

YouTube LinkedIn
  • Home
  • About Us
  • Advertise With Us
  • Contact Us
  • DMCA
  • Privacy Policy
  • Terms & Conditions
© 2025 advancedainews. Designed by advancedainews.

Type above and press Enter to search. Press Esc to cancel.