Close Menu
  • Home
  • AI Models
    • DeepSeek
    • xAI
    • OpenAI
    • Meta AI Llama
    • Google DeepMind
    • Amazon AWS AI
    • Microsoft AI
    • Anthropic (Claude)
    • NVIDIA AI
    • IBM WatsonX Granite 3.1
    • Adobe Sensi
    • Hugging Face
    • Alibaba Cloud (Qwen)
    • Baidu (ERNIE)
    • C3 AI
    • DataRobot
    • Mistral AI
    • Moonshot AI (Kimi)
    • Google Gemma
    • xAI
    • Stability AI
    • H20.ai
  • AI Research
    • Allen Institue for AI
    • arXiv AI
    • Berkeley AI Research
    • CMU AI
    • Google Research
    • Microsoft Research
    • Meta AI Research
    • OpenAI Research
    • Stanford HAI
    • MIT CSAIL
    • Harvard AI
  • AI Funding & Startups
    • AI Funding Database
    • CBInsights AI
    • Crunchbase AI
    • Data Robot Blog
    • TechCrunch AI
    • VentureBeat AI
    • The Information AI
    • Sifted AI
    • WIRED AI
    • Fortune AI
    • PitchBook
    • TechRepublic
    • SiliconANGLE – Big Data
    • MIT News
    • Data Robot Blog
  • Expert Insights & Videos
    • Google DeepMind
    • Lex Fridman
    • Matt Wolfe AI
    • Yannic Kilcher
    • Two Minute Papers
    • AI Explained
    • TheAIEdge
    • Matt Wolfe AI
    • The TechLead
    • Andrew Ng
    • OpenAI
  • Expert Blogs
    • François Chollet
    • Gary Marcus
    • IBM
    • Jack Clark
    • Jeremy Howard
    • Melanie Mitchell
    • Andrew Ng
    • Andrej Karpathy
    • Sebastian Ruder
    • Rachel Thomas
    • IBM
  • AI Policy & Ethics
    • ACLU AI
    • AI Now Institute
    • Center for AI Safety
    • EFF AI
    • European Commission AI
    • Partnership on AI
    • Stanford HAI Policy
    • Mozilla Foundation AI
    • Future of Life Institute
    • Center for AI Safety
    • World Economic Forum AI
  • AI Tools & Product Releases
    • AI Assistants
    • AI for Recruitment
    • AI Search
    • Coding Assistants
    • Customer Service AI
    • Image Generation
    • Video Generation
    • Writing Tools
    • AI for Recruitment
    • Voice/Audio Generation
  • Industry Applications
    • Finance AI
    • Healthcare AI
    • Legal AI
    • Manufacturing AI
    • Media & Entertainment
    • Transportation AI
    • Education AI
    • Retail AI
    • Agriculture AI
    • Energy AI
  • AI Art & Entertainment
    • AI Art News Blog
    • Artvy Blog » AI Art Blog
    • Weird Wonderful AI Art Blog
    • The Chainsaw » AI Art
    • Artvy Blog » AI Art Blog
What's Hot

AI investments failing? 95 per cent of firms see no returns, says MIT | Technology News

Canva prototypes faster with GPT-5

ElevenLabs Releases Eleven v3 (Alpha) API for Advanced AI Voice Synthesis Development | AI News Detail

Facebook X (Twitter) Instagram
Advanced AI News
  • Home
  • AI Models
    • OpenAI (GPT-4 / GPT-4o)
    • Anthropic (Claude 3)
    • Google DeepMind (Gemini)
    • Meta (LLaMA)
    • Cohere (Command R)
    • Amazon (Titan)
    • IBM (Watsonx)
    • Inflection AI (Pi)
  • AI Research
    • Allen Institue for AI
    • arXiv AI
    • Berkeley AI Research
    • CMU AI
    • Google Research
    • Meta AI Research
    • Microsoft Research
    • OpenAI Research
    • Stanford HAI
    • MIT CSAIL
    • Harvard AI
  • AI Funding
    • AI Funding Database
    • CBInsights AI
    • Crunchbase AI
    • Data Robot Blog
    • TechCrunch AI
    • VentureBeat AI
    • The Information AI
    • Sifted AI
    • WIRED AI
    • Fortune AI
    • PitchBook
    • TechRepublic
    • SiliconANGLE – Big Data
    • MIT News
    • Data Robot Blog
  • AI Experts
    • Google DeepMind
    • Lex Fridman
    • Meta AI Llama
    • Yannic Kilcher
    • Two Minute Papers
    • AI Explained
    • TheAIEdge
    • The TechLead
    • Matt Wolfe AI
    • Andrew Ng
    • OpenAI
    • Expert Blogs
      • François Chollet
      • Gary Marcus
      • IBM
      • Jack Clark
      • Jeremy Howard
      • Melanie Mitchell
      • Andrew Ng
      • Andrej Karpathy
      • Sebastian Ruder
      • Rachel Thomas
      • IBM
  • AI Tools
    • AI Assistants
    • AI for Recruitment
    • AI Search
    • Coding Assistants
    • Customer Service AI
  • AI Policy
    • ACLU AI
    • AI Now Institute
    • Center for AI Safety
  • Business AI
    • Advanced AI News Features
    • Finance AI
    • Healthcare AI
    • Education AI
    • Energy AI
    • Legal AI
LinkedIn Instagram YouTube Threads X (Twitter)
Advanced AI News
Industry Applications

S3 Launches – LLM Eval ‘For Any Jurisdiction, Language + Model’ – Artificial Lawyer

By Advanced AI EditorJune 30, 2025No Comments5 Mins Read
Share Facebook Twitter Pinterest Copy Link Telegram LinkedIn Tumblr Email
Share
Facebook Twitter LinkedIn Pinterest Email



Raymond Blyd, the well-known legal tech expert, has launched S3, a new LLM evaluation framework for legal needs, which focuses on ‘identifying core deficiencies rather than proficiencies’.

As Blyd explained to AL, S3 was created to calibrate and compare open-source models during Sabaio’s (his earlier AI company) development, targeting accuracy and hallucinations.

It provides:

‘Standardized Evaluation Metrics: Implements industry-standard benchmarks and custom metrics tailored for legal tasks.

Reproducible Workflows: Ensures that evaluation processes can be repeated and verified by others.

Extensible Architecture: Easily add new evaluation modules or integrate with other legal tech tools.

Transparent Reporting: Generates clear, auditable reports for regulatory and internal review.’

Blyd commented: ‘I needed a consistent method to assess improvements in core model capabilities. For instance, many models failed to cite correct articles or reference numbers. To test this, I developed a simple ‘Strawberry’ test by offsetting legal article numbers to check model accuracy. Most models failed, exposing their unreliability.

‘This insight led to the creation of a prompt template for model testing. The template uses a fixed structure – jurisdiction, code, article number, offset, and legal topic – to ensure consistency. This allows for measurable, reproducible comparisons of model performance across languages and legal systems.’

The framework employs a ‘straightforward quantitative approach: each model responds to a fixed set of objective questions, and correct answers are counted’. Performance is reported as a ratio (e.g., 12/12), enabling transparent and reproducible comparisons between models and test runs, he explained.

Below is a more in-depth interview with Blyd about the how and the why of the project.

Why do this?

For Sabaio, I was looking for a way to check if a large language model could accurately reference Dutch civil legislation, specifically identifying the correct article related to tort law. None of the open-source models I’ve run locally managed this. So, I wanted to see if any model out there has this fundamental legal skill. Other evaluation frameworks look at model proficiencies or specific product proficiencies, while the S3 framework looks at deficiencies in foundational models only. 

How can you tell what is accurate or not?

By deliberately including an incorrect article number and then asking the model to verify if the number is correct. This results in a straightforward “true or false” test—like a legal version of the “strawberry test.” This works on legal code as well as case law references. 

What measure do you use?

We use a simple ratio, like 12/12, to provide clear and reproducible comparisons across different models and test runs. This also helps gauge consistency when repeating tests with identical inputs. For instance, the first run might achieve 12/12, whereas a second run could be 10/12. Some models perform better consistently. Therefore, we see vendors and firms looking at S3 evaLs as a MCP service or tool call to verify outputs. S3 provides essential infrastructure for model output stability, making legal AI realistically reliable.

Which ones have you tested?

We tested DeepSeek R1 0528 with Dutch and Jordanian laws, specifically in Dutch and Arabic languages. The Legal AI Arabic test was carried out in Egypt to help create a new tool for judges. S3 allows us to test any model, in any language for any legal jurisdiction. 

What datasets are you testing against?

We generate our tests using local legislative texts and case law databases.

If testing for citation accuracy, which case law library do you use for comparison?

Currently, we’re not testing citation formats. Our tests are limited to verifying if the case reference number correctly matches the case name. In those cases, S3 tests will have to rely on customers’ access to case law databases. However, we do see opportunities to add citation formats as extra eval in S3. 

How can you measure accuracy for more subjective areas, like drafting and redlining?

In short, we currently don’t test in subjective areas. We don’t believe drafting and redlining can be objectively measured unless approached from a litigant’s perspective. Each party typically wants to strengthen their arguments in a case or contract negotiation. In litigations, this may have been the cause for the hallucinated citations in court briefs. That being said, understanding these conditions allows us to create custom evals in specific use cases.

Special thanks to Emma Kelly and Khrizelle Lascano for their key contributions. We invite the legal and AI communities to help build a more trustworthy future for legal AI. If you are a legal expert, vendor, or at a law firm, connect with Emma at emma@legalcomplex.com.

—

You can see more about S3 here on Github.

—

Legal Innovators Conferences New York and UK – Both In November ’25

If you’d like to stay ahead of the legal AI curve….then come along to Legal Innovators New York, Nov 19 + 20, where the brightest minds will be sharing their insights on where we are now and where we are heading. 

And also, Legal Innovators UK – Nov 4 + 5 + 6

Both events, as always, are organised by the awesome Cosmonauts team! 

Please get in contact with them if you’d like to take part. 

Discover more from Artificial Lawyer

Subscribe to get the latest posts sent to your email.



Source link

Follow on Google News Follow on Flipboard
Share. Facebook Twitter Pinterest LinkedIn Tumblr Email Copy Link
Previous ArticlePaper page – MiCo: Multi-image Contrast for Reinforcement Visual Reasoning
Next Article Enjoy TikTok Explainers? These Old-Fashioned Diagrams Are A Whole Lot Smarter
Advanced AI Editor
  • Website

Related Posts

SpotDraft, StructureFlow, BigHand, Eudia + ClausePilot – Artificial Lawyer

August 21, 2025

Tech sell-off? Investors could just be taking profit

August 21, 2025

Second day of U.S. tech-sell off — but don’t panic

August 21, 2025
Leave A Reply

Latest Posts

Tanya Bonakdar Gallery to Close Los Angeles Space

Ancient Silver Coins Suggest New History of Trading in Southeast Asia

Sasan Ghandehari Sues Christie’s Over Picasso Once Owned by a Criminal

Dallas Museum of Art Names Brian Ferriso as Its Next Director

Latest Posts

AI investments failing? 95 per cent of firms see no returns, says MIT | Technology News

August 21, 2025

Canva prototypes faster with GPT-5

August 21, 2025

ElevenLabs Releases Eleven v3 (Alpha) API for Advanced AI Voice Synthesis Development | AI News Detail

August 21, 2025

Subscribe to News

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

Recent Posts

  • AI investments failing? 95 per cent of firms see no returns, says MIT | Technology News
  • Canva prototypes faster with GPT-5
  • ElevenLabs Releases Eleven v3 (Alpha) API for Advanced AI Voice Synthesis Development | AI News Detail
  • SpotDraft, StructureFlow, BigHand, Eudia + ClausePilot – Artificial Lawyer
  • C3.AI Stock Is Surging Tuesday: What’s Going On? – C3.ai (NYSE:AI)

Recent Comments

  1. Binance Paglikha ng Account on Google’s Sergey Brin: ‘I made a lot of mistakes with Google Glass’
  2. Richardsmeap on 1-800-CHAT-GPT—12 Days of OpenAI: Day 10
  3. JuliusRex on 1-800-CHAT-GPT—12 Days of OpenAI: Day 10
  4. NathanFairl on 1-800-CHAT-GPT—12 Days of OpenAI: Day 10
  5. ocenochnaya-kompaniya-615 on 1-800-CHAT-GPT—12 Days of OpenAI: Day 10

Welcome to Advanced AI News—your ultimate destination for the latest advancements, insights, and breakthroughs in artificial intelligence.

At Advanced AI News, we are passionate about keeping you informed on the cutting edge of AI technology, from groundbreaking research to emerging startups, expert insights, and real-world applications. Our mission is to deliver high-quality, up-to-date, and insightful content that empowers AI enthusiasts, professionals, and businesses to stay ahead in this fast-evolving field.

Subscribe to Updates

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

LinkedIn Instagram YouTube Threads X (Twitter)
  • Home
  • About Us
  • Advertise With Us
  • Contact Us
  • DMCA
  • Privacy Policy
  • Terms & Conditions
© 2025 advancedainews. Designed by advancedainews.

Type above and press Enter to search. Press Esc to cancel.