
Raymond Blyd, the well-known legal tech expert, has launched S3, a new LLM evaluation framework for legal needs, which focuses on ‘identifying core deficiencies rather than proficiencies’.
As Blyd explained to AL, S3 was created to calibrate and compare open-source models during Sabaio’s (his earlier AI company) development, targeting accuracy and hallucinations.
It provides:
‘Standardized Evaluation Metrics: Implements industry-standard benchmarks and custom metrics tailored for legal tasks.
Reproducible Workflows: Ensures that evaluation processes can be repeated and verified by others.
Extensible Architecture: Easily add new evaluation modules or integrate with other legal tech tools.
Transparent Reporting: Generates clear, auditable reports for regulatory and internal review.’
Blyd commented: ‘I needed a consistent method to assess improvements in core model capabilities. For instance, many models failed to cite correct articles or reference numbers. To test this, I developed a simple ‘Strawberry’ test by offsetting legal article numbers to check model accuracy. Most models failed, exposing their unreliability.
‘This insight led to the creation of a prompt template for model testing. The template uses a fixed structure – jurisdiction, code, article number, offset, and legal topic – to ensure consistency. This allows for measurable, reproducible comparisons of model performance across languages and legal systems.’
The framework employs a ‘straightforward quantitative approach: each model responds to a fixed set of objective questions, and correct answers are counted’. Performance is reported as a ratio (e.g., 12/12), enabling transparent and reproducible comparisons between models and test runs, he explained.
Below is a more in-depth interview with Blyd about the how and the why of the project.
Why do this?
For Sabaio, I was looking for a way to check if a large language model could accurately reference Dutch civil legislation, specifically identifying the correct article related to tort law. None of the open-source models I’ve run locally managed this. So, I wanted to see if any model out there has this fundamental legal skill. Other evaluation frameworks look at model proficiencies or specific product proficiencies, while the S3 framework looks at deficiencies in foundational models only.
How can you tell what is accurate or not?
By deliberately including an incorrect article number and then asking the model to verify if the number is correct. This results in a straightforward “true or false” test—like a legal version of the “strawberry test.” This works on legal code as well as case law references.
What measure do you use?
We use a simple ratio, like 12/12, to provide clear and reproducible comparisons across different models and test runs. This also helps gauge consistency when repeating tests with identical inputs. For instance, the first run might achieve 12/12, whereas a second run could be 10/12. Some models perform better consistently. Therefore, we see vendors and firms looking at S3 evaLs as a MCP service or tool call to verify outputs. S3 provides essential infrastructure for model output stability, making legal AI realistically reliable.
Which ones have you tested?
We tested DeepSeek R1 0528 with Dutch and Jordanian laws, specifically in Dutch and Arabic languages. The Legal AI Arabic test was carried out in Egypt to help create a new tool for judges. S3 allows us to test any model, in any language for any legal jurisdiction.
What datasets are you testing against?
We generate our tests using local legislative texts and case law databases.
If testing for citation accuracy, which case law library do you use for comparison?
Currently, we’re not testing citation formats. Our tests are limited to verifying if the case reference number correctly matches the case name. In those cases, S3 tests will have to rely on customers’ access to case law databases. However, we do see opportunities to add citation formats as extra eval in S3.
How can you measure accuracy for more subjective areas, like drafting and redlining?
In short, we currently don’t test in subjective areas. We don’t believe drafting and redlining can be objectively measured unless approached from a litigant’s perspective. Each party typically wants to strengthen their arguments in a case or contract negotiation. In litigations, this may have been the cause for the hallucinated citations in court briefs. That being said, understanding these conditions allows us to create custom evals in specific use cases.
Special thanks to Emma Kelly and Khrizelle Lascano for their key contributions. We invite the legal and AI communities to help build a more trustworthy future for legal AI. If you are a legal expert, vendor, or at a law firm, connect with Emma at emma@legalcomplex.com.
—
You can see more about S3 here on Github.
—
Legal Innovators Conferences New York and UK – Both In November ’25
If you’d like to stay ahead of the legal AI curve….then come along to Legal Innovators New York, Nov 19 + 20, where the brightest minds will be sharing their insights on where we are now and where we are heading.

And also, Legal Innovators UK – Nov 4 + 5 + 6

Both events, as always, are organised by the awesome Cosmonauts team!
Please get in contact with them if you’d like to take part.
Discover more from Artificial Lawyer
Subscribe to get the latest posts sent to your email.