‘AI Model Fine-Tuning Is Overrated’ – Artificial Lawyer

In this Law Punx blast, Scott Stevenson of Spellbook discusses the limitations of fine-tuning AI models for legal use cases, arguing that it has become an overrated technique. He emphasizes the importance of using LLMs as layers of human reasoning rather than relying on their long-term memory. The discussion also covers the advantages of real-time information retrieval over fine-tuning. There is a full transcript below.

Press Play to watch / listen here, or go to the AL TV Channel.

Law Punx via AL TV Productions, 2025.

Takeaways:

Fine-tuning legal AI models is often ineffective.

Large language models should be viewed as layers of reasoning.

Real-time information retrieval is superior to fine-tuning.

Models can hallucinate when relying on long-term memory.

Preference learning is crucial for improving AI accuracy.

The acceptance rate of AI suggestions is a key metric.

Legal tech tools should focus on application layers.

AI models should fetch information rather than memorize it.

–

–

Five other Law Punx podcasts are on Spotify, which features Electra Japonas, Horace Wu, Jake Jones, Todd Smithline, and Richard Mabey. This one will be loaded up soon.

The Spotify Law Punx site with multiple episodes is here – where you can listen/watch all of the episodes in one place.

And the same, but just an audio version, is available on Apple Podcasts.

Several more Law Punx blasts have been recorded are now in the studio, including Oz Benamram, Tara Waters, and Jerry Levine, among others. Drop AL a line if you’ve got something you want you really, really, really want to say .

—

AI Transcript
Hey everybody, Richard Tromans here at Law Punx and another episode this time with Scott Stevenson from Spellbook in Canada. Hey Scott, good to have you with us. I’m not going to do a big preamble. We’re going to get straight into this. So Scott basically has a very key point that he wants to make, which is that refinement is a scam. Take it away, Scott.

Scott Stevenson (00:30.327)
Yeah, so one of the views I hold, which I think has been under discussed, is that fine tuning legal AI models turned out to be a little bit of a scam. And we were one of the first companies in the world to use fine tuning with large language models for legal work way back in 2022. The models weren’t very good at that time. And you definitely did, to some extent, need to try fine tuning to either reduce costs or to improve accuracy.

But fine tuning as a technique, we hear about it from our customers at Spellbook all the time. We have 4,000 law firms and legal teams on board. We meet with about 400 teams a week right now of prospects. And we still get the question all the time, like, are you using fine-tuned models and so on? And what has happened over the past three years is I think fine tuning has actually become

A fairly useless technique a lot of the time and I think is highly, highly, highly overrated compared to other techniques and approaches to getting these models to work really, really well for lawyers.

Richard Tromans (01:36.658)
Okay, well, we’re going to keep digging. just just for the audience’s sake, can you just explain what do we mean by fine tuning? So you’ve got your model, you’ve got your LLM, you’ve got your GPT-5, your Claude, your whatever it is, then LegalTech company comes along and says, great, we’ll fine tune your model with all this lovely legal data, which will make it much, better. Is that basically right?

Scott Stevenson (01:57.206)
Yeah, basically the idea that you’re going to have this underlying large language model and you’re going to feed it a bunch of legal data and train the model itself to basically, in some way or another, of remember that data in its long-term memory and in its essence to make it better at legal work. And this is something we have done for certain techniques at Spellbook, but it turned out to actually not be that great a technique in the long run.

One example I’ll give before I get into the legal side is Bloomberg trained a model for financial work a couple of years ago. And I think they might have spent like a couple of million training it. And as soon as GPT-4 came out, that model performed better than Bloomberg’s model that they had spent a ton of time and energy training. And this pattern has happened over and over again.

Richard Tromans (02:30.798)
Mmm.

Scott Stevenson (02:53.911)
I think Harvey does great work, the product is great. They had a proprietary model at one point. And now I think if you look at legal benchmarks, there were maybe like seven public models that do better at certain types of legal work than Harvey’s proprietary model. And they do a lot of other great work as well. But I think that technique ended up being really overrated. And I think it should be known by lawyers that it’s actually not the best way to get

these types of models to get the best results for you. One, yeah, yeah.

Richard Tromans (03:29.198)
But hold on, I just want to clarify for you just so people understand. So we’re talking about, so you’ve got your basic LLM. How do you add to that though, that you can’t get into the original LLM? It’s like a closed system, right? So when you say that we’re training the model or refining the model or whatever it is, mean, how is that even possible?

Scott Stevenson (03:50.687)
Yeah, so there are open source models where you can literally get into them and train them on more data. also have, even with OpenAI, they do offer fine tuning services, which use other techniques to basically, I won’t go too deep into the technical details, but they use other techniques to do something similar in maybe a little bit more of a scalable way where you’re not necessarily training the model directly, but they

kind of have this fine tuning or learning layer on top. And these techniques actually are OK for maybe some of the agentic work. But the idea that you’re going to train a model on all of these contracts or all of legislation, and now it’s just going to perfectly remember everything, is not the best way to do things. And the analogy I’ll give is imagine your work

you’re a human. think thinking of large language models like humans intuitively gives you really good intuition about how to work with these systems. So imagine you’re asking a human to remember all of, say, Edgar, the SEC’s database that contains many contracts. And you’re asking a human to put that in the human’s long-term memory.

that’s going to cause hallucination because our long-term memory is not that good. Our long-term memory hallucinates. So actually that method of imbuing these models with knowledge, relying on it too heavily is actually what causes hallucination. And there was sort of this, I think, misunderstanding that, by fine tuning on legal data, we can actually reduce hallucination. And while that might be true to an extent,

The idea of relying on the model’s baked-in long-term memory is actually what causes hallucination most of the time. So it’s actually a really bad approach. The alternative is our approaches like rag and agentic rag. So retrieval, augmented generation, not a new thing anymore. But I think very underhyped, because it doesn’t sound as sexy. But the way we think of models now at Spellbook is that they’re a layer of human reasoning.

Scott Stevenson (06:15.902)
of human logic, of human-like thinking. But they’re not really where you want to store knowledge and information because they’re fallible. If you’re storing your knowledge and information inside a large language model, it’s going to be fallible and it’s going to hallucinate and that’s going to give you bad results. If you instead teach the AI.

Richard Tromans (06:32.776)
So, so, so effectively what you’re saying is it’s better to let the big boys open AI and throw Anthropic and everyone else build their LLMs you know, which come out every six months, two years, whatever. and then just add workflows, agents, you name it on top of that. And don’t, don’t mess around with it. Just let it just use that improved language, understanding user improved, hope reasoning and so forth. And then build on top of that, literally just like an extension.

rather than trying to get into the, get into training about data.

Scott Stevenson (07:06.582)
Oftentimes, yes. There are different use cases in agents, for instance, where maybe reinforcement learning is becoming a little bit more useful again. But generally, yes. Treating these models as a layer of human reasoning, not as a database where you’re storing knowledge. then access to knowledge in other ways has been a better approach and gives us better results.

Richard Tromans (07:29.358)
Please. Is it, is it, is it a question of the quantity of data? I mean, if you could give, you know, the latest version of, you know, whatever, GPT five, whatever latest thing, right? If you could give it 25 billion tokens worth of legal data, surely that would make some positive impact. Or are you just going to confuse the language model because the language model is generalistic by dumping a great big legal chunk right into the middle of it. It’s kind of.

Scott Stevenson (07:55.988)
Yeah.

Richard Tromans (07:59.192)
mess with its brain.

Scott Stevenson (08:00.663)
Yeah, I mean, it’s not deterministic. Yeah, it might help and it might help you get some better results. But I think you don’t want the model to be looking in its long term memory to remember what was that contract that, you know, Microsoft signed in, you know, 1991. Yeah, it’s just not because models are effectively compressing information and compression is lossy. It’s much better to teach these models how to go and fetch this information, how to search for it to assist them and

Richard Tromans (08:17.602)
We did, we did.

Scott Stevenson (08:28.95)
finding the correct information. And then they can provide it to you with citations. So the amazing part about RAG is that you can have fresh real-time information. So the other problem with fine-tuning is there’s always a knowledge cutoff. for GPT-5, for instance, there’s a knowledge cutoff of maybe like, I don’t know the specific date right now, but it’s probably at least six months old and doesn’t know anything much sooner unless it goes and searches the web. So you don’t want to be reliant.

having to constantly retrain the models. Whereas if you rely on teaching the models to go and fetch information and find information, you can do everything in real time. You can have the most real time legislation. You can have the most real time case law and so on. And you have citations. And so it’s just a way, way better approach to kind of think about the knowledge separately and think about the LLMs as sort of your layer of human reasoning. And I think we’ve been

building in this space since 2022. And it turns out there’s actually just an enormous amount to build around models. in some cases, we still might use a fine-tuned model for a very specific task where we want a low cost, we want it to be really fast. it’s much more effective to build around them, and there’s a lot to build. Yeah.

Richard Tromans (09:50.402)
Yeah. And that goes to the broader debate, just to wrap up. We’ve got 10 seconds left just to wrap up. And I think what you’re saying goes to the broader debate about what are the benefits of legal tech tools, which is, and some people kind of kind of look down their nose at application layer and workflows and data management and so forth. But really that is where the value of legal tech is because you can’t mess with the base model. The base models have been taught

Scott Stevenson (10:01.408)
Mm-hmm.

Richard Tromans (10:17.58)
you know, developed language understanding, that’s what they’re good at. You then take that and you apply it through the application layer, through the data that you’ve created, through the workflow, through the agents, and then you get your good results, right?

Scott Stevenson (10:29.942)
Yeah, that’s right. Yeah. Can I go into overtime? I have another point or not. It’s okay if not. The other secondary point I will make is I think the evaluation of these systems is far more subjective than what people want to admit. So when we look at a spell book, work AI contract review tool. We’ve seen probably over 10 million contracts. And what we look at is our suggestion acceptance rate. So when we suggest

Richard Tromans (10:34.51)
Special bonus, special bonus, one minute, one minute starting now,

Scott Stevenson (10:58.638)
a red line to a contract, we’re looking at what percentage of those actually gets accepted by a lawyer. And when we first launched Spellbook, the acceptance rate was maybe 5%. Now our acceptance rate is 60%. And we’ve driven that acceptance rate up through an enormous amount of iteration, hooking in different data sources. But a big thing that we’ve been doing more recently that’s in beta is preference learning. So contract review is highly, highly subjective.

Every client and lawyer has a different way of doing things. There really isn’t an objective answer in most cases. And there’s the context of the deal and the power dynamic of the deal. like, think like memory and preference learning is incredibly important and something that’s not being taken into account very well. And I think in some ways, in some ways we think of Spellbook more like a YouTube recommendation algorithm of red lines than like

something that’s giving you this objective mathematically correct answer. And the more that someone uses Spellbook, the better and more accurate those suggestions should come to their preferences. that’s not something you can do with fine tuning, because fine tuning kind of trains your model for everyone without that subjective preference learning. So I think we’re going to see rags already super, super popular.

think we’re going to start seeing a lot more preference learning because just training one model for all customers and all users everywhere just only gets you so far. Legal problems are highly subjective and very preference driven. And RAG and memory are a way that we can drive results there.

Richard Tromans (12:43.086)
All right, also I’m gonna stop you there. Thanks Scott. Awesome, I should just say for the record, what Scott says basically applies to all legal tech companies and all legal tech companies therefore can benefit from this approach.

Discover more from Artificial Lawyer

Subscribe to get the latest posts sent to your email.

Source link

What's Hot

OpenAI Launches AgentKit To Boost Agent Use – Artificial Lawyer

Factuality Matters: When Image Generation and Editing Meet Structured Visuals – Takara TLDR

OpenAI Wants ChatGPT to Be Your Future Operating System

‘AI Model Fine-Tuning Is Overrated’ – Artificial Lawyer

OpenAI Launches AgentKit To Boost Agent Use – Artificial Lawyer

Trilogy Metals shares surge after U.S. takes stake in minerals firm

ROI Lessons for In-House Counsel – Artificial Lawyer

Tomb of Amenhotep III Reopens After Two-Decade Renovation

Limited Edition Print of Ozzy Osbourne Art Sold To Benefit Charities

Odili Donald Odita Sues Jack Shainman Gallery over ‘Withheld’ Artworks

Mohamed Hamidi, Moroccan Modernist Painter, Has Died at 84

OpenAI Launches AgentKit To Boost Agent Use – Artificial Lawyer

Factuality Matters: When Image Generation and Editing Meet Structured Visuals – Takara TLDR

OpenAI Wants ChatGPT to Be Your Future Operating System

What's Hot

‘AI Model Fine-Tuning Is Overrated’ – Artificial Lawyer

Discover more from Artificial Lawyer

Related Posts

Subscribe to Updates