The world is being taken by storm by the apparently amazing “superpowers” of the latest generation of large language models (LLMs). Whether you’ve used DeepSeek, Gemini, Perplexity, or Claude, you’ve almost certainly wondered, “How did they do that?”
Artificial intelligence guru Andrej Karpathy has produced one of the best tech videos I’ve ever watched. It’s not for the faint of heart, but in 3.5 hours, he leads anyone with a basic understanding of neural networks to a similar knowledge of how modern LLMs, “chat-based” LLMs, and “reasoning” LLMs are constructed:
The video above contains a lot to unpack, but we’ll provide a walkthrough of some of the major concepts in this article.
Transformer Networks
Almost a decade ago, so-called transformer networks started to appear. The idea is that if you train a neural network on many sequences of characters or words (more technically, those are turned into tokens), it can begin to predict the following character or word.
OpenAI’s GPT-2, released in 2019, was an early and famous example of a transformer network. As an experiment that year, I trained it on a few thousand ET articles to see how well it could generate them on its own.
Since then, GPUs have become much faster, and models have become much larger. So subsequent generations of traditional transformer models can be trained on much larger sources of data—and rely on longer inputs to generate better and more types of results.
Cost curve for training GPT-2
Advances in computer performance have taken the cost of training GPT-2 down from $40,000 to less than $1,000, and currently as low as $100. Credit: Andrej Karpathy
The leap from a model like GPT-2 to a modern system like ChatGPT builds on the basic transformer model with additional layers and types of training, which has led to extraordinary results.
Now, the traditional transformer training is called pre-training. For a large LLM, it includes crawling much of the internet and using the sequences of words (actually tokens) that it finds there to make a model of what word (token) is most likely to follow a sequence of input tokens.
Given enough time, money, and electricity, the result is a powerful model that can provide plausible answers to some questions by calculating the most-like subsequent words.
At this point, the model is really good at regurgitating what it read on the internet, but it doesn’t (yet) know how to go beyond that. Next are what can go wrong—and the technologies that can turn it into a state-of-the-art LLM.
Preventing Hallucinations
We’ve all seen instances where an LLM makes up facts out of thin air. This was especially true of “traditional” LLMs that relied exclusively on a transformer model. Since the LLM doesn’t really “know” everything, it makes sense that it will invent answers.
More recent models have a strategy for minimizing hallucinations. Basically, the model is asked a slate of questions multiple times. If it provides different answers each time, it is “told” that it doesn’t know those answers. It then manages to “learn” the types of questions it can’t answer from memory.
This is where tools can come in. The data models are trained on is not only finite, but has a specific cutoff date. So using another strategy, typically called a tool, can help the model get additional or better results. Early versions of systems like ChatGPT were limited to the information they were trained on. But current versions know how to search the web when they determine they need additional information. Sometimes that works, but if the query stumps the web, then we can still get a nonsensical answer.
Self-Awareness—Or Not
Models are frequently asked questions along the lines of “Who made you?”, which can lead to suspicious results. For example, at one point, DeepSeek was “outed” for saying it was trained on OpenAI. However, what is actually happening is that the model is looking for the most common answer on the web—which is OpenAI—and returns it.
One powerful way to pre-program models to teach them to avoid these mistakes is to feed them base context information. As Karpathy so elegantly puts it, the model itself is similar to our total memory, while its context is more like our current working memory. So a model might have some question-and-answer strings preloaded in its context that include basic facts about the model.
‘Chat’ Training: Learning How to Converse
It’d be pretty easy to miss the technology transition from, say, GPT-2 to ChatGPT. Since models seem to get named pretty randomly, and features are often poorly described, the large leap in how they act is a little hard to find.
The big advance is the next round of training for the models—one that provides “conversational” input to them, such as questions asked by a user and ideal answers provided in response. This training provides another layer on top of the “simplistic” word generator.
Human Labeling
Currently, training a model to decide what’s useful conversation takes a large amount of human labeling. Not the billions of words needed to pretrain the model, but enough that companies are making a living providing and automating this service.
Human Labeling
This is a tiny excerpt of the hundreds of pages a human labeler might be given to help them author conversations that are useful for training an LLM. Credit: Andrej Karpathy
Why Models Can Be Stupid
It’s common to wonder how models can solve complex problems but get simple questions related to which number is larger or whether water freezes at 0° C wrong.
The key to understanding that is to realize that LLMs see the world as a series of tokens. They don’t actually have an intuitive understanding of numbers as mathematical constructs.
One illustrative example that Karpathy cites in his video is an older model being asked whether 9.9 is larger than 9.11. That’s trivial for us, but some models stated that 9.11 was larger. A paper analyzing the issue determined that because the Bible verse 9.11 followed 9.9, those models determined it to be a larger number.
This is where another clver tool comes in. “Use code” tells a model that has been taught to write and run code to essentially “show its work” by writing a program. For math problems, for example, the program often gets the correct answer even when the equivalent simple query might fool it. For example, here is the result when we ask ChatGPT to use code to answer the question which is greater:
a = 9.9
b = 9.11if a > b:
print(f”{a} is greater than {b}”)
elif a print(f”{b} is greater than {a}”)
else:
print(f”{a} and {b} are equal”)It then runs it for us and provides the output:
9.9 is greater than 9.11
Another reason is that the initial data input to the model is a bit like our own memory: It can be hazy. That’s a major reason that a user-provided prompt or “context” provided to the model can generate much more useful responses.
‘Thinking’ Models: Reinforcement Learning
All of the above steps create some excellent models. But they are trapped in their immediate analysis of a problem. Borrowing a page from successful reinforcement learning models like Alpha Go, creators of LLMs have begun to allow them to improve their own results by doing multiple trials of their answers and evaluating them.
RL Example
LLMs do reinforcement learning essentially by doing the same sort of practice problems a student would in a textbook over and over again while improving their answers. Credit: Karpathy
This technique is already being used in some proprietary models from OpenAI and others. But DeepSeek blew it wide open by making it publicly available and publishing a paper on how it works in their R1 model (which you can now find hosted on many sites and available for download).
Learning How to Respond to Subjective Questions
Reinforcement Learning (RL) has proven to be an impressive approach for solving problems with empirical solutions—like AlphaGo learning to beat the world’s best human player by playing against itself. But that approach isn’t very useful for subjective queries like “tell me a joke” or “write me a poem.” Those require human judgment.
The naive approach to training a model on queries that require creativity would be to have humans create great jokes, poems, and so on, and feed them to the model. Unfortunately, people are not generally great creators of jokes and poems.
But people are much better at judging the quality of a poem or a joke than they are at creating one. So, the human-based RL trained on subjective topics relies on many humans scoring the quality of jokes, poems, and other common subjective responses.
Science That Looks Like Magic
Even knowing something about how modern LLMs are built, trained, and run, the output I can get from them often still seems magical. As a result, it is tempting to conclude that they have some kind of superpower that comes from their massive neural networks. Whatever you think about that, hopefully you now at least have an understanding of what goes on underneath the surface, or “behind the curtain,” if you prefer.
Thanks to Phil Z. for getting me inspired to watch Andrej’s video and write this article.