“This is one of the most insane pieces of work I’ve ever written.” Just now, Andrej Karpathy, the former AI director of Tesla and a founding member of OpenAI, released his latest open – source project, a repository named nanochat. As of now, the project has exceeded 7.9k Stars on GitHub!
GitHub repository: https://github.com/karpathy/nanochat
It is reported that, different from Karpathy’s previous similar repository nanoGPT which only included pre – training functions, nanochat is a minimalist, end – to – end training/inference toolchain built from scratch. It can be used to build a simplified ChatGPT replication model, and the entire codebase consists of only one file with very few dependencies.
A model trained in half a day with $100 beats GPT – 2
“The best ChatGPT you can get for $100,” Karpathy described nanochat in the announcement. With nanochat, you just need to launch a cloud GPU server and run a script. After as fast as 4 hours, you can have a conversation with your own trained large language model (LLM) on a ChatGPT – like web interface.
Specifically, the project can achieve the following functions:
Train a tokenizer based on the new Rust language implementation version
Pre – train a Transformer – architecture large language model on the FineWeb dataset and evaluate the CORE score through multiple metrics
Conduct mid – training (Midtrain) on the SmolTalk user – assistant dialogue dataset, multiple – choice question dataset, and tool – using dataset
Perform instruction fine – tuning (SFT) on the chat model and evaluate the model’s performance on world knowledge multiple – choice questions (ARC – E/C, MMLU), math problems (GSM8K), and code tasks (HumanEval)
Optionally, train the model through reinforcement learning (RL) on the GSM8K dataset using the “GRPO” algorithm
Implement efficient inference in an inference engine with KV cache, support simple pre – fill/decoding processes and tool use (a Python interpreter in a lightweight sandbox), and interact with the model through the command – line interface (CLI) or a ChatGPT – like web interface (WebUI)
Automatically generate a Markdown – formatted “report card” to summarize the entire project process and present various metrics in a “gamified” way
According to Karpathy, even with a low cost of about $100 (training for about 4 hours on an 8 – card H100 node), you can use nanochat to train a conversational simplified ChatGPT replication model. It can write stories, poems, and answer simple questions. After about 12 hours of training, the model’s performance can exceed the CORE metrics of GPT – 2.
On Github, Karpathy explained the detailed process of “rapidly training” the optimal ChatGPT model with $100.
Detailed technical steps: https://github.com/karpathy/nanochat/discussions/1
If the cost is further increased to about $1000 (training for about 41.6 hours), the model’s coherence will be significantly improved. It can solve simple math problems, code tasks, and complete multiple – choice question tests. For example, after training a model with a depth of 30 for 24 hours (its computational complexity FLOPs is equivalent to that of GPT – 3 Small (1.25 billion parameters), only 1/1000 of GPT – 3), it can score more than 40 points on the MMLU dataset, more than 70 points on the ARC – Easy dataset, and more than 20 points on the GSM8K dataset.
Karpathy’s goal is to integrate this complete “strong benchmark” toolchain into a logically coherent, minimalist, readable, highly modifiable, and forkable code repository. “nanochat will be the core project of the LLM101n course (still under development). I think it also has the potential to develop into a research tool framework or a benchmarking tool, just like the previous nanoGPT.”
He revealed that currently, this project is by no means the final version. It has neither been fully tuned nor optimized for performance. However, its overall framework is well – established enough to be released on GitHub, and all subsequent modules can be further improved in the community. Moreover, Karpathy said that in fact, there are still many easily achievable optimization points in nanochat.
8000 lines of code written by hand, “Agents are of no help”
The entire project has only about 8000 lines of code in total, but Karpathy emphasized that “the code structure is quite clear.” Moreover, this code repository is basically entirely written by Karpathy himself, only using the Tab key for auto – completion.
“I’ve tried using Claude or Codex Agents to assist several times before, but the results were extremely poor. In the end, they were of no help. Maybe it’s because the code style and functions of this repository deviate too much from the regular code in the training data of these tools,” Karpathy said.
When talking about the model architecture of nanochat, Karpathy introduced that it is generally similar to the Llama model, with a simpler structure, and also borrows some design ideas from modded – nanoGPT (an improved version of nanoGPT).
He tried to determine a reliable benchmark architecture for a model of this scale, as follows:
Dense Transformer (without sparse structure)
Rotary Embeddings for position encoding, without using other position encodings
QK Norm (normalizing the query vector Q and the key vector K)
The weights of the embedding layer and the unembedding layer are not shared
Normalize the results of token embedding
Use the relu² activation function in the multi – layer perceptron (MLP)
The Root Mean Square Normalization (RMSNorm) does not contain learnable parameters
No biases are used in the linear layers
Multi – Query Attention (MQA)
Logit softcap (limiting the logit value range to stabilize training)
The optimizer of nanochat uses a combination of Muon + AdamW, which is largely inspired by modded – nanoGPT. It is reported that currently, Karpathy has a to – do item: try to remove the dependence on Muon by optimizing the learning rate of Adam (for example, setting exclusive learning rates for different modules), but he hasn’t devoted enough energy to it yet.
Netizens: Get a machine and the title of machine learning engineer
In addition to Github, the newly released nanochat has also gained a lot of popularity on social platforms.
“I’ve always loved the Nano series of projects! This minimalist end – to – end training/inference toolchain will definitely have a profound impact on many machine learning learners and researchers,” a netizen said.
Another netizen also said, “For me personally, this code repository is an excellent learning material for the future – it’s very helpful whether for understanding the low – level deep learning implementation based on Rust or (more fundamentally) Python deep learning development.” At the same time, he pointed out, “If everyone can now train their own large language models (LLMs) with minimal effort using this repository, wouldn’t the technological advantages of companies like Anthropic and OpenAI be weakened? After all, there are many excellent engineers in the market, and as long as they have sufficient resources, they are quite likely to train more powerful large language models.”
Someone else pointed out, “I think the largest audience for this code repository is researchers. Many people may have ideas for improving large language models (LLMs), but turning these ideas into a complete implementation not only requires a lot of effort, but the final results are also full of uncertainties. Now, we have such a ready – made tool process that we can directly use for experiments. What used to be just a daydream of ‘what if we could do this?’ has now become a practical action of ‘I can try to implement this idea next weekend’.”
Some netizens even joked, “After running this, I’ll definitely add the title of ‘Machine Learning Engineer’ to my resume.”
Reference links:
https://x.com/karpathy/status/1977755427569111362
https://github.com/karpathy/nanochat
This article is from the WeChat official account “AI Frontline”, compiled by Hua Wei, and published by 36Kr with authorization.