Welcome to the final part of our GPT From Scratch series, inspired by Karpathy’s Let’s build GPT: from scratch, in code, spelled out.

Links to previous and upcoming posts of the series:

Part 1: Intro
Part 2: The Training set
Part 3: The Bigram model
Part 4: The Mathematical Trick behind Self Attention
Part 5: Positional Encodings
Part 6: Coding Self-Attention
Part 7: Building a GPT (this post)

The T in GPT stands for Transformer, the key neural net architecture that powers LLMs and more.

In the last post we deep dived into self attention’s code and intuition. Although self attention is the heart of transformers, there are few additional critical parts to the transformer architecture that actually made it shine. Almost every single post, deck or tutorial around transformers shows this diagram from Google’s original “attention is all you need” paper.

In this post, we’ll get to implement and understand all the parts of that diagram that are relevant to build a GPT.

(Masked) Multi-Head Attention

This part is really about having multiple attention heads (as the one in our last post, see the Head module) in parallel.

Karpathy’s implementation is as elegant and easy as that:

Few notes:

In the forward pass, dim = -1 means you concatenate on the last dimension of each head, which is the C in B,T,C and corresponds to the size of the embeddings.
So basically, each self-attention head is producing a vector (the embedding) and here you simply concatenate it.
To not break dimensions, if before we had one attention each of size e.g. 32 , if we want to produce 4 attention heads, we just divide by 4 the size of each head. So when initializing , it simply look like that:

It means dimensions are preserved, but we just get to build and learn different communication channels in parallel
This concept is similar to what is done for convolution when you get to learn “convolution groups”
Re-running with this gives yet another non negligible improvement, from 2.4, to 2.28
This means that the characters have a lot to talk/communicate about and having multiple smaller communication channels helps more than having only one longer.

Computation on top of Communication

In our previous post , we explained Karpathy’s interpretation of self attention as a communication mechanism.

Communication is great, but it is not enough.

Once communication has happened, you need to have a computation layer allowing each node to “process” what they’ve learned in the communication.

Indeed, up until now, we went way too fast to extract the logits, immediately after the self-attention (communication) layer:

Instead, and as described in the paper “attention is all you need ”, it is important to add even a simple linear layer just before extracting the logits. As simple as that:

Note that the dimension we give is n_embd, so when we feed forward it, it will keep the same B,T,C dimension.
It also means that this feed forward is a the token level (you apply it to each of the embeddings of the BxT tokens)
In our case, it means that you give each character some time to “think” (or compute) after all the communication it has learned just before in the self attention layer.
Retraining with this new trick brings the loss from, from 2.28 to 2.24. Not huge, but it still improved the situation.

Interspersing Communication and Computation

As we just showed in the previous paragraph, it is important to add a “computation” layer on top of the communication one.

One of the important aspects of the transformer architecture is to do it multiple times.

This can be done by creating a “block” of multi headed self attention, which then does computation. Nothing too complex, mainly packaging things together, to get the box of the diagram above (except for the middle part, and also for the “add & norm” part and some of the arrows there as we’ll explain in next sections):

Then, in the overall model, you can initialize a bunch of those blocks. The number can be a parameter n_layer , and then you simply apply it just before extracting the logits.

But it turns out though that retraining with this doesn’t give good results.

The reason is that the network is getting quite deep by now and is starting to suffer from optimization issues (while running backpropagation), which is rather typical for deep neural nets.

So we need a couple more techniques/tricks that we can borrow from the paper, to actually solve this optimization issue: residual connections and layer norm.

Residual (or skip) connections

Residual connections were initially introduced in that paper and are represented by that arrow in the diagram.
The idea is that each time you have a set of complex computations, you also add on the side a direct path that skips all the computations and that you connect at the end with addition (hence the “Add” part in the diagram):

The idea being that the gradients from the output and up to the input/beginning of the network can flow directly (like on a gradient super highway) from output to input.
Remember that the gradients are contributing equally through additions during backpropagation.
This trick thus avoids gradients being stuck or vanishing in highly complex computations of deep neural nets.
Instead, they are initialized such that at the beginning they contribute almost nothing, and only the “gradient highway” allows the optimization to get started, and then the gradients start to kick-in also in the complex computations.
It turns out that it dramatically helps with the optimization issues we mentioned above.
Karptahy’s Implementation:
- First we do this ‘+’ operation in the forward pass of our block (the code discussed in previous section) like this:

Then, we add a projection layer. To be honest, i’m not sure why it is needed and i’ll need to dive deeper in the paper to understand it, but i’ll show how Karpathy’s adds both to the multi headed self attention, and the feed forward modules.
For the multi headed self attention layer it just corresponds to add this:

And for the feed forward model it just corresponds to add this (splitting the linear layer that way is also coming from the paper ):

Amazingly, retraining the model with that trick, brings the val loss down up to 2.08 (from 2.28) which is a very serious improvement. And generating text with the model starts looking like proper english words now 🤯
However, we start to observe that val loss is a bit better than train loss, and thus we start slightly to overfit.

This is where the next trick comes into play.

Layer norm

In another video , Karpathy talked about batch normalization, which was critical for allowing better optimization of deep neural nets.
Here we’ll use an evolution, called layer normalization (introduced in that paper). Also implemented in pytorch here.
The basic idea is that each embedding vector inside of our BxT batch will be unit Gaussian (0 mean, variance 1).
The main difference with batch normalization is that batch normalization was normalizing at the batch level, and layer norm is doing it as the individual example level, thus being more robust to varying batch size.
Karpathy uses the pytorch implementation and applies it right into the block, before it goes into self attention and feed forward :

Important as well to add it after the block initialization in the general model, just before the final linear layer.

Note that the dimension of the layer norm is n_emb. This is because it will be applied directly to x, i.e. to all our vectors of size C in our BxTxC batch .
Note that initially LayerNorm will be such that each vector has unit gaussian, but because LayerNorm has weights, during training it will adapt and maybe decide to normalize in a different way.
Note as well that in the original “attention is all you need paper”, the layer normalization is happening after the transformations (i.e after self attention and feed forward) but now in the past 5 years, it is one of the very little changes that happened on the original paper, is that we do it before the transformation.
Retraining the network now goes down up to 2.06 (from 2.08), so only a slight improvement but we would expect to help even much more if we get deeper and deeper network (those things are probably useless for small network but becomes critical for much larger ones)

Scaling the model

Now it is time to run the model training on a GPU.

The code is now exactly what was shown in the very first port of that series and hopefully you now have a good understanding of each line. A small thing we didn’t mention and that shows up in the code is adding dropout to prevent overfitting (which is critical when you want to scale a model). A nice interpretation of dropout: by removing random nodes at each backpropagation pass, you kind of train an ensemble of sub-neural nets.

Here are the hyper params config used by Karpathy:

Some notes:

The batch size was increased from 8 to 64
the context from 8 to 256 (size of the prompt)
A much smaller learning rate (otherwise you overshoot for such deep neural nets).
N_embed of 384 and the number of heads is 6. Remember that at the end, each head is concatenated to get the full 384 embedding. So it means that each head is 384/6 = 64

Now, all what is left is to run the simple training loop:

The final result

After retraining, the loss gets as low as (drumroll): 1.48!! (from 2.06) 🤯🤯by just scaling it, with the exact same code.

On an A100 GPU, it took 15 minutes. On a CPU, it would not even run.

Results are nonsensical english, but it now outputs something that looks like the original format and with “english sounding” words:

This is pretty amazing knowing that it is a character level trained model, just trained on 1 million characters from Shakespeare and a 15 minutes training on a GPU.

What was implemented: Decoder Only Transformer

What was needed for implementing a GPT is only part of the full transformer architecture, typically, this is what was implemented by Karpathy in his video:

In another post I explain how different flavors of the transformer architecture are used to build different kinds of models (GPT, BERT, BART). The left part of the diagram is the encoder and the right part is a decoder.

Few notes:

The reason why only the decoder part is needed, is because we’re just generating text, and we don’t condition this on anything, like e.g. an input sentence in another language
What concretely makes it a decoder, is the fact we used the triangular matrix for “hiding the future”. I.e. during training, we hide the characters that are beyond the next character we want to generate.
In the original paper, they needed an encoder first because it is a machine translation paper. And thus the decoder needs to be conditioned on the input translation.

Note that the encoder part (left part of the graph) is actually exactly identical as the block we implemented, except that it does not do “masked” multi-head attention, because it is allowed to look at the whole input and let all the tokens communicate together.
The middle layer coming from the left is called the cross attention layer. This is because the queries from that component are coming from our batch input (our BxTxC batch), but keys and the values are coming from the external layer, the encoder on the left.
So what it does concretely is that the generation of the decoder is not only conditioned on the past input to the decoder, but now also conditioned on full the output of the encoder .

From GPT to Gemini / ChatGPT

What we have built is a Pretrained (the P in GPT) Base Model. It is a powerful next-token prediction engine, but it is not yet a helpful assistant. If you ask a Base Model “How do I make an omelet?”, it might reply with “…and why eggs are delicious,” simply because it is trying to complete the sentence rather than answer a request.

To turn this Base Model into a ChatBot assistant (like Gemini or ChatGPT), industry leaders apply a process called Alignment:

Supervised Fine-Tuning (SFT): We retrain the model on high-quality “Question -> Answer” pairs written by humans. This teaches the model the format of a helpful assistant.
Reinforcement Learning from Human Feedback(RLHF): We generate multiple answers, ask humans to rank them (Best to Worst), and use those rankings to tune the model. This teaches the model preferences—prioritizing safety, helpfulness, and honesty.

Some more details can be found in a technical report on Gemini 1.5 from Google DeepMind and this paper from OpenAI.

It is quite magical and unexpected how GPT models can become useful assistants using those methods applied to a rather limited amount of human feedback data.

Where to go from there

I hope that by now you have a deep understanding of how GPT is built.But true mastery comes from doing. So what you could do is to try reproducing the code alone. You can look at Karpathy’s code from time to time but without doing copy paste, otherwise, it loses the point.

And if you’re brave enough, continue with yet another masterpiece from Andrej: the 4 hours Let’s reproduce GPT-2 video.

That’s it, I hope you enjoyed that series and found it useful.
And keep learning and enjoying/mastering whatever you do.

GPT From Scratch #7: Building a GPT