Philippe Adjiman's blog

GPT From Scratch #7: Building a GPT

adjiman — Sat, 22 Nov 2025 16:27:57 +0000

Welcome to the final part of our GPT From Scratch series, inspired by Karpathy’s Let’s build GPT: from scratch, in code, spelled out.

Links to previous and upcoming posts of the series:

Part 1: Intro
Part 2: The Training set
Part 3: The Bigram model
Part 4: The Mathematical Trick behind Self Attention
Part 5: Positional Encodings
Part 6: Coding Self-Attention
Part 7: Building a GPT (this post)

The T in GPT stands for Transformer, the key neural net architecture that powers LLMs and more.

In the last post we deep dived into self attention’s code and intuition. Although self attention is the heart of transformers, there are few additional critical parts to the transformer architecture that actually made it shine. Almost every single post, deck or tutorial around transformers shows this diagram from Google’s original “attention is all you need” paper.

In this post, we’ll get to implement and understand all the parts of that diagram that are relevant to build a GPT.

(Masked) Multi-Head Attention

This part is really about having multiple attention heads (as the one in our last post, see the Head module) in parallel.

Karpathy’s implementation is as elegant and easy as that:

Few notes:

In the forward pass, dim = -1 means you concatenate on the last dimension of each head, which is the C in B,T,C and corresponds to the size of the embeddings.
So basically, each self-attention head is producing a vector (the embedding) and here you simply concatenate it.
To not break dimensions, if before we had one attention each of size e.g. 32 , if we want to produce 4 attention heads, we just divide by 4 the size of each head. So when initializing , it simply look like that:

It means dimensions are preserved, but we just get to build and learn different communication channels in parallel
This concept is similar to what is done for convolution when you get to learn “convolution groups”
Re-running with this gives yet another non negligible improvement, from 2.4, to 2.28
This means that the characters have a lot to talk/communicate about and having multiple smaller communication channels helps more than having only one longer.

Computation on top of Communication

In our previous post , we explained Karpathy’s interpretation of self attention as a communication mechanism.

Communication is great, but it is not enough.

Once communication has happened, you need to have a computation layer allowing each node to “process” what they’ve learned in the communication.

Indeed, up until now, we went way too fast to extract the logits, immediately after the self-attention (communication) layer:

Instead, and as described in the paper “attention is all you need ”, it is important to add even a simple linear layer just before extracting the logits. As simple as that:

Note that the dimension we give is n_embd, so when we feed forward it, it will keep the same B,T,C dimension.
It also means that this feed forward is a the token level (you apply it to each of the embeddings of the BxT tokens)
In our case, it means that you give each character some time to “think” (or compute) after all the communication it has learned just before in the self attention layer.
Retraining with this new trick brings the loss from, from 2.28 to 2.24. Not huge, but it still improved the situation.

Interspersing Communication and Computation

As we just showed in the previous paragraph, it is important to add a “computation” layer on top of the communication one.

One of the important aspects of the transformer architecture is to do it multiple times.

This can be done by creating a “block” of multi headed self attention, which then does computation. Nothing too complex, mainly packaging things together, to get the box of the diagram above (except for the middle part, and also for the “add & norm” part and some of the arrows there as we’ll explain in next sections):

Then, in the overall model, you can initialize a bunch of those blocks. The number can be a parameter n_layer , and then you simply apply it just before extracting the logits.

But it turns out though that retraining with this doesn’t give good results.

The reason is that the network is getting quite deep by now and is starting to suffer from optimization issues (while running backpropagation), which is rather typical for deep neural nets.

So we need a couple more techniques/tricks that we can borrow from the paper, to actually solve this optimization issue: residual connections and layer norm.

Residual (or skip) connections

Residual connections were initially introduced in that paper and are represented by that arrow in the diagram.
The idea is that each time you have a set of complex computations, you also add on the side a direct path that skips all the computations and that you connect at the end with addition (hence the “Add” part in the diagram):

The idea being that the gradients from the output and up to the input/beginning of the network can flow directly (like on a gradient super highway) from output to input.
Remember that the gradients are contributing equally through additions during backpropagation.
This trick thus avoids gradients being stuck or vanishing in highly complex computations of deep neural nets.
Instead, they are initialized such that at the beginning they contribute almost nothing, and only the “gradient highway” allows the optimization to get started, and then the gradients start to kick-in also in the complex computations.
It turns out that it dramatically helps with the optimization issues we mentioned above.
Karptahy’s Implementation:
- First we do this ‘+’ operation in the forward pass of our block (the code discussed in previous section) like this:

Then, we add a projection layer. To be honest, i’m not sure why it is needed and i’ll need to dive deeper in the paper to understand it, but i’ll show how Karpathy’s adds both to the multi headed self attention, and the feed forward modules.
For the multi headed self attention layer it just corresponds to add this:

And for the feed forward model it just corresponds to add this (splitting the linear layer that way is also coming from the paper ):

Amazingly, retraining the model with that trick, brings the val loss down up to 2.08 (from 2.28) which is a very serious improvement. And generating text with the model starts looking like proper english words now
However, we start to observe that val loss is a bit better than train loss, and thus we start slightly to overfit.

This is where the next trick comes into play.

Layer norm

In another video , Karpathy talked about batch normalization, which was critical for allowing better optimization of deep neural nets.
Here we’ll use an evolution, called layer normalization (introduced in that paper). Also implemented in pytorch here.
The basic idea is that each embedding vector inside of our BxT batch will be unit Gaussian (0 mean, variance 1).
The main difference with batch normalization is that batch normalization was normalizing at the batch level, and layer norm is doing it as the individual example level, thus being more robust to varying batch size.
Karpathy uses the pytorch implementation and applies it right into the block, before it goes into self attention and feed forward :

Important as well to add it after the block initialization in the general model, just before the final linear layer.

Note that the dimension of the layer norm is n_emb. This is because it will be applied directly to x, i.e. to all our vectors of size C in our BxTxC batch .
Note that initially LayerNorm will be such that each vector has unit gaussian, but because LayerNorm has weights, during training it will adapt and maybe decide to normalize in a different way.
Note as well that in the original “attention is all you need paper”, the layer normalization is happening after the transformations (i.e after self attention and feed forward) but now in the past 5 years, it is one of the very little changes that happened on the original paper, is that we do it before the transformation.
Retraining the network now goes down up to 2.06 (from 2.08), so only a slight improvement but we would expect to help even much more if we get deeper and deeper network (those things are probably useless for small network but becomes critical for much larger ones)

Scaling the model

Now it is time to run the model training on a GPU.

The code is now exactly what was shown in the very first port of that series and hopefully you now have a good understanding of each line. A small thing we didn’t mention and that shows up in the code is adding dropout to prevent overfitting (which is critical when you want to scale a model). A nice interpretation of dropout: by removing random nodes at each backpropagation pass, you kind of train an ensemble of sub-neural nets.

Here are the hyper params config used by Karpathy:

Some notes:

The batch size was increased from 8 to 64
the context from 8 to 256 (size of the prompt)
A much smaller learning rate (otherwise you overshoot for such deep neural nets).
N_embed of 384 and the number of heads is 6. Remember that at the end, each head is concatenated to get the full 384 embedding. So it means that each head is 384/6 = 64

Now, all what is left is to run the simple training loop:

The final result

After retraining, the loss gets as low as (drumroll): 1.48!! (from 2.06) by just scaling it, with the exact same code.

On an A100 GPU, it took 15 minutes. On a CPU, it would not even run.

Results are nonsensical english, but it now outputs something that looks like the original format and with “english sounding” words:

This is pretty amazing knowing that it is a character level trained model, just trained on 1 million characters from Shakespeare and a 15 minutes training on a GPU.

What was implemented: Decoder Only Transformer

What was needed for implementing a GPT is only part of the full transformer architecture, typically, this is what was implemented by Karpathy in his video:

In another post I explain how different flavors of the transformer architecture are used to build different kinds of models (GPT, BERT, BART). The left part of the diagram is the encoder and the right part is a decoder.

Few notes:

The reason why only the decoder part is needed, is because we’re just generating text, and we don’t condition this on anything, like e.g. an input sentence in another language
What concretely makes it a decoder, is the fact we used the triangular matrix for “hiding the future”. I.e. during training, we hide the characters that are beyond the next character we want to generate.
In the original paper, they needed an encoder first because it is a machine translation paper. And thus the decoder needs to be conditioned on the input translation.

Note that the encoder part (left part of the graph) is actually exactly identical as the block we implemented, except that it does not do “masked” multi-head attention, because it is allowed to look at the whole input and let all the tokens communicate together.
The middle layer coming from the left is called the cross attention layer. This is because the queries from that component are coming from our batch input (our BxTxC batch), but keys and the values are coming from the external layer, the encoder on the left.
So what it does concretely is that the generation of the decoder is not only conditioned on the past input to the decoder, but now also conditioned on full the output of the encoder .

From GPT to Gemini / ChatGPT

What we have built is a Pretrained (the P in GPT) Base Model. It is a powerful next-token prediction engine, but it is not yet a helpful assistant. If you ask a Base Model “How do I make an omelet?”, it might reply with “…and why eggs are delicious,” simply because it is trying to complete the sentence rather than answer a request.

To turn this Base Model into a ChatBot assistant (like Gemini or ChatGPT), industry leaders apply a process called Alignment:

Supervised Fine-Tuning (SFT): We retrain the model on high-quality “Question -> Answer” pairs written by humans. This teaches the model the format of a helpful assistant.
Reinforcement Learning from Human Feedback(RLHF): We generate multiple answers, ask humans to rank them (Best to Worst), and use those rankings to tune the model. This teaches the model preferences—prioritizing safety, helpfulness, and honesty.

Some more details can be found in a technical report on Gemini 1.5 from Google DeepMind and this paper from OpenAI.

It is quite magical and unexpected how GPT models can become useful assistants using those methods applied to a rather limited amount of human feedback data.

Where to go from there

I hope that by now you have a deep understanding of how GPT is built.But true mastery comes from doing. So what you could do is to try reproducing the code alone. You can look at Karpathy’s code from time to time but without doing copy paste, otherwise, it loses the point.

And if you’re brave enough, continue with yet another masterpiece from Andrej: the 4 hours Let’s reproduce GPT-2 video.

That’s it, I hope you enjoyed that series and found it useful.
And keep learning and enjoying/mastering whatever you do.

GPT From Scratch #6: Coding Self Attention

adjiman — Wed, 19 Nov 2025 06:20:48 +0000

Welcome to Part 6 of our GPT From Scratch series, inspired by Karpathy’s Let’s build GPT: from scratch, in code, spelled out.

Links to previous and upcoming posts of the series:

Part 1: Intro
Part 2: The Training set
Part 3: The Bigram model
Part 4: The Mathematical Trick behind Self Attention
Part 5: Positional Encodings
Part 6: Coding Self-Attention (this post)
Part 7: Building a GPT

In Part 2 we explained how to create a training set from Shakespeare’s works. Part 3 introduced a basic bigram model, predicting the next character based solely on its predecessor. Part 4 explained how a very clever matrix multiplication enables doing some operations (like average) on the previous character, in a very efficient way and Part 5 was about the positional encoding trick.

Now we’ve reached a critical point of that series: the implementation of the self attention mechanism, which is at the very heart of the transformer architecture. The original paper by Google which introduced the Transformer architecture is called “Attention Is All You Need” for a reason.

What is self-attention

In a separate post (not part of that series) called Decoding Transformers: The Neural Nets Behind LLMs and More, I’m describing the history of transformers, why they were a game changer, and a deep dive on their main component: the self attention mechanism (that we’ll implement now).

So, as a prerequisite, it is highly recommended to read it. But if you need to understand one thing about the intuition behind self attention, it is this:

The main purpose of the self attention mechanism is to adapt the vectors/embeddings of the words based on the context of the sentence/prompt.

If this sentence doesn’t sink in very well for you yet, read on and hopefully it will be clearer (and if not, please read the separate post i mentioned above).

How does it connect to the previous posts of that series?

As we explained in Part 2, section “Framing the Prediction Task”, our goal is to predict the next character, using the previous ones.

Regardless of how complex you do it, the input/output format is always the same: you start with a bunch of characters (the “context” or “prompt”), you do some calculation that outputs a vector called the logits , that is the size of your vocabulary (in our case 65, which is the number of distinct characters), that you ultimately transform (using the softmax function) into probabilities that allows you to pick the next character.

In part 3, we introduced how to compute those logits using only the single previous character as the context (which is obviously very limited), using a bigram model.

Now we want to use all the previous characters (that we can call “the context”, or even, the “prompt”) in order to predict the next one. Self attention will provide us a way to do it in a very smart, massively parallel and effective way, by combining the embeddings of each previous character in a way that is putting more weight on the relevant characters.

Let’s visualize at a very high level the steps showing how the next character is generated at prediction time (once we already learned the weights of the models):

It starts with the prompt (in the example below, three letters V, E, and R), with embeddings of size 4 each
Then self attention does its “contextualization magic” on the embeddings, producing “Contextualized Embeddings”
We then end up with an additional linear layer and softmax to give each character of our vocabulary a probability of being the next character (in our case, it is 65 characters, each one getting its probability).
Then the predicted next character is sampled according to those probabilities, and in our example the character B is picked, which seems a logical result when the prompt was VER, thus forming the word VERB.

As we said, this is at prediction time. At training time, we’re doing it on a bunch of B examples in parallel, and we’re learning relevant weights. Let’s dive into it, deciphering self-attention code, line by line, and understand how it connects at training and prediction time.

Deciphering self-attention code, line by line

The main diagram illustrating what is happening in self-attention is the one below (originally from this great video), where the learned weights (during the training process) are the matrices M_k , M_q and M_q. Again, best is to read the detailed explanations in my post , but in a nutshell:

Taking our example above, we have 3 characters in the context
V₂ represents the embeddings of the second character (in our example above it is ).
V₂ ends up being y₂ which is simply a weighted version of V₂ using all the other characters of the context as normalized weights.
The M_k , M_q and M_v matrices represent Keys, Queries and Values, and will be the learned weights on the process.

So here is how Karpathy implements this self attention layer:

Wow. Barely 20 lines of code, but maybe the most important 20 lines of the AI revolution.

A lot to unpack. Let’s dive in and explain it all, line by line.

First, the main object that we’ll manipulate in that code (the variable x in the code) is our good old batch of size BxTxC (as explained in Part 2, section “the logits”).

As a reminder, each line there are T consecutive characters (a.k.a an example) from the training set, and each such character is associated with its embeddings (an array of numbers of fixed size C), and you have B such examples in the batch.
In the init, you have the initialization of the learnable weights of the transformer, namely the key, query and value. More details and intuitions on what they mean in my other post but in high level:
Each token (in our case a character) in the batch will emit three vectors, the key, the query and the value:
- The query represents: “what am i looking for”
- The key represents: “what do i contain”
- The value is “what i will communicate”
Note the dimension of those. E.g. key is a linear layer of dimension (n_embed,head_size) where n_embed is just C and head_size is a hyperparameter that can be set during hyperparameter tuning at training time.
Let’s now explain the forward function line by line.
is simply about getting the important dimension of our batch x , that we explained just above
- First, let’s understand the dimensions there. The dimensions of x are B,T,C (in the example above B=8, T=3 and C=4). Think of it as BxT vectors, each of dimension (1,C) .
- What key(x) does, is a matrix multiplication between each of those BxT vectors and the same tensor: key (where key is one of the learned weights as explained above). key is dimension (C,head_size).
- So, when you multiply BxT vectors of dimension (1,C) with a matrix ( key ) of dimension (C,head_size) , you end up with BxT vectors of dimension (1,head_size), and thus a tensor of dimension (B,T,head_size) .
- Note that in his comment, Andrej Karpathy wrote (B,T,C) , but he did a small typo and meant (B,T,head_size)
- The exact same thing happens with the query object . query(x) produces a tensor of dimension (B,T,head_size) .
So basically, what you have in hand now are two tensors , k and q, each of dimension (B,T,head_size) .

This line hides quite a lot of magic behind the scene. Let’s unpack it.
Each line in the batch of examples is like a sentence of T tokens, and for each such sentence, you’d like to create an affinity matrix between each token. This matrix is of size TxT .
The way to create this affinity matrix is to do a dot product between the keys and queries of each of those tokens, that are represented by k and q .
For doing a dot product between two tensors of dimension (B,T,head_size) , you have to switch the two last dimensions of the second tensor. This is what does. As a reminder, the symbol “@” in pytorch does tensor multiplication.
You end up with a tensor of dimension (B,T,T) as explained in the comment

As for the part, this is just a scaling trick, and I explain the intuition behind it in my other blog post (look for “scaling embeddings”).
Btw, this line of code can be represented mathematically by this formula:

This part might be intimidating, but in fact, it is highly similar to what we introduced in part 4 of this series : The Mathematical Trick Behind Self Attention.
The main difference is that now we’re not multiplying our batch with a simple triangular matrix (which would do only a simple average).
We’re now multiplying by the weights that resulted in the dot product of all the tokens in the example! But otherwise the mathematical trick is exactly the same.
Let that sink in for a second. When we multiplied the queries and the keys, what we got is the affinity matrix between each token of the examples, giving one TxT matrix per example in the batch.
This TxT matrix represents the affinities between each token, and thus by applying the mathematical trick, we’re transforming the embedding of each token, into a weighted average, the weight being the affinity scores.
The only differences you might notice from the mathematical trick are:
- The dropout, which is well known simple trick to prevent neural networks from overfitting
- We don’t multiply the wei tensor directly with x, but with value(x) where value is also a learned linear layer (see the init function) is simply an additional layer on top of the key*query raw affinity score.
Bottom line: Each token embedding, of each example of the batch, is now transformed into a weighted version, with the weights being the affinities between each tokens. Which is exactly what was illustrated in the diagram above

And that can be also captured in that one formula, which is at the center of the initial paper Attention is all you need that introduced it all:

Interpreting self-attention as a communication mechanism

Karpathy made a very interesting note about how to think about self attention.

In some sense, each token (in our case, character) has some vector of information (the embeddings) and has to aggregate it via a weighted sum from all the other tokens that are connected to it. This weighted sum can be interpreted as a kind of communication between each token.

It can even be seen as a directed graph, where each token points to itself and all the previous tokens. For our 8 characters it can look like this (graph created with graphviz, with code created by Gemini).

Note that it is only because we’re an autoregression setting, where each character can only see the previous characters (the prompt) that a node cannot point to another one further in the sequence, but in other settings (where you’re e.g. allowed to see the whole sentence, like for analyzing a text for sentiment or anything else) then you could have a full connected graph, and every token can communicate with any other.

Connecting it all to our model

In part 3, we explained the bigram model. Now that we have the implementation of a self attention head, we can replace the basic embeddings table that we had in the bigram model by:

1. The positional encoding trick (explained in part 5) and

2. the attention head described above.

It gives this:

By now, this code should look much more self explained to you:

It is the exact same structure as the bigram model (see Part 3)
In the init we add the position embedding table (see Part 5)
We also add a layer normalization layer (we’ll explain that in our next post)
And a final linear layer of dimension (n_emb, vocab_size) so we end up with logits
About the last few lines, the targets variable is None when the model is invoked for prediction/inference (and not for training), more details on that in the next section.

Let’s draw the dimensions of the forward pass to see how things work out very nicely.

From there, what you do next really depends on whether you are in the training phase (i.e. tuning the actual parameters of the model) or in the inference phase (i.e. actually predicting the next character). Let’s detail that difference as it is important.

Training vs. Inference

Both in training and inference, the key element that you need are the logits, which are the raw predictions (one number per possible prediction, in our case, vocab_size numbers). Let’s see how those logits are used in Training vs. Inference phase.

Training

In that phase, your goal is to tune the parameters of the model. As explained above, the parameters are the Keys, Queries and Values. In deep learning, the way you tune those is via backpropagation. To do backpropagation, you need two things: the logits, the labels (or targets) and a loss function. This is all captured in those few lines from the previous snippet.

The details of how those few lines beautifully work is detailed in one of my “Deep Learning Gymnastics” posts: Master Your (LLM) Cross Entropy . Below is an excerpt from that post. Read it fully for more details.

From there, the training loop to fine tune the weights from backpropagation is rather straightforward:

The get_batch function is what we explained in Part 2: The Training Set .

Inference

Once your model was trained and that your keys, queries, values and other parameters like embeddings have the proper weights learned from backpropagation, you can start actually doing inference, which in our case corresponds to generating Shakespeare text, one character at a time. The code to do it was presented in part 3, and we’ll put it again here for completeness:

Wow, we now deeply understand how one of the most important pieces of code of gen AI revolution is working!

But to achieve the actual end-goal (of this series) of building a GPT, there are a bunch of very important optimizations that will take the loss to new heights (well, maybe “canyon” is a better term, as the lower the loss, the better).

Let’s now dive into the grand final part of this series: Building a GPT.

GPT From Scratch #5: Positional Encodings

adjiman — Sat, 15 Nov 2025 16:18:19 +0000

Welcome to Part 5 of our GPT From Scratch series, inspired by Karpathy’s Let’s build GPT: from scratch, in code, spelled out.

Links to previous and upcoming posts of the series:

Part 1: Intro
Part 2: The Training set
Part 3: The Bigram model
Part 4: The Mathematical Trick behind Self Attention
Part 5: Positional Encodings (this post)
Part 6: Coding Self-Attention
Part 7: Building a GPT

In Part 2 we explained how to create a training set from Shakespeare’s works. Part 2 introduced a basic bigram model, predicting the next character based solely on its predecessor. Part 3 explained how a very clever matrix multiplication enables doing some operations (like average) on the previous character, in a very efficient way.

Before we jump to the actual implementation of a transformer (the heart of GPT), we’ll add one more trick to the arsenal: positional encoding.

Why token’s position matters

For example, the sentences:

“Alice gave the book to Bob.”
“Bob gave the book to Alice.”

contain the exact same words but mean different things — only the order changes. A large language model without some notion of order wouldn’t be able to tell the difference.

The Solution: Positional Encodings

We already have embeddings for each character (see Part 3, “The token embedding table”). In order to give the model a way to learn the notion of position, we’ll simply add another set of embeddings, one per position. We’re working with characters, but the same applies for words or more generally tokens.

Think of it like this:

Token Embedding = What the word is (or its identity)
Positional Encoding = Where the word is

How to add it to our model

We’re still focused on our good old batch of size BxTxC (as explained in Part 3, section “the logits”).

In our toy example, the batch contains B=8 examples, each example has T = 3 characters (let’s call it block_size), and each character embedding is of size C = 2 (let’s call it n_embd) . So, for the position, we need block_size embeddings (each one to encode the position of the character in the example). As for the size of the embedding, we could take anything, but for the sake of simplicity, we’ll take the same size of the character embedding, which is n_embd.

Let’s see what it looks like in code. We first need to declare our new embeddings table.

As a reminder, our token (or character in our case) embedding table was keeping track of the embedding of each of our 65 (vocab_size) characters. The embedding of each character is of size n_embd. This token_embedding_table was initialized like this:

Now, in the forward pass, how do you combine the embeddings of both the identity of the character (token_embedding_table) and the embeddings of the position (position_embedding_table)?

Answer: you actually simply just sum them up!

Before doing so, you just need first to fetch the logits (as we explained in Part 3, section “the logits”).

As for the position embeddings, you just need to arrange them to make sure they align nicely as a T x C (i.e. context_size x n_emb). All in all, this is how to include positional embeddings into the logits:

That’s it.

From there, the logits are enriched with a “positional” component in the forward pass, and thus, backpropagation will now optimize the weights of the neural net with that new positional information, constraining the optimization in a way that will take it into account.

Now, the next step is to enrich further the logits with another piece of information: the famous attention layer, that is the heart of the transformer architecture and that made GPT what it is today. That’s what we’ll explore in our next post: Coding Self Attention.

GPT From Scratch #4: The Mathematical Trick Behind Self Attention

adjiman — Fri, 14 Nov 2025 13:17:25 +0000

Welcome to Part 4 of our GPT From Scratch series, inspired by Karpathy’s Let’s build GPT: from scratch, in code, spelled out.

Links to previous and upcoming posts of the series:

Part 1: Intro
Part 2: The Training set
Part 3: The Bigram model
Part 4: The Mathematical Trick behind Self Attention (this post)
Part 5: Positional Encodings
Part 6: Coding Self-Attention
Part 7: Building a GPT

In Part 2 we explained how to create a training set from Shakespeare’s works. Part 3 then introduced a basic bigram model, predicting the next character based solely on its predecessor.

However, this approach is fundamentally limited. To achieve the capabilities of models like GPT, we need to go beyond just one character (or word, or token) back, and consider the broader context of the preceding sequence. This vital interaction is enabled by self-attention, a mechanism underpinned by a very elegant mathematical trick for efficient context awareness.

Let’s dive in.

The simplest kind of communication in our batch

Back to our good old batch of size BxTxC (as explained in Part 3, section “the logits”).

As a reminder, each line there are T consecutive characters from the training set (a.k.a an example), and each such character is associated with its embeddings (an array of numbers of fixed size C), and you have B such examples in the batch.

Our goal in order to illustrate communication between characters is the following: for each character at index i (in an example of the batch) do an average on the embeddings of all the previous characters.

One would wonder why doing just an average is interesting, but we’ll see later that it will be the basis for building the powerful self attention mechanism.

So, let’s illustrate with an example what we mean by doing an average of the embeddings.

We first generate a random batch of size 4,8,2 (i.e. 4 examples, each with 8 characters, and each with an embedding of size 2).

Let’s look at the first line example:

Those numbers represent the 8 characters (one character per line) of the first example, and for each, their embeddings (2 numbers).

So our goal is to produce a tensor such that for each line we get the average of all the previous numbers in the respective column.

In our example, we’re looking to get the tensor below. Look for example at the second line, the first number there is 0.3507, which is the average of the first number of the two first lines in the original tensor above (0.0783 and 0.6231). Same for all the other numbers in the resulting tensor, they are the average of all the previous ones.

Now the challenge is: how to produce that in a very efficient way so it can scale.

The brute force way

It is always useful in every problem to start with the brute force solution as a baseline.

Here is Karpathy’s code for the brute force way of solving this:

A few notes on that code:

xprev is referring to the b^thexample of the batch, and to all the characters from 0 to t . And for each of those, you have the embedding of size C, and thus xprev is of dimension (t,C)
Then, when you do the torch.mean(xprev,0), it is actually doing the average on each channel of the embedding (in our case, there are 2).
Bow means bag of words (a common term when just averaging stuff out)

And sure enough, it works and produces the right result. The problem is that it is highly inefficient and won’t scale both at training and at inference when talking about huge models like GPTs.

The trick: a (very) cleaver matrix multiplication

Now let’s describe the trick that enabled scaling self attention, and that arguably is at the core of the generative AI revolution.

First, a small reminder on how matrix multiplication works.

Each element in c, is obtained by summing the dot product of the corresponding row and column in a and b.

E.g., to obtain in c the result of the 2nd row, and 1st column, you just do the dot product of the 2nd row in a (which is [4,6,5]) and the 1st column in b (which is [3,4,4]) . And thus [4,6,5] . [3,4,4] = 3*4+4*6+4*5 = 56, which is indeed what we see in c at 2nd row and 1st column.

Now, in the example above, if instead of multiplying b by a random matrix, we multiply it by a triangular matrix, something magic happens:

Can you see what happened? It turns out that now each element in c, is the sum of all the previous elements from b!

For instance, the 7 in c , which is 1st column, 2nd row, corresponds to the sum of all the elements of the 1st column in b, and up to the second row: 3+4 = 7. This works for every element in c.

Why is it magical and why is it helping in what we try to achieve?

Because if you add a last ingredient to the magic, and instead of multiplying by the triangular matrix, you multiply by a matrix that is the triangular matrix, but with the number divided by the sum of the row, you get the exact average result we wanted! See it below with the associated code

If you want to understand what the line

is doing, you can read my blog post on tensor broadcasting.

Those are two dimensional matrices, let’s get back to our initial problem which is a tensor of dimension BxTxC. Let’s see how it works well for those dimension:

We’re multiplying a TxT matrix with a tensor BxTxC .
Pytorch will create a B dimension to the TxT matrix, which will yield a multiplication between BxTxT and BxTxC, which for each batch element , will do a TxT times TxC multiplication (in parallel) , which will yield a BxTxC result. In code, it gives:

Does it really work?

To check it, we can compare the result of xbow2 with xbow that we obtained in the brute force way section:

Sure enough, it magically works and we indeed obtain the exact same output in both methods .

The difference? With the matrix multiplication, it is incomparably more efficient and is thus a game changer given the scale of what it takes to build GPT.

A softmax version

We saw that the key part of the trick is to produce this normalized triangular tensor (the wei in the code above ).

Turns out there is an equivalent way to produce it using the softmax function!

This works because softmax is actually a normalization layer where you exponent all elements and then divide them by the sum.

Why use softwax instead of what we did in the previous example? To be honest, i’m not sure, but i assume it is a matter of elegance and interpretation, because, as Karpathy explains, the triangular matrix before applying softmax looks like this:

And if you interpret this matrix as the “communication allowance” for each element in the batch (because this is what it will end up being through the matrix multiplication => each line of the batch contains 8 training examples ) then it says that it is only allowed to communicate with past elements, and for future element, the communication is forbidden (because when we’ll generate characters, we’ll have access only to the previous ones, and not the upcoming ones).

And, magically enough, it just works, and produces the same result as in the brute force way.

What was achieved and what’s next?

We started with our standard batch of examples of dimension (B,T,C), and our goal was to produce another batch (same dimension), but such that for each character at index i (in an example of the batch) do an average on the embeddings of all the previous characters.

We discovered it could be done in a crazy effective way by simply multiplying the tensor of the batch by a triangular normalized matrix.

This is how it looks like in one example of our batch:

And this works exactly the same if we apply it to the whole batch of examples: you start with the tensor (B,T,C) and you end up with a tensor of the same dimension, but this time with all the averaged examples.

Why does it matter?

Because:

It is an extremely efficient operation and that’s what will allow scaling GPT to huge amount of data
We’ll replace the simple averaging by a very smart aggregation of all the previous characters (the context)
The operation will be exactly the same: a simple matrix multiplication, we’ll just replace the triangular matrix by the smart aggregation.

We thus now illustrated the foundation of what lies behind GPT.

From there, we’ll just introduce an additional fundamental concept, called “positional encoding” and then we’ll implement the famous self-attention mechanism which is the backbone of GPT.

So let’s do it and dive in Part 5: Positional Encodings.

GPT From Scratch #3: The Bigram Model

adjiman — Mon, 10 Nov 2025 14:14:35 +0000

Welcome to Part 3 of our GPT From Scratch series, inspired by Karpathy’s Let’s build GPT: from scratch, in code, spelled out.

Links to previous and upcoming posts of the series:

Part 1: Intro
Part 2: The Training set
Part 3: The Bigram model (this post)
Part 4: The Mathematical Trick behind Self Attention
Part 5: Positional Encodings
Part 6: Coding Self-Attention
Part 7: Building a GPT

Now that we’ve created a training set, we can now start to train a first model.

To illustrate the overall principles of how we’ll build a GPT, Karpathy starts with a simple Bigram model.

It is mind blowing how the structure and principles we’ll use to build such a simplistic model are exactly the same as what will take us up to a full GPT.

Read on.

Bigram ?

First, let’s describe very briefly what a bigram model is. As our goal is to predict the next character, one of the simplest and most naive ways to predict it would be to basically check how often 2 characters occur together in the data (in our case, in Shakespeare literature). For instance, if you have the letter ‘t’ , in english the most likely next letter would be ‘h’ based on the frequency of occurrences of two letters together, as illustrated in that old yet great post by Peter Norving exposing a bi-gram frequency table for English.

Peter Norvig’s English Bi-Gram table

With such a table, one naive yet working way of generating the next character would be based on the current character and drawing randomly the next one according to that table distribution.

Bigrams in a neural net?

In another amazing separate video, Karpathy illustrates how to actually do that not only with a direct approach of just counting occurrences in a bigram table, but how to learn it directly using a neural net. Of course, it doesn’t give better results (because eventually, you only use one character as the context to guess the next one), but learning it from a neural net is without any comparison more flexible and powerful as it sets the ground for extending it up to a full GPT model , as it will be illustrated along the posts.

From here, we assume the reader has some minimal understanding of PyTorch, and how neural nets are working, what is a forward pass, backpropagation and a loss function. But if not, you can watch this video or just read on and hopefully you’ll get the idea on the way.

So let’s jump into Karpathy’s implementation of the bigram model as a neural net in pytorch.

Let’s dive into each important line/component of that code. Each section below highlights the line(s) of code of interest and explains it.

The token embedding table

Usually, embeddings refers to a latent compacted representation of a concept (word, image, others). But here, it can actually be interpreted simply as the NxN bigram matrix. N (the vocab_size) is the number of possible characters, so e.g. 26 for English characters, but will be 65 in our case as we consider some punctuations and lower/upper case as different characters (see our previous post on the training set).

The logits: an important BxTxC tensor

In this section, we’ll dive into this line:

Let’s explain.

In machine learning, and in classification tasks in particular, logits can be interpreted as the raw score a model can give to each of the possible classes. In our case, given we try to predict the next character, the possible classes are each of the possible characters of the vocabulary, so 65 different options.

So the logits are the raw score the model gives to each of those 65 options, before normalizing them into probabilities. In the case of bigrams, one could expect this raw score to be the count of occurrences of the two characters (the current one + the one we consider to predict next).

So, why does token_embedding_table(idx) give us the logits?

To understand it, you can read my blog post about tensor indexing , but the basic idea (illustrated in the picture below) is that idx represents the batch of examples (as we described it in the training set post, as seen on the left of the picture below), and token_embedding_table represents the embedding matrix (each row is an embedding representation of the corresponding character in the vocabulary, in the middle of the picture below), and doing token_embedding_table(idx) simply returns the same initial batch idx but that time augmented with the embedding of each character (and thus it is a cube, on the right in the picture below)

Augmenting the batch with embeddings using tensor indexing

This BxTxC tensor will be our starting point in everything we’ll do to build a GPT. So what you need to remember is that each line there are T consecutive characters from the training set (a.k.a an example), and each such character is associated with its embeddings (an array of numbers of fixed size C), and you have B such examples in the batch.

A note on the terminology of the dimensions:

B refers to Batch (number of lines/examples in the batch)
T to Time. It comes from the number of timesteps in the series of characters (as those concepts were initially introduced for time series in the context or Recurrent Neural Network, and thus the terminology sticked also in the context of characters )
C refers to channels, or the size of the embedding. Channels were often used for the three RGB channels in pixel of a picture in image classification, but in general it just represents the number of latent features of each timestep.

Cross-entropy loss

As Jeremy Howard likes to say, in order to train a neural net, you just need 3 things: an input (the batch in our case), a label (the next character in our case) and a loss function. Give those 3 things to a neural net, and it will learn something.

So, what is the loss function in our case? We’ll use cross-entropy as it is often used in classifiers.

Some long time ago, I wrote a blog post about the theoretical aspects of entropy , but more recently, I wrote about explaining how it is done in LLM, as Karpathy demonstrates it in his video and code. You can find it here: Master Your (LLM) Cross Entropy .

In a nutshell, it explains how the three first lines of code with B,T,C above are just using tensor reshaping to unfold both the embeddings obtained in the previous section (the logits) and the labels (the next character we want to predict) so it can be easily passed to the cross_entropy function in PyTorch.

Basically, it looks like that before we pass it to the cross entropy function:

The training loop

So what we defined above is basically the model. It defines the logits, and the loss (by calculating the cross entropy between the logits and the label).

Now we need to train it. To learn what? Basically to learn the parameters of the model, which in our case is the embedding table (the NxN token bigram matrix we discussed above in the section “The token embedding table”).

So here is a basic yet functioning training loop:

Rather simple: we fetch a batch (using the getBatch function we described in the first post), we compute the logits on it, we do the backward pass magic (someday we’ll write another post on how it works). And that’s it. Rinse and repeat and observe the loss dropping.

Generating Text From the model

Once the model is trained, this is how text is generated from the model:

A couple of notes on why this function is a bit overkilled for now, but still will set us up for future steps:

We don’t need batches (B) as we’ll start only from one example, and the idea is simply to start with some empty context, generate each time one character based on the logits converted into probabilities (using softmax) and sampling from it, and then concatenating them to the current context and redoing the same.
For the bigram model, this function is even more overkilled given that for generating the next character, the bigram model only looks at the last character, and here we’re passing it the full context from the beginning: it is just because this generic (for now overkilled) way of writing it will be reusable as-is later on when we’ll start using the full context of all the previous characters

Then to generate text, it just looks like this:

Obviously, a Bigram model won’t be able to generate any Shakespeare-like content, but just for fun, let’s use the function above to generate some text:

As expected, it is completely Gibberish, and Shakespeare can still rest in peace for now, but still, it is interesting to observe that the model did capture some structure of how the text from Shakespeare is formed (someone speaking, then comma, then new line etc…). As a reminder, this is how the Shakespeare original text look like in the training set:

Why this framework sets us up for GPT

So what is really amazing, is that all the building blocks described above for the bigram model, will stay exactly the same for a full GPT model: the training loop, the loss computation from the logits, the text generation.

So what will change? Well, basically the way the logits will be built. Instead of getting them using only the previous character, they will be built using the full context of the previous text (the “prompt”) and will go through the beautiful transformer architecture and its self attention component, that we already explained at length from the intuition perspective (in that post) and that we’ll build from scratch in the coming post of that series.

What’s next? We’ll now dive into the Mathematical Trick behind Self Attention.

GPT From Scratch #2: The Training Set

adjiman — Thu, 06 Nov 2025 11:42:56 +0000

Welcome to Part 2 of our GPT From Scratch series, inspired by Karpathy’s Let’s build GPT: from scratch, in code, spelled out.

Links to previous and upcoming posts of the series:

Part 1: Intro
Part 2: The Training set (this post)
Part 3: The Bigram model
Part 4: The Mathematical Trick behind Self Attention
Part 5: Positional Encodings
Part 6: Coding Self-Attention
Part 7: Building a GPT

In this post, we’ll focus on building the training set for our model. Although this part might sound less exciting than building the model itself, it is still a crucial (and in my opinion very clever) part of the whole process.

Framing the prediction task

Before building a training set, you need to frame the prediction task in a proper way. The core capability of chat bots is to be able to predict the next word (or more generally token) given the previous words of the sentence. For example:

To us, it is obvious that the next predicted word is very likely “cold”. But the goal is to build a (very) strong model that will be able to do that. Why does it matter? Because once you have that capability, building a very strong chatbot is a matter of “just” an additional (tricky and important) step around teaching your model to answer questions using reinforcement learning. Let’s keep that for a separate post or series. For now, we’ll focus on building that core capability.

Getting training data

Often, getting proper training data with labels can be a very costly process. If you e.g. build a model classifying images of dog vs. cat or cancer vs. benign, you’ll need a lot of human labeled data.

In our case, since we want to train a model to guess the next word, we simply need to get well formed sentences (a lot of them), hide the last word, and train the model to guess it! Where to find a lot of well formed sentences? Well, wikipedia, any book, blogs (just like this one), you name it. Finding data here is not really the issue.

To illustrate how to build the model, Karpathy is taking all Shakespeare literature as the basis. And for the sake of simplicity, the model is built to predict the next character (and not the next word), but the principles are exactly the same, and the end result is equivalent given that by predicting the next character over and over again, you end up generating words and sentences.

One Sentence, Multiple Training Examples

Let’s consider the sentence “Today I put my coat on because it is very cold”. We could consider it as one training example, as in the picture above (i.e. giving the whole sentence except the last word as the context, and tell the model that what it needs to guess is the last word), but we can actually do more. Consider this:

With only one sentence, we end up with not only 1 training example, but with 10!

That’s a very clever trick. Now let’s see how we can use it to actually implement a training set on real data.

Encoding Shakespeare

In his video, Karpathy works with all Shakespeare literature, which is one long text file that can be found here .

Let’s print the size of this text and the first few sentences:

Now, the very first step is to encode this into tokens, a process called tokenization. For the purpose of illustrating how to build a toy GPT imitating Shakespeare, Karpathy uses the most simple tokenization possible: encoding each character with just a number.

First, we can see that there are 65 distinct characters in that file, (see below), so the tokenization process will be about transforming each of those characters (including space) into its equivalent number between 0 and 64.

From there, Karpathy provides the simplest yet working tokenizer by providing the encode and decode lambda function below.

You can see how each character is encoded into a number between 0 and 64 (corresponding to the index of the relevant character in the list above), and then decoding it back to get the original sentence.

As we’ll work with PyTorch or TensorFlow to build GPT, we’ll first need to convert all that textual data into a long tensor containing those encoded numbers. If you don’t know what a tensor is, for now you can simply think of them as a generalization of a vector to N dimensions. So in our case, our tensor for now can be seen just as a long vector of integers, as illustrated below.

As a reminder, this is a very naive encoding (or more precisely, tokenization), and in practice, this is much more sophisticated (another great video from Karpathy details it here) but this will surprisingly be more than enough for illustrating the implementation of GPT from scratch.

Generating (a lot of) training data

Now that our data is properly encoded, let’s see the implementation of the cleaver trick we described earlier.

First, let’s split our data into training and test data. This is done simply by taking the first 90% of characters of the long text file of Shakespeare literature as the training data, and the rest as the test data

And now the heart of the implementation of the clever data generation trick.

The goal now is to generate one batch of data for the training set (see older post here where I describe how batches look like in general). So here’s Karpathy’s simple and elegant code for generating the training set:

Let’s understand what is going on here:

First we decide the batch size (how many examples we want in each batch) and the block size (how many tokens we want to have as “the context” in each example).
Then to create a batch (of size 4 in our case), you take 4 totally random numbers in the train data, which will be the beginning of each of your example of the batch (the ix index) and you stack together:
- The 4 block_size (8) characters (4 x 8 characters)
- The same thing but shifted by one
This will create two tensors of size 4×8 each :

Behind this is hiding 32 examples to learn from! Here are the first 8:

And that’s it, we now have a way to generate a lot of training data to train a GPT from scratch.

Next, we’ll talk about a very naive model, that will give very poor results but that will set the ground for the real game.

See you in Part 3: The Bigram Model .

GPT From Scratch #1: Intro

adjiman — Fri, 24 Oct 2025 11:13:11 +0000

Since the (generative) AI revolution started, it seems like we’re observing one breakthrough every 2 weeks on average, and sometimes it can feel overwhelming. Rather than shallowly chasing every breakthrough, I believe it is critical to first start by getting a deep understanding of what started it all: GPT.

Welcome to a new Series of 7 posts, where we’re going to deep dive into one of the most exciting videos from Andrej Karpathy on the topic: Let’s build GPT: from scratch, in code, spelled out.

A Note on Motivation and Inspiration

Before we begin this 7-part journey, I want to set the stage. This entire series is a deep dive directly inspired by and based on Andrej Karpathy’s phenomenal video mentioned above, which in my opinion is by far the best resource on the internet to understand GPT.

As a fellow PhD in AI, my own learning process has always been to ‘teach what I learn.’ I created this series as a way to meticulously deconstruct and document every step from that video, solidifying my own understanding in the process.

While heavily guided by the video, I’ve invested significant time in structuring the content into clear topics, creating custom illustrations, and adding in-depth explanations to unpack the ‘why’ behind the ‘how.’ My hope is that this series serves as a valuable resource for those who learn best by reading, or who need a quick, searchable reference to complement the video format.

A final note: true understanding comes from doing. So I encourage you to use this, but then challenge yourself to e.g. reproduce Karpathy’s code entirely on your own.

What is GPT?

GPT stands for Generative Pretrained Transformer.

Generative: because it is a model that generates new content, most commonly words or more generally tokens. You give it some initial input (a prompt) and it generates what is most likely to follow from it.

Pretrained: because of the foundational training process the model undergoes before it is used.

Transformer: because it is based on the Transformer neural net architecture, introduced in the now famous attention is all you need paper.

Once you have such a model, building a chat bot like Gemini or ChatGPT requires a few more important steps, but GPT is the foundational part that enables it and our primary focus in this series.

Starting from the end

In some tense TV shows or movies, it sometimes starts by showing the final scene, without context. It looks great, but we have no clue about what is going on. And then the show starts all from the beginning, walking us slowly but surely through that final scene again, and that time, we understand it all.

This is what we’ll do in that series. We’ll just look at the end result, the heart of the beautiful and minimalist implementation of GPT by Karpathy. The core components basically looks like this (don’t freak out just yet, it is expected if you don’t have clue yet of what this code is doing):

The init and forward pass of the model. The key magical component in that code is the “block” variable, which includes the implementation of the transformer (and self attention).

A self-attention head (which is the core component of a Transformer that we’ll discuss later) :

And then, adding some key ingredient of the Transformer architecture, forming a Block (that is initialized in the first snippet).

From there, to generate text, you just do:

That’s it.

Not even 100 lines of code.

Knowing that this is what enabled the generative AI revolution, you’re probably staring at those like:

But no worries, after this series of posts, this code will hopefully look much clearer and intuitive to you.

The journey to understand it all

To follow along, you’ll just need a basic understanding of python, also some basics of tensors and PyTorch neural networks, and from time to time, it will be required to review one of the posts from my Deep Learning Gymnastic Series (i’ll point them out each time it is relevant).

Here is the plan of the posts we’ll study together in that series:

Part 1: Intro (this post)
Part 2: The Training set
Part 3: The Bigram model
Part 4: The Mathematical Trick behind Self Attention
Part 5: Positional Encodings
Part 6: Coding Self-Attention
Part 7: Building a GPT

From here, let’s continue to the next part, the training set.

Decoding Transformers: The Neural Nets Behind LLMs and More

adjiman — Thu, 28 Nov 2024 12:14:21 +0000

When Karpathy was asked by Lex Fridman “What is the most beautiful or surprising idea in deep learning or AI”, the answer was quite obvious: The Transformer Architecture.

The T in GPT.

When Google researchers introduced the idea back in 2017 with the famous “Attention is all you need” paper, it was looking like just “yet another cool idea for machine translation tasks”. In hindsight, it turned out to be the backbone of the current LLMs/AI revolution.

What you’ll find in this post:

A bit of history of transformers and the larger neural net archiceture they belong to: Encoder-Decoder
An explanation of why transformers were a game changer
A deep (yet intuitive) dive into the core component on which transformers are based on: self-attention
The important neural nets who emerged from Transformers
What are arguably the 10 most important lines of code behind the LLM revolution

Let’s get started.

Encoder-Decoder

A challenge of Deep Neural Networks back in 2010 was to handle “sequence to sequence” problems, where both the input and output are of unknown length. Best known example is machine translation: if you need to translate from e.g. French to English, both the input sentence and its (output) translation are of unknown length.

This is in that context that the Encoder-Decoder neural network architecture emerged, pioneered by the paper Sequence to Sequence Learning with Neural Networks (the first author is Ilya Sutskever, ex Open AI chief scientist) which proposed a general (domain-independent) method to tackle sequence to sequence problems.

The core idea was to take a sequence as input, encode it (via an “encoder”) into a fixed size vector representation and then decode it (via a “decoder”) into another sequence (of possibly a different size).

The main evolution of the encoder-decoder neural network lies in what was used inside the encoders and decoders. It started with RNN, and then was revolutionized by Transformers.

RNN based vs. Transformer based Encoder-Decoder

RNN based encoders/decoders: In the initial paper mentioned above, the encoder and decoder components were handled by Recurrent Neural Networks (RNNs, more particularly LSTMs). The RNN encoder processes each word (or token) at a time (sequentially), encodes its state in a vector, and passes it on, until the whole sentence is encoded. The RNN decoder does the opposite: takes the encoded vector representation of the sentence, and decodes it one token at a time, using the current state and what has been decoded so far.

Transformer based encoders/decoders: Then came up the famous Attention is all you need paper from Google, which suggested replacing the RNNs in the encoder-decoder by a new architecture called Transformer, which relies on what is called “attention mechanism” (see section below).

Why were Transformers a game changer?

One could wonder what was so game changer about the transformer architecture (compared to RNNs) to bring encoder-decoder from a very nice state-of-the-art method in NLP to what enabled the LLMs revolution. Here are a few of the central reasons:

Parallelization: First, the self attention mechanism allows to process all tokens in parallel (unlike RNNs which are sequential and recurring at their core), which significantly speeded up training and inference.

Self-attention: The longer the input sequence (or prompt) is, the more RNNs struggle to capture what is essential in it due to issues like vanishing/exploding gradients. The self attention mechanism overcomes this issue by being able to focus on the important part of the sequence given the context, regardless of its length (see next section for a deep dive into self attention).

A very effective general purpose computer: As Karpathy explains here (highly recommended short watch), Transformers are remarkably general purpose compared to all previous neural net architectures.You can feed it videos, images or speech or text and it just gobbles it up. Also, it is not only about Self Attention, as every piece and detail of the architecture (the residual connection, the layer normalization and more) creates not only a very powerful and expressive machine, but most importantly an optimizable one with our very basic but scalable methods like back-propagation/gradient descent. In Karpathy’s own words:

Deep Diving into the intuition behind self-attention

As we just said, transformers are not only about self attention. Yet, this new paradigm played a big part in Transformers’ success, and marked the beginning of the revolution in NLP tasks first, and in LLMs next.

We’ll use this great video to understand the core intuitions and components behind self attention.

First, let’s remind that in neural networks, words are represented by vectors of numbers, usually called embeddings, see the relevant section in my post here. Two words are similar if their embeddings point into the same direction (which corresponds to having the dot product of their vectors being high). But the same word can have very different meanings depending on the context of a sentence.

The main purpose of the self attention mechanism is to adapt the vectors/embeddings of the words based on the context of the sentence/prompt.

If we take an example (from the video above), of a sentence “I swam across the river to get to the other bank” and draw the matrix of the dot product of their embedding (before applying self attention), we would get e.g. something like that:

All words in the diagonale have the highest score obviously since they represent the similarity of a word with itself. But the word “bank” (traditionally related to the institution) might have nothing to do with the word “river” and get a low score. But after applying the self attention mechanism, one would expect that “bank” and “river” have a strong correlation and thus a high dot product.

So how to weigh word vectors based on their context? The way the self attention mechanism proposed to do that is rather simple and can be summarized in the diagram below (also from the video) that we’ll explain step by step.

As mentioned above, the purpose of self attention is to transform each word vector of the sentence/prompt into a version that is weighted by the other word vectors in the sentence/prompt (a.k.a the context). The diagram bellow (also from the video) illustrates exactly that for the word vector v₂ , and shows how to transform it into a vector y₂ that is weighted based on the whole sentence/prompt (which contains only 3 words in that example: v₁, v₂ and v₃).

Check the diagram and the step by step explanations below.

From the top left of the diagram, this is what happens:

First we take the vector v₂(which is a word embedding of dimension (1×50) in that example ) and multiply it respectively with every other vector of the prompt. It gives 3 numbers (scalars): s₂₁ (which is v₂ . v₁) , s₂₂ (which is v₂ . v₂) and s₂₃ (which is v₂ . v₃)
Those numbers (or scores) represent the respective affinity of v₂ with each of the other words of the context. But since those numbers are not scaled, we just normalize them using softmax , thus giving 3 new numbers: the weights w₂₁, w₂₂, w₂₃.
And now, to get our y₂ which is the “weighted version of v₂ based on the context”, we just do . Et voilà. You get y₂ , the weighted version of the initial word vector v₂.
Pay attention to the dimensions: we started with a word embedding of dimension (1,50), and we properly end up with a contextualized version of it (y₂) with the same dimension (1,50).

Doing this for each word vector of the sentence is essentially what the self-attention mechanism is all about. Note that those operation can be made massively parallel using matrix multiplication.

The result is that now each embedding captures the relation with any other word in the sentence/prompt, regardless of the length/distance between two words, and it does it in a massively parallel and effective way.

Now, if you’re into ML, you probably wonder: where are the learnable weights?? Indeed, you don’t need to train any model to apply the above mechanism, so how do we learn an optimal way to contextualize each word vector?

This is where the magic happens. In the steps we described above, you can think of v₂ as a query looking for similar words that could be matched as keys . So by simply introducing matrices of the right dimensions (here, (50×50)), you’re essentially creating learnable weights (optimizable through backpropagation at training time), that have the following meaning:

M_Q represents what v₂ is “looking for”
M_K represents for each word vector (v₁, v₂ or v₃) what does it “contain”, or represents, or has to offer
M_V , or the values, are used essentially as a way to communicate the result of the matching between queries and keys .

Notice the dimensions, the M_Q, M_K and M_V are matrices that are just injected in between the simple weighting scheme that we described in the first diagram, and do not affect the input and output dimensions (1×50 vector dot product by 50×50 matrix still gives a 1×50 vector).

The main difference is that the M_Q, M_K and M_V matrices are now powerful learnable, optimizable weights of the model.

Scaling embeddings

The original paper described something called scaled dot-product attention. We’ll just give an intuition (again from the great video) of what that scaling term is.

Suppose you have an embedding vector being just (2,2,2) . The magnitude of the vector is

If you divide this by the square root of the dimension of the vector (which is 3), i.e. you multiply by 1/√3, then you just get 2, which is the average.

Why is this important? Because the embeddings will usually be of high dimension (e.g. 300) and thus the dot products (going out of the GetScores component in the diagram above) can end up huge, which would pretty much annihilate the gradient when going through the softmax function.

That’s pretty much it, the scaling term is just a cleaver trick to keep the softmax weights in a reasonable range and not create issues at training time.

The formula that captures it all

The whole (scaled dot) attention mechanism can be summarized by a simple formula:

It simply represents the matrix multiplications we described above, between queries (Q), keys (K) and values (V), going through the softmax function, after being scaled with the scaling term explained above.

This formula is capturing the essence of self attention, which in itself is at the heart of transformers who sparked the LLMs revolution.

So yes, this formula is really at the heart of the LLM revolution and beyond.

BERT vs. BART vs. GPT: All flavors of Transformers

While GPT is the most famous usage of transformers which powered the revolution around LLMs and chat bots, some other famous models were also groundbreaking additions to the NLP world: BART and BERT.

Can you find what is common between the three? Yes, it is the T , which stands for Transformer.

Below is a comparative table between the three models.

Now a question to you: can you guess which of the three model generated that table?

You probably guessed it: it is GPT. And i promess: it was the only generated part of that blog post

The most important 10 lines of code of the LLMs revolution?

My favorite series of learning videos in the past few years is by far Karpathy’s series on Neural Networks (my blog post series Deep Learning Gymnastics is directly inspired from it). In one of the videos, Andrej is building GPT from scratch.

In a future series of post, i’ll deep dive into the core components of it, but just as a teaser, look at Karpathy’s concise and beautiful implementation of the self attention mechanism we described above.

The forward layer is just 10 lines of code, and implements exactly the attention formula that we described above:

Of course, those 10 lines of code in a silo, without transformers, back-propagation, gradient descent, and tons of GPUs cannot do much.

But since self-attention can be considered as one of the most important core element of the transformers breaktrhough, if we were to decide which are the most important (or influential) 10 lines of code behind the LLMs revolution, those would probably be among the best candidates.

Hope you enjoyed this post and see you soon for more.

Deep Learning Gymnastics #4: Master Your (LLM) Cross Entropy

adjiman — Sat, 09 Mar 2024 18:12:05 +0000

Welcome to the 4th episode of our Deep Learning Gymnastics series.

Today, we’ll use all the skills learned in our previous lessons: tensor broadcasting, indexing and reshaping, to revisit one of the most famous and important loss functions of supervised machine learning (and deep learning): cross entropy.

LLMs? Yes, they are also based on it. We’ll actually get inspired (again) by Andrej Karpathy’s videos around building an LLM from scratch to illustrate how to manipulate the cross entropy function.

A short refresher on Cross Entropy

Entropy in general and Cross-entropy in particular are fascinating concepts that lie at the foundation of information theory. If you want to dive a bit into it and understand the links between the logistic regression cost function, Log Loss, Cross Entropy and Negative Log Likelihood and are not afraid of some maths formulas, you can read one of my old posts here.

But for today we’ll focus on the essence. Cross-entropy in ML is most often used as a cost function that measures the difference between a probability vector (one probability per predicted class) and a one-hot encoded label. Typically:

Here, O is the raw output of the neural network, often called logits. Then, before we apply the cross entropy formula, we typically pass those logits through the softmax function so it becomes a probability vector P, where each probability is the prediction of each of your multiple classes. And L is the one hot encoded vector representing the label.

So in our example, we can see that the cross-entropy is simply – log(0.6) i.e ~0.22 . As you note, the higher the probability for the correct class, the closer to 0 it will be (when probability is 1 for the correct class, then the cost will be -log(1) , which is 0). The lower the probability for the correct class, the bigger the cost (tending to infinity when the probability is 0). Note the figure above is inspired from this short great video.

Cross Entropy in LLMs

Large Langage Models (LLMs) core capability is to try predicting the next word (or more generally token) given a list of previous words/tokens. In a future blog post, we’ll describe precisely how the training set is built, but for the sake of this post, let’s illustrate a batch of the training set of an LLM on a picture and explain it:

In the episode #2 of our series, we explained what a batch is, and that those numbers represents the index of a token in the vocabulary. Assume our LLM is predicting the next token (out of 27 possible) given a context of max 3 tokens, this is how to read the figure above:

The batch on the left represents 8 lines of three tokens each.
Each token of the batch points to a tensor of size (27,1) representing the prediction of what the next token should be (one logit for each of the 27 possible tokens). So the batch tensor shape is (8,3,27).
For instance, the (27,1) tensor in the figure represents the prediction for each of the 27 tokens, given the sequence of the three tokens 7,16,18.
In that example, what is e.g. the logit prediction for the next token to be token 1? just look at index 1 of that vector. Here you go: ~0.55 (which seems rather high compared to others)
The tensor on the right are the labels (the actual next token from the training set). It thus has the same shape as the batch, except that it does not contains prediction logits tensors, so just (8,3)

How to calculate the Cross Entropy on that single prediction logits (in the figure) against the actual label?

Simple, we just follow the diagram we gave above: we pass that vector through the softmax function, which will give us the (27,1) tensor P representing probabilities. Then we have L = (0,1,0,0,0,0,…,0) , and we just apply the cross entropy formula.

The Gymnastic Exercise

In the previous section, we explained how to compute the Cross Entropy for one single entry of the (8,3) batch of our example. But how to compute it for the whole batch? To do so, we need to calculate the exact same thing, but for the 8*3 = 24 possible cases.

Did you recognize the vector we had in the previous section’s figure? Yes, that’s the 7th one from the bottom.

So the gymnastic exercise is to take the initial batch with prediction tensor of shape (8,3,27) , stretch it out to the 8*3 = 24 prediction logits (which is a (24,27) tensor as in pic above), do the same for the label tensor, and from there, compute in parallel the cross entropy of the 24 couples of logits/label, and returns the mean of them as the result.

Solving it in PyTorch

First we need to generate all the input tensors:

X, the batch with prediction, which is a (8,3,27) tensor
Y, the labels, which is a (8,3) tensor.

The code below will produce the same numbers as the one exposed in the second figure of this post.

import torch
torch.manual_seed(18)

# creates the batch
random_tensor = torch.randint(low=0, high=26, size=(8,3))

# create random logits for each index in the vocabulary
L = torch.randn((27, 27))
#creating the labels
Y = torch.randint(low=0, high=26, size=(8,3))
# creating our batch (8,3,27). C.f https://www.philippeadjiman.com/blog/2023/12/23/deep-learning-gymnastics-tensor-indexing/ 
X = L[random_tensor]

To fully understand this code, please refer to the post #2 of this series about tensor indexing.

Note that in that other post, we created embeddings of size 4 as an illustration, while here, we’re having already the final logits (of size 27, which is the vocabulary size). In a fully implemented LLM, those logits will only come up after many steps (stay tuned for a future blog post about it).

Now, we’d like to use the PyTorch’s cross_entropy function. Reading the doc, we see it expects as input the actual logits to be in the second dimension, which corresponds exactly to what we described in the figure above: stretching out the input batch. And same for the labels. We actually learned how to do that with views in the post #3 of this series around tensor reshaping. So here you go:

#Reshaping before using cross_entropy. C.f https://www.philippeadjiman.com/blog/2024/02/03/deep-learning-gymnastics-tensor-reshaping/
B,T,C = X.shape 
logits = X.view(B*T,C)
labels = Y.view(B*T)

With that, we’ll exactly obtain what we illustrated in our previous figure.

Now that we got our inputs in the proper shape, we can compute our cross entropy with the function:

import torch.nn.functional as F
F.cross_entropy(logits , labels)

Which gives 3.7759 . Yay! we computed the cross entropy of our LLM batch

Calculating Cross Entropy “manually”

Turns out that once we have the logits and labels in the proper shape like we just did with views, then calculating cross entropy without using the PyTorch’s function is actually quiet simple, and is useful to understand what happens behind the scenes.

Here is an compact and elegant way to do it (credit again to the code from Karpathy’s videos ):

counts = logits.exp()
prob = counts / counts.sum(1,keepdims=True)
- prob[torch.arange(24),target].log().mean()

Surely enough, it returns the exact same result (3.7759) as when using the PyTorch function .

So what’s going on in that code?

The first two lines are to transform the logits into probabilities using the softmax function, by simply first applying the exponential function and then dividing all logits by the sum of exponentials. Wonder what that keepdims=True means? Please read the post #1 of this series around tensor broadcasting

Now the last line is interesting.

Remember our initial figure. Let’s look again how cross entropy is calculated:

Given L is a one hot encoded vector, there will be only one 1, and thus the cross entropy is just about plucking out the right index in P and -log it. In the figure, the 1 is at the second place, so in terms of index it is 1 (as index starts at 0), and thus cross entropy is simply -log(P[1]).

Because in our code, the labels are already a number between 0 and 26 (the size of the vocabulary), we can use it as an index, extract the right number in each of the 24 vectors of prob, log them all, and the mean is simply the cross entropy of the whole batch.

So, simply:

- prob[torch.arange(24),target].log().mean()

Magical, no?

If you’re wondering why it is still worth to use the built-in cross entropy function, watch this great explanation by Andrej Karpathy.

What about TensorFlow?

As traditionally done in the posts of that series, let’s also look at the equivalent code in TensorFlow.

As for PyTorch, for all the gymnastic preparation (broadcasting, indexing and reshaping), please refer to the post #1 , #2 and #3 of our Deep Learning Gymnastic series .

Regarding the cross entropy function in TensorFlow, we can use e.g. sparse_softmax_cross_entropy_with_logits . Note how explicit is the name: it tells that you need to pass logits, and then it will apply softmax and cross entropy.

If you’re using Keras, you can also use the SparseCategoricalCrossentropy . Note that to do so, you first need to instantiate the function , explicitly saying we’re using logits, and then apply it to the reshaped logits and labels.

Find the full code below, illustrating both entropy functions.

import tensorflow as tf
tf.random.set_seed(18)

# Create a random batch of shape (8,3) with indexes between 0 and 26
random_tensor = tf.random.uniform(shape=(8,3), minval=0, maxval=26, dtype=tf.int32)

# create random logits for each index in the vocabulary
L = tf.random.uniform((27,27), dtype=tf.float32)

#creating the labels
Y = tf.random.uniform(shape=(8,3), minval=0, maxval=26, dtype=tf.int32)

# creating our batch (8,3,27). C.f https://www.philippeadjiman.com/blog/2023/12/23/deep-learning-gymnastics-tensor-indexing/ 
X = tf.gather(L,random_tensor)

#Reshaping before using cross_entropy. C.f https://www.philippeadjiman.com/blog/2024/02/03/deep-learning-gymnastics-tensor-reshaping/
B,T,C = X.shape
logits = tf.reshape( X , [B*T,C])
labels =  tf.reshape( Y , [B*T,1]) # 24 numbers (each one between 0 and 26)

#Calling cross entropy using sparse_softmax_cross_entropy_with_logits
loss = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=labels[:, 0],logits=logits)
print(tf.reduce_mean(loss))

#Calling cross entropy using Keras' SparseCategoricalCrossentropy
ce = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
print(ce(labels,logits))

That’s it for today.

Hope you’re feeling in better shape with your tensors . Until our next episode.

Like those posts? Feel free to subscribe here to not miss future ones:

Deep Learning Gymnastics #3: Tensor (re)Shaping

adjiman — Sat, 03 Feb 2024 16:47:39 +0000

Welcome to the 3rd episode of the Deep Learning Gymnastics series. By now you should already start to be in shape. That’s good, because today we’ll talk about how to shape (or more precisely reshape) tensors, a basic yet critical operation that is needed in any advanced enough deep learning model implementation.

To best understand this post, it is highly recommended to read the previous gymnastic exercise around tensor indexing as we’ll build on top of it.

MLP Motivating example

To illustrate the power of tensor (re-)shaping, we’ll continue to get inspired from Andrej Karpathy’s makemore series, where he implements from scratch the famous paper “A neural probabilistic language model” . As Andrej says, it is not the first paper who proposed a neural network approach to predict the next token in a sequence, but it is one that is very often cited and is a really nice write-up.

The gymnastic exercise will consist into implementing the bottom part of the figure below, which describes the architecture of the neural network (or Multi Layer Perceptron, MLP for short) defined in the paper. First we’ll explain a bit the diagram so the goal of the exercise will be crystal clear.

Let’s assume that the 3 green dots at the bottom are the last three characters of a word and that we’re trying to predict (or generate) the next character. The first layer (this one: ) is nothing else than the embeddings of each of the three characters. Turns out it is exactly the output of the example we introduced in our previous gymnastic exercise around tensor indexing . We ended up with a tensor of shape (8,3,4) , the one on the right in the figure below. As a reminder, an embedding is simply here a one dimensional tensor (of size 4 in our case).

So in our example, the first layer of the neural net, the , is nothing else than the 3 embeddings of each character, as seen below:

So the first example of the batch is associated with those three embeddings:

Now, in order to pass this to the next layer (this one), we need to concatenate those three embeddings of size 4 each, into a single long one of size 12.

So here is the gymnastic exercise: take our (8,3,4) tensor, and for each of the 8 lines of the batch, transform the 3 embeddings of size 4 into one of size 12 (which is just the concatenation of the 3). We should thus end up with a tensor of shape (8,12).

The basics of PyTorch Views

Let’s introduce the concept that will allow us to solve the gymnastic exercise as a breeze: PyTorch views. The easiest way to understand PyTorch views is through a simple example.

Let’s create a one dimensional tensor of elements from 0 to 17.

The exact same underlying storage can be viewed as (2,9) tensor.

Or a a (9,2) one

Or a (3,2,3) one:

As you understand, as long as the product of the dimensions equals the number of element in the underlying storage (18 in our case), then we can view (or reshape) the tensor.

Beyond being very convenient, the big of advantage of this is that it is blazing fast, because no new tensors are created: the underlying storage stays the same, and only some metadata about the tensor are modified.

Bonus: we can also use -1 to infer the dimension automatically. E.g., if the underlying storage is 18 numbers, then invoking the view function with shape (-1,9), it will deduce that the first dimension has to be 2:

Solving our gymnastic exercise with views

Now that we understand views, let’s get back to our gymnastic exercise: we have a tensor of shape (8,3,4) and we need to transform into a tensor of shape (8,12). First, let’s reproduce the embedded batch of shape (8,3,4) (see our previous gymnastic exercise to understand the code below):

import torch
torch.manual_seed(18)

# Create a random batch of shape (8,3) 
# with indexes between 0 and 26
random_tensor = torch.randint(low=0, high=26, size=(8,3))

# Create a random embedding matrix of shape (27,4): 
# one embedding for each of the 27 indexes elements
embeddings = torch.randn(size=(27, 4))

#Creating the embedded batch
embedded_batch = embeddings[random_tensor]

Get ready, and let’s solve our exercise. As in last post, it will be a short yet sharp (tensor) movement:

input_layer = embedded_batch.view(8,12)

Yes, that’s it, just one line. By doing this, each line of batch of 8 embeddings, will extremely effectively and in parallel take their 3 associated embedding of size 4 each, concatenate them together, to thus end up with a tensor of size (8,12).

Let’s actually validate it on the first example of the batch:

We obtain an embedding of size 12 as expected, which is nothing else than the concatenation of the 3 embeddings of size 4 that we showed at the end of our motivating example above. Baam.

Oh, let’s not forget that we created this to pass it as input to a layer of a neural net. So let’s do it: we create the initial random weight and biaises of the layer, pass into it our (reshaped) batch and apply tanh on top of it, in other words:

W1 = torch.randn((12, 100)) # weights
b1 = torch.randn(100) # biases
h = torch.tanh(emb.view(-1, 12) @ W1 + b1) # (8,12) @ (12,100) => (8,100)

PyTorch view vs. reshape ?

There is another function in PyTorch called reshape that seems to achieve the exact same goal as view. So what’s the difference?

Typically, view is extremely efficient as it won’t move any underlying data and just modify the shape of the tensor. But it comes with a constraint: the underlying data has to be contiguous, otherwise calling view will return an error (see example below).

If you’re not sure if your tensor is contiguous, you can either use the contiguous function before calling view (it will make the tensor contiguous), or simply use reshape which returns a view if the shapes are compatible, and copies otherwise.

You might ask why anyone would use view over reshape? I asked myself the same question, and I assume that given that using view is guaranteed to be efficient, seeing it in the code gives any reader the guarantee that there is nothing to optimize there. As for the one writing the code, if there are some cases where there would be an inefficient copy, then at least when using view it will fail explicitly and make you aware of the potentially efficiency bottleneck.

Below an example of code illustrating where view wouldn’t work:

import torch

# Create a non-contiguous tensor
tensor = torch.tensor([[1, 2, 3], [4, 5, 6]]).t()  # Transpose to make it non-contiguous

# Reshape works successfully
reshaped_tensor = tensor.reshape(6)
print(reshaped_tensor)  # Output: tensor([1, 4, 2, 5, 3, 6])

# View fails with an error
try:
    viewed_tensor = tensor.view(6)
except RuntimeError as e:
    print(e)  # Output: RuntimeError: view size is not compatible with input tensor's size and stride

TensorFlow reshape

Obviously, TensorFlow also supports the same powerful reshape operation. In TensorFlow, you don’t have the explicit view function, but reshape handles non-contiguous tensors gracefully, similar to PyTorch’s reshape.

Below the full TensorFlow code equivalent to what we illustrated above in PyTorch.

import tensorflow as tf
tf.random.set_seed(18)

# Create a random batch of shape (8,3) with indexes between 0 and 26
random_tensor = tf.random.uniform(shape=(8,3), minval=0, maxval=26, dtype=tf.int32)

# Create a random embedding matrix of shape (27,4): one embedding for each of the 27 indexes elements
embeddings = tf.random.uniform((27,4), dtype=tf.float32)

# Solving the gymnastic exercise: creating an embedded batch with the tf.gather function
embedded_batch = tf.gather(embeddings,random_tensor)

# Validating the results
print(random_tensor)
print(embeddings)
print(embedded_batch.shape) # (8,3,4) which is the expected dimension
print(embedded_batch[0,0])

W1 =  tf.random.normal([12, 100])
b1 =  tf.random.normal([100])
h = tf.math.tanh(tf.linalg.matmul(tf.reshape(embedded_batch, [8, 12]) , W1) + b1)

Another example of usage: CNNs

Reshaping is a very useful operation in various cases in Deep Learning. Another frequent usage/example is in the context of image manipulation in convolutional neural networks (CNN), where you need for instance to connect the output of a convolutional layer to a fully connected layer:

import torch

# An output from a convolutional layer
conv_output = torch.randn(10, 8, 5, 5)  # (batch size, channels, height, width)

# Flatten for a fully connected layer
flattened = conv_output.view(-1, 8 * 5 * 5)  # (batch size, flattened features)

print(flattened.shape)  # Output: torch.Size([10, 200])

Alright, that’s if for today. Hope you’re now in a better shape, and see you next time for other gymnastic exercises .

References

Part 2 of the amazing makemore series by Andrej Karpathy (which inspired this post).
Great blog post on the internal representation of tensors, and his very cool stride visualizer (it is from a PyTorch research engineer, so it is about PyTorch but still useful general concepts )