Welcome to Part 5 of our GPT From Scratch series, inspired by Karpathy’s Let’s build GPT: from scratch, in code, spelled out.
Links to previous and upcoming posts of the series:
- Part 1: Intro
- Part 2: The Training set
- Part 3: The Bigram model
- Part 4: The Mathematical Trick behind Self Attention
- Part 5: Positional Encodings (this post)
- Part 6: Coding Self-Attention
- Part 7: Building a GPT
In Part 2 we explained how to create a training set from Shakespeare’s works. Part 2 introduced a basic bigram model, predicting the next character based solely on its predecessor. Part 3 explained how a very clever matrix multiplication enables doing some operations (like average) on the previous character, in a very efficient way.
Before we jump to the actual implementation of a transformer (the heart of GPT), we’ll add one more trick to the arsenal: positional encoding.
Why token’s position matters
For example, the sentences:
- “Alice gave the book to Bob.”
- “Bob gave the book to Alice.”
contain the exact same words but mean different things — only the order changes. A large language model without some notion of order wouldn’t be able to tell the difference.
The Solution: Positional Encodings
We already have embeddings for each character (see Part 3, “The token embedding table”). In order to give the model a way to learn the notion of position, we’ll simply add another set of embeddings, one per position. We’re working with characters, but the same applies for words or more generally tokens.
Think of it like this:
Token Embedding = What the word is (or its identity)
Positional Encoding = Where the word is
How to add it to our model
We’re still focused on our good old batch of size BxTxC (as explained in Part 3, section “the logits”).

In our toy example, the batch contains B=8 examples, each example has T = 3 characters (let’s call it block_size), and each character embedding is of size C = 2 (let’s call it n_embd) . So, for the position, we need block_size embeddings (each one to encode the position of the character in the example). As for the size of the embedding, we could take anything, but for the sake of simplicity, we’ll take the same size of the character embedding, which is n_embd.
Let’s see what it looks like in code. We first need to declare our new embeddings table.

As a reminder, our token (or character in our case) embedding table was keeping track of the embedding of each of our 65 (vocab_size) characters. The embedding of each character is of size n_embd. This token_embedding_table was initialized like this:

Now, in the forward pass, how do you combine the embeddings of both the identity of the character (token_embedding_table) and the embeddings of the position (position_embedding_table)?
Answer: you actually simply just sum them up!
Before doing so, you just need first to fetch the logits (as we explained in Part 3, section “the logits”).
As for the position embeddings, you just need to arrange them to make sure they align nicely as a T x C (i.e. context_size x n_emb). All in all, this is how to include positional embeddings into the logits:

That’s it.
From there, the logits are enriched with a “positional” component in the forward pass, and thus, backpropagation will now optimize the weights of the neural net with that new positional information, constraining the optimization in a way that will take it into account.
Now, the next step is to enrich further the logits with another piece of information: the famous attention layer, that is the heart of the transformer architecture and that made GPT what it is today. That’s what we’ll explore in our next post: Coding Self Attention.
