Since the (generative) AI revolution started, it seems like we’re observing one breakthrough every 2 weeks on average, and sometimes it can feel overwhelming. Rather than shallowly chasing every breakthrough, I believe it is critical to first start by getting a deep understanding of what started it all: GPT.
Welcome to a new Series of 7 posts, where we’re going to deep dive into one of the most exciting videos from Andrej Karpathy on the topic: Let’s build GPT: from scratch, in code, spelled out.
A Note on Motivation and Inspiration
Before we begin this 7-part journey, I want to set the stage. This entire series is a deep dive directly inspired by and based on Andrej Karpathy’s phenomenal video mentioned above, which in my opinion is by far the best resource on the internet to understand GPT.
As a fellow PhD in AI, my own learning process has always been to ‘teach what I learn.’ I created this series as a way to meticulously deconstruct and document every step from that video, solidifying my own understanding in the process.
While heavily guided by the video, I’ve invested significant time in structuring the content into clear topics, creating custom illustrations, and adding in-depth explanations to unpack the ‘why’ behind the ‘how.’ My hope is that this series serves as a valuable resource for those who learn best by reading, or who need a quick, searchable reference to complement the video format.
A final note: true understanding comes from doing. So I encourage you to use this, but then challenge yourself to e.g. reproduce Karpathy’s code entirely on your own.
What is GPT?
GPT stands for Generative Pretrained Transformer.
Generative: because it is a model that generates new content, most commonly words or more generally tokens. You give it some initial input (a prompt) and it generates what is most likely to follow from it.
Pretrained: because of the foundational training process the model undergoes before it is used.
Transformer: because it is based on the Transformer neural net architecture, introduced in the now famous attention is all you need paper.
Once you have such a model, building a chat bot like Gemini or ChatGPT requires a few more important steps, but GPT is the foundational part that enables it and our primary focus in this series.
Starting from the end
In some tense TV shows or movies, it sometimes starts by showing the final scene, without context. It looks great, but we have no clue about what is going on. And then the show starts all from the beginning, walking us slowly but surely through that final scene again, and that time, we understand it all.
This is what we’ll do in that series. We’ll just look at the end result, the heart of the beautiful and minimalist implementation of GPT by Karpathy. The core components basically looks like this (don’t freak out just yet, it is expected if you don’t have clue yet of what this code is doing):
The init and forward pass of the model. The key magical component in that code is the “block” variable, which includes the implementation of the transformer (and self attention).

A self-attention head (which is the core component of a Transformer that we’ll discuss later) :

And then, adding some key ingredient of the Transformer architecture, forming a Block (that is initialized in the first snippet).

From there, to generate text, you just do:

That’s it.
Not even 100 lines of code.
Knowing that this is what enabled the generative AI revolution, you’re probably staring at those like:

But no worries, after this series of posts, this code will hopefully look much clearer and intuitive to you.
The journey to understand it all
To follow along, you’ll just need a basic understanding of python, also some basics of tensors and PyTorch neural networks, and from time to time, it will be required to review one of the posts from my Deep Learning Gymnastic Series (i’ll point them out each time it is relevant).
Here is the plan of the posts we’ll study together in that series:
- Part 1: Intro (this post)
- Part 2: The Training set
- Part 3: The Bigram model
- Part 4: The Mathematical Trick behind Self Attention
- Part 5: Positional Encodings
- Part 6: Coding Self-Attention
- Part 7: Building a GPT
From here, let’s continue to the next part, the training set.
