{"id":2212,"date":"2025-11-22T16:27:57","date_gmt":"2025-11-22T16:27:57","guid":{"rendered":"https:\/\/philippeadjiman.com\/blog\/?p=2212"},"modified":"2025-11-22T16:29:28","modified_gmt":"2025-11-22T16:29:28","slug":"gpt-from-scratch-7-building-a-gpt","status":"publish","type":"post","link":"https:\/\/philippeadjiman.com\/blog\/2025\/11\/22\/gpt-from-scratch-7-building-a-gpt\/","title":{"rendered":"GPT From Scratch #7: Building a GPT"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\">Welcome to the final part of our GPT From Scratch series, inspired by Karpathy\u2019s&nbsp; <a href=\"https:\/\/www.youtube.com\/watch?v=kCc8FmEb1nY\">Let&#8217;s build GPT: from scratch, in code, spelled out.<\/a><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Links to previous and upcoming posts of the series:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"\"><a href=\"https:\/\/philippeadjiman.com\/blog\/2025\/10\/24\/gpt-from-scratch-1-intro\/\">Part 1: Intro<\/a><\/li>\n\n\n\n<li class=\"\"><a href=\"https:\/\/philippeadjiman.com\/blog\/2025\/11\/06\/gpt-from-scratch-2-the-training-set\/\">Part 2: The Training set<\/a>\u00a0<\/li>\n\n\n\n<li class=\"\"><a href=\"https:\/\/philippeadjiman.com\/blog\/2025\/11\/10\/gpt-from-scratch-3-the-bigram-model\/\">Part 3: The Bigram model<\/a>\u00a0<\/li>\n\n\n\n<li class=\"\"><a href=\"https:\/\/philippeadjiman.com\/blog\/2025\/11\/14\/gpt-from-scratch-4-the-mathematical-trick-behind-self-attention\/\">Part 4: The Mathematical Trick behind Self Attention<\/a><\/li>\n\n\n\n<li class=\"\"><a href=\"https:\/\/philippeadjiman.com\/blog\/2025\/11\/15\/gpt-from-scratch-5-positional-encodings\/\">Part 5: Positional Encodings<\/a><\/li>\n\n\n\n<li class=\"\"><a href=\"https:\/\/philippeadjiman.com\/blog\/2025\/11\/19\/gpt-from-scratch-6-coding-self-attention\/\">Part 6: Coding Self-Attention<\/a>\u00a0<\/li>\n\n\n\n<li class=\"\">Part 7: Building a GPT (this post)<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">The T in GPT stands for Transformer, the key neural net architecture that powers LLMs and more.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In the last post we deep dived into self attention\u2019s code and intuition. Although self attention is the heart of transformers, there are few additional critical parts to the transformer architecture that actually made it shine. Almost every single post, deck or tutorial around transformers shows this diagram from Google\u2019s original \u201c<a href=\"https:\/\/arxiv.org\/abs\/1706.03762\">attention is all you need<\/a>\u201d paper.&nbsp;<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-large is-resized\"><img decoding=\"async\" width=\"1237\" height=\"1600\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-90.png?fit=792%2C1024&amp;ssl=1\" alt=\"\" class=\"wp-image-2218\" style=\"width:548px;height:auto\" srcset=\"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-90.png?w=1237&amp;ssl=1 1237w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-90.png?resize=232%2C300&amp;ssl=1 232w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-90.png?resize=792%2C1024&amp;ssl=1 792w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-90.png?resize=768%2C993&amp;ssl=1 768w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-90.png?resize=1188%2C1536&amp;ssl=1 1188w\" sizes=\"auto, (max-width: 1000px) 100vw, 1000px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">In this post, we\u2019ll get to implement and understand all the parts of that diagram that are relevant to build a GPT.&nbsp;<\/p>\n\n\n\n<h1 class=\"wp-block-heading\">(Masked) Multi-Head Attention<\/h1>\n\n\n\n<figure class=\"wp-block-image size-large is-resized\"><img decoding=\"async\" width=\"870\" height=\"1054\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-91.png?fit=845%2C1024&amp;ssl=1\" alt=\"\" class=\"wp-image-2220\" style=\"width:175px;height:auto\" srcset=\"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-91.png?w=870&amp;ssl=1 870w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-91.png?resize=248%2C300&amp;ssl=1 248w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-91.png?resize=845%2C1024&amp;ssl=1 845w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-91.png?resize=768%2C930&amp;ssl=1 768w\" sizes=\"auto, (max-width: 870px) 100vw, 870px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">This part is really about having multiple attention heads (as the one in our <a href=\"https:\/\/philippeadjiman.com\/blog\/2025\/11\/19\/gpt-from-scratch-6-coding-self-attention\/\">last post<\/a>, see the Head module) in parallel.&nbsp;<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Karpathy\u2019s implementation is as elegant and easy as that:<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" width=\"1600\" height=\"391\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-94.png?fit=1024%2C250&amp;ssl=1\" alt=\"\" class=\"wp-image-2222\" srcset=\"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-94.png?w=1600&amp;ssl=1 1600w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-94.png?resize=300%2C73&amp;ssl=1 300w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-94.png?resize=1024%2C250&amp;ssl=1 1024w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-94.png?resize=768%2C188&amp;ssl=1 768w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-94.png?resize=1536%2C375&amp;ssl=1 1536w\" sizes=\"auto, (max-width: 1000px) 100vw, 1000px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Few notes:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"\">In the forward pass, dim = -1 means you concatenate on the last dimension of each head, which is the C in B,T,C and corresponds to the size of the embeddings.<\/li>\n\n\n\n<li class=\"\">So basically, each self-attention head is producing a vector (the embedding) and here you simply concatenate it.<\/li>\n\n\n\n<li class=\"\">To not break dimensions, if before we had one attention each of size e.g. 32 , if we want to produce 4 attention heads, we just divide by 4 the size of each head. So when initializing , it simply look like that:<\/li>\n<\/ul>\n\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" width=\"1600\" height=\"140\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-85.png?fit=1024%2C90&amp;ssl=1\" alt=\"\" class=\"wp-image-2214\" srcset=\"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-85.png?w=1600&amp;ssl=1 1600w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-85.png?resize=300%2C26&amp;ssl=1 300w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-85.png?resize=1024%2C90&amp;ssl=1 1024w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-85.png?resize=768%2C67&amp;ssl=1 768w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-85.png?resize=1536%2C134&amp;ssl=1 1536w\" sizes=\"auto, (max-width: 1000px) 100vw, 1000px\" \/><\/figure>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"\">It means dimensions are preserved, but we just get to build and learn different communication channels in parallel\u00a0<\/li>\n\n\n\n<li class=\"\">This concept is similar to what is done for convolution when you get to learn \u201cconvolution groups\u201d<\/li>\n\n\n\n<li class=\"\">Re-running with this gives yet another non negligible improvement, from 2.4, to 2.28<\/li>\n\n\n\n<li class=\"\">This means that the characters have a lot to talk\/communicate about and having multiple smaller communication channels helps more than having only one longer.\u00a0<\/li>\n<\/ul>\n\n\n\n<h1 class=\"wp-block-heading\">Computation on top of Communication&nbsp;<\/h1>\n\n\n\n<figure class=\"wp-block-image size-large is-resized\"><img decoding=\"async\" width=\"892\" height=\"1092\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-89.png?fit=836%2C1024&amp;ssl=1\" alt=\"\" class=\"wp-image-2219\" style=\"width:251px;height:auto\" srcset=\"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-89.png?w=892&amp;ssl=1 892w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-89.png?resize=245%2C300&amp;ssl=1 245w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-89.png?resize=836%2C1024&amp;ssl=1 836w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-89.png?resize=768%2C940&amp;ssl=1 768w\" sizes=\"auto, (max-width: 892px) 100vw, 892px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">In our <a href=\"https:\/\/philippeadjiman.com\/blog\/2025\/11\/19\/gpt-from-scratch-6-coding-self-attention\/\">previous post<\/a> , we explained Karpathy\u2019s interpretation of self attention as a communication mechanism.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Communication is great, but it is not enough.&nbsp;<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Once communication has happened, you need to have a computation layer allowing each node to \u201cprocess\u201d what they\u2019ve learned in the communication.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Indeed, up until now, we went way too fast to extract the logits, immediately after the self-attention (communication) layer:<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" width=\"1600\" height=\"103\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-86.png?fit=1024%2C66&amp;ssl=1\" alt=\"\" class=\"wp-image-2215\" srcset=\"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-86.png?w=1600&amp;ssl=1 1600w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-86.png?resize=300%2C19&amp;ssl=1 300w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-86.png?resize=1024%2C66&amp;ssl=1 1024w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-86.png?resize=768%2C49&amp;ssl=1 768w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-86.png?resize=1536%2C99&amp;ssl=1 1536w\" sizes=\"auto, (max-width: 1000px) 100vw, 1000px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Instead, and as described in the paper \u201cattention is all you need \u201d, it is important to add even a simple linear layer just before extracting the logits. As simple as that:<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" width=\"1600\" height=\"651\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-92.png?fit=1024%2C417&amp;ssl=1\" alt=\"\" class=\"wp-image-2221\" srcset=\"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-92.png?w=1600&amp;ssl=1 1600w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-92.png?resize=300%2C122&amp;ssl=1 300w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-92.png?resize=1024%2C417&amp;ssl=1 1024w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-92.png?resize=768%2C312&amp;ssl=1 768w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-92.png?resize=1536%2C625&amp;ssl=1 1536w\" sizes=\"auto, (max-width: 1000px) 100vw, 1000px\" \/><\/figure>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"\">Note that the dimension we give is n_embd, so when we feed forward it, it will keep the same B,T,C dimension.<\/li>\n\n\n\n<li class=\"\">It also means that this feed forward is a the token level (you apply it to each of the embeddings of the BxT tokens)<\/li>\n\n\n\n<li class=\"\">In our case, it means that you give each character some time to \u201cthink\u201d (or compute) after all the communication it has learned just before in the self attention layer.<\/li>\n\n\n\n<li class=\"\">Retraining with this new trick brings the loss from, from 2.28 to 2.24. Not huge, but it still improved the situation.<\/li>\n<\/ul>\n\n\n\n<h1 class=\"wp-block-heading\">Interspersing Communication and Computation&nbsp;<\/h1>\n\n\n\n<figure class=\"wp-block-image size-full is-resized\"><img decoding=\"async\" width=\"624\" height=\"982\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-87.png?fit=624%2C982&amp;ssl=1\" alt=\"\" class=\"wp-image-2216\" style=\"width:203px;height:auto\" srcset=\"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-87.png?w=624&amp;ssl=1 624w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-87.png?resize=191%2C300&amp;ssl=1 191w\" sizes=\"auto, (max-width: 624px) 100vw, 624px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">As we just showed in the previous paragraph, it is important to add a \u201ccomputation\u201d layer on top of the communication one.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">One of the important aspects of the transformer architecture is to do it multiple times.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">This can be done by creating a \u201cblock\u201d of multi headed self attention, which then does computation. Nothing too complex, mainly packaging things together, to get the box of the diagram above (except for the middle part, and also for the \u201cadd &amp; norm\u201d part and some of the arrows there as we\u2019ll explain in next sections):<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" width=\"1600\" height=\"591\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-93.png?fit=1024%2C378&amp;ssl=1\" alt=\"\" class=\"wp-image-2223\" srcset=\"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-93.png?w=1600&amp;ssl=1 1600w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-93.png?resize=300%2C111&amp;ssl=1 300w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-93.png?resize=1024%2C378&amp;ssl=1 1024w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-93.png?resize=768%2C284&amp;ssl=1 768w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-93.png?resize=1536%2C567&amp;ssl=1 1536w\" sizes=\"auto, (max-width: 1000px) 100vw, 1000px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Then, in the overall model, you can initialize a bunch of those blocks. The number can be a parameter n_layer , and then you simply apply it just before extracting the logits.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" width=\"1600\" height=\"662\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-96.png?fit=1024%2C424&amp;ssl=1\" alt=\"\" class=\"wp-image-2225\" srcset=\"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-96.png?w=1600&amp;ssl=1 1600w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-96.png?resize=300%2C124&amp;ssl=1 300w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-96.png?resize=1024%2C424&amp;ssl=1 1024w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-96.png?resize=768%2C318&amp;ssl=1 768w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-96.png?resize=1536%2C636&amp;ssl=1 1536w\" sizes=\"auto, (max-width: 1000px) 100vw, 1000px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">But it turns out though that retraining with this doesn\u2019t give good results.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The reason is that the network is getting quite deep by now and is starting to suffer from optimization issues (while running backpropagation), which is rather typical for deep neural nets.&nbsp;<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">So we need a couple more techniques\/tricks that we can borrow from the paper, to actually solve this optimization issue: residual connections and layer norm.<\/p>\n\n\n\n<h1 class=\"wp-block-heading\">Residual (or skip) connections<\/h1>\n\n\n\n<figure class=\"wp-block-image size-full is-resized\"><img decoding=\"async\" width=\"894\" height=\"788\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-88.png?fit=894%2C788&amp;ssl=1\" alt=\"\" class=\"wp-image-2217\" style=\"width:184px;height:auto\" srcset=\"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-88.png?w=894&amp;ssl=1 894w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-88.png?resize=300%2C264&amp;ssl=1 300w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-88.png?resize=768%2C677&amp;ssl=1 768w\" sizes=\"auto, (max-width: 894px) 100vw, 894px\" \/><\/figure>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"\">Residual connections were initially introduced in <a href=\"https:\/\/arxiv.org\/abs\/1512.03385\">that paper<\/a> and are represented by that arrow in the diagram.<\/li>\n\n\n\n<li class=\"\">The idea is that each time you have a set of complex computations, you also add on the side a direct path that skips all the computations and that you connect at the end with addition (hence the \u201cAdd\u201d part in the diagram):<\/li>\n<\/ul>\n\n\n\n<figure class=\"wp-block-image size-full\"><img decoding=\"async\" width=\"424\" height=\"234\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-84.png?fit=424%2C234&amp;ssl=1\" alt=\"\" class=\"wp-image-2213\" srcset=\"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-84.png?w=424&amp;ssl=1 424w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-84.png?resize=300%2C166&amp;ssl=1 300w\" sizes=\"auto, (max-width: 424px) 100vw, 424px\" \/><\/figure>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"\">The idea being that the gradients from the output and up to the input\/beginning of the network can flow directly (like on a gradient super highway) from output to input.<\/li>\n\n\n\n<li class=\"\">Remember that the gradients are contributing equally through additions during backpropagation.<\/li>\n\n\n\n<li class=\"\">This trick thus avoids gradients being stuck or vanishing in highly complex computations of deep neural nets.\u00a0<\/li>\n\n\n\n<li class=\"\">Instead, they are initialized such that at the beginning they contribute almost nothing, and only the \u201cgradient highway\u201d allows the optimization to get started, and then the gradients start to kick-in also in the complex computations.<\/li>\n\n\n\n<li class=\"\">It turns out that it dramatically helps with the optimization issues we mentioned above.<\/li>\n\n\n\n<li class=\"\">Karptahy\u2019s Implementation:\n<ul class=\"wp-block-list\">\n<li class=\"\">First we do this \u2018+\u2019 operation in the forward pass of our block (the code discussed in previous section) like this:<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n\n\n\n<figure class=\"wp-block-image size-large is-resized\"><img decoding=\"async\" width=\"1134\" height=\"392\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-97.png?fit=1024%2C354&amp;ssl=1\" alt=\"\" class=\"wp-image-2226\" style=\"width:249px;height:auto\" srcset=\"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-97.png?w=1134&amp;ssl=1 1134w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-97.png?resize=300%2C104&amp;ssl=1 300w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-97.png?resize=1024%2C354&amp;ssl=1 1024w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-97.png?resize=768%2C265&amp;ssl=1 768w\" sizes=\"auto, (max-width: 1000px) 100vw, 1000px\" \/><\/figure>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"\">Then, we add a projection layer. To be honest, i\u2019m not sure why it is needed and i\u2019ll need to dive deeper in the paper to understand it, but i\u2019ll show how Karpathy\u2019s adds both to the multi headed self attention, and the feed forward modules.\u00a0<\/li>\n\n\n\n<li class=\"\">For the multi headed self attention layer it just corresponds to add this:<\/li>\n<\/ul>\n\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" width=\"1600\" height=\"496\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-104.png?fit=1024%2C317&amp;ssl=1\" alt=\"\" class=\"wp-image-2233\" srcset=\"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-104.png?w=1600&amp;ssl=1 1600w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-104.png?resize=300%2C93&amp;ssl=1 300w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-104.png?resize=1024%2C317&amp;ssl=1 1024w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-104.png?resize=768%2C238&amp;ssl=1 768w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-104.png?resize=1536%2C476&amp;ssl=1 1536w\" sizes=\"auto, (max-width: 1000px) 100vw, 1000px\" \/><\/figure>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"\">And for the feed forward model it just corresponds to add this (splitting the linear layer that way is also coming from the paper ):<\/li>\n<\/ul>\n\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" width=\"1600\" height=\"691\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-98.png?fit=1024%2C442&amp;ssl=1\" alt=\"\" class=\"wp-image-2227\" srcset=\"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-98.png?w=1600&amp;ssl=1 1600w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-98.png?resize=300%2C130&amp;ssl=1 300w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-98.png?resize=1024%2C442&amp;ssl=1 1024w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-98.png?resize=768%2C332&amp;ssl=1 768w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-98.png?resize=1536%2C663&amp;ssl=1 1536w\" sizes=\"auto, (max-width: 1000px) 100vw, 1000px\" \/><\/figure>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"\">Amazingly, retraining the model with that trick, brings the val loss down up to 2.08 (from 2.28) which is a very serious improvement. And generating text with the model starts looking like proper english words now\u00a0 \ud83e\udd2f<\/li>\n\n\n\n<li class=\"\">However, we start to observe that val loss is a bit better than train loss, and thus we start slightly to overfit.\u00a0<\/li>\n<\/ul>\n\n\n\n<figure class=\"wp-block-image size-full\"><img decoding=\"async\" width=\"474\" height=\"48\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-95.png?fit=474%2C48&amp;ssl=1\" alt=\"\" class=\"wp-image-2224\" srcset=\"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-95.png?w=474&amp;ssl=1 474w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-95.png?resize=300%2C30&amp;ssl=1 300w\" sizes=\"auto, (max-width: 474px) 100vw, 474px\" \/><\/figure>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"\">This is where the next trick comes into play.<\/li>\n<\/ul>\n\n\n\n<h1 class=\"wp-block-heading\">Layer norm<\/h1>\n\n\n\n<figure class=\"wp-block-image size-full is-resized\"><img decoding=\"async\" width=\"456\" height=\"940\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-102.png?fit=456%2C940&amp;ssl=1\" alt=\"\" class=\"wp-image-2231\" style=\"width:150px;height:auto\" srcset=\"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-102.png?w=456&amp;ssl=1 456w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-102.png?resize=146%2C300&amp;ssl=1 146w\" sizes=\"auto, (max-width: 456px) 100vw, 456px\" \/><\/figure>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"\">In <a href=\"https:\/\/www.youtube.com\/watch?v=P6sfmUTpUmc\">another video <\/a>, Karpathy talked about batch normalization, which was critical for allowing better optimization of deep neural nets.\u00a0<\/li>\n\n\n\n<li class=\"\">Here we\u2019ll use an evolution, called layer normalization (introduced in <a href=\"https:\/\/arxiv.org\/abs\/1607.06450\">that paper<\/a>). Also implemented in pytorch <a href=\"https:\/\/pytorch.org\/docs\/stable\/generated\/torch.nn.LayerNorm.html\">here<\/a>.<\/li>\n\n\n\n<li class=\"\">The basic idea is that each embedding vector inside of our BxT batch will be unit Gaussian (0 mean, variance 1).\u00a0<\/li>\n\n\n\n<li class=\"\">The main difference with batch normalization is that batch normalization was normalizing at the batch level, and layer norm is doing it as the individual example level, thus being more robust to varying batch size.<\/li>\n\n\n\n<li class=\"\">Karpathy uses the pytorch implementation and applies it right into the block, before it goes into self attention and feed forward :\u00a0<\/li>\n<\/ul>\n\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" width=\"1600\" height=\"672\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-107.png?fit=1024%2C430&amp;ssl=1\" alt=\"\" class=\"wp-image-2236\" srcset=\"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-107.png?w=1600&amp;ssl=1 1600w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-107.png?resize=300%2C126&amp;ssl=1 300w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-107.png?resize=1024%2C430&amp;ssl=1 1024w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-107.png?resize=768%2C323&amp;ssl=1 768w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-107.png?resize=1536%2C645&amp;ssl=1 1536w\" sizes=\"auto, (max-width: 1000px) 100vw, 1000px\" \/><\/figure>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"\">Important as well to add it after the block initialization in the general model, just before the final linear layer.<\/li>\n<\/ul>\n\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" width=\"1600\" height=\"734\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-105.png?fit=1024%2C470&amp;ssl=1\" alt=\"\" class=\"wp-image-2234\" srcset=\"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-105.png?w=1600&amp;ssl=1 1600w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-105.png?resize=300%2C138&amp;ssl=1 300w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-105.png?resize=1024%2C470&amp;ssl=1 1024w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-105.png?resize=768%2C352&amp;ssl=1 768w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-105.png?resize=1536%2C705&amp;ssl=1 1536w\" sizes=\"auto, (max-width: 1000px) 100vw, 1000px\" \/><\/figure>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"\">Note that the dimension of the layer norm is n_emb. This is because it will be applied directly to x, i.e. to all our vectors of size C in our BxTxC batch .<\/li>\n\n\n\n<li class=\"\">Note that initially LayerNorm will be such that each vector has unit gaussian, but because LayerNorm has weights, during training it will adapt and maybe decide to normalize in a different way.<\/li>\n\n\n\n<li class=\"\">Note as well that in the original \u201cattention is all you need paper\u201d, the layer normalization is happening <strong>after<\/strong> the transformations (i.e after self attention and feed forward) but now in the past 5 years, it is one of the very little changes that happened on the original paper, is that we do it before the transformation.\u00a0\u00a0<\/li>\n\n\n\n<li class=\"\">Retraining the network now goes down up to\u00a0 2.06 (from 2.08), so only a slight improvement but we would expect to help even much more if we get deeper and deeper network (those things are probably useless for small network but becomes critical for much larger ones)<\/li>\n<\/ul>\n\n\n\n<h1 class=\"wp-block-heading\">Scaling the model<\/h1>\n\n\n\n<p class=\"wp-block-paragraph\">Now it is time to run the model training on a GPU.&nbsp;<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The code is now exactly what was shown in the <a href=\"https:\/\/philippeadjiman.com\/blog\/2025\/10\/24\/gpt-from-scratch-1-intro\/\">very first port of that series<\/a> and hopefully you now have a good understanding of each line. A small thing we didn\u2019t mention and that shows up in the code is adding <a href=\"https:\/\/www.cs.toronto.edu\/~rsalakhu\/papers\/srivastava14a.pdf\">dropout<\/a> to prevent overfitting (which is critical when you want to scale a model).&nbsp; A nice interpretation of dropout: by removing random nodes at each backpropagation pass, you kind of train an ensemble of sub-neural nets.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Here are the hyper params config used by Karpathy:<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" width=\"1380\" height=\"434\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-101.png?fit=1024%2C322&amp;ssl=1\" alt=\"\" class=\"wp-image-2230\" srcset=\"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-101.png?w=1380&amp;ssl=1 1380w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-101.png?resize=300%2C94&amp;ssl=1 300w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-101.png?resize=1024%2C322&amp;ssl=1 1024w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-101.png?resize=768%2C242&amp;ssl=1 768w\" sizes=\"auto, (max-width: 1000px) 100vw, 1000px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Some notes:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"\">The batch size was increased from 8 to 64<\/li>\n\n\n\n<li class=\"\">the context from 8 to 256 (size of the prompt)<\/li>\n\n\n\n<li class=\"\">A much smaller learning rate (otherwise you overshoot for such deep neural nets).<\/li>\n\n\n\n<li class=\"\">N_embed of 384 and the number of heads is 6. Remember that at the end, each head is concatenated to get the full 384 embedding. So it means that each head is 384\/6 = 64<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">Now, all what is left is to run the simple training loop:<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" width=\"1600\" height=\"536\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-103.png?fit=1024%2C343&amp;ssl=1\" alt=\"\" class=\"wp-image-2232\" srcset=\"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-103.png?w=1600&amp;ssl=1 1600w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-103.png?resize=300%2C101&amp;ssl=1 300w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-103.png?resize=1024%2C343&amp;ssl=1 1024w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-103.png?resize=768%2C257&amp;ssl=1 768w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-103.png?resize=1536%2C515&amp;ssl=1 1536w\" sizes=\"auto, (max-width: 1000px) 100vw, 1000px\" \/><\/figure>\n\n\n\n<h1 class=\"wp-block-heading\">The final result<\/h1>\n\n\n\n<p class=\"wp-block-paragraph\">After retraining, the loss gets as low as (drumroll): 1.48!! (from 2.06) \ud83e\udd2f\ud83e\udd2fby just scaling it, with the exact same code.&nbsp;<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">On an A100 GPU, it took 15 minutes. On a CPU, it would not even run.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Results are nonsensical english, but it now outputs something that looks like the original format and with \u201cenglish sounding\u201d words:<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" width=\"1152\" height=\"494\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-100.png?fit=1024%2C439&amp;ssl=1\" alt=\"\" class=\"wp-image-2229\" srcset=\"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-100.png?w=1152&amp;ssl=1 1152w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-100.png?resize=300%2C129&amp;ssl=1 300w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-100.png?resize=1024%2C439&amp;ssl=1 1024w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-100.png?resize=768%2C329&amp;ssl=1 768w\" sizes=\"auto, (max-width: 1000px) 100vw, 1000px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">This is pretty amazing knowing that it is a character level trained model, just trained on 1 million characters from Shakespeare and a 15 minutes training on a GPU.&nbsp;<\/p>\n\n\n\n<h1 class=\"wp-block-heading\">What was implemented: Decoder Only Transformer<\/h1>\n\n\n\n<p class=\"wp-block-paragraph\">What was needed for implementing a GPT is only part of the full transformer architecture, typically, this is what was implemented by Karpathy in his video:<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-large is-resized\"><img decoding=\"async\" width=\"1188\" height=\"1600\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-106.png?fit=760%2C1024&amp;ssl=1\" alt=\"\" class=\"wp-image-2235\" style=\"width:451px;height:auto\" srcset=\"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-106.png?w=1188&amp;ssl=1 1188w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-106.png?resize=223%2C300&amp;ssl=1 223w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-106.png?resize=760%2C1024&amp;ssl=1 760w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-106.png?resize=768%2C1034&amp;ssl=1 768w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-106.png?resize=1140%2C1536&amp;ssl=1 1140w\" sizes=\"auto, (max-width: 1000px) 100vw, 1000px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">In <a href=\"https:\/\/philippeadjiman.com\/blog\/2024\/11\/28\/decoding-transformers-the-neural-nets-behind-llms-and-more\/\">another post<\/a> I explain how different flavors of the transformer architecture are used to build different kinds of models (GPT, BERT, BART). The left part of the diagram is the encoder and the right part is a decoder.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Few notes:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"\">The reason why only the decoder part is needed, is because we\u2019re just generating text, and we don\u2019t condition this on anything, like e.g. an input sentence in another language<\/li>\n\n\n\n<li class=\"\">What concretely makes it a decoder, is the fact we used the triangular matrix for \u201chiding the future\u201d. I.e. during training, we hide the characters that are beyond the next character we want to generate.<\/li>\n\n\n\n<li class=\"\">In the original paper, they needed an encoder first because it is a machine translation paper. And thus the decoder needs to be conditioned on the input translation.<\/li>\n<\/ul>\n\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" width=\"1092\" height=\"126\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-99.png?fit=1024%2C118&amp;ssl=1\" alt=\"\" class=\"wp-image-2228\" srcset=\"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-99.png?w=1092&amp;ssl=1 1092w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-99.png?resize=300%2C35&amp;ssl=1 300w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-99.png?resize=1024%2C118&amp;ssl=1 1024w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-99.png?resize=768%2C89&amp;ssl=1 768w\" sizes=\"auto, (max-width: 1000px) 100vw, 1000px\" \/><\/figure>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"\">Note that the encoder part (left part of the graph) is actually exactly identical as the block we implemented, except that it does not do \u201cmasked\u201d multi-head attention, because it is allowed to look at the whole input and let all the tokens communicate together.<\/li>\n\n\n\n<li class=\"\">The middle layer coming from the left is called the cross attention layer. This is because the queries from that component are coming from our batch input (our BxTxC batch), but keys and the values are coming from the external layer, the encoder on the left.\u00a0<\/li>\n\n\n\n<li class=\"\">So what it does concretely is that the generation of the decoder is not only conditioned on the past input to the decoder, but now also conditioned on full the output of the encoder .<\/li>\n<\/ul>\n\n\n\n<h1 class=\"wp-block-heading\">From GPT to Gemini \/ ChatGPT&nbsp;<\/h1>\n\n\n\n<p class=\"wp-block-paragraph\">What we have built is a<strong> P<\/strong>retrained (the <strong>P<\/strong> in GPT) <strong>Base Model<\/strong>. It is a powerful next-token prediction engine, but it is not yet a helpful assistant. If you ask a Base Model &#8220;How do I make an omelet?&#8221;, it might reply with &#8220;&#8230;and why eggs are delicious,&#8221; simply because it is trying to complete the sentence rather than answer a request.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">To turn this Base Model into a ChatBot assistant (like Gemini or ChatGPT), industry leaders apply a process called <strong>Alignment<\/strong>:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li class=\"\"><strong>Supervised Fine-Tuning (SFT):<\/strong> We retrain the model on high-quality &#8220;Question -> Answer&#8221; pairs written by humans. This teaches the model the <em>format<\/em> of a helpful assistant.<\/li>\n\n\n\n<li class=\"\"><strong>Reinforcement Learning from Human Feedback(RLHF):<\/strong> We generate multiple answers, ask humans to rank them (Best to Worst), and use those rankings to tune the model. This teaches the model <em>preferences<\/em>\u2014prioritizing safety, helpfulness, and honesty.<\/li>\n<\/ol>\n\n\n\n<p class=\"wp-block-paragraph\">Some more details can be found in a <a href=\"https:\/\/storage.googleapis.com\/deepmind-media\/gemini\/gemini_v1_5_report.pdf\">technical report<\/a> on Gemini 1.5 from Google DeepMind and <a href=\"https:\/\/arxiv.org\/pdf\/2203.02155\">this paper<\/a> from OpenAI.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">It is quite magical and unexpected how GPT models can become useful assistants using those methods applied to a rather limited amount of human feedback data.&nbsp;<\/p>\n\n\n\n<h1 class=\"wp-block-heading\">Where to go from there<\/h1>\n\n\n\n<p class=\"wp-block-paragraph\">I hope that by now you have a deep understanding of how GPT is built.But true mastery comes from doing. So what you could do is to try reproducing the code alone. You can look at Karpathy\u2019s code from time to time but without doing copy paste, otherwise, it loses the point.&nbsp;<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">And if you\u2019re brave enough, continue with yet another masterpiece from Andrej: the 4 hours <a href=\"https:\/\/www.youtube.com\/watch?v=l8pRSuU81PU\">Let&#8217;s reproduce GPT-2<\/a> video.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">That\u2019s it, I hope you enjoyed that series and found it useful.\u00a0<br>And keep learning and enjoying\/mastering whatever you do.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Self Attention is the heart of Transformers, the T of GPT. But there are few additional critical parts to the transformer architecture that actually made it shine.<\/p>\n","protected":false},"author":1,"featured_media":2237,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"nf_dc_page":"","_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[18,46,14],"tags":[48,47,49],"class_list":["post-2212","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-deep-learning","category-gpt","category-pytorch","tag-deep-learning","tag-gpt","tag-pytorch"],"jetpack_featured_media_url":"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/gpt-7.png?fit=510%2C284&ssl=1","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/posts\/2212","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/comments?post=2212"}],"version-history":[{"count":2,"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/posts\/2212\/revisions"}],"predecessor-version":[{"id":2239,"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/posts\/2212\/revisions\/2239"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/media\/2237"}],"wp:attachment":[{"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/media?parent=2212"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/categories?post=2212"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/tags?post=2212"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}