{"id":2114,"date":"2025-11-14T13:17:25","date_gmt":"2025-11-14T13:17:25","guid":{"rendered":"https:\/\/philippeadjiman.com\/blog\/?p=2114"},"modified":"2025-11-22T18:35:32","modified_gmt":"2025-11-22T18:35:32","slug":"gpt-from-scratch-4-the-mathematical-trick-behind-self-attention","status":"publish","type":"post","link":"https:\/\/philippeadjiman.com\/blog\/2025\/11\/14\/gpt-from-scratch-4-the-mathematical-trick-behind-self-attention\/","title":{"rendered":"GPT From Scratch #4: The Mathematical Trick Behind Self Attention"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\">Welcome to Part 4 of our GPT From Scratch series, inspired by Karpathy\u2019s&nbsp; <a href=\"https:\/\/www.youtube.com\/watch?v=kCc8FmEb1nY\">Let&#8217;s build GPT: from scratch, in code, spelled out.<\/a><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Links to previous and upcoming posts of the series:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"\"><a href=\"https:\/\/philippeadjiman.com\/blog\/2025\/10\/24\/gpt-from-scratch-1-intro\/\">Part 1: Intro<\/a><\/li>\n\n\n\n<li class=\"\"><a href=\"https:\/\/philippeadjiman.com\/blog\/2025\/11\/06\/gpt-from-scratch-2-the-training-set\/\">Part 2: The Training set<\/a>&nbsp;<\/li>\n\n\n\n<li class=\"\"><a href=\"https:\/\/philippeadjiman.com\/blog\/2025\/11\/10\/gpt-from-scratch-3-the-bigram-model\/\">Part 3: The Bigram model<\/a>&nbsp;<\/li>\n\n\n\n<li class=\"\">Part 4: The Mathematical Trick behind Self Attention (this post)<\/li>\n\n\n\n<li class=\"\"><a href=\"https:\/\/philippeadjiman.com\/blog\/2025\/11\/15\/gpt-from-scratch-5-positional-encodings\/\">Part 5: Positional Encodings<\/a><\/li>\n\n\n\n<li class=\"\"><a href=\"https:\/\/philippeadjiman.com\/blog\/2025\/11\/19\/gpt-from-scratch-6-coding-self-attention\/\">Part 6: Coding Self-Attention<\/a><\/li>\n\n\n\n<li class=\"\"><a href=\"https:\/\/philippeadjiman.com\/blog\/2025\/11\/22\/gpt-from-scratch-7-building-a-gpt\/\">Part 7: Building a GPT<\/a><\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">In <a href=\"https:\/\/philippeadjiman.com\/blog\/2025\/11\/06\/gpt-from-scratch-2-the-training-set\/\">Part 2<\/a> we explained how to create a training set from Shakespeare&#8217;s works. <a href=\"https:\/\/philippeadjiman.com\/blog\/2025\/11\/10\/gpt-from-scratch-3-the-bigram-model\/\">Part 3<\/a> then introduced a basic bigram model, predicting the next character based solely on its predecessor.&nbsp;<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">However, this approach is fundamentally limited. To achieve the capabilities of models like GPT, we need to go beyond just one character (or word, or token) back, and consider the broader context of the preceding sequence. This vital interaction is enabled by self-attention, a mechanism underpinned by a very elegant mathematical trick for efficient context awareness.&nbsp;<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Let\u2019s dive in.<\/p>\n\n\n\n<h1 class=\"wp-block-heading\">The simplest kind of communication in our batch<\/h1>\n\n\n\n<p class=\"wp-block-paragraph\">Back to our good old batch of size BxTxC (as explained in <a href=\"https:\/\/philippeadjiman.com\/blog\/2025\/11\/10\/gpt-from-scratch-3-the-bigram-model\/\">Part 3<\/a>, section \u201cthe logits\u201d).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">As a reminder, each line there are <strong>T<\/strong> consecutive characters from the training set (a.k.a an example), and each such character is associated with its embeddings (an array of numbers of fixed size <strong>C<\/strong>), and you have <strong>B<\/strong> such examples in the batch.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Our goal in order to illustrate communication between characters is the following<\/strong>:&nbsp; for each character at index i (in an example of the batch) do an average on the embeddings of all the previous characters.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">One would wonder why doing just an average is interesting, but we\u2019ll see later that it will be the basis for building the powerful self attention mechanism.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">So, let\u2019s illustrate with an example what we mean by doing an average of the embeddings.&nbsp;<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">We first generate a random batch of size 4,8,2 (i.e. 4 examples, each with 8 characters, and each with an embedding of size 2).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Let\u2019s look at the first line example:<\/p>\n\n\n\n<figure class=\"wp-block-image size-full is-resized\"><img data-recalc-dims=\"1\" decoding=\"async\" width=\"488\" height=\"368\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-36.png?resize=488%2C368&#038;ssl=1\" alt=\"\" class=\"wp-image-2117\" style=\"width:357px;height:auto\" srcset=\"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-36.png?w=488&amp;ssl=1 488w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-36.png?resize=300%2C226&amp;ssl=1 300w\" sizes=\"auto, (max-width: 488px) 100vw, 488px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Those numbers represent the 8 characters (one character per line) of the first example, and for each, their embeddings (2 numbers).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">So our goal is to produce a tensor such that for each line we get the average of all the previous numbers in the respective column.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In our example, we\u2019re looking to get the tensor below. Look for example at the second line, the first number there is 0.3507, which is the average of the first number of the two first lines in the original tensor above (0.0783 and 0.6231). Same for all the other numbers in the resulting tensor, they are the average of all the previous ones.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full is-resized\"><img data-recalc-dims=\"1\" decoding=\"async\" width=\"464\" height=\"300\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-37.png?resize=464%2C300&#038;ssl=1\" alt=\"\" class=\"wp-image-2118\" style=\"width:370px;height:auto\" srcset=\"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-37.png?w=464&amp;ssl=1 464w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-37.png?resize=300%2C194&amp;ssl=1 300w\" sizes=\"auto, (max-width: 464px) 100vw, 464px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Now the challenge is: how to produce that in a very efficient way so it can scale.<\/p>\n\n\n\n<h1 class=\"wp-block-heading\">The brute force way<\/h1>\n\n\n\n<p class=\"wp-block-paragraph\">It is always useful in every problem to start with the brute force solution as a baseline.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Here is Karpathy\u2019s code for the brute force way of solving this:<\/p>\n\n\n\n<figure class=\"wp-block-image size-full is-resized\"><img data-recalc-dims=\"1\" decoding=\"async\" width=\"690\" height=\"232\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-39.png?resize=690%2C232&#038;ssl=1\" alt=\"\" class=\"wp-image-2120\" style=\"width:482px;height:auto\" srcset=\"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-39.png?w=690&amp;ssl=1 690w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-39.png?resize=300%2C101&amp;ssl=1 300w\" sizes=\"auto, (max-width: 690px) 100vw, 690px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">A few notes on that code:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"\">xprev is referring to the b<sup>th <\/sup>example of the batch, and to all the characters from 0 to t . And for each of those, you have the embedding of size C, and thus xprev is of dimension (t,C)<\/li>\n\n\n\n<li class=\"\">Then, when you do the torch.mean(xprev,0), it is actually doing the average on each channel of the embedding (in our case, there are 2).<\/li>\n\n\n\n<li class=\"\">Bow means bag of words (a common term when just averaging stuff out)<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">And sure enough, it works and produces the right result. The problem is that it is highly inefficient and won\u2019t scale both at training and at inference when talking about huge models like GPTs.<\/p>\n\n\n\n<h1 class=\"wp-block-heading\">The trick: a (very) cleaver matrix multiplication<\/h1>\n\n\n\n<p class=\"wp-block-paragraph\">Now let\u2019s describe the trick that enabled scaling self attention, and that arguably is at the core of the generative AI revolution.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">First, a small reminder on how matrix multiplication works.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" decoding=\"async\" width=\"1024\" height=\"440\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-55.png?resize=1024%2C440&#038;ssl=1\" alt=\"\" class=\"wp-image-2138\" srcset=\"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-55.png?resize=1024%2C440&amp;ssl=1 1024w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-55.png?resize=300%2C129&amp;ssl=1 300w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-55.png?resize=768%2C330&amp;ssl=1 768w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-55.png?w=1118&amp;ssl=1 1118w\" sizes=\"auto, (max-width: 1000px) 100vw, 1000px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Each element in <strong>c<\/strong>, is obtained by summing the dot product of the corresponding row and column in <strong>a<\/strong> and <strong>b<\/strong>.&nbsp;<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">E.g., to obtain in <strong>c<\/strong> the result of the 2nd row,&nbsp; and 1st column, you just do the dot product of the 2nd row in <strong>a<\/strong> (which is [4,6,5]) and the 1st column in <strong>b<\/strong> (which is [3,4,4]) . And thus [4,6,5] . [3,4,4] = 3*4+4*6+4*5 = 56, which is indeed what we see in <strong>c<\/strong> at 2nd row and 1st column.&nbsp;<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Now, in the example above, if instead of multiplying <strong>b<\/strong> by a random matrix, we multiply it by a triangular matrix, something magic happens:<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full is-resized\"><img data-recalc-dims=\"1\" decoding=\"async\" width=\"414\" height=\"496\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-56.png?resize=414%2C496&#038;ssl=1\" alt=\"\" class=\"wp-image-2139\" style=\"width:202px;height:auto\" srcset=\"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-56.png?w=414&amp;ssl=1 414w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-56.png?resize=250%2C300&amp;ssl=1 250w\" sizes=\"auto, (max-width: 414px) 100vw, 414px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Can you see what happened? It turns out that now each element in c, is the sum of all the previous elements from b!&nbsp;<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">For instance, the 7 in c , which is 1st column, 2nd row, corresponds to the sum of all the elements of the 1st column in b, and up to the second row: 3+4 = 7. This works for every element in c.&nbsp;<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Why is it magical and why is it helping in what we try to achieve?<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Because if you add a last ingredient to the magic, and instead of multiplying by the triangular matrix, you multiply by a matrix that is the triangular matrix, but with the number divided by the sum of the row, you get the exact average result we wanted! See it below with the associated code<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-large\"><img data-recalc-dims=\"1\" decoding=\"async\" width=\"1024\" height=\"424\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-57.png?resize=1024%2C424&#038;ssl=1\" alt=\"\" class=\"wp-image-2140\" srcset=\"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-57.png?resize=1024%2C424&amp;ssl=1 1024w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-57.png?resize=300%2C124&amp;ssl=1 300w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-57.png?resize=768%2C318&amp;ssl=1 768w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-57.png?resize=1536%2C635&amp;ssl=1 1536w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-57.png?w=2048&amp;ssl=1 2048w\" sizes=\"auto, (max-width: 1000px) 100vw, 1000px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">If you want to understand what the line<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-large is-resized\"><img data-recalc-dims=\"1\" decoding=\"async\" width=\"1024\" height=\"56\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-58.png?resize=1024%2C56&#038;ssl=1\" alt=\"\" class=\"wp-image-2141\" style=\"width:491px;height:auto\" srcset=\"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-58.png?resize=1024%2C56&amp;ssl=1 1024w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-58.png?resize=300%2C16&amp;ssl=1 300w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-58.png?resize=768%2C42&amp;ssl=1 768w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-58.png?w=1276&amp;ssl=1 1276w\" sizes=\"auto, (max-width: 1000px) 100vw, 1000px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">is doing, you can read my blog post on <a href=\"https:\/\/www.philippeadjiman.com\/blog\/2023\/07\/16\/deep-learning-gymnastics-tensor-broadcasting\/\">tensor broadcasting<\/a>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Those are two dimensional matrices, let\u2019s get back to our initial problem which is a tensor of dimension BxTxC. Let\u2019s see how it works well for those dimension:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"\">We\u2019re multiplying a TxT matrix with a tensor BxTxC .<\/li>\n\n\n\n<li class=\"\">Pytorch will create a B dimension to the TxT matrix, which will yield a multiplication between BxTxT and BxTxC, which for each batch element , will do a TxT times TxC multiplication (in parallel) , which will yield a BxTxC result. In code, it gives:<\/li>\n<\/ul>\n\n\n\n<figure class=\"wp-block-image size-full is-resized\"><img data-recalc-dims=\"1\" decoding=\"async\" width=\"948\" height=\"118\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-59.png?resize=948%2C118&#038;ssl=1\" alt=\"\" class=\"wp-image-2142\" style=\"width:545px;height:auto\" srcset=\"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-59.png?w=948&amp;ssl=1 948w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-59.png?resize=300%2C37&amp;ssl=1 300w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-59.png?resize=768%2C96&amp;ssl=1 768w\" sizes=\"auto, (max-width: 948px) 100vw, 948px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Does it really work?&nbsp;<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">To check it, we can compare the result of xbow2 with xbow that we obtained in the brute force way section:<\/p>\n\n\n\n<figure class=\"wp-block-image size-full is-resized\"><img data-recalc-dims=\"1\" decoding=\"async\" width=\"852\" height=\"82\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-60.png?resize=852%2C82&#038;ssl=1\" alt=\"\" class=\"wp-image-2143\" style=\"width:253px;height:auto\" srcset=\"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-60.png?w=852&amp;ssl=1 852w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-60.png?resize=300%2C29&amp;ssl=1 300w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-60.png?resize=768%2C74&amp;ssl=1 768w\" sizes=\"auto, (max-width: 852px) 100vw, 852px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Sure enough, it magically works and we indeed obtain the exact same output in both methods \ud83e\udd2f .<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The difference? With the matrix multiplication, it is incomparably more efficient and is thus a game changer given the scale of what it takes to build GPT.<\/p>\n\n\n\n<h1 class=\"wp-block-heading\">A softmax version<\/h1>\n\n\n\n<p class=\"wp-block-paragraph\">We saw that the key part of the trick is to produce this normalized triangular tensor (the wei in the code above ).<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" decoding=\"async\" width=\"1024\" height=\"264\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-61.png?resize=1024%2C264&#038;ssl=1\" alt=\"\" class=\"wp-image-2144\" srcset=\"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-61.png?resize=1024%2C264&amp;ssl=1 1024w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-61.png?resize=300%2C77&amp;ssl=1 300w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-61.png?resize=768%2C198&amp;ssl=1 768w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-61.png?w=1032&amp;ssl=1 1032w\" sizes=\"auto, (max-width: 1000px) 100vw, 1000px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Turns out there is an equivalent way to produce it using the softmax function!<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">This works because softmax is actually a normalization layer where you exponent all elements and then divide them by the sum.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Why use softwax instead of what we did in the previous example? To be honest, i\u2019m not sure, but i assume it is a matter of elegance and interpretation, because, as Karpathy explains, the triangular matrix before applying softmax looks like this:<\/p>\n\n\n\n<figure class=\"wp-block-image size-large is-resized\"><img data-recalc-dims=\"1\" decoding=\"async\" width=\"1024\" height=\"311\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-62.png?resize=1024%2C311&#038;ssl=1\" alt=\"\" class=\"wp-image-2145\" style=\"width:509px;height:auto\" srcset=\"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-62.png?resize=1024%2C311&amp;ssl=1 1024w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-62.png?resize=300%2C91&amp;ssl=1 300w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-62.png?resize=768%2C233&amp;ssl=1 768w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-62.png?w=1172&amp;ssl=1 1172w\" sizes=\"auto, (max-width: 1000px) 100vw, 1000px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">And if you interpret this matrix as the \u201ccommunication allowance\u201d for each element in the batch (because this is what it will end up being through the matrix multiplication =&gt; each line of the batch contains 8 training examples ) then it says that it is only allowed to communicate with past elements, and for future element, the communication is forbidden (because when we\u2019ll generate characters, we\u2019ll have access only to the previous ones, and not the upcoming ones).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">And, magically enough, it just works, and produces the same result as in the brute force way.<\/p>\n\n\n\n<h1 class=\"wp-block-heading\">What was achieved and what\u2019s next?<\/h1>\n\n\n\n<p class=\"wp-block-paragraph\">We started with our standard batch of examples of dimension (B,T,C), and our goal was to produce another batch (same dimension), but such that for each character at index i (in an example of the batch) do an average on the embeddings of all the previous characters.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">We discovered it could be done in a crazy effective way by simply multiplying the tensor of the batch by a triangular normalized matrix.&nbsp;<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">This is how it looks like in one example of our batch:<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-large\"><img data-recalc-dims=\"1\" decoding=\"async\" width=\"1024\" height=\"551\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-63.png?resize=1024%2C551&#038;ssl=1\" alt=\"\" class=\"wp-image-2146\" srcset=\"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-63.png?resize=1024%2C551&amp;ssl=1 1024w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-63.png?resize=300%2C161&amp;ssl=1 300w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-63.png?resize=768%2C413&amp;ssl=1 768w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/image-63.png?w=1524&amp;ssl=1 1524w\" sizes=\"auto, (max-width: 1000px) 100vw, 1000px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">And this works exactly the same if we apply it to the whole batch of examples: you start with the tensor (B,T,C) and you end up with a tensor of the same dimension, but this time with all the averaged examples.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Why does it matter?&nbsp;<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Because:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li class=\"\">It is an extremely efficient operation and that\u2019s what will allow scaling GPT to huge amount of data<\/li>\n\n\n\n<li class=\"\">We\u2019ll replace the simple averaging by a very smart aggregation of all the previous characters (the context)<\/li>\n\n\n\n<li class=\"\">The operation will be exactly the same: a simple matrix multiplication, we\u2019ll just replace the triangular matrix by the smart aggregation.<\/li>\n<\/ol>\n\n\n\n<p class=\"wp-block-paragraph\">We thus now illustrated the foundation of what lies behind GPT.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">From there, we\u2019ll just introduce an additional fundamental concept, called \u201cpositional encoding\u201d and then we\u2019ll implement the famous self-attention mechanism which is the backbone of GPT.&nbsp;<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">So let&#8217;s do it and dive in <a href=\"https:\/\/philippeadjiman.com\/blog\/2025\/11\/15\/gpt-from-scratch-5-positional-encodings\/\">Part 5: Positional Encodings<\/a>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><\/p>\n","protected":false},"excerpt":{"rendered":"<p>One simple mathematical trick. The most cleaver matrix multiplication of the gen AI revolution. What enabled ultra fast self attention. <\/p>\n","protected":false},"author":1,"featured_media":2148,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"nf_dc_page":"","_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[18,46,13],"tags":[48,47,49],"class_list":["post-2114","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-deep-learning","category-gpt","category-python","tag-deep-learning","tag-gpt","tag-pytorch"],"jetpack_featured_media_url":"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2025\/11\/gpt-4.png?fit=510%2C287&ssl=1","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/posts\/2114","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/comments?post=2114"}],"version-history":[{"count":6,"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/posts\/2114\/revisions"}],"predecessor-version":[{"id":2253,"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/posts\/2114\/revisions\/2253"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/media\/2148"}],"wp:attachment":[{"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/media?parent=2114"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/categories?post=2114"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/tags?post=2114"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}