Category Archives: deep learning

Decoding Transformers: The Neural Nets Behind LLMs and More

When Karpathy was asked by Lex Fridman “What is the most beautiful or surprising idea in deep learning or AI”, the answer was quite obvious: The Transformer Architecture.

The T in GPT.

When Google researchers introduced the idea back in 2017 with the famous “Attention is all you need” paper, it was looking like just “yet another cool idea for machine translation tasks”. In hindsight, it turned out to be the backbone of the current LLMs/AI revolution.   

What you’ll find in this post:

  • A bit of history of transformers and the larger neural net archiceture they belong to: Encoder-Decoder
  • An explanation of why transformers were a game changer
  • A deep (yet intuitive) dive into the core component on which transformers are based on: self-attention
  • The important neural nets who emerged from Transformers
  • What are arguably the 10 most important lines of code behind the LLM revolution

Let’s get started.

Encoder-Decoder

A challenge of Deep Neural Networks back in 2010 was to handle “sequence to sequence” problems, where both the input and output are of unknown length. Best known example is machine translation: if you need to translate from e.g. French to English, both the input sentence and its (output) translation are of unknown length. 

This is in that context that the Encoder-Decoder neural network architecture emerged, pioneered by the paper Sequence to Sequence Learning with Neural Networks (the first author is Ilya Sutskever, ex Open AI chief scientist) which proposed a general (domain-independent) method to tackle sequence to sequence problems. 

The core idea was to take a sequence as input, encode it (via an “encoder”) into a fixed size vector representation and then decode it (via a “decoder”) into another sequence (of possibly a different size). 

The main evolution of the encoder-decoder neural network lies in what was used inside the encoders and decoders. It started with RNN, and then was revolutionized by Transformers.

RNN based vs. Transformer based Encoder-Decoder

RNN based encoders/decoders: In the initial paper mentioned above, the encoder and decoder components were handled by Recurrent Neural Networks (RNNs, more particularly LSTMs). The RNN encoder processes each word (or token) at a time (sequentially), encodes its state in a vector, and passes it on, until the whole sentence is encoded. The RNN decoder does the opposite: takes the encoded vector representation of the sentence, and decodes it one token at a time, using the current state and what has been decoded so far.

Transformer based encoders/decoders: Then came up the famous Attention is all you need paper from Google, which suggested replacing the RNNs in the encoder-decoder by a new architecture called Transformer, which relies on what is called “attention mechanism” (see section below). 

Why were Transformers a game changer?

One could wonder what was so game changer about the transformer architecture (compared to RNNs) to bring encoder-decoder from a very nice state-of-the-art method in NLP to what enabled the LLMs revolution. Here are a few of the central reasons:

  • Parallelization: First, the self attention mechanism allows to process all tokens in parallel (unlike RNNs which are sequential and recurring at their core), which significantly speeded up training and inference. 
  • Self-attention: The longer the input sequence (or prompt) is, the more RNNs struggle to capture what is essential in it due to issues like vanishing/exploding gradients. The self attention mechanism overcomes this issue by being able to focus on the important part of the sequence given the context, regardless of its length (see next section for a deep dive into self attention).
  • A very effective general purpose computer: As Karpathy explains here (highly recommended short watch), Transformers are remarkably general purpose compared to all previous neural net architectures.You can feed it videos, images or speech or text and it just gobbles it up. Also, it is not only about Self Attention, as every piece and detail of the architecture (the residual connection, the layer normalization and more) creates not only a very powerful and expressive machine, but most importantly an  optimizable one with our very basic but scalable methods like back-propagation/gradient descent. In Karpathy’s own words:

Deep Diving into the intuition behind self-attention

As we just said, transformers are not only about self attention. Yet, this new paradigm played a big part in Transformers’ success, and marked the beginning of the revolution in NLP tasks first, and in LLMs next. 

We’ll use this great video to understand the core intuitions and components behind self attention.

First, let’s remind that in neural networks, words are represented by vectors of numbers, usually called embeddings, see the relevant section in my post here. Two words are similar if their embeddings point into the same direction (which corresponds to having the dot product of their vectors being high). But the same word can have very different meanings depending on the context of a sentence.

The main purpose of the self attention mechanism is to adapt the vectors/embeddings of the words based on the context of the sentence/prompt. 

If we take an example (from the video above), of a sentence “I swam across the river to get to the other bank” and draw the matrix of the dot product of their embedding (before applying self attention), we would get e.g. something like that:  

All words in the diagonale have the highest score obviously since they represent the similarity of a word with itself. But the word “bank” (traditionally related to the institution) might have nothing to do with the word “river” and get a low score. But after applying the self attention mechanism, one would expect that “bank” and “river” have a strong correlation and thus a high dot product.

So how to weigh word vectors based on their context? The way the self attention mechanism proposed to do that is rather simple and can be summarized in the diagram below (also from the video) that we’ll explain step by step. 

As mentioned above, the purpose of self attention is to transform each word vector of the sentence/prompt into a version that is weighted by the other word vectors in the sentence/prompt (a.k.a the context).  The diagram bellow (also from the video) illustrates exactly that for the word vector v2 , and shows how to transform it into a vector y2 that is weighted based on the whole sentence/prompt (which contains only 3 words in that example: v1, v2 and v3).

Check the diagram and the step by step explanations below.

From the top left of the diagram, this is what happens:

  • First we take the vector v2 (which is a word embedding of dimension (1×50) in that example ) and multiply it respectively with every other vector of the prompt. It gives 3 numbers (scalars): s21 (which is v2 . v1) , s22 (which is v2 . v2) and s23 (which is v2 . v3)
  • Those numbers (or scores) represent the respective affinity of v2 with each of the other words of the context. But since those numbers are not scaled, we just normalize them using softmax , thus giving 3 new numbers: the weights w21, w22, w23 .
  • And now, to get our y2 which is the “weighted version of v2 based on the context”, we just do . Et voilà. You get y2 , the weighted version of the initial word vector v2
  • Pay attention to the dimensions: we started with a word embedding of dimension (1,50), and we properly end up with a contextualized version of it (y2) with the same dimension (1,50).

Doing this for each word vector of the sentence is essentially what the self-attention mechanism is all about. Note that those operation can be made massively parallel using matrix multiplication.

The result is that now each embedding captures the relation with any other word in the sentence/prompt, regardless of the length/distance between two words, and it does it in a massively parallel and effective way.

Now, if you’re into ML, you probably wonder: where are the learnable weights?? Indeed, you don’t need to train any model to apply the above mechanism, so how do we learn an optimal way to contextualize each word vector?

This is where the magic happens. In the steps we described above, you can think of v2 as a query looking for similar words that could be matched as keys . So by simply introducing matrices of the right dimensions (here, (50×50)), you’re essentially creating learnable weights (optimizable through backpropagation at training time), that have the following meaning:

  • MQ  represents what v2 is “looking for”
  • MK represents for each word vector (v1, v2 or v3) what does it “contain”, or represents, or has to offer
  • MV , or the values, are used essentially as a way to communicate the result of the matching between queries and keys .

Notice the dimensions, the MQ, MK and MV are matrices that are just injected in between the simple weighting scheme that we described in the first diagram, and do not affect the input and output dimensions (1×50 vector dot product by 50×50 matrix still gives a 1×50 vector).

The main difference is that the MQ, MK and MV matrices are now powerful learnable, optimizable weights of the model.

Scaling embeddings

The original paper described something called scaled dot-product attention. We’ll just give an intuition (again from the great video) of what that scaling term is.

Suppose you have an embedding vector being just (2,2,2) . The magnitude of the vector is

If you divide this by the square root of the dimension of the vector (which is 3), i.e. you multiply by 1/√3, then you just get 2, which is the average.

Why is this important? Because the embeddings will usually be of high dimension  (e.g. 300) and thus the dot products (going out of the GetScores component in the diagram above) can end up huge, which would pretty much annihilate the gradient when going through the softmax function.

That’s pretty much it, the scaling term is just a cleaver trick to keep the softmax weights in a reasonable range and not create issues at training time.

The formula that captures it all

The whole (scaled dot) attention mechanism can be summarized by a simple formula:

It simply represents the matrix multiplications we described above, between queries (Q), keys (K) and values (V), going through the softmax function, after being scaled with the scaling term explained above.

This formula is capturing the essence of self attention, which in itself is at the heart of transformers who sparked the LLMs revolution.

So yes, this formula is really at the heart of the LLM revolution and beyond.

BERT vs. BART vs. GPT: All flavors of Transformers

While GPT is the most famous usage of transformers which powered the revolution around LLMs and chat bots, some other famous models were also groundbreaking additions to the NLP world: BART and BERT. 

Can you find what is common between the three? Yes, it is the T , which stands for Transformer. 

Below is a comparative table between the three models.

Now a question to you: can you guess which of the three model generated that table?

You probably guessed it: it is GPT. And i promess: it was the only generated part of that blog post 😀

The most important 10 lines of code of the LLMs revolution?

My favorite series of learning videos in the past few years is by far Karpathy’s series on Neural Networks (my blog post series Deep Learning Gymnastics is directly inspired from it). In one of the videos, Andrej is building GPT from scratch

In a future series of post, i’ll deep dive into the core components of it, but just as a teaser, look at Karpathy’s concise and beautiful implementation of the self attention mechanism we described above. 

The forward layer is just 10 lines of code, and implements exactly the attention formula that we described above:

Of course, those 10 lines of code in a silo, without transformers, back-propagation, gradient descent, and tons of GPUs cannot do much. 

But since self-attention can be considered as one of the most important core element of the transformers breaktrhough, if we were to decide which are the most important (or influential) 10 lines of code behind the LLMs revolution, those would probably be among the best candidates.

Hope you enjoyed this post and see you soon for more.

Deep Learning Gymnastics #4: Master Your (LLM) Cross Entropy

Welcome to the 4th episode of our Deep Learning Gymnastics series.

Today, we’ll use all the skills learned in our previous lessons: tensor broadcasting, indexing and reshaping, to revisit one of the most famous and important loss functions of supervised machine learning (and deep learning): cross entropy. 

LLMs? Yes, they are also based on it. We’ll actually get inspired (again) by Andrej Karpathy’s videos around building an LLM from scratch to illustrate how to manipulate the cross entropy function.

A short refresher on Cross Entropy

Entropy in general and Cross-entropy in particular are fascinating concepts that lie at the foundation of information theory. If you want to dive a bit into it and understand the links between the logistic regression cost function, Log Loss, Cross Entropy and Negative Log Likelihood and are not afraid of some maths formulas, you can read one of my old posts here.

But for today we’ll focus on the essence. Cross-entropy in ML is most often used as a cost function that measures the difference between a probability vector (one probability per predicted class) and a one-hot encoded label. Typically:

Here, O is the raw output of the neural network, often called logits. Then, before we apply the cross entropy formula, we typically pass those logits through the softmax function so it becomes a probability vector P, where each probability is the prediction of each of your multiple classes. And L is the one hot encoded vector representing the label. 

So in our example, we can see that the cross-entropy is simply – log(0.6) i.e ~0.22 . As you note, the higher the probability for the correct class, the closer to 0 it will be (when probability is 1 for the correct class, then the cost will be -log(1) , which is 0). The lower the probability for the correct class, the bigger the cost (tending to infinity when the probability is 0). Note the figure above is inspired from this short great video.

Cross Entropy in LLMs

Large Langage Models (LLMs) core capability is to try predicting the next word (or more generally token) given a list of previous words/tokens. In a future blog post, we’ll describe precisely how the training set is built, but for the sake of this post, let’s illustrate a batch of the training set of an LLM on a picture and explain it:

In the episode #2 of our series, we explained what a batch is, and that those numbers represents the index of a token in the vocabulary. Assume our LLM is predicting the next token (out of 27 possible) given a context of max 3 tokens, this is how to read the figure above:

  • The batch on the left represents 8 lines of three tokens each.
  • Each token of the batch points to a tensor of size (27,1) representing the prediction of what the next token should be (one logit for each of the 27 possible tokens). So the batch tensor shape is (8,3,27).
  • For instance, the (27,1) tensor in the figure represents the prediction for each of the 27 tokens, given the sequence of the three tokens 7,16,18.
  • In that example, what is e.g. the logit prediction for the next token to be token 1? just look at index 1 of that vector. Here you go: ~0.55 (which seems rather high compared to others)
  • The tensor on the right are the labels (the actual next token from the training set). It thus has the same shape as the batch, except that it does not contains prediction logits tensors, so just (8,3)

How to calculate the Cross Entropy on that single prediction logits (in the figure) against the actual label?

Simple, we just follow the diagram we gave above: we pass that vector through the softmax function, which will give us the (27,1) tensor P representing probabilities. Then we have L = (0,1,0,0,0,0,…,0) , and we just apply the cross entropy formula.

The Gymnastic Exercise

In the previous section, we explained how to compute the Cross Entropy for one single entry of the (8,3) batch of our example. But how to compute it for the whole batch? To do so, we need to calculate the exact same thing, but for the 8*3 = 24 possible cases.

Did you recognize the vector we had in the previous section’s figure? Yes, that’s the 7th one from the bottom.

So the gymnastic exercise is to take the initial batch with prediction tensor of shape (8,3,27) , stretch it out to the 8*3 = 24 prediction logits (which is a (24,27) tensor as in pic above), do the same for the label tensor, and from there, compute in parallel the cross entropy of the 24 couples of logits/label, and returns the mean of them as the result.

Solving it in PyTorch

First we need to generate all the input tensors:

  • X, the batch with prediction, which is a (8,3,27) tensor
  • Y, the labels, which is a (8,3) tensor.

The code below will produce the same numbers as the one exposed in the second figure of this post.

import torch
torch.manual_seed(18)

# creates the batch
random_tensor = torch.randint(low=0, high=26, size=(8,3))

# create random logits for each index in the vocabulary
L = torch.randn((27, 27))
#creating the labels
Y = torch.randint(low=0, high=26, size=(8,3))
# creating our batch (8,3,27). C.f https://www.philippeadjiman.com/blog/2023/12/23/deep-learning-gymnastics-tensor-indexing/ 
X = L[random_tensor] 

To fully understand this code, please refer to the post #2 of this series about tensor indexing.

Note that in that other post, we created embeddings of size 4 as an illustration, while here, we’re having already the final logits (of size 27, which is the vocabulary size). In a fully implemented LLM, those logits will only come up after many steps (stay tuned for a future blog post about it).

Now, we’d like to use the PyTorch’s cross_entropy function. Reading the doc, we see it expects as input the actual logits to be in the second dimension, which corresponds exactly to what we described in the figure above: stretching out the input batch. And same for the labels. We actually learned how to do that with views in the post #3 of this series around tensor reshaping. So here you go:

#Reshaping before using cross_entropy. C.f https://www.philippeadjiman.com/blog/2024/02/03/deep-learning-gymnastics-tensor-reshaping/
B,T,C = X.shape 
logits = X.view(B*T,C)
labels = Y.view(B*T)

With that, we’ll exactly obtain what we illustrated in our previous figure.

Now that we got our inputs in the proper shape, we can compute our cross entropy with the function:

import torch.nn.functional as F
F.cross_entropy(logits , labels)

Which gives 3.7759 . Yay! we computed the cross entropy of our LLM batch 💪

Calculating Cross Entropy “manually”

Turns out that once we have the logits and labels in the proper shape like we just did with views, then calculating cross entropy without using the PyTorch’s function is actually quiet simple, and is useful to understand what happens behind the scenes.

Here is an compact and elegant way to do it (credit again to the code from Karpathy’s videos ):

counts = logits.exp()
prob = counts / counts.sum(1,keepdims=True)
- prob[torch.arange(24),target].log().mean() 

Surely enough, it returns the exact same result (3.7759) as when using the PyTorch function 🤩 .

So what’s going on in that code?

The first two lines are to transform the logits into probabilities using the softmax function, by simply first applying the exponential function and then dividing all logits by the sum of exponentials. Wonder what that keepdims=True means? Please read the post #1 of this series around tensor broadcasting

Now the last line is interesting.

Remember our initial figure. Let’s look again how cross entropy is calculated:

Given L is a one hot encoded vector, there will be only one 1, and thus the cross entropy is just about plucking out the right index in P and -log it. In the figure, the 1 is at the second place, so in terms of index it is 1 (as index starts at 0), and thus cross entropy is simply -log(P[1]).

Because in our code, the labels are already a number between 0 and 26 (the size of the vocabulary), we can use it as an index, extract the right number in each of the 24 vectors of prob, log them all, and the mean is simply the cross entropy of the whole batch.

So, simply:

- prob[torch.arange(24),target].log().mean() 

Magical, no?

If you’re wondering why it is still worth to use the built-in cross entropy function, watch this great explanation by Andrej Karpathy.

What about TensorFlow?

As traditionally done in the posts of that series, let’s also look at the equivalent code in TensorFlow.

As for PyTorch, for all the gymnastic preparation (broadcasting, indexing and reshaping), please refer to the post #1 , #2 and #3 of our Deep Learning Gymnastic series .

Regarding the cross entropy function in TensorFlow, we can use e.g. sparse_softmax_cross_entropy_with_logits . Note how explicit is the name: it tells that you need to pass logits, and then it will apply softmax and cross entropy.

If you’re using Keras, you can also use the SparseCategoricalCrossentropy . Note that to do so, you first need to instantiate the function , explicitly saying we’re using logits, and then apply it to the reshaped logits and labels.

Find the full code below, illustrating both entropy functions.

import tensorflow as tf
tf.random.set_seed(18)

# Create a random batch of shape (8,3) with indexes between 0 and 26
random_tensor = tf.random.uniform(shape=(8,3), minval=0, maxval=26, dtype=tf.int32)

# create random logits for each index in the vocabulary
L = tf.random.uniform((27,27), dtype=tf.float32)

#creating the labels
Y = tf.random.uniform(shape=(8,3), minval=0, maxval=26, dtype=tf.int32)

# creating our batch (8,3,27). C.f https://www.philippeadjiman.com/blog/2023/12/23/deep-learning-gymnastics-tensor-indexing/ 
X = tf.gather(L,random_tensor)

#Reshaping before using cross_entropy. C.f https://www.philippeadjiman.com/blog/2024/02/03/deep-learning-gymnastics-tensor-reshaping/
B,T,C = X.shape
logits = tf.reshape( X , [B*T,C])
labels =  tf.reshape( Y , [B*T,1]) # 24 numbers (each one between 0 and 26)

#Calling cross entropy using sparse_softmax_cross_entropy_with_logits
loss = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=labels[:, 0],logits=logits)
print(tf.reduce_mean(loss))

#Calling cross entropy using Keras' SparseCategoricalCrossentropy
ce = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
print(ce(labels,logits))

That’s it for today.

Hope you’re feeling in better shape with your tensors 🤸. Until our next episode.

Like those posts? Feel free to subscribe here to not miss future ones:

Deep Learning Gymnastics #3: Tensor (re)Shaping

Welcome to the 3rd episode of the Deep Learning Gymnastics series. By now you should already start to be in shape. That’s good, because today we’ll talk about how to shape (or more precisely reshape) tensors, a basic yet critical operation that is needed in any advanced enough deep learning model implementation.

To best understand this post, it is highly recommended to read the previous gymnastic exercise around tensor indexing as we’ll build on top of it.

MLP Motivating example

To illustrate the power of tensor (re-)shaping, we’ll continue to get inspired from Andrej Karpathy’s makemore series, where he implements from scratch the famous paper “A neural probabilistic language model” . As Andrej says, it is not the first paper who proposed a neural network approach to predict the next token in a sequence, but it is one that is very often cited and is a really nice write-up.

The gymnastic exercise will consist into implementing the bottom part of the figure below, which describes the architecture of the neural network (or Multi Layer Perceptron, MLP for short) defined in the paper. First we’ll explain a bit the diagram so the goal of the exercise will be crystal clear.

Let’s assume that the 3 green dots at the bottom are the last three characters of a word and that we’re trying to predict (or generate) the next character. The first layer (this one: ) is nothing else than the embeddings of each of the three characters. Turns out it is exactly the output of the example we introduced in our previous gymnastic exercise around tensor indexing . We ended up with a tensor of shape (8,3,4) , the one on the right in the figure below. As a reminder, an embedding is simply here a one dimensional tensor (of size 4 in our case).

So in our example, the first layer of the neural net, the , is nothing else than the 3 embeddings of each character, as seen below:

So the first example of the batch is associated with those three embeddings:

Now, in order to pass this to the next layer (this one), we need to concatenate those three embeddings of size 4 each, into a single long one of size 12.

So here is the gymnastic exercise: take our (8,3,4) tensor, and for each of the 8 lines of the batch, transform the 3 embeddings of size 4 into one of size 12 (which is just the concatenation of the 3). We should thus end up with a tensor of shape (8,12).

The basics of PyTorch Views

Let’s introduce the concept that will allow us to solve the gymnastic exercise as a breeze: PyTorch views. The easiest way to understand PyTorch views is through a simple example.

Let’s create a one dimensional tensor of elements from 0 to 17.

The exact same underlying storage can be viewed as (2,9) tensor.

Or a a (9,2) one

Or a (3,2,3) one:

As you understand, as long as the product of the dimensions equals the number of element in the underlying storage (18 in our case), then we can view (or reshape) the tensor.

Beyond being very convenient, the big of advantage of this is that it is blazing fast, because no new tensors are created: the underlying storage stays the same, and only some metadata about the tensor are modified.

Bonus: we can also use -1 to infer the dimension automatically. E.g., if the underlying storage is 18 numbers, then invoking the view function with shape (-1,9), it will deduce that the first dimension has to be 2:

Solving our gymnastic exercise with views

Now that we understand views, let’s get back to our gymnastic exercise: we have a tensor of shape (8,3,4) and we need to transform into a tensor of shape (8,12). First, let’s reproduce the embedded batch of shape (8,3,4) (see our previous gymnastic exercise to understand the code below):

import torch
torch.manual_seed(18)

# Create a random batch of shape (8,3) 
# with indexes between 0 and 26
random_tensor = torch.randint(low=0, high=26, size=(8,3))

# Create a random embedding matrix of shape (27,4): 
# one embedding for each of the 27 indexes elements
embeddings = torch.randn(size=(27, 4))

#Creating the embedded batch
embedded_batch = embeddings[random_tensor]

Get ready, and let’s solve our exercise. As in last post, it will be a short yet sharp (tensor) movement:

input_layer = embedded_batch.view(8,12)

Yes, that’s it, just one line. By doing this, each line of batch of 8 embeddings, will extremely effectively and in parallel take their 3 associated embedding of size 4 each, concatenate them together, to thus end up with a tensor of size (8,12).

Let’s actually validate it on the first example of the batch:

We obtain an embedding of size 12 as expected, which is nothing else than the concatenation of the 3 embeddings of size 4 that we showed at the end of our motivating example above. Baam.

Oh, let’s not forget that we created this to pass it as input to a layer of a neural net. So let’s do it: we create the initial random weight and biaises of the layer, pass into it our (reshaped) batch and apply tanh on top of it, in other words:

W1 = torch.randn((12, 100)) # weights
b1 = torch.randn(100) # biases
h = torch.tanh(emb.view(-1, 12) @ W1 + b1) # (8,12) @ (12,100) => (8,100)

PyTorch view vs. reshape ?

There is another function in PyTorch called reshape that seems to achieve the exact same goal as view. So what’s the difference?

Typically, view is extremely efficient as it won’t move any underlying data and just modify the shape of the tensor. But it comes with a constraint: the underlying data has to be contiguous, otherwise calling view will return an error (see example below).

If you’re not sure if your tensor is contiguous, you can either use the contiguous function before calling view (it will make the tensor contiguous), or simply use reshape which returns a view if the shapes are compatible, and copies otherwise.

You might ask why anyone would use view over reshape? I asked myself the same question, and I assume that given that using view is guaranteed to be efficient, seeing it in the code gives any reader the guarantee that there is nothing to optimize there. As for the one writing the code, if there are some cases where there would be an inefficient copy, then at least when using view it will fail explicitly and make you aware of the potentially efficiency bottleneck.

Below an example of code illustrating where view wouldn’t work:

import torch

# Create a non-contiguous tensor
tensor = torch.tensor([[1, 2, 3], [4, 5, 6]]).t()  # Transpose to make it non-contiguous

# Reshape works successfully
reshaped_tensor = tensor.reshape(6)
print(reshaped_tensor)  # Output: tensor([1, 4, 2, 5, 3, 6])

# View fails with an error
try:
    viewed_tensor = tensor.view(6)
except RuntimeError as e:
    print(e)  # Output: RuntimeError: view size is not compatible with input tensor's size and stride

TensorFlow reshape

Obviously, TensorFlow also supports the same powerful reshape operation. In TensorFlow, you don’t have the explicit view function, but reshape handles non-contiguous tensors gracefully, similar to PyTorch’s reshape.

Below the full TensorFlow code equivalent to what we illustrated above in PyTorch.

import tensorflow as tf
tf.random.set_seed(18)

# Create a random batch of shape (8,3) with indexes between 0 and 26
random_tensor = tf.random.uniform(shape=(8,3), minval=0, maxval=26, dtype=tf.int32)

# Create a random embedding matrix of shape (27,4): one embedding for each of the 27 indexes elements
embeddings = tf.random.uniform((27,4), dtype=tf.float32)

# Solving the gymnastic exercise: creating an embedded batch with the tf.gather function
embedded_batch = tf.gather(embeddings,random_tensor)

# Validating the results
print(random_tensor)
print(embeddings)
print(embedded_batch.shape) # (8,3,4) which is the expected dimension
print(embedded_batch[0,0])

W1 =  tf.random.normal([12, 100])
b1 =  tf.random.normal([100])
h = tf.math.tanh(tf.linalg.matmul(tf.reshape(embedded_batch, [8, 12]) , W1) + b1)

Another example of usage: CNNs

Reshaping is a very useful operation in various cases in Deep Learning. Another frequent usage/example is in the context of image manipulation in convolutional neural networks (CNN), where you need for instance to connect the output of a convolutional layer to a fully connected layer:

import torch

# An output from a convolutional layer
conv_output = torch.randn(10, 8, 5, 5)  # (batch size, channels, height, width)

# Flatten for a fully connected layer
flattened = conv_output.view(-1, 8 * 5 * 5)  # (batch size, flattened features)

print(flattened.shape)  # Output: torch.Size([10, 200])

Alright, that’s if for today. Hope you’re now in a better shape, and see you next time for other gymnastic exercises 🤸.

References

  • Part 2 of the amazing makemore series by Andrej Karpathy (which inspired this post).
  • Great blog post on the internal representation of tensors, and his very cool stride visualizer (it is from a PyTorch research engineer, so it is about PyTorch 🙂 but still useful general concepts )

Deep Learning Gymnastics #2: Tensor Indexing

Welcome to the second episode of the Deep Learning Gymnastics series. Hope you’re in good shape. Get warmed up. We start.

Today, we’ll talk about a simple yet important and powerful aspect of tensor manipulations: tensor indexing.

Batches and embeddings motivating example

At the heart of any modern deep learning model, you’ll most often deal with batches and embeddings.

Batches? Below is a toy example of what a batch from a training set could look like:

The numbers represent an index in a vocabulary of size N, representing any kind of entity. This could be letters or words in a language model, a movie in a recommender system, a segment on a map in an ETA model, or ads in an ad Network.

For the example, let’s assume those are letters (indexed between 0 and 26 for all letters + a special end character) as in the great Andrej Karpathy “makemore” series.

Embeddings? For each element of that vocabulary, you’ve learned a representation of its (latent) characteristics, represented by a vector of size k. This vector is often called embeddings. Continuing with our example above, let’s consider an embedding of size 4 for each element (in our case, english letters) of the vocabulary, i.e. a tensor of dimension (27, 4)

Here is the gymnastic exercise: you have a toy batch containing 8 examples of size 3, where each number in the example are taken from vocabulary of size 27 . You also have an embedding matrix of dimension (27,4), where each raw is an embedding vector of size 4, for all of the 27 element of the vocabulary. For each element of the batch, you need to fetch its embedding vector, to end up with a batch which is a tensor of dimension (8,3,4) . This is illustrated below

Tensor indexing, the PyTorch way

Let’s first generate the two input tensors (the same as the two inputs on the left of the picture above ) :

import torch
torch.manual_seed(18)

# Create a random batch of shape (8,3) 
# with indexes between 0 and 26
random_tensor = torch.randint(low=0, high=26, size=(8,3))

# Create a random embedding matrix of shape (27,4): 
# one vector for each of the 27 indexes elements
embeddings = torch.randn(size=(27, 4))

And now, let’s solve the gymnastic exercise. Take a deep breath, prepare the movement, and here you go:

embedded_batch = embeddings[random_tensor]

Yes, that’s right. PyTorch allows to pass a full tensor as the index. And it works like magic.

You can check the shape of the result, and observe it is indeed (8,3,4), as expected (see the picture above). Indeed, (8,3) is the shape of the initial batch, and for each element of it, we get the proper embedding vector of shape (1,4).

Let’s validate that the first element of the result (embedded_batch[0,0] ) corresponds to the embedding vector of the index of the first element of the batch. This corresponds to this part of the picture:

And sure enough, it worked 🎉 :

What about TensorFlow?

In TensorFlow, it is of course possible to achieve the same result, but this is done a bit differently.

The tf.gather function

Instead of injecting the batch directly as a (tensor) index in the embedding matrix, in TensorFlow we have to use a very powerful function: tf.gather .

You can read the details of the documentation, but essentially, the equivalent of the following PyTorch indexing:

embedded_batch = embeddings[random_tensor] 

in TensorFlow would be:

embedded_batch = tf.gather(embeddings,random_tensor)

And that’s all.

Full equivalent TensorFlow code below :

import tensorflow as tf
tf.random.set_seed(18)

# Create a random batch of shape (8,3) with indexes between 0 and 26
random_tensor = tf.random.uniform(shape=(8,3), minval=0, maxval=26, dtype=tf.int32)

# Create a random embedding matrix of shape (27,4): one vector for each of the 27 indexes elements
embeddings = tf.random.uniform((27,4), dtype=tf.float32)

# Solving the gymnastic exercise: creating an embedded batch with the tf.gather function
embedded_batch = tf.gather(embeddings,random_tensor)

# Validating the results
print(random_tensor)
print(embeddings)
print(embedded_batch.shape) # (8,3,4) which is the expected dimension
print(embedded_batch[0,0])

Hope you enjoyed the gymnastic lesson. Take some rest. Until the next one 🤸 .

References

Deep Learning Gymnastics #1: Tensor Broadcasting

In the heart of the implementation of modern deep learning models (yes, including LLMs) always lies some subtle and critical techniques and/or tricks that are important to know and master. Tensor Broadcasting is one of them.

Official doc exists (for e.g. pytorch or tensorflow) but in this post, we’ll try to introduce the topic in a simple and intuitive way, using a motivating example inspired from the amazing series of videos from Andrej Karpathy on language modeling.

Example of broadcasting in action

Suppose you have a tensor of size 3 x 4 (tensor having 2 dimensions can also be just called a matrix) , and each row represents a set of counts over 4 options you try to choose from (the higher, the more likely it is the right option), and your goal is to efficiently transform those counts into probability densities. On a concrete example, you want to go from left to right here:

The matrix on the left is our raw counts, and the one on the right is what we’d like to get. So we’d like to find an efficient (vectorized) way to sum up all the rows separately, and divide each count by the sum of its row. So we first need to create a matrix of shape 1×3 which contains the sum of each row, typically :
\(\) \begin{bmatrix} 150 \\ 50 \\ 100 \end{bmatrix} \(\)
The question then is whether the following operation is allowed:

(for the sake of the explanation, we’re assuming that none of the rows’ sum is equal to 0)

This is where broadcasting comes into play. When presented such an operation, broadcasting will find a way to adapt the second matrix to be of the same dimension as the first one, by duplicating its columns, and then perform an efficient element wise division. As follows:

Are your tensors broadcastable?

Whether your doing broadcasting using numpy, pytorch or tensorflow , in order to know if two tensors are “broadcastable”, you just need to align the shapes (or dimensions) of your two tensors from right to left, and for each dimension, check if they are either equal, or one of them is 1, or one of them does not exist. If it is the case for all dimensions, then the two tensors are broadcastable. What is the shape of the resulting tensor? just take the max dimension along each dimension.

Let’s try it on our example. The shape of the first tensor is [3,4] and the second one (before broadcasting) is [3,1] . So let’s align the shapes and go from right to left and compare each dimension:

This method works also for tensors of any shapes. Let’s check a couple of other examples:

Example 1: Two tensors with shapes A.shape = [4,3,2] and B.shape = [3,1]
Example 2: Two tensors with shapes A.shape = [4,3,2] and B.shape = [3,1,2]

Which of the two examples are brodcastable tensors and which are not? Let’s start by Example 1:

All good, you can broadcast those two tensors. Note that for the case of the most left dimension, since it was not existing for the second tensor, it just acts as if it was a 1.

What about Example 2?

Because the most left dimension of those two tensors both exists but are not equal, and none of them is 1, then it breaks the conditions for them to be broadcastable.

Tensor brodcasting in Pytorch and Tensorflow

Let’s see broadcasting in action with PyTorch on a example of a tensor of shape 3×3 of counts, that we want to normalize in the same way as our previous example:

import torch

N = torch.tensor([[10, 20, 10], 
                  [20, 5 , 25], 
                  [10, 60, 30]], dtype=torch.int32) 
# calculate sum along rows 
row_sums = N.sum(dim=1, keepdim=True)
# normalize each row 
N_normalized = N / row_sums

The parameter dim=1 is here to say that we want to sum over rows, and for the keepdim parameter, wait for next section to see why we used it and why it is critical.
Let’s now print N, row_sums and N_normalized respectively:

As we can see, the broadcast operation worked as expected as the sum on each row of the results is indeed equal to 1.

Let’s see how the code looks like in tensorflow:

import tensorflow as tf

N = tf.constant([
    [10, 20, 10],
    [20, 5, 25],
    [10, 60, 30]
], dtype=tf.int32)

# calculate sum along rows 
row_sums = tf.reduce_sum(N, axis=1, keepdims=True)
# normalize each row 
N_normalized = N / row_sums

As you can see, the code is rather similar, up to some differences like the need to use the tf.reduce_sum function rather than doing the sum directly on the tensor, and also, the keepdim parameter is now in plural (keepdims)😅 . But printing N_normalized returns the same result as with the pytorch code.

When things go wrong

So, what was this keepdim=True (or keepdims=True in tensorflow) all about?

If you run e.g. the exact same pytorch code as above but without keepdim=True, this is what you’ll get when printing N, row_sums and N_normalized .

As you can see, N_normalized is completely messed up and the rows don’t sum to 1 anymore 🤦
But how that happened? What did broadcasting do at all?

First, was the operation broadcastable? well, now you know how to check it from previous section. N is of shape [3,3] and the trick is that now row_sums is of shape [3] , because pytorch squeezed the dimension and created a line vector. Using the method explained before, you can see that the tensors are broadcastable.

And practically, what happens now is that row_sums gets duplicated horizontally instead of being duplicated vertically! In other words, during the operation N / row_sums , this is what happened to row_sums in the process:

So as you can see, in that case, the keepdim parameter was critical to keep row_sums with the same number of dimensions than the initial tensor and thus have the right shape for a proper broadcasting.

ChatBots can help, but only when you know what you’re doing

This statement holds for any code related generation coming from chat bots like Bard or ChatGPT.

Specifically on that one, depending on the version of the chatBot you’re using and how you ask your prompt, sometimes you’ll get the right code (using keepdims=True) and sometimes not. But now, for any broadcasting related question, you won’t be able to get fooled anymore 🤩.

Conclusion

Broadcasting is a critical technique that every deep learning developer needs to master in order to efficiently and properly implement state of the art models in an efficient way. And you better understand the nuances and subtleties we discussed (like e.g. the keepdims param), otherwise you might silently introduce bugs that will render your whole model useless.