GPT From Scratch #2: The Training Set

Welcome to Part 2 of our GPT From Scratch series, inspired by Karpathy’s  Let’s build GPT: from scratch, in code, spelled out.

Links to previous and upcoming posts of the series:

In this post, we’ll focus on building the training set for our model. Although this part might sound less exciting than building the model itself, it is still a crucial (and in my opinion very clever) part of the whole process.

Framing the prediction task

Before building a training set, you need to frame the prediction task in a proper way. The core capability of chat bots is to be able to predict the next word (or more generally token) given the previous words of the sentence. For example:

To us, it is obvious that the next predicted word is very likely “cold”. But the goal is to build a (very) strong model that will be able to do that. Why does it matter? Because once you have that capability, building a very strong chatbot is a matter of “just” an additional (tricky and important) step around teaching your model to answer questions using reinforcement learning. Let’s keep that for a separate post or series. For now, we’ll focus on building that core capability.

Getting training data

Often, getting proper training data with labels can be a very costly process. If you e.g. build a model classifying images of dog vs. cat or cancer vs. benign, you’ll need a lot of human labeled data.

In our case, since we want to train a model to guess the next word, we simply need to get well formed sentences (a lot of them), hide the last word, and train the model to guess it! Where to find a lot of well formed sentences? Well, wikipedia, any book, blogs (just like this one), you name it. Finding data here is not really the issue. 

To illustrate how to build the model, Karpathy is taking all Shakespeare literature as the basis. And for the sake of simplicity, the model is built to predict the next character (and not the next word), but the principles are exactly the same, and the end result is equivalent given that by predicting the next character over and over again, you end up generating words and sentences.

One Sentence, Multiple Training Examples

Let’s consider the sentence “Today I put my coat on because it is very cold”. We could consider it as one training example, as in the picture above (i.e. giving the whole sentence except the last word as the context, and tell the model that what it needs to guess is the last word), but we can actually do more. Consider this:

With only one sentence, we end up with not only 1 training example, but with 10! 

That’s a very clever trick. Now let’s see how we can use it to actually implement a training set on real data.

Encoding Shakespeare

In his video, Karpathy works with all Shakespeare literature, which is one long text file that can be found here .

Let’s print the size of this text and the first few sentences:

Now, the very first step is to encode this into tokens, a process called tokenization. For the purpose of illustrating how to build a toy GPT imitating Shakespeare, Karpathy uses the most simple tokenization possible: encoding each character with just a number.

First, we can see that there are 65 distinct characters in that file, (see below), so the tokenization process will be about transforming each of those characters (including space) into its equivalent number between 0 and 64.

From there, Karpathy provides the simplest yet working tokenizer by providing the encode and decode lambda function below.

You can see how each character is encoded into a number between 0 and 64 (corresponding to the index of the relevant character in the list above), and then decoding it back to get the original sentence.

As we’ll work with PyTorch or TensorFlow to build GPT, we’ll first need to convert all that textual data into a long tensor containing those encoded numbers. If you don’t know what a tensor is, for now you can simply think of them as a generalization of a vector to N dimensions. So in our case, our tensor for now can be seen just as a long vector of integers, as illustrated below.

As a reminder, this is a very naive encoding (or more precisely, tokenization), and in practice, this is much more sophisticated (another great video from Karpathy details it here) but this will surprisingly be more than enough for illustrating the implementation of GPT from scratch. 

Generating (a lot of) training data

Now that our data is properly encoded, let’s see the implementation of the cleaver trick we described earlier.

First, let’s split our data into training and test data. This is done simply by taking the first 90% of characters of the long text file of Shakespeare literature as the training data, and the rest as the test data

And now the heart of the implementation of the clever data generation trick.

The goal now is to generate one batch of data for the training set (see older post here where I describe how batches look like in general). So here’s Karpathy’s simple and elegant code for generating the training set:

Let’s understand what is going on here:

  • First we decide the batch size (how many examples we want in each batch) and the block size (how many tokens we want to have as “the context” in each example).
  • Then to create a batch (of size 4 in our case), you take 4 totally random numbers in the train data, which will be the beginning of each of your example of the batch (the ix index) and you stack together:
    • The 4 block_size (8) characters (4 x 8 characters)
    • The same thing but shifted by one
  • This will create two tensors of size 4×8 each :
  • Behind this is hiding 32 examples to learn from! Here are the first 8:

And that’s it, we now have a way to generate a lot of training data to train a GPT from scratch.

Next, we’ll talk about a very naive model, that will give very poor results but that will set the ground for the real game.

See you in Part 3: The Bigram Model .


Posted

in

, ,

by