Skip to content

SET A - LECTURE NOTES


Introduction

Timestamp: 00:00:00

In the previous lecture, we had implemented a Bigram character level language model where we took one character and tried to predict the next one. This was all and good if we just wanted to predict two different characters. But we saw that it didn't do very well when we tried to predict words out of it, plus we only implemented a single layer of neuron. Now, if we go with the same approach (where we did counts and build a graph matrix), for each character the number of matrix rows and columns will increase i.e. from 27x27 to 27x27x27 and so on.

So now we will be moving on to another model called MLP (Multi Layer Perceptron).

 

Bengio et al. 2003 (MLP language model) paper walkthrough.

Timestamp: 00:01:48

Paper Review: (Now this obviously wasn't the first paper to introduce this, but it was definitely the most influential one) Paper Link

Now, in the paper they have proposed a word level language model, but we will be implementing it for characters itself - so same approach but as a character level language model.

The modelling approach suggested in the paper is also identical: We use a multi layer NN to predict the next word from the previous one. And we try to maximize the negative log likelihood of the training data.

They are basically proposing a vector dimensional space (3D Space - You can revisit the imagery from here ) where the words most related to each other will be close by. So during testing if the model encounters a sentence which it may not have been trained on, it can still relate to the other words and complete the sentence. So within that embedded space, there is knowledge exchange and outcome is produced.

First Explanation of the diagram 5:42 with an Overview (Will have to comeback to this as I progress through the lecture, there were some imagery/explanation which I couldn't grasp completely)

Update: Yeah it all makes sense now lol

 

(re-)building our training dataset

Timestamp: 00:09:03

We are preparing our dataset. Its the same name.txt file we used before. We have made a slight change to how we formatted the dataset (The < S > and < E >), so here we are adding a block size which represents 'How many characters should we consider for predicting the next one'

We've used 3 to follow the diagram in the research paper, the 3 different inputs present horizontally in the diagram at the bottom represent that. View page 6 for the diagram.

Prepared the X and Y values.

 

Implementing the embedding lookup table

Timestamp: 00:12:19

(Basically showing a broken down alternative way of implementing this, but ultimately the point is to show how simple and direct it is to do indexing in PyTorch)

In the diagram, we are basically implementing the 'Look up table in C'. So we have 27 possible characters and we're gonna embed them in a lower dimensional space. In the paper, they had taken 17000 words and crammed it into a 30 dimensional space. So, we'll be doing something like taking 27 characters and cramming them into a 2 dimensional space.

This lookup table C will be random numbers, which will have 27 rows and 2 columns. So each one of the 27 characters will have a 2 dimensional embedding: C = torch.randn((27,2))

Now, if you look at the diagram, we are indexing each word (our case character) into the look up table C. Ultimately, you can see that entire structure as one layer of the NN (the first layer)

So, adding that character to the look-up table is called INDEXING. There is also the method of one-hot encoding them, but we'll be discarding that as its simpler and much faster to do indexing.

So long story short, in order to embed all of X (our 27 characters in 2 dimension) into C, we simply do C[X]

 

Implementing the hidden layer + internals of torch.Tensor: storage, views

Timestamp: 00:18:35

Now we try to build the hidden layer. Here we consider the size of the embedding layer C[X].shape is torch.Size([32, 3, 2])

So we have 2 dimensional embedding layers and there are 3 of them. (Just consider the diagram itself, the 2D ones are the red circles and the 3 of them are the 3 rectangles)

The hidden layer, we will consider as W1 initializing with a bunch of random numbers. So taking that 2D in 3, we take 6. And for the number of neurons in the hidden layer we can consider any number of our choice, so we take 100. W1 = torch.randn((6, 100))

And we add bias to it b1 = torch.randn(100)

Now, normally we would wanna matrix multiply the embeddings with the weights in the hidden layer and add bias to it emb @ W1 + b1 (Note: emb is basically C[X]. So, emb = C[X])

But we can't do that because the shape of emb is [32, 3, 2] and our W1 is [6, 100]. So we need to somewhat, concatenate all 3 of those into one, so we get 3x2 i.e. 6.

So those 3 different boxes that we have, we want to concatenate all of there values into one. And this is where we use different functions provided by PyTorch.

In PyTorch concatenate function torch.cat we have to add the embedding values and then mention to which dimension you want to concatenate them to, hence we are adding that 1 in torch.cat(----, 1)

Now instead of adding the embeddings one by one like torch.cat([ emb[:,0,:], emb[:, 1, :], emb[:, 2, :]], 1)

we use this torch function called unbind which basically returns such a list. So we do torch.unbind(emb, 1). Here also we are mentioning the dimension of each of those values (We are looping through this basically)

So finally we get torch.size( torch.unbind(emb, 1) ,1)

But it turns out, even that is not very efficient, as for unbind we are using like a whole different set of memory.

So resolve this, we will be converting the shape of it using PyTorch. So in PyTorch we have something called .view where we can change the dimensions as we want. So if the total elements is 18, we can view it as 9x2, 3x3x2, 6x3 anything. The reason is, PyTorch basically puts all the elements in its memory as a single dimensional array i.e. from 0 to 17 in our example. So as long as it's total number of elements remain the same, we can always ask PyTorch to view it in a different shape.

So, instead we go back to the original matrix multiplication equation emb @ W1 + b1 We simply just convert the shape of the embedding to match that of W1 for the multiplication, by simply asking PyTorch to view it differently. emb.view(32, 6) @ W1 + b1

Now we don't want to hardcode the value 32 and make it more dynamic, so we instead add emb.view(shape[0], 6) @ W1 + b1

or to make it even more efficient we do emb.view(-1, 6) @ W1 + b1. So when we had -1, PyTorch will itself know that it needs to look for the size and add there.

And finally, since this is the hidden tanh layer, we implement that as well, so h = torch.tanh(emb.view(-1, 6) @ W1 + b1)

AND THAT'S OUR HIDDEN LAYER!

(psst.. fun flashback. In the equation we also have the addition of biases to the weights before matrix multiplying them. W1 is [6, 100] and b1 is 100, so here broadcasting is happening! so its [1, 100] for b1)

 

Implementing the output layer

Timestamp: 00:29:15

Now finally lets implement our final layer. So we assign W2 and b2. So W2 takes the input of 100 neurons from the hidden layer and we need the output as 27 as we have 27 characters and the bias is also set to 27.

W2 = torch.randn((100, 27))
b2 = torch.randn(27)

And finally we calculate the logits which is the output of the final layer

logits = h @ W2 + b2

So finally, our output layer (logits) dimension will be [32, 27]

 

Implementing the negative log likelihood loss

Timestamp: 00:29:53

Now as we've seen in the part 1 of makemore, we need to take those logits values and first exponentiate them to get our "fake counts" and then normalize them to get the probability.

counts = logits.exp()
prob = counts / counts.sum(1, keepdims=True)

Now, we have the final piece of this equation, the Y for predicting the next possible character.

So now we need to index Y into prob. Now first we need to label Y (i.e. its positioning 0 to 31) so we use torch.arange(32) and label them respectively to Y's values.

torch.arange(32) -> 0, 1, 2, 3 ..... 32 Y -> 5, 13, 13, 1 ..... 0

prob[torch.arange(32), Y]
So now we've indexed Y into the prob

Then we find the log value of it, then its mean and the negative of it to finally get the negative log likelihood value, which is basically our loss value.

loss = -prob[torch.arange(32), Y].log().mean()

So this is the loss value that we would like to minimize, so that we can get the network the predict the next character of the sequence correctly.

 

Summary of the full network

Timestamp: 00:32:17

Now we're just putting them altogether (To make it more respectable lol)

X.shape, Y.shape #dataset
g = torch.Generator().manual_seed(2147483647)
C = torch.randn((27,2), generator=g)
W1 = torch.rand((6, 100), generator=g)
b1 = torch.rand(100, generator=g)
W2 = torch.rand((100, 27), generator=g)
b2 = torch.rand(27, generator=g)
parameters = [C, W1, b1, W2, b2]
sum(p.nelement() for p in parameters) #to check number of parameters in total
emb = C[X]
h = torch.tanh(emb.view(-1,6) @ W1 + b1)
logits = h @ W2 + b2
counts = logits.exp()
prob = counts / counts.sum(1, keepdims=True)
loss = - prob[torch.arange(32), Y].log().mean()
loss