Hi, everyone. So recently, I gave a 30-minute talk on large language models, just kind of like an intro talk. Um, unfortunately, that talk was not recorded, but a lot of people came to me after the talk, and they told me that they really liked the talk. So, I thought I would just re-record it and basically put it up on YouTube. So here we go: The Busy Person's Intro to Large Language Models. Director Scott, okay, so let's begin.
First of all, what is a large language model, really? Well, a large language model is just two files, right? Um, there are two files in this hypothetical directory. So, for example, let's work with the specific example of the Llama 270b model. This is a large language model released by Meta AI, and this is basically the Llama series of language models, the second iteration of it. And this is the 70 billion parameter model of, uh, this series. So, there are multiple models, uh, belonging to the Llama 2 Series: 7 billion, 13 billion, 34 billion, and 70 billion is the biggest one.
Now, many people like this model specifically because it is probably today the most powerful open weights model. So basically, the weights and the architecture and a paper were all released by Meta, so anyone can work with this model very easily, uh, by themselves. Uh, this is unlike many other language models that you might be familiar with. For example, if you're using ChatGPT or something like that, the model architecture was never released. It is owned by OpenAI, and you're allowed to use the language model through a web interface, but you don't have actually access to that model.
So, in this case, the Llama 270b model is really just two files on your file system: the parameters file and the Run, uh, some kind of code that runs those parameters. So, the parameters are basically the weights or the parameters of this neural network that is the language model. We'll go into that in a bit because this is a 70 billion parameter model. Uh, every one of those parameters is stored as two bytes, and so therefore, the parameters file here is 140 gigabytes. And it's two bytes because this is a float 16, uh, number as the data type.
Now, in addition to these parameters, that's just like a large list of parameters, uh, for that neural network, you also need something that runs that neural network, and this piece of code is implemented in our run file. Now, this could be a C file or a python file or any other programming language, really. Uh, it can be written in any arbitrary language, but C is sort of like a very simple language, just to give you a sense. And, uh, it would only require about 500 lines of C with no other dependencies to implement the, uh, neural network architecture, uh, and that uses basically the parameters to run the model. So it's only these two files. You can take these two files, and you can take your MacBook, and this is a fully self-contained package. This is everything that's necessary. You don't need any connectivity to the internet or anything else.
You can take these two files, compile your C code, get a binary that you can point at the parameters, and you can talk to this language model. So, for example, you can send it text like, for example, "Write a poem about the company Scale AI," and this language model will start generating text. And in this case, it will follow the directions and give you a poem about Scale AI. Now, the reason that I'm picking on Scale AI here, and you're going to see that throughout the talk, is because the event that I originally presented, uh, this talk with, was run by Scale AI. And so, I'm picking on them throughout, uh, throughout the slides, a little bit, just in an effort to make it concrete.
So, this is how we can run the model. It just requires two files, just requires a MacBook. I'm slightly cheating here because this was not actually in terms of the speed of this, uh, video here. This was not running a 70 billion parameter model. It was only running a 7 billion parameter model. A 70b would be running about 10 times slower, but I wanted to give you an idea of, uh, sort of just the text generation and what that looks like. So, not a lot is necessary to run the model. This is a very small package, but the computational complexity really comes in when we'd like to get those parameters. So, how do we get the parameters, and where are they from? Uh, because whatever is in the run.c file, um, the neural network architecture, and sort of the forward pass of that network, everything is algorithmically understood and open, and so on. But the magic really is in the parameters. And how do we obtain them?
So, to obtain the parameters, um, basically, the model training, as we call it, is a lot more involved than model inference, which is the part that I showed you earlier. So, model inference is just running it on your MacBook. Model training is a competition, very involved process. So, basically, what we're doing can best be sort of understood as kind of a compression of a good chunk of the Internet. So, because Llama 270b is an open-source model, we know quite a bit about how it was trained because Meta released that information in paper. So, these are some of the numbers of what's involved. You basically take a chunk of the internet that is roughly—you should be thinking 10 terabytes of text. This typically comes from like a crawl of the internet, so just imagine, uh, just collecting tons of text from all kinds of different websites and collecting it together. So, you take a large chunk of the internet, then you procure a GPU cluster, um, and, uh, these are very specialized computers intended for very heavy computational workloads, like training of neural networks.
You need about 6,000 GPUs, and you would run this for about 12 days, uh, to get a Llama 270b. And this would cost you about $2 million. And what this is doing is basically, it is compressing this, uh, large chunk of text into which you can think of as a kind of a zip file. So, these parameters that I showed you in an earlier slide are best kind of thought of as like a zip file of the internet. And in this case, what would come out are these parameters, 140 GB. So, you can see that the compression ratio here is roughly like 100x, uh, roughly speaking. But this is not exactly a zip file because a zip file is lossless compression. What's happening here is a lossy compression. We're just kind of like getting a kind of a Gestalt of the text that we trained on. We don't have an identical copy of it in these parameters, and so it's kind of like a lossy compression. You can think about it that way.
The one more thing to point out here is these numbers here are actually, by today's standards, in terms of state-of-the-art, rookie numbers. Uh, so if you want to think about state-of-the-art neural networks, like, say, what you might use in ChatGPT, or Claude, or Bard, or something like that, uh, these numbers are off by a factor of 10 or more. So, you would just go in, and you just, like, start multiplying, um, by quite a bit more. And that's why these training runs today are many tens or even potentially hundreds of millions of dollars, very large clusters, very large data sets. And this process here is very involved to get those parameters. Once you have those parameters, running the neural network is fairly computationally cheap.
Okay, so what is this neural network really doing, right? I mentioned that there are these parameters. Um, this neural network basically is just trying to predict the next word in a sequence. You can think about it that way. So, you can feed in a sequence of words. For example, "Cat sat on a..." This feeds into a neural net, and these parameters are dispersed throughout this neural network, and there are neurons, and they're connected to each other, and they all fire in a certain way. You can think about it that way, um, and outcomes a prediction for what word comes next. So, for example, in this case, this neural network might predict that, in this context of four words, the next word will probably be "mat" with, say, a 97% probability. So, this is fundamentally the problem that the neural network is performing. And this, you can show mathematically, that there's a very close relationship between prediction and compression, which is why I sort of allude to this neural network as a kind of training. It's kind of like a compression of the internet, um, because if you can predict, uh, sort of the next word very accurately, uh, you can use that to compress the data set. So, it's just a next-word prediction neural network. You give it some words; it gives you the next word.
Now, the reason that what you get out of the training is actually quite a magical artifact is that basically, the next word prediction task, you might think, is a very simple objective, but it's actually a pretty powerful objective because it forces you to learn a lot about the world inside the parameters of the neural network