LLM • machine learning for noobs

TD;LR: LLM stands for Large Language Model. LLMs have been trained on loads of human language text, and are made up of a bunch of numbers (aka parameters) used to make predictions - like the next word or sentence. By identifying patterns in the training data, they can answer questions, understand context, and generate human-like text.

what are LLMs?

You’ve likely seen the word LLM thrown about everywhere. Some of you might already know that it stands for Large Language Model. But what actually is an LLM?

A model is essentially a giant set of numbers in a particular structure. In the case of Large Language Models, a model is created with what’s called a transformer architecture. This model is initialised with random small values (otherwise known as parameters), giving a model a starting point from which it will learn before any training begins.

A model’s parameters are made up of multiple categories of data which represent different things. What we care most about in this post are the weights!

At this point, we could give the model an input but the model doesn’t actually know anything, and the output will be random nonsense. It’s like a newborn baby’s brain; everything is in place for it to start absorbing information.

Training will adjust all of these random small values, which will ultimately give us meaningful information. So how is an LLM trained?

an example

We have the following sentence, which is exclusively used to train a model called AlligatorBot:

Alligators have about 75 teeth in their mouths at any one time, but as the teeth wear down or break off, they are replaced. As a result, many can have about 3,000 teeth throughout their lives.

Before training, if you input a question like “How many teeth do alligators have?”, we could receive gibberish. This could even be something like “Teeth teeth teeth teeth.” The model doesn’t know anything about language, or alligators.

However, after training, the model has gathered an understanding of how to phrase an answer about an alligator’s teeth, and should be able to parrot back the information we gave it, but it wouldn’t be able to say anything else - not even “I don’t know”. The model hasn’t been trained on any other tokens and is confined to information learned from the training data. This is all it knows! Here is a conversation that could happen:

You: How many teeth do alligators have?

AlligatorBot: Alligators have about 75 teeth. But, as they wear down or break, they are replaced. Some alligators can have about 3,000 teeth in their lives.

That said, it has a strong understanding of the relationship between tokens and the context of the sentence.

training

A huge amount of data is gathered (this is called a corpus) from a range of sources: books, videos, papers, code repositories, and social media platforms. This corpus usually amounts to terabytes or petabytes of data. This corpus is cleaned up (aka “preprocessed”) and broken down into individual pieces called tokens, and fed into the model.

Tokens can be:

words like alligator, teeth
characters and punctuation like £, &, <>
subwords, like -able, -un, dis-

Then, the model will start learning the probability of one token following another based on the sequence of tokens - this is called Next Token Prediction (NTP). A lot of other tasks happen around this, such as contextual understanding and prediction accuracy, but NTP is a pretty important task in the context of building an LLM.

NTP is a process that is made up of several layers that work together to predict the next token in a sequence. I’ll go into more detail about these interesting layers in another post, but for now, you can think of these layers as a bunch of steps. The input tokens are gobbled up and go through each step like an assembly line, where each layer transforms the token into something more meaningful to a model.

While all this is happening, the values of the weights are adjusted based on the difference between the predicted token and the actual token. This process, known as backpropagation, minimizes future errors. You could compare these weights to the strength of the synapses between neurons in a human brain. These values often form the bulk of a model’s parameters.

deployment

In the real world, this LLM will have been trained on a huge number of tokens. For context, OpenAI’s ChatGPT model, GPT-4, has been trained on 13 trillion tokens, which is roughly 10 trillion words.

Once the LLM has undergone some fine-tuning (among other steps) and meets an expected standard of performance, it will be compressed and packaged up into a file, or multiple files. For public use, this LLM may be deployed to a server and an API may be created to interact with that model, and make it available for other people to use on the world wide web.

explore LLMs yourself

If you want to peruse some publicly available models and learn more about them, I recommend having a gander on Hugging Face. You can even see a comparison of public models and their performance on the OpenLLM Leaderboard. You can watch this space for an explanation of what the stats on that leaderboard mean.