How Does ChatGPT Work?

ChatGPT is a thrilling achievement in machine learning for the simple fact that it is really good at doing a wide range of tasks. This is exciting for computer nerds (such as myself) because, in the past, we could only make models that were proficient at only one thing. If you made a neural network that could detect facial features, for example, it would excel at telling people apart. However, it could not also tell you what the temperature outside will be at 4 pm. ChatGPT has changed everything because it can handle a wide range of input and give a sensible answer almost immediately.

So, how does it work? In this article, I'd like to give a general overview of what happens when you ask ChatGPT a question, without delving into all the scary math. In doing this, I'm going to make some gross oversimplifications, but my goal is to give you a general gist of what's going on. Even the math haters deserve to nerd out over this exciting tech!

Let's start with the name itself. GPT stands for Generative Pre-trained Transformer. Sounds fancy, but let's break it down. Generative means that when you ask ChatGPT "Make me a photo of a purple unicorn flying on top of a rainbow in a pink sky," it's going to generate an image. Okay, next is pre-trained. I like to think of neural networks like toddlers. They have to be trained on data to be able to make predictions. Picture it as if you have a stack of flashcards and a toddler sitting in front of you. You hold up a picture of a red car and say "Car". The toddler repeats it back to you. Then, you pull out a picture of a pick-up truck and say "Truck" " The toddler repeats. You go through this cycle a few times and now you can just hold up the picture and the toddler can now identify you're holding a picture of a car without telling them what it is. You've just trained a toddler, or in our case, a neural network. Imagine you pull out a picture of a purple car instead of a red one. Most of the time, the toddler will still be able to identify it as a car, even though they've never seen a purple car before. The same goes for a neural network. So, we train our neural network on a bunch of data, then it can make assumptions and generalizations on information it's never seen before.

Finally, we come to the transformer. This is what makes ChatGPT so exciting, but it is tricky to understand. Transformers learn to understand and generate text by analyzing patterns in large amounts of data (which occurs in our pre-trained step). This is difficult for several reasons. When I say large amounts of data, I mean millions of images, books, web articles, Reddit posts, Youtube videos - anything that researchers can find and use. As it turns out, computers don't really like words. They think in numbers and math. Humans (for the most part) prefer words over numbers. This creates a sort of language barrier between us, and that's where the transformer comes into play. Say we ask ChatGPT "What Shakespearean play is the one where two young lovers die?" ChatGPT takes that sentence and divides it up into tokens, which we will say is an individual word. These tokens each get a query, key, and value matrix associated with it. A query contains information about the token. It tells us “The word play is the 3rd word in the sentence. Play is a noun. Are there adjectives in front of play?” The key responds to this query: “There’s an adjective before the word play. It’s a Shakespearean play.”

Remember: these 'thoughts' are encoded as a series of numbers that represent this information. These thought processes are all encoded as a series of numbers. The key and query matrices are multiplied together, creating the attention matrix. The name is pretty indicative. The words that affect each other the most have a higher score. Therefore, that is where the attention is focused. The value matrix scales these values by their relevance. It may say “The word the is an article that doesn’t give me much information, so it’s not as important. The word Shakespeare is a very specific adjective so it’s very important. I need to adjust what plays I consider based on this information." It repeats this for every word it’s given. This process is important for understanding the context for each token which provides better reading comprehension. We may find the words with the highest score are Shakespearean, play, lovers, and die. So CHATGPT takes these high-scoring words and thinks “Hm, when I’ve seen these words combined in my training before, it was about Romeo and Juliet. It must be that one.” Again, this is all done in numbers. So for the final step, it decodes this matrix of numbers to say to you: “It’s Romeo and Juliet.” And it does all this in a matter of seconds! You don't usually think about how quickly our brains process all this until you have to explain it to a computer.

All this information is contained and calculated in a single “attention head”. What’s so cool about ChatGPT is that it can have multiple “heads” doing these calculations at the same time. This is what makes it fast and able to take on such diverse tasks! Again, this is an overly simplified explanation, that isn’t exactly correct due to the bits I’ve left out, but the general idea is there. It’s cool to see a chatbot that has such a similar “thought” process to our own and will be significant in future discoveries in AI.

How Does ChatGPT Work?

Recent Posts

Comments