How ChatGPT Works: The Science of Conversational AI

ChatGPT has an average of 13 million users per day. Users ask how to center a `div` or generate emails to send to their boss. But how many of those users really know how this amazing piece of computing works? Why are its answers so accurate (most of the time)? Well, in this post I will explain to you in a simple way how ChatGPT works and what the magic behind the scenes is.

What does GPT mean?

GPT stands for Generative Pre-trained Transformer. In English, it’s a tool equipped with a deep learning model called transformer that can generate text for various purposes with its vast understanding and capacity.

Great, that sounds easy to understand except for the "deep learning model called transformer" part. What do those strange words mean? Well, let’s explain each concept, starting with the Deep learning model.

Deep learning model

Deep learning is a type of machine learning that uses artificial neural networks with multiple layers (sets of neural nets) to analyze and learn from complex data inputs. It is inspired by the structure and function of the human brain, and aims to improve the accuracy and speed of decision-making processes in tasks such as image and speech recognition, Natural Language Processing (NLP), and more.

Simply put, a deep learning model is a computer system that can learn and make decisions based on the data it is trained on. The deep learning model that gives life to the GPT technology is the transformer.

Transformer

So a transformer is basically a deep learning model used in NLP (among other tasks). But what exactly is NLP? I know you are very curious, but I would need a whole post just to explain this topic. In short, have you ever talked to Siri or Alexa, the voice assistants on your phone or smart speaker? They use NLP to understand what you're saying and respond to your questions or commands. This is the technology that allows computers to understand, interpret, and generate human language, making it possible for computers to communicate with people using natural language.

Interesting, but how do transformers work into the NLP field? Well, in order to answer this question and keep it simple, let’s explain how a computer understands texts.

Did you know that computers don’t understand text? Yeah, computers just understand numbers, zeros and ones. So then how is NLP possible? Thanks to the text pre-processing phase which converts regular text into numerical representation to feed a deep learning model (like a transformer). This phase is divided into the following steps:

Tokenization: Breaking down the text into individual words, phrases, or even splitting a single word into more than one token.
Encoding: Converting the words or phrases into numerical values, such as vectors. This can be done using techniques like word indexing and word embeddings (an interesting concept that I can cover in another post).
Normalization: Making sure that words with similar meanings are represented in a consistent way, for example, converting all words to lowercase.
Features extraction: Selecting relevant information from the pre-processed text, such as the frequency of certain words.

And that is the process that is followed to feed a model like a transformer with data as text into its architecture. Oh, yes, I haven't explained what the architecture of a transformer is.

Attention is all you need

The transformer architecture was presented for the first time in 2017 in a paper called “Attention is All you Need”, written by some engineers of the Toronto University and Google. The title of the paper was referring to probably the most important concept transformers have and the main difference with other deep learning models like Recurrent Neural Networks (RNN), the Attention Mechanism.

In short, the attention mechanism allows the model to dynamically focus on different parts of the input sequence when making predictions. The attention mechanism operates by computing attention weights for each word in the input sequence, which represent the degree of importance of that word with respect to the rest of the words in the text for the prediction task.

The attention mechanism is based on three vectors: the query vector, the key vector, and the value vector. The query vector represents the information that the model is looking for, the key vector represents the information in the input sequence, and the value vector represents the information that the model should pay attention to.

These concepts can be a little confusing, so let's take a look at an example to get a better understanding. Imagine you're a chef, and you're preparing a recipe that requires several ingredients. To make sure you have all the ingredients, you would have three things:

The query: This would be the request for a list of ingredients required for a recipe. It represents the information that you're looking for.
The keys: This would be all the ingredients you have in your kitchen. They represent the information that you have available.
The values: These would be the ingredients you actually need for the recipe. They represent the information that you should pay attention to.

I hope you have understood these three concepts. Also, you have to know that each word in the text has its own query, key, and value vector, which means a numerical representation that is computed by the transformer itself.

Finally, the attention weights are calculated by applying a multiplication of the query and the key vectors. The attention weights are then used to compute a weighted sum of the value vectors, which provides a context-aware representation of the input data for the prediction task.

In the image below you can see a graphic representation of the attention mechanism:

This image is based on the one presented in this article about the attention mechanism.

As you can see, the model learns to pay attention to those words whose French translation coincides both contextually and grammatically. The more intense the color, the greater the relationship that the English word has with each word of the French sentence.

In this example of a neural machine translation system, the key vectors are the word vector representations of the input sequence (“How was your day”), which are computed from the input words and their corresponding hidden states (i.e., internal representations of the input sequence). The query vector is the word vector representation of a word in the output sequence (for example “Comment”), which is obtained by passing the decoder hidden state (i.e., internal representation of the output sequence generated so far) at a particular time step through a separate neural network. The value vectors are another representation of the input sequence similar to the key vectors, but including a representation of the state. The key vector is used to calculate the attention score by multiplying it by the query vector, and the resulting scores are used to compute a probability distribution over the value vectors. Finally, the next token is selected using the computed probabilities.

This process is done for every word in the output sequence, and each word of the input sequence has the full context of its relationship with every word in the output sequence, no matter the distance between words in the sequence.

Now, you’re probably wondering why having the full context is so important? Or, where does all this process occur in the transformer architecture?

All the text pre-processing part we saw earlier and the attention weight computation occurs in the encoding side (the one in the left) of the architecture you can see below:

As you can see, this figure has a lot of strange names which I will not explain in depth to keep this post simple, but you have to know that the encoding side processes the input sequence (text or audio), and consists of a stack of identical encoder layers. Each encoder layer has two sub-layers: a multi-head self-attention mechanism (self-attention refers to an attention mechanism applied to the input sequence itself) and a fully connected feed-forward network (a feed-forward network is a type of neural net where information flows only in one direction, from input to output). The multi-head self-attention mechanism allows the encoder to attend to different parts of the input sequence, and the feed-forward network processes the representations generated by the attention mechanism.

This architecture also has a decoder side (the one on the right) which generates the output sequence, and is similarly composed of a stack of identical decoder layers. The decoder also has two sub-layers: a multi-head self-attention mechanism and a multi-head attention mechanism over the output of the encoder stack. The self-attention mechanism in the decoder allows it to attend to different parts of the output sequence, while the attention mechanism over the output of the encoder allows the decoder to incorporate information from the input sequence into its predictions.

This is tricky, so let's see another analogy.

You can think of the encoding and decoding architecture in a transformer as two parts of a conversation between two people.

The encoding part can be thought of as the sender of the conversation who is encoding or summarizing a message for another person. In this part, the person “speaks” the message carefully, breaking it down into individual words or phrases (tokenization) and converting them into numerical representations (encoding). The person also normalizes the message, making sure that words with similar meanings are represented consistently (normalization). Finally, the person extracts the most relevant information from the message, such as the frequency of certain words (features extraction).

The decoding part can be thought of as the person who is decoding or reconstructing the message from the summary that the sender generated. In this part, the person listens to the summarized message, attending to different parts of it to generate a new message (self-attention mechanism in the decoder). Additionally, the person also uses information from the original message to help reconstruct it accurately (attention mechanism over the output of the encoder).

So, in this analogy, the attention mechanism is like the process of listening and paying attention to different parts of the message in order to generate an accurate summary of reconstruction. That is, the model is learning to continue the conversation that the "person" started in the first place, making it the basis of a chat, a ChatGPT.

So, how was ChatGPT trained?

The answer to this question is simple: it's all about text. As you may know at this point, OpenAI engineers trained ChatGPT using a large amount of text corpus, that is, content-rich writings of any kind that have existed on the Internet until the beginning of 2022, as the official OpenAI blog explains.

There is another feature I haven't told you about transformers, and that is that the transformer architecture is designed to work with data sequences using parallel processing to speed up computations, making products like ChatGPT so amazing, just as amazing as its "father", GTP-3.

GPT-3 is the third generation of the NLP models created by OpenAI. This model was trained using 175 billion parameters fitted in the transformer architecture I explained above, and you can consider it as the predecessor of ChatGPT because it was released before but its use was more focused on developing external products by other developers using the OpenAI API.

At the time of writing this post, OpenAI has published the fourth version of its model, GPT-4 which adds some interesting improvements that can be discussed in a future post.

But it was a matter of time for OpenAI to develop its own product based on GPT-3, and they probably knew that this product was going to be used by millions of people every day, so the interaction between "human and machine" had to be really good. For that reason, the ChatGPT model was trained using a Machine Learning technique called Reinforcement Learning from Human Feedback (RLHF), which is a technique used in Reinforcement Learning to "teach" the model how good its responses are based on human feedback, and that way the model starts to learn that if it generates better responses, it will be rewarded.

In the image below you can see this process.

As you can see, in step 1 a sampled prompt is used to write the desired response to be given by the model. This response is made by a human and then passed as training data to the most recent version of GPT-3, GPT-3.5. Be aware that the model will not generate exactly the same response for exactly the same prompt as if it were a database, but rather this data is used to train the transformer model that GPT-3.5 uses to try to create new responses based on future prompts that are similar to those used during the training process.

In the second step, a sampled prompt is used along with several responses generated by the GPT-3.5 model, then a human ranks the responses from best to worst (the reward) and then this data is passed to the ChatGPT reward model.

Finally, step 3 is quite complex to explain. PPO, which stands for Proximal Policy Optimization, is, in a nutshell, an algorithm that the model uses to adjust its learning process based on the feedback it receives to get better results on the task it is trying to achieve. In this case, the task is to generate a consistent response for a given prompt.

Thus, ChatGPT was trained to give responses very close to human ones, combining the unsupervised learning of transformers with Human Feedback Reinforcement Learning, just as a human learns based on experience (unsupervised learning) and feedback given after demonstrating that experience (RLHF).

And this combination is the super power behind ChatGPT, because the GPT-3.5 model can have good responses when people interact with it using the OpenAI API, but here I’m talking about a product used publicly by millions of people every day, so the interaction with those people has to be extremely good, and RLHF is the key to achieve it. For that reason, I may explain this concept further in a second part of this post, so stay tuned.

Conclusion

The "brain" of ChatGPT is very interesting from a theoretical perspective. It is quite impressive how a tool uses an Artificial Intelligence model that was trained using an incredible amount of data from the Internet, making it so useful, but it's not magic. Although ChatGPT can help us to perform many tasks and save some time being more efficient, this is a tool that can make many mistakes like any other tech tool, and it is very important to be aware of that to use it carefully without taking its answers as the final solution to use in our daily tasks.

Also, I want to encourage you to be aware of the ethical implications that using this technology has, since tools like these "learned" from the public data that exists on the Internet, and you may know that in that place not all information is accurate or unbiased, so AI models can and do reflect these inaccuracies and biases. Thus it is necessary to create policies to regulate these situations.

I hope you learned something new from this blog post, see you in a new opportunity and don't forget, the future is today.

Inside the brain of ChatGPT

Cracking the Code of Automatic Differentiation in Haskell

Cracking the Code of Automatic Differentiation in Haskell