This is the third part of my series explaining the technology behind generative models like ChatGPT. This time: Transformers. Tune in for the gory details:
The previous episodes:
Tokenisation: how text gets into the model: Part1
Embedding spaces: how the model represents text internally: Part 2
This one will be about how the model manipulates internal representations.
Google came out with this paper in 2017 and changed deep learning forever. Just like Mikolov’s paper in 2013, it is very hard to estimate the significance of this paper. They truly changed the industry overnight.
Part of the reason is its natural simplicity. The paper is also easily readable for non-researchers, with simple concepts and nice illustrations. If you don’t want to read the original, I link Jay Alammar’s famous page. [Link]
So what is this simple idea that changed the field?
Fuzzy Dictionaries
Transformers are essentially fuzzy dictionaries. And training them is like training fuzzy KNNs, a well-understood topic.
Let’s elaborate on this.
What is a dictionary? If you are familiar with python, this shouldn’t be a question, but if not, dictionaries are a storage of key-value pairs, you can look up keys (foreign words), also known as the “query”, like in an actual dictionary, and you get back their pair (the English word). (This is, BTW, a reason why translation is so well suited for DL models) What happens if a key is not in the dictionary? You either get a default value or an error depending on implementation.
What are the keys and values?
Queries, keys and values are all points in the (same) embedding space.
So the tokens represented in this space are mapped from one place to another by transformers.
What is a “fuzzy” dictionary, then? DL models are continuous mappings, so the traditional one-to-one mapping is relaxed. When a key is not found in the mapping, it returns its best match, something that is close to it in the embedding space.
This has many benefits:
You can address the dictionary (query) with a mixture of keys (e.g. weighted average), essentially a point in the embedding space.
You can address an uninitialised dictionary and still get an answer (a random point in the embedding space).
These “dictionaries” are created at training time, and then they are fixed. They store the model’s knowledge apart from the prompt (the input). The model weights are two sets of vectors in the embedding space shaped as matrices/tensors.
Skip Connections
There is one more concept called “Skip connection”. This essentially connects the transformer’s input to the output. With a formula:
y=x+f(x) instead of y=f(x).
This means that during training, the model doesn’t need to remap the entire space through f(), only a small portion where there is an input-output mismatch.
Fuzzy KNNs
But what happens at training?
Transformers are “fuzzy-KNNs” as we “with a vector attached to each class, simple classifiers in the embedding space. The dictionaries/KNNs have a fixed size (the length of the dictionary or the K in KNN), forming an informational bottleneck. The training’s job is to utilise this bottleneck in the best possible way, which is a lossy compression.
Skip connections help this because the transformers only need to map the difference between the output and the input. The output is gathered through backpropagation from the end. It is the input of another transformer.
Next part: Training
I need to finish this here because the article is already too long. Subscribe to be notified or follow me on Linkedin for the next part: Training Transformers https://www.linkedin.com/in/laszlosragner/