A Very, Very Surface-Level Look At Transformers
Or “Documenting My Struggle To Understand Transformers”
Transformers are a key part of what makes models like GPT-3, GPT-4, and chatbots like ChatGPT so powerful. In 2017, a paper came out called Attention Is All You Need — the paper is only 15 pages long but has been cited more than 70,000 times. You can find a number of blog posts on Medium that reference the paper, a 2020 YouTube video by Computerphile that references the paper, and this illustrated guide to the paper on GitHub.io, licensed under Creative Commons and open for redistribution, so long as it is not done for commercial purposes (note the lack of a star or paywall on this blog post). If you can read the 15-page paper and understand everything, great. You probably have background knowledge in this, and you have saved time.
For the rest of us…
GPT stands for Generative Pretrained Transformer. You can try ChatGPT here, and they released GPT-4 about two months ago. A Transformer is a model that uses attention, and this approach is now a reliable alternate to the Recurrent Neural Network, which Computerphile characterizes as “inherently serial,” so that it cannot scale in the same way. Transformers, on the other hand, lend themselves to parallelization. From the IBM blog post about recurrent neural networks:
Like feedforward and convolutional neural networks (CNNs), recurrent neural networks utilize training data to learn. They are distinguished by their “memory” as they take information from prior inputs to influence the current input and output. While traditional deep neural networks assume that inputs and outputs are independent of each other, the output of recurrent neural networks depend on the prior elements within the sequence.
— Source
If you know what these terms mean, great. If not, you probably have the following questions:
- What is a Recurrent Neural Network, and how does it work?
- WHY is a transformer able to overcome a limitation of Recurrent Neural Networks?
- What is attention and how does attention work?
What I had wanted to do was title this blog post something like “A Gentle Introduction To Transformers,” “Transformers Explained,” or “Why We Care About Transformers And How They Work.” Instead, I want to put my writing out there for entertainment value. I will start with the least confusing and least debatable topics, then fan out to the point that I will be going over the actual math involved…and mainly just using quotes, because I do not follow it all.
(Edit: The diagrams and explanations you see here will be for transformers in general. GPT still uses transformers, but will not follow this exact architecture. ChatGPT, for example, is a decoder-only model)
The Surface-Level Explanation
Transformers are used in OpenAI language models. You will notice that most of these sources are from several years ago, but transformers are still very relevant today.
From this TowardsDataScience post by Giuliano Giacaglia:
- Transformers solve the problem of sequence transduction, meaning they transform an input sequence to an output sequence. For example, maybe we are translating an input sentence into French
- Recurrent Neural Networks have been used to deal with this problem in the past. They have a chain-like structure
- A transformer uses a self-attention layer. A decoder focuses on relevant parts of the input sentence, analogous to how real human attention works
Recurrent Neural Networks
From Alammar:
- A sequence-to-sequence model is made up of an encoder and a decoder, both of which tend to be recurrent neural networks
- The decoder maintains a hidden state that it passes from one step to the next (if you visit the above, that screenshot is a full video translating a French sentence to an English sentence. As already stated, memory is crucial here and all states impact the one that comes next — it’s linear)
- The mechanism is described in much more detail here and here. These are two pioneering papers about recurrent neural networks, totaling about 30 pages in length
I have another example of a recurrent neural network. It’s me using CakeChat on one of my sister blogs, many months ago, blindly copying steps from the GitHub to train a chatbot from some canned Twitter data and then interface with the backend using Telegram. Anyone can tell you that this is not a fair representation of all recurrent neural networks, but it’s also a GitHub that very clearly articulates that it turned into an archive because transformers outperformed recurrent neural networks like itself.
This is Sharon. Sharon sucks. I wanted to make my own Replika, which I believe was based on GPT-3 at the time. Replika is a good chatbot. ChatGPT is a good chatbot. The above is just dumb.
Attention
We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train
— From the key paper, Attention Is All You Need
Here is a picture of the transformer architecture.
You can find Pytorch implementation annotations on the Attention paper, and they also have a GitHub. So…here.
…hopefully that clears everything up.
At a basic level, encoders in transformers use a self-attention layer. The encoder “looks at other words in the input sentence as it encoders a specific word.” The decoder has an attention layer that focuses on relevant parts of the input sentence — it is analogous to how real human attention works.
Transformers are non-sequential. They overcome the limitations of recurrent neural networks by providing a different underlying model, one that is more easily parallelizable.
Now I am going to just go off, doing my best to summarize the steps of self-attention from the GitHub I keep referencing. To understand it better, please read the actual GitHub.
Calculate self-attention by creating three vectors from each of the encoder’s input vectors. Create a query vector, a key vector, and a value vector per word. I refuse to elaborate.
Calculate scores by taking the dot product of the query vector and the key vector of the respective word you are scoring.
Divide scores by 8 to achieve a stable gradient, then use softmax for normalization.
Multiply by software score. Sum weighted values.
Okay, I am going to put this on my to-do list. Here is a class I found called CS224N, from Stanford. I see their slides are publicly available at the bottom of the page.
Here is one of their assignments, assignment 5, that challenges students to explore the math behind self-attention. Here is a reading they posted that will probably help. Unfortunately for my readers, I do not see any publicly available midterms/answer keys they put up. In fact, it doesn’t look like this course even has midterms.
So yeah, I will be sure to read all their slides, learn all the math, and solve some of their problem sets. Someday. Of course.
Until then, I do not understand the math behind this.