Transformers
Transformers are a powerful type of architecture that allows input sequences to be considered with the whole input context. They are built on the self-attention mechanism, which performs an \(O(N^2)\) computation on the input sequence. In continued stacks, this provides the ability to represent relations between inputs at different levels of abstraction.
Transformers can be used in three general ways: encoder-only, decoder-only, and encoder-decoder.
Types of Transformer Architectures¶
Encoder-Only Networks¶
In encoder-only networks, like BERT, the entire input text is used. These networks are primarily useful for output classification tasks (sequence-to-value).
Encoder-Decoder Networks¶
As described in the original Transformer attention paper, encoder-decoder networks are used to convert sequences to sequences for tasks like language translation. In these systems, an encoder first projects information based on the input, generates new outputs, and these new outputs are used in a recurrent fashion to generate subsequent outputs.
Decoder-Only Networks¶
In decoder-only networks, like GPT, the model performs next-token predictions, requiring information only from previously seen words/tokens. The outputs are estimates of the probability of the next word/token. While next-token prediction is singular, this can happen iteratively, and with the proper prompting, the generation of output sequences can perform a variety of sequence-to-sequence tasks, such as language translation.
Key Components of Transformers¶
Attention Mechanism¶
- Attention: The token being predicted is mapped to a query vector, and tokens in context are mapped to key and value vectors. Inner products are used to combine and extract information.
- Bi-directional / Unmasked: Allows the model to attend to all tokens in the sequence.
- Unidirectional / Masked Self-Attention: Ensures that predictions for a token can only depend on previous tokens.
- Cross-Attention: Applies attention to the primary sequence and treats the second token sequence as the context.
- Multi-Head Attention: Multiple attention heads operate in parallel.
- Layer Normalization: Found to be computationally efficient, often using root mean square layer normalization (
RMSnorm
). - Unembedding: Learns to convert vectors into vocabulary elements.
Positional Encoding¶
Standard embeddings are position-invariant, meaning the position of the token/word in the input has little importance. Positional embeddings are used to encode the position of tokens/words, generally using additive, varying sinusoids, or trainable parameters.
- A Gentle Introduction to Positional Encoding in Transformer Models, pt1
- Transformer Language Models without Positional Encodings Still Learn Positional Information
Layer Normalization¶
Layer normalization observably improves results. For more details, see On Layer Normalization in the Transformer Architecture.
Visualizing The Structures¶
Detailed Components¶
- Positional Encoding
- Attention: Query, Key, Vectors
- Layer Normalization
Initially, the word or subword is broken and represented as a lookup key to find an 'embedding'. This can be trained alongside transformer models or pre-trained from other models. It provides a vector representation of the input word.
To allow the token embedding to attend or share information with the other inputs, calculate a self-attention matrix. In a series of input token-embeddings, there is an attention query:
- A Query matrix \(W^Q\)
- A Key matrix \(W^K\)
- A Value matrix \(W^V\)
For each token/word \(i\), the embedding is multiplied by this matrix to yield a query vector, a key vector, and a value vector, \(Q_i\), \(K_i\), and \(V_i\).
Each query vector is then multiplied by each key vector, resulting in matrix computation \(Q*V\). Because the key-query is supposed to describe how important an input combination is, it is then normalized by the dimension of the values to allow for similar behavior for different dimensions, and then passed through a soft-max function:
This is then multiplied by the value matrix to provide the attention output:
Multiple attention heads can be combined by stacking and concatenating them together and then multiplying by a final matrix that will produce a final output:
Finally, this matrix is combined with input values to have a residual connection, and the layer is normalized. This matrix can be passed to additional layers or a final fully-connected projection layer.
Reviews¶
The Transformer Blueprint: A Holistic Guide to the Transformer Neural Network Architecture provides a thorough exposition of transformer technology.
Useful References and Research¶
General Introductions¶
Seminal Research¶
- Neural Machine Translation by Jointly Learning to Align and Translate: First paper indicating the notion of an 'attention' mechanism.
- Attention Is All You Need: Initial paper indicating that attention is very powerful and a potential replacement for LLM architectures.
- Formal Algorithms for Transformers in 2023: Important discussion revealing the components of Transformers.
Modifications¶
- A Simple yet Effective Learnable Positional Encoding Method for Improving Document Transformer Model: Introduces a learnable sinusoidal positional encoding feed-forward network, demonstrating significant improvements over other datasets.
Enhancements and Variations¶
Context Length Improvements¶
In its vanilla state, Transformers are \(O(N^2)\) in their computation with self-complexity. This makes long context lengths increasingly costly to train and generate. Improvements in context length, for both training and generation, have found ways to generally work around these limits. While there is ample research in this domain, we present a few of the most successful methods. They improve computation complexity in one of several ways:
- Introducing sparsity that is:
- Banded or fixed
- Hierarchical
- Banded to reduce full computation
- Wedge-shaped with a banded window that also takes into account observably important first tokens.
- Inclusion of a recursive RNN-style that permits memory to be retained.
- Memory retrieval systems.
HyperAttention: Long-context Attention in Near-Linear Time
Developments: The authors reveal a new method of attention that allows for very-long context lengths, which they call 'hyperattention'. This algorithm finds (1) larger entries in the attention matrix using sorted locality sensitive hashing
, and then performs column subsampling to rearrange the matrices to provide block-diagonal approximation.
Results: While not without a tradeoff for perplexity, the speedup for long context lengths can be considerable.
Generating Long Sequences with Sparse Transformers provides simple solutions to generate longer sequences.
Scaling Transformer to 1M tokens and beyond with RMT Uses a Recurrent Memory Transformer (RMT) architecture to extend understanding to large lengths.
MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers
MEGABYTE segments sequences into patches and uses a local submodel within patches and a global model between patches. This allows for \(O(N^{4/3})\) scaling directly on bytes, thereby bypassing tokenization requirements found with traditional transformers.
An open-source version made by lucidrains
: Megabyte Github implementation for PyTorch
Infinite Former Uses a representation of the input sequence as a continuous signal expressed in a combination of N radial basis functions.
LM-INfinite: Simple On-the-Fly Length Generalization for Large Language Models provides an O(n) time/space extension allowing LLMs to go to 32k tokens and 2.7x speedup.
Selective Atteion Improves Transformer
The authors present a solution to minimize unecessary attention given to information based on updated understanding of its value. They create 'selective attention' that helps to ensure information is may not need to attend to other areas.
Advanced Transformer Blocks¶
DenseFormer: Enhancing Information Flow in Transformers via Depth Weighted Averaging
Developments: The authors reveal in their paper a variation of the transformer that yields improved results by introducing 'Depth Weighted Averaging' that averages weights at layer (i) with the output from the current block \(B_i\) (ii) the output of all previous blocks \(B_{j<i}\), and (iii) the embedded input \(X_0\).
Diff-Transformer
Developments: The authors introduce a new attention mechanism, called 'Diff-Transformer', that amplifies attention to the relevant context while canceling noise. It does so by calculating attention scores as the difference between two separate softmax attention maps. The subtraction cancels noise, promoting the emergence of sparse attention patterns. Experimental results on language modeling show that DIFF Transformer outperforms Transformer in various settings of scaling up model size and training tokens.
Computation Reduction¶
Simplified Transformers that removes the 'value' parameter-set to increase speed by 14% with potentially minimal accuracy reduction
Herein the authors reveal a variation of transformers that removes the 'value' parameter to yield notable speed gains at the same performance level. Paper
Other Modalities¶
Vision¶
Graphs¶
Transformers Meet Directed Graphs introduces a variation of Transformer GNNs that uses 'direction-aware' positional encodings to help handle both undirected and directed graphs
Multimodal¶
Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model
In this work, we present VistaLLM, the first general-purpose vision model that addresses coarse- and fine-grained vision-language reasoning and grounding tasks over single and multiple input images. We unify these tasks by converting them into an instruction-following sequence-to-sequence format. We efficiently transform binary masks into a sequence of points by proposing a gradient-aware adaptive contour sampling scheme, which significantly improves over the naive uniform sampling technique previously used for sequence-to-sequence segmentation tasks.
- Visual GPT
- Language is not all you need
- Meta-Transformer: A Unified Framework for Multimodal Learning: The first framework to perform unified learning across 12 modalities with unpaired data. It does so by learning an embedding that can be shared across the modalities. Github
Graph¶
Code¶
Hugging Face Transformers An API to access a large number of pre-trained transformers. Pytorch based.
Fast Transformers A quality collection of a number of transformer implementations written in Pytorch.
Theory and Experiment¶
A MATHEMATICAL PERSPECTIVE ON TRANSFORMERS
We develop a mathematical framework for analyzing Transformers based on their interpretation as interacting particle systems, which reveals that clustersemerge in long time.
Abstract Uses¶
Looped Transformers and Programmable Computers Understanding that transformer networks can simulate complex algorithms when hardcoded with specific weights and made into a loop.
'Machine Learning' 'Machine code'. "We demonstrate that a constant number of encoder layers can emulate basic computing blocks, including embedding edit operations, non-linear functions, function calls, program counters, and conditional branches. Using these building blocks, we emulate a small instruction-set computer."