Skip to content

Transformers

Transformers are a powerful type of architecture that allows input sequences to be considered with the whole input context. They are built on the self-attention mechanism, which performs an \(O(N^2)\) computation on the input sequence. In continued stacks, this provides the ability to represent relations between inputs at different levels of abstraction.

Transformers can be used in three general ways: encoder-only, decoder-only, and encoder-decoder.

Types of Transformer Architectures

Encoder-Only Networks

In encoder-only networks, like BERT, the entire input text is used. These networks are primarily useful for output classification tasks (sequence-to-value).

Encoder-Decoder Networks

As described in the original Transformer attention paper, encoder-decoder networks are used to convert sequences to sequences for tasks like language translation. In these systems, an encoder first projects information based on the input, generates new outputs, and these new outputs are used in a recurrent fashion to generate subsequent outputs.

Decoder-Only Networks

In decoder-only networks, like GPT the model performs next-token predictions, requiring information only from previously seen words/tokens. The outputs are estimates of the probability of the next word/token. While next-token prediction is singular, this can happen iteratively, and with the proper prompting, the generation of output sequences can perform a variety of sequence-to-sequence tasks, such as language translation.

Key Components of Transformers

Attention Mechanism

  • Attention: The token being predicted is mapped to a query vector, and tokens in context are mapped to key and value vectors. Inner products are used to combine and extract information.
  • Bi-directional / Unmasked: Allows the model to attend to all tokens in the sequence.
  • Unidirectional / Masked Self-Attention: Ensures that predictions for a token can only depend on previous tokens.
  • Cross-Attention: Applies attention to the primary sequence and treats the second token sequence as the context.
  • Multi-Head Attention: Multiple attention heads operate in parallel.
  • Layer Normalization: Found to be computationally efficient, often using root mean square layer normalization (RMSnorm).
  • Unembedding: Learns to convert vectors into vocabulary elements.

Positional Encoding

Standard embeddings are position-invariant, meaning the position of the token/word in the input has little importance. Positional embeddings are used to encode the position of tokens/words, generally using additive, varying sinusoids, or trainable parameters.

Layer Normalization

Layer normalization observably improves results. For more details, see On Layer Normalization in the Transformer Architecture.

Visualizing The Structures

Visualizing Large Transformers

A very interesting visual representation of transformers. image

Detailed Components

  1. Positional Encoding
  2. Attention: Query, Key, Vectors
  3. Layer Normalization

Initially, the word or subword is broken and represented as a lookup key to find an 'embedding'. This can be trained alongside transformer models or pre-trained from other models. It provides a vector representation of the input word.

To allow the token embedding to attend or share information with the other inputs, calculate a self-attention matrix. In a series of input token-embeddings, there is an attention query:

  1. A Query matrix \(W^Q\)
  2. A Key matrix \(W^K\)
  3. A Value matrix \(W^V\)

For each token/word \(i\), the embedding is multiplied by this matrix to yield a query vector, a key vector, and a value vector, \(Q_i\), \(K_i\), and \(V_i\).

Each query vector is then multiplied by each key vector, resulting in matrix computation \(Q*V\). Because the key-query is supposed to describe how important an input combination is, it is then normalized by the dimension of the values to allow for similar behavior for different dimensions, and then passed through a soft-max function:

\[ \text{softmax}\left(\frac{Q \cdot K^T}{\sqrt{d_k}}\right) \]

This is then multiplied by the value matrix to provide the attention output:

\[ Z_{\text{head } i} = \text{softmax}\left(\frac{Q \cdot K^T}{\sqrt{d_k}}\right) V \]

Multiple attention heads can be combined by stacking and concatenating them together and then multiplying by a final matrix that will produce a final output:

\[ Z = \text{concat}(Z_i) \cdot W^O \]

Finally, this matrix is combined with input values to have a residual connection, and the layer is normalized. This matrix can be passed to additional layers or a final fully-connected projection layer.

Reviews

The Illustrated Transformer
The Transformer Blueprint: A Holistic Guide to the Transformer Neural Network Architecture provides a thorough exposition of transformer technology.

Useful References and Research

General Introductions

Seminal Research

Modifications

Enhancements and Variations

Context Length Improvements

In its vanilla state, Transformers are \(O(N^2)\) in their computation with self-complexity. This makes long context lengths increasingly costly to train and generate. Improvements in context length, for both training and generation, have found ways to generally work around these limits. While there is ample research in this domain, we present a few of the most successful methods. They improve computation complexity in one of several ways:

  • Introducing sparsity that is:
  • Banded or fixed
  • Hierarchical
  • Banded to reduce full computation
  • Wedge-shaped with a banded window that also takes into account observably important first tokens.
  • Inclusion of a recursive RNN-style that permits memory to be retained.
  • Memory retrieval systems.
GitHub Repo stars HyperAttention: Long-context Attention in Near-Linear Time

Developments: The authors reveal a new method of attention that allows for very-long context lengths, which they call 'hyperattention'. This algorithm finds (1) larger entries in the attention matrix using sorted locality sensitive hashing, and then performs column subsampling to rearrange the matrices to provide block-diagonal approximation.

image image

Results: While not without a tradeoff for perplexity, the speedup for long context lengths can be considerable. image image

Paper

Generating Long Sequences with Sparse Transformers provides simple solutions to generate longer sequences.

image

GitHub Repo stars Hierarchical Attention

Paper

Scaling Transformer to 1M tokens and beyond with RMT Uses a Recurrent Memory Transformer (RMT) architecture to extend understanding to large lengths.

MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers

MEGABYTE segments sequences into patches and uses a local submodel within patches and a global model between patches. This allows for \(O(N^{4/3})\) scaling directly on bytes, thereby bypassing tokenization requirements found with traditional transformers.

image

An open-source version made by lucidrains: Megabyte Github implementation for PyTorch

GitHub Repo stars Infinite Former Uses a representation of the input sequence as a continuous signal expressed in a combination of N radial basis functions.

Paper Infinity Former

LM-INfinite: Simple On-the-Fly Length Generalization for Large Language Models provides an O(n) time/space extension allowing LLMs to go to 32k tokens and 2.7x speedup.

image image image

GitHub Repo stars Efficient Streaming Language Models with Attention Sinks

Paper image

📋
Selective Atteion Improves Transformer

The authors present a solution to minimize unecessary attention given to information based on updated understanding of its value. They create 'selective attention' that helps to ensure information is may not need to attend to other areas.

image

image

Advanced Transformer Blocks

GitHub Repo stars DenseFormer: Enhancing Information Flow in Transformers via Depth Weighted Averaging

Developments: The authors reveal in their paper a variation of the transformer that yields improved results by introducing 'Depth Weighted Averaging' that averages weights at layer (i) with the output from the current block \(B_i\) (ii) the output of all previous blocks \(B_{j<i}\), and (iii) the embedded input \(X_0\). image image

Computation Reduction

GitHub Repo stars Simplified Transformers that removes the 'value' parameter-set to increase speed by 14% with potentially minimal accuracy reduction

Herein the authors reveal a variation of transformers that removes the 'value' parameter to yield notable speed gains at the same performance level. image Paper

Other Modalities

Vision

Graphs

Transformers Meet Directed Graphs introduces a variation of Transformer GNNs that uses 'direction-aware' positional encodings to help handle both undirected and directed graphs

image

Multimodal

Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model

image

In this work, we present VistaLLM, the first general-purpose vision model that addresses coarse- and fine-grained vision-language reasoning and grounding tasks over single and multiple input images. We unify these tasks by converting them into an instruction-following sequence-to-sequence format. We efficiently transform binary masks into a sequence of points by proposing a gradient-aware adaptive contour sampling scheme, which significantly improves over the naive uniform sampling technique previously used for sequence-to-sequence segmentation tasks.

Graph

Code

Hugging Face Transformers An API to access a large number of pre-trained transformers. Pytorch based.
Fast Transformers A quality collection of a number of transformer implementations written in Pytorch.

Theory and Experiment

A MATHEMATICAL PERSPECTIVE ON TRANSFORMERS

We develop a mathematical framework for analyzing Transformers based on their interpretation as interacting particle systems, which reveals that clustersemerge in long time.

Abstract Uses

Looped Transformers and Programmable Computers Understanding that transformer networks can simulate complex algorithms when hardcoded with specific weights and made into a loop.

'Machine Learning' 'Machine code'. "We demonstrate that a constant number of encoder layers can emulate basic computing blocks, including embedding edit operations, non-linear functions, function calls, program counters, and conditional branches. Using these building blocks, we emulate a small instruction-set computer."