GitHub Repo stars Diff-Transformer

Developments: The authors introduce a new attention mechanism, called 'Diff-Transformer', that amplifies attention to the relevant context while canceling noise. It does so by calculating attention scores as the difference between two separate softmax attention maps. The subtraction cancels noise, promoting the emergence of sparse attention patterns. Experimental results on language modeling show that DIFF Transformer outperforms Transformer in various settings of scaling up model size and training tokens. image

Share link! 📋
Link copied!
See the main site!