AMPLIFY Protein Language Model
The author's show in their Paper that they can train highly performant ESM models (and modifications) with better performance. They use different data sets with better filtering and validation selection. They use flash attention. Together they see their 350M model is as performant of 15B ESM model. They also use something called pseudo-perplexity which measures the replacement of non-random masking (one of each sequence). They show that retraining the same models (ESM and Amplify) on uniref data Differences with ESM * * They used SwiGLU and RMS norm instead of Gelu Activation. * They used reduced number of attention heads. * They used AdamW optimization (not Adam). * They trained with bf16 using DeepSpeed, an dmodel sharding. * They streamlined the vocabulary removing unused tokens.