AMPLIFY Protein Language Model

The author's show in their Paper that they can train highly performant ESM models (and modifications) with better performance. They use different dat asets with better filtering and validation selection. They use flash attention. Together they see their 350M model is as performant of 15B ESM model. They also use something called _pseudo-perplexity- which measures the replacement of non-random masking (one of each sequence).