Large language models generate functional protein sequences across diverse families
In their paper the authors reveal the ability to generate proteins with functionality across a wide variety of families. Functionally, it uses property-conditional generation so that the sequences that are generated will be conditions upon protein family, biological process, molecular function. They train models to predict next-amino acid prediction. With models finetuned to different lysozyme families, they showed similar catalytic efficiencies as natural versions demonstrate high expression (40-50%) activity with sometimes much lower sequence identity. Conditional Language Modeling They are able to do so by creating a concatenated sequence of the control tag and the protein sequence x=[c;a] and doing next token