Protein Optimization Using AI¶
Generating or modifying protein sequences to improve or create novel behavior is a powerful application for AI. Guided through evolutionary techniques, Bayesian optimization, and/or using protein language models (PLMs), AI can vastly accelerate the development of biotechnological tools and identify targets and avenues for therapeutics. Because of their ability to represent the 'language of proteins,' PLMs are increasingly important in predicting the structure and function of proteins.
Where to start?¶
There are two general manners of optimizing proteins: mutagenic and de-novo. In mutagenic protein optimization, a target protein is found and altered in a manner to fulfill target requirement. In de novo protein generation, protein sequences are created without direct seeding by initial target proteins. It is important to note that de novo generation is generally more difficult because generated protein sequences may not have originated from evolutionary pressures, so may be existentially dispreferred, but de novo designs can offer a degree of freedom and flexibility beyond directly evolutionarily derived protein sequences.
Targets¶
There are a number of targets that protein optimization can focus on. For example, some targets enable primarily basic understanding, such as protein structure, and other targets are related to function, though it is generally considered that structure enables the functions.
In the canon of causal influence, source has → sequence that creates → structure → enables the function. We can generally compartmentalize targets based on these, though there is certain crossover between them.
- Source
- Sequence
- Alignment
- [Remote cohomology]: Similar function, or structure
- Structure
- Contact prediction
- Secondary and tertiary structure
- (mis)Folding (missense)
- Function
- Enzymatic Catalysis: The ability of an enzyme to accelerate chemical processes
- Thermocompatibility or thermostability, how well a protein remains stable or functions at varying temperatures
- Fluorescence for visualization purposes
- Protein Binding to...
- Proteins
- Nucleic Acids
- Drug molecules
- Metals
Though there are many examples where these classes cross, these potential targets are essential for protein optimization.
Components¶
Protein optimization can be broken down into several components1:
- Target Property: The intended goal(s) for protein development.
- Fitness Predictor: Uses sequence information to estimate the value of the optimization target, as a surrogate for laboratory measurement.
- Sequence Proposer: Creates sequences to evaluate and explore.
- Prioritizer: Uses sequence and predictor information to estimate the top candidates.
- Laboratory Measurements: Reveal the quality of the generated proteins based on the targets.
- Orchestrator: Puts the pieces together in a functional and validated manner.
Optimization systems may involve merging and combining these components for full solutions in two general manners:
- A model that separates generation and evaluation steps, where the predictor model evaluates the quality of an input set of sequences (generated or otherwise defined).
- A model that directly predicts the best designs using adaptive sampling, proposing solutions, evaluating them with the predictor model, and then iterating.
These components can be seen in the box below:
Fitness Prediction¶
Training a fitness model may first involve training an unsupervised foundation model on a high volume of data. These models can then be fine-tuned, or otherwise adapted, to incorporate protein sequences or higher relevance to the protein targets of interest.
Learning protein fitness models from evolutionary and assay-labeled data
The authors show in their paper that uses a manner to combine ridge regression with large-language models revealing the ability to effectively predict evolutionary and assay-labeled fitness.
Strategy¶
Protein optimization will necessarily evolve the creation of those proteins and evaluations of target characteristics. There are large volumes of databases of various forms that may be useful in creating foundation models. It will still be essential to use continued observation to improve the optimization target based on predicted and iterated feedback.
The volume of the observations will help to determine the architectures that one could use. Base models tend to be PLMs because of the large set of available data. Unsupervised fine-tuning with those large models may be able to occur through homology or family sets. Final targets may then be optimized with simple networks, often involving regression to minimize overfitting or methods that include Bayesian or evolutionary approaches.
To be able to successfully deliver on final target optimization, the greater the quantity of direct or surrogate data that can be obtained, the greater the potential the resulting models will sufficiently predict the fitness of future protein sequence candidates. That is why massive screening approaches, as described by Ginkgo's platform, screen thousands of candidates.
An example process by Ginkgo
Ginkgo reveals with foundry-scale protein estimates, that with thousands of samples they were able to create an enzyme with 10x improvement from where they started. In their design, they use structure (differential) estimates via Rosetta, Evolutionary-scale modeling (PLMs), active site focus evolutionary models, as well as an in-house method called 'OWL.'
When it is possible to iteratively measure proposed sequences, new data can be used to improve subsequent sequence predictions. This can be done greedily, choosing the best solutions, or using probabilistic methods, such as [Bayesian Optimization]. Searching for a protein that optimizes a target by combining both estimated values, as well as their uncertainties. Selecting the sequences with the highest-predicted target values will greedily inform what should be used and may easily fail due to incorrect estimates from the predictor model. In other manners, confidence bound (UCB) acquisition selects sequences based on a sum of the predicted target value and the predicted target uncertainty.
Fine-tuning protein language models boosts predictions across diverse tasks
The authors compared finetuning on ESM2, ProtT5 adn Ankh on different tasks. They found that supervised finetuning improves doenstream predictions, PEFT has similar improvement with great acceleration. They also suggest that final-layer tuning is not ideal.
Evaluation of Machine Learning-Assisted Directed Evolution Across Diverse Combinatorial Landscapes
Abstract
Various machine learning-assisted directed evolution (MLDE) strategies have been shown to identify highfitness protein variants more efficiently than typical wet-lab directed evolution approaches. However, limited understanding of the factors influencing MLDE performance across diverse proteins has hindered optimal strategy selection for wet-lab campaigns. To address this, we systematically analyzed multiple MLDE strategies, including active learning and focused training using six distinct zero-shot predictors, across 16 diverse protein fitness landscapes. By quantifying landscape navigability with six attributes, we found that MLDE offers a greater advantage on landscapes which are more challenging for directed evolution, especially when focused training is combined with active learning. Despite varying levels of advantage across landscapes, focused training with zero-shot predictors leveraging distinct evolutionary, structural, and stability knowledge sources consistently outperforms random sampling for both binding interactions and enzyme activities. Our findings provide practical guidelines for selecting MLDE strategies for protein engineering
Sequence Proposer¶
With a fitness predictor made available, the next step is to create proposal sequences that may be evaluated with the predictor model, or potentially with direct measurement.
One way of doing this is to use generative models. Generative modeles can be made by using logistic/probabilistic outputs from models and random sampling to determine amino acids in a sequence. It can be done so using causal language (CLM) models, like GPT, where the tokens only attend to prior tokens, or with masked language models (CLM), that can attend to the entire sequences. With CLM, directly in seeding the generated sequence with starting sequences of the target sequence, or even from a natural language prompt, as in models like ProGen, sequences are generated sequentially. In other models, sequences can be generated using MLM using several techniques.
These methods include:
- activation maximization_, a method that will generate input sequences to a model that will optimize given model.
- Iterative Masking, where masks are randomly removed until generated remain stationary.
- Markov Chain Monte Carlo to iteratively mutate evaluate mutations to improve design approaches.
Iterative Masking¶
Activation Maximization¶
SeqProp: Stochastic Sequence Propagation - A Keras Model for optimizing DNA, RNA and protein sequences based on a predictor.
The authors reveal in their paper and arxiv a method to optimize biological protein sequences based on a predictor model. They use something called trainable logits that can be sampled from, but do so using instance normalization. A Python API for constructing generative DNA/RNA/protein Sequence PWM models in Keras. Implements a PWM generator (with support for discrete sampling and ST gradient estimation), a predictor model wrapper, and a loss model.
Protein sequence design by conformational landscape optimization
The authors propose a Bayesian approach to optimizing a protein structure to yield a residue sequence. They use a loss of the form \(Loss = -/log P(contacts|sequence) + D_{KL}(f_{20}||f_{20}^{PDB}\) where \(D_{KL}\) is the Kullback-Leibler divergence, \(f_{20}\) is the average frequency of amino acids from the sequence, and \(f_{20}^{PDB}\) is the average frequency of amino acids from proteins in the PDB. Paper
Structure-based scoring and sampling of 'Combinatorial Variant Effects from Structure' (CoVES)
The authors show in their paper and Nature over 7 different combinatorial mutation studies, the ability to design proteins by exploring the design space without the need for a combinatorial number of mutations. They build a model to estimate a residue preference effect for each amino acid variant at each position and sum these effects to predict combinatorial variants. Simple linear and logistic models using a 'mutation effect preference of size 20(Amino Acids)x residue size' were able to predict the effect of variance. They could then use this to design sequences using Boltzmann sampling and generate variants that were much better. Particularly the following image provides credence that these simple models of important sites can be useful in predicting proteins.
Markov Chain Monte Carlo¶
Plug & play directed evolution of proteins with gradient-based discrete MCMC (EvoProtGrad for MCMC)
A Python package for directed evolution on a protein sequence with gradient-based discrete Markov chain Monte Carlo (MCMC) based on the paper, blog, and docs
Low-N protein engineering with data-efficient deep learning
The authors demonstrate a standard model where a PLM undergoes unsupervised pre-training and then refined on evolutionarily related sequences, and finally fine-tuned on assay-specific sequences. They use a Markov Chain Monte Carlo (MCMC) method to mutate and iteratively evaluate mutations to improve design approaches.
Generative Models¶
Progen2¶
Large language models generate functional protein sequences across diverse families
In their paper the authors reveal the ability to generate proteins with functionality across a wide variety of families. Functionally, it uses property-conditional generation so that the sequences that are generated will be conditions upon protein family, biological process, molecular function. They train models to predict next-amino acid prediction. With models finetuned to different lysozyme families, they showed similar catalytic efficiencies as natural versions demonstrate high expression (40-50%) activity with sometimes much lower sequence identity. Conditional Language Modeling They are able to do so by creating a concatenated sequence of the control tag and the protein sequence \(x=[c;a]\) and doing next token
Design of highly functional genome editors by modeling the universe of CRISPR-Cas sequences
To generate novel CRISPR-Cas proteins, they fine-tuned the ProGen2-base language model.
??? abstract CONDITIONAL ENZYME GENERATION USING PROTEIN LANGUAGE MODELS WITH ADAPTERS" procalm The author show the ability to generate proteins in families by using conditional encoding to project the conditions into a embedding state that is used to generate proteins in a manner that can satisfy certain conditions, like family type'.
Evo¶
Sequence modeling and design from molecular to genome scale with Evo
The authors reveal in their paper the use of long-context Genetics models can be powerful in their ability to yield state-of-the-art predictions in protein-related tasks. These tasks include zero-shot function prediction, multi-element sequence generation. Their models use the 'Striped-Hyena' structured state space model. Their model is known as Evo.
ZymCTRL: a conditional language model for the controllable generation of artificial enzymes
Here, we describe ZymCTRL, a conditional language model trained on the BRENDA database of enzymes, which generates enzymes of a specific enzymatic class upon a user prompt. ZymCTRL generates artificial enzymes distant from natural ones while their intended functionality matches predictions from orthogonal methods. Model
With Natural Large Language Models¶
Data¶
Data Selection¶
Protein Language Model Fitness Is a Matter of Preference
The authors show that models preferences are biased by human preference during the data curation. Quite cleanly, they state "Algorithmic differences might be overshadowed by human preferences at the data level confounding whether a model better captures the biology of proteome"
Data Sources¶
ProteinGym: Large-Scale Benchmarks for Protein Fitness Prediction and Design
ProteinGym is an extensive set of Deep Mutational Scanning (DMS) assays and annotated human clinical variants. The results are "curated to enable thorough comparisons of various mutation effect predictors in different regimes." Website Paper
Example Architectures¶
While there are many architectures and methods for creating and optimizing proteins, we focus here primarily on ways that employ PLMs in some way. These create foundation models that can be fine-tuned and readily adapted to specific domains of interest.
The general method of creating protein foundation models uses Masked Language Modeling (MLM) or 'Bert-based' predictions, though next-token predictions, as is done with GPT-architectures, may also be used. We share a number of prominent models and uses or derivatives.
Evaluation Metrics¶
To do¶
- Spearman Correlation Coefficient
- AUC
- MCC
Pseudo Likelihood¶
The Pseudo log likelihood (PLL) is often used to evaluate the fintess of a given sequence conditioned upon the parameters of the model. It found by evaluating the following:
It requires \(O(L)\) passes through the data.
There is a way to go faster, as in Protein Language Model Fitness Is a Matter of Preference. The authors show that the pseudo log likelihood can be calculated in a single pass as such:
BERTOLOGY MEETS BIOLOGY: INTERPRETING ATTENTION IN PROTEIN LANGUAGE MODELS
Developments The authors show in their paper "that attention: (1) captures the folding structure of proteins, connecting amino acids that are far apart in the underlying sequence, but spatially close in the three-dimensional structure, (2) targets binding sites, a key functional component of proteins, and (3) focuses on progressively more complex biophysical properties with increasing layer depth. We find this behavior to be consistent across three Transformer architectures (BERT, ALBERT, XLNet) and two distinct protein datasets. We also present a three-dimensional visualization of the interaction between attention and protein structure." They see the following: * Attention aligns strongly with contact maps in the deepest layers. * Attention targets binding sites throughout most layers of the models. * Attention targets Post-translational modifications in a small number of heads. * Attention targets higher-level properties in deeper layers. * Attention heads specialize in particular amino acids. * Attention is consistent with substitution relationships.
Strategies¶
Pro-FSFP: Few-Shot Protein Fitness Prediction
In their paper The wuthors show tthe ability to use a meta-model that is able to train models using a 'meta learning model that works with multiepl tasks to create a meta-learned model (PLMS with LORA adapters) to create better results using a ranking loss. Comparing in this manner allows for multiple results in different experiments to be used simultaneously without impacting the quality of results.
Foundation Models¶
ESM Models¶
Evolutionary-scale prediction of atomic-level protein structure with a language model (esm)
End-to-end Language model enabling structure sequence pairing, coupled with an equivariant transformer structure model at the end. Science paper
Genome-wide prediction of disease variant effects with a deep protein language model
The authors show in their paper a workflow using ESM1b, a 650-million-parameter protein language model, to predict all ~450 million possible missense variant effects in the human genome, and made all predictions available on a web portal. Developments Using established and newly trained protein language models, the authors demonstrate the ability to provide zero-shot predictions of the effect of a protein mutation on a protein's fluorescence. They use a PLM to score the mutations using a log odds-ratio of the mutated protein. Data They create ESM-1v, an unsupervised masked transformer model by training on 98 million protein sequences, using Uniref90 2020-03. They evaluate the model on a set of 41 deep mutational scans.
[Paper]( Paper
MSA Transformer
The authors demonstrate in their paper training an unsupervised PLM that operates on sets of aligned sequences. Self-supervision helps to reconstruct the corrupted MSA. Developments Architecture The architecture 'interleaves attention across the rows and columns of the alignment as an axial attention' that ties the attention map across the rows with 'tied row attention'. They use a single feed-forward layer for each block. For position embeddings, they use a 1D learned position embeddings added independently to each row of MSA to distinguish aligned positions differently for each sequence. The objective looks for the loss of the masked MSA as follows: With the probabilities being the output of the MSA transformer, softmax normalized of the amino acid vocabulary independently normalized per position in the sequence. Masking the columns uniformly resulted in the best performance. The models are 12 layers, with a 768 embedding size, and 12 attention heads resulting in 100M parameters. Data They use 26 million MSA sequences generated from UniRef50 by searching UniClust30 with HHblits. Analysis They show that a logistic regression with 144 parameters fit on 20 training structures could predict the contact maps of almost 15k other structures almost unsupervised. They show a supervised contact prediction map can improve the contact-prediction maps. They find the attention heads focus on highly variable columns, correlating with the per-column entropy of MSA.
Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences
The authors used masked language prediction with transformer models to train a foundation model capable of multiple downstream tasks. "To this end we use unsupervised learning to train a deep contextual language model on 86 billion amino acids across 250 million protein sequences spanning evolutionary diversity. The resulting model contains information about biological properties in its representations. The representations are learned from sequence data alone. The learned representation space has a multi-scale organization reflecting structure from the level of biochemical properties of amino acids to remote homology of proteins. Information about secondary and tertiary structure is encoded in the representations and can be identified by linear projections."
Reference Optimization of Protein Language Models as a Multi-objective Binder Design Paradigm
The authors create a design paradigm using instruction fine-tuning and direct preference optimization of PLMs. Creating ProtGPT2 allows binders to be designed based on receptor and drug developability criteria. To do this, they do two-step instruction tuning with receptor-binding 'chat-templates', and then optimize fine-tuned models to promote preferred binders. Specifically, they "propose an alignment method to transform pre-trained unconditional protein sequence models (p(s)), that autoregressively sample sequences (s) from underlying data distribution (D), to conditional probability models (p(s|r; c)) that given a target receptor ® sample binders that satisfy constraints © encoded by preference datasets compiled from experiments and domain experts." Notably, they fuse protein sequences with English-language prompts and use BPE encoding with a large vocabulary size (50k) instead of the smaller PLM vocabulary sizes (33) that are standard.
Single-sequence protein structure prediction using supervised transformer protein language models
The authors show in their paper the ability to generate high-quality predictions outperforming AlphaFold2, with a model called trRosettaX-Single using ESM to generate representations and attention maps that can be trained for distance+energy maps.
AMPLIFY Protein Language Model
The author's show in their Paper that they can train highly performant ESM models (and modifications) with better performance. They use different data sets with better filtering and validation selection. They use flash attention. Together they see their 350M model is as performant of 15B ESM model. They also use something called pseudo-perplexity which measures the replacement of non-random masking (one of each sequence). They show that retraining the same models (ESM and Amplify) on uniref data Differences with ESM * * They used SwiGLU and RMS norm instead of Gelu Activation. * They used reduced number of attention heads. * They used AdamW optimization (not Adam). * They trained with bf16 using DeepSpeed, an dmodel sharding. * They streamlined the vocabulary removing unused tokens.
Alpha-models¶
(closed source) De novo design of high-affinity protein binders with AlphaProteo
The authors reveal in their paper and blog, a very performant solution that designs proteins to bind to protein targets.
(semi-open) Accurate structure prediction of biomolecular interactions with AlphaFold 3
The authors reveal a highly powerful solution that allows higha ccuracy binding, and uses tokenization beyond single protein letters.
Protenix: Protein + X
The authors share in their technichal report and code a trainable PyTorch reproduction of AlphaFold 3.
xTrimo¶
xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering the Language of Protein
Developments The authors reveal an innovative manner of training protein language models using novel Masked Language Model training. They also investigate LORA and MLP adapter layers at the end for finetuning methods and show a significant gain when using LORA.
Results The resulting models are made with both standard [MASK]
tokens masking tokens that indicate short-spans that are masked [sMASK]
and spans marked at the end with [gMASK]
. Training with both standard and block masking, at a ratio of 20% to 80%, respectively, they train models with notable improvement over models.
xTrimoGene: An Efficient and Scalable Representation Learner for Single-Cell RNA-Seq Data
Developments The authors create a scaleable asymmetrical encoder-decoder network that uses scRNA-seq with sparse labeling.
Methods: From an expression matrix, the model masks and filters expression sequences to try to reconstruct the full-length embedding and expression matrix. They also introduce auto-discretization to help alleviate category assignment errors to different genes... because genes are not necessarily fully categorical. The Auto-discritization strategy has a lookup table that leaves a weighted combination of individual embeddings from the lookup-table.
Others¶
Chai labs protein model
An apparent competitor to AF-3 in the making
??? "Miniaturizing, Modifying, and Augmenting Nature’s Proteins with" raygun Developments; The author show in their publication raygun a novel method that is able to generate solutions that are able to create sequneces that are able to design proteins with structural similarity but verying lengths.
They do so by creating use models to create an intermediate multivariate normal distribution to embody varyiable length sequences into a same-lengh metric.
<img width="689" alt="image" src="https://github.com/user-attachments/assets/81fd94f5-a904-47de-8da8-574677f80535">
Their process of collapsing data to a smaller fixt-length embedding is a parameter free method that performs blocked chunking and averaging of components.
<img width="689" alt="image" src="https://github.com/user-attachments/assets/1c1cf62e-5c91-4e7a-90d1-4ee849dffdbf">
Natural Language + Protein Language model integrations¶
It is possible to combine LLMs for natural language and PLMs to produce poweful suggestions just based on NL queries. Here are some examples.
🧬 Protein function prediction as approximate semantic entailment
Developments Current LLM models excel at predicting the structure and other attributes of biological sequences like proteins. However, their transferability is limited, capping their true potential. The DeepGO-SE model innovates 🚀 by integrating protein language models with specific knowledge on protein function, bridging the gap between knowledge-graphs' explicit representations and next-token prediction's implicit representations, and thereby significantly improving model performance. How it works * 🔄 First, DeepGO-SE reuses the ESM2 large language model to convert a protein sequence into a vector space embedding, prepping it for machine learning application. * 🧠 Next, an ensemble of fitted prediction models is trained to align ESM2 embeddings with an embedding space (ELEmbeddings) derived from GO axioms, creating a world model filled with geometric shapes and relations akin to a Σ algebra, which can verify the truth of a statement. * ✅ Finally, for statements such as "protein has function C", when the ensemble reaches a consensus on truth, the semantic truth estimation is then accepted as valid. The authors demonstrate 📈 that this method improves molecular function prediction by a substantial margin. Moreover, they reveal that training with protein-protein interactions substantially benefits the understanding of complex biological processes. They suggest that predicting biological processes may only require knowledge of molecular functions, potentially paving the way for a more generalized approach that could be advantageous in other domains.
ProtST: Multi-Modality Learning of Protein Sequences and Biomedical Texts
The authors show in their paper that the fusion of natural language model with a protein language model can reasonably improve protein location prediction, fitness landscape prediction, and protein function annotation. Data They build a ProtDescribe to match protein sequences with text descriptions. Models Their models involve three losses. 1. InfoNCE loss to maximize similarity between sequence pairs, and minimize similarity between negative pairs. 2. A Masked protein modeling cross-entropy loss to maintain unimodal information to the sequences, and a fusion MultiModal Mask Prediction that uses self and cross-attention on masked input sequence and text pairs to mutually recover the predicted results in sequence and text results. They start with pre-trained protein models (Bert, ESM-1b, and ESM-2) and pre-trained language model (PubMedBERT-abs and PubMedBERT-full). The text data set looks like this:
Architectures by Target¶
Enzymatic Catalysis¶
ForceGen: End-to-end de novo protein generation based on nonlinear mechanical unfolding responses using a language diffusion model
Developments The authors present ForceGen, an end-to-end algorithm for de novo protein generation based on nonlinear mechanical unfolding responses. Rooted in the physics of protein mechanics, this generative strategy provides a powerful way to design new proteins rapidly, including exquisite and rapid predictions about their dynamical behavior. Proteins, like any other mechanical object, respond to forces in peculiar ways. Think of the different response you'd get from pulling on a steel cable versus pulling on a rubber band, or the difference between honey and glass. Now, we can design proteins with a set of desirable mechanical characteristics, with applications from health to sustainable plastics. The key to solving this problem was to integrate a protein language model with denoising diffusion methods, and using accurate atomistic-level physical simulation data to endow the model a first-principles understanding. ForceGen can solve both forward and inverse tasks: In the forward task, we can predict how stable a protein is, how it will unfold and what the forces involved are, all given just the sequence of amino acids. In the inverse task, we can design new proteins that meet complex nonlinear mechanical signature targets. With the new generative model, they can directly design proteins to meet complex nonlinear mechanical property-design objectives by leveraging deep knowledge on protein sequences from a pretrained protein language model and maps mechanical unfolding responses to create proteins. Via full-atom molecular simulations for direct validation from physical and chemical principles, we demonstrate that the designed proteins are de novo, and fulfill the targeted mechanical properties, including unfolding energy and mechanical strength, and a detailed unfolding force-separation curves.
Thermostability¶
ProLaTherm: Protein Language Model-based Thermophilicity Predictor
Developments The authors reveal in their paper a model that is good at predicting thermal stability as well as an augmented dataset to enable their good predictive control. Data: Collected from multiple sources to create new sets. "9422 UniProt identifiers and 9363 corresponding amino acid sequences from 16 thermophilic and 16 mesophilic organisms" Filtered. Models: They considered several first, we consider feature-based models that rely on manually engineered features, such as physicochemical properties. Second, we include hybrid sequence-based models that use amino acid features to learn sequence embeddings. Third, we consider approaches that are purely sequence-based, similarly to ProLaTherm, but in contrast train sequence embeddings from scratch. The final model used a simplified transformer solution that used 1024 sequence embeddings that were put into a self-attention network resulting in an output embedding that was averaged and put into a ReLU activation that then went to a batch norm and logistic prediction of whether the protein was a thermophile. Training: From scratch. Results: High performance of PLM 97% accuracy over other models, though this accuracy is reduced when reducing train/test set homology.
Transfer learning to leverage larger datasets for improved prediction of protein stability changes
The authors show in their paper a graph neural network (GNN) trained using transfer learning to predict changes in stability for protein point mutants.
Candidate Identification¶
Particularly for evolutionary methods, it is essential to know where to start optimizing from. GenAI can be used to identify candidates based on databases of prior candidates.
Searching is essential to find similar sequences that may aid in the training or fine-tuning of models. This can be done with sequence-based alignment, as well as structure-based alignment. Here are a few references of highly-relevant tools for search/alignment.
Fast and accurate protein structure search with: Foldseek
Foldseek "aligns the structure of a query protein against a database by describing tertiary amino acid interactions within proteins as sequences over a structural alphabet." Paper
Candidate Alignment¶
It is not necessarily just enough to identify a potential candidate but to have a degree of alignment with the candidate with starting or suggested candidates. This allows for a degree of interpretability by people.
Contrastive learning on protein embeddings enlightens midnight zone
In their paper the authors demonstrate the use of contrastive optimization (like CLIP) to create embeddings that "optimize constraints captured by hierarchical classification of protein 3D structures."
Protein Binding¶
Contrastive learning in protein language space predicts interactions between drugs and protein targets
The authors show in their paper the use of contrastive learning to help co-locate proteins and potential drug molecules in a 'shared feature space' and learns to map drugs against non-binding 'decoy' molecules.
ProteinMPNN¶
Robust deep learning based protein sequence design using ProteinMPNN
In their paper the authors reveal a novel method to predict sequences and sequence recovery.
Performance optimizations¶
Tokenized and Continuous Embedding Compressions of Protein Sequence and Structure
Developments The authors show they "can construct a tokenized all-atom structure vocabulary that retains high reconstruction accuracy, thus introducing a tokenized representation of all-atom structure that can be obtained from sequence alone". They use a Compressed Hourglass Embedding Adaptations of Proteins (CHEAP) toe represent protein structure of sequence and structure with significant embedding compression.
Common Methods¶
Tools¶
Colab Design¶
Quality Reviews and References¶
Deep Learning in Protein Structural Modeling and Design
Provides a thorough summary of DL manners of optimizing proteins. They emphasize a Sequence → Structure → Function approach should be focused upon.
Companies¶
Here are several companies that focus on protein design. If you have one you'd like to suggest, please file an issue.
EOF