Tokenized and Continuous Embedding Compressions of Protein Sequence and Structure
Developments The authors show they "can construct a tokenized all-atom structure vocabulary that retains high reconstruction accuracy, thus introducing a tokenized representation of all-atom structure that can be obtained from sequence alone". They use a Compressed Hourglass Embedding Adaptations of Proteins (CHEAP) toe represent protein structure of sequence and structure with significant embedding compression.