Protein Generation: AI Tools & Techniques

socially assistive robotics supporting coverage of socially assistive robotics

The awarding of the 2024 Nobel Prize to AlphaFold2 marks an important moment of recognition for artificial intelligence’s role in biology. What comes next after protein folding? This article explores how innovative techniques are revolutionizing biomolecule creation through protein generation, accelerating drug discovery and materials science.

From Structure Prediction to Real-World Drug Design

While recent works demonstrate promise for the ability of diffusion models to generate proteins, there still exist limitations that make them impractical for real-world applications. Firstly, many existing generative models only produce backbone atoms. To produce an all-atom structure and place sidechain atoms, a sequence is needed, creating a multimodal generation problem requiring simultaneous discrete and continuous modality generation. Secondly, proteins intended for human use need to be humanized to avoid immune system destruction. Furthermore, drug discovery involves complex constraints; for example, even after addressing biological factors, solubility might dictate tablet formulation over vials.

Generating “Useful” Proteins

Simply generating proteins isn’t as valuable as controlling generation to obtain useful proteins. Let’s consider how we’d control image generation via compositional textual prompts, which mirrors the approach used in protein generation.

Compositional Textual Prompts Example — *For inspiration, let’s consider how we’d control image generation via compositional textual prompts (example from Liu et al., 2022).*

The ultimate goal is to control generation entirely via a textual interface, but here we consider compositional constraints for function and organism axes as a proof-of-concept. This allows for greater control in the process of protein generation.

Function-Structure Connection — Learning the function-structure-sequence connection. PLAID learns the tetrahedral cysteine-Fe2+/Fe3+ coordination pattern often found in metalloproteins, while maintaining high sequence-level diversity.

Training Using Sequence-Only Training Data

Notably, this method only requires sequences to train the generative model! Generative models learn the data distribution defined by its training data, and sequence databases are considerably larger than structural ones. Therefore, utilizing these abundant datasets is crucial for effective protein generation.

Training Data Comparison — Learning from a larger and broader database. The cost of obtaining protein sequences is much lower than experimentally characterizing structure, and sequence databases are 2-4 orders of magnitude larger than structural ones.

How Does It Work?

We can train the generative model to generate structure using only sequence data by learning a diffusion model over the latent space of a protein folding model. During inference, we decode structure from frozen weights of the protein folding model. We use ESMFold, a successor to AlphaFold2.

PLAID Method Overview — *Our method. During training, only sequences are needed to obtain the embedding; during inference, we can decode sequence and structure from the sampled embedding. ❄️ denotes frozen weights.*

Compressing the Latent Space of Protein Folding Models

The latent space of ESMFold requires a lot of regularization. To address this, we propose CHEAP (Compressed Hourglass Embedding Adaptations of Proteins), where we learn a compression model for the joint embedding of protein sequence and structure. In addition to improving efficiency, this approach enhances the capabilities of protein generation.

What’s Next?

This method can be adapted to perform multi-modal generation for any modalities where there is a predictor from an abundant modality to a less abundant one. If you are interested in collaborating, reach out!

Further Links

@article{lu2024generating,
  title={Generating All-Atom Protein Structure from Sequence-Only Training Data},
  author={Lu, Amy X and Yan, Wilson and Robinson, Sarah A and Yang, Kevin K and Gligorijevic, Vladimir and Cho, Kyunghyun and Bonneau, Richard and Abbeel, Pieter and Frey, Nathan},
  journal={bioRxiv},
  pages={2024--12},
  year={2024},
  publisher={Cold Spring Harbor Laboratory}
}

@article{lu2024tokenized,
  title={Tokenized and Continuous Embedding Compressions of Protein Sequence and Structure},
  author={Lu, Amy X and Yan, Wilson and Yang, Kevin K and Gligorijevic, Vladimir and Cho, Kyunghyun and Abbeel, Pieter and Bonneau, Richard and Frey, Nathan},
  journal={bioRxiv},
  pages={2024--08},
  year={2024},
  publisher={Cold Spring Harbor Laboratory}
}

Check out our preprints (PLAID, CHEAP) and codebases (PLAID, CHEAP).

Source: Read the original article here.

Discover more tech insights on ByteTrending.