The schedule is now final!
TIME IN UTC!
CLICK ON THE TIME TO FIND OUT WHEN THE EVENT TAKES PLACE IN YOUR TIME ZONE!
11:50-12:05 Lightning talks
11:50-11:55 - Evotuning protocols for Transformer-based variant effect prediction on multi-domain proteins - Hideki Yamaguchi [Live]
Accurate prediction of variant effects has broad impacts on protein engineering. Recent machine learning approaches toward this end are based on representation learning, often using large-scale, diverse datasets. However, it is still unclear how we can effectively learn the intrinsic evolutionary properties of an engineering target protein, specifically when the protein is composed of multiple domains. Additionally, no optimal protocols are established for incorporating such properties into Transformer-based variant effect predictors. In response, we propose evolutionary fine-tuning, or “evotuning”, protocols, considering various combinations of homology search, fine-tuning, and sequence embedding strategies, without the need for multiple sequence alignment. Exhaustive evaluations on diverse proteins indicate that the models obtained by our protocols achieve significantly better performances than previous methods. The visualizations of attention maps suggest that the structural information can be incorporated by evotuning without direct supervision, possibly leading to better prediction accuracy.
11:55 - 12:00 - ProteinBERT: A universal deep-learning model of protein sequence and function - Dan Ofer [Live]
Self-supervised deep language modeling has shown unprecedented success across natural language tasks, and has recently been repurposed to biological sequences. However, existing models and pretraining methods are designed and optimized for text analysis. We introduce ProteinBERT, a deep language model specifically designed for proteins. Our pretraining scheme consists of masked language modeling combined with a novel task of Gene Ontology (GO) annotation prediction. We introduce novel architectural elements that make the model highly efficient and flexible to very large sequence lengths. The architecture of ProteinBERT consists of both local and global representations, allowing end-to-end processing of these types of inputs and outputs. ProteinBERT obtains state-of-the-art performance on multiple benchmarks covering diverse protein properties (including protein structure, post translational modifications and biophysical attributes), despite using a far smaller model than competing deep-learning methods. Overall, ProteinBERT provides an efficient framework for rapidly training protein predictors, even with limited labeled data. Code and pretrained model weights are available at https://github.com/nadavbra/protein_bert
12:00 - 12:05 - Graph attention network based representation learning for cancer drug response prediction and interpretation - Dionizije Fa [Recorded]
We present a state of the art multimodal deep learning model for cancer drug response prediction based on pharmacogenomic data. We featurize cell lines as protein-protein interaction graphs. Graph attention networks then allow us to examine potentially plausible biological interactions in protein-protein interactions graphs by examining the attention coefficients.
12:55-13:15 Lightning talks
12:55-13:00 - HydrAMP: a deep generative model for antimicrobial peptide discovery - Paulina Szymczak [Live]
The development of resistance to conventional antibiotics in pathogenic bacteria poses global health hazard. Antimicrobial peptides (AMPs) are an emerging group of compounds with the potential to become the new generation of antibiotics. Deep learning methods are widely used by wet-laboratory researchers to screen for the most promising candidates. We propose HydrAMP - a generative model based on a semi-supervised variational autoencoder, that can generate new AMPs, and perform analogue discovery. Novel features of our approach include: non-iterative training, parameter-regulated model creativity, and improvement of existing AMPs. We introduced multiple refinements to latent space modelling that allow us to sample novel AMPs despite the data scarcity. The peptides generated by HydrAMP are similar to the known AMPs in terms of physicochemical properties. We have successfully obtained and verified experimentally a new, more active analogue of Pexiganan, proving that HydrAMP is able to find potent analogues for existing peptides. The learnt representation enables fast and efficient discovery of peptides with desired biological activity.
13:50-13:05 - Random Walk-based Matrix Factorization of a Multilayer Network for Protein Function Prediction - Surabhi Jagtap [Recorded]
Cellular systems of organisms are composed of multiple interacting entities that control cellular processes at multiple levels by tightly regulated molecular networks. In recent years, the advent of high-throughput experimental methods has resulted in the increase of large-scale molecular and functional interaction networks such as gene co-expression, protein–protein interaction (PPI) , genetic interaction, and metabolic networks. These networks are rich source[s] of information that could be used to infer the functional annotations of genes or proteins. Extracting relevant biological information from their topologies essential in understanding the functioning of the cell and its building blocks (proteins). Therefore, it is necessary to obtain an informative representation of the proteins and their proximity that is not fully captured by features that are extracted directly from single input networks. Here, we propose BraneMF, a random walk-based matrix factorization of a multi-layer network for protein function prediction.
13:05-13:10 - Light Attention Predicts Protein Location from the Language of Life - Hannes Stärk [Live]
Although knowing where a protein functions in a cell is important to characterize biological processes, this information remains unavailable for most known proteins. Machine learning narrows the gap through predictions from expertly chosen input features leveraging evolutionary information that is resource expensive to generate. We showcase using embeddings from protein language models for competitive localization predictions not relying on evolutionary information. Our lightweight deep neural network architecture uses a softmax weighted aggregation mechanism with linear complexity in sequence length referred to as light attention (LA). The method significantly outperformed the state-of-the-art for ten localization classes by about eight percentage points (Q10). The novel models are available as a web-service and as a stand-alone application at embed.protein.properties.
13:10-13:15 - Guided Generative Protein Design using Regularized Transformers - Egbert Castro [Live]
The development of powerful natural language models have increased the ability to learn meaningful representations of protein sequences. In addition, advances in high-throughput mutagenesis, directed evolution, and next-generation sequencing have allowed for the accumulation of large amounts of labelled fitness data. Leveraging these two trends, we introduce Regularized Latent Space Optimization (ReLSO), a deep transformer-based autoencoder which is trained to jointly generate sequences as well as predict fitness. Using ReLSO, we explicitly model the underlying sequence-function landscape of large labeled datasets and optimize within latent space using gradient-based methods. Through regularized prediction heads, ReLSO introduces a powerful protein sequence encoder and novel approach for efficient fitness landscape traversal.
14:35-14:40 - Efficient Design of Optimized AAV Capsids using Multi-property Machine Learning Models Trained across Cells, Organs and Species - Farhan Damani [Live]
While next-gen high-throughput assays enable us to learn how capsid sequence changes affect capsid functionality, measuring and optimizing capsid properties in the most therapeutically relevant models, such as non-human primates (NHP), remains challenging. The rate of transduction in target organs is lower than ideal, and most of the sequence space is non-functional. To overcome these challenges, we investigated to what extent multi-task machine learning can improve the efficiency of AAV capsid design for high-performing capsids. We apply our method to a previously designed library containing 156,858 designed sequence variants derived from a natural AAV capsid serotype and measured their properties as delivery vectors. MPMs provide a coherent framework in which to connect information from experiments across cell lines, organs, and species to the most relevant outcomes in NHP studies, thereby reducing the high resource and ethical burdens of NHP experimentation. Additionally, MPMs help overcome data sparsity in traits that are hard to measure, thereby improving model accuracy and providing a more reliable interpretation of experimental results. With further refinement, MPMs will enable the design of highly optimized AAV capsids that open new frontiers in delivery, toward realizing the full potential of gene therapy.
14:40-14:45 - Multimodal data visualization and denoising with integrated diffusion - Manik Kuchroo [Live]
We propose a method called integrated diffusion for combining multimodal data, gathered via different sensors on the same system, to create a integrated data diffusion operator. As real world data suffers from both local and global noise, we introduce mechanisms to optimally calculate a diffusion operator that reflects the combined information in data by maintaining low frequency eigenvectors of each modality both globally and locally. We show the utility of this integrated operator in denoising and visualizing multimodal toy data as well as multi-omic data generated from blood cells, measuring both gene expression and chromatin accessibility. Our approach better visualizes the geometry of the integrated data and captures known cross-modality associations. More generally, integrated diffusion is broadly applicable to multimodal datasets generated by noisy sensors collected in a variety of fields.
Asynchronous Poster Session
Outside of program
In order to accomodate speakers and attendees in different tiemzones, we propose two optional sessions for speakers and attendees to come together and have a conversation.
1. Cumulative QA session
These speakers will be present: Smita, Maria, Kevin, Hannes, Paulina, Surabhi, Dionizije, Burkhard, Farhan, Dan, Maria, Alex, Egbert, Yana