Skip to main content
U.S. flag

An official website of the United States government

Official websites use .gov
A .gov website belongs to an official government organization in the United States.

Secure .gov websites use HTTPS
A lock ( ) or https:// means you’ve safely connected to the .gov website. Share sensitive information only on official, secure websites.

Interpretable modeling of genotype-phenotype landscapes with state-of-the-art predictive power

Summary

Diverse biological sciences, from engineering to epidemiology, benefit from an increased understanding of how genetic code (genotype) determines downstream function (phenotype). While decreasing costs and increasing throughput has led to progressively larger-scale measurements, these measurements can still only sample a small fraction of the full genotype-phenotype landscape. So, a complete picture of the landscape requires models to predict unmeasured genotype-phenotype combinations. Recently, researchers have increasingly relied on black-box models, like deep neural networks, to make these predictions due to their unsurpassed ability to accurately predict the effects of genotype on phenotype. But, these approaches have a substantial limitation: they are uninterpretable. Specifically, it is extremely difficult to understand how or why a black-box model makes a particular prediction. This drastically limits the insights possible from these approaches, and decreases their trustworthiness to practitioners. 

To address these problems, researchers in the Statistical Engineering Division and Cellular Engineering Group developed a novel approach that is fully interpretable: LANTERN. Across a broad benchmark of large-scale genotype-phenotype landscapes, LANTERN equals or outperforms alternative models (including deep neural networks) - achieving state-of-the-art prediction. Beyond accurate predictions, LANTERN is fully interpretable, and we show how the model provides novel insights into diverse protein landscapes relevant to public health and the bioeconomy. LANTERN demonstrates that state-of-the-art prediction is possible without sacrificing interpretability.

Description

Overview of LANTERN applied to genotype-phenotype landscapes
(top) LANTERN takes as training data genotype-phenotype pairs, and learns a predictive model through two major components. Each phenotype is first compressed down to a low-dimensional latent mutational effect space where mutations combine additively. Then, a smooth non-linear surface along the latent space maps individual measurements to their measured phenotype. (bottom left) A two-dimensional cross-section of the low-dimensional surface learning by LANTERN on the LacI EC50 phenotype. (bottom right) LANTERN equals or outperforms alternative predictive models, as shown here for ten-fold cross-validation predictive accuracy.
 
LANTERN: genotype-phenotype LANdscape inTERpretable Non-parametric model

The bioeconomy is increasingly data-driven, relying on large-scale measurements to rapidly engineer novel functions. To facilitate rational engineering, where complex designs are constructed from well understood components, models must distill the large design space associated with genetic sequences down to a scale comprehensible to designers. For example, a typical protein sequence has thousands of potential mutations that can be included in a novel construct. Since most bioengineering goals require multiple mutations before reaching optimal performance, this massive design space places a bottleneck on engineering tasks.

In this project, we address these problems with LANTERN, a fully interpretable model of genotype-phenotype landscapes. LANTERN compresses the high-dimensional genotype design space down to a continuous, low-dimensional space where the effect of mutations combine additively. This converts the challenge of designing over thousands of potential mutations to navigating a much smaller space when designing new proteins. For example, LANTERN compressed thousands of mutations in a previously measured landscape of genetic sensors (Large-Scale Genotype-Phenotype Landscape Measurements for Precision Engineering of Living Measurement Systems | NIST) down to only three dimensions. Importantly, LANTERN also predicts the effects of mutations with state-of-the-art accuracy: across a broad benchmark of large-scale datasets LANTERN equaled or out-performed existing approaches in predicting novel function. This comparison included black-box neural networks, which are popular for their ability to generate accurate prediction but with a substantial sacrifice to model interpretability. When modeling genotype-phenotype landscapes, we show that LANTERN makes this trade-off unnecessary: state-of-the-art prediction is possible while still remaining fully interpretable.

Resources

Preprint

Source code: https://github.com/usnistgov/lantern

Documentation: https://lantern-gpl.readthedocs.io/en/latest/

 

Created August 9, 2021, Updated March 28, 2022