New biological “’omics” measurements, particularly sequencing-based methods, can produce 103 to 109 values from biological systems and are transforming biosciences and biotechnology. However, large investments into such sequencing technologies by bioindustry require also advancing measurement confidence in standard reference materials (SRMs), such as genome in a bottle (GIAB) and mass spectrometry data catalogs. These SRMs provide reference values but are not certified values due to insufficient understanding of biases. There is a need to design a metrological framework to accelerate the determination of ~109 certified values and probabilistic uncertainties from integrating measurement methods, specifically cutting-edge artificial intelligence (AI) based modeling methods.
The goal of the project is to produce billions of certified values from multiple measurement methods with well-characterized uncertainty in the “’omics” field (genomics, proteomics, transcriptomics). These values are prepared by a joint effort between human experts and trained AI based models for matching gene sequences to a reference. For example, the genomics convolutional neural network (CNN) called DeepVariant (based on the Inception AI model architecture) has been applied to the problem of sequence classification and matching without quantifying model prediction uncertainty.
Our goal in this project is to develop approaches that make the results of the deep-learning based variant interpretable and enable the tandem of AI model plus human expert to produce trusted reference values. Our approach to explainable AI is based on simulations at small scales (see the Related Publications), interactive visualization to interpret sequencing data and genomic variation visualizations, and designs of multiple AI model metrics using perturbations.