Machine and deep learning (ML/DL) models have been successfully used to systematically analyze various datasets and extract knowledge without having to “manually” define and fit highly non-linear, multi-dimensional mathematical models. However, taking advantage of ML/DL models becomes particularly challenging when dealing with diverse datasets consisting of many data types (e.g., image, text, or other structured data), as they require different ML/DL models for processing the different data types along with a compositional ML/DL model for combining the results of the type-specific models. In the biomedical field, diverse datasets are exemplified by ones acquired by a new diagnostic platform, the Molecular Perceptron (MP) developed at NIST-MML by Dr. Ming Zheng and his group. MP uses high-throughput spectroscopy to generate excitation-emission 2D images from a drop of serum when it is exposed to an array of optical probes made of DNA-wrapped carbon nanotubes (CNT). The platform design is disease-agnostic and relies on analyzing data of multiple types—the 2D images, DNA wrapping sequences, and CNT structure characteristics—to classify disease states. An additional ML/DL challenge is due to the expected scarcity of labeled data. This scarcity motivates the exploration of unsupervised learning methods such as Principal Component Analysis (PCA) and clustering, where general representations of the input data and associated metadata will need to be defined. The proposed work will be demonstrated on data from the above-mentioned MP. Namely, we will leverage ML/DL models operating on multi-type input data to perform signal prediction or disease-based classification of labeled image data, when available. The tools developed will be general and can be further validated, contingent upon additional data related to different diseases becoming available through our NIST-MML collaborators.
The main goals of this project are listed as follows.
Define ML/DL architectures targeted at diverse datasets consisting of spectral images and metadata of different types. Train and test the resulting ML/DL networks on data from the MP and select the best ones.
Using the above-developed ML/DL networks, select the best performing combination of CNT chirality and DNA sequence based on MP data.
Design feature extraction techniques targeting MP spectra, DNA sequences, CNT chirality and other important entities associated to the MP. Certain spectral features are known to be closely related to complex disease signatures, such as the spatial distribution and shape of high-intensity spots, or local textural properties of spectral images such as contrast and homogeneity. Validation will involve the comparison of the ML/DL network performance across the various features.
Test the effectiveness of unsupervised learning techniques in characterizing unlabeled data.
Design data enhancement techniques to overcome the scarcity of labeled and unlabeled data.
The project was recently funded through an internal NIST-ITL program, and therefore it is relatively young. The anticipated benefits of the project are listed as follows.
The proposed effort will lead to the design of novel ML/DL methods for the analysis of diverse data types. Diverse datasets are ubiquitous, but ML/DL networks generally target single types of data, either image data or text data, and do not properly handle the associated metadata, especially when it consists of a different type of structured data.
Several computational tools will be built to support the ML/DL networks: PCA and clustering techniques for unsupervised learning, feature extraction techniques to improve the performance of the learning networks, data enhancement techniques to overcome the expected scarcity of training data.
From the application perspective, the MP platform empowered by the ML/DL analysis will initially demonstrate its ability to accurately diagnose pathologies for which no biomarker-based approach currently exists. The long-term vision is to establish MP as a general-purpose disease classification technology based on serum measurement. This combined with AI and existing genome- and proteome-based diagnostics will lay a foundation for future precision medicine.