StratoMod: predicting sequencing and variant calling errors with interpretable machine learning

Nathan Dwarshuis; Nathanael Olson; Fritz Sedlazeck; Justin Wagner; Justin Zook

Official websites use .gov
A .gov website belongs to an official government organization in the United States.

Secure .gov websites use HTTPS
A lock ( ) or https:// means you’ve safely connected to the .gov website. Share sensitive information only on official, secure websites.

PUBLICATIONS

StratoMod: predicting sequencing and variant calling errors with interpretable machine learning

Published

October 13, 2024

Author(s)

Nathan Dwarshuis, Nathanael Olson, Fritz Sedlazeck, Justin Wagner, Justin Zook

Abstract

Despite the variety in sequencing platforms, mappers, and variant callers, no single pipeline is optimal across the entire human genome. Therefore, developers, clinicians, and researchers need to make tradeoffs when designing pipelines for their application. Currently, assessing such tradeoffs relies on intuition about how a certain pipeline will perform in a given genomic context. We present StratoMod, which addresses this problem using an interpretable machine-learning classifier to predict germline variant calling errors in a data-driven manner. We show StratoMod can precisely predict recall using Hifi or Illumina and leverage StratoMod's interpretability to measure contributions from difficult-to-map and homopolymer regions for each respective outcome. Furthermore, we use Statomod to assess the effect of mismapping on predicted recall using linear vs. graph-based references, and identify the hard-to-map regions where graph-based methods excelled and by how much. For these we utilize our draft benchmark based on the Q100 HG002 assembly, which contains previously-inaccessible difficult regions. Furthermore, StratoMod presents a new method of predicting clinically relevant variants likely to be missed, which is an improvement over current pipelines which only filter variants likely to be false. We anticipate this being useful for performing precise risk-reward analyses when designing variant calling pipelines.

Citation

Communications Biology

Volume

Pub Type

Journals

Download Paper

Local Download

Keywords

benchmarking, genomics, variant calling

Reference materials, Precision medicine, Genomics, Clinical diagnostics, Bioscience, Artificial intelligence and Applied AI

Citation

Dwarshuis, N. , Olson, N. , Sedlazeck, F. , Wagner, J. and Zook, J. (2024), StratoMod: predicting sequencing and variant calling errors with interpretable machine learning, Communications Biology, [online], https://tsapps.nist.gov/publication/get_pdf.cfm?pub_id=936081 (Accessed October 28, 2025)

Issues

If you have any questions about this publication or are having problems accessing it, please contact [email protected].

Created October 13, 2024, Updated March 4, 2025

Was this page helpful?

StratoMod: predicting sequencing and variant calling errors with interpretable machine learning

Author(s)

Abstract

Download Paper

Keywords

Citation

Additional citation formats

Issues