Skip to main content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

SVClassify: a method to use multiple datasets to classify candidate structural variants into true positives and false positives

Published

Author(s)

Justin M. Zook, Hemang M. Parikh, Desu Chen, Hariharan K. Iyer, Marc L. Salit, Wolfgang Losert

Abstract

The human genome contains variants ranging in size from small single nucleotide polymorphisms (SNPs) to large structural variants (SVs). While high-quality benchmark small variant calls have recently been developed by the Genome in a Bottle Consortium, no similar high-quality benchmarks exist for structural variants (SVs). Therefore, we have developed methods to combine multiple forms of evidence from multiple sequencing technologies to classify candidate SVs into likely true or false positives. Our method (SVClassify) calculates annotations from one or more aligned bam files from any high-throughput sequencing technology, and then builds a one-class model using these annotations to classify candidate SVs as likely true or false positives. We used pedigree analysis to develop a set of high-confidence breakpoint-resolved large deletions for the Genome in a Bottle pilot genome NA12878, and then used SVClassify to classify these deletions and a set of high-confidence deletions from the 1000 Genomes Project. We first perform unsupervised clustering and visualization of these candidate SV calls alongside random likely non-SV regions. We find that likely SVs generally cluster separately from likely non-SVs based on the annotations we calculated from the aligned bam files, and that the SVs cluster into different types of deletions. We then developed a one-class classification method that separates a training set of 4000 random non-SV regions from the pedigree-based and 1000 Genomes SVs. We use our pedigree-based “Gold” SVs and 1000 Genomes Project “validated” SVs along with manual visualization to test our classification methods, and find candidate SVs with high scores are generally true SVs, and candidate SVs with low scores are questionable. We distribute a set of 3000 high-confidence deletions with high SVClassify scores from these call sets for benchmarking SV callers.
Citation
Genome Research
Volume
17

Keywords

Genomics, DNA sequencing, Structural Variants, Machine Learning
Created January 16, 2016, Updated November 10, 2018