SVClassify: a method to use multiple datasets to classify candidate structural variants into true positives and false positives

Published: January 16, 2016


Justin M. Zook, Hemang M. Parikh, Desu Chen, Hariharan K. Iyer, Marc L. Salit, Wolfgang Losert


The human genome contains variants ranging in size from small single nucleotide polymorphisms (SNPs) to large structural variants (SVs). While high-quality benchmark small variant calls have recently been developed by the Genome in a Bottle Consortium, no similar high-quality benchmarks exist for structural variants (SVs). Therefore, we have developed methods to combine multiple forms of evidence from multiple sequencing technologies to classify candidate SVs into likely true or false positives. Our method (SVClassify) calculates annotations from one or more aligned bam files from any high-throughput sequencing technology, and then builds a one-class model using these annotations to classify candidate SVs as likely true or false positives. We used pedigree analysis to develop a set of high-confidence breakpoint-resolved large deletions for the Genome in a Bottle pilot genome NA12878, and then used SVClassify to classify these deletions and a set of high-confidence deletions from the 1000 Genomes Project. We first perform unsupervised clustering and visualization of these candidate SV calls alongside random likely non-SV regions. We find that likely SVs generally cluster separately from likely non-SVs based on the annotations we calculated from the aligned bam files, and that the SVs cluster into different types of deletions. We then developed a one-class classification method that separates a training set of 4000 random non-SV regions from the pedigree-based and 1000 Genomes SVs. We use our pedigree-based “Gold” SVs and 1000 Genomes Project “validated” SVs along with manual visualization to test our classification methods, and find candidate SVs with high scores are generally true SVs, and candidate SVs with low scores are questionable. We distribute a set of 3000 high-confidence deletions with high SVClassify scores from these call sets for benchmarking SV callers.
Citation: Genome Research
Volume: 17
Pub Type: Journals


Genomics, DNA sequencing, Structural Variants, Machine Learning
Created January 16, 2016, Updated November 10, 2018