SVClassify: a method to use multiple datasets to classify candidate structural variants into true positives and false positives

Justin M. Zook; Hemang M. Parikh; Desu Chen; Hariharan K. Iyer; Marc L. Salit; Wolfgang Losert

Official websites use .gov
A .gov website belongs to an official government organization in the United States.

Secure .gov websites use HTTPS
A lock ( ) or https:// means you’ve safely connected to the .gov website. Share sensitive information only on official, secure websites.

PUBLICATIONS

SVClassify: a method to use multiple datasets to classify candidate structural variants into true positives and false positives

Published

January 16, 2016

Author(s)

Justin M. Zook, Hemang M. Parikh, Desu Chen, Hariharan K. Iyer, Marc L. Salit, Wolfgang Losert

Abstract

The human genome contains variants ranging in size from small single nucleotide polymorphisms (SNPs) to large structural variants (SVs). While high-quality benchmark small variant calls have recently been developed by the Genome in a Bottle Consortium, no similar high-quality benchmarks exist for structural variants (SVs). Therefore, we have developed methods to combine multiple forms of evidence from multiple sequencing technologies to classify candidate SVs into likely true or false positives. Our method (SVClassify) calculates annotations from one or more aligned bam files from any high-throughput sequencing technology, and then builds a one-class model using these annotations to classify candidate SVs as likely true or false positives. We used pedigree analysis to develop a set of high-confidence breakpoint-resolved large deletions for the Genome in a Bottle pilot genome NA12878, and then used SVClassify to classify these deletions and a set of high-confidence deletions from the 1000 Genomes Project. We first perform unsupervised clustering and visualization of these candidate SV calls alongside random likely non-SV regions. We find that likely SVs generally cluster separately from likely non-SVs based on the annotations we calculated from the aligned bam files, and that the SVs cluster into different types of deletions. We then developed a one-class classification method that separates a training set of 4000 random non-SV regions from the pedigree-based and 1000 Genomes SVs. We use our pedigree-based Gold SVs and 1000 Genomes Project validated SVs along with manual visualization to test our classification methods, and find candidate SVs with high scores are generally true SVs, and candidate SVs with low scores are questionable. We distribute a set of 3000 high-confidence deletions with high SVClassify scores from these call sets for benchmarking SV callers.

Citation

Genome Research

Volume

Pub Type

Journals

Download Paper

DOI Link

Keywords

Genomics, DNA sequencing, Structural Variants, Machine Learning

Bioscience, Genomics, Health, Clinical diagnostics and Reference materials

Citation

Zook, J. , Parikh, H. , Chen, D. , Iyer, H. , Salit, M. and Losert, W. (2016), SVClassify: a method to use multiple datasets to classify candidate structural variants into true positives and false positives, Genome Research, [online], https://doi.org/10.1186/s12864-016-2366-2 (Accessed April 19, 2024)

Created January 16, 2016, Updated January 27, 2020

SVClassify: a method to use multiple datasets to classify candidate structural variants into true positives and false positives

Author(s)

Abstract

Download Paper

Keywords

Citation

Additional citation formats