CrowdVariant: a crowdsourcing approach to curate copy number variants

Published: January 06, 2019

Author(s)

Justin M. Zook, Marc L. Salit, Peyton Greenside, Ryan Poplin, Mark DePristo, Madeleine Cule

Abstract

Copy number variants (CNVs) are an important type of genetic variation and play a causal role in many diseases. However, they are also notoriously difficult to identify accurately from next-generation sequencing (NGS) data. For larger CNVs, genotyping arrays provide reasonable benchmark data, but NGS allows us to assay a far larger number of small (< 10kbp) CNVs that are poorly captured by array-based methods. The lack of high-quality benchmark callsets of small-scale CNVs has limited our ability to assess and improve CNV calling algorithms for NGS data. To address this issue we developed a crowdsourcing framework, called CrowdVariant, that leverages Google’s scalable compute infrastructure to create a high confidence set of copy number variants for NIST RM 8391, an Ashkenazim reference sample developed in partnership with the Genome In A Bottle Consortium. In a pilot study we show that crowd-sourced classifications, even from non-experts, can be useful to form curated CNV calls. We then scale our framework genome-wide to identify 1782 high confidence CNVs, which multiple lines of evidence suggest are a substantial improvement over existing CNV callsets. Our crowdsourcing methodology may be a useful guide for other genomics applications.
Citation: Pacific Symposium for Biocomputing
Volume: 24
Pub Type: Journals

Download Paper

Keywords

genomics, DNA sequencing, crowd sourcing, structural variants
Created January 06, 2019, Updated February 14, 2019