CrowdVariant: a crowdsourcing approach to curate copy number variants

Published: January 06, 2019


Justin M. Zook, Marc L. Salit, Peyton Greenside, Ryan Poplin, Mark DePristo, Madeleine Cule


Copy number variants (CNVs) are an important type of genetic variation and play a causal role in many diseases. However, they are also notoriously difficult to identify accurately from next-generation sequencing (NGS) data. For larger CNVs, genotyping arrays provide reasonable benchmark data, but NGS allows us to assay a far larger number of small (< 10kbp) CNVs that are poorly captured by array-based methods. The lack of high-quality benchmark callsets of small-scale CNVs has limited our ability to assess and improve CNV calling algorithms for NGS data. To address this issue we developed a crowdsourcing framework, called CrowdVariant, that leverages Google’s scalable compute infrastructure to create a high confidence set of copy number variants for NIST RM 8391, an Ashkenazim reference sample developed in partnership with the Genome In A Bottle Consortium. In a pilot study we show that crowd-sourced classifications, even from non-experts, can be useful to form curated CNV calls. We then scale our framework genome-wide to identify 1782 high confidence CNVs, which multiple lines of evidence suggest are a substantial improvement over existing CNV callsets. Our crowdsourcing methodology may be a useful guide for other genomics applications.
Citation: Pacific Symposium for Biocomputing
Volume: 24
Pub Type: Journals

Download Paper


genomics, DNA sequencing, crowd sourcing, structural variants
Created January 06, 2019, Updated February 14, 2019