Published: January 06, 2019
Justin M. Zook, Marc L. Salit, Peyton Greenside, Ryan Poplin, Mark DePristo, Madeleine Cule
Copy number variants (CNVs) are an important type of genetic variation and play a causal role in many diseases. However, they are also notoriously difficult to identify accurately from next-generation sequencing (NGS) data. For larger CNVs, genotyping arrays provide reasonable benchmark data, but NGS allows us to assay a far larger number of small (< 10kbp) CNVs that are poorly captured by array-based methods. The lack of high-quality benchmark callsets of small-scale CNVs has limited our ability to assess and improve CNV calling algorithms for NGS data. To address this issue we developed a crowdsourcing framework, called CrowdVariant, that leverages Googles scalable compute infrastructure to create a high confidence set of copy number variants for NIST RM 8391, an Ashkenazim reference sample developed in partnership with the Genome In A Bottle Consortium. In a pilot study we show that crowd-sourced classifications, even from non-experts, can be useful to form curated CNV calls. We then scale our framework genome-wide to identify 1782 high confidence CNVs, which multiple lines of evidence suggest are a substantial improvement over existing CNV callsets. Our crowdsourcing methodology may be a useful guide for other genomics applications.
Citation: Pacific Symposium for Biocomputing
Pub Type: Journals
genomics, DNA sequencing, crowd sourcing, structural variants
Created January 06, 2019, Updated February 14, 2019