Synthetic spike-in standards improve run-specific systematic error analysis for DNA and RNA sequencing

Justin M. Zook; Daniel V. Samarov; Jennifer H. McDaniel; Shurjo Sen; Marc L. Salit

Official websites use .gov
A .gov website belongs to an official government organization in the United States.

Secure .gov websites use HTTPS
A lock ( ) or https:// means you’ve safely connected to the .gov website. Share sensitive information only on official, secure websites.

PUBLICATIONS

Synthetic spike-in standards improve run-specific systematic error analysis for DNA and RNA sequencing

Published

July 31, 2012

Author(s)

Justin M. Zook, Daniel V. Samarov, Jennifer H. McDaniel, Shurjo Sen, Marc L. Salit

Abstract

While the importance of random sequencing errors decreases at higher DNA or RNA sequencing depths, systematic sequencing errors (SSEs) dominate at high sequencing depths and can be difficult to distinguish from biological variants. These SSEs can cause base quality scores to underestimate the probability of error at certain genomic positions, resulting in false positive variant calls, particularly in mixtures such as samples with RNA editing, tumors, circulating tumor cells, bacteria, mitochondrial heteroplasmy, or pooled DNA. Most algorithms proposed for correction of SSEs require a training data set, which is typically either from a part of the data set being recalibrated (Genome Analysis ToolKit, or GATK) or from a separate data set with special characteristics (SysCall). Here, we combine the advantages of these approaches by adding synthetic RNA spike-in standards to human RNA, and use GATK to recalibrate base quality scores with reads mapped to the spike-in standards. Compared to conventional GATK recalibration that uses reads mapped to the genome, spike-ins improve the accuracy of Illumina base quality scores by a mean of 5 units, and by as much as 13 units at CpG sites. In addition, since reads mapping to the genome are not used for recalibration, our method allows run-specific recalibration even for the many species without a comprehensive and accurate SNP database. We also use GATK with the spike-in standards to demonstrate that the Illumina RNA sequencing runs overestimate quality scores for AC, CC, GC, GG, and TC dinucleotides, while SOLiD has less dinucleotide bias but more bias for certain cycles. We conclude that using these DNA and RNA spike-in standards with GATK improves base quality score recalibration.

Citation

PLoS One

Volume

Issue

Pub Type

Journals

Download Paper

https://doi.org/10.1371/journal.pone.0041356

Keywords

bioscience and health, DNA and RNA sequencing, systematic sequencing errors

Bioscience, Genomics, Health, Clinical diagnostics and Reference materials

Citation

Zook, J. , Samarov, D. , McDaniel, J. , Sen, S. and Salit, M. (2012), Synthetic spike-in standards improve run-specific systematic error analysis for DNA and RNA sequencing, PLoS One, [online], https://doi.org/10.1371/journal.pone.0041356 (Accessed May 11, 2026)

Additional citation formats

Issues

If you have any questions about this publication or are having problems accessing it, please contact [email protected].

Created July 31, 2012, Updated November 10, 2018

Was this page helpful?