Comparing Variant Call Files for Performance Benchmarking of Next-Generation Sequencing Variant Calling Pipelines
Justin Zook, John G. Cleary, Len Trigg, Francisco De La Vega
To evaluate and compare the performance of variant calling methods and confidence scores, comparisons between a test call set and a "gold standard" need to be carried out. Unfortunately, these comparisons are not straightforward with the current Variant Call Files (VCF), which are the standard output of most variant call-ing algorithms for high-throughput sequencing data. Comparisons of VCFs are often is confounded by the different representations of indels, MNPs, and combinations thereof with SNVs in complex re-gions of the genome, resulting in misleading results. A variant caller is inherently a classification method designed to score putative vari-ants with a confidence score that could permit to control the rate of false positives (FP) or false negatives (FN) variants for a given ap-plication. Receiver operator curves (ROC) and the area under the ROC (AUC) are efficient metrics to evaluate a test call set vs. the gold standard. However, in the case of VCF data this also requires a special accounting to deal with discrepant representations. We de-veloped a novel algorithm for comparing variant call sets sets that deals with complex call representation discrepancies and through a dynamic programing method minimizes false positives and nega-tives globally across the entire call sets for accurate performance evaluation of VCFs.
, Cleary, J.
, Trigg, L.
and De La Vega, F.
Comparing Variant Call Files for Performance Benchmarking of Next-Generation Sequencing Variant Calling Pipelines, biorxiv, [online], https://doi.org/10.1101/023754
(Accessed December 11, 2023)