Frequently Asked Questions about the Genome in a Bottle Consortium, NIST's human genome reference materials, and data resources
What's the difference between NIST Reference Material DNA and DNA/cells from Coriell?
NIST worked with Coriell to grow large batches of cells, extract DNA, mix the DNA well, and aliquot into 1000’s of vials that are the NIST Reference Materials for HG001-HG005. These were characterized under the NIST quality system, and may differ in small ways from the DNA at Coriell, which is from different batches of cells, though in general these differences are expected to be small. The NIST price is higher because it incorporates some of the costs of the NIST quality system and the extensive NIST/GIAB characterization of these samples.
If I want to start with one GIAB genome, which one should I choose?
GIAB currently develops new benchmarks first on the PGP Ashkenazi Jewish son HG002 (NIST RM 8391), since it has the most extensive trio data and is part of the broad consent of the PGP. This currently includes benchmarks extending our small variant and structural variant calls. Over 50 commercial products based on this cell line are also available. Therefore, we recommend that you start with HG002/RM8391, though it is often helpful to use all seven of the GIAB genomes.
Can I use GIAB for exome and targeted gene panel sequencing?
Yes, our benchmarks can be used to assess targeted exome and gene panel sequencing. You generally will want to subset to your regions of interest, e.g., using the --target-regions option in hap.py. One important limitation is that our benchmarks contain limited numbers of difficult small variants and structural variants in exons, particularly for targeted panels, so it is particularly important to calculate confidence intervals for performance metrics like precision and recall. One resource for more challenging variants in clinically important regions is described in https://doi.org/10.1101/335950.
Currently there are no tumor/normal cell lines characterized by NIST or GIAB, although NIST is exploring possibilities for developing appropriately consented tumor/normal cell lines for reference material development. NIST has characterized CNVs in EGFR, HER2, and MET in several tumor cell lines in SRM 2373 and RM 8366. The Medical Device Innovation Consortium has put together a Somatic Reference Sample Landscape Report that describes many of the available somatic reference samples available as of early 2019 (https://mdic.org/wp-content/uploads/2019/03/MDIC-SRS-Landscape-Analysis-Report.pdf).
How can I get involved in GIAB?
A good first step to learn about active work is to read recent emails in and sign up for the general GIAB and analysis team google groups:
Why are there variants in the benchmark vcfs outside the benchmark bed files?
We include some variants outside the benchmark bed file because they reduce the risk of our benchmark including only part of a complex variant (e.g., when one indel is just inside the bed and one is just outside). These complex variants can often be represented in multiple ways in the vcf file, and it is important that the benchmark vcf include all parts of a complex variant, even if part falls outside the bed, in order to ensure that benchmarking tools will not erroneously count different, but correct, representations of the complex variant as incorrect.
What is the difference between "high-confidence variants and regions" and "benchmark variants and regions"?
In 2018, we decided to change the terminology for our vcf and bed files from "high-confidence" to "benchmark" in order to more clearly convey their intended use for benchmarking performance. Although we do still have high confidence that the variants are largely true, sometime "high-confidence regions" were interpreted as meaning that everyone should have confidence in their variants in these regions. Especially as we expand to more difficult regions, our benchmark regions will contain variants and regions that are difficult to characterize for some methods. In fact, our benchmark variants and regions are intended to enable anyone to determine how well any method performs for different types of variants and genome contexts within our benchmark regions.
Where can I report potential errors in the GIAB calls?
GIAB and IGSR/HGSVC both have characterized trios of Chinese ancestry. Does the proband with GIAB ID "HG005_NA24631" correspond to the one of IGSR-HGSVC with ID "HG00512"?
No, these are in fact different trios of Chinese ancestry. The GIAB Ashkenazi and Chinese trios are from the Personal Genome Project, since they are more broadly consented, including for commercial redistribution, development of iPSCs, etc. For SVs, we developed the first benchmark that enables both sensitivity and specificity assessment for the son in the Ashkenazi trio (HG002) – see https://doi.org/10.1101/664623.