Skip to main content
. 2022 Jan 3;23:2. doi: 10.1186/s13059-021-02569-8

Fig. 1.

Fig. 1

Study design and highly reproducible regions (HRR). a Study design. The DNA samples are from the Chinese quartet, the HapMap trio, and NA12878. WGS was conducted on the samples using different platforms and library preparation kits in multiple labs in the original study (light blue background) and confirmatory study (light brown background). Various variant calling pipelines were employed to generate variants (yellow boxes) from the raw sequence data. The variants were leveraged to define the HRR and pinpoint HRVs (light green boxes). Reproducibility (blue boxes) was analyzed for both all variants and the variants only in HRR (green boxes) in both original and confirmatory studies. The variants with and without HRR-filtering were compared with the HRVs to calculate F-scores (blue boxes), which were used to evaluate reproducibility from a different angle. b Process for defining HRR. All alignment results for the same sample were first examined to find the genomic regions that have sequence reads mapped. Difficult regions such as repeats were then removed to form the callable regions. At last, the HRVs obtained from comparative analysis on all call sets were used to remove the low confidence calling regions from the callable regions, resulting the HRR. c Data generated. Sequencing data coverage is on the y-axis for DNA samples. Original and confirmatory data sets are separated with the vertical solid line and depicted with the x-axis label. The four Illumina sequencing platforms are separated with the vertical dashed lines and marked on the x-axis ticks where L1 indicates the Nextera DNA Flex library preparation kit and L2 is the TruSeq DNA PCR-Free Library Prep Kit. The color legend indicates samples. d Sizes (y-axis) of HRR (dark blue bars) for the 8 samples (x-axis). The color legend shows the excluded genomic regions, including gap region (dark brown) not in GRCh38, heterochromatin (blue) for condensed DNA labeled as N in the reference, telomere (dark purple) for repeat sequence at the end of the chromosome, not mapped region (light blue), mapping conflict region (green), difficult region (purple) for repeat regions (“SimpleRepeat_imperfecthomopolgt10_slop5.BED” and “remapped_superdupsmerged_all_sort.BED”) defined by GA4GH and GIAB, calling conflict region (yellow) for the flanking region of discordant variants, and pedigree conflict region (brown). e False negative rates (FN/(TP + FN)) of HRVs for NA12878 against the GIAB v4.0 benchmark set and stratified by genome context for SNVs (the left panel) and indels (the right panel) in the entire v4.0 benchmark regions (blue) and confined to the HRR (red). Error bars indicate 95% confidence intervals. f False positive rates (FP/(TP + FP)) of HRVs stratified by genome context in the entire v4.0 benchmark regions. Error bars indicate 95% confidence intervals