Skip to main content
. 2022 Apr 28;2(5):100128. doi: 10.1016/j.xgen.2022.100128

Table 2.

Base pairs overlapping different types of difficult regions that are excluded from all input call sets for HG002

Difficult region description Bases excluded in GRCh37 Bases excluded in GRCh38 Explanation of exclusion
Modeled centromere and heterochromatin N/A 58,270,517 highly repetitive regions with modeled reference sequences that are difficult to characterize and structurally variable
VDJ 3,482,644 3,348,717 a region that undergoes somatic recombination
Regions that are collapsed and expanded from GRCh37/38 primary assembly alignments 17,702,248 N/A regions of GRCh37 with identified issues, so benchmark small variant calls are generally not as reliable
Segmental duplications with >5 copies, >99% identity, and longer than 10 kb 1,026,737 2,094,143 highly similar duplications with many copies in the reference make it difficult to identify which segmental duplication is the correct location for small variants, and variants could be from structural variants or additional copies of the sequence in HG002 not in the reference
Potential increased copy number in HG002 21,595,779 28,679,205 difficult to identify in which copy of region the small variants are, could be at a location in GRCh37/38 or at the extra copy of the region in HG002; no standards for representation or benchmarking in these regions
Inversions 843,244 893,369 would need to have a joint small and structural variant benchmark for reliable benchmarking
v.0.6 GIAB tier 1 plus tier 2 SV benchmark expanded by 150% 39,371,460 39,560,707 would need to have a joint small and structural variant benchmark for reliable benchmarking
Tandem repeats >10 kb 1,736,692 4,486,559 these repeats are similar to or longer than the read lengths for all input datasets, making variant calls less reliable

The table shows progressive subtraction of other difficult regions, so each row has all rows above it subtracted before calculating overlapping base pairs. In non-gap regions on chromosomes 1–22, there are 158,845,257 bp in GRCh37 and 202,943,679 bp in GRCh38 that are excluded by v.4.2.1 (i.e., outside the v.4.2.1 benchmark regions).