Table 2.
Difficult region description | Bases excluded in GRCh37 | Bases excluded in GRCh38 | Explanation of exclusion |
---|---|---|---|
Modeled centromere and heterochromatin | N/A | 58,270,517 | highly repetitive regions with modeled reference sequences that are difficult to characterize and structurally variable |
VDJ | 3,482,644 | 3,348,717 | a region that undergoes somatic recombination |
Regions that are collapsed and expanded from GRCh37/38 primary assembly alignments | 17,702,248 | N/A | regions of GRCh37 with identified issues, so benchmark small variant calls are generally not as reliable |
Segmental duplications with >5 copies, >99% identity, and longer than 10 kb | 1,026,737 | 2,094,143 | highly similar duplications with many copies in the reference make it difficult to identify which segmental duplication is the correct location for small variants, and variants could be from structural variants or additional copies of the sequence in HG002 not in the reference |
Potential increased copy number in HG002 | 21,595,779 | 28,679,205 | difficult to identify in which copy of region the small variants are, could be at a location in GRCh37/38 or at the extra copy of the region in HG002; no standards for representation or benchmarking in these regions |
Inversions | 843,244 | 893,369 | would need to have a joint small and structural variant benchmark for reliable benchmarking |
v.0.6 GIAB tier 1 plus tier 2 SV benchmark expanded by 150% | 39,371,460 | 39,560,707 | would need to have a joint small and structural variant benchmark for reliable benchmarking |
Tandem repeats >10 kb | 1,736,692 | 4,486,559 | these repeats are similar to or longer than the read lengths for all input datasets, making variant calls less reliable |
The table shows progressive subtraction of other difficult regions, so each row has all rows above it subtracted before calculating overlapping base pairs. In non-gap regions on chromosomes 1–22, there are 158,845,257 bp in GRCh37 and 202,943,679 bp in GRCh38 that are excluded by v.4.2.1 (i.e., outside the v.4.2.1 benchmark regions).