Stratification Group | Description | # Strats | Example Stratifications | Useful for |
---|---|---|---|---|
FunctionalRegions | Coding regions | 2 | CDS, not in CDS | Evaluating performance in coding regions more likely to be functional |
GC-content | Various ranges of GC-content | 14 | GC < 25%; 30% < GC < 55% | identifying GC bias in variant calling performance |
Low Complexity | 22 | evaluating performance in locally repetitive, difficult to sequence contexts | ||
Homopolymers | Identification of homopolymers by length | 4 | Homopolymers >101 bp; imperfect homopolymers >10 bp | evaluating performance in homopolymers, where systematic sequencing errors and complex variants frequently occur |
Simple Repeats | Di, tri, and quad-nucleotide repeats of different lengths | 9 | Di-nucleotide repeats 11–50 bp; di-nucleotide repeats >200 bp | evaluating performance in exact Short Tandem Repeats where systematic sequencing errors and complex variants frequently occur, and variant calls are challenging if the read length is insufficient to traverse the entire repeat |
Tandem Repeats | Tandem repeats of different lengths | 5 | Tandem repeats between 51 and 200 bp; tandem repeats >10 kb | evaluating performance in exact Short Tandem Repeats and Variable Number Tandem Repeats where systematic sequencing errors and complex variants frequently occur, and variant calls are challenging if the read length is insufficient to traverse the entire repeat |
Other Difficult | Various difficult regions of the genome | 6 | MHC; VDJ | evaluating performance in or excluding regions where variants are difficult to call and represent due to limitations of the reference genome (e.g. gaps or errors) or being highly polymorphic in the population (MHC). |
Segmental Duplications | Segmental duplications defined using multiple methods and limited to segdups >10kb | 9 | Segdups >10 kb; selfChain | Regions with multiple similar copies in the reference, making them challenging to map and assemble. |
Genome Specific | Difficult regions of the genome specific to one or more of the GIAB genomes. Including but not limited to complex variants, copy number variants, and structural variants. | 65 | CNVs, complex variants | evaluating performance in or excluding regions in each GIAB reference sample where small variants can be challenging to call (e.g., complex variants) or represent (e.g., CNVs and SVs) |
The updated stratification set includes the union of multiple stratifications as well as “not in” stratifications, which are useful in evaluating performance outside specific difficult genomic contexts.