Table 1.
Feature identifiera | Feature description | Selection statusb | ||
---|---|---|---|---|
1 | Sequence/Alignment | repeat | Count of repeat unit including homopolymers and STRs in indel flanking region | s, m |
2 | lc (linguistic complexity) | Diversity of k-mers in flanking 50-bp region | ||
3 | local_lc | Diversity of k-mers in flanking 6-bp region | s, m | |
4 | gc | GC content in flanking 50-bp region | ||
5 | local_gc | GC content in flanking 6-bp region | ||
6 | strength | DNA pair-bond strength of 2-mers in flanking 50-bp region | m | |
7 | local_strength | DNA pair-bond strength of 2-mers in flanking 6-bp region | s | |
8 | dissimilarity** | Edit distance between indel and flanking sequences | m | |
9 | indel_complexity | Mismatches around the indel measured by edit distance | s | |
10 | indel_size** | Length of inserted or deleted nucleotides | m | |
11 | is_ins | True for insertions | m | |
12 | is_at_ins* | True for ‘A’ or ‘T’ insertions | s | |
13 | is_at_del* | True for ‘A’ or ‘T’ deletions | ||
14 | is_gc_ins* | True for ‘G’ or ‘C’ insertions | ||
15 | is_gc_del* | True for ‘G’ or ‘C’ deletions | s | |
16 | ref_count | Count of RNA-Seq reads representing the reference allele | s, m | |
17 | alt_count | Count of RNA-Seq reads representing the indel allele | s, m | |
18 | is_bidirectional | True if an indel is supported by forward and reverse reads | s | |
19 | is_uniq_mapped | True if an indel is supported by uniquely mapped reads | s, m | |
20 | is_near_exon_boundary | True if an indel is within exon but on the exon boundary | s, m | |
21 | equivalence exists | True if alternative indel alignments are observe | s, m | |
22 | is_multiallelic | True if multiple indels are observed at the locus | s, m | |
23 | Transcript/Protein | is_inframe** | True if an indel is in-frame | |
24 | is_splice | True if an indel is in an intronic region within 10-bp to exon | ||
25 | is_truncating | True if an indel causes frame-shift, or stop gain, or destroys splice motif | ||
26 | is_in_cdd** | True if an indel is located in conserved domain | ||
27 | indel_location | Relative indel location in coding region | s | |
28 | is_nmd_insensitive | True if nonsense-mediate decay insensitive | ||
29 | indels_per_gene | Number of indels detected in the gene in the sample | ||
30 | cds_length | Length of the coding region | s | |
31 | DB | is_on_db | True if indel is present in the default germline database | s, m |
Note: A total of 31 features related to sequence/alignment, biological effect on transcription and protein coding, and match to germline variant database are examined.
Features marked with * were used only for training of single-nucleotide indel model while those marked with ** were used only for training of multi-nucleotide indel model.
Features selected by the single-nucleotide or the multi-nucleotide model are marked as s and m, respectively.