Abstract
Accurate estimation of population allele frequency (AF) is crucial for gene discovery and genetic diagnostics. However, determining AF for frameshift-inducing small insertions and deletions (indels) faces challenges due to discrepancies in mapping and variant calling methods. Here, we propose an innovative approach to assess indel AF. We developed CRAFTS-indels (Calculating Regional Allele Frequency Targeting Small indels), an algorithm that combines AF of distinct indels within a given region and provides “regional AF” (rAF). We tested and validated CRAFTS-indels using three independent datasets: gnomAD v2 (n=125,748 samples), an internal dataset (IGM; n=39,367), and the UK BioBank (UKBB; n=469,835). By comparing rAF against standard AF, we identified rare indels with rAF exceeding standard AF (s AF≤10−4 and rAF>10−4) as “rAF-hi” indels. Notably, a high percentage of rare indels were “rAF-hi”, with a higher proportion in gnomAD v2 (11–20%) and IGM (11–22%) compared to the UKBB (5–9% depending on the CRAFTS-indels’ parameters). Analysis of the overlap of regions based on their rAF with low complexity regions and with ClinVar classification supported the pertinence of rAF. Using the internal dataset, we illustrated the utility of CRAFTS-indel in the analysis of de-novo variants and the potential negative impact of rAF-hi indels in gene discovery. In summary, annotation of indels with cohort specific rAF can be used to handle some of the limitations of current annotation pipelines and facilitate detection of novel gene disease associations. CRAFTS-indels offers a user-friendly approach to provide rAF annotation. It can be integrated to public databases such as gnomAD, UKBB and used by ClinVar to revise indels classifications.
Keywords: insertions and deletions, allele frequencies, bioinformatic pipeline, variants annotation, genomic mapping
INTRODUCTION
The discovery of novel gene-disease associations relies on the identification of predicted deleterious genetic variants. Frameshift variants caused by small insertions and deletions (indels ≤ 50bp) are frequently categorized as predicted deleterious and they constitute the second most common class of variants underlying monogenic disorders (Stenson et al. 2020). The recommendation is therefore to apply “Very Strong” evidence of pathogenicity if the indel is in a gene where loss-of-function is a known mechanism of disease (Richards et al. 2015). Indels represent 16% to 25% of all sequence polymorphisms in humans (Mills et al. 2006) and the analysis of the Telomere-to-Telomere CHM13 (T2T-CHM13) genome identified many new indels suggesting many are benign (Aganezov et al. 2022). One of the central arguments for refuting a variant’s deleteriousness is its identification at a high frequency in the general population (Richards et al. 2015). However, the mapping, calling and annotation of indels are challenging, leading to inaccurate estimation of indel’s AF by many pipelines.
Comparison of four different callers for analysis of whole genome sequences demonstrated limited overlap in indel calling, with 323,334 indels identified by a single caller, 658,363 indels were identified by at least two of the callers, and 315,159 identified by all four (Ratan et al. 2015). There are efforts to increase the sensitivity and specificity of indel calling (Albers et al. 2011; Li et al. 2013), however the issue remains and researchers are repeatedly identifying false-positive associations between indels and disease only to realize after additional investigation that the indels are redundant or overlapping and most probably not associated with disease. These challenges for indel calling led researchers to remove indels when calculating the probability of loss-of-function intolerance (pLI) (Fuller et al. 2019). In addition to the difficulty in accurate indel calling, indel annotation is also challenging, as demonstrated by indel redundancy within dbSNP (Li et al. 2014). Despite these obstacles, indels are a main mutation class and need to be analyzed when searching for causal variants of human diseases.
One of the solutions implemented to overcome this substantial issue is to mask regions at higher risk for artifacts or mapping issues, or flag variants within those regions. This approach is based on the assumption that the occurrence of indels depends on the genomic context, with some loci more prone to mutate than others (i.e., hotspots)(MacLean et al. 2006; Montgomery et al. 2013; Georgakopoulos-Soares et al. 2018; Nesta et al. 2021). Different tools have been incorporated. For example, ENCODE recommends removing blacklisted regions (Amemiya et al. 2019). The Genome in a Bottle (GIAB) consortium created files that can be used as a standard resource to identify variants in low complexity regions (LCRs) (the Global Alliance for Genomics and Health Benchmarking Team et al. 2019). In the Genome Aggregation Database (gnomAD)(Karczewski et al. 2020), DUST was implemented to flag variants in LCRs (Morgulis et al. 2006). While other genomic contexts may also affect accurate mapping of indels, there is no current tool rapidly identifying them.
In this paper, we propose a novel concept “regional allele frequency” (rAF) and developed a novel approach to identify indels which are annotated as rare by standard AF (sAF) methodology, but are either redundant (same indel annotated differently) or clustered with other indels in genomic micro-regions. The approach, CRAFTS-indel (Calculating Regional Allele Frequency Targeting Small indels), aimed at flagging those regions as either biological hot spots or false positive calls. CRAFTS-indel is an adjustable algorithm calculating rAF for all indels within a given region within any dataset of interest, minimizing the issues associated with differences between bioinformatic pipelines. Using CRAFTS-indel, we identified genomic regions that harbor indels that are rare based on sAF but more common based on their rAF. We call these indels and the genomic regions containing them “rAF-hi” indels and regions, respectively.
To validate this approach, we compared the overlap of LCRs with rAF-hi regions, rAF-lo regions (only containing indels with low sAF and low rAF) and sAF-hi regions (only containing indels with high sAF). As sequencing technologies and technical batch effects can impact the occurrence of indels, we compared the prevalence and characteristics of rAF-hi regions from three datasets: 1) gnomADv2, which includes exomes from a variety of research studies but was extensively curated and outliers were systematically removed, 2) the Columbia University Institute for Genomic Medicine (IGM), which also includes exomes from a variety of studies, and 3) the UK Biobank (UKBB), which utilized a single sequencing pipeline. We also used indels with a ClinVar classification to test our hypothesis that rAF-hi indels are more likely to be classified as benign. To assess the utility of rAF in novel gene discovery, we compared the output of enrichment analysis of de-novo variants with and without the application of rAF and analyzed the burden of rAF-hi indels in an internal dataset (IGM).
Materials and Methods
2.1. Datasets
The annotated variant-level data of the 125,748 publicly available samples from gnomADv2was downloaded from the gnomADv2.1.1 database (Karczewski et al. 2020). Indels were retained by parsing the gnomADv2 VCF files using bcftools (Danecek et al. 2021). All indels were annotated using ClinVar from the September 10, 2023 ClinVar pull (Landrum et al. 2020). The analysis was limited to protein-coding variants located in the exons of the 18,894 protein-coding genes from the Consensus Coding Sequence (CCDS) database (release 20) (Pruitt et al. 2009). No allele frequency cut-offs were applied. Only indels smaller than or equal to 50bp in length were included for this study.
The IGM cohort includes data from individuals who consented to have anonymized sequence data available for secondary genetic analysis. The variant-level data of 39,367 unrelated samples stored in ATAVDB was analyzed (Table SI 1)(Ren et al. 2021). Utilization of this dataset was approved by the Columbia University Institutional Review Board. The same quality filters described for the gnomADv2dataset were applied on the IGM dataset (see SI).
The UKBB exome sequencing variant-level data was generated for n=469,835 UKBB participants as previously described (Geisinger-Regeneron DiscovEHR Collaboration et al. 2020; Backman et al. 2021). We accessed and pulled the unique list of indels with their respective AC and AN values from the joint-genotyped multi-sample project-level VCF through the UKBB Research Analysis Platform (RAP) on DNAnexus.
2.2. Identification of rAF-hi indels and genomic regions
For the identification of clusters of indels in the coding regions of the genome, we developed CRAFTS-indel. The input of CRAFTS-indel is the chromosome number, genomic coordinates (position), reference allele, alternate allele, alternate allele counts (AC) and the total number of alleles (AN; number of chromosomes covered in the genomic position of interest) from the given dataset. CRAFTS-indel sorts unique indels in ascending order by position for each chromosome. To ensure that CRAFTS-indel can fit studies with different hypotheses and aims, the researchers can define the base pair range used to calculate the rAF. In this study, we analyzed four base pair ranges (10bp, 20bp, 30bp, and 40bp), equivalent to a maximum distance between two adjacent indels of 5bp, 10bp, 15bp and 20bp. These four base pair ranges were chosen based on previous reports on the distance between adjacent indels (Li et al. 2014), the higher risk for correlated mapping errors in small genomic regions (Bansal and Libiger 2011) and our laboratory experience. For each range, indels were grouped into genomic regions, and each region was assigned a unique identifier (Figure 1a–b). The size of the genomic regions varies, so that each indel belongs to only one genomic region, there is no overlap between genomic regions (Figure 1b).
Figure 1: Illustration of CRAFTS-indel calculation of regional allele frequency.

A. Fictional deletions (grey triangles above the DNA strand) and insertions (white triangles under the DNA strand) depicted along a 180bp DNA region. The positions and sizes of the insertions and deletions (indels) are provided. B. The regions with indels are identified using 4 different ranges (10bp, 20bp, 30bp and 40bp). The size of each region depends on the distance between the indels and the size of the range. C. All fictional indels are assigned a sAF of 4×10-5. Regions are labelled as rAF-hi (sAF≤10−4 and rAF>10−4) or rAF-lo (sAF≤10−4 and rAF≤10−4).
Whereas sAF represents the proportion of one variant alternate allele (AC) to the total number of the alleles sequenced (coverage, AN), rAF represents the proportion of all alternate alleles within a genomic region to the coverage of the genomic region (Figure 1c–d). CRAFTS-indel first identifies all the alleles within a genomic region and their counts (AC). The AC of each indel within each genomic region (rAC) is then summed, as if all indels in the region are identical and assuming that they are from independent individuals. To calculate the coverage of each genomic region, CRAFTS-indel calculates the mean coverage of all the indels (AN) within the region (rAN). The regional allele frequency (rAF) for each indel (i) is finally calculated: . In addition, CRAFTS-indel calculates the standard allele frequency (sAF) for each indel (i) in any given dataset: . The output from CRAFTS-indel contains the sAF, rAC, rAN, rAF for the chosen base pair range, unique identifiers for genomic regions, and the length of genomic regions, and the columns of the provided input.
The AF threshold can be defined based on the purpose of the rAF calculation. In this study, a AF threshold of 10−4 was chosen to define rAF-hi indels (sAF≤10−4 and rAF>10−4, Table 1). We defined rAF-hi regions as genomic regions containing at least one rAF-hi indel. Genomic regions only containing rAF-lo indels (sAF≤10−4 and rAF≤10−4) were defined as rAF-lo. Genomic regions only containing indels with an sAF>10−4 were defined as sAF-hi regions. Rare indels based on standard allele frequency (sAF≤10−4) were defined as sAF-lo. The length of the rAF-hi, rAF-lo and sAF-hi genomic regions is the distance between the most proximal and the most distal indel within the same “micro-region” based on the chosen base-pare range. The AF thresholds and base-pare ranges can be changed depending on the goals of the study.
Table 1:
Classification of indels and genomic regions based on standard AF and regional AF
| All indels in the genomic region have sAF> 10−4 | All indels in the genomic region have sAF≤ 10−4 (sAF-lo) | At least one indel in the genomic region has sAF≤ 10−4 and the others have sAF> 10−4 | |
|---|---|---|---|
| All indels in the genomic region have rAF≤ 10−4 | sAF-hi indels sAF-hi region |
rAF-lo indels rAF-lo region |
rAF-lo & sAF-hi indels rAF-lo region |
| All indels in the genomic region have rAF > 10−4 | rAF-hi indels rAF-hi region |
rAF-hi & sAF-hi indels rAF-hi region |
|
| At least one of the indels in the genomic region have rAF > 10−4 and the others have rAF≤ 10−4 | rAF-hi & rAF-lo indels rAF-hi region |
rAF: regional allele frequency; sAF: standard allele frequency
As the proportion of redundant indels out of all indels was previously reported by chromosome (Li et al. 2014), we also analyzed the proportion of rAF-hi indels out of all sAF-lo indels across each chromosome.
2.3. Low complexity regions (LCRs)
As LCRs are known to be difficult to sequence and to map, we hypothesized that rAF-hi regions overlap with LCRs like sAF-hi regions, but more than rAF-lo regions. We compared the overlap between LCRs and genomic regions containing sAF-hi, rAF-lo, or rAF-hi indels using chi-square and the Cramer’s V test. The null hypothesis assumed equal proportions of genomic regions overlapping with LCRs irrespective of the indel’s AF. The positions of the LCRs in bed file format were downloaded from the Global Alliance for Genomics and Health (GA4GH) Benchmarking Team and the GIAB Github (the Global Alliance for Genomics and Health Benchmarking Team et al. 2019). The overlaps between these genomic regions and LCRs were identified using BEDTools. The percentage of overlap was calculated by dividing the number of sAF-hi regions, rAF-lo regions and rAF-hi regions that overlap with the LCR by the total number of genomic regions for each group.
2.4. ClinVar classifications
As variants submitted to ClinVar are reviewed and interpreted by geneticists and indels in regions with high frequency of indels are often annotated as likely benign, even if they are rare based on standard allele frequency. Rare indels in the three datasets were annotated with ClinVar and the association between their rAF and their ClinVar classification was analyzed (see SI).
2.5. Assessing the utility of rAF in studying monogenic diseases
One approach for studying the genetic architecture of diseases is to assess the genome-wide enrichment of de-novo variants (Deciphering Developmental Disorders Study 2017). Removal of variants using strict internal and external AF thresholds is an important step in de-novo analysis. To assess the impact of applying a strict rAF threshold, we identified a set of 111 trios in the IGM with either a diagnostic inherited variant or a strong family history suggesting an inherited disease. Family relationship was confirmed using KING software (Manichaikul et al. 2010). De-novo variants were defined using accepted methodologies (see SI). To estimate the probability of a de-novo variant, we used denovolyzeR (Ware et al. 2015) and the updated mutation table from denovoWEST (Jobo and Samocha 2020) (see SI). We hypothesized that application of rAF will reduce the false-positive enrichment for de-novo loss-of-function variants in a cohort unlikely to have an increased burden of de-novo variants.
To evaluate the impact of rAF on novel gene discovery, we analyzed the number of individuals in the IGM cohort carrying rAF-hi predicted deleterious indels. Predicted deleterious indels were defined as those annotated as frameshift, splice donor, splice acceptor, stop gained, start lost, stop lost, or exon loss (Richards et al. 2015). We then limited the analysis to constrained genes (gnomADv2 pLI>0.5 and Loss-of-function Observed/Expected Upper-bound Fraction [LOEUF] score < 0.35), as they are most associated with dominant disorders and identified as novel candidates for dominant genetic disorders (Lek et al. 2016; Harrison et al. 2019; Karczewski et al. 2020). Finally, we limited the analysis to constrained genes known to be associated with dominant disorders based on OMIM to assess the potential impact of rAF on variant classification (Amberger et al. 2019).
3. RESULTS
3.1. Characteristics of rAF-hi genomic regions and rAF-hi indels
In the gnomADv2 dataset, there are 38,543 rAF-hi genomic regions (CRAFTS-indel 10bp range). The mean length of the rAF-hi genomic regions is 3.76bp and the median is 2bp (Table SI 2a). There is a drop in the number of rAF-hi genomic regions after 5bp, and genomic regions less than 10bp account for 91% of the rAF-hi genomic regions (Figure SI 1a, Table SI 2a). Utilization of larger base pair ranges (20–40bp) does not change the proportion of rAF-hi regions smaller than the range used (Figure 1b–d, Table SI 2a). Similar findings were replicated in the IGM cohort (Figure SI 1e–h, Table SI 2b). In the UKBB, a higher proportion of rAF-hi genomic regions is smaller than the range used (97%, Figure SI 1i–l, Table SI 2c).
In the gnomADv2 dataset, there are 101,092 rAF-hi indels (11% of sAF-lo indels) in 13,137 genes (CRAFTS-indel 10bp range, Figure 2, Table SI 3a). The proportion of sAF-lo indels that are rAF-hi is similar in all chromosomes, with the highest proportion in chromosomes X and 21 (Figure SI 2). The proportion of sAF-lo indels identified as rAF-hi grows with the size of the base pair ranges utilized, and so does the number of genes with rAF-hi indels (11–20%, Figure 2, Table SI 3a). A similar proportion of sAF-lo indels is identified as rAF-hi in the IGM dataset (11–22%, Figure 2, Table SI 3b), but a smaller proportion in the UKBB (5–9%, Figure 2, Table SI 3c). The number of genes containing rAF-hi indels is similar in the IGM and UKBB, but larger in the gnomADv2dataset (Tables S3a–c).
Figure 2: Proportion of rAF-hi indels out of sAF-lo indels.

The percentage of rAF-hi indels (rAF>10−4) out of all sAF-lo indels (sAF ≤ 10−4) in the gnomAD, IGM and UKBB datasets for each of the four different ranges (10bp, 20bp, 30bp and 40bp). Additional information in Table S3.
3.2. Overlap between rAF-hi genomic regions and LCRs
Using the CRAFTS-indel 10bp range, 64% of rAF-hi regions in the gnomADv2dataset overlap with LCRs, compared to 43% of the sAF-hi regions and 9% of the rAF-lo regions (Figure 3). There is a stronger difference between the proportion of rAF-hi regions and the proportion of rAF-lo regions overlapping with LCRs (Cramer’s V=0.39), than between the proportions of rAF-hi regions and sAF-hi regions overlapping with LCRs (Cramer’s V=0.20, Table SI 4a). The proportion of rAF-hi genomic regions outside of LCRs increases with the size of the base pair range (20–40bp), and the differences between the proportions of rAF-hi and sAF-hi overlapping with LCRs decreases (Figure 3, Table SI 4a). Similar findings were replicated in the IGM dataset and the UKBB (Figure 3, Table SI 4b–c). The overlap between the rAF-hi regions identified in the three datasets is minimal, regardless of their overlap with LCRs (Figure S3).
Figure 3: Number and proportion of genomic regions overlapping with low complexity regions (LCRs).

A. Illustration of the identification of regions overlapping with LCRs. B. Number and proportion of rAF-lo regions and LCRs for the 3 datasets (gnomAD, IGM and UKBB) and for the 4 ranges (10bp, 20bp, 30bp and 40bp). C. Number and proportion of sAF-hi regions and LCRs for the 3 datasets (gnomAD, IGM and UKBB) and for the 4 ranges (10bp, 20bp, 30bp and 40bp). D. Number and proportion of rAF-hi regions and LCRs for the 3 datasets (gnomAD, IGM and UKBB) and for the 4 ranges (10bp, 20bp, 30bp and 40bp). Additional information in Tables S4a–c
3.2. Correlation between rAF-hi indels ClinVar classification
Using the CRAFTS-indel 10bp range, in the gnomADv2 dataset, 3% of the sAF-lo pathogenic (P) and 3% of the sAF-lo likely pathogenic (LP), 21% of the sAF-lo benign (B) and 15% of the sAF-lo likely benign (LB) indels are rAF-hi (Table 2). Of the 404 rAF-hi indels with a rAF>1% in gnomADv2 with a ClinVar classification of P/LP/LB or B, 12 (3%) are classified as P/LP and 392 (97%) are classified as B/LB. Increasing the range raises the proportion of rAF-hi for both pathogenic and likely pathogenic indels to 7% and the proportion of benign variants to 31%. Similar findings were replicated in the IGM and UKBB datasets and when using different base pair ranges (Table 2, Tables SI 5a–c).
Table 2:
Association between ClinVar classification and regional allele frequency (10bp range)
| Cohort | gnomAD | IGM | UKBB | |||
|---|---|---|---|---|---|---|
| ClinVar Classification | sAF-lo Indels | rAF-hi (Proportion a) | sAF-lo Indels | rAF-hi (Proportion a) | sAF-lo Indels | rAF-hi (Proportion a) |
| Benign | 1939 | 402 (21%) | 162 | 61 (38%) | 424 | 57 (13%) |
| Likely Benign | 5917 | 881 (15%) | 546 | 230 (42%) | 2463 | 226 (9%) |
| Likely Pathogenic | 1379 | 40 (3%) | 700 | 29 (4%) | 408 | 2 (0%) |
| Pathogenic | 7921 | 201 (3%) | 4086 | 166 (4%) | 2047 | 13 (1%) |
Proportion of rAF-hi indels out of sAF-lo indels
Analysis with base pair ranges 20, 30 and 40 are presented in tables SI 5a-c
The 241 rAF-hi indels classified as P/LP in gnomAD (Table 2) belong to 161 independent rAF-hi regions (10bp range). Of those 161, 46% contained only one gnomAD indel with a ClinVar classification, 31% contained one P/LP variant as well as one or multiple gnomAD indels classified as Variants of Uncertain Significance (VUS), Benign or Likely Benign indels (Table SI 6). The remainding regions contained multiple gnomAD indels classified as P/LP: 12% did not contain indels of other ClinVar classifications and 11% also contained one or multiple gnomAD indels classified as VUS, Benign or Likely Benign indels. Similar results were observed in the IGM and UKBB datasets. An example of a sAF-lo indel (gnomADv2sAF= 3.10 ×10−5 and IGM sAF= 1.28 ×10−5 ) that is classified in ClinVar as pathogenic and identified as rAF-hi rAF (1.02 ×10−4) and its gnomAD v2 rAF (9.95×10−4), is the indel 4–1980558-GC-G in the gene, NSD2 (Figure S4). The indel was identified in an individual enrolled to a study as a control. The deletion is in an area with 8 consecutive cytosine nucleotides, which can lead to equivalent deletions. NSD2 is associated with Rauch-Steindl syndrome, an autosomal dominant disorder associated with short stature, small head circumference, dysmorphic facial features, developmental delay, and impaired intellectual function, albeit with variable expressivity and unknown penetrance.(Zanoni et al. 2021) The rAF-hi annotation is due to two additional indels within the CRAFTS-indel 10bp range. One of the additional indels is a frameshift classified on ClinVar as “Likely Benign”. Based on this information, the ClinVar classification of both indels should probably be amended to Variant of Uncertain Significance.“
3.4. Assessing the burden of rAF-hi indels.
To evaluate the impact of applying a rAF threshold in research, we analyzed 111 negative trios (See Methods). When only applying a gnomADv2 sAF≤10−5 filter, we observed a false-positive significant enrichment for de-novo loss-of-function variants (p-value=8.46×10−3). When applying a gnomADv2 rAF≤10−4 and an IGM rAF≤10−4 filter, this enrichment was no longer significant (p-value>0.01, Table SI 7). Using larger CRAFTS-indel base pair ranges did not impact the results.
The IGM dataset was then used to assess the prevalence of rAF-hi indels and their potential impact on novel gene discovery through the determination of the number of individuals carrying predicted deleterious indels in all genes and in constrained genes only. Using the CRAFTS-indel 10bp range, 18,351 individuals in the IGM dataset (47%) carry at least one rAF-hi predicted deleterious indel in 4,767 different genes (Table 3). When restricting the analysis to genes constrained against loss-of-function, 4,325 individuals (11% of the IGM cohort) carry at least one rAF-hi predicted deleterious indel in 686 different genes, with 4 constrained genes having more than 50 individuals with a predicted deleterious rAF-hi indel: KMT2B, TRRAP, HTT and ATXN2 (Table SI 8a). Of the individuals carrying a rAF-hi indel in KMT2B, 29% were enrolled in studies on the genetics of kidney diseases and 26% were enrolled as controls or healthy family members. KMT2B is associated with childhood onset dystonia 28 (OMIM # 617284) and intellectual developmental disorder, autosomal dominant 68 (OMIM #619934) and has a pLI score of 1. Even though KMT2B is expressed in the kidney, those disorders have not been associated with kidney anomalies. Of the individuals carrying a rAF-hi indel in TRRAP, 71% had epilepsy, which is one of the clinical presentations of TRRAP-associated developmental delay. However, most reported pathogenic variants in TRAPP are missense variants, raising questions about the clinical implications of loss-of-function variants (Cogné et al. 2019). Of the individuals carrying rAF-hi indels in HTT, 22% were enrolled in studies on the genetics of kidney diseases and 21% were enrolled as controls or healthy family members. Of the individuals carrying rAF-hi indels in ATXN2, 25% were enrolled in a genetic study on amyotrophic lateral sclerosis (ALS) and 24% were enrolled as controls or healthy family members. For HTT and ATXN2, CAG expansion repeats have been reported to be associated with Huntington disease (1998) and ALS susceptibility (Elden et al. 2010) respectively, however different bioinformatic pipelines, such as ExpansionHunter (Dolzhenko et al. 2017), are recommended to identify such variants and provide the number of repeats. The deleteriousness of other indels in those genes is unknown. The number of individuals carrying rAF-hi indels and the number of genes with those rAF-hi indels increases as the CRAFTS-indel base pair range, as does the number of constrained genes with more than 50 individuals carrying a predicted deleterious rAF-hi indel (Table 3, Tables SI 8a–c).
Table 3.
Number of individuals and genes with deleterious rAF-hi indels a in the IGM dataset
| Genes | All | Constrained b | ||
|---|---|---|---|---|
| Range | Individuals | Genes | Individuals | Genes |
| 10 bps | 18,351 | 4,767 | 4,325 | 686 |
| 20 bps | 21,727 | 5,728 | 5,249 | 806 |
| 30 bps | 24,002 | 6,443 | 6,087 | 916 |
| 40 bps | 25,874 | 7,054 | 6,745 | 981 |
Deleterious effects: frameshift variants, splice donor variants, splice acceptor variants, stop gained variants, start lost variants, stop lost variants and exon loss variants.
constrained genes: gnomAD pLI >0.5 and an oe_lof_upper score <0.35.
To understand the burden of rAF-hi, we then focused on individuals carrying at least one rare (IGM sAF≤10−4) predicted deleterious indel in constrained genes already known to be associated with autosomal dominant disorders. A total of 5,299 individuals carrying at least one such indel were found. The prevalence of individuals carrying only rAF-hi indels (no rAF-lo indels) was then calculated and a total of 1,236 individuals (23.3%) identified (CRAFTS-indel 10bp range, Table SI 9).
DISCUSSION
We propose a novel concept: regional allele frequency of small insertions and deletions and developed an algorithm to calculate it. This algorithm identifies rAF-hi genomic regions containing indels that are redundant or in proximity with other indels. The algorithm is adjustable, allowing researchers to define the AF thresholds and the base pair ranges to identify rAF-hi regions and indels. We demonstrated that a large proportion of individuals carry rAF-hi predicted deleterious indels, even in constrained genes. This could impact novel gene discovery, especially if the cases and controls are not sequenced, aligned and annotated using the same bioinformatic pipeline. Overall, “regional allele frequency” could help with indels interpretation like standard allele frequency helps with the interpretation of single nucleotide variants.
Interestingly, we observed that most rAF-hi genomic regions are very small. More than 85% of the regions identified were smaller than the length of the base pair range utilized. Such small regions might be difficult to identify using only sequence-based models like LCRs or direct, inverted, or mirror repeats (Ball et al. 2005). The decision on which CRAFTS-indel base pair range to use to determine the rAF should be made based on the goal of the analysis, balancing sensitivity with specificity. Utilization of larger base pair range will be the equivalent of applying a more stringent filter as indels at larger distance would be combined to calculate the rAF. A more stringent filter might be used when the dataset includes a large number of different exome kits, sequencing data with low coverage or of relatively low quality. As previously shown, the number of adjacent indels rapidly decrease with distance, probably reducing the differences between a range of 40bp and a range of 50bp (Li et al. 2014). Similarly, we applied a threshold of 10−4 to identify rAF-hi indels and regions, however the threshold for low and high allele frequency depends on the size of the cohort and the goals of the study.
As expected, rAF-hi regions partially overlap with LCRs, however, they allow the identification of many more regions, therefore providing novel information. In all datasets, the proportion of regions containing rAF-hi indels overlapping with LCRs was comparable to the proportion of regions containing sAF-hi indels, but not to the regions containing rAF-lo indels. This similarity suggests that a subset of the rAF-hi regions, like LCRs, contain sequences at increased risk for PCR amplification and alignment errors. As DNA replication errors are also the cause of hotspots for disease-causing indels, the identification of rAF-hi indels should be interpreted in the frame of the specific research study.
Based on ClinVar submissions, rAF-hi indels are more frequently classified as benign or likely benign by geneticists compared to pathogenic or likely pathogenic variants, suggesting that rAF provides similar information to the one utilized by clinical geneticists when adjudicating the pathogenicity of rare indels. At least a third of the rAF-hi indels classified as P/LP could be reclassified as VUS based on the ClinVar classification of other adjacent or overlapping indels, suggesting that consideration of rAF could be used by geneticists as a criterion for indel pathogenicity classification. The analysis of de-novo variants demonstrated that filtering using rAF can reduce false-positive rate. Using the IGM cohort we report a high burden of rAF-hi indels, which can generate significant challenges, including time spent reviewing and validating those indels, as well as a risk of misclassification. Together with the lack of clinical presentation of genetic disease in individuals carrying rAF-hi indels, the disproportionate number of individuals carrying deleterious rAF-hi indels in constrained genes known to be associated with autosomal dominant disorders suggests that rAF-hi indels may mislead researchers into false-positive associations. The rAF values can be easily incorporated into databases reporting AF of large cohorts (e.g. gnomAD and the UKBB) and be displayed by annotation tools (e.g. VEP). The rAF could impact ClinVar classification of Pathogenic variants into Variants of Uncertain Significance. Such rAF annotation could help geneticists, genetic counselors, clinicians and researchers rapidly identify suspicious indels and interpret their potential relevance to their phenotype of interest.
In addition, as indel calling depends on the bioinformatic pipeline utilized, rAF-hi regions vary between datasets and most of the rAF-hi regions are dataset specific, highlighting the importance of calculating rAF for each cohort. Differences between rAF-hi regions may be due to the distinct sequencing protocols and bioinformatic pipelines used to map, call, and annotate variants. Another cause might be differences in composition of the cohort in terms of genetic ancestry and disease status. The gnomAD v2 and IGM datasets are more diverse than the UKBB dataset, and include participants in research studies focused on diseases, while the UKBB represents the general population. In addition, a higher number of rAF-hi indels was identified in the gnomAD v2 and IGM datasets than in the UKBB. The difference observed may be attributed to the fact that the entire UKBB cohort was sequenced using the same platform. Additionally, the UKBB underwent joint genotyping at the cohort level using GLNexus (Yun et al. 2021). In contrast, gnomAD v2 employed the GATK joint genotypic methodology, and the IGM samples were not jointly genotyped at the cohort level. The differences between the datasets emphasize the limitations of using the sAF provided by an external dataset when analyzing novel indels in an independent dataset. The unique rAF-hi regions identified in each dataset demonstrate the added value of annotating the indels of any cohort with a cohort-specific rAF.
In summary, annotation of indels with cohort specific regional allele frequency can provide an additional tool to handle some of the limitations of current annotation pipelines and facilitate detection of novel gene-disease associations. CRAFTS-indel can be easily implemented, so that rAF annotation could be provided to researchers and potentially clinical geneticists.
Supplementary Material
Figure SI 1: Lengths distribution of rAF-hi genomic regions. A-D Lengths of rAF-hi genomic regions in the gnomADv2dataset (A- 10bp range, B- 20bp range, C- 30bp range, D-40bp range). E-H Lengths of rAF-hi genomic regions in the IGM dataset (E- 10bp range, F- 20bp range, G- 30bp range, H-40bp range). I-L Lengths of rAF-hi genomic regions in the UKBB dataset (I- 10bp range, J- 20bp range, K- 30bp range, L-40bp range). Additional information in Tables S3a–c.
Figure SI 2: The percentage of rAF-hi indels across each chromosome in the gnomADv2(A), IGM (B) and UKBB datasets.
Figure SI 3: Overlap of rAF-hi regions between datasets and low complexity regions (LCR). A. Number of base pairs classified as rAF-hi in each one of the 3 datasets and in LCR. B. Number of base pairs classified as rAF-hi in each one of the 3 datasets but not in LCR.
Table SI 1: Distribution of phenotypes in the IGM dataset.
Table SI 2: Length of rAF-hi regions in the (a) gnomAD (b) IGM and (c) UKBB datasets.
Table SI 3: Proportion of rAF-hi indels in the (a) gnomAD (b) IGM and (c) UKBB datasets.
Table SI 4a. The proportion of genomic regions in the gnomAD dataset overlapping with Low Complexity Regions (LCR).
Table SI 4b. The proportion of genomic regions in the IGM dataset overlapping with the Low Complexity Regions (LCR).
Table SI 4c. The proportion of genomic regions in the UKBB dataset overlapping with the Low Complexity Regions (LCR).
Table SI 5: Association between ClinVar classification and regional allele frequency In the (a) gnomAD (b) IGM and (c) UKBB datasets.
Table SI 6: Genomic regions with “rAF-hi indels” classified as Pathogenic or Likely Pathogenic (P/LP)
Table SI 7. De-novo enrichment using two different allele frequency filtering approaches.
Table SI 8a: Constrained genes associated with more than 0.1% of individuals in the IGM cohort carrying a rAF-hi predicted deleterious indel (10bp range)
Table SI 8b. Constrained genes with more than 0.1% of the IGM cohort carrying deleterious rAF-hi indels (rAF 20bp range)
Table SI 8c. Constrained genes with more than 0.1% of the IGM cohort carrying deleterious rAF-hi indels (rAF 30bp range)
Table SI 8d. Constrained genes with more than 0.1% of the IGM cohort carrying deleterious rAF-hi indels (rAF 40bp range)
Table SI 9: Prevalence of rAF-hi predicted deleterious indels in constrained genes known to be associated with autosomal dominant disorders out of the 5,299 individuals carrying sAF-lo indels
Funding
Dr. Milo Rasouly received a K01 award from the NIH (Grant number K01DK132495) during the conduct of this study.
Dr Milo Rasouly was also awarded the Donald E. Wesson Research Fellowship from the ASN Foundation for Kidney Research.
Dr. Motelow received support as a Samberg Scholar and a Thrasher Early Career Research Award during the conduct of this study.
Footnotes
Competing Interests
The authors have no relevant financial or non-financial interests to disclose.
Data availability
Code used to generate the rAF for gnomAD, IGM and UKBB dataset is available at https://github.com/ColumbiaCPMG/CRAFTs-Indel along with other scripts used to generate the tables and figures in this paper.
References
- Aganezov S, Yan SM, Soto DC, et al. (2022) A complete reference genome improves analysis of human genetic variation. Science 376:eabl3533. 10.1126/science.abl3533 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Albers CA, Lunter G, MacArthur DG, et al. (2011) Dindel: accurate indel calls from short-read data. Genome Res 21:961–973. 10.1101/gr.112326.110 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Amberger JS, Bocchini CA, Scott AF, Hamosh A (2019) OMIM.org: leveraging knowledge across phenotype–gene relationships. Nucleic Acids Research 47:D1038–D1043. 10.1093/nar/gky1151 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Amemiya HM, Kundaje A, Boyle AP (2019) The ENCODE Blacklist: Identification of Problematic Regions of the Genome. Sci Rep 9:9354. 10.1038/s41598-019-45839-z [DOI] [PMC free article] [PubMed] [Google Scholar]
- Backman JD, Li AH, Marcketta A, et al. (2021) Exome sequencing and analysis of 454,787 UK Biobank participants. Nature 599:628–634. 10.1038/s41586-021-04103-z [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ball EV, Stenson PD, Abeysinghe SS, et al. (2005) Microdeletions and microinsertions causing human genetic disease: common mechanisms of mutagenesis and the role of local DNA sequence complexity. Hum Mutat 26:205–213. 10.1002/humu.20212 [DOI] [PubMed] [Google Scholar]
- Bansal V, Libiger O (2011) A probabilistic method for the detection and genotyping of small indels from population-scale sequence data. Bioinformatics 27:2047–2053. 10.1093/bioinformatics/btr344 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cogné B, Ehresmann S, Beauregard-Lacroix E, et al. (2019) Missense Variants in the Histone Acetyltransferase Complex Component Gene TRRAP Cause Autism and Syndromic Intellectual Disability. Am J Hum Genet 104:530–541. 10.1016/j.ajhg.2019.01.010 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Danecek P, Bonfield JK, Liddle J, et al. (2021) Twelve years of SAMtools and BCFtools. GigaScience 10:giab008. 10.1093/gigascience/giab008 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Deciphering Developmental Disorders Study (2017) Prevalence and architecture of de novo mutations in developmental disorders. Nature 542:433–438. 10.1038/nature21062 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dolzhenko E, JJFA van Vugt, Shaw RJ, et al. (2017) Detection of long repeat expansions from PCR-free whole-genome sequence data. Genome Res 27:1895–1903. 10.1101/gr.225672.117 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Elden AC, Kim H-J, Hart MP, et al. (2010) Ataxin-2 intermediate-length polyglutamine expansions are associated with increased risk for ALS. Nature 466:1069–1075. 10.1038/nature09320 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fuller ZL, Berg JJ, Mostafavi H, et al. (2019) Measuring intolerance to mutation in human genetics. Nat Genet 51:772–776. 10.1038/s41588-019-0383-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Geisinger-Regeneron DiscovEHR Collaboration, Regeneron Genetics Center, Van Hout CV, et al. (2020) Exome sequencing and characterization of 49,960 individuals in the UK Biobank. Nature. 10.1038/s41586-020-2853-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Georgakopoulos-Soares I, Morganella S, Jain N, et al. (2018) Noncanonical secondary structures arising from non-B DNA motifs are determinants of mutagenesis. Genome Res 28:1264–1271. 10.1101/gr.231688.117 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Harrison SM, Biesecker LG, Rehm HL (2019) Overview of Specifications to the ACMG/AMP Variant Interpretation Guidelines. Curr Protoc Hum Genet 103:e93. 10.1002/cphg.93 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jobo Q, Samocha K (2020) queenjobo/DeNovoWEST: DeNovoWEST
- Karczewski KJ, Francioli LC, Tiao G, et al. (2020) The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581:434–443. 10.1038/s41586-020-2308-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Landrum MJ, Chitipiralla S, Brown GR, et al. (2020) ClinVar: improvements to accessing data. Nucleic Acids Res 48:D835–D844. 10.1093/nar/gkz972 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lek M, Karczewski KJ, Minikel EV, et al. (2016) Analysis of protein-coding genetic variation in 60,706 humans. Nature 536:285–91. 10.1038/nature19057 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li S, Li R, Li H, et al. (2013) SOAPindel: efficient identification of indels from short paired reads. Genome Res 23:195–200. 10.1101/gr.132480.111 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li Z, Wu X, He B, Zhang L (2014) Vindel: a simple pipeline for checking indel redundancy. BMC Bioinformatics 15:359. 10.1186/s12859-014-0359-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
- MacLean HE, Favaloro JM, Warne GL, Zajac JD (2006) Double-strand DNA break repair with replication slippage on two strands: a novel mechanism of deletion formation. Hum Mutat 27:483–489. 10.1002/humu.20327 [DOI] [PubMed] [Google Scholar]
- Manichaikul A, Mychaleckyj JC, Rich SS, et al. (2010) Robust relationship inference in genome-wide association studies. Bioinformatics 26:2867–73. 10.1093/bioinformatics/btq559 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mills RE, Luttig CT, Larkins CE, et al. (2006) An initial map of insertion and deletion (INDEL) variation in the human genome. Genome Res 16:1182–1190. 10.1101/gr.4565806 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Montgomery SB, Goode DL, Kvikstad E, et al. (2013) The origin, evolution, and functional impact of short insertion-deletion variants identified in 179 human genomes. Genome Res 23:749–761. 10.1101/gr.148718.112 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Morgulis A, Gertz EM, Schäffer AA, Agarwala R (2006) A Fast and Symmetric DUST Implementation to Mask Low-Complexity DNA Sequences. Journal of Computational Biology 13:1028–1040. 10.1089/cmb.2006.13.1028 [DOI] [PubMed] [Google Scholar]
- Nesta AV, Tafur D, Beck CR (2021) Hotspots of Human Mutation. Trends in Genetics 37:717–729. 10.1016/j.tig.2020.10.003 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pruitt KD, Harrow J, Harte RA, et al. (2009) The consensus coding sequence (CCDS) project: Identifying a common protein-coding gene set for the human and mouse genomes. Genome Res 19:1316–1323. 10.1101/gr.080531.108 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ratan A, Olson TL, Loughran TP, Miller W (2015) Identification of indels in next-generation sequencing data. BMC Bioinformatics 16:42. 10.1186/s12859-015-0483-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ren Z, Povysil G, Hostyk JA, et al. (2021) ATAV: a comprehensive platform for population-scale genomic analyses. BMC Bioinformatics 22:149. 10.1186/s12859-021-04071-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Richards S, Aziz N, Bale S, et al. (2015) Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology. Genet Med 17:405–24. 10.1038/gim.2015.30 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Stenson PD, Mort M, Ball EV, et al. (2020) The Human Gene Mutation Database (HGMD®): optimizing its use in a clinical diagnostic or research setting. Hum Genet 139:1197–1207. 10.1007/s00439-020-02199-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
- the Global Alliance for Genomics and Health Benchmarking Team, Krusche P, Trigg L, et al. (2019) Best practices for benchmarking germline small-variant calls in human genomes. Nat Biotechnol 37:555–560. 10.1038/s41587-019-0054-x [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ware JS, Samocha KE, Homsy J, Daly MJ (2015) Interpreting de novo Variation in Human Disease Using denovolyzeR. Curr Protoc Hum Genet 87:1–15. 10.1002/0471142905.hg0725s87 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yun T, Li H, Chang P-C, et al. (2021) Accurate, scalable cohort variant calls using DeepVariant and GLnexus. Bioinformatics 36:5582–5589. 10.1093/bioinformatics/btaa1081 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zanoni P, Steindl K, Sengupta D, et al. (2021) Loss-of-function and missense variants in NSD2 cause decreased methylation activity and are associated with a distinct developmental phenotype. Genet Med 23:1474–1483. 10.1038/s41436-021-01158-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
- (1998). ACMG/ASHG statement. Laboratory guidelines for Huntington disease genetic testing. The American College of Medical Genetics/American Society of Human Genetics Huntington Disease Genetic Testing Working Group. Am J Hum Genet 62:1243–1247 [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Figure SI 1: Lengths distribution of rAF-hi genomic regions. A-D Lengths of rAF-hi genomic regions in the gnomADv2dataset (A- 10bp range, B- 20bp range, C- 30bp range, D-40bp range). E-H Lengths of rAF-hi genomic regions in the IGM dataset (E- 10bp range, F- 20bp range, G- 30bp range, H-40bp range). I-L Lengths of rAF-hi genomic regions in the UKBB dataset (I- 10bp range, J- 20bp range, K- 30bp range, L-40bp range). Additional information in Tables S3a–c.
Figure SI 2: The percentage of rAF-hi indels across each chromosome in the gnomADv2(A), IGM (B) and UKBB datasets.
Figure SI 3: Overlap of rAF-hi regions between datasets and low complexity regions (LCR). A. Number of base pairs classified as rAF-hi in each one of the 3 datasets and in LCR. B. Number of base pairs classified as rAF-hi in each one of the 3 datasets but not in LCR.
Table SI 1: Distribution of phenotypes in the IGM dataset.
Table SI 2: Length of rAF-hi regions in the (a) gnomAD (b) IGM and (c) UKBB datasets.
Table SI 3: Proportion of rAF-hi indels in the (a) gnomAD (b) IGM and (c) UKBB datasets.
Table SI 4a. The proportion of genomic regions in the gnomAD dataset overlapping with Low Complexity Regions (LCR).
Table SI 4b. The proportion of genomic regions in the IGM dataset overlapping with the Low Complexity Regions (LCR).
Table SI 4c. The proportion of genomic regions in the UKBB dataset overlapping with the Low Complexity Regions (LCR).
Table SI 5: Association between ClinVar classification and regional allele frequency In the (a) gnomAD (b) IGM and (c) UKBB datasets.
Table SI 6: Genomic regions with “rAF-hi indels” classified as Pathogenic or Likely Pathogenic (P/LP)
Table SI 7. De-novo enrichment using two different allele frequency filtering approaches.
Table SI 8a: Constrained genes associated with more than 0.1% of individuals in the IGM cohort carrying a rAF-hi predicted deleterious indel (10bp range)
Table SI 8b. Constrained genes with more than 0.1% of the IGM cohort carrying deleterious rAF-hi indels (rAF 20bp range)
Table SI 8c. Constrained genes with more than 0.1% of the IGM cohort carrying deleterious rAF-hi indels (rAF 30bp range)
Table SI 8d. Constrained genes with more than 0.1% of the IGM cohort carrying deleterious rAF-hi indels (rAF 40bp range)
Table SI 9: Prevalence of rAF-hi predicted deleterious indels in constrained genes known to be associated with autosomal dominant disorders out of the 5,299 individuals carrying sAF-lo indels
Data Availability Statement
Code used to generate the rAF for gnomAD, IGM and UKBB dataset is available at https://github.com/ColumbiaCPMG/CRAFTs-Indel along with other scripts used to generate the tables and figures in this paper.
