Assembly and characterization of novel Alu inserts detected from next-generation sequencing data

Harun Mustafa; Matei David; Michael Brudno

doi:10.4161/21592543.2014.969584

. 2014 Dec 12;4(5):1–7. doi: 10.4161/21592543.2014.969584

Assembly and characterization of novel Alu inserts detected from next-generation sequencing data

Harun Mustafa ¹, Matei David ¹, Michael Brudno ^1,^2,^*

PMCID: PMC4589995 PMID: 26442170

Abstract

Repetitive elements generally, and Alu inserts specifically are a large contributor to the recent evolution of the human genome. By assembling the sequences of novel Alu inserts using their respective subfamily consensus sequences as references, we found an exponential decay in the Alu subfamily call enrichment with increased number of sequence variants (Pearson correlation $r = - 0.68$ , $p < 0.0039$ ). By mapping the sequences of these inserts to a human reference genome, we infer the reference Alu sources of a subset of the novel Alus, of which 85% were previously shown to be active. We also evaluate relationships between the loci of the novel inserts and their inferred sources.

Keywords: Alu, chromosomal conformation capture, Hi-C, retrotransposon, SINE, SNP

Abbreviations

TSD: Target site duplication
bp: base pair
SNP: single nucleotide polymorphism
NGS: Next-Generation Sequencing

Introduction

Alu elements are roughly 300 base pairs in length with about 10⁶ copies in the human genome, and thus are the most common type of short interspersed element (SINE).¹ They are divided into subfamilies based on homology to different ancestral sequences and were highly active during primate evolution, but only 22 AluY and 6 AluS subfamilies (0.05% of all Alus²) are still active in human populations^1-3^,† and less than 0.5% are polymorphic (not fixed).^4,5 Like other SINEs, they rely on external mechanisms for retrotransposition (Fig. S1). Novel Alu insertions have been linked to various human health issues through effects on regulatory regions, mRNA splicing, exon disruption, etc.⁶ Alu replication has traditionally been described using the “master gene” model, where very few Alus are actively replicating and few novel inserts are active. However, a recent study showed that human Alus consistently exhibit a “sprout”-like evolution model, in which copies of the master gene continue to copy themselves.⁷

Several computational studies have worked on novel Alu insert detection from NGS data.^8-10 We have recently¹¹ developed a tool called alu-detect that improves on previous methods and notably, is able to call novel inserts without requiring both breakpoints to be detected. This allows it to be used on very low coverage data and also enabled statistical analysis of novel insert sites and their relative enrichments in different functional regions of the genome. All of the software packages referenced, except for the TEA pipeline,¹⁰ only report novel inserts as lists of genomic coordinates and do not assemble the inserts’ sequences. These sequences could be used to determine their sequence variants and to infer their age and source sequence in the human reference genome.

In this manuscript we extend the alu-detect pipeline to reconstruct the sequences of the inserted Alus and identify their sources on the reference genome. We utilize this pipeline to analyze the polymorphic Alu insertions in 92 human genomes from the 1000 Genomes Project¹² and utilize the resulting data to evaluate the rate of insertion, characteristics, and spatial distribution of polymorphic Alus and their sources in the human genome.

Results

In this section we first overview the assembly pipeline and then discuss the results obtained using this pipeline on 92 low coverage human genomes.

Alu assembly pipeline

Here, we summarize key steps of the Alu assembly pipeline (more details are described in the Methods). To assemble the sequences of the novel Alu inserts it is necessary to obtain the reads used as evidence. We have modified alu-detect to output a modified BED file with each call annotated with its evidence read IDs. Due to the high similarity between Alu sequences, an approach involving read mapping and variant calling relative to an Alu consensus sequence was chosen to assemble the inserts’ sequences. Since evidence reads may span a region not contained in a consensus Alu sequence, for a given call, a corresponding reference sequence was generated by taking the consensus sequence for that call's subfamily, then flanking it on both sides by the TSD and extra sequence corresponding to the read size plus the maximum distance between reads in a pair. Thus, the mapping quality would be maximized by providing a reference in which both pairs in all evidence reads could be mapped in the correct orientation. After mapping, the reads were locally realigned, then a consensus sequence was generated to obtain the sequence of the insert. The full source code of the pipeline and the assembled sequences are available in the assembly branch of the alu-detect git repository at https://github.com/compbio-UofT/alu-detect.

Alu detection

A total of 24229 novel Alu calls were made from a sample of 92 low-coverage individuals from phase 1 of the 1000 Genomes Project¹² (Table S1). The samples from this phase were sequenced at an average coverage of 3.6× and have significantly lower coverage than the samples used in the other studies, whose samples range from ∼15–40× coverage.^8-10 Since our method filters out calls supported by fewer than 10 read pairs, the number of calls made per individual was significantly lower than in previous studies, and varied greatly with a mean of 263.36 and standard deviation of 221.14. An average of $22.5 \pm 12.3$ had both breakpoints detected, $47.6 \pm 13$ had one breakpoint detected and $29.9 \pm 16.3$ had no breakpoints detected, with the distributions of these percentages having heavier right tails. Notably, 870 of the calls were equally distant from the AluYb8 and AluYb9 consensus sequences out of the 1543 calls in both subfamilies. Therefore, we have added a new subfamily called Yb8/Yb9 to the analysis.

Call merging

Novel Alu calls detected by alu-detect are reported as the coordinates of their TSDs, annotated with their orientation, subfamily, the number of reads supporting each breakpoint, and a quality score. When a certain breakpoint cannot be detected, its coordinate is given as the boundary of a confidence interval based on the read coverage. In addition, each call has a set of read pairs that were used as evidence. Under the assumption that overlapping calls represent the same novel insertion event, overlapping calls are merged to form a list of unique insert calls, which we will refer to as merged calls (Fig. S2 for a schematic of the method and the Methods and Supplementary Materials for a precise formulation of the merging algorithm). The read set for a merged call is the union of the constituent calls' read sets, meaning that each merged call has higher coverage than the calls detected from the individual samples. The breakpoint coordinates for a merged call are determined by a consensus of the breakpoint coordinates of its constituent calls, weighted by their quality score. In the case that none of the calls contained a certain breakpoint, the least/greatest left/right coordinate was chosen to produce the largest interval.

After merging, the original call set was reduced to 6059 calls, with 24.2% having both breakpoints detected, 39.5% having one breakpoint detected, and 36.3% having no breakpoints detected. These results indicate that the reduction in the percentage of calls with one breakpoint is due mainly to their merging with existing 2-breakpoint calls, rather than the formation of new 2-breakpoint calls. Among these, 2952 were from the original call set (i.e. not merged with anything), 4837 were composed from 5 or fewer calls, and 271 were composed from 20 or more. Overall, the allele frequencies of each call (the percentage of samples with a given call) were negatively correlated to the natural log of the number of calls with that frequency ( $r = - 0.97$ , $p < 4.5 \times 10^{- 9}$ , see Fig. 1) and were partially correlated to the number of detected breakpoints ( $d C o r = 0.33$ , $p < 2.2 \times 10^{- 16}$ ).

Figure 1. — Distribution of allele frequencies (percentage of samples with a given call) for the merged novel *Alu* calls. 48.7% of the calls (2952/6059) were only present in one sample. When the calls are binned by allele frequency, the number of calls in each bin decays exponentially with allele frequency (Pearson correlation r = −0.97, P < 4.5×10⁻⁹ between allele frequency and log of bin size).

When we compared the subfamily distributions of our novel insert calls with those of Stewart, et al., we found a significant high correlation between the subfamily call enrichment across the 2 data sets ( $r = 0.91$ , $p < 1.3 \times 10^{- 6}$ , see Table 1). In this context, enrichment is a measure of Alu subfamily activity and is defined as the number of merged calls in a specific Alu subfamily normalized by the number of reference Alus in that subfamily.

Table 1.

The results of novel Alu detection and assembly. Columns 2-5 The number of reference and novel Alus in each subfamily, sorted by their enrichment, along with the enrichment of novel calls in those subfamilies in Stewart et al. The enrichments of our novel calls were significantly highly correlated to those of Stewart et al. Columns 6,7: Mean and median number of SNPs per call for the Alu subfamilies. There is a significant negative Pearson correlation between the natural log of the novel insert enrichment and the mean and median number of SNPs. Columns 8,9: Mean and median number of SNPs per call for the Alu subfamilies in sites important for Alu activity (see Confirming Alu Sources as Active). The mean increases as the enrichment decreases

Subfamily	*Alu*s in Reference	Number of Calls	Call Enrichment	Stewart Call Enrichment	All loci SNPs		Important loci SNPs
					Mean	Median	Mean	Median
Yb9	327	472	1.4	0.19	2.53	2	1.27	1
Ya5	3918	2079	0.53	0.12	1.82	1	1.24	1
Yb8/Yb9	3181	870	0.27		1.82	1	1.14	1
Yg6	797	138	0.17	0.048	3.17	2	1.45	1
Yd8	225	30	0.13	0.053	2.5	1	1.17	1
Yf4/Ye5	1379	164	0.12	0.032	3.77	3	1.27	1
Yc5	45	5	0.090	0.022	1.8	1	1.4	1
Yb8	2854	202	0.071	0.085	2.58	2	1.19	1
Yh9	272	19	0.070	0.022	2.68	2	1.10	1
Ya8	338	21	0.062	0.038	1.14	1	1.86	2
Yk12	224	13	0.058	0.0045	3.07	2	3.23	4
Yk11	1044	21	0.020	0.0019	2.19	1	3.05	3
Yc3/Yd3a1	570	11	0.019	0.0053	6.91	8	2.27	3
Y	120617	1109	0.0092	0.030	4.56	2	1.61	1
Yf5	180	1	0.0056	0.0	4	4	1	1
Yc/Yd2	8519	26	0.0030	0.00070	4.96	5	1.58	1
Yk4/Yb3a2	1869	14	0.00075	0.00053	6.29	5	1.79	1

Open in a new tab

Read mapping and Alu sequence assembly

Since only novel AluY calls were assembled, the data set was reduced to 5193 merged calls (Table S2). Each merged call was assembled by generating a reference sequence for it based on its detected subfamily, aligning the merged call's read set to that reference, and generating a consensus sequence from this alignment (see Methods). When these were aligned to their respective reference sequences, there was an average alignment rate of 93.5% (Fig. S3). The mean and median SNP counts for the calls are 2.65 and 2, respectively, with the more active and younger Ya5 subfamily having lower average and median SNP counts, as expected. The Yb8/Yb9 subfamily also has lower variant counts and enrichment similar to the enrichment of the Yb8 subfamily in Stewart's data. Possibly, the subfamily we labeled as Yb8/Yb9 is in the Yb8 subfamily, or it is a new, active subfamily. Although a large proportion of the calls were determined to belong to the Y subfamily, normalization by its much greater copy count in the human genome shows that it is not, in fact, a highly active subfamily.

Comparing the different subfamilies, there was exponential decay in subfamily enrichment with increasing mean and median number of SNPs per call, with a significant distance correlation between the natural log of the respective subfamilies’ enrichments and their respective mean ( $d C o r = 0.75$ , $p < 4.4 \times 10^{- 9}$ ) and median ( $r = 0.70$ , $p < 8.7 \times 10^{- 7}$ ) variant counts (see Table 1). Significant Pearson correlation coefficients ( $r = - 0.68$ , $p < 0.0039$ and $r = - 0.57$ , $p < 0.023$ , respectively) show that this relation is, in fact, linear. Thus, there is an exponential decrease in subfamily activity as the mean and median number of SNPs in a subfamily increases, indicating that the novel inserts from highly active subfamilies are of more recent origin. Since there is also an increase in the mean number of SNPs at loci important for Alu activity with increased enrichment (significant distance correlation $d C o r = 0.40$ , $p < 0.02$ , see Confirming Alu Sources as Active), the exponential decrease in the enrichment can be explained by a decreased ability to transcribe with increased mutation. Since Pearson correlation measures linear relations, the distance correlation method was used to better determine the degree of fit between the mean number of SNPs and subfamily enrichment (see Methods).

Inference of Alu insert sources

With the sequences of novel Alu inserts now assembled, comparisons to reference Alus and other sequences can be made. In particular, it can be inferred that a novel insert's closest sequence match in the reference is a potential source or ancestor of the novel insert. Using the results from previous studies, which have produced lists of active³ and polymorphic⁹ reference Alus, the feasibility of sources inferred through this method were evaluated. Since these data sets were published for the hg18 human reference, hg18 was used as the reference for subsequent analyses.

Of the 5193 assembled calls, 4658 were alignable to the hg18 reference, of which only 1445 could be uniquely aligned, and thus, their sources in the reference determined based on unique best-scoring match (Table S2 for the genomic regions and github page for the assembled sequences of these calls). The sources of 28.4%, 20.8%, and 19.7% of the Yb9, Ya5, and Yb8/Yb9 subfamilies, respectively, could be determined (see Table 2). This is most likely caused by their higher copy count and lower average SNP count, both of which contribute to an increased number of matches. However, a significant correlation could not be found between subfamily enrichment and unique mapping percentage ( $d C o r = 0.38$ , $p = 0.19$ ). Despite their low unique-match count, the Ya5, Yb8, Yb9, and Yb8/Yb9 subfamilies also had a greater number of sources. There was a notable drop in the number of reference sources for the novel Yg6 inserts (see Table 2), where one particular repeat at chr9:6752052-6752361 was inferred to be the source of 85 novel calls.

Table 2.

Inference and analysis of candidate sources for the novel Alu calls. Columns 2-5: Novel Alu call mapping results for the different subfamilies, indicating the number of calls mapped and the number and percentage of calls mapped with a unique highest-scoring match. A significant correlation could not be shown between enrichment and the unique mapping percentage. Columns 6-9: Evaluation of the inferred sources of the novel inserts. In the first step, known active Alus³ and known polymorphic reference Alus⁹ were filtered out, thus leaving those calls whose activity has not been previously reported. Then, those which were conserved at at least 120 of the 124 essential sites for activity³ were also filtered out and the remaining were determined to be inactive. Finally, those inactive sources not present in the chimpanzee reference panTro4 were filtered out, leaving sources which we can reliably infer are not active. There is a notably high number of AluY sources determined to be nonactive, perhaps as an artifact of the use of this subfamily label for novel inserts not in the other subfamilies. Since some reference Alus were inferred to be the sources of novel calls from different subfamilies, some of the sources are counted multiple times in the table

Subfamily	Call Enrichment	Calls Mapped	Uniquely Mapped	Percent	Number Sources	Activity unknown	Source Nonactive	In panTro4
Yb9	1.4	455	129	28.4%	84	9	0	0
Ya5	0.53	2007	417	20.8%	198	23	13	9
Yb8/Yb9	0.27	841	166	19.7%	79	3	0	0
Yg6	0.17	133	104	78.2%	16	2	2	2
Yd8	0.13	30	18	60.0%	6	2	1	1
Yf4/Ye5	0.12	159	42	26.4%	23	4	3	1
Yc5	0.090	5	3	60.0%	1	0	0	0
Yb8	0.071	191	58	30.4%	43	2	0	0
Yh9	0.070	17	15	88.2%	4	0	0	0
Ya8	0.062	20	5	25.0%	4	1	1	1
Yk12	0.058	13	3	23.1%	3	0	0	0
Yk11	0.020	21	7	33.3%	3	1	1	1
Yc3/Yd3a1	0.019	11	9	81.8%	5	4	0	0
Y	0.0092	735	444	60.4%	290	117	64	51
Yf5	0.0056	1	0	0.00%
Yc/Yd2	0.0030	25	21	84.0%	12	5	1	1
Yk4/Yb3a	0.00075	12	10	83.3%	6	0	0	0

Open in a new tab

Confirming Alu sources as active

The set of 1445 novel Alus whose source sequences could be determined were inferred to originate from 570 reference Alus. Of these, 419 were previously reported as active by Bennett et al. Of the remaining calls, 3 were previously reported as polymorphic by Stewart et al., and thus, are of recent origin. This left 148 AluY calls whose activity was unknown (see Table 2).

Using Bennett's method of determining the conservation of essential sites in the source sequences to infer activity (see Methods), a total of 87 of the sources were determined to be non-active. Since the set of Alus active in both humans and chimpanzees is negligible¹³, we checked if the remaining source sequences were present in the list of known Alus in the chimpanzee reference genome to make further inferences about their potential activity. Of the 87 non-active sources, 67 were present in the chimpanzee reference, with 51 of them being AluY's. The source Alus found in chimpanzees were the sources of at most 4 novel inserts, with only 3 of them, all of which were from the Y subfamily, being the sources of more than 2 novel inserts. These results show that when sequence similarity is used to infer the sources of the novel inserts, the 84.7% (483/570) of the inferred sources are active, and thus, valid source candidates.

Contact enrichment distribution

One important tool that can be used for discovering relationships between pairs of genomic loci is chromosomal conformation capture. The Hi-C method uses a NGS and read mapping approach to obtain raw chromosomal contact numbers,¹⁴ which can be normalized by Poisson regression using the HiCNorm method.¹⁵ Using recently published chromosomal contact data and inferences about the sources of novel Alu inserts, we hypothesized that Alu transcripts insert into genomic regions that interact with their source regions. Using the upper triangle of a normalized Hi-C chromosomal contact matrix (see Methods), 2 control empirical cumulative distribution functions (ECDFs) of contact counts were constructed, one corresponding to loci pairs less than 5 Mbp (mega base pairs) apart and the other corresponding to pairs greater than 5 Mbp apart. From these, test ECDFs were generated by restricting their values to those corresponding to observed novel-source Alu pairs.

The control ECDF for the loci pairs at least 5 Mbp apart was significantly greater than the corresponding test ECDF ( $p < 2.2 \times 10^{- 308}$ Kolmogorov-Smirnov test), indicating that the test distribution had greater values (Fig. S4). Due to the large difference in the proportion of zero contact values between the 2 distributions, the data was binarized into a $2 \times 2$ contingency table (see Methods) and a significant difference between the control and test distributions ( $p < 1.42 \times 10^{- 96}$ , Fisher's exact test) was observed. When testing the distributions for loci less than 5 Mbp apart, no significant difference was found between the control and original test distributions ( $p = 0.111$ , Kolmogorov-Smirnov test).

To reduce any bias potentially introduced by the choice of novel or source Alu coordinates, a second control distribution was generated by shuffling the sources associated with each novel insert to generate a new set of coordinate pairs and obtaining the normalized Hi-C values for those pairs. There was no significant difference between the Alu distribution and the new control ( $p = 0.5$ , Fisher exact test on binarized data). The likely explanation for these discrepancies is discussed below.

Discussion

Alu elements and other repetitive elements have contributed significantly to primate genomes. Although Alu elements were more active during early human evolution, 0.05% are still actively replicating in current human populations. Through recent computational methods, it is now possible to detect novel Alu inserts from next-generation sequencing data and, using the methods described above, to assemble their sequences. A few recent, highly active subfamilies form the bulk of novel Alu inserts, with the subfamily call enrichment (a measure of activity) decaying exponentially with increased number of sequence variants due to mutations in loci important for replication. When inferring the sources of the novel inserts by finding their unique best matches in the reference, 84.7% of the novel inserts’ inferred sources were determined to be active, and thus, viable sources. The sources of fewer of the novel inserts in the more active subfamilies could be determined, but the correlation between subfamily enrichment and the percentage of inserts whose sources could be inferred was not significant. The interaction frequencies between pairs of loci corresponding to novel inserts and their inferred sources are significantly greater, but it was determined that this may be due to other factors. The current method is unable to distinguish between the true source and true copy in a pair of Alus (it could be that a novel call is not present in the reference due to an assembly error and is the true source of a reference Alu) and could have been a factor in the inference of non-active Alus as the sources of novel inserts. In addition, since Alus are known to replicate using a sprout-like model, with novel inserts showing significant amounts of variants derived from the descendants of the master sequence. Hence, there is a possibility that several intermediate copies could exist between a detected novel insert and a reference Alu. It is possible that the true source of a novel Alu is one of the other detected inserts or an undetected (possibly filtered out) insert, with the nearest ancestor in the reference being mutated to the point of being a lower-scoring match than another, less-closely-related reference Alu. Although sequence similarity can be used to determine viable candidates for the sources of novel Alu inserts, more information needs to be taken into consideration to distinguish between different candidates.

The results suggest that although there is an elevated level of interaction between the coordinates of novel calls and their sources in the reference, this elevation is not caused by this association between the 2 sequences and is instead due to other factors. One possibility, however, is that this is caused by a bias in the chromosomal contact data that is not being corrected by the HiCNorm method. When chromosomal contact numbers are being computed, only reads that have been uniquely mapped close to a restriction site with sufficient quality are considered. Naturally, this is expected to lead to an underreporting of contacts between repetitive regions due to mapping difficulties. Although HiCNorm attempts to remove systematic biases in the raw contact data, it is unclear whether it corrects for this form of underreporting. New methods that correct for this bias would have to be developed in order to get an accurate representation of chromosomal contact between repeat regions.

Methods

Data

Novel Alu inserts were detected in the 92 low-coverage individuals with the highest coverage from phase 1 of the 1000 Genomes Project. The hg19 human reference, Alu consensus sequences from RepBase, and reference Alu annotations from RepeatMasker were used for novel insert detection. For determining the replication activities of the inferred Alu sources, a list of known active Alus in hg18 was obtained from Bennett et al.,³ a list of known polymorphic Alus was obtained from Stewart et al.,⁹ and Alu annotations in the panTro4 chimpanzee reference genome were obtained through RepeatMasker.

Alu detection and call merging

The alu-detect tool was used to detect novel Alu insertions from the samples. Novel calls were filtered to only allow calls with at least 10 evidence reads to allow for accurate assembly of the inserts. Given mappings of reads to a reference, this tool produces a list of novel insert calls in bed format, annotated with their confidence and breakpoint support. The get-regions method for calling novel inserts from read mappings was modified so that each novel call would be annotated with the read IDs that were used as evidence to detect it.

To overcome a large false-negative rate due to low coverage, the calls from all individuals were combined into a single list and calls representing the same insert were merged, concatenating the read ID lists. See Supplementary materials for a precise description of the algorithm.

Alu sequence assembly

A reference was generated for each insert by taking the consensus sequence of the detected family of the insert, then flanking it with the TSD and extra sequence corresponding to the total possible region coverable by the reads. The bowtie2 tool was used to map the reads from an insert to its respective reference, using the –very-sensitive option and disallowing overlapping pairs, dovetailing, and disallowing gaps within 1 bp of a read terminus. The sequence of each insert was determined by using the Genome Analysis Toolkit (GATK) to locally realign the reads, call variants, and generate an alternative reference from the consensus sequence and the variants.¹⁶

Novel insert origin inference

The sequences of the novel inserts were aligned to the hg18 human reference¹⁷ using the blat tool¹⁸ with output in ClustalW format. The results were then filtered to only include those which aligned with a unique top hit. These matches were inferred to be the sources of the novel inserts. The results were reformatted into a BED file, with the chromosomal coordinate fields representing the Alu sources and the associated novel Alus being represented as annotations. The intersectBed program from bedtools¹⁹ with the options -wao -s -f 0.95 was used to intersect this file with the list of known active reference Alus. Those that were not matched were intersected with the list of known polymorphic Alus.

The Alu sources that were not present in these 2 data sets were tested for activity using the method developed by Bennett et al.³ A template FASTA sequence of length 311 was created, with all base pairs except for the 124 conserved sites replaced with N. The source sequences were extracted from the reference using samtools,²⁰ with the negative strand sequences being reverse complimented. The source sequences were aligned to the template FASTA using the muscle multiple sequence aligner²¹ and those which contained at least 120 of the conserved 124 base pairs were considered to be active.

Finally, those that were determined to not be active were intersected with the list of known reference Alus in the reference chimpanzee genome (panTro4) using intersectBed, converting the panTro4 coordinates to hg18 coordinates using the liftOver tool.

Chromosomal contact

Normalized Hi-C data generated from the GM06990 human cell line with the HindIII restriction enzyme was obtained from Hu et al.¹⁵ The data is represented as a matrix, where each element is the contact count between 2 genomic loci bins of size 1 Mbp. Using the upper triangle of this matrix, 2 contact count distributions were generated. The first was for values corresponding to loci pairs less than 5 Mbp apart and the other was for pairs of loci greater than 5 Mbp apart. These distributions were then restricted to those pairs found in the source-novel Alu pair data to generate contact distributions that correspond to inferred Alu copy routes. Each control distribution was compared to its corresponding test distribution for significant difference with a Kolmogorov–Smirnov test using the R software package. In addition, the contact counts for the pairs greater than 5 Mbp were binarized with the ceiling function (the value mapping to 1 if and only if it is greater than 0) to form a $2 \times 2$ contingency table of the 2 distributions and a Fisher's exact test was performed to determine significant differences.

Distance correlation

Since correlations between call enrichment and other variables are not assumed to be linear, the distance correlation coefficient developed by Székely and Rizzo (denoted $d C o r$ ) was used as an alternative method.²² Correlation $t$ -tests for independence were performed using the dcor.ttest function from the energy library for R.

Footnotes

^†

The AluY family refers to the subfamily of SINEs consisting of all AluY* subfamilies, not to be confused with the subfamily of generic AluY's.

Disclosure of Potential Conflicts of Interest

No potential conflicts of interest were disclosed.

Acknowledgments

We would like to thank Orion Buske for his helpful discussions.

Funding

This work was supported by the Ontario Ministry of Research and Innovation (MRI) through the Genomes to Life (GL2) program.

Supplemental Material

Supplemental data for this article can be accessed on the publisher's website.

Supplement_2.xlsx

kmge-04-05-969584-s001.xlsx^{(480KB, xlsx)}

Supplement_1.pdf

kmge-04-05-969584-s002.pdf^{(252.3KB, pdf)}

Endnote

References

1. Cordaux R, Batzer MA. The impact of retrotransposons on human genome evolution. Nat Rev Genet 2009; 10:691-703; PMID:19763152; http://dx.doi.org/ 10.1038/nrg2640 [DOI] [PMC free article] [PubMed] [Google Scholar]
2. Mills RE, Bennett EA, Iskow RC, Devine SE. Which transposable elements are active in the human genome? TRENDS Genet 2007; 23:183-91; PMID:17331616; http://dx.doi.org/ 10.1016/j.tig.2007.02.006 [DOI] [PubMed] [Google Scholar]
3. Bennett EA, Keller H, Mills RE, Schmidt S, Moran JV, Weichenrieder O, Devine SE. Active Alu retrotransposons in the human genome. Genome Res 2008; 18:1875-83; PMID:18836035; http://dx.doi.org/ 10.1101/gr.081737.108 [DOI] [PMC free article] [PubMed] [Google Scholar]
4. Roy-Engel AM, Carroll ML, Vogel E, Garber RK, Nguyen SV, Salem A-H, Batzer MA, Deininger PL. Alu insertion polymorphisms for the study of human genomic diversity. Genetics 2001; 159:279-90; PMID:11560904 [DOI] [PMC free article] [PubMed] [Google Scholar]
5. Batzer MA, Deininger PL. Alu repeats and human genomic diversity. Nat Rev Genet 2002; 3:370-9; PMID:11988762; http://dx.doi.org/ 10.1038/nrg798 [DOI] [PubMed] [Google Scholar]
6. Callinan P, Batzer M. Retrotransposable elements and human disease. Genome Dyn 2006;104-15; PMID: 18724056; http://dx.doi.org/ 10.1159/000092503 [DOI] [PubMed] [Google Scholar]
7. Cordaux R, Hedges DJ, Batzer MA. Retrotransposition of alu elements: how many sources? TRENDS Genet 2004; 20:464-7; PMID:15363897; http://dx.doi.org/ 10.1016/j.tig.2004.07.012 [DOI] [PubMed] [Google Scholar]
8. Hormozdiari F, Alkan C, Ventura M, Hajirasouliha I, Malig M, Hach F, Yorukoglu D, Dao P, Bakhshi M, Sahinalp SC, et al. Alu repeat discovery and characterization within human genomes. Genome Res 2011; 21(6):840-849; PMID:21131385; http://dx.doi.org/ 10.1101/gr.115956.110 [DOI] [PMC free article] [PubMed] [Google Scholar]
9. Stewart C, Kural D, Strömber MP, Walker JA, Konkel MK, et al. A comprehensive map of mobile element insertion polymorphisms in humans. PLoS Genet [Internet] 2011; 7:e1002236; http://dx.doi.org/ 10.1371/journal.pgen.1002236 [DOI] [PMC free article] [PubMed] [Google Scholar]
10. Lee E, Iskow R, Yang L, Gokcumen O, Haseley P, Luquette LJ, Lohr JG, Harris CC, Ding L, Wilson RK, et al. Landscape of somatic retrotransposition in human cancers. Science [Internet] 2012; 337:967-71. [DOI] [PMC free article] [PubMed] [Google Scholar]
11. David M, Mustafa H, Brudno M. Detecting Alu insertions from high-throughput sequencing data. Nucl Acids Res 2013; 41(17):e169; PMID:23921633; http://dx.doi.org/ 10.1093/nar/gkt612 [DOI] [PMC free article] [PubMed] [Google Scholar]
12. Abecasis GR, Altshuler D, Auton A, Brooks LD, Durbin RM, Gibbs RA, Hurles ME, McVean GA, 1000 Genomes Project Consortium A map of human genome variation from population-scale sequencing. Nature 2010; 467:1061-73. doi: 10.1038/nature09534 [DOI] [PMC free article] [PubMed] [Google Scholar]
13. Hedges DJ, Callinan PA, Cordaux R, Xing J, Barnes E, Batzer MA. Differential Alu mobilization and polymorphism among the human and chimpanzee lineages. Genome Res 2004; 14:1068-75; PMID:15173113; http://dx.doi.org/ 10.1101/gr.2530404 [DOI] [PMC free article] [PubMed] [Google Scholar]
14. Lieberman-Aiden E, Berkum NL van, Williams L, Imakaev M, Ragoczy T, Telling A, Amit I, Lajoie BR, Sabo PJ, Dorschner MO, et al. Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science 2009; 326:289-93; PMID: 19815776; http://dx.doi.org/ 10.1126/science.1181369 [DOI] [PMC free article] [PubMed] [Google Scholar]
15. Hu M, Deng K, Selvaraj S, Qin Z, Ren B, Liu JS. HiCNorm: removing biases in Hi-C data via Poisson regression. Bioinformatics 2012; 28:3131-3; PMID:23023982; http://dx.doi.org/ 10.1093/bioinformatics/bts570 [DOI] [PMC free article] [PubMed] [Google Scholar]
16. DePristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, Philippakis AA, Angel G del, Rivas MA, Hanna M, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet 2011; 43:491-8; PMID:21478889; http://dx.doi.org/ 10.1038/ng.806 [DOI] [PMC free article] [PubMed] [Google Scholar]
17. Lander E, Linton L, Birren B, Nusbaum C, Zody M, Baldwin J, Devon K, Dewar K, Doyle M, Fitzhugh W, et al. Initial sequencing and analysis of the human genome. Nature [Internet] 2001; 409:860-921. [DOI] [PubMed] [Google Scholar]
18. Kent WJ. BLAT-the BLAST-like alignment tool. Genome Res 2002; 12:656-64; PMID:11932250; http://dx.doi.org/ 10.1101/gr.229202 [DOI] [PMC free article] [PubMed] [Google Scholar]
19. Quinlan AR, Hall IM. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics [Internet] 2010; 26:841-2; http://dx.doi.org/ 10.1093/bioinformatics/btq033 [DOI] [PMC free article] [PubMed] [Google Scholar]
20. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, others. The sequence alignment/map format and SAMtools. Bioinformatics 2009; 25:2078-9; PMID:19505943; http://dx.doi.org/ 10.1093/bioinformatics/btp352 [DOI] [PMC free article] [PubMed] [Google Scholar]
21. Edgar RC. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucl Acids Res 2004; 32:1792-7; PMID:15034147; http://dx.doi.org/ 10.1093/nar/gkh340 [DOI] [PMC free article] [PubMed] [Google Scholar]
22. Székely GJ, Rizzo ML. The distance correlation test of independence in high dimension. J Multivariate Analysis [Internet] 2013; 117:193-213; http://dx.doi.org/ 10.1016/j.jmva.2013.02.012 [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplement_2.xlsx

kmge-04-05-969584-s001.xlsx^{(480KB, xlsx)}

Supplement_1.pdf

kmge-04-05-969584-s002.pdf^{(252.3KB, pdf)}

[cit0001] 1. Cordaux R, Batzer MA. The impact of retrotransposons on human genome evolution. Nat Rev Genet 2009; 10:691-703; PMID:19763152; http://dx.doi.org/ 10.1038/nrg2640 [DOI] [PMC free article] [PubMed] [Google Scholar]

[cit0002] 2. Mills RE, Bennett EA, Iskow RC, Devine SE. Which transposable elements are active in the human genome? TRENDS Genet 2007; 23:183-91; PMID:17331616; http://dx.doi.org/ 10.1016/j.tig.2007.02.006 [DOI] [PubMed] [Google Scholar]

[cit0003] 3. Bennett EA, Keller H, Mills RE, Schmidt S, Moran JV, Weichenrieder O, Devine SE. Active Alu retrotransposons in the human genome. Genome Res 2008; 18:1875-83; PMID:18836035; http://dx.doi.org/ 10.1101/gr.081737.108 [DOI] [PMC free article] [PubMed] [Google Scholar]

[cit0004] 4. Roy-Engel AM, Carroll ML, Vogel E, Garber RK, Nguyen SV, Salem A-H, Batzer MA, Deininger PL. Alu insertion polymorphisms for the study of human genomic diversity. Genetics 2001; 159:279-90; PMID:11560904 [DOI] [PMC free article] [PubMed] [Google Scholar]

[cit0005] 5. Batzer MA, Deininger PL. Alu repeats and human genomic diversity. Nat Rev Genet 2002; 3:370-9; PMID:11988762; http://dx.doi.org/ 10.1038/nrg798 [DOI] [PubMed] [Google Scholar]

[cit0006] 6. Callinan P, Batzer M. Retrotransposable elements and human disease. Genome Dyn 2006;104-15; PMID: 18724056; http://dx.doi.org/ 10.1159/000092503 [DOI] [PubMed] [Google Scholar]

[cit0007] 7. Cordaux R, Hedges DJ, Batzer MA. Retrotransposition of alu elements: how many sources? TRENDS Genet 2004; 20:464-7; PMID:15363897; http://dx.doi.org/ 10.1016/j.tig.2004.07.012 [DOI] [PubMed] [Google Scholar]

[cit0008] 8. Hormozdiari F, Alkan C, Ventura M, Hajirasouliha I, Malig M, Hach F, Yorukoglu D, Dao P, Bakhshi M, Sahinalp SC, et al. Alu repeat discovery and characterization within human genomes. Genome Res 2011; 21(6):840-849; PMID:21131385; http://dx.doi.org/ 10.1101/gr.115956.110 [DOI] [PMC free article] [PubMed] [Google Scholar]

[cit0009] 9. Stewart C, Kural D, Strömber MP, Walker JA, Konkel MK, et al. A comprehensive map of mobile element insertion polymorphisms in humans. PLoS Genet [Internet] 2011; 7:e1002236; http://dx.doi.org/ 10.1371/journal.pgen.1002236 [DOI] [PMC free article] [PubMed] [Google Scholar]

[cit0010] 10. Lee E, Iskow R, Yang L, Gokcumen O, Haseley P, Luquette LJ, Lohr JG, Harris CC, Ding L, Wilson RK, et al. Landscape of somatic retrotransposition in human cancers. Science [Internet] 2012; 337:967-71. [DOI] [PMC free article] [PubMed] [Google Scholar]

[cit0011] 11. David M, Mustafa H, Brudno M. Detecting Alu insertions from high-throughput sequencing data. Nucl Acids Res 2013; 41(17):e169; PMID:23921633; http://dx.doi.org/ 10.1093/nar/gkt612 [DOI] [PMC free article] [PubMed] [Google Scholar]

[cit0012] 12. Abecasis GR, Altshuler D, Auton A, Brooks LD, Durbin RM, Gibbs RA, Hurles ME, McVean GA, 1000 Genomes Project Consortium A map of human genome variation from population-scale sequencing. Nature 2010; 467:1061-73. doi: 10.1038/nature09534 [DOI] [PMC free article] [PubMed] [Google Scholar]

[cit0013] 13. Hedges DJ, Callinan PA, Cordaux R, Xing J, Barnes E, Batzer MA. Differential Alu mobilization and polymorphism among the human and chimpanzee lineages. Genome Res 2004; 14:1068-75; PMID:15173113; http://dx.doi.org/ 10.1101/gr.2530404 [DOI] [PMC free article] [PubMed] [Google Scholar]

[cit0014] 14. Lieberman-Aiden E, Berkum NL van, Williams L, Imakaev M, Ragoczy T, Telling A, Amit I, Lajoie BR, Sabo PJ, Dorschner MO, et al. Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science 2009; 326:289-93; PMID: 19815776; http://dx.doi.org/ 10.1126/science.1181369 [DOI] [PMC free article] [PubMed] [Google Scholar]

[cit0015] 15. Hu M, Deng K, Selvaraj S, Qin Z, Ren B, Liu JS. HiCNorm: removing biases in Hi-C data via Poisson regression. Bioinformatics 2012; 28:3131-3; PMID:23023982; http://dx.doi.org/ 10.1093/bioinformatics/bts570 [DOI] [PMC free article] [PubMed] [Google Scholar]

[cit0016] 16. DePristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, Philippakis AA, Angel G del, Rivas MA, Hanna M, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet 2011; 43:491-8; PMID:21478889; http://dx.doi.org/ 10.1038/ng.806 [DOI] [PMC free article] [PubMed] [Google Scholar]

[cit0017] 17. Lander E, Linton L, Birren B, Nusbaum C, Zody M, Baldwin J, Devon K, Dewar K, Doyle M, Fitzhugh W, et al. Initial sequencing and analysis of the human genome. Nature [Internet] 2001; 409:860-921. [DOI] [PubMed] [Google Scholar]

[cit0018] 18. Kent WJ. BLAT-the BLAST-like alignment tool. Genome Res 2002; 12:656-64; PMID:11932250; http://dx.doi.org/ 10.1101/gr.229202 [DOI] [PMC free article] [PubMed] [Google Scholar]

[cit0019] 19. Quinlan AR, Hall IM. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics [Internet] 2010; 26:841-2; http://dx.doi.org/ 10.1093/bioinformatics/btq033 [DOI] [PMC free article] [PubMed] [Google Scholar]

[cit0020] 20. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, others. The sequence alignment/map format and SAMtools. Bioinformatics 2009; 25:2078-9; PMID:19505943; http://dx.doi.org/ 10.1093/bioinformatics/btp352 [DOI] [PMC free article] [PubMed] [Google Scholar]

[cit0021] 21. Edgar RC. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucl Acids Res 2004; 32:1792-7; PMID:15034147; http://dx.doi.org/ 10.1093/nar/gkh340 [DOI] [PMC free article] [PubMed] [Google Scholar]

[cit0022] 22. Székely GJ, Rizzo ML. The distance correlation test of independence in high dimension. J Multivariate Analysis [Internet] 2013; 117:193-213; http://dx.doi.org/ 10.1016/j.jmva.2013.02.012 [DOI] [Google Scholar]

PERMALINK

Assembly and characterization of novel Alu inserts detected from next-generation sequencing data

Harun Mustafa

Matei David

Michael Brudno

Abstract

Abbreviations

Introduction

Results

Alu assembly pipeline

Alu detection

Call merging

Figure 1.

Table 1.

Read mapping and Alu sequence assembly

Inference of Alu insert sources

Table 2.

Confirming Alu sources as active

Contact enrichment distribution

Discussion

Methods

Data

Alu detection and call merging

Alu sequence assembly

Novel insert origin inference

Chromosomal contact

Distance correlation

Footnotes

Disclosure of Potential Conflicts of Interest

Acknowledgments

Funding

Supplemental Material

Endnote

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases