Identification of a unique library of complex, but ordered, arrays of repetitive elements in the human genome and implication of their potential involvement in pathobiology

Kang-Hoon Lee; Young-Kwan Lee; Deug-Nam Kwon; Sophia Chiu; Victoria Chew; HyungChul Rah; Gregory Kujawski; Ramzi Melhem; Karen Hsu; Cecilia Chung; David G Greenhalgh; Kiho Cho

doi:10.1016/j.yexmp.2011.02.007

. Author manuscript; available in PMC: 2012 Jun 1.

Published in final edited form as: Exp Mol Pathol. 2011 Mar 1;90(3):300–311. doi: 10.1016/j.yexmp.2011.02.007

Identification of a unique library of complex, but ordered, arrays of repetitive elements in the human genome and implication of their potential involvement in pathobiology

Kang-Hoon Lee ^1,², Young-Kwan Lee ^1,², Deug-Nam Kwon ^1,², Sophia Chiu ¹, Victoria Chew ¹, HyungChul Rah ², Gregory Kujawski ¹, Ramzi Melhem ¹, Karen Hsu ¹, Cecilia Chung ¹, David G Greenhalgh ^1,², Kiho Cho ^1,^2,^*

PMCID: PMC3092023 NIHMSID: NIHMS278679 PMID: 21376035

Abstract

Approximately 2 % of the human genome is reported to be occupied by genes. Various forms of repetitive elements (REs), both characterized and uncharacterized, are presumed to make up the vast majority of the rest of the genomes of human and other species. In conjunction with a comprehensive annotation of genes, information regarding components of genome biology, such as gene polymorphisms, non-coding RNAs, and certain REs, are found in human genome databases. However, the genome-wide profile of unique RE arrangements formed by different groups of REs has not been fully characterized yet. In this study, the entire human genome was subjected to an unbiased RE survey to establish a whole-genome profile of REs and their arrangements. Due to the limitation in query size within the bl2seq alignment program (National Center for Biotechnology Information [NCBI]) utilized for the RE survey, the entire NCBI reference human genome was fragmented into 6,206 units of 0.5 M nucleotides. A number of RE arrangements with varying complexities and patterns were identified throughout the genome. Each chromosome had unique profiles of RE arrangements and density, and high levels of RE density were measured near the centromere regions. Subsequently, 175 complex RE arrangements, which were selected throughout the genome, were subjected to a comparison analysis using five different human genome sequences. Interestingly, three of the five human genome databases shared the exactly same arrangement patterns and sequences for all 175 RE arrangement regions (a total of 12,765,625 nucleotides). The findings from this study demonstrate that a substantial fraction of REs in the human genome are clustered into various forms of ordered structures. Further investigations are needed to examine whether some of these ordered RE arrangements contribute to the human pathobiology as a functional genome unit.

Keywords: repetitive element, arrangement, density, human genome

Introduction

Diverse populations of repetitive elements (REs) are reported to be abundantly present in the mammalian genomes, constituting approximately 45 % or more of the human genome (Lander et al., 2001; Richard et al., 2008). REs can be categorized into two main branches based primarily on their arrangement patterns as well as mechanisms governing their generation and expansion: I) Tandem REs are commonly represented by small repetitive units, such as microsatellites and minisatellites although some larger RE units are also present in this category, and II) Interspersed REs are scattered throughout the genome randomly and/or in an orderly fashion and most elements in this category are known to be retrotransposons (e.g., SINE [short interspersed nuclear element], LINE [long interspersed nuclear element], LTR [long terminal repeat] retrotransposons) and DNA transposons (Babushok et al., 2007; Jurka, 2004; Smit, 1996 ). Although there are several reports describing patterns of RE arrangements in certain genomic loci, comprehensive RE profiles in regard to their distribution and arrangement configuration within the human genome, as well as in other species, have not yet been established (Lee et al., 1997; Schueler and Sullivan, 2006).

The mechanism(s) controlling the expansion of certain short tandem REs, such as (CAG)_n, may be directly linked to their tendency to form secondary structures in conjunction with replication slippage during the S-phase of cell cycle followed by repair (Mirkin, 2007; Pearson et al., 2005; Richard et al., 2008). In contrast, it is certain that the transposition activities of retrotransposons and DNA transposons in response to a variety of physiologic stress signals are responsible for their dynamic propagation in the genome during the life span of an individual (Beguiristain et al., 2001; Cho et al., 2008; Madlung and Comai, 2004).

REs’ participation in the biology of their hosts is exemplified by the role of size variations of microsatellites of certain genes in phenotypic diversity, such as skull sizes among dog breeds (Fondon and Garner, 2004; Sears et al., 2007). Several neurological disorders, such as fragile X syndrome, Huntington’s disease, and spinocerebellar ataxia type I, are reported to be associated with the expansion/instability of certain types of tandem trinucleotide REs (Bat et al., 1997; Crawford et al., 2001; Hutchinson et al., 1993; Orr et al., 1993). It is also suggested that expansion of highly methylated (CCG)_n in the genome of fragile X patients hinders normal nucleosome assembly, implicating the trinucleotide REs’ role in chromatin formation (Wang et al., 1996; Wang and Griffith, 1996). On the other hand, retrotransposons have been implicated in a range of biological processes, such as genome configuration and evolution, post-transcriptional modification, and expression control of neighboring genes following random genomic integration events (Han et al., 2004; Hasler et al., 2007; Kazazian, 2004; Peaston et al., 2004).

It is anticipated that further in-depth investigations into the genomic configuration and biological properties of diverse RE populations, in conjunction with advances in the collection of genome information, may reveal REs’ novel roles in various biologic processes essential for the evolution and structural organization of the genome, and determination of phenotype. In this study, the entire reference human genome database from the National Center for Biotechnology Information (NCBI) was surveyed to characterize arrangement structures and density of putative REs using 6,206 genome units of 0.5 M nucleotides (0.5 Mb) covering the entire reference genome (Pruitt et al., 2007). Next, selected genome units with complex, but highly-ordered, RE arrangement patterns were subjected to comparative analyses using five different human genome databases, which have been published since 2001 (Ahn et al., 2009; Levy et al., 2007; Venter et al., 2001; Wang et al., 2008; Wheeler et al., 2008). In addition, the centromere regions of all the 22 autosomes and two sex chromosomes (X and Y) were examined to identify the structural characteristics and density of REs in those regions.

Materials and Methods

Human genome databases

The following five human genome databases, which were published during the last decade were selected for this study: NCBI (Build 37.1), Venter (HuRef version 6), Watson (as of October, 2009), Han (upload date: April 24, 2009), and SJK_r (revision: KOREF_20090224) (Ahn et al., 2009, 2004; Lander et al., 2001; Levy et al., 2007; Venter et al., 2001; Wang et al., 2008). The genome sequences were retrieved either from the web-based database (NCBI, Venter, and Watson) as needed or the entire sequence information was downloaded at once to a local computer from the website (Han and SJK_r).

Generation of a library of human genome units of 0.5 Mb

From the entire NCBI human genome sequence (Build 37.1), a library of 6,206 genome units were generated by in silico cutting of every 0.5 Mb within the individual chromosomes (1 through 22, X, and Y) starting at the 5′-end. The nucleotide sequences of the individual genome units, which often included gap(s) with no sequence information, were processed and stored using the EditSeq program within DNASTAR ver. 8.0.2 (DNASTAR, Madison, WI) and chromosomal coordinates were recorded according to the NCBI reference. The gaps without any sequence information were filled with “N”s. Each genome unit was designated with the chromosome number and the sequential order of appearance within each chromosome (e.g., the 25^th genome unit of chromosome X: X.025).

Fragmentation of a human genome unit of 0.5 Mb into smaller subunits

For some human genome units of 0.5 Mb, which did not yield dot-matrix self-alignment data, presumably due to a complex RE arrangement, each genome unit was cut precisely in half sequentially, resulting in subunits, until dot-matrix data were generated. Each genome subunit was designated by the definition described in Figure 1 (e.g., the first half [0.25 Mb] and second half [0.25 Mb] of X.025: X.025.L and X.025.R, respectively).

The schematic diagram illustrates how a series of half-size genome subunits were generated sequentially from a genome unit of 0.5 Mb. For each half-size cut, the 5′-fragment is designated as “L” and the 3′ fragment is designated as “R”, and lowercase and uppercase are alternated for each sequential cut to identify each genome subunit with varying sizes. Grey shade indicates the various sizes of genome subunits, which were not used in this study.

Survey of REs and their arrangements within each genome unit/subunit by self-alignment using the Align (bl2seq) program

The nucleotide sequence of each human genome unit/subunit was subjected to self-alignment analysis using the Align (bl2seq) program from NCBI. Although the Align (bl2seq) was originally developed to identify homology between two different sequences, its basic alignment algorithm allows for identification of REs within a single sequence. The analysis results are presented in two different formats: dot-matrix summary and nucleotide alignment data.

Comparative analyses of selective RE arrangements among five human genome databases

Among the RE arrangements identified from the survey of the NCBI reference human genome, a total of 175 complex, but ordered, arrangements were selected for comparative analyses with five published human genome databases in regard to the similarity of conformational arrangements of REs as well as their nucleotide sequences. The nucleotide sequences corresponding to the individual NCBI genome units/subunits were extracted from the relevant databases and saved as a genome unit or subunit. The individual genome units/subunits from four databases were compared to the matching NCBI genome units/subunits. In addition, to demonstrate conformational changes in RE arrangements in association with sequence polymorphisms (99 % homology), the RE arrangement patterns of one representative genome subunit from NCBI, Han, and SJK_r databases were compared by overlaying the patterns using Adobe Photoshop CS2 (Adobe Systems Incorporated, San Jose, CA).

Comparison of RE density and pattern in 175 RE arrangements derived from three different genome databases

The differences in RE arrangements among three human genome sequences (Han, SJK_r, and Venter/Watson) were measured by comparison of RE density and pattern between each set of corresponding dot plots from 175 individual RE arrangements. The layer difference function of the Photoshop CS2 program (Adobe Systems Inc.), which differentiates matching (black) and mismatching (white) sections, was used to calculate a mean RGB value that indicates differences between png dot plots of two corresponding RE arrangements.

Survey of REs and their arrangements in centromere regions

Using the same protocol described above for the generation of a library of human genome units/subunits, the centromere regions of all 24 chromosomes from the NCBI database were divided into genome units or subunits, as necessary. Then, the arrangement pattern of REs within each genome unit/subunit of the individual centromere regions was analyzed by self-alignment using the Align (bl2seq) program.

Measurement of RE density of individual genome units to construct a whole-chromosome RE density plot

The RE density was initially measured from the dot plot images of each genome unit or subunit as an average RGB value using the Photoshop CS2 program (Adobe Systems Inc.). Since the RGB values obtained are inversely correlated with the density of RE dot plots, prior to the graphic presentation of RE density distribution, the RE density value was calculated by subtracting the original value from the maximum RGB value of 255 (white). In addition, the RGB values obtained from various sizes of genome subunits were normalized using the following protocols: 1) the smallest genome subunit was identified as 7.812 Kb, 2) a standard curve (a set of genome unit and subunits [7.812 Kb ~ 0.5 Mb] vs. respective RE densities) was generated using a full set of genome unit-subunits from four-0.5 Mb genome units with a relatively even RE distribution, and 3) the average RE densities of the genome subunits, ranging from 7.812 Kb to 0.25 Mb, were normalized using specific conversion factors, which were calculated for individual genome subunits from the standard curve.

Results

Distribution and arrangement patterns of putative REs within the entire NCBI human genome sequence

The distribution and arrangement patterns of putative REs in the entire human genome sequence from NCBI was investigated by examining the arrangement pattern of putative REs within a genome unit of 0.5 Mb by a self-alignment protocol using the NCBI BLAST program. A total of 6,206 genome units, which were derived from the reference genome, were subjected to this survey. The majority of the genome units (6,045 of 6,206) were analyzed as intact units; however, the rest had to be further fragmented into smaller genome subunits to obtain the distribution and arrangement data, presumably due to a high level of complexity in regard to the RE arrangement in the genome units (Figure 1). It needs to be noted that no sequence information was provided for a substantial number of regions (partial or entire genome unit of 0.5 Mb) throughout the NCBI human genome database. None of the 24 chromosomes (1 through 22, X, and Y) had complete sequence information. In particular, five chromosomes (13, 14, 15, 21, and 22) lacked 5′-end sequence information ranging from ~9 Mb (chromosome 21) to ~20 Mb (chromosome 15). There was no sequence information for more than half of the Y chromosome.

In a two-dimensional dot matrix plot, a “dot” is formed where the sequences on the X and Y axes are aligned. A sum of consecutive dots forms a “line” with either a 45 “degree” angle, indicating a direct alignment/repeat, or a 135 “degree” angle for an inverse alignment/repeat. A comprehensive examination of the distribution and arrangement patterns of REs in all 6,206 genome units identified a diverse population of simple as well as complex, but ordered, RE arrangements within individual genome units or subunits, including large inverted repeats (Figures 2 and 3, and Supplementary Figures 1–24). Some of the RE arrangements resembled abstract drawings reflecting specific orders (Figures 2 and 4). The majority of REs residing in these complex RE arrangements were tandem REs with various sizes of repeat units; however, some interspersed REs contributed to the ordered arrangement structures as well. The patterns of these ordered RE arrangements were formed by various combinations of different types of REs, such as direct repeats, inverse repeats, and palindromes (cross of 45 degree and 135 degree lines). In addition, the length, spacing, and frequency of the individual RE components contributed to the specific patterns of the RE arrangements. The results from this survey of the entire human genome for the distribution and arrangement patterns of REs identified a number of complex RE arrangements.

The RE arrangements derived from individual genome units/subunits of chromosomes 1 and X, which are also presented in Figure 3, are assembled in order. The identification of genome units are labeled for every nine to ten units and all genome subunits are labeled. Both genome units and subunits are represented by the same size square. Variations in the size of genome subunits are indicated by a set of shades and patterns (legend on the bottom right corner). In addition, the 5′-end and 3′-end of a series of genome subunits within one genome unit are indicated by a circle and triangle, in conjunction with appropriate shades for different sizes, respectively. The white square indicates genome units with no sequence information available in the database. Grey rectangles with an arrowhead indicate partial gaps present in certain genome units. A detailed view of the individual RE arrangements can be obtained by accessing the relevant electronic Supplementary Figures 1–24.

The dot-matrix data set derived from 6,045 genome units of 0.5 Mb and 524 subunits of various sizes, which cover the entire NCBI reference genome, are assembled in order for each chromosome (1 through 22, X, and Y). Each genome unit/subunit is represented by a rectangle and genome units without any sequence information are indicated with a white rectangle. Grey areas indicate partial gaps present in certain genome units. Numbers on the left column indicate each chromosome. It needs to be noted that both genome units and subunits, although different in size, are represented by the same size rectangle.

A total of 204 RE arrangements, which are very complex and ordered, are selected and compiled in order of chromosomes 1 through 22, X, and Y. Each labeled triangle represents one genome unit or subunit and the genome units/subunits, which were subjected to comparative analyses, are indicated with a grey arrow next to their identification.

Interestingly, areas with a higher density of REs in combination with more complex arrangement patterns were present in some chromosomes, represented by an overall darker/denser RE population, compared to the others; for example, chromosomes 17 and 19 (complex/dense) vs. chromosome 18 (less complex/dense) (Figure 3 and Supplementary Figures 1–24). Within the individual chromosomes, it was evident that certain segments, particularly the areas surrounding the regions with no sequence information, are more densely populated with REs than the others, often in conjunction with contiguous stretches of complex RE arrangements (Figures 2 and 3, and Supplementary Figures 1–24). Some sections within the genome units/subunits, which were densely populated with REs overall, were clearly delineated from the rest to be almost free of REs.

Centromeres are known to be highly populated with various types of REs; in this experiment, these regions (according to the NCBI representation and coordinates) for all chromosomes (1 through 22, X, and Y) were surveyed to characterize the distribution and arrangement patterns of REs specific for the centromeres (Lamb et al., 2004; Schueler et al., 2001). The majority of the centromere regions harbored complex RE arrangements as well as substantially large segments with no sequence information (Figure 5). It is probable that the absence of sequence information in these regions reflects technical difficulties during the sequencing and assembly of REs arranged in a complex pattern. No shared specific arrangement patterns of REs were identified in the centromere regions.

The RE arrangements within each genome unit/subunit of the centromere regions of all 24 (22 autosomes, X, and Y) chromosomes are presented. Each genome unit/subunit is represented by a square and labeled using the same protocol as described in Figures 1 and 2. The white square indicates genome units with no sequence information available in the database. Grey rectangles indicate partial gaps present in certain genome units. Numbers at the 5′-end indicate individual chromosome identifications.

Chromosome-wide analyses of the distribution pattern of RE density revealed that there were substantial differences in RE density within individual chromosomes as well as among the 24 chromosomes examined (Figure 6). It was determined that chromosome 19 was the most densely populated with REs, while chromosome 4 had the lowest RE density. Among the entire set of 6,206 genome units, the Y.45 genome unit had the highest average RE density. It was interesting to observe that some chromosomal regions with a high level of RE density preceded the gaps without sequence information.

The average RE density of individual genome units/subunits are plotted in order (plus strand: 5′ to 3′) for each chromosome. There are significant differences in the distribution pattern of RE density within each chromosome as well as among the 24 chromosomes. NS indicates the regions with no sequence information. Black bar (chromosome 19) on the graph indicates the chromosome with the highest average RE density and white bar (chromosome 4) indicates one with the lowest RE density. * genome unit (Y.45) with the highest RE density, Chr (chromosome). The chromosomal ideograms were obtained from the NCBI human genome database.

Comparative analyses of selective complex RE arrangements among five different human genome databases

Among the noteworthy RE arrangements identified during the initial survey, 204 unique arrangements were selected throughout the genome and their structural details are presented in Figure 4. It appears that the individual RE arrangements are formed based on a protocol specific for each structure. To investigate whether the structural characteristics of these RE arrangements as well as their nucleotide sequences are conserved among the five human genome databases (NCBI, Venter, Watson, Han, and SJK_r), genomic regions corresponding to 175 of 204 genome units/subunits were selected for comparative analyses (Ahn et al., 2009, 2004; Lander et al., 2001; Levy et al., 2007; Wang et al., 2008; Wheeler et al., 2008). The relative coordinates of the 175 genome units/subunits subjected to this analysis were consistent among the five databases. During the period of this study, accessing a sixth human genome database, published in August of 2009, was unsuccessful (Ahn et al., 2009).

The data obtained from the NCBI reference genome in regard to the RE arrangements and the nucleotide sequences of the genome units/subunits were directly compared to data sets extracted from the corresponding genomic regions of the four different human genome databases (Venter, Watson, Han, and SJK_r). Since these databases were established from the human subjects of presumably different genetic backgrounds, it was anticipated that there should be certain degrees of polymorphisms in both structure of RE arrangements and their nucleotide sequences, at least in some of the genome units/subunits examined. Interestingly, both the RE arrangement patterns and nucleotide sequences for all 175 genome units/subunits were 100 % identical in three of the five different databases (NCBI, Venter and Watson) (Table 1). This finding could suggest that all 175 genomic regions, summing up to 12,765,625 nucleotides, are conserved among human populations derived from different ancestries. It may indicate that maintaining the integrity of these genomic regions, in spite of the constant presence of evolutionary pressure and stress signals, is essential for certain phenotypes of humans. In contrast, the nucleotide sequences of all 175 genome units/subunits from the Han and SJK_r genome databases had a sequence homology of 97 % ~ 99 % in comparison to their NCBI/Venter/Watson counterparts (Table 1). Some of the mismatches originated from the special characters occasionally embedded into the sequences and/or internal gaps (no sequence information) in a substantial number of genome units/subunits examined. As presented in Figure 7, the sequence differences (~99 % homology) between NCBI reference vs. Han or SJK_r genome within a genomic subunit are reflected in the discrepancy among the RE arrangement patterns. The SJK_r genome units/subunits, which display a very low homology score compared to the NCBI reference units, primarily due to multiple alignment break points, were excluded from this analysis (indicated with an open triangle in Table 1).

Table 1.

Comparative analysis of selective RE arrangements (a total of 175) among five human genome databases.

graphic file with name nihms278679f8.jpg

Open in a new tab

The individual RE arrangements from five different genome databases (V [Venter], W [Watson], H [Han], and K_r [SJK_r]) are compared to the NCBI references and their sequence similarities are indicated: closed circle (100 %), open circle (99 % ≤ ○ <100 %), grey circle (97 % ≤ Inline graphic <99 %), open circle with a line (99 % ≤ ⦶ <100 %) with a gap of no sequence information, and open triangle (not analyzed).

gap (no sequence information) at 5′-end and

gap (no sequence information) at 3′-end. Chr (chromosome), GU (genome unit/subunit), and HGD (human genome database).

Comparative analyses of RE arrangement patterns of one representative genome subunit from NCBI, Han, and SJK_r databases demonstrate the effect of sequence polymorphisms (99 % homology) on their arrangement patterns.

Differences in RE arrangement density and pattern between putative Han-SJK_r clade and Venter-Watson clade

To quantify the differences in RE arrangements between putative clades of Han-SJK_r and Venter-Watson, the density and pattern of the individual RE arrangements were examined. Within the putative Venter-Watson clade, all 175 RE arrangements shared identical density and pattern consistent with the 100 % sequence identity in these regions (Table 2). In contrast, an average of 2.091 % difference in RE arrangement density and pattern was present in all 175 arrangements between Han and SJK_r. In addition, there were average of 1.182 % and 1.748 % differences in RE arrangement density and pattern between Han and Venter-Watson (within 174 of 175 RE arrangements) and between SJK_r and Venter-Watson (within 174 of 175 RE arrangements), respectively. It is likely that the presence of a 100 % match of density and pattern in one RE arrangement from each of the comparison groups is linked to the resolution threshold of the png dot plot images and/or software. These findings support the formation of a specific clade between Venter and Watson, but not with Han and/or SJK_r. On the other hand, these results also suggest that Han and SJK_r are not close enough to share a specific clade, which is similar to the previous reports by Ahn et al. (Ahn et al., 2009).

Table 2.

Similarity of RE arrangements (a total of 175) between putative clades of Han-SJK_r and Venter-Watson.

graphic file with name nihms278679f9.jpg

Open in a new tab

The similarities of RE arrangements among three human genome sequences (Han [H], SJK_r [Kr], and Venter/Watson [V/W]) were determined by comparison of RE density and pattern between each set of corresponding dot plots from 175 individual RE arrangements. Numbers indicate percent similarity. Chr (chromosome), GU (genome unit/subunit).

Discussion

Since the first human genome was reported to be decoded in 2001, DNA sequences of several other human genomes derived from different genetic backgrounds have been published (Ahn et al., 2009, 2004; Lander et al., 2001; Levy et al., 2007; Wang et al., 2008; Wheeler et al., 2008). Public availability of the databases allows for in-depth investigations into the biology of the human genome in a range of aspects. The outcomes from this study regarding a genome-wide survey for arrangements of putative REs are highlighted by two key findings: 1) Throughout the human genome, there was a diverse population of simple as well as complex, but ordered, RE arrangements within individual genome units/subunits examined; 2) The comparative analyses of 175 complex RE arrangements revealed that three independent genome databases (NCBI, Venter, and Watson), presumably established from different human subjects, are identical, including their nucleotide sequences. It suggests that the NCBI, Venter, and Watson genomes belong to the same clade. In addition, the results from these analyses yielded two somewhat distantly-related clades for the Han and SJK_r genomes, which are separated from the NCBI-Venter-Watson clade.

In response to relevant pathophysiologic signals, REs are reported to play an essential and specific role in the rearrangement of various genomic segments to form a diverse population of functional genes, such as the V/(D)/J rearrangements of B and T cell receptors and isotype switching of B cell receptors (Reddy et al., 2006; Stavnezer, 1996). The unique characteristics of some RE arrangements identified in this study, such as orientation, spacing, and repeat frequency, suggest that their specific configuration is essential for a precise rearrangement of scattered, but relevant, genome segments in a region into an ordered genetic element.

During their life span, humans and other species are subjected to a wide range of pathophysiologic stress signals and the outcomes of the interactions between the genome and the environment determines the phenotype. It is likely that the transposable REs, both DNA transposons and retrotransposons, dynamically transpose to random loci in the genome of stressed cells as well as their neighbors, including the regions with a high density of ordered REs. Sporadic, but persistent, integration of transposable REs into certain genomic regions with ordered RE arrangements may cause disruption of their structures and putative functions in conjunction with phenotypic changes. The biological impacts of conformational changes in RE arrangements due to the random integration of the transposable REs may be variable depending on their arrangement patterns and genomic locations.

Tandem trinucleotide REs, such as (CAG)_n and (CTG)_n, are reported to form various secondary structures (e.g., hairpin by cytosine guanine pairing) and they are implicated in the modification of chromatin structure (Gacy et al., 1995; Mitas, 1997; Pearson et al., 2005; Richard et al., 2008). It is probable that some RE arrangements identified in this study may reflect combinations of different forms of secondary structures, which are constructed by specific types of REs and non-specific spacers. For instance, some inverted repeats of varying sizes, which were abundantly present in some RE arrangements, may be able or prone to form a secondary structure, such as a hairpin. We can speculate that the presence of certain arrangements, in conjunction with the formation of RE secondary and/or tertiary structures, may contribute to the configuration of chromatin structure and affect the promoter activity of genes.

Centromere regions are reported to be heavily populated with REs and the findings from our study revealed an extensive presence of very complex RE arrangements in these regions (Hattori et al., 2000; Lamb et al., 2004; Schueler et al., 2001). It is probable that there are some genomic regions in which various types of REs are purposely organized to form unique secondary/architectural structures for specific physical functions. The putative secondary structures of REs formed in centromere regions may play essential physical roles in chromosome compaction in metaphase and/or chromosome migration to the poles in anaphase during mitosis and meiosis. Various RE arrangements (both simple and complex), probably composed of a set of specific RE types, may play different roles as a scaffolding unit in chromosome compaction and migration in conjunction with certain structural proteins bound to the REs’ repetitive epitopes.

In conclusion, further investigations are needed to examine whether the complex and ordered RE arrangements harbor unique biological characteristics and serve as a functional genome unit. They may play critical roles in the fine control of chromatin configuration directly linked to gene expression, cell division, and other pathophysiologic processes. The scarce existence of protein coding regions and the presumed to be abundant presence of various types of REs in the Y chromosome suggest that certain RE arrangements may play a role in the manifestation of the male phenotype as a functional unit in cooperation with other genome elements. It may be worthwhile to investigate the biological properties of these RE arrangements by limited or complete disturbance of their arrangement structures using transgenic technology followed by the examination of phenotypes, such as gender determinants. The outcomes from these investigations will provide new insights into the understanding of human pathobiology.

Supplementary Material

01. Supplementary Figures 1–24 RE arrangements in chromosomes 1 through 22, X, and Y of the NCBI human genome.

The RE arrangements derived from the individual genome units/subunits are assembled for each chromosome (1 through 22, X, and Y) in order. The identification of genome units were labeled for every nine to ten units and all genome subunits were labeled. Both genome units and subunits are represented by the same size square. Variations in the size of genome subunits were indicated by a set of shades and patterns. In addition, the 5′-end and 3′-end of a series of subunits within one genome unit was indicated by a circle and triangle, respectively. The white square indicates genome units with no sequence information available in the database. Grey rectangles with an arrow head indicate partial gaps present in certain genome units. Detailed information pertaining to the individual RE arrangements can be obtained by selecting each units/subunits.

NIHMS278679-supplement-01.pdf^{(15.4MB, pdf)}

Acknowledgments

This study was supported by grants from Shriners of North America (No. 86800 to KC, No. 84302 to KHL [postdoctoral fellowship], No. 84308 to YKL [postdoctoral fellowship], and No. 84294 to DNK [postdoctoral fellowship]) and the National Institutes of Health (R01 GM071360 to KC).

Abbreviations

Chr: chromosome
GU: genome unit/subunit
H: Yan Huang’s genome
HGD: human genome database
K_r: Seong-Jin Kim’s genome-revised (KOREF_20090224)
LINE: long interspersed nuclear element
LTR: long terminal repeat
NCBI: National Center for Biotechnology Information
RE: repetitive element
SINE: short interspersed nuclear element
V: Craig Venter’s Genome
W: James Watson’s Genome

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

References

Ahn SM, et al. The first Korean genome sequence and analysis: full genome sequencing for a socio-ethnic group. Genome Res. 2009;19:1622–9. doi: 10.1101/gr.092197.109. [DOI] [PMC free article] [PubMed] [Google Scholar]
Babushok DV, et al. Current topics in genome evolution: molecular mechanisms of new gene formation. Cell Mol Life Sci. 2007;64:542–54. doi: 10.1007/s00018-006-6453-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
Bat O, et al. Computer simulation of expansions of DNA triplet repeats in the fragile X syndrome and Huntington’s disease. J Theor Biol. 1997;188:53–67. doi: 10.1006/jtbi.1997.0451. [DOI] [PubMed] [Google Scholar]
Beguiristain T, et al. Three Tnt1 subfamilies show different stress-associated patterns of expression in tobacco. Consequences for retrotransposon control and evolution in plants. Plant Physiol. 2001;127:212–221. doi: 10.1104/pp.127.1.212. [DOI] [PMC free article] [PubMed] [Google Scholar]
Cho K, et al. Endogenous retroviruses in systemic response to stress signals. Shock. 2008;30:105–116. doi: 10.1097/SHK.0b013e31816a363f. [DOI] [PubMed] [Google Scholar]
Collins FS, et al. Finishing the euchromatic sequence of the human genome. Nature. 2004;431:931–945. doi: 10.1038/nature03001. [DOI] [PubMed] [Google Scholar]
Crawford DC, et al. FMR1 and the fragile X syndrome: human genome epidemiology review. Genet Med. 2001;3:359–71. doi: 10.1097/00125817-200109000-00006. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fondon JW, 3rd, Garner HR. Molecular origins of rapid and continuous morphological evolution. Proc Natl Acad Sci USA. 2004;101:18058–18063. doi: 10.1073/pnas.0408118101. [DOI] [PMC free article] [PubMed] [Google Scholar]
Gacy AM, et al. Trinucleotide repeats that expand in human disease form hairpin structures in vitro. Cell. 1995;81:533–540. doi: 10.1016/0092-8674(95)90074-8. [DOI] [PubMed] [Google Scholar]
Han JS, et al. Transcriptional disruption by the L1 retrotransposon and implications for mammalian transcriptomes. Nature. 2004;429:268–274. doi: 10.1038/nature02536. [DOI] [PubMed] [Google Scholar]
Hasler J, et al. Useful ‘junk’: Alu RNAs in the human transcriptome. Cell Mol Life Sci. 2007;64:1793–1800. doi: 10.1007/s00018-007-7084-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hattori M, et al. The DNA sequence of human chromosome 21. Nature. 2000;405:311–319. doi: 10.1038/35012518. [DOI] [PubMed] [Google Scholar]
Hutchinson GB, et al. An Alu element retroposition in two families with Huntington disease defines a new active Alu subfamily. Nucleic Acids Res. 1993;21:3379–3383. doi: 10.1093/nar/21.15.3379. [DOI] [PMC free article] [PubMed] [Google Scholar]
Jurka J. Evolutionary impact of human Alu repetitive elements. Curr Opin Genet Dev. 2004;14:603–608. doi: 10.1016/j.gde.2004.08.008. [DOI] [PubMed] [Google Scholar]
Kazazian HH., Jr Mobile elements: drivers of genome evolution. Science. 2004;303:1626–1632. doi: 10.1126/science.1089670. [DOI] [PubMed] [Google Scholar]
Lamb JC, et al. What’s in a centromere? Genome Biol. 2004;5:239. doi: 10.1186/gb-2004-5-9-239. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lander ES, et al. Initial sequencing and analysis of the human genome. Nature. 2001;409:860–921. doi: 10.1038/35057062. [DOI] [PubMed] [Google Scholar]
Lee C, et al. Human centromeric DNAs. Hum Genet. 1997;100:291–304. doi: 10.1007/s004390050508. [DOI] [PubMed] [Google Scholar]
Levy S, et al. The diploid genome sequence of an individual human. PLoS Biol. 2007;5:e254. doi: 10.1371/journal.pbio.0050254. [DOI] [PMC free article] [PubMed] [Google Scholar]
Madlung A, Comai L. The effect of stress on genome regulation and structure. Ann Bot. 2004;94:481–495. doi: 10.1093/aob/mch172. [DOI] [PMC free article] [PubMed] [Google Scholar]
Mirkin SM. Expandable DNA repeats and human disease. Nature. 2007;447:932–940. doi: 10.1038/nature05977. [DOI] [PubMed] [Google Scholar]
Mitas M. Trinucleotide repeats associated with human disease. Nucleic Acids Res. 1997;25:2245–2254. doi: 10.1093/nar/25.12.2245. [DOI] [PMC free article] [PubMed] [Google Scholar]
Orr HT, et al. Expansion of an unstable trinucleotide CAG repeat in spinocerebellar ataxia type 1. Nat Genet. 1993;4:221–226. doi: 10.1038/ng0793-221. [DOI] [PubMed] [Google Scholar]
Pearson CE, et al. Repeat instability: mechanisms of dynamic mutations. Nat Rev Genet. 2005;6:729–742. doi: 10.1038/nrg1689. [DOI] [PubMed] [Google Scholar]
Peaston AE, et al. Retrotransposons regulate host genes in mouse oocytes and preimplantation embryos. Dev Cell. 2004;7:597–606. doi: 10.1016/j.devcel.2004.09.004. [DOI] [PubMed] [Google Scholar]
Pruitt KD, et al. NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 2007;35:D61–65. doi: 10.1093/nar/gkl842. [DOI] [PMC free article] [PubMed] [Google Scholar]
Reddy YV, et al. Genomic instability due to V(D)J recombination-associated transposition. Genes Dev. 2006;20:1575–1582. doi: 10.1101/gad.1432706. [DOI] [PMC free article] [PubMed] [Google Scholar]
Richard GF, et al. Comparative genomics and molecular dynamics of DNA repeats in eukaryotes. Microbiol Mol Biol Rev. 2008;72:686–727. doi: 10.1128/MMBR.00011-08. [DOI] [PMC free article] [PubMed] [Google Scholar]
Schueler MG, et al. Genomic and genetic definition of a functional human centromere. Science. 2001;294:109–115. doi: 10.1126/science.1065042. [DOI] [PubMed] [Google Scholar]
Schueler MG, Sullivan BA. Structural and functional dynamics of human centromeric chromatin. Annu Rev Genomics Hum Genet. 2006;7:301–313. doi: 10.1146/annurev.genom.7.080505.115613. [DOI] [PubMed] [Google Scholar]
Sears KE, et al. The correlated evolution of Runx2 tandem repeats, transcriptional activity, and facial length in carnivora. Evol Dev. 2007;9:555–565. doi: 10.1111/j.1525-142X.2007.00196.x. [DOI] [PubMed] [Google Scholar]
Smit AF. The origin of interspersed repeats in the human genome. Curr Opin Genet Dev. 1996;6:743–748. doi: 10.1016/s0959-437x(96)80030-x. [DOI] [PubMed] [Google Scholar]
Stavnezer J. Immunoglobulin class switching. Curr Opin Immunol. 1996;8:199–205. doi: 10.1016/s0952-7915(96)80058-6. [DOI] [PubMed] [Google Scholar]
Venter JC, et al. The sequence of the human genome. Science. 2001;291:1304–1351. doi: 10.1126/science.1058040. [DOI] [PubMed] [Google Scholar]
Wang J, et al. The diploid genome sequence of an Asian individual. Nature. 2008;456:60–65. doi: 10.1038/nature07484. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wang YH, et al. Long CCG triplet repeat blocks exclude nucleosomes: a possible mechanism for the nature of fragile sites in chromosomes. J Mol Biol. 1996;263:511–516. doi: 10.1006/jmbi.1996.0593. [DOI] [PubMed] [Google Scholar]
Wang YH, Griffith J. Methylation of expanded CCG triplet repeat DNA from fragile X syndrome patients enhances nucleosome exclusion. J Biol Chem. 1996;271:22937–22940. [PubMed] [Google Scholar]
Wheeler DA, et al. The complete genome of an individual by massively parallel DNA sequencing. Nature. 2008;452:872–876. doi: 10.1038/nature06884. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

01. Supplementary Figures 1–24 RE arrangements in chromosomes 1 through 22, X, and Y of the NCBI human genome.

NIHMS278679-supplement-01.pdf^{(15.4MB, pdf)}

[R1] Ahn SM, et al. The first Korean genome sequence and analysis: full genome sequencing for a socio-ethnic group. Genome Res. 2009;19:1622–9. doi: 10.1101/gr.092197.109. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] Babushok DV, et al. Current topics in genome evolution: molecular mechanisms of new gene formation. Cell Mol Life Sci. 2007;64:542–54. doi: 10.1007/s00018-006-6453-4. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] Bat O, et al. Computer simulation of expansions of DNA triplet repeats in the fragile X syndrome and Huntington’s disease. J Theor Biol. 1997;188:53–67. doi: 10.1006/jtbi.1997.0451. [DOI] [PubMed] [Google Scholar]

[R4] Beguiristain T, et al. Three Tnt1 subfamilies show different stress-associated patterns of expression in tobacco. Consequences for retrotransposon control and evolution in plants. Plant Physiol. 2001;127:212–221. doi: 10.1104/pp.127.1.212. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] Cho K, et al. Endogenous retroviruses in systemic response to stress signals. Shock. 2008;30:105–116. doi: 10.1097/SHK.0b013e31816a363f. [DOI] [PubMed] [Google Scholar]

[R6] Collins FS, et al. Finishing the euchromatic sequence of the human genome. Nature. 2004;431:931–945. doi: 10.1038/nature03001. [DOI] [PubMed] [Google Scholar]

[R7] Crawford DC, et al. FMR1 and the fragile X syndrome: human genome epidemiology review. Genet Med. 2001;3:359–71. doi: 10.1097/00125817-200109000-00006. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] Fondon JW, 3rd, Garner HR. Molecular origins of rapid and continuous morphological evolution. Proc Natl Acad Sci USA. 2004;101:18058–18063. doi: 10.1073/pnas.0408118101. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] Gacy AM, et al. Trinucleotide repeats that expand in human disease form hairpin structures in vitro. Cell. 1995;81:533–540. doi: 10.1016/0092-8674(95)90074-8. [DOI] [PubMed] [Google Scholar]

[R10] Han JS, et al. Transcriptional disruption by the L1 retrotransposon and implications for mammalian transcriptomes. Nature. 2004;429:268–274. doi: 10.1038/nature02536. [DOI] [PubMed] [Google Scholar]

[R11] Hasler J, et al. Useful ‘junk’: Alu RNAs in the human transcriptome. Cell Mol Life Sci. 2007;64:1793–1800. doi: 10.1007/s00018-007-7084-0. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] Hattori M, et al. The DNA sequence of human chromosome 21. Nature. 2000;405:311–319. doi: 10.1038/35012518. [DOI] [PubMed] [Google Scholar]

[R13] Hutchinson GB, et al. An Alu element retroposition in two families with Huntington disease defines a new active Alu subfamily. Nucleic Acids Res. 1993;21:3379–3383. doi: 10.1093/nar/21.15.3379. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] Jurka J. Evolutionary impact of human Alu repetitive elements. Curr Opin Genet Dev. 2004;14:603–608. doi: 10.1016/j.gde.2004.08.008. [DOI] [PubMed] [Google Scholar]

[R15] Kazazian HH., Jr Mobile elements: drivers of genome evolution. Science. 2004;303:1626–1632. doi: 10.1126/science.1089670. [DOI] [PubMed] [Google Scholar]

[R16] Lamb JC, et al. What’s in a centromere? Genome Biol. 2004;5:239. doi: 10.1186/gb-2004-5-9-239. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] Lander ES, et al. Initial sequencing and analysis of the human genome. Nature. 2001;409:860–921. doi: 10.1038/35057062. [DOI] [PubMed] [Google Scholar]

[R18] Lee C, et al. Human centromeric DNAs. Hum Genet. 1997;100:291–304. doi: 10.1007/s004390050508. [DOI] [PubMed] [Google Scholar]

[R19] Levy S, et al. The diploid genome sequence of an individual human. PLoS Biol. 2007;5:e254. doi: 10.1371/journal.pbio.0050254. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] Madlung A, Comai L. The effect of stress on genome regulation and structure. Ann Bot. 2004;94:481–495. doi: 10.1093/aob/mch172. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] Mirkin SM. Expandable DNA repeats and human disease. Nature. 2007;447:932–940. doi: 10.1038/nature05977. [DOI] [PubMed] [Google Scholar]

[R22] Mitas M. Trinucleotide repeats associated with human disease. Nucleic Acids Res. 1997;25:2245–2254. doi: 10.1093/nar/25.12.2245. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] Orr HT, et al. Expansion of an unstable trinucleotide CAG repeat in spinocerebellar ataxia type 1. Nat Genet. 1993;4:221–226. doi: 10.1038/ng0793-221. [DOI] [PubMed] [Google Scholar]

[R24] Pearson CE, et al. Repeat instability: mechanisms of dynamic mutations. Nat Rev Genet. 2005;6:729–742. doi: 10.1038/nrg1689. [DOI] [PubMed] [Google Scholar]

[R25] Peaston AE, et al. Retrotransposons regulate host genes in mouse oocytes and preimplantation embryos. Dev Cell. 2004;7:597–606. doi: 10.1016/j.devcel.2004.09.004. [DOI] [PubMed] [Google Scholar]

[R26] Pruitt KD, et al. NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 2007;35:D61–65. doi: 10.1093/nar/gkl842. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R27] Reddy YV, et al. Genomic instability due to V(D)J recombination-associated transposition. Genes Dev. 2006;20:1575–1582. doi: 10.1101/gad.1432706. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R28] Richard GF, et al. Comparative genomics and molecular dynamics of DNA repeats in eukaryotes. Microbiol Mol Biol Rev. 2008;72:686–727. doi: 10.1128/MMBR.00011-08. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R29] Schueler MG, et al. Genomic and genetic definition of a functional human centromere. Science. 2001;294:109–115. doi: 10.1126/science.1065042. [DOI] [PubMed] [Google Scholar]

[R30] Schueler MG, Sullivan BA. Structural and functional dynamics of human centromeric chromatin. Annu Rev Genomics Hum Genet. 2006;7:301–313. doi: 10.1146/annurev.genom.7.080505.115613. [DOI] [PubMed] [Google Scholar]

[R31] Sears KE, et al. The correlated evolution of Runx2 tandem repeats, transcriptional activity, and facial length in carnivora. Evol Dev. 2007;9:555–565. doi: 10.1111/j.1525-142X.2007.00196.x. [DOI] [PubMed] [Google Scholar]

[R32] Smit AF. The origin of interspersed repeats in the human genome. Curr Opin Genet Dev. 1996;6:743–748. doi: 10.1016/s0959-437x(96)80030-x. [DOI] [PubMed] [Google Scholar]

[R33] Stavnezer J. Immunoglobulin class switching. Curr Opin Immunol. 1996;8:199–205. doi: 10.1016/s0952-7915(96)80058-6. [DOI] [PubMed] [Google Scholar]

[R34] Venter JC, et al. The sequence of the human genome. Science. 2001;291:1304–1351. doi: 10.1126/science.1058040. [DOI] [PubMed] [Google Scholar]

[R35] Wang J, et al. The diploid genome sequence of an Asian individual. Nature. 2008;456:60–65. doi: 10.1038/nature07484. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R36] Wang YH, et al. Long CCG triplet repeat blocks exclude nucleosomes: a possible mechanism for the nature of fragile sites in chromosomes. J Mol Biol. 1996;263:511–516. doi: 10.1006/jmbi.1996.0593. [DOI] [PubMed] [Google Scholar]

[R37] Wang YH, Griffith J. Methylation of expanded CCG triplet repeat DNA from fragile X syndrome patients enhances nucleosome exclusion. J Biol Chem. 1996;271:22937–22940. [PubMed] [Google Scholar]

[R38] Wheeler DA, et al. The complete genome of an individual by massively parallel DNA sequencing. Nature. 2008;452:872–876. doi: 10.1038/nature06884. [DOI] [PubMed] [Google Scholar]

PERMALINK

Identification of a unique library of complex, but ordered, arrays of repetitive elements in the human genome and implication of their potential involvement in pathobiology

Kang-Hoon Lee

Young-Kwan Lee

Deug-Nam Kwon

Sophia Chiu

Victoria Chew

HyungChul Rah

Gregory Kujawski

Ramzi Melhem

Karen Hsu

Cecilia Chung

David G Greenhalgh

Kiho Cho

Abstract

Introduction

Materials and Methods

Human genome databases

Generation of a library of human genome units of 0.5 Mb

Fragmentation of a human genome unit of 0.5 Mb into smaller subunits

Figure 1. Schedule for the generation of sequential half-size subunits from a 0.5 Mb genome unit.

Survey of REs and their arrangements within each genome unit/subunit by self-alignment using the Align (bl2seq) program

Comparative analyses of selective RE arrangements among five human genome databases

Comparison of RE density and pattern in 175 RE arrangements derived from three different genome databases

Survey of REs and their arrangements in centromere regions

Measurement of RE density of individual genome units to construct a whole-chromosome RE density plot

Results

Distribution and arrangement patterns of putative REs within the entire NCBI human genome sequence

Figure 2. RE arrangements in chromosomes 1 and X: representative presentation.

Figure 3. Whole genome overview of RE arrangements derived from the individual genome units and/or their subunits.

Figure 4. Compilation of selective complex RE arrangements.

Figure 5. Abundant presence of complex RE arrangements in centromere regions.

Figure 6. Differential distribution of RE density in individual chromosomes.

Comparative analyses of selective complex RE arrangements among five different human genome databases

Table 1.

Figure 7. Changes in RE arrangement patterns in association with sequence polymorphisms.

Differences in RE arrangement density and pattern between putative Han-SJKr clade and Venter-Watson clade

Table 2.

Discussion

Supplementary Material

Acknowledgments

Abbreviations

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

Differences in RE arrangement density and pattern between putative Han-SJK_r clade and Venter-Watson clade