Construction and validation of customized genomes for human and mouse ribosomal DNA mapping

Subin S George; Maxim Pimkin; Vikram R Paralkar

doi:10.1016/j.jbc.2023.104766

. 2023 Apr 28;299(6):104766. doi: 10.1016/j.jbc.2023.104766

Construction and validation of customized genomes for human and mouse ribosomal DNA mapping

Subin S George ¹, Maxim Pimkin ², Vikram R Paralkar ^3,^4,^5,^∗

PMCID: PMC10245113 PMID: 37121547

Abstract

rRNAs are transcribed from ribosomal DNA (rDNA) repeats, the most intensively transcribed loci in the genome. Due to their repetitive nature, there is a lack of genome assemblies suitable for rDNA mapping, creating a vacuum in our understanding of how the most abundant RNA in the cell is regulated. Our recent work revealed binding of numerous mammalian transcription and chromatin factors to rDNA. Several of these factors were known to play critical roles in development, tissue function, and malignancy, but their potential roles in rDNA regulation remained unexplored. This demonstrated the blind spot into which rDNA has fallen in genetic and epigenetic studies and highlighted an unmet need for public rDNA-optimized genome assemblies. Here, we customized five human and mouse assemblies—hg19 (GRCh37), hg38 (GRCh38), hs1 (T2T-CHM13), mm10 (GRCm38), and mm39 (GRCm39)—to render them suitable for rDNA mapping. The standard builds of these genomes contain numerous fragmented or repetitive rDNA loci. We identified and masked all rDNA-like regions, added a single rDNA reference sequence of the appropriate species as a ∼45 kb chromosome designated “chromosome R,” and created annotation files to aid visualization of rDNA features in browser tracks. We validated these customized genomes for mapping of known rDNA-binding proteins and present a simple workflow for mapping chromatin immunoprecipitation-sequencing datasets. Customized genome assemblies, annotation files, positive and negative control tracks, and Snapgene files of standard rDNA reference sequences have been deposited to GitHub. These resources make rDNA mapping and visualization more readily accessible to a broad audience.

Keywords: rRNA, rDNA, mapping, genome, transcription

Ribosomes are essential ribonucleoprotein complexes composed in eukaryotes of large 60S and small 40S subunits (1). RNA components of the ribosome, comprising a majority of its mass, are termed rRNAs (28S, 5.8S, 5S rRNAs in the 60S subunit, and 18S rRNA in the 40S subunit) and collectively form >80% of total cellular RNA. Of these, 18S, 5.8S, and 28S rRNAs are produced in mammals through stepwise cleavage of a 13.4 kb 47S pre-rRNA, the transcription of which from ribosomal DNA (rDNA) genes by RNA polymerase I (RNA Pol I) is the rate-limiting step of ribosome biogenesis. rDNA genes, each containing a transcribed region (producing 47S pre-rRNA) followed by a ∼30 kb intergenic spacer, are present in hundreds of near-identical copies (typically 200–600 per cell) in repetitive tandem arrays called nucleolar organizing regions (NORs) (2) distributed across five chromosomes in humans and mice (3, 4). rDNA genes are the most intensively transcribed loci in dividing cells and account for the bulk of all transcription (5, 6).

The repetitive nature of rDNA has placed substantial bioinformatic barriers in its study. Most standard genome assemblies do not contain intact rDNA sequences and instead contain a scattering of rDNA fragments, inhibiting the ability of investigators to effectively map high-throughput datasets to them (7). The recent telomere-to-telomere (hs1/T2T-CHM13) human genome assembly contains multiple full NOR arrays (8); however, the presence in this assembly of 219 rDNA repeats creates nontrivial mapping and interpretive complexities. A handful of groups with ribosome biogenesis expertise have mapped datasets to rDNA using privately generated rDNA-containing assemblies (9, 10, 11, 12), but, to our knowledge, these assemblies are not publicly deposited. The historical lack of suitable public genomes has led investigators at large to simply ignore rDNA when profiling the binding sites of their transcriptional regulators of interest, creating a significant vacuum in the understanding of how the most abundant RNA in the cell is regulated. Recently published work by our group illustrated the extent of this vacuum, when, through unbiased mapping of ∼2200 chromatin immunoprecipitation-sequencing (ChIP-seq) datasets for ∼250 target proteins to rDNA, we identified conserved patterns of rDNA binding for numerous hematopoietic transcription factors and chromatin regulators (7). Several of the transcription factors, including the CCAAT enhancer binding protein (CEBP) family, interferon-regulatory factor family, and SPI1, all of which have critical roles in hematopoiesis, immune cell development and function, and leukemogenesis, showed consistent, robust peaks at canonical motif sequences. Several of these motifs were conserved between human and mouse. Most of these factors had been extensively studied for decades and had been subjected to ChIP-seq by dozens of investigators, but their striking rDNA occupancy had passed unnoticed due to a lack of appropriate bioinformatic tools. We further demonstrated that the myeloid transcription factor CCAAT enhancer binding protein alpha (CEBPA) promotes RNA Pol I occupancy on rDNA in myeloid progenitors (7). Our work used mammalian hematopoiesis as a model system, but we have no reason to believe that binding of cell-type–specific transcriptional regulators to rDNA is restricted to just one tissue; this is potentially a universal mechanism that is repurposed across all tissues to modulate rRNA transcription rates per tissue-specific needs but whose prevalence is unknown due to a dearth of suitable mapping tools. Collectively, our work highlighted an unmet need for public assemblies that can be easily used for rDNA mapping.

In this article, we present human and mouse genome assemblies customized for rDNA mapping (hg19-rDNA v1.0, hg38-rDNA v1.0, hs1-rDNA v1.0, mm10-rDNA v1.0, and mm39-rDNA v1.0). We generated these assemblies by masking degenerate or repetitive rDNA-matching loci in standard human and mouse genomes and inserting a new chromosome R (chrR) composed of a single rDNA reference sequence of the appropriate species. We have made these genomes accessible by depositing them to GitHub, along with annotation files and control mapping tracks. Here, we present validation for the use of these genomes for rDNA ChIP-seq along with a simple workflow to produce rDNA mapping tracks.

Results

Generation of customized human and mouse genomes with rDNA

Human KY962518.1 (44,838 nt) and mouse Genbank BK000964.3 (45,306 nt) reference rDNA sequences are available from NCBI Genbank (13) and are used in the literature as consensus reference rDNA sequences for their respective species (8, 14, 15, 16). Each reference contains two promoters (spacer promoter and 47S promoter), a 13.4 kb transcribed region and intergenic spacer sequence (IGS) (Fig. 1A). The human hg19 (GRCh37), hg38 (GRCh38), hs1 (T2T-CHM13), and mouse mm10 (GRCm38), mm39 (GRCm39) genome assemblies each contain multiple partial, fragmented, or repetitive rDNA-matching loci. We identified and masked between 418 and 541 such regions in the genomes (Table S1, see Experimental procedures), masking total regions ranging from 252 kb (mm10) to 12 mb (hs1). We then added to these masked genomes a single full-length rDNA sequence of the respective species as an additional chrR to generate customized genomes (Fig. 1B).

Workflow for mapping to customized genomes with rDNA

Aiming to make routine rDNA mapping accessible to users with no more than basic Linux command line experience, we designed a simple workflow optimized for mapping of high-throughput datasets to the customized genomes (Fig. 2). Given the repetitive nature of rDNA and given that all reads deriving from rDNA will map to the new chrR, even datasets with lower read numbers are capable of giving interpretable chrR coverage; however, we recommend using datasets with at least 10 million total reads. We recommend running Trimmomatic (17) with standard parameters. We recommend Bowtie2 (18) for indexing custom rDNA-containing reference genomes and mapping with -X 2000 parameter to permit paired-end sequence mapping with an accommodation for 1 to 2 kb indel variation that has been documented in rDNA repeats (19). We recommend filtering Sequence Alignment Map (SAM) files produced by Bowtie2 with Samtools view (20) -q 1 to allow for some residual multimapping that may occur despite genome masking. After sorting and indexing the resultant Binary Alignment Map (BAM) files with Samtools (20), we recommend using DeepTools BamCoverage (21) with -bs 1 to generate nucleotide-resolution tracks. PCR duplicate removal is not recommended; dense overlapping read coverage is expected on chrR due to the presence of hundreds of rDNA copies per cell, leading to the unavoidable presence of independent reads with the same start and end sites, confounding PCR duplicate removal tools. Given that each individual human or mouse has a different number of rDNA copies in its genome and given that biases arising from library preparation and sequencing protocols can produce some unevenness in background coverage density across the rDNA gene sequence (22), users of our customized genomes should anticipate that different cell lines, tissues, and sample preparation methodologies will yield chrR background signal of varying depth and slight idiosyncratic nonuniformity that could be misinterpreted as peak signal if appropriate controls are not used. To robustly distinguish specific signal from background variation, sequencing artifact, or PCR duplication, the inclusion of at least two biological ChIP-Seq replicates, along with paired input and IgG pull-down samples prepared through the same protocol and with the same number of PCR cycles, is critical. We also mention at the bottom of Figure 2 a method of track normalization, to be used if quantitative signal comparison between datasets is required.

Validation of customized genomes with rDNA

To validate the ability of our customized genomes and workflow to yield valid rDNA tracks, we picked four proteins whose rDNA ChIP-seq mapping patterns are known (7, 11): CCCTC-binding factor (CTCF) (insulator transcription factor), TATA-binding protein (TBP), RNA Polymerase I Subunit A (core catalytic subunit of RNA Pol I), and upstream binding transcription factor (UBTF) (high mobility group-box protein that binds and activates rDNA). We picked as a negative control EP300, a transcriptional coactivator known to bind promoters and enhancers of Pol II–transcribed genes (23) but not known to bind rDNA. Using the public repositories Encyclopedia of DNA Elements (ENCODE) (24) and GEO/SRAdb (25), we identified human and mouse ChIP-seq datasets for each of these factors and downloaded their raw FASTQ files as well as matched background controls (Input or IgG deposited in the same accession) (links to original datasets are provided in Table S2). We passed these datasets through the workflow in Figure 2 and visualized BigWig (.bw) tracks on the Integrative Genomics Viewer (IGV) browser (26) along with the genome and annotation files (tracks for hg38-rDNA and mm39-rDNA are shown in Figure. 3, A and B). In all cases, background controls (Input and IgG) showed relatively flat background chrR coverage of significantly lower intensity than specific factor pull-down. Upstream binding transcription factor and RNA Polymerase I Subunit A showed the expected broad signal throughout the transcribed region, TBP showed the expected peaks at the spacer and 47S promoters, and CCCTC-binding factor showed the expected sharp peak slightly upstream of TBP at the spacer promoter (7, 11). EP300, as expected, did not show any rDNA occupancy (Fig. 3, A and B). Similar tracks were observed on mapping these datasets to the hg19-rDNA, hs1-rDNA, and mm10-rDNA genomes (not shown; tracks deposited to GitHub). Thus, we have validated our customized genomes and workflow for high-throughput dataset mapping.

Accessibility of customized genomes and annotation files

All customized human and mouse genomes and associated files have been deposited to GitHub (https://github.com/vikramparalkar/rDNA-Mapping-Genomes). Specifically, FASTA files for three human (hg19-rDNA v1.0, hg38-rDNA v1.0, hs1-rDNA v1.0) and two mouse (mm10-rDNA v1.0, mm39-rDNA v1.0) genomes have been deposited, along with annotation Browser Extensible Data (BED) files (containing standard RefSeq gene annotations plus new chrR annotations) and Bigwig files for positive and negative controls tracks for each genome.

Discussion

The public availability of genome assemblies (27), enabling widespread profiling in diverse cell types of chromatin accessibility, histone modifications, transcription factor binding, and chromosome conformation, has facilitated unprecedented progress over the past 2 decades in our understanding of gene regulation in health and disease. However, certain regions of the genome remain poorly studied due to their repetitive nature (28). rDNA genes, organized in tandem arrays in NORs, are a prominent example. Due to the inability of standard genome assemblies and mapping pipelines to yield interpretable rDNA data, a stark contrast has emerged between the depths of our understanding of coding genes compared to rDNA genes. Given that rRNA is the most abundant cellular RNA and accounts for the bulk of all transcription, this is a significant bioinformatic blind spot.

We attempt in this article to address this deficiency by generating and depositing to GitHub-customized versions of multiple commonly used human and mouse genomes containing a single copy of reference rDNA as an added chromosome, along with annotation files, sample tracks, and a simple mapping workflow. This method can be easily extended to any new assembly that may be released, as well as to other species for which full-length reference rDNA sequences are available. Our recent work (7) indicates that rDNA is bound and regulated in complex eukaryotes by a variety of cell-type–specific transcription factors. The implications of such binding for evolution, cell biology, development, and oncogenesis are substantial and warrant a systematic exploration of whether modulation of ribosome production rates through control of rRNA transcription is a conserved function of many transcription factors and chromatin regulators and whether this regulation is hijacked in malignancies to promote rapid ribosome biogenesis. Identification of such factors will only become possible with the routine incorporation of rDNA-containing genomes into mapping pipelines.

rDNA repeats, though fairly uniform in their tandem repeat arrangements, are not perfectly identical and show variations in the form of single nucleotide polymorphisms scattered throughout the repeat, as well as small insertions and deletions in the IGS (16, 19, 29, 30). These variations can be seen between repeats within the same cell as well as between individuals within the same species. The next challenge for rDNA bioinformatics will be to accommodate rDNA variants. Continued work by the T2T Consortium (8) and the Human Pangenome Project (31) will, over time, reveal the full extent of variation in rDNA number and sequence. We anticipate that future standard genome assemblies of all species will contain a systematized diversity of rDNA repeat sequences, and the next generation of mapping tools will need to grapple with the peculiar challenges of effectively mapping to these repetitive loci.

In summary, we provide in this article customized genomes, annotation files, and a simple workflow to make human and mouse rDNA mapping accessible to nonspecialists in ribosome biogenesis. We envision that routine integration of these genomes into standard bioinformatic pipelines will allow the discovery and characterization of novel regulators of rDNA relevant to normal and disease biology.

Experimental procedures

Identifying regions of human and mouse genome assemblies with sequence similarity to reference rDNA

The full-length rDNA sequence is ∼45 kb in human and mouse (19, 32), and all human and mouse assemblies contain numerous partial, fragmented, or repeated rDNA sequence stretches. To systematically identify these regions for masking, we used four methods: (i) we obtained from the UCSC genome browser (33) a list of all Repeatmasker (34) annotations titled “LSU-rRNA” or “SSU-rRNA” (large and small subunit rRNA). (ii) we obtained from NCBI Genbank (13) the reference sequences for human and mouse rDNA (KY962518.1 (19) and BK000964.3 (32), respectively) and used BLAST-like alignment tool (BLAT) (35) to identify genomic intervals with matches to the reference sequences. (iii) modifying a previously reported approach (22), we fragmented the KY962518.1 and BK000964.3 rDNA gene reference sequences to a collection of 50 nt “reads” that we mapped to genomes of the respective species using Bowtie2 (18) with the -k 10 parameter to include multiple mapped locations. We then parsed the output Sequence Alignment Map file to identify stretches of the genome with consecutive mapping of at least eight of the fragmented “reads”, including those in which adjacent stretches were separated by less than 200 nt. (iv) for the hs1 (T2T-CHM13) genome, which contains full NOR arrays on five chromosomes, we obtained from the USCS genome browser (33) a list of five Centromeric Satellite Annotations titled “rDNA”. To add buffer regions, we extended intervals from (i), (ii), and (iv) by 100 nt upstream and downstream and those from (iii) by 500 nt. Because several intervals were overlapping or redundant, we used the Merge tool in the public Galaxy server usegalaxy.org (36) to merge the interval groups and produce a final nonredundant set of intervals (Table S1):

hg19 (GRCh37): We finalized for masking 473 intervals ranging from 225 nt to 61 kb in length, with mean and median lengths of 763 nt and 331 nt, respectively. One interval was longer than 45 kb (the reference rDNA length) and on manual curation were found to be composed of more than one rDNA copy.

hg38 (GRCh38): We finalized for masking 473 intervals ranging from 226 nt to 271 kb in length, with mean and median lengths of 1862 nt and 323 nt, respectively. Four intervals were longer than 45 kb (the reference rDNA length) and on manual curation were found to be composed of sequence combinations corresponding to more than one rDNA copy.

hs1 (T2T-CHM13): We finalized for masking 418 intervals ranging from 1132 nt to 3.5 mb in length, with mean and median lengths of 29 kb and 5 kb, respectively. These included five full NOR arrays on chromosomes 13, 14, 15, 21, 22, ranging from 718 kb to 3.5 mb and encompassing 219 rDNA repeats.

mm10 (GRCm38): We finalized for masking 541 intervals ranging from 231 nt to 17 kb in length, with mean and median lengths of 467 nt and 311 nt, respectively.

mm39 (GRCm39): We finalized for masking 516 intervals ranging from 223 nt to 32 kb in length, with mean and median lengths of 554 nt and 318 nt, respectively.

Generating masked genome assemblies

We used the bedtools maskfasta (37) tool to mask human and mouse rDNA-like loci and masked a total of 361 kb, 881 kb, 12 mb, 252 kb, and 286 kb in the hg19, hg38, hs1, mm10, and mm39 genomes, respectively. We remapped the 50 nt fragmented reads generated from human and mouse rDNA references (see section above) to the masked genomes of the respective species and found no consecutive mapping of more than eight reads, confirming adequate masking.

Adding reference rDNA sequence to masked genome assemblies

The human KY962518.1 (44,838 nt) and mouse BK000964.3 (45,306 nt) reference rDNA sequences both start with the rRNA transcription start site (TSS) and end with the Pol I promoter. Given that these ∼45 kb units are repeated in tandem arrays in NORs, their 3′ ends can be seen as continuous with their 5′ ends, and we therefore modified their sequences to bring their promoters upstream of their TSSs for ease of track visualization. We did so by cutting the human KY962518.1 sequence at the 35,500 nt position and the mouse BK000964.3 sequence at the 36,000 nt position and transposed the sequences downstream of the cut to upstream of the TSSs, such that in the modified orientation, the rDNA sequences began with ∼10 kb of IGS before transitioning into promoters and transcribed regions (final orientation shown in Fig. 1A). We chose the cut sites based on our rDNA mapping study of ∼2200 ChIP-seq datasets (7), which did not show overlap of these sites with any transcription factor, chromatin protein, or histone modification peaks. We then added the modified rDNA reference sequences to masked genomes of the respective species as an additional chrR (Fig. 1B). These customized assemblies are hg19-rDNA v1.0, hg38-rDNA v1.0, hs1-rDNA v1.0, mm10-rDNA v1.0, mm39-rDNA v1.0.

Generating annotation files containing chromosome R

We obtained from the USCS genome browser (33) the RefGene (38) annotation BED files for each assembly, swapped RefSeq NM/NR IDs with Gene Name IDs, and added chrR annotations depicting the known locations of spacer and 47S promoters, rRNA transcript segments, and thyroid transcription factor-1 (TTF1) binding sites within the rDNA references (7, 11, 39). The final annotation BED files were confirmed to be compatible with the Integrative Genomics Viewer browser (26) for visualization of chrR as well as for genome-wide visualization of standard gene structures.

Data availability

Customized genome assemblies, annotation files, and positive and negative control mapping tracks are available through GitHub at: https://github.com/vikramparalkar/rDNA-Mapping-Genomes.

Supporting information

This article contains supporting information (11, 40).

Conflict of interests

The authors declare that they have no conflicts of interest with the contents of this article.

Acknowledgments

Author contributions

S. S. G. and V. R. P. methodology; V. R. P. supervision; M. P. and V. R. P. conceptualization; V. R. P. supervision; V. R. P. funding acquisition; V. R. P. project administration; S. S. G. and V. R. P. writing–original draft; S. S. G. software; S. S. G. validation; S. S. G. investigation; S. S. G. data curation; S. S. G. visualization; M. P. writing–review and editing.

Funding and additional information

V. R. P. is supported by the National Institute of General Medical Sciences (NIGMS) of the National Institutes of Health (NIH) under award number R35-GM138035. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

Biography

graphic file with name fx1.jpg

Subin S.George is a bioinformatician at the Institute for Biomedical Informatics at the University of Pennsylvania Perelman School of Medicine. He studies the epigenetic regulation of ribosomal DNA and became interested in this topic because of a dearth of computational tools to study the genetics and epigenetics of this highly transcribed repetitive gene. He hopes that the tools generated for this manuscript are helpful to the broader genomics community. Twitter: @Subin_S_George.

Reviewed by members of the JBC Editorial Board. Edited by Brian Strahl

Supporting information

Supporting Table S1

mmc1.xlsx^{(74.5KB, xlsx)}

Supporting Table S2

mmc2.xlsx^{(11.1KB, xlsx)}

References

1.Baßler J., Hurt E. Eukaryotic ribosome assembly. Annu. Rev. Biochem. 2019;88:281–306. doi: 10.1146/annurev-biochem-013118-110817. [DOI] [PubMed] [Google Scholar]
2.Long E.O., Dawid I.B. Repeated genes in eukaryotes. Annu. Rev. Biochem. 1980;49:727–764. doi: 10.1146/annurev.bi.49.070180.003455. [DOI] [PubMed] [Google Scholar]
3.Henderson A.S., Warburton D., Atwood K.C. Location of ribosomal DNA in the human chromosome complement. Proc. Natl. Acad. Sci. U. S. A. 1972;69:3394–3398. doi: 10.1073/pnas.69.11.3394. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Dev V.G., Tantravahi R., Miller D.A., Miller O.J. Nucleolus organizers in Mus musculus subspecies and in the RAG mouse cell line. Genetics. 1977;86:389–398. [PMC free article] [PubMed] [Google Scholar]
5.Miller O.L., Beatty B.R. Visualization of nucleolar genes. Science. 1969;164:955–957. doi: 10.1126/science.164.3882.955. [DOI] [PubMed] [Google Scholar]
6.Pederson T. The nucleolus. Cold Spring Harb. Perspect. Biol. 2011;3:a000638. doi: 10.1101/cshperspect.a000638. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Antony C., George S.S., Blum J., Somers P., Thorsheim C.L., Wu-Corts D.J., et al. Control of ribosomal RNA synthesis by hematopoietic transcription factors. Mol. Cell. 2022;82:3826–3839.e9. doi: 10.1016/j.molcel.2022.08.027. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Nurk S., Koren S., Rhie A., Rautiainen M., Bzikadze A.V., Mikheenko A., et al. The complete sequence of a human genome. Science. 2022;376:44–53. doi: 10.1126/science.abj6987. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Zentner G.E., Saiakhova A., Manaenkov P., Adams M.D., Scacheri P.C. Integrative genomic analysis of human ribosomal DNA. Nucleic Acids Res. 2011;39:4949–4960. doi: 10.1093/nar/gkq1326. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Zentner G.E., Balow S.A., Scacheri P.C. Genomic characterization of the mouse ribosomal DNA locus. G3 (Bethesda) 2014;4:243–254. doi: 10.1534/g3.113.009290. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Herdman C., Mars J.-C., Stefanovsky V.Y., Tremblay M.G., Sabourin-Felix M., Lindsay H., et al. A unique enhancer boundary complex on the mouse ribosomal RNA genes persists after loss of Rrn3 or UBF and the inactivation of RNA polymerase I transcription. PLoS Genet. 2017;13 doi: 10.1371/journal.pgen.1006899. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Diesch J., Bywater M.J., Sanij E., Cameron D.P., Schierding W., Brajanovski N., et al. Changes in long-range rDNA-genomic interactions associate with altered RNA polymerase II gene programs during malignant transformation. Commun. Biol. 2019;2:39. doi: 10.1038/s42003-019-0284-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Sayers E.W., Bolton E.E., Brister J.R., Canese K., Chan J., Comeau D.C., et al. Database resources of the national center for biotechnology information. Nucleic Acids Res. 2022;50:D20–D26. doi: 10.1093/nar/gkab1112. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Henras A.K., Plisson-Chastang C., O'Donohue M.-F., Chakraborty A., Gleizes P.-E. An overview of pre-ribosomal RNA processing in eukaryotes. Wiley Interdiscip. Rev. RNA. 2015;6:225–242. doi: 10.1002/wrna.1269. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Agrawal S., Ganley A.R.D. The conservation landscape of the human ribosomal RNA gene repeats. PLoS One. 2018;13 doi: 10.1371/journal.pone.0207531. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Hori Y., Shimamoto A., Kobayashi T. The human ribosomal DNA array is composed of highly homogenized tandem clusters. Genome Res. 2021;31:1971–1982. doi: 10.1101/gr.275838.121. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Bolger A.M., Lohse M., Usadel B. Trimmomatic: a flexible trimmer for illumina sequence data. Bioinformatics. 2014;30:2114–2120. doi: 10.1093/bioinformatics/btu170. [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Langmead B., Salzberg S.L. Fast gapped-read alignment with Bowtie 2. Nat. Methods. 2012;9:357–359. doi: 10.1038/nmeth.1923. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Kim J.-H., Dilthey A.T., Nagaraja R., Lee H.S., Koren S., Dudekula D., et al. Variation in human chromosome 21 ribosomal RNA genes characterized by TAR cloning and long-read sequencing. Nucleic Acids Res. 2018;46:6712–6725. doi: 10.1093/nar/gky442. [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Li H., Handsaker B., Wysoker A., Fennell T., Ruan J., Homer N., et al. The sequence alignment/map format and SAMtools. Bioinformatics. 2009;25:2078–2079. doi: 10.1093/bioinformatics/btp352. [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Ramírez F., Ryan D.P., Grüning B., Bhardwaj V., Kilpert F., Richter A.S., et al. deepTools2: a next generation web server for deep-sequencing data analysis. Nucleic Acids Res. 2016;44:W160–W165. doi: 10.1093/nar/gkw257. [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Mars J.-C., Sabourin-Felix M., Tremblay M.G., Moss T. A deconvolution protocol for ChIP-seq reveals analogous enhancer structures on the mouse and human ribosomal RNA genes. G3 (Bethesda) 2018;8:303–314. doi: 10.1534/g3.117.300225. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Zhou P., Gu F., Zhang L., Akerberg B.N., Ma Q., Li K., et al. Mapping cell type-specific transcriptional enhancers using high affinity, lineage-specific Ep300 bioChIP-seq. Elife. 2017;6 doi: 10.7554/eLife.22039. [DOI] [PMC free article] [PubMed] [Google Scholar]
24.ENCODE Project Consortium An integrated encyclopedia of DNA elements in the human genome. Nature. 2012;489:57–74. doi: 10.1038/nature11247. [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Barrett T., Wilhite S.E., Ledoux P., Evangelista C., Kim I.F., Tomashevsky M., et al. NCBI GEO: archive for functional genomics data sets--update. Nucleic Acids Res. 2013;41:D991–D995. doi: 10.1093/nar/gks1193. [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Robinson J.T., Thorvaldsdóttir H., Winckler W., Guttman M., Lander E.S., Getz G., et al. Integrative genomics viewer. Nat. Biotechnol. 2011;29:24–26. doi: 10.1038/nbt.1754. [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Gates A.J., Gysi D.M., Kellis M., Barabási A.-L. A wealth of discovery built on the human genome project - by the numbers. Nature. 2021;590:212–215. doi: 10.1038/d41586-021-00314-6. [DOI] [PubMed] [Google Scholar]
28.Treangen T.J., Salzberg S.L. Repetitive DNA and next-generation sequencing: computational challenges and solutions. Nat. Rev. Genet. 2011;13:36–46. doi: 10.1038/nrg3117. [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Parks M.M., Kurylo C.M., Dass R.A., Bojmar L., Lyden D., Vincent C.T., et al. Variant ribosomal RNA alleles are conserved and exhibit tissue-specific expression. Sci. Adv. 2018;4:eaao0665. doi: 10.1126/sciadv.aao0665. [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Fan W., Eklund E., Sherman R.M., Liu H., Pitts S., Ford B., et al. Widespread genetic heterogeneity of human ribosomal RNA genes. RNA. 2022;28:478–492. doi: 10.1261/rna.078925.121. [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Wang T., Antonacci-Fulton L., Howe K., Lawson H.A., Lucas J.K., Phillippy A.M., et al. The human pangenome project: a global resource to map genomic diversity. Nature. 2022;604:437–446. doi: 10.1038/s41586-022-04601-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Grozdanov P., Georgiev O., Karagyozov L. Complete sequence of the 45-kb mouse ribosomal DNA repeat: analysis of the intergenic spacer. Genomics. 2003;82:637–643. doi: 10.1016/s0888-7543(03)00199-x. [DOI] [PubMed] [Google Scholar]
33.Kent W.J. The human genome browser at UCSC. Genome Res. 2002;12:996–1006. doi: 10.1101/gr.229102. [DOI] [PMC free article] [PubMed] [Google Scholar]
34.Bao W., Kojima K.K., Kohany O. Repbase update, a database of repetitive elements in eukaryotic genomes. Mob. DNA. 2015;6:11. doi: 10.1186/s13100-015-0041-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
35.Kent W.J. BLAT--the BLAST-like alignment tool. Genome Res. 2002;12:656–664. doi: 10.1101/gr.229202. [DOI] [PMC free article] [PubMed] [Google Scholar]
36.Galaxy Community The galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2022 update. Nucleic Acids Res. 2022;50:W345–W351. doi: 10.1093/nar/gkac247. [DOI] [PMC free article] [PubMed] [Google Scholar]
37.Quinlan A.R., Hall I.M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010;26:841–842. doi: 10.1093/bioinformatics/btq033. [DOI] [PMC free article] [PubMed] [Google Scholar]
38.O’Leary N.A., Wright M.W., Brister J.R., Ciufo S, Haddad D, McVeigh R, et al. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 2016;44:D733–D745. doi: 10.1093/nar/gkv1189. [DOI] [PMC free article] [PubMed] [Google Scholar]
39.Diermeier S.D., Németh A., Rehli M., Grummt I., Längst G. Chromatin-specific regulation of mammalian rDNA transcription by clustered TTF-I binding sites. PLoS Genet. 2013;9 doi: 10.1371/journal.pgen.1003786. [DOI] [PMC free article] [PubMed] [Google Scholar]
40.Sanij E., Diesch J., Lesmana A., Poortinga G., Hein N., Lidgerwood G., et al. A novel role for the Pol I transcription factor UBTF in maintaining genome stability through the regulation of highly transcribed Pol II genes. Genome Res. 2015;25:201–212. doi: 10.1101/gr.176115.114. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supporting Table S1

mmc1.xlsx^{(74.5KB, xlsx)}

Supporting Table S2

mmc2.xlsx^{(11.1KB, xlsx)}

Data Availability Statement

Customized genome assemblies, annotation files, and positive and negative control mapping tracks are available through GitHub at: https://github.com/vikramparalkar/rDNA-Mapping-Genomes.

[bib1] 1.Baßler J., Hurt E. Eukaryotic ribosome assembly. Annu. Rev. Biochem. 2019;88:281–306. doi: 10.1146/annurev-biochem-013118-110817. [DOI] [PubMed] [Google Scholar]

[bib2] 2.Long E.O., Dawid I.B. Repeated genes in eukaryotes. Annu. Rev. Biochem. 1980;49:727–764. doi: 10.1146/annurev.bi.49.070180.003455. [DOI] [PubMed] [Google Scholar]

[bib3] 3.Henderson A.S., Warburton D., Atwood K.C. Location of ribosomal DNA in the human chromosome complement. Proc. Natl. Acad. Sci. U. S. A. 1972;69:3394–3398. doi: 10.1073/pnas.69.11.3394. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib4] 4.Dev V.G., Tantravahi R., Miller D.A., Miller O.J. Nucleolus organizers in Mus musculus subspecies and in the RAG mouse cell line. Genetics. 1977;86:389–398. [PMC free article] [PubMed] [Google Scholar]

[bib5] 5.Miller O.L., Beatty B.R. Visualization of nucleolar genes. Science. 1969;164:955–957. doi: 10.1126/science.164.3882.955. [DOI] [PubMed] [Google Scholar]

[bib6] 6.Pederson T. The nucleolus. Cold Spring Harb. Perspect. Biol. 2011;3:a000638. doi: 10.1101/cshperspect.a000638. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib7] 7.Antony C., George S.S., Blum J., Somers P., Thorsheim C.L., Wu-Corts D.J., et al. Control of ribosomal RNA synthesis by hematopoietic transcription factors. Mol. Cell. 2022;82:3826–3839.e9. doi: 10.1016/j.molcel.2022.08.027. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib8] 8.Nurk S., Koren S., Rhie A., Rautiainen M., Bzikadze A.V., Mikheenko A., et al. The complete sequence of a human genome. Science. 2022;376:44–53. doi: 10.1126/science.abj6987. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib9] 9.Zentner G.E., Saiakhova A., Manaenkov P., Adams M.D., Scacheri P.C. Integrative genomic analysis of human ribosomal DNA. Nucleic Acids Res. 2011;39:4949–4960. doi: 10.1093/nar/gkq1326. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib10] 10.Zentner G.E., Balow S.A., Scacheri P.C. Genomic characterization of the mouse ribosomal DNA locus. G3 (Bethesda) 2014;4:243–254. doi: 10.1534/g3.113.009290. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib11] 11.Herdman C., Mars J.-C., Stefanovsky V.Y., Tremblay M.G., Sabourin-Felix M., Lindsay H., et al. A unique enhancer boundary complex on the mouse ribosomal RNA genes persists after loss of Rrn3 or UBF and the inactivation of RNA polymerase I transcription. PLoS Genet. 2017;13 doi: 10.1371/journal.pgen.1006899. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib12] 12.Diesch J., Bywater M.J., Sanij E., Cameron D.P., Schierding W., Brajanovski N., et al. Changes in long-range rDNA-genomic interactions associate with altered RNA polymerase II gene programs during malignant transformation. Commun. Biol. 2019;2:39. doi: 10.1038/s42003-019-0284-y. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib13] 13.Sayers E.W., Bolton E.E., Brister J.R., Canese K., Chan J., Comeau D.C., et al. Database resources of the national center for biotechnology information. Nucleic Acids Res. 2022;50:D20–D26. doi: 10.1093/nar/gkab1112. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib14] 14.Henras A.K., Plisson-Chastang C., O'Donohue M.-F., Chakraborty A., Gleizes P.-E. An overview of pre-ribosomal RNA processing in eukaryotes. Wiley Interdiscip. Rev. RNA. 2015;6:225–242. doi: 10.1002/wrna.1269. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib15] 15.Agrawal S., Ganley A.R.D. The conservation landscape of the human ribosomal RNA gene repeats. PLoS One. 2018;13 doi: 10.1371/journal.pone.0207531. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib16] 16.Hori Y., Shimamoto A., Kobayashi T. The human ribosomal DNA array is composed of highly homogenized tandem clusters. Genome Res. 2021;31:1971–1982. doi: 10.1101/gr.275838.121. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib17] 17.Bolger A.M., Lohse M., Usadel B. Trimmomatic: a flexible trimmer for illumina sequence data. Bioinformatics. 2014;30:2114–2120. doi: 10.1093/bioinformatics/btu170. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib18] 18.Langmead B., Salzberg S.L. Fast gapped-read alignment with Bowtie 2. Nat. Methods. 2012;9:357–359. doi: 10.1038/nmeth.1923. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib19] 19.Kim J.-H., Dilthey A.T., Nagaraja R., Lee H.S., Koren S., Dudekula D., et al. Variation in human chromosome 21 ribosomal RNA genes characterized by TAR cloning and long-read sequencing. Nucleic Acids Res. 2018;46:6712–6725. doi: 10.1093/nar/gky442. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib20] 20.Li H., Handsaker B., Wysoker A., Fennell T., Ruan J., Homer N., et al. The sequence alignment/map format and SAMtools. Bioinformatics. 2009;25:2078–2079. doi: 10.1093/bioinformatics/btp352. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib21] 21.Ramírez F., Ryan D.P., Grüning B., Bhardwaj V., Kilpert F., Richter A.S., et al. deepTools2: a next generation web server for deep-sequencing data analysis. Nucleic Acids Res. 2016;44:W160–W165. doi: 10.1093/nar/gkw257. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib22] 22.Mars J.-C., Sabourin-Felix M., Tremblay M.G., Moss T. A deconvolution protocol for ChIP-seq reveals analogous enhancer structures on the mouse and human ribosomal RNA genes. G3 (Bethesda) 2018;8:303–314. doi: 10.1534/g3.117.300225. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib23] 23.Zhou P., Gu F., Zhang L., Akerberg B.N., Ma Q., Li K., et al. Mapping cell type-specific transcriptional enhancers using high affinity, lineage-specific Ep300 bioChIP-seq. Elife. 2017;6 doi: 10.7554/eLife.22039. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib24] 24.ENCODE Project Consortium An integrated encyclopedia of DNA elements in the human genome. Nature. 2012;489:57–74. doi: 10.1038/nature11247. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib25] 25.Barrett T., Wilhite S.E., Ledoux P., Evangelista C., Kim I.F., Tomashevsky M., et al. NCBI GEO: archive for functional genomics data sets--update. Nucleic Acids Res. 2013;41:D991–D995. doi: 10.1093/nar/gks1193. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib26] 26.Robinson J.T., Thorvaldsdóttir H., Winckler W., Guttman M., Lander E.S., Getz G., et al. Integrative genomics viewer. Nat. Biotechnol. 2011;29:24–26. doi: 10.1038/nbt.1754. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib27] 27.Gates A.J., Gysi D.M., Kellis M., Barabási A.-L. A wealth of discovery built on the human genome project - by the numbers. Nature. 2021;590:212–215. doi: 10.1038/d41586-021-00314-6. [DOI] [PubMed] [Google Scholar]

[bib28] 28.Treangen T.J., Salzberg S.L. Repetitive DNA and next-generation sequencing: computational challenges and solutions. Nat. Rev. Genet. 2011;13:36–46. doi: 10.1038/nrg3117. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib29] 29.Parks M.M., Kurylo C.M., Dass R.A., Bojmar L., Lyden D., Vincent C.T., et al. Variant ribosomal RNA alleles are conserved and exhibit tissue-specific expression. Sci. Adv. 2018;4:eaao0665. doi: 10.1126/sciadv.aao0665. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib30] 30.Fan W., Eklund E., Sherman R.M., Liu H., Pitts S., Ford B., et al. Widespread genetic heterogeneity of human ribosomal RNA genes. RNA. 2022;28:478–492. doi: 10.1261/rna.078925.121. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib31] 31.Wang T., Antonacci-Fulton L., Howe K., Lawson H.A., Lucas J.K., Phillippy A.M., et al. The human pangenome project: a global resource to map genomic diversity. Nature. 2022;604:437–446. doi: 10.1038/s41586-022-04601-8. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib32] 32.Grozdanov P., Georgiev O., Karagyozov L. Complete sequence of the 45-kb mouse ribosomal DNA repeat: analysis of the intergenic spacer. Genomics. 2003;82:637–643. doi: 10.1016/s0888-7543(03)00199-x. [DOI] [PubMed] [Google Scholar]

[bib33] 33.Kent W.J. The human genome browser at UCSC. Genome Res. 2002;12:996–1006. doi: 10.1101/gr.229102. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib34] 34.Bao W., Kojima K.K., Kohany O. Repbase update, a database of repetitive elements in eukaryotic genomes. Mob. DNA. 2015;6:11. doi: 10.1186/s13100-015-0041-9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib35] 35.Kent W.J. BLAT--the BLAST-like alignment tool. Genome Res. 2002;12:656–664. doi: 10.1101/gr.229202. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib36] 36.Galaxy Community The galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2022 update. Nucleic Acids Res. 2022;50:W345–W351. doi: 10.1093/nar/gkac247. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib37] 37.Quinlan A.R., Hall I.M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010;26:841–842. doi: 10.1093/bioinformatics/btq033. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib38] 38.O’Leary N.A., Wright M.W., Brister J.R., Ciufo S, Haddad D, McVeigh R, et al. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 2016;44:D733–D745. doi: 10.1093/nar/gkv1189. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib39] 39.Diermeier S.D., Németh A., Rehli M., Grummt I., Längst G. Chromatin-specific regulation of mammalian rDNA transcription by clustered TTF-I binding sites. PLoS Genet. 2013;9 doi: 10.1371/journal.pgen.1003786. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib40] 40.Sanij E., Diesch J., Lesmana A., Poortinga G., Hein N., Lidgerwood G., et al. A novel role for the Pol I transcription factor UBTF in maintaining genome stability through the regulation of highly transcribed Pol II genes. Genome Res. 2015;25:201–212. doi: 10.1101/gr.176115.114. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Construction and validation of customized genomes for human and mouse ribosomal DNA mapping

Subin S George

Maxim Pimkin

Vikram R Paralkar

Abstract

Results

Generation of customized human and mouse genomes with rDNA

Figure 1.

Workflow for mapping to customized genomes with rDNA

Figure 2.

Validation of customized genomes with rDNA

Figure 3.

Accessibility of customized genomes and annotation files

Discussion

Experimental procedures

Identifying regions of human and mouse genome assemblies with sequence similarity to reference rDNA

Generating masked genome assemblies

Adding reference rDNA sequence to masked genome assemblies

Generating annotation files containing chromosome R

Data availability

Supporting information

Conflict of interests

Acknowledgments

Author contributions

Funding and additional information

Biography

Supporting information

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Construction and validation of customized genomes for human and mouse ribosomal DNA mapping

Subin S George

Maxim Pimkin

Vikram R Paralkar

Abstract

Results

Generation of customized human and mouse genomes with rDNA

Figure 1.

Workflow for mapping to customized genomes with rDNA

Figure 2.

Validation of customized genomes with rDNA

Figure 3.

Accessibility of customized genomes and annotation files

Discussion

Experimental procedures

Identifying regions of human and mouse genome assemblies with sequence similarity to reference rDNA

Generating masked genome assemblies

Adding reference rDNA sequence to masked genome assemblies

Generating annotation files containing chromosome R

Data availability

Supporting information

Conflict of interests

Acknowledgments

Author contributions

Funding and additional information

Biography

Supporting information

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases