Skip to main content
Nucleic Acids Research logoLink to Nucleic Acids Research
. 2000 Jun 15;28(12):2342–2352. doi: 10.1093/nar/28.12.2342

Evolutionary appearance of genes encoding proteins associated with box H/ACA snoRNAs: Cbf5p in Euglena gracilis, an early diverging eukaryote, and candidate Gar1p and Nop10p homologs in archaebacteria

Yoh-ichi Watanabe 1, Michael W Gray 1,a
PMCID: PMC102724  PMID: 10871366

Abstract

A reverse transcription–polymerase chain reaction (RT–PCR) approach was used to clone a cDNA encoding the Euglena gracilis homolog of yeast Cbf5p, a protein component of the box H/ACA class of snoRNPs that mediate pseudouridine formation in eukaryotic rRNA. Cbf5p is a putative pseudouridine synthase, and the Euglena homolog is the first full-length Cbf5p sequence to be reported for an early diverging unicellular eukaryote (protist). Phylogenetic analysis of putative pseudouridine synthase sequences confirms that archaebacterial and eukaryotic (including Euglena) Cbf5p proteins are specifically related and are distinct from the TruB/Pus4p clade that is responsible for formation of pseudouridine at position 55 in eubacterial (TruB) and eukaryotic (Pus4p) tRNAs. Using a bioinformatics approach, we also identified archaebacterial genes encoding candidate homologs of yeast Gar1p and Nop10p, two additional proteins known to be associated with eukaryotic box H/ACA snoRNPs. These observations raise the possibility that pseudouridine formation in archaebacterial rRNA may be dependent on analogs of the eukaryotic box H/ACA snoRNPs, whose evolutionary origin may therefore predate the split between Archaea (archaebacteria) and Eucarya (eukaryotes). Database searches further revealed, in archaebacterial and some eukaryotic genomes, two previously unrecognized groups of genes (here designated ‘PsuX’ and ‘PsuY’) distantly related to the Cbf5p/TruB gene family.

INTRODUCTION

In eukaryotes, cytoplasmic rRNA species other than 5S rRNA are transcribed as part of a longer precursor (pre-rRNA), from which the mature rRNAs are processed. In most cases, the pre-rRNA, which includes external and internal transcribed spacers, undergoes endonucleolytic cleavage to yield large subunit (LSU) rRNAs (5.8S and 26–28S) and a small subunit (SSU) rRNA. To date, the mechanisms of eukaryotic pre-rRNA processing have been well studied only in relatively late diverging species such as yeast (Saccharomyces cerevisiae) and vertebrate animals. These investigations have provided information about the complexity and order of processing events, as well as the involvement of numerous cis-elements and trans-acting factors (for recent reviews, see 14).

In the protist phylum Euglenozoa, which includes euglenid protozoa such as Euglena gracilis and kinetoplastid protozoa such as Trypanosoma spp, the LSU rRNA is further fragmented into smaller pieces (a total of 14 in the case of E.gracilis; 5). Euglena gracilis LSU rRNA also contains a substantially higher proportion of O2′-methylnucleosides than its counterparts in yeast and vertebrates (6; M.N.Schnare, personal communication). As in other eukaryotes, the highly fragmented euglenid LSU rRNA is derived from a pre-rRNA that includes the SSU rRNA sequence (7,8); however, information about trans-acting factors and details about the rRNA maturation pathway in E.gracilis are still quite limited (8,9).

Although questions about the precise phylogenetic position of the Euglenozoa within the domain Eucarya (eukaryotes) have been raised by recent protein phylogenies (1013), this phylum is generally considered to represent an early-branching eukaryotic lineage (14). In this context, studies of ribosome biogenesis in members of the Euglenozoa should provide insights into the evolutionary origin of the system that processes eukaryotic rRNA.

The box H/ACA snoRNAs constitute one of the two major classes of snoRNA in eukaryotes (15). Box H/ACA snoRNAs display a secondary structure that features two hairpin motifs and two conserved sequence blocks, box H (ANANNA, in the hinge region between the two hairpin structures) and box ACA (ACANNN, at the 3′ end) (1518). The box H/ACA snoRNAs participate in either rRNA processing (e.g. yeast snR30) (19) or formation of pseudouridine (5-ribosyluracil, Ψ) in rRNA. The latter function relies on complementarity within ‘pseudouridine pockets’ to short regions flanking the modification sites in rRNA (16,20). Some box H/ACA snoRNAs (e.g. yeast snR10) (20,21) are involved in both nucleolytic processing and post-transcriptional modification. Box H/ACA snoRNAs have been identified in so-called ‘crown’ eukaryotes such as animals, plants and yeast (reviewed in 22,23; see also 18,2426) as well as in a ciliate protozoon, Tetrahymena thermophila (27).

In S.cerevisiae, box H/ACA snoRNAs form a complex with the proteins Gar1p, Cbf5p, Nhp2p and Nop10p (4,2831). Cbf5p, originally identified as a centromere/microtubule-binding protein (32), is an essential protein in yeast, implicated in rRNA processing (33) and stability of box H/ACA snoRNAs (4). Cbf5p shares sequence similarity with the TruB/Pus4p family of tRNA Ψ55 synthases (3438) and, in association with box H/ACA snoRNAs, is believed to act as an rRNA Ψ synthase. Eukaryotic Cbf5p homologs have been identified and characterized in rat (NAP57; 34), human (dyskerin; 39,40), Drosophila [Nop60B (41) or Minifly protein (26)] and fungi (42). Genes for archaebacterial Cbf5p homologs have also been identified (4347) but their function is currently unclear.

As yet, there is no published information about box H/ACA snoRNP proteins in unicellular eukaryotes (protists), especially ones thought to be early branching members of the domain Eucarya. As part of an on-going investigation of rRNA processing and modification in the early diverging protist, E.gracilis, we have cloned and characterized a full-length cDNA encoding the Cbf5p protein of this organism. In the course of this study, we also identified candidate genes encoding archaebacterial homologs of two of the proteins (Gar1p and Nop10p) known to be associated with eukaryotic box H/ACA snoRNPs, as well as genes specifying two novel protein families related to Cbf5p and TruB.

MATERIALS AND METHODS

RNA and DNA

Total cellular DNA and RNA from E.gracilis was prepared as described by Breckenridge et al. (48). Poly(A)+ RNA was selected from total RNA using the PolyATtract system (Promega, Madison, WI).

Oligonucleotides

Oligoribonucleotide P-1R (5′-AAUAAAGCGGCCGCGGAUCCAA-3′) was purchased from Dalton Chemical Laboratories Inc. (North York, ON, Canada). Oligodeoxyribonucleotides (Table 1) were obtained from Gibco BRL (Burlington, ON, Canada) or ID Labs Biotechnology, Inc. (London, ON, Canada).

Table 1. Oligonucleotides used for cloning of Cbf5p cDNA and genomic sequences.

Primer Sequence (5′ to 3′) Corresponding peptide sequence
P-4 5′-AATAAAGCGGCCGCGGATCCAA-3′  
dTP-4 5′-AATAAAGCGGCCGCGGATCCAAT16-3′  
P-16 5′-AATAAAGCGGCCGCGGATCCAAT16(A/G/C)N-3′  
P-22 5′-CTAATACGACTCACTATAGGGCTCGAGCGGCCGCCCGGGCAGGT-3′  
P-23ddC 5′-pACCTGCCddC-3′a  
P-24 5′-GGATCCTAATACGACTCACTATAGGGC-3′  
P-25 5′-AATAGGGCTCGAGCGGC-3′  
P-55 5′-GGAGCTCAATAAAGCGGCCGC-3′  
P-70 5′-AGTCA(T/C)GA(A/G)GTN(T/G)CNTGG-3′ SHEVV(A/S)W (sense)
P-73 5′-ACTTTTGGN(A/T)C(C/A/G)A(A/G)NGTNCC-3′ GTL(D/V)PKV (antisense)
P-77 5′-GCT(A/G)T(T/C/G)TG(T/C)TA(T/C)GGNGCNAA-3′ A(I/V)CYGAK (sense)
P-79 5′-ATNGC(C/T)TCNCC(T/C)TTNGTNGT-3′ TTKGEA(I/V) (antisense)
P-85 5′-TTGAAGCGAATCCTGAAGGTGGAGAAGAC-3′  
P-87 5′-GATGACCTCGTTGACCTCAATGCCGT-3′  
P-86 5′-AAGGTGGAGAAGACTGGGCACGCA-3′  
P-88 5′-CAATGCCGTTCTCAAAGCGGAGGAGT-3′  
P-95 5′-CCAATCGGGTGGCTCTCTCAATGCAGA-3′  
P-99 5′-CTCGGCATCGCCCTCATGACCA-3′  
P-107 5′-CTGCGGCAGCAACGAGGAGGCGATGAA-3′  
P-163 5′-GGCATAGCTTCAGGCCGTAGGTCATTCA-3′b  
P-164 5′-CCCCTGGAACGTGCGGTGCTGCTGCTT-3′c  

a3′ end blocked with enzymatically added dideoxy C (ddC), rather than an amine group as in ref. 56.

bCorresponds to positions 1469 to 1496 of the cDNA sequence.

cCorresponds to positions 1511 to 1537 of the cDNA sequence.

Generation of Cbf5p cDNA sequence by reverse transcription–polymerase chain reaction (RT–PCR)

RT of a fraction enriched in poly(A)+ RNA (∼500 ng) was carried out using Superscript II reverse transcriptase (RNase H; Gibco BRL) and modified oligo(dT) primers (Table 1) dTP-4 for degenerate PCR or P-16 for 3′ RACE (rapid amplification of cDNA ends) following the supplier’s protocol. The oligo-capping method of Maruyama and Sugano (49) was used in 5′ RACE experiments. In brief, a fraction enriched in poly(A)+ RNA was treated with calf intestinal alkaline phosphatase (New England Biolabs, Beverly, MA) followed by tobacco acid pyrophosphatase (Epicentre Technologies, Madison, WI). Using T4 RNA ligase (Amersham-Pharmacia Biotech, Cleveland, OH), the 3′ end of RNA oligo P-1R was joined to the 5′ end of the RNA originally possessing the cap structure. The resulting RNA was subjected to RT as described above using P-87 (Table 1) as a primer.

PCR mixtures were prepared as described by Baskaran et al. (50) using Pfu DNA polymerase (1.5 U/ml, Stratagene, La Jolla, CA) but substituting Taq DNA polymerase (25 U/ml of reaction mix, Gibco BRL) for the KlenTaq1 polymerase. Initially, two internal portions of Cbf5p sequence were amplified by degenerate PCR using modified touchdown protocols (51) with the primer pairs P-70/P-73 and P-77/P-79 (Table 1; degenerate RT–PCR, Fig. 1). These PCR products were purified by non-denaturing polyacrylamide gel electrophoresis, re-amplified using a conventional PCR cycle, then cloned and sequenced as outlined below. New primers (P-85 and P-87, then P-86 and P-88 in the nested reaction), designed on the basis of the sequence information obtained, were used to connect the two portions (internal RT–PCR, Fig. 1). For 5′ RACE, the combination P-55 + P-87 was followed by P-4 + P-95 in the nested reaction (Fig. 1). For 3′ RACE, P-85 + P-55, then P-86 + P-4 were used in the nested reaction (3′ RACE I, Fig. 1).

Figure 1.

Figure 1

Strategy for RT–PCR cloning of E.gracilis Cbf5p cDNA. Primers (see Table 1) are shown as arrowheads at positions corresponding to their target binding sites.

To verify poly(A) addition site(s) more precisely, an additional round of 3′ RACE was carried out using as template dTP-4-primed cDNA together with primers specific for sequence further downstream (P-99 and P-107, respectively, in primary and nested PCR) and anchor-specific primers (P-4 and P-55, respectively, in primary and nested reactions; 3′ RACE II, Fig. 1). Ten clones selected in this way were sequenced.

Nested PCR products were purified by agarose gel electrophoresis, cloned (52), then sequenced by the dideoxy method (53) with modifications (48,54,55). To minimize the possibility of ‘PCR mutation’, at least three independent clones from each reaction were sequenced on both strands. The P-86 to P-88 portion of the 3′ RACE I products was not fully analyzed because this region overlapped parts of other RT–PCR products.

Genomic PCR

To examine the 3′ terminal region of the E.gracilis Cbf5p gene encompassing the region that includes sequence heterogeneities noted during cDNA analysis (see Results), a PCR walking strategy (56) was used with minor modifications. In brief, E.gracilis total DNA was digested separately with a number of restriction enzymes that generate blunt ends. A partially double-stranded adapter (P-22 and P23ddC annealed together; see table 1 and figure 1 of ref. 56) was ligated to both ends of the resulting restriction fragments using T4 DNA ligase. The ligation products were then used as templates in PCR with different combinations of cDNA- and adapter-specific primers (P-163 and P-24 in the primary reaction, P-164 and P-25 in the secondary reaction; Table 1). One of the PCR products (∼420 bp), amplified from template prepared from DraI-digested DNA, was purified, cloned and sequenced.

Northern hybridization

The fraction enriched in poly(A)+ RNA (500 ng) was electrophoresed in a 1% agarose–0.4 M formaldehyde gel (57), then transferred to GeneScreen Plus nylon membrane (NEN, Boston, MA) according to the protocol of Chomczynski and Mackey (58). A cloned cDNA fragment (P-86 to P-88) was amplified by PCR using a corresponding plasmid clone as template DNA. The amplification product was purified by agarose gel electrophoresis, then labeled by a random priming protocol (59) using [α-32P]dATP. Hybridization was performed in 0.5 M sodium phosphate buffer (pH 7.2), 7% SDS and 1 mM EDTA (60) at 65°C overnight. The membrane was immersed in 40 mM sodium phosphate buffer, 5% SDS and 1 mM EDTA at 65°C for 5 min, then 40 mM sodium phosphate buffer, 1% SDS and 1 mM EDTA at 65°C for 20 min (3×), and finally subjected to autoradiography with an intensifying screen at –75°C.

Southern hybridization

Total cellular DNA (5 µg) was hydrolyzed with restriction enzymes and the products were separated by electrophoresis in a 0.7% agarose gel, then transferred to GeneScreen Plus as described by Chomczynski (61). The cDNA fragment was amplified by PCR and labeled as above. Hybridization and autoradiography were carried out as described above.

Analysis of sequence data

Sequence data were assembled and analyzed using MacDNASIS version 3.5 (Hitachi Software, Yokohama, Japan). Searches of public domain nucleotide and protein sequence databases were carried out by Gapped-BLAST (BLASTP or TBLASTN; 62) through the NCBI web server (www.ncbi.nlm.nih.gov ) using the default options unless otherwise specified. Sequence alignments were generated with Clustal W version 1.7 (63) followed by manual modification. Based on the alignment, 181 amino acid positions were selected for phylogenetic analysis, with positions of insertion and deletion omitted. Phylogenetic trees were constructed using the quartet puzzling and maximum likelihood methods of protein phylogeny in the PUZZLE 4.0 program (64). The JTT-F + γ model of amino acid substitutions was assumed in the analysis (6466). Rate heterogeneity among sites was approximated by a discrete gamma distribution (with four categories).

RESULTS

Assembly of a cDNA sequence encoding E.gracilis Cbf5p

As summarized in Figure 1, an RT–PCR approach was used to assemble a complete cDNA sequence encoding Euglena Cbf5p. Initially, highly degenerate primers were designed on the basis of conserved peptide motifs identified in an alignment of eukaryotic homologs of yeast Cbf5p. Two of the primer sets successfully amplified short cDNA fragments that appeared to specify portions of Euglena Cbf5p (degenerate PCR in Fig. 1). These partial cDNA sequences were used in further primer design to connect the initial two sequences, after which additional sets of primers were synthesized as necessary to allow completion of the sequence by RACE techniques.

The 5′ end of the cDNA was obtained by an oligo-capping method (49) that in theory should yield the authentic 5′ end of the corresponding mature mRNA. In E.gracilis, as in the trypanosomatid protozoa, most mRNAs acquire a common 5′ terminal sequence as a result of a trans-splicing event that covalently joins the 5′ portion of a small, separately transcribed RNA, the spliced leader (SL) RNA, to different mRNA transcripts (67). In our hands, the oligo-capping procedure yielded an SL sequence having two additional 5′ nucleotides (AC) compared with previously reported sequences (67,68). Known gene sequences for the Euglena SL RNA do have AC at the corresponding positions (68); thus, the Euglena SL sequence very likely is 28 nt long, beginning with the sequence 5′-ACAC...

The SL RNA features an as yet uncharacterized methylguanosine 5′ cap structure, as suggested by analyses (Y.Watanabe, unpublished results) using an anti-TMG antibody that in our hands reacts with monomethyl- as well as trimethylguanosine (69). Previous studies of SL sequences in Euglena had employed direct reverse transcriptase sequencing, with the enzyme evidently stopping at the third or fourth positions from the 5′ end, presumably as a result of post-transcriptional modifications in the SL sequence (70; see also 71, but no sequencing gel shown). The protocol used in the present study may allow reverse transcriptase extension through these modifications to some degree; alternatively, reverse transcriptase readthrough may reflect undermodification at the 5′ end of the Euglena SL RNA, as observed in the case of the trypanosomatid SL (7274). The latter RNA has a characteristic cap4 structure in which the first four 5′ nucleotides are methylated in the base and/or sugar (O2′-methyl) moieties (72,75). Our results and those of previous studies suggest that Euglena SL RNA likely harbors modifications in its first four 5′ nucleotides. Thus, Euglena and kinetoplastid protozoa may be similar not only in possessing trans-splicing but also in having an SL RNA that is extensively modified at its 5′ end. Recently, a 28-nt SL beginning with 5′-ACUC... and also possibly containing a 5′ cap structure with extensive modifications was characterized in the colorless euglenid, Entosiphon sulcatum (76).

In 3′ RACE experiments that examined cDNA synthesized using an oligo(dT) primer, we obtained evidence of heterogeneity at the poly(A) addition site (Fig. 2). 3′ RACE analysis also revealed a likely sequence heterogeneity at a single position (Fig. 2): at this site, four clones had C whereas five clones had T, with one clone lacking the corresponding region altogether (see below). It seemed unlikely that these differences could be attributed to RT or PCR artifacts because they were observed in clones generated in independent 3′ RACE experiments (data not shown). On the other hand, sequencing of the genomic PCR product comprising the 3′ end of the cDNA indicated that the position in question is T in the gene sequence (three out of three independent clones sequenced). The poly(A) addition site in the one ‘deleted’ clone is ambiguous due to the possible occurrence of A residues immediately 5′ to the poly(A) tail; all other clones had a pyrimidine residue at the poly(A) junction.

Figure 2.

Figure 2

cDNA sequence encoding E.gracilis Cbf5p, displayed with the deduced amino acid sequence. Locations of the trans-spliced leader sequence, TruB motifs I and II, and PUA domain (36,84) and the C-terminal KKE repeat are indicated. Alternative polyadenylation sites determined by 3′ RACE (see Fig. 1) are denoted by asterisks. Because in the gene sequence A residues immediately follow one of the deduced polyadenylation sites, the position of this site was assigned tentatively.

Genomic Southern analysis (data not shown) combined with the results of genomic PCR make it unlikely that there are multiple copies of the Cbf5p gene in E.gracilis nuclear DNA, although detailed Cbf5p gene structure has not yet been investigated. Northern blot analysis (not shown) revealed a Cbf5p mRNA of ∼1.7 kb, consistent with the size of the complete cDNA sequence plus about 100 3′ terminal A residues.

Characteristics of the Euglena Cbf5p sequence

The E.gracilis Cbf5p cDNA sequence contains an open reading frame of 467 residues specifying a protein of predicted molecular mass 52 392 Da (Fig. 2). Like the homologous fungal sequences, Euglena Cbf5p lacks an identifiable positively charged N-terminal extension that in metazoan Cbf5p sequences is assumed to represent a nuclear localization signal. However, Euglena Cbf5p does possess the C-terminal KKE/D repeats that are characteristic of all known eukaryotic Cbf5p homologs. These repeats display microtubule-binding activity in vitro and are essential for viability in S.cerevisiae (32) (although not in another yeast, Kluyveromyces lactis; 42).

In Escherichia coli Ψ synthases acting on rRNA and tRNA, a highly conserved aspartate residue is essential for activity (7780), and a possible catalytic role for this residue via its β-carbonyl group has been proposed (77,81). A recent crystallographic study supports this idea (82). A motif that includes this conserved Asp (the TruB motif II) has been identified not only among members of the TruB/Pus4p family of tRNA Ψ55 synthases (37) but also in Cbf5p (36), and the functional importance of this Asp in both TruB (80) and Cbf5p (83) has now been demonstrated. A partial alignment of eukaryotic and archaebacterial Cbf5p homologs, encompassing highly conserved regions (motifs I and II) characteristic of Ψ synthases, is shown in Figure 3 (a full alignment of the corresponding Cbf5p sequences is available upon request). Exceptional conservation of the motif II sequence (including the functionally critical Asp residue) is evident among Cbf5p homologs, including the E.gracilis one.

Figure 3.

Figure 3

A portion of the Cbf5p sequence alignment showing the region encompassing TruB motifs I and II. Motifs are adapted from Koonin (36), with highly conserved residues boxed. The position of the conserved, functionally important Asp (see text) is marked with an asterisk. Sequences shown above the Euglena sequence are eukaryotic, sequences below are archaebacterial. GenBank accession numbers are: S.cerevisiae, AAA34473; K.lactis, AAC64862; Candida albicans, AAB94297; Emericella nidulans, AAB94296; Sartorya (Aspergillus) fumigata, AAB94298; Schizosaccharomyces pombe, CAB10131; D.melanogaster, AAC97117; C.elegans, CAB07244; Homo sapiens, AAB94299; Rattus norvegicus, P40615; A.fulgidus, AAB90995; M.jannaschii, AAB98132; Pyrococcus abyssi, CAB49444; Pyrococcus horikoshii, O59357; M.thermoautotrophicum, O26140. The A.pernix sequence is a fusion of two peptide sequences (BAA79973 and BAA79974) encoded by separate open reading frames (nucleotide sequence AP000060.1). In the complete alignment (available from the authors on request), non-Cbf5p-homologous regions of these two peptides were excluded.

An unusual arrangement of the Cbf5p gene is seen in Aeropyrum pernix, the only member of the archaebacterial kingdom Crenarchaeota for which a complete genome sequence is available (the other completely sequenced archaebacterial genomes being from members of the Euryarchaeota). In this case, the Cbf5p homolog is encoded by two partially overlapping open reading frames and the motif II aspartate residue (Asp60) is replaced by glutamate (47). In mutagenesis experiments with the E.coli TruA tRNA Ψ synthase, Huang et al. (77) found that Glu at this position could not substitute functionally for the conserved Asp.

Phylogenetic relationships

Phylogenetic analysis of eukaryotic and archaebacterial Cbf5p homologs (Fig. 4) suggests that Euglena diverged from the main branch of eukaryotic evolution earlier than the other eukaryotes for whom Cbf5p sequences are currently available. However, confirmation of this point will require a more diverse collection of eukaryotic Cbf5p sequences, including additional protist ones. This phylogenetic analysis also demonstrates that archaebacterial and eukaryotic (including Euglena) Cbf5p sequences comprise two distinct branches of a single clade to the exclusion of the affiliated TruB (eubacterial) and Pus4p (eukaryotic) sequences, which form a separate clade. These affinities are supported by the presence of unique insertions found only in TruB and yeast Pus4p sequences and by the absence of an apparent pseudouridine synthase and archaeosine (PUA) domain in Pus4p and most eubacterial TruB sequences (84).

Figure 4.

Figure 4

Maximum likelihood phylogenetic tree of selected Cbf5p/TruB family sequences. Sequences are as listed in Figure 3, with GenBank accession numbers for additional sequences as follows: Thermotoga maritima, AAD35938; Aquifex aeolicus, AAC06885; Bacillus subtilis, CAB13539; E.coli, AAC76200; S.cerevisiae (Pus4p), P48567; S.pombe (Pus4p), CAA20692. Numbers on branches are the quartet puzzling support values from 1000 puzzling steps (64).

Potential homologs of box H/ACA snoRNP proteins in archaebacterial genomes

The known presence in archaebacterial genomes of genes encoding homologs of Cbf5p and Nhp2p (another box H/ACA snoRNP-specific protein) prompted us to search for genes that might specify Gar1p and Nop10p, the remaining two proteins recently identified as components of box H/ACA snoRNPs (30,31). In a TBLASTN search of sequenced archaebacterial genomes using the yeast Gar1p sequence, a potential Methanobacterium homolog was identified (E value 0.004). Using this Methanobacterium sequence as query, putative homologs from other archaebacterial genomes were identified (E values ranging from 10–12 to 0.43 for Aeropyrum). Figure 5A presents an alignment of the N-terminal portion of inferred archaebacterial and eukaryotic Gar1p sequences. The archaebacterial versions are predicted to have shorter N- and C-terminal regions than their eukaryotic homologs, with sequence identity limited to short blocks dispersed throughout the sequence. In this regard, it has been shown in the case of S.cerevisiae Gar1p that Gly–Arg repeats at both N- and C-termini are not essential for growth (85), with the protein produced by in vitro translation able to interact with box H/ACA snoRNAs through its internal core region (86). Thus, the shorter archaebacterial Gar1p candidates could well be functional.

Figure 5.

Figure 5

(A) Alignment of the N-terminal portion of putative Gar1p protein sequences from archaebacteria and some eukaryotes. GenBank accession numbers for these sequences are: P.abyssi, CAB49230; M.jannaschii, P81312; A.pernix, BAA79764; M.thermoautotrophicum, AAB85384; Encephalitozoon cuniculi, CAA07263; S.cerevisiae, P28007; D.melanogaster, S49193; Arabidopsis thaliana, AAF00626. The sequences of P.horikoshii, A.fulgidus, C.parvum and H.sapiens (the latter two being partial sequences from EST data) are from open reading frames in nucleotide sequences AP000007.1, AE001014, AA532317 and AA308727, respectively. (B) Alignment of putative Nop10p sequences from archaebacteria and some eukaryotes. Accession numbers are: P.abyssi, CAB49761; P.horikoshii, translation of an open reading frame in AP000004; A.fulgidus, O29724; M.thermoautotrophicum, O27362; M.jannaschii, P81303; A.thaliana, AAD25649; Trypanosoma brucei, translation from AA681026 (EST data). The sequences of S.cerevisiae and H.sapiens Nop10p are from (31) whereas the Aeropyrum Nop10p sequence is from an open reading frame that begins with TTG in the nucleotide sequence AP000059. Highly conserved residues are shown in white on a black background whereas gray shading indicates conservative substitutions.

Using the yeast Nop10p sequence in a TBLASTN search of complete archaebacterial genome sequences, we identified possible homologs of this protein; E values ranged from 10–5 to 0.84 (except for 4.9 in the case of Methanococcus), with corresponding values between 10–16 and 10–10 obtained using the Pyrococcus Nop10p homologs (which have the identical protein sequence) as query. Amino acid sequence identity among the putative archaebacterial and eukaryotic Nop10p homologs (Fig. 5B) is somewhat greater than in the case of Gar1p, with N-terminal and C-terminal regions being the most divergent. However, in contrast to Gar1p, the putative Nop10p sequences from archaebacteria and eukaryotes are virtually the same length.

Examination of sequenced archaebacterial genomes revealed that in all cases, the putative Nop10p gene is physically linked to the gene encoding the homolog of eukaryotic translation initiation factor IF2-α, and sometimes ribosomal protein genes, as well, possibly in the same operon (Fig. 7A). Further, genes for some candidate archaebacterial Gar1p homologs (in Methanococcus, Methanobacterium, Archaeoglobus and Aeropyrum) are physically linked to genes for archaebacterial homologs of eukaryotic transcription factor IIB (TFIIB) (Fig. 7B). In Archaea, transcription of both rRNA and mRNA requires TFIIB (87). Furthermore, the candidate Aeropyrum Gar1p gene is in the same transcriptional orientation (and so may be part of the same operon) as the gene for ribosomal protein S8E (Fig. 7B). The organization of these putative archaebacterial Gar1p and Nop10p genes suggests that their expression may be co-regulated with components of the transcription and translation machineries in these organisms.

Figure 7.

Figure 7

(A) Organization of the genes encoding archaebacterial homologs of Nop10p (filled rectangles). IF2-α, homolog of eukaryotic translation initiation factor 2-α; L44E and S27E, ribosomal protein genes. (B) Organization of the genes encoding archaebacterial homologs of Gar1p (filled rectangles). TFIIB, homolog of eukaryotic transcription factor IIB; APE0788, unidentified open reading frame; S8E, ribosomal protein gene. (C) Organization of the genes encoding putative archaebacterial PsuX orthologs (filled rectangles). SRP54/Ffh, homolog of eukaryotic signal recognition particle GTPase; L21E, ribosomal protein gene. Direction of transcription is indicated by arrows above the rectangles. Note that the sizes of genes and spacers are not drawn to scale, and the region shown may not represent the entire operon.

Identification in archaebacteria and some eukaryotes of genes encoding a novel group of proteins related to the Cbf5p/TruB gene family

In an attempt to identify additional protein sequences related to known Ψ synthases, we conducted BLAST searches using as query an internal portion of the E.gracilis Cbf5p sequence (positions G66 to A271) that excluded Cbf5p-specific N- and C-terminal regions. In particular, the query sequence lacked the PUA domain, which has been proposed as a possible RNA-binding domain in a broad range of known or putative RNA-binding proteins (84). In addition to the expected Cbf5p/TruB family sequences (E values ranging from 10–86 to 0.45 except for a value of 5.1 in the case of S.cerevisiae Pus4p), this search detected a previously uncharacterized Archaeoglobus sequence (GenBank accession no. AAB90092) at an E value of 0.052. Using this sequence in Gapped-BLAST searches (62), additional novel Cbf5p/TruB-related sequences (here designated ‘PsuX’) were detected (Fig. 6A). In BLASTP or TBLASTN searches, we identified PsuX sequences with high statistical significance (E values <10–20) in all completely sequenced archaebacterial genomes, as well as detecting possible orthologous full-length sequences in the animals Caenorhabditis elegans and Drosophila melanogaster and a partial sequence in a protist, Giardia intestinalis; on the other hand, no PsuX-related sequences were evident in the yeast (S.cerevisiae) or eubacterial genomes. BLAST searches of EST databases suggested that mammals and plants also express a PsuX homolog (data not shown). After four rounds of iteration using the Archaeoglobus fulgidus sequence as query in a more sensitive PSI-BLAST (62) search and with the low-complexity filter program SEG (88) off, authentic Cbf5p/TruB family sequences but no pseudo-positives appeared with a high degree of significance (E values <10–8).

Figure 6.

Figure 6

(A) Alignment of the C-terminal portion of the novel PsuX protein sequence family (the full-length alignment is available upon request). The overlining denotes a highly conserved stretch of the PsuX alignment that contains two Asp residues and is reminiscent of a Ψ synthase motif (B). GenBank accession numbers for the PsuX sequences are: P.horikoshii, BAA30059; P.abyssi, CAB49759; M.jannaschii, Q60346; M.thermoautotrophicum, AAB85800; A.fulgidus, AAB90092; A.pernix, BAA79514; C.elegans, CAB60423; D.melanogaster (translation of open reading frame in GenBank accession number AC005334). The portions of the A.pernix and E.gracilis Cbf5p sequences that display similarity to the PsuX sequences are shown at the bottom of the alignment; these Cbf5p sequences correspond to K57 to L132 of accession number BAA79974 (A.pernix) and K127 to L205 (E.gracilis; see Fig. 2). Shading of residues is as described in Figure 5. (B) Alignment of different classes of Ψ synthase motifs (77) plus the overlined PsuX stretch shown in (A). Highly conserved positions are shown, with non-conserved positions indicated by ‘x’. Because the putative PsuX motif contains two Asp residues that might correspond to the catalytic Asp (indicated by the asterisk) in known Ψ synthases, two slightly different alignments are shown.

The Cbf5p/TruB-homologous portion of the PsuX protein sequence does not contain a readily identifiable TruB motif II (Fig. 6A) which, as noted above, includes the functionally critical Asp residue. However, alignment of PsuX protein sequences did reveal a highly conserved stretch, (A/S)GRED(V/I)D(A/V)R(M/T/V)LG (positions 183–194 in the A.fulgidus sequence) (Fig. 6A), that is likely a functionally important region. This conserved stretch resembles known Ψ synthase motifs (77) and includes two Asp residues (Fig. 6B). In Figure 6B, the PsuX2 alignment follows the DxxxxG pattern proposed for the TruB/RluA/RsuA superfamily (79).

Another notable feature of PsuX sequences is the presence of CX2C motifs within the N-terminal region. Some of these motifs (e.g., C4X2C7, C20X2C23, C139X2C142 and C147X2C150 in A.fulgidus) are well conserved among the aligned sequences (not shown), although the C139X2C142 and C147X2C150 motifs are shared only among A.fulgidus, Methanococcus jannaschii and Methanobacterium thermoautotrophicum. The CX2C motif is frequently found in metal-binding domains of various nucleic acid-binding proteins (89). Pus1p, a yeast tRNA Ψ synthase, contains a zinc ion essential for function and tRNA binding, and potential zinc-binding elements similar to CX2C motifs have been proposed for this protein (90).

In M.jannaschi and M.thermoautotrophicum, the PsuX gene is physically linked to the gene encoding ribosomal protein L21E and (in the case of M.thermoautotrophicum) the gene for an archaebacterial homolog of the signal recognition particle GTPase, Ffh/SRP54, as well (Fig. 7C). Furthermore, in the genome of Pyrococcus spp, genes encoding PsuX and a homolog of eukaryotic initiation factor IF2-α are arrayed in a head-to-head fashion (Fig. 7C). These observations suggest the possibility in at least some archaebacteria of co-regulation of the expression of PsuX genes and of genes encoding proteins related to translation, as suggested above for some of the archaebacterial candidate Gar1p and Nop10p homologs.

DISCUSSION

On the basis of cDNA analysis, the Euglena Cbf5p mRNA has a 71-nt 5′ untranslated leader, including a 28-nt trans-spliced leader that is two nucleotides longer at the 5′ end than previously reported for the Euglena SL (67,68). A cluster of polyadenylation sites occurs between 122 and 136 nt downstream of an inferred UAG termination codon. However, no clear polyadenylation signals are evident.

The E.gracilis Cbf5p sequence is the first reported example of a protist homolog of this key rRNA modification protein. The Euglena branch is the earliest one in a eukaryotic Cbf5p phylogenetic tree, consistent with the early divergence of Euglena in phylogenies based on rRNA sequence comparisons; however, this placement should be considered tentative given the highly biased nature of the current Cbf5p database which, with the exception of Euglena Cbf5p, consists exclusively of animal and fungal sequences. In any event, Euglena Cbf5p branches robustly within the eukaryotic sub-tree, distinct from both archaebacterial Cbf5p and eubacterial/eukaryotic TruB/Pus4p sequences. Euglena Cbf5p has all of the conserved motifs characteristic of Ψ synthases, as well as the distinctive C-terminal KKE repeat motif found only in eukaryotic Cbf5p sequences.

Our finding of a Cbf5p sequence in Euglena as well as the presence of box H/ACA snoRNAs in Tetrahymena (27) strongly indicates that protists possess a box H/ACA snoRNP-based system for rRNA processing and pseudouridylylation of rRNA. This conclusion is supported by the presence in public databases of partial protist cDNA sequences encoding box H/ACA snoRNP proteins. These protein sequences include Gar1p, Cbf5p and Nhp2p from an apicomplexan, Cryptosporidium parvum (GenBank accession nos AA532317, AA532324 and AA224694, respectively), and Nop10p from kinetoplastid protozoa, Trypanosoma spp (AA681026 and AA952384). In contrast, eubacteria use a set of site-specific (sometimes multisite-specific) rRNA Ψ synthases, namely RluA, B, C, D and E (comprising the RluA family) and RsuA (78,79,9197).

The situation with respect to rRNA pseudouridylylation in archaebacteria is unclear at present. Based on the relatively small number of Ψ residues in the rRNA of the crenarchaeote, Sulfolobus solfataricus (98), and the apparent absence of a Gar1p homolog in archaebacterial genomes, Lafontaine and Tollervey (99) suggested that archaebacteria may have a snoRNA-independent system of Ψ formation in rRNA. On the other hand, as noted recently (79,84), genes for the eubacterial-type RluA and RsuA families of Ψ synthases are not apparent in any of the completely sequenced archaebacterial genomes. Moreover, although the LSU rRNA of Sulfolobus acidocaldarius contains only six Ψ residues (compared with 9 and 55, respectively, in E.coli and human LSU rRNA), none of these archaebacterial Ψ residues is specifically shared with eubacteria to the exclusion of eukaryotes (100); rather, three are at unique positions, two are shared with both eubacteria and eukaryotes, and one is shared only with eukaryotes. Our report of archaebacterial sequences encoding candidate homologs of Gar1p and Nop10p, in conjunction with prior evidence of archaebacterial Cbf5p and Nhp2p homologs (30,31,101,102), lends additional support to the proposition that a snoRNA-based system operates in Archaea to generate Ψ in rRNA. However, biochemical characterization of these archaebacterial proteins will be required to verify their proposed function.

Table 2 summarizes the known distribution of Ψ synthases within the three domains of life. In addition to one or more members of the site-specific RluA and RsuA sub-families of rRNA Ψ synthases, eubacteria generally contain one TruA gene and (with the exception of Mycoplasma spp and Helicobacter pylori) one TruB gene (79). The TruA and TruB Ψ synthases catalyze formation of Ψ at positions 38–40 and 55, respectively, in tRNA [in E.coli, the RluA synthase is a dual-specificity enzyme that carries out pseudouridylylation at position 32 in tRNA as well as position 746 in the LSU (23S) rRNA (91)]. The RluA- and RsuA-type synthases bear no statistically significant similarity to the Cbf5p/TruB superfamily, although they do possess short sequence motifs diagnostic of Ψ synthases in general.

Table 2. Global distribution of genes for known Ψ synthases, the TruB/Cbf5p-related protein PsuX, and box H/ACA snoRNP proteins within the three domains of life.

Gene family Bacteria Archaea Eucarya
RluA, RsuA yes (for rRNA or rRNA/tRNA) no apparent homologs yes (RluA only; for organellar rRNA?a)
Cbf5p no apparent homologs yes yes (for cytoplasmic rRNA)
Other box H/ACA snoRNP proteins (Gar1p, Nhp2p, Nop10p) no apparent homologsb yesc yes
TruB yesd (for tRNA) no apparent homologs yes (Pus4p) (S.cerevisiae for tRNA, mammals for ?)e
PsuX no apparent homologs yesf no (S.cerevisiae) yes (C.elegans, D.melanogaster, mammals, plants, G.intestinalis)
TruA yes (for tRNA) yes yes (for tRNA or tRNA/snRNAg)

aThought to act on mitochondrial rRNA S.cerevisiae (79,93).

bSome eubacteria do have genes for homologs of eukaryotic Nhp2p (101,102).

cFor summary of gene organization, see text and Figure 7.

dHomologous gene is missing in some eubacteria (see text).

ePsuY, a distant relative of TruB, is also encoded by the H.sapiens, D.melanogaster and C.elegans genomes (see text).

fFor description of gene organization, see text and Figure 7.

gRef. 106.

Convincing Cbf5p and TruA homologs have been identified in those archaebacterial genomes that have been completely sequenced; however, no genes that are specifically related to TruB have been reported. If, as suggested here, Cbf5p functions as an rRNA Ψ synthase in archaebacteria, these observations raise the question of how Ψ55, which is known to occur in archaebacterial tRNAs (103), is formed. This may be another situation in which a single Ψ synthase (in this case, Cbf5p or TruA) possesses dual specificity, acting on tRNA as well as rRNA, or at different sites in tRNA. Alternatively, genes for other (tRNA-specific) Ψ synthases may exist in archaebacterial genomes but be too divergent to be recognized in database searches by the methods currently available. A third possibility is that the novel PsuX family reported here plays a role in Ψ55 synthesis in tRNA, as its organismal distribution might suggest (Table 2). For example, although Pus4p orthologs are not detectable in sequenced archaebacterial genomes or in the C.elegans and D.melanogaster genome sequences, PsuX homologs are present in these genomes. Conversely, the yeast (S.cerevisiae) genome apparently lacks PsuX homologs but does encode a tRNA Ψ55 synthase (Pus4p).

Lafontaine and Tollervey (99) have suggested that the Cbf5p gene might have originated via duplication of an ancestral TruB-like gene (probably) after the separation of the domains Eucarya and Archaea. However, because the archaebacterial ‘TruB’ homolog is more closely related to the Cbf5p class than to the TruB sub-family per se, it is likely that any duplication of a TruB-like gene would have pre-dated the postulated archaebacterial–eukaryotic split. In fact, the accumulating evidence does suggest that a transition in the rRNA Ψ synthesizing machinery may have occurred in an archaebacterial–eukaryotic common ancestor, from the simpler, site-specific, eubacterial type of RluA-RsuA Ψ synthases to the more complex, snoRNP-dependent type present in eukaryotes and (as suggested here) archaebacteria. Alternatively (but perhaps less likely in view of the complexity factor), the snoRNP-based Cbf5p Ψ synthase system may have been ancestral, with retention of this system in the archaebacterial–eukaryotic line but with a shift to an RluA/RsuA-based system in eubacteria. A third possibility is the direct evolution of a TruB-like, tRNA-specific activity into a Cbf5p-like, rRNA-specific enzyme (perhaps initially retaining specificity for tRNA as well), again presumably in an archaebacterial–eukaryotic common ancestor. Current data do not allow us to distinguish among these possibilities.

This evolutionary picture is complicated by the presence in some eukaryotes of a TruB ortholog (Pus4p in yeast) having tRNA Ψ55 specificity. It is possible that this gene traces its origin to an ancestral, duplicated TruB-like gene, with one of the duplicates diverging to give the Cbf5p family and the other retaining TruB structure and function. However, this scenario would require that the TruB gene was lost in the archaebacterial lineage but retained (as a Pus4p homolog) in the eukaryotic lineage. Also, one must account for the fact that eukaryotic Pus4p sequences are substantially more closely related to eubacterial TruB sequences than to either archaebacterial or eukaryotic Cbf5p sequences (Fig. 4). Thus, if duplication of a TruB-like gene had occurred in an archaebacterial–eukaryotic common ancestor, one of the duplicates would have had to have diverged so radically that the evolutionary descendents of this duplication (Cbf5p and Pus4p in eukaryotes) no longer bear evidence of a specifically shared common ancestry. Further clouding this issue is the novel PsuX family described here, which is distantly related to both the Cbf5p and TruB subfamilies.

An alternative possibility to account for the origin for Pus4p is lateral transfer of the TruB gene, e.g. from the eubacteria-like endosymbiont that gave rise to mitochondria. This scenario would account for the fact that eukaryotic Pus4p is more similar to eubacterial TruB than to eukaryotic Cbf5p (31,84). The phylogenetic placement of Pus4p sequences (Fig. 4) is consistent with a direct eubacterial ancestry, although evidence of a specific α-proteobacterial ancestry, as expected for a mitochondrial origin (104), is not apparent. Again, this may largely be a sampling limitation, with only two full-length Pus4p sequences available at the moment. With regard to a possible endosymbiotic origin of the Pus4p gene, we note that the Pus4p protein catalyzes formation of Ψ55 in mitochondrial as well as cytosolic tRNAs in yeast (38).

After submission of this manuscript, the nearly complete genome sequence of D.melanogaster was published (105). Using E.coli Ψ synthase sequences in a BLASTP search against the database of predicted Drosophila protein sequences, we detected (at E values <10–4) two and three putative homologs, respectively, of RluA and TruA, but no RsuA counterpart. Also, in addition to the Cbf5p (Nop60Bp/minifly) and PsuX orthologs already discussed, we encountered a third Drosophila protein sharing a low level (E = 2 × 10–3) of sequence similarity with E.coli TruB, namely the CB7849 gene product (AAF57283.1). The latter sequence detects closely related proteins encoded by the human (AAD20059.1; E = 4 × 10–28) and C.elegans (AAF60570.1; E = 5 × 10–9) genomes. The human homolog of this protein group (for which we propose the name ‘PsuY’) was previously detected as a TruB homolog (TRUB2/HUMAN in figure 5 of ref. 79); however, no conserved candidate TruB motif II is evident in these PsuY sequences.

Acknowledgments

ACKNOWLEDGEMENTS

We thank Dr Murray N. Schnare and Michael Charette for valuable discussion and advice, and other members of the Gray Lab for critical comment. M.W.G., who is a Fellow in the Program in Evolutionary Biology, Canadian Institute for Advanced Research, gratefully acknowledges salary and interaction support from the CIAR. This work was funded by operating grant MT-11212 from the Medical Research Council of Canada to M.W.G.

DDBJ/EMBL/GenBank accession nos AF234319, AF234320

REFERENCES


Articles from Nucleic Acids Research are provided here courtesy of Oxford University Press

RESOURCES