Skip to main content
Proceedings of the National Academy of Sciences of the United States of America logoLink to Proceedings of the National Academy of Sciences of the United States of America
. 2005 May 9;102(20):7303–7308. doi: 10.1073/pnas.0502313102

Predicted highly expressed genes in archaeal genomes

Samuel Karlin *,, Jan Mrázek *, Jiong Ma , Luciano Brocchieri *
PMCID: PMC1129124  PMID: 15883368

Abstract

Based primarily on 16S rRNA sequence comparisons, life has been broadly divided into the three domains of Bacteria, Archaea, and Eukarya. Archaea is further classified into Crenarchaea and Euryarchaea. Archaea generally thrive in extreme environments as assessed by temperature, pH, and salinity. For many prokaryotic organisms, ribosomal proteins (RP), transcription/translation factors, and chaperone genes tend to be highly expressed. A gene is predicted highly expressed (PHX) if its codon usage is rather similar to the average codon usage of at least one of the RP, transcription/translation factors, and chaperone gene classes and deviates strongly from the average gene of the genome. The thermosome (Ths) chaperonin family represents the most salient PHX genes among Archaea. The chaperones Trigger factor and HSP70 have overlapping functions in the folding process, but both of these proteins are lacking in most archaea where they may be substituted by the chaperone prefoldin. Other distinctive PHX proteins of Archaea, absent from Bacteria, include the proliferating cell nuclear antigen PCNA, a replication auxiliary factor responsible for tethering the catalytic unit of DNA polymerase to DNA during high-speed replication, and the acidic RP P0, which helps to initiate mRNA translation at the ribosome. Other PHX genes feature Cell division control protein 48 (Cdc48), whereas the bacterial septation proteins FtsZ and minD are lacking in Crenarchaea. RadA is a major DNA repair and recombination protein of Archaea. Archaeal genomes feature a strong Shine-Dalgarno ribosome-binding motif more pronounced in Euryarchaea compared with Crenarchaea.

Keywords: acidic ribosomal proteins, Archaea, highly expressed proteins, thermosome


The identity of the three domains of life (1) and their relationships are controversial (2-11). Archaea form a heterogeneous clade composed of a mosaic of bacterial, eukaryotic, and unique features. Archaea and Eukarya share many homologous genes involved in information processing (replication, transcription, and translation), whereas Archaea and Bacteria share many morphological structures and metabolic proteins (10, 12). Of 19 archaeal genomes completely sequenced (Table 1, mid-2004), 4 are from Crenarchaea and 14 are from Euryarchaea. Nanoarchaeum equitans, a parasitic archaeon that lives in coculture with the archaeon Ignicoccus, has been tentatively assigned to the separate group of Nanoarchaea. Most sequenced archaea, to date, are thermophilic, generally prefer extreme environments, and are found in most ecosystems. The four sequenced crenarchaea are all hyperthermophiles (optimal growth temperature, ≥75°C), although mesophilic crenarchaea have been putatively found in pelagic waters (3, 13). Among the Euryarchaea, six are methanogens, including three mesophiles, Methanosarcina acetivorans, Methanosarcina mazei, and Methanococcus maripaludis. Halobacterium NRC-1 is also mesophilic, thriving in high salt concentrations. Most sequenced archaea, excluding methanogens (lifestyle strictly anaerobic, metabolism methanogenesis) grow both aerobically and anaerobically. Archaeal mRNAs are principally polycistronic as in bacterial genomes. Archaeal proteins involved in translation have both eukaryotic and bacterial features (14).

Table 1. Archaeal genomes.

Name Genome size, kb Optimal growth temp, °C G + C content, % No. of genes ≥80 aa
Crenarchaea
SULSO 2,992 80 36 2,869
SULTO 2,695 80 33 2,657
AERPE 1,670 90 56 1,783
PYRAE 2,222 100 51 2,290
Euryarchaea
PYRAB 1,765 96 45 1,786
PYRFU 1,908 96 41 1,941
PYRHO 1,739 96 42 1,828
THEAC 1,565 59 46 1,415
THEVO 1,585 60 40 1,415
PICTO 1,546 60 36 1,473
ARCFU 2,178 83 49 2,214
METKA 1,695 110 61 1,606
METTH 1,751 65 50 1,735
METJA 1,665 85 31 1,635
METMP 1,661 37 33 1,630
METAC 5,751 37 43 4,252
METMA 4,096 37 41 3,147
HALSP 2,014 37 68 1,880
Nanoarchaea
NANEQ 491 90 32 515

Notice that most archaea subtend genomes of moderate size, ranging from ≈1.5 to 3.00 Mb. The methanogens are of variable size with the two mesophilic Methanosarcina species especially relatively large, exceeding 4- and 5.7-Mb genome lengths. SULSO, Sulfolobus solfataricus; SULTO, Sulfolobus tokodaii; AERPE, Aeropyrum pernix; PYRAE, Pyrobaculum aerophilum; PYRAB, Pyrococcus abyssi; PYRFU, Pyrococcus furiosus; PYRHO, Pyrococcus horikoshii; THEAC, Thermoplasma acidophilum; THEVO, Thermoplasma volcanium; PICTO, Picrophilus torridus; ARCFU, Archaeoglobus fulgidus; METKA, Methanopyrus kandleri; METTH, Methanobacter thermoautotrophicus; METJA, Methanocaldococcus jannaschii; METMP, Methanococcus maripaludis; METAC, Methanosarcina acetivorans; METMA, Methanosarcina mazei; HALSP, Halobacterium sp. NRC-1; NANEQ, Nanoarchaeum equitans; temp, temperature.

The objectives of this work are to identify and analyze the major predicted highly expressed (PHX) genes with respect to codon usage biases among the archaeal genomes. Our analyses of bacterial genomes support the hypothesis that each species has evolved codon usage patterns promoting “optimal” gene expression levels for most circumstances of its habitat, energy sources, and lifestyle (15, 16). Codon bias is often different at the start of a gene compared with the central or terminal part of the gene (17, 18). Different selection pressures are imposed by the constraints of ribosomal binding and translation fidelity. Protein folding is possibly correlated with codon usage (19, 20). According to the rare codon hypothesis for domains and secondary structures, repetition of rare codons reduces translation rate and introduces translation pauses, allowing time for protein domains and secondary structures to fold into native conformations. However, there appear to be subtle differences in bacterial and eukaryotic translation mechanisms, e.g., the role of chaperonins in bacteria vs. eukaryotes and the important activity of cotranslational folding in eukaryotes but not in prokaryotes. Generally, PHX genes in bacterial genomes rely on favorable codon usages, tend to possess a strong Shine-Dalgarno (SD) sequence (21), and putatively possess a strong promoter sequence. The substantial variability of G+C composition within mammalian genomes (isochores) may complicate predicting gene expression levels from codon usages. In contrast, the nucleotide compositions of bacterial genomes are largely homogeneous. Gene expression in prokaryotes is regulated at initiation and termination of transcription and translation and by different rates of transcription and translation, differential mRNA stabilities, segmental stability differences in polycistronic messages, codon preferences, and interactions with chaperones and other proteins.

The thermosome chaperones are outstandingly PHX genes (Table 2) consistent with the extreme environmental conditions to which these species have adapted. General comparisons of bacterial vs. archaeal genomes and corresponding discussion of the genomic and proteomic content of the Saccharomyces cerevisiae and Drosophila melanogaster genomes are set forth in our companion paper (2).

Table 2. Predicted expression values E(g) for distinctive genes of archaeal genomes.

Value SULSO SULTO AERPE PYRAE PYRAB PYRFU PYRHO THEAC THEVO PICTO ARCFU METKA METTH METJA METMP METAC METMA HALSP NANEQ
No. of PHX genes, % 601 (21) 529 (20) 299 (17) 84 (3.7) 163 (9.1) 160 (8.2) 139 (7.6) 172 (12) 122 (8.6) 214 (15) 301 (14) 255 (16) 168 (10) 128 (7.8) 162 (10) 505 (12) 321 (10) 335 (18) 36 (7.0)
Max E(g) 1.40 1.51 1.50 1.21 1.73 1.67 1.41 1.30 1.18 1.57 1.47 1.62 1.39 1.69 2.09 2.00 1.95 1.49 1.21
Ths 1.32 1.38 1.60 1.21 1.52 1.42 1.32 1.26 1.13 1.57 1.47 1.47 1.39 1.57 1.90 1.71 1.82 1.36 1.20
1.32 1.29 1.28 1.15 1.13 1.08 1.57 1.34 1.34 1.60 1.73 1.35
1.02 1.20 etc N* 1.26
DnaK 1.19 1.00 1.39 1.25 1.70 1.85 1.09
Psm 1.07 1.17 1.04 1.03 1.17 1.12 1.00 0.94 1.02 1.16 1.16 1.10 0.93 0.95 0.95 1.38 1.37 1.20 1.09
1.04 1.12 1.01 1.02 0.96 0.88 0.88 0.94 0.88 0.93 1.00 0.80 0.93 0.93 0.89 0.97 1.28 1.16
1.04 1.04 0.98 0.97 0.63 0.76 0.79 0.91
Hsp20 1.06 1.00 0.91 1.03 1.18 1.19 1.20 1.08 1.07 1.38 1.24 1.13 1.31 0.80 1.20 0.95 1.10
0.90 0.98 0.94 0.96 1.03 0.93 1.10 0.96 0.95
etc N etc N
Pfd β 0.94 1.23 1.15 0.91 1.32 0.79 1.20 0.82 0.88 0.96 1.25 0.92 1.02 0.96 1.13 0.83 0.83 0.99 0.97
Pfd α 0.81 1.09 1.07 0.77 0.96 0.98 1.18 1.01 0.99
PCNA 0.94 1.08 1.07 1.00 1.16 0.80 0.90 0.84 0.98 0.88 1.12 1.08 1.16 0.90 0.94 0.86 0.93 1.24 1.00
0.87 0.94 0.95
0.91
rp P0 1.11 1.26 1.01 1.00 0.98 1.25 1.23 0.91 0.98 1.37 0.91 0.93 0.95 1.28 1.69 1.56 1.57 1.09 0.94
Cdc48 1.18 0.98 1.01 1.01 1.16 1.19 1.17 1.11 1.09 1.39 1.37 0.82 1.08 1.54 0.68 1.43 1.02 1.08 1.21
1.15 0.88 0.95 0.97 0.66 0.89 1.13 1.26 1.16 1.07 0.80
etc N 0.60 0.71
Cdc6 1.20 0.83 1.12 0.70 0.70 0.81 0.89 0.87 0.83 0.72 0.82 0.91 0.86 1.06 0.87
0.89 1.05
etc P
AhpC/Bcp 1.08 1.12 1.16 0.95 1.32 1.20 1.11 1.14 1.08 1.26 1.14 0.92 0.83 0.76 0.87 0.87 0.97 1.09
0.95 1.11 1.14 0.94 0.97 1.06 1.06 1.18 0.78 0.81 0.80 0.77
etc P etc N 0.93 0.80 etc N etc N etc P
SecY 1.14 1.18 1.10 1.08 0.93 0.83 0.86 1.19 1.04 1.30 0.81 0.83 0.95 0.72 0.75 0.95 0.80 0.75
RadA 1.18 0.94 1.28 0.96 0.71 0.96 0.80 1.11 0.94 1.09 0.85 0.80 0.76 0.69 1.28 1.22 1.04 0.88 0.98
EF-1α (Tuf) 1.19 1.28 1.36 1.09 1.41 1.47 1.31 1.20 1.18 1.44 1.38 1.45 1.25 1.42 1.88 1.71 1.59 1.49 1.14
EF-2 (Fus) 1.17 1.35 1.29 1.06 1.70 1.55 1.31 1.13 1.07 1.45 1.21 1.62 1.23 1.56 2.04 1.85 1.95 1.32 1.09
rpoA 1.30 1.21 1.12 0.99 1.35 1.03 1.18 1.17 0.99 1.20 1.12 1.47 1.13, 0.83 1.14 1.50 2.00 1.90 0.75 1.05
1.08 1.02 0.91 0.95 0.83 0.80 0.94 1.16 1.04 0.97 0.97 1.12 0.82 0.80 0.89 1.27 1.29 0.87 1.02
rpoB 1.13 1.51 1.15 0.89 1.23 1.04 1.21 1.20 1.06 1.52 1.06 1.33 1.28 0.94 1.58 1.35 1.17 0.73 0.89
1.13 1.07 1.22 1.16 0.86 1.00 1.34 1.41 0.84 1.00

Bold indicates PHX; regular type indicates not PHX. Ths, any thermosome subunit; Psm, proteasome subunit; Hsp20, small heat shock protein, eye crystalline structure in higher eukaryotes; Pfdβ, prefoldin β subunit; PCNA, proliferating cell nuclear antigen; rpPo, ribosomal protein Po; Cdc48, cell division control protein 48; Cdc6, replication initiation; AhpC/Bcp, alkyl hydroperoxide reductase, bacterioferritin comigratory protein, peroxiredoxin or thioredoxin peroxidase; SecY, protein translocase; RadA, DNA repair and recombination protein; EF1α, elongation factor 1α; EF-2, elongation factor; rpo, DNA-directed RNA polymerase subunit; etc N, additional copies not PHX; etc P, additional copies PHX;—, gene not present in genome. HSP60 (GroEL) 1.35 and 1.38 PHX in two Archaeal genomes, METAC and METMA, respectively.

*

Three copies not PHX of Ths in METAC.

Methods

Highly expressed genes are characterized on the basis of biased codon usages compared with the average gene (15). For most bacteria during exponential growth, many genes encoding ribosomal proteins (RP), the principal transcription/translation synthesis factors (TF), and the major chaperone/degradation genes (CH) functioning in protein folding/unfolding and trafficking tend to be highly expressed. We consider these gene classes (restricted to genes of ≥80 codons) as representative of highly expressed genes. In this purview, a gene is PHX if its codon frequencies are similar to the average of any of these gene classes but deviate strongly from those of the average gene of the genome. The codon usage bias of a gene group F with respect to a gene group G is calculated by the formula

graphic file with name M1.gif

where {pa(F)} correspond to the average amino acid frequencies of the genes of F, and f(x, y, z) and g(x, y, z) are codon frequencies of F and G genes, respectively, normalized to one for each amino acid family. Predicted expression levels with respect to the RP, TF, or CH standards can be based on the ratios ERP(g) = B(g|C)/B(g|RP), ECH(g) = B(g|C)/B(g|CH), and ETF(g) = B(g|C)/B(g|TF), where C is the totality of all genes of the genome. An overall estimate E(g) of the expression level of a gene g is defined by the equation B(g|C)/E(g) = (1/3) (B(g|RP) + (B(g|TF) + (B(g|CH)). For archaeal genes, the criterion E(g) > 1 with at least one (two for bacteria) of the values ERP(g), ETF(g), or ECH(g) exceeding 1.05 provides generally a consistent benchmark reflecting high protein abundance. The concept of PHX in most bacterial genomes was justified by independent measurements of gene expression (16). For most bacterial genomes, the codon usage differences among the functional gene classes RP, TF, and CH tend to be low and concordant (see ref. 15). However, not all genes of the classes RP, TF, and CH are automatically PHX in Bacteria or Archaea. In this respect, the RP gene class of Archaea is the most variable, whereas the CH and TF gene classes are more coherent and consistent. There is good agreement of our determinations of PHX protein abundances with assessments by 2D gel electrophoresis displacements (e.g., ref. 16).

Results and Discussion

Distinctive Proteins of Archaeal Genomes. The thermosome subunits (Ths) are among the most PHX throughout the archaeal domain and almost always essential (22). Archaea generally live in extreme environments that are likely to affect the integrity of their proteins, nucleic acid, and membranes. Mutational and other disturbances conceivably may be alleviated by an abundance of chaperone and degradation proteins, including thermosome, prefoldin, and the proteasome complex assisted by repair, recombination, and replication enzymes (22, 23). Ths is pervasively PHX in archaeal genomes at a very high predicted expression level (Table 2). Ths also has been investigated experimentally and confirmed especially abundant in Sulfolobus species encompassing up to 20% of the cellular protein content (24-26). DnaK (HSP70) is found, so far, only in archaeal mesophiles or in moderate thermophiles (27), where it is PHX. The heat-inducible Lon protease is absent from the Crenarchaea but usually PHX among the Euryarchaea (see Table 5, which is published as supporting information on the PNAS web site). Archaeal genomes also are distinguished with proteasome subunits. The chaperone prefoldin (Pfd) β-subunit is present in all Archaea, whereas the α-subunit is lacking in Crenarchaea and Thermoplasma genomes (Tables 2 and 5).

The replication protein PCNA (proliferating cell nuclear antigen) is present mostly PHX in all archaea and eukaryotes but absent from bacteria. There are multiple copies (two or three) of PCNA in the crenarchaeal genomes (see Table 2 and ref. 28). Moreover, PCNA is widely distributed in eukaryotes where it functions in association with DNA polymerase δ enhancing processivity in elongation of the leading strand during DNA replication.

The regulatory RP P0, P1, and P2, prominent in eukaryotes, feature a hyperacidic amino acid run proximal to their carboxyl termini. These proteins are missing from bacterial genomes, and only P0 is present in archaeal genomes. The RPs L7/L12 and S1 of bacterial genomes, neither of which is similar to P0, also emphasize acidic residues.

There is evidence that the cell division control protein 48 (Cdc48), ubiquitous in archaeal genomes, functions in cell division and growth processes. In Arabidopsis, Cdc48 is localized primarily to the nucleus. In other eukaryotes, this protein is mainly plasma membrane-associated. In Archaea, the genes of ORC (origin recognition complex) and Cdc6, analogous to DnaA of Bacteria, are deemed responsible for replication initiation (28).

RadA of archaea is similar to RecA of bacteria and Rad51 of eukaryotes. They all bind single-stranded DNA with the same stoichiometry and exhibit well-conserved Walker-A and -B ATP binding motifs (29). These proteins are important in recombination, mutagenesis, transposition, and repair, and are made in response to general DNA damage. DNA repair pathways involve activities of chemically reverse DNA damage, base excision repair, and nucleotide excision repair (29).

Small Hsps (average 150 aa), mostly Hsp20, are abundant among archaeal genomes, often in multiple copies (see Table 5), but are variably represented among bacterial genomes. The small Hsps are not present in approximately half of the current collection of bacterial genomes. For example, they are absent from Lactococcus lactis, Streptococcus pyogenes, Streptococcus pneumoniae, Listeria innocua, Haemophilus influenzae, Pasteurella multocida, Neisseria meningitidis, Helicobacter pylori, Campylobacter jejuni, Chlamydia trachomatis, Treponema pallidum, Borrelia burgdorferi, Mycoplasma genitalium, and others. However, the Hsp20 is ubiquitous in most eukaryotes, and in higher eukaryotes it is involved in the structural determination of the eye crystalline.

The translation elongation proteins EF-1α (Tuf) and EF-2 (Fus) and the RNA polymerase subunits RpoA, RpoB, and RpoC (Table 2) as with bacterial genomes are predominantly PHX in Archaea. These genes are generally present in multiple copies among bacteria but are represented by a single copy in most archaea (Table 2). Archaeal RPs contain a mixture of bacterial, eukaryotic, and some unique RPs (28, 30). Many archaea employ both eukaryotic and bacterial mechanisms in translation initiation (14). However, archaeal mRNAs have no 5′ CAP structure or 3′ poly(A) appendages, but to some extent they engage a bacterial SD translation initiation motif (Table 3).

Table 3. Shine-Dalgarno (SD) sequences in archaeal genomes.

Name Anti-SD* sequence SD%
SULSO AUAUCACCUCAU 22.9
SULTO AUCACCUC 20.2
AERPE AUCACCUCC 38.8
PYRAE AUCACCUCC 23.8
PYRAB AUCACCUCCUAU 71.9
PYRFU AUCACCUCCUAU 69.8
PYRHO AUCACCUCCUAU 54.9
THEAC AUCACCUCC 24.6
THEVO AUCACCUCCAA 35.7
PICTO AUCACCUCCU 30.5
ARCFU AUCACCUCCUAA 47.0
METKA AUCACCUCC 70.9
METTH AUCACCUCCU 60.5
METJA AUCACCUCCU 54.4
METMP AUCACCUCCU 71.8
METAC AUCACCUCCUAA 48.6
METMA AUCACCUCCUAA 52.1
HALSP AUCACCUCCUAA 26.3
NANEQ AUCACCUCCU 7.5
*

Bold indicates the core anti-SD sequence. See Table 1 for complete names of genomes.

SD% is defined as the fraction of genes (≥80 aa) in a given genome that possesses a SD motif (for details, see ref. 21). The anti-SD sequence at the 3′ end of the 16S rRNA binds to the SD motif of a gene when available to initiate translation. In bacterial genomes, the consensus anti-SD sequence is AUCACCUCCUUU, although archaeal genomes show some variation in their anti-SD sequence around the conserved core CCUCC.

There is debate on the time of the origin of aerobic respiration (e.g., refs. 11 and 31-34). A substantial increase of oxygen in the atmosphere accompanied the evolution of cyanobacterial lineages and associated oxygenic photosynthesis in the time epoch at ≈3,200-2,500 million years ago (31, 34). However, there is evidence showing some oxygen was present earlier than 3,500 million years ago (35-37) and influential in the evolution of life (31, 32). Methanogenesis is considered a later development in archaeal evolution (11). Onset of methane cycling putatively started ≈2,700 million years ago, whereas oxygenic photosynthesis evolved earlier (32). All archaeal genomes except for methanogens carry many PHX detoxification proteins that protect against oxygen stress, generally including alkyl hydroperoxide reductase, bacterioferritin comigratory proteins, thioredoxin, and superoxide dismutase (Table 2), suggesting that these prokaryotes were selected to survive under moderate aerobic conditions. These observations may indicate that archaea, especially methanogens in their present form, are probably not so ancient or were converted to an oxygen-tolerant variant. The detoxification gene thioredoxin also catalyzes and removes disulphide bonds during protein folding.

Chaperone Proteins in Archaea. Because chaperones, especially thermosomes, are potently PHX in Archaea, we elaborate more on these classes of proteins. Chaperones play pivotal roles in protein folding, degradation of misfolded proteins, proteolysis, secretion, trafficking across membranes, facilitating the assembly of macromolecular structural complexes (22), and archaeal membrane stabilization (26). Molecular chaperone systems that promote the correct folding of nascent or misfolded proteins have evolved in all domains of life.

The ATP-regulated HSP70 (DnaK) together with its cofactors DnaJ and GrpE and the ATP-independent Trigger factor (Tig) act posttranslationally in folding. Tig is only present in bacteria and generally is PHX. Tig is presumably substituted by nascent polypeptide-associated complex (NAC) in eukaryotes (38) and possibly is Archaea (22, 38). DnaK (HSP70) is ubiquitous in eukaryotes and bacteria, often with multiple copies, but is missing from most archaea (see Tables 2 and 5). Tig and DnaK are demonstrated to cooperate in the folding of newly synthesized proteins (39). Simultaneous deletion of both Tig and DnaK in bacteria is lethal under normal growth conditions (40). The archaeal HSP70, as with Gram-positive genomes, are missing a 23- to 25-aa segment present in Gram-negative genomes (4, 5). All archaea and eukaryotes contain the molecular chaperone Pfd (subunit β) (Table 2), which has not been recognized in Bacteria. The crenarchaea do not contain the α subunit (Table 2). Pfd is considered to perform HSP70-like functions (41), albeit the sequences and structures of these proteins are substantially different.

The chaperonin complex (HSP60) assists protein folding in a cavity, where nonnative polypeptides, usually of the size of 30-70 kDa, are enclosed and protected against intermolecular aggregation (for reviews, see refs. 38 and 42). Two groups of chaperonins are distinguished. With occasional exceptions, Group I embodies GroEL of bacteria, mitochondria, and chloroplasts, whereas Group II features thermosomes in Archaea and TRiC, also labeled CCT in eukaryotes. The thermosome genes are potently PHX in Archaea (Table 2), whereas the chaperonin (Cpn) and its cochaperonin (GroEL/GroES) are highly expressed in virtually all bacterial genomes (43). They are lacking in five of the nine Mycoplasmas sequenced so far (data not shown). The GroES lid of the chaperonin complex is missing from archaea and eukaryotes wherein helical protrusions supplant GroES (22). The HSP60s in all three domains are purified from cells as double-ring complexes. In Bacteria, each ring of GroEL is composed of seven HSP60 subunits, whereas in Archaea each ring embodies eight or nine HSP60 subunits. Some archaeal rings are formed from identical subunits, whereas in others there are two subunit types; the Sulfolobus sp. contains three subunit types. Yeast contains at least 11 distinct CCT genes. It is observed that the eukaryotic ring structure contains six to nine different subunits in a variety of arrangements (22).

Functional regions inferred from mutational studies and the E. coli GroEL 3D crystal structure (44, 45) have been evaluated in a multiple alignment across 43 HSP60 sequences selected from diverse genomes, centering on ATP/ADP and Mg2+ binding sites, residues interacting with substrate, GroES contact positions, interface regions between monomers and domains, and residues important in allosteric conformational changes (46). The most evolutionarily conserved residues relate to the ATP/ADP and Mg2+ binding sites. Hydrophobic residues that contribute to substrate binding also are significantly conserved. A large number of charged residues line the central cavity of the GroEL/GroES complex in the substrate-releasing conformation. Charged residues also span intramonomer and intermonomer 3D charge clusters (47) that are highly conserved among sequences and can play an important role interacting with the substrate.

Duplicated HSP60 sequences stand out among the classical α-proteobacteria, contrasted to no duplications of HSP60 in other proteobacterial clades (48). Multiple HSP60 paralogs also exist in cyanobacteria, in Chlamydia, and in high G+C Gram-positive bacteria. Many a-mitochondrial eukaryotes, including Trichomonas vaginalis, Giardia lamblia, and Entamoeba histolytica, contain two or more HSP60. Plastids carry multiple copies of HSP60 that bind Rubisco. Specialized complex structures in cells often need their own “dedicated” chaperones (e.g., ref. 49).

Peptidyl-prolyl cis-trans isomerase (PPIase) in prokaryotes and protein disulfide isomerase (PDI) in prokaryotes and eukaryotes are generally present in multiple copies. E. coli has at least nine PPIases defined by sequence similarity. One of these, the survival protein SurA, promotes the folding of periplasmic and outer membrane proteins. As expected, SurA does not exist in Gram-positive bacteria. DegP is a chaperone/degradation factor that is significantly PHX and acts primarily in degrading misfolded proteins in the periplasm. PapD/FimC are other chaperones/periplasmic, and disulfide oxidase proteins cytoplasmic chaperones. Tig exhibits PPI activity in vitro and contains a PPI motif of the cyclophilin family in most bacteria, binds at the ribosome polypeptide tunnel exit, and cooperates with DnaK during de novo protein folding (39-42). In this respect, NAC substitutes for Tig in eukaryotes and possibly in archaea (38). Tig interacts with the ribosomal protein L23 at the ribosomal tunnel exit (50), where it helps to stabilize and organize nascent translated polypeptides in bacteria. Tig recognizes short hydrophobic stretches, whereas DnaK binds to longer peptides. Nevertheless, Tig and DnaK have overlapping functions in the folding process where Tig implements the initial chaperone interactions with ribosomes in bacteria. In yeast, Ssb (HSP70) is associated with ribosomes and generally contributes in posttranslational protein assembly and protein translocation across membranes (42). Prefoldin generally has six distinct subunit types in eukaryotes, at most two subunit types (α and β) in archaea (Table 2), and is absent from bacteria (22). Prefoldin may assist de novo protein folding when interacting with CCT (41). HSP90 (HtpG) is widely distributed in Bacteria but is absent from Archaea and also from the genomes of the deeply branching bacterial species Aquifex aeolicus and Thermotoga maritima (22). HtpG plays a role in stress tolerance.

Among the 19 archaeal genomes of Table 1, only 7 possess a DnaK gene, all of which are PHX. These seven archaea are either mesophiles or moderate thermophiles with optimal growth temperature of ≤65°C (27). However, the mesophile M. maripaludis (see Tables 2 and 5) is lacking a HSP70 gene. It has been suggested that the presence of HSP70 genes in the seven archaeal genomes is the consequence of lateral gene transfer (27, 51). The gene may have been subsequently lost in M. maripaludis. Nine genomes of the bacterial mycoplasma species have been entirely sequenced, and each has a PHX DnaK gene; in contrast, GroEL is missing from five of these genomes. By comparison, the eight thermophilic bacteria completely sequenced (Table 4) (five moderate thermophiles and three hyperthermophiles) all encode at least one DnaK gene persistently PHX.

Table 4. Thermophilic bacteria.

Name Group DnaK copies OGT/°C G+C E(DnaK) 24- to 30-bp repeats*
SYMTH Firmicutes 1 51 68.7 1.27 +
THETH Deinococcus-Thermus 1 70-75 69.4 1.27 + (4 copies)
THEEL Cyanobacteria 3 55 53.9 dnak1 0.76
dnak2 1.38
dnak3 0.86
THETE Firmicutes 1 80 37.6 1.28 +
STRTH Firmicutes 1 42 39.1 2.10 +
THEMA Thermotogales 1 80 46.3 1.38 +
AQUAE Aquificales 1 90 43.3 1.36 +
CHLTE Chlorobiales 1 48 56.5 1.05 +

SYMTH, Symbiobacterium thermophilum; THETH, Thermus thermophilus; THEEL, Thermosynechococcus elongatus; THETE, Thermoanaerobacter tengcogensis; STRTH, Streptococcus thermophilus; THEMA, Thermotoga maritima; AQUAE, Aquifex aeolicus; CHLTE, Chlorobium tepidum.

*

The nature of the 24- to 30-bp repeats are discussed in ref. 2. These are lacking in T. elongatus and involve only four copies in the T. thermophilus genome.

The crenarchaea have no HSP70 representations (Table 2). It seems that the chaperone Pfd in archaea can substitute for HSP70, whereas in bacteria, the Tig gene (missing from archaea) functions cooperatively with or substitutes for DnaK.

”Hybrid” Bacterial and Archaeal Genomes. The genomes of the mesophilic methanogens Methanosarcina acitovorans and M. mazei (Table 1) might be regarded as hybrids of bacterial and archaeal species. They both contain multiple copies of Ths and one GroEL gene, all PHX. We speculate that the acquisition of GroEL in these genomes is due to lateral gene transfer. The same applies to the cyanobacterium Gloeobacter violaceus, which contains simultaneously Ths and GroEL and the recombination repair proteins RecA and RadA. G. violaceus is also similar to Archaea in expressing multiple detoxification PHX bacterioferritin comigratory proteins and several Hsp20s.

SD Sequences in Archaea. In Bacteria, a strong correlation between high predicted gene expression levels and the presence of a SD sequence motif, which plays an important role in translation initiation, is observed (21). SD motifs do not exist in eukaryotes. Initiation is generally considered the rate-limiting step of translation, which in many bacteria involves interactions between a SD sequence immediately upstream of the start codon in the mRNA and an anti-SD sequence at the 3′ end of the 16S rRNA (reviewed in refs. 52 and 53). The consensus SD sequence features at its core the purine run AGGAGG, generally traversing positions -10 to -5 relative to the start codon, and the 16S rRNA gene usually carries the anti-SD sequence CACCTCCTTT at its 3′ end. The PHX genes as compared with genes with an average expression level are significantly more likely to possess a strong SD motif (21). This positive correlation between strong SD signal sequences and PHX genes can be found in almost all bacterial and archaeal genomes (ref. 21 and Table 3). The Crenarchaea and Thermoplasma have many leaderless transcripts, and Crenarchaea and Thermoplasma are low in SD motifs.

Conclusions and Prospects

Several authors underscore processes of lateral gene transfer and the archaeal origin of eukaryotic genes (8, 10, 11, 51). Many conspicuous genes of Archaea, e.g., Ths, PCNA, P0, Cdc48, and Pfd (Table 2), are missing from bacterial genomes but distributed profusely in eukaryotes. However, several genomic features are common to all prokaryotes. These include operon gene organization, the SD motif that provides control on mRNA translation initiation, and the presence of several clusters of RP. The most pronounced PHX genes are the thermosome chaperones (distant homologs of GroEL) and the chaperone prefoldin, which is hypothesized to substitute for the activities of Trigger factor and HSP70 (22, 38, 41). Based on the Clusters of Orthologous Groups database (www.ncbi.nlm.nih.gov/COG, 2004), all bacterial and archaeal genomes share only 67 genes of which 30 are ribosomal protein genes, 14 are amino acetyl-tRNA synthetases, and several are major protein synthesis and processing factors.

Cavicchioli et al. (54) address the question of whether there are pathogenic Archaea. They point out interactions of archaea with eukaryotes, methanogenic inhabitants of the human oral cavity and intestinal tract, and call attention to many human diseases whose causative agents have not been identified. For these reasons and others, they effectively suggest that pathogenic archaeal will be identified in time as more studies on archaeal species are conducted. Martin (55; see also ref. 56) in his reactions, seems doubtful on biochemical grounds whether Archaea are natural agents capable to parasitize mammalian environments.

Archaeal-bacterial endosymbiosis and other relations have been proposed to explain the genesis of eukaryotes and their organelles (57-63). Along these lines, archaeal-bacterial partnerships have been conceived preceding the origins of eukaryotes. It is increasingly appreciated that the genomes of many prokaryotes and primitive eukaryotes are “heterogeneous unions” in which lateral gene transfer and/or close associations have been at work (64-67).

Supplementary Material

Supporting Table

Acknowledgments

We thank Profs. A. Campbell and D. Kaiser and Dr. J. Trent of NASA Ames Research Center for helpful comments on the manuscript. S.K. was supported in part by National Institutes of Health Grant 5R01GM10452-40.

Abbreviations: CH, chaperone/degradation genes; PHX, predicted highly expressed; RP, ribosomal proteins; SD, Shine-Dalgarno; TF, transcription/translation synthesis factors.

References

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supporting Table

Articles from Proceedings of the National Academy of Sciences of the United States of America are provided here courtesy of National Academy of Sciences

RESOURCES