The evolution of vertebrate Toll-like receptors -- Data Supplement - HTML Page - 02272SuppText.htslp -- Proceedings of the National Academy of Sciences

Supporting Text
Sequencing and Assembly

All Takifugu rubripes sequences used in the draft assembly and for finishing were derived from a single fish, unless otherwise stated. Thus, possibly one but probably two alleles were included in the assembly of each finished contig. Putative loci coding for TLRs were identified by tblastn or tswn, an algorithm for the Paracel GeneMatcher. The complete coding sequence, as well as some 5' and 3' sequences, was finished for each Takifugu Toll-like receptor (TLR) locus by a local reassembly of reads retrieved from the National Center for Biotechnology Information trace archive, additional in-house finishing reads, and/or sequences derived from shotgunning bacterial artificial chromosomes (BACs) or cosmids, as described in GenBank accession nos. AC156430--AC156440.

Bioinformatics

To be considered a TLR for our analysis, a sequence had to either (i) contain both an N-terminal leucine-rich repeat (LRR) domain and a C-terminal Toll-IL-resistance (TIR) domain or (ii) have a LRR domain that unambiguously had a best reciprocal match to a previously identified TLR. The second criterion allows the inclusion of soluble TLRs (e.g., BAC65467) and fragments from incomplete genomic pairwise assemblies. To form the basis of multiple alignments and molecular trees (our "core" analysis), vertebrate sequences from the nonredundant DDBJ/EMBL/NCBI database (GenBank) were identified no later than January 2005 by blast similarity to known TLRs (Table 1); the newly deposited zebrafish TLR3 was added at the request of a reviewer in April 2005. In cases where multiple nearly identical sequences were present in GenBank, some were not included in Table 1 or were used for further analysis with a preference for retaining RefSeqs (www.ncbi.nlm.nih.gov/RefSeq). For example, KIAA0012 was not included, because NP_003254 was considered adequate representation for human TLR1. Sequence data for some ESTs were obtained from The Institute for Genomic Research web site (www.tigr.org) (Table 1).

TIR domains are more highly conserved than are LRR domains. Our analyses use molecular distances computed on alignments of both domains. Analyses based solely on TIR domains or on LRR domains reproduce the same topologies as full-length analyses but with lower bootstrap values (data not shown). This is expected, because statistical strength is gained with longer alignments. For particular large evolutionary distances, the LRR domains become unalignable, and then alignments based solely on TIR domains must be used. Distances computed from such alignments are likely to be underestimates, however, because many residues of the TIR domain are highly constrained.

A gene model is a prediction possibly including the translation start, exon boundaries, and translation stop for a gene. Gene models in well studied organisms are often supported by mRNA sequences. Gene models from other organisms are less reliable; their reliability increases when they are based on high-quality ungapped genomic sequence and with corroboration from comparison with sequences from related species. More than one gene model can exist for a gene, such as when alternative splicing occurs. In this paper, we use a single-best or longest-gene model for each gene. Gene models not acquired from GenBank were constructed from available sequences of incomplete genome sequencing projects and, in cases indicated in Table 1, based on reassembled trace data. The Ciona savignyi draft genome was constructed from a highly polymorphic individual, making it difficult to differentiate genes from alleles; 14 scaffolds contain TLRs, two or three scaffolds contain TIR domains probably from TLRs, and two scaffolds contain divergent matches to TLRs that may be pseudogenes. Because each gene may be represented twice, there are likely 7-19 Ciona savignyi TLRs. The Strongylocentrotus genome assembly also was based on a single individual and is believed to provide 90% genome coverage. The Ciona intestinalis draft genome is reported to contain three TLRs (1), and we reconfirmed this count. C. intestinalis may have at least one additional divergent TLR or TLR pseudogene. The three C. intestinalis TLRs are thought to be distinct genes and not allelic versions of fewer genes. We identified 340 Strongylocentrotus TLRs; the actual number of genes could conceivably be as few as half this number if all alleles assembled distinctly.

In some cases, we did not include additional gene models from Danio rerio and Gallus gallus because of uncertainties in the TLR loci in these draft genomes. Trede et al. (2) discuss Danio TLRs. We did not include some mammalian TLRs, such as those from the dog, because their inclusion does not significantly alter our interpretation of TLR evolution but does tend to crowd figures.

Our use of data from draft genome projects is guided by the National Human Genome Research Institute Rapid Data Release Policy (www.genome.gov/10506376). References for genome projects are:

Canis familiaris
Sequencing Project. July 2004 version 1.0 draft assembly. The Broad Institute, Cambridge, MA and Agencourt Bioscience, Beverly, MA. Ciona intestinalis
Sequencing Project. Version 1.95; October 2002 (1). Department of Energy Joint Genome Institute. Ciona savignyi
Sequencing Project. Data version 4/25/2003. Whitehead Institute and Massachusetts Institute of Technology Center for Genome Research (www-genome.wi.mit.edu). Danio rerio
Sequencing Project. June 2004 assembly. The Wellcome Trust Sanger Institute, United Kingdom. Gallus gallus
Sequencing Project. February 2004 draft sequence (3). Genome Sequencing Center, Washington University School of Medicine, St. Louis, MO. Monodelphis domestica
Sequencing Project. October 2004 preliminary assembly. The Broad Institute, Cambridge, MA. Strongylocentrotus purpuratus
Sequencing Project. Version 0.3. Baylor College of Medicine, Houston, TX (www.hgsc.bcm.tmc.edu). Tetraodon nigroviridis
Sequencing Project. V7 assembly; February 2004 (4). Genoscope, Evry, France. Xenopus
tropicalis Sequencing Project. Version 3.0 draft assembly; xenTro1. Department of Energy Joint Genome Institute. Semantics and Nomenclature

In this paper, the term "fish" always refers to Actinopterygii. Unless specified otherwise, we use Xenopus as shorthand for X. tropicalis, for which a draft genome sequence is available. We use HUGO gene names when they exist; in cases where the HUGO name for an ortholog varies among species, we use the name of the human ortholog when referring generally to a gene.

Historically, vertebrate TLR numbering is based on the order of their discovery in humans and mice, spanning the range from TLR1 to TLR13. It is unlikely that additional complete TLRs will be found in humans or mice. Allowing room for some further mammalian consecutive numbering, fish numbering has started with TLR18. The symbols TLR18-20 have been assigned to mRNAs from D. rerio. D. rerio TLR18 may eventually be redesignated TLR1, but more complete genomic sequence would be necessary to accurately classify all of the D. rerio TLRs. Subfamilies TLR21 and TLR22 are previously known in fish. We introduce the TLR14-16 and TLR23 subfamilies in this paper. Vertebrate TLRs with the same number are generally orthologous. Invertebrate and vertebrate TLR nomenclature do not correspond (e.g., Drosophila TLR9 is not named on the basis of orthology with human TLR9). Our symbol for soluble or "short-form" TLR5 is TLRS5. To avoid confusion with the plural, we do not use the synonymous "TLR5S" abbreviation.

Our choice of certain clades as major families is subjective. Objective and universally accepted methodologies for dividing multigene families into clades for families and subfamilies do not exist. Our choice of families was designed to allow the inference that each family was represented by a single-copy gene in a common ancestor of all of the vertebrates.

The nomenclature relating to coincidental evolution has a long history. "Coincidental evolution" is the first general term to describe correlated evolution in gene families (5). Examples were first noticed by Edelman and Gally (6) and Brown et al. (7). They introduced the terms "horizontal evolution" and "coevolution" to refer specifically to the direct molecular transfer of information; these terms are less general than "coincidental evolution." "Coevolution" has also been given other meanings by other authors, which has eroded the utility of the term. A later term, "concerted evolution," is less desirable, because "concerted" implies teleological coordination.

Molecular Tree Construction and Multidimensional Scaling

Amino acid sequence alignments were generated with clustalx (8). In most cases, inspection of the resulting alignments suggested that little if any improvement could be made by manual adjustment, so the initial output of the alignment program was accepted (Fig. 4). clustalx could not align invertebrate TLRs properly (e.g., with the TIR domain aligned with all other TIR domains), so they were manually aligned with se-al using absolutely conserved residues as anchors [Rambaut, A. (1996) se-al: Sequence Alignment Editor; evolve.zoo.ox.ac.uk]. Molecular distances were computed using protdist from the phylip package [Felsenstein, J. (2004) phylogeny inference package (phylip), Version 3.6; distributed by the author; Department of Genome Sciences, University of Washington, Seattle). We weighted the input to allow only alignment positions with at least 80% nongap characters to contribute to molecular distances. Molecular trees were inferred with fitch and portrayed with drawtree. To make it easier to read the text of species’ names, Figs. 5 and 6 were computed with kitsch and drawgram. Bootstrap statistics are the result of 100 block bootstrappings with seqboot (B = 10). Block bootstrapping is used, because the domain structure of TLRs causes significant correlation between nearby amino acids. The majority of our bootstrap values were 100. We conclude from the bootstrap values that our sequences are long enough to provide high-confidence topologies under the assumptions of protdist and fitch.

Sequence Characteristics

Inspection of the multiple alignment (Fig. 4) indicates that each family has sequence features conserved within that family but not with other families. For example, the short consensus ANPGGPV is found in all TLR3 extracellular domains but in no other TLRs. These features are likely to be particularly important for recognition of specific ligands. They may do this in part by disruption of the LRR ectodomain structure.

All TLRs consist of a many hundreds of residues-long extracellular leucine-rich domain, a membrane-spanning domain, and an intracellular TIR domain. If a hidden Markov model motif search algorithm is unleashed on such a sequence, it will and does, for the TLRs, predict a series of short LRR domain model matches along the length of the extracellular domain. For TLR3, with default parameters, the HMMER hmmpfam algorithm searching for all PFAM domains places a N-terminal LLRNT domain, » 22 LRR_1 domains, a LRRCT domain, and a TIR domain. LRRNT is PFAM’s leucine rich repeat N-terminal domain (accession number PF01462). LRR_1 is PFAM’s standard leucine-rich repeat domain (accession number PF00560). LRRCT is PFAM’s LRR repeat C-terminal domain (GenBank accession no. PF01463). There is relatively low information content in the primary sequence of the extracellular domain. A typical e-value for hmmpfam’s TIR prediction is 10^-35; a typical e-value for each LRR_1 prediction is 10, 36 orders of magnitude less significant. The boundaries of the LRR_1 model matches can slide out of register with the model matches in other orthologs, causing the algorithms to report slightly different counts of LRR_1 domains in slightly different locations, even though the primary sequences are well aligned and highly similar in these locations. The relative merits remain to be demonstrated for considering the extracellular portion of TLR3 to be a string of 22 concatenated LRR_1 domains as opposed to a single 600-residue domain rich in leucines. The length of the leucine-rich extracellular region varies between TLR families. For example, considered in terms of approximate number of LRR_1 domains, TLR7/TLR8/TLR9 each have 25, TLR11 has 21, TLR5 has 20, and TLR1/TLR4 each have 13.

Some immune multigene families, such as MHC molecules, are under positive selection. If TLRs were under evolutionary forces similar to those operating on the MHC families, they might also demonstrate properties of positive selection, such as a synonymous/nonsynonymous substitution ratio significantly greater than one. We tested the TLR phylogeny for positive selection with the maximum-likelihood algorithm implemented in PAML’s codeml using multiple models for the site distribution of evolutionary rates. We tested three representative families: TLR3, TLR5, and TLR11. If positive selection were detectable in any family, we would have expected to find it for TLR11, which demonstrates the greatest intrafamily molecular distances. Even with the most sensitive empirical model (Bayes 11 class, also the least specific model), no statistic gave significant evidence for any residues under positive selection. Lack of evidence for positive selection is consistent with the conventional hypothesis that negative selection is the dominant constraint on vertebrate TLR evolution.

Coincidental Evolution

Multigene families often evolve in ways that violate assumptions necessary for simple and objective gene phylogeny estimation. Their molecular clock may not be regular. In particular, some members of the family may evolve at much faster rates and as such are dubbed "fast-evolving genes." This happens when one member gene takes on a significantly novel function and thus encounters significantly different selective pressures from the other multigene family members. Vertebrate lactate dehydrogenase C is a classic example of a fast-evolving gene. Another usual assumption of molecular tree construction is that each branch of the tree evolves independently from the other branches. "Coincidental evolution" is a term describing phylogenies with branches that do not evolve independently. Multigene families often show coincidental evolution, either indirectly through biased mutational and selective forces or directly by mechanisms such as gene conversion (9). By comparing the molecular distance of pairs of paralogs present in different species with pairs of paralogs present in the same species, we can gain a sense of the amount of within-species coincidental evolution.

One effect of coincidental evolution in a multigene family can be to shorten the apparent molecular distance between family members present in the same species’ genome compared with the distance between family members in different genomes. Thus, for example, the molecular distance between human TLR5 and human TLR7 might be less than the distance between human TLR5 and Takifugu TLR7 despite the fact that both molecular distances are estimates of the same period (9). To see whether coincidental evolution has operated globally on TLRs since the time of the divergence of the major families, we tested a representative subset of subfamilies (TLR2-5 and TLR9) and species (chicken, human, Takifugu, and X. tropicalis). We did not include TLR1, TLR6-8, and TLR10, because we did not want to overweight families that may have had more recent duplications. The average distance of intraspecies paralogs is 270 ± 20 percent accepted mutations (PAM); the average distance of paralogs between species is 280 ± 20 PAM. We are thus led to reject the hypothesis that coincidental evolution has been a significant factor in the evolution of the six major TLR families. At least two reasons may be considered as explanations: (i) the TLR gene families are so divergent at the sequence level that molecular homogenizing events such as gene conversion cannot occur, and (ii) most individual TLR genes are indispensable, so there is strong selection against any alteration in sequence, including those that would affect our statistic for measuring coincidental evolution.

Although coincidental evolution has not operated globally on the TLR multigene family, we hypothesized it might have operated on a more recent timescale within particular TLR clades. This might have been facilitated by more similar sequences and function. In particular, our computed root of the TLR9 family is inconsistent with the known evolution of species (the root should be between the amphibian and fish TLR9, not between the amphibian and mammal TLR9). This inconsistency could be explained by coincidental evolution. However, for TLR7-9, the average distance of intraspecies paralogs is 120 ± 20 PAM; the average distance of paralogs between species is also 120 ± 20 PAM. These numbers do not support the hypothesis of coincidental evolution within the TLR7 family.

If there is not coincidental evolution within the TLR7 family, and TLRs evolve at a conservative and constant rate, as appears to be the case, then we can tentatively use a molecular clock to infer certain aspects of the timing of TLR evolution. TLR7 and TLR8 must have split more recently (average molecular distance 110 PAM) than the divergence of the six major clades (average molecular distance 270 PAM). Furthermore, the TLR7 and TLR8 split was before the divergence of fish and tetrapods, because both possess an ortholog for each. Therefore, we can infer that the divergence of the major families was more than twice as long ago as the divergence of fish and tetrapods. The major TLR families probably diverged during or before the Cambrian Period.

We next hypothesized that coincidental evolution might operate within the TLR1 family. This is particularly compelling, because TLR2 must interact with another partner within the family. Direct molecular interaction provides a mechanism for correlated selection pressure. Here, the average distance of intraspecies paralogs is 170 ± 10 PAM; the average distance of paralogs between species is 180 ± 10 PAM. Sequences from more species would be required to gain the power to conclude that coincidental evolution was at work between TLR2 and its partners.

Finally, we hypothesized there might be coincidental evolution operating between TLR5 and TLRS5. Considering only amino acid positions present in all sequences, the average distance of intraspecies paralogs is 70 ± 50 PAM; the average distance of paralogs between species is 120 ± 20 PAM. Coincidental evolution seems likely. The distance between trout TLR5 and TLRS5 of 19 PAM is particularly short, suggesting the possibility of recent gene conversion or of origin by duplication for trout TLRS5 independently from Takifugu and Xenopus TLRS5, which seems unlikely.

Gene Dosage

Gene dosage of TLRs appears important. Fish have undergone a complete genome duplication since their divergence from tetrapods (4). In many cases, both copies of the genes from this duplication have been retained in fish with respect to tetrapods. However, for almost all TLR genes, one of the duplicate copies has been lost. This implies that it is neutral or disadvantageous to have two nearly identical TLR genes in a genome. Perhaps such duplicates disrupt the highly conserved signaling pathways used by TLRs. A consequence of selection pressure against duplicate TLRs is that fewer gene duplications will be available for the evolution of new TLR functionality from a redundant TLR gene. This may also limit some mechanisms of coincidental evolution. Such selective pressure on gene dosage allows another hypothesis for the recognizable TLR2 pseudogenes in marsupials, dogs, and humans. The gene may have been functional until recently and independently pseudogenized in each species.

It is possible that not all novel TLR genes from the fish genome duplication have been lost, although clearly most have. For example, TLR5 and TLRS5 might represent a duplicated pair. TLRS5 is on Tetraodon chromosome 14; TLR5 is not localized. In Takifugu, the two TLR5 genes are on different scaffolds. Most of the fish TLR genes in the TLR11 clade have a Xenopus ortholog, thus must have diverged before a fish-specific genome duplication. However, Xenopus TLR22 appears to be a single Xenopus TLR orthologous to a pair of fish-specific gene TLR subfamilies: TLR22 and TLR23. Thus, fish TLR22 and TLR23 might have arisen by genome duplication.

The same selective pressure favoring single-copy genes may also operate to favor allelic exclusion of TLRs.

TLR Evolution

We hypothesize that TLRs are so critical for survival that vertebrate species seldom persist without possessing at least one TLR from each of the six major clades. We find it surprising that some appear to do so. If they have not evolved alternative recognition strategies, are these species doomed? Are the TLRs really there and missed by contemporary genomics? Does domestication of Takifugu and chicken allow them to survive? For example, could pseudogenization of chicken TLR8 be a result of antibiotics in feed? Does human intelligence allow avoidance of some pathogenic selection pressure and bypass the need for TLR12? If this is the case, we would also have to note that dogs have lost TLR12 as well. We have little data from which to answer these questions.

All species need methods of defense from pathogens. When such defense includes both innate and adaptive components, it is reasonable to suppose that species with a more robust adaptive response would need less of an innate response, and species with a more robust innate response would need less of an adaptive response. Because TLRs are a basic component of the innate response, it is logical to predict that species with more primitive adaptive response would have greater number of TLRs than are seen in mammals. Indeed, sea urchins appear to have greatly expanded their TLR repertoire, consistent with a lack of an adaptive immune system. But this expansion is seen in neither fish nor flies. Thus, we do not feel that we have the data to strongly endorse an inverse correlation between the number of TLRs and the development of the adaptive immune system.

Why are there not more TLRs in more species? The number of highly conserved pathogen pathogen-associated molecular patterns (PAMPs) must be > 12, and the hundreds of Strongylocentrotus TLRs would seem to confirm this. Certainly many billions of distinct structures, such as antibodies and T cell receptors, can be built on a basis of Ig domains, each recognizing a specific epitope. The failure of TLRs to evolve into as many PAMP specificities may be due to their critical role in initiating inflammatory responses. It would be very dangerous to have self-reactive TLRs and thus critical that TLRs faithfully distinguish between the pathogen and the host. Because the TLR system has not evolved self-tolerance, as seen in the adaptive immune system and the development of the natural killer cell repertoire, an autoreactive TLR would predictably lead to autoinflammatory and autoimmune disorders. Another possibility is that the small number of TLRs is sufficient to recognize a vast number of pathogens, either due to the conservation of specific PAMPs or the ability of individual TLRs to recognize multiple PAMPs, thereby decreasing selective pressure for additional receptors.

Some limitations on TLR evolution may be due to their reliance on LRR domains as the basis of their recognition structures (10, 11). LRR-based recognition is very permissive compared with Ig recognition. Such permissivity could be either due to (i) a static binding site that fits many ligands or (ii) a dynamic binding site that molds to many ligands. Igs recognize a very specific molecule, often derived from a specific gene of a specific strain of a specific species. LRRs of TLRs are clearly much more permissive. However, the variable lymphocyte receptors (VLRs) in agnathans are perhaps an exception to the argument that there is selective pressure limiting the number of LRR-based immune receptors. Even so, little is known about the immunologic compromises in agnathans. Their immune response is slower and less specific than a mammalian response. This may be due to permissivity of VLRs and compensatory tolerance mechanisms.

An important challenge in innate immunity today is to identify unknown TLR ligands. Our current study makes possible three classes of predictions. First, TLRs that are phylogenetically related should recognize structurally related PAMPs. For example, eutherian TLR1 and TLR6 arose by duplication after the divergence of marsupial TLR1, and both recognize lipopeptides. Thus, we predict that the closely related TLR10 and the other TLR1-like toll receptors will also be found to recognize lipopeptides. Second, species that have lost a critical TLR will have evolved alternative strategies for recognizing the PAMP for the lost TLR. Thus, we predict that pufferfish, lacking TLR4, will be found to have other means of recognizing components of Gram-negative bacteria to compensate for the loss of LPS recognition. We propose that TLR23, to date found only in puffers, may provide this recognition. If so, this would demonstrate that TLRs from different families can independently evolve to recognize the same or similar PAMPs. Analogously, chicken TLR15 might replace the functionality of TLR9. Third, where a gene family has expanded, it may have done so in response to clade-specific selection pressures. If terrestrial species have more members of the TLR1 family, and fish have more members of the TLR11 family, then lipopeptide recognition may be more important on land than is the PAMP for TLR11, and vice versa.

1. Dehal, P., Satou, Y., Campbell, R. K., Chapman, J., Degnan, B., De Tomaso, A., Davidson, B., Di Gregorio, A., Gelpke, M., Goodstein, D.M., et al. (2002) Science 298, 2157-2167.

2. Trede, N. S., Langenau, D. M., Traver, D., Look, A. T. & Zon, L. I. (2004) Immunity 20, 367-79.

3. Hillier, L. W., Miller, W., Birney, E., Warren, W., Hardison, R. C., Ponting, C. P., Bork, P., Burt, D. W., Groenen, M. A., Delany, M. E., et al. (2004) Nature 432, 695-716.

4. Jaillon, O., Aury, J. M., Brunet, F., Petit, J. L., Stange-Thomann, N., Mauceli, E., Bouneau, L., Fischer, C., Ozouf-Costaz, C., Bernot, A., et al. (2004) Nature 431, 946-957.

5. Hood, L., Campbell, J. H. & Elgin, S. C. (1975) Annu. Rev. Genet. 9, 305-353.

6. Edelman, G. M. & Gally, J. A. (1970) in The Neurosciences: Second Study Program, ed. Schmitt, F. O. (Rockefeller Univ. Press, New York), pp. 962-972.

7. Brown, D. D., Wensink, P. C. & Jordan, E. (1972) J. Mol. Biol. 63, 57-73.

8. Thompson, J. D., Gibson, T. J., Plewniak, F., Jeanmougin, F. & Higgins, D. G. (1997) Nucleic Acids Res. 25, 4876-82.

9. Roach, J. C., Wang, K., Gan, L. & Hood, L. (1997) J. Mol. Evol. 45, 640-652.

10. Kobe, B. & Deisenhofer, J. (1994) Trends Biochem. Sci. 19, 415-421.

11. Kajava, A. V. (1998) J. Mol. Biol. 277, 519-527.