Abstract
Recent work indicates that allelic incompatibility in the mouse PRDM9 (Meisetz) gene can cause hybrid male sterility, contributing to genetic isolation and potentially speciation. The only phenotype of mouse PRDM9 knockouts is a meiosis I block that causes sterility in both sexes. The PRDM9 gene encodes a protein with histone H3(K4) trimethyltransferase activity, a KRAB domain, and a DNA-binding domain consisting of multiple tandem C2H2 zinc finger (ZF) domains. We have analyzed human coding polymorphism and interspecies evolutionary changes in the PRDM9 gene. The ZF domains of PRDM9 are evolving very rapidly, with compelling evidence of positive selection in primates. Positively selected amino acids are predominantly those known to make nucleotide specific contacts in C2H2 zinc fingers. These results suggest that PRDM9 is subject to recurrent selection to change DNA-binding specificity. The human PRDM9 protein is highly polymorphic in its ZF domains and nearly all polymorphisms affect the same nucleotide contact residues that are subject to positive selection. ZF domain nucleotide sequences are strongly homogenized within species, indicating that interfinger recombination contributes to their evolution. PRDM9 has previously been assumed to be a transcription factor required to induce meiosis specific genes, a role that is inconsistent with its molecular evolution. We suggest instead that PRDM9 is involved in some aspect of centromere segregation conflict and that rapidly evolving centromeric DNA drives changes in PRDM9 DNA-binding domains.
Introduction
Allelic incompatibility at the mouse PRDM9 (Meisetz) locus can cause hybrid male sterility due to failure in spermatogenesis [1]. Rare dominant nonsynonymous mutations in human PRDM9 may also cause failure in spermatogenesis (azoospermia, [2]), suggesting similar allelic incompatibility in humans. These data support a role for PRDM9 in an early stage of pre-zygotic hybrid incompatibility, consistent with a role in speciation [1], [3]. Targeted PRDM9 knockout in the mouse causes sterility in both sexes due to a block in meiosis I in both the male and female germ line [4]. The germ line arrest morphology resulting from PRDM9 knockout and from incompatible PRDM9 alleles are identical, suggesting that the incompatible alleles abrogate PRDM9 function [1], [4].
The PRDM9 gene in human and mouse encodes a protein with KRAB [5] and SET domains followed by multiple tandem C2H2 zinc finger (ZF) domains near the C-terminus [6], [7]. The PRDM9 SET domain region confers histone H3(K4) trimethyltransferase activity, consistent with activity as a transcriptional activator [6]. The function of the KRAB domain of PRDM9 has not specifically been studied, but in other ZF transcription factors it is known to recruit histone deacetylases and histone H3(K9) methyltransferase, suggesting activity as a transcriptional repressor [8], [9]. Since the ZF domains are the only DNA-binding domains in PRDM9, it is likely that they confer the DNA-binding specificity of PRDM9. Because of its histone modifying activity and DNA-binding domains, it has been assumed that PRDM9 encodes a transcription factor that regulates other genes important for germ line meiosis but there is no direct evidence for such a role.
The structure of ZF domains of the type found in PRDM9 bound to DNA is well-established [10], [11], [12]. Tandem ZF domains confer DNA-binding specificity in a modular manner, with sequential ZF domains binding sequential 3 nt DNA sequences in target sites. Each core ZF domain is typically 21 residues long and consists of a conserved framework of amino acids that coordinates and positions a highly variable nucleotide contact region. Adjacent core ZF domains are joined by a conserved 7 amino acid linker region, which makes a DNA phosphate contact and coordinates adjacent zinc fingers. Within the core ZF domain, nucleotide contacts are made by an eight amino acid turn-helix that occupies the major groove of DNA. Three amino acids make the major nucleotide contacts and adjacent residues may contribute additional contacts and influence the positioning of the major nucleotide contacts. The positions of all these amino acids are highly conserved and can be directly inferred from ZF domain sequence.
We report analysis of molecular evolution of the PRDM9 gene based on sequence comparisons between primate species and on human polymorphisms.
Results
Primate Divergence and Positive Selection
We identified PRDM9 orthologs in primates and analyzed their molecular evolution. Most of the PRDM9 gene is well conserved in sequence, but the ZF domains are highly divergent. For example, in the 12 ZF domains of human and chimpanzee PRDM9 proteins, 28 of the 36 major nucleotide contact residues differ (Figure 1), despite a genome-wide average nucleotide divergence of 1.2% and protein divergence of 0.12% [13]. To study selection acting on primate PRDM9 genes, we performed maximum-likelihood analysis of synonymous and nonsynonymous codons in the human, chimpanzee, orangutan, macaque, and baboon. Most of the 894 codons in the PRDM9 alignment are characterized by negative (purifying) selection, indicated by a low estimated dN/dS value (frequency of nonsynonymous change relative to synonymous change). However 32 codons had a high estimated dN/dS value that reached statistical significance for positive selection (P>0.95). Of these 32 codons, 26 encode major nucleotide contact residues in ZF domains and all 6 other codons are immediately adjacent to major nucleotide contact residues. Several additional codons in DNA-binding turn-helix regions also had high estimated dN/dS values that did not reach statistical significance. Divergence in the DNA-binding turn-helix regions is obvious by simple inspection of the protein multiple alignment (Figure 1). In addition to rapid divergence in the DNA-binding residues, there are several changes in the number of ZF domains among the five primates. All of these changes are the result of precise insertion or deletion of entire ZF domains, consistent with generation by unequal cross-over events. These changes are also expected to affect DNA-binding specificity. A high degree of amino acid divergence in the DNA-binding turn-helix region and changes in ZF domain number were also observed in other mammals, including rodents (data not shown).
The strong evidence for positive selection in nearly all of the DNA-binding domains and the fact that human and chimp PRDM9 differ so sharply in their major nucleotide contact residues suggests directional selection on PRDM9 to rapidly change DNA-binding specificity. Maximum-likelihood analysis also suggests that similar selection is acting on every branch of the primate tree (data not shown). The remainder of the PRDM9 sequence is conserved throughout mammals and even in more basal lineages including non-mammalian chordates and echinoderms [7], indicating that a highly conserved function is tethered to a rapidly evolving DNA-binding specificity.
Human Polymorphism
We assessed single-nucleotide polymorphisms (SNPs) in the coding region of human PRDM9 reported in dbSNP130, from 16 individual exome sequences generated by next-generation sequencing ([14] and this study), and from a large study of PRDM9 coding SNPs in Japanese men [2]. Of 35 distinct SNPs, 31 are nonsynonymous, indicating an exceptionally high population diversity in PRDM9 protein sequence. The positions of the 31 nonsynonymous SNPs are remarkable (Figure 1 and Figure 2): 28 of 31 affect ZF domains or the linkers between ZF domains, and 24 of these 28 affect residues in the DNA-binding turn-helix of ZF domains. In the entire PRDM9 protein of 894 amino acids, only 96 amino acids are in a DNA-binding turn-helix region (8 in each of 12 ZF domains), indicating a highly significant enrichment of nonsynonymous SNPs in turn-helix residues (24/31 vs. 96/894, P<0.0001 by Fisher's exact test). Taken together with the strong positive selection acting at the same class of sites, this pattern suggests an ongoing series of partial selective sweeps affecting PRDM9. In the study of Japanese men, allele frequencies were determined for 21 SNPs. Two of these 21 were found only once in patients with azoospermia; neither affects a ZF turn-helix. All of the remaining 19 alleles were common in the Japanese population (>1% frequency) and all 19 affect a ZF turn helix. Of the 12 nonsynonymous SNPs found in non-Japanese populations, only one is identical to a Japanese SNP. This lack of overlap suggests that both the Japanese and non-Japanese PRDM9 polymorphisms arose very recently.
Zinc Finger Homogenization
We noticed that zinc fingers within the PRDM9 gene in each primate tended to be similar in sequence, despite the rapid divergence among species. We investigated the generality of this pattern by identifying putative PRDM9 sequences from 19 sequenced mammalian genomes (including the primates in Figure 1). The 19 sequences all encode multiple ZF domains with the same spacing and arrangement as that in human and mouse. The number of ZF domains ranges from 3 to 20 with an average of 8.3. A dot plot of ZF amino acid sequence similarity within genes is shown in Figure 3. With the possible exception of tenrec, intraspecies ZF domains are clearly more similar than between species. These results are consistent with interfinger recombination resulting in homogenization of most of the ZF sequence. Homogenization among ZF domains in human and mouse is even more striking at the level of DNA sequence (Figure 4). For example, of the 28 codons that make up a single ZF domain plus linker, 19 encode the same amino acid in all 12 human ZF domains. All 19 of these codons are 2-fold or 4-fold degenerate, but only 2 of 228 possible synonymous differences are found among the fingers. Similarly, in the mouse only 1 of 264 possible synonymous differences is found. All 3 of these synonymous differences occur in the first or last zinc finger, consistent with partial recombinational isolation of terminal ZF domain sequences.
It is likely that such recombination events contribute to human population polymorphism as documented in Figures S1 and S2. Briefly, 28 of 31 distinct human SNPs found in ZF domains could have arisen by recombination with other ZF domains. Given the general homogeneity of sequence in the aligned ZF domains (Figure 4), this correlation is clearly significant. Rapid divergence in PRDM9 genes could be facilitated by such recombination events, which can allow spread of new advantageous mutations to other fingers [15]. We speculate that it is this process of spreading new advantageous changes by recombination that drives the homogenization of other sequences in PRDM9 ZF domains.
Discussion
In summary, PRDM9 sequences across mammals show rapid divergence specific to the DNA-binding turn-helix region of their ZF domains and there is an exceptionally high level of nonsynonymous human polymorphism in the same classes of sites. What could account for these patterns? All other domains in PRDM9 are conserved among species, suggesting that a conserved biochemical function is tethered to a rapidly evolving DNA-binding specificity. The expected histone modification activities of PRDM9 (histone deacetylation and histone H3(K4) trimethylation) suggest a role in transcriptional regulation. However transcription factors are generally characterized by highly conserved DNA-binding domains, yet these are the regions where PRDM9 is most rapidly evolving. In contrast to transcriptional regulation, centromere structure and function is associated both with regulated histone modification states [16], [17] and with rapidly evolving DNA sequences [18], [19], [20]. A favored model for the rapid evolution of centromere sequence is the centromere-drive hypothesis, in which selfish centromeres compete to segregate to the oocyte during female meiosis [19], [21]. Centromere drive is potentially deleterious to the host by causing skewed sex ratios or male sterility, effects that may be balanced by observed rapid evolution in genes encoding centromere associated proteins [22].
The internal structure and boundaries of functional centromeres are strongly associated with histone modifications and with centromere-specific classes of histones [16], [17]. PRDM9 has the potential to regulate two of the known centromeric chromatin-associated histone modifications. First, human and Drosophila centromeric chromatin is hypoacetylated on histones H3 and H4 [17]. PRDM9 encodes a KRAB domain, which is well-established to recruit histone deacetylases via the KAP1 protein [23]. Though the KRAB domain of PRDM9 has not itself been studied, it is very similar to other KRAB domains and is probably the evolutionary origin of the huge family of KRAB zinc finger transcription factors [7]. Second, histone H3(K4) dimethylation is associated with centromeric chromatin, whereas H3(K4) trimethylation is often found at the borders just outside of centromeric chromatin [17]. The SET domain of PRDM9 has been shown to have H3(K4) trimethyltransferase activity (converting dimethyl H3(K4) to trimethyl H3(K4)). Thus PRDM9 could function to limit the extent of core centromeric chromatin by helping to define the borders of di- and tri-methyl histone H3.
Centromere drive is expected to be limited to the germ line, with the most obvious site of action being meiosis in the female. The only phenotype of PRDM9 knockout in the mouse is arrest at prophase of meiosis I in both sexes, consistent with various specific roles including recombination and chromatin condensation in preparation for metaphase. Our hypothesis for PRDM9 function clearly predicts that the PRDM9 protein will be physically associated with the centromere during meiosis I and that it will function there to moderate centromere drive via histone modification.
Materials and Methods
dN/dS Analysis
Complete or nearly complete PRDM9 coding sequence was obtained either from available gene predictions (human, chimpanzee, orangutan, macaque) or from a genewise [24] prediction based on the human PRDM9 protein (baboon). Codons were aligned guided by a clustalw [25] protein alignment (default parameters). The codeml program from the PAML suite [26] was run on the codon alignment without gap removal using model 7 and model 8 (three starting omega values with unconstrained added omega class, plus one run with the added omega class constrained to 1.0). Evidence for positive selection was overwhelming (e.g. an 85.2 difference in log-likelihood for model 8 with unconstrained omega relative to model 8 with omega constrained to 1.0). The Bayes-Empirical-Bayes dN/dS estimates came from the codeml “rst” output file and P-values came from the standard codeml output (both with unconstrained model 8).
SNP Identification
Human SNPs were ascertained in three ways. First, a publication provides extensive information on human coding-sequence alleles and frequencies in a Japanese population [2]. Second, dbSNP130 was queried to obtain all known PRDM9 coding SNPs. Third, novel coding SNPs were ascertained by deep sequencing of 12 human exomes [14] plus an additional 4 exomes sequenced by the same method. Table S1 summarizes information on all the SNPs, including genome position, nucleotide change, and source. The haplotype configurations of the SNPs are unknown. Haplotter summary statistics in the PRDM9 region show no obvious signs of recent population-specific positive selection [27].
Ortholog Identification
We wanted to obtain the PRDM9 ZF coding exon from available mammalian genome assemblies. Divergence in PRDM9 ZF sequences combined with large C2H2 zinc finger gene families throughout mammals made PRDM9 identification based on these sequences impossible (data not shown). For low coverage assemblies, the unique upstream coding exons were likewise useless because they were often not present on the same contig as the ZF exon. However, there is a conserved unique protein sequence upstream of the ZF domains in the ZF-containing exon (see Figure 1) that appears to be diagnostic for PRDM9 genes in mammals. We used this protein sequence as a tblastn [28] query of all available mammalian genome assemblies. The DNA corresponding to the best tblastn matches from each genome were extracted along with 1 KB of DNA downstream of the match. These DNA sequences were translated and tested for encoding multiple ZF domains in-frame and downstream of the diagnostic sequence. Finally, candidate sequences that passed this test were translated and a maximum-likelihood tree was made to confirm probable orthology of the sequences (see Figure 3). As expected because of incomplete assemblies, a PRDM9 ortholog was not found in all species, especially those with 2-fold coverage. In most cases additional confirmation that the sequence is a bona fide PRDM9 gene was obtained as follows: the marmoset, macaque, baboon, orangutan, and chimpanzee genes are complete or nearly complete in their assemblies and are clearly syntenic to human PRDM9; the rat gene is complete and syntenic with the mouse PRDM9 gene; the cow gene is nearly complete; and the cat, mouse lemur (Microcebus murinus), dolphin (Tursiops truncatus), and bat (Pteropus vampyrus) genes have an upstream exon on the same contig that matches the next upstream human exon from PRDM9. Finally, all the sequences except tenrec show clear homogenization of their ZF domain sequences, a very unusual character among tandem ZF domain genes (data not shown). The tenrec sequence is included based on the protein tree, but it should be regarded as a questionable ortholog assignment because it shows no other shared PRDM9 characters. Higher primates have a partial duplicate of the PRDM9 gene (PRDM7) but it completely lacks the tandem ZF domains of PRDM9 and is thus easily distinguished.
Supporting Information
Acknowledgments
We thank Sean Schneider for helpful discussions.
Footnotes
Competing Interests: The authors have declared that no competing interests exist.
Funding: The sequencing in the paper was supported by the following grants: NHLBI 5R01HL094976-02 “SeattleSeq”, http://www.genome.gov/, NHGRI 1R21HG004749-01, “Molecular Tools for Genome Partitioning”, http://www.genome.gov/. The computational work had no extramural grant support. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
References
- 1.Mihola O, Trachtulec Z, Vlcek C, Schimenti JC, Forejt J. A mouse speciation gene encodes a meiotic histone H3 methyltransferase. Science. 2009;323:373–375. doi: 10.1126/science.1163601. [DOI] [PubMed] [Google Scholar]
- 2.Irie S, Tsujimura A, Miyagawa Y, Ueda T, Matsuoka Y, et al. Single-nucleotide polymorphisms of the PRDM9 (MEISETZ) gene in patients with nonobstructive azoospermia. J Androl. 2009;30:426–431. doi: 10.2164/jandrol.108.006262. [DOI] [PubMed] [Google Scholar]
- 3.Coyne JA, Orr HA. The evolutionary genetics of speciation. Philos Trans R Soc Lond B Biol Sci. 1998;353:287–305. doi: 10.1098/rstb.1998.0210. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Hayashi K, Yoshida K, Matsui Y. A histone H3 methyltransferase controls epigenetic events required for meiotic prophase. Nature. 2005;438:374–378. doi: 10.1038/nature04112. [DOI] [PubMed] [Google Scholar]
- 5.Bellefroid EJ, Poncelet DA, Lecocq PJ, Revelant O, Martial JA. The evolutionarily conserved Kruppel-associated box domain defines a subfamily of eukaryotic multifingered proteins. Proc Natl Acad Sci U S A. 1991;88:3608–3612. doi: 10.1073/pnas.88.9.3608. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Hayashi K, Matsui Y. Meisetz, a novel histone tri-methyltransferase, regulates meiosis-specific epigenesis. Cell Cycle. 2006;5:615–620. doi: 10.4161/cc.5.6.2572. [DOI] [PubMed] [Google Scholar]
- 7.Birtle Z, Ponting CP. Meisetz and the birth of the KRAB motif. Bioinformatics. 2006;22:2841–2845. doi: 10.1093/bioinformatics/btl498. [DOI] [PubMed] [Google Scholar]
- 8.Margolin JF, Friedman JR, Meyer WK, Vissing H, Thiesen HJ, et al. Kruppel-associated boxes are potent transcriptional repression domains. Proc Natl Acad Sci U S A. 1994;91:4509–4513. doi: 10.1073/pnas.91.10.4509. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Pengue G, Calabro V, Bartoli PC, Pagliuca A, Lania L. Repression of transcriptional activity at a distance by the evolutionarily conserved KRAB domain present in a subfamily of zinc finger proteins. Nucleic Acids Res. 1994;22:2908–2914. doi: 10.1093/nar/22.15.2908. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Pavletich NP, Pabo CO. Zinc finger-DNA recognition: crystal structure of a Zif268-DNA complex at 2.1 A. Science. 1991;252:809–817. doi: 10.1126/science.2028256. [DOI] [PubMed] [Google Scholar]
- 11.Elrod-Erickson M, Rould MA, Nekludova L, Pabo CO. Zif268 protein-DNA complex refined at 1.6 A: a model system for understanding zinc finger-DNA interactions. Structure. 1996;4:1171–1180. doi: 10.1016/s0969-2126(96)00125-6. [DOI] [PubMed] [Google Scholar]
- 12.Kim CA, Berg JM. A 2.2 A resolution crystal structure of a designed zinc finger protein bound to DNA. Nat Struct Biol. 1996;3:940–945. doi: 10.1038/nsb1196-940. [DOI] [PubMed] [Google Scholar]
- 13.The Chimpanzee Sequencing Analysis Consortium. Initial sequence of the chimpanzee genome and comparison with the human genome. Nature. 2005;437:69–87. doi: 10.1038/nature04072. [DOI] [PubMed] [Google Scholar]
- 14.Ng SB, Turner EH, Robertson PD, Flygare SD, Bigham AW, et al. Targeted capture and massively parallel sequencing of 12 human exomes. Nature. 2009;461:272–276. doi: 10.1038/nature08250. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Thomas JH. Concerted evolution of two novel protein families in caenorhabditis species. Genetics. 2006;172:2269–2281. doi: 10.1534/genetics.105.052746. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Allshire RC, Karpen GH. Epigenetic regulation of centromeric chromatin: old dogs, new tricks? Nat Rev Genet. 2008;9:923–937. doi: 10.1038/nrg2466. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Sullivan BA, Karpen GH. Centromeric chromatin exhibits a histone modification pattern that is distinct from both euchromatin and heterochromatin. Nat Struct Mol Biol. 2004;11:1076–1083. doi: 10.1038/nsmb845. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Malik HS, Bayes JJ. Genetic conflicts during meiosis and the evolutionary origins of centromere complexity. Biochem Soc Trans. 2006;34:569–573. doi: 10.1042/BST0340569. [DOI] [PubMed] [Google Scholar]
- 19.Malik HS, Henikoff S. Conflict begets complexity: the evolution of centromeres. Curr Opin Genet Dev. 2002;12:711–718. doi: 10.1016/s0959-437x(02)00351-9. [DOI] [PubMed] [Google Scholar]
- 20.Cellamare A, Catacchio CR, Alkan C, Giannuzzi G, Antonacci F, et al. New insights into centromere organization and evolution from the white-cheeked gibbon and marmoset. Mol Biol Evol. 2009;26:1889–1900. doi: 10.1093/molbev/msp101. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Malik HS. The centromere-drive hypothesis: a simple basis for centromere complexity. Prog Mol Subcell Biol. 2009;48:33–52. doi: 10.1007/978-3-642-00182-6_2. [DOI] [PubMed] [Google Scholar]
- 22.Malik HS, Henikoff S. Adaptive evolution of Cid, a centromere-specific histone in Drosophila. Genetics. 2001;157:1293–1298. doi: 10.1093/genetics/157.3.1293. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Friedman JR, Fredericks WJ, Jensen DE, Speicher DW, Huang XP, et al. KAP-1, a novel corepressor for the highly conserved KRAB repression domain. Genes Dev. 1996;10:2067–2078. doi: 10.1101/gad.10.16.2067. [DOI] [PubMed] [Google Scholar]
- 24.Birney E, Clamp M, Durbin R. GeneWise and Genomewise. Genome Res. 2004;14:988–995. doi: 10.1101/gr.1865504. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Thompson JD, Higgins DG, Gibson TJ. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 1994;22:4673–4680. doi: 10.1093/nar/22.22.4673. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Yang Z. PAML: a program package for phylogenetic analysis by maximum likelihood. Comput Appl Biosci. 1997;13:555–556. doi: 10.1093/bioinformatics/13.5.555. [DOI] [PubMed] [Google Scholar]
- 27.Voight BF, Kudaravalli S, Wen X, Pritchard JK. A map of recent positive selection in the human genome. PLoS Biol. 2006;4:e72. doi: 10.1371/journal.pbio.0040072. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–3402. doi: 10.1093/nar/25.17.3389. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.