Skip to main content
Proceedings of the National Academy of Sciences of the United States of America logoLink to Proceedings of the National Academy of Sciences of the United States of America
. 2003 Jul 11;100(15):8793–8798. doi: 10.1073/pnas.1031592100

A two-locus gene conversion model with selection and its application to the human RHCE and RHD genes

Hideki Innan 1,*
PMCID: PMC166392  PMID: 12857961

Abstract

A two-locus gene conversion model with selection is developed. Under the joint action of selection, mutation, gene conversion, recombination, and random genetic drift, approximate formulas for the expectations of the moments of allele frequencies and the expected amounts of variation within and between two loci are obtained by a diffusion method assuming relatively strong selection. It is shown that the pattern of allelic variation is mainly determined by the balance between gene conversion and selection, because these two mechanisms act in opposite directions. As an application of the theoretical results, the human RHCE and RHD genes are considered. The very high level of amino acid divergence between the two genes is observed only in a short region around exon 7. It is known that exon 7 encodes amino acids that characterize the difference between the RHCE and RHD antigens. The observed pattern of DNA variation in this region is consistent with the selection model developed in this article, suggesting that strong selection might be working to maintain the RHCE/RHD antigen variation in the two-locus system. The selection intensity is estimated on the basis of the theoretical result.


Recent genomic sequencing projects confirmed earlier studies showing there are a number of duplicated genes or chromosome segments in the eukaryotic genome (14). Gene duplication has been considered an important mechanism for adaptive genome evolution, because there is an opportunity that an advantageous mutation gives one of the duplicated genes a new function (57). However, there is a great debate on the fates of duplicated genes and how often adaptive functional diversification occurs. Despite many demonstrations of adaptive evolution in duplicated genes (e.g., refs. 811), theoretical studies indicate that one of the duplicated genes is likely to be silenced relatively quickly after duplication (e.g., refs. 1216).

To understand the mechanism underlying the acquisition of a new function by duplicated genes, this article considers the evolutionary process within a relatively short period after gene duplication. Walsh (17) suggested that functional diversification does not occur frequently, because gene conversion homogenizes variation between duplicated genes (i.e., concerted evolution of multigene families; see refs. 1824). He considered a neutral model, in which a duplicated gene can acquire a new function when it has successfully “escaped” from conversion due to accumulation of neutral mutations. The conversion rate is assumed to decrease as genes diverge. Here an alternative model is proposed, in which strong selection results in evolution of a new function under the pressure of gene conversion.

A simple two-locus gene conversion model with two alleles, A and B, is considered in a finite population. It is assumed that A and B have slightly different functions, so that haplotypes with the two different alleles (A-B and B-A) are advantageous over haplotypes with the same alleles (A-A and B-B). Let us suppose that a new allele (B) is introduced by mutation in a population in which A-A is fixed. The frequency of B might increase by selection, and the frequencies of advantageous haplotypes (A-B and B-A) might also increase, while gene conversion changes advantageous haplotypes to deleterious haplotypes (A-A and B-B). Thus, selection and gene conversion act in opposite directions. If the effect of gene conversion is larger than that of selection, the four haplotypes might coexist, but eventually one of the deleterious haplotypes could fix in the population by genetic drift. With very strong selection, on the other hand, one of the advantageous haplotypes is likely to fix, and deleterious haplotypes created by gene conversion are eliminated immediately from the population. This state gives a great opportunity for further functional divergence. The purpose of this article is to consider how strong selection is needed to maintain the state where an advantageous haplotype is nearly fixed under the pressure of gene conversion. The model considers the joint action of selection, mutation, gene conversion, recombination, and random genetic drift as factors to determine the pattern of haplotype polymorphism in the duplicated genes. Because this study considers a relatively short-time evolutionary process (i.e., polymorphism) in a small multigene family, other mechanisms such as unequal crossing over and birth-and-death process, which might play important roles in middle- or large-size multigene families, are ignored (25, 26).

Under the model described above, the pattern of polymorphism is considered by hw and hb, where hw is the heterozygosity within the locus, and hb is the probability that a pair of alleles randomly chosen from different loci is not identical (27). When an advantageous haplotype is nearly fixed by selection, it is expected that hb is almost one and that hw is very small. On the other hand, when selection is not strong, hw might be relatively large and hb is much smaller than one. In this article, approximate equations for the expectations of hw and hb are obtained by using a diffusion method when selection is relatively strong. The theoretical result is very different from that under the neutrality (21, 27, 28).

As an application of the theory, human rhesus (RH) genes are considered. On the short arm of chromosome 1, there are two closely linked RH genes, RHCE and RHD, which encode the CcEe and D blood group antigens (29). The DNA sequence identity between the two genes is high (≈97%), and their exon–intron structures are very similar (30), indicating they were created by a tandem gene duplication event (for review, see ref. 31). It is estimated that the gene duplication occurred 5–12 million years ago (32). The observed high level of amino acid replacement variation between the two genes in a short region around exon 7 might be explained by the model developed in this article, suggesting that strong selection might be working to maintain the RHCE/RHD antigen variation in the human population.

Theory

Consider two linked loci, I and II, in a random mating population with 2N haploids or N diploids. We consider two alleles, A and B, so there are four haplotypes, A-A, A-B, B-A, and B-B. The fitnesses of these haplotypes are given by 1 – s, 1, 1 and 1 – s, respectively. It is assumed that the symmetric mutation rate between the two alleles is μ per locus per generation. The recombination rate between the two loci is assumed to be r per generation. Intrachromosomal gene conversion occurs at rate c per locus per generation, e.g., A-B changes into A-A with probability c and into B-B with the same probability. This is a simple case of Ohta's model (33). Let the frequencies of A-A, A-B, B-A, and B-B be x1, x2, x3, and x4 (x1 + x2 + x3 + x4 = 1), respectively. Given x1, x2, x3, and x4, their expectations in the next generation might be given by the following recursion equations:

graphic file with name M1.gif [1a]
graphic file with name M2.gif [1b]
graphic file with name M3.gif [1c]
graphic file with name M4.gif [1d]

where D = x1x4x2x3. These recursions treat mutation, recombination, gene conversion, and selection independently, which (especially recombination and gene conversion) may not be biologically independent. However, it should be noted that these events can be treated independently in a sufficiently large population (i.e., continuous time approximation). In a diploid population, Eqs. 1ad might have a problem in the treatment of selection, which will be discussed later.

The goal of this section is to obtain the expectations of hw and hb at equilibrium. In this model, their expectations are given by

graphic file with name M5.gif [2]

where p (= x1 + x2) and q (= x1 + x3) are the frequencies of A at loci I and II, respectively. Note that this symmetric model predicts E(p) = E(q) = 0.5 and E(p2) = E(q2). To obtain E(hw) and E(hb), we consider the expectations of the moments of allele frequencies by using a diffusion method (34). At equilibrium, it is known that a function, g(x1, x2, x3), satisfies the following equation:

graphic file with name M6.gif [3]

where L is the differential operator of the Kolmogorov backward equation (3436). In this model, L(g) is given by

graphic file with name M7.gif [4]

Transforming the three variables, x1, x2, and x3 into p, q and D, Eq. 4 becomes

graphic file with name M8.gif [5]

and

graphic file with name M9.gif [6]

where θ = 4Nμ, C = 4Nc and R = 4Nr.

From Eqs. 3, 5, and 6, we consider approximate solutions for the moments of p and q under the assumption of relatively strong selection, because the diffusion equation can be solved exactly only when Ns = 0 (27). Because the model is symmetrical, it is obvious that

graphic file with name M10.gif [7]

When the effect of selection on allelic variation is large, the frequencies of deleterious haplotypes, A-A and B-B, should be very small in the population. Therefore, it may be possible to assume that the sum of p and q is approximately 1. Under this assumption, we have the following approximations:

graphic file with name M11.gif [8]

Because the amount of linkage disequilibrium (D) is much smaller than E(p2), E(p3), and E(p4), it is assumed that

graphic file with name M12.gif [9]

Then, letting g = p, pq and D with Eqs. 79, Eq. 5 gives the following three equations:

graphic file with name M13.gif [10]
graphic file with name M14.gif [11]
graphic file with name M15.gif [12]

Solving these equations, we have

graphic file with name M16.gif [13]
graphic file with name M17.gif [14]

and

graphic file with name M18.gif [15]

It should be noted that these expectations are independent from the recombination rate. From Eq. 13, the expectations of hw and hb are given by

graphic file with name M19.gif [16]

and

graphic file with name M20.gif [17]

To check the theoretical results, Monte Carlo simulations were carried out with N = 1,000 and θ = 0.01. For each parameter set, a simulation was run for 100,000 N generations, in which the pseudosampling method (37) was used to determine the haplotype frequencies generation by generation, and p2, p3, p4, hw, and hb were calculated every N generations. The averages of p2, p3, p4, hw, and hb are almost independent from R, as expected, and in very good agreement with the approximate Eqs. 1317 when selection is relatively strong. Part of the results is shown in Fig. 1, in which the averages of hw and hb for R = 0 and 100 are plotted against C. It is demonstrated that Eqs. 16 and 17 are good approximations for hw and hb when C is smaller than Ns/5. If C is higher than Ns/5, hw from the simulation is smaller than Eq. 16, and hb from the simulation is larger than Eq. 17. The deviation from the theory is bigger when R is small. Eqs. 16 and 17 and Fig. 1 demonstrate that hw and hb are getting close to 0 and 1 as selection intensity increases, indicating that selection works to keep one of the advantageous haplotypes (A-B or B-A) in a very high frequency in a population. That is, this model does not predict the state where both advantageous haplotypes coexist in intermediate frequencies (e.g., x2x3 ≈ 0.5 so that hwhb ≈ 0.5) due to genetic drift.

Fig. 1.

Fig. 1.

Results of simulations for hw and hb. The lines represent the theoretical results from Eqs. 16 and 17.

The simulations also demonstrate that a population reaches its equilibrium state very quickly after advantageous mutations are introduced (data not shown). The time from the appearance of an advantageous mutation to equilibrium is similar to the fixation time of an advantageous allele with fitness 1 + s in a single locus system [≈–2 ln(1/2N)/s; see ref. 17].

The theoretical results (Eqs. 1317) might hold in a diploid population with size N, even though there is a problem in recursion equations (Eqs. 1ad). The recursions can be used in a diploid population only when the fitness effects of haplotypes are additive. For example, the fitness of a diploid with A-A and B-B is 1 – 2s, the fitness of a diploid with A-B and B-A is 1, and so on. This additive assumption is not consistent with the model considered here, in which diploids with genotype AABB should be most advantageous. That is, the fitness of a diploid with A-A and B-B is 1, not 1 – 2s. Nevertheless, Eqs. 1317 might hold in a diploid population because of the following reason. Because strong selection is assumed, most individuals are homozygotes of one of the two advantageous haplotypes, say A-B, and others could be heterozygotes of A-B and one of the two deleterious haplotypes (A-A or B-B), as indicated by the result that hw ≈ 0 and hw ≈ 1. This means that the effect of fitness of the heterozygotes of A-A and B-B may be negligible because they appear only with frequency 2x1x4, which is extremely small when x1 and x4 are much smaller than 1. In other words, whatever the fitness of a heterozygote of A-A and B-B, x1 and x4 cannot increase in a diploid population with the assumption of strong selection, because deleterious individuals with genotypes AAAA and BBBB appear with frequency Inline graphic.

Nucleotide Polymorphism in RHCE and RHD Genes

As an application of the theoretical results, the human RH genes, RHCE and RHD, are considered. Twenty-two complete coding sequences (five RHCE and 17 RHD) were obtained from GenBank. These sequences are aligned together, and the summary of the amounts of nucleotide variation is shown in Table 1. Forty-one amino acid replacement polymorphic sites are detected in a total of 50 segregating sites. As expected from other reports (e.g., refs. 31 and 39), the number of replacement polymorphic sites is very large. The average numbers of pairwise differences per site (πw) in the RHCE and RHD genes are 0.00526 and 0.00719, much higher than the genome average (0.0007–0.001; refs. 2, 40, and 41). These observations might be a signature of selection favoring amino acid changes (e.g., ref. 42).

Table 1. Summary of the amounts of nucleotide variation in human RH genes.

SW
πw (×100)
L Rep Syn Rep Syn Total
Within RHCE (n = 5)
    Exons 1-5 801 10 2 0.897 0.604 0.824
    Exons 6-10 453 0 0 0 0 0
    Total 1254 10 2 0.576 0.380 0.526
Within RHD (n = 17)
    Exons 1-5 801 25 5 1.183 0.506 1.016
    Exons 6-10 453 3 2 0.106 0.445 0.195
    Total 1254 28 7 0.799 0.483 0.719
SS
SF
πb (×100)
L Rep Syn Rep Syn Rep Syn Total
Between RHCE and RHD
    Exons 1-5 801 9 2 0 0 2.383 0.742 1.977
    Exons 6-10 453 0 0 13 2 4.724 2.592 4.168
    Total 1254 9 2 13 2 3.219 1.432 2.769

SW, number of polymorphic site within the gene; SS, number of shared polymorphic sites; SF, number of fixed polymorphic sites; n, sample size; L, sequence length.

Following ref. 43, the 50 segregating sites are classified into three groups: “specific polymorphic sites,” where polymorphism is observed in either of the two genes; “shared polymorphic sites,” where two nucleotides are segregating in both genes; and “fixed polymorphic sites,” where each gene has a different fixed nucleotide. The observed numbers of these three types of polymorphic sites are 24, 11, and 15, respectively. The existence of a relatively large number of shared polymorphic sites indicates that DNA variation in the two RH genes has been frequently homogenized, probably by gene conversion (43, 44). As illustrated in Fig. 2, the distributions of shared and fixed polymorphic sites are not uniform. All 11 shared polymorphic sites are in the first half of the coding region (exons 1–5), whereas all 15 fixed sites are in the remaining region (exons 6–10). This striking difference in the numbers of the two classes of polymorphic sites is highly significant (P < 106; Fisher's exact test), even though the test is conservative due to the nonindependence of the two regions (43). It is indicated that the mechanisms to maintain DNA variation in the two regions are very different. In the following analysis, therefore, the two regions are considered separately.

Fig. 2.

Fig. 2.

Distributions of the number of shared and fixed polymorphic sites. A window analysis is conducted, in which a 100-bp window is moved at 20-bp increments. The gene structures of the human RHCE and RHD genes (Upper) are according to refs. 30 and 49.

First, consider whether a neutral model can explain the observation. Assuming no selection, the mutation and gene conversion parameters can be estimated by a method of ref. 27. This method uses πw, πb and linkage disequilibrium to estimate θ, C, and R. Because the sequences obtained from GenBank are from independent chromosomes, linkage disequilibrium cannot be calculated. Therefore, θ and C are estimated from πw and πb, assuming free recombination (R = ∞). This assumption is not unreasonable, because the distance between the two genes is ≈80 kb, so that R may not be small when the recombination rate is on the same order as the mutation rate (45, 46). The effect of recombination on πw and πb is very small unless R is low (27). Given πw = 0.00920 and πb = 0.01977 in exons 1–5, θ and C are estimated to be 0.0047 and 0.423, respectively. The estimate of the gene conversion rate is ≈90 times larger than the mutation rate. This ratio is in the range of that in the Amy loci in Drosophila melanogaster, where C is estimated to be 60–165 times larger than θ (27). On the other hand, in exons 6–10, C is estimated to be 0.011, which is about 1/40 of the estimate in exons 1–5. Thus, a neutral model with very different levels of gene conversion might explain the data.

However, it is important to notice that the two genes were identical when gene duplication occurred. That is, it is not unreasonable to consider that gene conversion used to occur in exons 6–10 as frequently as in exons 1–5 in the initial stages of the duplicated genes. It is suggested that some kind of mechanism worked to dramatically reduce the level of gene conversion in exons 6–10. A relatively high level of divergence between the two genes in exons 6–10 (4%) might contribute to the reduction in the gene conversion rate. A drastic change in the DNA sequence caused by an indel could be a barrier to restrict gene conversion (e.g., ref. 18), although no such big indels are found in the flanking region of exon 7 (30).

Another mechanism is selection, which could reduce the level of gene conversion effectively (i.e., selection does not favor gene conversion to maintain the variation between the two genes, as demonstrated in the previous section). The very high level of amino acid differences between the two genes might support the selection hypothesis. As shown in Table 1, there are 15 fixed sites in this region, of which 13 are amino acid replacement changes. The ratio of the rate of nonsynonymous substitution (Ka) to the rate of synonymous substitution (Ks) is ω = Ka/Ks = 3.25. Unfortunately, ω is not significantly larger than 1, but because this test is extremely conservative, ω > 1 is sometimes considered as evidence for positive selection. The spatial distribution of the 15 fixed sites might also support the selection hypothesis. All 15 sites are in a relatively short region around exon 7. The length of this cluster is 121 bp, significantly shorter than expected (P < 105; permutation test). It should be noted that these amino acid changes characterize the difference between the RHCE and RHD antigens. Exon 7 encodes amino acids in the transmembrane and cytoplasmic domains and those on the exofacial surface. It is suggested that the RHCE/RHD antigen variation might be maintained in the human population by strong selection.

If this is the case, it might be possible to apply the two-locus model developed in this article to the target nucleotide site of selection in the RHCE and RHD genes. As discussed above, if strong selection works to maintain two different alleles (nucleotides) in the two-locus (site) system, the theory predicts that hb is close to 1 and hw is very small at the target site of selection. Although we do not know the exact position(s) of the target site(s) of selection, the pattern of polymorphism at all 16 replacement polymorphic sites in exons 6–10 is compatible with theoretical prediction. Here, we attempt to estimate the selection intensity for these sites. At the target site of selection, from Eqs. 16 and 17, the selection intensity is given by

graphic file with name M22.gif [18]

when θ is very small. Because this equation can be applied to only a single pair of target sites of selection but we only know candidate sites (see above), we have to make some assumptions to estimate selection intensity. First, we assume that selection is working at all replacement polymorphic sites at equal intensity. The average hw in RHCE and RHD for the 16 sites are 0 and 0.022 (the average is 0.011), respectively, and the average hb is 0.989, so that the observed amounts of variation within and between the two genes can be explained when Ns is ≈45 times larger than C. If we use an estimate of C in exons 1–5, Ns is ≈20.

The selection intensity might be underestimated because of the assumption that selection is acting at all 16 replacement polymorphic sites at equal intensity. Because the peak of the fixed polymorphic sites is in the middle of exon 7 (Fig. 2), it might be reasonable to consider that the target site(s) is in exon 7. Because all 13 replacement variations in exon 7 are fixed sites, hw = 0 and hb = 1, and an estimate of Ns is infinity. A bigger sample size is needed to obtain a correct estimate of Ns. Using C = 0.423 for the gene conversion parameter in exons 6–10 might overestimate Ns if the gene conversion rate in exons 6–10 is lower than that in exons 1–5, as discussed above.

Discussion

Model and Theory. A two-locus gene conversion model with selection is developed. Under the joint action of selection, mutation, gene conversion, recombination, and random genetic drift, the pattern of allelic polymorphism is investigated by a diffusion method. Approximate formulas for the expectations of the moments of allele frequencies and the expected amounts of variation within and between two loci are obtained by assuming relatively strong selection. These expectations are given by functions of Ns, θ, and C, indicating they are nearly independent of the recombination rate. The approximate formulas for the expectations of p2, p3, p4, hw, and hb are in excellent agreement with the results of simulations when Ns is more than five times larger than C. The theoretical results demonstrate that hw and hb are getting close to 0 and 1 as selection intensity increases. It is indicated that selection works to keep one of the advantageous haplotypes (A-B or B-A) in a very high frequency instead of maintaining both in intermediate frequencies. This behavior may be similar to that under Kimura's model of compensatory evolution (see refs. 47 and 48).

Under this model, selection and gene conversion act in opposite directions; that is, gene conversion produces deleterious haplotypes, and selection works to eliminate them. Therefore, the pattern of allelic variation is determined mainly by the balance between gene conversion and selection as shown by Eq. 18. This equation indicates that very strong selection is needed to keep two different alleles in a population when gene conversion is active. In a population in which two alleles are nearly fixed (e.g., hw < 0.01 and hb > 0.99), Ns/C may be >50, indicating that successful evolution of new gene function might not occur without strong selection unless C is very small. This result is compatible with other theoretical studies, which indicate that one of the duplicated genes is silenced relatively quickly after duplication (e.g., refs. 1216). Recently, Lynch and Conery (1) demonstrated the majority of duplicated genes become pseudogenes within a few million years, based on the survey of the genomic databases of several model species.

The model does not include null mutations by which genes are silenced. Although this is one of the important fates of duplicated genes, it might be possible to ignore such mutations in this model with strong selection, because selection might eliminate them immediately from the population.

Evolution of the Human RHCE and RHD Genes. DNA polymorphism in human RHCE and RHD genes is analyzed. Because the pattern of DNA polymorphism in exons 1–5 is completely different from that in exons 6–10, the two regions are analyzed separately. It is shown that ≈35% of segregating sites in exons 1–5 are shared polymorphic sites, indicating frequent gene conversion in this region. On the other hand, there is no shared polymorphism in the remaining coding region (exons 6–10). Instead, a large proportion of segregating sites (15 of 20) are fixed sites, and most of them (13 of 15) are amino acid replacement changes. Because all fixed sites are around exon 7, it is suggested that some kind of mechanism might be working in this region to accelerate the sequence divergence between the two genes. Although a reduction in the gene conversion rate due to a drastic change of the DNA sequences caused by indels might be one explanation, no such big indels are found in the flanking region of exon 7. An alternative and more likely explanation is that selection is acting to maintain the high level of amino acid differences between the two genes, and many aspects of the observed pattern of DNA variation could support this hypothesis. Assuming this is the case, the two-locus selection model developed in this article is applied to the data to estimate selection intensity. It is suggested that very strong selection (Ns is at least 45 times larger than C) is needed to explain the observed pattern of polymorphism.

Under the selection hypothesis, the evolutionary history of the human RHCE and RHD genes is inferred. The history of the duplicated genes started with two identical sequences created by tandem duplication. In the initial stages, gene conversion occurred quite frequently. The gene conversion parameter, C, might have been 0.4 or so, as estimated from the present-day polymorphism data in exons 1–5. This level of gene conversion is high enough to keep the two genes nearly identical. Then an advantageous mutation was introduced (probably in exon 7) and fixed in one gene. Selection might have been so strong that this fixation state was nearly stable and continued for quite a long time. During this state, additional mutations were accumulated near the target site of selection, creating the high level of sequence divergence between the two genes around exon 7. In exons 1–5, which are at least 3–4 kb upstream of exon 7, the sequence identity between the two genes has been maintained high by frequent gene conversion. The present human RHCE and RHD genes might be in their initial stage of further functional divergence, starting in the short region around exon 7. This region of high divergence might spread if the divergence itself reduces the gene conversion rate.

Although the application of the two-locus selection model to the human RHCE and RHD genes seems successful, there are a few caveats. The first concerns the well known RHD-negative chromosomes on which the RHD gene is either absent or silenced. Although the frequency of chromosomes with no RHD gene might be relatively low, there may be some effect of such chromosomes on the model. There might also be the possibility of other minor duplication and deletion polymorphism in this region. The second is the possibility of selection that might be working on exons 1–5. It is known that the RHCE gene encodes four types of antigens, CE, Ce, cE, and ce, which are characterized by the two amino acid positions 103 (exon 2) and 226 (exon 5) (reviewed in ref. 31). These antigen polymorphisms within RHCE might be under selection. The unusually high level of nonsynonymous polymorphic sites (80%) in exons 1–5 might also be a signature of selection. This selection might overestimate θ and C. The last is the theoretical problem in the treatment of selection in a diploid population. This might not much affect the results under the assumption of strong selection as discussed above, but when selection is weak, the problem should be considered seriously.

Acknowledgments

I thank D. Hewett-Emmett, M. Nordborg, T. Ohta, N. Rosenberg, F. Tajima, K. Teshima, B. Walsh, and two anonymous reviewers for comments and discussions.

Abbreviation: RH, rhesus.

References


Articles from Proceedings of the National Academy of Sciences of the United States of America are provided here courtesy of National Academy of Sciences

RESOURCES