Bakewell et al. 10.1073/pnas.0701705104.

Supporting Information

Files in this Data Supplement:

SI Materials and Methods
SI Table 3
SI Table 4
SI Table 5
SI Table 6
SI Table 7
SI Table 8
SI Table 9
SI Figure 4
SI Table 10
SI Table 11
SI Table 12
SI Table 13




SI Figure 4

Fig. 4. Distribution of human and chimp PSGs across chromosomes. The human chromosome numbers are used. Shown are 13,714 genes in the Q20 dataset for which recombination rate data are available. Genes located in segments with a recombination rate in the lowest quintile of all 1-megabase segments in the genome are colored blue; genes in the second, third, fourth, and fifth quintiles of recombination rate are colored green, yellow, orange, and red, respectively. A total of 152 human PSGs (filled diamonds) and 228 chimp PSGs (open diamonds) for which recombination data are available are shown to the right of each chromosome. For all genes, position along the chromosome corresponds to the midpoint of the gene. There is a weak tendency for PSGs to have higher recombination rates than by chance (P = 0.32 and 0.04 for human and chimp, respectively; simulation test).





SI Materials and Methods

Use of the 6´ Chimp Genome Assembly.

The 233 chimp PSGs identified by using the Q20 data from the 4´ chimp sequence were reanalyzed using sequences from the 6´ chimp genome assembly (panTro2; www.genome.ucsc.edu). The 6´ sequences corresponding to 4´ sequences of the 233 PSGs were found by using BLAT (http://genome.ucsc.edu/cgi-bin/hgBlat?command=start). Codons with one or more bases having a quality score less than Q20 in the 4´ assembly were eliminated, as described in Materials and Methods. Of the 233 PSGs, 100 had a perfect match between the 4´ and 6´ assemblies. Forty-eight PSGs were aligned to the 6´ assembly with no gaps, but with mismatches of 0.02-0.13%. Eighty-five PSGs were aligned to 6´ with some gaps, ranging from 0.02 to 63.0%. Codons having one or more bases missing or ambiguous (i.e., N) in the 6´ assembly were also eliminated, and the resulting sequence was aligned to the human and macaque sequences. This alignment was used in the branch-site test of positive selection in the chimp branch, as described in Materials and Methods.

Of the 48 PSGs with no gaps, 42 still show positive selection in chimp, using the 6´ sequence, whereas 6 no longer show the signal of positive selection. Of the 85 PSGs with gaps, 65 still show positive selection in chimp, whereas 20 appeared not to be under positive selection when the 6´ alignment was used. Each of these 26 (6 + 20) apparent reversals was examined manually to determine the source of discrepancies between the 4´ and 6´ results. In five cases (all with gaps), it was determined that the 4´ assembly was more accurate because of elimination of exons or other problems in the 6´ assembly. In these cases, the 4´ result was retained.

Performance of the Improved Branch-Site Likelihood Method.

Although there have been concerns about the performance of the likelihood method in detecting positive selection (1), the improved branch-site likelihood method was previously shown by computer simulations to produce reasonably good results, even when some of the assumptions are violated (2). To further verify the suitability of the method when the number of substitutions is as small, as in the present context, we conducted additional simulations specifically designed to mimic the evolution of human, chimp, and macaque genes. The simulation procedure follows ref. 2. A tree of three taxa was used. The numbers of synonymous substitutions per site for the human, chimp, and macaque branches were set as 0.006, 0.006, and 0.058, respectively, because these were the actual numbers observed from our Q20 data for the three branches. Because the 13,888 alignments have a mean length of 432 codons and a standard deviation of 339 codons, we examined three different sequence lengths (150, 400, and 1,000 codons) in the simulation. To examine the type I error (i.e., false-positives), we used model B1 (negative selection) to simulate sequence evolution in the background branches (macaque and chimp branches) and either model F1 or F2 for the foreground branch (human branch) (SI Table 3). Note that F1 and F2 do not contain sites under positive selection; rather, they represent partial and complete relaxation of negative selection, respectively. After the three sequences were generated, the likelihood method was used to detect positive selection in the human branch. Positive selection was inferred if the likelihood of the alternative model was greater than that of the null model at the 5% significance level. Four hundred simulation replications were conducted. The results showed that the type I error is lower than the nominal rate of 5% in the case of partial relaxation of negative selection (SI Table 4). In the case of complete relaxation of negative selection, the error rate is lower than, close to, and higher than the nominal rate for short, intermediate, and long sequences, respectively (SI Table 4). Because only 10% of our 13,888 genes have >800 codons, and because complete relaxation of negative selection is rare, it is expected that the slightly-higher-than-nominal type I error observed in one condition of the simulation will have only a minimal influence on our results. Although the c2 approximation of the likelihood ratio test depends on the large-sample assumption, our simulation showed that the approximation is justified in the present context. This may be due to two factors. First, we used c12 instead of a 50:50 mixture of point mass 0 and c12 (2, 3), thus reducing type I errors. Second, the c2 approximation appears insensitive to sample size, as was found previously (4).

We also examined the power of the statistical test in four simulations, by changing the background and foreground models (SI Table 3). The sequence length of 400 codons was used in this set of simulations. The four background models (B3-B6) differ in the level of mean w. The corresponding foreground models (F3-F6) also have different mean w, but have the same level of positive selection. The results showed that a higher background w increases the detection rate of positive selection (SI Table 9).

We noticed that when the likelihood ratio test provides statistical evidence for positive selection in a gene, the estimated w for the positively selected sites (class 2 codons) in the foreground branch is often very large (e.g., >100). This appears biologically unreasonable. We examined the accuracy of the estimated w by using the simulations described in the previous paragraph. We allowed 30% of codons to be under positive selection in the foreground branch, with a mean w for these positively selected codons equal to 5 (SI Table 3). However, as shown in SI Table 9, the estimated w for class 2 codons has a mean of several hundred and a standard deviation of several hundred among the genes in which positive selection is detected by PAML. Thus, the simulations showed that, although the likelihood ratio test of positive selection is reliable, the estimation of w (when >1) is problematic and not trustable. For this reason, we do not present the likelihood-estimated w values.

1. Nei M (2005) Mol Biol Evol 22:2318-2342.

2. Zhang J, Nielsen R, Yang Z (2005) Mol Biol Evol 22:2472-2479.

3. Self SG, Liang K-Y (1987) J Am Stat Assoc 82:605-610.

4. Zhang J (1999) Mol Biol Evol 16:868-875.