Skip to main content
Proceedings of the National Academy of Sciences of the United States of America logoLink to Proceedings of the National Academy of Sciences of the United States of America
. 2003 Apr 28;100(10):5896–5901. doi: 10.1073/pnas.0730857100

Natural variation in human membrane transporter genes reveals evolutionary and functional constraints

Maya K Leabman , Conrad C Huang , Joseph DeYoung §, Elaine J Carlson §, Travis R Taylor §, Melanie de la Cruz §, Susan J Johns , Doug Stryke , Michiko Kawamoto , Thomas J Urban , Deanna L Kroetz , Thomas E Ferrin , Andrew G Clark , Neil Risch , Ira Herskowitz ‡‡,††, Kathleen M Giacomini †,††,§§; Pharmacogenetics of Membrane Transporters Investigators¶¶
PMCID: PMC156298  PMID: 12719533

Abstract

Membrane transporters maintain cellular and organismal homeostasis by importing nutrients and exporting toxic compounds. Transporters also play a crucial role in drug response, serving as drug targets and setting drug levels. As part of a pharmacogenetics project, we screened exons and flanking intronic regions for variation in a set of 24 membrane transporter genes (96 kb; 57% coding) in 247 DNA samples from ethnically diverse populations. We identified 680 single nucleotide polymorphisms (SNPs), of which 175 were synonymous and 155 caused amino acid changes, and 29 small insertions and deletions. Amino acid diversity (πNS) in transmembrane domains (TMDs) was significantly lower than in loop domains, suggesting that TMDs have special functional constraints. This difference was especially striking in the ATP-binding cassette superfamily and did not parallel evolutionary conservation: there was little variation in the TMDs, even in evolutionarily unconserved residues. We used allele frequency distribution to evaluate different scoring systems (Grantham, blosum62, SIFT, and evolutionarily conserved/evolutionarily unconserved) for their ability to predict which SNPs affect function. Our underlying assumption was that alleles that are functionally deleterious will be selected against and thus under represented at high frequencies and over represented at low frequencies. We found that evolutionary conservation of orthologous sequences, as assessed by evolutionarily conserved/evolutionarily unconserved and SIFT, was the best predictor of allele frequency distribution and hence of function. European Americans had an excess of high frequency alleles in comparison to African Americans, consistent with a historic bottleneck. In addition, African Americans exhibited a much higher frequency of population specific medium-frequency alleles than did European Americans.


With the completion of the draft sequence of the human genome and the development of high-throughput sequencing methods, several large-scale investigations of human sequence variation have been carried out (14). These studies have provided valuable information about the nature and frequency of sequence variation in the human genome. For example, Cargill et al. (1) and Halushka et al. (3) identified differences in the level of genetic diversity among single-nucleotide polymorphism (SNP) types, such as coding and noncoding SNPs as well as synonymous and nonsynonymous SNPs. More recently, patterns of haplotype diversity across the human genome have been characterized (2, 4). These studies typically screened 24–40 chromosomes within an ethnic population and therefore identified common variants (frequencies ≥5%) with high accuracy but did not have the power to identify less common variants, which may have more severe functional consequences. Because the studies to date screened genes from a wide variety of structural and functional classes, little is known about the relative levels of genetic diversity within classes of genes.

Membrane transporters play a critical role in a variety of physiological processes. They maintain cellular and organismal homeostasis by importing nutrients essential for cellular metabolism and exporting cellular waste products and toxic compounds. Furthermore, membrane transporters are important in drug response as they provide the targets for many commonly used drugs and are major determinants of drug absorption, distribution, and elimination. Membrane transport proteins share a similar secondary structure, characterized by multiple membrane-spanning domains joined by alternating intracellular and extracellular segments (“loops”). Two of the major superfamilies of membrane transport proteins are the ABC (ATP-binding cassette) transporters, which include MDR1, a protein that pumps xenobiotics from cells, and the SLC (solute carrier) transporters, which take up neurotransmitters, nutrients, heavy metals, and other substrates into cells.

In this study, we screened for variation in a set of 24 genes encoding membrane transporters as part of a pharmacogenetics project that seeks to identify genes that determine drug response (Fig. 1). We identified variants by screening an ethnically diverse collection of genomic DNA samples, 494 chromosomes in total. Sequencing this functionally and structurally similar class of proteins in this DNA collection allowed us to determine the levels and patterns of genetic diversity in different ethnic groups, in different transporter families, and across different structural regions of membrane transporters. By combining population-genetic and phylogenetic analyses, we were able to identify amino acid residues and protein domains that may be important for human fitness. Our large sample set made it possible for us to obtain information about rare variants. We have used this greater statistical power to identify predictors of allele frequency distribution and thus to infer functional consequences of amino acid substitution.

Figure 1.

Figure 1

Twenty-four membrane transporters with potential roles in drug response. Transporters are grouped based on transporter family (e.g., OCT1, OCT2, and OCT3 belong to the SLC6 family; CNT1 and CNT2 belong to the SLC28 family). Transporters depicted by blue ovals belong to the SLC superfamily; red rectangles, ABC superfamily; green hexagon, P-type ATPase. Typical substrates for each family of transporters are listed. The direction of transport is indicated by an arrow pointing into the cell (influx) or out of the cell (efflux).

Materials and Methods

Screening Protocol.

Genomic DNA samples were obtained from the Coriell Institute (Camden, NJ). Primers for PCR were synthesized to specifically amplify each exon, and a minimum of 35 bases of 5′ and 3′ flanking intronic sequence. Primers were designed by using the Virtual Genome PCR primer selection website (http://alces.med.umn.edu/websub.html). Genomic and cDNA sequences were obtained from GenBank. For optimal denaturing HPLC (DHPLC) and DNA sequence analyses, amplicons were designed to be 200–500 bp in length. Small, closely spaced exons were combined and analyzed in a single amplicon; exons >500 bp were divided into smaller, overlapping amplicons. AmpliTaq Gold DNA Polymerase (Applied Biosystems) and a GeneAmp PCR System 9700 Thermal Cycler (Applied Biosystems) were used for PCR. All exons except exon 1 of MRP1 and exon 11 of ENT2 were successfully amplified. To obtain a preview of the frequency of variants in a given amplicon, the amplicon was sequenced from six randomly chosen DNA samples (12 chromosomes). Amplicons with at least four variant chromosomes were subjected to direct DNA sequencing. Both the forward and reverse DNA sequences of purified PCR products were determined by using ABI PRISM BigDye terminator cycle sequencing Version 2.0 and an ABI Prism 3700 DNA analyzer. DNA sequence files were imported into sequencher (Gene Codes, Ann Arbor, MI) and aligned with the amplicon reference sequence. Heterozygous variants were identified in aligned sequences and scored in SEQUENCHER. Amplicons with fewer than four variant chromosomes were subjected to multiplexed DHPLC analysis followed by direct DNA sequencing as follows: amplicons from three individuals were pooled, heteroduplexed, and analyzed by using a HELIX DHPLC System (Varian) at two temperatures. Elution profiles were scored by visual inspection, and the three samples in a pool with variant peaks were sequenced. Homoduplex pools were inferred to be reference sequence. Approximately 60% of the amplicons (244 of 405) were subjected to direct DNA sequencing and ≈40% of the amplicons (161 of 405) were analyzed by the two-step procedure.

Error Analysis.

To determine our false positive rate for identifying polymorphisms, we resequenced all 314 singleton SNPs. Ninety-five percent (297 of 314) of the singletons were verified on resequencing, indicating a 5% false positive rate (essentially identical to that of Stephens et al., ref. 4). To evaluate our false negative rate for identifying polymorphisms, we resequenced 14 amplicons (4,923 bp) in all 247 DNA samples. We identified 11 new SNPs in 1.2 × 106 bp of resequenced DNA, leading to a frequency of 9.0 × 10−6 SNPs missed per bp screened. All of the missed SNPs were very rare: eight were singletons, two were doubletons, and one was a tripleton. This false negative rate is approximately half of that reported by Cargill et al. (1), who resequenced 10 genes (20,475 bp) in 20 DNA samples, yielding a frequency of 17.1 × 10−6 SNPs missed per bp screened. Variant identification for one gene, OCT1, was particularly problematic because of the presence of multiple SNPs and insertions and deletions in several of its amplicons. Mis-scoring of variants in these amplicons was detected by apparent violations of Hardy–Weinberg equilibrium. We therefore resequenced several amplicons of OCT1. The values reported for OCT1 Tables 4–9, which are published as supporting information on the PNAS web site, www.pnas.org, are the results of resequencing.

Population Genetic Parameters.

The neutral parameter (θ), nucleotide diversity (π), and Tajima's D statistic were calculated as described by Tajima et al. (5). Each parameter was calculated for various gene regions (e.g., coding, noncoding, and intron-exon boundaries) as well as for various sites within the coding region [e.g., synonymous and nonsynonymous sites, evolutionarily conserved and unconserved sites, and sites within predicted transmembrane domains (TMDs) and loops]. Synonymous and nonsynonymous sites were defined as described by Hartl and Clark (6). The observed allele frequency distribution of noncoding, synonymous, and nonsynonymous SNPs was compared with the distribution expected under the neutral mutation model. According to the neutral mutation model, the expected number of SNPs (Gn) with a frequency of i/n can be obtained from the following equation: Gn(i) = θ[1/i + 1/(ni)] where θ is the neutral parameter, n is the number of chromosomes sequenced, and i/n is the expected allele frequency. The expected and observed allele frequency distributions were compared by using a χ2 test of binned data. Several different binnings were used. Our assessment of significance did not depend on binning.

Characterization of Amino Acid Residues.

Evolutionarily conserved and unconserved amino acid residues were classified based on sequence alignments with mammalian orthologs by using the GCG program pileup. At least three mammalian protein sequences (the human sequence and at least two from rat, rabbit, and mouse) were used for the alignments. All orthologs were at least 65% identical to the human sequence (7). For three transporters (e.g., FIC1, MRP1, and VMAT1), there were not enough mammalian orthologs to generate alignments. Therefore, these transporters were not included in the analysis of evolutionarily conserved and unconserved amino acids. An amino acid residue was classified as EC (evolutionarily conserved) if it was present in all species in the alignment. All other residues were classified as EU (evolutionarily unconserved). TMD and loop regions were assigned based on published topology data where available, otherwise on topology data from the SwissProt database (www.ebi.ac.uk/swissprot).

Results

To identify polymorphisms in the set of 24 membrane transporter genes, we screened all exons as well as 35–100 bp of flanking intronic sequence in an ethnically diverse collection of genomic DNA samples by using denaturing HPLC analysis and direct sequencing. The collection of DNA used for screening included samples from 247 unrelated individuals: 100 European Americans, 100 African Americans, 30 Asians, 10 Mexicans, and 7 Pacific Islanders.

New SNPs Identified.

We identified 680 biallelic SNPs in almost 96 kb of genomic sequence (Table 1). In addition to the 680 biallelic SNPs, two tri-allelic sites were identified. To determine which of the SNPs had been previously identified, we examined dbSNP entries for 17 of the 24 transporter genes (www.ncbi.nlm.nih.gov/SNP/index.html). Of 477 SNPs that we found, 91 had been found previously. Seventy of the SNPs reported in dbSNP were not identified in our sample set.

Table 1.

Summary statistics of variation in transporter genes

Sequence section bp SNPs θ* π* D*
Total population 96,074 680 10.44  ± 2.01 5.09  ± 2.44 −1.56
Coding 54,700 330 8.90  ± 1.75 3.96  ± 1.94 −1.67
Noncoding 41,374 350 12.48  ± 2.45 6.57  ± 3.19 −1.43
Intron–exon boundary 4,437 12 3.99  ± 1.38 2.20  ± 1.66 −0.99
Synonymous 12,820 175 20.14  ± 4.10 9.73  ± 4.86 −1.53
Nonsynonymous 41,880 155 5.46  ± 1.12 2.20  ± 1.12 −1.77
*

Values of θ, π, and Tajima's D are listed as mean × 104 ± standard deviation; bp, base pairs analyzed. 

Population Specificity of SNPs.

Of the 680 SNPs, 421 were population-specific, of which 248 were singletons (occurring on only one of 494 chromosomes). Of the 259 SNPs that were not population-specific, 83 were present in all five populations, and 176 were present in two, three, or four populations. In general, few population-specific alleles were found at high frequency. Only 4 of 278 African American-specific alleles and 1 of 50 Asian-specific alleles had frequencies > 0.1. Strikingly, the European American population sample had no population-specific alleles (0 of 80) at frequencies ≥0.05 in contrast to the African American population sample, which had 31 of 278 at frequencies ≥0.05. The relatively high incidence of moderately frequent population-specific alleles in African Americans may facilitate identification of ethnic-specific disease loci in this population. One hundred thirty-three (48%) of the African American-specific SNPs and 71 (89%) of the European American-specific SNPs were singletons. These singletons reflect a mixture of new mutations that have not been subjected to selection and old mutations that have been. The ratio of African American to European American singletons provides a measure of the distribution of new mutations (similar in both populations) and old mutations, some of which are shared by both populations and others which are present only in African Americans. We observed a ratio of ≈2:1, in contrast to Stephens et al. (4), who reported a ratio of 3:1 Because we identified rarer singletons than Stephens et al., our ratio should lead to a better estimation of new mutations.

Insertions and Deletions.

Previous large-scale screens have provided relatively little information on insertion and deletion (indel) mutations for technical reasons and because of their rarity (1, 3, 4). Clark and colleagues (8, 9) reported 9 indel mutations (all intronic) and 79 SNPs in screening the entire lipoprotein lipase gene in 71 individuals. Saito et al. (10) identified 29 noncoding indels and 297 SNPs in screening nine ABC transporter genes. In contrast, we observed 29 indel mutations, of which eight affected the coding region (see Table 4). The ratio of intronic indels to total SNPs (21 of 680) was lower than that reported for lipoprotein lipase (0.03 versus 0.11, respectively) (8, 9). The low frequency of coding indels presumably reflects their severe consequences on protein function. Of the eight coding indels that we identified, five added or deleted amino acids but otherwise conserved the reading frame. Two of these, one of which adds an amino acid and one of which deletes an amino acid, occurred at high allele frequencies (0.29 for CNT1 4-4 and 0.105 for OCT1 7-5). Both of these variants exhibit transport function but may have differences in specificity (unpublished observations). Because frameshift mutations comprise a substantial fraction of disease-causing mutations (Human Gene Mutation Database, Cardiff, Wales; http://archive.uwcm.ac.uk/uwcm/mg/hgmd0.html), it is notable that one of the three frameshift indels occurred at a frequency of 3.1% in African Americans. For 15 of the 29 indels, a plausible basis for their origin (e.g., occurrence at short repeated sequences; as has been observed in microorganisms) can be proposed (see Table 4) (11).

Population-Genetic Analysis of Nucleotide Diversity.

To further quantify the amount of variation in our set of genes, we calculated two measures of nucleotide diversity, the average heterozygosity (π) and the population mutation parameter (θ) (6). In addition, Tajima's D was calculated to detect deviations from the neutral mutation model (5). These parameters were calculated for the entire sequenced region as well as for each SNP type (i.e., coding versus noncoding and synonymous versus nonsynonymous) and for the total population and each ethnic group (Tables 1 and 5). Mean values of θ and π were similar to values previously reported for other genes (θ and π ranging from 5.0 × 10−4 to 8.3 × 10−4), suggesting that, on average, genetic variation in membrane transporters is similar to that in other genes (1, 3, 4). Values of θ × 104 ranged from 1.6 to 21.7 and π × 104 from 0.65 to 14.0 for the 24 genes analyzed (see Table 7). Our average θ values (Table 1) were somewhat higher than those reported by Halushka et al. (3) and substantially higher than those of Cargill et al. (1) and Stephens et al. (4), presumably because our sample contained a greater number and higher proportion of African Americans, which have higher levels of nucleotide diversity. Some genes (ENT1, FIC1, and SERT) exhibited particularly low πS values (<1) (see Table 6), suggesting that these genes may have intrinsically reduced mutability or that they have been subjected to some mechanism that eliminated variation, e.g., a recent selective sweep (6, 12).

The ratio of π for nonsynonymous sites (πNS) to π for synonymous sites (πS) provides a measure of selection for function of a given gene (13). The πNSS ratios for each of the 24 transporter genes in our study along with the ratios previously determined for VMAT2, SERT, and TLR4 (12, 14) are provided Table 7. Twenty-two of 24 genes had ratios <1, suggesting that they are under selection. Values of θ were greater than π, resulting in negative Tajima's D values for 23 of 24 genes, which is consistent with population expansion or negative (purifying) selection. The average nucleotide diversity of the two major superfamilies, ABC transporters and SLC transporters, was similar (Table 7). Four of the five ABC transporters exhibited intermediate πNSS values, indicating similar levels of selection, whereas one (MRP1) had a low ratio. Nucleotide diversity and the πNSS ratio of SLC superfamily members exhibited wide variation, consistent with the functional diversity of transporters in this superfamily (12). In our set of membrane transporters, the average nucleotide diversity over all regions (πtotal) was greater than the amino acid diversity (πNS) (5.1 × 10−4 versus 2.2 × 10−4), consistent with Cargill et al., and indicating that these genes are under negative selection. The mean ratio of πNSS in our study (0.23) was considerably lower than that of Cargill et al. (0.64), who analyzed 106 genes from diverse classes (1). The lower ratio that we observed suggests that membrane transporter genes may be under more negative selection than the diverse set of genes analyzed by Cargill et al.

Amino Acid Diversity Across Structural Regions and Transporter Superfamilies.

Loops and TMDs.

Because membrane transporters have two distinct types of secondary structure, TMDs and loops (Fig. 2), we compared nucleotide diversity of these structural regions. Amino acid diversity (πNS) in the TMDs was significantly lower than amino acid diversity in the loops (πTMD-NS = 1.16 versus πloop-NS = 2.68; P < 0.05 by a one-sample test for a binomial proportion) (Table 2). These results are consistent with the observation that TMDs are more evolutionarily conserved than loops (i.e., the proportions of EC residues in TMDs and loops are 83% and 74%, respectively) suggesting that there are constraints on TMDs of transporters. The parallel between phylogenetic variation and amino acid diversity suggests that constraints on structural regions of proteins (e.g., TMDs) occurs across long and short evolutionary distances for this set of proteins. The restricted variation in TMDs relative to loops has been noted previously in phylogenetic comparisons of 93 integral membrane proteins with multiple TMDs (15).

Figure 2.

Figure 2

Predicted secondary structures of two representative membrane transporters (BSEP and CNT1) from the ABC and SLC superfamilies showing positions of nonsynonymous SNPs. The transmembrane topology schematic was rendered by using TOPO [S. J. Johns (University of California, San Francisco) and R. C. Speth (Washington State University, Pullman), transmembrane protein display software available at the University of California, San Francisco Sequence Analysis Consulting Group web site, www.sacs.ucsf.edu/TOPO/topo.html]. Nonsynonymous amino acid changes are shown in red.

Table 2.

Summary statistics of variation across structural regions

Gene region All π SLC π ABC π
Loop (total) 4.33  ± 2.44 5.23  ± 2.67 4.00  ± 2.10
Loop (s) 10.0  ± 5.15 10.9  ± 5.97 10.7  ± 5.96
Loop (ns) 2.68  ± 1.38 3.57  ± 1.93 2.06  ± 1.22
 EC 1.16  ± 0.73 1.13  ± 0.82 1.20  ± 0.95
 EU 7.32  ± 3.98 8.78  ± 5.11 5.71  ± 3.69
TMD (total) 3.20  ± 1.70 3.38  ± 1.86 3.13  ± 2.12
TMD (s) 9.17  ± 5.05 8.57  ± 5.02 12.21  ± 8.38
TMD (ns) 1.16  ± 0.77 1.57  ± 1.06 0.15  ± 0.40
EC 0.43  ± 0.44 0.47  ± 0.51 0.22  ± 0.67
EU 5.16  ± 3.73 8.89  ± 6.46 0.13  ± 0.69

Conserved and unconserved regions are defined based on protein sequence alignments with at least two other mammalian orthologs and therefore refer to evolutionary conservation. S, synonymous; ns, nonsynonymous. 

ABC and SLC superfamilies.

The ABC and the SLC superfamilies of transporters have evolved to transport structurally diverse biological molecules, including essential nutrients, metabolic waste products, and xenobiotics. The TMDs of both superfamilies contain residues and structural domains responsible for substrate specificity, whereas the loops of the ABC transporters contain ATP-binding domains. We observed that amino acid diversity in the TMDs of ABC transporter family members was extraordinarily low, much lower than in the TMDs of SLC family members (πNS-TMD 0.15 × 10−4 versus 1.6 × 10−4; Table 2; representative transporters are shown in Fig. 2). Surprisingly, the extent of amino acid diversity did not parallel evolutionary conservation: the fraction of residues that are evolutionarily unconserved in the TMDs of the ABC superfamily was significantly higher than the fraction of EU residues in the TMDs of the SLC superfamily (35% versus 13%). These observations imply that a protein segment, in this case the TMDs of ABC transporters, is more constrained within humans than across species. In contrast to the TMDs, there were no differences in amino acid diversity and in the fraction of EC residues between the loop regions of these superfamilies (πNS-loop 2.1 × 10−4 and 70% for ABC transporters and 3.6 × 10−4 and 74% for SLC transporters; Table 2).

EC and EU residues.

We classified amino acid changes as EC and EU based on sequence alignments with two mammalian orthologs (e.g., rat, mouse, and/or rabbit) (see Materials and Methods). Of the 155 nonsynonymous SNPs, we were able to assign 118 as affecting EC or EU amino acid residues. πNS-EC was significantly lower than πNS-EU (0.90 × 10−4 versus 6.77 × 10−4, P < 0.05 by a one-sample test for a binomial proportion) over the entire protein (see Table 5). Similarly, πNS-EC was significantly lower than πNS-EU within the TMDs and the loops (Table 2) for SLC family members. Strikingly, this relationship did not hold for the TMDs of ABC family members, in which πNS-EC and πNS-EU were not significantly different (0.22 × 10−4 and 0.13 × 10−4, respectively). The extraordinarily low amino acid diversity observed in the TMDs of ABC transporters, which extends to EU residues, may reflect special functional demands on the TMDs of this superfamily. The low amino acid diversity observed for both EU and EC residues in the TMDs of ABC transporters demonstrates that variation within humans does not always parallel phylogenetic variation.

Predictors of Frequency Distributions of Nonsynonymous SNP Alleles.

We identified 155 nonsynonymous SNPs, which occurred at various frequencies, and would ultimately like to know their effect on function. We first calculated the ratio of nonsynonymous to synonymous changes normalized to base pairs (NS*/S*) as a function of allele frequency. This ratio has been used previously to estimate the fraction of deleterious alleles, on the assumption that deleterious alleles are selected against and are under-represented at high frequencies (13). We observed that NS*/S* was slightly greater at low than at high allele frequencies (0.32 for alleles with frequencies ≤0.002 compared with 0.22 for alleles with frequencies ≥0.2), suggesting that rare nonsynonymous variants affect fitness more than common nonsynonymous variants (as observed in ref. 13). Striking differences were observed in the ratio of nonsynonymous changes at EC to EU sites normalized to base pairs (EC*/EU*) at different allele frequencies: the EC*/EU* ratio was 0.88 for alleles with frequencies ≤0.002 compared with 0.15 for alleles with frequencies ≥0.2.

To identify alleles that are deleterious, we fractionated nonsynonymous SNPs according to chemical similarities and evolutionary relatedness. We then compared frequency distributions of the fractionated alleles looking for differences in the frequency distributions. Our underlying assumption was that alleles that are functionally deleterious (from mildly to severely deleterious) will be selected against and thus underrepresented at high frequencies and over-represented at low frequencies. The frequency distribution of EC alleles differed significantly from that of EU alleles (Fig. 3, Table 3) (χ2 = 11.35, P = 0.025). For example, 52% of the amino acid changes at EC sites occurred at the lowest allele frequencies (≤0.002) in contrast to only 25% of the amino acid changes at EU sites (Fig. 3, Table 3). The skew toward low frequencies for the EC alleles in comparison to the EU alleles can be explained by proposing that changes at EC sites affect fitness and thus protein function. The highest percentage of amino acid changes at EC sites was likewise found at the lowest allele frequencies (≤0.002) for both African American and European American population samples (data not shown). The African American sample had a greater fraction of amino acid changes at EC sites at intermediate frequencies than did the European American sample. The allele frequency distributions of variants at EU sites showed less skewing to lower frequencies than the EC distributions and somewhat different distributions for European and African Americans (data not shown).

Figure 3.

Figure 3

Nature of nonsynonymous SNPs as a function of allele frequency. Allele frequency distribution of nonsynonymous SNPs at EC and EU amino acid residues. Percentage of nonsynonymous SNPs at EC (black) and EU (striped) amino acid residues are shown for different allele frequency ranges. Amino acid residues were classified as EC or EU based on alignments of each human protein with two mammalian orthologs (rat, mouse, or rabbit).

Table 3.

Allele frequency distribution of nonsynonymous SNPs

Frequency NS SNPs* BLOSUM62
EC EU SIFT
Grantham
<0 ≥0 ≤0.1 >0.1 ≤100 >100
P ≤ 0.002 41.3 39.7 41.8 52.3 24.5 47.6 22.9 41.9 36.7
0.002 < P ≤ 0.01 27.7 30.2 26.4 24.6 30.2 29.3 20.0 29.0 23.3
0.01 < P ≤ 0.10 18.7 19.0 18.7 15.4 22.6 13.4 31.4 16.9 26.7
0.10 < P ≤ 0.20 4.5 4.8 4.4 3.1 9.4 3.7 11.4 3.2 10.0
P > 0.20 7.7 6.3 8.8 4.6 13.2 6.1 14.3 8.9 3.3
*

The percentage of all nonsynonymous SNPs in each allele frequency class (P ≤ 0.002, 0.002 < P ≤ 0.01, 0.01 < P ≤ 0.10, 0.10 < P ≤ 0.20, and P > 0.20). For the remaining columns, nonsynonymous SNPs were further subdivided based on BLOSUM62 scores (<0 or ≥0), SIFT scores (≤0.1 or 0.1), Grantham scores (≤100 or >100), or EC/EU criteria. The numbers in these columns also list the percentage of SNPs in each allele frequency class. 

We next used the same criteria as Cargill et al. to characterize the nonsynonymous changes using BLOSUM62, scoring variant substitutions as <0 (evolutionarily less acceptable) or ≥0 (evolutionarily more acceptable) (Tables 3 and 8) (1, 16). No difference was observed in the frequency distribution of alleles with BLOSUM62 scores <0 and ≥0 (χ2 = 0.53, P = 0.97; Table 3). In fact, these distributions were very similar to the frequency distribution of unfractionated nonsynonymous SNPs, indicating that BLOSUM62 did not distinguish between deleterious and tolerated nonsynonymous SNPs in this set of genes.

Ng and Henikoff (17) have described an algorithm for predicting functional consequences of amino acid substitutions, SIFT, which assigns scores in part on alignment of orthologous sequences. SIFT scores range from 0 to 1: scores near 0 reflect evolutionary conservation and intolerance to substitution, whereas scores near 1 reflect tolerance to substitution. We have assigned SIFT scores to our nonsynonymous SNPs and fractionated them in two ways: zero versus nonzero and ≤0.1 versus >0.1 (Tables 3 and 8). Scoring nonsynonymous SNPs as zero versus nonzero produced allele frequency distributions identical to EC and EU, respectively (χ2 = 11.35, P = 0.025). Scoring nonsynonymous SNPs as ≤0.1 or >0.1 resulted in different allele-frequency distributions for alleles in the two categories (χ2 = 13.15, P = 0.013).

A similar analysis of all 155 nonsynonymous SNPs was carried out by using Grantham values, which provide a measure of chemical similarity (18). We assigned Grantham scores to our nonsynonymous SNPs and fractionated the scores as less radical (<100) or more radical (≥100) (Table 3), which correspond to the categories used by Li et al. (19). No significant differences in the frequency distributions of the alleles with less radical and more radical amino acid changes were observed (χ2 = 5.05, P = 0.282). Similar observations were made when we fractionated the alleles according to Grantham values of <50 and ≥50 (data not shown).

Frequency Distributions and Evolutionary Constraints of Minor Alleles.

Because we screened 494 chromosomes, binomial sampling theory suggests that we have an ≈99% chance of identifying SNPs that occur at a frequency of 1% in our total sample and an ≈86% chance of identifying SNPs that occur at a frequency of 1% in our two largest ethnic samples (European Americans and African Americans, 200 chromosomes of each). We tabulated the observed minor allele frequency distributions of noncoding, synonymous, and nonsynonymous SNPs (see Table 9). Similar to previous reports, we observed a higher percentage of SNPs in the lowest allele frequency class and a lower percentage in the higher allele frequency classes (1, 3). We tested the heterogeneity of the frequency distributions of synonymous versus nonsynonymous SNPs as well as coding versus noncoding and found them to be homogeneous (χ2 = 7.24, P = 0.40, for nonsynonymous versus synonymous and χ2 = 6.85, P = 0.44 for coding versus noncoding). We compared the observed minor allele frequency distributions with that predicted under the infinite-sites, neutral model, which is based on the assumptions that all sites are mutable, alleles are lost through genetic drift but not by selection, and population size is fixed (see Table 9). Relative to this model, we observed a higher percentage of low-frequency alleles for the noncoding and synonymous sites, which are not expected to be under selection, and a reduced percentage of high-frequency alleles. These observations are similar to those of Glatt et al. (12) and can be explained by population expansion.

The overall trends in the allele frequency distributions of the African American and European American samples were similar to those seen in the total sample (see Table 9). The allele frequency distribution of coding-region SNPs in the European American sample was significantly different from that in the African American sample (χ2 = 16.21, P = 0.0127). This difference was notable for intermediate and high frequency alleles and reflects population demography, for example, a population bottleneck. Admixture of the African American population confounds the allele frequency distributions.

Discussion

Phylogenetic Variation and Amino Acid Diversity in Humans.

Based on screening for genetic variation, we have observed that, in general, amino acid diversity in humans paralleled phylogenetic variation. That is, the diversity of EC residues (πNS-EC) was significantly lower than the diversity of EU residues (πNS-EU) for the total protein as well as for protein segments (Table 2). We found a striking exception in the TMDs of ABC transporters, in which πNS-EU was as low as πNS-EC (Table 2): variation at EU sites in the TMDs of ABC transporters did not exhibit the variation observed, for example, at EU sites in the TMDs of SLC family members. How can we explain the existence of sites that vary phylogenetically but not within humans, i.e., the EU sites in the TMDs of ABC transporters? One rationale is as follows. The TMDs contain the amino acids and structural domains involved in substrate specificity and translocation. Transporters in the ABC superfamily have broad specificities, and many have a primary role in protecting the organism from environmental toxins and xenobiotics through efflux. Humans differ from organisms such as mice and rabbits in the set of xenobiotics to which they are exposed through diet, inhalation, and metabolic processing. Thus, human ABC transporters are expected to have different substrate specificities from those of mice and rabbits. Such differences have been observed and reflect phylogenetic variation (20, 21). A given ABC transporter, for example, of humans may have evolved to recognize a distinct set of substrates important to humans. We suggest that EU residues within the TMDs of ABC transporters play important roles in defining the species-specific recognition of these xenobiotics and their metabolites. We therefore expect that mice would also exhibit restricted variation in the TMD domains of their ABC transporters, as we observed for humans.

Predicting Function of Variants.

A goal of large-scale screening efforts is to identify variants in genes that affect disease susceptibility and drug response. A special challenge is to determine which of the variants affect function and therefore contribute to altered phenotype. We have fractionated the nonsynonymous variants and used allele frequency distributions to evaluate the predictive ability of these criteria. Miller and Kumar, using Grantham values to assess chemical relatedness, reported that amino acid changes associated with disease exhibited greater chemical differences than amino acid changes observed across species (22), suggesting that Grantham values are useful for predicting deleterious function. In contrast, we observed no clear relationship between Grantham values and allele frequencies (Table 3), indicating that Grantham values are not useful for predicting function of the transporter variants identified in healthy population.

Cargill et al. observed that nonsynonymous SNPs had a lower fraction of nonconservative changes (negative BLOSUM62 scores) compared with that predicted from randomly distributed nonsynonymous SNPs, suggesting that BLOSUM62 values predict deleterious function (1). We did not observe a relationship between allele frequency distribution of nonsynonymous SNPs and BLOSUM62 values (Table 3), though a relationship between activity of nonsynonymous variants of OCT1 and BLOSUM62 scores has been observed (7). One potential limitation of BLOSUM62 is that it is based on soluble, globular proteins. An amino acid substitution scoring matrix based on TMD (e.g., SLIM) might better predict function of variants with alterations in TMDs than does BLOSUM62 (23). We were unfortunately not able to evaluate SLIM because of the paucity of high-frequency nonsynonymous variants affecting TMDs. A second potential limitation of BLOSUM62 (as noted by Ng and Henikoff, ref. 17) is that it uses protein domains rather than sequences homologous to the proteins under study.

We observed that SNPs at evolutionarily conserved sites, defined by alignments of mammalian orthologs (EC/EU) or by multiple alignments of orthologs (SIFT), were significantly under-represented at high allele frequencies and overrepresented at low allele frequencies than were SNPs at unconserved sites (Table 3 and Fig. 3). These observations indicate that stringent definition of EC or EU residues by alignment of mammalian orthologs is a strong predictor of fitness and hence protein function. Alignments of additional mammalian orthologs, which will become available as a result of ongoing genome sequencing projects, should enhance this scoring system but will require weighting systems like that used in SIFT to take into consideration the frequency and diversity of the amino acids found at evolutionarily unconserved positions. In past studies, SIFT was tested by its ability to predict function of experimentally derived variants for a set of three genes (17). Our studies demonstrate that SIFT can be used to evaluate natural variation in humans and suggest that the frequency distribution of variant alleles can be used to refine SIFT and similar algorithms. We suggest that the ability of such algorithms to predict protein function might be improved by taking into consideration the chemical nature of the amino acid changes (7). For example, changes at an evolutionarily conserved position from leucine to isoleucine would be expected to have less effect on function than changes from leucine to aspartate. We note that two of the three most common EC variants (>0.2) had low Grantham scores. The use of evolutionary conservation of protein orthologs to predict function of transporter variants is validated by functional studies of OCT1 variants described in the companion paper (7). Miller and Kumar have similarly observed that disease-causing mutations are more prevalent at evolutionarily conserved sites relative to other sites (22).

A substantial number of Mendelian disorders, including glucose-galactose malabsorption syndrome, Menke's syndrome, and Tangier's disease, are associated with nonfunctional genetic variants of membrane transport proteins (2426). Further studies are necessary to determine whether the variants that we have identified contribute to diseases or to alterations in drug response.

Supplementary Material

Supporting Tables

Acknowledgments

We thank Anna Di Rienzo, Eric Peters, Hao Li, and James Robertson for discussion, and Chung-I Wu, Pauline Ng, Sudhir Kumar, and Irwin Herskowitz for comments on the manuscript. This work was funded by National Institutes of Health Grant GM 61390. Data are available at www.pharmgkb.org and www.pharmacogenetics.ucsf.edu.

Abbreviations

EC

evolutionarily conserved

EU

evolutionarily unconserved

TMD

transmembrane domain

SNP

single-nucleotide polymorphism

References

  • 1.Cargill M, Altshuler D, Ireland J, Sklar P, Ardlie K, Patil N, Lane C R, Lim E P, Kalayanaraman N, Nemesh J, et al. Nat Genet. 1999;22:231–238. doi: 10.1038/10290. [DOI] [PubMed] [Google Scholar]
  • 2.Gabriel S B, Schaffner S F, Nguyen H, Moore J M, Roy J, Blumenstiel B, Higgins J, DeFelice M, Lochner A, Faggart M, et al. Science. 2002;296:2225–2229. doi: 10.1126/science.1069424. [DOI] [PubMed] [Google Scholar]
  • 3.Halushka M K, Fan J B, Bentley K, Hsie L, Shen N, Weder A, Cooper R, Lipshutz R, Chakravarti A. Nat Genet. 1999;22:239–247. doi: 10.1038/10297. [DOI] [PubMed] [Google Scholar]
  • 4.Stephens J C, Schneider J A, Tanguay D A, Choi J, Acharya T, Stanley S E, Jiang R, Messer C J, Chew A, Han J H, et al. Science. 2001;293:489–493. doi: 10.1126/science.1059431. [DOI] [PubMed] [Google Scholar]
  • 5.Tajima F. Genetics. 1989;123:585–595. doi: 10.1093/genetics/123.3.585. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Hartl D L, Clark A G. Principles of Population Genetics. Sunderland, MA: Sinauer Associates; 1997. [Google Scholar]
  • 7.Shu Y, Leabman M K, Feng B, Mangravite L M, Huang C C, Stryke D, Kawamoto M, Johns S J, DeYoung J, Carlson E, et al. Proc Natl Acad Sci USA. 2003;100:5902–5907. doi: 10.1073/pnas.0730858100. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Clark A G, Weiss K M, Nickerson D A, Taylor S L, Buchanan A, Stengard J, Salomaa V, Vartiainen E, Perola M, Boerwinkle E, Sing C F. Am J Hum Genet. 1998;63:595–612. doi: 10.1086/301977. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Nickerson D A, Taylor S L, Weiss K M, Clark A G, Hutchinson R G, Stengard J, Salomaa V, Vartiainen E, Boerwinkle E, Sing C F. Nat Genet. 1998;19:233–240. doi: 10.1038/907. [DOI] [PubMed] [Google Scholar]
  • 10.Saito S, Iida A, Sekine A, Miura Y, Ogawa C, Kawauchi S, Higuchi S, Nakamura Y. J Hum Genet. 2002;47:38–50. doi: 10.1007/s10038-002-8653-6. [DOI] [PubMed] [Google Scholar]
  • 11.Drake J W. Annu Rev Genet. 1991;25:125–146. doi: 10.1146/annurev.ge.25.120191.001013. [DOI] [PubMed] [Google Scholar]
  • 12.Glatt C E, DeYoung J A, Delgado S, Service S K, Giacomini K M, Edwards R H, Risch N, Freimer N B. Nat Genet. 2001;27:435–438. doi: 10.1038/86948. [DOI] [PubMed] [Google Scholar]
  • 13.Fay J C, Wyckoff G J, Wu C I. Genetics. 2001;158:1227–1234. doi: 10.1093/genetics/158.3.1227. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Smirnova I, Hamblin M T, McBride C, Beutler B, Di Rienzo A. Genetics. 2001;158:1657–1664. doi: 10.1093/genetics/158.4.1657. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Tourasse N J, Li W H. Mol Biol Evol. 2000;17:656–664. doi: 10.1093/oxfordjournals.molbev.a026344. [DOI] [PubMed] [Google Scholar]
  • 16.Henikoff S, Henikoff J G. Proc Natl Acad Sci USA. 1992;89:10915–10919. doi: 10.1073/pnas.89.22.10915. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Ng P C, Henikoff S. Genome Res. 2001;11:863–874. doi: 10.1101/gr.176601. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Grantham R. Science. 1974;185:862–864. doi: 10.1126/science.185.4154.862. [DOI] [PubMed] [Google Scholar]
  • 19.Li W H, Wu C I, Luo C C. J Mol Evol. 1984;21:58. doi: 10.1007/BF02100628. [DOI] [PubMed] [Google Scholar]
  • 20.Ambudkar S V, Dey S, Hrycyna C A, Ramachandra M, Pastan I, Gottesman M M. Annu Rev Pharmacol Toxicol. 1999;39:361–398. doi: 10.1146/annurev.pharmtox.39.1.361. [DOI] [PubMed] [Google Scholar]
  • 21.Tang-Wai D F, Kajiji S, DiCapua F, de Graaf D, Roninson I B, Gros P. Biochemistry. 1995;34:32–39. doi: 10.1021/bi00001a005. [DOI] [PubMed] [Google Scholar]
  • 22.Miller M P, Kumar S. Hum Mol Genet. 2001;10:2319–2328. doi: 10.1093/hmg/10.21.2319. [DOI] [PubMed] [Google Scholar]
  • 23.Jones D T, Taylor W R, Thornton J M. FEBS Lett. 1994;339:269–275. doi: 10.1016/0014-5793(94)80429-x. [DOI] [PubMed] [Google Scholar]
  • 24.Rust S, Rosier M, Funke H, Real J, Amoura Z, Piette J C, Deleuze J F, Brewer H B, Duverger N, Denefle P, Assmann G. Nat Genet. 1999;22:352–355. doi: 10.1038/11921. [DOI] [PubMed] [Google Scholar]
  • 25.Vulpe C, Levinson B, Whitney S, Packman S, Gitschier J. Nat Genet. 1993;3:7–13. doi: 10.1038/ng0193-7. [DOI] [PubMed] [Google Scholar]
  • 26.Martin M G, Turk E, Lostao M P, Kerner C, Wright E M. Nat Genet. 1996;12:216–220. doi: 10.1038/ng0296-216. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supporting Tables
pnas_0730857100_1.pdf (100.4KB, pdf)
pnas_0730857100_2.pdf (125.6KB, pdf)
pnas_0730857100_3.pdf (112.7KB, pdf)
pnas_0730857100_4.pdf (133.9KB, pdf)
pnas_0730857100_5.pdf (103.9KB, pdf)
pnas_0730857100_6.pdf (107.4KB, pdf)

Articles from Proceedings of the National Academy of Sciences of the United States of America are provided here courtesy of National Academy of Sciences

RESOURCES