Skip to main content
Proceedings of the National Academy of Sciences of the United States of America logoLink to Proceedings of the National Academy of Sciences of the United States of America
. 1997 Aug 19;94(17):9197–9201. doi: 10.1073/pnas.94.17.9197

Detecting immigration by using multilocus genotypes

Bruce Rannala 1,*, Joanna L Mountain 1
PMCID: PMC23111  PMID: 9256459

Abstract

Immigration is an important force shaping the social structure, evolution, and genetics of populations. A statistical method is presented that uses multilocus genotypes to identify individuals who are immigrants, or have recent immigrant ancestry. The method is appropriate for use with allozymes, microsatellites, or restriction fragment length polymorphisms (RFLPs) and assumes linkage equilibrium among loci. Potential applications include studies of dispersal among natural populations of animals and plants, human evolutionary studies, and typing zoo animals of unknown origin (for use in captive breeding programs). The method is illustrated by analyzing RFLP genotypes in samples of humans from Australian, Japanese, New Guinean, and Senegalese populations. The test has power to detect immigrant ancestors, for these data, up to two generations in the past even though the overall differentiation of allele frequencies among populations is low.


Classical theory in population genetics has focused on the long term effects of immigration on allele frequency distributions in semi-isolated populations, concentrating on the stationary distribution resulting from a balance between forces of immigration, genetic drift, and mutation (14). Less theory exists addressing the effect of recent immigration among populations with low levels of genetic differentiation. A theory describing the effects of immigration on the genetic composition of individuals in populations that are not at genetic equilibrium is needed to interpret much of the data being generated using current genetic techniques.

In this paper we consider the multilocus genotypes that result when individuals are immigrants, or have recent immigrant ancestry. We propose a test that allows recent immigrants to be identified on the basis of their multilocus genotypes; the test has considerable power for detecting immigrant individuals even when the overall level of genetic differentiation among populations is low. Molecular genetic techniques that allow multilocus genotypes to be described from single individuals are relatively new, and much of the information contained in these types of data is not fully exploited by estimators of long term gene flow that are currently available (57). We provide an example of an application of the method to restriction fragment length polymorphism (RFLP) genotypes from human populations; the method may also be applied to analyze multilocus allozyme and microsatellite data.

Theory

A collection of I discrete populations of a diploid species exchange immigrants, with random mating among individuals within populations. Consider a set of l loci in linkage equilibrium, and let kj be the number of alleles at the jth locus. Let x = {xhji} be a matrix of the allele frequencies in each population, where xhji is the frequency of the hth allele (h = 1, 2, … , kj) at the jth locus (j = 1, 2, … , l) in the ith population (i = 1, 2, … , I). A set of n = {n1, n2, … , nI} chromosomes are sampled from the I populations, where ni is the number sampled from the ith population. Let X = {Xijm} be the matrix of genotypes among the sampled individuals, where Xijm is the genotype at the jth locus of the mth individual sampled from the ith population.

The population allele frequencies are generally unknown, and we therefore derive the probability density of allele frequencies in each population by using a Bayesian approach. The tests presented in this paper make use of the allele frequency distributions to calculate genotype probabilities. It is assumed that the total number of alleles at the jth locus in each population is identically kj. The set of alleles observed in the collection of populations as a whole is used as an estimate of kj. Without additional information, we initially assign an equal probability density to the frequencies of the alleles at the jth locus in the ith population. The prior probability density of allele frequencies (i.e., before sampling) is then (8)

graphic file with name M1.gif 1

The posterior probability density of the allele frequencies at the jth locus, conditioned on the alleles observed in a sample from population i, is now determined. Let the vector nji = {n1ji, … , nkjji}, where nhji is the observed number of copies of the hth allele at the jth locus in a sample from the ith population. The posterior probability density of allele frequencies is then

graphic file with name M2.gif 2

where

graphic file with name M3.gif 3

and we define nji = Σh=1kj nhji. The marginal distribution of nji (conditional on nji) is

graphic file with name M4.gif 4

Eq. 2 then simplifies to

graphic file with name M5.gif 5

where θ = nji + 1 and

graphic file with name M6.gif 6

Eq. 5 is a Dirichlet probability density function (8) with parameters θ and ah, where h = 1, 2, … , kj.

Genotype Probabilities

If individual m is born to nonimmigrant parents in population i, the probability that the individual has genotype Xijm at the jth locus, assuming random mating, is

graphic file with name M7.gif 7

for all h = 1, 2, … , kj and g = 1, 2, … , kj where gh. The actual allele frequencies in population i are unknown and we therefore consider the probability of the genotype for individual m, conditional on the sample of alleles from the ith population, denoted as Pr(Xijm|nji). The genotype of individual m is created by sampling two alleles at random from population i. Since the allele frequencies are not known we use the Dirichlet density of Eq. 5 to describe the probability density of allele frequencies and integrate over all possible allele frequencies

graphic file with name M8.gif 8

The marginal probability of the observed genotype, obtained in this way, is equal to the probability of sampling two alleles from a compound multinomial-Dirichlet distribution (7)

graphic file with name M9.gif 9

This is the posterior probability that the genotype Xijm is observed for the jth locus when the individual is a nonimmigrant from population i. If the allele frequencies are independent among loci (i.e., there is no linkage disequilibrium), the genotype probabilities at other loci are calculated similarly. The probability of the multilocus genotype Xim = {Xi1m, … , Xilm} of individual m is then a product over the probabilities of the observed allelic configurations for each locus

graphic file with name M10.gif 10

We now consider situations in which one parent is a resident of population i, and the other is an immigrant born to nonimmigrant parents in population i′. In this case, one allele copy is of immigrant origin. There is generally no prior information regarding the source of an individual’s alleles (chromosomes), and each copy at a locus is therefore equally likely to have been derived from the immigrant source. If we consider the genotype of individual m, born in population i, and denote an immigrant allele from population i′ using a prime symbol, the probability of the mixed genotype X(i,i′)jm at the jth locus is

graphic file with name M11.gif 11

where parentheses in the subscripts indicate that the alleles making up the genotype are averaged with respect to possible source populations, and brackets indicate that the alleles making up the genotype are labeled according to their source population. If an individual has alleles h and g at a particular locus, for example, these possibilities are X[i,i′]jm = hg′ and X[i′,i]jm = hg, where h′ indicates that allele h is derived from population i′ and h indicates that it is derived from population i. Conditional on the allele frequencies, the probabilities of the genotypes, labeled according to source population, are

graphic file with name M12.gif
graphic file with name M13.gif 12

Sampling a single allele for each population from a multinomial-Dirichlet density with the appropriate population parameters, we obtain

graphic file with name M14.gif 13
graphic file with name M15.gif

for all h = 1, 2, … , kj and g = 1, 2, … , kj where gh. This is the posterior probability that the genotype X(i′,i)jm is observed for the jth locus when the individual is of mixed ancestry from populations i and i′. The probability of the multilocus genotype X(i′,i)m = {X(i′,i)1m, … , X(i′,i)lm} is then

graphic file with name M16.gif 14

Identifying Immigrant Genotypes

In this section, we describe a test for detecting individuals born in a population other than the one from which they are sampled; these individuals are first-generation immigrants. Consider an individual m randomly sampled from population i. The probability of the observed multilocus genotype for the individual, given that the individual was born in population i and has no recent immigrant ancestry, is calculated using Eq. 10 above as Pr(Xim|ni). If individual m was instead born in population i′ to parents with no recent immigrant ancestry and subsequently immigrated to population i, then the probability of observing the multilocus genotype of the individual is calculated using Eq. 10 above as Pr(Xim|ni). The relative probability that the individual was born to parents with no recent immigrant ancestry in population i, rather than population i′, is therefore given by the ratio of the probabilities

graphic file with name M17.gif 15

In practice, we take logarithms and use the equivalent form

graphic file with name M18.gif 16

Positive values of ln Λ indicate that the null hypothesis (that the individual is not an immigrant) is favored, while negative values indicate that the alternative hypothesis (that the individual is an immigrant) is favored. A value of ln Λ = ln(10) = 2.30, for example, indicates that individual m is 10 times more likely to have arisen in population i, while a value of ln Λ = −2.30 indicates the individual is 10 times less likely to have arisen in i than i′. The distribution of the statistic Λ under the null hypothesis (that the individual is not an immigrant) was examined using Monte Carlo simulation (see below).

We now describe a test for detecting an individual with a single parent that is an immigrant, or is descended from an immigrant. In this case, one allele copy at each locus is of possible immigrant origin and the other is of local origin. For l independent loci, the probability of observing the genotype of individual m, given that the individual was born in population i and has an immigrant parent from population i′, is calculated using Eq. 14 as Pr(X(i′,i)m|ni, ni).

The individual might instead have an ancestor d generations in the past that was an immigrant. The probability of the observed genotype X(i′,i)jm at the jth locus under this hypothesis is

graphic file with name M19.gif 17

For l independent loci, the posterior probability of observing the genotype of individual m, given that this individual has an immigrant ancestor d generations removed from population i′, is

graphic file with name M20.gif 18

The relative probability that individual m, born in population i, did not have an immigrant ancestor from population i′ at generation d in the past is

graphic file with name M21.gif 19

We again use logarithms to calculate this statistic as ln Λd. The analysis can be extended to consider individuals of mixed parentage over several generations, but the number of possible outcomes makes an exhaustive analysis difficult. In certain cases, when fewer ancestral immigration patterns are possible, based on prior information, the method outlined above might be extended to decide among the possible alternatives.

Critical Region and Power of Tests

The critical (rejection) region for the test statistic calculated using the methods described in the preceding sections contains all values of the statistic such that Λ < C, where C is chosen to satisfy Pr(Λ < C) = α under the null hypothesis. For a specified value of C, the value of α is given by

graphic file with name M22.gif 20

where

graphic file with name M23.gif 21

and the sum is over the total number of possible genotype configurations G = ∏j=1l(kj + 1)kj/2. A Monte Carlo estimator of α is

graphic file with name M24.gif 22

where Xi(r) is the rth simulated genotype with R genotypes simulated in total from the posterior probability distribution Pr(Xih|ni). The random variables Xi(r) can be generated using the following procedure: for the jth locus, generate the first allele by assigning to allele type h the probability

graphic file with name M25.gif 23

If the first allele is of type h, generate the second allele by assigning to allele type h the probability

graphic file with name M26.gif 24

or to allele type gh the probability

graphic file with name M27.gif 25

It is also possible to determine C by generating a set of genotypes as outlined above and considering the value of the test statistic that falls below 1 − α percent of the values for the simulated genotypes (see Fig. 1). The power of the test to reject the null hypothesis when it is false, for a specified critical region α, is

graphic file with name M28.gif 26

where

graphic file with name M29.gif 27

where C(α) is the value of C that specifies the critical region with probability α determined using Eq. 20. A Monte Carlo estimator of the power β is

graphic file with name M30.gif 28

where Xi(r) is the rth simulated genotype, with R genotypes simulated in total from the posterior probability distribution Pr(Xih|ni). The power of the test is illustrated graphically as the overlap between the distributions of the statistic generated by simulating genotypes under the null and alternative hypotheses (see Fig. 2).

Figure 1.

Figure 1

Illustration of Monte Carlo method for examining significance of test statistic ln Λ for comparison of Australian (sample) and New Guinean (potential source) populations. Histogram of 1,000 values of the ln-probability difference generated by simulating genotypes given the allele counts observed for the Australian (sample) population. A total of 72 markers, for which the individual Australian 1 has been typed, were used to generate the distribution. See Eqs. 2325. The critical region for the test statistic (α = 0.05) is that portion of the distribution to the left of the arrow. The posterior probability ratio (ln Λ = −2.76) for the individual Australian 1 is indicated by an asterisk.

Figure 2.

Figure 2

Histograms indicating the power of the immigration tests for two cases. (a) The hypothesis that an Australian individual is an immigrant (d = 0) from New Guinea is considered. The shaded columns represent the distribution of ln Λ generated given the alleles observed for the Australian sample, while the unshaded columns represent the distribution of ln Λ generated given the alleles observed for the New Guinean sample. (b) The hypothesis that one parent of an Australian individual was an immigrant (d = 1) from New Guinea is considered. The shaded columns represent the distribution of ln Λ generated given the alleles observed for the Australian sample, while the unshaded columns represent the distribution of ln Λ generated given the alleles observed for the Australian and New Guinean samples and assuming that the individual received one allele at each locus from each population.

Application

We have applied our method to a set of 12 individuals from each of four human populations. We chose to compare two population samples with quite low genetic differentiation (9) and two population samples with quite high genetic differentiation from a set of population samples studied previously (10). The samples with low differentiation are from an Australian population and a New Guinean population (FST distance = 0.056). The samples with high differentiation are from a Japanese population and a Senegalese population (FST distance = 0.232). The Australian sample was collected from a coastal region of Australia, and the New Guinea sample from the highland region of New Guinea. The Japanese sample consists of individuals born in Japan and was collected in the San Francisco Bay Area (11). The Senegalese sample consists of Niokolonke individuals of the Mandenka population collected in southeastern Senegal (10). These 48 individuals have been typed at approximately 50 loci (12) by using RFLPs. The physical locations of the loci suggest that most are unlinked. Multiple restriction enzymes were used to type several of the loci so that the total number of genetic markers was approximately 75. The procedures employed in the sampling and the genetic analysis are described in detail elsewhere (11).

The power of the test to detect immigrants depends on the extent of differentiation between the populations compared (Table 1) as well as the number of loci examined and the number of individuals sampled (unpublished observations). A test of the hypothesis that an individual is an immigrant has high power in all the population comparisons. A test of the hypothesis that an individual has an immigrant parent has lower power for a comparison of individuals from the Australian and New Guinean samples than for a comparison of individuals from the Japanese and Senegalese samples. The test has power to detect an immigrant ancestor through the grandparent generation for a comparison of individuals from Japan and Senegal.

Table 1.

Power of posterior probability ratio tests for recent immigration, with α = 0.05

Sample population Potential source Power at d
0 1 2 3 4
Australian New Guinean 1.00 0.83 0.37 0.17 0.09
New Guinean Australia 1.00 0.94 0.60 0.25 0.14
Senegalese Japanese 1.00 1.00 0.78 0.37 0.16
Japanese Senegalese 1.00 1.00 0.76 0.41 0.20

If d = 0, the individual under consideration immigrated from source population; d = 1, one parent of the individual immigrated; if d = 2, one grandparent of the individual immigrated; if d = 3, one great-grandparent of the individual immigrated; if d = 4, one great-great-grandparent of the individual immigrated from source population. 

The distribution of the statistic under Monte Carlo simulation (Fig. 2) illustrates the power of the tests. In Fig. 2a, individuals sampled in Australia are postulated to have immigrated from New Guinea. There is little overlap between the distribution of the test statistic generated by Monte Carlo simulation under the null hypothesis that an individual was born in the Australian population (at right of Fig. 2a) and that under the alternative hypothesis that an individual is an immigrant from the New Guinea population (at left of Fig. 2a). In Fig. 2b, individuals in the Australian sample have a single parent that is an immigrant from New Guinea under the alternative hypothesis. In this case there is more overlap between the distributions generated under the null and alternative hypotheses, indicating that the test has reduced power by comparison with the test for detecting first-generation immigrants (i.e., Fig. 2a).

We applied the test to predict whether individuals sampled in Australia have New Guinean ancestry, and vice versa, and whether individuals sampled in Japan have African ancestry, and vice versa. A total of four individuals from the complete set of 48 comparisons produced significant test statistics at some level of ancestry (Table 2). Three of the four individuals (Australia 1, Australia 2, and Australia 3) who appeared to be immigrants, or descended from immigrants, were drawn from the Australian population, which appears likely to have experienced recent exchanges of immigrants (11). In the case of three individuals (Australia 1, Australia 3, and Japanese 1) it appears possible that an ancestor two or more generations removed was an immigrant, whereas in the case of one individual (Australia 2) it appears most probable that the individual is a first-generation immigrant. Given these results, one might consider excluding individual Australia 1, for example, from the Australian population sample for evolutionary studies, as it is quite probable that this individual has recent immigrant ancestry.

Table 2.

Power of the posterior probability ratio test to detect immigrant ancestry: Four individuals with posterior probability ratios indicating possible immigration (α < 0.05)

Individual Potential source No. of markers Value Hypothetical immigrant ancestor
Individual (d = 0) Parent (d = 1) Grandparent (d = 2) Great-grandparent (d = 3)
AUS1 NGN 76 ln Λ −2.76 −2.89 −1.65 −0.89
α 0.000 0.009 0.022 0.037
Power 1.000 0.821 0.347 0.197
AUS2 NGN 73 ln Λ 4.48 0.87 −0.37 −0.11
α 0.032 0.179 0.244 0.288
Power 1.000 0.828 0.332 0.136
AUS3 NGN 82 ln Λ 5.23 −0.50 −0.90 −0.56
α 0.032 0.049 0.064 0.092
Power 1.000 0.862 0.375 0.149
JPN1 SEN 69 ln Λ 17.80 1.52 −1.26 −1.10
α 0.021 0.014 0.029 0.045
Power 1.000 0.999 0.771 0.431

Twelve individuals from each of four populations were included. Australians (AUS) were considered as possible immigrants, or descendants of immigrants, from New Guinea (NGN), and vice versa. Japanese (JPN) were considered as possible immigrants, or descendants of immigrants, from the Senegalese (SEN) population, and vice versa. Values of ln Λ or ln Λd are given in the first row for each individual. Values in the second row are significance levels (α values) approximated using the Monte Carlo approach (1,000 iterations per test). Values in the third row are the power of the test for this individual (α < 0.05). 

Discussion

The test for detecting recent immigration developed in this paper provides information relevant to a wide range of problems in population biology and human genetics. In the area of human genetics, for example, the method may be used to identify individuals whose genomes are not typical of the populations in which they currently live, or of their ethnic group. This may be helpful in genetic counselling. In the area of evolutionary biology, it is often important to identify immigrant individuals to study their behavior and interactions with resident individuals. It may also be important to quantify the amount of recent immigration in populations that are not at genetic equilibrium. In the field of conservation genetics, this test may be useful for identifying the population of origin for zoo animals whose history is poorly known to implement successful captive breeding programs.

At least three potentially misleading results may arise when applying the method considered here. First, the failure to reject the hypothesis that an individual was an immigrant, or descended from immigrants, may simply reflect the fact that the appropriate populations for comparison were not included in the analysis. Second, an individual might incorrectly appear to have originated in a particular population other than the one from which it was sampled. This might be due to similarities in allele frequencies, due to long-term gene flow, between that population and a third population from which the individual actually originated, but which was not included in the sample of populations. Third, the fact that many pairwise comparisons between populations are performed for each of a large number of individuals means that some individuals will appear to be immigrants purely by chance. This can be corrected for by using smaller values for α.

The analyses of human populations presented in this paper show that, even with a sample of only 60 independent loci, the method we have proposed has power to detect immigrant ancestry up to two generations in the past. This is despite our conservative correction for uncertainties of allele frequencies. A larger number of loci will increase the power and could allow a single immigrant great-grandparent (out of 8 total), or a single immigrant great-great-grandparent (out of 16 total), to be identified. The precise number of loci needed to obtain a given level of power depends on the degree of genetic differentiation between populations; with greater differentiation, fewer loci are needed to obtain the same level of power. Computer simulations should prove useful in exploring the statistical performance of the method more generally.

Program availability.

A program written in the C computer language for performing the calculations described in this paper is available by anonymous ftp from mw511.biol.berkeley.edu in directory /pub, or on the World-Wide Web at site http://mw511.biol.berkeley.edu/homepage.html.

Acknowledgments

This research was supported, in part, by a National Institutes of Health Grant (GM40282) to Montgomery Slatkin and by a postdoctoral fellowship from the Natural Sciences and Engineering Research Council of Canada to B.R.

ABBREVIATION

RFLP

restriction fragment length polymorphism

References

  • 1.Wright S. Genetics. 1931;16:97–159. doi: 10.1093/genetics/16.2.97. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Kimura M. Annu Rep Natl Inst Genet. 1953;3:63. [Google Scholar]
  • 3.Maruyama T. Theor Pop Biol. 1970;1:273–306. doi: 10.1016/0040-5809(70)90047-x. [DOI] [PubMed] [Google Scholar]
  • 4.Slatkin M. Annu Rev Ecol System. 1985;16:393–430. [Google Scholar]
  • 5.Slatkin M, Barton N H. Evolution. 1989;43:1349–1368. doi: 10.1111/j.1558-5646.1989.tb02587.x. [DOI] [PubMed] [Google Scholar]
  • 6.Weir B S, Cockerham C C. Evolution. 1984;38:1358–1370. doi: 10.1111/j.1558-5646.1984.tb05657.x. [DOI] [PubMed] [Google Scholar]
  • 7.Rannala B, Hartigan J A. Genet Res. 1996;67:147–158. doi: 10.1017/s0016672300033607. [DOI] [PubMed] [Google Scholar]
  • 8.Johnson N L, Kotz S. Distributions in Statistics: Continuous Multivariate Distributions. New York: Wiley; 1970. [Google Scholar]
  • 9.Reynolds J, Weir B S, Cockerham C C. Genetics. 1983;105:767–779. doi: 10.1093/genetics/105.3.767. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Poloni E S, Excoffier L, Mountain J L, Langaney A, Cavalli-Sforza L L. Ann Hum Genet. 1995;59:43–61. doi: 10.1111/j.1469-1809.1995.tb01605.x. [DOI] [PubMed] [Google Scholar]
  • 11.Lin A A, Hebert J M, Mountain J L, Cavalli-Sforza L L. Gene Geography. 1994;8:191–214. [PubMed] [Google Scholar]
  • 12.Mountain J L. Ph.D. thesis. Stanford, CA: Stanford University; 1994. [Google Scholar]

Articles from Proceedings of the National Academy of Sciences of the United States of America are provided here courtesy of National Academy of Sciences

RESOURCES