Abstract
The interpretation of matching between DNA profiles of a person of interest and an item of evidence is undertaken using population genetic models to predict the probability of matching by chance. Calculation of matching probabilities is straightforward if allelic probabilities are known, or can be estimated, in the relevant population. It is more often the case, however, that the relevant population has not been sampled and allele frequencies are available only from a broader collection of populations as might be represented in a national or regional database. Variation of allele probabilities among the relevant populations is quantified by the population structure quantity FST and this quanity affects matching propoptions. Matching within a population can be interpreted only with respect to matching between populations and we show here that FST, can be estimated from sample allelic matching proportions within and between populations. We report such estimates from data we extracted from 250 papers in the forensic literature, representing STR profiles at up to 24 loci from nearly 500,000 people in 446 different populations. The results suggest that theta values in current forensic use do not have the buffer of conservativism often thought.
Keywords: Forensic DNA, STR marker, theta, coancestry
INTRODUCTION
The use of microsatellite, or short tandem repeat (STR), loci for forensic identification is now well established and analyses of large STR frequency databases have been published. Silva et al. [1] made use of STR allele frequencies accessed from a then-online database strdna-db (Pamplona et al., [2]), while Steele et al. [3] used data from people living in the United Kingdom (UK), or wishing to migrate to the UK on the basis of relatedness to a UK resident, as well as “reference” data collected by the UK Forensic Science Service. Both these papers provide useful reviews of the forensic uses of STR profiles and both describe the population genetic structure revealed by STR data. Silva et al. [1] commented on the overall similarity between conclusions that can be drawn from forensic and other population databases about the structure of human populations and its relationship to the movement of people from East Africa, the most likely place of origin of anatomically modern humans [4]. This is in spite of forensic markers being chosen to maximize the diversity among individuals. Steele at al. [3] gave detailed analyses of the population structure quantity FST that can be regarded as measuring the evolutionary relatedness between two individuals in the same population.
Here we continue the discussion of FST by describing a new approach to estimating this quantity and then by applying this approach to allele frequencies at 24 STR loci in forensic use published for 446 populations. Central to our approach is the recognition that FST values are statements about pairs of alleles within a population relative to some other population or collection of populations, and this reference frame needs to be specified. Our results suggest that a somewhat higher value of FST than previously thought should be used in forensic match probability calculations, especially when the continental ancestry of the relevant population is unknown.
Allelic Matching Theory
Population genetic theory (e.g. [5]) distinguishes between three types of probability of Au, the uth allele at locus A. First, there is the sample frequency p̃iu in a sample of individuals from population i. Second, there is the actual frequency p̌iu of Au in population i from which this sample has been taken. Note that p̌iu is the expected value of p̃iu over samples from that population, irrespective of sample size, so p̃iu is an unbiased and consistent estimate of p̌iu. With random sampling of individuals, and if the population in question is in Hardy-Weinberg equilibrium, then the number of Au alleles in a sample of ni genotypes has a binomial distribution with parameters 2ni and p̌iu so the within-population variance of p̃iu is VarW(p̃iu) = p̌iu(1 − p̌iu)/(2ni).
Any population genetic theory aimed at calculating match probabilities must address another level of sampling, namely that inherent to the underlying evolutionary process. An actual allele frequency p̌iu is just one of many values that are possible for a given evolutionary history, and the expected value of p̌iu taken over many different realizations of this same history, written as pu, is usually referred to as the allele frequency of Au. It may be better to say “allele probability” here but we will defer to common usage and use “frequency” both for sample and population proportions and the corresponding expected values or probabilities. The same value pu applies here to all populations and it is the probability that an allele drawn at random from any population is of type Au. Note that the total expectation of the sample frequency p̃iu, taken over samples and over populations is also equal to pu. There is an implicit assumption that natural selection has not been acting differentially among populations. For a wide class of evolutionary models, the variance of the actual allele frequency p̌iu among replicates of that population has the form VarA(p̌iu) = pu(1−pu)θi, where θi is the probability that two alleles drawn randomly from population i are identical by descent (ibd). Alleles are ibd if they have a common origin – this is necessarily a statement about the past and there is variation among all the possible populations that might descend from an ancestral population. The total variance of the sample frequency p̃iu, over samples from a population and over realizations of a population, is VarT (p̃iu) = pu(1 − pu)[θi + (1 − θi)/(2ni)].
The evolutionary perspective given by this introduction of θi allows account to be taken of some alleles in the population having a shared ancestry and being ibd. Shared ancestry of some alleles increases the chance of allele matching, or identity in state (ibs), for alleles drawn randomly from the population.
If there is Hardy-Weinberg equilibrium within a (large) population i for which the frequency of allele Au is p̌iu, then the probability that two alleles in that population are of type Au is , and this has an expected value over replicates of the population of . If, instead of a population frequency p̌iu for population i, or its expected value pu, we have only a sample value p̃u from a larger population, or collection of populations, then we must consider the sampling variance of p̃u. Some of the alleles contributing to that sample value are from the traget population i for which θi is appropriate nad some are from different populations j, j ≠ i for which an analogous quantity &thetasij is appropriate. This new measure is the probability that an allele from population i is ibd to an allele from population j. The estimated two-allele probability for population i is , where βi = (θi − θB)/(1 − θB) and θB is the average over all pairs of populations within the sampled collection of population-pair θij ’s. We are assuming a large (unknown) number of populations within the sampled collection. The corresponding allelic match probabilty is Pr(Au|Au) = pu + θi(1 − pu) and we can estimate this as p̃u + βi(1 − p̃u).
Balding and Nichols [7] extended the discussion of matching to genotypes and gave the match probabilities for a population as:
| (1) |
If the particular population within a collection of populations is not specified, and we wish to express a match probability that would apply to any of a large number of populations within the collection, then we replace θi by the average value θW over populations. To estimate the matching probabilities with the total sample allele frequencies p̃u in place of the true values pu, we use βW = (θW − θB)/(1 − θB), the average of the βi’s, in place of θW.
We note that most discussions in the literature, e.g. [1], [2], do not make explicit mention of between-population values θij or θB, and we also note that θW is the quantity usually referred to as FST. Our formulation stresses that the concept of identity by descent of alleles within a population requires a point of reference, which we take as identity by descent between pairs of populations. We see below that there is an immediate translation to allelic matching within and between populations.
There is widespread use in forensic science of the profile probabilities Pr(AuAu) or Pr(AuAv), u ≠ v, and their product rule estimates or 2p̃up̃v, rather than the match probabilities shown in (1). To allow for departures from Hardy-Weinberg equilibrium in the population collection that results from variation in allele frequencies among populations, the profile probabilities in a subpopulation can be estimated as
| (2) |
although, following the National Research Council [8], the (1−Fi) is often omitted in the expression for heterozygotes. In these last equations, the symbol ≙ means “is estimated by” and the equations provide estimates for population genotype frequencies in terms of population-collection sample allele frequencies. The quantity Fi is the total inbreeding coefficient for population i. It is the average of Equations 2 that would be appropriate if the particular population was not identified and here we write the average over populations of Fi as FW, although it is generally written as FIT and is often written as FIT. If there is Hardy-Weinberg equilibrium within populations, then Fi = θi, FW = &thetas;W.
Regardless of whether the average (1−FW) term is included for heterozygotes, it is not clear that estimated profile probabilities are of great forensic interest since the relevant quantities are match probabilities: the chance that an untyped person will have a certain genotype given that a typed person has that genotype. When account is taken of the evolutionary history of a population, Fi, θi > 0, the match probabilities exceed the profile probabilities. Match probability estimates require the quantities βi, or their average βW.
Since their introduction by Balding and Nichols [7], the match probabilities in Equations 1 have been extended to allow for inbreeding [9], mixtures [10], and relatedness [11]. They have been of substantial benefit for the interpretation of matching autosomal profiles and they were endorsed by the US National Research Council [8].
Estimation of FST
There is a logical difficulty in estimating FST = βW or the population-specific values βi. We have shown how it arises in match probabilities to account for variation in allele frequencies over evolutionary replicate populations. We could estimate βi from standard methods, such as those of Weir and Cockerham [12] (who assumed θB = 0 and θi = θ for all values of i) or Weir and Hill [13] (who relaxed those assumptions), if we had data from more than one actual population to provide an indication of evolutionary variation, but if we had those frequencies we could use them directly in a within-population analysis and not need the θ formulation. We adopt a pragmatic work-around by exploiting the many published sets of allele frequencies, grouping them into broad ancestry or geographic sets: an estimate from a set of European-ancestry samples, for example, can be used in situations where a European-ancestry θ is required.
We take this opportunity to simplify the expressions for our earlier estimation procedures [12],[13] while allowing for different values θi for different populations i and for analogous quantities θij for pairs i, j of populations. Bhatia et al. [14] found it convenient to work with sample heterozygosities within and among populations, and use the complementary matching proportions (i.e. sample homozygosities). The within-population sample matching proportions can be found from the counts nilu of allele Alu at locus l sampled from population i as M̃il = Σu nilu(nilu−1)/[nil(nil−1)], where nil = Σu nilu is the total number of alleles sampled at that locus. The sample allele frequencies are p̃ilu = nilu/nil, and we can write M̃il as or, approximately, as . For populations i and j, the between population-pair sample matching proportions are M̃ijl = Σu nilunjlu/(nilnjl) or M̃ijl = Σu p̃ilu p̃jlu. For sets of r populations, we take averages over populations and over pairs of populations of the matching proportions for locus l: and H̃Bl = Σi≠ H̃ijl/[r(r − 1)].
We estimate the β’s by comparing within-population matching proportions to between population-pair matching proportions with the ratios β̂il = (M̃il− M̃Bl)/(1− M̃Bl) and β̂Wl = (M̃Wl− M̃Bl)/(1− M̃ Bl). Approximating the expectation of these ratios as the ratio of expectations, the Weir and Hill model [13] leads to β̂il, β̂Wl being unbiased estimates of βil, βWl.
We will see below that there is variation among loci of the locus-specific values β̂il and these differences may reflect both different locus-specific mutation rates as well as large sampling variances. We provide estimates of locus-average values, evaluated as explained in the Supplementary Material, that will be unbiased if each locus has the same value. The average matching proportions over loci are , and for a set of L loci. The β estimates are β̂i = ( M̃i − M̃B)/(1 − M̃B) and β̂W = ( M̃W − M̃B)/(1− M̃B): these “ratio of averages” estimates have smaller variances than the averages of the β̂il or β̂Wl ratios.
It is not possible to estimate the within-population θ’s other than as being “relative to” the between-population-pair values. The compound quantities β measure how much more matching there is within populations than there is between pairs of populations. Weir and Hill [13] showed that it is these compound parameters that, for a pure drift model of evolution, are functions of the time since the set of populations diverged from an ancestral population. For two populations i and j in the pure drift case, the parameter (βi + βj)/2 is proportional to the time since those two populations diverged from an ancestral population and so it can be used as a distance measure in constructing evolutionary dendrograms (see Supplementary Material).
To illustrate how populations i differ in match probabilities it would be preferable to examine values of θi. Although only estimates of θi relative to θB are available, comparisons are still possible. If βi > βj, for example, then it can be inferred that θi > θj and it will be assumed reasonable that β̂i > β̂j also implies θi > θj when the same between-population-pair matching proportion M̃B is used for each. In the empirical values shown below, the estimates β̂i for population i, in a group of populations such as those with common continental ancestry, use data from all the populations in that group in order to calculate M̃B. Within-population match probabilities are appropriately estimated with population-collection sample allele frequencies p̃lu in place of population frequencies plu and estimates β̂W values in place of θW. The formulation given here of βW or FST shows this quantity describes allelic matching within populations relative to matching between pairs of populations within a collection of populations.
DATA AND DATA HANDLING
Data were obtained from 250 published population reports in the following journals: Forensic Science International, Forensic Science International: Genetics, International Journal of Legal Medicine, Journal of Forensic Sciences, and Legal Medicine. The data were gathered from electronic sources, either from the PDF text or from published electronic files, or from hard copy using OCR and post-processing. We are aware that we have not been comprehensive and that even more data are available. The populations used and the associated references are given in the Supplementary Material. This process yielded 446 distinct populations. The populations and the papers from which we extracted data are also shown in the Supplementary Material. We do not have access to the individual genotypes, so many of the data-cleaning steps described by Pemberton et al. [15] are not available.
The data required a significant amount of post-data handling from cleaning to storage. In particular, not all of the data collected in this study were originally collected for forensic purposes. A proportion was collected more probably for anthropological reasons than validation of forensic multiplexes for casework. The net effect of this fact is that we had some issues regarding nonstandard loci and non-standard allele designations, as well as small samples.
Errors in the data were also apparent in a significant number of cases. Most of these errors were typographical and in many cases the correct data could be deduced. In some cases we determined that an error was present because the allele frequencies did not add sufficiently close to one or because allele frequencies multiplied by twice the number of individuals in the sample were not sufficiently close to integers. In one case the allele frequencies had not been correctly determined from the genotype frequencies and in another case D3S1358 and D13S317 had been swapped. In approximately half of the cases of perceived error we received a helpful answer from the corresponding author to our enquiries. In the other cases we have either omitted the data or used our own deductions as to the correct values. We have also struggled with different nomenclatures for rare alleles, however this should have only a minor effect on the analysis. We assigned the populations to geographic groups using our best judgment, and we show below that these groups of populations cluster together with principal component analysis. We therefore have some confidence that the grouping has both genetic and geographic coherence although we concede that further adjustments could be made. The motivation for grouping at all, apart from simplifying the presentation of results, is that we wish to provide θ values (now interpreted as βW) that would result in appropriate match probability estimates for any population within a broadly-defined group of populations. The group names, numbers of populations within groups, and sample sizes within groups, are shown in Table 1. The data we discuss represent observations on 494,473 individuals. The loci for which we present results are displayed in Table 2.
Table 1.
Sample characteristics for geographic groups.
| Group | Name | Number of
|
Sample size†
|
||
|---|---|---|---|---|---|
| Populations | Loci* | Minimum | Maximum | ||
| Africa | African | 37 | 24 | 342 | 7998 |
| Andaman Islands | Andam | 3 | 9 | 61 | 97 |
| Asian | Asian | 85 | 24 | 1153 | 23557 |
| Australian Aborigine | AusAb | 17 | 15 | 1686 | 18441 |
| Caucasian | Caucn | 173 | 24 | 861 | 233753 |
| Hispanic | Hisp | 41 | 24 | 236 | 167872 |
| India Pakistan | IndPk | 30 | 23 | 488 | 4525 |
| Inuit | Inuit | 3 | 21 | 194 | 403 |
| Native American | NatAm | 34 | 17 | 618 | 4234 |
| Polynesian | Polyn | 4 | 15 | 2614 | 33593 |
| Unknown | Unknn | 15 | 24 | 205 | 9072 |
Number of loci scored in at least one population in the group.
Number of individuals scored at a single locus over all populations in the group.
Table 2.
Numbers of alleles and numbers of populations for each locus.
| Locus | No. of Alleles | Number of populations per geographic group
|
|||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| African | Andam | Asian | AusAb | Caucn | Hisp | IndPk | Inuit | NatAm | Polyn | ||
| CSF1PO | 27 | 33 | — | 76 | 2 | 109 | 39 | 25 | 2 | 33 | 2 |
| D1S1656 | 26 | 3 | — | 7 | — | 34 | 2 | 2 | 1 | — | — |
| D2S441 | 33 | 3 | — | 9 | — | 36 | 2 | 2 | 1 | — | — |
| D2S1338 | 33 | 20 | — | 63 | 2 | 97 | 21 | 14 | 1 | 25 | 4 |
| D3S1358 | 30 | 36 | 3 | 82 | 17 | 161 | 41 | 27 | 3 | 31 | 4 |
| D5S818 | 25 | 34 | 2 | 81 | 17 | 131 | 40 | 27 | 2 | 34 | 2 |
| D6S1043 | 32 | 1 | — | 4 | — | 2 | 1 | — | — | — | — |
| D7S820 | 37 | 34 | 3 | 81 | 17 | 131 | 40 | 27 | 2 | 34 | 2 |
| D8S1179 | 30 | 37 | 3 | 77 | 17 | 161 | 41 | 27 | 3 | 34 | 4 |
| D10S1248 | 16 | 3 | — | 9 | — | 36 | 2 | 2 | 1 | — | — |
| D12S391 | 38 | 3 | — | 7 | — | 31 | 3 | 2 | 1 | — | — |
| D13S317 | 28 | 34 | 3 | 81 | 17 | 131 | 40 | 27 | 2 | 34 | 2 |
| D16S539 | 26 | 35 | — | 74 | 2 | 135 | 39 | 28 | 3 | 30 | 4 |
| D18S51 | 56 | 36 | 2 | 80 | 17 | 161 | 41 | 28 | 3 | 34 | 4 |
| D19S433 | 36 | 20 | — | 65 | 2 | 99 | 21 | 14 | 1 | 25 | 4 |
| D21S11 | 70 | 37 | 3 | 80 | 17 | 161 | 41 | 28 | 3 | 34 | 4 |
| D22S1045 | 13 | 3 | — | 8 | — | 36 | 2 | 2 | 1 | — | — |
| FGA | 86 | 37 | 3 | 81 | 17 | 162 | 41 | 27 | 3 | 34 | 4 |
| PENTAD | 40 | 6 | — | 22 | — | 37 | 10 | 7 | — | 4 | — |
| PENTAE | 46 | 6 | — | 23 | — | 37 | 10 | 7 | — | 4 | — |
| SE33 | 85 | 2 | — | 3 | — | 11 | 2 | 2 | 1 | — | — |
| TH01 | 30 | 36 | — | 77 | 2 | 142 | 41 | 28 | 3 | 34 | 4 |
| TPOX | 22 | 33 | — | 75 | 2 | 109 | 39 | 25 | 2 | 34 | 2 |
| vWA | 31 | 36 | 3 | 81 | 17 | 162 | 41 | 30 | 2 | 34 | 4 |
RESULTS
Principal Coordinates Analysis
As a first examination of the data, and as a check on our assignment of populations to geographic groups, we performed a principal component analysis of the allele frequencies in the whole data set. We used only those populations for which at least 25 individuals were scored. In Figure 1 we show a parallel coordinates plot of the first 10 principal coordinates. There is clear separation of populations by geographic group: principal coordinate (PC) 1 separates out Asian, Caucasian and Indo-Pakistan groups, PC2 separates out the Hispanic and Native American groups, PC3 separates out the African group, PC4 separates out the Australian Aborigine group, PC5 separates out the Inuits and Polynesians. There does not appear to be any population assigned to the wrong group and the overall impression is that the forensic STR markers can distinguish geographic groups, as has been noted previously [1]. The results depicted in Figure 1 suggest that further FST -based analyses are appropriate.
Figure 1.
Parallel coordinate plot for first 10 principal coordinatess for all populations with sample sizes at least 50. Each line in the plot represents one population. Color code: Black=African, Grey=AusAb, Yellow=Asian, Blue=Caucn, Purple=Hisp, Brown=IndPk, Red=NatAm, Orange= Inuit, Brown=Andam, Green=Polyn.
Population-specific FST
A complete list of the poplation- and locus-specific estimates β̂il is given in the Supplementary Material. In Figure 2 we display estimates of the population-specific parameters βi for populations i. The estimates use all the loci scored in each population, as explained in the Appendix, and we bootstrap over loci to provide the 95% confidence intervals indicated by vertical lines in the figure. The between-population-pair matching values M̃B used to estimate the β’s were for all pairs of populations in the survey. The estimates have been ordered by size and colored by geographic group. The smallest values are for the African populations, reflecting the greater diversity within those older populations, whereas the largest values are for the smaller and less diverse Native American and Inuit populations. The Asian values are generally higher than the Caucasian values, with both lying between the African and Native American values. This pattern is not unexpected but is based on larger sets of data than have been used previously. Even when up to 24 loci are used to estimate βi, there are large sampling variances. We now turn to a closer look among loci and among geographic groups.
Figure 2.
Estimated values of the population-specific βi, ordered by size. Each vertical line in the plot represents one population, and the length of the line is the 95% confidence interval obtained by bootstrapping over loci.
Locus-specific FST
In Figure 3 we display box-plots of the estimates of βil for loci l and populations i. The between-population pair matching values M̃B used to estimate these β’s were for all pairs of populations in the survey. The colored dots show the mean values over populations for each of the nine geographic groups.
Figure 3.
Estimated values of the locus-specific βil, ordered by locus. The black dots, and box-plots with whiskers, are for populations. The colored dots are for geographic groups. The box plots show the medians and interquartile ranges: the whiskers extend out from the box plots to 1.5 times the interquartile range.
There is relatively little variation over loci in the averages over populations of locus-specific values, but there is substantial variation in the range of values over populations at each locus. The variation appears to reflect the numbers of geographic groups and/or numbers of populations typed at each locus. Locus D3S1358, for example, was typed in over 400 populations from all geographic groups, whereas locus D6S1043 was typed in only eight populations in four geographic groups. This sampling range seems to mask any effect of mutation rate in the figure. The loci have been ordered on the X axis in decreasing order of published mutation rates (Table 14.5 in [16]): SE33 on the left has the highest reported rate, of 0.0016, whereas the D1S1656, D2S441, D10S1248, D22S1045 loci on the right have no reported mutations.
Geographic-group-specific FST
In Figure 4 we display population-specific estimates of βi for each continental group of populations. Two sets of estimates are shown: those on the right, with a shaded box plot, use only the populations in the group to calculate the between-population-pair matching proportion M̃B. The resulting βW estimate is therefore the average over populations of θ relative to that group or region, and we could write the estimates as β̂PR. These are the estimates to be used if the relevant group of populations is known. The estimates on the left for each group, with an open box, use all pairs of populations across all groups to calculate M̃B. Now the estimates are relative to the total collection of populations and we could write them as β̃PT. These are appropriate if there was no information about the group to which the relevant population belongs.
Figure 4.
Estimated values of the population-specific β’s, for each geographic group, using all loci. Each box plot indicates the inter-quartile variation among populations within the group, with whiskers extending out from the box plots by 1.5 times the interquartile range. For each geographic group, the left hand plot (white box plot) compares each within-population matching proportion to the average matching proportion among all pairs of populations. The right hand plot (grey box plot) compares each within-population matching proportion to the average matching proportion among all pairs of populations in that region.
As the population versus total estimates are to be used when there is no information about the group to which a population belongs, it is reasonable that those estimates β̃;PT are greater than the population versus region estimates β̂PR: this increases the matching probability and reduces the evidentiary strength of a match. A substantial difference in the two estimates β̂PT and β̂PR, is seen for the Inuit group of populations: the average within-population matching proportion, averaged over loci, is M̃W = 0.4379 whereas the average between-population-pair matching proportions are 0.1726 for pairs within the group and 0.0090 for all pairs in the study. The Inuit populations are more similar to each other than are any pair of populations in the study. We find that β̂PR = 0.0205 and β̂PT = 0.1057. The theta correction has much less effect if attention can be confined to the Inuit group.
The values for the African group show the opposite relationship from the Inuit group. Now the average within-population matching proportion is M̃W = 0.1884 and the average between-population- pair averages are 0.1691 within the African region and 0.1726 for all pairs of populations. There is more divergence between pairs of African populations than there is among all pairs of populations and β̂PR = 0.0082 and β̂PT = 0.0020. The theta corrections for African populations have little effect regardless of the reference group.
The locus- and group-specific values of β̂Wl are shown in Tables 3 and 4, where the group or the world serve as reference collections of populations. We advocate the use of the locus-average values. Except for Africa and India-Pakistan, the values are greater when the world rather than the group serves as a reference, and we note the different rankings of the groups in the two tables: Africa clearly has the lowest average value in Table 4 whereas it is comparable to the Australian Aboriginal and Hispanic values in Table 3. The Inuit group has the highest value in Table 4, but is second behind Native Americans in Table 3.
Table 3.
Estimated values β̂Wl of locus-specific β’s for each geographic region, and the value for all loci within a region. Each estimate is for θW relative to pairs of populations within that region. There are no estimates if a locus is scored in less than two populations for a region, or if less than 25 individuals were typed in a region. Values in the body of the table are for each population, averaged over populations in a region. The bottom row is the average over loci. The method of averaging is described in the Supplementary Material.
| Locus | Africa | AusAb | Asian | Caucn | Hispn | InPak | NatAm | Inuit | Polyn |
|---|---|---|---|---|---|---|---|---|---|
| CSF1PO | 0.0042 | 0.0020 | 0.0073 | 0.0031 | 0.0018 | 0.0064 | 0.0256 | 0.0066 | 0.0256 |
| D1S1656 | 0.0047 | —— | 0.0027 | 0.0039 | 0.0032 | −0.0018 | —— | —— | —— |
| D2S441 | 0.0129 | —— | 0.0082 | 0.0032 | 0.0212 | 0.1933 | —— | —— | —— |
| D2S1338 | 0.0101 | 0.0039 | 0.0129 | 0.0075 | 0.0097 | 0.0158 | 0.0661 | —— | 0.0051 |
| D3S1358 | 0.0042 | 0.0002 | 0.0104 | 0.0044 | 0.0111 | 0.0178 | 0.0548 | 0.0058 | 0.0014 |
| D5S818 | 0.0044 | 0.0018 | 0.0090 | 0.0055 | 0.0125 | 0.0109 | 0.0553 | 0.0127 | 0.0073 |
| D6S1043 | —— | —— | 0.0051 | 0.0005 | —— | —— | —— | —— | —— |
| D7S820 | 0.0055 | 0.0008 | 0.0117 | 0.0064 | 0.0070 | 0.0067 | 0.0282 | 0.0092 | 0.0047 |
| D8S1179 | 0.0050 | 0.0005 | 0.0123 | 0.0059 | 0.0021 | 0.0122 | 0.0197 | 0.0273 | 0.0077 |
| D10S124 | 0.0018 | —— | 0.0137 | 0.0026 | 0.0036 | −0.0018 | —— | —— | —— |
| D12S391 | 0.0090 | —— | 0.0039 | 0.0016 | 0.0204 | −0.0018 | —— | —— | —— |
| D13S317 | 0.0068 | 0.0008 | 0.0157 | 0.0071 | 0.0144 | 0.0062 | 0.0335 | 0.0220 | 0.0136 |
| D16S539 | 0.0085 | 0.0015 | 0.0205 | 0.0047 | 0.0094 | 0.0049 | 0.0343 | 0.0402 | 0.0042 |
| D18S51 | 0.0047 | 0.0003 | 0.0145 | 0.0050 | 0.0025 | 0.0053 | 0.0306 | 0.0076 | 0.0026 |
| D19S433 | 0.0103 | 0.0011 | 0.0103 | 0.0091 | 0.0136 | 0.0002 | 0.0310 | —— | 0.0143 |
| D21S11 | 0.0151 | 0.0001 | 0.0140 | 0.0055 | 0.0064 | 0.0128 | 0.0530 | 0.0002 | 0.0184 |
| D22S104 | 0.0270 | —— | 0.0498 | 0.0017 | −0.0022 | −0.0018 | —— | —— | —— |
| FGA | 0.0033 | 0.0002 | 0.0116 | 0.0038 | 0.0058 | 0.0076 | 0.0235 | 0.0147 | 0.0063 |
| PENTAD | 0.0075 | —— | 0.0145 | 0.0107 | 0.0022 | 0.0171 | 0.0259 | —— | —— |
| PENTAE | 0.0015 | —— | 0.0152 | 0.0056 | 0.0029 | 0.0115 | 0.0217 | —— | —— |
| SE33 | 0.0062 | —— | 0.0084 | 0.0264 | −0.0009 | −0.0018 | —— | —— | —— |
| TH01 | 0.0209 | 0.0006 | 0.0237 | 0.0176 | 0.0176 | 0.0128 | 0.0640 | 0.0366 | 0.0168 |
| TPOX | 0.0104 | 0.0871 | 0.0164 | 0.0076 | 0.0074 | 0.0174 | 0.0511 | 0.0359 | 0.0307 |
| VWA | 0.0042 | 0.0002 | 0.0114 | 0.0046 | 0.0076 | 0.0059 | 0.0198 | 0.0479 | 0.0045 |
| All | 0.0081 | 0.0064 | 0.0133 | 0.0066 | 0.0077 | 0.0171 | 0.0368 | 0.0199 | 0.0106 |
Table 4.
Estimated values β̂Wl of locus-specific β’s for each geographic region, and the value for all loci within a region. Each estimate is for θW relative to all pairs of populations in the survey. There are no estimates if a locus is scored in less than two populations for a region, or if less than 25 individuals were typed in a region. Values in the body of the table are for each population, averaged over populations in a region. The right-most column is the average over regions, and the bottom row is the average over loci. The method of averaging is described in the Supplementary Material.
| Africa | AusAb | Asian | Caucn | Hisp | IndPk | NatAm | Inuit | Polyn | World | |
|---|---|---|---|---|---|---|---|---|---|---|
| CSF1PO | −0.0668 | 0.0130 | 0.0154 | 0.0127 | 0.0165 | 0.0197 | 0.0616 | 0.0406 | 0.0291 | 0.0117 |
| D1S1656 | 0.0339 | —— | 0.0658 | −0.0018 | 0.0189 | 0.0316 | —— | 0.0812 | —— | 0.0157 |
| D2S441 | 0.0153 | —— | 0.0265 | 0.0316 | 0.1005 | −0.0285 | —— | 0.1625 | —— | 0.0332 |
| D2S1338 | 0.0029 | 0.0313 | 0.0319 | 0.0129 | 0.0234 | 0.0134 | 0.1210 | 0.1255 | 0.0035 | 0.0292 |
| D3S1358 | 0.0145 | 0.0279 | 0.0578 | −0.0345 | 0.0239 | 0.0227 | 0.2200 | 0.2196 | 0.0426 | 0.0254 |
| D5S818 | 0.0102 | −0.0229 | −0.0132 | 0.0465 | 0.0474 | 0.0197 | 0.1192 | 0.0461 | −0.0243 | 0.0337 |
| D6S1043 | −0.0006 | —— | 0.0126 | 0.0669 | 0.0030 | —— | —— | —— | —— | 0.0233 |
| D7S820 | 0.0244 | 0.0557 | 0.0345 | 0.0001 | 0.0165 | 0.0039 | 0.0842 | 0.0443 | −0.0078 | 0.0222 |
| D8S1179 | 0.0405 | −0.0153 | −0.0187 | 0.0169 | 0.0273 | −0.0207 | 0.0885 | 0.1264 | 0.0227 | 0.0179 |
| D10S1248 | −0.0397 | —— | 0.0383 | 0.0047 | 0.0473 | −0.0195 | —— | 0.1345 | —— | 0.0102 |
| D12S391 | 0.0317 | —— | 0.0448 | −0.0097 | 0.0745 | 0.0258 | —— | 0.0522 | —— | 0.0120 |
| D13S317 | 0.1221 | 0.0806 | 0.0235 | 0.0445 | 0.0051 | 0.0093 | 0.0252 | 0.0990 | 0.0375 | 0.0384 |
| D16S539 | −0.0018 | 0.0597 | 0.0237 | 0.0288 | 0.0093 | −0.0025 | 0.0720 | 0.1635 | 0.0227 | 0.0250 |
| D18S51 | −0.0012 | 0.0064 | 0.0345 | 0.0064 | 0.0026 | 0.0323 | 0.0503 | 0.0733 | 0.0538 | 0.0181 |
| D19S433 | −0.0095 | 0.1661 | 0.0226 | 0.0410 | 0.0053 | 0.0166 | 0.0132 | −0.0013 | 0.0015 | 0.0254 |
| D21S11 | −0.0076 | −0.0225 | 0.0422 | 0.0084 | 0.0126 | 0.0013 | 0.0702 | 0.0492 | 0.0393 | 0.0200 |
| D22S1045 | −0.0626 | —— | −0.0078 | 0.0300 | 0.0872 | −0.0211 | —— | 0.0836 | —— | 0.0204 |
| FGA | 0.0027 | 0.0038 | 0.0183 | 0.0164 | 0.0011 | 0.0072 | 0.0226 | 0.0296 | 0.0655 | 0.0142 |
| PENTAD | −0.0402 | —— | 0.0567 | 0.0180 | 0.0015 | 0.0160 | 0.0380 | —— | —— | 0.0227 |
| PENTAE | 0.0185 | —— | 0.0163 | 0.0235 | 0.0136 | 0.0137 | 0.0409 | —— | —— | 0.0202 |
| SE33 | 0.0234 | —— | 0.0138 | 0.0205 | 0.0152 | 0.0041 | —— | 0.1081 | —— | 0.0219 |
| TH01 | 0.0731 | 0.0679 | 0.1465 | 0.0189 | 0.0369 | 0.0199 | 0.2084 | 0.5200 | 0.0464 | 0.0755 |
| TPOX | −0.1336 | −0.0031 | 0.0911 | 0.0578 | 0.0064 | −0.0369 | 0.0736 | 0.0395 | 0.0412 | 0.0339 |
| VWA | −0.0021 | 0.0246 | 0.0195 | 0.0087 | 0.0373 | 0.0055 | 0.0808 | 0.0231 | 0.0213 | 0.0198 |
| All loci | 0.0038 | 0.0328 | 0.0328 | 0.0193 | 0.0258 | 0.0065 | 0.0804 | 0.1050 | 0.0265 | 0.0244 |
From the perspective of a forensic scientist wanting to assign an FST value for a broad geographic group, if an upper limit (the upper end of each whisker in Figure 4) was considered to be the most appropriate [3], then we would suggest values around 0.05 except for Native Americans and Inuits when the value is above 0.10. Rather than the upper limit, however, we have some preference for the median values (the center of each box in Figure 4), and these suggest values around 0.01–0.03 except for Native Americans and Inuits when the value are higher and more susceptible to the group or the world being the reference set. The table in the Supplementary Material shows much higher variation for locus- and population-specific values. As we discuss below, our recommendation is to use the multi-locus values in the last lines of Tables 3 and 4.
DISCUSSION
We have presented an analysis of an extensive set of autosomal forensic STR allele frequencies with the aim of assisting forensic scientists assign FST or theta values for use in match probability calculations.
If allele frequencies p̌iu are known for the population thought to be relevant, meaning that it is the population from which an unknown contributor to an evidence profile is supposed to be drawn randomly, then those frequencies may be used directly. Assuming independence of alleles within and between loci, and assuming the population is large, the probability of the two profiles (for the person of interest and the unknown person with a matching profile) is just the product of these frequencies and the match probability is the same as the profile probability.
Allele frequencies in a population are not known, however, and sampling variation must be considered. For reasonably large sample sizes, the evolutionary variation is larger than the sampling of individuals from the population and the “theta corrections” of Equations 1 are appropriate. Values of θi need to be assigned: they cannot be estimated with data from a single population. Taking evolutionary variation into account, in effect, allows the evolutionary relatedness of the person of interest and the unknown person to be accommodated. If the relevant population is not sampled, but allele frequencies are available from some collection of populations, maybe of some specific geographic region or ancestry, then θ is replaced by β, with an average value βW seeming most appropriate.
Silva et al. [1] showed that measures of population structure based on STR data provide good separation of geographic groups whether the loci were selected for evolutionary or forensic studies, with attentuated separation for forensic markers because they have been chosen to have the high diversity needed for individual identification. These authors made most use of the statistic RST, in which alleles are compared by the squared difference in the number of repeat units, rather than simply matching or not. As did these authors we also examined the survey of non-forensic STR data reported by Pemberton et al. [15] and, like them, found similar but attenuated separation for forensic markers with β instead of RST (results not shown). It is interesting that “global” FST value in [1] of 0.027 from allele frequency data is comparable to our 0.024 value shown in Table 4. Silva et al. did not focus on estimation of match probabilities.
Steele at al. [3] reported FST values for a collection of populations self-identified by UK residents or potential immigrants. Their “direct” method of estimation made use of data from a reference population with similar continental ancestry for each population under consideration, so their approach is closest to our β̂PR values shown in Table 3. Their “indirect” method uses allele frequencies from the whole set of populations and so corresponds to the approach we used to generate the β̂PT values shown in Table 4. Steele et al. use a Bayesian method and we would expect the medians of their posterior distributions to be comparable to the moment estimates we display. They concluded with a general recommendation of an FST of no more than 0.03 being generally applicable. For populations within a group, we are in general agreement with Steele et al, provided data from the appropriate group are used. If a world-wide dataset is to be used then higher values are necessary for Native American and Inuit populations. For individual Asian populations, however, we often find higher values than those shown in [3].
Steele and Balding [17] repeated the recommendation in [3] that an FST of 0.03 is sufficiently large to be almost always conservative. They advocate the use of a database “most appropriate” for the unknown donor of an evidentiary sample.
As the forensic task is to predict profile matching probabilities from an available set of data, we find intuitive appeal in phrasing our approach in terms of allelic matching proportions, especially by emphasizing the need to consider matching within and between populations. A matching proportion within a population is high in the present context only if it is substantially higher than matching between pairs of populations. The importance of the appropriate terms of reference is illustrated by the plots in Figure 4.
The situation most likely to confront a forensic scientist is having to decide on a value of FST for match probability predictions for a population when the allele frequencies are going to be based on a sample from a collection of populations of similar ancestry or geographic region. We show the locus-specific values for each region in Tables 3 and 4. These β̂Wl values are highly variable across loci and values are ranked differently within each region. We note a few negative locus-specific values, indicating less allelic matching at those loci within populations than between pairs of populations in a region. Because of the large variances in locus-specific values, we place little confidence in any one of them. We suggest, instead, the values we show that are based on all the loci typed for a population. In Figure 2 we showed the 95% confidence intervals based on bootstrapping over loci and we note that Steele et al. [3] advocated using quantities analogous to the upper ends of these intervals. Our recommendation, however, is to use the all-loci values in the last rows of Tables 3 and 4.
We have approached allelic match probabilities from the expression for pairs of alleles in a population, i, for which each allele has probability p̌iu of being of type Au. The actual population frequencies p̌iu are generally unknown, so we took expectations over replicates of the evolutionary process to obtain . If only an estimate p̃u is available for pu, we had the estimate . Steele et al. [2] also considered the case of matching probabilities for individuals in two populations i, j. Our development leads to in that case, with sample values requiring βij = (θij−θB)/(1−θB). If populations i, j are in the same region R then we have an estimation problem: we cannot estimate βij other than by comparison to the average θB over all pairs of populations in that region and the only possibility we have is to compare each β̂ij = ( M̃ij− M̃B)/(1− M̃B) to their average. This average is zero by construction. The reference point for a single population is (the average over) pairs of populations: for a single region there is no meaningful reference point for pairs of populations. We can make progress in estimating βij for two populations in the same region by taking θB to refer to pairs of populations, one from each of two regions. If we have a hierarchical structure limited to populations and regions, however, we cannot address estimation of θij for populations in different regions. Steele et al. [2] do not address the decomposition of FST into components within and between populations and so do not fully address the issue of allelic matching between populations.
The concept we stress in this paper is that forensic match probabilities, designed to recognize the shared ancestry between a person of interest and an unknown donor of an evidentiary sample, can be expressed in terms of identity by descent of pairs of alleles drawn from the relevant population. If sample allele frequencies, drawn from a larger set of populations, are to be used in calculating match probabilities then it is necessary to consider the identity between alleles drawn from pairs of these populations. It is the compound parameter FST, comparing allelic identity within populations to that between populations, that is of forensic importance. The value of this parameter clearly depends on the set of populations furnishing allele frequencies. The same point was made by Steele et al. [1] and we endorse their suggestion for the sampled set of populations to be generally of similar ancestry to the population of interest. A possible exception is for populations of African ancestry that tend to have retained greater diversity than those in the rest of the world.
Supplementary Material
Highlights.
Largest survey of published frequencies of forensic STR alleles.
Clarification of meaning of Fst needed to calculate match probabilities.
Demonstration that Fst can be expressed in terms of allelic matching proportions within and between populations.
Demonstration that Fst for a population depends on the set of populations to which it is compared.
Acknowledgments
This work was supported in part by grants 2011-DN-BX-K541 and 2014-DN-BX-K028 from the US National Institute of Justice, and grant IZK0Z3 157867 from the Swiss National Science Foundation. Points of view in this document are those of the authors and do not necessarily represent the official position or policies of the U.S. Departments of Justice or Commerce or of the Swiss National Science Foundation.
APPENDIX
Combining Estimates over Loci and Populations
The within- and between-population-pair matching proportions for populations i and population-pairs i, j at locus l are M̃il and M̃ijl. These have expectations [1−Hl(1−θil)] and [1−Hl(1−θijl)] respectively and they provide the population/locus specific estimates
where M̃Bl = ΣiΣj≠iMijl/[rl(1 − rl)] if locus l is scored in rl populations. These estimates have expectation βil = (θil − θBl)/(1 − θBl) if θBl = ΣiΣj≠i θijl/[rl(1 − rl)].
Averaging over populations
Averaging matching proportions leads to average estimates over populations for locus l. If locus l is scored in rl populations:
Here the sums are over populations for which locus l is scored (e.g. θWl = Σi θil/rl), and the estimate is for the average over populations of the θil values relative to a reference value θBl. These estimates allow a comparison among loci if the same populations are scored for each locus. In reality, this may not be the case and comparisons need to be interpreted carefully, although there is less of an issue as the number of populations increases.
Averaging over loci
Combining estimates over loci for population i is not as straightforward if the θ’s differ among loci. A unweighted average over loci is
where Li is the number of loci scored in population i. This average has expectation
This is not the average over loci of the θil values relative to a reference value.
Variance is reduced by weighting estimates by their denominators:
If there is no variation among loci, θil = θi, and θBl = θB, as might be expected for neutral loci with similar mutation rates, each of have the same expected value of (θi − θB)/(1 − θB).
A single estimate β̂W, for a set of populations, using all the scored loci, can be calculated as
An unweighted single estimate that does not have an expectation dependent on Hl is
This is not the average value over loci, θW, relative to a reference value. If θil = θi for all l, as might be expected for neural loci with equal mutation rates, each of and has the same expected value (θW − θB)/(1 − θB).
Which Estimate Should be Used?
The results shown in this paper show variation in β̂il, suggesting variation in θil, over both loci and populations. The estimates are ratios of quadratic forms and large variances are expected. The figures suggest more variation over populations than over loci and this is not unexpected. One goal of this paper is to suggest values of “theta” for the calculations of match probabilities with the “theta correction.” We have shown that the appropriate quantity is θW, an average over populations in some relevant group of populations.
Our interest is primarily in the effect of population, rather than locus, on match probabilities. We would prefer to use a single theta, or β, value than separate values for each locus. [There is a need for separate values for each locus if we wish, for example, to detect signatures of natural selection at individual loci.] Which value should be used? The mean or median, or some other percentile, of the collection of single-locus values for a population or group of populations are valid measures, and have been advocated by Balding and his colleagues. Collections of values are shown in Figures 2 and 3. Extensive sets of numerical values are shown in Supplementary Material section 4. We argue instead for the best single value calculated from all the loci, meaning the values with smallest variance, and these are the values shown in Tables 3 and 4.
Footnotes
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
LITERATURE CITED
- 1.Silva NM, Pereira L, Poloni ES, Currat M. Human neutral genetic variation and forensic STR data. PLoS One. 2012;7:e49666. doi: 10.1371/journal.pone.0049666. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Pamplona JP, Freitas F, Pereira L. A worldwide database of autosomal markers used by the forensic community. Forensic Science International: Genetics Supplement Series. 2008;1:656–657. [Google Scholar]
- 3.Steele CD, Syndercombe Court D, Balding DJ. Worldwide FST estimates relative to five continental-scale populations. Annals of Human Genetics. 2014;78:468–477. doi: 10.1111/ahg.12081. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Stringer CB, Andrews P. Genetic and fossil evidence for the origin of modern humans. Science. 1988;239:1263–1268. doi: 10.1126/science.3125610. [DOI] [PubMed] [Google Scholar]
- 5.Cockerham CC. Variance of gene frequencies. Evolution. 1969;23:72–84. doi: 10.1111/j.1558-5646.1969.tb03496.x. [DOI] [PubMed] [Google Scholar]
- 6.Wright S. Evolution in Mendelian populations. Genetics. 1931;16:97–159. doi: 10.1093/genetics/16.2.97. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Balding DJ, Nichols RA. DNA profile match probability calculation: how to allow for population stratification, relatedness, database selection and single bands. Forensic Science International. 1994;64:125–140. doi: 10.1016/0379-0738(94)90222-4. [DOI] [PubMed] [Google Scholar]
- 8.National Research Council. National Research Council Committee on DNA Forensic Science, The Evaluation of Forensic DNA Evidence. National Academy Press; Washington, D.C: 1996. [Google Scholar]
- 9.Ayres KL, Overall ADJ. Allowing for within-subpopulation inbreeding in forensic match probabilities. Forensic Science International. 1999;103:207–216. [Google Scholar]
- 10.Curran JM, Triggs CM, Buckleton JS, Weir BS. Interpreting DNA mixtures in structured populations. Journal of Forensic Sciences. 1999;44:987–995. [PubMed] [Google Scholar]
- 11.Weir BS. The rarity of DNA profiles. Annals of Applied Statistics. 2007;1:358–370. doi: 10.1214/07-AOAS128. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Weir BS, Cockerham CC. Estimating F-statistics for the analysis of population structure. Evolution. 1984;38:1358–1370. doi: 10.1111/j.1558-5646.1984.tb05657.x. [DOI] [PubMed] [Google Scholar]
- 13.Weir BS, Hill WG. Estimating F-statistics. Annual Reviews of Genetics. 2002;36:721–750. doi: 10.1146/annurev.genet.36.050802.093940. [DOI] [PubMed] [Google Scholar]
- 14.Bhatia G, Patterson N, Sankararaman S, Price AL. Estimating and interpreting FST : the impact of rare variants. Genome Research. 2013;23:1514–1521. doi: 10.1101/gr.154831.113. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Pemberton TJ, DeGiorgio M, Rosenberg NA. Population structure in a comprehensive genomic data set on human microsatellite variation. G3 Genes, Genomes, Genetics. 2013;3:891–907. doi: 10.1534/g3.113.005728. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Butler JM. Advanced Topics in Forensic DNA Typing: Interpretation. Academic Press; New York: 2014. [Google Scholar]
- 17.Steele DC, balding DJ. Choice of population database for forensic DNA profile analysis. Science and Justice. 2014;544:487–493. doi: 10.1016/j.scijus.2014.10.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.




