THE RARITY OF DNA PROFILES

Bruce S Weir

doi:10.1214/07-AOAS128

. Author manuscript; available in PMC: 2008 Nov 21.

Published in final edited form as: Ann Appl Stat. 2007;1(2):358–370. doi: 10.1214/07-AOAS128

THE RARITY OF DNA PROFILES^¹

Bruce S Weir ¹

PMCID: PMC2585748 NIHMSID: NIHMS47987 PMID: 19030117

Abstract

It is now widely accepted that forensic DNA profiles are rare, so it was a surprise to some people that different people represented in offender databases are being found to have the same profile. In the first place this is just an illustration of the birthday problem, but a deeper analysis must take into account dependencies among profiles caused by family or population membership.

Key words and phrases: DNA profiles, forensic profiles, birthday problem, population genetics, relatives, inbreeding

1. Introduction

In the 20 years since the introduction of DNA profiles for forensic identification there has developed a wide-spread belief that it is unlikely two people will share the same profile. Assuming at least 10 alleles or 55 genotypes at each locus, a 13-locus system in common use allows for at least 10²¹ different profiles, which far exceeds the total number of people in the world. It is difficult to attach a meaningful estimate to the probability that a person chosen at random would have a particular profile, but a good first step is to assume independence of all (26) alleles in a profile to arrive at an estimate that “reaches a figure altogether beyond the range of the imagination” in the language Galton (1892) used to describe probabilities for fingerprints. Given such arguments, what is to be made of recent findings that the profiles of two people in a database of offender profiles either match or come very close to matching? Is there a need to re-think the understanding that profiles are rare?

There are forensic, statistical and genetic aspects to discussions of profile rarity. The key forensic issue centers on the comparison of two profiles, often one from a crime-scene sample and one from a suspect. The relevant calculations must recognize the existence of two profiles rather than focusing on only one of them. The statistical aspects are addressed initially by the “Birthday Problem.” The probability that a person chosen randomly has a particular birthday is 1/365, ignoring leap-year complications, but there is over 50% probability that two people in a group of 23 people share a birthday. This result recognizes that the number of pairs of people, 253, is much greater than the number of people, 23, and that the particular shared birthday is not specified. The finding of DNA profile matching in an Arizonan database of 65,000 profiles [Troyer, Gilroy and Koeneman (2001)] becomes less surprising when it is recognized that there are over two billion possible pairs of profiles in that database. The genetic aspects rest on the shared evolutionary history of humans. The very fact that the population is finite means that any two people have shared ancestors and the resulting dependencies increase the probability of profile matching.

2. Forensic issues

The interpretation of DNA forensic evidence E requires the probabilities of that evidence under alternative hypotheses, referred to here as H_p and H_d for the case where they represent the views of prosecution and defense in a criminal trial. A simple scenario is when the profile G_C of a crime-scene stain matches that, G_S, of a suspect. The hypotheses may be as follows:

\begin{array}{l} H_{p} : the suspect is the source of the crime­scene stain . \\ H_{d} : the suspect is not the source of the crime­scene stain . \end{array}

A quantity of interest to those charged with making a decision is the posterior odds of the prosecution hypothesis after the finding of matching DNA profiles:

Posterior odds = \frac{Pr (H_{p} ∣ E)}{Pr (H_{d} ∣ E)} .

From Bayes’ theorem,

\begin{aligned} \frac{Pr (H_{p} ∣ E)}{Pr (H_{d} ∣ E)} = & \frac{Pr (E ∣ H_{p})}{Pr (E ∣ H_{d})} \times \frac{Pr (H_{p})}{Pr (H_{d})}, \\ Posterior odds = & LR \times Prior odds \end{aligned}

and it is the likelihood ratio LR that is estimated by forensic scientists. In paternity disputes this quantity is called the paternity index. Those who equate Pr(E|H_p) and Pr(H_p|E), as in “The odds were billions to one that the blood found at the scene was not O.J.s” [Anonymous (1997)], are said to have committed the “Prosecutor’s Fallacy” [Thompson and Schumann (1987)].

The likelihood ratio for a single-contributor DNA profile can be expressed as

\begin{array}{l} LR & = \frac{Pr (G_{S}, G_{C} ∣ H_{p})}{Pr (G_{S}, G_{C} ∣ H_{d})} \\ = \frac{Pr (G_{S} ∣ G_{C}, H_{p})}{Pr (G_{S} ∣ G_{C}, H_{d})} \frac{Pr (G_{C} ∣ H_{p})}{Pr (G_{C} ∣ H_{d})} \\ = \frac{1}{Pr (G_{S} ∣ G_{C}, H_{d})} \end{array}

by recognizing that the crime-scene stain profile does not depend on the alternative hypotheses and that the two profiles must match under the prosecution hypothesis. Among the many advantages of adopting this approach to comparing competing hypotheses is the clarification that it is match probabilities Pr(G_S|G_C) for profiles from two people that are relevant rather than profile probabilities Pr(G_S). In the discussion of matching profiles in a database, G_C and G_S can refer to the profiles from different people and the issue is whether or not matching is unlikely.

3. Statistical issues

Diaconis and Mosteller (1989) discussed basic statistical techniques for studying coincidences and stated the law of truly large numbers: “With a large enough sample, any outrageous thing is likely to happen.” Can we attach probabilities for very unlikely events to occur? In the forensic context, Kingston (1965) addressed match probabilities long before the advent of DNA profiling. If a particular item of evidence has a probability P, then he assumed that the unknown number x of occurrences of the profile in a large population of N people is Poisson with parameter λ = NP. Suppose a person with the particular profile commits a crime, leaves evidence with that profile at the scene, and then rejoins the population. A person with the profile is subsequently found in the population and a simple model says that the probability that this suspect is the perpetrator is 1/x. Although x is not known, it must be at least one, so the probability that the correct person has been identified is the expected value of 1/x given that x ≥ 1. Those people who would equate x to its expected value λ and then assign equal probabilities to all λ people are said to have committed the “Defense Attorney’s Fallacy” [Thompson and Schumann (1987)].

Balding and Donnelly (1995), referring to Eggleston (1983) and Lenth (1986), pointed out that Kingston’s conditioning on at least one individual having the profile is not the same as the correct conditioning, that a specific individual (the suspect) has the profile. They gave a general treatment of this “island problem” and then Balding (1999) followed with a discussion of uniqueness of DNA profiles. He started with the event that a person (the perpetrator) sampled at random from a population of size (N + 1) has a particular profile. The remaining people in the population each have independent probability P of having the same profile. A second person (the suspect) is drawn from the population and may be the same person as the first (event G). The second person is found to have the same profile as the first (event E). If U is the event that the suspect has the profile and that no-one else in the population has the profile, then

Pr (U ∣ E) = Pr (U ∣ G, E) Pr (G ∣ E) .

Now Pr(G|E) = Pr(E|G) Pr(G)/[Pr(E|G) Pr(G) + Pr(E|Ḡ) Pr(Ḡ)] by Bayes’ theorem, and Pr(E|G) = 1, Pr(E|Ḡ) = P, Pr(G) = 1/(N + 1). Moreover, for independent profiles, Pr(U|G, E) = (1 − P)^N so that Pr(U|E) > 1 − 2λ. For the USA, with a population of about 3 × 10⁸, a profile with a probability of 10⁻¹⁰ would give λ = 0.03 and the probability that the correct person has been identified of at least 0.94. This is not as dramatic a number as the original 10⁻¹⁰.

The birthday problem has to do with multiple occurrences of any profile, not a particular profile as treated by Kingston and Balding. Mosteller (1962) refers to the latter as the “birthmate problem.” The probability that at least two of a sample of n people have the same unspecified birthday (or DNA profile), in the case where every birthday (or profile) has the same probability P, is

\begin{array}{l} Pr (At least one match) & = 1 - Pr (No matches) \\ = 1 - {1 (1 - P) (1 - 2 P) \dots [1 - (n - 1) P]} \\ \approx 1 - \prod_{i = 0}^{n - 1} e^{- i P} \approx 1 - e^{- n^{2} P / 2} \end{array}

For the USA example of P = 10⁻¹⁰, the chance of some profile being replicated in the population of N = 3 × 10⁸ is essentially 100%. The Arizona Department of Public Safety [Troyer, Gilroy and Koeneman (2001)] reported a nine-locus match in a database of 65,493 for a profile that had an estimated probability of 1 in 7.54 × 10⁸. Using that probability, the chance of finding two matching profiles in the database would be about 94%, so the finding is not unexpected. DNA profiles do not have equal or independent probabilities, however, so these calculations are approximate at best.

4. Genetic issues

DNA profiles are genetic entities and, as such, are shaped by the evolutionary history of a population. Whereas it is sufficient to take samples from a population to provide descriptive statistics of that particular population, predictions of matching probabilities that recognize evolutionary events are necessarily expectations over replicate populations. There is no reason to believe that a particular population has properties that are at expectation.

As a simple example, consider the estimation of profile probabilities at a single locus A. If a sample of n genotypes provides estimates p̃_i for the frequencies p_i of alleles A_i, then genotypic frequency estimates are ${\tilde{p}}_{i}^{2}$ for homozygotes A_i A_i and 2p̃_i p̃_j for heterozygotes A_i A_j under the assumption of random mating within the population. Taking expectations of these estimates, over repeated samples from the same population and over replicates of the sampled population, provides

\begin{aligned} E ({\tilde{p}}_{i}^{2}) & = p_{i}^{2} + p_{i} (1 - p_{i}) [θ + \frac{1 + (2 n - 1) θ}{2 n}], \\ E (2 {\tilde{p}}_{i} {\tilde{p}}_{j}) & = 2 p_{i} p_{j} + 2 p_{i} p_{j} [θ + \frac{1 + (2 n - 1) θ}{2 n}] \end{aligned}

[Weir (1996)] to introduce the population coancestry coefficient θ which measures the relationship between pairs of alleles within a population relative to the relationship of alleles between populations. To illustrate the meaning of “relative to” consider a fanciful example of a large community of people, all of whom are first cousins to each other. If these people pair at random, their children will form a population in which genotypic frequencies are products of allele frequencies. A child’s two alleles, one from each parent, are independent. From the perspective of an observer outside the community, however, the allele pairs within the community appear to be dependent, with θ= 1/16. This value of θ is needed to predict genotypic frequencies for the community children on the basis of population-wide allele frequencies.

For large sample sizes, the expected genotypic frequencies reduce to the parametric values $p_{i}^{2} + p_{i} (1 - p_{i}) θ$ and 2p_i p_j (1 − θ). The sample allele frequencies p̃_i are unbiased for the parametric values p_i and θ is serving to provide the variance of the sample values—in particular, p_i (1 − p_i)θ is the variance over populations of allele frequencies within one population. In the situation where alleles are selectively neutral, it is convenient to regard θ as the probability that a random pair of alleles in the same population are identical by descent, ibd, meaning that they have both descended from the same ancestral allele. Identity by descent is also an expectation over replicate populations.

The probabilities of pairs of genotypes require measures of relationship analogous to θ but for up to four alleles. Two individuals that are both homozygous A_i A_i for the same allelic type, for example, may carry two, three, four or two pairs of alleles that are ibd. For the class of evolutionary models where there is stationarity under the opposing forces of mutation introducing genetic variation and genetic drift causing variation to be lost, and allelic exchangeability, these higher-order ibd probabilities may all be expressed in terms of θ. The distribution of allele frequencies over replicate populations is Dirichlet for this class of models and a very useful consequence is that the probability of drawing an allele of type A_i from a population given that n_i of the previous n alleles drawn were of that type is [n_iθ + (1 − θ)p_i]/[1 + (n −1) θ] [Balding and Nichols (1997)]. This provides, for example, the probability of two members of the same population being homozygotes A_i A_i:

Pr (A_{i} A_{i}, A_{i} A_{i}) = \frac{p_{i} [θ + (1 - θ) p_{i}] [2 θ + (1 - θ) p_{i}] [3 θ + (1 - θ) p_{i}]}{(1 + θ) (1 + 2 θ)} .

From this and similar expressions for other genotypes, it is possible to predict the probability that two members of a population will match, that is, have the same two alleles at a locus [Weir (2004)],

\begin{aligned} P_{2} = & \sum_{i} Pr (A_{i} A_{i}, A_{i} A_{i}) + \sum_{i} \sum_{j \neq i} Pr (A_{i} A_{j}, A_{i} A_{j}) \\ = & \sum_{i} Pr (A_{i} A_{i} A_{i} A_{i}) + 2 \sum_{i} \sum_{j \neq i} Pr (A_{i} A_{i} A_{j} A_{j}) \\ = & \frac{1}{D} [6 θ^{3} + θ^{2} (1 - θ) (2 + 9 S_{2}) \\ + 2 θ {(1 - θ)}^{2} (2 S_{2} + S_{3}) + {(1 - θ)}^{3} (2 S_{2}^{2} - S_{4})] . \end{aligned}

The first line specifies the genotypes, the second shows the corresponding sets of alleles, and the third shows the value from the Dirichlet assumption. Random mating is assumed for the second line. The third line employs the notation $S_{k} = \sum_{i} p_{i}^{k}$ k = 2, 3, 4, and D = (1 + θ) (1 + 2θ).

Partial matches occur when two individuals share one allele at a locus, rather than the two required for a match. As Diaconis and Mosteller (1989) said: “We often find ‘near’ coincidences surprising.” The probability that two individuals partially match is

\begin{aligned} P_{1} = & 2 \sum_{i} \sum_{j \neq i} Pr (A_{i} A_{i}, A_{i} A_{j}) + \sum_{i} \sum_{j \neq i} \sum_{k \neq i, j} Pr (A_{i} A_{j}, A_{i} A_{k}) \\ = & 4 \sum_{i} \sum_{j \neq i} Pr (A_{i} A_{i} A_{i} A_{j}) + 4 \sum_{i} \sum_{j \neq i} \sum_{k \neq i, j} Pr (A_{i} A_{i} A_{j} A_{k}) \\ = & \frac{1}{D} [8 θ^{2} (1 - θ) (1 - S_{2}) + 4 θ {(1 - θ)}^{2} (1 - S_{3}) \\ + 4 {(1 - θ)}^{3} (S_{2} - S_{3} - S_{2}^{2} + S_{4})], \end{aligned}

with the same meaning for the three rows as for P₂. Finally, for two individuals to mismatch, that is, have no alleles in common,

\begin{aligned} P_{0} = & \sum_{i} \sum_{j \neq i} Pr (A_{i} A_{i}, A_{j} A_{j}) + 2 \sum_{i} \sum_{j \neq i} \sum_{k \neq i, j} Pr (A_{i} A_{i}, A_{j} A_{k}) \\ + \sum_{i} \sum_{j \neq i} \sum_{k \neq i, j} \sum_{l \neq i, j, k} Pr (A_{i} A_{j}, A_{k} A_{l}) \\ = & \sum_{i} \sum_{j \neq i} Pr (A_{i} A_{i} A_{j} A_{j}) + 2 \sum_{i} \sum_{j \neq i} \sum_{k \neq i, j} Pr (A_{i} A_{i} A_{j} A_{k}) \\ + \sum_{i} \sum_{j \neq i} \sum_{k \neq i, j} \sum_{l \neq i, j, k} Pr (A_{i} A_{j} A_{k} A_{l}) \\ = & \frac{1}{D} [θ^{2} (1 - θ) (1 - S_{2}) + 2 θ {(1 - θ)}^{2} (1 - 2 S_{2} + S_{3}) \\ + {(1 - θ)}^{3} (1 - 4 S_{2} + 4 S_{3} + 2 S_{2}^{2} - 3 S_{4})] . \end{aligned}

Values of P₂ are shown in Table 1 for 13 commonly-used forensic loci, using Caucasian allele frequencies reported by Budowle and Moretti (1999) and various values of θ. Assuming independence of these loci, the full 13-locus match probabilities are the products of the 13 separate values and these products are also shown in Table 1. The probabilities of finding at least one matching pair among 65,493 individuals are given in Table 1, along with the sample size needed to give a 50% probability of at least one match. The column headed “Actual” shows the proportion of pairs of profiles that match at each locus in the very small sample of 203 Caucasian profiles reported by the FBI [Budowle and Moretti (1999)].

Table 1.

Probabilities that two unrelated noninbred¹ people match at common loci, based on allele frequencies reported by Budowle and Moretti (1999)

		θ
Locus	Actual²	0.000	0.001	0.005	0.010	0.030
D3S1358	0.077	0.075	0.075	0.077	0.079	0.089
vWA	0.063	0.062	0.063	0.065	0.067	0.077
FGA	0.036	0.036	0.036	0.038	0.040	0.048
D8S1179	0.063	0.067	0.068	0.070	0.072	0.083
D21S11	0.036	0.038	0.038	0.040	0.042	0.051
D18S51	0.027	0.028	0.029	0.030	0.032	0.040
D5S818	0.163	0.158	0.159	0.161	0.164	0.175
D13S317	0.076	0.085	0.085	0.088	0.090	0.101
D7S820	0.062	0.065	0.066	0.068	0.070	0.080
CSF1PO	0.122	0.118	0.119	0.121	0.123	0.134
TPOX	0.206	0.195	0.195	0.198	0.202	0.216
THO1	0.074	0.081	0.082	0.084	0.086	0.096
D16S539	0.086	0.089	0.089	0.091	0.094	0.105
All loci		2 × 10⁻¹⁵	2 × 10⁻¹⁵	3 × 10⁻¹⁵	4 × 10⁻¹⁵	2 × 10⁻¹⁴
Prob.³		0.000,004	0.000,004	0.000,006	0.000,009	0.000,050
Sample size⁴		28 million	27 million	22 million	18 million	7.7 million

Open in a new tab

Apart from evolutionary-driven inbreeding and relatedness.

Observed proportion of matches in data of Budowle and Moretti (1999).

Probability of at least one matching pair among 65,493 individuals.

⁴

Sample size to give 50% probability of at least one match.

The finding of Troyer, Gilroy and Koeneman (2001) was for a pair of profiles that matched at nine loci, partially matched at three loci and mismatched at one locus. It is shown in Table 2 that, in fact, 163 such pairs of individuals are expected when loci are assumed to be independent and θ= 0.03. This value of θ has been suggested as a very conservative value to use for forensic calculations [National Research Council (1996)], and Table 1 shows that value makes all 13 predicted match probabilities greater than FBI observed values. It would be of interest to examine the dataset of Troyer, Gilroy and Koeneman (2001) to see the level of agreement between observed and expected numbers of matches and partial matches. Weir (2004) was able to examine an Australian dataset of 15,000 profiles and showed (Table 4) very good agreement when θ was set to 0.001. The agreement was not as good when θ was set to zero. Table 3 shows observed and expected numbers of match/partial match combinations for the Caucasian data of Budowle and Moretti (1999). The sample size is too small to have more than six loci with matches and is really too small to allow strong conclusions about the role of θ to be made. This example shows good overall agreement between observed and expected values for θ= 0. Examination of actual offender datasets is needed.

Table 2.

Expected numbers of pairs of matching or partially matching profiles in a sample of size 65,493 profiles when at least six of 13 loci match if θ = 0.03

	Number of partially matching loci
Number of matching loci	0	1	2	3	4	5	6	7
6	4,059	37,707	148,751	322,963	416,733	319,532	134,784	24,125
7	980	7,659	24,714	42,129	40,005	20,061	4,150
8	171	1,091	2,764	3,467	2,153	530
9	21	106	198	163	50
10	2	7	8	3
11	0	0	0
12	0	0
13	0

Open in a new tab

Table 4.

Identity probabilities for common family relationships

Relationship	k₂	k₁	k₀
Identical twins	1	0	0
Full sibs	$\frac{1}{4}$	$\frac{1}{2}$	$\frac{1}{4}$
Parent and child	0	1	0
Double first cousins	$\frac{1}{16}$	$\frac{3}{8}$	$\frac{9}{16}$
Half sibs	0	$\frac{1}{2}$	$\frac{1}{2}$
Grandparent and grandchild	0	$\frac{1}{2}$	$\frac{1}{2}$
Uncle and nephew	0	$\frac{1}{2}$	$\frac{1}{2}$
First cousins	0	$\frac{1}{4}$	$\frac{3}{4}$
Unrelated	0	0	1

Open in a new tab

Table 3.

Observed and expected numbers of profiles with specified numbers of matching or partially loci when all 94 profiles in a dataset of Budowle and Moretti (1999) are compared to each other

		Number of partially matching loci
No. of matching loci	θ	0	1	2	3	4	5	6	7	8	9	10	11	12	13
0	Obs.	0	3	18	92	249	624	1077	1363	1116	849	379	112	25	4
	0.000	0	2	19	90	293	672	1129	1403	1290	868	415	134	26	2
	0.001	0	2	18	88	286	661	1114	1391	1286	869	418	135	26	2
	0.010	0	2	14	70	236	566	992	1289	1241	875	439	148	30	3
	0.030	0	1	8	42	152	396	754	1065	1118	860	471	174	39	4
1	Obs.	0	12	48	203	574	1133	1516	1596	1206	602	193	43	3
	0.000	0	7	50	212	600	1192	1704	1768	1320	692	242	51	5
	0.001	0	7	49	208	592	1182	1698	1770	1328	700	246	52	5
	0.010	0	5	40	178	527	1094	1637	1779	1393	767	282	62	6
	0.030	0	3	26	125	401	905	1475	1749	1496	901	363	88	10
2	Obs.	0	7	61	203	539	836	942	807	471	187	35	2
	0.000	1	9	56	210	514	871	1040	877	511	196	45	5
	0.001	1	9	56	208	512	872	1046	886	519	200	46	5
	0.010	1	8	50	193	494	875	1096	969	593	239	57	6
	0.030	0	5	38	160	445	861	1178	1140	765	339	89	11
3	Obs.	0	6	33	124	215	320	259	196	92	16	1
	0.000	1	7	36	116	243	344	334	220	94	23	3
	0.001	1	6	36	116	244	348	339	224	96	24	3
	0.010	0	6	35	117	256	380	387	268	120	32	4
	0.030	0	5	31	115	275	447	499	379	187	54	7
4	Obs.	1	5	17	29	54	82	67	16	6	0
	0.000	0	3	15	40	70	81	61	29	8	1
	0.001	0	3	15	40	71	82	63	30	8	1
	0.010	0	3	15	44	81	98	78	40	12	1
	0.030	0	3	16	52	105	139	122	68	22	3
5	Obs.	0	1	2	6	12	14	6	5	0
	0.000	0	1	4	9	13	11	6	2	0
	0.001	0	1	4	9	13	12	7	2	0
	0.010	0	1	4	11	16	15	9	3	0
	0.030	0	1	6	15	25	26	17	6	1
6	Obs.	0	1	0	2	2	0	0	0
	0.000	0	0	1	1	1	1	0	0
	0.001	0	0	1	1	2	1	0	0
	0.010	0	0	1	2	2	1	1	0
	0.030	0	0	1	3	4	3	1	0

Open in a new tab

It is clear, however, that instances of matching and partially matching profiles are not unexpected in offender databases.

5. Effect of relatives

The previous results accommodated the effects of shared evolutionary history on the probabilities that two individuals have the same genotype. These probabilities are increased if the individuals have a shared family history. Allowing for this degree of relatedness, but still assuming random mating within a population so there is no inbreeding, requires the probabilities k₂, k₁, k₀ that the individuals have received 2, 1 or 0 pairs of alleles identical by descent from their immediate family ancestors. Values for these probabilities for common relationships are shown in Table 4. Individuals that share two pairs of ibd alleles must have matching genotypes. Those that share one pair of alleles ibd may either match or partially match, and individuals with no ibd allele sharing may match, partially match or mismatch. Therefore, the probabilities that two individuals match, partially match or mismatch at one locus are

\begin{aligned} Pr (Match) & = k_{2} + k_{1} [\sum_{i} Pr (A_{i} A_{i} A_{i}) + \sum_{i} \sum_{j \neq i} Pr (A_{i} A_{j} A_{j})] + k_{0} P_{2} \\ = k_{2} + k_{1} [θ + (1 - θ) S_{2}] + k_{0} P_{2}, \\ Pr (Partial Match) & = k_{1} [2 \sum_{i} \sum_{j \neq i} Pr (A_{i} A_{i} A_{j}) + \sum_{i} \sum_{j \neq i} \sum_{k \neq i, j} Pr (A_{i} A_{j} A_{k})] + k_{0} P_{1} \\ = k_{1} (1 - θ) (1 - S_{2}) + k_{0} P_{1}, \\ Pr (Mismatch) & = k_{0} P_{0} . \end{aligned}

Equivalent results were given by Fung, Carracedo and Hu (2003). Numerical values for the matching probabilities for the 13-locus system described in Table 1 are shown in Table 5 for common relationships. Clearly, the probabilities increase with the degree of relationship.

Table 5.

Matching probabilities for common family relationships (with θ = 0.03)

Locus	Not related	First-cousins	Parent–child	Full-sibs
D3S1358	0.089	0.124	0.229	0.387
vWA	0.077	0.111	0.213	0.376
FGA	0.048	0.078	0.166	0.345
D8S1179	0.083	0.119	0.227	0.384
D21S11	0.051	0.081	0.172	0.349
D18S51	0.040	0.068	0.150	0.335
D5S818	0.175	0.216	0.339	0.463
D13S317	0.101	0.139	0.252	0.401
D7S820	0.080	0.115	0.219	0.379
CSF1PO	0.134	0.173	0.288	0.428
TPOX	0.216	0.261	0.397	0.503
THO1	0.096	0.133	0.241	0.395
D16S539	0.105	0.143	0.256	0.404
Total	2 × 10⁻¹⁴	2 × 10⁻¹²	6 × 10⁻⁹	5 × 10⁻⁶

Open in a new tab

Pairs of relatives with related common ancestors within their family are inbred, and the three ibd probabilities k₂, k₁, k₀ must be replaced by a more extensive set of nine probabilities Δ_i, i = 1, 2, …, 9, for the various patterns of ibd among all four alleles carried by the two relatives [Weir, Anderson and Hepler (2006)]. These are defined in Table 6, along with numerical values for the situation of full sibs whose parents are first cousins. The various matching probabilities become

Table 6.

Identity probabilities for inbred relatives carrying alleles (a, b) and (c, d), and values for example of siblings whose parents are first cousins

ibd alleles	Probability	Example^*
a, b, c, d	Δ₁	1/64
a, b and c, d	Δ₂	0
a, b, c or a, b, d	Δ₃	2/64
a, b only	Δ₄	1/64
a, c, d or b, c, d	Δ₅	2/64
c, d only	Δ₆	1/64
(a, c and b, d) or (a, d and b, c)	Δ₇	15/64
a, c or a, d or b, c or b, d	Δ₈	30/64
none	Δ₉	12/64

Open in a new tab

First cousin provides alleles a, c to sibs, second cousin provides alleles b, d to sibs.

\begin{aligned} Pr (Match) = & (Δ_{1} + Δ_{7}) + (Δ_{2} + Δ_{3} + Δ_{5} + Δ_{8}) [θ + (1 - θ) S_{2}], \\ + \frac{1}{1 + θ} (Δ_{4} + Δ_{6}) [2 θ^{2} + 3 θ (1 - θ) S_{2} + {(1 - θ)}^{2} S_{3}] \\ + Δ_{9} P_{2}, \end{aligned}

\begin{aligned} Pr (PartialMatch) = & (Δ_{3} + Δ_{5} + Δ_{8}) (1 - θ) (1 - S_{2}) \\ + \frac{2 (1 - θ)}{1 + θ} (Δ_{4} + Δ_{6}) [θ + (1 - 2 θ) S_{2} - (1 - θ) S_{3}] \\ + Δ_{9} P_{1}, \end{aligned}

\begin{aligned} Pr (Mismatch) = & Δ_{2} (1 - θ) (1 - S_{2}) \\ + \frac{1 - θ}{1 + θ} (Δ_{4} + Δ_{6}) [1 - (2 - θ) S_{2} + {(1 - θ)}^{2} S_{3}] + Δ_{9} P_{0} . \end{aligned}

Relatedness will increase the probability that two individuals will have matching or partially matching DNA profiles and it would not be surprising if very large offender databases had profiles from related people. It is difficult, however, to turn the question around and infer relatedness of people whose profiles have a high degree of matching. The current set of less than 20 STR loci is not enough to give good estimates of the degree of relatedness [Weir, Anderson and Hepler (2006)], and even unrelated people can have very similar profiles.

6. Discussion

DNA profiling has proven to be a powerful tool for human identification in forensic and other contexts. Different people, identical twins excepted, have different genetic constitutions and it is hoped that an examination of a small portion of these constitutions will allow for identification or differentiation. Current forensic DNA profiling techniques examine between 10 and 20 regions of the genome, representing of the order of 10³ of the 10⁹ nucleotides in the complete genome. Nevertheless, the probability that a randomly chosen person has a particular forensic profile can easily reach the small value of 10⁻¹⁰. Even when the forensic scientist is careful to present probabilities in the preferred format such as “the probability of a person having this profile given that we know the perpetrator has the profile,” the numbers remain small and the evidence that a defendant also has that profile can be compelling.

Given the widespread belief that specific forensic profiles are rare, there has been some concern expressed at the finding of matching or nearly matching profiles in databases of less than 100,000. Such findings were predicted by Weir (2004), unaware that they had already been reported [Troyer, Gilroy and Koeneman (2001)] for the case of two profiles matching at nine of 13 loci. At the simplest level, the apparent discrepancy is merely an application of the birthday problem. If all DNA profiles have the same probability P, and if profiles are independent, then the probability of at least two instances of any profile in a set of n profiles is approximately 1 − exp(−n²P/2). This probability can be large even for small P and it can be 50% when n is of the order of $1 / \sqrt{P}$ . The widespread practice of collecting profiles from people suspected of, arrested for, or convicted of crimes has already led to the establishment of large databases: the National DNA Database (NDNAD) in the United Kingdom had over three million profiles in February 2006 and the Combined DNA Index System (CODIS) in the United States had over four million profiles in February 2007. These and other national databases are growing.

This note has looked a little more closely at the probability of finding matching profiles in a database. The first observation was that DNA profiles are genetic entities with evolutionary histories that impose dependencies among profiles. The formulation of dependencies was made for single loci, but there is empirical evidence [Weir (2004), Figure 1] that sufficiently large “correction” for dependencies within loci will also accommodate between-locus dependencies. This means taking sufficiently large values of the parameter θ.

Incorporation of “θ-corrections” for the case of unrelated individuals refers to the dependencies generated by the evolutionary process. These would not be detected from observations taken solely within a population, but they are necessary to enable predictions to be made. Predictions need to take variation among populations into account. Additional dependencies due to nonrandom mating, leading to within-population inbreeding, were considered by Ayres and Overall (1999).

Acknowledgments

Very helpful comments were made by an anonymous reviewer.

Footnotes

Supported in part by NIH Grant GM 075091.

References

Anonymous DNA fingerprinting comes of age. Science. 1997;278:1407. [Google Scholar]
Ayres KL, Overall ADJ. Allowing for within-subpopulation inbreeding in forensic match probabilities. Forensic Science International. 1999;103:207–216. [Google Scholar]
Balding DJ. When can a DNA profile be regarded as unique? Science and Justice. 1999;39:257–260. doi: 10.1016/S1355-0306(99)72057-5. [DOI] [PubMed] [Google Scholar]
Balding DJ, Donnelly P. Inference in forensic identification. J Roy Statist Soc Ser A. 1995;158:21–53. [Google Scholar]
Balding DJ, Nichols RA. Significant genetic correlations among Caucasians at forensic DNA loci. Heredity. 1997;78:583–589. doi: 10.1038/hdy.1997.97. [DOI] [PubMed] [Google Scholar]
Budowle B, Moretti TR. Genotype profiles for six population groups at the 13 CODIS short tandem repeat core loci and other PCR-based loci. Forensic Science Communications 1999. 1999 Available at http://www.fbi.gov/hq/lab/fsc/backissu/july1999/budowle.htm.
Diaconis P, Mosteller F. Methods for studying coincidences. J Amer Statist Assoc. 1989;84:853–861. MR1134485. [Google Scholar]
Eggleston R. Evidence, Proof and Probability. 2. Wiedenfield and Nicholson; London: 1983. [Google Scholar]
Fung WK, Carracedo A, Hu Y-Q. Testing for kinship in a subdivided population. Forensic Science International. 2003;135:105–109. doi: 10.1016/s0379-0738(03)00168-3. [DOI] [PubMed] [Google Scholar]
Galton F. Fingerprints. MacMillan; London: 1892. [Google Scholar]
Kingston CR. Applications of probability theory in criminalistics. J Amer Statist Assoc. 1965;60:70–80. [Google Scholar]
Lenth RV. On identification by probability. J Forensic Science Society. 1986;26:197–213. doi: 10.1016/s0015-7368(86)72477-8. [DOI] [PubMed] [Google Scholar]
Mosteller F. Understanding the birthday problem. The Mathematics Teacher. 1962;55:322–325. [Google Scholar]
National Research Council. The Evaluation of Forensic DNA Evidence. National Academy Press; Washington, DC.: 1996. [Google Scholar]
Thompson WC, Schumann EL. Interpretation of statistical evidence in criminal trials—The prosecutors fallacy and the defense attorneys fallacy. Law and Human Behavior. 1987;11:167–187. [Google Scholar]
Troyer K, Gilroy T, Koeneman B. A nine STR locus match between two apparent unrelated individuals using AmpFlSTR Profiler Plus™ and COfiler™. Proceedings of the Promega 12th International Symposium on Human Identification 2001 [Google Scholar]
Weir BS. Genetic Data Analysis. II. Sinauer; Sunderland, MA: 1996. [Google Scholar]
Weir BS. Matching and partially-matching DNA profiles. J Forensic Sciences. 2004;49:1009–1014. [PubMed] [Google Scholar]
Weir BS, Anderson AD, Hepler AB. Genetic relatedness analysis: Modern data and new challenges. Nature Reviews Genetics. 2006;7:771–780. doi: 10.1038/nrg1960. [DOI] [PubMed] [Google Scholar]

[R1] Anonymous DNA fingerprinting comes of age. Science. 1997;278:1407. [Google Scholar]

[R2] Ayres KL, Overall ADJ. Allowing for within-subpopulation inbreeding in forensic match probabilities. Forensic Science International. 1999;103:207–216. [Google Scholar]

[R3] Balding DJ. When can a DNA profile be regarded as unique? Science and Justice. 1999;39:257–260. doi: 10.1016/S1355-0306(99)72057-5. [DOI] [PubMed] [Google Scholar]

[R4] Balding DJ, Donnelly P. Inference in forensic identification. J Roy Statist Soc Ser A. 1995;158:21–53. [Google Scholar]

[R5] Balding DJ, Nichols RA. Significant genetic correlations among Caucasians at forensic DNA loci. Heredity. 1997;78:583–589. doi: 10.1038/hdy.1997.97. [DOI] [PubMed] [Google Scholar]

[R6] Budowle B, Moretti TR. Genotype profiles for six population groups at the 13 CODIS short tandem repeat core loci and other PCR-based loci. Forensic Science Communications 1999. 1999 Available at http://www.fbi.gov/hq/lab/fsc/backissu/july1999/budowle.htm.

[R7] Diaconis P, Mosteller F. Methods for studying coincidences. J Amer Statist Assoc. 1989;84:853–861. MR1134485. [Google Scholar]

[R8] Eggleston R. Evidence, Proof and Probability. 2. Wiedenfield and Nicholson; London: 1983. [Google Scholar]

[R9] Fung WK, Carracedo A, Hu Y-Q. Testing for kinship in a subdivided population. Forensic Science International. 2003;135:105–109. doi: 10.1016/s0379-0738(03)00168-3. [DOI] [PubMed] [Google Scholar]

[R10] Galton F. Fingerprints. MacMillan; London: 1892. [Google Scholar]

[R11] Kingston CR. Applications of probability theory in criminalistics. J Amer Statist Assoc. 1965;60:70–80. [Google Scholar]

[R12] Lenth RV. On identification by probability. J Forensic Science Society. 1986;26:197–213. doi: 10.1016/s0015-7368(86)72477-8. [DOI] [PubMed] [Google Scholar]

[R13] Mosteller F. Understanding the birthday problem. The Mathematics Teacher. 1962;55:322–325. [Google Scholar]

[R14] National Research Council. The Evaluation of Forensic DNA Evidence. National Academy Press; Washington, DC.: 1996. [Google Scholar]

[R15] Thompson WC, Schumann EL. Interpretation of statistical evidence in criminal trials—The prosecutors fallacy and the defense attorneys fallacy. Law and Human Behavior. 1987;11:167–187. [Google Scholar]

[R16] Troyer K, Gilroy T, Koeneman B. A nine STR locus match between two apparent unrelated individuals using AmpFlSTR Profiler Plus™ and COfiler™. Proceedings of the Promega 12th International Symposium on Human Identification 2001 [Google Scholar]

[R17] Weir BS. Genetic Data Analysis. II. Sinauer; Sunderland, MA: 1996. [Google Scholar]

[R18] Weir BS. Matching and partially-matching DNA profiles. J Forensic Sciences. 2004;49:1009–1014. [PubMed] [Google Scholar]

[R19] Weir BS, Anderson AD, Hepler AB. Genetic relatedness analysis: Modern data and new challenges. Nature Reviews Genetics. 2006;7:771–780. doi: 10.1038/nrg1960. [DOI] [PubMed] [Google Scholar]

PERMALINK

THE RARITY OF DNA PROFILES^¹

Bruce S Weir

Abstract

1. Introduction

2. Forensic issues

3. Statistical issues

4. Genetic issues

Table 1.

Table 2.

Table 4.

Table 3.

5. Effect of relatives

Table 5.

Table 6.

6. Discussion

Acknowledgments

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

THE RARITY OF DNA PROFILES1

Bruce S Weir

Abstract

1. Introduction

2. Forensic issues

3. Statistical issues

4. Genetic issues

Table 1.

Table 2.

Table 4.

Table 3.

5. Effect of relatives

Table 5.

Table 6.

6. Discussion

Acknowledgments

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

THE RARITY OF DNA PROFILES^¹