Abstract
Friction ridge examiners report conclusions to palm impression comparisons similarly to fingerprint impression comparisons, although several key differences exist. These include an extensive search process in palm impressions, differences in minutiae rarity, and orientation challenges that most fingerprint comparisons do not require. Most US laboratories use a three-conclusion scale that includes Identification, Exclusion, and Inconclusive, which have not been calibrated against the actual strength of the evidence in palmprint comparisons. To measure the strength of the evidence of palmprint impressions, the present work constructs likelihood ratios using an ordered probit model based on distributions of examiner responses in an error rate study. Many likelihood ratios calculated are quite modest and the current articulation scales may overestimate the strength of support for same source propositions by up to five orders of magnitude. These likelihood ratios help calibrate the articulation language and may offer an alternative to categorical reporting scales.
1. Introduction
Friction ridge examination, colloquially known as fingerprint comparison, is a discipline that involves the comparison of an impression of friction ridge skin of an unknown origin to a known individual. Historically, friction ridge examiners have reported their findings using a three-conclusion scale: Identification, Exclusion or Inconclusive. In an attempt to calibrate the language against the strength of the evidence, the term representing the most support for the same source proposition has changed in name between identification, individualization and source identification with various institutions using different labels. The Scientific Working Group of Friction Ridge Analysis, Study and Technology (SWGFAST), Academy Standards Board (ASB), and Friction Ridge Subcommittee of the Organization of Scientific Area Committees (OSAC) have all proposed verbal definitions with varying levels of statistical language and decisiveness. Terms like “substantially stronger support”, “extremely strong support”, “sufficient agreement”, and “practical impossibility” have been proposed or used [[1], [2], [3]]. These definitions often contain a mixture of posterior language and strength of evidence reporting. Despite these efforts, none of these terms have been calibrated against the strength of the evidence across the discipline or related to specific samples. To address the issues related to calibration, in the current work we extend a method previously described by Busey & Coon [4] to analyze data from a palmprint error rate study [5] that serves three goals: Calculate likelihood ratios based on examiner responses, calibrate the specific verbal scale used by the examiners, and measure the strength of the evidence of each individual image pair in the study.
1.1. Palm print comparisons
Within the field of friction ridge comparison, most of the research emphasis has been the comparison of the distal joints, otherwise known as fingerprints. While most crime scene impressions are fingerprint impressions, palmar impressions (the region between the wrist and base of each digit) occur in approximately 30 % [6] of all crime scene recordings and included 21 % of all searches of unknown impressions against the FBI's Automated Fingerprint Identification System in 2023 [7].
The anatomy, rarity, and discriminability of regions and minutiae vary along the surface of the friction ridge skin [8], and while research on the error rate of fingerprint comparison has been extensive [[9], [10], [11]], error rates on palm comparisons have been more limited, with only one error rate study on palmar comparison as of 2025 [5]. Examiners perform visual techniques on all areas of friction ridge skin and palmar evidence has been accepted in US courts as early 1918 in State v. Kuhl [12]. However, palm impressions provide a distinct set of challenges. Distal joint impressions when fully recorded are limited in their size and pattern configurations and are approximately 1 square inch. Fully recorded palm impressions are approximately 16 square inches, contain around 800 minutiae, and are divided into three regions containing a variety of pattern configurations within each. Automated comparison systems can take 64 times longer to compare minutiae within palms as compared to fingers [13].
Palm impressions are categorized based upon the region of the palm and include interdigital, hypothenar, and thenar areas. These regions are classified by flexion creases, the directionality of ridge flow, delta shapes, and other anatomical features. Examiners often receive specialized training on palmprints to determine orientation, region of the palm, and handedness when analyzing crime scene impressions. Appropriately categorizing, orienting, and anchoring on features in an unknown palm impression requires a distinct set of skills, and the error rate of this extended search and pre-comparison stage has not been measured in current research.
1.2. Palm error rate study
Because of the distinct skills required by palm print comparisons, Eldridge et al. [5] conducted a black box study in which 210 expert participants were provided with 75 unknown palm impressions and 134 subjects (40 %) completed all trials. Participants were provided with an unknown impression and were requested to document features, determine orientation, and note each trial's value for comparison. If the examiner described the unknown impression as of value for comparison, they were provided with a known palm impression to compare (one hand only). The interface allowed examiners to add or remove features, rotate, and were provided with extensive markup tools to document the comparison process on both the latent and known images provided. Participants were not able to adjust color or contrast of the latent or known impressions.
Examples of the images provided to subjects are displayed in Fig. 1, Fig. 2, Fig. 3. Each example addresses a specific challenge such as variation in spatial relationship, anchor prevalence and contrast. In all figures, the image on the left was labeled as the latent impression. The right image was only provided after the latent was assessed and a subject deemed it of suitable comparison value to continue with the trial. Fig. 1 illustrates variation in contrast, crease prevalence and orientation. Fig. 2 illustrates an image pair where the position of the hand varies between recordings which caused a variation in minutiae spatial relationships. Fig. 3 is a sample where the known impression is not fully recorded and does not contain the anchor (a delta structure) located within the latent impression. Within the Eldridge publication, this sample was marked with a magenta X and red dots which indicate the corresponding creases and features between the images to aid in the comparison process but were not provided during the experiment.
Fig. 1.
This sample (case_0319) is a mated pair. Of the 34 examiners who completed this comparison, 23 said Identification, 8 said Inconclusive and 3 said Exclusion. The likelihood ratio calculated for this sample is 20.2, or twenty times greater support for the same source proposition relative to the different sources proposition. (Figure from original Eldridge study used with permission).
Fig. 2.
This sample (case_0344) is a mated pair. Of the 84 examiners who completed this comparison, 37 said Identification, 22 said Inconclusive and 25 said Exclusion. The likelihood ratio calculated for this sample is 1.42, or slightly greater support for the same source proposition relative to the different sources proposition. (Figure from original Eldridge study used with permission).
Fig. 3.
This sample (case_0224) is a mated pair. Of the 42 examiners who completed this comparison, 7 said Identification, 33 said Inconclusive and 2 said Exclusion. The likelihood ratio calculated for this sample is 1.19, or slightly greater support for the same source proposition relative to the different sources proposition. (Figure from original Eldridge study used with permission).
In the Eldridge study, 75 samples were randomly selected for each examiner, which included 53 mated pairs and 22 nonmated pairs. Nonmated exemplars were chosen from an AFIS search of 25,000 samples, or manually by a research member. 526 image pairs were distributed by categories of difficulty, which were assessed by the principal investigator as what they expected the conclusion to be. Each examiner received approximately 8 no value, 10 easy, 12 medium difficulty, 21 hard or very hard, and 2 inconclusive same source pairs. Across all images and participants, 12,279 suitability decisions were reported, 2406 of which were no value for comparison. Of the 9460 comparison decisions rendered, 1840 of those were determined to be Inconclusive, a rate of 19.45 %. Out of the 2470 decisions rendered on ground truth non mates, there were 10 identifications (0.04 %) and 515 exclusions out of 6683 mated pair samples (7.7 %).
These results contrast with data from fingerprint comparisons, which show many fewer erroneous exclusions and fewer unanimous conclusions than palmprint comparisons. Ulery et al. [10] completed a black box study involving fingerprint samples with 10,052 fingerprint comparisons. This resulted in a combined inconclusive rate of 22.99 %, erroneous identification rate of 0.1 % and erroneous exclusion rate of 7.5 %. The errors were not random, and instead tended to occur on specific image pairs, which suggests that the strength of each comparison will differ. While the combined error rates were similar between finger and palm datasets, the rates of unanimous identifications on mated trials were 10 % for the finger dataset [10] and 25 % on the palm data set [5]. Because 75 % of the mated pairs compared in the palm study were not unanimous, using and applying a system-wide error rate does not provide a strength of support for each comparison in the dataset, which may vary widely. Unanimous agreement on a conclusion is the exception, not the norm in both the fingerprint and palm print datasets.
Erroneous exclusions were similar in the palm set (7.7 % vs 7.5 %), but 36 palmprint samples received a majority of exclusion decisions despite being mated [5]. By contrast, Ulery et al [10] reported only one sample where this was the case. Erroneous exclusions tended to cluster around specific image pairs and specific examiners. 10 % of participants made 31 % or more erroneous exclusions and one participant had an erroneous exclusion rate of 75 % [5]. Thus, palmprints can have high erroneous exclusion rates for some images (perhaps due to some examiners not finding a starting point or mis-orienting the latent) and also can have high unanimous rates for some images (due to the greater skin surface recorded in palm impressions).
Although study-wide error rates provide a rough sense of the accuracy of examiners overall, it is difficult to make inferences about individual items or practitioners based on the study-wide error rate. In the present work we transform the data from the Eldridge et al [5] study into a measure of the strength of evidence of each individual sample using an ordered probit model and use that to compute likelihood ratios for the samples in the study. Likelihood ratios have several advantages over definitive conclusions, as we discuss next.
2. Likelihood ratios
The likelihood ratio provides the relative support for two possible states of the world: The unknown sample originated from a certain individual or item (same source proposition), or the unknown sample originated from a different individual or item (different sources proposition). The probability of the observations is measured relative to these two propositions and the likelihood ratio is the ratio of these two relative probabilities.
Within a Bayesian updating framework, the practitioner reports a likelihood ratio, and factfinders can combine this value with their prior beliefs (prior odds) using a multiplication operation. The beliefs of the factfinder are established and re-weighted with each additional item of information that they observe [14]. Some information is given more consideration and influences the beliefs of each factfinder differently than others. As a factfinder receives information for or against a particular proposition, they can weigh it with the entirety of the information presented to determine guilt or innocence. Importantly, only the factfinders compute posterior odds and make decisions, while examiners report observations through evaluative reporting articulation language [15,16].
To compute a likelihood ratio based on the responses of human examiners in error rate studies, Busey and Coon [4] introduce a method based on an ordered probit model to translate the distribution of conclusions by examiners into a measure of the strength of support for the same source proposition. The Ordered Probit model provides a way to convert categorical conclusions into a numerical value that represents a similarity or comparison score. This comparison score expresses the strength of support for the same source proposition, and, when combined with ground truth, can be used to create likelihood ratios as described below. However, unlike score-based (or rarity-based) likelihood ratios that involve computational approaches to derive similarity between impressions, we rely on the conclusions of a group of examiners to provide the scores. Central to the idea of a likelihood ratio is the concept of calibration, because it provides the relative support for the same- and different-sources propositions as a numerical value, not a verbal label. As a result, ordered probit likelihood ratios allow us to calibrate the articulation language used by a discipline. Next, we describe how the ordered probit model converts a distribution of examiner conclusions into a numerical estimate of the strength of support for the same source proposition.
2.1. Ordered probit model
The central idea behind the ordered probit model is the concept that during a comparison, the examiner gathers visual information and derives a latent value (this term is defined by its unknown state, not to be confused with a crime scene impression). This latent value represents the result of an examiner's evaluation of both images and is converted to a categorical response such as Identification or Inconclusive after the application of a set of decision thresholds. In this approach, we are converting examiner performance into a measure of the relative strength of support for the same- and different-sources proposition. This idea of converting examiner performance to quantitative support measures is also discussed by Warren, Handley, and Sheets [17] using a different modeling approach.
Consider the temperature of water as a metaphor for the latent value from each examiner. Temperature is a continuous measure that we can bin into frozen, liquid, and gas when describing the state of the water. However, within each category the water can take on any temperature within that category's range (e.g. ice can be any temperature, not just 0 °C). The thresholds of 0 °C and 100 °C define the phase state of water such that water colder than 0 °C is ice and above 100 °C is steam. Likewise in friction ridge comparisons, the category of identification represents a range of possible values which indicate the examiner's belief in the support for the same source proposition. The exact value along the latent dimension is not reported and gets lost when a scale with only three categories is applied, much like the phrase “the water is frozen” implies any temperature below 0 °C. The ordered probit model can be used to recover the distribution of possible underlying values if a collection of examiner responses to an image pair is available.
Imagine that examiners could report the measurements of their internal thermometers (the latent value) as a single number that describes the strength of their belief in the same source proposition. This might relate to the amount of perceived detail in agreement, the number of minutiae within tolerance, the perceived rarity of features, shape and contours of ridges or scars, or any number of sources of information that contributed to a conclusion. This value would vary from examiner to examiner since they vary in experience, may rely on different characteristics, and judge the rarity of these characteristics differently. We expect variability between examiners and propose the resulting internal “temperature” values will tend to be arrange themselves in a normally distributed way as the decisions rendered by the examiners are the conclusion to a complex process. The central limit theorem states that when measurements are the result of many individual steps or stages of evaluation, the final values tend to be normally distributed. This is often true of human decision making in general [18]. In the case of latent print examiners, the mental processes examiners use to weigh information and reach conclusions is muti-faceted and involves evaluation of pattern type, minutiae discriminability and rarity, tolerance for visual differences and noise, and acceptance of risk for consequences of errors. Each of these factors combine to produce the final value, thus justifying a Gaussian distribution for the internal values.
Because examiners in the black box studies referenced here [5,10] did not report their conclusions using a latent value, the ordered probit model assumes that there are two system-wide fixed thresholds which examiners use to define the exclusion, inconclusive, and identification categories. Many variables affect whether examiners behave the same way given the same stimulus, whether this is a product of their tolerance for risk, visual system or even personality [19,20]. The data within this study does not provide sufficient context to determine variables that may have impacted individual threshold behavior. If an examiner made an identification response, the latent value must have exceeded the larger of the two thresholds. If the examiner made an exclusion response, the latent value must have been less than both thresholds. An inconclusive decision reflects a value in between the thresholds. The ordered probit model enables us to work backwards to generate the internal value distribution that must have existed to produce the observed distribution of three-conclusion responses.
To set the scale of the latent axis, the conclusion thresholds in the data are anchored on the latent axis at 1.5 (Exclusion/Inconclusive) and 4.5 (Inconclusive/Identification). These threshold values are arbitrary and others could be used without loss of generality, much like Fahrenheit and Celsius have different numeric values for water's freezing and boiling points. All examiner responses for each comparison are summarized using a normal distribution with a mean that represents the typical support for the same source proposition, and a standard deviation that represents the consistency among examiners on this image pair. Selection of distribution type and values assigned to the priors is discussed in the supplemental information. The normal distribution has a very long history in science, dating from Fechner [21] and is widely used in signal detection theory (see review by Wixted [18]). Sometimes this distribution is altered by adding skew or other higher moments, and while these may provide some improvements to the fits, unfortunately the extra parameters required by these more complex distributions exceeds the degrees of freedom offered by a 3-choice conclusion scale in the present dataset.
In the ordered probit model, the latent axis is partitioned by the two thresholds, and the area under the normal distribution in different regions corresponds to the predicted proportion of examiners who reach each conclusion on a given comparison. Different locations of the normal distribution along the latent axis produce different predictions for the distribution of the number of examiners who reach each conclusion, as illustrated by Fig. 4. Parameter estimation recovers the most credible values of mean and standard deviation for each comparison that best predicts the empirical proportion of examiners who reach each conclusion. This provides individual estimates of strength of support for each comparison in the study.
Fig. 4.
Left Panels: Fit of the ordered probit model to response distributions from two hypothetical comparisons. The two thresholds (Exclusion/Inconclusive and Inconclusive/Identification) are shown as vertical bars that separate the latent axis into the Exclusion, Inconclusive, and Identification regions. The area under the normal distributions is the predicted proportion of examiners who reach each conclusion as illustrated by each colored region. Right Panels: Black dots are the actual proportion of examiners who reached each conclusion, while the height of the bars represents the predicted proportion of each conclusion. The ordered probit model adjusts the mean and standard deviation of the normal distribution to make the predicted proportions correspond to the actual proportions as closely as possible.
To illustrate the relation between a distribution of responses and the summary of that distribution by the ordered probit model, we have created a tool also referenced in Busey and Coon [4] for the reader to manipulate variables and view the effects of these changes:
https://iupbsapps.shinyapps.io/OrderedProbitDemoTraditional/
2.2. Fit to the Eldridge et al. [5] data
We applied the ordered probit model from Busey and Coon [4] to the data from Eldridge et al. [5] to compute likelihood ratios for each item in the study. Participants were provided more mated than nonmated pairs, and nonmated candidates were chosen from AFIS searches, however they were not informed of this during the study. The data was restricted to samples with at least 16 subjects providing conclusions on an image pair for it to be included in this model, because fewer than this will produce ordered probit model parameter estimates with increased uncertainty. With these limitations, 267 of the 526 total image pairs were included in this analysis. A Markov-Chain Monte Carlo parameter fitting approach is used to determine the most credible values of the mean and standard deviation of each normal distribution [22].
Bayesian parameter estimation requires that any parameter that is estimated must have a prior distribution. The prior distribution is assumed to be normal with a mean and standard deviation each assigned a value of 3 (labeled μ and σ in our model). This prior distribution is centered evenly on the middle of the thresholds (in the inconclusive range) and the typical variation of latent values would be twice the distance between the thresholds. The priors used in this analysis provide the highest likelihood ratio across all model variants tested. In the supplementary information we explore different values with wider spread and discuss the consequences and effects on the resulting likelihood ratio values: A t-distribution, wider mu priors, removing shrinkage, selectively removing poorly performing subjects, and including additional low-count samples.
In the following sections we discuss how we construct likelihood ratios from the latent values estimated using the ordered probit model. Fig. 5 illustrates the individual normal distributions from all the analyzed image pairs as determined by the ordered probit model. Thin curves come from individual pairs, while the thick curves are discussed below. Distributions that are narrower and taller represent those which have more consistency between examiners, while wider ones reflect more variation between examiners (e.g. a mixture of responses were recorded at varying amounts of the three categories).
Fig. 5.
Output of the ordered probit model. Each faint red curve represents a distribution of decisions within a single nonmated pair and each faint blue curve represents a distribution of decisions within a single mated pair. Bold curves represent a summed and renormalized distribution of all faint lines within that color. See text for details.
Samples that have curves further to the right contained more identification decisions and provide more support for the same source proposition. These samples likely contained more visual information consistent with the same source proposition, and was agreed upon as an identification by more participants. A simple analogy is: If a sample contained twenty corresponding minutiae this would likely result in more identification decisions and its distribution would appear further to the right on the latent axis than a sample containing four corresponding minutiae, because the second comparison will likely receive fewer identification responses than the first comparison.
We assume that samples are mutually exclusive and independent from the other image pairs in the Eldridge study [5]. Using the ‘or’ rule in probability, this allows us to sum all the mated distributions in Fig. 5 and renormalize so that the area under the summed curve is 1.0. In the supplementary information section we discuss the consequences of possible violations of independence. The thick blue curve represents the relative probability of obtaining a particular latent value for any mated pair (the same source proposition). We repeat this for nonmated pairs, and the thick red curve represents the relative probability of obtaining a particular latent value for any nonmated pair (the different sources proposition). The faint red and blue peaks occurring around −2.5 (red curves) and 8 (blue curves) on the latent dimension are those samples where unanimous agreement was reached amongst the participants. Curves centered around 3 indicate most participants were inconclusive, where there is similar support for the same source and different source propositions. To compute likelihood ratios, we need the support for the same- and different-sources propositions at each value along the latent axis. A sample along the full range of likelihood ratios observed in the study is demonstrated in Table 1.
Table 1.
This table illustrates a subset of comparisons from the Eldridge, De Donno, and Champod (2021)study, and the full table is found in the Supplementary Information. The rows are sorted by the μ values, with smaller μ values at the top, and thus the items at the top have the least support for the same source proposition. PairID indicates the sample's label, “Mated” is whether the impressions came from the same or different sources, μ is the normal distribution's mean for that sample, sigma is the normal distribution's standard deviation. The LR is the calculated likelihood ratio based upon the μ location compared to the normalized curves. LRLowHDI represents the lower bound of the 95 % HDI around the likehood ratio, and LRHighHDI represents the upper bound. LogLRRange is log10(LRHighHDI) – log10(LRLowHDI), and represents the magnitude of uncertainty around the likelihood ratio. The columns Exclusion, Inconclusive, Identification and NoValue represent the number of examiners who reach each conclusion. MajorityID is a flag describing when at least half of examiners reach an Identification decision, and 2/3MajorityID is the same except 2/3rds of the examiners reached an Identification decision.
| pairID | Mated | mu | sigma | LR | LRLowHDI | LRHighHDI | LogLRRange | Exclusion | Inconclusive | Identification | No Value | MajorityID | 2/3MajorityID |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 384 | Different sources | −2.3 | 1.1 | 0.1 | 0.2 | 0.1 | 0.1 | 32 | 0 | 0 | 0 | FALSE | FALSE |
| 523 | Different sources | −2.2 | 1.2 | 0.1 | 0.2 | 0.1 | 0.1 | 24 | 0 | 0 | 0 | FALSE | FALSE |
| 177 | Different sources | −1.5 | 3.4 | 0.1 | 0.1 | 0.1 | 0.2 | 15 | 1 | 1 | 10 | FALSE | FALSE |
| 475 | Different sources | −1.4 | 1.8 | 0.1 | 0.1 | 0.2 | 0.2 | 24 | 1 | 0 | 0 | FALSE | FALSE |
| 430 | Different sources | −1.1 | 1.9 | 0.1 | 0.1 | 0.2 | 0.3 | 26 | 2 | 0 | 0 | FALSE | FALSE |
| 74 | Different sources | −0.7 | 1.9 | 0.1 | 0.1 | 0.2 | 0.5 | 23 | 3 | 0 | 0 | FALSE | FALSE |
| 163 | Different sources | −0.5 | 2.0 | 0.1 | 0.1 | 0.2 | 0.5 | 19 | 3 | 0 | 1 | FALSE | FALSE |
| 464 | Different sources | −0.2 | 1.9 | 0.1 | 0.1 | 0.2 | 0.5 | 20 | 4 | 0 | 1 | FALSE | FALSE |
| 383 | Different sources | 0.0 | 1.9 | 0.1 | 0.1 | 0.2 | 0.5 | 21 | 5 | 0 | 2 | FALSE | FALSE |
| 168 | Different sources | 0.3 | 1.8 | 0.1 | 0.1 | 0.2 | 0.5 | 15 | 5 | 0 | 0 | FALSE | FALSE |
| 71 | Different sources | 0.6 | 1.8 | 0.1 | 0.1 | 0.2 | 0.5 | 14 | 6 | 0 | 5 | FALSE | FALSE |
| 445 | Different sources | 0.7 | 1.7 | 0.1 | 0.1 | 0.2 | 0.4 | 15 | 7 | 0 | 2 | FALSE | FALSE |
| 145 | Different sources | 0.8 | 1.6 | 0.1 | 0.1 | 0.2 | 0.4 | 18 | 9 | 0 | 2 | FALSE | FALSE |
| 43 | Different sources | 1.1 | 1.5 | 0.2 | 0.1 | 0.2 | 0.3 | 16 | 10 | 0 | 2 | FALSE | FALSE |
| 405 | Different sources | 1.2 | 1.4 | 0.2 | 0.1 | 0.3 | 0.4 | 14 | 10 | 0 | 4 | FALSE | FALSE |
| 297 | Different sources | 1.3 | 1.4 | 0.2 | 0.1 | 0.3 | 0.3 | 15 | 12 | 0 | 1 | FALSE | FALSE |
| 261 | Same source | 1.4 | 2.3 | 0.2 | 0.1 | 0.5 | 0.7 | 8 | 7 | 1 | 0 | FALSE | FALSE |
| 315 | Different sources | 1.6 | 1.2 | 0.2 | 0.2 | 0.3 | 0.3 | 12 | 14 | 0 | 3 | FALSE | FALSE |
| 109 | Different sources | 1.8 | 1.1 | 0.2 | 0.2 | 0.4 | 0.3 | 10 | 17 | 0 | 4 | FALSE | FALSE |
| 56 | Different sources | 2.1 | 1.0 | 0.3 | 0.2 | 0.5 | 0.4 | 5 | 16 | 0 | 4 | FALSE | FALSE |
| 426 | Same source | 2.9 | 1.9 | 0.6 | 0.3 | 1.5 | 0.6 | 9 | 24 | 8 | 14 | FALSE | FALSE |
| 258 | Same source | 3.2 | 2.5 | 0.8 | 0.2 | 5.4 | 1.3 | 4 | 9 | 5 | 5 | FALSE | FALSE |
| 3 | Same source | 3.6 | 1.8 | 1.4 | 0.5 | 5.3 | 1.0 | 2 | 13 | 6 | 36 | FALSE | FALSE |
| 280 | Same source | 3.9 | 3.0 | 2.2 | 0.3 | 38.9 | 2.1 | 3 | 6 | 7 | 0 | FALSE | FALSE |
| 99 | Same source | 4.6 | 2.8 | 5.7 | 1.1 | 45.4 | 1.6 | 4 | 12 | 17 | 11 | TRUE | FALSE |
| 51 | Same source | 4.7 | 2.0 | 7.2 | 1.8 | 44.1 | 1.4 | 1 | 11 | 14 | 11 | TRUE | FALSE |
| 81 | Same source | 5.0 | 1.4 | 11.0 | 4.4 | 49.1 | 1.1 | 0 | 13 | 23 | 0 | TRUE | TRUE |
| 307 | Same source | 5.4 | 4.2 | 22.9 | 1.1 | 796.7 | 2.8 | 4 | 4 | 15 | 17 | TRUE | TRUE |
| 63 | Same source | 5.6 | 7.0 | 29.4 | 1.7 | 631.0 | 2.6 | 24 | 9 | 50 | 0 | TRUE | TRUE |
| 306 | Same source | 5.7 | 1.6 | 38.4 | 6.7 | 449.5 | 1.8 | 0 | 7 | 26 | 1 | TRUE | TRUE |
| 148 | Same source | 6.0 | 3.0 | 56.5 | 2.8 | 1245.4 | 2.6 | 1 | 3 | 12 | 0 | TRUE | TRUE |
| 85 | Same source | 6.2 | 4.3 | 81.1 | 2.1 | 1733.7 | 2.9 | 3 | 2 | 15 | 0 | TRUE | TRUE |
| 293 | Same source | 6.3 | 4.6 | 102.0 | 1.9 | 2056.0 | 3.0 | 3 | 1 | 14 | 1 | TRUE | TRUE |
| 457 | Same source | 6.4 | 1.8 | 121.5 | 7.7 | 1710.5 | 2.3 | 0 | 2 | 16 | 1 | TRUE | TRUE |
| 525 | Same source | 6.5 | 1.8 | 140.4 | 8.5 | 1789.1 | 2.3 | 0 | 2 | 18 | 1 | TRUE | TRUE |
| 469 | Same source | 6.7 | 3.1 | 191.9 | 7.4 | 2148.3 | 2.5 | 1 | 2 | 15 | 1 | TRUE | TRUE |
| 76 | Same source | 6.8 | 3.1 | 227.0 | 8.5 | 2174.5 | 2.4 | 1 | 2 | 16 | 7 | TRUE | TRUE |
| 390 | Same source | 6.9 | 1.6 | 244.7 | 9.9 | 2430.9 | 2.4 | 0 | 1 | 18 | 0 | TRUE | TRUE |
| 334 | Same source | 7.1 | 3.2 | 362.0 | 9.8 | 2917.7 | 2.5 | 1 | 1 | 15 | 0 | TRUE | TRUE |
| 466 | Same source | 7.3 | 4.2 | 435.3 | 8.1 | 4214.0 | 2.7 | 2 | 0 | 15 | 0 | TRUE | TRUE |
| 30 | Same source | 7.5 | 4.4 | 554.7 | 14.8 | 3741.4 | 2.4 | 3 | 1 | 22 | 0 | TRUE | TRUE |
| 13 | Same source | 7.5 | 1.1 | 575.3 | 13.3 | 5345.3 | 2.6 | 0 | 0 | 17 | 0 | TRUE | TRUE |
| 396 | Same source | 7.5 | 1.1 | 592.1 | 14.2 | 5732.6 | 2.6 | 0 | 0 | 18 | 0 | TRUE | TRUE |
| 89 | Same source | 7.6 | 1.0 | 642.5 | 14.7 | 6059.5 | 2.6 | 0 | 0 | 24 | 1 | TRUE | TRUE |
| 275 | Same source | 7.6 | 1.0 | 674.6 | 13.8 | 6191.9 | 2.7 | 0 | 0 | 31 | 0 | TRUE | TRUE |
| 286 | Same source | 7.7 | 1.0 | 690.8 | 14.4 | 6466.0 | 2.7 | 0 | 0 | 31 | 0 | TRUE | TRUE |
| 359 | Same source | 7.7 | 1.0 | 703.4 | 17.4 | 7248.3 | 2.6 | 0 | 0 | 37 | 0 | TRUE | TRUE |
| 391 | Same source | 7.7 | 3.3 | 712.9 | 19.4 | 5591.3 | 2.5 | 1 | 0 | 15 | 0 | TRUE | TRUE |
| 330 | Same source | 7.8 | 4.8 | 768.8 | 24.2 | 4797.3 | 2.3 | 4 | 1 | 27 | 1 | TRUE | TRUE |
| 11 | Same source | 7.8 | 4.8 | 845.2 | 26.9 | 5032.1 | 2.3 | 4 | 1 | 28 | 0 | TRUE | TRUE |
| 516 | Same source | 8.1 | 3.0 | 1046.1 | 57.9 | 6094.3 | 2.0 | 1 | 1 | 28 | 2 | TRUE | TRUE |
| 47 | Same source | 8.4 | 3.0 | 1293.5 | 62.9 | 8985.4 | 2.2 | 1 | 0 | 24 | 1 | TRUE | TRUE |
| 72 | Same source | 8.7 | 2.9 | 1565.4 | 111.2 | 11916.1 | 2.0 | 1 | 0 | 31 | 1 | TRUE | TRUE |
| 524 | Same source | 9.1 | 3.9 | 1923.9 | 234.5 | 13966.7 | 1.8 | 3 | 2 | 56 | 0 | TRUE | TRUE |
2.3. Computing likelihood ratios
The thick blue and red curves in Fig. 5 represent the relative probability of a particular latent value (the observation, which is a location along the x axis in Fig. 5) given a mated or nonmated pair was presented. The likelihood ratio is the ratio of these two relative probability values computed at each location on the latent axis (blue line divided by red line). As an example, consider a latent value of 7.5 that is obtained based upon an examiner's observations of information within the image pairs and ultimately by the distribution of conclusions across all examiners who completed that comparison. The likelihood ratio computed at this location comes from the height of the bold blue curve at the value at 7.5 divided by the height of the bold red curve at 7.5. This ratio of relative probabilities (likelihoods) is the likelihood ratio at that location. Fig. 5 computes the ratio of the likelihoods (using the bold curves) at each location along the latent axis.
The final step in the ordered probit model approach is to compute individual likelihood ratios for each image pair and its relationship to the overall data. We assume that the mean value of the normal distribution associated with each image pair represents the typical strength of support offered by that image pair. We can use this value in conjunction with Fig. 6 to determine the likelihood ratio associated with that image pair. Table 1 presents a subset of the likelihood ratios from the Eldridge study, and the full table is in the supplementary information. The likelihood ratios range from less than 10 to the thousands, and in the Discussion we describe some implications of these values.
Fig. 6.
A graphic representation of the continuous measure represented by the ratio of the bold blue and red curves in Fig. 5. The Latent Dimension is a continuous value representing possible values of an image pair. The red dots indicate all mated pairs and their location along the latent dimension. Larger dots are those which contained a majority of identification decisions.
In addition to the likelihood ratio column in Table 1, we have included the 95 % highest density interval (HDI) values around the likelihood ratios that result from the MCMC process, along with the log of the range of HDI values. The 95 % HDI limits represent the range of credible values for the likelihood ratio, as derived from the 95 % HDI on the mu values. Note that samples with greater variation among examiners (which will have larger sigma values) also tend to have wider HDI ranges that imply greater uncertainty. In fact, the correlation between the sigma values and the log HDI interval is 0.42, which supports the observation that greater uncertainty among examiners is associated with greater uncertainty for the likelihood ratio.
While the mated or nonmated state of an image pair in the study was known, hence the reason we can color-code the curves, casework has no ground truth. Within the friction ridge discipline, conclusions of identification are typically only reported to the factfinder when at least two examiners agree, otherwise known as verification [23]. We consider a sample to have met a reasonable likelihood of being reported to a factfinder as an identification when at least 50 % of the participants reached a conclusion of identification, and we call this Majority ID. In the Eldridge et al. [5] database, Majority ID samples had likelihood ratios as low as 5.7 and continue to 1982. These values are bolded in Table 1.
The lowest likelihood ratio value increases to 10.2 if a 66 % super majority is required. Refer to the Supplemental Information for a discussion and table for this comparison. Data, analysis code, and full graphs are available at:
https://osf.io/at3hw/?view_only=3b711922cfc74ebaa4951248fa838259.
For a normal distribution to account for erroneous exclusions, the standard deviation is higher on impressions with near but not complete unanimity to account for the variation. Because of these increased standard deviation values, the μ (and therefore the likelihood ratio) values are slightly higher than those with complete unanimity. This happens because the standard deviation must increase to account for the erroneous exclusion (the far left side of the distribution) which allows the μ value to shift higher than the completely unanimous comparisons. This is purely a statistical artifact that results from the choice of a symmetric distribution. For example, case_524 (likelihood ratio 1892.8) experiences this high standard deviation (3.9) due to the three erroneous exclusions and two inconclusive decisions as seen in Table 1. Case_524 has a higher μ value than Case_359 (likelihood ratio 706.7), which was unanimously identified (standard deviation 0.96). The 26 samples with the highest likelihood ratio values appear to be subject to this style of inflation due to their standard deviation (a selection of these appear in Table 1).
While this aspect of the model is a clear limitation, it only affects comparisons that are nearly unanimous and are less dependent upon statistical support given the clear image quality and correspondence to even a layperson. Samples that are barely majority ID are unaffected by this artifact.
Previously we noted that unanimity was the exception, rather than the norm in the palmprint data, and the ordered probit model explicitly models this variability across examiners to produce a summary of the responses on the latent axis in the form of a normal distribution. Rather than requiring complete reproducibility across examiners, the ordered probit model embraces diversity across examiners to quantify the observation that image pairs that have more disagreement across examiners should have lower evidentiary strength. While the justice system may have difficulty with experts who disagree, a scientific view will expect variation among examiners, especially with more difficult image pairs. An advantage of the ordered probit model is that it distinguishes between the relative amount of support for the same- and different-sources propositions for these non-unanimous samples. As more examiners chose responses that are not Identification, the likelihood ratio will drop.
3. Discussion
3.1. Source data study designs: palms vs fingers
The likelihood ratios we report in this work are conditioned on the information and structure of the black box study (as are the error rates that are reported from these studies). Black box studies require care to create [24], and the data they provide is a function of the images used, the subjects who participated, the method subjects used, and the testing environment. For example, if a researcher chose to include a large proportion of trivial exclusions, the erroneous identification rate would be greatly reduced. The Ulery et al. [10] and Eldridge et al. [5] studies involve the comparison of friction ridge skin with a similar pool of possible subjects. Study design has several impacts on the overall mated and nonmated distributions and the resulting likelihood ratios. Every dataset and participant pool can generate varying results, and we have identified four possible factors that might influence these results, as discussed below.
First, Ulery et al. [10] chose non-mated exemplar images from AFIS database searching (58 million individuals). The Eldridge et al. [5] study searched 25,000 palm records and the principal investigator chose between the AFIS results or a manually selected non-mate depending on the trial. While the procedure used in selection of the nonmates within Ulery et al. [10] may more represent casework for participants who utilize large databases, the procedure in Eldridge et al. [5] may more closely reflect small database searches or manual comparison requests.
Second, active training and research efforts to reduce erroneous fingerprint exclusions has been a priority. In the years following the original fingerprint black box data [10] the field has experienced changes in policies, and increased training and awareness of erroneous exclusions [25]. Based on conversations with practitioners, research efforts have not been mirrored with regards to erroneous exclusions in palm impressions.
Third, documentation of features to be utilized in later phases was available in both studies, but additional tools were encouraged in the Eldridge et al. [5] study and was more limited in the interface of the Ulery et al. [10] study. Participants anchoring themselves visually with markers may have impacted their ability to perform the comparisons.
Finally, the likelihood ratios are sensitive to an examiner's sufficiency for comparison decision, as ‘no value’ decisions were not compared in the Eldridge dataset and did not contribute to this analysis. However, some comparisons received some ‘of value’ determinations which were followed by a large number of inconclusive decisions. For example, in Fig. 5, the large humps around 3 for both mated and non-mated curves results from trials in which examiners made predominately inconclusive decisions. Because both thick curves are normalized, these comparisons effectively lower the likelihood ratios for other comparisons by pulling the thick blue and red curves closer together and therefore reducing the likelihood ratios.
Impressions that are challenging contribute to higher rates of inconclusive decisions, but black box studies with only extremely easy comparisons are unlikely to be reflective of casework. The proportion of mated to nonmated pairs each subject received (53 mated to 22 nonmated) can also have an influence on the overall modeling and likelihood ratio. If examiners adopted a higher threshold for sufficiency for comparison, this would improve likelihood ratios overall, much like a student who gets to take a test where they are not penalized for not answering hard questions and thus only answer the easy questions. However, because more conservative examiners would contribute less information to the justice system, we are not advocating for such a shift.
Our likelihood ratios reflect the context of a given study design, and the likelihoods could vary from what one might anticipate in part because of these study designs. Thus, the context of the black box study should be considered as conditioning information when discussing these likelihood ratios. The likelihood ratios represent the strength of the evidence for those aspects of casework that are similar to the construction of the Eldridge et al. [5] study. Our likelihood ratios reflect that aspect of casework where the examiner is presented with an image pair and begins a deliberative comparison. Aspects of casework that are not captured by this process are also not part of the likelihood ratio calculation, which could include situations such as easy exclusions based on pattern type.
3.2. Calibration of articulation scales
Based upon this model, a forensic report containing a palm print identification conclusion could indicate that the strength of support could be as low as 5.7 times more support for the observations given the individual named than any other individual. It is our view that terms such as “substantially stronger support”, “extremely strong support”, “sufficient agreement”, and “practical impossibility” [[1], [2], [3]] are misleadingly different from the likelihood ratios calculated near the threshold for majority identification. However, the likelihood ratios in the present work are a function of the images used in the study (the conditioning information), and this may have affected the likelihood ratios. Most of the samples provided to practitioners in the Eldridge study were difficult, because only 41 % of the images provided to participants were rated as very easy, easy or medium difficulty by the researchers [5]. The likelihood ratio model provided here may undervalue the strength of the typical casework item of evidence. Casework decisions may be easier and more unanimous, which may produce likelihood ratios that more closely align with current definitions of the identification conclusion. We caution against using the highest likelihood ratio values as an endpoint for the highest discipline-wide strength of support. The samples, subjects, categorical labels and how they are defined all impact the values as is discussed in the supplemental information.
The miscalibration described above is not specific to the U.S. reporting scale. The European Network of Forensic Science Institutes provides verbal equivalency scales in range of likelihood ratios calculated here, including Moderate Support (10–100), Moderately Strong Support (100–1000) and Strong Support (1000–10,000) [26]. When informed priors are used (i.e. a narrower prior on μ to constrain unanimous image pairs), no sample in this data set supports conclusions such as very strong support (10,000–1,000,000) or extremely strong support (1,000,000 and above) for common sources. Our low end of likelihood ratios for majority-ID image pairs is 6.9, which is a 5 order of magnitude difference from 1,000,000.
The published verbal equivalency guidelines may be overinflated due to the strength of DNA analysis and not indicative of strength of evidence in other disciplines [27,28]. The large likelihood ratios seen in DNA are likely to overwhelm most prior odds [29], but this may not be true in friction ridge comparisons.
Unanimous samples may require a different method of calculation to attain accurate likelihood ratio values. Score-based approaches such as FRStat [30] and Xena [31] may be more appropriate. These statistical models have quantified the value of latent print comparisons relying on physical features such as minutiae count and orientation [[32], [33], [34]]. These quantitative approaches are based on minutiae from fingers, are compared against a database, rely on minutiae quantity, placement and relationships chosen by the examiner, and were not designed to describe the strength of different sources decisions. Some authors have argued that these models require additional calibration [35]. Quantitative models that are based on rarity data provide a contrast to the proposed model here which is based on human performance data. Our model characterizes the strength of human-based conclusions including the examiner's interpretations of ridge flow, ridge edges and shapes. By harnessing the examiner's visual system the model can be applied to the comparison of any two items including other forensic disciplines such as firearms [36]. If the examiner uses a rarity-based model (e.g. FRStat [34]) to inform their conclusion, our approach will build in the combined output of both the statistical tool and the examiner into one fused measure of the strength of support, albiet currently just in black box testing settings.
We caution against the assumption that all areas of skin would have the same range of likelihood ratios, as the results here demonstrate a variation in examiner's performance on palms compared to fingers. Majority Identification decisions ranging from 6.9 to 110,000 and 23–22,000 for fingerprints [4], and palm samples ranging of 5.7 to 1982 highlight the need for additional black box studies and research to determine likelihood ratio rates on joints, or plantar (foot) regions.
3.3. Likelihood ratios in casework
While the present approach does not create likelihood ratios for casework comparisons, they do help calibrate the conclusion scales that might be used by latent print examiners. Current research in our laboratory is exploring the possibility of extending likelihood ratios to operational casework. This approach uses benchmark comparisons with known likelihood ratios against which casework comparisons are judged. If successful, this approach could help ground examiner observations in a numerical strength-of-support scale in the form of a likelihood ratio.
There is also concern among examiners about statistical models in general, and especially that low values may be overinterpreted [34,37]. Even so, low likelihood ratios can still contribute meaningfully to a case [28,38]. Additional research is needed on interpretations of likelihood ratio values in relationship to criminal case outcomes and strategies to communicate clearly the strength of findings. To engage with policy makers, forensic practitioners and laboratory managers should understand the priors and implications of their analyses before using any statistical reporting scheme.
Likelihood ratio calculations are intended to be combined with other variables from a particular case, such as additional forensic evidence, testimony of lay witnesses, motive, means and opportunity. Studies on jury interpretations are mixed. Some studies indicate juries may underweight any testimony involving statistical evidence [39,40], and another study indicates a similar interpretation of evidence across reporting styles when jurors are given additional context [41]. Efforts should be made to communicate when samples are highly dependent upon certain examiners (by reporting specific and discipline-wide examiner performance), when a sample is low quality or likely to cause disagreement between practitioners, or when a procedure deviated from normal laboratory policies or discipline standards. Debate comes from multiple perspectives on the use of specific styles of reporting [39], and ultimately when the dust settles, we will support the style of communication which most accurately translates the mindset of the practitioner to that of the jury. Of course, the practitioner could be incorrect or use an inappropriately high or low likelihood ratio. However, our goal is to improve the communication between the partitioner and the consumer, and likelihood ratios have the advantage of referencing both propositions and provide more information about the observed strength of support for each proposition. We believe that applications of tools like the ordered probit model to casework provides the most transparent and informative scientific analysis to the judicial system.
CRediT authorship contribution statement
Meredith Coon: Writing – review & editing, Writing – original draft, Formal analysis, Data curation, Conceptualization. Thomas Busey: Writing – review & editing, Formal analysis, Data curation, Conceptualization.
Declaration of competing interest
The authors have no competing interests to declare.
Footnotes
Supplementary data to this article can be found online at https://doi.org/10.1016/j.fsisyn.2025.100628.
Contributor Information
Meredith Coon, Email: MeredithACoon@gmail.com.
Thomas Busey, Email: busey@iu.edu.
Appendix A. Supplementary data
The following is the Supplementary data to this article.
References
- 1.AAFS A.S.B. ASB standard 013: standard for friction ridge examination conclusions. 2022. www.aafs.org/academy-standards-board
- 2.OSAC F.R.S. Standard for friction ridge examination conclusions [DRAFT DOCUMENT] 2018. https://www.nist.gov/system/files/documents/2018/07/17/standard_for_friction_ridge_examination_conclusions.pdf
- 3.SWGFAST . Scientific Working Group on Friction Ridge Analysis, Study and Technology (SWGFAST) 2013. Guideline for the articulation of the decision-making process for the individualization in friction ridge examination.www.swgfast.org [Google Scholar]
- 4.Busey T., Coon M. Not all identification conclusions are equal: quantifying the strength of fingerprint decisions. Forensic Sci. Int. 2023;343 doi: 10.1016/j.forsciint.2022.111543. 111543-111543. [DOI] [PubMed] [Google Scholar]
- 5.Eldridge, De Donno, Champod Testing the accuracy and reliability of palmar friction ridge Comparisons–A Black box study. Forensic Sci. Int. 2021;318 doi: 10.1016/j.forsciint.2020.110457. [DOI] [PubMed] [Google Scholar]
- 6.Dewan S. Elementary, Watson: Scan a Palm, Find a Clue. The New York Times; 2003. [Google Scholar]
- 7.Schooley S. 2023. Personal Communication to M. Coon. 11-16-2023. [Google Scholar]
- 8.Gutierrez E., Galera V., Martinez J.M., Alonso C. Biological variability of the minutiae in the fingerprints of a sample of the Spanish population. Forensic Sci. Int. 2007;172(2–3):98–105. doi: 10.1016/j.forsciint.2006.12.013. [DOI] [PubMed] [Google Scholar]
- 9.Pacheco I., Brian Cerchiai B., Stoiloff S. 2014. Miami-Dade Research Study for the Reliability of the ACE-V Process: Accuracy & Precision in Latent Fingerprint Examinations. Retrieved from. [Google Scholar]
- 10.Ulery B.T., Hicklin R.A., Buscaglia J., Roberts M.A. Accuracy and reliability of forensic latent fingerprint decisions. Proc. Natl. Acad. Sci. U. S. A. 2011;108(19):7733–7738. doi: 10.1073/Pnas.1018707108. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Ulery B.T., Hicklin R.A., Buscaglia J., Roberts M.A. Repeatability and reproducibility of decisions by latent fingerprint examiners. PLoS One. 2012;7(3):1–12. doi: 10.1371/journal.pone.0032800. doi:ARTN e32800. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Taber E.J.L., J. 1918. State v. Kuhl et al. S. C. o. Nevada. [Google Scholar]
- 13.Jain A.K. Latent palmprint matching. IEEE Trans. Pattern Anal. Mach. Intell. 2009;31:1032–1047. doi: 10.1109/TPAMI.2008.242. [DOI] [PubMed] [Google Scholar]
- 14.Evett I.W. Towards a uniform framework for reporting opinions in forensic science casework. Sci. Justice. 1998;38(3):198–202. doi: 10.1016/S1355-0306(98)72105-7. [DOI] [Google Scholar]
- 15.Biedermann A. The strange persistence of (source) “identification” claims in forensic literature through descriptivism, diagnosticism and machinism. Forensic Sci. Int.: Synergy. 2022;4 doi: 10.1016/j.fsisyn.2022.100222. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Evett I.W. Avoiding the transposed conditional. Sci. Justice. 1995;35(2):127–131. [Google Scholar]
- 17.Warren E.M., Handley J.C., Sheets H.D. Cross entropy and log likelihood ratio cost as performance measures for multi‐conclusion categorical outcomes scales. J. Forensic Sci. 2025;70(2):589–606. doi: 10.1111/1556-4029.15686. [DOI] [PubMed] [Google Scholar]
- 18.Wixted J.T. The forgotten history of signal detection theory. J. Exp. Psychol. Learn. Mem. Cognit. 2020;46(2):201. doi: 10.1037/xlm0000732. [DOI] [PubMed] [Google Scholar]
- 19.Aggadi N., Coon M., Busey T. Measuring factors associated with identification thresholds in fingerprint analysts. J. Forensic Sci. 2025:1–13. doi: 10.1111/1556-4029.70085. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Ulery B.T., Hicklin R.A., Roberts M.A., Buscaglia J. Measuring what latent fingerprint examiners consider sufficient information for individualization determinations (vol 9, e110179, 2014) PLoS One. 2015;10(2) doi: 10.1371/journal.pone.0118172. doi:ARTN e0118172. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Fechner G.T. Elemente der psychophysik. 1860;2 [Google Scholar]
- 22.Kruschke J. Doing Bayesian data analysis: a tutorial with R. JAGS, and Stan. 2014 [Google Scholar]
- 23.Vanderkolk J. US Dept. of Justice, Office of Justice Programs, National Institute of Justice; 2010. Fingerprint sourcebook-chapter 9: Examination Process. [Google Scholar]
- 24.Taylor M., Hicklin A., Kiebuzinski G. 2021. Best Practices in the Collection and Use of Biometric and Forensic Datasets. [Google Scholar]
- 25.Ray E., Dechant P. Sufficiency and standards for exclusion decisions. J. Forensic Ident. 2013:675–697. [Google Scholar]
- 26.Willis S., McKenna L., McDermott S., O'Donell G., Barrett A., Rasmusson B.…Lucena-Molina J. ENFSI guideline for evaluative reporting in forensic science. European Network of Forensic Science Institutes. 2015 [Google Scholar]
- 27.Marquis R., Biedermann A., Cadola L., Champod C., Gueissaz L., Massonnet G.…Hicks T. Discussion on how to implement a verbal scale in a forensic laboratory: benefits, pitfalls and suggestions to avoid misunderstandings. Sci. Justice. 2016;56(5):364–370. doi: 10.1016/j.scijus.2016.05.009. [DOI] [PubMed] [Google Scholar]
- 28.Robertson B., Vignaux G.A., Berger C.E. John Wiley & Sons; 2016. Interpreting Evidence: Evaluating Forensic Science in the Courtroom. [Google Scholar]
- 29.Neumann C., Ausdemore M. Communicating forensic evidence: is it appropriate to report posterior beliefs when DNA evidence is obtained through a database search? Law Probab. Risk. 2019;18(1):25–34. [Google Scholar]
- 30.Swofford H., Koertner A., Zemp F., Ausdemore M., Liu A., Salyards M. A method for the statistical interpretation of friction ridge skin impression evidence: method development and validation. Forensic Sci. Int. 2018;287:113–126. doi: 10.1016/j.forsciint.2018.03.043. [DOI] [PubMed] [Google Scholar]
- 31.Anthonioz N.M., Champod C. Evidence evaluation in fingerprint comparison and automated fingerprint identification systems—Modeling between finger variability. Forensic Sci. Int. 2014;235:86–101. doi: 10.1016/j.forsciint.2013.12.003. [DOI] [PubMed] [Google Scholar]
- 32.Egli Anthonioz N., Champod C. Evidence evaluation in fingerprint comparison and automated fingerprint identification systems—Modeling between finger variability. Forensic Sci. Int. 2014;235:86–101. doi: 10.1016/j.forsciint.2013.12.003. [DOI] [PubMed] [Google Scholar]
- 33.Neumann C., Evett I.W., Skerrett J. Quantifying the weight of evidence from a forensic fingerprint comparison: a new paradigm. J. Roy. Stat. Soc. 2012;175(2):371–415. doi: 10.1111/j.1467-985X.2011.01027.x. [DOI] [Google Scholar]
- 34.Swofford H.J., Koertner A.J., Zemp F., Ausdemore M., Liu A., Salyards M.J. A method for the statistical interpretation of friction ridge skin impression evidence: method development and validation. Forensic Sci. Int. 2018;287:113–126. doi: 10.1016/j.forsciint.2018.03.043. [DOI] [PubMed] [Google Scholar]
- 35.Hannig J., Iyer H. Testing for calibration discrepancy of reported likelihood ratios in forensic science. J. Roy. Stat. Soc. Stat. Soc. 2022;185(1):267–301. [Google Scholar]
- 36.Aggadi N., Zeller K., Busey T. Quantifying the strength of firearms comparisons based on error rate studies. J. Forensic Sci. 2025;70(1):84–97. doi: 10.1111/1556-4029.15646. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Swofford H., Cole S., King V. Mt. Everest—We are going to lose many: a survey of fingerprint examiners' attitudes towards probabilistic reporting. Law Probab. Risk. 2020;19(3–4):255–291. [Google Scholar]
- 38.Aitken C.G., Stoney D.A. CRC Press; 1991. The Use of Statistics in Forensic Science. [Google Scholar]
- 39.Eldridge Juror comprehension of forensic expert testimony: a literature review and gap analysis. Forensic Sci. Int.: Synergy. 2019;1:24–34. doi: 10.1016/j.fsisyn.2019.03.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Hans V.P., Saks M.J. Improving judge & jury evaluation of scientific evidence. Daedalus. 2018;147(4):164–180. [Google Scholar]
- 41.Bali S., Martire K. Exploring mock juror evaluations of forensic evidence conclusion formats within a complete expert report. Forensic Sci. Int.: Synergy. 2025;10 doi: 10.1016/j.fsisyn.2024.100564. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.






