Two Types of Error: False Positives and False Negatives
I’m honored to write this commentary for the special issue of Child Maltreatment regarding legal intervention in cases of child maltreatment. My work concerns child witnesses’ performance, focusing on the difficulties that children have in sexual abuse cases, and I’ll draw my examples from this field. However, the issues concern researchers and policymakers in all aspects of child maltreatment.
A recurring and central issue regards two types of error: false positives and false negatives. In the context of child maltreatment, a false positive is a false finding of abuse, and a false negative is an erroneous finding that abuse did not occur. For practical purposes, false negatives include cases in which abuse occurred, but for which there is insufficient evidence supporting an abuse finding. In criminal law, one speaks of false convictions and false acquittals. But one can think of the two types of error at every stage of legal intervention. In child abuse reporting, for example, there are false reports and false judgments that reporting is not warranted. Furthermore, the concept of false positives and false negatives can be applied to evidence suggesting that abuse occurred. In this context, a false positive occurs when the evidence appears in the absence of abuse, and a false negative occurs when the evidence is lacking in the presence of abuse. In what follows, I will focus on the difficulties in assessing evidence of abuse, and in particular the difficulties in assessing children’s behavior that might indicate that abuse occurred. This behavior includes what is probably the most important evidence in sexual abuse cases: the child’s statements describing abuse.
The ways in which false positives and false negatives are related have led to considerable confusion, certainly in the law (Lyon & Koehler, 1996), and to some extent among researchers as well. Let’s spell out the four different possibilities: true positives and false positives, true negatives and false negatives (Table 1). Somewhat counterintuitively, true positives and false negatives are complementary. True positives are cases in which abused children exhibit a behavior. The true-positive rate is the percentage or proportion of abused children who exhibit the behavior. False negatives are cases in which abused children do not exhibit the behavior. The false-negative rate is the percentage or proportion of abused children who do not exhibit the behavior. One chooses the threshold for deciding whether one counts the behavior as present (positive) or absent (negative). Therefore, one can derive the false-negative rate from the true-positive rate: they sum to 100%.
Table 1.
Correct Judgments and Errors in Assessing Behaviors in Abused and Non-Abused Children.
| Behavior Present | Behavior Absent | |
|---|---|---|
|
| ||
| Abused | True positive | False negative |
| Non-abused | False positive | True negative |
Similarly, false positives and true negatives are also complementary. False positives are cases in which non-abused children exhibit a behavior; the false-positive rate is the proportion or percentage of non-abused children who exhibit the behavior. True negatives are cases in which non-abused children don’t exhibit the behavior; the true negative rate is the proportion or percentage of non-abused children who don’t exhibit the behavior. Again, one can derive the true negative rate from the false-positive rate: they sum to 100%.
The reader must be careful, however, not to misinterpret the relation among the four possibilities. Given the true-positive rate, one can deduce the false-negative rate, and given the false-positive rate, one can deduce the true negative rate. In other words, the rows in Table 1 each sum to 100%. However, one cannot deduce the true-positive rate from the false-positive rate, although they are related in an important way I will discuss below.
Researchers estimate the four different rates when they examine groups of children who have been identified as abused and non-abused. In order to decide if a behavior is evidence of abuse, researchers assess whether the percentage of abused children who exhibit the behavior is higher than the percentage of non-abused children who exhibit the behavior. In other words, whether the true-positive rate is higher than the false-positive rate. Moreover, the ratio of the true-positive rate to the false-positive rate identifies the strength of the evidence. The higher the ratio, the stronger the evidence (Wood, 1996).
When the courts decide whether behavior is evidence of abuse, they assess the behavior’s logical relevance, which asks whether the evidence makes a legally significant fact more likely than it would be without the evidence (Federal Rule of Evidence 401, 2023). The logic is the same as assessing whether the true-positive rate is higher than the false-positive rate (Lyon & Koehler, 1996). Moreover, the strength of the evidence is characterized in the law as the weight of the evidence, or its probative value. The legal rule of admissibility is lenient, and the courts often dismiss objections to the admissibility of evidence as affecting the weight of the evidence rather than its admissibility (Mueller et al., 2018). Evidence may be inadmissible when its probative value is likely to be misevaluated by the jury, but the misevaluation must be “substantial” (Federal Rule of Evidence 403, 2023).
Setting the Threshold for Identifying Positives/Negatives
Although one cannot deduce the false-positive rate from the true-positive rate, there is an important relation between the two. If one wishes to reduce the false-positive rate, then one can become more conservative in identifying instances of a behavior. For example, if one is assessing disclosures of abuse, one can define a “disclosure” more narrowly. With respect to deciding if abuse occurred, or if one should report suspected abuse, one can raise the evidentiary threshold for the decision. However, doing so comes at a cost: decreasing false positives will also decrease true positives.
The notion of thresholds is well understood by the law because they correspond to standards of proof. The “beyond a reasonable doubt” standard in criminal cases is designed to minimize the number of false positives (convictions of the innocent), recognizing that this increases the number of false negatives (acquittals of the guilty; In re Winship, 1970). In contrast, dependency cases, in which parents risk losing custody of their children because of abuse, require a lower standard of proof. This is partly because the state prioritizes the protection of abused children, who may face immediate danger if a court fails to find that abuse occurred (Santosky v. Kramer, 1982). In other words, the standards of proof in dependency cases are more lenient than in criminal cases because of greater concern over the dangers of false negatives.
Failures to Recognize the Two Types of Error
The Supreme Court has instructed courts assessing the admissibility of expert opinion to consider the “rate of error,” rather than the “rates” of error (Daubert v. Merrell Dow Pharmaceuticals, Inc., 1993), and this is symptomatic of a common confusion. When evaluating expert testimony, or indeed any type of evidence, one needs to know both the false-positive rate and the false-negative rate. False positive rates are subject to the greatest confusion. A false positive occurs when one finds evidence implicating an innocent person. The false-positive rate tells us the likelihood of the evidence, given a person is innocent. But courts and jurors will often interpret false-positive rates as the likelihood the person is innocent, given the evidence. Confusing the likelihood of evidence given innocence with the likelihood of innocence given evidence is an example of the inverse error (Koehler, 2011). True positive rates are also sometimes misunderstood. Courts have dismissed symptoms of abuse as irrelevant when they are uncommon among abused children, without considering their frequency among non-abused children. Here again the inverse error is at least partly at fault: the likelihood of the evidence given abuse is confused with the likelihood of abuse given evidence (Lyon & Koehler, 1996). The inverse error is partly due to simple linguistic confusion. But it is particularly likely to occur when the courts are confronted with research findings. Researchers comparing abused to non-abused children start with the truth and assess evidence, whereas legal decision makers start with evidence and attempt to discern the truth.
An example of the inverse error comes from the oral argument in the New Jersey Supreme Court’s J.L.G. decision, which limited expert testimony explaining to juries why abused children often recant sexual abuse (New Jersey v. J.L.G., 2018). The prosecution had cited a study finding that recantations in substantiated cases of sexual abuse are more common when the child accused a father figure and when the mother was unsupportive (Malloy et al., 2007). The overall rate of recantation in that study was 23%. One of the judges asked whether that percentage meant that in the majority of cases, recantations were true, in other words, that the child had not in fact been abused. The judge was committing the inverse error: He was confusing the likelihood that recantations occurred, given abuse (23%), with the likelihood that abuse occurred, given a recantation.
Researchers are not always immune from a failure to keep both types of error in mind. For example, take the difference between recognition and recall in memory. Recognition is typically tested by yes–no questions in which the child must merely affirm or deny proffered information (e.g., “Was it Jim?“), whereas recall is typically tested by wh-questions in which the child must generate the desired information (e.g., “Who was there?). Classically, compared to recall, recognition is understood to have a higher true-positive rate than recall but also a higher false-positive rate (Pear & Wyatt, 1914). In order to reduce the likelihood of false allegations, interviewing protocols urge interviewers to maximize their use of recall questions (Lamb et al., 2018).
It is commonly asserted that recall questions produce fewer errors than recognition questions. The assumption is that an “error” is synonymous with a false positive. However, children’s recall performance is especially likely to be less complete than their recognition performance (Ceci & Bruck, 1993), and incompleteness constitutes a type of false negative. Perhaps because they are invisible, omissions are overlooked. Researchers have demonstrated that recall questions are more productive than recognition questions, but this is based on the average number of details produced per question by recall compared to recognition questions (Ahern et al., 2018). The unanswered issue is what to do about particular details that children omit in their answers to recall questions. For example, in laboratory research with children, recognition questions uncover a substantial percentage of true disclosures about children’s and adult’s transgressions that recall questions fail to elicit (Ahern et al., 2016).
When I first started thinking about this problem, my first take was to defend the continued use of recognition questions because they increase true positives (Lyon, 1995). However, a subtler appreciation of the tradeoffs between recognition and recall has led to improved research and new insights. Early research on children’s recall typically asked children a single recall question, with little or no additional prompting (Saywitz et al., 1991). Subsequent research has shown how children’s recall of details about abuse can be enhanced through narrative practice, in which the interviewer asks questions about innocuous events early in the interview (Sternberg et al., 1997). Most recently, research has identified recall questions aimed at addressing the types of details frequently left out of children’s abuse narratives, such as emotional and bodily reactions, conversations (including child disclosures), and specific details of abusive touch. Because they rely on recall memory, and do not suggest specific details, these questions minimize both types of error: They reduce false negatives without increasing false positives (Henderson, et al., 2023). In other words, they increase true details without increasing false details.
A second insight is that false-negative errors look different in response to recall and recognition questions. As noted, false-negative errors in recall are omissions, whereas false-negative errors in recognition are almost always overt denials. If a disclosure ultimately occurs, it is easier to explain an earlier omission than an overt denial. Hence, the move to ask recognition questions in order to uncover transgressions comes at a cost for the cases in which reluctant children only gradually disclose because their initial denial can undermine the credibility of subsequent disclosures.
Third, the assumption that the primary danger of recognition questions is in producing high rates of false positives has been challenged. Research has found that only the youngest children consistently exhibit yes-biases (Fritzley & Lee, 2003), and recognition questions elicit high rates of false negatives in many contexts, particularly when children misunderstand, are confused, or are reluctant (Lyon et al., 2019). This has important implications for practice. Because of the false-positive problem with recognition, interviewers have been advised to pair yes–no questions with requests for elaboration, so that children’s unthinking “yes” responses are detected (Lamb et al., 2018). However, pairing is impractical and not recommended for “no” responses, leaving recognition’s false-negative problem unaddressed.
Although I’ve focused here on the tradeoffs between recognition and recall, and on the evidentiary value of disclosures of sexual abuse, similar issues arise in other child maltreatment contexts. Worries about underreporting of child abuse (false negatives) are countered by worries about overreporting (false positives) (Piersiak et al., 2022). Child protection in many jurisdictions veers between highly publicized cases of children dying in abusive homes under the government’s supervision (false negatives) to unnecessary removal of children from falsely accused parents (false positives). Practitioners must take account of the tradeoff between false positives and false negatives at every stage of child abuse investigation and intervention. Moreover, the tradeoffs occur not just with respect to decision making but also with respect to the proper interpretation of evidence influencing those decisions. A good heuristic for thinking about any recommended practice is to ask what type of error is being addressed and how the practice will affect the other type of error.
A final point concerns the value judgments in converting research into policy. Researchers often act as if they are neutral with respect to tradeoffs between false positives and false negatives. However, when they advocate for the adoption of policies that decrease false positives without considering the effect on false negatives, they are implicitly adopting the value judgment that false negatives deserve zero weight. Even the strongest proponents of heightened standards of proof in criminal cases acknowledge false acquittals are a concern, save those who argue that criminal courts should be abolished. Moreover, child abuse allegations are made in many different forums, most of which have adopted more lenient standards of proof in order to ensure that true victims are not overlooked. The only value-free approach is to search for solutions that minimize both false positives and false negatives.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: Preparation of this article was supported in part by NICHD Grant HD101617.
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
References
- Ahern EC, Andrews SJ, Stolzenberg SN, & Lyon TD (2018). The productivity of wh-prompts in child forensic interviews. Journal of Interpersonal Violence, 33(13), 2007–2015. 10.1177/0886260515621084 [DOI] [PubMed] [Google Scholar]
- Ahern EC, Stolzenberg SN, McWilliams K, & Lyon TD (2016). The effects of secret instructions and yes/no questions on maltreated and non-maltreated children’s reports of a minor transgression. Behavioral Sciences and the Law, 34(6), 784–802. 10.1002/bsl.2277 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ceci SJ, & Bruck M (1993). Suggestibility of the child witness: A historical review and synthesis. Psychological Bulletin, 113(3), 403–439. 10.1037/0033-2909.113.3.403 [DOI] [PubMed] [Google Scholar]
- Daubert. Merrell Dow Pharmaceuticals, Inc., 509 U.S. 579 (1993). [Google Scholar]
- Fritzley VH., & Lee K. (2003). Do young children always say yes to yes–no questions? A metadevelopmental study of the affirmation bias. Child Development, 74(5), 1297–1313. 10.1111/1467-8624.00608 [DOI] [PubMed] [Google Scholar]
- Henderson HM, Lundon GM, & Lyon TD (2023). Suppositional wh-questions about perceptions, conversations, and actions are more productive than paired yes-no questions when questioning maltreated children. Child Maltreatment, 28(1), 55–65. 10.1177/10775595211067208 [DOI] [PMC free article] [PubMed] [Google Scholar]
- In re Winship, 397 U.S. 358 (1970). [Google Scholar]
- Koehler JJ (2011). Misperceptions about statistics and statistical evidence. In Wiener RL, & Bornstein BH (Eds.), Handbook of trial consulting (pp. 121–133). Springer. [Google Scholar]
- Lamb ME, Brown DA, Hershkowitz I, Orbach Y, & Esplin PW (2018). Tell me what happened (2nd ed.). Wiley. [Google Scholar]
- Lyon TD (1995). False allegations and false denials in child sexual abuse. Psychology, Public Policy, and Law, 1(2), 429–437. 10.1037/1076-8971.1.2.429 [DOI] [Google Scholar]
- Lyon TD, & Koehler JJ (1996). The relevance ratio: Evaluating the probative value of expert testimony in child sexual abuse cases. Cornell Law Review, 82(1), 43–78. [Google Scholar]
- Lyon TD, McWilliams K, & Williams S (2019). Child witnesses. In Brewer N, & Douglass AB (Eds.), Psychological science and the law (pp. 157–181). Guilford. [Google Scholar]
- Malloy LC, Lyon TD, & Quas JA (2007). Filial dependency and recantation of child sexual abuse allegations. Journal of the American Academy of Child and Adolescent Psychiatry, 46(2), 162–170. 10.1097/01.chi.0000246067.77953.f7 [DOI] [PubMed] [Google Scholar]
- Mueller C, Kirkpatrick L, & Richter L (2018). Evidence (6th ed.). Wolters Kluwer. [Google Scholar]
- New Jersey, v. J.L.G., 190 A.2d 442 (2018). Oral Argument video at: https://www.njcourts.gov/attorneys/opinions/supreme
- Pear TH, & Wyatt S (1914). The testimony of normal and mentally defective children. British Journal of Psychology, 6(3–4), 387–419. [Google Scholar]
- Piersiak HA, Levi BH, & Humphreys KL (2022). Statutory threshold wording is associated with child maltreatment reporting. Child Maltreatment. 10.1177/10775595221092961 [DOI] [PubMed] [Google Scholar]
- Santosky v. Kramer, 455 U.S. 745 (1982). [Google Scholar]
- Saywitz KJ, Goodman GS, Nicholas E, & Moan SF (1991). Children’s memories of a physical examination involving genital touch: Implications for reports of child sexual abuse. Journal of Consulting and Clinical Psychology, 59(5), 682–691. 10.1037//0022-006x.59.5.682 [DOI] [PubMed] [Google Scholar]
- Sternberg KJ, Lamb ME, Hershkowitz I, Yudilevitch L, Orbach Y, Esplin PW, & Hovav M (1997). Effects of introductory style on children’s abilities to describe experiences of sexual abuse. Child Abuse and Neglect, 21(11), 1133–1146. 10.1016/s0145-2134(97)00071-9 [DOI] [PubMed] [Google Scholar]
- Wood JM (1996). Weighing evidence in sexual abuse evaluations: An introduction to Bayes’s Theorem. Child Maltreatment, 1(1), 25–36. 10.1177/1077559596001001004 [DOI] [Google Scholar]
