ABSTRACT
This study describes a novel measure of children's Theory of Mind (ToM) development—called the Comprehensive Assessment of ToM (CAT)—that addresses limitations in existing ToM measures. This behavioral measure includes three–six items each about diverse desires, diverse beliefs, knowledge access, knowledge expertise, false belief, and visual perspective taking, as well as nonsocial representational reasoning (i.e., false‐sign). All items include a prediction, explanation, and general comprehension question. The measure is psychometrically valid and robust in 3‐ to 8‐year‐old children (n = 206; 104 boys; 101 girls; 1 gender fluid; 37.7% White non‐Hispanic). Children's performance replicates prior findings with the commonly used Wellman and Liu (2004) ToM scale, but also reveals a novel and nuanced pattern of mental‐state scaling over early to middle childhood.
Keywords: children, measurement, theory of mind
Theory of mind (ToM) describes the understanding that internal mental states such as desires, knowledge, and beliefs guide human behavior in the real world. Decades of research reveal that, over early to middle childhood, children improve in their ability to explicitly reason about mental states as person‐specific representations of the world that can be false. ToM as a field was revolutionized by Wellman and Liu's (2004) creation of a ToM “scale” showing that children's ToM development appears to progress in a series of mental‐state concepts wherein children all over the world can first reason accurately about desires, then about beliefs and knowledge, then false belief, then hidden emotions (Liu et al. 2008; Peterson et al. 2012; Shahaeian et al. 2011; Wellman and Liu 2004; Wellman et al. 2006, 2011).
Wellman and Liu's (2004) ToM scale is a “multi‐mental‐state” ToM measure that assesses 3‐ to 5‐year‐old children's explicit understanding of multiple mental‐state constructs. The use of this ToM scale dominates ToM research in children: In a review of ToM measures by Beaudoin et al. (2020), over half (65%) of studies employed Wellman and Liu's ToM scale, whereas alternative measures (e.g., ToM storybooks, Blijd‐Hoogewys et al. 2008; the ToM test, Muris et al. 1999; and 11 other ToM measures) were split across the remaining 35% of studies. The scale is so widely used in part because it has revealed much about how explicit understanding of multiple mental‐state concepts develops over early childhood; as noted above, the characterization of a “progression” of mental‐state concepts exists largely because of extensive research with this scale (see e.g., Shahaeian et al. 2011; Wellman et al. 2011). The Wellman and Liu (2004) scale has also been used in studies attempting to find more nuanced individual differences (by creating a sum score of item performance, e.g., Bowman and Brandone 2024).
However, despite the incredible value the Wellman and Liu (2004) scale brings to the study of ToM, it is limited in its ability to reveal a full and nuanced characterization of ToM and its development. These limitations raise questions about our current understanding of how ToM develops over childhood. Indeed, with over 20 years of disproportionate reliance on this single measure, it is possible that our characterization of ToM development could be disproportionately influenced by the specificities of the measure itself, some of which may reflect task demands unique to specific items rather than robust conceptual understanding. New behavioral ToM measures are needed that assess multiple mental‐state concepts across multiple items, with better‐equated formats and test questions across all items, to provide a novel re‐evaluation of the progression of mental‐state concept development.
1. Assessing Multiple Mental‐State Concepts With Multiple Items Per Concept
The vast majority of children's behavioral ToM assessments constitute one or two items to represent understanding of a given mental‐state concept (though see Gweon et al. 2012; Richardson et al. 2018 for exceptions). Throughout this paper, we use the term “item” to describe a measurement observation (e.g., children's score on a single story assessing mental‐state understanding) that constitutes the final score incorporated into analyses; an item score may thus reflect children's answers to an item's “test” questions (e.g., predict a character's actions based on their mental state) as well as “control” questions assessing general story comprehension. The commonly used Wellman and Liu (2004) scale used a single item to test each mental‐state concept. A single item to index a cognitive construct has high measurement error, whereas the addition of more items decreases measurement error (Spearman 1904). In contrast to ToM, robust measures of other domains of cognition include multiple items or trials (e.g., executive function measures; Carlson and Moses 2001) including measures used with young infants (e.g., the IOWA attentional cueing task; Ross‐Sheehy et al. 2015). Given the common vignette or storybook style of mental‐state reasoning tasks (in which initial context is provided to enable subsequent testing of mental‐state reasoning), multiple items that present multiple contexts may also help reveal a more generalizable measure of children's conceptual understanding that is unrelated to item‐specific task demands. It would also offer superior assessment of individual differences (that can be related to other behavioral, neurological, or physiological measures) to reveal more nuanced developmental patterns.
2. Using Standardized Question Formats That Ask Children to Predict as well as Explain, and That Systematically Test Understanding in Multiple Contexts
Existing multimental‐state measures of ToM, including the Wellman and Liu (2004) scale and others (e.g., Gweon et al. 2012; Richardson et al. 2018) are also limited in that while there is reduced variability in the story context and general task demands for items within a given mental‐state category, the context and task demands often vary widely across mental‐state categories. That is, the way that items test reasoning about one category of mental‐state concept (e.g., desires) can consistently differ from the way that items test reasoning about another category of mental‐state concepts (e.g., false beliefs). For example, some items have lengthy story set‐ups with multiple control questions, whereas others are relatively straightforward. Moreover, not all items have control or comprehension check questions, and those that do can vary in when the control question is presented within the story (e.g., either before or after the child predicts a character's mental state). This wide variability in item format across mental‐state concepts may create a confound in mental‐state concept assessment, especially when examining how mental state understanding develops or unfolds from easiest to most challenging. That is, it is possible that the order of difficulty for different mental‐state items within existing scales in part reflects general task demands.
As noted above, systematic testing of mental state concepts that use similar question formats across multiple contexts can provide a more generalizable measure of an underlying ToM construct. Likewise, systematically varying the context (while keeping item formats similar) could reveal nuances in ToM development. Of particular note, Wellman and Liu (2004) constrained their testing of diverse desire and diverse belief concepts to “participant‐versus‐character” contexts in which items ask children about their own mental state versus another character's (e.g., the child likes cookies but Mr. Jones likes carrots, which snack will Mr. Jones choose?). However, research shows that reasoning about one's own epistemic mental states develops earlier than reasoning about other peoples' (Gonzales et al. 2018). Thus, it is possible that “character‐versus‐character” items asking children to reason about two other character's contrasting perspectives (e.g., Mr. Jones likes carrots but Mrs. Wong likes cookies) may be more difficult than items asking about children's own mental states as diverse from another's. While these “character‐versus‐character” types of diverse desire and belief questions have been used in existing research (e.g., Bowman et al. 2015), they have not been scaled alongside the commonly used “participant‐versus‐character” items in the Wellman and Liu scale. Relatedly, two types of false‐belief items are commonly used in the field: unexpected contents false belief (involving a box with misleading contents such as a crayon box containing marbles; e.g., Gopnik and Astington 1988) and location‐change false belief (involving a central object changing from one location to another; e.g., Wimmer and Perner 1983). However, the Wellman and Liu scale only includes a single unexpected content false‐belief item, leaving an open question of whether and how location‐change items may scale in difficulty relative to false belief and other mental‐state concepts.
Additionally, Wellman and Liu (2004) constrained their scaling to “prediction” items that require participants to predict a character's mental state or their behavior based on their mental state (e.g., “where will Sally look for her marble?”; Wimmer and Perner 1983). Recent ToM measures include “explanation” questions that ask about why a character behaves a certain way (see e.g., Gweon et al. 2012; Richardson et al. 2018; Osterhaus et al. 2016, 2022). These explanation questions potentially tap a richer understanding of children's mental‐state reasoning beyond prediction questions that are typically presented in a constrained two‐alternative forced‐choice format (e.g., “will Sally look in the box or in the basket?”). Explanation questions therefore reduce risk of lucky guesses compared to two‐alternative forced‐choice questions (which have a 50% chance of being correct). The inclusion of explanation questions is more common in ToM assessments of older children (Gweon et al. 2012; Osterhaus et al. 2016, 2022; Richardson et al. 2018). However, explanation questions are rarely included in ToM measures designed for preschoolers, and are not part of the commonly used Wellman and Liu (2004) scale. It is possible that the inclusion of explanation questions will reveal different patterns in preschool children's progression of mental‐state concepts.
3. Adding Assessments of Other ToM‐Relevant Concepts Such as Knowledge Expertise, Visual Perspective Taking, Nonmental Representation Reasoning, and True Belief
The commonly used Wellman and Liu (2004) scale includes items testing understanding of diverse desires, diverse beliefs, knowledge access, false beliefs, and hidden emotions. There are multiple mental‐state concepts in addition to these commonly studied concepts that have also received considerable attention in the broader field of ToM research such as understanding of intentions and goals (e.g., Woodward 2009), pretense (e.g., Carlson et al. 2004), knowledge expertise (e.g., Lane et al. 2014), visual perspective taking (e.g., Flavell 2013), faux pas (e.g., Happe 1994), and sarcasm (e.g., Happé, 1994). Additionally, several studies investigate understanding of mental‐state concepts as they relate to moral reasoning (e.g., see Lagattuta and Weller 2013 for review) or emotions (e.g., “affective ToM”; Gallant et al. 2020). The more mental‐state concepts a measure assesses, the more comprehensive characterization of ToM development it offers. However, there are considerations for the number and type of concepts assessed in a single measure such as whether the concepts can be tested with the same item format (e.g., vignette/story format versus behavioral coding of actions that reflect mental state understanding, e.g., Carlson et al. 2004), whether they similarly measure the same underlying construct (e.g., moral ToM versus affective ToM versus cognitive ToM; see e.g., Schurz et al. 2014), whether the concepts are developmentally appropriate for the target age range of participants, and whether the measure is feasible to administer (e.g., given a lengthy duration with more items). Each of these considerations can limit the number of concepts a given measure includes.
Here, we highlight four additional concepts that we view as particularly relevant because they have one or more of the following qualities: (1) show variability in the highly studied early childhood age range; (2) are closely related to the widely tested concepts in the Wellman and Liu scale, (3) have been tested with the most commonly used vignette or story style item format, (4) can be broadly categorized as “cognitive” ToM concepts (similar to the majority of concepts in the original Wellman and Liu scale). Specifically, we highlight knowledge expertise, visual perspective taking, false‐sign concepts, and true belief.
First, while studies commonly include the “knowledge access” item from the Wellman and Liu scale testing children's understanding that people need to see inside a container to know its contents, few studies have assessed understanding of relative knowledge or “knowledge expertise” (e.g., understanding that a pilot should know more about planes than a chef; see e.g., Lane et al. 2014). Understanding of knowledge expertise is variable in 3.5‐ and 6.5‐year‐olds (Lane et al. 2014). It is possible that this kind of knowledge understanding is either more or less difficult than knowledge access or may have a different developmental trajectory than knowledge access. However, knowledge expertise has never been scaled alongside other mental‐state concepts.
A second relevant concept that is less highly explored in relation to other mental states in the Wellman and Liu (2004) scale is visual perspective taking (VPT), which is the ability to understand that someone might see something that another person cannot see (level 1 perspective taking) and that two people may view the same thing differently (level 2 perspective taking) (Masangkay et al. 1974). More recently developed multi‐mental‐state ToM measures (e.g., Gweon et al. 2012; Richardson et al. 2018) include VPT items, but VPT has never been scaled alongside other mental states and is not part of the Wellman and Liu scale. Early ToM research has shown that children develop a rudimentary (e.g., level‐1) understanding of visual perspective in toddlerhood and preschool (Masangkay et al. 1974; Carlson et al. 2004), although children develop in their understanding of more complex (e.g., level 2) perspective taking throughout preschool and middle childhood (e.g., Gweon et al. 2012; Richardson et al. 2018), and even show variability in adulthood (e.g., Todd et al. 2017). More research is needed to understand how this ToM‐relevant concept connects with the development of other mental states.
Third, multi‐mental‐state measures of ToM rarely include assessment of nonmental representation reasoning, such as about a photograph or a sign (which represents the world and can be false or outdated, but is not a mental‐state construct per se) (Zaitchik 1990). There are both theoretical (e.g., theory‐theory; Gopnik and Wellman 1992) and empirical (Iao et al. 2011; Leekam et al. 2008; Sabbagh et al. 2006) connections between false‐sign and false‐photograph reasoning and ToM (particularly false belief). For example, the false‐sign task follows the same story structure as a false‐belief task except that instead of a character's belief about an object's location becoming false (e.g., because someone moved the object in the character's absence), a pictorial representation (e.g., a sticker, an arrow) changes so that it no longer reflects reality (e.g., it used to point to the ice cream stand, but the wind blew the sign to point elsewhere). Thus, false sign is similar to false belief in that both tasks assess understanding of representational change and false representations (Sabbagh et al. 2006), and domain‐general abilities (e.g., vocabulary and executive function; Iao et al. 2011). Children's performance on false‐sign tasks is variable over 3–6 years (Iao et al. 2011; Leekam et al. 2008; Sabbagh et al. 2006). However, children's performance on false‐sign or false‐photograph tasks has not been scaled alongside other mental‐state concepts. Examining the difficulty scaling of false‐sign items in relation to other representational mental‐state items (e.g., false belief) can shed light on developmental relations between mental‐ and nonmental representational understanding (e.g., whether nonrepresentational understanding is achieved before or after mental‐state reasoning, or whether the two progress in parallel). Such insight is valuable given some existing research suggesting that mental‐state reasoning may develop out of, or become increasingly separate from, domain‐general representational understanding as mental‐state reasoning improves (see e.g., Richardson et al. 2018; Bowman and Brandone 2024). Including false sign may also help clarify how these types of representations relate to more advanced ToM task performance that shows variability beyond the preschool years, such as true‐belief understanding and level 2 perspective taking. Furthermore, including false‐sign items in ToM measures and using these measures in brain‐behavior examinations may help clarify the specificity of neural activity typically associated with reasoning about mental states (e.g., in the temporoparietal junction; Bowman et al. 2015; Gweon et al. 2012). To begin to address these open questions, more research is needed that empirically examines children's performance on nonmental representational reasoning compared to mental‐state reasoning and that includes nonmental representational items in ToM measures.
Fourth, there is evidence that true‐belief understanding—an understanding that an individual's belief can accurately reflect reality—develops later than false belief, despite assumptions that true‐belief reasoning is less cognitively demanding (e.g., Fabricius et al. 2010). Interestingly, while children begin to pass true‐belief tasks in preschool, children 5‐ to 7‐years‐old often fail true‐belief tasks after more consistently passing false‐belief tasks, which may suggest that children of this age still do not understand under what conditions a person's belief would reflect reality, and therefore still have not yet developed a complete understanding of representational ToM (Fabricius et al. 2010; Riggs and Simpson 2005). Thus, inclusion of true‐belief items may be valuable in ToM measures to facilitate assessment of a wider age range of participants and to reveal more nuances in the progression of representational ToM. The addition of true‐belief items also importantly complements the assessment of false belief and diverse belief, creating a more complete and comprehensive assessment of belief understanding in the multiple ways that beliefs can be conceptualized (e.g., as diverse, false, and true), and could reveal a progression in which these different aspects of belief understanding may develop.
We note that some studies have expanded the original Wellman and Liu (2004) ToM scale by adding a sarcasm item (i.e., Peterson et al. 2012), while others have scaled “advanced” ToM concepts different from and additional to Wellman and Liu items (e.g., first‐, second‐, and third‐order false belief, morally relevant false belief, faux pas, strange stories; Osterhaus et al. 2016, 2022). However, limitations of few or singular items to represent a given mental‐state concept and widely varying item formats across mental‐state categories (as discussed above) apply to these more recent measures and analyses. And no study has comprehensively scaled desires, beliefs, knowledge access, knowledge expertise, false belief, VPT, and nonmental representation reasoning altogether. Thus, open questions remain about how VPT and knowledge expertise develop in relation to other concepts included in the Wellman and Liu scale, and how mental‐state concepts more generally scale in relation to representational reasoning.
4. The Present Study
The present study aimed to address several limitations in existing ToM measures by developing a novel measure of children's ToM called the Comprehensive Assessment of ToM (CAT). The CAT importantly expands the original Wellman and Liu (2004) scale in several ways, and offers novel insight into children's progression of mental‐state concepts.
We administered the CAT to children 3‐ to 8‐years‐old to shed light on children's progression of mental‐state concepts throughout and beyond preschool. Critically, in addition to assessing commonly examined mental‐state concepts including diverse desires, diverse beliefs, knowledge access, and false belief (all included in the Wellman and Liu 2004 scale), the CAT also assesses children's understanding of knowledge expertise, VPT, false signs, and true beliefs. As noted above, these additional concepts were particularly useful to include given they closely relate to the highly studied Wellman and Liu scale, show variability in our target age range, and have been tested in the common vignette‐style item format enabling consistency across all items in the measure. Additional items examining reasoning about intentions and goals, pretense, faux pas, sarcasm, affective ToM, or moral ToM were either not commonly tested in the vignette format, were unlikely to show variability in our target age range of preschool to school‐age children, and/or tested underlying constructs that were possibly different from our target “cognitive” ToM construct captured in the majority of items in the Wellman and Liu scale (Apperly and Butterfill 2009; Schurz et al. 2014). Thus, to facilitate feasibility of administration and to reasonably constrain our measure, we did not include items testing these additional categories. However, in the discussion section, we describe the potential utility of developing additional items in these and other mental‐state categories following the methods and approach outlined in this paper.
Additionally, the CAT includes three–six items per category of mental‐state concept and false sign to evaluate whether the expected progression of mental‐state concept difficulty holds across multiple items that vary in story context. Our items consist of custom images and scripts built to closely match existing tasks in the field (see Appendix S2), modified so that item format and questions maintain similarity and consistency throughout the measure. In some cases, item images were taken directly from existing measures (i.e., Lane et al. 2014; Richardson et al. 2018). Importantly, each item consists of a prediction question as well as an open‐ended explanation question, and takes into account a control question assessing general story comprehension. Across all items, the general structure of questioning is matched such that participants are first told a story and then asked the prediction, explanation, and control questions in that order to better equate the testing experience across items and thus better isolate assessment of item difficulty as a function of varying mental‐state content. We also systematically vary contexts in order to evaluate whether context influences item difficulty (e.g., participant‐versus‐character compared to character‐versus‐character mental‐state reasoning; false contents versus false locations).
We conduct psychometric analyses of the CAT in an item response theory framework (IRT; see Embretson and Reise 2000) in a large sample of children to assess the reliability of the CAT and to examine the relative difficulty of different mental‐state concepts and false sign. While our particular scaling analysis is different from Wellman and Liu (2004) who used Guttman scaling (Guttman 1944) to create a hierarchy of single items, our IRT analysis affords examination of the possibility that items do not scale hierarchically—namely, that mental‐state constructs may develop alongside each other rather than constraining analyses to the notion that children must first understand one concept (e.g., knowledge access) before they understand another (e.g., false belief). Importantly, with IRT analysis of the CAT, we can rank order the difficulty of each of the measure's items within each category of mental state and false sign to reveal a “scaling” of mental states (similar to the approach of Osterhaus et al. 2022) that can be compared with the scaling hierarchy revealed by Wellman and colleagues.
We also measure children's executive functioning and vocabulary using standard assessments to evaluate the extent to which these domain general skills are related to CAT performance. Additionally, as a measure of validity, we examine correlations between children's CAT performance and a multi‐item parent‐report measure of children's ToM (the Theory of Mind Inventory‐2, TOMI2; Hutchins et al. 2014) that has been validated in multiple samples (e.g., Greenslade and Coggins 2016; Hutchins et al. 2012).
In sum, the present study introduces a novel, more comprehensive multi‐mental‐state assessment of ToM tested in a large sample to provide a more nuanced analysis of mental‐state development in preschool‐ to school‐aged children.
1. Methods
1.1. Participants
A large sample (n = 223) of typically developing 3‐ to 8‐year‐old children was tested remotely via the online conference software Zoom. The study design plan, sampling plan, and data collection stopping rule were all preregistered on Open Science Framework (OSF) at https://osf.io/4d9ar/. Participants were recruited from a database of families living in the [BLINDED] region who were willing to participate in research and were assessed between August 2021 and September 2022. Participants were compensated with a $15 gift card, and parents gave informed consent prior to participation. The Institutional Review Board approved all study methods. 17 subjects were excluded from the final sample: 11 participants did not meet a minimum vocabulary score required to understand and respond to the ToM items, 4 were excluded due to technical issues with presenting stimuli (e.g., internet speed), 1 due to technical errors in audio‐visual recording for later data scoring, and 2 because the participants did not complete the study. We recruited an approximately even number of boys and girls across each age group. The final sample consisted of 206 children: 26 3‐year‐olds (10 girls, 16 boys, Mage = 3.67 years, SD = 0.20), 34 4‐year‐olds (17 girls, 17 boys, Mage = 4.54, SD = 0.27), 36 5‐year‐olds (17 girls, 19 boys, Mage = 5.54, SD = 0.19), 35 6‐year‐olds (17 girls, 18 boys, Mage = 6.63 years, SD = 0.15), 35 7‐year‐olds (18 girls, 16 boys, and 1 genderfluid child, Mage = 7.60 years, SD = 0.15), and 40 8‐year‐olds (21 girls, 19 boys, Mage = 8.65 years, SD = 0.11). Participants' demographic makeup reflected the diversity of the community from which they were drawn. Figure S1 (Appendix S1) shows the race and ethnicity of the final sample which was 37.7% White, non‐Hispanic. Median annual family income was $100,000 and greater (n = 100), and the range extended as low as $15–24k. The median educational attainment of the child's primary caregiver was a 4‐year college degree (n = 49), and ranged from 8th grade to Ph.D. or M.D. See Appendix S1 for more details.
1.2. Measures
1.2.1. Comprehensive Assessment of Theory of Mind (CAT)
The CAT consists of a series of story‐based items requiring participants to reason about multiple mental states as well as nonmental representation items about false signs, developed on PowerPoint. The task can be administered by an experimenter remotely (e.g., via online conference platforms) as was done in the present study (see Procedure below) or in person in a laboratory or home setting (as of the writing of this manuscript, we have tested 31 neurotypical children ages 3–5 years with the CAT in person in lab). There were three–six items within each category of diverse desires, diverse beliefs, knowledge access, knowledge expertise, visual perspective taking (VPT; levels 1 and 2), true beliefs, false beliefs, and false sign. For each item, participants viewed an animated story and were asked three questions of different types, always in the same order. All items were adapted from existing tasks designed to measure the target category. In some cases, visual stimuli were taken directly from existing tasks (e.g., Gweon et al. 2012; Richardson et al. 2018; Lane et al. 2014).
The structure for each item was as follows. After hearing the initial story set‐up, participants were asked a prediction question (in which the participant had to predict a character's behavior or mental state). Next, the story continued and depicted a target character acting in accordance with their mental state. All participants (regardless of whether their prediction was correct) were then asked an explanation question wherein they were asked to explain why the target character behaved the way they did. After giving their explanation, children were then asked a control question about relevant story information to check that children understood the critical story content. Children received 1 point for a correct prediction and another point for a correct explanation (possible range per item = 0–2). If a participant failed the control question, they received a score of 0 for the item overall, following established scoring practices of similar story‐based ToM tasks (Peterson et al. 2012; Wellman and Liu 2004; Wellman et al. 2011; Osterhaus et al. 2022). Stimuli and coding scheme are available at [BLINDED]. Item descriptions and example items are in Appendices S2 and S3.
Items were administered across three sets (Set A, Set B, and Set C), counterbalanced across participants. Stimuli and sets were designed to be equated and counterbalanced across salient features (e.g., whether the correct image was presented first and whether on the left or right side of the screen). The number of items from each mental‐state and false‐sign category was equated across sets, pseudorandomized such that no two items assessing the same category were presented consecutively. See Appendix S4 for full details on counterbalancing. Administration of each CAT set is approximately 15 min in duration, for a total of approximately 45 min to administer all three sets. Set administration can be spread out throughout a single testing session or across multiple testing sessions to reduce participant fatigue. See Appendix S6 for further procedural details.
1.2.2. Executive Function (EF), Vocabulary, and Parent‐Report Theory of Mind Measures
Standard EF and vocabulary measures were administered. For each measure, higher scores indicate greater ability. See Appendix S5 for details. In brief, two standard conflict‐inhibition stroop tasks—Grass/Sky (Carlson and Moses 2001) and Up/Down (Kramer et al. 2015)—assessed participants ability to inhibit a dominant response to produce a nondominant one (e.g., point to the green card when the experimenter says “sky”; say “up” when shown a downward arrow). The verbal subscale of the Kaufman Brief Intelligence Test 2 (KBIT‐2; Kaufman and Kaufman 2004) assessed children's vocabulary wherein children selected 1 of 6 pictures that represented a word prompt. Grass/Sky and KBIT‐2 were administered via a web‐based platform (Gorilla; Anwyl‐Irvine et al. 2020) wherein children directly clicked on their response (or pointed to the picture and parents clicked). In Up/Down, children verbally responded to images presented by the experimenter via PowerPoint.
Parent‐report ToM was measured by the Theory of Mind Inventory‐2 (TOMI‐2; Hutchins et al. 2014), in which parents responded to questions about whether each of 60 statements was similar or dissimilar to their child, see Appendix S5.
1.3. Procedure
All participants interacted with an experimenter over the web‐based conference platform Zoom. Children sat in front of a computer or tablet (six children sat in front of a phone). See Appendix S6 for further procedural details.
2. Results
Analyses were conducted in R (Version 3.6.2; R Core Team 2019). Item Response Theory (IRT) models were fit using the ltm (Version 1.1‐1; Rizopoulos 2006) and mirt (Version 1.37.1; Chalmers 2012) packages. As noted above, the sampling plan and data collection stopping rule were preregistered on OSF ([BLINDED]). Our final sample of 206 met minimum sample size recommendations for latent factor models, such as IRT (Hoogland and Boomsma 1998; Boomsma and Hoogland 2001; Kline 2015). Criteria for acceptable model fit were preregistered on OSF.
2.1. Preliminary Analyses: Internal Consistency, Relations With Age, and Validity of the CAT
As described in focal analyses, all CAT items had good psychometric properties except true‐belief items, which had the lowest internal consistency (α = 0.55) and poor discrimination (an IRT parameter that describes how well an item measures different abilities of the target construct across participants). Thus, true‐belief items were excluded from further analyses.
Next, we examined internal consistency for the ToM measure as a whole (with true‐belief items removed), and within each mental state and false‐sign subscale (i.e., the items within a given mental state or false‐sign category). Internal consistency was high for the measure (α = 0.96) and subscales (α ranged from 0.76 to 0.88), see Appendix S7, Table S1.
We then examined relations between children's age and their subscale performance. We observed age‐related improvements across all subscale categories, rs > 0.60, ps < 0.001, see Figure S2 (Appendix S7). We also saw differences in overall performance across subscales. We analyzed these patterns of subscale performance differences across age through a linear mixed effects model in which the outcome was children's score on an item (i.e., their score from 0 to 2, accounting for control questions), which was predicted by an interaction between mental state and age, as well as additive effects of vocabulary, the two EF measures, and random intercepts of participant ID and item. In this model, there was a significant interaction between age and mental state, as assessed through model comparison between our final model containing an interaction term and a model with an additive effect of age and mental state, χ 2(16, N = 206) = 76.53, p < 0.001. In probing this effect, in line with what is illustrated in Figure 1, diverse desires was passed at significantly earlier ages compared to false sign (t(30.6) = 2.32, p = 0.027) and VPT (t(30.6) = 3.77, p = 0.001). VPT was also passed later compared to false belief (t(30.6) = 2.38, p = 0.024) and knowledge (t(30.6) = 2.12, p = 0.042). However, only the difference between diverse desires and VPT withstood Sidak correction for multiple comparisons. We report item‐level and subscale performance means and SDs in Appendix S7, Tables S2 and S3.
FIGURE 1.

Bar graph illustrating age‐related improvement in mental‐state and false‐sign reasoning. Color‐coded bars represent participants' average performance across mental‐state and false‐sign subscales calculated as the average performance across items within a category of mental state or false sign. Scores range from 0 to 2, wherein 2 reflects correct responses to prediction, explanation, and control questions. Figure shows differences in subscale performance across age, particularly between desires and visual perspective taking.
Table 1 shows how children's CAT performance related to their vocabulary, EF, and parent‐report TOM in bivariate and partial correlations (controlling for age). In general, these relations with age, vocabulary, EF, and parent‐report theory of mind increase confidence in the CAT given relations are in line with prior research and theory (Devine and Hughes 2014; Milligan et al. 2007; Wellman et al. 2001; Ford et al. 2012; Austin et al. 2020; Talwar et al. 2017). We note that the two EF measures were not significantly related to one another. This lack of relation is not unexpected given similar null relations in research with this age range (e.g., Livesey et al. 2006; Poarch and van Hell 2019), and given that the two measures in our study were administered differently—Grass/Sky required clicking on a computer or tablet, and Up/Down involved a verbal response to an experimenter, resulting in different demands (e.g., motor) across the two tasks that may have contributed to their null relation. Given the lack of relation between these two tasks, we examined further relations between additional constructs and each EF measure separately.
TABLE 1.
Full and partial correlations among children's performance on the CAT, executive functioning, and vocabulary.
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Age | — | — | — | — | — | — | — | — | — | — | — |
| 2 | Vocab. | 0.81** | — | 0.04 | −0.03 | 0.24** | 0.22* | 0.05 | 0.12 | 0.14 | 0.05 | 0.15 |
| 3 | Div. des. | 0.60** | 0.49** | — | 0.50** | 0.49** | 0.42** | 0.68** | 0.51** | 0.07 | 0.11 | 0.20* |
| 4 | Div. bel. | 0.68** | 0.54** | 0.70** | — | 0.43** | 0.42** | 0.60** | 0.48** | 0.20 + | 0.06 | 0.19 + |
| 5 | Know. | 0.71** | 0.69** | 0.71** | 0.70** | — | 0.52** | 0.57** | 0.61** | 0.08 | 0.06 | 0.24* |
| 6 | VPT | 0.77** | 0.71** | 0.68** | 0.72** | 0.78** | — | 0.46** | 0.51** | 0.20 + | 0.04 | 0.18 + |
| 7 | False bel. | 0.67** | 0.57** | 0.81** | 0.78** | 0.78** | 0.73** | — | 0.60** | 0.15 | 0.03 | 0.18 + |
| 8 | False sign | 0.68** | 0.61** | 0.71** | 0.72** | 0.79** | 0.76** | 0.78** | — | 0.11 | 0.07 | 0.07 |
| 9 | Grass/sky | 0.40** | 0.41** | 0.29* | 0.40** | 0.34** | 0.43** | 0.38** | 0.35** | — | 0.08 | 0.01 |
| 10 | Up/down | 0.33** | 0.32** | 0.26** | 0.25** | 0.26** | 0.27** | 0.23* | 0.26** | 0.18 | — | 0.03 |
| 11 | TOMI‐2 | 0.51** | 0.49** | 0.45** | 0.48** | 0.51** | 0.50** | 0.47** | 0.54** | 0.20 + | 0.17 + | — |
Note: Bivariate correlations are shown on the lower diagonal, and partial correlations controlling for age are shown on the upper diagonal in blue shading. Pairwise deletion was used to conduct correlations when data were missing.
Abbreviations: Div. bel., diverse beliefs; Div. des., diverse desires; False bel., false belief; Know., knowledge; TOMI‐2, Theory of Mind Inventory‐2; Vocab., vocabulary; VPT, visual perspective taking.
p < 0.05.
p < 0.01.
p < 0.001.
We also examined whether vocabulary and EF explained variance in ToM beyond age through a series of regression analyses. As reported in detail in Appendix S10, we fit a regression to each mental‐state construct and to false sign in which the outcome was children's average score (0–2) within a given construct, and predictors were Up/Down (average score across all trials), Grass/Sky (average score across all trials), and vocabulary (raw sum score). Given that regression implements casewise deletion if any predictor or outcome is missing, we conducted this analysis on a reduced sample size of N = 110 children who had data for all measures. In line with bivariate correlations (Table 1), age significantly predicted performance across all mental state categories and false sign (ps < 0.015). In addition, vocabulary also significantly predicted performance in knowledge, visual perspective taking, and false sign (ps < 0.024). Up/Down was significantly related to false belief (p = 0.035), and there was a nonsignificant trend in which Grass/Sky was related to diverse beliefs (p = 0.069). Thus, our vocabulary measure explained additional variance beyond age in more than one mental‐state concept, whereas the influence of EF on ToM performance beyond age was minimal to null. These results parallel research and theory discussing relations between vocabulary and ToM development (Milligan et al. 2007) and add to the collection of studies that report null relations between EF and ToM (Ford et al. 2012; Austin et al. 2020; Talwar et al. 2017).
Children's parent‐report ToM scores from the TOMI‐2 (Hutchins et al. 2014) were strongly and positively correlated with CAT performance on all psychometrically valid subscales, and these relations withheld correction for age with the exception of false sign (which was not tested in the TOMI‐2). These positive links between parent‐report ToM and children's CAT performance provide evidence for the validity of the CAT. TOMI‐2 correlations between TOMI‐2 and EF did not withstand correction for age, nor did relations between TOMI‐2 and vocab, again paralleling children's CAT performance and its relations to EF and vocabulary.
2.2. Focal Analyses: Psychometric Evaluation and Difficulty Scaling of Mental States and False Sign
We next conducted scaling analyses, similar to those in Wellman and Liu (2004) and Osterhaus et al. (2016, 2022). Specifically, we built two‐parameter logistic (2PL) models (i.e., models within the IRT family), which estimated participants' latent ability (a standardized latent measurement of participants' performance across mental‐state and false‐sign items), which was then used to assess item difficulty (interpreted as the level of latent ability needed to reach a 0.5 probability of passing each item, which can be used to order item difficulty across the measure as a whole), and item discrimination (interpreted as how well each item assesses latent ability, or discriminates between participants with different latent abilities, and also how closely an item is related to the underlying latent ability factor).
We built two different IRT models: one considering only children's scores on prediction questions (following the approach in Wellman and Liu 2004) and one considering children's scores on both prediction and explanation questions. All models met IRT assumptions of local independence and unidimensionality (see Appendix S8). Acceptable fit was determined by a χ 2:df ratio ≤ 2 (Schreiber et al. 2006), approximate fit indices: Tucker–Lewis index (TLI) and comparative fit index (CFI) ≥ 0.90 (Brown 2006), goodness‐of‐fit indices (GFI/AGFI) > 0.90, and absolute fit indices: root mean square error of approximation (RMSEA) and standardized root mean square residual (SRMR) ≤ 0.08 (Hu and Bentler 1999).
2.2.1. Scaling of Prediction Items Only
This first 2PL model estimated the probability of children scoring either 0 or 1 on each item's prediction question based on the child's underlying latent ability (i.e., of ToM), the item difficulty (b), and the item discrimination (a). Reeve and Fayers (2005) suggest that items with good discrimination have item discrimination coefficients between 0.5 and 2.5. As seen in Table 2, while notably two contents false‐belief items had the highest discrimination values, all items fell within this standard of discrimination (ranging from 0.47 to 5.24), supporting the inclusion of these items in the measure. Table 2 also shows item difficulties, ordered from easiest to hardest. With the exception of one VPT item (C11) that had a higher difficulty coefficient (5.05), items were generally easy (ranging from −3.04 to 0.58), suggesting that low latent ToM ability was needed to pass prediction questions in our sample.
TABLE 2.
Color‐coded item difficulty and discrimination for prediction questions, scaled from easiest items to most difficult.
| Category | Brief item description | Item | Diff. b | Disc. a |
|---|---|---|---|---|
| Know. (expertise) | Chef and Pilot talk about cooking | B9 | −3.04 | 1.10 |
| Div. des. (C vs. C) | Mark and Sarah like different toys | C1 | −2.81 | 1.05 |
| VPT (level‐1) | Tall fence and only Giraffe sees over it | B2 | −2.34 | 0.58 |
| VPT (level‐1) | Carmen sees only one of two raisin boxes a | B5 | −2.28 | 1.03 |
| Div. des. (P vs. C) | John and ppt like different snacks | B12 | −1.93 | 1.47 |
| Div. des. (C vs. C) | James and Audrey like different snacks | B10 | −1.88 | 1.39 |
| Div. des. (C vs. C) | Amanda and Jake like different books | A9 | −1.85 | 1.20 |
| Div. des. (P vs. C) | Zachary and ppt like different games | C5 | −1.82 | 1.44 |
| Know. (expertise) | Mechanic and Chef talk about a car | C10 | −1.79 | 1.88 |
| Div. des. (P vs. C) | Amelia and ppt like different hats | A1 | −1.78 | 1.68 |
| VPT (level‐1) | Big mountain blocks Bear's view | A11 | −1.75 | 1.98 |
| Know. (expertise) | Mechanic and Pilot talk about planes | A4 | −1.72 | 1.05 |
| False sign (loc.) | Light tells train to go to wrong location | B4 | −1.68 | 0.95 |
| Div. bel. (P vs. C) | Mr. Smith and ppt think snack is in different spots | A3 | −1.68 | 1.34 |
| False sign (cont.) | Sign shows a horse but there is a sheep in the field | C2 | −1.60 | 1.79 |
| Div. bel. (C vs. C) | Anne and Ben think cat is in different spots | C8 | −1.45 | 1.68 |
| False sign (cont.) | Sign shows cookies but there is a cake in the tin | A5 | −1.34 | 1.56 |
| False sign (loc.) | Ice cream sign points to wrong location | C9 | −1.27 | 2.03 |
| VPT (level‐1) | Billy's hat covers his eyes so he cannot see bird | A2 | −1.24 | 1.21 |
| Know. (access) | Alligator under rock, unknown to Jim | A12 | −1.23 | 2.39 |
| Know. (access) | Spoon in box, unknown to Michael | C3 | −1.21 | 2.82 |
| Know. (access) | Flower in jar, unknown to Alana | B6 | −1.19 | 2.40 |
| Div. bel. (C vs. C) | Joe and Sally think the gift is different toys | A6 | −1.19 | 1.49 |
| False bel. (cont.) | Legos in a Cheerios box | C4 | −1.19 | 5.24 |
| False bel. (loc.) | Nick moves Ryan's cookie, while Ryan is away | B11 | −1.13 | 1.72 |
| False sign (loc.) | Carrot sign points to wrong location | A13 | −1.12 | 1.83 |
| False bel. (cont.) | Candles in a crayon box | B3 | −1.06 | 2.30 |
| Div. bel. (P vs. C) | Ella and ppt think bunny is in different spots | B7 | −1.00 | 2.23 |
| False bel. (loc.) | Sue puts her bag where Allie's was, while Allie is away a | C6 | −1.00 | 1.65 |
| False bel. (cont.) | Marbles in a Band‐Aid box | A8 | −0.94 | 3.45 |
| False sign (cont.) | Sign shows dishes but there is food in the cupboard | B8 | −0.56 | 1.05 |
| False bel. (loc.) | Someone moves Diana's snack, while Diana is away a | A10 | −0.54 | 1.38 |
| VPT (level‐2) | Alice and her mom see apples differently | C7 | 0.58 | 1.42 |
| VPT (level‐2) | Marie & her dad see cookies differently a | C11 | 5.05 | 0.47 |
Note: a, discrimination; b, difficulty; C vs. C, character‐versus‐character; cont., contents; Div. bel., diverse beliefs; Div. des., diverse desires; Know., Knowledge; loc., location; P vs. C, participant‐versus‐character; ppt, participant; VPT, visual perspective taking.
Critically, as depicted in Table 2, the order of prediction item difficulty across categories of mental states, at the broadest level, is in line with foundational work by Wellman and Liu (2004) and others since (Osterhaus et al. 2022; Peterson et al. 2012; Shahaeian et al. 2011): Diverse desire items as a group were easier than diverse beliefs, knowledge access, and false‐belief items; and false beliefs were more difficult than diverse beliefs and knowledge (access and expertise).
Intriguingly, our addition of multiple items per category as well as new categories also revealed new patterns. Diverse belief items were not consistently easier than knowledge access (contrasting with findings in other North American samples that used the Wellman & Liu scale). Regarding our new item categories: all knowledge expertise items were easier than diverse beliefs, VPT items about level 1 perspective taking were among the easiest and level 2 items were the hardest both within the VPT category and the measure overall, and false‐sign items were more difficult than diverse desires, paralleling the difficulty of diverse and false belief.
Item difficulty did not scale based on specific story context. Participant‐versus‐character mental‐state contrasts were both more and less difficult than character‐versus‐character mental‐state contrasts. Items about false contents were both more and less difficult than items about false locations (for both false‐belief and false‐sign items).
2.2.2. Scaling of Prediction and Explanation Items Together
We next scaled prediction and explanation items together, in a graded response model (an extension of the 2PL framework; Samejima 1969) that used a sum score in which each item was scored from 0 to 2 (1 point each per correct prediction and explanation, accounting for control question performance). This model estimates the probability of children scoring 0, 1, or 2 points on each item based on the child's underlying latent ability (i.e., of ToM), item difficulty, and item discrimination. We report item difficulty for a score of 0 compared to either 1 or 2 (b1) and for 0 and 1 compared to 2 (b2). We also report generalized item difficulty (b) as described by Ali et al. (2015), which gives a single difficulty parameter per item, simplifying item comparisons to estimate overall difficulty scaling.
Table 3 reports item difficulties (b, b1, b2) and discrimination (a). As in Table 2, this table is also ordered from easiest to hardest items. As in the prediction‐only model, two contents false‐belief items had the highest discrimination coefficients, and all item discriminations were high and ranged from 1.07 to 2.35, suggesting that items scored based on both predictions and explanations were still accurate discriminators of different latent abilities and broadly were good measures of mental‐state concepts and false sign. Table 3 also shows that the generalized item difficulties (b) for the prediction‐plus‐explanation model ranged from −1.79 to 2.39: Overall, items were still fairly easy, as shown by the negative difficulty coefficients.
TABLE 3.
Color‐coded item difficulty and discrimination for prediction and explanation questions, scaled from easiest items to most difficult.
| Category | Brief item description | Item | Difficulty | Disc. | ||
|---|---|---|---|---|---|---|
| b | b1 | b2 | a | |||
| Div. des. (P vs. C) | John and ppt like different snacks | B12 | −1.79 | −2.12 | −1.48 | 1.71 |
| Div. des. (C vs. C) | Mark and Sarah like different toys | C1 | −1.79 | −2.71 | −0.95 | 1.28 |
| Div. des. (P vs. C) | Zachary and ppt like different games | C5 | −1.55 | −1.97 | −1.17 | 1.74 |
| Div. des. (P vs. C) | Amelia and ppt like different hats | A1 | −1.47 | −2.07 | −0.87 | 1.60 |
| False sign (loc.) | Light tells train to go to wrong location | B4 | −1.36 | −1.99 | −0.73 | 1.33 |
| Div. des. (C vs. C) | James and Audrey like different snacks | B10 | −1.33 | −1.89 | −0.75 | 1.47 |
| Div. des. (C vs. C) | Amanda and Jake like different books | A9 | −1.32 | −1.88 | −0.74 | 1.36 |
| False bel. (loc.) | Nick moves Ryan's cookie, while Ryan is away | B11 | −1.1 | −1.72 | −0.46 | 1.77 |
| VPT (level‐1) | Big mountain blocks Bear's view | A11 | −1.08 | −1.78 | −0.3 | 2.00 |
| VPT (level‐1) | Tall fence and only Giraffe sees over it | B2 | −1.03 | −1.82 | −0.21 | 1.24 |
| Div. bel. (P vs. C) | Mr. Smith and ppt think snack is in different spots | A3 | −1.03 | −1.88 | −0.11 | 1.39 |
| Know. (expertise) | Chef and Pilot talk about cooking | B9 | −1.03 | −2.4 | 0.4 | 1.81 |
| Know. (access) | Spoon in box, unknown to Michael | C3 | −0.97 | −1.3 | −0.6 | 2.18 |
| False bel. (loc.) | Sue's bag goes where Allie's was, while Allie is away a | C6 | −0.93 | −1.45 | −0.38 | 1.57 |
| Know. (access) | Flower in jar, unknown to Alana | B6 | −0.89 | −1.26 | −0.46 | 2.15 |
| VPT (level‐1) | Billy's hat covers his eyes so he cannot see bird | A2 | −0.84 | −1.47 | −0.14 | 1.88 |
| VPT (level‐1) | Carmen sees only one of two raisin boxes a | B5 | −0.84 | −1.92 | 0.3 | 1.37 |
| False sign (loc.) | Ice cream sign points to wrong location | C9 | −0.83 | −1.36 | −0.22 | 2.16 |
| False bel. (cont.) | Candles in a crayon box | B3 | −0.81 | −1.19 | −0.36 | 2.26 |
| False bel. (cont.) | Marbles in a Band‐Aid box | A8 | −0.81 | −1.32 | −0.24 | 2.35 |
| False sign (loc.) | Carrot sign points to wrong location | A13 | −0.75 | −1.21 | −0.21 | 1.84 |
| Div. bel. (P vs. C) | Ella and ppt think bunny is in different spots | B7 | −0.72 | −1.28 | −0.08 | 1.65 |
| False sign (cont.) | Sign shows a horse but there is a sheep in field | C2 | −0.72 | −1.65 | 0.3 | 1.92 |
| False bel. (cont.) | Legos in a Cheerios box | C4 | −0.71 | −1.47 | 0.12 | 1.62 |
| Know. (expertise) | Mechanic and Chef talk about a car | C10 | −0.7 | −1.76 | 0.47 | 1.88 |
| False sign (cont.) | Sign shows cookies but there is a cake in the tin | A5 | −0.67 | −1.46 | 0.21 | 1.82 |
| False bel. (loc.) | Someone moves Diana's snack, while Diana is away a | A10 | −0.64 | −1.26 | 0.06 | 1.91 |
| Div. bel. (C vs. C) | Anne and Ben think cat is in different spots | C8 | −0.58 | −1.43 | 0.36 | 1.88 |
| Know. (expertise) | Mechanic and Pilot talk about planes | A4 | −0.56 | −1.49 | 0.47 | 1.32 |
| Div. bel. (C vs. C) | Joe and Sally think the gift is different toys | A6 | −0.54 | −1.15 | 0.18 | 1.41 |
| Know. (access) | Alligator under rock, unknown to Jim | A12 | −0.5 | −1.33 | 0.44 | 1.93 |
| False sign (cont.) | Sign shows dishes but there is food in cupboard | B8 | 0.03 | −0.71 | 0.93 | 1.37 |
| VPT (level‐2) | Alice and her mom see apples differently | C7 | 0.37 | −0.1 | 1.04 | 1.69 |
| VPT (level‐2) | Marie and her dad see cookies differently a | C11 | 2.39 | 1.13 | 4.16 | 1.08 |
Note: b, generalized difficulty, as described by Ali et al. (2015); b1, difficulty threshold for 0 vs. 1–2; b2, difficulty threshold for 0–1 vs. 2; a, discrimination; Know., Knowledge; Div. des., diverse desires; Div. bel., diverse beliefs; VPT, visual perspective taking; cont., contents; loc., location; C vs. C, character‐versus‐character; P vs. C, participant‐versus‐character; ppt, participant.
Critically, additional scaling patterns emerged in this model that are noteworthy compared to both the prediction‐only model and Wellman and Liu (2004). As shown in Table 3, diverse desires was now the most consistently easy category: All six items had the lowest generalized difficulty coefficients (except for one false‐sign item that was 5th easiest overall). These results are in line with Wellman and Liu but differ from our prediction‐only model (in which knowledge expertise and VPT items were easiest along with diverse desires).
This prediction‐plus‐explanation model also diverged from Wellman and Liu (2004): the pattern of false‐belief reasoning being more difficult relative to knowledge access and diverse beliefs—a pattern we replicated in our prediction‐only scaling—was no longer evident. Additionally, the prediction‐only model pattern of knowledge expertise items easier than knowledge access was no longer evident. VPT and false‐sign items retained their wide range of difficulty as with the prediction‐only scaling, though were generally more difficult. VPT level‐1 items were still easier than VPT level‐2 items.
Also unlike prediction‐only scaling results, we did find scaling patterns based on elements of story context. Within the diverse beliefs category, participant‐versus‐character contrasts were easier for children compared to character‐versus‐character contrasts. Within false sign, items that were about signs pointing to inaccurate locations were easier for children compared to items about a picture inaccurately representing the contents of a container. However, for diverse desires and false‐belief items, item format did not influence difficulty level (similar to results from the prediction‐only modeling).
2.2.3. Item Information Curves and Test Information Functions
IRT models also yield item information curves (IICs) that visually depict the difficulty and discrimination parameters for each item, and that assess the information captured by each item as a function of both item difficulty and discrimination. Additionally, IRT models yield a test information function (TIF) that visually depicts the range of ability the test can best measure, and that assesses the combined information captured by the measure considering all items. In the context of IRT, IICs, and TIFs with greater information demonstrate that individual items (in the case of IICs) and the measure as a whole (in the case of the TIF) have greater precision to measure the underlying latent ability (i.e., of ToM).
IICs for the diverse desires and false beliefs items are shown in Figure 2 (see Appendix S9, Figure S3 for remaining item IICs). The curves illustrate which items are most informative psychometrically, wherein higher curves depict the latent ability at which the item has the highest discrimination and wider curves depict items that capture a broader range of latent ability corresponding to the range of item difficulty (Embretson and Reise 2000). In both the prediction‐only and the prediction‐plus‐explanation analysis, IICs for multiple items within a mental‐state category or false sign overlapped substantially, suggesting items assessed the same underlying latent ToM ability while together covering a broader range of ability than any single item alone. However, for the prediction‐only model, the false‐belief items captured more information on children's latent ToM ability, meaning that these items had the best precision in assessing ToM, with items in all other constructs contributing relatively less information. In contrast, for the prediction‐plus‐explanation model, items across all categories captured similar amounts of information and all helped explain variability in children's latent ToM ability (see Figures 2 and S3).
FIGURE 2.

Item information curves (IICs) for diverse desires and false‐belief items, as well as test information functions for the prediction‐only and prediction‐plus‐explanation models. IICs (left two panels) for each individual item within a mental state category (i.e., diverse desires, pink; false belief, teal) are overlaid on a single plot. Y‐axis scales for prediction‐only IICs (top) and prediction‐plus‐explanation IICs (bottom) differ given models reveal different relative information captured by the items. TIFs (right of center panel) include blue shading to highlight standardized latent abilities for which the models capture the most information. Histograms (far right panel) of children's ages for the subsample of children represented by blue shading show the ages for whom the measure (scored as either prediction‐only, or prediction‐plus explanation) is most informative in terms of ToM ability.
It is important to note that we cannot directly compare specific parameter values across prediction‐only and prediction‐plus‐explanation analyses given that the underlying latent factors are different (i.e., one includes prediction items, and one includes both prediction and explanations) and parameter estimates are in relation to the latent factor. Thus, we cannot simply look at the discrimination of one item in the prediction‐only analysis and determine if it is better or worse than an item in the other prediction‐plus‐explanation analysis based on the values themselves. Rather, to evaluate the efficacy of items in a measure, we examine difficulty and discrimination for items within a given model, and we examine overall patterns of items from each model in comparison to extant literature. Within these evaluation guidelines, results suggest that when considering children's predictions alone, false‐belief predictions provide the most information about children's ToM, with other mental‐state and false‐sign prediction questions offering relatively little information. In contrast, the inclusion of explanation questions to all mental states expands variability captured by items, better captures “partial” understanding or gradients in understanding (e.g., passing a prediction item but not an explanation), and increases the amount of information available to the model in estimating the latent factor of ToM ability.
TIFs are shown in Figure 2. The TIF shows the test precision at different levels of the latent ability and is a function of the summed item information curves (Embretson and Reise 2000). Thus, the TIF shows which latent abilities the measure is most accurate at assessing. We used this information to find the age range that corresponded to the measure's most accurate assessment range to reveal the range of ages for which the measure was most optimal.
For the prediction‐only model, the TIF shows that the CAT was most informative for participants with a latent ability (θ) between −2.5 and −0.5, best representing children 4‐ and 5‐year‐olds. These participants (n = 61) were on average substantially younger than the full sample tested (MAgeSubset = 4.64, SD = 1.22; versus full sample MAgeFullSample = 6.28, SD = 1.71). In contrast, for the prediction‐plus‐explanation questions, the TIF showed a larger latent ability range between −2.5 and 1. These participants (n = 176) were on average still slightly younger than the full sample (MAgeSubset = 5.94, SD = 1.59). However, this subset of participants included an increased number of older children, several of whom reached the top of the age range tested. These results suggest that when using both prediction and explanation questions, the CAT was able to accurately assess ToM within the full age range tested (3‐ to 8‐year‐olds), although it may be best suited for children 4‐ to 8‐year‐olds given the comparatively low number of 3‐year‐olds for whom the measure was highly informative.
3. Discussion
The present study introduces and evaluates a novel measure of children's ToM—the Comprehensive Assessment of ToM (CAT). The CAT items deliberately mimic common tasks assessing multiple mental‐state concepts. In particular, our inclusion of items assessing children's diverse desires, diverse beliefs, knowledge access, and false belief matched the core items from the most widely used multi‐mental‐state ToM measure to date—the Wellman and Liu (2004) ToM scale. We also included items assessing ToM‐relevant concepts that have not previously been scaled: knowledge expertise, visual perspective taking (VPT) and false sign (we also included true‐belief but this subscale had low internal consistency and discrimination, and was removed from further analysis). We rank ordered items in the CAT by difficulty to reveal a scaling of mental‐state concepts and false sign that can be compared to the hierarchical scale identified in Wellman and Liu.
Importantly, the CAT has novel elements that address several limitations of existing ToM measures, including the Wellman and Liu (2004) scale. Our analysis of the CAT in a large and diverse sample of 3‐ to 8‐year‐old children establishes the measure as an important alternative to existing measures of children's ToM. Moreover, it provides novel insight into children's progression of mental‐state concepts and potentially challenges current characterizations of children's ToM development based on decades of research with the Wellman and Liu scale. In the sections that follow, we highlight seven key take‐away points from the present study.
3.1. Point #1: The CAT Offers a Reliable, Comprehensive Assessment of Children's Observed ToM Abilities. We Recommend Its Use in Future Research
The CAT is psychometrically robust and reliable. According to multiple metrics—i.e., alpha, as well as discrimination and difficulty from 2PL models—all items provide reliable assessments of children's mental‐state and false‐sign reasoning (except for true‐belief items, which were removed from the measure, as discussed in point 7 below). In the IRT model examining both prediction and explanation questions, item information curves (IICs) revealed that all items in a given mental‐state or false‐sign category were highly discriminating in measuring the underlying latent “ToM” ability, while together covering a broader range of ability than any single item alone. Category subscales demonstrated expected relations with age and with parent‐report ToM, further supporting use of the measure as a whole. The prediction‐plus‐explanation model test information function (TIF) also supports that the CAT assessed latent ToM ability for children ages 3–8 years.
Taken together, analyses revealed that the CAT provides a reliable assessment of children's explicit reasoning about diverse desires, diverse beliefs, knowledge access, knowledge expertise, false beliefs, VPT, and false sign, over a wide age range from early to middle childhood. It can assess group‐level performance and individual differences and can be used in a variety of assessment formats (e.g., within a single session, across sessions, online, in person). Thus, the CAT offers the most comprehensive assessment of children's observed ToM abilities to date. We recommend its use in future research (with the true‐belief items removed).
3.2. Point #2: Explicit Reasoning About Desires is Easier for Children Compared to Explicit Reasoning About Beliefs, Replicating Decades of Prior Research
The IRT difficulty scaling analysis revealed that diverse desire items were consistently easier to pass compared to belief items—both diverse and false‐belief items– across analyses of prediction questions only as well as when considering additional explanation questions. These results suggest a robust “progression” of mental‐state concept reasoning from more easily reasoning about desire concepts to reasoning about more difficult belief concepts—a pattern also evident in our age‐related analyses of the different mental‐state subscales wherein desire items were passed at earlier ages. This pattern is in line with decades of prior research examining children's explicit reasoning about desire and belief concepts, in both studies that use the Wellman and Liu (2004) scale (e.g., Liu et al. 2008; Peterson et al. 2012; Shahaeian et al. 2011; Wellman et al. 2006, 2011) as well as studies of children's everyday conversations and experimental performance (e.g., Bartsch and Wellman 1995; Wellman and Woolley 1990) and more recent scaling analyses with the diverse‐desire, diverse‐belief, and false‐belief items from the Wellman and Liu scale (Osterhaus et al. 2022). Our findings from the CAT add to this literature on desire and belief reasoning, showing that across multiple items that vary in story context, and that test children's predictions as well as open‐ended explanations, desire reasoning is consistently easier than belief reasoning. The replication of the desires‐to‐beliefs trajectory suggests that the desire and belief items as administered in the context of the battery and the novel format adopted for these items (which included both character‐versus‐character and self‐versus‐character judgments) validate these individual items and support the use of the CAT to reveal underlying robust developmental phenomena.
3.3. Point #3: False‐Belief Items Can Be Especially Informative on Children's ToM Ability, Supporting the Prominence of This Task in the Literature
Passing a false‐belief task has long been regarded as a critical milestone in ToM development and often considered a marker for the achievement of a “mature” ToM itself (see meta‐analysis; Wellman et al. 2001). Although the false‐belief items were not the most difficult to pass in our measure, the two items with the highest discrimination parameters were false‐belief items, across both prediction‐only and prediction‐plus‐explanation analyses. Moreover, across both analyses, false‐belief items captured the most information (according to the item information curves) on children's underlying ToM ability. False‐belief tasks are designed to test two important aspects of ToM: that mental states are person‐specific—each person has a unique mind and unique mental states (e.g., the child thinks there are candles in the crayon box but Mickey thinks there are crayons)—and that mental states are distinct from reality—they reflect representations of the world that may or may not be true (e.g., Mickey thinks that there are crayons in the box, even though there are really candles inside). In contrast, several other mental state tasks only assess person‐specific aspects of ToM (e.g., diverse desires, diverse beliefs, visual perspective taking) or are designed to highlight nonmental representational aspects (e.g., false sign). It is possible that the “dual” qualities of the false‐belief task (assessing both person‐specific and reality‐distinct components of mental states) contribute to the task's ability to capture high levels of information in children's ToM abilities. Broadly, our psychometric analysis reveals the utility of false‐belief tasks in measuring ToM development, supporting decades of research that have featured this type of task in ToM investigations.
However, our use of multiple items to measure multiple mental states also reveals additional important nuances in the utility of the false‐belief task. First, the unique informativeness of false‐belief items was only prominent when considering children's responses to prediction questions alone. That is, when only prediction questions were evaluated, false‐belief items were by far the most informative. This pattern suggests that the prediction questions alone for nonfalse‐belief items are substantially less informative on children's ToM ability, whereas false‐belief items are highly informative of children's ToM in this constrained prediction‐only form. Notably though, when explanation questions were evaluated in analyses, the other nonfalse‐belief items in the CAT captured similar levels of information to the false‐belief items, suggesting that false‐belief items were no longer uniquely more informative and pointing to the utility of explanation questions in broadening assessment of multiple mental‐state constructs to inform on children's ToM development. Second, the high informative quality of the false‐belief task was specific to only 2 out of 6 false‐belief items: the two highest discrimination parameters were for contents false‐belief items; the other contents false‐belief item and the three location‐change items captured information similar to other mental‐state items including knowledge, visual perspective taking, diverse belief, and false sign. This pattern suggests that not all false‐belief items are equally informative, especially when considering children's responses to prediction questions alone (wherein 4 out of 6 false‐belief items were substantially less informative, on par with other mental‐state and false‐sign items). These findings again point to the importance of including multiple items per mental‐state category to best assess children's underlying ToM.
3.4. Point #4: The Progression of Knowledge and Belief Concepts May Be Less Straightforward Than Previously Thought (As Revealed by Multiple Items and Explanation Questions), Potentially Challenging Existing Conceptualizations of Children's ToM Development
Our inclusion of multiple items per category of mental‐state concept revealed that diverse belief, knowledge, and false‐belief items did not show a consistent scaled order of difficulty. Especially in the model of prediction and explanation items together, difficulty rankings of items from each of these categories were spread out among each other, with items in one category being both easier and more difficult than items in any of the other two categories.
This lack of clear difficulty scaling for these mental states is inconsistent with the hierarchical scaling revealed in the Wellman and Liu (2004) scale and its subsequent use in North American and other Western samples (e.g., Wellman et al. 2006) which indicated a progression of difficulty from diverse beliefs to knowledge access to false‐belief understanding. Under this premise, we would have expected to see in our North American sample a pattern of difficulty scaling wherein diverse‐belief items were consistently easier compared to knowledge items, which were easier than false‐belief items. We saw this kind of consistent difficulty scaling pattern for our desire items compared to belief items (in line with Wellman and colleagues' prior findings), but it did not exist for diverse beliefs, knowledge, and false beliefs.
Intriguingly, our analysis of prediction questions alone in some ways revealed a pattern more in line with Wellman and colleagues. Indeed, if we scaled based on a single item per category in the prediction‐only model, it is possible we would have replicated Wellman and colleagues' findings of diverse beliefs being easier than knowledge access, which is easier than false belief. However, even though all items were adapted from common tasks designed to test our target concepts, not all items across these categories showed this pattern of progressive difficulty. Indeed, if we included only one item per category, we would have drawn different conclusions about mental‐state difficulty scaling depending on which specific item was selected.
Our findings with multiple items therefore raise the possibility that Wellman and colleagues' scaling results were based in part on the difficulty of a specific item and its story context, and not necessarily entirely on the basis of the underlying mental‐state concept. This possibility is further supported by the fact that our inclusion of explanation questions further reduced any discernible difficulty scaling pattern across items in the belief and knowledge categories. On the one hand, this increased spread in difficulty across items and categories in the prediction‐plus‐explanation model could suggest that the ease with which children can correctly explain behavior based on mental states may be more sensitive to specific item details (e.g., the specific story context) such that children's answers become more variable across different items designed to tap the same concept. However, inconsistent with this notion, the addition of explanation questions revealed a more consistent pattern of difficulty scaling for desire items compared to others. Alternatively, the spread of difficulty across items and categories could suggest that explanation questions reveal a generally more nuanced development of mental state concepts, at least with respect to nondesire states, calling into question the “progression” identified by Wellman and colleagues that was largely replicated in our analysis of prediction questions alone (although our prediction‐only analysis with multiple items also revealed some nuances in the progression that had not been shown before). At minimum, findings from the CAT suggest the need to assess children's mental‐state reasoning with multiple items, and with both prediction and explanation questions, to see whether these patterns replicate. We have confidence that our study findings are not spurious given the robust psychometric properties of our measure, and the replication of the robust desires‐easier‐than‐beliefs pattern, but future research (e.g., with our measure in other samples) is needed to further bolster our findings. It will be important to test our measure in non‐Western samples as well.
3.5. Point #5: Visual Perspective Taking and False Sign Are Just as Difficult to Reason About as Other Mental‐State Concepts
In addition to the core mental‐state concepts examined in Wellman and Liu (2004), our study sheds light on how VPT and false‐sign scale in difficulty across story contexts, and in comparison with other mental states. In general, both VPT and false‐sign items ranged in difficulty with some easy and some hard, and generally paralleled the difficulty ranges of diverse belief, false belief, and knowledge items.
In line with prior research (Masangkay et al. 1974), more difficult VPT items required level‐2 perspective taking: reasoning about multiple competing perspectives that, in the case of the most difficult items, also involved a question about an ambiguous reference (e.g., which cookie is the “biggest” given the cookies this character can see; which “apples” will be selected given one character can see green apples and one can see red apples). In these items, two characters saw the same thing (i.e., the “biggest cookie”, the “apples”) differently. Whereas easier VPT items involved level‐1 perspective taking: reasoning about the difference between visual access versus lack thereof (e.g., only the giraffe can see over the tall fence; the boy cannot see when his hat covers his eyes). Both the prediction‐only and prediction‐plus‐explanation model revealed these nuances in visual perspective‐taking difficulty, providing further support for the validity of the CAT.
False‐sign items—which require children to reason about a sign that has become “false” (the arrow no longer points to the location of its target; the sticker no longer represents the actual contents of a closed container)—had similar difficulty to diverse beliefs, false beliefs, and knowledge items. This finding is also in line with existing theory and research. Several theories posit that ToM development involves conceptualizing the mind as a vessel that holds thoughts and ideas that can exist outside of reality and outside of the current moment (e.g., Gopnik and Wellman 1992). Empirical research shows that performance on false sign and other tasks testing nonmental representational reasoning (e.g., false photograph tasks) are related to false‐belief performance (Iao et al. 2011; Leekam et al. 2008), and the ability to “update” nonmental representations (e.g., about how much a deceptively heavy small object weighs after lifting it) is related to children's broader ToM performance (Sabbagh et al. 2006). Our findings add further support for the notion that reasoning about representations develops alongside of mental states, and lay groundwork to continue testing the extent to which reasoning about representations more generally is involved in children's ToM.
3.6. Point #6: Story Context and Other Item‐Specific Details May Influence ToM Performance, and Explanation Questions Are Useful
Item difficulty, in some cases, appeared to systematically differ based on story context or item set‐up. For example, in the prediction‐and‐explanation model, items that contrasted one character's belief with another character's were more difficult compared to items that contrasted a character's belief with the participant's. This finding is in line with research showing that reasoning about one's own epistemic mental states develops earlier than reasoning about other peoples' (Gonzales et al. 2018). Additionally, in the prediction‐and‐explanation model, items that were about signs that falsely represented the contents of a container were more difficult compared to items about signs that inaccurately pointed to a target location, potentially suggesting that children are more familiar with the notion of an inaccurate arrow sign than they are with a picture falsely representing internal contents. This finding and interpretation (as well as the lack of influence of contents versus location contexts in children's false‐belief reasoning) are also in line with prior research. Wellman et al. (2001) showed false‐belief performance did not vary systematically across location versus content contexts; whereas Callaghan et al. (2012) showed that increased familiarity with pictures as representational symbols improves children's reasoning about the representational nature of picture signs (but is not related to their false‐belief reasoning) (Callaghan et al. 2012).
These nuances in difficulty across systematically varying story contexts were apparent only in the prediction‐plus‐explanation model, suggesting that the inclusion of children's explanations helps reveal potentially more subtle nuances in children's ToM ability. Importantly, these nuances have been substantiated in extant studies that aimed to target these more subtle aspects of children's ToM development. These results thus provide additional confidence in the CAT as a valid ToM measure, and further promote the importance of explanation questions in revealing children's ToM ability. They also potentially raise further questions about the purported progression from first understanding diverse beliefs, then knowledge, then false beliefs; while we were able to replicate other nuances of ToM development found in prior research, our prediction‐plus‐explanation model did not show this belief‐knowledge‐false belief progression. Taken together, findings from the CAT support the possibility that the progression as revealed in the singleitem Wellman and Liu (2004) scale may be based in part on item‐specific contexts.
More generally, findings suggest that differences in story context can influence children's performance, and should be evaluated or at very least considered in ToM measures. Indeed, measures that include only a single item per mental‐state category, or that have only one type of story context per mental‐state category, could create confounds that obscure children's underlying ToM ability. The CAT is advantageous in its inclusion of multiple items that vary in story set‐up within a given mental‐state category because researchers can average children's item performance to reveal their ToM ability across these varying contexts.
3.7. Point #7: True‐Belief Items May Be Measuring Something Different
The true‐belief items included in our measure had poor psychometric properties; items showed poor internal consistency and poor discrimination. We thus removed them from further analysis of the measure, and we also recommend they be excluded from the CAT. Intriguingly, the notion that true‐belief items—specifically, items in which the target character stays to view a location‐change event—have been discussed as a potentially poor measure of children's belief reasoning per se. For example, scholars posit the pragmatics of these kinds of items leave children questioning the straightforward response (i.e., that the character will look for their item where it is; Oktay‐Gür and Rakoczy 2017), and empirical research shows that as children advance in their false‐belief reasoning, their performance on these kinds of true‐belief “stay” items declines (Fabricius et al. 2021). Our results alongside existing research and theory suggest careful treatment of true‐belief items in future ToM research.
3.8. Discussion of CAT Validity
Several aspects of our findings give confidence in the CAT and the revealed patterns of children's ToM development. Chiefly, the good psychometric properties demonstrate the measure's reliability, and the expected positive relations with age and with a multi‐item parent‐report ToM measure (TOMI‐2; Hutchins et al. 2014) provide important evidence of CAT validity. Additional evidence supporting the CAT's validity comes from our scaling findings that replicated previously demonstrated nuances in children's ToM development. That is, as described in the sections above, the CAT replicated previous findings of reasoning about desires being easier than beliefs (Bartsch and Wellman 1995), level‐1 perspective taking being easier than level 2 perspective taking (Masangkay et al. 1974), reasoning about one's own mental states being easier than reasoning about others' (Gonzales et al. 2018), and that nonrepresentational reasoning can be influenced by the item context (e.g., familiarity with pictures as signs or symbols; Callaghan et al. 2012) whereas false‐belief reasoning does not systematically vary by item context (e.g., across location‐change versus unexpected contents item variants) (Wellman et al. 2001). These nuances in children's ToM performance are exemplified in prior studies that targeted these more subtle aspects of children's ToM development, and thus our replication of them here gives confidence in the CAT as a ToM measure.
Although we did not find robust relations with our two EF measures (both conflict‐inhibition stroop tasks; Carlson and Moses 2001; Kramer et al. 2015), these null results are in line with other studies reporting null relations between ToM and similar conflict‐inhibition EF tasks (Ford et al. 2012; Austin et al. 2020; Talwar et al. 2017). Moreover, the parent‐report ToM measure also did not relate to the EF measures in our study after controlling for age, paralleling findings with the CAT and raising further questions about the role of EF (at least as captured in these tasks) in ToM. Future research should continue to investigate the role of EF in ToM generally and as assessed with the CAT specifically, using alternative EF tasks (e.g., working memory tasks), to further probe EF and ToM relations.
We also recommend the use of the CAT in longitudinal study designs to determine whehter individual differences in early CAT performance predict individual differences in later CAT performance. Neuroscience research can also provide additional evidence for validity by examining how neural activity in the “ToM Neural Network” (e.g., Bowman and Wellman 2014) relates to children's CAT performance (e.g., following brain‐behavior study designs such as Bowman and Brandone 2024; Richardson et al. 2018).
3.9. Limitations and Future Directions
Although the current study presents evidence for both the reliability and validity of the CAT, there are noteworthy limitations that future research should consider.
First, while we did achieve the minimum recommended sample size, IRT models are often fit to larger samples. The CAT is available on OSF to encourage its broad use in the hopes that as a field we can develop a large‐scale data repository on children's CAT performance to validate the measure with a larger and more diverse sample.
Second, we used a 2PL IRT model to analyze our measure, which differs from the Guttman scale used in Wellman and Liu (2004). It is possible that we failed to replicate a difficulty pattern in line with a mental‐state progression from diverse beliefs, to knowledge, to false beliefs in part because of our use of a different analytic model. Our 2PL model also differs from the 1PL Rasch model used in Osterhaus et al.'s (2016, 2022) scaling of advanced ToM items. We chose the 2PL model over Guttman and Rasch models because it offers the additional parameter of discrimination to more comprehensively evaluate how well items assess ToM, and it is well‐suited for analysis of multiple items that have similar difficulties (e.g., items within the same mental state construct). Future research may conduct Guttman scaling of CAT items to further tease apart whether discrepancies between the present study findings and those from Wellman and colleagues are attributable to differences in the two measures' abilities to capture nuances in children's ToM development versus differences in analytic models. We think it unlikely that the present study findings are merely a product of our chosen analytic model given our ability to replicate Wellman and Liu's finding of desire reasoning being easier than belief reasoning, as well as other nuances in ToM development demonstrated in extant literature (Carlson et al. 2004; Gonzales et al. 2018; Callaghan et al. 2012).
Third, IRT models (as well as Guttman and Rasch models) do not allow for inclusion of covariates that are not part of the unidimensional latent factor, and thus we did not partial out effects of age, EF or vocabulary in our scaling. As such, in the present study, the latent factor of “ToM” ability used to scale items does not statistically account for item performance due to these domain‐general constructs, and it is likely that our difficulty scaling patterns were influenced at least in part by variability in underlying domain‐general skills that vary with item‐specific details. Indeed, the finding that the inclusion of explanation questions yielded more similar information (as shown in the item characteristic curves) for items across all mental‐state categories could suggest the items are at least partly measuring a general ability to explain. We note that we examined relations between age, EF, and vocabulary and children's performance on the CAT mental‐state and false‐sign subscales and found some relations with vocabulary that existed beyond age (as expected given existing research), but many mental‐state categories did not relate to vocab or EF beyond age (also as expected given existing research) (see e.g., Milligan et al. 2007; Ford et al. 2012; Austin et al. 2020; Talwar et al. 2017). Our high‐level standardization of the overall item format (i.e., first presenting story context, followed by prediction, then explanation, then control questions) is helpful in potentially equating some domain‐general cognitive skills across items, but items still likely varied in these skills. Importantly, we did not see item categories scale on the basis of their relations with vocabulary and executive functioning (e.g., VPT and knowledge item performance were related to vocabulary development even after controlling for age, but these items paralleled difficulty of other category items that were unrelated to vocabulary such as diverse beliefs, false beliefs, and false sign). Thus, while it is likely that the explanation questions capture additional domain‐general abilities (e.g., proficiency in explanation), these domain general demands are reasonably equated across items and mental‐state categories, and thus the ability to explain correctly about some mental states over others (performance that affects difficulty and discrimination parameters) should still reveal a targeted ToM ability.
Nonetheless, future research is needed to explore issues related to domain general abilities contributing to ToM performance in our CAT and other existing ToM measures. As one example, researchers could analyze the CAT in an alternative latent factor model (e.g., confirmatory factor analysis, CFA) that enables inclusion of covariates. Parameters in confirmatory factor analysis relate to those in IRT in the following ways: Difficulty in IRT is related to the means of item performance and therefore item means may be used to estimate how items may scale in CFA, and discrimination in IRT is related to factor loadings, such that items with higher factor loadings are more informative or discriminating in CFA. Covariates may be included by regressing each item onto the covariates (e.g., onto age, executive function, and vocabulary). Given the complexity of this model, large sample sizes would be needed to appropriately estimate these parameters, and we hope that widespread use of the CAT and cross‐lab collaboration will make this feasible. Future research could also consider examining other domain general skills that may relate to ToM performance (e.g., working memory, sentence complexity).
We also encourage the expansion of the CAT with additional mental‐state constructs. We chose our target mental‐state and false‐sign categories because, as described in prior sections, these concepts integrated well with existing Wellman and Liu (2004) scale concepts, were likely to yield reasonable variability in a target age range across early childhood, and were reasonably categorized within a classic “cognitive” conceptualization of ToM. However, additional items in other mental‐state categories not targeted here (e.g., intention understanding) can be created following the format of our CAT items. Psychometric evaluation (following the approach we outline here) of the CAT with these additional items/mental‐state categories can then reveal robustness and reliability of new items as well as how they scale with our original CAT items. Likewise, similar measures could be created comprising items that test other ToM conceptualizations such as affective or moral ToM; the item format and psychometric analyses we adopt here provide a guide for such future measure development.
Finally, our sample was tested online via the conference platform Zoom. It is possible that children's performance may differ when tested in person. We note that recent research has revealed comparable performance on children's ToM tasks when tested online versus in person, including performance on false‐belief items similar to those we used here (Schidelko et al. 2021) and we have been successfully testing preschool children with the CAT in in‐person laboratory settings. We encourage future researchers to use the CAT in in‐person settings to compare results with those found with the present study's online sample.
3.10. Recommendation for How to Use the CAT in Future Studies
As discussed, our psychometric evaluation of the CAT reveals that it provides a reliable measure of diverse desires, diverse beliefs, knowledge, false beliefs, visual perspective taking, and false sign as administered in the present study. Specifically, we administered the measure items with both prediction and explanation questions within three randomized, counterbalanced blocks, either within a single testing session or across two sessions approximately 1 week apart (M = 6.14 days, SD = 5.68). We thus recommend its use in future research following similar administration guidelines. Our psychometric analyses also revealed details that support additional recommendations and modifications of the measure's use in future research.
1. Analyze both prediction and explanation questions. We recommend analyzing the CAT items using children's responses to both prediction and explanation questions, rather than prediction questions alone (as is common practice in the field currently). Scaling analysis of the prediction and explanation items together revealed more nuances in ToM development that have been documented in the extant literature, and the inclusion of explanation questions resulted in all items capturing similar amounts of information. Further, the inclusion of explanation questions to all mental states expands variability captured by items, better captures “partial” understanding or gradients in understanding (e.g., passing a prediction item but not an explanation), and increases the amount of information available to the model in estimating the latent factor of ToM ability. In contrast, the prediction‐only items were more variable in the overall information captured, with two false‐belief contents items reflecting the majority of information relative to other items. This pattern of results suggests that when considering only prediction items, false‐belief items were the most predictive (far beyond any other mental state examined). However, prediction and explanation questions together offer more information across all mental‐state and false‐sign categories we tested.
2. Use in its described form with 4‐ to 8‐year‐olds, potentially modify for 3‐year‐olds. The TIF revealed that items captured the most information on ToM for children ranging from 4 to 8 years of age when analyzing both prediction and explanation questions together. Thus, the measure can be used with confidence for children in this age range. Given the comparatively low number of 3‐year‐olds for whom the CAT was highly informative, if researchers wish to use the measure with the majority of their sample in this youngest range, it is possible the measure may benefit from augmentation. One possible augmentation that may better suit more 3‐year‐olds is to constrain the measure to a smaller set of mental‐state categories that showed lower difficulty, leaving out mental‐state categories whose items were consistently more difficult. Another possible augmentation for 3‐year‐olds is to create new items following the item structure we outline here that test understanding of mental‐state items typically acquired earlier in development, such as intention understanding (e.g., Woodward 2009; Wellman and Brandone 2009). Researchers should conduct new psychometric analyses (e.g., following the approach we outline here and code available on [BLINDED FOR REVIEW]) on any new version of the measure (e.g., on the measure with new items added). We note that the analysis of prediction questions alone resulted in a substantial drop in the number of children for whom the measure was highly informative; in particular, the measure was no longer highly informative for children outside the range of 4 to 5 years (another reason why we recommend analyzing prediction and explanation questions together).
3. To reduce administrative burden, take breaks between sets, spread sets across testing sessions, and/or select the most theoretically relevant mental‐state categories for your research question (avoid reducing the number of items in a category). Each of the three item sets in the CAT takes ~15 min to administer. In the present study, we included other tasks between CAT sets and took “wiggle” and “snack” breaks to reduce participant fatigue. We recommend similar practices in future use of this measure. Sets can also be administered across testing sessions (in the current study, several 3‐year‐olds completed sets across two testing sessions ~1 week apart). Given that the subscales for each of the mental‐state categories and false sign showed good internal consistency when analyzed separately, researchers may be able to select only the subscales that are highly relevant to their research question and leave out others. For example, researchers may opt to only administer items in a subset of categories (e.g., only the false‐belief and visual perspective‐taking items). We recommend that whatever subscales are selected, researchers follow randomization and counterbalancing practices for the items administered in line with the present study. We do not recommend reducing the number of items within a given category, given that including more items that assess the same construct results in a more reliable scale (Cronbach 1951; Green and Yang 2009). We also acknowledge that our IRT analyses were conducted on the full set of items, and new combinations of subscales would benefit from similar psychometric analysis.
4. Administer the CAT online or in person. The psychometric analyses we report here are on data collected online (over Zoom). However, we do not see a reason to restrict administration to the online setting. As noted above, research has revealed comparable performance on children's ToM tasks when tested online versus in person, including performance on false‐belief items similar to those we used here (Schidelko et al. 2021). In current ongoing studies, we are successfully administering the CAT in person in a laboratory setting to a comparable sample of children (current n = 31 neurotypical children; 11 girls, 20 boys; 3–5 years of age, M = 55.47 months, SD = 7.88). We acknowledge that in‐person performance on the CAT should be psychometrically evaluated to confirm its appropriateness for administration outside of online settings, but at this time we have no reason to believe the measure will perform substantially differently across administrative settings.
4. Conclusion
The present study introduces and evaluates the Comprehensive Assessment of ToM (CAT)—a novel multi‐mental‐state measure of children's diverse desires, diverse beliefs, knowledge access, knowledge expertise, false beliefs, visual perspective taking, and false sign. Psychometric analysis with item response theory (IRT) as well as examined relations with age, and parent‐report ToM (TOMI‐2; Hutchins et al. 2012) establish the measure as reliable and valid. Further support for the measure's validity comes from its ability to replicate nuances in children's ToM development demonstrated in prior studies targeting these specific nuanced aspects of children's ToM (e.g., about differences in reasoning about desires versus beliefs, one's own mental states versus others, level‐1 versus level‐2 perspective taking). Overall, findings from the CAT which includes three–six items each per category of mental state and false sign, as well as both “prediction” and “explanation” questions suggests the need to consider how differences in item story context may influence children's ToM performance and that the inclusion of multiple items and question types may help reveal children's mental‐state understanding that emerges across varying item‐specific task demands.
Supporting information
Appendix S1.
Acknowledgments
We thank the parents and children who participated in this study, without whom this work would not be possible. We also thank Simona Ghetti and Yuko Munakata who contributed valuable feedback during the creation of this measure.
Funding: The authors received no specific funding for this work.
Data Availability Statement
Data are available from the corresponding author upon reasonable request. Methods, materials, and analysis code for the present study, and preregistration information are available at https://osf.io/4d9ar/.
References
- Ali, U. S. , Chang H., and Anderson C. J.. 2015. “Location Indices for Ordinal Polytomous Items Based on Item Response Theory.” ETS Research Report 2: 1–13. 10.1002/ets2.12065. [DOI] [Google Scholar]
- Anwyl‐Irvine, A. L. , Massonnié J., Flitton A., Kirkham N., and Evershed J. K.. 2020. “Gorilla in Our Midst: An Online Behavioral Experiment Builder.” Behavior Research Methods 52: 388–407. 10.3758/s13428-019-01237-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Apperly, I. A. , and Butterfill S. A.. 2009. “Do Humans Have Two Systems to Track Beliefs and Belief‐Like States?” Psychological Review 116, no. 4: 953–970. 10.1037/a0016923. [DOI] [PubMed] [Google Scholar]
- Austin, G. , Bondü R., and Elsner B.. 2020. “Executive Function, Theory of Mind, and Conduct‐Problem Symptoms in Middle Childhood.” Frontiers in Psychology 11: 539. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bartsch, K. , and Wellman H. M.. 1995. Children Talk About the Mind. Oxford University Press. [Google Scholar]
- Beaudoin, C. , Leblanc E., Gagner C., and Beauchamp M. H.. 2020. “Systematic Review and Inventory of Theory of Mind Measures for Young Children.” Frontiers in Psychology 10: 2905. 10.3389/fpsyg.2019.02905. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Blijd‐Hoogewys, E. M. A. , Van Geert P. L. C., Serra M., and Minderaa R. B.. 2008. “Measuring Theory of Mind in Children. Psychometric Properties of the ToM Storybooks.” Journal of Autism and Developmental Disorders 38: 1907–1930. 10.1007/s10803-008-0585-3. [DOI] [PubMed] [Google Scholar]
- Boomsma, A. , and Hoogland J. J.. 2001. “The robustness of LISREL modeling revisited. Structural equation models: Present and future.” A Festschrift in honor of Karl Jöreskog 2, no. 3: 139–168. [Google Scholar]
- Bowman, L. C. , and Brandone A. C.. 2024. “Neural Correlates of Preschoolers' Passive‐Viewing False Belief: Insights Into Continuity and Change and the Function of Right Temporoparietal Activity in Theory of Mind Development.” Developmental Science 27: e13530. [DOI] [PubMed] [Google Scholar]
- Bowman, L. C. , Kovelman I., Hu X., and Wellman H. M.. 2015. “Children's Belief‐and Desire‐Reasoning in the Temporoparietal Junction: Evidence for Specialization From Functional Near‐Infrared Spectroscopy.” Frontiers in Human Neuroscience 9: 560. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bowman, L. C. , and Wellman H. M.. 2014. “Neuroscience Contributions to Childhood Theory of Mind Development.” In Contemporary Perspectives on Research in Theories of Mind in Early Childhood Education, edited by Saracho O. N., 195–224. Information Age Publishing. [Google Scholar]
- Brown, T. A. 2006. Confirmatory factor analysis for applied research. The Guilford Press. [Google Scholar]
- Callaghan, T. C. , Rochat P., and Corbit J.. 2012. “Young Children‘s Knowledge of the Representational Function of Pictorial Symbols: Development Across the Preschool Years in Three Cultures.” Journal of Cognition and Development 13: 320–353. [Google Scholar]
- Carlson, S. M. , Mandell D. J., and Williams L.. 2004. “Executive Function and Theory of Mind: Stability and Prediction From Ages 2 to 3.” Developmental Psychology 40: 1105–1122. [DOI] [PubMed] [Google Scholar]
- Carlson, S. M. , and Moses L. J.. 2001. “Individual Differences in Inhibitory Control and Children's Theory of Mind.” Child Development 72, no. 4: 1032–1053. 10.1111/1467-8624.00333. [DOI] [PubMed] [Google Scholar]
- Carlson, S. M. , and Moses L. J.. 2001. “Individual differences in inhibitory control and children's theory of mind.” Child development 72, no. 4: 1032–1053. [DOI] [PubMed] [Google Scholar]
- Chalmers, R. P. 2012. “Mirt: A Multidimensional Item Response Theory Package for the R Environment.” Journal of Statistical Software 48: 1–29. 10.18637/jss.v048.i06. [DOI] [Google Scholar]
- Cronbach, L. J. 1951. “Coefficient Alpha and the Internal Structure of Tests.” Psychometrika 16, no. 3: 297–334. 10.1007/BF02310555. [DOI] [Google Scholar]
- Devine, R. T. , and Hughes C.. 2014. “Relations Between False Belief Understanding and Executive Function in Early Childhood: A Meta‐Analysis.” Child Development 85: 1777–1794. [DOI] [PubMed] [Google Scholar]
- Embretson, S. E. , and Reise S. P.. 2000. Item Response Theory for Psychologists. Lawrence Erlbaum Associates. [Google Scholar]
- Fabricius, W. V. , Boyer T. W., Weimer A. A., and Carroll K.. 2010. “True or false: Do 5‐year‐olds understand belief?” Developmental Psychology 46: 1402–1416. [DOI] [PubMed] [Google Scholar]
- Fabricius, W. V. , Gonzales C. R., Pesch A., et al. 2021. “Perceptual Access Reasoning (PAR) in Developing a Representational Theory of Mind.” Monographs of the Society for Research in Child Development 86: 7–154. 10.1111/mono.12432. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Flavell, J. H. 2013. “Perspectives on Perspective Taking.” In Piaget's Theory, 107–139. Psychology Press. [Google Scholar]
- Ford, R. M. , Driscoll T., Shum D., and Macaulay C. E.. 2012. “Executive and Theory‐Of‐Mind Contributions to Event‐Based Prospective Memory in Children: Exploring the Self‐Projection Hypothesis.” Journal of Experimental Child Psychology 111: 468–489. [DOI] [PubMed] [Google Scholar]
- Gallant, C. M. , Lavis L., and Mahy C. E.. 2020. “Developing an Understanding of Others' Emotional States: Relations Among Affective Theory of Mind and Empathy Measures in Early Childhood.” British Journal of Developmental Psychology 38: 151–166. [DOI] [PubMed] [Google Scholar]
- Gonzales, C. R. , Fabricius W. V., and Kupfer A. S.. 2018. “Introspection Plays an Early Role in Children's Explicit Theory of Mind Development.” Child Development 89: 1545–1552. [DOI] [PubMed] [Google Scholar]
- Gopnik, A. , and Wellman H. M.. 1992. “Why the Child's Theory of Mind Really Is a Theory.” Mind & Language 7: 145–171. [Google Scholar]
- Gopnik, A. , and Astington J. W.. 1988. “Children's understanding of representational change and its relation to the understanding of false belief and the appearance‐reality distinction.” Child development: 26–37. [DOI] [PubMed] [Google Scholar]
- Green, S. B. , and Yang Y.. 2009. “Reliability of Summed Item Scores Using Structural Equation Modeling: An Alternative to Coefficient Alpha.” Psychometrika 74: 155–167. [Google Scholar]
- Greenslade, K. , and Coggins T.. 2016. “Brief Report: An Independent Replication and Extension of Psychometric Evidence Supporting the Theory of Mind Inventory.” Journal of Autism and Developmental Disorders 46: 2785–2790. [DOI] [PubMed] [Google Scholar]
- Guttman, L. 1944. “A Basis for Scaling Qualitative Data.” American Sociological Review 9: 139–150. [Google Scholar]
- Gweon, H. , Dodell‐Feder D., Bedny M., and Saxe R.. 2012. “Theory of Mind Performance in Children Correlates With Functional Specialization of a Brain Region for Thinking About Thoughts.” Child Development 83: 1853–1868. 10.1111/j.1467-8624.2012.01829.x. [DOI] [PubMed] [Google Scholar]
- Happe, F. G. E. 1994. “An Advanced Test of Theory of Mind: Understanding of Story Characters' Thoughts and Feelings by Able Autistic, Mentally Handicapped, and Normal Children and Adults.” Journal of Autism and Developmental Disorders 24: 129–154. [DOI] [PubMed] [Google Scholar]
- Hoogland, J. J. , and Boomsma A.. 1998. “Robustness studies in covariance structure modeling: An overview and a meta‐analysis.” Sociological Methods & Research 26, no. 3: 329–367. [Google Scholar]
- Hu, L. T. , and Bentler P. M.. 1999. “Cutoff Criteria for Fit Indexes in Covariance Structure Analysis: Conventional Criteria Versus New Alternatives.” Structural Equation Modeling: A Multidisciplinary Journal 6: 1–55. 10.1080/10705519909540118. [DOI] [Google Scholar]
- Hutchins, T. L. , Prelock P. A., and Bonazinga L. A.. 2012. “Psychometric Evaluation of the Theory of Mind Inventory (ToMI): A Study of Typically Developing Children and Children With Autism Spectrum Disorder.” Journal of Autism and Developmental Disorders 42: 327–341. [DOI] [PubMed] [Google Scholar]
- Hutchins, T. L. , Prelock P. A., and Bouyea L. B.. 2014. “Technical Manual for the Theory of Mind Inventory‐2.” https://www.theoryofmindinventory.com/task‐battery.
- Iao, L. S. , Leekam S., Perner J., and McConachie H.. 2011. “Further Evidence for Nonspecificity of Theory of Mind in Preschoolers: Training and Transferability in the Understanding of False Beliefs and False Signs.” Journal of Cognition and Development 12: 56–79. [Google Scholar]
- Kaufman, A. S. , and Kaufman N. L.. 2004. Kaufman Brief Intelligence Test–Second Edition (KBIT‐2). American Guidance Service. [Google Scholar]
- Kramer, H. J. , Lagattuta K. H., and Sayfan L.. 2015. “Why Is Happy–Sad More Difficult? Focal Emotional Information Impairs Inhibitory Control in Children and Adults.” Emotion 15: 61–72. [DOI] [PubMed] [Google Scholar]
- Kline, R. B. 2015. Principles and practice of structural equation modeling. Guilford publications. [Google Scholar]
- Lagattuta, K. H. , and Weller D.. 2013. “Interrelations Between Theory of Mind and Morality: A Developmental Perspective.” In Handbook of Moral Development, 385–407. Psychology Press. [Google Scholar]
- Lane, J. D. , Wellman H. M., and Evans E. M.. 2014. “Approaching an Understanding of Omniscience From the Preschool Years to Early Adulthood.” Developmental Psychology 50, no. 10: 2380–2392. 10.1037/a0037715. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Leekam, S. , Perner J., Healey L., and Sewell C.. 2008. “False Signs and the Non‐Specificity of Theory of Mind: Evidence That Preschoolers Have General Difficulties in Understanding Representations.” British Journal of Developmental Psychology 26: 485–497. [Google Scholar]
- Liu, D. , Wellman H. M., Tardif T., and Sabbagh M. A.. 2008. “Theory of Mind Development in Chinese Children: A Meta‐Analysis of False‐Belief Understanding Across Cultures and Languages.” Developmental Psychology 44: 523–531. 10.1037/0012-1649.44.2.523. [DOI] [PubMed] [Google Scholar]
- Livesey, D. , Keen J., Rouse J., and White F.. 2006. “The Relationship Between Measures of Executive Function, Motor Performance and Externalising Behaviour in 5‐and 6‐Year‐Old Children.” Human Movement Science 25: 50–64. [DOI] [PubMed] [Google Scholar]
- Masangkay, Z. S. , McCluskey K. A., McIntyre C. W., Sims‐Knight J., Vaughn B. E., and Flavell J. H.. 1974. “The Early Development of Inferences About the Visual Percepts of Others.” Child Development 45: 357–366. 10.2307/1127956. [DOI] [PubMed] [Google Scholar]
- Milligan, K. , Astington J. W., and Dack L. A.. 2007. “Language and Theory of Mind: Meta‐Analysis of the Relation Between Language Ability and False‐Belief Understanding.” Child Development 78: 622–646. 10.1111/j.1467-8624.2007.01018.x. [DOI] [PubMed] [Google Scholar]
- Muris, P. , Steerneman P., Meesters C., et al. 1999. “The TOM Test: A New Instrument for Assessing Theory of Mind in Normal Children and Children With Pervasive Developmental Disorders.” Journal of Autism and Developmental Disorders 29: 67–80. 10.1023/A:1025922717020. [DOI] [PubMed] [Google Scholar]
- Oktay‐Gür, N. , and Rakoczy H.. 2017. “Children's Difficulty With True Belief Tasks: Competence Deficit or Performance Problem?” Cognition 166: 28–41. [DOI] [PubMed] [Google Scholar]
- Osterhaus, C. , Koerber S., and Sodian B.. 2016. “Scaling of Advanced Theory‐Of‐Mind Tasks.” Child Development 87, no. 6: 1971–1991. 10.1111/cdev.12566. [DOI] [PubMed] [Google Scholar]
- Osterhaus, C. , Kristen‐Antonow S., Kloo D., and Sodian B.. 2022. “Advanced Scaling and Modeling of Children's Theory of Mind Competencies: Longitudinal Findings in 4‐ To 6‐Year‐Olds.” International Journal of Behavioral Development 46: 251–259. [Google Scholar]
- Peterson, C. C. , Wellman H. M., and Slaughter V.. 2012. “The Mind Behind the Message: Advancing Theory‐of‐Mind Scales for Typically Developing Children, and Those With Deafness, Autism, or Asperger Syndrome.” Child Development 83: 469–485. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Poarch, G. J. , and van Hell J. G.. 2019. “Does Performance on Executive Function Tasks Correlate?” Bilingualism, Executive Function, and Beyond: Questions and Insights 57: 223–236. [Google Scholar]
- R Core Team . 2019. R (Version 3.6.2.). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. www.R‐project.org/. [Google Scholar]
- Reeve, B. B. , and Fayers P.. 2005. “Applying Item Response Theory Modeling for Evaluating Questionnaire Item and Scale Properties.” In Assessing Quality of Life in Clinical Trials: Methods and Practice, edited by Fayers P. M. and Hays R. D.. Oxford University Press. [Google Scholar]
- Richardson, H. , Lisandrelli G., Riobueno‐Naylor A., and Saxe R.. 2018. “Development of the Social Brain From Age Three to Twelve Years.” Nature Communications 9: 1027. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Riggs, K. J. , and Simpson A.. 2005. “Young Children Have Difficulty Ascribing True Beliefs.” Developmental Science 8: F27–F30. [DOI] [PubMed] [Google Scholar]
- Rizopoulos, D. 2006. “Ltm: An R Package for Latent Variable Modeling and Item Response Analysis.” Journal of Statistical Software 17: 1–25. 10.18637/jss.v017.i05. [DOI] [Google Scholar]
- Ross‐Sheehy, S. , Schneegns S., and Spencer J. P.. 2015. “The Infant Orienting With Attention Task: Assessing the Neural Basis of Spatial Attention in Infancy.” Infancy 20: 467–506. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sabbagh, M. A. , Moses L. J., and Shiverick S.. 2006. “Executive Functioning and Preschoolers' Understanding of False Beliefs, False Photographs, and False Signs.” Child Development 77: 1034–1049. 10.1111/j.1467-8624.2006.00917.x. [DOI] [PubMed] [Google Scholar]
- Samejima, F. 1969. Estimation of Latent Ability Using a Response Pattern of Graded Scores. Psychometric Monograph. Vol. 17. Psychometric Society. [Google Scholar]
- Schidelko, L. P. , Schünemann B., Rakoczy H., and Proft M.. 2021. “Online Testing Yields the Same Results as Lab Testing: A Validation Study With the False Belief Task.” Frontiers in Psychology 12: 703238. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schreiber, J. B. , Nora A., Stage F. K., Barlow E. A., and King J.. 2006. “Reporting Structural Equation Modeling and Confirmatory Factor Analysis Results: A Review.” Journal of Educational Research 99, no. 6: 323–338. 10.3200/JOER.99.6.323-338. [DOI] [Google Scholar]
- Schurz, M. , Radua J., Aichhorn M., Richlan F., and Perner J.. 2014. “Fractionating Theory of Mind: A Meta‐Analysis of Functional Brain Imaging Studies.” Neuroscience & Biobehavioral Reviews 42: 9–34. [DOI] [PubMed] [Google Scholar]
- Shahaeian, A. , Peterson C. C., Slaughter V., and Wellman H. M.. 2011. “Culture and the Sequence of Steps in Theory of Mind Development.” Developmental Psychology 47: 1239–1247. [DOI] [PubMed] [Google Scholar]
- Spearman, C. 1904. “The Proof and Measurement of Association Between Two Things.” American Journal of Psychology 15: 72–101. [PubMed] [Google Scholar]
- Talwar, V. , Crossman A., and Wyman J.. 2017. “The Role of Executive Functioning and Theory of Mind in Children's Lies for Another and for Themselves.” Early Childhood Research Quarterly 41: 126–135. [Google Scholar]
- Todd, A. R. , Cameron C. D., and Simpson A. J.. 2017. “Dissociating Processes Underlying Level‐1 Visual Perspective Taking in Adults.” Cognition 159: 97–101. [DOI] [PubMed] [Google Scholar]
- Wellman, H. M. , and Brandone A. C.. 2009. “Early Intention Understandings That Are Common to Primates Predict Children's Later Theory of Mind.” Current Opinion in Neurobiology 19: 57–62. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wellman, H. M. , Cross D., and Watson J.. 2001. “Meta‐Analysis of Theory‐Of‐Mind Development: The Truth About False Belief.” Child Development 72: 655–684. [DOI] [PubMed] [Google Scholar]
- Wellman, H. M. , Fang F., Liu D., Zhu L., and Liu G.. 2006. “Scaling of Theory‐Of‐Mind Understandings in Chinese Children.” Psychological Science 17: 1075–1081. [DOI] [PubMed] [Google Scholar]
- Wellman, H. M. , Fang F., and Peterson C. C.. 2011. “Sequential Progressions in a Theory‐Of‐Mind Scale: Longitudinal Perspectives.” Child Development 82: 780–792. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wellman, H. M. , and Liu D.. 2004. “Scaling of Theory‐Of‐Mind Tasks.” Child Development 75: 523–541. 10.1111/j.1467-8624.2004.00691.x. [DOI] [PubMed] [Google Scholar]
- Wellman, H. M. , and Woolley J. D.. 1990. “From Simple Desires to Ordinary Beliefs: The Early Development of Everyday Psychology.” Cognition 35: 245–275. [DOI] [PubMed] [Google Scholar]
- Wimmer, H. , and Perner J.. 1983. “Beliefs about beliefs: Representation and constraining function of wrong beliefs in young children's understanding of deception.” Cognition 13, no. 1: 103–128. [DOI] [PubMed] [Google Scholar]
- Woodward, A. L. 2009. “Infants' Grasp of Others' Intentions.” Current Directions in Psychological Science 18: 53–57. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zaitchik, D. 1990. “When Representations Conflict With Reality: The Preschooler's Problem With False Beliefs and “False” Photographs.” Cognition 35: 41–68. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Appendix S1.
Data Availability Statement
Data are available from the corresponding author upon reasonable request. Methods, materials, and analysis code for the present study, and preregistration information are available at https://osf.io/4d9ar/.
