Abstract
There has been perennial interest in personal qualities other than cognitive ability that determine success, including self-control, grit, growth mindset, and many others. Attempts to measure such qualities for the purposes of educational policy and practice, however, are more recent. In this article, we identify serious challenges to doing so. We first address confusion over terminology, including the descriptor “non-cognitive.” We conclude that debate over the optimal name for this broad category of personal qualities obscures substantial agreement about the specific attributes worth measuring. Next, we discuss advantages and limitations of different measures. In particular, we compare self-report questionnaires, teacher-report questionnaires, and performance tasks, using self-control as an illustrative case study to make the general point that each approach is imperfect in its own way. Finally, we discuss how each measure’s imperfections can affect its suitability for program evaluation, accountability, individual diagnosis, and practice improvement. For example, we do not believe any available measure is suitable for between-school accountability judgments. In addition to urging caution among policymakers and practitioners, we highlight medium-term innovations that may make measures of these personal qualities more suitable for educational purposes.
Keywords: psychological assessment, non-cognitive, accountability, improvement research, character
Measurement matters. While reason and imagination also advance knowledge (Kuhn, 1961), only measurement makes it possible to observe patterns and to experiment—to put our guesses about what is and is not true to the test (Kelvin,1883). From a practical standpoint, intentionally changing something is dramatically easier when you can quantify with precision how much or how little of it there is (Drucker, 1974).
In recent years, scholars, practitioners, and the lay public have grown increasingly interested in measuring and changing attributes other than cognitive ability (Heckman & Kautz, 2013; Levin, 2013; Naemi, Burrus, Kyllonen, & Roberts, 2012; Stecher & Hamilton, 2014; Tough, 2013; Willingham, 1985). These so-called “non-cognitive” qualities are diverse and collectively facilitate goal-directed effort (e.g., grit, self-control, growth mindset), healthy social relationships (e.g., gratitude, emotional intelligence, social belonging), and sound judgment and decision making (e.g., curiosity, open-mindedness). Longitudinal research has confirmed such qualities powerfully predict academic, economic, social, psychological, and physical well-being (Almlund, Duckworth, Heckman, & Kautz, 2011; Borghans, Duckworth, Heckman, & ter Weel, 2008; Farrington et al., 2012; Jackson, Connolly, Garrison, Levin, & Connolly, 2015; Moffitt et al., 2011; Naemi et al., 2012; Yeager & Walton, 2011).
We share this more expansive view of student competence and well-being, but we also believe that enthusiasm for these factors should be tempered with appreciation for the many limitations of currently available measures. In this essay, our claim is not that everything that counts can be counted, nor that everything that can be counted counts. Rather, we argue that the field urgently requires much greater clarity about how well, at present, we are able to count some of the things that count.
A Rose by any Other Name: Naming and Defining the Category
Reliable and predictive performance tasks to assess academic aptitude (i.e., the capacity to acquire new academic skills and knowledge) and academic achievement (i.e., previously acquired skills and knowledge) have been available for well over a century (Roberts, Markham, Matthews, & Zeidner, 2005). The influence of such measures on contemporary educational policy and practice is hard to overstate.
Yet parallel measures for human attributes other than cognitive ability have not followed suit. Notably, pioneers in the measurement of cognitive ability shared the intuition that these other qualities were crucial to success both in and out of the classroom. For instance, the creators of the first valid IQ test wrote that success in school “admits of other things than intelligence; to succeed in his studies, one must have qualities which depend especially on attention, will, and character” (Binet & Simon, 1916, p. 254). The author of the widely used Weschler tests of cognitive ability likewise observed that “in addition to intellective there are also definite non-intellective factors which determine intelligent behavior” (Weschler, 1943, p. 103). Our guess is that the present asymmetry represents more of an engineering problem than a difference in importance: attributes other than cognitive ability are just as consequential but may be harder to measure (Stecher & Hamilton, 2014).
Of the descriptor “non-cognitive,” Easton (2013) has pointed out, “Everybody hates this term but everyone knows roughly what you mean when you use it…” Where did the term originate? Messick (1979) explains: “Once the term cognitive is appropriated to refer to intellective abilities and subject-matter achievement in conventional school areas…the term noncognitive comes to the fore by default to describe everything else” (p. 282). The term is problematic. Arguably too broad to be useful1, this terminology also seems to imply that there are features of human behavior that are devoid of cognition. On the contrary, every facet of psychological functioning, from perception to personality, is inherently “cognitive” insofar as processing of information is involved. For example, self-control, a canonical “non-cognitive” attribute, depends crucially on how temptations are represented in the mind. Cognitive strategies that recast temptations in less alluring terms (e.g., thinking about a marshmallow as a fluffy white cloud instead of a sticky, sweet treat) dramatically improve our ability to resist them (Fujita, 2011; Mischel et al., 2011). And, exercising self-control also relies on executive function, a suite of top-down cognitive processes including working memory (Blair & Raver, 2015; Diamond, 2013). Hence, from a psychological perspective, the term is simply inaccurate.
Given such obvious deficiencies, several alternatives have emerged. Without exception, these terms have both proponents and critics. For example, some prefer—while others, with equal fervor, detest—the terms character (Berkowitz, 2012; Damon, 2010; Peterson & Seligman, 2004; Tough, 2013), character skills (Heckman & Kautz, 2014), or virtue (Kristjansson, 2013; for a review of moral character education, see Lapsley & Yeager, 2012). To speak of character or virtue is, obviously, to speak of admirable and beneficial qualities. This usefully ties contemporary efforts toward the cultivation of such positive qualities to venerated thinkers of the past, from Plato and Aristotle to Benjamin Franklin and Horace Mann to Martin Luther King Jr., who in 1947 declared, “Intelligence plus character—that is the goal of true education.”
Many educators, however, prefer terminology without moral connotations. Some have adopted the term social and emotional learning (SEL) competencies, a phrase which highlights the relevance of emotions and social relationships to any complete view of child development (Durlak et al., 2015; Elias, 1997; Weissberg & Cascarino, 2013). SEL terminology has grown increasingly popular in education, and a search on google n-gram shows that mention of the phrase “social and emotional learning” has increased 19-fold in published books since its introduction in 1994 (Merrell & Gueldner, 2010). The SEL moniker may, however, inadvertently suggest a distinction from academic priorities, even though the data show that children perform better in school when SEL competencies are developed (Durlak, Weissberg, Dymnicki, Taylor, & Schellinger, 2011).
Psychologists who study individual differences among children might alternatively suggest the terms personality, dispositions, and temperament. But such “trait” terminology may incorrectly suggest that these attributes cannot be changed by people’s experiences, and the connotation of immutability is at odds with both empirical evidence (Caspi, Roberts, & Shiner, 2005; Roberts & DelVecchio, 2000; Roberts, Walton, & Viechtbauer, 2006) and pedagogical aims (Tough, 2011). Indeed, widespread interest in personal qualities is fueled in large part by the assumption that students can learn, practice, and improve them2.
Next, the terms twenty-first century skills, twenty-first century competencies, and new basic skills have made their timely appearance (Murnane & Levy, 1996; Pellegrino & Hilton, 2012; Soland, Hamilton, & Stecher, 2013). Likewise, some authors have used the terms soft skills (Heckman & Kautz, 2012). Unlike “trait” terminology, “skill” terminology usefully connotes malleability. However, referring to skills may implicitly exclude beliefs (e.g., growth mindset), values (e.g., prosocial motivation), and other relational attitudes (e.g., trust). The narrowness of “skill” terminology is obvious when considering attributes like gratitude, generosity, and honesty. Yes, these behaviors can be practiced and improved, but an authentic desire to be grateful, generous, and/or honest is an essential aspect of these dispositions. As far as the descriptor “twenty-first century” or “new” is concerned, it seems fair to question whether attributes like self-control and gratitude—of central concern to every major philosophical and religious tradition since ancient times—are of special relevance to modernity. Indeed, these may be more timeless than timely.
Finally, all of these terms—virtues, traits, competencies, or skills—have the disadvantage of implying that they are consistently demonstrated across all possible life situations. But they are not (Fleeson & Noftle, 2008; Mischel, 1968; Ross & Nisbett, 1991; Ross, Lepper, & Ward, 2010; Wagerman & Funder, 2009). For instance, self-control is undermined when people are laboring under the burden of a negative stereotype (Inzlicht & Kang, 2010) or when authority figures are perceived as unreliable (Kidd, Palmeri, & Aslin, 2013; Mischel, 1961). Learners are grittier when they have been asked to reflect on their purpose in life (Yeager et al., 2014), and organizations can create a fixed mindset climate that undermines employee motivation independently of employees’ own prior mindset beliefs (Murphy & Dweck, 2009).
We believe that all of the above terms refer to the same conceptual space, even if connotations (e.g., morality, mutability, or consistency across settings) differ. Crucially, all of the attributes of interest are (a) conceptually independent from cognitive ability, (b) generally accepted as beneficial to the student and to others in society, (c) relatively rank-order stable over time in the absence of exogenous forces (e.g., intentional intervention, life events, changes in social roles), (d) potentially responsive to intervention, and (e) dependent on situational factors for their expression.
From a scientific perspective, agreement about the optimal terminology for the overarching category of interest may be less important than consensus about the specific attributes in question and, in particular, their definition and measurement. Of course, a community of practice (e.g., a school district, a reform movement, a networked improvement community) benefits from consensual terminology (Bryk, Gomez, Grunow, & LeMahieu, 2015). Marching under the same flag, rather than several different ones, would make more obvious the fact that many researchers and educators are working to measure and improve the same student attributes (Bryk, Gomez, Grunow, & LeMahieu, 2015; Langley et al., 2009). However, because each community of practice has its own previously established concerns and priorities, the choice of a motivating umbrella term is perhaps best left to these groups themselves and not to theoretical psychologists.
Our view is pragmatic, not ideological. We suggest that the potentially interminable debate about what to call this category of student attributes draws attention away from the very urgent question of how to measure them. In this review, we refer to personal qualities as shorthand for “positive personal qualities other than cognitive ability that lead to student success” (see Willingham, 1985). Of course, this terminology is provisional because it, too, has flaws. For instance, attitudes and beliefs are not quite satisfyingly described as “qualities” per se. In any case, we expect that communities of research or practice will adopt more descriptive terms as they see fit.
Advantages and Limitations of Common Measures
No measure is perfect. We attempt here an incomplete sketch of the limitations and advantages of three common approaches to measuring this set of personal qualities other than cognitive ability: (a) self-report questionnaires administered to students, (b) questionnaires administered to teachers about their students, and (c) performance tasks. Throughout, we illustrate our points with one important and well-studied personal quality—self-control. Self-control refers to the regulation of attention, emotion, and behavior when enduringly valued goals conflict with more immediately pleasurable temptations. This is an informative example because research on self-control is burgeoning (Carlson, Zelazo, & Faja, 2013). Moreover, longitudinal research supports earlier speculation (Freud, 1920) that self-control is essential to success in just about every arena of life, including academic achievement (de Ridder et al., 2012; Duckworth & Carlson, 2013; Mischel, 2014; Moffitt et al., 2011). Where appropriate, we draw on other examples, such as grit or growth mindset. With these few brushstrokes, summarized in Table 1 and discussed briefly below, we hope to depict the contemporary landscape of measurement as we see it.
Table 1.
Serious Limitations of Questionnaires and Performance Tasks
| Serious Limitations of Self-Report and Teacher Report Questionnaires |
| Misinterpretation by participant: Student or teacher may read or interpret the item in a way that differs from researcher intent |
| Lack of insight or information: Student or teacher may not be astute or accurate reporters of behaviors or internal states (e.g., emotions, motivation) for a variety of reasons |
| Insensitivity to short-term changes: Questionnaire scores may not reflect subtle changes over short periods of time |
| Reference bias: The frame of reference (i.e., implicit standards) used when making judgments may differ across students or teachers |
|
Faking and social desirability bias: Students or teachers may provide answers that are desirable but not accurate |
| Serious Limitations of Performance Tasks |
| Misinterpretation by researcher: Researchers may make inaccurate assumptions about underlying reasons for student behavior |
| Insensitivity to typical behavior: Tasks which optimize motivation to perform well (i.e., elicit maximal performance) may not reflect behavior in everyday situations |
| Task impurity: Task performance may be influenced by irrelevant competencies (e.g., hand-eye coordination) |
| Artificial situations: Performance tasks may foist students into situations (e.g., doing academic work with distracting videogames in view) that they might proactively avoid in real life |
| Practice effects: Scores on sequential administrations may be less accurate (e.g., because of increased familiarity with task or boredom) |
| Extraneous situational influences: Task performance may be influenced by aspects of environment in which task is performed or by physiological state (e.g., time of day, noise in classroom, hunger, fatigue) |
| Random error: Scores may be influenced by purely random error (e.g., respondent randomly marking the wrong answer) |
Self-Report and Teacher-Report Questionnaires
For good reason, self-report and teacher-report questionnaires are the most common approaches to assessing personal qualities among both researchers and practitioners. Questionnaires are cheap, quick, reliable, and in many cases remarkably predictive of objectively-measured outcomes (Connelly & Ones, 2010; Duckworth, Tsukayama, & May, 2010; Hightower et al., 1986; Jackson, Connolly, Garrison, Levin, & Connolly, 2015; Lucas & Baird, 2006; Roberts, Kuncel, Shiner, Caspi, & Goldberg, 2007). Furthermore, a very large literature in social and cognitive psychology confirms that people are relatively good at using questionnaires to communicate their true opinions—provided that they in fact have answers for the questions asked and feel comfortable reporting accurately on them (see Krosnick & Fabrigar, forthcoming; Krosnick,1999; Schuman & Presser, 1981). Indeed, self-report questionnaires are arguably better suited than any other measure for assessing internal psychological states like feelings of belonging.
Questionnaires typically ask individuals to integrate numerous observations of thoughts, feelings, or behavior over a specified period of time ranging from “at this moment” to “in general.” For example, the Character Growth Card includes a self-control item that reads, “During the past marking period, I came to class prepared” and provides response options ranging from “almost never” to “almost always” (Park, Tsukayama, Patrick & Duckworth, 2015).
The process by which students answer this question or any other self-report item is depicted in Figure 1: (1) students must first read and understand the question, then (2) search their memories for relevant information, (3) integrate whatever information comes to mind into a summary judgment, (4) translate this judgment into one of the offered response options, and finally (5) “edit” their response if motivated to do so (Krosnick & Presser, 2010; Schwarz & Oyserman, 2001; Tourangeau, Rips, & Rasinski, 2000). Teacher-report questionnaires work the same way, except that it is the teacher who integrates observations of the student over time and arrives at a judgment with respect to his or her own standards. Individuals can carry out this kind of self-judgment and other-judgment arithmetic with admirable accuracy and precision (Funder, 2012).
Figure 1.
The process by which students and teachers respond to questionnaire items.
A catalogue of threats to validity can be accomplished by considering potential failures at each stage. For (1) encoding the meaning of the questionnaire items, literacy is an obvious concern, particularly for younger or lower-achieving students. Beyond vocabulary, it cannot be assumed that students always understand the pragmatic meaning—the intended idea—of questionnaire items. For example, self-control questionnaires aim to assess the self-initiated regulation of conflicting impulses (e.g., wanting to get homework done because it is important but, at the same time, wanting to play videogames because they are more fun). Yet students or teachers may instead interpret items as asking about compliance with authority (e.g., following directions simply because an adult asked).
After encoding the question itself, individuals must (2) search their memories and (3) integrate recalled information into a summary judgment. For both students and teachers, mentally integrating across past observations can reduce sensitivity to how behavior is now as compared to before (for a compelling empirical example of errors in these judgments, see Bowman, 2010). Moreover, individuals tend to see themselves and others as holding consistent beliefs and attitudes over time, and this bias toward consistency can affect what information is retrieved as well as how it is evaluated (Mischel, 1968; Nisbett & Wilson, 1977; Podsakoff, MacKenzie, Lee, & Podsakoff, 2003; Ross & Nisbett, 1991; Sabini, Siepmann, & Stein, 2001).
When (3) coming to a summary judgment, teachers have the benefit of a non-egocentric perspective, as well as experience with many other same-aged students over the course of their careers. Nevertheless, end-of-year teacher reports may be colored by first impressions and therefore underestimate change (see moderation results in Raudenbush, 1984). In addition, many teachers only see their students in the classroom setting. Because behavior can vary across contexts (Mischel, 1968; Ross & Nisbett, 1991; Tsukayama, Duckworth, & Kim, 2013), teacher observations may not agree with those made by parents, who may see their child in every context except school. Not surprisingly, correlations between separate teacher ratings of student behavior tend to be higher than between parents and teachers (Achenbach, McConaughy, & Howell, 1987).
Another limitation of teacher-report questionnaires is the potential for teachers to misinterpret student behavior. People’s inferences about why others act the way they do are not always accurate (e.g., Dodge, 1980). For instance, it might seem reasonable to conclude that students who reliably complete all of their homework assignments on time are highly self-controlled. Alternatively, it is possible that some assiduous students are so intrinsically motivated to do schoolwork that they do not find alternatives like texting and videogames at all tempting. If so, it is incorrect to infer that their conscientious academic behavior represents self-control (Duckworth & Steinberg, 2015).
Teachers’ ratings of students’ specific qualities can also be colored by their top-down, global evaluations. For instance, teachers may think “This is a good kid” and then conclude “This student must be good at delaying gratification” (see Abikoff, Courtney, Pelham, & Koplewicz, 1993; Babad, Inbar, & Rosenthal, 1982; Nisbett & Wilson, 1977).
Both students and teachers must use some frame of reference to arrive at their judgments, and the problem of “reference bias” refers to frames of reference that differ systematically across respondents (Heine, Lehman, Peng, & Greenholtz, 2002). For example, the more competent an individual is in a given domain, the more stringently they tend to judge themselves (Kruger & Dunning, 1999). Frames of reference are also influenced by the norms shared within—but not necessarily across—cultures. Thus, reference bias is most readily evidenced in paradoxical inconsistencies in cross-cultural research.
Reference bias is apparent in the PISA (Program for International Student Assessment). Within-country analyses of the PISA show the expected positive association between self-reported conscientiousness and academic performance, but between-country analyses suggest that countries with higher conscientiousness ratings actually perform worse on math and reading tests (Kyllonen & Bertling, 2013). Norms for judging behavior can also vary across schools within the same country: students attending middle schools with higher admissions standards and test scores rate themselves lower in self-control (Goldman, 2006; M. West, personal communication, March 17, 2015). Likewise, KIPP charter school students report spending more time on homework each night than students at matched control schools, and they earn higher standardized achievement test scores—but score no higher on self-report questionnaire items such as “Went to all of your classes prepared” (Tuttle et al., 2013). Dobbie and Fryer (2013) report a similar finding for graduates of the Harlem Children’s Zone charter school. There can even be reference bias among students in different grade levels within the same school. Seniors in one study rated themselves higher in grit than did juniors in the same high school, but the exact opposite pattern was obtained in performance tasks of persistence (Egalite, Mills, & Greene, 2014)3.
In the final stages of responding to questionnaire items, individuals must (4) translate their judgment into one of the offered response options. Reference bias can be a problem here, too, insofar as what one respondent considers “rarely” may be what another respondent considers “often” (Pace & Friedlander, 1982).
Next individuals may (5) amend their response in accordance with any of a number of motivations other than truth-telling. Potentially biasing reports is “acquiescence bias,” the inclination, particularly among younger students, to agree with statements regardless of their actual content (Saris, Revilla, Krosnick, & Shaeffer, 2010; Soto, John, Gosling, & Potter, 2008). Individuals may also not tell the truth simply because they would be embarrassed to admit it (Jones & Sigall, 1971).
Unfortunately, many methods thought to reduce social desirability response bias instead harm validity. For example, preemptive assurances of confidentiality can backfire if they imply that questionnaires will be about sensitive and potentially embarrassing topics (Schwarz & Oyserman, 2001). Moreover, assuring individuals of their anonymity can decrease response validity by removing accountability to be honest (Lelkes, Krosnick, Marx, Judd, & Park, 2012). And attempting to make adolescents feel comfortable reporting undesirable attitudes or behaviors by suggesting that “some people do X … other people do Y” implies to adolescents that the undesirable behavior is carried out by half of their peers, and so it artificially inflates reports of that behavior through conformity processes (Yeager & Krosnick, 2011). Unfortunately, scales purporting to measure individual differences in social desirability bias do not fulfill their promise (Uziel, 2010).
Finally, there is the problem of faking. The extent to which outright faking actually reduces the validity of questionnaires in real-world situations is hotly debated (Ziegler, MacCann, & Roberts, 2011), but the possibility of deliberately inflating or deflating scores on questionnaires is incontrovertible (Sackett, 2011).
Performance tasks
As an alternative to asking a student or teacher to report on behavior, it is possible to observe behavior through performance tasks. A performance task is essentially a situation that has been carefully designed to elicit meaningful differences in behavior of a certain kind. Observing students in the identical contrived situation eliminates the possible confound of variation in the base rates of certain types of situations. For example, it is problematic to use “time spent doing homework” as an indicator of self-control if some students are assigned more homework than others (for example, when comparing students whose teachers or schools differ). But if all students are put in a situation where they have the same opportunity to do academic work, with the same opportunity to allocate their attention to entertaining diversions, then differences in time spent on academic work can be used to index self-control (Galla & Duckworth, 2015).
The most influential performance task in the large literature on self-control is the preschool delay of gratification paradigm, colloquially known as the “marshmallow test” (Mischel, 2014). At the start of the task, children are presented with a variety of treats and asked to pick their favorite. Some choose marshmallows, but others choose Oreos, chocolate candies, pretzels, and so on. Next, the less preferred treats are taken away, and the experimenter makes a smaller pile (e.g., one marshmallow) and a larger pile (e.g., two marshmallows). The experimenter asks the child whether he or she would prefer to have the small pile right away or, alternatively, to wait for the larger pile after the experimenter comes back from doing something unrelated in the hallway. The question is not which choice the child makes—in a national study of approximately one thousand preschool children, nearly all chose the larger, delayed treat (NICHD, 1999)—but rather, once the decision has been made, how long the wait to obtain the larger treat can be endured. Wait time in this standardized situation correlates positively with self-control ratings by parents and caregivers and, over a decade later, predicts higher report card grades and standardized test scores, lower self-reported reckless behavior, and healthier body weight, among other outcomes (Mischel, 2014; Tsukayama et al., 2013).
An advantage of performance tasks is that they do not rely upon the subjective judgments of students or teachers. This feature circumvents reference bias, social desirability bias, acquiescence bias, and faking. Relatedly, by assaying behavior at a moment in time, task measures could be more sensitive than questionnaires to subtle changes in behavior. Not surprisingly, several major studies examining the effects of either self-control interventions or age-related changes in self-control have used performance tasks to do so (Bierman, Nix, Greenberg, Blair, & Domitrovich, 2008; Blair & Raver, 2014; Diamond & Lee, 2011; Raver et al., 2011). Likewise, experiments that attempt to manipulate self-control in the short-term commonly measure change using performance tasks rather than questionnaire measures (e.g. Baumeister, Bratslavsky, Muraven, & Tice, 1998; Hagger, Wood, Stiff, & Chatzisarantis, 2010).
Of course, the advantages of performance tasks must be considered in tandem with their limitations. As is the case with teacher-reported questionnaires, performance tasks require drawing inferences about the internal motivations, emotions, and thoughts of students. For instance, is a child who refrains from playing with toys when instructed to do so exerting autonomous self-control, or does such behavior represent compliance with adult authority (see Aronson & Carlsmith, 1963; Eisenberg et al., 2004; Mischel & Liebert, 1967)? While the task itself is “objective,” interpreting performance is nevertheless “subjective” in the sense that behavior must be interpreted by the researcher.
Relatedly, a one-time performance task may be appropriate for assessing the capacity of a student to perform a certain behavior when maximally motivated to do so but not particularly diagnostic of their everyday behavior in typical life situations (Duckworth, 2009; Sackett, 2007). For many personal qualities (e.g., generosity, kindness, honesty), what matters most is how a child usually behaves, not how they could behave when trying their hardest. In these cases, performance tasks that assess behavior under optimally motivating circumstances miss the mark. Of course, for some personal qualities, assessing capacity may be appropriate, because the construct itself specifies an ability which may or may not be expressed in daily life. For example, performance task measures of emotional intelligence appropriately assess the ability—not the propensity—to perceive, understand, and manage emotions (Brackett & Geher, 2006; Brackett & Mayer, 2003).
Another limitation of performance tasks is their sensitivity to factors irrelevant to the attribute of interest. Miyake and Friedman (2012) call this the “task-impurity problem” (p. 8) and use as an example the Stroop task of executive function. Completing the Stroop task entails looking at the names of colors printed in variously colored ink. When the name of the color is different from the ink in which it is printed (e.g., the word “green” printed in red), then naming the ink color requires executive function. But executive function is not all that is required. Quick and accurate performance also requires color processing, verbal articulation, motivation to pay attention, and so on. Task impurity is thought to be one reason why performance tasks assessing executive function are only weakly correlated with questionnaire measures of self-control (Duckworth & Kern, 2011; Sharma, Markon, & Clark, 2014).
In addition, performance tasks may thrust individuals into situations they might have avoided if left to their own devices (Diener, Larson, & Emmons, 1984). Consider, for example, children faced with the dilemma of one treat now or two treats later in the delay of gratification task. In the test situation, children are not allowed to get up from their chair, occupy themselves with toys or books, or cover the treats with a plate or napkin. Outside of this constrained laboratory situation, any of these tactics might be employed in order to make waiting easier. In fact, more self-controlled adults say they very deliberately avoid temptations in everyday life (Ent, Baumeister, & Tice, 2015; Imhoff, Schmidt, & Gerstenberg, 2013), and as a consequence experience fewer urges to do things they will later regret (Hofmann, Baumeister, Forster, & Vohs, 2012). Thus, performance tasks foist individuals into identical circumstances so that we may assess their ability to navigate such situations, but this comes at the expense of knowing the extent to which they might have the judgment to proactively avoid or modify situations of that kind on their own (Duckworth, Gendler, & Gross, 2014).
To some extent, all performance tasks suffer from practice effects (or test-retest effects), defined broadly as the effect of repeated exposure to the same task. This is even true for the most “pure” measures of general cognitive ability (Hausknecht, Halpert, Di Paolo, & Moriarty Gerrard, 2007; Reeve & Lam, 2005). Familiarity with the procedures of a task can undermine score validity when the task is intended to represent an ambiguous or novel situation (Burgess, 1997; Muller, Kerns, & Konkin, 2012). For example, a first-time experience with the delay of gratification task is not identical to a second-time encounter because expectations of when the experimenter will return to the room are altered (McGuire & Kable, 2013). Experience with a task may also lead to boredom or increased fluency with task procedures irrelevant to the target attribute. At present, almost nothing is known about the feasibility of developing parallel forms of performance tasks assessing personal qualities for repeated administration.
Because performance tasks are standardized situations in which to observe student behavior, they must be administered under carefully controlled conditions. For example, children in the delay of gratification task wait longer if they trust that the experimenter is actually going to deliver on the promise of two marshmallows later (Kidd et al., 2013). Likewise, performance on self-control tasks can suffer when performed in sequence after other effortful tasks (Hagger et al., 2010). Error increases, and precision decreases, the more these situational influences differ across students.
Moreover, situational influences on task performance that vary systematically across groups create bias and potentially misleading conclusions about group differences. For example, a task that assesses diligence on academic work cannot be properly interpreted if administered in a school setting characterized by frequent noisy intrusions (e.g., students walking in and out of the testing room) or especially crowded conditions (e.g., students sitting so closely that they are distracted by each other) (Galla et al., 2014). While questionnaire responses, too, can be influenced by transient situational influences, these effects may be small (see Lucas & Lawless, 2013). In our experience, performance tasks are especially sensitive to differences in administration, such as time of day or presence of ambient distractions.
Even when administered under optimally controlled conditions, performance tasks generate random error—the white noise produced by stochastic influences on behavior. This is especially problematic for performance tasks because most yield a single score (e.g., in the marshmallow test, the number of seconds a child can wait). Questionnaires, in contrast, usually include several different items designed to assess the same latent construct. Using multiple items exploits the principle of aggregation, which states that uncorrelated errors across items cancel out, thus reducing noise and increasing reliability (Clark & Watson, 1995; Rushton, Brainerd, & Pressley, 1983).
An obvious solution is to create a suite of different performance tasks to assess the same construct and then to aggregate results into a composite score. There are only a handful of precedents for this multi-task approach to assessing self-control (Hartshorne & May, 1929; White et al., 1994). The rarity of these studies suggests that the time, expense, and effort entailed in administering a battery of performance tasks to the same children is at present prohibitive in most applied settings. A single performance task could take as many as 20 minutes to administer by a trained experimenter; doing so several times across separate sessions (to avoid fatigue) would likely require hours and hours of testing time.
Valid for What?
As the above exposition demonstrates, perfectly unbiased, unfakeable, and error-free measures are an ideal, not a reality. Instead, researchers and practitioners have at their disposal an array of measures that have distinct advantages and limitations. Accordingly, measurement experts have emphasized that validity is not an inherent feature of a measure itself but rather a characteristic of a measure with respect to a particular end use (AERA, APA, NCME, 1999/2014). Thus, different measures, with their unique advantages and limitations, are differentially valid depending not only on their psychometric properties, but also on their intended application.
One important end use is basic research, and indeed this is the purpose for which most of the measures reviewed here were developed. Given the litany of limitations noted above, it is notable that measures of personal qualities have been shown in basic research studies to be predictive of consequential life outcomes months, years, or decades later (Almlund, Duckworth, Heckman, & Kautz, 2011; Borghans, Duckworth, Heckman, & ter Weel, 2008; Farrington et al., 2012; Moffitt et al., 2011; Naemi, Burrus, Kyllonen, & Roberts, 2012; Roberts et al., 2007). Of course, these research studies have sought to reject the null hypothesis of no relation between personal qualities and later life outcomes, under testing conditions where incentives to distort responses were minimal—a very different project than the applied uses we consider in this final section.
We attempt to explain how the problems with extant measures of these personal qualities can create threats to validity for more applied uses. Four common examples are: program evaluation, accountability, individual diagnosis, or practice improvement. We make specific recommendations regarding each.
Program Evaluation
Many educational programs, including charter schools, in-school programming, and afterschool activities, aim to cultivate self-control, grit, emotional intelligence, and other personal qualities. Yet the above review makes it clear that in many cases self-report questionnaires have serious limitations for such evaluations. Reference bias may even produce results opposite of the truth when evaluating within-person program effects (i.e., a change from pre-test to post-test) or assessing between-program differences (i.e., mean-level differences among schools or programs), as noted above (e.g., Tuttle et al., 2013; West et al., 2015).
Teacher-report measures of personal qualities may be valid when program evaluation is occurring within schools (i.e., comparing classes in the same school, where the standard for a given characteristic is presumably held constant). However, when conducting between-school program evaluation—as is common—it seems likely that self-report and teacher-report questionnaires could be biased by a non-shared frame of reference. For example, teachers at schools with more rigorous standards of behavior may rate their students more stringently.
How then should between-school program evaluations be conducted? Performance tasks may be helpful (Blair & Diamond, 2008; Greenberg, 2010). However, they have the limitations noted above, including but not limited to: dependence on carefully controlled settings for proper administration, the need to tailor the task parameters to the age group, practice effects, and respondent burden. At the same time, performance tasks have perhaps the most important quality for program evaluation: objective, quantifiable behaviors that do not suffer from reference bias over time and across sites.
A potentially solvable engineering problem, in the medium-term, is to create a suite of brief, scalable, age-specific performance tasks designed for group administration. This possibility was foreseen as prohibitively expensive by pioneers in the assessment of personal qualities (Hartshorne & May, 1929), but they could not have predicted the proliferation of computers and wireless technology in schools. Imagine, for example, a set of web-based tasks of academic self-control accompanied by easy-to-follow protocols and checklists for administering them (e.g., Galla et al., 2014). Assuming that practice effects (i.e., test-retest effects) could be minimized, such task batteries might allow for meaningful, apples-to-apples comparisons across schools, among individuals within schools, or within individuals over time.
In sum, scalable batteries of performance tasks to assess various personal qualities would be of great value for program evaluation, especially as schools and districts seek to allocate limited funds wisely.
Accountability
Reference bias in questionnaire measures has a pernicious implication for accountability. Current data and theory suggest schools that promote personal qualities most ably—and raise the standards by which students and teachers at that school make comparative judgments—may show the lowest scores and be punished, while schools that are least effective may receive the highest scores and be rewarded for ineffectiveness (Dobbie & Fryer, 2013; O’Brien, Yeager, Galla, D’Mello, & Duckworth, 2015; West et al., 2015). Even when accountability does not carry high stakes—for instance, when between-school measures are simply used to pair high and low scoring schools to learn from one another—reference bias undermines school improvement: it would lead the practices from the worst schools to be spread to the best schools. Unfortunately, our experience suggests that re-writing items does not seem to eliminate this type of reference bias.
The reference bias problem alone suggests that questionnaires, as they currently exist, should not be used for between-school accountability. Yet accountability adds at least two additional concerns. First, it is not clear that aggregated student reports can reasonably distinguish among schools throughout the majority of the distribution. Indeed, even value-added measures based on standardized achievement test scores (which do not suffer from reference bias) fail to distinguish more effective from less effective teachers outside of the very high or very low ends of the distribution (Goldhaber & Loeb, 2013; Raudenbush & Jean, 2012).
One exception may be assessing personal qualities for the purpose of comparing teachers within schools. For example, the Tripod measure allows for students’ ratings of different teachers in the same school; ratings of one teacher can be “anchored” using ratings of others in the same school, and these anchored ratings have been shown to correlate with differences in value-added measures among teachers within schools (Ferguson, 2012; Ferguson & Danielson, 2014; Kane & Cantrell, 2013). These measures would be excellent for identifying “positive outlier” teachers within schools—for instance, those who reduce achievement gaps and maintain a strong sense of belonging in students—and then encouraging peer teachers in the same schools to learn from their practices. Unfortunately, these measures, like many others, are not very effective when comparing between schools. This mirrors analyses of state test scores, which have found that value-added measures are better at distinguishing among different teachers in the same schools than different teachers in different schools (Raudenbush, 2013).
There is a second, perhaps more problematic, issue with using measures for the sake of accountability: the potential for faking or unfairly manipulating data. Campbell (1976) observed, “The more any quantitative social indicator is used for social decision-making, the more subject it will be to corruption pressures and the more apt it will be to distort and corrupt the social processes it is intended to monitor” (p. 49). Campbell was making a very general point about how good intentions can lead to unintended perverse outcomes. But this may be especially germane to self-report and teacher-report questionnaire measures, because students can easily be taught to mark the “right answer”, and teachers can likewise rate their students more favorably than they really perceive them to be. Even when faking does not occur, accountability pressures for qualities such as growth mindset can lead to superficial parroting of growth mindset ideas, so as to increase self-reports, rather than true, deep, mindset changes in students. We should note that accountability pressures can also affect performance tasks insofar as schools could be incentivized to alter testing situations to optimize student performance (for examples from achievement tests, see commentaries by Hollingworth, Dude, & Shepherd, 2010; Ravitch, 2012).
In sum, we have a simple scientific recommendation regarding the use of currently-available personal quality measures for most forms of accountability: not yet.
Individual Diagnosis
Schools may wish to diagnose students’ personal qualities to use in tracking or remediation decisions. This type of measurement involves decisions about the resources given to a child, and raises two major concerns: reliability at the level of the individual person and context dependency. First, existing questionnaire measures are likely not sufficiently reliable for making diagnoses. Indeed, even more well-established clinical protocols for assessing factors such as depression rely on extensive self-report questionnaires only as a screening tool, and these have only modest sensitivity and specificity for determining whether a person is clinically depressed (Kovacs, 1992). Such measures require clinical follow-ups. Without highly reliable, multi-method, multi-informant measurement batteries whose validity has been demonstrated for diagnosis, it will be difficult for a practitioner to justify the individual diagnosis of children’s personal qualities such as self-control, grit, or growth mindset.
Our second concern is the context-dependency of some measures. Take the example of self-control. Research finds that individuals have a harder time regulating themselves when they are under stereotype threat, but they show greater self-control when they do not feel under threat (Carr & Steele, 2009; Inzlicht & Kang, 2010; Inzlicht, McKay, & Aronson, 2006). Hence, cues that signal the potential to be stereotyped might impair a stigmatized student’s self-control, inappropriately supporting conclusions about the child’s ability rather than about the bias in the context. This may lead to unintended effects such as victim-blaming, rather than systemic reform.
In sum, a handful of self-report or teacher-report questions cannot (currently) diagnose an individual child’s self-control, growth mindset, grit, purpose, etc. And, even if more extensive protocols were available, it would be essential to consider the possibility of situational and group-specific biases.
Practice Improvement
The ultimate goal of much educational research is to systematically improve personal qualities across contexts—that is, to promote the improvement of practice (Bryk, Gomez, Grunow, & LeMahieu, 2015). Here, too, existing measures have important limitations, but they also have great potential.
As has been argued elsewhere (Bryk, Gomez, Grunow, & LeMahieu, 2015; Langley et al., 2009; Yeager & Bryk, 2014), the improvement of practice requires “practical measurement.” Practical measures are not measures for theory development or accountability. Instead, they are administrable in the web of daily instruction, they can be quickly reported on and communicated to practitioners, and they have direct relation to causes of student underperformance that are the explicit target of improvement efforts. They allow people to learn rapidly from practice. This means that the measures should be brief, easily collected, and also contextually appropriate. Practical measures should be sensitive to short-term changes and provide short-term feedback on progress that has or has not been made in improving personal qualities (Bryk, Gomez, Grunow, & LeMahieu, 2015; Yeager & Bryk, 2014).
Existing questionnaires demonstrate very few of these features. First, questionnaires can be quite long (Atkins-Burnett, Fernandez, Akers, Jacobson, & Smither-Wulsin, 2012). For instance, some measures of self-efficacy—a construct that could in theory be assessed with a single item—are 60 items long (Marat, 2005). Next, questionnaire measures are rarely if ever customized for different settings, and therefore their data may not be as relevant to a given teacher working in a given school. That is, issues of language and literacy, cultural norms, or even colloquialisms could compromise a practical measure. A practical measure is only useful if it helps improve practice in a given setting, not if it has reliability and validity on average in different settings. Third, conventional questionnaire measures are often not designed to be sensitive to change over time. For example, a teacher who wants to know whether classroom activities encouraged self-control during the prior week may not learn much by asking students to repeatedly respond to very general questions such as “People say that I have iron self-discipline.” At the same time, it may be possible to write optimized questions—ones that use construct-specific verbal labels to avoid acquiescence bias, use the optimal number of response options, balance bipolar choices, and so on (Gehlbach & Brinkworth, 2011; Krosnick, 1999; Schuman & Presser, 1981; Schwarz & Oyserman, 2001)—and many fewer of them, to solve this latter problem (Yeager & Bryk, 2014).
We believe performance tasks can also support practice improvement. For instance, tasks can document within-person changes over the short-term. To the extent that performance tasks can be embedded online, then they may be used to produce efficient web-based reports, facilitating teachers’ improvement efforts. At the same time, as noted, performance tasks still require that procedures be optimized to reduce systematic and random error. This can make them logistically difficult to embed in the web of daily practice. Still, this may be a solvable engineering problem in the medium-term.
In sum, a promising area for future research is to increase our knowledge of the conditions under which questionnaires and performance tasks can support the continuous improvement of educational practice.
Final Recommendations
The major conclusions of this article are summarized in Table 2. We have argued that all measures have limitations as well as advantages. Furthermore, we have observed that the applied uses of assessments are diverse, and design features that make a measurement approach helpful for one use may render it less appropriate for another. As a consequence, it is impossible to hierarchically rank measures from best to worst in any absolute sense. Rather than seek out the “most valid measure,” therefore, we advise practitioners and researchers to seek out the “most valid measure for their intended purpose.” While doing so, policymakers and practitioners in particular should keep in mind that most existing measures were developed for basic scientific research. We urge heightened vigilance regarding the use-specific limitations of any measure, regardless of prior “evidence of validity.”
Table 2.
Summary for Practitioners and Policymakers
Conclusions
|
Recommendations
|
Whenever possible, we recommend using a plurality of measurement approaches. While time and money are never as ample as would be ideal, a multi-method approach to measurement can dramatically increase reliability and validity (Eid & Diener, 2006; Rushton, Brainerd, & Pressley, 1983). As just one example, Duckworth and Seligman (2005) aggregated multiple measures of self-control, including a delay of gratification task, self-report, teacher-report, and parent-report questionnaires, finding that a composite score for self-control in the fall predicted final report card grades better than a standard measure of cognitive ability. We also encourage further innovation in measurement development. An incomplete list of promising approaches includes: opportunistically mining students’ online learning behavior or written communication in real time (e.g., Twitter feeds, Kahn Academy databases) for meaningful patterns of behavior (D’Mello, Duckworth, & Dieterle, 2014; Ireland & Pennebaker, 2010; Kern et al., 2014); the aperture method of administering random subsets of questionnaire items to respondents so as to minimize administration time while maximizing content validity (Revelle, Wilt, & Rosenthal, 2010); recording and later coding 30-second audio snippets during everyday life (Mehl, Vazire, Holleran, & Clark, 2010); presenting hypothetical situations in narrative form and asking students what they would do in that circumstance (Oswald, Schmitt, Kim, Ramsay, & Gillespie, 2004; Ployhart & MacKenzie, 2011); asking students to make observations of their peers (Wagerman & Funder, 2007); indirectly assessing personal qualities through innovative application of factor analysis to conventionally collected data (e.g., GPA, attendance, achievement test scores) (Kautz & Zanoni, 2014; Jackson, 2012); and contacting students throughout the day to assess their momentary actions, thoughts, and feelings (Wong & Csikszentmihalyi, 1991; Zirkel, Garcia, & Murphy, 2015). In general, efforts to advance measurement of personal qualities would greatly benefit from cross-fertilization with similar efforts in personality psychology, industrial and organizational psychology, neuroscience, and economics (Heckman & Kautz, 2013; Pickering & Gray, 1999; Roberts, Jackson, Duckworth, & Von Culin, 2011; Schmidt, 2013; Schmidt & Hunter, 1998).
Relatedly, it has recently been suggested that supplementing questionnaires with anchoring vignettes may help reduce reference bias (King, Murray, Salomon, & Tandon, 2004; Kyllonen & Bertling, 2013). Anchoring vignettes are brief descriptions of hypothetical persons that serve as anchors for calibrating questionnaire responses. Respondents rate each vignette and then their own behavior on the same rating scale. Adjusting scores of self-report questionnaires using anchoring vignettes has been shown to resolve paradoxical findings attributed to reference bias. However, adding vignettes to questionnaires can dramatically increase respondent burden. Moreover, at present it has been impossible to verify the extent to which vignettes fully correct for reference bias (Kyllonen & Bertling, 2013).
Finally, measuring personal qualities, although difficult, is only the first step. Scientific inquiry and organizational improvement begin with data collection, but those data must be used to inform action. Too little is known about the question of how to act on data regarding the personal qualities of students in various classrooms or schools (Bryk, Gomez, Grunow, & LeMahieu, 2015). If a classroom is low in grit, what should one do? If a student is known to have a fixed mindset, how can one intervene without stigmatizing the child (and should one intervene at all)? How can multi-dimensional data on personal qualities be visualized and fed to decision makers more clearly? The wise use of data in educational practice is another topic that will be increasingly important—and likely just as fraught with difficulty—as the collection of that data (Bryk, Gomez, Grunow, & LeMahieu, 2015).
Interest in the “other” side of the report card is not at all new. What is new is the expectation that we can measure, with precision and accuracy, the many positive personal qualities other than cognitive ability that contribute to student well-being and achievement. Quantifying, even imperfectly, the extent to which young people express self-control, gratitude, purpose, growth mindset, collaboration, emotional intelligence, and other beneficial personal qualities, has dramatically advanced scientific understanding of their development, impact on life outcomes, and underlying mechanisms. It is no surprise that policymakers and practitioners have grown increasingly interested in using such measures for diverse purposes other than theory development. Given the advantages, limitations, and medium-term potential of such measures, our hope is that the broader educational community proceeds forward with both alacrity and caution, and with equal parts optimism and humility.
Acknowledgments
This research was made possible by grants to the first author from the National Institute on Aging (Grants K01-AG033182-02 and R24-AG048081-01), the Character Lab, the Gates Foundation, the Robert Wood Johnson Foundation, the Spencer Foundation, and the Templeton Foundation as well as grants to the second author from the Raikes Foundation, the William T. Grant Foundation, and a fellowship from the Center for Advanced Study in the Behavioral Sciences. The content is solely the responsibility of the authors and does not necessarily represent the official views of the funding agencies.
Footnotes
Interestingly, while the notion of “cognitive skills” has garnered much more adherence than the term “non-cognitive skills,” both are difficult to define with precision, often misinterpreted because of lack of consensual definitions, hard to measure without influence of the other, and representative of heterogeneous rather than homogenous categories (Duckworth, Quinn, Lynam, Loeber, & Stouthamer-Loeber, 2011; Gardner, 2004; Heckman & Kautz, 2013; Sternberg, 2008).
We hasten to point out that cognitive ability is also mutable (Nisbett, 2009; Nisbett et al., 2012).
Some have argued that comparisons to peers of higher or lower achievement are not merely a source of systematic measurement error but, in addition, can lead to durable changes in self-concept, motivation, and performance (Huguet et al., 2009).
Contributor Information
Angela L. Duckworth, University of Pennsylvania
David Scott Yeager, University of Texas at Austin.
References
- Abikoff H, Courtney M, Pelham WE, Jr, Koplewicz HS. Teachers' ratings of disruptive behaviors: The influence of halo effects. Journal of Abnormal Child Psychology. 1993;21(5):519–533. doi: 10.1007/BF00916317. [DOI] [PubMed] [Google Scholar]
- Achenbach TM, McConaughy SH, Howell CT. Child/adolescent behavioral and emotional problems: Implications of cross-informant correlations for situational specificity. Psychological Bulletin. 1987;101(2):213–232. [PubMed] [Google Scholar]
- American Educational Research Association, American Psychological Association, National Council on Measurement in Education [AERA/APA/NCME] Standards for educational and psychological testing. Washington, DC: American Educational Research Association; 1999. [Google Scholar]
- American Educational Research Association, American Psychological Association, National Council on Measurement in Education [AERA/APA/NCME] Standards for educational and psychological testing. Washington, DC: American Educational Research Association; 2014. (Rev. ed.) [Google Scholar]
- Almlund M, Duckworth AL, Heckman JJ, Kautz TD. NBER Working Paper Series. Cambridge, MA: National Bureau of Economic Research; 2011. Personality psychology and economics (No. w16822) [Google Scholar]
- Aronson E, Carlsmith JM. Effect of the severity of threat on the devaluation of forbidden behavior. The Journal of Abnormal and Social Psychology. 1963;66(6):584–588. [Google Scholar]
- Atkins-Burnett S, Fernandez C, Akers L, Jacobson J, Smither-Wulsin C. Landscape analysis of non-cognitive measures. Princeton, NJ: Mathematica Policy Research; 2012. [Google Scholar]
- Babad EY, Inbar J, Rosenthal R. Teachers' judgment of students' potential as a function of teachers' susceptibility to biasing information. Journal of Personality and Social Psychology. 1982;42(3):541–547. [Google Scholar]
- Baumeister RF, Bratslavsky E, Muraven M, Tice DM. Ego depletion: Is the active self a limited resource? Journal of Personality and Social Psychology. 1998;74(5):1252–1265. doi: 10.1037//0022-3514.74.5.1252. [DOI] [PubMed] [Google Scholar]
- Berkowitz MW. Moral and character education. In: Harris KR, Graham S, Urdan T, Royer JM, Zeidner M, editors. APA educational psychology handbook, Vol 2: Individual differences and cultural and contextual factors. Washington, DC: American Psychological Association; 2012. pp. 247–264. [Google Scholar]
- Bierman KL, Nix RL, Greenberg MT, Blair C, Domitrovich CE. Executive functions and school readiness intervention: Impact, moderation, and mediation in the Head Start REDI program. Development and Psychopathology. 2008;20:821–843. doi: 10.1017/S0954579408000394. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Binet A, Simon T. The development of intelligence in children (The Binet-Simon Scale) Baltimore, MD: Williams & Wilkins Co; 1916. [Google Scholar]
- Blair C, Diamond A. Biological processes in prevention and intervention: The promotion of self-regulation as a means of preventing school failure. Development and Psychopathology. 2008;20(3):899–911. doi: 10.1017/S0954579408000436. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Blair C, Raver CC. Closing the achievement gap through modification of neurocognitive and neuroendocrine function: Results from a cluster randomized controlled trial of an innovative approach to the education of children in kindergarten. PLoS ONE. 2014;9(11):e112393. doi: 10.1371/journal.pone.0112393. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Blair C, Raver CC. School readiness and self-regulation: A developmental psychobiological approach. Annual Review of Psychology. 2015;66(1):711–731. doi: 10.1146/annurev-psych-010814-015221. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Borghans L, Duckworth AL, Heckman JJ, ter Weel B. The economics and psychology of personality traits. Journal of Human Resources. 2008;43(4):972–1059. [Google Scholar]
- Bowman NA. Can 1st-year college students accurately report their learning and development? American Educational Research Journal. 2010;47(2):466–496. [Google Scholar]
- Brackett MA, Geher G. Measuring emotional intelligence: Paradigmatic diversity and common ground. In: Ciarrochi J, Forgas J, Mayer JD, editors. Emotional intelligence in everyday life. 2nd. New York, NY: Psychology Press; 2006. pp. 27–50. [Google Scholar]
- Brackett MA, Mayer JD. Convergent, discriminant, and incremental validity of competing measures of emotional intelligence. Personality and Social Psychology Bulletin. 2003;29(9):1147–1158. doi: 10.1177/0146167203254596. [DOI] [PubMed] [Google Scholar]
- Bryk AS, Gomez LM, Grunow A, LeMahieu PG. Learning to improve: How America’s schools can get better at getting better. Cambridge, MA: Harvard Education Press; 2015. [Google Scholar]
- Burgess PW. Theory and methodology in executive function research. In: Rabbitt P, editor. Methodology of frontal and executive function. East Sussex, UK: Psychology Press; 1997. pp. 91–116. [Google Scholar]
- Campbell DT. Assessing the impact of planned social change; Paper presented at the Conference on Social Psychology; Visegrad, Hungary. 1976. [Google Scholar]
- Carlson SM, Zelazo PD, Faja S. Executive Function. In: Zelazo PD, editor. The Oxford Handbook of Developmental Psychology. Vol. 1. New York, NY: Oxford University Press; 2013. pp. 706–742. Body and Mind. [Google Scholar]
- Caspi A, Roberts BW, Shiner RL. Personality development: Stability and change. Annual Review of Psychology. 2005;56:453–484. doi: 10.1146/annurev.psych.55.090902.141913. [DOI] [PubMed] [Google Scholar]
- Carr PB, Steele CM. Stereotype threat and inflexible perseverence in problem solving. Journal of Experimental Social Psychology. 2009;45:853–859. [Google Scholar]
- Clark LA, Watson D. Constructing validity: Basic issues in objective scale development. Psychological Assessment. 1995;7(3):309–319. doi: 10.1037/pas0000626. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Connelly BS, Ones DS. An other perspective on personality: Meta-analytic integration of observers' accuracy and predictive validity. Psychological Bulletin. 2010;136(6):1092–1122. doi: 10.1037/a0021212. [DOI] [PubMed] [Google Scholar]
- Damon W. The bridge to character: To help students become ethical, responsible citizens, schools need to cultivate students' natural moral sense. Education Leadership. 2010;67(5):36–41. [Google Scholar]
- de Ridder DTD, Lensvelt-Mulders G, Finkenauer C, Stok FM, Baumeister RF. Taking stock of self-control: A meta-analysis of how trait self-control relates to a wide range of behaviors. Personality and Social Psychology Review. 2012;16(1):76–99. doi: 10.1177/1088868311418749. [DOI] [PubMed] [Google Scholar]
- Diamond A, Lee K. Interventions shown to aid executive function development in children 4 to 12 years old. Science. 2011;333(6045):959–964. doi: 10.1126/science.1204529. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Diamond A. Executive Functions. Annual Review of Psychology. 2013;64:135–168. doi: 10.1146/annurev-psych-113011-143750. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Diener E, Larsen JE, Emmons RA. Person x situation interactions: Choice of situations and congruence response models. Journal of Personality and Social Psychology. 1984;47(3):580–592. doi: 10.1037//0022-3514.47.3.580. [DOI] [PubMed] [Google Scholar]
- D'Mello S, Duckworth A, Dieterle E. Advanced, analytic, automated measures of state engagement during learning. 2014 doi: 10.1080/00461520.2017.1281747. Manuscript under review. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dobbie W, Fryer RG., Jr . NBER Working Paper Series. Cambridge, MA: National Bureau of Economic Research; 2013. The Medium-Term Impacts of High-Achieving Charter Schools on Non-Test Score Outcomes. [Google Scholar]
- “Dodge KA. Social cognition and children's aggressive behavior. Child Development. 1980;51:162–170. [PubMed] [Google Scholar]
- Drucker PF. Management: Tasks, responsibilities, practices. New York, NY: Routledge; 1974. [Google Scholar]
- Duckworth AL. (Over and) beyond high-stakes testing. American Psychologist. 2009;64(4):279–280. doi: 10.1037/a0014923. [DOI] [PubMed] [Google Scholar]
- Duckworth AL, Carlson SM. Self-regulation and school success. In: Sokol BW, Grouzet FME, Muller U, editors. Self-regulation and autonomy: Social and developmental dimensions of human conduct. New York, NY: Cambridge University Press; 2013. pp. 208–230. [Google Scholar]
- Duckworth A, Gendler T, Gross J. Situational strategies for self-control. 2014 doi: 10.1177/1745691615623247. Manuscript under review. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Duckworth AL, Kern ML. A meta-analysis of the convergent validity of self-control measures. Journal of Research in Personality. 2011;45(3):259–268. doi: 10.1016/j.jrp.2011.02.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Duckworth AL, Quinn PD, Lynam DR, Loeber R, Stouthamer-Loeber M. Role of test motivation in intelligence testing. Proceedings of the National Academy of Sciences. 2011;108(19):7716–7720. doi: 10.1073/pnas.1018601108. [DOI] [PMC free article] [PubMed] [Google Scholar] [Retracted]
- Duckworth AL, Seligman MEP. Self-discipline outdoes IQ in predicting academic performance of adolescents. Psychological Science. 2005;16(12):939–944. doi: 10.1111/j.1467-9280.2005.01641.x. [DOI] [PubMed] [Google Scholar]
- Duckworth AL, Steinberg L. Unpacking self-control. Child Development Perspectives. 2015;9(1):32–37. doi: 10.1111/cdep.12107. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Duckworth AL, Tsukayama E, May H. Establishing causality using longitudinal hierarchical linear modeling: An illustration predicting achievement from self-control. Social Psychological and Personality Science. 2010;1(4):311–317. doi: 10.1177/1948550609359707. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Durlak JA, Domitrovich CE, Weissberg RP, Gullotta TP. Handbook of social and emotional learning: Research and practice. New York, NY: Guilford; 2015. [Google Scholar]
- Durlak JA, Weissberg RP, Dymnicki AB, Taylor RD, Schellinger KB. The impact of enhancing students’ social and emotional learning: A meta-analysis of school-based universal interventions. Child Development. 2011;82(1):405–432. doi: 10.1111/j.1467-8624.2010.01564.x. [DOI] [PubMed] [Google Scholar]
- Easton J. Using measurement as leverage between developmental research and educational practice; Paper presented at the Center for Advanced Study of Teaching and Learning Meeting; Charlottesville, VA. 2013. Paper retrieved from http://ies.ed.gov/director/pdf/Easton062013.pdf. [Google Scholar]
- Egalite AJ, Mills JN, Greene JP. The softer side of learning: Measuring students’ non-cognitive skills. Fayetteville, AR: University of Arkansas Education Reform; 2014. EDRE Working Paper No. 2014–03. [Google Scholar]
- Eid M, Diener E. Handbook of multimethod measurement in psychology. Vol. 553. Washington, DC: American Psychological Association; 2006. [Google Scholar]
- Eisenberg N, Spinrad TL, Fabes RA, Reiser M, Cumberland A, Shepard SA, Thompson M. The relations of effortful control and impulsivity to children's resiliency and adjustment. Child Development. 2004;75(1):25–46. doi: 10.1111/j.1467-8624.2004.00652.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Elias MJ, editor. Promoting social and emotional learning: Guidelines for educators. Chicago, IL: Association for Supervision and Curriculum Development; 1997. [Google Scholar]
- Ent MR, Baumeister RF, Tice DM. Trait self-control and the avoidance of temptation. Personality and Individual Differences. 2015;74:12–15. [Google Scholar]
- Farrington CA, Roderick M, Allensworth E, Nagaoka J, Keyes TS, Johnson DW, Beechum NO. Teaching adolescents to become learners. The role of noncognitive factors in shaping school performance: A critical literature review. Chicago, IL: University of Chicago Consortium on Chicago School Research; 2012. [Google Scholar]
- Ferguson RF. Can student surveys measure teaching quality? Phi Delta Kappan. 2012;94(3):24–28. [Google Scholar]
- Ferguson RF, Danielson C. How framework for teaching and tripod 7Cs evidence distinguish key components of effective teaching. In: Kane TJ, Kerr KA, Pianta RC, editors. Designing teacher evaluation systems: New guidance from the measures of effective teaching project. Hoboken, NJ: Josey-Bass Publishers; 2014. pp. 98–143. [Google Scholar]
- Fleeson W, Noftle EE. The end of the person-situation debate: An emerging synthesis in the answer to the consistency question. Social and Personality Psychology Compass. 2008;2(4):1667–1684. [Google Scholar]
- Freud S. Introductory Lectures on Psychoanalysis. New York, NY: W. W. Norton & Company; 1920. [Google Scholar]
- Fujita K. On conceptualizing self-control as more than the effortful inhibition of impulses. Personality and Social Psychology Review. 2011;15(4):352–366. doi: 10.1177/1088868311411165. [DOI] [PubMed] [Google Scholar]
- Funder DC. Accurate personality judgment. Current Directions in Psychological Science. 2012;21(3):177–182. [Google Scholar]
- Galla BM, Duckworth AL. More than resisting temptation: Beneficial habits mediate the relationship between self-control and positive life outcomes. Journal of Personality and Social Psychology. 2015 doi: 10.1037/pspp0000026. Advance online publication. http://dx.doi.org/10.1037/pspp0000026. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Galla BM, Plummer BD, White R, Meketon D, D'Mello SK, Duckworth AL. Development and validation of the Academic Diligence Task. Contemporary Educational Psychology. 2014;39(4):314–325. doi: 10.1016/j.cedpsych.2014.08.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gardner H. Frames of mind: The theory of multiple intelligences. New York, NY: Basic Books; 2004. [Google Scholar]
- Gehlbach H, Brinkworth ME. Measure twice, cut down error: A process for enhancing the validity of survey scales. Review of General Psychology. 2011;15(4):380–387. [Google Scholar]
- Goldhaber D, Loeb S. What do we know about the tradeoffs associated with teacher misclassification in high stakes personnel decisions? 2013 Retrieved from http://www.carnegieknowledgenetwork.org/briefs/value-added/teacher-misclassifications/
- Goldman S. Self-discipline predicts academic performance among low-achieving adolescents. Res: A Journal of Undergraduate Research. 2006;2(1):84–97. [Google Scholar]
- Greenberg MT. School-based prevention: Current status and future challenges. Effective Education. 2010;2(1):27–52. [Google Scholar]
- Hagger MS, Wood C, Stiff C, Chatzisarantis NLD. Ego depletion and the strength model of self-control: A meta-analysis. Psychological Bulletin. 2010;136(4):495–525. doi: 10.1037/a0019486. [DOI] [PubMed] [Google Scholar]
- Hartshorne H, May MA. Studies in the nature of character: Studies in self-control. Vol. 2. New York, NY: McMillan; 1929. [Google Scholar]
- Hausknecht JP, Halpert JA, Di Paolo NT, Moriarty Gerrard MO. Retesting in selection: A meta-analysis of coaching and practice effects for tests of cognitive ability. Journal of Applied Psychology. 2007;92(2):373–385. doi: 10.1037/0021-9010.92.2.373. [DOI] [PubMed] [Google Scholar]
- Heckman JJ, Kautz TD. Hard evidence on soft skills. Labour Economics. 2012;19(4):451–464. doi: 10.1016/j.labeco.2012.05.014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Heckman J, Kautz TD. Achievement tests and the role of character in American life. In: Heckman J, Humphries JE, Kautz T, editors. The myth of achievement tests: The GED and the role of character in American life. Chicago, IL: University of Chicago Press; 2013. pp. 1–71. [Google Scholar]
- Heckman JJ, Kautz TD. The myth of achievement tests: The GED and the role of character in american. Chicago, IL: The University of Chicago Press; 2014. [Google Scholar]
- Heine SJ, Lehman DR, Peng K, Greenholtz J. What's wrong with cross-cultural comparisons of subjective likert scales? The reference-group effect. Journal of Personality and Social Psychology. 2002;82(6):903–918. [PubMed] [Google Scholar]
- Hightower AD, Work WC, Cowen EL, Lotyczewski B, Spinell A, Guare J, Rohrbeck C. The Teacher-Child Rating Scale: A brief objective measure of elementary children's school problem behaviors and competencies. School Psychology Review. 1986;15(3):393–409. [Google Scholar]
- Hofmann W, Baumeister RF, Förster G, Vohs KD. Everyday temptations: An experience sampling study of desire, conflict, and self-control. Journal of Personality and Social Psychology. 2012;102(6):1318. doi: 10.1037/a0026545. [DOI] [PubMed] [Google Scholar]
- Hollingworth L, Dude DJ, Shepherd JK. Pizza parties, pep rallies, and practice tests: Strategies used by high school principals to raise percent proficient. Leadership and Policy in Schools. 2010;9(4):462–478. [Google Scholar]
- Huguet P, Dumas F, Marsh H, Régner I, Wheeler L, Suls J, Nezlek J. Clarifying the role of social comparison in the big-fish-little-pond effect (BFLPE): An integrative study. Journal of Personality and Social Psychology. 2009;97(1):156–170. doi: 10.1037/a0015558. [DOI] [PubMed] [Google Scholar]
- Imhoff R, Schmidt AF, Gerstenberg F. Exploring the interplay of trait self-control and ego depletion: Empirical evidence for ironic effects. European Journal of Personality. 2013;28(5):413–424. [Google Scholar]
- Inzlicht M, McKay L, Aronson J. Stigma as ego depletion: How being the target of prejudice affects self-control. Psychological Science. 2006;17(3):262–269. doi: 10.1111/j.1467-9280.2006.01695.x. [DOI] [PubMed] [Google Scholar]
- Inzlicht M, Kang SK. Stereotype threat spillover: How coping with threats to social identity affects aggression, eating, decision making, and attention. Journal of Personality and Social Psychology. 2010;99(3):467–481. doi: 10.1037/a0018951. [DOI] [PubMed] [Google Scholar]
- Ireland ME, Pennebaker JW. Language style matching in writing: Synchrony in essays, correspondence, and poetry. Journal of Personality and Social Psychology. 2010;99(3):549–571. doi: 10.1037/a0020386. [DOI] [PubMed] [Google Scholar]
- Jackson CK. Non-cognitive ability, test scores, and teacher quality: Evidence from 9th grade teachers in North Carolina NBER Working Paper Series. Cambridge, MA: National Bureau of Economic Research; 2012. [Google Scholar]
- Jackson JJ, Connolly JJ, Garrison M, Levine M, Connolly SL. Your friends know how long you will live: A 75 year study of peer-rated personality traits. Psychological Science. 2015;26(3):335–340. doi: 10.1177/0956797614561800. [DOI] [PubMed] [Google Scholar]
- Jones E, Sigall H. The Bogus Pipeline: A new paradigm for measuring affect and attitude. Psychological Bulletin. 1971;76(5):349–364. [Google Scholar]
- Kane TJ, Cantrell S. Ensuring fair and reliable measures of effective teaching: Culminating findings from the MET Project's three-year study. Policy & Practice Brief. 2013 [Google Scholar]
- Kautz TD, Zanoni W. Measuring and Fostering Non-Cognitive Skills in Adolescence: Evidence from Chicago Public Schools and the OneGoal Program. Chicago, Illinois: Department of Economics, University of Chicago; 2014. Unpublished manuscript. [Google Scholar]
- Kelvin WT. Popular Lectures and Addresses. Vol. 1. London, UK: MacMillan and Co; 1883. [Google Scholar]
- Kern ML, Eichstaedt JC, Schwartz HA, Park G, Ungar LH, Stillwell DJ, Seligman MEP. From “Sooo excited!!!” to “So proud”: Using language to study development. Developmental Psychology. 2014;50(1):178–188. doi: 10.1037/a0035048. [DOI] [PubMed] [Google Scholar]
- Kidd C, Palmeri H, Aslin RN. Rational snacking: Young children’s decision-making on the marshmallow task is moderated by beliefs about environmental reliability. Cognition. 2013;126(1):109–114. doi: 10.1016/j.cognition.2012.08.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- King G, Murray CJL, Salomon JA, Tandon A. Enhancing the validity and cross-cultural comparability of measurement in survey research. American Political Science Review. 2004;98(1):191–207. [Google Scholar]
- King ML., Jr The purpose of education. Maroon Tiger. 1947 Jan-Feb [Google Scholar]
- Kovacs MK. Children’s Depression Inventory-Short Form (CDI) New York, NY: Multi-Health Systems Inc; 1992. [Google Scholar]
- Kristjánsson K. Ten myths about character, virtue and virtue education-plus three well-founded misgivings. British Journal of Educational Studies. 2013;61(3):269–287. [Google Scholar]
- Krosnick JA. Survey research. Annual Review of Psychology. 1999;50:537–567. doi: 10.1146/annurev.psych.50.1.537. [DOI] [PubMed] [Google Scholar]
- Krosnick JA, Presser S. Question and questionnaire design. In: Marsden PV, Wright JD, editors. Handbook of survey research. Bingley, UK: Emerald Group Publishing; 2010. pp. 263–314. [Google Scholar]
- Krosnick JA, Fabrigar LR. The handbook of questionnaire design. New York, NY: Oxford University Press; (Forthcoming) [Google Scholar]
- Kruger J, Dunning D. Unskilled and unaware of it: How difficulties in recognizing one's own incompetence lead to inflated self-assessments. Journal of Personality and Social Psychology. 1999;77(6):1121–1134. doi: 10.1037//0022-3514.77.6.1121. [DOI] [PubMed] [Google Scholar]
- Kuhn TS. The function of measurement in modern physical science. Isis. 1961;52(2):161–193. [Google Scholar]
- Innovative questionnaire assessment methods to increase cross-country comparability. In: Kyllonen PC, Bertling J, editors; Rutkowski L, von Davier M, Rutkowski D, editors. A handbook of international large-scale assessment data analysis: Background, technical issues, and methods of data analysis. London, UK: Chapman Hall/CRC Press; 2013. [Google Scholar]
- Langley GJ, Moen R, Nolan KM, Nolan TW, Norman CL, Provost LP. The improvement guide: A practical approach to enhancing organizational performance. San Francisco, CA: Jossey-Bass; 2009. [Google Scholar]
- Lapsley DK, Yeager DS. Moral-character education. In: Weiner IB, Reynolds WM, Miller GE, editors. Handbook of psychology. 2nd. Vol. 7. New York, NY: Wiley Publishing; 2012. pp. 289–348. Educational Psychology. [Google Scholar]
- Lelkes Y, Krosnick JA, Marx DM, Judd CM, Park B. Complete anonymity compromises the accuracy of self-reports. Journal of Experimental Social Psychology. 2012;48(6):1291–1299. [Google Scholar]
- Levin HM. The utility and need for incorporating noncognitive skills into large-scale educational assessments. In: von Davier M, Gonzalez E, Kirsch I, Yamamoto K, editors. The role of international large-scale assessments: Perspectives from technology, economy, and educational research. New York, NY: Springer Netherlands; 2013. pp. 67–86. [Google Scholar]
- Lucas RE, Baird BM. Global self-assessment. In: Eid M, Diener E, editors. Handbook of multimethod measurement in psychology. Washington, DC: American Psychological Association; 2006. pp. 29–42. [Google Scholar]
- Lucas RE, Lawless NM. Does life seem better on a sunny day? Examining the association between daily weather conditions and life satisfaction judgments. Journal of Personality and Social Psychology. 2013;104(5):872–884. doi: 10.1037/a0032124. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Marat D. Assessing mathematics self-efficacy of diverse students from secondary schools in Auckland: Implications for academic achievement. Issues in Educational Research. 2005;15(1):37–68. [Google Scholar]
- McGuire JT, Kable JW. Rational temporal predictions can underlie apparent failures to delay gratification. Psychological Review. 2013;120(2):395–410. doi: 10.1037/a0031910. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mehl MR, Vazire S, Holleran SE, Clark CS. Eavesdropping on happiness: Well-being is related to having less small talk and more substantive conversations. Psychological Science. 2010;21(4):539–541. doi: 10.1177/0956797610362675. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Merrell KW, Gueldner BA. Social and emotional learning in the classroom: Promoting mental health and academic success. New York, NY: Guilford Press; 2010. [Google Scholar]
- Messick S. Potential uses of noncognitive measurement in education. Journal of Educational Psychology. 1979;71(3):281. [Google Scholar]
- Mischel W. Father-absence and delay of gratification. The Journal of Abnormal and Social Psychology. 1961;63(1):116–124. doi: 10.1037/h0046877. [DOI] [PubMed] [Google Scholar]
- Mischel W. Personality and assessment. Hoboken, NJ: John Wiley & Sons, Inc; 1968. [Google Scholar]
- Mischel W. The MarshmallowTest: Mastering self-control. New York, NY: Little, Brown, and Company; 2014. [Google Scholar]
- Mischel W, Ayduk O, Berman MG, Casey BJ, Gotlib IH, Jonides J, Zayas V. ‘Willpower’ over the life span: Decomposing self-regulation. Social Cognitive and Affective Neuroscience. 2011;6(2):252–256. doi: 10.1093/scan/nsq081. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mischel W, Liebert RM. The role of power in the adoption of self-reward patterns. Child Development. 1967;38(3):673–683. doi: 10.1111/j.1467-8624.1967.tb04588.x. [DOI] [PubMed] [Google Scholar]
- Miyake A, Friedman NP. The nature and organization of individual differences in executive functions: Four general conclusions. Current Directions in Psychological Science. 2012;21(1):8–14. doi: 10.1177/0963721411429458. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Moffitt TE, Arseneault L, Belsky D, Dickson N, Hancox RJ, Harrington HL, Caspi A. A gradient of childhood self-control predicts health, wealth, and public safety. Proceedings of the National Academy of Sciences. 2011;108(7):2693–2698. doi: 10.1073/pnas.1010076108. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Müller U, Kerns KA, Konkin K. Test-retest reliability and practice effects of executive function tasks in preschool children. The Clinical Neuropsychologist. 2012;26(2):271–287. doi: 10.1080/13854046.2011.645558. [DOI] [PubMed] [Google Scholar]
- Murnane RJ, Levy F. Teaching the new basic skills: Principles for educating children to thrive in a changing economy. New York, NY: The Free Press; 1996. [Google Scholar]
- Murphy MC, Dweck CS. A culture of genius: How an organization's lay theory shapes people's cognition, affect, and behavior. Personality and Social Psychology Bulletin. 2009;36(3):282–296. doi: 10.1177/0146167209347380. [DOI] [PubMed] [Google Scholar]
- Naemi B, Burrus J, Kyllonen PC, Roberts RD. Building a case to develop noncognitive assessment products and services targeting workforce readiness at ETS. Princeton, NJ: Educational Testing Service; 2012. Dec, [Google Scholar]
- National Institute of Child Health and Human Development (NICHD) Child’s self-regulation fifty-four month delay of gratification test. Research Triangle Park, NC: National Institute of Child Health and Human Development; 1999. [Google Scholar]
- Nisbett RE, Wilson TD. Telling more than we can know: Verbal reports on mental processes. Psychological Review. 1977;84(3):231–259. [Google Scholar]
- Nisbett RE. Intelligence and how to get it: Why schools and cultures count. New York, NY: W. W. Norton & Co; 2009. [Google Scholar]
- Nisbett RE, Aronson J, Blair C, Dickens W, Flynn J, Halpern DF, Turkheimer E. Intelligence: New findings and theoretical developments. American Psychologist. 2012;67(2):130–159. doi: 10.1037/a0026699. [DOI] [PubMed] [Google Scholar]
- O'Brien J, Yeager DS, Galla B, D'Mello S, Duckworth AL. Between-school comparisons in non-cognitive factors: Evidence of reference bias in self-reports and advantages of performance tasks. 2015 Manuscript in preparation. [Google Scholar]
- Oswald FL, Schmitt N, Kim BH, Ramsay LJ, Gillespie MA. Developing a biodata measure and situational judgment inventory as predictors of college student performance. Journal of Applied Psychology. 2004;89(2):187–207. doi: 10.1037/0021-9010.89.2.187. [DOI] [PubMed] [Google Scholar]
- Pace CR, Friedlander J. The meaning of response categories: How often is "occasionally," "often," and "very often"? Research in Higher Education. 1982;17(3):267–281. [Google Scholar]
- Park A, Tsukayama E, Patrick S, Duckworth AL. A Tripartite Taxonomy of Character. 2015 doi: 10.1016/j.cedpsych.2016.08.001. Manuscript in preparation. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pellegrino JW, Hilton ML. Education for life and work: Developing transferable knowledge and skills in the 21st century. Washington, DC: National Academy of Sciences; 2012. [Google Scholar]
- Peterson C, Seligman MEP. Character strengths and virtues: A handbook and classification. Washington, DC: American Psychological Association; 2004. [Google Scholar]
- Pickering AD, Gray JA. The neuroscience of personality. In: Pervin LA, John OP, editors. Handbook of personality: Theory and research. 2nd. New York, NY: Guilford Press; 1999. pp. 277–299. [Google Scholar]
- Ployhart RE, MacKenzie WI., Jr . APA handbook of industrial and organizational psychology, Vol 2: Selecting and developing members for the organization. Washington, DC: American Psychological Association; 2011. Situational judgment tests: A critical review and agenda for the future; pp. 237–252. [Google Scholar]
- Podsakoff PM, MacKenzie SB, Lee JY, Podsakoff NP. Common method biases in behavioral research: A critical review of the literature and recommended remedies. Journal of Applied Psychology. 2003;88(5):879–903. doi: 10.1037/0021-9010.88.5.879. [DOI] [PubMed] [Google Scholar]
- Raudenbush SW. Magnitude of teacher expectancy effects on pupil IQ as a function of the credibility of expectancy induction: A synthesis of findings from 18 experiments. Journal of Educational Psychology. 1984;76(1):85–97. [Google Scholar]
- Raudenbush SW. What do we know about using value-added to compare teachers who work in different schools? Stanford, CA: Carnegie Knowledge Network; 2013. Aug, Retrieved from http://www.carnegieknowledgenetwork.org/wp-content/uploads/2013/08/CKN_Raudenbush-Comparing-Teachers_FINAL_08-19-13.pdf. [Google Scholar]
- Raudenbush SW, Jean M. How should educators interpret value-added scores? Stanford, CA: Carnegie Knowledge Network; 2012. Oct, Retrieved from http://www.carnegieknowledgenetwork.org/briefs/value-added/interpreting-value-added/ [Google Scholar]
- Raver CC, Jones SM, Li-Grining CP, Zhai F, Bub K, Pressler E. CSRP’s impact on low-income preschoolers’ preacademic skills: Self-regulation as a mediating mechanism. Child Development. 2011;82(1):362–378. doi: 10.1111/j.1467-8624.2010.01561.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ravitch D. What is Campbell’s Law? 2012 Retrieved from http://dianeravitch.net/2012/05/25/what-is-campbells-law/
- Reeve CL, Lam H. The psychometric paradox of practice effects due to retesting: Measurement invariance and stable ability estimates in the face of observed score changes. Intelligence. 2005;33(5):535–549. [Google Scholar]
- Revelle W, Wilt J, Rosenthal A. Individual differences in cognition: New methods for examining the personality-cognition link. New York, NY: Springer Science; 2010. [Google Scholar]
- Roberts BW, DelVecchio WF. The rank-order consistency of personality traits from childhood to old age: A quantitative review of longitudinal studies. Psychological Bulletin. 2000;126(1):3–25. doi: 10.1037/0033-2909.126.1.3. [DOI] [PubMed] [Google Scholar]
- Roberts BW, Jackson JJ, Duckworth AL, Von Culin K. Personality measurement and assessment in large panel surveys. Forum for Health Economics & Policy. 2011;14(3):1–32. doi: 10.2202/1558-9544.1268. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Roberts BW, Kuncel NR, Shiner R, Caspi A, Goldberg LR. The power of personality: The comparative validity of personality traits, socioeconomic status, and cognitive ability for predicting important life outcomes. Perspectives on Psychological Science. 2007;2(4):313–345. doi: 10.1111/j.1745-6916.2007.00047.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Roberts BW, Walton KE, Viechtbauer W. Patterns of mean-level change in personality traits across the life course: A meta-analysis of longitudinal studies. Psychological Bulletin. 2006;132(1):1–25. doi: 10.1037/0033-2909.132.1.1. [DOI] [PubMed] [Google Scholar]
- Roberts RD, Markham PM, Matthews G, Zeidner M. Assessing intelligence: Past, present, and future. In: Wilhelm O, Engle RW, editors. Handbook of understanding measuring intelligence. Thousand Oaks, CA: Sage Publications, Inc; 2005. pp. 333–360. [Google Scholar]
- Ross L, Lepper M, Ward A. History of social psychology: Insights, challenges, and contributions to theory and application. In: Fiske ST, Gilbert DT, Lindzey G, editors. Handbook of social psychology. Vol. 2. Hoboken, NJ: John Wiley & Sons, Inc; 2010. pp. 3–50. [Google Scholar]
- Ross L, Nisbett RE. The person and the situation: Perspectives of social psychology. New York, NY: Mcgraw-Hill Book Company; 1991. [Google Scholar]
- Rushton JP, Brainerd CJ, Pressley M. Behavioral development and construct validity: The principle of aggregation. Psychological Bulletin. 1983;94(1):18–38. [Google Scholar]
- Sabini J, Siepmann M, Stein J. The really fundamental attribution error in social psychological research. Psychological Inquiry. 2001;12(1):1–15. [Google Scholar]
- Sackett PR. Revisiting the origins of the typical-maximum performance distinction. Human Performance. 2007;20(3):179–185. [Google Scholar]
- Sackett PR. Faking in personality assessment: Where do we stand? In: Ziegler M, MacCann C, Roberts RD, editors. New perspectives on faking in personality assessment. Oxford, UK: Oxford University Press; 2011. pp. 330–344. [Google Scholar]
- Saris W, Revilla M, Krosnick JA, Shaeffer E. Comparing questions with agree/disagree response options to questions with item-specific response options. Survey Research Methods. 2010;4(1):61–79. [Google Scholar]
- Schmidt FL. The validity and utility of selection methods in personnel psychology: Practical and theoretical implications of 100 years of research. Presentation at the University of Iowa; 2013. Apr, [Google Scholar]
- Schmidt FL, Hunter JE. The validity and utility of selection methods in personnel psychology: Practical and theoretical implications of 85 years of research findings. Psychological Bulletin. 1998;124(2):262–274. [Google Scholar]
- Schuman H, Presser S. Questions and answers in attitude surveys: Experiments on question form, wording, and context. New York, NY: Academic Press; 1981. [Google Scholar]
- Schwarz N, Oyserman D. Asking questions about behavior: Cognition, communication, and questionnaire construction. American Journal of Evaluation. 2001;22(2):127–160. [Google Scholar]
- Sharma L, Markon KE, Clark LA. Toward a theory of distinct types of “impulsive” behaviors: A meta-analysis of self-report and behavioral measures. Psychological Bulletin. 2014;140(2):374–408. doi: 10.1037/a0034418. [DOI] [PubMed] [Google Scholar]
- Soland J, Hamilton LS, Stecher BM. Measuring 21st century competencies: Guidance for educators. Santa Monica, CA: RAND Corporation; 2013. [Google Scholar]
- Soto CJ, John OP, Gosling SD, Potter J. The developmental psychometrics of big five self-reports: Acquiescence, factor structure, coherence, and differentiation from ages 10 to 20. Journal of Personality and Social Psychology. 2008;94(4):718. doi: 10.1037/0022-3514.94.4.718. [DOI] [PubMed] [Google Scholar]
- Stecher BM, Hamilton LS. Measuring hard-to-measure student competencies: A research and development plan. Santa Monica, CA: RAND Corporation; 2014. [Google Scholar]
- Sternberg RJ. Using cognitive theory to reconceptualize college admissions testing. In: Gluck MA, Anderson JR, Kosslyn SM, editors. Memory and mind: A festschrift for Gordon H. Bower. New York, NY: Lawrence Erlbaum Associates; 2008. pp. 159–175. [Google Scholar]
- Tough P. What if the secret to success is failure? New York Times Magazine. 2011 Sep 14;:1–14. [Google Scholar]
- Tough P. How children succeed: Grit, curiosity, and the hidden power of character. New York, NY: Houghton Mifflin Harcourt; 2013. [Google Scholar]
- Tourangeau R, Rips LJ, Rasinski K. The psychology of survey response. Cambridge, UK: Cambridge University Press; 2000. [Google Scholar]
- Tsukayama E, Duckworth AL, Kim BE. Domain-specific impulsivity in school-age children. Developmental Science. 2013;16(6):879–893. doi: 10.1111/desc.12067. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tuttle CC, Gill B, Gleason P, Knechtel V, Nichols-Barrer I, Resch A. KIPP middle schools: Impacts on achievement and other outcomes. Washington, DC: Mathematica Policy Research; 2013. [Google Scholar]
- Uziel L. Rethinking social desirability scales: From impression management to interpersonally oriented self-control. Perspectives on Psychological Science. 2010;5(3):243–262. doi: 10.1177/1745691610369465. [DOI] [PubMed] [Google Scholar]
- Wagerman SA, Funder DC. Personality psychology of situations. In: Corr PJ, Matthews G, editors. The Cambridge handbook of personality psychology. Cambridge, UK: Cambridge University Press; 2009. [Google Scholar]
- Wagerman SA, Funder DC. Acquaintance reports of personality and academic achievement: A case for conscientiousness. Journal of Research in Personality. 2007;41(1):221–229. [Google Scholar]
- Weissberg RP, Cascarino J. Academic learning + social-emotional learning = national priority. Phi Delta Kappan. 2013;95(2):8–13. [Google Scholar]
- Weschler D. Non-intellective factors in general intelligence. Journal of Abnormal & Social Psychology. 1943;38(1):101–103. [Google Scholar]
- West MR, Kraft MA, Finn AS, Martin RE, Duckworth AL, Gabrieli CFO, Gabrieli JDE. Promise and paradox: Measuring students’ non-cognitive skills and the impact of schooling. 2015 Manuscript under review. [Google Scholar]
- White JL, Moffitt TE, Caspi A, Bartusch DJ, Needles DJ, Stouthamer-Loeber M. Measuring impulsivity and examining its relationship to delinquency. Journal of Abnormal Psychology. 1994;103(2):192–205. doi: 10.1037//0021-843x.103.2.192. [DOI] [PubMed] [Google Scholar]
- Willingham WW. Success in college: The role of personal qualities and academic ability. New York, NY: College Entrance Examination Board; 1985. [Google Scholar]
- Wong MM, Csikszentmihalyi M. Affiliation motivation and daily experience: Some issues on gender differences. Journal of Personality and Social Psychology. 1991;60(1):154–164. [Google Scholar]
- Yeager DS, Bryk AS. Practical measurement. Austin, Texas: Department of Psychology, University of Texas at Austin; 2014. Unpublished Manuscript. [Google Scholar]
- Yeager DS, Krosnick J. Does mentioning “some people” and “other people” in a survey question increase the accuracy of adolescents’ self-reports? Developmental Psychology. 2011;47(6):1674–1679. doi: 10.1037/a0025440. [DOI] [PubMed] [Google Scholar]
- Yeager DS, Henderson M, Paunesku D, Walton GM, D’Mello S, Spitzer BJ, Duckworth AL. Boring but important: A self-transcendent purpose for learning fosters academic self-regulation. Journal of Personality and Social Psychology. 2014;107(4):559–580. doi: 10.1037/a0037637. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yeager DS, Walton GM. Social-psychological interventions in education: They're not magic. Review of Educational Research. 2011;81(2):267–301. [Google Scholar]
- Ziegler M, MacCann C, Roberts R. New perspectives on faking in personality assessment. Oxford, UK: Oxford University Press; 2011. [Google Scholar]
- Zirkel S, Garcia JA, Murphy MC. Experience-sampling research methods and their potential for education research. Educational Researcher. 2015 [Google Scholar]

