Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2021 Apr 15.
Published in final edited form as: J Affect Disord. 2017 Jan 23;212:38–45. doi: 10.1016/j.jad.2017.01.024

Test-retest reliability and validity of a frustration paradigm and irritability measures

Wan-Ling Tseng 1,*, Elizabeth Moroney 1, Laura Machlin 1, Roxann Roberson-Nay 2, John M Hettema 2, Dever Carney 2, Joel Stoddard 1, Kenneth A Towbin 1, Daniel S Pine 1, Ellen Leibenluft 1, Melissa A Brotman 1
PMCID: PMC8049456  NIHMSID: NIHMS1688643  PMID: 28135689

Abstract

Background:

Data on the reliability and validity of assessments for irritability, particularly behavioral paradigms, are limited. This study examined the test-retest reliability and validity of a frustration paradigm (the Affective Posner 2 task) and two irritability measures [the Affective Reactivity Index (ARI) and Child Behavior Checklist (CBCL) irritability].

Methods:

Participants were 109 youth from a general population sample of twins (aged 9–14 years). Participants completed two visits that were 2–4 weeks apart. At both visits, participants completed the Affective Posner 2 task and self-reported their irritability using the ARI. Parents reported their child’s irritability using the ARI and completed the CBCL.

Results:

The Affective Posner 2 task demonstrated good test-retest reliability, with intraclass correlations (ICCs) ranging from .44 to .78. The task effectively evoked negative affect (frustration and unhappiness) at both test and retest, demonstrating its construct validity. Moreover, self-rated frustration and unhappiness during the frustration components of the task correlated positively with self-reported but not parent-reported irritability, providing modest support for convergent validity. Parent- and child-reports of the ARI and parent-reports of the CBCL irritability measure showed excellent test-retest reliability, with ICCs ranging from .88 to .90.

Limitations:

The sample consists of mostly twins aged 9–14 years from the communities. Thus, results may not generalize to non-twin samples or clinical samples outside of this age range.

Conclusions:

The Affective Posner 2 paradigm and the ARI and CBCL irritability scales may be useful tools for longitudinal or treatment research on irritability.

Keywords: Test-retest reliability, validity, irritability, frustration, parent-child agreement

Introduction

Recently, childhood irritability has received increased scientific attention (Leibenluft, 2011; Stringaris, 2011), due in part to the addition of disruptive mood dysregulation disorder (DMDD) to the Diagnostic and Statistical Manual of Mental Disorders, fifth edition (DSM-5; American Psychiatric Association, 2013). Children with irritability often have a low threshold for frustration, defined as the emotional and behavioral responses to blocked goal attainment (Leibenluft, 2011). Paradigms that evoke frustration thus provide an effective means for studying irritability in a laboratory setting (Leibenluft and Stoddard, 2013). Indeed, irritability has been linked with frustration that arises when a goal or reward is blocked (Berkowitz, 1989), and it aligns with the frustrative nonreward construct in the National Institute of Mental Health Research Domain Criteria (RDoC; Insel et al., 2010) matrix. However, data on the reliability and validity of behavioral paradigms that probe frustration are limited. This study is the first to examine test-retest reliability and validity of a behavioral paradigm used to elicit frustration (the Affective Posner 2 task). This study also reports test-retest reliability and validity of two commonly used clinical rating scales for childhood irritability [the Affective Reactivity Index (ARI) and the Child Behavior Checklist (CBCL) irritability].

In the past few years, studies on irritability have employed behavioral paradigms that induce frustration by withholding an expected reward through either increased task difficulty or rigged information (Deveney et al., 2013; Perlman et al., 2015). These studies demonstrate that frustration can be elicited in the laboratory. For example, in Deveney et al.’s study (2013), a previous version of the frustration paradigm described in this study was used. Deveney et al. (2013) found that children with severe irritability reported more frustration than healthy children during the frustration condition, suggesting discriminant validity of such paradigm. However, the test-retest reliability of subjects’ behavior on frustration paradigms is unknown. Establishing the longitudinal reliability (e.g., test-retest reliability) of an assessment technique is important for psychological and psychiatric research. In particular, good test-retest reliability is essential for longitudinal studies or treatment trials, i.e., studies in which the research focus is changes that occur with development or in response to treatment.

In addition to behavioral paradigms, studies are increasingly using scales designed to assess childhood irritability (e.g., Roberson-Nay et al., 2015; Savage et al., 2015; Stoddard et al., 2014; Stringaris et al., 2012a, 2012b; Wiggins et al., 2014). Two commonly used scales are the ARI (Stringaris et al., 2012a) and three items extracted from the CBCL (see the Methods section; Aebi et al., 2013; Roberson-Nay et al., 2015; Savage et al., 2015; Wiggins et al., 2014). Studies report good internal consistency, construct validity, and discriminant validity of the ARI (DeSousa et al., 2013; Mulraney et al., 2014; Stringaris et al., 2012a), as well as good internal consistency of the CBCL irritability score (Aebi et al., 2013; Stringaris et al., 2012b; Wiggins et al., 2014). Although some test-retest reliability data are available for the ARI, they are based on small samples (Ns ≤ 40). For example, Mulraney et al. (2014) reported good test-retest reliability of self-reported ARI over a 1-week period in an adult sample [intraclass correlation (ICC) = .80] (Mulraney et al., 2014). Stringaris et al. (2012a) found good stability of parent-ARI (r = .88), but poor stability of child-ARI (r = .29), over a 1-year interval in a youth sample. No studies have examined the test-retest reliability of the CBCL irritability score.

Therefore, to address these limitations in the literature, we evaluated the test-retest reliability and validity of a frustration paradigm (i.e., the Affective Posner 2 task); that is, whether the task elicits frustration in youth reliably (i.e., twice over a 2–4 week period). We also examined the test-retest reliability and validity of two irritability measures (i.e., the ARI and the CBCL irritability scale) over the same period.

Methods

Participants

The participants were part of a general population sample of twins aged 9–14 years recruited through the Mid-Atlantic Twin Registry (Lilley and Silberg, 2013) for the Virginia Commonwealth University (VCU) Juvenile Anxiety Study (Carney et al., 2016). A total of 109 participants who completed the Affective Posner 2 task at both visits were included. The two visits were two to four weeks apart (M = 23.15 days, SD = 7.11 days). The mean age of the sample was 11.26 years (SD = 1.33); the mean IQ, measured by the Wechsler Abbreviated Scale of Intelligence (WASI; Wechsler, 1999) two-subtest forms, was 110.72 (SD = 13.87); 55% of the sample were female. A correlation table with these demographic variables and all the study variables at both visits is presented in Table 1. The sample included 48 complete twin pairs (28 monozygotic twins and 20 dizygotic twins) and 13 singletons. Participants were excluded from the study if they had an IQ less than 70, serious physical or neurological symptoms, thoughts of suicide or homicide, substance abuse, a diagnosis of autism or a severe pervasive developmental disorder, psychotropic medication use other than stimulants for attention-deficit/hyperactivity disorder (ADHD), or current or past episodes of psychosis. The study was approved by the VCU Institutional Review Board. Participants provided informed assent, and parents provided consent prior to their participation in the study. At both visits, participants completed the Affective Posner 2 task and self-reported their irritability using children’s self-report of the ARI (Stringaris et al., 2012a); parents reported their child’s irritability using parent-reports of the ARI and completed the CBCL (Achenbach, 1991; Achenbach and Rescorla, 2001). Participants received $25 for participation plus task winnings during the non-frustration part of the task (up to $50).

Table 1.

Correlations between study variables

Visit 1

Visit 2
NF_ ACC NF_ RT NF_ happy NF_ frust F_ ACC F_ RT F_ happy F_ frust NF_ ACC.RC NF_RT. RC F_ ACC.RC F_ RT. RC ARI_C ARI_ P CBCL Age Gender IQ
NF_ACC __ .14 .00 −.07 .30** −.14 .01 .00 .82*** −.16 .07 −.25** −.13 −.01 .07 .15 −.08 .18
NF_RT .13 __ .05 −.01 .34*** .38*** −.23* −.09 .36** .19 .33*** .19 −.10 .03 .20* −.21* .07 −.13
NF_happy −.18 −.07 __ .59*** .00 .04 .22* .12 −.07 .12 .04 .14 −.12 .01 .12 −.03 .23* −.07
NF_frust −.17 .00 .70*** __ −.13 −.21* .30** .20* −.13 .12 −.16 −.07 .08 .05 .05 .12 .12 −.01
F_ACC .27** .24* −.03 .06 __ .42*** −.20* −.11 .40*** .05 .63*** .06 −.09 −.09 −.09 −.10 .15 .01
F_RT −.18 .20* −.08 .05 .15 __ −.32*** −.28** .03 .24* .61*** .34*** −.12 .15 .09 −.53*** .18 −.18
F_happy .16 −.04 .46*** .42*** −.05 −.20* __ .83*** −.07 −.13 −.21* −.13 .32*** .06 .08 .22* −.09 .22*
F_frust .18 .00 .39*** .35*** −.08 −.25** .88*** __ −.01 −.14 −.15 −.01 .23* .04 −.03 .24* −.19 .19*
NF_ACC.RC .83*** .35*** −.14 −.07 .25** .06 .14 .15 __ −.14 .22* −.23* −.16 −.02 .07 .04 −.08 .09
NF_RT.RC −.22* −.07 .04 .04 .14 .24* −.06 −.13 −.18 __ .05 .27** .06 .05 .07 −.28** .09 .00
F_ACC.RC .13 .36*** −.01 .13 .57*** .65*** .01 −.04 .34*** .02 __ −.06 −.08 .09 .15 −.23* .00 −.03
F_RT.RC −.09 −.20* .13 .09 −.24* .15 .05 −.08 −.15 .17 −.13 __ −.01 .02 −.08 −.30** .23* −.10
ARI_C −.08 −.08 .17 .13 −.03 −.11 .33*** .26* −.05 .19 −.17 −.05 __ .37*** .28** .00 .02 .04
ARI_P −.05 −.05 .04 −.03 −.03 .15 −.12 −.08 −.06 .08 .07 .14 .22* __ .64*** −.21* −.03 −.13
CBCL .01 .03 .10 .17 −.07 .09 .02 .03 −.01 .04 −.08 .17 .29** .54*** __ −.12 −.03 −.14
Age .17 −.18 .17 .03 −.08 −.44*** .15 .15 −.01 −.20* −.28** −.09 .01 −.21* −.08 __ −.16 .08
Gender .05 .11 .00 .02 .41*** .10 −.12 −.13 .03 .17 .25** −.02 .00 .01 .01 −.15 __ −.06
IQ .14 .10 .12 .10 .01 −.06 .10 .16 .18 −.04 −.03 −.10 .01 −.07 .03 .08 −.06 __

Note. The within-time correlations between Visit 1 variables are above the diagonal; the within-time correlations between Visit 2 variables are below the diagonal. NF_ACC = Non-Frustration Runs Accuracy; NF_RT = Non-Frustration Runs Reaction Time; NF_happy = Non-Frustration Runs Happy/Unhappy Ratings; NF_frust = Non-Frustration Runs Frustration Ratings; F_ACC = Frustration Runs Accuracy; F_RT = Frustration Runs Reaction Time; F_happy = Frustration Runs Happy/Unhappy Ratings; F_frust = Frustration Runs Frustration Ratings; NF_ACC.RC = Non-Frustration Runs Accuracy Response Cost; NF_RT.RC = Non-Frustration Runs Reaction Time Response Cost; F_ACC.RC = Frustration Runs Accuracy Response Cost; F_RT.RC = Frustration Runs Reaction Time Response Cost; ARI_C = Affective Reactivity Index Child Report; ARI_P = Affective Reactivity Index Parent Report; CBCL = Child Behavior Checklist. Gender was coded as 0 for male and 1 for female.

*

p < .05;

**

p < .01;

***

p < .001.

Affective Posner 2 Task

This task was adapted from the Affective Posner task used in previous studies (Deveney et al., 2013; Rich et al., 2005, 2007, 2010). Trials consisted of 1) a fixation cross, 2) two boxes, 3) a blue cue appearing in one of the boxes, 4) a white target appearing in one of the boxes, 5) a jittered inter-stimulus interval (a blank screen), and 6) feedback (Figure 1a). Participants were asked to identify the target’s location (left or right) by button press. For 75% of trials, the white target appeared in the same box as the blue cue (valid trials); for 25% of trials, the white target appeared in the opposite location (invalid trials).

Figure 1a.

Figure 1a

Trial Structure during the Frustration Condition (Game 3)

Prior to completing the task, participants underwent a 50-trial training during which they received accurate feedback on their performance but did not win or lose money. The task itself consisted of two non-frustration runs, a pilot frustration run, and two frustration runs (Figure 1b). During the two non-frustration runs (Figure 1b), participants received accurate feedback on their performance, earning $0.50 for every correct response and losing $0.50 for every incorrect response. During the pilot frustration run and the two frustration runs, participants were instructed that they needed to respond both correctly and quickly in order to win money. The pilot frustration run of the task (Figure 1b) consisted of 32 trials and was used to acclimate participants to the “real” frustration runs; participants received rigged feedback on 10% of correct trials. Specifically, on trials with rigged feedback, participants were informed that they were “too slow,” and lost money regardless of their actual reaction time. In the two “real” frustration runs (Figure 1b), participants received rigged feedback on 60% of correct trials. After each run of the task, participants rated their feelings of unhappiness and frustration using 9-point Likert scales (i.e., 1 = “happy” or “not at all frustrated”; 9 = “sad” or “extremely frustrated”). Task variables [i.e., accuracy, reaction time (RT) on correct trials] and valence ratings (unhappiness, frustration) from the two non-frustration runs and two frustration runs were used in the reliability and validity analyses.

Figure 1b.

Figure 1b

Task Structure

Note. (1a) ITI = inter-trial interval; ISI = inter-stimulus interval. In the frustration condition, 40% of correct responses were followed by positive feedback (“YOU WIN”), and 60% of correct responses were followed by negative feedback (“TOO SLOW”). All incorrect responses were followed by negative feedback (“WRONG”).

It is important to examine attention shifting in the context of frustration because the ability to orient attention away from frustrating stimuli is a critical component of emotion regulation. When frustrated and confronted with a negative event, irritable children may have difficulty disengaging from the blocked goal and the associated negative affect and shifting their attention to helpful distractors or emotion regulation strategies (Deveney et al., 2013). Indeed, previous research has shown that children with severe irritability responded more slowly to invalid trials vs. valid trials, suggesting difficulties in attention shifting (Deveney et al., 2013). Given this, we also examined response cost (i.e., validity effect) on accuracy and RT, as a measure of attention orienting/shifting. Specifically, we tested the test-retest reliability and validity of the accuracy and RT differences between invalid and valid trials (invalid – valid). Data from the pilot frustration runs were not included in the analyses.

After completing the task for the second time, participants filled out a self-report questionnaire to assess whether they were deceived. Participants were then debriefed about the use of deception in the task (i.e., they were told that the “too slow” feedback was not based on their reaction time). No participant reported marked distress due to frustration or deception.

Irritability Measures

ARI.

Children’s irritability was assessed using the parent- and child-reports of the ARI (Stringaris et al., 2012a). This is a six-item scale assessing the frequency, duration, and threshold of irritability in the past 6 months (Stringaris et al., 2012a). Each item was rated on a 3-point Likert scale from 0 (not true) to 2 (certainly true). Sample items include “get angry frequently,” “stay angry for a long time,” and “lose temper easily.” Past research has demonstrated good internal consistency (α’s ≥ .80), good validity (i.e., construct validity and discriminant validity), and a single-factor structure in clinical and community samples (DeSousa et al., 2013; Mulraney et al., 2014; Stringaris et al., 2012a). The child-reported and parent-reported ARI in this sample also showed good internal consistency at both visits (α’s = .86 and .89 for child ARI; .89 and .84 for parent ARI). The sum scores of the six items on the child- and parent-report were used in the analyses (a possible score of 0–12).

CBCL.

Parents completed the CBCL (Achenbach, 1991). Three items extracted from the parent-reported CBCL (“temper tantrums or hot temper,” “stubborn, sullen or irritable,” and “sudden changes in mood or feelings”) were also used to assess childhood irritability in the past 6 months. Past studies have also used scores derived from these items to assess childhood irritability (Aebi et al., 2013; Roberson-Nay et al., 2015; Stringaris et al., 2012b; Wiggins et al., 2014). Each item was rated on a 3-point Likert scale from 0 (not true) to 2 (very true or often true). Good internal consistency (Aebi et al., 2013; Stringaris et al., 2012b; Wiggins et al., 2014) and a single-factor structure (Roberson-Nay et al., 2015; Stringaris et al., 2012b; Wiggins et al., 2014) have been demonstrated in past research. For the current sample, the internal consistency of this measure was good at both visits (α’s = .84 at Visit 1 and .78 at Visit 2). The sum of the three items was used in the analyses (a possible score of 0–6).

Data analysis

All analyses were performed using SPSS. There were no missing data in the task variables at both visits; irritability measures were missing in some participants (n = 4–9 at Visit 1; n = 10–16 at Visit 2). Missing data were handled with pairwise deletion in the analyses.

Test-retest reliability.

We computed Pearson’s correlation (r) and ICC [two-way mixed effect model, ICC (C, k); McGraw and Wong, 1996] to evaluate the test-retest reliability of the Affective Posner 2 task variables (accuracy, RT) and valence ratings (happy/unhappy, frustration). We also computed Pearson’s correlation and ICC to determine the test-retest reliability of response cost (i.e., invalid vs. valid trials) on accuracy and RT. All of these analyses were conducted separately for non-frustration and frustration runs. To determine the test-retest reliability of the irritability measures (i.e., child ARI, parent ARI, and the CBCL irritability), Pearson’s r and ICC were also calculated. We evaluated reliability using Cicchetti and Sparrow’s (1981) definition for judging the clinical significance of ICC values: < .40 = poor; .40–.59 = fair; .60–.74 = good; > .74 = excellent (Cicchetti and Sparrow, 1981).

Construct validity and convergent validity.

To test the construct validity of the Affective Posner 2 task, i.e., whether the task elicits negative affect effectively at both visits, we examined changes in valence ratings over the course of the task and across visits. To this end, we conducted two separate two-way Runs (5) × Visit (2) repeated-measures analyses of variance (ANOVA), one for happy/unhappy ratings and the other for frustration ratings. We also evaluated convergent validity of the task by examining correlations between task variables (i.e., accuracy, RT, response cost, ratings) and child- and parent-reported irritability. To determine the convergent validity of the irritability measures (i.e., child ARI, parent ARI, and the CBCL irritability) and the agreement between these measures, Pearson’s r and ICC were also calculated.

Results

Test-retest reliability of task ratings and behavior

Accuracy, RT, happy/unhappy ratings, and frustration ratings during both non-frustration runs and frustration runs were reliable and stable over 2–4 weeks (r’s = .28 – .64, p’s < .01; ICCs = .44 to .78, p’s < .001; Table 2). Most measures had good to excellent test-retest reliability (ICC > .60); during non-frustration runs, the happy/unhappy ratings and frustration ratings had fair test-retest reliability (ICCs = .44 and .56). Test-retest reliability for happy/unhappy ratings during frustration runs was better than those during non-frustration runs (r = .63 vs. r = .28, p = .001, using Fisher’s r-to-z transformation); similar finding was evident for frustration ratings but at a trend level (r = .58 vs. r = .39, p = .07). Of note, the lower reliability of ratings during non-frustration runs compared to that during frustration runs may be due to a restricted range of variances in the ratings.

Table 2.

Test-retest Reliability of Task Variables

Variables Visit 1 Visit 2 Pearson’s r Intraclass Correlations (ICC) Difference Score
Mean (SD) Mean (SD)
Non-Frustration Runs
 Accuracy (%) 97.00 (3.08) 96.15 (3.75) .54*** .69*** −0.85
 RT (ms) 475.04 (108.19) 456.50 (120.91) .57*** .72*** −18.54
 Happy/unhappy rating 2.61 (1.64) 2.14 (1.45) .28** .44*** −0.47
 Frustration rating 1.96 (1.44) 1.95 (1.42) .39*** .56*** −0.01
Frustration Runs
 Accuracy (%) 82.34 (11.97) 83.48 (11.77) .64*** .78*** 1.14
 RT (ms) 334.24 (70.09) 351.96 (89.92) .64*** .76*** 17.72
 Happy/unhappy rating 4.74 (2.53) 4.00 (2.47) .63*** .77*** −0.74
 Frustration rating 5.49 (2.66) 4.38 (2.74) .58*** .74*** −1.11

Note. RT = reaction time; ms = milliseconds.

**

p < .01;

***

p < .001.

Response cost in accuracy and RT during both non-frustration runs and frustration runs were reliable and stable over 2–4 weeks (r’s = .28 – .55, p’s < .01; ICCs = .44 to .70, p’s < .001; Table 3). Specifically, response cost in accuracy had good test-retest reliability (ICCs = .70 and .64), while response cost in RT had fair test-retest reliability (ICCs = .58 and .44).

Table 3.

Test-retest Reliability of Response Cost

Variables Visit 1 Visit 2 Pearson’s r Intraclass Correlations (ICC) Difference Score
Mean (SD) Mean (SD)
Non-Frustration Runs
 Accuracy (%) −5.25 (8.73) −7.80 (10.50) .55*** .70*** −2.55
 RT (ms) 80.80 (53.85) 85.46 (49.65) .41** .58*** 4.66
Frustration Runs
 Accuracy (%) −38.04 (21.38) −35.39 (23.15) .47*** .64*** 2.65
 RT (ms) 119.52 (81.25) 126.21 (75.60) .28** .44*** 6.69

Note. RT = reaction time; ms = milliseconds.

**

p < .01;

***

p < .001.

Test-retest reliability of, and agreement between, child and parent irritability measures

Child-reported ARI, parent-reported ARI, and CBCL irritability scores were all reliable and stable over 2–4 weeks (r’s = .79 – .82, p’s < .001; Table 4); ICCs = .88, .90, and .90, respectively (p’s < .001), suggesting excellent test-retest reliability for these measures. At both visits, the agreement between child-reported and parent-reported ARI was only low to moderate (r’s = .37 and .22, p’s < .05), as was the agreement between child-reported ARI and parent-reported CBCL irritability (r’s = .28 and .29, p’s < .01; Table 4). At both visits, parent-reported ARI was highly correlated with parent-reported CBCL irritability (r’s = .64 and .54, p’s < .001; Table 4), providing support for the convergent validity of these measures.

Table 4.

Reliability of and Correlations between Irritability Measures

Visit 1 Visit 2
Child ARI (n = 104) Parent ARI (n = 105) CBCL Irritability (n = 100) Child ARI (n = 93) Parent ARI (n = 99) CBCL Irritability (n = 98)
Visit 1
 Child ARI __ .37*** .28** .79*** .26* .26*
 Parent ARI __ .64*** .29** .82*** .56***
 CBCL Irritability __ .23* .55*** .82***
Visit 2
 Child ARI __ .22* .29**
 Parent ARI __ .54***
 CBCL Irritability __
Mean (SD) 3.43 (3.00) 1.55 (2.49) 0.88 (1.44) 2.94 (3.13) 1.49 (2.15) 0.74 (1.26)

Note. ARI = Affective Reactivity Index; CBCL = Child Behavior Checklist.

*

p < .05;

**

p < .01;

***

p < .001.

Construct validity and convergent validity of the task

As shown in Figure 2a and 2b, participants felt less happy and more frustrated in frustration runs than in non-frustration runs, demonstrating the construct validity of the task, i.e., the ability of the task to elicit negative affect. This was true for both Visit 1 and Visit 2. Happy/unhappy ratings and frustration ratings during frustration runs dropped from Visit 1 to Visit 2; however, at Visit 2, the ratings during frustration runs were still higher than those during non-frustration runs.

Figure 2a.

Figure 2a

Happy-Unhappy Ratings across Task by Visits

Figure 2b.

Figure 2b

Frustration Ratings across Task by Visits

Note. NF = Non-Frustration; F = Frustration. * p < .05; ** p < .01; *** p < .001. (a) All the pairwise comparisons between runs were significant at p < .05, except for the one between F Run 1 and F Run 2 at Visit 2; (b) All the pairwise comparisons between runs were significant at p < .05, except for the ones between NF Run 1 and NF Run 2 at both Visit 1 and Visit 2.

At both visits, accuracy, RT, and response cost did not correlate with child- or parent-reported irritability (r’s = −.17 to .19, p’s > .07), with the exception of a positive, yet small, correlation between parent-reported CBCL irritability and RT in non-frustration runs at Visit 1 (r = .20, p < .05). The only consistently significant findings were between child-reported ARI and self-reported ratings during task (unhappiness and frustration) in frustration runs (r’s = .32 and .33, p’s < .001, at Visit 1 and 2, between child ARI and unhappiness; r’s = .23 and .26, p’s < .05, at Visit 1 and 2, between child ARI and frustration).

Additional analyses

Because this was predominately a twin sample, we conducted analyses to account for non-independence in our data due to including twins from the same family. Specifically, we conducted mixed models with family unit as the random effect to evaluate reliability and validity of task and irritability measures. We also adjusted for gender, days between visits, and IQ in the mixed models, and calculated partial correlations after covarying gender, days between visits, and IQ. All the correlations remained similar and significant, suggesting that reliability and validity were not inflated due to nested family effect and was not influenced by the gender of the child, days between visits, or IQ.

Discussion

The present study examined test-retest reliability and validity of a frustration paradigm, i.e., the Affective Posner 2 task, and two measures of irritability, i.e., the ARI and the CBCL irritability scale. We found that the accuracy, RT, response cost (validity effect), happy/unhappy ratings, and frustration ratings of the Affective Posner 2 task demonstrated good test-retest reliability over 2–4 weeks. Child-reported ARI, parent-reported ARI, and CBCL irritability scores all showed excellent test-retest reliability over 2–4 weeks. In addition, our frustration paradigm demonstrated validity, i.e., it effectively elicited negative affect in youth. Specifically, participants felt less happy and more frustrated after frustration, compared to non-frustration, runs. This is true even when participants performed the task for the second time after 2–4 weeks. Moreover, self-rated unhappiness and frustration during frustration runs correlated positively with self-reported but not parent-reported irritability, providing only modest support for the convergent validity of the task with other irritability measures. Overall, these findings suggest that the Affective Posner 2 task and the irritability measures are reliable and potentially valid and may be useful for future studies on childhood irritability and frustration.

It should be noted that the ICCs and Pearson’s correlations for the variables in the non-frustration runs were lower and less reliable than those in the frustration runs. This is expected because the variances of the variables in the non-frustration runs were low. Compared to the frustration runs, the non-frustration runs were relatively easy; participants’ mean accuracy was over 96% during the non-frustration runs for both visits, and most participants were happy and not frustrated because they were winning money. As a result, the variables in the non-frustration runs were less reliable due to the restricted range of variance.

Although overall our frustration paradigm demonstrated fair to good test-retest reliability, it is less stable compared to parent- or child-report of irritability. This may be because the two types of assessments (i.e., clinical rating scale vs. task paradigm) elicit information about two different aspects of irritability, e.g., trait vs. state. Irritability as a trait has been found to be relatively stable (Caprara et al., 2007; Leibenluft and Stoddard, 2013). It is possible that our frustration paradigm may provide information about irritability that is more state-like (e.g., aberrant responses to blocked goals such as anger and frustration). In that case, one would expect its measures to be more context-dependent and less reliable than parent- and child-reports, which also assess trait-like irritability, and thus may be more context-independent and hence more reliable.

Our reliability findings have important implications for treatment and longitudinal research. Test-retest reliability sets an upper limit for longitudinal stability because stability is reduced by both test-retest unreliability and true score changes (McCrae et al., 2011). As such, measures with poor test-retest reliability have limited ability to detect reliable and true scores changes, i.e., changes that reflect meaningful developmental alterations, rather than variation due to unreliability. Therefore, a reliable measure is fundamental to detect meaningful changes associated with development or intervention (Leon, 2008). Similarly, test-retest reliability is important for behavioral genetic research because poor test-retest reliability may attenuate heritability estimates (McCrae et al., 2011). Of note, the fact that participants were still frustrated when they completed the task for the second time after 2–4 weeks indicated that the Affective Posner 2 may be an effective probe to use in clinical trials (i.e., pre- and post-treatment) in which irritability and aberrant responses to frustration are a target of treatment.

At both visits, the correlations between irritability measures was high within informants (i.e., parent ARI and parent CBCL irritability), but low to moderate between informants (i.e., child ARI and parent ARI or child ARI and parent CBCL irritability). This is consistent with previous research on youth’s behavioral and emotional symptoms, which typically finds high agreement within the same informant or similar informants (e.g., pairs of parents) and low agreement between informants (e.g., subjects and parents; see Achenbach et al, 1987 for a meta-analysis; De Los Reyes and Kazdin, 2005; Youngstrom, et al., 2000). The low agreement between parent- and child-reports is worth noting because it suggests that parents and children each contribute variance not accounted for by the other; thus, children’s self-report cannot substitute for reports by the parents, and vice versa. At both visits, the correlation between parent ARI and parent CBCL irritability was high (r = .54 and .64), suggesting convergent validity of the two measures. However, there are still non-shared variances between these two measures, suggesting that the ARI and CBCL irritability may each provide unique information about irritability that is not measured by the other. Given the importance of a multi-method approach to minimize measurement error from a single source (Cicchetti and Cohen, 1995), future research may benefit from including both measures to assess irritability.

Of note, at both visits, child-reported ARI positively correlated with frustration and unhappy ratings during frustration runs. However, child-reported ARI was not related to other task variables (i.e., accuracy, RT, response cost), and parent-reported irritability was not related to either performance or ratings during the task. The null associations between parent-reported irritability and the task variables may be attributed to the non-clinical nature of the sample, i.e., low levels of parent-reported irritability in this community sample. Such poor agreement between parent-reported irritability and the task variables is consistent with the low cross-informant agreement in the child development and psychopathology literature mentioned above (Achenbach et al., 1987; De Los Reyes and Kazdin, 2005; Youngstrom et al., 2000). The correlations between child ARI and task ratings, although significant, were moderate and may be due to shared method variance, i.e., both were reported by the same informant (the child). Although modest associations between questionnaire measures and behavioral tasks are to be expected, this finding suggests that the two types of assessments may be assessing different aspects of irritability (i.e., trait vs. state as described earlier) or even different constructs. Thus, our findings only provide modest support for the convergent validity of the task with other irritability measures and call for validations from future research.

The present findings should be evaluated in light of the following limitations. First, the sample consists of mostly twins, although we accounted for this effect in the analyses. Also, the age range of the sample is narrow (i.e., age 9–14). Thus, results may not generalize to non-twin samples or youth outside of this age range. Second, test-retest reliability may be decreased in clinical populations due to disease progression or symptom fluctuation (Duff, 2012). Future research with clinical samples is needed to assess whether the current task and measures have acceptable reliability in youth with disorders. Third, children who were on medications for ADHD were included in the study. Medications such as stimulants may have affected performance (e.g., RT) or ratings during the task. Moreover, it is unclear whether parents and children provided irritability ratings based on behaviors on or off medications. Finally, the Affective Posner 2 paradigm does not allow us to parse out different aspects of reward processes (e.g., reward anticipation and outcomes). This paradigm may be measuring frustration in a specific context (i.e., receipt of nonreward). It is unclear whether frustration arose from loss of expected reward (frustrative nonreward), or anticipation of nonreward, or both.

In conclusion, the Affective Posner 2 task demonstrated good test-retest reliability and modest validity, although it successfully evoked frustration in youth at both test and retest. The parent- and child-reports of the ARI and the parent-report of the CBCL irritability scale also showed good test-retest reliability. These reliability data provide an important basis for future studies to further investigate the Affective Posner 2’s validity, as validity cannot be achieved without adequate reliability. Together, our study suggests that the Affective Posner 2 paradigm, as well as the ARI and CBCL irritability scales, may be useful tools for research on irritability.

Acknowledgements

This research was in part supported by the Intramural Research Program of the National Institute of Mental Health (NIMH), National Institutes of Health (NIH). This study was also supported by R01MH098055 (JMH) and R01MH101815 (RRN).

Footnotes

Financial Disclosures

The authors disclose no conflicts of interest related to this work.

References

  1. Achenbach TM, 1991. Manual for the Child Behavior Checklist/4–18 and 1991 Profile. Burlington, VT: University of Vermont, Department of Psychiatry. [Google Scholar]
  2. Achenbach TM, McConaughy SH, Howell CT, 1987. Child/adolescent behavioral and emotional problems: implications of cross-informant correlations for situational specificity. Psychol. Bull 101, 213–232. [PubMed] [Google Scholar]
  3. Achenbach TM, Rescorla L, 2001. Manual for the ASEBA school-age forms & profiles. Burlington, VT: University of Vermont, Research Center for Children, Youth, and Families. [Google Scholar]
  4. Aebi M, Plattner B, Metzke CW, Bessler C, Steinhausen HC, 2013. Parent- and self-reported dimensions of oppositionality in youth: construct validity, concurent validity, and the prediction of criminal outcomes in adulthood. J. Child Psychol. Psychiatry 54, 941–949. [DOI] [PubMed] [Google Scholar]
  5. American Psychiatric Association, 2013. Diagnostic and Statistical Manual of Mental Disorders, 5th Edition. Washinton, DC: Author. [Google Scholar]
  6. Berkowitz L, 1989. Frustration-aggression hypothesis: Examination and reformulation. Psychol. Bull 106, 59–73. [DOI] [PubMed] [Google Scholar]
  7. Caprara GV, Paciello M, Gerbino M, Cugini C, 2007. Individual differences conducive to aggression and violence: Trajectories and correlates of irritability and hostile rumination through adolescence. Aggerss. Behav 33, 359–374. [DOI] [PubMed] [Google Scholar]
  8. Carney DM, Moroney E, Machlin L, Hahn S, Savage JE, Lee M, Towbin KA, Brotman MA, Pine DS, Leibenluft E, Roberson-Nay R, Hettema JM, 2016. The Twin Study of Negative Valence Emotional Constructs. Twin Res. Hum. Genet 19, 456–464. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Cicchetti D, Cohen D, 1995. Developmental Psychopathology: Vol. 1. Theory and Methods. New York: Wiley. [Google Scholar]
  10. Cicchetti DV, Sparrow SA, 1981. Developing criteria for establishing interrater reliability of specific items: applications to assessment of adaptive behavior. Am. J. Ment. Defic 86, 127–137. [PubMed] [Google Scholar]
  11. De Los Reyes A, Kazdin AE, 2005. Informant discrepancies in the assessment of childhood psychopathology: a critical review, theoretical framework, and recommendations for further study. Psychol. Bull 131, 483–509. doi: 10.1037/0033-2909.131.4.483 [DOI] [PubMed] [Google Scholar]
  12. DeSousa DA, Stringaris A, Leibenluft E, Koller SH, Manfro GG, Salum GA, 2013. Cross-cultural adaptation and preliminary psychometric properties of the Affective Reactivity Index in Brazilian Youth: implications for DSM-5 measured irritability. Trends Psychiatry Psychother. 35, 171–180. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Deveney CM, Connolly ME, Haring CT, Bones BL, Reynolds RC, Kim P, Pine DS, Leibenluft E, 2013. Neural mechanisms of frustration in chronically irritable children. Am. J. Psychiatry 170, 1186–1194. doi: 10.1176/appi.ajp.2013.12070917 [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Duff K, 2012. Evidence-based indicators of neuropsychological change in the individual patient: relevant concepts and methods. Arch. Clin. Neuropsychol 27, 248–261. doi: 10.1093/arclin/acr120 [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Insel T, Cuthbert B, Garvey M, Heinssen R, Pine DS, Quinn K, Sanislow C, Wang P, 2010. Research domain criteria (RDoC): Toward a new classification framework for research on mental disorders. Am. J. Psychiatry 167, 748–751. [DOI] [PubMed] [Google Scholar]
  16. Leibenluft E, 2011. Severe mood dysregulation, irritability, and the diagnostic boundaries of bipolar disorder in youths. Am. J. Psychiatry 168, 129–142. doi: 10.1176/appi.ajp.2010.10050766 [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Leibenluft E, Stoddard J, 2013. The developmental psychopathology of irritability. Dev. Psychopathol 25, 1473–1487. doi: 10.1017/S0954579413000722 [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Leon AC, 2008. Implications of clinical trial design on sample size requirements. Schizophr. Bull 34, 664–669. doi: 10.1093/schbul/sbn035 [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Lilley ECH, Silberg JL, 2013. The Mid-Atlantic Twin Registry, Revisited. Twin Res. Hum. Genet 16, 424–428. [DOI] [PubMed] [Google Scholar]
  20. McCrae RR, Kurtz JE, Yamagata S, Terracciano A, 2011. Internal consistency, retest reliability, and their implications for personality scale validity. Pers. Soc. Psychol. Rev 15, 28–50. doi: 10.1177/1088868310366253 [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. McGraw KO Wong SP, 1996. Forming inferences about some intraclass correlation coefficients. Psychol. Method 1, 30–46. [Google Scholar]
  22. Mulraney MA, Melvin GA, Tonge BJ, 2014. Psychometric properties of the affective reactivity index in Australian adults and adolescents. Psychol. Assess 26, 148–155. doi: 10.1037/a0034891 [DOI] [PubMed] [Google Scholar]
  23. Perlman SB, Jones BM, Wakschlag LS, Axelson D, Birmaher B, Phillips ML, 2015. Neural substrates of child irritability in typically developing and psychiatric populations. Dev. Cogn. Neurosci 14, 71–80. doi: 10.1016/j.dcn.2015.07.003 [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Rich BA, Holroyd T, Carver FW, Onelio LM, Mendoza JK, Cornwell BR, Fox NA, Pine DS, Coppola R, Leibenluft E, 2010. A preliminary study of the neural mechanisms of frustration in pediatric bipolar disorder using magnetoencephalography. Depress. Anxiety 27, 276–286. doi: 10.1002/da.20649 [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Rich BA, Schmajuk M, Perez-Edgar KE, Fox NA, Pine DS, Leibenluft E, 2007. Different psychophysiological and behavioral responses elicited by frustration in pediatric bipolar disorder and severe mood dysregulation. Am. J. Psychiatry 164, 309–317. doi: 10.1176/ajp.2007.164.2.309 [DOI] [PubMed] [Google Scholar]
  26. Rich BA, Schmajuk M, Perez-Edgar KE, Pine DS, Fox NA, Leibenluft E, 2005. The impact of reward, punishment, and frustration on attention in pediatric bipolar disorder. Biol. Psychiatry 58, 532–539. doi: 10.1016/j.biopsych.2005.01.006 [DOI] [PubMed] [Google Scholar]
  27. Roberson-Nay R, Leibenluft E, Brotman MA, Myers J, Larsson H, Lichtenstein P, Kendler KS, 2015. Longitudinal Stability of Genetic and Environmental Influences on Irritability: From Childhood to Young Adulthood. Am. J. Psychiatry 172, 657–664. doi: 10.1176/appi.ajp.2015.14040509 [DOI] [PubMed] [Google Scholar]
  28. Savage J, Verhulst B, Copeland W, Althoff RR, Lichtenstein P, Roberson-Nay R, 2015. A genetically informed study of the longitudinal relation between irritability and anxious/depressed symptoms. J. Am. Acad. Child Adolesc. Psychiatry 54, 377–384. doi: 10.1016/j.jaac.2015.02.010 [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Stoddard J, Stringaris A, Brotman MA, Montville D, Pine DS, Leibenluft E, 2014. Irritability in child and adolescent anxiety disorders. Depress. Anxiety 31, 566–573. doi: 10.1002/da.22151 [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Stringaris A, 2011. Irritability in children and adolescents: a challenge for DSM-5. Eur. Child Adolesc. Psychiatry 20, 61–66. doi: 10.1007/s00787-010-0150-4 [DOI] [PubMed] [Google Scholar]
  31. Stringaris A, Goodman R, Ferdinando S, Razdan V, Muhrer E, Leibenluft E, Brotman MA, 2012a. The Affective Reactivity Index: a concise irritability scale for clinical and research settings. J. Child Psychol. Psychiatry 53, 1109–1117. doi: 10.1111/j.1469-7610.2012.02561.x [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Stringaris A, Zavos H, Leibenluft E, Maughan B, Eley TC, 2012b. Adolescent irritability: phenotypic associations and genetic links with depressed mood. Am. J. Psychiatry 169, 47–54. doi: 10.1176/appi.ajp.2011.10101549 [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Wechsler D, 1999. Wechsler Abbreviated Scale of Intelligence (WASI). San Antonio, TX: Psycholoigcal Corporation. [Google Scholar]
  34. Wiggins JL, Mitchell C, Stringaris A, Leibenluft E, 2014. Developmental trajectories of irritability and bidirectional associations with maternal depression. J. Am. Acad. Child Adolesc. Psychiatry 53, 1191–1205, 1205 e1191–1194. doi: 10.1016/j.jaac.2014.08.005 [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Youngstrom E, Loeber R, Stouthamer-Loeber M, 2000. Patterns and correlates of agreement between parent, teacher, and male adolescent ratings of externalizing and internalizing problems. J. Consult. Clin. Psychol 68, 1038–1050. [DOI] [PubMed] [Google Scholar]

RESOURCES