Skip to main content
PLOS One logoLink to PLOS One
. 2022 Jan 13;17(1):e0262465. doi: 10.1371/journal.pone.0262465

Test-retest reliability of the HEXACO-100—And the value of multiple measurements for assessing reliability

Sam Henry 1,*,#, Isabel Thielmann 2,, Tom Booth 1,, René Mõttus 1,3,#
Editor: Frantisek Sudzina4
PMCID: PMC8757920  PMID: 35025932

Abstract

Despite the widespread use of the HEXACO model as a descriptive taxonomy of personality traits, there remains limited information on the test-retest reliability of its commonly-used inventories. Studies typically report internal consistency estimates, such as alpha or omega, but there are good reasons to believe that these do not accurately assess reliability. We report 13-day test-retest correlations of the 100- and 60-item English HEXACO Personality Inventory-Revised (HEXACO-100 and HEXACO-60) domains, facets, and items. In order to test the validity of test-retest reliability, we then compare these estimates to correlations between self- and informant-reports (i.e., cross-rater agreement), a widely-used validity criterion. Median estimates of test-retest reliability were .88, .81, and .65 (N = 416) for domains, facets, and items, respectively. Facets’ and items’ test-retest reliabilities were highly correlated with their cross-rater agreement estimates, whereas internal consistencies were not. Overall, the HEXACO Personality Inventory-Revised demonstrates test-retest reliability similar to other contemporary measures. We recommend that short-term retest reliability should be routinely calculated to assess reliability.

Introduction

The HEXACO Personality Inventory-Revised (HEXACO-PI-R [1]) is currently one of the most widely used personality questionnaires in psychology and beyond. Several key properties of its domain and facet scales such as convergent and discriminant validity, gender differences, and measurement invariance across various countries and translations have been reported in recent large-sample studies [2, 3]. The scales also demonstrate associations with numerous life outcomes [46]. Surprisingly, however, one of the most basic of its psychometric properties, test-retest reliability (or retest reliability; rTT), has rarely been assessed, and never in sufficiently large samples [79].

Reliability denotes “the consistency of a measure with itself” [10; p.464] or of two independent assessments of the same inventory [11]. Whereas internal reliability or internal consistency represents the consistency of responses for a trait using non-identical parallel forms, short-term rTT quantifies the extent to which test takers agree with themselves on the exact same items being measured at different time points and hence independently, assuming that test takers do not remember their previous responses and copy these at second testing. For many, rTT is reliability, whereas internal consistency estimates like alpha (α) and omega (ω) are seen to have only been designed to imperfectly approximate reliability when rTT is not available [12]. As succinctly noted by McCrae and colleagues [13]: “[i]nternal consistency of scales can be useful as a check on data quality but appears to be of limited utility for evaluating the potential validity of developed scales, and it should not be used as a substitute for retest reliability” (p. 28). In many cases, high internal consistency, reflecting redundancy among test constituents, is even undesirable because it may mean poor coverage of the content of the construct being measured–especially when constructs are broad–or an impractically long test. We note that α does not measure internal consistency of items per se: with hundreds or thousands of items, any scale has high α but can have practically zero consistency among individual items. Instead, a scale’s α indexes the expected consistency among hypothetical item aggregates containing the same number of items as the scale. Nonetheless, in line with common usage and to clearly distinguish it from rTT, we will also refer to α as a measure of internal consistency.

In addition to its tighter theoretical adherence to reliability, rTT has empirical advantages when compared to internal consistency. Because internal reliability confounds unreliability with unique trait information in items [14], it tends to be lower than rTT [15], and this difference may even be underestimated due to occasion-specific state effects inflating internal consistency. Even more importantly, rTT, but not internal consistency, is a good predictor of scales’ cross-rater agreement (rCA) and long-term stability, two of the most straightforward and broadly applicable psychometric validity criteria [13, 16]. These associations with validity criteria follow with rTT being a more accurate reliability estimate than internal consistency, as reliability necessarily provides an upper bound for validity. Unlike internal reliability, rTT can also be estimated for individual items. This provides researchers an empirical criterion to assess item quality–where items with low rTT either index characteristics on which respondents cannot even agree with themselves or are simply ambiguous. Although some items may assess a specific pattern of thought, feeling, or behavior that genuinely changes over a brief time span, we contend that traits, even those assessed by single items, should be largely stable over the brief time periods typically used to measure rTT, and those items with particularly low rTTs ought to be replaced with more reliable ones where possible.

While indices of internal consistency (most prominently α) have been widely reported for HEXACO measures [2], only a handful of studies have reported rTT estimates for them. For example, previous work [9] assessed seven-month rTT of the full 200-item HEXACO-PI-R to compare it with a variety of other item properties and found a mean single-item rTT = .56 (N = 188; SD = .10; range = .31 to .80), with mean domain rTT = .85 (range = .79 to .90). Others [7] found similar results for seven-month rTT of HEXACO domains using the 60-item version of the HEXACO-PI-R (HEXACO-60 [17]), albeit with a very small sample (N = 31; M = .81; range = .72 to .90). More recently, 2-year rTT (N = 214) estimates were reported [8] for only the Honesty-Humility, Agreeableness, and Conscientiousness domains and facets of the 100-item HEXACO-PI-R (HEXACO-100 [18]). Domain reliabilities were rTT = .80, .75, and .74, respectively, and median facet reliability was rTT = .69 (range = .58 to .78).

Despite providing good preliminary evidence for the temporal stability of HEXACO domains, all three studies used a testing interval more appropriate for long-term stability estimates, rather than for quantifying rTT [19]. Longer retest intervals over many months or even years confound unreliability (or “transient error”) with actual personality change, while short-term intervals such as two weeks [20, 21] or less [22] are more appropriate to quantify rTT because true change is unlikely to occur, but test takers are also unlikely to retrieve their previous responses from memory. Cattell et al. [23] recommended one or two months as the ideal interval where no true change should occur and “memory effects” [24] are still unlikely. However, more recent research [20] suggests that even shorter timespans still do not evoke memory effects, finding no discernible difference in Big Five (i.e., NEO) personality domain rTTs measured two months versus two weeks apart. For single items, Kosinski et al. [25] found that median rTTs declined almost linearly from one-day to one-, two-, three-, and four-week retest intervals, though the range was relatively narrow (rTTs = .63 to .70) and did not include a two-month interval for further comparison. Thus, while the exact interval that distinguishes between unreliability and true change is unclear (and perhaps impossible to resolve), we suggest–in line with previous research [e.g., 16, 20, 26]–that two weeks strikes a balance between mitigating both the possibility of true change and the likelihood that participants recall and repeat their responses from the first measurement.

Given the popularity of HEXACO-PI-R scales and that rTT is likely a superior measure of the scales’ most fundamental property, reliability, a comprehensive study of HEXACO-PI-R domain, facet, and item rTTs over a brief time period should be of wide interest to both personality researchers and those applying HEXACO-PI-R scales in other (including practical) settings. Thus, we report those estimates here for the English HEXACO-100 and HEXACO-60 (where the latter uses a subset of items from the former); although the longer 200-item version exists, the HEXACO-100 and HEXACO-60 are much more commonly used (in a meta-analysis [2], the two shorter scales were used in 443 of 489 studies that administered a HEXACO-PI-R inventory). To evaluate the criterion validity of rTT versus internal consistency estimates, we also explore the extent to which these indices correlate with facet-scale and (where possible) single-item rCAs [18]. Thus, similarly to previous work [13], we expected rTT but not internal consistency estimates to correlate with rCAs at all levels of investigation. Finally, we report the rTTs for individual HEXACO-PI-R items, which could suggest those in need of replacement in future iterations of the inventory and inform research on the properties of (un)reliable items.

Material and methods

Participants

Participants were recruited via Prolific Academic from a cohort of participants (N = 639) that forms part of an ongoing project led by the first and last author on this manuscript. These participants had previously provided survey responses for personality, life outcomes, and item properties. All participants provided informed, written consent online in all previous waves of data collection, which was approved by the University of Edinburgh School of Philosophy, Psychology, and Language Sciences Research Ethics Committee (Refs 123-1920/1 and 123-1920/2). The present study, submitted as a new iteration of the previous one, received approval from the same committee on 11 September 2020 (Ref 400-1920/3).

We invited these participants to complete the English HEXACO-100 [18] twice in two weeks. At the first administration (T1), participants gave written online consent before being directed to the survey. T1 was released at 11:55 GMT on 23 September 2020 and closed at 19:01 GMT on 28 September after achieving the planned sample size (N = 450). The second survey (T2) was released in two waves to account for variance in T1 start dates and restrict the range of retest intervals. Specifically, we published the first T2 survey at 11:15 GMT on 6 October for participants who completed T1 on 23–25 September (n = 421), then published the second wave two days later (12:11 GMT on 8 October) for participants who completed T1 between 26 and 28 September.Ultimately, 423 participants (48.5% female; M age = 26.9, SD age = 7.9, age range = 19 to 69) completed both T1 and T2 assessments. Pending a successful quality control check, all participants received a total compensation of £2.00 for participation at both time points.

Following recommendations from a similar study [27], we excluded participants whose profile consistency (q, calculated as the overall correlation between responses across all items at each measurement occasion) was exceptionally low–they used a cut-off of q = 0.25 for consistency estimates taken from repeated measures in the same session. Given our considerably longer testing interval, we were more lenient and only removed participants whose profile consistency was three or more standard deviations below the median q = 0.66 (i.e., q ≤ 0.12). This excluded seven participants, leaving a final sample of N = 416 (49.0% female; M age = 26.9, SD age = 7.9, age range = 19 to 69). For a hypothesized mean rTT of .65 (based on previous research [16, 25]), this final sample size entailed an average predicted standard error (SE) of .037.

With respect to other demographic variables, our sample was rather heterogeneous. No one country of birth exceeded 100 participants, with the largest representation coming from Portugal (n = 84), United Kingdom (n = 76), Poland (n = 49), Italy (n = 33), Greece (n = 30), Spain (n = 22), and the United States (n = 15). All other nations had ns < 10. English was underrepresented as a first language, with just under a quarter of the sample being native English speakers (n = 94); first language mapped fairly consistently with country of birth, although there was a greater variety of the latter (50 countries of birth versus 33 first languages).

Finally, about 1/3 (n = 137 and 135, respectively) of the sample was missing data for student and employment status, but among the remaining subsamples, about half (53.1%) were students and over two thirds (69.4%) claimed to be employed in some capacity. In summary, although our sample was not representative of any one population, we did manage to capture a relatively wide variety of cultures, ages, and occupational circumstances.

Measure

The HEXACO-100 measures the six HEXACO domains, their respective four facets (see Table 1 for domain and facet names), and an additional interstitial facet—Altruism—for a total of 25 facets with four items each [18]. The measure also contains all 60 items of the shorter form of the inventory, the HEXACO-60 [17], which contains 10 items for each domain. The HEXACO-60 does not include items for the Altruism facet, and is generally not intended to assess HEXACO facets. Full scales and scoring keys are freely available at https://hexaco.org/hexaco-inventory.

Table 1. Empirical properties of HEXACO-100 domains and facets.

Scale Our study Lee & Ashton (2018)
α r TT Δ r CA αStudent αOnline
H: Honesty-Humility .88 .89 .01 .46 .82 .89
E: Emotionality .84 .88 .04 .61 .84 .84
X: eXtraversion .89 .92 .03 .56 .85 .86
A: Agreeableness .87 .86 (.01) .47 .84 .86
C: Conscientiousness .87 .88 .01 .52 .84 .82
O: Openness to Experience .83 .88 .05 .56 .81 .82
H1: Sincerity .83 .75 (.08) .20 .66 .78
H2: Fairness .85 .86 .01 .45 .76 .83
H3: Greed Avoidance .83 .84 .01 .47 .81 .83
H4: Modesty .79 .80 .01 .30 .68 .79
E1: Fearfulness .72 .81 .09 .51 .70 .70
E2: Anxiety .75 .81 .06 .40 .64 .73
E3: Dependence .78 .80 .02 .44 .80 .76
E4: Sentimentality .78 .84 .06 .47 .70 .73
X1: Social Self-Esteem .74 .85 .11 .38 .67 .70
X2: Social Boldness .74 .83 .09 .53 .76 .72
X3: Sociability .80 .85 .05 .45 .71 .77
X4: Liveliness .81 .83 .02 .45 .76 .78
A1: Forgivingness .82 .78 (.04) .35 .74 .78
A2: Gentleness .73 .75 .02 .35 .66 .72
A3: Flexibility .70 .76 .06 .35 .61 .64
A4: Patience .82 .82 0 .43 .79 .80
C1: Organization .74 .81 .07 .52 .74 .73
C2: Diligence .80 .83 .03 .37 .70 .71
C3: Perfectionism .76 .79 .03 .42 .69 .69
C4: Prudence .77 .74 (.03) .33 .69 .70
O1: Aesthetic Appreciation .69 .83 .14 .49 .66 .65
O2: Inquisitiveness .66 .80 .14 .45 .66 .70
O3: Creativity .78 .87 .09 .50 .75 .73
O4: Unconventionality .54 .78 .24 .36 .52 .59
Interstitial: Altruism .66 .75 .09 .36 .59 .66
Domain Median .87 .88 .01 .54 .84 .85
Facet Median .77 .81 .04 .43 .70 .73

α = alpha internal consistency. rTT = 13-day test-retest reliability. rCA = cross-rater agreement. Δ = rTT—α in the present sample, with instances of α > rTT in parentheses.

Analyses

After recoding reverse-keyed items, we calculated domain and facet scale scores for both HEXACO-100 and HEXACO-60 using mean scores of their associated items. Though facet scores are not typically calculated for the latter, we deemed it worthwhile to do so for the purpose of comparing αs and rTTs. Using the psych() package in R Version 4.1.1 [28, 29], we calculated internal consistencies (Cronbach’s α) for facets and domains (using item scores to estimate domains’ αs). Because some researchers [12] recommend omega (ω) as a more appropriate measure for internal reliability than α, we calculated domain and facet ωs for comparison with αs. They correlated .98 and .99, respectively; ω estimates are available in the Online Supplement (https://osf.io/wz3du/). Test-retest reliabilities for domains, facets, and items were all estimated as the correlation between their scores at T1 and T2.

We report α and rTT estimates from our sample alongside the α and rCA estimates from a previous meta-analysis [18]. They reported the estimates from a large sample comprised of self- and informant reports from university students (N = 2,863 pairs), and a very large sample of participants providing only self-reports via a survey link on the HEXACO website (N = 100,318). Full details on the samples, data collection, and results may be found in the original publication [18].

Finally, we report Spearman’s correlations between αs, rTTs, and rCAs of HEXACO-100 facets as well as the correlation between items’ rTTs and rCAs. Items’ rCAs were not published in the original paper, but the authors kindly made their data available to us, so we calculated these. We opted not to conduct these analyses for domains because we deemed that correlations of vectors of six values would not provide additional informative value. We were particularly interested in the extent to which facets’ rTT and α correlated with rCA: in other words, how strongly reliability criteria correlate with a common validity criterion.

Code and data that may be used to reproduce all analyses, as well as a data file containing final outputs, can be found in the Online Supplement at https://osf.io/wz3du/.

Results

Completion time, interval length, and language

Median time to complete the survey at T1 was 11’ 33”, (SD = 8’22”, IQR = 8’55” to 16’3”). For T2, we removed times for two participants with extreme values (approximately 25 and 72 hours). The resulting completion times were slightly shorter than T1 (median = 10’44”, SD = 9’20”, IQR = 8’13” to 13’ 20”). On average, participants provided complete data in about 13 days (median T1-T2 interval = 12 days, 23 hours, 37 minutes; SD = 1 day, 6 hours, 41 minutes), with a vast majority completing the survey by the end of two weeks; only n = 56 completed the survey after 14 days and no participants took more than 19 days to complete it.

To evaluate whether interval length moderated rTT estimates, we calculated mean T1-T2 profile correlations for each participant, then correlated these with the length of time in seconds between T1 and T2 start times. The resulting correlation was r = .07 (p = .14), suggesting no relationship between the number of days between administrations and overall reliability. We also examined correlations between T1-T2 profile correlations and time taken to complete the survey at T1, T2, and the average of these, but found no evidence that delay between measurements affected participants’ overall consistency over time either: rs = -.08 (p = .12), .01 (p = .84), and -.04 (p = .42), respectively.

Finally, given that most participants were non-native English speakers, we considered it important to test for potential differences in profile correlations between T1 and T2 assessments. Indeed, natives showed slightly higher average T1-T2 profile correlations (q = .69) than non-natives (q = .62). By implication, the rTT estimates provided here may thus underestimate reliability, providing a particularly conservative test. These estimates are slightly lower than the minimum recommendation for profile correlations of .70 based on a same-day retest interval [27], which is expected given our longer retest interval.

HEXACO-100 domain and facet scales

As summarized in Table 1, domains had a median 13-day rTT of .88 (SD = .02, range = .86 to .92) versus a median α of .87 (SD = .02, range = .83 to .89). These αs, slightly lower than rTTs for all but one domain (i.e., Agreeableness), were comparable to those from the comparison study by Lee and Ashton ([18]; Mdn = .88, SD = .03, range = .82 to .89) that we report in Table 1, as well as a more recent meta-analysis ([2]; Mdn = .84, SD = .02, range = .81 to .86). For facets, median rTT and α were .81 (SD = .04, range = .74 to .87) and .77 (SD = .07, range = .54 to .85), respectively. Our αs tended to be higher than those observed by Lee and Ashton ([18]; Mdns = .70 and .73), although our rankings correlated highly with their student and online samples (ρs = .67 and .88; Table 2). In turn, facet rTTs were higher than αs for all but three facets (mean difference rTT−α: Δ = .04, median = .03, range = -.08 to .24). Though typically small, the disparities were more notable for facets with lower αs, as evidenced by a correlation of ρ = -.72 between α and Δ. For example, in the lowest quintile of αs (range = .54 to .70), the median difference between α and rTT was .14 (range = .06 to .24), whereas disparities were negligible in the highest quintile (α range = .82 to .85; Δ median = .0, range = -.08 to .01).

Table 2. Spearman correlations of rTT, α, and rCA of HEXACO facets.

Estimate α r TT r CA LA αStudent
r TT .34
r CA ‒.07 .69
LA αStudent .67 .55 .52
LA αOnline .88 .36 .07 .69

α = alpha internal consistency estimates from our sample, averaged across both testing occasions. rTT = 13-day retest reliability. rCA = cross-rater agreement assessed by Lee & Ashton (2018) [18]. LA refers to the two samples in which Lee & Ashton (2018) calculated α, with appropriate subscripts.

Details on inter-item correlations for domains, facets, and all items can be found in the Online Supplement.

Relationships between rTT, α, and the validity criterion rCA for HEXACO-100 facets

Table 2 reports Spearman correlations between rTTs, αs, and rCAs for the HEXACO-100 facets. While rCAs correlated strongly (ρ = .52) with only one of the three α values–αStudent, from the same sample–they demonstrated a stronger association with the rTT values from the present sample (ρ = .69). Because rTT was also correlated with αStudent (ρ = .55), we conducted a post-hoc analysis to determine the partial associations among these three properties. When controlling for rTT, the relationship between αStudent and rCA was substantially attenuated (ρ = .24). However, correcting the correlation between rTT and rCA for αStudent only reduced the association to ρ = .56.

HEXACO-100 Items

Median 13-day rTT of the HEXACO-100 items was .65 (M = .65, SD = .08, range = .39 to .84). The average standard error of the correlations was .037 (range = .027 to .045). Estimates of rTTs, standard deviations, and standard errors for all items of the HEXACO-100 can be found in S1 Table, with the five highest and lowest available in Table 3. Similar to the relationship found at the facet level, single-item rTT estimates correlated ρ = .62 with their rCAs. Interestingly, a post-hoc analysis found that item rTT and rCA were also highly correlated with the items’ standard deviation: ρs = .74 and .58, respectively.

Table 3. HEXACO-100 items with highest and lowest test-retest reliabilities.

Item Code r TT
Low
I wouldn’t pretend to like someone just to get that person to do favors for me H1 0.39
I wouldn’t want people to treat me as though I were superior to them * H4 0.46
I don’t allow my impulses to govern my behavior * C4 0.47
I generally accept people’s faults without complaining about them * A2 0.48
I wouldn’t use flattery to get a raise or promotion at work, even if I thought it would succeed H1 0.48
High
I find it boring to discuss philosophy O4 0.84
If I had the opportunity, I would like to attend a classical music concert O1 0.83
I would be quite bored by a visit to an art gallery O1 0.80
If I knew that I could never get caught, I would be willing to steal a million dollars (R) H2 0.79
I sometimes feel that I am a worthless person (R) X1 0.77

Codes correspond to facet labels given in Table 1. rTT = 13-day test-retest reliability.

* indicates items not included in the HEXACO-60. (R) = reverse-keyed.

HEXACO-60

Properties of the HEXACO-60 demonstrated similar patterns as the HEXACO-100, with rTT being slightly higher than α on average. Domains had a median rTT = .86 (SD = .02, range = .82 to .89) and median α = .82 (SD = .03, range = .80 to .87), with all αs ≤ rTT. Median facet rTT was .76 (SD = .05, range = .68 to .87) and median α was .72 (SD = .08, range = .52 to .84). We again found the association between rCA and rTT for facets, ρ = .57. This held for single items as well (ρ = .67), for which the median 13-day rTT was .65 (SD = .09, range = .39 to .84). As noted in Table 3, while three of the five lowest rTT items from the HEXACO-100 do not feature in the HEXACO-60, all five of the items with highest rTT are included in the shorter scale. A full breakdown of properties for HEXACO-60 domains, facets, and items can be found in the Online Supplement.

Discussion

The HEXACO-PI-R, and particularly its two shorter versions, the HEXACO-100 and HEXACO-60, are currently among the most widely-used personality questionnaires. Surprisingly, however, the test-retest reliability (rTT)–a key psychometric property of any psychological test–of their scales and items has not yet been systematically studied. Although several studies have reported HEXACO-PI-R stability estimates over intervals from a few months up to two years [79], we herein provided the first examination of short-term rTT for the English HEXACO-100 and HEXACO-60 to our knowledge. We found that rTTs were slightly higher on average on both the domain and facet levels than internal consistencies (α), suggesting that the latter can underestimate scales’ reliabilities, particularly at the lower end. rTTs remained high even for those facets with low αs, which indicates that their items are likely reliable, just less intercorrelated than those of scales with a higher α. We also found evidence to suggest that α–but not rTT−confounds reliability with non-random error sources of variance, indicated by strong associations between rTT and cross-rater agreement (rCA), a common validity criterion, even when controlling for α.

The present study indicates that HEXACO-PI-R scales have comparable reliability as other established personality measures (see below). This is good news for researchers and practitioners, as the HEXACO-PI-R scales are freely available, meaning these measures can be applied without any costs that are associated with other proprietary personality measures. Moreover, our finding that the HEXACO-60 assesses HEXACO-PI-R domains, facets, and items with similar degrees of reliability to its longer 100-item variant supports its use in settings where a shorter version may be favored, such as clinical use or inclusion in a larger research project assessing many other variables. The finding that facet properties in the HEXACO-60 were consistent with the longer HEXACO-100, despite its being written only to measure domains, may suggest that researchers could consider interpreting facets when measuring the HEXACO domains with the shorter version, although this would need to be confirmed by predictive validity studies. In sum, our findings contribute further empirical backing for the use of HEXACO scales in research and practice.

Striking similarities to other measures

The HEXACO-PI-R rTTs observed here were remarkably similar to those reported for other popular personality scales, such as the Big Five, at all levels of the trait hierarchy. For example, McCrae and colleagues [13] found that median NEO-Personality Inventory-Revised (NEO-PI-R [30]) αs ranged from .71 to .77 in three samples, with an overall range of .50 to .87—compared to .54 to .85 in the present study. NEO-PI-R facet rTTs also showed highly similar patterns to the present study, with median rTT = .82 and a slightly wider range of rTTs = .72 to .89 (where median facet rTT of the HEXACO-100 in our study was = .81, range = .74 to .87). McCrae and colleagues found consistent associations between rCA and rTT, but none with αs after controlling for third variables, just as we observed in the present study.

Single items assessing the Big Five domains and facets also demonstrated similar levels of rTT to the present study. A recent study [16] reported median rTT of .64 (M = .64, SD = .09, range = .36 to .87) for the NEO-PI-R, whereas other research found median rTT of .66 (data for the SD and range were not available) for a 100-item IPIP scale [21, 25, 31]. The HEXACO-PI-R showed a similar rTT range as the NEO-PI-R, although the former was slightly narrower (i.e., r = .39 to .84 vs .36 to .87, respectively). The association between rCA and rTT has also been found for NEO-PI-R items (ρ = .57, p < .001 [16]). These results suggest three things: (1) ever-accumulating evidence indicates that the “average” personality item is reliable at about rTT = .65; (2) items within contemporary scales vary substantially in quality, as indicated by several especially unreliable items; and (3) rTT is a good predictor of items’ validity, making it useful quality criterion in scale development.

Variation in item properties: Interpretations and implications

An inventory’s reliability is a fundamentally desirable psychometric property, and a guiding theoretical claim of this manuscript has been that retest reliability in particular is a better index of reliability than other indices, particularly the more commonly-used Cronbach’s α. We have also shown this empirically to be the case, with rTTs > αs on average, and rTT consistently linked with validity criteria whereas α is not. But why should this be the case?

McCrae [14] offers one explanation, describing how to use the rTT, α, and rCA of a given trait to parse its variance into common trait, method, and (items’) specific variance components. In essence, the model postulates: Most items have unique valid variance [21, 32], and this unique variance is by definition not captured by α but is assessed by rTT (because α removes anything not common to all items); therefore, a trait scale that aggregates multiple items should have rTT > α. Our results support this model, with only three facet αs lower than rTTs. In other words, most facet measures contain both information that is common to items written to measure the trait (e.g., Sincerity) and unique valid content specific to each item, ostensibly indexing a further personality nuance [14]. However, the ways that items vary in their individual rTTs and contributions to higher-order trait αs is as of yet a relatively unexplored question.

The role of item content

As one starting point, we can compare how domains and facets in the HEXACO-100 differ in their rTTs and αs. Of the top 10 most reliable items, Openness to Experience and Extraversion featured with three and four items, respectively. Conversely, Honesty-Humility and Agreeableness demonstrated the opposite pattern, containing three and four of the least reliable items. An examination of the least reliable items suggests that they may be especially dependent on the contextuality of the item, particularly when it describes a hypothetical “other.” For example, responses to the least reliable item “I wouldn’t pretend to like someone just to get that person to do favors for me” (emphasis added) may depend on a variety of circumstantial details the respondent imagines when responding: who is the “someone” and what relation do they have to me? Are they a friend, colleague, boss, or stranger? How important is the favor? The same participant could envision different situations with different individuals, stakes, and emotional investment when responding to the item on different testing occasions, ultimately leading to a low rTT on average. At face value, other low-rTT items seem to demonstrate similar levels of ambiguity (e.g., “I generally accept people’s faults without complaining about them,” “I wouldn’t want people to treat me as though I were superior to them”; emphasis added) that leave the context unclear.

In contrast, the most reliable items are generally less vague in their contextual referents. “I find it boring to discuss philosophy,” “If I had the opportunity, I would like to attend a classical music concert,” and “I would be quite bored by a visit to an art gallery” are the three items with highest rTT. While all three items assess Openness to Experience, more notable is that the content references specific situations that do not evoke the presence of a particular other individual. These Openness to Experience items leave little to the respondent’s imagination.

Conversely, facets of Openness to Experience had the lowest α on average, with α for Unconventionality, Inquisitiveness, and Aesthetic Appreciation all falling below .70. They also had the greatest disparities between α and rTT, with the latter at or near the median, and respective differences of .24, .14, and .14. This demonstrates two things. More generally, it serves as a reminder that a scale’s actual reliability (as assessed by rTT) is not well approximated by α. With regard to the Openness to Experience facets in particular, it may well be that the facet-level traits assessed are more abstract and therefore broader in content than facets of other domains such as, for example, Greed Avoidance from the Honesty-Humility domain–which has, in turn, one of the highest α values.

These speculations can and should be tested empirically. One way to explore questions of item content more generally is by recruiting a small pool of lay or expert raters (say, N = 20–30 [33]) to assess the degree to which items differentially feature properties (including contextuality) that may be relevant for a variety of empirical criteria, such as rTT or rCA. This would shed light on how the way items are written or the trait they are assessing is related to their empirical properties. Item ratings could also be averaged to generate scores for the broader facets they measure, which could then be compared to a rating of the same criterion but for the facet (or domain) alone. For example, how would a breadth rating for “Unconventionality” as a facet compare to the average of breadth ratings across its constituent items? Investigation of the nuanced ways in which measures of personality vary in content and at different levels of specificity could help generate new questions and hypotheses in future scale development.

Some work on these types of questions has already begun. Previous research [9] found that item variance strongly predicted items’ rCA for both the NEO-PI-3 [34] and the full HEXACO-PI-R, as well as rTT for the latter, but not single-item internal reliability. rTT and rCA were also associated with item evaluativeness, observability, position, length, inclusion of a negation, and broad content domain (assigning Big Five and HEXACO domains to either “Engagement” or “Altruism”), although these factor loadings were more modest [9].

These are useful preliminary findings, but more properties still may be investigated to explore as of yet unexplained mechanisms driving items’ validity. For example, the item properties studied by De Vries and colleagues [9] only achieved R2 values of .06 and .17 for item variance, the strongest predictor of rTT and rCA. Perhaps studying some of the aforementioned properties such as breadth, contextuality, and abstractness may clarify this issue; others have suggested additional candidates such as item ambiguity, complexity, and (social) importance [33]. Varying domains of content, such as the affective, behavioral, cognitive, and motivational components of items have also been suggested as possible candidates and thus preliminarily studied [35].

We would thus call for an integration of and extension to these varied lines of research. For example, while De Vries and colleagues [9] assessed rTT using data more appropriate for long-term stability estimates, follow-up work using the present findings could refine the relationships between rTT and item properties. Likewise, researchers could attempt to incorporate analyses of item content into studies that apply the variance decomposition techniques suggested by McCrae [14]. By drawing on the insights and resources from past work and across tests, we can iteratively make progress toward a comprehensive understanding of how items behave and why.

Implications for scale development

Taken together, the findings in the present study and subsequent examination of item properties may also offer personality researchers the potential to revise existing scales in an informed way, with the aim of keeping only those items that best measure the target traits. This, in turn, could result in survey administrations that capture more and better information with the same or even a smaller number of items, helping researchers to maximize the investment of their resources while also having more confidence in observed relationships with outcome variables of interest.

How exactly to assess the “goodness” of an item remains an open question, as evidenced by the previous section, but we would recommend that researchers begin by prioritizing items with higher rTTs, standard deviations, and rCAs, three highly intercorrelated empirical properties [9, 21, the present study]. We caution, though, that comparisons of these properties should only be used when choosing between items for a single trait, as it is conceivable that some traits may have reliability “ceilings” such that they are more difficult to assess than others–both for an individual about themselves or for an informant who knows them well–but this does not necessarily make them less of a trait. Further, the interpretation of these criteria is not always clear-cut; indeed, at least two criteria are logically intertwined: specifically, the stability of one’s self-view (rTT) is probably a necessary condition for an informant to agree with them (rCA). Given the complexities of interpreting even apparently straightforward quality indices of items, let alone the complex interplay of factors at both the level of the written item and regarding the nature of the underlying trait itself, we are currently hesitant to offer concrete advice for survey generation. Instead, we encourage researchers to pursue questions–including those we pose above–that begin to paint a clearer picture of how items, traits, and their properties interact.

Limitations and generalizability

A limitation of the present study is that we have only estimated rTT for the HEXACO-100, meaning half of the items that assess the HEXACO domains (as per the full HEXACO-PI-R) remain untested. Furthermore, both this study and much of the research that we have cited on rTT reports on relatively small samples. Measures for what arguably are the two most popular models of personality (HEXACO and Big Five) have rTT estimates based on samples N < 500, which rather pale in comparison to cross-sectional samples of (hundreds of) thousands of self- and informant-reports. This likewise limits the ability to generalize our findings, particularly given 1) the disproportionate number of non-native English speakers, and their lower overall consistency in responses compared to natives, which may have downwardly-biased our reliability estimates; and 2) that the sample was recruited using a paid online service, perhaps leading to a selection bias. To the former point, this may actually serve as encouraging, suggesting that the rTTs reported here are actually lower limits for HEXACO-PI-R domain, facet, and item rTT. To the latter, the αs observed in the present study are consistent with the substantially larger samples in recent studies and meta-analyses [2, 18], suggesting that our data were not irregular.

Finally, though we chose an approximately two-week retest interval primarily for empirical consistency with previous work, we found little robust theoretical rationale in the personality literature as to why to prefer two weeks to, say, one, three, four, six, or eight–aside from vague assumptions about true trait change and memory effects. As study of single-item properties in particular advances, researchers should investigate more thoroughly the differences in reliability across different retest intervals, while also integrating findings from the memory literature, to inform our understanding of appropriate retest intervals.

Conclusions

Given (1) facet α estimates being generally lower than rTT, (2) the unique, robust association between rTTs and rCAs, and (3) the remarkable replicability of these findings across multiple samples, questionnaires, and even models of personality, we reiterate the growing sentiment that rTT is a superior estimate of scale reliability to α (or its cousin ω). This is not a new position [11, 36], but it is one that has been overlooked given the convenience of estimating internal consistency. However, as participants become more readily available via online recruitment platforms, we see little reason to avoid–and much to be gained by–collecting rTT data as a standard procedure in scale development. We thus recommend calculation of rTT as routine practice in psychological research and argue that internal consistency should only be used to screen for data quality [13].

We conclude by advising researchers to pay special attention to item content when designing scales. The average personality item appears to be reliable at approximately rTT = .65. However, as shown in various samples, many items are still quite unreliable, ranging as low as the .30s. We suggest that the wide variability in rTT across personality scales is, in part, due to the inclusion of “poor” items, rather than solely indexing true variation in reliability across traits. We argue that rTT provides one straightforward means of identifying lower-quality items and replacing them with higher-quality ones [33, 37]. We also call for further investigation into the properties that are predictive of rTT and related validity criteria (e.g., rCA, heritability, long-term stability). Ultimately, we encourage researchers to deliberately explore how components of item content relate to item quality, while at the same time considering the traits they intend to measure. We believe that devoting time and resources to these sorts of questions will move personality measurement–and thus our entire field–forward.

Supporting information

S1 Table. HEXACO-100 items and their descriptive statistics.

rTT = 13-day test-retest reliability. SE = standard error for the rTT estimate. SD = Standard deviation of the item. * indicates items not included in the HEXACO-60.

(DOCX)

S1 File. Raw data for HEXACO-100 items, facets, and domains.

(XLSX)

S2 File. Facet and factor omegas for HEXACO-100 and HEXACO-60.

(XLSX)

Acknowledgments

We are very grateful to Kiboem Lee, Ph.D. and Michael C. Ashton, Ph.D. for providing us with item-level self- and informant-reports for the HEXACO-100.

Data Availability

All test-retest data files will be available from the project page on the Open Science Foundation database (https://osf.io/wz3du/) and are also included in this submission. However, the cross-rater agreement data included in the additional analyses was not collected by the authors of this manuscript. The analyses of facet alphas and cross-rater agreement estimates were simply copied from Table 4 in Lee & Ashton (2018) and can thus be accessed in the published article itself. We can confirm that we did not have any special access privileges to the single-item data. Though the authors of Lee & Ashton (2018) do not specifically indicate in their article how to access these data, we simply sent them an email requesting the raw self-/observer data and explaining how we intended to use them. We would thus confirm that, to our knowledge, others can access these datasets and would be able to access these data in the same manner as the authors of the present manuscript by contacting the authors Lee & Ashton (2018) in a similar manner.

Funding Statement

The corresponding author (SH) was supported by two Research Support Grants (grant numbers unspecified as the funds come from a common pool used by research students) from the University of Edinburgh School of Philosophy, Psychology, and Language Sciences. Information on these grants may be found at https://www.ed.ac.uk/ppls/linguistics-and-english-language/current/postgraduate/fees-and-funding/funding-phd-research. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1.Lee K, Ashton MC. Psychometric Properties of the HEXACO Personality Inventory. Multivariate Behav Res. 2004;329:329–58. doi: 10.1207/s15327906mbr3902_8 [DOI] [PubMed] [Google Scholar]
  • 2.Moshagen M, Thielmann I, Hilbig BE, Zettler I. Meta-Analytic Investigations of the HEXACO Personality Inventory(-Revised). Z Psychol. 2019;227:186–94. [Google Scholar]
  • 3.Thielmann I, Akrami N, Babarović T, Belloch A, Bergh R, Chirumbolo A, et al. The HEXACO–100 Across 16 Languages: A Large-Scale Test of Measurement Invariance. J Pers Assess [Internet]. Routledge; 2020;102:714–26. Available from: doi: 10.1080/00223891.2019.1614011 [DOI] [PubMed] [Google Scholar]
  • 4.Anglim J, Horwood S, Smillie LD, Marrero RJ, Wood JK. Predicting Psychological and Subjective Well-Being From Personality: A Meta-Analysis. Psychol Bull. 2020;146:279–323. doi: 10.1037/bul0000226 [DOI] [PubMed] [Google Scholar]
  • 5.Lee K, Ashton MC, Griep Y, Edmonds M. Personality, Religion, and Politics: An Investigation in 33 Countries. Eur J Pers. 2018;32:100–15. [Google Scholar]
  • 6.Zettler I, Thielmann I, Hilbig BE, Moshagen M. The Nomological Net of the HEXACO Model of Personality: A Large-Scale Meta-Analytic Investigation. Perspect Psychol Sci. 2020;15:723–60. doi: 10.1177/1745691619895036 [DOI] [PubMed] [Google Scholar]
  • 7.Moshagen M, Hilbig BE, Zettler I. Faktorenstruktur, psychometrische Eigenschaften und Messinvarianz der deutschsprachigen Version des 60-Item HEXACO Persönlichkeitsinventars. Diagnostica. 2014;60:86–97. [Google Scholar]
  • 8.Dunlop PD, Bharadwaj AA, Parker SK. Two-year stability and change among the honesty-humility, agreeableness, and conscientiousness scales of the HEXACO100 in an Australian cohort, aged 24–29 years. Pers Individ Dif [Internet]. Elsevier Ltd; 2021;172:110601. Available from: doi: 10.1016/j.paid.2020.110601 [DOI] [Google Scholar]
  • 9.de Vries RE, Realo A, Allik J. Using Personality Item Characteristics to Predict Single-Item Internal Reliability, Retest Reliability, and Self–Other Agreement. Eur J Pers. 2016;30:618–36. [Google Scholar]
  • 10.John OP, Soto CJ. The importance of being valid: Reliability and the process of construct validation. Handb Res methods Personal Psychol Robins R W, Fraley R C, Krueger R F, Robins R W, Fraley R C, Krueger R F. New York: Guilford.; 2007. p. 461–94. [Google Scholar]
  • 11.Guttman L. A basis for analyzing test-retest reliability. 1945;255–82. [DOI] [PubMed] [Google Scholar]
  • 12.Revelle W, Condon DM. Reliability from α to ω: A tutorial. Psychol Assess. 2018;31:1395–411. [DOI] [PubMed] [Google Scholar]
  • 13.McCrae RR, Kurtz JE, Yamagata S, Terracciano A. Internal consistency, retest reliability, and their implications for personality scale validity. Personal Soc Psychol Rev. 2011;15:28–50. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.McCrae RR. A More Nuanced View of Reliability: Specficity in the Trait Hierarchy. Personal Soc Psychol Rev. 2015;19:97–112. [DOI] [PubMed] [Google Scholar]
  • 15.McCrae RR, Mõttus R. What Personality Scales Measure: A New Psychometrics and Its Implications for Theory and Assessment. Curr Dir Psychol Sci. SAGE Publications Inc.; 2019. [Google Scholar]
  • 16.Henry S, Mõttus R. Traits and Adaptations: A Theoretical Examination and New Empirical Evidence. Eur J Pers. 2020;34:265–84. [Google Scholar]
  • 17.Ashton MC, Lee K. The HEXACO-60: A short measure of the major dimensions of personality. J Pers Assess. 2009;91:340–5. doi: 10.1080/00223890902935878 [DOI] [PubMed] [Google Scholar]
  • 18.Lee K, Ashton MC. Psychometric Properties of the HEXACO-100. Assessment. 2018;25:543–56. doi: 10.1177/1073191116659134 [DOI] [PubMed] [Google Scholar]
  • 19.Kandler C, Penner A, Richter J, Zapko-Willmes A. The Study of Personality Architecture and Dynamics (SPeADy): A Longitudinal and Extended Twin Family Study. Twin Res Hum Genet. 2019;22:548–53. doi: 10.1017/thg.2019.62 [DOI] [PubMed] [Google Scholar]
  • 20.Chmielewski M, Watson D. What Is Being Assessed and Why It Matters: The Impact of Transient Error on Trait Research. J Pers Soc Psychol. 2009;97:186–202. doi: 10.1037/a0015618 [DOI] [PubMed] [Google Scholar]
  • 21.Mõttus R, Sinick J, Terracciano A, Hrebícková M, Kandler C, Ando J, et al. Personality Characteristics Below Facets: A Replication and Meta-Analysis of Cross-Rater Agreement, Rank-Order Stability, Heritability, and Utility of Personality Nuances. J Pers Soc Psychol. 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Lowman GH, Wood D, Armstrong BF, Harms PD, Watson D. Estimating the reliability of emotion measures over very short intervals: The utility of within-session retest correlations. Emotion. 2018;18:896–901. doi: 10.1037/emo0000370 [DOI] [PubMed] [Google Scholar]
  • 23.Cattell RB, Eber HW, Tatsuoka MM. The psychometric properties of tests: Consistency, validity, and efficiency. Handb. Sixt. Personal. Factor Quest. Champaign, IL: Institute for Personality and Ability Testing; 1970. [Google Scholar]
  • 24.Heise DR. Separating Reliability and Stability in Test-Retest Correlation. Am Sociol Rev [Internet]. 1969;34:93–101. Available from: https://www.jstor.org/stable/2092790 [Google Scholar]
  • 25.Kosinski M, Matz SC, Gosling SD, Popov V, Stillwell D. Facebook as a research tool for the social sciences: Opportunities, challenges, ethical considerations, and practical guidelines. Am Psychol. 2015;70:543–56. doi: 10.1037/a0039210 [DOI] [PubMed] [Google Scholar]
  • 26.Watson D. Stability versus change, dependability versus error: Issues in the assessment of personality over time. J Res Pers. 2004;38:319–50. [Google Scholar]
  • 27.Wood D, Harms PD, Lowman GH, DeSimone JA. Response Speed and Response Consistency as Mutually Validating Indicators of Data Quality in Online Samples. Soc Psychol Personal Sci. 2017;8:454–64. [Google Scholar]
  • 28.R Core Team. R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing; 2021. [Google Scholar]
  • 29.Revelle W. psych: Procedures for Personality and Psychological Research [Internet]. Northwestern University, Evanston, Illinois, USA; 2021. Available from: https://cran.r-project.org/package=psych [Google Scholar]
  • 30.Costa PT, McCrae RR. Revised NEO Personality Inventory (NEO PI-R) and NEO Five-Factor Inventory (NEO-FFI) professional manual. Resources PA, editor. Odessa, Florida; 1992.
  • 31.Goldberg LR, Johnson JA, Eber HW, Hogan R, Ashton MC, Cloninger CR, et al. The international personality item pool and the future of public-domain personality measures. J Res Pers. 2006;40:84–96. [Google Scholar]
  • 32.Mõttus R, Kandler C, Bleidorn W, Riemann R, McCrae RR. Personality Traits Below Facets: The Consensual Validity, Longitudinal Stability, Heritability, and Utility of Personality Nuances. J Pers Soc Psychol. American Psychological Association (APA); 2017;112:474–90. [DOI] [PubMed] [Google Scholar]
  • 33.Condon DM, Wood D, Mõttus R, Booth T, Costantini G, Greiff S, et al. Bottom Up Construction of a Personality Taxonomy. 2020. [Google Scholar]
  • 34.McCrae RR, Costa PT, Martin TA. The NEO–PI– 3: A More Readable Revised NEO Personality Inventory. J Pers Assess. 2005;84:261–70. doi: 10.1207/s15327752jpa8403_05 [DOI] [PubMed] [Google Scholar]
  • 35.Wilt J, Revelle W. Affect, Behaviour, Cognition and Desire in the Big Five: An Analysis of Item Content and Structure. Eur J Pers. John Wiley and Sons Ltd; 2015;29:478–97. doi: 10.1002/per.2002 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Cronbach LJ. Coefficient alpha and the internal structure of tests. Psychometrika. 1951;16:297–334. [Google Scholar]
  • 37.Mõttus R, Wood D, Condon DM, Back M, Baumert A, Costantini G, et al. Descriptive, predictive and explanatory personality research: Different goals, different approaches, but a shared need to move beyond the Big Few traits. Eur J Pers [Internet]. 2020;1–31. Available from: https://osf.io/fn5pw [Google Scholar]

Decision Letter 0

Frantisek Sudzina

1 Nov 2021

PONE-D-21-30871Test-retest reliability of the HEXACO-100PLOS ONE

Dear Dr. Henry,

Thank you for submitting your manuscript to PLOS ONE. We invite you to look at the reviewer's suggestions and think whether they could be used to improve the article. Please submit your revised manuscript by Dec 16 2021 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Frantisek Sudzina

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. In your Data Availability statement, you have not specified where the minimal data set underlying the results described in your manuscript can be found. PLOS defines a study's minimal data set as the underlying data used to reach the conclusions drawn in the manuscript and any additional data required to replicate the reported study findings in their entirety. All PLOS journals require that the minimal data set be made fully available. For more information about our data policy, please see http://journals.plos.org/plosone/s/data-availability.

Upon re-submitting your revised manuscript, please upload your study’s minimal underlying data set as either Supporting Information files or to a stable, public repository and include the relevant URLs, DOIs, or accession numbers within your revised cover letter. For a list of acceptable repositories, please see http://journals.plos.org/plosone/s/data-availability#loc-recommended-repositories. Any potentially identifying patient information must be fully anonymized.

Important: If there are ethical or legal restrictions to sharing your data publicly, please explain these restrictions in detail. Please see our guidelines for more information on what we consider unacceptable restrictions to publicly sharing data: http://journals.plos.org/plosone/s/data-availability#loc-unacceptable-data-access-restrictions. Note that it is not acceptable for the authors to be the sole named individuals responsible for ensuring data access.

We will update your Data Availability statement to reflect the information you provide in your cover letter.

3. We note that you have stated that you will provide repository information for your data at acceptance. Should your manuscript be accepted for publication, we will hold it until you provide the relevant accession numbers or DOIs necessary to access your data. If you wish to make changes to your Data Availability statement, please describe these changes in your cover letter and we will update your Data Availability statement to reflect the information you provide.

4. Please include your full ethics statement in the ‘Methods’ section of your manuscript file. In your statement, please include the full name of the IRB or ethics committee who approved or waived your study, as well as whether or not you obtained informed written or verbal consent. If consent was waived for your study, please include this information in your statement as well.

5. Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: Thank you for the opportunity to review this paper. I see it as being both interesting and very important to researchers using the HEXACO 60 and 100 personality inventories. I also really enjoyed the analysis of and discourse about the item-level data. I believe that this paper is likely to become the ‘go-to’ paper for people wishing to cite evidence of the reliability of these measures. Overall, I have only two relatively minor suggestions/thoughts.

1. If the authors have more information they can share about the sample (e.g., country of origin, education levels), I strongly encourage them to add this information to the Participants sections, if there is space. My thinking here is that these details may be important for future researchers who, for example, conduct a similar study but receive different results and wish to understand why.

2. After reading the third paragraph on page 13, which pondered the effects of items’ contextualisation levels on reliability, I wondered whether contextualisation levels would be predictably positively associated with rTT but negatively associated with α. For example, whether a person likes poetry _today_ is probably very strongly associated with whether they like it tomorrow, next week, next year, and so on (high rTT). But it’s not hard to imagine there would exist plenty of people who love poetry but are indifferent to, say, classical music or ancient ruins, thus the contextualised items contribute negatively to alpha. I concur with the authors’ speculation that more generic/less contextualised items (e.g., a hypothetical item, “I like artistic things”) may undermine rTT, for all the reasons the authors mentioned (e.g., what artistic things are they thinking about in that moment? Have they enjoyed/not enjoyed a recent artistic experience?). And I could see how such an item would positively influence alpha, as the generic item would represent, to some extent, any of the specific/contextualised items in the same scale. Anyway these are just thoughts; I do _not_ insist the authors should to include them in their revision.

Reviewer #2: Review PONE-D-21-30871: “Test-retest reliability of the HEXACO-100”

Many thanks for the opportunity to review this manuscript, which provides - once again - evidence that alpha reliability is a less optimal parameter than test-retest reliability. Based on my reading of the manuscript, I have a few suggestions and comments:

1. One of the open questions for me is what the most optimal time period is to establish test-retest reliability. The authors chose 12 days (please provide mean and SD of number of days or even hours between the two ratings; and please check whether the individual number of days has an effect on r(tt)!), but I’m not sure whether this is the optimal time period and what is actually the most optimal time period for personality questionnaires. That is, in the introduction and the discussion, I would like the authors to explain a bit more, based maybe on memory research (which, of course, also shows large individual differences) and based on the traitedness of a construct and the possible time frame for changes to occur, what kind of time frame would be most optimal to establish r(tt).

2. With respect to the above time frame, the findings can also be used to comment on McCrae’s (2015) approach to distinguish trait, method, specific, and error variance components. McCrae notes that specific variance is obtained by subtracting alpha from r(tt), but in most cases this would yield a negative specific variance in the current study. As McCrae notes: “By definition, [...] specific variance in an item is not shared by other items in the scale, so it detracts from alpha. However, in retest designs, the same items, with the same specific variance, are readministered, and they may elicit the same response. Item-specific variance could thus account for the fact that retest reliability is greater than alpha, especially if we also assume that method variance is stable over short intervals.” (McCrae, 2015, p. 2)

That is, McCrae’s formula implies that the time period between two measures of the same construct should depend on the specific variance (i.e., if there is more specific variance, the time period should be longer, because else r(tt) is bound to be greater than alpha. I’d love the authors to comment on this. Note: I must admit there are notable problems with McCrae’s approach, something that is long overdue being commented on.

3. I wondered about the criteria to establish whether an item is a ‘good’ item. One could argue that both r(ca) and r(tt) are important, and not just r(ca). But how to weigh these is - to me - an open question. Logically, r(tt) is a necessary, but not sufficient, condition for r(ca) (i.e., a highly temporally stable item may not be observable, and thus have a low r(ca), whereas r(ca) may be a sufficient condition for r(tt) (if items are really observable and there is high r(ca), by necessity there is a high r(tt)). But the question is whether you only want to have observability criteria (or other criteria aligned with r(ca), e.g., ‘item domain’, see De Vries et al., 2016) properties in a personality questionnaire. I would love to see the authors make a statement about this in the discussion and maybe even suggest which (24? 48?) items would provide the most suitable short measure of the HEXACO-100 (with coverage of each facet) according to their criteria.

4. Last but not least, I would love the authors to make the title a bit more informative about the implications of the manuscript, especially with respect to the importance of test-retest reliability and the fact that alpha reliability should be less often used as a measure of reliability. As a final note, please refrain from using the term ‘internal consistency’ and/or explain that it is a misnomer, because alpha does not measure internal consistency (with thousands of items, any scale has a high alpha, but can have practically zero internally consistency). See Sijtsma (2009); just call it ‘alpha reliability’ or ‘internal reliability’.

McCrae, R. R. (2015). A more nuanced view of reliability: Specificity in the trait hierarchy. Personality and Social Psychology Review, 19(2), 97-112.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: Yes: Reinout E. de Vries

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2022 Jan 13;17(1):e0262465. doi: 10.1371/journal.pone.0262465.r002

Author response to Decision Letter 0


22 Dec 2021

Response to Reviewers | PONE-D-21-30871R1 | Test-Retest Reliability of the HEXACO-100

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

We have carefully reviewed PLOS ONE’s style requirements and believe that all files meet these requirements.

2. In your Data Availability statement, you have not specified where the minimal data set underlying the results described in your manuscript can be found. PLOS defines a study's minimal data set as the underlying data used to reach the conclusions drawn in the manuscript and any additional data required to replicate the reported study findings in their entirety. All PLOS journals require that the minimal data set be made fully available. For more information about our data policy, please see http://journals.plos.org/plosone/s/data-availability.

Upon re-submitting your revised manuscript, please upload your study’s minimal underlying data set as either Supporting Information files or to a stable, public repository and include the relevant URLs, DOIs, or accession numbers within your revised cover letter. For a list of acceptable repositories, please see http://journals.plos.org/plosone/s/data-availability#loc-recommended-repositories. Any potentially identifying patient information must be fully anonymized.

Important: If there are ethical or legal restrictions to sharing your data publicly, please explain these restrictions in detail. Please see our guidelines for more information on what we consider unacceptable restrictions to publicly sharing data: http://journals.plos.org/plosone/s/data-availability#loc-unacceptable-data-access-restrictions. Note that it is not acceptable for the authors to be the sole named individuals responsible for ensuring data access.

We will update your Data Availability statement to reflect the information you provide in your cover letter.

The minimal underlying dataset for test-retest reliability has now been submitted as Supporting Information (S2_File.xlsx) and can also be found in the project page on the Open Science Framework website, listed in the manuscript (https://osf.io/wz3du/). We also include the Lee & Ashton (2018) article, which provides facet alphas and cross-rater agreement estimates in Table 4, as is now noted in the Data Availability statement.

We now also note in the Data Availability statement how to access the single-item cross-rater agreement data: “Though the authors of Lee & Ashton (2018) do not specifically indicate in their article how to access these data, we simply sent them an email requesting the raw self-/observer data and explaining how we intended to use them. We would thus confirm that, to our knowledge, others can access these datasets and would be able to access these data in the same manner as the authors of the present manuscript by contacting the authors Lee & Ashton (2018) in a similar manner.”

3. We note that you have stated that you will provide repository information for your data at acceptance. Should your manuscript be accepted for publication, we will hold it until you provide the relevant accession numbers or DOIs necessary to access your data. If you wish to make changes to your Data Availability statement, please describe these changes in your cover letter and we will update your Data Availability statement to reflect the information you provide.

See previous response; our data is made available at the following data repository: https://osf.io/wz3du/.

4. Please include your full ethics statement in the ‘Methods’ section of your manuscript file. In your statement, please include the full name of the IRB or ethics committee who approved or waived your study, as well as whether or not you obtained informed written or verbal consent. If consent was waived for your study, please include this information in your statement as well.

We now include a full ethics statement in the “Methods” section that includes the full name of the ethics committee at the University of Edinburgh who approved our study. We also note that consent was in written form, given online at the time of completing the survey.

5. Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

We have revised the reference section and can confirm that 1) all references are up to date and correct, and 2) none of our references have been retracted.

-----------

Reviewer #1: Thank you for the opportunity to review this paper. I see it as being both interesting and very important to researchers using the HEXACO 60 and 100 personality inventories. I also really enjoyed the analysis of and discourse about the item-level data. I believe that this paper is likely to become the ‘go-to’ paper for people wishing to cite evidence of the reliability of these measures. Overall, I have only two relatively minor suggestions/thoughts.

Thank you for this very positive evaluation of the manuscript.

1. If the authors have more information they can share about the sample (e.g., country of origin, education levels), I strongly encourage them to add this information to the Participants sections, if there is space. My thinking here is that these details may be important for future researchers who, for example, conduct a similar study but receive different results and wish to understand why.

Though we did not explicitly ask participants about these variables, we were able to access information via Prolific Academic for many of our participants’ first language, country of birth and residence, student status, and occupational status. All of these are now included in the “Participants” subsection of our “Materials and Methods” section (lines 177-189). We did not have full data for every demographic variable because, at the time of extracting it, some participants had removed their information. Where we have a substantial amount of missing data (e.g., for student and occupational status) we mention this specifically.

Reporting these demographic variables also allowed us to examine the effect of having English as a first language on rTT, as nearly three quarters of our sample were non-natives. There did indeed appear to be a slight difference between natives and non-natives. We now discuss this and its implications in the Methods, Results, and Limitations sections (lines 177-183, 246-252, 486-493), noting that our reported rTTs may thus be lower bound estimates.

2. After reading the third paragraph on page 13, which pondered the effects of items’ contextualisation levels on reliability, I wondered whether contextualisation levels would be predictably positively associated with rTT but negatively associated with α. For example, whether a person likes poetry _today_ is probably very strongly associated with whether they like it tomorrow, next week, next year, and so on (high rTT). But it’s not hard to imagine there would exist plenty of people who love poetry but are indifferent to, say, classical music or ancient ruins, thus the contextualised items contribute negatively to alpha. I concur with the authors’ speculation that more generic/less contextualised items (e.g., a hypothetical item, “I like artistic things”) may undermine rTT, for all the reasons the authors mentioned (e.g., what artistic things are they thinking about in that moment? Have they enjoyed/not enjoyed a recent artistic experience?). And I could see how such an item would positively influence alpha, as the generic item would represent, to some extent, any of the specific/contextualised items in the same scale. Anyway these are just thoughts; I do _not_ insist the authors should to include them in their revision.

We think that this is an interesting proposition and spent some time trying to fit it into the Discussion; we even went so far as to attempt to test this hypothesis (see below). Ultimately, we chose not to include it in the final document as we felt it distracted somewhat from the main purpose, which was simply to speculate on the wide variety of causes of variation in item properties. However, we agree that this would be worth exploring further in any work that specifically looks at contextuality.

Excised:

“Specifically, we examined the correlations of item rTT with both �s and rTTs of the facets and domains the items measure. For facets, we found a strong relation (ρ = .58) between single-item rTT and facet rTT on the one hand (which is reasonable given that each item effectively contributes 25% to their facet’s rTT). On the other hand, single-item rTT and facet �s had a much smaller association, and in the opposite direction than we might have expected suggested in the previous paragraph (ρ = .11). When expanding these analyses to domains, these relations effectively went to zero. The Spearman’s correlation between item and domain rTT was ρ = .10 and between item rTT and domain �, ρ = -.01. Based on this, there is not strong evidence to suggest that, were it the case that more specific, less contextually variable items are more reliable than more general items, that has to come at the “cost” of lower �.”

-----------

Reviewer #2: Review PONE-D-21-30871: “Test-retest reliability of the HEXACO-100”

Many thanks for the opportunity to review this manuscript, which provides - once again - evidence that alpha reliability is a less optimal parameter than test-retest reliability. Based on my reading of the manuscript, I have a few suggestions and comments:

1. One of the open questions for me is what the most optimal time period is to establish test-retest reliability. The authors chose 12 days (please provide mean and SD of number of days or even hours between the two ratings; and please check whether the individual number of days has an effect on r(tt)!), but I’m not sure whether this is the optimal time period and what is actually the most optimal time period for personality questionnaires. That is, in the introduction and the discussion, I would like the authors to explain a bit more, based maybe on memory research (which, of course, also shows large individual differences) and based on the traitedness of a construct and the possible time frame for changes to occur, what kind of time frame would be most optimal to establish r(tt).

We are happy to elaborate on these issues. We have added a paragraph in the Methods section with a comprehensive description of times: both the days, hours, and minutes between survey administration as well as the time it took participants to complete the survey at each time point. We then examined whether either of these had an impact on retest reliability by correlating them with participants T1-T2 overall profile consistency. In summary, we found no evidence that interval length or overall duration had an effect on reliability, but please see lines 228-245 for more details.

Regarding the second point, we were not able to find much information on ideal intervals aside from previous empirical work comparing different intervals, which we now include in the manuscript – both in the Introduction and the Discussion. We now more explicitly note that we chose our interval (actually closer to 13 than 12 days) largely to be consistent with and comparable to previous research. We also dedicate more space to previous research which has discussed the issue of appropriate interval length and go on to note that there is little theoretical rationale even in the typically-cited papers that refers to memory work in particular.

We thus agree that these are interesting and important questions that need to be explored with respect to retest reliability, especially in an era where single-item properties are receiving more and more attention (see lines 119-129 and 495-501 where we comment on this).

2. With respect to the above time frame, the findings can also be used to comment on McCrae’s (2015) approach to distinguish trait, method, specific, and error variance components. McCrae notes that specific variance is obtained by subtracting alpha from r(tt), but in most cases this would yield a negative specific variance in the current study. As McCrae notes: “By definition, [...] specific variance in an item is not shared by other items in the scale, so it detracts from alpha. However, in retest designs, the same items, with the same specific variance, are readministered, and they may elicit the same response. Item-specific variance could thus account for the fact that retest reliability is greater than alpha, especially if we also assume that method variance is stable over short intervals.” (McCrae, 2015, p. 2)

That is, McCrae’s formula implies that the time period between two measures of the same construct should depend on the specific variance (i.e., if there is more specific variance, the time period should be longer, because else r(tt) is bound to be greater than alpha. I’d love the authors to comment on this. Note: I must admit there are notable problems with McCrae’s approach, something that is long overdue being commented on.

Another excellent point. Several of the authors on this manuscript are actually also interested in re-visiting this technique and have plans to do so in upcoming projects, utilizing some of the ideas discussed in this paper. That is to say, we wholeheartedly agree that this needs to be commented on in more detail, although we recognize that a full discussion of McCrae’s method is probably beyond the scope of the present manuscript.

To the first point, we have now included a few paragraphs commenting on this in the Discussion (lines 373-382). Just as a note, we wonder whether the reviewer may have mis-read Table 1, because for only 3 facets is � > rTT, which is what McCrae’s approach would predict. For example, we now include the following: “Most items have unique valid variance [21,32], and this unique variance is by definition not captured by � but is assessed by rTT (because � removes anything not common to all items); therefore, a trait scale that aggregates multiple items should have rTT > �. Our results support this model, with only three facet �s lower than rTTs. In other words, most facet measures contain both information that is common to items written to measure the trait (e.g., Sincerity) and unique valid content specific to each item, ostensibly indexing a further personality nuance [32]” (lines 378-384).

3. I wondered about the criteria to establish whether an item is a ‘good’ item. One could argue that both r(ca) and r(tt) are important, and not just r(ca). But how to weigh these is - to me - an open question. Logically, r(tt) is a necessary, but not sufficient, condition for r(ca) (i.e., a highly temporally stable item may not be observable, and thus have a low r(ca), whereas r(ca) may be a sufficient condition for r(tt) (if items are really observable and there is high r(ca), by necessity there is a high r(tt)). But the question is whether you only want to have observability criteria (or other criteria aligned with r(ca), e.g., ‘item domain’, see De Vries et al., 2016) properties in a personality questionnaire. I would love to see the authors make a statement about this in the discussion and maybe even suggest which (24? 48?) items would provide the most suitable short measure of the HEXACO-100 (with coverage of each facet) according to their criteria.

We have also been pondering how best to operationalize the “goodness” of an item lately, and we have yet to come to any clear conclusion. We agree with the speculations here, but (as we now mention in the Discussion) are relatively hesitant to suggest a specific subset of items given how much there is yet to be understood about the interplay of item properties (content, empirical quality, and otherwise).

We offer some tentative starting points for prioritising one item for a given trait over another (e.g., selecting items with high variance, rTT, and rCA) but do not feel we have enough evidence to propose a specific subset of HEXACO items for future research. See the section “Implications for scale development” from line 454 for our full commentary on the topic.

4. Last but not least, I would love the authors to make the title a bit more informative about the implications of the manuscript, especially with respect to the importance of test-retest reliability and the fact that alpha reliability should be less often used as a measure of reliability. As a final note, please refrain from using the term ‘internal consistency’ and/or explain that it is a misnomer, because alpha does not measure internal consistency (with thousands of items, any scale has a high alpha, but can have practically zero internally consistency). See Sijtsma (2009); just call it ‘alpha reliability’ or ‘internal reliability’.

Thank you for this suggestion. We agree and have now changed the title to be “Test-Retest Reliability of the HEXACO-100 – and the Value of Multiple Measurements for Assessing Reliability”

Regarding the point about alpha, we went back and forth between the two recommendations and have ultimately chosen to continue referring to it as “internal consistency” for ease of differentiating it from rTT throughout the paper and to be consistent with the literature (we, e.g., use a direct quote from McCrae who also referred to internal consistency for alpha). That said, we now include a clarification on the first page that addresses the Reviewer’s point and explains that using “internal consistency” to describe alpha is a misnomer: “We note that alpha does not measure internal consistency of items per se: with hundreds or thousands of items, any scale has high alpha but can have practically zero consistency among individual items. Instead, a scale’s alpha indexes the expected consistency among hypothetical item aggregates containing the same number of items as the scale. Nonetheless, in line with common usage and to clearly distinguish it from rTT, we will also refer to alpha as a measure of internal consistency” (lines 79-84).

Attachment

Submitted filename: Response to Reviewers.docx

Decision Letter 1

Frantisek Sudzina

26 Dec 2021

Test-retest reliability of the HEXACO-100 - and the value of multiple measurements for assessing reliability

PONE-D-21-30871R1

Dear Dr. Henry,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Frantisek Sudzina

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewers' comments:

Acceptance letter

Frantisek Sudzina

4 Jan 2022

PONE-D-21-30871R1

Test-Retest reliability of the HEXACO-100 – and the value of multiple measurements for assessing reliability

Dear Dr. Henry:

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

If we can help with anything else, please email us at plosone@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Frantisek Sudzina

Academic Editor

PLOS ONE

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Table. HEXACO-100 items and their descriptive statistics.

    rTT = 13-day test-retest reliability. SE = standard error for the rTT estimate. SD = Standard deviation of the item. * indicates items not included in the HEXACO-60.

    (DOCX)

    S1 File. Raw data for HEXACO-100 items, facets, and domains.

    (XLSX)

    S2 File. Facet and factor omegas for HEXACO-100 and HEXACO-60.

    (XLSX)

    Attachment

    Submitted filename: Response to Reviewers.docx

    Data Availability Statement

    All test-retest data files will be available from the project page on the Open Science Foundation database (https://osf.io/wz3du/) and are also included in this submission. However, the cross-rater agreement data included in the additional analyses was not collected by the authors of this manuscript. The analyses of facet alphas and cross-rater agreement estimates were simply copied from Table 4 in Lee & Ashton (2018) and can thus be accessed in the published article itself. We can confirm that we did not have any special access privileges to the single-item data. Though the authors of Lee & Ashton (2018) do not specifically indicate in their article how to access these data, we simply sent them an email requesting the raw self-/observer data and explaining how we intended to use them. We would thus confirm that, to our knowledge, others can access these datasets and would be able to access these data in the same manner as the authors of the present manuscript by contacting the authors Lee & Ashton (2018) in a similar manner.


    Articles from PLoS ONE are provided here courtesy of PLOS

    RESOURCES