Skip to main content
Elsevier Sponsored Documents logoLink to Elsevier Sponsored Documents
. 2023 Dec;117:102818. doi: 10.1016/j.jmp.2023.102818

Experiment-based calibration in psychology: Optimal design considerations

Dominik R Bach 1,2,
PMCID: PMC11158008  PMID: 38855427

Abstract

Psychological theories are often formulated at the level of latent, not directly observable, variables. Empirical measurement of latent variables ought to be valid. Classical psychometric validity indices can be difficult to apply in experimental contexts. A complementary validity index, termed retrodictive validity, is the correlation of theory-derived predicted scores with actually measured scores, in specifically designed calibration experiments. In the current note, I analyse how calibration experiments can be designed to maximise the information garnered and specifically, how to minimise the sample variance of retrodictive validity estimators. First, I harness asymptotic limits to analytically derive different distribution features that impact on estimator variance. Then, I numerically simulate various distributions with combinations of feature values. This allows deriving recommendations for the distribution of predicted values, and for resource investment, in calibration experiments. Finally, I highlight cases in which a misspecified theory is particularly problematic.

Keywords: Calibration, Retrodictive validity, Measurement uncertainty, Measurement accuracy

Highlights

  • Experiment-based calibration is a method for validation of psychological measurement.

  • Calibration benefits from low variance of sample estimators.

  • Equiprobable standard values reduce estimator variance.

  • Reducing random aberration does not always reduce estimator variance.

  • Systematic aberration is particularly problematic when inverse sigmoid.

1. Introduction

Many areas of psychology deal with latent variables that are not directly observable, such as subjective value, confidence, or arousal. Empirical research mandates that such variables are assessed via a suitable measurement procedure. This raises a question of measurement validity: does the measurement actually relate to the latent variable of interest, and if yes, then how close is that relation? A validity argument is usually based on several sources (AERA, Joint Committee on the Standards for Educational, Psychological Testing of the American Educational Research Association, & the American Psychological Association and the National Council on Measurement in Education, 2014), among them quantitative indices such as convergent and discriminant validity, or reliability. While some of these are commonly used, others are less widespread in practice. For example, discriminant validity, while theoretically indispensable for interpretation of convergent validity (McDonald, 2013), is more infrequently assessed (Zumbo & Chan, 2014). This might reflect practical and interpretational challenges in the application of psychometric indices (Bach, Rigdon & Sarstedt, 2023). For a practical example, validity assessment requires demonstrating convergent and divergent validity, which necessitates finding at least two measurement methods for the same, and for a different, latent variable, and this can be challenging. For an interpretational example, when convergent validity is low, then it cannot adjucate between a “more”, and a “less” valid method (Bach, Rigdon et al., 2023). Furthermore, psychometrics is historically grounded in the study of interindividual difference, and all psychometric indices require incidental between-person variability to be meaningful (Brandmaier et al., 2018). In contrast, much of experimental psychology is concerned with general or average treatment effects, where the goal of experimental design is often to minimise variability of the treatment effect between persons (Hedge, Powell, & Sumner, 2018). In such cases, psychometric indices can be misleading.

Experiment-based calibration has recently been proposed as a complementary strategy to compare measurement methods, and does not suffer from the aforementioned shortcomings. Inspired by contemporary approaches to the evaluation of physical measurement (Phillips, Estler, Doiron, Eberhardt, & Levenson, 2001), the strategy is to generate well-defined values of a latent variable in a standardised experiment, and evaluate methods by how well they can reproduce these values (Bach, Melinscak, Fleming, & Voelkle, 2020). This is formalised in the concept of retrodictive validity: the correlation of intended scores of the latent variable with the actually measured scores (see Fig. 1) (Bach & Melinscak, 2020). Conceptually, this can be seen as a type of criterion validity, with the external validity criterion replaced by experiment-defined standard scores (Bach et al., 2020).

Fig. 1.

Fig. 1

Schematic of the calibration approach. In the analytical derivations, no assumptions are made for the distributions of ω and ɛ (black arrows). In the simulations, aberration ω is decomposed into a non-linearity f and random aberration, and measurement error is assumed to be unbiased i.i.d. for simplicity (grey arrows).

While previous work has formally laid out the concept of retrodictive validity, and detailed the statistical assumptions under which it is informative on measurement accuracy (Bach et al., 2020), it has not addressed how the concept can be put into practice — or in other words, how a calibration experiment should be designed. There are at least two perspectives on calibration design. The first is domain-specific: to define a suitable experimental manipulation that can be used to impact on a particular latent variable. Recent work has suggested an expert consensus procedure to this end, and has provided an exemplary experiment to calibrate the measurement of associative learning (Bach et al., 2023). The second perspective is statistical: how can numerical and distributional design features be used to improve calibration? This is the topic of the current work.

To clearly define the objective, i.e., what does it mean to “improve calibration”, I draw on statistical estimation theory. Development of experimental measurement methods is often incremental, and previous research in the field of psychophysiology has suggested that the achievable improvements might be relatively small (Bach & Melinscak, 2020). In this situation, the challenge is to distinguish between two relatively similar population values for retrodictive validity from sample estimators. Consequently, retrodictive validity estimates themselves should be accurate. Thus, I ask how design features can be used to maximise accuracy of retrodictive validity estimators. Retrodictive validity is expressed as bivariate correlation. Sample estimators for correlation coefficients are approximately unbiased for large samples, such that the objective of maximising accuracy reduces to minimising estimator variance. Estimator variance, in this case, depends not only on sample size but crucially on features of the parent distributions which will often be non-normal. Hence, the approach taken here is to analytically identify relevant distribution features using asymptotic limits, and then to numerically simulate distributions with combinations of these features in finite samples. Note that all simulation results depict the quantity nVrSY which is (approximately) independent of sample size n. The variance of the retrodictive validity estimator in a finite sample of particular size can be derived from this quantity by dividing through the actual sample size used.

2. Methods

2.1. Model objects and notation

2.1.1. Overview

To recapitulate the calibration approach, an experiment is conducted in which the independent variable is chosen to generate predicted standard scores S of the latent variable (see Table 1). However, these predicted values are not accurately achieved: the truly achieved scores T deviate by experimental aberration ω. Measurement operations are then performed to quantify T, but the measured scores Y deviate by some error ɛ. Because psychological variables have no scale, we can disregard linear mappings between S, T, and Y without loss of generality. To illustrate this with an example, consider the calibration of lightness measurement (i.e., the perception of an object’s luminance). Experimenters can vary (physical) luminance and colour of an object in order to manipulate (perceived) lightness. Then the standard scores S are defined by a vector of the physical quantities (luminance and colour) together with a formal theory of lightness (e.g. the CIECAM02 model) that takes these quantities as input. A simple measurement operation might consist in asking people to indicate lightness on a visual analogue scale.

Table 1.

Notation used in the methods and results.

E Expectation

V Variance

S Standard (i.e. intended or predicted) scores of the latent variable

T True (i.e. achieved) scores of the latent variable

Y Measured scores of the latent variable

ω Experimental aberration: T=S+ω. Experimental aberration can be decomposed into aberration due to non-linearity, and random (imprecision) aberration: ω=SfS+ωI, see next row.

f, ωI Non-linearity and random (imprecision) aberration: T=fS+ωIS, where f:RR is fixed for a given value of S, and ωI is randomly drawn from a probability distribution that depends on S.

ɛ Measurement error: Y=T+ɛ

ρSY Retrodictive validity (i.e. correlation between standard and measured attribute scores)

rSY Retrodictive validity estimator (i.e. sample correlation between standard and measured attribute scores)

ζIR+, ω0 Scale and shape of random (imprecision) aberration: ωI=ζIω0, Vω0=1, ζI2=VωI

ξR+ Scaling constant in various simulations

σ2R+ Variance of normal distributions used in various simulations

Experimental aberration ω has two sources. The first is that the theory that links experimental manipulation with latent variable scores (the CIECAM02 lightness model in our example) might be inaccurate; in other words, S is mis-specified. This is modelled here with a systematic non-linearity fS. The second reason is random variation due to imprecision ωI (due to physical and psychological factors), which is modelled by a (scalable) probability distribution that may depend on S (corresponding to non-additive noise). Measurement error ɛ also has two sources: systematic mis-specification in the measurement system, and random variation.

2.1.2. Standard values

Standard values S are the experimenters’ prediction for the effect of their experimental manipulation on the latent variable. Since experimenters have full control over the experimental manipulation, the distribution of S is a key consideration for designing a calibration experiment. In some cases where the theory is well-developed (e.g. CIECAM02 model for lightness perception), S might be chosen from a large range, allowing many different distributions including approximately normal distributions. In many other cases, the theory that links independent variable to latent attribute values might not be numerically specified and only allow predictions for a discrete number of values of the independent variable. For example, attributes like “fear memory” or “confidence” might only allow predicting two standard values, i.e. “low” and “high” (Bach et al., 2023). In the numerical simulations, we consider discrete and continuous distributions, including a standard normal distribution, and a truncated normal distribution, where the latter is more feasible for experimental implementation.

2.1.3. Experimental aberration

The distribution of the experimental aberration ω is not under direct experimental control. First, if the systematic non-linearity fS were precisely known then one would be able to include it in the model and thus adjust the predicted standard values S. Secondly, precisely revealing the distribution of random aberration would already require a perfect measurement system. However, it is still useful to consider the impact of these two terms. Regarding the non-linearity, theoretical or experimental considerations might suggest a particular class of non-linearity fS. For example, the Helmholtz–Kohlrausch effect, which is not included in the CIECAM02 model, would suggest that for certain colours, the model over- or underestimates the true perceived lightness, even if this predicted mis-specification is not known numerically. Regarding random aberration, it may be possible to unspecifically reduce the scale of the aberration with experimental means. For example, conducting a psychophysics study in a dedicated lab instead of as an online experiment is likely to result in smaller aberration. This is likely to include an impact on the shape of aberration distribution, which is unknown. Here, we model this situation with an unspecific impact on the scale of random aberration. This is why we decompose random aberration ωI=ζIω0 into shape ω0, and scale ζIR with Vω01 and VωI=ζI2.

2.1.4. Measurement error

Measurement error is not under direct experimental control in a calibration experiment. Crucially, data transformations after the experiment may influence measurement error and thus are not known a priori. Hence, a calibration experiment should be designed in a way that is useful under various levels of measurement error. In numerical simulations, we assume throughout that aberration and measurement error are not only uncorrelated but also stochastically independent, and that measurement error follows a normal distribution. We explore several levels of error variance in order to account for measurement instruments of different precision. To facilitate the results and because it is difficult to foresee a priori, we do not consider the effect of systematic measurement error.

2.1.5. Scaling

Because psychological variables have no natural scale or anchor point, we can assume, without loss of generality: ES=ET=EY=Eω=Eɛ=0, VS=1, CovS,T=1.

2.2. Numerical simulations

In our simulations, we evaluate the following distributions. These examples are based on the analytical results and represent distributions with extreme values for some relevant features. For simplicity, the simulations consider calibration experiments in which the standard scores are independently randomised for each instantiation of the experiment, rather than being balanced.

2.2.1. Standard values

  • 1.

    Two equiprobable values: S1,1.

  • 2.

    Four equiprobable values, ensuring VS=1: S3/5,1/5,1/5,3/5.

  • 3.

    Continuous uniform distribution, ensuring VS=1: SU3,3.

  • 4.

    Standard normal distribution: SR,SN0,1.

  • 5.

    Truncated normal distribution with variance σ2 numerically determined to ensure VS=1: SN0,σ2,S2σ,2σ.

2.2.2. Systematic non-linearity in the experimental aberration

For the simulated non-linearities fS (see Table 1), the parameter ξ is numerically determined to ensure that the linear component of the non-linearity is fixed, i.e. CovS,fS=1, which simplifies model scaling. For each of the two types of non-linear mappings, we consider a version with smaller and with larger deviation from linearity, i.e. low and high values for VfS.

  • 1.

    No non-linearity: fS=S.

  • 2.

    Sigmoid non-linearity: fS=ξS1/3, and fS=ξS1/5. This means that the predicted standard scores are closer to their mean than the true scores; in other words: the standard scores overestimate the true scores when the standard scores are close to the mean, and they underestimate the true scores when the standard scores are extreme. Such non-linearity might be suspected when the latent variable is supposed to be bounded and saturates, while the experimental independent variable is unbounded or has very wide bounds. This might be the case in many areas of perception.

  • 3.

    Inverse sigmoid non-linearity: fS=ξS3, and fS=ξS5. Such non-linearity might be suspected when there is evidence that small changes of the independent variable can have strong effects at extreme values. A classical example for inverse sigmoid non-linearity is subjective probability weighting in value-based decision-making: if a latent variable is manipulated by economic gambles with two fixed potential outcomes and different probabilities of winning, then the predicted and true values of the latent variable are likely to relate by inverse sigmoid mapping. (In this example, of course, this known inverse sigmoid mapping could be taken into account when using prospect theory to predict the standard values.)

2.2.3. Random aberration due to imprecision: shape

The parameter ξ in some of these simulations was numerically determined to ensure that Vω0=1. For all simulations, CovS,ω0=0. These distributions were selected as they provide extreme values for some features highlighted in the analytical derivation, and they are illustrated in Fig. 2. Some tentative intuition for where these features might occur in psychology is given after each distribution; these explanations do not refer to the specific example distributions but rather to the feature values being modelled.

Fig. 2.

Fig. 2

Illustration of calibration-specific terms that impact on variance of retrodictive validity estimators. A: Kurtosis of the different standard value distributions used for numeric simulations. BC: Two aberration distributions with different kurtosis. B: Continuous uniform distribution (platykurtic example). C: Student’s t-distribution with ν=5 (leptokurtic example). DE: Two random aberration distributions with different covariance of squared standard values and aberration. D: (Conditional) normal distribution with larger variance for smaller absolute values of S. E: (Conditional) normal distribution with smaller variance for larger absolute values of S. FG: Sigmoid and inverse sigmoid non-linearity. HI: Two random aberration distribution with different skewness of the aberration. H: Gamma distribution with positive skew for larger S. I: Gamma distribution with negative skew for larger S.

  • 1.

    Continuous uniform distribution (platykurtic example): ω0U3,3. This might for example occur due to physical or conceptual limitations in the resolution of the independent variable, i.e. it is specified only within a range.

  • 2.

    Standard normal distribution (mesokurtic example): ω0N0,1. This is a standard assumption in many types of research.

  • 3.

    Student’s t-distribution with ν=5 degrees of freedom (leptokurtic example): ω0t5. Heavy-taled distributions are seen throughout psychology.

  • 4.

    (Conditional) normal distribution with larger variance for smaller absolute values of S: for fixed S, ω0N0,ξexp|S|. This might for example occur due to floor and ceiling effects if there are person-independent boundaries on achievable true scores.

  • 5.

    (Conditional) normal distribution with larger variance for larger absolute values of S: for fixed S, ω0N0,ξ|S|. This might happen if persons differ in their boundaries on achievable true scores, or if more extreme values are achieved by qualitatively different experimental manipulations.

  • 6.

    (Conditional) gamma distribution with positive skew for larger S: for fixed S, ω0SΓ2,1/2EΓ2,1/2. This could happen due to floor and ceiling effects if there are person-independent boundaries on achievable true scores.

  • 7.

    (Conditional) gamma distribution with positive skew for smaller S: for fixed S, ω0SΓ2,1/2EΓ2,1/2. This might occur due to a combination of a categorical/qualitative and a quantitative mechanism in generating the true score, e.g. a percept is generated as either positive or negative in relation to a reference value, and then imbued with a specific value on the half-axes, which will have more variability into the extreme direction than towards the reference.

2.2.4. Random aberration due to imprecision: scale

Simulated values for ζI were drawn from a wide range, including typical values encountered in experimental settings. These can be derived from reported retrodictive validity values in calibration experiments. Systematic research using retrodictive validity as a metric has mainly been conducted in the field of human fear conditioning. Here, retrodictive validity has typically been expressed as (Cohen’s) d for a within-subjects difference between two experimental treatments. Published values range between d=0.4 for some skin conductance analysis methods, and d=1.2 for fear-potentiated startle eye-blink (e.g. Bach & Melinscak, 2020). This corresponds to retrodictive validity estimates between (Pearson’s) r=.20 and r=.51. From Bach et al. (2020) (SI Eq. (33)), we have

ρSY=1Vω+Vɛ+1, (1)

and so

Vω+Vɛ=1ρSY21. (2)

Thus, the two variances combined will take values between 24 for r=.20, and 2.8 for r=.51. Since these reported estimates derived from experiments with only two possible values of S, we can assume fS=S. If aberration and error variance were equal, this would correspond to ζI=3.46 and ζI=1.18, respectively. Assuming that retrodictive validity might be lower or higher in other areas of psychology, ζI was varied over the interval 0.1,4.

2.2.5. Measurement error

Because measurement error is not usually known a priori, and can depend on the other model quantities in complicated ways, all simulations use a placeholder error distribution that represents i.i.d. Gaussian noise with three levels: ɛN0,σ2 with σ0.1,1,4.

2.2.6. Simulation settings

All simulations used Matlab’s default Mersenne twister to generate random numbers. Distributions were simulated by drawing 107 samples. Asymptotic results used all of these samples to approximate the expectations in Eq. (3), from which estimator variance was analytically calculated. For finite-sample results, I randomly selected (with replacement) from these simulated distributions 106 samples of size n=100 (based on the sample size suggested in Bach et al. (2023)) and computed the resulting variance of the sample retrodictive validity estimator. All simulation results depict the quantity nVrSY, from which estimator variance is computed by division through actual sample size. For small samples, sample size can have an impact on nVrSY. Hence, I compared the results to simulations with sample sizes of N=50 and N=1000. On average, the absolute differences in nVrSY between the differently sized samples was about 1%, and in more than 90% of the simulated scenarios, the absolute difference was below 5%.

3. Results

3.1. Analytical results

The asymptotic limit of the sample estimator variance can be decomposed into the following terms (see Appendix for derivation):

limnnVrSY=
ES414ρSY22ρSY4+ρSY6+
Eω414ρSY6+
ESω2ρSY252ρSY4+32ρSY6+
ES3ωρSY22ρSY4+ρSY6+
ESω3ρSY4+ρSY6+
+hɛɛ, (3)

where hɛ collapses all termes involving the (a priori unknown and unknowable) measurement error. In the following I discuss the five terms not involving measurement error, keeping in mind the goal of minimising VrSY. Fig. 2 illustrates the different quantities. I note (see Appendix) that although the sum of these terms is non-negative, some of them can be negative (and thus serve to reduce estimator variance).

Weighted kurtosis of the standard value distribution.

A1ES414ρSY22ρSY4+ρSY6 (4)
=14KurtSρSY22ρSY4+ρSY6, (5)

because ES=0 and VS=1. Both factors in this term are non-negative on ρSY0,1. Kurtosis of the standard value distribution is under experimental control, and reducing it will always reduce A1. Fig. 2A illustrates different distributions and their kurtosis. The second factor has minima at ρSY=0 and ρSY=1, and a maximum at ρSY=1/30.577 (corresponding to Vω+Vɛ=2). For fixed Vɛ, this means that reducing Vω from a given value can increase or reduce this term. The maximum value of A1 on the interval ρSY0,1 is 127KurtS. While kurtosis is theoretically unbounded, high-kurtosis distributions of S appear unnecessary in practice, and for a plausible normal distribution of S, the maximum is A10.11. Thus, overall this term makes a relatively modest contribution to VrSY.

Weighted kurtosis of the aberration distribution.

A2Eω414ρSY6 (6)
=14KurtωV2ω1+Vω+Vɛ3 (7)

using Eω=0 and thus, Kurtω=Eω4/V2ω, and Eq. (1). Both factors in this term are non-negative on ρSY0,1. Reducing aberration kurtosis would thus always reduce A2, although in practice this will usually not be under experimental control (Fig. 2BC). For fixed Vɛ, the second factor has a maximum at Vω=2Vɛ+2. In words, if the aberration is relatively large compared to the measurement error, then decreasing the aberration increases A2. If Vω is below this bound, then decreasing the aberration decreases A2. This is the case for example if Vω=Vɛ. Because Vω also impacts on the other terms, its net effect can only be understood in the numerical simulations.

Weighted covariance of squared intended values and squared aberration.

A3ESω2ρSY252ρSY4+32ρSY6 (8)
=ESω02Vω1+Vω+Vɛ5Vω21+Vω+Vɛ2+3Vω21+Vω+Vɛ3. (9)

with Eq. (1) and ω=ω0Vω. The first factor is non-negative. If S and ω are stochastically independent, or if there are just two equiprobable values of S, then the first factor equals 1. In other cases, the first factor is large when ω takes larger absolute values for more extreme values of S (Fig. 2DE). For the second factor, if Vɛ>1/2, then reducing Vω will decrease A2. If Vɛ1/2, then on an interval 0,Vmω, decreasing Vω increases A2, where Vmω depends on Vɛ with Vmω<0.3. The impact of changing Vω is analysed in detail in the numerical simulations below.

Weighted systematic aberration.

A4ES3ωρSY22ρSY4+ρSY6 (10)
=ES3ω0Vω1+Vω+Vɛ2Vω1+Vω+Vɛ2+Vω1+Vω+Vɛ3 (11)

with Eq. (1) and ω=ω0Vω. The first factor is zero if there are two or three equidistant values of E, or if there is no non-linearity, i.e., fS=S. Otherwise, the first factor is positive when ω takes more extreme values for more extreme values of S (e.g. inverse sigmoid non-linearity f, Fig. 2F), and negative in the opposite case (e.g. sigmoid non-linearity f, Fig. 2G). The second factor is non-negative. On an interval 0,Vmω, decreasing Vω decreases this factor. The minimum value of Vmω is achieved when Vɛ=0, which yields Vmω=5. It is plausible to assume that Vω will rarely be larger than this value. The impact of changing Vω then only depends on the first factor and is analysed in detail in the numerical simulations below.

Weighted skewness of the aberration.

A5ESω3ρSY4+ρSY6 (12)
=ESω03V3/2ω1+Vω+Vɛ2+V3/2ω1+Vω+Vɛ3, (13)

with Eq. (1) and ω=ω0Vω. The first factor is positive if ω is positively skewed for large(r) values of S and negatively skewed for small(er) values of S (see Fig. 2HI). For two values of S, this would mean that the distribution of ω has long tails into both directions, but fewer values in the middle. For the opposite situation, the first factor will be negative. If ω has no skew for any value of S, then this term will be zero. The second factor is negative. On an interval 0,Vmω, decreasing Vω increases this factor. The minimum value of Vmω is achieved when Vɛ=0 which yields V0ω=5. It is plausible to assume that Vω will rarely be larger than this value. The impact of changing Vω then only depends on the first factor and is analysed in detail in the numerical simulations below.

3.2. Numerical results

Because several of the analytically identified terms depend on aberration in different ways, I numerically simulated various scenarios to quantify their impact on estimator variance. Fig. 3 shows simulated asymptotic results, where rows refer to different distributions of random aberration ω0, columns to different systematic non-linearities and levels of measurement error Vɛ, and the x-axis to different levels of random aberration ζI. Using the same layout, Fig. 4 shows simulated results from finite samples, and Fig. 5 shows the direct comparison.

Fig. 3.

Fig. 3

Simulated asymptotic results for 63 scenarios with different standard value distributions (colours), different random aberration distributions (rows), different scale of random aberration (x-axis), different systematic non-linearity (columns) and different levels of measurement error (columns). All panels show the product of asymptotic estimator variance with sample size. Note that under inverse sigmoid aberration, two columns are displayed with a different limit on the y-axis.

Fig. 4.

Fig. 4

Simulated finite sample results for 63 scenarios with different standard value distributions (colours), different random aberration distributions (rows), different scale of random aberration (x-axis), different systematic non-linearity (columns) and different levels of measurement error (columns). All panels show the product of asymptotic estimator variance with sample size. This quantity was averaged across 105 samples of size n=100.

Fig. 5.

Fig. 5

Direct comparison of asymptotic limits (Fig. 3) and finite-sample results (Fig. 4). Faint colours denote the asymptotic limits.

We first consider estimator variance in the absence of systematic non-linearity (left three columns in Fig. 3, Fig. 4). When ω0 is independent of the standard values S (top three rows), different distributions of S yield relatively similar estimator variance. This confirms the analytically derived result that although the kurtosis of S does affect A1, its numerical impact is relatively small. Furthermore, reducing the scale of aberration can reduce estimator variance considerably, but only for low to medium levels of measurement error, i.e. when the error variance Vɛ is on a lower or same order of magnitude as the variance of the aberration. In high-noise scenarios, when measurement error variance is on a larger order, reducing aberration only has a small effect.

Next, we consider conditional distributions of ω0 that depend on S. Here, different choices of S yield more dissimilar estimator variance. In most situations, binary S yields the smallest, and normally distributed S the largest estimator variance, with a small impact of truncating S. In most circumstances, reducing the scale of ωI reduces estimator variance. However, in the specific case that Vω0 inversely scales with the absolute values of S (fourth row), these results are reversed: binary S yields the largest estimator variance, and reducing the scale of ωI can increase estimator variance, in particular if measurement error is large. For small and medium levels of measurement noise, this effect is mainly due to term A2, as anticipated in the analytical results. However, A2 has hardly an impact at high measurement noise. Here, the increase in estimator variance is due to an interplay of several factors, including the error variance-dependent term h. At high levels of measurement noise, this term generally increases with decreasing ζI. Note that in the absence of prior knowledge on the distribution of measurement error, I assumed i.i.d. normally distributed error for the simulations. Insofar as some of these results depend on h, they may not generalise.

In the presence of sigmoid non-linearity in the aberration, the results look very similar (middle three columns), and there is almost no impact of the variance of the systematic aberration. However, inverse sigmoid aberration changes this picture (right three columns) in several important ways. First, asymptotic estimator variance becomes very large for large systematic non-linearity when S is a normal distribution and the scale of ωI is small. This is due to term A2. The implementation of inverse sigmoid aberration used here is a cubic function, and the kurtosis of the cube of a standard normally distributed variable corresponds to the value of its 12th moment which is 11!!=2079. This excessive kurtosis is reduced by adding random noise to ω; hence the phenomenon is particularly pronounced for small scale of ωI and ɛ. However, kurtosis is not as excessive for any truncated S or in finite samples (see Fig. 4), and this phenomenon therefore may be irrelevant for practical purposes. The following three differences apply both for asymptotic and finite-sample cases. Secondly, estimator variance is generally much larger for (truncated) normal distribution of S than for the discrete and continuous uniform distributions, due to an increase of the term A3. Third, the scale of the non-linearity has a larger impact on estimator variance here than for sigmoid non-linearity, again due to an impact on term A3. Fourth, there are more scenarios in which decreasing the scale of ωI increases estimator variance.

4. Discussion

In this note, I investigated quantities that determine the sample variance of retrodictive validity estimators. Of particular interest are design features that an experimenter can control; and it is useful to consider how different settings of these features fare under various combinations of non-controllable features.

The first feature under experimental control is the sample size, which has a well-known linear relation with estimator variance. Increasing sample size, can, in any scenario, reduce estimator variance. All other results concern the quantity nVrSY, where n is the sample size. In other words, they describe how estimator variance is affected when sample size is constant. The second feature under experimental control is the distribution of S, which the experimenter can freely choose. In the absence of systematic non-linearity in the aberration, and when imprecision aberration is i.i.d., the choice of S has no large impact on estimator variance. However, in many other situations, estimator variance increases from binary to discrete to continuous uniform to normally distributed S. The third feature of interest is the scale of the aberration distribution ζI, which is at least under indirect experimental control. Reducing imprecision aberration reduces estimator variance in many but not all scenarios, and it can sometimes even increase estimator variance. Because it is generally unknown which scenario occurs in a given calibration experiment, the effect of reducing imprecision aberration is difficult to predict.

From these results, several recommendations for the design of calibration experiments can be derived. First, it may be preferable to use a relatively small number of discrete experimental treatments, as is usually the case anyway in many fields of experimental psychology. Binary treatments fare particularly well, but in advanced experimental fields, a goal may be to calibrate a measurement across a range of values and to include a possible impact of systematic aberration, which would motivate the use of several discrete values. In case a continuous distribution is desired, then normal distributions, however desirable as they are for simplified statistical computations, should be avoided in favour of a continuous uniform distribution. A second recommendation relates to aberration and the use of resources. Reducing aberration is obviously desirable in its own right, but it may be costly if it involves specific experimental efforts such as shielded testing rooms, prolonged resting periods, monitoring participants over extended periods of time, etc. The present results suggest that reducing imprecision aberration is useful in most scenarios but not when measurement noise is relatively high, and not under some particular aberration distributions. Since these distributions are not a priori known to the calibration planner, if there are limited resources available they may be better invested into increasing sample size, which uniformly and linearly decreases estimator variance.

Other distributional features had a pronounced impact on estimator variance but many of these will be hard to control or even to know, for example the shape of the aberration distribution. However, as one of these features, systematic non-linearity deserves particular attention as it relates to shortcomings in the substantive theory used to generate the predicted standard values S. While not under experimental control in a particular calibration experiment, substantive theory can be improved over time, and its results can be incorporated into the analysis of a calibration experiment even after the data are acquired. A particularly problematic case, inverse sigmoid aberration, is exemplified by subjective probability weighting in value-based decision making. If a latent variable is manipulated by changing outcome probabilities in an economic gamble with two fixed potential outcomes, then the latent variable is likely to relate to the probabilities in an inverse-sigmoid manner. If, however, predicted values were derived with a linear probability weighting, then these would be misspecified with an inverse-sigmoid aberration. This could then be improved by using cumulative prospect theory to derive more accurate predictions of the true scores. Thus, reducing systematic aberration merits investment and yields long-term returns, as the results can be used retrospectively and prospectively in many calibration experiments. This requires improving substantive theory and thus resonates with recent theoretical proposals that measurement should principally be based on substantive theory (Borgstede & Eggert, 2023).

These recommendations are based on the objective of minimising estimator variance. I note that there might be other objectives when designing a calibration experiment. For example, experimental aberration constitutes an important source of measurement uncertainty (BIPM, IEC, IFCC, ILAC, IUPAC, IUPAP, ISO, OIML, 2012, Rigdon and Sarstedt, 2022, Rigdon et al., 2020). Aberration cannot be reduced by better measurement instruments. In the terminology of this article, experimental aberration places an upper limit on achievable values of retrodictive validity. This might be undesirable in itself, and as such, reducing the scale of measurement aberration might be an independent goal of calibration design, beyond reducing estimator variance.

All results are necessarily limited to the choice of scenarios analysed. To motivate these scenarios, I analytically derived distribution features that have a large impact on estimator variance, and then simulated distributions with extreme values of these features. Doing so may have missed out on some other distributions that are possibly more relevant in particular fields of experimental psychology. All simulations are openly available and can easily be expanded to encompass other situations. Furthermore, the present analysis is focused on independently randomised (rather than balanced) experimental treatments. This is a plausible experimental approach for between-subjects studies, but the case of balanced experimental treatments is particularly important for within-subjects designs. Future work will explore this experimental design in a purely numerical approach.

As a general limitation of the calibration approach, and similar to the concept of construct validity (Cronbach & Meehl, 1955), it does not allow to empirically separate trueness (bias) and precision of the measurement (Bach et al., 2020). Trueness of the measurement refers to systematic deviation from the true scores. This can occur for example due to unknown non-linearities in the mapping from true scores to measured scores (see for illustration e.g., Dunn & Anderson, 2022), or because one is measuring not the latent variable of interest but a different, but somewhat correlated variable. Precision on the other hand, relates to variability in the measurement under constant true scores. Although these can be distinguished in theory – just as one can distinguish systematic aberration (bias) from imprecision aberration – retrodictive validity collapses both into one accuracy metric. In order to distinguish the terms, there are two approaches (Bach et al., 2020). The weaker one is to analyse pairs of values of S, in which case, the influence of trueness will likely be reduced compared to a larger set of values of S. The second is to additionally evaluate reliability (Brandmaier et al., 2018), from which precision can be computed under justifiable assumptions (McDonald, 2013). Note, however, that the design of parallel tests in the context of experimental measurement can be non-trivial, as discussed previously (see supplemental information in Bach et al. (2020)). Future work will address this experimental limitation and hopefully contribute to the derivation of uncertainty metrics in experimental measurement (Rigdon and Sarstedt, 2022, Rigdon et al., 2020).

To summarise, estimator variance in calibration experiments can be reduced by using uniformly distributed, preferably discrete, experimental treatments; by reducing inverse sigmoid experimental aberration; and in many but not all circumstances by reducing imprecision aberration. If resources are limited, then the most predictable route to reducing estimator variance, under all experimental circumstances, remains increasing sample size.

Acknowledgements

The author thanks Drs. Jules Brochard, Federico Mancinelli, and Filip Melinscak, for discussion and insightful comments on a first draft of this manuscript. DRB receives funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (Grant agreement No. ERC-2018 CoG-816564 ActionContraThreat), and from the National Institute for Health Research (NIHR) UCLH Biomedical Research Centre. The Wellcome Centre for Human Neuroimaging is funded by core funding from the Wellcome (203147/Z/16/Z).

Appendix.

The following theorem is proven in Steiger and Hakstian (1982). Let ρ denote a population (bivariate) correlation coefficient, and r the corresponding sample correlation coefficient .

Theorem 1

For a sample with size N , let nN1 . For two sample correlation coefficients with arbitrary parent distributions, the joint distribution of nrijρij and nrkhρkh converges to a bivariate normal distribution with covariance matrix γij,ijγij,khγij,khγkh,kh and

γij,kh=ρijkh+14ρijρkhρiikk+ρjjkk+ρiihh+ρjjhh12ρijρiikh+ρjjkh12ρkhρijkk+ρijhh

with

σijCovXi,Xj
σijkhEXiμiXjμjXkμkXhμh
ρijσijσiiσjj
ρijkhσijkhσiiσjjσkkσhh.

For the variance, this theorem reduces to:

γij,ij=ρijij+14ρij2ρiiii+2ρiijj+ρjjjjρijρiiij+ρijjj (A.1)
=σiijjσiiσjj+14ρij2σiiiiσii2+2σiijjσiiσjj+σjjjjσjj2ρijσiiijσii3σjj+σijjjσiiσjj3. (A.2)

For sample retrodictive validity rSYCors,y, we have:

limnnVrSY=γij,ij=ESY2VSVY+14ρEY2ES4V2S+2ESY2VEVY+EY4V2Y
ρSYES3YV3SVY+ESY3VSV3Y (A.3)
=ESY2VY+14ρSY2ES4+2ESY2VY+EY4V2Y
ρSYES3YVY+ESY3V3Y (A.4)

where the second equality used ES=EY=0 and VS=1.

Now we expand the expectations involving Y=S+ω+ɛ:

ESY2=ESS+ω+ɛ2 (A.5)
=ES2+Sω+Sɛ2 (A.6)
=ES4+Sω2+Sɛ2+2S3ω+2S3ɛ+2S2ωɛ (A.7)
ES3Y=ES3S+ω+ɛ (A.8)
=ES4+S3ω+S3ɛ (A.9)
ESY3=ESS+ω+ɛ3 (A.10)
=ES4+Sω3+Sɛ3+3S3ω+3S3ɛ+3S2ω2+3Sω2ɛ+3S2ɛ2+3Sωɛ2+6S2ωɛ (A.11)
EY4=ES+ω+ɛ4 (A.12)
=ES4+ω4+ɛ4+4S3ω+4S3ɛ+4Sω3+4Sɛ3+4ω3ɛ+4ωɛ3+ (A.13)
6S2ω2+6S2ɛ2+6ω2ɛ2+12S2ωɛ+12Sω2ɛ+12Sωɛ2

Taking all of this together yields

limnnVrSY=
=ES4+Sω2+Sɛ2+2S3ω+2S3ɛ+2S2ωɛVY+
14ρSY2ES4+2ES4+Sω2+Sɛ2+2S3ω+2S3ɛ+2S2ωɛVY+
ES4+ω4+ɛ4+4S3ω+4S3ɛ+4Sω3+4Sɛ3+4ω3ɛ+4ωɛ3V2Y+
E6S2ω2+6S2ɛ2+6ω2ɛ2+12S2ωɛ+12Sω2ɛ+12Sωɛ2V2Y
ρSYES4+S3ω+S3ɛVY+
ES4+Sω3+Sɛ3+3S3ω+3S3ɛ+3S2ω2V3Y+
E3Sω2ɛ+3S2ɛ2+3Sωɛ2+6S2ωɛV3Y. (A.14)

Simplifying yields

limnnVrSY=
=ρSY2ES4+Sω2+Sɛ2+2S3ω+2S3ɛ+2S2ωɛ+
14ρSY2ES4+2ρSY2ES4+Sω2+Sɛ2+2S3ω+2S3ɛ+2S2ωɛ+
ρSY4ES4+ω4+ɛ4+4S3ω+4S3ɛ+4Sω3+4Sɛ3+4ω3ɛ+4ωɛ3+
ρSY4E6S2ω2+6S2ɛ2+6ω2ɛ2+12S2ωɛ+12Sω2ɛ+12Sωɛ2
ρSYρSYES4+S3ω+S3ɛ+
ρSY3ES4+Sω3+Sɛ3+3S3ω+3S3ɛ+3S2ω2+
ρSY3E3Sω2ɛ+3S2ɛ2+3Sωɛ2+6S2ωɛ. (A.15)

Now we collapse all terms involving ɛ, as they are outside the experimenter’s control, depend on the particular measurement model and thus cannot be determined a priori. This yields

limnnVrSY=
=ρSY2ES4+Sω2+2S3ω+
14ρSY2ES4+2ρSY2ES4+Sω2+2S3ω+
ρSY4ES4+ω4+4S3ω+4Sω3+
ρSY4E6S2ω2
ρSYρSYES4+S3ω+
ρSY3ES4+Sω3+3S3ω+3S2ω2
+hɛɛ. (A.16)

Re-arranging gives:

limnnVrSY=
ES414ρSY22ρSY4+ρSY6+
Eω414ρSY6+
ESω2ρSY252ρSY4+32ρSY6+
ES3ωρSY22ρSY4+ρSY6+
ESω3ρSY4+ρSY6+
+hɛɛ. (A.17)

We note that if ɛ=0 then hɛɛ=0. It follows that the sum of the other terms is non-negative.

Data availability

All simulations and code are publicly available on OSF: https://osf.io/dfg9e/.

References

  1. AERA, Joint Committee on the Standards for Educational, Psychological Testing of the American Educational Research Association, the American Psychological Association and the National Council on Measurement in Education . American Educational Research Association; Washington, DC: 2014. The standards for educational and psychological testing. [Google Scholar]
  2. Bach D.R., Melinscak F. Psychophysiological modelling and the measurement of fear conditioning. Behaviour Research and Therapy. 2020;127 doi: 10.1016/j.brat.2020.103576. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Bach D.R., Melinscak F., Fleming S.M., Voelkle M.C. Calibrating the experimental measurement of psychological attributes. Nature Human Behaviour. 2020;4(12):1229–1235. doi: 10.1038/s41562-020-00976-8. [DOI] [PubMed] [Google Scholar]
  4. Bach D.R., Rigdon E., Sarstedt M. 2023. Calibration experiments: an experimentalist alternative to multi-method approaches for measurement validation. [DOI] [Google Scholar]
  5. Bach D.R., Sporrer J., Abend R., Beckers T., Dunsmoor J.E., Fullana M.A., et al. Consensus design of a calibration experiment for human fear conditioning. Neuroscience & Biobehavioral Reviews. 2023;148 doi: 10.1016/j.neubiorev.2023.105146. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. BIPM, IEC, IFCC, ILAC, IUPAC, IUPAP, ISO, OIML D.R. The international vocabulary of metrology—basic and general concepts and associated terms (VIM) JCGM. 2012;200:2012. [Google Scholar]
  7. Borgstede M., Eggert F. Squaring the circle: From latent variables to theory-based measurement. Theory & Psychology. 2023;33(1):118–137. doi: 10.1177/09593543221127985. [DOI] [Google Scholar]
  8. Brandmaier A.M., Wenger E., Bodammer N.C., Kuhn S., Raz N., Lindenberger U. Assessing reliability in neuroimaging research through intra-class effect decomposition (ICED) Elife. 2018;7 doi: 10.7554/eLife.35718. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Cronbach L.J., Meehl P.E. Construct validity in psychological tests. Psychological Bulletin. 1955;52(4):281–302. doi: 10.1037/h0040957. [DOI] [PubMed] [Google Scholar]
  10. Dunn J.C., Anderson L.M. 2022. The monotonic linear model: Testing for removable interactions. [DOI] [PubMed] [Google Scholar]
  11. Hedge C., Powell G., Sumner P. The reliability paradox: Why robust cognitive tasks do not produce reliable individual differences. Behavior Research Methods. 2018;50(3):1166–1186. doi: 10.3758/s13428-017-0935-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. McDonald R.P. psychology Press; 2013. Test theory: A unified treatment. [Google Scholar]
  13. Phillips S.D., Estler W.T., Doiron T., Eberhardt K.R., Levenson M.S. A careful consideration of the calibration concept. Journal of Research of the National Institute of Standards and Technology. 2001;106(2):371–379. doi: 10.6028/jres.106.014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Rigdon E.E., Sarstedt M. In: Measurement in marketing. Baumgartner H., Weijters B., editors. vol. 19. Emerald Publishing Limited; 2022. Accounting for uncertainty in the measurement of unobservable marketing phenomena; pp. 53–73. (Review of marketing research). [DOI] [Google Scholar]
  15. Rigdon E.E., Sarstedt M., Becker J.M. Quantify uncertainty in behavioral research. Nature Human Behaviour. 2020;4(4):329–331. doi: 10.1038/s41562-019-0806-0. [DOI] [PubMed] [Google Scholar]
  16. Steiger J.H., Hakstian A.R. The asymptotic distribution of elements of a correlation matrix: Theory and application. British Journal of Mathematical and Statistical Psychology. 1982;35(2):208–215. [Google Scholar]
  17. Zumbo B.D., Chan E.K. Validity and validation in social, behavioral, and health sciences. Springer; 2014. Reflections on validation practices in the social, behavioral, and health sciences; pp. 321–327. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

All simulations and code are publicly available on OSF: https://osf.io/dfg9e/.

RESOURCES