Table 1.
Challenge | Possible solution | Concrete example with the ASRT |
---|---|---|
Not all forms of reliability can be meaningfully evaluated in all contexts | Determine appropriate reliability forms | Interference and offline consolidation effects make test–retest reliability unfeasible for the ASRT. Rely on internal consistency and split-half reliability instead |
Multiple performance metrics can be calculated from the same task, the reliabilities of which cannot be assumed to be equivalent | Estimate reliability for each metric separately |
Accuracy and RT-based learning scores have distinct reliability profiles, with RT-based learning scores being somewhat more reliable Learning scores calculated using two-stage averaging are generally more reliable Triplet-based learning scores are more reliable, than pattern-random trial difference scores |
Different pre-processing choices regarding splitting can lead to distinct reliability estimates | Investigate robustness of reliability estimation to splitting choices, e.g., by varying the units of splitting and carrying out trial-resampling | Splitting by sequences instead of trials leads to lower reliability, with more variance, but possibly less bias |
Task length influences reliability estimation, with longer tasks being associated with higher reliability estimates. This needs to be taken into account when interpreting published reliability estimates and designing studies | Determine the scaling of reliability estimates with increasing task length | Threshold for 'minimally acceptable' reliability of .65 is met with a task length of around 25 blocks |
Sample size influences reliability estimation, with larger samples being associated with more precise reliability estimates. This needs to be taken into account when interpreting published reliability estimates | Determine the scaling of the precision of reliability estimates with increasing sample size | Marginal gains in the precision of reliability estimates drop off noticeably around 50 subjects |