. 2022 Sep 13;11:e78717. doi: 10.7554/eLife.78717

Table 1. Definitions of key terms (A) and data specifications applied across analyses (B).

(A)
Term			Definition
Internal consistency			In our study, internal consistency refers to the reliability of conditioned responding within experimental phases at both time points, respectively. It provides information on the extent to which items – or in our case – trials measure the same construct (e.g., fear acquisition). Odd and even trials were splitted (i.e., split-half method), averaged per subject and correlated across the sample.
Longitudinal reliability at the individual level			Longitudinal reliability at the individual level indicates to which extent responses within the same individuals are stable over time. It takes into account the individual responses of participants, which are then related across time points. Longitudinal reliability at the individual level inherently includes the group level, as it is calculated for the sample as a whole, but the individual responses are central to the calculation.
Intraclass correlation coefficients (ICCs)			‘ICC coefficients quantify the extent to which multiple measurements for each individual (within individuals) are statistically similar enough to discriminate between individuals’ (Aldridge et al., 2017). Here, we calculated two types of ICCs, namely absolute agreement and consistency. To illustrate the difference between absolute agreement and consistency in a short example (Koo and Li, 2016), consider an interrater reliability study with two raters: Consistency indicates the extent to which the score of one rater (y) is equal to the score of another rater (x) plus a systematic error (c) (i.e., y = x + c). In contrast, absolute agreement indicates to which degree y equals x. As ‘two raters’ can be replaced by ‘two time points’ and individual responses were taken into account here, we used ICCs to determine longitudinal reliability at the individual level.
Within- and between-subject similarity			Similarity analyses provide information on the extent to which trial-by-trial responses of one individual at one time point are comparable to responses of the same individual at a later time point (i.e., within-subject similarity) and all other individuals at a later time point (i.e., between-subject similarity). Comparisons of within- and between-subject similarity were used here to determine longitudinal reliability at the individual level.
Overlap at the individual level (applied for BOLD fMRI only)			Overlap at the individual level reflects the degree of overlap of significant voxels between both time points for single subject-level activation patterns.
Longitudinal reliability at the group level			Longitudinal reliability at the group level indicates to which degree responses within the group as a whole are stable over time. More precisely, longitudinal reliability at the group level relies on first averaging all individuals responses for each trial (for SCR) or voxel (for fMRI) yielding a group average for each trial/voxel. These are then related across time points, that is the calculation is carried out using the trial-wise (for SCR) or voxel-wise (for fMRI) group averages.
Overlap at the group level (applied for BOLD fMRI only)			Overlap at the group level reflects the degree of overlap of significant voxels between both time points for aggregated group-level activations.
(B)
	Measure	Internal consistency	Longitudinal reliability at the individual level			Longitudinal reliability at the group level	Cross-phases predictability
			ICCs	Within- and between-subject similarity	Overlap	Overlap (BOLD fMRI) or R squared (SCR)
Included time points	All	T0 and T1 separately	T0 and T1	T0 and T1	T0 and T1	T0 and T1	T0
Included stimuli	SCR	CS+, CS−, CS discrimination, US	CS+, CS−, CS discrimination, US^*	CS+, CS−, CS discrimination, US	–	CS+, CS−, CS discrimination, US	CS+, CS−, CS discrimination
	Fear ratings	–	CS+, CS−, CS discrimination, US^*	–	–	–	CS+, CS−, CS discrimination
	BOLD fMRI	–	CS discrimination^†	CS discrimination^†	CS discrimination^†	CS discrimination^†	CS+, CS−, CS discrimination
Phase operationalizations	SCR	Entire phases (ACQ, EXT, RI-Test; except first trials of ACQ and EXT)	CS+, CS−, and CS discrimination: average ACQ, last two trials ACQ^‡, first trial EXT^§, average EXT, last two trials EXT^‡^¶, first trial RI-Test^§ US: average RI	Average ACQ^**, average EXT	–	Average ACQ, average EXT	Average ACQ, last two trials ACQ^‡, first trial EXT^§, average EXT, last two trials EXT^‡ ^¶, first trial RI-Test^§
	Fear ratings	–	CS+, CS−, and CS discrimination: post–pre ACQ, post ACQ, pre EXT, pre–post EXT, post EXT, first trial RI-Test US: post RI-Test	–	–	–	post–pre ACQ, post ACQ, pre EXT, pre–post EXT, post EXT, first trial RI-Test
	BOLD fMRI^††	–	Average ACQ, average EXT	Average ACQ, average EXT	Average ACQ, average EXT	Average ACQ, average EXT	Average ACQ, average EXT
Transformations ^{‡ ‡}	SCR	None, log-transformation^{§ §}, log-transformation and range correction^{¶ ¶}	None, log-transformation^{§ §}, log-transformation and range correction^{¶ ¶}	None^***	–	None, log-transformation^{§ §}, log-transformation and range correction^{¶ ¶}	None, log-transformation^{§ §}, log-transformation and range correction^{¶ ¶}
	Fear ratings	–	None	–	–	–	None
	BOLD fMRI	–	None	None	None	None	None
Ordinal ranking ^†††	SCR	No ranking	No ranking^{‡ ‡ ‡}	No ranking	–	No ranking	No ranking and ordinal ranking ^{§ § §}
	Fear ratings	–	No ranking^{‡ ‡ ‡}	–	–	–	No ranking and ordinal ranking
	BOLD fMRI	–	No ranking	No ranking	No ranking	No ranking	No ranking

The specifications we used here are exemplary and are not intended to cover all possible data specifications. Note that internal consistency, within- and between-subject similarity and reliability at the group level could not be calculated for fear ratings due to the limited number of trials. ACQ = acquisition training, EXT = extinction training, RI = reinstatement, RI-Test = reinstatement-test.

Non-pre-registered ICCs for SCRs to the USs and US aversiveness ratings were calculated as we considered these informative.

^†

For BOLD fMRI, ICCs were calculated only for CS discrimination and not for CS+ and CS− given the fact that the calculations are based on first-level T contrast maps and contrasts against baseline are not optimal.

^‡

In addition to the averaged acquisition and extinction training performance, the last two SCR trials of acquisition (pre-registered) and extinction training (not pre-registered) were separated from the previous trials and averaged as equivalent to the post-acquisition/-extinction ratings. The first extinction trial was taken into account separately as fear recall.

^§

Fear recall and reinstatement-test were operationalized as the first extinction training trial and the first reinstatement-test trial (since the reinstatement effect fades away relatively quickly, Haaker et al., 2014), respectively.

^¶

The operationalization of extinction training as the last two trials was not pre-registered and included for completeness. In cases where phase operationalizations included more than one SCR trial, trials were averaged.

^**

Note that reliability at a group level for SCRs during reinstatement-test was not calculated as correlations between two SCR data points are not meaningful.

^††

fMRI data for the reinstatement-test were not analyzed in the current study since data from a single trial do not provide sufficient power.

^{‡ ‡}

The pre-registered transformation types were identified to be typically employed data transformations in the literature by for example Sjouwerman et al., 2022 who also pre-registered these transformation types.

^{§ §}

Raw SCR amplitudes were log-transformed by taking the natural logarithm to normalize the distribution (Levine and Dunlap, 1982).

^{¶ ¶}

Log-transformed SCR amplitudes were range corrected by dividing each individual SCR trial by the maximum SCR trial across all CS and US trials. Due to potentially different response ranges, the maximum SCR trial was determined separately for experimental days as recommended by Lonsdorf et al., 2017a. Range correction was recommended to control for interindividual variability (Lykken, 1972; Lykken and Venables, 1971).

^***

We also carried out similarity analyses for log-transformed as well as for log-transformed and range corrected data. However, results were almost identical to the results from the raw data. For reasons of space, we only report results for raw data.

^†††

Ranking of the data was included to investigate to which degree individuals occupy the same ranks at both time points as pre-registered or put differently, whether the quality of predictions changes when the predictions were not based on the absolute values but on a coarser scale.

^{‡ ‡ ‡}

As opposed to what was pre-registered, in ICC analyses, we included non-ranked data only as closer inspection of the conceptualization of ICC_con revealed that it would be redundant to calculate both ICC_abs and ICC_con with ranked and non-ranked data as ICC_con itself ranks the data.

^{§ § §}

Ranks of SCRs were built upon raw, log-transformed as well as log-transformed and range corrected values.