Test-retest reliability of the human functional connectome over consecutive days: identifying highly reliable portions and assessing the impact of methodological choices

Leonardo Tozzi; Scott L Fleming; Zachary D Taylor; Cooper D Raterink; Leanne M Williams

doi:10.1162/netn_a_00148

. 2020 Sep 1;4(3):925–945. doi: 10.1162/netn_a_00148

Test-retest reliability of the human functional connectome over consecutive days: identifying highly reliable portions and assessing the impact of methodological choices

Leonardo Tozzi ¹, Scott L Fleming ², Zachary D Taylor ³, Cooper D Raterink ⁴, Leanne M Williams ^5,^*

PMCID: PMC7888485 PMID: 33615097

Abstract

Countless studies have advanced our understanding of the human brain and its organization by using functional magnetic resonance imaging (fMRI) to derive network representations of human brain function. However, we do not know to what extent these “functional connectomes” are reliable over time. In a large public sample of healthy participants (N = 833) scanned on two consecutive days, we assessed the test-retest reliability of fMRI functional connectivity and the consequences on reliability of three common sources of variation in analysis workflows: atlas choice, global signal regression, and thresholding. By adopting the intraclass correlation coefficient as a metric, we demonstrate that only a small portion of the functional connectome is characterized by good (6–8%) to excellent (0.08–0.14%) reliability. Connectivity between prefrontal, parietal, and temporal areas is especially reliable, but also average connectivity within known networks has good reliability. In general, while unreliable edges are weak, reliable edges are not necessarily strong. Methodologically, reliability of edges varies between atlases, global signal regression decreases reliability for networks and most edges (but increases it for some), and thresholding based on connection strength reduces reliability. Focusing on the reliable portion of the connectome could help quantify brain trait-like features and investigate individual differences using functional neuroimaging.

Keywords: fMRI, Functional connectivity, Resting state, Reliability

Author Summary

We quantified the reliability of fMRI functional connectivity in a large sample of healthy participants (N = 833) scanned over two consecutive days. We also assessed the consequences on reliability of atlas choice, global signal regression and thresholding. Only a small portion of the functional connectome has good (6–8%) to excellent (0.08–0.14%) reliability. Connectivity between prefrontal, parietal and temporal areas is especially reliable and average connectivity within known networks has good reliability. While unreliable edges are generally weak, reliable edges are not necessarily strong. Reliability of edges varies between atlases. Global signal regression decreases reliability for networks and most edges (but increases it for some). Thresholding based on connection strength reduces reliability. Focusing on reliable portions of the connectome could help investigate individual differences using functional neuroimaging.

INTRODUCTION

The human brain is an extraordinarily complex network comprising one hundred billion neurons, each connected to an average of 7,000 other neurons. This yields between 100 trillion and 1 quadrillion synapses, depending on a person’s age (Drachman, 2005). Current research in neuroscience suggests that it is the architecture and dynamic interactions of neurons that give rise to complex phenomena, such as cognition and emotion (Bassett & Sporns, 2017; Lindquist et al., 2012; Mill et al., 2017). This has been called the “functional connectome,” and over the past three decades, several studies have characterized it in vivo (see, for example, Van Essen et al., 2013). There is consensus that we need to map the human brain connectome in order to move forward our fundamental understanding of the human brain and its organization. However, we do not yet know to what extent the functional connectome is stable over time for an individual. In the present work, our aim is to explore the short-term reliability (in the order of days) of functional connectomes. We believe this is a fundamental step toward the identification of a persistent representation of brain function, which will facilitate the mapping of cognitive processes in individuals and will be critical for linking connectivity disruptions to brain disorders.

Over the past three decades, functional connectivity (FC) has become a well-established approach to measuring the functional connectome by using functional magnetic resonance imaging (fMRI). The term FC refers to synchronous distributed fluctuations in neuronal activity and is thought to represent a correlate of the dynamic interaction between neurons located in different brain areas (Lowe et al., 2000). In most studies, FC is measured by computing a Pearson correlation of the fMRI-derived blood oxygen level–dependent (BOLD) time series of a set of regions while the participant is awake and not performing any task (Lowe et al., 2016). This is what we will intend as FC in the present work, but it is important to note that FC can also be computed while participants are not completely idle and by using a wide array of methods besides correlations, such as independent component analysis, analyses in the frequency domain, Bayesian models, and dynamic approaches (for review, Gonzalez-Castillo & Bandettini, 2018; Lowe et al., 2016; Preti et al., 2017; Smitha et al., 2017).

In recent years, FC has been used to answer important questions about both healthy and disordered brain function. For example, psychiatry and neurology have turned to FC to develop new diagnostic tests, predict treatment response, and relate brain function to symptoms (Fornito et al., 2015; Sha et al., 2018). This is in answer to an urgent need for quantitative correlates of brain illnesses, and network-based approaches show great promise to this end (Williams, 2017). However, when testing candidate measures for clinical applications, it is important to consider that measures of robust group-level effects, which make up a large portion of the existent literature, might not necessarily be suited to make inferences about individuals (Hedge et al., 2018). For example, a network might show consistent FC values because of its low between-subjects variability, but this same characteristic would make it unsuitable to investigate its correlations with measures that might be highly variable between subjects (symptoms, personality, etc.). Also, one reasonable characteristic of a measure considered for clinical applications is repeatability: the same test on the same individual after a short period of time should return very similar results (Sullivan et al., 2015). In the present work, we define reliability as the combination of high between-subjects variability, coupled with low within-subject variability, a combination that is advantageous when trying to relate biomarkers to individual traits (Fleiss et al., 2003; Zuo et al., 2019b). Unfortunately, several of the measurements commonly collected by researchers to relate brain function to behavior have poor reliability (Hedge et al., 2018), which might explain their weak relationship to trait-like features (see, for example, Eisenberg et al., 2019). It follows that for any study investigating individual differences using neuroimaging, ensuring the reliability of measurements is of paramount importance, even outside of the realm of clinical applications (Xing & Zuo, 2018; Zuo et al., 2019a, 2019b).

To date, only a relatively small number of studies has examined in detail the reliability of the functional connectome, summarized by two recent papers (Noble et al., 2019; Zuo & Xing, 2014).There is consensus that connectomes tend to become more reliable the longer the duration of the scan (Anderson et al., 2011; Birn et al., 2013; Elliott et al., 2019; Noble et al., 2019; Termenon et al., 2016). However, which and how many functional connections can be measured reliably is unclear. Meta-analytical evidence shows that, on average, FC reliability is poor (Noble et al., 2019). Some studies, however, report that functional connections have “fair” reliability on average (as defined by Cicchetti, 1994) but others report “good” or “excellent” reliability for large (>25%) portions of the functional connectome, in particular of well-characterized functional networks such as the default mode, fronto-parietal, and dorsal attention networks (Birn et al., 2013; Elliott et al., 2019; Guo et al., 2012; Zuo & Xing, 2014). Reliability of global and local graph metrics derived from functional connectomes has also been shown to only be in the “fair” range, but can still be considered statistically significant (Termenon et al., 2016). One reason for these discrepancies might be that most studies only used relatively small samples (N <= 50) and had long intervals between scanning sessions (days or months). Sample size has been found to affect reliability estimates (Termenon et al., 2016), and shorter time intervals are better suited for measuring test-retest reliability, since they minimize the variability introduced by ancillary factors (Sullivan et al., 2015). Also, there are several choices that are routinely made when computing functional connectomes and it is unclear how these might affect their reliability. Examining all such possible choices is beyond the possibilities of the current work, but we focus here on three procedures for which a clear guideline in the community is lacking. The first is the set of regions (atlas) used to compute FC. Choosing atlases with a higher number of regions has been shown to increase reliability (Termenon et al., 2016), but it is unclear if a consensual pattern of reliable connections exists across atlases. Second, global signal regression term (GSR) is often used as a denoising procedure but potentially affects FC and has been suggested to decrease reliability (Elliott et al., 2019; Murphy & Fox, 2017; Power et al., 2018). Third, to compute graph metrics based on the functional connectome that require sparsity, usually weaker connections are removed, either based on an absolute or relative threshold, that is, edges below a certain value, or the bottom n^th percentile. Different graph metrics appear to have highest reliability at different thresholds (Termenon et al., 2016). It is, however, unclear if edge strength and reliability are related, how thresholding affects the reliability of edges, and even if the same edges are retained when functional connectomes from the same individual are thresholded independently.

In the present work, we explore short-term test-retest reliability of functional connectomes in a very large sample of healthy individuals scanned on two consecutive days. In particular, we leverage the entire Human Connectome Project (HCP) Healthy Young Adult data release. This dataset makes use of cutting-edge acquisition and preprocessing techniques, has very long resting-state sessions (30 min), is publicly available, and is a widely used gold standard for transparent and reproducible methods testing. In these ideal conditions, first, we assess reliability of the edges and known networks that make up the functional connectome. Then, we examine the impact on reliability of atlas choice, GSR, and thresholding.

MATERIALS AND METHODS

All scripts to reproduce our analyses and plots are available on GitHub at https://github.com/leotozzi88/reliability_study. The data processing flow is shown in Supporting Information Figure S1, along with the name of the scripts in this repository corresponding to each analysis step.

Dataset

Our sample is derived from the HCP Healthy Young Adult release, a large public dataset of 1,200 subjects aged between 22 and 35 years without any psychiatric or neurological disorder (Van Essen et al., 2013). The acquisition parameters and minimal preprocessing of these data are described in Glasser et al. (2013). Briefly, participants underwent a large number of MRI scans, that included T1- and T2-weighted structural imaging, diffusion tensor imaging, and nearly 2 hours of resting-state and task multiband fMRI. For the present study, we used 1 hour of resting-state fMRI collected on each participant during four 15-min scans (1,200 time points each, two runs acquired with RL phase encoding and two with LR) split in two scanning sessions over two days.

To select our sample, we accessed the data at https://db.humanconnectome.org. Using the online filtering options, we selected only participants who had completed the full resting-state scanning protocol and had no known quality issues. This returned a total of 860 subjects, each with four resting-state fMRI runs. For these, we downloaded the framewise relative root-mean-square realignment estimates (RMS) and fMRI data denoised using ICA-FIX (Salimi-Khorshidi et al., 2014). All analyses were conducted in greyordinate space, that is, they were constrained to the gray matter by using files in the CIFTI format, thus taking full advantage of HCP preprocessing and minimizing nonneuronal signal (Glasser et al., 2013).

Connectivity Matrix Construction

Each dense denoised timeseries resting-state file was parcellated using connectome workbench (wb_command-cifti-parcellate) to obtain the mean timeseries in each atlas region (see below). All further analyses were conducted in MATLAB R2018a (9.4.0.949201) for Mac (The MathWorks, Inc.). For each subject, parcellated timeseries as well as framewise RMS were loaded. Then, GSR was performed (see below). A high-pass filter (0.008-Hz cut-off) was applied to the timeseries. High frequencies were retained to avoid excessive loss of degrees of freedom due to the very low TR (Bright et al., 2017). Volumes with RMS > 0.30 were flagged as containing motion and were removed from the timeseries (Power et al., 2014). We also excluded any subject for which volumes flagged for motion exceeded 15% in any of the four resting-state runs. This step resulted in a final sample size of 833. The two runs within each one of the two sessions were then concatenated, and Pearson correlation between all the timeseries was used to obtain a connectivity matrix for each session. At the end of this procedure, each subject had six matrices (3 atlases, with and without GSR) for each of two sessions, for a total of 9,996 connectivity matrices.

Connectivity of Established Resting-State Networks

From each connectivity matrix, the average FC within each of 12 established resting-state networks was computed using the labels of the Gordon Atlas (Gordon et al., 2016). These resting-state networks were default mode, parieto-occipital, fronto-parietal, salience, cingulo-opercular, medial-parietal, dorsal attention, ventral attention, visual, supplementary motor (hand), supplementary motor (mouth), and auditory. The reliability of these aggregate network statistics was then assessed.

Test-Retest Reliability

Intraclass correlation coefficient (ICC) as implemented in MATLAB (https://www.mathworks.com/matlabcentral/fileexchange/22099-intraclass-correlation-coefficient-icc) was used to quantify the test-retest reliability of FC. In particular, we assessed the consistency among measurements under the fixed levels of the session factor, in line with previous work (Chen et al., 2018; Elliott et al., 2019). This measure has been named ICC ‘C-1’ in McGraw and Wong (1996) or ICC (3,1) in Shrout and Fleiss (1979).

To calculate ICC for all our FC values, first, the FC values in the upper triangle of each subject’s connectivity matrix were entered as rows in two large matrices (one matrix for each session, one row per subject in each matrix). Then, the corresponding columns of these matrices were compared to obtain an ICC value. Since the number of features (and thus ICCs) was very large (from 21,945 to 61,776 depending on atlas), we report the median, minimum, and maximum ICC as well as the portion of functional connections having poor (<0.40), fair (0.40–0.60), good (0.60–0.75), or excellent (>0.75) ICC as defined by Cicchetti (1994).

This procedure was conducted for connectivity matrices obtained using all atlases, with and without GSR as well as for the resting-state networks average FC.

Effects of Atlas on Reliability

For each subject and each session, we computed functional connectomes by using three atlases that are widely used within the neuroimaging community and available in CIFTI format. The first is the Brainnettome Atlas, derived by structural and FC (Fan et al., 2016). The second is the Glasser Atlas, based on the multimodal cortical parcellation of HCP participants (Glasser et al., 2016).The third is the Gordon Atlas, which is based on resting-state FC and provides labels identifying well-established resting-state networks (Gordon et al., 2016). Since none of three atlases includes subcortical structures, these were derived from the Freesurfer segmentation (Fischl et al., 2002) and added to each CIFTI dense label file using connectome workbench (wb_command-cifti-create-dense-from-template). To test for a difference in ICC values across atlases, we used a Kruskal–Wallis test.

Effect of Global Signal Regression on Reliability

Immediately after loading the timeseries data in MATLAB, the mean of the grayordinate timeseries from all regions was regressed from each timeseries to produce a set of GSR-corrected timeseries (Burgess et al., 2016). Analyses then proceeded in the same way separately for GSR-corrected (GSR+) and noncorrected (GSR−) timeseries. To test for a difference in ICC values computed using GSR+ and GSR− timeseries, we used a Wilcoxon signed rank test.

Effect of Edge Strength on Reliability

To get a measure of edge strength, we averaged the FC of all edges across the two sessions and across all subjects. Then, for each atlas, we computed a Spearman correlation between edge strength and ICC of the edge calculated as outlined above.

Effects of Thresholding on Reliability

To test the effects of thresholding the functional connectome on reliability estimates and if the same edges would be consistently retained across sessions, we proceeded as follows. First, we defined 20 evenly spaced threshold values from 0.05 to 1. For each value and each connectivity matrix, two new matrices were created using functions from the brain connectivity toolbox (Rubinov & Sporns, 2010). In the first matrix, all FC values below the threshold were set to 0 (absolute threshold). In the second, the proportion of strongest FC values corresponding to the threshold was retained (relative threshold). Then, for each absolute and relative threshold, we examined all edges that were retained at least in one session. We computed the ratio between the number of participants in which each edge was retained at both time points versus the ones in which it was retained at least once. This quantity, which we name “ratio of consistent edges,” or “consistency ratio” for short, goes from 0 (each time the edge is retained, it is only retained in one session) to 1 (each time the edge is retained, it is retained in both sessions). We report the median, minimum, and maximum ratio of consistent edges for each threshold and atlas as well as the proportion of poor (<0.40), fair (0.40–0.60), good (0.60–0.75), or excellent (>0.75) edges using the same cutoffs as for ICC for convenience (Cicchetti, 1994). We also tested whether the median ratio of consistent edges was correlated with the threshold value by using a Spearman correlation. Finally, for each absolute and relative threshold, we computed the ICC as described above only in subjects for which the edge was retained in both sessions, so as not to bias our calculation by the edge potentially being set to 0 because of thresholding in one session. We also computed whether median ICC values were correlated with the threshold value by using a Spearman correlation.

Confirmation of Results in Nonrelated Subsample

Since several subjects in the Healthy Young Adult dataset share family membership and a significant portion of variance in FC is explained by genetics (Adhikari et al., 2018; Elliott et al., 2019; Reineberg et al., 2018), we reran our analyses on an unrelated subset of our dataset to confirm our results (N = 397 after accounting for data availability and motion).

Confirmation of Results Using Different ICC Intervals

Since the cutoffs reported in Cicchetti (1994) are just one of the possible ways to classify ICC values, we reran our analyses using a more fine-grained binning scheme with five instead of four classes: slight (<0.20), fair (0.20–0.40), moderate (0.40–0.60), substantial (0.60–0.80), and perfect (>0.80) (Xing & Zuo, 2018).

RESULTS

Test-Retest Reliability of the Functional Connectome

Regardless of atlas and without performing GSR (see below), the majority of edges of the functional connectome were in the “fair” reliability range (Figure 1).

Median ICC ranged from 0.41 to 0.47 depending on atlas (Table 1 and Figure 1). Given the large sample size, estimation of ICC was accurate: the width of confidence intervals for ICC in all three atlases varied between 0.04 and 0.14 and had a median of 0.10 (Supporting Information Figure S2). When examining the average ICC of all edges touching each node, the subgenual anterior cingulate cortex and inferior temporal lobe had the lowest average reliability. Average ICC was also low in areas immediately adjacent to the corpus callosum (cingulate cortex). The most reliable nodes on average were located in the superior parietal and middle temporal lobes (Figure 2).

Table 1. .

Reliability of edges in the functional connectome.

		Median ICC	Min ICC	Max ICC	Poor edges ratio	Fair edges ratio	Good edges ratio	Excellent edges ratio
Brainnetome	GSR−	0.4688	−0.0682	0.8145	0.2884	0.6271	0.0832	0.0014
Brainnetome	GSR+	0.3868	−0.0750	0.8713	0.5294	0.3328	0.1191	0.0187
Glasser	GSR−	0.4587	−0.0627	0.8277	0.3085	0.6124	0.0777	0.0014
Glasser	GSR+	0.3522	−0.0889	0.8643	0.6143	0.3103	0.0714	0.0040
Gordon	GSR−	0.4122	−0.0845	0.8102	0.4663	0.4755	0.0574	0.0008
Gordon	GSR+	0.3113	−0.0965	0.8453	0.6939	0.2361	0.0663	0.0037

Open in a new tab

Note. We show the median, minimum, and maximum ICC of functional connectomes computed using three different atlases, with or without global signal regression. We also show the proportion of edges having poor (ICC < 0.40), fair (ICC = 0.40–0.60), good (ICC = 0.60–0.75), or excellent (ICC > 0.75) reliability, defined in accordance to Cicchetti (1994). ICC = intraclass correlation coefficient; GSR− = no global signal regression; GSR+ = global signal regression.

Figure 2. — Average ICC of functional connections by node. On an inflated brain we show, for each node of the Brainnetome, Glasser, and Gordon atlases, the average ICC of all functional connections involving the node. ICC = intraclass correlation coefficient.

“Good” edges always represented a relatively small portion compared to the total (6–8%) but still numbered in the thousands. These connections mostly involved the inferior parietal and middle temporal lobes, but were also present in the frontal, superior parietal, and occipital lobes (Figure 3). “Excellent” edges were relatively rare and consistently less than a hundred (0.08–0.14% of total connections). These were mostly intrahemispherical and predominantly connected the prefrontal, parietal, and temporal lobes (Figure 4).

Figure 3. — Number of functional connections with “good” reliability by node. On an inflated brain we show, for each node of the Brainnetome, Glasser, and Gordon atlases, the number of functional connections involving the node with intraclass correlation coefficient = 0.60–0.75.

Figure 4. — Connections with “excellent” reliability. On a transparent brain we show the connections having “excellent” reliability (intraclass correlation coefficient > 0.75) in the Brainnetome, Glasser, and Gordon atlases. These consistently involved the superior parietal and middle temporal lobes as well as the dorsolateral prefrontal cortex. ICC = intraclass correlation coefficient.

Test-Retest Reliability of Resting-State Networks

Average FC within established resting-state networks always had “good” reliability. The most reliable network was the parieto-occipital (ICC = 0.73), followed by the medial-parietal (ICC = 0.71), and auditory (ICC = 0.71). The least reliable were the salience (ICC = 0.63), dorsal attention (ICC = 0.64), and supplementary motor (mouth) (ICC = 0.64) (Figure 5 and Table 2).

Figure 5. — Reliability of known resting-state networks. We show the ICC and confidence intervals for the average connectivity within known resting-state networks defined in accordance to Gordon et al. (2016). ICC = intraclass correlation coefficient; No GSR = no global signal regression; GSR = global signal regression; CI = confidence interval; DMN = default mode network; PAO = parieto-occipital; FP = fronto-parietal; SAL = salience; COP = cingulo-opercular; MEP = medial-parietal; DAN = dorsal attention network; VAN = ventral attention network; VIS = visual; SMH = supplementary motor (hand); SMM = supplementary motor (mouth); AUD = auditory.

Table 2. .

Reliability of known resting-state networks.

	ICC		Upper CI		Lower CI
	GSR−	GSR+	GSR−	GSR+	GSR−	GSR+
DMN	0.6485	0.5900	0.6862	0.6326	0.6074	0.5439
PAO	0.7256	0.6783	0.7562	0.7133	0.6918	0.6398
FP	0.6631	0.5873	0.6995	0.6301	0.6233	0.5410
SAL	0.6252	0.4953	0.6648	0.5449	0.5820	0.4423
COP	0.6744	0.6661	0.7098	0.7022	0.6356	0.6265
MEP	0.7153	0.6960	0.7469	0.7294	0.6804	0.6593
DAN	0.6426	0.5625	0.6808	0.6072	0.6010	0.5142
VAN	0.6648	0.6454	0.7011	0.6834	0.6251	0.6040
VIS	0.6437	0.5918	0.6818	0.6342	0.6022	0.5458
SMH	0.6413	0.6050	0.6796	0.6464	0.5996	0.5601
SMM	0.6350	0.5640	0.6739	0.6086	0.5927	0.5159
AUD	0.7095	0.6664	0.7417	0.7025	0.6741	0.6269

Open in a new tab

Note. We show the ICC and confidence intervals for the average connectivity within known resting-state networks defined in accordance to Gordon et al. (2016). ICC = intraclass correlation coefficient; GSR− = no global signal regression; GSR+ = global signal regression; CI = confidence interval; DMN = default mode network; PAO = parieto-occipital; FP = fronto-parietal; SAL= salience; COP = cingulo-opercular; MEP = medial-parietal; DAN = dorsal attention network; VAN = ventral attention network; VIS = visual; SMH = supplementary motor (hand); SMM = supplementary motor (mouth); AUD = auditory.