Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2026 Feb 18.
Published in final edited form as: Nat Neurosci. 2025 Jul 1;28(8):1787–1796. doi: 10.1038/s41593-025-01990-7

Addressing artifactual bias in large, automated MRI analyses of brain development

Safia Elyounssi 1,2,*, Keiko Kunitoki 1,2,3,*, Jacqueline A Clauss 1,2,4, Eline Laurent 1,2, Kristina A Kane 1,2, Dylan E Hughes 1,2,5, Casey E Hopkinson 1,2, Oren Bazer 1,2, Rachel Freed Sussman 1,2, Alysa E Doyle 1,6, Hang Lee 7, Brenden Tervo-Clemmens 1,8, Hamdi Eryilmaz 1,2, Randy L Hirschtick 1,2, Deanna M Barch 9, Theodore D Satterthwaite 10,11,12, Kevin F Dowling 1,13, Joshua L Roffman 1,2
PMCID: PMC12912710  NIHMSID: NIHMS2136587  PMID: 40595447

Abstract

Large, population-based MRI studies of adolescents promise transformational insights into neurodevelopment and mental illness risk. However, youth MRI studies are especially susceptible to motion and other artifacts that introduce non-random noise. Following visual quality control of 11,263 T1 MRI scans obtained at age 9–10 through the Adolescent Brain Cognitive DevelopmentSM Study we uncovered bias in measurements of cortical thickness and surface area in 55.1% of the sample with suboptimal image quality. These biases impacted analyses relating structural MRI and clinical measures, resulting in both false positive and negative associations. Surface hole number, an automated index of topological complexity, reproducibly identified lower-quality scans with good specificity and its inclusion as a covariate partially mitigated quality-related bias. Closer examination of high-quality scans revealed additional topological errors introduced during image preprocessing. Correction with manual edits reproducibly altered thickness measurements and strengthened age-thickness associations. We demonstrate that inadequate quality control undermines advantages of large sample size to detect meaningful associations. These biases can be mitigated through additional automated and manual interventions.

Introduction

Magnetic resonance imaging (MRI) is widely used in clinical neuroscience research to study neuroanatomical variation in healthy individuals and those with neuropsychiatric disease1. Structural (T1-weighted) MRI scans (sMRI) provide reliable, individual-level indices of cortical thickness, surface area, and volume, and enable registration of other brain imaging data (such as functional MRI and PET) to anatomical templates that facilitate group-level analyses2. In accordance with neurodevelopmental models of mental illness, large-scale brain MRI studies of children and adolescents offer potential to elaborate neural signatures of emergent psychopathology3 4. Such insights could be harnessed in efforts to develop improved early recognition and treatment5. Accordingly, the U.S. National Institutes of Health and other funding agencies have invested heavily in longitudinal MRI studies of adolescent brain development, such as the ongoing Adolescent Brain Cognitive DevelopmentSM (ABCD) Study6.

Recent work has underscored the need for thousands of participants in such clinical MRI studies, as effect sizes for relationships between psychopathology and MRI indices tend to be small7. Further, MRI scans of children and adolescents are particularly susceptible to artifact due to participant motion8,9. An unanswered question concerns whether large sample size – e.g. in studies involving thousands of participants – sufficiently compensates for errant sMRI measurements arising from inclusion of poorer quality images. Alternatively, smaller studies have suggested the possibility that visible motion artifact results not only in random noise but in bias8,9, which may or may not be offset by inclusion of more participants.

A related question concerns the adequacy of automated quality control (QC) measures, applied during scan acquisition, processing, or analysis, to identify and adjust for poor quality images in large sMRI studies of children. Notably, unlike for functional MRI, head motion is less routinely quantified as part of sMRI analyses, and its effects on sMRI measurements have been less well studied – although some prior work has associated induced or measured motion with bias in sMRI estimates8,10. Image preprocessing software can provide automated QC metrics, such as the overall “pass/fail” rating in the FreeSurfer processing stream11. This metric is used by ABCD in conjunction with raw data screens and clinical (radiology) evaluations to provide an overall recommendation on whether to include images in analyses. However, routine automated QC measures have shown inconsistent sensitivity to detect artifact identified by manual (visual) QC ratings of sMRI scans in smaller studies of youth12,13.

As such, a final consideration – one especially pertinent to large-scale studies such as ABCD, which is collecting 6 sets of MRI scans over 10 years from >10,000 youth participants – is the added value of manual QC of postprocessed sMRI scans, and of the even more time- and resource-intensive process of manual cortical edits14,15. Depending on image quality, manual edits of a single scan can take a skilled technician as few as 30 minutes to as long as several days to complete. While the utility of manual edits in identifying case-control differences in pediatric sMRI studies has been questioned1618, their importance to accurately detecting subtle, sub-diagnostic neurodevelopmental differences among youth is evident in other studies19,20.

The overarching goals of the present analyses were (1) to uncover effects of latent variation in image quality on sMRI measurements and applied clinical analyses, and (2) to assess mitigating effects of additional automated and manual QC interventions on risk for quality-related errors. These analyses relied principally on in-depth, manual QC assessments of >12,000 sMRI images obtained at Baseline (age 9–10) and Year 2 follow-up (age 11–12) from ABCD study participants, and on comparison of manual with automated QC interventions.

Results

Variable image quality in Baseline scans

Please refer to Methods for overall ABCD Study design, inclusion criteria for the present analyses, structural MRI (sMRI) image pre-processing, and derivation of manual quality control (MQC) ratings. A total of 10,295 T1 scans from Baseline (age 9–10) underwent MQC ratings (Fig. 1a). Overall appearance of the entire T1 volumes, viewed slice-by-slice in multiple planes, was rated as “1” (requiring minimal manual edits from a trained technician, n=4,630, 45.0%), “2” (requiring moderate edits, n=4,063, 39.5%), “3” (requiring substantial edits, n=1,383, 13.4%), or “4” (unusable, n=219, 2.1%) (Fig. 1b,c). We have uploaded these ratings to the NIMH Data Archive (NDA; see Data Availability).

Figure 1. Manual quality control (MQC) protocol.

Figure 1.

(A) Among 11,875 total participants at Baseline, we excluded participants with clinical findings (see Methods), broken or blank T1 images, or repeated failed FreeSurfer preprocessing. After excluding additional images found to have cysts or signal dropout, we rated 10,295 images on MQC=1–4 scale (1: best, 4: worst). (B) Distribution of MQC ratings, stratified by site and scanner. (C) Representative examples of MQC=1–4 scans and a scan with apparent tissue loss due to segmentation error. Arrows indicate areas where manual edits are needed to correct for errant automated segmentation of the pial surface from the underlying cortex. Distribution of MQC ratings among all scans is displayed at lower right.

The distribution of MQC ratings was stable over the temporal sequence of scan evaluations (see Extended Data Fig. 1a, Table S2), and also when including 228 scans with segmentation errors (Extended Data Fig. 2, Table S1b). All but 325 scans had been designated as recommended for use, based on an automated pass-fail rating available as part of the ABCD NDA; however, these 325 fell disproportionately within higher MQC groups (comprising 0.4% of MQC=1 scans, 1.4% of MQC=2, 10.6% of MQC=3, and 48.9% of MQC=4).

Demographic, clinical, and scanner characteristics of participants stratified by MQC group are described in Table S1a. Individuals with higher quality scans tended to be slightly older and female, and demonstrated less externalizing psychopathology and total symptoms on the Child Behavior Checklist (CBCL). Scan quality also differed by scanner manufacturer; notably, the mean MQC rating for images from Philips magnets (1.34, 95% CI 1.29–1.38), which were not subject to real-time motion correction, was more favorable than those for Siemens (1.71, 95% CI 1.69–1.73) and GE (1.96, 95% CI 1.93–1.99), which did include this feature21,22 (p’s<.0001, after controlling for age, gender, and psychopathology).

Effects of image quality on cortical measurements

Automated measures of cortical thickness, surface area, and volume are commonly used to identify case-control differences or as predictors of dimensional measures (e.g., psychopathology) in psychiatric neuroimaging research2. We next determined the extent to which MQC ratings associate with variance in these measures, as determined by FreeSurfer. MQC ratings associated linearly with reduced thickness across much of the cortical mantle (Fig. 2a), with increased cortical surface area in lateral/superior and reduced surface area in medial/inferior regions (Fig. 2b), and heterogeneous effects on cortical volume (Fig. 2c). Pairwise comparisons of best quality (MQC=1) versus lower quality (MQC=2, 3, and 4) images demonstrated increasingly strong effects on each structural index as MQC ratings worsened, with moderate to strong effect sizes noted in numerous cortical regions (see also Table S3a, b, c for effects of MQC rating differences in each of the 68 cortical regions-of-interest (ROI) defined by the Desikan-Killiany Atlas). For example, comparison of cortical thickness values between MQC=1 versus MQC=2, 3, and 4 yielded a total of 39, 55, and 61 ROI (of 68), respectively, with statistically significant differences (FDR q <.05). Regions demonstrating stronger effects of poor quality control on thickness included, but also extended beyond, those identified as showing similar effects in a previous, smaller study of adolescent and adult participants (n=1,840)13, in consistent directions (e.g., increased thickness in numerous lateral ROI, decreased thickness in medial occipital and posterior cingulate cortices). Subcortical volumes also differed significantly based on MQC rating, with higher ratings generally associated with smaller volumes (Table S4).

Figure 2. Association between MQC ratings and sMRI indices, n=10,261.

Figure 2.

Maps at left show linear associations of MQC rating (1 to 4) with cortical thickness (A), surface area (B), and volume (C). Vertex-wise p-values indicate significance for effect of rating on sMRI measurements using general linear models (2-tailed, uncorrected). Maps at right contrast thickness, surface area, and volume highest quality images (MQC=1) with those assigned to lower quality ratings. Covariates included age, gender, estimated intracranial volume (fixed effects), site, and scanner manufacturer (random effects). Of the initial 10,295 scans with MQC, 34 were excluded due to missing covariates or FreeSurfer processing errors.

We next compared the performance of other available automated QC measures, including the surface hole number (SHN), to manual (MQC) ratings (see Methods). SHN increased in tandem with MQC ratings (rho=0.59; mean SHN differed between all MQC level pairs, p≤1.02E-121). Linear associations of SHN with differences in cortical thickness, surface area, and volume (Fig. 3a,b,c) closely resembled those of MQC (Fig. 2). SHN also outperformed all other automated QC metrics in predicting MQC ratings, and its relationship to other ratings was similar to that of MQC (Extended Data Fig. 3). Distribution of SHN values among MQC groups was stable over the temporal sequence of MQC evaluations (Extended Data Fig. 1b).

Figure 3. Effects of surface hole number (SHN) on sMRI indices, and derivation of SHN tiers in conjunction with MQC ratings, n=10,261.

Figure 3.

Associations of SHN (non-transformed) with cortical thickness (A), surface area (B), and volume (C) were mapped using general linear models (vertex-wise p-values, 2-tailed, uncorrected) closely resembled those of MQC ratings (compare to Figure 2). Covariates included age, gender, estimated intracranial volume (fixed effects), site, and scanner manufacturer (random effects). Additional adjustment for SHN diminished the effect sizes of pairwise MQC contrasts for thickness (D), surface area (E), and volume (F). Markers represent effect sizes for pairwise MQC contrasts in each of 68 cortical regions-of-interest, and solid lines reflect best-fit across all 68 regions for a given pairwise MQC contrast. Note reduced slopes compared to dashed unity line. (G) Density plot of SHN values, stratified by MQC ratings. Panel (H) illustrates the overall approach for deriving SHN tiers from MQC ratings. The SHN tiers were developed to provide quality control estimates in the absence of manual ratings, and are based on optimized SHN thresholds for parsing higher versus lower manual quality scan groupings. Receiver-operating characteristic (ROC) analyses for various thresholds are shown in panels (I), (J), and (K) along with related specificity, sensitivity, and accuracy indices. For example, with an optimized SHN threshold of 29.5, 81.3% of scans with MQC=2 and higher are eliminated. This threshold was used as a breakpoint for SHN tiers A and B. Blue shaded regions indicate 95% confidence intervals. AUC: area under the curve.

We then examined whether including SHN as an additional covariate mitigated effects of variable scan quality on sMRI indices, as defined by differences in measurements between MQC=1 and MQC=2, 3, and 4 respectively (Table S3d,e,f; Fig. 3d,e,f). Depending on the specific comparison (MQC=1 vs. 2, 3, or 4), controlling for SHN reduced the effect size (Cohen’s d) of MQC-related differences in cortical thickness across ROI by 42 to 59%. Analogous reductions were seen for cortical surface area (39 to 57%), and cortical volume (16 to 62%; Table S5). Further, among 39 ROIs exhibiting differences in cortical thickness between MQC=1 and MQC=2 before covarying for SHN, 17 fell out of significance after covarying for SHN, while 1 ROI became newly significant.

We then used SHN data in concert with MQC ratings to develop and assess the reliability of a tiered, automated sMRI QC rubric to classify the quality of individual scans. This rubric assigned scans to 4 levels akin to the MQC groups, but based exclusively on automated, SHN-based thresholds, so that these ratings could be applied even in the absence of manual QC. Fig. 3g displays the distribution of SHN among MQC groups. Using receiver operating characteristic curve analyses, we derived 3 optimized SHN thresholds to isolate poorer-quality scans (Fig. 3h). The most conservative threshold eliminated scans with MQC ratings of 2 or higher, based on an SHN cutoff of 29.5 (sensitivity=0.81; Fig. 3i). The next threshold eliminated scans with MQC ratings of 3 or higher, based on a SHN cutoff of 36.5 (sensitivity=0.81; Fig. 3j). The most liberal threshold eliminated scans with MQC ratings of 4, based on an SHN cutoff of 62.5 (sensitivity=0.93, Fig. 3k).

These 3 thresholds defined 4 SHN groups (tiers A-D), that in turn associated with linear effects on sMRI indices (Extended Data Fig. 4). The linear effects of SHN tiers closely approximated the linear effects of MQC groupings (Fig. 2), as well as those of continuous SHN values (Fig. 3a,b,c). Still, MQC and SHN each accounted for distinct variance in scan quality as seen in Extended Data Fig. 5. In a sensitivity analysis, inclusion of scans with FreeSurfer segmentation errors (n=228) did not substantially alter either the distribution of SHN across MQC ratings or optimal boundaries between SHN tiers in ROC analyses (Table S6).

Replication of image quality effects in Year 2 scans

Evaluation of Year 2 scans from ABCD enabled us to test the reliability of SHN tiers derived from Baseline scans. A total of 6,941 minimally processed Year 2 T1 volumes were available through the ABCD Data Archive, after removing those that did not meet inclusion criteria for Baseline analysis; these scans were preprocessed with FreeSurfer and SHN were extracted (Extended Data Fig. 6). Of note, Year 2 scans showed better overall quality than Baseline scans, with 83.9% falling into SHN tier A, compared to 57.3% of Baseline scans. From preprocessed Year 2 scans, a subset of 1,000 scans, balanced for SHN tiers and scanner manufacturers, were semi-randomly selected for MQC ratings (see Table S7 and Methods).

Table S8 describes the performance of SHN tiers in predicting MQC ratings for the Year 2 scans. SHN again increased in tandem with MQC ratings (rho=0.58). The SHN tiers effectively filtered out scans with higher MQC ratings, with sensitivity ranging from 0.87 (for differentiating scans rated 2 and higher from those rated 1) to 1.00 (for differentiating scans rated 4 from those rated lower). Extended Data Fig. 7 shows the distribution of MQC ratings within each SHN tier. Extended Data Fig. 4b indicates the effect of SHN tiers on sMRI indices across all 6,941Year 2 scans (most of which had not received MQC ratings); comparison to Extended Data Fig. 4a affirms that SHN tiers reproducibly tracked variance in scan quality, especially in regard to cortical thickness and surface area.

Risk for error in applied structural MRI analyses

sMRI measures are frequently explored for associations with clinical and developmental data. A primary goal of the ABCD Study is to relate imaging to clinical measurements within the same individuals over time, and to elaborate related trajectories of mental illness risk. However, given the tendency of poorer quality images to bias sMRI measurements among youth, we next examined the extent to which unaccounted variance in scan quality might affect associations between MRI and clinical indices.

As a positive control, we first considered a well-established relationship between age and cortical thickness. Most of the cortex is known to thin during adolescence, as seen in smaller but well quality-controlled samples23. As points of reference, we compared age-thickness effects in the SHN-corrected, MQC=1 sample (n=4,617, “ground truth”) to those in the full, non-corrected sample (n=10,257, “full non-QC-adjusted sample”). Significant age-thickness relationships were readily observed, even cross-sectionally between ages 9.0 and 10.9, within the full non-QC-adjusted sample (Fig. 4a) – although note the considerably smaller effect size of age, compared to that of quality control ratings, on thickness (Figure 2a). Despite these smaller effects, age-thickness effects were sufficiently robust to be detected within the smaller ground truth sample: among 68 cortical ROIs, significant (FDR q<.05) inverse associations were present in 59 regions, regardless of SHN adjustment (Table S9). Notably, though, several of these ROI did not show significant age-thickness differences in the (larger) full unadjusted sample – but then regained significance in the full sample after SHN adjustment. As such, inclusion of SHN mitigated Type II error (i.e., false negatives) that would have otherwise occurred in the full non-QC-adjusted sample, albeit for only a small number of regions.

Figure 4. Effects of variable quality control on applied analyses of sMRI data.

Figure 4.

(A) Association of age (z-transformed) with cortical thickness, without adjusting for manual quality control (MQC) rating or surface hole number (SHN). Note the substantially smaller effect size scale compared to Figure 2, which shows effects of quality control variance on sMRI measurements. (B) Age-thickness effects stratified by region of interest (ROI) and MQC inclusion threshold. Broken lines indicate best-fit lines across all ROIs for each inclusion threshold group. Note tendency toward diminished effect size (and increased risk for false negatives) with broader inclusion thresholds. (C) Association of externalizing symptoms (CBCL externalizing subscale; z-transformed) with cortical volume, without adjusting for MQC or SHN. Note the even smaller effect size compared to effects of age on thickness. (D) Externalizing symptoms-volume effects stratified by ROI and MQC inclusion threshold. Broken lines indicate best-fit lines for each inclusion threshold group. Note tendency toward inflated effect size (and increased risk for false positives) with broader inclusion thresholds. All analyses covaried for age, gender, estimated intracranial volume (fixed effects), site, and scanner manufacturer (random effects); ROI-based analyses also included family ID as a random effect.

The risk of Type II error arising from non-quality-corrected images can also be appreciated in Fig. 4b, which plots effect sizes for the age-thickness relationship across all 68 ROIs. To facilitate comparisons across MQC levels, ROIs were rank-ordered (left-to-right) by effect size among the 1-rated scans. Effect sizes generally diminished as poorer quality images were iteratively included (2s, then 3s, then 4s) in the analysis. These results echo a prior, smaller analysis (n=1,598, mean age=15.0), wherein poorer quality scans associated with blunted effects of age on cortical thickness12.

Next, we considered a more exploratory relationship between dimensional psychopathology and cortical volume. Several groups have reported inverse associations between CBCL scales and cortical volume, including using ABCD data24,25. In a recent study focused on genetic and neurodevelopmental underpinnings of psychopathology in ABCD26, among the broadband CBCL scales (total, internalizing, externalizing) we identified externalizing symptoms (CBCLext) as most strongly related to cortical volumes at Baseline, after accounting for both MQC ratings and SHN. In the full, non-QC-adjusted sample (n=10,257), CBCLext scores showed a diffuse, inverse relationship with volume across the cortical mantle (Fig. 4c), although effect sizes were smaller than for the age-thickness relationship (Fig. 4a). Within this larger sample, 43 ROIs demonstrated significant (FDR q <.05) relationships between CBCLext and volume (Fig. 4d, Table S10). However, a starkly different pattern emerged in the ground truth (MQC=1) sample (n=4,617), wherein only 3 regions demonstrated significant CBCLext-volume relationships. A sensitivity analysis indicated that exclusion of individual ROI data with cortical volume measurements ≥4 standard deviations from the mean (n=374 participants) had minimal effect on the number of significant regions, in either the full or ground truth samples (Table S11).

To understand the differences in significant ROI between the full and ground truth samples, we again used stepwise analyses that gradually adjusted the stringency of QC. These analyses implicated an interplay of QC and power considerations (Fig. 5). A critical consideration in parsing these two factors is that effect size should not increase with sample size in a well-powered analysis. However, inclusion of lower-quality images resulted in substantial inflation of CBCLext-volume effects in numerous regions. As scan quality worsened, variation in volume measurements (mean coefficient of variation (CV) across 68 ROI) increased from 0.15 at MQC=1 to 0.24 at MQC=4. This increase in variation was >4-fold more than that of CBCL-ext (CV 0.22 at MQC=1, 0.24 at MQC=4), suggesting that effect size inflations were driven by scan quality rather than clinical heterogeneity. These effect size inflations were evident in some regions (for example, right middle temporal, bilateral insula, and bilateral superior frontal cortex) even after adding in only MQC=2 images.

Figure 5. Effects of increasingly stringent quality control on statistical significance and effect size of externalizing symptoms-volume findings.

Figure 5.

(A) Regions were assigned to nested groups, based on whether they retained significance (p<.05, FDR correction for 68 regions) at various QC thresholds. For example, when including only the highest quality (MQC=1) scans in analyses that corrected for SHN (“gold standard” sample, n=4,617) 3 regions showed significant findings. The number of significant associations increased as QC thresholds were relaxed, such that when including all available scans without regard to MQC ratings and without SHN correction (n=10,257), 43 regions were significant. (B) These 43 regions are clustered by QC thresholds, with the most stringent QC group at the top (MQC=1, SHN corrected) and the least stringent (MQC=1 to 4, no SHN correction) at the bottom. For each region, horizontal bars indicate effect sizes based on MQC inclusion thresholds, enabling their direct comparison. Dark blue bars indicate the most conservative effect size estimates, as they only include “gold standard” scans (MQC=1 with SHN correction). Within each QC inclusion level, regions are ordered based on effect size estimates (largest to smallest) when only including these “gold standard” scans. Brackets (C) and (D) indicate several regions that fell short of significance when only using the 4,617 MQC=1 scans, and demonstrate the importance of considering effect size stability when lower quality scans are included. For some regions, as in (C), effect sizes remained stable when including lower quality scans. This pattern suggests true associations, and that using only the highest quality scans resulted in underpowered analyses (type II error). For others, as in (D), effect sizes increased substantially when lower quality scans are included – even when only including those rated as MQC=2. This pattern suggests that inclusion of lower quality scans resulted in false positives (type I error). Regions listed with an asterisk are considered true positives because MQC>1 scans included in the analysis show effect sizes within +/− 1 standard error for the MQC=1 “gold standard” effect size. All analyses covaried for age, gender, estimated intracranial volume (fixed effects), site, scanner manufacturer, and family ID (random effects).

These and other ROI showed statistically significant CBCLext-volume relationships only after MQC=2, 3, and 4 scans were included in the analysis – even after correction with SHN. Further, regions with smaller effect sizes in the ground truth sample were more likely to show inflated effect sizes – and, hence Type I error – in the full, non-adjusted sample (Fig. 4d). This result was counterintuitive, given that large sample size is often invoked to reduce risk of Type II error, through improved power to detect small but true effects.

However, for other regions, such as left superior temporal, left precentral, and bilateral postcentral cortex, effect size remained relatively stable as inclusion thresholds loosened. This pattern suggests that effect sizes in these regions were not inflated by artifact, and that failure to reach significance when using only MQC=1 scans reflected a lack of statistical power (i.e., Type II error) – even with a sample size of >4,500.

To further distinguish true positive, false positive, and false negative results among FDR-significant regions, we examined whether effect sizes from included lower quality scans fell within 1 standard error (SE) of the effect size for the gold standard (MQC=1) sample (Table S12a; see also Extended Protocol). This analysis revealed 9 additional regions (in addition to the 3 in the MQC=1 analysis) that could be considered true positives, including several that reached significance only when the lowest quality scans were included (MQC≥3). We then repeated this analysis but using SHN tiers rather than MQC ratings to stratify included scans (Table S12b). This enabled direct comparison of regions identified as being true positives following manual versus automated QC approaches. To identify true positives from analyses that used lower quality scans (i.e., SHN Tier B and higher), we compared the effect size for the lower quality scans to the SE range for the SHN Tier A analysis. Of the 12 regions identified as true positives in the MQC-based analysis, 9 were identified as such in the SHN tier-based analysis. However, 16 additional regions were also designated as true positives within the SHN tier-based analysis – whereas none of these would have been designated as such in the more rigorous, MQC-based analysis.

Effects of manual edits on structural MRI measurements

Image reconstruction errors can influence sMRI measurements and can be exacerbated by head motion and other artifacts7,8. These errors include skull strip errors, segmentation errors, intensity normalization errors, pial surface misplacement, and topological defects. Within FreeSurfer these errors can be corrected through manual editing of voxels in brain and white matter masks, watershed thresholds, and the addition of control points15,16. Here, we examined effects of manual edits on sMRI indices among scans with relatively higher image quality, to assess whether this intervention might safely be reserved for those with MQC >2.

A total of 150 Baseline scans with MQC=1 and n=30 Baseline scans with MQC=2 were randomly selected for manual edits by a trained coordinator (see Methods). Direction and effect sizes of pre-to-post edit changes across the cortical mantle are displayed in Fig. 6 (MQC=1 and 2 combined, n=180) and Extended Fig. 8a,b,c (MQC=1 and 2 separately), while ROI-level changes are described in Table S13a,b,c. Effects of manual edits were most pronounced for cortical thickness and volume, both of which tended to decrease. These changes reached statistical significance (FDR q <.05) for cortical thickness in 40 regions (Cohen’s d range 0.16 to 0.92), and for cortical volume in 28 regions (d range 0.18 to 0.73). Numerous regions demonstrated stronger effects of editing on cortical thickness and volume in MQC=2 scans than MQC=1 scans (e.g., bilateral parahippocampal, caudal middle frontal, and superior parietal cortices). Further, cortical volume maps revealed a strong effect of edits in the area of the superior sagittal sinus, particularly impacting superior parietal cortex (Extended Fig. 8d). In an applied analysis, we examined the degree to which cortical edits affected effect size for the relationship between cortical thickness and age. Across all 68 cortical ROIs, effect size slightly strengthened (became more negative) for post-edited images compared to pre-edited images (t=2.31, p=0.024, d=0.10).

Figure 6. Effects of manual edits on sMRI indices (n=180).

Figure 6.

Manual edits (e.g., A, which corrects a gray-white matter boundary segmentation error) were conducted on 150 scans with MQC=1 and 30 scans with MQC=2. Maps reflect effect sizes (Cohen’s d) of pre-to-post edit changes in (B) cortical thickness, (C) cortical surface area, and (D) cortical volume.

To put these findings in context along with MQC rating effects on sMRI indices, Extended Fig. 9 maps all ROIs that showed significant (FDR, q <.05) effects of MQC, surface edits, or both, as well as their direction, among Baseline scans with MQC=1 or 2. Note that even when constrained to the best two scan quality groups, there are diffuse effects of scan quality differences across the cortex, for each of the sMRI indices; and that biases related to poorer overall quality control and to subtle topological defects can induce opposing effects on sMRI measurements.

Finally, to assess reproducibility and developmental specificity of cortical edit effects, we compared ABCD results to that of a second, non-overlapping MRI cohort of 292 youths, age 8 to 18, who received MRI scans that were assessed by radiology reports as free of pathology at Massachusetts General Hospital (MGH; Table S14a, b, c). This sample was previously described in an analysis relating prenatal folic acid exposure to cortical development19. This sample differed from ABCD by its inclusion of (1) clinical rather than research participants, (2) all editable images (not just those of relatively high overall quality), (3) a mix scanner field strengths (1.5 and 3T) as well as manufacturers, and (4) a broader age range. Despite these differences, of the 40 regions demonstrating significant effects of manual edits on thickness in ABCD, 18 again showed nominally significant (15 showed FDR-significant) effects of edits in the same direction within the MGH MRI cohort (Cohen’s d range 0.12 to 0.98). Notably, across these 18 regions, differences in pre-to-post edit mean thickness were greater at age 8–10 compared to other age groups (11–12, 13–14, and 15–17; omnibus F=8.49, p=0.0001, post hoc comparisons p’s≤.0002; Extended Fig. 10a). Similarly, the standard error of pre-to-post thickness changes across individuals was also greatest at age 8–10 (omnibus F=64.53, p=2.25E-17, post hoc comparisons of age 8–10 vs. other groups, p’s≤6.53E-10). Finally, the effect of edits on the relationship between age and cortical thickness differed among age groups (F=21.54, p=3.88E-12); specifically, the effect of edits on the age-thickness relationship was stronger at age 8–10 (d=−1.18) than for any other age group (p’s≤7.73E-09; Extended Fig. 10b).

Discussion

These present findings identify nuances related to scan quality in large pediatric brain MRI cohorts that are pervasive and complex, and that likely require multi-pronged intervention to avoid error in sMRI analyses. Leveraging one of the largest collections of uniformly collected sMRI data from children and adolescents, we used manual quality control (MQC) to separate high quality scans and contrast them to those with various degrees of observable artifact. Inclusion of lower-quality scans introduced substantial bias in widely used sMRI metrics, such as cortical thickness and surface area. These effects were partially mitigated by inclusion of surface hole number (SHN), an automated measure of topological complexity that accounted for quality-related variance in sMRI measures. However, inclusion of SHN failed to safeguard against most Type I and II errors when poorer quality scans were included in applied analyses that associated sMRI measures with clinical data. Further, even among the highest quality scans, manual editing associated with significant changes in cortical thickness and surface area -- changes that in some regions were oppositely signed to those observed when controlling for SHN or MQC, and that replicated in a non-overlapping clinical cohort. As a whole, these results challenge assumptions that large sample size alone improves sensitivity to detect valid brain-behavior relationships, or mitigates the effects of variable image quality on error risk.

Implications of these findings extend not only to studies that map trajectories of healthy and aberrant brain development, but also to applied analyses that relate structural indices to clinical measures. Comparison of effect sizes for sMRI-clinical relationships (Figure 4) to those of bias related to poor scan quality (d=0.14–2.84) or manual edits (d=0.15–0.92) – which are generally higher by an order of magnitude – demonstrates the susceptibility of these relationships to artifact. Recent analyses illustrate the need to include thousands of individuals in brain-wide association studies27,28, reflecting the small effect sizes intrinsic to these relationships. Here, inclusion of the best quality (MQC=1, n=4,617) scans was inadequate to detect relationships between cortical volume and externalizing psychopathology in several regions, effects that became statistically significant when scans of marginally lower quality (MQC=2, n=4,057) were included. However, inclusion of MQC=2 scans also resulted in some errant associations between volume and externalizing psychopathology, as demonstrated by markedly inflated effect sizes compared to MQC=1 scans in some regions. Inclusion of lower quality scans (MQC=3 or 4) resulted in even more pronounced effect inflation and additional false positives. These results illustrate complex trade-offs between sample size and scan quality that warrant careful consideration in large MRI studies, especially in the setting of small effect sizes.

Large and diverse study samples offer clear advantages such as statistical power and improved generalizability, and in the case of psychology and neuropsychiatry research, such designs help to mitigate well-described problems of publication bias and reproducibility failures27,29. However, several pitfalls within “big data” science have also been described, including inadequate control for multiple comparisons, sampling bias, measurement error, and discrepancies between statistical and clinical significance. These issues have hampered other areas of clinical-translational research, such as electronic health record, epidemiology, and health services studies30. With regard to brain imaging research, a recent study31 used theoretical data to model trade-offs of increasing sample size well into the thousands, demonstrating the risk of latent bias to outweigh the benefit of reduced variance. This concern bears out in the present real-world analysis, which cautions against equating data quantity and quality in youth sMRI studies32. The findings also have important implications for large-scale MRI studies of other populations where head motion occurs more frequently, including those with psychiatric and neurological disorders33, and those at the extremes of age34,35.

Beyond best practices to minimize participant motion36, the present findings suggest that relatively labor-intensive approaches – visual QC and manual editing – conducted in concert with automated measures such as SHN – provide the best protection against errant sMRI findings in youth cohorts. However, manual edits pose their own challenges with regard to feasibility in studies with thousands of participants. The present analyses offer several QC alternatives that may be weighed in the context of available resources, the nature of particular findings, and the characteristics of the study population. The Extended Protocol published with this report (10.5281/zenodo.14872906) offers specific guidance with respect to both time- and labor-intensive approaches (manual ratings and edits), and, for situations where these may be impractical, approaches that are based solely on SHN. In any case, investigators who chose to include lower quality scans in their analyses should give particular attention to the stability of effect sizes (i.e,, compared to the subgroup with the best QC metrics), especially when significant results occur in regions that are especially susceptible to error based on their location and measurement type (thickness, area, volume; see Figure 2 and Table S3). That said, the present analyses of older adolescents (ABCD Year 2, MGH) are encouraging and suggest that less intervention may be needed with advancing participant age. As automated methods continue to gain sophistication37,38 they may continue to improve the efficiency of QC and further strengthen causal inference in neurodevelopmental MRI research.

Methods

Sample from ABCD

The ABCD study enrolled 11,875 participants, age 9 or 10 at Baseline, across 22 U.S. sites. Participant race and ethnicity mirrored those of the U.S., and enrollment was enriched for multiple births and siblings from multiple pregnancies39. Primary analyses used baseline data from children aged 9–10 years old. Institutional Review Board (IRB) approval for the ABCD study is described in Auchter et al.40. All parents provided written informed consent and all youth provided assent.

MRI acquisition

Structural MRI (sMRI) scans were obtained from participants on 3T Siemens, Philips, or GE magnets as described in Methods and by Casey and colleagues36. All MRI images were obtained using harmonized parameters. We used T1-weighted images (256×256 matrix, slices=176–225, TR=6.31–2500, TE=2–2.9, 1×1×1 mm resolution) for our analysis. Images acquired from Siemens and GE scanners included real-time motion detection using volumetric navigators that automatically triggered re-scans21,22. Additional details of MRI sequences are described elsewhere36. Minimally processed T1 volumes were available from the NIMH Data Archive (NDA) for all but 160 participants. We excluded subjects whose baseline MRI scans were flagged for clinical consultation (N=451), and those without available T1 data (N=160) from all analyses.

Image processing

T1 images from the remaining 11,264 participants, and year 2 follow-up T 1 images from 6,941 of these participants, were downloaded from the ABCD Data Archive (release 4.0). Scans underwent N4 field bias correction to correct low frequency intensity non-uniformities or field bias41. Subsequently whole brain processing and analyses were conducted using FreeSurfer version 7.1 (http://surfer.nmr.mgh.harvard.edu/). While several processing streams are available to process and analyze sMRI data, the present analyses used FreeSurfer software for two reasons: first, existing, tabulated region-of-interest sMRI analyses available through the NIMH Data Archive and widely used in published ABCD analyses were conducted wither FreeSurfer; and second, FreeSurfer offers manual cortical edit capabilities. One baseline scan failed Freesurfer processing. Using automated segmentation (Desikan-Killiany atlas), cortical thickness, surface area, and volume of 68 regions of interest (ROI) were extracted, as were 20 subcortical volumes.

Development of manual quality control (MQC) ratings

Following preliminary review of 500 randomly selected scans, two expert raters who had previously conducted cortical edits of >300 pediatric MRI scans19 (J.LR., K.F.D.) and a third who had been extensively trained by these raters (S.E.) reached consensus that four levels of quality (“1”=best, “4”=worst) were optimal to bin scans for analysis. This scale was developed based on (1) previously published methods that describe specific artifacts that are frequently encountered in structural MRI scans17,18; see also FreeSurfer wiki documentation at https://surfer.nmr.mgh.harvard.edu/fswiki); and (2) our lab’s previous experience19, which enabled us to further classify scan quality based on the estimated time that would be needed to conduct these manual edits. An Extended Protocol posted at 10.5281/zenodo.14872906 provides annotated examples. Briefly, assessment of each scan proceeded as follows:

  1. Visual inspection of full volume in all 3 planes to identify cysts, areas of (a) signal dropout (as in Extended Data Fig. 2), (b) large cysts (>1 cm3), or (c) large defects (ghosting, rings) that would result in systematic measurement error for cortical morphometry. These were coded separately (dropout, cyst, or “4”/unusable rating, respectively).

  2. Slice-by-slice assessment for smaller-scale problems that would require manual edits. These include errant inclusion of meninges or skull tissue, intensity normalization errors, gray-white matter parcellation errors, or other abrupt changes in cortical volume that are not accounted for by slicing through gyral edges (e.g., thickness reduction that appears only in one slice would not be flagged).

  3. Rating scans as follows: “1” for scans with no or only a small number of problems that would require manual edits, and that would require approximately 30 minutes for a trained technician to complete these edits; “2” for scans with a moderate number of problems that would require manual edits, and that would require approximately 1 to 2 hours for a trained technician to complete these edits; “3” for scans with a large number of problems that would require manual edits, that would require several hours for a trained technician to complete these edits; “4” for scans with a very large number of problems that would be impossible or impractical to edit.

Implementation of MQC ratings

A single, trained rater (S.E.) conducted visual assessment of all processed Baseline scans, reviewing all available slices in multiple (axial, coronal, horizontal) views. The rater was blinded to any potential identifying, clinical, or demographic information regarding participants. The single rater approach was chosen because previous work12 using 4 or more rating levels – which we deemed necessary to sufficiently capture variance in scan quality in ABCD – had described a ceiling of relatively poor inter-rater reliability, even through multiple attempts to improve it. We weighed the risks/benefits of (1) having a single, highly-trained rater whose intra-rater reliability could be established but who could also induce systematic error (e.g., if the ratings were consistently inaccurate), versus (2) the logistical burden of having two or more raters and a tie-breaker, who would likely need to weigh in quite frequently given low inter-rater reliability. As a compromise, we chose to use a single-rater for Baseline scans, but with results reinforced with several objective and complementary measures. Specifically, (1) we sought to validate manual ratings with surface hole number (SHN), which based on prior work12 provides an objective (albeit imperfect) automated measure of image quality; and (2) we sought to replicate the entire approach using a new set of scans (from Year 2) and a new set raters.

Scans originating from N=5,105 participants of European ancestry were prioritized and randomly sequenced in order to have complete QC data for a different analysis published by our group26. However, this initial group also contained 373 randomly interspersed scans from randomly selected non-European participants. Following assessment of this initial set of 5,105 scans, the remainder were evaluated in random order. Of the evaluated scans, 368 had been coded within the ABCD NIMH Data Archive as “inclusion not recommended” based on an automated overall QC measure in the FreeSurfer preprocessing stream and/or corrupted raw data at the time of scan acquisition (imgincl_t1w_include=0); the remainder received the “inclusion recommended” code. During manual review 740 scans were removed from further consideration due to the presence of cysts >1 cm3, and 228 were omitted from the main analyses due to segmentation errors and related signal dropout that persisted after a second round of preprocessing (Fig. 1a).

Characterizing apparent tissue loss due to segmentation errors

To better characterize the 228 scans with localized signal dropout, these scans were separately rated from “1” through “4” (as above) to assess the quality of the remaining volume that was unaffected by segmentation errors. Ratings were performed by the same trained rater who assigned ratings to all baseline scans. The sagittal, coronal, and axial extents of the drop out region were measured in Freeview version 7.1.1 (https://surfer.nmr.mgh.harvard.edu/fswiki/FreeviewGuide). Approximate volumes of segmentation error-related tissue loss were calculated assuming an ellipsoid shape and measured x, y, and z dimensions. For purposes of displaying location and overlap of drop-out across scans, rectangular cuboids were constructed in MarsBar in SPM 12 (http://www.fil.ion.ucl.ac.uk/spm/) using measured dimensions and coordinates. Rectangular cuboids were combined across subjects and were thresholded by a whole brain mask in MarsBar. Areas of dropout were thresholded at n>10 subjects with dropout in that region and drop-out was displayed on an exemplar structural image in xjView (https://www.alivelearn.net/xjview).

Surface hole number and other automated QC metrics

We used surface hole number (SHN) as an automated quality control measure extracted from FreeSurfer aparc tabulated data. SHN reflects the Euler number, which measures continuity of tessellated images (e.g., those that contain continuous triangular structures, as do FreeSurfer-generated maps of the cortical surface) based on the sum of the vertices and faces subtracted by the number of faces. Higher SHN have predicted worse manual quality control ratings in previous, smaller MRI studies12,15 and have been proposed as an automated quality control index for use in high-throughput neuroimaging studies, outperforming other measures (such as signal-to-noise ratio and motion during functional MRI scans conducted during the same scan session)13,16. We calculated SHN for each available Baseline and Year 2 scan using FreeSurfer 7.1 and have uploaded the data to the NDA (see Data Availability). Here, SHN from baseline scans were used to determine optimal proxies for MQC, through creating of 4 tiers (A, B, C, D) that approximated the 4 levels of MQC ratings (1, 2, 3, 4).

To obtain complementary automated image QC measures, we deployed the MRIQC package42 (version 24.0.2), which provides contrast-to-noise and signal-to-noise ratios, foreground-background energy ratio, background kurtosis, entropy-focus criterion, Mortamet’s quality index 2, and white matter standard deviation. Correlations of MQC ratings and SHN were performed with each of these measures, means of which were also compared across MQC ratings and SHN tiers.

Psychopathology measurement

We used the parent-reported Child Behavior Checklist (CBCL) as a measure of dimensional psychopathology. The CBCL is a frequently used scale comprising eight subscales (anxious/depressed, withdrawn/depressed, somatic, social, thought, attention, rulebreaking, and aggressive symptoms) that can be summarized by total, internalizing, and externalizing scores. Raw scores are converted to t-scores which are normed for age and gender43.

Year 2 T1 replication

We examined all available Year 2 T1 weighted images to assess the reliability of SHN tiers derived from Baseline scans. ABCD data release 4.0 contains Year 2 scans from 7,829 participants. Using the same method as for Baseline scans, we used FreeSurfer to process images from 6,941 individuals whose baseline image passed the inclusion criteria and received MQC ratings of 1–5. SHN were calculated by FreeSurfer for each of these scans. In addition, 1,000 Year 2 scans were semi-randomly selected for MQC ratings, such that they contained (1) a range of scan quality, operationalized by selecting for an approximately equivalent number of scans that fell into tiers A, B, C, and D; and (2) a distribution of magnet types (Siemens, Philips, GE) that was equivalent to the analyzed Baseline sample. One scan was discarded due to presence of a large cyst. SHN tier D was slightly underrepresented, as only 168 total Year 2 scans fell within this tier; all 168 of these scans were used for the analysis. These scans were then rated for MQC in random sequence by two raters (E.L, K.A.K.) who were blind to SHN and other participant-level information. These raters had previously been trained by the rater of all Baseline scans (S.E), such that the three raters achieved an intraclass coefficient of >0.75 (two-way mixed effects model for absolute agreement) across a training set of 1,000 Baseline scans. The additional step of testing ICC was implemented because multiple raters were used to evaluate Year 2 scans, and also to further demonstrate generalizability of the protocol.

Manual cortical edits of ABCD scans

Effect sizes for manual edits on cortical thickness can be fairly substantial15. Based on data from a previously published study of pediatric patients at MGH19 we observed that manual edits of 64 MRI scans from a clinical sample of participants aged 8.0 to 11.0 resulted in cortical thickness changes of at least moderate (d≥0.5) effect size in 29 (of 68 possible) ROIs. Based on this prior data, a sample size of 180 scans would provide 92% power to observe statistically significant effects, at the same effect size (d=.5), of manual edits on cortical thickness using ABCD data. As such, a subset of 180 rated Baseline scans was randomly selected for manual editing (N=150 with MQC=1, N=30 with MQC=2), which was completed by a trained technician (SE). Each structural scan was loaded into Freeview version 7.1.1 with the following volumes: brainmask, wm, brain.finalsurfs.manedit, T1, and the following surfaces: rh.pial, rh.white, lh.pial, lh.white. The scans were primarily displayed in the coronal view, although sagittal and horizontal views were used as needed. Criteria for editing were primarily based off overestimation and underestimation of the pial and white matter boundaries. Edits to the white matter boundary were made directly on the wm volume using control points and the erasing tool. Edits to the pial surface were made on the brainmask volume. Errors between the pial surface and cerebellum were corrected using the brain.finalsurfs.manedit volume. Edits were considered to be complete when, after post-edit re-processing in FreeSurfer, there appeared only minimal errors remaining, meaning the generated pial and white matter boundaries more closely matched the actual boundaries on the T1 image.

Manual edits of Massachusetts General Hospital (MGH) scans

The MGH sample was included as a replication set for effects of manual editing on cortical MRI indices and to assess whether such effects change later in adolescence. Study sample, scanner characteristics, and editing methods were previously described by Eryilmaz and colleagues19. Study procedures were approved by Partners Human Research Committee, which granted a waiver of informed consent, since this retrospective study of the medical record involved only deidentified data. Briefly, clinical brain MRI scans from 292 individuals aged 8 to 17, conducted at MGH between 2005 and 2015, were selected based on date of birth, adequate scan quality on visual inspection (i.e., artifacts could reasonably be addressed with manual edits), and absence of pathology as indicated on radiology reports. Scans were edited by a trained technician (KFD) as described above. Pre-to-post edit changes in cortical thickness, volume, and surface area were measured across 68 ROIs using FreeSurfer 5.0.

Statistical analysis

Stability of MQC ratings over time

MQC ratings of baseline scans that did not show signal dropout or cysts were divided into 10 equally sized time groups, reflecting the sequence in which scans were evaluated. Initial analyses were conducted to assess whether factors known to affect scan quality, including age, gender, scanner manufacturer, and psychopathology (CBCL) differed over time, using time period as either a categorical or continuous variable. Then, ANOVA was used to assess significant linear or quadratic changes in mean MQC rating across time groups, controlling for variation in these other factors and in their interactions with time and time-squared.

Surface-based sMRI analyses

Surface maps for group-based and within-subject analyses were generated using FreeSurfer 7.1. Images from each subject were smoothed by 22mm full width-half maximum. For between-group analyses we fit general linear models with following covariates: age, gender, estimated intracranial volume, study site, and scanner. Continuous predictor variables were z-transformed prior to analysis. Models assessed both linear effects of MQC ratings (i.e., 1 to 4) as well as pairwise contrasts (1 vs. 2, 1 vs. 3, 1 vs. 4) on cortical thickness, surface area, and volume. Sensitivity analyses assessed linear effects of SHN on these indices, as well as effects of MQC after controlling for SHN and vice versa. Results were visualized using uncorrected significance maps (log p-value) and effect size maps (Cohen’s d) as appropriate.

ROI-based sMRI analyses

Following extraction of ROI-based data from Freesurfer, analyses involving cortical thickness, cortical surface area, and cortical and subcortical volumes were conducted with R version 4.1.2 (https://www.R-project.org/). Mixed-effects linear regression was run with “lme4” package version 1.1–14 unless specifically mentioned. The covariates included in the analysis were age, gender, estimated intracranial volume (fixed effect), site, scanner, and family ID (random effects), the latter accounting for inclusion of sibling groups. Analyses were corrected for multiple comparisons using FDR (q<.05), based on the number of included ROIs.

SHN tiers

We conducted receiver operating characteristic (ROC) analyses to evaluate the sensitivity of SHN to detect poorer quality scans. Analyses were conducted in R using the “pROC” package version 1.18–5. Using Baseline scan data, we contrasted SHN for three breakpoints: MQC=1 vs. 2, 3, and 4; MQC=1 and 2 vs. 3 and 4; and MQC=1, 2, and 3 vs. 4. We used the Youden Index to select an optimal threshold to discriminate higher versus lower quality scans for each of the three breakpoints. These three thresholds were used to define SHN tiers A, B, C, and D, respectively – such that scans in the A tier best represented MQC=1, those in the B tier best represented MQC=2, etc. As a sensitivity analysis, we also included MQC and SHN values for scans with segmentation-related tissue loss into the analysis, and examined whether thresholds were altered by inclusion of these scans. Then, to test reliability, we grouped all available Year 2 scans according to SHN tiers, and conducted MQC on 1,000 of these scans (described above). Sensitivity, specificity, and accuracy of the SHN tiers to distinguish MQC levels were assessed. These metrics could then be compared to those from the Baseline analysis, as well as to those from a new set of ROC analyses that determined optimal thresholds for SHN tiers in the 1,000 Year 2 scans.

Applied analyses relating quality control to MRI-clinical associations

Linear mixed models examined associations between cortical thickness and age, and between cortical thickness and externalizing psychopathology, conditioned on the degree to which lower-quality scans were included in the analyses (e.g., inclusion of MQC=1 only, versus MQC 1 and 2; 1, 2, and 3; and 1, 2, 3, and 4). Overall surface-based and ROI methods were similar to those described above, but now using age or CBCL externalizing score rather than MQC as the predictors of interest. Sensitivity analyses examined effects of including SHN as an additional predictor in the models, and effects of removing individual ROI outlier data for participants with cortical volumes ≥4 standard deviations from the mean.

Effects of manual edits on sMRI indices

For Baseline ABCD scans, within-subject analyses that contrasted cortical thickness, surface area, and volume before vs. after manual edits were conducted using general linear models in Freesurfer (for surface maps of effect size) or paired t-tests in R (for ROI analyses). These analyses were conducted without covariates, following upon sensitivity analyses that demonstrated no significant effects of age, gender, scanner, or CBCL externalizing symptoms on pre-to-post edit changes in sMRI measures. ROI analyses were corrected for multiple comparisons using FDR (q<.05), based on the number of included ROIs. Analyses of MGH scans focused on cortical regions that replicated significant effects of manual edits on cortical thickness that were seen in the ABCD cohort. Potential changes in magnitude and variance of pre-to-post edit changes across these regions were assessed as a function of age group (8–10, 11–12, 13–14, 15–17 years) using ANOVA.

Extended Data

Extended Data Figure 1. Stability of manual quality control (MQC) ratings over time (n=10,295).

Extended Data Figure 1.

Scans were assigned to deciles based on the sequence in which they received MQC ratings by a single trained rater. (A) Box and whisker plots show distribution of MQC ratings for each time period, after adjusting for age, gender, scanner manufacturer, and externalizing psychopathology. Adjacent marks show unadjusted mean ratings for the same period. (B) Box and whisker plots show distribution of the log of surface hole numbers (SHN), stratified by decile and MQC rating. For both plots, box indicates median and interquartile range (IQR), whiskers indicate range of non-outlier data, circles indicate mild outliers (1.5 to 3 × IQR), and asterisks indicate extreme outliers (>3 × IQR).

Extended Data Figure 2. Signal dropout in sMRI processing (n=228).

Extended Data Figure 2.

(A) Examples of dropout regions where FreeSurfer segmentation failed and did not include a substantial portion of cortex. (B) Distribution of approximate volume of dropout area estimated by ellipsoid volume calculated and distribution of (C) sagittal, (D) coronal, and (E) axial extent. (F) Distributions of drop-out regions overlaid on exemplar brain thresholded at n=10 subjects. Heat map represents number of overlapping subjects.

Extended Data Figure 3. Comparison of manual quality control (MQC) and surface hole number (SHN) to other automated quality control metrics (QCMs) at Baseline (n=10,294) and Year 2 (n=999).

Extended Data Figure 3.

At Baseline, SHN values correlated more strongly with MQC ratings than did any other IQM (A), and SHN tiers closely approximated MQC ratings in detecting variance in other IQMs (B). The same patterns were apparent using Year 2 scans (C, D). FBER: Foreground-background energy ratio; BG; EFC: Entropy-focus criterion; QI2: Mortamet’s quality index 2.

Extended Figure 4. Comparison of SHN tier effects on sMRI indices at (A) Baseline (n=10,295) and (B) Year 2 (n=6,941); compare to Figure 2.

Extended Figure 4.

Maps at left show linear associations of SHN tier (A to D) with cortical thickness, surface area, and volume. Maps at right contrast thickness, surface area, and volume highest quality images (SHN=A) with those assigned to lower quality ratings. Covariates included age, gender, estimated intracranial volume (fixed effects), site, and scanner manufacturer (random effects).

Extended Data Figure 5. Unique contributions of SHN tiers versus MQC to variance in sMRI indices, n=10,295.

Extended Data Figure 5.

(A) Linear association of MQC on cortical indices after controlling for SHN tiers. (B) Linear association of SHN tiers on cortical indices after controlling for MQC. Covariates included age, gender, estimated intracranial volume (fixed effects), site, and scanner manufacturer (random effects).

Extended Data Figure 6. Included Year 2 follow-up scans.

Extended Data Figure 6.

Among 11,875 total participants at baseline, Year 2 T1 scans were available from 7,829; of these, 6,941 were eligible for processing with FreeSurfer, and 1,000 were semi-randomly selected for MQC ratings (see Methods for additional details).

Extended Data Figure 7. Relationship of surface hole number (SHN) to manual quality control (MQC) in selected Year 2 follow-up scans (n=999).

Extended Data Figure 7.

(A) Density plot of SHN values, stratified by MQC ratings. (B) Distribution of MQC ratings as related to SHN for each SHN tier.

Extended Data Figure 8. Effects of manual edits on sMRI indices, stratified by MQC rating.

Extended Data Figure 8.

Edits were conducted on 150 scans with MQC=1 and 30 scans with MQC=2. Maps reflect effect sizes of pre-to-post edit changes in (A) cortical thickness, (B) cortical surface area, and (C) cortical volume. Note increased effects of edits in MQC=2 relative to MQC=1. (D) Post-edit thickness reduction along the superior sagittal sinus, which is frequently misattributed to pial surface during preprocessing.

Extended Data Figure 9. Composite maps showing location and direction of sMRI measurement errors detected by manual quality control and cortical edits, among MQC=1 and 2 scans only.

Extended Data Figure 9.

Highlighted regions show either significant differences in sMRI indices between MQC=1 and MQC=2 scans, significant effects of cortical edits, or both. Note that, when co-occurring within the same region, errors due to poor scan quality (assessed by MQC) do not necessarily occur in the same direction as errors requiring manual edits.

Extended Data Figure 10. Effects of manual edits on cortical thickness and age-thickness relationships MGH sample, stratified by age group (n=292).

Extended Data Figure 10.

(A) Violin plots show effect size and related variance of manual edits on cortical thickness in the MGH sample, stratified by age group. The 18 included ROIs are those that also showed significant effects of edits on cortical thickness in the ABCD cohort, in the same direction. Regions are ordered by effect size in the 8- to 10-year-old group. Means are represented by black circles. Note that effect sizes and variance diminished with age. (B) Effects of edits on the magnitude of age-thickness relationships within the MGH sample across 68 cortical ROIs, stratified by age group. Each marker shows the age-thickness effect size for a given ROI. Edits strengthened age-thickness effects (i.e., effect sizes became more negative, indicated by lower intercept of the best-fit line compared to the dashed unity line) at age 8–10, but not in other age groups.

Supplementary Material

Suppl Tables

Acknowledgments

Supported by NIH (R01MH124694, R01MH120402, T32MH112485 to J.L.R.; R01MH113550, R01MH120482, R01MH112847, R37MH125829, R01EB022572 to T.D.S; K23DA057486 to B.T.-C.), Harvard Medical School Dupont Warren Fellowship (to J.A.C.), Louis V. Gerstner Scholar Award (to J.A.C.), MQ Foundation (to J.L.R.), and the Mass General Hospital Early Brain Development Initiative (to J.L.R.).

The authors are grateful to Drs. Randy L. Buckner and Erin C. Dunn for helpful comments on the manuscript, and to Sofia Perdomo and August Blum for conducting additional statistical analysis and assisting with final manuscript preparation. We thank the investigators and staff at the ABCD sites and coordinating centers, as well as study participants and their families for their essential contributions to this work.

Data used in the preparation of this article were obtained from the Adolescent Brain Cognitive DevelopmentSM (ABCD) Study (https://abcdstudy.org), held in the NIMH Data Archive (NDA). This is a multisite, longitudinal study designed to recruit more than 10,000 children age 9-10 and follow them over 10 years into early adulthood. The ABCD Study® is supported by the National Institutes of Health and additional federal partners under award numbers U01DA041048, U01DA050989, U01DA051016, U01DA041022, U01DA051018, U01DA051037, U01DA050987, U01DA041174, U01DA041106, U01DA041117, U01DA041028, U01DA041134, U01DA050988, U01DA051039, U01DA041156, U01DA041025, U01DA041120, U01DA051038, U01DA041148, U01DA041093, U01DA041089, U24DA041123, U24DA041147. A full list of supporters is available at https://abcdstudy.org/federal-partners.html. A listing of participating sites and a complete listing of the study investigators can be found at https://abcdstudy.org/consortium_members/. ABCD consortium investigators designed and implemented the study and/or provided data but did not necessarily participate in the analysis or writing of this report. This manuscript reflects the views of the authors and may not reflect the opinions or views of the NIH or ABCD consortium investigators.

Footnotes

Presented in part at the Society for Biological Psychiatry 2022 Annual Meeting, New Orleans and the Society for Neuroscience 2022 Annual Meeting, San Diego. A pre-review version of the manuscript was published on bioRxiv at https://doi.org/10.1101/2023.02.28.530498 on 3/1/23.

Competing interests statement

No authors declare any competing interests.

Data availability

Data from all ABCD-related analyses were downloaded from the NIMH Data Archive (NDA), version 4.0 (https://nda.nih.gov/study.html?id=1299). Derived variables, including MQC ratings and SHN, as well as region-of-interest level data for cortical thickness, surface area, and volume processed in FreeSurfer 7.1, have been uploaded to the NDA (https://nda.nih.gov/study.html?id=1944). Data from MGH analyses contain sensitive patient information that was obtained following a waiver of informed consent, and as such has not been uploaded to a publicly available repository. Please contact the corresponding author for additional information.

Code availability

R code is available at doi 10.5281/zenodo.14872906. Source files are available at the NIMH Data Repository (https://nda.nih.gov/study.html?id=1944).

References

  • 1.Thompson PM et al. ENIGMA and global neuroscience: A decade of large-scale studies of the brain in health and disease across more than 40 countries. Transl Psychiatry 10, 100 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Mills KL & Tamnes CK Methods and considerations for longitudinal structural brain imaging analysis across development. Dev Cogn Neurosci 9, 172–190 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Becht AI & Mills KL Modeling Individual Differences in Brain Development. Biol Psychiatry 88, 63–69 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Dick AS et al. Meaningful associations in the adolescent brain cognitive development study. Neuroimage 239, 118262 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Marquand AF et al. Conceptualizing mental disorders as deviations from normative functioning. Mol Psychiatry 24, 1415–1424 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Karcher NR & Barch DM The ABCD study: understanding the development of risk for mental and physical health outcomes. Neuropsychopharmacology 46, 131–142 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Dick DM et al. Post-GWAS in Psychiatric Genetics: A Developmental Perspective on the “Other” Next Steps. Genes Brain Behav 17, e12447 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Alexander-Bloch A et al. Subtle in-scanner motion biases automated measurement of brain anatomy from in vivo MRI. Hum Brain Mapp 37, 2385–2397 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Blumenthal JD, Zijdenbos A, Molloy E & Giedd JN Motion artifact in magnetic resonance imaging: implications for automated analysis. Neuroimage 16, 89–92 (2002). [DOI] [PubMed] [Google Scholar]
  • 10.Reuter M et al. Head motion during MRI acquisition reduces gray matter volume and thickness estimates. Neuroimage 107, 107–115 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Dale AM, Fischl B & Sereno MI Cortical surface-based analysis. I. Segmentation and surface reconstruction. NeuroImage 9, 179–194 (1999). [DOI] [PubMed] [Google Scholar]
  • 12.Rosen AFG et al. Quantitative assessment of structural image quality. Neuroimage 169, 407–418 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.White T et al. Automated quality assessment of structural magnetic resonance images in children: Comparison with visual inspection and surface-based reconstruction. Hum Brain Mapp 39, 1218–1231 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Waters AB, Mace RA, Sawyer KS & Gansler DA Identifying errors in Freesurfer automated skull stripping and the incremental utility of manual intervention. Brain Imaging Behav 13, 1281–1291 (2019). [DOI] [PubMed] [Google Scholar]
  • 15.Monereo-Sanchez J et al. Quality control strategies for brain MRI segmentation and parcellation: Practical approaches and recommendations - insights from the Maastricht study. Neuroimage 237, 118174 (2021). [DOI] [PubMed] [Google Scholar]
  • 16.Ross MC et al. Gray matter volume correlates of adolescent posttraumatic stress disorder: A comparison of manual intervention and automated segmentation in FreeSurfer. Psychiatry Res Neuroimaging 313, 111297 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.McCarthy CS et al. A comparison of FreeSurfer-generated data with and without manual intervention. Front Neurosci 9, 379 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Beelen C, Phan TV, Wouters J, Ghesquiere P & Vandermosten M Investigating the Added Value of FreeSurfer’s Manual Editing Procedure for the Study of the Reading Network in a Pediatric Population. Front Hum Neurosci 14, 143 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Eryilmaz H et al. Association of Prenatal Exposure to Population-Wide Folic Acid Fortification With Altered Cerebral Cortex Maturation in Youths. JAMA Psychiatry 75, 918–928 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Pulli EP et al. Feasibility of FreeSurfer Processing for T1-Weighted Brain Images of 5-Year-Olds: Semiautomated Protocol of FinnBrain Neuroimaging Lab. Front Neurosci 16, 874062 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.White N et al. PROMO: Real-time prospective motion correction in MRI using image-based tracking. Magn Reson Med 63, 91–105 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Tisdall MD et al. Prospective motion correction with volumetric navigators (vNavs) reduces the bias and variance in brain morphometry induced by subject motion. Neuroimage 127, 11–22 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Ducharme S et al. Trajectories of cortical thickness maturation in normal brain development--The importance of quality control procedures. Neuroimage 125, 267–279 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Wainberg M, Jacobs GR, Voineskos AN & Tripathy SJ Neurobiological, familial and genetic risk factors for dimensional psychopathology in the Adolescent Brain Cognitive Development study. Mol Psychiatry 27, 2731–2741 (2022). [DOI] [PubMed] [Google Scholar]
  • 25.Wang C, Hayes R, Roeder K & Jalbrzikowski M Neurobiological Clusters Are Associated With Trajectories of Overall Psychopathology in Youth. Biol Psychiatry Cogn Neurosci Neuroimaging 8, 852–863 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Hughes DE et al. Genetic patterning for child psychopathology is distinct from that for adults and implicates fetal cerebellar development. Nat Neurosci 26, 959–969 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Marek S et al. Reproducible brain-wide association studies require thousands of individuals. Nature 603, 654–660 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Szucs D & Ioannidis JP Sample size evolution in neuroimaging research: An evaluation of highly-cited studies (1990–2012) and of latest practices (2017–2018) in high-impact journals. Neuroimage 221, 117164 (2020). [DOI] [PubMed] [Google Scholar]
  • 29.Open Science C PSYCHOLOGY. Estimating the reproducibility of psychological science. Science 349, aac4716 (2015). [DOI] [PubMed] [Google Scholar]
  • 30.Kaplan RM, Chambers DA & Glasgow RE Big data and large sample size: a cautionary note on the potential for bias. Clin Transl Sci 7, 342–346 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Bozek J, Griffanti L, Lau S & Jenkinson M Normative models for neuroimaging markers: Impact of model selection, sample size and evaluation criteria. Neuroimage 268, 119864 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Sonuga-Barke EJS Editorial: ‘Safety in numbers’? Big data discovery strategies in neurodevelopmental science - contributions and caveats. J Child Psychol Psychiatry 64, 1–3 (2023). [DOI] [PubMed] [Google Scholar]
  • 33.Pardoe HR, Kucharsky Hiess R & Kuzniecky R Motion and morphometry in clinical and nonclinical populations. Neuroimage 135, 177–185 (2016). [DOI] [PubMed] [Google Scholar]
  • 34.Smith J et al. Can this data be saved? Techniques for high motion in resting state scans of first grade children. Dev Cogn Neurosci 58, 101178 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Sacca V et al. Aging effect on head motion: A Machine Learning study on resting state fMRI data. J Neurosci Methods 352, 109084 (2021). [DOI] [PubMed] [Google Scholar]
  • 36.Casey BJ et al. The Adolescent Brain Cognitive Development (ABCD) study: Imaging acquisition across 21 sites. Dev Cogn Neurosci 32, 43–54 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Backhausen LL, Herting MM, Tamnes CK & Vetter NC Best Practices in Structural Neuroimaging of Neurodevelopmental Disorders. Neuropsychol Rev 32, 400–418 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Duffy BA et al. Retrospective motion artifact correction of structural MRI images using deep learning improves the quality of cortical surface reconstructions. Neuroimage 230, 117756 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Garavan H et al. Recruiting the ABCD sample: Design considerations and procedures. Dev Cogn Neurosci 32, 16–22 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Auchter AM et al. A description of the ABCD organizational structure and communication framework. Dev Cogn Neurosci 32, 8–15 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Tustison NJ et al. N4ITK: improved N3 bias correction. IEEE Trans Med Imaging 29, 1310–1320 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Esteban O et al. MRIQC: Advancing the automatic prediction of image quality in MRI from unseen sites. PLoS One 12, e0184661 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Achenbach TM The Achenbach System of Empirically Based Assessment (ASEBA): Development. (2009). [DOI] [PMC free article] [PubMed]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Suppl Tables

Data Availability Statement

Data from all ABCD-related analyses were downloaded from the NIMH Data Archive (NDA), version 4.0 (https://nda.nih.gov/study.html?id=1299). Derived variables, including MQC ratings and SHN, as well as region-of-interest level data for cortical thickness, surface area, and volume processed in FreeSurfer 7.1, have been uploaded to the NDA (https://nda.nih.gov/study.html?id=1944). Data from MGH analyses contain sensitive patient information that was obtained following a waiver of informed consent, and as such has not been uploaded to a publicly available repository. Please contact the corresponding author for additional information.

R code is available at doi 10.5281/zenodo.14872906. Source files are available at the NIMH Data Repository (https://nda.nih.gov/study.html?id=1944).

RESOURCES