Abstract
Reproducible identification of white matter pathways across subjects is essential for the study of structural connectivity of the human brain. One of the key challenges is anatomical differences between subjects and human rater subjectivity in labeling. Labeling white matter regions of interest presents many challenges due to the need to integrate both local and global information. Clearly communicating the manual processes to capture this information is cumbersome, yet essential to lay a solid foundation for comprehensive atlases. Segmentation protocols must be designed so the interpretation of the requested tasks as well as locating structural landmarks is anatomically accurate, intuitive and reproducible. In this work, we quantified the reproducibility of a first iteration of an open/public multi-bundle segmentation protocol. This allowed us to establish a baseline for its reproducibility as well as to identify the limitations for future iterations. The protocol was tested/evaluated on both typical 3T research acquisition Baltimore Longitudinal Study of Aging (BLSA) and high-acquisition quality Human Connectome Project (HCP) datasets. The results show that a rudimentary protocol can produce acceptable intra-rater and inter-rater reproducibility. However, this work highlights the difficulty in generalizing reproducible results and the importance of reaching consensus on anatomical description of white matter pathways. The protocol has been made available in open source to improve generalizability and reliability in collaboration. The goal is to improve upon the first iteration and initiate a discussion on the anatomical validity (or lack thereof) of some bundle definitions and the importance of reproducibility of tractography segmentation.
Keywords: white matter, diffusion MRI, tractography, segmentation, reproducibility
1. Introduction
1.1. Three-dimensional atlases (GM/WM)
Human brain atlases have had a transformative role in the study of neuroscience. Evolving neuroimaging, image processing, and analysis techniques are framing the way that we understand brain anatomy. Modern brain atlases (Amunts et al., 2013; Glasser et al., 2016; Hawrylycz et al., 2012) combine population atlases with multi-modal structural and functional information. These atlases provide a detailed representation of the anatomy and are a fundamental component of the field of image analysis.
Although, most commonly used modern approaches tend to treat white matter (WM) as essentially homogeneous and each region is labeled as a single WM structure. Most current atlases are incomplete and do not contain detailed information about the full spatial extent of WM structures. A whole-brain WM atlas,frongra labeled on a series of subjects using equivalent methodologies, tools and protocols, would be a fundamental resource for integrative and comprehensive modern brain atlases.
The effort to create atlases is not new. The importance of white matter atlases arises from the information that can be inferred from them. White matter connectivity stores substantial information about functional neuroanatomy (Greicius et al., 2009; Skudlarski et al., 2008; Sporns et al., 2005; van den Heuvel et al., 2008), longitudinal changes (Resnick et al., 2003), brain morphology (Lawes et al., 2008) abnormalities and their cognitive correlation (Gunning-Dixon & Raz, 2000), critical periods for neural development (Anderson et al., 2011), recovery and plasticity (Dayan & Cohen, 2011; Jiang et al., 2006), and cognitive functioning (Llinás et al., 1998). Nonetheless, existing white matter atlases are limited to population-averaged from extremely time-consuming manual delineation or they lack comprehensive coordinate information about white matter structures and specific details about finding anatomical landmarks. They are also often limited to a 3D representation where each position (voxel) represents only one label/class, which is an oversimplification of the true architecture of the brain (Adluru et al., 2016; Besseling et al., 2012; Mori et al., 2008; Wakana et al., 2007).
1.2. Four-dimensional atlases (WM)
We know from ex-vivo dissections that white matter pathways cross each other and merge and/or fan in a complex manner. This challenges the usage of a 3D atlas limited to one label per voxel (Hansen et al., 2020) which greatly increases the complexity to manually label the pathways. For this reason, manual labeling of pathways from diffusion MRI tractography using virtual dissection (Catani et al., 2002; Mori et al., 1999) is a useful approach to visualize the spatial relationship between anatomical landmarks. By using inclusion and exclusion regions of interest (ROIs) to isolate streamlines belonging to known WM pathways, a “4D” atlas can be made, where any voxels can be associated to any number of labels (Catani & de Schotten, 2008; Hansen et al., 2020; Thiebaut de Schotten et al., 2020; F.-C. Yeh et al., 2018).
While there are expert-evaluated white matter atlases (Catani & de Schotten, 2008; Essayed et al., 2017), they lack protocols detailing how to delineate white matter structures (Chenot et al., 2019; Mandonnet et al., 2018). Additionally, most well-known 4D WM atlases were generated using diffusion tensor imaging, which has been shown to have limitations in defining crossing fiber and complex regions (Jones & Cercignani, 2010; Oouchi et al., 2007), whereas a number of more advanced techniques can successfully overcome these limitations (Dell’Acqua & Tournier, 2019; Jeurissen et al., 2014; J. Tournier et al., 2012). More often than not, WM atlases are intended to be used as a tool, not a guide to perform a delineation on a new dataset. Atlases created with the intentions of being used with automated methods often rely on manual delineations or opaque (if existing at all) anatomical definitions (Garyfallidis et al., 2018; O‟Donnell & Westin, 2007; Wasserthal et al., 2018; F. C. Yeh et al., 2018; Yendiki et al., 2011; Yoo et al., 2015). The lack of explicit and detailed manual labeling protocols that are robust to human subjectivity must be addressed.
1.3. Reproducibility of protocols
Atlases are crucial tools to investigate WM pathways at the scale of a population. However, an atlas cannot be constructed without delineating subject-specific WM pathways. While WM atlases are a worthwhile end goal, we must first take a step back and investigate the basic building block of any WM atlas: subject-specific segmentation protocols. Subject-specific delineation can be useful in multiple situations: 1) pathologies and lesions can make the mapping of subjects to the atlas difficult (de Schotten et al., 2011), 2) if a specific population is studied and their anatomy is somewhat different from the population used for the atlas (Avants et al., 2010; Schmidt et al., 2018) or 3) simply to provide a framework to classify WM structure (Mandonnet et al., 2018; Panesar & Fernandez-Miranda, 2019). Two of the assumptions of a WM labeling is that all labels are anatomically valid and the same label in two datasets are in fact representing the same underlying structure. Despite often being taken for granted, these assumptions can be challenged. The potential implications of these assumptions being false would diminish the usability of the atlas in clinical application, for example.
If anatomical validity is an important criterion in theory, reproducibility of its execution is an important criterion in practice. This work focuses solely on the second assumption, identical labels should represent the same structures. An anatomically valid and well-detailed protocol is great, but if it cannot be executed accurately by a rater, if each execution is too variable or if each rater interprets it differently, the utility of the protocol will be diminished or even curtailed completely.
1.3.1. Intra-rater
The first layer of reproducibility to be analyzed is intra-rater: Can a rater following the same instructions twice on the same datasets achieve similar results? This layer is crucial because we can hypothesize that the interpretation of the instructions should not vary within a single individual. Thus, this quantifies the variability in the execution of the instructions.
1.3.2. Inter-rater
The second layer of reproducibility to be analyzed is inter-rater: Can two (or more) raters following the same instructions on the same datasets achieve similar results? This layer will quantify the variability in interpretation of the instructions. The observed inter-rater reproducibility score is lower than intra-rater (since it encompasses both the variability within and across raters).
1.3.3. Inter-subject
This last layer is not actually due to the protocols or raters directly. However, as mentioned earlier it encompasses both the intra and inter-rater variability. Analyzing this level will provide information related to biological differences in the underlying anatomical structures. If the protocol’s variability observed is too high at the intra or inter (or both) rater stage, this measurement error will drastically reduce the ability to interpret changes observed across subjects.
This is the crucial point of this work: not only must a consensus be reached on anatomical definition, but the interpretation and execution of this consensus must be quantified in order to assess how useful a segmentation protocol is in practice.
1.4. Framework for protocol creation and evaluation
As shown in (Boccardi et al., 2011; Rheault et al., 2020; Schilling et al., 2020a), simply creating a protocol that can be followed “closely” is a difficult task. Segmentation protocols are more common in 3D medical imaging or even 2D natural images, but tractography is a much more complex 3D representation. A lot of work is necessary to reach a set of do’s and don’ts to clearly design, evaluate and use bundle segmentation protocols in clinical settings. The logic behind our investigation is fourfold: 1) Can instructions from a protocol be followed by raters? 2) How can we quantify how reproducible a protocol is? 3) Can we identify portions of the protocol that increase or decrease reproducibility? 4) What are the next steps to take to improve the quality of a protocol or to create a better protocol in the future? In this work we approach and quantify the first two and provide insights into the last two.
The framework used to obtain bundles of interest (TractEM) is a first iteration of an open/public protocol; the reproducibility analysis presented in this work should be seen as a baseline for future improvements to TractEM. This work should not be perceived as an attempt to demonstrate the superiority or inferiority of the protocol, nor as claiming to be a valid, definitive or even accepted/agreed upon set of anatomical definitions. This work is rather, in our opinion, a necessary step to generate open discussion about the anatomical definitions of white matter pathways, how to segment them using a protocol, and how to evaluate results.
2. Methods
2.1. MRI acquisition
The protocol was developed on datasets obtained from the Baltimore Longitudinal Study of Aging (BLSA) (Ferrucci et al., 2008) and Human Connectome Project (HCP) (van Essen et al., 2013).
The BLSA dataset used in this study contains 10 subjects with ages ranging between 57–77 years old. The data were acquired after written informed consent and institutional review board approval and accessed in de-identified form. Each session included a T1-weighted structural MP-RAGE (number of slices = 170, voxel size = 1.0×1.0×1.2 mm3, reconstruction matrix = 256×256, flip angle = 8 degrees and TR/TE = 6.5ms/3.1 ms). Diffusion data were acquired with a 3D spin-echo diffusion-weighted EPI sequence (TR/TE = 7454/75 ms). Each acquisition consisted of an initial b0 and 32 diffusion weighted volumes all with the same b-value of 700 s/mm2 (number of slices = 170, voxel size = 0.81×0.81×2.2 mm3).
The HCP dataset used in this study contains 10 healthy subjects with no known history of neuropathological and psychiatric diseases, ages ranging between 26–36 years old. Diffusion data were acquired with a 3D spin-echo diffusion weighted EPI sequence (TR/TE = 5520/89.5 ms). Each diffusion acquisition consisted of six b0 and 90 diffusion weighted volumes all with 3 shells of b=1000, 2000, and 3000 s/mm2 interspersed with an approximately equal number of acquisitions on each shell within each run (number of slices = 111, voxel size = 1.25×1.25×1.25 mm3). Data were pre-processed to correct the effects of gradient nonlinearities on the b-values and b-vectors for each voxel (Glasser et al., 2013).
The choice of two different datasets was to provide insight for the protocol’s sensitivity to input data (SNR, spatial and angular resolution, reconstruction methods, etc.) and robustness to age related changes in the WM structures to observe how these variables affect reproducibility.
2.2. MRI preprocessing
Susceptibility correction (Andersson et al., 2003), eddy current correction techniques (Andersson & Sotiropoulos, 2016), and b0 signal normalization were applied to the diffusion data as a preprocessing step.
T1 weighted images were co-registered to b0. DWI and T1w were moved to Talairach space (Jenkinson & Smith, 2001) using an affine transformation from ICBM152-space and the technique from (Lancaster et al., 2007). The affine transformation was applied to the DWI and T1w, and the appropriate corresponding adjustment was made to the b-vectors. DTI metrics such as FA and RGB were computed using DSI-Studio (http://dsi-studio.labsolver.org).
For the orientation distribution function, two high angular resolution diffusion imaging (HARDI) models were selected; for characterizing multiple directions per voxel and providing enhanced ability to resolve crossing fibers (Descoteaux et al., 2007; Tuch & others, 2002). For the BLSA dataset, we fit 4th order spherical harmonic Q-ball. QBI uses a single shell (single b-value) to estimate the dODF to represent the fiber orientation distribution. DSI-Studio adopts the spherical harmonics based QBI reconstruction method. For multi-shell HCP data, generalized Q-sampling (GQI) reconstruction was used (F. C. Yeh et al., 2010). GQI in addition to being capable of resolving crossing fiber issues, supports multi-shell (multi-b-value) data and works best with fairly dense sampling of directions to similarly estimate the dODF.
2.3. Study design
For intra-rater, 5 raters were recruited to perform segmentation of all bundles on the same dataset, twice. For association and projection pathways, segmentations were performed only on the left hemisphere. In addition, the inter-rater variability was evaluated with an additional 10 raters and every dataset was traced at least twice by different raters (see Figure 1). Raters have no prior experience with bundle segmentation from tractography data (virtual dissection) to assess how the TractEM protocols can be learned from instruction only. Evaluation of the protocol was performed using 20 subjects (10 HCP, 10 BLSA) with a wide age range to evaluate robustness to heterogeneity in WM structures. The pre-processed data was provided so it could be directly input into DSI-Studio.
Figure 1:
Visual representation of our study design. For intra-rater, all 5 raters had to segment all bundles on the duplicated BLSA dataset that was provided. For inter-rater, 10 raters had to execute only a subset of all combinations of datasets/bundles. The above schematic represents a perfectly balanced workload for all raters and between BLSA and HCP. However, in reality some raters had more segmentation to perform (either on more datasets or more bundles or both). This planned division of labor ultimately leads to each bundle being segmented twice for each dataset.
2.3.1. Protocol description
The TractEM project is a tractography-based whole-brain protocol (https://my.vanderbilt.edu/tractem/); in its current version it is informed by the EVE atlas (see Figure 2). The goal was to create a first release and by soliciting feedback on both label quality and protocol definitions, the protocol could be iteratively refined.
Figure 2:
Most pathways were defined using the EVE atlas as a reference, with the aim to create a protocol that non-experts in anatomy and tractography can follow intuitively. We again stress that we are not suggesting these are optimal protocols for a given pathway, but rather a rudimentary guide of how one may delineate pathways given knowledge of EVE-based regions.
For example, many pathways utilize seed and/or inclusion ROIs using the orientation and locations visualized directly in the EVE atlas (see Figure 2), with additional inclusion/exclusion regions used to capture the bundle as completely as possible while also removing false positive streamlines as some in Figure 3. A total of 53 unique pathways and 8 subject-specific lobar regions (defined by a multi-atlas segmentation algorithm (Huo et al., 2016)) were selected such that most of the brain was accounted for. Accounting for bilaterality, there are 7 commissural, 11 association, 7 projection pathways and 5 within the brainstem. All pathway descriptions were directly inspired from the EVE atlas.
Figure 3:
Difference between simple or complex bundle as defined in this work. Simple bundles (a) do not use regions of interest to guide their path. Only seed and a region of avoidance are necessary. Complex bundles (b) require seeding region and one, or more, region of interest. General direction is enforced by the regions of interest.
In this manuscript, the complexity of the protocols is referred to as simple and complex tracts. If a protocol contains only a single seed region, e.g., genu of the corpus callosum (GCC), we refer to it as a simple bundle. If a bundle’s definition includes multiple seed or ROIs in combination, this is referred as a complex bundle, e.g., inferior longitudinal fasciculus (ILF) (see Figure 3).
The current version of the TractEM protocol or the description of some segmentations can be debated. However, the logic behind our investigation remains the same: 1) Can the protocol’s instructions be followed by raters? 2) How can we rigorously assess how reproducible a protocol is? 3) Can we identify specific aspects of the protocol that increase or decrease reproducibility? 4) What are the next steps to take to improve protocol quality or alternatively, to create a better protocol in the future? The current work acts as a baseline, the first iteration, of an open and incremental protocol for multiple bundles’ segmentation.
2.3.2. Task execution in DSI-Studio
DSI Studio is a tractography software that reconstructs WM pathways. It is a bundle-specific, ROI-based deterministic tractography approach (F. C. Yeh, 2020) to reconstruct, making it a tool of choice at the time of the study design. However, we acknowledge that this decision limits the scope of the baseline results presented in this work. Tracking parameters were chosen carefully to reduce the number of false positive fibers and to ensure that the tracking does not suffer from a premature termination, these parameters apply to all bundles in the TractEM protocol. Termination criterion is normalized quantitative anisotropy with a threshold of 0.1 and is used to ensure the maximum number of fibers are produced. Smoothing was set to 1, seed initialization limited to 100K for each bundle, and streamline length constraints were 30–300 mm for the simple and complex bundles. Angular and quantitative anisotropy (QA) thresholds are the determining parameters of how the fibers are tracked from the region of interest. The QA threshold was set to default and the angular threshold was set to zero, therefore stopping criterion is not determined by this parameter. The tracking algorithm is streamline (Euler), seed orientation is primary, and the direction interpolation is set to trilinear. The random seed generator was set to a constant to ensure tracking results are identical and reproducible.
2.4. Reproducibility analysis
2.4.1. Data representation
To assess the reproducibility of the TractEM protocol, Dice coefficient, weighted Dice coefficient and bundle adjacency distance were utilized. Reproducibility analysis is essential to validate robustness. The reader should note that the concept of ground truth does not apply to this study, instead each pair was compared with each. Streamlines were converted to a binary image representation (each voxel crossed by a streamline is set to 1) or a density image (a voxel crossed by N streamlines is set to N).
2.4.2. Reproducibility score / variability metrics
Dice coefficient (Dice, 1945) is a quantification of overlap between two sets, the resulting value is between 0 and 1. When applied to a binary image, it quantifies morphological similarity and overall volume agreement. The weighted Dice coefficient (Cousineau et al., 2017) is a measure of agreement taking into account streamline density. This way outliers and spurious streamlines will not have a large impact on the measure.
The bundle adjacency (Garyfallidis et al., 2015) is another measure of spatial agreement. Bundle adjacency does not take into account streamline density. To reduce the impact of outliers, spurious streamlines or streamlines offset by a few millimeters a simple distance (in mm) is used. Between pairs of segmentation, each voxel (occupied by a streamline) in the first segmentation is mapped to their closest voxel in the second segmentation and vice-versa. This way overlapping voxels are considered zeros, while the non-overlapping voxels are assigned the distance (in millimeter, mm). With this measure, values closer to 0 mm are desired.
These three similarity metrics are used to quantify intra-rater and inter-rater reproducibility scores for each bundle of the TractEM protocol. While individually useful, it is best to look at the three simultaneously. The easiest scenario to understand occurs when the Dice (and weighted Dice) coefficient are high with a low bundle adjacency. This indicates similar spatial overlap, similar spatial distribution of density with rare outliers. However, a relatively high Dice with a very high bundle adjacency could indicate an overall good volume overlap with a few spurious streamlines extending far beyond the “cores”. Another example would be two bundles with a high Dice and a low weighted Dice, this scenario would indicate a good spatial overlap, but their “cores” (where density is highest) are not overlapping.
3. Results
From the 10 raters, we received close to 2600 manually identified bundles. On average, an entire protocol took 6 hours to complete. This includes learning how to use DSI-studio, familiarization with the instructions for each bundle of interest and the segmentation itself (drawing the ROIs). Most raters did not have to perform the segmentation of all bundles for a given subject, instead they performed a few (random) bundle segmentations on a few subjects.
Raters were allowed to use any version of DSI-Studio. After all the data were gathered, to reduce the variability due to DSI-Studio version differences and to ensure that the data was free of human errors (naming convention, saving issues, etc.), we performed the tracking for all subjects using the provided manual labels. If the tracking did not result in the same streamline submitted, this specific task was given to another rater and manually relabeled to replace the missing/inconsistent data. Some errors due to human fallibility were left unfixed in the dataset. Non-exhaustive list of issues: 1) some raters sought placement guidance from other raters, resulting in very similar labels, 2) some raters used the same labels for different subjects, 3) some raters did not closely follow the protocols, 4) some raters delineated the same-subject bundle more than once by accident.
Quality assessment is a crucial step to ensure validity of the submitted bundles. An automated PDF report was generated for each bundle. It includes streamline density visualization overlaid on the subject’s FA images for all three views: sagittal, coronal, axial, and a 3D representation of the streamline. This tool served as a quick visual sanity check to see if the assembled bundle appeared consistent with the expected morphology. Its main use was to check if the rater interpreted the protocol correctly (obvious mistakes, saving issues, etc.).
3.1. Intra-rater reproducibility
Five Raters were assigned to test all protocols (except lobar bundles) on a given BLSA subject. The results have shown acceptable homogeneity of results on duplicate datasets. However, while the median is an ‘acceptable’ level, the interquartile range is high. As seen in Figure 5 the small sample makes the distribution sensitive to extreme scores. No single rater was responsible for very high or very low reproducibility scores.
Figures 5:
Intra-rater reproducibility metrics across all pathways. The pathways are ordered from the highest Dice Coefficient to the lowest. Coloring represents categories (brainstem, commissures, etc.) described earlier. No particular type of bundle obtained a higher or lower score. Due to the small sample size, outliers are difficult to identify and lead to a high interquartile range of the boxplots.
For each tract, medians and standard deviations were plotted for both datasets. HCP has shown Dice > 0.65, Weighted Dice > 0.85 and Bundle Adjacency < 2 mm reproducibility on most bundles. BLSA has shown Dice > 0.6, Weighted Dice > 0.8 and Bundle Adjacency < 2 mm.
We defined the pathways during the TractEM development stage as lobar regions, simple and complex bundles. Lobar regions exhibit very high reproducibility across both datasets, all lobar bundles have Dice above 0.9 (above 0.99 Weighted Dice). Simple bundles do not show increased reproducibility over complex bundles.
Pathways were categorized per type: lobar, brainstem, commissures, associations, projections. It is important to mention that in our nomenclature, the brainstem category encompasses bundles that are within the brainstem regions and the projection bundles encompass all long-ranging bundles with a general up-down orientation. Except for the lobar category, which has a high reproducibility score, there is no association between the pathways’ types and reproducibility.
3.2. Inter-rater reproducibility
The scores were calculated by averaging similarity across all subjects for a corresponding bundle. For each tract, medians and standard deviations were plotted for both datasets. HCP has shown Dice > 0.65, Weighted Dice > 0.85 and Bundle Adjacency < 2 mm reproducibility on most bundles. BLSA has shown Dice > 0.6, Weighted Dice > 0.8 and Bundle Adjacency < 2 mm reproducibility on most bundles. It is important to mention that median scores do not accurately represent the performance of the raters, as seen in Figure 6–7–8 the interquartile ranges show that segmentation tasks are prone to major interpretation/execution discrepancy.
Figure 6:
Dice Coefficient across all pathways for both datasets. All bundles with identical left and right descriptions were fused for the analysis. Pathway ordering and coloring are the same as in Figure 5. Low average or large interquartile ranges expose the lack of consistency in results (in terms of overall volume).
Figure 7:
Weighted Dice Coefficient across all pathways for both datasets. Pathway ordering and coloring are the same as in Figure 5. High average and smaller interquartile range show that bundles’ cores (most dense portion of a pathway) are more consistently obtained.
Figure 8:
Bundle adjacency across all pathways for both datasets. Pathway ordering and coloring are the same as in Figure 5. Medium average and smaller interquartile range shows that segmentations were producing similar results, but all with various levels of outliers or spurious streamlines.
3.3. Reproducibility score and data quality
The difference in data quality influences pathway reconstruction. Our hypothesis was that better acquisition, such as HCP, would positively impact the reproducibility score. However, the difference between both datasets was not statistically significant for most bundles (see Table 1). 24 out of 35 bundles did not show increased reproducibility between BLSA and HCP. Only 11 bundles showed statistical differences between BLSA and HCP.
Table 1:
Complete list of acronyms for the TractEM pathways.
| Complete name | Acronym |
|---|---|
| Anterior Commissure | AC |
| Anterior Corona Radiata | ACR |
| Anterior Limb of the internal capsule | AIC |
| Body of Corpus Callosum | BCC |
| Cerebral Peduncle | CP |
| Cingulum (cingulate gyrus part) | CGC |
| Cingulum (hippocampal part) | CGH |
| Corticospinal Tract | CST |
| Fornix | FX |
| Fornix (crus)/ stria terminalis | FXST |
| Genu of the corpus callosum | GCC |
| Inferior Cerebellar Peduncle | ICP |
| Inferior Fronto Occipital Fasciculus | IFO |
| Inferior Longitudinal Fasciculus | ILF |
| Medial Lemniscus | ML |
| Midbrain | M |
| Middle Cerebellar Peduncle | MCP |
| Olfactory Radiation | OLFR |
| Optic Tract | OPT |
| Posterior Corona Radiata | PCR |
| Pontine Crossing Tract | PCT |
| Posterior Limb of the Internal Capsule | PIC |
| Posterior Thalamic Radiation | PTR |
| Sagittal Stratum | SS |
| Splenium of the Corpus Callosum | SCC |
| Superior Cerebellar Peduncle | SCP |
| Superior Corona Radiata | SCR |
| Superior Fronto Occipital Fasciculus | SFO |
| Superior Longitudinal Fasciculus | SLF |
| Tapetum of the Corpus Callosum | TAP |
| Uncinate Fasciculus | UNC |
| Frontal Lobe | FL |
| Parietal Lobe | PL |
| Occipital Lobe | OL |
| Temporal Lobe | TL |
4. Discussion
Qualitative and quantitative identification of fiber pathways are crucial connectivity studies. Reproducible tractography can provide a backbone to the group-wise and longitudinal testing studies for which a comprehensive modern atlas is essential. Despite the importance, there are a limited number of reproducibility studies for manual bundle segmentation, particularly those with models other than DTI. Those that exist often investigate a limited number of fiber pathways due to the time-consuming nature of the tasks (Besseling et al., 2012; Ciccarelli et al., 2003; Heiervang et al., 2006; Rheault et al., 2020; Veenith et al., 2013; Wakana et al., 2007). Using a large number of WM pathways with state-of-the-art local model and tractography we have shown that most bundles achieve an acceptable level of reproducibility when it comes to the core (most dense regions) of the bundles, but that spurious streamlines and outliers remain problematic. The inclusion of older adults, from the BLSA database, and reproducibility scores being similar to younger adults, from the HCP database, demonstrate that the segmentation protocol is robust to WM structures heterogeneity related to aging. The main goal of this study was not only to design and quantify segmentation protocols that are clinically relevant, but to establish a baseline quality assessment. The current project identified strengths and weaknesses of our segmentation protocols and what can be improved in future iterations of the protocols.
4.1. Tasks complexities
By using an intra-rater and inter-rater project design with the TractEM protocol to obtain 35 (61 counting bilateral) individual bundles of interest, we quantified the ability of raters to follow simple/complex instructions. Our raters had no prior experiences with tractography to mimic how neurosciences students or clinicians without expertise in diffusion MRI could learn the TractEM protocols from instruction only. As shown in Rheault et al., 2020, expertise in neuroanatomy or tractography had little impact on the quality/reproducibility of the segmentation executed by raters. The most reproducible bundles, when taking into account all metrics, were the subdivision of the corpus callosum (CC), pontine crossing tract (PCT), cingulum (CG), middle cerebral peduncle (MCP), corticospinal tract (CST), superior longitudinal fasciculus (SLF), posterior limb of the internal capsule (PIC), and uncinate fasciculus (UNC) on both datasets. In contrast, FX, fornix stria-terminalis (FXST), superior cerebellar peduncle (SCP), superior fronto-occipital fasciculus (SFO) exhibited low reproducibility. Some of the most reproducible bundles were categorized as simple bundles and some of the least reproducible were categorized as complex bundles. Even though HCP showed a more distinguishable division of reproducibility between simple and complex bundles, the BLSA dataset shows the same trend. However, in most cases this categorization was not a good predictor of scores. In more extreme cases, it is possible that as the complexity of the manual protocol increases, the rater reproducibility decreases. However, the ‘complexity’ of a protocol is hard to define. Variables such as the number of ROIs, size of ROIs, shape of ROIs, number of planes (axial, coronal, sagittal) the ROIs are drawn on and even the presentation itself (figures, text disposition, ordering, etc.) could all increase or decrease complexity. It is possible that in our protocol, the number of ROIs was a driver of complexity. We hypothesize that the best way to increase the reproducibility score of the protocol itself would be to reduce the number of smaller exclusion ROIs required for complex bundles and to focus on using more clearly defined inclusion ROIs (point of passage, bottleneck, etc.).
4.2. Study design and protocol limitations
4.2.1. Intra-rater
Intra-rater reproducibility was evaluated only on the BLSA dataset due to the workload generated by the large number of pathways to segment. The effect of intra-rater variation will always reduce the observed reproducibility for any datasets (Gwet, 2012; Liao et al., 2010). The exact causes of the origin of this variation are still difficult to identify; we initially hypothesized that raters’ interpretations do not change, only the execution. The exact level of reproducibility will vary due to multiple factors such as experience with anatomy or tractography, academic background of raters or familiarity with the software. However, it was observed that for virtually all bundles the intra-rater reproducibility scores were extremely close to their inter-rater counterparts. This indicates that, for this version of the protocol, whether one rater performs two segmentations or two raters perform one segmentation each (on the same dataset) the reproducibility scores will be similar.
4.2.2. Inter-rater
From the results, across metrics, across bundles for BLSA and HCP, it is difficult to reach simple conclusions. Overall, the Dice Coefficient for most bundles (in both database/acquisition) shows poor agreement. It was initially hypothesized that the HCP database would achieve higher scores due to its quality. We believed that quality of acquisition would lead to better bundle reconstruction and so easier, more intuitive, segmentation. However, only a minority of bundles reached our threshold for statistical significance between BLSA and HCP datasets; even though significant, many differences were quite small.
The weighted Dice coefficient shows a more interesting pattern, the higher scores indicate that raters were able to segment the regions where density was the highest. While the average scores seem acceptable, the standard deviation is very large. Such a range of interpretation/execution across raters is in itself a problem.
Bundle adjacency distance indicates that most bundles (for both acquisitions/database) contained similar amounts of outliers and spurious streamlines. It is important to mention that bundles with high Dice and high bundle adjacency represent cases where extreme outliers were present. The reason why distant/isolated streamlines have such an impact on the bundle adjacency metric and not on the Dice coefficient is due to the fact that the distance to the closest neighbor (in millimeter) is part of the bundle adjacency computation. This means that a few streamlines extending far beyond the core of the bundle have a low impact on the total volume, but their cumulative distances to the core will rapidly increase. Scenarios with low Dice and high bundle adjacency can represent multiple (more complex) situations that cannot be easily (or intuitively) interpreted. The better reproducibility scores are, the easier they are to interpret and/or disentangle.
While intra-rater reproducibility is absolutely crucial for a protocol, one can imagine a protocol that can be easily done again and again but is prone to misinterpretation. This would create a scenario where a, wrong, interpretation is reproducible for a single rater, but every rater submits entirely different looking bundles. The lower reproducibility across raters is expected since it merges the variability in execution and in interpretation. The similarity between intra-rater and inter-rater reproducibility score likely comes from the fact that our protocol is difficult to interpret and difficult to execute.
4.2.3. Intra-subject and Inter-subjects
Since the reproducibility scores for most bundles of interest have either a low median reproducibility score, or a wide range of results (interquartile range); we would not recommend interpreting anatomical variability from these bundles if obtained from the TractEM protocol in its current version. Intra-subject (test-retest) reliability of DSI-Studio was evaluated for another protocol and showed high reproducibility (F. C. Yeh, 2020). The fact that DSI-Studio uses deterministic tractography, predetermined tracking parameters and that the segmentation protocol used was defined on a template rather than being subject-specific explains why bundle reproducibility is higher than other tools/algorithms (Cousineau et al., 2017, Zhang et al., 2019). Since our intra-rater reproducibility was evaluated on identical datasets (rather than test-retest), the intra-subject reliability can only be equal to or lower than our reported results. When a segmentation protocol has high intra-rater reproducibility scanner effects will dominate the intra-subject reproducibility. However, it was shown that when it comes to protocols with low reproducibility scanner effects will not significantly decrease reproducibility and the segmentation protocol will dominate intra-subject reproducibility (Schilling et al., 2021b). Using the results of the current work, it would be unwise to try to quantify variability at the individual or the population level (i.e., structural differences). Improvements to the TractEM protocol are needed before a more exhaustive quantification of the intra-subject reliability and inter-subjects comparison is performed. At the moment, the reproducibility scores range of most bundles is too large to accurately interpret the intra-subject reliability or the inter-subjects comparison. This highlights the importance of not only having a clearly defined and open protocol for bundle segmentation, but also a rigorous quantification of its reproducibility.
4.3. Future work on the protocol
4.3.1. Generalization to other tools
TractEM allows us to obtain bundle segmentation from non-experts. At its initial developmental stage, it has shown encouraging results. Using the protocol, reconstructing a full dataset with 35 bundles with their bilateral counterparts is feasible in less than 6 hours per subject (including the software’s learning curve and tractography itself).
However, we must acknowledge the limitations of the protocol we used in this work. Some decisions were made at the time due to technological limitations or simple lack of familiarity with the cutting-edge processing tools available at the design stage. For example, the TractEM protocol is specific to deterministic tractography and embedded with DSI-Studio, which limits the generalization of anatomical definitions. Future works include modification to at least accept any deterministic tractograms from tools such as Dipy or MRtrix (Garyfallidis et al., 2014; J. D. Tournier et al., 2019). Major modifications would be required to create a robust protocol for probabilistic tractography, but this work provides a framework as well as a baseline to quantify modification to the TractEM protocol. An important consideration, if the TractEM protocols become independent of the tractography reconstruction process, would be to evaluate the impact of the tractography method used. In its current state TractEM protocols only apply to deterministic tractography, but not all deterministic tractography algorithms are equal. For the current project, we used the same configuration (tracking parameters), but an approach of parameters exploration (F. C. Yeh, 2020) combining randomized parameters and a high streamline count to saturate bundle coverage could, hypothetically, lead to different results. Investigations of optimal tracking parameters and spurious streamlines pruning threshold (within DSI-Studio) could be investigated separately to determine which approach leads to higher reproducibility. However, It is likely impossible to generalize an existing segmentation protocol to new tracking algorithms or new bundles of interest (Rheault et al., 2020). Small deviations from the framework in which the protocol was evaluated can be acceptable, but major modifications such as generating streamlines using MRtrix (J. D. Tournier et al., 2019) that would impact both the local reconstruction as well as the tracking itself are difficult to predict. Furthermore, the choice of software in which the segmentation protocol is performed could lead to major differences in reproducibility scores (Rheault et al., 2020). Even a minor/major user-interface update or a change in the underlying delineation behavior in DSI-Studio could be enough to throw off previously quantified protocols. However, improvements to the TractEM protocol are needed before a more exhaustive quantification of the impact of more minor variables.
4.3.2. Fantastic tracts and how to define them
This first iteration of the protocol was inspired by the EVE atlas, initially the intent was to define pathways using simple regions described in the original work. However, the original work was a 3D atlas of the WM when in fact WM pathways often (always) overlap each other. This means that the way bundles are defined in TractEM is not optimal considering the structure of what it attempts to segment/reconstruct. In order to respect the bundles as defined in the EVE atlas, the TractEM protocol used minimal seeds/ROIs regions with numerous ROAs. Furthermore, the number of ROAs and their size was left to the rater with little to no description, this not only made the segmentation tasks difficult to follow, but likely drove the variability.
From figure 4, we can observe that some pathways are nearly indistinguishable from one another. This is due to multiple factors: 1) Major pathways are easier to reconstruct than others, bundles with close trajectory will often simply merge into a single entity. 2) The raters likely did not know what was the expected shape or the instructions were not clear enough to disentangle close bundles. 3) The EVE atlas does not define pathways globally; it rather simply defines local regions. This leads to problems when global pathways share local regions and this disconnection between the EVE WM 3D atlas and tractography highlight the need to challenge classical WM delineation.
Figure 4:
Average shape and position of all bundles that are part of the TractEM protocol (association and projection pathways are shown only for the left hemisphere). The average was computed using segmentation from 4 raters and 4 datasets (BLSA) using a majority vote. This highlights the general agreement on the shape, but as seen in the lower right vignette the variability of each individual segmentation can be quite extreme.
Another area of improvement would be to increase the details on anatomical landmarks surrounding ROIs and how to identify the right slice, to identify ROIs’ size and border. Using better nomenclature for the identification of landmarks would also provide more context for participants familiar with neuroanatomy while being a learning tool for those not familiar with it. In DSI-Studio, the data is co-registered to an atlas and in a similar fashion to the EVE atlas the approximate position of ROIs/ROAs can be easily suggested to the raters. However, this iteration of the protocol provided no anatomical landmark, no general description of the ROIs/ROAs and limited figures. This often resulted in the rater picking the exact suggested slices and blindly drawing a region in the general neighborhood of what was expected. Again, this not only makes the protocol difficult to follow, but increases the variability. This also makes the current protocol entirely dependent on DSI-Studio and its internal co-registration routine.
In the current version all bundles are considered independent, no ROIs/ROAs are reusable. Furthermore, for each bundle the ordering of regions to draw is often arbitrary, switching from axial to coronal to sagittal or going back and forth from inclusion to exclusion regions. This drastically increases software interaction (and so the chance for error/mistake), leading to some suboptimality. It would be more efficient to design a protocol where some regions can be reused. For example, all association bundles required a mid-sagittal plane to cut off commissural pathways. This region of exclusion was redrawn each time, which is time-consuming and generates variability (e.g., choosing a different slice each time). This would require the protocol to cross reference ROIs, but the advantages for the raters and reproducibility are non-negligible.
Another important factor is the consistency of the text. Instructions should be similarly phrased, similarly ordered and have a similar level of detail. This would facilitate the process of developing a routine for raters, which tends to accelerate the tasks, decrease the occurrence of mistakes and increase reproducibility. All seeds/ROIs/ROAs should be accounted for and described, leaving the total number of regions unclear or to the raters’ interpretation is a source of confusion. Optional regions should be clearly identified. Overall, we believe all regions must be justified, their goals/reasons be explicitly described and their impact (if drawn right) on the global pathway trajectory detailed.
Finally, a future iteration would require more visual description. Each region should be defined using at least one image. When surrounding landmarks need to be found, they should be pointed to, zoomed in and shown on more than one contrast, at the moment 100% of the protocol relies on the RGB (ColorFA). It would also be useful to provide a multi-orientation view of the bundles in 3D or a population average probability map to inform the raters of what is expected and help to detect errors/mistakes more easily. All figures should clearly indicate, plane (sagittal, coronal, axial), orientation (R/L, A/P, S/I) and zoomed figures should show the entire slice in a vignette.
4.3.3. Theoretical anatomical definitions
Finally, a major update would be required to separate the conceptual anatomical definition (theory) vs its execution (practice). For example, if the definition of the corticospinal tract requires it to terminate in the brainstem and in the precentral and postcentral gyrus as well as passing through the internal capsule (Chenot et al., 2019), that would be an agreed upon theoretical anatomical definition. However, the execution can vary to respect these criteria. In a specific protocol these rules could be achieved using planar ROIs limiting the posterior and anterior region at the cortex (Rheault et al., 2020) or in the lower brainstem, but in another version, it could require the usage of an atlas, e.g., Freesurfer (Desikan et al., 2006) to obtain well-delineated 3D ROIs.
These variations in execution would likely give different reproducibility scores for each bundle while nothing changed in the theoretical anatomical definition. Defining as a community a set of agreed upon theoretical anatomical definitions is crucial, then defining an agreed upon execution of this definition is also crucial and the variability must be quantified each time it is modified (or each deviation from the original one). As of right now, TractEM is only an execution protocol that was not created from a consensus in the field, but it still provided a first iteration, a baseline, to identify limitations, pave the way forward for the next iteration and initiate discussion on how to define WM pathways in humans.
4.3.4. Usefulness for automatic algorithms
Different automated protocols are capable of producing streamlines, highly reproducible (small and large) bundles covering the whole brain (Garyfallidis et al., 2018; Rheault et al., 2018; Wasserthal et al., 2018; Zhang et al., 2020).These methods often rely on manual delineations as their a priori. This highlights the importance of reproducibility studies at the manual level. Subject-specific delineations can provide a backbone to automated methods.
As mentioned earlier, there is a distinction between an anatomical definition in theory and in practice. Automatic methods can be seen as yet one more version to execute a theoretical anatomical definition. Consensus must be obtained; automatic methods would greatly benefit from agreed upon definition from the field of neuroanatomy method.
5. Conclusion
In this work we investigated the reproducibility of a first iteration of the TractEM protocol, an open and public bundle segmentation protocol based on the EVE WM atlas. We have shown that most bundles achieve good reproducibility in the densest regions (core) of the bundles. Spurious streamlines and outliers remain challenging to avoid and affect negatively the reproducibility scores for most bundles. The complexity of some segmentation or the lack of clarity in the instructions sometimes leads to bundles of questionable anatomical validity. Identifying flaws and limitations of the protocol will lead to improvements. The experience obtained from creating a whole brain bundle segmentation protocol as well as evaluating its performance provided useful insight for future iterations.
Table 2:
Reported average (standard deviation) values for all bundles for both datasets. Cells in bold show statistical significance (p < 0.01, one-sided) for our hypothesis that HCP would yield higher reproducibility scores than BLSA. When statistical significance is reached, all metrics reach the threshold (2 out of 3 for FX_LR). The bundle names in italics have Dice coefficient below 0.5, this curtails any potential analysis due to the variability in shape being too high. The bundles with the suffix ‘_LR’ represent bilateral pathways (Left/Right).
| Dice | Weighted Dice | Bundle Adjacency | ||||
|---|---|---|---|---|---|---|
| BLSA | HCP | BLSA | HCP | BLSA | HCP | |
| OL_LR | 0.93 (0.01) | 0.94 (0.00) | 1.00 (0.00) | 1.00 (0.00) | 0.09 (0.02) | 0.07 (0.01) |
| PT_LR | 0.92 (0.02) | 0.92 (0.01) | 0.99 (0.01) | 0.99 (0.00) | 0.12 (0.04) | 0.10 (0.03) |
| TL_LR | 0.91 (0.01) | 0.90 (0.01) | 0.99 (0.00) | 0.99 (0.00) | 0.13 (0.05) | 0.13 (0.05) |
| FL_LR | 0.90 (0.02) | 0.91 (0.01) | 0.99 (0.00) | 0.99 (0.00) | 0.12 (0.04) | 0.10 (0.01) |
| GCC | 0.78 (0.10) | 0.86 (0.05) | 0.96 (0.03) | 0.98 (0.02) | 0.76 (0.73) | 0.25 (0.17) |
| SCC | 0.71 (0.18) | 0.82 (0.16) | 0.89 (0.19) | 0.95 (0.09) | 1.51 (1.85) | 0.64 (1.24) |
| M | 0.63 (0.13) | 0.61 (0.15) | 0.87 (0.13) | 0.89 (0.09) | 1.42 (1.06) | 4.55 (4.20) |
| BCC | 0.71 (0.12) | 0.83 (0.09) | 0.92 (0.05) | 0.96 (0.07) | 1.29 (1.00) | 0.42 (0.42) |
| MCP | 0.77 (0.15) | 0.88 (0.04) | 0.96 (0.06) | 0.99 (0.01) | 1.85 (3.54) | 0.15 (0.08) |
| SCR_LR | 0.65 (0.13) | 0.84 (0.09) | 0.85 (0.12) | 0.97 (0.03) | 1.58 (1.12) | 0.51 (0.64) |
| PTR_LR | 0.60 (0.18) | 0.77 (0.10) | 0.81 (0.21) | 0.95 (0.06) | 1.70 (1.37) | 0.51 (0.31) |
| UNC_LR | 0.65 (0.16) | 0.84 (0.11) | 0.87 (0.10) | 0.96 (0.08) | 1.32 (1.20) | 0.42 (0.64) |
| PIC_LR | 0.68 (0.05) | 0.82 (0.10) | 0.89 (0.05) | 0.97 (0.04) | 1.27 (0.49) | 0.60 (0.71) |
| CST_LR | 0.63 (0.14) | 0.76 (0.11) | 0.90 (0.06) | 0.97 (0.03) | 2.30 (1.62) | 1.28 (1.03) |
| CGC_LR | 0.72 (0.12) | 0.73 (0.08) | 0.96 (0.04) | 0.92 (0.06) | 1.73 (1.75) | 0.85 (0.54) |
| ACR_LR | 0.70 (0.08) | 0.62 (0.17) | 0.94 (0.04) | 0.83 (0.14) | 0.68 (0.31) | 3.87 (3.45) |
| SS_LR | 0.64 (0.17) | 0.72 (0.17) | 0.88 (0.13) | 0.92 (0.09) | 1.38 (1.36) | 1.48 (1.41) |
| AIC_LR | 0.51 (0.14) | 0.79 (0.14) | 0.81 (0.15) | 0.95 (0.08) | 4.74 (3.70) | 0.66 (0.86) |
| PCR_LR | 0.64 (0.12) | 0.79 (0.12) | 0.88 (0.09) | 0.94 (0.07) | 1.91 (1.26) | 0.65 (0.58) |
| SLF_LR | 0.72 (0.07) | 0.71 (0.14) | 0.93 (0.04) | 0.93 (0.07) | 0.96 (0.50) | 0.97 (0.72) |
| ILF_LR | 0.64 (0.09) | 0.69 (0.15) | 0.89 (0.07) | 0.93 (0.08) | 1.71 (0.87) | 1.26 (0.94) |
| PCT | 0.76 (0.07) | 0.87 (0.04) | 0.98 (0.02) | 1.00 (0.01) | 0.45 (0.21) | 0.18 (0.08) |
| IFO_LR | 0.60 (0.20) | 0.59 (0.20) | 0.82 (0.25) | 0.90 (0.11) | 1.58 (1.34) | 2.87 (3.05) |
| ICP_LR | 0.48 (0.18) | 0.51 (0.23) | 0.86 (0.13) | 0.85 (0.19) | 4.27 (2.65) | 3.15 (2.43) |
| CP_LR | 0.45 (0.21) | 0.66 (0.21) | 0.83 (0.12) | 0.92 (0.13) | 4.10 (2.76) | 1.88 (2.07) |
| AC | 0.57 (0.19) | 0.52 (0.29) | 0.88 (0.18) | 0.89 (0.13) | 2.24 (2.70) | 4.78 (5.54) |
| OLFR_LR | 0.59 (0.22) | 0.62 (0.30) | 0.78 (0.26) | 0.86 (0.20) | 2.95 (4.32) | 5.21 (6.67) |
| SFO_LR | 0.46 (0.20) | 0.66 (0.19) | 0.82 (0.20) | 0.89 (0.16) | 3.54 (2.66) | 2.16 (1.92) |
| TAP | 0.53 (0.13) | 0.62 (0.13) | 0.85 (0.14) | 0.88 (0.09) | 2.70 (1.92) | 2.19 (1.40) |
| ML_LR | 0.59 (0.21) | 0.64 (0.17) | 0.82 (0.16) | 0.89 (0.11) | 3.09 (2.80) | 3.97 (3.92) |
| SCP_LR | 0.31 (0.20) | 0.69 (0.17) | 0.78 (0.12) | 0.97 (0.04) | 4.82 (3.11) | 1.54 (1.84) |
| CGH_LR | 0.60 (0.21) | 0.48 (0.23) | 0.92 (0.12) | 0.77 (0.24) | 2.87 (2.77) | 3.29 (2.77) |
| FX_LR | 0.45 (0.22) | 0.56 (0.24) | 0.80 (0.17) | 0.85 (0.17) | 3.04 (2.43) | 3.18 (4.02) |
| OPT | 0.44 (0.15) | 0.72 (0.17) | 0.78 (0.17) | 0.93 (0.12) | 3.56 (1.53) | 1.15 (1.16) |
| FXST_LR | 0.29 (0.25) | 0.18 (0.20) | 0.46 (0.30) | 0.27 (0.29) | 3.90 (2.10) | 6.42 (3.26) |
Highlights.
Quantify reproducibility of WM bundles segmentation from deterministic tractography.
The TractEM project is a tractography-based whole-brain protocol informed by the EVE atlas.
Protocols have been made available in open source to facilitate collaboration.
Acknowledgements
This work was supported by the National Institutes of Health under award numbers R01EB017230, T32EB001628, and in part by ViSE/VICTR VR3029 and the National Center for Research Resources, Grant UL1 RR024975-01. This research was conducted with the support from Intramural Research Program, National Institute on Aging, NIH. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH.
Footnotes
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
Data Availability Statement
TractEM is an open-source project, in terms of data and labeling guidelines. Comments and discussion are open for each bundle. The frameworks enable versioning of bundle definitions as consensus definitions evolve. By crowd sourcing our results, we aim to obtain expert validation for each bundle. The raw input data, Talairach-aligned data, DSI-Studio-ready input data, and TractEM results are made available on our website, https://my.vanderbilt.edu/tractem.
Bibliography
- Adluru N, Destiche DJ, Tromp DPM, Davidson RJ, Zhang H, & Alexander AL (2016). Evaluating consistency of deterministic streamline tractography in non-linearly warped DTI data. http://arxiv.org/abs/1602.02117
- Amunts K, Lepage C, Borgeat L, Mohlberg H, Dickscheid T, Rousseau MÉ, Bludau S, Bazin PL, Lewis LB, Oros-Peusquens AM, Shah NJ, Lippert T, Zilles K, & Evans AC (2013). BigBrain: An ultrahigh-resolution 3D human brain model. Science, 340(6139), 1472–1475. 10.1126/science.1235381 [DOI] [PubMed] [Google Scholar]
- Anderson V, Spencer-Smith M, & Wood A (2011). Do children really recover better? Neurobehavioural plasticity after early brain insult. In Brain (Vol. 134, Issue 8, pp. 2197–2221). Oxford University Press. 10.1093/brain/awr103 [DOI] [PubMed] [Google Scholar]
- Andersson JLR, Skare S, & Ashburner J (2003). How to correct susceptibility distortions in spin-echo echo-planar images: Application to diffusion tensor imaging. NeuroImage, 20(2), 870–888. 10.1016/S1053-8119(03)00336-7 [DOI] [PubMed] [Google Scholar]
- Andersson JLR, & Sotiropoulos SN (2016). An integrated approach to correction for offresonance effects and subject movement in diffusion MR imaging. NeuroImage, 125, 1063–1078. 10.1016/j.neuroimage.2015.10.019 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Avants BB, Yushkevich P, Pluta J, Minkoff D, Korczykowski M, Detre J, & Gee JC (2010). The optimal template effect in hippocampus studies of diseased populations. Neuroimage, 49(3), 2457–2466. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Besseling RMH, Jansen JFA, Overvliet GM, Vaessen MJ, Braakman HMH, Hofman PAM, Aldenkamp AP, & Backes WH (2012). Tract specific reproducibility of tractography based morphology and diffusion metrics. PLoS ONE, 7(4). 10.1371/journal.pone.0034125 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Boccardi M, Ganzola R, Bocchetta M, Pievani M, Redolfi A, Bartzokis G, Camicioli R, Csernansky JG, de Leon MJ, deToledo-Morrell L, & others. (2011). Survey of protocols for the manual segmentation of the hippocampus: preparatory steps towards a joint EADC-ADNI harmonized protocol. Journal of Alzheimer’s Disease, 26(s3), 61–75. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Catani M, & de Schotten MT (2008). A diffusion tensor imaging tractography atlas for virtual in vivo dissections. Cortex, 44(8), 1105–1132. [DOI] [PubMed] [Google Scholar]
- Catani M, Howard RJ, Pajevic S, & Jones DK (2002). Virtual in vivo interactive dissection of white matter fasciculi in the human brain. Neuroimage, 17(1), 77–94. [DOI] [PubMed] [Google Scholar]
- Chenot Q, Tzourio-Mazoyer N, Rheault F, Descoteaux M, Crivello F, Zago L, Mellet E, Jobard G, Joliot M, Mazoyer B, & others. (2019). A population-based atlas of the human pyramidal tract in 410 healthy participants. Brain Structure and Function, 224(2), 599–612. [DOI] [PubMed] [Google Scholar]
- Ciccarelli O, Parker GJM, Toosy AT, Wheeler-Kingshott CAM, Barker GJ, Boulby PA, Miller DH, & Thompson AJ (2003). From diffusion tractography to quantitative white matter tract measures: A reproducibility study. NeuroImage, 18(2), 348–359. 10.1016/S1053-8119(02)00042-3 [DOI] [PubMed] [Google Scholar]
- Cousineau M, Jodoin P-M, Garyfallidis E, Côté M-A, Morency FC, Rozanski V, Grand’Maison M, Bedell BJ, & Descoteaux M (2017). A test-retest study on Parkinson’s PPMI dataset yields statistically significant white matter fascicles. NeuroImage: Clinical, 16, 222–233. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dayan E, & Cohen LG (2011). Neuroplasticity subserving motor skill learning. In Neuron (Vol. 72, Issue 3, pp. 443–454). Neuron. 10.1016/j.neuron.2011.10.008 [DOI] [PMC free article] [PubMed] [Google Scholar]
- de Schotten MT, Bizzi A, Dell’Acqua F, Allin M, Walshe M, Murray R, Williams SC, Murphy DGM, Catani M, & others. (2011). Atlasing location, asymmetry and inter-subject variability of white matter tracts in the human brain with MR diffusion tractography. Neuroimage, 54(1), 49–59. [DOI] [PubMed] [Google Scholar]
- Dell’Acqua F, & Tournier J-D (2019). Modelling white matter with spherical deconvolution: How and why? NMR in Biomedicine, 32(4), e3945. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Descoteaux M, Angelino E, Fitzgibbons S, & Deriche R (2007). Regularized, fast, and robust analytical Q-ball imaging. Magnetic Resonance in Medicine: An Official Journal of the International Society for Magnetic Resonance in Medicine, 58(3), 497–510. [DOI] [PubMed] [Google Scholar]
- Desikan RS, Ségonne F, Fischl B, Quinn BT, Dickerson BC, Blacker D, Buckner RL, Dale AM, Maguire RP, Hyman BT, & others. (2006). An automated labeling system for subdividing the human cerebral cortex on MRI scans into gyral based regions of interest. Neuroimage, 31(3), 968–980. [DOI] [PubMed] [Google Scholar]
- Dice LR (1945). Measures of the amount of ecologic association between species. Ecology, 26(3), 297–302. [Google Scholar]
- Essayed WI, Zhang F, Unadkat P, Cosgrove GR, Golby AJ, & O’Donnell LJ (2017). White matter tractography for neurosurgical planning: A topography-based review of the current state of the art. In NeuroImage: Clinical (Vol. 15, pp. 659–672). Elsevier Inc. 10.1016/j.nicl.2017.06.011 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ferrucci L, Giallauria F, & Guralnik JM (2008). Epidemiology of Aging. In Radiologic Clinics of North America (Vol. 46, Issue 4, pp. 643–652). Radiol Clin North Am. 10.1016/j.rcl.2008.07.005 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Garyfallidis E, Brett M, Amirbekian B, Rokem A, van der Walt S, Descoteaux M, & Nimmo-Smith I (2014). Dipy, a library for the analysis of diffusion MRI data. Frontiers in Neuroinformatics, 8(FEB), 8. 10.3389/fninf.2014.00008 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Garyfallidis E, Côté M-A, Rheault F, Sidhu J, Hau J, Petit L, Fortin D, Cunanne S, & Descoteaux M (2018). Recognition of white matter bundles using local and global streamline-based registration and clustering. NeuroImage, 170, 283–295. [DOI] [PubMed] [Google Scholar]
- Garyfallidis E, Ocegueda O, Wassermann D, & Descoteaux M (2015). Robust and efficient linear registration of white-matter fascicles in the space of streamlines. NeuroImage, 117, 124–140. [DOI] [PubMed] [Google Scholar]
- Glasser MF, Coalson TS, Robinson EC, Hacker CD, Harwell J, Yacoub E, Ugurbil K, Andersson J, Beckmann CF, Jenkinson M, & others. (2016). A multi-modal parcellation of human cerebral cortex. Nature, 536(7615), 171. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Glasser MF, Sotiropoulos SN, Wilson JA, Coalson TS, Fischl B, Andersson JL, Xu J, Jbabdi S, Webster M, Polimeni JR, & others. (2013). The minimal preprocessing pipelines for the Human Connectome Project. Neuroimage, 80, 105–124. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Greicius MD, Supekar K, Menon V, & Dougherty RF (2009). Resting-state functional connectivity reflects structural connectivity in the default mode network. Cerebral Cortex, 19(1), 72–78. 10.1093/cercor/bhn059 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gunning-Dixon FM, & Raz N (2000). The cognitive correlates of white matter abnormalities in normal aging: A quantitative review. Neuropsychology, 14(2), 224–232. 10.1037//0894-4105.14.2.224 [DOI] [PubMed] [Google Scholar]
- Gwet KL (2012). Handbook of inter-rater reliability: The definitive guide to measuring the extent of agreement among multiple raters. Advanced Analytics, LLC. [Google Scholar]
- Hansen CB, Yang Q, Lyu I, Rheault F, Kerley C, Chandio BQ, Fadnavis S, Williams O, Shafer AT, Resnick SM, Zald DH, Cutting LE, Taylor WD, Boyd B, Garyfallidis E, Anderson AW, Descoteaux M, Landman BA, & Schilling KG (2020). Pandora: 4-D White Matter Bundle Population-Based Atlases Derived from Diffusion MRI Fiber Tractography. Neuroinformatics. 10.1007/s12021-020-09497-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hawrylycz MJ, Lein ES, Guillozet-Bongaarts AL, Shen EH, Ng L, Miller JA, van de Lagemaat LN, Smith KA, Ebbert A, Riley ZL, Abajian C, Beckmann CF, Bernard A, Bertagnolli D, Boe AF, Cartagena PM, Mallar Chakravarty M, Chapin M, Chong J, … Jones AR (2012). An anatomically comprehensive atlas of the adult human brain transcriptome. Nature, 489(7416), 391–399. 10.1038/nature11405 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Heiervang E, Behrens TEJ, Mackay CE, Robson MD, & Johansen-Berg H (2006). Between session reproducibility and between subject variability of diffusion MR and tractography measures. NeuroImage, 33(3), 867–877. 10.1016/j.neuroimage.2006.07.037 [DOI] [PubMed] [Google Scholar]
- Huo Y, Plassard AJ, Carass A, Resnick SM, Pham DL, Prince JL, & Landman BA (2016). Consistent cortical reconstruction and multi-atlas brain segmentation. NeuroImage, 138, 197–210. 10.1016/j.neuroimage.2016.05.030 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jenkinson M, & Smith S (2001). A global optimisation method for robust affine registration of brain images. Medical Image Analysis, 5(2), 143–156. 10.1016/S1361-8415(01)00036-6 [DOI] [PubMed] [Google Scholar]
- Jeurissen B, Tournier J-D, Dhollander T, Connelly A, & Sijbers J (2014). Multi-tissue constrained spherical deconvolution for improved analysis of multi-shell diffusion MRI data. NeuroImage, 103, 411–426. [DOI] [PubMed] [Google Scholar]
- Jiang Q, Zhang ZG, Ding GL, Silver B, Zhang L, Meng H, Lu M, Pourabdillah-Nejed-D. S, Wang L, Savant-Bhonsale S, Li L, Bagher-Ebadian H, Hu J, Arbab AS, Vanguri P, Ewing JR, Ledbetter KA, & Chopp M (2006). MRI detects white matter reorganization after neural progenitor cell treatment of stroke. NeuroImage, 32(3), 1080–1089. 10.1016/j.neuroimage.2006.05.025 [DOI] [PubMed] [Google Scholar]
- Jones DK, & Cercignani M (2010). Twenty-five pitfalls in the analysis of diffusion MRI data. NMR in Biomedicine, 23(7), 803–820. [DOI] [PubMed] [Google Scholar]
- Lancaster JL, Tordesillas-Gutiérrez D, Martinez M, Salinas F, Evans A, Zilles K, Mazziotta JC, & Fox PT (2007). Bias between MNI and talairach coordinates analyzed using the ICBM-152 brain template. Human Brain Mapping, 28(11), 1194–1205. 10.1002/hbm.20345 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lawes INC, Barrick TR, Murugam V, Spierings N, Evans DR, Song M, & Clark CA (2008). Atlas-based segmentation of white matter tracts of the human brain using diffusion tensor tractography and comparison with classical dissection. NeuroImage, 39(1), 62–79. 10.1016/j.neuroimage.2007.06.041 [DOI] [PubMed] [Google Scholar]
- Liao SC, Hunt EA, & Chen W (2010). Comparison between inter-rater reliability and inter-rater agreement in performance assessment. Annals Academy of Medicine Singapore, 39(8), 613. [PubMed] [Google Scholar]
- Llinás R, Ribary U, Contreras D, & Pedroarena G (1998). The neuronal basis for consciousness. Philosophical Transactions of the Royal Society B: Biological Sciences, 353(1377), 1841–1849. 10.1098/rstb.1998.0336 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mandonnet E, Sarubbo S, & Petit L (2018). The nomenclature of human white matter association pathways: Proposal for a systematic taxonomic anatomical classification. Frontiers in Neuroanatomy, 12, 94. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mori S, Crain BJ, Chacko VP, & van Zijl PCM (1999). Three-dimensional tracking of axonal projections in the brain by magnetic resonance imaging. Annals of Neurology: Official Journal of the American Neurological Association and the Child Neurology Society, 45(2), 265–269. [DOI] [PubMed] [Google Scholar]
- Mori S, Oishi K, Jiang H, Jiang L, Li X, Akhter K, Hua K, Faria A. v., Mahmood A, Woods R, Toga AW, Pike GB, Neto PR, Evans A, Zhang J, Huang H, Miller MI, van Zijl P, & Mazziotta J (2008). Stereotaxic white matter atlas based on diffusion tensor imaging in an ICBM template. NeuroImage, 40(2), 570–582. 10.1016/j.neuroimage.2007.12.035 [DOI] [PMC free article] [PubMed] [Google Scholar]
- O’Donnell LJ, & Westin C-F (2007). Automatic tractography segmentation using a highdimensional white matter atlas. IEEE Transactions on Medical Imaging, 26(11), 1562–1575. [DOI] [PubMed] [Google Scholar]
- Oouchi H, Yamada K, Sakai K, Kizu O, Kubota T, Ito H, & Nishimura T (2007). Diffusion anisotropy measurement of brain white matter is affected by voxel size: Underestimation occurs in areas with crossing fibers. American Journal of Neuroradiology, 28(6), 1102–1106. 10.3174/ajnr.A0488 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Panesar S, & Fernandez-Miranda J (2019). Commentary: The Nomenclature of Human White Matter Association Pathways: Proposal for a Systematic Taxonomic Anatomical Classification. Frontiers in Neuroanatomy, 13, 61. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Resnick SM, Pham DL, Kraut MA, Zonderman AB, & Davatzikos C (2003). Longitudinal magnetic resonance imaging studies of older adults: A shrinking brain. Journal of Neuroscience, 23(8), 3295–3301. 10.1523/jneurosci.23-08-03295.2003 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rheault F, de Benedictis A, Daducci A, Maffei C, Tax CMW, Romascano D, Caverzasi E, Morency FC, Corrivetti F, Pestilli F, & others. (2020). Tractostorm: The what, why, and how of tractography dissection reproducibility. Human Brain Mapping. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rheault F, St-Onge E, Sidhu J, Chenot Q, Petit L, & Descoteaux M (2018). Bundle-Specific Tractography. In Computational Diffusion MRI (pp. 129–139). Springer. [Google Scholar]
- Schilling KG, Rheault F, Petit L, Hansen CB, Nath V, Yeh F-C, Girard G, Barakovic M, Rafael-Patino J, Yu T, & others. (2021). Tractography dissection variability: what happens when 42 groups dissect 14 white matter bundles on the same dataset? Neuroimage 118502. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schilling KG, Tax CM, Rheault F, Hansen C, Yang Q, Yeh FC, … & Landman BA (2021). Fiber tractography bundle segmentation depends on scanner effects, vendor effects, acquisition resolution, diffusion sampling scheme, diffusion sensitization, and bundle segmentation workflow. NeuroImage, 118451. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schmidt MF, Storrs JM, Freeman KB, Jack CR, Turner ST, Griswold ME, & Mosley TH (2018). A comparison of manual tracing and FreeSurfer for estimating hippocampal volume over the adult lifespan. Human Brain Mapping. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Skudlarski P, Jagannathan K, Calhoun VD, Hampson M, Skudlarska BA, & Pearlson G (2008). Measuring brain connectivity: Diffusion tensor imaging validates resting state temporal correlations. NeuroImage, 43(3), 554–561. 10.1016/j.neuroimage.2008.07.063 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sporns O, Tononi G, & Kötter R (2005). The Human Connectome: A Structural Description of the Human Brain. PLoS Computational Biology, 1(4), e42. 10.1371/journal.pcbi.0010042 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Thiebaut de Schotten M, Foulon C, & Nachev P (2020). Brain disconnections link structural connectivity with function and behaviour. Nature Communications, 11(1), 1–8. 10.1038/s41467-020-18920-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tournier J, Calamante F, Connelly A, & others. (2012). MRtrix: diffusion tractography in crossing fiber regions. International Journal of Imaging Systems and Technology, 22(1), 53–66. [Google Scholar]
- Tournier JD, Smith R, Raffelt D, Tabbara R, Dhollander T, Pietsch M, Christiaens D, Jeurissen B, Yeh CH, & Connelly A (2019). MRtrix3: A fast, flexible and open software framework for medical image processing and visualisation. In NeuroImage (Vol. 202, p. 116137). Academic Press Inc. 10.1016/j.neuroimage.2019.116137 [DOI] [PubMed] [Google Scholar]
- Tuch DS, & others. (2002). Diffusion MRI of complex tissue structure.
- van den Heuvel M, Mandl R, Luigjes J, & Pol HH (2008). Microstructural organization of the cingulum tract and the level of default mode functional connectivity. Journal of Neuroscience, 28(43), 10844–10851. 10.1523/JNEUROSCI.2964-08.2008 [DOI] [PMC free article] [PubMed] [Google Scholar]
- van Essen DC, Smith SM, Barch DM, Behrens TEJ, Yacoub E, Ugurbil K, Consortium W-MHCP, & others. (2013). The WU-Minn human connectome project: an overview. Neuroimage, 80, 62–79. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Veenith T. v., Carter E, Grossac J, Newcombe VFJ, Outtrim JG, Lupson V, Williams GB, Menon DK, & Coles JP (2013). Inter Subject Variability and Reproducibility of Diffusion Tensor Imaging within and between Different Imaging Sessions. PLoS ONE, 8(6). 10.1371/journal.pone.0065941 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wakana S, Caprihan A, Panzenboeck MM, Fallon JH, Perry M, Gollub RL, Hua K, Zhang J, Jiang H, Dubey P, Blitz A, van Zijl P, & Mori S (2007). Reproducibility of quantitative tractography methods applied to cerebral white matter. NeuroImage, 36(3), 630–644. 10.1016/j.neuroimage.2007.02.049 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wakana S, Caprihan A, Panzenboeck MM, Fallon JH, Perry M, Gollub RL, Hua K, Zhang J, Jiang H, Dubey P, & others. (2007). Reproducibility of quantitative tractography methods applied to cerebral white matter. Neuroimage, 36(3), 630–644. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wasserthal J, Neher P, & Maier-Hein KH (2018). Tractseg-fast and accurate white matter tract segmentation. NeuroImage, 183, 239–253. [DOI] [PubMed] [Google Scholar]
- Yeh FC (2020). Shape analysis of the human association pathways. NeuroImage, 223, 117329. 10.1016/j.neuroimage.2020.117329 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yeh FC, Panesar S, Fernandes D, Meola A, Yoshino M, Fernandez-Miranda JC, Vettel JM, & Verstynen T (2018). Population-averaged atlas of the macroscale human structural connectome and its network topology. NeuroImage, 178, 57–68. 10.1016/j.neuroimage.2018.05.027 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yeh FC, Wedeen VJ, & Tseng WYI (2010). Generalized q-sampling imaging. IEEE Transactions on Medical Imaging, 29(9), 1626–1635. 10.1109/TMI.2010.2045126 [DOI] [PubMed] [Google Scholar]
- Yeh F-C, Panesar S, Fernandes D, Meola A, Yoshino M, Fernandez-Miranda JC, Vettel JM, & Verstynen T (2018). Population-averaged atlas of the macroscale human structural connectome and its network topology. NeuroImage, 178, 57–68. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yendiki A, Panneck P, Srinivasan P, Stevens A, Zöllei L, Augustinack J, Wang R, Salat D, Ehrlich S, Behrens T, & others. (2011). Automated probabilistic reconstruction of white-matter pathways in health and disease using an atlas of the underlying anatomy. Frontiers in Neuroinformatics, 5, 23. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yoo SW, Guevara P, Jeong Y, Yoo K, Shin JS, Mangin J-F, & Seong J-K (2015). An example-based multi-atlas approach to automatic labeling of white matter tracts. PloS One, 10(7), e0133337. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang F, Cetin Karayumak S, Hoffmann N, Rathi Y, Golby AJ, & O’Donnell LJ (2020). Deep white matter analysis (DeepWMA): Fast and consistent tractography segmentation. Medical Image Analysis, 65, 101761. 10.1016/j.media.2020.101761 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang F, Wu Y, Norton I, Rathi Y, Golby AJ, & O’Donnell LJ (2019). Test–retest reproducibility of white matter parcellation using diffusion MRI tractography fiber clustering. Human brain mapping, 40(10), 3041–3057. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
TractEM is an open-source project, in terms of data and labeling guidelines. Comments and discussion are open for each bundle. The frameworks enable versioning of bundle definitions as consensus definitions evolve. By crowd sourcing our results, we aim to obtain expert validation for each bundle. The raw input data, Talairach-aligned data, DSI-Studio-ready input data, and TractEM results are made available on our website, https://my.vanderbilt.edu/tractem.








