Comparing fully automated state-of-the-art cerebellum parcellation from magnetic resonance images

Aaron Carass; Jennifer L Cuzzocreo; Shuo Han; Carlos R Hernandez-Castillo; Paul E Rasser; Melanie Ganz; Vincent Beliveau; Jose Dolz; Ismail Ben Ayed; Christian Desrosiers; Benjamin Thyreau; José E Romero; Pierrick Coupé; José V Manjón; Vladimir S Fonov; D Louis Collins; Sarah H Ying; Chiadi U Onyike; Deana Crocetti; Bennett A Landman; Stewart H Mostofsky; Paul M Thompson; Jerry L Prince

doi:10.1016/j.neuroimage.2018.08.003

. Author manuscript; available in PMC: 2019 Dec 1.

Published in final edited form as: Neuroimage. 2018 Aug 9;183:150–172. doi: 10.1016/j.neuroimage.2018.08.003

Comparing fully automated state-of-the-art cerebellum parcellation from magnetic resonance images

Aaron Carass ^a,^b,^*, Jennifer L Cuzzocreo ^c,^*, Shuo Han ^d,^e,^*, Carlos R Hernandez-Castillo ^f, Paul E Rasser ^g, Melanie Ganz ^h,ⁱ, Vincent Beliveau ^h,^j, Jose Dolz ^k, Ismail Ben Ayed ^k, Christian Desrosiers ^k, Benjamin Thyreau ^l, José E Romero ^m, Pierrick Coupé ^n,^o, José V Manjón ^m, Vladimir S Fonov ^p, D Louis Collins ^p, Sarah H Ying ^q,^*, Chiadi U Onyike ^r,^*, Deana Crocetti ^s,^*, Bennett A Landman ^t,^*, Stewart H Mostofsky ^s,^q,^r,^*, Paul M Thompson ^u,^v,^*, Jerry L Prince ^a,^b,^*

^aDepartment of Electrical and Computer Engineering, The Johns Hopkins University, Baltimore, MD 21218, USA

^bDepartment of Computer Science, The Johns Hopkins University, Baltimore, MD 21218, USA

^cDepartment of Radiology, The Johns Hopkins School of Medicine, Baltimore, MD 21287, USA

^dDepartment of Biomedical Engineering, The Johns Hopkins University, Baltimore, MD 21218, USA

^eLaboratory of Behavioral Neuroscience, National Institute on Aging, National Institutes of Health, Baltimore, MD 20892, USA

^fConsejo Nacional de Ciencia y Tecnología, Instituto de Neuroetología, Universidad Veracruzana, Xalapa, Mexico

^gPriority Research Centre for Brain & Mental Health and Stroke & Brain Injury, University of Newcastle, Callaghan NSW, Australia

^hNeurobiology Research Unit, Rigshospitalet, Copenhagen, Denmark

ⁱDepartment of Computer Science, University of Copenhagen, Copenhagen, Denmark

^jFaculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark

^kLaboratory for Imagery, Vision, and Artificial Intelligence, École de Technologie Supérieure, Montreal, QC, Canada

^lInstitute of Development, Aging and Cancer, Tohoku University, Japan

^mInstituto Universitario de Tecnologías de la Información y Comunicaciones (ITACA), Universitat Politècnica de València, Camino de Vera s/n, 46022 Valencia, España

ⁿUniversity of Bordeaux, LaBRI, UMR 5800, PICTURA, Talence, F-33400, France

^oCNRS, LaBRI, UMR 5800, PICTURA, Talence, F-33400, France

^pImage Processing Laboratory, Montreal Neurological Institute, McGill University, Montreal, Quebec, Canada

^qDepartment of Neurology, The Johns Hopkins School of Medicine, Baltimore, MD 21287, USA

^rDepartment of Psychiatry and Behavioral Sciences, The Johns Hopkins School of Medicine, Baltimore, MD 21287, USA

^sCenter for Neurodevelopmental Medicine and Imaging Research, Kennedy Krieger Institute, Baltimore, MD 21205, USA

^tDepartment of Electrical Engineering and Computer Science, Vanderbilt University, Nashville, TN 37235, USA

^uImaging Genetics Center, Mark and Mary Stevens Institute for Neuroimaging and Informatics, Keck School of Medicine, University of Southern California, Marina del Rey, CA 90292, USA

^vDepartments of Neurology, Pediatrics, Psychiatry, Radiology, Engineering, and Ophthalmology, University of Southern California, Los Angeles, CA 90033, USA

These authors curated the data and organized the comparison, all others contributed results.

^✉

Please address correspondence to: Aaron Carass, Department of Electrical and Computer Engineering, The Johns Hopkins University, 105 Barton Hall, 3400 N. Charles St., Baltimore, MD 21218, USA. aaron_carass@jhu.edu (Aaron Carass)

PMCID: PMC6271471 NIHMSID: NIHMS1510954 PMID: 30099076

Abstract

The human cerebellum plays an essential role in motor control, is involved in cognitive function (i.e., attention, working memory, and language), and helps to regulate emotional responses. Quantitative in-vivo assessment of the cerebellum is important in the study of several neurological diseases including cerebellar ataxia, autism, and schizophrenia. Different structural subdivisions of the cerebellum have been shown to correlate with differing pathologies. To further understand these pathologies, it is helpful to automatically parcellate the cerebellum at the highest fidelity possible. In this paper, we coordinated with colleagues around the world to evaluate automated cerebellum parcellation algorithms on two clinical cohorts showing that the cerebellum can be parcellated to a high accuracy by newer methods. We characterize these various methods at four hierarchical levels: coarse (i.e., whole cerebellum and gross structures), lobe, subdivisions of the vermis, and the lobules. Due to the number of labels, the hierarchy of labels, the number of algorithms, and the two cohorts, we have restricted our analyses to the Dice measure of overlap. Under these conditions, machine learning based methods provide a collection of strategies that are efficient and deliver parcellations of a high standard across both cohorts, surpassing previous work in the area. In conjunction with the rank-sum computation, we identified an overall winning method.

Keywords: Magnetic resonance imaging, cerebellar ataxia, attention deficit hyperactivity disorder, autism

1. Introduction

The cerebellum is a structure of great importance in the neuroanatomy of humans. It plays an essential role in motor coordination (Ito, 1984; Manto et al., 2013), as well as cognitive function such as attention (Schmahmann, 1991, 2004), working memory (Desmond and Fiez, 1998), and language (Silveri et al., 1994; Desmond and Fiez, 1998), regulates emotional responses (Schutter and Van Honk, 2005) including fear (Schmahmann and Caplan, 2006), and there is increasing understanding of perceptual processes in the cerebellum (Baumann et al., 2015). Anatomically, the cerebellum is nestled underneath the cerebral hemispheres behind the brainstem in the posterior cranial fossa. It is separated from the cerebrum by the tentorium cerebelli, a dura structure, and is connected to the brainstem at the pons. The cerebellum is divided into two hemispheres, like the cerebrum, and also has a midline zone which is known as the vermis. The cortical surface of the cerebellum is made up of finely spaced branches that radiate outwards from the cerebellar white matter (WM), which is known as the corpus medullare (CM). These WM branches conceal that the volume of the cerebellum is a tightly folded layer of gray matter (GM). Anatomists differentiate regions of the cerebellum hierarchically into groups of folds, known as lobes, and then into individual folds, referred to as lobules. The lobes are the anterior, superior posterior, inferior posterior, and the flocculonodular. The lobules are identified by Roman Numerals I through X (Schmahmann et al., 2000), however Lobules VII and VIII are further differentiated. This nomenclature comes from Schmahmann et al. (2000), derived in part from Larsell (1952); we refer to it as the Schmahmann nomenclature and note the differences between it and the classical nomenclature (Malacarne, 1776; Henle, 1879) in Table 1. Figure 1 shows the anatomical structure of the cerebellum, including the hierarchical breakdown of the lobes and lobules. Due to the importance of the cerebellum, any pathology can have serious consequences; however, the tightly folded structure of the cerebellum makes identifying specific structures challenging. Below we outline the clinical relevance of understanding the structure of the cerebellum and the various effects of cerebellar pathologies; we then provide an overview of the fully automated parcellation tools that exist in the literature.

Table 1:

A key to convert between the nomenclature of Schmahmann (Schmahmann et al., 2000), derived from Larsell (1952), and the classical nomenclature (Malacarne, 1776; Henle, 1879) of common cerebellar structures.

Vermal Nomenclature		Hemisphere Nomenclature
Schmahmann	Classical	Schmahmann	Classical

Vermis I / II^†	Lingula	L/R Lobule I/II	L/R Lingula (or Lingulae)
Vermis III^†	Centralis	L/R Lobule III	L/R Centralis
Vermis IV^†	Culmen I	L/R Lobule IV	L/R Quadrangularis
Vermis V^†	Culmen II	L/R Lobule V	L/R Quadrangularis
Vermis VI	Declive	L/R Lobule VI	L/R Quadrangularis
Vermis VIIAf	Folium	L/R Lobule VIIAf (Crus I)	L/R Semi-Lunaris Superior
Vermis VIIAt	Tuber I	L/R Lobule VIIAt (Crus II)	L/R Semi-Lunaris Inferior
Vermis VIIB	Tuber II	L/R Lobule VIIB	L/R Semi-Lunaris Inferior
Vermis VIIIA	Pyramis I	L/R Lobule VIIIA	L/R Biventer I
Vermis VIIIB	Pyramis II	L/R Lobule VIIIB	L/R Biventer II
Vermis IX	Uvula	L/R Lobule IX	L/R Tonsilla (or Tonsil)
Vermis X	Nodulus	L/R Lobule X	L/R Flocculus

Open in a new tab

^†

It is widely acknowledged that there is no true vermis for the Anterior Lobe (Lobules I–V). The division in our Pediatric Cohort differentiates the midline portion of the Anterior Lobe from the body of the lobe.

Figure 1: — An illustration of a coronal view of one hemisphere of the human cerebellum (with orientation axes inset in **(a)**). Shown are the lobule labels for our **(a)** Adult and **(b)** Pediatric Cohort with their corresponding lobe groupings, based on the Schmahmann nomenclature (Schmahmann et al., 2000). Table 4 has a complete list of the provided labels for both cohorts. It is widely acknowledged that there is no true vermis for the Anterior Lobe (Lobules I–V). Thus the distinction between vermis and body in the Anterior Lobe differentiates the midline portion from the body of the lobe. Our Adult Cohort does not use this differentiation, whereas our Pediatric Cohort does.

Several anatomical studies have elucidated the view that the cerebellum is involved in more than just motor control (Baumann et al., 2015; Schmahmann, 2004; Schmahmann and Caplan, 2006; Strick et al., 2009). As such, the cerebellum projects to a diverse set of cortical areas via the thalamus, which then close the circuitry loop by reciprocating back to the cerebellum. It is now apparent that multiple cortical areas are the target of cerebellar output, including not only the primary motor cortex, but also subdivisions of premotor, oculomotor, prefrontal, and inferotemporal areas of cortex (Middleton and Strick, 2000); each of which form cerebello-thalamo-cortical circuits. Moreover, several clinical studies have shown that dysfunction in individual cerebellar loops with the cerebral cortex may underlie the development of specific neurological and psychiatric symptoms (Allen and Courchesne, 2003; Amaral et al., 2008; Andreasen and Pierson, 2008; Gottwald et al., 2004). Furthermore, clinical studies of cerebellum centric disorders, such as spinocerebellar ataxia (SCA), have been previously shown to have cerebellar shape (Yang et al., 2016a), clinical disability scores (Ying et al., 2006), and functional scores (Yang et al., 2014; Kansal et al., 2016) that correlate with SCA subtype in a region specific manner. More importantly, the cerebellum has been shown to be affected in diseases ranging from attention-deficit and hyperactivity disorder (ADHD) (Mostofsky et al., 1998b), schizophrenia (Nopoulos et al., 1999; Parker et al., 2014), Alzheimer’s disease (Thomann et al., 2008; Colloby et al., 2014), Parkinson’s disease (Lewis et al., 2013), to chronic alcoholism (Victor et al., 1959; Torvik and Torp, 1986; Cavanagh et al., 1997; Baker et al., 1999; Fitzpatrick et al., 2008). In patients with schizophrenia, a reduction in the volume of the vermis has been observed in multiple studies (Nopoulos et al., 1999; Okugawa et al., 2002, 2003) based on the manual parcellation of the cerebellum. Moreover, when the vermis has been further subdivided into the anterior and posterior portions, the volume differences are driven by changes in the posterior vermis (Womer et al., 2016) with a significant diagnosis-by-sex interaction. Several types of dementia exhibit correlations with the cerebellum; Alzheimer’s disease (AD) has shown a reduction in the volume of the posterior lobes (Thomann et al., 2008), whereas dementia with Lewy bodies has shown greater GM loss in Lobule VII than AD (Colloby et al., 2014). Several recent voxel based morphometry (VBM) studies have shown regional patterns of atrophy between AD and cerebellar GM and WM (Möller et al., 2013) and correlations between GM loss and the constructional praxis and constructional praxis recall test in the CERAD test battery (Dos Santos et al., 2011). However, older studies (Karas et al., 2003) that relied upon studying large regions—due to the FWHM size used in the VBM—showed no significant GM loss in the cerebellum suggesting that the effects of cerebellum/AD interaction can only be identified when smaller regions of interest are used. We highlight these studies and some of the other work done in improving our understanding, through neuroimaging, of the cerebellum and its role in health and disease in Table 2. However, Table 2 is far from a complete list of such studies, see Stoodley (2014) and Traut et al. (2018) for more comprehensive lists. There are two key points to take from this past work: 1) in-vivo assessment of the cerebellum through magnetic resonance (MR) imaging (MRI) is imperative to further our understanding and 2) manual parcellation or delineation remains a widely used approach for studying the cerebellum.

Table 2:

A summary of some cerebellar focused imaging studies exploring various pathologies. We include whether the study used manual delineations (MD) and the key cerebellar related findings. N (M/F) denotes the number of patients and the male/female ratio. Abbreviations: ADHD - Attention-deficit and hyperactivity disorder; AD - Alzheimer’s disease; AS - Asperger syndrone.

Disease	Citation	N (M/F)	MD	Observations
ADHD	Mostofsky et al. (1998b)	35 (35/0)	Y	Decreased inferior posterior vermis
	Durston et al. (2004)	90 (90/0)	N	Reduced right cerebellar volume
Alcoholism	Torvik and Torp (1986)	65 (65/0)	Y	Decreased vermis segments
	Baker et al. (1999)	19 (14/5)	Y	Non-significant loss in vermis and flocculus
AD	Thomann et al. (2008)	60 (29/31)	Y	Decreased superior and inferior posterior lobes
	Möler et al. (2013)	344 (175/169)	–^†	Reduced GM throughout the cerebellum
	Colloby et al. (2014)	127 (84/43)	–^†	Bilateral reduction of Lobule VI
AS	Catani et al. (2008)	31 (31/0)	Y	Reduced cerebellar fractional anisotropy
Autism	Courchesne et al. (1994)	103 (84/19)	Y	Reduced area in the vermis of Lobule VI and VII
	Cleavinger et al. (2008)	65 (65/0)	Y	No significant differences
	Webb et al. (2009)	71 (56/15)	Y	Reduced area in various vermal labels
	D’Mello et al. (2015)	70 (51/19)	N	Reduced GM in Lobule VII
Dyslexia	Brambati et al. (2004)	21 (10/11)	–^†	Reduced cerebellar GM volume
	Jednoróg et al. (2013)	81 (39/42)	–^†	Reduced GM volume in left Lobule I/II
Fragile X Syndrome	Mostofsky et al. (1998a)	188 (98/90)	Y	Decreased posterior vermis in males and females, though less significant in females.
Schizophrenia	Nopoulos et al. (1999)	130 (130/0)	N^‡	Smaller vermis area and smaller anterior lobe
	Okugawa et al. (2002)	30 (30/0)	N^‡	Reduced posterior superior vermis
	Okugawa et al. (2003)	116 (73/43)	N^‡	Reduced anterior vermis, posterior superior vermis, and posterior inferior vermis volumes
	Womer et al. (2016)	104 (48/56)	Y	Decreased posterior vermis volumes in males

Open in a new tab

^†

The studies did not differentiate regions of the cerebellum and based assessment on an anatomists interpretation of the areas of change.

^‡

Automated processing for cerebellar volumes based on registration to a Talairach Atlas, augmented by manual tracings of the vermis.

Despite the continued use of manual delineation of the cerebellum in various studies (Womer et al., 2016) there has been work on both semi-automated (Pierson et al., 2002) and fully automated segmentation and parcellation of the cerebellum. Automating the parcellation, and consequently the volumetric analyses, of the cerebellum has benefits outside of the obvious efficiencies of speed and cost. Automated parcellation has been shown to remove manual bias, thus increasing consistency and comparability across studies and sites (Buckner et al., 2004; Hsu et al., 2002). We are only concerned with those methods that provide at a minimum the lobes of the cerebellum; hence methods like FreeSurfer (Dale et al., 1999; Fischl et al., 2004; Reuter and Fischl, 2011; Fischl, 2012), BrainSuite (Shattuck et al., 2001; Shattuck and Leahy, 2002; Shattuck et al., 2008), TOADS (Bazin and Pham, 2008; Shiee et al., 2010), CRUISE (Tosun et al., 2006; Landman et al., 2013), CRUISE+ (Shiee et al., 2014), MA-CRUISE (Huo et al., 2016a,b), and others (Carass et al., 2017c; Desikan et al., 2006; Guo et al., 2017; Ledig et al., 2015; Liu et al., 2012; Roy et al., 2015; Shao et al., 2018; Shiee et al., 2011; Tomas-Fernandez and Warfield, 2015; Van Leemput et al., 1999; Zhang et al., 2001; Zhao et al., 2017) that only provide tissue classes or coarse parcellations of the cerebellum are not directly relevant unless used in combination with other tools. The first published method that provided a fully automated parcellation of the cerebellar lobules was SUIT (Diedrichsen, 2006); the method used a spatially unbiased template of the human cerebellum that when registered with a subject image provided the parcellation. The method was later updated (Diedrichsen et al., 2009) to include a probabilistic atlas. As powerful as SUIT is in identifying the subdivisions of the cerebellum, it has primarily been used only for identifying cerebellar GM as a normalizing factor in functional MRI analysis. Prior to the introduction of the probabilistic version of SUIT, Powell et al. (2008) presented machine learning approaches for cerebellar parcellation that identified the lobes and vermis of the cerebellum. Bogovic et al. (2013a) presented ACCLAIM, a multi-object geometric deformable model (Bogovic et al., 2013c; Carass and Prince, 2016) approach that provides a parcellation of 28 labels of the cerebellum and included a comparison to SUIT. Price et al. (2014) presented the Cerebellar Analysis Toolkit (CATK) which used a Bayesian Appearance Modeling (Patenaude et al., 2011) with prior knowledge of shape, image intensity, and inter-shape relationships to provide five cerebellar labels. Weier et al. (2014) described the Rapid Automatic Segmentation of the human Cerebellum And its Lobules (RASCAL) which is a patch matching based approach that improved on the multi-atlas segmentation fusion technique presented in Coupe et al. (2011). Romero et al. (2017) presented CERES another patch-matching technique, that uses OPAL (Giraud et al., 2016; Ta et al., 2014) for its label fusion. Yang et al. (2016b) presented a multi-atlas labeling approach that used a graph-cut to help regularize the final segmentation. Several other methods have been reported in the literature (van der Lijn et al., 2009; Park et al., 2014; Plassard et al., 2016). A more detailed description of several of these methods is provided in Appendix A to help describe the approaches presented in this paper. To summarize, the previous work in this area includes: single and multi atlas registration, level sets, graph methods, a Bayesian framework, neural networks, support vector machines, and patch matching. Table 5 presents an overview of the methods presented and evaluated in this paper. It can be seen that deep learning, an important new class of algorithms in medical imaging, are represented among the methods tested in this paper.

Table 5:

An overview of the methods used in our comparison, with details of each method listed in the remainder of this Section.

Name	Approach
SUIT^*	Default SUIT v3.2
C-SUIT^*	C-SUIT is a customized SUIT, with Correction and Customized Atlas based on the Pediatric Cohort
FS-SUIT	FreeSurfer and SUIT in collaboration
LiviaNET	A thirteen layer fully convolution network (FCN)
ConvNet	Convolution neural network
CERES2	Updated version of CERES with improved intensity normalization and a new error correction method based on an ensemble of boosted patch-based neural networks
RASCAL	Updated patch-matching technique with cohort specific templates, improved intensity normalization, and non-linear registration
DeepNet	A U-net based FCN with ten layers

Open in a new tab

– Denotes methods that only contributed results for the Pediatric Cohort.

There has been an increasing movement towards Grand Challenges (Styner et al., 2008; Schaap et al., 2009; Heimann et al., 2009; Menze et al., 2015; Mendrik et al., 2015; Maier et al., 2017; Carass et al., 2017a) in the medical imaging community in recent years. These challenges have helped to develop standards for evaluating the performance of different categories of medical imaging problems and for helping those on the peripheral of the community to understand the state-of-the-art and the general direction in which the technology is moving. In particular, the 2008 MICCAI MS Lesion challenge (Styner et al., 2008) was a significant step forward in the sharing of clinically relevant data. More recently, the 2015 Multimodal Brain Tumor Image Segmentation Benchmark (BRATS) (Menze et al., 2015) has been a disruptive step forward, having allowed groups without access to high-quality data with delineations to contribute innovative new solutions for segmenting brain tumors (Sauwen et al., 2015; Banerjee et al., 2016; Kamnitsas et al., 2017).

Thus in the spring of 2017, we invited colleagues from around the world to participate in a Cerebellum Parcellation Challenge as part of MICCAI 2017. As only eight groups responded to this call, it was decided that the workshop itself would not go forward due to lack of broad interest. Coming out of the discussions for this Cerebellum Parcellation Challenge (hereafter the Comparison), it was agreed that we would present the performance findings from seven of the research teams who participated in the Comparison (hereafter the Participants). In Section 2, we outline the two cohorts of data that were provided to the Participants and the evaluation used in comparing the submitted results from each of the Participants. One of the Participants submitted two methods, however two of the Participants contributed no results for our first cohort. Thus, that cohort has results from six algorithms, while our other cohort was processed by eight algorithms. Both cohorts are imaged using standard clinical protocols with an approximately 1 mm isotropic resolution, with complete details of the acquisition in Section 2. In our examination of these data and methods, we restrict our analyses to the Dice overlap; we outline our rationale behind this decision in Section 2.2. Section 3 provides a brief description of the methods contributed by the Participants for the Comparison, with a complete description of the presented methods in Appendix A. Section 4 includes the Comparison between the manual delineations for our two cohorts and the algorithms; it is broken down into hierarchical levels: 1) Coarse level including the whole cerebellum, whole vermis, and CM (3 labels); 2) Lobe level including the left and right of the four lobes (8 labels); 3) Vermis level which included the vermal subdivisions of the vermis (5 labels for our Adult Cohort, 3 labels for our Pediatric Cohort); 4) Lobule level (22 labels for our Adult Cohort, 14 labels for our Pediatric Cohort); and a5) Consolidated level, with further details in Sec. 4. In general the methods show agreement with the manual delineations of the cerebellar structures. However, the size of our cohorts restricted our statistical analyses, with rank-sum computations being used to determine an overall highest ranked method.

2. Materials and Metrics

2.1. Data

The Participants were given data from our Adult and Pediatric Cohorts. Our Adult Cohort is an expertly labeled data set collected by the Image Analysis and Communications Laboratory (IACL) at Johns Hopkins University (PI: J.L. Prince), complete details of the delineation protocol and the inter-rater variability are in Bogovic et al. (2013b). Our Pediatric Cohort comprises data collected at the Center for Neurodevelopmental Medicine and Imaging Research at the Kennedy Krieger Institute (PI: S.H. Mostofsky). The Pediatric Cohort was labeled using a cerebellar atlas developed at the Center for Neurodevelopmental Medicine and Imaging Research. The cerebellar atlas is based on a highly reliable manual parcellation protocol with interclass correlation coefficients ranging from .86–.99 across anatomically defined subdivisions as outlined in Table 4. The Participants were also encouraged to take advantage of other available data sets; in particular, they were made aware of data provided by Jörn Diedrichsen of the University of Western Ontario¹. The Diedrichsen data comprises 20 normal adult subjects, each of which have 30 labeled cerebellar components.

Table 4:

The labeled cerebellar structures of both cohorts. For reference, we include a key to convert between the Schmahmann and classical nomenclature in Table 1.

Adult Cohort (Healthy Controls and Ataxia Patients)		Pediatric Cohort (Healthy Controls, ADHD & HFA Patients)
Major Structure	Cerebellar Sub-components	Major Structure	Cerebellar Sub-components

Corpus Medullare		Corpus Medullare
Vermis	Vermis of Lobule VI	Vermis	Vermis of Lobule I–V
	Vermis of Lobule VII		Vermis of Lobule VI–VII
	Vermis of Lobule VIII		Vermis of Lobule VIII–X

	Vermis of Lobule IX	L/R Anterior	L/R Lobule I–V
	Vermis of Lobule X	L/R Anterior	L/R Lobule I–V

L/R Anterior	L/R Lobule I / II / III	L/R Superior Posterior	L/R Lobule VI
	L/R Lobule IV		L/R Lobule VIIAf (Crus I)
	L/R Lobule V		L/R Lobule VIIAt (Crus II) & VIIB

L/R Superior Posterior	L/R Lobule VI	L/R Inferior Posterior	L/R Lobule VIII
	L/R Lobule VIIAf (Crus I)		L/R Lobule IX

	L/R Lobule VIIAt(Crus II)	L/R Flocculonodular	L/R Lobule X

	L/R Lobule VIIB

L/R Inferior Posterior	L/R Lobule VIIIA
	L/R Lobule VIIIB
	L/R Lobule IX

L/R Flocculonodular	L/R Lobule X

Open in a new tab

Our Adult Cohort contains 20 subjects, a mix of healthy controls and ataxia patients, each with 28 labeled cerebellar components (complete demographic information is provided in Table 3; see Fig. 2 for an example image and corresponding manual labels). Fifteen training examples were provided to the Participants, and the remaining five data sets were used for testing, with the goal being to label the cerebella of the test subjects to best agree with the expert labels. Magnetization prepared rapid gradient echo (MP-RAGE) images using a 3.0 T MR scanner (Intera, Phillips Medical Systems, Netherlands) were acquired with the following parameters: 1.1 mm slice thickness, 8° flip angle, TE = 3.9 ms, TR = 8.43 ms, FOV 21.2 × 21.2 cm, image matrix of 256 × 256. The images were resampled to have a 1.0 mm isotropic voxel; subsequently they were defaced using mri_deface from FreeSurfer (v5.3) (Fischl, 2012), a skull stripping mask was generated using SPECTRE (Carass et al., 2007, 2010), and the skull-stripped image was white matter (WM) peak normalized so that all images have a consistent WM peak intensity (Nyúl and Udupa, 1999a). For the training data, the defaced MR image, the WM peak skull stripped image, and the expert manual cerebellar parcellations were provided to the Participants. For the test subjects only the defaced MR image and the WM peak skull-stripped image were provided. All images in this cohort were acquired in an axial orientation. An example of both the defaced and WM peak skull stripped image for a data set are shown in Fig. 2 with the corresponding manual delineation.

Table 3:

Demographic details for the training and test data for both cohorts. The top line is the information of the entire data set, while subsequent lines within a section are specific to the patient diagnoses. N (M/F) denotes the number of patients and the male/female ratio, respectively. The Age column lists the mean, standard deviation, min, and max, in years, at scan time. The codes for the patient groups are: HC – Healthy controls; CB – Symptoms of cerebellar dysfunction without genetic diagnosis; SCA6 – Spinocerebellar ataxia type 6; ADHD – Attention-deficit and hyperactivity disorder; HFA – High-functioning Autism.

Data Set	N (M/F)	Age
		Mean (SD)	[Min, Max]
Adult Cohort
Training	15 (5/10)	54.7(±11.97)	[30.0, 71.0]
HC	6 (2/4)	54.3(±14.69)	[30.0, 71.0]
CB	3 (1/2)	54.3(±8.02)	[46.0, 62.0]
SCA6	6 (2/4)	55.3(±12.60)	[35.0, 70.0]

Testing	5 (5/0)	69.2(±5.81)	[62.0, 78.0]
CB	5 (5/0)	69.2(±5.81)	[62.0, 78.0]

Pediatric Cohort
Training	20 (7/13)	10.1(±1.36)	[8.3, 13.2]
HC	10 (4/6)	10.2(±1.33)	[8.4, 13.2]
ADHD	7 (0/7)	10.4(±1.61)	[8.3, 12.2]
HFA	3 (3/0)	9.2(±0.65)	[8.5, 9.7]

Testing	10 (3/7)	10.1(±1.29)	[8.4, 12.6]
HC	5 (1/4)	9.9(±1.04)	[8.4, 11.2]
ADHD	3 (0/3)	10.2(±1.06)	[9.2, 11.3]
HFA	2 (2/0)	10.6(±2.76)	[8.7, 12.6]

Open in a new tab

Figure 2: — For our Adult Cohort, we show a cropped portion of a typical axial slice of **(a)** the defaced MP-RAGE, **(b)** the skull-stripped MP-RAGE, and **(c)** the manual labels with a corresponding color key for the prominent labels. The images are shown in radiological convention. A complete list of all the labels for the Adult Cohort is provided in Table 4. Results of the methods on the same data are shown in Fig. 4.

Our Pediatric Cohort comprises data collected at the Center for Neurodevelopmental and Imaging Research at the Kennedy Krieger Institute (PI: S.H. Mostofsky). These 30 expertly labeled data sets, with 18 labeled cerebellar components, are from 8–12 year old boys and girls with a mix of healthy controls, ADHD and high-functioning Autism (HFA) patients (complete demographic information is provided in Table 3). 20 of these were provided as training and 10 were reserved for testing. The objective was to label these cerebella to best agree with the expert labels. The provided MR images were MP-RAGE, acquired on a 3T Philips Gyroscan NT (Royal Philips Electronics) system with the following parameters: 1 mm slice thickness, 8° flip angle, TE = 3.0 ms TR = 7.0 ms, image matrix of 256 × 256. The Pediatric Cohort was preprocessed in an identical manner to our Adult Cohort; specifically, the images were defaced using mri_deface, skull-stripped using SPECTRE, and the skull-stripped image was WM peak normalized. For each of the 20 training images, the defaced MR image, the WM peak skull stripped image, and the expert manual cerebellar parcellation were provided to the Participants. For the test subjects only the defaced MR image and the WM peak skull stripped image were provided. All images in this cohort were acquired in a coronal orientation. An example of both the defaced and WM peak skull stripped image for a training data set are shown in Fig. 3 with the corresponding manual delineation. A complete list of the labels provided for the two cohorts is available in Table 4 and a key is provided in Table 1 to convert between the Schmahmann (Schmahmann et al., 2000) and classical (Malacarne, 1776; Henle, 1879) nomenclature.

Figure 3: — For our Pediatric Cohort, we show a cropped portion of a typical coronal slice of **(a)** the defaced MP-RAGE, **(b)** the skull-stripped MP-RAGE, and **(c)** the manual labels with a corresponding color key for the prominent labels. The images are shown in radiological convention. A complete list of all the labels for the Pediatric Cohort is provided in Table 4. Results of the methods on the same data are shown in Fig. 9.

2.2. Comparison Metric

To compare the results from the available methods with our expert delineations, we used the Dice overlap (Dice, 1945). The Dice overlap is a commonly used volume metric for comparing labels masks. If $M_{G}$ is the gold standard mask of a human rater and $M_{A}$ is the mask generated by a particular algorithm, then the Dice overlap for binary objects is computed as

Dice (M_{G}, M_{A}) = 2 \frac{| M_{G} \cap M_{A} |}{| M_{G} | + | M_{A} |},

where | · | is the cardinality (number of voxels). This overlap measure has values in the range [0,1], with 0 indicating no agreement between the two masks, and 1 meaning the two masks are identical. We have chosen to explicitly restrict our analysis to the Dice overlap for two reasons: 1) it is a widely reported and understood measure; 2) due to the large number of labels, the hierarchy of labels (from coarse to fine), the number of algorithms, and the two cohorts that we report on would make reporting multiple measures very lengthy. We note that in two recent challenge papers (Carass et al., 2017b; Maier et al., 2017) the final rankings of the methods—which used multiple metrics—were well correlated with the Dice overlap; see Table 7 in Maier et al. (2017) for example. A benefit of using a single measure in this manner is the clarity that is afforded in declaring a best method. We comment more on the pros and cons of this evaluation in Section 5.

Table 7:

For the Adult Cohort, we show the p-value for the two-sided Wilcoxon paired signed-rank test comparing the second ( Inline graphic LiviaNET) and third ( DeepNet) placed teams to the top ( CERES2) ranked team across the four hierarchies (Coarse, Lobe, Vermis, Lobule) of labeling and also the combination of all 38 labels (Consolidated). The mean Dice overlap for each method, at the respective hierarchy, is shown underneath the method’s name.

Hierarchy	Method		p-value
Hierarchy	Mean Dice Overlap		p-value
Coarse	CERES2 0.9118	vs. LiviaNET 0.8967	6.9 × 10^{−3 †}
Coarse	CERES2 0.9118	vs. DeepNet 0.8908	6.1 × 10^{−5 ‡}
Lobe	CERES2 0.8395	vs. LiviaNET 0.8289	2.2 × 10⁻¹
Lobe	CERES2 0.8395	vs. DeepNet 0.8021	1.9 × 10^{−4 †}
Vermis	CERES2 0.8302	vs. LiviaNET 0.8012	1.2 × 10⁻²
Vermis	CERES2 0.8302	vs. DeepNet 0.8003	5.6 × 10^{−4 †}
Lobule	CERES2 0.7657	vs. LiviaNET 0.7168	5.5 × 10^{−5 ‡}
Lobule	CERES2 0.7657	vs. DeepNet 0.7382	1.2 × 10^{−5 ‡}
Consolidated	CERES2 0.8013	vs. LiviaNET 0.7657	3.0 × 10^{−7 ‡}
Consolidated	CERES2 0.8013	vs. DeepNet 0.7719	3.1 × 10^{−12 ‡}

Open in a new tab

^†

Denotes weak statistical significance (p-value < 0.001).

^‡

Denotes strong statistical significance (p-value < 0.0001).

3. Methods Overview

We introduce each method with a three line summary: the first line includes a colored square (that is used in subsequent plots and figures for quick reference) and the name of the method; second is a one line summary of the method; and finally in parentheses is the Participant(s) that contributed the method. Following each of the summaries is a brief overview of the respective method, a complete description of each method is available in Appendix A. A brief summary of each of the methods is provided in Table 5.

SUIT

Default SUIT v3.2

(Carlos H. Castillo)

Cerebellar isolation is performed using the unified segmentation (Ashburner and Friston, 2005) of SPM12. Then the cerebellar cortex is normalized into the spatially unbiased atlas template of the human cerebellum (SUIT) toolbox v3.2 (Diedrichsen et al., 2009), using a a fast-diffeomorphic normalization algorithm (DARTEL) (Ashburner, 2007). The probabilistic atlas included in the SUIT toolbox identify the cerebellar lobular boundaries. Then the inverse warp deformation field was calculated and then applied to map the SUIT atlas into a subject’s native space.

C-SUIT

Customized-SUIT (C-SUIT) with Corrections and Customized Atlas based on the Pediatric Cohort

(Paul Rasser)

The Pediatric data is preprocessed using FreeSurfer (v5.3) (FS) (Fischl, 2012) and then mapped into the SUIT space. In this space a customized pediatric atlas is constructed for application to the Pediatric Cohort. The ten test subjects from the Pediatric Cohort are also preprocessed with FreeSurfer and then diffeomorphically transformed into the customized atlas.

FS-SUIT

FreeSurfer and SUIT in collaboration for Cerebellar Segmentation

(Melanie Ganz & Vincent Beliveau)

Whole brain and cerebellar GM and WM segmentation of structural MRI data was performed with FreeSurfer. FS-SUIT augments FreeSurfer with SUIT (v2.7) (Diedrichsen et al., 2009) to identify cerebellar lobules. To unify the results from FreeSurfer and SUIT into a coherent cerebellum lobule parcellation, a final segmentation is created by limiting the SUIT lobule parcellation to their intersection with the FreeSurfer cerebellar GM.

LiviaNET

Cerebellum parcellation from a deep learning perspective

(Jose Dolz, Ismail Ben Ayed, & Christian Desrosiers)

LiviaNET is built on the FCN described in Dolz et al. (2018) which had state-of-the-art performance for subcortical brain segmentation. To ensure that the network contains only convolutional layers, fully-connected layers are converted to a collection of 1 × 1 × 1 convolutions (Kamnitsas et al., 2017). As the structures in the cerebellum are often thinner than subcortical structures, to avoid losing small details when passing the target structures through several convolutional blocks, LiviaNET embeds the feature maps from all layers into the fully-connected layers.

ConvNet

Cerebellum segmentation using convolutional neural networks

(Benjamin Thyreau)

ConvNet learns to segment the MRI using the expert labels as training data. ConvNet was intended to investigate whether whole-image input, as opposed to patch-based, could better capture high-level structure and human-expert variation. ConvNet took advantage of the overlap between the labeling schemes in both cohorts, which provided for data augmentation to train a base network that is then refined for the two cohorts separately.

CERES2

Cerebellum multi-atlas patch-based segmentation with a patch-based boosted neural network error corrector

(José E. Romero, Pierrick Coupé, & José V. Manjón)

A new version of CERES (Romero et al., 2017), which is a cerebellum lobule segmentation algorithm that is based on a recent method called Optimized Patch-Match Label fusion (OPAL) (Giraud et al., 2016; Ta et al., 2014) is presented. The method consists of a multi-atlas patch-based (Rousseau et al., 2011; Coupe et al., 2011) non-local label fusion technique that produces segmentations using a library of manually annotated cases. CERES2 improves on CERES by using a different intensity normalization method and by adding a systematic error correction step based on an ensemble of patch-based boosted neural networks.

RASCAL

Patch-based label fusion

(Vladimir S. Fonov and D. Louis Collins)

The previously published RASCAL (Rapid Automatic Segmentation of the Human Cerebellum and its Lobules) (Weier et al., 2014) was adapted for use with the two cohorts. The data was preprocessed as follows: 1) linear registration to MNI-ICBM152 2009c stereotaxic space (Fonov et al., 2010); 2) linear intensity normalization based on quantile matching to normalize the intensity range to the MNI-ICBM152 2009c template; 3) extracted brain mask using thresholding of the provided SPECTRE brainmask; 4) created an unbiased population specific template (Fonov et al., 2010), the resultant template was used as a reference template for RASCAL.

DeepNet

U-Net Parcellation of the Cerebellum

(Vladimir S. Fonov and D. Louis Collins)

DeepNet is an exploration of the potential of using an FCN based on U-net (Ronneberger et al., 2015; Çiçek et al., 2016) to parcellate the cerebellum.

4. Results

We present results using the Dice overlap measure to characterize the performance of the methods applied to both cohorts in our Comparison. Each Participating group provided a parcellation of the test data sets into lobules respecting the labeling scheme used in the respective cohort. To better characterize performance, we broke down the analysis using a hierarchical scheme. At the coarsest level we have the gross structures of the whole cerebellum, the whole vermis, and the corpus medullare (CM). We then have the subdivisions of the cerebellum into its left and right lobes; see Table 4 for the definitions of these structures for each cohort. The final two levels are the subdivisions of the vermis and the individual lobules, these are different for both cohorts—as the delineations draw distinctions between the vermis and the granularity with which the cerebellum compartments are identified. Specifically, for the Adult Cohort there are five subdivisions of the vermis and 22 lobule labels (11 per hemisphere), whereas for the Pediatric Cohort there are three vermal subdivisions and 14 lobule labels (seven per hemisphere). These levels are identified and defined as: 1) Coarse level which includes the whole cerebellum, whole vermis, and CM (3 labels); 2) Lobe level including the left and right of the four lobes (8 labels); 3) Vermis level which includes the vermal subdivisions of the vermis (5 labels for our Adult Cohort, 3 labels for our Pediatric Cohort); 4) Lobule level (22 labels for our Adult Cohort, 14 labels for our Pediatric Cohort); and a grouping listed as 5) Consolidated, which is a union of all the available labels (38 labels for the Adult Cohort, 28 labels for the Pediatric Cohort). These hierarchies have been generated (where necessary) based on the supplied parcellation of each algorithm by merging the appropriate labels; for example, the whole cerebellum label is given by merging all the labels. In Subsection 4.3, we summarize the Dice overlap results using the rank-sum to compare the performance of the various methods in a succinct manner. The rank-sum scoring assigns a score of 1 to the method with the highest mean Dice overlap measure, 2 to the second highest mean Dice overlap measure, et cetera, for each label. Table 6 provides a summary of the rank-sums for each of the hierarchies. The supplemental material includes details of the rank-sum calculation.

Table 6:

A summary of the rank-sum calculation for each of the hierarchies. The Coarse hierarchy includes three labels: whole cerebellum, whole vermis, and CM; the Lobe hierarchy includes eight labels: Left/Right Anterior Lobe, Left/Right Superior Posterior, Left/Right Inferior Posterior, and Left/Right Flocculonodular; the Vermis hierarchy is five labels for the Adult Cohort and three labels for the Pediatric Cohort (see Table 4 for details); the Lobule hierarchy contains 22 labels for the Adult Cohort and 14 labels for the Pediatric Cohort (see Table 4 for details). Complete rank-sum calculation is included in the supplemental material.

		1^st	2^nd	3^rd	4^th	5^th	6^th	7^th	8^th
Adult Cohort	Coarse	CERES2	LiviaNET	DeepNet	RASCAL	ConvNet	FS-SUIT
	Lobe	CERES2	LiviaNET	ConvNet	DeepNet	RASCAL	FS-SUIT
	Vermis	CERES2	LiviaNET	DeepNet	RASCAL	ConvNet	FS-SUIT
	Lobule	CERES2	Deep Net	LiviaNET	RASCAL	ConvNet	FS-SUIT

	Consolidated	CERES2	Deep Net	LiviaNET	RASCAL	ConvNet	FS-SUIT


Pediatric Cohort	Coarse	LiviaNET	CERES2	DeepNet	RASCAL	ConvNet	C-SUIT	FS-SUIT	SUIT
	Lobe	CERES2	LiviaNET	DeepNet	RASCAL	C-SUIT	ConvNet	SUIT	FS-SUIT
	Vermis	CERES2	LiviaNET	DeepNet	RASCAL	ConvNet	C-SUIT	SUIT	FS-SUIT
	Lobule	CERES2	Deep Net	LiviaNET	RASCAL	C-SUIT	ConvNet	SUIT	FS-SUIT

	Consolidated	CERES2	LiviaNET	DeepNet	RASCAL	C-SUIT	ConvNet	SUIT	FS-SUIT

Open in a new tab

4.1. Adult Cohort

Figure 4 shows the results of the six methods on a typical axial slice from a test data set in the Adult Cohort: Fig. 2 shows the underlying MR data. Figures 5–8 show the Dice overlap for each of the methods across the various hierarchies; these plots show the individual data point for each of the five test data sets as well as showing the mean Dice overlap as a horizontal bar. Specifically, Fig. 5 shows the Dice overlap for the whole cerebellum, the whole vermis, and the CM. The mean Dice overlap of the methods on whole cerebellum was used to order the methods in Fig. 4. We can see that CERES2 has the highest mean Dice overlap for each of the Coarse labels; however, for the whole cerebellum label the difference between CERES2 and LiviaNET is quite small (0.950 vs. 0.949), though this is not the case for the other two Coarse Labels. This result sets the tone for many of the other labels in the Adult Cohort; in general for a given label the mean Dice overlap of CERES2 is the highest of the methods, with LiviaNET typically coming in second and on occasion the difference is negligible. Typical examples of this behavior are the Left and Right Anterior Lobe (Fig. 6), the Left and Right Superior Posterior Lobe (Fig. 6), Vermis of Lobules VIII through X (Fig. 7), and several cases in the Lobule hierarchy shown in Fig. 8. There are of course example of labels on which CERES2 does not achieve the maximum mean Dice overlap. See the Left and Right Inferior Posterior Lobe in Fig. 6, and the Vermis of Lobule VI in Fig. 7 for examples. In all 38 labels under consideration, there are 11 labels on which CERES2 is not ranked first; these 11 cases are split between LiviaNET (3 times), ConvNet (5 times), and DeepNet (3 times); see the supplemental material for complete details. We also observe in Figs. 6 and 8 that each algorithm has similar performance on both the left and right for each label. We make the observation that most of the methods have a mean Dice overlap above 0.8 for all the lobes except the Flocculonodular Lobe. For the vermal subdivisions, we see a slight degradation in results (mean Dice overlap in the range 0.7 to 0.9). Of course we see a further drop in performance when considering the lobe subdivisons, particularly for Lobules V, VIIB, and VIIIA. In fact, these lobules appear to be the most difficult to parcellate for all the methods; as each method has a large range of Dice overlap values for these regions.

Inline graphic — Shown for a test data set in the Adult Cohort are the **(a)** manual delineation, and the results for each of the methods: **(b)** **CERES2**; **(c)** **LiviaNET**; **(d)** **DeepNet**; **(e)** **ConvNet**; **(f)** **RASCAL**; and **(g)** **FS-SUIT**, for the same axial slice shown in Fig. 2. The methods are ranked based on their mean whole cerebellum parcellation, see Fig. 5 for details.

Figure 5: — The Dice overlap for the three labels associated with the Coarse hierarchy is shown for the Adult Cohort. Each column includes five data points, for the five test data sets in the Adult Cohort, showing the Dice overlap for a method-label pair (some of the data points are *on top* of one another and are thus occluded from view). The horizontal line in each column shows the mean Dice overlap for that particular method and label. We note that the scale has been zoomed to help appreciate the differences between the algorithms.

Figure 8: — The Dice overlap for the 22 labels (11 per hemisphere) associated with the Lobule hierarchy is shown for the Adult Cohort, see Table 4 for the list of lobule labels. See Fig. 5 for instructions on interpreting the plots.

Figure 6: — The Dice overlap for the eight labels associated with the Lobe hierarchy is shown for the Adult Cohort, see Table 4 for the list of lobe labels. We note that the scale has been zoomed to help appreciate the differences between the algorithms.

Figure 7: — The Dice overlap for the five labels associated with the Vermis hierarchy is shown for the Adult Cohort, see Table 4 for the list of vermis labels. See Fig. 5 for instructions on interpreting the plots. We note that some of the scale has been zoomed to help appreciate the differences between the algorithms.

4.2. Pediatric Cohort

Figure 9 shows the results of the eight methods on a typical coronal slice from a test data set in the Pediatric Cohort, Fig. 3 shows the underlying MR data. Figures 10–13 show the Dice overlap for each of the methods across the various hierarchies; these plots show the individual data point for each of the ten test data sets as well as showing the mean Dice overlap as a horizontal bar. Specifically, Fig. 10 shows the Dice overlap for the whole cerebellum, the whole vermis, and the CM. The mean Dice overlap of the methods on the whole cerebellum was used to order the methods in Fig. 9. We can see that LiviaNET has the highest mean Dice overlap for the whole cerebellum and CM labels with CERES2 in second place; however, for the other coarse label the order of these two methods is reversed. In fact, unlike the Adult Cohort, where CERES2 was on top but definitely not unopposed, in the Pediatric Cohort CERES2 is quite dominant. The only labels for which it is not ranked first are the whole cerebellum and the CM. Similar to the Adult Cohort, we observe in Figs. 11 and 13 for the Pediatric Cohort that each algorithm performs consistently on both the left and right for each label.

Figure 10: — The Dice overlap for the three labels associated with the Coarse hierarchy are shown for the Pediatric Cohort. Each column includes ten data points, for the ten test data sets in the Pediatric Cohort, showing the Dice overlap for a method-label pair (some of the data points are *on top* of one another and are thus occluded from view). The horizontal line in each column shows the mean Dice overlap for that particular method and label. We note that the scale has been zoomed to help appreciate the differences between the algorithms.

Figure 13: — The Dice overlap for the 14 labels (7 per hemisphere) associated with the Lobule hierarchy are shown for the Pediatric Cohort, see Table 4 for the list of lobule labels. See Fig. 10 for instructions on interpreting the plots. We note that the scale has been zoomed to help appreciate the differences between the algorithms.

Figure 11: — The Dice overlap for the eight labels associated with the Lobe hierarchy are shown for the Pediatric Cohort, see Table 4 for the list of lobe labels. See Fig. 10 for instructions on interpreting the plots. We note that the scale has been zoomed to help appreciate the differences between the algorithms.

4.3. Summary and Further Analysis

To create a readily interpretable representation of these results we computed the rank-sum for each method over the various hierarchies and both cohorts. These rank-sum results are presented in Table 6, with the details of the computation included in the supplemental material. Over both cohorts, we can easily discern some patterns in Table 6: clearly CERES2 is the overall winner, with LiviaNET and DeepNet trading back and forth between second and third place. We also see RASCAL is quite consistently fourth in both cohorts. Given the outcome of our rank-sum analysis, we identify the top three methods as CERES2, LiviaNET, and DeepNet. We next want to determine if there is a statistically significant difference between these top three methods. To this end, we use a two-sided Wilcoxon paired signed-rank test (Wilcoxon, 1945) between CERES2 & LiviaNET, and between CERES2 & DeepNet, to establish statistical significance. The Wilcoxon test is a nonparametric test of the null hypothesis that the two samples come from the same population against an alternative hypothesis. We tested using all the available Dice overlap values for a particular hierarchy; thus for the Coarse level on the Adult Cohort there are 15 values for each method (3 labels × 5 data sets). For the statistical comparisons we use an α level of 0.001 to note weak statistical significance and an α level of 0.0001 to denote strong statistical significance; we use these α values as we do not employ any multiple comparison correction techniques. The p-values for the Wilcoxon test and the mean values for the Dice overlap (for our top three methods) are shown in Table 7 for the Adult Cohort and Table 8 for the Pediatric Cohort. For the five hierarchies (Coarse, Lobe, Vermis, Lobule, and Consolidated) on the Adult Cohort CERES2 has the highest mean Dice overlap on all five hierarchies and is statistically significantly different on eight of the ten comparisons (with strong significance in five instances). The two cases where there is no statistically significant difference are between CERES2 and LiviaNET for the Lobe and Vermis hierarchies. For the Pediatric Cohort CERES2 has the highest mean Dice overlap on all five hierarchies and is statistically significantly different on nine of the ten comparisons (with strong significance in all nine cases). The single comparison for which there is not significance is between CERES2 and LiviaNET on the Coarse hierarchy.

Table 8:

For the Pediatric Cohort, we show the p-value for the two-sided Wilcoxon paired signed-rank test comparing the second ( Inline graphic LiviaNET) and third ( DeepNet) placed teams to the top ( CERES2) ranked team across the four hierarchies (Coarse, Lobe, Vermis, Lobule) of labeling and also the combination of all 28 labels (Consolidated). The mean Dice overlap for each method, at the respective hierarchy, is shown underneath the methods name.

Hierarchy	Method		p-value
Hierarchy	Mean Dice Overlap		p-value
Coarse	CERES2 0.9348	vs. LiviaNET 0.9326	2.1 × 10⁻¹
Coarse	CERES2 0.9348	vs. DeepNet 0.9201	6.0 × 10^{−6 ‡}
Lobe	CERES2 0.9033	vs. LiviaNET 0.8859	7.4 × 10^{−6 ‡}
Lobe	CERES2 0.9033	vs. DeepNet 0.8827	4.9 × 10^{−7 ‡}
Vermis	CERES2 0.8763	vs. LiviaNET 0.8491	2.7 × 10^{−5 ‡}
Vermis	CERES2 0.8763	vs. DeepNet 0.8427	7.5 × 10^{−5 ‡}
Lobule	CERES2 0.9043	vs. LiviaNET 0.8776	1.6 × 10^{−11 ‡}
Lobule	CERES2 0.9043	vs. DeepNet 0.8808	1.4 × 10^{−12 ‡}
Consolidated	CERES2 0.9043	vs. LiviaNET 0.8828	2.2 × 10^{−16 ‡}
Consolidated	CERES2 0.9043	vs. DeepNet 0.8815	2.2 × 10^{−16 ‡}

Open in a new tab

^†

Denotes weak statistical significance (p-value < 0.001).

^‡

Denotes strong statistical significance (p-value < 0.0001).

To understand the inherent bias of any of the presented methods we have generated two bias plots (similar to Bland-Altman plots (Bland and Altman, 1986)), which are included in the supplemental material. Traditional Bland-Altman plots show the difference of two measurements vs. the mean for the same two measurements. To allow us to present the various methods on a single plot and to reflect our higher confidence in the manual delineation, we plot the volumetric difference between each method and the manual delineation vs. the volume identified by the manual delineation. In this way, all of the methods can be shown on a single plot and the differences of each method on a particular subject are directly comparable. The presented bias plots, included in supplemental material, are for the whole cerebellum label on the Adult and Pediatric Cohorts. These plots, included as Figs. 1 and 2 in the supplemental material, are indicative of the behavior of the methods across all the labels; in that FS-SUIT and C-SUIT have positive biases, while SUIT has a negative bias, and the other methods do not exhibit a consistent bias across labels.

5. Discussion and Conclusions

5.1. Ranking the Methods

The primary result of this Comparison is a ranking of the state-of-the-art methods for parcellating the cerebellum, which is summarized in Table 6 for both the Adult and Pediatric Cohorts. The different levels of labeling, which we have referred to as hierarchies, allows for some granularity in understanding the ranking of the various methods on our cohorts. Had all the Participants contributed results for the two cohorts it would have been feasible to merge the rankings; regardless of this, there is an obvious stratification that occurs across both cohorts that is almost independent of the hierarchy. We observe that the order of CERES2, LiviaNET, DeepNet, and RASCAL (as first through fourth) is very stable across both cohorts and the hierarchies. This is quite pleasing, as it points to a stability of both the algorithms and the labeling schemes used on both cohorts—even though the cohorts were labeled independently. We observe that these top methods all used spatial and intensity normalization to the MNI space.

Clearly improvements in the mean Dice overlap of 0.01 could be considered marginal, possibly even negligible, however the two-sided Wilcoxon paired signed-rank test establishes the results of CERES2 as being a statistical improvement over the second and third place methods of LiviaNET and DeepNet (see Tables 7 and 8). Other metrics may provide some subtle insight into the differences of these approaches that the Dice overlap cannot distinguish, however we note that recent work (Maier-Hein et al., 2018) has shown that the median Dice overlap is the most stable manner in which to evaluate challenge winners. The important point of this work is that all three of these methods provide a high level of accuracy in parcellating both the adult and pediatric cerebellum. This provides an opportunity for detailed analyses of the cerebellum on an unprecedented scale.

5.2. Criticisms

The current work has two major shortcomings: 1) flawed cohorts and 2) exclusive use of Dice overlap. The two cohorts are flawed in different ways. Firstly, the Adult Cohort while having a rich label set (CM label, five vermal labels, and 22 lobule labels) provided only five test data sets each of which showed signs of cerebellar dysfunction without a genetic diagnosis. In particular, the test data for the Adult Cohort had a mean age of 69.2 years of age, whereas the training data had a mean age of 54.7 years of age (see Table 3). A two-sided Wilcoxon signed-rank test (Wilcoxon, 1945) between the ages of the training and testing portions of the Adult Cohort has a p-value of 0.02, not significant but not a satisfactory situation either. The other issues with the Adult Cohort are its gender bias (all male test data versus training data that is only one third male) and the small size of the test data (N = 5). The effects of the gender bias are an unknown and the cohort size limits the statistical power of any tests. The cohort size also reduced the organizers’ willingness to report standard deviations for the Dice overlap, with such a small sample any reported standard deviations would be erroneous. In contrast, to the Adult Cohort, the Pediatric Cohort has a slightly smaller label set (CM label, three vermal labels, and 18 lobule labels), a larger training pool of 20 data sets and a larger testing pool with 10 data sets. The gender proportions are consistent throughout the training and testing data sets as well as throughout the disease classifications in both the training and testing data. When using a two-sided Wilcoxon signed-rank test to perform a comparison between the ages of the training and test data, we get a more pleasing p-value of 0.95. The unfortunate drawback of the Pediatric Cohort is that it is pediatric data. The pediatric cerebellum is an area of great potential research and the availability of these automated methods for future work is very promising. However, the pediatric cerebellum remains an understudied portion of the central nervous system. The organizers believe that the pooling of these two cohorts to validate these methods is still a comprehensive test for any cerebellum parcellation method.

Since both our data sets were acquired using Philips scanners, our study cannot be used to assess the robustness and stability across scanners. Additionally, our study did not explore the ability of any of these methods to identify group differences between populations; our cohorts’ size and composition did not permit this. We note that the top ranked method, LiviaNET, was validated on the Autism Brain Imaging Data Exchange (ABIDE) (Martino et al., 2014). LiviaNET used ABIDE I, which included 17 international sites consisting of 1112 individuals (539 with autism spectrum disorder and 573 healthy controls, individuals were between 6 and 64 years of age at scan time) and the authors claim that this demonstrated a robustness “to various acquisition protocols, demographics, and clinical factors” (Dolz et al., 2018). We further note that several of the included methods have been used on other data sources with high quality results (Romero et al., 2017; Weier et al., 2014), and studies based on the submitted methods have previously explored group differences (Bernard et al., 2015; Weier et al., 2016). However, we do not make any claims of robustness to data or efficacy for group comparisons for any of the reported methods.

The remaining concern is the exclusive use of the Dice overlap measure throughout the paper. If we ignore the hierarchical label evaluation we employed, there were 28 labels in the Adult Cohort and 22 labels in the Pediatric Cohort. Given this many labels it seemed impractical to the organizers to report multiple metrics. Moreover, it would have been quite difficult to develop a consensus as to how to combine such metrics in a meaningful and unbiased manner. We also note that the majority of papers comparing multiple algorithms, as this paper does, are focused on a small number of labels. In fact it is typical for there to be only one label under consideration: white matter lesions, for example (Styner et al., 2008). As organizers, we observed in Maier et al. (2017) (from Table 7) that the final ranking correlated with the mean Dice overlap; in fact, the mean Dice overlap correctly predicts the top three methods and only incorrectly ranks three of the fourteen methods under consideration. This occurs despite the fact that the Dice overlap is only one component of a multi-measure evaluation (Maier et al., 2017). Thus, we believe exclusive use of the Dice overlap is acceptable and that our analysis of this Comparison correctly represents the state-of-the-art in fully automated cerebellum parcellation.

There are three more (minor) concerns—image quality, subsequent analyses, and the use of FreeSurfer v5.3—each of which we comment on below. A limitation in both of our cohorts was quality assessment (QA) of the images, which was limited to ensuring images were free from artifacts. It is possible that other more subtle quality issues may have contributed to errors in generating the manual delineations. Our basic review was not a comprehensive QA, like MRI-QC (Esteban et al., 2017), which might identify an image that was not acquired in a manner consistent with the other data in the cohort. A further consideration, that is not covered by MRI-QC, would be a cerebellum specific processing assessment (Li et al., 2016; Zuo et al., 2018) that would highlight other issues due to preprocessing.

This paper has focused on the automated parcellation of the cerebellum to facilitate streamlined regional analyses. Such regional analyses can be used as a normalizing factor in functional MRI (Barrett et al., 2017) and positron emission tomography data (Murphy et al., 2013), and for studying the changing shape of the cerebellum in disease (Abulnaga et al., 2016; Kansal et al., 2016). However, there is continued interest in voxel based morphometry (VBM) analyses of the cerebellum (Colloby et al., 2014). In this regard several of the presented methods are deficient, as there is no convenient interpretation between the provided parcellation and a common atlas space that would lend itself to a VBM style study. This is an area for potential future research. Finally, two of the included methods used FreeSurfer v5.3 as part of their processing pipeline, however a newer version of FreeSurfer v6.0 is available. Based on the ChangeLog available for FreeSurfer, it was not immediately obvious of any improvement that would benefit either of these methods.

5.3. Comment on Inter-rater Performance

A portion of our Adult Cohort, along with other similarly acquired data, was used as part of an inter-rater comparison (Bogovic et al., 2013b). It is reassuring to see that the performance of the top methods in this Comparison have similar Dice overlap to that reported for the inter-rater analysis. In particular, the mean Dice overlap for CERES2, LiviaNET, and DeepNet, for the whole vermis are larger than those reported for the inter-rater values (Fig. 5 in Bogovic et al. (2013b)).

5.4. Impact of this Work

Several of the methods in this Comparison are either readily available for use through download or web interface. In particular, the top two methods are accessible to the community, CERES2 can be used through a web portal² and LiviaNET is available for download³. Identifying the state-of-the-art in cerebellum parcellation is important for improving the robustness and speed with which cerebellum imaging studies can be completed. Although SUIT (Diedrichsen, 2006; Diedrichsen et al., 2009) has been available and widely used for over 10 years, our study clearly reveals that there are emerging methods with significantly better performance (given our performance criteria); we note that the probabilistic lobular segmentation generated by SUIT was meant to be informative and not definitive. As studies begin to emerge relating the volumes of cerebellar lobules to functional brain performance (cf. Kansal et al. (2016)), methods such as CERES2, LiviaNET, and DeepNet may offer a better alternative for identifying these volumes. As well, this study provides a baseline for future work on cerebellar parcellation, both in providing information on the best strategies to date and in providing Dice coefficients for comparison.

Supplementary Material

NIHMS1510954-supplement-1.pdf^{(171.8KB, pdf)}

Figure 12: — The Dice overlap for the three labels associated with the Vermis hierarchy are shown for the Pediatric Cohort, see Table 4 for the list of vermis labels. See Fig. 10 for instructions on interpreting the plots.

Acknowledgments

The data collection and labeling of the cerebellum was supported in part by the NIH/NINDS grant R01 NS056307 (PI: J.L. Prince) and NIH/NIMH grants R01 MH078160 & R01 MH085328 (PI: S.H. Mostofsky). PMT is supported in part by the NIH/NIBIB grant U54 EB020403. CERES2 development was supported by grant UPV2016-0099 from the Universitat Politécnica de Valencia (PI: J.V. Manjón); the French National Research Agency through the Investments for the future Program IdEx Bordeaux (ANR-10-IDEX-03-02, HL-MRI Project; PI: P. Coupé) and Cluster of excellence CPU and TRAIL (HR-DTI ANR-10-LABX-57; PI: P. Coupé). Support for the development of LiviaNET was provided by the National Science and Engineering Research Council of Canada (NSERC), discovery grant program, and by the ETS Research Chair on Artificial Intelligence in Medical Imaging. The authors wish to acknowledge the invaluable contributions offered by Dr. George Fein (Dept. of Medicine and Psychology, University of Hawaii) in preparing this manuscript.

Appendix A. Methods

Here we provide detailed descriptions of all the methods used in the Comparison. We introduce each method with a three line summary: the first line includes a colored square (that is used in subsequent plots and figures for quick reference) and the name of the method; second is a one line summary of the method; and finally in parentheses is the Participant(s) that contributed the method.

SUIT

Default SUIT v3.2

(Carlos H. Castillo)

Data analysis were performed using MATLAB R2015b (The Mathworks Inc. Natick, MA), SPM12 (Ashburner et al., 2000), and the spatially unbiased atlas template of the human cerebellum (SUIT) toolbox v3.2 (Diedrichsen et al., 2009). To achieve the best performance from SUIT, all anatomical images were first reoriented into LPI (Neurological) orientation and then the origin of each T1-w image was assigned to the manually selected anterior commissure.

To ensure the correct normalization of the cerebellar cortex into the atlas template, SUIT first isolates the infra-tentorial structures from the rest of the brain. This is important because the occipital cortex has a similar intensity as the cerebellum and in most cases there is not a clearly visible separation between these two structures. SUIT v3.2 achieves this separation by using the unified segmentation (Ashburner and Friston, 2005) of SPM12; this segmentation procedure combines tissue classification and registration by means of both a mixture of Gaussians and tissue probability maps. Using this technique, the brain is segmented into eight tissue types: cerebral GM, cerebral WM, cerebellar GM, cerebellar WM, cerebrospinal fluid (CSF), bone, fat/skin, and air. Finally, a binary cerebellar mask is created by combining the cerebellar GM and WM segmentation maps including voxels with a tissue probability of greater than or equal to 90% of coming from either of those classes.

After the cerebellar isolation, SUIT uses a fast-diffeomorphic normalization algorithm (DARTEL) (Ashburner, 2007). DARTEL uses the probabilistic GM and WM segmentation maps to align the anatomy of the cerebellum of each participant to the SUIT atlas template. To increase the speed of the process, the non-linear registration is solved using a Levenberg-Marquardt strategy and a multigrid method; see Ashburner (2007) for complete details. The result is a non-linear deformed image coregistered to the SUIT atlas template and its respective deformation field.

To identify the cerebellar lobular boundaries, the probabilistic atlas of the cerebellum included in the SUIT toolbox was used. The SUIT atlas consists of a set of 34 probabilistic maps that indicates the likelihood that a certain voxel in the reference space belongs to each lobule. The SUIT atlas includes the cerebellar left and right lobules (I–IV, V, VI, Crus I, Crus II, VIIb, VIIIa, VIIIb, IX, X), vermis (VI, Crus I, Crus II, VIIb, VIIIa, VIIIb, IX, and X), and deep cerebellar nuclei. For this work, these compartments were combined to have only 18 labels (I–V, VI, Crus I, Crus II–VIIb, VIII, IX, X, Vermis I–V, Vermis VI–VII, Vermis VIII–X, and corpus medullare). For each subject, the inverse warp deformation field was calculated and then applied to the SUIT atlas using a nearest neighbor approach, so the values of each label were preserved. For each voxel one label was assigned depending on the maximum probability of the SUIT atlas, resulting in a lobular segmentation of the subject’s native space.