Distributed Weight Consolidation: A Brain Segmentation Case Study

Patrick McClure; Jakub R Kaczmarzyk; Satrajit S Ghosh; Peter Bandettini; Charles Y Zheng; John A Lee; Dylan Nielson; Francisco Pereira

. Author manuscript; available in PMC: 2021 Aug 9.

Published in final edited form as: Adv Neural Inf Process Syst. 2018 Dec;31:4093–4103.

Distributed Weight Consolidation: A Brain Segmentation Case Study

Patrick McClure ¹, Jakub R Kaczmarzyk ², Satrajit S Ghosh ³, Peter Bandettini ⁴, Charles Y Zheng ⁵, John A Lee ⁶, Dylan Nielson ⁷, Francisco Pereira ⁸

PMCID: PMC8351531 NIHMSID: NIHMS1043682 PMID: 34376963

Abstract

Collecting the large datasets needed to train deep neural networks can be very difficult, particularly for the many applications for which sharing and pooling data is complicated by practical, ethical, or legal concerns. However, it may be the case that derivative datasets or predictive models developed within individual sites can be shared and combined with fewer restrictions. Training on distributed data and combining the resulting networks is often viewed as continual learning, but these methods require networks to be trained sequentially. In this paper, we introduce distributed weight consolidation (DWC), a continual learning method to consolidate the weights of separate neural networks, each trained on an independent dataset. We evaluated DWC with a brain segmentation case study, where we consolidated dilated convolutional neural networks trained on independent structural magnetic resonance imaging (sMRI) datasets from different sites. We found that DWC led to increased performance on test sets from the different sites, while maintaining generalization performance for a very large and completely independent multi-site dataset, compared to an ensemble baseline.

1. Introduction

Deep learning methods require large datasets to perform well. Collecting such datasets can be very difficult, particularly for the many applications for which sharing and pooling data is complicated by practical, ethical, or legal concerns. One prominent application is human subjects research, in which researchers may be prevented from sharing data due to privacy concerns or other ethical considerations. These concerns can significantly limit the purposes for which the collected data can be used, even within a particular collection site. If the datasets are collected in a clinical setting, they may be subject to many additional constraints. However, it may be the case that derivative datasets or predictive models developed within individual sites can be shared and combined with fewer restrictions.

In the neuroimaging literature, several platforms have been introduced for combining models trained on different datasets, such as ENIGMA ([29], for meta-analyses) and COINSTAC ([23], for distributed training of models). Both platforms support combining separately trained models by averaging the learned parameters. This works for convex methods (e.g. linear regression), but does not generally work for non-convex methods (e.g. deep neural networks, DNNs). [23] also discussed using synchronous stochastic gradient descent training using server-client communication; this assumes that all of the training data is simultaneously available. Also, for large models such as DNNs, the bandwidth required could be problematic, given the need to transmit gradients at every update.

Learning from non-centralized datasets using DNNs is often viewed as continual learning, a sequential process where a given predictive model is updated to perform well on new datasets, while retaining the ability to predict on those previously used for training [32, 4, 21, 16, 18]. Continual learning is particularly applicable to problems with shifting input distributions, where the data collected in the past may not represent data collected now or in the future. This is true for neuroimaging, since the statistics of MRIs may change due to scanner upgrades, new reconstruction algorithms, different sequences, etc. The scenario we envisage is a more complex situation where multiple continual learning processes may take place non-sequentially. For instance, a given organization produces a starting DNN, which different, independent sites will then use with their own data. The sites will then contribute back updated DNNs, which the organization will use to improve the main DNN being shared, with the goal of continuing the sharing and consolidation cycle.

Our application is segmentation of structural magnetic resonance imaging (sMRI) volumes. These segmentations are often generated using the Freesurfer package [8], a process that can take close to a day for each subject. The computational resources for doing this at a scale of hundreds to thousands of subjects are beyond the capabilities of most sites. We use deep neural networks to predict the Freesurfer segmentation directly from the structural volumes, as done previously by other groups [26, 27, 7, 6]. We train several of those networks – each using data from a different site – and then consolidate their weights. We show that this results in a model with improved generalization performance in test data from these sites, as well as a very large, completely independent multi-site dataset.

2. Data and Methods

2.1. Datasets

We use several sMRI datasets collected at different sites. We train networks using 956 sMRI volumes collected by the Human Connectomme Project (HCP) [30], 1,136 sMRI volumes collected by the Nathan Kline Institute (NKI) [22], 183 sMRI volumes collected by the Buckner Laboratory [2], and 120 sMRI volumes from the Washington University 120 (WU120) dataset [24]. In order to provide an independent estimate of how well a given network generalizes to any new site, we also test networks on a completely held-out dataset consisting of 893 sMRI volumes collected across several institutions by the ABIDE project [5].

2.2. Architecture

Several deep neural network architectures have been proposed for brain segmentation, such as U-net [26], QuickNAT [27], HighResNet [18] and MeshNet [7, 6]. We chose MeshNet because of its relatively simple structure, its lower number of learned parameters, and its competitive performance.

MeshNet uses dilated convolutional layers [31] due to the 3D structural nature of sMRI data. The output of these discrete volumetric dilated convolutional layers can be expressed as:

{(w_{f * l} h)}_{i, j, k} = \sum_{\tilde{i} = - a}^{a} \sum_{\tilde{j} = - b}^{b} \sum_{\tilde{k} = - c}^{c} w_{f, \tilde{i}, \tilde{j}, \tilde{k}} h_{i - l \tilde{i}, j - l \tilde{j}, k - l \tilde{k}} = {(w_{f * l} h)}_{v} = \sum_{t \in W_{a b c}} w_{f, t} h_{v - l t} .

(1)

where h is the input to the layer, a, b, and c are the bounds for the i, j, and k axes of the filter with weights w_f, (i, j, k) is the voxel, v, where the convolution is computed. The set of indices for the elements of w_f can be defined as $W_{a b c} = {- a, \dots, a} \times {- b, \dots, b} \times {- c, \dots, c}$ . The dilation factor l allows the convolution kernel to operate on every l-th voxel, since adjacent voxels are expected to be highly correlated. The dilation factor, number of filters, and other details of the MeshNet-like architecture that we used for all experiments is shown in Table 1.

Table 1:

The Meshnet-like dilated convolutional neural network architecture for brain segmentation.

Layer	Filter	Pad	Dilation (l)	Function	Layer	Filter	Pad	Dilation (l)	Function
1	96×3³	1	1	ReLU	5	96×3³	4	4	ReLU
2	96×3³	1	1	ReLU	6	96×3³	8	8	ReLU
3	96×3³	1	1	ReLU	7	96×3³	1	1	ReLU
4	96×3³	2	2	ReLU	8	50×1³	0	1	Softmax

Open in a new tab

2.3. Bayesian Inference in Neural Networks

2.3.1. Maximum a Posteriori Estimate

When training a neural network, the weights of the network, w are learned by optimizing ${argmax}_{w} p (w | D)$ where $D = {(x_{1}, y_{1}), \dots, (x_{N}, y_{N})}$ and (x_n, y_n) is the nth input-output example, per maximum likelihood estimation (MLE). However, this often overfits, so we used a prior on the network weights, p(w), to obtain a maximum a posteriori (MAP) estimate, by maximizing:

L_{M A P} (w) = \sum_{n = 1}^{N} \log p (y_{n} | x_{n}, w) + \log p (w) .

(2)

2.3.2. Approximate Bayesian Inference

In Bayesian inference for neural networks, a distribution of possible weights is learned instead of just a MAP point estimate. Using Bayes’ rule, $p (w | D) = p (D | w) p (w) / p (D)$ , where p(w) is the prior over weights. However, directly computing the posterior, $p (w | D)$ , is often intractable, particularly for DNNs. As a result, an approximate inference method must be used.

One of the most popular approximate inference methods for neural networks is variational inference, since it scales well to large DNNs. In variational inference, the posterior distribution $p (w | D)$ is approximated by a learned variational distribution of weights q_θ(w), with learnable parameters θ. This approximation is enforced by minimizing the Kullback-Leibler divergence (KL) between q_θ(w), and the true posterior, $p (w | D)$ , $KL [q_{θ} (w) | | p (w | D)]$ . This is equivalent to maximizing the variational lower bound [11, 10, 3, 14, 9, 20, 19], also known as the evidence lower bound (ELBO),

L_{E L B O} (θ) = L_{D} (θ) - L_{K L} (θ),

(3)

where $L_{D} (θ)$ is

L_{D} (θ) = \sum_{n = 1}^{N} E_{q θ (w)} [\log p (y_{n} | x_{n}, w)]

(4)

and $L_{K L} (θ)$ is the KL divergence between the variational distribution of weights and the prior,

L_{K L} (θ) = KL [q_{θ} (w) | | p (w)] .

(5)

Maximizing $L_{D}$ seeks to learn a q_θ(w) that explains the training data, while minimizing L_KL (i.e. keeping q_θ(w) close to p(w)) prevents learning a q_θ(w) that overfits to the training data.

2.3.3. Stochastic Variational Bayes

Optimizing Eq. 3 for deep neural networks is usually impractical to compute, due to both: (1) being a full-batch approach and (2) integrating over q_θ(w). (1) is often dealt with by using stochastic mini-batch optimization [25] and (2) is often approximated using Monte Carlo sampling. [14] applied these methods to variational inference in deep neural networks. They used the “reparameterization trick” [15], which formulates q_θ(w) as a deterministic differentiable function $w = f (θ, ϵ)$ where $ϵ ~ N (0, I)$ , to calculate an unbiased estimate of $\nabla_{θ} L_{D}$ for a mini-batch, ${(x_{1}, y_{1}), \dots, (x_{M}, y_{M})}$ , and one weight noise sample, $ϵ_{m}$ , for each mini-batch example:

L_{E L B O} (θ) \approx L_{D}^{S G V B} (θ) - L_{K L} (θ),

(6)

where

L_{D} (θ) \approx L_{D}^{S G V B} (θ) = \frac{N}{M} \sum_{m = 1}^{M} \log p (y_{m} | x_{m}, f (θ, ϵ_{m})) .

(7)

2.3.4. Variational Continual Learning

In Bayesian neural networks, p(w) is often set to a multivariate Gaussian with diagonal covariance $N (0, σ_{p r i o r}^{2} I)$ . (A variational distribution of the same form is called a fully factorized Gaussian (FFG).) However, instead of using a naïve prior, the parameters of a previously trained DNN can be used. Several methods, such elastic weight consolidation [16] and synaptic intelligence [32], have explored this approach. Recently, these methods have been reinterpreted from a Bayesian perspective [21, 17]. In variational continual learning (VCL) [21] and Bayesian incremental learning [17], the DNNs trained on previously obtained data, $D_{1} - D_{T - 1}$ , are used to regularize the training of a new neural network trained on $D_{T}$ per:

p (w | D_{1 : T}) = \frac{p (D_{1 : T} | w) p (w)}{p (D_{1 : T})} = \frac{p (D_{1 : T - 1} | w) p (D_{T} | w) p (w)}{p (D_{1 : T - 1}) p (D_{T})} = \frac{p (w | D_{1 : T - 1}) p (D_{T} | w)}{p (D_{T})},

(8)

where $p (w | D_{1 : T - 1})$ is the network resulting from training on a sequence of datasets $D_{1} - D_{T - 1}$ .

For DNNs, computing $p (w | D_{1 : T})$ directly can be intractable, so variational inference is iteratively used to learn an approximation, $q_{θ}^{T} (w)$ , by minimizing $KL [q_{θ}^{τ} (w) | | p (w | D_{1 : τ})]$ for each sequential dataset $D_{τ}$ , with τ ranging over integers from 1 to T.

The sequential nature of this approach is a limitation in our setting. In many cases it is not feasible for one site to wait for another site to complete training, which can take days, in order to begin their own training.

2.4. Distributed Weight Consolidation

The main motivation of our method – distributed weight consolidation (DWC) – is to make it possible to train neural networks on different, distributed datasets, independently, and consolidate their weights into a single network.

2.4.1. Bayesian Continual Learning for Distributed Data

In DWC, we seek to consolidate several distributed DNNs trained on S separate, distributed datasets, $D_{T} = {D_{T}^{1}, \dots, D_{T}^{S}}$ , so that the resulting DNN can then be used to inform the training of a DNN on $D_{T + 1}$ . The training on each dataset starts from an existing network $p (w | D_{1 : T - 1})$ .

Assuming that the S datasets are independent allows Eq. 8 to be rewritten as:

p (w | D_{1 : T - 1}) = \frac{p (w | D_{1 : T - 1}) \prod_{s = 1}^{S} p (D_{T}^{s} | w)}{\prod_{s = 1}^{S} p (D_{T}^{s})} .

(9)

However, training one of the S networks using VCL produces an approximation for $p (w | D_{1 : T - 1}, D_{T}^{s})$ . Eq. 9 can be written in terms of the learned distributions, since $p (w | D_{1 : T - 1}, D_{T}^{s}) = p (w | D_{1 : T - 1}) p (D_{T}^{s} | w) / p (D_{T}^{s})$ per Eq. 8:

p (w | D_{1 : T}) = \frac{1}{p {(w | D_{1 : T - 1})}^{S - 1}} \prod_{s = 1}^{S} p (w | D_{1 : T - 1}, D_{T}^{s}) .

(10)

$p (w | D_{1 : T - 1})$ and each $p (w | D_{1 : T - 1}, D_{T}^{s})$ can be learned and then used to compute $p (w | D_{1 : T})$ . This distribution can then be used to learn $p (w | D_{1 : T + 1})$ per Eq. 8.

2.4.2. Variational Approximation

In DNNs, however, directly calculating these probability distributions can be intractable, so variational inference is used to learn an approximation, $q_{θ}^{T, s} (w)$ , for $p (w | D_{1 : T - 1}, D_{T}^{s})$ by minimizing $KL [q_{θ}^{T} (w) | | p (w | D_{1 : T - 1}, D_{T}^{s})]$ . This results in approximating Eq. 10 using:

p (w | D_{1 : T}) \approx \frac{1}{q_{θ}^{T - 1} {(w)}^{S - 1}} \prod_{s = 1}^{S} q_{θ}^{T, s} (w) .

(11)

2.4.3. Dilated Convolutions with Fully Factorized Gaussian Filters

Although more complicated variational families have recently been explored in DNNs, the relatively simple FFG variational distribution can do as well as, or better than, more complex methods for continual learning [17]. In this paper, we use dilated convolutions with FFG filters. This assumes each of the F filters are independent (i.e. $p (w) = \prod_{f = 1}^{F} p (w_{f})$ ), that each weight within a filter is also independent (i.e. $p (w) = \prod_{t \in W_{a b c}} p (w_{f, t})$ ), and that each weight is Gaussian (i.e. $w_{f, t} ~ N (μ_{f, t}, σ_{f, t}^{2})$ ) with learnable parameters µ_f,t and σ_f,t. However, as discussed in [14, 20], randomly sampling each weight for each mini-batch example can be computationally expensive, so the fact that the sum of independent Gaussian variables is also Gaussian is used to move the noise from the weights to the convolution operation. For, dilated convolutions, this is described by

{(w_{f * l} h)}_{v} ~ N (μ_{f, v}^{*}, {(σ_{f, v}^{*})}^{2}),

(12)

where

μ_{f, v}^{*} = \sum_{t \in W_{a b c}} μ_{f, t} h_{v - l t}

(13)

and

{(σ_{f, v}^{*})}^{2} = \sum_{t \in W_{a b c}} σ_{f, t}^{2} h_{v - l t}^{2} .

(14)

Eq. 12 can be rewritten using the Gaussian “reparameterization trick”:

{(w_{f * l} h)}_{v} = μ_{f, v}^{*} + σ_{f, v}^{*} ϵ_{f, v} where ϵ_{f, v} ~ N (0, 1) .

(15)

2.4.4. Consolidating an Ensemble of Fully Factorized Gaussian Networks

Eq. 11 can be used to consolidate an ensemble of distributed networks in order to allow for training on new datasets. Eq. 11 can be directly calculated if $q_{θ}^{T - 1} (w_{f, t}) = N (μ_{f, t}^{0}, {(σ_{f, t}^{0})}^{2})$ and $q_{θ}^{T, s} (w_{f, t}) = N (μ_{f, t}^{s}, {(σ_{f, t}^{s})}^{2})$ are known, resulting in $p (w | D_{1 : T})$ also being an FFG per

p (w_{f, t} | D_{1 : T}) \underset{\sim}{\propto} e^{(S - 1) \frac{{(w_{f, 0, t} - μ_{f, t}^{0})}^{2}}{2 {(σ_{f, t}^{0})}^{2}} \prod_{s = 1}^{S} e \frac{- {(w_{f, s, t} - μ_{f, t}^{2})}^{2}}{2 {(σ_{f, t}^{s})}^{2}}}

(16)

and

p (w_{f, t} | D_{1 : T}) \approx N (\frac{\sum_{s = 1}^{S} \frac{μ_{s, t}^{s}}{{(σ_{s, t}^{s})}^{2}} - \sum_{s = 1}^{S - 1} \frac{μ_{f, t}^{0}}{{(σ_{f, t}^{0})}^{2}}}{\sum_{s = 1}^{S} \frac{1}{{(σ_{f, t}^{s})}^{2}} - \sum_{s = 1}^{S - 1} \frac{1}{{(σ_{f, t}^{0})}^{2}}}, \frac{1}{\sum_{s = 1}^{S} \frac{1}{{(σ_{f, t}^{s})}^{2}} - \sum_{s = 1}^{S - 1} \frac{1}{{(σ_{f, t}^{0})}^{2}}}) .

(17)

Eq. 17 follows from Eq. 16 by completing the square inside the exponent and matching the parameters to the multivariate Gaussian density; it is defined when $\sum_{s = 1}^{S} \frac{1}{{(σ_{f, t}^{s})}^{2}} - \sum_{s = 1}^{S - 1} \frac{1}{{(σ_{f, t}^{0})}^{2}} > 0$ . To ensure this, we constrained ${(σ_{f, t}^{0})}^{2} \geq {(σ_{f, t}^{s})}^{2}$ . This should be the case if the loss is optimized, since $L_{D}$ should pull ${(σ_{f, t}^{0})}^{2}$ to 0 and $L_{K L}$ pulls ${(σ_{f, t}^{s})}^{2}$ towards ${(σ_{f, t}^{0})}^{2} \cdot p (w_{f, t} | D_{1 : T})$ can then be used as a prior for training another variational DNN.

3. Experiments

3.1. Experimental Setup

3.1.1. Data Preprocessing

The only pre-processing that we performed was conforming the input sMRIs with Freesurfer’s mri_convert, which resized all of the volumes used to 256×256×256 1 mm voxels. We computed 50-class Freesurfer [8] segmentations, as in [6], for all subjects in each of the datasets described earlier. These were used as the labels for prediction. A 90–10 training-test split was used for the HCP, NKI, Buckner, and WU120 datasets. During training and testing, input volumes were individually z-scored across voxels. We split each input volume into 512 non-overlapping 32×32×32 sub-volumes, as in [7, 6].

3.1.2. Training Procedure

All networks were trained with Adam [13] and an initial learning rate of 0.001. The MAP networks were trained until convergence. The subsequent networks were trained until the training loss started to oscillate around a stable loss value. These networks trained much faster than the MAP networks, since they were initialized with previously trained networks. Specifically, we found that using VCL led to ~3x, ~2x, and ~4x convergence speedups for HCP to NKI, HCP to Buckner and HCP to WU120, respectively. The batch-size was set to 10. Weight normalization [28] was used for the weight means for all networks and the weight standard deviations were initialized to 0.001 as in [19] for the variational network trained on HCP. For MAP networks and the variational network trained on HCP, $p (w) = N (0, 1)$ .

3.1.3. Performance Metric

To measure the quality of the produced segmentations, we calculated the Dice coefficient, which is defined by

D i c e_{c} = \frac{2 | {\hat{y}}_{c} \cdot y_{c} |}{{‖ {\hat{y}}_{c} ‖}^{2} + {‖ y_{c} ‖}^{2}} = \frac{2 T P_{c}}{2 T P_{c} + F N_{c} + F P_{c}},

(18)

where ${\hat{y}}_{c}$ is the binary segmentation for class c produced by a network, y_c is the ground truth produced by Freesurfer, TP_c is the true positive rate for class c, FN_c is the false negative rate for class c, and FP_c is the false positive rate for class c. We calculate the Dice coefficient for each class c and average across classes to compute the overall performance of a network.

3.1.4. Baselines

We trained MAP networks on the HCP (H_MAP ), NKI (N_MAP ), Buckner (B_MAP ) and WU120 (W_MAP ) datasets. We averaged the output probabilities of the H_MAP, N_MAP, B_MAP, and W_MAP networks to create an ensemble baseline. We also trained a MAP model on the aggregated HCP, NKI, Buckner, and WU120 training data (HNBW_MAP ) to estimate the performance ceiling of having the training data from all sites available together.

3.1.5. Variational Continual Learning

We trained an initial FFG variational network on HCP (H) using H_MAP to initialize the network. We then used used VCL with HCP as the prior for distributed training of the FFG variational networks on the $NKI (H \to N)$ , Buckner $(H \to B)$ and WU120 $(H \to W)$ datasets. Additionally, we trained networks using VCL to transfer from HCP to NKI to Buckner to WU120 $(H \to N \to B \to W)$ and from HCP to WU120 to Buckner to $NKI (H \to W \to B \to N)$ . These options test training on NKI, Buckner, and WU120 in increasing and decreasing order of dataset size, since dataset order may matter and may be difficult to control in practice.

3.1.6. Distributed Weight Consolidation

For DWC, our goal was to take distributed networks trained using VCL with an initial network as a prior, consolidate them per Eq. 17, and then use this consolidated model as a prior for fine-tuning on the original dataset. We used DWC to consolidate $H \to N$ , $H \to B$ , and $H \to W$ into $H \to N + B + W$ per Eq. 17. VCL [21] performance was found to be improved by using coresets [1, 12], a small sample of data from the different training sets. However, if examples cannot be collected from the different datasets, as may be the case when examples from the separate datasets cannot be shared, coresets are not applicable. For this reason, we used $H \to N + B + W$ as a prior for fine-tuning (FT) by training the network on $H (H \to N + B + W + \to H)$ and giving $L_{D}$ the weight of one example volume.

3.2. Experimental Results

In Table 2 we show the average Dice scores across classes and sMRI volumes for the differently trained networks. The weighted average Dice scores were computed across H, N, B, and W by weighing each of the Dice scores according to the number of volumes in each test set. For the variational networks, 10 MC samples were used during test time to approximate the expected network output. The weighted average Dice scores of DWC were better than the scores of the ensemble, the baseline method for combining methods across sites, (p = 1.66e-15, per a two tailed paired t-test across volumes). The ABIDE Dice scores of DWC were not significantly different from the scores of the ensemble (p = 0.733, per a two tailed paired t-test across volumes), showing that DWC does not reduce generalization performance for a very large and completely independent multi-site dataset.

Table 2:

The average Dice scores across test volumes for the trained networks on HCP (H), NKI (N), Buckner (B), and WU120 (W), along with the weighted average Dice scores across H, N, B, and W and the average Dice scores across volumes on the independent ABIDE (A) dataset.

Network	H	N	B	W	Avg.	A
H _MAP	82.25	65.88	67.94	70.88	72.92	55.25
N _MAP	71.20	72.19	70.73	73.06	71.66	66.67
B _MAP	65.69	50.17	82.02	68.87	59.25	50.23
W _MAP	70.18	66.27	72.20	76.38	68.76	62.83

H → N	75.40	73.24	71.77	73.17	74.03	64.62
H → B	73.85	56.79	79.49	68.53	65.78	49.27
H → W	77.07	67.63	76.15	77.26	72.51	62.31

H → N → B → W	77.42	71.46	79.70	79.82	74.86	63.3
H → W → B → N	78.04	78.15	75.79	79.50	77.99	70.79
H → N + B + W (DWC w/o FT)	78.28	73.52	78.02	77.37	75.95	65.56

Ensemble	79.13	72.32	80.02	78.84	75.94	66.27
H → N + B + W → H (DWC)	80.34	73.64	77.46	78.10	76.82	66.21
HNBW _MAP	81.38	77.99	80.64	79.54	79.62	70.76

Open in a new tab

Training on different datasets sequentially using VCL was very sensitive to dataset order, as seen by the difference in Dice scores when training on NKI, Buckner, and WU120 in order of decreasing and increasing dataset size (H → N → B → W and H → W → B → N, respectively). The performance of DWC was within the range of VCL performance. The weighted average and ABIDE dice scores of DWC were better than the H → N → B → W Dice scores, but not better than the H → W → B → N Dice scores.

In Figures 1, 2, 3, and 4, we show selected example segmentations for DWC and HNBW_MAP, for volumes that have Dice scores similar to the average Dice score across the respective dataset. Visually, the DWC segmentation appears very similar to the ground truth. The errors made appear to occur mainly at region boundaries. Additionally, the DWC errors appear to be similar to the errors made by HNBW_MAP.

The axial and sagittal segmentations produced by DWC and the *HNBW*_MAP baseline on a HCP subject. The subject was selected by matching the subject specific Dice with the average Dice across HCP. Segmentations errors for all classes are shown in red in the respective plot.

The axial and sagittal segmentations produced by DWC and the *HNBW*_MAP baseline on a NKI subject. The subject was selected by matching the subject specific Dice with the average Dice across NKI. Segmentations errors for all classes are shown in red in the respective plot.

The axial and sagittal segmentations produced by DWC and the *HNBW*_MAP baseline on a Buckner subject. The subject was selected by matching the subject specific Dice with the average Dice across Buckner. Segmentations errors for all classes are shown in red in the respective plot.

The axial and sagittal segmentations produced by DWC and the *HNBW*_MAP baseline on a WU120 subject. The subject was selected by matching the subject specific Dice with the average Dice across WU120. Segmentations errors for all classes are shown in red in the respective plot.

4. Discussion

There are many problems for which accumulating data into one accessible dataset for training can be difficult or impossible, such as for clinical data. It may, however, be feasible to share models derived from such data. A method often proposed for dealing with these independent datasets is continual learning, which trains on each of these datasets sequentially [4]. Several recent continual learning methods use previously trained networks as priors for networks trained on the next dataset [32, 21, 16], albeit with the requirement that training happens sequentially. We developed DWC by modifying these methods to allow for training networks on several new datasets in a distributed way. Using DWC, we consolidated the weights of the distributed neural networks to perform brain segmentation on data from different sites. The resulting weight distributions can then be used as a prior distribution for further training, either for the original site or for novel sites. Compared to an ensemble made from models trained on different sites, DWC increased performance on the held-out test sets from the sites used in training and led to similar ABIDE performance. This demonstrates the feasibility of DWC for combining the knowledge learned by networks trained on different datasets, without either training on the sites sequentially or ensembling many trained models. One important direction for future research is scaling DWC up to allow for consolidating many more separate, distributed networks and repeating this training and consolidation cycle several times. Another area of research is to investigate the use of alternative families of variational distributions within the framework of DWC. Our method has the potential to be applied to many other applications where it is necessary to train specialized networks for specific sites, informed by data from other sites, and where constraints on data sharing necessitate a distributed learning approach, such as disease diagnosis with clinical data.

Acknowledgments

This work was supported by the National Institute of Mental Health Intramural Research Program (ZIC-MH002968, ZIC-MH002960). JK’s and SG’s work was supported by NIH R01 EB020740.

Footnotes

32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montréal, Canada.

Contributor Information

Patrick McClure, National Institute of Mental Health.

Jakub R. Kaczmarzyk, Massachusetts Institute of Technology

Satrajit S. Ghosh, Massachusetts Institute of Technology

Peter Bandettini, National Institute of Mental Health.

Charles Y. Zheng, National Institute of Mental Health

John A. Lee, National Institute of Mental Health

Dylan Nielson, National Institute of Mental Health.

Francisco Pereira, National Institute of Mental Health.

References

[1].Bachem Olivier, Lucic Mario, and Krause Andreas. Coresets for nonparametric estimation-the case of DP-means. In International Conference on Machine Learning, pages 209–217, 2015. [Google Scholar]
[2].Biswal Bharat B, Mennes Maarten, Zuo Xi-Nian, Gohel Suril, Kelly Clare, Smith Steve M, Beckmann Christian F, Adelstein Jonathan S, Buckner Randy L, Colcombe Stan, et al. Toward discovery science of human brain function. Proceedings of the National Academy of Sciences, 107(10):4734–4739, 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
[3].Blundell Charles, Cornebise Julien, Kavukcuoglu Koray, and Wierstra Daan. Weight uncertainty in neural networks. In International Conference on Machine Learning, pages 1613–1622, 2015. [Google Scholar]
[4].Chang Ken, Balachandar Niranjan, Lam Carson, Yi Darvin, Brown James, Beers Andrew, Rosen Bruce, Rubin Daniel L, and Kalpathy-Cramer Jayashree. Distributed deep learning networks among institutions for medical imaging. Journal of the American Medical Informatics Association [DOI] [PMC free article] [PubMed]
[5].Martino Adriana Di, Yan Chao-Gan, Li Qingyang, Denio Erin, Castellanos Francisco X, Alaerts Kaat, Anderson Jeffrey S, Assaf Michal, Bookheimer Susan Y, Dapretto Mirella, et al. The autism brain imaging data exchange: Towards a large-scale evaluation of the intrinsic brain architecture in autism. Molecular Psychiatry, 19(6):659, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
[6].Fedorov Alex, Damaraju Eswar, Calhoun Vince, and Plis Sergey. Almost instant brain atlas segmentation for large-scale studies. arXiv preprint arXiv:1711.00457, 2017.
[7].Fedorov Alex, Johnson Jeremy, Damaraju Eswar, Ozerin Alexei, Calhoun Vince, and Plis Sergey. End-to-end learning of brain tissue segmentation from imperfect labeling. In International Joint Conference on Neural Networks, pages 3785–3792. IEEE, 2017. [Google Scholar]
[8].Fischl Bruce. Freesurfer. Neuroimage, 62(2):774–781, 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
[9].Gal Yarin and Ghahramani Zoubin. Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In International Conference on Machine Learning, pages 1050–1059, 2016. [Google Scholar]
[10].Graves Alex. Practical variational inference for neural networks. In Advances in Neural Information Processing Systems, pages 2348–2356, 2011. [Google Scholar]
[11].Hinton Geoffrey E and Van Camp Drew. Keeping the neural networks simple by minimizing the description length of the weights. In Proceedings of the sixth annual conference on Computational learning theory, pages 5–13. ACM, 1993. [Google Scholar]
[12].Huggins Jonathan, Campbell Trevor, and Broderick Tamara. Coresets for scalable Bayesian logistic regression. In Advances in Neural Information Processing Systems, pages 4080–4088, 2016. [Google Scholar]
[13].Kingma Diederik P and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015. [Google Scholar]
[14].Kingma Diederik P, Salimans Tim, and Welling Max. Variational dropout and the local reparameterization trick. In Advances in Neural Information Processing Systems, pages 2575–2583, 2015. [Google Scholar]
[15].Kingma Diederik P and Welling Max. Auto-encoding variational Bayes. arXiv preprint arXiv:1312.6114, 2013.
[16].Kirkpatrick James, Pascanu Razvan, Rabinowitz Neil, Veness Joel, Desjardins Guillaume, Rusu Andrei A, Milan Kieran, Quan John, Ramalho Tiago, Grabska-Barwinska Agnieszka, et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 114(13):3521–3526, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
[17].Kochurov Max, Garipov Timur, Podoprikhin Dmitry, Molchanov Dmitry, Ashukha Arsenii, and Vetrov Dmitry. Bayesian incremental learning for deep neural networks. ICLR Workshop, 2018.
[18].Li Wenqi, Wang Guotai, Fidon Lucas, Ourselin Sebastien, Jorge Cardoso M, and Vercauteren Tom. On the compactness, efficiency, and representation of 3D convolutional networks: Brain parcellation as a pretext task. In International Conference on Information Processing in Medical Imaging, pages 348–360. Springer, 2017. [Google Scholar]
[19].Louizos Christos and Welling Max. Multiplicative normalizing flows for variational Bayesian neural networks. In International Conference on Machine Learning, pages 2218–2227, 2017. [Google Scholar]
[20].Molchanov Dmitry, Ashukha Arsenii, and Vetrov Dmitry. Variational dropout sparsifies deep neural networks. In International Conference on Machine Learning, pages 2498–2507, 2017. [Google Scholar]
[21].Nguyen Cuong V, Li Yingzhen, Bui Thang D, and Turner Richard E. Variational continual learning. In International Conference on Learning Representations, 2018. [Google Scholar]
[22].Nooner Kate Brody, Colcombe Stanley, Tobe Russell, Mennes Maarten, Benedict Melissa, Moreno Alexis, Panek Laura, Brown Shaquanna, Zavitz Stephen, Li Qingyang, et al. The NKI-Rockland sample: A model for accelerating the pace of discovery science in psychiatry. Frontiers in Neuroscience, 6:152, 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
[23].Plis Sergey M, Sarwate Anand D, Wood Dylan, Dieringer Christopher, Landis Drew, Reed Cory, Panta Sandeep R, Turner Jessica A, Shoemaker Jody M, Carter Kim W, et al. COINSTAC: a privacy enabled model and prototype for leveraging and processing decentralized brain imaging data. Frontiers in Neuroscience, 10:365, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
[24].Power Jonathan D, Plitt Mark, Kundu Prantik, Bandettini Peter A, and Martin Alex. Temporal interpolation alters motion in fMRI scans: Magnitudes and consequences for artifact detection. PloS one, 12(9):e0182939, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
[25].Robbins Herbert and Monro Sutton. A stochastic approximation method. The Annals of Mathematical Statistics, pages 400–407, 1951.
[26].Ronneberger Olaf, Fischer Philipp, and Brox Thomas. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 234–241. Springer, 2015. [Google Scholar]
[27].Roy Abhijit Guha, Conjeti Sailesh, Navab Nassir, and Wachinger Christian. QuickNAT: Segmenting MRI neuroanatomy in 20 seconds. arXiv preprint arXiv:1801.04161, 2018.
[28].Salimans Tim and Kingma Diederik P. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. In Advances in Neural Information Processing Systems, pages 901–909, 2016. [Google Scholar]
[29].Thompson Paul M, Stein Jason L, Medland Sarah E, Hibar Derrek P, Vasquez Alejandro Arias, Renteria Miguel E, Toro Roberto, Jahanshad Neda, Schumann Gunter, Franke Barbara, et al. The ENIGMA consortium: Large-scale collaborative analyses of neuroimaging and genetic data. Brain imaging and behavior, 8(2):153–182, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
[30].Van Essen David C, Smith Stephen M, Barch Deanna M, Behrens Timothy EJ, Yacoub Essa, Ugurbil Kamil, Wu-Minn HCP Consortium, et al. The WU-Minn human connectome project: An overview. NeuroImage, 80:62–79, 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
[31].Yu Fisher and Koltun Vladlen. Multi-scale context aggregation by dilated convolutions. In International Conference on Learning Representations, 2015. [Google Scholar]
[32].Zenke Friedemann, Poole Ben, and Ganguli Surya. Continual learning through synaptic intelligence. In International Conference on Machine Learning, pages 3987–3995, 2017. [PMC free article] [PubMed] [Google Scholar]

[R1] [1].Bachem Olivier, Lucic Mario, and Krause Andreas. Coresets for nonparametric estimation-the case of DP-means. In International Conference on Machine Learning, pages 209–217, 2015. [Google Scholar]

[R2] [2].Biswal Bharat B, Mennes Maarten, Zuo Xi-Nian, Gohel Suril, Kelly Clare, Smith Steve M, Beckmann Christian F, Adelstein Jonathan S, Buckner Randy L, Colcombe Stan, et al. Toward discovery science of human brain function. Proceedings of the National Academy of Sciences, 107(10):4734–4739, 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] [3].Blundell Charles, Cornebise Julien, Kavukcuoglu Koray, and Wierstra Daan. Weight uncertainty in neural networks. In International Conference on Machine Learning, pages 1613–1622, 2015. [Google Scholar]

[R4] [4].Chang Ken, Balachandar Niranjan, Lam Carson, Yi Darvin, Brown James, Beers Andrew, Rosen Bruce, Rubin Daniel L, and Kalpathy-Cramer Jayashree. Distributed deep learning networks among institutions for medical imaging. Journal of the American Medical Informatics Association [DOI] [PMC free article] [PubMed]

[R5] [5].Martino Adriana Di, Yan Chao-Gan, Li Qingyang, Denio Erin, Castellanos Francisco X, Alaerts Kaat, Anderson Jeffrey S, Assaf Michal, Bookheimer Susan Y, Dapretto Mirella, et al. The autism brain imaging data exchange: Towards a large-scale evaluation of the intrinsic brain architecture in autism. Molecular Psychiatry, 19(6):659, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] [6].Fedorov Alex, Damaraju Eswar, Calhoun Vince, and Plis Sergey. Almost instant brain atlas segmentation for large-scale studies. arXiv preprint arXiv:1711.00457, 2017.

[R7] [7].Fedorov Alex, Johnson Jeremy, Damaraju Eswar, Ozerin Alexei, Calhoun Vince, and Plis Sergey. End-to-end learning of brain tissue segmentation from imperfect labeling. In International Joint Conference on Neural Networks, pages 3785–3792. IEEE, 2017. [Google Scholar]

[R8] [8].Fischl Bruce. Freesurfer. Neuroimage, 62(2):774–781, 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] [9].Gal Yarin and Ghahramani Zoubin. Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In International Conference on Machine Learning, pages 1050–1059, 2016. [Google Scholar]

[R10] [10].Graves Alex. Practical variational inference for neural networks. In Advances in Neural Information Processing Systems, pages 2348–2356, 2011. [Google Scholar]

[R11] [11].Hinton Geoffrey E and Van Camp Drew. Keeping the neural networks simple by minimizing the description length of the weights. In Proceedings of the sixth annual conference on Computational learning theory, pages 5–13. ACM, 1993. [Google Scholar]

[R12] [12].Huggins Jonathan, Campbell Trevor, and Broderick Tamara. Coresets for scalable Bayesian logistic regression. In Advances in Neural Information Processing Systems, pages 4080–4088, 2016. [Google Scholar]

[R13] [13].Kingma Diederik P and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015. [Google Scholar]

[R14] [14].Kingma Diederik P, Salimans Tim, and Welling Max. Variational dropout and the local reparameterization trick. In Advances in Neural Information Processing Systems, pages 2575–2583, 2015. [Google Scholar]

[R15] [15].Kingma Diederik P and Welling Max. Auto-encoding variational Bayes. arXiv preprint arXiv:1312.6114, 2013.

[R16] [16].Kirkpatrick James, Pascanu Razvan, Rabinowitz Neil, Veness Joel, Desjardins Guillaume, Rusu Andrei A, Milan Kieran, Quan John, Ramalho Tiago, Grabska-Barwinska Agnieszka, et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 114(13):3521–3526, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] [17].Kochurov Max, Garipov Timur, Podoprikhin Dmitry, Molchanov Dmitry, Ashukha Arsenii, and Vetrov Dmitry. Bayesian incremental learning for deep neural networks. ICLR Workshop, 2018.

[R18] [18].Li Wenqi, Wang Guotai, Fidon Lucas, Ourselin Sebastien, Jorge Cardoso M, and Vercauteren Tom. On the compactness, efficiency, and representation of 3D convolutional networks: Brain parcellation as a pretext task. In International Conference on Information Processing in Medical Imaging, pages 348–360. Springer, 2017. [Google Scholar]

[R19] [19].Louizos Christos and Welling Max. Multiplicative normalizing flows for variational Bayesian neural networks. In International Conference on Machine Learning, pages 2218–2227, 2017. [Google Scholar]

[R20] [20].Molchanov Dmitry, Ashukha Arsenii, and Vetrov Dmitry. Variational dropout sparsifies deep neural networks. In International Conference on Machine Learning, pages 2498–2507, 2017. [Google Scholar]

[R21] [21].Nguyen Cuong V, Li Yingzhen, Bui Thang D, and Turner Richard E. Variational continual learning. In International Conference on Learning Representations, 2018. [Google Scholar]

[R22] [22].Nooner Kate Brody, Colcombe Stanley, Tobe Russell, Mennes Maarten, Benedict Melissa, Moreno Alexis, Panek Laura, Brown Shaquanna, Zavitz Stephen, Li Qingyang, et al. The NKI-Rockland sample: A model for accelerating the pace of discovery science in psychiatry. Frontiers in Neuroscience, 6:152, 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] [23].Plis Sergey M, Sarwate Anand D, Wood Dylan, Dieringer Christopher, Landis Drew, Reed Cory, Panta Sandeep R, Turner Jessica A, Shoemaker Jody M, Carter Kim W, et al. COINSTAC: a privacy enabled model and prototype for leveraging and processing decentralized brain imaging data. Frontiers in Neuroscience, 10:365, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] [24].Power Jonathan D, Plitt Mark, Kundu Prantik, Bandettini Peter A, and Martin Alex. Temporal interpolation alters motion in fMRI scans: Magnitudes and consequences for artifact detection. PloS one, 12(9):e0182939, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R25] [25].Robbins Herbert and Monro Sutton. A stochastic approximation method. The Annals of Mathematical Statistics, pages 400–407, 1951.

[R26] [26].Ronneberger Olaf, Fischer Philipp, and Brox Thomas. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 234–241. Springer, 2015. [Google Scholar]

[R27] [27].Roy Abhijit Guha, Conjeti Sailesh, Navab Nassir, and Wachinger Christian. QuickNAT: Segmenting MRI neuroanatomy in 20 seconds. arXiv preprint arXiv:1801.04161, 2018.

[R28] [28].Salimans Tim and Kingma Diederik P. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. In Advances in Neural Information Processing Systems, pages 901–909, 2016. [Google Scholar]

[R29] [29].Thompson Paul M, Stein Jason L, Medland Sarah E, Hibar Derrek P, Vasquez Alejandro Arias, Renteria Miguel E, Toro Roberto, Jahanshad Neda, Schumann Gunter, Franke Barbara, et al. The ENIGMA consortium: Large-scale collaborative analyses of neuroimaging and genetic data. Brain imaging and behavior, 8(2):153–182, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R30] [30].Van Essen David C, Smith Stephen M, Barch Deanna M, Behrens Timothy EJ, Yacoub Essa, Ugurbil Kamil, Wu-Minn HCP Consortium, et al. The WU-Minn human connectome project: An overview. NeuroImage, 80:62–79, 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R31] [31].Yu Fisher and Koltun Vladlen. Multi-scale context aggregation by dilated convolutions. In International Conference on Learning Representations, 2015. [Google Scholar]

[R32] [32].Zenke Friedemann, Poole Ben, and Ganguli Surya. Continual learning through synaptic intelligence. In International Conference on Machine Learning, pages 3987–3995, 2017. [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Distributed Weight Consolidation: A Brain Segmentation Case Study

Patrick McClure

Jakub R Kaczmarzyk

Satrajit S Ghosh

Peter Bandettini

Charles Y Zheng

John A Lee

Dylan Nielson

Francisco Pereira

Abstract

1. Introduction

2. Data and Methods

2.1. Datasets

2.2. Architecture

Table 1:

2.3. Bayesian Inference in Neural Networks

2.3.1. Maximum a Posteriori Estimate

2.3.2. Approximate Bayesian Inference

2.3.3. Stochastic Variational Bayes

2.3.4. Variational Continual Learning

2.4. Distributed Weight Consolidation

2.4.1. Bayesian Continual Learning for Distributed Data

2.4.2. Variational Approximation

2.4.3. Dilated Convolutions with Fully Factorized Gaussian Filters

2.4.4. Consolidating an Ensemble of Fully Factorized Gaussian Networks

3. Experiments

3.1. Experimental Setup

3.1.1. Data Preprocessing

3.1.2. Training Procedure

3.1.3. Performance Metric

3.1.4. Baselines

3.1.5. Variational Continual Learning

3.1.6. Distributed Weight Consolidation

3.2. Experimental Results

Table 2:

Figure 1.

Figure 2.

Figure 3.

Figure 4.

4. Discussion

Acknowledgments

Footnotes

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases