Outlier detection for multi-network data

Pritam Dey; Zhengwu Zhang; David B Dunson

doi:10.1093/bioinformatics/btac431

. 2022 Jun 28;38(16):4011–4018. doi: 10.1093/bioinformatics/btac431

Outlier detection for multi-network data

Pritam Dey ^1,^✉, Zhengwu Zhang ², David B Dunson ³

Editor: Hanchuan Peng

PMCID: PMC9890313 PMID: 35762974

Abstract

Motivation

It has become routine in neuroscience studies to measure brain networks for different individuals using neuroimaging. These networks are typically expressed as adjacency matrices, with each cell containing a summary of connectivity between a pair of brain regions. There is an emerging statistical literature describing methods for the analysis of such multi-network data in which nodes are common across networks but the edges vary. However, there has been essentially no consideration of the important problem of outlier detection. In particular, for certain subjects, the neuroimaging data are so poor quality that the network cannot be reliably reconstructed. For such subjects, the resulting adjacency matrix may be mostly zero or exhibit a bizarre pattern not consistent with a functioning brain. These outlying networks may serve as influential points, contaminating subsequent statistical analyses. We propose a simple Outlier DetectIon for Networks (ODIN) method relying on an influence measure under a hierarchical generalized linear model for the adjacency matrices. An efficient computational algorithm is described, and ODIN is illustrated through simulations and an application to data from the UK Biobank.

Results

ODIN was successful in identifying moderate to extreme outliers. Removing such outliers can significantly change inferences in downstream applications.

Availability and implementation

ODIN has been implemented in both Python and R and these implementations along with other code are publicly available at github.com/pritamdey/ODIN-python and github.com/pritamdey/ODIN-r, respectively.

Supplementary information

Supplementary data are available at Bioinformatics online.

1 Introduction

In recent years there has been substantial interest in brain functional and structural connectomics. Although the proposed methodology is general, we are motivated by structural connectomes, consisting of the collection of white matter fiber bundles connecting various regions of the brain. Recent advances in non-invasive brain imaging have made available large brain imaging datasets including the Human Connectome Project (Van Essen et al., 2013), the Adolescent Brain Cognitive Development Study (Casey et al., 2018) and the UK Biobank (Miller et al., 2016). Relying on the availability of such datasets, there is a large literature on statistical analysis of brain networks developing models for characterizing inter-individual variation (Aliverti and Durante, 2019; Durante et al., 2017; Wang et al., 2019), ANOVA-like hypothesis testing (Ginestet et al., 2017) and network regression (Wang et al., 2017; Zhang et al., 2019, 2022).

An important issue to consider in brain network analysis is reconstruction error from available neuroimaging data. Even if image acquisition is conducted correctly for a subject who remains still in the scanner, there is inevitably some amount of error in conducting the inverse problem of inferring the white matter fiber bundles based on indirect measurements (Fornito et al., 2013). Moreover, due to the long preprocessing pipeline in structural connectome reconstruction, a small measurement error in the raw imaging data can be amplified and yield substantial errors in the inferred structural connectome. In this article, instead of being concerned with small measurement errors that are difficult to distinguish from actual biological variation, we focus on identifying outlying brain networks that are almost certainly attributable to measurement errors in reconstructing the connectome. Such gross errors can potentially arise due to problems during the data collection phase; for example, due to non-negligible movement of the patient in the scanner (Baum et al., 2018) or mistakes in preprocessing large of amounts of data using complex structural connectome reconstruction pipelines (Zhang et al., 2018).

Some examples of outlying brain connectomes are shown in Figure 1. A major concern is that such outlying networks may serve as influential observations in statistical analyses of brain connectomes, leading to degradation of the results. For example, suppose we are attempting to represent variation across subjects in their brain connectomes via an embedding, such as the PCA method of Zhang et al. (2019). Including outlying brains can lead to poor quality embeddings, as PCA needs to characterize not just normal biological variation in the brain connectomes but also the outlying brains. Analyses seeking to infer relationships between brain networks and human traits can similarly be contaminated by outliers, obscuring true relationships. Current practice tends to either ignore outliers or to apply informal quality controls, such as visual examination of connectome matrices (Alfaro-Almagro et al., 2018). Such checks are overly subjective and time consuming for large datasets. Potentially, one can instead apply existing statistical outlier detection methods (Hawkins, 1980) to low-dimensional summary statistics of the connectome (for example, graph metrics such as node degree, average path length and eigenvector centrality). However, such an approach will be highly sensitive to the summaries chosen, and may miss certain outlying networks or remove non-outlying networks.

Fig. 1. — Brain fiber streamlines from diffusion MR imaging (top row) with corresponding binary adjacency matrices (bottom row) of some subjects from the UK Biobank dataset. The streamlines are visualized using TrackVis (Wang and Wedeen, 2007) and colored by orientation (i.e. left to right: red, anterior to posterior: green, superior to inferior: blue). In these tractography diagrams, the anterior side of the brain is facing inwards into the page. The matrices in the bottom row are the ones used by ODIN. These adjacency matrices are not directly available from UK Biobank. We preprocessed the raw data using the PSC pipeline (Zhang *et al.*, 2018) to extract these adjacency matrices. In these matrices, black indicates presence of at least one fiber connecting the two corresponding regions and white represents absence of such fibers. The brain network represented by the streamlines and adjacency matrix in (a) is a typical non-outlier. The networks shown in (b)–(e) are outliers of various kinds selected from among the outliers detected by ODIN (A color version of this figure appears in the online version of this article.)

Ideally, we could apply an outlier detection method specifically designed for multi-network data, to identify individual networks that are fundamentally different than the bulk of the networks in a dataset. However, to our knowledge, there are no such methods available in the literature. To bridge this gap, we propose a simple model-based Outlier DetectIon for Networks (ODIN) method relying on a hierarchical logistic regression model and a measure of the statistical influence of each subject’s brain connectome on the parameter estimates in this model. The logistic model includes prior knowledge of the anatomical structure of the brain. Although our initial approach is for binary adjacency matrices conveying information about connectivity/no-connectivity of pairs of ROIs, ODIN can be trivially extended to weighted adjacency matrices by using alternative generalized linear models (GLMs) in place of logistic regression. The model can also include covariate information about the subject, such as age and gender. A key advantage of ODIN is computational scalability to massive datasets containing tens of thousands of connectomes.

We apply ODIN to 18 083 structural connectomes extracted from a large cohort, the UK Biobank data, which has a number of moderate to extreme outliers. Figure 1 shows a non-outlier and four different outliers from this dataset. Our objective is to detect outliers, and not to use the logistic model for inference directly. ODIN can be used in a data cleaning step to remove outliers, or can provide numerical influence scores that can be used for down-weighting of overly influential observations in subsequent robust analyses. We will demonstrate the former approach in simulation studies and an application involving the UK Biobank data (henceforth refered to as UKB data).

2 Methods

It is common practice in the brain network analysis literature to partition the brain into small regions based on anatomical considerations (Desikan et al., 2006). These small regions are known as regions of interest (ROIs). The brain structural connectome is thus summarized as a network with these ROIs as nodes and the fibers connecting these ROIs are the edges in the network. We use this framework for our outlier detection method, ODIN. Let A_i denote the binary adjacency matrix of an undirected brain network on V ROIs (assuming no self connections) for subject i, $i = 1, 2, \dots, N$ . The symmetry of the A_i’s coupled with the lack of self connections allows us to denote our networks in terms of the vectors $a_{i} = {(A_{i [2, 1]}, A_{i [3, 1]}, A_{i [3, 2]}, \dots, A_{i [V, V - 1]})}^{T} = {(a_{i 1}, a_{i 2}, \dots, a_{i L})}^{T}$ . Here, $L = \frac{V (V - 1)}{2}$ . Each $l \in {1, 2, \dots, L}$ represents an edge in the network connecting two ROIs, say u and v. For each subject i and edge l representing the pair of ROIs {u, v}, we let hemi(u) and lobe(u) represent the hemisphere and the lobe location of the ROI u, respectively. Similarly, hemi(v) and lobe(v) does the same for v. We model:

\begin{matrix} a_{i l} \sim Bernoulli (π_{i l}) \\ logit (π_{i l}) = z_{l} + β_{i, hemi (u), hemi (v)} + β_{i, lobe (u), lobe (v)} . \end{matrix}

(1)

Model (1) characterizes variation across subjects in their brain networks in a semi-parametric manner. The baseline model for the connection probabilities, $π_{0 l} = 1 / (1 + e^{- z_{l}})$ , is nonparametric in allowing fully flexible variation in these probabilities across different pairs of ROIs. This avoids imposing a particular model on the population-averaged connection probabilities between different regions of the brain.

We then allow a more restricted type of variation across subjects, with subject-specific deviations from the baseline edge probabilities depending on hemisphere and lobe locations of the two brain ROIs forming the edge. The coefficient $β_{i, hemi (u), hemi (v)}$ allows connection probabilities between regions u and v in subject i’s brain to vary depending on which hemisphere u and v belong to. Similarly the coefficient $β_{i, lobe (u), lobe (v)}$ allows the connection probability to vary depending on the lobe locations of u and v. Since there are only two hemispheres and the number of lobes is relatively small, the number of subject-specific parameters for each subject is small. Due to the symmetric nature of the networks, we may assume: $β_{i, h_{1}, h_{2}} = β_{i, h_{2}, h_{1}}$ and $β_{i, l_{1}, l_{2}} = β_{i, l_{2}, l_{1}}$ for every choice of hemispheres h₁ and h₂ and lobes l₁ and l₂. We collect all these lobe and hemisphere parameters for subject i into a vector β_i. These parameters β_i allow for variations across subjects, in terms of lower or higher numbers of connections between specific lobes and hemispheres.

Using the notation we introduced above, we can write $logit (π_{i l}) = z_{l} + x_{l}^{T} β_{i}$ for a suitable vector x_l of 0’s and 1’s. We stack these vectors x_l into a matrix $X = {(x_{1}, x_{2}, \dots x_{L})}^{T}$ . Further, we represent the vector of parameters ${(z_{1}, z_{2}, \dots, z_{L})}^{T}$ as Z and the vector ${(π_{i 1}, π_{i 2}, \dots, π_{i L})}^{T}$ as $π_{i}$ .

Due to the large number of parameters in the baseline component and to allow cases in which certain edges are connected (or disconnected) for all subjects in the sample, we add a ridge (or l₂) regularization to z_l. Estimation of the parameters Z and the β_i’s then proceeds by maximizing the following penalized log-likelihood:

L (Z, {β_{j}}_{j = 1}^{N}) = \frac{1}{N} \sum_{i = 1}^{N} \sum_{l = 1}^{L} [a_{i l} η_{i l} - log (1 + e^{η_{i l}})] - \frac{λ}{2} {‖ Z ‖}_{2}^{2},

(2)

where $η_{i l} = logit (π_{i l}) = z_{l} + x_{l}^{T} β_{i}$ .

The Hessian matrix of the objective function has a block arrowhead structure, which can be exploited to obtain a fast algorithm for estimation. While the estimation step can be done using a variety of different algorithms, we recommend a slightly modified version of the standard Majorization—Minimization (MM) algorithm for logistic regression to find the parameter values that maximize the Equation (2) (technically we minimize its negative). The algorithm is discussed in detail in Supplementary Section S1.1. This algorithm is chosen as it scales well with sample size, N and the number of edges, L. In fact each iteration of the algorithm has linear time complexity in N and L. This is verified in the simulation section.

In exploring methodology for outlier detection of brain networks, we considered a wide variety of complex and flexible models for characterizing variation across individuals in their brain structure including a hierarchical latent space model. However, computational time was a substantial barrier to implementation of more elaborate models. Our objective being the identification of outlying networks as a step in the cleaning and preprocessing of the data before subsequent statistical analyses, we wanted our model to be simple and computationally efficient while still being effective in detecting outliers. Our relatively simple logistic model did an excellent job in fulfilling all of these objectives. Indeed, if a model is too flexible in characterizing inter-individual variability, then it can end up not flagging any of the networks as outliers. To identify potentially outlying subjects through our model, we consider appropriate influence measures.

One way of measuring the influence of subject i on the penalized maximum likelihood estimates (pMLEs) for the parameters in Equation (2) is by the change in the pMLE of Z when subject i is dropped from the sample. More concretely, if the pMLE of Z is denoted by $\hat{Z}$ and the pMLE of Z is recalculated after dropping subject i from the sample is denoted by ${\hat{Z}}_{- i}$ then an influence measure may be $‖ {\hat{Z}}_{- i} - \hat{Z} ‖$ which is analogous to DFBETA, a popular influence measure in regression analysis. However, since the maximizer of Equation (2) does not have an analytical form and has to be numerically calculated, it is not computationally feasible to compute the pMLE exactly for the case of omission of each subject when the number of subjects is large. So instead, we suggest for every subject i, to do just one step of the Newton-Raphson algorithm for calculation of the pMLE after dropping subject i but with the already calculated pMLE (with all subjects included) as the initial values. We do this since dropping subject i leaves the resultant log-likelihood still close to the original log-likelihood in (2). We denote this approximation by $Z_{- i}^{(1)}$ . Hence, our influence measure is simply the Euclidean distance between $\hat{Z}$ and $Z_{- i}^{(1)}$ . The following result gives an explicit formula in terms of the overall pMLEs of Z and the β_i’s.

Proposition 1.

Suppose the pMLE obtained by maximizing (2) is $(\hat{Z}, {\hat{β}}_{1}, \dots, {\hat{β}}_{N})$ . Then the update in the common parameters Z in the first step of the Newton-Raphson algorithm for re-estimation of the pMLE after deletion of the ith unit starting from $(\hat{Z}, {\hat{β}}_{1}, \dots, {\hat{β}}_{i - 1}, {\hat{β}}_{i + 1}, \dots {\hat{β}}_{N})$ is

$\begin{matrix} Z_{- i}^{(1)} - \hat{Z} = - {[λ (N - 1) I + \sum_{j \neq i} W_{j} - \sum_{j \neq i} B_{j}^{T} Q_{j}^{- 1} B_{j}]}^{- 1} \\ [(a_{i} - {\hat{π}}_{i}) - λ \hat{Z}] \end{matrix}$ (3)

where W_j is a diagonal matrix with entries $[π_{j 1} (1 - π_{j 1}), π_{j 2} (1 - π_{j 2}), \dots, π_{j L} (1 - π_{j L})], B_{j} = X^{T} W_{j}$ , and $Q_{j} = X^{T} W_{j} X$ . These are all evaluated at $(\hat{Z}, {\hat{β}}_{1}, \dots, {\hat{β}}_{i - 1}, {\hat{β}}_{i + 1}, \dots {\hat{β}}_{N})$ .

We show a proof of the above proposition in Supplementary Section S1.2. The formula for $Z_{- i}^{(1)} - \hat{Z}$ given in Proposition 1 needs one more approximation to make it computationally feasible. The matrix $[λ (N - 1) I + \sum_{j \neq i} W_{j} - \sum_{j \neq i} B_{j}^{T} Q_{j}^{- 1} B_{j}]$ depends on i, and hence we need to calculate and invert this huge matrix for each subject. We can avoid that by approximating it instead by $Γ = γ_{N} [λ N I + \sum_{j = 1}^{N} W_{j} - \sum_{j = 1}^{N} B_{j}^{T} Q_{j}^{- 1} B_{j}]$ , where $γ_{N} = (N - 1) / N$ . This solves the problem as now we only need to invert Γ once. After making these approximations, we obtain the following influence measure for brain networks:

\begin{matrix} I M_{1} (i) = {‖ Γ^{- 1} [(a_{i} - {\hat{π}}_{i}) - λ \hat{Z}] ‖}_{2} \end{matrix}

(4)

A drawback of $I M_{1} (i)$ as the sole measure of influence is that it does not take into account the fact that the individual-specific parameters, β_i, can still be radically different for some networks than others. Those networks may not appear influential to the model, even though they can still be outliers. To flag such networks, we need another influence measure along with $I M_{1} (i)$ . A suitable candidate that is easy to compute is the relative distance of the estimated ${\hat{β}}_{i}$ from the bulk of the estimated ${\hat{β}}_{j}$ ’s for the entire sample. With this in mind, we use the Mahalanobis distance of ${\hat{β}}_{i}$ from the mean of the ${\hat{β}}_{j}$ ’s, $\bar{β} = \sum_{i = 1}^{N} {\hat{β}}_{i} / N$ , as our second influence measure:

\begin{matrix} I M_{2} (i) = {‖ {({\hat{β}}_{i} - \bar{β})}^{T} S {(\hat{β})}^{- 1} ({\hat{β}}_{i} - \bar{β}) ‖}_{2} \end{matrix}

(5)

where $S (\hat{β}) = \sum_{i = 1}^{N} ({\hat{β}}_{i} - \bar{β}) {({\hat{β}}_{i} - \bar{β})}^{T} / N$ . We declare the brain network of subject i to be an outlier if either of $I M_{1} (i)$ or $I M_{2} (i)$ is large.

Finally, we need to define thresholds for classifying if the influence measures, $I M_{1} (i)$ or $I M_{2} (i)$ , are ‘large’. For this, we use a data-based scheme in which we plot the quantiles of the influence measure in question. Since most of the subjects have small influence measures, the graph remains flat for most of the lower quantiles and then starts to sharply increase as we get to the higher quantiles. The ‘elbow’ of this plot, i.e. the point where the graph starts increasing sharply, can be used as a threshold. We use the kneedle algorithm (Satopaa et al., 2011) to calculate the elbow point. We describe this thresholding method in more detail in Supplementary Section S2.

The time complexity for calculation of the influence measures $I M_{1} (i)$ and $I M_{2} (i)$ for all networks i is linear in the sample size, N and quadratic in the number of edges, L. We verify this empirically in the simulation section. Our simulation studies also demonstrate good separation of outliers and non-outliers in terms of these influence measures.

3 Simulation study

In this section, we carry out a number of simulation experiments to evaluate the performance of our outlier detection algorithm, ODIN and understand the computational complexity of the associated steps.

3.1 Computational complexity

As summarized in Section 2, ODIN has two major steps: (i) estimation of the model parameters Z and the β_i’s and (ii) calculating the influence measures $I M_{1} (i)$ and $I M_{2} (i)$ for each i. In this Section, we perform simulation experiments to study how sample size, N and the number of edges, L, impact computational time for these two steps (Fig. 2).

Fig. 2. — Run-time (in seconds) for (a) each iteration of the estimation algorithm with respect to sample size, N; (b) calculation of the influence measures with respect to sample size, N; (c) each iteration of the estimation algorithm with respect to number of edges, $L = V (V - 1) / 2$ and (d) calculation of the influence measures with respect to the number of edges L. The first three are linear and the last one is quadratic

The first step implements an iterative algorithm, so we investigate the effect of L, the number of edges and N, the sample size, on the run-time of each iteration. We conduct two sets of simulations, one by keeping L fixed and varying N and one by keeping N fixed and varying L. In these simulations, we generate Z and β_i’s in our model from the standard Cauchy and Gaussian distributions, respectively, and use these to generate the adjacency matrix from our model. For each of these cases, we run 200 iterations of our algorithm and plot the average run-time for each iteration. As noted in the Section 2, each iteration of the model fitting algorithm has linear run-time in both N and L.

The next step is calculation of the influence measures. Again we run two sets of simulations as in the previous paragraph. In each simulation we generate the data as before, iterate our estimation algorithm till convergence and calculate the influence measures. This time the run-time is recorded for only the calculation of the influence measures. Plotting this we conclude that the run-time is linear in the sample size N and quadratic in L as mentioned in the Section 2. This quadratic complexity in L does not make ODIN computationally prohibitive for the most popular atlases, as they rely on a moderate but not large number of ROIs.

3.2 Performance in detecting outliers

The goal in this section is to assess the performance of ODIN in terms of detection of outliers with varying levels of extremeness. We do this by generating synthetic datasets containing outliers and non-outliers in two different ways. For both these approaches we provide sensitivity (true positive rate) and specificity (true negative rate) values.

3.2.1 Simulation from model (1)

We simulate 500 synthetic binary brain networks having 70 ROIs each, with 2 hemispheres and 5 lobes within each hemisphere. As data from the UKB contain 68 ROIs using the Desikan atlas, our simulated data are similar in dimensionality. We use model (1) to simulate data by generating the vectors Z and β_i’s from standard Cauchy and Gaussian distributions, respectively. We simulate ‘outliers’ in the sample by randomly flipping a fixed percentage of the edges in 10% of the generated adjacency matrices. We then apply ODIN on the simulated dataset.

Repeating this process many times for various percentages of edges flipped and noting the sensitivity (true positive rate) and specificity (true negative rate) gives us Table 1. Figure 3 shows box plots of the influence measures for non-outliers and outliers for some choices of percentages of edges flipped. As we can see, even for changes in just 7% of the edges, we start to notice that the influence measure tends to separate outliers from non-outliers. Table 1 also shows that sensitivity and specificity values are very close to 100% indicating our method’s efficiency in outlier detection.

Table 1.

Sensitivity and specificity of the outlier detection algorithm ODIN for network data simulated from model (1)

Proportion of flipped edges	Average sensitivity (5 repetitions)	Average specificity (5 repetitions)
1%	93%	94%
2%	98.2%	95.6%
7%	100%	97.1%
10%	100%	97.1%
15%	100%	98%

Open in a new tab

Fig. 3. — Box-plots of $I M_{1} (i)$ of outliers and non-outliers for data generated from model (1) with ‘outliers’ simulated by flipping a fixed proportion of edges for some of the networks. The three figures are for 1%, 5% and 10% of flipped edges, respectively. This demonstrates that as outliers become more severe, ODIN can more easily distinguish outliers and non-outliers

3.2.2. Simulation using tensor network PCA

To assess how well ODIN can detect outliers in practice, it is important to also measure performance under a very different generative model than (1). For this, we choose Tensor Network Principal Component Analysis (Zhang et al., 2019) (TN-PCA) as the data generation scheme. Briefly, TN-PCA is an extension of standard principal component analysis (PCA) to the case of multi-network data. Just as standard PCA reduces dimensionality of vector valued data by embedding the data points in a lower dimensional space, TN-PCA does the same for data in the form of symmetric adjacency matrices. Essentially it minimizes the quantity $\sum_{i = 1}^{N} {‖ A_{i} - \sum_{k = 1}^{K} λ_{i k} v_{k} v_{k}^{T} ‖}_{2}^{2}$ where the $v_{k}$ s are unit vectors and are pairwise orthogonal. The vector $λ_{i} = (λ_{i 1}, λ_{i 2}, \dots, λ_{i K})$ can be treated as an embedding of the adjacency matrix A_i in K-dimensional Euclidean space. In this description, we used slightly different notation than the original paper.

Since TN-PCA is not a generative model explicitly, but rather an embedding method, we use the following steps to simulate data which resemble the UKB brain networks. First we take a small subsample of size 500 from the UKB data, with this subsample not containing any outliers (based on ODIN). Then we perform TN-PCA on the adjacency matrices from that subsample. This gives us vectors $v_{k}$ for $1 \leq k \leq K$ , and $λ_{i}$ for $1 \leq i \leq 500$ . We modify these $λ_{i}$ as ${\tilde{λ}}_{i} = λ_{i} + δ_{i} ϵ_{i}$ where $δ_{i} = 1$ for a randomly selected 10% of the i’s and $δ_{i} = 0$ for the rest and $ϵ_{i} \sim Normal (0, σ^{2} I)$ . We then construct adjacency matrices ${\tilde{A}}_{i}$ by first calculating $S_{i} = \sum_{k = 1}^{K} {\tilde{λ}}_{i k} v_{k} v_{k}^{T}$ and then setting each entry in ${\tilde{A}}_{i}$ to 0 or 1 depending on whichever is closer to the corresponding entry of S_i.

In this way of simulation, the 10% of the subjects for which $δ_{i} = 1$ can be thought of as ‘outliers’. We run ODIN on this sample and repeat this for many choices of σ, the standard deviation of the Gaussian noise, and note the sensitivity (true positive rate) and specificity (true negative rate) in Table 2. These results suggest that ODIN is quite efficient in detecting outliers and can easily distinguish them from non-outliers.

Table 2.

Sensitivity and specificity of the outlier detection algorithm ODIN for network data simulated using TN-PCA

SD of Gaussian noise	Average sensitivity (5 repetitions)	Average specificity (5 repetitions)
0.010	71%	91.2%
0.015	97%	92.7%
0.017	98%	92.6%
0.020	100%	93%
0.025	100%	92.4%

Open in a new tab

4 Application to UK Biobank data

We applied ODIN to a sample of 18 083 brain networks from the UK Biobank (UKB) dataset. We use the Desikan et al. (2006) atlas-based ROI representation of the brain networks. Although this dataset contains information on hundreds of thousands of subjects, we focus on data for the 18 083 individuals having relevant data to construct adjacency matrices for structural brain connection networks. Of these, we detected 1931 (around 10.6%) networks as outliers.

4.1 Exploratory view of outliers versus non-outliers

We have already displayed some of the detected outliers in Figure 1 in the Section 1. This figure shows one possible difference between outliers and non-outliers may be in the number of edges which are connected. Indeed when we plot the distributions of connected edges among outliers and non-outliers in Figure 4 it confirms that many of the outliers have very few or too many connected edges compared to non-outliers.

Fig. 4. — An exploratory view of the difference between outliers and non-outliers detected by ODIN. In each of the figures, the blue graphs represent outliers and the orange graph represents non-outliers. (a) Distribution of number of edges connecting two ROIs located in different hemispheres. (b) Distribution of number of edges connecting two ROIs located in the same hemisphere. In both figures, it is clear that these distributions are significantly different between outliers and non-outliers (A color version of this figure appears in the online version of this article.)

4.2 Impact of sample size on outlier detection

ODIN detects outliers based on a sample from a population of networks. This raises the question of the extent to which the sample size impacts performance in outlier detection. To investigate this question, we took stratified subsamples of several different sizes with each subsample containing 10% outliers (as detected in the full sample) and 90% non-outliers. We applied ODIN separately to each of these subsamples. We were interested to see how many of these outliers are also detected as outliers in the subsamples. We were also interested to see how many non-outliers in the full sample were detected as outliers in a subsample. The results from this experiment are presented Table 3. It appears that even over various different ranges of subsample size, ODIN still classifies most subjects classified as outliers in the full sample as outliers in the subsample while also classifying very few non-outliers from the full sample as outliers in a subsample.

Table 3.

This table demonstrates the dependence of outlier detection on sample size

Subsample size	Number of outliers from full sample included in subsample	Number of observations in subsample classified as outliers	Number of subsample outliers which were outliers in full sample	Number of non-outliers in full sample classified as outliers in subsample
200	20	19	14 (70%)	5 (1.04%)
500	50	52	42 (84%)	10 (2.22%)
1000	100	105	88 (88%)	17 (1.89%)
2000	200	200	186 (93%)	14 (0.08%)
5000	500	509	481 (96.2%)	28 (0.06%)

Open in a new tab

Note: ODIN was applied to stratified subsamples of the full sample to see if subjects detected as outliers in the full sample are also detected as outliers in subsamples of different sizes. The percentage values in the fourth column is the percentage of full sample outliers classified as outliers in the subsample. The percentage value in the fifth column are the percentage of full sample non-outliers which are classified as outliers in the subsample.

4.3 Case study: impact of removing outliers on a subsequent analysis using TN-PCA

In addition to the brain network data, the UKB data also contain measurements of traits of various types including substance use, physical characteristics, cognitive abilities, levels of physical activity and measures of mental health. Some of the trait data are missing for some subjects. In this Section, we assess the impact of removing outliers on inferences relating brain networks to traits.

To illustrate how outliers can impact brain network inferences, including assessments of relationships with cognitive traits, we focus on the impact of outliers on the TN-PCA algorithm. We perform TN-PCA on both the full sample and the full sample with all outliers removed. We then reconstruct the networks from the TN-PCA components and study the differences in connectivity of each edge in the reconstructed networks between high and low scoring individuals for several different trait scores.

Doing this provides us with a picture of how different the vector representations obtained from TN-PCA of the networks are between subjects with high and low scorers with respect to a specific cognitive trait. We do this for three specific traits: numeric memory, symbol digit substitution and fluid intelligence.

4.3.1. Numeric memory

In this test, the participants were shown a 2-digit number to remember. The number then disappeared and after a short while they were asked to enter the number onto the screen. The number became one digit longer each time they remembered correctly (up to 12 digits). The score is the maximum number of digits the participant remembered correctly. This is to assess short term memory.

Figure 5 shows the difference between high and low scorers in this test. In this case, the pattern of differences in connectivity between the groups is essentially completely changed with removal of outliers. After removal of outliers, the resulting pattern strongly suggests that high and low scoring individuals differ primarily in the connectivity in the frontal lobe. This is in agreement with previous knowledge (Funahashi et al., 1993; Jacobsen, 1936; Pribram et al., 1952) that the frontal lobe, in particular the prefrontal cortex, plays a major role in short term memory.

Fig. 5. — Changes in brain connectivity with increasing numeric memory scores with (left) and without (right) outliers. The 20 edges having the most change in connectivity are shown to improve visualization. The dots on the circle boundary represent ROIs and are colour coded to represent which lobe and hemisphere it belongs to. For a more complete description of the ROI labels see Supplementary Section S3

4.3.2. Symbol digit substitution

In this test, participants were presented with a series of grids in which symbols were to be matched to numbers according to a key presented on the screen. We consider the number of symbols correctly matched by each participant. Again the differences in brain connectivity for the 10% lowest and 10% highest scorers is shown in Figure 6. In this case removing outliers seems to significantly change the detected difference in connectivity among high and low scorers in this trait.

Fig. 6. — Changes in brain connectivity with increasing symbol digit substitution scores with (left) and without (right) outliers. The 20 edges having the most change in connectivity are shown to improve visualization. The dots on the circle boundary represent ROIs and are colour coded to represent which lobe and hemisphere it belongs to. For a more complete description of the ROI labels see Supplementary Section S3

4.3.3. Fluid intelligence

Participants were asked 13 questions designed to assess ‘Fluid intelligence’, i.e. the capacity to solve problems that require logic and reasoning ability, independent of pre-acquired knowledge. Each participant was given 2 min to complete as many questions as possible from the test. Figure 7 shows the differences in brain connectivity with and without outliers among the 10% lowest and 10% highest scorers for the fluid intelligence test. In this case, removing outliers has some impact on the inferred differences, but the impact is more subtle than in the previous two examples. In both cases fluid intelligence seems to be highly related to connections in the frontal lobe, a fact that seems to line up with literature (Geake and Hansen, 2005) on the subject.

Fig. 7. — Changes in brain connectivity with increasing fluid intelligence scores with (left) and without (right) outliers. The 20 edges having the most change in connectivity are shown to improve visualization. The dots on the circle boundary represent ROIs and are colour coded to represent which lobe and hemisphere it belongs to. For a more complete description of the ROI labels see Supplementary Section S3

The above three case studies demonstrate that outliers can greatly influence inferences on how brain networks relate to traits, to the extent of fundamentally altering scientific conclusions.

5 Discussion

In this article, we propose a fast, simple and effective approach to the problem of outlier detection for brain structural connectivity data represented as binary (i.e. unweighted) adjacency matrices. Our method, ODIN, involves first fitting a simple model based on logistic regression which contains both population and subject level parameters. Then using the fitted model, ODIN measures outlyingness based on: (i) approximating the change in the pMLE of the population parameters when a subject is dropped and (ii) tracking how extreme the subject-specific parameters are. The resulting measures of influence are then thresholded to classify subjects as outliers/non-outliers. Using different simulations and an application on a large dataset, we demonstrated that ODIN is fast and effective in detecting outliers. We also illustrated how removing outliers can fundamentally impact scientific conclusions from brain network studies.

There are several natural next steps building on the ODIN approach. An important extension is to generalize ODIN to accommodate weighted adjacency matrices. If the weights are modeled parametrically via a generalized linear model, then this extension is very straightforward. For example, if the weights consist of the number of fibers connecting each pair of ROIs, then we could use a negative binomial log-linear model in place of the logistic regression. If the weights are instead continuous, we could instead use a Gaussian linear model. This modification makes ODIN directly applicable to outlier detection for functional connectivity (FC). For example, FC networks are often expressed in a weighted form as correlation matrices. Applying a Fisher transformation to these correlations, one can use the same linear predictor as in (1) but within a Gaussian linear model instead of a logistic regression.

A more ambitious next step is to detect outliers in SC and FC simultaneously, potentially leveraging on a multivariate extension of model (1) which allows for SC-FC dependence. In addition, the distribution of the weights may not be well characterized by simple parametric GLMs and it becomes interesting to develop methods that accommodate flexible distributions for the weights.

Supplementary Material

btac431_Supplementary_Data

Click here for additional data file.^{(539.3KB, pdf)}

Acknowledgement

This research was conducted using the UK Biobank Resource under application number 51659.

Funding

This research work was partially supported by the grants R01MH118927 and R21AG066970 of the United States National Institutes of Health.

Conflict of Interest: none declared.

Data availability

The data underlying this article were provided by UK Biobank under licence/by permission. Data will be shared on request to the corresponding author with permission of UK Biobank.

Contributor Information

Pritam Dey, Department of Statistical Science, Duke University, Durham, NC 27708, USA.

Zhengwu Zhang, Department of Statistics and Operations Research, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA.

David B Dunson, Department of Statistical Science, Duke University, Durham, NC 27708, USA.

References

Alfaro-Almagro F. et al. (2018) Image processing and quality control for the first 10,000 brain imaging datasets from UK Biobank. NeuroImage, 166, 400–424. [DOI] [PMC free article] [PubMed] [Google Scholar]
Aliverti E., Durante D. (2019) Spatial modeling of brain connectivity data via latent distance models with nodes clustering. Stat. Anal. Data Min. ASA Data Sci. J., 12, 185–196. [Google Scholar]
Baum G.L. et al. (2018) The impact of in-scanner head motion on structural connectivity derived from diffusion MRI. NeuroImage, 173, 275–286. [DOI] [PMC free article] [PubMed] [Google Scholar]
Casey B. et al. ; ABCD Imaging Acquisition Workgroup. (2018) The adolescent brain cognitive development (ABCD) study: imaging acquisition across 21 sites. Dev. Cogn. Neurosci., 32, 43–54. [DOI] [PMC free article] [PubMed] [Google Scholar]
Desikan R.S. et al. (2006) An automated labeling system for subdividing the human cerebral cortex on MRI scans into gyral based regions of interest. NeuroImage, 31, 968–980. [DOI] [PubMed] [Google Scholar]
Durante D. et al. (2017) Nonparametric Bayes modeling of populations of networks. J. Am. Stat. Assoc., 112, 1516–1530. [Google Scholar]
Fornito A. et al. (2013) Graph analysis of the human connectome: promise, progress, and pitfalls. NeuroImage, 80, 426–444. [DOI] [PubMed] [Google Scholar]
Funahashi S. et al. (1993) Dorsolateral prefrontal lesions and oculomotor delayed-response performance: evidence for mnemonic “scotomas”. J. Neurosci., 13, 1479–1497. [DOI] [PMC free article] [PubMed] [Google Scholar]
Geake J.G., Hansen P.C. (2005) Neural correlates of intelligence as revealed by fMRI of fluid analogies. NeuroImage, 26, 555–564. [DOI] [PubMed] [Google Scholar]
Ginestet C.E. et al. (2017) Hypothesis testing for network data in functional neuroimaging. Ann. Appl. Stat., 11, 725–750. [Google Scholar]
Hawkins D.M. (1980) Identification of Outliers (Vol. 11). Springer, The Netherlands. [Google Scholar]
Jacobsen C. (1936) Studies of cerebral function in primates. I. The functions of the frontal association areas in monkeys. Comp. Psychol. Monogr., 13, 3–60. [Google Scholar]
Miller K.L. et al. (2016) Multimodal population brain imaging in the UK Biobank prospective epidemiological study. Nat. Neurosci., 19, 1523–1536. [DOI] [PMC free article] [PubMed] [Google Scholar]
Pribram K.H. et al. (1952) Effects on delayed-response performance of lesions of dorsolateral and ventromedial frontal cortex of baboons. J. Comp. Physiol. Psychol., 45, 565–575. [DOI] [PubMed] [Google Scholar]
Satopaa V. et al. (2011) Finding a “Kneedle” in a haystack: detecting knee points in system behavior. In: 2011 31st International Conference on Distributed Computing Systems Workshops, Minneapolis, MN, USA. IEEE Computer Society, Los Alamitos, CA, USA, pp. 166–171.
Van Essen D.C. et al. ; WU-Minn HCP Consortium. (2013) The WU-Minn human connectome project: an overview. NeuroImage, 80, 62–79. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wang R., Wedeen V.J. (2007) TrackVis.org. Martinos Center for Biomedical Imaging, Massachusetts General Hospital. [Google Scholar]
Wang L. et al. (2017) Bayesian network–response regression. Bioinformatics, 33, 1859–1866. [DOI] [PubMed] [Google Scholar]
Wang L. et al. (2019) Common and individual structure of brain networks. Ann. Appl. Stat., 13, 85–112. [Google Scholar]
Zhang Z. et al. (2018) Mapping population-based structural connectomes. NeuroImage, 172, 130–145. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhang Z. et al. (2019) Tensor network factorizations: relationships between brain structural connectomes and traits. NeuroImage, 197, 330–343. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhang J. et al. (2022) Generalized connectivity matrix response regression with applications in brain connectivity studies. J. Comput. Graph. Stat., 0, 1–30. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

btac431_Supplementary_Data

Click here for additional data file.^{(539.3KB, pdf)}

Data Availability Statement

The data underlying this article were provided by UK Biobank under licence/by permission. Data will be shared on request to the corresponding author with permission of UK Biobank.

[btac431-B1] Alfaro-Almagro F. et al. (2018) Image processing and quality control for the first 10,000 brain imaging datasets from UK Biobank. NeuroImage, 166, 400–424. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btac431-B2] Aliverti E., Durante D. (2019) Spatial modeling of brain connectivity data via latent distance models with nodes clustering. Stat. Anal. Data Min. ASA Data Sci. J., 12, 185–196. [Google Scholar]

[btac431-B3] Baum G.L. et al. (2018) The impact of in-scanner head motion on structural connectivity derived from diffusion MRI. NeuroImage, 173, 275–286. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btac431-B4] Casey B. et al. ; ABCD Imaging Acquisition Workgroup. (2018) The adolescent brain cognitive development (ABCD) study: imaging acquisition across 21 sites. Dev. Cogn. Neurosci., 32, 43–54. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btac431-B5] Desikan R.S. et al. (2006) An automated labeling system for subdividing the human cerebral cortex on MRI scans into gyral based regions of interest. NeuroImage, 31, 968–980. [DOI] [PubMed] [Google Scholar]

[btac431-B6] Durante D. et al. (2017) Nonparametric Bayes modeling of populations of networks. J. Am. Stat. Assoc., 112, 1516–1530. [Google Scholar]

[btac431-B7] Fornito A. et al. (2013) Graph analysis of the human connectome: promise, progress, and pitfalls. NeuroImage, 80, 426–444. [DOI] [PubMed] [Google Scholar]

[btac431-B8] Funahashi S. et al. (1993) Dorsolateral prefrontal lesions and oculomotor delayed-response performance: evidence for mnemonic “scotomas”. J. Neurosci., 13, 1479–1497. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btac431-B9] Geake J.G., Hansen P.C. (2005) Neural correlates of intelligence as revealed by fMRI of fluid analogies. NeuroImage, 26, 555–564. [DOI] [PubMed] [Google Scholar]

[btac431-B10] Ginestet C.E. et al. (2017) Hypothesis testing for network data in functional neuroimaging. Ann. Appl. Stat., 11, 725–750. [Google Scholar]

[btac431-B11] Hawkins D.M. (1980) Identification of Outliers (Vol. 11). Springer, The Netherlands. [Google Scholar]

[btac431-B12] Jacobsen C. (1936) Studies of cerebral function in primates. I. The functions of the frontal association areas in monkeys. Comp. Psychol. Monogr., 13, 3–60. [Google Scholar]

[btac431-B13] Miller K.L. et al. (2016) Multimodal population brain imaging in the UK Biobank prospective epidemiological study. Nat. Neurosci., 19, 1523–1536. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btac431-B14] Pribram K.H. et al. (1952) Effects on delayed-response performance of lesions of dorsolateral and ventromedial frontal cortex of baboons. J. Comp. Physiol. Psychol., 45, 565–575. [DOI] [PubMed] [Google Scholar]

[btac431-B15] Satopaa V. et al. (2011) Finding a “Kneedle” in a haystack: detecting knee points in system behavior. In: 2011 31st International Conference on Distributed Computing Systems Workshops, Minneapolis, MN, USA. IEEE Computer Society, Los Alamitos, CA, USA, pp. 166–171.

[btac431-B16] Van Essen D.C. et al. ; WU-Minn HCP Consortium. (2013) The WU-Minn human connectome project: an overview. NeuroImage, 80, 62–79. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btac431-B17] Wang R., Wedeen V.J. (2007) TrackVis.org. Martinos Center for Biomedical Imaging, Massachusetts General Hospital. [Google Scholar]

[btac431-B18] Wang L. et al. (2017) Bayesian network–response regression. Bioinformatics, 33, 1859–1866. [DOI] [PubMed] [Google Scholar]

[btac431-B19] Wang L. et al. (2019) Common and individual structure of brain networks. Ann. Appl. Stat., 13, 85–112. [Google Scholar]

[btac431-B20] Zhang Z. et al. (2018) Mapping population-based structural connectomes. NeuroImage, 172, 130–145. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btac431-B21] Zhang Z. et al. (2019) Tensor network factorizations: relationships between brain structural connectomes and traits. NeuroImage, 197, 330–343. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btac431-B22] Zhang J. et al. (2022) Generalized connectivity matrix response regression with applications in brain connectivity studies. J. Comput. Graph. Stat., 0, 1–30. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Outlier detection for multi-network data

Pritam Dey

Zhengwu Zhang

David B Dunson

Roles

Abstract

Motivation

Results

Availability and implementation

Supplementary information

1 Introduction

Fig. 1.

2 Methods

Proposition 1.

3 Simulation study

3.1 Computational complexity

Fig. 2.

3.2 Performance in detecting outliers

3.2.1 Simulation from model (1)

Table 1.

Fig. 3.

3.2.2. Simulation using tensor network PCA

Table 2.

4 Application to UK Biobank data

4.1 Exploratory view of outliers versus non-outliers

Fig. 4.

4.2 Impact of sample size on outlier detection

Table 3.

4.3 Case study: impact of removing outliers on a subsequent analysis using TN-PCA

4.3.1. Numeric memory

Fig. 5.

4.3.2. Symbol digit substitution

Fig. 6.

4.3.3. Fluid intelligence

Fig. 7.

5 Discussion

Supplementary Material

Acknowledgement

Funding

Data availability

Contributor Information

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases