Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2014 Mar 3.
Published in final edited form as: Med Image Comput Comput Assist Interv. 2013;16(0 2):625–632. doi: 10.1007/978-3-642-40763-5_77

A New Sparse Simplex Model for Brain Anatomical and Genetic Network Analysis

Heng Huang 1, Jingwen Yan 2, Feiping Nie 1, Jin Huang 1, Weidong Cai 3, Andrew J Saykin 2, Li Shen 2,
PMCID: PMC3939613  NIHMSID: NIHMS502893  PMID: 24579193

Abstract

The Allen Brain Atlas (ABA) database provides comprehensive 3D atlas of gene expression in the adult mouse brain for studying the spatial expression patterns in the mammalian central nervous system. It is computationally challenging to construct the accurate anatomical and genetic networks using the ABA 4D data. In this paper, we propose a novel sparse simplex model to accurately construct the brain anatomical and genetic networks, which are important to reveal the brain spatial expression patterns. Our new approach addresses the shift-invariant and parameter tuning problems, which are notorious in the existing network analysis methods, such that the proposed model is more suitable for solving practical biomedical problems. We validate our new model using the 4D ABA data, and the network construction results show the superior performance of the proposed sparse simplex model.

1 Introduction

In recent research, the large-scale screenings for gene expression profiles across all different brain regions have been done by the Allen Institute for Brain Science, known as Allen Brain Atlas (ABA) project [1]. ABA provides spatially mapped large-scale gene expression database and enables quantitative comparison of data measurements across genes, anatomy, and phenotype. Detection of gene-anatomy association in brain structure is crucial for understanding brain function based on the molecular and genetic/genomic information. Particularly in the mouse or human brain where there are over thousands of genes expressed, systematic and comprehensive quantification of the expression densities in the whole three-dimensional (3D) anatomical context is critical.

The ABA database provides cellular resolution 3D expression patterns for both mouse and human (ongoing project). The image data are generated by in situ hybridization using gene-specific probes, followed by slide scanning and 3D image registration to the Allen Reference Atlas (ARA) [2] and expression segmentation [3]. The resulted mouse brain 4D expression data are in a set of spatially aligned 3D volumes of size 67 × 41 × 58. The genes’ values expressed in each voxel of the mouse brain are recorded.

The ABA contains information about the spatial distribution of genes within the human and mouse brain. Efficient and effective analysis of these high throughput data can shed light on the global function of mammalian central nervous system [4] and provide important information for understanding the connections of human brain anatomy, genome, and transcriptome. However, most previous research works are limited to retrieve correlation values between the spatial patterns of genes [5], or cluster the brain regions into co-expressed groups [6].

Network analysis provides a productive approach to analyze the high throughput biomedical and biological data. Transforming the data into a network framework offers distinct advantages for directly relating specific biomedical and biological interactions or outcome states with the network properties and dynamics. Thus, it is desired to model and analyze the spatial gene expression data of human brain in ABA in network format. Existing approaches to construct biomedical and biological networks usually have three deficiencies: (1) shift variant, i.e. when the data are shifted with a value, the network construction result will be totally different; (2) tedious parameter tuning is needed and not suitable for the practical applications; (3) the network edge weight has no probability interpretation to help the analysis. In this paper, to tackle these problems, we propose a novel sparse simplex learning model and applied it to ABA mouse brain data to create both anatomical and transcriptomic networks, which provide important insights into the global structure of the anatomy and transcriptome.

2 Related Work

The ABA brain microarray data provide the great opportunity to model the neuroanatomical and transcriptomic networks, where each vertex represents a spatial location or a gene and the edges between vertices encode the correlations between locations and genes. In recent related studies, the weighted gene co-expression network analysis (WGCNA) [7] based computational tools were mainly used to construct the co-expression network. More recently, Ji [8] used an approximate formulation for Gaussian graphical modeling [9] to model the mouse brain networks and showed more efficient and stable construction results. Given the input data X = [x1, · · ·, xn] ∈ ℜd×n, this approximation model calculates the edge weight matrix W ∈ ℜn×n (all values on the diagonal are “0”s) by solving a series of sparsity regularized regression problems. In this paper, we write matrices as capital letters and vectors as boldface lowercase letters. Given a matrix W = [wij], its i-th row and j-th column are denoted as wi and wj, respectively.

In [8], the weights of edges from vertex xi’s neighboring vertices to xi are learned by solving the standard sparse representation problem:

minαixi-X-iαi2+λαi1, (1)

where Xi = [x1, · · ·, xi−1, xi+1, · · ·, xn] is the data matrix obtained from X by removing the i-th data point, αi ∈ ℜ(n−1)×1 is a weight vector from wi by removing the i-th weight, which is zero. The network links are constructed by applying the thresholding value 0.5 to the edge weights. The above model constructs the network/graph using Lasso for variables selection [9]. However, this approach has three key deficiencies: 1) This method is not shift invariant, i.e., if data are shifted with an arbitrary value, such as xi = xi + t1, the network construction result will be totally different. 2) The parameter λ has to be tuned to get good results. Although [8] provided a strategy to learn this parameter, the strategy also depends on the link thresholding value. Thus, the network construction results are not robust as expected. 3) The edge weights cannot be interpreted as probabilities. To solve these deficiencies, we propose a new sparse simplex learning model to construct brain networks with non-parameter tuning, shift invariant, and probability interpretation advantages.

3 Methodology

3.1 Sparse Simplex Learning Model

Sparse learning models have been actively applied to solve problems in computational neuroscience [1014]. To effectively construct the brain networks, the sparse representation model can be utilized as in [8]. When we build the neuroanatomical and transcriptomic networks, we hope the edge weight has the probability meaning, which can directly tell us the link strength between two nodes. Thus, we add two constraints on the sparse representation model: αi ≥ 0 and αiT1=1, where 1 ∈ ℜ(n−1)×1 is a vector with all “1” as elements. The new objective will solve:

minαiX-iαi-xi22+λαi1,s.t.αi0,αiT1=1. (2)

After imposing these two constraints, the solutions αi will have the probability interpretations. The αi(j) is the edge weight between nodes i and j. Because jαi(j)=αiT1=1, αi(j) can be interpreted as the probability to have an edge between nodes i and j.

In the network construction, we hope the learning model is shift-invariant, such that the network constructions have small changes when the data have an arbitrary shift value. The shift-invariant property is important for practical biomedical applications, because the data collection processes are often effected by instruments and environment factors and the collected data may include a shifted value caused by these factors. Fortunately, after imposing the above two constraints, the new sparse learning model becomes shift-invariant.

When the data are shifted by a constant t, i.e., xk = xk + t1 for all k = 1, · · ·, n, the computed similarities between the pairs of nodes will be changed. The objective function becomes:

(X-i+t11T)αi-(xi+t1)22+λαi1. (3)

Because αiT1=1, the above objective can be written as:

X-iαi-xi22+λαi1, (4)

which is the original one. Thus, the new objective in (2) is shift-invariant.

More important, the constraints in problem (2) make the second regularization term as a constant. Thus, the problem (2) becomes

minαiX-iαi-xi22,s.t.αi0,αiT1=1. (5)

Thus, the new model has no parameter, such that it is suitable for biomedical and biological applications, in which we usually lack information/data to tune the parameter.

Because the constraints in problem (5) are the simplex formulation, we call the new method as the sparse simplex learning model. Note that the above constraints (1 ball constraints) indeed introduce sparse solution αi.

The ABA 4D data are large-scale with high-dimensionality. Thus, we need to derive the efficient optimization algorithm to solve the new objective in Eq. (5). It is more appropriate to apply the first-order methods, i.e., use function values and their (sub)gradient at each iteration. There are many first-order methods, including gradient descent, subgradient descent, and Nesterov’s optimal method [16]. In this paper, we use the accelerated projected gradient method to optimize Eq. (5).

3.2 Optimization Algorithm

When we use the accelerated projected gradient method to solve this problem, the critical step of the projected gradient method is to solve the following proximal problem:

minαi12αi-v22,s.tαi0,αT1=1. (6)

We write the Lagrangian function of problem (6) as:

12αi-v22-γ(αiT1-1)-λTαi, (7)

where γ is a Lagrangian multiplier and λ is a Lagrangian multiplier vector, both of which are to be determined. Suppose the optimal solution to the proximal problem (6) is α*, the associate Lagrangian coefficients are γ* and λ*. Then according to the KKT condition [17], we have the following equations:

{j,aij-vj-γ-λj=0j,αij0j,λj0j,αijλj=0 (8) (9) (10) (11)

where αij is the j-th scalar element of vector αi. Eq. (8) can be written as αij-vj-γ1-λj=0. According to the constraint 1Tαi=1, we have γ=1-1Tv-1Tλn. Thus, α=(v-11Tnv+1n1-1Tλn1)+λ.

Denoting λ¯=1Tλn and u=v-11Tnv+1n1, we can write α* = u+λ*λ̄*1. Thus, ∀j we have:

αij=uj+λj-λ¯. (12)

According to Eqs. (9)(12) we know uj+λj-λ¯=(uj-λ¯)+, here x+ = max(x, 0). Then we have

αj=(uj-λ¯)+. (13)

Therefore, we can obtain the optimal solution α* if we know λ̄*.

We write Eq. (12) as λj=αij+λ¯-uj. Similarly, according to Eqs.(9)–(11), we know λj=(λ¯-uj)+. Since v is a n − 1-dimensional vector, we have λ¯=1n-1j=1n-1(λ¯-uj)+. Defining a function as

f(λ¯)=1n-1i=1n-1(λ¯-uj)+-λ¯, (14)

such that f(λ̄*) = 0 and we can solve the root finding problem with Newton method to obtain λ̄*.

The convergence rate of our algorithm is O(1t2), where t is the number of iterations. The detailed proof can be found at [15, 16].

4 Experiments and Discussions

4.1 Experimental Results on ABA Data

In our experiment, we use the ABA cellular resolution 3D expression patterns in the male, 56-day-old C57BL mouse brain. The 4D spatial gene data are a 4D tensor 2980 × 67 × 41 × 58, in which the first index corresponds to genes, and the other three indices represent the rostral-caudal, dorsal-ventral and left-right spatial directions, respectively. The newest database provides 2980 genes which are slightly different to the data used in [8].

When we create the genetic network, each node is one gene of 2980 genes. In [8], the tensor factorization method was used to reduce the dimensionality. However, this is an improper process. Although the voxels on the boundary of brain have no gene values, the tensor factorization includes the values of these voxels into calculation. We used the PCA method to reduce the dimensionality. Although the number of voxels is very large, the number of genes is not large. The PCA calculation is still affordable. For each gene, we reduce its 3D tensor data to 25 × 15 × 20 and then concatenate its all values into a feature vector. The resulted 7500×2980 data matrix is used as input of sparse simplex learning model to construct the genetic network.

When we build the anatomical network, we directly use the 2980×67×41×58 tensor data. We have total 67 × 41 × 58 nodes in brain spatial structure and the gene values are features. The sparse simplex model is performed to construct the neuroanatomical network. Because the network is 3D, we cannot visualize the whole 3D network. Thus, in Figure 1, we select three slices of brain data (center slices on three directions) and plot the networks on them. Figure 1(a) shows the 20-th slice on the dorsal-ventral direction. Figure 1(b) plots the 29-th slice on the left-right direction. Figure 1(c) visualizes the 33-rd slice on the rostral-caudal direction. The region with the largest number of connections corresponds to the brain structure dentate gyrus. We don’t plot the genetic network here, because there is no spatial structure in genes. The genetic network cannot show meaningful visualization.

Fig. 1.

Fig. 1

We select and visualize the center slices of three directions of the 3D neuroanatomical network. (a) The 20-th slice on the dorsal-ventral direction. (b) The 29-th slice on the left-right direction. (c) The 33-rd slice on the rostral-caudal direction. The region with the largest number of connections corresponds to the brain structure dentate gyrus.

4.2 Model Evaluation Using Clustering Tasks

In above experimental results, we showed that the proposed sparse simplex model can efficiently construct both genetic and neuroanatomical networks. Because there is no ground-truth results for network constructions, we cannot directly compare the performance of our sparse simplex method. Thus, we use the clustering task results to compare our sparse simplex model to other graph construction methods. We use the sparse representation method [8] to construct graph and then perform Normalized Cut (NCut) and Self-Tuning Spectral Clustering (STSC) methods. After that, we build the graph using the proposed model and then perform the Normalized Cut (SSM+NCut). The clustering accuracy on six public computer vision benchmark image datasets are shown in Table 1. We also show the clustering results of K-means and NMF as baseline results. Although these data are not biomedical image data, we only use them for validation purpose because they have ground truth labels. In all results, our new sparse simplex model shows the promising graph/network construction results.

Table 1.

Clustering accuracy using different graph construction methods.

Datasets K-means NMF NCut STSC SSM+NCut
AR 0.133 0.143 0.158 0.130 0.324
AT&T 0.664 0.678 0.698 0.685 0.763
JAFFE 0.789 0.774 0.795 0.813 0.902
MNIST 0.641 0.636 0.647 0.693 0.796
PIE 0.229 0.241 0.234 0.186 0.325
UMIST 0.475 0.457 0.443 0.394 0.514

5 Conclusion

In this paper, we propose a novel sparse simplex learning model to construct the genetic and neuroanatomical networks using ABA 4D spatial gene patterns. Compared to the existing methods, the new model has three advantages: (1) it is shift-invariant such that the noise in data collection won’t dramatically effect the network construction; (2) it doesn’t require the parameter tuning, thus it is suitable for practical biomedical and biological applications; (3) it has probability interpretations on the resulted network weights, which can help the further data analysis. We validate the proposed model using the ABA mouse brain data and construct both genetic and anatomical networks. Our new model can also be applied to other biomedical network construction and analysis problems.

References

  • 1.Lein ES. Genome-wide atlas of gene expression in the adult mouse brain. Nature. 2007;445:168–176. doi: 10.1038/nature05453. [DOI] [PubMed] [Google Scholar]
  • 2.Dong HW. The Allen Reference Atlas: A Digital Color Brain Atlas of the C57BL/6J Male Mouse. 2009 [Google Scholar]
  • 3.Ng L, Pathak SD, Kuan C, Lau C, Dong H, Sodt A, Dang C, Avants B, Yushkevich P, Gee JC, Haynor D, Lein E, Jones A, Hawrylycz M. Neuroinformatics for genome-wide 3-D gene expression mapping in the mouse brain. IEEE/ACM Trans Comput Biol Bioinformatics. 2007;4:382–393. doi: 10.1109/tcbb.2007.1035. [DOI] [PubMed] [Google Scholar]
  • 4.Jones AR, Overly CC, Sunkin SM. The Allen Brain Atlas: 5 years and beyond. Nat Rev Neurosci. 2009;10:821–828. doi: 10.1038/nrn2722. [DOI] [PubMed] [Google Scholar]
  • 5.Ng L, et al. An anatomic gene expression atlas of the adult mouse brain. Nat Neurosci. 2009;12:356–362. doi: 10.1038/nn.2281. [DOI] [PubMed] [Google Scholar]
  • 6.Bohland JW, Bokil H, Pathak SD, Lee CK, Ng L, Lau C, Kuan C, Hawrylycz M, Mitra PP. Clustering of spatial gene expression patterns in the mouse brain and comparison with classical neuroanatomy. Methods. 2010;50:105–112. doi: 10.1016/j.ymeth.2009.09.001. [DOI] [PubMed] [Google Scholar]
  • 7.Zhang B, Horvath S. A General Framework for Weighted Gene Co-Expression Network Analysis. Statistical Applications in Genetics and Molecular Biology. 2005;4(1):17. doi: 10.2202/1544-6115.1128. [DOI] [PubMed] [Google Scholar]
  • 8.Ji SW. Computational network analysis of the anatomical and genetic organizations in the mouse brain. Bioinformatics. 2011;27(23):3293–3299. doi: 10.1093/bioinformatics/btr558. [DOI] [PubMed] [Google Scholar]
  • 9.Meinshausen N, Buhlmann P. High-dimensional graphs and variable selection with the Lasso. The Annals of Statistics. 2006;34(3):1436–1462. [Google Scholar]
  • 10.Wang H, Nie FP, Huang H, Risacher S, Saykin AJ, Shen L. Identifying adsensitive and cognition-relevant imaging biomarkers via joint classification and regression. MICCAI. 2011:115–123. doi: 10.1007/978-3-642-23626-6_15. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Wang H, Nie FP, Huang H, Kim S, Nho K, Risacher S, Saykin AJ, Shen L. Identifying quantitative trait loci via group-sparse multitask regression and feature selection: an imaging genetics study of the adni cohort. Bioinformatics. 2012;28(2):229–237. doi: 10.1093/bioinformatics/btr649. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Wang H, Nie FP, Huang H, Risacher S, Saykin AJ, Shen L. Identifying disease sensitive and quantitative trait-relevant biomarkers from multidimensional heterogeneous imaging genetics data via sparse multimodal multitask learning. Bioinformatics. 2012;28(12):i127–i136. doi: 10.1093/bioinformatics/bts228. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Wang H, Nie FP, Huang H, Yan J, Kim S, Nho K, Risacher S, Saykin AJ, Shen L. From phenotype to genotype: an association study of longitudinal phenotypic markers to alzheimer’s disease relevant SNPs. Bioinformatics. 2012;28(18):i619–i625. doi: 10.1093/bioinformatics/bts411. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Wang H, Nie FP, Huang H, Yan J, Kim S, Risacher S, Saykin AJ, Shen L. High-order multi-task feature learning to identify longitudinal phenotypic markers for alzheimer’s disease progression prediction. NIPS. 2012:1286–1294. [Google Scholar]
  • 15.Nesterov Y. Method for solving a convex programming problem with convergence rate O(1/k2) Soviet Math Dokl. 1983;1983(2):372–376. [Google Scholar]
  • 16.Nesterov Y. Gradient methods for minimizing composite objective function. 2007 [Google Scholar]
  • 17.Boyd S, Vandenberghe L. Convex Optimization. Cambridge University Press; 2004. [Google Scholar]

RESOURCES