A multivariate to multivariate approach for voxel-wise genome-wide association analysis

Qiong Wu; Yuan Zhang; Xiaoqi Huang; Tianzhou Ma; L Elliot Hong; Peter Kochunov; Shuo Chen

doi:10.1002/sim.10101

. Author manuscript; available in PMC: 2025 Apr 11.

Published in final edited form as: Stat Med. 2024 Jun 24;43(20):3862–3880. doi: 10.1002/sim.10101

A multivariate to multivariate approach for voxel-wise genome-wide association analysis

Qiong Wu ¹, Yuan Zhang ², Xiaoqi Huang ³, Tianzhou Ma ^4,⁵, L Elliot Hong ⁶, Peter Kochunov ⁶, Shuo Chen ^5,^6,^7,⁸

PMCID: PMC11986643 NIHMSID: NIHMS2067460 PMID: 38922949

Abstract

The joint analysis of imaging-genetics data facilitates the systematic investigation of genetic effects on brain structures and functions with spatial specificity. We focus on voxel-wise genome-wide association analysis, which may involve trillions of single nucleotide polymorphism (SNP)-voxel pairs. We attempt to identify underlying organized association patterns of SNP-voxel pairs and understand the polygenic and pleiotropic networks on brain imaging traits. We propose a bi-clique graph structure (ie, a set of SNPs highly correlated with a cluster of voxels) for the systematic association pattern. Next, we develop computational strategies to detect latent SNP-voxel bi-cliques and an inference model for statistical testing. We further provide theoretical results to guarantee the accuracy of our computational algorithms and statistical inference. We validate our method by extensive simulation studies, and then apply it to the whole genome genetic and voxel-level white matter integrity data collected from 1052 participants of the human connectome project. The results demonstrate multiple genetic loci influencing white matter integrity measures on splenium and genu of the corpus callosum.

Keywords: bi-clique, imaging-genetics, ultra-high dimensionality, voxel-wise GWAS, white matter integrity

1 |. INTRODUCTION

Imaging-genetics has garnered increased interest in the field of neuropsychiatric research as it provides a viable pathway to understand brain diseases by integrating genetic, brain imaging, and environmental factors. Compared to clinical descriptions of symptoms in psychiatry, brain imaging measurements assess brain structures and functions quantitatively with reproducibility, which are reported to be associated with psychiatric disorders including schizophrenia,¹ Alzheimer’s disease,² major depressive disorder.³ More importantly, neuroimaging signals can serve as intermediate phenotypes resulting in increased power in the detection of genetic loci. Recent studies have been focused on the joint analysis of imaging-genetics data that reveals the genetic effects on spatially specific brain functions and structures.^4–10 Identifying genetic effects on objectively measured high-resolution imaging traits can not only enhance understanding the complex genetic and neurological mechanisms of neuropsychiatric disorders, but further impact early diagnosis and treatment of psychiatric disorders.

In imaging-genetics studies, both brain imaging data and genome sequence are measured for each participant. The genetic measurements can characterize genetic variations using single nucleotide polymorphism (SNP) and copy number variants (CNVs). The non-invasive brain imaging techniques assess the brain structures by magnetic resonance imaging (MRI), diffusion tensor imaging (DTI), and brain functions by functional magnetic resonance imaging (fMRI). The recent development of neuroimaging technology provides high-resolution imaging data with improved spatial specificity and thus can better assess the genetic effects on brain structures and functions.

The statistical analysis of imaging-genetics data is computationally intensive and methodologically challenging. These challenges mainly rise from the combination of two sets of high-dimensional features: multivariate imaging traits with multivariate genetic variants (Figure 1). Moreover, both imaging traits and genetic variants exhibit complex and organized dependence structure reflecting the underlying neurophysiological mechanisms and linkage disequilibrium patterns.⁶ For example, a typical imaging-genetics study collects up to 10⁷ SNPs and 10⁵ voxels, jointly contributing trillions (10¹²) of SNP-voxel pairs.^11,12 The direct application of classic voxel-wise genome-wide association analysis (vGWAS) could require an enormous sample size (eg, multiple millions of participants) to control the false positive error rate while maintaining adequate statistical power.^13–16

Data structure for vGWAS. For imaging-genetics data, we can perform GWAS analysis on each voxel of 3D brain imaging data for the study cohort. The vGWAS analyses generate billions of association results, which raises challenges of result interpretation and comprehension.

Furthermore, advanced methods have been developed to leverage group sparsity by techniques including regularization, low rank techniques and projection of high-dimensional features.^10,17–25 However, while these methods could gain statistical power by jointly modeling genetic variants and imaging traits through a multivariate regression model, the high dimensionality of imaging-genetics data remains challenging due to computational burdens and/or over-fittings. For instance, the analysis can only be applied on imaging data at an regional-level or genetic data with filtered to thousands of SNP loci. Besides, the results from summarized measures as a few latent variables or a coarser scale are less interpretable or lacking the spatial specificity.⁵

In this study, we propose a new multivariate to multivariate method to systematically investigate the SNP-(imaging)voxel association patterns with four aims: (i) identify voxel clusters as genetically correlated imaging traits, (ii) detect functionally related SNP sets, (iii) understand the SNP-voxel association patterns as polygenic and pleiotropic relationships, and (iv) test the association patterns while controlling multiplicity. In our study, a polygenic trait refers to a voxel influenced by multiple SNPs while pleiotropy indicates that one gene can affect multiple voxel traits. Specifically, we consider genetic variants and imaging voxels as two disjoint sets of nodes, correspondingly, and associations between all SNP-voxel pairs as edges in a bipartite graph. We model the polygenic and pleiotropic SNP-voxel association structure as an imaging-genetics dense bi-clique (IGDB). IGDB is a node-induced subgraph consisting of a subset of SNPs and a subset of voxels, where the possibility of a SNP associated with a voxel is much elevated than the rest of graph. Within an IGDB, each voxel can be considered as a polygenic imaging trait, and a SNP as a pleiotropic genetic variant. Therefore, our method contributes as a new GWAS tool for voxel level neuroimaging traits which alleviates the burden of ultra stringent threshold (eg, p < 5 × 10⁻¹² in vGWAS) and uncover the systematic SNP-trait association patterns.

With the specified IGDB structure of polygenic and pleiotropic association pattern, the current study makes several contributions. First, we develop computationally efficient algorithms to identify the IGDB structure with the scalability for analyzing the whole genome-whole brain data. Second, the proposed greedy algorithm is presented with the approximation bounds for the true optimal as well as its asymptotically full recovery of IGDB-based network structure. Last, we formulate the existence of a polygenic and pleiotropic SNP-voxel association structure against a random bipartite graph, which can be evaluated through likelihood-based statistics.

2 |. MOTIVATING DATA EXAMPLE

The human connectome project (HCP) sponsored by National Institutes of Health (NIH) aims to construct the underlying neural pathways of healthy human brain functions. It is an important public resource for structural and functional brain connectivity data, accompanied by demographic, behavioral, genetic and other data. In this study, we focus on the brain imaging and genetics data in the HCP surveyed from 1052 participants (F/M 483/569; age 28.1 ± 3.7), for whom the scans and data were released in June 2014 (https://humanconnectome.org) that passed the HCP and ENIGMA quality control and assurance standards.²⁶ The participants in the HCP study were recruited from a large population-based study named “the Missouri Family and Twin Registry.”²⁷

The fractional anisotropy (FA) measure, derived from diffusion tensor imaging (DTI), is a widely-used metric characterizing the localized white matter microstructural integrity.²⁸ Previous studies have investigated the heritability through variance components method of pedigrees.²⁹ They find that 70% to 80% of the total phenotypic variance of trait-wise FA measures can be explained by additive genetic factors.³⁰ The significantly and reliably heritable FA measurements are qualified as a set of endophenotypes which suggests further exploration on associated genetic variants. Hence, the genetic analysis is desirable to detect the genetic effect from specific loci on imaging traits with statistical inference. Moreover, it is reported that FA measurements at multiple brain locations can be affected by a common set of genetic variates.⁹ FA is a complex trait determined by multiple alleles. It stimulates the identification of functionally-related genetic variants. This investigation naturally invokes the search for polygenicity and pleiotropy of networks as the focus of this study. Voxel-level association analysis between imaging traits and genetic variants can provide the maximal spatial resolution. Nevertheless, the implementation is challenging because it requires a multivariate to multivariate association analysis to extract SNP-voxel subnetworks with polygenic and pleiotropic structures and further to provide sound statistical inference. To close this gap, we develop an IGDB-based framework to perform voxel-vise GWAS and systematically identify polygenic and pleiotropic structures.

3 |. METHODS

3.1 |. Background and notations

We consider an imaging-genetics data set collected from $L$ independent subjects. We let $V$ be the set of brain imaging voxels with $| V | = n$ and $U$ be the set of genetic variants (ie, SNPs) with $| U | = m$ . For each participant $l \in {1, \dots, L}$ , define $x_{l} = {(x_{1, l}, \dots, x_{m, l})}^{T}$ to be the genetic variants for the participant $l$ and $y_{l} = {(y_{1, l}, \dots, y_{n, l})}^{T}$ to be the vector of multivariate imaging traits. Let $z_{l}$ denote a $p$ -dimensional vector of individual-level profiling covariates. We model the associations between multivariate imaging traits and multivariate genetic variants using a generalized linear regression model:

E (y_{l} ∣ x_{l}) = g^{- 1} (B^{T} x_{l} + α^{T} z_{l}),

where $g (\cdot)$ is a known link function with inverse $g^{- 1} (\cdot)$ . The coefficient $B = {\{β_{u v}\}}_{u \in U, v \in V} \in R^{m \times n}$ is called the SNP-voxel association matrix. Without loss of generality, we consider the association matrix based on GWAS analysis (eg, using open-source whole genome association analysis toolset).³¹ The goal of our statistical inference is to accurately identify the subset of significant associations $\{(u, v) : β_{u v} \neq 0\}$ from billions of entries of $B$ by multivariate to multivariate hypothesis testing^{32, 33}:

H_{0}^{(u, v)} : β_{u v} = 0, vs H_{1}^{(u, v)} : β_{u v} \neq 0, for all u \in U, v \in V .

Conventional statistical inference methods (eg, multiple testing correction or regression shrinkage) work by regularizing vectorized $B$ . However, this strategy may only capture individual association pairs $β_{u v}$ without recognizing systematic patterns (eg, the pleiotropic and polygenic structure). A prominent example is that a cluster of SNPs may jointly influence the observations through a cluster of neighboring voxels. To address this challenge, we propose a new multivariate to multivariate inference framework that extracts the joint structure in $B$ , which we call imaging-genetics dense bi-clique (IGDB). Next, we introduce the IGDB structure, based on which, we then formally propose a novel estimation and inference procedure on this structure.

3.2 |. IGDB in a multivariate to multivariate graph structure

We characterize the vGWAS association as a bipartite graph $G = (U, V, E)$ , where $U$ and $V$ are distinct node sets representing SNPs and voxels, respectively. The set of binary edges $E$ describes the locations of significant SNP-voxel associations: $e_{u v} \in E$ if and only if $β_{u v} \neq 0$ in the association matrix $B = {\{β_{u v}\}}_{u \in U, v \in V}$ . In contrast to conventional approaches that treat edges $e_{u v}$ individually, our proposal provides a succinct description of pleiotropic (one SNP to multiple image voxels) and polygenic (multiple SNPs to one voxel) relationships. To this end, we now formally propose IGDB as a subgraph structure of $G$ . Denote an arbitrary subgraph of $G$ by $G [S, T] = (S, T, E [S, T])$ , where $S \subset U$ , $T \subset V$ and $E [S, T] = \{e_{u v} \in E ∣ i \in S, j \in T}$ . Our proposed IGDB will be defined based on some particular subgraph $G [S_{0}, T_{0}]$ such that most $β_{u v}$ ’s are nonzero for $e_{u v} \in G [S_{0}, T_{0}]$ , while most $β_{u^{'} v^{'}}$ ’s elsewhere are zero. We illustrate the IGDB structure of a bipartite graph in Figure 2.

Illustration of a bipartite graph with IGDB structure $G [S_{0}, T_{0}]$ that reveals underlying patterns of massive SNP-voxel association. In the top-left bipartite graph, each node (square) on the left side represents an SNP, while each node (circle) represents a location-specific voxel. The edges connecting the SNPs and voxels illustrate the associations, with red edges indicating pairs of associated SNPs and voxels in an IGDB structure. The bottom-left 2D figure provides an alternative representation of SNP-voxel associations, where associated pairs are depicted as black dots. The SNP-voxel association patterns in the left figures appear to be random. The bottom-right figure showcases the patterns that can be unveiled through the proposed IGDB method, suggesting systematic associations between imaging features and genetic variants. Note that traditional statistical methods, such as bi-clustering, face limitations in accurately identifying these patterns (see Figure A1 in the Appendix).

Our core intuition can be quantified into the following formulation:

\frac{\sum_{u, v} I (β_{u v} \neq 0 ∣ δ_{u v} = 1)}{\sum_{u, v} I (δ_{u v} = 1)} > \frac{\sum_{u, v} I (β_{u v} \neq 0 ∣ δ_{u v} = 0)}{\sum_{u, v} I (δ_{u v} = 0)},

(1)

where $δ_{u v}$ is a binary variable indicating the IGDB-based network structure, that is,

δ_{u v} \equiv δ_{u v} (S_{0}, T_{0}) = I (e_{u v} \in G [S_{0}, T_{0}]) .

This reflects that imaging features $(T_{0})$ are polygenic traits and the genetic variants $(S_{0})$ are pleiotropic alleles. The genetically correlated imaging features and functionally related SNPs jointly compose a functional biclique $G [S_{0}, T_{0}]$ . In neuroimaging studies, findings are often reported for spatially contiguous brain areas (ie, connected voxels) because of the biological interpretability and inference advantages.³⁴ This is reflected in our proposed IGDB structure by further formulating $S_{0}$ and $T_{0}$ as disjoint vertex neighborhoods, as follows:

S_{0} = 𝒩_{1}^{S_{0}} \cup \dots \cup 𝒩_{K_{1}}^{S_{0}}, and T_{0} = 𝒩_{1}^{T_{0}} \cup \dots \cup 𝒩_{K_{2}}^{T_{0}},

where each $𝒩_{k}^{T_{0}} (k \in \{1, \dots, K_{2}\})$ is a spatially contiguous voxel cluster, and accordingly $𝒩_{k}^{S_{0}} (k \in \{1, \dots, K_{1}\})$ is a set of functionally related SNPs associated with one or multiple spatially-contiguous voxel clusters (eg, $𝒩_{k}^{T_{0}}$ ). In the next subsection, we articulate that the IGDB enjoys several statistical advantages supported by graph and combinatorics theory.

3.3 |. Graph properties of IGDB

Without loss of generality, we consider the following two cases regarding the underlying network structure of $G$ :

\begin{array}{l} C a s e 0 : & G i s o b s e r v e d f r o m a r a n d o m b i p a r t i t e g r a p h G (m, n, μ_{0}), \\ C a s e 1 : & T h e r e e x i s t s a t l e a s t o n e n o n - t r i v i a l I G D B G [S_{0}, T_{0}] s u c h t h a t G i s o b s e r v e d f r o m \\ e_{u v} = I (β_{u v} \neq 0) ~ \{\begin{array}{l} Bernoulli (μ_{1}), & if u \in S_{0} & v \in T_{0} \\ Bernoulli (μ_{0}), & otherwise \end{array} \\ with μ_{1} > μ_{0} . \end{array}

(2)

In Case 0 (ie, no polygenic and pleiotropic patterns), we can directly implement the conventional multiple testing corrections and regression shrinkage methods to determine individual associations between genetic variants and imaging traits. If Case 1 presents, our primary goal becomes to extract and test the underlying IGDB subgraphs as polygenic and pleiotropic subnetworks.

In practice, the estimated IGDB from a sample can be used to distinguish Case 0 vs Case 1 because the observed network behave differently under two cases on the size of the maximal “dense” subgraph. For convenience, we call a subgraph $G [S, T]$ a $γ$ -quasi biclique, if it contains at least $γ \cdot | S | \cdot | T |$ edges. Then, asymptotically, if $|S_{0}|, |T_{0}| \to \infty$ as $m, n \to \infty$ , with high probability, the true IGDB subgraph $G [S_{0}, T_{0}]$ would be a $γ$ -quasi biclique for any fixed $γ \in (μ_{0}, μ_{1})$ . In contrast, under Case 0, there would rarely exist a $γ$ -quasi biclique of decent size with high density as the following lemma.

Lemma 1.

Suppose $G$ is observed from a random bipartite graph $G (m, n, μ_{0})$ as Case 0. $G [S, T]$ is any subgraph with edge density $\frac{| E [S, T] |}{| S | | T |} \geq γ \in (μ_{0}, 1)$ (ie, $γ$ -quasi biclique). Let $m_{0}, n_{0} = Ω (m a x \{m^{ϵ}, n^{ϵ}\})$ for some $0 < ϵ < 1$ Then for sufficiently large $m, n$ with $c (γ, μ_{0}) m_{0} \geq 8 l o g n$ and $c (γ, μ_{0}) n_{0} \geq 8 l o g m$ , we have

P (| S | \geq m_{0}, | T | \geq n_{0}) \leq 2 m n \cdot e x p (- \frac{1}{4} c (γ, μ_{0}) m_{0} n_{0}),

where $c (a, b) = {\{\frac{1}{(a - b)^{2}} + \frac{1}{3 (a - b)}\}}^{- 1}$ .

4 |. ESTIMATION AND INFERENCE

Let $W_{m \times n}$ denote the inference result matrix (eg, test statistics $w_{u v} = t_{u v}$ or $- l o g (p_{u v}))$ ) for the regression coefficients ${\hat{B}}_{m \times n}$ . Then, our goal becomes to extract and test the IGDB structure from a weighted bipartite graph $G = (U, V, W)$ . Similar to Reference 33, as a natural consequence of our model set up in Section 3.2, edge weights in $W$ follow a mixture marginal distribution:

w_{u v} ~ \{\begin{array}{l} f_{1} (\cdot; θ_{1}), & if β_{u v} \neq 0 \\ f_{0} (\cdot; θ_{0}), & if β_{u v} = 0 . \end{array}

(3)

where $w_{u v} ∣ δ_{u v} = 1 ~ μ_{1} f_{1} + (1 - μ_{1}) f_{0}$ , while $w_{u v} ∣ δ_{u v} = 0 ~ μ_{0} f_{1} + (1 - μ_{0}) f_{0}$ . Empirically, we have the central tendency of $f_{1} (\cdot; θ_{1})$ being greater than $f_{0} (\cdot; θ_{0})$ , in the sense that $E_{θ_{1}} [w_{u v} ∣ β_{u v} \neq 0] > E_{θ_{0}} [w_{u v} ∣ β_{u v} = 0]$ .

4.1 |. IGDB estimation

Motivated by the nature of IGDB as a subgraph of elevated mean edge weights, we estimate it by looking for the maximal subgraph of $G$ with a density constraint. Inspired by Lemma 1, we estimate the IGDB $G [S_{0}, T_{0}]$ based on the edge weight matrix $W$ by optimizing:

\underset{S \subseteq U, T \subseteq V}{m a x} | S | | T | subject to \frac{‖ W [S, T] ‖_{1,1}}{| S | | T |} \geq γ^{'}

(4)

or the Lagrangian form after taking logarithm on both terms:

\underset{S \subseteq U, T \subseteq V}{m a x} \log (|S| |T|) + λ \log (\frac{‖ W [S, T] ‖_{1,1}}{|S| |T|}),

(5)

where $‖ \cdot ‖_{1,1}$ refers to the entry-wise $ℓ_{1}$ norm such that $‖ W [S, T] ‖_{1,1} = \sum_{u \in S, v \in T} |w_{u v}|, γ^{'}$ is the density constraint and the tuning parameter $λ \in (1, \infty)$ .

Algorithm 1.

Direct optimization of objective function (5)

1:	Input: $G = (U, V, W), λ$ , pre-specified ratio $h \in \{h_{1}, h_{2}, \dots, h_{m a x}\}$ ;Output: $G [{\tilde{S}}_{λ}, {\tilde{T}}_{λ}]$
2:	procedure Algorithm
3:	for $h \in \{h_{1}, h_{2}, \dots, h_{m a x}\}$ do
4:	$S_{1} \leftarrow U, T_{1} \leftarrow V$
5:	for $k = 1$ to $n + m - 1$ do
6:	Let $i \in S_{k}$ be the node with smallest degree: $i = a r g {m i n}_{i^{'} \in S_{k}} {d e g}_{X} (i^{'}; S_{k}, T_{k})$ ;
7:	Let $j \in T_{k}$ be the node with smallest degree: $j = a r g {m i n}_{j^{'} \in T_{k}} {d e g}_{Y} (j^{'}; S_{k}, T_{k});$
8:	if $\sqrt{h} {d e g}_{X} (i; S_{k}, T_{k}) \leq \frac{1}{\sqrt{h}} {d e g}_{Y} (j; S_{k}, T_{k})$ then
9:	$S_{k + 1} \leftarrow S_{k} / {i}$ and $T_{k + 1} \leftarrow T_{k}$ ;
10:	else
11:	$S_{k + 1} \leftarrow S_{k}$ and $T_{k + 1} \leftarrow T_{k} / {j}$ ;
12:	end if
13:	end for
14:	Output $G [S^{h}, T^{h}]$ with largest objective function in $G [S_{1}, T_{1}], \dots, G [S_{n + m - 1}, T_{n + m - 1}]$ ;
15:	end for
16:	Output $G [{\tilde{S}}_{λ}, {\tilde{T}}_{λ}]$ with largest objective function in $G [S^{h_{1}}, T^{h_{1}}], \dots, G [S^{h_{m a x}}, T^{h_{m a x}}]$ ;
17:	end procedure

Open in a new tab

The direct optimization of the objective function (5) is challenging because it is a nondeterministic polynomial (NP) problem.^35,36 We propose a computationally efficient greedy algorithm to approximately carry out the optimization of (5). We describe the greedy algorithm as Algorithm 1 in the following. In designing it, we extended the greedy algorithms for dense subgraph discovery³⁶ in an adjacency matrix to a large bipartite matrix to extract dense bi-cliques. Algorithm 1 removes nodes with the smallest degrees iteratively, which is a deterministic algorithm that does not depend on initial values. The computational complexity of Algorithm 1 is $O (C_{1} m n)$ , where $C_{1}$ is determined by the grid search of $h$ , that is, $h = | S | / | T |$ , representing the aspect ratio of a dense subgraph, in the following Algorithm 1.

Now we establish approximation accuracy results of Algorithm 1 and its estimation of IGDB. Let $S_{λ}^{*}$ and $T_{λ}^{*}$ be the true optimal solution to (5):

(S_{λ}^{*}, T_{λ}^{*}) = a r g {m a x}_{S \subset U, T \subset V} d_{λ} (S, T),

and $({\tilde{S}}_{λ}, {\tilde{T}}_{λ})$ is from Algorithm 1 with

({\tilde{S}}_{λ}, {\tilde{T}}_{λ}) = a r g {m a x}_{h} a r g {m a x}_{(S_{1}, T_{1}), \dots, (S_{m + n - 1}, T_{m + n - 1})} d_{λ} (S, T),

where $d_{λ} (S, T) : = l o g (| S | | T |) + λ l o g (\frac{‖ W ∣, T ‖_{1,1}}{| S | | T |})$ .

The greedy algorithm with average-degree based density (or equivalently $λ = 2$ ) is said to have a 2-approximation guarantee for the true optimal,³⁵ namely, $2 d_{2} ({\tilde{S}}_{2}, {\tilde{T}}_{2}) > d_{2} (S_{2}^{*}, T_{2}^{*})$ . In this article, we present the approximation bounds for the proposed objective function (5) in terms of a parameter $λ$ as the following Theorem 1.

Algorithm 2.

Determine tuning parameter $λ$ by likelihood function

1:	Input: $G = (U, V, W)$ , a grid of tuning parameters: $λ_{1}, λ_{2}, \dots, λ_{J}$ , a sequence of cutoffs $r_{1}, r_{2}, \dots, r_{R}$ and its mass $q (r_{1}), \dots, q (r_{R})$ ; Output: $G [{\tilde{S}}_{\hat{λ}}, {\tilde{T}}_{\hat{λ}}]$ and $\hat{λ}$
2:	procedure Algorithm
3:	while $λ \in \{λ_{1}, \dots, λ_{J}\}$ do
4:	Return the IGDB $({\tilde{S}}_{λ}, {\tilde{T}}_{λ})$ of $W$ from Algorithm 1
5:	for $r = r_{1}$ to $r_{R}$ do
6:	calculate the likelihood defined in 4.2: $ℒ_{λ} (\hat{π}; {\tilde{S}}_{λ}, {\tilde{T}}_{λ}, W (r))$ (We refer to Section 4.2 for detailed definition of the likelihood function.)
7:	end for
8:	integrate w.r.t. $r$ :
	$ℒ_{λ} (W) = \sum_{i = 1}^{R} ℒ_{λ} (\hat{π}; {\tilde{S}}_{λ}, {\tilde{T}}_{λ}, W (r_{i})) q (r_{i})$
9:	end while
10:	Output $\hat{λ}$ and $({\tilde{S}}_{\hat{λ}}, {\tilde{T}}_{\hat{λ}})$ with maximized $ℒ_{λ} (W)$
11:	end procedure

Open in a new tab

Theorem 1.

For a given bipartite graph $G = (U, V, E)$ , with $(S_{λ}^{*}, T_{λ}^{*})$ and $({\tilde{S}}_{λ}, {\tilde{T}}_{λ})$ defined in Section 3.1.1, the greedy algorithm 1 has a $ρ (λ, m, n)$ -approximation, that is, $d_{λ} (S_{λ}^{*}, T_{λ}^{*}) \leq ρ (λ, m, n) d_{λ} ({\tilde{S}}_{λ}, {\tilde{T}}_{λ})$ with

ρ (λ, m, n) = \{\begin{array}{l} 2 (m n)^{\frac{1}{λ} (1 - \frac{2}{λ})} & if λ \geq 2 \\ 2 (m n)^{(\frac{1}{λ} - \frac{1}{2})} & if \frac{4}{3} < λ < 2 \\ (m n)^{(1 - \frac{1}{λ})} & if 1 < λ \leq \frac{4}{3} . \end{array}

In Theorem 2, we state that the optimization of the proposed objective function (5) asymptotically leads to almost full recovery of the IGDB-based network structure.

Theorem 2.

We assume the graph $G = (U, V, E)$ with an $I G D B G [S_{0}, T_{0}] = (S_{0}, T_{0}, E [S_{0}, T_{0}])$ is generated from a mixture of Bernoulli distributions: $e_{u v} ~ δ_{u v} B e r n o u l l i (π_{1}) + (1 - δ_{u v}) B e r n o u l l i (π_{0}), δ_{u v} = I (e_{u v} \in G [S_{0}, T_{0}])$ and $π_{1} > π_{0}$ . For simplicity, we let $m = Θ (n)$ . Assume $|S_{0}| = O (| m |^{1 / 2 + ϵ})$ and $|T_{0}| = O (| n |^{1 / 2 + ϵ})$ as $n \to \infty$ for some $ϵ > 0$ . Denote

e_{S} = (1 - \frac{{\tilde{S}}_{λ} \cap S_{0}}{S_{0}}) + (1 - \frac{{\tilde{S}}_{λ}^{c} \cap S_{0}^{c}}{S_{0}^{c}})

and

e_{T} = (1 - \frac{{\tilde{T}}_{λ} \cap T_{0}}{T_{0}}) + (1 - \frac{{\tilde{T}}_{λ}^{c} \cap T_{0}^{c}}{T_{0}^{c}})

to be the error rates of node memberships based on $({\tilde{S}}_{λ}, {\tilde{T}}_{λ})$ from Algorithm 1. Then, there exists some $λ$ such that we will get almost full recovery in Algorithm 1, that is, for any fixed $a \in (0,1)$ , as $n \to \infty$ , we have

P (e_{S} + e_{T} \geq a) \to 1 .

In practice, we select the tuning parameter $λ$ by a grid search based on the likelihood criterion,³⁷ and describe the details in Algorithm 2. Based on each dense subgraph $G [S, T]$ , we further identify spatially-contiguous voxel clusters (ie, ${\tilde{𝒩}}_{k}^{T}$ , $k = 1, \dots, {\tilde{K}}_{2}$ ), and a corresponding set of SNPs (ie, ${\tilde{𝒩}}_{k}^{S}$ , $k = 1, \dots, {\tilde{K}}_{1}$ ) that are functionally associated with voxel clusters (see Supplement A). Last, multiple IGDBs can be extracted by performing algorithms repeatedly with the detected IGDBs masked.³⁸

4.2 |. Statistical inference of the IGDB

Recall that the purpose of this study is to perform statistical inference on the pleiotropic and polygenic association pattern or the IGDB. We investigate the significance of the presence of an IGDB against a random bipartite graph (Case 1 vs Case 0) as illustrated in Section 3.3.

Let $r$ be a sound cutoff that dichotomize the weighted graph $G$ into a binary graph $G^{r} = (U, V, A)$ using $a_{u v} = I (|w_{u v}| > r)$ . Then, under IGDB structure indexed by node sets $(S_{0}, T_{0})$ , the edges in $G^{r}$ follow a mixture of two Bernoulli distributions:

a_{u v} ∣ (S_{0}, T_{0}) ~ B e r n o u l l i (π_{u v}),

(6)

where $π_{u v} = δ_{u v} π_{1} + (1 - δ_{u v}) π_{0}$ , $π_{1} = μ_{1} \int_{r}^{\infty} f_{1} (w, θ_{1}) d w + (1 - μ_{1}) \int_{r}^{\infty} f_{0} (w, θ_{0}) d w$ , $π_{0} = μ_{0} \int_{r}^{\infty} f_{1} (w, θ_{1}) d w + (1 - μ_{0}) \int_{r}^{\infty} f_{0} (w, θ_{0}) d w$ , and $π_{1} > π_{0}$ .³⁹ Then, a hypothesis testing to distinguish Cases 0 and 1 can be proposed:

H_{0} : π_{1} = π_{0} = π vs H_{1} : π_{1} > π_{0},

based on our mixture distribution model (6).

We propose a likelihood-based statistic for the IGDB test. For a binarized graph $G^{r}$ , let

t_{G} = \log \frac{\underset{H_{0} \cup H_{1}}{s u p} ℒ (π; S, T, A)}{\underset{H_{0}}{s u p} ℒ (π; A)},

with likelihood given by Bernoulli distributions in (6). Specifically,

\begin{array}{l} ℒ (π; S, T, A) = \prod_{u \in S a n d v \in T} π_{1}^{a_{u v}} {(1 - π_{1})}^{1 - a_{u v}} \times \prod_{u \in U / S or v \in V / T} π_{0}^{a_{u v}} {(1 - π_{0})}^{1 - a_{u v}} \\ and ℒ (π; A) = \prod_{u \in U and v \in V} π^{a_{u v}} (1 - π)^{1 - a_{u v}} . \end{array}

Then, the asymptotic power is ensured using the likelihood-based statistic through the following Theorem 3.

Theorem 3 (Under IGDB alternative hypothesis $H_{1}$ ).

Assume $m = Θ (n)$ and the underlying IGDB $G [S_{0}, T_{0}]$ with generating probabilities $π_{1} > π_{0}$ satisfies $|S_{0}| = m_{0}, |T_{0}| = n_{0}$ and $m_{0}, n_{0} = Ω (n^{ϵ})$ for some $ϵ > 0$ . Then for any $η > 1$ , as $n \to \infty$ , we have

P r (t_{G} > η) \to 1 .

In determining the significance of IGDBs, the simultaneous testing needs to be accounted for all potential IGDBs. Besides, a rejection region ( $η$ ) should be determined based on the distribution of $t_{G}$ under null model. Hence, we employ the commonly used permutation test procedure in the field of neuroimaging^40,41 to empirically approximate the distribution of the likelihood-based statistic $t_{G}$ under the IGDB null and control the family-wise error rates (FWER).

Let $ϕ (\cdot)$ be the vectorization of a matrix, such that $ϕ (A)$ is an $m n$ vector of the adjacency matrix $A$ . Denote $τ$ as a permutation of $m n$ elements, and $P_{τ}$ is the corresponding permutation matrix. Let $G_{τ} = (U, V, E_{τ})$ an edge-permuted graph from $G$ . Then, under random bipartite graph (Case 0), the edge-permuted graph $G_{τ}$ would be a realization from the same null model. We let $τ (1), \dots, τ (M)$ be $M$ random permutations and the corresponding edge-permuted adjacency matrices are given by $A_{τ (1)}, \dots, A_{τ (M)}$ . The test statistics associated with edge-permuted adjacency matrices $A_{τ (1)}, \dots, A_{τ (M)}$ forms a random sample of $t_{G}$ under null hypothesis, which can be utilized to obtain the empirical distribution of $t_{G}$ under null hypothesis. We illustrate whole procedure of the permutation test in Algorithm 3, while the P-values of multiple IGDBs can be observed by considering each IGDB individually.

To dichotomize the weighted graph $G$ , rather than setting $r$ as a fixed value, which could lead to an arbitrary selection, we consider $r$ as a random variable with a distribution $q (r)$ . This allows us to integrate the likelihood function over $r$ , utilizing the prior distribution $q (r)$ , thereby making our optimization process robust to the specific choice of $r$ . We implement a discrete distribution for $q (r)$ , defined by a set of possible values $r_{1}, \dots, r_{R}$ and their corresponding probabilities $q (r_{1}), \dots, q (r_{R})$ . In practice, our algorithm demonstrates robustness to the choice of the prior distribution, given that a reasonable range for the support of $r$ is selected.

Algorithm 3.

Implementation of likelihood ratio statistic via permutation tests

1:	Input: $G = (U, V, A), \hat{S}, \hat{T}$ ; Output: p-value
2:	procedure Algorithm
3:	calculate the test statistic on $G$ with subgraph $G [\hat{S}, \hat{T}]$ and denote as: $t_{0}$
4:	for $b = 1$ to $M$ do
5:	generate permutation matrix $P_{b}$ on $m n$ elements
6:	observe adjacency matrix of edge-permuted graph $G_{b} : A_{b} = ϕ^{- 1} (P_{b} ϕ (A))$
7:	calculate the test statistic on $G_{b}$ as: $t_{b}$
8:	end for
9:	end procedure

Open in a new tab

5 |. RESULTS

We applied the IGDB approach to the motivating data set. The FA measures of DTI at 117 139 voxels were used in this study to characterize the white matter integrity.^30,42 The image acquisition parameters are described in the Supplement A. Regarding genetic variants, 10 595 779 SNPs passed the quality control filters in HCP data set (MAF < 0.01; HQE < 1e–6; r-squared > 0.03; call rate > 0.95) after imputation on the Michigan Imputation Server Minimac3 (https://imputationserver.sph.umich.edu) using the 1000 Genomes Project (phase 1 v3) reference set.⁴³

We preprocessed the diffusion weighted images following the ENIGMA-DTI workflow (http://enigma.ini.usc.edu/protocols/dti-protocols/). We further applied the Sequential Oligogenic Linkage Analysis Routines (SOLAR)-Eclipse software (https://www.nitrc.org/projects/se_linux) for the heritability analysis, of which imaging voxels were kept with significant heritability, based on the Fast and Powerful Heritability Inference (FPHI) function of SOLAR-Eclipse ( $P < 0.05$ ) in both the HCP and Amish Connectome Project (ACP). For these voxels, we performed vGWAS using PLINK while adjusting covariates including sex, age, BWI, and population characteristics using the first 10 principal components in our application.³¹ We then performed sure independence screening on SNPs with multiple imaging responses through a direct extension of univariate screening procedure.⁴⁴ 13 498 SNPs across 22 chromosomes survive into further analysis. The details are described in the Supplement A.

We tested the imaging-genetic associations between SNPs across 22 chromosomes and voxel-level imaging traits using our proposed method. Based on the procedures described in Sections 4.1 and 4.2, we extracted IGDBs and performed permutation tests to determine its statistical significance while controlling family-wise error rate ( $q < 0.05$ ). We observe different brain areas being influenced by distinct genetic loci. A Manhattan plot for all SNPs across 22 chromosomes with selected imaging-genetic associations highlighted and tables for SNP and voxels across all 22 chromosomes are included in the Supplement A.

In this section, we focus on SNPs on chromosome 1 to demonstrate their systematic association patterns with voxel-traits, and then annotate the genes in the detected IGDB. Based on the matrix of association strength $W_{1178 \times 29627}$ (ie, Figure 3A), we detected an IGDB with 384 SNPs and 3803 voxels as Figure 3B by maximizing the objective function (5), which is achieved by implementing Algorithm 2 utilizing a grid search for $h$ across the range ${1 / 20,1 / 19, \dots, 1,2, \dots, 19,20}$ , and for $λ$ within the interval 0.5 to 1.2, with an incremental step of 0.02. We further calculated the p value for the IGDB statistical inference via the permutation test, which results in a significant existence of an IGDB with P value < 0.001. Although the IGDB is an irreducible subgraph, it can be further refined based on data-driven algorithms and spatial information of imaging data. We applied the existing community detection algorithms⁴⁵ on similarity matrices observed from the detected IGDB. The refined pattern in Figure 3C displays 6 distinct SNP-voxel association clusters. Note that the refined structure cannot be identified without revealing the IGDB by the proposed algorithm.

IGDB procedure on chromosome 1: (A) is the input matrix $W$ , derived from vGWAS using *PLINK* while adjusting covariates including sex, age, BWI, and population characteristics using the first 10 principal components in our application.³¹ Each entry in the matrix is $- l o g (p_{i j})$ of the association between an SNP and imaging voxel pair (ie, a hotter entry indicates a higher level of SNP-voxel association). Although $W$ is obtained after screening (eg, by voxel-level heritability analysis), it remains challenging to directly recognize the patterns of imaging-genetics associations; (B) demonstrates the detected IGDB which reveals *dense* blocks of imaging-genetics associations; (C) displays the refined pattern of the IGDB. In panels (B) and (C), we have reordered the SNPs and voxels to better illustrate their patterns of association.

As a greedy algorithm, the computational complexity of Algorithm 1 is linear in the size of the original graph. By determining the tuning parameters through the likelihood function, as outlined in Algorithm 2, the computation remains efficient, which took 20 minutes on a PC with an i7 CPU 3.60 GHz and 64 GB memory to detect the IGDB of the SNP-voxel association graph in chromosome 1. The computation of the p-value is dependent on the number of permutations, which can be easily parallelized for efficient computation.

We illustrate the voxel clusters and corresponding SNP sets in Figure 4. For example, the voxel cluster 2 (colored cyan) includes voxels mainly from the splenium of corpus callosum (SCC), part of one of the largest white matter tracts that connects many parts of the brain, and which lesions to often result in many varied neurological issues.⁴⁶ To annotate the SNPs in the identified clusters, we queried the SNPs in the QTLbase (http://mulinlab.org/qtlbase/index.html,⁴⁷) for potential expression quantitative trait locus (eQTL) and examined the genes being regulated by these variants in a tissue-specific pattern. The summary of associated genes related with brain tissues is displayed in the Supplement A as supporting information. In cluster 1, multiple SNPs are linked with the LEPR gene, a protein coding gene for leptin receptor generation that has been shown to be associated with obesity. It has been known the white matter integrity is highly associated with obese disorder and body mass index.⁴⁸ Therefore, this cluster reveals the marginal association of (obesity-related) LEPR gene and white matter integrity. In clusters 2 to 5, the associated genes, for example, S100A1, TAF1A, CFH, CFHR3, and DPH5 are associated with immune system functions (http://immunet.princeton.edu/, https://www.innatedb.com/moleculeSearch.do). White matter integrity can be influenced by the immune system functions and systematic inflammation. In cluster 6, the NOS1AP gene has been found to be associated with white matter microstructure in previous studies.⁸ In addition, the NOS1AP gene is identified to be a risk factor for schizophrenia,⁴⁹ while the alterations of white matter integrity for patients with schizophrenia were studied in Kubicki et al.⁵⁰ In summary, our findings provided insights into the complex neurogenetic mechanisms of how genetic variants influence imaging traits in a systematic fashion potentially via regulating gene expression and generated hypotheses to be further confirmed in future multi-omics studies.

An illustration of the association patterns between SNP and voxel clusters on Chromosome 1. We demonstrate the systematic imaging-genetics associations in an integrated Manhattan plot based on the results of our analysis by IGDB. The highlighted subsets of SNPs are systematically associated with corresponding areas of the white matter tracks. The dual localized association patterns provide a straightforward interpretation of the genetic effects on location-specific brain areas.

6 |. SIMULATION

6.1 |. Synthetic data

We evaluate the finite-sample performance of our proposed method based on simulation studies. We generate the input matrix $W_{m \times n}$ based on the two sets of multivariate variables representing genetic variants $X_{m \times L}$ and imaging voxels $Y_{n \times L}$ . We let the pattern of $W_{m \times n}$ be determined by a graph $G = (U, V, E)$ . Specifically, we assume there exists an IGDB $G [S_{0}, T_{0}] = (S_{0}, T_{0}, E [S_{0}, T_{0}])$ with higher proportion of edges as significant imaging-genetics associations (ie, $μ_{1}$ ) than the rest of graph (ie, $μ_{0}$ ). Then, we let the entries of $W_{m \times n}$ follow mixture distributions according to $G$ as $w_{u v} ∣ δ_{u v} = 1 ~ μ_{1} t_{d f} (v) + (1 - μ_{1}) t_{d f} (0)$ , $w_{u v} ∣ δ_{u v} = 0 ~ μ_{0} t_{d f} (v) + (1 - μ_{0}) t_{d f} (0)$ , where $δ_{u v}$ is an indicator variable with $δ_{u v} = 1$ for edges in the IGDB and 0 otherwise. $t_{d f} (ν)$ and $t_{d f} (0)$ are the non-null and null distributions of imaging-genetics associations respectively. $t_{d f} (ν)$ is a $t$ distribution with the degree of freedom $L - p (p$ covariates) and non-central parameter $v = \frac{θ}{\sqrt{4 / L}}$ , where $θ$ is standardized effect size (eg, Cohen’s d). $μ_{1}$ and $μ_{0}$ are the proportions of the non-null distribution within the IGDB and otherwise. We use $m = 200$ , $n = 100$ , and $L = 60$ . We simulate data sets with multiple settings by varying the size of IGDB (ie, $(|S_{0}|, |T_{0}|) = (50, 40)$ and (30, 20)), standard effect size (ie, $θ = 0.8, 1$ , and 1.2), and proportions of noisy edges (ie, $(μ_{1}, μ_{0}) = (0.8,0.2)$ and (0.9, 0.1)). Additional simulation settings with larger graph and sample sizes are included in the Appendix.

6.2 |. Performance metrics and results

We evaluate the performance of proposed method at several levels. At the subgraph-level, we assess the accuracy of IGDB inference by examining if we can reject the null (ie, no systematic imaging-genetics association). At the edge-level, we evaluate the accuracy of detected IGDB by comparing it with ground truth in terms of edge differences. We also evaluated the node-assignment accuracy of the proposed method using synthetic data (see Section 1.5 of Supplement A in the Supporting information for details). The performance was only compared to Charikar’s algorithm³⁵ for dense component extraction instead of bi-clustering algorithms. As bi-clustering algorithms tend to assign all SNPs and voxels into clusters, they are not well suited to the IGDB structure extraction (see demonstration in Appendix).

For IGDB inference, we consider a detected IGDB $G [\hat{S}, \hat{T}]$ is a recovery of the underlying IGDB $G [S_{0}, T_{0}]$ if it is rejected in the proposed likelihood-ratio test and has high similarity with $G [S_{0}, T_{0}]$ . Specifically, we consider $G [\hat{S}, \hat{T}]$ . is a true positive detection of $G [S_{0}, T_{0}]$ if $J_{X} \land J_{Y}$ is no less than the cutoff with

J_{X} = \frac{S_{0} \cap \hat{S}}{S_{0} \cup \hat{S}} and J_{Y} = \frac{T_{0} \cap \hat{T}}{T_{0} \cup \hat{T}},

and we succeed to reject the IGDB null hypothesis in the permutation test. We display the results with cutoff of 0.8 and 0.9 on the $J_{X} \land J_{Y}$ . Therefore, the detected IGDB leads to a false negative finding if the P-value in the permutation test is not lower than the a significant level (ie, 0.05). Besides, we observe a false positive error if $G [\hat{S}, \hat{T}]$ has low similarity to $G [S_{0}, T_{0}]$ even we rejected the IGDB null hypothesis. We report the accuracy of inference by false positive rate (FPR) and false negative rate (FNR) among replications.

Furthermore, we compare IGDB to commonly-used multivariate testing methods at the edge-level: positive false discovery rate (pFDR) by Storey⁵¹ and Bonferroni correction. These correction methods are commonly used in GWAS and vGWAS analysis in practice. We evaluate the true $Δ = {\{δ_{u v}\}}_{u \in U, v \in V}$ with estimated $\hat{Δ} = {{\hat{δ}}_{u v}}_{u \in U, v \in V}$ from varied methods. For the proposed method, we obtain the $\hat{Δ}$ based on the extracted IGDB $G [\hat{S}, \hat{T}]$ and the hypothesis testing. Particularly, if we reject the IGDB null hypothesis with a detected IGDB $G [\hat{S}, \hat{T}]$ , we let $\hat{Δ} = {{\hat{δ}}_{u v}} = {I (e_{u v} \in G [\hat{S}, \hat{T}])}$ . In the case that we fails to reject, we consider $\hat{S}$ , $\hat{T}$ as empty sets such that $\hat{Δ} = 0_{m \times n}$ . The FDR threshold of 0.2 and corrected $α$ level of 0.05 are used in the pFDR and Bonferroni correction respectively.

Subsequently, based on the ${\hat{δ}}_{u v}$ observed from different methods, and true parameters $δ_{u v}$ , we calculate true positive rate (TPR) and true negative rate (TNR) as:

T P R = \frac{\sum_{u, v} I (δ_{u v} = {\hat{δ}}_{u v} = 1)}{\sum_{u, v} I (δ_{u v} = 1)}, T N R = \frac{\sum_{u, v} I (δ_{u v} = {\hat{δ}}_{u v} = 0)}{\sum_{u, v} I (δ_{u v} = 0)} .

The associated means and standard deviations are reported based on 100 replications for each simulation scenario.

The results from the IGDB inference are summarized in Table 1. The power of the IGDB inference relies on the size and SNR (by different standard effect sizes) of the underlying IGDB $G [S_{0}, T_{0}]$ , which concurs with our theoretical results. We fails to reject the IGDB null hypothesis for one simulated data set with a smaller size (30, 20) and effect size 0.8, and higher noise (0.8, 0.2).

TABLE 1.

IGDB inference results under varied SNRs and noises.

$(\|S_{0}\|, \|T_{0}\|)$	$(q_{1}, q_{2})$	Metrics	0.8	1.0
(50,40)	(0.9, 0.1)	FPR (0.8)	0 (0)	0 (0)
		FPR (0.9)	0 (0)	0 (0)
		FNR	0 (0)	0 (0)
	(0.8, 0.2)	FPR (0.8)	0 (0)	0 (0)
		FPR (0.9)	0 (0)	0 (0)
		FNR	0 (0)	0 (0)
(30,20)	(0.9, 0.1)	FPR (0.8)	0 (0)	0 (0)
		FPR (0.9)	0 (0)	0 (0)
		FNR	0 (0)	0 (0)
	(0.8, 0.2)	FPR (0.8)	0 (0)	0 (0)
		FPR (0.9)	0.2100 (0.4073)	0.0400 (0.1960)
		FNR	0.0600 (0.2375)	0 (0)

Open in a new tab

Note: We summarize the FPR (with cutoff of 0.8 and 0.9 on the $J_{X} \land J_{Y}$ ) and FNR to evaluate the estimated IGDB. The results suggest robust and accurate performance of our method at a bi-clique level (ie, revealing patterns).

The comparative edge-level results from the proposed method and competing methods are displayed in Table 2 for different sizes of IGDB. All three methods have improved performance with higher SNRs and lower noise levels. The proposed method outperforms pFDR and Bonferroni correction methods for both TPR and TNR under different scenarios. Both pFDR and Bonferroni methods have high TNR but low TPR indicating a stringent cutoff, while the proposed method achieves a higher TPR maintaining a similar or even higher TNR than the others. The Bonferroni method is even more stringent where the TPR is even smaller than 10% when we have low SNRs (eg, 0.8) for all cases.

TABLE 2.

Edge-wise accuracy under varied IGDB sizes, SNRs and noises.

$(\|S_{0}\|, \|T_{0}\|)$	$(q_{1}, q_{2})$	Metrics		0.8	1.0	1.2
(50,40)	(0.9, 0.1)	IGDB	TPR	0.9879 (0.0184)	0.9942 (0.0124)	0.9968 (0.0097)
		IGDB	TNR	1 (0)	1 (0)	1 (0)
		pFDR	TPR	0.7453 (0.0090)	0.8686 (0.0045)	0.8995 (0.0023)
		pFDR	TNR	0.8858 (0.0020)	0.8667 (0.0018)	0.8619 (0.0018)
		Bonferroni	TPR	0.0520 (0.0048)	0.1739 (0.0092)	0.3941 (0.0096)
		Bonferroni	TNR	0.9942 (0.0005)	0.9806 (0.0008)	0.9562 (0.0012)
	(0.8, 0.2)	IGDB	TPR	0.9938 (0.0126)	0.9982 (0.0064)	0.9984 (0.0061)
		IGDB	TNR	0.9998 (0.0006)	1.0000 (0.0003)	1.0000 (0.0004)
		pFDR	TPR	0.7032 (0.0067)	0.7903 (0.0039)	0.8095 (0.0027)
		pFDR	TNR	0.7842 (0.0021)	0.7577 (0.0019)	0.7517 (0.0018)
		Bonferroni	TPR	0.0458 (0.0043)	0.1557 (0.0084)	0.3506 (0.0097)
		Bonferroni	TNR	0.9884 (0.0007)	0.9612 (0.0014)	0.9125 (0.0020)
(30,20)	(0.9, 0.1)	IGDB	TPR	0.9987 (0.0081)	0.9992 (0.0060)	1 (0)
		IGDB	TNR	1.0000 (0.0001)	1 (0)	1 (0)
		pFDR	TPR	0.7043 (0.0176)	0.8537 (0.0085)	0.8954 (0.0042)
		pFDR	TNR	0.9017 (0.0019)	0.8799 (0.0015)	0.8741 (0.0014)
		Bonferroni	TPR	0.0517 (0.0082)	0.1741 (0.0163)	0.3946 (0.0175)
		Bonferroni	TNR	0.9942 (0.0005)	0.9807 (0.0009)	0.9561 (0.0012)
	(0.8, 0.2)	IGDB	TPR	0.8527 (0.2248)	0.9645 (0.0398)	0.9778 (0.0287)
		IGDB	TNR	0.9996 (0.0009)	0.9995 (0.0009)	0.9997 (0.0005)
		pFDR	TPR	0.6891 (0.0114)	0.7857 (0.0075)	0.8069 (0.0045)
		pFDR	TNR	0.7952 (0.0022)	0.7661 (0.0017)	0.7596 (0.0019)
		Bonferroni	TPR	0.0473 (0.0095)	0.1563 (0.0144)	0.3525 (0.0173)
		Bonferroni	TNR	0.9884 (0.0008)	0.9610 (0.0013)	0.9123 (0.0017)

Open in a new tab

Note: We compare the performance of IGDB with multiple testing correction methods in terms of the accuracy of individual SNP-voxel pairs. The extracted IGDB patterns dramatically improve the SNP-voxel pair level inference accuracy by allowing pairs to borrow strengths from each other.

7 |. DISCUSSION

Imaging-genetics studies aim to model the predictive mechanism of genetic variants on quantitative imaging measures. However, high dimensionality and complex association patterns between genetic variants and imaging traits raise a considerable challenge for statistical estimation and inference. For example, purely region-level inference erases local voxel heterogeneity, thus may be ineffective in learning spatial specificity of imaging voxels. In this article, we have developed an IGDB multivariate to multivariate analysis tool to identify systematic associations between multivariate voxel-level imaging features and multivariate genetic variants. Our method focuses on the systematic polygenic and pleiotropic patterns rather than individual pairwise associations, and thus mitigates the challenges of ultra-high dimensionality due to multivariate to multivariate association analysis. Besides, our high-resolution voxel-level genome wide association analysis is not constrained by pre-specified regions of interest, hence fully accounts for the variability between voxels, and yields data-driven brain regions associated with functionally related genetic loci. Therefore, our findings are more biologically interpretable and meaningful.

We develop a new optimization solution to extract IGDB by leveraging its graph properties that we discovered in theoretical study. Our IGDB extraction algorithm is computationally efficient and scalable. The input data for our method could be either individual-level or GWAS summary statistics. The IGDB inference method controls the family-wise error rate for IGDB-level findings. We provide theoretical results to guarantee the numerical performance of IGDB extraction and accuracy of the inference model. Although initially proposed in analyzing systematic association patterns between SNPs and voxels, this approach is also well-suited for analyzing region-level imaging data, where spatial constraints are not necessary.

In real data applications, we applied our method to the HCP data set to study the genetic effects on white matter microstructure integrity. The results revealed a variety of functionally related genetic loci that are associated with sub-regions of white matter area tracts on posterior corpus callosum. These novel findings are consistent with previous findings.³⁰ Our annotation analysis further provide evidence that selected SNPs are associated with white matter microstructures through gene expression. The overall computational load for imaging-genetics analysis remains heavy regardless improved algorithms and computational facilities. Since our initial vGWAS is performed using GWAS analysis tools (eg, plink), the analysis is limited on individual SNPs. Regardless, the input of our method is vGWAS analysis results and thus suits for any vGWAS analysis methods. Our IGDB algorithm can also be extended to further constrain the IGDB structure by leveraging the functional annotation of genetic variants.⁵²

In summary, we have developed a new neuroimaging-GWAS tool to identify systematic associations between multivariate imaging features and multivariate genetic variants. Our IGDB method is computationally efficient and improves the accuracy and power through revealing systematic polygenic and pleiotropic patterns.

Supplementary Material

Appendix S2 Positions of SNPs from imaging-genetic association clusters detected in all chromosomes.

NIHMS2067460-supplement-Appendix_S2_Positions_of_SNPs_from_imaging-genetic_association_clusters_detected_in_all_chromosomes_.xlsx^{(55.1KB, xlsx)}

Appendix S3 Coordinates of voxels from imaging-genetic association clusters detected in all chromosomes.

NIHMS2067460-supplement-Appendix_S3_Coordinates_of_voxels_from_imaging-genetic_association_clusters_detected_in_all_chromosomes_.xlsx^{(1MB, xlsx)}

Appendix S4 Summary of genes related to brain tissues from annotation analysis.

NIHMS2067460-supplement-Appendix_S4_Summary_of_genes_related_to_brain_tissues_from_annotation_analysis_.xlsx^{(15.9KB, xlsx)}

Appendix S5 Supplemental Material.

NIHMS2067460-supplement-Appendix_S5_Supplemental_Material_.zip^{(2.8MB, zip)}

Appendix S1 Supplemental Material.

NIHMS2067460-supplement-Appendix_S1_Supplemental_Material_.pdf^{(923.4KB, pdf)}

ACKNOWLEDGEMENTS

This work was partially supported by the National Institute on Drug Abuse of the National Institutes of Health under Award Number 1DP1DA048968-01, R01EB015611, R01MH094520. The second author Dr. Yuan Zhang was supported by NSF Grant DMS-2311109.

APPENDIX A. ADDITIONAL NUMERICAL RESULTS

Comparisons with bi-clustering algorithms.

In our simulation analysis, we only compared our method to Charikar’s algorithm instead of bi-clustering algorithms because these methods are not well suited to dense bi-clique extraction. To demonstrate this, we applied the classic spectral bi-clustering algorithm⁵³ to a simulated data set. Specifically, we generated a bipartite graph with $m = 200$ , $n = 100$ , $L = 60$ , the IGDB size $(|S_{0}|, |T_{0}|) = (50, 40)$ , and standard effect size $θ = 0.8$ , and proportions of noisy edges $(μ_{1}, μ_{0}) = (0.8, 0.2)$ . The true structure of simulated bipartite graph, detected subnetworks from competing methods are displayed in Figure A1. The convectional bi-clustering algorithms can miss the dense bi-cliques.

FIGURE A1 — Comparison with other biclustering algorithm in a simulated data set. True and detected subnetworks are highlighted in red. (A) displays the true bipartite graph with an IGDB; (B) shows the IGDB structure extracted by Algorithms 1 and 2; (C) shows subnetworks detected by spectral co-clustering algorithm with $K = 2$ ; (D) highlights several subnetworks detected spectral co-clustering algorithm with $K = 10$ .

Simulation results from large graphs.

We extended our simulation studies by considering larger graphs by setting $m = 800$ , $n = 500$ , and $L = 200$ . The synthetic data was generated with an IGDB (ie, $(|S_{0}|, |T_{0}|) = (100, 80)$ ). The results are displayed in Table A1 with the same setting of standard effect size (ie, $θ = 0.8, 1$ , and 1.2), and proportions of noisy edges (ie, $(μ_{1}, μ_{0}) = (0.8, 0.2)$ and (0.9, 0.1)) as in the main analysis.

TABLE A1.

Edge-wise accuracy under varied SNRs and noises with $(|S_{0}|, |T_{0}|) = (100, 80)$ .

$(q_{1}, q_{2})$	Methods	Metrics	0.8	1.0	1.2
(0.9, 0.1)	IGDB	TPR	0.9600 (0.0000)	0.9600 (0.0000)	0.9600 (0.0000)
	IGDB	TNR	0.9998 (0.0000)	0.9998 (0.0000)	0.9998 (0.0000)
	pFDR	TPR	0.9025 (0.0005)	0.9029 (0.0006)	0.9029 (0.0006)
	pFDR	TNR	0.8747 (0.0003)	0.8746 (0.0003)	0.8747 (0.0003)
	Bonferroni	TPR	0.6060 (0.0048)	0.8692 (0.0021)	0.8994 (0.0003)
	Bonferroni	TNR	0.9326 (0.0002)	0.9035 (0.0001)	0.9001 (0.0000)
(0.8, 0.2)	IGDB	TPR	0.9545 (0.0101)	0.9598 (0.0024)	0.9598 (0.0024)
	IGDB	TNR	0.9998 (0.0001)	0.9998 (0.0000)	0.9998 (0.0000)
	pFDR	TPR	0.8100 (0.0011)	0.8100 (0.0011)	0.8101 (0.0011)
	pFDR	TNR	0.7598 (0.0004)	0.7597 (0.0004)	0.7597 (0.0004)
	Bonferroni	TPR	0.5385 (0.0044)	0.7724 (0.0018)	0.7994 (0.0001)
	Bonferroni	TNR	0.8652 (0.0003)	0.8069 (0.0001)	0.8001 (0.0001)

Open in a new tab

Footnotes

CONFLICT OF INTEREST STATEMENT

The authors declare no potential conflict of interest.

SUPPORTING INFORMATION

Additional supporting information can be found online in the Supporting Information section at the end of this article.

DATA AVAILABILITY STATEMENT

The data that support the findings of this study are available on request from the corresponding author. The data are not publicly available due to privacy or ethical restrictions. Software in the form of Matlab code, together with a sample input data set and complete documentation is available on github through link https://github.com/qwu1221/multi2multi.

REFERENCES

1.Meisenzahl E, Koutsouleris N, Bottlender R, et al. Structural brain alterations at different stages of schizophrenia: a voxel-based morphometric study. Schizophr Res. 2008;104(1–3):44–60. [DOI] [PubMed] [Google Scholar]
2.Lee S, Viqar F, Zimmerman ME, et al. White matter hyperintensities are a core feature of Alzheimer’s disease: evidence from the dominantly inherited Alzheimer network. Ann Neurol. 2016;79(6):929–939. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Savitz JB, Drevets WC. Imaging phenotypes of major depressive disorder: genetic correlates. Neuroscience. 2009;164(1):300–330. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Ge T, Schumann G, Feng J. Imaging genetics-towards discovery neuroscience. Quant Biol. 2013;1(4):227–245. [Google Scholar]
5.Liu J, Calhoun VD. A review of multivariate analyses in imaging genetics. Front Neuroinform. 2014;8:29. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Nathoo FS, Kong L, Zhu H, Initiative ADN. A review of statistical methods in imaging genetics. Can J Stat. 2019;47(1):108–131. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Smith SM, Douaud G, Chen W, et al. An expanded set of genome-wide association studies of brain imaging phenotypes in UK biobank. Nat Neurosci. 2021;24(5):737–745. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Zhao B, Zhang J, Ibrahim JG, et al. Large-scale GWAS reveals genetic architecture of brain white matter microstructure and genetic overlap with cognitive and mental health traits (n= 17,706). Mol Psychiatry. 2019;26:3943–3955. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Zhao B, Li T, Yang Y, et al. Common genetic variation influencing human white matter microstructure. Science. 2021;372(6548). [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Zhu H, Khondker Z, Lu Z, Ibrahim JG. Bayesian generalized low rank regression models for neuroimaging phenotypes and genetic markers. J Am Stat Assoc. 2014;109(507):977–990. [PMC free article] [PubMed] [Google Scholar]
11.Huang M, Nichols T, Huang C, et al. FVGWAS: fast voxelwise genome wide association analysis of large-scale imaging genetic data. Neuroimage. 2015;118:613–627. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Huang C, Thompson P, Wang Y, et al. FGWAS: functional genome wide association analysis. Neuroimage. 2017;159:107–121. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Ge T, Feng J, Hibar DP, Thompson PM, Nichols TE. Increasing power for voxel-wise genome-wide association studies: the random field theory, least square kernel machines and fast permutation procedures. Neuroimage. 2012;63(2):858–873. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Ge T, Nichols TE, Ghosh D, et al. A kernel machine method for detecting effects of interaction between multidimensional variable sets: an imaging genetics application. Neuroimage. 2015;109:505–514. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Hibar DP, Stein JL, Kohannim O, et al. Voxelwise gene-wide association study (vGeneWAS): multivariate gene-based association testing in 731 elderly subjects. Neuroimage. 2011;56(4):1875–1891. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Stein JL, Hua X, Lee S, et al. Voxelwise genome-wide association study (vGWAS). Neuroimage. 2010;53(3):1160–1174. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Chi EC, Allen GI, Zhou H, Kohannim O, Lange K, Thompson PM. Imaging genetics via sparse canonical correlation analysis. 2013 IEEE 10th International Symposium on Biomedical Imaging. New York: IEEE; 2013:740–743. [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Greenlaw K, Szefer E, Graham J, Lesperance M, Nathoo FS, Initiative ADN. A Bayesian group sparse multi-task regression model for imaging genetics. Bioinformatics. 2017;33(16):2513–2522. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Hardoon DR, Ettinger U, Mourão-Miranda J, et al. Correlation-based multivariate analysis of genetic influence on brain volume. Neurosci Lett. 2009;450(3):281–286. [DOI] [PubMed] [Google Scholar]
20.Kong D, An B, Zhang J, Zhu H. L2RM: low-rank linear regression models for high-dimensional matrix responses. J Am Stat Assoc. 2020;115(529):403–424. [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Le Floch É, Guillemot V, Frouin V, et al. Significant correlation between a set of genetic polymorphisms and a functional brain network revealed by feature selection and sparse partial least squares. Neuroimage. 2012;63(1):11–24. [DOI] [PubMed] [Google Scholar]
22.Liu J, Pearlson G, Windemuth A, Ruano G, Perrone-Bizzozero NI, Calhoun V. Combining fMRI and SNP data to investigate connections between brain function and genetics using parallel ICA. Hum Brain Mapp. 2009;30(1):241–255. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Wang H, Nie F, Huang H, et al. Identifying quantitative trait loci via group-sparse multitask regression and feature selection: an imaging genetics study of the ADNI cohort. Bioinformatics. 2012;28(2):229–237. [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Vounou M, Nichols TE, Montana G. Discovering genetic associations with high-dimensional neuroimaging phenotypes: a sparse reduced-rank regression approach. Neuroimage. 2010;53(3):1147–1159. [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Vounou M, Janousova E, Wolz R, et al. Sparse reduced-rank regression detects genetic associations with voxel-wise longitudinal phenotypes in Alzheimer’s disease. Neuroimage. 2012;60(1):700–716. [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Marcus DS, Harms MP, Snyder AZ, et al. Human connectome project informatics: quality control, database services, and data visualization. Neuroimage. 2013;80:202–219. [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Van Essen DC, Smith SM, Barch DM, et al. The WU-Minn human connectome project: an overview. Neuroimage. 2013;80:62–79. [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Jahanshad N, Kochunov PV, Sprooten E, et al. Multi-site genetic analysis of diffusion images and voxelwise heritability analysis: a pilot project of the ENIGMA–DTI working group. Neuroimage. 2013;81:455–469. [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Kochunov P, Jahanshad N, Sprooten E, et al. Multi-site study of additive genetic effects on fractional anisotropy of cerebral white matter: comparing meta and megaanalytical approaches for data pooling. Neuroimage. 2014;95:136–150. [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Kochunov P, Kochunov P. Heritability of fractional anisotropy in human white matter: a comparison of human connectome project and ENIGMA-DTI data. Neuroimage. 2015;111:300–311. [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Chang CC, Chow CC, Tellier LC, Vattikuti S, Purcell SM, Lee JJ. Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience. 2015;4(1):7. [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Benjamini Y, Hochberg Y. On the adaptive control of the false discovery rate in multiple testing with independent statistics. J Educ Behav Stat. 2000;25(1):60–83. [Google Scholar]
33.Efron B Large-Scale Inference: Empirical Bayes Methods for Estimation, Testing, and Prediction. Vol 1. Cambridge, UK: Cambridge University Press; 2012. [Google Scholar]
34.Woo CW, Krishnan A, Wager TD. Cluster-extent based thresholding in fMRI analyses: pitfalls and recommendations. Neuroimage. 2014;91:412–419. [DOI] [PMC free article] [PubMed] [Google Scholar]
35.Charikar M Greedy approximation algorithms for finding dense components in a graph. International Workshop on Approximation Algorithms for Combinatorial Optimization. Berlin: Springer; 2000:84–95. [Google Scholar]
36.Khuller S, Saha B. On finding dense subgraphs. International Colloquium on Automata, Languages, and Programming. Berlin: Springer; 2009:597–608. [Google Scholar]
37.Amini AA, Chen A, Bickel PJ, Levina E. Pseudo-likelihood methods for community detection in large sparse networks. Ann Stat. 2013;41(4):2097–2122. [Google Scholar]
38.Cheng Y, Church GM. Biclustering of expression data. Proc Int Conf Intell Syst Mol Biol. 2000;8:93–103. [PubMed] [Google Scholar]
39.Xu M, Jog V, Loh PL. Optimal rates for community estimation in the weighted stochastic block model. Ann Stat. 2020;48(1):183–204. [Google Scholar]
40.Zalesky A, Fornito A, Bullmore ET. Network-based statistic: identifying differences in brain networks. Neuroimage. 2010;53(4):1197–1207. [DOI] [PubMed] [Google Scholar]
41.Nichols TE. Multiple testing corrections, nonparametric methods, and random field theory. Neuroimage. 2012;62(2):811–815. [DOI] [PubMed] [Google Scholar]
42.Kochunov P, Rowland LM, Fieremans E, et al. Diffusion-weighted imaging uncovers likely sources of processing-speed deficits in schizophrenia. Proc Natl Acad Sci. 2016;113(47):13504–13509. [DOI] [PMC free article] [PubMed] [Google Scholar]
43.Das S, Forer L, Schönherr S, et al. Next-generation genotype imputation service and methods. Nat Genet. 2016;48(10):1284–1287. [DOI] [PMC free article] [PubMed] [Google Scholar]
44.Zou H, He D, Zhou Y. On sure screening with multiple responses. Stat Sin. 2021;31:1749–1777. doi: 10.5705/ss.202018.0462 [DOI] [Google Scholar]
45.Chen S, Kang J, Xing Y, Zhao Y, Milton DK. Estimating large covariance matrix with network topology for high-dimensional biomedical data. Comput Stat Data Anal. 2018;127:82–95. [Google Scholar]
46.Park MK, Hwang SH, Jung S, Hong SS, Kwon SB. Lesions in the splenium of the corpus callosum: clinical and radiological implications. Neurol Asia. 2014;19(1):79–88. [Google Scholar]
47.Zheng Z, Huang D, Wang J, et al. QTLbase: an integrative resource for quantitative trait loci across multiple human molecular phenotypes. Nucleic Acids Res. 2020;48(D1):D983–D991. [DOI] [PMC free article] [PubMed] [Google Scholar]
48.Verstynen TD, Weinstein AM, Schneider WW, Jakicic JM, Rofey DL, Erickson KI. Increased body mass index is associated with a global and distributed decrease in white matter microstructural integrity. Psychosom Med. 2012;74(7):682. [DOI] [PMC free article] [PubMed] [Google Scholar]
49.Brzustowicz LM, Simone J, Mohseni P, et al. Linkage disequilibrium mapping of schizophrenia susceptibility to the CAPON region of chromosome 1q22. Am J Hum Genet. 2004;74(5):1057–1063. [DOI] [PMC free article] [PubMed] [Google Scholar]
50.Kubicki M, Park H, Westin CF, et al. DTI and MTR abnormalities in schizophrenia: analysis of white matter integrity. Neuroimage. 2005;26(4):1109–1118. [DOI] [PMC free article] [PubMed] [Google Scholar]
51.Storey JD. A direct approach to false discovery rates. J R Stat Soc Series B Stat Methodology. 2002;64(3):479–498. [Google Scholar]
52.Li X, Li Z, Zhou H, et al. Dynamic incorporation of multiple in silico functional annotations empowers rare variant association analysis of large whole-genome sequencing studies at scale. Nat Genet. 2020;52(9):969–983. [DOI] [PMC free article] [PubMed] [Google Scholar]
53.Dhillon IS. Co-clustering documents and words using bipartite spectral graph partitioning. Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM; 2001:269–274. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Appendix S2 Positions of SNPs from imaging-genetic association clusters detected in all chromosomes.

NIHMS2067460-supplement-Appendix_S2_Positions_of_SNPs_from_imaging-genetic_association_clusters_detected_in_all_chromosomes_.xlsx^{(55.1KB, xlsx)}

Appendix S3 Coordinates of voxels from imaging-genetic association clusters detected in all chromosomes.

NIHMS2067460-supplement-Appendix_S3_Coordinates_of_voxels_from_imaging-genetic_association_clusters_detected_in_all_chromosomes_.xlsx^{(1MB, xlsx)}

Appendix S4 Summary of genes related to brain tissues from annotation analysis.

NIHMS2067460-supplement-Appendix_S4_Summary_of_genes_related_to_brain_tissues_from_annotation_analysis_.xlsx^{(15.9KB, xlsx)}

Appendix S5 Supplemental Material.

NIHMS2067460-supplement-Appendix_S5_Supplemental_Material_.zip^{(2.8MB, zip)}

Appendix S1 Supplemental Material.

NIHMS2067460-supplement-Appendix_S1_Supplemental_Material_.pdf^{(923.4KB, pdf)}

Data Availability Statement

[R1] 1.Meisenzahl E, Koutsouleris N, Bottlender R, et al. Structural brain alterations at different stages of schizophrenia: a voxel-based morphometric study. Schizophr Res. 2008;104(1–3):44–60. [DOI] [PubMed] [Google Scholar]

[R2] 2.Lee S, Viqar F, Zimmerman ME, et al. White matter hyperintensities are a core feature of Alzheimer’s disease: evidence from the dominantly inherited Alzheimer network. Ann Neurol. 2016;79(6):929–939. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] 3.Savitz JB, Drevets WC. Imaging phenotypes of major depressive disorder: genetic correlates. Neuroscience. 2009;164(1):300–330. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] 4.Ge T, Schumann G, Feng J. Imaging genetics-towards discovery neuroscience. Quant Biol. 2013;1(4):227–245. [Google Scholar]

[R5] 5.Liu J, Calhoun VD. A review of multivariate analyses in imaging genetics. Front Neuroinform. 2014;8:29. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] 6.Nathoo FS, Kong L, Zhu H, Initiative ADN. A review of statistical methods in imaging genetics. Can J Stat. 2019;47(1):108–131. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] 7.Smith SM, Douaud G, Chen W, et al. An expanded set of genome-wide association studies of brain imaging phenotypes in UK biobank. Nat Neurosci. 2021;24(5):737–745. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] 8.Zhao B, Zhang J, Ibrahim JG, et al. Large-scale GWAS reveals genetic architecture of brain white matter microstructure and genetic overlap with cognitive and mental health traits (n= 17,706). Mol Psychiatry. 2019;26:3943–3955. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] 9.Zhao B, Li T, Yang Y, et al. Common genetic variation influencing human white matter microstructure. Science. 2021;372(6548). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] 10.Zhu H, Khondker Z, Lu Z, Ibrahim JG. Bayesian generalized low rank regression models for neuroimaging phenotypes and genetic markers. J Am Stat Assoc. 2014;109(507):977–990. [PMC free article] [PubMed] [Google Scholar]

[R11] 11.Huang M, Nichols T, Huang C, et al. FVGWAS: fast voxelwise genome wide association analysis of large-scale imaging genetic data. Neuroimage. 2015;118:613–627. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] 12.Huang C, Thompson P, Wang Y, et al. FGWAS: functional genome wide association analysis. Neuroimage. 2017;159:107–121. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] 13.Ge T, Feng J, Hibar DP, Thompson PM, Nichols TE. Increasing power for voxel-wise genome-wide association studies: the random field theory, least square kernel machines and fast permutation procedures. Neuroimage. 2012;63(2):858–873. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] 14.Ge T, Nichols TE, Ghosh D, et al. A kernel machine method for detecting effects of interaction between multidimensional variable sets: an imaging genetics application. Neuroimage. 2015;109:505–514. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] 15.Hibar DP, Stein JL, Kohannim O, et al. Voxelwise gene-wide association study (vGeneWAS): multivariate gene-based association testing in 731 elderly subjects. Neuroimage. 2011;56(4):1875–1891. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] 16.Stein JL, Hua X, Lee S, et al. Voxelwise genome-wide association study (vGWAS). Neuroimage. 2010;53(3):1160–1174. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] 17.Chi EC, Allen GI, Zhou H, Kohannim O, Lange K, Thompson PM. Imaging genetics via sparse canonical correlation analysis. 2013 IEEE 10th International Symposium on Biomedical Imaging. New York: IEEE; 2013:740–743. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] 18.Greenlaw K, Szefer E, Graham J, Lesperance M, Nathoo FS, Initiative ADN. A Bayesian group sparse multi-task regression model for imaging genetics. Bioinformatics. 2017;33(16):2513–2522. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] 19.Hardoon DR, Ettinger U, Mourão-Miranda J, et al. Correlation-based multivariate analysis of genetic influence on brain volume. Neurosci Lett. 2009;450(3):281–286. [DOI] [PubMed] [Google Scholar]

[R20] 20.Kong D, An B, Zhang J, Zhu H. L2RM: low-rank linear regression models for high-dimensional matrix responses. J Am Stat Assoc. 2020;115(529):403–424. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] 21.Le Floch É, Guillemot V, Frouin V, et al. Significant correlation between a set of genetic polymorphisms and a functional brain network revealed by feature selection and sparse partial least squares. Neuroimage. 2012;63(1):11–24. [DOI] [PubMed] [Google Scholar]

[R22] 22.Liu J, Pearlson G, Windemuth A, Ruano G, Perrone-Bizzozero NI, Calhoun V. Combining fMRI and SNP data to investigate connections between brain function and genetics using parallel ICA. Hum Brain Mapp. 2009;30(1):241–255. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] 23.Wang H, Nie F, Huang H, et al. Identifying quantitative trait loci via group-sparse multitask regression and feature selection: an imaging genetics study of the ADNI cohort. Bioinformatics. 2012;28(2):229–237. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] 24.Vounou M, Nichols TE, Montana G. Discovering genetic associations with high-dimensional neuroimaging phenotypes: a sparse reduced-rank regression approach. Neuroimage. 2010;53(3):1147–1159. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R25] 25.Vounou M, Janousova E, Wolz R, et al. Sparse reduced-rank regression detects genetic associations with voxel-wise longitudinal phenotypes in Alzheimer’s disease. Neuroimage. 2012;60(1):700–716. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] 26.Marcus DS, Harms MP, Snyder AZ, et al. Human connectome project informatics: quality control, database services, and data visualization. Neuroimage. 2013;80:202–219. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R27] 27.Van Essen DC, Smith SM, Barch DM, et al. The WU-Minn human connectome project: an overview. Neuroimage. 2013;80:62–79. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R28] 28.Jahanshad N, Kochunov PV, Sprooten E, et al. Multi-site genetic analysis of diffusion images and voxelwise heritability analysis: a pilot project of the ENIGMA–DTI working group. Neuroimage. 2013;81:455–469. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R29] 29.Kochunov P, Jahanshad N, Sprooten E, et al. Multi-site study of additive genetic effects on fractional anisotropy of cerebral white matter: comparing meta and megaanalytical approaches for data pooling. Neuroimage. 2014;95:136–150. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R30] 30.Kochunov P, Kochunov P. Heritability of fractional anisotropy in human white matter: a comparison of human connectome project and ENIGMA-DTI data. Neuroimage. 2015;111:300–311. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R31] 31.Chang CC, Chow CC, Tellier LC, Vattikuti S, Purcell SM, Lee JJ. Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience. 2015;4(1):7. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R32] 32.Benjamini Y, Hochberg Y. On the adaptive control of the false discovery rate in multiple testing with independent statistics. J Educ Behav Stat. 2000;25(1):60–83. [Google Scholar]

[R33] 33.Efron B Large-Scale Inference: Empirical Bayes Methods for Estimation, Testing, and Prediction. Vol 1. Cambridge, UK: Cambridge University Press; 2012. [Google Scholar]

[R34] 34.Woo CW, Krishnan A, Wager TD. Cluster-extent based thresholding in fMRI analyses: pitfalls and recommendations. Neuroimage. 2014;91:412–419. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R35] 35.Charikar M Greedy approximation algorithms for finding dense components in a graph. International Workshop on Approximation Algorithms for Combinatorial Optimization. Berlin: Springer; 2000:84–95. [Google Scholar]

[R36] 36.Khuller S, Saha B. On finding dense subgraphs. International Colloquium on Automata, Languages, and Programming. Berlin: Springer; 2009:597–608. [Google Scholar]

[R37] 37.Amini AA, Chen A, Bickel PJ, Levina E. Pseudo-likelihood methods for community detection in large sparse networks. Ann Stat. 2013;41(4):2097–2122. [Google Scholar]

[R38] 38.Cheng Y, Church GM. Biclustering of expression data. Proc Int Conf Intell Syst Mol Biol. 2000;8:93–103. [PubMed] [Google Scholar]

[R39] 39.Xu M, Jog V, Loh PL. Optimal rates for community estimation in the weighted stochastic block model. Ann Stat. 2020;48(1):183–204. [Google Scholar]

[R40] 40.Zalesky A, Fornito A, Bullmore ET. Network-based statistic: identifying differences in brain networks. Neuroimage. 2010;53(4):1197–1207. [DOI] [PubMed] [Google Scholar]

[R41] 41.Nichols TE. Multiple testing corrections, nonparametric methods, and random field theory. Neuroimage. 2012;62(2):811–815. [DOI] [PubMed] [Google Scholar]

[R42] 42.Kochunov P, Rowland LM, Fieremans E, et al. Diffusion-weighted imaging uncovers likely sources of processing-speed deficits in schizophrenia. Proc Natl Acad Sci. 2016;113(47):13504–13509. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R43] 43.Das S, Forer L, Schönherr S, et al. Next-generation genotype imputation service and methods. Nat Genet. 2016;48(10):1284–1287. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R44] 44.Zou H, He D, Zhou Y. On sure screening with multiple responses. Stat Sin. 2021;31:1749–1777. doi: 10.5705/ss.202018.0462 [DOI] [Google Scholar]

[R45] 45.Chen S, Kang J, Xing Y, Zhao Y, Milton DK. Estimating large covariance matrix with network topology for high-dimensional biomedical data. Comput Stat Data Anal. 2018;127:82–95. [Google Scholar]

[R46] 46.Park MK, Hwang SH, Jung S, Hong SS, Kwon SB. Lesions in the splenium of the corpus callosum: clinical and radiological implications. Neurol Asia. 2014;19(1):79–88. [Google Scholar]

[R47] 47.Zheng Z, Huang D, Wang J, et al. QTLbase: an integrative resource for quantitative trait loci across multiple human molecular phenotypes. Nucleic Acids Res. 2020;48(D1):D983–D991. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R48] 48.Verstynen TD, Weinstein AM, Schneider WW, Jakicic JM, Rofey DL, Erickson KI. Increased body mass index is associated with a global and distributed decrease in white matter microstructural integrity. Psychosom Med. 2012;74(7):682. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R49] 49.Brzustowicz LM, Simone J, Mohseni P, et al. Linkage disequilibrium mapping of schizophrenia susceptibility to the CAPON region of chromosome 1q22. Am J Hum Genet. 2004;74(5):1057–1063. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R50] 50.Kubicki M, Park H, Westin CF, et al. DTI and MTR abnormalities in schizophrenia: analysis of white matter integrity. Neuroimage. 2005;26(4):1109–1118. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R51] 51.Storey JD. A direct approach to false discovery rates. J R Stat Soc Series B Stat Methodology. 2002;64(3):479–498. [Google Scholar]

[R52] 52.Li X, Li Z, Zhou H, et al. Dynamic incorporation of multiple in silico functional annotations empowers rare variant association analysis of large whole-genome sequencing studies at scale. Nat Genet. 2020;52(9):969–983. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R53] 53.Dhillon IS. Co-clustering documents and words using bipartite spectral graph partitioning. Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM; 2001:269–274. [Google Scholar]

PERMALINK

A multivariate to multivariate approach for voxel-wise genome-wide association analysis

Qiong Wu

Yuan Zhang

Xiaoqi Huang

Tianzhou Ma

L Elliot Hong

Peter Kochunov

Shuo Chen

Abstract

1 |. INTRODUCTION

FIGURE 1.

2 |. MOTIVATING DATA EXAMPLE

3 |. METHODS

3.1 |. Background and notations

3.2 |. IGDB in a multivariate to multivariate graph structure

FIGURE 2.

3.3 |. Graph properties of IGDB

Lemma 1.

4 |. ESTIMATION AND INFERENCE

4.1 |. IGDB estimation

Algorithm 1.

Algorithm 2.

Theorem 1.

Theorem 2.

4.2 |. Statistical inference of the IGDB

Theorem 3 (Under IGDB alternative hypothesis H1).

Algorithm 3.

5 |. RESULTS

FIGURE 3.

FIGURE 4.

6 |. SIMULATION

6.1 |. Synthetic data

6.2 |. Performance metrics and results

TABLE 1.

TABLE 2.

7 |. DISCUSSION

Supplementary Material

ACKNOWLEDGEMENTS

APPENDIX A. ADDITIONAL NUMERICAL RESULTS

Comparisons with bi-clustering algorithms.

FIGURE A1.

Simulation results from large graphs.

TABLE A1.

Footnotes

DATA AVAILABILITY STATEMENT

REFERENCES

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

Theorem 3 (Under IGDB alternative hypothesis $H_{1}$ ).