ABSTRACT
Estimating phenotype networks is a growing field in computational biology. It deepens the understanding of disease etiology and is useful in many applications. In this study, we present a method that constructs a phenotype network by assuming a Gaussian linear structure model embedding a directed acyclic graph (DAG). We utilize genetic variants as instrumental variables and show how our method only requires access to summary statistics from a genome-wide association study (GWAS) and a reference panel of genotype data. Besides estimation, a distinct feature of the method is its summary statistics-based likelihood ratio test on directed edges. We applied our method to estimate a causal network of 29 cardiovascular-related proteins and linked the estimated network to Alzheimer’s disease (AD). A simulation study was conducted to demonstrate the effectiveness of this method. An R package sumdag implementing the proposed method, all relevant code, and a Shiny application are available.
Keywords: Alzheimer’s disease (AD), directed acyclic graph (DAG), genome-wide association study (GWAS), likelihood ratio test, proteomics
1. INTRODUCTION
Network analysis has deepened our understanding of biological mechanisms and disease etiologies (Zhang and Itan, 2019). Specifically, protein–protein interaction (PPI) networks that capture the interplay of proteins in the biomolecular systems are vital for normal cell functions (Snider et al., 2015). Disturbing of the normal pattern in the PPI network can be causative to or indicative of a disease state. Studies have linked co-regulatory networks of proteins to a variety of complex diseases (Ross and Poirier, 2004; Emilsson et al., 2018). Recently, a network-based method modeling PPI boasted high accuracy rates in cancer prediction (Id et al., 2021). Cheng et al. (2021) further showed that disease-associated variants were significantly enriched in the sequences coding PPI interfaces compared to variants in healthy individuals. Their work also demonstrated associations of PPIs with drug resistance and overall survival, highlighting the use of protein networks for informing genotype-based therapy. Network-based analyses have shown their potential in advancing precision medicine for complex diseases over traditional approaches, which focus on monogenic mutations and independent assessment of risk factors (Napoli et al., 2020).
Network analyses can be categorized into 2 groups. One utilizes only phenotypic data to construct networks. For example, weighted gene network co-expression analysis estimates an undirected network, which is further characterized using dimension reduction techniques (Zhang and Horvath, 2005). Graphical lasso formulation employs penalized methods to estimate a Gaussian graphical model for a large number of variables (Witten et al., 2012). Bayesian network analysis (Friedman et al., 2000) estimates directed acyclic graphs (DAG), which are widely accepted in biological systems (Ashburner et al., 2000), and its recent improvements in computational approaches have led to much shortened computational time (Liu et al., 2016). The other group of methods exploits the use of instrumental variable (IV) techniques to estimate a DAG, assuming a linear structural equation model. Chen et al. (2018) developed a penalized two-stage least squares method to estimate a DAG, assuming known intervention targets. Li et al. (2023) further extended the work to accommodate unknown intervention targets commonly encountered in biological applications.
Individual-level data are required for all the methods above, which, however, can be difficult to obtain, especially for human studies, due to logistic limitations and privacy concerns. On the other hand, many genome-wide association studies (GWAS) have shared their summary statistics publicly, generating a rich and valuable data resource. Thus, we propose adapting the network estimation and inference methods of Li et al. (2023) to rely only on GWAS summary statistics and a genetic reference panel, both much more easily accessible. We will show how a DAG can be estimated for cardiovascular-related proteins using a large-scale proteomic GWAS summary dataset, and then link the protein network to Alzheimer’s disease (AD). The algorithm for the proposed work is packaged in R. Our work represents one of the initial attempts to utilize GWAS summary statistics in the construction of a DAG. We expect that our work can facilitate more comprehensive network analysis in studying biological and medical relationships. In addition to inferring PPI network, our method is readily applicable to understand the interplay of many other molecular and non-molecular phenotypes, as long as the corresponding GWAS summary statistics are available.
2. METHODS
2.1. Network modeling and data
2.1.1. Directed phenotype network
Our goal is to use genotypes as external interventions to construct and infer a DAG that describes the directed relationships among a set of phenotypes. In the framework of interventional Gaussian DAG (Li et al., 2023), we assume
![]() |
(1) |
where
is the N × P data matrix of P phenotypes,
is the N × Q data matrix of Q genotypes serving as IVs,
is the N × P error matrix with each row sampled from
, and N is the sample size. Note that Equation (1) lacks an intercept because we assume phenotype and genotype are centered at mean 0, which could be easily done with individual-level GWAS data.
In Equation (1),
and
are unknown parameters to be estimated. The P × P matrix
specifies the network structure such that ukj ≠ 0 indicates a directed relation from phenotype k to phenotype j. The Q × P matrix
specifies the targets and strengths of interventions in that wqp ≠ 0 indicates an interventional relation from genotype q to phenotype p. Let
be the set of directed relations, and
be the set of interventional relations.
2.1.2. Summary statistics and reference panel
In Equation (1), the data matrices
and
contain individual information, such that each row represents the variables measured on an individual. On the other hand, GWAS summary statistics aggregate N observations into a single measure for each single nucleotide polymorphism (SNP) across the whole genome. This measure is the average effect of having 1 copy of the effect allele of the SNP on the phenotype being studied. It is estimated by
, often reported along with accompanying statistics, such as the corresponding standard error
, z-score zqp, sample size N, reference allele (REF), minor allele frequency (MAF), and P-value. The summary statistics of the Q SNPs in
are included in the GWAS summary data.
As a complement to the summary-level data, a reference panel comprising genotypic data of individuals from a general population provides the correlation structure among the genotypes. Many existing resources can be used for such a reference panel (International HapMap Consortium, 2005; The 1000 Genomes Project Consortium, 2015; Bycroft et al., 2018; Taliun et al., 2021). Given an Nr × Q (centered) reference panel
of Nr individuals, we follow the conventional suggestion (Mak et al., 2017) to regularize the genetic correlation matrix
, such that
, where 0 ≤ sp ≤ 1 is a real number controlling the degree of regularization.
From the summary statistics and the reference panel, we compute the following quantities that are used for the construction and inference of the directed phenotype network. The subsequent computation also assumes
and
are centered, which does not influence
and the accompanying statistics.
The covariance matrix of genotypes
is estimated by
.Let
. Then
is estimated by
, provided that MAF is reported in the summary statistics, or otherwise estimated by
, the q th diagonal element of
.Given
,
is estimated by
.For
, we use the median estimate
.Finally,
is estimated using the null SNPs from GWAS summary statistics, that is, SNPs not marginally associated with
or
. Following Kim et al. (2015),
, where
and
are vectors of z-scores for the null SNPs. Thus, we can rearrange the sample correlation formula for (centered) phenotype variables and plug in our approximation to obtain
. In practice, the SNPs with P-values >0.05 are considered as null SNPs. An alternative method to consider for estimating
is also feasible to use for GWAS (Bulik-Sullivan et al., 2015), although not used herein.
Next, we extend the framework of interventional Gaussian DAG to leverage large-scale GWAS summary statistics.
2.2. Method for network construction
The estimation of interventional Gaussian DAG consists of 3 steps:
- First, we use penalized regressions to estimate the genotype–phenotype association matrix
in the following equation

(2) Next, we employ the peeling algorithm (Li et al., 2023) to learn a super-DAG, that is, a directed super-graph without cycles, based on
obtained in Step (E1).Finally, we estimate
and
through penalized regressions based on the estimated super-DAG in Step (E2).
Now, we elaborate on our extensions to accommodate summary statistics.
2.2.1. Estimation of V by truncated Lasso penalized regressions
In Equation (2), the matrix
can be estimated column-wise from
![]() |
(3) |
where
is the data vector of phenotype p,
is the data matrix of genotypes, and vector
is the p th column of
. Given the summary statistics, we expand the squared error function
, and replace the quantities
,
, and
in Equation (3) with their estimates
,
, and
, respectively. As a result, we estimate
through regressions with the Truncated Lasso Penalty (TLP) (Shen et al., 2012) to minimize
![]() |
(4) |
where κp > 0 is an integer tuning parameter and
is the TLP function, which does not penalize the parameters over the threshold τp. We use the R package “glmtlp” (Li et al., 2022) to fit the summary-level data regression Equation (4).
For implementation, we fix
and choose κp ∈ {1, …, Q} individually for each of the P penalized regressions by minimizing the pseudo-BIC (Pattee and Pan, 2020), which is defined as
, where
is the (estimated) sum of squared error of
,
is the number of nonzero coefficients in
,
is the estimate in Equation (4) with tuning parameters (λp, τp), N is the sample size (when N differs, the median is taken), and
is the estimated residual variance for phenotype p in Equation (3). When Q is small compared to N as in our application, a consistent estimate for
can be obtained from the ordinary least squares using all Q genotypes,
, where
is the estimate in Equation (4) with κp = P. Letting
(1 ≤ p ≤ P) be the estimates with the optimally chosen tuning parameters, the final estimate of
is
.
2.2.2. Estimation of super-DAG by the peeling algorithm
Given
, the peeling algorithm (Li et al., 2023) can be used to construct a super-DAG with phenotype edge set
(a superset of
) and interventional edge set
(a superset of
). The key idea is that the sparse pattern of matrix
characterizes the orientations of the relations among the phenotypes. Specifically, it is demonstrated in Li et al. (2023) that vqp ≠ 0 implies that genotype q intervenes on phenotype p or an ancestor node of phenotype p in the DAG. Thus, if vqp ≠ 0 and vqi = 0 for i ≠ p, then phenotype p is a leaf node in the DAG; that is, there is no directed edge from phenotype p to the others. On this basis, we can sequentially identify and remove (ie, peel) the leaf node in the DAG, and construct supersets
and
.
Since the peeling algorithm solely depends on
, no modification is needed to extend the existing method to accommodate summary-level data.
2.2.3. Estimation of U and W
The peeling algorithm yields supersets
and
. To remove the extra edges in
and
, we consider fitting U and W within a restricted model defined by
and
.
From Equation (1), for phenotype p, we have
![]() |
(5) |
where
and
. As in Section 2.2.1, we replace the corresponding quantities with the summary-level data estimates and fit the TLP regression based on Equation (5),
![]() |
(6) |
where
is the parameter vector and
. We fix
and the tuning parameter
are selected by pseudo-BIC as described in Section 2.2.1. The estimated
and
(1 ≤ p ≤ P) are aggregated to form the the final estimate
and
.
Due to penalization, we recommend following the common practice to standardize the variables so that the phenotypes and genotypes are on a comparable scale, which is straightforward to do as
and
are obtained. Moreover, if only
is of interest, penalization of
is optional.
2.3. Likelihood-based inference for a DAG
We extend the likelihood ratio inference (Li et al., 2023) to quantify the uncertainty of the network structures. As in Li et al. (2023), we consider 2 types of hypothesis testing.
Testing of multiple directed relations. The null hypothesis H0: ukj = 0 for each
and alternative hypothesis Ha: ukj ≠ 0 for some
. Rejecting H0 indicates evidence for the presence of some hypothesized relationships in the network.Testing of a directed pathway. The null hypothesis H0: ukj = 0 for some
and alternative hypothesis Ha: ukj ≠ 0 for each
. Rejecting H0 indicates evidence for the presence of the entire directed pathway in the phenotype network.
The procedure for testing multiple directed relations comprises 5 steps.
Estimate
and use the peeling algorithm to obtain
and
, as in Section 2.2.Identify the set of non-degenerate edges
(Li et al., 2023), which contains
, non-degenerate edges pointed to phenotype p.Estimate the parameters
and
under H0 and Ha, respectively. Specifically, denote by
and
, the estimates under H0. Then
and
are computed as in the regression (6) with an additional constraint that ukj = 0 for
. Let
and
be the estimates under Ha. Then
and
are computed from the restricted models (1 ≤ p ≤ P),
,
, via regression (6), where the penalties become
and
.Compute
(1 ≤ p ≤ P) from the residual sum of squares of
.Compute the test statistic
, where L is the log-likelihood of the model [Equation (1)]. By Li et al. (2023), T is approximately chi-squared distributed with degrees of freedom
when the size
is <50;
is approximately standard normal when
>50. Thus, the P-value is calculated as
, when
and
, when
.
The procedure for testing a directed pathway is similar, with minor modifications.
Estimate
, as in Step (T1).First, we decompose H0 into each nongenerate edge
. For each
, implement Steps (T2)–(T5) above to obtain the corresponding P-value PV (k, j). The final P-value is computed as the maximum of the P-values for the sub-hypotheses,
.
Of note, testing a directed pathway concerns a composite (null) hypothesis. Fixing 0 < α < 1, we have
(Li et al., 2023). In other words, the test asymptotically achieves exactly the α significance level for the composite null hypothesis.
3. INFERRING CARDIOVASCULAR-RELATED PROTEIN–PROTEIN INTERACTION NETWORK
The role of cardiovascular diseases has been recognized as an important etiologic hallmark of AD (de Bruijn and Ikram, 2014). There are different hypotheses on the various mechanisms underlying the association between AD and cardiovascular diseases (Tini et al., 2020). In this real data application, we constructed a directed PPI network of some cardiovascular-related proteins based on a GWAS of 83 plasma protein biomarkers. We further connected the PPI network to AD through MR analyses.
3.1. GWAS summary statistics for cardiovascular-related proteins
The GWAS summary statistics on 83 cardiovascular-related proteins, which came from Wald tests for the association between each SNP and the standardized residuals among 3394 European individuals by Folkersen et al. (2017) were used. Five proteins were excluded from the analysis as their corresponding protein-encoding genes are located on the sex chromosome. The summary statistics were first processed to remove (a) indels; (b) SNPs located within 1 base pair of an indel; (c) SNPs with imputation quality score INFO ≤ 0.8; and (d) SNPs with MAF ≤ 0.05. We then used the following steps to select putative IVs for the proteins:
SNPs were clumped at an r2 value of 0.01 using 3000 uncorrelated individuals (individuals with kinship coefficients <0.084) from UK Biobank of European ancestry as the reference panel, such that SNPs were independent of each other for each protein (Bycroft et al., 2018).
Only the SNPs in the clumped data files located within ±1 MB of each protein-encoding gene were considered. In general, cis-regulatory changes will be less pleiotropic (Signor and Nuzhdin, 2018), and thus, these SNPs located close to the genes are more likely to be valid IVs due to the exclusion assumption (Swerdlow et al., 2016; Hemani et al., 2018; Li et al., 2023) (ie, an IV only directly intervenes on 1 primary variable).
To ensure the relevance assumption (Li et al., 2023) was satisfied (ie, IV intervenes on at least 1 primary variable), we only selected SNPs whose P-values were below the GWAS significance threshold (5 × 10−8). This filtering process led to a total number of 33 SNPs and 23 proteins, with at least 1 putative IV in the final network analysis.
The genetic correlation matrix for the included IVs was estimated based on the same reference panel used in clumping. We calculated the empirical correlation of each pair of proteins as the correlation coefficient of the z-scores of the null SNPs, that is, all autosomal SNPs with MAF ≥ 0.05, INFO ≥ 0.8, and GWAS P-values ≥ 0.05 for both proteins. The number of null SNPs for each pair of proteins ranged from 1 191 204 to 1 223 357. All preparation of the reference panel and GWAS data for both the DAG estimation and MR analysis was done using PLINK version 1.9 (Purcell et al., 2007).
3.2. GWAS summary statistics for AD
We explored the relationship between each of the 23 proteins in Folkersen et al. (2017) and AD. We used the summary statistics of the GWAS for AD from a most recent study totaling 111 326 clinically diagnosed or “proxy” AD cases and 677 663 controls (Bellenguez et al., 2022). We removed SNPs with MAF < 0.05, SNPs not included in the GWAS of Folkersen et al. (2017), and clumped SNPs at r2 = 0.01. Among the remaining SNPs, we selected IVs only with a GWAS P-value <5 × 10−8 in the MR analyses.
3.3. Results
We constructed a DAG of the 23 proteins as described in Section 2.5.1. As Folkersen et al. (2017) shared MAF in the summary statistics, we compared them with those in UK Biobank, the reference panel for clumping and estimating genetic correlation matrix. The absolute difference of the MAF of all IVs ranged from 0.001 to 0.055 with a mean of 0.02, while the correlation was 0.99 (Supplementary Table S1). We further performed MR analysis on each protein to evaluate their relationship with AD using the TwoSampleMR package (Hemani et al., 2018). We used Egger’s test of intercept for examining the exclusion assumption: If a protein had a P-value of the Egger’s test of intercept >0.05/23, there was no evidence against no direct/pleiotropic effects, and we’d go with the more powerful MR-IVW method; otherwise, we used MR-Egger (to allow pleiotropic effects of IVs). In any case, we used the P-value cut-off <0.05/23 to declare statistical significance. Supplementary Table S2 contains a complete list of MR results. The protein IL18, which showed marginal significance in both Egger’s test of intercept (P-value = 0.06) and MR-Egger (P-value = 0.07), was a parent node for several proteins related to AD, including ADM, IL1RL1, CTSD, CXCL6, and CXCL16. We further performed the likelihood ratio test on each edge. Edges with P-value <0.05/(23 × 22−56) were considered as significant and were in solid line in Figure 1. The number of tests in the Bonferonni correction is bounded by the sum of possible edges among all the nodes minus the total number of edges after the peeling algorithm, which is justified in Supplementary Material S2.1. Each edge from IL18 to the 5 AD-associated proteins was highly significant in the likelihood ratio test, thus suggesting that simultaneous testing of the pathway from IL18 to the 5 genes would be significant. Previous studies detected increased levels of pro-inflammatory IL18 in both cardiovascular diseases and in brain regions of AD patients (Sutinen et al., 2014). IL18 is known to increase the level of Cdk5 and GSK-3β, which are involved in Tau hyperphosphorylation, and the inhibition of Cdk5 was known to improve AD subjects’ conditions (Calabrò et al., 2021). Our work suggests a possible regulatory role of IL18 on multiple AD-associated proteins. According to OpenTargets.org (Ochoa et al., 2021) for current pharmaceuticals either approved or in development with IL18, this protein is currently a target of an antibody drug to treat diabetes mellitus and a few other conditions; diabetes has long been linked to AD with epidemiological and biological evidence (Barbagallo and Dominguez, 2014). Lastly, we provide a Shiny application that allows users to test any selected proteins in this cardiovascular-related PPI network.
FIGURE 1.
Estimated DAG for 23 proteins based on the GWAS summary statistics of Folkersen et al. (2017). Proteins significantly associated with AD in MR analysis are colored gold. A solid line represents an edge that is statistically significant by the likelihood ratio test whereas a dashed line represents an edge that is not significant.
4. SIMULATION STUDIES
4.1. Simulation settings
We simulated the data assuming a fixed
,
, standardized genotype matrix
, and sampled each row of
independently from
. Then we generated
from equation:
. Without loss of generality, no intercept was modeled; that is,
was centered at mean 0. The values of
,
, and
are provided in Supplementary Material S1.1, where the structure of the relationship of
followed a DAG of 15 nodes/phenotypes (Figure 2). The effect sizes of the non-zero components of
ranged from 0.002 to 1.16, with a median of 0.06. All phenotypes had at least 1 valid IV. Twenty-six SNPs were included in the model, with their effect sizes ranging from −2.2 to 2.5 with a median of −0.11. Two SNPs violated the relevance assumption, while the rest were valid IVs. We also varied the effect sizes to be 1/3 and 1/15 of
while keeping
fixed.
FIGURE 2.
True DAG for the simulation study with 15 phenotypes.
The standardized genotype matrix
was obtained from unrelated individuals of European ancestry in UK Biobank (Bycroft et al., 2018). We then calculated the summary statistics using a linear model of each phenotype on each standardized genotype, and inputted the summary statistics into the proposed algorithm. The reference panel was obtained from the UK Biobank European samples, which were not correlated with the simulated samples used to derive the summary statistics. SNPs on chromosome 22 with a Hardy-Weinberg Disequilibrium test P-value > 0.0001, missing call rate <0.05, and MAF >0.05 were pruned to have r2 < 0.01. We then randomly selected 26 SNPs for
. Missing values of SNPs were imputed by their mean. Null SNPs were directly simulated to be independent of each other and have no relationship with
.
We evaluated the performance of both the network construction and statistical inference for the proposed method. To evaluate the performance of network construction, we examined the false-positive (TP) and false-negative (FN) rates for
over 200 replications. The sample size of the summary statistics was varied at 3000, 6000, 9000, and 12 000, and the sample size of the reference panel was fixed at 3000. Null SNPs were simulated to have the same sample size as the GWAS summary statistics.
In terms of testing, we examined the empirical Type I error/power of the likelihood ratio tests with increasing sample sizes and varying strength of
for the following 5 scenarios over 1000 replications for the 2 types of testing.
I. Testing 1 or more directed relations:
A1. Type 1 error: testing 1 edge when in truth it was null with H0: u1, 14 = 0 versus H1: u1, 14 ≠ 0.
A2. Power: testing 1 edge when in truth it was not null with H0: u1, 6 = 0 versus H1: u1, 6 ≠ 0.
A3. Power: testing 2 edges together when in truth both were not null with H0: u7, 15 = u1, 6 = 0 versus H1: u7, 15 ≠ 0 or u1, 6 ≠ 0.
II. Testing of a directed pathway:
B1. Type 1 error: testing whether at least 1 edge was not null when in truth only 1 was not null with H0: u1, 14 = 0 or u6, 12 = 0 versus H1: u1, 14 ≠ 0 and u6, 12 ≠ 0.
B2. Power: testing whether at least 1 edge was not null when in truth both were not null with H0: u1, 6 = 0 or u6, 12 = 0 versus H1: u1, 6 ≠ 0 and u6, 12 ≠ 0.The true strengths of the tested edges were: u1, 14 = 0, u1, 6 = 0.27, u7, 15 = 0.44, and u6, 12 = 1.36.
4.2. Simulation results
With the increase of sample size, the constructed networks became closer to the true graph (Figure 3; numeric values in Figure 3a are in Supplementary Tables S3–S6). More specifically, FP of
was ∼0.05, and FN of
decreased with the increase of sample size when 15 000 null SNPs were used to estimate the 15 × 15 matrix of
. Unsurprisingly, the performance of using
estimated by 15 000 null SNPs (denoted as
in Figure 3) was slightly worse than that of using
(denoted as
in Figure 3). In addition, the FP of
clearly decreased with the increase of effect sizes (from
to
to
). To check the validity of our method, we compared pBIC estimated from summary statistics with BIC estimated from the individual-level data for the same set of penalized regression coefficients. We found the 2 sets of values were highly concordant, and the results from 1 iteration were plotted in Supplementary Figure S2.
FIGURE 3.
Performance of network construction (a) and likelihood ratio tests (b–d) in simulation with varying sample sizes and true effect sizes at
,
, and
. Figures b, c, and d represent scenarios A2, A3, and B2, respectively.
In terms of testing, we observed well-controlled Type I error rates for Scenarios A1 and B1 (numeric values present in Figure 3, b–d are in Supplementary Tables S7–S10), using
. We note that the empirical Type I error rates might become conservative when
is replaced by its estimate
derived from 15 000 null SNPs (Supplementary Table S10). In real data analysis of GWAS summary level data, typically a much larger number of null SNPs and various other strategies can be used to better estimate
(Bulik-Sullivan et al., 2015; Kim et al., 2015; Li et al., 2021); however, the investigation in this direction is out of our scope. Furthermore, empirical power was high for scenarios A2, A3, and B2 and increased with sample size and effect size. The empirically power of jointly testing 2 edges u1, 6 and u7, 15 (Scenario A3) was larger than testing one edge u1, 6 alone (Scenario A2).
5. DISCUSSION
In this paper, we present a method to estimate an interventional DAG of phenotypes utilizing linear structural equation models, applicable to GWAS summary statistics in the absence of individual-level data. We demonstrated satisfactory performance in terms of the FP and FN rates in network construction and high empirical power and of well-controlled Type I error rates of the likelihood ratio tests. We applied this method to a large-scale proteomic GWAS summary dataset to obtain an estimated DAG of 23 cardiovascular-related proteins and further illustrated the effects of these proteins on AD by MR analysis. These results can be useful in understanding the disease etiology, drug repurposing, and other applications for AD.
We note that the choice of a proper reference panel is just as important for our method as many other summary-statistics-based methods (Deng and Pan, 2018; Chen et al., 2021; Privé et al., 2022). When constructing the cardiovascular-related PPI network, we used an ancestry-matched reference panel of uncorrelated individuals from UK Biobank with a sample size of 3000, which is close to the sample size of the GWAS of cardiovascular proteins. Furthermore, we clumped SNPs around the cis-region of the gene and only used the genome-wide significant SNPs as IVs. This step not only aimed at selecting at least 1 valid IV for each protein, but also achieving better estimation of the genetic correlation matrix for the IVs as the number of total IVs became much smaller than the sample size of the reference panel, that is, Q ≪ Nr. Our analysis was constrained to a single ancestry group and unrelated samples. As the collection of multi-ancestry and related samples increases, it will be of significant research interest to establish networks among these populations, a challenge we anticipate addressing in our future work.
In recent years, a large amount of summary-level data has become widely accessible. On the molecular level, many studies published their summary statistics for SNP-molecular phenotype associations. For example, variant-gene associations in 49 tissues can be directly downloaded from GTExPortal (Consortium, 2020). Beyond the molecular phenotypes, UK Biobank alone provides GWAS summary statistics on >7000 traits, including but not limited to cognitive functions, early life factors, health and medical history, and physical measurement (Bycroft et al., 2018). Our proposed method provides a computational and analytical tool to explore the relationships among multiple phenotype variables by taking advantage of rapid advances in GWAS and other association mappings.
Supplementary Material
Web Appendices, Supplementary Tables S1–S10, Figure S1 referenced in Sections 3.3 and 4.2, and data and code (sumDAG algorithm, Shiny app, and the real data analysis code) are available with this paper at the Biometrics website on Oxford Academic. Data and code can also be found on https://github.com/chunlinli/sumdag.
Acknowledgement
Rachel Zilinskas and Chunlin Li have contributed equally to this work. The authors would like to thank the associate editor and the reviewer for their valuable comments. Tianzhong Yang would like to further acknowledge the Children’s Cancer Research Funds and the St. Baldrick’s Foundation Scholar Award.
Contributor Information
Rachel Zilinskas, Statistics and Data Corporation, Tempe, AZ 85288, United States.
Chunlin Li, Department of Statistics, Iowa State University, Ames, IA 50011, United States.
Xiaotong Shen, School of Statistics, University of Minnesota, Minneapolis, MN 55455, United States.
Wei Pan, Division of Biostatistics and Health Data Science, University of Minnesota, Minneapolis, MN 55455, United States.
Tianzhong Yang, Division of Biostatistics and Health Data Science, University of Minnesota, Minneapolis, MN 55455, United States.
FUNDING
This research was supported by NIH grants U01 AG073079, R01 AG065636, R01 AG069895, RF1 AG067924, R01 HL116720, and R01 GM126002 and by the Minnesota Supercomputing Institute at the University of Minnesota.
CONFLICT OF INTEREST
None declared.
DATA AVAILABILITY
We downloaded the summary-level genome-wide association study data in Section 3.1 from https://zenodo.org/record/264128/. The algorithm for the proposed work is packaged in R, available at https://github.com/chunlinli/sumdag and the Biometrics website, along with code used for the simulation studies and real data application. The processed summary-level genome-wide association study data that were used as input for the algorithm for the real data application are also included on GitHub.
References
- Ashburner M., Ball C. A., Blake J. A., Botstein D., Butler H., Cherry J. M. et al. (2000) The gene ontology consortium gene ontology: tool for the unification of biology. Nature Genetics, 25, 25–29. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Barbagallo M., Dominguez L. J. (2014) Type 2 diabetes mellitus and Alzheimer’s disease. World Journal of Diabetes, 5, 889–893. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bellenguez C., Küçükali F., Jansen I. E., Kleineidam L., Moreno-Grau S., Amin N. et al. (2022) New insights into the genetic etiology of Alzheimer’s disease and related dementias. Nature Genetics, 54, 412–436. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bulik-Sullivan B., Finucane H. K., Anttila V., Gusev A., Day F. R., Loh P.-R. et al. (2015) An atlas of genetic correlations across human diseases and traits. Nature Genetics, 47, 1236–1241. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bycroft C., Freeman C., Petkova D., Band G., Elliott L. T., Sharp K. et al. (2018) The UK Biobank resource with deep phenotyping and genomic data. Nature, 562, 203–209. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Calabrò M., Rinaldi C., Santoro G., Crisafulli C. (2021) The biological pathways of Alzheimer disease: a review. AIMS Neuroscience, 8, 86–132. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chen C., Ren M., Zhang M., Zhang D. (2018) A two-stage penalized least squares method for constructing large systems of structural equations. Journal of Machine Learning Research, 19, 1–34. [Google Scholar]
- Chen W., Wu Y., Zheng Z., Qi T., Visscher P. M., Zhu Z. et al. (2021) Improved analyses of gwas summary statistics by reducing data heterogeneity and errors. Nature Communications, 12, 7117. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cheng F., Zhao J., Wang Y., Lu W., Liu Z., Zhou Y. et al. (2021) Comprehensive characterization of protein–protein interactions perturbed by disease mutations. Nature Genetics, 53, 342–353. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Consortium G. (2020) The GTEx consortium atlas of genetic regulatory effects across human tissues. Science, 369, 1318–1330. [DOI] [PMC free article] [PubMed] [Google Scholar]
- de Bruijn R. F., Ikram M. A. (2014) Cardiovascular risk factors and future risk of alzheimer’s disease. BMC Medicine, 12, 1–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Deng Y., Pan W. (2018) Improved use of small reference panels for conditional and joint analysis with gwas summary statistics. Genetics, 209, 401–408. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Emilsson V., Ilkov M., Lamb J. R., Finkel N., Gudmundsson E. F., Pitts R. et al. (2018) Co-regulatory networks of human serum proteins link genetics to disease. Science, 361, 769–773. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Folkersen L., Fauman E., Sabater-Lleal M., Strawbridge R. J., Frånberg M., Sennblad B. et al. (2017) Mapping of 79 loci for 83 plasma protein biomarkers in cardiovascular disease. PLOS Genetics, 13, e1006706. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Friedman N., Linial M., Nachman I., Pe’er D. (2000) Using bayesian networks to analyze expression data. In: Journal of Computational Biology, 7(3-4), 601–20. [DOI] [PubMed] [Google Scholar]
- Hemani G., Bowden J., Smith G. D. (2018) Evaluating the potential role of pleiotropy in mendelian randomization studies. Human Molecular Genetics, 27, 195–208. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hemani G., Zheng J., Elsworth B., Wade K. H., Haberland V., Baird D. et al. (2018) The MR-base platform supports systematic causal inference across the human phenome. eLife, 7, e34408. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Id J. Q., Chen K., Zhong C., Zhu S., Id X. M. (2021) Network-based protein-protein interaction prediction method maps perturbations of cancer interactome. PLOS Genetics, 17, e1009869. [DOI] [PMC free article] [PubMed] [Google Scholar]
- International HapMap Consortium . (2005) A haplotype map of the human genome. Nature, 437, 1299–1320. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kim J., Bai Y., Pan W. (2015) An adaptive association test for multiple phenotypes with GWAS summary statistics. Genetic Epidemiology, 39, 651–663. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li C., Shen X., Pan W. (2023) Inference for a large directed acyclic graph with unspecified interventions. Journal of Machine Learning Research, 24, 1–48. [PMC free article] [PubMed] [Google Scholar]
- Li C., Yang Y., Wu C. (2022) Package “glmtlp”. https://cran.r-project.org/web/packages/glmtlp/glmtlp.pdf.
- Li T., Ning Z., Shen X. (2021) Improved estimation of phenotypic correlations using summary association statistics. Frontiers in Genetics, 12, 665252. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liu F., Zhang S.-W., Guo W.-F., Wei Z.-G., Chen L. (2016) Inference of gene regulatory network based on local bayesian networks. PLoS Computational Biology, 12, e1005024. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mak T. S. H., Porsch R. M., Choi S. W., Zhou X., Sham P. C. (2017) Polygenic scores via penalized regression on summary statistics. Genetic Epidemiology, 41, 469–480. [DOI] [PubMed] [Google Scholar]
- Napoli C., Benincasa G., Donatelli F., Ambrosio G. (2020) Precision medicine in distinct heart failure phenotypes: Focus on clinical epigenetics. American Heart Journal, 224, 113–128. [DOI] [PubMed] [Google Scholar]
- Ochoa D., Hercules A., Carmona M., Suveges D., Gonzalez-Uriarte A., Malangone C. et al. (2021) Open targets platform: supporting systematic drug–target identification and prioritisation. Nucleic Acids Research, 49, D1302–D1310. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pattee J., Pan W. (2020) Penalized regression and model selection methods for polygenic scores on summary statistics. PLOS Computational Biology, 16, e1008271. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Privé F., Arbel J., Aschard H., Vilhjálmsson B. J. (2022) Identifying and correcting for misspecifications in gwas summary statistics and polygenic scores. Human Genetics and Genomics Advances, 3, 100136. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Purcell S., Neale B., Todd-Brown K., Thomas L., Ferreira M. A., Bender D. et al. (2007) PLINK: a tool set for whole-genome association and population-based linkage analyses. The American Journal of Human Genetics, 81, 559–575. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ross C. A., Poirier M. A. (2004) Protein aggregation and neurodegenerative disease. Nature Medicine, 10, S10–S17. [DOI] [PubMed] [Google Scholar]
- Shen X., Pan W., Zhu Y. (2012) Likelihood-based selection and sharp parameter estimation. Journal of the American Statistical Association, 107, 223–232. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Signor S. A., Nuzhdin S. V. (2018) The evolution of gene expression in cis and trans. Trends in Genetics, 34, 532–544. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Snider J., Kotlyar M., Saraon P., Yao Z., Jurisica I., Stagljar I. (2015) Fundamentals of protein interaction network mapping. Molecular Systems Biology, 11, 848. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sutinen E. M., Korolainen M. A., Häyrinen J., Alafuzoff I., Petratos S., Salminen A. et al. (2014) Interleukin-18 alters protein expressions of neurodegenerative diseases-linked proteins in human SH-SY5Y neuron-like cells. Frontiers in Cellular Neuroscience, 8, 214. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Swerdlow D., Kuchenbaecker K., Shah S., Sofat R., Holmes M., White J. et al. (2016) Selecting instruments for mendelian randomization in the wake of genome-wide association studies. International Journal of Epidemiology, 45, 1600–1616. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Taliun D., Harris D. N., Kessler M. D., Carlson J., Szpiech Z. A., Torres R. et al. (2021) Sequencing of 53,831 diverse genomes from the nhlbi topmed program. Nature, 590, 290–299. [DOI] [PMC free article] [PubMed] [Google Scholar]
- The 1000 Genomes Project Consortium . (2015) A global reference for human genetic variation. Nature, 526, 68–74. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tini G., Scagliola R., Monacelli F., La Malfa G., Porto I., Brunelli C. et al. (2020) Alzheimer’s disease and cardiovascular disease: a particular association. Cardiology Research and Practice, 2020, 2617970. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Witten D. M., Friedman J. H., Simon N. (2012) New insights and faster computations for the graphical lasso view. Journal of Computational and Graphical Statistics, 20: 892–900. [Google Scholar]
- Zhang B., Horvath S. (2005) A general framework for weighted gene co-expression network analysis. Statistical Applications in Genetics and Molecular Biology, 4: 1128. [DOI] [PubMed] [Google Scholar]
- Zhang P., Itan Y. (2019) Biological network approaches and applications in rare disease studies. Genes, 10, 797. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Web Appendices, Supplementary Tables S1–S10, Figure S1 referenced in Sections 3.3 and 4.2, and data and code (sumDAG algorithm, Shiny app, and the real data analysis code) are available with this paper at the Biometrics website on Oxford Academic. Data and code can also be found on https://github.com/chunlinli/sumdag.
Data Availability Statement
We downloaded the summary-level genome-wide association study data in Section 3.1 from https://zenodo.org/record/264128/. The algorithm for the proposed work is packaged in R, available at https://github.com/chunlinli/sumdag and the Biometrics website, along with code used for the simulation studies and real data application. The processed summary-level genome-wide association study data that were used as input for the algorithm for the real data application are also included on GitHub.








