Skip to main content
Oxford University Press logoLink to Oxford University Press
. 2024 Mar 12;80(1):ujad039. doi: 10.1093/biomtc/ujad039

Inferring a directed acyclic graph of phenotypes from GWAS summary statistics

Rachel Zilinskas 1, Chunlin Li 2, Xiaotong Shen 3, Wei Pan 4,, Tianzhong Yang 5,
PMCID: PMC10928990  PMID: 38470257

ABSTRACT

Estimating phenotype networks is a growing field in computational biology. It deepens the understanding of disease etiology and is useful in many applications. In this study, we present a method that constructs a phenotype network by assuming a Gaussian linear structure model embedding a directed acyclic graph (DAG). We utilize genetic variants as instrumental variables and show how our method only requires access to summary statistics from a genome-wide association study (GWAS) and a reference panel of genotype data. Besides estimation, a distinct feature of the method is its summary statistics-based likelihood ratio test on directed edges. We applied our method to estimate a causal network of 29 cardiovascular-related proteins and linked the estimated network to Alzheimer’s disease (AD). A simulation study was conducted to demonstrate the effectiveness of this method. An R package sumdag implementing the proposed method, all relevant code, and a Shiny application are available.

Keywords: Alzheimer’s disease (AD), directed acyclic graph (DAG), genome-wide association study (GWAS), likelihood ratio test, proteomics

1. INTRODUCTION

Network analysis has deepened our understanding of biological mechanisms and disease etiologies (Zhang and Itan, 2019). Specifically, protein–protein interaction (PPI) networks that capture the interplay of proteins in the biomolecular systems are vital for normal cell functions (Snider et al., 2015). Disturbing of the normal pattern in the PPI network can be causative to or indicative of a disease state. Studies have linked co-regulatory networks of proteins to a variety of complex diseases (Ross and Poirier, 2004; Emilsson et al., 2018). Recently, a network-based method modeling PPI boasted high accuracy rates in cancer prediction (Id et al., 2021). Cheng et al. (2021) further showed that disease-associated variants were significantly enriched in the sequences coding PPI interfaces compared to variants in healthy individuals. Their work also demonstrated associations of PPIs with drug resistance and overall survival, highlighting the use of protein networks for informing genotype-based therapy. Network-based analyses have shown their potential in advancing precision medicine for complex diseases over traditional approaches, which focus on monogenic mutations and independent assessment of risk factors (Napoli et al., 2020).

Network analyses can be categorized into 2 groups. One utilizes only phenotypic data to construct networks. For example, weighted gene network co-expression analysis estimates an undirected network, which is further characterized using dimension reduction techniques (Zhang and Horvath, 2005). Graphical lasso formulation employs penalized methods to estimate a Gaussian graphical model for a large number of variables (Witten et al., 2012). Bayesian network analysis (Friedman et al., 2000) estimates directed acyclic graphs (DAG), which are widely accepted in biological systems (Ashburner et al., 2000), and its recent improvements in computational approaches have led to much shortened computational time (Liu et al., 2016). The other group of methods exploits the use of instrumental variable (IV) techniques to estimate a DAG, assuming a linear structural equation model. Chen et al. (2018) developed a penalized two-stage least squares method to estimate a DAG, assuming known intervention targets. Li et al. (2023) further extended the work to accommodate unknown intervention targets commonly encountered in biological applications.

Individual-level data are required for all the methods above, which, however, can be difficult to obtain, especially for human studies, due to logistic limitations and privacy concerns. On the other hand, many genome-wide association studies (GWAS) have shared their summary statistics publicly, generating a rich and valuable data resource. Thus, we propose adapting the network estimation and inference methods of Li et al. (2023) to rely only on GWAS summary statistics and a genetic reference panel, both much more easily accessible. We will show how a DAG can be estimated for cardiovascular-related proteins using a large-scale proteomic GWAS summary dataset, and then link the protein network to Alzheimer’s disease (AD). The algorithm for the proposed work is packaged in R. Our work represents one of the initial attempts to utilize GWAS summary statistics in the construction of a DAG. We expect that our work can facilitate more comprehensive network analysis in studying biological and medical relationships. In addition to inferring PPI network, our method is readily applicable to understand the interplay of many other molecular and non-molecular phenotypes, as long as the corresponding GWAS summary statistics are available.

2. METHODS

2.1. Network modeling and data

2.1.1. Directed phenotype network

Our goal is to use genotypes as external interventions to construct and infer a DAG that describes the directed relationships among a set of phenotypes. In the framework of interventional Gaussian DAG (Li et al., 2023), we assume

2.1.1. (1)

where Inline graphic is the N × P data matrix of P phenotypes, Inline graphic is the N × Q data matrix of Q genotypes serving as IVs, Inline graphic is the N × P error matrix with each row sampled from Inline graphic, and N is the sample size. Note that Equation (1) lacks an intercept because we assume phenotype and genotype are centered at mean 0, which could be easily done with individual-level GWAS data.

In Equation (1), Inline graphic and Inline graphic are unknown parameters to be estimated. The P × P matrix Inline graphic specifies the network structure such that ukj ≠ 0 indicates a directed relation from phenotype k to phenotype j. The Q × P matrix Inline graphic specifies the targets and strengths of interventions in that wqp ≠ 0 indicates an interventional relation from genotype q to phenotype p. Let Inline graphic be the set of directed relations, and Inline graphic be the set of interventional relations.

2.1.2. Summary statistics and reference panel

In Equation (1), the data matrices Inline graphic and Inline graphic contain individual information, such that each row represents the variables measured on an individual. On the other hand, GWAS summary statistics aggregate N observations into a single measure for each single nucleotide polymorphism (SNP) across the whole genome. This measure is the average effect of having 1 copy of the effect allele of the SNP on the phenotype being studied. It is estimated by Inline graphic, often reported along with accompanying statistics, such as the corresponding standard error Inline graphic, z-score zqp, sample size N, reference allele (REF), minor allele frequency (MAF), and P-value. The summary statistics of the Q SNPs in Inline graphic are included in the GWAS summary data.

As a complement to the summary-level data, a reference panel comprising genotypic data of individuals from a general population provides the correlation structure among the genotypes. Many existing resources can be used for such a reference panel (International HapMap Consortium, 2005; The 1000 Genomes Project Consortium, 2015; Bycroft et al., 2018; Taliun et al., 2021). Given an Nr × Q (centered) reference panel Inline graphic of Nr individuals, we follow the conventional suggestion (Mak et al., 2017) to regularize the genetic correlation matrix Inline graphic, such that Inline graphic, where 0 ≤ sp ≤ 1 is a real number controlling the degree of regularization.

From the summary statistics and the reference panel, we compute the following quantities that are used for the construction and inference of the directed phenotype network. The subsequent computation also assumes Inline graphic and Inline graphic are centered, which does not influence Inline graphic and the accompanying statistics.

  1.  The covariance matrix of genotypes Inline graphic is estimated by Inline graphic.

  2.  Let Inline graphic. Then Inline graphic is estimated by Inline graphic, provided that MAF is reported in the summary statistics, or otherwise estimated by Inline graphic, the q th diagonal element of Inline graphic.

  3.  Given Inline graphic, Inline graphic is estimated by Inline graphic.

  4.  For Inline graphic, we use the median estimate Inline graphic.

  5.  Finally, Inline graphic is estimated using the null SNPs from GWAS summary statistics, that is, SNPs not marginally associated with Inline graphic or Inline graphic. Following Kim et al. (2015), Inline graphic, where Inline graphic and Inline graphic are vectors of z-scores for the null SNPs. Thus, we can rearrange the sample correlation formula for (centered) phenotype variables and plug in our approximation to obtain Inline graphic. In practice, the SNPs with P-values >0.05 are considered as null SNPs. An alternative method to consider for estimating Inline graphic is also feasible to use for GWAS (Bulik-Sullivan et al., 2015), although not used herein.

Next, we extend the framework of interventional Gaussian DAG to leverage large-scale GWAS summary statistics.

2.2. Method for network construction

The estimation of interventional Gaussian DAG consists of 3 steps:

  1. First, we use penalized regressions to estimate the genotype–phenotype association matrix Inline graphic in the following equation
    graphic file with name TM0044.gif (2)
  2. Next, we employ the peeling algorithm (Li et al., 2023) to learn a super-DAG, that is, a directed super-graph without cycles, based on Inline graphic obtained in Step (E1).

  3. Finally, we estimate Inline graphic and Inline graphic through penalized regressions based on the estimated super-DAG in Step (E2).

Now, we elaborate on our extensions to accommodate summary statistics.

2.2.1. Estimation of V by truncated Lasso penalized regressions

In Equation (2), the matrix Inline graphic can be estimated column-wise from

2.2.1. (3)

where Inline graphic is the data vector of phenotype p, Inline graphic is the data matrix of genotypes, and vector Inline graphic is the p th column of Inline graphic. Given the summary statistics, we expand the squared error function Inline graphic, and replace the quantities Inline graphic, Inline graphic, and Inline graphic in Equation (3) with their estimates Inline graphic, Inline graphic, and Inline graphic, respectively. As a result, we estimate Inline graphic through regressions with the Truncated Lasso Penalty (TLP) (Shen et al., 2012) to minimize

2.2.1. (4)

where κp > 0 is an integer tuning parameter and Inline graphic is the TLP function, which does not penalize the parameters over the threshold τp. We use the R package “glmtlp” (Li et al., 2022) to fit the summary-level data regression Equation (4).

For implementation, we fix Inline graphic and choose κp ∈ {1, …, Q} individually for each of the P penalized regressions by minimizing the pseudo-BIC (Pattee and Pan, 2020), which is defined as Inline graphic, where Inline graphic is the (estimated) sum of squared error of Inline graphic, Inline graphic is the number of nonzero coefficients in Inline graphic, Inline graphic is the estimate in Equation (4) with tuning parameters (λp, τp), N is the sample size (when N differs, the median is taken), and Inline graphic is the estimated residual variance for phenotype p in Equation (3). When Q is small compared to N as in our application, a consistent estimate for Inline graphic can be obtained from the ordinary least squares using all Q genotypes, Inline graphic, where Inline graphic is the estimate in Equation (4) with κp = P. Letting Inline graphic (1 ≤ pP) be the estimates with the optimally chosen tuning parameters, the final estimate of Inline graphic is Inline graphic.

2.2.2. Estimation of super-DAG by the peeling algorithm

Given Inline graphic, the peeling algorithm (Li et al., 2023) can be used to construct a super-DAG with phenotype edge set Inline graphic (a superset of Inline graphic) and interventional edge set Inline graphic (a superset of Inline graphic). The key idea is that the sparse pattern of matrix Inline graphic characterizes the orientations of the relations among the phenotypes. Specifically, it is demonstrated in Li et al. (2023) that vqp ≠ 0 implies that genotype q intervenes on phenotype p or an ancestor node of phenotype p in the DAG. Thus, if vqp ≠ 0 and vqi = 0 for ip, then phenotype p is a leaf node in the DAG; that is, there is no directed edge from phenotype p to the others. On this basis, we can sequentially identify and remove (ie, peel) the leaf node in the DAG, and construct supersets Inline graphic and Inline graphic.

Since the peeling algorithm solely depends on Inline graphic, no modification is needed to extend the existing method to accommodate summary-level data.

2.2.3. Estimation of U and W

The peeling algorithm yields supersets Inline graphic and Inline graphic. To remove the extra edges in Inline graphic and Inline graphic, we consider fitting U and W within a restricted model defined by Inline graphic and Inline graphic.

From Equation (1), for phenotype p, we have

2.2.3. (5)

where Inline graphic and Inline graphic. As in Section 2.2.1, we replace the corresponding quantities with the summary-level data estimates and fit the TLP regression based on Equation (5),

2.2.3. (6)

where Inline graphic is the parameter vector and Inline graphic. We fix Inline graphic and the tuning parameter Inline graphic are selected by pseudo-BIC as described in Section 2.2.1. The estimated Inline graphic and Inline graphic (1 ≤ pP) are aggregated to form the the final estimate Inline graphic and Inline graphic.

Due to penalization, we recommend following the common practice to standardize the variables so that the phenotypes and genotypes are on a comparable scale, which is straightforward to do as Inline graphic and Inline graphic are obtained. Moreover, if only Inline graphic is of interest, penalization of Inline graphic is optional.

2.3. Likelihood-based inference for a DAG

We extend the likelihood ratio inference (Li et al., 2023) to quantify the uncertainty of the network structures. As in Li et al. (2023), we consider 2 types of hypothesis testing.

  1. Testing of multiple directed relations. The null hypothesis H0: ukj = 0 for each Inline graphic and alternative hypothesis Ha: ukj ≠ 0 for some Inline graphic. Rejecting H0 indicates evidence for the presence of some hypothesized relationships in the network.

  2. Testing of a directed pathway. The null hypothesis H0: ukj = 0 for some Inline graphic and alternative hypothesis Ha: ukj ≠ 0 for each Inline graphic. Rejecting H0 indicates evidence for the presence of the entire directed pathway in the phenotype network.

The procedure for testing multiple directed relations comprises 5 steps.

  1.  Estimate Inline graphic and use the peeling algorithm to obtain Inline graphic and Inline graphic, as in Section 2.2.

  2.  Identify the set of non-degenerate edges Inline graphic (Li et al., 2023), which contains Inline graphic, non-degenerate edges pointed to phenotype p.

  3.  Estimate the parameters Inline graphic and Inline graphic under H0 and Ha, respectively. Specifically, denote by Inline graphic and Inline graphic, the estimates under H0. Then Inline graphic and Inline graphic are computed as in the regression (6) with an additional constraint that ukj = 0 for Inline graphic. Let Inline graphic and Inline graphic be the estimates under Ha. Then Inline graphic and Inline graphic are computed from the restricted models (1 ≤ pP), Inline graphic, Inline graphic, via regression (6), where the penalties become Inline graphic and Inline graphic.

  4.  Compute Inline graphic (1 ≤ pP) from the residual sum of squares of Inline graphic.

  5.  Compute the test statistic Inline graphic, where L is the log-likelihood of the model [Equation (1)]. By Li et al. (2023), T is approximately chi-squared distributed with degrees of freedom Inline graphic when the size Inline graphic is <50; Inline graphic is approximately standard normal when Inline graphic >50. Thus, the P-value is calculated as Inline graphic, when Inline graphic and Inline graphic, when Inline graphic.

The procedure for testing a directed pathway is similar, with minor modifications.

  1.  Estimate Inline graphic, as in Step (T1).

  2.  First, we decompose H0 into each nongenerate edge Inline graphic. For each Inline graphic, implement Steps (T2)–(T5) above to obtain the corresponding P-value PV (k, j). The final P-value is computed as the maximum of the P-values for the sub-hypotheses, Inline graphic.

Of note, testing a directed pathway concerns a composite (null) hypothesis. Fixing 0 < α < 1, we have Inline graphic (Li et al., 2023). In other words, the test asymptotically achieves exactly the α significance level for the composite null hypothesis.

3. INFERRING CARDIOVASCULAR-RELATED PROTEIN–PROTEIN INTERACTION NETWORK

The role of cardiovascular diseases has been recognized as an important etiologic hallmark of AD (de Bruijn and Ikram, 2014). There are different hypotheses on the various mechanisms underlying the association between AD and cardiovascular diseases (Tini et al., 2020). In this real data application, we constructed a directed PPI network of some cardiovascular-related proteins based on a GWAS of 83 plasma protein biomarkers. We further connected the PPI network to AD through MR analyses.

3.1. GWAS summary statistics for cardiovascular-related proteins

The GWAS summary statistics on 83 cardiovascular-related proteins, which came from Wald tests for the association between each SNP and the standardized residuals among 3394 European individuals by Folkersen et al. (2017) were used. Five proteins were excluded from the analysis as their corresponding protein-encoding genes are located on the sex chromosome. The summary statistics were first processed to remove (a) indels; (b) SNPs located within 1 base pair of an indel; (c) SNPs with imputation quality score INFO ≤ 0.8; and (d) SNPs with MAF ≤ 0.05. We then used the following steps to select putative IVs for the proteins:

  1. SNPs were clumped at an r2 value of 0.01 using 3000 uncorrelated individuals (individuals with kinship coefficients <0.084) from UK Biobank of European ancestry as the reference panel, such that SNPs were independent of each other for each protein (Bycroft et al., 2018).

  2. Only the SNPs in the clumped data files located within ±1 MB of each protein-encoding gene were considered. In general, cis-regulatory changes will be less pleiotropic (Signor and Nuzhdin, 2018), and thus, these SNPs located close to the genes are more likely to be valid IVs due to the exclusion assumption (Swerdlow et al., 2016; Hemani et al., 2018; Li et al., 2023) (ie, an IV only directly intervenes on 1 primary variable).

  3. To ensure the relevance assumption (Li et al., 2023) was satisfied (ie, IV intervenes on at least 1 primary variable), we only selected SNPs whose P-values were below the GWAS significance threshold (5 × 10−8). This filtering process led to a total number of 33 SNPs and 23 proteins, with at least 1 putative IV in the final network analysis.

The genetic correlation matrix for the included IVs was estimated based on the same reference panel used in clumping. We calculated the empirical correlation of each pair of proteins as the correlation coefficient of the z-scores of the null SNPs, that is, all autosomal SNPs with MAF ≥ 0.05, INFO ≥ 0.8, and GWAS P-values ≥ 0.05 for both proteins. The number of null SNPs for each pair of proteins ranged from 1 191 204 to 1 223 357. All preparation of the reference panel and GWAS data for both the DAG estimation and MR analysis was done using PLINK version 1.9 (Purcell et al., 2007).

3.2. GWAS summary statistics for AD

We explored the relationship between each of the 23 proteins in Folkersen et al. (2017) and AD. We used the summary statistics of the GWAS for AD from a most recent study totaling 111 326 clinically diagnosed or “proxy” AD cases and 677 663 controls (Bellenguez et al., 2022). We removed SNPs with MAF < 0.05, SNPs not included in the GWAS of Folkersen et al. (2017), and clumped SNPs at r2 = 0.01. Among the remaining SNPs, we selected IVs only with a GWAS P-value <5 × 10−8 in the MR analyses.

3.3. Results

We constructed a DAG of the 23 proteins as described in Section 2.5.1. As Folkersen et al. (2017) shared MAF in the summary statistics, we compared them with those in UK Biobank, the reference panel for clumping and estimating genetic correlation matrix. The absolute difference of the MAF of all IVs ranged from 0.001 to 0.055 with a mean of 0.02, while the correlation was 0.99 (Supplementary Table S1). We further performed MR analysis on each protein to evaluate their relationship with AD using the TwoSampleMR package (Hemani et al., 2018). We used Egger’s test of intercept for examining the exclusion assumption: If a protein had a P-value of the Egger’s test of intercept >0.05/23, there was no evidence against no direct/pleiotropic effects, and we’d go with the more powerful MR-IVW method; otherwise, we used MR-Egger (to allow pleiotropic effects of IVs). In any case, we used the P-value cut-off <0.05/23 to declare statistical significance. Supplementary Table S2 contains a complete list of MR results. The protein IL18, which showed marginal significance in both Egger’s test of intercept (P-value = 0.06) and MR-Egger (P-value = 0.07), was a parent node for several proteins related to AD, including ADM, IL1RL1, CTSD, CXCL6, and CXCL16. We further performed the likelihood ratio test on each edge. Edges with P-value <0.05/(23 × 22−56) were considered as significant and were in solid line in Figure 1. The number of tests in the Bonferonni correction is bounded by the sum of possible edges among all the nodes minus the total number of edges after the peeling algorithm, which is justified in Supplementary Material S2.1. Each edge from IL18 to the 5 AD-associated proteins was highly significant in the likelihood ratio test, thus suggesting that simultaneous testing of the pathway from IL18 to the 5 genes would be significant. Previous studies detected increased levels of pro-inflammatory IL18 in both cardiovascular diseases and in brain regions of AD patients (Sutinen et al., 2014). IL18 is known to increase the level of Cdk5 and GSK-3β, which are involved in Tau hyperphosphorylation, and the inhibition of Cdk5 was known to improve AD subjects’ conditions (Calabrò et al., 2021). Our work suggests a possible regulatory role of IL18 on multiple AD-associated proteins. According to OpenTargets.org (Ochoa et al., 2021) for current pharmaceuticals either approved or in development with IL18, this protein is currently a target of an antibody drug to treat diabetes mellitus and a few other conditions; diabetes has long been linked to AD with epidemiological and biological evidence (Barbagallo and Dominguez, 2014). Lastly, we provide a Shiny application that allows users to test any selected proteins in this cardiovascular-related PPI network.

FIGURE 1.

FIGURE 1

Estimated DAG for 23 proteins based on the GWAS summary statistics of Folkersen et al. (2017). Proteins significantly associated with AD in MR analysis are colored gold. A solid line represents an edge that is statistically significant by the likelihood ratio test whereas a dashed line represents an edge that is not significant.

4. SIMULATION STUDIES

4.1. Simulation settings

We simulated the data assuming a fixed Inline graphic, Inline graphic, standardized genotype matrix Inline graphic, and sampled each row of Inline graphic independently from Inline graphic. Then we generated Inline graphic from equation: Inline graphic. Without loss of generality, no intercept was modeled; that is, Inline graphic was centered at mean 0. The values of Inline graphic, Inline graphic, and Inline graphic are provided in Supplementary Material S1.1, where the structure of the relationship of Inline graphic followed a DAG of 15 nodes/phenotypes (Figure 2). The effect sizes of the non-zero components of Inline graphic ranged from 0.002 to 1.16, with a median of 0.06. All phenotypes had at least 1 valid IV. Twenty-six SNPs were included in the model, with their effect sizes ranging from −2.2 to 2.5 with a median of −0.11. Two SNPs violated the relevance assumption, while the rest were valid IVs. We also varied the effect sizes to be 1/3 and 1/15 of Inline graphic while keeping Inline graphic fixed.

FIGURE 2.

FIGURE 2

True DAG for the simulation study with 15 phenotypes.

The standardized genotype matrix Inline graphic was obtained from unrelated individuals of European ancestry in UK Biobank (Bycroft et al., 2018). We then calculated the summary statistics using a linear model of each phenotype on each standardized genotype, and inputted the summary statistics into the proposed algorithm. The reference panel was obtained from the UK Biobank European samples, which were not correlated with the simulated samples used to derive the summary statistics. SNPs on chromosome 22 with a Hardy-Weinberg Disequilibrium test P-value > 0.0001, missing call rate <0.05, and MAF >0.05 were pruned to have r2 < 0.01. We then randomly selected 26 SNPs for Inline graphic. Missing values of SNPs were imputed by their mean. Null SNPs were directly simulated to be independent of each other and have no relationship with Inline graphic.

We evaluated the performance of both the network construction and statistical inference for the proposed method. To evaluate the performance of network construction, we examined the false-positive (TP) and false-negative (FN) rates for Inline graphic over 200 replications. The sample size of the summary statistics was varied at 3000, 6000, 9000, and 12 000, and the sample size of the reference panel was fixed at 3000. Null SNPs were simulated to have the same sample size as the GWAS summary statistics.

In terms of testing, we examined the empirical Type I error/power of the likelihood ratio tests with increasing sample sizes and varying strength of Inline graphic for the following 5 scenarios over 1000 replications for the 2 types of testing.

I. Testing 1 or more directed relations:

A1. Type 1 error: testing 1 edge when in truth it was null with H0: u1, 14 = 0 versus H1: u1, 14 ≠ 0.

A2. Power: testing 1 edge when in truth it was not null with H0: u1, 6 = 0 versus H1: u1, 6 ≠ 0.

A3. Power: testing 2 edges together when in truth both were not null with H0: u7, 15 = u1, 6 = 0 versus H1: u7, 15 ≠ 0 or u1, 6 ≠ 0.

II. Testing of a directed pathway:

B1. Type 1 error: testing whether at least 1 edge was not null when in truth only 1 was not null with H0: u1, 14 = 0 or u6, 12 = 0 versus H1: u1, 14 ≠ 0 and u6, 12 ≠ 0.

B2. Power: testing whether at least 1 edge was not null when in truth both were not null with H0: u1, 6 = 0 or u6, 12 = 0 versus H1: u1, 6 ≠ 0 and u6, 12 ≠ 0.The true strengths of the tested edges were: u1, 14 = 0, u1, 6 = 0.27, u7, 15 = 0.44, and u6, 12 = 1.36.

4.2. Simulation results

With the increase of sample size, the constructed networks became closer to the true graph (Figure 3; numeric values in Figure 3a are in Supplementary Tables S3S6). More specifically, FP of Inline graphic was ∼0.05, and FN of Inline graphic decreased with the increase of sample size when 15 000 null SNPs were used to estimate the 15 × 15 matrix of Inline graphic. Unsurprisingly, the performance of using Inline graphic estimated by 15 000 null SNPs (denoted as Inline graphic in Figure 3) was slightly worse than that of using Inline graphic (denoted as Inline graphic in Figure 3). In addition, the FP of Inline graphic clearly decreased with the increase of effect sizes (from Inline graphic to Inline graphic to Inline graphic). To check the validity of our method, we compared pBIC estimated from summary statistics with BIC estimated from the individual-level data for the same set of penalized regression coefficients. We found the 2 sets of values were highly concordant, and the results from 1 iteration were plotted in Supplementary  Figure S2.

FIGURE 3.

FIGURE 3

Performance of network construction (a) and likelihood ratio tests (b–d) in simulation with varying sample sizes and true effect sizes at Inline graphic, Inline graphic, and Inline graphic. Figures b, c, and d represent scenarios A2, A3, and B2, respectively.

In terms of testing, we observed well-controlled Type I error rates for Scenarios A1 and B1 (numeric values present in Figure 3, b–d are in Supplementary Tables S7S10), using Inline graphic. We note that the empirical Type I error rates might become conservative when Inline graphic is replaced by its estimate Inline graphic derived from 15 000 null SNPs (Supplementary Table S10). In real data analysis of GWAS summary level data, typically a much larger number of null SNPs and various other strategies can be used to better estimate Inline graphic (Bulik-Sullivan et al., 2015; Kim et al., 2015; Li et al., 2021); however, the investigation in this direction is out of our scope. Furthermore, empirical power was high for scenarios A2, A3, and B2 and increased with sample size and effect size. The empirically power of jointly testing 2 edges u1, 6 and u7, 15 (Scenario A3) was larger than testing one edge u1, 6 alone (Scenario A2).

5. DISCUSSION

In this paper, we present a method to estimate an interventional DAG of phenotypes utilizing linear structural equation models, applicable to GWAS summary statistics in the absence of individual-level data. We demonstrated satisfactory performance in terms of the FP and FN rates in network construction and high empirical power and of well-controlled Type I error rates of the likelihood ratio tests. We applied this method to a large-scale proteomic GWAS summary dataset to obtain an estimated DAG of 23 cardiovascular-related proteins and further illustrated the effects of these proteins on AD by MR analysis. These results can be useful in understanding the disease etiology, drug repurposing, and other applications for AD.

We note that the choice of a proper reference panel is just as important for our method as many other summary-statistics-based methods (Deng and Pan, 2018; Chen et al., 2021; Privé et al., 2022). When constructing the cardiovascular-related PPI network, we used an ancestry-matched reference panel of uncorrelated individuals from UK Biobank with a sample size of 3000, which is close to the sample size of the GWAS of cardiovascular proteins. Furthermore, we clumped SNPs around the cis-region of the gene and only used the genome-wide significant SNPs as IVs. This step not only aimed at selecting at least 1 valid IV for each protein, but also achieving better estimation of the genetic correlation matrix for the IVs as the number of total IVs became much smaller than the sample size of the reference panel, that is, QNr. Our analysis was constrained to a single ancestry group and unrelated samples. As the collection of multi-ancestry and related samples increases, it will be of significant research interest to establish networks among these populations, a challenge we anticipate addressing in our future work.

In recent years, a large amount of summary-level data has become widely accessible. On the molecular level, many studies published their summary statistics for SNP-molecular phenotype associations. For example, variant-gene associations in 49 tissues can be directly downloaded from GTExPortal (Consortium, 2020). Beyond the molecular phenotypes, UK Biobank alone provides GWAS summary statistics on >7000 traits, including but not limited to cognitive functions, early life factors, health and medical history, and physical measurement (Bycroft et al., 2018). Our proposed method provides a computational and analytical tool to explore the relationships among multiple phenotype variables by taking advantage of rapid advances in GWAS and other association mappings.

Supplementary Material

ujad039_Supplemental_Files

Web Appendices, Supplementary Tables S1–S10, Figure S1 referenced in Sections 3.3 and 4.2, and data and code (sumDAG algorithm, Shiny app, and the real data analysis code) are available with this paper at the Biometrics website on Oxford Academic. Data and code can also be found on https://github.com/chunlinli/sumdag.

Acknowledgement

Rachel Zilinskas and Chunlin Li have contributed equally to this work. The authors would like to thank the associate editor and the reviewer for their valuable comments. Tianzhong Yang would like to further acknowledge the Children’s Cancer Research Funds and the St. Baldrick’s Foundation Scholar Award.

Contributor Information

Rachel Zilinskas, Statistics and Data Corporation, Tempe, AZ 85288, United States.

Chunlin Li, Department of Statistics, Iowa State University, Ames, IA 50011, United States.

Xiaotong Shen, School of Statistics, University of Minnesota, Minneapolis, MN 55455, United States.

Wei Pan, Division of Biostatistics and Health Data Science, University of Minnesota, Minneapolis, MN 55455, United States.

Tianzhong Yang, Division of Biostatistics and Health Data Science, University of Minnesota, Minneapolis, MN 55455, United States.

FUNDING

This research was supported by NIH grants U01 AG073079, R01 AG065636, R01 AG069895, RF1 AG067924, R01 HL116720, and R01 GM126002 and by the Minnesota Supercomputing Institute at the University of Minnesota.

CONFLICT OF INTEREST

None declared.

DATA AVAILABILITY

We downloaded the summary-level genome-wide association study data in Section 3.1 from https://zenodo.org/record/264128/. The algorithm for the proposed work is packaged in R, available at https://github.com/chunlinli/sumdag and the Biometrics website, along with code used for the simulation studies and real data application. The processed summary-level genome-wide association study data that were used as input for the algorithm for the real data application are also included on GitHub.

References

  1. Ashburner  M., Ball  C. A., Blake  J. A., Botstein  D., Butler  H., Cherry  J. M.  et al. (2000) The gene ontology consortium gene ontology: tool for the unification of biology. Nature Genetics, 25, 25–29. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Barbagallo  M., Dominguez  L. J. (2014) Type 2 diabetes mellitus and Alzheimer’s disease. World Journal of Diabetes, 5, 889–893. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Bellenguez  C., Küçükali  F., Jansen  I. E., Kleineidam  L., Moreno-Grau  S., Amin  N.  et al. (2022) New insights into the genetic etiology of Alzheimer’s disease and related dementias. Nature Genetics, 54, 412–436. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Bulik-Sullivan  B., Finucane  H. K., Anttila  V., Gusev  A., Day  F. R., Loh  P.-R.  et al. (2015) An atlas of genetic correlations across human diseases and traits. Nature Genetics, 47, 1236–1241. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Bycroft  C., Freeman  C., Petkova  D., Band  G., Elliott  L. T., Sharp  K.  et al. (2018) The UK Biobank resource with deep phenotyping and genomic data. Nature, 562, 203–209. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Calabrò  M., Rinaldi  C., Santoro  G., Crisafulli  C. (2021) The biological pathways of Alzheimer disease: a review. AIMS Neuroscience, 8, 86–132. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Chen  C., Ren  M., Zhang  M., Zhang  D. (2018) A two-stage penalized least squares method for constructing large systems of structural equations. Journal of Machine Learning Research, 19, 1–34. [Google Scholar]
  8. Chen  W., Wu  Y., Zheng  Z., Qi  T., Visscher  P. M., Zhu  Z.  et al. (2021) Improved analyses of gwas summary statistics by reducing data heterogeneity and errors. Nature Communications, 12, 7117. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Cheng  F., Zhao  J., Wang  Y., Lu  W., Liu  Z., Zhou  Y.  et al. (2021) Comprehensive characterization of protein–protein interactions perturbed by disease mutations. Nature Genetics, 53, 342–353. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Consortium  G. (2020) The GTEx consortium atlas of genetic regulatory effects across human tissues. Science, 369, 1318–1330. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. de Bruijn  R. F., Ikram  M. A. (2014) Cardiovascular risk factors and future risk of alzheimer’s disease. BMC Medicine, 12, 1–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Deng  Y., Pan  W. (2018) Improved use of small reference panels for conditional and joint analysis with gwas summary statistics. Genetics, 209, 401–408. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Emilsson  V., Ilkov  M., Lamb  J. R., Finkel  N., Gudmundsson  E. F., Pitts  R.  et al. (2018) Co-regulatory networks of human serum proteins link genetics to disease. Science, 361, 769–773. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Folkersen  L., Fauman  E., Sabater-Lleal  M., Strawbridge  R. J., Frånberg  M., Sennblad  B.  et al. (2017) Mapping of 79 loci for 83 plasma protein biomarkers in cardiovascular disease. PLOS Genetics, 13, e1006706. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Friedman  N., Linial  M., Nachman  I., Pe’er  D. (2000) Using bayesian networks to analyze expression data. In: Journal of Computational Biology, 7(3-4), 601–20. [DOI] [PubMed] [Google Scholar]
  16. Hemani  G., Bowden  J., Smith  G. D. (2018) Evaluating the potential role of pleiotropy in mendelian randomization studies. Human Molecular Genetics, 27, 195–208. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Hemani  G., Zheng  J., Elsworth  B., Wade  K. H., Haberland  V., Baird  D.  et al. (2018) The MR-base platform supports systematic causal inference across the human phenome. eLife, 7, e34408. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Id  J. Q., Chen  K., Zhong  C., Zhu  S., Id  X. M. (2021) Network-based protein-protein interaction prediction method maps perturbations of cancer interactome. PLOS Genetics, 17, e1009869. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. International HapMap Consortium . (2005) A haplotype map of the human genome. Nature, 437, 1299–1320. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Kim  J., Bai  Y., Pan  W. (2015) An adaptive association test for multiple phenotypes with GWAS summary statistics. Genetic Epidemiology, 39, 651–663. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Li  C., Shen  X., Pan  W. (2023) Inference for a large directed acyclic graph with unspecified interventions. Journal of Machine Learning Research, 24, 1–48. [PMC free article] [PubMed] [Google Scholar]
  22. Li  C., Yang  Y., Wu  C. (2022) Package “glmtlp”. https://cran.r-project.org/web/packages/glmtlp/glmtlp.pdf.
  23. Li  T., Ning  Z., Shen  X. (2021) Improved estimation of phenotypic correlations using summary association statistics. Frontiers in Genetics, 12, 665252. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Liu  F., Zhang  S.-W., Guo  W.-F., Wei  Z.-G., Chen  L. (2016) Inference of gene regulatory network based on local bayesian networks. PLoS Computational Biology, 12, e1005024. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Mak  T. S. H., Porsch  R. M., Choi  S. W., Zhou  X., Sham  P. C. (2017) Polygenic scores via penalized regression on summary statistics. Genetic Epidemiology, 41, 469–480. [DOI] [PubMed] [Google Scholar]
  26. Napoli  C., Benincasa  G., Donatelli  F., Ambrosio  G. (2020) Precision medicine in distinct heart failure phenotypes: Focus on clinical epigenetics. American Heart Journal, 224, 113–128. [DOI] [PubMed] [Google Scholar]
  27. Ochoa  D., Hercules  A., Carmona  M., Suveges  D., Gonzalez-Uriarte  A., Malangone  C.  et al. (2021) Open targets platform: supporting systematic drug–target identification and prioritisation. Nucleic Acids Research, 49, D1302–D1310. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Pattee  J., Pan  W. (2020) Penalized regression and model selection methods for polygenic scores on summary statistics. PLOS Computational Biology, 16, e1008271. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Privé  F., Arbel  J., Aschard  H., Vilhjálmsson  B. J. (2022) Identifying and correcting for misspecifications in gwas summary statistics and polygenic scores. Human Genetics and Genomics Advances, 3, 100136. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Purcell  S., Neale  B., Todd-Brown  K., Thomas  L., Ferreira  M. A., Bender  D.  et al. (2007) PLINK: a tool set for whole-genome association and population-based linkage analyses. The American Journal of Human Genetics, 81, 559–575. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Ross  C. A., Poirier  M. A. (2004) Protein aggregation and neurodegenerative disease. Nature Medicine, 10, S10–S17. [DOI] [PubMed] [Google Scholar]
  32. Shen  X., Pan  W., Zhu  Y. (2012) Likelihood-based selection and sharp parameter estimation. Journal of the American Statistical Association, 107, 223–232. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Signor  S. A., Nuzhdin  S. V. (2018) The evolution of gene expression in cis and trans. Trends in Genetics, 34, 532–544. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Snider  J., Kotlyar  M., Saraon  P., Yao  Z., Jurisica  I., Stagljar  I. (2015) Fundamentals of protein interaction network mapping. Molecular Systems Biology, 11, 848. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Sutinen  E. M., Korolainen  M. A., Häyrinen  J., Alafuzoff  I., Petratos  S., Salminen  A.  et al. (2014) Interleukin-18 alters protein expressions of neurodegenerative diseases-linked proteins in human SH-SY5Y neuron-like cells. Frontiers in Cellular Neuroscience, 8, 214. [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Swerdlow  D., Kuchenbaecker  K., Shah  S., Sofat  R., Holmes  M., White  J.  et al. (2016) Selecting instruments for mendelian randomization in the wake of genome-wide association studies. International Journal of Epidemiology, 45, 1600–1616. [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Taliun  D., Harris  D. N., Kessler  M. D., Carlson  J., Szpiech  Z. A., Torres  R.  et al. (2021) Sequencing of 53,831 diverse genomes from the nhlbi topmed program. Nature, 590, 290–299. [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. The 1000 Genomes Project Consortium . (2015) A global reference for human genetic variation. Nature, 526, 68–74. [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. Tini  G., Scagliola  R., Monacelli  F., La Malfa  G., Porto  I., Brunelli  C.  et al. (2020) Alzheimer’s disease and cardiovascular disease: a particular association. Cardiology Research and Practice, 2020, 2617970. [DOI] [PMC free article] [PubMed] [Google Scholar]
  40. Witten  D. M., Friedman  J. H., Simon  N. (2012) New insights and faster computations for the graphical lasso view. Journal of Computational and Graphical Statistics, 20: 892–900. [Google Scholar]
  41. Zhang  B., Horvath  S. (2005) A general framework for weighted gene co-expression network analysis. Statistical Applications in Genetics and Molecular Biology, 4: 1128. [DOI] [PubMed] [Google Scholar]
  42. Zhang  P., Itan  Y. (2019) Biological network approaches and applications in rare disease studies. Genes, 10, 797. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

ujad039_Supplemental_Files

Web Appendices, Supplementary Tables S1–S10, Figure S1 referenced in Sections 3.3 and 4.2, and data and code (sumDAG algorithm, Shiny app, and the real data analysis code) are available with this paper at the Biometrics website on Oxford Academic. Data and code can also be found on https://github.com/chunlinli/sumdag.

Data Availability Statement

We downloaded the summary-level genome-wide association study data in Section 3.1 from https://zenodo.org/record/264128/. The algorithm for the proposed work is packaged in R, available at https://github.com/chunlinli/sumdag and the Biometrics website, along with code used for the simulation studies and real data application. The processed summary-level genome-wide association study data that were used as input for the algorithm for the real data application are also included on GitHub.


Articles from Biometrics are provided here courtesy of Oxford University Press

RESOURCES