Skip to main content
PLOS Genetics logoLink to PLOS Genetics
. 2025 Apr 10;21(4):e1011659. doi: 10.1371/journal.pgen.1011659

Enhancing nonlinear transcriptome- and proteome-wide association studies via trait imputation with applications to Alzheimer’s disease

Ruoyu He 1,2,#, Jingchen Ren 1,2,#, Mykhaylo M Malakhov 2, Wei Pan 2,*
Editor: Hae Kyung Im3
PMCID: PMC12040266  PMID: 40209152

Abstract

Genome-wide association studies (GWAS) performed on large cohort and biobank datasets have identified many genetic loci associated with Alzheimer’s disease (AD). However, the younger demographic of biobank participants relative to the typical age of late-onset AD has resulted in an insufficient number of AD cases, limiting the statistical power of GWAS and any downstream analyses. To mitigate this limitation, several trait imputation methods have been proposed to impute the expected future AD status of individuals who may not have yet developed the disease. This paper explores the use of imputed AD status in nonlinear transcriptome/proteome-wide association studies (TWAS/PWAS) to identify genes and proteins whose genetically regulated expression is associated with AD risk. In particular, we considered the TWAS/PWAS method DeLIVR, which utilizes deep learning to model the nonlinear effects of expression on disease. We trained transcriptome and proteome imputation models for DeLIVR on data from the Genotype-Tissue Expression (GTEx) Project and the UK Biobank (UKB), respectively, with imputed AD status in UKB participants as the outcome. Next, we performed hypothesis testing for the DeLIVR models using clinically diagnosed AD cases from the Alzheimer’s Disease Sequencing Project (ADSP). Our results demonstrate that nonlinear TWAS/PWAS trained with imputed AD outcomes successfully identifies known and putative AD risk genes and proteins. Notably, we found that training with imputed outcomes can increase statistical power without inflating false positives, enabling the discovery of molecular exposures with potentially nonlinear effects on neurodegeneration.

Author summary

Transcriptome-wide association studies (TWAS) and proteome-wide association studies (PWAS) are useful for identifying causal genes and proteins for complex human traits. However, the power of TWAS/PWAS to identify genes and proteins for late-onset diseases, such as Alzheimer’s disease (AD), is limited by the small numbers of disease cases in some biobanks, largely due to the relatively young age of study participants. Traditional TWAS methods can overcome this limitation by relying on external genome-wide association study (GWAS) summary statistics, but they fail to capture nonlinear associations, which are particularly important in complex diseases such as AD. The main contribution of this paper is to demonstrate that by incorporating a newly proposed trait imputation method, LS-imputation, along with the widely used proxy AD imputation method, we can detect potentially nonlinear associations between molecular exposures and AD status. This approach enhances the power of nonlinear TWAS/PWAS for studying AD and possibly other late-onset diseases. Furthermore, we show that applying LS-imputation within the TWAS/PWAS framework is unlikely to lead to false discoveries, supporting its reliability in genetic studies of AD.

1. Introduction

Alzheimer’s disease (AD), a complex polygenic neurodegenerative disorder and the most prevalent form of dementia, has captured significant attention within the genetics research community. Genetic mutations are recognized as a predominant factor in AD pathogenesis, with heritability estimates for late-onset AD ranging from 60% to 80% [1]. The advent of large biobanks and the development of genome-wide association studies (GWAS) have accelerated the identification of single-nucleotide polymorphisms (SNPs) and genetic loci linked to AD. In the past decade alone, the number of known GWAS loci significantly associated with AD has increased from 20 to over 90 [27]. Although these findings have significantly advanced our understanding of the genetic architecture of AD, statistical power for studying neurodegenerative conditions may plateau despite ever-larger sample sizes due to the demographic makeup of most large biobanks. Large studies such as the UK Biobank (UKB) primarily enroll relatively young individuals who are healthier than the general population [8,9], so the number of AD cases among biobank participants does not grow at the same rate as the overall sample size. Thus, methods for imputing the disease phenotypes of participants who are likely to develop AD as they age are necessary in order to refine research outcomes and increase statistical power.

A common practice for addressing this issue is to use the GWAS-by-proxy (GWAX) method [10], which aggregates the family history of biobank participants to impute their expected AD status, known as “proxy AD.” Since its proposal, GWAX has gained rapid popularity, with almost every recent AD GWAS analysis adopting this method to enhance statistical power [3,4,1114]. Although not commonly applied in this context, polygenic score methods can also be used to impute AD risk. Recently, another phenotype imputation method, LS-imputation, has been proposed [15,16]. Instead of borrowing information from family history, LS-imputation relies on published GWAS summary statistics to impute phenotypes for individuals with known genome-wide genotypes. Importantly, despite relying solely on GWAS summary statistics, LS-imputation has been shown to effectively reconstruct both linear and nonlinear genotype-phenotype associations, unlocking new possibilities for downstream analyses.

Testing for nonlinear genotype-phenotype associations has drawn growing interest in recent years [17,18], but such discoveries can be constrained by the limited sample size available for certain traits such as AD. By leveraging the capabilities of LS-imputation and proxy phenotypes to retain nonlinear genotype-phenotype relationships, the large sample sizes of modern biobanks now enable nonlinear transcriptome-wide association study (TWAS) inference [1925]. More generally, this approach can be used for nonlinear association studies of any molecular traits [2628], including proteome-wide association studies (PWAS) of plasma protein concentrations [2932].

In this study, we evaluated the ability of various AD imputation methods to recapture genotype-phenotype relationships and compared their performance to that of using only observed trait in TWAS/PWAS inference tasks. In stage 1 of TWAS, we built prediction models using gene expression data from the Genotype-Tissue Expression (GTEx) Project version 8 [33]. Analogously, we also trained prediction models for protein expression using data from the UKB Pharma Proteomics Project [34,35]. Then we imputed gene/protein expression into the UKB data. In stage 2, we used the imputed expression as input and trained the DeLIVR model [36] using the imputed AD status of UKB individuals. Lastly, we performed association testing with diagnosed AD case/control phenotypes from the Alzheimer’s Disease Sequencing Project (ADSP) release 4. Note that hypothesis testing was done on an independent dataset (ADSP) with clinically diagnosed AD cases in order to avoid potential pitfalls of testing with imputed data. Our nonlinear TWAS analysis identified known AD risk genes and proteins, as well as new putative targets. Notably, we found that training DeLIVR models using imputed AD outcomes resulted in the identification of more genes and proteins than when training solely using observed outcomes. Moreover, the distinct sets of results obtained from different imputation methods suggest that considering these methods together may provide a more complete picture of the genetic underpinnings of AD.

Additional analyses were also performed using UKB data for high-density lipoprotein (HDL) cholesterol to compare the Type I error rate and power of DeLIVR trained with an LS-imputed trait. Our results showed that DeLIVR trained with LS-imputed HDL cholesterol levels and tested on an independent dataset with observed HDL cholesterol levels could control the Type I error rate at a nominal level. In addition, training DeLIVR with the imputed trait often had higher power compared to training it on a smaller dataset with the observed trait.

2. Materials and methods

2.1. Overview of TWAS/PWAS

We begin by outlining the TWAS/PWAS framework. For conciseness, we will use TWAS as an example, though the same methods are also applicable to proteomics and other types of molecular traits. The causal model for TWAS is illustrated in Fig 1a. Formally, denote Yn×1 to be the outcome trait with n samples, Xn×1 to be a gene’s or protein’s expression levels, and Zn×m to be the genotype matrix with m genetic variants. The TWAS model is as follows,

Fig 1. Illustration of the TWAS/PWAS workflow.

Fig 1

(a) Traditional TWAS trains a model to predict an exposure (X) from genetic instruments (Z) and then tests the relationship between the predicted exposure and an outcome trait (Y). (b) The first stage of our TWAS framework is the same as in traditional TWAS, where a model is trained to predict gene or protein expression from local genetic variation. (c) Unlike traditional TWAS, which directly tests for association between predicted exposures and the outcome trait, our framework first trains a stage 2 model to predict the outcome trait. The outcome trait used for training may itself be imputed. (d) Hypothesis testing is performed on an independent test dataset. We use the stage 1 model to predict expression from genotypes, then we use the stage 2 model to predict the outcome trait from the predicted expression, and finally we test for association between the predicted outcome trait and its observed values.

X=Zβ+ϵ1, (1)
Y=g(X)+ϵ2, (2)

where g ( X )  is the target function of expression that we are interested in testing, and ϵ1 and ϵ2 are independent error terms. Traditionally, TWAS uses a reference dataset to estimate the stage 1 model (Eq 1) and another independent dataset to estimate the stage 2 model (Eq 2).

Our framework differs from the traditional approach in that it requires three distinct datasets. First, we need a reference dataset containing genotype data and observed gene/protein expression data to build the stage 1 prediction model (Eq 1) for gene/protein expression using genetic variants as predictors (Fig 1b). Second, we require a large biobank dataset with genotype data but without phenotypes. We apply a trait imputation method to impute the outcome trait into this biobank dataset, and use the trained stage 1 model to predict gene/protein expression for those same individuals. Then we train the stage 2 model (Eq 2) using the predicted gene/protein expression as input and the imputed trait as the output (Fig 1c). Lastly, we need an independent test dataset with observed genotype and outcome trait data. We predict gene/protein expression on the test dataset using the stage 1 model and then predict the genetic component of the outcome trait using the stage 2 model. Hypothesis testing is performed to test the association between the predicted outcome trait and its observed values (Fig 1d). Unlike the traditional TWAS approach, our framework performs hypothesis testing on an independent dataset with observed traits, rather than on the imputed traits themselves. This is crucial as testing with observed traits avoids any potential biases and uncertainties associated with imputed traits. In particular, any biases in the trait imputation process can affect downstream association tests if the imputed values are used as if they were observed (e.g. for hypothesis testing). An independent test dataset with observed traits provides an unbiased dataset against which the predictions can be validated. The association is then evaluated in a context that is free from any imputation biases. Furthermore, imputed traits inherently carry estimation uncertainty, but testing on an independent dataset with observed traits ensures that the variability in the data is accurately captured, leading to valid inference. Theoretical justification for this framework is provided in [36].

2.1.1. DeLIVR

A variety of methods are available to estimate the TWAS model introduced in Eqs 1 and 2. In this study we primarily focus on the recently published DeLIVR method, which uses neural networks to estimate E ( Y | Z )  nonparametrically [36]. Stage 1 of DeLIVR is the same as in standard TWAS, where we regress the gene expression X on the genotypes Z to obtain X^=Zβ^. In stage 2, DeLIVR estimates and performs inference on E ( Y | Z ) = E ( g ( X ) | Z )  without explicitly learning g ( X ) . In [36], the authors showed that under some assumptions, estimating E ( g ( X ) | Z )  is sufficient for testing the association between the trait and the predicted expression levels. Furthermore, this approach resulted in a much more stable estimate compared to other deep learning methods for instrumental variables regression, such as DeepIV, and consequently yielded higher statistical power in hypothesis testing [36,37].

Let h^θ be a neural network parameterized by θ for estimating E ( g ( X ) | Z )  with X^=Zβ^. We solve for θ by minimizing the following loss,

L(Y,Z,θ)=1ni=1n(Yih^θ(X^i))2+λθ22, (3)

where the last term is the ridge penalty.

2.1.2. TWAS-L and TWAS-LQ

We also consider the standard linear TWAS (TWAS-L) and a parametric nonlinear model proposed in [38] (TWAS-LQ) to compare with DeLIVR. Assuming Y, X, and Z are standardized to have mean 0 and variance 1, we fit the following two stage 1 models: X =   +  ϵ1 and X2=Zβ2  +  ϵ2. Using the fitted models, we obtain X^=Zβ^ and X2^=Zβ^2. In stage 2 of TWAS-L, we fit Y=X^θ  +  ϵ. In stage 2 of TWAS-LQ, we instead fit Y=X^θ1+X2^θ2  +  ϵ.

2.1.3. Hypothesis testing

Assuming that the unknown parameters in h^θ has been estimated by DeLIVR with a training dataset, we use an independent test set to calculate E^(g(X)|Z)=h^θ(X^ and perform the following two tests in TWAS.

  • Global test: We fit
    Y=α+βE^(g(X)|Z)+ϵG, (4)

    where ϵGN(0,σG2In) is an independent error term. Then we test the hypothesis H0:β=0vs.H1:β0.

  • Nonlinearity test: We fit
    Y=α+β1E^(X|Z)+β2E^(g(X)|Z)+ϵNL, (5)

    where ϵNLN(0,σNL2In) is an independent error term. Then we test the hypothesis H0:β2=0vs.H1:β20.

For a given gene, we could aggregate the test results from different methods by simply combining their p-values using the Cauchy combination test [39], which may further improve power over that of a single model.

2.2. Trait imputation methods

In this section we summarize the methods we considered for imputing AD status. The imputed outcome trait values obtained from these methods were then subsequently used to train the stage 2 models for TWAS/PWAS, as described above.

2.2.1 LS-imputation

LS-imputation leverages GWAS summary statistics from an independent study to impute disease phenotypes [15,16]. Suppose we have a GWAS summary (training) dataset {(β^j,σ^j):j=1,...,p} with p genetic variants. For a new individual-level (test) dataset in which we only have genotypes X with sample size n2, we can impute the missing trait values Y for these n2 individuals by solving the following optimization problem,

Y~= argminYβ^1n21XTY2=(n21)(XXT)+Xβ^=(n21)(XT)+β^, (6)

where ()+ denotes the Moore-Penrose generalized inverse. The objective is to find Y~ such that its correlation with the genotypes closely matches observed correlations in an independent GWAS from the same population. Due to computational constraints when dealing with large matrices, particularly in cases where both p and n2 are large, LS-imputation is applied in smaller batches of size m [16]. The batch size m is chosen to balance a bias-variance trade-off; smaller values of m typically result in higher bias but lower variance in the imputed Y. In practice, we try a few values of m and choose the one that yields marginal association results similar to those obtained from the external GWAS.

It is important to note that the LS-imputation method is based on ordinary least squares (OLS) estimation and is only directly applicable to traits where the GWAS summary statistics were obtained from a linear regression model for each variant. However, for a binary trait like AD, the GWAS summary statistics are typically obtained from logistic regression. To employ the LS-imputation method in this scenario, we first use the following approximation formulas to convert GLM-based summary statistics to OLS-based summary statistics [40],

β^=eb^0(1+eb^0)2b^1, (7)
SE(β^)eb^0(1+eb^0)2SE(b^1), (8)

where b^0 is the estimated log odds of an individual with the non-counted allele being a diseased case and b^1 is the GLM-based effect size.

2.2.2. Proxy AD

To counteract the underrepresentation of late-onset diseases such as AD in large biobanks such as the UKB, where cohorts predominantly consist of middle-aged individuals, a novel methodology known as GWAS-by-proxy (GWAX) was introduced by [10]. They constructed proxy AD cases by leveraging the health history information of the biobank participants’ parents. Since its introduction, GWAX has gained increasing popularity. The majority of recently published AD GWAS studies are meta-analyses that merge associations computed from clinically diagnosed AD cases with associations computed from GWAX proxy cases in order to enhance both sample size and statistical power [3,4,1113]. The adoption of GWAX has sparked many discussions and led to the development of various related methods for constructing AD proxies [4,41].

In our analyses, we explored two distinct strategies for constructing proxy AD cases from individuals in the UKB. The first strategy, as employed in [13], designates a participant as an AD proxy case if they reported that at least one biological relative (parent or sibling) was diagnosed with AD. Participants who responded “Do not know” or “Prefer not to answer” when asked about their parents’ AD history are excluded. In this strategy no distinction is made between proxy AD cases and clinically diagnosed AD cases during analysis, i.e., both are together considered to be AD cases. We refer to this imputation strategy as “AD Proxy.”

The second strategy, as outlined in [4], involves evaluating whether participants’ biological parents have a history of AD and taking into account the parents’ current ages or ages at death. The proxy phenotype is then constructed as a linear score between 0 and 2 based on these factors. Each parent with a diagnosis of AD contributes 1 unit to the score, while each unaffected parent contributes the fraction  ( 100  −  age ) ∕ 100, where “age” is the current age of the parent if they are still alive or the parent’s age at death. The contribution for unaffected parents is capped at 0.32, which corresponds to the maximum population prevalence of AD. Participants with clinically diagnosed AD are given the maximum risk score of 2. Similarly to the first strategy, participants who responded “Do not know” or “Prefer not to answer” regarding their parents’ disease status or age are excluded. We refer to this imputation strategy as “AD Proxy2.”

2.2.3. PRS-CS

PRS-CS builds polygenic risk scores (PRS) by employing a high-dimensional Bayesian regression framework that incorporates a continuous shrinkage prior on genetic variant effect sizes [42]. In this paper, we used the Python implementation of PRS-CS available at https://github.com/getian107/PRScs. In particular, we applied PRS-CS-auto and used the provided European UKB reference panel. The risk scores computed by PRS-CS were directly used as the imputed trait values.

2.2.4. LDpred2

LDpred2 is another widely used method for constructing PRS, which leverages GWAS summary statistics and a linkage disequilibrium (LD) matrix [43]. LDpred2 is an enhanced and more efficient version of the original LDpred method [44]. In our analyses, we utilized LDpred2-auto as implemented in the R package bigsnpr (https://github.com/privefl/bigsnpr). Similarly to our application of PRS-CS, the risk scores computed by LDpred2 were directly used as the imputed trait values.

2.3. Imputation data sources

In this section we detail the datasets we used to facilitate trait imputation and the preprocessing steps we applied to each one. Recall that LS-imputation, PRS-CS, and LDpred2 require summary-level data from an external GWAS in addition to individual-level genotype data on the individuals for whom trait imputation is to be performed. We used summary-level AD GWAS data from either the EADB consortium or the IGAP consortium, and individual-level genotype data from the UKB. Imputation using AD Proxy and AD Proxy2, on the other hand, is only based on family history. For those methods we used individual-level data from the UKB. Note that the GWAS studies described in this section were only used for trait imputation, and they should not be confused with the biobank data used in stage 2 of the TWAS/PWAS workflow.

2.3.1. EADB GWAS summary data

The European Alzheimer & Dementia Biobank (EADB) consortium aggregates data from various European GWAS consortia focused on AD [13]. The EADB GWAS results were meta-analyzed with a GWAS-by-proxy conducted on the UKB dataset, following the AD Proxy strategy described in Sect 2.2.2, with a total of 21,101,114 genetic variants. The meta-analyzed EADB stage 1 GWAS included 85,934 AD cases (39,106 clinically diagnosed AD cases and 46,828 proxy AD cases) and 401,577 controls, for a total of 485,711 individuals.

2.3.2. IGAP GWAS summary data

The International Genomics of Alzheimer’s Project (IGAP) is a large three-stage study based on GWAS analyses of individuals with European ancestry [5]. The IGAP stage 1 meta-analysis included 11,480,632 SNPs in 21,982 AD cases and 41,944 cognitively normal controls from four consortia: the Alzheimer Disease Genetics Consortium (ADGC); the European Alzheimer’s Disease Initiative (EADI); the Cohorts for Heart and Aging Research in Genomic Epidemiology Consortium (CHARGE); and the Genetic and Environmental Risk in Alzheimer’s Disease/Defining Genetic, Polygenic and Environmental Risk for Alzheimer’s Disease Consortium (GERAD/PERADES). In the IGAP stage 2 meta-analysis, 11,632 SNPs were genotyped and tested for association in an independent set of 8,362 AD cases and 10,483 controls. Meta-analysis of variants selected for analysis in stage 3A (n = 11 , 666) or stage 3B (n = 30 , 511) brought the final sample size to 35,274 clinical or autopsy-documented AD cases and 59,163 controls.

2.3.3. UKB individual-level data

We imputed the AD status of 367,182 self-identified White British UKB individuals [34] using each of the trait imputation methods described in Sect 2.2. For the GWAS-based trait imputation methods, we utilized imputed UKB genotypic data along with EADB or IGAP GWAS summary data. For each of the two AD GWAS datasets, we first extracted all of the genetic variants in common between the GWAS and UKB data. Subsequently, we filtered out variants with a minor allele frequency (MAF) less than 0.05, missing rates larger than 10%, and those that failed a Hardy-Weinberg equilibrium exact test with p-values less than 0.001. Furthermore, we pruned out variants in high LD with a window size of 50 base pairs (bp), a step size of 1 bp, and an r2 threshold of 0.8. For the EADB GWAS dataset, we ended up with 1,035,821 high-quality SNPs in common with UKB data. A total of 76,417 SNPs had p-values less than 0.05, from which we randomly selected 70,000 SNPs to use in the genotype-based trait imputation methods LS-imputation, PRS-CS, and LDpred2. For the IGAP GWAS dataset, we ended up with 1,053,329 high-quality SNPs in common with UKB data. A total of 63,094 SNPs had p-values less than 0.05, from which we randomly selected 60,000 SNPs to use in the genotype-based trait imputation methods.

Recall that LS-imputation requires splitting the individuals into smaller batches and performing the imputation batch-by-batch. For EADB data, we considered batch sizes of 40,000, 50,000, and 60,000. For IGAP data, we considered batch sizes of 30,000, 40,000, and 50,000.

For the family-based trait imputation methods, we obtained diagnosed AD cases from self-reported ICD10 diagnoses (data field 41270) and ICD10 cause of death (data fields 40001 and 40002). Our AD Proxy analyses included a total of 367,182 UKB individuals, comprising 313,294 controls and 53,888 cases (51,815 AD Proxy and 2,073 diagnosed AD). Due to the absence of parental information for some individuals, AD Proxy2 scores were computed for a subset of 355,325 individuals within the aforementioned group. Finally, we adjusted the AD Proxy and AD Proxy2 outcomes for a set of standard covariates: top 20 genetic principal components (PCs), age, and sex. Note that we did not regress out any covariates from the AD outcomes imputed using LS-imputation, PRS-CS, and LDpred2, since those imputation methods predict AD risk based on external GWAS summary statistics that already incorporate covariate adjustment.

2.4. TWAS/PWAS data sources

2.4.1. Stage 1 data

GTEx gene expression data To train transcriptome imputation models, we used genotype data and whole blood gene expression data from GTEx v8, which includes expression data for 19,626 genes [33]. To ensure consistency with the UKB and GWAS data, we subset the GTEx data to individuals with genetically-inferred European ancestry (n = 558). We regressed out the effects of 68 covariates provided by GTEx, and used the standardized residuals as the new gene expression levels. Next, we extracted the local expression quantitative trait loci (cis-eQTLs) of each gene from the 100k bp window around its coding region (100k bp upstream from the transcription start site and 100k bp downstream from the transcription end site). In particular, we identified cis-eQTLs as follows: Any variants with MAF  ≤ 0 . 05 or with any missing values were removed. The remaining local variants for each gene were pruned by removing those with an absolute value of pairwise Pearson correlation  ≥ 0 . 8. After pruning, we kept the top 50 cis-eQTLs for each gene that are most strongly correlated with its expression.

We applied linear regression with backward selection using AIC as the criterion to build an expression imputation model for each gene. If the resulting model’s F-statistic was  ≤ 10 or the model only had one variant, we excluded that gene from further analysis due to its weak association with the selected cis-eQTLs. A total of 3,880 genes had a stage 1 F-statistic  ≥ 10 and were included in the stage 2 TWAS analysis.

UKB protein expression data In addition to TWAS, we also examined associations between the genetic component of protein expression and AD by performing PWAS. To train proteome imputation models, we used imputed genotype data and plasma proteomic data from the UKB, which consists of approximately 36,000 individuals with self-identified white British ancestry (the exact number varies between proteins) and 2,823 proteins coded by autosomal genes [35]. The proteins were mapped to their corresponding genes using UniProt to obtain their cis-regions, and then applied quality control steps and local protein quantitative trait loci (cis-pQTL) selection procedures analogous to those described above for gene expression data, except that the covariates we adjusted for this time were age, sex, age2, sex*age, sex*age2, and the top 20 genetic PCs. A total of 1,679 proteins had stage 1 F statistic  ≥ 10 and were included for stage 2 PWAS analysis.

We also considered a different QC process and used LASSO [45] and Elastic-Net [46] as the stage 1 model. Additional analyses using brain hippocampus tissue were also performed with the new stage 1 model and QC process. For full details, see S1 Sect 5.

2.4.2. Stage 2 phenotypes

High-density lipoprotein cholesterol We considered high-density lipoprotein (HDL) cholesterol as an outcome trait to evaluate the reliability of using an LS-imputed trait to train TWAS stage 2 models. In UK Biobank, missing phenotypes do not pose an issue for HDL. Thus, we imputed HDL trait values only to compare the performance of TWAS methods trained on imputed versus observed traits. By treating the results obtained from training on the observed trait as the ground truth, we could assess the type I error rate and power of TWAS performed with models trained on the imputed trait. For this analysis, we retained individuals with genetically-inferred White British ancestry and no missing HDL values, resulting in a final dataset of 356,351.

Alzheimer’s disease For our application of TWAS and PWAS to AD, we used data both from the UKB and from release 4 of the ADSP. Stage 2 TWAS models were trained on UKB data with imputed AD status, while association testing was only performed on clinically diagnosed AD cases from ADSP. Depending on the TWAS method used (see Sect 3.3.1), either a portion or the entirety of the ADSP data were reserved for association testing. Note that the ADSP cohorts were designed to ensure an enrichment of AD cases and hence have complete diagnostic information for their participants, so we took those clinical case/control phenotypes as the ground truth and did not perform imputation on the ADSP dataset.

2.4.3. Stage 2 datasets

UK Biobank We used UKB individuals with imputed AD status as the outcome trait when training stage 2 models for TWAS and PWAS. Namely, we separately considered four imputation strategies: AD status imputed from the EADB GWAS using LS-imputation, AD status imputed from the IGAP GWAS using LS-imputation, AD status derived from family history using the AD Proxy method, and AD status derived from family history using the AD Proxy2 method. The PRS methods PRS-CS and LDpred2 were not considered in our main analysis due to their poor performance in recovering the genetic architecture of AD (see Sect 3.1 and the S1 Text). A total of 367,182 self-reported White British individuals were included in our analysis, and the same genotype quality control steps and AD status covariate adjustments were applied as previously described in Sect 2.3.3. We set aside 10% of the data as a validation set for early stopping, while the remaining portion was utilized for training the neural network in DeLIVR.

Alzheimer’s Disease Sequencing Project For sample-level quality control, we followed ADSP recommendations to only keep one sample from each set of genetically-inferred duplicates according to an identical by descent (IBD) analysis (π^>0.98), prioritizing the sample with better sequence call rate or based on reconciliation of phenotypic data. Next, we again followed ADSP recommendations to only keep variants which passed ADSP quality control steps (VFLAGS_One_subgroup = 0) and which have a plausible allele balance of heterozygous calls (ABHet) across samples (0 . 25 < ABHet < 0 . 75). This resulted in a total of 314,809,384 high-quality variants, from which we extracted 1,211,080 with MAF  ≥ 0 . 05 and Hardy-Weinberg equilibrium exact test p-value  ≥ 0 . 001. To create a set of harmonized AD case/control phenotypes across the diverse cohorts included in ADSP, we applied the script provided by the ADSP Phenotype Harmonization Consortium (https://github.com/NIAGADS/ADSPIntegratedPhenotypes) with default settings. This yielded 11,308 cases and 16,685 controls for a total sample size of 27,993.

We selected individuals who self-identified as White and Non-Hispanic to match the genetic ancestry of the UKB population. We then excluded individuals with missing age or sex data, leaving 5,857 cases and 4,597 controls for a final sample size of 10,454 individuals. To compute genetic PCs, we first imputed missing genotypes based on allele frequencies and then performed LD clumping to remove variants with a squared correlation higher than 0.2. The top 20 genetic PCs were computed based on the remaining 138,029 variants. Finally, AD status was regressed on several covariates: age, sex, age2, the top 20 genetic PCs, sex*age, and sex*age2.

2.5. Simulation

We conducted a simulation study to complement our Type I error control and power analysis. The setup and results are provided in S1 Sect 4.

3. Results

In the following sections, we first evaluate the performance of each AD imputation method on GWAS tasks to justify our selection of imputation methods for TWAS/PWAS. We then detail our TWAS analysis workflow for HDL cholesterol and the results we obtained. Note that the target application of our study is AD, for which the number of UKB individuals with available diagnoses is very low. However, we first applied our approach to HDL cholesterol, for which almost all UKB individuals have available measurements, in order to address some key questions regarding the Type I error rate and power of using imputed traits in TWAS/PWAS. In particular, we (a) assessed whether training on imputed traits and testing on observed traits can control the Type I error rate, and (b) compared the power of our imputation-based approach with a TWAS analysis in which stage 2 models are trained and tested on a fully observed trait with limited sample size. Lastly, we present our TWAS and PWAS workflows and analysis results for AD.

3.1. Marginal GWAS analysis

For the marginal GWAS analysis, we first imputed AD status for UKB individuals using each of the imputation methods introduced in Sect 2.2, and then performed GWAS with the imputed traits. An additional GWAS analysis was also performed with the diagnosed AD cases for comparison. For the binary AD case/control phenotype obtained using AD Proxy, we fitted a logistic regression model for each variant to obtain its GWAS summary statistics. We then adjusted the resulting marginal effect sizes and standard errors (SEs) by multiplying a factor of 2 as recommended in [10] and [13]. For the continuous AD phenotypes estimated using the other imputation methods, we instead fitted a linear regression model to obtain GWAS summary statistics. To make these summary statistics comparable, we used the formulas in Eqs 7 and 8 to convert the GLM-based summary statistics to OLS-based summary statistics. Here we only provide a brief summary of the results as they are not the main focus of this paper. Additional details can be found in the S1 Text.

3.1.1. LS-imputation accurately recovers genetic information in GWAS summarydata

We first compared the marginal effect sizes, standard errors (SEs), and log10(p)-values estimated using imputed AD traits to those reported in the EADB (Figs A-C in S1 Text) and IGAP (Figs D-F in S1 Text) GWAS datasets. Regardless of the GWAS dataset considered, all three quantities estimated with LS-imputed AD status exhibited much higher correlations with the external GWAS data compared to those derived using AD Proxy (Figs G-L in S1 Text). Notably, the correlation between the marginal effect sizes estimated with LS-imputed AD status and the original GWAS estimates reached as high as 0.998, indicating that LS-imputation provides nearly unbiased marginal effect sizes. In contrast, the GWAS effect sizes obtained using AD Proxy outcomes exhibited greater bias.

Next, we visually compared the Manhattan plots of the original GWAS data to the GWAS results obtained using AD status imputed by different methods (Figs M-N in S1 Text). The distribution of significant SNPs identified with LS-imputed AD status (using either IGAP or EADB summary statistics) most closely resembled that obtained from the corresponding GWAS results (IGAP or EADB). The second-best results were obtained with AD Proxy and AD Proxy2. We found that LS-imputation can be more informative than AD Proxy and AD Proxy2, but its performance depends on the sample size and quality of the external GWAS used. Conversely, GWAS analyses performed using AD status imputed by PRS-CS and LDpred2 revealed an inflated number of significant SNPs. This issue was previously highlighted by [15], who noted that any variants included in PRS-CS (or any other PRS) models, along with those in linkage disequilibrium (LD) with them, would be deemed significant given a sufficiently large sample size. We also present Venn diagrams of the significant SNPs identified by different GWAS analyses (Fig O in S1 Text). For this comparison, we considered five different models, excluding those that used PRS-imputed traits due to the inflated numbers of false positives. The conclusions align with those from the Manhattan plots. The distribution of significant SNPs identified using LS-imputed AD status closely resembled that from the corresponding observed GWAS results. Furthermore, LS-imputation appeared to be more informative than AD Proxy or AD Proxy2, depending on the training GWAS data.

3.1.2. Comparison of differently imputed AD traits

We explored the similarity of AD traits obtained using different imputation methods (Tables A and B in S1 Text). For every pair of methods, we fitted a linear or a logistic regression model using the AD status from one method as the response and the AD status from the other method as the predictor. Then we calculated the R2 (for linear regression models) or Nagelkerke’s R2 (for logistic regression models) for each pair of methods. We noticed that the R2/Nagelkerke’s R2 values between AD Proxy and AD Proxy2 were significantly higher than between any other pairs, which is not surprising as both AD Proxy and AD Proxy2 were constructed mainly with the participants’ parental information. The R2 between the two linear imputation methods, PRS-CS and LDpred2, was also higher than between other pairs. For all remaining pairs, the R2/Nagelkerke’s R2 values were fairly low, suggesting that the AD outcomes imputed by different methods are nearly uncorrelated with each other. Thus, it might be possible to aggregate information from multiple imputation methods to gain power.

In summary, LS-imputation demonstrated the best performance in GWAS analysis, as both the distribution of significant SNPs and the summary statistics most closely resembled those from the IGAP/EADB GWAS data. Due to their linearity and relatively poorer performance in GWAS, we decided to exclude PRS-CS and LDPred2 from our TWAS and PWAS analyses. Although AD Proxy performed worse than LS-imputation, we retained it in our TWAS/PWAS analyses due to its widespread use for imputing AD cases. However, we excluded AD Proxy2 because of its high correlation with AD Proxy.

3.2. Proof-of-concept: TWAS analysis for HDL

3.2.1. Analysis workflow for HDL cholesterol

Fig 2 shows our workflow for the TWAS analysis of HDL cholesterol. We created three sets of data for our analysis: (Z1,X1,Y1), (Z2,X2,Y2), and (Z2,X2,Y~2). The first dataset has a sample size of n1 and the latter two datasets each have sample sizes of n2, where n1  +  n2=n=178,176. Y1 and Y2 are observed traits, whereas Y~2 is the LS-imputed trait. We will refer to (Z1,X1,Y1) as the observed data, (Z2,X2,Y~2) as the imputed data, and (Z1,X1,Y1)  ∪  (Z2,X2,Y2) as the complete data. We tested the models on different observed sample sizes, specifically n1=20,000,40,000, and 60,000.

Fig 2. Flow chart illustrating the TWAS analysis steps for high-density lipoprotein (HDL) cholesterol.

Fig 2

DeLIVR Observed: trained and tested on the observed data with sample splitting; DeLIVR Imputed: trained on the imputed data and tested on the observed data; DeLIVR Complete: trained and tested on the complete data with sample splitting.

We developed four distinct approaches to assess the efficacy of the LS-imputation technique based on these three datasets. For the first approach, we divided the observed data (Z1,X1,Y1) into training, validation (for parameter tuning), and testing subsets and used them to train and test the DeLIVR method. The resulting model is denoted as DeLIVR Observed. For the second approach, we trained the DeLIVR method using the imputed data (Z2,X2,Y~2) and tested it on the observed data (Z1,X1,Y1). This model is referred to as DeLIVR Imputed. In our third approach, we trained the DeLIVR method using (Z2,X2,Y2) and again tested on the observed data (Z1,X1,Y1). This model is named DeLIVR Complete since it incorporates all available data for both training and testing. Finally, we trained and tested TWAS-LQ on the observed data, which we refer to as TWAS-LQ Observed.

Note that all models were tested on the same set to allow for a direct comparison of the impact of different training data on the association test results. We treated the genes identified by TWAS-LQ Observed, DeLIVR Observed, and DeLIVR Complete as “true positives” and evaluated whether DeLIVR Imputed identified additional genes, which would help assess the Type I error rate of training on imputed traits. We then compared the number of genes discovered by DeLIVR Imputed with those identified by DeLIVR Observed and TWAS-LQ Observed to determine if training with the imputed trait improved power—a key consideration in using imputed traits for TWAS/PWAS.

3.2.2. LS-imputed HDL enhances power while maintaining type I error rates comparableto real data

In this section, we evaluate the set of genes identified by DeLIVR when trained on imputed HDL compared to those identified by models trained on observed and complete data, focusing on Type I error control and potential power improvements. Defining true positives in real-data analyses is inherently challenging; thus, our primary aim is to demonstrate that models trained on imputed data maintain Type I error rates comparable to those trained on complete data.

We used the set of genes identified by models trained on observed and complete data as the gold standard. Genes uniquely identified by the model trained on the imputed trait were considered “false positives” introduced by the imputation process. To assess potential power improvements, we examined genes identified using imputed data but not observed data. If these genes were also identified using complete data, this would suggest that training on the imputed trait effectively simulates an increase in sample size, which is the primary motivation for our work.

Fig 3 presents the number of significant genes identified for HDL cholesterol by each TWAS method on a UKB data subset, evaluated across three different sample sizes and two significance cutoffs (Bonferroni and FDR). When the observed data sample size (n1) was 20,000, DeLIVR Imputed discovered two or three more genes than DeLIVR Complete; however, since these genes were also discovered by TWAS-LQ Observed, they are not considered additional false positives introduced by the imputation process. Similar results were observed for n1=40,000 and n1=60,000, with one exception: at n1=40,000, DeLIVR Imputed uniquely identified one or four gene(s) missed by all other methods, which could be potential false positives. In all other cases, DeLIVR Imputed effectively controlled the Type I error rate.

Fig 3. Venn diagrams showing the numbers of significant genes identified for HDL cholesterol using different observed sample sizes and significance cutoffs.

Fig 3

Top row: Bonferroni cutoff 1 . 3  ×  105; Bottom row: FDR cutoff 0.1. Sample sizes: (a), (d): 20k; (b), (e): 40k; (c), (f): 60k. DeLIVR Imputed: trained on imputed data and tested on observed data; DeLIVR Observed: trained and tested on observed data with sample splitting; DeLIVR Complete: trained and tested on complete data with sample splitting; TWAS-LQ Observed: trained and tested on observed data with sample splitting.

DeLIVR Imputed demonstrated higher power than DeLIVR Observed. Specifically, all genes identified by DeLIVR Observed were also identified by DeLIVR Imputed. Furthermore, when using the Bonferroni cutoff of 1 . 3  ×  1015 (panels a, b, and c), DeLIVR Imputed identified 6 out of 7 and 13 out of 18 genes discovered by DeLIVR Complete at n1=40,000 and n1=60,000, respectively. The results using an FDR cutoff of 0.1 were similar. These findings indicate that using the LS-imputed trait for training not only improved power compared to training on smaller datasets with observed traits but also maintained a level of power comparable to using the complete dataset. This suggests that the power increase achieved by LS-imputation is akin to a sample size increase, with additional signals identified using imputed traits potentially being detectable with more observed trait data.

Finally, we conducted simulations to evaluate Type I error control and power increase as a complement to our real-data analysis. The conclusions were consistent with those from the real-data analysis: DeLIVR Imputed controlled the Type I error rate at the nominal level of 0.05 and demonstrated significantly higher power than DeLIVR Observed. Full details are provided in S1 Sect 4.

3.3. TWAS/PWAS analyses for Alzheimer’s disease

3.3.1. Analysis workflow for Alzheimer’s disease

In our primary analysis for AD, we trained the stage 2 DeLIVR models on UKB data with imputed AD outcomes and then performed hypothesis testing on clinically diagnosed cases/controls from the ADSP data (Fig 4). Specifically, we first imputed AD status for the UKB individuals using LS-imputation and AD Proxy. For LS-imputation, we created two sets of imputed traits: one using IGAP GWAS summary data and the other using EADB GWAS summary data. We then trained DeLIVR on each of the three imputed UKB datasets and performed association testing on the ADSP dataset. The same workflow and methods were used for both TWAS and PWAS, the only difference being that TWAS relied on transcriptomic data to train expression prediction models in stage 1 while PWAS utilized proteomic data.

Fig 4. Flow chart illustrating the TWAS and PWAS analysis steps for AD.

Fig 4

In stage 1, gene expression prediction models for TWAS were trained on GTEx data, and protein expression prediction models for PWAS were trained on UKB data. The model weights were then used to predict expression levels for UKB and ADSP individuals. In stage 2, DeLIVR models were trained using predicted expression levels from stage 1 and one of the following outcome traits: AD status imputed in UKB using LS-imputation and the IGAP GWAS, AD status imputed in UKB using LS-imputation and the EADB GWAS, AD status imputed in UKB using the AD Proxy method, or clinically diagnosed AD status in ADSP. Association testing was performed using the differently trained stage 2 models applied to clinically diagnosed AD status in ADSP individuals.

In a separate analysis, we considered using the ADSP data for both training and testing, following the methodology from the original DeLIVR paper [36]. For that analysis the ADSP data were divided into training, validation (for parameter tuning), and test sets, to which we applied the DeLIVR method (Fig 4). Our goal was to compare the performance of training on the imputed trait from a large, independent biobank and testing on the real trait against the performance of both training and testing on the real trait. To compare the DeLIVR results with those from a parametric TWAS method, we also applied TWAS-L and TWAS-LQ, which were trained and tested on ADSP data.

All analyses were conducted three times with varying training parameters, and the resulting p-values were combined using the Cauchy combination test [47]. Note that association testing was always performed on the observed ADSP data. We intentionally refrained from using imputed AD outcomes for hypothesis testing for two reasons. First, using the imputed trait for hypothesis testing in TWAS requires more justification. Second, the LS-imputed trait exhibits correlation among individuals, making it difficult to create an independent test set as required by DeLIVR.

3.3.2. DeLIVR with imputed AD traits identified Alzheimer’s risk genes missed bymodels trained with observed AD status

Table 1 shows the genes identified by each model. The results focus on the global test (see Sect 2.1.3) since the nonlinear test did not yield any statistically significant findings. The DeLIVR model trained on LS-imputed UKB data using IGAP or EADB summary statistics identified the gene FNBP1L, which aligns with recent literature suggesting its association with AD [48]. Similarly, DeLIVR trained on AD Proxy outcomes identified CR1L, a gene whose variants have been previously linked to AD risk [5]. Furthermore, the gene MS4A6A, well-recognized for its relation to AD, was consistently identified by DeLIVR when trained on LS-imputed data using EADB summary statistics, as well as by both TWAS-L and TWAS-LQ models [4954]. In contrast, the DeLIVR model trained directly on the observed ADSP data did not identify any of these key genes, possibly due to the much smaller available sample size. As a side note, the well-known AD risk gene APOE was not identified, as it did not pass stage 1 QC, regardless of the tissue type or stage 1 model used. It also exhibited a small R2 and a large p-value in the precomputed stage 1 model provided by FUSION [19]. Fig 5a shows an UpSet plot summarizing the overlap and uniqueness of genes identified by each method using a 1  ×  103 significance threshold. The full table of p-values is provided in Table J in S1 Text. Relaxing the threshold revealed 17 genes uniquely identified by the DeLIVR models. The Q-Q plot (Fig 5b) displays the distribution of p-values for the TWAS results across all tested genes. The observed p-value distribution closely aligns with the expected null distribution, indicating that the models did not exhibit systematic inflation of the test statistics. These results underscore the efficacy of using imputed traits to uncover nonlinear genetic associations with AD that might be missed by direct analysis of limited observed data alone.

Table 1. p-values of the three genes identified by at least one method.

Bonferroni cutoff: 0 . 05 ∕ 3880 = 1 . 3  ×  105. The top row shows the models evaluated. The second row shows the training sets used. All models were tested on the ADSP data. p-values smaller than the Bonferroni cutoff are highlighted in bold.

Model DeLIVR TWAS-L TWAS-LQ
Training data ADSP LS-imp IGAP LS-imp EADB AD Proxy ADSP ADSP
Gene Chr
FNPB1L 1 8.5e-2 1.2e-5 8.3e-6 2.6e-5 4.7e-2 7.1e-3
CR1L 1 2.8e-3 6.4e-5 1.9e-5 2.4e-6 3.8e-1 3.5e-1
MS4A6A 11 1.5e-2 3.7e-5 3.9e-7 2.4e-4 2.2e-7 5.8e-7
Fig 5. Side-by-side UpSet plot and Q-Q plot of all models for TWAS.

Fig 5

“ADSP” refers to the DeLIVR model trained on the ADSP data (observed AD status). “TWAS-L” and “TWAS-LQ” refer to the standard TWAS model and the parametric TWAS-LQ model trained on the ADSP data, respectively. “LS-imp EADB” and “LS-imp IGAP” refer to the DeLIVR model trained on the LS-imputed AD status using EADB and IGAP as the GWAS data, respectively. “AD Proxy” refers to the DeLIVR model trained on the proxy AD status. “Combined” refers to the results of the Cauchy combination test, which combines the p-values from “LS-imp EADB,” “LS-imp IGAP,” and “AD Proxy.”

Additional analyses using different stage 1 QC criteria and models were conducted for both whole blood and brain hippocampus tissues, as detailed in S1 Sect 5. The conclusions remained largely unchanged with one notable exception: the Cauchy combination test identified more genes than using any single set of imputed trait values. This result is likely due to the distinctive gene sets identified by each set of imputed trait values.

3.3.3. PWAS performed on imputed Alzheimer’s status identified known and putativerisk proteins

Table 2 shows the significant proteins identified by at least one PWAS method. Across all models, the well-established AD risk proteins apolipoprotein E (APOE) and apolipoprotein C-I (APOC1) were consistently identified. CR1 is another protein known to be associated with AD, which was identified by DeLIVR trained on AD status imputed by LS-imputation with EADB GWAS data, as well as by the two parametric models. BCAM was also identified by both parametric models, as well as by DeLIVR when trained on AD outcomes imputed by AD Proxy and by LS-imputation with IGAP summary statistics. RARRES2 was uniquely identified by DeLIVR trained on imputed AD outcomes, namely when we used the imputation methods of AD Proxy and LS-imputation with IGAP GWAS data. Recent research has suggested that this protein may act as a risk factor for AD [55,56], supporting the potential validity of our findings. Interestingly, LCN15 was uniquely identified by DeLIVR trained on the observed ADSP data, despite its smaller sample size. There was also one protein, CA13, that was uniquely identified by TWAS-LQ. Fig 6a presents the UpSet plot using a significance cutoff of 1×103. Unlike the TWAS results, the parametric models using the observed AD status identified the highest numbers of proteins. However, DeLIVR uniquely identified a number of proteins, complementing the models trained on the observed AD status. Fig 6b displays the Q-Q plot for all proteins. The distributions of p-values exhibit a similar pattern across all models.

Table 2. p-values of the significant proteins identified for Alzheimer’s disease by at least one method.

Bonferroni cutoff: 0 . 05 ∕ 1679 = 2 . 98  ×  105. The top row shows the models evaluated. The second row shows the training sets used. All models were tested on the ADSP data. p-values smaller than the Bonferroni cutoff are highlighted in bold.

Model DeLIVR TWAS-L TWAS-LQ
Training data ADSP LS-imp IGAP LS-imp EADB AD Proxy ADSP ADSP
Protein name Gene name
Apolipoprotein E APOE 2.8e-16 0 0 0 1.7e-44 2.9e-44
Apolipoprotein C-I APOC1 3.2e-6 1.5e-10 4.3e-8 2.1e-8 5.0e-7 3.3e-6
Complement receptor type 1 CR1 1.9e-3 3.4e-2 2.0e-5 3.4e-3 2.4e-6 1.0e-5
Basal cell adhesion molecule BCAM 1.8e-3 1.9e-6 3.2e-4 3.2e-6 1.7e-9 1.0e-24
Lipocalin-15 LCN15 2.1e-5 1.7e-3 5.2e-4 1.4e-3 1.8e-4 8.8e-4
Carbonic anhydrase 13 CA13 3.5e-2 7.8e-3 1.1e-2 1.1e-2 6.9e-4 3.3e-7
Retinoic acid receptor responder protein 2 RARRES2 1.8e-5 1.2e-5 2.2e-4 1.9e-6 9.8e-4 4.4e-3
Fig 6. Side-by-side UpSet plot and Q-Q plot of all models for PWAS.

Fig 6

“ADSP” refers to the DeLIVR model trained with the ADSP data (the observed AD status). “TWAS-L” and “TWAS-LQ” refer to the standard TWAS model and the parametric TWAS-LQ model trained with the ADSP data, respectively. LS-imp EADB and LS-imp IGAP refer to the DeLIVR model trained with the LS-imputed AD status using EADB and IGAP as the GWAS data, respectively. “AD Proxy” refers to the DeLIVR model trained with the proxy AD status. “Combined” refers to the results of the Cauchy combination test combining the p-values of “LS-imp EADB,” “LS-imp IGAP,” and “AD Proxy.”

4. Discussion

The results of this study underscore the potential of proxy phenotypes and LS-imputation to address the challenges posed by the small number of AD cases in large biobank data, which otherwise limits the discovery of genetic signals related to AD. By leveraging the proxy AD and LS-imputation methods, we observed consistent and reliable GWAS results across different datasets. Specifically, the comparison of GWAS summary statistics derived from imputed AD traits using either proxy AD or LS-imputation with those from other popular methods highlighted the robustness of these two approaches in capturing the genetic underpinnings of AD. However, proxy AD and GWAX cannot be applied without parental/sibling information, while LS-imputation can gain information from external large-scale GWAS summary data. Overall, we found that trait imputation, e.g., via proxy AD and LS-imputation, proved effective in overcoming the limitations posed by the sparsity of AD cases in biobank data, thereby broadening the scope of genetic research in AD.

Furthermore, the application of DeLIVR, a nonlinear TWAS method, to LS-imputed AD status provided novel insights into the genetic landscape potentially driving AD pathology. The identification of significant genes and proteins linked to AD, such as the gene FNPB1L and protein CR1, underscores the method’s utility in identifying biologically relevant targets. Although DeLIVR trained on the imputed traits did not always identify the largest number of genes or proteins, it uniquely identified some genes and proteins that would be missed using conventional GWAS or TWAS, highlighting the method’s added value. This is particularly relevant given that nonlinear associations may be preserved by LS-imputation, but not by traditional linear PRS models.

Our findings also suggest that different imputation methods can complement each other, particularly when the methods exhibit significant distinctiveness, revealing a broader spectrum of genetic influences on AD that could be missed when relying on a single method. This point is further supported by additional analyses performed on brain hippocampus tissue, where aggregating results from different imputation methods yielded the largest set of genes discovered. These complementary effects have the potential to enhance the overall power of genetic studies, underscoring the value of employing a multifaceted approach to trait imputation.

We acknowledge several limitations that warrant further investigation. First, the computational complexity of LS-imputation limits the number of genetic variants that can be used. Second, at present, we are unable to perform hypothesis testing on LS-imputed traits for nonparametric TWAS models for two main reasons: 1) the imputed trait values are correlated across batches, which complicates the creation of an independent test set, and 2) testing using the imputed traits could lead to biased results, which is also expected to be a general issue for any trait imputation method. There is always a bias-variance trade-off when using imputed traits; the question of how to best balance this trade-off remains largely open. Third, our analysis was limited to individuals of European ancestry. A multi-ancestry design would be necessary to ensure that the findings are generalizable across diverse populations [57]. Fourth, our analysis focused solely on whole blood tissue, but aggregating data from multiple tissues could provide additional insights and improve the robustness of the findings [58,59].

In conclusion, LS-imputation represents a significant advance in the field of genetic research into AD, providing a robust tool for enhancing the quality of phenotypic data and, consequently, the insights gleaned from genetic association studies. Future research should focus on refining these imputation techniques and exploring their applications across other complex diseases to fully realize their potential in precision medicine and genetics research.

Supporting information

S1 Text. Supplementary file with details of additional quality control procedure, real data analysis results, and simulation results.

(PDF)

pgen.1011659.s001.pdf (5.8MB, pdf)

Acknowledgments

Access to the GTEx data was approved for dbGaP Project #26511, and the data were obtained from dbGaP accession number phs000424.v8.p2 on 10/13/2021. Access to the UK Biobank (UKB) data was approved through UKB Application #35107. Data from the Alzheimer’s Disease Sequencing Project (ADSP) were prepared, archived, and distributed by the National Institute on Aging Alzheimer’s Disease Data Storage Site (NIAGADS) at the University of Pennsylvania. The authors also acknowledge the Minnesota Supercomputing Institute (MSI) at the University of Minnesota for providing high-performance computing resources that contributed to the research results reported within this paper. We thank the International Genomics of Alzheimer’s Project (IGAP) for providing summary results data for these analyses. The investigators within IGAP contributed to the design and implementation of IGAP and/or provided data but did not participate in analysis or writing of this report. IGAP was made possible by the generous participation of the control subjects, the patients, and their families.

Data Availability

EADB GWAS summary statistics data are publicly available from the European Bioinformatics Institute (EBI) GWAS Catalog (https://www.ebi.ac.uk/gwas/) under accession no. GCST90027158. IGAP GWAS summary statistics data are publicly available from the EBI GWAS Catalog under accession no. GCST007511. Individual-level data from the UK Biobank (https://www.ukbiobank.ac.uk/), the Genotype-Tissue Expression (https://gtexportal.org/home/) Project, and the Alzheimer’s Disease Sequencing Project (https://adsp.niagads.org/) are available by application through their respective data access processes. The code used for the analyses presented in this paper is available at https://github.com/RuoyuHe/LS-imputation_TWAS. The code for DeLIVR is available at https://github.com/RuoyuHe/DeLIVR. The code for LS-imputation is available at https://github.com/ren328/LSimputing. The code for PRS-CS is available at https://github.com/getian107/PRScs. LDpred2 is supplied with the R package “bigsnpr” (https://github.com/privefl/bigsnpr).

Funding Statement

This work was supported by the National Institutes of Health (NIH) under grants U01 AG073079 (RH, JR, WP), RF1 AG067924 (MM, WP) and R01 HL116720 (MM, WP). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1.Gatz M, Reynolds CA, Fratiglioni L, Johansson B, Mortimer JA, Berg S, et al. Role of genes and environments for explaining Alzheimer disease. Arch Gen Psychiatry 2006;63(2):168–74. doi: 10.1001/archpsyc.63.2.168 [DOI] [PubMed] [Google Scholar]
  • 2.Lambert JC, Ibrahim-Verbaas CA, Harold D, Naj AC, Sims R, Bellenguez C, et al. Meta-analysis of 74,046 individuals identifies 11 new susceptibility loci for Alzheimer’s disease. Nat Genet 2013;45(12):1452–8. doi: 10.1038/ng.2802 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Marioni RE, Harris SE, Zhang Q, McRae AF, Hagenaars SP, Hill WD, et al. GWAS on family history of Alzheimer’s disease. Transl Psychiatry 2018;8(1):99. doi: 10.1038/s41398-018-0150-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Jansen IE, Savage JE, Watanabe K, Bryois J, Williams DM, Steinberg S, et al. Genome-wide meta-analysis identifies new loci and functional pathways influencing Alzheimer’s disease risk. Nat Genet 2019;51(3):404–13. doi: 10.1038/s41588-018-0311-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Kunkle BW, Grenier-Boley B, Sims R, Bis JC, Damotte V, Naj AC, et al. Genetic meta-analysis of diagnosed Alzheimer’s disease identifies new risk loci and implicates Aβ, tau, immunity and lipid processing. Nat Genet. 2019;51(3):414–30. doi: 10.1038/s41588-019-0358-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Lambert JC, Ramirez A, Grenier-Boley B, Bellenguez C. Step by step: towards a better understanding of the genetic architecture of Alzheimer’s disease. Molecul Psychiatry. 2023;28(7):2716–27. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Andrews SJ, Renton AE, Fulton-Howard B, Podlesny-Drabiniok A, Marcora E, Goate AM. The complex genetic architecture of Alzheimer’s disease: novel insights and future directions. EBioMedicine. 2023;90:104511. doi: 10.1016/j.ebiom.2023.104511 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Fry A, Littlejohns TJ, Sudlow C, Doherty N, Adamska L, Sprosen T, et al. Comparison of sociodemographic and health-related characteristics of UK Biobank participants with those of the general population. Am J Epidemiol 2017;186(9):1026–34. doi: 10.1093/aje/kwx246 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Schoeler T, Speed D, Porcu E, Pirastu N, Pingault J-B, Kutalik Z. Participation bias in the UK Biobank distorts genetic associations and downstream analyses. Nat Hum Behav 2023;7(7):1216–27. doi: 10.1038/s41562-023-01579-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Liu JZ, Erlich Y, Pickrell JK. Case-control association mapping by proxy using family history of disease. Nat Genet 2017;49(3):325–31. doi: 10.1038/ng.3766 [DOI] [PubMed] [Google Scholar]
  • 11.Schwartzentruber J, Cooper S, Liu JZ, Barrio-Hernandez I, Bello E, Kumasaka N, et al. Genome-wide meta-analysis, fine-mapping and integrative prioritization implicate new Alzheimer’s disease risk genes. Nat Genet 2021;53(3):392–402. doi: 10.1038/s41588-020-00776-w [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Wightman DP, Jansen IE, Savage JE, Shadrin AA, Bahrami S, Holland D, et al. A genome-wide association study with 1,126,563 individuals identifies new risk loci for Alzheimer’s disease. Nat Genet 2021;53(9):1276–82. doi: 10.1038/s41588-021-00921-z [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Bellenguez C, Küçükali F, Jansen I, Kleineidam L, Moreno-Grau S, Amin N. New insights into the genetic etiology of Alzheimer’s disease and related dementias. Nature Genetics. 2022;54(4):412–36. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Sherva R, Zhang R, Sahelijo N, Jun G, Anglin T, Chanfreau C, et al. African ancestry GWAS of dementia in a large military cohort identifies significant risk loci. Mol Psychiatry 2023;28(3):1293–302. doi: 10.1038/s41380-022-01890-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Ren J, Lin Z, He R, Shen X, Pan W. Using GWAS summary data to impute traits for genotyped individuals. HGG Adv 2023;4(3):100197. doi: 10.1016/j.xhgg.2023.100197 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Ren J, Pan W. Statistical inference with large-scale trait imputation. Statist Med. 2024;43(4):625–41. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Wang T, Ionita-Laza I, Wei Y. A unified quantile framework reveals nonlinear heterogeneous transcriptome-wide associations. arXiv preprint. 2022. doi: arXiv:220712081
  • 18.Sun B, Chen L. Quantile regression for challenging cases of eQTL mapping. Brief Bioinform 2020;21(5):1756–65. doi: 10.1093/bib/bbz097 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Gusev A, Ko A, Shi H, Bhatia G, Chung W, Penninx BWJH, et al. Integrative approaches for large-scale transcriptome-wide association studies. Nat Genet 2016;48(3):245–52. doi: 10.1038/ng.3506 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Zhu Z, Zhang F, Hu H, Bakshi A, Robinson MR, Powell JE, et al. Integration of summary data from GWAS and eQTL studies predicts complex trait gene targets. Nat Genet 2016;48(5):481–7. doi: 10.1038/ng.3538 [DOI] [PubMed] [Google Scholar]
  • 21.Gamazon ER, Wheeler HE, Shah KP, Mozaffari SV, Aquino-Michaels K, Carroll RJ, et al. A gene-based association method for mapping traits using reference transcriptome data. Nature genetics. 2015;47(9):1091–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Gao G, Fiorica PN, McClellan J, Barbeira AN, Li JL, Olopade OI, et al. A joint transcriptome-wide association study across multiple tissues identifies candidate breast cancer susceptibility genes. Am J Hum Genet 2023;110(6):950–62. doi: 10.1016/j.ajhg.2023.04.005 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Yuan Z, Zhu H, Zeng P, Yang S, Sun S, Yang C, et al. Testing and controlling for horizontal pleiotropy with probabilistic Mendelian randomization in transcriptome-wide association studies. Nat Commun 2020;11(1):3861. doi: 10.1038/s41467-020-17668-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Tang S, Buchman A, De Jager P, Bennett D, Epstein M, Yang J. Novel variance-component TWAS method for studying complex human diseases with applications to Alzheimer’s dementia. PLoS Genetics 2021;17(4):e1009482. doi: 10.1371/journal.pgen.1009482 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Cao C, Ding B, Li Q, Kwok D, Wu J, Long Q. Power analysis of transcriptome-wide association study: Implications for practical protocol choice. PLoS Genet 2021;17(2):e1009405. doi: 10.1371/journal.pgen.1009405 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Grishin D, Gusev A. Allelic imbalance of chromatin accessibility in cancer identifies candidate causal risk variants and their mechanisms. Nat Genet 2022;54(6):837–49. doi: 10.1038/s41588-022-01075-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Gao G, McClellan J, Barbeira AN, Fiorica PN, Li JL, Mu Z, et al. A multi-tissue, splicing-based joint transcriptome-wide association study identifies susceptibility genes for breast cancer. Am J Hum Genet 2024;111(6):1100–13. doi: 10.1016/j.ajhg.2024.04.010 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.He J, Antonyan L, Zhu H, Ardila K, Li Q, Enoma D, et al. A statistical method for image-mediated association studies discovers genes and pathways associated with four brain disorders. Am J Hum Genet 2024;111(1):48–69. doi: 10.1016/j.ajhg.2023.11.006 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Suhre K, Arnold M, Bhagwat AM, Cotton RJ, Engelke R, Raffler J, et al. Connecting genetic risk to disease end points through the human blood plasma proteome. Nat Commun. 2017;8:14357. doi: 10.1038/ncomms14357 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Zheng J, Haberland V, Baird D, Walker V, Haycock PC, Hurle MR, et al. Phenome-wide Mendelian randomization mapping the influence of the plasma proteome on complex diseases. Nat Genet 2020;52(10):1122–31. doi: 10.1038/s41588-020-0682-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Zhang J, Dutta D, Köttgen A, Tin A, Schlosser P, Grams ME, et al. Plasma proteome analyses in individuals of European and African ancestry identify cis-pQTLs and models for proteome-wide association studies. Nat Genet 2022;54(5):593–602. doi: 10.1038/s41588-022-01051-w [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Zhu J, Liu S, Walker KA, Zhong H, Ghoneim DH, Zhang Z, et al. Associations between genetically predicted plasma protein levels and Alzheimer’s disease risk: a study using genetic prediction models. Alzheimers Res Ther 2024;16(1):8. doi: 10.1186/s13195-023-01378-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.GTEx Consortium. The GTEx Consortium atlas of genetic regulatory effects across human tissues. Science 2020;369(6509):1318–30. doi: 10.1126/science.aaz1776 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Sudlow C, Gallacher J, Allen N, Beral V, Burton P, Danesh J, et al. UK biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med 2015;12(3):e1001779. doi: 10.1371/journal.pmed.1001779 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Sun BB, Chiou J, Traylor M, Benner C, Hsu Y-H, Richardson TG, et al. Plasma proteomic associations with genetics and health in the UK Biobank. Nature 2023;622(7982):329–38. doi: 10.1038/s41586-023-06592-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.He R, Liu M, Lin Z, Zhuang Z, Shen X, Pan W. DeLIVR: a deep learning approach to IV regression for testing nonlinear causal effects in transcriptome-wide association studies. Biostatistics 2024;25(2):468–85. doi: 10.1093/biostatistics/kxac051 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Hartford J, Lewis G, Leyton-Brown K, Taddy M. Deep IV: A flexible approach for counterfactual prediction. In: Proceedings of Machine Learning Research. 2017. p. 1414–23.
  • 38.Lin Z, Xue H, Malakhov MM, Knutson KA, Pan W. Accounting for nonlinear effects of gene expression identifies additional associated genes in transcriptome-wide association studies. Hum Mol Genet 2022;31(14):2462–70. doi: 10.1093/hmg/ddac015 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Liu Y, Xie J. Cauchy combination test: a powerful test with analytic p-value calculation under arbitrary dependency structures. J Am Stat Assoc 2020;115(529):393–402. doi: 10.1080/01621459.2018.1554485 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Knutson KA, Deng Y, Pan W. Implicating causal brain imaging endophenotypes in Alzheimer’s disease using multivariable IWAS and GWAS summary data. Neuroimage. 2020;223:117347. doi: 10.1016/j.neuroimage.2020.117347 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Wu Y, Sun Z, Zheng Q, Miao J, Dorn S, Mukherjee S. Pervasive biases in proxy GWAS based on parental history of Alzheimer’s disease. bioRxiv. 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Ge T, Chen C-Y, Ni Y, Feng Y-CA, Smoller JW. Polygenic prediction via Bayesian regression and continuous shrinkage priors. Nat Commun 2019;10(1):1776. doi: 10.1038/s41467-019-09718-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Privé F, Arbel J, Vilhjálmsson BJ. LDpred2: better, faster, stronger. Bioinformatics. 2020;36(22–23):5424–31. [DOI] [PMC free article] [PubMed]
  • 44.Vilhjálmsson BJ, Yang J, Finucane HK, Gusev A, Lindström S, Ripke S, et al. Modeling linkage disequilibrium increases accuracy of polygenic risk scores. Am J Hum Genet 2015;97(4):576–92. doi: 10.1016/j.ajhg.2015.09.001 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Tibshirani R. Regression shrinkage and selection via the lasso. J Roy Statist Soc Ser B: Statist Methodol 1996;58(1):267–88. doi: 10.1111/j.2517-6161.1996.tb02080.x [DOI] [Google Scholar]
  • 46.Zou H, Hastie T. Regularization and variable selection via the elastic net. J Roy Statist Soc Ser B: Statist Methodol. 2005;67(2):301–20. [Google Scholar]
  • 47.Liu Y, Xie J. Cauchy combination test: a powerful test with analytic p-value calculation under arbitrary dependency structures. J Am Stat Assoc 2020;115(529):393–402. doi: 10.1080/01621459.2018.1554485 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Prokopenko D, Morgan SL, Mullin K, Hofmann O, Chapman B, Kirchner R, et al. Whole-genome sequencing reveals new Alzheimer’s disease-associated rare variants in loci related to synaptic function and neuronal development. Alzheimers Dement 2021;17(9):1509–27. doi: 10.1002/alz.12319 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Martinez FO, Gordon S, Locati M, Mantovani A. Transcriptional profiling of the human monocyte-to-macrophage differentiation and polarization: new molecules and patterns of gene expression. J Immunol 2006;177(10):7303–11. doi: 10.4049/jimmunol.177.10.7303 [DOI] [PubMed] [Google Scholar]
  • 50.Mhatre SD, Tsai CA, Rubin AJ, James ML, Andreasson KI. Microglial malfunction: the third rail in the development of Alzheimer’s disease. Trends Neurosci 2015;38(10):621–36. doi: 10.1016/j.tins.2015.08.006 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Hollingworth P, Harold D, Sims R, Gerrish A, Lambert J-C, Carrasquillo MM, et al. Common variants at ABCA7, MS4A6A/MS4A4E, EPHA1, CD33 and CD2AP are associated with Alzheimer’s disease. Nat Genet 2011;43(5):429–35. doi: 10.1038/ng.803 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Harold D, Abraham R, Hollingworth P, Sims R, Gerrish A, Hamshere ML, et al. Genome-wide association study identifies variants at CLU and PICALM associated with Alzheimer’s disease. Nat Genet 2009;41(10):1088–93. doi: 10.1038/ng.440 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Tan L, Yu JT, Zhang W, Wu ZC, Zhang Q, Liu QY, et al. Association of GWAS-linked loci with late-onset Alzheimer’s disease in a northern Han Chinese population. Alzheimer’s Dementia. 2013;9(5):546–53. [DOI] [PubMed] [Google Scholar]
  • 54.Deng Y-L, Liu L-H, Wang Y, Tang H-D, Ren R-J, Xu W, et al. The prevalence of CD33 and MS4A6A variant in Chinese Han population with Alzheimer’s disease. Hum Genet 2012;131(7):1245–9. doi: 10.1007/s00439-012-1154-6 [DOI] [PubMed] [Google Scholar]
  • 55.Yue G, An Q, Xu X, Jin Z, Ding J, Hu Y, et al. The role of Chemerin in human diseases. Cytokine. 2023;162:156089. doi: 10.1016/j.cyto.2022.156089 [DOI] [PubMed] [Google Scholar]
  • 56.Zhao H, Wang J, Li Z, Wang S, Yu G, Wang L. Identification ferroptosis-related hub genes and diagnostic model in Alzheimer’s disease. Front Mol Neurosci. 2023;16:1280639. doi: 10.3389/fnmol.2023.1280639 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57.Chen F, Wang X, Jang S-K, Quach BC, Weissenkampen JD, Khunsriraksakul C, et al. Multi-ancestry transcriptome-wide association analyses yield insights into tobacco use biology and drug repurposing. Nat Genet 2023;55(2):291–300. doi: 10.1038/s41588-022-01282-x [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58.Gleason KJ, Yang F, Chen LS. A robust two-sample transcriptome-wide Mendelian randomization method integrating GWAS with multi-tissue eQTL summary statistics. Genet Epidemiol 2021;45(4):353–71. doi: 10.1002/gepi.22380 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59.Zhao B, Shan Y, Yang Y, Yu Z, Li T, Wang X, et al. Transcriptome-wide association analysis of brain structures yields insights into pleiotropy with complex neuropsychiatric traits. Nat Commun 2021;12(1):2878. doi: 10.1038/s41467-021-23130-y [DOI] [PMC free article] [PubMed] [Google Scholar]

Decision Letter 0

Hae Kyung Im, Xiaofeng Zhu

7 Oct 2024

Dear Dr Pan,

Thank you very much for submitting your Research Article entitled 'Enhancing nonlinear transcriptome- and proteome-wide association studies via trait imputation with applications to Alzheimer's disease' to PLOS Genetics.

The manuscript was fully evaluated at the editorial level and by independent peer reviewers. The reviewers appreciated the attention to an important problem, but raised some substantial concerns about the current manuscript. Based on the reviews, we will not be able to accept this version of the manuscript, but we would be willing to review a much-revised version. We cannot, of course, promise publication at that time.

Should you decide to revise the manuscript for further consideration here, your revisions should address the specific points made by each reviewer. We will also require a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript.

If you decide to revise the manuscript for further consideration at PLOS Genetics, please aim to resubmit within the next 60 days, unless it will take extra time to address the concerns of the reviewers, in which case we would appreciate an expected resubmission date by email to plosgenetics@plos.org.

If present, accompanying reviewer attachments are included with this email; please notify the journal office if any appear to be missing. They will also be available for download from the link below. You can use this link to log into the system when you are ready to submit a revised version, having first consulted our Submission Checklist.

To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

Please be aware that our data availability policy requires that all numerical data underlying graphs or summary statistics are included with the submission, and you will need to provide this upon resubmission if not already present. In addition, we do not permit the inclusion of phrases such as "data not shown" or "unpublished results" in manuscripts. All points should be backed up by data provided with the submission.

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool.  PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

PLOS has incorporated Similarity Check, powered by iThenticate, into its journal-wide submission system in order to screen submitted content for originality before publication. Each PLOS journal undertakes screening on a proportion of submitted articles. You will be contacted if needed following the screening process.

To resubmit, log into your Editorial Manager account and select the option 'Revise Submission' in the 'Submissions Needing Revision' folder.

We are sorry that we cannot be more positive about your manuscript at this stage. Please do not hesitate to contact us if you have any concerns or questions.

Yours sincerely,

Hae Kyung Im

Guest Editor

PLOS Genetics

Xiaofeng Zhu

Section Editor

PLOS Genetics

Please address the reviewer comments.

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: He and Ren et. al. present an interesting approach to TWAS and PWAS, asking if outcome trait imputation improves TWAS/PWAS performance, especially for Alzheimer’s Disease (AD), where the number of cases available in large biobanks is limited. They model the omics trait effects on outcome explicitly, focusing on their DeLIVR deep learning model, and compare predicted outcomes to observed outcomes in an independent cohort rather than testing predicted omics traits for association with outcomes like in traditional TWAS. While the authors do successfully demonstrate that by imputing outcomes (HDL or AD) and using DeLIVER, they can detect potentially nonlinear associations between omics traits (exposures) and outcomes, I’m not convinced this method uncovers new gene/protein associations that would not be found in traditional linear TWAS. Clarifications and additions that could improve the paper include:

1. In this paper, you use your HDL analyses to justify using case/control imputation in your AD analyses. As genetic architectures vary among traits, showing similar results with additional traits likely to be present in biobanks, e.g. height, blood pressure, etc., would be more convincing than just HDL. More concerning is your conclusions are based on a handful of significant observations in the Venn diagrams presented in Fig 3. Just because an association was found with TWAS-LQ does not mean it is a “true positive”. I know defining “true positives” is a challenge, but I suggest also calculating the proportion of genes/proteins you discover that are also listed in the GWAS Catalog for a particular trait (HDL, AD, etc.) would be helpful. Additionally, you could perform your intersection analyses at a more relaxed threshold, e.g. FDR<0.1 rather than Bonferroni, since your Venns have such small numbers.

2. The results presented in Tables 1 and 2 seem highly correlated across models, even though only the bolded p-values reach Bonferroni adjustment for the numbers of genes or proteins tested. Indeed, many p-values miss this significance threshold, but are still low. The only unique DeLIVR result I see is CR1L in Table 1. Thus, I am not convinced the associations are unique to deep learning methods. Similar to above, if you relax thresholds, how unique are your DeLIVR results?

3. Along these lines, you tested just 3,880 genes (GTEx whole blood) and 1,679 proteins (UKB plasma) in your models. I’m guessing APOE was not tested in your gene expression (eQTL) models? How many known AD genes were tested in TWAS and PWAS? GTEx also has several brain tissues available that may be quite useful for AD, which I suggest including. Rather than building your own gene expression models, you could use models publicly available for PrediXcan or FUSION to test more genes in your TWAS, ~13k per tissue, which may allow better assessment of potential differences between methods.

4. “LD contamination” is often a problem with TWAS results because genes linked to each other may share eQTLs (Barbeira et al 2018, Zhou et al 2020). Including chromosome and positions in your tables listing genes (and including this info for the Fig 3 results) would be helpful. In addition, colocalization or joint analyses would help zero in on the causal gene/protein just like fine mapping for SNPs.

5. I thought Fig 1 was very clear, thank you. Could you add sample size and sample splitting information to Fig 2 to make it clearer? I suggest UpSet plots rather than Venn diagrams for Fig 3, especially if you end up with larger numbers.

Reviewer #2: Please see the review report.

Reviewer #3: This study evaluates various (Alzheimer's disease) imputation methods in capturing genotype-phenotype relationships and compares these methods in non-linear TWAS/PWAS inference. The framework differs from traditional TWAS in that it requires 3 datasets: 1) a reference dataset to build TWAS/PWAS prediction models 2) a biobank data with genotype data but without phenotypes (Using a trait imputation method to impute the outcome into this biobank and using imputed expression from stage 1, the framework then trains a stage 2 model using the predicted expression as input and imputed trait as output.) 3) Finally, an independent test set with observed genotype and observed outcome trait data, which is used for hypothesis testing to test the association between predicted outcome trait and its observed values. The framework uses a neural network approach, DeLIVR, to nonparametrically estimate E[Y | genotype] in stage 2. The authors analyzed several trait imputation methods: LS-imputation, Proxy-AD (of 2 types), PRS-CS (where the risk score is used as the imputed trait), and LDpred2 (where again the risk score is used as the imputed trait). GTEx and UKB were the sources of the reference transcriptome/proteome data for training models of these molecular traits.

Overall, this study adds to the TWAS literature and offers some important methodological insights. The systematic benchmarking of the various methods (at each stage) will be useful to the community. Overall, the paper is well-written. However, I do have some concerns, which I think should be addressed.

1. Please connect the comparison of marginal effect sizes, standard errors, and p-values reported in Results to the actual tables/figures in Supplementary Information. It was challenging to follow the results.

2. How was the distribution of significant SNPs identified via LS-imputation (and the other trait imputation methods) compared to the results from the corresponding GWAS results (IGAP or EADB). The authors indicate that they "compared the Manhattan plots" of the original GWAS to the results from the trait imputation methods. There's a lot of hand-waving in this section of results and quite a bit of subjectivity.

3. Similarly for the comparison of the various imputed traits (using R2 or Nagelkerke’s R2) in Results, it would be important to link this section of Results to the relevant Supplementary Information.

4. I'm concerned that the TWAS did not identify the well-known Alzheimer's gene APOE, which has been consistently identified by previous TWAS studies (and implicated by previous GWAS). Now, of course, APOE was identified in the PWAS analysis (on imputed AD status). How do the authors interpret these findings?

5. Given that with the exception of APOE and APOC1, no other proteins were identified by DeLIVR across all imputation approaches, how specifically would the authors "combine" the various results? It's unclear how this combination should be done; it would be useful if the authors can make recommendations along these lines given the results presented here.

6. The study was done using GTEx whole blood. The tissue specificity of the approach/results is unclear. (There's a strong argument to be made that whole blood is not the ideal tissue for the analysis, given the availability of brain in GTEx (and other reference resources) and given its well-recognized cell type heterogeneity.)

7. The study started with >2800 proteins, but ended up with only <1700 proteins for the stage 2 PWAS analysis. It's unclear how many proteins were dropped at each step of the preprocessing and QC. This would be very useful information.

8. It would be important to present the Q-Q plot of the association p-values. It's important for readers to see this distribution given the control for type I error claimed here.

9. The authors should make all results available, including the TWAS/PWAS models used and the association results (not just the code), for ease of reproducibility.

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Large-scale datasets should be made available via a public repository as described in the PLOS Genetics data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.

Reviewer #1: None

Reviewer #2: None

Reviewer #3: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Reviewer #3: Yes: Eric R. Gamazon

Attachment

Submitted filename: ReviewReport_PGENETICS.pdf

pgen.1011659.s002.pdf (39.8KB, pdf)

Decision Letter 1

Hae Kyung Im, Xiaofeng Zhu

26 Jan 2025

PGENETICS-D-24-00954R1

Enhancing nonlinear transcriptome- and proteome-wide association studies via trait imputation with applications to Alzheimer's disease

PLOS Genetics

Dear Dr. Pan,

Thank you for submitting your manuscript to PLOS Genetics. After careful consideration, we feel that it has merit but does not fully meet PLOS Genetics's publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript within 60 days Mar 27 2025 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosgenetics@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pgenetics/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

* A rebuttal letter that responds to each point raised by the editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'. This file does not need to include responses to any formatting updates and technical items listed in the 'Journal Requirements' section below.

* A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

* An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, competing interests statement, or data availability statement, please make these updates within the submission form at the time of resubmission. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

We look forward to receiving your revised manuscript.

Kind regards,

Hae Kyung Im

Guest Editor

PLOS Genetics

Xiaofeng Zhu

Section Editor

PLOS Genetics

Aimée Dudley

Editor-in-Chief

PLOS Genetics

Anne Goriely

Editor-in-Chief

PLOS Genetics

Additional Editor Comments :

Apologies for the delay in the decision. However, we would like to see the issue on type I errror raised by reviewer #2 to be addressed before making a decision.

Journal Requirements:

1) Thank you for stating that "Individual-level data from the UK Biobank (UKB {https://www.ukbiobank.ac.uk/}), the Genotype-Tissue Expression (GTEx {https://gtexportal.org/home/}) Project, and the Alzheimer's Disease Sequencing Project (ADSP {https://adsp.niagads.org/}) are available by application through their respective data access processes." Please note that these links seem to direct to very general homepages rather than pages with the specific data. Please update your statement to provide further details or direct page links.

2) The file inventory includes files for Figures 3a, 3b,3c,3d,3e,3f, 5a,5b,6a and 6b. We would recommend either combining these into single Figure 3.tiff, Figure 5.tiff and Figure 6.tiff files with separate internal panels, or renumbering them as individual figures, as we are not able to publish multiple components of a single figure as separate files.

3) Please ensure that the affiliations of the authors listed on the manuscript title page do exactly match with the affiliations provided in the online submission form.

NOTE: Affiliations should include a department (if applicable), an institution, a city, and a country.

Reviewers' comments:

Reviewer's Responses to Questions

Reviewer #1: The authors have addressed most of my concerns. I have a few minor suggestions given the additional analyses conducted:

1. The Discussion limitations have not been updated to include your brain hippocampus analysis. This analysis should be mentioned in the main text.

2. Similarly, your response to reviewers about APOE transcript models should be included in the main text or supplement as readers will want to know this information.

3. Thank you for including your code in github. Please be sure to add a README to describe what scripts and results files are included.

Reviewer #2: For my previous comments 1&2, 4 (Type I error):

The authors have added additional simulations to address my previous comments. However, the type I error analysis remains unconvincing. It does not seem reasonable to use other existing methods as a gold standard to determine whether a new discovery is a false positive. For instance, the authors state that in one case (n_1 = 4000), DeLIVR Imputed uniquely identified several genes missed by all other methods, suggesting these could be potential false positives. This raises concerns about the claim that DeLIVR Imputed has higher power. As a result, this approach does not appear to provide a valid method for demonstrating control of type I error.

For my previous comments 3&6:

There seems to be a contradiction in the authors’ responses. In response to my comment 3, the authors stated: “First, our main point is not about the combination of methods, but that trait imputation can improve over using small samples of observed traits only.” However, in response to comment 6, they stated: “It is not the main point of the current paper to investigate the performance of DeLIVR, but the combined use of DeLIVR and trait imputation.” These statements appear inconsistent, leaving me unclear about the main purpose of the paper. Please clarify the primary focus and goals in the manuscript to address this confusion.

Then, in response to the author’s comments to my comment 6: The presentation of the results is somewhat confusing and gives the impression that the authors are claiming DeLIVR combined with other methods is superior to TWAS-L and TWAS-LQ. However, based on the data analysis results, particularly Table 2, I still do not see a clear advantage of DeLIVR combined with other imputation methods compared to existing TWAS approaches (e.g., TWAS-L and TWAS-LQ). If this is not the intended message, please clarify this in the manuscript.

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Large-scale datasets should be made available via a public repository as described in the PLOS Genetics data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.

Reviewer #1: None

Reviewer #2: None

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

Figure resubmission:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step. If there are other versions of figure files still present in your submission file inventory at resubmission, please replace them with the PACE-processed versions.

Reproducibility:

To enhance the reproducibility of your results, we recommend that authors of applicable studies deposit laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

Decision Letter 2

Hae Kyung Im, Xiaofeng Zhu

18 Mar 2025

Dear Dr Pan,

We are pleased to inform you that your manuscript entitled "Enhancing nonlinear transcriptome- and proteome-wide association studies via trait imputation with applications to Alzheimer's disease" has been editorially accepted for publication in PLOS Genetics. Congratulations!

Before your submission can be formally accepted and sent to production you will need to complete our formatting changes, which you will receive in a follow up email. Please be aware that it may take several days for you to receive this email; during this time no action is required by you. Please note: the accept date on your published article will reflect the date of this provisional acceptance, but your manuscript will not be scheduled for publication until the required changes have been made.

Once your paper is formally accepted, an uncorrected proof of your manuscript will be published online ahead of the final version, unless you’ve already opted out via the online submission form. If, for any reason, you do not want an earlier version of your manuscript published online or are unsure if you have already indicated as such, please let the journal staff know immediately at plosgenetics@plos.org.

In the meantime, please log into Editorial Manager at https://www.editorialmanager.com/pgenetics/, click the "Update My Information" link at the top of the page, and update your user information to ensure an efficient production and billing process. Note that PLOS requires an ORCID iD for all corresponding authors. Therefore, please ensure that you have an ORCID iD and that it is validated in Editorial Manager. To do this, go to ‘Update my Information’ (in the upper left-hand corner of the main menu), and click on the Fetch/Validate link next to the ORCID field.  This will take you to the ORCID site and allow you to create a new iD or authenticate a pre-existing iD in Editorial Manager.

If you have a press-related query, or would like to know about making your underlying data available (as you will be aware, this is required for publication), please see the end of this email. If your institution or institutions have a press office, please notify them about your upcoming article at this point, to enable them to help maximise its impact. Inform journal staff as soon as possible if you are preparing a press release for your article and need a publication date.

Thank you again for supporting open-access publishing; we are looking forward to publishing your work in PLOS Genetics!

Yours sincerely,

Hae Kyung Im

Guest Editor

PLOS Genetics

Xiaofeng Zhu

Section Editor

PLOS Genetics

Aimée Dudley

Editor-in-Chief

PLOS Genetics

Anne Goriely

Editor-in-Chief

PLOS Genetics

www.plosgenetics.org

Twitter: @PLOSGenetics

----------------------------------------------------

Comments from the reviewers (if applicable):

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: The authors have satisfactorily addressed my concerns.

Reviewer #2: The authors addressed my concerns.

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Large-scale datasets should be made available via a public repository as described in the PLOS Genetics data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.

Reviewer #1: None

Reviewer #2: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

----------------------------------------------------

Data Deposition

If you have submitted a Research Article or Front Matter that has associated data that are not suitable for deposition in a subject-specific public repository (such as GenBank or ArrayExpress), one way to make that data available is to deposit it in the Dryad Digital Repository. As you may recall, we ask all authors to agree to make data available; this is one way to achieve that. A full list of recommended repositories can be found on our website.

The following link will take you to the Dryad record for your article, so you won't have to re‐enter its bibliographic information, and can upload your files directly: 

http://datadryad.org/submit?journalID=pgenetics&manu=PGENETICS-D-24-00954R2

More information about depositing data in Dryad is available at http://www.datadryad.org/depositing. If you experience any difficulties in submitting your data, please contact help@datadryad.org for support.

Additionally, please be aware that our data availability policy requires that all numerical data underlying display items are included with the submission, and you will need to provide this before we can formally accept your manuscript, if not already present.

----------------------------------------------------

Press Queries

If you or your institution will be preparing press materials for this manuscript, or if you need to know your paper's publication date for media purposes, please inform the journal staff as soon as possible so that your submission can be scheduled accordingly. Your manuscript will remain under a strict press embargo until the publication date and time. This means an early version of your manuscript will not be published ahead of your final version. PLOS Genetics may also choose to issue a press release for your article. If there's anything the journal should know or you'd like more information, please get in touch via plosgenetics@plos.org.

Acceptance letter

Hae Kyung Im, Xiaofeng Zhu

PGENETICS-D-24-00954R2

Enhancing nonlinear transcriptome- and proteome-wide association studies via trait imputation with applications to Alzheimer's disease

Dear Dr Pan,

We are pleased to inform you that your manuscript entitled "Enhancing nonlinear transcriptome- and proteome-wide association studies via trait imputation with applications to Alzheimer's disease" has been formally accepted for publication in PLOS Genetics! Your manuscript is now with our production department and you will be notified of the publication date in due course.

The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.

Soon after your final files are uploaded, unless you have opted out or your manuscript is a front-matter piece, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

Thank you again for supporting PLOS Genetics and open-access publishing. We are looking forward to publishing your work!

With kind regards,

Lilla Horvath

PLOS Genetics

On behalf of:

The PLOS Genetics Team

Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom

plosgenetics@plos.org | +44 (0) 1223-442823

plosgenetics.org | Twitter: @PLOSGenetics

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Text. Supplementary file with details of additional quality control procedure, real data analysis results, and simulation results.

    (PDF)

    pgen.1011659.s001.pdf (5.8MB, pdf)
    Attachment

    Submitted filename: ReviewReport_PGENETICS.pdf

    pgen.1011659.s002.pdf (39.8KB, pdf)
    Attachment

    Submitted filename: Response1.pdf

    pgen.1011659.s003.pdf (108.7KB, pdf)
    Attachment

    Submitted filename: response_letter2.pdf

    pgen.1011659.s004.pdf (82.9KB, pdf)

    Data Availability Statement

    EADB GWAS summary statistics data are publicly available from the European Bioinformatics Institute (EBI) GWAS Catalog (https://www.ebi.ac.uk/gwas/) under accession no. GCST90027158. IGAP GWAS summary statistics data are publicly available from the EBI GWAS Catalog under accession no. GCST007511. Individual-level data from the UK Biobank (https://www.ukbiobank.ac.uk/), the Genotype-Tissue Expression (https://gtexportal.org/home/) Project, and the Alzheimer’s Disease Sequencing Project (https://adsp.niagads.org/) are available by application through their respective data access processes. The code used for the analyses presented in this paper is available at https://github.com/RuoyuHe/LS-imputation_TWAS. The code for DeLIVR is available at https://github.com/RuoyuHe/DeLIVR. The code for LS-imputation is available at https://github.com/ren328/LSimputing. The code for PRS-CS is available at https://github.com/getian107/PRScs. LDpred2 is supplied with the R package “bigsnpr” (https://github.com/privefl/bigsnpr).


    Articles from PLOS Genetics are provided here courtesy of PLOS

    RESOURCES