Skip to main content
PLOS One logoLink to PLOS One
. 2020 Nov 30;15(11):e0239189. doi: 10.1371/journal.pone.0239189

Mining GWAS and eQTL data for CF lung disease modifiers by gene expression imputation

Hong Dang 1,*, Deepika Polineni 2, Rhonda G Pace 1, Jaclyn R Stonebraker 1, Harriet Corvol 3,4, Garry R Cutting 5,6, Mitchell L Drumm 7, Lisa J Strug 8,9, Wanda K O’Neal 1, Michael R Knowles 1
Editor: Dylan Glubb10
PMCID: PMC7703903  PMID: 33253230

Abstract

Genome wide association studies (GWAS) have identified several genomic loci with candidate modifiers of cystic fibrosis (CF) lung disease, but only a small proportion of the expected genetic contribution is accounted for at these loci. We leveraged expression data from CF cohorts, and Genotype-Tissue Expression (GTEx) reference data sets from multiple human tissues to generate predictive models, which were used to impute transcriptional regulation from genetic variance in our GWAS population. The imputed gene expression was tested for association with CF lung disease severity. By comparing and combining results from alternative approaches, we identified 379 candidate modifier genes. We delved into 52 modifier candidates that showed consensus between approaches, and 28 of them were near known GWAS loci. A number of these genes are implicated in the pathophysiology of CF lung disease (e.g., immunity, infection, inflammation, HLA pathways, glycosylation, and mucociliary clearance) and the CFTR protein biology (e.g., cytoskeleton, microtubule, mitochondrial function, lipid metabolism, endoplasmic reticulum/Golgi, and ubiquitination). Gene set enrichment results are consistent with current knowledge of CF lung disease pathogenesis. HLA Class II genes on chr6, and CEP72, EXOC3, and TPPP near the GWAS peak on chr5 are most consistently associated with CF lung disease severity across the tissues tested. The results help to prioritize genes in the GWAS regions, predict direction of gene expression regulation, and identify new candidate modifiers throughout the genome for potential therapeutic development.

Introduction

The International Cystic Fibrosis Gene Modifier Consortium identified 5 genome-wide significant genetic loci associated with cystic fibrosis (OMIM: 219700) lung disease severity through GWAS of 6,365 CF patients, with a chr16 locus also showing significance in some analyses [1, 2]. The GWAS signals point to genes in regions that may play a role in CF lung disease pathogenesis. Heritability studies of twins and siblings estimated that at least 50% of lung disease variability is attributable to non-CFTR genetic modifiers [3]. The effect sizes of the identified loci as extrapolated from the beta-coefficients range from 2.5% - 4.6% predicted forced expiratory volume in one second (FEV1) [1], with a combined potential effect size to explain < 25% FEV1 variation. Therefore, a large proportion of genetic influences on CF lung disease severity remain undetected, in part reflecting limited statistical power of GWAS due to multiple test penalties over millions of single nucleotide polymorphisms (SNPs).

The most common scenario explaining genetic association to phenotype is through the effects of variants on gene expression [4, 5]. Studies of genetic regulation of gene expression, i.e., expression Quantitative Trait Loci (eQTL), are effective strategies and “next steps” for post-GWAS investigations to understand genetic susceptibility/modification of diseases [6, 7]. The availability of reference data sets for more than 40 human tissues by the Genotype-Tissue Expression (GTEx) consortium [5] has greatly facilitated post-GWAS research. In a survey of 44 human tissues, the GTEx consortium found that most genetic regulation of gene expression is common across multiple tissues, acting through cis-SNPs at promoter and enhancer sites [5]. Also using the entire set of 44 GTEx tissues, as opposed to limiting analyses to 9 pilot tissues, increased the number of trait-associated variants by 5-fold for 18 complex traits [8]. In other words, genetic regulation of gene expression, or eQTL, can be informative regardless of tissue origin of the training data set [8], and can help overcome technical deficiencies, such as small sample sizes of certain tissue data, and potential biological limitations such as unsampled developmental stage and environmental and pathogenic masking of gene expression through reverse causality.

The study of eQTLs requires gene expression and genetic variation data from the same individuals, typically testing one gene-SNP pair at a time. A recent extension of eQTL analysis is the use of machine learning and predictive modeling techniques to associate multiple genetic variants, to predict gene expression [9, 10]. The PrediXcan [9] and Transcriptome-Wide Association Studies (TWAS) [10] methods utilize small training data sets (with both genotype and expression data from the same individuals), to build predictive models, where genotypes from several cis-SNPs are used to predict the portion of genetic regulation of expression for each gene. Once built, these models, regardless of tissue origin, can be used to impute gene expression from large GWAS studies where only genotype data are available. The implicit assumption of these approaches is that genetic regulation of gene expression is largely preserved among human population as shown by cross cohort heritability correlation [9, 10], and that eQTLs will be conserved across different tissues for most of cis-eQTLs [8, 9]. The resultant (imputed) gene expression can then be analyzed for association to disease phenotypes to pinpoint the genetic regulation that is relevant to the disease process. These methods can improve statistical power through interrogating SNPs associated with gene expression regulation only, thus reducing multiple test burdens. The predictive models can also suggest the direction of gene expression regulation relating to phenotype, informing the mechanism by which SNPs affect the phenotype. In addition, by interrogating multiple cis-SNPs at the same time, no single SNP is required to be significant, which can uncover combinatorial effects not identified otherwise [10].

Here we report the use of PrediXcan and TWAS methods to mine the CF GWAS data for genetic regulation of gene expression associated with CF lung disease severity. We use a combination of our own CF training data sets [11, 12] and reference GTEx data sets of multiple human tissues [4, 5] to generate a list of genes with evidence of association with CF lung disease severity. Leveraging the strengths of diverse approaches [9, 10], and querying multiple tissues produced 379 potential modifier candidates. From this list, 52 consensus genes met the statistical cutoff from both approaches, and 28 of these were within 1 mega-base (Mb) of significant GWAS loci. We sought indirect validation of some of these candidate CF lung disease modifier genes by examining their known functions in literature and annotation databases, and we highlight potential relevance of some of the findings to CF biology. These genes are candidates for further experimental validation.

Methods

The overall workflow of the study is outlined in Fig 1. The cohort study design, and demographic and clinical characteristics of the CF patients used in this study have been previously described [1]. Briefly, 5 cohorts (total 6,365 CF patients) with >90% European ancestry from US, Canada, and France were recruited by the International Cystic Fibrosis Gene Modifier Consortium, and their genome-wide genetic variance were assayed using different genotyping platforms over several years. GWAS was performed as a meta-analysis of cohort/platform combinations, using the standardized quantitative lung function score, or KNoRMA (Kulich normal residual mortality adjusted) mean FEV1 percentile, as phenotype trait [1, 3]. The present study also utilized gene expression data previously interrogated for association to several CF disease phenotypes, including expression data from Affymetrix exon microarrays of 753 EBV-transformed lymphoblastoid cell lines (LCLs) from CF patients [11] and RNA-sequencing from nasal mucosal epithelial biopsies from 132 CF patients [12]. These gene expression data provided training data to build predictive models using the PredictDB_Pipeline (used by PrediXcan from Im lab) for GTEx v7 release. Models for LCL gene expression available from PredictDB repository (http://predictdb.org/ from Im lab), were compared to our CF LCL models to assess the quality of our predictive models. Full details of genetic and transcriptomic datasets utilized in the modeling, and the modeling procedures are described in S1 Methods in S4 File. Additionally, GTEx models from 48 human tissues and a large data set from Depression Genes and Networks (DGN) whole blood [13] were downloaded from the PredictDB (PrediXcan) data repository [9], and TWAS [10].

Fig 1. Analysis workflow overview.

Fig 1

GWAS imputation of SNP variances in CF patients (n = 6,365) were used to impute genetically regulated gene expression, which were then tested for CF lung disease severity using either the PrediXcan platform (left arm), or TWAS (right arm). The association results from multiple tissues from each platform were combined through 2 different meta-analysis of multiple p-values from different tissues. GTEx: Genotype-Tissue Expression RNA-seq (n = 48 tissues); CF: LCL microarray (n = 753 samples), and nasal epithelial biopsy RNA-seq (n = 132 samples); DGN: Depression Genes and Networks RNA-seq from whole blood (n = 922 samples); HMP: harmonic mean p-value; EBM: empirical adaptation of Brown’s method; OMNIBUS: omnibus p-value from TWAS.

Imputed SNP genotypes from the CF GWAS cohorts [14] were used as input for PrediXcan model training [9]. Compared to the imputation reported in the GWAS studies [1], the updated version here utilized a more recent release of 1000 genomes project Phase3 (v5a) haplotype data and 101 CF whole genome sequencing data as reference panels, which improved coverages at HLA and CFTR regions [14].

To test for association with CF lung disease severity, the quantitative score (KNoRMA) used in the prior GWAS studies was used as a standardized CF lung phenotype trait [13], and the imputed gene expression from each tissue was modeled as response variable to KNoRMA in a linear model, with sex and 4 genotype principle components (PCs) as covariates. Association testing of imputed gene expression, using the PrediXcan platform [9], from the CF LCLs and CF nasal epithelial biopsies, 48 GTEx tissues, and DGN whole blood (a total of 51 human tissues), were performed using robust regression [15, 16] based on 5,756 unrelated patients. The analyses were done using the Bioconductor LIMMA package and the robust regression utilized iterated re-weighted least squares by the rlm function from the R package, MASS. For disease phenotype association testing using predictive models trained on CF nasal epithelial biopsy and LCL data sets, the samples used in predictive model training (122 nasal and 753 LCL samples were part of GWAS) were excluded from the association testing, resulting in 5,634 and 5,003 final sample size for nasal epithelial biopsies and LCLs, respectively.

Alternatively, summary GWAS statistics were used to test imputed gene expression association from 48 GTEx tissues to KNoRMA using Functional Summary-based Imputation, or FUSION software from TWAS [10]. Briefly, summary GWAS statistics for SNP associations to CF lung disease phenotype (n = 6,365) and reference linkage-disequilibrium (LD) data from 1000 genome projects were used as input for FUSION, with TWAS predictive models from 48 GTEx v7 human tissues downloaded from FUSION website (http://gusevlab.org/projects/fusion/). The analysis was performed according to instructions on the FUSION website.

To leverage information from all tested tissues, meta-analyses from multiple p-values were performed. Since these tissue-specific association tests all started from the same CF GWAS data set, meta-analysis for dependent/correlated tests were applied to both the PrediXcan and TWAS results. We then adopted a strategy to compare results from the two independently developed approaches. Multi-tissue tests from each result set were combined by two separate meta-analysis methods, a simple harmonic mean p-value (HMP) [17], and a correlation adjusted method, specifically, empirical adaptation of Brown’s method (EBM) [18] for PrediXcan, or omnibus test [10] for TWAS. For significant modifier genes from each analysis platform, a p-value < 0.01 from both the HMP, and correlation adjusted method (EBM for PrediXcan, or omnibus for TWAS) was chosen. Consensus between the 2 result sets (with 4 p-value < 0.01 thresholds) yielded the most robust findings, while the union of significant genes from the 2 result sets maximized sensitivity of discovery. For comparison of numeric outcomes, such as performance of predictive models or imputed gene expression between data sets or tissues, the distribution of correlation R2 among multiple genes were compared to R2 values derived from null distribution using Fisher’s transformation through a modified R script originally from the Im lab (https://gist.github.com/hakyim/a925fea01b365a8c605e).

Narrow-sense heritability (h2) of phenotype from imputed GWAS data from unrelated patients was calculated using the GREML-LDMS method [19] from the Genome-wide Complex Trait Analysis (GCTA) software [20], v1.93.0beta.

For hierarchical clustering, signed -log10p-value with sign of association beta coefficient as indicator of expression change direction were compiled for genes significantly associated to disease phenotype from multiple tissue data sets. Clustering heatmaps were generated using the Bioconductor R package, ComplexHeatmap [21] (additional details provided in the S1 Methods in S4 File). Manhattan plots of GWAS data and imputed gene expression phenotype associations were generated using the R package, qqman [22], and ggplot2 [23]. GWAS p-values of relevant SNPs were formatted as bedGraph files, and visualized on the UCSC genome browser (http://genome.ucsc.edu/) as custom annotation tracks against appropriate reference genomes.

Pre-ranked Gene Set Enrichment Analysis [24] against several collection of gene sets and pathways were performed with both PrediXcan and TWAS platforms using the Bioconductor R package fgsea [25]. The ranks were based on the -log10 of the maximal p-value between the 2 meta-analysis methods applied for each platform. In addition, candidate genes were functionally categorized using Gene Ontology (GO) terms [26], and Reactome annotations [27], coupled with expert review of the literature.

Results

Predictive models for genetic regulation of gene expression using training data from CF cohorts

To build predictive models of genetic regulation of gene expression with training data from CF patients, we adapted the PredictDB_Pipeline for GTEx_v7 to work with CF genotype and gene expression data from both LCL [11] and nasal epithelial biopsy [12] data sets. The performance of the predictive models was evaluated by the correlations between predicted and observed gene expression, and genes were filtered at minimal performance suggested by PredictDB. The number of imputable genes (as defined by prediction R2 > 0.01 and p-value < 0.05), including protein-coding, lincRNA, and pseudogenes, from nasal epithelial biopsy data set consisting of 132 training samples was 2,881; while that from 753 LCL data set was 5,299. As shown in S1 Fig in S4 File, the predicted vs observed R2 from both data sets are significantly higher than expected from null distribution, with the average R2 of 0.11 and 0.072 for imputable genes from nasal epithelial biopsy and LCL models, respectively, comparable to reported models based on GTEx data sets [9]. These R2 values suggest the existence of a substantial number of genes whose expression can be partially explained by genetic variants. The degree of R2 deviation from null between nasal epithelial biopsy (n = 132) and LCL (n = 753) models reflect the sample size difference between them, since sample size and quality of training data are critical factors that determine the performance of the predictive models and the number of predictable genes [10]. Our nasal epithelial biopsy models are comparable to GTEx RNA-seq data sets from PrediXcan, while our LCL microarray data set yielded fewer than expected number of imputable genes (S2 Fig in S4 File).

We investigated correlations of our CF LCL model predictions with those of GTEx on the same set of patients. The numbers of imputed genes that passed respective prediction filters are 5,299 from CF LCL, and 3,039 from GTEx Cells_EBV-transformed_lymphocytes (i.e. LCLs), with overlap of 1,623 genes by ENSEMBL gene_id. The correlation of the 1,623 genes between the 2 data sets were calculated and compared to expected R2 distribution from null (S3 Fig in S4 File). The mean R2 value among 1,623 genes is 0.51, i.e. the two imputed gene expression data sets are highly correlated, suggesting similar genetic regulation of gene expression in the same cell type in independent training data sets. Also as reported, there is significant cross predictability of the models between different tissues [9], and the correlation between imputed gene expression from CF LCLs, and GTEx lung tissue, among 2,552 genes predicted in both data sets, are also significantly above null, with mean R2 of 0.40 (S3 Fig in S4 File).

Association of genetically regulated gene expression to CF lung disease severity

Association testing of imputed gene expression from a total of 51 tissues (2 CF, 48 GTEx, and DGN whole blood) were performed using robust regression against the quantitative lung function score, KNoRMA, and results from all tissues were used in meta-analysis as described in methods (Fig 1). The meta-analyses resulted in 245 candidate modifier genes from PrediXcan by consistent p-value < 0.01 from 2 meta-analyses (HMP.PrediXcan, EBM.PrediXcan) and 186 candidate genes utilizing GWAS summary statistics and TWAS/FUSION meta-analyses (HMP.TWAS, OMNIBUS.TWAS), giving a combined candidate list of 379 unique genes (S1 File). Using a threshold of p-value < 0.01 across all 4 meta-analyses, 52 consensus CF lung disease modifier genes were defined (Figs 2 and 3, Table 1). Several key features of these 52 consensus genes are highlighted in Fig 2. First, there is a general agreement between PrediXcan (left panel) and TWAS (right panel) in terms of direction (color) and strength (intensity) of the association of imputed gene expression to lung disease severity. Second, more than half (28 out of 52) of the consensus genes were located within 1 Mb of the 5 autosomal GWAS signals. Third, the direction of the predicted effect of gene expression as it relates to the lung disease phenotype varies across genes (blue versus red) and is relatively consistent across tissues, with rare exceptions (discussed below). Fourth, association signal is often centered around GWAS loci and with genes imputed across many tissues, although there are exceptions. Many of these genes have relevance to known features of CF pathogenesis (see citations in Table 1), and the direction of imputed gene expression change reflects the direction of alleles and prediction weights of SNPs in the predictive models. Among the 52 consensus modifier genes, the correlation coefficient between average effect sizes from multiple tissues between PrediXcan and TWAS is r = 0.83 (R2 = 0.69, S4B Fig in S4 File), while that from the maximal multi-tissue p-values of PrediXcan and TWAS, is r = 0.68 (R2 = 0.46, S4C Fig in S4 File). As shown by the color of the heatmaps in Fig 2, most of the consensus modifier genes are similar in change of direction relative to KNoRMA across multiple tissues with strongest signals from chr5 and chr6 GWAS loci, such as EXOC3, and HLA-DRB1, respectively. However, there are some exceptions, such as TPPP and MET, where genetic regulations of expressions associate to KNoRMA with different direction in different tissues. For example, TPPP is predicted to be increased in milder patients (higher KNoRMA values) from both GTEx and DGN whole blood, while the opposite is predicted from other tissues.

Fig 2. Hierarchical clustering of genes whose imputed expression are associated with CF lung disease severity.

Fig 2

Consensus modifier genes (n = 52) were determined as p-value < 0.01 from all 4 meta-analyses of multiple tissue association testing described in methods, and the -log10(p-values) were clustered and represented as a heatmap with red-grey-blue color scale. The color represents direction of predicted expression change, with red indicates “protective”, or increased expression with increasing KNoRMA (milder lung disease), and blue, “harmful”, or increased expression with decreasing KNoRMA (more severe lung disease), and the intensity reflects the significance (p-values) of the association. White cells in heatmap indicate missing data, where the genes were not well predicted from the relevant tissues. The vertical color columns on the right indicate type of gene and chromosome near GWAS loci. The genes were clustered based on results from PrediXcan (left heatmap), and the order of the genes were kept the same for TWAS (right heatmap). Key patterns of negative and positive associations to KNoRMA across multiple tissues in the heatmap are highlighted by the dashed boxes. Arrows on top of the left heatmap identify the additional tissues over the 48 GTEx tissues common to both platforms, and arrows in the middle of the heatmaps show the results from whole blood tissues for TPPP.

Fig 3. Manhattan plots of CF lung disease association p-values from gene expression imputation and GWAS.

Fig 3

Maximal p-values between 2 meta-analyses from imputed gene expression to KNoRMA by PrediXcan and TWAS were used in the Manhattan plots A and B respectively. The 28 consensus modifier genes within 1 Mb of 5 autosomal GWAS signals (red squares), and those not near GWAS signals (blue triangles) are labeled. Panel C represents GWAS p-values from the updated imputation [78] by fixed-effect meta-analysis performed according to the GWAS study [1]. The solid lines correspond to genome-wide significant p-value of 0.01 (for imputed expression, A and B) or 1.25x10-08 (for GWAS, C), while the dashed lines represent the suggestive p-value of 0.05 (for imputed expression) or 1x10-06 (for GWAS).

Table 1. Consensus 52 CF lung disease modifier genes.

Gene Gene type chr p-value (max) Direction* CF-related citations
A: Genes in regions of GWAS association ordered by chromosome
MUC20 protein coding 3 8.1x10-03 Protective (0.014;2.44) Mucus barrier
MUC4 protein coding 3 5.9x10-03 Protective (0.011;2.1) Epithelial membrane mucin; possible regulation by CFTR [28]
SDHAP1 pseudogene 3 2.3x10-04 Harmful (-0.021;-4.1)
AC069213.1 pseudogene 3 4.9x10-03 Harmful (-0.012;-2.06)
AC026740.1 protein coding 5 3.1x10-04 Protective (0.01;2.97)
AHRR protein coding 5 3.7x10-03 Protective (0.003;0.97) Aryl hydrocarbon receptor [29, 30]
BRD9 protein coding 5 1.3x10-04 Harmful (-0.002;-3.95) Lysine-acetylated histone binding, chromatin organization; important in small lung cell cancers
C5orf55 protein coding 5 4.7x10-05 Harmful (-0.02;-3.95) EXOC3 antisense
CCDC127 protein coding 5 5.8x10-03 Harmful (-0.006;-1.83) Regulates HSP70 gene expression; HSP70 is involved in CFTR processing [31, 32]
CEP72 protein coding 5 1.8x10-09 Protective (0.019;5.66) Microtubule-organizing, organelle, centrosome; required for cilia formation; microtubules and cilia important for CF pathophysiology [3339]
CTD-2083E4.5 pseudogene 5 6.3x10-03 Harmful (-0.007;-1.8)
CTD-2228K2.5 protein coding 5 1.6x10-05 Harmful (-0.01;-2.99)
EXOC3 protein coding 5 3.5x10-06 Protective (0.028;4.86) Exocytosis, epithelial polarity; interaction with actin cytoskeletal remodeling and vesicle transport machinery; components of exocyst complex required for intracellular bacteria clearance from cells; regulates MUC5AC secretion induced by neutrophil elastase in human airway epithelial cells [40]
TPPP protein coding 5 1.0x10-07 Harmful (-0.012;-4.08) Microtubule bundle; microtubules associated with CFTR-related pathogenic processes (see CEP72 above) [4147]
ZDHHC11 protein coding 5 9.4x10-06 Protective (0.005;4.41) Palmitoylation, ER, Golgi protein targeting; mediator of DNA virus response [48]
ZDHHC11B protein coding 5 1.1x10-04 Protective (0.003;4.13) Palmitoylation, ER, Golgi protein targeting
AGER protein coding 6 6.5x10-03 Harmful (-0.007;-2.39) Associated with pathogen load, inflammation, and hypoxia in CF [4951]
CYP21A2 protein coding 6 2.6x10-03 Harmful (-0.01;-2.39) Steroid hydroxylase, congenital adrenal hyperplasia; Cytochrome P450 superfamily; required for the synthesis of steroid hormones including cortisol and aldosterone.
HLA-DQA1 protein coding 6 1.0x10-04 Protective (0.026;3.84) Ancestral allele 8.1, CF delayed onset infection; potential CF modifier in pancreas and liver [52, 53]
HLA-DQA2 protein coding 6 2.5x10-04 Harmful (-0.049;-4.76) Ancestral allele 8.1, CF delayed onset infection; highly conserved in contrast to some other HLA genes [54, 55]
HLA-DQB1 protein coding 6 3.9x10-04 Protective (0.04;3.48) Ancestral allele 8.1, CF delayed onset infection; potential CF modifier in pancreas and liver [52, 53, 56]
HLA-DRB1 protein coding 6 5.1x10-05 Protective (0.024;3.61) Ancestral allele 8.1, CF delayed onset infection; associated with allergic and T(H)-1 like responses [52, 5658]
HLA-DRB6 pseudogene 6 1.1x10-05 Harmful (-0.052;-4.67) Ancestral allele 8.1, CF delayed onset infection
HLA-DRB9 pseudogene 6 1.8x10-03 Harmful (-0.017;-2.77) Ancestral allele 8.1, CF delayed onset infection
PRRT1 protein coding 6 5.3x10-04 Harmful (-0.01;-2.39) Post synaptic membrane
PDHX protein coding 11 3.1x10-03 Harmful (-0.011;-2.01) Mitochondrial glycolysis, congenital lactic acidosis; pyruvate dehydrogenase, an enzyme complex linking glycolysis with downstream oxidative metabolism, represents a key location where regulation of metabolism occurs; PDHX is a key structural component of this complex and is essential for its function; involved in glucose metabolism so associated with oxidative responses
CHP2 protein coding 16 1.9x10-03 Protective (-0.002;0.74) Cellular pH regulation, plasma membrane Na+/H+ exchangers required as an obligatory binding partner for ion transport
PRKCB protein coding 16 9.6x10-03 Harmful (-0.002;-0.1) Adaptive immunity, B cell activation; Linked to CFTR mRNA expression, Regulation of autophagy via sensing of mitochondrial energy status [59, 60]
B: Genes in regions of no prior association (in this cohort of subjects) ordered by chromosome
MYCL Protein coding 1 5.0x10-03 Protective (0.006;2.28) Dis-regulation associated with lung and other cancers [61]
AJ239322.1 lincRNA 2 8.1x10-03 Protective (0.007;2.74)
PLA2R1 Protein coding 2 8.8x10-03 Harmful (-0.008;-2.11) Potential target in asthma [62, 63]
RP11-496H1.2 lincRNA 3 8.0x10-03 Harmful (-0.004;-2.43)
OSTN Protein coding 3 9.5x10-03 Protective (0.005;1.82)
SLITRK3 protein coding 3 8.1x10-03 Protective (0.002;2.32) Synaptic membrane adhesion; involved in GABAergic synapse formation; recent evidence of GABAergic control of mucous cell differentiation in human airway epithelium [64, 65]
TAPT1 protein coding 4 8.7x10-03 Harmful (-0.0004;-0.36) Cilia basal body, centrosome; associated with lung function decline in smokers
DSE Protein coding 6 9.2x10-04 Harmful (-0.006;-1.51) Dermatan sulfate is part of proteoglycans that are involved in many biological processes, such as cancer, immunity, and defect can cause Ehlers-Danlos syndrome, which may lead to hypoplasia of the lung [66, 67]
CDSN protein coding 6 6.1x10-04 Harmful (-0.015;-3.75) Cell adhesion, skin morphogenesis; epithelial cell differentiation
HLA-S pseudogene 6 5.9x10-03 Harmful (-0.019;-2.5)
HEATR2 protein coding 7 5.8x10-03 Protective (0.011;2.21) DNAAF5 (alias), motile cilia, necessary for assembly of the ciliary motile apparatus [68, 69]
MET protein coding 7 7.2x10-03 Harmful (-0.006;-0.92) Genetic marker, CFTR mutation [70]
RP11-56A10.1 pseudogene 8 7.4x10-03 Harmful (-0.007;-3.16)
C9orf16 protein coding 9 9.6x10-03 Protective (-0.0001;0.34)
SMTNL1 protein coding 11 8.2x10-03 Protective (0.022;3.08) Muscle contraction
OASL protein coding 12 4.6x10-03 Harmful (-0.004;-1.8) Antiviral, inhibits RSV [7173]
TFCP2 protein coding 12 2.7x10-03 Harmful (-0.003;-2.56) Transcription factor, alpha-globin, inflammatory response
TMEM30B protein coding 14 9.9x10-03 Harmful (-0.002;-0.61) Phospholipid translocation
MTFMT protein coding 15 5.6x10-03 Harmful (-0.003;-1.61) Mitochondrial translation, required for mitochondrial function/oxidative phosphorylation
RP11-491F9.8 lincRNA 16 7.5x10-03 Harmful (-0.015;-3.25)  
MYL4 protein coding 17 8.7x10-03 Harmful (-0.005;-2.69) Actin filament binding, atrial fibrillation
HDHD2 protein coding 18 3.9x10-03 Protective (0.003;1.51)
DESI1 protein coding 22 2.6x10-03 Harmful (-0.009;-2.84) Proteolysis; desumoylating isopeptidase; SUMO paralogues determine fate of wild-type and mutant CFTR protein [74]
TMPRSS6 Protein coding 22 3.6x10-03 Harmful (-0.0004;-0.36) AKA matriptase-2, variants associated with iron refractory iron deficiency anemia [75]

*Direction defined as: Harmful (PrediXcan beta coefficient; TWAS zscore): Increased expression correlated with worse lung disease (decreased KNoRMA), or Protective (PrediXcan beta coefficient; TWAS zscore): Increased expression correlated with milder lung disease (better KNoRMA)

As expected from published PrediXcan and TWAS applications to other diseases [76, 77], many genes associated with CF lung disease severity are around the reported genome-wide significant loci from GWAS (red squares in Fig 3, and Table 1A), but there are also significant genes elsewhere (blue triangles in Fig 3, and Table 1B), including MET ~700 kb upstream of CFTR on chr7, TAPT1 on chr4, and HEATR2 on chr7 to name a few. This provides evidence for significant association with SNPs outside the GWAS significant loci and/or combinatorial signals from the multiple SNPs used in predictive models. Further, the genome-wide significant signal by fixed-effect meta-analysis p-value on chr16 (Fig 3C, S5 Fig in S4 File), which was not reported in the GWAS publication due to multiple hypothesis testing penalty [1], was brought to attention by gene expression imputation for CHP2 and PRKCB (Fig 3A and 3B).

To globally compare GWAS association with imputed expression association, available SNP GWAS association p-values for the cis-SNPs used as predictive variables, were retrieved for all imputable genes of PrediXcan predictive models of all 48 GTEx tissues. Minimal SNP p-values in predictive models of a gene were compared to the maximal association p-value between HMP.PrediXcan and EBM.PrediXcan for the same gene to CF lung disease severity from imputed expression (Fig 4). The correlation coefficient of the minimal GWAS -log10 p-values with PrediXcan maximal association p-values over the > 25,000 imputable genes is highly significant, with r = 0.19 (R2 = 0.036, Fig 4). Similarly, mean SNP GWAS p-value and imputed expression p-value among these genes are also significantly correlated with r = 0.13 (R2 = 0.017, S6 Fig in S4 File). As indicated above, examples of significant associations from imputed gene expression from regions where no genome-wide significant SNPs were identified from the GWAS include DESI1, HEATR2, OASL, SLITRK3, TAPT1, etc. (Fig 3, and Table 1B).

Fig 4. Correlation of imputed gene expression association from PrediXcan and minimal GWAS association p-values.

Fig 4

Maximal p-values between HMP and EBM meta-analyses of CF lung disease associations from imputed gene expression (PrediXcan) for 26,750 genes from 48 GTEx tissues are plotted against minimal GWAS SNP p-values per gene among all cis-SNPs used in predictive models. The 52 consensus modifier genes are highlighted in red squares (near GWAS loci) and blue triangles (novel), while genes with minimal GWAS SNP p-values < x10-08 (dashed vertical line), but not among the 52, are highlighted in black diamonds. Solid line represents linear regression.

The integration of SNP association to lung disease phenotype (GWAS) and imputed eQTL signals can be illustrated by examining the SNPs utilized in the models to predict expression for the chr11 locus, as shown in Fig 5 (and S7 Fig in S4 File). Combining predictive variables (SNPs) from multiple GTEx tissue models, and among SNPs with significant GWAS p-values of < x10-07 [top annotation track in Fig 5 (zoom-in view), S7 Fig in S4 File (full region)], only 1 SNP (among 50 in all EHF models) was used to impute EHF expression, and only 2 SNPs (among 759 in all APIP models) were used for APIP. In contrast, 20 of the significant SNPs were predictive for PDHX, which in turn translated into significant lung disease associations of imputed gene expression for PDHX (Figs 2 and 3, and Table 1), but not EHF and APIP, even though EHF and APIP are closest to the GWAS signal. Similarly, imputed eQTL data help to point to genes regulated by SNPs at other regions (S8-S12 Figs in S4 File) and suggest the direction of genetically regulated expression change in regard to phenotype trait (Table 1).

Fig 5. Comparison of predictive model SNPs at chromosome 11 GWAS locus.

Fig 5

The -log10 p-values from GWAS analysis were retrieved for cis-SNPs in viable PrediXcan predictive models from 48 GTEx tissues for EHF, APIP, and PDHX. These p-values were formatted as bedGraph files and displayed through the UCSC genome browser (http://genome.ucsc.edu/) as custom annotation tracks, with vertical scales set between 0 and 10. The screenshot of the genome browser shows from top to bottom: GWAS SNP p-values, SNPs used in EHF gene expression imputation model, those for APIP, PDHX, and gene annotation from NCBI RefSeq genes.

Gene set enrichment analyses and functional categories of candidate CF lung disease modifier genes

Gene set (pathway) enrichment analyses (GSEA) were performed based on protein-coding genes pre-ranked by the maximal p-value between the 2 multi-tissue meta-analyses for each analysis platform, PrediXcan and TWAS. Since all imputed protein-coding genes of PrediXcan (n = 16,431) and TWAS (n = 13,685) were ranked, GSEA can uncover concerted association of gene set or pathway members with CF lung disease (S1, S2 Tables in S4 File). Apart from the usual suspects of immune and vesicle trafficking processes and pathways reported in previous publications, including a large number of pathways dominated by HLA genes [11, 12, 79, 80], some highly specific, pathogenically relevant processes were also enriched, with examples of “Interferon-gamma-mediated signaling pathway” from GO biological process, “Defective CFTR causes cystic fibrosis” and “Antimicrobial peptides” from Reactome pathway, and “Asthma” from KEGG pathway shown in Fig 6 (and in S1, S2 Tables in S4 File).

Fig 6. Gene set enrichment plots.

Fig 6

Gene set enrichment analyses (GSEA) were performed and enrichment plots were generated for selected gene sets using the Bioconductor R package, fgsea. For each enrichment plot, the horizontal black line at the bottom represent p-value ranks of protein-coding genes with most significant p-value rank on the left. The vertical bars represent individual genes in a gene set and their ranks. The green curves represent the cumulative enrichment score (ES), and the red horizontal dashed lines denote minimal (often 0) and maximal scores. Listed genes represent the leading edge with increasing ES, that contribute to the overall enrichment of the gene set. Panel A and C are GSEA results from PrediXcan platform, while B and D from TWAS. Particular gene sets shown are from GO biological process (A), and Biosystems (C–KEGG, B, D–Reactome).

Alternatively, we looked for overlaps between the 379 potential candidate modifiers of CF lung disease (described above) and CF relevant-biological categories, many of which are represented by GSEA analyses. Using GO and Reactome annotations, coupled to key functional categories identified with CF relevance (Table 1), we classified 149 of the 379 candidate genes into 11 functional categories (Table 2).

Table 2. Functional categories of significant genes (n = 149 out of 379) relevant to CF pathophysiology*.

Category Genes
Immunity/ infection/inflammation AGER, AHRR, EXOC3, HLA-DQA1, HLA-DQA2, HLA-DQB1, HLA-DRB1, MET, MUC20, MUC4, OASL, PRKCB, TFCP2; ADAM, AMBP, AP1S1, ATP6V0D2, AZU1, BPIFA1, BPIFB1, BTNL2, C2, CEACAM6, CFH, DDX60, EFNB3, FGF20, FRK, GAN, HLA-B, HLA-DQB2, HLA-DRA, IGSF5, JMJD6, LCN2, METTL7A, MEX3C, MME, NDC1, NFAM1, NPY5R, ORMDL3, PIK3R2, PRG2, RAC2, RORC, SLC3A2, SLFN13, SMAD4, SPG21, TFRC, TREX1, UBE2Z, VAV3, YTHDF2, ZFP36L2, ZYX
Mucociliary clearance C5orf55, CEP72, EXOC3, HEATR2, MUC20, MUC4, SLITRK3, TAPT1, TPPP; AK8, ARL3, CEP120, ICK, IFT74, MYO3B, NUBP1, PROM1
Glycosylation AGER, MUC20, MUC4; A4GALT, ARFGAP3, GOSR1, NOTCH4, PIGO, PIGW, SERP1, ST3GAL6, TRAPPC2L, XXYLT1
Viral/virus HLA-DQA1, HLA-DQA2, HLA-DQB1, HLA-DRB1, OASL; AMBP, ATP6V0D2, AZU1, BPIFA1, CFH, DDX39B, DDX60, EFNB3, HLA-B, HLA-DRA, LCN2, NDC1, PIK3R2, RAC2, RPS10, SLFN13, STMN1, TFRC, TREX1, ZYX
Mitochondria MTFMT, PDHX; BIK, DDAH2, HIGD2A, HRK, MMAA, MTFR1L, MTG1, MYO19, NDUFAF6, NRF1, RAC2, SDHA, TARS2, TDRKH, TIMM10
ER/Golgi DSE, EXOC3, TAPT1, TMEM30B, ZDHHC11, ZDHHC11B; A4GALT, AKR7A2, AP1S1, ARFGAP3, ARL3, BSCL2, CPD, CUX2, GOSR1, IER3IP1, METTL7A, NOTCH4, ORMDL3, PIK3R2, SERP1, STC2, TFRC, TRAPPC2L, XXYLT1
Ubiquitination GAN, GNA12, MEX3C, PIAS2, SMAD4, TNK2, UBE2Q2P1, UBE2Z, UFD1L
Lipid AHRR, CYP21A2, PLA2R1, TMEM30B, ZDHHC11, ZDHHC11B; A4GALT, APOC2, BSCL2, CYP21A2, FADS3, GLTP, GNA12, JAZF1, LDLRAP1, MED19, MMAA, NCOA3, NRF1, NRIP1, ORMDL3, OSBPL10, PIGO, PIGW, PIK3R2, PLA2R1, PNLIPRP3, SERINC1, SOAT1, THRB, TREX1
CFTR interactome RAC2, SDHA, TARS2, YTHDF2
Transcription factors AATF, FOXP2, NCOA3, NEAT1, NRF1, NRIP1, PIAS2, RORC, SMAD4, TFCP2, THRB
Cytoskeleton/ microtubule CEP72, MET, SMTNL1, TAPT1, TPPP; ADD3, ARL3, AUNIP, CEP120, GAN, GAS2L3, GNA12, ICK, IFT74, MAST3, MYO19, NUBP1, PACSIN2, PDLIM3, PIK3R2, POC5, RAC2, SMTNL1, SPATC1L, STMN1, TAPT1, TPPP, VILL, ZYX

*Alphabetical listing for 28 (of 54) consensus genes near (bold) and outside (underlined) GWAS loci (between TWAS and PrediXcan, Table 1); remaining genes (n = 121, alphabetically listed) are from the other 327 significant candidate modifier genes (S1 File)

Allele bias of gene expression estimation may confound interpretation of hyper-variable genes, such as HLAs

Many HLA genes appear to be strongly regulated genetically, as reflected by variance explained or R2 of the predictive models (S3, S4 Tables in S4 File) and HLA-dominated pathways are highly significant in our previous gene expression association studies [11, 12]. However, since gene expression quantification relies on mapping of RNA-seq reads to genome/transcriptome sequences, expression levels may be biased towards the reference allele, especially for the hypermorphic HLA genes [81, 82]. To assess influences of allele bias on gene expression quantification and trait association, we compared different strategies of RNA-seq read mapping from our nasal epithelial biopsy RNA-seq data set. In addition to the standard protocol of mapping to the primary reference genome assembly, we also adopted an alternative mapping strategy to include additional alternative genome assemblies as suggested [82], and incorporated common variance information (http://ccb.jhu.edu/hisat-genotype) from dbSNP v150 (S1 Methods in S4 File). As shown in S13 Fig in S4 File, the correlation and spread of expression estimates are similar for selected HLA Class II genes, between AltHapAlignR [82] and default gene counts (S13A-S13D Fig in S4 File), and alternative mapping FPKM (Fragments Per Kilobase per Million) and standard mapping FPKM (S13E-S13H Fig in S4 File). When the bias-corrected alternative gene expression quantification was used in predictive model building, gene expression imputation, and trait association testing, the results were dramatically different for some genes, such as HLA-DQA1 and HLA-DRB1, where the direction of predicted expression changes in regard to lung function are opposite between different mapping strategies (Fig 7A). The number of genes that can be predicted by cis-SNPs among the bias-corrected training set, compared to the standard protocol that predicted 2,881 genes (S2 File), increased by >1,000 to 4,263 (S3 File), with only 1,379 overlap between them. These findings suggest that allele bias associated with commonly employed gene expression estimation pipelines can confound phenotype association testing, resulting in misinterpretation of genetic modulation of phenotype apparently via gene expression regulation.

Fig 7. Effect of allele bias on gene expression quantification and disease phenotype association in CF nasal epithelial biopsy RNA-seq data set.

Fig 7

Comparison of CF lung disease (KNoRMA) association t statistics between different mapping protocols among 1,379 common imputable genes by respective predictive models among 5,634 unrelated CF patients are shown in A. HLA genes in A, are represented as red triangles, and x-axis represent standard and y-axis alternative mapping protocols. Panels B and C show gene expression quantifications by standard (x-axis) and alternative (y-axis) protocols in the format of FPKM for HLA-DQB1, and HLA-DRB1 genes. Each dot represents 1 sample (out of 132 total), with solid line denoting linear regression line, and dashed line representing equality.

Discussion

We have applied gene expression imputation to mine the CF gene modifier GWAS data set and extracted 379 potential and 52 consensus CF lung disease modifier candidates. The imputation techniques leveraged GTEx integrative training data sets from 48 human tissues [5], a large RNA-seq data set from whole-blood (DGN) [13], and our own CF gene expression data sets from nasal epithelial biopsy [12] and LCL [11] samples. Twenty eight of the 52 consensus genes are within 1 Mb of the 5 autosomal genome-wide significant loci [1], while 24 consensus modifier genes were not identified in GWAS. Overall, integration of GWAS with eQTL data through gene expression imputation highlighted some candidate modifier genes (Figs 3 and 4, red squares), and diminished potential roles of others (Fig 4, black diamonds) around GWAS loci, as well as uncovered modifiers outside GWAS loci (Figs 3 and 4, blue triangles). Disease phenotype association testing of the imputed gene expression also predicted the direction of genetically regulated gene expression changes relative to CF lung disease severity, which provides guidance on mechanism of disease modification, and potential intervention strategies. By using independently developed divergent approaches, we sought to balance sensitivity by combining the results from multiple tissues and platforms, and robustness by consensus of the findings between PrediXcan and TWAS. The consensus and potential CF lung disease modifier genes were then evaluated by biological context through literature review and gene set enrichment analyses.

The usefulness of defining the relationship of SNP association to the imputed gene expression association to phenotype, deduced through independent eQTL data sets, can be illustrated at the chr11 locus (Fig 5, S7 Fig in S4 File). Although EHF and APIP are the nearest genes to the intergenic chr11 GWAS locus with significant lung disease association p-values, PDHX is best predicted to be regulated by SNPs in the region based on current gene expression data. These results do not rule out developmental and other cell/tissue-specific mechanisms not assessed, by which EHF and APIP may modify CF lung disease process. Nevertheless, PDHX is a critical gene in mitochondrial energy metabolism (OMIM: 245349) that should be investigated further, since many additional candidate modifiers related to mitochondrial function were also identified in this study (Table 2).

Examples at other genomic loci are also informative (S8-S12 Figs in S4 File). The strongest GWAS signals on chr5 supported by gene expression imputation (Fig 3) contain 3 genes, CEP72, TPPP, and EXOC3 (Figs 2 and 3, S9 Fig in S4 File, Table 1) involved in microtubule organization and exocytosis. MUC4 and MUC20 are significant at chr3 (S8 Fig in S4 File), and CYP21A2 and HLA Class II genes at chr6 (S10 Fig in S4 File). The locus on chr16 (Fig 3, S5 Fig in S4 File) was borderline genome-wide significant that did not pass the threshold in publication of the GWAS study [1]. However, the chr16 region contains several genes relevant to CF lung disease, including ERN2 involved in ER stress response and mucin production [83], and the SCNN1B and SCNN1G subunits of the epithelial sodium channel (ENaC) that have been suggested as being CF disease modifiers [84]. Over-expression of ENaC channels in SCNN1B transgenic mice has been used as a model of CF lung disease [85], and suppression of ENaC subunit expression is being explored as therapeutic strategies [86]. However, only CHP2 and PRKCB in the chr16 region are consistently associated with CF lung disease by expression imputation (Figs 2 and 3, and Table 1).

Relevance to CF pathogenesis for the candidate modifiers are partly referenced in Table 1, and the full list of the 379 candidate genes often represent functional categories that are represented at the GWAS significant loci, for example PDHX discussed above (Table 2). Thus, both GWAS loci and non-GWAS loci contain genes that mark functions important in the pathogenesis of CF lung disease, such as immunity/infection/inflammation, virus/viral, and mucociliary clearance; and in CFTR biology, such as cytoskeleton, microtubules, mitochondria, lipid, ubiquitination, and ER and Golgi compartments. Several genes not in GWAS loci, e. g. BPIFA1 [8790], CEACAM6 [91, 92], and ORMDL3 [9397], have been implicated directly in CF pathogenesis. Additionally, 4 genes (RAC2, SDHA, TARS2, and YTHDF2) have been reported to be part of core CFTR interactome [98], so their mechanism of disease modification may partly be attributable to CFTR biogenesis. Another 6 genes (AGER, ELAVL2, HLA-DQB1, JAZF1, MET, and RASSF3) have recently been identified near genetic variants associated with lung function in COPD [99]. Interestingly, 11 genes are among the literature-curated transcription factors (Table 2), which are potential targets for intervention. Among them, FOXP2 together with nucleotide binding protein, NUBP1, have been implicated in distal lung development in mice [100, 101], and the NKX2-1/FOXP2 positive progenitor cells can be differentiated into distal alveolar cells [102]. These functional categories are also highly represented in GSEA analyses, with >60% of all enriched GSEA pathways representing these functional categories (S1, S2 Tables in S4 File). Further, highly similar pathways were observed in previous gene expression association studies [11, 12]. Taken together, these gene expression imputation results are congruent with current concepts of the pathophysiology of CF lung disease. All evidence of pathogenic relevance supports the validity of our data mining approach to uncover new genetic modifier genes of CF lung disease severity.

Among the 379 potential (and 52 consensus) modifiers, 92 (and 10) are non-protein-coding genes (S1 File and Table 1). There has been a rapid increase in identification of non-coding genes in recent years, with the current human genome assembly containing 20,433 protein-coding genes, 17,835 non-coding genes, and 15,952 pseudogenes (https://www.ncbi.nlm.nih.gov/genome/annotation_euk/Homo_sapiens/108/#FeatureCountsStats). There is little doubt that non-coding genes play important roles in biological functions, particularly in gene expression regulation [103105], and evidence for their roles in CF disease processes are also emerging [106, 107]. The non-coding CF modifier genes reported here are likely under-estimated compared to protein-coding genes, due to reference genome and gene annotations associated with some of the gene expression data sets used in predictive model training, and general lag of functional knowledge of non-coding transcripts [108]. These are expected to improve over time, and new technologies and studies are required to understand mechanisms of CF disease modification by non-coding genes.

Although our efforts uncovered hundreds of potential candidate modifier genes from the CF GWAS data, it is likely not the whole story of genetic modification of CF lung disease severity, due to limitations of the data and necessary simplifications. The GWAS study with imputation can only effectively interrogate common variants, mostly SNPs, and gene expression imputation is currently restricted to autosomal genes due to the complexity of X chromosome gene expression between male and female samples, and apparent random selection of X-inactivation in females [109], thus, the GWAS signal for lung function on the X-chromosome [1] has not been interrogated. Furthermore, only cis-SNPs within 1 Mb (PrediXcan), or 0.5 Mb (TWAS) around a gene were used in predictive models of gene expression, and the genetic regulation of gene expression was modeled as linear additive effects of potential cis-SNPs. Therefore, modifier genes affected by rare variants were not investigated, and trans-regulation of gene expression was not evaluated. Additionally, some cis-regulation of gene expression may not follow linear combination (e.g. significant interaction between cis-SNPs), which would not be accurately assessed by current predictive models. Furthermore, the number of genes whose expression can be reliably predicted from genetic variants varied among tissues, ranging from ~2,000 to ~10,000, which in large part can be attributed to training sample sizes [10] (S2 Fig in S4 File). With continued accumulation of tissue samples and improved data quality, e. g. from GTEx, as well as improvement of gene expression quantification, and machine learning techniques, we expect to discover more candidate modifier genes of CF lung disease, and other CF related traits. To estimate proportion of genetic influences on CF lung disease phenotype from GWAS and gene expression imputation, we calculated heritability (h2) from the imputed GWAS data using the GREML-LDMS method [19] from the Genome-wide Complex Trait Analysis (GCTA) software [20]. The h2 of KNoRMA from GWAS imputation of ~8.3 million SNPs among ~5,000+ unrelated CF patients, is 0.41 (SE = 0.072), while that from ~1.4 million cis-SNPs used in combined PrediXcan predictive models from 48 GTEx tissues, is 0.33 (SE = 0.061). The difference between the h2 could potentially reflect missing imputable genes due to small training sample sizes, trans-regulation of gene expression from distant genetic variants, and/or other ways of affecting gene function from genetic variants.

The prevailing method of gene expression quantification used in published studies [5, 8, 10, 13] involved mapping of RNA-seq reads to the reference genome/transcriptome assembly, which are biased towards the reference sequences or alleles [82, 110]. This bias is more pronounced for hypervariable genes, such as some HLA genes, containing thousands of allotypes among the general population. When comparing alternative mapping strategies correcting for known variances and including multiple genome assemblies to the commonly used method (S13 Fig in S4 File), some genes (HLA-DQA1, HLA-DRB1) can change direction of association to CF lung disease from imputed gene expression, even though overall disease association are correlated (Fig 7) among the commonly imputable genes, as described [81, 82]. This indicates that reassessment of gene expression estimates based on HLA alleles in subset of samples can alter the predictive models, and subsequent association of imputed expression to disease phenotype in rare instances. However, the impact of allele-bias correction may be far reaching in that significantly more genes were imputed by SNP variants when RNA-seq reads were mapped with bias correction from our nasal epithelial biopsy data set (S2, S3 Files). This impact should be investigated with more data sets to understand genetic regulation of true gene expression.

In summary, we applied the technique of gene expression imputation, leveraging availability of CF and other eQTL data sets, to mine the CF GWAS data, and uncovered 52 consensus modifier genes for CF lung disease, which is substantially greater than identified by GWAS alone. Further, we identified an additional 327 potential candidate CF lung disease modifier genes. Some modifier candidates had been supported by independent studies, and functional annotations are consistent with our current knowledge of CF lung disease pathogenesis. These candidate modifiers provide potential targets for intervention of disease process in CF and for other airway diseases as well.

Supporting information

S1 File

(XLSX)

S2 File

(XLSX)

S3 File

(XLSX)

S4 File

(DOCX)

Acknowledgments

We thank Dr. Nancy J. Cox, Vanderbilt University, Division of Genetic Medicine, Dr. Fred Wright, North Carolina State University, Bioinformatics Research Center, and Dr. Ani W. Manichaikul, University of Virginia, Center for Public Health Genomics, for guidance, advisement, and discussion. We also like to thank Dr. Hae Kyung Im and lab, University of Chicago, Department of Human Genetics, Dr. Alexander Gusev and lab, Harvard University, Dana Farber Cancer Institute, and the Genotype-Tissue Expression (GTEx) project, for making their software tools and databases (PrediXcan and TWAS) open source and publicly available.

Data Availability

All predictive models derived from GTEx human reference data set are publicly available. Gene expression data from CF LCL samples are available from GEO (accession code GSE60690). Gene expression data from CF nasal mucosal epithelial RNAseq samples are uploaded to dbGaP for controlled access for researchers who meet the criteria for access to confidential data (https://view.ncbi.nlm.nih.gov/dbgap-controlled). Data dictionaries and variable summaries are available on the dbGaP FTP site (https://ftp.ncbi.nlm.nih.gov/dbgap/studies/phs002254/phs002254.v1.p1/). The public summary-level phenotype data may be browsed at the dbGaP study report page (http://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs002254.v1.p1). The summary GWAS data from CF Gene Modifier Consortium studies and summary results of phenotype trait association testing are publicly available at GitHub (https://github.com/danghunccf/CF-GWAS-dataMiningPaper).

Funding Statement

H.D. was supported by Cystic Fibrosis Foundation grant, DANG16I0. M.R.K. was supported by Cystic Fibrosis Foundation grant, KNOWLE00A0. CFF URL: https://www.cff.org/Research/Researcher-Resources/Awards-and-Grants/ The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1.Corvol H, Blackman SM, Boelle PY, Gallins PJ, Pace RG, Stonebraker JR, et al. Genome-wide association meta-analysis identifies five modifier loci of lung disease severity in cystic fibrosis. Nat Commun. 2015;6:8382 10.1038/ncomms9382 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Wright FA, Strug LJ, Doshi VK, Commander CW, Blackman SM, Sun L, et al. Genome-wide association and linkage identify modifier loci of lung disease severity in cystic fibrosis at 11p13 and 20q13.2. Nat Genet. 2011;43(6):539–46. 10.1038/ng.838 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Taylor C, Commander CW, Collaco JM, Strug LJ, Li W, Wright FA, et al. A novel lung disease phenotype adjusted for mortality attrition for cystic fibrosis genetic modifier studies. Pediatr Pulmonol. 2011;46(9):857–69. 10.1002/ppul.21456 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Consortium GT. The Genotype-Tissue Expression (GTEx) project. Nat Genet. 2013;45(6):580–5. 10.1038/ng.2653 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Consortium GT, Laboratory DA, Coordinating Center -Analysis Working G, Statistical Methods groups-Analysis Working G, Enhancing Gg, Fund NIHC, et al. Genetic effects on gene expression across human tissues. Nature. 2017;550(7675):204–13. 10.1038/nature24277 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Croteau-Chonka DC, Rogers AJ, Raj T, McGeachie MJ, Qiu W, Ziniti JP, et al. Expression Quantitative Trait Loci Information Improves Predictive Modeling of Disease Relevance of Non-Coding Genetic Variation. PLoS One. 2015;10(10):e0140758 10.1371/journal.pone.0140758 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Vicente CT, Revez JA, Ferreira MAR. Lessons from ten years of genome-wide association studies of asthma. Clin Transl Immunology. 2017;6(12):e165 10.1038/cti.2017.54 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Gamazon ER, Segre AV, van de Bunt M, Wen X, Xi HS, Hormozdiari F, et al. Using an atlas of gene regulation across 44 human tissues to inform complex disease- and trait-associated variation. Nat Genet. 2018;50(7):956–67. 10.1038/s41588-018-0154-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Gamazon ER, Wheeler HE, Shah KP, Mozaffari SV, Aquino-Michaels K, Carroll RJ, et al. A gene-based association method for mapping traits using reference transcriptome data. Nat Genet. 2015;47(9):1091–8. 10.1038/ng.3367 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Gusev A, Ko A, Shi H, Bhatia G, Chung W, Penninx BW, et al. Integrative approaches for large-scale transcriptome-wide association studies. Nat Genet. 2016;48(3):245–52. 10.1038/ng.3506 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.O'Neal WK, Gallins P, Pace RG, Dang H, Wolf WE, Jones LC, et al. Gene expression in transformed lymphocytes reveals variation in endomembrane and HLA pathways modifying cystic fibrosis pulmonary phenotypes. Am J Hum Genet. 2015;96(2):318–28. 10.1016/j.ajhg.2014.12.022 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Polineni D, Dang H, Gallins PJ, Jones LC, Pace RG, Stonebraker JR, et al. Airway Mucosal Host Defense Is Key to Genomic Regulation of Cystic Fibrosis Lung Disease Severity. Am J Respir Crit Care Med. 2018;197(1):79–93. 10.1164/rccm.201701-0134OC [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Battle A, Mostafavi S, Zhu X, Potash JB, Weissman MM, McCormick C, et al. Characterizing the genetic basis of transcriptome diversity through RNA-sequencing of 922 individuals. Genome Res. 2014;24(1):14–24. 10.1101/gr.155192.113 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Panjwani N, Xiao B, Xu L, Gong J, Keenan K, Lin F, et al. Improving imputation in disease-relevant regions: lessons from cystic fibrosis. NPJ Genom Med. 2018;3:8 10.1038/s41525-018-0047-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Marazzi A, Joss J, Randriamiharisoa A. Algorithms, routines, and S functions for robust statistics: the FORTRAN library ROBETH with an interface to S-PLUS. Pacific Grove, Calif.: Wadsworth & Brooks/Cole Advanced Books & Software; 1993. xii, 436 p. p. [Google Scholar]
  • 16.Venables WN, Ripley BD, Venables WN. Modern applied statistics with S. 4th ed York New: Springer; 2002. xi, 495 p. p. [Google Scholar]
  • 17.Wilson DJ. The harmonic mean p-value for combining dependent tests. bioRxiv. 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Poole W, Gibbs DL, Shmulevich I, Bernard B, Knijnenburg TA. Combining dependent P-values with an empirical adaptation of Brown's method. Bioinformatics. 2016;32(17):i430–i6. 10.1093/bioinformatics/btw438 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Yang J, Bakshi A, Zhu Z, Hemani G, Vinkhuyzen AA, Lee SH, et al. Genetic variance estimation with imputed variants finds negligible missing heritability for human height and body mass index. Nat Genet. 2015;47(10):1114–20. 10.1038/ng.3390 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Yang J, Benyamin B, McEvoy BP, Gordon S, Henders AK, Nyholt DR, et al. Common SNPs explain a large proportion of the heritability for human height. Nat Genet. 2010;42(7):565–9. 10.1038/ng.608 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Gu Z, Eils R, Schlesner M. Complex heatmaps reveal patterns and correlations in multidimensional genomic data. Bioinformatics. 2016;32(18):2847–9. 10.1093/bioinformatics/btw313 [DOI] [PubMed] [Google Scholar]
  • 22.Turner SD. qqman: an R package for visualizing GWAS results using Q-Q and manhattan plots. bioRxiv. 2014. [Google Scholar]
  • 23.Hadley W. Ggplot2. New York, NY: Springer Science+Business Media, LLC; 2016. pages cm p. [Google Scholar]
  • 24.Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci U S A. 2005;102(43):15545–50. 10.1073/pnas.0506580102 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Sergushichev A. An algorithm for fast preranked gene set enrichment analysis using cumulative statistic calculation. bioRxiv. 2016:060012. [Google Scholar]
  • 26.Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet. 2000;25(1):25–9. 10.1038/75556 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Fabregat A, Jupe S, Matthews L, Sidiropoulos K, Gillespie M, Garapati P, et al. The Reactome Pathway Knowledgebase. Nucleic Acids Res. 2018;46(D1):D649–D55. 10.1093/nar/gkx1132 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Singh AP, Chauhan SC, Andrianifahanana M, Moniaux N, Meza JL, Copin MC, et al. MUC4 expression is regulated by cystic fibrosis transmembrane conductance regulator in pancreatic adenocarcinoma cells via transcriptional and post-translational mechanisms. Oncogene. 2007;26(1):30–41. 10.1038/sj.onc.1209764 [DOI] [PubMed] [Google Scholar]
  • 29.Kodal JB, Kobylecki CJ, Vedel-Krogh S, Nordestgaard BG, Bojesen SE. AHRR hypomethylation, lung function, lung function decline and respiratory symptoms. Eur Respir J. 2018;51(3). 10.1183/13993003.01512-2017 [DOI] [PubMed] [Google Scholar]
  • 30.Puccetti M, Paolicelli G, Oikonomou V, De Luca A, Renga G, Borghi M, et al. Towards targeting the aryl hydrocarbon receptor in cystic fibrosis. Mediators Inflamm. 2018;2018:1601486 10.1155/2018/1601486 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Saito Y, Nakagawa T, Kakihana A, Nakamura Y, Nabika T, Kasai M, et al. Yeast two-hybrid and one-hybrid screenings identify regulators of hsp70 gene expression. J Cell Biochem. 2016;117(9):2109–17. 10.1002/jcb.25517 [DOI] [PubMed] [Google Scholar]
  • 32.Young JC. The role of the cytosolic HSP70 chaperone system in diseases caused by misfolding and aberrant trafficking of ion channels. Dis Model Mech. 2014;7(3):319–29. 10.1242/dmm.014001 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Bodas M, Mazur S, Min T, Vij N. Inhibition of histone-deacetylase activity rescues inflammatory cystic fibrosis lung disease by modulating innate and adaptive immune responses. Respir Res. 2018;19(1):2 10.1186/s12931-017-0705-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Edelman A. Cytoskeleton and CFTR. Int J Biochem Cell Biol. 2014;52:68–72. 10.1016/j.biocel.2014.03.018 [DOI] [PubMed] [Google Scholar]
  • 35.Kido J, Shimohata T, Amano S, Hatayama S, Nguyen AQ, Sato Y, et al. Cystic fibrosis transmembrane conductance regulator reduces microtubule-dependent Campylobacter jejuni invasion. Infect Immun. 2017;85(10). 10.1128/IAI.00311-17 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Rymut SM, Harker A, Corey DA, Burgess JD, Sun H, Clancy JP, et al. Reduced microtubule acetylation in cystic fibrosis epithelial cells. Am J Physiol Lung Cell Mol Physiol. 2013;305(6):L419–31. 10.1152/ajplung.00411.2012 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Rymut SM, Ivy T, Corey DA, Cotton CU, Burgess JD, Kelley TJ. Role of exchange protein activated by cAMP 1 in regulating rates of microtubule formation in cystic fibrosis epithelial cells. Am J Respir Cell Mol Biol. 2015;53(6):853–62. 10.1165/rcmb.2014-0462OC [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Stowe TR, Wilkinson CJ, Iqbal A, Stearns T. The centriolar satellite proteins Cep72 and Cep290 interact and are required for recruitment of BBS proteins to the cilium. Mol Biol Cell. 2012;23(17):3322–35. 10.1091/mbc.E12-02-0134 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Rymut SM, Kampman CM, Corey DA, Endres T, Cotton CU, Kelley TJ. Ibuprofen regulation of microtubule dynamics in cystic fibrosis epithelial cells. Am J Physiol Lung Cell Mol Physiol. 2016;311(2):L317–27. 10.1152/ajplung.00126.2016 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Li Q, Li N, Liu CY, Xu R, Kolosov VP, Perelman JM, et al. Ezrin/Exocyst complex regulates mucin 5AC secretion induced by neutrophil elastase in human airway epithelial cells. Cell Physiol Biochem. 2015;35(1):326–38. 10.1159/000369699 [DOI] [PubMed] [Google Scholar]
  • 41.Bodas M, Mazur S, Min T, Vij N. Inhibition of histone-deacetylase activity rescues inflammatory cystic fibrosis lung disease by modulating innate and adaptive immune responses. Respir Res. 2018;19(1):2 10.1186/s12931-017-0705-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Edelman A. Cytoskeleton and CFTR. Int J Biochem Cell Biol. 2014;52:68–72. 10.1016/j.biocel.2014.03.018 [DOI] [PubMed] [Google Scholar]
  • 43.Kido J, Shimohata T, Amano S, Hatayama S, Nguyen AQ, Sato Y, et al. Cystic Fibrosis Transmembrane Conductance Regulator Reduces Microtubule-Dependent Campylobacter jejuni Invasion. Infect Immun. 2017;85(10). 10.1128/IAI.00311-17 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Rymut SM, Harker A, Corey DA, Burgess JD, Sun H, Clancy JP, et al. Reduced microtubule acetylation in cystic fibrosis epithelial cells. Am J Physiol Lung Cell Mol Physiol. 2013;305(6):L419–31. 10.1152/ajplung.00411.2012 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Rymut SM, Ivy T, Corey DA, Cotton CU, Burgess JD, Kelley TJ. Role of Exchange Protein Activated by cAMP 1 in Regulating Rates of Microtubule Formation in Cystic Fibrosis Epithelial Cells. Am J Respir Cell Mol Biol. 2015;53(6):853–62. 10.1165/rcmb.2014-0462OC [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Stowe TR, Wilkinson CJ, Iqbal A, Stearns T. The centriolar satellite proteins Cep72 and Cep290 interact and are required for recruitment of BBS proteins to the cilium. Mol Biol Cell. 2012;23(17):3322–35. 10.1091/mbc.E12-02-0134 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Young JC. The role of the cytosolic HSP70 chaperone system in diseases caused by misfolding and aberrant trafficking of ion channels. Dis Model Mech. 2014;7(3):319–29. 10.1242/dmm.014001 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Liu Y, Zhou Q, Zhong L, Lin H, Hu MM, Zhou Y, et al. ZDHHC11 modulates innate immune response to DNA virus by mediating MITA-IRF3 association. Cell Mol Immunol. 2018;15(10):907–16. 10.1038/cmi.2017.146 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Beucher J, Boelle PY, Busson PF, Muselet-Charlier C, Clement A, Corvol H, et al. AGER -429T/C is associated with an increased lung disease severity in cystic fibrosis. PLoS One. 2012;7(7):e41913 10.1371/journal.pone.0041913 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Iannitti RG, Casagrande A, De Luca A, Cunha C, Sorci G, Riuzzi F, et al. Hypoxia promotes danger-mediated inflammation via receptor for advanced glycation end products in cystic fibrosis. Am J Respir Crit Care Med. 2013;188(11):1338–50. 10.1164/rccm.201305-0986OC [DOI] [PubMed] [Google Scholar]
  • 51.Mulrennan S, Baltic S, Aggarwal S, Wood J, Miranda A, Frost F, et al. The role of receptor for advanced glycation end products in airway inflammation in CF and CF related diabetes. Sci Rep. 2015;5:8931 10.1038/srep08931 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Laki J, Laki I, Nemeth K, Ujhelyi R, Bede O, Endreffy E, et al. The 8.1 ancestral MHC haplotype is associated with delayed onset of colonization in cystic fibrosis. Int Immunol. 2006;18(11):1585–90. 10.1093/intimm/dxl091 [DOI] [PubMed] [Google Scholar]
  • 53.Trouve P, Genin E, Ferec C. In silico search for modifier genes associated with pancreatic and liver disease in Cystic Fibrosis. PLoS One. 2017;12(3):e0173822 10.1371/journal.pone.0173822 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54.Rudy G, Lew AM. Limited polymorphism of the HLA-DQA2 promoter and identification of a variant octamer. Hum Immunol. 1994;39(3):225–9. 10.1016/0198-8859(94)90264-x [DOI] [PubMed] [Google Scholar]
  • 55.Rudy GB, Lew AM. The nonpolymorphic MHC class II isotype, HLA-DQA2, is expressed on the surface of B lymphoblastoid cells. J Immunol. 1997;158(5):2116–25. [PubMed] [Google Scholar]
  • 56.Muro M, Mondejar-Lopez P, Moya-Quiles MR, Salgado G, Pastor-Vivero MD, Lopez-Hernandez R, et al. HLA-DRB1 and HLA-DQB1 genes on susceptibility to and protection from allergic bronchopulmonary aspergillosis in patients with cystic fibrosis. Microbiol Immunol. 2013;57(3):193–7. 10.1111/1348-0421.12020 [DOI] [PubMed] [Google Scholar]
  • 57.Polineni D, Dang H, Gallins PJ, Jones LC, Pace RG, Stonebraker JR, et al. Airway mucosal host defense is key to genomic regulation of cystic fibrosis lung disease severity. Am J Respir Crit Care Med. 2018;197(1):79–93. 10.1164/rccm.201701-0134OC [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58.Koehm S, Slavin RG, Hutcheson PS, Trejo T, David CS, Bellone CJ. HLA-DRB1 alleles control allergic bronchopulmonary aspergillosis-like pulmonary responses in humanized transgenic mice. J Allergy Clin Immunol. 2007;120(3):570–7. 10.1016/j.jaci.2007.04.037 [DOI] [PubMed] [Google Scholar]
  • 59.Kang-Park S, Dray-Charier N, Munier A, Brahimi-Horn C, Veissiere D, Picard J, et al. Role for PKC alpha and PKC epsilon in down-regulation of CFTR mRNA in a human epithelial liver cell line. J Hepatol. 1998;28(2):250–62. 10.1016/0168-8278(88)80012-6 [DOI] [PubMed] [Google Scholar]
  • 60.Patergnani S, Marchi S, Rimessi A, Bonora M, Giorgi C, Mehta KD, et al. PRKCB/protein kinase C, beta and the mitochondrial axis as key regulators of autophagy. Autophagy. 2013;9(9):1367–85. 10.4161/auto.25239 [DOI] [PubMed] [Google Scholar]
  • 61.Masso-Valles D, Beaulieu ME, Soucek L. MYC, MYCL and MYCN as therapeutic targets in lung cancer. Expert Opin Ther Targets. 2020. 10.1080/14728222.2020.1723548 [DOI] [PubMed] [Google Scholar]
  • 62.Nolin JD, Ogden HL, Lai Y, Altemeier WA, Frevert CW, Bollinger JG, et al. Identification of Epithelial Phospholipase A2 Receptor 1 as a Potential Target in Asthma. Am J Respir Cell Mol Biol. 2016;55(6):825–36. 10.1165/rcmb.2015-0150OC [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 63.Rava M, Ahmed I, Kogevinas M, Le Moual N, Bouzigon E, Curjuric I, et al. Genes Interacting with Occupational Exposures to Low Molecular Weight Agents and Irritants on Adult-Onset Asthma in Three European Studies. Environ Health Perspect. 2017;125(2):207–14. 10.1289/EHP376 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 64.Feldman MB, Wood M, Lapey A, Mou H. SMAD signaling restricts mucous cell differentiation in human airway epithelium. Am J Respir Cell Mol Biol. 2019. 10.1165/rcmb.2018-0326OC [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 65.Xiang YY, Wang S, Liu M, Hirota JA, Li J, Ju W, et al. A GABAergic system in airway epithelium is essential for mucus overproduction in asthma. Nat Med. 2007;13(7):862–7. 10.1038/nm1604 [DOI] [PubMed] [Google Scholar]
  • 66.Pradhan P, Deb J, Deb R, Chakrabarti S. Lung hypoplasia and patellar agenesis in Ehlers-Danlos syndrome. Singapore Med J. 2009;50(12):e415–8. [PubMed] [Google Scholar]
  • 67.Thelin MA, Bartolini B, Axelsson J, Gustafsson R, Tykesson E, Pera E, et al. Biological functions of iduronic acid in chondroitin/dermatan sulfate. FEBS J. 2013;280(10):2431–46. 10.1111/febs.12214 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 68.Diggle CP, Moore DJ, Mali G, zur Lage P, Ait-Lounis A, Schmidts M, et al. HEATR2 plays a conserved role in assembly of the ciliary motile apparatus. PLoS Genet. 2014;10(9):e1004577 10.1371/journal.pgen.1004577 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 69.Szymanski EP, Leung JM, Fowler CJ, Haney C, Hsu AP, Chen F, et al. Pulmonary nontuberculous mycobacterial infection. A multisystem, multigenic disease. Am J Respir Crit Care Med. 2015;192(5):618–28. 10.1164/rccm.201502-0387OC [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 70.White R, Woodward S, Leppert M, O'Connell P, Hoff M, Herbst J, et al. A closely linked genetic marker for cystic fibrosis. Nature. 1985;318(6044):382–4. 10.1038/318382a0 [DOI] [PubMed] [Google Scholar]
  • 71.Dhar J, Cuevas RA, Goswami R, Zhu J, Sarkar SN, Barik S. 2'-5'-Oligoadenylate synthetase-like protein inhibits respiratory syncytial virus replication and is targeted by the viral nonstructural protein 1. J Virol. 2015;89(19):10115–9. 10.1128/JVI.01076-15 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 72.Leisching G, Wiid I, Baker B. The association of OASL and type i interferons in the pathogenesis and survival of intracellular replicating bacterial species. Front Cell Infect Microbiol. 2017;7:196 10.3389/fcimb.2017.00196 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 73.Zhu J, Ghosh A, Sarkar SN. OASL-a new player in controlling antiviral innate immunity. Curr Opin Virol. 2015;12:15–9. 10.1016/j.coviro.2015.01.010 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 74.Ahner A, Gong X, Frizzell RA. Divergent signaling via SUMO modification: potential for CFTR modulation. Am J Physiol Cell Physiol. 2016;310(3):C175–80. 10.1152/ajpcell.00124.2015 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 75.Bhatia P, Singh A, Hegde A, Jain R, Bansal D. Systematic evaluation of paediatric cohort with iron refractory iron deficiency anaemia (IRIDA) phenotype reveals multiple TMPRSS6 gene variations. Br J Haematol. 2017;177(2):311–8. 10.1111/bjh.14554 [DOI] [PubMed] [Google Scholar]
  • 76.Huckins LM, Dobbyn A, Ruderfer DM, Hoffman G, Wang W, Pardinas AF, et al. Gene expression imputation across multiple brain regions provides insights into schizophrenia risk. Nat Genet. 2019;51(4):659–74. 10.1038/s41588-019-0364-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 77.Petty LE, Highland HM, Gamazon ER, Hu H, Karhade M, Chen HH, et al. Functionally oriented analysis of cardiometabolic traits in a trans-ethnic sample. Hum Mol Genet. 2019;28(7):1212–24. 10.1093/hmg/ddy435 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 78.Gong J, Wang F, Xiao B, Panjwani N, Lin F, Keenan K, et al. Genetic association and transcriptome integration identify contributing genes and tissues at cystic fibrosis modifier loci. PLoS Genet. 2019;15(2):e1008007 10.1371/journal.pgen.1008007 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 79.Farinha CM, Matos P, Amaral MD. Control of cystic fibrosis transmembrane conductance regulator membrane trafficking: not just from the endoplasmic reticulum to the Golgi. FEBS J. 2013;280(18):4396–406. 10.1111/febs.12392 [DOI] [PubMed] [Google Scholar]
  • 80.Roesch EA, Nichols DP, Chmiel JF. Inflammation in cystic fibrosis: An update. Pediatr Pulmonol. 2018;53(S3):S30–S50. 10.1002/ppul.24129 [DOI] [PubMed] [Google Scholar]
  • 81.Aguiar VRC, Cesar J, Delaneau O, Dermitzakis ET, Meyer D. Expression estimation and eQTL mapping for HLA genes with a personalized pipeline. PLoS Genet. 2019;15(4):e1008091 10.1371/journal.pgen.1008091 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 82.Lee W, Plant K, Humburg P, Knight JC. AltHapAlignR: improved accuracy of RNA-seq analyses through the use of alternative haplotypes. Bioinformatics. 2018. 10.1093/bioinformatics/bty125 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 83.Martino MB, Jones L, Brighton B, Ehre C, Abdulah L, Davis CW, et al. The ER stress transducer IRE1beta is required for airway epithelial mucin production. Mucosal Immunol. 2013;6(3):639–54. 10.1038/mi.2012.105 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 84.O'Neal WK, Knowles MR. Cystic Fibrosis Disease Modifiers: Complex Genetics Defines the Phenotypic Diversity in a Monogenic Disease. Annu Rev Genomics Hum Genet. 2018;19:201–22. 10.1146/annurev-genom-083117-021329 [DOI] [PubMed] [Google Scholar]
  • 85.Zhou Z, Duerr J, Johannesson B, Schubert SC, Treis D, Harm M, et al. The ENaC-overexpressing mouse as a model of cystic fibrosis lung disease. J Cyst Fibros. 2011;10 Suppl 2:S172–82. 10.1016/S1569-1993(11)60021-0 [DOI] [PubMed] [Google Scholar]
  • 86.Zhao C, Crosby J, Lv T, Bai D, Monia BP, Guo S. Antisense oligonucleotide targeting of mRNAs encoding ENaC subunits alpha, beta, and gamma improves cystic fibrosis-like disease in mice. J Cyst Fibros. 2019;18(3):334–41. 10.1016/j.jcf.2018.07.006 [DOI] [PubMed] [Google Scholar]
  • 87.Akram KM, Moyo NA, Leeming GH, Bingle L, Jasim S, Hussain S, et al. An innate defense peptide BPIFA1/SPLUNC1 restricts influenza A virus infection. Mucosal Immunol. 2018;11(1):71–81. 10.1038/mi.2017.45 [DOI] [PubMed] [Google Scholar]
  • 88.De Smet EG, Seys LJ, Verhamme FM, Vanaudenaerde BM, Brusselle GG, Bingle CD, et al. Association of innate defense proteins BPIFA1 and BPIFB1 with disease severity in COPD. Int J Chron Obstruct Pulmon Dis. 2018;13:11–27. 10.2147/COPD.S144136 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 89.Saferali A, Obeidat M, Berube JC, Lamontagne M, Bosse Y, Laviolette M, et al. Polymorphisms associated with expression of BPIFA1/BPIFB1 and lung disease severity in cystic fibrosis. Am J Respir Cell Mol Biol. 2015;53(5):607–14. 10.1165/rcmb.2014-0182OC [DOI] [PubMed] [Google Scholar]
  • 90.Wu T, Huang J, Moore PJ, Little MS, Walton WG, Fellner RC, et al. Identification of BPIFA1/SPLUNC1 as an epithelium-derived smooth muscle relaxing factor. Nat Commun. 2017;8:14118 10.1038/ncomms14118 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 91.Fagerberg L, Hallstrom BM, Oksvold P, Kampf C, Djureinovic D, Odeberg J, et al. Analysis of the human tissue-specific expression by genome-wide integration of transcriptomics and antibody-based proteomics. Mol Cell Proteomics. 2014;13(2):397–406. 10.1074/mcp.M113.035600 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 92.Stanke F, Becker T, Hedtfeld S, Tamm S, Wienker TF, Tummler B. Hierarchical fine mapping of the cystic fibrosis modifier locus on 19q13 identifies an association with two elements near the genes CEACAM3 and CEACAM6. Hum Genet. 2010;127(4):383–94. 10.1007/s00439-009-0779-6 [DOI] [PubMed] [Google Scholar]
  • 93.Chen J, Miller M, Unno H, Rosenthal P, Sanderson MJ, Broide DH. Orosomucoid-like 3 (ORMDL3) upregulates airway smooth muscle proliferation, contraction, and Ca(2+) oscillations in asthma. J Allergy Clin Immunol. 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 94.Paulenda T, Draber P. The role of ORMDL proteins, guardians of cellular sphingolipids, in asthma. Allergy. 2016;71(7):918–30. 10.1111/all.12877 [DOI] [PubMed] [Google Scholar]
  • 95.Siow D, Sunkara M, Dunn TM, Morris AJ, Wattenberg B. ORMDL/serine palmitoyltransferase stoichiometry determines effects of ORMDL3 expression on sphingolipid biosynthesis. J Lipid Res. 2015;56(4):898–908. 10.1194/jlr.M057539 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 96.Stein MM, Thompson EE, Schoettler N, Helling BA, Magnaye KM, Stanhope C, et al. A decade of research on the 17q12-21 asthma locus: Piecing together the puzzle. J Allergy Clin Immunol. 2018. 10.1016/j.jaci.2017.12.974 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 97.Toncheva AA, Potaczek DP, Schedel M, Gersting SW, Michel S, Krajnov N, et al. Childhood asthma is associated with mutations and gene expression differences of ORMDL genes that can interact. Allergy. 2015;70(10):1288–99. 10.1111/all.12652 [DOI] [PubMed] [Google Scholar]
  • 98.Pankow S, Bamberger C, Calzolari D, Martinez-Bartolome S, Lavallee-Adam M, Balch WE, et al. F508 CFTR interactome remodelling promotes rescue of cystic fibrosis. Nature. 2015;528(7583):510–6. 10.1038/nature15729 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 99.Shrine N, Guyatt AL, Erzurumluoglu AM, Jackson VE, Hobbs BD, Melbourne CA, et al. New genetic signals for lung function highlight pathways and chronic obstructive pulmonary disease associations across multiple ancestries. Nat Genet. 2019;51(3):481–93. 10.1038/s41588-018-0321-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 100.Schnatwinkel C, Niswander L. Nubp1 is required for lung branching morphogenesis and distal progenitor cell survival in mice. PLoS One. 2012;7(9):e44871 10.1371/journal.pone.0044871 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 101.Yang Z, Hikosaka K, Sharkar MT, Tamakoshi T, Chandra A, Wang B, et al. The mouse forkhead gene Foxp2 modulates expression of the lung genes. Life Sci. 2010;87(1–2):17–25. 10.1016/j.lfs.2010.05.009 [DOI] [PubMed] [Google Scholar]
  • 102.Hannan NR, Sampaziotis F, Segeritz CP, Hanley NA, Vallier L. Generation of Distal Airway Epithelium from Multipotent Human Foregut Stem Cells. Stem Cells Dev. 2015;24(14):1680–90. 10.1089/scd.2014.0512 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 103.Beltran M, Garcia de Herreros A. Antisense non-coding RNAs and regulation of gene transcription. Transcription. 2016;7(2):39–43. 10.1080/21541264.2016.1148804 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 104.Patil VS, Zhou R, Rana TM. Gene regulation by non-coding RNAs. Crit Rev Biochem Mol Biol. 2014;49(1):16–32. 10.3109/10409238.2013.844092 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 105.Salviano-Silva A, Lobo-Alves SC, Almeida RC, Malheiros D, Petzl-Erler ML. Besides Pathology: Long Non-Coding RNA in Cell and Tissue Homeostasis. Noncoding RNA. 2018;4(1). 10.3390/ncrna4010003 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 106.Balloy V, Koshy R, Perra L, Corvol H, Chignard M, Guillot L, et al. Bronchial Epithelial Cells from Cystic Fibrosis Patients Express a Specific Long Non-coding RNA Signature upon Pseudomonas aeruginosa Infection. Front Cell Infect Microbiol. 2017;7:218 10.3389/fcimb.2017.00218 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 107.Saayman SM, Ackley A, Burdach J, Clemson M, Gruenert DC, Tachikawa K, et al. Long Non-coding RNA BGas Regulates the Cystic Fibrosis Transmembrane Conductance Regulator. Mol Ther. 2016;24(8):1351–7. 10.1038/mt.2016.112 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 108.Jarroux J, Morillon A, Pinskaya M. History, Discovery, and Classification of lncRNAs. Adv Exp Med Biol. 2017;1008:1–46. 10.1007/978-981-10-5203-3_1 [DOI] [PubMed] [Google Scholar]
  • 109.Deng X, Berletch JB, Nguyen DK, Disteche CM. X chromosome regulation: diverse patterns in development, tissues and disease. Nat Rev Genet. 2014;15(6):367–78. 10.1038/nrg3687 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 110.Panousis NI, Gutierrez-Arcelus M, Dermitzakis ET, Lappalainen T. Allelic mapping bias in RNA-sequencing is not a major confounder in eQTL studies. Genome Biol. 2014;15(9):467 10.1186/s13059-014-0467-2 [DOI] [PMC free article] [PubMed] [Google Scholar]

Decision Letter 0

Dylan Glubb

1 Nov 2019

PONE-D-19-23859

Mining GWAS and eQTL data for CF lung disease modifiers by gene expression imputation

PLOS ONE

Dear Dr. Dang,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

The manuscript provides a comprehensive transcriptome-wide association study that identifies a number of genes which may modify CF severity, greatly increasing knowledge of CF genetics. The reviewers have raised a number of issues which should be addressed, particularly in the context of the statistical analysis, and additional information is required to improve the clarity and comprehensibility of the study.

We would appreciate receiving your revised manuscript by Dec 16 2019 11:59PM. When you are ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter.

To enhance the reproducibility of your results, we recommend that if applicable you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). This letter should be uploaded as separate file and labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. This file should be uploaded as separate file and labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. This file should be uploaded as separate file and labeled 'Manuscript'.

Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

We look forward to receiving your revised manuscript.

Kind regards,

Dylan Glubb

Academic Editor

PLOS ONE

Journal Requirements:

1. When submitting your revision, we need you to address these additional requirements. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at http://www.journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and http://www.journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. We note that you have indicated that data from this study are available upon request. PLOS only allows data to be available upon request if there are legal or ethical restrictions on sharing data publicly. For information on unacceptable data access restrictions, please see http://journals.plos.org/plosone/s/data-availability#loc-unacceptable-data-access-restrictions.

In your revised cover letter, please address the following prompts:

a) If there are ethical or legal restrictions on sharing a de-identified data set, please explain them in detail (e.g., data contain potentially identifying or sensitive patient information) and who has imposed them (e.g., an ethics committee). Please also provide contact information for a data access committee, ethics committee, or other institutional body to which data requests may be sent.

b) If there are no restrictions, please upload the minimal anonymized data set necessary to replicate your study findings as either Supporting Information files or to a stable, public repository and provide us with the relevant URLs, DOIs, or accession numbers. Please see http://www.bmj.com/content/340/bmj.c181.long for guidelines on how to de-identify and prepare clinical data for publication. For a list of acceptable repositories, please see http://journals.plos.org/plosone/s/data-availability#loc-recommended-repositories.

We will update your Data Availability statement on your behalf to reflect the information you provide.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Partly

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: No

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: No

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: No

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: The authors describe an extension of a previous GWAS concerning cystic fibrosis (CF) by imputing gene expression with PrediXcan and TWAS using GTEx, LCL, and nasal mucosal models. They take the genes at the consensus of the most significant results of both methods and examine how they function within the CF pathway as well as delve into genes not near known loci and how their genetic regulation contributes to CF. These analyses provide valuable insight into the genetic mechanisms underlying CF in addition to finding potential therapeutic targets.

While your reasoning behind your methods was sound, the interpretation of PrediXcan and TWAS’ results and the decision to take the consensus of the methods is my main concern. Additionally, as an overall note, please use color in your grey and black plots while possible, it makes it much easier to discern the data.

Thank you for making GWAS summary statistics available for public. Are your PrediXcan models from the CF and nasal mucosal epithelial cells freely available as well?

36-37: “Using congruence of findings from the two approaches” - at this point in the abstract, you haven’t compared the two approaches

73-75: “Thus, using eQTLs from all tissues, not just “disease-relevant” tissues, can identify a greater number of candidate gene modifiers” - can you elaborate more on how genes in non-relevant tissues may be contributing to CF?

104-105: “The cohort study design, and demographic and clinical characteristics of the CF patients used in this study have been previously described” - a quick summary of the cohort (sample size, number of SNPs, sex, age, race/ethnicity composition), etc. would be welcome as an addition to the paragraph.

118: “more recent release of 1000 genomes project data” - specify which phase

120-122: “summary GWAS statistics… derive imputed gene expression… using [FUSION]” - an issue with comparing PrediXcan and FUSION output is their inherent differences in design. For more comparable results if you continue using both methods, you should consider using S-PrediXcan (29739930) which is similar to PrediXcan but takes summary statistics rather than genotypes as input. In addition, description of your methods (ex. Models used) in FUSION to the same extent as PrediXcan would be appreciated.

131: “the samples used in predictive model training were excluded from the association testing” - what’s the before and after n?

137-138: “In our most stringent analyses, we sought consensus between two meta-analysis approaches.” - The reason behind taking the intersection of these two methods, rather than just the results from one model (especially since only GTEx is used in FUSION while there are additional models for PrediXcan), isn’t articulated well through the paper. Either take the results from the more robust method, or fully justify the use of the intersection of the methods’ results.

141: “performed for each gene among the tissues with imputed gene expression” - how does this account fairly between genes ex. if one of them is present in only two tissues and the other is in thirty related tissues?

183-184: “our LCL microarray data set yielded fewer than expected number of imputable genes” - what is the expected # of imputable genes, based on PrediXcan models produced with microarray data? Additionally, in the following paragraph, how is the difference between microarray and RNA-Seq in CF LCL vs. GTEx LCL accounted for in this comparison?

200-202: “Using a threshold of p-value < 0.01… disease modifier genes were identified.” - Does this method account for the range of tissues used? Does this account for the CF and DGN models not existing in TWAS? Can you delve into how a result in a seemingly irrelevant tissue can still be useful in elucidating CF pathways? These questions especially arise in lines 211-214.

206-208: “average effect sizes… (R2 = 0.36, Fig 3B)” - these calculations should be covered in Methods

240-243: “MET ~700 kb upstream… in predictive models.” - How would this compare to a traditional GWAS-eQTL analysis?

259-261: “The SNP p-values used in this analysis were either the minimal p-value selected per gene (Fig. 5) or represented the average of the unique set of SNPs from all predictive models per genes (S6 Fig).” - Choose a consistent metric across all genes.

282: “among SNPs with significant p-values of < 10e-7” - why did you use this threshold instead of the most traditional 5e-8?

304-305: “largely due to environmental influences and/or disease process, rather than genetic regulation.” - do you have an h2 measurement for this (GWAS vs. imputed gene expression)? This section would also flow better in the creation/analysis of the models near the beginning of results.

315: “maximal p-value between the 2 multi-tissue meta-analyses for each analysis platform” - Again, choose a consistent metric from one platform or else results are difficult to interpret.

360-361: “the direction of predicted expression changes in regard to lung function are opposite between different mapping strategies” - which method best agrees with the known direction of effect in observed CF gene expression data?

392-394: “Although EHF and APIP… current gene expression data.” - this issue of best predicted in a model vs. the genes with actual biological implications was described in Weinberg et al. (PMID: 30926968). How does this perspective, as well as the differences in the tissues, contribute to PDHX as a candidate in the CF pathway? How would a colocalization analysis change these results? Additionally, what is the gene currently implicated in the locus in that may be related to CF?

399-410: Good review of how the most sig. predicted genes could work in the context of CF!

433-434: “The overall gene expression association to the CF lung disease severity from our own CF nasal epithelial and LCL data sets is not correlated with imputed gene expression.” - this seems contradictory to lines 175-178, can you clarify this?

444: “The non-coding CF modifier genes are likely underestimated” - in addition, they seem understudied from the fewer references in table 1 compared to the protein coding genes. Can you elaborate on the reasons why this may be?

460-461: “Furthermore, the number of genes whose expression can be reliably predicted from genetic variants varied among tissues” - a noticeable absence throughout most of this paper is the lack of individual tissues being scrutinized. Which tissue are these significant genes in, and how may that also contribute to CF? And if the tissue seems irrelevant, why are the findings still important?

Next, review figures/tables and supplementals

Table 1: add a column of which tissue the P-value was max. Where did the keywords originate from? Can you use coefficients instead of protective/harmful?l

Fig. 1: Since the PM, imGE, and pvals for GTEx, DGN, nasal epi, and LCL are similar except for sample sizes, can they be consolidated to look less cluttered? Also, since genotypes were used to make the gene expression models for nasal epi and LCL and not summary statistics, “GWAS imputation” may be misleading.

Fig. 2: There’s a lot going on in this plot that makes it difficult to follow. Is it more legible if you subset to only the most important tissues?

Fig. 3: Use ggrepel with higher force for easier readability. These plots would be more appropriate in the supplemental figures with the r2 of both of them described in the text.

Fig. 4: At a glance, the PrediXcan and TWAS results look similar in architecture. Can these results be consolidated with colors identifying differences between the methods instead, or colors indicating known vs novel CF genes?

Fig. 5: This figure would also benefit from ggrepel. It would be easier to follow as a 2x2 rather than branching off the original scatterplot. Are all 3 of these offshoots necessary? What do each of them represent that the others don’t?

Fig. 6: Very unique and insightful visual. Can you discuss more about how these genes are co-regulated from their intertwined, linked eQTLs?

Fig. 7: Put the r2 in captions. This figure would also be better suited for a supplemental.

Fig. 8: Can you differentiate the known genes from the novel ones with color, bolding, or other formatting? Do you have similar figures for “consensus” genes available?

Fig. 9: Color in the HLA gene dots, they get lost amongst all the grey.

Supporting methods: I enjoy your well-described and well-cited (with links!) descriptions. There are a few very minor typos - “the imputed gene expression data sets had sample size” should be “the imputed gene expression data sets had a sample size”. I have a small concern with the hierarchical clustering analysis. Why did you have “the missing values in the resultant distance matrix were replaced with the largest distance values between any pairs” rather than just leaving the data N/A? It seems misleading. Also, again, for a PrediXcan-like method that also uses summary statistics, S-PrediXcan is up your alley.

S1 Fig.: Do you think these dramatically different slopes are due to RNA quantification collection methods, sample sizes, or other outside factors?

S2 Fig.: I quite enjoy this figure. Can you compare the LCL microarray observed/expected gene count to those found in PrediXcan microarray-based models?

S6 Fig.: This is illegible. Would you be able to give each point a number and then have a side table with both the numbers and gene names?

S11 Fig.: Can you include within the main text as part of the discussion why CFTR isn’t as highly significant as one would initially think in an analysis of cystic fibrosis?

S Tables: Can you bold or italicize the known or novel gene findings to make them easier to differentiate?

Overall, my concerns lie mainly with the comparison between PrediXcan and TWAS, the interpretability of the figures, and the lack of connecting the actual tissue genes were determined significant to the CF phenotype, but I enjoy your analysis and your contribution to determining the genetic architecture of CF. Addressing these concerns as well as clarifying the points I made above and adding an additional colocalization analysis would strengthen this paper.

Reviewer #2: This is a very comprehensive and well-conducted analysis.

I have two questions:

1. The harmonic mean P-value method was recently corrected ( please see: http://blog.danielwilson.me.uk/2019/08/updated-correction-harmonic-mean-p.html ). Do the authors' calculations incorporate the updated (and corrected) harmonic mean P-value method?

2. There is no discussion of LD-contamination ( see for example: https://www.nature.com/articles/s41467-018-03621-1 ). To what extent does this affect the results?

Reviewer #3: On the whole, the writing of the manuscript can and should be improved. As it stands, it is hard to follow. The authors present no motivation for the selected approaches to analyze the data (and why they chose to use two). The methods are not described well. It takes a great effort to match the description of the analyses with Fig. 1 illustrating them.

Description of one type of analysis is interweaved with sentences starting with ‘Alternatively’ (lines 120, 132) describing the other type of analysis. This makes it harder to follow either of them.

The authors pay attention to unimportant details (such as talk about GTEx pilot data) but not discuss important ones (e.g. choice of just 1 principal component to correct for ancestry).

It causes great concern to see that the authors did not correctly use scientific E-notation for small numbers.

For instance, instead of ‘1e-6’ or ‘10^{-6}’ the authors have 1x10e-06, which would actually be equal to 1e-5 if read correctly (10x1e-6 = 1e-5).

There are a total of seven instances of incorrect use of scientific E-notation.

Which samples were whole genome sequenced (WGS)?

Is the sequencing data publicly available?

Were these samples among those being imputed?

Were WGS genotypes mixed with imputed genotypes in the analyses?

Line 127. Why only one PC was used for correction of ancestry? This appears to be grossly insufficient given current knowledge.

Line 129. There are numerous methods for robust regression analysis. Not providing a name for the chosen method, only a citation is a great inconvenience for the reader.

Lines 137-138. What does “in our most stringent analyses” mean?

Line 138. The phrase “we sought consensus” does not exactly read as “we selected genes significant in both analyses”.

Line 173. What is the definition of 'imputable gene'?

Minor comments:

Line 72. Why even mention GTEx pilot data?

Why every plot is black and white?

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Reviewer #3: No

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files to be viewed.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2020 Nov 30;15(11):e0239189. doi: 10.1371/journal.pone.0239189.r002

Author response to Decision Letter 0


1 Sep 2020

NOTE: Responses to reviewers with proper formatting are attached as a .docx file, and only the text is copied below.

I. General Overview of Responses

We appreciate the effort and time the reviewers have put into the reviews. We have taken their suggestions very seriously, and we believe that responding to their comments has significantly improved the manuscript. The responses to each of the reviewer comments is provided below. As way of introduction, there were four main points that we thought were the thrust of the reviewer’s concerns. We summarize our overall response to these four concerns here, while providing detailed response notes later in the document.

1. The reviewers were concerned that the analyses between PrediXcan and TWAS are not uniform, and so it is not possible for one to “replicate the other”. Thus, they were unsure whether our concept to provide what we previously called our “most robust” list of candidate genes was valid. We understand their confusion and apologize that our overall intent for choosing to report on the two approaches was not clear. It is NOT our intent to compare and contrast the two similar but independently developed platforms of gene expression imputation (PrediXcan and TWAS), or to ask one to replicate the other. Rather we wish to mine the data to obtain maximal coverage. We believe our goal was best achieved by utilizing these two divergent methods, leveraging the strength of each method. Importantly, intersecting gene modifier candidate signals are generated by both methods. To illustrate the value of the analyses for the CF field, we chose to focus on the consensus findings between the two approaches. We believe for the purposes of this paper that this “consensus” list represents the most robust results, as they are stable between the two approaches. Since the candidate modifier genes extracted from the data mining constitute hypotheses that need to be validated by independent experiments, which requires time and resources, we sought to evaluate the main findings by reviewing functional annotations and literature as indirect validation in the context of CF biology. This “consensus” list is a workable list that provided a framework for our discussion. We have modified the discussion of the data mining substantially and hope that our new explanation provides clarity.

2. The reviewers were concerned about the number of tissues that were explored in our analyses. It is true that our method utilized multiple tissue eQTL data, which relies on many tissues that are not known to be directly affected in CF disease pathogenesis. We chose this strategy based on the findings by GTEx that the majority of genetic regulation of gene expression is through cis-SNPs, which tend to be conserved across multiple tissues. Under the concepts developed by GTEx, extracting predictive relationships of SNPs to gene expression from all available tissues helps to overcome technical issues of uneven sample number, data quality among the different tissues, biological influences of environments and development, or reverse causality (disease affecting gene expression). We cite several papers that support this concept.

3. The reviewers were concerned that only one PC was used to control for population stratification. We appreciate this as a legitimate concern and revised the analysis, now including 4 PCs. This did reduce the number of candidate disease modifier genes slightly, and so is a more conservative approach. Please note however that the top candidate genes remain, and the overall concepts related to the biology of CF modifier-related pathways was not dramatically altered by this change.

4. The reviewers were also concerned that issues related to LD structure and causal SNPs for disease modification were overinterpreted. We agree with the sentiment that the data mining approach and findings do not constitute proof of causality or identify causal SNPs. We acknowledge that strong LD between SNPs, while not impeding the accuracy of gene expression prediction, presents considerable challenge for causality determination. Nonetheless, by highlighting genetic regulation of gene expression from available eQTL data, i.e., pointing out genes with or without demonstrable eQTLs, the analysis we present puts us a step closer to the mechanism of disease modification. Future studies to test hypothesis related to the significant findings will be needed to link gene expression to lung phenotype.

II. Specific Reviewer Comments and Responses

Reviewer 1

Overview: The authors describe an extension of a previous GWAS concerning cystic fibrosis (CF) by imputing gene expression with PrediXcan and TWAS using GTEx, LCL, and nasal mucosal models. They take the genes at the consensus of the most significant results of both methods and examine how they function within the CF pathway as well as delve into genes not near known loci and how their genetic regulation contributes to CF. These analyses provide valuable insight into the genetic mechanisms underlying CF in addition to finding potential therapeutic targets.

While your reasoning behind your methods was sound, the interpretation of PrediXcan and TWAS’ results and the decision to take the consensus of the methods is my main concern. Additionally, as an overall note, please use color in your grey and black plots while possible, it makes it much easier to discern the data.

Response to Overview: We have attempted to clarify our line of thinking as described in the overview above. In general, our intent was to conduct the analyses using the divergent approaches (Predixcan and TWAS), which complement each other and have individual strengths, to maximize discovery potential for hypothesis generation. We then sought to evaluate the significance of the findings through biological context. Utilizing the consensus of findings between the two different approaches was used as a prioritization tool, allowing us to focus the biological exploration on the most robust findings.

Please note that all findings will be provided from both methods.

As a note, we have incorporated color into the plots where it is helpful.

Reviewer 1, Comment 1. Thank you for making GWAS summary statistics available for public. Are your PrediXcan models from the CF and nasal mucosal epithelial cells freely available as well?

Response: Yes, we will make these models available. We are planning on hosting these files on GitHub, unless the journal has a more preferred option, which we will explore with them. The hosted data will include a GWAS summary, PrediXcan models, individual tissue results, combined meta-analysis results, pathway results, and data values used to generate figures.

Reviewer 1, Comment 2. 36-37: “Using congruence of findings from the two approaches” - at this point in the abstract, you haven’t compared the two approaches

Response: We have modified the text (lines 34-37): “By comparing and combining results from alternative approaches, we identified 379 candidate modifier genes. We delved into 52 modifier candidates that showed consensus between approaches, and 28 of them were near known GWAS loci”

Reviewer 1, Comment 3. 73-75: “Thus, using eQTLs from all tissues, not just “disease-relevant” tissues, can identify a greater number of candidate gene modifiers” - can you elaborate more on how genes in non-relevant tissues may be contributing to CF?

Response: As discussed above, the concept is not that genes showing signal in non-relevant tissues are only active in those non-relevant tissues, but that the signals derived from these tissues add to the power to detect signals. As discussed in the survey of early GTEx data sets (1, 2) (references at the end of this document), most genetic regulations of gene expression are local through cis-SNPs, and many of them are shared across multiple tissues. So by looking at patterns of gene expression regulation by cis-SNPs from all tissues, one can recover more trait modifier genes, presumably by overcoming non-optimal training data, e.g., small sample sizes, and potential biological limitations, e.g., temporal regulation during development and potential effects of disease progression on gene expression in affected tissues.

We hope that we have clarified this in the revised text (lines 68-72): “In other words, genetic regulation of gene expression, or eQTL, can be informative regardless of tissue origin of the training data set (2), and can help overcome technical deficiencies, such as small sample sizes of certain tissue data, and potential biological limitations such as unsampled developmental stage and environmental and pathogenic masking of gene expression through reverse causality.”

Reviewer 1, Comment 4. 104-105: “The cohort study design, and demographic and clinical characteristics of the CF patients used in this study have been previously described” - a quick summary of the cohort (sample size, number of SNPs, sex, age, race/ethnicity composition), etc. would be welcome as an addition to the paragraph.

Response: We have added summary description of the CF GWAS cohorts as follows: “Briefly, 5 cohorts (total 6,365 CF patients) with >90% European ancestry from US, Canada, and France were recruited by the International Cystic Fibrosis Gene Modifier Consortium, and their genome-wide genetic variance were assayed using different genotyping platforms over several years. GWAS was performed as a meta-analysis of cohort/platform combinations, using the standardized lung function score, KNoRMA, as phenotype trait (3, 4).”

Reviewer 1, Comment 5. 118: “more recent release of 1000 genomes project data” - specify which phase

Response: The exact version is “1 KG phase3 v5a haplotype” data (5). Revised text (lines 132-134): “Compared to the imputation reported in the GWAS studies (3), the updated version here utilized a more recent release of 1000 genomes project Phase3 (v5a) haplotype data and 101 CF whole genome sequencing data as reference panels, which improved coverages at HLA and CFTR regions (5).”

Reviewer 1, Comment 6. 120-122: “summary GWAS statistics… derive imputed gene expression… using [FUSION]” - an issue with comparing PrediXcan and FUSION output is their inherent differences in design. For more comparable results if you continue using both methods, you should consider using S-PrediXcan (29739930) which is similar to PrediXcan but takes summary statistics rather than genotypes as input. In addition, description of your methods (ex. Models used) in FUSION to the same extent as PrediXcan would be appreciated.

Response: We have added more description of the FUSION procedure in the main text. In our original version, we made efforts to simplify the description of methods, provided more detail in the Supporting Methods section, but perhaps we went overboard with that approach and too many details were missing. We have sought to remedy this by trying to strike a better balance between what we present in the main text versus what we have detailed in the Supporting Methods. In response to this specific comment by the reviewer, we have added more details to the main text.

The new text added (lines 149-153): “Briefly, summary GWAS statistics for SNP associations to CF lung disease phenotype (n=6,365) and reference linkage-disequilibrium (LD) data from 1000 genome projects were used as input for FUSION, with TWAS predictive models from 48 GTEx v7 human tissues downloaded from FUSION website (http://gusevlab.org/projects/fusion/). The analysis was performed according to instructions on the FUSION website.”

As we hoped we have clarified, our goal is not to compare different approaches, but rather to mine the data, and by using a consensus of findings between the different approaches to generate a list of the consensus results, thus, maximizing discovery and hypothesis generation. That being said, we did explore S-PrediXcan (MetaXcan) as a potential analyses option. As shown in the following figure, the meta-analysis p-values by Harmonic Mean Pvalue (HMP) from 48 GTEx tissues are highly correlated pair-wise between PrediXcan, MetaXcan, and TWAS (the 99 consistent HMP p-value<0.01 significant genes among all 3 sets are highlighted).

As can be appreciated, the majority of the genes detected by S-PrediXcan (MetaXcan) were also detected by our original methods (TWAS and PrediXcan). Since there is no readily available meta-analysis for dependent multiple p-values other than HMP, and there are fewer significant genes (p<0.01) for MetaXcan compared to PrediXcan (346 vs 478) from the same tissues, we chose to stay with our original comparison of PrediXcan and TWAS.

Reviewer 1, Comment 7. 131: “the samples used in predictive model training were excluded from the association testing” - what’s the before and after n?

Response: We have added the relevant text to clarify this issue. The total unrelated sample size is 5,756 (selecting one out of siblings from the same family following rules set out by the GWAS analysis). There are 122 and 753 samples among the 5,756 in the nasal epithelial and LCL training data sets, respectively. Therefore, excluding them from lung disease association tests resulted in 5,634 and 5,003 final sample sizes for nasal epithelial and LCL tissues, respectively.

Revised text (lines 138-141, 143-146): “Association testing of imputed gene expression, using the PrediXcan platform (6), from the CF LCLs and nasal epithelium, 48 GTEx tissues, and DGN whole blood (a total of 51 human tissues), were performed using robust regression (7, 8) based on 5,756 unrelated patients .”… “For disease phenotype association testing using predictive models trained on CF nasal epithelial and LCL data sets, the samples used in predictive model training were excluded from the association testing, resulting in 5,634 and 5,003 final sample size for nasal epithelial and LCLs, respectively.”

Reviewer 1, Comment 8. 137-138: “In our most stringent analyses, we sought consensus between two meta-analysis approaches.” - The reason behind taking the intersection of these two methods, rather than just the results from one model (especially since only GTEx is used in FUSION while there are additional models for PrediXcan), isn’t articulated well through the paper. Either take the results from the more robust method, or fully justify the use of the intersection of the methods’ results.

Response: We hope that we have now fully justified the use of the intersection of the methods in our response. As discussed above, we think the exploratory nature of our analyses justifies our strategy to utilize two independently developed approaches – each with different strengths and weaknesses. Correlation of effect size estimates and consensus of significant genes between the 2 result sets help strengthen the main findings and identify robust candidates for follow-up as modifier genes. The union of results maximizes coverage of potential modifiers, which is important at current state of rapid development, where many things, such as training data sample sizes and quality of gene expression estimates are not optimal. In fact, we were comforted by the fact that some of the expanded list of candidate modifier genes, such as BPIFA1, which has been strongly implicated to modify CF lung disease in the literature, seemed to help justify our nuanced strategy.

Reviewer 1, Comment 9. 141: “performed for each gene among the tissues with imputed gene expression” - how does this account fairly between genes ex. if one of them is present in only two tissues and the other is in thirty related tissues?

Response: We have investigated the relationship between meta-analysis p-value and the number of tissues a gene was imputed by examining the combined meta-analysis p-values for the 52 consensus modifier genes. As shown by the intensity patterns in Figure 2 of our manuscript, significant phenotype association of imputed gene expression appears to be determined by at least 2 factors, close distance to GWAS loci, and number of tissues the gene is imputed. This makes sense since the imputed gene expression reflects the underlying genetic variances. On the other hand, genes can be significant due to strong association in a few imputable tissues, or consistent weak associations in many imputed tissues through meta-analysis. We believe, although significant genes imputed from multiple tissues should be given higher priority for follow-up studies, the genes imputed in only a few tissues may still be important since they can be tissue-specific, or insufficiently sampled due to non-optimal sample sizes in the reference tissue data sets.

Reviewer 1, Comment 10. 183-184: “our LCL microarray data set yielded fewer than expected number of imputable genes” - what is the expected # of imputable genes, based on PrediXcan models produced with microarray data? Additionally, in the following paragraph, how is the difference between microarray and RNA-Seq in CF LCL vs. GTEx LCL accounted for in this comparison?

Response: All PrediXcan models used in this report were based on RNA-seq gene expression assays, with the exception of our CF LCL data set. While one data set makes it difficult to estimate expected number of imputable genes from microarray assays, our LCL data point deviates significantly from the gene count vs sample size relationship established by many tissue training data sets based on RNA-seq (figure below, left panel). Similarly, Gusev et al also demonstrated the reduced number of imputable genes from the Netherland Twin Registry (NTR) microarray gene expression data (9), Figure 3 in their paper, and reproduced below. It is difficult to compare number of imputable genes from our LCL (microarray, 5,299 genes), and GTEx LCL (RNAseq, 3,023 genes), since the sample numbers are quite different: 753 CF LCL vs 117 GTEx LCL, and the correlation of the 1,623 common imputed genes between the 2 sets at imputed gene expression of ~5,000 patients did not take original assay platform into consideration. Our assumption is gene expression data quality affects predictability at model training stage, and the number of genes that can be predicted from cis-SNPs given the same sample size, so it was accounted for at model training.

Reviewer 1, Comment 11. 200-202: “Using a threshold of p-value < 0.01… disease modifier genes were identified.” - Does this method account for the range of tissues used? Does this account for the CF and DGN models not existing in TWAS? Can you delve into how a result in a seemingly irrelevant tissue can still be useful in elucidating CF pathways? These questions especially arise in lines 211-214.

Response: As stated above in response to Comment 3, there are both technical and biological rational to leverage predictive models of genetic regulation of gene expression from all tissues where reference data are available. We also intended to use PrediXcan and TWAS as complementary approaches to mine the CF GWAS data, not to compare the merits between them, therefore, we feel the fact that DGN and CF models are not available for TWAS analysis does not adversely affect the intended use.

Reviewer 1, Comment 12. 206-208: “average effect sizes… (R2 = 0.36, Fig 3B)” - these calculations should be covered in Methods

Response: We have now added a section to describe the comparison of results by linear regression in Supporting Methods. The revised text (lines 146-148): “To assess correlation between different test results among multiple genes, simple linear regression was performed between 2 sets of test statistics, such as mean effect-sizes, or -log10 p-values from PrediXcan and TWAS meta-analyses, or GWAS.”

Reviewer 1, Comment 13. 240-243: “MET ~700 kb upstream… in predictive models.” - How would this compare to a traditional GWAS-eQTL analysis?

Response: We have not specifically compared the predictive models derived from machine learning (i.e. panelized regressions) with traditional eQTL, but the original PrediXcan and TWAS papers did address the issue generally, and they found that the machine learning approach is more accurate than single variant – gene eQTL analysis (6, 9) for predicting gene expression. Intuitively, predictive models with multiple SNPs as independent variables would include single SNP eQTL as a special case, if the eQTL is strong enough to explain the observed gene expression variance. As shown in Supplemental Figure S11, the MET signal came from multiple sub-threshold GWAS SNPs, which may also show up in traditional GWAS-eQTL overlap test, depending on the parameters.

Reviewer 1, Comment 14. 259-261: “The SNP p-values used in this analysis were either the minimal p-value selected per gene (Fig. 5) or represented the average of the unique set of SNPs from all predictive models per genes (S6 Fig).” - Choose a consistent metric across all genes.

Response: We are sorry for the confusion. The metrics are consistent across all genes – the different metrics refer to alternative comparisons as plotted in different figures, not among different genes. We hopefully have clarified this issue in the main text. We feel it is helpful to provide both minimal and mean p-values when comparing GWAS at SNP level to imputed expression at gene level, since typically many SNPs were used to predict gene expression. Therefore, genes can be ranked from GWAS by the most significant SNP (minimal p-value) in the model (Figure 4, formerly Figure 5), or mean p-value among all the SNPs in the model (updated Supp. Figure S6), and both showed significant correlation with imputed gene expression association to phenotype.

Reviewer 1, Comment 15. 282: “among SNPs with significant p-values of < 10e-7” - why did you use this threshold instead of the most traditional 5e-8?

Response: Our goal for this part of the manuscript is to illustrate the connection between GWAS SNP associations to gene expression regulation by showing the inclusion of strongly associated SNPs in gene expression predictive models. This is somewhat arbitrary, since we are looking for SNPs associated with CF lung function that may also regulate gene expression, and it is not necessarily constrained by genome-wide significant threshold. We chose a highly suggestive significance level (<10-07) with the assumption that such an association to lung function is more likely to be reflected in the imputed gene expression, since genome-wide threshold of 5e-8 would result in fewer SNPs to overlap with those chosen in the predictive models.

Reviewer 1, Comment 16. 304-305: “largely due to environmental influences and/or disease process, rather than genetic regulation.” - do you have an h2 measurement for this (GWAS vs. imputed gene expression)? This section would also flow better in the creation/analysis of the models near the beginning of results.

Response: After reconsidering the complex issue of genetic vs environmental influences on phenotype traits, and the potential confusion of “negative” results of no correlations as raised also by Reviewer 1, Comment 21, we have decided to drop this topic, and figure from our manuscript.

We did investigate the issue of heritability as suggested by the reviewer, and estimated h2 of CF lung function score (KNoRMA) to genome-wide SNPs, using the Genome-wide Complex Trait Analysis (GCTA) approach. The h2 from all imputed SNPs (~8.3 million) is 0.41, and that from cis-SNPs of all PrediXcan predictive models (~1.4 million SNPs) is 0.33. We interpret these preliminary findings as estimates of narrow-sense inheritance of common SNPs of CF lung function, and 0.41 represent the upper limits gene expression imputation can achieve. The h2 of 0.33 from all cis-SNPs used in current gene expression imputation suggests that there is still room for data mining of the GWAS signals.

We have added relevant text in discussion (lines 478-486): “To estimate proportion of genetic influences on CF lung disease phenotype from GWAS and gene expression imputation, we calculated heritability (h2) from the imputed GWAS data using the GREML-LDMS method (10) from the Genome-wide Complex Trait Analysis (GCTA) software (11). The h2 of KNoRMA from GWAS imputation of ~8.3 million SNPs among ~5,000+ unrelated CF patients, is 0.41 (SE = 0.072), while that from ~1.4 million cis-SNPs used in combined PrediXcan predictive models from 48 GTEx tissues, is 0.33 (SE = 0.061). The difference between the h2 could potentially reflect missing imputable genes due to small training sample sizes, trans-regulation of gene expression from distant genetic variants, and/or other ways of affecting gene function from genetic variants.”

Reviewer 1, Comment 17. 315: “maximal p-value between the 2 multi-tissue meta-analyses for each analysis platform” - Again, choose a consistent metric from one platform or else results are difficult to interpret.

Response: We acknowledge the difference in general strategy we adopted and the desire for uniform metric across all analyses. Our past experiences dealing with high-throughput -omics data sets of 20k+ genes and millions of SNPs suggested that there are often spurious results and edge cases in such analysis, and we therefore sought agreement between different approaches to ensure robustness of the findings. The results tend to be more conservative as evaluated by the distribution of final p-values shown in the QQ plots below, which demonstrate effective control of Type I error. We believe functional evaluations and literature review of the resultant candidate CF lung disease modifier genes supported such a strategy.

Reviewer 1, Comment 18. 360-361: “the direction of predicted expression changes in regard to lung function are opposite between different mapping strategies” - which method best agrees with the known direction of effect in observed CF gene expression data?

Response: This is an interesting question! In our opinion, genetically regulated gene expression need not have the same direction of association to the phenotype trait as the actual observed gene expression, since observed gene expression can also be highly influenced by environmental factors and/or disease processes. Examining the current results for the 2 genes, HLA-DRB1 and HLA-DQA1, it happened to be the case that the alternative mapping strategy of accounting for extra allele polymorphisms at the HLA loci resulted in the same direction of lung disease association between imputed and observed gene expression, while the commonly employed mapping protocol resulted in the opposite direction. However, this is a complex issue that needs more careful investigation, which we feel is beyond the scope of this paper. The ultimate proof is to knock-out a modifier gene, and then examine its effect on the phenotype trait as predicted by this approach.

Reviewer 1, Comment 19. 392-394: “Although EHF and APIP… current gene expression data.” - this issue of best predicted in a model vs. the genes with actual biological implications was described in Weinberg et al. (PMID: 30926968). How does this perspective, as well as the differences in the tissues, contribute to PDHX as a candidate in the CF pathway? How would a colocalization analysis change these results? Additionally, what is the gene currently implicated in the locus in that may be related to CF?

Response: This is a fascinating topic, and it touches on a newly developed method, which we have considered and followed both the Wainberg paper (12) and the responses from the PrediXcan authors online: http://hakyimlab.org/post/2017/vulnerabilities/, in writing our discussion. As users of this method, and evaluating the results more from known CF biology, we resonate with opinions expressed by both the Wainberg paper and PrediXcan authors, in that this is an incremental progress, that gets us closer to the causal variant(s) and mechanism, but does not get all the way there. These methods have some useful properties of framing the analysis at gene expression level, which improves power identifying modifier genes outside the genome-wide significant loci, narrows the candidate gene list around GWAS loci by ignoring genes without eQTL support. Apart from algorithm improvement and best practices, an important issue is that our reference eQTL training data are far from perfect – small sample sizes (most <300), mixture of cell types from whole tissues, limited time representations lacking many developmental stages of tissues, all of which impede mechanistic discovery. Improvement in sample size and quality is probably more impactful, at this time, on the overall performance of these approaches. As a concrete example, we have been studying EHF and APIP at the chr11 GWAS locus for several years without major breakthrough, and this analysis points to genes further away from the peak SNPs by predictive models, which if validated by further investigation and prove to be true, will certainly advance our understanding of genetic modification through gene expression regulation. Meanwhile, EHF and APIP cannot be excluded due to lack of or little genetic regulation or eQTL signals, since it is possible actions of these genes in specific cells or developmental stages relevant to CF lung disease are not captured by current eQTL data sets.

Reviewer 1, Comment 20. 399-410: Good review of how the most sig. predicted genes could work in the context of CF!

Response: Thank you.

Reviewer 1, Comment 21. 433-434: “The overall gene expression association to the CF lung disease severity from our own CF nasal epithelial and LCL data sets is not correlated with imputed gene expression.” - this seems contradictory to lines 175-178, can you clarify this?

Response: Although we have dropped the content relating comparison of observed vs imputed gene expression associations to disease phenotype (Reviewer 1, comment 16), it is a good question we would like to respond. Among the genes whose expression can be imputed from cis-SNPs in nasal epithelial (2,881) and LCLs (5,299), most (2,309 and 4,633) had also been tested for disease phenotype association of the observed gene expressions, which overall are not correlated with those of imputed gene expression, as shown in the figure shown below, which we have now deleted from the manuscript. Lines 175-178 in original main text, described genetic regulation of gene expression, i.e. correlations between observed and predicted gene expression. Although correlation between imputed and observed gene expression may translate into correlation of phenotype trait associations between the 2, it is generally not since genetic regulation of gene expression only explained, on average, 12% and 7% of the observed gene expression variance as judged by the R2 values of the predictive models. We will make all the data used in figures available as supplements.

Reviewer 1, Comment 22. 444: “The non-coding CF modifier genes are likely underestimated” - in addition, they seem understudied from the fewer references in table 1 compared to the protein coding genes. Can you elaborate on the reasons why this may be?

Response: According to one review (PMID:28815535) (13), functional relevance of non-coding RNAs in general received little attention by the scientific community in the pre-genome sequence era before 2000s. Publication of the reference human genome sequence in 2001 highlighted the small portion of the genome sequences that code for proteins, and the advancement of deep sequencing of RNA in recent years allowed the discovery and cataloging of many transcribed genome sequences into non-coding RNAs. These RNAs were only identified in recent years, and there are relatively few reagents, such as enzyme assay and antibodies used to study proteins, to study non-coding RNAs in normal biology and disease associations.

The relevant text reads (lines 455-458): “The non-coding CF modifier genes reported here are likely under-estimated compared to protein-coding genes, due to reference genome and gene annotations associated with some of the gene expression data sets used in predictive model training, and general lag of functional knowledge of non-coding transcripts (13).”

Reviewer 1, Comment 23. 460-461: “Furthermore, the number of genes whose expression can be reliably predicted from genetic variants varied among tissues” - a noticeable absence throughout most of this paper is the lack of individual tissues being scrutinized. Which tissue are these significant genes in, and how may that also contribute to CF? And if the tissue seems irrelevant, why are the findings still important?

Response: We will provide association testing results from each tissue, and the 2 meta-analyses results combining all tissues by p-values. As discussed above, all the tissues may contribute to generate predictive models of gene expression by cis-SNPs since majority of such regulation are preserved across tissues. The importance of genetic regulation of gene expression to CF lung disease is linked to the fact that SNPs associated with the disease phenotype by GWAS are part of the predictive models of some genes. While some genetic diseases have direct tissue origins, e.g., sickle cell anemia in red-blood cells, CF has a complex disease pathogenesis involving a chain of events in multiple tissues, interactions with the environment (reduced clearance of pathogens), and great variations in disease manifestations even among patients with the same genetic defects in CFTR. The time and place genetic regulation of modifier genes of CF diseases are active and meaningful, in terms of affecting disease outcome, are not clear. We therefore are trying to use all the available data at present to generate leads and hypothesis.

Reviewer 1, General Comment: “Next, review figures/tables and supplementals”

Reviewer 1, Comment 24. Table 1: add a column of which tissue the P-value was max. Where did the keywords originate from? Can you use coefficients instead of protective/harmful?

Response: The maximal p-value was retrieved among 4 meta-analysis p-values, each from various number of tissues depending on specific genes. With the limited space in Table 1 format, it is difficult to provide which tissues contributed to the p-value displayed, but the information will be provided in supporting files. The functional keywords were added by us, through review of the literature and our current expert knowledge of CF disease pathogenesis. It is subjective, and not from any public annotation database. We have added beta coefficient from linear regression of PrediXcan imputed gene expression to CF lung disease phenotype, as well as TWAS signed zscore, both represent mean estimated effect sizes among the tissue models tested.

Reviewer 1, Comment 25. Fig. 1: Since the PM, imGE, and pvals for GTEx, DGN, nasal epi, and LCL are similar except for sample sizes, can they be consolidated to look less cluttered? Also, since genotypes were used to make the gene expression models for nasal epi and LCL and not summary statistics, “GWAS imputation” may be misleading.

Response: We have simplified the workflow into 2 arms – PrediXcan imputed gene expression, and TWAS off summary statistics. To clarify, the starting data of GWAS imputation represent common SNP dosages imputed for each individual patient, which are required for predictive model training. The GWAS (SNP) imputation was analyzed as traditional GWAS according to the same protocol outlined in the Corvol, et al, paper (3), and the summary results were used as input for TWAS/FUSION analysis.

Reviewer 1, Comment 26. Fig. 2: There’s a lot going on in this plot that makes it difficult to follow. Is it more legible if you subset to only the most important tissues?

Response: As expanded upon elsewhere in this document, we feel it is relevant to show results from all the tissues, since the different tissues are means to deduce expression regulation by genetic variance, and many signals were derived from “non-CF” tissues. However, we understand that there is a lot going on in the plot, and we have made an attempt within the text to further highlight the major patterns in the figure to help the reader navigate the figure (lines: 225-232). These major features include a general agreement between PrediXcan and TWAS, and the direction of imputed expression change to CF lung disease phenotype are consistent among different tissues in general. Additionally, the strongest signals are near GWAS loci on chr5 and chr6, which are imputed in most tissues.

Reviewer 1, Comment 27. Fig. 3: Use ggrepel with higher force for easier readability. These plots would be more appropriate in the supplemental figures with the r2 of both of them described in the text.

Response: We appreciate the pointer, have tried to improve the readability and have moved the figure to the supplements as the reviewer suggested.

Reviewer 1, Comment 28. Fig. 4: At a glance, the PrediXcan and TWAS results look similar in architecture. Can these results be consolidated with colors identifying differences between the methods instead, or colors indicating known vs novel CF genes?

Response: Thank you for the suggestion! We have now consolidated the Manhattan plots to show PrediXcan, TWAS, and GWAS results in the same figure (which is now Figure 3 in the main text). For the PrediXcan (A) and TWAS (B) results, the red squares represent genes near GWAS loci, and the blue triangles represent novel genes outside of 1 MB from the GWAS loci.

Reviewer 1, Comment 29. Fig. 5: This figure would also benefit from ggrepel. It would be easier to follow as a 2x2 rather than branching off the original scatterplot. Are all 3 of these offshoots necessary? What do each of them represent that the others don’t?

Response: Based on this comment, we have simplified the figure (now Figure 4 in the main text) to represent just 1 scatterplot without gene name labels. The gene level information will be available in table format. The colored markers represent the following: red squared = consensus modifier genes near GWAS loci; blue triangles = consensus genes outside GWAS loci; and the black diamonds = genes near GWAS loci, that are not supported by gene expression imputation to be associated to CF lung disease. The last category represents the usefulness of data mining in eliminating certain genes near GWAS loci due to lack of eQTL support from currently available data sets.

Reviewer 1, Comment 30. Fig. 6: Very unique and insightful visual. Can you discuss more about how these genes are co-regulated from their intertwined, linked eQTLs?

Response: Yes, to paraphrase the relationship between GWAS, eQTL, and gene expression imputation: eQTLs from independent training data sets help select SNPs that are correlated with gene expression changes, and if the selected SNPs are associated with the phenotype of interest from GWAS collectively, then the imputed gene expression from these SNPs will be associated with the phenotype. Since genes near a particular GWAS loci may be co-regulated (or correlated with expression, eQTL) by the same variant(s), they would also be associated with the phenotype through gene expression imputation, as shown for chr3, chr5, and chr6 (supporting figures S8-S10).

Reviewer 1, Comment 31. Fig. 7: Put the r2 in captions. This figure would also be better suited for a supplemental.

Response: Based on the previous comment (Reviewer 1, Comment 16), we have decided to remove this figure from the manuscript.

Reviewer 1, Comment 32. Fig. 8: Can you differentiate the known genes from the novel ones with color, bolding, or other formatting? Do you have similar figures for “consensus” genes available?

Response: We have had to update the results of gene set and pathway enrichment analyses since we have updated the PrediXcan association analysis to utilize 4 genotype PCs (instead of 1 in the original report) to control for population stratification (as requested by Reviewer 2), but the original feedback regarding the figure is still relevant. To clarify, the Gene Set Enrichment Analysis (GSEA) interrogates the rankings of genes of particular set or pathway among all imputed genes, we are leveraging concerted changes in ranks from both strong and weak signals. As a result, an enriched pathway may not contain top consensus modifier genes (although many of them do, e.g. HLA containing gene sets). We have added color highlights to the gene hash marks as follows: red = candidate modifiers (379) near GWAS loci; blue = candidate modifiers outside GWAS loci.

Reviewer 1, Comment 33. Fig. 9: Color in the HLA gene dots, they get lost amongst all the grey.

Response: We have marked the HLA genes with red triangles to distinguish them from other genes. Note: this figure is now Figure 7A.

Reviewer 1, Comment 34. Supporting methods: I enjoy your well-described and well-cited (with links!) descriptions. There are a few very minor typos - “the imputed gene expression data sets had sample size” should be “the imputed gene expression data sets had a sample size”.

Response. Thank you. We have significantly revised the Supporting methods document, the exact phrase is no longer there. .

Reviewer 1, Comment 35. I have a small concern with the hierarchical clustering analysis. Why did you have “the missing values in the resultant distance matrix were replaced with the largest distance values between any pairs” rather than just leaving the data N/A? It seems misleading. Also, again, for a PrediXcan-like method that also uses summary statistics, S-PrediXcan is up your alley.

Response: Unfortunately, the current clustering function hclust from R does not tolerate missing values (although distance calculations works fine with NA for missing values), we therefore chose the largest distance, analogous to setting a threshold to treat the missing values as the most dissimilar pairs for clustering purposes. Note: this only affects the clustering tree (which seems within reasonable shape in Figure 2), and the heatmap still uses the values and color NA as white. As described above, we did explore S-PrediXcan and obtained similar, but not identical, results from PrediXcan, and decided to stay with the current format, since overall statistical power is better with PrediXcan.

Reviewer 1, Comment 36. S1 Fig.: Do you think these dramatically different slopes are due to RNA quantification collection methods, sample sizes, or other outside factors?

Response: Mainly sample size, since the x-axis is simulated null distribution of r2, which depends on sample size, and the sample sizes for nasal epithelial biopsies and LCLs are 132 and 753.

Reviewer 1, Comment 37. S2 Fig.: I quite enjoy this figure. Can you compare the LCL microarray observed/expected gene count to those found in PrediXcan microarray-based models?

Response: We did not find microarray based models from PrediXcan, but a similar observation was reported by TWAS (9) (Figure 3 from their paper – see below), where NTR (the Netherlands Twins Registry) data set from microarray had lower slope between gene count versus sample size (green in the reproduced figure).

Reviewer 1, Comment 38. S6 Fig.: This is illegible. Would you be able to give each point a number and then have a side table with both the numbers and gene names?

Response: It is not feasible to label all the 52 consensus genes, so we chose to label just the top few with p-value<10-05. The full data (non-graphed) will be provided in supporting tables.

Reviewer 1, Comment 39. S11 Fig.: Can you include within the main text as part of the discussion why CFTR isn’t as highly significant as one would initially think in an analysis of cystic fibrosis?

Response: Yes, we are testing severity of lung disease among CF patients, many of them have the same CFTR mutations, F508del homozygotes being the most common. Because a CFTR deleterious mutation is a criterion for inclusion, we are unlikely to see further association at the locus.

Reviewer 1, Comment 40. S Tables: Can you bold or italicize the known or novel gene findings to make them easier to differentiate?

Response: We have now highlighted the consensus genes in the supporting tables as suggested by the reviewer.

Reviewer 1, Final Comment. Overall, my concerns lie mainly with the comparison between PrediXcan and TWAS, the interpretability of the figures, and the lack of connecting the actual tissue genes were determined significant to the CF phenotype, but I enjoy your analysis and your contribution to determining the genetic architecture of CF. Addressing these concerns as well as clarifying the points I made above and adding an additional colocalization analysis would strengthen this paper.

Response: We thank the reviewer for the very detailed reading of the manuscript and appreciate the time and thoughtful feedback! Hopefully, we have addressed the reviewer’s concerns.

Reviewer 2, General comment: I have two questions:

Reviewer 2, Comment 1. 1. The harmonic mean P-value method was recently corrected ( please see: http://blog.danielwilson.me.uk/2019/08/updated-correction-harmonic-mean-p.html ). Do the authors' calculations incorporate the updated (and corrected) harmonic mean P-value method?

Response: We thank the reviewer for alerting us to the method update, and as a result, we have re-run the relevant analyses. Apparently, we were already using version 3 as the re-run results were identical to before.

Reviewer 2, Comment 2. 2. There is no discussion of LD-contamination ( see for example: https://www.nature.com/articles/s41467-018-03621-1 ). To what extent does this affect the results?

Response: LD-contamination refers to SNPs in strong LD, among which one or more are projected to be causal. Although LD-contamination is a major challenge in determining causal variant in an explanatory statistical model due to collinearity of SNPs in LD, it does not affect the accuracy of predictive models, which was used in the context of predicting gene expression from SNPs, refer to https://www.theanalysisfactor.com/differences-in-model-building-explanatory-and-predictive-models/, and https://statisticalhorizons.com/prediction-vs-causation-in-regression-analysis. More in depth discussion on the topic, see https://www.stat.berkeley.edu/~aldous/157/Papers/shmueli.pdf. We believe LD-contamination does not affect the accuracy of gene expression prediction from SNPs in LD, therefore the results of our data mining at gene level was not affected. It does present great challenge to identify causality at both SNP and gene levels, which require further experimental studies.

Reviewer 3.

Reviewer 3, Overall Comments: On the whole, the writing of the manuscript can and should be improved. As it stands, it is hard to follow. The authors present no motivation for the selected approaches to analyze the data (and why they chose to use two). The methods are not described well. It takes a great effort to match the description of the analyses with Fig. 1 illustrating them.

Response: Hopefully, we have improved our manuscript through the revision and resolved the issues raised by the reviewers.

Reviewer 3, Comment 1. Description of one type of analysis is interweaved with sentences starting with ‘Alternatively’ (lines 120, 132) describing the other type of analysis. This makes it harder to follow either of them.

Response: We understand this to be a general criticism that perhaps we had moved too many details of our analyses to the Supporting Methods.We have tried to strike a new balance, and re-organized the section to describe the 2 different approaches separately.

Reviewer 3, Comment 2. The authors pay attention to unimportant details (such as talk about GTEx pilot data) but not discuss important ones (e.g. choice of just 1 principal component to correct for ancestry).

Response: We thank the reviewer for the constructive comment. We will explain our reason to discuss GTEx pilot data in response to Comment 13.

Regarding the use of 1 PC, since our patent demographics are >90% Caucasian, we originally chose just 1 PC as a covariate to control for population stratification. However, after careful consideration, we agree with the reviewer that this was inadequate, and we now utilize 4 PCs as covariates in association testing. This did reduce the number of consensus and potential candidate modifier genes to 52 and 379 (from 54 and 531 when we were using 1 PC). This approach is more conservative and is based upon variance explained (Supporting Figure S4). We believe it to be the best decision. The results were updated in the current revision.

Reviewer 3, Comment 3. It causes great concern to see that the authors did not correctly use scientific E-notation for small numbers.

For instance, instead of ‘1e-6’ or ‘10^{-6}’ the authors have 1x10e-06, which would actually be equal to 1e-5 if read correctly (10x1e-6 = 1e-5).

There are a total of seven instances of incorrect use of scientific E-notation.

Response: We are very sorry for our carelessness, and we thank the reviewer for pointing out our mix-up. We have corrected all instances.

Reviewer 3, Comment 4. Which samples were whole genome sequenced (WGS)?

Response: None used in this analysis. Gene expression from nasal epithelial samples were obtained from RNA-seq and LCLs from microarray (GSE60690). The 101 Canadian WGS samples mentioned were used to form a reference population panel for GWAS imputation of the array genotyped samples (5), but it was not directly used in this analysis. We have reviewed the method sections and added clarifying language where we thought it would help.

Reviewer 3, Comment 5. Is the sequencing data publicly available?

Response: The RNA-seq from nasal epithelial samples are not currently available. We have initiated the process to deposit the raw RNA-seq data into controlled access database, dbGaP in compliance with patient consent and IRB.

Reviewer 3, Comment 6. Were these samples among those being imputed?

Response: Yes, all samples were used for imputation, but they were excluded in association testing if they contributed to model building.

Reviewer 3, Comment 7. Were WGS genotypes mixed with imputed genotypes in the analyses?

Response: No. The 101 Canadian WGS samples were used to combine with 1000 genome projects phase3 (v5a) haplotype data to form a hybrid reference for GWAS imputation (5), and they were not used in this analysis due to data provenance, since the WGS sample are not part of GWAS cohorts.

Reviewer 3, Comment 8. Line 127. Why only one PC was used for correction of ancestry? This appears to be grossly insufficient given current knowledge.

Response: After more careful consideration with the reviewer feedback, we updated the analysis by including 4 PCs (Supporting Figure S4).

Reviewer 3, Comment 9. Line 129. There are numerous methods for robust regression analysis. Not providing a name for the chosen method, only a citation is a great inconvenience for the reader.

Response: We had provided more details in the Supporting methods in our original manuscript, but we have now added additional information in the main text in the revision as suggested by the reviewer. Revised text (142-143): “…the robust regression utilized iterated re-weighted least squares by the rlm function from the R package, MASS.”

Reviewer 3, Comment 10. Lines 137-138. What does “in our most stringent analyses” mean?

Response: We meant consensus, which is hopefully explained above in the overview and response to Reviewer 1. We have revised the text (lines 160-164): “For significant modifier genes from each analysis platform, a p-value < 0.01 from both the HMP, and correlation adjusted method (EBM for PrediXcan, or omnibus for TWAS) was chosen. Consensus between the 2 result sets (with 4 p-value < 0.01 thresholds) yielded the most robust findings, while the union of significant genes from 2 result sets maximized sensitivity of discovery.”

Reviewer 3, Comment 11. Line 138. The phrase “we sought consensus” does not exactly read as “we selected genes significant in both analyses”.

Response: Noted. We have tried to improve the precision of our descriptions. The revised text (lines 157-164): “Multi-tissue tests from each result set were combined by two separate meta-analysis methods, a simple harmonic mean p-value (HMP) (19), and a correlation adjusted method, specifically, empirical adaptation of Brown’s method (EBM) (20) for PrediXcan, or omnibus test (10) for TWAS. For significant modifier genes from each analysis platform, a p-value < 0.01 from both the HMP, and correlation adjusted method (EBM for PrediXcan, or omnibus for TWAS) was chosen. Consensus between the 2 result sets (with 4 p-value < 0.01 thresholds) yielded the most robust findings, while the union of significant genes from 2 result sets maximized sensitivity of discovery.”

Reviewer 3, Comment 12. Line 173. What is the definition of 'imputable gene'?

Response: It is defined as genes that showed significant component of genetic regulation defined by PrediXcan and TWAS predictive model builders – for PredictDB models based on GTEx v7 data, the cross-validated r>0.1 and p-value<0.05 were used, which are the same for our CF derived models as well; and for TWAS, significant h2 estimates was used (http://gusevlab.org/projects/fusion/).

Reviewer 3, Minor Comments

Reviewer 3, Comment 13. Line 72. Why even mention GTEx pilot data?

Response: Our strategy was designed to consider the general characteristics of genetic regulation of gene expression, or eQTL. These were initially described in publications by GTEx using the pilot data and were not fully reiterated upon later data release.

Reviewer 3, Comment 14. Why every plot is black and white?

Response: Thank you. We have altered the figures to incorporate color where helpful.

References

1. Consortium GT, Laboratory DA, Coordinating Center -Analysis Working G, Statistical Methods groups-Analysis Working G, Enhancing Gg, Fund NIHC, et al. Genetic effects on gene expression across human tissues. Nature. 2017;550(7675):204-13.

2. Gamazon ER, Segre AV, van de Bunt M, Wen X, Xi HS, Hormozdiari F, et al. Using an atlas of gene regulation across 44 human tissues to inform complex disease- and trait-associated variation. Nat Genet. 2018;50(7):956-67.

3. Corvol H, Blackman SM, Boelle PY, Gallins PJ, Pace RG, Stonebraker JR, et al. Genome-wide association meta-analysis identifies five modifier loci of lung disease severity in cystic fibrosis. Nat Commun. 2015;6:8382.

4. Taylor C, Commander CW, Collaco JM, Strug LJ, Li W, Wright FA, et al. A novel lung disease phenotype adjusted for mortality attrition for cystic fibrosis genetic modifier studies. Pediatr Pulmonol. 2011;46(9):857-69.

5. Panjwani N, Xiao B, Xu L, Gong J, Keenan K, Lin F, et al. Improving imputation in disease-relevant regions: lessons from cystic fibrosis. NPJ Genom Med. 2018;3:8.

6. Gamazon ER, Wheeler HE, Shah KP, Mozaffari SV, Aquino-Michaels K, Carroll RJ, et al. A gene-based association method for mapping traits using reference transcriptome data. Nat Genet. 2015;47(9):1091-8.

7. Marazzi A, Joss J, Randriamiharisoa A. Algorithms, routines, and S functions for robust statistics : the FORTRAN library ROBETH with an interface to S-PLUS. Pacific Grove, Calif.: Wadsworth & Brooks/Cole Advanced Books & Software; 1993. xii, 436 p. p.

8. Venables WN, Ripley BD, Venables WN. Modern applied statistics with S. 4th ed. New York: Springer; 2002. xi, 495 p. p.

9. Gusev A, Ko A, Shi H, Bhatia G, Chung W, Penninx BW, et al. Integrative approaches for large-scale transcriptome-wide association studies. Nat Genet. 2016;48(3):245-52.

10. Yang J, Bakshi A, Zhu Z, Hemani G, Vinkhuyzen AA, Lee SH, et al. Genetic variance estimation with imputed variants finds negligible missing heritability for human height and body mass index. Nat Genet. 2015;47(10):1114-20.

11. Yang J, Benyamin B, McEvoy BP, Gordon S, Henders AK, Nyholt DR, et al. Common SNPs explain a large proportion of the heritability for human height. Nat Genet. 2010;42(7):565-9.

12. Wainberg M, Sinnott-Armstrong N, Mancuso N, Barbeira AN, Knowles DA, Golan D, et al. Opportunities and challenges for transcriptome-wide association studies. Nat Genet. 2019;51(4):592-9.

13. Jarroux J, Morillon A, Pinskaya M. History, Discovery, and Classification of lncRNAs. Adv Exp Med Biol. 2017;1008:1-46.

Attachment

Submitted filename: response_to_reviewer_comments_clean.docx

Decision Letter 1

Dylan Glubb

2 Sep 2020

Mining GWAS and eQTL data for CF lung disease modifiers by gene expression imputation

PONE-D-19-23859R1

Dear Dr. Dang,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Dylan Glubb

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

I congratulate the authors for their considered and well explained responses to the reviewers which were a pleasure to read. However, I would suggest that it is unlikely that transcription factors would provide targets for intervention (see lines 438-9). If the authors are interested in finding druggable targets amongst their candidate genes, I would recommend using the Open Targets database (https://www.targetvalidation.org/) to assess this.

Reviewers' comments:

Acceptance letter

Dylan Glubb

9 Nov 2020

PONE-D-19-23859R1

Mining GWAS and eQTL data for CF lung disease modifiers by gene expression imputation

Dear Dr. Dang:

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

If we can help with anything else, please email us at plosone@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Dylan Glubb

Academic Editor

PLOS ONE

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 File

    (XLSX)

    S2 File

    (XLSX)

    S3 File

    (XLSX)

    S4 File

    (DOCX)

    Attachment

    Submitted filename: response_to_reviewer_comments_clean.docx

    Data Availability Statement

    All predictive models derived from GTEx human reference data set are publicly available. Gene expression data from CF LCL samples are available from GEO (accession code GSE60690). Gene expression data from CF nasal mucosal epithelial RNAseq samples are uploaded to dbGaP for controlled access for researchers who meet the criteria for access to confidential data (https://view.ncbi.nlm.nih.gov/dbgap-controlled). Data dictionaries and variable summaries are available on the dbGaP FTP site (https://ftp.ncbi.nlm.nih.gov/dbgap/studies/phs002254/phs002254.v1.p1/). The public summary-level phenotype data may be browsed at the dbGaP study report page (http://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs002254.v1.p1). The summary GWAS data from CF Gene Modifier Consortium studies and summary results of phenotype trait association testing are publicly available at GitHub (https://github.com/danghunccf/CF-GWAS-dataMiningPaper).


    Articles from PLoS ONE are provided here courtesy of PLOS

    RESOURCES