Polygenic risk scores for cardiovascular diseases and type 2 diabetes

Chi Kuen Wong; Enes Makalic; Gillian S Dite; Lawrence Whiting; Nicholas M Murphy; John L Hopper; Richard Allman

doi:10.1371/journal.pone.0278764

. 2022 Dec 2;17(12):e0278764. doi: 10.1371/journal.pone.0278764

Polygenic risk scores for cardiovascular diseases and type 2 diabetes

Chi Kuen Wong ^1,^*, Enes Makalic ², Gillian S Dite ^1,², Lawrence Whiting ¹, Nicholas M Murphy ¹, John L Hopper ², Richard Allman ^1,²

Editor: Gualtiero I Colombo³

PMCID: PMC9718402 PMID: 36459520

Abstract

Polygenic risk scores (PRSs) are a promising approach to accurately predict an individual’s risk of developing disease. The area under the receiver operating characteristic curve (AUC) of PRSs in their population are often only reported for models that are adjusted for age and sex, which are known risk factors for the disease of interest and confound the association between the PRS and the disease. This makes comparison of PRS between studies difficult because the genetic effects cannot be disentangled from effects of age and sex (which have a high AUC without the PRS). In this study, we used data from the UK Biobank and applied the stacked clumping and thresholding method and a variation called maximum clumping and thresholding method to develop PRSs to predict coronary artery disease, hypertension, atrial fibrillation, stroke and type 2 diabetes. We created case-control training datasets in which age and sex were controlled by design. We also excluded prevalent cases to prevent biased estimation of disease risks. The maximum clumping and thresholding PRSs required many fewer single-nucleotide polymorphisms to achieve almost the same discriminatory ability as the stacked clumping and thresholding PRSs. Using the testing datasets, the AUCs for the maximum clumping and thresholding PRSs were 0.599 (95% confidence interval [CI]: 0.585, 0.613) for atrial fibrillation, 0.572 (95% CI: 0.560, 0.584) for coronary artery disease, 0.585 (95% CI: 0.564, 0.605) for type 2 diabetes, 0.559 (95% CI: 0.550, 0.569) for hypertension and 0.514 (95% CI: 0.494, 0.535) for stroke. By developing a PRS using a dataset in which age and sex are controlled by design, we have obtained true estimates of the discriminatory ability of the PRSs alone rather than estimates that include the effects of age and sex.

Introduction

A polygenic risk score (PRS) is a single quantitative measure to capture the relationship between multiple genetic variants and a phenotype. In practice, it is usually calculated by the sum of risk allele counts of the single-nucleotide polymorphisms (SNPs) weighted by their effect sizes. A PRS can explain the relative risk of getting a particular disease compared to others with a different genotype.

As the power of polygenic risk scores (PRSs) has substantially increased over the last few years due to more advanced computing technology and better computational algorithms, more studies have suggested that PRSs are capable of identifying clinically meaningful increases in risk prediction [1–4]. For example, Khera et al. [1] developed a PRS for coronary artery disease that identified 8% of individuals with greater than 3-fold increased risk, which is comparable to the increase in risk from monogenic mutations. The discriminatory power of these PRSs, as reported by the area under the receiver operating characteristic curve (AUC), have usually been quite high. For instance, the PRSs developed by Khera et al. [1] has an AUC of 0.81 for coronary artery disease and 0.77 for atrial fibrillation; the metaGRS developed by Inouye et al. [2] has an AUC of 0.79 for coronary artery disease, and Bolli et al. [5] developed a PRS that has an AUC of 0.81 for coronary artery disease.

However, the prediction models used in these studies adjust for age and sex, which are known risk factors for the disease of interest and confound the association between the PRS and the disease. The reference models in these studies, which often include age, sex and a few principal components, already have high AUCs. If these studies do not report AUCs separately for the reference model and the PRS, recognizing how much a PRS actually contributes to disease prediction is impossible.

In addition, comparison of AUCs obtained by including additional covariates between studies can be difficult because of differences in the age and sex distributions of the study sample. A disease with a non-linear association with age will have a different AUC in a study of younger people versus a study of older people. This is because the reference models (i.e., the age and sex models) have different AUCs. Not knowing the separate AUCs for the PRS and the reference model makes comparison difficult.

Another problem with some studies [1] that seek to develop PRS is the use of prevalent cases. This can lead to biased estimates of disease risks, known as the prevalence–incidence bias [6], because severe cases die before, or are too unwell for, study enrolment, leaving only the mild cases included in the analysis.

In this paper, we aim to develop PRSs for coronary artery disease, hypertension, atrial fibrillation, stroke and type 2 diabetes when the effects of age and sex are controlled by design. We deliberately created a matched case-control study in which we controlled for age and sex by sampling controls from the available data and we excluded prevalent cases to prevent potential mis-estimation of disease risks.

Predicting an individual’s risk of developing disease can provide tremendous value in public health. It allows early intervention and less costly treatment by directing screening or other health resources to the patients who are at high risk.

Materials and methods

Ethics approval

The UK Biobank has Research Tissue Bank approval (REC #11/NW/0382) that covers analysis of data by approved researchers. All participants provided written informed consent to the UK Biobank before data collection began. This research has been conducted using the UK Biobank resource under Application Number 47401.

Participants

We used genotyped data from the UK Biobank Axiom Array [7] to develop PRSs for five common diseases [8, 9]: coronary artery disease, hypertension, atrial fibrillation, stroke and type 2 diabetes. The UK Biobank conducted baseline assessment of over 500,000 participants aged 40–69 years from 2006 to 2010. We used the disease definitions described in the supplements of Said et al. [8, 9]. Prevalent cases were excluded to prevent biased estimation of disease risks. For quality control, we removed variants with minor allele frequency less than 0.001, Hardy–Weinberg equilibrium p-value less than 10⁻⁵, and genotyping rate of at least 95%. For each disease, the cases were split into training (70%) and testing (30%) datasets. The training datasets were used to build a PRS for each disease and the predictive performances were evaluated on the testing datasets.

Training dataset

To control for the effects of age and sex, we applied the following sampling strategy to create the training datasets. For each of the five diseases, we computed the quintiles of age using the cases and divided individuals into five age groups. For each of the five age groups, and for each gender separately (i.e., total number of groups is 10 for each disease), we sampled 5 controls for each case. If the number of controls was not enough to draw 5 controls per case for all groups, we drew 4 per case, and so on. We then randomly selected 70% of the case and controls sets to form the training dataset. By this sampling strategy, the case-control ratios were approximately the same across all groups, and therefore the individuals were age and sex matched.

Testing dataset

For the testing dataset for each of the diseases, we used the remaining 30% of cases and randomly sampled 5 controls per case without matching for age and sex. Controls were drawn from the 30% of unaffected participants identified or the testing dataset. The sizes of the training and testing datasets for each disease are summarized in Table 1.

Table 1. Sizes of training and testing datasets used in our study.

Disease	Training size (controls/cases)	Testing size (controls/cases)
Coronary artery disease	38,217 (31,847 / 6,370)	16,374 (13,645 / 2,729)
Hypertension	51,273 (41,018 / 10,255)	21,970 (17,576 / 4,394)
Atrial fibrillation	28,725 (23,937 / 4,788)	12,306 (10,255 / 2,051)
Stroke	12,693 (10,577 / 2,116)	5,439 (4,533 / 906)
Type 2 diabetes	12,768 (10,640 / 2,128)	5,472 (4,560 / 912)

Open in a new tab

Statistical analysis

We created our PRSs using a recently developed method called stacked clumping and thresholding (SCT) [10]. Applying clumping (or pruning) to control linkage disequilibrium followed by marginal p-value thresholding is a standard method for computing PRS [11]. This approach requires users to specify hyperparameters such as the size of clumping windows (kb), the correlation threshold (r²) and the p-value significance threshold for clumped SNPs. In general, it is not straightforward how to choose these hyperparameters in practice. Usually, users apply default values for these hyperparameters; for example, the default option in Plink [12] uses r² = 0.5 for the correlation threshold, 250 kb for the window size and p = 0.01 for the p-value threshold.

SCT is an advanced algorithm that is based on the standard clumping and thresholding method. The user selects a set of values for each of the hyperparameters, runs clumping and thresholding on each combination of those parameters and gives a PRS for each combination. These steps can be efficiently conducted using the R package bigsnpr [13]. The PRSs are then stacked using a penalized regression model. The outcome of this algorithm is a linear combination of PRSs, where each PRS is a linear combination of variants. Therefore, a single vector of variant effect sizes can be obtained in the final prediction model. Instead of stacking these PRSs, we could select the PRS with the best prediction, and this is referred to as the maxCT approach. In general, SCT would identify more genetic variants than maxCT.

We applied SCT and maxCT to create PRSs for each of the five diseases. We used the training datasets to create 1,400 risk scores for each chromosome using the default hyperparameters values provided by the R package bigsnpr (Table 2). For maxCT, we selected the risk score that maximized the AUC on the training datasets as the final PRS. For SCT, we stacked the 30,800 (1,400 × 22) risks scores from all 22 chromosomes using penalized logistic regression; the optimal stack weight was also estimated from the training sets.

Table 2. A grid of hyperparameters used in the SCT algorithm.

Hyperparameters	Values
Correlation threshold (r²)	0.01, 0.05, 0.1, 0.2, 0.5, 0.8, 0.95
Base window sizes (kb)	50, 100, 200, 500
Significance threshold (p)	50 evenly spaced thresholds

Open in a new tab

These are the default values used in the R package bigsnpr. The algorithm runs clumping and thresholding on each combination of these parameters and combines the risk scores by a penalized regression. The window size is computed as the base windows size divided by the correlation threshold. The significance threshold is evenly spaced on a logarithmic scale.

To estimate the GWAS effect sizes of SNPs, we obtained summary statistics from large external GWAS. We removed ambiguous SNPs and variants with duplicated positions or refSNP cluster ID numbers, and only kept SNPs that appeared in both the UK Biobank data and the study from which we used summary statistics. These GWAS and the number of SNPs are summarized in Table 3. The second last column gives the number of SNPs in the original studies. The last column shows the number of SNPs that appeared in both the UK Biobank data and the original studies after removing ambiguous SNPs and other quality control.

Table 3. External GWAS summary statistics used in our study.

Disease	GWAS study	# SNPs	# matched SNPs
Coronary artery disease	Nikpay et al. (2015) [14]	9,455,778	506,432
Hypertension	Zhu et al. (2019) [15]	5,265,189	382,924
Atrial fibrillation	Christophersen et al. (2017) [16]	11,792,062	508,687
Stroke	Malik et al. (2018) [17]	8,255,860	513,802
Type 2 diabetes	Scott et al. (2017) [18]	12,056,346	536,788

Open in a new tab

After PRSs were created for each disease, we quantified their predictive power in the testing data by using the bigstatsr package in R to compute AUCs. No other covariates were included in the calculation of the AUCs. To assess the association of each of the PRS with the disease of interest in the testing data, we used logistic regression to estimate the odds ratio (OR) per standard deviation (SD) of the PRS. The SDs were calculated using the controls in the 30% testing dataset. In addition, we assessed the calibration performance by fitting a logistic regression with the disease status and the logit of the predicted probabilities given by our PRSs. A well-calibrated model should have an intercept close to 0 and a slope close to 1.

Results

S1 Table shows, for each disease, the number of participants in the 70% age- and sex-matched training dataset and the number in the 30% unmatched testing dataset. Five controls were able to be selected for almost all cases; three controls were not able to be matched for coronary artery disease, atrial fibrillation and stroke. For hypertension, four controls were drawn for each case and two controls were not able to be matched.

The distributions of the standardized SCT PRSs in the testing datasets are plotted in Fig 1. For all diseases except stroke, the PRSs for the cases had a greater mean and median than the controls. For example, for atrial fibrillation the PRS had a mean of 0.34 for the cases and −0.07 for the controls. Similarly, the PRS for coronary artery disease had a mean of 0.26 for the cases and −0.05 for the controls, and the PRS for type 2 diabetes had a mean of 0.28 for the cases and −0.06 for the controls. The PRS for stroke had similar mean for the cases and controls, 0.03 for the cases and −0.01 for the controls respectively, which shows the lack of discriminatory ability compared to other diseases.

The main results are summarized in Fig 2 and Table 4. The strongest predictive performance was found for the PRSs for atrial fibrillation followed by the PRSs for type 2 diabetes and coronary artery disease and then the PRSs for hypertension. The PRSs for stroke were unable to predict disease.

Table 4. Predictive performance of the developed PRSs and the number of identified SNPs.

Disease	AUC (95% CI)		Number of SNPs
Disease	maxCT	SCT	maxCT	SCT
Coronary artery disease	0.572 (0.560, 0.584)	0.587 (0.576, 0.599)	1,059	390,782
Hypertension	0.559 (0.550, 0.569)	0.566 (0.556, 0.576)	61,669	309,759
Atrial fibrillation	0.599 (0.585, 0.613)	0.613 (0.599, 0.626)	265	216,837
Stroke	0.514 (0.494, 0.535)	0.512 (0.492, 0.533)	17,568	169,186
Type 2 diabetes	0.585 (0.564, 0.605)	0.595 (0.575, 0.615)	46,353	419,209

Open in a new tab

For each disease except stroke, the SCT PRS had a slightly higher AUC than the maxCT PRS but was based on many more SNPs: 820× more for atrial fibrillation, 370× more for coronary artery disease, 10× more for both type 2 diabetes and stroke, and 5× more for hypertension. For example, for atrial fibrillation the AUC increased from 0.599 (95% CI: 0.585, 0.613) for 265 SNPs in the maxCT PRS to an AUC of 0.613 (95% CI: 0.599, 0.626) for 216,837 SNPs in the SCT PRS. For hypertension, which had the largest number of maxCT SNPs, the AUC increased from 0.559 (95% CI: 0.550, 0.569) for 61,669 SNPs in the maxCT PRS to an AUC of 0.566 (95% CI: 0.556, 0.576) for 309,759 SNPs in the SCT PRS. The optimal hyperparameters for maxCT are reported in Table 5. These hyperparameters maximized the AUC in the training sets. These hyperparameters are the size of clumping windows (kb), the correlation threshold (r²) and the p-value significance threshold for the clumped SNPs.

Table 5. Optimal hyperparameters for maxCT.

Disease	r ²	kb	p
Coronary artery disease	0.50	1,000	1.23×10⁻³
Hypertension	0.80	625	9.77×10⁻²
Atrial fibrillation	0.95	52	5.75×10⁻⁵
Stroke	0.95	105	2.82×10⁻²
Type 2 diabetes	0.80	125	7.59×10⁻²

Open in a new tab

The coefficients of our calibration analysis showed that the prediction models were well calibrated for atrial fibrillation, coronary artery disease and type 2 diabetes. For the maxCT PRSs, the intercept and slope were found to be 0.02 (95% CI: −0.17, 0.22) and 1.02 (95% CI: 0.90, 1.14) for atrial fibrillation, −0.11 (95% CI: −0.31, 0.09) and 0.94 (95% CI: 0.81, 1.07) for coronary artery disease, and −0.13 (95% CI: −0.45, 0.17) and 0.91 (95% CI: 0.72, 1.11) for type 2 diabetes. There was no evidence to reject the null hypothesis that the intercept of the calibration curve is zero and the slope is one. However, different results were found for the other two diseases, the calibration was weak for hypertension (slope = 0.80, 95% CI: 0.69, 0.92) and poor for stroke (slope = 0.37, 95% CI: −0.27, 1.01).

The results of the logistic regression to estimate the odds ratio (OR) per standard deviation (SD) are summarized in Table 6. They are similar to what we observed in terms of AUC, with the SCT PRSs having slightly higher associations than the maxCT PRSs. The strongest performance was seen for the PRSs for atrial fibrillation, having an OR per SD of 1.49 (95% CI: 1.42, 1.57) for the SCT PRS and an OR per SD of 1.41 (95% CI: 1.35, 1.48) for the maxCT PRS. The OR per SD for type 2 diabetes and coronary artery disease were similar in magnitude and the OR per SD for hypertension was slightly lower. The PRSs for stroke were not associated with disease. Table 7 shows the comparison of the AUCs for the SCT PRSs with and without the inclusion of age and sex in the testing data.

Table 6. Odds ratio (and 95% confidence interval) per standard deviation for PRSs generated by maxCT and SCT.

Disease	maxCT	SCT
Coronary artery disease	1.29 (1.24, 1.35)	1.36 (1.31, 1.42)
Hypertension	1.23 (1.18, 1.27)	1.26 (1.22, 1.30)
Atrial fibrillation	1.41 (1.35, 1.48)	1.49 (1.42, 1.57)
Stroke	1.05 (0.97, 1.13)	1.04 (0.97, 1.12)
Type 2 diabetes	1.35 (1.25, 1.45)	1.41 (1.31, 1.51)

Open in a new tab

Table 7. Predictive performance of the SCT PRSs in the testing data, with and without including age and sex.

Disease	AUC (95% CI)
Disease	PRS only	PRS + sex +age
Coronary artery disease	0.587 (0.576, 0.599)	0.706 (0.696, 0.716)
Hypertension	0.566 (0.556, 0.576)	0.677 (0.669, 0.686)
Atrial fibrillation	0.613 (0.599, 0.626)	0.738 (0.728, 0.750)
Stroke	0.512 (0.492, 0.533)	0.668 (0.650, 0.686)
Type 2 diabetes	0.595 (0.575, 0.615)	0.638 (0.619, 0.657)

Open in a new tab

For each disease, we selected summary statistics from the GWAS Catalog that are publicly available for download, discovered using mostly Caucasian populations with a large sample size, and not generated using the UK Biobank. Ideally, we would prefer to use summary statistics that used only incident cases (and also satisfied the other mentioned criteria) in order to match our study design but we cannot find such summary statistics. We would expect better performance if the summary statistics were commensurate with our study design. Note that for hypertension, the summary statistics were generated using the UK Biobank so the performance of the hypertension PRS could be overestimated.

Discussion

In this study, we have addressed two important limitations of some other studies that have attempted to develop PRSs [1, 2, 5]. First, we have ensured that our PRS models do not include the effects of age and sex and represent the genetic effects alone.

To do this, we used a sampling strategy to create training datasets in which age and sex are controlled by design. We ensured that the ratio of the number of cases to the number of controls was the same across all age and sex groups in the training datasets. Therefore, the selection of SNPs and estimation of their ORs in the development stage of the PRSs cannot be affected by age and sex. Importantly, the AUCs for our PRSs in the testing datasets are due solely to genetic effects and are not inflated from the inclusion of age and sex in the model. For example, in the testing dataset for atrial fibrillation, the AUC for a base model with age and sex was 0.711, while the AUC for our PRS alone was 0.613 (see Table 7). If we present the AUC for PRS, age and sex–as other authors [1, 5] have done–it would be 0.738. This age- and sex- adjusted AUC has often been reported without also reporting an AUC for the PRS alone, making it difficult to understand the contribution of the PRS to disease prediction.

Second, in other studies, the inclusion of prevalent cases might lead to biased estimation of disease risks because severe or fatal cases do not have an opportunity to be included in the analysis. By excluding prevalent cases we have ensured that our disease risks are not mis-estimated. The vast size of the UK Biobank has meant that we have achieved large sample sizes using incident cases.

Our results suggest that the PRSs developed in this study have moderate discriminatory power for incident atrial fibrillation (AUC = 0.613), coronary artery disease (AUC = 0.587) and type 2 diabetes (AUC = 0.595). Our PRSs were not able to predict risk for stroke. It has been pointed out in a previous study [3] that PRS for stroke is less predictive than PRS for other common diseases because stroke is a more heterogeneous disease. Including more variants in a PRS can improve its predictive performance, even if most of the variants have very small effect sizes [5]. While this is consistent with our findings that the SCT PRSs have better performance than the maxCT PRSs, the cost and the practicality of implementing these PRSs into clinical practice should also be taken into consideration. Finding the balance between performance and practicality is crucial for a successful implementation. We found that the maxCT PRS could be a good candidate for this purpose because the number of SNPs is much more manageable (e.g., for atrial fibrillation, 265 SNPs for the maxCT PRS vs. 216,837 SNPs for the SCT PRS) without too much sacrifice of the prediction performance (e.g., AUC of 0.599 with maxCT vs 0.613 with SCT for atrial fibrillation). Simulation and real data analysis [10] has shown that maxCT outperforms the most widely used clumping and thresholding method.

One potential limitation of our study is that we used summary statistics from GWAS that were not matched for age and sex. This approach will potentially reduce the performance of the PRSs. We used external summary statistics rather than obtaining them using a hold-out set from the UK Biobank so that we could maximize the samples available for analysis. While the weights from the summary statistics are used in the maxCT PRSs, we selected SNPs using the training data, which is age- and sex- matched. For the SCT PRSs, the final weights of the SNPs are obtained by fitting a penalized regression using the training data.

The most common cardiovascular disease is coronary artery disease (CAD) which involves the reduction of blood flow to the heart muscle due to build-up of plaque (atherosclerosis) in the arteries of the heart. Clinical risk factors include high blood pressure, smoking, diabetes, lack of exercise, obesity, high blood cholesterol, poor diet, depression and excessive alcohol. Similarly Type 2 diabetes primarily occurs as a result of modifiable risk factors. Thus, accurate risk prediction for the development of these diseases allows early intervention, including educational resources to drive behavior modification, to the patients who are at high risk.

Established clinical risk prediction scores, for example, the Framingham risk scores, are designed for use in people aged over 30 years [19] and some studies [20] have shown that individuals with high PRS had similar risk to the individuals with familial hypercholesterolemia (a genetic disorder that increases the likelihood of coronary artery disease) although their levels of cholesterol and other traditional risk factors were normal. As a result, individuals at high genetic risk of coronary artery disease might not be receiving timely advice because of the limitation of the clinical risk tools. Because a PRS is based on germline DNA, it can potentially be used much earlier than conventional risk prediction tools. The early identification of individuals at increased genetic risk could lead to prevention strategies at earlier ages and significant savings in mortality and treatment costs.

Conclusion

We developed PRSs and evaluated their predictive performances for coronary artery disease, hypertension, atrial fibrillation, stroke and type 2 diabetes. Using a sampling strategy, the effects of age and sex have been controlled by design and did not affect the development of the PRSs. The predictive performances were reported as their true AUCs, not AUCs that include the effects of age and sex. Our PRSs have moderate predictive power to predict incident coronary artery disease, atrial fibrillation and type 2 diabetes. Further study should be investigated to examine the clinical utility for PRS to improve risk predictions for these diseases.

Supporting information

S1 Table. Number, age quintile and sex of the cases and controls in the 70% training matched dataset for each of the diseases studied.

(DOCX)

Click here for additional data file.^{(15.2KB, docx)}

Data Availability

Access to the data used in this study can be obtained by applying directly to the UK Biobank at https://www.ukbiobank.ac.uk/register-apply/. The authors did not receive special access privileges to the data that others would not have. Interested researchers will be able to access the data in the same manner by applying directly to the UK Biobank. The successful PRSs arising from this study have been deposited in PGS Catalog (PGS002773 – PGS002780). R code used in the analyses is available from the corresponding author for non-commercial purposes only.

Funding Statement

The authors received no external funding for this work. CKW, GSD, LW, NMM and RA are employed by a commercial company, Genetic Technologies Limited, which provided support in the form of salaries but did not have any role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript. The specific roles of all authors are articulated in the Author Contributions section.

References

1.Khera AV, Chaffin M, Aragam KG, Haas ME, Roselli C, Choi SH, et al. Genome-wide polygenic scores for common diseases identify individuals with risk equivalent to monogenic mutations. Nat Genet. 2018;50: 1219–1224. doi: 10.1038/s41588-018-0183-z [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Inouye M, Abraham G, Nelson CP, Wood AM, Sweeting MJ, Dudbridge F, et al. Genomic risk prediction of coronary artery disease in 480,000 adults: Implications for primary prevention. J Am Coll Cardiol. 2018;72: 1883–1893. doi: 10.1016/j.jacc.2018.07.079 [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Abraham G, Malik R, Yonova-Doing E, Salim A, Wang T, Danesh J, et al. Genomic risk score offers predictive performance comparable to clinical risk factors for ischaemic stroke. Nat Commun. 2019;10. doi: 10.1038/s41467-019-13848-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Wand H, Lambert SA, Tamburro C, Iacocca MA, O’Sullivan JW, Sillari C, et al. Improving reporting standards for polygenic scores in risk prediction studies. Nature. Nature Research; 2021. pp. 211–219. doi: 10.1038/s41586-021-03243-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Bolli A, Di Domenico P, Bottà G. Software as a service for the genomic prediction of complex diseases. bioRxiv. 2019. p. 763722. doi: 10.1101/763722 [DOI] [Google Scholar]
6.Hill G, Connelly J, Hébert R, Lindsay J, Millar W. Neyman’s bias re-visited. J Clin Epidemiol. 2003;56: 293–296. doi: 10.1016/s0895-4356(02)00571-1 [DOI] [PubMed] [Google Scholar]
7.Bycroft C, Freeman C, Petkova D, Band G, Elliott LT, Sharp K, et al. The UK Biobank resource with deep phenotyping and genomic data. Nature. 2018;562: 203–209. doi: 10.1038/s41586-018-0579-z [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Said MA, Verweij N, Van Der Harst P. Associations of combined genetic and lifestyle risks with incident cardiovascular disease and diabetes in the UK Biobank study. JAMA Cardiology. 2018;3: 693–702. doi: 10.1001/jamacardio.2018.1717 [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Eastwood SV, Mathur R, Atkinson M, Brophy S, Sudlow C, Flaig R, et al. Algorithms for the capture and adjudication of prevalent and incident diabetes in UK Biobank. PLoS One. 2016;11. doi: 10.1371/journal.pone.0162388 [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Privé F, Vilhjálmsson BJ, Aschard H, Blum MGB. Making the Most of Clumping and Thresholding for Polygenic Scores. Am J Hum Genet. 2019;105: 1213–1221. doi: 10.1016/j.ajhg.2019.11.001 [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Wray NR, Lee SH, Mehta D, Vinkhuyzen AAE, Dudbridge F, Middeldorp CM. Research review: Polygenic methods and their application to psychiatric traits. Journal of Child Psychology and Psychiatry and Allied Disciplines. Blackwell Publishing Ltd; 2014. pp. 1068–1087. doi: 10.1111/jcpp.12295 [DOI] [PubMed] [Google Scholar]
12.Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR, Bender D, et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet. 2007;81: 559–575. doi: 10.1086/519795 [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Prive F, Aschard H, Ziyatdinov A, Blum MGB. Efficient analysis of large-scale genome-wide data with two R packages: bigstatsr and bigsnpr. Bioinformatics. 2018;34: 2781–2787. doi: 10.1093/bioinformatics/bty185 [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Nikpay M, Goel A, Won HH, Hall LM, Willenborg C, Kanoni S, et al. A comprehensive 1000 Genomes-based genome-wide association meta-analysis of coronary artery disease. Nat Genet. 2015;47: 1121–1130. doi: 10.1038/ng.3396 [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Zhu Z, Wang X, Li X, Lin Y, Shen S, Liu CL, et al. Genetic overlap of chronic obstructive pulmonary disease and cardiovascular disease-related traits: A large-scale genome-wide cross-trait analysis. Respir Res. 2019;20. doi: 10.1186/s12931-019-1036-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Christophersen IE, Rienstra M, Roselli C, Yin X, Geelhoed B, Barnard J, et al. Large-scale analyses of common and rare variants identify 12 new loci associated with atrial fibrillation. Nat Genet. 2017;49: 946–952. doi: 10.1038/ng.3843 [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Malik R, Chauhan G, Traylor M, Sargurupremraj M, Okada Y, Mishra A, et al. Multiancestry genome-wide association study of 520,000 subjects identifies 32 loci associated with stroke and stroke subtypes. Nat Genet. 2018;50: 524–537. doi: 10.1038/s41588-018-0058-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Scott RA, Scott LJ, Mägi R, Marullo L, Gaulton KJ, Kaakinen M, et al. An expanded genome-wide association study of type 2 diabetes in Europeans. Diabetes. 2017;66: 2888–2902. doi: 10.2337/db16-1253 [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Berry JD, Lloyd-Jones DM, Garside DB, Greenland P. Framingham risk score and prediction of coronary heart disease death in young men. Am Heart J. 2007;154: 80–86. doi: 10.1016/j.ahj.2007.03.042 [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Rao AS, Knowles JW. Polygenic risk scores in coronary artery disease. Current opinion in cardiology. NLM (Medline); 2019. pp. 435–440. doi: 10.1097/HCO.0000000000000629 [DOI] [PubMed] [Google Scholar]

PLoS One. doi: 10.1371/journal.pone.0278764.r001

Decision Letter 0

Thomas Tischer

Transfer Alert

This paper was transferred from another journal. As a result, its full editorial history (including decision letters, peer reviews and author responses) may not be present.

7 Sep 2022

PONE-D-22-03370Polygenic risk scores for cardiovascular diseases and type 2 diabetesPLOS ONE

Dear Dr. Wong,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

The manuscript has been evaluated by two reviewers that both raised major concerns with the current version. Specifically, they are concerned about the reporting and inclusion of the sex and age variables and request clarifications on the statistical models that have been used.Their full reviewers are attached below, could you please revise your manuscript to carefully address all their concerns?

Please submit your revised manuscript by Oct 22 2022 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.
A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.
An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Thomas Tischer

Staff Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. Thank you for providing the following Funding Statement:

“I have read the journal's policy and the authors of this manuscript have the following competing interests: CKW, GSD, LW, NMM and RA are employees of Genetic Technologies Limited. Aspects of this manuscript are covered by Provisional Patent Application AU 2020903793, Methods of assessing risk of developing a disease. Chi Kuen Wong, Gillian Dite, Nicholas Murphy and Richard Allman are named inventors on the patent application, which is assigned to Genetic Technologies Limited.”

We note that one or more of the authors is affiliated with the funding organization, indicating the funder may have had some role in the design, data collection, analysis or preparation of your manuscript for publication; in other words, the funder played an indirect role through the participation of the co-authors.

If the funding organization did not play a role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript and only provided financial support in the form of authors' salaries and/or research materials, please review your statements relating to the author contributions, and ensure you have specifically and accurately indicated the role(s) that these authors had in your study in the Author Contributions section of the online submission form. Please make any necessary amendments directly within this section of the online submission form. Please also update your Funding Statement to include the following statement: “The funder provided support in the form of salaries for authors [insert relevant initials], but did not have any additional role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript. The specific roles of these authors are articulated in the ‘author contributions’ section.”

If the funding organization did have an additional role, please state and explain that role within your Funding Statement.

Please also provide an updated Competing Interests Statement declaring this commercial affiliation along with any other relevant declarations relating to employment, consultancy, patents, products in development, or marketed products, etc.

Within your Competing Interests Statement, please confirm that this commercial affiliation does not alter your adherence to all PLOS ONE policies on sharing data and materials by including the following statement: "This does not alter our adherence to PLOS ONE policies on sharing data and materials.” (as detailed online in our guide for authors http://journals.plos.org/plosone/s/competing-interests). If this adherence statement is not accurate and there are restrictions on sharing of data and/or materials, please state these. Please note that we cannot proceed with consideration of your article until this information has been declared.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Partly

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: No

Reviewer #2: No

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: In this study, Wong et al. derive polygenic risk scores (PRS) for incident cases of several cardiovascular diseases and type 2 diabetes using a clumping and thresholding approach. They use a subset (70%) of the UKBB that they divide into age and sex matched cases and controls as a trainining set to optimize hyperparameters for their clumping and thresholding approach for each phenotype respectively. They then test the predictive performance of their PRS on the remaining 30% test set and state, that their sampling strategy controlled for the effects of age and sex in the development of the PRS.

This study has several major issues in the design of the experiment, as well as in the reporting. I think the authors need to address all of these concerns to show that their results reflect the claims they are making. I have listed alls of my comments regarding the manuscript below:

General major points

1) The authors mix a couple of different arguments in their abstract and introduction about how age and sex are problematic in the construction of PRS and how this apparently affects AUC reporting of predictive models. I agree with the authors, that predictive performance of PRS is often reported including age and sex as covariates in the logistic regression models and that authors should report base models without PRS as well, to properly assess how much the PRS adds ot the performance. However, the argument that age and sex are included in the PRS models themselves is questionable. The authors should clarify why they think that matching cases and controls for age and sex in the clumping and thresholding step of the PRS construction accounts for age and sex differences for traits, since the clumping and thresholding approach will select candidate variants from pre-calculated GWAS results based p-value thresholds and LD-structure in the training set. If the authors believe that age and sex are associated with the traits that they are investigating, shouldn’t they rather perform GWAS stratified by age and sex to recalibrate effect sizes and association statistics? This way, the variants that will be selected for each respective strata during clumping and thresholding would actually reflect the differences the authors want to highlight. An example of sex stratified PRS generation can be found in PMID: 35873490 (https://pubmed.ncbi.nlm.nih.gov/35873490/).

2) Related to the issue of how the authors attempt to control for age and sex in the construction of their PRS, they fail to show, that by not controlling for these covariates, models are actually over-or underperforming. The authors should generate PRS using their same approach without matching cases-and controls and report the predictive performance of these models to show whether they actually differ across sex and age groups compared to their age-and sex matched PRS.

3) The authors removed prevalent disease cases from their analysis. They do highlight that this is to prevent biased estimation of disease risk, which is an important point to make. However, they fail to also highlight that by using prevalent disease based GWAS summary statistics in the selection of their PRS variants, their PRS might also be underperforming, as variants selected for PRS might be biased towards prevalent disease.

4) In the methods section, the authors write that they computed AUC values to assess predictive performance of their PRS, but fail to provide details about which computational packages they used (if they used any) to do this. In the results section, they refer to using logistic regression to calculate the Odds ratio (OR) per standard deviation for Table 3. Were the probabilities from the logistic regression models used for calculation of AUC and did the authors include any covariates in the logistic regression models (PCs, smoking, statin usage, etc?). The authors need to add the details about logistic regression models for OR per SD calculation to the methods section. Currently, what they write in the results section does not match what they report in the methods section for OR per SD calculation.

Minor points

One controversial point the authors raise is whether variant imputation is too computationally and analytically intense to be generally applied to PRS calculation. They do cite a study that investigated the effect of different imputation algorithms on PRS performance (Ref 8 in the manuscript : https://genomemedicine.biomedcentral.com/articles/10.1186/s13073-020-00801-x). However, while this study found that imputation can introduce variability on the individual level, it generally does not cause problems in interpretation of the PRS. As many easy-to-use public tools such as the Michigan Imputation server (https://imputationserver.sph.umich.edu/index.html) are providing low-cost, fast imputation to large reference panels and whole-genome sequencing is becoming cheaper and cheaper, the argument that imputation and large PRS panels are unfeasible becomes less valid. The Michigan imputation server has recently even incorporated PRS calculation (from PGS catalog) as part of the imputation process, making it even easier and faster to obtain PRS for datasets.

Data reporting

The authors fail to mention availability of their score for the public to reproduce their results in independent cohorts. They don’t provide any reference to a data or any other repository where their PRS variants and effect sizes can be found. I suggest the authors submit their PRS to https://www.pgscatalog.org/, a resource that has been created for exactly the purpose of making PRS reporting more reproducible.

Reviewer #2: The authors describe that AUC reflects the predictive accuracy of not only the polygenic score but also important risk factors, namely age and sex. The authors use sex- and age-matched control samples to estimate the AUC of polygenic score alone.

I have some major concerns. Firstly, it is not very clear to me what the aim of this work is and what contribution to the field the authors are trying make.

(1) If the aim is to show that the commonly used AUC estimates are ‘inflated’ due to the effects of age and sex, what is missing in the manuscript is a proper comparison between AUC estimates using the same polygenic scores in case control samples with and without matching the distributions of sex and age. It was only briefly mentioned in the discussion section (line 139). It would always be helpful to quantify the observations.

(2) If the aim is to develop polygenic scores that improve risk stratification for the tested traits (as mentioned on line 65 in the introduction), the authors did not compare their PRS with previously published scores, some of which used more sophisticated methods such as Bayesian based methods (e.g. LDpred, PRSCS, and SBayesR). These methods are believed to outperform C+T based methods. It can probably also address the issue that the stroke score was not predictive, by using the PRS developed by Neumann et al. (available at the PGS Catalog; https://www.ahajournals.org/doi/10.1161/STROKEAHA.120.033670#d6462236e293)or from an older study that the authors themselves cited (reference 3).

Moreover, I don’t agree with the authors that adjusting or considering age and sex in the model when reporting the ‘inflated’ AUC of a polygenic score is an issue. We compare AUC basically in the following two scenarios. (1) We compare different polygenic scores or scores generated using different parameters in the same testing sample. Age, sex and other baseline characteristics have the same effects on the goodness-of-fit metrics and better scores always have higher AUC. (2) We compare scores that are validated in different cohorts. It is more complicated and there are always other cohort-specific factors (such as different healthcare systems, phenotype definitions, fine-scale population stratification, different socioeconomic status) which may contribute to the estimation of PRS accuracy. Even the case control samples within each cohort have matched age and sex distributions, differences in the distributions between cohorts (e.g. one cohort might have many young cases while the other cohort recruits more older cases) could still result in AUC estimates that are not comparable. Also, power is reduced when selecting controls that match with cases due to the smaller sample size. There are other approaches to accounting differences in age and sex between cases and controls, which have been commonly used already, such as the incremental R2 (AUC or pseudo-R2 for binary traits; https://doi.org/10.1038/s41596-020-0353-1) which quantifies the increase in variance explained with the addition of the PRS to the baseline model.

Also, the authors discussed a lot about the disadvantages of using imputed versus genotyped array data, which doesn’t really fit in the manuscript. The authors did not do any analysis comparing the accuracy, individual-level variability, or costs versus benefits between PRS with a small number of genetic variants using array data and PRS with many more genetic variants using imputed data. This is out of the scope of the work and the discussion on this topic probably needs to be shortened. The justification of using array data can be briefly mentioned in the beginning of the results section when the study samples and analysis methods are introduced, or in the methods section.

More specific comments:

Results section on page 5: it would be great to briefly introduce the study cohort first. More specifically, the effort of matching controls with cases in terms of age and sex should be mentioned, as well as the sample sizes. It would be great to move table 4 here.

Line 76: figure 2 shows the distributions, not figure 1.

Page 7: I’m not very sure whether the comparison of SCT and maxCT scores is necessary here. It has been clearly established in the original paper (citation 7) that the SCT scores show better performance than maxCT. It is not surprising either that the SCT scores contain more genetic variants given that they are combinations of multiple C+T scores.

Line 151: it would be more accurate to say that “our results suggest that PRSs that were developed in this study have xxx”. The conclusion may not be generalisable to other PRSs.

Line 168-169: it should be made clearly in the beginning of the results section that the genotyped data were used.

Page 10-11: like I said previously, there are a lot of discussions about imputed data, which is not the focus of this manuscript which currently does not have any analysis relevant to imputed vs genotyped data.

The authors should also consider that genotyping only a small number of SNPs that are used in the maxCT scores is not flexible. There will be larger GWAS and more accurate PRS for the tested diseases as well as new phenotypes in the future, and different genetic markers will be needed. In the long term, using commercial array chips + imputation strategy is probably more cost effective, as we can reuse the high coverage data for better scores and scores for many other traits when available.

Line 171-172: what does the ‘validation process’ mean here?

Line 185-186: please provide reference to this claim which I find hard to believe. There are standard ways to perform QC of array data and imputation can be done easily and freely using online servers.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2022 Dec 2;17(12):e0278764. doi: 10.1371/journal.pone.0278764.r002

Author response to Decision Letter 0

3 Oct 2022

Please see the attached file "Response to Reviewers"

Attachment

Submitted filename: Response to Reviewers.docx

Click here for additional data file.^{(39.2KB, docx)}

PLoS One. doi: 10.1371/journal.pone.0278764.r003

Decision Letter 1

Gualtiero I Colombo

15 Nov 2022

PONE-D-22-03370R1Polygenic risk scores for cardiovascular diseases and type 2 diabetesPLOS ONE

Dear Dr. Wong,

Indeed, although the authors responded to most of the comments raised by the reviewers, this editor and the reviewers believe that it is important to show that the PRS constructed using cases and matched controls for covariates such as age and sex is actually better than simply adjusting for them when calculating predictive performance.

Please submit your revised manuscript by Dec 30 2022 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.
A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.
An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

We look forward to receiving your revised manuscript.

Kind regards,

Gualtiero I. Colombo, M.D., Ph.D.

Academic Editor

PLOS ONE

Journal Requirements:

Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: All comments have been addressed

Reviewer #2: (No Response)

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

Reviewer #1: Yes

Reviewer #2: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

Reviewer #1: Yes

Reviewer #2: Yes

**********

6. Review Comments to the Author

Reviewer #1: The authors have adequately addressed all of my comments to their original manuscript submission. The scores submitted to the PGS Catalog need to be made public once the manuscript is accepted in a publication journal.

Reviewer #2: The authors have addressed most of my comments and the revised manuscript has been greatly improved. I still have one last comment on authors’ response to 5.2 and 5.7. It is important to establish and quantify the issue of over-estimation, as the authors claimed, of PRS accuracy measured in AUC when including age and sex. I still think that it would be great to analyse all traits in addition to atrial fibrillation (line 241-244), and add the “over-estimated” AUC somewhere in a main table. I find it interesting and helpful (and perhaps some other readers too) to know how much the AUC metrics reported in other studies are overestimated due to the effects of covariates such as age and sex.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

**********

PLoS One. 2022 Dec 2;17(12):e0278764. doi: 10.1371/journal.pone.0278764.r004

Author response to Decision Letter 1

21 Nov 2022

We will ensure that the PRSs submitted to the PGS Catalog are made public when the paper is accepted for publication.

In additional to atrial fibrillation, we have now added Table 7 to show the inflation of AUCs when age and sex are included for all five diseases.

Attachment

Submitted filename: Response to Reviewers.docx

Click here for additional data file.^{(18.6KB, docx)}

PLoS One. doi: 10.1371/journal.pone.0278764.r005

Decision Letter 2

Gualtiero I Colombo

23 Nov 2022

Polygenic risk scores for cardiovascular diseases and type 2 diabetes

PONE-D-22-03370R2

Dear Dr. Wong,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Gualtiero I. Colombo, M.D., Ph.D.

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewers' comments:

PLoS One. doi: 10.1371/journal.pone.0278764.r006

Acceptance letter

Gualtiero I Colombo

25 Nov 2022

PONE-D-22-03370R2

Polygenic risk scores for cardiovascular diseases and type 2 diabetes

Dear Dr. Wong:

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

If we can help with anything else, please email us at plosone@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Gualtiero I. Colombo

Academic Editor

PLOS ONE

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

S1 Table. Number, age quintile and sex of the cases and controls in the 70% training matched dataset for each of the diseases studied.

(DOCX)

Click here for additional data file.^{(15.2KB, docx)}

Attachment

Submitted filename: Response to Reviewers.docx

Click here for additional data file.^{(39.2KB, docx)}

Attachment

Submitted filename: Response to Reviewers.docx

Click here for additional data file.^{(18.6KB, docx)}

Data Availability Statement

[pone.0278764.ref001] 1.Khera AV, Chaffin M, Aragam KG, Haas ME, Roselli C, Choi SH, et al. Genome-wide polygenic scores for common diseases identify individuals with risk equivalent to monogenic mutations. Nat Genet. 2018;50: 1219–1224. doi: 10.1038/s41588-018-0183-z [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0278764.ref002] 2.Inouye M, Abraham G, Nelson CP, Wood AM, Sweeting MJ, Dudbridge F, et al. Genomic risk prediction of coronary artery disease in 480,000 adults: Implications for primary prevention. J Am Coll Cardiol. 2018;72: 1883–1893. doi: 10.1016/j.jacc.2018.07.079 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0278764.ref003] 3.Abraham G, Malik R, Yonova-Doing E, Salim A, Wang T, Danesh J, et al. Genomic risk score offers predictive performance comparable to clinical risk factors for ischaemic stroke. Nat Commun. 2019;10. doi: 10.1038/s41467-019-13848-1 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0278764.ref004] 4.Wand H, Lambert SA, Tamburro C, Iacocca MA, O’Sullivan JW, Sillari C, et al. Improving reporting standards for polygenic scores in risk prediction studies. Nature. Nature Research; 2021. pp. 211–219. doi: 10.1038/s41586-021-03243-6 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0278764.ref005] 5.Bolli A, Di Domenico P, Bottà G. Software as a service for the genomic prediction of complex diseases. bioRxiv. 2019. p. 763722. doi: 10.1101/763722 [DOI] [Google Scholar]

[pone.0278764.ref006] 6.Hill G, Connelly J, Hébert R, Lindsay J, Millar W. Neyman’s bias re-visited. J Clin Epidemiol. 2003;56: 293–296. doi: 10.1016/s0895-4356(02)00571-1 [DOI] [PubMed] [Google Scholar]

[pone.0278764.ref007] 7.Bycroft C, Freeman C, Petkova D, Band G, Elliott LT, Sharp K, et al. The UK Biobank resource with deep phenotyping and genomic data. Nature. 2018;562: 203–209. doi: 10.1038/s41586-018-0579-z [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0278764.ref008] 8.Said MA, Verweij N, Van Der Harst P. Associations of combined genetic and lifestyle risks with incident cardiovascular disease and diabetes in the UK Biobank study. JAMA Cardiology. 2018;3: 693–702. doi: 10.1001/jamacardio.2018.1717 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0278764.ref009] 9.Eastwood SV, Mathur R, Atkinson M, Brophy S, Sudlow C, Flaig R, et al. Algorithms for the capture and adjudication of prevalent and incident diabetes in UK Biobank. PLoS One. 2016;11. doi: 10.1371/journal.pone.0162388 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0278764.ref010] 10.Privé F, Vilhjálmsson BJ, Aschard H, Blum MGB. Making the Most of Clumping and Thresholding for Polygenic Scores. Am J Hum Genet. 2019;105: 1213–1221. doi: 10.1016/j.ajhg.2019.11.001 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0278764.ref011] 11.Wray NR, Lee SH, Mehta D, Vinkhuyzen AAE, Dudbridge F, Middeldorp CM. Research review: Polygenic methods and their application to psychiatric traits. Journal of Child Psychology and Psychiatry and Allied Disciplines. Blackwell Publishing Ltd; 2014. pp. 1068–1087. doi: 10.1111/jcpp.12295 [DOI] [PubMed] [Google Scholar]

[pone.0278764.ref012] 12.Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR, Bender D, et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet. 2007;81: 559–575. doi: 10.1086/519795 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0278764.ref013] 13.Prive F, Aschard H, Ziyatdinov A, Blum MGB. Efficient analysis of large-scale genome-wide data with two R packages: bigstatsr and bigsnpr. Bioinformatics. 2018;34: 2781–2787. doi: 10.1093/bioinformatics/bty185 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0278764.ref014] 14.Nikpay M, Goel A, Won HH, Hall LM, Willenborg C, Kanoni S, et al. A comprehensive 1000 Genomes-based genome-wide association meta-analysis of coronary artery disease. Nat Genet. 2015;47: 1121–1130. doi: 10.1038/ng.3396 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0278764.ref015] 15.Zhu Z, Wang X, Li X, Lin Y, Shen S, Liu CL, et al. Genetic overlap of chronic obstructive pulmonary disease and cardiovascular disease-related traits: A large-scale genome-wide cross-trait analysis. Respir Res. 2019;20. doi: 10.1186/s12931-019-1036-8 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0278764.ref016] 16.Christophersen IE, Rienstra M, Roselli C, Yin X, Geelhoed B, Barnard J, et al. Large-scale analyses of common and rare variants identify 12 new loci associated with atrial fibrillation. Nat Genet. 2017;49: 946–952. doi: 10.1038/ng.3843 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0278764.ref017] 17.Malik R, Chauhan G, Traylor M, Sargurupremraj M, Okada Y, Mishra A, et al. Multiancestry genome-wide association study of 520,000 subjects identifies 32 loci associated with stroke and stroke subtypes. Nat Genet. 2018;50: 524–537. doi: 10.1038/s41588-018-0058-3 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0278764.ref018] 18.Scott RA, Scott LJ, Mägi R, Marullo L, Gaulton KJ, Kaakinen M, et al. An expanded genome-wide association study of type 2 diabetes in Europeans. Diabetes. 2017;66: 2888–2902. doi: 10.2337/db16-1253 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0278764.ref019] 19.Berry JD, Lloyd-Jones DM, Garside DB, Greenland P. Framingham risk score and prediction of coronary heart disease death in young men. Am Heart J. 2007;154: 80–86. doi: 10.1016/j.ahj.2007.03.042 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0278764.ref020] 20.Rao AS, Knowles JW. Polygenic risk scores in coronary artery disease. Current opinion in cardiology. NLM (Medline); 2019. pp. 435–440. doi: 10.1097/HCO.0000000000000629 [DOI] [PubMed] [Google Scholar]

PERMALINK

Polygenic risk scores for cardiovascular diseases and type 2 diabetes

Chi Kuen Wong

Enes Makalic

Gillian S Dite

Lawrence Whiting

Nicholas M Murphy

John L Hopper

Richard Allman

Roles

Abstract

Introduction

Materials and methods

Ethics approval

Participants

Training dataset

Testing dataset

Table 1. Sizes of training and testing datasets used in our study.

Statistical analysis

Table 2. A grid of hyperparameters used in the SCT algorithm.

Table 3. External GWAS summary statistics used in our study.

Results

Fig 1. Distribution of the standardized SCT PRSs (with mean 0 and standard deviation 1) for the cases and controls in five common diseases.

Fig 2. Predictive performance of the PRSs generated by maxCT and SCT for five common diseases as measured by AUC.

Table 4. Predictive performance of the developed PRSs and the number of identified SNPs.

Table 5. Optimal hyperparameters for maxCT.

Table 6. Odds ratio (and 95% confidence interval) per standard deviation for PRSs generated by maxCT and SCT.

Table 7. Predictive performance of the SCT PRSs in the testing data, with and without including age and sex.

Discussion

Conclusion

Supporting information

Data Availability

Funding Statement

References

Decision Letter 0

Thomas Tischer

Roles

Transfer Alert

Author response to Decision Letter 0

Decision Letter 1

Gualtiero I Colombo

Roles

Author response to Decision Letter 1

Decision Letter 2

Gualtiero I Colombo

Roles

Acceptance letter

Gualtiero I Colombo

Roles

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases