Abstract
One of the major challenges in the post-genomic era is elucidating the genetic basis of human diseases. In recent years, studies have shown that polygenic risk scores (PRS), based on aggregated information from millions of variants across the human genome, can estimate individual risk for common diseases. In practice, the current medical practice still predominantly relies on physiological and clinical indicators to assess personal disease risk. For example, caregivers mark individuals with high body mass index (BMI) as having an increased risk to develop type 2 diabetes (T2D). An important question is whether combining PRS with clinical metrics can increase the power of disease prediction in particular from early life. In this work we examined this question, focusing on T2D. We present here a sex-specific integrated approach that combines PRS with additional measurements and age to define a new risk score. We show that such approach combining adult BMI and PRS achieves considerably better prediction than each of the measures on unrelated Caucasians in the UK Biobank (UKB, n = 290,584). Likewise, integrating PRS with self-reports on birth weight (n = 172,239) and comparative body size at age ten (n = 287,203) also substantially enhance prediction as compared to each of its components. While the integration of PRS with BMI achieved better results as compared to the other measurements, the latter are early-life measurements that can be integrated already at childhood, to allow preemptive intervention for those at high risk to develop T2D. Our integrated approach can be easily generalized to other diseases, with the relevant early-life measurements.
Keywords: body weight, genetic variations, GWAS, metabolic disease, obesity, sex difference, UK-Biobank
1. Introduction
Predicting the risk of an individual to develop a specific disease is a key challenge in clinical decision making [1]. Based on such predictions, individuals can be identified for early intervention to prevent, delay the onset or better manage the disease and its outcome. Understanding the genetic component of the disease can highlight individuals at risk based on their genetic profile. Indeed, with more genetic and phenotypic information available for large cohorts, genome-wide association studies (GWAS) have been used to find genetic variants associated with complex diseases and traits [2,3,4] Nevertheless, in most GWAS studies, variants that are significantly associated with the disease or trait explained only a small fraction of its presumed genetic heritability component. The shortage of GWAS contribution to complex disease risk has been addressed as the missing heritability problem with various explanations that were presented to address it [5,6,7]. A likely explanation argues that complex diseases are signified by complex intracellular interactions. However, the many variants that are below significance in GWAS, actually affect the trait, and cumulatively contribute to the phenotype even more than the relatively few statistically significant GWAS variants [8,9]. In light of this possibility, different studies developed polygenic risk scores (PRS) that consider the accumulative effect of millions of genetic markers to predict the probability of an individual to develop a complex disease [1,10,11,12,13]. In some cases, the PRS methodology was able to highlight individuals with the same risk as individuals with rare monogenic mutations linked to a disease. The greater effect on public health reflects the fact that the PRS-based approach covers many more individuals (up to 20 folds) as compared to rare monogenic mutation carriers [14]. In addition, it was shown that the penetrance of rare monogenic high-risk variants in various diseases is also affected by the polygenic risk background as reflected by PRS [15].
The etiology of common complex diseases is presumed to be a combination of both genetic and environmental factors and the interactions between them [16]. Various physical and clinical measures are often taken to highlight individuals with high risk for diseases, and these measures reflect both genetic and non-genetic factors. For example, high body mass index (BMI), which has both genetic and non-genetic components [17], is a major risk factor for type 2 diabetes (T2D) [18,19]. Birth weight is yet another example of a physical measure that combines effects from both genetic and environmental factors [20]. However, the direction of the association between birth weight and T2D (low birth weight being a risk or also high birth weight), its scale and whether it is sex-dependent are still not clear [21,22,23,24,25].
In this work we asked whether a combined approach that utilizes both genetic factors (e.g., PRS) and quantitative measures (that have non-genetic components) can improve disease prediction. We evaluated this approach by using both the PRS and physical measurements associated with T2D prevalence (BMI, birth weight and comparative body size at age ten) to predict disease risk, based on the UK Biobank (UKB) cohort [26].
Our results demonstrate that such a combined risk predictor significantly enhances prediction as compared to PRS or each of the underlying measures alone. Importantly, our analysis includes early-life measurements, meaning that individuals at high risk can be identified early in life, leading to more effective intervention.
2. Methods
2.1. UK Biobank (UKB) Data
The analysis in this work is based on the information available for UKB participants [26] (2019 release). We focused on Caucasians by limiting the analysis to participants who self-reported themselves as White (being White, British, Irish or any other white background [codes 1, 1001, 1002, 1003, respectively, in Ethnic background, UKB data field 21000]) and being classified as Caucasians based on their genetic ancestry (Genetic ethnic group, data-field 22006). We also required the individuals to have both genotyping data and information on T2D disease status. Disease classification was based on clinical information provided for UKB participants and encoded by ICD-10 code for T2D (E11.X) either as a main (UKB data field 41202) or secondary (UKB data field 41204) diagnosis. Additional phenotypes were used for the analysis: BMI (taken at the UKB Assessment Centre, UKB data field 21001), birth weight (based on self-reporting, UKB data field 20022) and comparative body size at age ten (based on self-reporting, UKB data field 1687). In case where multiple values were reported for an individual (e.g., BMI measures taken at different time points), the maximal value was taken. In each of the analyses we focused on individuals with the relevant phenotypic information. To address possible sex differences, the analysis was done separately for males and females. Following the filtering steps, 332,338, 184,288 and 318,260 participants were included in the analysis for BMI, birth weight and body size at age ten, respectively. Finally, we focused on participants evaluated at age 40–70 and removed genetic relatives, by keeping only one representative of each kinship group of related individuals from the same sex (recall that analysis was done separately for each sex). This resulted in sets of 290,584, 172,239 and 287,203 participants for BMI, birth weight and body size at age ten, respectively.
2.2. Polygenic Risk Score Calculation
The PRS of an individual is calculated as the weighted sum of his/her allele values over the set of markers. This score is based on the genotype of each individual and does not considers sex or age. Therefore, we refer to it as a “raw” PRS. Let be the number of markers used for raw PRS calculation, let be the allelic status of marker in a specific individual (), and let be the weight of marker (based on the association of the marker with the trait). raw PRS of that individual is then defined as:
The weights for PRS calculation for T2D on a set of approximately 6.5 million imputed markers, based on a previous work [14], were downloaded from The Cardiovascular Disease Knowledge Portal (https://cvd.hugeamp.org/downloads.html; accessed on 10 June 2021). We applied these weights, which had been fit on the UKB data, on the markers of UKB participants to obtain raw PRS values for each individual.
2.3. Composite Risk Score
In this work, we defined a composite risk score () which is composed of three components: genetic profile (), phenotypic information (i.e., BMI, birth weight or comparative body size at age ten), and age. For each of the components, we estimate an individual’s disease risk based on the disease prevalence observed within the relevant UKB cohort across individuals with similar scores (e.g., similar raw for the genetic component). A weighted sum of the different components is taken to obtain a that reflects an individual’s disease risk. The estimated risk scores and weights for each component are learned in a training set and evaluated on a test set (as described below). The rationale behind transforming the original measures into estimated disease prevalence is to allow incorporation of measures that are not necessarily monotonic with respect to disease prevalence. In addition, transforming the measures into disease prevalence also normalizes the different measures, that often span different ranges (e.g., raw PRS and BMI values). The analysis was done separately for each sex.
Formally, we sorted all the individuals in the training set based on their raw values and divided them into 100 equal-size bins (i.e., raw percentiles of UKB participants). For each bin we calculated T2D prevalence in that bin (i.e., the number of cases divided by the total number of individuals in that bin) and defined it as the genetic risk () of the members of that bin. For example, if in a specific bin, 5% of the individuals were reported as having a disease, the of that bin was defined as 0.05. Thus, the reflects the actual disease risk in the UKB, based on individuals with similar raw scores, sharing the same bin. Let raw be the raw value of sample . We define as the of the bin that raw belongs to.
The same procedure was also applied to the phenotypic measure : we sorted all individuals in the training set based on their measures and divided them into 100 equal-size bins and calculated for each bin the phenotypic risk () of members of that bin. In the case of comparative body size at age ten, which included only three values (“Thinner”, “About average” and “Plummer”), people were divided to three bins based on this classification and the was calculated for each of these three predefined bins. We denote the of individual by .
In addition, we also considered age for the composite score. We divided all individuals in the training set according to their age (measured in rounded years) and for each age calculated the age risk () of members with the same age. We denote the of individual by .
The composite risk score () of sample , was then defined as a weighted sum of the three risk measures:
where:
These parameters are trained in and learned in the training set, as described below.
In addition to CRS, we also converted each of the measures alone to disease risk estimates and included age, without including the other measure. Formally, the of sample , was defined as follows:
where:
Thus, as opposed to the original raw , considers age as well (but does not include the phenotypic measure).
Similarly, for the phenotypic measures BMI and birth weight, we defined a measure risk score that combines them with age, but without PRS. Thus, BMI risk of sample (as opposed to raw BMI that included only the original BMI measurement) was defined as:
where:
This was also done for birth weight risk (as opposed to raw birth weight) but not to comparative body size at age ten that includes only three distinct values. Finally, we also considered age alone, to examine whether the other measures provide additional predictive information beyond age alone. In that case, it was defined as .
For each measure (CRS, PRS and the phenotypic measures BMI and birth weight), we trained our model on 70% of the individuals (comprising the training set) to estimate the optimal weights (, depending on the specific measure) that maximize the area under curve (AUC) in the receiver operating characteristic (ROC) for the specific measure. We sampled all combinations for the values of the weights, in steps of 0.025 in the range [0, 1]. Evaluation of the measures was performed on the remaining 30% of the individuals (comprising the test set), based on odds ratio (OR) analysis, as described below. For the age measure () alone there was no weight to learn, but the measure itself (i.e., T2D prevalence per age for each sex) was estimated on the training set and evaluated on the test set.
2.4. Evaluation of the Results
We evaluated and compared the different measures (CRS, PRS, BMI, birth weight and age) by examining the resulting T2D OR. For each measure, we divided the participants in the test set into 100 equal-size bins (i.e., percentiles 0–99). We then calculated for each bin its OR. Formally, let be the number of individuals diagnosed with T2D among all individuals in the percentile, and let be the number of individuals not diagnosed with T2D among all individuals in the percentile. Similarly, let and be the number of individuals diagnosed with T2D among all individuals except those in the percentile and the number of individuals not diagnosed with T2D among all individuals except those in the percentile, respectively. The OR of percentile , was then defined as:
To estimate the robustness of the results (e.g., calculating standard deviations for the OR), we repeated the procedure of randomly dividing the dataset into training and test sets, and evaluating the OR from the classification results of 1000 repetitions.
The analysis presented here was performed in Python (version 2.7; using the packages pandas, numpy, sklearn and scipy for data curation and analysis) and in R (version 3.6.0; using the packages stats, tidyverse and cowplot for statistical analysis and plots).
3. Results
3.1. PRS and BMI
In the current study we used the UK Biobank (UKB) cohort [26], focusing on participants whose ethnic background was classified as White, where genotyping information and disease status for T2D was available (see Methods). As there are known sex differences in and T2D prevalence and risk factors [27,28], we preformed the analysis separately for males and females. raw PRS (based on [14]), BMI and disease state (case/control) information was available for 290,584 participants, among them 157,813 (54.31%) were females.
Figure 1A shows the relationship between raw PRS and BMI and T2D disease prevalence. As can be seen, both measures were strongly associated with disease prevalence in both sexes. T2D disease prevalence was higher in males as compared to females. The analysis also showed that raw BMI was a better predictor for the disease risk as compared to raw PRS. This was also demonstrated with respect to OR across the different percentiles (Figure 1B). For example, the OR in the 99th percentile was 8.62 vs. 2.87 and 6.79 vs. 2.84 for raw BMI vs. raw PRS in females and males, respectively. The receiver operating characteristic (ROC) curves also confirmed this. The area under the curve (AUC) of the raw BMI measure was larger than the AUC of the raw PRS measure in both sexes: 0.767 vs. 0.626 and 0.721 vs. 0.629 for raw BMI vs. PRS in females and males, respectively (Figure 1C). These results also indicate that the differences between the two measures were larger in females than in males, and that BMI is a better predictor in females than in males for identifying individuals at high risk to develop T2D.
Next, we examined whether combining PRS and BMI together can increase their prediction power. For that purpose, we defined a new composite risk score (CRS) which combines both the raw PRS and BMI measures, as well as age. For each of these measures (raw PRS, raw BMI, age) we estimated an individual’s risk based on disease prevalence of people with similar values (e.g., people in the same raw PRS percentile) and combined them into a composite score. The AUC of the combined score was significantly higher as compared to the other measures in both sexes (Wilcoxon signed rank test p-value < 10−16; Supplementary Figure S1). Comparison of OR revealed that for both sexes, BMI exhibited better performance as compared to PRS, but CRS outperformed both measures across all percentiles (Figure 2). All measures (BMI, PRS and CRS) outperformed age alone.
Specifically, the average OR of the top percentile in males was 3.99, 7.84 and 9.38 for PRS, BMI and CRS, respectively. In females, the average OR of the top percentiles was 3.94, 9.10 and 10.27 for PRS, BMI and CRS, respectively. Additional results of the top percentiles are summarized in Figure 2C–E and Table 1. Both PRS and BMI measures that included age achieved higher OR values than the raw PRS and BMI measures that did not include age (Figure 1B), demonstrating the importance of adding age into the predictive model.
Table 1.
Sex | Percentile | OR (BMI) a | OR (PRS) | OR (CRS) | p-Value b |
---|---|---|---|---|---|
Females | 90 | 2.92 ± 0.43 | 2.01 ± 0.36 | 3.03 ± 0.44 | 3.63 × 10−13 |
95 | 4.44 ± 0.6 | 2.46 ± 0.41 | 4.59 ± 0.57 | <10−16 | |
97 | 5.54 ± 0.65 | 2.71 ± 0.44 | 6.22 ± 0.75 | <10−16 | |
99 | 9.10 ± 0.98 | 3.94 ± 0.48 | 10.27 ± 1.16 | <10−16 | |
Males | 90 | 2.48 ± 0.32 | 1.95 ± 0.29 | 3.00 ± 0.36 | <10−16 |
95 | 4.21 ± 0.47 | 2.38 ± 0.31 | 4.30 ± 0.48 | 1.67 × 10−12 | |
97 | 4.69 ± 0.49 | 2.90 ± 0.36 | 5.44 ± 0.57 | <10−16 | |
99 | 7.84 ± 0.76 | 3.99 ± 0.42 | 9.38 ± 0.91 | <10−16 |
a Results include the standard deviation of each measure. Measures with the highest OR for each percentile are bolded. b p-value refers to Wilcoxon signed rank test for comparing the OR distributions of the two measures that achieved the average highest OR across 1000 test sets, in each percentile.
These results also demonstrate sex differences with respect to the predictive power of BMI, and therefore of CRS: higher OR values were achieved for females, in accordance with the results reported for the raw measures (Figure 1).
3.2. PRS and Birth Weight
After evaluating BMI, we turned to another physical measure associated with T2D–birth weight. We studied a cohort of 172,239 participants, 105,438 (61.21%) of which were females, who had birth weight values, PRS, and T2D disease state information was available. Similar to the analysis performed for the BMI, we analyzed the association between disease risk and raw birth weight, for males and females separately (Figure 3).
Lower birth weight was associated with higher disease prevalence in both males and females, in accordance with previous studies [24]. High birth weight (mainly in the top percentiles) was also associated with higher T2D risk in both sexes, but to a lesser extent.
Next, and similar to the analysis for BMI, we defined a combined score that reflects both the risk associated with birth weight and PRS, based on disease prevalence for different PRS and birth weight percentiles, while also accounting for age. The predictive power (AUC) of the combined score was significantly higher than the individual measures in both sexes (Wilcoxon signed rank test p-value <10−16; Supplementary Figure S1). CRS also achieved higher OR values in the top percentiles (Table 2). Specifically, in females it achieved an average OR of 4.64 for the top percentile, compared to 3.81 and 3.62 for PRS and birth weight, respectively. In males the OR values were even higher: 4.83 vs. 4.54 and 3.08 for PRS and birth weight, respectively. For detailed trends across all measurement range (in percentiles) see Supplementary Figure S2.
Table 2.
Sex | Percentile | OR (Birth Weight) a | OR (PRS) | OR (CRS) | p-Value b |
---|---|---|---|---|---|
Females | 90 | 1.84 ± 0.42 | 1.94 ± 0.47 | 2.00 ± 0.45 | 2.25 × 10−6 |
95 | 1.99 ± 0.44 | 2.55 ± 0.53 | 2.59 ± 0.52 | 0.014 | |
97 | 2.26 ± 0.48 | 2.78 ± 0.54 | 3.11 ± 0.60 | <10−16 | |
99 | 3.62 ± 0.57 | 3.81 ± 0.63 | 4.64 ± 0.67 | <10−16 | |
Males | 90 | 1.67 ± 0.38 | 1.99 ± 0.44 | 1.97 ± 0.40 | 0.11 |
95 | 1.83 ± 0.40 | 2.56 ± 0.50 | 2.51 ± 0.50 | 3.39 × 10−4 | |
97 | 1.94 ± 0.39 | 2.59 ± 0.47 | 2.81 ± 0.51 | <10−16 | |
99 | 3.08 ± 0.51 | 4.54 ± 0.73 | 4.83 ± 0.72 | <10−16 |
a Results include the standard deviation of each measure. Measures with the highest OR for each percentile are bolded. b p-value refers to Wilcoxon signed rank test for comparing the OR distributions of the two measures that achieved the average highest OR across 1000 test sets, in each percentile.
While BMI was more predictive of T2D risk than birth weight, the latter also significantly improved prediction power (as part of the combined score) over PRS. Comparing males and females, we observed that males had higher OR values in the higher percentiles, for both PRS and CRS measures (but not for birth weight).
3.3. PRS and Body Size at Age Ten
Studies have shown that childhood obesity increases the risk for adult T2D and coronary artery disease (CAD) [29,30]. Information on childhood BMI was not available for UKB participants but a related childhood measure of a comparative body size at age ten was available for 287,203 participants, among them 156,307 (54.42%) were females. While this measure is subjective and retrospective, and included only three predetermined categorical values (thinner, about average and plumper), it was still associated with T2D risk in adulthood (Figure 4).
People who had described themselves as being plumper at age ten were at higher risk to develop T2D in adulthood compared to people reporting average weight at that age. Similarly, but to a lesser extent, people who described themselves as being thinner at age ten were also at higher risk to develop T2D later in life as compared to people reporting average weight at that age. This was observed in both sexes These differences in T2D prevalence between the three groups were highly significant (Chi square test p-value < 10−16).
Next, we defined a combined score that considers PRS, comparative body size at age ten and age. Even with this subjective and simplistic categorical measure, the CRS significantly outperformed PRS with respect to AUC (Wilcoxon signed rank test p-value < 10−16; Supplementary Figure S1) and OR (Figure 5 and Table 3). Results for males and females were very similar, with slightly higher OR values in males (for both PRS and CRS). Specifically, the average OR in the top CRS percentile was 4.18 vs. 3.83 for PRS in females and 4.24 vs. 3.98 in males.
Table 3.
Sex | Percentile | OR (PRS) a | OR (CRS) | p-Value b |
---|---|---|---|---|
Females | 90 | 1.97 ± 0.35 | 2.02 ± 0.36 | 7.71 × 10−5 |
95 | 2.43 ± 0.41 | 2.40 ± 0.38 | 0.031 | |
97 | 2.75 ± 0.43 | 3.00 ± 0.45 | <10−16 | |
99 | 3.83 ± 0.46 | 4.18 ± 0.49 | <10−16 | |
Males | 90 | 1.96 ± 0.28 | 2.06 ± 0.29 | <10−16 |
95 | 2.38 ± 0.32 | 2.48 ± 0.33 | <10−16 | |
97 | 2.93 ± 0.35 | 3.07 ± 0.37 | <10−16 | |
99 | 3.98 ± 0.42 | 4.24 ± 0.42 | <10−16 |
a Results include the standard deviation of each measure. Measures with the highest OR for each percentile are bolded. b p-value refers to Wilcoxon signed rank test for comparing the OR distributions of the two measures that achieved the average highest OR across 1000 test sets, in each percentile.
4. Discussion
In recent years, PRS has attracted increasing attention as a potential tool to estimate disease risk for common conditions and diseases based on the genetics of individuals [1,12]. In the current work we enhanced PRS prediction potential by integrating the raw genetic signal with available physical measures that capture non-genetic (environmental) components of human diseases, focusing on T2D. First, we integrated information on BMI into the PRS model, as high BMI is a well-known risk factor for T2D [18,19]. We found that while both PRS and BMI can highlight individuals with higher risk to develop T2D, a combined approach was superior to each of the measures alone, for both males and females, demonstrating the added value in such an approach.
Recently, several studies used integrated approaches for disease risk estimation by adding PRS information to standard clinical predictors. Conceptually, these studies applied the combined approach from both sides of its components: either to augment standard disease risk predictors with PRS or to augment PRS with disease risk predictors. Studies that focused on coronary artery disease (CAD) showed no [31] or little [32] improvement when adding PRS to clinically accepted risk predictors. These results raised again the question and the ongoing debate regarding the clinical utility of PRS [1,33,34].
A different study on CAD did find significant improvement by adding PRS to the routinely used risk predictors [35]. Another study on CAD, T2D, atrial fibrillation, breast and prostate cancer found that PRS improved the prediction power of such predictors [36]. Similarly, augmenting PRS with additional information such as BMI, and lab results such as HDL and LDL measures improved prediction power for T2D [37]. Similarly, augmenting PRS by traditional measures for cardiovascular disease risk modestly enhanced its prediction power [38]. In addition, a recent study added mortality risk factors to disease PRS to mark individuals with higher mortality risk [39].
Importantly, these studies used measures collected at adulthood while PRS values can be calculated earlier at life to indicate individuals at risk. Indeed, measurements that are taken at adulthood are likely to have stronger prediction power, as more relevant information on the disease and its risk predictors is revealed. However, interventions at the adult stage may be less effective, as some of the biological processes leading to diseases may have already started. Naturally, a composite score that includes adult BMI measures also suffers from this limit. Therefore, we examined whether augmenting PRS with early-life measures can increase their predictive utility. While genetic risk itself cannot be modified, additional risk factors that impact long-term health outcomes and are obtained at early life can be addressed through routine healthcare policy. In our study we used two such early-life measures that were available for many UKB participants: birth weight and three categories of body size at age ten. Similar to previous studies, we found association with low birth rate and high T2D prevalence, with stronger association in females. This is in accordance with the developmental origin’s theory, which suggests that low birth weight reflects under nutrition in utero that can lead to permanent changes in body functions, posing higher risk for certain metabolic diseases [40]. A weaker association was also found for high birth weight. Importantly, the number of UKB participants that were included in this analysis was relatively large as compared to many previous studies analyzing the relationship between birth weight and T2D [22]. A combined approach that included birth weight and PRS improved the prediction power of each of its components. We note that the birth weight used in this study is based on self-reporting (and not on medical records) and may be less accurate. A more accurate measure of birth weight is likely to further improve the results. Turning to comparative body size at age ten, we found that adding this information to PRS improved its prediction power as well. Indeed, BMI had a better predictive power as compared to these early-life measures. However, these measures may only partially reveal the component they intend to reflect. Specifically, the body size categories at age ten measure is retrospective, subjective and included only three categories. Therefore, the labels for the body size at age ten only roughly estimated the actual body size at that age. Despite these limitations, early-life measures significantly improved PRS prediction power. We anticipate that more accurate and relevant measures such as childhood BMI or other relevant measures (that are routinely collected at the clinic), as well as their trajectories (across different ages), will further improve disease risk estimation and may inform early intervention.
This work also introduces a revised approach with respect to integrate age and sex into a predictive risk model. Traditionally, the sex of an individual is considered a covariate that is controlled for when learning raw PRS weights [41]. Therefore, when these weights are used, the resulting PRS is no longer affected by sex, and an individual’s PRS is determined solely based on their genetic background, regardless of their sex. In practice, like in other diseases, there are substantial sex differences in T2D prevalence and pathophysiology [27,28]. In this work we addressed this issue by performing the analysis for each sex separately. Therefore, two people with the same raw PRS value but different sex may be given a completely different risk score. Indeed, we observed differences between the sexes. First, T2D prevalence was much higher in males as compared to females. In addition, T2D risk in the top percentiles for the PRS measure was slightly higher in males. This may perhaps explain why T2D risk in the top percentiles for the CRS measure (which is partially based on the PRS measure) was also higher in males when PRS was integrated with birth weight and comparative body size at age ten. However, when PRS was combined with BMI, the CRS measure achieved higher OR scores in females. This is likely because BMI, which outperforms PRS in its prediction power, is a better predictor in females for highlighting individuals at higher risk for T2D [42], perhaps due to sex differences in fat metabolism and storage [43].
Similar to sex, age is also often considered a covariate that is controlled for when learning PRS weights. The inferred PRS of an individual is constant and does not change with age. However, similar to other diseases, T2D prevalence increases with age [44]. Here we addressed the role of age as a principal risk factor by adding it into the predictive model. As a result, our score reflects an individual’s risk to develop T2D around their age, and it changes throughout life, resulting in risk score trajectories.
We designed our combined risk score to be simple and easy for application and generalization. Thus, the PRS measure was based on raw PRS weights that had been calculated in a previous work [14]. While we focused on T2D, such summary statistics are available for numerous other diseases and traits (e.g., the Polygenic Score Catalog, [45]). Therefore, with additional relevant phenotypes and measures (based on the nature of the disease), our approach can also be applied to other complex diseases. In addition, we converted each of the measures used in the study into disease prevalence measures (based on the average disease prevalence in people with similar values of that measure). This conversion allowed us to easily integrate measures whose relationship with disease prevalence is not monotonic (e.g., birth weight and comparative body size at age ten), and to integrate measures of different scales without explicit normalization. We integrated the different measures through a simple linear model. Taken together, this method can be applied relatively easily to various diseases, using various relevant measurements.
Even with this simplified approach, we achieved significant improvements that highlighted the importance of an integrated approach to estimate disease risk. Future works can further improve this through complementary ways to calculate and integrate risk factors. Below we briefly outline some suggestions for such improvements, mainly in the integration of sex and age into the model. First, our sex-specific approach was applied after the calculation of the raw PRS values, which can also be calculated for each sex alone. Indeed, several recent works used sex-specific PRS values because of the putative role of sex in many diseases and mortality [39,46]. Second, for simplicity of the combined approach, age was taken as an independent measure with a constant effect. However, the role of some T2D risk factors changes throughout life [47]. Specifically, the weight of the genetic component of T2D varies across different ages of onsets [48], and this can lead to differential power of PRS to predict disease risk across different age groups, as was demonstrated in other diseases [49,50]. Hence, the integration of age into the model can be done in more sophisticated ways (e.g., nonlinear), reflecting the apparent different weights of each component at different ages.
The PRS included in this work was trained solely based on T2D information, but a PRS that leverages the genetic contributions of additional traits that may be correlated to it can increase its utility [51] and future works can apply such an approach. In addition, T2D is highly heterogenous and can be further classified into different subgroups based on various features, where the subgroups vary in their clinical outcome [52]. In this work we analyzed T2D as a single disease, but future works can examine different models for different T2D subgroups.
Lastly, the PRS used in our study were calculated for Caucasians, the largest ethnic population in the UKB and therefore our analysis also focused on that population. Studies have shown that PRS calculated for one population have reduced prediction power on other populations [53,54]. We hope that future studies will apply our methodology on additional populations such that a composite score and therefore a better early intervention will be available for these populations as well.
In summary, we demonstrated the benefit of adding measures to enhance PRS prediction. Specifically, we integrated PRS with early-life measures to pave the way for early intervention. We hope this will encourage future work on the integration of PRS with additional measures to provide more accurate clinical risk estimates for T2D and other complex diseases.
Acknowledgments
We thank Ido Margaliot for useful discussion. We thank Center for Interdisciplinary Data Science (CIDR) and the CSE system team for support in data storage. We thank the four anonymous reviewers of this manuscript for their comments.
Supplementary Materials
The following are available online at https://www.mdpi.com/article/10.3390/jpm11060582/s1. Figure S1: Comparison of AUC values in the test sets for the different measure. Figure S2: Odds ratio (OR) for T2D, based on birth weight, PRS, CRS or age percentiles.
Author Contributions
Conceptualization, Y.Y.W., N.B. and M.L.; methodology, A.M., Y.Y.W., N.B. and M.L.; software, A.M. and Y.Y.W.; validation, A.M.; formal analysis, A.M., Y.Y.W., N.B. and M.L.; investigation, A.M., Y.Y.W.; resources, N.B. and M.L.; writing—original draft preparation, A.M., Y.Y.W., N.B. and M.L.; writing—review and editing, A.M., Y.Y.W., N.B. and M.L.; visualization, A.M. and Y.Y.W.; supervision, Y.Y.W. and M.L.; project administration, Y.Y.W.; funding acquisition, M.L. All authors have read and agreed to the published version of the manuscript.
Funding
This study was supported by the ISF grant number: 2753/20 (to M.L.).
Institutional Review Board Statement
The study was conducted according to the guidelines of the application of the UK-Biobank (application ID 26664) and approved by the Institutional Review Board (or Ethics Committee) of The Hebrew University of Jerusalem (#13082019, Date of approval August 2019).
Informed Consent Statement
Based on UK Biobank’s consent (as of 2018) which is compliant with all relevant legislation.
Data Availability Statement
Not applicable.
Conflicts of Interest
A.M. and Y.Y.W. are employees of NRGene Ltd. No conflict of interest.
Footnotes
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.Torkamani A., Wineinger N.E., Topol E.J. The personal and clinical utility of polygenic risk scores. Nat. Rev. Genet. 2018;19:581–590. doi: 10.1038/s41576-018-0018-x. [DOI] [PubMed] [Google Scholar]
- 2.Hirschhorn J.N., Daly M.J. Genome-wide association studies for common diseases and complex traits. Nat. Rev. Genet. 2005;6:95–108. doi: 10.1038/nrg1521. [DOI] [PubMed] [Google Scholar]
- 3.Lander E.S. Initial impact of the sequencing of the human genome. Nature. 2011;470:187–197. doi: 10.1038/nature09792. [DOI] [PubMed] [Google Scholar]
- 4.Bush W.S., Moore J.H. Chapter 11, Genome-Wide Association Studies. PLoS Comput. Biol. 2012;8:e1002822. doi: 10.1371/journal.pcbi.1002822. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Manolio T.A., Collins F.S., Cox N.J., Goldstein D.B., Hindorff L.A., Hunter D.J., McCarthy M.I., Ramos E.M., Cardon L.R., Chakravarti A., et al. Finding the missing heritability of complex diseases. Nature. 2009;461:747–753. doi: 10.1038/nature08494. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Eichler E.E., Flint J., Gibson G., Kong A., Leal S.M., Moore J.H., Nadeau J.H. Missing heritability and strategies for finding the underlying causes of complex disease. Nat. Rev. Genet. 2010;11:446–450. doi: 10.1038/nrg2809. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Zuk O., Hechter E., Sunyaev S.R., Lander E.S. The mystery of missing heritability: Genetic interactions create phantom heritability. Proc. Natl. Acad. Sci. USA. 2012;109:1193–1198. doi: 10.1073/pnas.1119675109. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Yang J., Benyamin B., McEvoy B.P., Gordon S., Henders A.K., Nyholt D.R., Madden P.A., Heath A.C., Martin N.G., Montgomery G.W., et al. Common SNPs explain a large proportion of the heritability for human height. Nat. Genet. 2010;42:565–569. doi: 10.1038/ng.608. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Boyle E.A., Li Y.I., Pritchard J.K. An Expanded View of Complex Traits: From Polygenic to Omnigenic. Cell. 2017;169:1177–1186. doi: 10.1016/j.cell.2017.05.038. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Chatterjee N., Shi J., García-Closas M. Developing and evaluating polygenic risk prediction models for stratified disease prevention. Nat. Rev. Genet. 2016;17:392–406. doi: 10.1038/nrg.2016.27. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Inouye M., Abraham G., Nelson C.P., Wood A.M., Sweeting M.J., Dudbridge F., Lai F.Y., Kaptoge S., Brozynska M., Wang T., et al. Genomic Risk Prediction of Coronary Artery Disease in 480,000 Adults: Implications for Primary Prevention. J. Am. Coll. Cardiol. 2018;72:1883–1893. doi: 10.1016/j.jacc.2018.07.079. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Lambert S.A., Abraham G., Inouye M. Towards clinical utility of polygenic risk scores. Hum Mol Genet. 2019;28:R133–R142. doi: 10.1093/hmg/ddz187. [DOI] [PubMed] [Google Scholar]
- 13.Lewis C.M., Vassos E. Polygenic risk scores: From research tools to clinical instruments. Genome Med. 2020;12:1–11. doi: 10.1186/s13073-020-00742-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Khera A.V., Chaffin M., Aragam K.G., Haas M.E., Roselli C., Choi S.H., Natarajan P., Lander E.S., Lubitz S.A., Ellinor P.T., et al. Genome-wide polygenic scores for common diseases identify individuals with risk equivalent to monogenic mutations. Nat. Genet. 2018;50:1219–1224. doi: 10.1038/s41588-018-0183-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Fahed A.C., Wang M., Homburger J.R., Patel A.P., Bick A.G., Neben C.L., Lai C., Brockman D., Philippakis A., Ellinor P.T., et al. Polygenic background modifies penetrance of monogenic variants for tier 1 genomic conditions. Nat. Commun. 2020;11:1–9. doi: 10.1038/s41467-020-17374-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Wang W.Y.S., Barratt B.J., Clayton D.G., Todd J.A. Genome-wide association studies: Theoretical and practical concerns. Nat. Rev. Genet. 2005;6:109–118. doi: 10.1038/nrg1522. [DOI] [PubMed] [Google Scholar]
- 17.Khera A.V., Chaffin M., Wade K.H., Zahid S., Brancale J., Xia R., Distefano M., Senol-Cosar O., Haas M.E., Bick A., et al. Polygenic Prediction of Weight and Obesity Trajectories from Birth to Adulthood. Cell. 2019;177:587–596. doi: 10.1016/j.cell.2019.03.028. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Chan J.M., Rimm E.B., Colditz G.A., Stampfer M.J., Willett W.C. Obesity, fat distribution, and weight gain as risk factors for clinical diabetes in men. Diabetes Care. 1994;17:961–969. doi: 10.2337/diacare.17.9.961. [DOI] [PubMed] [Google Scholar]
- 19.Tirosh A., Shai I., Afek A., Dubnov-Raz G., Ayalon N., Gordon B., Derazne E., Tzur D., Shamis A., Vinker S., et al. Adolescent BMI Trajectory and Risk of Diabetes versus Coronary Disease. N. Engl. J. Med. 2011;364:1315–1325. doi: 10.1056/NEJMoa1006992. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Warrington N.M., Beaumont R.N., Horikoshi M., Day F.R., Helgeland Ø., Laurin C., Bacelis J., Peng S., Hao K., Feenstra B., et al. Maternal and fetal genetic effects on birth weight and their relevance to cardio-metabolic risk factors. Nat. Genet. 2019;51:804–814. doi: 10.1038/s41588-019-0403-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Whincup P.H., Kaye S.J., Owen C.G., Huxley R., Cook D.G., Anazawa S., Barrett-Connor E., Bhargava S.K., Birgisdottir B.E., Carlsson S., et al. Birth weight and risk of type 2 diabetes a systematic review. JAMA J. Am. Med. Assoc. 2008;300:2886–2897. doi: 10.1001/jama.2008.886. [DOI] [PubMed] [Google Scholar]
- 22.Zhao H., Song A., Zhang Y., Zhen Y., Song G., Ma H. The association between birth weight and the risk of type 2 diabetes mellitus: A systematic review and meta-analysis. Endocr. J. 2018;65:EJ18-0072. doi: 10.1507/endocrj.EJ18-0072. [DOI] [PubMed] [Google Scholar]
- 23.Knop M.R., Geng T.T., Gorny A.W., Ding R., Li C., Ley S.H., Huang T. Birth weight and risk of type 2 diabetes mellitus, cardiovascular disease, and hypertension in adults: A meta-analysis of 7 646 267 participants from 135 studies. J. Am. Heart Assoc. 2018;7:e008870. doi: 10.1161/JAHA.118.008870. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Mi D., Fang H., Zhao Y., Zhong L. Birth weight and type 2 diabetes: A meta-analysis. Exp. Ther. Med. 2017;14:5313–5320. doi: 10.3892/etm.2017.5234. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Zimmermann E., Gamborg M., Sørensen T.I.A., Baker J.L. Sex differences in the association between birth weight and adult type 2 diabetes. Diabetes. 2015;64:4220–4225. doi: 10.2337/db15-0494. [DOI] [PubMed] [Google Scholar]
- 26.Bycroft C., Freeman C., Petkova D., Band G., Elliott L.T., Sharp K., Motyer A., Vukcevic D., Delaneau O., O’Connell J., et al. The UK Biobank resource with deep phenotyping and genomic data. Nature. 2018;562:203–209. doi: 10.1038/s41586-018-0579-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Kautzky-Willer A., Harreiter J., Pacini G. Sex and gender differences in risk, pathophysiology and complications of type 2 diabetes mellitus. Endocr. Rev. 2016;37:278–316. doi: 10.1210/er.2015-1137. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Huebschmann A.G., Huxley R.R., Kohrt W.M., Zeitler P., Regensteiner J.G., Reusch J.E.B. Sex differences in the burden of type 2 diabetes and cardiovascular risk across the life course. Diabetologia. 2019;62:1761–1772. doi: 10.1007/s00125-019-4939-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Geng T., Smith C.E., Li C., Huang T. Childhood BMI and Adult Type 2 Diabetes, Coronary Artery Diseases, Chronic Kidney Disease, and Cardiometabolic Traits: A Mendelian Randomization Analysis. Diabetes Care. 2018;41:1089–1096. doi: 10.2337/dc17-2141. [DOI] [PubMed] [Google Scholar]
- 30.Dong S.S., Zhang K., Guo Y., Ding J.M., Rong Y., Feng J.C., Yao S., Hao R.H., Jiang F., Chen J.B., et al. Phenome-wide investigation of the causal associations between childhood BMI and adult trait outcomes: A two-sample Mendelian randomization study. Genome Med. 2021;13:1–17. doi: 10.1186/s13073-021-00865-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Mosley J.D., Gupta D.K., Tan J., Yao J., Wells Q.S., Shaffer C.M., Kundu S., Robinson-Cohen C., Psaty B.M., Rich S.S., et al. Predictive Accuracy of a Polygenic Risk Score Compared with a Clinical Risk Score for Incident Coronary Heart Disease. JAMA J. Am. Med. Assoc. 2020;323:627–635. doi: 10.1001/jama.2019.21782. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Elliott J., Bodinier B., Bond T.A., Chadeau-Hyam M., Evangelou E., Moons K.G.M., Dehghan A., Muller D.C., Elliott P., Tzoulaki I. Predictive Accuracy of a Polygenic Risk Score-Enhanced Prediction Model vs a Clinical Risk Score for Coronary Artery Disease. JAMA J. Am. Med. Assoc. 2020;323:636–645. doi: 10.1001/jama.2019.22241. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Khan S.S., Cooper R., Greenland P. Do Polygenic Risk Scores Improve Patient Selection for Prevention of Coronary Artery Disease? JAMA J. Am. Med. Assoc. 2020;323:614–615. doi: 10.1001/jama.2019.21667. [DOI] [PubMed] [Google Scholar]
- 34.Wald N.J., Old R. The illusion of polygenic disease risk prediction. Genet. Med. 2019;21:1705–1707. doi: 10.1038/s41436-018-0418-5. [DOI] [PubMed] [Google Scholar]
- 35.Riveros-Mckay F., Weale M.E., Moore R., Selzam S., Krapohl E., Sivley R.M., Tarran W.A., Sørensen P., Lachapelle A.S., Griffiths J.A., et al. Integrated Polygenic Tool Substantially Enhances Coronary Artery Disease Prediction. Circ. Genom. Precis. Med. 2021;14:e003304. doi: 10.1161/CIRCGEN.120.003304. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Mars N., Koskela J.T., Ripatti P., Kiiskinen T.T.J., Havulinna A.S., Lindbohm J.V., Ahola-Olli A., Kurki M., Karjalainen J., Palta P., et al. Polygenic and clinical risk scores and their impact on age at onset and prediction of cardiometabolic diseases and common cancers. Nat. Med. 2020;26:549–557. doi: 10.1038/s41591-020-0800-0. [DOI] [PubMed] [Google Scholar]
- 37.Liu W., Zhuang Z., Wang W., Huang T., Liu Z. An Improved Genome-Wide Polygenic Score Model for Predicting the Risk of Type 2 Diabetes. Front. Genet. 2021;12:632385. doi: 10.3389/fgene.2021.632385. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Sun L., Pennells L., Kaptoge S., Nelson C.P., Ritchie S.C., Abraham G., Arnold M., Bell S., Bolton T., Burgess S., et al. Polygenic risk scores in cardiovascular risk prediction: A cohort study and modelling analyses. PLoS Med. 2021;18:e1003498. doi: 10.1371/journal.pmed.1003498. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Meisner A., Kundu P., Zhang Y.D., Lan L.V., Kim S., Ghandwani D., Choudhury P.P., Berndt S.I., Freedman N.D., Garcia-Closas M., et al. Combined utility of 25 disease and risk factor polygenic risk scores for stratifying risk of all-cause mortality. medRxiv. 2020;107:418–431. doi: 10.1016/j.ajhg.2020.07.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Barker D.J.P. The origins of the developmental origins theory. Wiley Online Libr. 2007;261:412–417. doi: 10.1111/j.1365-2796.2007.01809.x. [DOI] [PubMed] [Google Scholar]
- 41.Choi S.W., Mak T.S.H., O’Reilly P.F. Tutorial: A guide to performing polygenic risk score analyses. Nat. Protoc. 2020;15:2759–2772. doi: 10.1038/s41596-020-0353-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Censin J.C., Peters S.A.E., Bovijn J., Ferreira T., Pulit S.L., Mägi R., Mahajan A., Holmes M.V., Lindgren C.M. Causal relationships between obesity and the leading causes of death in women and men. PLoS Genet. 2019;15:e1008405. doi: 10.1371/journal.pgen.1008405. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Power M.L., Schulkin J. Sex differences in fat storage, fat metabolism, and the health risks from obesity: Possible evolutionary origins. Br. J. Nutr. 2008;99:931–940. doi: 10.1017/S0007114507853347. [DOI] [PubMed] [Google Scholar]
- 44.Halim M., Halim A. The effects of inflammation, aging and oxidative stress on the pathogenesis of diabetes mellitus (type 2 diabetes) Diabetes Metab. Syndr. Clin. Res. Rev. 2019;13:1165–1172. doi: 10.1016/j.dsx.2019.01.040. [DOI] [PubMed] [Google Scholar]
- 45.Lambert S.A., Gil L., Jupp S., Ritchie S.C., Xu Y., Buniello A., McMahon A., Abraham G., Chapman M., Parkinson H., et al. The Polygenic Score Catalog as an open database for reproducibility and systematic evaluation. Nat. Genet. 2021;53:420–425. doi: 10.1038/s41588-021-00783-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Fan C.C., Banks S.J., Thompson W.K., Chen C.H., McEvoy L.K., Tan C.H., Kukull W., Bennett D.A., Farrer L.A., Mayeux R., et al. Sex-dependent polygenic effects on the clinical progressions of Alzheimer’s disease. bioRxiv. 2019:613893. doi: 10.1093/brain/awaa164. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Alva M.L., Hoerger T.J., Zhang P., Gregg E.W. Identifying risk for type 2 diabetes in different age cohorts: Does one size fit all? BMJ Open Diabetes Res. Care. 2017;5:e000447. doi: 10.1136/bmjdrc-2017-000447. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Padilla-Martínez F., Collin F., Kwasniewski M., Kretowski A. Systematic review of polygenic risk scores for type 1 and type 2 diabetes. Int. J. Mol. Sci. 2020;21:1703. doi: 10.3390/ijms21051703. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Thomas M., Sakoda L.C., Hoffmeister M., Rosenthal E.A., Lee J.K., van Duijnhoven F.J.B., Platz E.A., Wu A.H., Dampier C.H., de la Chapelle A., et al. Response to Li and Hopper. Am. J. Hum. Genet. 2021;108:527–529. doi: 10.1016/j.ajhg.2021.02.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Li S., Hopper J.L. Age dependency of the polygenic risk score for colorectal cancer. Am. J. Hum. Genet. 2021;108:525–526. doi: 10.1016/j.ajhg.2021.02.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Maier R.M., Zhu Z., Lee S.H., Trzaskowski M., Ruderfer D.M., Stahl E.A., Ripke S., Wray N.R., Yang J., Visscher P.M., et al. Improving genetic prediction by leveraging genetic correlations among human diseases and traits. Nat. Commun. 2018;9:1–7. doi: 10.1038/s41467-017-02769-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Ahlqvist E., Storm P., Käräjämäki A., Martinell M., Dorkhan M., Carlsson A., Vikman P., Prasad R.B., Aly D.M., Almgren P., et al. Novel subgroups of adult-onset diabetes and their association with outcomes: A data-driven cluster analysis of six variables. Lancet Diabetes Endocrinol. 2018;6:361–369. doi: 10.1016/S2213-8587(18)30051-2. [DOI] [PubMed] [Google Scholar]
- 53.Duncan L., Shen H., Gelaye B., Meijsen J., Ressler K., Feldman M., Peterson R., Domingue B. Analysis of polygenic risk score usage and performance in diverse human populations. Nat. Commun. 2019;10:1–9. doi: 10.1038/s41467-019-11112-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.De La Vega F.M., Bustamante C.D. Polygenic risk scores: A biased prediction? Genome Med. 2018;10:1–3. doi: 10.1186/s13073-018-0610-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
Not applicable.