Summary
We develop a closed-form Haseman-Elston estimator for genetic and environmental correlation coefficients between complex phenotypes, which we term HEc, that is as precise as GCTA yet ∼20× faster. We estimate genetic and environmental correlations between over 7,000 phenotype pairs in subgroups from the Trans-Omics in Precision Medicine (TOPMed) program. We demonstrate substantial differences in both heritabilities and genetic correlations for multiple phenotypes and phenotype pairs between individuals of self-reported Black, Hispanic/Latino, and White backgrounds. We similarly observe differences in many of the genetic and environmental correlations between genders. To estimate the contribution of genetics to the observed phenotypic correlation, we introduce “fractional genetic correlation” as the fraction of phenotypic correlation explained by genetics. Finally, we quantify the enrichment of correlations between phenotypic domains, each of which is comprised of multiple phenotypes. Altogether, we demonstrate that the observed correlations between complex human phenotypes depend on the genetic background of the individuals, their gender, and their environment.
Keywords: genetic correlation, household correlation, genetic architecture, heritability, multi-ethnic, admixed population, genetic background, Haseman-Elston regression, Trans-Omics in Precision Medicine, Hispanic Community Health Study/Study of Latinos
Graphical abstract
Highlights
-
•
An estimator of genetic correlation generalizes the Pearson correlation estimator
-
•
Fractional genetic correlation quantifies genetic fraction of a phenotypic correlation
-
•
Genetic correlations between phenotypes vary by genetic and environmental determinants
Elgart et al. develop a closed-form estimator for genetic correlation between phenotypes called HEc, generalizing the Pearson estimator. Fractional correlation estimates that quantify contributions of genetics and environment to the overall phenotypic correlation demonstrate that both genetics and environment contribute to phenotypic correlation differences between groups of individuals.
Introduction
Both genetics and environment determine human phenotypes and the correlations between them.1,2 Correlations can arise due to multiple forms of causal relationships including common genetic and environmental determinants, (bi)directional causal associations and others. The correlations between phenotypes can reveal genetic architecture, help uncover gene functions and disease mechanisms, improve diagnosis, and aid in therapeutic interventions.3 Given appropriate data, phenotypic correlations can be decomposed into genetic and environmental components by estimating corresponding measures, such as genetic correlation.4
Several studies have leveraged the data generated by large studies with genotyped individuals and multiple measured phenotypes (e.g., BioBank Japan5 and UK Biobank6), to estimate genetic correlations between various phenotypes.7,8,9,10 Dozens of pairwise correlations between phenotypes were estimated and reported, mostly based on studies with participants of European or East Asian ancestries9,10 and of mixed genders. Genetic ancestry, as well as biological sex and sociocultural roles of gender, all contribute to differences in phenotypic distributions between different groups of people due to both underlying genetics and different environmental exposures.11,12,13 The genetic background manifests in allele frequencies, effect sizes, and, more generally, genetic architectures.14,15 Sociocultural measures, captured in part by self-reported race/ethnicity, are related to behavioral, environmental, and psychosocial exposures such as smoking, alcohol, nutrition, physical activity, and stress,16,17 modifying the effect of genetic variants. In this work, we refer to the collective characteristics of a studied race and ethnicity-based group as a “background” to reflect the fact that no single genetic or social measure can be used to define the group, although its individuals may self-identify using a specific, pre-defined label, whether by choice or by the set of options presented to them. Notably, these background groups are enriched with patterns of both genetics and sociodemographic (and environmental and cultural) similarity, and all factors may ultimately affect the expression of genetic effects. However, we cannot, using existing data, separate these different influences. While a handful of studies reported differences in genetic correlation across different background groups,18,19,20 these are limited in the number of studied phenotypes and backgrounds. Similarly, while biological sex-specific heritabilities of complex phenotypes have been reported previously,21,22,23 gender-specific correlations between phenotypes have not been comprehensively studied. Genetic correlations between phenotypes can be leveraged for prediction of health conditions24 and differences between these correlations across groups may affect the application of prediction models. An environmental correlation between two phenotypes may suggest the possibility of a lifestyle change to alleviate the burden of disease related to these phenotypes.
The two main computational approaches that are used to estimate genetic correlation are the genetic restricted maximum likelihood analysis (GREML),25,26,27 which requires individual-level genotypes and is computationally challenging when analyzing datasets with thousands of individuals; and the linkage disequilibrium score regression (LDSC),28 which uses genome-wide association study (GWAS) summary statistics. LDSC requires reliable GWASs, and can be inaccurate when there is genetic heterogeneity between the target sample and reference linkage disequilibrium (LD) panel,29,30 and thus cannot be used reliably for admixed or multi-ancestry analyses. Thus, both GREML and LDSC approaches are limited when analyzing datasets that include tens of thousands of genetically diverse individuals.
Here, we estimated the heritabilities, as well as genetic and environmental correlations between phenotypes in joint as well as background- and gender-stratified analysis. First, we derived a closed-form solution for the estimation of genetic and environmental correlation coefficients within the Haseman-Elston regression framework, which we termed HEc (Haseman-Elston closed form). Second, we applied the algorithm to study heritabilities and genetic correlations for 28 phenotypes in the Trans-Omics in Precision Medicine (TOPMed) program31,32 dataset, with a large representation of individuals of White, Black, and Hispanic/Latino backgrounds. We then focused on Hispanics/Latinos from the Hispanic Community Health Study/Study of Latinos (HCHS/SOL) cohort,33 and utilized available data on shared household (representing shared environmental exposure) and compared gender-specific genetic and environmental correlations across a larger panel of 61 phenotypes. Finally, we performed domain-level enrichment analysis to identify genetic and environmental correlations between phenotypic domains.
Results
HEc is a fast and precise estimator for genetic correlation
We developed an estimator, HEc, for genetic correlation within the Haseman-Elston regression framework (see STAR Methods for detailed derivation). The HEc estimator generalizes the Pearson correlation coefficient, and it has the closed-form formula:
where are the residuals of traits after regression on covariates, and is a matrix that depends on the genetic relationship between individuals in the dataset and potentially other matrices used to model correlation between individuals (e.g., matrices modeling shared environmental influences). The reader may see that it generalizes the standard Pearson estimator since replacing the matrix by the identity matrix and by assuming that the residuals are those obtained by regressing the traits on intercepts (i.e., subtracting their mean), reduces the estimate to the Pearson correlation coefficient.
We evaluated the performance of HEc in simulations across multiple parameters and background groups and compared it with the GREML algorithm4 implemented in the GCTA (Genome-wide Complex Trait Analysis) software32 (Figures 1A and S2–S5).
Figure S1 visualizes TOPMed and HCHS/SOL individuals in the space generated by the first two genetic principal components (PCs). One can see that the White, Black, and Hispanic/Latino background groups are only partly separated in PC space, and individuals from Hispanic/Latino background, who are admixed with high proportions of European, Amerindian, and, to lower extent, African genetic ancestries, are represented throughout the PC space. In Figure S1B that focuses on HCHS/SOL, we also highlight finer self-reported Hispanic/Latino backgrounds, including Mexican, Puerto Rican, etc., which show some clustering, indicating some genetic structure within Hispanic/Latino individuals (as is known). This figure demonstrates that individuals within the background groups are on average more similar to each other in their genetic patterns compared with individuals in other background groups. However, the separation in the figure is far from perfect, as background groups capture self-identification, which does not precisely correlate with genetic patterns. Self-identified background also captures shared culture and systematic environmental exposures, which modify genetic effects, and which motivated us to use these grouping rather than ones based on genetic ancestry.
To simulate realistic data, we used kinship matrices from participants of the different self-reported backgrounds (Black, Hispanic/Latino, White) and a joint group of same size comprised of people from each group (n = 7,706, which is the also the size of the smallest subgroup from the TOPMed dataset). We evaluated the time required for each algorithm as a function of the number of individuals (Figure 1B). In all cases, HEc matched or exceeded the accuracy of GCTA while improving the speed up to 20-fold and beyond (Figure 1B).
HEc outperforms existing approaches on complex real-world data
We selected eight phenotypes from the TOPMed cohort (height, BMI, diastolic blood pressure, systolic blood pressure, total cholesterol, HDL cholesterol, LDL cholesterol, triglycerides) to compare the performance of HEc with existing algorithms (GCTA and LDSC; Figures 1C and 1D). We estimated the heritabilities of the eight phenotypes using GCTA and LDSC and estimated genetic correlations between all phenotype pairs using HEc and GCTA. Estimates were computed based on the combined TOPMed cohort (Figure 1, “All”) as well as on the different background groups. While we see excellent agreement of HEc and GCTA estimates (Figure 1C; correlation above 0.96 for all groups), there is lower agreement between HEc and LDSC (Figure 1D; same for GCTA versus LDSC). This is especially evident for the self-reported Black group where the correlation is only 0.64 (Figure 1D). We also performed comparisons to a method called cov-LDSC, which was reported to account for population structure in admixed populations when estimating heritability34 (Figure S5). Although the method indeed resulted in a lesser bias for heritability (Figure S5D), when extending the method for estimating genetic correlation there was little improvement (Figure S5B).
A compendium of genetic correlations and heritabilities in the multi-ethnic TOPMed dataset
We next expanded the number of analyzed phenotypes and calculated phenotypic and genetic correlations between 28 harmonized phenotypes in 33,959 TOPMed individuals. Results are provided in Figure 2 and Data S1. There were 378 phenotype pairs in the dataset. Of these, 270 (∼71%) had phenotypic correlations with p < 0.05 (Figure 2A), and 66 (18%) had genetic correlations (p < 0.05; Figure 2B, and inset).
Our results in the multi-ethnic dataset agree well with previous reports. For example, we estimated a genetic correlation of between body mass index (BMI) and triglycerides, while Bulik-Sullivan et al. estimated it at 0.2610 and Zhang et al. at 0.230 in European populations. We also identified potentially clinically relevant genetic correlations, including between white blood cell types and blood pressure, that are consistent with literature implicating a biological association between inflammation and hypertension35,36 (Table S2 in Data S3).
We also estimated heritabilities for all the phenotypes in our combined dataset (Figure 2C; Table S3). The most highly heritable phenotypes are height ( 0.56, SE = 0.02) as well as multiple blood cell measurements, such as neutrophil counts (Figure 2C; 0.4, SE = 0.04) and total white blood cell (WBC) counts (Figure 2C; 0.45, SE = 0.03) similar to previous estimates.37,38
Fractional genetic correlation quantifies the contribution of genetics to phenotypic correlations
We identified multiple instances where the genetic correlation coefficient is larger than the phenotypic correlation (Figure 2D, inset). For example, for lymphocytes and platelets the estimated phenotypic correlation is but the estimated genetic correlation is larger: . This emphasizes that the genetic correlation coefficient is not directly related to the phenotypic correlation.39,40 To address this, we introduce the concept of “fractional genetic correlation” (), which we define as the fraction of the observed phenotypic correlation explained by genetics in the decomposition (see STAR Methods (Equation 34), (Equation 35), (Equation 36)), where is the estimated fractional residual correlation. In the example of lymphocytes and platelets, the estimated fractional genetic correlation is , and we conclude that 56% of the observed phenotypic correlation is due to genetics ( out of 0.23).
We estimated for all the pairs of phenotypes in our dataset (Figure 2D; Table S4 in Data S3). Of phenotype pairs with genetic correlation with p < 0.05, 58% had substantial , defined as , corresponding to 19% of phenotypic correlations with p < 0.05. For most of the phenotype pairs, is much lower than and is lower than the estimated (by construction).
Genetic correlations and heritabilities differ between self-reported background groups
Out dataset includes 8,054 Black, 17,143 White, and 8,762 Hispanic/Latino participants (Table S2). We estimated heritability, phenotypic, and genetic correlations within these groups and compared them with the estimates of the combined group (Figures 3 and S7; Data S1).
While some phenotypes, such as HDL and eosinophil counts, have similar heritabilities across self-reported background groups (Figures 3 and S4), many other heritabilities vary by group. For example, CRP is similarly heritable in the Black and Hispanic/Latino backgrounds ( and 0.38 , respectively), but much less so in White individuals (. In contrast, neutrophil counts are very heritable in Black individuals (, but are less so in White individuals ( and Hispanic/Latino individuals ().
Similarly, multiple phenotype pairs, such as systolic and diastolic blood pressure are genetically correlated across backgrounds (Figures 3 and S6). However, many other differ by background (Figure 3). For example, lymphocyte counts and height have estimated genetic correlation in individuals from a Black background, but an opposite estimate of genetic correlation in individuals from an Hispanic/Latino background (p = 0.02 for correlation coefficients difference), and genetic correlation with p > 0.05 in individuals from a White background (Figures 3 and S6; Data S1). Overall, out of the 378 examined phenotype pairs, in 211 (55%) we detect a difference in genetic correlation values (p < 0.05) for at least two of the background groups in the multi-ethnic TOPMed dataset.
Genetics and shared household differentially affect phenotypic correlations in Hispanic/Latino individuals
We next studied genetic correlations among a larger panel of 61 phenotypes in n = 12,565 self-reported Hispanic/Latino individuals from the HCHS/SOL.33,41 The phenotypes represent 11 phenotypic domains: diabetes, cardiovascular disease, blood pressure, kidney function, lipids, lung function, sleep, anthropometrics, iron, RBC (red blood cell), and WBC (Figures 4 and S8; Table S1). HCHS/SOL also has information about household sharing between participants, allowing for estimation of both genetic and environmental correlations between phenotypes.
We estimated and in conjunction with the corresponding household correlation measures and for all the 1,830 pairs of phenotypes (Figures 4 and S8). Out of the 1,830 phenotype pairs, 1,499 (or ∼81%) have phenotypic correlations with p < 0.05 (Data S2). Of these, 427 (∼28%) also have genetic correlations with p < 0.05 (Figure 3A) and 380 (∼25%) have household correlations with p < 0.05. An interesting contrast between the genetic and household correlations can be seen for multiple phenotype pairs. For example, the diabetes domain (Figures 4A and 4B) has many household correlations with p < 0.05 with blood pressure domain phenotypes (but less so for genetic correlations) and has many genetic correlations with p < 0.05 with lung and lipid domain phenotypes (but not household correlations).
Domain enrichment analysis highlights associations between phenotypic domains
The genetic correlations are distributed non-uniformly with regard to the phenotypic domains. Domain enrichment analysis, in which we measured over-abundance of correlations with p < 0.05 between phenotypes within domain-pairs, showed a strong enrichment of the number of intra-group correlations for all the 11 phenotype domains (Data S2). Figures 4C and 4D visualize the estimated between-domain enrichment (p < 0.05). While some observed correlations, such as the ones between anthropometrics and diabetes or sleep, are driven both by genetics and shared household, many other domain-level correlations due to shared household do not mirror the genetic ones (Figure 4D). For example, shared household affects the correlations between diabetes and the blood-pressure and cardiovascular domains (but not genetic); however, the correlations between diabetes and lung and lipids domains are driven by genetics.
Heritabilities and genetic correlations differ across gender groups
The HCHS/SOL dataset of 12,565 participants has 5,175 males and 7,390 females (both self-identified genders and identified by sex chromosome checks using genetic data). Thus, we estimated heritabilities and genetic correlations stratified by gender. We identify a large number of differences between genders (Figures 5 and S9) in both genetic and environmental correlations. For example, the phenotype pair DiasBP (diastolic blood pressure) and FEV1FVC (forced expiratory volume to forced vital capacity ratio) has a negative estimated in males but a positive genetic correlation in females. Similarly, the household correlation for lymphocyte count and height is in males but in females (Figures 5 and S9; Data S2).
Overall, out of 363 phenotype pairs with either male or female genetic correlation with p < 0.05, there were 128 phenotype pairs (35%) in which the correlations were detected only in one gender groups (Figure S9C). Similarly, out of 349 phenotype pairs with either male or female household correlation with p < 0.05, there were 214 phenotype pairs (61%) in which the correlations were detected only in one gender group (Figure S9F).
These differences were also apparent at the domain level (Figures 5C and 5D). For example, while the correlations between blood pressure and diabetes domains are predominantly environmental in both gender groups (Figures 5C and 5D), the correlations between anthropometrics and sleep domains are enriched for genetic correlations only in the male group (Figure 5C and 5D). Figure S10 and Table S5 further provide similar results from gender-stratified genetic correlation analysis in the TOPMed White background group.
Discussion
We sought to study how observed correlations between complex human phenotypes can vary by socially constructed groups, and their characterization using both genetics and shared environment. To achieve that, we developed and implemented a computationally efficient framework HEc to estimate genetic and environmental correlations between phenotypes. We validated our method in simulations guided by data from multiple TOPMed background groups. HEc showed similar accuracy to GCTA while being up to 20 times faster. The GCTA speed dropped significantly with increased number of people, and it took up to 55 h to calculate a single genetic correlation for the combined TOPMed dataset (∼30k people), while it took ∼2.5 h for HEc.
We also compared HEc with GCTA and LDSC genetic correlation estimates using real data on a number of phenotypes in different background groups. While HEc and GCTA results were highly concordant, HEc and LDSC results differed. This is expected, as LDSC uses summary statistics from GWAS and relies on a reference panel for computing LD, assuming that the LD matches that of the population used for GWAS. Here, we implemented LDSC on summary statistics from the pan-UKBB GWAS, a population of mostly European genetic ancestry, and our target TOPMed background groups for LD. Thus, differences in LDSC-estimated genetic correlations across backgrounds are only due to differences in LD, and differences between HEc (and GCTA) and LDSC for the same TOPMed group are due to mismatch in the underlying genetic associations with the phenotypes. Notably, there are currently no available high-powered GWAS for either the Black or Hispanic/Latino groups that could be used by LDSC in lieu of the pan-UKBB GWAS summary statistics for any of the phenotypes analyzed here. We also adapted and studied a more recent algorithm (cov-LDSC34) that was developed to estimate heritability in admixed populations using summary statistics. While it did improve estimated heritability across backgrounds, its estimated genetic correlations were similar to those from LSDC (Figure S3). Finally, while LDSC was indeed very fast when calculating genetic correlations (which takes only several minutes) after preparation of the LD reference, it is important to note that it required a very long preprocessing: over 5 days for calculating a group-specific LD panel for each background group. In this case, methods that use individual-level data, such as GCTA and HEc, are advantageous, especially given that preparation of the LD panel needs to be adapted for both the summary statistics and the genetic data used by matching on available genotypes and on effect alleles.
We next employed HEc to systematically interrogate factors that may affect the observed correlation between complex human phenotypes, including shared genetics and shared environment. We assessed differences by self-reported background, capturing a combination of genetic patterns and sociocultural and environmental exposures that may modify genetic effects; and gender, capturing both biological sex effects and downstream sociocultural-related modifications; all affecting the underlying determinants of phenotypic distributions. We identified differences in some of the correlations and heritabilities across groups, agreeing with some previous reports of differences in phenotypic distributions and disease prevalence across race/ethnicities11,12,13 (Figure S7). Overall, we identified 26 phenotype pairs that had different genetic correlations (p < 0.05) between the Hispanic/Latino background and other backgrounds, 41 such phenotype pairs for the White background and 39 phenotype pairs with genetic correlation p < 0.05 only in the Black background group (Figure S6A).
We estimated high heritabilities of some of the blood phenotypes specifically in the Black background group. The heritabilities of neutrophil counts ( in the Black background were much higher than in the White background, with . Similarly high heritabilities were previously reported for Black individuals based on 236 African American pedigrees from the GeneSTAR study,42 and are usually attributed to the Duffy antigen receptor for chemokine gene, which accounts for ∼20% of total variation in the blood measures.43,44 The differences in the distribution of the Duffy antigens in population were first reported in 1954, when it was found that the overwhelming majority of people of African descent had the erythrocyte phenotype Fy(a-b-) which is rare in European genetic ancestries, and therefore in individuals of White background, who have predominately European ancestry. This region was shown to confer protection against malaria while inducing benign neutropenia. This genotype likely has high influence on estimated heritabilities and genetic correlations related to blood counts.
Traditionally in the GWAS era, such analyses have been performed for groups with clearly defined genetic ancestry, typically European. An important difference now is that we have been using whole-genome sequencing (WGS) data, with joint estimation of PCs and kinship matrix. Thus, our analysis does not suffer from limitations from focusing on sets of SNPs with differing LD patterns in different genetic ancestries and with different imputation qualities, which may affect analyses using genotyping array and imputed data. In principle, it is therefore appropriate to attempt to use these data to characterize the population in aggregate. However, we still see differences between estimates in the aggregated analysis and in the background-stratified analyses. These differences are likely driven by gene-environment interactions, where individuals from different backgrounds are exposed differently to modifiers of genetic effects. Such modifiers may include lifestyle factors such as smoking, sleep, nutrition,45,46 and although less studied, structural determinants such as the built environment and access to health care. Another potential reason for differences between backgrounds are gene-gene interactions leading to different genetic effects in haplotypes of different genetic ancestries.47 Although group differences reduce our ability to interpret the estimates in the combined group, we think that it was important to demonstrate the places where such differences are observed, as these are likely areas where there are stronger environmental effects on genetic effects and therefore policy or lifestyle interventions may be more useful to improve health.
We used a few measures of correlation throughout. The phenotypic correlation is the correlation between phenotypes without further modeling of contribution of specific factors. The genetic and household correlations measure the similarity between the contribution of genetic factors and household environment, respectively, to the phenotypes. Although genetic correlation could, by some statistical models, be traced to additive effects of a set of genetic variants, the household correlation was not developed under the same modeling assumption. Yet, they are estimated in the same manner, as different parameters corresponding to different matrices, once defined based on a statistical model that relies on measures of similarity of genetics (genetic relationship) and of household environment (household sharing) between individuals. Both of types of correlations are not restricted by the phenotypic correlation, where the phenotypic correlation may be very low while the genetic (or household) correlation can be large. This motivated the concept of “fractional genetic correlation coefficient” that we introduced, defined as the fraction of the observed phenotypic correlation explained by genetics: . The fractional genetic correlation is the genetic correlation normalized by the two traits’ heritabilities ((Equation 34), (Equation 35), (Equation 36)) and is algorithm-agnostic, i.e., it does not depend on which algorithm is used to estimate heritabilities and genetic correlations. It addresses the limitations of the genetic correlations where (1) it is sometimes higher than the phenotypic correlation, and (2) it can have high estimate when the estimated heritabilities are low. The fractional genetic correlation allows for identification of phenotype pairs where genetics is a large contributor to the overall observed correlations. Fractional genetic correlations are typically lower than genetic correlations and are more highly correlated with the observed phenotypic correlations (e.g., correlations of 0.92 between the phenotype and fractional genetic correlations estimated in the TOPMed White background compared with correlation of 0.67 between the estimated phenotypic and genetic correlations; similarly 0.84 versus 0.52 between estimated phenotype and fractional versus genetic correlation in the Black background group). We believe that this measure is useful as it is interpretable with respect to its relationship to phenotypic correlation, and we report it for all correlations estimated in this work.
We next assess the contribution of genetics and environment to the correlations between phenotypes using the rich data collected in the HCHS/SOL cohort, including measurements from a wide range of phenotypes along with genetics and information on shared households. We estimated genetic and environmental (due to shared household) correlations between 61 phenotypes from diabetes, cardiovascular, blood pressure, kidney, lipids, lung, sleep, anthropometrics, iron, RBC, and WBC domains. While shared household does not capture all of the contribution of the environment to the correlations between phenotypes, it does contribute substantially to 22% of all the 1,830 phenotype pairs (p < , Figure 4; Figure S8). Moreover, the contribution of the shared household to phenotypic correlations varies by phenotype pairs. In some cases, estimated genetic and household correlation are in opposite directions. For example, for albumin-creatinine ratio and PR duration (an echocardiogram measure of heart rate) ; and for major ECG abnormalities and BMI ).
We also performed domain-level enrichment analysis. We defined domains as sets of phenotypes that capture similar underlying “latent” phenotypes, with the limitations that groups of phenotypes assigned for the same domain may still capture complex underlying biology, i.e., are not measures of exactly the same latent phenotype (e.g., insomnia and mean oxygen saturation during sleep, while correlated in individuals with obstructive sleep apnea, may capture different pathophysiological disorders). While more study is needed, domain analysis should be less sensitive to individual variation in any particular phenotype. The results of this analysis highlight the domains and their mode of association (genetic and/or environmental) and, in the case of the correlations that are driven by the shared household, present a way to increase or disrupt the correlation via lifestyle changes. Interestingly, it seems that some domains have multiple phenotypes that are correlated with phenotypes in another domain predominantly via genetics or shared household. For example, the interactions between diabetes and blood pressure and cardiovascular domains are strongly influenced by shared household and therefore are a possible target for lifestyle interventions. On the other hand, the diabetes domain has an abundance of genetically correlated phenotypes with the lung and anthropometrics domains.
Finally, we stratified participants from the Hispanic/Latino background by gender and analyzed genetic and household correlations. We found multiple differences between the genders such that there were 35% phenotype pairs with genetic correlations and 61% phenotype pairs with household correlations with p < 0.05 only in one gender group but not the other. For example, eosinophil counts versus PhysHealth (Aggregate Physical Health Scale) had a high (95% CI. 0.19–1) in males but in females. Similarly, the estimated household correlation between lymphocyte counts and height was (95% CI, 0.02–1) in males but was reversed and equal to (95% CI, to ). Multiple correlations between phenotypic domains were also gender specific. Other recent studies considered gender differences in genetic determinants of phenotypes, focusing on UK Biobank participants of European ancestries.48,49 Both studies computed genetic correlations between male and female genetic effects for the same phenotype, which we will denote here by to differentiate it from , and found that often this genetic correlation is different than 1, indicating differences in genetic architecture between the gender groups. Zhu et al. further showed that gender differences are often due to “amplification effects,” where genetic associations in one gender group, e.g., females, on average, are the same as the effects in males, multiplied by a constant, suggesting different regulations, for example, due to hormone levels, in males and females. In contrast, our analysis focused on gender differences in genetic correlation between pairs of phenotypes. While mathematical modeling is needed to study whether the amplification model is consistent with downstream large differences in between a pair of phenotypes in males compared with females, it is possible that differences in the systematic regulation of sets of genes between males and females will lead to such observed differences.
In interpreting such gender-stratified genetic and household correlations, we note that gender is a social construct that is related to sex, and drivers of some of the estimated quantities depend on complex interactions between biological determinants of sex and the gendered environment. Thus, both genetic and environmental correlations may differ between genders due to sociocultural differences between them (differences in environmental exposures may lead to differences in genetic effects via gene-environment interactions). Here, we were not able to assess specific sociocultural contributions related to gender roles to the estimated correlations, but we think that observed differences in household correlations are largely driven by them.
A specific strength of our study is the use of high-quality phenotypic and genotyping data from the diverse multi-ethnic TOPMed program. Furthermore, all participating studies are population-based cohort studies, reducing the likelihood of selection and other biases, which may arise in studies following selected populations, such as case-control studies. Other strengths are the investigation of a large panel of phenotypes, evaluation of both genetic and environmental correlations, stratification by both self-reported background and gender, and the domain-level enrichment analysis.
Nevertheless, this study also has several limitations. For example, the use of self-reported background rather than strata defined by genetic ancestry is also somewhat imprecise, as self-reported background may change over time.50 Still, we chose to proceed with these groupings; first, to reflect on currently used groups in medical research, and second, because individuals of Hispanic/Latino background are highly admixed and there is no natural grouping that is based on genetic ancestry. We also note that socially constructed background groups may be meaningful due to differences in exposures across groups, which may translate to differences in the expression of genetic effects via gene-environment interactions. As the field of genetic medicine grapples with the use of genetic ancestry and social definition of race/ethnicity,51 both potentially leading to the wrong and harmful reification of the biological basis of race,52 it would be important to re-consider models for genetic correlation analyses. Finally, while our results are consistent with other studies14,15 that demonstrated that genetic effects on specific traits vary by age, gender, and other environmental exposures, additional data and mathematical models are needed to untangle how specific factors influence measures of genetic and environmental correlations.
In summary, in this work we establish that multiple factors, including genetics, gender, sociocultural environment, and household environment, all shape the correlations between complex human phenotypes. We demonstrate how stratification by groups that encapsulate these factors, such as background groups capturing both genetic patterns and social race/ethnicity constructs, and gender groups capturing biological sex effects and gendered environment, uncovers differences in heritabilities and genetic correlations between them. We report thousands of genetic and environmental correlations between phenotypes. This work should lay the foundation for additional research in identifying personalized treatment and intervention strategies in understudied populations. Future work includes the application of approaches from the graph analysis field, such as Gaussian graphical models53,54,55 to discover directionality and causality, utilizing genetically correlated phenotypes to improve polygenic risk prediction models,56 and studying genetic correlations by categories of genetic variants to capture the contributions of rare variants.
Limitations of the study
One limitation of this study is the somewhat limited sample sizes. Larger sample sizes, especially within stratified analyses, would enable stronger inferences. Another limitation is imperfect definitions of phenotypic domains, which may not accurately capture underlying pathophysiology. Next, although the use of population-based studies reduces the likelihood of bias in estimation of heritabilities and genetic correlations, additional biases could remain as all studies employed some preferential sampling, e.g., of specific age ranges, geographic regions, etc., and therefore none of the studies, separately or combined, accurately represents a random sample from a specific US population. Lastly, we note that because the p values were derived via a bootstrap procedure (which is quite slow), and there were thousands of estimated genetic correlations in this paper, we are limited by the resulting p values with regard to FDR correction (which we provide for all calculated correlation values in Data S1. Phenotypic and genetic correlations between all harmonized phenotypes in the TOPMed dataset combined and separated by group, Data S2. Phenotypic and genetic correlations between all phenotypes in the HCHS/SOL dataset combined and separated by self-reported gender, Data S3. Supplemental Table 2: TOPMed phenotypes’ characteristics across backgrounds, related to Figures 2 and 3, Data S4. Supplemental Table 4: HCHS/SOL phenotypes’ characteristics across self-reported genders, related to Figures 4 and 5).
STAR★Methods
Key resources table
REAGENT or RESOURCE | SOURCE | IDENTIFIER |
---|---|---|
Deposited data | ||
Summary statistics, PAN-UK BioBank | Pan-UKB team. https://pan.ukbb.broadinstitute.org. 2020 | https://pan.ukbb.broadinstitute.org/ |
TOPMed data, by study | Taliun et al.31 | Amish: phs000956, ARIC: phs001211, CARDIA: phs001612, CHS: phs001368, FHS: phs000974, HCHS_SOL: phs001395, JHS: phs000964, MESA: phs001416 |
Software and algorithms | ||
All original computer code | This paper | https://github.com/tamartsi/HE_Genetic_Correlation |
Principal Components, kinship matrices, unrelated individuals | This paper | TOPMed DCC pipeline https://github.com/UW-GAC/analysis_pipeline |
GCTA | Yang et al.57 | https://yanglab.westlake.edu.cn/software/gcta/#Overview |
R package: igraph | Csardi et al.58 | https://igraph.org/r/ |
R package: qgraph | Epskamp et al.59 | https://cran.r-project.org/web/packages/qgraph/index.html |
R package: corrplot | Friendly60 | https://cran.r-project.org/web/packages/corrplot/vignettes/corrplot-intro.html |
ldsc package | Bulik-Sullivan28 | https://github.com/bulik/ldsc |
cov-LDSC | Luo34 | https://github.com/immunogenomics/cov-ldsc |
Resource availability
Lead contact
Further information and requests should be directed to the lead contact, Dr. Tamar Sofer (tsofer@bwh.harvard.edu).
Materials availability
This study did not generate new, unique reagents.
Experimental model and subject details
The Trans-Omics for Precision Medicine (TOPMed) program
We used harmonized phenotype data from eight cohort studies participating in TOPMed (9) Freeze 8 [https://www.nhlbiwgs.org/topmed-whole-genome-sequencing-methods-freeze-8], which included 33,959 genotyped individuals from the Amish Study (n = 1,105), JHS (n = 2,807), FHS (n = 3,658), HCHS/SOL (n = 7,693), ARIC (n = 7,479), CHS (n = 3,482), MESA (n = 4,665), and CARDIA (n = 3,070) with available self-reported race/ethnic identification, which we refer to as “background”. Descriptions of each of these studies are provided in the Supplemental information. This dataset included 8,054 Black participants, 17,143 White participants and 8,762 participants of Hispanic/Latino background. All participants provided informed consent and the study was approved by IRBs in each of the participating institutions. For TOPMed WGS data acquisition and QC report see ncbi.nlm.nih.gov/projects/gap/cgi-bin/GetPdf.cgi?id=phd006969.1. The phenotype harmonization was performed by TOPMed Data Coordinating Center (DCC) as described in.32 The phenotype names and description, exclusion criteria and transformations are described in Table S1. Phenotypes’ characteristics across backgrounds are reported in Table S2 (Data S3). All analyses were adjusted for age, gender, study, and reported race/ethnicity as well as 11 first principal components (PCs) to adjust for population structure. The PCs, kinship matrices, and unrelated individual pools were computed by TOPMed DCC via a robust pipeline [https://github.com/UW-GAC/analysis_pipeline] via a combination of KING,61 PC-AiR,62 and PC-Relate.63
The Hispanic community health study/study of latinos
The HCHS/SOL is a community-based cohort study of Hispanic individuals from four field centers across the US33,41 with almost 13,000 genotyped participants. A two-stage sampling scheme for participant selection was employed, with sampled community block units followed by households. Correlation matrices to model environmental variance due to households and community block units were generated so that the i, j entry of a given matrix was set to 1 if the i and j individuals live in the same household (or community block unit), and 0 otherwise. This study was approved by the institutional review boards at each field center, where all participants gave written informed consent. Genotyping and quality control for HCHS/SOL have been described in detail elsewhere.62 In brief, DNA extracted from blood was genotyped on the HCHS Custom 15,041,502 array (Illumina Omni2.5M + custom content). Genotyping and downstream quality-control procedures yielded 2 232 944 genetic variants for genotyped HCHS/SOL participants. These were used for genotype imputation with the 1000 Genomes Project phase 3 reference panel64 as previously described.65 Variants with at least two copies of the minor allele and present in any of the four 1000 Genomes continental panels were imputed yielding about 50 million imputed variants prior to quality filtering. A subset of 7,693 individuals from HCHS/SOL participated in TOPMed and was used in the first part of this study (with the whole dataset used in the latter parts), with additional phenotypes that were not harmonized across other TOPMed studies. HCHS/SOL phenotype names and description are described in Table S3. The number of participants with non-missing information per phenotype as well as means and standard deviations per phenotype per gender are reported in Table S4 (Data S3). We estimated genetic correlations between phenotypes from the following domains: anthropometric, blood pressure, lipids, blood cell counts, and inflammation markers. All analyses were adjusted for age, gender, sampling weights and 11 first principal components to account for population structure.
Method details
Statistical model when genetic relatedness in the only modeled source of correlation
Consider the linear model
Model 1 |
in which the quantitative outcome is modeled by a regression on covariates and the additive effects of d genetic variants ; and are normally distributed errors across participants. Assuming that the genetic variants are independent random variables, each centered and scaled to have mean 0 and variance 1, the mean and variance of are
(Equation 1) |
(Equation 2) |
Here, are the genetic and error variance components. Accordingly, narrow-sense heritability, which is the proportion of trait variance that is due to additive genetic factors is:
(Equation 3) |
To model genetic correlation, we extend Model 1 into a two-trait model. For person i:
(Equation 4) |
(Equation 5) |
with error terms satisfying for .
Thus, the errors of the same person may be correlated for the two traits, but for different traits the error of person i is independent of the error of person j. Consider the covariance between the two traits, again while making the simplifying assumption of independence between genetic variants:
(Equation 6) |
where is the correlation between the genetic effect of the k-th variant on the two outcomes, and is the correlation between the residual errors of the two outcomes. Note that the transition to using at the final step treats the vectors of causal genetic effects as random variables with mean 0, i.e., .
Noting that for individuals , the probability of the two individuals sharing the same allele identically-by-decent,61,63 an equivalent formulation supposes that the genetic effects can be modeled via K, the kinship matrix, tabulating the measure of genetic relationship between the i and j participants in its i,j entry. Consider the vector form of the model for the l outcome with correlated errors of trait l = 1,2:
Model 2 |
Now the genetic correlation can be estimated using mixed model with two traits. However, this is computationally demanding, especially for very large datasets. Recently,66 discussed the Haseman-Elston regression for variance components estimation, and demonstrated that the genetic variance components estimator corresponding to the kinship matrix, if it is independent of all other correlation matrices in a model with potentially multiple sources of correlation (which holds here, because we only have a single correlation matrix, the kinship matrix) are given by , where is the kinship matrix with all diagonal values set to zero.
An estimator of genetic correlation between two phenotypes
We extend the Haseman-Elston approach for modeling the genetic correlations between two phenotypes. For the errors of persons i and j, and phenotypes 1 and 2, under Model 2 we get:
(Equation 7) |
(Equation 8) |
Suppose for now that and are known. For estimating the genetic correlation between the two phenotypes, we take all pairs of residuals (estimating the error terms) after regression on mean-model covariates for , and regress them against the “covariate” . From properties of linear regression, we get:
(Equation 9) |
Now we can plug-in the estimators of to get:
(Equation 10) |
This estimator resembles that of the Pearson correlation parameter between the variables and , as can be seen if one replaces the matrix by the identify matrix . Interestingly, this estimator does not involve the unknown variance parameters. It does include the estimated kinship parameters, which are treated as fixed.
Extension to multiple correlation matrices and generalization
We can use multiple relatedness matrices with elements indicating the measure of relatedness between the i-th and j-th participants in its i,j entry to model the variance. We can then estimate the variance components obtained from expressions of the form
(Equation 11) |
(corresponding to Model 2 above) via a residual regression, i.e. by taking the vector all pairs of residuals for all . The HE design matrix, now re-defined (compared to66) to include rows corresponding to with , is given by:
(Equation 12) |
Similarly, the design matrix for estimating genetic correlation, obtained from expression of the form
(Equation 13) |
Can be written (if the variance parameters were known) as:
(Equation 14) |
Noting that:
(Equation 15) |
We get that:
(Equation 16) |
(Equation 17) |
We also note the outcome matrices for estimating variance components and correlation parameters are:
(Equation 18) |
Therefore:
(Equation 19) |
and:
(Equation 20) |
(Equation 21) |
(Equation 22) |
Because the lth entry of is we have then for :
(Equation 23) |
(Equation 24) |
To prove that this is a generalized Pearson correlation, we only need to show that is a bilinear form, with the matrix completely defined by the lth row of . This is simple to see, because the entries of are:
(Equation 25) |
Similarly:
(Equation 26) |
and the lth row of determines the weights in the following expression:
(Equation 27) |
Thus for
(Equation 28) |
We get
(Equation 29) |
Deriving confidence intervals for estimated correlation coefficients; We propose two methods to compute confidence intervals for the estimated correlation coefficients. First, using the Fisher’s transformation, which was developed to estimate confidence intervals for the standard Pearson correlation coefficient, and second, using block bootstrap.
Confidence intervals using the Fisher’s transform
Fisher’s transformation converts the distribution of the correlation coefficients to a normal one and thus allows us to calculate confidence intervals (and corresponding p values) for the correlation coefficient using the values of the correlation coefficient and the sample size.67,68 Since we show that calculating genetic correlation is equal to calculating standard correlation for adjusted phenotypes (Equation 29), the Fisher method is equally applicable for genetic correlation coefficient with the modification of plugging-in “effective sample size” to account for the modeled correlation structure between the two traits. Specifically, Fisher’s z-transformation of a correlation coefficient is defined as:
(Equation 30) |
If the two variables for which the correlation is measured have a bivariate normal distribution and are independent and identically distributed, then is approximately normally distributed with mean and SE given by:
(Equation 31) |
(Equation 32) |
and being the effective sample size of our sample equal to:
(Equation 33) |
where is the weighted matrix described in Equation 28. The coverage of this approach was verified in simulations via comparisons to the block Bootstrap method described below (see Figure S1).
Confidence intervals using block bootstrap
Multiple participants in our datasets are genetically related (and thus correlated), which violates the assumption of the standard bootstrap method. We thus performed the block bootstrap procedure to derive the confidence intervals and p values as described in.69 Briefly, related individuals were grouped into blocks (via third degree kinship) and the sampling procedure was at the level of blocks. We applied the Fisher’s transformation on each correlation value estimated in the bootstrap. Standard deviations and consequently confidence intervals and p values were calculated based on the Fisher’s transformed values. Finally, to obtain confidence intervals on the original genetic correlation scale we applied the inverse transformation of the Fisher’s transform (given by ) to the endpoints of the confidence intervals obtained on the Fisher’s transformation scale. In addition, we used the quantiles method to derive 95% confidence interval from the bootstrap results and report whether the corresponding p value is < 0.05 based on the null value being in the confidence interval.
Testing for difference of two correlation coefficients
To compare two correlation coefficients, we transform both correlations using the Fisher-z transformation (as described in Equation 30). Once the correlations have been converted into z values, the normal distribution is used to conduct the test of .
Fractional genetic correlation
For simplicity, focus on the single correlation matrix settings (the derivation here naturally extends to multiple sources of correlation). Recall Model 2 - the phenotypic correlation coefficient is equal to:
(Equation 34) |
We then define Fractional Genetic Correlation () as
(Equation 35) |
and Fractional Residual Correlation as
.
This is a natural decomposition of phenotypic correlation into two components:
(Equation 36) |
Unlike standard genetic correlation, is the genetic correlation adjusted (fractional) for both traits’ heritabilities and variances and crucially it represents a fraction of the phenotypic correlation that is due to genetics. As expected, the fractional correlation terms are never larger than the phenotypic correlation , because they sum to .
Quantification and statistical analysis
Simulation studies
We studied the accuracy of the proposed method for estimating genetic correlations and for calculating confidence intervals in simulations. We used correlation matrices from the HCHS/SOL representing kinship and shared household to generate realistic correlation structures. In all simulations, data were generated by first sampling two uncorrelated error vectors from a standard normal distribution. We next simulated the covariance structure according to our model:
The matrix K represents kinship, and H represents shared household. All simulations were performed 1,000 times with different sample sizes (1000, 4000, and 7706, the latter is sample size of HCHS/SOL individuals in TOPMed freeze 8 which is the smallest subgroup in this study) and values of and ranging from 0 to 1 in increments of 0.1. The variance components reported here were set to typical values for phenotypes from our dataset and equal to (corresponding to HDL, height, fasting glucose levels and eosinophil counts accordingly). The confidence intervals and p values were calculated from the block bootstrap method using the Fisher’s transformation and the percentile method (Figure S1).
We repeated all the simulations for a single kinship matrix on different self-reported background groups using kinship matrices from TOPMed participants from the available background groups (Black, Hispanic/Latino, and White). We also simulated a joint group in which we sampled an equal number of TOPMed individuals from each of the background groups, and used the corresponding kinship matrix. The number of participants was kept at 7706, and we compared the performance of HEc method (1000 repeats per group per ) to GCTA-GREML (with 40 repeats due to the slow computation speeds).
Heritability and genetic/environment correlation estimation via HEc and GCTA-GREML
The relatedness between individuals is modeled via a kinship (K) matrix, and an additional household matrix (H) for modeling environmental effects (available only for the HCHS/SOL cohort). Each phenotype was regressed on age, gender, sampling weights and 11 first principal components (and race/ethnicity and study for TOPMed combined cohorts) and the residuals were rank-normalized. We estimated the correlation coefficients corresponding to the relatedness matrices for all trait pairs by plugging in the normalized residuals to Equation 29. This was implemented via R scripts provided in GitHub repository [https://github.com/tamartsi/HE_Genetic_Correlation]. The genetic and environment variance components as well as the corresponding heritabilities were calculated via the GCTA software.57 Following sensitivity analysis for the presence of related individuals in the cohorts (Figure S5) we removed all individuals related at third degree or more for the calculation of heritabilities, however as we did not see any substantial effects of relatives on the estimated genetic correlations (Figure S2B), the relatives were kept in for genetic and environmental correlation coefficients estimation. We provide the confidence intervals for both the Fisher Method and the two Bootstrap-Based approaches in the supplementary data files. Visualizations were performed via the R packages igraph,58 qgraph,59 ggplot270 and corrplot60 followed by Adobe Illustrator. The figures are based on uncorrected p values. FDR-corrected p values are provided in the Supplementary Data files.
Heritability and genetic correlation estimation via LD-based methods
Summary statistics for 8 selected phenotypes (Table S5) were downloaded from PAN-UK BioBank (https://pan.ukbb.broadinstitute.org/). The GWAS underwent further cleanup and quality control as recommended by the ldsc package (https://github.com/bulik/ldsc).10,28 LD scores were computed for each of the TOPMed self-reported background groups as well as for the joint dataset as recommended by the ldsc software. Alternatively, covariate-adjusted LD scores were calculated via cov-LDSC (https://github.com/immunogenomics/cov-ldsc)34 using 11 principal components (calculated as described in section 2.4). Finally, genetic correlations and heritabilities were estimated using the computed LD scores.
Domain-level enrichment analysis
We calculated the enrichment of inter-domain correlation via a permutation approach. Specifically, for 1000 repeats, we generated random connections between nodes in our correlation graph such that each node will receive a same number of connections as in the real dataset as well as keeping the overall number of connections identical. We then calculated the distribution of number of connections between each pair of domains and used it to obtain a domain enrichment p value as follows:
where is the number of connections between domains 1 and 2, and is the number of connections between domains 1 and 2 in the th permutation. We considered two domains to be enriched if their enrichment p value was <0.05.
Additional resources
No additional resources were used.
Acknowledgments
M.E., T.S., and S.R. were supported by National Heart, Lung, and Blood Institute grants R21HL145425 to T.S. and R35HL135818 to S.R. Acknowledgments for the TOPMed, CCDG, and parent cohorts participating in this study are provided in the supplemental information.
Author contributions
M.E. and T.S. conceived the study. M.E., M.O.G., and T.S. developed statistical methods for correlation analysis. M.E. performed analysis and prepared the figures and tables. M.E. and T.S. drafted the manuscript. M.O.G., C.I., H.C., P.S.d.V., H.X., A.W.M., X.G., N.F., B.M.P., S.S.R., J.I.R., D.M.L.-J., M.F., A.C., N.L.H.-C., R.S.V., R.H., R.C.K., A.C.M., and S.R. critically reviewed and approved the manuscript. C.I., H.X., B.M.P., S.S.R., J.I.R., D.M.L.-J., M.F., A.C., N.L.H.-C., R.S.V., R.C.K., A.C.M., and S.R. were involved in collection of data for studies that were used in this manuscript.
Declaration of interests
B.M.P. serves on the Steering Committee of the Yale Open Data Access Project funded by Johnson & Johnson. R.H. serves as a consultant for Invitae and is on the Scientific Advisory Board for Variant Bio. These roles are unrelated to the work in this paper.
Published: December 12, 2022
Footnotes
Supplemental information can be found online at https://doi.org/10.1016/j.xcrm.2022.100844.
Contributor Information
Michael Elgart, Email: melgart@bwh.harvard.edu.
Tamar Sofer, Email: tsofer@bwh.harvard.edu.
Supplemental information
Data and code availability
-
•
The data is publicly available upon request from the TOPMed consortium (https://topmed.nhlbi.nih.gov/), and all original code used in this work is freely available at https://github.com/tamartsi/HE_Genetic_Correlation.
-
•
Any additional information required to reanalyze the data reported in this work paper is available from the lead contact upon request.
References
- 1.Pickrell J.K., Berisa T., Liu J.Z., Ségurel L., Tung J.Y., Hinds D.A. Detection and interpretation of shared genetic influences on 42 human traits. Nat. Genet. 2016;48:709–717. doi: 10.1038/ng.3570. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Shi H., Mancuso N., Spendlove S., Pasaniuc B. Local genetic correlation gives insights into the shared genetic architecture of complex traits. Am. J. Hum. Genet. 2017;101:737–751. doi: 10.1016/j.ajhg.2017.09.022. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.van Rheenen W., Peyrot W.J., Schork A.J., Lee S.H., Wray N.R. Genetic correlations of polygenic disease traits: from theory to practice. Nat. Rev. Genet. 2019;20:567–581. doi: 10.1038/s41576-019-0137-z. [DOI] [PubMed] [Google Scholar]
- 4.Lee S.H., Yang J., Goddard M.E., Visscher P.M., Wray N.R. Estimation of pleiotropy between complex diseases using single-nucleotide polymorphism-derived genomic relationships and restricted maximum likelihood. Bioinformatics. 2012;28:2540–2542. doi: 10.1093/bioinformatics/bts474. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Nagai A., Hirata M., Kamatani Y., Muto K., Matsuda K., Kiyohara Y., Ninomiya T., Tamakoshi A., Yamagata Z., Mushiroda T., et al. Overview of the BioBank Japan project: study design and profile. J. Epidemiol. 2017;27:S2–S8. doi: 10.1016/j.je.2016.12.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Bycroft C., Freeman C., Petkova D., Band G., Elliott L.T., Sharp K., Motyer A., Vukcevic D., Delaneau O., O'Connell J., et al. The UK Biobank resource with deep phenotyping and genomic data. Nature. 2018;562:203–209. doi: 10.1038/s41586-018-0579-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Zheng J., Erzurumluoglu A.M., Elsworth B.L., Kemp J.P., Howe L., Haycock P.C., Hemani G., Tansey K., Laurin C., et al. Early Genetics and Lifecourse Epidemiology EAGLE Eczema Consortium LD Hub: a centralized database and web interface to perform LD score regression that maximizes the potential of summary level GWAS data for SNP heritability and genetic correlation analysis. Bioinformatics. 2017;33:272–279. doi: 10.1093/bioinformatics/btw613. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Sakaue S., Kanai M., Karjalainen J., Akiyama M., Kurki M., Matoba N., Takahashi A., Hirata M., Kubo M., Matsuda K., et al. Trans-biobank analysis with 676, 000 individuals elucidates the association of polygenic risk scores of complex traits with human lifespan. Nat. Med. 2020;26:542–548. doi: 10.1038/s41591-020-0785-8. [DOI] [PubMed] [Google Scholar]
- 9.Kanai M., Akiyama M., Takahashi A., Matoba N., Momozawa Y., Ikeda M., Iwata N., Ikegawa S., Hirata M., Matsuda K., et al. Genetic analysis of quantitative traits in the Japanese population links cell types to complex human diseases. Nat. Genet. 2018;50:390–400. doi: 10.1038/s41588-018-0047-6. [DOI] [PubMed] [Google Scholar]
- 10.Bulik-Sullivan B., Finucane H.K., Anttila V., Gusev A., Day F.R., Loh P.R., ReproGen Consortium. Psychiatric Genomics Consortium. Genetic Consortium for Anorexia Nervosa of the Wellcome Trust Case Control Consortium 3. Duncan L., et al. An atlas of genetic correlations across human diseases and traits. Nat. Genet. 2015;47:1236–1241. doi: 10.1038/ng.3406. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Chen M.H., Raffield L.M., Mousas A., Sakaue S., Huffman J.E., Moscati A., Trivedi B., Jiang T., Akbari P., Vuckovic D., et al. Trans-ethnic and ancestry-specific blood-cell genetics in 746, 667 individuals from 5 global populations. Cell. 2020;182:1198–1213.e14. doi: 10.1016/j.cell.2020.06.045. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Mogil L.S., Andaleon A., Badalamenti A., Dickinson S.P., Guo X., Rotter J.I., Johnson W.C., Im H.K., Liu Y., Wheeler H.E. Genetic architecture of gene expression traits across diverse populations. PLoS Genet. 2018;14:e1007586. doi: 10.1371/journal.pgen.1007586. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Burt V.L., Whelton P., Roccella E.J., Brown C., Cutler J.A., Higgins M., Horan M.J., Labarthe D. Prevalence of hypertension in the US adult population: results from the third national health and nutrition examination survey, 1988-1991. Hypertension. 1995;25:305–313. doi: 10.1161/01.hyp.25.3.305. [DOI] [PubMed] [Google Scholar]
- 14.Ge T., Chen C.Y., Neale B.M., Sabuncu M.R., Smoller J.W. Phenome-wide heritability analysis of the UK Biobank. PLoS Genet. 2017;13:e1006711. doi: 10.1371/journal.pgen.1006711. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Mostafavi H., Harpak A., Agarwal I., Conley D., Pritchard J.K., Przeworski M. Variable prediction accuracy of polygenic scores within an ancestry group. Elife. 2020;9:e48376. doi: 10.7554/ELIFE.48376. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Rao D.C., Sung Y.J., Winkler T.W., Schwander K., Borecki I., Adrienne Cupples L., James Gauderman W., Rice K., Munroe P.B., Psaty B.M. Multiancestry study of gene-lifestyle interactions for cardiovascular traits in 610 475 individuals from 124 cohorts: design and rationale. Circ Cardiovasc Genet. 2017;10:e001649. doi: 10.1161/CIRCGENETICS.116.001649. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Walters R.K., Polimanti R., Johnson E.C., McClintick J.N., Adams M.J., Adkins A.E., Aliev F., Bacanu S.A., Batzler A., Bertelsen S., et al. Transancestral GWAS of alcohol dependence reveals common genetic underpinnings with psychiatric disorders. Nat. Neurosci. 2018;21:1656–1669. doi: 10.1038/s41593-018-0275-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Brown B.C., Asian Genetic Epidemiology Network Type 2 Diabetes Consortium. Ye C.J., Price A.L., Zaitlen N. Transethnic genetic-correlation estimates from summary statistics. Am. J. Hum. Genet. 2016;99:76–88. doi: 10.1016/j.ajhg.2016.05.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Galinsky K.J., Reshef Y.A., Finucane H.K., Loh P.R., Zaitlen N., Patterson N.J., Brown B.C., Price A.L. Estimating cross-population genetic correlations of causal effect sizes. Genet. Epidemiol. 2019;43:180–188. doi: 10.1002/gepi.22173. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Wientjes Y.C.J., Bijma P., Vandenplas J., Calus M.P.L. Multi-population genomic relationships for estimating current genetic variances within and genetic correlations between populations. Genetics. 2017;207:503–515. doi: 10.1534/genetics.117.300152. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Schousboe K., Willemsen G., Kyvik K.O., Mortensen J., Boomsma D.I., Cornes B.K., Davis C.J., Fagnani C., Hjelmborg J., Kaprio J., et al. Sex differences in heritability of BMI: a comparative study of results from twin studies in eight countries. Twin Res. 2003;6:409–421. doi: 10.1375/136905203770326411. [DOI] [PubMed] [Google Scholar]
- 22.Weiss L.A., Pan L., Abney M., Ober C. The sex-specific genetic architecture of quantitative traits in humans. Nat. Genet. 2006;38:218–222. doi: 10.1038/ng1726. [DOI] [PubMed] [Google Scholar]
- 23.Ober C., Loisel D.A., Gilad Y. Sex-specific genetic architecture of human disease. Nat. Rev. Genet. 2008;9:911–922. doi: 10.1038/nrg2415. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Maier R.M., Zhu Z., Lee S.H., Trzaskowski M., Ruderfer D.M., Stahl E.A., Ripke S., Wray N.R., Yang J., Visscher P.M., Robinson M.R. Improving genetic prediction by leveraging genetic correlations among human diseases and traits. Nat. Commun. 2018;9:989–1017. doi: 10.1038/s41467-017-02769-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Yang J., Benyamin B., McEvoy B.P., Gordon S., Henders A.K., Nyholt D.R., Madden P.A., Heath A.C., Martin N.G., Montgomery G.W., et al. Common SNPs explain a large proportion of the heritability for human height. Nat. Genet. 2010;42:565–569. doi: 10.1038/ng.608. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Speed D., Hemani G., Johnson M.R., Balding D.J. Improved heritability estimation from genome-wide SNPs. Am. J. Hum. Genet. 2012;91:1011–1021. doi: 10.1016/j.ajhg.2012.10.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Lee S.H., Wray N.R., Goddard M.E., Visscher P.M. Estimating missing heritability for disease from genome-wide association studies. Am. J. Hum. Genet. 2011;88:294–305. doi: 10.1016/j.ajhg.2011.02.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Bulik-Sullivan B.K., Loh P.R., Finucane H.K., Ripke S., Yang J., Schizophrenia Working Group of the Psychiatric Genomics Consortium. Patterson N., Daly M.J., Price A.L., Neale B.M. LD score regression distinguishes confounding from polygenicity in genome-wide association studies. Nat. Genet. 2015;47:291–295. doi: 10.1038/ng.3211. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Ni G., Moser G., Schizophrenia Working Group of the Psychiatric Genomics Consortium. Wray N.R., Lee S.H. Estimation of genetic correlation via linkage disequilibrium score regression and genomic restricted maximum likelihood. Am. J. Hum. Genet. 2018;102:1185–1194. doi: 10.1016/j.ajhg.2018.03.021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Zhang Y., Cheng Y., Jiang W., Ye Y., Lu Q., Zhao H. Comparison of methods for estimating genetic correlation between complex traits using GWAS summary statistics. Briefings Bioinf. 2021;22:bbaa442. doi: 10.1093/bib/bbaa442. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Taliun D., Harris D.N., Kessler M.D., Carlson J., Szpiech Z.A., Torres R., Taliun S.A.G., Corvelo A., Gogarten S.M., Kang H.M., et al. Sequencing of 53, 831 diverse genomes from the NHLBI TOPMed Program. Nature. 2021;590:290–299. doi: 10.1038/s41586-021-03205-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Stilp A.M., Emery L.S., Broome J.G., Buth E.J., Khan A.T., Laurie C.A., Wang F.F., Wong Q., Chen D., D'Augustine C.M., et al. A system for phenotype harmonization in the NHLBI trans-omics for precision medicine (TOPMed) program. Am. J. Epidemiol. 2021;190:1977–1992. doi: 10.1093/aje/kwab115. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Sorlie P.D., Avilés-Santa L.M., Wassertheil-Smoller S., Kaplan R.C., Daviglus M.L., Giachello A.L., Schneiderman N., Raij L., Talavera G., Allison M., et al. Design and implementation of the hispanic community health study/study of Latinos. Ann. Epidemiol. 2010;20:629–641. doi: 10.1016/j.annepidem.2010.03.015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Luo Y., Li X., Wang X., Gazal S., Mercader J.M., 23 and Me Research Team. SIGMA Type 2 Diabetes Consortium. Neale B.M., Florez J.C., Auton A., et al. Estimating heritability and its enrichment in tissue-specific gene sets in admixed populations. Hum. Mol. Genet. 2021;30:1521–1534. doi: 10.1093/hmg/ddab130. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Siedlinski M., Jozefczuk E., Xu X., Teumer A., Evangelou E., Schnabel R.B., Welsh P., Maffia P., Erdmann J., Tomaszewski M., et al. White blood cells and blood pressure: a mendelian randomization study. Circulation. 2020;141:1307–1317. doi: 10.1161/CIRCULATIONAHA.119.045102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Schillaci G., Pirro M., Pucci G., Ronti T., Vaudo G., Mannarino M.R., Porcellati C., Mannarino E. Prognostic value of elevated white blood cell count in hypertension. Am. J. Hypertens. 2007;20:364–369. doi: 10.1016/j.amjhyper.2006.10.007. [DOI] [PubMed] [Google Scholar]
- 37.Wainschtein P., Jain D.P., Yengo L., Cupples L.A., Shadyab A.H., McKnight B., Shoemaker B.M., Mitchell B.D., Psaty B.M., Kooperberg C., Liu C.T. Recovery of trait heritability from whole genome sequence data. bioRxiv. 2019 doi: 10.1101/588020. Preprint at. [DOI] [Google Scholar]
- 38.Reiner A.P., Lettre G., Nalls M.A., Ganesh S.K., Mathias R., Austin M.A., Dean E., Arepalli S., Britton A., Chen Z., et al. Genome-Wide association study of white blood cell count in 16, 388 african americans: the continental Origins and Genetic Epidemiology network (COGENT) PLoS Genet. 2011;7:e1002108. doi: 10.1371/journal.pgen.1002108. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Sodini S.M., Kemper K.E., Wray N.R., Trzaskowski M. Comparison of genotypic and phenotypic correlations: cheverud’s conjecture in humans. Genetics. 2018;209:941–948. doi: 10.1534/genetics.117.300630. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Searle S.R. Phenotypic, genetic and environmental correlations. Biometrics. 1961;17:474. [Google Scholar]
- 41.LaVange L.M., Kalsbeek W.D., Sorlie P.D., Avilés-Santa L.M., Kaplan R.C., Barnhart J., Liu K., Giachello A., Lee D.J., Ryan J., et al. Sample design and cohort selection in the hispanic community health study/study of Latinos. Ann. Epidemiol. 2010;20:642–649. doi: 10.1016/j.annepidem.2010.05.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Reiner A.P., Lettre G., Nalls M.A., Ganesh S.K., Mathias R., Austin M.A., Dean E., Arepalli S., Britton A., Chen Z., Couper D. Genome-wide association study of white blood cell count in 16,388 African Americans: the continental origins and genetic epidemiology network (COGENT) PLoS Genet. 2011;7:e1002108. doi: 10.1371/JOURNAL.PGEN.1002108. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Nalls M.A., Couper D.J., Tanaka T., van Rooij F.J.A., Chen M.H., Smith A.V., Toniolo D., Zakai N.A., Yang Q., Greinacher A., et al. Multiple loci are associated with white blood cell phenotypes. PLoS Genet. 2011;7:e1002113. doi: 10.1371/journal.pgen.1002113. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Reich D., Nalls M.A., Kao W.H.L., Akylbekova E.L., Tandon A., Patterson N., Mullikin J., Hsueh W.C., Cheng C.Y., Coresh J., et al. Reduced neutrophil count in people of african descent is due to a regulatory variant in the duffy antigen receptor for chemokines gene. PLoS Genet. 2009;5:e1000360. doi: 10.1371/journal.pgen.1000360. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Wang H., Noordam R., Cade B.E., Schwander K., Winkler T.W., Lee J., Sung Y.J., Bentley A.R., Manning A.K., Aschard H., et al. Multi-ancestry genome-wide gene–sleep interactions identify novel loci for blood pressure. Mol. Psychiatr. 2021;26:6293–6304. doi: 10.1038/s41380-021-01087-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Laville V., Majarian T., Sung Y.J., Schwander K., Feitosa M.F., Chasman D.I., Bentley A.R., Rotimi C.N., Cupples L.A., de Vries P.S., et al. Gene-lifestyle interactions in the genomics of human complex traits. Eur. J. Hum. Genet. 2022;30:730–739. doi: 10.1038/s41431-022-01045-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Patel R.A., Musharoff S.A., Spence J.P., Pimentel H., Tcheandjieu C., Mostafavi H., Sinnott-Armstrong N., Clarke S.L., Smith C.J., et al. V.A. Million Veteran Program Genetic interactions drive heterogeneity in causal variant effect sizes for gene expression and complex traits. Am. J. Hum. Genet. 2022;109:1286–1297. doi: 10.1016/j.ajhg.2022.05.014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Bernabeu E., Canela-Xandri O., Rawlik K., Talenti A., Prendergast J., Tenesa A. Sex differences in genetic architecture in the UK Biobank. Nat. Genet. 2021;53:1283–1289. doi: 10.1038/s41588-021-00912-0. [DOI] [PubMed] [Google Scholar]
- 49.Zhu C., Ming M.J., Cole J.M., Kirkpatrick M., Harpak A. Amplification is the primary mode of gene-by-sex interaction in complex human traits. bioRxiv. 2022 doi: 10.1101/2022.05.06.490973. Preprint at. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Hitlin S., Scott Brown J., Elder G.H. Racial self-categorization in adolescence: multiracial development and social pathways. Child Dev. 2006;77:1298–1308. doi: 10.1111/j.1467-8624.2006.00935.x. [DOI] [PubMed] [Google Scholar]
- 51.Hernandez L.M., Blazer D.G., Institute of Medicine (US) Committee on Assessing Interactions Among Social B and GF in H . National Academies Press; 2006. Sex/Gender, Race/Ethnicity, and Health. [PubMed] [Google Scholar]
- 52.Lewis A.C.F., Molina S.J., Appelbaum P.S., Dauda B., Di Rienzo A., Fuentes A., Fullerton S.M., Garrison N.A., Ghosh N., Hammonds E.M., et al. Getting genetic ancestry right for science and society. Science. 2022;376:250–252. doi: 10.1126/science.abm7530. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Talluri R., Shete S. BMC Proc. BioMed Central Ltd.; 2014. Gaussian graphical models for phenotypes using pedigree data and exploratory analysis using networks with genetic and nongenetic factors based on Genetic Analysis Workshop 18 data; p. S99. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Zhao H., Duan Z.H. Cancer genetic network inference using Gaussian graphical models. Bioinf. Biol. Insights. 2019;13 doi: 10.1177/1177932219839402. 1177932219839402. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Glymour C., Zhang K., Spirtes P. Review of causal discovery methods based on graphical models. Front. Genet. 2019;10:524. doi: 10.3389/fgene.2019.00524. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Torkamani A., Wineinger N.E., Topol E.J. The personal and clinical utility of polygenic risk scores. Nat. Rev. Genet. 2018;19:581–590. doi: 10.1038/s41576-018-0018-x. [DOI] [PubMed] [Google Scholar]
- 57.Yang J., Lee S.H., Goddard M.E., Visscher P.M. GCTA: a tool for genome-wide complex trait analysis. Am. J. Hum. Genet. 2011;88:76–82. doi: 10.1016/j.ajhg.2010.11.011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Csardi G., Nepusz T. InterJournal Complex Sy; 2006. The Igraph Software Package for Complex Network Research; p. 1695. [Google Scholar]
- 59.Epskamp S., Cramer A.O.J., Waldorp L.J., Schmittmann V.D., Borsboom D. Qgraph: network visualizations of relationships in psychometric data. J. Stat. Software. 2012;48:1–18. [Google Scholar]
- 60.Wei T., Simko V. corrplot; 2021. R Package “Corrplot”: Visualization of a Correlation Matrix.https://cran.r-project.org/web/packages/corrplot/index.html [Google Scholar]
- 61.Manichaikul A., Mychaleckyj J.C., Rich S.S., Daly K., Sale M., Chen W.M. Robust relationship inference in genome-wide association studies. Bioinformatics. 2010;26:2867–2873. doi: 10.1093/bioinformatics/btq559. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Conomos M.P., Miller M.B., Thornton T.A. Robust inference of population structure for ancestry prediction and correction of stratification in the presence of relatedness. Genet. Epidemiol. 2015;39:276–293. doi: 10.1002/gepi.21896. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63.Conomos M.P., Reiner A.P., Weir B.S., Thornton T.A. Model-free estimation of recent genetic relatedness. Am. J. Hum. Genet. 2016;98:127–148. doi: 10.1016/j.ajhg.2015.11.022. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64.1000 Genomes Project Consortium. Abecasis G.R., Auton A., Brooks L.D., DePristo M.A., Durbin R.M., Handsaker R.E., Kang H.M., Marth G.T., McVean G.A. An integrated map of genetic variation from 1, 092 human genomes. Nature. 2012;491:56–65. doi: 10.1038/nature11632. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65.Conomos M.P., Laurie C.A., Stilp A.M., Gogarten S.M., McHugh C.P., Nelson S.C., Sofer T., Fernández-Rhodes L., Justice A.E., Graff M., et al. Genetic diversity and association studies in US hispanic/latino populations: applications in the hispanic community health study/study of Latinos. Am. J. Hum. Genet. 2016;98:165–184. doi: 10.1016/j.ajhg.2015.12.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 66.Sofer T. Confidence intervals for heritability via Haseman-Elston regression. Stat. Appl. Genet. Mol. Biol. 2017;16:259–273. doi: 10.1515/sagmb-2016-0076. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 67.Fisher R.A. Frequency distribution of the values of the correlation coefficient in samples from an indefinitely large population. Biometrika. 1915;10:507. [Google Scholar]
- 68.Bishara A.J., Hittner J.B. Confidence intervals for correlations when data are not normal. Behav. Res. Methods. 2017;49:294–309. doi: 10.3758/s13428-016-0702-8. [DOI] [PubMed] [Google Scholar]
- 69.Kunsch H.R. The jackknife and the bootstrap for general stationary observations. Ann. Stat. 1989;17:1217–1241. [Google Scholar]
- 70.Wickham H. Springer-Verlag, New York; 2016. ggplot2: Elegant Graphics for Data Analysis. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
-
•
The data is publicly available upon request from the TOPMed consortium (https://topmed.nhlbi.nih.gov/), and all original code used in this work is freely available at https://github.com/tamartsi/HE_Genetic_Correlation.
-
•
Any additional information required to reanalyze the data reported in this work paper is available from the lead contact upon request.