Skip to main content
Nature Communications logoLink to Nature Communications
. 2023 Jun 14;14:3520. doi: 10.1038/s41467-023-39259-x

Unravelling the genetic architecture of human complex traits through whole genome sequencing

Ozvan Bocher 1, Cristen J Willer 2,3,4, Eleftheria Zeggini 1,5,
PMCID: PMC10267118  PMID: 37316478

Abstract

Whole genome sequencing has enabled new insights into the genetic architecture of complex traits, especially through access to low-frequency and rare variation. This Comment highlights the key contributions from this technology and discusses considerations for its use and future perspectives.

Subject terms: Next-generation sequencing, Genome-wide association studies, Rare variants


The field of human complex trait genetics has been enriched by high-throughput whole genome sequencing (WGS) technologies. WGS complements array-based genotyping by offering the opportunity to access most sequence in the genome and not only a set of known genetic variants. As sequencing costs gradually drop, an increasing number of study designs involve WGS approaches. Large sequencing projects have been undertaken in the general population that can be used as reference panels for genotype imputation in association studies, such as the Haplotype Reference Consortium (HRC1) project. Further initiatives, such as UKBioBank2 and TOPMed3, also make use of WGS technologies to sequence thousands of phenotypically-diverse individuals, providing resources of unprecedented scale to study the genetic architecture underlying complex diseases and traits4. Here, we comment on the progress and successes in the field heralded through WGS, as well as on future perspectives of this technology.

Advantages of WGS

Access to rare variation

The main advantage of using WGS approaches is direct access to genetic variation across the whole frequency spectrum, without first knowing where such variation occurs, as is required for array-based genotyping. Low frequency and rare variants are often not imputed accurately from reference panels (which are also frequently not a perfect population match), but can be detected by WGS. WGS offers more accurate and complete information capture of rare variation observed in sequenced individuals, their family members, and others with shared ancestry5. WGS-based studies have reported associations with rare variants of large effect size. These associations have been described in case-control studies, for example, in the TOPMed project where Zhao et al. identified a rare variant with a large effect on reduced lung function6, as well as in quantitative trait studies, for example, in the study by Benonisdottir et al., which identified nine rare variants associated with urinary biomarkers7. Detecting single-point rare variant associations requires very large sample sizes, especially if effect sizes are not large. To maximise the chance to detect rare variants associated with complex diseases and to consider genetic heterogeneity between individuals, rare variant association tests (RVAT) have been developed. These methods have enabled the detection of associations between medically relevant traits and an accumulation of rare variants in a chromosomal region, typically a single gene. For example, Gilly et al. described a cardioprotective rare variant burden in the APOC3 gene, composed of exonic and splice variants, which could not be detected using imputation of genotype data8. In a larger study, by applying gene-based burden tests in over 17,000 binary phenotypes, Wang et al. identified over 1,700 significant associations, highlighting the importance of rare genetic variation in complex diseases9. Followed by functional investigation, these findings are bringing new insights into the biological mechanisms behind complex diseases. Nevertheless, biological interpretation can be more readily reached for the exome, on which the majority of RVAT have currently been applied.

Despite the description of several associations with rare variants, it is still unclear how much they contribute to the heritability of human complex traits. This proportion is especially hard to estimate for rare variants as they correspond to observations only in a few individuals resulting in high standard errors10. As common variants are present in more individuals, it is expected that they will contribute more to the phenotypic variance than rare variants. Apart from a few examples such as height11 or type 2 diabetes12, several studies have indeed shown that complex trait heritability due to rare variants is expected to be rather low. For example, by looking at 22 common traits, Weiner et al. showed that rare coding variants explain on average only 1.3% of the overall phenotypic variance, ranging from 0.4% for asthma to 3.6% for height13. While rare variants will unlikely explain all the remaining phenotypic variability of complex traits, they can also be useful for prediction14. For instance, a study to predict haemoglobin A1C levels showed that the integration of many rare variants into prediction scores could lead to the identification of a substantial number of undiagnosed type 2 diabetes cases15.

Ancestry-diverse studies

A further important benefit of WGS is the investigation of under-represented populations that have not been well characterised by currently available sequencing data, in which rare or population-enriched variation is therefore not accurately described16. For example, using genotyping and low-depth sequencing in 6,400 individuals from the Uganda population, numerous associations were identified with complex traits, including both novel findings and associations at previously reported loci but with different allelic effects17. Similarly, sequencing followed by imputation of the Icelandic population has resulted in novel insights, including an association between a splice variant in RPL3L and atrial fibrillation18. Considering sub-populations within Europeans is also of interest: in Norwegians, low-depth sequencing followed by a custom genotyping array performed on 70,000 individuals resulted in new associations, e.g., between ZNF529 p.K405X and LDL-C19. A GWAS performed in the Finnish population on 1,932 phenotypes found 2,491 significant associations, including newly associated variants that could be identified due to their higher frequency in the Finnish population. For example, an intronic variant of TNRC18 strongly associated with IBD but almost absent from other European populations20. WGS in diverse populations represents one of the most active areas of research, as getting an overview of the genetic architecture in diverse populations will enable better comprehension of complex diseases as well as the differences in effect direction and sizes of the associated variants that are observed across populations10.

Considerations in WGS-based studies

Challenges in study design

Genotype-based GWAS is an established field where power has been shown to clearly depend on the sample size and the detectable genetic effect21, however, planning the design of a sequencing-based study can be more challenging. Li et al. showed that power indeed also depends on read depth and distribution22. In addition, the power of RVAT is less straightforward to estimate compared to single-point association analysis, because it depends on additional parameters such as the filtering strategy used to select qualifying variants and their directions of effect. Power is a major driver of the success of a study, and multiple software packages are available to estimate the expected power of WGS-based studies, as reviewed in Li et al.22. Conducting studies in diverse populations can provide useful insights into the genetics of complex diseases. Furthermore, power to detect associations can be boosted by studying isolated populations, in which variant frequencies and effect sizes may be larger8,18,20. As the majority of WGS projects to date have been focused on European populations, conducting sequencing-based studies in under-represented populations is expected to be of benefit10 and represents an important direction for future WGS applications.

Determination of WGS approach

WGS-based studies can take various forms, for which the optimal choice will depend on several parameters including the population under study, the biological hypothesis investigated and the computational and financial resources available, with the cost of WGS being its largest disadvantage. These forms include WGS coupled to imputation in genome-wide association studies, cohort-wide low- or very low-depth WGS, and deep WGS (Box 1). When the interest is in related samples or under-represented populations, low and very-low depth WGS approaches may be relatively more efficient compared to when examining well-studied populations. For example, Tran et al. performed low-depth sequencing to genetically describe the Vietnamese population and reported five disease-associated pathogenic variants with higher allelic frequencies than in other populations23. Low-depth sequencing has indeed been shown to be more efficient than classical imputation-coupled array designs in detecting GWAS signals, primarily due to a more complete assessment of genomic variation, especially in ancestries with poor coverage in existing imputation panels24. This was illustrated in a study from Gilly et al. in an isolated Greek population where twice as many variants were detected using very low-depth WGS as compared to classical imputed array genotyping data, and a vast majority of which were rare variants, leading to a twofold increase in the number of association signals25. Nevertheless, low-depth WGS shows decreased accuracy when studying rare variation (frequency lower than 1%) in the genome compared to low-frequency (frequency between 1 and 5%) and common variants. To identify such variation, medium-depth designs can be applied, or high-depth sequencing for detecting indels and ultra-rare variation, such as singletons, with high accuracy26. While high-depth WGS has proven to be useful, for example, in the study by Wessel et al., which described the contribution of rare non-coding variants to type 2 diabetes12, it remains expensive, especially for large cohorts. A solution to perform high-depth sequencing at a lower cost would be to focus on coding parts of the genome by using whole exome sequencing (WES). The lower cost associated with WES would enable the inclusion of more individuals in the study and therefore an increase in the power to detect genetic variants that reside in genes and are associated with human complex traits, as illustrated in the study by Wang et al.9. Nevertheless, using WES instead of WGS misses genetic variation in the non-coding genome (or any gene with poor coverage in whole exome studies), which has been shown to play an important role in complex diseases27. Finally, emerging technologies such as long-read sequencing offer the possibility to access genome-wide structural variants which have been found to have an impact on complex phenotypes as highlighted by Beyter et al. on LDL cholesterol levels and height28. This approach provides additional advantages, such as easier assembly and mapping of genomes, but remains the most expensive sequencing technology, preventing its use in large cohorts.

Box 1 Overview of the different sequencing techniques currently available.

White boxes correspond to coding exons and thin black lines to sequencing reads. The sequencing depth is represented at the bottom of each graphic by brown shades. Pros and cons of the different depths and genome coverage are highlighted.Box 1 Overview of the different sequencing techniques currently available

Disadvantages compared to genotype-based studies

Despite recent progress, the cost of WGS remains prohibitive for most large-scale studies. Genotyping coupled to imputation can retrieve most common variation in the genome29. The use of cost-efficient array-based technologies enables increasing sample sizes, which in turns results in the identification of further associations with common and low-frequency variants. One striking example is the study by Yengo et al. on over 5 million individuals, which has described all of the genetic heritability of height due to common variants30. The contribution of common variants to the genetic architecture of human complex traits is still not fully understood and array-based technologies will continue to be useful in filling this gap. In addition, while sequencing will continue to contribute to obtaining the whole picture of the genetic architecture of complex traits, it is likely that translation into the clinic to screen for polygenic risk will focus on array-genotyping approaches rather than on sequencing at first.

Perspectives and conclusion

WGS has made an important contribution to the understanding of genetics underlying complex traits, especially in under-represented populations, and through rare variation. Functional interpretation of association signals arising from WGS remains more challenging in the non-coding genome compared to the exome. Even if single-point associations have been described with rare and common variants in these regions, it is still arduous to biologically characterise these association signals. Combining association results from WGS with functional information at multiple levels, using, for example, other omics data such as transcriptomics, open chromatin, methylation, metabolomics or proteomics, has been shown to help in the interpretation of the associated signals31. Similarly, using RVAT in non-coding regions of the genome is not straightforward, despite affording higher power to detect genetic associations with rare variants32. Novel statistical methods are therefore needed which, for example, consider functional information across the non-coding genome33,34. WGS studies will remain useful in the future as a tool to explore the genetic underpinning of complex diseases, especially in combination with emerging functional data and their integration at multiple levels. As the cost of WGS is dropping and given the exciting prospect of long-read WGS at scale, these technologies will become increasingly accessible and will enable the description of genetic variation in hitherto understudied populations. As our understanding of the non-coding genome continues to improve, and with the further development of powerful methods to integrate functional information in rare variant association testing approaches, WGS will hopefully lead to a better and more accurate comprehension of complex diseases. In the future, it is anticipated that WGS-informed clinical decisions and interventions will accelerate personalised medicine in the wider field of complex diseases, following recent successes in cancer and rare disease, such as monogenic forms of cardiomyopathy35,36. To achieve these goals in a globally equitable fashion, WGS of diverse populations should remain a high priority going forward.

Acknowledgements

The authors would like to acknowledge Kuan-Han Wu (Michigan State University) for his help with the Box 1 graphics. This project has received funding from the European Union’s Horizon 2020 research and innovation programme under Grant Agreement No 101017802 (OPTOMICS).

Author contributions

O.B., C.J.W. and E.Z. wrote and reviewed the manuscript.

Peer review

Peer review information

Nature Communications thanks the anonymous reviewers for their contribution to the peer review of this work.

Competing interests

C.J.W. currently works at Regeneron Pharmaceuticals. The remaining authors declare no competing interests.

Footnotes

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.McCarthy S, et al. A reference panel of 64,976 haplotypes for genotype imputation. Nat. Genet. 2016;48:1279–1283. doi: 10.1038/ng.3643. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Sudlow C, et al. UK biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med. 2015;12:e1001779. doi: 10.1371/journal.pmed.1001779. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Taliun D, et al. Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program. Nature. 2021;590:290–299. doi: 10.1038/s41586-021-03205-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Halldorsson BV, et al. The sequences of 150,119 genomes in the UK Biobank. Nature. 2022;607:732–740. doi: 10.1038/s41586-022-04965-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Si, Y., Vanderwerff, B. & Zollner, S. Why are rare variants hard to impute? Coalescent models reveal theoretical limits in existing algorithms. Genetics21710.1093/genetics/iyab011 (2021). [DOI] [PMC free article] [PubMed]
  • 6.Zhao X, et al. Whole genome sequence analysis of pulmonary function and COPD in 19,996 multi-ethnic participants. Nat. Commun. 2020;11:5182. doi: 10.1038/s41467-020-18334-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Benonisdottir S, et al. Sequence variants associating with urinary biomarkers. Hum. Mol. Genet. 2019;28:1199–1211. doi: 10.1093/hmg/ddy409. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Gilly A, et al. Very low-depth sequencing in a founder population identifies a cardioprotective APOC3 signal missed by genome-wide imputation. Hum. Mol. Genet. 2016;25:2360–2365. doi: 10.1093/hmg/ddw088. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Wang Q, et al. Rare variant contribution to human disease in 281,104 UK Biobank exomes. Nature. 2021;597:527–532. doi: 10.1038/s41586-021-03855-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Selvaraj MS, et al. Whole genome sequence analysis of blood lipid levels in >66,000 individuals. Nat. Commun. 2022;13:5995. doi: 10.1038/s41467-022-33510-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Wainschtein, P. et al. Assessing the contribution of rare variants to complex trait heritability from whole-genome sequence data. Nat. Genet.10.1038/s41588-021-00997-7 (2022). [DOI] [PMC free article] [PubMed]
  • 12.Wessel, J. et al. Rare non-coding variation identified by large scale whole genome sequencing reveals unexplained heritability of Type 2 diabetes. medRxiv. 10.1101/2020.11.13.20221812 (2020).
  • 13.Weiner DJ, et al. Polygenic architecture of rare coding variation across 394,783 exomes. Nature. 2023;614:492–499. doi: 10.1038/s41586-022-05684-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Kierczak M, et al. Contribution of rare whole-genome sequencing variants to plasma protein levels and the missing heritability. Nat. Commun. 2022;13:2532. doi: 10.1038/s41467-022-30208-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Dornbos P, et al. A combined polygenic score of 21,293 rare and 22 common variants improves diabetes diagnosis based on hemoglobin A1C levels. Nat. Genet. 2022;54:1609–1614. doi: 10.1038/s41588-022-01200-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Martin AR, et al. Low-coverage sequencing cost-effectively detects known and novel variation in underrepresented populations. Am. J. Hum. Genet. 2021;108:656–668. doi: 10.1016/j.ajhg.2021.03.012. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Gurdasani D, et al. Uganda genome resource enables insights into population history and genomic discovery in Africa. Cell. 2019;179:984–1002.e1036. doi: 10.1016/j.cell.2019.10.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Thorolfsdottir RB, et al. Coding variants in RPL3L and MYZAP increase risk of atrial fibrillation. Commun. Biol. 2018;1:68. doi: 10.1038/s42003-018-0068-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Nielsen JB, et al. Loss-of-function genomic variants highlight potential therapeutic targets for cardiovascular disease. Nat. Commun. 2020;11:6417. doi: 10.1038/s41467-020-20086-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Kurki MI, et al. FinnGen provides genetic insights from a well-phenotyped isolated population. Nature. 2023;613:508–518. doi: 10.1038/s41586-022-05473-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Purcell S, Cherny SS, Sham PC. Genetic Power Calculator: design of linkage and association genetic mapping studies of complex traits. Bioinformatics. 2003;19:149–150. doi: 10.1093/bioinformatics/19.1.149. [DOI] [PubMed] [Google Scholar]
  • 22.Li CI, Samuels DC, Zhao YY, Shyr Y, Guo Y. Power and sample size calculations for high-throughput sequencing-based experiments. Brief. Bioinform. 2018;19:1247–1255. doi: 10.1093/bib/bbx061. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Tran NH, et al. Genetic profiling of Vietnamese population from large-scale genomic analysis of non-invasive prenatal testing data. Sci. Rep. 2020;10:19142. doi: 10.1038/s41598-020-76245-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Li JH, Mazur CA, Berisa T, Pickrell JK. Low-pass sequencing increases the power of GWAS and decreases measurement error of polygenic risk scores compared to genotyping arrays. Genome Res. 2021;31:529–537. doi: 10.1101/gr.266486.120. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Gilly A, et al. Very low-depth whole-genome sequencing in complex trait association studies. Bioinformatics. 2019;35:2555–2561. doi: 10.1093/bioinformatics/bty1032. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Kishikawa T, et al. Empirical evaluation of variant calling accuracy using ultra-deep whole-genome sequencing data. Sci. Rep. 2019;9:1784. doi: 10.1038/s41598-018-38346-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.French JD, Edwards SL. The Role of Noncoding Variants in Heritable Disease. Trends Genet. 2020;36:880–891. doi: 10.1016/j.tig.2020.07.004. [DOI] [PubMed] [Google Scholar]
  • 28.Beyter D, et al. Long-read sequencing of 3,622 Icelanders provides insight into the role of structural variants in human diseases and other traits. Nat. Genet. 2021;53:779–786. doi: 10.1038/s41588-021-00865-4. [DOI] [PubMed] [Google Scholar]
  • 29.Hanks SC, et al. Extent to which array genotyping and imputation with large reference panels approximate deep whole-genome sequencing. Am. J. Hum. Genet. 2022;109:1653–1666. doi: 10.1016/j.ajhg.2022.07.012. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Yengo L, et al. A saturated map of common genetic variants associated with human height. Nature. 2022;610:704–712. doi: 10.1038/s41586-022-05275-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Steinberg J, et al. A molecular quantitative trait locus map for osteoarthritis. Nat. Commun. 2021;12:1309. doi: 10.1038/s41467-021-21593-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Bocher O, Genin E. Rare variant association testing in the non-coding genome. Hum. Genet. 2020;139:1345–1362. doi: 10.1007/s00439-020-02190-y. [DOI] [PubMed] [Google Scholar]
  • 33.Li X, et al. Dynamic incorporation of multiple in silico functional annotations empowers rare variant association analysis of large whole-genome sequencing studies at scale. Nat. Genet. 2020;52:969–983. doi: 10.1038/s41588-020-0676-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Bocher O, et al. Testing for association with rare variants in the coding and non-coding genome: RAVA-FIRST, a new approach based on CADD deleteriousness score. PLoS Genet. 2022;18:e1009923. doi: 10.1371/journal.pgen.1009923. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Bagnall RD, et al. Whole Genome Sequencing Improves Outcomes of Genetic Testing in Patients With Hypertrophic Cardiomyopathy. J. Am. Coll. Cardiol. 2018;72:419–429. doi: 10.1016/j.jacc.2018.04.078. [DOI] [PubMed] [Google Scholar]
  • 36.Priestley P, et al. Pan-cancer whole-genome analyses of metastatic solid tumours. Nature. 2019;575:210–216. doi: 10.1038/s41586-019-1689-y. [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from Nature Communications are provided here courtesy of Nature Publishing Group

RESOURCES