Abstract
Genome-wide association studies using large-scale genome and exome sequencing data have become increasingly valuable in identifying associations between genetic variants and disease, transforming basic research and translational medicine. However, this progress has not been equally shared across all people and conditions, in part due to limited resources. Leveraging publicly available sequencing data as external common controls, rather than sequencing new controls for every study, can better allocate resources by augmenting control sample sizes or providing controls where none existed. However, common control studies must be carefully planned and executed as even small differences in sample ascertainment and processing can result in substantial bias. Here, we discuss challenges and opportunities for the robust use of common controls in high-throughput sequencing studies, including study design, equality control and statistical approaches. Thoughtful generation and use of large and valuable genetic sequencing data sets will enable investigation of a broader and more representative set of conditions, environments and genetic ancestries than otherwise possible.
High-throughput genome and exome sequencing has been foundational to advancing the understanding of disease aetiology, precision medicine and drug development1–7. From beginnings in rare diseases with monogenic or oligogenic architecture, which studied relatively small numbers of individuals or even single families8–10, rare variant discovery research has more recently encompassed common complex diseases, that is, those with polygenic architectures, by using substantial resources to recruit and analyse genomes at scale4,11–13. Genome-wide association studies (GWAS) for dichotomous outcomes use array-based or high-throughput sequencing technology to scan and compare the genomes of cases, recruited to have a certain condition, and controls, gathered for direct comparison. Specifically, by comparing the allele frequencies of a variant or the frequencies of rare alleles in a genetic region of interest between the ascertained cases and controls, GWAS aim to identify genetic variants or regions that are associated with the phenotype of interest. However, advances have not been equitably shared across ancestries, environments and conditions, exacerbating inequities in research, healthcare and drug development and leading to poorer understanding of the complete genomic landscape for all14–16. A multifaceted approach is urgently required to address this gap, including advances in sequencing technology to lower costs and enable the generation of large, high-quality studies of diverse genetic ancestries and environments. In the meantime, to extend the utility of existing and future large-scale sequencing studies, data can be leveraged to serve as a community resource for comparison with sequenced cases, an approach generally known as common controls.
Using common controls, rather than sequencing new controls for every study, can boost power to detect genotype-phenotype associations by increasing the sample size or providing a control set where none existed. One can distinguish between internal common controls, which were ascertained and processed as part of a single study that contains multiple case sets of different phenotypes17, and the more frequent external common controls, which were recruited as part of an unrelated study. Although using external common controls as the sole control group is more susceptible to bias and confounding compared with using internal controls, as described in more detail below, the practice is remarkably frequent18–36. Uses of external common controls as the sole control group include data analysis of cases recruited in a case-only20,22,31,33,37 or family design for rare diseases32,38, and to study germline associations in cancer for cases originally recruited to identify somatic mutations19,29,39–42. Large external common control data have also been used for more common conditions such as coronary artery disease, obesity, osteoporosis and schizophrenia18,34–36,43 and can be especially successful for the detection of rare, strong-acting alleles. For example, studies using common controls have led to the detection of rare loss-of-function variants in SETD1 associated with schizophrenia35 and missense and protein-truncating variants in ATG4C have been associated with Crohn’s disease36, among other findings34. Interest in common controls has increased in fields outside genomic research as well, for example, for clinical trials44.
The use of common controls can free resources for functional work, careful phenotyping of cases and additional recruitment of cases from diverse ancestries and environments, helping to address a persistent lack of representation in genomic research45–48. Lack of representation in sequencing studies is compounded in conditions that are neither rare nor common. Given their polygenic and complex architecture, these uncommon conditions require large sample sizes but their relatively low prevalence (for example, 0.2–1.8% for vitiligo)49 makes mobilizing the necessary resources difficult (FIG. 1). In these situations, even large-scale biobanks and general population cohorts may not be adequately powered for discovery, which can be worsened by challenges in deriving precise phenotypes50–52 and stigma for certain phenotypes53. For example, in the 500,000 UK Biobank population cohort, only 2,000 cases of Crohn’s disease are available for genetic study, with disease phenotyping derived from a mixture of questionnaire and hospital inpatient record data, whereas focused disease case–control sequencing studies can gather more than 30,000 cases and 80,000 shared controls36. Even when feasible, gathering and sequencing controls for all conditions is not an optimal use of resources due to redundancy in control sets and limited recruitment from under-represented populations.
Fig. 1 |. Where to use common controls?

Optimal study design for a particular condition is determined, in part, by genetic architecture (y axis) and prevalence (x axis) of the condition. Small family studies and case groups are useful for rare, Mendelian conditions, whereas very large case–control studies can be used for common, polygenic conditions where sufficient resources can be garnered. Uncommon conditions with low to moderate prevalence are likely to be polygenic, necessitating large sample sizes. However, low prevalence of the condition may make recruiting sufficient sample sizes and resources difficult. Whereas common controls are useful across the genetic architecture and prevalence spectrum, common controls are particularly valuable for polygenic, uncommon conditions where sufficient resources may be lacking. CHD, coronary heart disease; T2DM, type 2 diabetes mellitus.
Whereas common controls hold great potential in both array-based GWAS17 and sequencing studies34–36,54, their robust use can pose challenges, with missteps caused by inadequate harmonization of batch effects30,31,55, mismatched ancestries19,37,42,56, incorrect filtering19,21,22,26,28,29,37,57–59 and insufficient documentation of data and methods for reproducible results20,24,28,32,33,37,40,60. These missteps can result in biased and incorrect results, especially when using external common controls as the sole control set. Careful harmonization, analyses and quality control are essential when employing external common controls, given the great potential for bias and confounding. Here, we discuss the data, study design, infrastructure and methods required to incorporate common controls in rigorous rare variant analyses to advance human genetics research. Throughout, we highlight two studies as exemplars of how to incorporate external common controls in a robust manner34,36 (BOX 1). Although the focus here is on high-throughput sequencing studies, much of the content is also applicable to array-based GWAS. Ultimately, careful use of common control data can be an important tool for improving statistical power, addressing the lack of data for understudied people or conditions and providing a more complete understanding of the entire human genetics landscape.
Box 1 |. Case studies for the exemplary use of common controls.
Throughout this Review, we provide two case studies as examples of thoughtful and robust use of common controls. Of note, both are samples of European ancestries. As discussed comprehensively elsewhere48,120,160, broader representation of ancestries is needed in genetic studies including in the creation of common control data sets.
Marenne et al.34 use the INTERVAL study and gnomAD data (TABLE 1) as common controls for severe childhood-onset obesity project cases from the UK10K project. The UK10K project6 is an interesting example of a study where exome sequencing was performed for cases (that is, obesity, neurodevelopment and rare diseases) without internal controls, thus necessitating the use of external common controls34,35. The INTERVAL sample was used as the primary common control set and gnomAD was used as a second common control data set to complete partial replication of the top gene regions identified.
Sazonovs et al.36 used two case–control discovery samples and three case–control follow-up samples to identify rare variant genetic associations with Crohn’s disease. External common control data from the Centers for Common Disease Genomics (CCDG)151 were identified to match the sequencing technology and computational pipeline of the discovery case samples, and INTERVAL83 and the UK Biobank85 were used as external common controls for two of the follow-up case samples.
Data and study design
The use of common controls in genetic studies is unique in that the researcher has already defined a specific research question and ascertained cases. Selection of a common control set is therefore driven by case set characteristics, such as condition prevalence, genetic ancestry, sequencing technology and sample processing. Selecting controls according to these considerations reduces the potential for confounding, both measured and unmeasured. Beyond study characteristics, practical considerations such as accessibility, size and type of data (individual level or summary level) may constrain the utility of a common control data set. Below, we discuss aspects of the genetic data, participant demographics and relevant confounders to support appropriate selection of common controls such as those provided in TABLE 1.
Table 1 |.
Resources for high-throughput sequencing data sets that could be used as external common controls
| Resourcea | Sizeb | Ancestries | Datac | Permissions | Description |
|---|---|---|---|---|---|
| Individual-level data d | |||||
| 1000 Genomes Project114 (AnVIL) | WGS/WES: 2,504 | 5 continental ancestries (African, East Asian, European, South Asian and admixed American) and 26 populations | – | None | Catalogue of human variation and genotype or sequencing data from self-identified ‘healthy’ individuals |
| CCDG151 (AnVIL) | WES: 198,831 WGS: 135,853 | ~25% each of the African, admixed American and European continental ancestries | – | Permissions obtained via dbGaP | Collection of case–control studies of cardiovascular, neuropsychiatric and immune-mediated diseases |
| Estonian Biobank 81 | WES: 3,000 WGS: 2,500 |
Estonian (83%), Russian (14%) and other (3%) | Phenotypes | Access form required | Longitudinal, population-based cohort study of subjects recruited by general practitioners and hospital physicians |
| H3Africa 121 | WGS: 581 WES: 314 |
African (50 ethnolinguistic groups from 13 African countries) | Phenotypes | Access form required | Population-based studies of common diseases in Africa, such as trypanosomiasis and paediatric HIV |
| HGDP 115 | WGS: 929 (828 online) | 54 populations | – | None | WGS Library of diverse indigenous populations |
| INTERVAL 83 | WES: 4,502 WGS: 5,592 |
~91% white British individuals | Phenotypes | Permissions obtained via EGA | Study of blood donation frequency across England |
| SGDP 152 | WGS: 300 | 142 populations (127 in publicly available data) | – | Access form required for 21 genomes | Deep genome sequences from smaller diverse populations |
| TOPMed84 (AnVIL, BioData Catalyst) | WGS: ~155,000 | 41% European, 31% African, 15% Hispanic/Latino, 9% Asian and 4% other/unknown | Phenotypes, ‘omics | Permissions obtained via dbGaP | More than 80 parent studies (prospective, case–control, family and case-only) focusing on heart, lung, blood and sleep disorders |
| UK Biobank 85 | WES: ~200,000 WGS: ~300,000 |
~84% white British individuals | Phenotypes | Registration, data application and fee required | Prospective, population-based cohort study collected from volunteers across the UK |
| Summary-level data | |||||
| dbGaP ALFA 153 | WES: 29,931 WGS: 25,478 | 12 populations | – | None | Allele frequencies for variants in dbGaP across approved unrestricted studies |
| FinnGen 87 | WGS: 3,775 | Finnish | Phenotypes | Access form required | A large public–private partnership aiming to collect and analyse genome and health data from 500,000 Finnish biobank participants (10% of population) |
| gnomAD v2.1 89 | WES: 125,748 WGS: 15,708 |
7 global and 7 subcontinental ancestries | – | None | Exome and genome sequencing data primarily from case–control studies of common adult-onset diseases |
| gnomAD v3.1 89 | WGS: 76,156 | 9 ancestries | – | None | Genome sequencing data primarily from case–control studies of common adult-onset diseases |
| jMorp 154 | WGS: 8,380 | Japanese | ‘Omics | Access form required for genotype frequency panel | Japanese reference panel of two prospective cohort studies (a population-based adult cohort and a birth and three-generation cohort) |
| SISuv4.1 155 | WES: 10,490 | Finnish | – | None | Finnish database of 13 cohorts with population controls and cases/controls from disease-specific studies |
| Taiwan Biobank 88 | WGS: 1,517 | Han Chinese | Phenotypes | None | Taiwanese database of healthy controls from cohort and case–control studies of local diseases |
| TOPMed BRAVO 84 | WGS: 132,345 | Multiple | – | Google login required | Variant browser for 705 million variants in TOPMed Freeze 8 |
| Summary and individual-level data d | |||||
| All of Us103 (Researcher Workbench) | WGS: 98,640 | 49% European, 23% African, 15% Latino/Admixed American, 9% Other, 2% East Asian, 1% South Asian | Phenotypes | Individual: registration required through the All of Us Research Program | Effort to collect and study health data from a diverse and representative group of people in the USA |
| CSVS 90 | WGS/WES: 2,027 (individual: 267) | Spanish | Phenotypes | Summary level: access form required; individual level: permissions obtained via EGA | Crowdsourcing repository of Spanish genetic variability |
| GenomeAsia 100K 116 | WGS: 1,739 | 7 global regions (64 countries, 219 population groups) with ~80% Asian individuals | – | Individual level: access form required | WGS reference data set |
| Data repositories of individual-level data | |||||
| dbGaP91 (AnVIL, BioData Catalyst) | WES: ~500 studies WGS: ~300 studies |
Multiple | Phenotypes | NIH DAC application required | Database of studies that investigate the interaction between genotypes and phenotypes in humans |
| EGA 92 | WES: ~1,200 studies WGS: ~1,050 studies |
Multiple | Phenotypes | DAC application required | Archive of genetic and phenotypic data, mostly of various types of cancer |
DAC, data access committee; dbGaP, database of Genotypes and Phenotypes; EGA, European Genome–Phenome Archive; NIH, National Institutes of Health; WES, whole-exome sequencing; WGS, whole-genome sequencing.
Table limited to resources that can be widely accessed by researchers globally. Future resources156–158 or other potentially useful data sets that require consortia membership or collaboration159 are not included. AnVIL currently contains genomic data from eight consortia and more than 280,000 participants, and BioData Catalyst from some 155,000 participants in TOPMed along with the parent studies. Resources currently available in AnVIL, BioData Catalyst or Researcher Workbench are noted.
The number of individuals with WGS and/or WES is provided unless otherwise noted.
Additional data provided including phenotypes or other ‘omics data. All resources provide sex and ancestry information.
All individual-level data resources, except for GenomeAsia100K and the Collaborative Spanish Variability Server (CSVS), have FASTQ or BAM/CRAM files available for joint recalling. H3Africa and the UK Biobank also have gVCF files available.
Ascertainment of common controls
Common control data sets typically fall into one of three study designs: ascertained, convenience and population (FIG. 2). The relevant major difference among these designs is the expected prevalence of the outcome of interest or related conditions and its influence on findings, which is expanded upon below.
Fig. 2 |. Types of control samples.

Three regularly used types of controls include ascertained, convenience and population controls. a | Ascertained controls are specifically collected to exclude the condition and conditions related to case status. b | Relative ease of ascertainment or use is the defining factor of convenience samples. In human genetics, convenience samples may have been ascertained for another condition related to case status (blue people) and may also include unidentified cases (orange person). c,d | Population controls are a random sample of controls from the general population that often contain unidentified cases based on prevalence of the condition. Choosing which controls to use as common controls should be based on the study design and research question. For instance, population controls are appropriate for rare conditions (for example, prevalence <1%) (part d) but not for common conditions (for example, prevalence >20%) (part c), as the high proportion of unidentified cases for a common condition would affect the study results including power. Ascertained controls are ideal, but more difficult and expensive to collect compared with convenience sampling and population controls.
Ascertained controls.
Ascertained controls are gathered to exclude individuals with specific condition(s); the identification of a condition of interest occurs using, for example, hospital records, questionnaire responses, disease diagnosis or family history of disease. Additionally, controls should have had the opportunity to develop the disease but did not, especially with respect to age (for late-onset outcomes), sex (for outcomes disproportionately affecting a specific sex)43 and location (such as for geographic region-specific outcomes). For example, in a genetic association study of heart disease, participants ascertained to be ‘controls’ at age 55years may very well become ‘cases’ by age 65 years. However, in this instance it is understood that differences in the age of onset do not imply differences in genetic exposure but may result in a decrease in power to detect association. A more extreme example is in studying host genetics responses to infectious disease, where it is optimal for cases and controls to have similar exposure to the infectious agent61. Although ascertained controls are the ideal comparison group due to the designed absence of the condition of interest, they are also the least available, especially for uncommon conditions for which deliberate exclusion was not the priority.
Convenience sampling.
Controls gathered through convenience sampling, such as from a biobank or a sample ascertained for a different disease, are much more widely available, but may contain individuals with the outcome of interest or relevant potential confounders. An interesting example of this is the UK Biobank, which contains healthier and wealthier individuals than the general UK population, and thus may differ in related features from case samples not drawn from the UK Biobank62,63. Depending on the ascertained phenotype of the convenience sample, both the cases and the controls from the original study may be suitable as common controls in a new study. The result of using a convenience sample ascertained for a different condition from the cases is often not straightforward, with effect sizes inflated or attenuated depending on allele frequency deviations63,64. As such, identifying a reasonable and unbiased model can be difficult.
Population controls.
Population controls, namely individuals from the general population, are not ascertained with respect to any particular outcome; they are easier and less expensive to recruit compared with ascertained controls but will often contain identified or unidentified cases. The potential for misclassification as a control, when in fact they are an unidentified case, will depend on the prevalence of the condition. At one end of the spectrum, conditions of low prevalence will have a small proportion of unidentified cases in the control set, which is unlikely to greatly affect the results of the study. For example, the Marenne et al. study of severe childhood obesity used INTERVAL, a population-based cohort, as external common controls34 (BOX 1). Although it is likely that the INTERVAL sample contained individuals who, when they were children, met the criteria to be cases, the number is likely to be quite small given an estimated prevalence of severe childhood-onset obesity of 0.15%65. As the prevalence of conditions increases towards common conditions (for example, prevalence >20%)66, the misclassification of cases as population controls will increasingly result in lower power and attenuated effect estimates67. In rare variant association studies, the reduction of power is likely to be more pronounced. Of course, the increase in control sample size could make up for the inefficiency, but not the bias of the effect estimates towards the null. To reduce bias and increase power, using a control sample with fewer unascertained cases or case-related traits is preferable, particularly for common conditions.
Comparability of cases and common controls
Differences that are unaccounted for in sample characteristics (for example, demographics such as age and sex, genotyping or sequencing, genetic ancestries) between cases and controls can result in substantial confounding and bias. For example, the Exome Sequencing Project (ESP) used two sequencing centres and found that differences in reagents and local analysis pipelines resulted in unexpected batch effects68. An additional classic example is a study of exceptional longevity, which originally failed to fully correct for differences in genotyping platforms and found spurious associations55,69. Whereas many association methods can adjust for some differences, case attributes must be represented in the common control data to be able to use these methods. This is especially important for genetic ancestries. For instance, although ancestry inference methods can be used to adjust for differences among African American admixed samples, there are limits to their applicability; an African American sample cannot be robustly compared with a sample with only European ancestries. A failure to adequately match and adjust for genetic ancestry can result in population stratification with an increase in both false positives and false negatives due to genetic differences unrelated to disease risk. Classic examples of spurious associations as a result of population stratification include height with European ancestry70, type 2 diabetes mellitus (T2DM) with Native American ancestry71,72 and asthma with Indigenous ancestry in Latinx groups72,73, among others70,72,74,75. Matching by fine-scale ancestry in addition to continental ancestry is especially important for rare variants, which are more likely to segregate on a geographically smaller scale76–78 (FIG. 3a–d). Similarly, although rare variant methods have been developed to use common control data sequenced and generated with different technology and computational pipelines (TABLE 2), matching as closely as possible will reduce bias and increase coverage of the genome (FIG. 3e,f). Both Marenne et al. and Sazonovs et al. limit common controls to those with European ancestries and match closely by sequencing technology (BOX 1). Indeed, Sazonovs et al. identify and use several common control data sets, including INTERVAL, the Centers for Common Disease Genomics (CCDG) and the UK Biobank, to match the technology and sequencing centres for four Crohn’s disease case sets.
Fig. 3 |. Types of bias that could affect common control analyses.

Differences between cases and controls not due to case status, such as differences in ancestry, coverage or processing, could result in confounding and lead to inaccurate conclusions. a | Allele frequencies (here for rs17578381) can differ greatly both between continental-level regions and within more fine-scale regions, requiring careful matching or adjustment of ancestry. b | Some of this matching can be conducted using principal components (PCs). However, additional attention must be made beyond the first PCs to ensure fine-scale substructure is accounted for. c | This substructure can occur within continental-level groupings or self-identified racial categories, such as within Asia and even within East Asia. d | Differences can also occur within a region due to admixture proportion differences between groups, whether two-way or three-way admixture. e | Coverage can differ in which part of the genome is sequenced (for example, genome versus exome) and in number of sequencing reads (for example, high depth or low depth). Type of coverage determines how many and which variants are detected. f | Processing computational pipelines can differ in number and type of steps such as the variant calling algorithm, which can lead to differences in the variants detected in the processed samples. To reduce ancestry, coverage or processing biases, cases and external common controls should be harmonized prior to analysis. AFR, Africa; AME, Americas; EUR, Europe.
Table 2 |.
Case–control association methods or frameworks that incorporate external common controls
| Method | Method type | Internal controls | Data type | Covariates |
|---|---|---|---|---|
| Individual-level data | ||||
| Chen and Lin128,a | Single variant | No | Sequencing | Yes |
| iECAT Score test127 | Single variant | Yes | Array | Yes |
| Summary-level data | ||||
| iECAT-O94,b | Optimal combination of burden and variance component | Yes | Array or sequencing | No |
| ProxECAT95,c | Burden | No | Sequencing | No |
| RV-EXCALIBER98 | Harmonization framework for burden test | No | Sequencing | No |
| TRAPD96 | Filtering/harmonization framework for burden test | No | Sequencing | No |
Requires variant depth and quality information.
Optimal in the presence of moderate to large single-variant confounding and requires the internal sample minor allele count to be greater than two.
Optimal for very rare variants and can use variants with a minor allele count greater than zero.
Curation of phenotypic data
Well-curated and standardized phenotypic information, which can enable optimal choice of common control data, removal of cases and adjustment or assessment of environmental factors, is key to supporting interoperability and usefulness of common control data. Electronic health record data, such as from biobanks, have standards for the translation of International Classification of Disease (ICD) codes into standard codes, such as with the Observational Medical Outcomes Partnership (OMOP)79 and the ICD-9/ICD-10-compatible phecodes80. However, many data harmonization efforts are focused on one or a handful of conditions, whereas harmonization for common control data must be broadly focused to enable interoperability with various conditions. To enable broad utility of common control data, information relating to age, sex, ancestry and chronic conditions should be standard inclusions. In general, there are two main categories of common control data: summary level (such as allele frequencies) and individual level. For individual-level data, information should be included at the subject level, whereas descriptive statistics (for example, five-number summary, mean, standard deviation) can be provided for summary-level data. Importantly, reporting of chronic conditions enables cases or individuals with case-related traits to be removed from the common control data set.
A high degree of phenotypic metadata and study detail is essential to ensure widespread interoperability of common control data, not to mention an appreciation (for example, funding and incentives) for the efforts needed to tailor and maintain common control data sets for broad use (BOX 2). Of note, the Estonian Biobank81, H3Africa82, INTERVAL83, TOPMed84 and the UK Biobank85 are examples of individual-level common control data sets that have well-curated phenotypic data to enable broad use as common controls. To better enable easier identification of large-scale sequencing datasets with deep phenotype information, Gutierrez-Sacristan et al. provide a dynamic catalogue from which users can identify data with desired characteristics86. Finngen87, the Taiwan Biobank88, gnomAD89 and the Collaborative Spanish Variability Server (CSVS)90 are summary-level common control data sets that provide summary statistics or grouping of age, sex, ancestry and common conditions. For instance, gnomAD v2.1 provides a ‘control’ subset where individuals recruited as cases are excluded, and the CSVS enables filtering to provide allele frequency excluding selected conditions.
Box 2 |. Considerations for the generation of new common control resources.
Creation of broadly useful common control data sets and infrastructure to support their use requires careful consideration. Here, we outline some guiding principles to consider when establishing new common control resources.
Metadata. There are resource and time costs in working with a new data set. Detailed descriptions are valuable both in helping researchers decide whether to use a particular data set and when harmonizing the chosen data. Helpful information includes methods and procedures used in creating the data, sequencing technology, coverage, variant calling algorithm, recruitment details, case definition, inclusion criteria and ancestry description, among others.
Variant-level quality control metrics. Variant-level quality control metrics enable consistent quality control decisions across case data sets prior to harmonization with the common control data set.
Individual-level quality control metrics. Similar to variant-level quality control metrics, individual-level quality control metrics also enable consistent quality control decisions between the common control data set and case data sets. For summary data, where individual-level metrics are not available, summary-level information pertaining to individual filters and distributions of quality control metrics should be provided.
Rich phenotype and covariate data. Incorporating available phenotypes and covariate information beyond just a few demographics can enable identification and removal of cases or related conditions from the common control data set as well as controlling for or evaluating the role of other conditions or environments. Age, sex, ancestry and known chronic conditions should be standard variables in common control data sets of individual-level data. For summary data, descriptive statistics of these variables should be provided and data should be grouped when possible (for example, by ancestry or condition) to reduce heterogeneity in allele frequencies.
Broad consent and sharing standards. Studies should seek broad consent, such as no restrictions (NRES), general research use (GRU) or health, medical biomedicine use (HMB), to allow their data to be readily used as common controls. Consents that specify disease-specific research, while not impeding the original investigators, make the subsequent data of little use for common controls. Additional conditions, such as letters of collaboration, add to research friction, particularly when investigators need to combine several studies. Restrictions on who can use the data (for example, only not-for-profit organizations) also reduce the value of the resource161.
FAIR principles. FAIR principles state that data should be Findable, Accessible, Interoperable and Reusable for both humans and machines162. These principles are central to the utility of common control data, which must be useful for other researchers and multiple case data sets.
Intermediately processed data. When storage costs allow for it, inclusion of intermediately processed data, such as gVCF files105, will allow for more efficient joint recalling of cases and controls together, which can improve variant calls and harmonization.
Representation. Common control data sets and the researchers working to create new data should be representative of the worldwide population. Such diversity helps ensure that challenges relating to a broad set of conditions, environments and ancestries are considered early on.
Funding. Many of the recommendations listed above require additional time and resources, often above and beyond a study’s primary mandate. As such, funding and incentives are essential to support the creation and maintenance of high-quality common control data.
Accessibility and usability of common control data
Individual-level data enable better harmonization and analysis between cases and common controls. However, individual-level data often have more barriers to access, in order to maintain participant privacy according to existing consents, and require significantly more resources to use than summary-level data. Indeed, although individual genetic data are often accessible from a central database (for instance, the database of Genotypes and Phenotypes (dbGaP)91 or the European Genome-Phenome Archive (EGA)92), they can often be missing important metadata, be in a less processed state or be divided into sub-data sets that can be difficult to combine. These aspects, combined with harmonization and processing of individual-level data, necessitate a large amount of resources (for example, person-time, computing and data storage/transfer cost) for appropriate use. Numerous common control data sets are hosted on cloud computing environments to minimize these hurdles (see Infrastructure). For example, the UK Biobank Research Analysis Platform enables researchers to access data from a centralized location and although there are costs, they are tiered depending on computational and financial needs, with discounted rates for student researchers or those from low and middle-income countries93.
Conversely, summary-level data are often readily available for download, have undergone extensive quality control and processing, have few to no barriers to access and require fewer resources for storage, transfer and analysis. However, summary statistics, such as allele frequencies, can mask heterogeneity and stratification within and between samples, including population structure, sample recruitment and processing. These differences can cause severe batch effects between cases and external common controls, resulting in biased results. Additionally, adjusting for covariates is either more difficult or impossible. Although some methods, such as iECAT-O94 and ProxECAT95 (TABLE 2), have been developed to incorporate allele frequency data while reducing the potential for bias94–98, more thorough validation and replication of results are necessary than with individual-level data. There are proposed intermediate frameworks where case data are uploaded to a central location and allele frequencies from matched individual-level common control data by ancestry and other covariates are returned; however, these frameworks are not yet widely available99. The CSVS90 is an especially interesting example of how to collect and distribute sequencing data. The CSVS crowdsourcing initiative encourages genomic projects and consortia across Spain to submit whole-exome sequencing (WES) and whole-genome sequencing (WGS) data, resulting in a database of allele frequency information with the ability to interactively exclude chronic disease subgroups90.
Lastly, when choosing a common control data set, close attention should be paid to the consent type of each contributing study. Only data labelled as no restrictions (NRES), general research use (GRU) or health, medical biomedicine use (HMB) can be used as common controls for various phenotypes. (Studies with disease-specific consents can, of course, be used as common controls for the matching disease.) Because of this, studies that obtain broad consents are far more useful sources of common controls than those with narrower consents (BOX 2).
Infrastructure
To ensure the broad, robust and equitable use of common controls, infrastructure must be secure, easy to use and widely accessible. In addition to traditional infrastructure, such as data storage, transfer and computing, infrastructure to support educational training to use common control resources is needed.
Common controls necessitate two primary areas of computational infrastructure: storage and maintenance of common control data sets in broadly accessible locations; and environments and workflows to bring together and analyse common control and case data (FIG. 4). Depending on data permissions, size and workflow, researchers often use a combination of local and cloud computing. Cloud computing is an increasingly widespread alternative to local computing, ranging from a predefined, limited workspace where a specific task is performed (for example, the TOPMed Imputation Server)84,100 to flexible user environments that store data and modifiable analysis pipelines, such as the National Human Genome Research Institute’s (NHGRI’s) AnVIL101, which includes data from the CCDG, the National Heart, Lung, and Blood Institute’s (NHLBI’s) BioData Catalyst102, which includes data from TOPMed, and the National Institute of Health’s (NIH’s) Researcher Workbench103.
Fig. 4 |. Common control analysis workflow and example infrastructure with AnVIL.

a | Case–control analysis begins by identifying a research question and associated condition of interest. Collection and processing of cases and potentially internal controls are then conducted. Processing steps include sequencing, variant calling, imputation and quality control. External common controls are chosen to match cases as closely as possible on potential confounders. Internal and external data are then brought together in a computing environment. If utilizing infrastructure such as AnVIL, accounts need to be created for the Terra cloud computing platform and Gen3 data platform. Terra Billing Project is created and linked to Google Billing Project or access to a specific billing project is requested. Resources such as AnVIL Dataset Catalog or Gen3’s Data Explorer can be used to search for common control data. Additionally, open (for example, 1000 Genomes), controlled (for example, database of Genotypes and Phenotypes (dbGaP)) or consortium access data can be requested and used. b | Cases and common controls should be harmonized prior to analysis to reduce bias. Terra library or Dockstore, which contains Broad Methods Repository and GATK Best Practices Toolkit, can be used to find workflows. Analysis method, either single-variant or region-based, can be implemented and post quality control performed. Jupyter Notebook can be created for interactive and collaborative analysis using Python3 or R/Bioconductor, or a cloud environment can be created for Galaxy. Harmonization, analysis and post-quality control steps should be iterated and updated until batch effects are no longer evident. c | Results are verified and contextualized within limitations of a common control study. Reproducibility and transparency supported by making the code and harmonization pipeline publicly available (for example, on GitHub) and by citing methods and processing steps. Furthermore, data and results (for example, test statistics) should be provided to the research community through a publicly available portal such as the genome-wide association studies (GWAS) catalogue for open access and dbGaP or a consortia website for controlled access. For more detailed tutorial on AnVIL infrastructure, see https://anvilproject.org/learn. MAF, minor allele frequency; QQ, quantile–quantile. Logos reprinted with permission from AnVIL (https://anvilproject.org/data); Dockstore (https://cancercollaboratory.org/services-dockstore); GATK (https://gatk.broadinstitute.org/hc/en-us/categories/360002310591); Terra (Geraldine Van der Auwera at the Broad Institute); Broad Institute (https://www.broadinstitute.org/journalists/logos-graphics); dbGaP (https://www.ncbi.nlm.nih.gov/sra/docs/submitdbgap/); Bioconductor (https://www.bioconductor.org/about/logo/); GWAS Catalogue (https://www.ebi.ac.uk/gwas/). Jupyter is reprinted with permission from Jupyter (https://commons.wikimedia.org/wiki/File:Jupyter_logo.svg), CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/).
The advantage of the cloud in ‘bringing the user to the data’ can be especially useful for common control studies where large control data sets can be indexed and stored in a central location that is available to authorized users, avoiding redundancy as well as improving accessibility for groups lacking in-house computational resources101,102,104,105. Users can upload their own data to integrate and analyse with common control data using automated analysis workflows optimized for cloud environments and made available in cloud repositories106. An alternative workflow uses the cloud environment for selecting common control data and local computing for analysis.
Harmonization, quality control and association analysis
Following the selection of an appropriate common control data set, the next step is to implement and iteratively calibrate a harmonization, quality control and analysis pipeline (FIG. 4) to the specific case and common control data sets. Although iterative pipeline optimization is useful for any large-scale genomic analysis with batches, it is especially necessary when using external common controls. Without careful harmonization, analysis and quality control, systematic differences (FIG. 3) can result in bias and an increased rate of both false positives and false negatives.
Harmonization and quality control
To ensure that cases and controls are comparable with one another, they must be harmonized prior to analysis with respect to both sample-level and variant-level features. Case sample quality control can be performed before or during harmonization using well-established steps for genetic data64,107 and standard software such as PLINK108,109. The level of harmonization needed to reduce bias will depend on the extent of the differences in sequencing, such as scope (whole exome versus whole genome) or depth of coverage, processing and recruitment between cases and external common controls. If external common control data have been preprocessed, the entire quality control process should be well documented, including relevant variant-level and individual-level quality control metadata, so that users can assess and match the quality control performed (BOX 2). Here, we detail considerations for both sample-level and variant-level quality control, with current best practices for reducing bias.
Sample-level harmonization and quality control.
Identical quality control filters should be applied to common controls and cases to ensure consistency between data sets. This includes standard quality control procedures such as removing individuals with poor sequencing quality, a low proportion of individuals with a genotype call (that is, a low call rate) or high contamination, among other well-documented filters64,107. Covariates, and outcome definitions, where available, should be defined and harmonized to ensure comparability. In addition to choosing a common control data set that contains the genetic ancestries of the case data, special consideration must be made to closely match genetic ancestry at both a continental-specific and region-specific level to address fine-scale substructure (FIG. 3a–d).
For individual-level data, alignment of genome-wide (global) ancestry for cases and controls can be done with methods such as ADMIXTURE110, which estimates the proportions of genetic ancestry in each individual, or through projection methods111,112, such as principal component (PC) analysis113, with or without diverse ancestry reference data sets such as the 1000 Genomes Project114, the Human Genome Diversity Project (HGDP)115 and GenomeAsia116. As ADMIXTURE estimates the proportion of the genome from a specific discretized ancestry, it is not able to provide insights beyond this often continental-level classification, such as subcontinental or subregional fine-scale structure. Therefore, it may be ideal to closely match cases and controls using classification methods that leverage continuous measures of ancestry, such as the PCs that explain a substantial amount of variation. This strategy can also be used for admixed populations as long as the PCs used for matching are informative for ancestry proportions. One classic example of the importance of matching by ancestry was demonstrated in a study of asthma in Latino participants, in which participants needed to be matched on both a subcontinental level (Mexican and Puerto Rican) and also by ancestry proportions within groups to adequately control for population stratification and false positives73. For analyses that focus on a specific genomic region in admixed populations, local ancestry can be estimated using methods such as RFMix117 or Gnomix118. Estimation of ancestry in summary-level data is less established and often limited to incomplete matching of cases and common controls by reported race, ethnicity and/or genetic ancestry, which can result in removal of non-homogeneous groups from analysis or residual population stratification. More recently, Summix119 enables genetic ancestry estimation and adjustment of allele frequencies requiring only summary-level reference data, which is beginning to address this limitation.
Using individual-level data, both Marenne et al. and Sazonovs et al. (BOX 1) used PCs to assess concordance of genetic ancestry between cases and common controls34,36. Sazonovs et al. further used random forest on PCs generated from 1000 Genomes Phase 3 reference data to classify UK Biobank samples into broad genetic ancestry groups (Europe, Africa, South Asia, East Asia, admixed). Samples classified as European ancestry were retained.
Variant-level harmonization and quality control.
Large differences in sequencing (or genotyping) technology and processing (FIG. 3e,f) require a greater degree of harmonization and likely the use of analysis methods developed explicitly for use with external common controls (TABLE 2). Genetic variants or regions with poor or differential sequencing quality between cases and controls64,96 identified using variant quality metrics, such as genotype quality, Hardy–Weinberg equilibrium (such as PHWE< 10−4)120, variant quality score log-odds, depth of coverage and others64, should be removed. Removal of genetic variants or regions can be performed with both individual-level and summary-level data if adequate variant quality control metrics are detailed.
A benefit of individual-level data is the possibility to improve variant calls through recalling cases and common controls together34,35. Both Marenne et al.34 and Sazonovs et al.36 performed individual-level intermediate variant calling to produce gVCF files and then performed joint variant calling across case and common control samples (BOX 1). The feasibility of jointly calling cases and common controls together depends on the computational resources required and available. Larger sample sizes and whole-genome sequenced samples will require more resources, as will less processed files. For instance, BAM files require more computational resources for storage and processing compared with gVCF files. All individual-level data resources provided in TABLE 1, except for GenomeAsia100K116 and the CSVS90, have FASTQ or BAM/CRAM files available for joint recalling. H3Africa121 and the UK Biobank85 also have gVCF files available. For common, low-frequency and, increasingly, rare variants, variant calls can be improved further with imputation of individual-level data using resources such as the TOPMed Imputation Server84,100.
Filtering to rare alleles.
Rare variant association tests use filters or weights by minor allele frequency (MAF). Although there are several appropriate filtering designs, it is essential that the same criteria be applied to both cases and controls64. For instance, filtering using only the common control data set, but not the case data set, will remove all variants above the MAF threshold in controls, but not in cases, resulting in spurious associations64. The ideal method for filtering to rare variants is to utilize an independent, well-matched (that is, by ancestry and sequence technology and processing) external data set. Importantly, some common control data sets contain data that researchers might use to filter. For instance, gnomAD contains 1000 Genomes Project114 and NHLBI ESP122 data. As such, ideally, neither the 1000 Genomes Project or the ESP should be used for filtering when gnomAD is used as the common control data set. When a separate, well-matched, external data set is not available, filtering can be applied using both case and control data by keeping only variants that are rare in both data sets95. Marenne et al.34 used external data sets (1000 Genomes Project and UK10K cohort reference panel) as well as the case and common control data sets for filtering, whereas Sazonovs et al.36 used an external, ancestry-matched data set (gnomAD Non-Finnish European) (BOX 1). Studies of rare variant associations are also commonly limited to variants with predicted functional effects as a way to improve power34,35, with multiple computational prediction tools (for example, FAVOR123, CADD124, PolyPhen-2 (REF.125) and SIFT126) available that must be identically applied to cases and controls, as any differences will introduce bias.
Association methods
Association methods have been explicitly developed to robustly incorporate common controls while controlling for technology and processing differences between cases and controls (TABLE 2). The most computationally efficient methods incorporate summary-level common control data. Methods such as iECAT can increase power to detect rare variant genetic associations in an existing case–control study by augmenting with common controls94, whereas ProxECAT95, TRAPD96 and RV-EXCALIBER98 enable the use of common controls as the sole control set. Some common control data, such as gnomAD, contain both exome and genome data. These allele frequencies should be kept separate when using the preceding methods, otherwise the methods will not be able to adequately estimate and adjust for differences in sequencing technology and processing. Other methods incorporate individual common control data using only variant calls127, using variant calls with other data such as read depth128 or using different data, such as raw reads, directly129. These methods can help alleviate residual differences due to technology or computational pipelines especially when jointly recalling cases and controls is not feasible or does not fully alleviate bias. Importantly, these methods do not inherently correct for differences in ancestry and should be used along with ancestry adjustment and matching discussed above.
Post-analysis quality control
Post-analysis quality control helps identify any remaining systematic bias and issues in the harmonization and analysis pipeline, which can then inform a recalibration and improvement of the pipeline. After iteration of this process until there is no longer any evidence of batch effects, the study can progress to assessment and interpretation of results. If there are unremovable batch effects, however, a different common control data set may be needed. Starting with a subset of the data (for example, one chromosome) and performing soundness checks (for example, comparing distributions of quality control metrics between cases and controls) at each stage of the process enable facile assessment and updating of the pipeline.
The most common technique in a genome-wide study to assess a pipeline’s effectiveness is to use quantile–quantile (QQ) plots to compare the observed distribution of test statistics with the expected distribution under the null hypothesis of no association. Systematic bias will manifest in a QQ plot with an inflated distribution of test statistics at the median (that is, λGC > 1)96,107. Although QQ plots can identify inflation of test statistics, they cannot determine whether the inflation is due to population structure or polygenicity130 using the single λGC value. For WES, assessing inflation in rare variant test statistics derived from synonymous variants can start to disentangle bias from signal, as we expect less association signal with synonymous variants96. For WGS, synonymous or a random subset of non-coding variants may be useful to identify residual bias. Additionally, the degree to which a genetic marker tags local variability as estimated using linkage disequilibrium score regression can help distinguish polygenicity from population structure effects131. Of note, careful attention to quality control, filtering and association analysis is especially important for candidate gene studies where bias cannot be assessed through genome-wide metrics, such as QQ plots.
Verification, follow-up and reproducibility of results
When common controls are used, validation and replication of results are at the same time more important and more difficult to implement than for traditional case–control studies. The necessity of common controls often follows from a scenario in which resources are scarce; thus, gathering another independent and sufficiently large replication sample of cases, let alone cases and controls, may be difficult. In such a scenario, in silico validation and partial replication can be performed. In silico validation includes an in-depth confirmation that top hits are not driven by subtle variant quality issues. Partial replication, using other external common control data sets or public databases, can be used to cross-reference genotype or allele frequencies in the original common control data set or to complete another association test with the discovery case sample. Although these strategies do not comprise a traditional external validation, they do provide additional support for or against results from the primary analyses.
Sazonovs et al. and Marenne et al. both performed replication using additional independent case and external common control data34,36 (BOX 1). Sazonovs et al. used three follow-up case–control studies for replication including whole-genome sequenced INTERVAL83 and whole-exome sequenced UK Biobank85 external common controls. Marenne et al. performed replication of nine genes using targeted sequencing of independent severe childhood-onset obesity cases and external common controls from the Fenland Study132. Additionally, Marenne et al. performed partial replication of the original severe childhood-onset obesity samples with gnomAD non-Finnish European controls using the association method ProxECAT95.
To ensure reproducibility of research with common controls, code used for harmonization, quality control and analysis should be publicly available, and common control data used should be clearly identified including version, access location and date accessed. Infrastructure such as Dockstore133 for workflows and GitHub134 for code are useful in supporting reproducibility.
Sharing of summary-level results with sufficient information for follow-up is especially necessary for common control studies. Standard recommendations exist for GWAS of common or low-frequency variants107,135 and are being developed for rare variant analyses47. In addition to these standards, information resulting from the harmonization process, such as allele counts, frequencies before and after harmonization, and study-specific allele frequencies when multiple control sets are used, should be provided for common control summary statistics. As summary statistics from analyses using common controls become more prevalent, the question of how to complete follow-up analyses from studies that utilize the same control set needs to be addressed. Methods such as the Bayesian Multiple Rare Variant and Phenotype framework harness correlation, scale and/or direction of genetic effects to assess summary statistics from a broad range of rare variant association study designs including multiple diseases and shared controls136.
Concluding perspectives
Foundational to the continued generation and use of common control data are the urgency to better represent global populations in the data sets, the researchers charged with generating new data sets and infrastructure, and users of the resources48,137,138. Deliberate attention to representation at every stage in the process ensures that challenges are addressed prospectively rather than as an afterthought or overlooked entirely139. Importantly, the use of common control data should not preclude the design and execution of large, high-quality studies of diverse genetic ancestries and environments. Indeed, such large, diverse data sets are urgently needed to increase representation of publicly available resources, which may also serve as common controls.
Any development and maintenance of infrastructure for the use of common controls, and data sharing in general, must prioritize accessibility in terms of cost and ease of use to close the resource gap, or else inequities regarding who is able to conduct high-quality research with even ‘publicly available’107,140 data will persist. Structured cloud computing and levelled cost structures, such as in the UK Biobank, can help address disparities in access.
The need for broad representation and equity of access for data resources does not necessarily mean that all data should be publicly available. There is a balance between open data and data sovereignty, especially for historically marginalized groups, cultures and countries141. Although broad consent from research participants is preferred for common control data, as it allows the data to be more widely used, consent must be obtained with respect, accompanied by authentic community engagement and trust building142. Efforts to ensure an equitable inclusion of and partnership with people across environments, ancestries and conditions must become the standard. As such, structures for funding, publishing and rewarding science must include expectations for representation, equity and community engagement.
The generation and maintenance of rich phenotypic data sets as well as ensuring representativeness require funding and incentives. Current funding focuses on the creation and analysis of individual data sets rather than on broad interoperability among shared data sets. Although there have been recent efforts by funding agencies143 and organizations such as the Global Alliance for Genomics and Health (GA4GH)144–147 to support phenotype harmonization and data sharing standards, such as common data elements and passports to enable authentication and access, a commitment to continued long-term funding of these and other resources is needed.
The robust use of common controls requires careful consideration and specialized training in how to access data and complete robust analyses. Training programmes currently exist for access and use of large-scale genetic data sets including the BioData Catalyst Fellowship Program148, AnVIL’s Massive Genome Informatics in the Cloud (MaGIC) Jamboree149 and the GA4GH starter kit150. Further integration of training into existing quantitative human genetics programmes through institutions and professional societies would be beneficial. Funding for time and travel is necessary to support equity in training, especially for researchers who depend on accessible public resources.
Genomic research requires an enormous amount of resources, which can exacerbate disparities in research and, ultimately, health outcomes for under-represented groups. With the increasing availability of genome sequencing and low-cost genotyping arrays, common control data sets are an invaluable resource to the wider research community in addressing these disparities. Although common controls are and will likely continue to be widely used in the future, there exist no comprehensive guidelines and summary of the current resources to support robust use of common controls. As such, many studies have used common controls incorrectly (for example, owing to improper filtering or mismatch of ancestry), and quality control and analysis modifications would have resulted in more robust studies. Importantly, the need for appropriate calibration of test statistics when using common control data sets is not eliminated by increasing sample size; although this approach leads to reduced variance of estimates and greater power for discoveries, it can also open the door to additional batch effects and other technical confounders. Thus, regardless of the size, leveraging external data requires careful consideration. Used rigorously, common controls hold the potential to expand the impact of human genetics research across a breadth of environments, conditions and populations.
Acknowledgements
This work was supported by the Genome Sequencing Program (R35HG011293 to A.E.H. and C.R.G.; U01HG009080 to A.E.H., A.G.I., C.R.G. and M.A.R.; and U24HG008956 to S.B.). The Genome Sequencing Program is funded by the National Institute of Health (NIH) National Human Genome Research Institute (NHGRI), the National Heart, Lung, and Blood Institute (NHLBI) and the National Eye Institute (NEI). G.L.W. received support for this work from NHGRI (R35HG011944).
Glossary
- Monogenic
A condition influenced by one genetic locus.
- Oligogenic
A condition influenced by a few genetic loci.
- Polygenic
A condition influenced by a large number of genetic loci.
- Allele frequencies
The rates of genetic variant types in a specified population.
- Common controls
Controls used for multiple studies.
- Bias
Systematic error (as opposed to error due to chance processes), whether caused by statistical methods, differences between sampled individuals and the population they nominally represent, differences between cases and controls in ascertainment or sample processing, or other issues.
- Confounding
A spurious association or lack of association caused by a third variable that is related to both the predictor variable (for example, allele frequency) and the outcome (for example, case status).
- Internal controls
Controls that were ascertained, sequenced and processed together with the case sample. By contrast, external common controls were recruited, sequenced and processed separately, often using different technology from the case sample.
- Biobanks
Collections of both biological samples (particularly DNA) and health information from individuals generally assembled from a region or a health system.
- Harmonization
The formation of a single cohesive data set from two or more separate data sets by standardizing scales, definitions, quality control and other processing.
- Batch effects
Differences between groups induced by processing over different times, places or technologies unrelated to biological causes.
- Quality control
A process where low-quality data or observations are identified and improved or removed from further analysis.
- Statistical power
The probability of rejecting the null hypothesis when it is false.
- Ascertained cases
Participants of a study who are recruited to have a known disease, outcome or condition of interest.
- Ascertained controls
Participants of a study who are recruited to not have a known disease, outcome or condition of interest.
- Convenience sample
A sample drawn from an easily accessible, but often not representative, cohort.
- Population controls
A control group sampled from a population but possibly lacking information regarding the condition of interest, with the result that some of the population controls will likely have the condition of interest.
- Admixed
A term to denote the mixture of genetic ancestries from two or more divergent groups.
- Population stratification
The presence of subpopulations with differing allele frequencies in a study; a source of confounding if phenotypes also vary by subpopulation.
- False positives
Test results that are statistically significant even though there is no real association. By contrast, a false negative is a test result that is not statistically significant even though there is a real association.
- Fine-scale ancestry
Genetic differentiation at a regional level (such as subcontinental), as opposed to continental-level ancestry.
- Metadata
A high-level description of a data set, often including details of the cohort and of data generation.
- Local ancestry
The genetic ancestry of a particular chromosomal region on a haplotype level.
- Minor allele frequency (MAF)
For a genetic variant with two alleles, the frequency, in a specified population, of the less frequent allele.
- In silico validation
Secondary quality control analysis of genotype calls, often of top association results, that passed the initial harmonization process to ensure that differences in processing do not drive important association signals.
- Partial replication
Repeating association analysis reusing some data from the discovery analysis (for example, discovery cases and new external common controls).
Footnotes
Competing interests
C.R.G. owns stock in 23and Me. M.A.R. is a scientific founder of Broadwing Bio, a consultant for MazeTx, and is currently on leave at HiBio. The other authors declare no competing interests.
RELATED LINKS
1000 Genomes Project: https://www.internationalgenome.org
All of Us: https://www.researchallofus.org/
AnVIL: https://anvilproject.org
BioData Catalyst: https://biodatacatalyst.nhlbi.nih.gov
CCDG: https://ccdg.rutgers.edu
CSVS: http://csvs.babelomics.org/
dbGaP: https://www.ncbi.nlm.nih.gov/gap/
dbGaP ALFA: https://www.ncbi.nlm.nih.gov/snp/docs/gsr/alfa/
Estonian Biobank: https://genomics.ut.ee/en/content/estonian-biobank
FinnGen: https://finngen.gitbook.io/documentation/data-download
GenomeAsia 100K: https://browser.genomeasia100k.org
gnomAD v.2.1: https://gnomad.broadinstitute.org/downloads
gnomAD v.3.1: https://gnomad.broadinstitute.org/downloads
H3Africa: https://catalog.h3africa.org
HGDP: https://www.internationalgenome.org/data-portal/data-collection/hgdp
INTERVAL: https://www.intervalstudy.org.uk
jMorp: https://jmorp.megabank.tohoku.ac.jp/202109/downloads/
Researcher Workbench: https://www.researchallofus.org/data-tools/workbench/
sGDP: https://cloud.google.com/life-sciences/docs/resources/public-datasets/simons
SISu v4.1: https://sisuproject.fi
Taiwan Biobank: https://taiwanview.twbiobank.org.tw/browse38
TOPMed: https://topmed.nhlbi.nih.gov
TOPMed Bravo: https://bravo.sph.umich.edu/freeze8/hg38/
UK Biobank: https://biobank.ctsu.ox.ac.uk/crystal/label.cgi?id=263
References
- 1.McGuire AL et al. The road ahead in genetics and genomics. Nat. Rev. Genet 21, 581–596 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]; Perspective from a panel of leading genetics experts across the world describing the current state of the field and where genetics should go to ensure that the insights gained by modern genomic research will benefit all.
- 2.Rehm HL et al. ClinGen — the clinical genome resource. N. Engl. J. Med 372, 2235–2242 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Wang Q et al. Rare variant contribution to human disease in 281,104 UK Biobank exomes. Nature 597, 527–532 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Szustakowski JD et al. Advancing human genetics research and drug discovery through exome sequencing of the UK Biobank. Nat. Genet 53, 942–948 (2021). [DOI] [PubMed] [Google Scholar]
- 5.Gibbs RA The Human Genome Project changed everything. Nat. Rev. Genet 21, 575–576 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.UK10K Consortium et al. The UK10K project identifies rare variants in health and disease. Nature 526, 82–90 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Minikel EV et al. Evaluating drug targets through human loss-of-function genetic variation. Nature 581, 459–464 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Banka S et al. How genetically heterogeneous is Kabuki syndrome?: MLL2 testing in 116 patients, review and analyses of mutation and phenotypic spectrum. Eur. J. Hum. Genet 20, 381–388 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Biesecker LG Exome sequencing makes medical genomics a reality. Nat. Genet 42, 13–14 (2010). [DOI] [PubMed] [Google Scholar]
- 10.Ng SB et al. Exome sequencing identifies the cause of a Mendelian disorder. Nat. Genet 42, 30–35 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Akbari P et al. Sequencing of 640,000 exomes identifies GPR75 variants associated with protection from obesity. Science 373, eabf8683 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Flannick J et al. Exome sequencing of 20,791 cases of type 2 diabetes and 24,440 controls. Nature 570, 71–76 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Backman JD et al. Exome sequencing and analysis of 454,787 UK Biobank participants. Nature 599, 628–634 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]; Initial description of the data and potential provided by exomes for medical and genomic applications across the UK Biobank.
- 14.Martin AR et al. Human demographic history impacts genetic risk prediction across diverse populations. Am. J. Hum. Genet 100, 635–649 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Petrovski S & Goldstein DB Unequal representation of genetic variation across ancestry groups creates healthcare inequality in the application of precision medicine. Genome Biol. 17, 157 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Manrai AK et al. Genetic misdiagnoses and the potential for health disparities. N. Engl. J. Med 375, 655–665 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Wellcome Trust Case Control Consortium. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 447, 661–678 (2007). [DOI] [PMC free article] [PubMed] [Google Scholar]; Foundational early genome-wide association study leveraging a common set of controls to enhance discovery possibility across seven diseases. The paper includes stringent QC now common to ensure homogeneity across a common control data set.
- 18.Corredor-Orlandelli D et al. Association between paraoxonase-1 p.Q192R polymorphism and coronary artery disease susceptibility in the Colombian population. Vasc. Health Risk Manag 17, 689–699 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Tan M et al. Whole genome sequencing identifies rare germline variants enriched in cancer related genes in first degree relatives of familial pancreatic cancer patients. Clin. Genet 100, 551–562 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Taroc EZM et al. Gli3 regulates vomeronasal neurogenesis, olfactory ensheathing cell formation, and GnRH-1 neuronal migration. J. Neurosci 40, 311–326 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Muskens IS et al. Germline cancer predisposition variants and pediatric glioma: a population-based study in California. Neuro. Oncol 22, 864–874 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Lorenzo-Salazar JM et al. Novel idiopathic pulmonary fibrosis susceptibility variants revealed by deep sequencing. ERJ Open Res 5, 00071 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Georges A et al. Rare loss-of-function mutations of PTGIR are enriched in fibromuscular dysplasia. Cardiovasc. Res 117, 1154–1165 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Li C et al. Mutation analysis of DNAJC family for early-onset Parkinson’s disease in a Chinese cohort. Mov. Disord 35, 2068–2076 (2020). [DOI] [PubMed] [Google Scholar]
- 25.Hillman P et al. Identification of novel candidate risk genes for myelomeningocele within the glucose homeostasis/oxidative stress and folate/one-carbon metabolism networks. Mol. Genet. Genom. Med 8, e1495 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Hebert L et al. Burden of rare deleterious variants in WNT signaling genes among 511 myelomeningocele patients. PLoS ONE 15, e0239083 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Yuan J-H et al. Genomic analysis of 21 patients with corneal neuralgia after refractive surgery. Pain Rep. 5, e826 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Rojas RA et al. Phenotypic continuum between Waardenburg syndrome and idiopathic hypogonadotropic hypogonadism in humans with SOX10 variants. Genet. Med 23, 629–636 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Terradas M et al. TP53, a gene for colorectal cancer predisposition in the absence of Li–Fraumeniassociated phenotypes. Gut 70, 1139–1146 (2021). [DOI] [PubMed] [Google Scholar]
- 30.Li C et al. Mutation analysis of LRP10 in a large Chinese familial Parkinson disease cohort. Neurobiol. Aging 99, 99.e1–99.e6 (2021). [DOI] [PubMed] [Google Scholar]
- 31.Gunadi et al. Effect of semaphorin 3C gene variants in multifactorial Hirschsprung disease. J. Int. Med. Res 49, 300060520987789 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Messina A et al. Neuron-derived neurotrophic factor is mutated in congenital hypogonadotropic hypogonadism. Am. J. Hum. Genet 106, 58–70 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Trimarchi M et al. Gene expression analysis in patients with cocaine-induced midline destructive lesions. Medicina 57, 861 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Marenne G et al. Exome sequencing identifies genes and gene sets contributing to severe childhood obesity, linking PHIP variants to repressed POMC transcription. Cell Metab. 31, 1107–1119.e12 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Singh T et al. Rare loss-of-function variants in SETD1A are associated with schizophrenia and developmental disorders. Nat. Neurosci 19, 571–577 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Sazonovs A et al. Sequencing of over 100,000 individuals identifies multiple genes and rare variants associated with Crohns disease susceptibility. Preprint at bioRxiv 10.1101/2021.06.15.21258641 (2021). [DOI] [Google Scholar]
- 37.Malki L et al. Variant PADI3 in central centrifugal cicatricial alopecia. N. Engl. J. Med 380, 833–841 (2019). [DOI] [PubMed] [Google Scholar]
- 38.Ulirsch JC et al. The genetic landscape of Diamond–Blackfan anemia. Am. J. Hum. Genet 103, 930–947 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Hubert J-N et al. The PI3K/mTOR pathway is targeted by rare germline variants in patients with both melanoma and renal cell carcinoma. Cancers 13, 2243 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Rashid M et al. ALPK1 hotspot mutation as a driver of human spiradenoma and spiradenocarcinoma. Nat. Commun 10, 2213 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Belhadj S et al. Candidate genes for hereditary colorectal cancer: mutational screening and systematic review. Hum. Mutat 41, 1563–1576 (2020). [DOI] [PubMed] [Google Scholar]
- 42.Mosquera Orgueira A et al. Detection of rare germline variants in the genomes of patients with B-cell neoplasms. Cancers 13, 1340 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Li C et al. Targeted next generation sequencing of nine osteoporosis-related genes in the Wnt signaling pathway among Chinese postmenopausal women. Endocrine 68, 669–678 (2020). [DOI] [PubMed] [Google Scholar]
- 44.Thorlund K, Dron L, Park JJH & Mills EJ Synthetic and external controls in clinical trials — a primer for researchers. Clin. Epidemiol 12, 457–467 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Popejoy AB & Fullerton SM Genomics is failing on diversity. Nature 538, 161–164 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Ben-Eghan C et al. Don’t ignore genetic data from minority populations. Nature 585, 184–186 (2020). [DOI] [PubMed] [Google Scholar]
- 47.McMahon A et al. Sequencing-based genome-wide association studies reporting standards. Cell Genomics 1, 100005 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Gurdasani D, Barroso I, Zeggini E & Sandhu MS Genomics of disease risk in globally diverse populations. Nat. Rev. Genet 20, 520–535 (2019). [DOI] [PubMed] [Google Scholar]; This paper provides a summary of the current state of genomic diversity in research and how diversity is key to discovery and translation in genomics.
- 49.Zhang Y et al. The prevalence of vitiligo: a metaanalysis. PLoS ONE 11, e0163806 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Conway M et al. Analyzing the heterogeneity and complexity of electronic health record oriented phenotyping algorithms. AMIA Annu. Symp. Proc 2011, 274–283 (2011). [PMC free article] [PubMed] [Google Scholar]
- 51.Newton KM et al. Validation of electronic medical record-based phenotyping algorithms: results and lessons learned from the eMERGE network. J. Am. Med. Inform. Assoc 20, e147–e154 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Shang N et al. Making work visible for electronic phenotype implementation: lessons learned from the eMERGE network. J. Biomed. Inform 99, 103293 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Davis KAS et al. Indicators of mental disorders in UK Biobank — a comparison of approaches. Int. J. Methods Psychiatr. Res 28, e1796 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Singh T et al. Rare coding variants in ten genes confer substantial risk for schizophrenia. Nature 604, 509–516 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Ledford H Paper on genetics of longevity retracted. Nature 10.1038/news.2011.429 (2011). [DOI] [Google Scholar]
- 56.Viering DHHM et al. Genetics of renovascular hypertension in children. J. Hypertens 38, 1964–1970 (2020). [DOI] [PubMed] [Google Scholar]
- 57.Mazzarotto F et al. Reevaluating the genetic contribution of monogenic dilated cardiomyopathy. Circulation 141, 387–398 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Steel D et al. Loss-of-function variants in HOPS complex genes VPS16 and VPS41 cause early onset dystonia associated with lysosomal abnormalities.Ann. Neurol 88, 867–877 (2020). [DOI] [PubMed] [Google Scholar]
- 59.Johnson JO et al. Association of variants in the SPTLC1 gene with juvenile amyotrophic lateral sclerosis. JAMA Neurol. 78, 1236–1248 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Gallego-Martinez A, Requena T, Roman-Naranjo P, May P & Lopez-Escamez JA Enrichment of damaging missense variants in genes related with axonal guidance signalling in sporadic Meniere’s disease. J. Med. Genet 57, 82–88 (2020). [DOI] [PubMed] [Google Scholar]
- 61.Kwok AJ, Mentzer A & Knight JC Host genetics and infectious disease: new tools, insights and translational opportunities. Nat. Rev. Genet 22, 137–153 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Fry A et al. Comparison of sociodemographic and health-related characteristics of UK biobank participants with those of the general population. Am. J. Epidemiol 186, 1026–1034 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63.Wright CF et al. Assessing the pathogenicity, penetrance, and expressivity of putative disease-causing variants in a population setting. Am. J. Hum. Genet 104, 275 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64.Povysil G et al. Rare-variant collapsing analyses for complex traits: guidelines and applications. Nat. Rev. Genet 20, 747–759 (2019). [DOI] [PubMed] [Google Scholar]; Review describing rare variant aggregation testing, a common method for association in sequencing studies. Beyond describing techniques, the review covers specific filtering and quality control needed to ensure appropriate statistical calibration.
- 65.Riveros-McKay F et al. Genetic architecture of human thinness compared to severe obesity. PLoS Genet. 15, e1007603 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 66.Moskvina V, Holmans P, Schmidt KM & Craddock N Design of case–controls studies with unscreened controls. Ann. Hum. Genet 69, 566–576 (2005). [DOI] [PubMed] [Google Scholar]
- 67.Sham PC & Purcell SM Statistical power and significance testing in large-scale genetic studies. Nat. Rev. Genet 15, 335–346 (2014). [DOI] [PubMed] [Google Scholar]
- 68.Auer PL et al. Guidelines for large-scale sequence-based complex trait association studies: lessons learned from the NHLBI Exome Sequencing Project. Am. J. Hum. Genet 99, 791–801 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 69.Alberts B Editorial expression of concern. Science 330, 912 (2010). [DOI] [PubMed] [Google Scholar]
- 70.Campbell CD et al. Demonstrating stratification in a European American population. Nat. Genet 37, 868–872 (2005). [DOI] [PubMed] [Google Scholar]
- 71.Knowler WC, Williams RC, Pettitt DJ & Steinberg AG Gm3;5,13,14 and type 2 diabetes mellitus: an association in American Indians with genetic admixture. Am. J. Hum. Genet 43, 520–526 (1988). [PMC free article] [PubMed] [Google Scholar]
- 72.Hellwege JN et al. Population stratification in genetic association studies. Curr. Protoc. Hum. Genet 95, 1.22.1–1.22.23 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 73.Choudhry S et al. Population stratification confounds genetic association studies among Latinos. Hum. Genet 118, 652–664 (2006). [DOI] [PubMed] [Google Scholar]
- 74.Helgason A, Yngvadóttir B, Hrafnkelsson B, Gulcher J & Stefánsson K An Icelandic example of the impact of population structure on association studies. Nat. Genet 37, 90–95 (2005). [DOI] [PubMed] [Google Scholar]
- 75.Panarella M & Burkett KM A cautionary note on the effects of population stratification under an extreme phenotype sampling design. Front. Genet 10, 398 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 76.Gravel S et al. Demographic history and rare allele sharing among human populations. Proc. Natl Acad. Sci. USA 108, 11983–11988 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 77.Mathieson I & McVean G Differential confounding of rare and common variants in spatially structured populations. Nat. Genet 44, 243–246 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 78.O’Connor TD et al. Fine-scale patterns of population stratification confound rare variant association tests. PLoS ONE 8, e65834 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 79.Klann JG, Joss MAH, Embree K & Murphy SN Data model harmonization for the All Of Us Research Program: transforming i2b2 data into the OMOP common data model. PLoS ONE 14, e0212463 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 80.Wei W-Q et al. Evaluating phecodes, clinical classification software, and ICD-9-CM codes for phenome-wide association studies in the electronic health record. PLoS ONE 12, e0175508 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 81.Leitsalu L et al. Cohort profile: Estonian Biobank of the Estonian Genome Center, University of Tartu. Int. J. Epidemiol 44, 1137–1147 (2015). [DOI] [PubMed] [Google Scholar]
- 82.Choudhury A et al. Author correction: High-depth African genomes inform human migration and health. Nature 592, E26 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 83.Di Angelantonio E et al. Efficiency and safety of varying the frequency of whole blood donation (INTERVAL): a randomised trial of 45 000 donors. Lancet 390, 2360–2371 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 84.Taliun D et al. Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program. Nature 590, 290–299 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 85.Bycroft C et al. The UK Biobank resource with deep phenotyping and genomic data. Nature 562, 203–209 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 86.Gutierrez-Sacristan A et al. GenoPheno: cataloging large-scale phenotypic and next-generation sequencing data within human datasets. Brief Bioinform 22, 55–65 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 87.FinnGen. FinnGen documentation of R5 release. FinnGen; https://finngen.gitbook.io/documentation/ (2021). [Google Scholar]
- 88.Wei C-Y et al. Genetic profiles of 103,106 individuals in the Taiwan Biobank provide insights into the health and history of Han Chinese. NPJ Genom. Med 6, 10 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 89.Karczewski KJ, Francioli LC & MacArthur DG The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 90.Peña-Chilet M et al. CSVS, a crowdsourcing database of the Spanish population genetic variability. Nucleic Acids Res. 49, D1130–D1137 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 91.Mailman MD et al. The NCBI dbGaP Database of Genotypes and Phenotypes. Nat. Genet 39, 1181–1186 (2007). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 92.Lappalainen I et al. The European Genome–Phenome Archive of human data consented for biomedical research. Nat. Genet 47, 692–695 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 93.UK Biobank. New costs for 2021. UK Biobank; https://www.ukbiobank.ac.uk/enable-your-research/costs (2021). [Google Scholar]
- 94.Lee S, Kim S & Fuchsberger C Improving power for rare-variant tests by integrating external controls. Genet. Epidemiol 41, 610–619 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 95.Hendricks AE et al. ProxECAT: Proxy External Controls Association Test. A new case–control gene region association test using allele frequencies from public controls. PLoS Genet. 14, e1007591 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 96.Guo MH, Plummer L, Chan Y-M, Hirschhorn JN & Lippincott MF Burden testing of rare variants identified through exome sequencing via publicly available control data. Am. J. Hum. Genet 103, 522–534 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 97.Jiang L et al. Deviation from baseline mutation burden provides powerful and robust rare-variants association test for complex diseases. Nucleic Acids Res. 50, e34 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 98.Lali R et al. Calibrated rare variant genetic risk scores for complex disease prediction using large exome sequence repositories. Nat. Commun 12, 5852 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 99.Bodea CA et al. A method to exploit the structure of genetic ancestry space to enhance case–control studies. Am. J. Hum. Genet 98, 857–868 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 100.Das S et al. Next-generation genotype imputation service and methods. Nat. Genet 48, 1284–1287 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 101.Schatz MC et al. Inverting the model of genomics data sharing with the NHGRI Genomic Data Science Analysis, Visualization, and Informatics Lab-space. Cell Genom. 2, 100085 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 102.National Heart, Lung, and Blood Institute, National Institutes of Health, US Department of Health and Human Services. The NHLBI BioData catalyst. Zenodo 10.5281/zenodo.3822858 (2020). [DOI] [Google Scholar]
- 103.All of Us Research Program Investigators et al. The “All of Us” Research Program. N. Engl. J. Med 381, 668–676 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 104.Langmead B & Nellore A Cloud computing for genomic data analysis and collaboration. Nat. Rev. Genet 19, 208–219 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]; This paper reviews how the current and future state of cloud computing will be fundamental for large-scale genomics research including for collaboration and reproducibility.
- 105.Van der Auwera GA & O’Connor BD Genomics in the Cloud: Using Docker, GATK, and WDL in Terra (O’Reilly Media, 2020). [Google Scholar]
- 106.Yuen D et al. The Dockstore: enhancing a community platform for sharing reproducible and accessible computational protocols. Nucleic Acids Res. 49, W624–W632 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 107.Uffelmann E et al. Genome-wide association studies. Nat. Rev. Methods Primers 1, 60 (2021). [Google Scholar]
- 108.Purcell S et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet 81, 559–575 (2007). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 109.Chang CC et al. Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience 4, 7 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 110.Alexander DH & Lange K Enhancements to the ADMIXTURE algorithm for individual ancestry estimation. BMC Bioinformatics 12, 246 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 111.Reich D, Price AL & Patterson N Principal component analysis of genetic data. Nat. Genet 40, 491–492 (2008). [DOI] [PubMed] [Google Scholar]
- 112.Wang C, Zhan X, Liang L, Abecasis GR & Lin X Improved ancestry estimation for both genotyping and sequencing data using projection procrustes analysis and genotype imputation. Am. J. Hum. Genet 96, 926–937 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 113.Price AL et al. Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet 38, 904–909 (2006). [DOI] [PubMed] [Google Scholar]
- 114.1000 Genomes Project Consortium et al. A global reference for human genetic variation. Nature 526, 68–74 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 115.Bergström A et al. Insights into human genetic variation and population history from 929 diverse genomes. Science 367, eaay5012 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 116.GenomeAsia100K Consortium. The GenomeAsia 100K Project enables genetic discoveries across Asia. Nature 576, 106–111 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 117.Maples BK, Gravel S, Kenny EE & Bustamante CD RFMix: a discriminative modeling approach for rapid and robust localancestry inference. Am. J. Hum. Genet 93, 278–288 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 118.Hilmarsson H et al. High resolution ancestry deconvolution for next generation genomic data. Preprint at bioRxiv 10.1101/2021.09.19.460980 (2021). [DOI] [Google Scholar]
- 119.Arriaga-MacKenzie IS et al. Summix: a method for detecting and adjusting for population structure in genetic summary data. Am. J. Hum. Genet 108, 1270–1282 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 120.Wojcik GL et al. Genetic analyses of diverse populations improves discovery for complex traits. Nature 570, 514–518 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]; A large, multi-ethnic, multi-trait genome-wide association study paper from the Population Architecture using Genomics and Epidemiology (PAGE) study describing best practices for handling heterogeneous population data, including imputation, filtering and QC steps. The paper also describes the critical importance of genomic diversity in genetic association studies.
- 121.Choudhury A et al. High-depth African genomes inform human migration and health. Nature 586, 741–748 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 122.Exome Variant Server. NHLBI Exome Sequencing Project (ESP). EVS http://evs.gs.washington.edu/EVS/ (2013). [Google Scholar]
- 123.Li X et al. Dynamic incorporation of multiple in silico functional annotations empowers rare variant association analysis of large whole-genome sequencing studies at scale. Nat. Genet 52, 969–983 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 124.Kircher M et al. A general framework for estimating the relative pathogenicity of human genetic variants. Nat. Genet 46, 310–315 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 125.Adzhubei IA et al. A method and server for predicting damaging missense mutations. Nat. Methods 7, 248–249 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 126.Sim N-L et al. SIFT web server: predicting effects of amino acid substitutions on proteins. Nucleic Acids Res. 40, W452–W457 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 127.Li Y & Lee S Novel score test to increase power in association test by integrating external controls. Genet. Epidemiol 45, 293–304 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 128.Chen S & Lin X Analysis in case–control sequencing association studies with different sequencing depths. Biostatistics 21, 577–593 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 129.Hu Y-J, Liao P, Johnston HR, Allen AS & Satten GA Testing rare-variant association without calling genotypes allows for systematic differences in sequencing between cases and controls. PLoS Genet. 12, e1006040 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 130.Boyle EA, Li YI & Pritchard JK An expanded view of complex traits: from polygenic to omnigenic. Cell 169, 1177–1186 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 131.Bulik-Sullivan BK et al. LD score regression distinguishes confounding from polygenicity in genome-wide association studies. Nat. Genet 47, 291–295 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 132.Clifton EAD et al. Associations between body mass index-related genetic variants and adult body composition: the Fenland cohort study. Int. J. Obes 41, 613–619 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 133.O’Connor BD et al. The Dockstore: enabling modular, community-focused sharing of Docker-based genomics tools and workflows. F1000Res. 6, 52 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 134.Perkel J Democratic databases: science on GitHub. Nature 538, 127–128 (2016). [DOI] [PubMed] [Google Scholar]
- 135.Buniello A et al. The NHGRI-EBI GWAS Catalog of published genome-wide association studies, targeted arrays and summary statistics 2019. Nucleic Acids Res. 47, D1005–D1012 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 136.Venkataraman GR et al. Bayesian model comparison for rare-variant association studies. Am. J. Hum. Genet 108, 2354–2367 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 137.Thomas SP et al. Cultivating diversity as an ethos with an anti-racism approach in the scientific enterprise. HGG Adv. 108, 100052 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 138.Bonham VL & Green ED The genomics workforce must become more diverse: a strategic imperative. Am. J. Hum. Genet 108, 3–7 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 139.Bentley AR, Callier SL & Rotimi CN Evaluating the promise of inclusion of African ancestry populations in genomics. NPJ Genom. Med 5, 5 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 140.Bezuidenhout L & Chakauya E Hidden concerns of sharing research data by low/middle-income country scientists. Glob. Bioeth 29, 39–54 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 141.Tsosie KS, Yracheta JM & Dickenson D Overvaluing individual consent ignores risks to tribal participants. Nat. Rev. Genet 20, 497–498 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 142.Tindana P & de Vries J Broad consent for genomic research and biobanking: perspectives from low- and middle-income countries. Annu. Rev. Genomics Hum. Genet 17, 375–393 (2016). [DOI] [PubMed] [Google Scholar]; A review outlining the key elements to promote global health and equity when completing genomic research, such as through biobanks.
- 143.National Human Genome Research Institute. NOTHG-21-022: notice announcing the National Human Genome Research Institute’s expectation for sharing quality metadata and phenotypic data. NIH https://grants.nih.gov/grants/guide/notice-files/NOT-HG-21-022.html (2021). [Google Scholar]
- 144.Fiume M et al. Federated discovery and sharing of genomic data using Beacons. Nat. Biotechnol 37, 220–224 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 145.Thorogood A et al. International federation of genomic medicine databases using GA4GH standards. Cell Genomics 1, 100032 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 146.Rehm HL et al. GA4GH: international policies and standards for data sharing across genomic research and healthcare. Cell Genom. 1, 100029 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 147.Lawson J et al. The Data Use Ontology to streamline responsible access to human biomedical datasets. Cell Genom. 1, 100028 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 148.National Heart, Lung, and Blood Institute. Catalyst Fellows Program. NHLBI https://biodatacatalyst.nhlbi.nih.gov/fellows/program/ (2021). [Google Scholar]
- 149.National Human Genome Research Institute. Massive Genome Informatics in the Cloud (MaGIC) Jamboree. AnVIL https://anvilproject.org/events/magic2020(2020). [Google Scholar]
- 150.Global Alliance for Genomics and Health. GA4GH starter kit. GA4GH https://starterkit.ga4gh.org/(2021). [Google Scholar]
- 151.Abel HJ et al. Mapping and characterization of structural variation in 17,795 human genomes. Nature 583, 83–89 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 152.Mallick S et al. The Simons Genome Diversity Project: 300 genomes from 142 diverse populations. Nature 538, 201–206 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 153.Phan L et al. ALFA: Allele Frequency Aggregator. NCBI https://www.ncbi.nlm.nih.gov/snp/docs/gsr/alfa/(2020). [Google Scholar]
- 154.Tadaka S et al. jMorp updates in 2020: large enhancement of multi-omics data resources on the general Japanese population. Nucleic Acids Res. 49, D536–D544 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 155.Sequencing Initiative Suomi Project. Sequencing Initiative Suomi. SISu http://sisuproject.fi (2021). [Google Scholar]
- 156.Wam. Dubai to map genome of all its residents. Khaleej Times https://www.khaleejtimes.com/uae/dubai-to-map-genome-of-all-its-residents (2018). [Google Scholar]
- 157.Geis C A Chinese province is sequencing one million of its residents’ genomes. Futurism https://futurism.com/neoscope/chinese-province-sequencing-1-millionresidents-genomes (2017). [Google Scholar]
- 158.Health RI. European ‘1+Million Genomes’ initiative (1+MG). Health RI; https://www.health-ri.nl/initiatives/european-1million-genomes-initiative-1mg (2020). [Google Scholar]
- 159.Gaziano JM et al. Million Veteran Program: a mega-biobank to study genetic influences on health and disease. J. Clin. Epidemiol 70, 214–223 (2016). [DOI] [PubMed] [Google Scholar]
- 160.Sirugo G, Williams SM & Tishkoff SA The missing diversity in human genetic studies. Cell 177, 1080 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 161.Byrd JB, Greene AC, Prasad DV, Jiang X & Greene CS Responsible, practical genomic data sharing that accelerates research. Nat. Rev. Genet 21, 615–629 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 162.Wilkinson MD et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci. Data 3, 160018 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]; This foundational manuscript is the first to present the FAIR principles (that is, findable, accessible, interoperable and reusable) for data sharing.
