Summary
Genome sequencing has recently become a viable genotyping technology for use in genome-wide association studies (GWASs), offering the potential to analyze a broader range of genome-wide variation, including rare variants. To survey current standards, we assessed the content and quality of reporting of statistical methods, analyses, results, and datasets in 167 exome- or genome-wide-sequencing-based GWAS publications published from 2014 to 2020; 81% of publications included tests of aggregate association across multiple variants, with multiple test models frequently used. We observed a lack of standardized terms and incomplete reporting of datasets, particularly for variants analyzed in aggregate tests. We also find a lower frequency of sharing of summary statistics compared with array-based GWASs. Reporting standards and increased data sharing are required to ensure sequencing-based association study data are findable, interoperable, accessible, and reusable (FAIR). To support that, we recommend adopting the standard terminology of sequencing-based GWAS (seqGWAS). Further, we recommend that single-variant analyses be reported following the same standards and conventions as standard array-based GWASs and be shared in the GWAS Catalog. We also provide initial recommended standards for aggregate analyses metadata and summary statistics.
Keywords: GWAS, genome-wide association study, seqGWAS, sequencing-based GWAS, rare variant association, WGS, WES, FAIR data, common variant association, standards
Graphical abstract
Highlights
Recommendations for increasing FAIRness of sequencing-based GWASs
To be findable, we recommend standard terminology of sequencing-based GWAS (seqGWAS)
To improve access and standards, the GWAS Catalog will support deposition of seqGWAS
To improve utility, we recommend reporting standards for single and aggregate analyses
McMahon et al. report an analysis of the sequencing-based GWAS literature, finding a lack of standardized language and incomplete reporting, along with less-frequent sharing of summary statistics compared with that of array-based GWASs. We provide recommendations for the reporting and sharing of sequencing-based GWASs to increase FAIRness of these valuable datasets.
Introduction
Huge advances in the field of human genetics can be attributed to the advent of genome-wide association studies (GWASs) more than 15 years ago.1,2 In recent years, decreasing costs and advances in analytic methods have made high-throughput whole-genome sequencing (WGS) and whole-exome sequencing (WES) feasible alternatives to array-based genotyping in GWASs.3,4 Sequencing offers a significant advantage over array-based methods, with the potential to detect and genotype all variants present in a sample, not only those present on an array or imputation reference panel. Most arrays are designed to assay common variants (minor allele frequency [MAF] > 5%), omitting rare (MAF < 1%) and low-frequency (MAF 1%–5%) variants. The analysis of these rarer variants could explain additional disease risk or trait variability and help overcome the problem of “missing heritability.”5,6 In addition, most arrays have historically been biased toward coverage of variation in European populations.7 The fact that sequencing potentially provides an unbiased assessment of variants within the population studied is particularly important for studies of non-European populations.8,9
There are challenges with analyzing many more and rarer variants. Single-variant tests, used as the standard in array-based GWASs, are typically underpowered when applied to low-frequency or rare variants, unless sample sizes or effects are very large. There are also issues with correcting for multiple testing when the number of statistical tests is very large. To address those issues, statistical methods have been designed specifically for rare-variant-association testing, which evaluate aggregate association over multiple variants in a genomic region (referred to here as “aggregate tests”).10 Variants are typically aggregated across biologically functional regions (e.g., a gene) with variants enriched for those likely to have larger effect sizes based on annotated or predicted functional effect (e.g., located in a splice junction or a predicted loss of function). The power of a particular aggregate test to detect an association will depend on how closely the model’s assumptions and contributing variants represent the true disease mechanism at each locus.
Repositories of scientific data have been indispensable in supporting research and in facilitating discoverability and integration across datasets through standard formats. The National Human Genome Research Institute-European Bioinformatics Institute (NHGRI-EBI) GWAS Catalog11 is the preeminent data resource of large-scale genetic-association studies, enabling research to identify causal variants, to understand disease mechanisms, and to establish targets for novel therapies.12 The GWAS Catalog infrastructure, data content, and standard formats have been designed to support array-based GWASs. Attempts to expand the scope of the Catalog to include sequencing-based association studies have been hindered by the need to develop new standards for the differences in methods, the metadata required to represent them, and the format of the results, particularly for aggregate analyses.
Here, we analyze the current landscape of published sequencing-based association studies to determine requirements for hosting and sharing those datasets in the GWAS Catalog and recommend best practices for reporting. First, we comprehensively reviewed publications reporting sequencing-based association studies, assessing the range of experimental designs and statistical methods, as well as the content and quality of reporting for analyses, methods, and datasets included in publications. We hope that this review will form a rallying point for building community consensus on standards. This work has also informed the development of the GWAS Catalog infrastructure and data-representation schema to support inclusion of sequencing-based association studies, which are now accepted for submission at the GWAS Catalog. Our work at the GWAS Catalog is focused on enabling broad data sharing and defining standards to ensure sequencing-based association study data are findable, interoperable, accessible, and reusable (FAIR).13
Results
Finding sequencing-based association studies
In our review of research publications (STAR Methods), we observed that a wide range of terms are used to describe sequencing-based genome or exome-wide association studies. The term “GWAS” is rarely used, and we have not seen an equivalent standard term emerge (Figure S1). Combinations of terminology were used, related to (1) analysis of associations (e.g., rare variant association analysis, rare variant aggregate association analysis, association test, and genome-wide significant associations), (2) the allele frequency of the variants analyzed (e.g., common variant and rare variant), (3) the analysis type, either single variant (e.g., single variant and variant level) or aggregate with multiple variants (e.g., gene-based, region-based, aggregate, gene burden, collapsing analysis, gene-level association, gene-level signal, and collapsed-variant tests).
We identified 167 publications reporting genome-wide sequencing-based association analyses meeting our selection criteria (STAR Methods; Tables S1 and S2). The first study was published in 2014, with the number of publications increasing year after year to 2020 (Figure 1A). Because no standard terminology has been adopted for these studies, we were not able to search discriminately for sequencing-based association studies meeting our criteria, and permissive searches (e.g., for “WGS OR WES association”) yield too many results to feasibly review manually (Figure S2); therefore, we expect this to be an underestimate of publications reporting sequencing-based GWASs (seqGWAS). Most publications analyzed WES data only (68%), approximately one-third analyzed WGS data (30%), and some publications included both coverage types (2%) (Figure 1A). Many publications that used WES and WGS sequencing data limited their analyses to pre-specified regions of interest; those targeted analyses are not the focus of this work and were, therefore, excluded from the analysis.
Association tests and qualifying variants
We surveyed the types of association tests included in these publications. Most frequent was the inclusion of both single-variant and aggregate analyses (48%), followed by aggregate analysis only (33%), and a minority of publications (19%) included single-variant analyses only (Figure 1B). Of the publications including aggregate tests, a wide range of statistical models and tools were used, with publications commonly using multiple models. For example, of publications that used one of the three most-common aggregation methods10 (burden/collapsing, variance-component [SKAT], and combined burden and variance-component [SKAT-O] tests), 40% (n = 65) used at least two of those methods (Figure 2A). The language used to describe those methods is varied; for example, SKAT is referred to variously as kernel based, dispersion based, or variance-component based (Figure S3).
We also examined variant-filtering or "masking" approaches. Minor allele frequency thresholds were reported in 72% of single-variant and 84% of aggregate-analysis publications, with the remainder either not reporting any MAF threshold or using all variants (26% of single variant/16% of aggregate) (Figure S4). “Greater than” thresholds were typically used for single-variant analysis, with 57% of analyses employing a MAF threshold of 0.01 or greater, limiting those analyses to the common variant space (Figure 2B) (n = 30/53 thresholded analyses from 51 publications). In contrast, aggregate analyses typically employed “less than” thresholds, to include only low-frequency (<0.05), rare (<0.005), or ultra-rare variants. Most aggregate analyses used <0.01 or <0.05 thresholds (78%, n = 67/86 thresholded analyses from 77 publications).
Many publications (63%, n = 75/120) also performed analyses on variants with predicted biological effect. Authors filtered for predicted functional effect based on transcript annotation (e.g., using the Variant Effect Predictor14) or protein structure (e.g., using Sorting Intolerant from Tolerant [SIFT],15 Polymorphism Phenotyping v2 [PolyPhen]16 and combined annotation-dependent depletion [CADD]17) or based on measures of evolutionary conservation or variation intolerance.18,19 An analysis of the text used to describe the filtering process highlights that the most commonly used terms were “splice,” “missense,” “protein,” “frameshift,” “stop gain,” “loss of function” (LoF), and “protein-truncating variant” (PTV), but a wide range of terms were used (Figure S5). Variants were often filtered by both annotation/predicted effect and MAF thresholds, with multiple different filtering criteria used per publication (examples are provided in Table S3).
The number of variants analyzed in WES single-variant analyses is considerably less than those typically analyzed in array-based GWASs (median, 158,091; versus 5,554,549), whereas, in WGS single-variant analyses, the number is greater (median, 12,210,410) (Table 1). The median number of statistical tests performed in aggregate analyses was 18,360, approximating the number of protein-coding genes with a consensus CDS (19,033; coding DNA sequence)20 because the most-common unit over which variants are aggregated is the protein-coding gene. The analyses in which the number of tests was greater than the inter-quartile range were those in which the unit of analysis was non-genic. The most-common non-genic aggregation units we observed were regulatory regions18,19,21,22 or agnostic sliding windows.23, 24, 25, 26 Authors also aggregated across evolutionary conserved regions or pathways.19,27
Table 1.
Single-variant array, % (n) | Single-variant sequencing, % (n) | Aggregate sequencing, % (n) | |
---|---|---|---|
Summary statistics available without restriction | 12 (300) | 5 (4) | 7 (7) |
Number of tests (reporting) | |||
Reported | 91 (5,817) | 74 (61) | 81 (84) |
Not reported | 9 (610) | 26 (21) | 19 (20) |
Number of tests (distribution) | overall | overall | overall |
Minimum | 12,033 | 26,011 | 339 |
Q1 | 899,892 | 144,477 | 16,788 |
Median | 5,554,549 | 548,889 | 18,665 |
Q3 | 9,334,585 | 8,752,596 | 20,843 |
Maximum | 90,000,000 | 32,503,121 | 129,820,320 |
WES only | WES only | ||
Minimum | – | 26,011 | 735 |
Q1 | – | 81,843 | 16,751 |
Median | – | 158,091 | 18,360 |
Q3 | – | 235,133 | 20,000 |
Maximum | – | 1,810,198 | 88,183 |
WGS only | WGS only | ||
Minimum | – | 658,234 | 339 |
Q1 | – | 7,666,134 | 19,903 |
Median | – | 12,210,410 | 32,316 |
Q3 | – | 29,880,479 | 1,082,577 |
Maximum | – | 32,503,121 | 129,820,320 |
Publications that state that they share summary statistics openly (not including those provided with restricted access). Reported/not reported refers to whether the number of statistical tests performed was detailed in the publication. The number of statistical tests performed in sequencing-based studies is based on publications that provide one “number of statistical tests” (n = 51 of 79 for single-variant analysis, n = 56 of 101 for aggregate analysis). Publications that provide a range of statistical test numbers performed are included in the “reported” category but are not included in the distribution. The data for array-based GWAS were obtained from 2014–2019 studies in the GWAS Catalog (December 2, 2020 release) (see STAR Methods).
The outcome of the various variant filters or "masks," i.e., a list of the qualifying variants included in each analysis, was not provided in any of the 167 publications we analyzed. However, some publications did specify the number of qualifying variants included per unit of aggregation.28,29
Sample characteristics
We next surveyed the characteristics of samples (sample size, ancestry, and traits) studied in seqGWAS. We compared the sample sizes of the seqGWAS, because that is a key determinant of statistical power. We classified publications into bins based on the number of individuals in the publication (Figure S6). The most-common sample size bin was 300–3,000 individuals (43% of publications), but in the past few years, there has been a near-even distribution across bins from small to large sample sizes. In 2019, both the smallest (<300 individuals) and the largest (>10,000) sample-size bins were used in approximately a quarter of publications each (23% and 26%, respectively; Figure S6). The number of cases is also a component of statistical power, and unbalanced case/control ratios can inflate type 1 errors.30 We observed 10 publications (6%) with unbalanced case/control ratios (cases ≤ 15% of samples), most of those (n = 7, 4%) being highly unbalanced (cases ≤ 4% of samples) (Table S4).31, 32, 33
The inclusion of diverse ancestral backgrounds in genomics studies is recognized as important,34,35 but analysis of array-based GWASs has highlighted the extreme bias toward samples of European origin.36,37 We assessed and compared ancestry in seqGWAS. Following the GWAS Catalog ancestry framework (a standard methodology for representing ancestry),36 we extracted publication-level, broad ancestral categories of samples. Mirroring what has been seen elsewhere with array-based GWASs, 71% of all publications (n = 85/120) included European ancestry individuals, with 40% not including any other ancestry (n = 48/120) (Figure 3A; Table S5). The second most commonly examined ancestral group was African American (28% of publications, n = 33/120), and most of those publications (21%) also included other ancestries (Figures 3B and S7). This profile may, in part, be due to the presence of large, trans-ancestry consortia, such as the Trans-Omics for Precision Medicine (TOPMed) program, which is the most commonly occurring consortium or cohort mentioned (Table S7).
We also examined the number of traits analyzed within the reported association study. Most publications examined one or two traits (76%, n = 89), whereas a few (4%, n = 5) examined 55–75 traits as part of larger-scale studies.18,22,39, 40, 41 More recently (2019–2020), very-large-scale studies using the UK Biobank have included 791–4,262 traits42, 43, 44 (Figure S8). Non-UK-Biobank publications analyzing multiple traits were mostly focused on quantitative biomarker or metabolite-level-type traits,18,21,41,45 such as inflammatory biomarkers, blood metabolite levels, blood protein levels. Studies analyzing fewer traits were more likely to be case/control studies.46, 47, 48, 49 A full list of publication-level trait names (analogous to the GWAS Catalog “reported trait”) and corresponding mapped Experimental Factor Ontology (EFO) terms are provided in Table S4.
Data availability
The public availability of full summary statistics from GWASs has great potential to extend the power of initial studies by enabling the community to re-analyze, meta-analyze, and perform follow-up analyses, with minimal risk to participants.11,50 We assessed whether summary statistics, in addition to individual-level genotyping results, were reported in these publications as available without restriction in a public repository. Sharing of sequencing-based single-variant summary statistics was much lower (5% of publications, n = 4/79, 2014–2019) than the proportion of array-based publications in the GWAS Catalog in the same period (12% of publications, n = 300/2,571, 2014–2019) (Table 1). Sharing of array-GWAS summary statistics is greater in recent years (19% of 2019 GWAS Catalog publications, n = 101/527), but seqGWAS summary statistics still lag (9%, n = 3/32). A further 2.5% of sequencing publications (n = 3/120, 2014–2019) deposited summary statistics in a controlled-access public repository (the Database of Genotypes and Phenotypes [dbGAP]). In contrast, 24% of publications (n = 29/120) deposited individual-level sequencing data in controlled access repositories (dbGAP or European Genome-Phenome Archive [EGA]) (Table S6) and, for some summary-level data, may have been co-submitted or bundled with those data but not specifically stated by the authors.
The data content of single-variant summary statistics for seqGWAS is comparable with that for standard-array GWASs and can conform to emerging standards.11,50 However, summary statistics for aggregate analysis in seqGWAS are commonly composed only of a gene name (or other range specifying chromosomal coordinates), p value, and often the number of contributing variants, sometimes separated by cases/controls. Crucially, we did not observe any publications that reported the list of variants included in each aggregate unit, which is key to interpretation of the data, either in the main text or in accompanying material.
Discussion
Recommended standards
Based on our review and analyses, we recommend standards to improve the reporting and accessibility of seqGWAS. First, to increase transparency when referring to study design and facilitate identification, we recommend that the community adopt the name of “sequencing-based GWAS,” abbreviated as “seqGWAS” (Box 1, recommendation 1). Second, to enable accurate interpretation and comparison of results across studies and loci, it is essential that detailed information describing each association test (including statistical tests and contributing variants) are consistently reported (Box 1, recommendations 2 and 3). These recommendations are based upon, and are designed to address, our observations of the state of the field.
Box 1. Recommendations for sequencing-based GWAS reporting standards.
Our recommendations for the development and adoption of reporting standards to increase the availability, accessibility, and utility of sequencing-based GWASs. The GWAS Catalog will support deposition of these datasets and promote adoption of these standards as well as continued discussions to reach consensus on the reporting of aggregate analyses.
-
1
WGS and WES association studies be referred to as “sequencing-based GWASs” (seqGWAS)
- 2
-
3Aggregate analyses:
-
aMetadata be reported to enable interpretation and aid reproducibility including
-
iSufficient details of the statistical test to allow replication of results
-
iiMinor allele frequency thresholds used
-
iiiDetails of tools used for functional annotation/consequence prediction (e.g., VEP release 103) and ontology terms used to describe the consequence (e.g., Sequence Ontology)
-
i
-
bCommunity reaches consensus for standard content and format for reporting of aggregate seqGWAS summary statistics. This should include
-
iThe full list of qualifying variants contributing to each test
-
iiChromosomal coordinates of aggregation units (including genome assembly builds or gene annotation release version, e.g., GENCODE release 37, GRCh38)
-
iiiA standard identifier for the aggregation unit, e.g., HGNC gene name or symbol (if applicable)
-
ivp value
-
i
-
a
-
4
SeqGWAS studies be conducted in populations that include more diverse ancestries
Observations
The sequencing-based association studies in the publications we analyzed included either single or aggregate multi-variant analyses. The restriction of single-variant analyses to common variants renders those studies largely comparable with array-based GWASs (Figure 2), with similar implications for data content and reporting (Box 1, recommendation 2) and similar utility for re-use, for example, in the derivation of polygenic scores or in Mendelian randomization. In comparison, studies performing tests of aggregate association across multiple variants, which appear in most (81%) publications, focus on ”low-frequency,” “rare,” and “ultra-rare” variants. Multiple statistical models of aggregate association are frequently used in the same publication because the power of each test depends on how closely the assumptions of the model match the true disease etiology at each locus. Therefore, there is no best model (including statistical tests and variant filtering strategies) across loci and traits, and there is no best model necessarily knowable a priori. To enable accurate interpretation and comparison of results across studies and loci, it is, therefore, essential that detailed information describing each association test (including statistical tests and contributing variants) is consistently reported (Box 1, recommendations 2 and 3).
It is in the performance and, therefore, reporting of aggregate association tests that sequencing-based association studies differ most from standard array-based GWASs. We observed that the experimental information provided for aggregate tests was not sufficient to facilitate thorough examination or replication. Variants are filtered (typically by MAF and functional annotation/predicted consequence) and combined in different units of aggregation. Crucially, the list of variants contributing to each test is not provided by these publications. Availability of these data would facilitate attempts at replication and enable further analysis and functional investigation51 (Box 1, recommendation 3b).
Given the rarity of these variants, privacy concerns regarding de-identification may be a barrier to their sharing. We suggest that the community look to the field of rare-variant clinical genomics, in which it is becoming increasingly accepted that the potential benefits of sharing far outweigh the perceived risks.52 This is illustrated by the number of clinical-laboratory-derived variants in ClinVar more than doubling since 2018.53,54 We note that individual genetic variants, even very rare ones, are not uniquely identifying and would require in-depth knowledge of an individual’s genotype to connect an individual to a phenotype.
Theoretically, lists of qualifying variants could be recapitulated, but filtering information provided by authors is again diverse and often vague and, overall, insufficient to independently derive those lists. The community should consider standardized ways to communicate variant filters or masks (for example, using the sequence ontology to describe functional annotation/predicted functional effect filters55). The unit of aggregation, which encompasses the variants included in each test (typically gene), must be clearly defined. This should include the coordinates of the region and the genome assembly or annotation release, along with any additional variant-filtering information (Box 1, recommendation 3a).
We observed that a smaller proportion of full-summary statistics are publicly available from seqGWAS (5%) compared with array-based GWASs (12%). That percentage is low for both types of studies despite guidance and growing community consensus supporting sharing (web resources).50 There are a number of reasons why full and public data sharing may be less for sequencing than array-based studies. There may be additional perceived privacy concerns regarding the rare variants present in sequencing-based summary statistics. It is also possible that summary statistics may be bundled with the individual-level genotyping data that 24% of publications deposited in controlled-access repositories (dbGAP/EGA). Single-variant summary statistics can conform to the proposed array-based standards (Box 1, recommendation 2)11 and can already be submitted to the GWAS Catalog. However, aggregate-analysis summary statistics, when they are shared, are typically only a gene name and a p value (sometimes with the number of qualifying variants included). These files are not large or cumbersome, given that the number of human genes is only approximately 20,000 and are easy to share, for example, as a supplementary table. As described above, we recommend authors supply full lists of qualifying variants that contribute to each test (Box 1, recommendation 3b). We hope that the development and adoption of these standards will simplify and encourage the sharing of seqGWAS summary statistics.
The ability of sequencing to genotype all variants present in the cohort offers a significant opportunity to overcome the biases inherent in array-based genotyping, with the potential to reduce disparities among ancestry groups. Despite that, the bias toward European-ancestry populations observed in array-based GWASs (49% European only and 74% including European) remains in sequencing publications (40% European only and 71% including European). Furthermore, we note that the percentage of European sequencing-based analyses is likely to be greater; publications containing multiple GWASs are more likely to be from large cohorts with deep phenotyping data, which are predominantly European (e.g., UK Biobank). Given the advantages of sequencing in analyzing non-Europeans, we question why it is not being further used. There are many possible reasons for this, including increased cost, the lack of diversity in legacy cohorts, pre-existing consent agreements, privacy concerns associated with rare-variant analysis, and analysis methods being complex. The GWAS Catalog reiterates its stance in encouraging analysis of diverse populations and encourages researchers to take advantage of the opportunities offered by sequencing technologies in enabling unbiased genotyping across ancestries (Box 1, recommendation 4).
Limitations of the study
The lack of standardized terms to refer to seqGWAS creates challenges for the reliable identification of these publications using term-based literature-search methods. The 167 publications we identified are, therefore, certainly an underestimate of the number of publications, and we do not claim that this work is a comprehensive analysis of all published seqGWAS. To maintain consistency and enable comparability across studies, we decided to limit our analysis to publications carrying out an unbiased, genome-wide or exome-wide assessment of loci associated with traits, equivalent to the GWAS Catalog’s inclusion criteria (web resources). Many of the publications we screened and deemed ineligible were targeted analyses based on prior knowledge, for example, to specific loci, genes, or pathways and are scientifically valid studies but are out of the scope of this manuscript. In our recommendation of the term “seqGWAS” (Box 1, recommendation 1), we note that some may feel the use of “GWAS” is inappropriate, primarily because WES-based analyses are necessarily targeted to expressed regions. However, we observe that the term “GWAS” is commonly used to refer to both genome-wide and exome-wide array-based association studies. Our motivation for suggesting a unique nomenclature (sequencing-based GWAS/seqGWAS) is to facilitate the “findability” of these study types (large-scale association studies that analyze variants spread across the genome (e.g., with coverage across all autosomal chromosomes) in the scientific literature.
A necessary limitation of this work is its restriction to a specific time period (2014–2020), and as such, it serves as a snapshot of the state of the field. It is anticipated that the field will grow significantly in the immediate future, and the ratio of WES and WGS studies may change. However, the findings of our work, in terms of how studies are described and reported, are unaffected by whether or not they are WES or WGS or the total number of studies. The recommendations similarly apply to both coverage types. Furthermore, we believe this is an appropriate time to publish a study such as ours so that standards can be established sooner, thus enabling future publications to adhere to the FAIR principles.
Ensuring seqGWAS are FAIR
The maximum benefit of scientific research can only be realized if data are FAIR (findable, accessible, interoperable, and reusable), as described by the FAIR guiding principles for good scientific data management.13 Our analysis highlights several obstacles to implementation of these principles for seqGWAS, including lack of an appropriate resource or repository to store and disseminate the data, consistency of metadata reporting without the use of structured vocabularies, clarity on metadata indexing that needs to support searching, and a community standard for summary statistics. The GWAS Catalog’s primary aim is to provide a comprehensive resource and repository of all large-scale genomic association studies and, as such, has extended its scope to include seqGWAS, initially focusing on single-variant analyses. We will support the community to reach consensus on the reporting of aggregate seqGWAS, including the creation of standards for metadata and summary format and content.50 The development and adoption of reporting standards will increase the availability, accessibility, and utility of seqGWAS. We include a summary of our recommendations (Box 1) and welcome further input from the community.
STAR★Methods
Key resources table
REAGENT or RESOURCE | SOURCE | IDENTIFIER |
---|---|---|
Software and algorithms | ||
Text analysis tool (MonkeyLearn) | https://monkeylearn.com/word-cloud/ | |
GWAS Catalog machine learning-based literature search | Lee et al.56 | N/A |
Literature search engine, EuropePMC | http://europepmc.org | |
PubMed | https://pubmed.ncbi.nlm.nih.gov | |
Other | ||
Literature (primary research journal articles) | Peer reviewed journals | PubMed IDs listed in Table S4 |
Publicly available curated meta-data | NHGRI-EBI GWAS Catalog | https://www.ebi.ac.uk/gwas/ |
Resource availability
Lead contact
Further information and requests for resources and reagents should be directed to and will be fulfilled by the lead contact, Aoife McMahon (aoifem@ebi.ac.uk).
Materials availability
This study did not generate new unique reagents.
Method details
To enable direct comparability with array-based GWAS we defined sequencing-based association studies as studies that analyze associations between a trait and a genome-wide distribution of genetic variants from either whole-genome or whole-exome sequencing. This does not include targeted sequencing studies that are limited to specific genomic regions or subsets of genes (e.g., publications57, 58, 59). From these, we selected studies with population-based association analyses, and did not include studies that used family structure/linkage (e.g., publications60, 61, 62) or were aimed at diagnostic discovery of pathogenic variants (e.g publications63, 64, 65). We also included family-based association studies, but only if they performed standard association analysis with relatedness accounted for in the model (e.g., publications39,66). Studies that combine array and sequencing-based genotyping, such as partially array-genotyped, or array genotyped with sequencing data used as an imputation panel, were not included in our analyses.
Sequencing-based association publications meeting these inclusion criteria were identified by several routes: Pubmed and EuropePMC literature searches, the GWAS Catalog machine learning-based literature search,56 examination of grants, cohort and project websites, social media, conference talks, references in publications and personal communications (Table S1). The source of initial identification of each sequencing publication was recorded. Publication level metadata relating to study design, sample description, traits examined and data availability were extracted (Tables S2 and S4). Publication triage, eligibility assessment and extraction of metadata were performed by experienced GWAS Catalog curators. Analysis of study eligibility, genomic coverage and analysis type was performed for 2020 publications. More detailed analysis of the sample, trait, data sharing and statistical tests was available to the end of 2019.
Quantification and statistical analysis
For analysis of text related to variant types, curators extracted sentences describing variant selection and relevant terms were identified using the text analysis tool MonkeyLearn (https://monkeylearn.com/word-cloud/). The output was examined by expert curators and non-relevant terms excluded, terms collapsed and missed relevant terms were added and counted.
Acknowledgments
Research reported in this publication was supported by the National Human Genome Research Institute of the National Institutes of Health under award no. U41HG007823 and EMBL-EBI Core Funds. L.A.H. and P.H. are employees of the National Human Genome Research Institute. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. In addition, we acknowledge funding from the European Molecular Biology Laboratory. We thank Kalliope Panoutsopoulou and Aimee Deaton for comments on the manuscript.
Author contributions
Conceptualization, A.M., H.P., L.A.H., and J.A.L.M.; methodology, A.M.; formal analysis, A.M.; investigation, A.M. and J.A.L.M.; data curation (GWAS Catalog), A.M., E.L., A.B., M.C., P.H., E.S., L.W.H., and J.A.L.M.; data curation (sequencing papers), A.M. and E.L.; writing – original draft, A.M. and J.A.L.M.; writing – review & editing, A.M., J.A.L.M., L.H., and L.W.H.; visualization, A.M.; supervision, J.A.L.M. and L.W.H.; project administration, H.P., J.A.L.M., and L.W.H.; funding acquisition, H.P.
Declaration of interests
An immediate family member of J.A.L.M. is an employee and shareholder of Illumina.
Published: October 13, 2021
Footnotes
Supplemental information can be found online at https://doi.org/10.1016/j.xgen.2021.100005.
Contributor Information
Aoife McMahon, Email: aoifem@ebi.ac.uk.
Helen Parkinson, Email: parkinson@ebi.ac.uk.
Web resources
GWAS Catalog eligibility criteria, https://www.ebi.ac.uk/gwas/docs/methods/criteria
Update to NIH management of genomic summary results access, https://datascience.nih.gov/foa/update-nih-management-genomic-summary-results-access
Supplemental information
Data and code availability
Data underlying analyses in this paper are curated from the literature and are presented in Table S4.
This paper does not report original code.
Any additional information required to reanalyze the data reported in this paper is available from the lead contact upon request.
References
- 1.Klein R.J., Xu X., Mukherjee S., Willis J., Hayes J. Successes of genome-wide association studies. Cell. 2010;142:350–351. doi: 10.1016/j.cell.2010.07.026. author reply 353–355. [DOI] [PubMed] [Google Scholar]
- 2.Wellcome Trust Case Control Consortium Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature. 2007;447:661–678. doi: 10.1038/nature05911. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.DePristo M.A., Banks E., Poplin R., Garimella K.V., Maguire J.R., Hartl C., Philippakis A.A., del Angel G., Rivas M.A., Hanna M., et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat. Genet. 2011;43:491–498. doi: 10.1038/ng.806. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Pasaniuc B., Rohland N., McLaren P.J., Garimella K., Zaitlen N., Li H., Gupta N., Neale B.M., Daly M.J., Sklar P., et al. Extremely low-coverage sequencing and imputation increases power for genome-wide association studies. Nat. Genet. 2012;44:631–635. doi: 10.1038/ng.2283. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Eichler E.E., Flint J., Gibson G., Kong A., Leal S.M., Moore J.H., Nadeau J.H. Missing heritability and strategies for finding the underlying causes of complex disease. Nat. Rev. Genet. 2010;11:446–450. doi: 10.1038/nrg2809. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Zuk O., Hechter E., Sunyaev S.R., Lander E.S. The mystery of missing heritability: Genetic interactions create phantom heritability. Proc. Natl. Acad. Sci. USA. 2012;109:1193–1198. doi: 10.1073/pnas.1119675109. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Martin A.R., Gignoux C.R., Walters R.K., Wojcik G.L., Neale B.M., Gravel S., Daly M.J., Bustamante C.D., Kenny E.E. Human demographic history impacts genetic risk prediction across diverse populations. Am. J. Hum. Genet. 2017;100:635–649. doi: 10.1016/j.ajhg.2017.03.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Lachance J., Tishkoff S.A. SNP ascertainment bias in population genetic analyses: why it is important, and how to correct it. BioEssays. 2013;35:780–786. doi: 10.1002/bies.201300014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Kim M.S., Patel K.P., Teng A.K., Berens A.J., Lachance J. Genetic disease risks can be misestimated across global populations. Genome Biol. 2018;19:179. doi: 10.1186/s13059-018-1561-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Lee S., Abecasis G.R., Boehnke M., Lin X. Rare-variant association analysis: study designs and statistical tests. Am. J. Hum. Genet. 2014;95:5–23. doi: 10.1016/j.ajhg.2014.06.009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Buniello A., MacArthur J.A.L., Cerezo M., Harris L.W., Hayhurst J., Malangone C., McMahon A., Morales J., Mountjoy E., Sollis E., et al. The NHGRI-EBI GWAS Catalog of published genome-wide association studies, targeted arrays and summary statistics 2019. Nucleic Acids Res. 2019;47(D1):D1005–D1012. doi: 10.1093/nar/gky1120. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Visscher P.M., Wray N.R., Zhang Q., Sklar P., McCarthy M.I., Brown M.A., Yang J. 10 Years of GWAS discovery: Biology, function, and translation. Am. J. Hum. Genet. 2017;101:5–22. doi: 10.1016/j.ajhg.2017.06.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Wilkinson M.D., Dumontier M., Aalbersberg I.J., Appleton G., Axton M., Baak A., Blomberg N., Boiten J.W., da Silva Santos L.B., Bourne P.E., et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci. Data. 2016;3:160018. doi: 10.1038/sdata.2016.18. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.McLaren W., Gil L., Hunt S.E., Riat H.S., Ritchie G.R., Thormann A., Flicek P., Cunningham F. The Ensembl Variant Effect Predictor. Genome Biol. 2016;17:122. doi: 10.1186/s13059-016-0974-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Sim N.-L., Kumar P., Hu J., Henikoff S., Schneider G., Ng P.C. SIFT web server: predicting effects of amino acid substitutions on proteins. Nucleic Acids Res. 2012;40:W452–W457. doi: 10.1093/nar/gks539. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Adzhubei I.A., Schmidt S., Peshkin L., Ramensky V.E., Gerasimova A., Bork P., Kondrashov A.S., Sunyaev S.R. A method and server for predicting damaging missense mutations. Nat. Methods. 2010;7:248–249. doi: 10.1038/nmeth0410-248. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Rentzsch P., Witten D., Cooper G.M., Shendure J., Kircher M. CADD: predicting the deleteriousness of variants throughout the human genome. Nucleic Acids Res. 2019;47(D1):D886–D894. doi: 10.1093/nar/gky1016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Yu B., de Vries P.S., Metcalf G.A., Wang Z., Feofanova E.V., Liu X., Muzny D.M., Wagenknecht L.E., Gibbs R.A., Morrison A.C., Boerwinkle E. Whole genome sequence analysis of serum amino acid levels. Genome Biol. 2016;17:237. doi: 10.1186/s13059-016-1106-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Kim D., Basile A.O., Bang L., Horgusluoglu E., Lee S., Ritchie M.D., Saykin A.J., Nho K. Knowledge-driven binning approach for rare variant association analysis: application to neuroimaging biomarkers in Alzheimer’s disease. BMC Med. Inform. Decis. Mak. 2017;17(suppl 1):61. doi: 10.1186/s12911-017-0454-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Pujar S., O’Leary N.A., Farrell C.M., Loveland J.E., Mudge J.M., Wallin C., Girón C.G., Diekhans M., Barnes I., Bennett R., et al. Consensus coding sequence (CCDS) database: A standardized set of human and mouse protein-coding regions supported by expert curation. Nucleic Acids Res. 2018;46(D1):D221–D228. doi: 10.1093/nar/gkx1031. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.de Vries P.S., Yu B., Feofanova E.V., Metcalf G.A., Brown M.R., Zeighami A.L., Liu X., Muzny D.M., Gibbs R.A., Boerwinkle E., Morrison A.C. Whole-genome sequencing study of serum peptide levels: The Atherosclerosis Risk in Communities study. Hum. Mol. Genet. 2017;26:3442–3450. doi: 10.1093/hmg/ddx266. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Gilly A., Suveges D., Kuchenbaecker K., Pollard M., Southam L., Hatzikotoulas K., Farmaki A.E., Bjornland T., Waples R., Appel E.V.R., et al. Cohort-wide deep whole genome sequencing and the allelic architecture of complex traits. Nat. Commun. 2018;9:4674. doi: 10.1038/s41467-018-07070-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.He Z., Xu B., Buxbaum J., Ionita-Laza I. A genome-wide scan statistic framework for whole-genome sequence data analysis. Nat. Commun. 2019;10:3018. doi: 10.1038/s41467-019-11023-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Sarnowski C., Satizabal C.L., DeCarli C., Pitsillides A.N., Cupples L.A., Vasan R.S., Wilson J.G., Bis J.C., Fornage M., Beiser A.S., et al. NHLBI Trans-Omics for Precision Medicine (TOPMed) Consortium. TOPMed Neurocognitive Working Group Whole genome sequence analyses of brain imaging measures in the Framingham Study. Neurology. 2018;90:e188–e196. doi: 10.1212/WNL.0000000000004820. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Mak A.C.Y., White M.J., Eckalbar W.L., Szpiech Z.A., Oh S.S., Pino-Yanes M., Hu D., Goddard P., Huntsman S., Galanter J., et al. NHLBI Trans-Omics for Precision Medicine (TOPMed) Consortium Whole-genome sequencing of pharmacogenetic drug response in racially diverse children with asthma. Am. J. Respir. Crit. Care Med. 2018;197:1552–1564. doi: 10.1164/rccm.201712-2529OC. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Sapkota Y., Cheung Y.T., Moon W., Shelton K., Wilson C.L., Wang Z., Mulrooney D.A., Zhang J., Armstrong G.T., Hudson M.M., et al. Whole-genome sequencing of childhood cancer survivors treated with cranial radiation therapy identifies 5p15.33 locus for stroke: A report from the St. Jude Lifetime Cohort Study. Clin. Cancer Res. 2019;25:6700–6708. doi: 10.1158/1078-0432.CCR-19-1231. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Monson E.T., Pirooznia M., Parla J., Kramer M., Goes F.S., Gaine M.E., Gaynor S.C., de Klerk K., Jancic D., Karchin R., et al. Assessment of whole-exome sequence data in attempted suicide within a bipolar disorder cohort. Mol. Neuropsychiatry. 2017;3:1–11. doi: 10.1159/000454773. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Gratten J., Zhao Q., Benyamin B., Garton F., He J., Leo P.J., Mangelsdorf M., Anderson L., Zhang Z.H., Chen L., et al. Whole-exome sequencing in amyotrophic lateral sclerosis suggests NEK1 is a risk gene in Chinese. Genome Med. 2017;9:97. doi: 10.1186/s13073-017-0487-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Scott W.K., Medie F.M., Ruffin F., Sharma-Kuinkel B.K., Cyr D.D., Guo S., Dykxhoorn D.M., Skov R.L., Bruun N.E., Dahl A., et al. Human genetic variation in GLS2 is associated with development of complicated Staphylococcus aureus bacteremia. PLoS Genet. 2018;14:e1007667. doi: 10.1371/journal.pgen.1007667. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Povysil G., Petrovski S., Hostyk J., Aggarwal V., Allen A.S., Goldstein D.B. Rare-variant collapsing analyses for complex traits: Guidelines and applications. Nat. Rev. Genet. 2019;20:747–759. doi: 10.1038/s41576-019-0177-4. [DOI] [PubMed] [Google Scholar]
- 31.Udagawa C., Horinouchi H., Shiraishi K., Kohno T., Okusaka T., Ueno H., Tamura K., Ohe Y., Zembutsu H. Whole genome sequencing to identify predictive markers for the risk of drug-induced interstitial lung disease. PLoS ONE. 2019;14:e0223371. doi: 10.1371/journal.pone.0223371. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Wolock C.J., Stong N., Ma C.J., Nagasaki T., Lee W., Tsang S.H., Kamalakaran S., Goldstein D.B., Allikmets R. A case-control collapsing analysis identifies retinal dystrophy genes associated with ophthalmic disease in patients with no pathogenic ABCA4 variants. Genet. Med. 2019;21:2336–2344. doi: 10.1038/s41436-019-0495-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Alkelai A., Greenbaum L., Heinzen E.L., Baugh E.H., Teitelbaum A., Zhu X., Strous R.D., Tatarskyy P., Zai C.C., Tiwari A.K., et al. New insights into tardive dyskinesia genetics: Implementation of whole-exome sequencing approach. Prog. Neuropsychopharmacol. Biol. Psychiatry. 2019;94:109659. doi: 10.1016/j.pnpbp.2019.109659. [DOI] [PubMed] [Google Scholar]
- 34.Popejoy A.B., Fullerton S.M. Genomics is failing on diversity. Nature. 2016;538:161–164. doi: 10.1038/538161a. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Peterson R.E., Kuchenbaecker K., Walters R.K., Chen C.Y., Popejoy A.B., Periyasamy S., Lam M., Iyegbe C., Strawbridge R.J., Brick L., et al. Genome-wide association studies in ancestrally diverse populations: Opportunities, methods, pitfalls, and recommendations. Cell. 2019;179:589–603. doi: 10.1016/j.cell.2019.08.051. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Morales J., Welter D., Bowler E.H., Cerezo M., Harris L.W., McMahon A.C., Hall P., Junkins H.A., Milano A., Hastings E., et al. A standardized framework for representation of ancestry data in genomics studies, with application to the NHGRI-EBI GWAS Catalog. Genome Biol. 2018;19:21. doi: 10.1186/s13059-018-1396-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Mills M.C., Rahal C. A scientometric review of genome-wide association studies. Commun. Biol. 2019;2:9. doi: 10.1038/s42003-018-0261-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Hulsen T., de Vlieg J., Alkema W. BioVenn—A web application for the comparison and visualization of biological lists using area-proportional Venn diagrams. BMC Genomics. 2008;9:488. doi: 10.1186/1471-2164-9-488. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Sabo A., Mishra P., Dugan-Perez S., Voruganti V.S., Kent J.W., Jr., Kalra D., Cole S.A., Comuzzie A.G., Muzny D.M., Gibbs R.A., Butte N.F. Exome sequencing reveals novel genetic loci influencing obesity-related traits in Hispanic children. Obesity (Silver Spring) 2017;25:1270–1276. doi: 10.1002/oby.21869. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Locke A.E., Steinberg K.M., Chiang C.W.K., Service S.K., Havulinna A.S., Stell L., Pirinen M., Abel H.J., Chiang C.C., Fulton R.S., et al. FinnGen Project Exome sequencing of Finnish isolates enhances rare-variant association power. Nature. 2019;572:323–328. doi: 10.1038/s41586-019-1457-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Höglund J., Rafati N., Rask-Andersen M., Enroth S., Karlsson T., Ek W.E., Johansson Å. Improved power and precision with whole genome sequencing data in genome-wide association studies of inflammatory biomarkers. Sci. Rep. 2019;9:16844. doi: 10.1038/s41598-019-53111-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Jiang L., Zheng Z., Qi T., Kemper K.E., Wray N.R., Visscher P.M., Yang J. A resource-efficient tool for mixed model association analysis of large-scale data. Nat. Genet. 2019;51:1749–1755. doi: 10.1038/s41588-019-0530-8. [DOI] [PubMed] [Google Scholar]
- 43.Cirulli E.T., White S., Read R.W., Elhanan G., Metcalf W.J., Tanudjaja F., Fath D.M., Sandoval E., Isaksson M., Schlauch K.A., et al. Genome-wide rare variant analysis for thousands of phenotypes in over 70,000 exomes from two cohorts. Nat. Commun. 2020;11:542. doi: 10.1038/s41467-020-14288-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Zhao Z., Bi W., Zhou W., VandeHaar P., Fritsche L.G., Lee S. UK Biobank whole-exome sequence binary phenome analysis with robust region-based rare-variant test. Am. J. Hum. Genet. 2020;106:3–12. doi: 10.1016/j.ajhg.2019.11.012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Long T., Hicks M., Yu H.C., Biggs W.H., Kirkness E.F., Menni C., Zierer J., Small K.S., Mangino M., Messier H., et al. Whole-genome sequencing identifies common-to-rare variants associated with human blood metabolites. Nat. Genet. 2017;49:568–578. doi: 10.1038/ng.3809. [DOI] [PubMed] [Google Scholar]
- 46.Hammer M.F., Ishii A., Johnstone L., Tchourbanov A., Lau B., Sprissler R., Hallmark B., Zhang M., Zhou J., Watkins J., Hirose S. Rare variants of small effect size in neuronal excitability genes influence clinical outcome in Japanese cases of SCN1A truncation-positive Dravet syndrome. PLoS ONE. 2017;12:e0180485. doi: 10.1371/journal.pone.0180485. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Grant R.C., Denroche R.E., Borgida A., Virtanen C., Cook N., Smith A.L., Connor A.A., Wilson J.M., Peterson G., Roberts N.J., et al. Exome-wide association study of pancreatic cancer risk. Gastroenterology. 2018;154:719–722.e3. doi: 10.1053/j.gastro.2017.10.015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Kwak S.H., Chae J., Lee S., Choi S., Koo B.K., Yoon J.W., Park J.H., Cho B., Moon M.K., Lim S., et al. Nonsynonymous variants in PAX4 and GLP1R are associated with type 2 diabetes in an East Asian population. Diabetes. 2018;67:1892–1902. doi: 10.2337/db18-0361. [DOI] [PubMed] [Google Scholar]
- 49.Sveinbjornsson G., Olafsdottir E.F., Thorolfsdottir R.B., Davidsson O.B., Helgadottir A., Jonasdottir A., Jonasdottir A., Bjornsson E., Jensson B.O., Arnadottir G.A., et al. Variants in NKX2-5 and FLNC Cause Dilated Cardiomyopathy and Sudden Cardiac Death. Circ. Genom. Precis. Med. 2018;11:e002151. doi: 10.1161/CIRCGEN.117.002151. [DOI] [PubMed] [Google Scholar]
- 50.MacArthur J.A.L., Buniello A., Harris L.W., Hayhurst J., McMahon A., Sollis E., Cerezo M., Hall P., Lewis E., Whetzel P.L., et al. Workshop proceedings—GWAS summary statistics standards and sharing. Cell Genomics. 2021;18 doi: 10.1016/j.xgen.2021.100004. 100004-1–100004-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Lappalainen T., Scott A.J., Brandt M., Hall I.M. Genomic analysis in the age of human genome sequencing. Cell. 2019;177:70–84. doi: 10.1016/j.cell.2019.02.032. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Wright C.F., Ware J.S., Lucassen A.M., Hall A., Middleton A., Rahman N., Ellard S., Firth H.V. Genomic variant sharing: A position statement. Wellcome Open Res. 2019;4:22. doi: 10.12688/wellcomeopenres.15090.2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Landrum M.J., Lee J.M., Benson M., Brown G.R., Chao C., Chitipiralla S., Gu B., Hart J., Hoffman D., Jang W., et al. ClinVar: Improving access to variant interpretations and supporting evidence. Nucleic Acids Res. 2018;46(D1):D1062–D1067. doi: 10.1093/nar/gkx1153. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Pérez-Palma E., Gramm M., Nürnberg P., May P., Lal D. Simple ClinVar: An interactive web server to explore and retrieve gene and disease variants aggregated in ClinVar database. Nucleic Acids Res. 2019;47(W1):W99–W105. doi: 10.1093/nar/gkz411. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Eilbeck K., Lewis S.E., Mungall C.J., Yandell M., Stein L., Durbin R., Ashburner M. The Sequence Ontology: a tool for the unification of genome annotations. Genome Biol. 2005;6:R44. doi: 10.1186/gb-2005-6-5-r44. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Lee K., Famiglietti M.L., McMahon A., Wei C.H., MacArthur J.A.L., Poux S., Breuza L., Bridge A., Cunningham F., Xenarios I., Lu Z. Scaling up data curation using deep learning: An application to literature triage in genomic variation resources. PLoS Comput. Biol. 2018;14:e1006390. doi: 10.1371/journal.pcbi.1006390. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Asanomi Y., Shigemizu D., Miyashita A., Mitsumori R., Mori T., Hara N., Ito K., Niida S., Ikeuchi T., Ozaki K. A rare functional variant of SHARPIN attenuates the inflammatory response and associates with increased risk of late-onset Alzheimer’s disease. Mol. Med. 2019;25:20. doi: 10.1186/s10020-019-0090-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Moore C., Blumhagen R.Z., Yang I.V., Walts A., Powers J., Walker T., Bishop M., Russell P., Vestal B., Cardwell J., et al. Resequencing study confirms that host defense and cell senescence gene variants contribute to the risk of idiopathic pulmonary fibrosis. Am. J. Respir. Crit. Care Med. 2019;200:199–208. doi: 10.1164/rccm.201810-1891OC. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Miller J.E., Metpally R.P., Person T.N., Krishnamurthy S., Dasari V.R., Shivakumar M., Lavage D.R., Cook A.M., Carey D.J., Ritchie M.D., et al. DiscovEHR collaboration Systematic characterization of germline variants from the DiscovEHR study endometrial carcinoma population. BMC Med. Genomics. 2019;12:59. doi: 10.1186/s12920-019-0504-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Jiang X., Zhang B., Zhao J., Xu Y., Han H., Su K., Tao J., Fan R., Zhao X., Li L., Li M.D. Identification and characterization of SEC24D as a susceptibility gene for hepatitis B virus infection. Sci. Rep. 2019;9:13425. doi: 10.1038/s41598-019-49777-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Lieberman S., Beeri R., Walsh T., Schechter M., Keret D., Half E., Gulsuner S., Tomer A., Jacob H., Cohen S., et al. Variable features of juvenile polyposis syndrome with gastric involvement among patients with a large genomic deletion of BMPR1A. Clin. Transl. Gastroenterol. 2019;10:e00054. doi: 10.14309/ctg.0000000000000054. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Musolf A.M., Ho W.S.C., Long K.A., Zhuang Z., Argersinger D.P., Sun H., Moiz B.A., Simpson C.L., Mendelevich E.G., Bogdanov E.I., et al. Small posterior fossa in Chiari I malformation affected families is significantly linked to 1q43–44 and 12q23–24.11 using whole exome sequencing. Eur. J. Hum. Genet. 2019;27:1599–1610. doi: 10.1038/s41431-019-0457-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63.Moawia A., Shaheen R., Rasool S., Waseem S.S., Ewida N., Budde B., Kawalia A., Motameny S., Khan K., Fatima A., et al. Mutations of KIF14 cause primary microcephaly by impairing cytokinesis. Ann. Neurol. 2017;82:562–577. doi: 10.1002/ana.25044. [DOI] [PubMed] [Google Scholar]
- 64.Dinckan N., Du R., Petty L.E., Coban-Akdemir Z., Jhangiani S.N., Paine I., Baugh E.H., Erdem A.P., Kayserili H., Doddapaneni H., et al. Whole-exome sequencing identifies novel variants for tooth agenesis. J. Dent. Res. 2018;97:49–59. doi: 10.1177/0022034517724149. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65.Di Rocco M., Rusmini M., Caroli F., Madeo A., Bertamino M., Marre-Brunenghi G., Ceccherini I. Novel spondyloepimetaphyseal dysplasia due to UFSP2 gene mutation. Clin. Genet. 2018;93:671–674. doi: 10.1111/cge.13134. [DOI] [PubMed] [Google Scholar]
- 66.Dapas M., Sisk R., Legro R.S., Urbanek M., Dunaif A., Hayes M.G. Family-based quantitative trait meta-analysis implicates rare noncoding variants in DENND1A in polycystic ovary syndrome. J. Clin. Endocrinol. Metab. 2019;104:3835–3850. doi: 10.1210/jc.2018-02496. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The public availability of full summary statistics from GWASs has great potential to extend the power of initial studies by enabling the community to re-analyze, meta-analyze, and perform follow-up analyses, with minimal risk to participants.11,50 We assessed whether summary statistics, in addition to individual-level genotyping results, were reported in these publications as available without restriction in a public repository. Sharing of sequencing-based single-variant summary statistics was much lower (5% of publications, n = 4/79, 2014–2019) than the proportion of array-based publications in the GWAS Catalog in the same period (12% of publications, n = 300/2,571, 2014–2019) (Table 1). Sharing of array-GWAS summary statistics is greater in recent years (19% of 2019 GWAS Catalog publications, n = 101/527), but seqGWAS summary statistics still lag (9%, n = 3/32). A further 2.5% of sequencing publications (n = 3/120, 2014–2019) deposited summary statistics in a controlled-access public repository (the Database of Genotypes and Phenotypes [dbGAP]). In contrast, 24% of publications (n = 29/120) deposited individual-level sequencing data in controlled access repositories (dbGAP or European Genome-Phenome Archive [EGA]) (Table S6) and, for some summary-level data, may have been co-submitted or bundled with those data but not specifically stated by the authors.
The data content of single-variant summary statistics for seqGWAS is comparable with that for standard-array GWASs and can conform to emerging standards.11,50 However, summary statistics for aggregate analysis in seqGWAS are commonly composed only of a gene name (or other range specifying chromosomal coordinates), p value, and often the number of contributing variants, sometimes separated by cases/controls. Crucially, we did not observe any publications that reported the list of variants included in each aggregate unit, which is key to interpretation of the data, either in the main text or in accompanying material.
Data underlying analyses in this paper are curated from the literature and are presented in Table S4.
This paper does not report original code.
Any additional information required to reanalyze the data reported in this paper is available from the lead contact upon request.