Guidelines for Evaluating the Comparability of Down-Sampled GWAS Summary Statistics

Camille M Williams; Holly Poore; Peter T Tanksley; Hyeokmoon Kweon; Natasia S Courchesne-Krak; Diego Londono-Correa; Travis T Mallard; Peter Barr; Philipp D Koellinger; Irwin D Waldman; Sandra Sanchez-Roige; K Paige Harden; Abraham A Palmer; Danielle M Dick; Richard Karlsson Linnér

doi:10.1007/s10519-023-10152-z

. 2023 Sep 15;53(5-6):404–415. doi: 10.1007/s10519-023-10152-z

Guidelines for Evaluating the Comparability of Down-Sampled GWAS Summary Statistics

Camille M Williams ^1,^2,^✉, Holly Poore ³, Peter T Tanksley ², Hyeokmoon Kweon ⁴, Natasia S Courchesne-Krak ⁵, Diego Londono-Correa ², Travis T Mallard ^6,^7,⁸, Peter Barr ⁹, Philipp D Koellinger ⁴, Irwin D Waldman ¹⁰, Sandra Sanchez-Roige ^5,¹¹, K Paige Harden ^1,², Abraham A Palmer ^5,¹², Danielle M Dick ^3,^13,^✉, Richard Karlsson Linnér ^14,^✉

PMCID: PMC10584908 PMID: 37713023

Abstract

Proprietary genetic datasets are valuable for boosting the statistical power of genome-wide association studies (GWASs), but their use can restrict investigators from publicly sharing the resulting summary statistics. Although researchers can resort to sharing down-sampled versions that exclude restricted data, down-sampling reduces power and might change the genetic etiology of the phenotype being studied. These problems are further complicated when using multivariate GWAS methods, such as genomic structural equation modeling (Genomic SEM), that model genetic correlations across multiple traits. Here, we propose a systematic approach to assess the comparability of GWAS summary statistics that include versus exclude restricted data. Illustrating this approach with a multivariate GWAS of an externalizing factor, we assessed the impact of down-sampling on (1) the strength of the genetic signal in univariate GWASs, (2) the factor loadings and model fit in multivariate Genomic SEM, (3) the strength of the genetic signal at the factor level, (4) insights from gene-property analyses, (5) the pattern of genetic correlations with other traits, and (6) polygenic score analyses in independent samples. For the externalizing GWAS, although down-sampling resulted in a loss of genetic signal and fewer genome-wide significant loci; the factor loadings and model fit, gene-property analyses, genetic correlations, and polygenic score analyses were found robust. Given the importance of data sharing for the advancement of open science, we recommend that investigators who generate and share down-sampled summary statistics report these analyses as accompanying documentation to support other researchers’ use of the summary statistics.

Supplementary Information

The online version contains supplementary material available at 10.1007/s10519-023-10152-z.

Keywords: Genomic SEM, Summary statistics, Data removal, Down-sample, Leave-one-out, Meta-analysis, Genomics, Genome-wide association study

Introduction

The success of genome-wide association studies (GWASs) depends on sample size (Abdellaoui et al. 2023). Accordingly, genetics researchers increasingly depend on public–private partnerships that pool data collected by academic researchers, national biobanks, and private companies. For example, the company 23andMe Inc. contributed an astonishing 2.5 million observations to a recent GWAS of height (Yengo et al. 2022). However, to protect their interests, private companies place restrictions on the public sharing of GWAS summary statistics and require a potentially lengthy and burdensome application process for researchers to gain access. In some cases, researchers’ institutions are unwilling to agree to the legal terms set by private companies in their material transfer agreements. These restrictions pose a challenge to scientific transparency and slow the pace of genetic discovery.

To address this challenge, researchers can publicly share down-sampled GWAS summary statistics that exclude restricted data (Coleman et al. 2020; Lee et al. 2018; Yengo et al. 2022). This is an imperfect solution, as leaving out a large part of the study sample not only reduces power but can also change the genetic etiology of the trait being studied, potentially leading to substantial differences in downstream analyses (Vlaming et al. 2017). For instance, down-sampling could influence estimates of genetic correlations with other traits, associations in polygenic score analyses, and insights from bioannotation analyses. We are only aware of one study investigating the effects of excluding restricted data from a univariate depression GWAS (Coleman et al. 2020), prior to including them in a meta-analysis of mood disorders. The authors examined the robustness of SNP heritability estimates, genetic correlations, and gene identification. Although they identified fewer variants in the down-sampled analyses, results were otherwise similar, suggesting that excluding data in their study did not markedly change the genetic etiology of their focal phenotype. However, most of the studies providing down-sampled summary statistics have not evaluated the comparability with restricted data counterparts (Lee et al. 2018; Liu et al. 2019; Wray et al. 2018).

There have been few if any, systematic investigations of how down-sampling affects results from multivariate GWASs. Multivariate GWAS methods, such as genomic structural equation modeling (Genomic SEM; Grotzinger et al. 2019), have become increasingly popular, as there is substantial genetic overlap across psychiatric and behavioral phenotypes. Genomic SEM models the shared genetic architecture among traits with latent factors representing cross-cutting genetic liabilities. Rather than just examining genetic associations with individual phenotypes, Genomic SEM enables the identification of shared genes. As in phenotypic factor analysis, the construct represented by latent factors could be sensitive to the choice of indicator phenotypes used in the factor analysis, or the construct might be fairly robust to this decision (Johnson et al. 2004, 2008). Using down-sampled univariate GWAS summary statistics as inputs in Genomic SEM could, therefore, identify a genetic factor structure that occupies a different position in genetic multivariate space. Yet, no studies to our knowledge have examined how down-sampling affects multivariate GWAS in the context of Genomic SEM.

Here, we present a systematic approach to assess the comparability of down-sampled summary statistics with their full data counterparts and examine their suitability for typical follow-up analyses. We used externalizing, a latent factor representing a cross-cutting liability to behaviors and disorders characterized by problems with self-regulation, as our model phenotype. A previous multivariate GWAS by the Externalizing Consortium identified several hundred genomic loci associated with an externalizing (EXT) factor, reflecting shared genetic liability among seven indicator phenotypes (Karlsson Linnér et al. 2021): (1) attention-deficit/hyperactivity disorder (ADHD; Demontis et al. 2019), (2) problematic alcohol use (ALCP; Sanchez-Roige et al. 2019), (3) lifetime cannabis use (CANN; Pasman et al. 2018), (4) reverse-coded age at first sexual intercourse (FSEX; Karlsson Linnér et al. 2019), (5) number of sexual partners (NSEX; Karlsson Linnér et al. 2019), (6) general risk tolerance (RISK; Karlsson Linnér et al. 2019), and (7) lifetime smoking initiation (SMOK; Liu et al. 2019). However, the univariate GWASs on two of the seven phenotypes, SMOK and CANN, contain restricted data, which limits public sharing of the summary statistics from this multivariate GWAS (hereafter, the original study on externalizing).

Therefore, we developed the following six steps to investigate the robustness of down-sampling and applied them to our scenario of assessing the impact of excluding restricted data from the original study on externalizing (Karlsson Linnér et al. 2021). As an initial check, we recommend that authors who generate and share down-sampled summary statistics report whether the genetic correlation between the full and down-sampled version is less than unity, suggesting an imperfect overlap of GWAS coefficients and genetic etiology. The greater the discrepancy between the genetic correlation of the full and down-sampled GWASs on the same trait, the more important it is to evaluate the comparability of down-sampled analyses.

We recommend that investigators who share down-sampled summary statistics generated with multivariate GWAS methods (e.g., Genomic SEM) report all six steps as supporting documentation, while steps 2–3 can be skipped when generating down-sampled univariate GWAS:

What is the loss of genetic signal in down-sampled univariate GWASs (which may later be used as indicator phenotypes in Genomic SEM)?
How do the factor loadings and factor model fit differ in multivariate Genomic SEM when the indicator phenotypes are down-sampled univariate GWASs?
What is the loss of genetic signal at the factor level of multivariate GWAS when the indicator phenotypes are down-sampled univariate GWASs?
How similar are gene-property analyses when using down-sampled GWASs?
How similar is the pattern of genetic correlations with other traits when using down-sampled GWASs?
How much explanatory power is lost when using polygenic scores (PGSs) constructed from down-sampled GWASs?

Methods

The code is publicly available here: https://github.com/Camzcamz/EXTminus23andMe, and the GWAS summary statistics on externalizing that excluded restricted data from 23andMe (“EXT-minus-23andMe”) are available here: https://externalizing.rutgers.edu/request-data/.

What is the loss of genetic signal in down-sampled univariate GWASs?

The following five key indicators are useful for evaluating the loss of genetic signal in down-sampled univariate GWASs: (1) effective sample size (EffN), (2) heritability, (3) mean χ², (4) genomic inflation factor, and (5) the LD Score regression attenuation/stratification bias ratio (see formula in Table 1). EffN is a transformation relevant for GWAS on binary traits that transforms an unbalanced number of cases and controls to effectively reflect the sample size of a balanced analysis (i.e., 50% cases). For a meta-analysis of k cohort-level univariate summary statistics, it is the sum of ${EffN}_{k} = 4 \times V_{k} (1 -$ $V_{k}) N_{k}$ , where $V_{k}$ is the cohort-specific proportion of cases, and $N_{k}$ is the total number of cases and controls. For GWAS on continuous traits, EffN can be replaced by the total sample size (N). The remaining four key indicators are standard estimates of LD Score regression (version 1.0.1; Bulik-Sullivan et al. 2015).

Table 1.

Summary of GWAS summary statistics with and without 23andMe data for seven externalizing-related disorders and behaviors

Phenotype	EXT (Karlsson Linnér et al. 2021)						Down-sampled EXT (EXT-minus-23andMe)
Phenotype	Max N (EffN)	h² (SE)	λ_GC	Mean χ2	Intercept	Ratio	Max N (EffN)	h² (SE)	λ_GC	Mean χ2	Intercept	Ratio
ADHD	53,293 (49,017)	0.235 (0.015)	1.25	1.297	1.034	0.113	53,293 (49,017)	0.260 (0.017)	1.25	1.297	1.034	0.113
ALCP	164,684 (150,640)	0.055 (0.004)	1.15	1.174	1.013	0.073	164,684 (150,640)	0.055 (0.004)	1.15	1.174	1.013	0.073
CANN	186,875 (179,534)	0.066 (0.004)	1.23	1.267	1.026	0.098	164,192 (157,230)	0.068 (0.004)	1.22	1.245	1.028	0.113
FSEX*	357,187	0.115 (0.004)	1.62	1.869	1.036	0.041	357,187	0.115 (0.004)	1.63	1.868	1.036	0.041
NSEX	336,121	0.097 (0.004)	1.49	1.682	1.027	0.041	336,121	0.099 (0.004)	1.49	1.674	1.027	0.041
RISK	426,379	0.053 (0.002)	1.37	1.461	1.019	0.041	426,379	0.053 (0.002)	1.37	1.461	1.019	0.041
SMOK	1,251,809 (1,232,397)	0.078 (0.002)	2.33	3.152	1.126	0.058	652,520 (652,518)	0.079 (0.003)	1.73	2.062	1.037	0.035

Open in a new tab

EXT: Externalizing. Bolded rows indicate down-sampled summary statistics. EffN = sum of cohort-level effective sample sizes. The statistics reported in this table were all estimated with LD Score regression (v1.0.1) (Bulik-Sullivan et al. 2015): Heritability (h²) is on the observed scale. The genomic inflation factor, λ_GC, is the median χ² statistic divided by the expected median of the χ² distribution with 1 degree of freedom. Mean χ² is the average χ² statistic. Intercept is the estimated LD Score regression intercept. The ratio measures stratification bias, defined as (intercept − 1)/(mean χ² − 1)

ADHD attention-deficit/hyperactivity disorder; ALCP problematic alcohol use; CANN lifetime cannabis use; FSEX age at first sexual intercourse (reverse coded*); NSEX number of sexual partners; RISK risk tolerance; SMOK = lifetime tobacco initiation

*Age at first sex was reverse coded so as to expect a positive relationship with EXT and EXT-minus-23andMe

In addition to evaluating the loss of genetic signal, we recommend three checks to examine concordance in GWAS coefficients ( $β$ ), which should preferably be applied to near-independent SNPs (Step 3 explains a standard pruning procedure to find near-independence). If correlated SNPs are included, larger LD blocks will be given more weight. Depending on the power of the full-data GWAS, the checks could be applied only to genome-wide significant hits or could be expanded to a less stringent threshold (say, P < 1 × 10^–5). The three checks are to (1) test for sign concordance, (2) inspect for outliers, and (3) run a regression of the down-sampled GWAS coefficients on their full-data counterparts (as absolute values).

Sign concordance can be evaluated by reporting the proportion of SNPs that have concordant direction of effect or by performing a binomial test. The binomial test requires an assumed null hypothesis of the true probability of success, which we set to 99% to make the test sensitive enough to detect minor deviations from near-perfect concordance (100% is too sensitive as a single discordant observation will reject the null). Power calculations show that 150 independent SNPs provide ≥ 80% power to reject this null even if the true, imperfect concordance is as high as 95%. To detect outliers, we suggest evaluating whether the down-sampled GWAS coefficients fall outside the 95% confidence intervals of their full-data counterparts. If outliers are detected, then we recommend adding an extra indicator column to the down-sampled summary statistics to allow its users to filter out SNPs with deviating down-sampled GWAS coefficients. The regression analysis of the down-sampled coefficients on the full-data coefficients should investigate whether (a) the intercept is zero, (b) whether the regression coefficient is unity (i.e., diagonal line), and (c) whether the adjusted coefficient of determination (adj. R²) is high. These checks are applicable to both univariate and multivariate GWAS (thus, also in Step 3). Here, because we did not generate any down-sampled univariate summary statistics to be disseminated, we report these checks only for the down-sampled multivariate GWAS on externalizing.

To generate a down-sampled version of the multivariate GWAS on externalizing, we first down-sampled the univariate GWASs of SMOK and CANN by mirroring the meta-analysis protocol of the original study (Karlsson Linnér et al. 2021) while excluding restricted 23andMe data. We then used these five key indicators to assess the loss of genetic signal in the down-sampled univariate GWASs. Finally, we estimated genetic correlations among the seven indicator phenotypes in the down-sampled analysis using LD Score regression (Bulik-Sullivan et al. 2015) and compared them to genetic correlations among the indicator phenotypes in the original study.

Stable heritability estimates and attenuation ratios across the original and down-sampled indicators should yield comparable factor loadings in the down-sampled Genomic SEM factor analysis (Step 2), whereas loss of genetic signal, indicated by a decrease in mean χ², should yield larger standard errors in the factor analysis and loss of statistical power to detect SNP effects in the multivariate GWAS (Step 3).

2.
How do the factor loadings and factor model fit differ in Genomic SEM when the indicator phenotypes are down-sampled univariate GWASs?

Genomic SEM is a flexible modeling approach that (1) estimates an empirical genetic covariance matrix and sampling covariance matrix from input GWAS summary statistics, and (2) evaluates a set of conventional parameters for structural equation modeling, such as factor loadings and residual variances, to minimize the discrepancy between the model-implied and empirical genetic covariance matrices (Grotzinger et al. 2019). Typically, several alternative models are compared (e.g., a single-factor model versus a two-factor model) followed by multivariate GWAS to estimate SNP effects on each of the factors in the preferred factor solution (Step 3).

To assess the impact of down-sampling on the factor loadings and model fit, we suggest forcing the best-fitting factor solution from the Genomic SEM analysis of the full dataset (that includes restricted data) onto the empirical genetic covariance matrix of the down-sampled summary statistics, and then evaluating the stability of the factor loadings and factor model fit indicators (e.g., the comparative fit index or the root mean square residual). We do not suggest searching for a better factor solution with the down-sampled indicators because the aim is to evaluate whether down-sampled analyses are representative of their corresponding versions with restricted data.

Thus, we ran the best-fitting Genomic SEM factor model of the original study (Karlsson Linnér et al. 2021): a single-factor model with seven indicator phenotypes (ADHD, ALCP, CANN, FSEX, NSEX, RISK, and SMOK), using unit variance identification of the factor model without SNP effects. However, in the analysis reported here, the input summary statistics for SMOK and CANN were replaced by down-sampled versions (see Step 1). We refer to the original factor model based on analyses with 23andMe data as the EXT factor and the down-sampled version as the EXT-minus-23andMe factor.

3.
What is the loss of genetic signal at the factor level of down-sampled multivariate GWAS when the indicator phenotypes are down-sampled univariate GWASs?

After conducting a multivariate GWAS on the latent factors in down-sampled analyses with Genomic SEM, the loss of genetic signal at the factor level can be assessed by (i) examining the genetic correlation between the respective latent factors of the full and down-sampled summary statistics using bivariate LD Score regression (Bulik-Sullivan et al. 2015) and by (ii) estimating the decrease in genetic signal with key indicators (1), (3), and (4) from Step 1. Please note that key indicators (2) and (5) are not used to evaluate the genetic signal of the latent factor because they are not clearly defined (e.g., heritability is defined as a ratio with phenotypic variance as the denominator, which is absent in latent genetic factors).

To generalize the loss of statistical power to identify individual SNP effects, we need to make assumptions about their magnitude. One approach is to compute the squared standardized coefficients,1 approximated as $r^{2} = Z^{2} / N$ , and then evaluate the median among the subset of genome-wide significant SNPs (P < 5 × 10^–8) in the down-sampled GWAS. Given that statistical power is the probability of correctly rejecting the null hypothesis when the alternative hypothesis is true, it can be computed as $1 - {CDF}_{λ} [χ_{1}^{2}$ ], where ${CDF}_{λ}$ is the cumulative distribution function for a χ² distribution with 1 degree of freedom and the non-centrality parameter $λ = N r^{2}$ . The sample size, $N$ , is set to the EffN of the summary statistics being evaluated. The term $χ_{1}^{2}$ (c) is the critical value (~29.7) at the threshold of genome-wide significance (P < 5 × 10^–8) for a χ²-test with 1 degree of freedom. As a complement, we suggest evaluating the power to detect arbitrary effect-size magnitudes, for which we selected three magnitudes representative of effects reaching genome-wide significance in recent large-scale GWAS ( $r^{2} =$ 0.003%, 0.004%, or 0.005%). Because power loss is more noticeable at the level of individual SNPs compared to methods that aggregate genetic signal among sets of SNPs or genome-wide, we recommend researchers interested in following up on individual SNPs use the original and not the down-sampled summary statistics for best precision.

As in the original study (Karlsson Linnér et al. 2021), we estimated individual SNP effects on the latent EXT-minus-23andMe factor with Genomic SEM, which we refer to as the EXT-minus-23andMe summary statistics. We then evaluated the loss of signal at the factor level. We expect the loss of power to be more noticeable at the level of individual loci compared to the follow-up analyses presented below, which aggregate genetic signal across larger sets of SNPs or genome wide. Lastly, we examined the concordance of GWAS coefficients on the latent factor per the three-check procedure outlined in Step 1.

Because of its accessibility and ease of use, we recommend using FUMA to find near-independent genome-wide significant “lead SNPs”. FUMA conducts conventional linkage-disequilibrium (LD) informed pruning (“clumping”). The default settings are sensible to use in most cases. FUMA computes LD with the publicly available European subsample of the 1000 Genomes Phase 3 reference panel as the default setting (though, researchers should depart from this default to match the genetic ancestry of the summary statistics being evaluated). The default settings largely overlap with those of the original study on EXT (importantly, the LD r² threshold of 0.1 to define lead SNPs is identical), though the original study used a larger restricted-access reference panel that combined the 1000 Genomes Phase 3 reference panel with other reference data.

4.
How similar are gene-property analyses when using down-sampled GWASs?

The biological correspondence of down-sampled univariate or multivariate GWAS can be evaluated by comparing the results from the Multi-marker analysis of genomic annotation (MAGMA) gene-property analyses in the SNP2GENE function of Functional Mapping and Annotation of Genome-Wide Association Studies (FUMA; Watanabe et al. 2017); version 1.5.0e) software using Spearman rank correlations of point estimates.

As done in the original paper, we ran gene-property analyses on the EXT-minus-23andMe summary statistics to (1) test 54 tissue-specific gene expression profiles, and (2) test gene expression profiles across 11 brain tissues and developmental stages with reference data from BrainSpan (Allen Institute for Brain Science 2022). We used the default settings of SNP2GENE, which match those used to conduct the gene-based analyses reported in the original study (Tables S3, 4).

5.
How similar is the pattern of genetic correlations with other traits when using down-sampled GWASs?

To assess the convergent and discriminant validity of down-sampled multivariate GWAS on latent factors, we can examine potential changes in the pattern of genetic correlation with other traits. If the down-sampled analysis tags the same genetic etiology, the confidence intervals of the point estimates should display considerable overlap. The overall pattern can be examined by estimating the rank correlation of the point estimates across traits, whereas significance of changes to individual genetic correlations can be assessed using a t-test.

The original study estimated genetic correlations between EXT and 91 other traits (Karlsson Linnér et al. 2021). Here, we performed the same analysis for EXT-minus-23andMe and then examined whether the pattern of genetic overlap was preserved after removing restricted data. Since the summary statistics of some of the 91 traits in the original study include restricted data, we conducted these analyses on the 79 traits with publicly available data.

6.
How much explanatory power is lost when using polygenic scores (PGSs) constructed from down-sampled GWASs?

Generally, the loss of genetic signal from down-sampling will only exacerbate the problem of measurement error in PGSs constructed with finite-sample estimates as weights (Becker et al. 2021). As one of the most common third-party applications of publicly available GWAS summary statistics, we strongly encourage researchers to evaluate the loss of explanatory power in their main PGS analysis before they share down-sampled summary statistics with other users. This loss can be evaluated (i) across traits, as indicated by the overall reduction in variance explained (R²/pseudo-R²) and (ii) with the rank correlation of point estimates to evaluate the comparability of the overall pattern of polygenic score associations.

Following the original study protocol (Karlsson Linnér et al. 2021), we constructed PGSs in two hold-out samples: the Collaborative Study on the Genetics of Alcoholism (COGABegleiter 1995; Bucholz et al. 2017; Edenberg 2002); N = 7594) and the National Longitudinal Study of Adolescent to Adult Health (Add HealthHarris et al. 2013; McQueen et al. 2015); N = 5107). We constructed the PGSs from the EXT-minus-23andMe summary statistics (EXT-minus-23andMe PGS), adjusted for LD with PRS-CS (version 20 October 2019; Ge et al. 2019), which restricts the PGS to ~1 million HapMap3 SNPs. The default settings are sensible for most standard uses (Bayesian gamma-gamma prior of 1 and 0.5, and 1000 Monte Carlo iterations with 500 burn-in iterations).

We compared the explanatory power of the EXT-minus-23andMe PGSs with the one reported in the original study from analyses of a phenotypic externalizing factor, followed by a set of outcomes related to, or affected by, externalizing behaviors and disorders (e.g., smoking initiation, substance-use disorders, or childhood developmental disorders) (Table S6). Linear regression was applied to continuous outcomes and logistic regression to dichotomous outcomes. We evaluated the incremental R²/pseudo-R² by subtracting the variance explained by a baseline model with only covariates (age, sex, and the first ten genetic principal components) from the variance explained by a model with the covariates and PGS. Confidence intervals were estimated with the percentile bootstrap method (1000 iterations). We then evaluated whether the coefficient estimates of the down-sampled EXT-minus-23andMe PGSs were comparable to the estimates of the PGS of EXT from the original paper.

We are aware of recent suggestions to evaluate the squared (semi-)partial correlation in favor of the incremental R²/pseudo-R², but the results of these two alternative approaches are often highly similar (except for certain phenotypes, e.g., height). For comparability with the original study, we retained the incremental R²/pseudo-R² measure.

Results

What is the loss of genetic signal in down-sampled univariate GWASs?

In the initial check of genetic overlap between the full and down-sampled summary statistics of the same trait, we found genetic correlations close to, but still significantly less than unity: 0.966 (SE = 0.007) for SMOK and 0.953 (SE = 0.012) for CANN,2 which motivated us to apply our approach to evaluate the comparability of the down-sampled summary statistics to those from the original paper.

The loss of genetic signal was evaluated using the five key indicators. First, down-sampling reduced the EffN of the two univariate GWASs on SMOK and CANN by about 47% and 12%, respectively (Table 1), which is a marked reduction with potential down-stream consequences. However, down-sampling did not meaningfully impact heritability estimates nor the attenuation/stratification bias ratio, which is important for expecting a comparable factor structure in the multivariate analysis below. Similarly, down-sampling did not meaningfully influence the genetic correlations among the seven indicator phenotypes (Fig. 1, Table S1), which increases the likelihood of obtaining a similar factor structure.

Fig. 1 — LD Score genetic correlations and heritability estimates for the seven indicator phenotypes of the single-factor models of EXT and EXT-minus-23andMe (see Step 1). The left panel displays the analysis of the original study with 23andMe data, the middle panel displays the down-sampled analysis excluding 23andMe data, and the right panel displays the difference in estimates computed by subtracting the values in the middle panel from those in the left panel. The lower and upper triangles display pairwise genetic correlation (r_g) estimates and standard errors, respectively. The diagonals display the observed-scale heritability (h²; see Table 1 for standard errors). These results are also reported in Table S1. *ADHD* attention-deficit/hyperactivity disorder; *ALCP* problematic alcohol use; *CANN* lifetime cannabis use; *FSEX* age at first sexual intercourse (reverse coded); *NSEX* number of sexual partners; *RISK* risk tolerance; *SMOK* lifetime tobacco initiation

Nevertheless, there was a noticeable loss of genetic signal as measured by mean χ² and the genomic inflation factor. The greatest decrease was observed for the down-sampled GWAS on SMOK (Δ mean χ² = 2.06–3.15 = – 1.09; – 34.6%), while the decrease for CANN was less pronounced (– 1.3%). Similar decreases were observed for the genomic inflation factor: – 25.9% and – 1.0% for SMOK and CANN, respectively. The overall stability we observed for the heritability estimates and attenuation ratios suggest that the factor loadings in the down-sampled Genomic SEM factor analysis will resemble those of the original paper (Step 2). The decrease in genetic signal in SMOK and CANN should translate into larger standard errors in the factor analysis and loss of statistical power to detect SNP effects in the multivariate GWAS of EXT-minus-23andMe (Step 3).

2.
How do the factor loadings and factor model fit differ in multivariate Genomic SEM when the indicator phenotypes are down-sampled univariate GWASs?

The factor loadings, residual variances, and model fit statistics were comparable in the down-sampled single factor solution (Fig. 2; Table S2). Neither the factor loadings nor residual variances were statistically different from the original estimates (a path diagram of the original estimates was therefore omitted). The largest non-significant difference was observed for the factor loading of the indicator phenotype RISK, which increased from 0.54 (SE = 0.03) to 0.56 (SE = 0.03). A similar-sized, non-significant decrease was observed for CANN: from 0.77 (SE = 0.03) to 0.75 (SE = 0.03). Furthermore, the comparative fit index (CFI) and standardized root mean square residual (SRMR) were similar between the down-sampled and original factor models and were within the preregistered thresholds for “good fit” (i.e., CFI > 0.9, and SRMR < 0.08) of the original study. In our example, we obtain close to identical factor loadings and model fit when applying the best-fitting factor solution of the original study to the empirical genetic covariance matrix of the down-sampled summary statistics.

3.
What is the loss of genetic signal at the factor level of multivariate GWAS when the indicator phenotypes are down-sampled univariate GWASs?

We estimated a multivariate GWAS of the EXT-minus-23andMe factor (see Step 2) (Figures S1). The genetic correlation between the summary statistics from the multivariate GWAS of EXT and EXT-minus-23andMe was strong but significantly less than unity (r_g = 0.978, SE = 0.001), which motivated Steps 4–6. The $EffN$ of the multivariate GWAS of EXT-minus-23andMe was 1,045,957 (about 70.1% of that on EXT). The mean χ² of the EXT and EXT-minus-23andMe factors were 3.12 and 2.37, respectively, corresponding to a 24% decrease. The reduction in the genomic inflation factor was similar (–18%). Thus, there was an appreciable loss of genetic signal in the down-sampled GWAS of EXT-minus-23andMe.

The reduction in mean χ² and genomic inflation factor suggested some loss of power to detect SNP effects. Down-sampling decreased the power by 17.8 pp to detect the median of squared standardized coefficients among the genome-wide significant SNPs (i.e., median r² = 0.0038%), and about 5–45 pp less power to detect the three assumed effect-size magnitudes ( $r^{2} =$ 0.003%, 0.004%, or 0.005%) (Figures S2, 3). Therefore, we recommend that users interested in following up on individual genome-wide significant SNPs associated with externalizing prioritize the version with 23andMe data.

Pruning of the summary statistics to find near-independent lead SNPs (using the FUMA default settings), identified 358 lead SNPs for the down-sampled EXT-minus-23andMe, as compared to 842 in the full-sample version. (Note that the number of lead SNPs reported here for EXT differs from the 855 reported in the original study because that study used a restricted-access genetic reference panel and somewhat different settings for the pruning parameters.) Thus, down-sampling reduced the number of lead SNPs by 57.5%, which could appear problematic. However, the results of the following three checks of the concordance in coefficients (see Step 1) suggested no strong reason for concern (Figure S4). First, all the 842 lead SNPs identified in the full-data version had a consistent direction of effect, meaning the null hypothesis of near-perfect sign concordance (99%) could not be rejected (P = 1). Moreover, there was 100% sign concordance among all 130,176 SNPs with P < 1 × 10^–5 (in the full-data GWAS). Second, we identified only 21 lead SNPs (out of the 842; 2.5%) for which the down-sampled coefficient fell outside the 95% confidence interval of the full-data estimate. Among the 130,176 SNPs, we found 2202 such outliers (1.7%). We marked these SNPs in the disseminated summary statistics, but otherwise interpret their small number as unproblematic for the comparability of the down-sampled multivariate GWAS. Third, regression analysis of the down-sampled coefficients on the full-data estimates with the 842 lead SNPs found an intercept close to zero (~0.0005, P = 0.045), a regression coefficient statistically different from but still near unity (~0.898, P = 5.24 × 10^–5), and high adjusted R² = 0.86. We found similar results for the 130,176 SNPs (reported in Figure S4). The regression results suggest the down-sampling induced some, but not marked, attenuation of the coefficients. Overall, these results demonstrate satisfactory concordance for the down-sampled multivariate coefficients.

4.
How similar are the gene-property analyses when using down-sampled GWASs?

We ran gene-property analyses using MAGMA on the EXT-minus-23andMe summary statistics. The Spearman rank correlation of the point estimates from the MAGMA 54 tissues-specific gene expression profiles on the down-sampled and restricted data multivariate GWAS summary statistics was 0.98, suggesting a comparable pattern of gene-tissue expression (Table S3 and Figure S5). The Spearman rank correlation of the point estimates from the MAGMA gene expression profiles across 11 brain tissues and developmental stages also suggested great similarity (r = 0.98) (Table S4 and Figure S6). Furthermore, the same 14 tissues, and three developmental stages, remained significant after Bonferroni-correction in the down-sampled analysis (Table S3–4). This evaluation showed that, in the case of EXT-minus-23andMe, the down-sampled gene-property analyses led to similar biological insights as those from the original paper (Karlsson Linnér et al. 2021).

5.
How similar is the pattern of genetic correlations with other traits when using down-sampled GWASs?

We assessed the pattern of genetic correlations of EXT-minus-23andMe with other traits and found this pattern to be nearly identical to that of the original study (Spearman r ~ 1) (Fig. 3, Table S5). Furthermore, none of the point estimates were statistically different. Thus, in our scenario, down-sampling did not meaningfully impact the genetic correlations with other traits, meaning that researchers interested in such analyses can safely proceed with using the down-sampled summary statistics.

Fig. 3 — Scatterplot of genetic correlations (r_g) and marginal density plots between EXT (y-axis) or EXT-minus-23andMe (x-axis) with 77 other phenotypes. Each point corresponds to the genetic correlation coefficient with its 95% confidence intervals ( $r_{g} \pm 1.96 \times S E$ ) estimated with bivariate LD Score regression. Table S5 reports the estimates, their standard errors, and confidence intervals. The Spearman rank correlation reported in the figure is rounded from r = 0.9995. No particular shape, such as a normal distribution, is expected for the marginal density because the figure displays an arbitrary selection of traits

6.
How much explanatory power is lost when using polygenic scores (PGSs) constructed from down-sampled GWASs?

The down-sampled PGS for EXT-minus-23andMe explained 8.4% and 8.5% of the variance of a phenotypic externalizing factor in Add Health and COGA, respectively, which is 1.9 pp and 0.5 pp less compared to the same analysis in the original study (Table S6). The overall reduction in explanatory power across other outcomes was less pronounced, on average 0.35 pp in Add Health, and 0.23 pp in COGA. The largest decrease was observed for lifetime smoking initiation with 2.1 pp and 1.7 pp, followed by lifetime cannabis use with 1.1 pp in Add Health (but only 0.55 pp in COGA), which may be explained by these two indicator phenotypes being most affected by the down-sampling. For most other traits, the variance explained by the down-sampled PGS was comparable to the original study.

Secondly, the Spearman rank correlation of the regression coefficients was 0.996, suggesting great similarity in point estimates (Fig. 4). All the coefficients of the down-sampled PGS fell within the confidence intervals of their original study counterparts (Table S6), except those for the phenotypic externalizing factor (in Add Health), lifetime smoking initiation, and lifetime cannabis use (in Add Health). Overall, our down-sampled polygenic score results were comparable to those from the original study, meaning that researchers interested in using the down-sampled summary statistics to construct PGS for EXT-minus-23andMe can generally expect similar results. However, we recommend the users be aware of the weaker explanatory power for certain outcomes.

Discussion

Unrestricted access to data and results is the cardinal tenet of open science. Here, we propose a systematic approach for researchers disseminating GWAS summary statistics with restricted data removed (i) to evaluate the comparability of down-sampled GWAS summary statistics with their restricted data counterparts, and (ii) to assess the impact of using down-sampled univariate summary statistics in multivariate GWAS with Genomic SEM. We examined the loss of genetic signal in down-sampled univariate GWAS (Step 1), the change in the factor model loadings and fit (Step 2), the loss of genetic signal at the factor-level of down-sampled multivariate GWAS (Step 3); and for potential changes to gene-property analyses (Step 4), the pattern of genetic correlations with other traits (Step 5), and the explanatory power of polygenic score analyses in independent samples (Step 6).

We applied these steps to the largest available multivariate GWAS of externalizing to evaluate the quality and predictive performance of the results following restricted data removal. We found nearly identical model fit and parameter estimates, genetic correlations with other phenotypes, and polygenic score analyses of externalizing phenotypes in independent samples. As expected, we observed a decrease in power and genetic signal in the down-sampled univariate and multivariate summary statistics. Although fewer lead SNPs were identified for EXT-minus-23andMe compared to EXT, the genes associated with EXT and EXT-minus-23andMe were similar in terms of region and developmental timing of expression. In the PGS context, EXT and EXT-minus-23andMe performed similarly well. Therefore, while we suggest that the down-sampled summary statistics may be used in analyses related to gene enrichment, genetic correlations, or polygenic scores, the summary statistics with restricted data should be prioritized for gene identification or to follow up on genome-wide significant hits. Prioritizing the restricted data when following up on individual GWAS hits is less of a problem because results for significant SNPs are more likely to be reported in full in the original study.

In our example, removing restricted data did not change the construct that was identified by genetic factor analysis: The genetic correlation between the factor identified without 23andMe data and the factor identified with 23andMe data was near unity, and the factors had highly similar associations with external variables. But this outcome is not guaranteed. Removing restricted data may be more impactful for univariate GWASs prior to their inclusion in meta-analyses and multivariate GWAS with different indicator phenotypes and model structures. The consistency we observed between EXT and EXT-minus-23andMe is likely explained by the inclusion of restricted data in only a subset of indicators, with just one of seven summary statistics experiencing a substantive reduction in genetic signal (i.e., 35% decrease in the mean χ² of SMOK). In the circumstance that more indicators had included 23andMe data, we could have expected greater discrepancies between EXT and EXT-minus-23andMe.

The issues raised here are also relevant in the context of GWAS meta-analyses. Removing a restricted set of cohort-level summary statistics from a single-phenotype GWAS meta-analysis should mainly affect power if the genetic correlation between the cohort-level summary statistics is close to unity. However, considering that genetic correlations between cohort-level GWASs of the same trait can be substantially less than unity (Levey et al. 2021), removing a large cohort from the meta-analysis can change the genetic etiology of the trait being studied (de Vlaming et al. 2017). Researchers should thus use the approach presented here to examine potential changes in a phenotype’s genetic etiology alongside the expected power reduction after removing a sample from their GWAS meta-analysis. To our knowledge, this has only been done by one meta-analysis (Coleman et al. 2020), where the authors conducted a subset of the steps described in the present study (e.g., changes in heritability, genetic correlations with external variables, and gene enrichment analyses). Therefore, the utility of our systematic approach goes beyond the Genomic SEM context, as some of these steps may apply to other multivariate GWAS implementations.

Providing public summary statistics to the wider research community is crucial to facilitating open science and advancing behavioral and biomedical research. The first step in this process should be to evaluate the comparability of down-sampled summary statistics and their restricted data counterparts. Herein, we provide a systematic approach to investigators who resort to sharing down-sampled GWAS summary statistics and recommend they report these analyses as accompanying documentation to facilitate open science and data sharing.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file1 (DOCX 2503 KB)^{(2.4MB, docx)}

Supplementary file2 (XLSX 257 KB)^{(256.8KB, xlsx)}

Acknowledgements

This research was conducted by the Externalizing Consortium. The Externalizing Consortium has been supported by the National Institute on Alcohol Abuse and Alcoholism (R01AA015416 – administrative supplement to DMD), and the National Institute on Drug Abuse (R01DA050721 to DMD). Additional funding for investigator effort has been provided by K02AA018755, U10AA008401, P50AA022537 to DMD, R01AA029688, and 28IR-0070 to AAP and T29KT0526 and T32IR5226 to NCK and SSR from the Tobacco-Related Disease Research Program (TRDRP), NIDA DP1DA054394 to SSR, R25MH081482-16 to NCK, R01HD092548 to KPH, as well as a European Research Council Consolidator Grant (647648 EdGe) to PDK. The content is solely the responsibility of the authors and does not necessarily represent the official views of the above funding bodies. The Externalizing Consortium would like to thank the following groups for making the research possible: 23andMe, Add Health, Vanderbilt University Medical Center’s BioVU, Collaborative Study on the Genetics of Alcoholism (COGA), the Psychiatric Genomics Consortium’s Substance Use Disorders working group, UK10K Consortium, UK Biobank, and Philadelphia Neurodevelopmental Cohort.

Author Contributions

CMW: Contribution: Conceptualization, Data curation, Formal analysis, Investigation, Visualization, Methodology, Writing – original draft, Writing – review, and editing. HP: Conceptualization, Data curation, Formal analysis, Methodology, Writing—original draft, Writing – review, and editing. PTT: Conceptualization, Writing—original draft, Writing – review, and editing, Visualization. HK: Data curation, Software, Writing—original draft. NSC-K: Formal analysis, Writing—Original Draft, Writing—review and editing. DL-C: Validation, Writing—Original Draft. TTM: Conceptualization, Data curation, Methodology, Supervision. PB: Formal analyses, Writing – review, and editing. PDK: Conceptualization, Writing—review and editing. IDW: Conceptualization, Writing—review and editing. SS-R: Conceptualization, Writing—review and editing. KPH: Conceptualization, Writing—review and editing. AAP: Conceptualization, Writing—review and editing. DMD: Conceptualization, Writing—review and editing. RKL: Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Supervision, Writing – review and editing.

Funding

Tobacco-Related Disease Research Program, T29KT0526, T29KT0526, R01AA029688, K02AA018755, National Institute on Drug Abuse, R25MH081482-16, DP1DA054394, R01HD092548, R01DA050721, European Research Council Consolidator Grant, 647648 EdGe, National Institute on Alcohol Abuse and Alcoholism, R01AA015416

Data Availability

The code for EXT-minus-23andMe is available on the wiki (https://github.com/Camzcamz/EXTminus23andMe/wiki) and the EXT-minus-23andMe summary statistics are available on the externalizing website (https://externalizing.rutgers.edu/ext-23andme-summary-statistics-now-available/).

Declarations

Conflict of interest

Camille M. Williams, Holly Poore, Peter T. Tanksley, Hyeokmoon Kweon, Natasia S. Courchesne-Krak, Diego Londono-Correa, Travis T. Mallard, Peter Barr, Philipp D. Koellinger, Irwin D. Waldman, Sandra Sanchez-Roige, K. Paige Harden, Abraham A Palmer, Danielle M. Dick and Richard Karlsson Linnér declare that they have no conflict of interest.

Ethical Approval

This study included only secondary data analysis of de-identified data and was approved as “Exempt Human Subjects Research” by the institutional review board (IRB) of Rutgers University (#Pro2022000138). All participants provided written informed consent in the original studies from which these data were drawn. In addition, data collection of each cohort was approved by a review board at each respective institution.

Footnotes

An approximate measure of variance explained (R²), standardized with respect to the outcome.

Estimated with the chi-square cut-off set to 30, i.e., the default cut-off applied by bivariate LD Score regression when estimating the heritability. To our knowledge, there is no consensus on the best cut-off to use.

Edited by Sarah Medland.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Contributor Information

Camille M. Williams, Email: williams.m.camille@gmail.com

Danielle M. Dick, Email: danielle.m.dick@rutgers.edu

Richard Karlsson Linnér, Email: r.karlsson.linner@law.leidenuniv.nl.

References

Abdellaoui A, Yengo L, Verweij KJH, Visscher PM. 15 years of GWAS discovery: realizing the promise. Am J Hum Genet. 2023 doi: 10.1016/j.ajhg.2022.12.011. [DOI] [PMC free article] [PubMed] [Google Scholar]
Allen Institute for Brain Science. (2022). BrainSpan atlas of the developing human brain. http://www.brainspan.org/. Accessed 22 Dec 2022
Becker J, Burik CAP, Goldman G, Wang N, Jayashankar H, Bennett M, Belsky DW, Karlsson Linnér R, Ahlskog R, Kleinman A, Hinds DA, Caspi A, Corcoran DL, Moffitt TE, Poulton R, Sugden K, Williams BS, Harris KM, Steptoe A, et al. Resource profile and user guide of the polygenic index repository. Nat Hum Behaviour. 2021;5(12):12. doi: 10.1038/s41562-021-01119-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
Begleiter H. The collaborative study on the genetics of alcoholism. Alcohol Health Res World. 1995;19(3):228–236. [PMC free article] [PubMed] [Google Scholar]
Bucholz KK, McCutcheon VV, Agrawal A, Dick DM, Hesselbrock VM, Kramer JR, Kuperman S, Nurnberger JI, Salvatore JE, Schuckit MA, Bierut LJ, Foroud TM, Chan G, Hesselbrock M, Meyers JL, Edenberg HJ, Porjesz B. Comparison of parent, peer, psychiatric, and cannabis use influences across stages of offspring alcohol involvement: evidence from the COGA prospective study. Alcohol Clin Exp Res. 2017;41(2):359–368. doi: 10.1111/acer.13293. [DOI] [PMC free article] [PubMed] [Google Scholar]
Bulik-Sullivan BK, Loh P-R, Finucane HK, Ripke S, Yang J, Patterson N, Daly MJ, Price AL, Neale BM. LD Score regression distinguishes confounding from polygenicity in genome-wide association studies. Nat Genet. 2015;47(3):3. doi: 10.1038/ng.3211. [DOI] [PMC free article] [PubMed] [Google Scholar]
Coleman JRI, Gaspar HA, Bryois J, Breen G, Disorder Working Group of the Psychiatric Genomics Consortium, Major Depressive Disorder Working Group of the Psychiatric Genomics Consortium The genetics of the mood disorder spectrum: genome-wide association analyses of more than 185,000 cases and 439,000 controls. Biol Psychiatry. 2020;88(2):169–184. doi: 10.1016/j.biopsych.2019.10.015. [DOI] [PMC free article] [PubMed] [Google Scholar]
de Vlaming R, Okbay A, Rietveld CA, Johannesson M, Magnusson PKE, Uitterlinden AG, van Rooij FJA, Hofman A, Groenen PJF, Thurik AR, Koellinger PD. Meta-GWAS accuracy and power (MetaGAP) calculator shows that hiding heritability is partially due to imperfect genetic correlations across studies. PLoS Genetics. 2017;13(1):e1006495. doi: 10.1371/journal.pgen.1006495. [DOI] [PMC free article] [PubMed] [Google Scholar]
Demontis D, Walters RK, Martin J, Mattheisen M, Als TD, Agerbo E, Baldursson G, Belliveau R, Bybjerg-Grauholm J, Bækvad-Hansen M, Cerrato F, Chambert K, Churchhouse C, Dumont A, Eriksson N, Gandal M, Goldstein JI, Grasby KL, Grove J, et al. Discovery of the first genome-wide significant risk loci for attention deficit/hyperactivity disorder. Nat Genet. 2019;51(1):63–75. doi: 10.1038/s41588-018-0269-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
Edenberg HJ. The collaborative study on the genetics of alcoholism: an update. Alcohol Res Health. 2002;26:214–218. [PMC free article] [PubMed] [Google Scholar]
Ge T, Chen C-Y, Ni Y, Feng Y-CA, Smoller JW. Polygenic prediction via Bayesian regression and continuous shrinkage priors. Nat Commun. 2019;10(1):1. doi: 10.1038/s41467-019-09718-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
Grotzinger AD, Rhemtulla M, de Vlaming R, Ritchie SJ, Mallard TT, Hill WD, Ip HF, Marioni RE, McIntosh AM, Deary IJ, Koellinger PD, Harden KP, Nivard MG, Tucker-Drob EM. Genomic structural equation modelling provides insights into the multivariate genetic architecture of complex traits. Nat Hum Behav. 2019;3(5):513–525. doi: 10.1038/s41562-019-0566-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
Harris KM, Halpern CT, Haberstick BC, Smolen A. The national longitudinal study of adolescent health (add health) sibling pairs data. Twin Res Hum Genet. 2013;16(1):391–398. doi: 10.1017/thg.2012.137. [DOI] [PMC free article] [PubMed] [Google Scholar]
Johnson W, Bouchard TJ, Krueger RF, McGue M, Gottesman II. Just one g: consistent results from three test batteries. Intelligence. 2004;32(1):95–107. doi: 10.1016/S0160-2896(03)00062-X. [DOI] [Google Scholar]
Johnson W, te Nijenhuis J, Bouchard TJ. Still just 1 g: consistent results from five test batteries. Intelligence. 2008;36(1):81–95. doi: 10.1016/j.intell.2007.06.001. [DOI] [Google Scholar]
Karlsson Linnér R, Biroli P, Kong E, Meddens SFW, Wedow R, Fontana MA, Lebreton M, Tino SP, Abdellaoui A, Hammerschlag AR, Nivard MG, Okbay A, Rietveld CA, Timshel PN, Trzaskowski M, de Vlaming R, Zünd CL, Bao Y, Buzdugan L, et al. Genome-wide association analyses of risk tolerance and risky behaviors in over 1 million individuals identify hundreds of loci and shared genetic influences. Nat Genet. 2019;51(2):245–257. doi: 10.1038/s41588-018-0309-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
Karlsson Linnér R, Mallard TT, Barr PB, Sanchez-Roige S, Madole JW, Driver MN, Poore HE, de Vlaming R, Grotzinger AD, Tielbeek JJ, Johnson EC, Liu M, Rosenthal SB, Ideker T, Zhou H, Kember RL, Pasman JA, Verweij KJH, Liu DJ, et al. Multivariate analysis of 1.5 million people identifies genetic associations with traits related to self-regulation and addiction. Nat Neurosci. 2021;24(10):10. doi: 10.1038/s41593-021-00908-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lee JJ, Wedow R, Okbay A, Kong E, Maghzian O, Zacher M, Nguyen-Viet TA, Bowers P, Sidorenko J, Linnér RK, Fontana MA, Kundu T, Lee C, Li H, Li R, Royer R, Timshel PN, Walters RK, Willoughby EA, et al. Gene discovery and polygenic prediction from a genome-wide association study of educational attainment in 1.1 million individuals. Nat Gene. 2018;50(8):1112–1121. doi: 10.1038/s41588-018-0147-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
Levey DF, Stein MB, Wendt FR, Pathak GA, Zhou H, Aslan M, Quaden R, Harrington KM, Nuñez YZ, Overstreet C, Radhakrishnan K, Sanacora G, McIntosh AM, Shi J, Shringarpure SS, Concato J, Polimanti R, Gelernter J. Bi-ancestral depression GWAS in the million veteran program and meta-analysis in >1.2 million individuals highlight new therapeutic directions. Nat Neurosci. 2021;24(7):7. doi: 10.1038/s41593-021-00860-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
Liu M, Jiang Y, Wedow R, Li Y, Brazel DM, Chen F, Datta G, Davila-Velderrain J, McGuire D, Tian C, Zhan X, Choquet H, Docherty AR, Faul JD, Foerster JR, Fritsche LG, Gabrielsen ME, Gordon SD, Haessler J, et al. Association studies of up to 1.2 million individuals yield new insights into the genetic etiology of tobacco and alcohol use. Nat Genet. 2019;51(2):237–244. doi: 10.1038/s41588-018-0307-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
McQueen MB, Boardman JD, Domingue BW, Smolen A, Tabor J, Killeya-Jones L, Halpern CT, Whitsel EA, Harris KM. The national longitudinal study of adolescent to adult health (add health) sibling pairs genome-wide data. Behav Genet. 2015;45(1):12–23. doi: 10.1007/s10519-014-9692-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
Pasman JA, Verweij KJH, Gerring Z, Stringer S, Sanchez-Roige S, Treur JL, Abdellaoui A, Nivard MG, Baselmans BML, Ong J-S, Ip HF, van der Zee MD, Bartels M, Day FR, Fontanillas P, Elson SL, de Wit H, Davis LK, MacKillop J, et al. GWAS of lifetime cannabis use reveals new risk loci, genetic overlap with psychiatric traits, and a causal influence of schizophrenia. Nat Neurosci. 2018;21(9):1161–1170. doi: 10.1038/s41593-018-0206-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
Sanchez-Roige S, Palmer AA, Fontanillas P, Elson SL, Adams MJ, Howard DM, Edenberg HJ, Davies G, Crist RC, Deary IJ, McIntosh AM, Clarke T-K. Genome-Wide Association Study Meta-Analysis of the Alcohol Use Disorders Identification Test (AUDIT) in Two Population-Based Cohorts. Am J Psychiatry. 2019;176(2):107–118. doi: 10.1176/appi.ajp.2018.18040369. [DOI] [PMC free article] [PubMed] [Google Scholar]
Watanabe K, Taskesen E, van Bochoven A, Posthuma D. Functional mapping and annotation of genetic associations with FUMA. Nat Commun. 2017;8(1):1826. doi: 10.1038/s41467-017-01261-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wray NR, Ripke S, Mattheisen M, Trzaskowski M, Byrne EM, Abdellaoui A, Adams MJ, Agerbo E, Air TM, Andlauer TMF, Bacanu S-A, Bækvad-Hansen M, Beekman AFT, Bigdeli TB, Binder EB, Blackwood DRH, Bryois J, Buttenschøn HN, Bybjerg-Grauholm J, et al. Genome-wide association analyses identify 44 risk variants and refine the genetic architecture of major depression. Nat Genet. 2018;50(5):668–681. doi: 10.1038/s41588-018-0090-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
Yengo L, Vedantam S, Marouli E, Sidorenko J, Bartell E, Sakaue S, Graff M, Eliasen AU, Jiang Y, Raghavan S, Miao J, Arias JD, Graham SE, Mukamel RE, Spracklen CN, Yin X, Chen S-H, Ferreira T, Highland HH, et al. A saturated map of common genetic variants associated with human height. Nature. 2022;610(7933):7933. doi: 10.1038/s41586-022-05275-y. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary file1 (DOCX 2503 KB)^{(2.4MB, docx)}

Supplementary file2 (XLSX 257 KB)^{(256.8KB, xlsx)}

Data Availability Statement

[CR1] Abdellaoui A, Yengo L, Verweij KJH, Visscher PM. 15 years of GWAS discovery: realizing the promise. Am J Hum Genet. 2023 doi: 10.1016/j.ajhg.2022.12.011. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR2] Allen Institute for Brain Science. (2022). BrainSpan atlas of the developing human brain. http://www.brainspan.org/. Accessed 22 Dec 2022

[CR3] Becker J, Burik CAP, Goldman G, Wang N, Jayashankar H, Bennett M, Belsky DW, Karlsson Linnér R, Ahlskog R, Kleinman A, Hinds DA, Caspi A, Corcoran DL, Moffitt TE, Poulton R, Sugden K, Williams BS, Harris KM, Steptoe A, et al. Resource profile and user guide of the polygenic index repository. Nat Hum Behaviour. 2021;5(12):12. doi: 10.1038/s41562-021-01119-3. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR4] Begleiter H. The collaborative study on the genetics of alcoholism. Alcohol Health Res World. 1995;19(3):228–236. [PMC free article] [PubMed] [Google Scholar]

[CR5] Bucholz KK, McCutcheon VV, Agrawal A, Dick DM, Hesselbrock VM, Kramer JR, Kuperman S, Nurnberger JI, Salvatore JE, Schuckit MA, Bierut LJ, Foroud TM, Chan G, Hesselbrock M, Meyers JL, Edenberg HJ, Porjesz B. Comparison of parent, peer, psychiatric, and cannabis use influences across stages of offspring alcohol involvement: evidence from the COGA prospective study. Alcohol Clin Exp Res. 2017;41(2):359–368. doi: 10.1111/acer.13293. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR6] Bulik-Sullivan BK, Loh P-R, Finucane HK, Ripke S, Yang J, Patterson N, Daly MJ, Price AL, Neale BM. LD Score regression distinguishes confounding from polygenicity in genome-wide association studies. Nat Genet. 2015;47(3):3. doi: 10.1038/ng.3211. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR7] Coleman JRI, Gaspar HA, Bryois J, Breen G, Disorder Working Group of the Psychiatric Genomics Consortium, Major Depressive Disorder Working Group of the Psychiatric Genomics Consortium The genetics of the mood disorder spectrum: genome-wide association analyses of more than 185,000 cases and 439,000 controls. Biol Psychiatry. 2020;88(2):169–184. doi: 10.1016/j.biopsych.2019.10.015. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR8] de Vlaming R, Okbay A, Rietveld CA, Johannesson M, Magnusson PKE, Uitterlinden AG, van Rooij FJA, Hofman A, Groenen PJF, Thurik AR, Koellinger PD. Meta-GWAS accuracy and power (MetaGAP) calculator shows that hiding heritability is partially due to imperfect genetic correlations across studies. PLoS Genetics. 2017;13(1):e1006495. doi: 10.1371/journal.pgen.1006495. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR9] Demontis D, Walters RK, Martin J, Mattheisen M, Als TD, Agerbo E, Baldursson G, Belliveau R, Bybjerg-Grauholm J, Bækvad-Hansen M, Cerrato F, Chambert K, Churchhouse C, Dumont A, Eriksson N, Gandal M, Goldstein JI, Grasby KL, Grove J, et al. Discovery of the first genome-wide significant risk loci for attention deficit/hyperactivity disorder. Nat Genet. 2019;51(1):63–75. doi: 10.1038/s41588-018-0269-7. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR10] Edenberg HJ. The collaborative study on the genetics of alcoholism: an update. Alcohol Res Health. 2002;26:214–218. [PMC free article] [PubMed] [Google Scholar]

[CR11] Ge T, Chen C-Y, Ni Y, Feng Y-CA, Smoller JW. Polygenic prediction via Bayesian regression and continuous shrinkage priors. Nat Commun. 2019;10(1):1. doi: 10.1038/s41467-019-09718-5. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR12] Grotzinger AD, Rhemtulla M, de Vlaming R, Ritchie SJ, Mallard TT, Hill WD, Ip HF, Marioni RE, McIntosh AM, Deary IJ, Koellinger PD, Harden KP, Nivard MG, Tucker-Drob EM. Genomic structural equation modelling provides insights into the multivariate genetic architecture of complex traits. Nat Hum Behav. 2019;3(5):513–525. doi: 10.1038/s41562-019-0566-x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR13] Harris KM, Halpern CT, Haberstick BC, Smolen A. The national longitudinal study of adolescent health (add health) sibling pairs data. Twin Res Hum Genet. 2013;16(1):391–398. doi: 10.1017/thg.2012.137. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR14] Johnson W, Bouchard TJ, Krueger RF, McGue M, Gottesman II. Just one g: consistent results from three test batteries. Intelligence. 2004;32(1):95–107. doi: 10.1016/S0160-2896(03)00062-X. [DOI] [Google Scholar]

[CR15] Johnson W, te Nijenhuis J, Bouchard TJ. Still just 1 g: consistent results from five test batteries. Intelligence. 2008;36(1):81–95. doi: 10.1016/j.intell.2007.06.001. [DOI] [Google Scholar]

[CR16] Karlsson Linnér R, Biroli P, Kong E, Meddens SFW, Wedow R, Fontana MA, Lebreton M, Tino SP, Abdellaoui A, Hammerschlag AR, Nivard MG, Okbay A, Rietveld CA, Timshel PN, Trzaskowski M, de Vlaming R, Zünd CL, Bao Y, Buzdugan L, et al. Genome-wide association analyses of risk tolerance and risky behaviors in over 1 million individuals identify hundreds of loci and shared genetic influences. Nat Genet. 2019;51(2):245–257. doi: 10.1038/s41588-018-0309-3. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR17] Karlsson Linnér R, Mallard TT, Barr PB, Sanchez-Roige S, Madole JW, Driver MN, Poore HE, de Vlaming R, Grotzinger AD, Tielbeek JJ, Johnson EC, Liu M, Rosenthal SB, Ideker T, Zhou H, Kember RL, Pasman JA, Verweij KJH, Liu DJ, et al. Multivariate analysis of 1.5 million people identifies genetic associations with traits related to self-regulation and addiction. Nat Neurosci. 2021;24(10):10. doi: 10.1038/s41593-021-00908-3. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR18] Lee JJ, Wedow R, Okbay A, Kong E, Maghzian O, Zacher M, Nguyen-Viet TA, Bowers P, Sidorenko J, Linnér RK, Fontana MA, Kundu T, Lee C, Li H, Li R, Royer R, Timshel PN, Walters RK, Willoughby EA, et al. Gene discovery and polygenic prediction from a genome-wide association study of educational attainment in 1.1 million individuals. Nat Gene. 2018;50(8):1112–1121. doi: 10.1038/s41588-018-0147-3. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR19] Levey DF, Stein MB, Wendt FR, Pathak GA, Zhou H, Aslan M, Quaden R, Harrington KM, Nuñez YZ, Overstreet C, Radhakrishnan K, Sanacora G, McIntosh AM, Shi J, Shringarpure SS, Concato J, Polimanti R, Gelernter J. Bi-ancestral depression GWAS in the million veteran program and meta-analysis in >1.2 million individuals highlight new therapeutic directions. Nat Neurosci. 2021;24(7):7. doi: 10.1038/s41593-021-00860-2. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR20] Liu M, Jiang Y, Wedow R, Li Y, Brazel DM, Chen F, Datta G, Davila-Velderrain J, McGuire D, Tian C, Zhan X, Choquet H, Docherty AR, Faul JD, Foerster JR, Fritsche LG, Gabrielsen ME, Gordon SD, Haessler J, et al. Association studies of up to 1.2 million individuals yield new insights into the genetic etiology of tobacco and alcohol use. Nat Genet. 2019;51(2):237–244. doi: 10.1038/s41588-018-0307-5. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR21] McQueen MB, Boardman JD, Domingue BW, Smolen A, Tabor J, Killeya-Jones L, Halpern CT, Whitsel EA, Harris KM. The national longitudinal study of adolescent to adult health (add health) sibling pairs genome-wide data. Behav Genet. 2015;45(1):12–23. doi: 10.1007/s10519-014-9692-4. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR22] Pasman JA, Verweij KJH, Gerring Z, Stringer S, Sanchez-Roige S, Treur JL, Abdellaoui A, Nivard MG, Baselmans BML, Ong J-S, Ip HF, van der Zee MD, Bartels M, Day FR, Fontanillas P, Elson SL, de Wit H, Davis LK, MacKillop J, et al. GWAS of lifetime cannabis use reveals new risk loci, genetic overlap with psychiatric traits, and a causal influence of schizophrenia. Nat Neurosci. 2018;21(9):1161–1170. doi: 10.1038/s41593-018-0206-1. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR23] Sanchez-Roige S, Palmer AA, Fontanillas P, Elson SL, Adams MJ, Howard DM, Edenberg HJ, Davies G, Crist RC, Deary IJ, McIntosh AM, Clarke T-K. Genome-Wide Association Study Meta-Analysis of the Alcohol Use Disorders Identification Test (AUDIT) in Two Population-Based Cohorts. Am J Psychiatry. 2019;176(2):107–118. doi: 10.1176/appi.ajp.2018.18040369. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR24] Watanabe K, Taskesen E, van Bochoven A, Posthuma D. Functional mapping and annotation of genetic associations with FUMA. Nat Commun. 2017;8(1):1826. doi: 10.1038/s41467-017-01261-5. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR25] Wray NR, Ripke S, Mattheisen M, Trzaskowski M, Byrne EM, Abdellaoui A, Adams MJ, Agerbo E, Air TM, Andlauer TMF, Bacanu S-A, Bækvad-Hansen M, Beekman AFT, Bigdeli TB, Binder EB, Blackwood DRH, Bryois J, Buttenschøn HN, Bybjerg-Grauholm J, et al. Genome-wide association analyses identify 44 risk variants and refine the genetic architecture of major depression. Nat Genet. 2018;50(5):668–681. doi: 10.1038/s41588-018-0090-3. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR26] Yengo L, Vedantam S, Marouli E, Sidorenko J, Bartell E, Sakaue S, Graff M, Eliasen AU, Jiang Y, Raghavan S, Miao J, Arias JD, Graham SE, Mukamel RE, Spracklen CN, Yin X, Chen S-H, Ferreira T, Highland HH, et al. A saturated map of common genetic variants associated with human height. Nature. 2022;610(7933):7933. doi: 10.1038/s41586-022-05275-y. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Guidelines for Evaluating the Comparability of Down-Sampled GWAS Summary Statistics

Camille M Williams

Holly Poore

Peter T Tanksley

Hyeokmoon Kweon

Natasia S Courchesne-Krak

Diego Londono-Correa

Travis T Mallard

Peter Barr

Philipp D Koellinger

Irwin D Waldman

Sandra Sanchez-Roige

K Paige Harden

Abraham A Palmer

Danielle M Dick

Richard Karlsson Linnér

Abstract

Supplementary Information

Introduction

Methods

Table 1.

Results

Fig. 1.

Fig. 2.

Fig. 3.

Fig. 4.

Discussion

Supplementary Information

Acknowledgements

Author Contributions

Funding

Data Availability

Declarations

Conflict of interest

Ethical Approval

Footnotes

Contributor Information

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases