Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2021 Jan 1.
Published in final edited form as: Hum Genet. 2019 Mar 26;139(1):61–71. doi: 10.1007/s00439-019-02001-z

OpenMendel: A Cooperative Programming Project for Statistical Genetics

Hua Zhou 1,*,+, Janet S Sinsheimer 2,*,+, Douglas M Bates 3, Benjamin B Chu 4, Christopher A German 5, Sarah S Ji 6, Kevin L Keys 7, Juhyun Kim 8, Seyoon Ko 9, Gordon D Mosher 10, Jeanette C Papp 11, Eric M Sobel 12, Jing Zhai 13, Jin J Zhou 14, Kenneth Lange 15,+
PMCID: PMC6763373  NIHMSID: NIHMS1525513  PMID: 30915546

Abstract

Statistical methods for genomewide association studies (GWAS) continue to improve. However, the increasing volume and variety of genetic and genomic data make computational speed and ease of data manipulation mandatory in future software. In our view, a collaborative effort of statistical geneticists is required to develop open source software targeted to genetic epidemiology. Our attempt to meet this need is called the OpenMendel project (https://openmendel.github.io). It aims to (1) enable interactive and reproducible analyses with informative intermediate results, (2) scale to big data analytics, (3) embrace parallel and distributed computing, (4) adapt to rapid hardware evolution, (5) allow cloud computing, (6) allow integration of varied genetic data types, and (7) foster easy communication between clinicians, geneticists, statisticians, and computer scientists. This article reviews and makes recommendations to the genetic epidemiology community in the context of the OpenMendel project.

Keywords: Statistical Genomics, GWAS, Computational Statistics, Open Source, Collaborative Programming

1. Introduction

Genomewide association studies (GWAS) query the entire genome to identify genetic variants associated with a trait of interest. GWAS have enjoyed many successes [77] and have uncovered many clues to the genetic etiology of common diseases [23]. Case-control tests of association between markers and traits predate GWAS by more than 50 years [1]. However, association studies were rarely undertaken in the pre-GWAS era unless there were candidate genes with strong prior evidence. The situation changed at the turn of the millennium when dense SNP (single nucleotide polymorphism) maps became available and SNP genotyping costs plummeted. Suddenly, it became possible to exploit linkage disequilibrium (LD) and survey hundreds of thousands to millions of genomewide SNPs. In the subsequent decade, hundreds of associations found by GWAS were published [76,77]. Genomics is in the midst of a second technological evolution driven by high-throughput sequencing [55,60, 40,73]. Geneticists can now survey both rare and common variants.

This sudden expansion of data leads to enormous challenges in statistical genetics. Many current algorithms and programs are ill adapted to handle modern data sets with 105 cases and 107 markers. Ever more types of genetic variation are being observed and catalogued [77]. These changes demand more complex data structures and data integration across multiple biological scales. Precision health and predictive medicine raise the stakes even further [40]. Concurrently, the nature of computing is rapidly changing. In addition to new hardware, new programming paradigms and new algorithms must be brought online as quickly as possible to sustain progress in statistical genetics.

The following three studies exemplify the variety and magnitude of genomic data sets being collected today: a) The Million Veteran Program contains GWAS data (657,459 SNPs) on 359,964 veterans [29]. Simply storing the genotypes in compressed format requires > 100 GB. Obviously, this data set and others like it [70,11] will continue to grow. b) A recent study [72] obtained whole genome sequence (WGS) data on 10,545 humans at 30–40x coverage for < $2000 per genome. These researchers identified > 150 million variants, the majority of which are rare or de novo. c) The iPOP (integrative personal omics profile) study [16] followed a single individual for 401 days and collected transcriptome, proteome, metabolome, microbiome, epigenome, exposome, and phenome data at 20 time points, along with an extremely high coverage WGS. This type of omics profiling yields a dynamic picture of the heteroallelic changes between healthy and diseased states.

Current analysis pipelines juggle a multitude of computer programs that are implemented in different languages, run on different platforms, and require different input/output formats. This heterogeneity unintentionally creates barriers to communication, data exchange, data visualization, and scientific replication. End-users treat the entire pipeline as a black box and often fail to use their biological insight to inform statistical analysis. Students, post-docs, and researchers spend inordinate amounts of time coding and debugging the low-level languages instead of thinking about the science. In addition to these disadvantages, current software packages are straining under the volume, velocity, variety, and veracity of modern genomics data. Many programs do not even run on multiple threads. Distributed computing across different machines is largely ignored. In our view, the time is ripe to put in place a better paradigm for statistical genetics.

In this review, we first explain why the new Julia language [6] is an ideal choice for the OpenMendel analysis platform. We then present what we see as some of the most pressing needs in gene mapping and our efforts to advance them through the cooperative OpenMendel effort. Owing to page and time constraints, we do not offer encyclopedic coverage of recent advances in GWAS or sequence analysis. Many promising methods are left unmentioned, for example meta-analysis based on summary statistics [15, 19, 26, 41, 52, 83] or estimation of fine-scale population structure [58]. Instead we focus on topics related to projects already underway in OpenMendel. These projects include methods for handing SNP data, genotype imputation, rapid GWAS, iterative hard thresholding, kinship comparison, variance component modeling, and SNP-set analyses.

We want to point out that there are other groups who have made notable strides in making genetic analyses accessible to researchers who work with big data but lack the support available at large genomic centers [7,63]. Some of these projects are further along than OpenMendel. A particularly interesting example is the PLATO software project ([30] and https://ritchielab.org/software/plato-download), which is designed to provide a single platform for a variety of association analyses. A major difference in our approach and PLATO is language choice. This difference might seem minor but, as we outline below, we believe it is a fundamental difference and is important for our goal of getting user initiated modules and modifications. Another example is the Ark software [7], which focuses on data management and is complementary to rather than competitive with OpenMendel.

2. The Importance of the Julia Computing Language

Many compelling features make Julia [6] an ideal vehicle for implementing methods for modern statistical genetics. First, it is free, open source, and easy to install. Second, its clear, powerful syntax lends itself to compact, readable code and quick algorithm mock-ups. As needed, it can easily call Fortran, C, R, and Python functions. Third, because Julia incorporates an excellent justin-time (JIT) compiler, it achieves the efficiency of low-level languages with minimal programming efforts. Fourth, Julia is built for parallelism at the multicore, graphical processing unit (GPU), and cluster levels. Fifth, Julia employs a modern, easy to use package management system. Of particular relevance to statistical genetics, Julia has many statistical and numerical analysis packages ready to use. Finally, end-users can run their analyses via the interactive Jupyter (Julia, Python, R) Notebook, an attractive interface for data visualization and reproducible research. Together these tools constitute an integrated environment for rapid prototyping of new applications and, with the same code, the analysis of large-scale genetic data.

Traditional high-level languages such as R, Matlab, and Python face the notorious two-language problem. In this scenario, one high-level language is used for prototyping, but a second low-level language is later needed for producing fast code for real world, large data sets. The high-level code is typically more compact, readable, and amenable to change, but much slower to execute. Most of the popular statistical genetics analysis tools or their most demanding subroutines are implemented purely in low-level languages, greatly restricting the community that feels comfortable exploring the code. Most tools are also restricted to certain computer platforms and input formats.

Today, a typical analysis pipeline requires a glue language such as Bash, Perl or Python to chain packages together. Data plotting and display require additional software, typically R or Matlab. Current analysis pipelines are cumbersome, opaque, and error-prone, creating barriers to the development of new statistical methods. Researchers wade through a swamp of low-level code and reinvent statistical genetics wheels instead of focusing on their unique contributions. This can be avoided as Julia has solved the two-language problem through careful design of the programming language itself. Julia is both easy to code and scales to peta-flop computing levels [21]. We can now use Julia in all phases of our methods development, from prototyping to production software. OpenMendel includes many leading-edge statistical genetics methods written in this fast, high-level language that invites easy contributions from scientists. Using Julia, OpenMendel can become the first highly efficient, open source statistical genetics software that can scale to million-subject studies and is both user- and developer-friendly.

3. Handling SNP Data

The SnpArrays.jl module of OpenMendel provides a convenient bridge between binary SNP data and downstream statistical analysis. The VCFTools.jl module achieves the same end for the richer genetic information distributed in VCF and BCF file formats. In SnpArrays.jl, biallelic genotype data are held in BitArrays, which store four genotypes per byte. As much as possible, compressed storage is also maintained during computation. Julia allows operators such as matrix multiplication to be defined directly on BitArrays without decompression. The design features of Julia make it easy to build high-performance statistical genetics software that is scalable to data sets with millions of subjects and tens of millions of SNPs.

The functionality of SnpArrays.jl includes: (1) reading and writing compressed SNP files, (2) computing summary statistics, (3) filtering data by genotyping success rates and other criteria, (4) copying compressed data into numerically oriented vectors and matrices, (5) computing genetic relationship matrices, (6) computing principal components, and (7) extending matrix and vector operations to compressed SNP data. SnpArrays.jl serves as a data interface to other OpenMendel modules.

4. Genotype Imputation

Genotype imputation involves the inference of unobserved genotypes from observed genotypes. It is possible to base inference on the observed genotypes of surrounding pedigree members [68], but pedigree data are now viewed as poor substitutes for linkage disequilibrium. In particular, pedigree data are incapable of imputing genotypes at completely untyped SNPs in a study. Recent versions of genotype imputation rely on panels of reference genotypes and employ hidden Markov models, with the hidden states being underlying haplotype pairs [34, 48, 54]. These programs are computationally intensive and operate by haplotyping individuals on the typed SNPs in the sample. These partial haplotypes are then compared to the reference panel to impute the full set of genotypes [33, 74]. We have taken an alternative approach based on the generic data mining technique of matrix completion [14, 18].

Matrix completion fills in the missing entries of an m × n matrix X = (xij) whose observed entries are indexed by a subset of {1, …, m} × {1, …, n}. Imputation involves finding a low rank matrix Y = (yij) consistent with the observed entries of X = (xij). This is done by minimizing the loss function

f(Y)=(i,j)Ω(xijyij)2 (1)

over the set of matrices Y of rank r or less. Taking r small is a form of parsimony capturing the hidden structure of the data. In genotype imputation, X records the observed genotype dosages (0, 1, or 2 counts of the reference allele), with rows corresponding to people and columns to SNPs. Imputation is performed over a narrow genomic window of a few hundred SNPs where linkage disequilibrium prevails. Including reference individuals typed on out-of-sample SNPs is a key part of the strategy.

Because every rank r matrix Y of dimension m × n can be expressed as a matrix product UV, where U is m×r and V is r × n, matrix completion can be phrased as updating the factors U and V of Y in the loss function (1). Imputation is iterative, and to restore symmetry at iteration m, each missing entry xij is imputed by its current best guess (UmVm)ij. New values Um+1 and Vm+1 can be recovered by taking the singular value decomposition (SVD) of Zm, the current completed version of X. The MM (majorization/minimization) principle of optimization shows that this procedure drives the loss downhill [44].

Chi et al. [18] compared the matrix completion program Mendel Impute to several popular model based imputation programs including MaCH and IMPUTE2 using a number of simulated and real datasets. The accuracy of imputation is dependent on the nature of the specific scenarios and so no program was universally most accurate. The least favorable scenario for Mendel Impute in terms of accuracy occurred when imputing genotypes between high density microarray platforms using as a measure of accuracy the mean r2 between the imputed values and the true genotypes at masked loci. In this case Mendel Impute was slightly worse then MaCH which was slightly worse than IMPUTE2 (Table 1). In other scenarios Mendel Impute was more accurate than MaCH and IMPUTE2 and in still others they were are roughly the same. However, in all the scenarios presented Mendel Impute was at least an order of magnitude faster than MaCH or IMPUTE2.

Table 1.

Comparison of imputation methods.

Mendel Impute MaCH IMPUTE2
r2 0.683 0.751 0.802
relative time 1.00 13.10 7.41

Alternating least squares provides an alternative to SVD that is potentially much faster [31]. The alternating updates

Vm+1=(UmtUm)1UmtYm

and

Um+1=ZmVm+1t(Vm+1Vm+1t)1

can achieve extremely high numerical throughput on modern computer architecture such as multicore CPUs and multiple GPUs. Because alternating least squares offers no guarantee of finding the global minimum of the loss, initial values for U and V should be as accurate as possible. Application of a randomized SVD to supply initial values is one possibility [49]. In practice we divide the current window into equal thirds and construct a hold-out-set by masking entries in the outer two thirds. We then choose the best rank r based on performance on the hold-out-set. Once we impute missing entries in the middle third, we shift the window to the right and begin again.

5. Enhancements to Ordinary GWAS

MendelGWAS.jl performs ordinary SNP-by-SNP association testing. To maximize speed in linear, logistic, and Poisson regression, MendelGWAS.jl employs score tests [3,17,20,89]. For the most significant SNPs, the score test is supplemented by the slower but more accurate likelihood ratio test (LRT). Principal components can be included as predictors, SNPs and subjects can be filtered by success rates, and a Manhattan plot is provided.

In addition to these standard approaches to GWAS, we are in the process of implementing score tests for generalized linear models (GLMs) [66]. GLMs permit trait-genotype relations to be modeled with more exotic response distributions. We are also planning to develop an efficient score test for the challenging Cox survival model [37, 56, 69]. Multinomial regression models for complex categorical phenotypes would be a valuable extension of logistic regression [57]. Finally, efficient GWAS for ordered discrete phenotypes is becoming increasingly important for the study of complex diseases and traits derived from electronic health records.

6. Iterative Hard Thresholding

To avoid the computational complexity of multiple regression and the identifiability issues caused by having more predictors p than sample individuals n [12], GWAS has traditionally focused on the marginal effects of single SNPs. Previously we introduced lasso penalized regression to GWAS to perform subset selection [81,91]. Our recent paper [38] implements a better heuristic, iterative hard thresholding (IHT), to solve this inherently combinatorial problem. We showed that IHT is better for GWAS than lasso or MCP penalties in controlling for false positive and false negative rates, in reducing parameter shrinkage, and in capturing heritability. It achieves these goals with little sacrifice in computational speed.

We now sketch how IHT iterates toward good local optima. To keep the discussion simple, consider the setting of linear regression with design matrix X, response vector y, and parameter vector β. The goal is to minimize the loss function f(β)=12yXβ2 subject to the sparsity condition β0k. The notation β0 is shorthand for the number of nonzero entries of β. In GWAS the entry xij of X denotes the number (0, 1, or 2) of reference alleles carried by individual i at SNP j or the imputed dosage value. The entry yi of y corresponding to individual i encodes a continuous trait such as height, blood pressure, or an expression level.

At iteration n, the IHT algorithm [8] moves in the steepest descent direction −▽f(βn) modified by the sparsity constraint. Here the gradient ▽f(β) of the objective equals −Xt(yXβ). The IHT update is explicitly

βn+1=PSk(βnsf(βn)), (2)

where s is the steplength and PSk(β) denotes projection onto the sparsity set Sk={β:β0k}. The projection operator PSk(β) sends to 0 all but the k largest entries of β in magnitude. The preferred entries of β are untosuched. The steplength s is chosen to minimize f(β) along the ray sβnsf(βn) prior to projection. This is achieved by taking

s=f(βn)2Xf(βn)2.

The best value of k can be chosen by cross-validation.

The theory and practice of IHT continues to advance. Shen and Li [67] show how to relax the restricted isometry property originally invoked to prove convergence [9]. Yang et al [82] suggest group-sparse IHT to promote sparsity on a group-level. Khanna and Kyrillidis [39] validate the application of momentum acceleration to IHT. Yuan et al [86] and Bahmani [5] adapt IHT to logistic regression. Further extension to generalized linear models is a natural target. MendelIHT.jl brings IHT under the OpenMendel umbrella. Integration of IHT with SnpArrays.jl unifies data handling and leads to faster code with a smaller memory footprint. Finally, we are investigating weighting predictors to accommodate candidate genes and candidate SNPs [88].

7. Kinship Comparison

Kinship coefficients quantify the degree of relationship between two relatives. Two genes are identical by descent (IBD) if one is a copy of the other or they are both copies of the same ancestral gene. The theoretical kinship coefficient ϕij is the probability that a randomly sampled gene at some arbitrary locus from individual i is IBD to a randomly sampled gene at the same locus from individual j. For example, if we assume no inbreeding, ϕij=12 if i = j, and ϕij=14 if i and j are first degree relatives. In the former case, the two genes are sampled with replacement. In an accurately constructed pedigree, the full matrix Φ of kinship coefficients ϕij can be calculated from a simple recurrence [43]. Jacquard’s more complex kinship coefficients [35] are less useful in practice and harder to calculate [46]. In the MendelKinship.jl module of OpenMendel, Jacquard’s coefficients are approximated by the Monte Carlo method of gene dropping.

When pedigrees are unknown or suspect, SNP markers can be used to estimate the kinship matrix Φ empirically. One popular estimate is the genetic relationship matrix (GRM), represented here by S = (sij). If pk denotes the reference allele frequency of SNP k, xik counts the number of reference alleles carried by individual i, and K is the number of SNPs, then the elements of S are calculated as

sij=1Kk=1K(xik2pk)(xjk2pk)4pk(1pk).

Alternatives to the GRM include a methods of moments estimator MoM [24] and a robust GRM [53,75]. The latter is

ϕ^ij=1k=1K4pk(1pk)k=1K(xik2pk)(xjk2pk).

This unbiased estimator generally has smaller variance than the standard estimator S, which is sensitive to low minor allele frequencies [78]. The MendelKin-ship.jl module calculates the GRM, the robust GRM, and the MoM estimators. All three of these estimators are special cases of general kinship estimators that are unbiased under ideal conditions [78]. When there is ethnic inhomogeneity and spread in the degrees of relationships, S can exhibit bias because it confounds close relatedness and ancestry differences [22]. Ethnic admixture can be accommodated by replacing the allele frequency pk by an ethnic specific estimate for each individual i [22].

Finding the variances of these estimators has been impossible without simplifying assumptions [78]. Our own unpublished approximation to the variance

E(SE(S)F2)1K2RF2[ΦF2+tr(Φ)2] (3)

of the GRM matrix S allows for inbreeding, linkage disequilibrium, and closely related relatives. It relies on the simplifying assumption that the fourth moments of the SNP counts coincide with the fourth moments of similarly distributed Gaussian random variables. In formula (3), tr(A) is the trace of A, ∥AF is the Frobenius norm of A, and R is the correlation matrix of the SNPs (LD matrix).

To check suspect pedigrees for hidden relatedness, one can compare theoretical kinships ϕij and empiric kinships ϕ^ij It is convenient to put these on a common scale by subjecting them to an approximate variance-stabilizing transformation. RA Fisher considered the simpler problem of comparing an ordinary covariance matrix Ʃ = (σij) to a sample covariance matrix S = (sij). Under an assumption of normality, he argued [27,28] that the quantity

tanh1(sijsiisjj)tanh1(σijσiiσjj)

is approximately normal with mean 0 and variance (K − 3)−1, where K is the sample size, and tanh−1 is the inverse hyperbolic tangent function. By analogy, we subject the GRM matrix or one of its variants to Fisher’s transformation and order the discrepancies from least to greatest in absolute value. The OpenMendel MendelKinship.jl tutorial explains in a concrete example how transformation identifies outlier pairs.

8. Variance Component Models

Association studies are subject to the effects of unmeasured confounding. The most common confounder is ethnic ancestry [4, 32, 42], which arises when both trait values and marker allele frequencies differ by region of origin. Ancestry informative markers are particularly prone to show up as false positives in a naïıve GWAS [64]. Currently there are two general adjustments for ethnic ancestry. The first approach uses either a few principal components of the GRM matrix [59, 61, 95] or estimated ancestry proportions [2, 62] as fixed effects. The second approach explicitly accounts for the correlation between subjects by including an estimate of the kinship matrix, e.g. the GRM matrix, as a random effect in a variance components model. When reliable pedigrees are available, the second approach is analogous to positing the theoretical kinship matrix as a random effect [10]. Because the theoretical kinship matrix does not capture hidden correlations, inclusion of the one of the SNP based estimates of the kinship matrix is usually preferred.

In any event, the variance components model y~N(Xβ,j=1kσj2Vj) figures prominently in genomewide association testing [25, 43]. In this model β are the fixed effects of covariates X and σj2 is the variance of the jth random effect. Estimation of the parameter vectors β and σ2=(σ12,,σk2)t has been the subject of intense study for decades. Most statisticians opt for maximum likelihood or restricted maximum likelihood. In the linear mixed model, the covariance matrices Vj factor as UjUjt. The factored form is advantageous if Uj is n × rj with rj small. In the absence of low rank structure, one can take Uj to be the Cholesky factor of Vj.

The covariance model W=2σa2Φ+σe2I corresponds to polygenic background (σa2) plus random noise (σe2). The kinship matrix here can be theoretical or empirical. The model is overly simplistic but widely applied due to its computational tractability. It omits dominance effects, shared environment, and parent of origin effects, among other things. Calculation of the inverse and determinant of W is the rate limiting step in estimation. In the simple polygenic model, a good tactic is to first calculate the spectral decomposition ODOt of Φ, where D is a diagonal matrix. One can then exploit the formulas det W=det(σa2D+σe2I) and W1=O(σa2D+σe2I)1Ot. The indicated determinant and inverse of the diagonal matrix are trivial to compute [36, 50, 71].

Our program VarianceComponentModels.jl incorporates this spectral decomposition tactic. It also treats more realistic models with multiple variance components and multivariate traits. For estimation we have compared Fisher scoring and the EM algorithms long familiar to computational statisticians. We have also explored a new MM algorithm that alternates updates of β and σ2 [90]. The normal equation update of β in our algorithm is

βn+1=(XtWn1X)1XtWn1y,

where Wn is the value of j=1kσj2Vj at the current estimate of σ2. The variance component updates are

σn+1,i2=σni2(yXβn+1)tWn1ViWn1(yXβn+1)tr(Wn1Vi). (4)

The MM algorithm converges faster than the standard EM algorithm. Fisher scoring requires fewer iterations to converge but substantially more effort per iteration, particularly in high dimensions. Both the EM and MM algorithms can be accelerated by quasi-Newton extrapolation [87].

The VarianceComponentModels.jl module serves as a convenient vehicle for other genetic applications. One example is Mendelian randomization (MR). Observational studies often find an association between a biomarker or expression (or methylation) level at a particular locus and a quantitative trait. The goal of MR is to assess the statistical support for this “exposure” as a cause of the trait, as opposed to reverse causality or confounding [13]. Our Mendelian randomization tutorial for continuous traits demonstrates the value of modularized genetic software such as VarianceComponentModels.jl.

When there are many loci to test, we [20,89] and others [3,17,36,50] have employed score tests or their equivalents in variance component models. Score tests are much faster than likelihood ratio tests (LRTs) because score tests require the likelihood to be maximized only under the null hypothesis. In contrast, LRTs require the likelihood be maximized both under the null and alternative hypotheses. When the null hypothesis is the same for all loci tested, this can amount to substantial savings. These score tests are easily extended to include maternal genetic effects and maternal-o spring genetic interaction as fixed effects [20]. Although most software programs implementing the score test adopt the simple covariance model 2σa2Φ+σe2I, in principle other variance components such as household effects can be included.

Our recent analysis of the GWAS data from the COPDGene study (http://www.copdgene.org) exemplifies the vast performance gain and yet ease of use of a typical OpenMendel workflow in genetic heritability analysis of a realistically large data set [93]. The data are available from NIH dbGap under phs000179.v5.p2. The steps are: (1) load the binary genotypes of 6,670 individuals at 630,860 SNPs, (2) compute summary statistics on the SNPs, (3) impute missing genotypes, (4) calculate the empirical kinship matrix, (5) load 13 phenotypes, (6) estimate the heritability of each phenotype, (7) estimate the coheritability of each pair of phenotypes, and (8) fit a joint model to all 13 phenotypes. All these steps are performed in a single interactive Julia environment on a common laptop computer. Typically such an analysis pipeline would require running at least five separate programs on a Linux machine.

In our experience, a pure Julia computation is often faster than the corresponding computation in a low-level language such as C, and much faster than any other high-level language such as R or Python. Figure 1 compares the speed of fitting large-scale variance component models in our Variance-ComponentModels.jl module to the two cutting edge programs GCTA [84] and GEMMA [94], both implemented in C++. In this example, there are two variance components, one for additive genetic effects and one for environmental effects. To make a fair comparison, the genetic relationship matrix S was pre-computed using the GCTA software. There are 13 continuous phenotypes. For both univariate (top panel) and bi-variate (bottom panel) models, we observe between 5 and 100 fold speedup over GEMMA and even more over GCTA. In all cases, the final log-likelihoods by Julia match those by GCTA and GEMMA to the third digit.

Fig. 1.

Fig. 1

Comparison of the OpenMendel VarianceComponentModels.jl implementation with GCTA (C++) and GEMMA (C++) for fitting a univariate variance component model Y~N(0,σa2S+σe2I) (top panel) and a bivariate variance component model Y~N(0,ΣaS+ΣeI) (bottom panel). GEMMA and OpenMendel runtimes exclude the eigen-decomposition of S, which is pre-computed.

The current versions of GCTA and GEMMA are only available for the x86 64-bit Linux operating system, while Julia, and thus OpenMendel, are available on all common systems. It is remarkable that a cross-platform, interactive, high-level language such as Julia can achieve such excellent computational efficiency.

9. SNP-Set Analysis

SNP-set analysis, or pathway-based analysis [79], is a powerful, widely-used strategy in sequencing studies. SNPs are grouped into sets to be examined for association with a certain phenotype. This analysis has been shown to have increased power over individual SNP analysis, especially for identifying rare variant associations [47].

Two types of SNP-set analyses are under active development within OpenMendel. The VarianceComponentTests.jl module implements different approaches for testing a set of markers as random effects. Notably the sequence kernel association test (SKAT) [80] is the first method to incorporate the generalized linear mixed model in testing the effect of a set of variants on a quantitative or dichotomous trait. Our recent work on exact tests [92] boosts the power of SKAT on small samples.

In contrast to marginal SNP-set analysis, an alternative approach is subset selection in a joint model y~N(Xβ,j=1mσj2Vj+σ02I), where Vj is the kernel matrix for the jth SNP-set and the σj2 for j ≥ 1 are the variance components subject to selection. Variance component selection is achieved by minimizing the penalized log-likelihood

L(β,σ2)+j=1mPλ(σj),

where L(β, σ2) is the log-likelihood function and Pλ(σj) is a penalty function. Several penalties, including the ridge, the lasso, the smoothly clipped absolute deviation (SCAD), and the minimax concave penalty (MCP), are implemented. The MM update (4) generalizes to penalized estimation because the variance components σj2 are nicely separated in the surrogate function [90].

10. Simulation Utilities

Simulation is vital in demonstrating the accuracy and power of new statistical methods. It is also important in designing genetic studies, where overly simplistic assumptions can lead to low power. Although there are a number of simulators already available [51, 65, 85], there is plenty of room for improvement. The unified nature of the OpenMendel environment makes it easy to craft code for simulating traits conditional on genotypes under any generalized linear model (GLM) or generalized linear mixed models (GLMM).

At the time that this article was written, the MendelTraitSimulate.jl option was under development. In its current version, we accommodate study designs involving both unrelateds and multigenerational families. We allow the user to specify both fixed and random effects for simulated univariate or multivariate traits. The simulated traits can be based on arbitrary functions of the provided covariates. By default, the program will use the PLINK format and make appropriate calls to SnpArrays.jl and VCFTools.jl.

11. Tutorials

Accompanying this article we have prepared a collection of tutorials via Jupyter Notebooks to demonstrate interactive genetic analysis using OpenMendel packages (https://github.com/OpenMendel/Tutorials). These include (1) PLINK binary data input, summary statistics, filtering, and visualization, (2) kinship calculation and comparison, (3) population GWAS, (4) iterative hard thresholding for GWAS, (5) heritability estimation, (6) Mendelian randomization, (7) GWAS based on linear mixed models, (8) SNP-set analysis, and soon to come (9) trait simulation. These tutorials will adapt to and grow with the expanding OpenMendel ecosystem.

12. Discussion

Readers may be familiar with our existing statistical package Mendel [45]. Although Mendel possesses many advantages, our goal going forward is not to modernize it, but to create an entirely new open source platform. Although Mendel is free, it is not open source. The Fortran language underlying it is also antiquated. Fortran lacks the supporting libraries of R and Matlab, its graphics functionality is nil, it neglects crucial statistical and linear algebra tools, and its code is needlessly verbose.

For the sake of brevity, we have not discussed many OpenMendel modules. Omitted modules include: (1) discovery of ancestry informative markers, (2) estimation of allele frequencies from pedigree data, (3) testing for transmission disequilibrium by the gamete competition model, (4) random genotype generation by gene dropping, (5) genetic counseling, (6) two-point linkage analysis, (7) location scores for linkage analysis, and (8) function optimization by recursive quadratic programming. Table 2 lists the currently available OpenMendel analysis options and utility packages as well as those soon to be released as part of the OpenMendel project.

Table 2.

Julia packages in the OpenMendel project.

OpenMendel Option Description
MendelAimSelection.jl Selects the most informative SNPs for predicting ancestry
MendelEstimateFrequencies.jl Estimates allele frequencies from pedigree data
MendelGameteCompetition.jl Tests for association under the gamete competition model
MendelGeneticCounseling.jl Computes risks in genetic counseling problems
MendelGWAS.jl Tests for association in genome-wide data
MendelIHT.jl GWAS using Iterative Hard Thresholding (forthcoming)
MendelImpute.jl Genotype imputation (forthcoming)
MendelKinship.jl Computes kinship and other identity coefficients
MendelLocationScores.jl Maps a trait via the method of location scores
MendelOrdinalGWAS.jl Implements GWAS for ordinal categorial phenotypes
MendelTwoPointLinkage.jl Implements two-point linkage analysis
MendelBase.jl Base functions for OpenMendel
MendelGeneDropping.jl Simulates genotypes based on pedigrees
MendelSearch.jl Optimization routines
MendelTraitSimulate.jl Trait simulation using GLM and GLMM (forthcoming)
SnpArrays.jl Utilities for handling compressed storage of biallelic SNP data
VCFTools.jl Utilities for handling compressed storage of sequence data
VarianceComponentModels.jl Utilities for fitting and testing variance components models

OpenMendel is inspired by a vision of genomic analysis that extracts the maximum benefit from the world-wide increase in genetic data and exploits the promise of collaborative, parallel, and distributed computing. We are not alone in this vision. As examples, notable strides have been made by HAIL (https://hail.is/) and TOPMed (https://www.nhlbiwgs.org/awards) in enabling large scale sequence analysis and the Ark data management system for health and biomedical research [7,63]. In our opinion, however, the barriers need to be lowered further to encourage more statistical geneticists and genetic epidemiologists to take part. The Julia language provides the ideal vehicles for this purpose. It is our hope that the OpenMendel project will spark a global effort to build a computing platform equal to the challenges of 21st-century genetic research.

Contributor Information

Hua Zhou, Department of Biostatistics, UCLA Fielding School of Public Health Tel.: +1-310-794-7835, huazhou@ucla.edu.

Janet S Sinsheimer, Department of Human Genetics, David Geffen School of Medicine at UCLA Tel.: +1-310-825-8002, jsinshei@g.ucla.edu.

Douglas M Bates, Department of Statistics, University of Wisconsin, Madison.

Benjamin B Chu, Department of Biomathematics, David Geffen School of Medicine at UCLA.

Christopher A German, Department of Biostatistics, UCLA Fielding School of Public Health.

Sarah S Ji, Department of Biostatistics, UCLA Fielding School of Public Health.

Kevin L Keys, Department of Medicine, University of California, San Francisco.

Juhyun Kim, Department of Biostatistics, UCLA Fielding School of Public Health.

Seyoon Ko, Department of Statistics, Seoul National University.

Gordon D Mosher, Departments of Statistics and Computer Science, University of California, Riverside.

Jeanette C Papp, Department of Human Genetics, David Geffen School of Medicine at UCLA.

Eric M Sobel, Department of Human Genetics, David Geffen School of Medicine at UCLA.

Jing Zhai, Department of Epidemiology and Biostatistics, Mel and Enid Zuckerman College of Public Health, University of Arizona.

Jin J Zhou, Department of Epidemiology and Biostatistics, Mel and Enid Zuckerman College of Public Health, University of Arizona.

Kenneth Lange, Department of Biomathematics, David Geffen School of Medicine at UCLA Tel.: +1-310-206-8076, klange@ucla.edu.

References

  • 1.Aird I, Bentall HH, Roberts JF: Relationship between cancer of stomach and the abo blood groups. British Medical Journal 1(4814), 799 (1953) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Alexander DH, Novembre J, Lange K: Fast model-based estimation of ancestry in unrelated individuals. Genome Research 19(9), 1655–1664 (2009). DOI 10.1101/gr.094052.109. URL 10.1101/gr.094052.109 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Amin N, Van Duijn CM, Aulchenko YS: A genomic background based method for association analysis in related individuals. PLoS ONE 2(12), e1274 (2007) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Astle W, Balding DJ, et al. : Population structure and cryptic relatedness in genetic association studies. Statistical Science 24(4), 451–471 (2009) [Google Scholar]
  • 5.Bahmani S, Raj B, Boufounos PT: Greedy sparsity-constrained optimization. Journal of Machine Learning Research 14(March), 807–841 (2013) [Google Scholar]
  • 6.Bezanson J, Edelman A, Karpinski S, Shah VB: Julia: A fresh approach to numerical computing. SIAM Review 59(1), 65–98 (2017). DOI 10.1137/141000671. URL 10.1137/141000671 [DOI] [Google Scholar]
  • 7.Bickerstaffe A, Ranaweera T, Endersby T, Ellis C, Maddumarachchi S, Gooden GE, White P, Moses EK, Hewitt AW, Hopper JL: The Ark: a customizable web-based data management tool for health and medical research. Bioinformatics 33(4), 624–626 (2017). DOI 10.1093/bioinformatics/btw675. URL 10.1093/bioinformatics/btw675 [DOI] [PubMed] [Google Scholar]
  • 8.Blumensath T, Davies ME: Iterative thresholding for sparse approximations. Journal of Fourier analysis and Applications 14(5–6), 629–654 (2008) [Google Scholar]
  • 9.Blumensath T, Davies ME: Iterative hard thresholding for compressed sensing. Applied and computational harmonic analysis 27(3), 265–274 (2009) [Google Scholar]
  • 10.Boerwinkle E, Sing C: The use of measured genotype information in the analysis of quantitative phenotypes in man. Annals of Human Genetics 51(3), 211–226 (1987) [DOI] [PubMed] [Google Scholar]
  • 11.Brody JA, Morrison AC, Bis JC, O’Connell JR, Brown MR, Huffman JE, Ames DC, Carroll A, Conomos MP, Gabriel S, et al. : Analysis commons, a team approach to discovery in a big-data environment for genetic epidemiology. Nature Genetics 49(11), 1560 (2017) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Bühlmann P, Van De Geer S: Statistics for high-dimensional data: methods, theory and applications. Springer Science & Business Media; (2011) [Google Scholar]
  • 13.Burgess S, Thompson SG: Mendelian randomization: methods for using genetic variants in causal estimation. Chapman and Hall/CRC; (2015) [Google Scholar]
  • 14.Candès EJ, Recht B: Exact matrix completion via convex optimization. Foundations of Computational Mathematics 9(6), 717–772 (2009). DOI 10.1007/s10208-009-9045-5. URL 10.1007/s10208-009-9045-5 [DOI] [Google Scholar]
  • 15.Cantor RM, Lange K, Sinsheimer JS: Prioritizing GWAS results: a review of statistical methods and recommendations for their application. The American Journal of Human Genetics 86(1), 6–22 (2010) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Chen R, Mias GI, Li-Pook-Than J, Jiang L, Lam HYK, Chen R, Miriami E, Karczewski KJ, Hariharan M, Dewey FE, Cheng Y, Clark MJ, Im H, Habegger L, Balasubramanian S, O’Huallachain M, Dudley JT, Hillenmeyer S, Haraksingh R, Sharon D, Euskirchen G, Lacroute P, Bettinger K, Boyle AP, Kasowski M, Grubert F, Seki S, Garcia M, Whirl-Carrillo M, Gallardo M, Blasco MA, Greenberg PL, Snyder P, Klein TE, Altman RB, Butte AJ, Ashley EA, Gerstein M, Nadeau KC, Tang H, Snyder M: Personal omics profiling reveals dynamic molecular and medical phenotypes. Cell 148(6), 1293–1307 (2012). DOI 10.1016/j.cell.2012.02.009. URL 10.1016/j.cell.2012.02.009 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Chen WM, Abecasis GR: Family-based association tests for genomewide association scans. The American Journal of Human Genetics 81(5), 913–926 (2007) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Chi EC, Zhou H, Chen GK, Del Vecchyo DO, Lange K: Genotype imputation via matrix completion. Genome Research 23(3), 509–518 (2013). DOI 10.1101/gr.145821.112. URL 10.1101/gr.145821.112 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Chiu C.y., Jung J, Chen W, Weeks DE, Ren H, Boehnke M, Amos CI, Liu A, Mills JL, Ting Lee M.l., Xiong M, Fan R: Meta-analysis of quantitative pleiotropic traits for next-generation sequencing with multivariate functional linear models. European Journal Of Human Genetics 25, 350 EP – (2016). URL 10.1038/ejhg.2016.170 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Clark MM, Blangero J, Dyer TD, Sobel EM, Sinsheimer JS: The quantitative-MFG test: A linear mixed effect model to detect maternal-o spring gene interactions. Annals of Human Genetics 80(1), 63–80 (2016). DOI 10.1111/ahg.12137. URL 10.1111/ahg.12137 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Claster A: Julia joins petaflop club (2017). URL https://juliacomputing.com/press/2017/09/12/julia-joins-petaflop-club.html
  • 22.Conomos MP, Reiner AP, Weir BS, Thornton TA: Model-free estimation of recent genetic relatedness. The American Journal of Human Genetics 98(1), 127–148 (2016) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Cookson W, Liang L, Abecasis G, Moffatt M, Lathrop M: Mapping complex disease traits with global gene expression. Nature Reviews Genetics 10(3), 184 (2009) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Day-Williams AG, Blangero J, Dyer TD, Lange K, Sobel EM: Linkage analysis without defined pedigrees. Genetic Epidemiology 35(5), 360–370 (2011). DOI 10.1002/gepi.20584 URL 10.1002/gepi.20584 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Falconer D, Mackay T: C. 1996. Introduction to Quantitative Genetics pp. 82–86 (1996) [Google Scholar]
  • 26.Fan R, Wang Y, Chiu C.y., Chen W, Ren H, Li Y, Boehnke M, Amos CI, Moore JH, Xiong M: Meta-analysis of complex diseases at gene level with generalized functional linear models. Genetics 202(2), 457–470 (2016). DOI 10.1534/genetics.115.180869. URL http://www.genetics.org/content/202/2/457 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Fisher RA: Frequency distribution of the values of the correlation coefficient in samples from an indefinitely large population. Biometrika 10(4), 507–521 (1915) [Google Scholar]
  • 28.Fisher RA: On the probable error of a coefficient of correlation deduced from a small sample. Metron 1, 3–32 (1921) [Google Scholar]
  • 29.Gaziano JM, Concato J, Brophy M, Fiore L, Pyarajan S, Breeling J, Whit-bourne S, Deen J, Shannon C, Humphries D, Guarino P, Aslan M, Anderson D, LaFleur R, Hammond T, Schaa K, Moser J, Huang G, Muralidhar S, Przygodzki R, O’Leary TJ: Million Veteran Program: A mega-biobank to study genetic influences on health and disease. Journal of Clinical Epidemiology 70, 214–223 (2016). DOI 10.1016/j.jclinepi.2015.09.016. URL 10.1016/j.jclinepi.2015.09.016 [DOI] [PubMed] [Google Scholar]
  • 30.Hall MA, Wallace J, Lucas A, Kim D, Basile AO, Verma SS, McCarty CA, Brilliant MH, Peissig PL, Kitchner TE, et al. : Plato software provides analytic framework for investigating complexity beyond genome-wide association studies. Nature Communications 8(1), 1167 (2017) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Hastie T, Mazumder R, Lee JD, Zadeh R: Matrix completion and low-rank SVD via fast alternating least squares. J. Mach. Learn. Res 16(1), 3367–3402 (2015). URL http://dl.acm.org/citation.cfm?id=2789272.2912106 [PMC free article] [PubMed] [Google Scholar]
  • 32.Helgason A, Yngvadóttir B, Hrafnkelsson B, Gulcher J, Stefánsson K: An Icelandic example of the impact of population structure on association studies. Nature Genetics 37(1), 90 (2005) [DOI] [PubMed] [Google Scholar]
  • 33.Howie B, Fuchsberger C, Stephens M, Marchini J, Abecasis GR: Fast and accurate genotype imputation in genome-wide association studies through pre-phasing. Nature Genetics 44(8), 955 (2012) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Howie BN, Donnelly P, Marchini J: A flexible and accurate genotype imputation method for the next generation of genome-wide association studies. PLoS Genetics 5(6), e1000,529 (2009). DOI 10.1371/journal.pgen.1000529. URL 10.1371/journal.pgen.1000529 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Jacquard A: The Genetic Structure of Populations, vol. 5 Springer Science & Business Media; (1974) [Google Scholar]
  • 36.Kang HM, Sul JH, Service SK, Zaitlen NA, Kong SY, Freimer NB, Sabatti C, Eskin E: Variance component model to account for sample structure in genome-wide association studies. Nature Genetics 42(4), 348–354 (2010) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Kawaguchi ES, Suchard MA, Liu Z, Li G: Scalable sparse Cox regression for large-scale survival data via broken adaptive ridge. Journal of the Royal Statistical Society: Series B (Statistical Methodology) (in press) (2018). URL https://arxiv.org/abs/1712.00561 [Google Scholar]
  • 38.Keys KL, Chen GK, Lange K: Iterative hard thresholding for model selection in genome-wide association studies. Genetic Epidemiology 41(8), 756–768 (2017) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Khanna R, Kyrillidis A: Iht dies hard: Provable accelerated iterative hard thresholding. In: International Conference on Artificial Intelligence and Statistics, pp. 188–198 (2018) [Google Scholar]
  • 40.Kilpinen H, Barrett JC: How next-generation sequencing is transforming complex disease genetics. Trends in Genetics 29(1), 23–30 (2013) [DOI] [PubMed] [Google Scholar]
  • 41.Kim J, Bai Y, Pan W: An adaptive association test for multiple phenotypes with GWAS summary statistics. Genetic Epidemiology 39(8), 651–663 (2015) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Knowler WC, Williams R, Pettitt D, Steinberg AG: Gm3; 5, 13, 14 and type 2 diabetes mellitus: an association in american indians with genetic admixture. American Journal of Human Genetics 43(4), 520 (1988) [PMC free article] [PubMed] [Google Scholar]
  • 43.Lange K: Mathematical and Statistical Methods for Genetic Analysis. Springer Science & Business Media; (2003) [Google Scholar]
  • 44.Lange K: MM Optimization Algorithms Society for Industrial and Applied Mathematics, Philadelphia, PA: (2016). URL 10.1137/1.9781611974409.ch1 [DOI] [Google Scholar]
  • 45.Lange K, Papp JC, Sinsheimer JS, Sripracha R, Zhou H, Sobel EM: Mendel: the Swiss army knife of genetic analysis programs. Bioinformatics 29(12), 1568–1570 (2013) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Lange K, Sinsheimer J: Calculation of genetic identity coefficients. Annals of Human Genetics 56(4), 339–346 (1992) [DOI] [PubMed] [Google Scholar]
  • 47.Lee S, Abecasis GR, Boehnke M, Lin X: Rare-variant association analysis: Study designs and statistical tests. American Journal of Human Genetics 95(1), 5–23 (2014). DOI 10.1016/j.ajhg.2014.06.009. URL 10.1016/j.ajhg.2014.06.009 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Li Y, Willer CJ, Ding J, Scheet P, Abecasis GR: MaCH: Using sequence and genotype data to estimate haplotypes and unobserved genotypes. Genetic Epidemiology 34(8), 816–834 (2010). DOI 10.1002/gepi.20533. URL 10.1002/gepi.20533 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Liberty E, Woolfe F, Martinsson PG, Rokhlin V, Tygert M: Randomized algorithms for the low-rank approximation of matrices. Proc. Natl. Acad. Sci. USA 104(51), 20,167–20,172 (2007). URL 10.1073/pnas.0709640104 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Lippert C, Listgarten J, Liu Y, Kadie CM, Davidson RI, Heckerman D: FaST Linear Mixed Models for genome-wide association studies. Nature Methods 8(10), 833–835 (2011) [DOI] [PubMed] [Google Scholar]
  • 51.Liu Y, Athanasiadis G, Weale ME: A survey of genetic simulation software for population and epidemiological studies. Human Genomics 3(1), 79 (2008) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Mancuso N, Shi H, Goddard P, Kichaev G, Gusev A, Pasaniuc B: Integrating gene expression with summary association statistics to identify genes associated with 30 complex traits. The American Journal of Human Genetics 100(3), 473–487 (2017) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Manichaikul A, Mychaleckyj JC, Rich SS, Daly K, Sale M, Chen WM: Robust relationship inference in genome-wide association studies. Bioinformatics 26(22), 2867–2873 (2010) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54.Marchini J, Howie B, Myers S, McVean G, Donnelly P: A new multipoint method for genome-wide association studies by imputation of genotypes. Nature Genetics 39(7), 906–913 (2007). DOI 10.1038/ng2088. URL 10.1038/ng2088 [DOI] [PubMed] [Google Scholar]
  • 55.Metzker ML: Sequencing technologiesthe next generation. Nature Reviews Genetics 11(1), 31 (2010) [DOI] [PubMed] [Google Scholar]
  • 56.Mittal S, Madigan D, Burd RS, Suchard MA: High-dimensional, massive samplesize Cox proportional hazards regression for survival analysis. Biostatistics 15(2), 207–221 (2014). DOI 10.1093/biostatistics/kxt043. URL 10.1093/biostatistics/kxt043 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57.Morris AP, Lindgren CM, Zeggini E, Timpson NJ, Frayling TM, Hattersley AT, McCarthy MI: A powerful approach to sub-phenotype analysis in population-based genetic association studies. Genetic Epidemiology 34(4), 335–343 (2010) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58.Novembre J, Peter BM: Recent advances in the study of fine-scale population structure in humans. Current Opinion in Genetics & Development 41, 98–105 (2016) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59.Patterson N, Price AL, Reich D: Population structure and eigenanalysis. PLoS Genetics 2(12), e190 (2006) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60.Pickrell WO, Rees MI, Chung SK: Next generation sequencing methodologies-an overview In: Advances in protein chemistry and structural biology, vol. 89, pp. 1–26. Elsevier; (2012) [DOI] [PubMed] [Google Scholar]
  • 61.Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D: Principal components analysis corrects for stratification in genome-wide association studies. Nature Genetics 38(8), 904–909 (2006). DOI 10.1038/ng1847. URL 10.1038/ng1847 [DOI] [PubMed] [Google Scholar]
  • 62.Pritchard JK, Stephens M, Donnelly P: Inference of population structure using multilocus genotype data. Genetics 155(2), 945–959 (2000) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 63.Ranaweera T, Makalic E, Hopper JL, Bickerstaffe A: An open-source, integrated pedigree data management and visualization tool for genetic epidemiology. International Journal of Epidemiology 47(4), 1034–1039 (2018). DOI 10.1093/ije/dyy049. URL 10.1093/ije/dyy049 [DOI] [PubMed] [Google Scholar]
  • 64.Rosenberg NA, Li LM, Ward R, Pritchard JK: Informativeness of genetic markers for inference of ancestry. The American Journal of Human Genetics 73(6), 1402–1422 (2003) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 65.Schäffer AA, Lemire M, Ott J, Lathrop GM, Weeks DE: Coordinated conditional simulation with slink and sup of many markers linked or associated to a trait in large pedigrees. Human heredity 71(2), 126–134 (2011) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 66.Schaid DJ, Rowland CM, Tines DE, Jacobson RM, Poland GA: Score tests for association between traits and haplotypes when linkage phase is ambiguous. The American Journal of Human Genetics 70(2), 425–434 (2002) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 67.Shen J, Li P: A tight bound of hard thresholding. The Journal of Machine Learning Research 18(1), 7650–7691 (2017) [Google Scholar]
  • 68.Sobel E, Lange K, OConnell JR, Weeks DE: Haplotyping algorithms. In: Genetic mapping and DNA sequencing, pp. 89–110. Springer; (1996) [Google Scholar]
  • 69.Suchard MA, Simpson SE, Zorych I, Ryan P, Madigan D: Massive parallelization of serial inference algorithms for a complex generalized linear model. ACM Transactions on Modeling and Computer Simulation (TOMACS) 23(1), article10:1–17 (2013). DOI 10.1145/2414416.2414791. URL 10.1145/2414416.2414791 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 70.Sudlow C, Gallacher J, Allen N, Beral V, Burton P, Danesh J, Downey P, Elliott P, Green J, Landray M, et al. : UK biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Medicine 12(3), e1001,779 (2015) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 71.Svishcheva GR, Axenovich TI, Belonogova NM, van Duijn CM, Aulchenko YS: Rapid variance components–based method for whole-genome association analysis. Nature Genetics 44(10), 1166 (2012) [DOI] [PubMed] [Google Scholar]
  • 72.Telenti A, Pierce LCT, Biggs WH, di Iulio J, Wong EHM, Fabani MM, Kirkness EF, Moustafa A, Shah N, Xie C, Brewerton SC, Bulsara N, Garner C, Metzker G, Sandoval E, Perkins BA, Och FJ, Turpaz Y, Venter JC: Deep sequencing of 10,000 human genomes. Proceedings of the National Academy of Sciences 113(42), 11,901–11,906 (2016). DOI 10.1073/pnas.1613365113. URL 10.1073/pnas.1613365113 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 73.Van Dijk EL, Auger H, Jaszczyszyn Y, Thermes C: Ten years of next-generation sequencing technology. Trends in Genetics 30(9), 418–426 (2014) [DOI] [PubMed] [Google Scholar]
  • 74.Van Leeuwen EM, Kanterakis A, Deelen P, Kattenberg MV, Abdellaoui A, Hofman A, Schönhuth A, Menelaou A, de Craen AJ, van Schaik BD, et al. : Population-specific genotype imputations using minimac or impute2. Nature Protocols 10(9), 1285 (2015) [DOI] [PubMed] [Google Scholar]
  • 75.VanRaden PM: Efficient methods to compute genomic predictions. Journal of Dairy Science 91(11), 4414–4423 (2008) [DOI] [PubMed] [Google Scholar]
  • 76.Visscher PM, Brown MA, McCarthy MI, Yang J: Five years of GWAS discovery. The American Journal of Human Genetics 90(1), 7–24 (2012) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 77.Visscher PM, Wray NR, Zhang Q, Sklar P, McCarthy MI, Brown MA, Yang J: 10 years of GWAS discovery: biology, function, and translation. The American Journal of Human Genetics 101(1), 5–22 (2017) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 78.Wang B, Sverdlov S, Thompson E: Efficient estimation of realized kinship from SNP genotypes. Genetics 210(2) (2017) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 79.Wang K, Li M, Bucan M: Pathway-based approaches for analysis of genomewide association studies. The American Journal of Human Genetics 81(6), 1278–1283 (2007) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 80.Wu MC, Lee S, Cai T, Li Y, Boehnke M, Lin X: Rare-variant association testing for sequencing data with the sequence kernel association test. The American Journal of Human Genetics 1(89), 82–93 (2011) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 81.Wu TT, Chen YF, Hastie T, Sobel E, Lange K: Genome-wide association analysis by lasso penalized logistic regression. Bioinformatics 25(6), 714–721 (2009) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 82.Yang F, Barber RF, Jain P, Lafferty J: Selective inference for group-sparse linear models. In: Advances in Neural Information Processing Systems, pp. 2469–2477 (2016) [Google Scholar]
  • 83.Yang J, Ferreira T, Morris AP, Medland SE, Madden PA, Heath AC, Martin NG, Montgomery GW, Weedon MN, Loos RJ, et al. : Conditional and joint multiple-SNP analysis of GWAS summary statistics identifies additional variants influencing complex traits. Nature Genetics 44(4), 369 (2012) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 84.Yang J, Lee SH, Goddard ME, Visscher PM: GCTA: A tool for genome-wide complex trait analysis. American Journal of Human Genetics 88(1), 76–82 (2011). DOI 10.1016/j.ajhg.2010.11.011. URL 10.1016/j.ajhg.2010.11.011 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 85.Yuan X, Miller DJ, Zhang J, Herrington D, Wang Y: An overview of population genetic data simulation. Journal of Computational Biology 19(1), 42–54 (2012) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 86.Yuan XT, Li P, Zhang T: Gradient hard thresholding pursuit. Journal of Machine Learning Research 18, 166–1 (2017) [Google Scholar]
  • 87.Zhou H, Alexander D, Lange K: A quasi-newton acceleration for high-dimensional optimization algorithms. Statistics and Computing 21(2), 261–273 (2011) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 88.Zhou H, Alexander DH, Sehl ME, Sinsheimer JS, Sobel E, Lange K: Penalized regression for genome-wide association screening of sequence data. In: Biocomputing 2011, pp. 106–117. World Scientific (2011) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 89.Zhou H, Blangero J, Dyer TD, Chan K.h.K., Lange K, Sobel EM: Fast genome-wide QTL association mapping on pedigree and population data. Genetic Epidemiology 41(3), 174–186 (2017). DOI 10.1002/gepi.21988. URL 10.1002/gepi.21988 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 90.Zhou H, Hu L, Zhou J, Lange K: MM algorithms for variance components models. Journal of Computational and Graphical Statistics accepted (2018). URL 10.1080/10618600.2018.1529601 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 91.Zhou H, Sehl ME, Sinsheimer JS, Lange K: Association screening of common and rare genetic variants by penalized regression. Bioinformatics 26(19), 2375–2382 (2010) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 92.Zhou JJ, Hu T, Qiao D, Cho MH, Zhou H: Boosting gene mapping power and efficiency with efficient exact variance component tests of SNP sets. Genetics 204(3), 921–931 (2016) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 93.Zhou JJ, Sinsheimer JS, Cho MH, Castaldi P, Zhou H: MMVC: An efficient mm algorithm to quantify genetic correlations across large number of phenotypes in giant datasets. manuscript in preparation (2018) [Google Scholar]
  • 94.Zhou X, Stephens M: Efficient multivariate linear mixed model algorithms for genome-wide association studies. Nature Methods 11(4), 407–409 (2014). DOI 10.1038/nmeth.2848. URL 10.1038/nmeth.2848 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 95.Zhu X, Zhang S, Zhao H, Cooper RS: Association mapping, using a mixture model for complex traits. Genetic Epidemiology 23(2), 181–196 (2002) [DOI] [PubMed] [Google Scholar]

RESOURCES