\documentclass[10pt]{article} \usepackage[authoryear,square]{natbib} \usepackage{graphicx} \usepackage{amsmath} \usepackage{rotating} \usepackage{listings} \usepackage{setspace} \usepackage[bottom=1in,right=1in,top=1.5in,left=1in]{geometry} \usepackage{indentfirst} %\usepackage{algorithmic} \usepackage{url} %\bibliographystyle{unsrt} \title{Supplement} \doublespacing \renewcommand{\familydefault}{\sfdefault} \begin{document} \section{Methods} \subsection{Coordinate Descent} In case control studies of disease, affection status is related to a set of genetic and/or environmental covariates through a vector of regression coefficients, denoted as $\beta$. Suppose $y$ stores the vector of binary values for affection status and $x$ is an $n$ by $m$ matrix of covariates comprised of discrete genotypes (coded as the number of risk alleles) and/or environmental covariates (treated as continuous values), defined across $n$ subjects and $m$ SNPs. The probability of affection for subject $i$ is defined as \begin{eqnarray} p_i = \frac{1}{1+\textrm{exp}(-\beta^Tx_iy_i)}. %p(\beta|D) = \displaystyle\prod_{i=1}^{n} \frac{1}{1+\textrm{exp}(-\beta^Tx_iy_i)} p(\beta). \label{eqn_prob_affection} \end{eqnarray}Each element of $\beta$ is interpreted as the log odds ratio of disease risk for each unit increase of its respective covariate. The log-likelihood across all observations is then \begin{eqnarray} L(\beta) = \displaystyle\sum_{i=1}^{n} [y_i \textrm{log}(p_i) + (1-y_i)\textrm{log} (1-p_i)] %\textrm{log}(p(\beta|D)) = -(\displaystyle\sum_{i=1}^{n} \textrm{log}(1+\textrm{exp}(-\beta^Tx_iy_i))+\textrm{log}(p(\beta)). \label{eqn_logistic_loglike} \end{eqnarray} We adopt the penalized regression model described in \citep{zhou10} to incorporate both an L1 penalty and L2 penalized variable groups (e.g. genes, pathways). The L1 parameter (denoted $\lambda_L$) enforces sparsity, while the L2 parameter (denoted $\lambda_E$) encourages variables mildly correlated to a strong predictor within a group to enter the model. Our goal is to minimize the objective function \begin{eqnarray} f(\beta) = -L(\beta) + \lambda_L||\beta||_1 + \lambda_E \displaystyle \sum_G ||\beta_G||_2. \label{eqn_penalized_like} \end{eqnarray} This parameterization is a generalization of several well-known methods. For example, setting $\lambda_E=0$ reduces Eq \ref{eqn_penalized_like} to the original LASSO \citep{tibshirani96} whereas including all variables into the set $G$ reduces Eq \ref{eqn_penalized_like} to the elastic net \citep{zou2005}. %In L1 regularized regression, the L1 norm (i.e. the sum of the absolute values of elements in $\beta$) is scaled by a global penalty parameter $\lambda$, and included in the posterior density in order to shrink $\beta$ towards zero. Some variables are shrunk to exactly zero, achieving a sparse interpretable model. The prior density of $\beta$ in L1 regression follows a double exponential distribution, where its logarithm is defined as: %\begin{eqnarray} %\textrm{log} (p(\beta)) &=& \textrm{log} (\displaystyle\prod_{j=1}^{p}\frac{\lambda_j}{2}\textrm{exp}(-\lambda_j |\beta_j|) ) \nonumber \\ % &=& \displaystyle\sum_{j=1}^{p} (\textrm{log}(\lambda_j)-\textrm{log}2-\lambda_j|\beta_j|). %\label{eqn_beta_prior} %\end{eqnarray}Because the posterior distribution shown in Eq \ref{eqn_logistic_loglike} is discontinuous and hence non-differentiable at zero, special action must be taken during optimization to handle cases where updates to elements in $\beta$ cross zero. These actions are simple to implement using one-dimensional optimization such as coordinate descent. The goal behind coordinate descent is to solve for a root $z_j$, specific to covariate $j$, where the objective function %\begin{eqnarray} %f(z_j) = \displaystyle\sum_{i=1}^{n} \textrm{log} (1+\textrm{exp}(-\beta^Tx_iy_i - (z-\beta_j)x_{i}y_{i} ) %\label{eqn_ccd_objective} %\end{eqnarray} is minimized. One dimensional Newton-Raphson, which is suitable for quickly finding the root $z_j$, approximates this solution by using a Taylor series expansion around the function evaluated at $\beta_j$, shown as %\begin{eqnarray} %f(z_j) \approx f(\beta_j) + f'(\beta_j)(z_j-\beta_j) + \frac{1}{2}f''(\beta_j)(z_j-\beta_j)^2. %\label{eqn_taylor} Regularized regression is usually solved using cyclic coordinate descent (CCD) \citep{zhang01}. At each variable $j$, CCD updates $\beta_j$ as \begin{eqnarray} \beta_j^{new} = z_j = \beta_j-\frac{f'(\beta_j)}{f''(\beta_j)} \label{eqn_beta_new} \end{eqnarray}, where the first derivative is \begin{eqnarray} f'(\beta_j) &=& \left. \frac{df(z_j)}{dz} \right |_{z=\beta_j} \nonumber\\ &=& \displaystyle\sum_{i=1}^{n}x_{ij}y_i \frac{1}{1+\textrm{exp}(\beta^Tx_{ij}y_i)} - \lambda_L\textrm{sgn}(\beta_j) -\lambda_E \left\{ \begin{array}{l l} \frac{\beta_j}{||\beta_G||_2}, & ||\beta_G||_2 > 0\\ sgn(\beta_j) & ||\beta_G||_2 = 0\\ \end{array} \right. \label{eqn_gradient} \end{eqnarray} and the second derivative is \begin{eqnarray} f''(\beta_j) &=& \left. \frac{d^2f(z_j)}{d^2z} \right |_{z=\beta_j} \nonumber\\ &=& \displaystyle\sum_{i=1}^{n}x_{ij}^2 \frac{\textrm{exp}(\beta^Tx_iy_i)}{1+\textrm{exp}(\beta^Tx_{i}y_i)^2} +\lambda_E \left\{ \begin{array}{l l} \frac{1}{||\beta_G||_2}(1-\frac{\beta_j^2}{||\beta_G||^2_2}), & ||\beta_G||_2 > 0\\ 0 & ||\beta_G||_2 = 0\\ \end{array} \right. \label{eqn_hessian} \end{eqnarray} %Because the Taylor approximation in Eq \ref{eqn_taylor} is a quadratic function of $z$, we can set the derivative of the right hand side of Eq \ref{eqn_taylor} to zero %\begin{eqnarray} %\frac{d}{dz_j} f(\beta_j) + f'(\beta_j)(z_j-\beta_j) + \frac{1}{2}f''(\beta_j)(z_j-\beta_j)^2 &=& 0 \nonumber \\ %f'(\beta_j) + (z_j - \beta_j) f''(\beta_j) & = & 0 %\label{eqn_quadratic_derivative} %\end{eqnarray} % so that the update for $\beta^{new}$ is Each element update of the vector $\beta$ updates the length $n$ vector of fitted values $x\beta$. CCD traverses the set of variables over multiple cycles until a convergence criterion is reached (e.g. the likelihood improvement between cycles is sufficiently small). In a review article, Wu and Lange described a slight variation on CCD called greedy coordinate descent (GCD) \citep{wu08}. Rather than update the vector $\beta$, GCD keeps track of the updated likelihood at index $j$ after evaluating Eq \ref{eqn_beta_new}. After evaluating all $p$ variables, $\beta$ is updated at the index corresponding to the largest increase in the likelihood. In contrast to CCD, which exposes parallelism across subjects, GCD offers the opportunity to carry out calculations in parallel across both subjects and variables. \subsection{OpenCL kernels} One feature that distinguishes our software from the majority of available GPU software releases, which is implemented in nVidia's proprietary CUDA library \citep{cuda}, is our decision to develop parallel code using OpenCL \citep{khronos}. OpenCL's support across the key hardware manufacturers ensures that OpenCL programs will require minimal configuration changes when targeting a wide range of hardware, including GPU devices from ATI or nVidia, or massively parallel CPUs that are currently being developed by Intel and AMD. OpenCL programs follow a heterogeneous computing model. That is, components of an algorithm that can execute in parallel on an SIMD device such as a GPU are implemented in routines, called kernels, that can be invoked by serial host code. Kernels execute data-parallel instructions so that at each clock cycle, calculations are processed in a lock-step manner across distinct \textit{work-items}, an abstraction that maps computing cores to distinct memory addresses (e.g. registers, elements of an array). OpenCL defines an explicit memory model so data must explicitly be moved by programmers between layers of a memory hierarchy. On GPU devices, \textit{global memory} is scoped across all work-items, \textit{local memory} is scoped only within the \textit{work-group} that a work-item is assigned to, and each \textit{register} is scoped only to a specific work-item. Global memory is the most abundant memory space, while local memory and register space is far scarcer (64kB on high end devices). Global memory has up to two orders of magnitude greater latency than the latter two memory spaces, so good programming practice includes moving data into fast memory as early as possible during kernel routines. It is vital that large datasets be stored efficiently in global memory, and compacted if possible, since even the most advanced GPU devices at the time of writing contain less than 2.5 GB of memory. Fortunately, SNP genotype data can be easily compressed, given the 3 possible cases plus ``no-calls''. One strategy is to declare genotypes in memory as a large array of character primitives, where each byte stores four genotypes. However, this strategy utilizes only 25\% of the available bandwidth between global and local memory, since work-items fetch/store a minimum of 4 bytes when accessing global memory. Our approach is to declare an array of custom containers so that each container stores a cluster of four genotypes: \singlespacing \begin{lstlisting} typedef struct { char geno[4]; }__attribute__((aligned(4))) packedgeno_t; \end{lstlisting} \doublespacing \begin{figure}[!tpb]%figure1 \centerline{\includegraphics[width=8cm,height=5cm]{genotype_load.eps}} \caption{Decompressing genotypes from a single memory load transaction}\label{fig_genotype_load} \end{figure} After transferring elements from a global array to a local array, data can quickly be decompressed using up to three bit shift operations and a single mask operation. The decompression step executes in parallel, storing results in a second local array of floats as illustrated in Figure \ref{fig_genotype_load}. Hence, each set of 32 packed genotype elements, stored in global memory, decompress into 512 floating point genotypes in local memory. As alluded to above, work-items are organized into work-groups, arranged in a three dimensional lattice called a work-group. Work-groups themselves are also organized in a three dimensional lattice. The dimensions of the lattices at both levels are defined by the programmer in order to make the most effective use of available cores and memory resources. On GPUs, instructions in work-groups are executed asynchronously and in any order. This leads to huge gains in efficiency as the level of parallelization increases, since memory latency in some work-groups is masked by calculations being made in other work-groups. \begin{figure}[!tpb]%figure1 \centerline{\includegraphics[width=8cm,height=5cm]{workgroup_greedy.eps}} \caption{Work-group layout for GCD}\label{fig_workgroup_greedy} \end{figure} For parallel GCD, covariate values at a variable are divided into blocks of 512 elements, and stored in local memory so that each work-group can operate on a particular block of memory. Work groups compute contributions to statistics such as the likelihood (Eq \ref{eqn_logistic_loglike}), first (Eq \ref{eqn_gradient}), and second (Eq \ref{eqn_hessian}) derivative. Figure \ref{fig_workgroup_greedy} illustrates how we organize work-groups and work-items for GCD. Work-items have an x-dimension of 512, and y,z dimensions of 1. Work-groups have a x-dimension of 1024 (representing blocks of the $p$ variables), a y-dimension of $n$/512+1, and a z-dimension of $p$/1024. \begin{figure}[!tpb]%figure1 \centerline{\includegraphics[width=8cm,height=5cm]{reduction.eps}} \caption{log$_2$ reduction for summing array elements}\label{fig_reduction} \end{figure} Calculations that aggregate across elements of an array, such as summations or max-operators, can take advantage of moderate parallelization. A common practice in parallel programming is to perform a log$_2$ reduction. For any array of size that is a power of 2, elements in the second half of the array can be ``collapsed'' into the first half. The procedure is recursively repeated on the first half until the terminating condition at which the final value is stored in the first element. Figure \ref{fig_reduction} illustrates this concept. \subsection{Distributed MPI algorithm} For especially large problems such as impending GWAS-Seq data, it may be necessary to pool computing and memory resources from multiple GPU devices. OpenCL exposes an API that allows host code to access multiple devices on a host. Implementing MPI (message passing interface) routines achieves the same purpose but would seamlessly scale across multiple hosts. MPI is an open standard that enables multiple processors (locally or across hosts) to communicate amongst each other. Our MPI implementation is conceptually straightforward. The MPI master node is the entry point of the algorithm. After loading configuration and study data into memory, it partitions the design matrix of covariate values into submatrices, which are then transferred to all available slave nodes for initialization. Each slave node carries out parallel GCD on its own subproblem, followed by a max reduction step at the MPI master. The following pseudocode describes our algorithm more explicitly:\\ \singlespacing %\begin{algorithmic} %\caption{Fit distributed GCD} \noindent \textbf{while} not converged \textbf{do}\\ \indent Broadcast $j^*$: index of best variable from previous iteration\\ \indent \textbf{for} host = 1 $\to$ slavehosts, in parallel \textbf{do}\\ \indent \indent $ x\beta = x\beta + x_{-j^*} \Delta \beta _{j^*}$ \\ \indent \indent \textbf{for all} $j \in varblock[host]$, in parallel \textbf{do}\\ \indent \indent \indent \textbf{for all} $k \in subjectblocks$, in parallel \textbf{do}\\ \indent \indent \indent \indent \textbf{for all} $i \in subjectblocks[k]$, in parallel \textbf{do} \\ \indent \indent \indent \indent \indent compute subject $i$'s contribution to sub-gradient,sub-hessian at variable $j$ using Eq \ref{eqn_gradient} and Eq \ref{eqn_hessian} \\ \indent \indent \indent \indent \textbf{end for} \\ \indent \indent \indent \textbf{end for} \\ \indent \indent \indent Sum over subject specific gradient and hessian via log$_2$ reduction \\ \indent \indent \textbf{end for} \\ \indent \indent \textbf{for all} $j \in varblock[host]$, in parallel \textbf{do}\\ \indent \indent \indent \textbf{for all} $k \in subjectblocks$, in parallel \textbf{do}\\ \indent \indent \indent \indent \textbf{for all} $i \in subjectblocks[k]$, in parallel \textbf{do} \\ \indent \indent \indent \indent \indent compute subject $i$'s contribution to likelihood at variable $j$ using Eq \ref{eqn_logistic_loglike}\\ \indent \indent \indent \indent \textbf{end for} \\ \indent \indent \indent \textbf{end for} \\ \indent \indent \indent Sum over subject specific likelihood via log$_2$ reduction\\ \indent \indent \textbf{end for} \\ \indent \indent Apply a max operator for variable with highest likelihood increase, store as best variable $j^*$ \\ \indent \textbf{end for}\\ \indent Update $\beta_j^*$ on host\\ \textbf{end while}\\ \\ \doublespacing \begin{figure}[!tpb]%figure1 \centerline{\includegraphics[width=8cm,height=5cm]{mpi.eps}} \caption{Using MPI to coordinate the LASSO optimization}\label{fig_mpi} \end{figure} Our MPI framework enables data and computations to be load balanced across multiple cores on a single desktop or processors on a large cluster, such as the heterogeneous cluster depicted in Figure \ref{fig_mpi}. In this hypothetical example, the last 20 nodes, each assigned to fit 100,000 variables are OpenCL enabled, making use of massively parallel GPU resources; each of the first 1000 nodes fit far less variables, but could be moderately accelerated using parallel constructs optimized for multi-core CPUs such as OpenMP \citep{openmp}. \subsection{Rare variant simulation} \begin{figure}[!tpb]%figure1 \centerline{\includegraphics[width=7cm,height=10cm,angle=270]{maf.eps}} \caption{Distribution of MAFs across 100 causal SNPs}\label{fig_maf} \end{figure} The Genetic Analysis Workshop 17 \citep{ziegler11} provided participants with genotypes across 24,487 SNPs on 697 individuals, taken from Pilot 3 of the 1000 Genomes Project. The genotypes were called using the Unified Genotyper method of the Genome Analysis Toolkit (GATK) package \citep{depristo11}. For our simulations, we chose 10 genes at random, and among each of them 10 SNPs at random, for a total of 100 causal SNPs, each assigned to have a relative risk of 2.0. Figure \ref{fig_maf} shows the distribution among these causal SNPs. The small relative risk ensured that rare causal SNPs would be difficult to detect unless prior biological knowledge (i.e. gene annotations) could facilitate their inclusion into the model through their L2 norm. The receiver operating curve was plotted by estimating the true and false discovery rates for both a mixed penalty (i.e. $\lambda_L=\lambda_E$) and a pure L1 penalty, taken across 100 simulation replicates. We did not extensively explore a full range of L1 to L2 ratios in this simulation. Rather, we simply chose parameter values based on ROCs from simulations in \citep{zhou10}, suggesting that a $\lambda_L$ to full penalty (i.e. $\lambda_L+\lambda_E$) ratio of .5 yields superior power compared to pure L1 or L2 based penalties at low FDR levels. \subsection{Stability Selection for GWAS} Stability selection was developed by Meinhausen and B\"{u}hlmann to address the issue of model overfitting in the setting of L1 penalized regression \citep{meinshausen09}. A models is fit to each random subsample (generally half) of the entire dataset over a set $\Lambda$ of penalty values $\lambda_L$. Variables are declared ``stable'' if the proportion of subsamples containing the variable is at least as large as a threshold, or formally: \begin{eqnarray} \hat{S}^{stable} = \{k:\displaystyle\max_{\lambda_L \in \Lambda} \hat{\Pi}_k^{\lambda_L} \geq \pi_{thr} \} \label{eqn_stability} \end{eqnarray} One important feature of stability selection is the ability to provide error control. The expected number V of falsely selected variables is: \begin{eqnarray} E(V) \leq \frac{1}{2 \pi_{thr} - 1} \frac{q^2_\Lambda}{p} \end{eqnarray}where $q_\Lambda$ is the average number of selected variables taken across all replicates, over the set $\Lambda$. Determining the set of stable variables according to Eq \ref{eqn_stability} can pose a heavy computational burden since at each replicate, each model is fit across multiple values of $\lambda_L$. This leads to impractical run times, even with parallel implementations. The authors point out however, for methods such as the LASSO where the probability of inclusion is a monotonic function with respect to decreasing values of $\lambda_L$, it is sufficient to choose a small enough value of $\lambda_L'$ such that the true set of variables $\hat{S}^{\lambda_L}$ under the true value of $\lambda_L$ is contained within the larger set $\hat{S}^{\lambda_L'}$ with high probability. In other words, by allowing moderate overfitting, we can apply the method as defined in Eq \ref{eqn_stability} where the set $\Lambda$ contains only value $\lambda_L'$. \begin{figure}[!tpb]%figure1 \centerline{\includegraphics[width=8cm,height=5cm]{fdr.eps}} \caption{False discovery rate as a function of stability selection threshold}\label{fig_fdr} \end{figure} For the African American GWAS data, each replicate's value of $\lambda_L'$ was set at 100, which resulted in 273 variables being selected on average (i.e. $q_\Lambda$=273). Note that because we did not include L2 penalties, $\lambda_E$ is set to zero. Based on $\lambda_L'$=100, one can get a sense of error control as a function of the threshold $\pi_{thr}$, as shown in Figure \ref{fig_fdr}. \subsection{Framework Use} We now walk the reader through source code for a program that makes use of our C++ framework. The \textbf{Stability} class, which implements Stability Selection, is a sub-class of the \textbf{MpiLasso2} class, which itself handles details such as loading XML configuration files, parsing standardized format data files (e.g. PLINK files \citep{purcell07}), communicating amongst MPI nodes, and executing OpenCL kernels on GPUs. One can refer to the code for \textbf{Power.cpp}, which implements the rare variant analysis, at the Google Code site. The header file below defines a container called \textbf{stability\_settings\_t} that is an extension of the XML container defined in \textbf{MpiLasso2.hpp}. Configuration settings that are relevant to the analysis can be defined by the user here, mapping to corresponding XML tags in the configuration file. \subsubsection*{Stability.hpp} \singlespacing \begin{lstlisting} class stability_settings_t:public lasso2_settings_t{ public: stability_settings_t(const ptree & pt,int mpi_rank); int replicates; string mask_basepath; }; class Stability:public MpiLasso2{ public: Stability(); ~Stability(); void init(const ptree & pt); void run(); private: int * varcounts; int totaltasks; vector varnames; int * mask; }; \end{lstlisting} \doublespacing The C++ class listed below initializes analysis by loading configuration settings, PLINK formatted data files, and other ancillary files. The initialization routine also performs the important tasks of initializing all MPI data structures, GPU memory buffers, and transfer of partitioned datasets to the computing nodes. The actual execution of the Stability Selection algorithm carries out a loop which reads a mask file at each iteration that informs the optimizer of which subjects to include for that iteration. The routine concludes by writing to the output file selection probabilities for a given value of $\lambda_L$. \singlespacing \subsubsection*{Stability.cpp} \begin{lstlisting} #include #include #include #include #include #include #ifdef USE_GPU #include #include"clsafe.h" #endif #include"main.hpp" #include"analyzer.hpp" #include"io.hpp" #include"dimension2.h" #include"utility.hpp" #include"lasso_mpi2.hpp" #include"stability.hpp" using namespace std; typedef unsigned int uint; // extends the XML configuration file container for LASSO settings // specific to Stability Selection stability_settings_t::stability_settings_t(const ptree & pt,int mpi_rank): lasso2_settings_t(pt,mpi_rank){ replicates = pt.get("subsamples"); mask_basepath = pt.get("inputdata.mask_basepath"); } Stability::Stability():MpiLasso2(){} Stability::~Stability(){ delete settings; } void Stability::init(const ptree & pt){ // For loading configuration settings settings = new stability_settings_t(pt,this->get_rank()); ofs<<"Platform ID: "<platform_id<device_id<kernel_path<snpfile.data(),settings->pedfile.data(), settings->genofile.data(),settings->covariatedatafile.data(), settings->covariateselectionfile.data(),NULL); read_tasks(settings->tasklist.data(),settings->annotationfile.data()); varnames = get_tasknames(); totaltasks = varnames.size(); varcounts = new int[totaltasks]; ofs<<"There are "<(this->settings); for(int i=0;ireplicates;++replicate){ ostringstream oss; oss<mask_basepath<<"."< modelvariables; fitLassoGreedy(replicate, logL,modelvariables); if (is_master){ ofs<<"REPLICATE:\t"<lambda<0){ ofs<replicates<