\documentclass{article} \usepackage{fullpage, natbib, amsmath, amssymb,graphicx} \usepackage[FIGTOPCAP]{subfigure} \usepackage[american]{babel} \usepackage{amsmath} \begin{document} \title{Supplemental Material for iBMQ: a R/Bioconductor package for Integrated Bayesian Modeling of eQTL data} \author{Greg C Imholte, Marie Pier Scott-Boyer, Aur\'elie Labbe, \\Christian F Deschepper, and Raphael Gottardo} \maketitle \section{iBMQ Implementation} iBMQ is a hierarchical Bayesian model for the detection of cis and trans-acting expression quantitative trait loci (eQTL). The iBMQ model simultaneously assesses interactions between thousands of genes and SNPs via the linear model \begin{equation} y_{ig} = \mu_g + \sum_{j = 1}^{S} x_{ij}\gamma_{jg}\beta_{jg} + \epsilon_{ig}, \end{equation} where $i$ indexes subjects, $g = 1, \ldots, G$ indexes genes, and $j = 1, \ldots, S$ indexes SNPs. Thus, gene expression $y_{ig}$ is modelled as a grand mean plus the additive effects of all SNPs $x_{ij}$. Not all SNPs affect the expression of every gene, so the $\gamma_{jg}$ parameters are 0/1 indicators for the presence of an interaction between SNP $j$ and gene $g$. One unique aspect of iBMQ is that each $\gamma_{jg}$ is controlled by its own inclusion probability parameter $\omega_{jg}$. Hierarchical parameters help induce appropriate shrinkage while sharing information across genes and SNPs; further details about the model may be found in \citet{scottboyer}. Inference is based on the marginal posterior probabilities of association (PPA), which are computed separately for each ``gene $\times$ SNP'' combination as the posterior mean of $\gamma_{jg}$ given the data, where $\gamma_{jg}$ is an indicator variable equal to one if gene $g$ is associated with SNP $j$ as defined in the model above and in \citet{scottboyer}. Our model is more flexible than other eQTL mapping methods, but this flexibility comes at a cost. For several thousand genes and several thousand SNPs, the number of parameters in the model can easily number in the tens of millions. The model's posterior distribution cannot be computed directly, and we employ Markov Chain Monte Carlo (MCMC) to sample the posterior. Our MCMC algorithm is written in C for improved speed, and interfaces with R for convenient data management. Efficient programming allows us to overcome the otherwise tremendous computational burden that comes with updating so many parameters. We outline techniques used to improve the stability, quality, and speed of our MCMC algorithm. \subsection{OpenMP\textregistered\, parallelization} The MCMC algorithm proceeds via Gibbs sampling of full conditional distributions. The iBMQ model admits conditional independence among many parameters that index across genes or across SNPs, hence a great number of Gibbs updates can be performed concurrently within each iteration. To take advantage of this model structure, the iBMQ package employs the OpenMP\textregistered\, API to parallelize parameter updates. OpenMP\textregistered\, is a shared memory parallel computing platform, and our algorithm scales well with the number available processors. In a parallel computing environment, care must be taken with pseudorandom number generation. We use Pierre L'Ecuyer's \textit{RngStream} package to ensure that streams of random numbers generated between threads are independent \citep{lecuyer}. \subsection{Parameter updates} For most model parameters, conditionally conjugate distributions within our model allow for Gibbs updates using well known probability distributions. Full conditional distributions for most model parameters can be found in \citet{scottboyer}. The full conditional distributions for parameters $a_j$ and $b_j$ (related to the distribution of parameters $\omega_{jg}$) are not from a well known family of probability distributions. A more general Metropolis-Hastings update \citep{hastings} can be a great substitute in such cases, but the thousands of $a_j$ and $b_j$ parameters assume a great variety of posterior distributions. Because of this variety, we were unable to find a Metropolis-Hastings proposal mechanism that allowed our Markov chain to efficiently mix across all $a_j$ and $b_j$. We instead implement the adaptive rejection sampling (ARS) algorithm presented in \citet{gilks}. ARS works for log-concave densities, and creates a hull over the target (possibly un-normalized) density from which samples are drawn. Rejected samples are incorporated into the hull, refining the hull and improving the acceptance rate of subsequent samples. Parameters $a_j$ and $b_j$ have log-concave full conditional densities and are updated via a Gibbs sample generated by ARS. Parameters $\omega_{jg} \in [0,1)$ are converted to the \textit{logit} scale for numerical stability (a separate indicator variable notes whether $\omega_{jg}$ equals 0 exactly). Care is needed when sampling the full conditional distribution of $\omega_{jg}$, which is $\omega_{jg}|\ldots =_d \text{Beta}(a_j + \gamma_{jg}, b_j + 1 - \gamma_{jg})$. Because $a_j$ and $b_j$ can become extremely small, random number generators can occasionally generate values of exactly zero or one, even when these values are theoretically impossible. Values of exactly zero or one create problems in further computations. Sampling $\omega_{jg}$ on the \textit{logit} scale eliminates this issue. One can show that the \textit{logit} transformation of $\omega_{jg}$ has full conditional distribution that is a difference of the logs of independent gamma random variables: \begin{equation} \text{logit}(\omega_{jg})|\ldots =_d \log \left[\text{Gamma}(a_j + \gamma_{jg}, 1) \right] - \log \left[\text{Gamma}(b_j + 1 - \gamma_{jg}, 1) \right]. \end{equation} \subsection{Sparse matrix $\boldsymbol{\beta}$ representation} Our model allows coefficients $\beta_{jg}$ to be exactly zero. Out of thousands of possible SNPs, relatively few are expected to interact with a given gene $g$. In practice, the matrix $\boldsymbol{\beta}$ of model coefficients $\beta_{jg}$ is sparsely populated at any given MCMC iteration. The matrix $\boldsymbol{\beta}$ also frequently changes between iterations. Because matrix elements are frequently added and deleted (i.e. set to zero), we use a linked-list sparse representation for $\boldsymbol{\beta}$. To prevent stack fragmentation via frequent allocation and deallocation of memory, and to improve speed, matrix elements from the linked list are `drawn' from thread-safe memory pools. \section{R code} To execute the code below, we have first prepared the "snp'' object, which corresponds to a \textit{SnpSet} object containing the 977 informative SNPs for the 24 RIS AXB-BXA population. The genotype is coded with 0 and 1. The ``gene'' is an \textit{ExpressionSet} object containing the normalized and pre-processed gene expression of 8725 genes for the 24 RIS AXB-BXA population. The ``snppos'' and ``genepos'' are dataframes containing the information about the SNP and gene positions respectively. All objects are available at https://github.com/raphg/iBMQ in the data\_application\_note folder.\\ \begin{verbatim} library(iBMQ) load("data_application_note.R") PPA <- eqtlMcmc(snp, gene, n.iter=1000000, burn.in=50000, n.sweep=20, mc.cores=6, RIS=TRUE, write.output=FALSE) cutoff <- calculateThreshold(PPA, 0.1) eqtl <- eqtlFinder(PPA, cutoff) eqtltype <- eqtlClassifier(eqtl, snppos, genepos,1000000) hotspot <- hotspotFinder(eqtltype, 10) p <- ggplot(eqtl.type, aes(y=GeneStart, x=MarkerPosition)) + geom_point(aes(y=GeneStart, x=MarkerPosition, color = PPA), size = 1.5) + facet_grid(GeneChrm~MarkerChrm)+theme_bw(base_size = 12, base_family = "") + theme(text=element_text(size=16), panel.margin = unit(0.01, "lines")) + theme(axis.ticks = element_blank(), axis.text.x = element_blank(), axis.text.y = element_())+ scale_x_reverse() p+scale_colour_gradientn(colours=c("grey", "black")) \end{verbatim} Trans-eQTL ``hotspots'' defined as instances where many genes across the genome all have their expression level associated to one common genetic locus. They are also sometimes referred to as ÒbandsÓ, because all genes in the hotspot align on a vertical band in the plot visualizing the results. In the context of the current package, the user sets a threshold for the size of the hotspot, i.e. specifies the minimal number x of genes that must associate significantly with the common SNP. \section{Additional information about the GO term enrichment analysis} Gene Ontology (GO) term enrichment was tested using the DAVID Bioinformatics Resources analysis \citep{Huang}. Table \ref{tabS:01} provides additional information for the trans-eQTL hotspots detected both by R/QTL and iBMQ and where corresponding genes showed enrichment for one same GO term. For each GO term, we indicate its full descriptive term, its corresponding head term, and the level of the child term it corresponds to (considering that the head term is at level 1). In each case, a child term belonging to a sub-term 3-6 levels below the head term was found, indicating that the child term was fairly specific. \begin{table}[h!] \caption{{\bf Additional information concerning GO terms showing enrichment in trans-eQTL hotspots.}} \begin{tabular}{cccc} \hline GO term \#&GO term designation&Head term &Level of child term \\ \hline GO:0012505&Endomembrane system&Cellular component&4\\ GO:0007167&Enzyme-linked receptor protein&Biological process&7\\ &signalling pathway&&\\ GO:0006955&Immune response&Biological process&3\\ GO:0017076&Purine nucleotide binding&Molecular function&7\\ \hline \end{tabular} \label{tabS:01} \end{table} The trans-eQTL on chr17 (which is the hotspot where iBMQ detected the largest number of trans-eQTL genes) is of particular interest. Others, using genome-wide co-expression analysis of genes from 4 tissues from a panel of rat RIS, have previously reported a network of 305 inflammatory gene \citep{Heinig}. In the chr17 trans-eQTL that we detected in hearts from our panel of mouse RIS, 41/192 genes (i.e. 21.4\%) had official gene symbols that were identical to those found in the rat inflammatory gene network. When considering genes from the same families, similarity between the 2 groups of genes was even higher. This showed that genes detected by iBMQ within a trans-eQTL hotspot may correspond to functionally important groups of genes. Among the genes detected by iBMQ in the trans-eQTL on mouse chr17, Òimmune responseÓ was the GO:term showing the highest level of enrichment. However, other GO:terms (including some more specific ones) also showed highly significant enrichment, as listed below in Table \ref{tabS:02}. In all 3 groups of genes, similar percentages of genes corresponding to the Òimmune responseÓ, Òdefense responseÓ and Òantigen processingÓ categories were found. However, the higher number of genes (as indicated in the "count" sub-column) was higher with iBMQ than with R/QTL, and thus yielded higher enrichment significance levels. The similarities in percentages further illustrate (beyond genes sharing identical gene symbol names) to which extent genes in the chr17 trans-eQTL were similar to those reported as belonging to the rat inflammatory gene network. \begin{table}[h!] \caption{{\bf Comparisons of GO:term enrichment in trans-eQTLs detected by iBMQ and R/QTL on chr17 in mouse RIS to that found in network of inflammatory genes detected by co-expression analysis in tissues from rat RIS.}} \small{ \begin{tabular}{c|ccc|ccc|ccc} \hline Term &iBMQ &mouse&& RQTL& mouse &&rat &(Heinig et al.)\\ \hline &count &\%& P-value & count& \%& P-value & count& \%& P-value\\ \hline GO:0006955& 26 &15.3 &2.70E-13 & 6& 14.0& 0.002 & 50 &16.3 &7.47E-29\\ immune response&&&&&&&&&\\ GO:0017076 &40& 23.5 &3.37E-07 & 10 &23.3& 0.015 & - &- &-\\ purine nucleotide &&&&&&&&&\\ binding&&&&&&&&&\\ GO:0006952&15 &8.8& 3.67E-05 & 4& 9.3& 0.063& 37& 8.2& 1.03E-16\\ defense response&&&&&&&&&\\ GO:0019884& 5& 2.9 &9.73E-05 & - &- &- & 7 &2.3& 2.56E-09\\ antigen processing&&&&&&&&&\\ \hline \end{tabular} } \label{tabS:02} \end{table} \vspace{-0.4cm} \section{Time scales in terms of the numbers of markers and CPUs} We have compared elapsed processing time for 1000, 5000, or 10000 SNPs, using 1, 2, 4, 8, or 12 processing cores. In each case we used the same 100 individuals and 50 genes with four replications. Figure \ref{figS:01} shows that the algorithm used in our method scales approximately linearly with the number of SNPs and approximately harmonically with the number of cores. The algorithm scales slightly better when there are more SNPs, which is not surprising considering parallelization overhead. \begin{figure} \centering \subfigure[]{ \includegraphics[scale=0.24]{timing_plots_1.pdf} \label{figS:01a} } \subfigure[]{ \includegraphics[scale=0.24]{timing_plots_2.pdf} \label{figS:01b} } \subfigure[]{ \includegraphics[scale=0.24]{timing_plots_3.pdf} \label{figS:01c} } \label{figS:01} \caption{Timing as a function of the number of SNPs and cores. a) For a fixed number of cores, we plot the relative timing of 1000 (red), 5000 (green) and 10000 (blue) SNPs. We see that for a small number of cores, a 5 or 10 fold increase in SNPs leads to a 5 or 10 fold increase in processing time; this indicates that our algorithm's complexity is linear in the number of SNPs (everything else being fixed). b and c) These two plots indicate that our algorithm scales well with the number of processing cores. For a fixed number of SNPs, we see that processing time decreases by a factor of approximately $\frac{1}{n}$ with $n$ cores. The decreasing relationship holds more strongly with a larger number of SNPs, where parallelization overhead is relatively small compared to the workload performed in each parallel processing fork. } \end{figure} \vspace{-0.4cm} \bibliography{iBMQ_bibliography} \bibliographystyle{plainnat} \end{document}