Summary
Parallel multifactor dimensionality reduction is a tool for large scale analysis of gene-gene and gene-environment interactions. The MDR algorithm was redesigned to allow an unlimited number of study subjects, total variables, and variable states, and to remove restrictions on the order of interactions being analyzed. In addition, the algorithm is markedly more efficient, with an approximately 150-fold decrease in runtime for equivalent analyses. To facilitate the processing of large datasets, the algorithm was made parallel.
1 INTRODUCTION
The development of high-throughput genotyping technologies has dramatically increased the number of polymorphisms considered in a typical genetic association study. Detecting and characterizing interactions between these polymorphisms may be important in discovering genetic contributions to common, complex diseases. Multifactor dimensionality reduction (MDR) has been previously introduced as a non-parametric, model-free method for detecting gene-gene and gene-environment interactions (Ritchie et al. 2001; Hahn et al. 2003). MDR was designed for small candidate gene studies of less than 500 variables and is quickly overwhelmed by the large scale datasets with thousands of variables generated by current technologies. In response to these trends, we have redesigned the MDR software and devised a new algorithm that can scale to handle extremely large data sets.
MDR is implemented with a cross-validation framework (typically 10-fold although any number of intervals including 1-fold can be used). In step 1 of the process, the data are divided into a training set (9/10 of the data) and a testing set (1/10 of the data). Next, a set of n genetic and/or environmental factors are selected from the set of all possible factors, N. Each individual in the training set is grouped according to its state at each of the n factors (i.e. for 2 loci with 3 possible genotypes, there are 9 possible combinations; for 3 loci, 27 etc). Then, each genotype combination is classified as high or low risk depending on the ratio of cases to controls with that genotype combination. If the ratio is <1 (there are less cases than controls) the genotype combination is classified as low risk. If the ratio is ≥1, the genotype combination is classified as high risk. This collection of high and low risk groups forms the MDR model for that combination of factors, and the fitness of that model is determined by classification error (the number of individuals incorrectly labeled by the model). To evaluate the predictive ability of a model, its prediction error is calculated based on the proportion of individuals in the testing set that are labeled incorrectly by the model. Each combination of variables (1 to k-variables) specified by the user is evaluated in this fashion, and the entire procedure is performed 10 times, using each 1/10th as a testing set and each 9/10th as a training set. While several modifications to MDR have been made in this implementation, this procedure forms the basis of the MDR algorithm and has not changed.
2 METHODS
The original MDR software (Hahn et al. 2003) relied on the successive use of the Standard Template Library quicksort function to subset individuals into a multilocus genotype group. This required multiple sorts to retrieve all genotype combinations for a single interaction model. To increase efficiency, we implemented a recursive binning function which combines the properties of a merge-sort and a bin-sort. This function is tree-based, placing each individual in bins according to genotype at locus 1 of the model, then recursively sorting each bin according to genotype at locus 2 (figure 1). All sorting operations are applied to pointers to the genotype data stored in a static array, reducing memory writes. This single sorting operation simultaneously groups all genotype combinations of the desired interaction model. In addition to a performance increase, the redesigned sorting mechanism removes hard-coded limitations in the number of study subjects, total variables, variable states, and restrictions on the order of interactions being analyzed. The number of genotype states (AA, Aa, aa) corresponds to the number of bins/branches of the tree, limited only by machine memory. The depth of the tree and number of recurses of the binning function correspond to the number of loci in the interaction model: an interaction of two loci with three states each produces 9 multilocus genotype bins; a 3-way interaction produces 27 multilocus genotype bins, etc.
Fig. 1. Recursive Binning Method for MDR Model Generation.
In this example, an MDR model is generated from interaction of polymorphisms 3 and 6. Individuals are binned according to status (0, 1 or 2) at polymorphism 3. The individuals in each of these bins are then grouped according to status at polymorphism 6. The ratio of affected to unaffected individuals is computed for each of the 9 resulting state-combinations. Each combination is classified as high-risk (if the ratio is greater than 1) or low-risk (if the ratio is less than 1). A classification error is generated based on the number of individuals misclassified by the model. This process is repeated for each variable combination, and is performed in a cross-validation framework.
The framework of this MDR implementation is divided into a data handling class, a model generation and processing class, and a result storage class. The modular class design allows for parallelization. Also, future modifications of individual functions can be easily integrated into the class framework. The data handling class contains functions for loading the dataset into memory and implementing data splitting for cross-validation. Model generation consists of an optimized combination generator that produces a set of integer vectors containing the loci to be evaluated. This generator, designed by Donald Knuth (Knuth, 2003), was parameterized to allow partitioning of model generation and evaluation across multiple nodes of a cluster environment. Each node generates and processes models using the recursive binning function until all specified combinations have been evaluated. Nodes then transmit their results to the master node where they are aggregated into the final set of results. Parallel MDR is an efficient implementation in that each node of the cluster processes a subset of the problem, there is little inter-process communication, and the results of each node are collected upon completion.
Parallel MDR was written in C++, using Message Passing Interface (MPI) for interprocess communication and compiled using the GNU compiler collection (gcc).
3 DISCUSSION
To accommodate large scale case-control association studies, several improvements to the original MDR software have been made. The efficiency of the algorithm has been markedly improved, decreasing processing time approximately 150-fold. The nature of the new algorithm design places no limitations on the number of genotype variable states, study subjects, or total number of variables. The increased scalability and performance are essential for the analysis of large scale data sets. Parallel MDR can accept any discrete value data-types, including allele, genotype, or haplotype encoding (note it cannot consider haplotype uncertainty estimates). In addition, parallel MDR can accept datasets consisting of hundreds of thousands of variables. In smaller datasets, the performance increase allows the exhaustive evaluation of previously infeasible high order interaction models. Statistical significance is assigned by permutation testing, where the MDR process is repeated on randomized data to generate an empirical distribution. This process is computationally intensive, and the performance improvements will expedite this process as well.
Empirical permutation testing will not be feasible for extremely large datasets, but in those cases parallel MDR still serves as a useful exploratory tool. In this vein, a user can now specify the number of models to be reported, ranked by classification error. These models can be thought of as data-driven hypotheses that may be analyzed using other analysis tools, such as logistic regression or pursued in subsequent studies.
To evaluate computational feasibility, we implemented parallel MDR on 24 Pentium 4 processors of a Beowulf cluster (ACCRE, 2005). The 1.25 billion pairwise interactions between 50,000 simulated single nucleotide polymorphisms (SNPs) completed in roughly 40 hours. Because each node of the cluster processes a subset of the overall problem and inter-process communication is minimized, the performance improvement due to parallelization approaches a linear decrease in run-time as processors are added. As such, we estimate that the 5 billion pairwise interactions between 100,000 SNPs could be analyzed in roughly 160 hours on 24 processors, or in roughly 40 hours on 96 processors.
Large scale studies of 50,000 to 100,000 SNPs may call for changes to the basic MDR methodology. In particular, cross-validation is especially costly, requiring multiple levels of processing. Reducing the number of cross-validation intervals from 10 to 5 or fewer may prove to be an effective strategy. Also, rather than keeping a single model as the result of an analysis, retaining multiple models may increase the statistical power of the method, but exposes the possibility of a higher false positive rate. These analysis considerations are currently being evaluated.
Parallel MDR has the capacity to analyze high order interactions of small data sets and can feasibly analyze the lower end of genome-wide scale analyses (100,000 to 200,000 variables).
Acknowledgments
This work was conducted in part using the resources of the Advanced Computing Center for Research and Education at Vanderbilt University. We thank Nate Barney, Lance Hahn, Jason Moore and Bill White for insightful discussions. We would also like to thank Bill White for bringing the Knuth algorithm to our attention. Finally, this work was supported by National Institutes of Health grants AG20135 and HL65962.
Footnotes
Availability: Parallel MDR is freely available for non-commercial research institutions. For full details see http://chgr.mc.vanderbilt.edu/ritchielab/pMDR. An open-source version of MDR software is available at http://www.epistasis.org.
REFERENCES
- Nashville, TN: Advanced Computing Center for Research and Education at Vanderbilt University; http://www.accre.vanderbilt.edu. [Google Scholar]
- Hahn LW, Ritchie MD, Moore JH. Multifactor dimensionality reduction software for detecting gene-gene and gene-environment interactions. Bioinformatics. 2003;19(3):376–382. doi: 10.1093/bioinformatics/btf869. [DOI] [PubMed] [Google Scholar]
- Knuth DE. The Art of Computer Programming. preparation. 2003;4 Available: http://www-cs-faculty.stanford.edu/~knuth/taocp.html. [Google Scholar]
- Ritchie MD, Hahn LW, Roodi N, Bailey LR, Dupont WD, Parl FF, Moore JH. Multifactor dimensionality reduction reveals high-order interaction among estrogen-metabolism genes in sporadic breast cancer. Am J Hum Genet. 2001;69(1):138–147. doi: 10.1086/321276. [DOI] [PMC free article] [PubMed] [Google Scholar]

