Abstract
Open data science and algorithm development competitions offer a unique avenue for rapid discovery of better computational strategies. We highlight three examples in computational biology and bioinformatics research in which the use of competitions has yielded significant performance gains over established algorithms. These include algorithms for antibody clustering, imputing gene expression data, and querying the Connectivity Map (CMap). Performance gains are evaluated quantitatively using realistic, albeit sanitized, data sets. The solutions produced through these competitions are then examined with respect to their utility and the prospects for implementation in the field. We present the decision process and competition design considerations that lead to these successful outcomes as a model for researchers who want to use competitions and non-domain crowds as collaborators to further their research.
Introduction
Researchers increasingly rely on crowdsourcing to address particular problems through the collective efforts of large communities of individuals. A wide variety of crowdsourcing mechanisms are used in practice, such as citizen science, gamification of scientific problems, and gear to labor-intense tasks (such as large-scale data annotation problems [1] or folding protein structures [2]). Open innovation competitions, a different crowdsourcing mechanism, is less understood but has become increasingly popular in computational biology research.
This mechanism opens a competition to a large crowd of people who must solve a given problem for prizes. The main difference between open innovation competitions and other crowdsourcing mechanisms is that in a contest, extreme-value solutions (best submissions) are rewarded and conventional approaches, even if very effective, may not win. Given this incentive to be creative and diversify submissions, researchers typically deploy open innovation competitions to benchmark their solutions to a particular computational problem or generalize existing methodologies to unsolved instances of the problem.
Past examples of open innovation competitions have been very successful in solving a wide range of biology problems [3–5] but, for the most part, they were intended for participants from “inside the field.” That is to say, researchers with a direct connection to the scientific problem at hand. An exception is a contest [6] in which a computationally complex problem (local alignment of DNA sequences) was translated into generic computer-science terms, stripping the problem description of all the jargon, and then posted on a commercial crowdsourcing platform (Topcoder). This challenged the members of that community (mostly computer scientists with little or no biology background) to improve upon the state-of-the-art solution for cash rewards. The community responded with numerous submissions fourteen of which achieved significant improvements over the benchmark solution, a tool widely used by the academic community (MegaBLAST). Another example is Meet-U [7], an educational initiative that challenged teams of non-life science students to solve hard computational biology problems. Despite this very successful case, the potential of leveraging communities of “non-experts” through open innovation competitions remains unclear, although there have been promising examples in other domains [8, 9].
In this study, we focus on trying to understand how to engage a crowd of “non-life science experts” through open innovation competitions. Knowledge of the mechanisms that drive participation of an external crowd may enable a broader use of contests in biology. For example, it could enable scientists to use crowdsourcing even when the community of experts is small or nonexistent or when it lacks the necessary knowledge to solve the problem, such as in rapidly evolving or emerging fields that depend heavily on large amounts of data and computational resources but are deficient in experts in scientific computing, data science, or machine learning.
The design of open innovation competitions for a non-expert audience, however, can be challenging. Because people from another field are not directly connected to the problem at hand, it can be difficult to motivate this audience to participate in the contest. Rather than the prospect of advancing research in the field or achieving a scientific publication, researchers have to incentivize participation through other means. Mechanisms for participation include opportunities to win cash prizes, although other incentives (the chance to learn new insights, signal skills to prospective employers, or the enjoyment of the competition itself) may be effective as well. The presence of heterogeneous (monetary and non-monetary) incentives promotes a wide and varied participation in these challenges, which could make open innovation competitions more cost-effective than other traditional solutions, such as direct hiring. On this topic, Lakhani et al. [6] offers a rich discussion on the costs and benefits of open innovation competitions relative to direct hiring.
Additionally, because non-experts need a constant guidance to progress in their work, researchers must articulate the problem in clear and easily-digestible language. They also must construct metrics to evaluate solutions that provide live feedback to participants. The need to evaluate and provide automatic feedback to contestants can be viewed as a limitation to the types of problems that open innovation competitions can address. Providing automatic feedback to the contestants requires an agreed-upon metric of how good a solution is, something which is often not available in many computational biology problems.
Despite these difficulties, we show evidence that open innovation competitions of this nature can be very effective in addressing biology problems. These competitions allow members of the public to contribute meaningfully to academic fields by providing incentives and easy access to problems and data that are otherwise inaccessible to them. We describe the design and outcomes of three contests in computational biology and bioinformatics which had solutions significantly improved. These contests sought to improve algorithms for:
clustering antibody sequences (hereinafter “Antibody”);
imputing gene expression measurements (hereinafter “Gene inference”);
performing fast queries on the Connectivity Map dataset (hereinafter “Query speedup”).
The improvements achieved by these competitions focused on both optimization (reduction in speed or memory footprint), production of better models (higher accuracy in statistical inference), and highlighted the versatility of crowdsourcing as an effective tool for researchers.
Results
The competitions discussed here were hosted on the platform Topcoder.com (Wipro, Bengaluru, India), providing access to a community of approximately one million software engineers and data scientists. They were held for a duration of two to four weeks, and they were structured so that each competitor could make multiple submissions and receive continuous feedback on the performance of their solution (based on objective metrics) through a live public leaderboard. The purpose of such feedback was essential to drive competition among the participants. Incentives consisted of cash prizes, ranging from $8,500 to $20,000. Table 1 reports specific payouts and contest duration.
Table 1. Competition payouts and duration.
Contest | Duration (days) | Prize Pool ($1000) (Distribution) |
---|---|---|
Antibody Clustering | 10 | 8.5 (5/2/1/0.5) |
Gene Inference | 21 | 20 (10/5/2.5/1.5/1) |
Query Speedup | 21 | 20 (9/5/3/2/1) |
Prize pools were distributed to the top-placing solutions, rank-ordered by performance based on objective evaluation metrics. For these contests, we used a constant prize pool over duration ratio (Prize Pool/Duration ∼ 1000 USD/day) and the prize distributions follow: P(N) ∼ P(1)e−2/3(N−1) for places N = 1, 2, … and P(1) = Prize Pool/2.
We evaluated each submission using predetermined performance metrics. These are real-valued functions of one or more characteristics, such as computation time, memory use, or accuracy, which targeted specific contest objectives. A detailed description of the metrics used for each challenge is available in the Supporting Information (S1, S2 and S3 Appendices).
The evaluation was based on three kinds of data sets: a training set, a validation set, and a test set. The training set and associated ground truth were provided to participants for the purpose of understanding the properties of the data and for developing or training their solutions. The validation data set (excluding ground truth) was used to provide real-time feedback on the performance of their solution via submission of either source code or predictions produced by their algorithms. The test set was reserved for the end of the competition and was used for final evaluation of submissions; this data set was withheld from participants at all times. Note that for data science competitions, the holdout test set is particularly important for determining the true performance of solutions, since continuous feedback of performance in the form of a real-time leaderboard raises the prospect of model overfitting on the validation set. All data sets and the source code of most of the submissions are publicly available on the internet (S1, S2 and S3 Datasets).
We measured the degree of coders’ engagement and participation by looking at the number of code submissions that they made during the contest. Remarkably, the Gene Inference competition attracted more than twice the participants (85) than the other two competitions, Antibody (34) and Query Speedup (33). It also had a twice as high median number of submissions per participant (Fig 1). Because all three challenges offered about the same payout per day (Table 1) and were executed under similar external circumstances (same platform), other factors must explain the gap. One possible explanation relates to the different nature of the computational problems these competitions aimed to solve. According to this explanation, the community members may be more eager to solve prediction problems (Gene inference) than speedup problems (Antibody and Query speedup).
Antibody clustering challenge (Scripps)
Motivation and objective
In the process of immunotherapy and vaccine development, next-generation sequencing of antibody repertoires allows researchers to profile the large pools of antibodies that comprise the immune memory formed following pathogen exposure or immunization. During the formation of immune memory, antibodies mature in an iterative process of clonal expansion, accumulation of somatic mutations, and selection of mutations that improve binding affinity. Thus, the memory response to a given pathogen or vaccine consists of “lineages” of antibodies which can trace each of their origins to a single precursor antibody sequence. Detailed study of such lineages can provide insight to the development of protective antibodies and the efficacy of vaccination strategies. After the initial alignment and assembly of reads, the antibody sequences can be clustered based on their similarity to the known gene segments encoding heavy or light antibody chains. Clustering these antibody sequences allows researchers to understand the lineage structure of all antibodies produced in individual B cells [10–12].
The number of antibody sequences from a single sample can easily reach to the millions; posing a major computational challenge for clustering at such a large scale. The bottleneck lies at computing a pairwise distance matrix and subsequently performing hierarchical clustering of the sequences. The former task scales as O(N2) in both computational and space complexity, where N is the total number of input sequences. The latter task, assuming a typical agglomerative hierarchical clustering algorithm, has a computational complexity that scales as O(N2 log N). By comparison, the file I/O is expected to scale as O(N).
Benchmarks and methodologies prior to the competition
A Python implementation of the clustering algorithm was developed utilizing numpy, fastcluster, and scipy.cluster.hierarchy.fcluster modules. This Python implementation is publicly available on Github (S1 Dataset). The clustering was performed using the average linkage criterion with a maximum threshold distance imposed. The implementation utilized the Python built-in multi-processing module to parallelize the computations on multiple cores and required approximately 54.4 core-hours and 80GB of memory (computations were performed on a 32 core machine with 250GB RAM) for a dataset containing 100K input antibody sequences. Empirically, the primary bottlenecks for this implementation and dataset were computation time and storage of the distance matrix (note that the embarrassingly parallel nature of this task implies a trivial cost conversion between single and multi-core computations). Full-hierarchical clustering scales comparably in terms of computational complexity; with the threshold imposed, the relative cost is approximately a factor of 50 less than the cost of constructing the distance matrix, however. Although I/O was included in the timing estimates, its contribution in this case was negligible.
Extrapolating the computation to a typical antibody profiling sample containing one million input sequences, the required computational time and storage is expected to reach around 5440 core-hours and 8TB, respectively. Given its poor scalability and efficiency limitations, this implementation is inadequate for large-scale profiling, which for a small clinical vaccine evaluation may consist of dozens of subjects with several longitudinal samples per subject. The goal of the challenge was to optimize and improve the algorithm and its implementation, so that routine data analysis of a large-scale antibody profiling study would become feasible given modest computational resources.
Problem abstraction and available data sets
A common optimization practice is to convert all or just the computationally-intensive parts of Python codes to languages such as C/C++. We implemented a C++ version of our algorithm (directly translated) and used it in our post-challenge benchmarking analysis (S1 Appendix). Although the memory footprint for this implementation was comparable to that of the Python implementation, it yielded approximately an order of magnitude reduction in computation time. This computational cost reduction was primarily attributed to the construction of the distance matrix, whereas the computational cost of the clustering itself was comparable in both implementations (this is not surprising, given the underlying implementation of the scipy.cluster.hierarchy.fcluster is written in C). All contest submissions were evaluated relative to the Python (A1) and C++ (A2) benchmarks described above, and the code for both benchmark implementations was provided to the contestants.
We generated datasets of 1K, 5K, 10K and 100K input sequences sampled from true sequences derived from a healthy adult subject and computed their corresponding clustering outputs using the A1 baseline. These data sets are publicly available on Github (S1 Dataset). The output results served as the “gold standards” for evaluating the accuracy of the solutions produced in the competition (S1 Appendix). The training datasets comprised one 1K and one 100K input sequences, the validation datasets comprised two 5K and three 10K input sequences, whereas the testing datasets comprised four 10K and six 100K input sequences. For the given threshold distance, the maximum cluster size for each dataset ranged from 1.6% to 4% of the total number of antibody sequences in the set. The most probable cluster size for each set was unity, however.
Competition outcome
The competition lasted for 10 days and involved 34 active participants with an average of 7 submissions per participant. All contest submissions were evaluated on a 32 core server with 64GB memory, although the top four solutions utilized only a single core. For the given dataset, a majority of participants had submitted solutions that were significantly more computationally efficient than the A2 benchmark, with the winning solutions being orders of magnitude more efficient.
Fig 2 illustrates the computational cost performance of the A1 and A2 benchmarks, along with the performance envelope for the top four performing solutions as a function of the number of input antibody sequences for N up to 1M input sequences. As shown in the figure, all solutions exceeded the 99% accuracy threshold to be considered for prizes (S1 Appendix) with little or no difference in accuracy among the top solutions. In terms of speed improvement, the winning solution was able to perform clustering of 100K sequences in approximately 1.8s on a single core implying a total computational cost reduction (speed-up) of 108,800 when compared to the A1 benchmark. Interestingly, the primary bottleneck for this implementation was no longer clustering but rather file I/O. Neglecting file I/O, which accounted for 85% of the computation time, the effective speedup achieved for clustering over the A1 benchmark was approximately 777,000 for 100K sequences. Further improvement of the winning algorithm was achieved subsequent to the competition, in part by aggressively optimizing the file I/O. The improvements resulted in another 30-fold reduction in the total computational cost of the algorithm (including I/O) on the 100K dataset. For N up to 1M input sequences, the four solutions required less than approximately 17GB memory, with the winning solution requiring only 0.7GB memory.
The winning solution achieved significant improvements in computational efficiency by abandoning generic implementations of hierarchical clustering, which demand computing the full (or half) the distance matrix as input in favor of a problem-specific strategy. The solution exploited several important properties of the data set to dramatically reduce the number of Levenshtein distances computed by noting that:
Some families of antibodies are never clustered together implying a partitioning of the dataset.
Some antibodies are always clustered together, forming subclusters.
Coarse “superclusters” can be quickly identified and formed from such subclusters.
Once coarse clusters were formed, a final exact clustering was performed for each of the superclusters. In addition to these innovations, the solution exploited SIMD instructions to speed up the calculation, improved the management of memory, and simplified the parsing of input data to minimize I/O operations.
The Connectivity Map Inference Challenge (Broad Institute)
Motivation and objective
The discovery of functional relationships among small molecules, genes, and diseases is a key challenge in biology with numerous therapeutic applications. The Connectivity Map (CMap) allows researchers to discover relationships between cellular states through the analysis of a large collection of perturbational gene expression signatures [13, 14]. These signatures arise from the systematic perturbation of human cell lines with a wide array of perturbagens, such as small molecules, gene knockdown, and over-expression reagents; together, they constitute a massive dataset that can be “queried” by powerful machine learning algorithms to identify biological relationships. Researchers can then use the outcomes of those queries to form hypotheses to guide experimentation [15].
Despite the recent advancements in high-throughput technology, developing massive gene expression data repositories, such as the Connectivity Map, is still considerably challenging due to the cost of doing systematic large-scale gene expression profiling experiments. These experiments typically involve thousands of genes with complex cellular dynamics and nonlinear relationships, thereby usually demanding large sample size. They also involve perturbations that can be performed with a wide array of molecules, at different doses, and in different cellular contexts, thus producing an enormous space of experimentation.
To address these difficulties, the CMap group at the Broad Institute has developed a novel high-throughput profiling technology, called L1000, that makes the data generation process for The Connectivity Map cheaper and faster [14]. This new technology is based on a reduced representation of the transcriptome that combines direct measurement of a subset of approximately 1000 genes, called landmarks, with an imputation approach to obtain the gene expression levels for the remaining (non-landmark) genes.
The resulting combination of imputed and directly-measured genes enables the L1000 platform to report on approximately 12,000 genes with accuracy comparable to traditional profiling technologies (e.g., RNA-Seq) but at a fraction of the cost [14]. This improvement has in a few years enabled the Connectivity Map to grow from its initial 500 (Affymetrix) profiles to over one million (L1000) profiles.
While L1000 is a powerful and cost-effective tool, the comparability with externally-generated gene sets remains an open issue. The practice of directly measuring only 1,000 genes (a small fraction of the entire transcriptome) reduces the fidelity of comparisons with externally-generated gene sets, whose overlap with the 1,000 landmarks may be minimal. Improving the imputation accuracy for non-landmark genes is, thus, expected to facilitate comparisons between L1000 and external gene expression data and gene sets, leading to higher-confidence hypotheses.
In addition, while the connectivity analysis of CMap signatures can be restricted to the landmarks only, sometimes this restriction is unfeasible (queries containing no landmarks) or undesirable (leading to poor outcomes, [14]). Improving the imputation accuracy is, therefore, expected to bear additional benefits downstream, increasing the ability of CMap to detect biologically-meaningful connections.
With these goals in mind (improving comparability and the downstream connectivity analysis), the “CMap Inference Challenge” solicited code submissions for imputation algorithms more accurate than the present approach.
Problem abstraction and available data sets
The following describes the basic imputation approach currently in use by the Connectivity Map:
Learn the structural correlations between non-landmark and landmark genes using Multiple Linear Regression (MLR) on a reference dataset D with complete data on both landmarks and non-landmarks;
Perform out-of-sample predictions of the expression values of the non-landmarks in a different target dataset T (i.e., apply the regression function as estimated from D to compute the fitted values of the transcriptional levels for the non-landmark genes in T), using these predictions for imputation.
It has been shown that a MLR approach using L1000 data for imputation achieves considerable accuracy when compared to RNA-Sequencing (RNA-Seq) on the same samples [14]. Nevertheless, alternative imputation approaches could yield performance improvements (e.g., by exploiting possible non-linearities in the data or a larger training dataset).
Contest participants were tasked to explore alternative approaches using an eight-times-larger dataset for their inferential models with the goal to predict 11,350 non-landmarks from 970 L1000 landmarks (S1 Table). Their models were then evaluated using as ground truth RNA-Seq data profiled on the same samples (S2 Appendix).
Competition outcome
Eighty-eight competitors submitted their predictions for evaluation; fifty-five (62%) achieved an average performance higher than the MLR benchmark, with improvements as high as 50%, and with little difference between provisional and system evaluations (S1 Fig), indicating no overfitting.
For the top five submissions, performance improvements were in absolute and relative accuracy (Fig 3). The median gene-level rank correlation between the predicted values and the ground truth was 70% higher than the benchmark. The gene-level relative accuracy (i.e., the rank of self-correlation compared to correlations with other genes) was distributed with a lower dispersion (an interquartile range 40-90% lower depending on the submission), resulting in a higher recall. All the above differences were highly statistically significant (p < 0.001), according to a pairwise paired Wilcoxon Signed Rank Test with Bonferroni correction.
Inspection of the methods used by the top five submissions (Table 2) reveals that 4 out of 5 of the top-performing submissions converged to a K-Nearest Neighbors (KNN) regression, and only one used a Neural Network approach. KNN regression is a non-parametric regression approach that makes predictions for each case by aggregating the responses of its K most similar training examples. Compared to MLR, KNN regression makes weaker assumptions about the shape of the “true” regression function thereby allowing a more flexible representation of the relationship between landmark and non-landmark genes.
Table 2. Methods of top five submissions for the Gene inference challenge.
Rank | Algorithm description |
---|---|
1 | k-nearest neighbors (average over multiple normalizations) |
2 | Neural network with batch normalization |
3 | k-nearest neighbors (average over random subsets) |
5 | k-nearest neighbors (average over random subsets) |
5 | k-nearest neighbors (average over random subsets) |
Methods used for estimating gene expression values by the top-five ranked submissions in the Gene inference challenge.
Compared to a standard KNN-based imputation model, such as [16], the winning submission presents a few innovations. First, it combines multiple predictions obtained by iteratively applying the same regression function to different data normalization, such as scale, quantile, rank, and ComBat batch normalization [17]. Second, within each iteration, it identifies and combines multiple sets of K nearest neighbors using different similarity measures. These modifications present some potential advantages over a standard KNN method. One potential advantage is to alleviate batch effects thereby controlling for a source of non-biological variation in the data. Another advantage is the identification of the training samples that are most biologically relevant for predicting a given test sample (e.g., same tissue). The net outcome is that the winning approach achieves more than 5% improvement in absolute accuracy and more than 2.6% improvement in the recall over the other top KNN approaches.
We further examined key differences between the winning method and the MLR benchmark by comparing performance on large (100,000) and small (12,000) sample size training datasets. We found that the winning KNN approach outperforms the MLR, although the performance is sensitive to the sample size. When trained on the smaller dataset, the overall performance of the winning KNN approach was higher than the benchmark (10% higher absolute accuracy) but the recall was (5%) lower.
Clustering analysis of the gene-level scores (Fig 4) suggested potential complementarities among the top submissions. We tested this hypothesis by combining the predictions of the top five submissions into an ensemble approach. We used the training dataset to select automatically the best predicting method for each particular gene (the one with the highest combined score). By doing so, we found a strong complementarity between the winning KNN and the Neural Network approach, which were equally selected for over two-thirds of the genes (Fig 3). We then evaluated the resulting predictions on the test set, showing a 2% improved performance over the top performing submission.
We then examined to what extent a better inference model affects the ability to recover expected connections (as defined in [14]) using CMap data. To test this hypothesis, the winning approach was used to impute non-landmark genes on a dataset of L1000 profiles from about 46,000 samples of multiple perturbagens (S2 Appendix). We then processed these KNN-imputed data through the standard CMap pre-processing pipeline [14] and queried the resulting signatures, and their MLR equivalents, with a collection of annotated pathway gene sets. Based on the literature, each gene set was expected to connect to at least one of the perturbagens. We compared the distribution of the connectivity scores and corresponding ranks generated using the predictions made by the KNN approach and those made by the benchmark MLR approach. Results show no significant difference in the distributions of these connectivity measures (according to Kolmogorov-Smirnov test).
In conclusion, the top submission achieved substantial improvements in performance over the MLR benchmark, thus it succeeded in achieving a better comparability of CMap inferred data with external data. To facilitate the use of the winning submission for this purpose, the winning code was deployed in the R package (“cmapR”) and is currently available at github.com/cmap. At the same time, we found no evidence that the better inference would translate into a more accurate downstream connectivity analysis (higher ability to recover the expected connections), contrary to the initial hypothesis. However, limitations in the contest configuration (scores were not directly based on connectivity) preclude a conclusive statement at the moment, and further investigation is ongoing.
CMap Query Speedup Challenge (Broad Institute)
Motivation and objective
Researchers often use the Connectivity Map to compare specific patterns of up- and down-regulated genes for similarity to expression signatures of multiple perturbagens (e.g., compounds and genetic reagents) in order to develop functional hypotheses about cell states [14]. These hypotheses can be used to inform areas of research ranging from elucidating gene function to identifying the protein target of a small molecule to nominating candidate therapies for disease.
Given the high potential of this approach, many algorithms that assess transcriptional signatures for similarity have been developed over the years. These algorithms are generally computationally expensive, which may limit their use to relatively small-sized data. The principal problem is that computationally efficient methods, such as cosine similarity, may be inadequate for interpreting gene expression data in general. Moreover, more powerful methods, such as the Gene Set Enrichment Analysis [18], need to perform computationally-expensive tasks, such as ranking genes in each signature by their expression levels to compare gene rank positions individually across signatures. Given the Connectivity Map has recently expanded to over one million profiles [14], this limitation is particularly problematic.
To address this problem, the CMap group developed a fast query-processing algorithm, called SigQuery. This tool was implemented in MATLAB incorporating a range of optimization techniques to speed up queries on the Connectivity Map, and was available on the online portal CLUE.IO. Overall, the algorithm achieves a good level of performance (in a preliminary analysis, it took about 120 minutes to process 1000 queries with gene sets of size 100 against a signature matrix of 470,000 signatures). Even so, execution time and memory requirements are still a potential barrier to adoption for the Connectivity Map.
To further the development of query algorithms for CMap data, the “CMap Query Speedup Challenge” solicited code submissions for fast implementations of the present CMap query methodology.
Problem abstraction and available data sets
Following CMap query methodology (S4 Appendix), one bottleneck lies at rank-ordering all signature values and subsequently walking down the entire list of G genes in a signature to compute the running sum statistics over all possible pairs of S signatures and queries. Using the quick-sort algorithm, this task has a computational complexity of O(S × G log(G)) in the average case. To save time, however, results can be stored on disk, as in the current implementation, with adjustments to try to minimize the cost of access to disk memory. The other very burdensome task involves computing the running sums for all genes in each signature, which has a computational complexity upper bound that scales as O(S × G) per query.
For this contest, participants had to address these problems and the performance of their methods was evaluated on 1000 queries to be run on the whole CMap signature matrix, which has expression values for > 470k signatures and > 10k genes (S3 Appendix).
Competition outcome
The competition resulted in 33 participants making 168 code submissions. All final submissions were evaluated on the holdout query dataset on a server with 16 cores. Results showed significant speed improvements over the benchmark: the median speedup was of 23× with at least two submissions achieving speedups beyond 60×.
Comparison of performance between 16-core and single-core evaluations showed that multithreading alone accounted for a large fraction of the gains in performance over the benchmark (Fig 5): the median speedup difference between single and multi core was 18×, accounting for 70% of the final median speedup over the benchmark.
Beyond multithreading, analysis of the scaling properties of the winning submission showed a computational time complexity that scales as the square-root of the number of queries (Fig 5), which represents a major improvement compared to the benchmark’s linear scaling.
The winning submission also showed substantial performance improvements in the reading time—the time to load in memory the absolute values and rank-ordered positions of genes in the CMap signature matrix: the overall reading time of the winning submission was just below 10 seconds, which represents a 50× speedup compared to the 500 seconds used by the benchmark (S2 Fig).
Close examination of the codes of the top submission revealed multiple optimization adjustments that can account for all these improvements. These adjustments often overlap with those of the other submissions, thus making hard any clear-cut categorization. Consequently, we report instead the areas of optimization that we believe are responsible for most of the performance improvement over the benchmark (beyond multithreading and the recourse to low-level programming languages, such as C++, instead of MATLAB). Essentially, these are:
Efficient data storage techniques in order to maximize the available cache memory for each thread.
Streaming SIMD Extensions (SSE) technology to execute multiple identical operations simultaneously for each thread.
One of the ways by which the winning submission achieved its exceptional performance was by loading the entire signature matrix in the cache memory. Cache memory is indeed the fastest memory in the computer, albeit of very limited capacity. To minimize memory usage, the winning contestant stored the CMap signature matrix at a lower precision than the benchmark (32-bit single precision floating for the scores and 16-bit integers for the ranks), with essentially no loss in accuracy. Precision reduction alone, however, was insufficient to fit level 1 cache memory (the fastest cache memory available) due to the large extent of queries and gene sets to be processed per signature. So, it developed a clever system of matrices to efficiently store the indexes and partial sums for each gene in a query. The resulting algorithm made a much more efficient use of memory compared to the benchmark.
The other major improvement is related to SSE, which is a set of instructions that allows the processor to compute the same operation on multiple data points simultaneously [19]. The winning submission used SSE to form batches of 4 genes and simultaneously compute the rank positions of these genes, thus reducing by approximately a factor of one-fourth the time of each query.
The winning code submission for the CMap Query Speedup Challenge was deployed in the online portal CLUE.io and is now currently available as an option to users in the Query App (S3 Fig). This improved algorithm has also enabled CLUE to support batch queries, allowing users to execute multiple queries in a single job, all via the CLUE user interface.
Discussion
This work demonstrates how researchers in computational biology and bioinformatics have utilized open innovation competitions to make inroads on a variety of computational roadblocks in their work. While participants in these competitions may not possess the domain knowledge to solve every research problem faced by scientists, their unique skills can be leveraged to attack well-defined tasks (e.g., algorithm improvement to maximize computational speed) resolving specific issues or bottlenecks to the research process.
We highlight a few key advantages over traditional approaches. First, competitions enable rapid yet broad exploration of the solution space of the problem. This broad exploration opens the possibility of discovering high-performing solutions, which may lead to breakthroughs in the field. Second, competitions give researchers access to multiple complementary solutions. Thus, they create opportunities to boost performance even further with ensemble techniques whereby different approaches are combined based upon their strengths in different regimes or for different subsets of data.
Our results indicate that these gains arise from the efforts of out-of-the-field participants instead of the community of practitioners and domain experts. Thus, on the one hand, we know open innovation competitions are a great tool to solicit community-based effort (e.g., Dream Challenges); on the other, we show that the potential for their use in biology and other life sciences goes beyond the size and availability of the community of researchers connected to the problem.
By defining appropriate objective functions to guide the competitors’ efforts, researchers have indeed the flexibility to pursue vastly different problems or even decide to tackle multiple aspects of the same problem concurrently via a composite objective function. This kind of problem definition, however, can be difficult. It requires knowledge of the aspects that are critical to the problem and for which improvements are quantifiable and achievable, understanding of ways to trade-off improvements in one dimension for another, and the ability to abstract the original problem from its domain to attract broad participation (knowing that domain-specific information hurts participation but also offers competitors insights on how to solve the problem).
We have shown some of the challenges encountered in addressing these issues for two very general types of computational problems: “code development” and “machine learning.”
Code development problems typically boil down to improving existing computational algorithms that have well-defined inputs and outputs under a variety of constraints (e.g., speeding up the computation, without exceeding memory limits). Here, although performance improvements are quantifiable and can be checked by test cases, other relevant aspects (e.g., robustness) are less so. This restriction forces researchers to take additional steps after the contest to validate methodologies and integrity of solutions beyond the limited test cases considered during the competition. These steps typically include ensuring security (understanding what the code does and ensuring it is not malicious) and legality of the produced codes (that they are original or properly licensed) before integration and deployment.
Machine learning problems focus on more exploratory questions, such as producing predictive models that describe an existing data set, yet are generalizable to new data. These problems are typically multi-dimensional, given the wide range of potential applications, and are sometimes hard to quantify (e.g., measurements that offer only a partial picture of biological states to model). As a result, competitions addressing these problems typically involve interpretation and further evaluation of methodologies, exploration of possible complementarities, and understanding strengths and pitfalls of solutions in comparison to known methodologies along dimensions not considered within the competition.
Our study suggests that although competitions offer the flexibility to address both kinds of problems, the associated post-competition efforts can be quite different. In both cases, handling these additional tasks requires expertise that itself may not be available to the end-user, although they may be addressed in part by subsequent competitions; thus, keeping a modular design strategy appears beneficial. For ML problems, further efforts (comparable to those devoted to replicating the methods used in another study) are often needed to evaluate and assimilate the new knowledge produced. Further research to examine the outer limits of scientific problem solving through contests is necessary.
We conclude by mentioning a few additional empirical contexts in which open innovation competitions seem promising. One is the assignment of functional attributes of small molecules and genes, such as predicting the mechanism of action of a compound. While many algorithms have been independently developed for this purpose, a broad exploration is often out of reach to individual laboratories. On the development side, research in biology is often impeded by bottlenecks in the memory storage and transmission of genetic data, which are critical and quantifiable, thus an open innovation competition seems an effective way to overcome these bottlenecks.
Supporting information
Data Availability
Data (including training, testing and validation) and codes (including the benchmark clustering algorithms and the top solutions of the contestants) are available within Harvard Dataverse (https://doi.org/10.7910/DVN/5PNPKJ).
Funding Statement
This work was funded by the Eric and Wendy Schmidt Foundation, NASA Center of Collaborative Excellence, and the Kraft Precision Medicine Accelerator & Division of Research and Faculty Development at the Harvard Business School. The Scripps competition was funded by N.I.H. Grant Number 5UM1 AI100663 to AB. The CMap competitions were supported in part by the NIH Common Funds Library of Integrated Network-based Cellular Signatures (LINCS) program U54HG008699 to AB and NIH Big Data to Knowledge (BD2K) program 5U01HG008699 to AB. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
References
- 1. Hughes AJ, Mornin JD, Biswas SK, Beck LE, Bauer DP, Raj A, et al. Quanti.us: a tool for rapid, flexible, crowd-based annotation of images. Nature methods. 2018;15(8):587–590. 10.1038/s41592-018-0069-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Cooper S, Khatib F, Treuille A, Barbero J, Lee J, Beenen M, et al. Predicting protein structures with a multiplayer online game. Nature. 2010;466(7307):756 10.1038/nature09304 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Saez-Rodriguez J, Costello JC, Friend SH, Kellen MR, Mangravite L, Meyer P, et al. Crowdsourcing biomedical research: leveraging communities as innovation engines. Nature reviews Genetics. 2016;17(8):470–486. 10.1038/nrg.2016.69 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Hill SM, et al. Inferring causal molecular networks: empirical assessment through a community-based effort. Nature methods. 2016;13(4):310–318. 10.1038/nmeth.3773 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Costello JC, et al. A community effort to assess and improve drug sensitivity prediction algorithms. Nature Biotechnology. 2014;32(12):1202–1212. 10.1038/nbt.2877 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Lakhani KR, Boudreau KJ, Loh PR, Backstrom L, Baldwin C, Lonstein E, et al. Prize-based contests can provide solutions to computational biology problems. Nature biotechnology. 2013;31(2):108 10.1038/nbt.2495 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Abdollahi N, Albani A, Anthony E, Baud A, Cardon M, Clerc R, et al. Meet-U: educating through research immersion. PLoS computational biology. 2018;14(3):e1005992 10.1371/journal.pcbi.1005992 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Jeppesen LB, Lakhani KR. Marginality and problem-solving effectiveness in broadcast search. Organization science. 2010;21(5):1016–1033. 10.1287/orsc.1090.0491 [DOI] [Google Scholar]
- 9. Mak RH, Endres MG, Paik JH, Sergeev RA, Aerts H, Williams CL, et al. Use of Crowd Innovation to Develop an Artificial Intelligence–Based Solution for Radiation Therapy Targeting. JAMA oncology. 2019;5(5):654–661. 10.1001/jamaoncol.2019.0159 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Lees WD, Shepherd AJ. In: Keith JM, editor. Studying Antibody Repertoires with Next-Generation Sequencing. New York, NY: Springer; New York; 2017. p. 257–270. Available from: 10.1007/978-1-4939-6613-4_15. [DOI] [PubMed] [Google Scholar]
- 11. Mathonet P, Ullman C. The Application of Next Generation Sequencing to the Understanding of Antibody Repertoires. Frontiers in Immunology. 2013;4:265 10.3389/fimmu.2013.00265 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Burton DR, Mascola JR. Antibody responses to envelope glycoproteins in HIV-1 infection. Nat Immunol. 2015;16(6):571–576. 10.1038/ni.3158 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Lamb J, et al. The Connectivity Map: Using Gene-Expression Signatures to Connect Small Molecules, Genes, and Disease. Science. 2006;313(5795):1929–1935. 10.1126/science.1132939 [DOI] [PubMed] [Google Scholar]
- 14. Subramanian A, et al. A Next Generation Connectivity Map: L1000 Platform and the First 1,000,000 Profiles. Cell. 2018;171(6):1437–1452.e17. 10.1016/j.cell.2017.10.049 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15. Qu XA, Rajpal DK. Applications of Connectivity Map in drug discovery and development. Drug Discovery Today. 2012;17(23-24):1289–1298. 10.1016/j.drudis.2012.07.017 [DOI] [PubMed] [Google Scholar]
- 16.Hastie T, Tibshirani R, Sherlock G, Eisen M, Brown P, Botstein D. Imputing Missing Data for Gene Expression Arrays; 1999.
- 17. Leek JT, Johnson WE, Parker HS, Jaffe AE, Storey JD. The sva package for removing batch effects and other unwanted variation in high-throughput experiments. Bioinformatics (Oxford, England). 2012;28(6):882–883. 10.1093/bioinformatics/bts034 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18. Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences. 2005;102(43):15545–15550. 10.1073/pnas.0506580102 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19. Raman SK, Pentkovski V, Keshava J. Implementing streaming SIMD extensions on the Pentium III processor. IEEE Micro. 2000;20(4):47–57. 10.1109/40.865866 [DOI] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
Data (including training, testing and validation) and codes (including the benchmark clustering algorithms and the top solutions of the contestants) are available within Harvard Dataverse (https://doi.org/10.7910/DVN/5PNPKJ).