SparkGIS: Efficient Comparison and Evaluation of Algorithm Results in Tissue Image Analysis Studies

Furqan Baig; Mudit Mehrotra; Hoang Vo; Fusheng Wang; Joel Saltz; Tahsin Kurc

doi:10.1007/978-3-319-41576-5_10

. Author manuscript; available in PMC: 2018 Sep 6.

Published in final edited form as: Biomed Data Manag Graph Online Querying (2015). 2016 Jun 24;9579:134–146. doi: 10.1007/978-3-319-41576-5_10

SparkGIS: Efficient Comparison and Evaluation of Algorithm Results in Tissue Image Analysis Studies

Furqan Baig ^1,^✉, Mudit Mehrotra ¹, Hoang Vo ¹, Fusheng Wang ^1,², Joel Saltz ^1,², Tahsin Kurc ²

PMCID: PMC6126541 NIHMSID: NIHMS980882 PMID: 30198025

Abstract

Algorithm evaluation provides a means to characterize variability across image analysis algorithms, validate algorithms by comparison of multiple results, and facilitate algorithm sensitivity studies. The sizes of images and analysis results in pathology image analysis pose significant challenges in algorithm evaluation. We present SparkGIS, a distributed, in-memory spatial data processing framework to query, retrieve, and compare large volumes of analytical image result data for algorithm evaluation. Our approach combines the in-memory distributed processing capabilities of Apache Spark and the efficient spatial query processing of Hadoop-GIS. The experimental evaluation of SparkGIS for heatmap computations used to compare nucleus segmentation results from multiple images and analysis runs shows that SparkGIS is efficient and scalable, enabling algorithm evaluation and algorithm sensitivity studies on large datasets.

1 Introduction

Tissue specimens obtained from patients contain rich and biologically meaningful morphologic information that can be linked to molecular alterations and clinical outcome, providing a complementary methodology to genomic data analysis for clinical investigations [3,4,12,13,16]. Manual histopathology using high-power microscopes has been the de facto standard in clinical settings for health care delivery. However, this process corresponds to a qualitative analysis of the tissue, is labor-intensive, and is not feasible in research studies involving thousands of tissue specimens. Advances in tissue imaging technologies have made it possible for investigators and research teams to collect high-resolution images of whole slide tissue specimens. Hence, quantitative analysis of tissue specimens is increasingly becoming a key component of clinical research in Pathology and in many clinical imaging studies targeting complex diseases such as cancer [9,13].

While a quantitative analysis of tissue data can provide new insights into disease mechanisms and facilitate greater reproducibility, image analysis pipelines are not immune to inter-method and inter-analysis variability. Most image analysis methods are sensitive to input parameters and input data. It is not uncommon that an analysis pipeline optimized for a particular set of images will not do so well when it is applied to another set of images. Consider object segmentation which is a common step in image analysis. A nucleus segmentation pipeline for tissue image analysis will detect and delineate the boundaries of nuclei in an image. Input parameters, such as intensity thresholds and the choice of algorithms for seed detection and for separation of clumped nuclei, will impact the results (the number and locations of detected nuclei, the shape of boundaries of a nucleus, etc.). Figure 1 shows nuclear segmentation results from two analysis runs. As is seen in the figure, the two analysis pipelines have good agreement in some regions (i.e., the boundaries of the polygons overlap closely) and large disagreement in other regions, where either one algorithm has not segmented nuclei while the other has or there are large differences between the boundaries of a nucleus segmented by the two algorithms. Both algorithm developers and biomedical researchers require methods and tools for detecting, studying, and quantifying variability in results from multiple analyses. We refer to this process as uncertainty and sensitivity quantification. A systematic approach for algorithm sensitivity study can facilitate the development of refined algorithms. It can also substantially help the development of large repositories of curated analysis results.

Fig. 1 — Nucleus segmentation results from two analysis runs. Each segmented nucleus is represented by a polygon. The color of a polygon overlaid on the original image indicates whether it was generated by the first run (dark blue) or the second (lime). (Color figure online)

The uncertainty/sensitivity quantification process is a data intensive and computationally expensive process when thousands of whole slide tissue images are analyzed by multiple analysis runs. State-of-the-art scanners are capable of capturing tissue images at very high resolutions, typically in the range of 50,000 × 50,000 to 100,000 × 100,000 pixels. A nuclear segmentation pipeline may segment hundreds of thousands to millions of nuclei in an image. A common metric for comparing results from two analysis runs is the Jaccard index [10]. Computation of the Jaccard index involves spatial joins between the two sets of results and calculations of how much segmented nuclei overlap. This is an expensive operation that can take hours on a single machine for a single image and a pair of result sets.

In this work, we propose, develop and evaluate SparkGIS; a high performance, in-memory, distributed framework to support comparison of image analysis algorithms targeting high-resolution microscopy images. The important features of SparkGIS are: (1) It combines the in-memory distributed processing capabilities of Apache Spark with the high performance spatial query capabilities of Hadoop-GIS; (2) It provides an I/O abstraction layer to support non-HDFS data sources (e.g., databases) and parallelizes I/O operations for such sources; (3) It employs data pipelining and multi-task execution optimizations for algorithm comparison jobs involving many images and algorithms. We implement Heatmap computations with this framework and evaluate the implementation using real data generated by analyses of whole slide tissue images. The Heatmap implementation is used to compare results from nucleus segmentation pipelines. We describe how SparkGIS combines the capabilities of Spark and Hadoop-GIS to efficiently support spatial data operations and execute Heatmap computations.

2 Related Work

Spatial data processing systems built on cloud computing frameworks have been the focus of recent research works [2,7,8,14,15]. SpatialHadoop [7] is an extension to Apache Hadoop for spatial data processing on the MapReduce framework [5]. It extends core hadoop classes to support spatial data types and functions. Hadoop-GIS [2] presents a scalable MapReduce framework for spatial query processing with a specialized pathology image analysis add-on. It implements optimizations for spatial partitioning, partitioned based parallel processing over MapReduce using the Real-time Spatial Query Engine (RESQUE) and multi-level spatial indexing. Hadoop-GIS supports a spatial bucketing algorithm that utilizes R*-Tree based global and on-demand local indexing for efficient spatial query processing. SparkGIS’s query processing model is extended from Hadoop-GIS’s MapReduce based spatial query work flow. MD-HBase [14] leverages a multi-dimensional K-d and Quad-Tree based index over key-value store to efficiently execute range and nearest neighbor queries in real time. It is built on HBase, a column oriented NoSQL distributed database that runs on top of Hadoop. Although all of these systems exhibit comprehensive distributed functionality, they inherently have high inter-job data movement cost. Hadoop requires disk reads and writes for any data passing between interdependent jobs. This can prove to be a major performance bottleneck for spatial processing which heavily relies on iterating over data through multiple map-reduce jobs.

Distributed in-memory data processing systems aim to keep data in memory to facilitate multiple iterations over it by multiple dependent jobs. Apache Spark [17,18] presents a Directed Acyclic Graph (DAG) execution engine that supports in-memory map reduce style processing. Spark’s architecture is built around an in-memory data abstraction termed as “Resilient Distributed Dataset” (RDD). An RDD represents an immutable, distributed data elements that can be processed in parallel. Spark allows multiple operations to be executed on the same dataset; examples of operations include parallelize, map, reduce, filter, and groupByKey. GeoSpark [11] extends Spark with a spatial processing layer that provides support for spatial data types and functionality. Although GeoSpark uses a similar query processing approach to ours, it solely relies on Apache Spark as its data accessing layer. SparkGIS, decouples data access layer from actual query processing allowing it to be extended to other types of data sources. It employs spatial query and data processing operations implemented by Hadoop-GIS along with data pipelining and multi concurrent task execution optimizations to reduce execution time.

3 SparkGIS Framework

The main goal of SparkGIS is to provide a high performance, scalable distributed spatial query processing framework tailored to processing of large volumes of results generated by tissue image analysis runs. To this end, the design of SparkGIS combines the advantages of Apache Spark’s in-memory distributed computing capability [17,18] with Hadoop-GIS’s spatial query processing capability [2]. Figure 2 illustrates the high-level SparkGIS framework. The framework consists of two main components: an I/O abstraction layer and an execution engine. The main purpose of this design is to decouple IO from execution which allows for any arbitrary IO subsystem to be seamlessly integrated with SparkGIS. In this section we present the two main components.

Fig. 2 — SparkGIS architecture. IO abstraction layer is decoupled from query processing engine. Data is read in distributed workers’ memory. SparkGIS execution engine spawns multiple jobs on this data to efficiently compute spatial query results. Results are provided back to IO abstraction layer for storage.

3.1 SparkGIS I/O Abstraction Layer

Hadoop-GIS requires that input datasets be stored in HDFS or copied to HDFS for spatial query processing. SparkGIS relaxes this requirement and provides an abstract I/O interface that can read from or write to HDFS and non-HDFS data sources and execute I/O operations in parallel even if the data source is not distributed. A storage system to be used as input source or output destination only needs to implement a set of basic I/O functions getDataAsRDD() and writeRDD() in SparkGIS. The basic data unit of SparkGIS is RDD<Polygon>, which is an in-memory distributed collection of polygons and is extended from Apache Spark’s generalized RDD. getDataAsRDD() returns a RDD<Polygon> instance populated with data from an input source; writeRDD() writes an output RDD to a destination.

SparkGIS inherits all data sources supported by Apache Spark. These include local file system, HDFS and any storage source supported by Hadoop. Distributed IO for all such storage sources are internal to Apache Spark and their details are beyond the scope of this study. In addition to inherited data sources, the current implementation of SparkGIS supports MongoDB as a data source and destination. There is an open source MongoDB connector for Apache Spark, referred to here as the Mongo-Hadoop connector [1], but it has some drawbacks. The Mongo-Hadoop connector distributes queries on the basis of MongoDB collections. To process a query on a given collection, it first copies the whole collection in memory as RDD and then compute the query on it. This copying makes it inefficient, and in some cases infeasible, to process queries on large collections – in our case, data from a collection may not fit in memory. SparkGIS’s getDataAsRDD() implementation for MongoDB executes the following steps:

Query the MongoDB server for the total count of documents matching given criteria, e.g. image ID and algorithm name in our case.
Create an appropriate number of splits of documents. The split size is based on several variable including number of nodes in cluster and total number of documents.
Distribute the splits among cluster nodes. Each cluster node reads its own range of documents from the MongoDB server and appends them to a RDD for processing.

3.2 SparkGIS Execution Engine

The execution engine of SparkGIS uses the Apache Spark runtime environment as the underlying distributed execution platform and Hadoop-GIS spatial query functions for queries. SparkGIS leverages the in-memory processing of Apache Spark. Keeping data in-memory for iterative processing greatly reduces the processing time. In addition, and more specifically for spatial query processing in our case, in-memory data also removes inter-job I/O costs. SparkGIS’s query execution model is based on the Hadoop-GIS RESQUE and uses optimized spatial operators and indexing methods implemented in RESQUE. SparkGIS uses RESQUE functions as a shared library which can easily be shipped to cluster nodes and whose methods can be invoked from SparkGIS to process distributed in-memory spatial data. SparkGIS supports several common spatial query types essential to comparison of analysis results in tissue image analysis. The first type of query, which is our main focus in this work, is the spatial join query, i.e. spatial operations used to combine two or more datasets with respect to a spatial relationship. Four steps are performed for this query type: combine input datasets into a single collection; partition the combined spatial data space into tiles; map all polygons to appropriate tiles; and process the tiles in parallel. Other query types include spatial containment and finding objects contained in subregions and computing density of those objects. These query types can leverage a similar work flow with little or no modification.

In the next section, we present how the analysis results evaluation work flow is supported and how the heatmap computations are implemented in SparkGIS.

3.3 SparkGIS Based Analysis Results Evaluation Workflow and Heatmap Computations

The SparkGIS data processing work flow for comparison of result sets from two analysis runs consists of the following steps: (1) Data retrieval for a given image and analysis runs A and B; (2) Data preparation for both result sets; and (3) Execution of spatial operations and computation of output. Figure 3 shows the core Spark functions used in implementing these three stages. We describe in this section the work flow stages for computation of heatmaps to compare and evaluate results from two nucleus segmentation runs.

Fig. 3 — This figure illustrates the stages involved in SparkGIS spatial join query in terms of Apache Spark’s functions

For a given image and a pair of analysis runs, the implementation of the heatmap computations partitions the image regularly into KxK-pixel tiles, where K is a user-defined value. It then computes a Jaccard index [10] or a Dice coefficient [6] for each tile. The coefficient computation for a tile involves (1) finding all the segmented nuclei (represented as polygons) that intersect or are contained in the tile, (2) performing a spatial join between the sets of polygons from the two analysis runs, and (3) computing the coefficients based on the amount of overlap between intersecting polygons.

Data Retrieval

Segmented objects in a tissue image are typically represented by polygons; each polygon represents the boundaries of a segmented nucleus, for example. The data retrieval step is responsible for retrieving these polygons from storage sources. Each image can have hundreds of thousands to millions of polygons from a single analysis run. Therefore, data retrieval for multiple analysis runs can easily scale up to tens of millions of polygons for a single image. If results from an analysis run are stored in files in HDFS, each row of a data file stores a polygon representing a segmented nucleus. The result set may be stored by the analysis application in one or more files for each (image id, analysis run name/id) pair. When HDFS is a data source, the SparkGIS I/O layer calls Spark I/O functions to retrieve the data. If input data is, for example, stored in MongoDB, the Spark I/O layer composes a MongoDB query to retrieve polygon data and executes the steps described in Sect. 3. A master node creates splits and distributes them among worker nodes by invoking the parallelize() function. Each worker node implements a flatMap function which reads its own split of data from the data source and returns a list of polygons. This list is appended to the RDD returned to the master node. The RDD is cached in memory for further processing. Upon return of getDataAsRDD() function, an RDD will be ready for processing. Figure 3 illustrates the use of Spark functions and execution engine in the SparkGIS I/O layer.

Data Preparation

Data preparation separates tasks that are specific to single dataset and can be executed independently for multiple algorithm result sets. The data preparation stage implements several preprocessing steps on data retrieved from sources for efficient execution of algorithm comparison operations in the spatial query execution stage: (1) Minimum Bounding Rectangles (MBRs) are computed for all of the polygons in the dataset; (2) The minimum bounds of the space encompassing all of the MBRs are computed; (3) The MBRs are normalized with respect to the bounds computed in step 2. That is, each dimension of a MBR is mapped to a value between [0.0,1.0); (4) The space of MBRs is partitioned into tiles using a spatial partitioning algorithm; (5) A spatial index (a R-tree index in our current implementation) is created on the set of tiles; (6) All the polygons are mapped to the tiles using the spatial index.

An advantage of separating this stage from the data processing stage is to leverage further parallelism. For multiple analysis algorithms, multiple data preparation steps can be executed in parallel as concurrent SparkGIS jobs (see Fig. 4). Consequently this improves the overall system performance along with better resource utilization.

Fig. 4 — Multiple concurrent jobs can be submitted to SparkGIS framework. The stages of these jobs as described in Sect. 3.3 are pipelined for efficient batch processing.

Figure 3 shows the Spark functions that are executed to prepare the retrieved data for processing. The Map function is executed in parallel on the RDD prepared by the IO layer. Each worker extracts Minimum Bounding Rectangles (MBRs) for each polygon in its data split. The next step filters out any invalid MBRs. The last step in the data preparation stage implements a reduce phase which calculate the space dimensions for the whole dataset from the MBRs. A data configuration object is generated by the data preparation step which contains the space dimensions along with a pointer to the cached polygon data RDD.

Query and Data Processing

Once the data preparation step is finished, the data processing stage is invoked. SparkGIS ships the RESQUE shared library to all workers and invoke its functions. To combine all the cached RDD’s from generated configurations SparkGIS uses Apache Spark’s Union operator. The FlatMap function partitions the combined data into tiles covering the whole combined spatial space. Once all polygons are mapped to appropriate tiles, they are grouped by tile IDs by calling the groupByKey function. GroupByKey is an expensive operation involving shuffle operation which requires a lot of data movement among cluster nodes. Prior to this step, all processing was done on local distributed in-memory data partitions with minimum cross nodes data transfer. Once grouped by tile IDs, all tiles are processed in parallel using RESQUE to generate results for spatial join query.

The final result of comparing multiple algorithm analysis results is a heatmap based on the spatial join query. The heatmap signifies the most agreed upon regions by all the compared algorithm analyses for a given image. From spatial join results, heatmaps are generated by calculating an average similarity index per tile. SparkGIS currently supports heatmaps based on Jaccard and Dice metrics.

Multiple Concurrent Task/Job Execution

Each job in SparkGIS goes through these stages to process data from a pair of analysis result sets. Executing a single job at a time when a large number of datasets needs to be processed will result in low overall performance and underutilized cluster resources. Multiple stages of the work flow can be overlapped in a batch for further performance gain when processing multiple images. Each work flow as described in Sect. 3.3 has its own context. This allows for multiple contexts to be active simultaneously in the system. Figure 4 illustrates the execution strategy for processing results from multiple algorithms and multiple datasets by pipelining the multiple stages of each work flow.

4 Experimental Performance Evaluation

We have used a CentoOS 6.5 cluster with 5 nodes with 120 total cores. Each node has 24 logical cores with hyper-threading (Intel (R) Xeon (R) CPU E5-2660 v2 at 2.20 GHz) and 128 GB memory. We employed Apache Spark-1.4.1 as our cluster computing framework. For Hadoop-GIS we used the same environment with default configurations. For a fair comparison between Hadoop-GIS and SparkGIS, datasets were uploaded to HDFS with replication factor set to 3 on each data node.

The datasets are the segmentation results from nucleus segmentation pipelines executed on sets of whole slide tissue images of cancer tumors. The images were obtained from the Cancer Genome Atlas (TCGA) repository¹. Each image corresponds to a separate TCGA case id – we use terms caseids and images interchangeably in this section. For each image, two result sets were generated by two different analysis pipelines. The spatial boundaries of the segmented nuclei were converted into polygons and normalized. The number of polygons in the datasets used in the evaluation is shown in Table 1.

Table 1.

Number of segmented nuclei for different sets of images (caseids).

# Images	# Segmented nuclei (approximate)
100	70 million
200	90 million
300	125 million
400	150 million

Open in a new tab

4.1 Batch Factor

The batch factor is the number of concurrent tasks submitted to SparkGIS. We experimented with several batch sizes to come up with a good value for efficient execution. SparkGIS uses Java’s default ExecutorService to submit multiple concurrent jobs to the Spark cluster. Our results on 120 CPU cores show that SparkGIS can handle up to 8 algorithm evaluation jobs simultaneously. Figure 5 shows that increasing the batch factor to greater than 8 leads to diminishing performance results in our setup. This is mainly due to the limit on the number of jobs that can be scheduled on available nodes and CPU cores.

Fig. 5 — SparkGIS Batch factor. Our experimental evaluations indicates 8 to be the optimal batch factor for our setup.

4.2 Execution Performance and Scalability

In this evaluation we fixed the batch factor to 8 (optimal from Sect. 4.1) and varied the total number of CPU cores across different data sizes. Data size was determined from the number of images. Each caseID corresponds to a pathology image having two separate algorithm analysis results. Table 1 summarizes the total number of spatial objects to process for each data size. Figure 6 shows that execution time decreases as more nodes and CPU cores are added, for all data sizes. SparkGIS achieves very good, almost linear speedup on our cluster; the execution time is halved when worker cores are doubled from 24 to 48. Figure 7 presents a breakdown of the execution time into the stages described in Sect. 3.3. Most of the execution time is spent in the query and data processing stage for heatmap generation. As more nodes and CPU cores are added, all stages scale well, reducing the overall execution time.

Fig. 6 — SparkGIS scalability with varying number of images. (Color figure online)

Fig. 7 — Breakdown of execution time into the I/O, data preparation, and data processing stages (Sect. 3.3). (Color figure online)

4.3 SparkGIS Versus Hadoop-GIS

We compared the results of SparkGIS with Hadoop-GIS to generate heatmaps for results from two analysis runs. The total number of images processed in these algorithms were varied across several experiments. Figure 8 shows the performance comparison for the two distributed spatial query processing frameworks. By mitigating the IO cost and processing data in memory, SparkGIS can produce algorithm analysis results by at least 40 times faster than Hadoop-GIS.

Fig. 8 — A comparison of HadoopGIS and SparkGIS for heatmap computation. SparkGIS outperforms HadoopGIS due to lower I/O overheads through efficient in-memory processing. (Color figure online)

5 Conclusions

Pathology image algorithm validation and comparison are essential to iterative algorithm development and refinement. A critical component for this is to support efficient spatial queries and computation of comparison metrics. In this work, we develop a Spark based distributed, in-memory algorithm comparison framework to normalize, manage and compare large amounts of image analytics result data. Our approach is based on spatial query processing principles in the Hadoop-GIS framework but takes advantage of in-memory and pipelined data processing. Our experiments on real datasets show that SparkGIS is efficient and scalable. The experiments demonstrate that by reducing the IO costs associated with data staging and inter-job data movement spatial query processing performance can be improved by orders of magnitude. The actual query processing can be decomposed in multiple stages to leverage further performance gain through parallelization and multiple concurrent task processing. In future work, we plan to focus on further evaluation of the spatial query pipeline in SparkGIS using more datasets and on extending SparkGIS to support a larger set of analytical operations, in addition to current heatmap generation functions.

Acknowledgments

This work was funded in part by HHSN261200800001E from the NCI, 1U24CA180924-01A1 from the NCI, 5R01LM011119-05, 5R01LM009239-07 from the NLM and ACI 1443054 and IIS 1350885 from the National Science Foundation.

Footnotes

http://cancergenome.nih.gov.

References

1.Mongo hadoop. https://github.com/mongodb/mongo-hadoop.
2.Aji A, Wang F, Vo H, Lee R, Liu Q, Zhang X, Saltz J. Hadoop gis: a high performance spatial data warehousing system over mapreduce. Proc VLDB Endow. 2013;6(11):1009–1020. [PMC free article] [PubMed] [Google Scholar]
3.Beck AH, Sangoi AR, Leung S, Marinelli RJ, Nielsen TO, van de Vijver MJ, West RB, van de Rijn M, Koller D. Systematic analysis of breast cancer morphology uncovers stromal features associated with survival. Sci Transl Med. 2011;3(108):108ra113. doi: 10.1126/scitranslmed.3002564. [DOI] [PubMed] [Google Scholar]
4.Cooper LAD, Kong J, Gutman DA, Wang F, Gao J, Appin C, Cholleti SR, Pan T, Sharma A, Scarpace L, Mikkelsen T, Kur TM, Moreno CS, Brat DJ, Saltz JH. Integrated morphologic analysis for the identification and characterization of disease subtypes. JAMIA. 2012;19(2):317–323. doi: 10.1136/amiajnl-2011-000700. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Dean J, Ghemawat S. Mapreduce: simplified data processing on large clusters. Commun ACM. 2008;51(1):107–113. [Google Scholar]
6.Dice LR. Measures of the amount of ecologic association between species. Ecology. 1945;26(3):297–302. [Google Scholar]
7.Eldawy A. Proceedings of the 2014 SIGMOD PhD Symposium. ACM; New York: 2014. Spatialhadoop: towards flexible and scalable spatial processing using mapreduce; pp. 46–50. [Google Scholar]
8.Frye R, McKenney M. Information Granularity, Big Data, and Computational Intelligence. Springer; Switzerland: 2015. Big data storage techniques for spatial databases: implications of big data architecture on spatial query processing; pp. 297–323. [Google Scholar]
9.Fuchs TJ, Buhmann JM. Computational pathology: challenges and promises for tissue analysis. Comput Med Imaging Graph. 2011;35(7):515–530. doi: 10.1016/j.compmedimag.2011.02.006. [DOI] [PubMed] [Google Scholar]
10.Jaccard P. Etude comparative de la distribution florale dans une portion des Alpes et du Jura. Impr Corbaz. 1901 [Google Scholar]
11.Jia Yu, J W, Sarwat M. Geospark: a cluster computing framework for processing large-scale spatial data. Proceedings of the 2015 International Conference on Advances in Geographic Information Systems, ACM SIGSPATIAL. 2015;2015 [Google Scholar]
12.Kong J, Cooper LAD, Wang F, Chisolm C, Moreno CS, Kur TM, Widener PM, Brat DJ, Saltz JH. ISBI. IEEE; 2011. A comprehensive framework for classification of nuclei in digital microscopy imaging: an application to diffuse gliomas; pp. 2128–2131. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Louis DN, Feldman M, Carter AB, Dighe AS, Pfeifer JD, Bry L, Almeida JS, Saltz J, Braun J, Tomaszewski JE, et al. Computational pathology: a path ahead. Archives of Pathology and Laboratory Medicine. 2015 doi: 10.5858/arpa.2015-0093-SA. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Nishimura S, Das S, Agrawal D, Abbadim AE. Proceedings of the 2011 IEEE 12th International Conference on Mobile Data Management, MDM 2011. 01. IEEE Computer Society; Washington, DC: 2011. Md-hbase: a scalable multidimensional data infrastructure for location aware services; pp. 7–16. [Google Scholar]
15.You S, Zhang J, Gruenwald L. Large-scale spatial join query processing in cloud. IEEE CloudDM Workshop. 2015 to appear. http://www-cs.ccny.cuny.edu/~jzhang/papers/spatial_cc_tr.pdf.
16.Yuan Y, Failmezger H, Rueda OM, Ali HR, Gräf S, Chin SF, Schwarz RF, Curtis C, Dunning MJ, Bardwell H, Johnson N, Doyle S, Turashvili G, Provenzano E, Aparicio S, Caldas C, Markowetz F. Quantitative image analysis of cellular heterogeneity in breast tumors complements genomic profiling. Sci Transl Med. 2012;4(157):157ra143. doi: 10.1126/scitranslmed.3004330. [DOI] [PubMed] [Google Scholar]
17.Zaharia M, Chowdhury M, Das T, Dave A, Ma J, McCauley M, Franklin MJ, Shenker S, Stoica I. Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, NSDI 2012. USENIX Association; Berkeley: 2012. Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing; p. 2. [Google Scholar]
18.Zaharia M, Chowdhury M, Franklin MJ, Shenker S, Stoica I. Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing, HotCloud 2010. USENIX Association; Berkeley: 2010. Spark: cluster computing with working sets; p. 10. [Google Scholar]

[R1] 1.Mongo hadoop. https://github.com/mongodb/mongo-hadoop.

[R2] 2.Aji A, Wang F, Vo H, Lee R, Liu Q, Zhang X, Saltz J. Hadoop gis: a high performance spatial data warehousing system over mapreduce. Proc VLDB Endow. 2013;6(11):1009–1020. [PMC free article] [PubMed] [Google Scholar]

[R3] 3.Beck AH, Sangoi AR, Leung S, Marinelli RJ, Nielsen TO, van de Vijver MJ, West RB, van de Rijn M, Koller D. Systematic analysis of breast cancer morphology uncovers stromal features associated with survival. Sci Transl Med. 2011;3(108):108ra113. doi: 10.1126/scitranslmed.3002564. [DOI] [PubMed] [Google Scholar]

[R4] 4.Cooper LAD, Kong J, Gutman DA, Wang F, Gao J, Appin C, Cholleti SR, Pan T, Sharma A, Scarpace L, Mikkelsen T, Kur TM, Moreno CS, Brat DJ, Saltz JH. Integrated morphologic analysis for the identification and characterization of disease subtypes. JAMIA. 2012;19(2):317–323. doi: 10.1136/amiajnl-2011-000700. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] 5.Dean J, Ghemawat S. Mapreduce: simplified data processing on large clusters. Commun ACM. 2008;51(1):107–113. [Google Scholar]

[R6] 6.Dice LR. Measures of the amount of ecologic association between species. Ecology. 1945;26(3):297–302. [Google Scholar]

[R7] 7.Eldawy A. Proceedings of the 2014 SIGMOD PhD Symposium. ACM; New York: 2014. Spatialhadoop: towards flexible and scalable spatial processing using mapreduce; pp. 46–50. [Google Scholar]

[R8] 8.Frye R, McKenney M. Information Granularity, Big Data, and Computational Intelligence. Springer; Switzerland: 2015. Big data storage techniques for spatial databases: implications of big data architecture on spatial query processing; pp. 297–323. [Google Scholar]

[R9] 9.Fuchs TJ, Buhmann JM. Computational pathology: challenges and promises for tissue analysis. Comput Med Imaging Graph. 2011;35(7):515–530. doi: 10.1016/j.compmedimag.2011.02.006. [DOI] [PubMed] [Google Scholar]

[R10] 10.Jaccard P. Etude comparative de la distribution florale dans une portion des Alpes et du Jura. Impr Corbaz. 1901 [Google Scholar]

[R11] 11.Jia Yu, J W, Sarwat M. Geospark: a cluster computing framework for processing large-scale spatial data. Proceedings of the 2015 International Conference on Advances in Geographic Information Systems, ACM SIGSPATIAL. 2015;2015 [Google Scholar]

[R12] 12.Kong J, Cooper LAD, Wang F, Chisolm C, Moreno CS, Kur TM, Widener PM, Brat DJ, Saltz JH. ISBI. IEEE; 2011. A comprehensive framework for classification of nuclei in digital microscopy imaging: an application to diffuse gliomas; pp. 2128–2131. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] 13.Louis DN, Feldman M, Carter AB, Dighe AS, Pfeifer JD, Bry L, Almeida JS, Saltz J, Braun J, Tomaszewski JE, et al. Computational pathology: a path ahead. Archives of Pathology and Laboratory Medicine. 2015 doi: 10.5858/arpa.2015-0093-SA. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] 14.Nishimura S, Das S, Agrawal D, Abbadim AE. Proceedings of the 2011 IEEE 12th International Conference on Mobile Data Management, MDM 2011. 01. IEEE Computer Society; Washington, DC: 2011. Md-hbase: a scalable multidimensional data infrastructure for location aware services; pp. 7–16. [Google Scholar]

[R15] 15.You S, Zhang J, Gruenwald L. Large-scale spatial join query processing in cloud. IEEE CloudDM Workshop. 2015 to appear. http://www-cs.ccny.cuny.edu/~jzhang/papers/spatial_cc_tr.pdf.

[R16] 16.Yuan Y, Failmezger H, Rueda OM, Ali HR, Gräf S, Chin SF, Schwarz RF, Curtis C, Dunning MJ, Bardwell H, Johnson N, Doyle S, Turashvili G, Provenzano E, Aparicio S, Caldas C, Markowetz F. Quantitative image analysis of cellular heterogeneity in breast tumors complements genomic profiling. Sci Transl Med. 2012;4(157):157ra143. doi: 10.1126/scitranslmed.3004330. [DOI] [PubMed] [Google Scholar]

[R17] 17.Zaharia M, Chowdhury M, Das T, Dave A, Ma J, McCauley M, Franklin MJ, Shenker S, Stoica I. Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, NSDI 2012. USENIX Association; Berkeley: 2012. Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing; p. 2. [Google Scholar]

[R18] 18.Zaharia M, Chowdhury M, Franklin MJ, Shenker S, Stoica I. Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing, HotCloud 2010. USENIX Association; Berkeley: 2010. Spark: cluster computing with working sets; p. 10. [Google Scholar]

PERMALINK

SparkGIS: Efficient Comparison and Evaluation of Algorithm Results in Tissue Image Analysis Studies

Furqan Baig

Mudit Mehrotra

Hoang Vo

Fusheng Wang

Joel Saltz

Tahsin Kurc

Abstract

1 Introduction

Fig. 1.

2 Related Work

3 SparkGIS Framework

Fig. 2.

3.1 SparkGIS I/O Abstraction Layer

3.2 SparkGIS Execution Engine

3.3 SparkGIS Based Analysis Results Evaluation Workflow and Heatmap Computations

Fig. 3.

Data Retrieval

Data Preparation

Fig. 4.

Query and Data Processing

Multiple Concurrent Task/Job Execution

4 Experimental Performance Evaluation

Table 1.

4.1 Batch Factor

Fig. 5.

4.2 Execution Performance and Scalability

Fig. 6.

Fig. 7.

4.3 SparkGIS Versus Hadoop-GIS

Fig. 8.

5 Conclusions

Acknowledgments

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases