Abstract
Much effort has been devoted to support high performance spatial queries on large volumes of spatial data in distributed spatial computing systems, especially in the MapReduce paradigm. Recent works have focused on extending spatial MapReduce frameworks to leverage high performance in-memory distributed processing capabilities of systems such as Spark. However, the performance advantage comes with the requirement of having enough memory and comprehensive configuration. Failing to fulfill this falls back to disk IO, defeating the purpose of such systems or in worst case gets out of memory and fails the job. The problem is aggravated further for spatial processing since the underlying in-memory systems are oblivious of spatial data features and characteristics. In this paper we present SparkGIS - an in-memory oriented spatial data querying system for high throughput and low latency spatial query handling by adapting Apache Spark’s distributed processing capabilities. It supports basic spatial queries including containment, spatial join and k-nearest neighbor and allows extending these to complex query pipelines. SparkGIS mitigates skew in distributed processing by supporting several dynamic partitioning algorithms suitable for a rich set of contemporary application scenarios. Multilevel global and local, pre-generated and on-demand in-memory indexes, allow SparkGIS to prune input data and apply compute intensive operations on a subset of relevant spatial objects only. Finally, SparkGIS employs dynamic query rewriting to gracefully manage large spatial query workflows that exceed available distributed resources. Our comparative evaluation has shown that the performance of SparkGIS is on par with contemporary Spark based platforms for relatively smaller queries and outperforms them for larger data and memory intensive workflows by dynamic query rewriting and efficient spatial data management.
Keywords: Spatial processing, MapReduce, Spark, In-Memory processing
CCS CONCEPTS: Information systems → MapReduce-based systems, Spatial-temporal systems, Theory of computation → MapReduce algorithms, Computing methodologies → MapReduce algorithms
1 INTRODUCTION
Over the recent years the proliferation of mobile phones, Internet of Things (IoT), collaborative data collection projects and ubiquitous sensory measurement technologies have contributed to generating multidimensional spatial data at unprecedented rate and scale. The need for low latency data intensive spatial frameworks has become increasingly important to business, daily users as well as scientific applications ranging from geo-marketing and social engineering to biomedical research and clinical diagnosis.
The compute-intensive nature of spatial queries coupled with large volume of data requires a scalable and efficient solution. In the past, we have developed and deployed a Hadoop based spatial warehousing system - Hadoop-GIS [1]. There are also other systems developed using Hadoop based approach [1, 4, 5, 11]. Despite being able to address many of big spatial data challenges, Hadoop based systems lack support for iterative data reuse, a common scenario in distributed spatial processing. Furthermore, most of these systems do not make efficient use of available distributed memory and thus suffer from considerably low throughput and high latency.
Recently, in-memory distributed computing has emerged as a popular choice for scalable and cost effective big data processing over a cluster of commodity machines. Computing platforms such as Apache Spark [12, 18] have a documented 10x to 100x performance gain for distributed processing over Hadoop based systems, provided enough distributed resources. Spark extends the basic map and reduce functions of Hadoop to more general purpose Directed Acyclic Graph style transformations and closures. While vanilla Spark is a general purpose distributed computing framework and lacks spatial query processing capabilities, several recent systems [7, 13–15] have extended its functionalities for large scale spatial data processing. However, most of these systems either lack comprehensive support for spatial objects, have limited spatial query functionalities or lack effective indexing mechanisms for large scale processing. In addition, for datasets considerably larger than available resources, all of these systems require extensive tuning to avoid the dreaded Out Of Memory (OOM) exceptions. Even with comprehensive tuning, excessive data can be spilled to disk and reloaded by Spark on demand. However, this results in a huge performance hit and is potentially very inefficient since Spark is oblivious to spatial data structure and formatting.
Inspired by these observations, we have developed SparkGIS -an in memory spatial data processing framework which combines the in-memory data handling capabilities of Apache Spark with efficient spatial query processing techniques. The goal of the system is to deliver a data size-oblivious high performance, scalable and efficient spatial querying system capable of delivering low latency analytical query results on relatively smaller spatial data working set. Simultaneously, to operate on large scale datasets that do not fit in available memory, SparkGIS employs a novel query pipeline to efficiently rewrite queries for optimal performance.
Our main contributions are summarized as follows
We implement a dynamic query re-writer to optimize query pipeline accelerating query performance for in-memory working set as well as gracefully handling data intensive spatial queries with limited distributed resources.
We support several efficient spatial data partitioning mechanisms that provide trade off between low skew and higher performance for a wide variety of application scenarios.
We utilize multilevel in-memory spatial indexes to filter query space and apply compute intensive spatial logic to relevant data only.
We develop an on-demand in-memory query processing engine that takes advantage of readily available libraries, multilevel indexing and can potentially leverage the power of external distributed resources for additional performance gain.
Our initial comparative evaluation on real world datasets demonstrate the benefits of SparkGIS both in terms of performance for in memory working sets and efficient memory utilization for larger datasets with limited distributed resources.
The rest of the paper is organized as follows. We first present necessary background and related work in Section 2 and provide an architectural overview of SparkGIS in Section 3. SparkGIS’s dynamic query rewriting is explained in Section 4. Section 5 covers an overview of spatial partitioning strategies employed by SparkGIS. Section 6 and 7 discuss the implementation details of query work flow in SparkGIS. Extensive experimental evaluation on real world datasets is presented in Section 8, which is followed by conclusion.
2 BACKGROUND & RELATED WORK
2.1 Distributed Spatial Processing
Spatial data processing systems built on cloud computing frameworks have been the focus of recent research works [1, 4, 5, 11, 16]. SpatialHadoop[4] is an extension to Apache Hadoop for spatial data processing on the MapReduce framework [3]. It extends core Hadoop classes to support spatial data types and functions. Hadoop-GIS[1] presents a scalable MapReduce framework for spatial query processing with a specialized pathology image analysis add-on. It implements optimizations for spatial partitioning, partitioned based parallel processing over MapReduce using the Real-time Spatial Query Engine (RESQUE) and multi-level spatial indexing. Hadoop-GIS supports a spatial bucketing algorithm that utilizes R*-Tree based global and on-demand local indexing for efficient spatial query processing. MD-HBase [11] leverages a multi-dimensional K-d and Quad-Tree based index over key-value store to efficiently execute range and nearest neighbor queries in real time. It is built on HBase, a column oriented NoSQL distributed database that runs on top of Hadoop.
Although all of these systems exhibit comprehensive distributed functionality, they inherently have high inter-job data movement cost. For instance, Hadoop requires disk reads and writes for any data passing between interdependent jobs. This can prove to be a major performance bottleneck for spatial processing which heavily relies on iterating over data through multiple map-reduce jobs.
2.2 Distributed In-Memory Spatial Processing
Distributed in-memory data processing systems aim to keep data in memory to facilitate multiple iterations over it by multiple dependent jobs. Apache Spark[17, 18] presents a Directed Acyclic Graph (DAG) execution engine that supports in-memory map reduce style processing. Spark’s architecture is built around an in-memory data abstraction termed as “Resilient Distributed Dataset” (RDD). An RDD represents an immutable set of distributed data elements that can be processed in parallel. Spark allows a rich set of transformations and actions that can be applied to RDDs; e.g. map, reduce, filter, sort, count, groupBy etc. GeoSpark[7] extends Spark with a spatial processing layer that provides support for spatial data types and functionality. It extends Spark’s RDD to Spatial Resilient Distributed Dataset (SRDD) which supports spatial operations such as range query, k-nearest neighbor (kNN) and spatial joins. Although GeoSpark uses a similar query processing approach to ours, their performance is mainly limited due to lack of any global index. SpatialSpark [15] uses partition and conquer approach similar to Hadoop-GIS [1]. It partitions input data into tiles and distributes them among workers. Each worker processes its own set of tiles. The master node aggregates the results and performs post processing to handle boundary objects. The basic query processing work flow is similar to ours. However, SpatialSpark only supports a very small subset of spatial queries, is no longer maintained and is optimized only for spatial join queries on input datasets. SIMBA [14] provides a spatial query engine and query optimizer which is built on SparkSQL to support in-memory distributed spatial queries. Although, SIMBA supports both SQL and Data Frame style processing, it is fairly limited on the support for spatial objects. The spatial queries supported by SIMBA are kNN, distance join and kNN join which as of this time are optimized for point objects only. LocationSpark [13] introduces several layers over Spark to mitigate skew, create spatial indexes, manage memory and execute queries on spatial data. In addition it also introduces spatial bloom filter and embeds it in indexes to avoid extra overheads associated with overlapped spatial data.
Despite being able to distributively process in memory spatial data, all of these frameworks share a common inherent limitation. They solely depend on Apache Spark for memory management. For datasets greater than available main memory, Apache Spark requires extensive manual configuration to avoid running out of memory. Even so, excess data may be spilled to disk resulting in lower performance which defeats the purpose of in memory processing.
SparkGIS, in addition to mitigating skew through customizable partitioning algorithms, employs efficient query rewriting to process spatial data while keeping as much of it in memory as required for best performance.
3 OVERVIEW
3.1 Common Query Datasets
To elaborate SparkGIS usability, we categorize spatial query datasets in working and faulting sets in terms of available distributed resources.
3.1.1 Working Set
The working set can be described as the dataset that fits easily in the available distributed resources. In particular, distributed memory, since our target is efficient in-memory processing. This is the general assumed case for most of the in-memory processing systems. The performance advantages reported for in-memory systems are usually associated with this category of datasets. Most of the contemporary in-memory spatial processing frameworks are optimized only for datasets that can be categorized as working sets. SparkGIS is designed to optimize the processing of spatial working sets by utilizing on demand multilevel indexes and efficient distributed query processing engine.
3.1.2 Faulting Set
The second and mostly ignored category of datasets is the faulting dataset which is relatively quite large than the total available distributed memory. While Spark based systems provide mechanisms to handle such datasets, these require deep understanding and expertise of the system as well as of the relevant features of the dataset. Even with these expert configurations, the system’s performance deteriorates dramatically or in worst case runs out of memory and fails the job. SparkGIS gracefully handles spatial faulting sets by dynamically rewriting query pipeline to process only relevant partitions of spatial data in parallel.
3.2 System Overview
Our major goal of developing SparkGIS is to take maximum advantage of distributed memory to store and process spatial data, minimize I/O cost, mitigate skew through highly effective partitioning mechanisms and achieve high performance by employing combination of pre-generated and on-demand indexes. Additionally, we also target a modular architecture which would allow easy extensibility to incorporate additional distributed computational resources as well as grant the user the ability to add arbitrary functionality on top of core spatial operations. Figure 1 outlines the basic architectural components of SparkGIS. The coordinator interface handles user sessions and exposes appropriate interfaces for data retrieval, partitioning mechanisms, indexes and spatial functionalities. The data is preprocessed to generate reusable partitions and global index to evenly distribute query space among workers. In case of limited resources, query optimizer generates a pipeline to load partitions that can be processed efficiently with available resources. The partitioned space and index are then passed on to the distributed framework, Spark in our case, which reads data in memory on distributed workers. Query executor instances running in parallel on each worker, create on demand local indexes and apply appropriate spatial operations on input data. These architectural layers can be classified into three major categories i.e. Interface, Query Scheduler and Query Executor.
Figure 1.

SparkGIS architecture
3.2.1 SparkGIS Interface
The coordinator manages user sessions and exposes spatial data types and functionalities to users. It also lets the user to plug in extra application specific functionality into the system.
Spatial Data Types & Functions
To operate on spatial data, SparkGIS supports basic spatial functions. Currently these include range, spatial join and k Nearest Neighbor join queries. More complex spatial queries can be created by pipelining one or multiple of these base queries. Additionally, non-spatial analysis functionalities can be added to SparkGIS independently of core operations using SparkGIS plugins.
Plugins
Layered architecture allows SparkGIS to have pluggable extensions. These plugins can either extend SparkGIS functions or can add further computation to the results from any of the spatial functions. For example Pathology Image Analytics plugin extends spatial join function to generate heatmap by computing per tile similarity coefficient statistics for algorithm result analysis studies on medical image data.
3.2.2 Distributed Query Scheduler
SparkGIS relies on Apache Spark to cache and process spatial data in a distributed fashion. Unlike other Spark-based spatial processing frameworks [7, 13], no new distributed data structures are introduced. Instead, SparkGIS operates on and returns native Spark RDDs, minimizing learning curve.
In addition, SparkGIS also provides a batch processor which can be configured to execute multiple concurrent spatial queries; this may significantly improve overall system performance especially for large scale analysis involving a substantial number of small data intensive queries.
3.2.3 Query Executor
SparkGIS’s layered architecture separates the actual spatial processing core from the distributed mechanism. Native C/C++ code executes locally on Spark workers to perform spatial functionality on input data.
On Demand Local In-Memory Index
Each worker creates a local in-memory index on demand for spatial objects in its partition if the index accelerates the processing. The local index can potentially filter out more spatial objects in the partition which do not satisfy the user specified query predicate. The same index is then reused to apply spatial predicate on classifying filtered spatial objects.
Legacy Code Reuse
An additional benefit of multi-tiered design is the ability to reuse readily available implementations of efficient spatial processing algorithms in native languages. For instance, GPU enhanced spatial processing algorithms are mostly developed in native languages. Similarly, many efficient implementations for 3D spatial processing algorithms are available in C/C++. Currently, we are working on integrating spatial functionality for 3-dimensional data, as well as GPU-enhanced spatial processing algorithms in SparkGIS.
4 DYNAMIC QUERY RE-WRITING
SparkGIS tries to maximize the utilization of Apache Spark’s in-memory distributed processing capability for spatial data. While Spark guarantees 10 to 100x performance gain over Hadoop-like systems, it relies on available distributed resources to get this performance. Input dataset is partitioned and read into memory on worker nodes as Spark RDD. All of the partitions of the RDD can be processed in parallel if the total number of partitions is less than or equal to the total number of available executors giving optimal performance. However, with larger datasets, Spark requires manual configuration to achieve full benefits of in memory distributed processing.
There are many situations where input dataset is considerably larger than the total available distributed memory. In such cases, Spark parameters such as parallelism, executor memory, memory fraction, executor cores etc. need to be carefully tuned by the end user. Furthermore, important mechanics such as data shuffle behavior and persistence still need to be handled for each job. Even with these configurations in place, Spark may still get OOM if a partition size becomes more than available memory on a particular executor. A general solution to avoid such situations is to keep a potion of RDD in memory and spill the remaining to disk. While this ensures that the job would succeed, the performance generally degrades. Furthermore, since Spark is not designed to handle spatial data, the data will be spilled irrespective of considering any spatial properties. This further deteriorates the overall spatial query performance.
To this end, SparkGIS provides a novel query pipeline which helps alleviate such situations. It takes into consideration the available resources as well the estimated dataset size. Instead of loading everything in memory as RDD, SparkGIS pre-generates a set of reusable parameters for the dataset according to user defined options such as partitioning scheme, number of partitions etc. These parameters include (1) total number of spatial objects, (2) number of partitions, (3) global partition index, and (4) per-partition local spatial index. Using these parameters, average load per partition L is estimated as
where
α = Estimated spatial object size
T = Total number of spatial objects
P = Number of partitions
The following simple heuristic estimates the memory available for a Spark task:
where
E = Estimated memory per Spark task
M = spark.executor.memory
F = spark.shuffle.memoryFraction
S = spark.shuffle.safetyFraction
C = spark.executor.cores
Using L and E the number of concurrent spatial partitions to load and process in parallel are estimated as N such that
At a given time, a sliding window of N partitions are processed until results from all partitions are successfully computed. This approach ensures that maximum relevant spatial data resides in memory for processing at a given time.
5 DISTRIBUTED SPATIAL PARTITIONING
One of the major performance bottlenecks of real world distributed applications is non-uniform data distribution, usually termed as data skew. Considerable research work [2, 8, 9] has gone into mitigating skew issues in traditional MapReduce paradigm. To handle data skew, SparkGIS employs partitioning optimizations to evenly distribute data among workers.
SparkGIS supports several partitioning algorithms to be used for distributed query processing. Fixed Grid (FG) partitions the query space into a grid of equal sized tiles and generally performs well for uniformly distributed datasets. Binary Space Partitioning (BSP) recursively divides the space until each partition has at most t number of spatial objects; where threshold t is user specified. Quad Tree (QT) is a similar recursive strategy, however, unlike BSP it divides the space into 4 equal tiles in each iteration and thus can potentially create empty or sparse partitions. Strip Partitioning (SLC) slices rectangular regions from the space one by one such that each slice contains approximately t objects. Boundary Optimized Strip (BOS) partitioning is a boundary object-aware extension of SLC that minimizes the number of cross boundary objects. Both SLC and BOS have loglinear complexity with respect to the number of spatial objects. Unlike rest of the algorithms, Hilbert Curve (HC) and Sort Tile Recursive (STR) mechanisms generate overlapping partitions which potentially can result in a large replication factor. However, this may be justified if this results in good partitioning and thus good overall query performance. For example, in our experiments HC partitioning resulted in the minimum number of partitions to be scanned for range queries.
Our preliminary experiments on Pathology Imaging (PI) and Open Street Map (OSM) datasets show BSP to be the optimal partitioning algorithm for general spatial join and nearest neighbor queries whereas STR and QT also perform reasonably well.
6 SPARKGIS SPATIAL QUERY WORKFLOW
A typical query in SparkGIS is split into two major stages; loading and query execution, where loading is a prerequisite for query execution. However, loading is a one time process whose results can be reused by multiple queries in the same session or if optionally persisted, by queries across different sessions. Figure 2 illustrates the stages involved in spatial query work flow.
Figure 2.
A typical spatial query work flow in SparkGIS
6.1 Load Stage
The load stage is a one time preprocessing step for input spatial data. While, this stage involves IO overheads associated with reading data in memory and shuffling to group spatial objects with respect to partitions, SparkGIS optimizes this stage by pipelining multiple independent jobs. To do so, this stage can further be divided into data retrieval and pre-processing steps. Algorithm 1 presents logical flow for both of these steps.
6.1.1 Data Retrieval
Data retrieval step is responsible for retrieving spatial objects from user specified data source. SparkGIS supports all storage sources supported by Apache Spark including local file system, HDFS, Amazon S3, Cassandra etc. Similarly, third party storage sources that natively support Spark integration, such as MongoDB, can also be used to query spatial data using SparkGIS. SparkGIS natively supports reading spatial data in well known text (w.k.t) and well known binary (w.k.b) formats. Alternatively, for arbitrary data formats, SparkGIS provides a simple interface to extract spatial information from input data.
To further enhance performance and facilitate better memory utilization, SparkGIS keeps all spatial data in serialized form throughout the query pipeline. Spatial objects are de-serialized on demand during query execution only when their bounding boxes satisfy user specified query predicate. Furthermore, serialized data can optionally be compressed to further reduce overall memory footprint. By default, compression is not enabled in SparkGIS as it introduces extra decompression overheads.
6.1.2 Data Preparation
After retrieving spatial data from storage source, the load stage prepares spatial matrices required for actual spatial query processing. These matrices mainly consist of dataset specific tasks that can be performed independent of each other. Despite being on the critical path of query execution, the main reason for separating this step from query processing is to avoid redundant computation and further exploit available parallelism in the spatial query work flow. Consequently, this improves the overall system performance along with better distributed resource utilization.
For efficient spatial query execution in SparkGIS, the data preparation stage implements several preprocessing steps on data retrieved from sources: (1) Minimum Bounding Rectangles (MBRs) are computed for all of the spatial objects in the dataset; (2) The minimum bounds of the space encompassing all of the MBRs are computed; (3) The MBRs are normalized with respect to the bounds computed in step 2. That is, each dimension of a MBR is mapped to a value between [0.0,1.0); (4) The space of MBRs is partitioned into tiles using user specified spatial partitioning algorithm; (5) A spatial index (R-tree index in our current implementation) is created on the set of tiles; (6) All the spatial objects are mapped to the tiles using the spatial index.
7 DISTRIBUTED SPATIAL QUERY PROCESSING
The last step in the work flow is to apply actual spatial query logic on the prepared data. Algorithm 2 illustrates a generic spatial query work flow in SparkGIS. After estimating N as described in Section 4, a distributed sliding window of N partitions is maintained in memory. The partitions in the window are distributed among Spark worker nodes. Each node has a native shared library which implements the spatial functions. Appropriate spatial functions are applied in parallel to all in memory data partitions.
Algorithm 1.
SparkGIS Load And Prepare Data
| 1: | procedure AsyncPrepareData(dataset) |
| 2: | binData = inputSrc.getDataAsBinaryRDD(dataset) |
| 3: | Initialize List mbbs |
| 4: | Initialize dataConfig.space |
| 5: | for each spatialObject in binData do |
| 6: | mbb = extractMBB(spatialObject) |
| 7: | /* compute space dimensions and properties */ |
| 8: | dataConfig.space.append(mbb) |
| 9: | mbbs.add(mbb) |
| 10: | end for |
| 11: | dataConfig.localIndex = mbbs.createLocalIndex () |
| 12: | /* estimates parameters for load balancing */ |
| 13: | dataConfig.estimateParams () |
| 14: | return dataConfig |
| 15: | end procedure |
| 16: | procedure Load(datasets, partitionMethod) |
| 17: | /* parallel for loop */ |
| 18: | Initialize List configs |
| 19: | for each dataset in datasets do |
| 20: | configs.add(AsyncPrepareData(dataset)) |
| 21: | end for |
| 22: | Partition dataspace in tiles using partitionMethod |
| 23: | Build global index |
| 24: | Use global index to map spatial objects to tiles |
| 25: | end procedure |
Algorithm 2.
SparkGIS Spatial Query
| 1: | procedure ExecuteQuery(p, preparedData) |
| 2: | Load a window w of p partitions from preparedData |
| 3: | while preparedData has partitions do |
| 4: | Execute Native function on all partitions in w |
| 5: | Keep sliding window to add unprocessed partitions |
| 6: | end while |
| 7: | end procedure |
| 8: | procedure SpatialQuery(datasets, partitioner) |
| 9: | if dataset not prepared then |
| 10: | preparedData = Load(datasets,partitioner) |
| 11: | else |
| 12: | preparedData = LoadFromMemory(datasets,partitioner) |
| 13: | end if |
| 14: | /* Load balancing logic */ |
| 15: | N = estimatePartitions () |
| 16: | ExecuteQuery(N,preparedData) |
| 17: | end procedure |
7.1 Spatial Containment Query
The logical implementation of spatial containment query in SparkGIS is presented in Algorithm 3. The native code expects a tile/partition, with all objects and a spatial containment region. Global index is used to determine whether the tile (1) does not intersect at all, (2) partially intersects, or (3) is contained in the query region. For case (1), no further computation is required and the worker simply exits. For case (2), the worker determines the objects in tile which are fully contained in the query region and appends them to the result. If the query region is big enough and contains the whole tile, as in case (3), no further spatial processing is required and all objects are appended to and returned as query result.
Algorithm 3.
SparkGIS Spatial Containment
| 1: | procedure NativeSpatialContainment(tile, region) |
| 2: | /* Use global index */ |
| 3: | if region contains tile then |
| 4: | append all objects in tile to result |
| 5: | else |
| 6: | if region intersects tile then |
| 7: | for each qualifying object in tile do |
| 8: | if region contains object then |
| 9: | append object to result |
| 10: | end if |
| 11: | end for |
| 12: | end if |
| 13: | end if |
| 14: | end procedure |
Algorithm 4.
SparkGIS Spatial Join
| 1: | procedure NativeSpatialJoin(tile, predicate) |
| 2: | dataset1 // all objects from dataset1 in tile |
| 3: | dataset2 // all objects from dataset2 in tile |
| 4: | /* local R*tree index */ |
| 5: | localIndex.build(dataset2) |
| 6: | for each object in dataset1 do |
| 7: | /* Use local index to get qualifying objects */ |
| 8: | for each qualifying object from dataset2 do |
| 9: | if MBB and predicate satisfied then |
| 10: | append objects to result |
| 11: | end if |
| 12: | end for |
| 13: | end for |
| 14: | end procedure |
7.2 Spatial Join Query
Native spatial join query processing logic in SparkGIS is listed in Algorithm 4. All data from both datasets is distributed among available worker nodes with respect to their tiles/partitions. The native functions expect these tiles and their respective data with user specified predicate as input to distributively process them and generate query results. At worker, each object in the tile is categorized as either from dataset-1 or dataset-2. Instead of joining all objects from both datasets, a local R*tree spatial index is created on partition boundaries of dataset-2. This local index is then used to get a set of qualifying objects from dataset-2 whose bounding boxes satisfies the predicate relation with each object from dataset-1. The actual spatial query with specified predicate is only applied to the subset of qualifying objects reducing the overall computational cost of the spatial join query. Supported predicates for spatial join query are intersects, touches, contains, within, equals and overlaps.
Algorithm 5.
SparkGIS Spatial kNN
| 1: | procedure NativeSpatialKNN(tile, k, searchRadius) |
| 2: | dataset1 // all objects from dataset1 in tile |
| 3: | dataset2 // all objects from dataset2 in tile |
| 4: | /* local R*tree index */ |
| 5: | localIndex.build(dataset2) |
| 6: | getObjectDensity() |
| 7: | for each object in dataset1 do |
| 8: | while searchRadius < spacedimensions do |
| 9: | /* Use local index to get qualifying objects */ |
| 10: | if qualifying objects > (k * 1.5) then |
| 11: | break |
| 12: | else searchRadius* = 2 |
| 13: | end if |
| 14: | end while |
| 15: | end for |
| 16: | for each qualifying object from dataset2 do |
| 17: | Compute distance |
| 18: | end for |
| 19: | Sort(distance, neighbors) |
| 20: | return first k results |
| 21: | end procedure |
7.3 Spatial kNN Query
Similar to spatial join query, the native kNN query implementation is SparkGIS also expects a set of all spatial objects belonging to a particular tile/partition which can be processed distributively. Per tile configuration objects from preloaded data is passed on to each worker where kNN logic presented in Algorithm 5 is applied after classifying the objects to dataset 1 and 2. The initial global index is used to filter out tiles which are not in the vicinity of the tile containing the starting object. The local R*tree index is then used to retrieve qualifying objects in the search radius. The search radius is iteratively expanded till enough objects are found to satisfy the query. Once found, the resulting objects are sorted and first k are returned to the master.
7.4 Batch Processing
Executing a single job at a time when a large number of datasets need to be processed will result in low overall throughput and underutilized cluster resources. To better optimize such use cases, SparkGIS natively allows for multiple concurrent stages of the work flow to be overlapped in batches for throughput sensitive work loads. For instance, each work flow comprising of several stages as described in Section 6, is categorized as a separate context. An internal configurable thread pool is maintained which keeps multiple spatial query contexts alive concurrently in the system. Each context further parallelizes the different query processing stages. This execution strategy for processing multiple datasets by pipelining independent stages of each work flow allows for better throughput and efficient resource utilization for batch style jobs.
8 EXPERIMENTAL EVALUATION
We evaluate SparkGIS with respect to spatial query performance in terms of job completion time 1) under normal resources and 2) under limited resources. We also share insights on SparkGIS’s 3) memory footprint for spatial query job with respect to data size, and 4) scalability with respect to data size and number of workers.
8.1 Experimental Setup
We used a CentoOS 6.5 cluster with 10 nodes having 400 total cores. Each node had 40 logical cores with hyper-threading (Intel (R) Xeon (R) CPU E5-2660 v2 at 2.20GHz), 512GB memory. A 10GB inifiniband interconnecting network was used for intra-node communication.
8.1.1 Spark Configuration
We employed Apache Spark version 2.1.0 as our cluster computing framework. We configured two separate environments to analyze SparkGIS’s performance under general and limited resource settings.
General Setting
For general settings we used all the resources available in the cluster described in Section 8.1. Important parameters such as spark.memory.fraction, spark.memory.storageFraction, spark.executor.memory etc. were set to default values. To create fewer intermediate files spark.shuffle.consolidateFiles was set to true during transformations. We also configured Kryo serializer that is recommended by Apache Spark to help better serialize/deserialize objects during job workflow.
Limited Resource Setting
In order to limit the available resources, we configured Spark to use limited number of executors with less memory available to execute each task. There are several Spark parameters that control these behavior which were set as follows
| spark.executor.memory | 1g |
| Spark.memory.fraction | 0.2 |
| spark.memory.storageFraction | 0.8 |
| spark.executor.cores | 2 |
| Spark.cores.max | 2 |
8.2 Dataset Description
Our experimentation was based on two real world datasets; pathology imaging (PI) and open street map (OSM).
8.2.1 Pathology Imaging
This dataset was composed of high resolution pathology images of whole slide tissue specimens obtained from boundary segmentation of micro-anatomic objects such as tumor regions and nuclei. These images were provided by Stony Brook University Hospital. Spatial boundaries had been validated, normalized, preprocessed and stored on Hadoop Distributed File System (HDFS) in (Well Known Text) WKT format. The datasets varied from 100 (approx 7.5GB) to 400 (approx 28GB) images having approx. 70 to 150 million spatial objects respectively. HDFS was configured over 10 nodes each having 24 cores with hyper-threading (Intel (R) Xeon (R) CPU E5-2660 v2 at 2.20GHz), 62 GB of memory, and connected through a 10 GB inifiniband interconnect. Most of HDFS configuration parameters were set to default values with replication factor set to 3.
8.2.2 Open Street Map
OSM [10] is a large scale map project through extensive collaborative contribution from a large number of community users. It contains spatial representation of geometric features such as lakes, forests, buildings and roads. Spatial objects are represented by specific types such as points, lines and polygons. We downloaded the dataset from the official website and dumped to HDFS in text format for spatial processing.
8.3 Query Cases
We used three typical queries for the benchmark: spatial containment, spatial join and nearest neighbor query. Many other complex queries can be decomposed into these basic queries.
In order to benchmark spatial join query, we employed the real world use case of heatmap generation by computing similarity coefficient [6] for multiple tissue analysis algorithm results on PI dataset. We used multiple sets of algorithms on varying image datasets to benchmark spatial join query. Another similar spatial join query on OSM dataset was also constructed to find changes in spatial objects between two snapshots.
Similarly, we constructed a spatial containment query to retrieve all objects (segmented nuclei) within given region, where query region covered a large area in the image space.
For kNN query, we computed k neighbor nuclei from a randomly selected central nucleus. The value of k was varied among different runs of experiments.
8.4 Working Set Query Benchmarking
We configured Apache Spark to use all available cluster resources as described in Section 8.1. We also made sure that available memory, storage and number of executors far exceed the minimum requirement for optimal query processing on evaluation datasets. By doing so, we avoided any extensive memory pressure, data spilling or resource contention for Spark jobs. Thus all spatial query platforms were able to take full advantage of distributed in-memory processing. The major purpose of performing this set of experiments was to confirm that SparkGIS performed as good as or better than contemporary spatial processing frameworks on typical spatial queries.
Also it is worth noting that GeoSpark, LocationSpark and SpatialSpark implement each spatial query as a single Spark job. Executing same query on same data set would require the same query steps to be executed again. SparkGIS separates load stage from actual query processing. This allows preprocessing matrices to be reused across different query sessions. For fairness, we started recording job execution time after data is loaded in memory.
8.4.1 Spatial Join Query
Figure 3a shows all three systems exhibits good performance and scalability for spatial join query. The results show that SpatialSpark slightly outperforms SparkGIS in terms of query runtime. This is mainly due to the fact that SpatialSpark is optimized for such query types. Additionally, it also provides two flavors of spatial join queries namely: broadcast spatial join for joining a large dataset with a relatively smaller dataset and partitioned spatial join for general purpose spatial join queries. We defined a threshold ratio, based on joining datasets’ sizes, to invoke appropriate spatial join strategy for SpatialSpark.
Figure 3.
Performance comparison of SparkGIS with existing spatial processing frameworks for PI datasets
SparkGIS’s performance advantage over GeoSpark is mainly due to fact that SparkGIS utilizes both local and global indexes to filter out subset of critical spatial objects and apply the predicate only to them reducing the overall computational cost of the spatial query. Even though GeoSpark is able to use local index embedded in each SRDD, it is not able to take advantage of global index to accelerate spatial query.
8.4.2 Spatial Containment Query
As described in Section 7.1, spatial containment query is more IO bound than being computationally heavy. Most of the spatial processing is done only for partially intersecting tiles with the query region. This nature of containment query makes it susceptible to the ability to iteratively process data in memory. Figure 3b shows the comparison results of spatial containment query on PI datasets. All three systems perform equally well in terms of query performance.
8.4.3 K Nearest Neighbor Query
For kNN query, SparkGIS outperforms GeoSpark for PI dataset. SpatialSpark does not support spatial kNN query. Figure 3c and Figure 3d shows these results. The lack of global index in GeoSpark limits its performance for kNN query. Unlike SparkGIS which can prune large number of spatial objects using global index before scanning for kNN, GeoSpark has to go through all spatial objects irrespective of their relevance to the kNN query. Figure 3c show the kNN query performance with respect to varying data size. Although the query just found k nearest neighbor in a single tile; hence same execution time for SparkGIS with varying data sizes, GeoSpark had to go through all data due to lack of global index. Similarly in Figure 3d for constant data size and varying value of k results in much higher constant times for GeoSpark whereas SparkGIS’s execution time increase with respect to the value of k. Therefore the lack of global index fairly limits GeoSpark’s suitability for big spatial data processing which involves kNN queries or its variants.
8.5 Faulting Set Query Benchmarking
In order to evaluate SparkGIS’s performance with limited resources, we configured Spark workers to spawn only limited executors with variable total memory. We then compared SparkGIS’s performance with other systems in terms of successful job completion or running out of memory and failing. Table 1 lists the outcomes of this experiment. GeoSpark and LocationSpark operate on objects of Spatial RDDs (SRDD) instead of primitive types thus bloating memory usage. GeoSpark additionally embeds a spatial RTree index in each SRDD. This further increases the overall job’s memory footprint and thus runs out of memory quicker than rest of the frameworks under limited resources. SpatialSpark has a similar work flow as of SparkGIS having lower memory requirements. Unlike these SparkGIS keeps spatial data in serialized from through out the query pipeline. It only de-serializes spatial objects on demand when required by user specified spatial query predicate. Figure 5 depicts these memory usage trends of SparkGIS, GeoSpark and SpatialSpark.
Table 1.
Job completion or failure under limited resource settings for spatial join query for PI dataset without spilling to disk.
| Frameworks | Dataset (Number of Images) | |||
|---|---|---|---|---|
| 100 | 200 | 300 | 400 | |
| SpatialSpark | Success | Success | Fail | Fail |
| GeoSpark | Success | Fail | Fail | Fail |
| LocationSpark | Success | Success | Fail | Fail |
| SparkGIS | Success | Success | Success | Success |
Figure 5.

Memory footprint for spatial query job with respect to data size.
For ideal case, we configured Spark to keep everything in memory only. Section 8.4 explains in depth results for this scenario with working sets. Repeating the same experiments with faulting sets, however, yielded out-of-memory errors for other systems. To avoid this, Spark based systems usually allow such datasets to persist as much in memory as possible and spill to disk whenever necessary. However, this degrades the overall system performance by orders of magnitude. Nonetheless, to study this, we configured Spark RDD’s (or GeoSpark’s Spatial RDD)’s persistence to MEM-ORY_AND_DISK.
8.5.1 Spatial Join Query
Figure 4a shows the performance comparison of SparkGIS with other systems under limited available resources. For memory only settings, as the dataset size increases other systems’ memory pressure increases until they get out memory resulting in job failure (Table 1). SparkGIS’s performance also gets effected with increase in dataset size since with limited resources only N partitions are processed in parallel.
Figure 4.

SparkGIS performance under limited resource setting
For configuration where data can be spilled to disk for Spark jobs, the performance of other platforms becomes deteriorated. SparkGIS does not need to aggressively spill to disk since it only loads data that can fit in memory. None of the platforms can process the whole data in parallel. However, SparkGIS still outperforms others since vanilla Spark spills data to disk without considering any spatial features potentially resulting in relevant spatial data to be spilled to disk.
8.5.2 Spatial Containment Query
Similar performance pattern can be observed for containment query as of spatial join query. SparkGIS, using global index loads only N partitions to memory, discards irrelevant partitions and keeps only relevant data in memory. Doing so allows to accelerate the overall query execution.
8.5.3 k Nearest Neighbor Query
Although major performance trend is similar due to to the same reasoning as described above. However, kNN queries are further affected by data spilling. kNN involves a further set of iterations over the same data to determine k results. But since SparkGIS only loads relevant partitions in memory unlike others it’s performance gets minimally affected. For other systems, relevant data gets spilled to disk which is read again when not enough results are computed. Figure 4b illustrates these results.
8.6 Scalability
8.6.1 PI Dataset
Figure 6a shows the scalability of SparkGIS for different datasets with varying number of parallel processing units. A continuous decline in execution time can be observed when the number of workers increases. It achieves a nearly linear speedup i.e. execution time is reduced to half when the number of workers is increased from 24 to 48. The average query time per image is about 3 seconds for 100 images with 120 parallel processing units. The figure also shows a very good scale up feature. As the figure shows the processing time increases linearly with data size. The time for processing 400 images dataset is roughly 4 times the time for processing 100 images data set with 120 distributed workers.
Figure 6.

SparkGIS scalability evaluation results
8.6.2 OSM Dataset
A spatial join query was constructed over 2 snapshots of OSM dataset to get the objects that changed between them. Multiple invocations of the query was executed with varying number of parallel processing units to evaluate scalability of SparkGIS on OSM dataset. Figure 6b shows that SparkGIS exhibits very good scalability for spatial join query. With increasing number of processing units, the time decreases continuously with almost linear speedup.
9 CONCLUSION
Effective support of spatial queries on massive scale of spatial data demands not only high scalability to parallelize query processing, but also low latency to achieve fast query response. While MapReduce provides a simple vehicle for developing scalable spatial computing systems, intrinsically it takes an IO centric approach for data communication and iterative data processing, which creates a significant barrier for fast query processing. Alternatively, to process large datasets with limited resources, Spark based distributed in-memory spatial frameworks either fail jobs or fall back to disk IO due to high memory pressure. We develop a resource aware in-memory based spatial querying system; SparkGIS with dynamic query rewriter to gracefully query and analyze large input spatial data with limited distributed resources. SparkGIS supports several partitioning algorithms that allow mitigating skew in spatial datasets, and implements a parallel on-demand query engine which with the help of global and local indexes allows high scalability and minimizes query response time. We’ve showed SparkGIS to gracefully handle data intensive spatial queries under limited resources. Additionally, with enough available resources, SparkGIS’s performance is comparable to other Spark based in-memory spatial processing frameworks.
Acknowledgments
This work was funded in part by HHSN261200800001E from the NCI, 1U24CA180924-01A1 from the NCI, 5R01LM011119-05, 5R01LM009239-07 from the NLM, and NSF ACI 1443054 and IIS 1350885.
Contributor Information
Furqan Baig, Stony Brook University.
Hoang Vo, Stony Brook University.
Tahsin Kurc, Stony Brook University.
Joel Saltz, Stony Brook University.
Fusheng Wang, Stony Brook University.
References
- 1.Aji Ablimit, Wang Fusheng, Vo Hoang, Lee Rubao, Liu Qiaoling, Zhang Xiaodong, Saltz Joel. Hadoop GIS: A High Performance Spatial Data Warehousing System over Mapreduce. Proc VLDB Endow. 2013 Aug;6(11):1009–1020. doi: 10.14778/2536222.2536227. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Ananthanarayanan Ganesh, Kandula Srikanth, Greenberg Albert G, Stoica Ion, Lu Yi, Saha Bikas, Harris Edward. Reining in the Outliers in Map-Reduce Clusters using Mantri [Google Scholar]
- 3.Dean Jeffrey, Ghemawat Sanjay. MapReduce: simplified data processing on large clusters. Commun ACM. 2008;51(1):107–113. [Google Scholar]
- 4.Eldawy Ahmed. SpatialHadoop: Towards Flexible and Scalable Spatial Processing Using Mapreduce. Proceedings of the 2014 SIGMOD PhD Symposium (SIGMOD’14 PhD Symposium); New York, NY, USA: ACM; 2014. pp. 46–50. [DOI] [Google Scholar]
- 5.Frye Roger, McKenney Mark. Information Granularity, Big Data, and Computational Intelligence. Springer; 2015. Big Data Storage Techniques for Spatial Databases: Implications of Big Data Architecture on Spatial Query Processing; pp. 297–323. [Google Scholar]
- 6.Jaccard Paul. Etude comparative de la distribution florale dans une portion des Alpes et du Jura. Impr Corbaz 1901 [Google Scholar]
- 7.Wu Jinxuan, Yu Jia, Sarwat Mohamed. GeoSpark: A Cluster Computing Framework for Processing Large-Scale Spatial Data. Proceedings of the 2015 International Conference on Advances in Geographic Information Systems (ACM SIGSPATIAL 2015).2015. [Google Scholar]
- 8.Kwon YongChul, Balazinska Magdalena, Howe Bill, Rolia Jerome. Skew-tune: mitigating skew in mapreduce applications. Proc. 2012 ACM SIGMOD International Conference on Management of Data. [Google Scholar]
- 9.Kwon YongChul, Balazinska Magdalena, Howe Bill, Rolia Jerome. Skew-resistant parallel processing of feature-extracting scientific user-defined functions. Proc. 1st ACM symposium on Cloud computing.2010. [Google Scholar]
- 10.Open Street Map. OSM; 2017. http://www.openstreetmap.org. [Google Scholar]
- 11.Nishimura Shoji, Das Sudipto, Agrawal Divyakant, El Abbadi Amr. MD-HBase: A Scalable Multi-dimensional Data Infrastructure for Location Aware Services. Proceedings of the 2011 IEEE 12th International Conference on Mobile Data Management - Volume 01 (MDM ‘11); Washington, DC, USA: IEEE Computer Society; 2011. pp. 7–16. [DOI] [Google Scholar]
- 12.Apache Spark. Spark Web. 2017 http://spark.apache.org.
- 13.Tang Mingjie, Yu Yongyang, Malluhi Qutaibah M, Ouzzani Mourad, Aref Walid G. Locationspark: a distributed in-memory data management system for big spatial data. Proceedings of the VLDB Endowment. 2016;9(13):1565–1568. [Google Scholar]
- 14.Xie Dong, Li Feifei, Yao Bin, Li Gefei, Zhou Liang, Guo Minyi. Simba: Efficient In-Memory Spatial Analytics. (To Appear) In Proceedings of 35th ACM SIGMOD International Conference on Management of Data (SIGMOD’16).2016. [Google Scholar]
- 15.You Simin, Zhang Jianting. Technical Report. City University of New York; 2015. Large-Scale Spatial Join Query Processing in Cloud. [Google Scholar]
- 16.You Simin, Zhang Jianting, Gruenwald L. Large-scale spatial join query processing in cloud. IEEE CloudDM workshop (To Appear); 2015. http://www-cs.ccny.cuny.edu/~jzhang/papers/spatial_cc_tr.pdf. [Google Scholar]
- 17.Zaharia Matei, Chowdhury Mosharaf, Das Tathagata, Dave Ankur, Ma Justin, McCauley Murphy, Franklin Michael J, Shenker Scott, Stoica Ion. Resilient Distributed Datasets: A Fault-tolerant Abstraction for In-memory Cluster Computing. Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation (NSDI’12); Berkeley, CA, USA: USENIX Association; 2012. pp. 2–2. http://dl.acm.org/citation.cfm?id=2228298.2228301. [Google Scholar]
- 18.Zaharia Matei, Chowdhury Mosharaf, Franklin Michael J, Shenker Scott, Stoica Ion. Spark: Cluster Computing with Working Sets. Proceedings of the 2Nd USENIX Conference on Hot Topics in Cloud Computing (HotCloud’10); Berkeley, CA, USA: USENIX Association; 2010. pp. 10–10. http://dl.acm.org/citation.cfm?id=1863103.1863113. [Google Scholar]


