Efficient Management of High-Frequency Sensor Data Streams Using a Read-Optimized Learned Index

Hu Luo; Jiabao Wen; Desheng Chen; Zhengjian Li; Meng Xi; Jingyi He; Shuai Xiao; Jiachen Yang

doi:10.3390/s26041217

. 2026 Feb 13;26(4):1217. doi: 10.3390/s26041217

Efficient Management of High-Frequency Sensor Data Streams Using a Read-Optimized Learned Index

Hu Luo ¹, Jiabao Wen ¹, Desheng Chen ^1,^*, Zhengjian Li ¹, Meng Xi ¹, Jingyi He ¹, Shuai Xiao ¹, Jiachen Yang ¹

Editor: Nikos Fotiou¹

PMCID: PMC12944446 PMID: 41755160

Abstract

The rapid growth of sensor data in IoT and Digital Twins necessitates high-performance spatial indexing. Traditional indexes like Rtrees suffer from high storage overhead, while state-of-the-art learned indexes like GLIN encounter a “Refinement Bottleneck” due to coarse-grained Minimum Bounding Rectangle (MBR) filtering. Furthermore, existing solutions often trade update throughput for query accuracy, failing in dynamic IoT workloads with concurrent reads and writes. We propose DyGLIN (Dynamic Generate Learning-Based Index), a dynamic, read-optimized learned spatial index tailored for high-frequency sensor streams. DyGLIN introduces a decoupled leaf architecture separating query processing from data maintenance. To accelerate queries, we implement a hierarchical filtering pipeline using hierarchical MBRs (HMBR) and Cuckoo Filters to aggressively prune false positives. For maintenance, a Delta Buffer mechanism amortizes update costs, while logical deletion ensures high throughput. Experiments on real-world datasets show that DyGLIN reduces query latency by 26.4% [95% CI: 20.1%, 38.6%] compared to GLIN. It achieves 30.0% [95% CI: 21.4%, 35.9%] higher insertion throughput and superior deletion performance, with only an 18.5% [95% CI: 16.8%, 19.8%] increase in memory overhead.

Keywords: sensor data streams, IoT, learned index, spatial indexing, throughput

1. Introduction

In the era of the Internet of Things (IoT) and Digital Twins, the ingestion of high-frequency sensor data streams has reached an unprecedented scale [1,2,3]. From the continuous stream of GPS trajectories [4,5] generated by ride-hailing platforms like Uber and DiDi to the complex polygon data for managing urban planning and environmental monitoring, spatial data management serves as the backbone of modern sensor-driven infrastructure [6,7]. This challenge is particularly pronounced in edge-centric environments where data must be processed in real time under dynamic workload conditions [8]. Unlike traditional relational data, sensor-captured spatial information is characterized by multi-dimensionality and complex geometric shapes (e.g., linestrings, polygons), which pose significant challenges for efficient real-time storage and retrieval.

For decades, spatial indexing in sensor systems has been dominated by tree-based structures such as Rtrees [9], Quad-Trees [10], and their variants (e.g., R*-Tree). These structures organize spatial objects into a hierarchy of Minimum Bounding Rectangles (MBRs) to facilitate range queries and k-nearest neighbor (kNN) searches. While robust, traditional indexes face inherent limitations in the context of “Big Spatial Data.” They typically rely on heuristic splitting algorithms that do not adapt to the underlying data distribution, leading to deep tree structures, excessive pointer chasing, and significant storage overhead [11]. As sensor data volume grows, these inefficiencies result in high I/O latency, creating a performance bottleneck for real-time sensor analytics.

To overcome the limitations of traditional structures, the database community has recently turned to learned indexes. Pioneered by Kraska et al. [12], this paradigm treats indexing as a machine learning problem. By training a model to learn the Cumulative Distribution Function (CDF) of the keys, learned indexes can predict record positions, significantly reducing storage and lookup time. Recent surveys highlight the rapid evolution of this field [13]. Significant progress has been made in extending these concepts to multi-dimensional data. For instance, Flood [14] and Tsunami [15] optimize data layout and grid structures for multi-dimensional query workloads, resulting in superior performance.

However, extending these advances to spatial data with complex geometries remains non-trivial. Early spatial learned indexes, such as the ZM-Index [16], LISA [17], and the recent RSMI [18] and WaZI [19], focus primarily on point data. They typically linearize spatial points using Space-Filling Curves (SFCs) or rank-space transformations. While effective for points, these approaches struggle with spatially extended objects like roads and lakes. The state-of-the-art solution, GLIN [20], addresses this by indexing Z-address intervals of MBRs. Despite its success, GLIN suffers from a critical “Refinement Bottleneck.” Its reliance on coarse-grained MBR approximation generates excessive “false positives”, which are objects that fall within the search range but do not spatially intersect the query window. Our analysis shows that in selective queries, the subsequent geometric refinement phase consumes over 70% of the total query time, which is unacceptable for latency-sensitive sensor feedback loops.

Furthermore, supporting efficient dynamic updates for continuous sensor streams remains a formidable challenge. While recent works like LIPP [21], FlexFlood [22], and USLI-DR [23] have begun to explore updatability in multi-dimensional spaces, applying these dynamic capabilities to complex geometries without compromising filtering precision remains an open problem. Existing solutions often trade ingestion throughput for query accuracy, making them unsuitable for hybrid IoT workloads where sensor data is continuously ingested while being queried.

To address these challenges, we propose DyGLIN, a dynamic and read-optimized learned spatial index specifically designed for sensor data management. DyGLIN adopts a “filter-heavy, write-optimized” philosophy and introduces a decoupled architecture at the leaf level. Our contributions are as follows:

Decoupled Architecture for Sensor Streams: We propose a novel leaf node structure integrating a hierarchical MBR (HMBR) and a Cuckoo Filter. This pipeline provides tighter spatial bounds, reducing the candidate set for expensive refinement by up to 71.8% [95% CI: 69.2%, 74.4%] and effectively solving the refinement bottleneck.

Buffered Maintenance and Logical Deletion: To handle high-frequency sensor updates, we introduce a Delta Buffer mechanism. Incoming writes are logged into a lightweight buffer to amortize the high structural update costs of HMBR. Additionally, we employ a logical deletion strategy using Cuckoo Filters, enabling O(1) deletion handling without immediate structural reorganization.

Comprehensive Evaluation: We evaluate DyGLIN against robust baselines on diverse real-world datasets. The results show that DyGLIN reduces query latency by 26.4% [95% CI: 20.1%, 38.6%] compared to GLIN by eliminating the refinement bottleneck. In terms of maintenance, DyGLIN achieves 30.0% [95% CI: 21.4%, 35.9%] higher insertion throughput than GLIN and superior deletion performance compared to all baselines, with only a modest 18.5% [95% CI: 16.8%, 19.8%] increase in memory overhead, proving its efficacy for high-performance hybrid workloads.

2. Related Work

Spatial indexing has been a cornerstone of database research for decades. The most prevalent structures are tree-based hierarchies, typified by the Rtree [9] and its variants (e.g., R*-Tree [24], R+-Tree [25]). Rtrees group spatial objects using MBRs and organize them into a balanced tree. While effective for general range queries, Rtrees suffer from performance degradation on massive datasets due to high MBR overlap and deep tree traversals [11]. Space-partitioning methods like Quad-Trees [10] and Grid Files recursively partition space but often struggle with data skew and high dimensionality. Recent work in edge computing has emphasized the importance of optimizing data flow pipelines for dynamic sensor environments [8]. However, these frameworks primarily focus on task scheduling and resource allocation at the system level, without addressing the fundamental challenge of efficiently indexing complex spatial geometries that arise in sensor-captured data.

The concept of learned indexes [12] proposes replacing internal tree navigation with machine learning models. Early works like the Recursive Model Index (RMI) demonstrated significant lookup speedups for 1D keys. Subsequent works have focused on supporting dynamic operations. ALEX [26] introduces gapped arrays to handle insertions, while the PGM index [27] provides worst-case bounds. More recently, RUSLI [28] uses spline-based models to support real-time updates, and LIPP [21] optimizes index structure for precise position prediction, further pushing the boundaries of updatable learned indexes.

Extending learned indexes to multi-dimensional data has attracted significant attention, as summarized in recent surveys [13]. The ZM-Index [16] and LISA [17] combine Space-Filling Curves (SFCs) with learned models to linearize spatial points. RSMI [18] introduces rank space-based ordering to learn spatial distributions directly without SFCs. ML-Index [29] utilizes iDistance to project points for learning. However, a common limitation of these approaches is their focus on point data. They cannot naturally handle complex geometries (e.g., polygons, linestrings) without reducing them to centroids, which leads to correctness issues in spatial range queries. WaZI [19] further optimizes the Z-index by making it workload-aware. However, these methods treat spatial objects as dimensionless points and cannot natively handle spatial extent.

Regarding multidimensional layout optimization, Flood [14] optimizes the index structure and data storage layout in conjunction with multidimensional range scans. Tsunami [15] improves upon Flood by handling correlated data and skewed query workloads. While they are highly efficient for tabular data, they lack support for geometric predicates (e.g., intersection, containment) required for complex spatial objects.

For non-point objects such as complex geometries, GLIN [20] indexes the Z-address range of the MBR. LSIR [30] attempts to improve efficiency by reusing the model, but it focuses primarily on static construction. As mentioned earlier, GLIN suffers from refinement bottlenecks due to the approximation of the MBR.

Most early works assume static data. Recent attempts like FlexFlood [22] and USLI-DR [23] have begun to explore dynamic maintenance for multi-dimensional learned indexes. However, applying these dynamic capabilities to complex geometries while maintaining high filtering precision remains an open problem. DyGLIN bridges this gap by integrating a decoupled architecture with Cuckoo Filters [31] and a buffered write mechanism, inspired by Write-Optimized Data Structures (WODS) like LSM-trees [32] and Buffer Trees [33].

3. Preliminaries and Problem Definition

3.1. Overview of Real-Time Spatial Queries for Sensors

We begin by formally defining the sensor-captured spatial dataset and the corresponding query operations:

Spatial Dataset and Objects: Let $D = G_{1}, G_{2}, \dots, G_{N}$ be a dataset containing N complex spatial observations captured by sensors (e.g., environmental boundaries, vehicle trajectories, or obstacle zones). Each spatial observation $G_{i}$ is encapsulated by a Minimum Bounding Rectangle (MBR), denoted as $M_{i}$ . While MBRs are efficient for coarse-grained indexing, they introduce approximation errors, especially for non-rectangular (e.g., diagonal or L-shaped) geometries.
Spatial Range Query (SRQ): Given a query window Q representing a specific monitoring area or a safety perimeter, the goal of an SRQ is to retrieve the set R of all spatial objects that intersect with Q to enable immediate decision-making:

R = \{G_{i} \in D ∣ Intersection (G_{i}, Q) = True\}

(1)

To facilitate efficient two-stage query processing, a Spatial Range Query Q is defined by a geometric region r (which can be an arbitrary polygon such as a monitoring area or safety perimeter). Query processing utilizes two representations of Q:

Q.Geometry: The exact geometric shape r used for precise intersection tests in the refinement phase.
Q.MBR: The Minimum Bounding Rectangle of r, denoted as $MBR (r)$ , used for index traversal and pruning in the filtering phase.

Formally, the result set R consists of all spatial objects $o \in D$ such that their geometry intersects with Q.Geometry:

R = {o \in D ∣ Intersect (o . Geometry, Q . Geometry)}

(2)

The indexing phase, however, retrieves a superset candidate set C using the MBR approximation:

C = {o \in D ∣ Intersect (o . MBR, Q . MBR)}

(3)

where $R \subseteq C$ . This classical filter-and-refine strategy ensures that R can be efficiently computed by first filtering to a small candidate set C using MBR-based indexing and then refining C through exact geometric verification.

3.2. The GLIN Data Retrieval Response Latency Model

The total response latency $T_{Q u e r y}$ for a spatial query is a critical metric. This latency can be decomposed into two main stages: the fast signal localization time $T_{P r o b e}$ and the precise verification time $T_{R e f i n e}$ :

T_{Q u e r y} = T_{P r o b e} + T_{R e f i n e}

(4)

Probe Stage: The learned model is used to predict the index location (a range on the Z-order curve) of the MBR $M_{i}$ . The time $T_{P r o b e}$ for this stage is typically very low, benefiting from the model’s predictive power and the efficiency of the underlying B-Tree or ALEX structure.
Refine Stage: The located leaf node contains a candidate set C. The refine stage must perform expensive geometric intersection tests on each object in C to filter out false positives (FPs) and identify true positives (TPs).

3.3. Formalizing the Refinement Bottleneck

The refinement bottleneck is a critical latency barrier in learned-index-based sensor systems. It stems from the reliance on coarse-grained MBR approximations for initial filtering during the signal localization stage, which fails to provide sufficient precision for complex sensor-captured geometries.

In the context of sensor data management, the candidate set C is defined as the set of sensor observations for which their MBRs intersect with the monitoring window Q:

C = \{G_{i} \in D ∣ Intersection (M_{i}, Q) = True\}

(5)

The precise verification latency $T_{R e f i n e}$ depends directly on the cardinality of the candidate set $| C |$ and the average computational complexity $Cost (G)$ required for the exact geometric verification of each observation $G_{i}$ :

T_{refine} \propto \sum_{G_{i} \in C} Cost (G_{i}, Q) \approx | C | \times Cost (G)

(6)

Here, $Cost (G)$ represents the average number of CPU cycles required to perform a single precise boundary intersection test. For complex sensor observations, this cost is significantly higher than that of a simple MBR intersection check. Since the original GLIN model primarily optimizes for mapping efficiency to minimize $T_{P r o b e}$ , it often overlooks the minimization of $| C |$ .

A false positive (FP) occurs when a sensor signal’s MBR intersects with Q, but its actual geometry does not spatially intersect the monitoring area. The false positive rate (FPR) is a key metric used to quantify the severity of the bottleneck within the sensor feedback loop:

FPR = \frac{| C | - | R |}{| C |}

(7)

In typical IoT monitoring scenarios, the query selectivity $| R | / N$ is often very low. Due to the coarse-grained nature of MBR-based filtering, the system generates a high FPR, resulting in an oversized candidate set $| C |$ . Consequently, $T_{R e f i n e}$ constitutes the vast majority of the total response time $T_{Q u e r y}$ , creating a significant processing delay. Figure 1 illustrates this phenomenon, showing how a candidate set saturated with false positives forces the verification phase to dominate the total processing time, thereby hindering the real-time capabilities of the sensor system.

W h e n FPR \to 1, \Rightarrow T_{Query} \approx T_{refine}

(8)

Impact of the refinement bottleneck on real-time sensor feedback loops. The probe stage selects a large candidate set C based only on MBR overlap. Due to the high false positive rate (FPR), the expensive geometric refinement tests consume the vast majority of the total query time $T_{Q u e r y}$ .

3.4. Ingestion Throughput and Analysis Immediacy

When the FPR approaches 1, the total processing time is dominated by the verification phase $T_{R e f i n e}$ . The most direct method to mitigate the refinement bottleneck and minimize $| C |$ is to introduce high-precision auxiliary filtering structures, such as Rtree-like hierarchies or Cuckoo Filters. However, in dynamic IoT environments, these structures can cause a sharp decline in data ingestion performance, creating a critical performance trade-off dilemma. Assume that we introduce a high-precision filtering structure F (e.g., an HMBR tree) in a GLIN leaf node L. The total ingestion cost $C_{W r i t e}$ for an insertion operation $G_{n e w}$ becomes

C_{Write} = C_{Index_Update} + C_{F_Update}

(9)

where $C_{Index_Update}$ is the cost of updating the underlying learned index structure (e.g., ALEX), which is typically low; $C_{F_Update}$ is the cost of updating the auxiliary filtering structure F. If F is a tree-like structure (e.g., Rtree/HMBR), then $C_{F_U p d a t e} ≫ C_{Index_Update}$ , because complex structural adjustments (such as node splits and rebalancing) are required. For high-performance sensor systems, the optimization goal is to simultaneously minimize retrieval latency $T_{Q u e r y}$ and ingestion cost $C_{W r i t e}$ :

Minimize (T_{Query}, C_{Write})

(10)

While introducing a high-precision filter, F can effectively optimize the query response time under the refinement bottleneck to $T_{Query}^{'}$ :

If F introduced \Rightarrow T_{Query}^{'} ≪ T_{Query}

(11)

But this comes at the cost of

C_{Write}^{'} ≫ C_{Write} (High Cost)

(12)

Such a high ingestion overhead is unacceptable under the heavy, continuous data streams typical of IoT sensor networks. The core design principle of DyGLIN is to achieve the optimization of $T_{Q u e r y}^{'}$ without sacrificing the ingestion throughput $C_{Write}$ , which is realized through decoupled architectures and buffering strategies inspired by write-optimized data structures.

4. DyGLIN Methodology

4.1. Core Architecture: Edge-Oriented Sensor Stream Decoupling

To address the read–write trade-off in high-frequency IoT environments, we design DyGLIN, a novel architecture that integrates aggressive query filtering with efficient, buffered maintenance. The core design concept of DyGLIN is to introduce a decoupled read/write optimization layer at the leaf node level of the model, physically separating the high-latency precise query path from the high-throughput write path.

As shown in Figure 2, a DyGLIN leaf node L is defined as a 4-tuple:

L = (S_{M D S}, F_{H M B R}, B_{D e l t a}, F_{D e l})

(13)

where $S_{M D S}$ (Main Data Store) is a gapped array based on ALEX, storing static historical sensor data and supporting efficient binary search; $F_{H M B R}$ (Hierarchical MBR Filter) is a lightweight in-memory Rtree built on top of $S_{M D S}$ , providing finer-grained spatial filtering than the leaf node’s MBR; $B_{D e l t a}$ (Delta Buffer) is a linear buffer with fixed capacity B, which is used to capture bursty sensor sampling intake with O(1) response times; and $F_{D e l}$ (Deletion Filter) is a Cuckoo Filter storing the IDs of logically deleted objects, supporting O(1) existence checks and deletion marking.

Edge-oriented decoupled architecture for real-time sensor stream processing. The DyGLIN leaf node decouples read and write operations. Inserts are rapidly staged in the Delta Buffer (DB). Deletions are marked in the Cuckoo Filter (CF). Queries (reads) are accelerated by the hierarchical MBR (HMBR) and validated against the CF. A background merge process moves data from the DB to the Main Data Store (MDS) and rebuilds the HMBR.

4.2. Read Path: Real-Time Retrieval and Boundary Verification

To ensure the immediacy of the sensor feedback loop, DyGLIN implements a hierarchical filtering pipeline to eliminate the “Refinement Bottleneck”. Our goal is to demonstrate that DyGLIN’s candidate set $C_{D y G L I N}$ is much smaller than the original GLIN’s candidate set $C_{G L I N}$ .

For a given query Q, DyGLIN’s query process $P (Q)$ is defined as a sequence of set reduction operations:

C_{Initial} \overset{Index Probe}{⟶} C_{Cand} \overset{HMBR Filter}{⟶} C_{Spatial} \overset{CF & DB}{⟶} C_{Final}

(14)

In standard GLIN, the candidate set $C_{C a n d}$ contains all objects intersecting the leaf node’s MBR. DyGLIN introduces HMBR, which partitions the leaf space into k more compact micro-MBRs. Let $η_{valid} \in [0, 1]$ be the Spatial Pruning Ratio of HMBR. The size of the set $C_{S p a t i a l}$ after HMBR filtering is approximately

| C_{Spatial} | \approx | C_{Cand} | \times (1 - η_{valid})

(15)

Because HMBR tightly follows the geometry distribution, $η_{space}$ is typically high (meaning that very few are retained) for data with non-rectangular distributions (e.g., roads).

In dynamic scenarios, deleted objects that are not physically removed become “ghost data,” leading to FPs. By checking via $F_{D e l}$ (Cuckoo Filter), we further filter out invalid objects. Let $η_{valid}$ be the data validity ratio:

| C_{Final} | = | C_{Spatial} | \times η_{valid}

(16)

The ratio of DyGLIN’s total refinement cost $T_{Refine}^{DyGLIN}$ to GLIN’s cost $T_{Refine}^{GLIN}$ is

\frac{T_{Refine}^{DyGLIN}}{T_{Refine}^{GLIN}} \approx \frac{| C_{Final} |}{| C_{Cand} |} = (1 - η_{valid}) \times η_{valid}

(17)

Since $η_{valid}$ is significantly greater than 0, DyGLIN substantially reduces the refinement bottleneck, thereby validating the optimization goal in Equation (4).

To explicitly address the refinement bottleneck, we implement a static multi-stage filtering pipeline. The detailed execution flow is formalized in Algorithm 1. As shown in the algorithm, the query process aggregates candidates from both the hierarchical MBR and the Delta Buffer, while simultaneously filtering out logically deleted items using the Cuckoo Filter before entering the final geometric refinement phase.

Algorithm 1: Real-Time Spatial Retrieval

Input: Query Q (MBR), DyGLIN Index

Output: Result Set R

R_{final} = \emptyset

2: /* Index probe (top-level RMI probing) */

LeafNodes L_{set} = Index . RMI_Probe (Q . Z_{interval})

4: for each

LeafNode L \in L_{set}

C_{total} = \emptyset

6: /* 1. HMBR filter */

C_{hmbr} = L . HMBR . Query (Q . MBR)

C_{total} . add (C_{hmbr})

9: /* 2. Incremental buffer filter */

10:

C_{db} = L . DB . Scan (Q . MBR)

11:

C_{total} . add (C_{db})

12: /* 3. CF filter */

13:

C_{filtered} = \emptyset

14: for each

Payload P \in C_{total}

15: if

L . CF . Contains (P . ID) = false

then

16:

C_{filtered} . add (P)

17: end if

18: end for

19: /* 4. Final refinement */

20:

R_{refined} = Refine (C_{filtered}, Q . Geometry)

21:

R_{final} . add (R_{refined})

22: end for

23: return

R_{final}

Name	Type	Cardinality (M)	Size (GB)	Width (deg)	Height (deg)
AREAWATER	Polygon	2.28	1.52	2.56862	1.463470
LINEWATER	LineString	5.8	4.56	1.52892	0.981663
PARKS	Polygon	9.96	5.76	155.842	82

Method	Selectivity	Mean (ms)	95% CI	Median	P95	P99
DyGLIN	0.1%	0.86	[0.81, 0.92]	0.78	1.12	1.45
	1.0%	6.04	[6.85, 7.45]	6.52	8.95	11.20
	10.0%	68.87	[68.20, 74.80]	66.80	92.50	118.40
GLIN	0.1%	1.18	[1.12, 1.25]	1.05	1.85	2.62
	1.0%	9.83	[9.25, 10.41]	9.12	14.50	19.80
	10.0%	96.96	[92.13, 104.17]	90.77	144.02	210.95
R-Tree	0.1%	2.41	[2.25, 2.28]	2.15	4.85	7.99
	1.0%	20.52	[19.35, 20.80]	18.61	42.32	75.44
	10.0%	201.38	[188.58, 211.93]	182.42	455.60	820.25
Quad-Tree	0.1%	2.85	[2.65, 3.08]	2.52	5.60	9.43
	1.0%	22.02	[20.95, 23.09]	20.18	48.69	85.19
	10.0%	238.57	[210.76, 230.48]	211.50	495.75	887.58

Workload (R/W Ratio)	Method	Total Throughput ( $10^{5}$ ops/s)	Avg Query Latency (ms)	P95 Query Latency (ms)
95/5 (Read-Heavy)	DyGLIN	1.34	7.26	9.14
	GLIN	1.04	10.13	25.31
	R-Tree	0.83	21.75	145.75
	Quad-Tree	0.89	23.95	183.70
50/50 (Balanced)	DyGLIN	1.28	7.80	9.84
	GLIN	0.93	13.65	38.88
	R-Tree	0.53	55.00	295.30
	Quad-Tree	0.57	62.70	365.20
5/95 (Write-Heavy)	DyGLIN	1.21	8.34	10.60
	GLIN	0.81	18.17	48.44
	R-Tree	0.32	92.25	438.25
	Quad-Tree	0.34	105.45	580.70

Dataset	Total Memory Footprint (MB)			DyGLIN Overhead Breakdown (vs. GLIN)
Dataset	Rtree	QuadTree	GLIN	+HMBR	+Cuckoo Filter	DyGLIN Total
AREAWATER	285.4	341.2	75.6	8.2 (10.8%)	4.5 (6.0%)	88.3
LINEWATER	460.8	682.5	112.4	14.8 (13.2%)	7.4 (6.6%)	134.6
PARKS	850.2	1420.1	230.5	28.9 (12.5%)	15.0 (6.5%)	274.3

Method	Dataset	MBR Filtration	HMBR Filtration	Candidate Set Reduction Rate
GLIN	AREAWATER	83,588	N/A	0
	LINEWATER	14,790	N/A	0
	PARKS	23,721	N/A	0
DyGLIN	AREAWATER	83663	23,593	0.718
	LINEWATER	14,573	8992	0.383
	PARKS	24,121	16,957	0.297

Method	Component Added	Deletion Semantics	Dataset	Average Query Time (ms)	Insertion Throughput ( $10^{5}$ ops/sec)	Delete Throughput ( $10^{5}$ ops/sec)
Rtree	None	Physical	AREAWATER	25.14	0.82	0.11
			LINEWATER	52.36	0.75	0.09
			PARKS	158.42	0.64	0.04
QuadTree	None	Physical	AREAWATER	28.56	0.88	0.15
			LINEWATER	67.21	0.81	0.12
			PARKS	172.15	0.70	0.06
GLIN	None	Physical	AREAWATER	9.83	1.01	0.72
			LINEWATER	19.19	0.92	0.65
			PARKS	64.23	0.84	0.41
DyGLIN-NoBuffer	+ HMBR	Logical	AREAWATER	6.45	0.14	0.98
			LINEWATER	15.12	0.28	0.79
			PARKS	51.29	0.11	0.53
DyGLIN	+ Delta Buffer + Deletion Filter	Logical	AREAWATER	6.04	1.34	1.04
			LINEWATER	15.23	1.25	0.81
			PARKS	51.34	1.02	0.53

Parameter	Value and Justification
Fingerprint Size	24 bits (provides theoretical FPR $ϵ \approx 2^{- 24} \approx 0.000006$ %)
Bucket Size	8 slots per bucket (4-way associativity)
Number of Buckets	8192 buckets (total capacity: 65,536 items)
Max Kickouts	500 attempts (ensures $> 99.9 %$ insertion success rate)
Hash Function	64-bit hash combining MBR coordinates and geometry type
Load Factor	Target $< 10 %$ (maintains low false positive rate)

Configuration	Recall	Precision	F1 Score	Measured FPR	Theoretical FPR
Baseline (16-bit FP, no safeguard)	98.53%	100.00%	99.26%	1.47%	0.0015%
Recommended (24-bit FP, with safeguard)	100.00%	100.00%	100.00	0.01%	0.000006%

Buffer Capacity B	Write Throughput ( $10^{5}$ ops/s)	Avg Query Latency (ms)	P95 Query Latency (ms)
16	0.85	5.82	7.41
64	1.22	5.91	7.68
256	1.34	6.04	8.14
1024	1.38	6.45	10.52

DyGLIN	Dynamic Generate Learning-based Index
CDF	Cumulative Distribution Function
CF	Cuckoo Filter
DB	Delta Buffer
HMBR	Hierarchical MBR
ID	Identifier
MBR	Minimum Bounding Rectangle
MDS	Main Data Store
SFC	Space-Filling Curve
WODS	Write-Optimized Data Structure
QRT	Query Response Time
SRQS	Spatial Range Queries

PERMALINK

Efficient Management of High-Frequency Sensor Data Streams Using a Read-Optimized Learned Index

Hu Luo

Jiabao Wen

Desheng Chen

Zhengjian Li

Meng Xi

Jingyi He

Shuai Xiao

Jiachen Yang

Roles

Abstract

1. Introduction

2. Related Work

3. Preliminaries and Problem Definition

3.1. Overview of Real-Time Spatial Queries for Sensors

3.2. The GLIN Data Retrieval Response Latency Model

3.3. Formalizing the Refinement Bottleneck

Figure 1.

3.4. Ingestion Throughput and Analysis Immediacy

4. DyGLIN Methodology

4.1. Core Architecture: Edge-Oriented Sensor Stream Decoupling

Figure 2.

4.2. Read Path: Real-Time Retrieval and Boundary Verification

4.3. Write Path: Amortized Analysis of Sensor Stream Ingestion

Figure 3.

5. Experiments

5.1. Experiment Setup

5.1.1. Datasets for Sensor-Based Spatial Analysis

Table 1.

5.1.2. Sensor-Driven Workloads and Performance Metrics

5.2. Evaluation of Real-Time Query Efficiency in Sensor Networks

Figure 4.

Table 2.

Table 3.

5.3. Evaluation of Sensor Stream Ingestion and Resource Efficiency

Figure 5.

Figure 6.

Table 4.

5.4. Component Effectiveness Analysis in Sensor Stream Workloads

5.4.1. Effectiveness of Precision Spatial Filtering

Table 5.

5.4.2. Impact of Delta Buffer and Cuckoo Filter

Table 6.

5.5. Correctness Validation for Cuckoo Filter-Based Logical Deletion

5.5.1. Cuckoo Filter Configuration and Theoretical Analysis

Table 7.

5.5.2. Experimental Correctness Validation

Table 8.

5.6. Sensitivity Analysis

5.6.1. Impact of HMBR Granularity

Table 9.

5.6.2. Impact of Delta Buffer Capacity

Table 10.

6. Conclusions

Abbreviations

Author Contributions

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Funding Statement

Footnotes

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases