Fast continuous streaming sort in big streaming data environment under fixed-size single storage

Suluk Chaikhan; Suphakant Phimoltares; Chidchanok Lursinsap

doi:10.1371/journal.pone.0266295

. 2022 Apr 5;17(4):e0266295. doi: 10.1371/journal.pone.0266295

Fast continuous streaming sort in big streaming data environment under fixed-size single storage

Suluk Chaikhan ¹, Suphakant Phimoltares ^1,^*, Chidchanok Lursinsap ¹

Editor: Mohamed Hammad²

PMCID: PMC8982863 PMID: 35381032

Abstract

Big streaming data environment concerns a complicated scenario where data to be processed continuously flow into a processing unit and certainly cause a memory overflow problem. This obstructs the adaptation of deploying all existing classic sorting algorithms because the data to be sorted must be entirely stored inside the fixed-size storage including the space in internal and external storage devices. Generally, it is always assumed that the size of each data chunk is not larger than the size of storage (M) but in fact the size of the entire stream (n) is usually much larger than M. In this paper, a new fast continuous streaming sorting is proposed to cope with the constraint of storage overflow. The algorithm was tested with various real data sets consisting of 10,000 to 17,000,000 numbers and different storage sizes ranging from 0.01n to 0.50n. It was found that the feasible lower bound of storage size is 0.35n with 100% sorting accuracy. The sorting time outperforms bubble sort, quick sort, insertion sort, and merge sort when data size is greater than 1,000,000 numbers. Remarkably, the sorting time of the proposed algorithm is 1,452 times less than the sorting time of external merge sort and 28.1767 times less than the sorting time of streaming data sort. The time complexity of proposed algorithm is O(n) while the space complexity is O(M).

Introduction

Sorting is one of the fundamental algorithms of various applications such as fast searching, computing quartiles, and finding duplicates of a given number set. Many previously proposed classic sorting methods, e.g. quick sort, heap sort, and merge sort, are very efficient in terms of time complexity. However, these methods are based on the strong conventional assumption that the entire set of data must be stored in a processing memory, including internal and external memory, during the sorting process. Storing the entire data set prior to the sorting process is feasible if the data size does not overflow the physical capacity of memory. But due to the present disruption of big data, it is impossible to retain the entire data set in the memory during the sorting process. Big data are continuously generated in various applications such as internet communication, sensor networks, and pattern cognition in artificial intelligence applications. Using massive data is a major issue in various branches of the computational sciences [1–3]. As the growth of data size exponentially escalating [4–6], data computing and analysis are becoming extremely difficult, resulting in the need for huge memory resources for the sorting process and massive storage for storing the results [7, 8]. Current VLSI technological development to increase the capacity of memory and storage falls behind the explosion of data consumption demand [3, 9, 10]. The telecommunication industry, business intelligence, data mining, cloud computing, mobile technology, and machine learning are confronted with the issue of memory overflow crisis when processing huge amounts of data [11–14]. This obstruction causes a severe data loss and induces the incorrect results for solving a problem. Hence, a new sorting algorithm capable of handling memory overflow must be developed. The time complexity of this algorithm must not exceed the worst time complexity of classic sorting algorithms.

Although classic sorting algorithms such as bubble sort, insertion sort, merge sort, and quick sort are popular, they have been improved to work faster. The time complexity is always the main focus of improvement. Singh et al. [15] proposed a new algorithm, namely, RVA sorting based on bubble sort and quick sort. The number of comparisons and the execution time are less than those of the existing algorithms. Wild et al. [16] analyzed the speed of Yaroslavsky’s quick sort in terms of the number of comparisons, swaps, and bytecodes. The cost of Yaroslavsky’s algorithm can be converged by a fixed-point equation. Agrawal and Sriram [17] proposed a new technique called the Concom sorting algorithm based on a comparison and counting method. The results of the execution time and comparison are better than bubble sort and insertion sort. Osama et al. [18] proposed a mapping sorting algorithm based on mapping technique. The mapping is not related to comparison or swapping, but it approximates temporary order by a linear equation. The wrong values at any positions are corrected through swapping or rearranging steps. Vignesh and Pradhan [19] improved merge sort by using multiple pivots to sort data in the array. The time complexities of the best and worst cases are O(n) and O(n log n), respectively. Idrizi et al. [20] modified the counting sort algorithm by reducing the number of comparisons and the execution time. The sorting result surpasses quick sort, bubble sort, and merge sort. Mohammed et al. [21] proposed bidirectional conditional insertion sort that improves insertion sort and reduces the number of comparisons by using two pivots as left-comparator and right-comparator. Goel and Kumar [22] proposed two new sorting algorithms called Brownian motus insertion sort and Clustered binary insertion sort to reduce the number of comparisons. The time complexity of first algorithms is $O (\sqrt[0.54]{n})$ while the second algorithm takes O(n log n). Omar et al. [23] generated a new algorithm called double hashing sort based on mapping sort algorithm and hashing technique. The proposed algorithm can reduce both the number of comparisons and the execution time. Recently, new sorting algorithms have been proposed to handle big data with small primary memory size by using new auxiliary space as secondary memory or hard disks to merge all sorted parts of big data. Zushi and Goswami [24] proposed a new quadratic sorting algorithm by considering an unsorted sequence as a set of disjoint sorted sequences. The execution time of the proposed algorithm is more efficient than those of insertion sort, bubble sort, selection sort, and quick sort. Gugale [25] proposed super-sort sorting algorithm by picking sorted sequence in unsorted sequence. This concept can reduce the number of steps in sorting process. The time complexity of worst case is O(n log n). Lee et al. [26] proposed ActiveSort for external sorting using SSDs. The concept involves using ActiveSort to reduce the number of read/write operations. This can reduce the amount of write operations to 40.4% of the original value. Laga et al. [27] proposed MONTRES, external sorting algorithm on SSDs to reduce the run time of sorting process by decreasing read and write costs of input and output operations and the size of temporary data. Liang et al. [28] designed B*-sort, a new sorting algorithm, which can reduce the amount of write traffic on Nonvolatile random access memory (NVRAM). Moreover, the structure used in this algorithm is similar to the binary search tree, which takes the time complexity of O(n log n). Arge and Thorup [29] studied I/O and RAM models to design I/O-efficient as well as RAM-efficient priority queues, and they also analyzed the lower bounds of the models. Kanza and Yaari [30] proposed a new external sorting method based on flash storage that avoids writing to the hard disk. In addition, Elder and Goh [31] designed a new sorting algorithm by a finite stack and an infinite stack on permutation sequence. Although this algorithm can sort streaming data, the sorting result may be not correct in terms of order and value.

It is noticeable that all classic sorting algorithms are designed by storing the entire data set prior to the sorting process. This implies that the data size must be at most as large as the memory size. On the contrary, streaming data continuously enter into the memory as a chunk whose size cannot exceed the memory size at any time. Obviously, any classic sorting algorithm cannot achieve the accurate sort because the algorithm sorts only the data presently stored in the memory. The rest of data not entering the memory are not sorted. One potential solution to resolve the problem of classic sorting algorithms is to use external sorting. However, all external sorting algorithm must rely on the size of external storage. With current data storage technology, the increase of storage size cannot keep pace with the increase of generated data, such as internet data [32, 33].

Recently, Chaikhan et al. [34] proposed an algorithm called streaming data sort to continuously sort incoming big streaming data by using a uniprocessor and only one internal fixed-size working memory. The memory size is much smaller than the total streaming data size. No external storage for storing overflowed data is required. The data flow into the working memory as a streaming chunk. After sorting the incoming data, these data are discarded and represented in forms of compact and single groups. The sorting time complexity is O(n), where n is the size of streaming data and the space complexity is O(M), where M is the size of working memory. Moreover, unlike the external sort, streaming data sort does not require any external space for the merge process. Streaming data sort merges sorting results and current incoming data based on a limited memory that exists at the beginning of the process, whereas the merging space, which is always much greater than the size of huge data, is strongly required for the external sort. Although streaming data sort can efficiently solve the sorting problem with limited memory size, the algorithm is limited to some specific distances between any two consecutive numbers and remains incapable of handling duplicate numbers. Let n be the size of streaming data, M ≪ n be the size of working memory, k ≥ 1, and ϵ > 0. The comparison of existing sorting algorithms and our fast streaming data sort are shown in Table 1, which contains eight columns: (1) compatibility with duplicate data, (2) stability, (3) compatibility with streaming data, (4) extra storage needed, (5) working space, (6) time complexity, (7) correct orders, and (8) correct values.

Table 1. Comparison of sorting algorithms on streaming data.

Sorting algorithms	Compatibility with duplicate data	Stability	Compatibility with streaming data	Extra storage needed	Working space	Time complexity	Correct orders	Correct values
Bubble sort	Yes	Yes	No	No	n + ϵ	O(n²)	Yes	Yes
Selection sort	Yes	No	No	No	n + ϵ	O(n²)	Yes	Yes
Insertion sort	Yes	Yes	Yes	No	n + ϵ	O(n²)	Yes	Yes
Quick sort	Yes	No	No	No	n + ϵ	O(n²)	Yes	Yes
Merge sort	Yes	Yes	No	Yes	n + ϵ	O(n lg n)	Yes	Yes
Heap sort	Yes	No	No	No	n + ϵ	O(n lg n)	Yes	Yes
External sorting	Yes	Yes	Yes	Yes	kn + ϵ	N/A	Yes	Yes
Permutation sort [31]	Yes	Yes	Yes	Yes	kn + ϵ	N/A	No	No
Streaming data sort [34]	No	Yes	Yes	No	≤ 0.35n	O(n)	Yes	Yes
Fast streaming data sort	Yes	Yes	Yes	No	≤ 0.35n	O(n)	Yes	Yes

Open in a new tab

n is the total streaming numbers to be sorted, M ≪ n is the limited size of working storage, k ≥ 0, and ϵ > 0.

Data lifecycle [35–39] is another essential issue in data analysis and data mining. Each datum passes various stages during its lifetime, starting from creation, analysis, and so on until expiration, possibly due to being obsolete. In practice, any expired data must be removed from the relevant database to gain available space and to keep the data up-to-date. Streaming data also possess this lifecycle trait without exception. The method of streaming data sort [34] is incapable of managing expired data after being sorted and captured in a compact representation. However, this issue is resolved in this study.

The rest of this paper is organized into the following sections. The “Studied Problems and Constraints” section discusses the problem studied in this paper and the concerned constraints. The “Background” section summarizes the related works. The “Definitions and Notations” section defines the notations used in this paper. The “Concepts” section explains the concepts of the proposed approach. The “Proposed fast streaming data sort Algorithms” section provides the details of proposed algorithms. The “Experimental Results and Discussion” section provides the experimental results for the proposed algorithms as well as the discussion of the results. The “Conclusion” section concludes the paper.

Studied problems and constraints

The basic studied problem in this paper focuses on the sorting of streaming data under the constraints of limited working memory size and a uniprocessor similar to the problem studied in [17]. Although the sorting algorithm in [34] can efficiently sort the streaming data under the imposed constraints, it assumes that there are no duplicate data in the streaming input. This assumption is impractical for many real-world data sets. Furthermore, the condition of data lifecycle was not taken into account. On the contrary, this study concerns the issues of duplicate data and data lifecycle as new constraints. Only integer data are considered, and any sorted number with its order in the sorted sequence must be correctly recalled from its compact group.

Background

Streaming data sort [34] proposed a sorting algorithm for large streaming data by representing sorted sub-sequences in forms of compact groups so that the sorting process could be executed under a limited working memory size on a uniprocessor. The compacting process is a major concept of streaming data sort to efficiently represent the sorted data set to avoid memory overflow. Four particular sub-sequences of type-1, type-2, type-3, and type-4 are compacted to type-1 compact groups (u, v)⁽¹⁾, type-2 compact groups (u, v)⁽²⁾, type-3 compact groups (u, v)⁽³⁾, and type-4 compact groups (u, v)⁽⁴⁾, respectively. Type-1 is a sub-sequence of data where all differences between any two consecutive data are one. Type-2 is a sub-sequence of data where the first difference is one and all differences of two consecutive data alternate between one and two. Type-3 is a sub-sequence of data wherein the first difference is two and all differences of two consecutive data alternate between one and two. Finally, type-4 is a sub-sequence of data wherein all differences of any two consecutive data are two. For examples of the four types compacted to compact groups: (1) a type-1 sub-sequence (14, 15, 16, 17, 18) is compacted to a type-1 compact group (14, 18)⁽¹⁾ in which (15, 16, 17) are removed from memory and called compressed data; (2) a type-2 sub-sequence (14, 15, 17, 18, 20) is compacted to a type-2 compact group (14, 20)⁽²⁾ and (15, 17, 18) are removed from memory; (3) a type-3 sub-sequence (14, 16, 17, 19, 20) is compacted to a type-3 compact group (14, 20)⁽³⁾ and (16, 17, 19) are removed from memory; (4) a type-4 sub-sequence (14, 16, 18, 20, 22) is compacted to a type-4 compact group (14, 22)⁽⁴⁾ and (16, 18, 20) are removed from memory. A compact group of any type is invertible to the sub-sequence of type p where p is 1, 2, 3 and 4. For example, (14, 20)⁽³⁾ = (14, 16, 17, 19, 20). A result of compressing all data in memory usage at iteration t is called the compact set Q^(t), which contains two member types: compact group and single data. Note that type-p of any sub-sequence (w_i, w_i+1, w_i+2, …, w_i+l) can be computed by

\begin{matrix} p = w_{i + 2} + w_{i + 1} - 2 (w_{i} + 1) where i \in N . \end{matrix}

(1)

To create Q^(t+1), the new incoming data in the (t + 1)^th iteration are investigated together with single data and inversion of the compact group in Q^(t). The inversion of the compact group can be generated by the recalling process. Inserting incoming data into Q^(t) affects the sorting pattern of single data and compact groups in Q^(t+1). Fig 1 illustrates an example of 20 simple data points sorted by the streaming data sort algorithm, where the memory usage size is 10, which is half of the data size. Note that type-1 and type-4 have higher compression precedence than type-2 and type-3.

Definitions and notations

Since this paper concerns the improvement of the algorithm in [34], the definitions of introduced sequence types must be recapped here as follows.

Definition 1 Streaming data sequence D = (d₁, d₂, d₃, …, d_n), for 1 ≤ n ≤ ∞, is a finite or infinite sequence of integers to be sorted.

The streaming data enter the sorting process in the form of a consecutive chunk of data. There are four types of relations among the data in each chunk, which are defined as follows.

Definition 2 [34] Type-1 sub-sequence T₁ = (d_i, …, d_i+l) ⊆ D is a sequence such that ∀d_i, d_i+1 ∈ T₁: |d_i − d_i+1| = 1.

Definition 3 [34] Type-2 sub-sequence T₂ = (d_i, …, d_i+l) ⊆ D is a sequence such that ∀d_i+a, d_i+a+1 ∈ T₂, 0 ≤ a ≤ l − 1: |d_i+a − d_i+a+1| = 1 when a is even and |d_i+a − d_i+a+1| = 2 when a is odd.

Definition 4 [34] Type-3 sub-sequence T₃ = (d_i, …, d_i+l) ⊆ D is a sequence such that ∀d_i+a, d_i+a+1 ∈ T₂, 0 ≤ a ≤ l − 1: |d_i+a − d_i+a+1| = 2 when a is even and |d_i+a − d_i+a+1| = 1 when a is odd.

Definition 5 [34] Type-4 sub-sequence T₄ = (d_i, …, d_i+l) ⊆ D is a sequence such that ∀d_i, d_i+1 ∈ T₁: |d_i − d_i+1| = 2.

Definition 6 Insert position ins(⋅) of the i^th datum d_i into type-p compact group (u, v)^(p) is the order of location of d_i with respect to all integers in the compact group (u, v)^(p). This position is computed by the following equations:

\begin{matrix} i n s (d_{i}) = {\begin{matrix} 2 ⌊ \frac{d_{i} - u}{3} ⌋ + (d_{i} - u) mod 3 + 1 & f o r p = 2, 3 \\ ⌈ \frac{d_{i} - u}{3} ⌉ + 1 & f o r p = 4 . \end{matrix} \end{matrix}

(2)

Definition 7 Size of compact group (u, v)^(p) denoted by |(u, v)^(p)| is the number of all possible existent integers inside the interval (u, v)^(p) computed by the following equations:

\begin{matrix} | {(u, v)}^{(p)} | = {\begin{matrix} v - u + 1 & f o r p = 1 \\ 2 ⌊ \frac{v - u}{3} ⌋ + (v - u) mod 3 & f o r p = 2 \\ - \frac{1}{2} ⌊ \frac{(v - u) mod 3}{2} ⌋ + 1 \\ 2 ⌊ \frac{v - u}{3} ⌋ + \frac{(v - u) mod 3}{2} + 1 & f o r p = 3 . \\ \frac{v - u}{2} + 1 & f o r p = 4 . \end{matrix} \end{matrix}

(3)

Notably, if d_i is in (u, v)^(p), then |(u, d_i)^(p)| is the position of d_i in (u, v)^(p).

Definition 8 A new incoming datum d_i is a duplicate of the already existent d_i ∈ (u, v)^(p) if d_i complies with one of the following conditions.

If u ≤ d_i ≤ v for the type-1 compact group.
If u ≤ d_i ≤ v and (d_i − u)mod3 = 0 or 1 for the type-2 compact group.
If u ≤ d_i ≤ v and (d_i − u)mod3 = 0 or 2 for the type-3 compact group.
If u ≤ d_i ≤ v and (d_i − u)mod2 = 0 for the type-4 compact group.

Each condition is based on the fact that the distance between two consecutive integers in each compact group must comply with the definitions 2, 3, 4, and 5. Thus, if a compact group already contains d_i, then another new incoming d_i must be obviously congruent with the distance defined in either 2, 3, 4, or 5. To denote a datum d_i with no duplicate and a datum with some duplicates, a superscript (f) is attached to d_i as $d_{i}^{(f)}$ , for f ≥ 1, to count the duplicates of d_i in the corresponding compact group. Obviously, $d_{i}^{(1)}$ implies that d_i has no duplicate.

Concepts

Streaming data sort [34] updates the present set of compact groups when a new incoming datum d_α enters the sorting process by unfolding the relevant compact groups and rearranging the data with d_α. After unfolding and rearranging data of the relevant compact group, a new set of compact groups is updated next, possibly by merging or decomposing the present compact groups. Obviously, these steps lengthen the sorting time. Another issue not considered by Streaming data sort is data duplication. Unlike Streaming data sort, the steps of unfolding and rearranging compact groups are eliminated. New steps are proposed by deploying Definition 6 with Definition 8 to speed up the sorting process as well as to handle data duplication.

Suppose that the working memory size is set to m and that there is only one incoming datum d_α at a time. Datum d_α is either inserted into one of the present compact groups or becomes a single number by itself if it cannot be inserted into any compact group. At any time t, there are two sets of interest: a set of compact groups called the compact set denoted by Q^(t) and a set of duplicates R^(t). At any time t, there exists a new incoming datum d_α. The conceptual steps to handle d_α and its duplicates are the followings.

Instead of unfolding the concerned compact group to obtain all numbers in the compact group, the insert position of d_α is directly computed by applying Definition 6.
Suppose d_α is inserted into compact group (u, v)^(p). Check whether (u, v)^(p) is of types 2, 3, or 4, then recursively split (u, v)^(p) at position of d_α into smaller compact groups.
Test the duplicate condition by using Definition 8. If the condition is satisfied, then update R^(t).
Merge any adjacent compact groups into one compact group if they have the same sub-sequence type.

By splitting the current compact group into several smaller compact groups with reference to the position of d_α, the sorting algorithm can achieve faster speed than the speed of Streaming data sort. Fig 2 illustrates an example of how the concept of fast streaming data sort works. There are 21 numbers to be sorted. The size of working memory is set to m = 12. The incoming data whose size is at most 12 flow into the working memory one chunk at a time.

The first incoming chunk includes 10 numbers with two duplicates: 2, 7, 2, 10, 6, 9, 8, 4, 6, 12. These data are sorted and compacted into two compact groups of types 1 and 4, and one single number is assigned to the compact set Q⁽¹⁾ = ((2, 6)⁽⁴⁾, (7, 10)⁽¹⁾, 12) with the duplicate set R⁽¹⁾ = (2⁽²⁾, 6⁽²⁾) as shown in Fig 2(a). The incoming numbers are circled. The less opaque numbers are not currently incoming numbers, but only the numbers in the natural sequence. Notice that only 7 out of 12 entries of memory are used. The second incoming chunk has only one number, 7, which already exists in compact group (7, 10)⁽¹⁾. Datum 7 is a duplicate according to Definition 8. Thus, the present compact set is still the same as before, i.e. Q⁽²⁾ = ((2, 6)⁽⁴⁾, (7, 10)⁽¹⁾, 12), but the set of duplicates is updated as R⁽²⁾ = (2⁽²⁾, 6⁽²⁾, 7⁽²⁾), as illustrated in Fig 2(b).

The third incoming chunk includes only number 14, which can be neither combined with any previous single number to form a new compact group nor assigned to any existing compact groups. This number is inserted behind datum 12. Therefore, Q⁽³⁾ = ((2, 6)⁽⁴⁾, (7, 10)⁽¹⁾, 12, 14) and R⁽³⁾ = (2⁽²⁾, 6⁽²⁾, 7⁽²⁾), as shown in Fig 2(c). The fourth incoming number 1 is inserted in front of data 2. The compact set and duplicate set are updated as Q⁽⁴⁾ = (1, (2, 6)⁽⁴⁾, (7, 10)⁽¹⁾, 12, 14) and R⁽⁴⁾ = (2⁽²⁾, 6⁽²⁾, 7⁽²⁾), as shown in Fig 2(d). The fifth incoming number 11 is inserted in front of datum 12 to form a new compact group of type 2, (11, 14)⁽²⁾. Therefore, Q⁽⁵⁾ = (1, (2, 6)⁽⁴⁾, (7, 10)⁽¹⁾, (11, 14)⁽²⁾) and R⁽⁵⁾ = (2⁽²⁾, 6⁽²⁾, 7⁽²⁾), as shown in Fig 2(e). The rest of the incoming numbers shown in Fig 2(f)–2(k) are handled by the same conceptual procedure as previously described. It is remarkable that the memory space used by this sorting process is much less than the available space.

Proposed fast streaming data sort algorithm

The steps to construct each compact group are adopted from the algorithm proposed in [34]. The algorithms to handle duplicate data, merge compact groups, and remove expired data based on the supportive Theorems are the following.

Main algorithm: Fast streaming data sort algorithm

Input: 1) a chunk of incoming numbers.

Output: (1) Q^(t) and (2) R^(t).

1. Set time t = 1.

2. Obtain the first incoming chunk of numbers

3. Create first compact set Q⁽¹⁾.

4. Create duplicate set R⁽¹⁾ using Algorithm 1.

5. t = t + 1.

6. While there exists a new data chunk do.

7. Insert each d_i into Q^(t−1) using Algorithm 2.

8. Merge some compact groups if possible.

9. Update duplicate set R^(t−1) using Algorithm 1.

10. t = t + 1.

11. EndWhile

The details of Algorithm 1 and Algorithm 2 are presented in the following sections.

Handling data duplication

To determine whether d_i entering the sorting algorithm at time t is a duplicate of any previously sorted d_j, the following two possible cases are analyzed. The first case is when d_j already exists in the compact set Q^(t−1) as a single number. In this case, determining the duplication of d_i is simply performed by comparing d_i with each single number in Q^(t−1). The second case occurs when d_j is already covered in a compact group. In this case, the conditions stated in Definition 8 are tested. For example, at time t − 1, the following compact set {(6, 8)⁽¹⁾, (9, 12)⁽²⁾, (16, 19)⁽³⁾, (22, 26)⁽⁴⁾} is formed and this chunk of incoming data 7, 10, 10, 18, 18, 18, 24, 24, 24, 24 enters the sorting algorithm at time t. Number 7 is declared as a duplicate of compact group (6, 8)⁽¹⁾ by condition 1 of Definition 8 and number 7 is represented by 7⁽²⁾. Number 10 is the duplicate of number 10 in compact group (9, 12)⁽²⁾ because the second condition (10 − 9)mod 3 = 1 is satisfied. Thus, number 10 is represented by 10⁽³⁾. Number 18 is a duplicate of (16, 19)⁽³⁾ because (18 − 16)mod 3 = 2 by condition 3, and number 24 is a duplicate of (22, 26)⁽⁴⁾ because (24 − 22)mod 2 = 0 by condition 4. Both numbers are represented by 18⁽⁴⁾ and 24⁽⁴⁾, respectively. The duplicate set R^(t) becomes R^(t) = {7⁽²⁾, 10⁽³⁾, 18⁽⁴⁾, 24⁽⁵⁾}.

Let $d_{i}^{(*)}$ be a duplicate of d_i kept in duplicate set R^(t−1).

Algorithm 1: Updating duplicate set

Input: (1) d_i. (2) Q^(t−1), where t ⩾ 2.

Output: (1) R^(t).

1. If d_i is a duplicate in Q^(t−1) then */ by definition 4 /*

2. If $d_{i}^{(*)}$ is in R^(t−1) then

3. Update duplicate count of d_i by f = f + 1.

4. Else

5. Insert d_i in compact set Q^(t).

6. EndIf

7. EndIf

Inserting incoming numbers into compact groups

When a new incoming number is taken into the framework, this number may possibly be one of the numbers in the number sequence of an existing compact group. If the incoming number has type 1, it can be simply inserted at the front or at the rear of the corresponding compact group. Otherwise, some prior process must be performed to form a new compact group or a new set of compact groups by splitting them. Only type-2, type-3, and type-4 require this prior process. When forming a new compact group of types 2, 3, and 4, it does not matter whether the incoming number is a duplicate or not because each compact group just keeps track of the existence of all numbers which are already sorted, and the number of duplicates is instead separately recorded in the duplicate set.

The ways to split compact groups of type-2, type-3, and type-4 compact groups into smaller compact groups after inserting a new number are completely different due to the unequal distances between any two consecutive numbers of each type. Furthermore, the size of each compact group measured in terms of the total numbers being compacted in the compact group is also important for splitting the compact group. There are two trivial cases for which compact groups cannot be split. The first case is when the size of the type-2 compact group is either 3 or 4. The second case is when the size of the type-3 compact group is 3. For other non-trivial cases, the type of compact group can be changed to another type after the insertion as the result of distance change between two consecutive numbers in the compact group. Notice that both type-2 and type-3 compact groups become type-1 compact groups after insertion. For example, suppose that an incoming number 16 is inserted into type-2 compact group (14, 18)⁽²⁾ of size |(14, 18)⁽²⁾| = 4. After insertion of 16 into (14, 18)⁽²⁾, the compact group becomes (14, 18)⁽¹⁾ because the distance between two consecutive numbers is 1. If incoming datum 15 is inserted into type-3 compact group (14, 17)⁽³⁾, then (14, 17)⁽³⁾ becomes (14, 17)⁽¹⁾. Obviously, the type-2 and type-3 compact groups are always transformed to type-1 compact groups after inserting incoming d_i

In the general case, the results of inserting an incoming datum into type-2, type-3, and type-4 compact groups can be split into smaller compact groups which are classified into two categories: (1) the type-1 compact group and (2) the non-type-1 part which may be either a single number or a mixture of compact group of type-2, type-3, and type-4. For example, an incoming 16 is inserted into (11, 17)⁽²⁾. If the size of |(11, 17)⁽²⁾| = 5 > 4, which is not a trivial case, and ins(16) = 5, then the type-1 part is (14, 17)⁽¹⁾ and the non-type-1 part consists of two single data: 11 and 12. Therefore, (11, 17)⁽²⁾ transforms to 11, 12, (14, 17)⁽¹⁾. For another example, an incoming 12 is inserted into (11, 19)⁽³⁾. If |(11, 17)⁽³⁾| = 5 > 4, not a trivial case, and ins(12) = 2, then the type-1 part is (11, 14)⁽¹⁾ and the non-type-1 part is (16, 19)⁽³⁾. Therefore, (11, 17)⁽³⁾ transforms to (11, 14)⁽¹⁾, (16, 19)⁽³⁾.

Theorem 1 A non-type-1 part containing single data is generated after inserting an incoming number d_j into a type-2 compact group (u, v)⁽²⁾, if ins(d_i) is equal to one of the following positions {5, |(u, v)⁽²⁾| − 3, |(u, v)⁽²⁾| − 2}.

Proof: There are three cases to be considered, depending on ins(d_i) of the type-2 compact group (u, v)⁽²⁾, as follows.

Case 1: ins(d_i) = 5. Let (u, v)⁽²⁾ = d_i − 5, d_i − 4, d_i − 2, d_i − 1, d_i + 1, d_i + 2, d_i + 4, …, v where u = d_i − 5. d_i is inserted between d_i − 1 and d_i + 1. Then, sub-sequence d_i − 2, d_i − 1, d_i + 1, d_i + 2 is transformed to type-1 compact group (d_i − 2, d_i + 2)⁽¹⁾. Since u = d_i − 5 and u + 1 = d_i − 4, u and u − 1 are single data in the non-type-1 part and sub-sequence d_i + 4, …, v is transformed to type-2 compact group (d_i + 4, v)⁽²⁾. After insertion, (u, v)⁽²⁾ is transformed to u, u + 1, (d_i − 2, d_i + 2)⁽¹⁾, (d_i + 4, v)⁽²⁾.

Case 2: ins(d_i) = |(u, v)⁽²⁾| − 3. Let (u, v)⁽²⁾ = u, …, d_i − 4, d_i − 2, d_i − 1, d_i + 1, d_i + 2, d_i + 4, d_i + 5, where v = d_i + 5. d_i is inserted between d_i − 1 and d_i + 1. Then, sub-sequence d_i − 2, d_i − 1, d_i, d_i + 1, d_i + 2 is transformed to type-1 compact group (d_i − 2, d_i + 2)⁽¹⁾. Since v = d_i + 5 and v − 1 = d_i + 4, v − 1 and v are single data in the non-type-1 part and sub-sequence u, …, d_i − 4 is transformed to type-2 compact group (u, d_i − 4)⁽²⁾. After insertion, (u, v)⁽²⁾ is transformed to (u, d_i − 4)⁽²⁾, (d_i − 2, d_i + 2)⁽¹⁾, v − 1, v.

Case 3: ins(d_i) = |(u, v)⁽²⁾| − 2. Let (u, v)⁽²⁾ = u, …, d_i − 4, d_i − 2, d_i − 1, d_i + 1, d_i + 2, d_i + 4, where v = d_i + 4. d_i is inserted between d_i − 1 and d_i + 1. Then, sub-sequence d_i − 2, d_i − 1, d_i, d_i + 1, d_i + 2 is transformed to type-1 compact group (d_i − 2, d_i + 2)⁽¹⁾. Since v = d_i + 4, v is a single datum in the non-type-1 part and sub-sequence u, …, d_i − 4 is transformed to type-2 compact group (u, d_i − 4)⁽²⁾. After insertion, (u, v)⁽²⁾ is transformed to (u, d_i − 4)⁽²⁾, (d_i − 2, d_i + 2)⁽¹⁾, v.

Theorem 2 The non-type-1 part is generated after inserting an incoming number d_j into a type-3 compact group (u, v)⁽³⁾, if ins(d_i) is equal to one of the following positions {4, |(u, v)⁽³⁾| − 3, |(u, v)⁽³⁾| − 2}.

Proof: This proof is similar to Theorem 1.

Case 1: When ins(d_i) = 4, then (u, v)⁽³⁾ = d_i − 4, d_i − 2, d_i − 1, d_i + 1, d_i + 2, d_i + 4, …, v where u = d_i − 4. Sub-sequences d_i − 2, d_i − 1, d_i, d_i + 1, d_i + 2 and d_i + 4, …, v are transformed to (d_i − 2, d_i + 2)⁽¹⁾ and (d_i + 4, v)⁽²⁾, respectively. u is a single datum in the non-type-1 part. After insertion, (u, v)⁽³⁾ is transformed to u, (d_i − 2, d_i + 2)⁽¹⁾, (d_i + 4, v)⁽²⁾.

Case 2: ins(d_i) = |(u, v)⁽³⁾| − 3. Then, (u, v)⁽³⁾ = u, …, d_i − 4, d_i − 2, d_i − 1, d_i + 1, d_i + 2, d_i + 4, d_i + 5, where v = d_i + 5. Sub-sequences d_i − 2, d_i − 1, d_i, d_i + 1, d_i + 2 and u, …, d_i − 4 are transformed to (d_i − 2, d_i + 2)⁽¹⁾ and (u, d_i − 4)⁽³⁾, respectively. v − 1 and v are single data in the non-type-1 part. After insertion, (u, v)⁽³⁾ is transformed to (u, d_i − 4)⁽³⁾, (d_i − 2, d_i + 2)⁽¹⁾, v − 1, v.

Case 3: ins(d_i) = |(u, v)⁽³⁾| − 2. Then, (u, v)⁽³⁾ = u, …, d_i − 4, d_i − 2, d_i − 1, d_i + 1, d_i + 2, d_i + 4, where v = d_i + 4. Sub-sequences d_i − 2, d_i − 1, d_i, d_i + 1, d_i + 2 and u, …, d_i − 4 are transformed to (d_i − 2, d_i + 2)⁽¹⁾ and (u, d_i − 4)⁽³⁾, respectively. v is a single datum in the non-type-1 part. After insertion, (u, v)⁽³⁾ is transformed to (u, d_i − 4)⁽³⁾, (d_i − 2, d_i + 2)⁽¹⁾, v.

Theorem 3 The non-type-1 part is generated after inserting an incoming datum d_j into a type-4 compact group (u, v)⁽⁴⁾, if ins(d_i) is equal to one of the following positions {3, |(u, v)⁽⁴⁾| − 2, |(u, v)⁽⁴⁾| − 1}.

Proof: This proof is similar to Theorem 1.

Case 1: ins(d_i) = 3, and then (u, v)⁽⁴⁾ = d_i − 3, d_i − 1, d_i + 1, d_i + 3, …, v where u = d_i − 3. Sub-sequences d_i − 1, d_i, d_i + 1 and d_i + 3, …, v are transformed to (d_i − 1, d_i + 1)⁽¹⁾ and (d_i + 3, v)⁽⁴⁾, respectively. u is a single datum in the non-type-1 part. After insertion, (u, v)⁽⁴⁾ is transformed to u, (d_i − 1, d_i + 1)⁽¹⁾, (d_i + 3, v)⁽⁴⁾.

Case 2: ins(d_i) = |(u, v)⁽⁴⁾| − 2, then (u, v)⁽⁴⁾ = u, …, d_i − 3, d_i − 1, d_i + 1, d_i + 3, d_i + 5, where v = d_i + 5. Sub-sequences d_i − 1, d_i, d_i + 1 and u, …, d_i − 3 are transformed to (d_i − 1, d_i + 1)⁽¹⁾ and (u, d_i − 3)⁽⁴⁾, respectively. v − 2 and v are single data in the non-type-1 part. After insertion, (u, v)⁽⁴⁾ is transformed to (u, d_i − 3)⁽⁴⁾, (d_i − 1, d_i + 1)⁽¹⁾, v − 2, v.

Case 3: ins(d_i) = |(u, v)⁽⁴⁾| − 1, then (u, v)⁽⁴⁾ = u, …, d_i − 3, d_i − 1, d_i + 1, d_i + 3 where v = d_i + 3. Sub-sequence d_i − 1, d_i, d_i + 1 and u, …, d_i − 3 are transformed to (d_i − 1, d_i + 1)⁽¹⁾ and (u, d_i − 3)⁽⁴⁾, respectively. v is a single datum in the non-type-1 part. After insertion, (u, v)⁽⁴⁾ is transformed to (u, d_i − 3)⁽⁴⁾, (d_i − 1, d_i + 1)⁽(1)), v.

If ins(d_i) does not conform with Theorems 1, 2, and 3, then a set of compact groups is generated without the existence of any single number as follows. Type-2 compact group (u, v)⁽²⁾ is transformed to (u, d_i − 4)⁽²⁾, (d_i − 2, d_i + 2)⁽¹⁾, (d_i + 4, v)⁽²⁾. Type-3 compact group (u, v)⁽³⁾ is transformed to (u, d_i − 4)⁽³⁾, (d_i − 2, d_i + 2)⁽¹⁾, (d_i + 4, v)⁽²⁾. Type-4 compact group (u, v)⁽⁴⁾ is transformed to (u, d_i − 3)⁽⁴⁾, (d_i − 1, d_i + 1)⁽¹⁾, (d_i + 3, v)⁽⁴⁾. Table 2 summarizes the splitting of type-2, type-3 and type-4 compact groups as a result of inserting d_i. The steps to generate a new compact set after insertion are provided in the following Algorithm 2. Let (u, v)^(p) denote the corresponding compact group of type-p for inserting the incoming d_i.

Table 2. Summary of insertion position and the sequence of insertion in forms of split compact groups.

Type-p	Insert position ins(d_i)	new resultant split compact groups (u, v)^(p)
2 (Trivial¹)	3	(u, v)⁽¹⁾
2	5	u, u + 1, (d_i − 2, d_i + 2)⁽¹⁾, (d_i + 4, v)⁽²⁾
2	\|(u, v)⁽²⁾\| − 3	(u, d_i − 4)⁽²⁾, (d_i − 2, d_i + 2)⁽¹⁾, v − 1, v
2	\|(u, v)⁽²⁾\| − 2	(u, d_i − 4)⁽²⁾, (d_i − 2, d_i + 2)⁽¹⁾, v
2	others	(u, d_i − 4)⁽²⁾, (d_i − 2, d_i + 2)⁽¹⁾, (d_i + 4, v)⁽²⁾
3 (Trivial²)	2	(u, v)⁽¹⁾
3	4	u, (d_i − 2, d_i + 2)⁽¹⁾, (d_i + 4, v)⁽²⁾
3	\|(u, v)⁽³⁾\| − 3	(u, d_i − 4)⁽³⁾, (d_i − 2, d_i + 2)⁽¹⁾, v − 1, v
3	\|(u, v)⁽³⁾\| − 2	(u, d_i − 4)⁽³⁾, (d_i − 2, d_i + 2)⁽¹⁾, v
3	others	(u, d_i − 4)⁽³⁾, (d_i − 2, d_i + 2)⁽¹⁾, (d_i + 4, v)⁽²⁾
4	3	u, (d_i − 1, d_i + 1)⁽¹⁾, (d_i + 3, v)⁽⁴⁾
4	\|(u, v)⁽⁴⁾\| − 2	(u, d_i − 3)⁽⁴⁾, (d_i − 1, d_i + 1)⁽¹⁾, v − 2, v
4	\|(u, v)⁽⁴⁾\| − 1	(u, d_i − 3)⁽⁴⁾, (d_i − 1, d_i + 1)⁽¹⁾, v
4	others	(u, d_i − 3)⁽⁴⁾, (d_i − 1, d_i + 1)⁽¹⁾, (d_i + 3, v)⁽⁴⁾

Open in a new tab

Trivial¹ refers to the case in which the size of the type-2 compact group is either 3 or 4. Trivial² refers to the case in which the size of the type-3 compact group is 3.

Algorithm 2: Generating a new compact set after insertion

Input: (1) d_i and (2) compact set Q^(t−1), where t ⩾ 2.

Output: (1) a new set of compact groups Q^(t).

1. Find the corresponding compact group (u, v)^(p) where p ∈ {2, 3, 4} for inserting d_i.

2. Case:

3. 1: d_i is in (u, v)⁽²⁾

4. If |(u, v)⁽²⁾| = 3 then /*trivial case*/

5. Split (u, v)⁽²⁾ into (u, v)⁽¹⁾.

6. ElseIf ins(d_i) = 5

7. Split (u, v)⁽²⁾ into u, u + 1, (d_i − 2, d_i + 2)⁽¹⁾, (d_i + 4, v)⁽²⁾.

8. ElseIf ins(d_i) = |(u, v)⁽²⁾| − 3

9. Split (u, v)⁽²⁾ into (u, d_i − 4)⁽²⁾, (d_i − 2, d_i + 2)⁽¹⁾, v − 1, v.

10. ElseIf ins(d_i) = |(u, v)⁽²⁾| − 2

11. Split (u, v)⁽²⁾ into (u, d_i − 4)⁽²⁾, (d_i − 2, d_i + 2)⁽¹⁾, v.

12. Else

13. Split (u, v)⁽²⁾ into (u, d_i − 4)⁽²⁾, (d_i − 2, d_i + 2)⁽¹⁾, (d_i + 4, v)⁽²⁾.

14. End

15. 2: d_i is in (u, v)⁽³⁾.

16. If |(u, v)⁽³⁾| = 3 then /*trivial case*/

17. Split (u, v)⁽³⁾ into (u, v)⁽¹⁾.

18. ElseIf ins(d_i) = 4

19. Split (u, v)⁽³⁾ into u, (d_i − 2, d_i + 2)⁽¹⁾, (d_i + 4, v)⁽²⁾.

20. ElseIf ins(d_i) = |(u, v)⁽³⁾| − 3

21. Split (u, v)⁽³⁾ into (u, d_i − 4)⁽³⁾, (d_i − 2, d_i + 2)⁽¹⁾, v − 1, v.

22. ElseIf ins(d_i) = |(u, v)⁽³⁾| − 2

23. Split (u, v)⁽³⁾ into (u, d_i − 4)⁽³⁾, (d_i − 2, d_i + 2)⁽¹⁾, v.

24. Else

25. Split (u, v)⁽³⁾ into (u, d_i − 4)⁽³⁾, (d_i − 2, d_i + 2)⁽¹⁾, (d_i + 4, v)⁽²⁾.

26. EndIf

27. 3: d_i is in (u, v)⁽⁴⁾.

28. If |(u, v)⁽⁴⁾| = 3 then

29. Split (u, v)⁽⁴⁾ into u, (d_i − 1, d_i + 1)⁽¹⁾, (d_i + 3, v)⁽⁴⁾.

30. ElseIf ins(d_i) = |(u, v)⁽⁴⁾| − 2

31. Split (u, v)⁽⁴⁾ into (u, d_i − 3)⁽⁴⁾, (d_i − 1, d_i + 1)⁽¹⁾, v − 2, v.

32. ElseIf ins(d_i) = |(u, v)⁽⁴⁾| − 1

33. Split (u, v)⁽⁴⁾ into (u, d_i − 3)⁽⁴⁾, (d_i − 1, d_i + 1)⁽¹⁾, v.

34. Else

35. Split (u, v)⁽⁴⁾ into (u, d_i − 3)⁽⁴⁾, (d_i − 1, d_i + 1)⁽¹⁾, (d_i + 3, v)⁽⁴⁾.

36. EndIf

37. EndCase

Merging compact groups

After compact group splitting, it is possible that some compact group (u, v)^(p) can be merged with a single number s to reduce the consumed memory space. The merging conditions of all types of compact groups are the following.

If u − s = 1 and p = 1, then s and (u, v)^(p) are merged into (s, v)⁽¹⁾.
If u − s = 2 and p = 2, then s and (u, v)^(p) are merged into (s, v)⁽³⁾.
If u − s = 1 and p = 3, then s and (u, v)^(p) are merged into (s, v)⁽²⁾.
If u − s = 2 and p = 4, then s and (u, v)^(p) are merged into (s, v)⁽⁴⁾.
If s − v = 1 and p = 1, then (u, v)^(p) and s are merged into (u, s)⁽¹⁾.
If s − v = (v − u)mod3 + 1 ⩽ 2, and p = 2, then (u, v)^(p) and s are merged into (u, s)⁽²⁾.
If s − v = (v − u + 2)mod3 ⩽ 2, and p = 3, then (u, v)^(p) and s are merged into (u, s)⁽³⁾.
If s − v = 2 and p = 4, then (u, v)^(p) and s are merged into (u, s)⁽⁴⁾.

Another possibility is that two compact groups can be merged into one compact group. The conditions for merging any two compact groups (u₁, v₁)^(p₁) and (u₂, v₂)^(p₂), where v₁ < u₂, are defined as follows.

If u₂ − v₁ = 1 and p₁ = p₂ = 1, then (u₁, v₁)^(p₁) and (u₂, v₂)^(p₂) are merged into (u₁, v₂)⁽¹⁾.
If u₂ − v₁ = 2 = (v₁ − u₁ + 1)mod 3 and p₁ = p₂ = 2, then (u₁, v₁)^(p₁) and (u₂, v₂)^(p₂) are merged into (u₁, v₂)⁽²⁾.
If u₂ − v₁ = 1 = (v₁ − u₁ + 1)mod 3, p₁ = 2, and p₂ = 3, then (u₁, v₁)^(p₁) and (u₂, v₂)^(p₂) are merged into (u₁, v₂)⁽²⁾.
If u₂ − v₁ = 2 = (v₁ − u₁ + 2)mod 3, p₁ = 3, and p₂ = 2, then (u₁, v₁)^(p₁) and (u₂, v₂)^(p₂) are merged into (u₁, v₂)⁽³⁾.
If u₂ − v₁ = 1 = (v₁ − u₁ + 2)mod 3 and p₁ = p₂ = 3, then (u₁, v₁)^(p₁) and (u₂, v₂)^(p₂) are merged into (u₁, v₂)⁽³⁾.
If u₂ − v₁ = 2 and p₁ = p₂ = 4, then (u₁, v₁)^(p₁) and (u₂, v₂)^(p₂) are merged into (u₁, v₂)⁽⁴⁾.

For example, the result of merging (4, 8)⁽²⁾ and (10, 12)⁽²⁾ is (4, 12)⁽²⁾ because 10 − 8 = 2 = (8 − 4 + 1)mod 3. Note that merging two compact groups depends on the order of compact groups in a compact set.

Removing expired single numbers from the compact set

Various applications in streaming data mining and data life cycle management must occasionally remove some expired data from the storage or database. Actually, removing any data from a sorted sequence when the whole sequence is kept in storage is rather simple. However, in cases of sequences sorted with limited storage size, removing any data from the sorted sequence is not as simple because the whole sorted sequence is transformed into a set of compact groups and single numbers. Streaming data sort [34] does not support data removal from a compact set during the sorting process.

The process of data removal from a compact set is proposed in this fast streaming data sort. A removed datum is denoted by $\tilde{d_{i}}$ , while a set of deleted data is denoted by $\tilde{D}$ . There are two removal cases. The first case is that $\tilde{d_{i}}$ is a single number in a compact set. In this case, $\tilde{d_{i}}$ can be directly removed from the compact set without affecting the components of other compact groups. For the second case, $\tilde{d_{i}}$ is a member of a compact group (u, v)^(p). Removing $\tilde{d_{i}}$ from (u, v)^(p) splits (u, v)^(p) into two parts: the left part L and right part R with respect to $\tilde{d_{i}}$ . Depending upon the type of compact group, the set of numbers in parts L and R can be expressed by the following Theorems.

Theorem 4 Let L be a set of a left part for removing a datum $\tilde{d_{i}}$ from a compact group (u, v)^(p). If $| {(u, \tilde{d_{i}})}^{(p)} | \in Z$ and $| {(u, \tilde{d_{i}})}^{(p)} | \leq 3$ , then

\begin{matrix} L = {{l \in {u, u + 1, u + 2} | | (u, l)}^{(p)} | \in Z} . \end{matrix}

(4)

Theorem 5 Let R be a set of a right part for removing a datum $\tilde{d_{i}}$ from a compact group (u, v)^(p). If $| {(u, \tilde{d_{i}})}^{(p)} | \in Z$ and ${| (u, v)}^{(p)} | - | {(u, \tilde{d_{i}})}^{(p)} | \leq 2$ , then

\begin{matrix} R = {{r \in {v - 2, v - 1, v} | | (u, r)}^{(p)} | \in Z} . \end{matrix}

(5)

Theorem 6 Let L be the left part for removing a datum $\tilde{d_{i}}$ from a compact group (u, v)^(p). If $| {(u, \tilde{d_{i}})}^{(p)} | \in Z$ and $| {(u, \tilde{d_{i}})}^{(p)} | > 3$ , then L has one of the following compact group forms:

\begin{matrix} L = {\begin{matrix} {(u, \tilde{d_{i}} - 1)}^{(1)} & for p = 1 \\ {(u, \tilde{d_{i}} - a)}^{(2)} & for p = 2 \\ {(u, \tilde{d_{i}} - b)}^{(3)} & for p = 3 \\ {(u, \tilde{d_{i}} - 2)}^{(4)} & for p = 4 \end{matrix} \end{matrix}

(6)

where $a = 1 + | {(u, \tilde{d_{i}})}^{(p)} | mod 2$ and $b = 1 + (| {(u, \tilde{d_{i}})}^{(p)} | + 1) mod 2$ .

Theorem 7 Let R be a right part for removing a datum $\tilde{d_{i}}$ from a compact group (u, v)^(p). If $| {(u, \tilde{d_{i}})}^{(p)} | \in Z$ and ${| (u, v)}^{(p)} | - | {(u, \tilde{d_{i}})}^{(p)} | \leq 2$ , then R has one of the following compact group forms:

\begin{matrix} R = {\begin{matrix} {(\tilde{d_{i}} + 1, v)}^{(1)} & for p = 1 \\ {(\tilde{d_{i}} + a, v)}^{(4 - a)} & for p = 2 \\ {(\tilde{d_{i}} + b, v)}^{(4 - b)} & for p = 3 \\ {(\tilde{d_{i}} + 2, v)}^{(4)} & for p = 4 \end{matrix} \end{matrix}

(7)

where $a = 2 - | {(u, \tilde{d_{i}})}^{(p)} | mod 2$ and $b = 2 - (| {(u, \tilde{d_{i}})}^{(p)} | + 1) mod 2$ .

For example, number 25 is removed from (23, 31)⁽¹⁾. By Theorem 4, |(23, 25)⁽¹⁾| ⩽ 3 and the numbers in L are {23, 24}. For R, |(23, 31)⁽¹⁾| − |(23, 25)⁽¹⁾| > 2 and R = (26, 31)⁽¹⁾ by Theorem 7. Therefore, removing 25 from (23, 31)⁽¹⁾ splits the compact group into L = {23, 24} and R = (26, 31)⁽¹⁾. Another example is removing number 26 from (23, 31)⁽³⁾. Since |(23, 26)⁽³⁾| ⩽ 3 and |(23, 31)⁽³⁾| − |(23, 25)⁽³⁾| > 2, the numbers in L are {23, 25} by Theorem 4 and R = (28, 31)⁽²⁾ by Theorem 7. Therefore, the compact group (23, 31)⁽³⁾ is split by number 26 into L = {23, 25} and R = (28, 31)⁽²⁾. Notice that, with the terms of 4 − a and 4 − b in Theorem 7, the type of compact group after removing $\tilde{d_{i}}$ will be changed from type 2 to type 3 and vice versa.

Experimental results and discussion

Three important issues concerning the performance of the proposed algorithm were evaluated. The first issue regards the execution time of the proposed sorting algorithm versus the working memory size in terms of percentage of total experimented data size. The second issue focuses on the fluctuation of working memory space with respect to the given number of numbers to be sorted. The fluctuation indicates the bound of the actual size of working memory required to achieve the sorting process. The third issue concerns the execution time when there are duplicates. The performance of the fast streaming data sorting algorithm was compared with streaming data sort, classic sorting algorithms, and external merge sort. The proposed algorithm was implemented by MATLAB R2021a and run on 3.2 GHz Intel Core i5 6500 and 8 GB of 2133 MHz RAM with the 64 bit Windows 10 platform. The details of each issue are the following.

Sorting time versus data size and working memory size

One million random numbers were sorted by the proposed algorithm and the sorting time was evaluated by varying the size of working memory in terms of percentage of total experimented data size. Different sorting time versus different working memory size is illustrated in Fig 3. There are two experiments conducted in this section. The first experiment was conducted to observe whether or not the sorting time increases almost linearly according to the accumulated amount of incoming numbers when the size of working memory is fixed. Each line in Fig 3 denotes the sorting time versus the size of the incoming chunk based on a fixed working memory size. The second experiment was conducted to observe the effect of working memory size on the sorting time. It is remarkable that when the size of working memory increases, the sorting is reduced. For example, the sorting time when the size of working memory is 50% of the total data size is less than the sorting time when the size of working memory is 30% of the total data size. This is because more incoming numbers can enter the sorting process, thus causing the sorting time to decrease.

The sorting speed of the fast streaming data sort was compared with the sorting speed of the streaming data sort, which is capable of sorting streaming data. Table 3 shows the comparison of execution time between fast streaming data sort and streaming data sort. Each percentage number denotes the size of working memory in terms of the percentage of the total amount of data to be sorted. Each bold number in each gray cell is the sorting time of fast streaming data sort, which is obviously 28.1767 times faster than the sorting time of streaming data sort on average. The time complexity of fast streaming data sort is still O(n).

Table 3. Comparison of sorting time between fast streaming data sort and streaming data sort when varying different sizes of working memory in terms of the percentages of total amount of data.

Data size	Time (sec) of fast streaming data sort				Time (sec) of streaming data sort
Data size	50%	45%	40%	35%	50%	45%	40%	35%
100,000	4.3338	4.9884	5.5894	6.9066	124.8196	137.7921	160.3303	190.7497
250,000	11.4705	12.3407	13.9405	17.1341	324.145	356.7402	422.1132	480.4968
400,000	17.1907	19.8813	22.3061	27.3360	498.5432	547.7929	645.2144	747.6058
550,000	24.3549	26.8283	30.4382	37.4500	680.8660	753.8266	901.9769	1027.8510
700,000	30.9219	34.1581	38.9551	49.1797	861.3885	958.9641	1139.8098	1312.4261
850,000	37.0885	41.9728	47.9061	59.0234	1051.8819	1169.8349	1390.2639	1612.7994
1,000,000	45.0412	50.5456	56.9898	70.2029	1238.8804	1379.0245	1643.9551	1873.0513

Open in a new tab

Furthermore, the sorting speed of fast streaming data sort was also compared with the speeds of many classic sort algorithms. Although this comparison is inappropriate because of the distinguished constraints imposed by fast streaming data sort and classic sorting algorithms, this comparison can be fairly justified when the size of working memory approaches the size of the data set. For classic sorting algorithms, the entire data set must be stored inside the working memory, which may refer to only internal storage or the combination of internal and external storage, during the sorting process. Table 4 shows the sorting time of the proposed algorithm and four other classic sorting algorithms, i.e. bubble sort, quick sort, insertion sort, and merge sort. The working memory size of the proposed algorithm is restricted to 35%, 50%, 75%, and 98.5% of data size. Notice that when the size of working memory is increased, it implies that a larger incoming data chunk size can be stored in the working memory, thus speeding the sorting as illustrated in Fig 3. Each of the underlined numbers in each gray cell of Table 4 represents the execution times of four classic sorting algorithms that are slower than fast streaming data sort for some restricted working memory. Regardless of any memory size, the speed of fast streaming data sort exceeds those of bubble sort and insertion sort when data size is 1 × 10⁵ or greater. N/A denotes excessively long processing time. The sorting time of fast streaming data sort, when the data size is at least 1 × 10⁶ and the size of working memory is set to at least 75% of data size, is less than the sorting time of quick sort. However, when the data size is at least 1 × 10⁶ and the size of working memory is set to 98.5% of the data set, its speed is faster than the speed of merge sort.

Table 4. Comparison of sorting time between fast streaming data sort and four classic sorting algorithms.

Data size	The execution time (sec)
	Fast streaming data sort				Bubble sort	Quick sort	Insertion sort	Merge sort
	35%	50%	75%	98.5%	Bubble sort	Quick sort	Insertion sort	Merge sort
10³	0.3738	0.3193	0.2345	0.1524	0.0144	0.0222	0.0085	0.0123
10⁴	0.9601	0.6843	0.3690	0.1756	1.3513	0.1434	0.4739	0.0272
10⁵	6.2055	3.7647	1.9997	0.2913	126.3189	1.4645	50.1042	0.1494
10⁶	58.1738	34.4192	17.5550	1.4446	N/A	19.4219	N/A	1.4890
10⁷	593.6722	330.6568	181.5882	13.1480	N/A	218.5080	N/A	16.1969

Open in a new tab

Four classic sorting algorithms do not require any extra storage. This means that they are not suitable for coping with large data sets. External sorting resolves this problem by deploying a very large external storage in addition to the internal working memory to store the overflowed data. Thus, it is interesting to compare the sorting speed of fast streaming data sort with the speed of external sorting restricted by different sizes of working memory defined in terms of percentage of the entire data size. Table 5 shows the sorting times of fast streaming data sort and external merge sort when varying the size of working memory to 35%, 50%, 75%, and 98.5% of the data size. It is obvious that fast streaming data sort is faster than external merge sort in all circumstances. In particular, when the memory size is set to 98.5% of data size and the data set includes 1 × 10⁵ numbers, the sorting time of fast streaming data sort is 1,452 times less than the sorting time of external merge sort.

Table 5. Comparison of sorting times of fast streaming data sort and external merge sort.

Data size	Time (sec) of fast streaming data sort				Time (sec) of external merge sort
Data size	50%	45%	40%	35%	50%	45%	40%	35%
10³	0.3738	0.3193	0.2345	0.1524	2.6021	2.2978	2.2204	2.1301
10⁴	0.9601	0.6843	0.3690	0.1756	38.0198	37.4343	35.6365	35.525
10⁵	6.2055	3.7647	1.9997	0.2913	298.8189	392.2904	417.2926	422.9222

Open in a new tab

Memory usage

The fluctuation of memory usage can be separated into three stages for analysis. In the beginning stage, the size of memory usage depends on the random order of incoming data. The trend of fluctuation is positive. In the second stage, some single data may appear during the sorting process. When new incoming data enter, they may be inserted into some previous compact groups or combined with the other single data to generate new compact groups. Some of these compact groups can be merged if they are in consecutive order. After merging some compact groups, the available memory space increases. The trend of fluctuation thus possibly changes in this stage. In the last stage, almost all early incoming data form partially complete sequences. The new incoming data fulfill the existing compact groups or concatenate with some single data to form new compact groups. Some of these compact groups can be further merged into new compact groups. Thus, the actual size of memory usage is rapidly reduced. Fig 4 illustrates the memory usage for compact groups and single data sets at each time step of the proposed algorithm with different sizes of working memory defined in terms of percentage of data size. When the size of working memory increases, the chance to form only one compact group is higher than having small working memory size because of the higher probability of obtaining consecutive numbers. The shape of each fluctuation curve is similar to a capsized hook. This reason for this shape phenomenon was discussed in [34]. There are four curves as a result of setting different sizes of working memory (35%, 40%, 45%, and 50%). It is remarkable that a larger size of working memory requires fewer time steps to accomplish the sorting result. The rationale for this behavior can be analyzed in the following Theorem.

Theorem 8 When the size of working memory increases, the time steps of the fluctuation curve decrease.

Proof: There are two scenarios in Fig 5 of different sizes of working memory, i.e. n₁ < n₂, to be discussed. The same set of incoming n numbers is sorted in both scenarios. The first scenario includes one currently existing compact group containing m numbers. After compacting these m numbers into a compact group, only 3 entries of memory space are required to represent the compact group. Since 3 ≪ n₁, the rest of working memory space is n₁ − 3 ≈ n₁. Next, a set of consecutive a ≤ n₁ numbers enters the sorting process. These a numbers can be combined with the current compact group to form a new compact group. The rest of the working memory can be filled by selecting n₁ − a from n − m − a numbers. The number of ways to select n₁−a is equal to $(\begin{matrix} n - m - a \\ n_{1} - a \end{matrix})$ . All n₁ numbers can be permuted in n₁! ways. Thus, the probability of having a numbers in the working memory in the first scenario is as follows.

\begin{matrix} p r o b_{1} & = & \frac{n_{1}! (\begin{matrix} n - m - a \\ n_{1} - a \end{matrix})}{(n - m)!} \end{matrix}

(8)

\begin{matrix} = & \frac{n_{1}! (n - m - a)!}{(n_{1} - a)! (n - m - n_{1})! (n - m)!} \end{matrix}

(9)

For the second scenario, the probability of having a numbers in the working memory in the second scenario is as follows.

\begin{matrix} p r o b_{2} & = & \frac{n_{2}! (\begin{matrix} n - m - a \\ n_{2} - a \end{matrix})}{(n - m)!} \end{matrix}

(10)

\begin{matrix} = & \frac{n_{2}! (n - m - a)!}{(n_{2} - a)! (n - m - n_{2})! (n - m)!} \end{matrix}

(11)

Since n₁ < n₂, prob₁ < prob₂.

Fig 5 — The working memory sizes are n₁ and n₂.

Data duplication

This section consists of two experiments to observe the gradient of sorting time when there are duplicates. The first experiment was conducted by using one synthetic data set. The data set consists of 100 non-duplicate numbers and 100 duplicate numbers. The size of working memory was set to 50 units. All non-duplicate numbers were equally divided into 4 chunks entering the sorting process one chunk at a time from time step 1 to time step 4. After that, a chunk of numbers with size as large as the available working memory was fed into the sorting process. The snapshot of each time step is shown in Fig 6. The first four steps show the run times of 100 non-duplicate data. Steps 5-104 illustrate the run time when only one duplicate number enters. The rate of run time change of non-duplicate data shown in the red area is greater than the rate of run time change of duplicate data shown in the blue area. After sorting all non-duplicate numbers, the run time change of duplicate numbers is almost linear and constant.

The second experiment concerning duplicate numbers was conducted by using real data sets containing duplicates. Three real-world data sets from kaggle.com, i.e. Delhi Weather Data [40], Artificial Lunar Landscape Dataset [41], and Continue Playing Game [42], were included in the experiment. The properties of attributes in these data sets are shown in Table 6. To relate the types of sub-sequences as defined previously, the distance average (Delta average) between two consecutively sorted input numbers is reported in Table 6. Four different working memory sizes were set to 0.1%, 1%, 35%, and 50% of total input data size. The sorting time in seconds of each data set is reported in Table 7. When the working memory size was set to 0.1% of total input data, none of the data sets except the Artificial Lunar Landscape data set could be sorted, as indicated by the word Overflow, because all spaces in the working memory were occupied by the compact groups and a set of single numbers after incoming data chunks. The final size of working memory used after sorting all data is also reported in column 6. Notice that the final size of working memory used is much smaller than the actual size of the sorted data set because the sorted data were compacted in the form of a compact set. For example, data set Dew point contains 100,369 numbers. After sorting, the size of final used memory is only 49 entries. The last column summarizes the ratio of data size to the final used memory size. In the case of Dew point, this ratio is 100, 369/49 ≈ 2, 048.

Table 6. Description of three real-world data sets.

Data sets		Number of data	Delta average	Minimum	Maximum	Area
Delhi Weather Data
	Dew point	100,369	0.98	-24	75	Weather
	Humidity	100,233	7.18	4	243	Weather
	Pressure	99,843	0.74	929	1033	Weather
	Temperature	100,317	1.63	1	90	Weather
Artificial Lunar Landscape Dataset
	Length	18,867	50.74	21	720	Image
	Height	18,867	34.61	21	480	Image
Continue Playing Game
	Hero_id	17,485,730	23.93	21	720	Game

Open in a new tab

Table 7. The time consumption, memory size for storing sorted data, and the ratio of data size to memory size for storing sorted data for three real-world data sets.

Data sets		Sorting time (sec) for different working memory sizes				Final memory size used after sorting all data	Ratio of data size to memory size used
Data sets		50% of data	35% of data	1% of data	0.1% of data	Final memory size used after sorting all data	Ratio of data size to memory size used
Delhi
	Dew point	0.18	0.19	0.78	5.65	49	2,048
	Humidity	0.28	0.34	3.31	15.23	103	973
	Pressure	0.28	0.33	1.29	11.88	171	589
	Temperature	0.24	0.27	1.43	8.12	52	1,929
Artificial Lunar Landscape Dataset
	Length	0.48	0.58	5.72	Overflow	420	283
	Height	0.45	0.52	3.73	Overflow	210	566
Continue Playing Game
	Hero_id	1.52	1.46	3.37	14.95	65	269,011

Open in a new tab

Time complexity analysis

There are four main parts included in this analysis. Let M be the working memory size, and H be the remaining size of working memory after the sorting process for each incoming chunk of h numbers. Generally, the remaining size is very small in comparison with the size of working memory and the total number of streaming data n, i.e., H ≪ M < n, as shown in Fig 7.

The first part is the process of forming a compact set Q^(t), where t is the number of iterations. The time complexity of this part is O(M). The second part is to update duplicate data. h data must be analyzed to find their appropriate positions for insertion. The time complexity for updating duplicate data is O(h log M). The third part is to insert h data into a compact set. The time complexity is O(h log M). If a datum d_i is inserted into a compact group, d_i and the compact group are transformed into a new compact group with time complexity of O(1). Therefore, the time complexity of the third main part is max{O(h log M), O(h log M) + O(1)} = O(h log (M)). The last part is to merge members in compact set Q^(t). The merging algorithm is based on linear searching, and its time complexity is O(M). In total, the time complexity of n data is O(M) + O(nh log M)+ O(nh log M))+ O(nM). Because h and M are constant, the time complexity of fast streaming data sort is O(n).

Working space analysis of duplicate data

In streaming data sort [34], the lower bound of working memory size for correctly sorting non-duplicated n numbers is 0.35n. In the case of duplication, suppose that each number has only one duplicate, which indicates that there are 2n numbers to be sorted. Obviously, if streaming data sort is applied, then the size of working memory must be set to 0.35 × 2n = 0.7n. However, the size of working memory used by the proposed fast streaming data sort is less than 0.7n. This is because all non-duplicate n numbers can be sorted first by setting the size of working memory to 0.35n. All duplicates do not need to be sorted, but only stored in a duplicate set R^(t) as discussed in Algorithm 1. Each duplicate requires one entry in set R^(t). Since there are n duplicates, the size of set R^(t) is only n and the minimum size of working memory needed by fast streaming data sort is at least 0.35n + n. This amount is equal to $\frac{0.35 n + n}{2 n} = 0.67$ or 67% of the data size. After the sorting process is finished, there must be only one compact group, which uses only two entries of memory space to represent the compact group, and n entries for storing all n duplicates.

Discussion

Even though fast streaming data sort effectively sorts streaming data under the working memory limitation, there are two concerns about its limitation. First, the working memory size is 35% of the input data size. This means the proposed algorithm can reduce the working memory size up to 65%. On the other hand, the number of incoming data that can be handled by the proposed algorithm is up to $\frac{1}{0.35} = 2.86$ times the working memory size without upgrading the memory space. Second, the proposed algorithm was designed for integer data, but it can be applied to sort floating-point numbers by multiplying each floating number with 10^a, for an integer constant a. The value of a must be predetermined from the number of digits in decimal part.

Conclusion

A new sort concept is presented in order to greatly hasten the process of streaming data sort with limited working memory size and to efficiently cope with duplicate data and data lifecycles without increasing the time complexity as reported in [34]. Instead of unfolding each compact group to insert new incoming numbers in their correct orders, the computational concepts of insert position, size of compact group, and detecting duplicates are developed to eliminate those unnecessary steps proposed in [34]. Several relevant theorems were established to support the proposed concepts of new streaming data sort under new constraints. The new sorting algorithm named fast streaming data sort can achieve the time complexity of O(n), where n is the total amount of input numbers, and the space complexity is equal to 35% of all incoming non-duplicate numbers, which is the possible lower bound. With duplicates, however, the space complexity can be at most 67% of all incoming non-duplicate and duplicate numbers.

The results were compared with several classic sorting algorithms and external merge sort on streaming data sort and analyzed with respect to large synthetic and real-world data sets. The sorting time of the proposed algorithm is 28.18 times less than streaming data sort. The proposed algorithm is faster than bubble sort and insertion sort when the size of input data is greater than 10,000 and faster than quick sort and merge sort when the size of input data is greater than 1,000,000. Moreover, the proposed algorithm is 1,452 times faster than external merge sort. Therefore, the proposed algorithm can achieve a remarkably faster sorting speed in comparison with the compared algorithms, even though memory overflow may occur in those classic sorting algorithms.

Theoretically, most of existing sorting algorithms cannot be applied to streaming data when working memory size is less than the size of incoming data. Although some algorithms are designed to cope with this situation, the sorting result is obtained with approximate orders and values. However, fast streaming data sort can sort streaming data and generate the sorting result with the exact orders and values on the entire streaming sequence by using only 35% of data size as the size of available working memory. In the other hand, if the memory capacity is limited, fast streaming data sort is robust to overflow issue caused by increasing incoming data rather than other existing algorithms. In practical point of view, fast streaming data sort can be applied to several real-world applications as follows.

Reducing the searching time of enormous input data by using search algorithm on the sorting result in compact group form.
Finding order statistics, quartile, decile, and percentile of huge streaming data in social network applications on mobile phones, data cleaning, or data querying.
Examining duplicate data to reduce cost of storing data, to find error detection of systems, or to delete data in the data cleaning process.

Although fast streaming data sort can cope with the duplicate streaming data under the limitation of working memory less than the size of input data, there are two issues to be considered in the future work as follows.

Adding or improving new patterns of compact sequence to generate new compact group types that can be adopted to all different consecutive numbers.
Developing a compression procedure for compact groups with several data types, for instance, floating point number or text string, resulting in compatibility of data in many areas such as business, forecasting, logistics, and science.

Data Availability

All Delhi Weather Data files are available from the kaggle.com database (https://www.kaggle.com/mahirkukreja/delhi-weather-data). All Artificial Lunar Landscape Dataset files are available from the kaggle.com database (https://www.kaggle.com/romainpessia/artificial-lunar-rocky-landscape-dataset). All Continue playing game files are available from the kaggle.com database (https://www.kaggle.com/qichenghu/continue-playing-game). The authors confirm that anyone can access these third-party data in the same manner. The authors confirm that they did not have any special access privilege.

Funding Statement

Suluk Chaikhan received a grant from Development and Promotion of Science and Technology Talents Project (DPST). Chidchanok Lursinsap received a grant from Thailand Research Fund under grant number RTA6080013. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

1. Xu LD, Duan L. Big data for cyber physical systems in industry 4.0: a survey. Enterprise Information Systems. 2019;13(2):148–169. doi: 10.1080/17517575.2018.1442934 [DOI] [Google Scholar]
2. Rao TR, Mitra P, Bhatt R, Goswami A. The big data system, components, tools and technologies: a survey. Knowledge and Information Systems. 2019;60(3):1165–1245. doi: 10.1007/s10115-018-1248-0 [DOI] [Google Scholar]
3. Ghorbanian M, Dolatabadi SH, Siano P. Big data issues in smart grids: a survey. IEEE Systems Journal. 2019;13(4):4158–4168. doi: 10.1109/JSYST.2019.2931879 [DOI] [Google Scholar]
4. Nadikattu RR. Research on data science, data analytics and big data. INTERNATIONAL JOURNAL OF ENGINEERING, SCIENCE AND. 2020;9(5):99–105. [Google Scholar]
5. Seng JKP, Ang KL. Multimodal emotion and sentiment modeling from unstructured big data: challenges, architecture, techniques. IEEE Access. 2019;7:90982–90998. doi: 10.1109/ACCESS.2019.2926751 [DOI] [Google Scholar]
6. Liu Y, Bi S, Shi Z, Hanzo L. When machine learning meets big data: a wireless communication perspective. IEEE Vehicular Technology Magazine. 2020;15(1):63–72. doi: 10.1109/MVT.2019.2953857 [DOI] [Google Scholar]
7. Lv Z, Li X, Lv H, Xiu W. BIM big data storage in WebVRGIS. IEEE Transactions on Industrial Informatics. 2020;16(4):2566–2573. doi: 10.1109/TII.2019.2916689 [DOI] [Google Scholar]
8. Francis J, Bian L. Deep learning for distortion prediction in laser-based additive manufacturing using big data. Manufacturing Letters. 2019;20:10–14. doi: 10.1016/j.mfglet.2019.02.001 [DOI] [Google Scholar]
9. Olivera P, Danese S, Jay N, Natoli G, Peyrin-Biroulet L. Big data in IBD: a look into the future. Nature Reviews Gastroenterology & Hepatology. 2019;16(5):312–321. doi: 10.1038/s41575-019-0102-5 [DOI] [PubMed] [Google Scholar]
10. Fernandez-Basso C, Francisco-Agra AJ, Martin-Bautista MJ, Dolores Ruiz M. Finding tendencies in streaming data using big data frequent itemset mining. Knowledge-Based Systems. 2019;163:666–674. doi: 10.1016/j.knosys.2018.09.026 [DOI] [Google Scholar]
11.Xu L, Shao G, Cao Y, Yang H, Sun C, Zhang T, et al. Research on telecom big data platform of LTE/5G mobile networks. In: 2019 IEEE International Conferences on Ubiquitous Computing & Communications (IUCC) and Data Science and Computational Intelligence (DSCI) and Smart Computing, Networking and Services (SmartCNS). IEEE; 2019. p. 756–761.
12.Salih BA, Wongthongtham P, Zajabbari B, et al. Towards a methodology for social business intelligence in the era of big social data incorporating trust and semantic analysis. In: Proceedings of the International Conference on Data Engineering 2015 (DaEng-2015). Springer; 2019. p. 519–527.
13. Rjoub G, Bentahar J, Wahab OA. BigTrustScheduling: Trust-aware big data task scheduling approach in cloud computing environments. Future Generation Computer Systems. 2020;110:1079–1097. doi: 10.1016/j.future.2019.11.019 [DOI] [Google Scholar]
14. Rutledge RB, Chekroud AM, Huys QJ. Machine learning and big data in psychiatry: toward clinical applications. Current opinion in neurobiology. 2019;55:152–159. doi: 10.1016/j.conb.2019.02.006 [DOI] [PubMed] [Google Scholar]
15.Singh R, Kumar V, Shrivastava A, Kumar S, Tiwari A. RVA sorting based on bubble & quick sort technique. In: Proceedings of the 2014 International Conference on Information and Communication Technology for Competitive Strategies; 2014. p. 1–6.
16. Wild S, Nebel ME, Neininger R. Average case and distributional analysis of dual-pivot quicksort. ACM Transactions on Algorithms (TALG). 2015;11(3):1–42. doi: 10.1145/2629340 [DOI] [Google Scholar]
17.Agrawal A, Sriram B. Concom sorting algorithm. In: 2015 4th International Conference on Computer Science and Network Technology (ICCSNT). vol. 01. IEEE; 2015. p. 229–233.
18.Osama H, Omar Y, Badr A. Mapping sorting algorithm. In: 2016 SAI Computing Conference (SAI). IEEE; 2016. p. 488–491.
19.Vignesh R, Pradhan T. Merge sort enhanced in place sorting algorithm. In: 2016 International Conference on Advanced Communication Control and Computing Technologies (ICACCCT). IEEE; 2016. p. 698–704.
20.Idrizi F, Rustemi A, Dalipi F. A new modified sorting algorithm: a comparison with state of the art. In: 2017 6th Mediterranean Conference on Embedded Computing (MECO). IEEE; 2017. p. 1–6.
21. Mohammed AS, S¸ahin Emrah Amrahov, C¸ elebi FV. Bidirectional conditional insertion sort algorithm; an efficient progress on the classical insertion sort. Future Generation Computer Systems. 2017;71:102–112. doi: 10.1016/j.future.2017.01.034 [DOI] [Google Scholar]
22. Goel S, Kumar R. Brownian motus and clustered binary insertion sort methods: an efficient progress over traditional methods. Future Generation Computer Systems. 2018;86:266–280. doi: 10.1016/j.future.2018.04.038 [DOI] [Google Scholar]
23. Omar YK, Osama H, Badr A. Double hashing sort algorithm. Computing in Science & Engineering. 2017;19(02):63–69. doi: 10.1109/MCSE.2017.26 [DOI] [Google Scholar]
24. Zutshi A, Goswami D. Systematic review and exploration of new avenues for sorting algorithm. International Journal of Information Management Data Insights. 2021;1(2):100042. doi: 10.1016/j.jjimei.2021.100042 [DOI] [Google Scholar]
25.Gugale Y. Super sort sorting algorithm. In: 2018 3rd International Conference for Convergence in Technology (I2CT). IEEE; 2018. p. 1–5.
26. Lee YS, Quero LC, Kim S, Kim J, Maeng SR. ActiveSort: efficient external sorting using active SSDs in the MapReduce framework. Future Generation Computer Systems. 2016;65:76–89. doi: 10.1016/j.future.2016.03.003 [DOI] [Google Scholar]
27. Laga A, Boukhobza J, Singhoff F, Koskas M. Montres: merge on-the-run external sorting algorithm for large data volumes on ssd based storage systems. IEEE Transactions on Computers. 2017;66(10):1689–1702. doi: 10.1109/TC.2017.2706678 [DOI] [Google Scholar]
28. Liang Y, Chen T, Chang Y, Chen S, Wei H, Shih W. B*-sort: enabling write-once sorting for non-volatile memory. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems. 2020;39(12):4549–4562. doi: 10.1109/TCAD.2020.2979819 [DOI] [Google Scholar]
29.Arge L, Thorup M. Ram-efficient external memory sorting. In: International Symposium on Algorithms and Computation. Springer; 2013. p. 491-501.
30. Kanza Y, Yaari H. External sorting on flash storage: reducing cell wearing and increasing efficiency by avoiding intermediate writes. The VLDB Journal. 2016;25(4):495–518. doi: 10.1007/s00778-016-0426-5 [DOI] [Google Scholar]
31.Elder M, Goh YK. Permutations sorted by a finite and an infinite stack in series. In: International Conference on Language and Automata Theory and Applications. Springer; 2018. p. 220–231.
32. Begenau J, Farboodi M, Veldkamp L. Big data in finance and the growth of large firms. Journal of Monetary Economics. 2018;97:71–87. doi: 10.1016/j.jmoneco.2018.05.013 [DOI] [Google Scholar]
33. Tiwari S, Wee HM, Daryanto Y. Big data analytics in supply chain management between 2010 and 2016: insights to industries. Computers & Industrial Engineering. 2018;115:319–330. doi: 10.1016/j.cie.2017.11.017 [DOI] [Google Scholar]
34. Chaikhan S, Phimoltares S, Lursinsap C. Correct and stable sorting for overflow streaming data with a limited storage size and a uniprocessor. PeerJ Computer Science. 2021;7:e355. doi: 10.7717/peerj-cs.355 [DOI] [PMC free article] [PubMed] [Google Scholar]
35. Polyzotis N, Roy S, Whang SE, Zinkevich M. Data lifecycle challenges in production machine learning: a survey. ACM SIGMOD Record. 2018;47(2):17–28. doi: 10.1145/3299887.3299891 [DOI] [Google Scholar]
36.Sinaeepourfard A, Garcia J, Masip-Bruin X, Marin-Tordera E, Yin X, Wang C. A data lifeCycle model for smart cities. In: 2016 International Conference on Information and Communication Technology Convergence (ICTC). IEEE; 2016. p. 400–405.
37.Ginart A, Guan MY, Valiant G, Zou J. Making ai forget you: data deletion in machine learning. arXiv preprint arXiv:190705012. 2019.
38. Krempl G, Zliobaite I, Brzeziński D, Hüllermeier E, Last M, Lemaire V, et al. Open challenges for data stream mining research. ACM SIGKDD explorations newsletter. 2014;16(1):1–10. doi: 10.1145/2674026.2674028 [DOI] [Google Scholar]
39.Lauinger T, Chaabane A, Buyukkayhan AS, Onarlioglu K, Robertson W. Game of registrars: an empirical analysis of post-expiration domain name takeovers. In: 26th USENIX Security Symposium (USENIX Security 17); 2017. p. 865–880.
40.Delhi weather data;. https://www.kaggle.com/mahirkukreja/delhi-weather-data.
41.Artificial lunar landscape dataset;. https://www.kaggle.com/romainpessia/artificial-lunar-rocky-landscape-dataset.
42.Continue playing game;. https://www.kaggle.com/qichenghu/continue-playing-game

PLoS One. doi: 10.1371/journal.pone.0266295.r001

Decision Letter 0

Mohamed Hammad

5 Jan 2022

PONE-D-21-39238Fast Continuous Streaming Sort in Big Streaming Data Environment under Fixed-Size Single StoragePLOS ONE

Dear Dr. Phimoltares,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

==============================

ACADEMIC EDITOR:

Many recent papers are missing
The limitations and drawbacks of the method should be highlighted

==============================

Please submit your revised manuscript by Feb 19 2022 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.
A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.
An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Mohamed Hammad, Ph.D.

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. Please list all data sources used in your Data availability statement. Please also state whether or not sharing the Kaggle data with the manuscript complies with the terms and conditions of Kaggle.

3. We note that the grant information you provided in the ‘Funding Information’ and ‘Financial Disclosure’ sections do not match.

When you resubmit, please ensure that you provide the correct grant numbers for the awards you received for your study in the ‘Funding Information’ section.

4. Thank you for stating the following in the Acknowledgments Section of your manuscript:

"This work was supported by Development and Promotion of Science and Technology Talents Project (DPST) and Thailand Research Fund under grant number RTA6080013."

We note that you have provided funding information that is not currently declared in your Funding Statement. However, funding information should not appear in the Acknowledgments section or other areas of your manuscript. We will only publish funding information present in the Funding Statement section of the online submission form.

Please remove any funding-related text from the manuscript and let us know how you would like to update your Funding Statement. Currently, your Funding Statement reads as follows:

"Suluk Chaikhan received a grant from Development and Promotion of Science and Technology Talents Project (DPST).

Chidchanok Lursinsap received a grant from Thailand Research Fund under grant number RTA6080013."

Please include your amended statements within your cover letter; we will change the online submission form on your behalf.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: No

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: No

Reviewer #2: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: This paper proposes a new fast continuous streaming sorting in a big streaming data environment is proposed to cope with the constraint of storage overflow.

This is an interesting paper that deserve to be improved before any publication in journal:

• The abstract must summarize the performance evaluation results.

• The related work papers are not up to date in the domain where there are a lot of research papers are recently. In addition the review for most of the cited work is nebulous. The review failed to summarize critical details about each of the cited research study and its relevant pros and cons.

• Text is overlapped with image in some figures.

• Equations should be aligned, and their fonts must be matched in the manuscript. All the equations should be revised and realigned.

• The conclusions section should conclude that you have achieved from the study, contributions of the study to academics and practices, and recommendations of future works.

• The list of references should be reformatted and checked again to be matched with the journal requirement where a different styles and types are used. Please check some spells and typos.

• The authors should make their manuscript proofread by a native English speaker.

Reviewer #2: The manuscript, in its present form, contains several weaknesses. Adequate revisions to the following points should be undertaken to justify the recommendation for publication.

1. The contribution is not stated clearly.

2. The abstract must be rewritten, please delete unnecessary sentences.

3. It is recommended to re-examine and design the Figures.

4. A deep and detailed comparison with other methods is mandatory.

5. Please add the system specifications used for the evaluation as well as the programming language.

6. Specify the limitations and drawbacks of the proposed method.

7: Please add future work to the conclusion section and discuss it briefly.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2022 Apr 5;17(4):e0266295. doi: 10.1371/journal.pone.0266295.r002

Author response to Decision Letter 0

5 Mar 2022

We would like to resubmit the revised manuscript entitled “Fast Continuous Streaming Sort in Big Streaming Data Environment under Fixed-Size Single Storage” (PONE-D-21-39238) to be considered for publication as an original article in the PLOS ONE. We appreciate your concerns and have improved the manuscript according to the advice of the academic editor and two reviewers. We detail the revision in 'cover-letter' and 'response to reviewers' files. We hope you find our manuscript suitable for publication and look forward to hearing from you.

Attachment

Submitted filename: Response to Reviewers.docx

Click here for additional data file.^{(1.4MB, docx)}

PLoS One. doi: 10.1371/journal.pone.0266295.r003

Decision Letter 1

Mohamed Hammad

18 Mar 2022

Fast continuous streaming sort in big streaming data environment under fixed-size single storage

PONE-D-21-39238R1

Dear Dr. Phimoltares,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Mohamed Hammad, Ph.D.

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: All comments have been addressed

Reviewer #2: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

Reviewer #1: Yes

Reviewer #2: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

Reviewer #1: Yes

Reviewer #2: Yes

**********

6. Review Comments to the Author

Reviewer #1: I have reviewed the author's response to concerns and feel that they have adequately responded to them. The changes in paper text also sufficiently address the concerns - in my view, the paper is ready for acceptance.

Reviewer #2: The paper titled (Fast continuous streaming sort in big streaming data environment under fixed-size single storage) accepted from my side.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

PLoS One. doi: 10.1371/journal.pone.0266295.r004

Acceptance letter

Mohamed Hammad

25 Mar 2022

PONE-D-21-39238R1

Fast continuous streaming sort in big streaming data environment under fixed-size single storage

Dear Dr. Phimoltares:

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

If we can help with anything else, please email us at plosone@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Mohamed Hammad

Academic Editor

PLOS ONE

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Attachment

Submitted filename: Response to Reviewers.docx

Click here for additional data file.^{(1.4MB, docx)}

Data Availability Statement

[pone.0266295.ref001] 1. Xu LD, Duan L. Big data for cyber physical systems in industry 4.0: a survey. Enterprise Information Systems. 2019;13(2):148–169. doi: 10.1080/17517575.2018.1442934 [DOI] [Google Scholar]

[pone.0266295.ref002] 2. Rao TR, Mitra P, Bhatt R, Goswami A. The big data system, components, tools and technologies: a survey. Knowledge and Information Systems. 2019;60(3):1165–1245. doi: 10.1007/s10115-018-1248-0 [DOI] [Google Scholar]

[pone.0266295.ref003] 3. Ghorbanian M, Dolatabadi SH, Siano P. Big data issues in smart grids: a survey. IEEE Systems Journal. 2019;13(4):4158–4168. doi: 10.1109/JSYST.2019.2931879 [DOI] [Google Scholar]

[pone.0266295.ref004] 4. Nadikattu RR. Research on data science, data analytics and big data. INTERNATIONAL JOURNAL OF ENGINEERING, SCIENCE AND. 2020;9(5):99–105. [Google Scholar]

[pone.0266295.ref005] 5. Seng JKP, Ang KL. Multimodal emotion and sentiment modeling from unstructured big data: challenges, architecture, techniques. IEEE Access. 2019;7:90982–90998. doi: 10.1109/ACCESS.2019.2926751 [DOI] [Google Scholar]

[pone.0266295.ref006] 6. Liu Y, Bi S, Shi Z, Hanzo L. When machine learning meets big data: a wireless communication perspective. IEEE Vehicular Technology Magazine. 2020;15(1):63–72. doi: 10.1109/MVT.2019.2953857 [DOI] [Google Scholar]

[pone.0266295.ref007] 7. Lv Z, Li X, Lv H, Xiu W. BIM big data storage in WebVRGIS. IEEE Transactions on Industrial Informatics. 2020;16(4):2566–2573. doi: 10.1109/TII.2019.2916689 [DOI] [Google Scholar]

[pone.0266295.ref008] 8. Francis J, Bian L. Deep learning for distortion prediction in laser-based additive manufacturing using big data. Manufacturing Letters. 2019;20:10–14. doi: 10.1016/j.mfglet.2019.02.001 [DOI] [Google Scholar]

[pone.0266295.ref009] 9. Olivera P, Danese S, Jay N, Natoli G, Peyrin-Biroulet L. Big data in IBD: a look into the future. Nature Reviews Gastroenterology & Hepatology. 2019;16(5):312–321. doi: 10.1038/s41575-019-0102-5 [DOI] [PubMed] [Google Scholar]

[pone.0266295.ref010] 10. Fernandez-Basso C, Francisco-Agra AJ, Martin-Bautista MJ, Dolores Ruiz M. Finding tendencies in streaming data using big data frequent itemset mining. Knowledge-Based Systems. 2019;163:666–674. doi: 10.1016/j.knosys.2018.09.026 [DOI] [Google Scholar]

[pone.0266295.ref011] 11.Xu L, Shao G, Cao Y, Yang H, Sun C, Zhang T, et al. Research on telecom big data platform of LTE/5G mobile networks. In: 2019 IEEE International Conferences on Ubiquitous Computing & Communications (IUCC) and Data Science and Computational Intelligence (DSCI) and Smart Computing, Networking and Services (SmartCNS). IEEE; 2019. p. 756–761.

[pone.0266295.ref012] 12.Salih BA, Wongthongtham P, Zajabbari B, et al. Towards a methodology for social business intelligence in the era of big social data incorporating trust and semantic analysis. In: Proceedings of the International Conference on Data Engineering 2015 (DaEng-2015). Springer; 2019. p. 519–527.

[pone.0266295.ref013] 13. Rjoub G, Bentahar J, Wahab OA. BigTrustScheduling: Trust-aware big data task scheduling approach in cloud computing environments. Future Generation Computer Systems. 2020;110:1079–1097. doi: 10.1016/j.future.2019.11.019 [DOI] [Google Scholar]

[pone.0266295.ref014] 14. Rutledge RB, Chekroud AM, Huys QJ. Machine learning and big data in psychiatry: toward clinical applications. Current opinion in neurobiology. 2019;55:152–159. doi: 10.1016/j.conb.2019.02.006 [DOI] [PubMed] [Google Scholar]

[pone.0266295.ref015] 15.Singh R, Kumar V, Shrivastava A, Kumar S, Tiwari A. RVA sorting based on bubble & quick sort technique. In: Proceedings of the 2014 International Conference on Information and Communication Technology for Competitive Strategies; 2014. p. 1–6.

[pone.0266295.ref016] 16. Wild S, Nebel ME, Neininger R. Average case and distributional analysis of dual-pivot quicksort. ACM Transactions on Algorithms (TALG). 2015;11(3):1–42. doi: 10.1145/2629340 [DOI] [Google Scholar]

[pone.0266295.ref017] 17.Agrawal A, Sriram B. Concom sorting algorithm. In: 2015 4th International Conference on Computer Science and Network Technology (ICCSNT). vol. 01. IEEE; 2015. p. 229–233.

[pone.0266295.ref018] 18.Osama H, Omar Y, Badr A. Mapping sorting algorithm. In: 2016 SAI Computing Conference (SAI). IEEE; 2016. p. 488–491.

[pone.0266295.ref019] 19.Vignesh R, Pradhan T. Merge sort enhanced in place sorting algorithm. In: 2016 International Conference on Advanced Communication Control and Computing Technologies (ICACCCT). IEEE; 2016. p. 698–704.

[pone.0266295.ref020] 20.Idrizi F, Rustemi A, Dalipi F. A new modified sorting algorithm: a comparison with state of the art. In: 2017 6th Mediterranean Conference on Embedded Computing (MECO). IEEE; 2017. p. 1–6.

[pone.0266295.ref021] 21. Mohammed AS, S¸ahin Emrah Amrahov, C¸ elebi FV. Bidirectional conditional insertion sort algorithm; an efficient progress on the classical insertion sort. Future Generation Computer Systems. 2017;71:102–112. doi: 10.1016/j.future.2017.01.034 [DOI] [Google Scholar]

[pone.0266295.ref022] 22. Goel S, Kumar R. Brownian motus and clustered binary insertion sort methods: an efficient progress over traditional methods. Future Generation Computer Systems. 2018;86:266–280. doi: 10.1016/j.future.2018.04.038 [DOI] [Google Scholar]

[pone.0266295.ref023] 23. Omar YK, Osama H, Badr A. Double hashing sort algorithm. Computing in Science & Engineering. 2017;19(02):63–69. doi: 10.1109/MCSE.2017.26 [DOI] [Google Scholar]

[pone.0266295.ref024] 24. Zutshi A, Goswami D. Systematic review and exploration of new avenues for sorting algorithm. International Journal of Information Management Data Insights. 2021;1(2):100042. doi: 10.1016/j.jjimei.2021.100042 [DOI] [Google Scholar]

[pone.0266295.ref025] 25.Gugale Y. Super sort sorting algorithm. In: 2018 3rd International Conference for Convergence in Technology (I2CT). IEEE; 2018. p. 1–5.

[pone.0266295.ref026] 26. Lee YS, Quero LC, Kim S, Kim J, Maeng SR. ActiveSort: efficient external sorting using active SSDs in the MapReduce framework. Future Generation Computer Systems. 2016;65:76–89. doi: 10.1016/j.future.2016.03.003 [DOI] [Google Scholar]

[pone.0266295.ref027] 27. Laga A, Boukhobza J, Singhoff F, Koskas M. Montres: merge on-the-run external sorting algorithm for large data volumes on ssd based storage systems. IEEE Transactions on Computers. 2017;66(10):1689–1702. doi: 10.1109/TC.2017.2706678 [DOI] [Google Scholar]

[pone.0266295.ref028] 28. Liang Y, Chen T, Chang Y, Chen S, Wei H, Shih W. B*-sort: enabling write-once sorting for non-volatile memory. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems. 2020;39(12):4549–4562. doi: 10.1109/TCAD.2020.2979819 [DOI] [Google Scholar]

[pone.0266295.ref029] 29.Arge L, Thorup M. Ram-efficient external memory sorting. In: International Symposium on Algorithms and Computation. Springer; 2013. p. 491-501.

[pone.0266295.ref030] 30. Kanza Y, Yaari H. External sorting on flash storage: reducing cell wearing and increasing efficiency by avoiding intermediate writes. The VLDB Journal. 2016;25(4):495–518. doi: 10.1007/s00778-016-0426-5 [DOI] [Google Scholar]

[pone.0266295.ref031] 31.Elder M, Goh YK. Permutations sorted by a finite and an infinite stack in series. In: International Conference on Language and Automata Theory and Applications. Springer; 2018. p. 220–231.

[pone.0266295.ref032] 32. Begenau J, Farboodi M, Veldkamp L. Big data in finance and the growth of large firms. Journal of Monetary Economics. 2018;97:71–87. doi: 10.1016/j.jmoneco.2018.05.013 [DOI] [Google Scholar]

[pone.0266295.ref033] 33. Tiwari S, Wee HM, Daryanto Y. Big data analytics in supply chain management between 2010 and 2016: insights to industries. Computers & Industrial Engineering. 2018;115:319–330. doi: 10.1016/j.cie.2017.11.017 [DOI] [Google Scholar]

[pone.0266295.ref034] 34. Chaikhan S, Phimoltares S, Lursinsap C. Correct and stable sorting for overflow streaming data with a limited storage size and a uniprocessor. PeerJ Computer Science. 2021;7:e355. doi: 10.7717/peerj-cs.355 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0266295.ref035] 35. Polyzotis N, Roy S, Whang SE, Zinkevich M. Data lifecycle challenges in production machine learning: a survey. ACM SIGMOD Record. 2018;47(2):17–28. doi: 10.1145/3299887.3299891 [DOI] [Google Scholar]

[pone.0266295.ref036] 36.Sinaeepourfard A, Garcia J, Masip-Bruin X, Marin-Tordera E, Yin X, Wang C. A data lifeCycle model for smart cities. In: 2016 International Conference on Information and Communication Technology Convergence (ICTC). IEEE; 2016. p. 400–405.

[pone.0266295.ref037] 37.Ginart A, Guan MY, Valiant G, Zou J. Making ai forget you: data deletion in machine learning. arXiv preprint arXiv:190705012. 2019.

[pone.0266295.ref038] 38. Krempl G, Zliobaite I, Brzeziński D, Hüllermeier E, Last M, Lemaire V, et al. Open challenges for data stream mining research. ACM SIGKDD explorations newsletter. 2014;16(1):1–10. doi: 10.1145/2674026.2674028 [DOI] [Google Scholar]

[pone.0266295.ref039] 39.Lauinger T, Chaabane A, Buyukkayhan AS, Onarlioglu K, Robertson W. Game of registrars: an empirical analysis of post-expiration domain name takeovers. In: 26th USENIX Security Symposium (USENIX Security 17); 2017. p. 865–880.

[pone.0266295.ref040] 40.Delhi weather data;. https://www.kaggle.com/mahirkukreja/delhi-weather-data.

[pone.0266295.ref041] 41.Artificial lunar landscape dataset;. https://www.kaggle.com/romainpessia/artificial-lunar-rocky-landscape-dataset.

[pone.0266295.ref042] 42.Continue playing game;. https://www.kaggle.com/qichenghu/continue-playing-game

PERMALINK

Fast continuous streaming sort in big streaming data environment under fixed-size single storage

Suluk Chaikhan

Suphakant Phimoltares

Chidchanok Lursinsap

Roles

Abstract

Introduction

Table 1. Comparison of sorting algorithms on streaming data.

Studied problems and constraints

Background

Fig 1. A streaming data sort example.

Definitions and notations

Concepts

Fig 2. An example fast streaming data sort with 11 iterations from (a) to (k).

Proposed fast streaming data sort algorithm

Handling data duplication

Inserting incoming numbers into compact groups

Table 2. Summary of insertion position and the sequence of insertion in forms of split compact groups.

Merging compact groups

Removing expired single numbers from the compact set

Experimental results and discussion

Sorting time versus data size and working memory size

Fig 3. The execution time of fast streaming data sort with respect to the setting of working memory size in terms of percentage of total data size (1,000,000 numbers).

Table 3. Comparison of sorting time between fast streaming data sort and streaming data sort when varying different sizes of working memory in terms of the percentages of total amount of data.

Table 4. Comparison of sorting time between fast streaming data sort and four classic sorting algorithms.

Table 5. Comparison of sorting times of fast streaming data sort and external merge sort.

Memory usage

Fig 4. Fluctuation in number of compact groups and single data for the proposed algorithm at several memory sizes for one million data points.

Fig 5. Scenarios of Theorem 8.

Data duplication

Fig 6. Snapshot of sorting time change when there exists a mixture of 100 non-duplicate numbers and 100 duplicate numbers.

Table 6. Description of three real-world data sets.

Table 7. The time consumption, memory size for storing sorted data, and the ratio of data size to memory size for storing sorted data for three real-world data sets.

Time complexity analysis

Fig 7. Constraint on the limitation of working memory size, H ≪ M < n.

Working space analysis of duplicate data

Discussion

Conclusion

Data Availability

Funding Statement

References

Decision Letter 0

Mohamed Hammad

Roles

Author response to Decision Letter 0

Decision Letter 1

Mohamed Hammad

Roles

Acceptance letter

Mohamed Hammad

Roles

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases