Skip to main content
Springer Nature - PMC COVID-19 Collection logoLink to Springer Nature - PMC COVID-19 Collection
. 2022 Mar 21;27(1):605–614. doi: 10.1007/s00500-022-06938-8

RETRACTED ARTICLE: Dynamic multi-variant relational scheme-based intelligent ETL framework for healthcare management

Vijayalakshmi Manickam 1,, Minu Rajasekaran Indra 1
PMCID: PMC8935255  PMID: 35340776

Abstract

The growth of information technology has opened the gate for the organizations to maintain their data in various forms and at various volumes. This increases the volume and dimension of data being maintained. However, they store their data in their data servers or in cloud environment. Such data have been used to generate various intelligence to support various problems. To support such analysis process, different data have been used and the big data comes to play in this part. Optimizing techniques to help improve the process of ETL could greatly help in real-time analysis of data. ETL optimization could be achieved through several factors simplest being increasing the frequency of the process. Other ways to achieve optimization are through the use of various architectures, programming models, intelligence in transformation and security. To improve the performance of ETL, an efficient dynamic multi-variant relational intelligent ETL framework has been presented in this article. The distributed approach maintains various ontology’s and data dictionaries which have been dynamically updated by different threads of ETL process. Initially, the process is start by applying the extraction process which extracts the data from different sources and finds set of dimensions and their characteristics. Such data extracted have been verified over the data dictionary. Further, the relational score has been measured for each data source with the existing one. Similarly, the method computes the value of multi-variant relational similarity (MVRS) for the data obtained from a single source. This will be performed by different threads of ETL process. According to the value of MVRS, the method performs map reduce and merging of data. According to the value of MVRS the method selects the data node and merges the data to store in the data warehouse. The threads of ETL are capable of reading the changes in data dictionaries and ontology’s to iterate the process of transformation and loading. The method improves the performance of ETL with least time complexity and higher performance.

Keywords: Big Data, ETL, Intelligence generation, MVRS, Relational model, Ontology, Cloud

Introduction

The modern organizations maintain variety of data belonging to their employees, customers and business sectors. Such data cover different dimensions, as the data being maintained by different organizations and their size also huge in size. This forms the big data which has no restriction on the size and volume of the data. In reality, the growing size of data and volume really challenges the organization. The ultimate aim of maintaining such big data is to generate different intelligence from the data. For example, when the data are related to the purchase histories (Nagarajan et al. 2015) of any E-commerce system, then it can be used to generate intelligence towards the business growth or the sale of different products. Similarly, when the big data is related to the medical domain, such big data can be used in generating intelligence and decision on different disease prediction systems. Similarly, it can be used towards variety of issues.

The organizations maintain the data in cloud data servers which can be accessed by the employees and users of the organization. However, the volume of data being increased at each fraction which would challenge the data analyst in generating different intelligence from the big data available. The extract transform loading (ETL) has been identified as a key concept in modern big data environment where the data are heterogeneous. It helps to move the source data from transactional database to analytical database for analysis. The ETL process involves in extraction process which extracts data from various sources and converts them to the meaningful form. Similarly, transforming is the processes of changing the data to specific form and performs map reduce process. Loading is the process of moving the data to the warehouse. Extraction is responsible to access various sources in order to extract the selected data intended to analysis purpose. To be standardized, data are then processed by a huge number of transforming (T) tasks (cleansing, filtering, converting, merging, splitting, conforming, aggregation, etc.). Finally, the loading (L) tasks are in charge of loading the prepared data in the data warehouse (DW).

The challenge behind the process is the varying volume as well as dimension (Kartick Chandra Mondal 2020). The author represented the issues over the data warehousing traditional data according to the data quality and availability. For example, consider a medical data which includes patient details, diagnostic details, scan images, cardiac graphs, and so on. Different medical organizations would maintain the same in different form and each would maintain additional data. Also, the organizations would keep and maintain the data in various structures. This increases the complexity of ETL process and challenges the entire problem of knowledge transformation to support various decisive support systems. Similarly, the problem of channel publication towards stream of real-time data and join operation is well studied in Mehamood (2020) which notice the requirement of ETL process. Similarly, in Adnan and Akbar Dec. (2019), there are many limitations in the existing ETL techniques according to the type of data and variety of big data. The issues related to them are identified and addressed.

By considering all these, the problem of ETL has been approached with a multi-variant relational model which improves the performance of data learning and intelligence generation. The data maintained by different organization or at the data servers would maintain the meta-data in different forms (Simpson and Nagarajan 2021). By considering such meta-data in terms of ontology, the relation between the data sources and data can be used in generating the intelligence. The proposed multi-variant relational scheme would read the entire data and ontology, to measure MVRS (multi-variant relational similarity) to perform transformation and learning to support intelligence generation. The proposed model has been detailed in the next section.

Related works

Number of approaches exist and discussed earlier towards ETL of big data. This section details set of methods around the problem. A meta-data-based scheme is designed in Wang (2020), which is a logical model towards ETL and works according to customizing the data cleaning process. The model has been evaluated with different experiments. The application of machine learning in ETL is discussed to meet the real-time user requirements even at the diversity of data and poor availability (Kartick Chandra Mondal 2020). Similarly, by considering the diversity of data volume, different approaches are analysed in ManelSouibgui (2019), and the author considers the characteristic of ETL quality and the impact of data volume and its quality in the process. Also, in Dakrory et al. (2015), the author analysed quality of data by designing a testing model which measures the quality of data by varying the volume towards ETL.

The data volume has great impact on the ETL process, and in Raj and Souza (2020), the author discussed how the data can be loaded to the Pig environment to support various predictions. The Pig model fetches the data and converts them to homogenous forms and transforms them towards predictions. It has been designed to access the HDFS system. Similarly, to support data analyst of power centres, an Informatica Power Center ETL tool is designed to support the predictions (Gupta 2019).

To improve the query optimization process, the author enforced different data quality techniques in data warehouse in Gupta (2020), and to support the generation of business intelligence a Talend Open Studio tool has been used to transform heterogeneous data to homogenous one. According to the integrated transformed data, the business intelligence has been generated (Shreemathi , 2020).

Towards research in different sectors, various journals are published every year. In order to analyze the data in business journals, ETL process has been used. According to this, a visualization tool is designed in Grover and Kar Sep. (2017), which performs text mining to identify the theme of journal and according to that the journals are categorized. Similarly, the concept of data in warehouse has been used in the ETL problem. With the social media data and data present in warehouse, an behaviour analysis model is designed which works on two classes to perform sentiment analysis (Moalla et al. 2017).

The data streaming has great importance in ETL, and an efficient DS-Join scheme is presented in Jeon et al. (2019), which involve in microlevel. The method performs stream processing in distributed way, and stream processing engines are designed to perform such joins. Similarly, to handle the unstructured and structured streams, an Mongo DB model is discussed in Mehmood and Anees (2019), which involves in ETL by joining the semi-stream data coming from various sources with the disk-based master data according to the keys. Similarly, the problem of distributed streaming with ETL is handled with novel Strilim model which supports the development and deployment of streaming application in rapid manner. The architecture of the model is sophisticated to handle both developers and business users (Pareek et al. 2018). On the other end, to handle the real-time conditions of streaming, an efficient ETL model is designed which performs streaming and joining according to the requirement (Biswas et al. May 2019). To handle the problem of cache inequality at semi-stream many to many equijoin approaches, an semi-stream balanced join (SSBJ) is designed which handles the joint operations according to the service rate and enforces the reduction of memory equally (Naeem et al. Dec. 2019).

Data partitioning is an important stage of ETL process, and to support the process, an efficient two-level data partitioning approach is designed in Hamdi et al. (2015), towards real-time data warehouse named (2LPA-RTDW) which consider the unbalanced data in every partition according to the user requirement. The spatial data has great use in various problems, and to integrate them to single form, a GeoKettle tool is designed which integrates the spatial data collected from different hotspots of Indonesia (Astriani and Trisminingsih Jan. 2016). The tool is designed to provide various interfaces to perform modelling with simplified and adjustable one. Similarly to make the data integration as simpler one, a rewrite/merge scheme is presented in Cuzzocrea and Furtado (2018), which is a novel data warehousing model. The model works on two modes: in static phase, it performs rewrite, and in dynamic phase, it performs merging to support real-time conditions.

To handle the problem of synchronization in live streams, an live synchronization model is presented in Ma and Yang Mar. (2017), which uses the cache of physical RDBMS by using stream processing. The method constructs a Column Access-aware In-stream Data Cache (CAIDC) to support streaming in relational databases. Similarly, to support real-time streaming an incremental model named Stream Cube is presented which handles the real-time data acquisition, processing and analysis to support decision making (Zheng et al. Aug. 2019).

To handle the smart transportation of big data in IoT environment, a 3-phase model is presented which supports the management of big data and processing the data at higher service management (Babar and Arif Oct. 2019). To improve the performance of data warehousing a conceptual model is presented, which defines different dimensionality and stereotypes. Similarly, a Data on Demand ETL (DOD-ETL) model is proposed in Machado et al. Dec. (2019), which combines on demand data stream and pipelines in distributed and parallel memory caches to support effective portioning of data. An event-driven architecture is presented in Rieke et al. (2018), to manage spatial data obtained from different capturing devices to maintain the geospatial information according to the real-time data (Bouali et al. 2019). All the approaches suffer to achieve higher performance in extract transform and loading process in high-dimensional data environments.

Dynamic multi-variant relational scheme-based intelligent ETL framework

The proposed dynamic multi-variant relational scheme-based extraction transformation and learning framework works according to the data dictionary as well as the ontology. First, the method read the data locations and meta-data about the big data given. Using the information available in the ontology, each data source has been read and initiated with different threads. Each of them is responsible for reading the ontology of the concern data source and data dictionary. Using them the value of relational score and similarity is measured to perform map reduce and merging in an iterative manner. The detailed working of the proposed model is presented in this section.

The architecture of proposed dynamic multi-variant relational model for ETL process is presented in Fig. 1. The functional components are detailed in this section.

Fig. 1.

Fig. 1

Architecture of proposed DMVR-ETL model

DMVR-ETL module

The big data given for the ETL problem has been read and the module fetches the set of data dictionary available and their respective ontology indexed. From the data source available, the method first identifies the set of features or dimensions available. Also, according to the features identified, the data points of the data set given are verified. If there exist any incomplete data, then it has been eliminated from the data set. Accordingly, the method generates a multi-thread where each is responsible for measuring different similarity measures towards the data present in each data nodes. The threads are capable of monitoring the update happening in the ontology and data dictionary. According to the update happening in the dictionaries, the process will be iterated further.graphic file with name 500_2022_6938_Figa_HTML.jpg

The above-discussed algorithm represents the working of DMVR-ETL model which monitors the changes in data dictionary as well as receives the big data given. Using the big data and data dictionary with ontology, the method estimates the multi-variant relational similarity measure towards various data dictionary belonging to different data nodes. According to the value of MVRS, the method decides on merging and performs map reduce to support efficient intelligence generation.

MVRS measurement

The multi-variant relational similarity (MVRS) value is measured by the proposed model according to feature relevancy score (FRS), relational score (RS), and type orient relevancy (TOR) values. First, the method estimates the value of feature relevancy score (FRS) towards various data dictionaries available belonging to different data nodes. Any data dictionary or data node would contain X number of data with Y number of features. According to that, the input big data Bd would contain a subset of y features belonging to the data node. So, based on the appearance and containment of the features, the value of feature relevancy measure is computed.

Consider, the data dictionary of node p is represented as DDp, which contains a feature set Yf, and the input big data BD contains the feature set Tf;, then, the relevancy between the data points according to the feature values can be measured. It has been measured as follows:

FRS=i=1Size(T)TiY&&TiDDpsizeT 3

Similarly, the value of feature relevancy of any data given can be measured towards other data sets. On the other side, the relevancy can be measured by counting the semantic relations which is measured using the ontology available. The semantic relation is the value computed based on the number of different meaning a feature consists with the ontology. Each data set has their own dictionary as well as ontology which represents the relation between the features and data. Similarly, the method computes the value of type orient relevancy (ToR) values between given data and the data present in the data nodes. By considering such relation, the problem of transformation and learning can be improved.graphic file with name 500_2022_6938_Figb_HTML.jpg

The above discussed algorithm shows the way how the value of multi-variant relational similarity measure is computed. The method computes the value of ToR (type orient relevancy) on features, FRS (feature relevancy score), and relational score (RS) between the given ontology and data dictionary sets. Using these features and values, the method computes the value of MVRS.

Map reduce/merging

The proposed DMVRS approach performs map reduce operation over the feature dimension but not on the features. As the data sets would have variety of dimension and features, the method computes the value of feature-centric overlap measure (FCOM) and dimension-centric overlap measure (DCOM) values. According to the value of both FCOM and DCOM values, the method performs either map reduce or both map reduce and merging. The value of FCOM is measured based on the number of values of features correlated with two different data sets where the value of DCOM is measured according to the similarity on the dimensions of two data sets. Once the merging is performed, then the concern data dictionary is updated as well as ontology.graphic file with name 500_2022_6938_Figc_HTML.jpg

The proposed merging and map reduce algorithm estimates the feature-centric overlap measure and dimension-centric overlap measure to perform merging and updating the ontology as well as data dictionary. The ontology file will be updated with new terms and relations identified accordingly.

Results and discussion

The proposed dynamic multi-variant relational model-based ETL framework has been implemented, and the performance of the method has been evaluated under various constraints. The performance of the method is evaluated based on the varying number of data nodes and varying number of ontology and data dictionaries. This section presents the result obtained through evaluation of different approaches.

The data being used towards performance evaluation of the proposed dynamic multi-variant relational model are presented in Table 1. The Covid-19 data set provided by WHO (World Health Organization) has been combined with other data sets like diabetic, cardiac, clinical, lung data sets. Such data have been collected from 20 numbers of sources through the web link https://data.world/datasets/covid-19 and applied with the proposed model to generate intelligence. The data set has different number of records and finally obtained up to one million records. The data sets are collected from various sources like American Health Department who populates the entire details of Covid-19 patients and their medical records. Similarly, the other sources are used for the study. According to the above details the performance of the method has been evaluated and presented in this section. The features considered for this evaluation include different features of clinical, diabetic, cardiac, Covid-19 (https:, , www.kaggle.com, sudalairajkumar, novel-corona-virus-2019-dataset. xxxx), and lung data sets. The clinical data set includes the features obtained from CBC (complete blood count) like number of white, red blood cells, haemoglobin, haematocrit, and platelets. It also includes temperature, height, weight, age, sex, and so on. Similarly, each data set covers various features, so the analysis is carried by considering 100 features, how the method performs and at 200 features how the method works and 300 features how the methods works. The analysis is performed with Covid-19 data set which is the collection of data set like diabetic, cardiac, clinical and other features. For example, the diabetic data set includes name, age, sex, weight, BMI, fasting sugar, meal sugar, burning, giddiness. Similarly, the cardiac data set would include BP, heart rate, QRS value, P wave value, R wave value, etc. The clinical data set includes the result of CRC and CBR test which has hundreds of features. The Covid-19 features like temperature, pressure, body pain, diarrhoea, vomiting, tiredness, contacts, transport, cough, headache, clinical visit, and so on. The features vary between 100 and 300 features which have been used to perform performance analysis.

Table 1.

Evaluation Details

Parameter Value
Tool Used Microsoft Azure
Data Source 20
No of Features 300
No of relation 500
Data Size One Million
Data Set Covid-19 Data Set

The performance in ETL has been measured for different approaches and is presented in Table 2, where the proposed method improves the performance in ETL which is higher than other approaches. The proposed method improves the performance in the ratio 87, 92, and 97% in the cases of 100, 200, and 300 feature cases.

Table 2.

Analysis on ETL performance

Performance on ETL % vs No of Features
100 Features 200 Features 300 Features
2LPA-RTDW 72 76 82
STRIIM 75 79 84
DOD-ETL 77 81 86
DMVR-ETL 87 92 97

The performance of the methods is measured for their ETL performance and compared in Fig. 2. The proposed DMVR-ETL has produced higher performance in extraction transformation and learning in all the test cases considered.

Fig. 2.

Fig. 2

Analysis on ETL performance

The ratio of overlap produced by different methods according to number of features is measured for various methods and is presented in Table 3. The proposed DMVR-ETL approach has produced less overlap ratio than other approaches.

Table 3.

Analysis on overlap vs no of features

Ratio of Overlap on ETL % vs No of Features
100 Features 200 Features 300 Features
2LPA-RTDW 28 24 18
STRIIM 25 21 16
DOD-ETL 23 19 14
DMVR-ETL 13 8 3

The ratio of overlap produced by different methods is measured and presented in Fig. 3. The proposed DMVR-ETL approach has produced least overlap ratio compared to other methods in all the test cases. The DMVR-ETL algorithm reduces the overlap value compared to 2LPA-RTDW, STRIIM, DOD-ETL approaches by considering the multi-variant relation value between different data and features.

Fig. 3.

Fig. 3

Analysis on Overlap

The performance of time complexity has been measured and is presented in Table 4. The proposed DMVR-ETL approach has produced less time complexity in all the cases.

Table 4.

Analysis on time complexity

Performance on Time Complexity vs no of features
100 Features 200 Features 300 Features
2LPA-RTDW 46 56 72
STRIIM 43 52 68
DOD-ETL 37 44 62
DMVR-ETL 27 32 37

The time complexity introduced by different methods in ETL process has been measured and is presented in Fig. 4. The proposed DMVR-ETL algorithm has produced less time complexity compared to other methods.

Fig. 4.

Fig. 4

Analysis on time complexity

Conclusion

In this paper an efficient dynamic multi-variant relational similarity-based ETL framework is sketched. The proposed approach read the data sets and dictionary with ontology, to perform ETL process. The method first identifies the features and dimensions of the data set given and estimates the multi-variant relational similarity (MVRS) towards various data dictionary and data nodes, which is based on the feature relevancy score, relational score and type orient relevancy score. Based on these values, the method estimates the MVRS value to perform merging and map reduce. The proposed method improves the performance in ETL at different test cases.

Abbreviations

Fel

Feature list

ETL

Extract, transform, load

Bd

Big data

MVRS

Multi-variant relational Similarity

DD

Data dictionary

FRS

Frequency relevancy score

Fs

Feature set

Os

Ontology

ToR

Type orient relevancy

FCOM

Feature-centric overlap measure

DCOM

Dimension-centric overlap measure

Funding

We did not receive financial support from any organization for the submitted work.

Data availability

Not applicable.

Declarations

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Consent for publication

Authors give consent to Wireless Networks: The Journal of Mobile Communication, Computation and Information to publish their article.

Footnotes

This article has been retracted. Please see the retraction notice for more detail: https://doi.org/10.1007/s00500-024-09972-w

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Change history

7/15/2024

This article has been retracted. Please see the Retraction Notice for more detail: 10.1007/s00500-024-09972-w

References

  1. Abhishek G, Arun S (2020) Proposed techniques to optimize the DW and ETL query for enhancing data warehouse efficiency. In: international conference on computing, communication and security (ICCCS)
  2. Abhishek G (2019) A complete reference for informatica power center ETL tool. Int J Trend Sci Res Develop 3(2):1063–1070 [Google Scholar]
  3. Adnan K, Akbar R (2019) An analytical study of information extraction from unstructured and multidimensional big data. J Big Data 6(1):91 [Google Scholar]
  4. Astriani W, Trisminingsih R (2016) Extraction transformation and loading (ETL) module for hotspot spatial data warehouse using Geokettle. Procedia Environ Sci 33:626–634 [Google Scholar]
  5. Babar M, Arif F (2019) Real-time data processing scheme using big data analytics in Internet of Things based smart transportation environment. J Ambient Intell Hum Comput 10(10):4167–4177 [Google Scholar]
  6. Biswas N, Sarkar A, Mondal KC (2019) Efficient incremental loading in ETL processing for real-time data integration. Innov Syst Softw Eng 16:53–61 [Google Scholar]
  7. Bouali H, Akaichi J, Gaaloul A (2019) Real-time data warehouse loading methodology and architecture: a healthcare use case. Int J Data Anal Techn Strategies 11(4):310–327 [Google Scholar]
  8. Cuzzocrea NF, Furtado P (2018) A rewrite/merge approach for supporting real-time data warehousing via lightweight data integration. J Supercomput 76:3898–3922 [Google Scholar]
  9. Dakrory SB, Mahmoud TM, Ali AA (2015) Automated ETL testing on the data quality of a data warehouse. Int J Comput Appl 131(16):9–16 [Google Scholar]
  10. Grover P, Kar AK (2017) Big data analytics: a review on theoretical contributions and tools used in literature. Global J Flexible Syst Manage 18:203–229 [Google Scholar]
  11. Hamdi E, Bouazizi S, Alshomrani JF (2015) 2LPA-RTDW: a two-level data partitioning approach for real-time data warehouse. In: Procedings of the IEEE/ACIS 14th Int. Conf. Comput. Inf. Sci. (ICIS), pp 632–638, Jun. 2015
  12. https://www.kaggle.com/sudalairajkumar/novel-corona-virus-2019-dataset.
  13. Jeon Y-H, Lee K-H, Kim H-J (2019) Distributed join processing between streaming and stored big data under the micro-batch model. IEEE Access 7:34583–34598 [Google Scholar]
  14. Kartick CM (2020) Role of machine learning in ETL automation. In: international conference on distributed computing and networking, Jan 2020 Article No.: 57 Pp 1–6
  15. Ma K, Yang B (2017) Column access-aware in-stream data cache with stream processing framework. J Signal Process Syst 86(2):191–205 [Google Scholar]
  16. Machado GV, Cunha Ì, Pereira ACM, Oliveira LB (2019) DOD-ETL: Distributed on-demand ETL for near real-time business intelligence. J Internet Services Appl 10(1):21 [Google Scholar]
  17. Manel S (2019) Data quality in ETL process: a preliminary study, Elsevier. Procedia Comput Sci 159:676–687 [Google Scholar]
  18. Mehmood E, Anees T (2019) Performance analysis of not only SQL semi-stream join using MongoDB for real-time data warehousing. IEEE Access 7:134215–134225 [Google Scholar]
  19. Mehmood E, Anees T (2020) Challenges and solutions for processing real-time big data stream: a systematic literature review. IEEE Access 8:119123–119143. 10.1109/ACCESS.2020.3005268 [Google Scholar]
  20. Moalla A, Nabli L. B, Hammami M (2017) Data warehouse design approaches from social media: review and comparison. Social Netw Anal Mining 7(1):5 [Google Scholar]
  21. Naeem MA, Weber G, Lutteroth C (2019) A memory-optimal many-to-many semi-stream join. Distrib Parallel Databases 37(4):623–649 [Google Scholar]
  22. Nagarajan G, Minu RI, Vedanarayanan V, Sundersingh Jebaseelan SD, Vasanth K (2015) CIMTEL-mining algorithm for big data in telecommunication. Int J Eng Technol (IJET) 7(5):1709–1715 [Google Scholar]
  23. Pareek B, Khaladkar R, Sen B, Onat VN, Lakshminarayanan M (2018) Real-time ETL in Striim. In: Proceedings of the Int. Workshop Real-Time Bus. Intell. Anal. (BIRTE), p 3
  24. Raj A, Souza RD (2020) Implementation of ETL Process using Pig and Hadoop, Int J Recent Technol Eng (IJRTE)
  25. Rieke M, Bigagli L, Herle S, Jirka S, Kotsev A, Liebig T et al (2018) Geospatial IoT—The need for event-driven architectures in contemporary spatial data infrastructures. ISPRS Int J Geo-Inf 7(10):385 [Google Scholar]
  26. Shreemathi J, Infant Jv (2020) Data integration in ETL using TALEND. In: international conference on advanced computing and communication systems (ICACCS)
  27. Simpson SV, Nagarajan G (2021) An edge based trustworthy environment establishment for internet of things: an approach for smart cities. Wirel Netw. 10.1007/s11276-021-02667-2 [Google Scholar]
  28. Jingtin W, Bao L (2020) Design of ETL tool for structured data based on data warehouse. In: international conference on computer science and application engineering, Oct 2020 Article No.:119 pp 1–5
  29. Zheng T, Chen G, Wang X, Chen C, Wang X, Luo S (2019) Real-time intelligent big data processing: technology platform and applications. Sci China Inf Sci 62(8):82101 [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

Not applicable.


Articles from Soft Computing are provided here courtesy of Nature Publishing Group

RESOURCES