A multidimensional data warehouse design to combat the health pandemics

Gizem Turcan; Serhat Peker

doi:10.1007/s42488-022-00082-6

2022 Dec 20;4(3-4):371–386. doi: 10.1007/s42488-022-00082-6

A multidimensional data warehouse design to combat the health pandemics

Gizem Turcan ^1,^✉, Serhat Peker ¹

PMCID: PMC9763078

Abstract

The Covid-19 pandemic has brought about a new lifestyle for across the globe. Throughout this period, the use of holistic methods has become indispensable to deal with the enormous amount of data in this regard. It appears that the simplest way to tackle this issue is to spread the digitalization efforts concerning all data-based applications. Given the significance of pandemic data management, it is essential to have a data warehouse that collects, associates, and communicates these data. Containing a significant volume of structured data, warehousing can provide the necessary foundation for data mining and the development of analytical tools. To this end, the present paper proposes a data warehouse for combatting and managing pandemics, with the possibility to be enhanced for other personal or public health-related initiatives. In this research, the bottom-up data warehouse building methodology is used to construct a warehouse. A fact constellation schema model is utilized to accommodate the information ranging from citizen demographics to physician-prescribed drugs and laboratory tests. Sample queries are executed based on the proposed data warehouse for different purposes, and desired query results are obtained within proper response times. The proposed data warehouse contributes to countrywide implementation of pandemic practices and illuminates research on faster, less expensive, and safer management of citywide, nationwide, or worldwide health emergencies within a robust technical framework by governments.

Keywords: Data warehouse, Covid-19, Data management, Multidimensional data model, Data mining

Introduction

At the beginning of 2020, the world started to tackle the Covid-19 pandemic. Yet, due to its many varieties, the rapid spread of the disease gained momentum and continues to do so even now. The world has awoken to take precautions against viruses by different means. These efforts cover a wide range of rules and practices, such as the obligation to wear masks and keep physical distances in public, not to mention the changes in international travel norms and practices. Moreover, there are several changes in the work and educational settings, as well as an increase in the workload intensity of hospitals and pharmacies and in the tendency toward online activities (shopping, education, work, consultancy, etc.) (Ros et al. 2021). In this way, the world could not resist the change since it was simply a matter of life and death for its citizens.

In the digitalizing world, all movements constitute a goal and lead to result-representing data, which feeds meaningful information and knowledge. People in the 21^st Century societies are being digitalized more due to ever-increasing data amounts (Pappas et al. 2018). In fact, the change is so rapid that the amount of data produced from the beginning of humanity until 2003 can now be generated in a matter of minutes (Ienca and Vayena 2020; Pappas et al. 2018) and, for the first time in human history, the volume of digital data has surpassed that of the analog data (Johnson et al. 2017). Considering this hard-to-manage amount of data, organizing societies and the interactions among people has become inevitable (Pappas et al. 2018). Concerning public health, the management of pandemics requires such a systematic approach as well. As to the Covid-19 pandemic, holistic approaches are required, and the easiest way to do so appears to be more digitalization of data-based applications. In this respect, governmental initiatives are a facilitator and mover for citizens to control the virus and to reduce the damage to the country’s economy (Alamo et al. 2020).

Considering all the sudden changes the public has to get used to, it can be seen that the amount of data has rapidly increased indirectly proportional to the pandemic. Countries such as the Unites States, Australia, Korea, Singapore, and Germany use digital technologies to manage the pandemic efficiently (Whitelaw et al. 2020). The pandemic management process (Whitelaw et al. 2020) considers technology for tracking, screening for infection, contact tracing, self-isolation, and clinical care. Apart from this, health data management enables people to maintain and share their data (Dimitrov 2019). Pandemics have shown that we need efficient data management, as the personal data of individuals or patients are constituted differently (Blomberg and Lauer 2020) in terms of the contracted diseases, demographic information, personal habits, health status, and tests. In short, this kind of data consists of a variety of sub-branches, implying that the number of different types of data is enormous and calls for storage in a classified way. Therefore, data management is an imperative considering the amount and variety of data.

Naturally, the total amount of data related to the pandemic – otherwise referred to as “Big Data” in this context—and their type have become unmanageable. The digital world already knows how to handle Big Data in various contexts, but the need for new storage spaces, concepts, applications, and information systems has now emerged as a consequence of the Covid-19 pandemic. The data that cannot be managed is devoid of meaning and sense for health institutions and organizations. Obtained data need to be stored, organized, processed, transferred, and analyzed (Sheng et al. 2021; Tavakoli et al. 2006). These stages should be planned systematically to avoid any complexity and, thus, inaccurate analyses. Against this trend, data warehouses help manage the data systematically for acquisition, processing, and dissemination (Gharaibeh et al. 2017). In other words, a data warehouse is a system intended to provide long-term storage and easy access to improve operational value (Mattingly 2020). Building a pandemic data warehouse as a part of its management gives us the ability to aggregate and view the available daily pandemic data. Considering back-to-front data visualization, such as presenting statistical charts and graphics and accurate analysis, a qualified data warehouse establishes a solid infrastructure for such efforts (Agapito et al. 2020; Sheng et al. 2021). For pandemic data management, which is critical for countries, a well-designed data warehouse makes room for better analyses and improved performance in terms of pandemic management. Moreover, a good data warehouse makes the data integrity phase very important for dynamic pandemic data circulation (Agapito et al. 2020).

Given the importance of the pandemic data management, a warehouse that gathers, associates, and shares the data is vital for useful mining and developing analytical tools. In this manner, the present research is an attempt to develop a pandemic data warehouse for use in time of pandemics. The fact constellation schema model, consisting of seven fact tables and several dimension tables, is used in order to obtain meaningful information regarding medical feature associations through data querying and analysis. This schema provides a versatile format for accessing data from the warehouse using intricate queries. It can also serve as a reference and guide for countries to manage pandemics, leading the way toward further research on faster, less costly, and safer management of health emergencies by governments within a technological framework and for use in a variety of ranges. The remainder of this paper is organized as follows. Section II focuses on the literature review within the scope of the study. Section III addresses the proposed data warehouse design with details of the methodology and sample queries. Finally, Section IV presents concluding remarks and future work directions.

Background

Data warehousing

A data warehouse aims to collect data directly related to a given subject and helps to act upon decision-making. The nature of the subject or issue has immense importance in designing the data warehouse as it determines its various features (Rob and Srubar 2016). Data clearance, easy access to correct data, and system flexibility support better decision-making through analysis. There are two common data warehouse design methodologies in the literature (Breslin 2004). One of them is Inmon (Inmon 2005)’s top-down approach, following a path that allows designers to build individual departmental databases sourced by enterprise data stores (Breslin 2004). Contrary to the top-down approach, Kimball et al. (2008) follows the bottom-up approach, which finds constructing a database for each critical business process beneficial instead of having a single enterprise database (Breslin 2004). The bottom-up approach consists of four phases, namely, select the business process; declare the grain; choose the dimensions; and identify the facts.

As an advantage of the Kimball technique (bottom-up), the data warehouse is transformed into a number of logically self-contained and consistent data marts instead of a large and frequently complex centralized model (Lawyer and Chowdhury 2004). As the data marts are constructed initially, reports may be generated rapidly. Analysts can accommodate a more significant number of data marts here, hence expanding the data warehouse. In addition, the cost and effort required to build this model are minimal (Sen and Sinha 2005). Additionally, ensuring the dimensional perspective of data marts is conformable is a significant task, because lack of it may be a disadvantage (Breslin 2004). An excessive amount of input might slow progress and cause disorganization.

Since data marts are built from the data warehouse, the top-down method provides a uniform dimensional representation of data marts. Additionally, this model is regarded as the most effective paradigm for company change (Milanovic et al. 2009). Large organizations choose this technique as a result. Once committed, the data in the data warehouse is static, read-only, and stored for future reporting. A data warehouse with a top-down architecture may store data from the majority or even all an organization's operating systems, making these data consistent. Since all data marts are imported from a single source, the top-down design process also produces remarkably consistent dimensional representations of data across data marts. Thus, it is relatively easy to generate new dimensional data models from the data contained in the data warehouse (Breslin 2004; Milanovic et al. 2009). Nevertheless, the primary disadvantage of the top-down approach is that it results in a huge project with an extensive scope. Implementing a data warehouse utilizing the top-down paradigm incurs high up-front costs, and it might take a considerable amount of time before end users see early advantages (Milanovic et al. 2009). During implementation, the top-down technique can often be rigid and insensitive to changing departmental demands. The cost, time required for design, and maintenance are all expensive (Milanovic et al. 2009).

A systematic data warehouse consists of fact and dimension tables. Fact tables refer to critical processes and hold the data to be analyzed (Alviana and Kurniawan 2018; Parmanto et al. 2005). These are central tables and store quantitative data. Fact tables, on the other hand, work with dimension tables which use the same data structure with a basic entity-relationship (ER) but have a higher performance (Parmanto et al. 2005). The facts are measures, and the dimensions are the context for dimensional modeling (Ramachandran et al. 2012). Dimension tables contain descriptions and explain the fact tables. Hence, the primary keys of dimension tables are the foreign keys of the central fact tables (Ramachandran et al. 2012).

The combination of the fact tables and dimension tables generates schemas. The star schema, the snowflake schema, and the galaxy or fact constellation schema are created by dimensional data warehouses (Moody and Kortink 2000; Rob and Srubar 2016). The star schema consists of one central fact table and several dimension tables. The shape of the star schema—as the name implies—looks like a star and is simple to use. The snowflake schemas are the branched form of star schemas, and the logic behind them is the same as that of the star schema. When a star schema dimension has sub-dimensions, the model becomes more normalized by the hierarchical design of entities (Moody and Kortink 2000; Ramachandran et al. 2012). It also allows users to execute more queries (Ramachandran et al. 2012). The fact constellation schemas, also known as another type of galaxy schemas, have more than one fact table (Warnars and Randriatoamanana 2017). Distinct from the star and snowflake schemas, the fact constellation/galaxy schemas are implemented for more complex and functional data warehouses (Garani and Adam 2020; Warnars and Randriatoamanana 2017) as they are a combination of two or more star schemas (Saxena and Agarwal 2014); henceforth, more flexible and agile.

Related works

Collecting, processing, and sharing health data in a pandemic environment is challenging. Data are vast and dynamic, in a word, complex. Several studies have explained the importance and benefits of building data warehouses for such purposes, and they have mentioned certain applications as well. Complex medical data are time-consuming, inclined to error-making, and defective. Quick and correct access to properly stored data, on the other hand, provides for improved data quality and cost reduction (Roelofs et al. 2013). In a pandemic environment, data are required from different resources. Roelofs et al. (Roelofs et al. 2013) state that combining the tools and different data sources in a data warehouse also improves the data quality and makes the collection times effective.

Garani and Adam (Garani and Adam 2020) develop a data warehouse to improve the efficiency of nursing activities. The main metrics of the design are data of different sizes collected from multiple sources. In that attempt, the data defined as raw are converted into valuable information that can be utilized for decision-making purposes in many applications, not to mention its benefits and importance for resource management.

With the Covid-19 on the rise, the intensive care units worldwide also faced an unexpected burden. With the increasing amount of patient information, the need for research to keep abreast of all the latest developments of Covid-19, describe the potential treatment strategies, and plot a route for resource utilization all came to the fore and became evident more than ever (Fleuren et al. 2021). It is obvious that developing an effective data warehouse should include data from as many sources as possible, but it is challenging. Time has to be spent standardizing the datasets that make up such a warehouse. In a poorly designed relational database with tens of thousands of records and multiple lab measurements per record, data queries can take days instead of seconds, contrary to expectations. For this reason, building an open data warehouse for Covid-19 is crucial, and it is worth the effort to be ready for future pandemics (Whitelaw et al. 2020). It has to be remembered, though, that multicenter data about patients are more significant than single-center data for the pandemic process (Fleuren et al. 2021).

In Fleuren et al. (Fleuren et al. 2021)'s study, a data-sharing collaboration with the Dutch Data Warehouse (DDW), a multicenter database, is conducted in the Netherlands. In this study, more than 200 million data concerning 3463 patients' demographics, clinical observations, medications, laboratory findings, and life support devices are added to the DDW. The built data warehouse is open to clinicians and researchers within certain ethical and legal limits. This study encourages researchers to share the electronic health record (EHR) data to advance the field of medical data science (Fleuren et al. 2021).

Managing the pandemic data also implies managing the pandemic itself since they include medical, biological, demographical, and social information (Agapito et al. 2020). Building data warehouses is one of the most well-known technologies used to process and analyze structured data (Salem et al. 2020). Agapito et al. (2020) conduct a study with data from Italy's Lombardia and Puglia regions. They develop a Covid-19 data warehouse called “Covid-Warehouse” which allows for data collection, harmonization, and integration issues of Covid-19. It models, integrates, and stores the Covid-19 data provided by the Italian Protezione Civile Department and several pollutions and climate data provided for different regions in Italy. The decision-making authorities in charge of public affairs can also utilize this data warehouse to take action in order to reduce pollution and climate conditions for public health (Agapito et al. 2020).

Proposed data warehouse design

The pandemic environment requires data to be followed up systematically. Reaching the statistics of real-time, daily, monthly, and annual data by the government is the key to managing the pandemic process. The amount of country-based pandemic data is vast to analyze. At this point, a data warehouse can be utilized to make them meaningful.

Kimball et al. (Kimball et al. 2008)’s data warehouse development methodology is taken into account to develop a warehouse in this study. Considering the complexities associated with the Covid-19 pandemic, the bottom-up approach and its steps are preferable. Three of the steps of the bottom-up approach are employed for pandemic data. Initially, designing the data warehouse begins with identifying the critical issues of the pandemic, such as contact with healthy individuals by patients, testing and vaccination processes, disease detection, and quarantine. After that, the dimensions and their associated roles are determined. In the third step, the facts are discussed (contact locations of patients, vaccination, disease tests, quarantining process, symptoms, and medication), and fact tables are composed.

Step 1: Identify the key issues
Step 2: Determine the dimensions and their associated roles
Step 3: Compose the facts and fact tables

The pandemic data warehouse schema resulting from completing the steps is a multi-dimensional fact constellation schema. In total, there are seven fact tables for the proposed pandemic data warehouse. The fact tables have several dimension tables which, in turn, have sub-dimensions as well. This multi-dimensional system represents a fact constellation schema that can overcome the complexity of the system. Figure 1 is the fact constellation schema of the pandemic data warehouse.

Fig. 1 — The fact constellation schema of the pandemic data warehouse

The Fact Constellation Schema is one in which the key processes represented by fact tables are associated, and which also shows the sub-dimensions of the main dimension tables. This schema allows more data to be stored about the pandemic or disease in question and, thus, more detailed queries. The validity and reliability of the analyses concerning a pandemic or disease are considerably crucial for governments. Briefly put, the more multidimensional and flexible the data warehouse, the greater the utility of analytics. Furthermore, several different query examples are given, depending on the date, city, person, etc., to prove the usefulness of the fact constellation schema and that it is a valuable warehouse design to be used during pandemics.

Fact tables

The fact tables refer to the key processes followed up during pandemics in a country. Accordingly, the proposed data warehouse design consists of seven fact tables as Fact_CitizenDiseaseTests, Fact_CitizenQuarantine, Fact_CitizenVaccines, Fact_CitizenVenues, Fact_Patient, Fact_PatientDrug and Fact_PatientSignSymptom. The fact table entities are the integer identifications of the dimension tables. The fact tables are fed from the dimension tables’ primary keys, and the primary keys of the dimension tables are the foreign keys of fact tables. In this way, the Fact_CitizenDiseaseTests table references to the Dim_Citizen, Dim_DiseaseTest, Dim_Time, and Dim_HealthUnit tables (Fig. 2). The Fact_CitizenQuarantine table’s entities come from Dim_Citizen, Dim_Time and Dim_Quarantine tables (Fig. 3). The Fact_CitizenVaccines table consists of the foreign keys of the Dim_Citizen, Dim_Time, Dim_Vaccine and Dim_HealthUnit tables (Fig. 4). The Fact_CitizenVenues table references to the Dim_Citizen, Dim_Time and Dim_Venue tables (Fig. 5). The Fact_Patient table’s entities come from the Dim_Citizen, Dim_Time, Dim_PatientStatus and Dim_DiseaseVariant tables (Fig. 6). The Fact_PatientDrug table consists of the Dim_Citizen, Dim_Time and Dim_Drug tables (Fig. 7). Lastly, the Fact_PatientSignSymptom table’s entities come from the Dim_Citizen, Dim_Time and Dim_SignSypmtom tables (Fig. 8). The Fact Constellation Schema of the Pandemic Data Warehouse is represented in Fig. 8.

Fig. 2 — The Fact_CitizenDiseaseTests table and dimensions

Fig. 3 — The Fact_CitizenQuarantine table and dimensions

Fig. 4 — The Fact_CitizenVaccines table and dimensions

Fig. 5 — The Fact_CitizenVenues table and dimensions

Fig. 6 — The Fact_Patient table and dimensions

Fig. 7 — The Fact_PatientDrug table and dimensions

Fig. 8 — The Fact_PatientSignSymptom table and dimensions

Disease test fact table

The central focus of the Fact_CitizenDiseaseTests table is identifying the individuals’ testing for disease; it depicts which patient has which test, when, and in which health unit. When examining the keys of this table, “citizenId” represents the individual’s personal information. “diseaseTestId” identifies the type/name of the test that citizens took to have the information about their health status during the pandemic. To have the date information about disease tests, the attribute “dateId” is used; whereas “healthUnitId” indicates the unit where the citizen had the service carried out. The Fact_CitizenDiseaseTests table allows a country or city-based statistical analysis of the total number of tests for a certain period of time.

Quarantine fact table

The Fact_CitizenQuarantine table focuses on the quarantine period required for citizens or patients; it indicates who is in which quarantine and when, and is powered by citizen data, time data, and quarantine data. The quarantine types differ according to the number of isolation days and the type of disease. To clarify, “citizenId” represents the individual’s personal information. “quarantineId” identifies the quarantine type/name and its duration according to the disease type. “startDateId” and “endDateId” refer to the start and end dates of quarantine. Thus, the Fact_CitizenQuarantine table allows for the analyses of individuals who are in quarantine or who have completed their quarantine according to dates and durations.

Vaccination fact table

The Fact_CitizenVaccines table represents the vaccination process against a disease or pandemic; it explains which individual has how many doses of which vaccine, in which health unit, and when. As one of the foreign keys of the table, “citizenId” represents the individual’s personal information. “vaccineId” refers to the type of vaccine, and “healthUnitId” represents the unit where citizens are vaccinated in. Vaccination date and time data are related to “dateId”. The Fact_CitizenVaccines table provides statistics on vaccination rates according to date by country and city. In this way, it is also possible to identify the vaccinated and unvaccinated individuals upon further easy analysis.

Venue fact table

The Fact_CitizenVenues table contains the location information of people in the same environment as disease-positive cases and/or with a high potential to have the disease. As in other tables, “citizenId” represents the individual’s personal information. “venueId” indicates the last location of the individual, and “dateId” shows the latest date they were there. The Fact_CitizenVenues table is utilized for contact tracing and allows the user to analyze the contact individual’s identification.

Patient fact table

The Fact_Patient table permits one to track the citizens infected by the disease; it shows the disease detection dates of individuals by “dateId”. “patientId” is equivalent to citizenId, which represents the personal information about the citizens in point. To learn about the variant of the disease, the key “diseaseVariantId” can be utilized. The course of disease information is provided by “patientStatusId”. The Fact_Patient table allows users to analyze the spread of various variants of the disease. It is also possible to quantify the current status of patients (new case, critical, recovered, death) based on time.

Medication fact table

The Fact_PatientDrug table makes it possible to store the records of drugs used by disease-positive cases; it represents which drug(s) (drugId) a patient is using and for how long. “patientId” equals to citizenId, which represents the personal information about citizens as in the Fact_Patient table. “drugId” is the primary key of the table, which stores the data about drug’s name, and the type of disease that the drug affects. Fact_PatientDrug table can be taken as a reference in the analysis of the frequency of drug use depending on time and drug types by disease.

Sign/symptom fact table

The Fact_PatientSignSymptom table indicates the signs or symptoms associated with a disease with reference to each individual and within specified time ranges. “patientId” refers to citizenId, which represents the personal information about citizens. “startDateId” and “endDateId” point out the date the symptom began and ended, respectively. Additionally, to reach the name of the sign or symptom, the key “signSymptomId” is utilized, allowing for the analysis of the frequency and duration of sign/symptoms according to the course of the disease.

Dimension tables

The dimension tables, which include all the specific records about the disease, are the main sources of information with reference to the critical processes carried out during the pandemic (fact tables). These tables contain primary keys that connect with fact tables; they are also more detailed than the fact tables. The dimension tables in this study are as follows:

Citizen

The Dim_Citizen dimension table stores the personal data about all citizens who are sick or not during the pandemic process. “citizenId” is the primary key/identifier of this table. Therefore, the Dim_Citizen table is related to many pandemic processes that have to include citizen data. The table stores the name, surname, gender, date of birth, weight, height, phone number, address, and district information of each individual. “districtId” is the foreign key of this table for situations that require analyzing the cases on the basis of the city districts. The other dimensions and fact tables can reach all these personal data by means of “citizenId” as a foreign key.

The Dim_Citizen table branches out to Dim_CitizenLivingHabits, Dim_CitizenComorbidities, Dim_Contact, and Dim_CitizenDrug. The Dim_CitizenLivingHabits table represents the individuals’ specific unhealthy habits such as smoking, malnutrition, and excessive drinking; it has two foreign keys, citizenId and livingHabitId, which come from the Dim_LivingHabits table that stores the name of those habits and its primary key. The Dim_CitizenComorbidities tables are utilized to store information regarding citizen comorbidities. This table also has a foreign key as comorbidityId, which comes from Dim_Comorbidities, excluding citizenId. The names of the comorbidities are stored in the Dim_Comobidities table. Another branch of the Dim_Citizen table is Dim_Contact, which represents the sick people having been in contact with healthy people, or vice versa. The table includes citizenId as a foreign key, contactId, which is also a foreign key and refers to citizenId and the level of direct contact of people. The last table linked to Dim_Citizen is Dim_CitizenDrug table. This table has citizenId and drugId, which comes from the Dim_Drug table and means the medication and its dosage related to the pandemic, or not related to the pandemic but used by the person routinely.

Health unit

The Dim_HealthUnit dimension table depicts the place where vaccination and testing services are taken. Hence, its attributes are the health unit name, the unit’s address, the district it belongs to, and “healthunitId” as the primary key. Accordingly, it has three branches and is linked to FactCitizenVaccine, Fact_CitizenDiseaseTest, and Dim_District tables.

District

The Dim_District table provides data for the Dim_HealthUnit and Dim_Citizen tables by its primary key, “districtId”. Distinctly, it is fed from the Dim_City table by the foreign key, cityId, which is the primary key of the Dim_City table. It is essential to determine the city to which the district is affiliated due to the importance of the Dim_District table for gathering the statistics on a district, city, or country basis.

Vaccine

The Dim_Vaccine table stores the data about vaccination and feeds the Fact_CitizenVaccine table. It has the name of each vaccine and the primary key, “vaccineId”, per vaccine. The Dim_Vaccine table allows users to obtain statistics about the number, type, and date of vaccinations, health units where they take place, and vaccinated citizens as per the Fact_CitizenVaccine table. The Dim_Vaccine table also helps analyze the distribution of these activities in terms of location.

Disease test

The data about the disease tests are stored in the Dim_Disease dimension table, which covers all types of pandemic tests. Therefore, the data provided to the Fact_CitizenDiseaseTests table can be analyzed from the aspects of disease tests that citizens have had in different health units where the tests take place, and the test dates.

Venue

The venue information appears in the Dim_Venue table with the primary key “venueId”. There are location names and types in this table. Determining contact with the people at risk of being disease-positive is crucial in preventing the spread of the disease and taking precautions. The Dim_Venue table provides data to the Fact_CitizenVenues table, accommodating for numerical analyses such as the number of sick people in a particular region and the number of people at high risk of contracting the disease; thus, playing a vital role in determining the risk-prone zones.

Disease variant

The Dim_DiseaseVariant table depicts the variants of a pandemic disease which may be countless and whose treatments can differ according to the variant types. The table contains the variant names and disease data by diseaseId, which comes from the Dim_Disease table. diseaseId is the primary key of the Dim_Disease dimension table, enabling it to connect with the Dim_DiseaseVariant dimension table. With the relationship between these two-dimensional tables, numerical analyses such as the number of variants of diseases and the distribution of the variants among the citizens can be made.

Patient status

The patient status shows the stage of the disease in a person. For example, he or she may have never contracted the disease, may have just contracted a new variant, or may have passed away. Since it is a parameter directly related to the patient, it is linked to the Fact_Patient table with the primary key, patientStatusId. The Dim_PatientStatus table allows users to determine how many people are in which condition.

Signs/symptoms

The Dim_SignSymptom dimension table stores the signs or symptoms of the disease. Obviously, symptoms vary greatly; yet those that emerge from the beginning of the outbreak or contracting should be followed up rigorously until they disappear. For this reason, the start and end dates of the symptoms can be followed with the Dim_SignSymptom table, which is connected to the Fact_PatientSignSymptom table by signSymptomId.

Drug

The Dim_Drug table stores the information related to all kinds of medication, even if it is irrelevant to the disease. For example, there might be items that people take routinely and not for the disease in question. Doctors should have enough information about such medication while prescribing new ones for any new disease. Thus, the Dim_Drug dimension table is related to both the Dim_CitizenDrug dimension table and the Fact_PatientDrug fact table. It provides data for them by drugId. The Dim_Drug table is utilized when keeping statistics of the medication used against a given disease.

Quarantine

Actual patients and those in contact with the infected ones have to be in quarantine if they have the disease or are at risk. The types of quarantine can differ according to the disease or its variants. Thus, the Dim_Quarantine dimension table stores the number of days according to the type of quarantine. In addition, it provides data to the Fact_CitizenQuarantine table. Analyses such as the number of citizens in quarantine and its degree of effectiveness in reducing the number of cases can be made with these tables.

Time

The Dim_Time dimension table serves many tables, as the pandemic period is a time-spanned process where each parameter is time-dependent. The timetable provides data directly to 10 different tables and takes data from 4 separate tables. These are Dim_DayOfWeek, Dim_Week, Dim_Month, and Dim_Year. The Dim_Time table provides data to tables that need start–end dates.

Queries

In the proposed multidimensional data warehouse, each datum to be reported is stored in different tables, and the number of tables is relatively high. Therefore, queries are made to gather the related data in different tables. Such queries are multifunctional and can answer basic questions, perform calculations, combine data from different tables, and add, modify, or delete data. In the pandemic data warehouse, many queries can be made as a result of combinations of tables. For example, new patients can be added to the health system; disease processes can be followed, case analyses can be made on a city basis, numerical data can be obtained about disease tests and vaccines, and so on.

Seven different queries are conducted for the proposed pandemic data warehouse. Each query has a different purpose, and users can create queries based on the results they wish to obtain. Queries, which are also very useful in decision-making processes, present concrete, and meaningful data to the end-user.

Query-1

Question: What is the number of new cases in a certain city and on a specific date?

Purpose: The purpose of this query is to extract the number of new cases in a city on a given date from the data warehouse. The query is carried out to obtain data related to the case numbers of statistics and to decide on new measures to be taken. Knowing the number of new cases on a given date contributes to the management of the pandemic process by comparing the number of cases on previous dates. It also helps predict risks by cities. In this case, the date is 13.10.2021, and the city is Izmir.

Query-2

Question: What is the number of different vaccinations administered in a certain city and on a specific date?

Purpose: This query indicates the distribution of different types of vaccines administered in a city. It is a useful query to understand which type of vaccine is preferred by patients. At the same time, the data on the number of vaccinations made in a day can also be used for vaccine inventory analysis. In this way, the number of doses that the government should provide can be predicted. Once again, the date for the query is 13.10.2021 and the city is Izmir.

graphic file with name 42488_2022_82_Figb_HTML.jpg

Query-3

Question: What is the total number of negative results from the viral tests conducted in a certain city and on a given date?

Purpose: Since the use of viral tests is the most common way to determine whether someone has the diseases, it is decisive for knowing the number of cases. However, negative test results are just as important as positive ones for pandemic management. With this query, the required data can be easily retrieved. The query asks for the number of negative results of the viral tests. For example, the test type is determined as viral, the city is Izmir, and the date is 13.10.2021.

Query-4

Question: How many people are in different quarantines in a certain city and on a certain date?

Purpose: This query is run to obtain the quarantine statistics. The query counts citizens, adhering to city, date, and quarantine type constraints. In this way, the number of individuals in quarantine is obtained. Location and time-based analyses can be made. For instance, the date for the query is 13.10.2021 and the city is Izmir.

Query-5

Question: Who are the individuals in a certain venue on a specific date?

Purpose: The purpose of this query is to identify those who are in the same environment with disease-positive individuals or those with high risk of having the disease. The query prints the names of individuals in a venue based on the location and date constraints. In this case, the date is determined as 13.10.2021, and the venueType is determined as venueType1 for the query.

graphic file with name 42488_2022_82_Fige_HTML.jpg

Query-6

Question: Which individuals have taken a certain medication within a given date range?

Purpose: This query determines the frequency of using medication. The data are collected concerning those who take a certain medication to treat the disease in question. It shows who uses what medication, both by name.

graphic file with name 42488_2022_82_Figf_HTML.jpg

Query-7

Question: What are the symptoms and their frequency in a certain date range?

Purpose: This query specifies the frequency of symptoms as one of the parameters that gives some of the most detrimental information about the course of the disease. Having symptom information that occurs within a given date range also helps in making sense of the progression a disease is taking. The query provides the symptom names and the number of cases showing them.

Conclusion

Pandemics are hard-to-manage processes that affect the entire world. Since their emergence, there have been significant rise in the amount of data related to healthy and sick citizens. The main factor that complicates the management of a pandemic is data volume, which simply means complexity. The proposed data warehouse design in this study is a major solution to this impediment. This design, which enables the appropriate use of data and accurate analysis, is especially beneficial for countrywide applications. The data flow is high-speed during pandemics and, as such, governments should make quick decisions to protect the public. An advanced and multi-dimensional data warehouse 0is an urgent and fundamental need. This study is carried out to meet this requirement to manage pandemics and similar other health-related processes.

The proposed data warehouse contains all the process data of the Covid-19 pandemic as a case study, from the last location of the contacts to the medication used. The warehouse enables users to make informed decisions. What is more, the implementation of the decision support system can enhance pandemic research significantly. This data warehouse also provides a foundation for carrying out predictive analytics on relational, temporal, and/or spatial data (medical encounter, hospital admission, social media, mobile apps, etc.) and for utilizing the resulting insights to monitor and analyze a pandemic or disease. With this warehouse, it is possible to access data related to the number of cases throughout the city and the country, test results, medications, and quarantine processes. At the same time, the data can be stored related to those who are not sick but have chronic diseases, routinely take certain medication, or have special conditions that need attention during the pandemic. In this way, the proposed system helps in maintain public health to a great extent. The data warehouse is also suitable for many other types of queries and users can access accurate and complete data by creating queries in many other dimensions regarding the pandemic. Governments can increase and improve the necessary measures within the scope of a pandemic or choose to start normalization processes accordingly.

This study provides a detailed example of constructing a multi-dimensional data warehouse and the fact constellation schema. The data warehouse, which was created for the Covid-19 pandemic as one of the most significant health concerns today across the world, can be enhanced in subsequent studies. With the discovery of new vaccines and the emergence of new variants, the number of data added to the warehouse is increasing daily. New tables can be added, and detailed queries can be made for advanced analysis. The proposed data warehouse can even be adjusted for use in the form of a personal mobile application or expanded to be applied for other health-related initiatives.

There is no doubt that this research has the potential for helping the developers in designing and building a data warehouse for health pandemics, however it also brings a big concern regarding individual privacy and security. In addition to standard clinical data such as common clinical symptoms and test findings, such a data warehouse also contains personal data such as name, address, phone number, and medical record number. If these confidential and sensitive patient data are disclosed to the public or malevolent individuals, it might severely affect the patients. Considering the above-mentioned concerns, it is essential to integrate certain mechanisms and practices for protecting privacy of individuals during the implementation of a health data warehouse. For this purpose, our proposed data warehouse is designed to enforce access control which limit users’ access to confidential and sensitive data. In other words, only users having authorization can access data. In addition to this, innovative privacy-aware data analysis procedures together with the cryptographic key management system should be employed to ensure the privacy of individuals in the implementation of such a health data warehouse.

Data Availability

All relevant data and material are presented in the main paper.

Declarations

Competing interests

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Footnotes

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

Agapito G, Zucco C, Cannataro M (2020) COVID-warehouse: A data warehouse of Italian COVID-19, pollution, and climate data. Int J Environ Res Public Health 17(15):1–22. 10.3390/ijerph17155596 [DOI] [PMC free article] [PubMed]
Alamo T, Reina DG, Millán P (2020) Data-driven methods to monitor, model, forecast and control covid-19 pandemic: Leveraging data science, epidemiology and control theory, pp 1–65. arXiv preprint arXiv:2006.01731
Alviana S, Kurniawan B. Design and analysis on data warehouse of personnel adMinistration system using time series algorithm. IOP Conf Ser Mater Sci Eng. 2018;407:1. doi: 10.1088/1757-899X/407/1/012092. [DOI] [Google Scholar]
Blomberg N, Lauer KB. Connecting data, tools and people across Europe: ELIXIR’s response to the COVID-19 pandemic. Eur J Hum Genet. 2020;28(6):719–723. doi: 10.1038/s41431-020-0637-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
Breslin M (2004) Data warehousing battle of the giants: comparing the basics of the and inmon models. Bus Intell J 7:6–20
Dimitrov DV. Blockchain applications for healthcare data management. Healthc Inform Res. 2019;25(1):51–56. doi: 10.4258/hir.2019.25.1.51. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fleuren LM, et al. The Dutch Data Warehouse, a multicenter and full-admission electronic health records database for critically ill COVID-19 patients. Crit Care. 2021;25(1):1–12. doi: 10.1186/s13054-021-03733-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
Garani G, Adam GK. A semantic trajectory data warehouse for improving nursing productivity. Heal Inf Sci Syst. 2020;8(1):1–13. doi: 10.1007/s13755-020-00117-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
Gharaibeh A, Salahuddin MA, Hussini SJ, Khreishah A, Khalil I, Guizani M, Al-Fuqaha A (2017) Smart cities: a survey on data management, security, and enabling technologies. IEEE Commun Surv Tutorials 19(4):2456–2501. 10.1109/COMST.2017.2736886
Ienca M, Vayena E. On the responsible use of digital data to tackle the COVID-19 pandemic. Nat Med. 2020;26(4):458. doi: 10.1038/s41591-020-0823-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
Inmon WH (2005) Building the data warehouse. John Wiley & Sons, New York
Johnson J, Denning P, Sousa-Rodrigues D, Delic KA (2017) Big data, digitization, and social change: big data (Ubiquity symposium). In: Ubiquity, pp 1–8
Kimball R, Ross M, Thornthwaite W, Mundy J, Becker B (2008) The data warehouse lifecycle toolkit. John Wiley & Sons, New York
Lawyer J, Chowdhury S. Best practices in Data Warehousing to support business initiatives and needs. Proc Hawaii Int Conf Syst Sci. 2004;37:3499–3507. doi: 10.1109/hicss.2004.1265515. [DOI] [Google Scholar]
Mattingly W. Considerations for a COVID-19 research data warehouse in the time of COVID. J Respir Infect. 2020;4(1):1–3. doi: 10.18297/jri/vol4/iss1/64. [DOI] [Google Scholar]
Milanovic N, Soskic G, Petkovic A (2009) Data warehouse design for croatian students’ nourishment information system. Proc Int Conf Inf Technol Interfaces ITI:193–198. 10.1109/ITI.2009.5196078
Moody D, Kortink MA. From enterprise models to dimensional models: A methodology for data warehouse and data mart design. Proc Int Work Des Manag Data Warehouses. 2000;2000:1–12. [Google Scholar]
Pappas IO, Mikalef P, Giannakos MN, Krogstie J, Lekakos G (2018) Big data and business analytics ecosystems: paving the way towards digital transformation and sustainable societies. Inf Syst E-Bus Manag 16(3):479–491. 10.1007/s10257-018-0377-z
Parmanto B, Scotch M, Ahmad S (2005) A framework for designing a healthcare outcome data warehouse. Perspect Heal Inf Manag 2:3 [PMC free article] [PubMed]
Ramachandran S, Rajeswari S, Murty SS (2012) Dimensional modeling of Indian materials database. Int J Comput Appl 37(7):1–8. 10.5120/4617-4834
Rob MA, Srubar FJ. Information gems from criminal mines: A data warehouse case study focusing on big-city criminal activity. Transform Gov People Process Policy. 2016;10(2):297–314. doi: 10.1108/TG-03-2015-0016. [DOI] [Google Scholar]
Roelofs E, Persoon L, Nijsten S, Wiessler W, Dekker A, Lambin P (2013) Benefits of a clinical data warehouse with data mining tools to collect data for a radiotherapy trial. Radiother Oncol 108(1):174–179. 10.1016/j.radonc.2012.09.019 [DOI] [PMC free article] [PubMed]
Ros F, Kush R, Friedman C, Gil Zorzo E, Rivero Corte P, Rubin JC, ... Van Houweling D (2021) Addressing the Covid-19 pandemic and future public health challenges through global collaboration and a data-driven systems approach. Learn Heal Syst 5(1):1–12. 10.1002/lrh2.10253 [DOI] [PMC free article] [PubMed]
Salem SB, Naouali S, Chtourou Z (2020) Scoring a data warehouse model for homeland security applications
Saxena G, Agarwal BB. Data warehouse designing: dimensional modelling and E-R \nModelling. Int J Eng Invent. 2014;3(9):28–34. [Google Scholar]
Sen A, Sinha AP. A comparison of data warehousing methodologies. Commun ACM. 2005;48(3):79–84. doi: 10.1145/1047671.1047673. [DOI] [Google Scholar]
Sheng J, Amankwah‐Amoah J, Khan Z, Wang X (2021) COVID-19 pandemic in the new Era of big data analytics: methodological innovations and future research directions. Br J Manag 32(4):1164–1183. 10.1111/1467-8551.12441
Tavakoli AS, Jackson K, Moneyham L, Phillips KD, Murdaugh C, Meding G (2006) Data management plans: stages, components, and activities. Appl Appl Math 1(2):141–151
Warnars HLHS, Randriatoamanana R. Datawarehouser: A data warehouse artist who have ability to understand data warehouse schema pictures. IEEE Reg 10 Annu Int Conf Proceedings/TENCON. 2017;0:2205–2208. doi: 10.1109/TENCON.2016.7848419. [DOI] [Google Scholar]
Whitelaw S, Mamas MA, Topol E, Van Spall HG (2020) Applications of digital technology in COVID-19 pandemic planning and response. Lancet Digit Heal 2(8):e435–e440. 10.1016/S2589-7500(20)30142-4 [DOI] [PMC free article] [PubMed]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

All relevant data and material are presented in the main paper.

[CR1] Agapito G, Zucco C, Cannataro M (2020) COVID-warehouse: A data warehouse of Italian COVID-19, pollution, and climate data. Int J Environ Res Public Health 17(15):1–22. 10.3390/ijerph17155596 [DOI] [PMC free article] [PubMed]

[CR2] Alamo T, Reina DG, Millán P (2020) Data-driven methods to monitor, model, forecast and control covid-19 pandemic: Leveraging data science, epidemiology and control theory, pp 1–65. arXiv preprint arXiv:2006.01731

[CR3] Alviana S, Kurniawan B. Design and analysis on data warehouse of personnel adMinistration system using time series algorithm. IOP Conf Ser Mater Sci Eng. 2018;407:1. doi: 10.1088/1757-899X/407/1/012092. [DOI] [Google Scholar]

[CR4] Blomberg N, Lauer KB. Connecting data, tools and people across Europe: ELIXIR’s response to the COVID-19 pandemic. Eur J Hum Genet. 2020;28(6):719–723. doi: 10.1038/s41431-020-0637-5. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR5] Breslin M (2004) Data warehousing battle of the giants: comparing the basics of the and inmon models. Bus Intell J 7:6–20

[CR6] Dimitrov DV. Blockchain applications for healthcare data management. Healthc Inform Res. 2019;25(1):51–56. doi: 10.4258/hir.2019.25.1.51. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR7] Fleuren LM, et al. The Dutch Data Warehouse, a multicenter and full-admission electronic health records database for critically ill COVID-19 patients. Crit Care. 2021;25(1):1–12. doi: 10.1186/s13054-021-03733-z. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR8] Garani G, Adam GK. A semantic trajectory data warehouse for improving nursing productivity. Heal Inf Sci Syst. 2020;8(1):1–13. doi: 10.1007/s13755-020-00117-5. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR9] Gharaibeh A, Salahuddin MA, Hussini SJ, Khreishah A, Khalil I, Guizani M, Al-Fuqaha A (2017) Smart cities: a survey on data management, security, and enabling technologies. IEEE Commun Surv Tutorials 19(4):2456–2501. 10.1109/COMST.2017.2736886

[CR10] Ienca M, Vayena E. On the responsible use of digital data to tackle the COVID-19 pandemic. Nat Med. 2020;26(4):458. doi: 10.1038/s41591-020-0823-6. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR11] Inmon WH (2005) Building the data warehouse. John Wiley & Sons, New York

[CR12] Johnson J, Denning P, Sousa-Rodrigues D, Delic KA (2017) Big data, digitization, and social change: big data (Ubiquity symposium). In: Ubiquity, pp 1–8

[CR13] Kimball R, Ross M, Thornthwaite W, Mundy J, Becker B (2008) The data warehouse lifecycle toolkit. John Wiley & Sons, New York

[CR14] Lawyer J, Chowdhury S. Best practices in Data Warehousing to support business initiatives and needs. Proc Hawaii Int Conf Syst Sci. 2004;37:3499–3507. doi: 10.1109/hicss.2004.1265515. [DOI] [Google Scholar]

[CR15] Mattingly W. Considerations for a COVID-19 research data warehouse in the time of COVID. J Respir Infect. 2020;4(1):1–3. doi: 10.18297/jri/vol4/iss1/64. [DOI] [Google Scholar]

[CR16] Milanovic N, Soskic G, Petkovic A (2009) Data warehouse design for croatian students’ nourishment information system. Proc Int Conf Inf Technol Interfaces ITI:193–198. 10.1109/ITI.2009.5196078

[CR17] Moody D, Kortink MA. From enterprise models to dimensional models: A methodology for data warehouse and data mart design. Proc Int Work Des Manag Data Warehouses. 2000;2000:1–12. [Google Scholar]

[CR18] Pappas IO, Mikalef P, Giannakos MN, Krogstie J, Lekakos G (2018) Big data and business analytics ecosystems: paving the way towards digital transformation and sustainable societies. Inf Syst E-Bus Manag 16(3):479–491. 10.1007/s10257-018-0377-z

[CR19] Parmanto B, Scotch M, Ahmad S (2005) A framework for designing a healthcare outcome data warehouse. Perspect Heal Inf Manag 2:3 [PMC free article] [PubMed]

[CR20] Ramachandran S, Rajeswari S, Murty SS (2012) Dimensional modeling of Indian materials database. Int J Comput Appl 37(7):1–8. 10.5120/4617-4834

[CR21] Rob MA, Srubar FJ. Information gems from criminal mines: A data warehouse case study focusing on big-city criminal activity. Transform Gov People Process Policy. 2016;10(2):297–314. doi: 10.1108/TG-03-2015-0016. [DOI] [Google Scholar]

[CR22] Roelofs E, Persoon L, Nijsten S, Wiessler W, Dekker A, Lambin P (2013) Benefits of a clinical data warehouse with data mining tools to collect data for a radiotherapy trial. Radiother Oncol 108(1):174–179. 10.1016/j.radonc.2012.09.019 [DOI] [PMC free article] [PubMed]

[CR23] Ros F, Kush R, Friedman C, Gil Zorzo E, Rivero Corte P, Rubin JC, ... Van Houweling D (2021) Addressing the Covid-19 pandemic and future public health challenges through global collaboration and a data-driven systems approach. Learn Heal Syst 5(1):1–12. 10.1002/lrh2.10253 [DOI] [PMC free article] [PubMed]

[CR24] Salem SB, Naouali S, Chtourou Z (2020) Scoring a data warehouse model for homeland security applications

[CR25] Saxena G, Agarwal BB. Data warehouse designing: dimensional modelling and E-R \nModelling. Int J Eng Invent. 2014;3(9):28–34. [Google Scholar]

[CR26] Sen A, Sinha AP. A comparison of data warehousing methodologies. Commun ACM. 2005;48(3):79–84. doi: 10.1145/1047671.1047673. [DOI] [Google Scholar]

[CR27] Sheng J, Amankwah‐Amoah J, Khan Z, Wang X (2021) COVID-19 pandemic in the new Era of big data analytics: methodological innovations and future research directions. Br J Manag 32(4):1164–1183. 10.1111/1467-8551.12441

[CR28] Tavakoli AS, Jackson K, Moneyham L, Phillips KD, Murdaugh C, Meding G (2006) Data management plans: stages, components, and activities. Appl Appl Math 1(2):141–151

[CR29] Warnars HLHS, Randriatoamanana R. Datawarehouser: A data warehouse artist who have ability to understand data warehouse schema pictures. IEEE Reg 10 Annu Int Conf Proceedings/TENCON. 2017;0:2205–2208. doi: 10.1109/TENCON.2016.7848419. [DOI] [Google Scholar]

[CR30] Whitelaw S, Mamas MA, Topol E, Van Spall HG (2020) Applications of digital technology in COVID-19 pandemic planning and response. Lancet Digit Heal 2(8):e435–e440. 10.1016/S2589-7500(20)30142-4 [DOI] [PMC free article] [PubMed]

PERMALINK

A multidimensional data warehouse design to combat the health pandemics

Gizem Turcan

Serhat Peker

Abstract

Introduction

Background

Data warehousing

Related works

Proposed data warehouse design

Fig. 1.

Fact tables

Fig. 2.

Fig. 3.

Fig. 4.

Fig. 5.

Fig. 6.

Fig. 7.

Fig. 8.

Disease test fact table

Quarantine fact table

Vaccination fact table

Venue fact table

Patient fact table

Medication fact table

Sign/symptom fact table

Dimension tables

Citizen

Health unit

District

Vaccine

Disease test

Venue

Disease variant

Patient status

Signs/symptoms

Drug

Quarantine

Time

Queries

Query-1

Query-2

Query-3

Query-4

Query-5

Query-6

Query-7

Conclusion

Data Availability

Declarations

Competing interests

Footnotes

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases