Skip to main content
Briefings in Bioinformatics logoLink to Briefings in Bioinformatics
. 2017 Aug 7;20(1):156–167. doi: 10.1093/bib/bbx086

Big data management challenges in health research—a literature review

Xiaoming Wang 1,, Carolyn Williams 1, Zhen Hua Liu 2, Joe Croghan 1
PMCID: PMC6488939  PMID: 28968677

Abstract

Big data management for information centralization (i.e. making data of interest findable) and integration (i.e. making related data connectable) in health research is a defining challenge in biomedical informatics. While essential to create a foundation for knowledge discovery, optimized solutions to deliver high-quality and easy-to-use information resources are not thoroughly explored. In this review, we identify the gaps between current data management approaches and the need for new capacity to manage big data generated in advanced health research. Focusing on these unmet needs and well-recognized problems, we introduce state-of-the-art concepts, approaches and technologies for data management from computing academia and industry to explore improvement solutions. We explain the potential and significance of these advances for biomedical informatics. In addition, we discuss specific issues that have a great impact on technical solutions for developing the next generation of digital products (tools and data) to facilitate the raw-data-to-knowledge process in health research.

Keywords: big data management, system performance, data quality, machine learning, SQL and NoSQL

Introduction

Big data management is a critical challenge across health research disciplines. Data from clinical studies, omics research and observations of individuals’ lifestyle and environmental exposure are all of importance in advanced health research [1, 2]. This data reality has raised pressing demands for an enhanced data management capacity beyond what traditional approaches can deliver [3, 4]. Big data originated from disparate sources is inherently heterogeneous and not connected. Knowledge extraction from these data relies on a raw-data-to-information transformation process that selects, cleanses, aligns, annotates and organizes data to deliver a comprehensive and reusable information resource that can be effectively used for analysis [5, 6]. Since the new millennium, significant funds have been invested in building such information resources; however, fully Findable, Accessible, Interoperable and Reusable (FAIR) data resources are still rare [7]. The lack of data access isolates valuable information and significant amount of associated bio-specimens from the resource seekers [8]. Optimizing the use of these data and biorepositories, into which considerable funds have already been spent, will release untapped resources to the health research community.

While data preparation and management are essential in facilitating the raw-data-to-knowledge process [9], research on leveraging cutting-edge methods and technologies to maximize the value of big data in health research remains underdeveloped [4, 10]. Many data management solutions developed in the past are facing the challenge of big data’s volume, velocity, variety and veracity (four Vs), with the last ‘V’ specifically critical to health research [11]. Inadequacy of analyzable data, barriers to data access and slowness in obtaining information have stymied scientific advances [12, 13], and added operational complexity to science administration [14, 15].

Several factors contribute to this situation. Recent publications have noted of cultural barriers in scientific communities [16, 17], regulations surrounding human subject data [18, 19], the need for biomedical data integration [20–22] and available tools for data integration [23, 24], data issues in biomedical analysis and analytics [25, 26] and security requirements in using human-centered health data [27]. In this review, we introduce state-of-the-art methods and technologies from the computing field to bioinformatics with the intent to leverage big data management efforts for information centralization (i.e. needed information is findable) and integration (i.e. related data are connectable) in health research.

To be more specifically delineating the problems and solutions, we arranged the rest of article as the following: in ‘Current data management practice in health research’ section, we first identify data usage pain points rendered in current data archiving practice, and analyze the technical issues underlying underperformance of some widely used data management approaches and underutilized data resources managed by these approaches. In ‘Demands for data management reform from health research’ section, we present the need to enhance data management capacity for the challenges from the big health research data reality. Focusing on these unmet needs and well-recognized problems, in ‘What is new in the data management landscape?’ section, we systematically introduce the most up-to-date concepts, methods and technologies for big data management from computing academia and industry, to explore potential improvements. In ‘Key improvement perspectives’ section, based on the unique data requirements in health research, we present our views on the issues that have great impact on technical solutions in advancing health research digital products (data and tools).

Current data management practice in health research

Independently derived data archiving approaches

Data management for information integration comprises two different but related practices: data organization and data preparation. The former is implemented through a database system and the latter is implemented through a data processing workflow. Both efforts will create and maintain a data archiving resource to deliver accessible and usable data products.

In the following discussion, we do not individually comment on the software products or data repositories that we studied, as we had neither the in-depth access nor the testing environment to rigorously evaluate the systems equally. Instead, we analyze the general pattern and characteristics of the approaches and mechanisms underlying the digital products, and their impacts on information delivery, based on literature review, Web site navigation and interviews with the resource management teams. The criteria that we used for studies include the followings: (1) the purpose and scope of a data resource; (2) data organization architecture and data preparation methods; (3) application functionality (e.g. ability to support granular-level data search and cross-disease/project data navigation if data are derived from the same individual) and system performance (i.e. query speed, accuracy and throughput); (4) data maintenance and governance practice and rules; (5) data readiness to use and reuse (e.g. integrity, consistency, normalization, standardization and availability of metadata). We chose these criteria as concrete measures of the FAIR qualities of a data source. The correspondent findings are summarized in this section (‘Current data management practice in health research’ section).

Over the past 15 years of data integration efforts, several database systems have been created to archive or integrate individualized clinical and basic (translational) research data. Such systems include i2b2 [24], BTRIS [28], STRID [29], dbGaP [30], NDAR [31], The Cancer Genome Atlas (TCGA) [32], TCIA [33], TRAM [34], IMMPORT [35], DASH [36] Enterprise Data Trust [37], among others [23]. Notably, the i2b2 system has been widely adopted across data archiving communities [24, 38], and the systems of BTRIS, dbGaP, NDAR, TCGA, TCIA, DASH and IMMPORT are used to maintain publicly accessible (i.e. available to a wide range of authorized researchers) data repositories.

Understandably, the database systems created in the past are for specific data challenges defined at the time, and carry characteristics of the data-era and technology availability at that time. For example, despite managing similar domain data, i.e., clinical and genomics data; and serving similar purposes, i.e. for translational or clinical genomics research, data organization mechanisms of these databases vary greatly. Two data management paradigms, relational database approach [39] and NoSQL-based approaches [40], have been used to archive these data. Some of them have different disease focuses [41]. Within the relational database paradigm, requirements for a data attribute vary from those that require defined common data element (CDE) [32], which specifies permissible data value and value format [42], to no restrictions at all. As a result, CDE-compliant or classical relational database approaches often deliver a schema with a few hundred data attributes, while the strategies that allow for ad hoc attribute creation, such as in entity–attribute–value (EAV) [43] data organization, can produce up to millions of ‘unique’ data elements in a database schema.

To accommodate these diverse data organizational structures and data element features, data preparation practices and rules of feeding data into the databases also differ greatly. The CDE-compliant or typical relational databases require more rigorously controlled per-attribute data normalization, standardization and curation, while those that were built on open-schema mechanisms that allow for ad hoc ‘attribute’ creation without discreet property specification assume little prior data cleansing, standardization and validation in data archiving.

Unsolved database system performance and data usability issues

As a result, these distinctly different data management practices produced data repositories with a range of ‘FAIR’ qualities [7], from those with little online data access (for authorized users) to the ones with a good level of online data search-ability. In current implementation, most data repositories built on NoSQL and open-schema approaches do not support penetrative granular-level data search-ability nor do they support cross-disease or cross-project data navigation, even when the data are derived from the same individuals and stored in the same repository.

With a few exceptions [29, 44, 45], the database performance (i.e. query speed, accuracy and throughput) and data usability (i.e. integrity, consistency and standardization) are not reported. Slow or even ‘hung’ queries are commonly observed in the EAV data structure dominant data archives that were implemented in the relational Database Management System (DBMS) environments (more analysis in ‘The impact and the root cause of data pain points’ section). In these situations, the maintenance staff of the repositories will have to manually deliver requested data on a case-by-case basis. In the instances where the data element dictionary and metadata about data are not available, investigators have to navigate their own way to understand data and thoroughly reprocess the data before they can use them. These ‘offline’ processes are arduous and error prone, and usually do not produce reusable procedures or accessible data.

The impact and the root cause of data pain points

We collectively refer to all types of data inconvenience as ‘data pain points’. The most common pain points stem from low system performance and low data usability. The data repositories created without enforcing data comparability and communicability within and between data repositories have become another layer of data heterogeneity and fragmentation. Reprocessing aggregated data in the established repositories can be considerably difficult and costly, as required metadata and provenance information might have been lost in the first attempt of ‘data integration’.

The technical causes to these problems are multifactual and intertwined. Here, we try to delineate a couple of major ones by first explaining the basic features of two fundamentally different data management approaches: the relational database (i.e. SQL-based) [39, 46] and the non-relational (i.e. NoSQL) approaches [40, 47–49]. The former is often used for information integration and the latter for information centralization. Each has its unique strengths, limitations and suitable applications (Table 1, more details in ‘Data management paradigms’ section).

Table 1.

Comparison of SQL- and NoSQL-based data management approaches

Features Relational database (SQL-based) Non-relational database (NoSQL-based)
Fundamentals Schema flexibility relies on clean separation between data structure and data value. SQL functions are derived from relational algebra, which performs best on semantically normalized data Schemaless (i.e. schema flexible, schema open). Data organization consistencies vary. Data query and operation strategies vary accordingly, from ad hoc to well-defined procedures, algorithms and mechanisms
Database architecture Basic concepts: relation, tuple, attribute; Building blocks: entity and relationship are predefined Basic concepts: non-relational, semi/unstructured data; data organization: key-value; attribute-value, graph store and BigTable approaches
Basic data elements Clean separation between data value and data structure and semantically normalized attributes Rigorous separation of data structure and value is not required, and attribute may have a range of semantic atomicity and consistency
Built for ACIDity required, e.g., transactional DB; annotation and curation required, e.g., knowledge base; integrity and continuity required, e.g., warehouse databases Data availability and consistency balanced databases; highly structured, decomposed and complex hierarchy and relationship not required databases
Used for Transactional operation, ad hoc analysis and reporting, multi-parameter and longitudinal data tracking, information integration Target finding, data mining, special type data transaction, unseen profile recognition and exploring, information centralization
Targeted users Life science researchers, statisticians, data scientists and dashboard needed users Data scientist and computing-lay life scientists, each kind may use different application interfaces
Challenge to build Complex, database modeling required; database quality varies greatly on human (designer) factors Optimizing data organization and search algorithm for accurate and fast target finding is also challenging
Challenge to feed Exhaustive and expensive to transform data from heterogeneous sources into a predefined schema Data processing requirements vary on database specifications, generally more error- and failure-tolerant and normalization-stringency-loosen, high throughput
Limitation Schema scalability and data processing complexity Joining related data, lack data standardization and integrity
Security Mature in various levels of data access control, including fine granular security enforcement Maturing, run in trusted infrastructure environment. Fine granular security enforcement is still challenging

In principle and if done right, the SQL approach can develop data structures with Atomicity, Consistency, Isolation and Durability (ACID)—a data property derived from the data normalization principles [39, 50, 51]. While a classical concept, the latest research on big data systems is rediscovering that these principles are still valid metrics in data management [52], and this data feature is fully supported by the relational DBMS [39, 51]. The database flexibility and scalability of the SQL approach rely on a clean separation between data structure and data values, as well as correct identification and construction of data relations and relationships [46, 51]. The final product of this process is an executable database schema, in which all the atomic data elements (i.e. attributes) are meant to be highly reusable so that the schema can be relatively stable and generalizable across studies. In contrast, a range of NoSQL solutions was started to address different data perspectives (e.g. focusing on data availability), so they generally do not require rigorous data attribute atomicity (i.e. normalization) nor demand predefined data relationship. Therefore, the NoSQL databases are often referred to as ‘open schema’, ‘schemaless’ or ‘schema-flexible’ [40, 48, 52]. Levels and types of data ‘ACIDity’ have been discussed and explored in the NoSQL world, and more details can be found in [47, 49, 53–57]. The balance between data consistency and availability in big data application, in particular, has been extensively studied by the NoSQL researchers [48, 49, 58]. Accordingly, algorithms and mechanisms to manage and query the SQL and NoSQL data matrixes are also different [47, 48, 51, 55, 59]. Industry has developed two lines of technology: relational database management system (RDBMS) and NoSQL DBMS to support these diverse needs. Placing either type of data structures into an unfit DBMS will be doomed to have query performance and data usability problems. Using samples from a real-world clinical data repository, we illustrate the attribute-level difference between data attribute normalization required (SQL) and not required (often seen in schemaless or NoSQL solutions) approaches, and how the semantically not normalized data elements can be reprocessed to separate data structure and data values and to form a typical SQL data table (Figure 1).

Figure 1.

Figure 1

Attribute-level difference between schemaless and classical relational database practices: Panel (A) shows the differences of attribute-level atomicity and abstraction between the two approaches. The left-side samples are collected from a real-world data repository. Panel (B) shows a standard relational data table derived from Panel (A) through data wrangling.

In biomedical data capture and archiving practice, both RDBMS and NoSQL DBMS have been used to build database tools [28, 34, 38, 60–62]. In the RDBMS environment, a particular data structure, EAV, has been a popular strategy to ‘boost’ database flexibility [24, 28, 29, 45, 61]. However, researchers and users soon noticed system performance and data usability problems [29, 63–65]. In our opinion, EAV can be viewed as a variant of ‘attribute-value’ pair data structure that is schema-less and typically used in the NoSQL paradigm [47, 55, 56]. The definitions about the ‘entity’ and the ‘attribute’ in EAV [43] are more analogues to the ‘object’ and ‘attribute’ described in a JavaScript Object Notation (JSON) document (Figure 2) [47, 56], rather than the ‘entity’ and ‘attribute’ specified by the relational database principles (Figure 1) [46, 51].

Figure 2.

Figure 2

Data organization comparison between JSON document and EAV table.

When the EAV data structure is implemented in a RDBMS that was built to manage classical relational database tables, its data structure is uninterpretable to the system. Therefore, EAV data structures cannot take advantages of several unique functions supported by a RDBMS: (1) utilities to develop centralized schema property dictionary and to enable schema property control; (2) utilities to define and enforce data uniqueness (i.e. redundancy control), which consequently ensures data integrity and usability; and (3) utilities to index data for high query performance. More importantly, the EAV data structure does not respond to the built-in functions of SQL, which are derived from relational algebra and require semantically normalized data in each attribute [51, 59]. To make an EAV table SQL-workable, researchers have tried to insert a ‘pivoting’ procedure to reorganize the data matrix [63, 64, 66]. This strategy is CPU-time costly and inefficient to SQL query if the ‘attributes’ in EAV tables do not have semantic consistency [63]. Therefore, if EAV tables dominate a database in an RDBMS, the query performance and data usability will inevitably be thwarted [29, 63, 64, 67].

Similarly, if data sets in NoSQL paradigm are not properly processed and sufficiently organized and indexed as described in [48, 49, 68, 69], the databases will not support penetrative granular-level data search-ability. The inability to query granular-level semi-structured or unstructured data is a common problem with NoSQL repositories in the biomedical data space, where inverted indexers [48, 69], which function as an index in the end of a text book that allows to search key words backward, to enable granular-level data search-ability in less structured text files are rarely reported to have been implemented. In that situation, users have to download the entire data set to figure out details on their own.

Nevertheless, the NoSQL approach has proven to be a potent solution for information centralization and target finding, especially in big data applications [48, 49, 68]. In many cases, target finding in health research does not require rigorous separation between data structure and data values nor highly normalized data attributes (more discussions can be found in ‘A different view to see and use data’ section and Figure 4). The attempt to combine EAV with relational data tables in a RDBMS environment for biomedical data application reflects a clear desire and demand for the coexistence of SQL and NoSQL mechanisms in a single technology platform. Such technology is only recently available to address this need (further discussion can be found in ‘Unification of SQL and NoSQL platforms’ section).

Figure 4.

Figure 4

Projected framework for enterprise health research data management to support information centralization and integration, through adopting the new concepts, methods and technologies available in database research academia and industry.

Demands for data management reform from health research

Biomedical informatics researchers are encountering even bigger challenges because of the unprecedented flood of health research data. First, more electronic health records (EHRs) have become available in part with rapid implementation of EHR systems for health-care providers. One of the goals of these investments is to foster research and nurture innovation [70, 71]. Second, advances in high-throughput technologies have made more individualized omics [72–74] and systems biology data [75, 76] available. Researchers want to combine these data with individuals’ clinical data for health research [4, 21, 77]. Third, to address individual variability in disease pathogenesis and responses to treatments, the size of health-related genomic cohort studies is substantially larger than ever before [2, 78, 79]. For example, the studies of type 2 diabetes [80–82], obesity [83, 84], hypertension [85, 86] and body metrics [87] have each enrolled hundreds of thousands of participants and collected millions of bio-specimens, whole-genome sequences and deep sequencing data, to understand genetic factors associated with the phenotypes and disorders. Going forward, a precision medicine initiative program (the ‘All of Us’ cohort study) plans to enroll at least a million participants [79]. In addition to clinical, genetics and genomics data, this program will include contextual health data of individuals from mobile devices. Fourth, to maximize the utility of clinical research data and collected biorepositories and to validate the results of studies, funding agencies [88–91], research organizations [41, 92] and health science publishers [93] have strengthened data sharing requirements [10, 16, 79, 94]. Finally, added to this unprecedented data volume, velocity and variety, research communities and journals still request professionally curated data and individualized data connectivity [1, 3, 4, 95, 96]. These requirements further exacerbate the difficulties of managing big health research data and intensify the need for solutions that shall substantially outperform the existing ones.

What is new in the data management landscape?

Data management paradigms

It is widely accepted that there is no one-size-fits-all solution for the inherent demands of big data. SQL and NoSQL approaches are the major paradigms in the landscape [9]. Understanding the strengths and limitations between the two is crucial toward intended goals of a data repository. The main differences between the two pure lineages are summarized in Table 1 [39, 40, 46, 48, 49, 54, 68, 97, 98].

A recently proposed data storage concept is the ‘Data Lake’. A Data Lake contains data masses in various settings that practically exist in many enterprise data centers [58, 99, 100]. How to effectively fish a specific target or to unveil a pattern from a Data Lake is challenging. One approach is nonintrusive, meaning that the data in the ‘Lake’ remain in their native forms, only the path to their physical locations and metadata about the data sets (e.g. size, type, and provenance) will be systematically identified, indexed, categorized and organized in a searchable way so that the target data set becomes discoverable when needed [58]. The other approach involves data wrangling, in which the methods of processing source data remain to be systematically developed and categorized [101]. In addition to the well-defined data processing methods, e.g. ETL [102] and MapReduce [68], a ‘Data Lake’ that hosts a myriad of original forms of data sets relies on versatile processes to deliver applicable master data and metadata. Examples include the Resource Description Framework (RDF) triplets cleansing and semantic alignment [99]; extraction and organization of metadata from heterogeneous source data sets to facilitate the Lake data management and query [103]; and the processes of selecting, vetting, sorting, cleansing, normalizing, integrating and provisioning to transform raw data into consumable and reusable forms [101]. In fact, the nonintrusive and intrusive (i.e. data wrangling) approaches are the consecutive logical steps in a raw-data-to-information process. More innovative approaches, such as machine learning mechanisms, are likely to be introduced to enhance the Lake data processing capacity. There is no ‘silver bullet’ to reach a target in a Data Lake without going through levels of data cleansing and processing, depending on the purpose and requirements of data usage, in either or both SQL and NoSQL environments [52, 58].

Machine learning and big data preparation

Data preparation, of which data wrangling is a major component, is essential for creating any searchable and reusable data resource with either SQL or NoSQL solutions [48, 68, 102, 104–108]. The process is more exhaustive with SQL solutions, as data from various sources need to be transformed in configuration and expression to meet the specifications of a unified schema [105, 109]. Low efficiency in accurate and automated data synthesizing has been a long-standing obstacle to information integration [104], as many steps in this process (e.g. semantic mapping, data element re-conceptualization, semantic alignment, data value standardization, curation, etc., as shown in Figure 1) are cognitive efforts [110]. To break this bottleneck, researchers have started exploring machine learning mechanism for required intelligence and throughput capacity since the new millennium [111], and recent advances in machine learning have indicated an imminent breakthrough [112–114]. Compared with the earlier efforts of text processing [115–117], machine learning that combines statistical inference and self-error correction has significantly enhanced data understanding accuracy and data processing capacity [118, 119]. The machine learning frameworks used for data processing and data preparation systems integrated with machine learning mechanisms include TensorFlow [120], DeepDive [121], Mindtagger [122], Magellan [121] and MLBCD [123]. Data scientists in health research have started using machine learning approaches for biological data curation [124] and EHR data summarization [125, 126]. Associated studies have emerged to assist health domain researchers to select algorithms, suggest hyperparameter values, aggregate clinical data attributes and interpret outcomes when using machine learning tools [127, 128].

Machine learning and knowledge base construction

Historically, SQL-based data preparation and schema design are separated processes that can be simply described as ‘schema first and data later’ [39, 55]. A poorly designed schema can make data preparation meaningless or even worsen the data quality [108, 129]. However, creating a sophisticated database schema for health science, even to a well-trained and experienced database architect, is startlingly hard, as the job not only requires data structure abstraction skill (i.e. separating data structure from data values, as shown in Figure 1) but also demands a good level of domain knowledge that takes years of training [1, 5, 12, 130, 131]. On the flip side, a domain expert is not trained to ‘cleanly’ separate data structure and data values for data characteristics abstraction. This problem will be greatly helped by a groundbreaking work led by Christopher Ré at Stanford [132–134]. Adhering with relational database principles, using machine learning approach that combines statistical inference and self-error correction mechanisms, researchers in Ré’s group have created a knowledge base construction (KBC) framework called DeepDive [134, 135]. DeepDive and its associated data feeding tool MindTagger can not only facilitate formation and optimization of a relational database schema with semantically normalized data structure but also cleans and prepare data for this schema. This end-to-end raw-data-to-insight workflow has shown impressive feasibility and efficacy for KBC. To some extent, its KBC ability outperforms human efforts [136]. DeepDive has been used in the real world to manage human trafficking data and create knowledge base for paleobiologists [118, 122, 135]. This approach for the first time demonstrated that machine learning technology can help researchers build quality knowledge bases from highly specialized (e.g. paleobiologic) domains, without needing to understand complex algorithms and coding details, or to worry about system performance. Although still in the early stage, the significance and long-term impact of this approach in (big) data management has excited the database research and science communities [118, 131, 136].

Unification of SQL and NoSQL platforms

In ‘The impact and the root cause of data pain points’ section, we point out the intuitive appeal and dilemma in the biomedical informatics community to use both structured SQL and semi-structured schemaless NoSQL data in a single technology platform. Recently, researchers in the DBMS industries and open-source communities have made significant advances to address this need. Instead of heavy tasks of physically decomposing data into EAV data tables and then pivoting EAV data into SQL-workable data matrixes—a computationally inefficient process [29, 64, 65]—the SQL and NoSQL unified DBMS platforms support a much light weight strategy by providing a built-in index utility that can be used to anchor the attributes of interest in JSON documents (Figure 2) and to project the data of interest virtually as in a relational database table, so they can be queried effectively by SQL-like declarative query language (Figure 3) [55, 137, 138]. This mechanism allows ‘ultra-performance’ in query execution without physically altering the JSON documents in the query process [55, 139]. In addition, the inverted index applied on the schemaless data can still enable NoSQL style query for fast target finding [57, 69]. By now, the open-source products, such as MySQL5X and PostgreSQL9X, have started supporting query of JSON document [140–142]. Industrial products, such as IBM DB2 [143, 144], Microsoft Azure [137], Oracle 12CR1 [145], Snowflake [139], Sinew [146] and Teradata [147], have also developed this capacity.

Figure 3.

Figure 3

Mechanism underlying SQL and NoSQL unified capacity in a DBMS: the figure is modified from [55].

A different view to see and use data

Generally speaking, most, if not all, data analytical tools prefer or require decomposed and semantically normalized data strings in each data point for quantitative analysis [63, 112, 148]. Health science researchers typically conduct data analysis and analytics using multi-parameters across domains with rigorously controlled concurrent and longitudinal time stamps. Delivering normalized data with multiple data elements from complex temporal, spatial and different scientific domains simultaneously is typically supported by a relational database [34, 51, 52]. However, it will be exhaustive and expensive to process massive amounts and wide spectrums of health-related data into a relational database schema [44, 116, 149]. After years of observation on information integration and data usage, data management researchers have concluded that not all data generated at the data capture stage are equally needed for integration or to have ACID quality [9]. Some EHR that are of importance in hospital operation may not be critical for disease causal factor identification [79, 150]. Data of importance to the purpose of an integrated data resource are referred to as ‘Core Data’ [9, 79]. The core data, although not ‘all data’ but not ‘small’ or ‘static’ data, often have longer life span and will be used repetitively. They are therefore worthy of wrangling with care for sustainable usage and sharing, as few raw data can fully deliver their imbedded meanings and be consistent and reusable without being processed [101, 109] (more discussion in section ‘Research on core data and core data elements’). The rest of data can be treated differently—such as in less structured NoSQL data stores [9] or be cataloged nonintrusively until needed [58]. This practical view of triaging data management efforts, together with advanced methods and technologies described in this review, will be necessary to realizing raw-data-to-information transformation in the big data era.

Next generation of data management solutions

Computing mechanisms and algorithms will only make a huge impact when they are implemented into convenient technologies for larger user base to use without the need for expert level knowledge of the underlying theoretical, mathematical and coding complexities. Today, we are in better position to conquer big data problems, as many revolutionized ideas and theories to operate, study and use data have been or are being developed into powerful tools. In Figure 4, using health research data as examples, we summarize the trending concepts, approaches and technologies described in ‘What is new in the data management landscape?’ section for their potential application in an information centralization and integration enterprise.

Key improvement perspectives

Rethink the role of data preparation and management

Data preparation facilitates every step in a raw-data-to-insight process. Data management, inclusive of both human and machine operations, has significant impact on the efficiency of health research, and by itself is a part of data science. Without optimized data preparation workflows and well-defined data organization architecture and framework, data standardization, annotation and curation activities will not have appropriate conduits to facilitate their processes, and the data product will not have a suitable home for storage, sharing and application. Consequently, data analysis and analytics tools will be starving for analysis-ready data. However, as data management activities and products usually do not directly result in scientific discoveries in health research, much needed research in this field is lagging and metrics of success are undefined. As a result, the inadequacy of data/information service has become an evident choke point for a broad range of research activities. It is time to relearn from our bygone experience and to invest in research to optimizing health research data management.

Learn to adopt standards and techniques from the other disciplines for innovation

In biomedical data management practice, it is not uncommon to see ‘convenient tricks’ played out as ‘novel’ solutions without performance and sustainability testing. In fact, many underperforming database tools, in either SQL or NoSQL paradigm, can hardly be attributed to the limitations of the technology itself, but to the misalignment between approaches and goals, or between the techniques used and the technology chosen.

Like innovations in life science (e.g. new drug discovery), transforming an idea into a handy, potent and sustainable technology in computer science is also a challenging journey (e.g. machine learning technology’s lengthy path to maturity). Along the way, computation researchers have distilled a set of standard procedures (i.e. best practice) to safeguard and measure the success (i.e. software function sophistication, accuracy and performance testing). Moving forward, we envision machine learning technique and technology will be increasingly adopted and integrated in data management activities. Therefore, it will be more cost-effective for health science data researchers to learn, adopt and leverage the state-of-the-art methods from computing academia and industry, wisely choose technologies, correctly apply techniques, adhere to good practice in tool development and data management and work with multidisciplinary talents to achieve interdisciplinary success.

Close the gap between CDE creation and data attribute specification

The common goal of defining CDE in health research community is to produce data with standardized vocabulary and format to facilitate data sharing within biomedical domain, despite definitions of CDE vary [42, 151–154]. In our view, a CDE is a data structure (not data value) equivalent to a semantically normalized and consistent ‘attribute’ specified with biomedical domain knowledge. Each CDE/attribute therefore has its unique semantic definition that specifies its permissible data value (taxonomy chosen), value type and format and other features [42]. After years of investment, several CDE stores have been created by different communities and efforts are still ongoing [42, 155, 156]. With only a couple of exceptions [32, 33], however, CDE adoption in database construction is limited. The major reasons for these underutilized efforts are 2-fold: (1) the CDEs created by researchers are not curated and validated to be computable, (2) the ‘open-schema’ practices in data management do not require CDEs. It is evident that a CDE will not realize its value unless be computable in a database. Therefore, it is time to emphasize the importance of semantic mapping between qualified CDEs and attributes in a database setting. We need to be aware that CDE required attribute consistency and normalization, even in complex data structures, is inherently supported by SQL and RDBMS, while is data condition related and software solution dependent if choosing NoSQL for implementation [48, 49, 53, 58, 157].

In theory, it is doable to create a backbone common data model with core data attributes to meet CDE requirements, and TCGA and Study Data Tabulation Model (SDTM) for the Center for Drug Evaluation and Research (CDER), FDA report have shown this feasibility [32, 158]. Attribute normalization-qualified and CDE-enhanced data modeling will result in a more generic, scalable, flexible, durable, but structurally succinct database schema (as the scenario indicated in Figure 1). However, planning a universal schema to structure and standardize all health-related data is largely unnecessary and unrealistic. Such effort will disrupt the focus on the core data management and make the processes of integrating information unmanageable.

Research on core data and core data elements

Core data identification and core data structure abstraction play a critical role in data integration. For example, TCGA only adopted around 200 CDEs from the caDSR CDE store [156] to be its database attributes. Within these 200 built-in attributes, only about half of them were filled with data by November 2016—after >6 years of data accumulation. Despite this seemingly small set of data attributes, this core schema and its data have been proven reusable, and this data resource is heading to a transformation from a raw database to an oncology knowledge base [32, 159, 160]. In contrast, data in repositories that contain large numbers (up to millions) of ‘distinct’ data elements in the EAV tables with little attribute normalization were much less used because of dysfunctional query performance and data usability issues.

These real-life examples deliver an important message: we need to conduct research on the scope of core data and the optimal core database structure, which will produce simplified schema with substantially fewer data elements while remaining powerful enough to cover the most-needed data and deliver high-quality information. In fact, the distinction between the core and the rest of the data set is not static (Figure 4). How to define the initial core data, and how to manage changes of the core scope, is critical to maintain the relevance of a database to knowledge discovery.

Work with versatile human roles in big data management

The ability to identify core data from the rest of the data requires expertise in both the data domain and vision about data usage, which mandates multidisciplinary insights from scientists. These domain scientists will play a crucial role in differentiating the core data from the rest, not only for each of their research interests but also for their peers within and beyond their disciplines. On the other hand, validating the core data scope to minimize bias and structuring core data for computation demands knowledge about statistics and state-of-the-art database approaches. The persons who serve these key roles must be able to transform domain researchers’ specific needs into executable design for computation. Furthermore, developing enterprise-scale data management products with high-performance and sophisticated functions requires substantial experience in software engineering; and finally, assembling and orchestrating a data preparation workflow needs versatile knowledge and techniques around tools, data structures, data meanings, data usages and process automation. Being an expert on all fronts is almost impossible. Therefore, it is important to recognize these needs and to nurture and establish a multidisciplinary coalition, through scientific collaboration and inclusion of technical expertise from industry.

Pay attention to data usability and system performance

Data usability and system performance are the basic elements to evaluate the success of a data resource. They often are the determining factors for users to choose or leave a data source. The problems of system dysfunctions and low usability of archived data must be effectively addressed in the big data era. Data resource initiatives must clearly define the goals and measures of success with system performance and data usability as basic requirements.

Summary

We are facing unprecedented data management challenges in health research. We are challenged by a lack of trained experts to address highly dynamic and complex data problems. Synthesizing and managing data for information integration and/or centralization in health research is a long-term challenge, and we are still in the infancy of this journey. The concepts described in this review will require validation through pilot studies. The multidisciplinary nature of managing health research data cannot be emphasized enough. Despite the anticipated and unknown challenges in front of us, with profound advances available in data management approaches, we remain optimistic that high-performance, reusable and ready-to-use data resources are technically achievable and will produce transformative impact on health knowledge discovery.

Key Points

  • We introduced the most advanced data management concepts, approaches and technologies available from computing domain to biomedical informatics community, with intent to leverage knowledge and skills to meet the big data challenges, in particular, for information centralization and integration in health research.

  • We analyzed the capacity and performance issues of current data management practice in health research data archiving, and discussed their impacts on data availability, quality and usability in the context of the big data reality.

  • We explained principle mechanisms of several trending data management solutions, and pointed out the significance of these advances in tackling the most intriguing technical problems in health science data management.

  • We discussed key aspects that have a significant impact on technical solutions in health research data management, to emphasize the importance of filling the gaps among disciplines, skillsets, human roles and relevant activities for developing next generation of digital products that shall substantially outperform the existing ones.

Xiaoming Wang, PhD, is a Biomedical Informaticist at National Institute of Infectious and Allergy Diseases, NIH. Her research interest is on information centralization and integration for health research.

Carolyn Williams, PhD, MPH, is an observational epidemiologist at the National Institutes of Allergy and Infectious Diseases, NIH. Her research interest is in maximizing knowledge from clinical data.

Zhen Hua Liu is a DBMS researcher and developer at Oracle Corporation. He has >35 peer-reviewed publications in computer science journals, and his current research interest is in SQL and NoSQL integration.

Joe Croghan is the Chief of Software Engineering at the National Institute of Allergy and Infectious Diseases. He has >30 years of experience in software development, artificial intelligence and database technologies.

Funding

The internal research fund of the National Institute of Allergy and Infectious Diseases, NIH.

References

  • 1. Auffray C, Balling R, Barroso I. Making sense of big data in health research: towards an EU action plan. Genome Med 2016;8:71. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Collins FS, Varmus H.. A new initiative on precision medicine. N Engl J Med 2015;372:793–5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Howe D, Costanzo M, Fey P. Big data: the future of biocuration. Nature 2008;455:47–50. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Duffy DJ. Problems, challenges and promises: perspectives on precision medicine. Brief Bioinform 2016;17:494–504. [DOI] [PubMed] [Google Scholar]
  • 5. Bernstam EV, Smith JW, Johnson TR.. What is biomedical informatics? J Biomed Inform 2010;43:104–10. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Bellinger G, Castro D, Mills A.. Data, information, knowledge, and wisdom. Mental Model Musings 2004; 1–3. [Google Scholar]
  • 7. Wilkinson MD, Dumontier M, Aalbersberg IJ. The FAIR guiding principles for scientific data management and stewardship. Sci Data 2016;3:160018.. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Campbell D. Don't forget people and specimens that make the database. Nature 2008;455:590. [DOI] [PubMed] [Google Scholar]
  • 9. Abadi D, Agrawal R, Ailamaki A. The Beckman report on database research. SIGMOD Rec 2014;43:61–70. [Google Scholar]
  • 10. Frey LJ, Bernstam EV, Denny JC.. Precision medicine informatics. J Am Med Inform Assoc 2016;23:668–70. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Raghupathi W, Raghupathi V.. Big data analytics in healthcare: promise and potential. Health Inf Sci Syst 2014;2:3.. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Alyass A, Turcotte M, Meyre D.. From big data analysis to personalized medicine for all: challenges and opportunities. BMC Med Genomics 2015;8:33.. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. van Panhuis WG, Paul P, Emerson C. A systematic review of barriers to data sharing in public health. BMC Public Health 2014;14:1144.. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Doyle M. How to avoid the 8 most common pain points in becoming a data driven healthcare organization. https://www.healthcatalyst.com/8-common-pain-points-to-avoid-in-data-driven-healthcare (31 October 2016, date last accessed).
  • 15. Strom BL, Buyse ME, Hughes J. Data sharing—is the juice worth the squeeze? N Engl J Med 2016;375:1608–9. [DOI] [PubMed] [Google Scholar]
  • 16. Hudson KL, Collins FS.. Sharing and reporting the results of clinical trials. JAMA 2015;313:355–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Piwowar HA, Becich MJ, Bilofsky H. Towards a data sharing culture: recommendations for leadership from academic health centers. PLoS Med 2008;5:e183.. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Balas EA, Vernon M, Magrabi F. Big data clinical research: validity, ethics, and regulation. Stud Health Technol Inform 2015;216:448–52. [PubMed] [Google Scholar]
  • 19. Malin B, Sweeney L.. How (not) to protect genomic data privacy in a distributed network: using trail re-identification to evaluate and design anonymity protection systems. J Biomed Inform 2004;37:179–92. [DOI] [PubMed] [Google Scholar]
  • 20. Cambiaghi A, Ferrario M, Masseroli M.. Analysis of metabolomic data: tools, current strategies and future challenges for omics data integration. Brief Bioinform 2017;18:498–510. [DOI] [PubMed] [Google Scholar]
  • 21. Manzoni C, Kia DA, Vandrovcova J. Genome, transcriptome and proteome: the rise of omics data and their integration in biomedical sciences. Brief Bioinform 2017, https://doi.org/10.1093/bib/bbw114. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Bernstam EV, Tenenbaum JD, Kuperman GJ.. Preserving an integrated view of informatics. J Am Med Inform Assoc 2014;21:e178–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Canuel V, Rance B, Avillach P. Translational research platforms integrating clinical and omics data: a review of publicly available solutions. Brief Bioinform 2015;16:280–90. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Murphy SN, Weber G, Mendis M. Serving the enterprise and beyond with informatics for integrating biology and the bedside (i2b2). J Am Med Inform Assoc 2010;17:124–30. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. Luo J, Wu M, Gopukumar D. Big data application in biomedical research and health care: a literature review. Biomed Inform Insights 2016;8:1–10. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26. Gligorijevic V, Malod-Dognin N, Przulj N.. Integrative methods for analysing big data in precision medicine. Proteomics 2016;16:741–58. [DOI] [PubMed] [Google Scholar]
  • 27. Claerhout B, DeMoor GJ.. Privacy protection for clinical and genomic data. The use of privacy-enhancing techniques in medicine. Int J Med Inform 2005;74:257–65. [DOI] [PubMed] [Google Scholar]
  • 28. Cimino JJ, Ayres EJ, Remennik L. The National Institutes of Health's Biomedical Translational Research Information System (BTRIS): design, contents, functionality and experience to date. J Biomed Inform 2014;52:11–27. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29. Lowe HJ, Ferris TA, Hernandez PM. STRIDE–an integrated standards-based translational research informatics platform. AMIA Annu Symp Proc 2009;2009:391–5. [PMC free article] [PubMed] [Google Scholar]
  • 30. Tryka KA, Hao L, Sturcke A. NCBI's database of genotypes and phenotypes: dbGaP. Nucleic Acids Res 2014;42:D975–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31. Payakachat N, Tilford JM, Ungar WJ.. National Database for Autism Research (NDAR): big data opportunities for health services research and health technology assessment. Pharmacoeconomics 2016;34:127–38. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32. Wang Z, Jensen MA, Zenklusen JC.. A Practical Guide to The Cancer Genome Atlas (TCGA). Methods Mol Biol 2016;1418:111–41. [DOI] [PubMed] [Google Scholar]
  • 33. Clark K, Vendt B, Smith K. The Cancer Imaging Archive (TCIA): maintaining and operating a public information repository. J Digit Imaging 2013;26:1045–57. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34. Wang X, Liu L, Fackenthal J. Translational integrity and continuity: personalized biomedical data integration. J Biomed Inform 2009;42:100–12. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35. Bhattacharya S, Andorf S, Gomes L. ImmPort: disseminating data to the public for the future of immunology. Immunol Res 2014;58:234–9. [DOI] [PubMed] [Google Scholar]
  • 36. NIH. NICHD Data and Specimen Hub (DASH). https://dash.nichd.nih.gov/ (20 May 2017, date last accessed).
  • 37. Chute CG, Beck SA, Fisk TB. The enterprise data trust at Mayo clinic: a semantically integrated warehouse of biomedical data. J Am Med Inform Assoc 2010;17:131–5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38. Gabetta M, Limongelli I, Rizzo E. BigQ: a NoSQL based framework to handle genomic variants in i2b2. BMC Bioinformatics 2015;16:415.. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39. Codd EF. A relational model of data for large shared data banks. Commun ACM 1970;13:377–87. [PubMed] [Google Scholar]
  • 40. Stonebraker M. SQL databases v. NoSQL databases. Commun ACM 2010;53:10–11. [Google Scholar]
  • 41. The Global Alliance for Genomics and Health. A federated ecosystem for sharing genomic, clinical data. Science 2016;352(6291):1278–1280. [DOI] [PubMed] [Google Scholar]
  • 42. CDISC. Clinical Data Interchange Standards Consortium. https://www.cdisc.org/standards (30 January 2017, date last accessed).
  • 43. Nadkarni PM, Brandt C.. Data extraction and ad hoc query of an entity-attribute-value database. J Am Med Inform Assoc 1998;5:511–27. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44. Wang X, Liu L, Fackenthal J. Towards an oncology database (ONCOD) using a warehousing approach. AMIA Summits Transl Sci Proc 2012;2012:105–15. [PMC free article] [PubMed] [Google Scholar]
  • 45. Loper D, Klettke M, Bruder I. Enabling flexible integration of healthcare information using the entity-attribute-value storage model. Health Inf Sci Syst 2013;1:9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46. Chen PP-S. The entity-relationship model—toward a unified view of data. ACM Trans Database Syst 1976;1:9–36. [Google Scholar]
  • 47. Parker Z, Poe S, Vrbsky SV. Comparing NoSQL MongoDB to an SQL DB. In: Proceedings of the 51st ACM Southeast Conference ACM, Savannah, GA, 2013, 1–6.
  • 48. Chang F, Dean J, Ghemawat S Bigtable: a distributed storage system for structured data. In: OSDI '06 Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation 2006, 205–18.
  • 49. DeCandia G, Hastorun D, Jampani M Dynamo: amazon's highly available key-value store. In: Proceedings of 21st ACM SIGOPS Symposium on Operating Systems Principles. ACM, Stevenson, WA, 2007, 205–20.
  • 50. Haerder T, Reuter A.. Principles of transaction-oriented database recovery. ACM Comput Surv 1983;15:287–317. [Google Scholar]
  • 51. Codd EF. Normalized data base structure: a brief tutorial. In: Proceedings of the 1971 ACM SIGFIDET (now SIGMOD) Workshop on Data Description, Access and Control ACM, San Diego, CA, 1971.
  • 52. Abadi D, Agrawal R, Ailamaki A. The Beckman report on database research. Commun ACM 2016;59:92–9. [Google Scholar]
  • 53. Pokorny J. NoSQL databases: a step to database scalability in web environment. In: Proceedings of the 13th International Conference on Information Integration and Web-based Applications and Services. ACM, Ho Chi Minh City, Vietnam, 2011;278–283. [Google Scholar]
  • 54. Klein J, Gorton I, Ernst N Performance evaluation of NoSQL databases: a case study. In: Proceedings of the 1st Workshop on Performance Analysis of Big Data Systems ACM, Austin, TX, 2015, 5–10.
  • 55. Liu ZH, Gawlick D. Management of flexible schema data in RDBMSs—opportunities and limitations for NoSQL. In: Proceedings of the 7th Biennial Conference on Innovative Data Systems Research (CIDR ’15) January 4-7, 2015 Asilomar, CA, 2015.
  • 56. Pezoa F, Reutter JL, Suarez F Foundations of JSON schema. In: Proceedings of the 25th International Conference on World Wide Web International World Wide Web Conferences Steering Committee, Montral, Qubec, Canada, 2016, 263–73.
  • 57. Liu ZH, Hammerschmidt B, McMahon D. JSON data management: supporting schema-less development in RDBMS. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data ACM, Snowbird, UT, 2014, 1247–58.
  • 58. Halevy A, Korn F, Noy NF Goods: organizing Google's datasets. In: Proceedings of the 2016 International Conference on Management of Data ACM, San Francisco, CA, 2016, 795–806.
  • 59. Codd EF. A data base sublanguage founded on the relational calculus. In: Proceedings of the 1971 ACM SIGFIDET (now SIGMOD) Workshop on Data Description, Access and Control ACM, San Diego, CA, 1971.
  • 60. Wade T, Hum R, Murphy J. A Dimensional Bus model for integrating clinical and research data. JAMIA, 2011;18(Supple 1): 96–102. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61. Harris PA, Taylor R, Thielke R. Research electronic data capture (REDCap)–a metadata-driven methodology and workflow process for providing translational research informatics support. J Biomed Inform 2009;42:377–81. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62. Ohno-Machado L, Sansone SA, Alter G. Finding useful data across multiple biomedical data repositories using DataMed. Nat Genet 2017;49:816–19. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 63. Luo G, Frey LJ.. Efficient execution methods of pivoting for bulk extraction of entity-attribute-value-modeled data. IEEE J Biomed Health Inform 2016;20:644–54. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 64. Chen RS, Nadkarni P, Marenco L. Exploring performance issues for a clinical database organized using an entity-attribute-value representation. J Am Med Inform Assoc 2000;7:475–87. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 65. Wang S, Pandis I, Wu C. High dimensional biological data retrieval optimization with NoSQL technology. BMC Genomics 2014;15(Suppl 8):S3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 66. Dinu V, Nadkarni P, Brandt C.. Pivoting approaches for bulk extraction of entity-attribute-value data. Comput Methods Programs Biomed 2006;82:38–43. [DOI] [PubMed] [Google Scholar]
  • 67. Duftschmid G, Wrba T, Rinner C.. Extraction of standardized archetyped data from electronic health record systems based on the entity-attribute-value model. Int J Med Inform 2010;79:585–97. [DOI] [PubMed] [Google Scholar]
  • 68. Dean J, Ghemawat S. MapReduce: simplified data processing on large clusters. In: OSDI'04 Proceedings of the 6th Conference on Symposium on Opearting Systems Design and Implementation 2004, 1–13.
  • 69. Zobel J, Moffat A.. Inverted files for text search engines. ACM Comput Surv 2006;38:6. [Google Scholar]
  • 70. HHS. Department of Health and Human Services (HHS) Information Technology (IT) Strategic Plan, 2017-2020. http://www.hhs.gov/sites/default/files/itstrategicplan2017.pdf (30 January 2017, date last accessed).
  • 71. Casey JA, Schwartz BS, Stewart WF. Using electronic health records for population health research: a review of methods and applications. Annu Rev Public Health 2016;37:61–81. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 72. Buck MJ, Lieb JD.. ChIP-chip: considerations for the design, analysis, and application of genome-wide chromatin immunoprecipitation experiments. Genomics 2004;83:349–60. [DOI] [PubMed] [Google Scholar]
  • 73. Jones AR, Miller M, Aebersold R. The Functional Genomics Experiment model (FuGE): an extensible framework for standards in functional genomics. Nat Biotechnol 2007;25:1127–33. [DOI] [PubMed] [Google Scholar]
  • 74. Taylor CF, Paton NW, Lilley KS. The minimum information about a proteomics experiment (MIAPE). Nat Biotechnol 2007;25:887–93. [DOI] [PubMed] [Google Scholar]
  • 75. Wang T, Wei JJ, Sabatini DM. Genetic screens in human cells using the CRISPR-Cas9 system. Science 2014;343:80–4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 76. Korkmaz G, Lopes R, Ugalde AP. Functional genetic screens for enhancer elements in the human genome using CRISPR-Cas9. Nat Biotechnol 2016;34:192–8. [DOI] [PubMed] [Google Scholar]
  • 77. Barbieri R, Guryev V, Brandsma CA. Proteogenomics: key driver for clinical discovery and personalized medicine. Adv Exp Med Biol 2016;926:21–47. [DOI] [PubMed] [Google Scholar]
  • 78. Gaziano JM, Concato J, Brophy M. Million veteran program: a mega-biobank to study genetic influences on health and disease. J Clin Epidemiol 2016;70:214–23. [DOI] [PubMed] [Google Scholar]
  • 79. PMI Working Group. The Precision Medicine Initiative Cohort Program (NIH). https://www.nih.gov/sites/default/files/research-training/initiatives/pmi/pmi-working-group-report-20150917-2.pdf (30 January 2017, date last accessed).
  • 80. Mahajan A, Go MJ, Zhang W. Genome-wide trans-ancestry meta-analysis provides insight into the genetic architecture of type 2 diabetes susceptibility. Nat Genet 2014;46:234–44. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 81. Scott LJ, Erdos MR, Huyghe JR. The genetic regulatory signature of type 2 diabetes in human skeletal muscle. Nat Commun 2016;7:11764. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 82. Fuchsberger C, Flannick J, Teslovich TM. The genetic architecture of type 2 diabetes. Nature 2016;536:41–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 83. Locke AE, Kahali B, Berndt SI. Genetic studies of body mass index yield new insights for obesity biology. Nature 2015;518:197–206. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 84. Shungin D, Winkler TW, Croteau-Chonka DC. New genetic loci link adipose and insulin biology to body fat distribution. Nature 2015;518:187–96. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 85. Surendran P, Drenos F, Young R. Trans-ancestry meta-analyses identify rare and common variants associated with blood pressure and hypertension. Nat Genet 2016;48:1151–61. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 86. Ehret GB, Ferreira T, Chasman DI. The genetics of blood pressure regulation and its target organs from association studies in 342,415 individuals. Nat Genet 2016;48:1171–84. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 87. Wood AR, Esko T, Yang J. Defining the role of common variation in the genomic and biological architecture of adult human height. Nat Genet 2014;46:1173–86. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 88. NIH. NIH data sharing policy. https://grants.nih.gov/grants/policy/data_sharing/ (30 January 2017, date last accessed).
  • 89. Foundation BaMG. Information sharing approach. http://www.gatesfoundation.org/How-We-Work/General-Information/Information-Sharing-Approach
  • 90. European Commission. Guidelines on Open Access to Scientific Publications and Research Data in Horizon 2020. https://ec.europa.eu/research/participants/data/ref/h2020/grants_manual/hi/oa_pilot/h2020-hi-oa-pilot-guide_en.pdf (30 January 2017, date last accessed).
  • 91. European Research Council. Guidelines on the Implementation of Open Access to Scientific Publications and Research Data. https://erc.europa.eu/sites/default/files/document/file/ERC_Guidelines_Implementation_Open_Access.pdf (30 January 2017, date last accessed).
  • 92. Organization Cancer Research UK. Data sharing guidelines. http://www.cancerresearchuk.org/funding-for-researchers/applying-for-funding/policies-that-affect-your-grant/submission-of-a-data-sharing-and-preservation-strategy/data-sharing-guidelines (30 January 2017, date last accessed).
  • 93. Taichman DB, Backus J, Baethge C. Sharing clinical trial data–a proposal from the International Committee of Medical Journal Editors. N Engl J Med 2016;374:384–6. [DOI] [PubMed] [Google Scholar]
  • 94. Duffy DJ. Problems, challenges and promises: perspectives on precision medicine. Brief Bioinform 2016;17:494–504. [DOI] [PubMed] [Google Scholar]
  • 95. Cochrane GR, Galperin MY.. The 2010 nucleic acids research database issue and online database collection: a community of data resources. Nucleic Acids Res 2010;38:D1–4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 96. Goble C, Stevens R, Hull D. Data curation + process curation = data integration + science. Brief Bioinform 2008;9:506–17. [DOI] [PubMed] [Google Scholar]
  • 97. Inmon WH. The Data Warehouse and Data Models. In: Building the Data Warehouse, 4th edn Wiley Publishing, Inc. Indianapolis, Indiana, 200579–99. [Google Scholar]
  • 98. Simitsis A. Mapping conceptual to logical models for ETL processes. In: Proceedings of the 8th ACM International Workshop on Data Warehousing and OLAP ACM, Bremen, Germany, 2005, 67–76.
  • 99. Farid M, Roatis A, Ilyas IF CLAMS: bringing quality to Data Lakes. In: Proceedings of the 2016 International Conference on Management of Data ACM, San Francisco, CA, 2016, 2089–92.
  • 100. Madera, Laurent A. The next information architecture evolution: the data lake wave. In: Proceedings of the 8th International Conference on Management of Digital EcoSystems. ACM, Biarritz, France, 2016, 174–180. [Google Scholar]
  • 101. Terrizzano I, Schwarz P, Roth M Data wrangling: the challenging journey from the wild to the lake. In: Proceedings of the 7th Biennial Conference on Innovative Data Systems Research (CIDR ’15) January 4-7, 2015. Asilomar, CA, 2015, 4–7.
  • 102. Vassiliadis P, Simitsis A, Skiadopoulos S. Conceptual modeling for ETL processes. In: Proceedings of the 5th ACM International Workshop on Data Warehousing and OLAP. ACM, McLean, Virginia, USA, 2002, 14–21.
  • 103. Hai R, Geisler S, Quix C. Constance: an intelligent Data Lake system. In: Proceedings of the 2016 International Conference on Management of Data. ACM, San Francisco, CA, 2016, 2097–100.
  • 104. Doan AH, Alon Y.. Semantic-integration research in the database community. AI Magazine 2005;26:83–94. [Google Scholar]
  • 105. Halevy A. Technical perspective: schema mappings: rules for mixing data. Commun ACM 2010;53:100. [Google Scholar]
  • 106. Atikoglu B, Xu Y, Frachtenberg E Workload analysis of a large-scale key-value store. In: Proceedings of the 12th ACM SIGMETRICS/PERFORMANCE Joint International Conference on Measurement and Modeling of Computer Systems ACM, London, England, UK, 2012, 53–64.
  • 107. Haas LM, Hentschel M, Kossmann D Schema AND data: a holistic approach to mapping, resolution and fusion in information integration. In: Proceedings of the 28th International Conference on Conceptual Modeling. Springer-Verlag, Gramado, Brazil, 2009, 28–40.
  • 108. Halevy A. Why your data won't mix. Queue 2005;3:50–8. [Google Scholar]
  • 109. Halevy A, Rajaraman A, Ordille J. Data integration: the teenage years. In: Proceedings of the 32nd international conference on Very large data bases VLDB Endowment, Seoul, Korea, 2006, 9–16.
  • 110. Doan A, Halevy AY.. Semantic-integration research in the database community. AI Magazine 2005;26:84–94. [Google Scholar]
  • 111. Doan A, Domingos P, Halevy AY. Reconciling schemas of disparate data sources: a machine-learning approach. In: Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data. ACM, Santa Barbara, CA, 2001, 509–20.
  • 112. Jordan MI, Mitchell TM.. Machine learning: trends, perspectives, and prospects. Science 2015;349:255–60. [DOI] [PubMed] [Google Scholar]
  • 113. Rampasek L, Goldenberg A.. TensorFlow: biology's gateway to deep learning? Cell Syst 2016;2:12–14. [DOI] [PubMed] [Google Scholar]
  • 114. Ghahramani Z. Probabilistic machine learning and artificial intelligence. Nature 2015;521:452–9. [DOI] [PubMed] [Google Scholar]
  • 115. Alex B, Grover C, Haddow B. Assisted curation: does text mining really help? Pac Symp Biocomput 2008;13:556–67. [PubMed] [Google Scholar]
  • 116. Winnenburg R, Wachter T, Plake C. Facts from text: can text mining help to scale-up high-quality manual curation of gene products with ontologies? Brief Bioinform 2008;9:466–78. [DOI] [PubMed] [Google Scholar]
  • 117. Krallinger M, Valencia A.. Text-mining and information-retrieval services for molecular biology. Genome Biol 2005;6:224.. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 118. Peters SE, Zhang C, Livny M. A machine reading system for assembling synthetic paleontological databases. PLoS One 2014;9:e113523.. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 119. Zhang C, Kumar A, Re C.. Materialization optimizations for feature selection workloads. ACM Trans Database Syst 2016;41:1–32. [Google Scholar]
  • 120. Dean J. Building machine learning systems that understand. In: Proceedings of the 2016 International Conference on Management of Data. ACM, San Francisco, CA, 2016, 1.
  • 121. Konda P, Das S, Suganthan P. Magellan: toward building entity matching management systems over data science stacks. Proc VLDB Endow 2016;9:1581–4. [Google Scholar]
  • 122. Shin J, Re C, Cafarella M.. Mindtagger: a demonstration of data labeling in knowledge base construction. Proc VLDB Endow 2015;8:1920–3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 123. Luo G. MLBCD: a machine learning tool for big clinical data. Health Inf Sci Syst 2015;3:3.. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 124. Miotto O, Tan TW, Brusic V.. Supporting the curation of biological databases with reusable text mining. Genome Inform 2005;16:32–44. [PubMed] [Google Scholar]
  • 125. Pivovarov R, Elhadad N.. Automated methods for the summarization of electronic health records. J Am Med Inform Assoc 2015;22:938–47. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 126. Mishra R, Bian J, Fiszman M. Text summarization in the biomedical domain: a systematic review of recent research. J Biomed Inform 2014;52:457–67. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 127. Luo G. PredicT-ML: a tool for automating machine learning model building with big clinical data. Health Inf Sci Syst 2016;4:5.. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 128. Luo G. Automatically explaining machine learning prediction results: a demonstration on type 2 diabetes risk prediction. Health Inf Sci Syst 2016;4:2.. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 129. Wang RY, Kon HB, Madnick SE. Data quality requirements analysis and modeling. In: Proceedings of the 9th International Conference on Data Engineering. Vienna, Austria, 1993. IEEE Computer Society, pp. 670–677.
  • 130. Donovan S. Big data: teaching must evolve to keep up with advances. Nature 2008;455:461.. [DOI] [PubMed] [Google Scholar]
  • 131. Halevy A. Technical perspective: incremental knowledge base construction using DeepDive. SIGMOD Rec 2016;45:59.. [PMC free article] [PubMed] [Google Scholar]
  • 132. Zhang C, Kumar A, Ré C. Materialization optimizations for feature selection workloads. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data ACM, Snowbird, UT, 2014, 265–76.
  • 133. Zhang C, Niu F, Ré C Big data versus the crowd: looking for relationships in all the right places. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1 Association for Computational Linguistics, Jeju Island, Korea, 2012, 825–34.
  • 134. Shin J, Wu S, Wang F. Incremental knowledge base construction using DeepDive. Proc VLDB Endow 2015;8:1310–21. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 135. Sa CD, Ratner A, Ré C. DeepDive: declarative knowledge base construction. SIGMOD Rec 2016;45:60–67. [PMC free article] [PubMed] [Google Scholar]
  • 136. Callaway E. Computers read the fossil record. Nature 2015;523:115–16. [DOI] [PubMed] [Google Scholar]
  • 137. Popovic J. JSON Functionalities in Azure SQL Database (Public Preview). https://azure.microsoft.com/en-us/blog/json-functionalities-in-azure-sql-database-public-preview (30 January 2017, date last accessed).
  • 138. Betts R. How JSON Sparked NoSQL—and will Return to the RDBMS Fold. http://www.infoworld.com/article/2608293/nosql/how-json-sparked-nosql—-and-will-return-to-the-rdbms-fold.html (30 January 2017, date last accessed).
  • 139. Dageville B, Cruanes T, Zukowski M The snowflake elastic data warehouse. In: Proceedings of the 2016 International Conference on Management of Data ACM, San Francisco, CA, 2016, 215–26.
  • 140. MySQL. Functions That Search JSON Values. https://dev.mysql.com/doc/refman/5.7/en/json-search-functions.html (20 May 2017, date last accessed).
  • 141. PostgreSQL. JSON Functions and Operators. https://www.postgresql.org/docs/current/static/functions-json.html (20 May 2017, date last accessed).
  • 142. Levy E. Postgres vs. MongoDB for Storing JSON Data—Which Should You Choose https://www.sisense.com/blog/postgres-vs-mongodb-for-storing-json-data/ (20 May 2017, date last accessed).
  • 143. Eberhard J. IBM DB2 for i: JSON Store Technology Preview. http://www.ibm.com/developerworks/ibmi/library/i-json-store-technology/ (11 January 2017, date last accessed).
  • 144. Tian Y, Ozcan F, Zou T. Building a hybrid warehouse: efficient joins between data stored in HDFS and enterprise warehouse. ACM Trans Database Syst 2016;41:1–38. [Google Scholar]
  • 145. Liu ZH, Hammerschmidt B, McMahon D Closing the functional and performance gap between SQL and NoSQL. In: Proceedings of the 2016 International Conference on Management of Data ACM, San Francisco, CA, 2016, 227–38.
  • 146. Tahara D, Diamond T, Abadi DJ. Sinew: a SQL system for multi-structured data. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data ACM, Snowbird, UT, 2014, 815–26.
  • 147. Teradata. Teradata JSON -Teradata Database. http://www.info.teradata.com/download.cfm?ItemID=1001873 (20 March 2017, date last accessed).
  • 148. Dhar V. Data science and prediction. Commun ACM 2013;56:64–73. [Google Scholar]
  • 149. Halevy AY, Ashish N, Bitton D Enterprise information integration: successes, challenges and controversies. In: SIGMOD '05 Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data 2005, 778–87.
  • 150. Ingwersen P, Chavan V.. Indicators for the Data Usage Index (DUI): an incentive for publishing primary biodiversity data through global information infrastructure. BMC Bioinformatics 2011;12(Suppl 15):S3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 151. Sheehan J, Hirschfeld S, Foster E. Improving the value of clinical research through the use of common data elements. Clin Trials 2016;13:671–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 152. Warzel DB, Andonaydis C, McCurry B. Common data element (CDE) management and deployment in clinical trials. AMIA Annu Symp Proc 2003;1048.. [PMC free article] [PubMed] [Google Scholar]
  • 153. Covitz PA, Hartel F, Schaefer C. caCORE: a common infrastructure for cancer informatics. Bioinformatics 2003;19:2404–12. [DOI] [PubMed] [Google Scholar]
  • 154. Nadkarni PM, Brandt CA.. The common data elements for cancer research: remarks on functions and structure. Methods Inf Med 2006;45:594–601. [PMC free article] [PubMed] [Google Scholar]
  • 155. NLM/NIH. NIH Common Data Element (CDE) portal. https://cde.nlm.nih.gov/home (30 January 2017, date last accessed).
  • 156. NCI/NIH. Cancer Data Standard Repository (caDSR). https://wiki.nci.nih.gov/display/caDSR/caDSR+Wiki (30 January 2017, date last accessed).
  • 157. Hecht R, Jablonski S. NoSQL evaluation: a use case oriented survey. In: Proceedings of the International Conference on Cloud and Service Computing 2011. IEEE, 2011, 336–41.
  • 158. FDA. CDER common data standards issues document. FDA, 2011. https://www.fda.gov/downloads/drugs/developmentapprovalprocess/formssubmissionrequirements/electronicsubmissions/ucm254113.pdf
  • 159. Tomczak K, Czerwinska P, Wiznerowicz M.. The Cancer Genome Atlas (TCGA): an immeasurable source of knowledge. Contemp Oncol 2015;19:A68–77. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 160. Huo D, Hu H, Rhie SK. Comparison of breast cancer molecular features and survival by African and European Ancestry in the cancer genome atlas. JAMA Oncol 2017, doi: 10.1001/jamaoncol.2017.0595. [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from Briefings in Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES