The Information Age has been promoted as a period in which access to information and data as the result of computerization is changing daily living. This period follows the Industrial Age in which the tools of manufacturing advanced the standard of living of humans but led to unanticipated and harmful impacts on the earth. The concept of an Information Age fails to address the real needs of society for a Knowledge Age, in which data and information lead to discovery and advancement.
The Information Age has brought about tremendous change in our daily lives. Even relatively minor but essential tasks, such as memorizing phone numbers and learning to read a paper map, are now all but irrelevant in a world of smart phones and GPS. Telephone books are nearly extinct and maps are mostly collectibles. The library card catalog that older investigators had to master as a means of identifying information sources and references is a relic for which none of us mourn. The personal computer, the smart phone, and the internet facilitate the exchange of massive troves of data through wireless data networks ubiquitous in most populated areas.
The examples of telephone books, maps, and card catalogs embody data. These data are of limited-to-no value—absent a question or query. The data these tools contain might be memorized by someone up to a point, but the capacity to provide comprehensive information on demand is the real value of those tools.
These data become information when a person uses them. A user might memorize a small portion of the information—telephone numbers of family, friends, and work associates, and he or she might memorize specific routes, but the amount of information most people retain is limited to exactly what is used the most. In the absence of context (i.e., routine necessity), or an inquiry, mostly unused phone numbers and untraveled routes are generally not important enough for someone to recall them from memory.
Data, when placed into a relationship with other information, can be synthesized into knowledge. Although knowledge is an element of most human endeavors, it is the central goal of science. In this model, data are the raw materials, much like grain or metal ores. Although data have potential value, they require refinement by aggregating value in combination with other data, just as grains become bread and metal ores become steel. An example in science is The Cancer Genome Atlas (TCGA) (see https://cancergenome.nih.gov/), a multimodality catalog of genomic, RNA expression, diagnosis, and outcomes, which can be downloaded and analyzed. The success of TCGA is the coordinate data, clinical classification, and outcomes. The individual data types, without additional data cannot be interpreted. When data types are combined, interpretation is possible, but knowledge requires utility.
There is little doubt that data are enabling, but when they cannot be contextualized and transformed into information, the data’s potential is unfulfilled. Increasingly, scientists are burdened with too much data, without the necessary linkage to other coordinate data. Furthermore, scientists often lack the tools to refine it into information, let alone knowledge. This is a problem because current trends in science are data-intensive. Examples include whole genome sequencing and high-multiplex immunohistochemistry. In both examples, multitudes of data points are collected, but their impact is (currently) limited. The data are never evaluated at the object level but are stratified and segregated based on an often a priori criteria of analyze/not analyze. Data points selected for analysis have passed some contextual criteria to contain information. Next, an evaluation is carried out to generate knowledge with the eventual goal that knowledge will beget further knowledge.
This data glut is being facilitated by the development of data repositories. Funders and other organizations are vacuuming up raw data from studies into large databases. These databases often lack the critical tools to identify all of their relationships and link coordinated data, preventing the transformation into information. Unlinked and discoordinate data are siloed, aging and frequently losing value over time. Journals are following this trend, demanding data deposition in these repositories, with a few journals actually seeing submission of the entirety of the raw data from the studies in the manuscript. Although laudatory in the name of reproducibility and data sharing, it is unclear whether these approaches are advancing the goals of these publications or of science as a whole.
As a result, many data repositories are functionally data museums. This analogy is appropriate, as the data have substantial value at the time of their collection but lose relevance over time. As a result, only a relatively small fraction of the data become functional information. Our museums face this challenge on a daily basis—what objects should be maintained in collections and what items should be excluded or deaccessioned. Concurrently, museums struggle to select what artifacts (data) to present to visitors. The cost of maintaining and archiving these physical assets, huge data sets, or digital data is unappreciated and chronically underfunded. Curators are faced with the challenge of determining the future value of an archive, despite knowing it has not been analyzed at the time of collection and may lack the coordinate data to become useful information or lead to knowledge.
Despite current limitations, the tools to convert data to information are improving. The transformation of data into information has been enabled by robust relational database architecture. In the past, mining these databases required diligent people, but is increasingly being probed by machine learning algorithms that can identify patterns and relationships that humans may miss. Although powerful and fast, these tools can only disclose existing relationships as they lack the capacity to generate hypotheses—at least for now.
Today, the bottleneck has moved from collecting data to creating tools for transforming information into knowledge. I have the opportunity to review projects and manuscripts driven by molecular biologists, genomicists, and informaticists. They share a common disconnect—a lack of connection to pathophysiology. The consequences are missed opportunities and wasted resources—neither of which is good. The current paradigm used by these and other scientists is much like using an excavator for an archeological dig and only collecting those objects larger than 10 cm3. The objects may be well cataloged, studied, and valuable, but given their size limitations, they may not accurately or adequately convey the nuance needed to best understand a society or people.
This is not to argue that a constrained approach, insistent on analysis of every data point, is appropriate. Science would become stagnant due to the loss of opportunity that raw data and information afford. It is critical that scientists be aware of this dichotomy and strive to develop approaches that fully use the potential of all the data they collect. Data, information, and knowledge all have value, and this value changes over time. It is essential for the scientific community to develop approaches that fulfill the potential of the data it collects, such that we can realize the full potential of living in a “Knowledge Age.”
