Abstract
Data science, bioinformatics, and machine learning are the advent and progression of the fourth paradigm of exploratory science. The need for human-supported algorithms to capture patterns in big data is at the center of personalized healthcare and directly related to translational research. This paper argues that hypothesis-driven and data-driven research work together to inform the research process. At the core of these approaches are theoretical underpinnings that drive progress in the field. Here, we present several exemplars of research on the gut-brain axis that outline the innate values and challenges of these approaches. As nurses are trained to integrate multiple body systems to inform holistic human health promotion and disease prevention, nurses and nurse scientists serve an important role as mediators between this advancing technology and the patients. At the center of person-knowing, nurses need to be aware of the data revolution and use their unique skills to supplement the data science cycle from data to knowledge to insight.
Keywords: Hypothesis-driven, data-driven, maternal-child health, informatics, microbiome, gut-brain axis
INTRODUCTION
Data science represents an interdisciplinary field that uses scientific methods, processes, and coded algorithms to extract knowledge and insight from data in various structured and unstructured forms.1 Bioinformatics, a subfield of data science, allows researchers to apply computational software to analyze large and/or multimodal biologic and physiologic datasets to gain mechanistic understanding of processes underlying health and disease.2 Skills in statistics, computer science, and domain knowledge are necessary for big data and bioinformatics application, which highlights both the velocity, volume, variety and veracity of data that cannot be processed with average computing power and the untapped potential of data that can be analyzed with the appropriate resources.3
Analysis, storage, and visualization of big data are just a few of the computational challenges that continue to increase as databases grow and analysis techniques become more advanced. Furthermore, as data points become more expansive while analyses remain high cost, there is potential for analyses that need to handle many more features as compared to the number of participants. Some types of traditional statistical methods, often driven by p-value metrics and correlations, may be inadequate to detect the underlying truth as the size and heterogeneity of the data increases. The advent and evolution of precision medicine has highlighted that human responses to disease and interventions are highly personalized, and nurse scientists are especially posed to capitalize on holistic data integration and biomarker discovery.4 Research problems now regularly require multi-feature integration (i.e. human genetic data with microbial genetic data or magnetic resonance imaging [MRI] features) and quantification of nonlinear relationships to analyze the increasingly complicated research questions necessary to conceptualize biological understanding (see Figure 1 for examples of microbial and MRI features).
Figure 1. Examples of Biomarker and Patient Phenotype Features in Gut-Brain Axis Research.
MRI: Magnetic Resonance Imaging. Figure created with Biorender.com.
With the advent of machine learning and the proliferation of cutting-edge bioinformatics methodologies, new approaches in research design are being used to outline the future of prediction and exploratory science. High-quality science, including studies that produce valid and trustworthy results, incorporates consideration of these approaches as a significant component of the research design. Advanced algorithms include those at the center of machine learning, or the use of statistical techniques to give computer systems the ability to learn based on training data. Learning means to progressively improve performance on a specific task with data, without being explicitly programmed to do so.5 Some researchers have argued that big data, bioinformatics, and machine learning are the advancement of the fourth paradigm of science, adding to the first three paradigms including exploratory, data-intensive exploration, and data mining.6 A list of definitions and key terminology are presented in Table 1.
Table 1.
Relevant Terms and Definitions
| Term | Definition |
|---|---|
| Data Science and Big Data Terms | |
| Black boxes | Models that are created directly from data by an algorithm without human input. Therefore, one cannot deconstruct how variables are being combined to make predictions.31 |
| Bioinformatics | Computational methods from various sub-fields of data science have come together to analyze biological data to gain mechanistic understanding of processes underlying health and disease.32 |
| Biomarker | A quantifiable characteristic as an indicator of normal biological/physiologic processes, pathologic processes, or responses to an exposure or intervention.33 |
| Data Science | An interdisciplinary field of research or inquiry where quantitative and analytical processes or pipelines are developed to extract knowledge and insight from large and/or complex data. |
| Decision Trees | A machine learning classification model that is formatted in a tree-like structure to make predictions based on a series of tests or decisions. Nodes on the tree structure represent a test of a specific feature and branches represent each of the possible test outcomes.34 |
| Feature | A measureable property of an input variable, also known as a dimension or attribute.35 |
| Machine Learning | A set of approaches that enable computers to learn from historical or input data without being explicitly programmed or controlled by humans.36 |
| Metadata | Background information about input data that provides information such as data acquisition methods, content, context, and structure.36 |
| Pipeline | Optimized sequence of data processing and analysis steps that can be applied to different datasets in a reproducible manner. |
| Random Forest Classifier | Machine learning technique that is used in regression and classification analyses. The random forest algorithm is composed of several decision trees, and prediction is made by taking the average output of the included decision trees.37 |
| Structured Data | Data that is systematically organized and formatted, usually within fixed fields and columns, so it is easily searchable and manipulated in relational databases.38 |
| Supervised Machine Learning | Type of statistical learning method in which a labeled training dataset where a known value for the target dependent variable is used to allow the computer to learn how input feature patterns relate to dependent variable outcomes. The goal of supervised learning is to predict an output of interest.39 |
| Unstructured Data | Data that does not have a predefined organization format, and usually categorized as qualitative data (i.e. video files, text files, audio files). Unstructured data is more difficult to collect, label and organize versus structured data.38 |
| Unsupervised Machine Learning | Type of statistical learning method in which response labels are not available for observations so the computer infers patterns from a dataset. The goal of unsupervised learning is to define previously unknown underlying structures and patterns in a dataset.39 |
| Microbiome Science-Specific Terms | |
| 16S rRNA Gene | A gene specific to bacteria that is commonly used in microbiome analysis. This gene has (on average) 1541 base pairs and contains nine hypervariable regions, used to categorize bacteria, that are flanked by highly conserved regions that are similar across many bacteria and therefore ideal targets for polymerase chain reaction primers.40 |
| 16S rRNA Gene Amplicon Sequencing | A microbiome analysis technique that uses polymerase chain reaction to target, amplify and sequence portions of the hypervariable regions of the 16S rRNA gene to categorize bacterial taxa in a sample. |
| Alpha Diversity | A measure of microbial diversity within an individual sample, where one value is generated per sample. Metrics used to calculate alpha diversity include Chao1, Shannon index, Simpson index, and Faith’s phylogenetic diversity, among others. |
| Beta Diversity | A measure of differences in the overall microbial communities between groups of samples, usually grouped by a metadata variable (e.g., body site, control versus intervention). The multivariate microbiome data is reduced to one point per sample and distances between samples are calculated using metrics like Bray Curtis dissimilarity, Euclidean distance and Weighted UniFrac as examples. |
| Differential Abundance | An analysis method that aims to determine which microbial taxa significantly differ in relative abundance between samples grouped by a metadata variable. Metrics used to calculate differential abundance include ALDEx2, ANCOM, and DESeq2, among others. |
| Functional Profiling | An analysis method that aims to describe the metabolic potential of a microbial community by characterizing the presence or absence and abundance of microbial-associated metabolic pathways either directly using shotgun metagenomics sequencing41, or indirectly by identifying pathways associated with annotated bacteria using 16S rRNA sequencing.42 |
| Human Microbiome | The human microbiome is the collection of bacteria that live in and on, and coexist with, the human body, and their genes.43 |
| Metabolomics | Metabolomics generates a profile of small molecules from a biological sample based on s substances mass to charge ratio using methods using as mass spectrometry or liquid chromatography.44 |
| Relative Abundance | The proportion of the microbial taxa that exists in a microbiome sample relative to the total number of identified organisms (expressed as a percentage). |
| Shotgun Metagenomics Sequencing | Sequencing of all genomic deoxyribonucleic acid present in a sample which provides sequencing information about metabolic genes in addition to bacterial genes, providing more information about the functional and metabolic potential of a microbiome community. |
| Neuroscience-Specific Terms | |
| Brain Oxygen Level-Dependent Imaging | A method of functional MRI imaging to observe different areas of brain activation in response to prompts, tasks or stimuli by measuring blood flow changes in brain regions of interest. Brain oxygen level-dependent imaging uses magnetic fields to measure changes in oxygenation in order to infer neuronal activity. |
| Brain Region of Interest | In neuroimaging, a brain region of interest involving selecting cluster of voxels or a specific brain anatomical area to target when performing analyses or investigating for effects of a stimulus. These regions of interest are often selected a priori based on previous literature and study hypotheses. |
| Functional MRI | A category of MRI that aims to measure and map brain activity by detecting changes associated with blood flow in response to a prompt or task (brain oxygen level-dependent imaging) or measuring brain connectivity within and across brain regions of interest while not performing any specific activity (resting-state functional MRI). |
| Brain Morphometry | Neuroimaging analysis that aims to measure the size and shape of brain structures and how they change during aging/development and disease.45 |
| Structural MRI | A neuroimaging method that quantifies the size, shape and integrity of gray and white matter structures in the brain. Different pulse sequences and protocols can be used to target various characteristics of normal and abnormal brain tissue.46 |
Abbreviations. ALDEx2: Analysis of variance-Like Differential Expression tool version 2, ANCOM: ANalysis of Composition of Microbiomes, MRI: magnetic resonance imaging; rRNA: ribosomal ribonucleic acid.
Two predominant views for designing a research study in pre-clinical and translational research include data-driven (DD) and hypothesis-driven (HD) approaches. HD research questions stem from the traditional scientific method, where prior literature informs the study aims and hypotheses. The study hypotheses are the foundation of the research design and analysis methods, and the concretely defined hypotheses are tested and then accepted or rejected. On the other hand, DD research questions are generally exploratory and are used when previous research is limited, relationships are unknown, and well-defined hypotheses cannot be formed. In DD research, scientists examine data in an unbiased way, and patterns or signatures from the data are captured to identify potential associations between variables of interest or biomarker signatures that are predictive or associated with disease. DD research allows the data “to speak for itself” and results can be used to generate or inform future HD research (and vice versa). While the detailed methodologies within each approach vary, this paper argues that DD and HD research work together in the data science lifecycle to span data management, cleaning, transformation, modeling, and validation of results.
The gut-brain axis or the gut-microbiome-brain axis is the bidirectional manner in which the gut microbiome, or the collection of the organisms that live in the intestine and their genes, communicates with the brain.7 Current understanding of the mechanisms in which the human gut microbiome communicates with the brain to main homeostasis and health is still evolving, but advancements in neuroimaging and gut microbiome bioinformatics have allowed researchers to demonstrate negative associations that alterations in gut microbiome community have with neurologic function and mental health (please see Table 1 for a list of relevant definitions in gut-brain axis and data science research).8 Modulation of this gut-brain communication by the microbiome occurs though a variety of mechanisms centered around vagal interoceptive processes, including metabolite-based signaling and regulation of immune cells.9,10 To explore these mechanisms, scientists use multi-modal technology methods, such as MRI paired with omics methods (amplicon sequencing, shotgun metagenomics sequencing, and/or metabolomics) to understand how changes in microbiome community structure and function can influence the brain. Despite advances and increased accessibility in both fields of MRI and microbiome science, the output that results from these analysis methodologies can span from hundreds to thousands of features. Therefore, careful planning regarding the use of HD versus DD research strategies (or a combination of the two) based on understanding of the research conceptual framework and underlying questions/aims for high-data problems is essential. Importantly, planning for possible HD and DD research strategies is applicable to research aims incorporating big data or multifactorial data integration, and should be considered regardless of the patient phenotype, analysis methods or model organism of interest used in the research.
Nurses routinely integrate multiple types of data and information in their assessment and management of patients in clinical practice. For example, appreciation of the interconnected nature of multiple body systems is the foundation of holistic nursing care and nurses regularly integrate multi-modal data from various sources including vital signs, mental health and home environment/social support in patient assessment for disease management and health promotion. Furthermore, the combination of advanced research training with clinical perspective and experience makes nurse scientists important contributors to translational research incorporating complex multi-system problems like microbiome science.4 It is critical for nurse researchers and clinicians to understand that bioinformatics has the potential to explain nursing phenomena through data sources such as genomics, imaging, mobile applications, electronic health records, and biosensors. In complement, nursing science can bring theory-driven science, clinical content expertise, and concern for ethical, legal, and social implications. This paper will discuss a comparison of DD and HD approaches, evaluation of their respective value, limitations and applications, and further explore each approach in the context of examining the relationship between the gut-brain axis and health outcomes during and after pregnancy. Importantly, the unique contribution that nurses bring to the advancement of science will be included.
Methods and exemplars: hypothesis-driven approaches
Traditional research methods are those driven by the scientific method. A typical application of the scientific method includes the development of a hypothesis based on theoretical knowledge and previous research. HD research is underlined by deductive reasoning, defined as a top-down approach in which a conclusion is derived from multiple premises that are assumed to be true. The resulting conclusions are formed from general theoretical principals.
Researchers can employ HD approaches to frame the endpoints or outcomes in a study. For example, many high-quality studies have been conducted that identify postpartum depression as a common and distressing occurrence post-pregnancy.11 While scientists have defined associations between breastfeeding and depression, there is a need for mechanistic studies to understand the biological and psychological factors underlying this possible connection.12,13 Breastfeeding has been hypothesized to modulate the infant and maternal gut-brain axis through hormonal, immune and stress pathways.14,15 Therefore, the HD knowledge identifies postpartum depression and breastfeeding as an outcome of interest and modulating factor, respectively, and can guide study design and the conceptual framework of the research aims.
In gut-brain axis research, HD approaches can also inform study design and analysis decisions to localize neuroimaging brain regions of interest. Gao et al. studied the gut-brain axis in infants, using resting state functional MRI to quantify functional connectivity and compare brain connectivity values with gut microbiome changes.16 Informed by their prior research17, these investigators targeted MRI analyses on the amygdala and thalamus as their brain regions of interest to compare functional changes to the microbiome in a targeted manner. The team focused on correlations between global diversity microbiome and amygdala/thalamic functional connectivity measures in their analysis. Nevertheless, HD research can inform other ways to test focused gut-brain associations through bioinformatics pipelines such as an automated comparison of the specific gut microbiome bacteria relative abundance levels and amygdala/thalamic connectivity values or using machine learning (i.e. decision trees) to evaluate which microbiome or connectivity features are associated with differences in infant phenotype variables of interest like cognitive function or anxiety-like behaviors. While not infallible, HD work requires further testing and validation in generalizable populations to refine the underlying concept and contribute to the published literature.
Hypothesis-driven approach considerations
The benefits of HD research approaches include theoretically driven research questions and well-defined analysis strategies and workflows. Most important to translational scientists and clinicians, are the readily interpretable results of traditional statistical approaches and well-understood methodologies. For example, the R2 value for a fit of a linear model is defined as the proportion of the variance in the dependent variable that is predicted from the independent variable. HD approaches are well represented in the maternal-child literature. In a pre-clinical gut-brain axis study, Bravo et al. hypothesized that administration of L. rhamnosus would modulate stress behaviors in mice.18 A two-way ANOVA test showed that there was a significant interaction between acute stress (stress versus control groups) and L. rhamnosus treatment (F1, 28 = 7.425; P = 0.011), and stressed mice given L. rhamnosus treatment had significantly lower plasma corticosterone levels compared to untreated stressed counterparts (P < 0.001). While this study is pre-clinical and translational implications are unknown, the rigorous hypothesis testing provides insight on plausible gut microbiome-mediated signaling mechanisms to the brain that can be tested or confirmed in future research.
Challenges of HD approaches include general issues with poorly conducted research. Traditional statistical approaches are limited by the types of data that are able to be processed. For example, the assumptions of linear regression are normality of predictors and a linear relationship with the outcome variable(s). If the data is not normal, transformation is needed in order to proceed. While transformation itself is not a limitation and often a necessary component of thorough statistical analyses, it is important to note that real world, dynamic problems are often not normally distributed or linearly defined. The narrowed focus of HD research and reliance on p-values to accept or reject hypotheses may overlook nonlinear relationships or complex mechanisms that may be present in multi-system mechanisms such as gut-brain axis communication. Additionally, randomized controlled trials, though the gold standard for determining an effect of an intervention, can be difficult to implement and participants have been shown to not always represent true variance in a larger population.19
Methods and exemplars: data-driven approaches
DD approaches use inductive reasoning to drive the production of a hypothesis and the development of theory, and DD approaches are commonly employed in big data, bioinformatics, and machine learning analysis pipelines. Inductive reasoning is the act of generalization based on many individual observations.20 Classic approaches in inductive reasoning come from the qualitative paradigm. Qualitative, inductive work is exploratory in nature and allows for greater understanding of the experiences of individuals without prior knowledge or premise.
Combined neuroimaging and microbiome studies of the gut-brain axis commonly use quantitative DD approaches due to the limited mechanistic understanding and multi-modal/multi-feature nature of microbiome and MRI data. For example, a quantitative DD article conducted by Dong et al. (2020) used machine learning techniques to evaluate relationships between functional MRI, gut microbiome, and fecal metabolome features in overweight subjects with and without food addiction.21 The research team utilized a DD multilevel sparse partial least square linear discriminant analysis to identify bacterial taxa (operational taxonomic units) that discriminated subjects with and without food addiction. The researchers subsequently used a random forest classifier to predict food addiction phenotype using features from gut microbiome bacteria, metabolites and brain imaging data that were significantly different between groups as input features. The area under the curve for the prediction accuracy of the random forest classifier for food addiction phenotype was 0.81, and the variables with the highest predictive scores were brain imaging features and the fecal metabolite indolepropionate.21 Similarly, Peter et al. used decision trees and a random forest classifier to predict psychological distress using microbiome community patterns.22 Many of the gut microbiome bacteria that were the top features used in psychological distress prediction were bacteria from the phylum Firmicutes, and future work can be performed to determine the functional significance of the bacterial features identified. This combination of DD approaches illustrates the range of methodologies that contribute to insight and knowledge generation.
Diverse data is available for research on the influence of the gut microbiome on neurologic symptoms and outcomes; Table 2 lists examples of relevant data. Relevant features related to the study of the gut microbiome environment that may be included in a bioinformatics pipeline include; survey-level data (i.e. dietary recalls, census statistics), electronic health record data (i.e. family history, clinic weight trends, patient-reported symptoms, narrative notes), sensor-derived data (i.e. Fitbit, actinography), and biological data (i.e. fecal or blood samples, lab values). For understanding of the brain, pertinent data could include physiological data (i.e. EKG, EEG, blood pressure, functional MRI activation), anthropomorphic data (i.e. weight, length), and biological data. These data types each have their own considerations for analysis (i.e. categorical vs. continuous, text vs. numeric, storage considerations, privacy, access) and long-term storage. Rigorous tracking of metadata, or background information about the input data, is also necessary for quality control. Some examples of metadata in microbiome input data are sequencing platform, sequencing run, and deoxyribonucleic acid extraction kit, while metadata related to MRI input data may be MRI scanner model, resolution (1.0 Tesla versus 3.0 Tesla), and acquisition parameters.
Table 2:
Values and challenges in data-driven and hypothesis-driven approaches.
| Data type | ||
|---|---|---|
| Microbiome/Gut | Brain | |
| Examples of relevant data |
|
|
| Value of DD* |
|
|
| Challenges of DD* |
|
|
| Value of HD# |
|
|
| Challenges of HD# |
|
|
DD = data-driven
HD = hypothesis-driven
RCT = randomized controlled trial, EEG = electroencephalogram, EKG = electrocardiography. This table is adapted from previously published work.27
It is important to note that these data may be in structured or unstructured formats. Structured data includes data that is organized in an established database with strict rules for how the data is categorized. Unstructured data is loosely organized or only in computer-readable language. Both of these data types have associated challenges that make drawing further insight more difficult. A significant component of the data science process includes the cleaning and transformation of data as it relates to the analysis.
Data-driven approach considerations
In the assessment of DD analysis approaches to the gut-brain axis and health outcomes, there are several value-added benefits although little has been done in this area as HD research has pre-dominantly been used. First, it is important to know that DD does not mean void of a research question. More than likely DD investigation is due to real-world concerns that researchers hypothesize underlying pathways behind the phenomena of interest are multifactorial and require rigorous but explorative analysis workflows. Second, DD work helps to fill a unique gap in the published literature. Further, the exploratory nature of DD and inductive research helps to generate hypotheses that can then be tested in other rigorous approaches. This creation of a hypothesis contributes to the production of theory over time. Third, DD is usually viewed an innovative approach because of a lack of previous work. Due to the lack of published premise, exploratory research is a necessary step. Fourth, DD approaches can result in multiple competing hypotheses that can provide a more robust perspective of the problem by addressing the multi-dimensional nature of the world.
The significant challenges of DD approaches in this research should be considered. First, there is a lack of well-known validation approaches to many DD methodologies. Because the results of a DD study are exploratory or may highlight patterns not otherwise published, it can be difficult to assess the true validity. Second, the “black box” nature of many data science algorithms, like neural networks, often are not interpretable for science implemented at the bedside. Black boxes are systems, including algorithms or devices, in which little is known about the core mechanisms that lead to a result despite the insertion of defined inputs and the identification of desired output.23 To the observer, how the output is created is unknown. Some recent theories have tried to deconstruct the “decision-making” of machines yet there is still little movement forward. Due to the poor mathematical understanding and interpretation of the latent representations with black box systems, DD approaches are decidedly susceptible to low quality science and interpretation. Further, the output is subject to misuse and ethically compromised decision-making.
Using data-driven analysis approaches to inform hypothesis-driven research
In genomic research in the study of the gut-brain axis and health, HD and DD research contributes to the scientific literature, but for different purposes. Some argue that genomic research, particularly with microbiome samples, is still in an exploratory stage requiring DD workflows in the development of biological theory and reference genomes for validation testing.24,25 The big-data nature of many high impact research studies in DD design understand that the process of model building is supported by a saturation of data and is less dependent on hypotheses.26 The very nature of genomic data is high-dimensional, and therefore DD studies can help to generate hypotheses. For example, exploratory DD bioinformatics pipelines might identify important bacteria from the gut microbiome where researchers can test the functional potential of these bacterial strains in HD research. Conversely, important microbiome features identified from DD research can be tested in clinical trials to evaluate the impact on patient symptoms or disease processes in HD research models. The merit of DD research comes in complement to HD research with theory as a connecting piece. There are limitations to both approaches, including the need to weigh data governance, methodological rigor, and ethical implications, which should be evaluated appropriately and from different perspectives.
Using hypothesis-driven approaches to inform data-driven research
HD research methods can inform DD research. The ability to control and standardize external variables that confound human translational research is why pre-clinical research is often a starting point when performing mechanistic research. Unfortunately, there is not always concordance between pre-clinical and translational research results when studies are replicated in humans. Therefore, if features or biomarkers identified by pre-clinical research are strictly tested using HD research, relevant features may be missed in different models. Alternatively, information from HD pre-clinical research can inform study design and the overarching aims, but allowing an exploratory DD-based aim will ensure important relationships that may happen in a more complex system are not overlooked.
Both DD and HD approaches can be applied to examine relationships between microbiome features and neurologic symptoms to inform patient outcomes. Table 2 outlines some major considerations for assessing the value and challenges related to the various approaches. Of note, the table outlines DD approaches outside of the qualitative paradigm. Knowing that a theory is a rigorously studied, well-substantiated explanation of a problem, it is difficult to separate theory from both DD and HD work. Both approaches contribute to the development and application of theoretical concepts. Figure 2 illustrates both the inductive and deductive approaches within a researcher’s epistemological viewpoint. As highlighted in black, theory is a significant component of quality science in both HD and DD approaches and reproducible bioinformatics analysis pipelines.
Figure 2:

Comparison of data-driven and hypothesis-driven approaches.
Nursing contribution and impact
Nurse scientists are uniquely positioned to contribute to both DD and HD research and data science, as the most impactful science is informed by real-world problems and observations. Dreisbach & Koleck have already expounded on the key definitions of data science methods, such as machine learning and artificial intelligence, in their commentary on data science as it relates to genomic nursing.27 Further, they offer pathways in which nurses and nurse scientists can gain skills and technical-expertise at this unique scientific intersection.27 The emergence of a team science or team-based approach for DD research teams includes computer scientists, biologists, and clinicians as team members and is timely for the interdisciplinary investigation of these advances.28 All team members undertaking a DD research problem bring unique perspectives and contributions to the team. For example, computer scientists or bioinformaticians are important to develop reproducible and efficient analysis workflows, while translational scientists or clinicians can help uncover the clinical significance or the “so what” of research results stemming from DD analysis. Nevertheless, cross-training to allow the interdisciplinary team members to speak the same language when defining and troubleshooting analysis challenges improves communication and productivity, and nurse scientists are increasingly seeking formal training in data science or bioinformatics in addition to graduate-level research training. Guidelines have been developed to guide infusion of data science into doctoral education for nurse scientists including the Data Science Curriculum Organizing Model introduced by Shea et al.29
Nurses and nurse scientists are well poised to synthesize the multi-modal data that comes from studying the role of the gut-brain axis in human health as they are trained to synthesize multiple data sources in clinical practice. Central to the translation of bioinformatics tools and insights learned from genomic analyses is the application of healthcare technology by practicing nurses and nursing informaticists. Translating new evidence, even at the biological level, requires core tenants of informatics to promote the health and wellbeing of our patients. Several high-impact nursing organizations, such as Johnson & Johnson and the National Institute for Nursing Research, have advocated for nursing’s inclusion in the advancement and application of data science and data-driven research.30
CONCLUSION
DD and HD analysis approaches reflect specific values and challenges in relation to the research goal. We argue that both approaches work together, each informed by theory, to move the science forward. With increases in access to large and expansive datasets it is important to consider more advanced algorithms to explore patterns unseen by the human eye. It is important to note that, with questions that address a smaller set of data, the performance may be very similar between traditional statistical applications and advanced computing. However, it is hard to ignore more data requires increasingly advanced algorithms to overcome issues in type 1 error, or a false positive finding. Regardless of method or approach, the academic community will benefit from encouraging the production of quality science rather than a specific paradigm. By conducting quality research and leveraging bioinformatics analysis pipelines that facilitate person-knowing, nurse scientists can leverage key opportunities to expand the role of nursing knowledge.
Funding:
This research is supported by intramural research funds from the National Institutes of Health Clinical Center (KAM).
ABBREVIATIONS
- DD
Data-Driven
- HD
Hypothesis-Driven
- MRI
Magnetic Resonance Imaging
Footnotes
Conflict of Interest: The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
REFERENCES
- 1.Dhar V Data science and prediction. Commun ACM. 2013;56(12):64–73. doi: 10.1145/2500499 [DOI] [Google Scholar]
- 2.Diniz WJS, Canduri F. REVIEW-ARTICLE Bioinformatics: an overview and its applications. Genet Mol Res. 2017;16(1). doi: 10.4238/gmr16019645 [DOI] [PubMed] [Google Scholar]
- 3.Kitchin R, McArdle G. What makes Big Data, Big Data? Exploring the ontological characteristics of 26 datasets. Big Data Soc. 2016;3(1). Accessed October 13, 2021. https://journals.sagepub.com/doi/10.1177/2053951716631130 [Google Scholar]
- 4.Maki KA, Joseph, Ames NJ, Wallen GR. Leveraging Microbiome Science From the Bedside to Bench and Back: A Nursing Perspective. Nurs Res. 2021;70(1):3–5. doi: 10.1097/NNR.0000000000000475 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Koza JR, Bennett FH, Andre D, Keane MA. Automated Design of Both the Topology and Sizing of Analog Electrical Circuits Using Genetic Programming. In: Gero JS, Sudweeks F, eds. Artificial Intelligence in Design ’96. Springer; Netherlands; 1996:151–170. doi: 10.1007/978-94-009-0279-4_9 [DOI] [Google Scholar]
- 6.Kitchin R Big Data, new epistemologies and paradigm shifts. Big Data Soc. 2014;1(1). Accessed October 13, 2021. https://journals.sagepub.com/doi/10.1177/2053951714528481 [Google Scholar]
- 7.Cryan J, O’Riordan K, Cowan C, Sandu K, Bastiaanssen T. The Microbiota-Gut-Brain Axis. Physiol Rev. 2019;99(4):1877–2013. [DOI] [PubMed] [Google Scholar]
- 8.Dinan TG, Cryan JF. The Microbiome-Gut-Brain Axis in Health and Disease. Gastroenterol Clin North Am. 2017;46(1):77–89. doi: 10.1016/j.gtc.2016.09.007 [DOI] [PubMed] [Google Scholar]
- 9.Fitzpatrick Z, Frazer G, Ferro A, et al. Gut-educated IgA plasma cells defend the meningeal venous sinuses. Nature. 2020;587(7834):472–476. doi: 10.1038/s41586-020-2886-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Silva YP, Bernardi A, Frozza RL. The Role of Short-Chain Fatty Acids From Gut Microbiota in Gut-Brain Communication. Front Endocrinol. 2020;11:25. doi: 10.3389/fendo.2020.00025 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Becker M, Weinberger T, Chandy A, Schmukler S. Depression During Pregnancy and Postpartum. Curr Psychiatry Rep. 2016;18(3):32. doi: 10.1007/s11920-016-0664-7 [DOI] [PubMed] [Google Scholar]
- 12.Figueiredo B, Dias CC, Brandão S, Canário C, Nunes-Costa R. Breastfeeding and postpartum depression: state of the art review. J Pediatr (Rio J). 2013;89(4):332–338. doi: 10.1016/j.jped.2012.12.002 [DOI] [PubMed] [Google Scholar]
- 13.Lara-Cinisomo S, McKenney K, Di Florio A, Meltzer-Brody S. Associations Between Postpartum Depression, Breastfeeding, and Oxytocin Levels in Latina Mothers. Breastfeed Med Off J Acad Breastfeed Med. 2017;12(7):436–442. doi: 10.1089/bfm.2016.0213 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Anderson G, Vaillancourt C, Maes M, Reiter R. Breastfeeding and the gut-brain axis: is there a role for melatonin? Biomol Concepts. 2017;8(3):185–195. doi: 10.1515/bmc-2017-0009 [DOI] [PubMed] [Google Scholar]
- 15.Brennan PA, Dunlop AL, Smith AK, Kramer M, Mulle J, Corwin EJ. Protocol for the Emory University African American maternal stress and infant gut microbiome cohort study. BMC Pediatr. 2019;19(1):246. doi: 10.1186/s12887-019-1630-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Gao W, Salzwedel AP, Carlson AL, et al. Gut microbiome and brain functional connectivity in infants-a preliminary study focusing on the amygdala. Psychopharmacology (Berl). 2019;236(5):1641–1651. doi: 10.1007/s00213-018-5161-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Salzwedel AP, Stephens RL, Goldman BD, Lin W, Gilmore JH, Gao W. Development of Amygdala Functional Connectivity During Infancy and Its Relationship With 4-Year Behavioral Outcomes. Biol Psychiatry Cogn Neurosci Neuroimaging. 2019;4(1):62–71. doi: 10.1016/j.bpsc.2018.08.010 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Bravo J, Forsythe P, Chew M, et al. Ingestion of Lactobacillus strain regulates emotional behavior and central GABA receptor expression in a mouse via the vagus nerve. PNAS. 2011;108(38):16050–16055. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Kennedy-Martin T, Curtis S, Faries D, Robinson S, Johnston J. A literature review on the representativeness of randomized controlled trial samples and implications for the external validity of trial results. Trials. 2015;16(1):495. doi: 10.1186/s13063-015-1023-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Burk M The Scientific Method. Science. 1986;231(4739):659. [DOI] [PubMed] [Google Scholar]
- 21.Dong TS, Mayer EA, Osadchiy V, et al. A Distinct Brain-Gut-Microbiome Profile Exists for Females with Obesity and Food Addiction. Obes Silver Spring Md. 2020;28(8):1477–1486. doi: 10.1002/oby.22870 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Peter J, Fournier C, Durdevic M, et al. A Microbial Signature of Psychological Distress in Irritable Bowel Syndrome. Psychosom Med. 2018;80(8):698–709. doi: 10.1097/PSY.0000000000000630 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Brouillette M Deep Learning Is a Black Box, but Health Care Won’t Mind. MIT Technology Review. Published 2017. Accessed October 13, 2021. https://www.technologyreview.com/2017/04/27/242905/deep-learning-is-a-black-box-but-health-care-wont-mind/ [Google Scholar]
- 24.Kyrpides NC, Eloe-Fadrosh EA, Ivanova NN. Microbiome Data Science: Understanding Our Microbial Planet. Trends Microbiol. 2016;24(6):425–427. doi: 10.1016/j.tim.2016.02.011 [DOI] [PubMed] [Google Scholar]
- 25.Tripathi A, Marotz C, Gonzalez A, et al. Are microbiome studies ready for hypothesis-driven research? Curr Opin Microbiol. 2018;44:61–69. doi: 10.1016/j.mib.2018.07.002 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Mazzocchi F Could Big Data be the end of theory in science?: A few remarks on the epistemology of data-driven science. EMBO Rep. 2015;16(10):1250–1255. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Dreisbach C, Koleck TA. The State of Data Science in Genomic Nursing. Biol Res Nurs. 2020;22(3):309–318. doi: 10.1177/1099800420915991 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Stulberg E, Fravel D, Proctor LM, et al. An assessment of US microbiome research. Nat Microbiol. 2016;1(1):1–7. doi: 10.1038/nmicrobiol.2015.15 [DOI] [PubMed] [Google Scholar]
- 29.Shea KD, Brewer BB, Carrington JM, Davis M, Gephart S, Rosenfeld A. A model to evaluate data science in nursing doctoral curricula. Nurs Outlook. 2019;67(1):39–48. doi: 10.1016/j.outlook.2018.10.007 [DOI] [PubMed] [Google Scholar]
- 30.Welton J The Future of Data-Driven, Value-Based Care. Discover Nursing. Published November 15, 2016. Accessed October 13, 2021. https://nursing.jnj.com/nurses-leading-innovation/the-future-of-data-driven-value-based-care [Google Scholar]
- 31.Rudin C, Radin J. Why Are We Using Black Box Models in AI When We Don’t Need To? A Lesson From An Explainable AI Competition. Harv Data Sci Rev. 2019;1(2). doi: 10.1162/99608f92.5a8a3a3d [DOI] [Google Scholar]
- 32.van Iterson M, van Haagen HHHBM, Goeman JJ Resolving confusion of tongues in statistics and machine learning: A primer for biologists and bioinformaticians. PROTEOMICS. 2012;12(4-5):543–549. doi: 10.1002/pmic.201100395 [DOI] [PubMed] [Google Scholar]
- 33.Group FNBW. Contents of a Biomarker Description. Food and Drug Administration (US); 2020. Accessed December 20, 2021. http://www.ncbi.nlm.nih.gov/books/NBK566059/ [Google Scholar]
- 34.Che D, Liu Q, Rasheed K, Tao X. Decision tree and ensemble learning algorithms with their applications in bioinformatics. Adv Exp Med Biol. 2011;696:191–199. doi: 10.1007/978-1-4419-7046-6_19 [DOI] [PubMed] [Google Scholar]
- 35.Schmarzo B Features Part 1: Are Features the New Data? Published 2021. Accessed December 20, 2021. https://www.datasciencecentral.com/profiles/blogs/features-are-the-new-data [Google Scholar]
- 36.NIH Strategic Plan for Data Science ∣ Data Science at NIH. Accessed December 20, 2021. https://datascience.nih.gov/strategicplan [Google Scholar]
- 37.Introduction to Random Forest in Machine Learning. Engineering Education (EngEd) Program ∣ Section. Accessed December 20, 2021. https://www.section.io/engineering-education/introduction-to-random-forest-in-machine-learning/ [Google Scholar]
- 38.Structured vs Unstructured Data – What’s the Difference? G2. Accessed December 20, 2021. https://www.g2.com/articles/structured-vs-unstructured-data [Google Scholar]
- 39.James G, Witten D, Hastie T, Tibshirani R. An Introduction to Statistical Learning: With Applications in R. 1st ed. 2013, Corr. 7th printing 2017 edition. Springer; 2013. [Google Scholar]
- 40.Brosius J, Palmer ML, Kennedy PJ, Noller HF. Complete nucleotide sequence of a 16S ribosomal RNA gene from Escherichia coli. Proc Natl Acad Sci U S A. 1978;75(10):4801–4805. doi: 10.1073/pnas.75.10.4801 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Beghini F, McIver LJ, Blanco-Míguez A, et al. Integrating taxonomic, functional, and strain-level profiling of diverse microbial communities with bioBakery 3. eLife. 2021;10:e65088. doi: 10.7554/eLife.65088 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Douglas GM, Maffei VJ, Zaneveld JR, et al. PICRUSt2 for prediction of metagenome functions. Nat Biotechnol. 2020;38(6):685–688. doi: 10.1038/s41587-020-0548-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Blum HE. The human microbiome. Adv Med Sci. 2017;62(2):414–420. doi: 10.1016/j.advms.2017.04.005 [DOI] [PubMed] [Google Scholar]
- 44.Liu X, Locasale JW. Metabolomics: A Primer. Trends Biochem Sci. 2017;42(4):274–284. doi: 10.1016/j.tibs.2017.01.004 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Heimann T, Meinzer HP. Statistical shape models for 3D medical image segmentation: A review. Med Image Anal. 2009;13(4):543–563. doi: 10.1016/j.media.2009.05.004 [DOI] [PubMed] [Google Scholar]
- 46.Symms M, Jager H, Schmierer K, Yousry T. A review of structural magnetic resonance neuroimaging. J Neurol Neurosurg Psychiatry. 2004;75(9):1235–1244. doi: 10.1136/jnnp.2003.032714 [DOI] [PMC free article] [PubMed] [Google Scholar]

