Abstract
Nurse scientists are generating, acquiring, distributing, processing, storing, and analyzing greater volumes of complex omics data than ever before. To take full advantage of big omics data, to address core biological questions, and to enhance patient care, however, genomic nurse scientists must embrace data science. Intended for readership with limited but expanding data science knowledge and skills, this article aims to provide a brief overview of the state of data science in genomic nursing. Our goal is to introduce key data science concepts to genomic nurses who participate at any stage of the data science lifecycle, from research patient recruitment to data wrangling, preprocessing, and analysis to implementation in clinical practice to policy creation. We address three major components in this review: (1) fundamental terminology for the field of genomic nursing data science, (2) current genomic nursing data science research exemplars, and (3) the spectrum of genomic nursing data science roles as well as education pathways and training opportunities. Links to helpful resources are included throughout the article.
Keywords: data science, genomics, machine learning, research, team science
Genomic nursing data science lies at the intersection of biology, statistics, computer science, and nursing domain expertise. Data science is an interdisciplinary field that uses analytical processes to extract knowledge and insight from large, varied, and/or complex data sets (Office of Data Science Strategy, National Institutes of Health, 2018). The ultimate goal of genomic nursing data science is to inform nursing phenomena (e.g., symptom science, wellness, self-management, end-of-life, and palliative care) and promote and improve the health of individuals, families, and communities by analyzing, interpreting, and extracting value from omics data (National Institute of Nursing Research, n.d.). Omics methods—including genomics, epigenomics, transcriptomics, proteomics, metabolomics, and microbiomics—are used to create comprehensive, global molecular profiles of organisms. Due to their inherent size, complexity, and number of variables, omics data are “big data.” In fact, next-generation whole-genome sequencing with 30-fold coverage for a single human being produces approximately 100 gigabytes of nucleotide base data (Zhao et al., 2017). This amount of data is equivalent to downloading 75 million document pages of information (SDS Discovery, 2019). Big data, however, represent more than just data size or volume, also known as the first V of big data. In addition to volume, big data are often described using three additional V’s: variety (i.e., diversity of the data types), velocity (i.e., speed at which data are generated and received), and veracity (i.e., trustworthiness of the data).
Big omics data allow for data-driven nursing inquiry, but also present unique challenges related to the method of scientific investigation and treatment of the data. The data science life cycle starts like any other, with a research question (Lau, 2019). The objective of data-driven research, however, is often different than that of traditional, hypothesis-driven work. Whereas, hypothesis-driven methods use deductive reasoning to test and validate findings, data-driven methods use inductive reasoning to generate new hypotheses and theories. After generation of a research question, the steps of the data science life cycle are as follows: (1) collecting or obtaining omics and phenotype data, (2) wrangling and preprocessing data, (3) analyzing and interpreting data, and (4) translating and implementing findings from the data (Figure 1). Nurses, both researchers and clinicians, play important roles at each step of the data science life cycle and as members of the genomic data science team.
Figure 1.
The genomic nursing data science life cycle and spectrum of nurse scientists, both researcher and clinician, roles. Note. The data science life cycle starts with a research question that informs all other subsequent steps: Collecting or obtaining omics and phenotype data, wrangling and preprocessing data, analyzing and interpreting data, and translating and implementing findings from the data. The rectangle at the bottom of the figure represents the spectrum of genomic nursing data science levels of training/roles ranging from bachelor’s prepared nurses, who are more likely to participate in the data collection and clinical implementation steps of the life cycle, to nurse data scientists, who are likely to be experts in and perform data wrangling, preprocessing, and analysis. The spectrum emphasizes that a nurse scientist may participate in several steps or have multiple roles.
This specialized domain knowledge of nursing—including clinical content expertise, theory-driven science, and concern for ethical, legal, and social implications (i.e., ELSI)—will benefit omics data science. In exchange, incorporation of data science will lead to data-informed genomic nursing practice. As Brennan and Bakken (2015) state in their seminal paper, “Nursing Needs Big Data and Big Data Needs Nursing,”
nursing’s understanding of the whole person complements the reductionist approach taken by many data scientists. It is useful to remind our data science colleagues that those…alleles…are attached to a person, and their interpretation in light of that individual’s life goals and priorities is something that nursing can guide. (p. 482)
The purpose of the present article is to provide an overview of the state of data science in genomic nursing to the novice nurse researcher or clinician desiring to build their data science tool box. We will cover three major components in this review: An introduction to the relevant terminology in the field of genomic nursing data science, exemplars of data science in genomic nursing research, and genomic nursing data science roles and education pathways for nurses in the data science field.
Terminology for Genomic Nursing Data Science
The infusion of data science into genomic and nursing research and practice has created the need to clearly define major concepts in the field, especially related to the “wrangling and preprocessing data” and “analyzing and interpreting data” steps of the data science life cycle. Consequently, we discuss key terms (which have been bolded) in detail in this section and throughout the review. We have also organized definitions of key terminology in Table 1 for reader convenience.
Table 1.
Definitions of Data Science Terms.
| Term | Definition |
|---|---|
| Algorithm | A set of rules or operations that is arranged for calculations or other problem-solving processes (National Institutes of Health, 2018). |
| Big data | Refers to the concepts of volume, variety, velocity, and veracity in the analysis of data (Kitchin & McArdle, 2016). |
| Computer programming | The use of code to instruct a set of executable commands by a computer (McCandless, 2018). |
| Data science | An interdisciplinary field that uses large, varied, and/or complex data sets to extract knowledge and insight using the development of quantitative and analytical processes and systems (National Institutes of Health, 2018). |
| Data wrangling and preprocessing | Converting raw data into a form that can be used for analysis (Tomar, 2016). |
| Data-driven methods | The use of inductive reasoning to generate new hypotheses and theories (Deming, 2015). |
| Feature | An input variable, also known as a dimension or attribute (Bhattacharjee, 2017). |
| Feature engineering | The process of transforming data, with relevant domain knowledge, into new variables in a data set (Shekhar, 2018). |
| High-dimensional data | A data set that has more features (columns) than observations (rows; James et al., 2013). |
| Hyperparameter | A mechanism within an algorithm to optimize a model that can be set and tuned by the research team (Bhattacharjee, 2017). |
| Hypothesis-driven methods | The use of deductive reasoning to test and validate findings (Fox, 2015). |
| Labels | Established outcomes (Bhattacharjee, 2017). |
| Machine learning | A set of approaches that enable computers to learn without being explicitly programmed (National Institutes of Health, 2018). |
| Multicollinearity | When a predictor variable is correlated with another variable(s; James et al., 2013). |
| Overfitting | When a statistical function is modeled too closely to a particular set of data points (James et al., 2013). |
| Parameter | Variable estimated from the data that is used to build an accurate model (Bhattacharjee, 2017). |
| Pipeline | Optimized sequence of data processing and analysis steps (Stanford Computer Science, n.d.). |
| Supervised learning | Type of statistical learning method in which response labels are available for observations. The goal of supervised learning is to predict an output of interest (James et al., 2013). |
| Testing data | A set of observations used to assess the accuracy of the predictions. Testing data do not include observations that were used in the training or validation stage (James et al., 2013). |
| Training data | A set of observations used to train, or teach, an algorithm how to estimate the underlying truth (James et al., 2013). |
| Unsupervised learning | Type of statistical learning method in which response labels are not available for observations. The goal of unsupervised learning is to find a pattern in the data (James et al., 2013). |
| Validation data | A set of observations used to tune the parameters of a model (Ripley, 1995). |
Data Wrangling and Preprocessing
The goal of data wrangling and preprocessing is to convert raw data into a form that can be used for analysis. Data wrangling and preprocessing involves obtaining, unifying, and cleaning the data. Examples of omics data wrangling and preprocessing tasks include genome alignment to reference data sets, detecting and correcting missing data and batch effects, sample filtering, probe filtering, normalization, and visualization. While similar data wrangling and preprocessing tasks should be completed for all big omics data, the specific techniques and tools used are unique to the particular data source(s). Techniques and tools continue to evolve with omics technologies. As an example, the recent explosion of interest in the microbiome has necessitated new data wrangling and preprocessing techniques and tools to prepare 16S rRNA sequencing data for cluster and classification analyses. Essential data wrangling and preprocessing steps for microbiome data include alignment of raw sequencing reads, quality control to remove low-quality samples, and normalization to remove systematic variation. Mothur and QIIME are the most popular versions of this pipeline for research use (Caporaso et al., 2010; Schloss, 2016).
One characteristic of big omics data which warrants special consideration during data wrangling and preprocessing, as compared to traditional data, is dimensionality. Big omics data are high dimensional, meaning that it has a significantly greater number of features, or input variables (e.g., nucleotide bases, genes, transcription factors, sites of methylation, microbiota), than observations (e.g., humans, mice, cells; James et al., 2013). Features can also be called dimensions or attributes. Typically, each column of a data set represents one feature. High dimensionality can lead to challenges, namely overfitting and multicollinearity. Conceptually, overfit models are “too complicated” for the data; instead of reflecting the overall population, the overfit model reflects the random noise and nuances of the sample data (Minitab Blog Editor, 2015). Models in high-dimensional spaces are much more likely to experience overfitting due to a large number of input variables. As such, overfitting can cause a lack of generalizability to other data sets. Considering the highly interconnected nature of omics data (e.g., genes and gene expression), multicollinearity (i.e., high intercorrelations among predictor variables) should be evaluated and mitigated. Multicollinearity can cause overfitting and threaten model performance. Exploratory visualization of cleaned omics data may help to reveal redundancy and noise in the data. Creative and visually appealing illustrations help researchers to understand the data at hand and to communicate new science. Computational data visualization packages, which we will discuss in further detail below in “Implementation of Data Wrangling, Preprocessing, and Analysis,” are available in R (e.g., ggplot, ggmap, RColorBrewer), Python (e.g., Matplotlib, Seaborn, Bokeh), and Tableau. Dimension-reduction techniques, such as principal component analysis or nonnegative matrix factorization, can be attempted before analysis to reduce the number of data parameters (i.e., the variables estimated from the data that are used to build an accurate model) while retaining the true signal.
Data wrangling and preprocessing can be a tedious and time-consuming process. It is estimated that data scientists spend approximately 80% of their time preparing data for analysis (Press, 2016). However, thoughtful and comprehensive data wrangling and preprocessing is vital to ensure quality results from the analysis—remember, “garbage in, garbage out.”
Data Analysis
Machine learning is frequently used to analyze big data. It differs from, and can offer additional value over, traditional statistical approaches. Machine learning encompasses a set of approaches that enable computers to learn without being explicitly programed. Machine learning methods can be divided into two main categories, supervised and unsupervised (Figure 2; Office of Data Science Strategy, National Institutes of Health, 2018). The selection of the method depends on the goal of the project. Supervised learning is used when labels (i.e., established outcomes) are known for each of the observations (James et al., 2013). The goal of supervised learning is to create a model that is able to predict the outcome of interest. Common supervised learning approaches include classification (for categorical output variables) and regression algorithms (for continuous output variables). Alternatively, unsupervised learning is used when there are no labels for the observations (James et al., 2013). The goal of unsupervised learning is to find some structure or pattern that characterizes the data. Common unsupervised learning approaches include clustering (i.e., discovering inherent groups) and association rule learning (i.e., discovering rules that govern relationships in the data). After a method (i.e., supervised or unsupervised) and approach (e.g., logistic regression, support vector machines, clustering) are selected, implementation of machine learning usually involves splitting the data into three data sets: A training set, a validation set, and a testing set. Alternatively, data may be split into two data sets: A training set and a validation set. In the second case, an additional testing set may be obtained from another source. The training data set is used to create the model(s) and fit model parameters. The validation data set is used to evaluate the performance of and further tune the trained model(s) that was fit from the training data. The validation data help to fill in the gaps from the testing set, stretch the capacity of the trained model on new data, and compare model performance to choose the “best” model. Finally, the testing data set is used to evaluate the performance of the “best” model. Because the model has not “seen” the testing data set before, the testing data set allows for a more reliable estimate of model performance over the validation set (Draelos, 2019).
Figure 2.
Conceptualization of (Panel A) supervised versus (Panel B) unsupervised machine learning. Note. The goal of supervised learning (Panel A) is to create a model that is able to predict an outcome of interest. In the first line of Panel A, the shape-classifier model is built by providing labeled input–output pair examples. Each shape image is labeled as a rectangle, heart, or circle. Distinct characteristics of each shape are learned by the model. In the second line of Panel A, the classifier model predicts that the inputted unlabeled shape image is a circle. The goal of unsupervised learning (Panel B) is to find some structure or pattern that characterizes the data. The input shape images do not have output labels. Shape images are organized into groups with similar characteristics.
To make the concept of data splits more tangible, let’s imagine a binary classification scenario in which a nurse scientist’s goal is to predict the occurrence of symptom X using peripheral-blood gene expression data from 5,000 distinct patients at a single medical center. All patient samples have been labeled as “positive” or “negative” occurrence of symptom X by the research team. After data wrangling and preprocessing, the scientist randomly divides the data into three subsets using a 60 training:20 validation:20 testing split. While a 60:20:20 ratio is common, training:validation:testing splits are subjective. Other common ratios include 70:15:15 and 80:10:10 (Draelos, 2019). The scientist starts by creating three different supervised learning models (e.g., logistic regression, random forest, convolutional neural net) using the 3,000 training samples. The models learn the features (i.e., gene expression levels) that are most important for positive or negative occurrence of symptom X. The scientist then evaluates each model’s performance and tunes hyperparameters (e.g., number of layers, use of normalization) using the validation set of 1,000 samples and computed performance metrics including accuracy, precision, and recall. Once the scientist has selected and optimized the best model, they evaluate its performance using the testing set of 1,000 samples. If the performance metric is adequate, the gene expression model for symptom X can be further tested and refined using additional samples.
Representative training, validation, and testing data are critical to confirm that model performance is not isolated (James et al., 2013). The robust use of appropriate training, validation, and testing sets allow machine learning to achieve unparalleled model-building ability and accuracy across new data. Additional methods such as cross-validation (e.g., K-fold, leave-one-out) can help to further decipher how models will perform or find models with the most compelling estimator quality (i.e., mean squared error). Overall, the machine learning provides an alternative model selection and performance-evaluation approach to a traditional statistical stepwise approach or a choice based on p values for parameter coefficients.
Implementation of Data Wrangling, Preprocessing, and Analysis
The implementation of data wrangling, preprocessing, and analysis is completed with computer programming. Computer programming is the use of code to create a set of executable commands by a computer. We outline common programming languages, editors, biological packages, advantages, disadvantages, and helpful links in Table 2. Coding scripts are combined for the development of analytic workflows or pipelines. A pipeline is an optimized sequence of data processing and analysis steps. The choice of programming language and editor should be project-driven such that the most appropriate methods are optimized (e.g., parallelized workflows, memory considerations, proprietary licensure). Fortunately, several omics-specific guides have been published, including nurse scientist–authored “Establishing an Analytic Pipeline for Genome-Wide DNA Methylation” (Wright et al., 2016) and “NuRsing Research in the 21st Century: R You Ready?” (Wright et al., 2019), that summarize major data preprocessing and analysis steps and recommend computational resources to accomplish these tasks. Other resources that provide pipeline examples include “A Tutorial on Conducting Genome-Wide Association Studies: Quality Control and Statistical Analysis” by Marees et al. (2018), “RNA-seq Workflow: Gene-Level Exploratory Analysis and Differential Expression” by Love et al. (2019), “A Cross-Package Bioconductor Workflow for Analyzing Methylation Array Data” by Maksimovic et al. (2019), and “Workflow for Microbiome Data Analysis: From Raw Reads to Community Analyses” by Callahan et al. (2017).
Table 2.
Comparison of Major Data Science and Machine-Learning Languages.
| Language | Common Editors | Popular Biological Packages | Advantages | Disadvantages | Helpful Links |
|---|---|---|---|---|---|
| R (R Core
Team, 2013) |
R RStudio |
Bioconductor suite |
Open source, popular for academic research, significant number of package implementations | Memory allocation, speed, security concerns |
www.r-project.org
rstudio.com www.bioconductor.org |
| Python (van
Rossum, 1995) |
Jupytr Notebook Spyder PyCharm |
Biopython QIIME |
Flexible and dynamic programming, software development, extensive support libraries | Speed, weak for other computing platforms |
python.org
www.sololearn.com realpython.com |
| Matlab (MathWorks Inc., 2019) |
MATLAB |
SimBiology Deep Learning Tool Box |
Versatile, rich library of machine learning, and engineering
libraries |
Cost of license, difficulties in converting code, difficult deployment of code into an application | www.mathworks.com |
| JavaScript (Javascript, 2019) | WebStorm Atom Editor |
BioJS Bionode |
Speed, interoperability, popular | Technical support, security | www.javascript.com |
| C++ (ISO/IEC, 2015) | Bluefish Code: Blocks Netbeans |
Bio++ SeqAn NCBI toolkit |
Object-oriented language, application for databases | Not well supported on non-Windows platforms | www.cplusplus.com |
| Scala (Odersky, 2004) | ENSIME Eclipse Vim |
BioJava | Use for data analytics, reduces risk of threads | Difficult recursive optimization | www.scala-lang.org |
| Julia (Julia, 2019) |
Juno | BioJulia |
Support for parallelism, speed, math-friendly syntax |
Lacking outside packages, concern about one index compared to
other languages |
www.julialang.org |
Data Science in Genomic Nursing
Big omics data have aided the development, implementation, and adoption of clinical tools related to pharmaceutical efficacy and safety, oncology therapeutics, evaluation of rare disorders, and prenatal screening (Krier et al., 2016; Monte et al., 2012). However, much of the big omics data work related to nursing phenomena is in the preclinical research stage.
Genomic Nursing Data Science Exemplars
The current focus of genomic nursing data science is on harnessing big omics data to further our understanding of disease biology. Nurse scientists are using a variety of omics approaches, including transcriptomics, epigenomics, and microbiomics, to study the biological underpinnings of a wide variety of phenotypes. Exemplar studies of data science genomic nursing research are summarized below.
Joseph and colleagues (2019) analyzed gene expression patterns of whole-blood samples collected from 90 healthy participants to identify patterns that were associated with body mass index. Using weighted gene co-expression network analysis (WGCNA), they identified two gene modules, enriched for catabolic and muscle system processes, respectively, that were associated with body mass index (Joseph et al., 2019). The research team used several key packages in R and Bioconductor including, but not limited to, the WGCNA package, the Limma package for differential expression, and the GOStats package for functional enrichment. Because obesity is a potentially complex genetic dysregulation, network analysis in this study used noise-reduction techniques such as selecting only the top expressed genes to construct the co-expression networks. Their results illustrate the use of systematic and comprehensive methods to address underlying gene expression patterns.
Dorsey et al. (2019) characterized the transcriptomes associated with chronic low back pain and the transition from acute to chronic low back pain. The team carefully phenotyped participants with an acute, nonspecific episode of low back pain with study measures such as demographic background, perceived pain (i.e., Brief Pain Inventory–Short Form, McGill Pain Questionnaire–Short Form), quantitative sensory testing, and mechanical pain sensitivity (i.e., windup ratio, dynamic mechanical allodynia). These clinical measurements, in combination with blood samples for RNA sequencing on the Illumina HiSeq platform, were analyzed using Euclidean clustering in R. The clustering analysis on the differentially expressed gene sets showed distinct genetic differences between the healthy, pain-free, and active-pain groups and the chronic-pain groups. Their results align with previous animal models and human research on the role of the major histocompatibility complex in pain symptomology.
Flowers and team(2018) evaluated relationships between gene expression and perturbation in biological pathways and severity of evening fatigue in oncology patients who were receiving chemotherapy. They measured 47,214 ribonucleic acid transcripts from whole blood using the Illumina HumanHT-12 Version 4.0 Expression BeadChip. The team followed well-established protocols and used the R statistical computing environment and packages from Bioconductor to perform data preparation. After quality control procedures, 43,923 probes in 16,980 genes were available for analysis. They then used R and Bioconductor packages (e.g., differential gene expression, Limma package; differential pathway perturbation, Generally Applicable Gene Set Enrichment package) and a variety of databases (e.g., TRANSFAC database, Kyoto Encyclopedia of Genes, BioCarta, Gene Ontology, Gene Expression Omnibus) to analyze and interpret data. The team identified differential gene expression and perturbed biological pathways that support both previously identified mechanisms (e.g., inflammation, energy metabolism) and a new mechanism (i.e., renal function) of fatigue.
Anderson et al. (2014) explored epigenome-wide DNA methylation patterns in maternal peripheral blood cells and placental tissue associated with the development of preeclampsia. Genome-wide DNA methylation data were generated using the Illumina Infinium DNA methylation 450 K bead-based array platform. The investigators used both Illumina’s GenomeStudio DNA methylation module and the Numerical Identification of Methylation Biomarker Lists Infinium analysis package for MATLAB to perform quality control procedures, normalization, and differential methylation analysis. They identified 207 differentially methylated CpG dinucleotides in the maternal blood cells in women who developed preeclampsia as compared to normotensive women. They also found evidence to support the transmission of DNA methylation biomarkers of preeclampsia from mother to offspring during pregnancy.
Wright et al. (2017) examined the relationship between maternal parenting stress and epigenome-wide DNA methylation among African-ancestry mother–child dyads. Saliva samples and the Illumina Infinium Methylation EPIC (850 K) BeadChip were used for epigenome-wide data collection. A total of 847,155 autosomal CpG sites were analyzed. The authors followed established analytic pipelines to perform data quality control checks, data preprocessing, and data analysis (i.e., linear regression) using the R statistical computing environment and packages from Bioconductor. They also applied a genomic-control adjustment to mitigate confounding and inflation of Type I error and false discovery rate to correct for multiple testing. The researchers found that higher levels of parenting stress are associated with differential maternal DNA methylation.
Fourie and colleagues (2016) explored differences in the oral buccal mucosal microbiome of participants with irritable bowel syndrome compared to healthy controls. They also explored whether the oral microbiome varied by the severity of visceral pain, a symptom of irritable bowel syndrome. They used a PhyloChip microarray to characterize the microbiome. Microbiome profiling, data processing, and data analysis were performed in collaboration with Second Genome, Inc., a microbiome drug discovery platform. They used several visualizations to interpret and communicate study findings including a bar chart, scatterplot, richness profile, and cladogram. The researchers found that overweight participants with irritable bowel syndrome have the most distinct oral microbiome with microbial characteristics of both obesity and gastrointestinal diseases. They further reported that the severity of visceral pain was correlated to the abundance of many microbial taxa.
Data science techniques are useful once potential biomarkers are identified as well. Findings from many hypothesis-generating big omics data studies require extensive mechanistic follow-up to pinpoint causal variants, genes, or pathways before consideration as a clinically actionable biomarker. Additionally, common variation does not fully explain the heterogeneity in many nursing phenotypes (e.g., chronic condition, symptom) of interest. By accounting for heritable variation through advanced computation (e.g., high performance computing, integration of diverse data types), genomic nursing can become more directed for applications such as drug targeting, clinical predictions, and risk prediction. For example, Taylor et al. (2019) used data science methods to feature engineer (i.e., the process of transforming data into new variables in a data set) genetic burden scores for studies with both genotype and DNA methylation data. Their goal was to evaluate associations with chronic conditions such as metabolic syndrome in a sample of 739 African American men and women from a large cohort study (Taylor et al., 2019). They found that genetic burden scores were not significant for the gene-by-methylation interaction after the appropriate corrections (Taylor et al., 2019).
Data Sources
Nurse scientists do not need to collect samples and generate their own big omics data to make discoveries. They can use preexisting data sets. Large cohort studies increase access to big omics data and make the opportunities for machine-learning insights more available. Two common big-data genomic resources include the UK Biobank (www.ukbiobank.ac.uk) and the Electronic Medical Records and Genomics (eMERGE) Network (emerge-network.org/). The UK Biobank is an unparalleled resource for deconvolving the genetic determinants of multifactorial disorders. The eMERGE Network, organized and funded by the National Institutes of Health, combines biorepositories with electronic medical record systems for genomic discovery and genomic medicine implementation research. To support research advancement through training, tool development and sharing, and data accessibility, the National Institutes of Health created the Big Data to Knowledge (BD2K) program (commonfund.nih.gov/bd2k; NIH Common Fund, n.d.). Another notable resource, currently in development, is the National Institutes of Health All of Us Research Program (allofus.nih.gov). The All of Us Research Hub (researchallofus.org) is intended to store health data, including electronic health records, surveys, physical measurements, and biospecimens, from 1 million or more diverse participants in the All of Us Research Program. Approved researchers will be able to use All of Use Research Hub data and tools to conduct studies.
There are two important National Institute of Nursing Research data, education, and collaborative resources specific to nursing science as well. First and foremost, the Omics Nursing Science and Education Network or ONSEN (omicsnursingnetwork.net) was created as a centralized location for omics data, research collaborations and training, and information about use of common data elements. The second resource is the Common Data Repository for Nursing Science (cdrns.nih.gov), which outlines core data elements and biomarker identification for common diagnoses and symptoms such as pain, fatigue, and sleep disturbances.
Phenotype-specific resources are also available. For example, the Trans-Omics for Precision Medicine (www.nhlbiwgs.org) and the Multi-Ethnic Study of Atherosclerosis through the National Heart, Lung, and Blood Institute are focused directly on understanding the biological underpinnings and mechanisms of cardiovascular disorders. Similarly, the Federal Interagency Traumatic Brain Injury Research Informatics System (fitbir.nih.gov) provides a platform for traumatic brain injury–specific data and collaboration. Based on the increasing access to data and significant resources invested in cohort studies, it is clear that the field is moving toward greater integration of diverse data and the synthesis of this knowledge into clinical discovery.
Data Science Genomic Nursing Roles and Education
Given that genomic nursing data science encompasses a number of specialties (i.e., biology, computer science, statistics, and nursing domain expertise), an interdisciplinary team is critical for successful implementation. Interdisciplinary team science involves collaboration between individuals and laboratories with complementary and relevant skills and experiences. A nurse scientist on the genomic nursing science team may or may not be directly involved in the data wrangling, preprocessing, and analysis steps of the data science life cycle (Figure 1). Key roles for non-data-intensive genomic nurses include patient recruitment, biological sample collection, phenotype collection, mechanistic follow-up of study findings, genomic data generation, evaluation and interpretation of study findings, translation of study findings to clinical practice, and influence on data policy, governance, and ethical considerations.
Genomic nurse scientists interested in taking on a data scientist role can obtain education in applying data science methodologies to the nursing omics setting in both formal and informal pathways (Table 3). Brennan and Bakken (2015) outlined a spectrum of training, roles, and activities of data-intensive nurses in practice and research across educational backgrounds. This outline can be used to guide data science education for omics-focused nurses as well. Training should be tailored to the goals of the individual. Formal training can range from courses in data science within the context of evidence-based nursing practice (i.e., bachelor’s prepared nurse) to courses in data science methods (i.e., advance practice nurse or omics-intensive nursing PhD) to a data-intensive postdoctoral research fellowship (i.e., data-intensive nursing PhD) to a formal degree program in data science (i.e., nurse data scientist). Formal training could also include data science certifications or workshops such as the National Institute of Nursing Research’s data-intensive boot camps (www.ninr.nih.gov/training/trainingopportunitiesintramural) or the University of Washington’s Biostatistics Summer Institutes in Statistical Genomics or Big Data (www.biostat.washington.edu/suminst). Informal methods for adding data skills to a nurse’s genomics data science tool box could include massive online open courses (i.e., MOOCs) and online training modules. Additionally, nurses can use social networking to build connections with other burgeoning genomic nurse data scientists, engage with relevant articles and viral content, and participate in hackathons (i.e., gatherings of individuals who code in tandem to meet and present a common goal).
Table 3.
Educational Pathways for Data Science and Genomics.
| Pathway | Definition | Example a |
|---|---|---|
| Formal |
|
|
| Informal |
|
|
| Social | Social networking and engagement |
|
a The examples provided for meeting the goals of data science training are not exhaustive.
Conclusion
Data science approaches have the potential to enhance genomic nursing practice, research, and education at multiple levels, from wrangling, preprocessing, and analyzing big omics data to using omics data–enabled applications to improve care. Our central objective in the present article has been to describe the juncture of data science, omics, and nursing. Nurse scientists are well positioned to take advantage of the data-driven omics movement to address core biological questions and enhance the direct care of patients. Nurses across the entire spectrum of the genomic nursing data science team can be champions for the use of data to improve patient outcomes and optimize the art and science of nursing care and research.
Footnotes
Author Contributions: Caitlin Dreisbach contributed to conception, design, acquisition, analysis, and interpretation; drafted the manuscript; critically revised the manuscript; gave final approval; and agreed to be accountable for all aspects of work ensuring integrity and accuracy. Theresa A. Koleck contributed to conception, design, acquisition, analysis, and interpretation; critically revised the manuscript; gave final approval; and agreed to be accountable for all aspects of work ensuring integrity and accuracy.
Declaration of Conflicting Interests: The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding: The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research received financial support from National Institute for Nursing Research (F31NR017821, K99NR017651, and P30NR016587).
ORCID iD: Caitlin Dreisbach
https://orcid.org/0000-0003-3964-3161
References
- Anderson C. M., Ralph J. L., Wright M. L., Linggi B., Ohm J. E. (2014). DNA methylation as a biomarker for preeclampsia. Biological Research for Nursing, 16(4), 409–420. 10.1177/1099800413508645 [DOI] [PubMed] [Google Scholar]
- Bhattacharjee J. (2017). Some key machine learning definitions. https://medium.com/technology-nineleaps/some-key-machine-learning-definitions-b524eb6cb48
- Brennan P. F., Bakken S. (2015). Nursing needs big data and big data needs nursing. Journal of Nursing Scholarship, 47(5), 477–484. 10.1111/jnu.12159 [DOI] [PubMed] [Google Scholar]
- Callahan B., Sankaran K., Fukuyama J., McMurdie P., Holmes S. (2017, July 25). Workflow for Microbiome Data Analysis: From raw reads to community analyses. https://bioconductor.org/help/course-materials/2017/BioC2017/Day1/Workshops/Microbiome/MicrobiomeWorkflowII.html [DOI] [PMC free article] [PubMed]
- Caporaso J. G., Kuczynski J., Stombaugh J., Bittinger K., Bushman F. D., Costello E. K., Fierer N., Peña A. G., Goodrich J. K., Gordon J. I., Huttley G. A., Kelley S. T., Knights D., Koenig J. E., Ley R. E., Lozupone C. A., McDonald D., Muegge B. D., Pirrung M.…Knight R. (2010). QIIME allows analysis of high-throughput community sequencing data. Nature Methods, 7(5), 335–336. 10.1038/nmeth.f.303 [DOI] [PMC free article] [PubMed]
- Deming W. E. (2015). What do we mean by data-driven? In Anderson C. (Ed.), Creating a data-driven organization (pp.1–19). O’Reilly Media. [Google Scholar]
- Dorsey S. G., Renn C. L., Griffioen M., Lassiter C. B., Zhu S., Huot-Creasy H., McCracken C., Mahurkar A., Shetty A. C., Jackson-Cook C. K., Kim H., Henderson W. A., Saligan L., Gill J., Colloca L., Lyon D. E., Starkweather A. R. (2019). Whole blood transcriptomic profiles can differentiate vulnerability to chronic low back pain. PLoS One, 14(5), e0216539 10.1371/journal.pone.0216539 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Draelos R. L. B. (2019, September 15). Best use of train/val/test splits, with tips for medical data. Glass Box: Artificial Intelligence + Medicine. https://glassboxmedicine.com/2019/09/15/best-use-of-train-val-test-splits-with-tips-for-medical-data/
- Flowers E., Miaskowski C., Conley Y., Hammer M. J., Levine J., Mastick J., Paul S., Wright F., Kober K. (2018). Differential expression of genes and differentially perturbed pathways associated with very high evening fatigue in oncology patients receiving chemotherapy. Supportive Care in Cancer, 26(3), 739–750. 10.1007/s00520-017-3883-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fourie N. H., Wang D., Abey S. K., Sherwin L. B., Joseph P. V., Rahim-Williams B., Ferguson E. G., Henderson W. A. (2016). The microbiome of the oral mucosa in irritable bowel syndrome. Gut Microbes, 7(4), 286–301. 10.1080/19490976.2016.1162363 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fox M. (2015). Using a hypothesis-driven approach in analyzing (and making sense) of your website traffic data. https://digital.gov/2015/04/16/using-a-hypothesis-driven-approach-in-analyzing-and-making-sense-of-your-website-traffic-data/
- International Organization for Standardization/International Electrotechnical Commission. (2015). ISO International Standard ISO/IEC 14882:2014(E)-Programming Language C++ (Version C++17) [Computer software] International Organization for Standardization (ISO). [Google Scholar]
- James G., Witten D., Hastie T., Tibshirani R. (2013). An introduction to statistical learning: With applications in R. Springer. [Google Scholar]
- Javascript. (2019). https://JavaScript.com. Retrieved February 22, 2019, from https://www.javascript.com/
- Joseph P. V., Jaime-Lara R. B., Wang Y., Xiang L., Henderson W. A. (2019). Comprehensive and systematic analysis of gene expression patterns associated with body mass index. Scientific Reports, 9(1), 7447 10.1038/s41598-019-43881-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Julia. (2019). The Julia Language Retrieved February 22, 2019, from https://julialang.org/
- Kitchin R., McArdle G. (2016). What makes big data, big data? Exploring the ontological characteristics of 26 datasets. Big Data & Society, 3(1). 10.1177/2053951716631130 [DOI] [Google Scholar]
- Krier J. B., Kalia S. S., Green R. C. (2016). Genomic sequencing in clinical practice: Applications, challenges, and opportunities. Dialogues in Clinical Neuroscience, 18(3), 299–312. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lau C. H. (2019, January 3). 5 steps of a data science project lifecycle. https://towardsdatascience.com/5-steps-of-a-data-science-project-lifecycle-26c50372b492
- Love M., Anders S., Kim V., Huber W. (2019, October 16). RNA-seq workflow: Gene-level exploratory analysis and differential expression. https://bioconductor.org/packages/release/workflows/vignettes/rnaseqGene/inst/doc/rnaseqGene.html [DOI] [PMC free article] [PubMed]
- Maksimovic J., Phipson B., Oshlack A. (2019, October 30). A cross-package Bioconductor workflow for analysing methylation array data. Retrieved January 14, 2020, from https://bioconductor.org/packages/release/workflows/vignettes/methylationArrayAnalysis/inst/doc/methylationArrayAnalysis.html [DOI] [PMC free article] [PubMed]
- Marees A. T., de Kluiver H., Stringer S., Vorspan F., Curis E., Marie-Claire C., Derks E. M. (2018). A tutorial on conducting genome-wide association studies: Quality control and statistical analysis. International Journal of Methods in Psychiatric Research, 27(2), e1608 10.1002/mpr.1608 [DOI] [PMC free article] [PubMed] [Google Scholar]
- MathWorks Inc. (2019). MATLAB Retrieved February 23, 2019, from https://www.mathworks.com/products/matlab.html
- McCandless K. (2018, June 13). What is computer programming? https://news.codecademy.com/what-is-computer-programming/
- Minitab Blog Editor. (2015, September 3). The danger of overfitting regression models. https://blog.minitab.com/blog/adventures-in-statistics-2/the-danger-of-overfitting-regression-models
- Monte A. A., Vasiliou V., Heard K. J. (2012). Omics screening for pharmaceutical efficacy and safety in clinical practice. Journal of Pharmacogenomics & Pharmacoproteomics, S5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- National Institute of Nursing Research. (n.d.). Advancing nursing research through data science. https://www.ninr.nih.gov/researchandfunding/datascience
- NIH Common Fund. (n.d.). Big data to knowledge. Retrieved December 9, 2019, from https://commonfund.nih.gov/bd2k
- Odersky M. (2004). An overview of the Scala Programming Language (Version IC/2004/64) [Computer software] EPFL. [Google Scholar]
- Office of Data Science Strategy, National Institutes of Health. (2018). NIH strategic plan for data science. National Institutes of Health; https://datascience.nih.gov/strategicplan [Google Scholar]
- Press G. (2016, March 23). Cleaning big data: Most time-consuming, least enjoyable data science task, survey says. Forbes. https://www.forbes.com/sites/gilpress/2016/03/23/data-preparation-most-time-consuming-least-enjoyable-data-science-task-survey-says/
- R Core Team. (2013). R: A language and environment for statistical computing (Version 3.5.2) [Computer software] R Foundation for Statistical Computing. [Google Scholar]
- Ripley B. D. (1995). Pattern recognition and neural networks. Cambridge University Press. [Google Scholar]
- Schloss P. D. (2016, January 12). Mothur and QIIME. The Mothur Blog. http://blog.mothur.org/2016/01/12/mothur-and-qiime/
- SDS Discovery. (2019). Data volume estimates and conversions. Superior Document Services. href="https://www.sdsdiscovery.com/resources/data-conversions/
- Shekhar A. (2018). What is feature engineering for machine learning? Mindorks. https://medium.com/mindorks/what-is-feature-engineering-for-machine-learning-d8ba3158d97a
- Stanford Computer Science. (n.d.). Pipelining. https://cs.stanford.edu/people/eroberts/courses/soco/projects/risc/pipelining/index.html
- Taylor J. Y., Ware E. B., Wright M. L., Smith J. A., Kardia S. L. R. (2019). Using genetic burden scores for gene-by-methylation interaction analysis on metabolic syndrome in African Americans. Biological Research for Nursing, 21, 279–285. 10.1177/1099800419828486 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tomar S. S. (2016, December 22). A comprehensive introduction to data wrangling. Springboard Blog. https://www.springboard.com/blog/data-wrangling/
- van Rossum G. (1995). Python tutorial, Technical Report CS-R9526 (Version 2.7) [Computer software] Centrum voor Wiskunde en Informatica (CWI). [Google Scholar]
- Wright M. L., Dozmorov M. G., Wolen A. R., Jackson-Cook C., Starkweather A. R., Lyon D. E., York T. P. (2016). Establishing an analytic pipeline for genome-wide DNA methylation. Clinical Epigenetics, 8, 45 10.1186/s13148-016-0212-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wright M. L., Higgins M., Taylor J. Y., Hertzberg V. S. (2019). NuRsing research in the 21st century: R you ready? Biological Research for Nursing, 21, 114–120. 10.1177/1099800418810514 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wright M. L., Huang Y., Hui Q., Newhall K., Crusto C., Sun Y. V., Taylor J. Y. (2017). Parenting stress and DNA methylation among African Americans in the InterGEN study. Journal of Clinical and Translational Science, 1(6), 328–333. 10.1017/cts.2018.3 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhao S., Watrous K., Zhang C., Zhang B. (2017). Cloud computing for next-generation sequencing data analysis In Sen J. (Ed.), Cloud computing—architecture and applications. Intech Open; 10.5772/66732 [DOI] [Google Scholar]


