Abstract
This poster discusses Automated Research Workflows (ARWs) in the context of a FAIR data ecosystem for the science of science research. We offer a conceptual discussion from the point of view of information science and technology using several cases of “data problems” in the science of science research to illustrate the characteristics and expectations for designers and developers of a FAIR data ecosystem. Drawing from a 10-year data science project developing GenBank metadata workflows, we incorporate the ideas of ARWs into the FAIR data ecosystem discussion to set a broader context and increase generalizability. Researchers can use these as a guide for their data science projects to automate research workflows in the science of science domain and beyond.
Keywords: Data ecosystem, Science of science research, Knowledge graphs, FAIR principles
INTRODUCTION
The fast growth of data and advances in computational tools have been loudly “revolutionizing” science (Atkins, 2011; Fortunato et al. 2018), but at the same time quietly changing the way research is conducted by expanding expectations of systems that service the needs for data findability, accessibility, interoperability, and reusability (FAIR) (Wilkinson et al, 2018). The term Automated Research Workflow (ARW) refers to the tools and techniques developed to support scientific investigations in meeting the demands for a FAIR data ecosystem and ensuring research reproducibility, replicability, and trustworthiness (NAS, 2022). ARWs “integrate computation, laboratory automation, and tools from AI in the performance of tasks that make up the research process such as designing experiments, observations, and simulations; collecting and analyzing data and learning from the results to inform further experiments, observations, and simulations” (NAS, 2022, p. 1). Although ARWs have accelerated scientific discoveries and yielded benefits to society (e.g., the rapid development of COVID-19 vaccines), the wider application and adoption of ARWs remains riddled with technical, social, cultural, educational, and policy-related challenges.
Science of science research has become increasingly computational, enabled in part by the explosion in trace data from the scientific enterprise and the sophistication and accessibility of AI/machine learning tools (Wang and Barabasi, 2021). However, before science of science data can be analyzed and run through computational workflows, the varied formats and structures of data from various publication and data repositories and the lack of linking between data requires a large amount of time and effort to acquire and process (Wilder-James, 2016; Qin et al., 2023). While the issues and questions raised in the NAS 2022 report are for all disciplines and at a more general level, this poster is going to discuss ARWs in the context of a FAIR data ecosystem for the science of science research.
The concept of a FAIR data ecosystem has been gaining attention from government funding agencies and research communities in the last decade (Wilkinson et al., 2016). The National Institutes of Health (NIH) strategic plan for data science defines a data ecosystem as “a distributed, adaptive, open system with properties of self-organization, scalability, and sustainability inspired by natural ecosystems” (NIH, 2018, p. 29). In this environment “data and resources become seamlessly integrated such that different data types and information about different organisms or diseases can be used easily together rather than existing in separate data ‘silos’ with only local utility” (NIH, 2018, p. 12). The National Science Foundation (NSF) has also strengthened the funding for building research data ecosystems in recent years (NSF, 2022). While funding agencies provide a wish list of essential characteristics of data ecosystems, many questions remain to be answered in specific disciplinary fields: what should a data ecosystem look like? What barriers exist for its implementation and use? What social and cultural impact may result from this transformed data environment for researchers? How do we (the information science and technology society) prepare workforces and reform educational programs for the ARW-driven data ecosystem? Clearly, different disciplinary fields have special sets of questions to address the ARW and data ecosystem challenge.
This poster offers a conceptual discussion from the point of view of information science and technology and in the science of science research as a case to illustrate the characteristics and expectations of a FAIR data ecosystem. We incorporate the ideas of ARWs into the FAIR data ecosystem discussion to set a broader context and increase generalizability. In this poster, we present the conceptual architecture of FAIR data ecosystems, introducing the concepts of research entities, artifacts, and a preliminary definition of data ecosystem in the context of science of science research. We then describe a proposed prototype of an ARW implemented for science of science in the biomedical context using GenBank metadata. We identify several pressing social, cultural, and educational challenges to implementing the vision of an automated research workflows to implement a FAIR data ecosystem for science of science research.
THE CONCEPTUAL ARCHITECTURE OF FAIR DATA ECOSYSTEMS
There are concepts important for the architecture of FAIR data ecosystems:
Research entities:
refers to authors and their affiliations, publications, datasets, patents, and grant awards – the primary objects and interests of science of science research. Research entities are often relatively stable and required to be consistent and identified globally uniquely. The metadata for these research entities are recorded as structured data and identified by a standard or local identifier. For example, authors can be globally uniquely identified by the Open Researcher and Contributor Identifier (ORCID), publications and datasets by Digital Object Identifier (DOI) and/or PMID (used for publications in PubMed), and NIH grants by project number. The entity representation promotes linking by relations, such as the OpenAlex entity graph (Priem et al., 2022).
Artifacts:
refers to the input, computational codes, workflows, models, pipelines, and output that are used in/generated from a research lifecycle. Unlike research entities, artifacts are much more dynamic as they are constantly being revised/tuned to achieve the optimal performance or outcome. Metadata about these artifacts including version control is critical for provenance, verification, and reusability, which are significant properties of trustworthy computational analysis (Wing, 2021).
Data ecosystem:
refers to an architecture for acquiring, organizing, linking, and sharing research entities and associated artifacts in databases by using automatic methods. The data ecosystem is built on the cloud/virtual infrastructure in which research entities and artifacts are represented by heterogeneous graphs for data discovery, selection, extraction, and other operations for researchers to obtain analysis-friendly datasets. The entity and artifact graphs are developed through an ontology that represents the knowledge networks of collaborations, communities, and innovations.
A FAIR DATA ECOSYSTEM PROTOTYPE USING GENBANK
The GenBank metadata analytics project (Bratt et al., 2017, Qin et al., 2022) has accumulated diverse trace data sources for molecular sequence submissions and associated publications, grants, and patents. Using this data collection, we aim to seamlessly integrate and link metadata for research entities and artifacts to create graphs not only for more effective discovery and use of data and knowledge, but also for tracking data and workflows used in research to ensure reproducibility and transparency.
The prototype will follow the FAIR principles and emphasize integration and linking of key research entities for use with metadata analytics. As the data sources already exist in distributed data systems, e.g., authors in ORCID, publications in PubMed and Microsoft’s Academic Graph, this prototype will focus on linking mechanisms for connecting research entities and resources in the data and knowledge spaces of the data ecosystem. As a data ecosystem, algorithms and tools will be important for optimizing the automation of data ingestion, processing, transformation, and representation, so that the data in this system stay up to date. A requirement for a FAIR data ecosystem is that not only must the data be FAIR but also the artifacts generated from data design and creation process be FAIR. Standardized notes, annotations, workflows, code files would need to be integrated for data and code provenance purposes. Ontology models will represent the conceptual architecture of this prototype for research entities, artifacts, as well as their relations. This area of work will be guided by three core principles: express knowledge in a sufficiently precise notation, the knowledge representation scheme meets the criteria of adequacy and expressiveness, and reasoning and problem solving are based on the facts represented by the schemes (Qin, 2020). The prototype data ecosystem will include not only baseline functions of findability, accessibility, interoperability, and reusability for data and artifacts but also advanced functions such as knowledge representation using “a formal, accessible, shared, and broadly applicable language” (Wilkinson et al., 2016, p. 4).
CONCLUSION
A data ecosystem is a new way to view the vast digital data and has increasingly become a new wave in the organization, management, and services for digital data. There have been many lessons learned from the digital library initiative almost 30 years ago. In this round of revolutionary changes, some lessons from the digital library initiative are still valid, e.g., community building and outreach, but many more are new and different from 30 years ago as we discussed earlier in this text. We hope this poster will stir up discussion and rethinking on the implications of data ecosystems among the ASIST community members and beyond.
ACKNOWLEDGMENTS
The work in this poster proposal is based on the GenBank metadata analytics project funded by National Science Foundation Award Number1561348 and the National Institute of General Medical Sciences of the National Institutes of Health under Award Number R01GM137409.
Contributor Information
Jian Qin, Syracuse University, USA.
Sarah Bratt, University of Arizona, USA.
Jeff Hemsley, Syracuse University, USA.
Alexander Smith, Syracuse University, USA.
Qiaoyi Liu, Syracuse University, USA.
REFERENCES
- Bratt S, Hemsley J, Qin J, & Costa M (2017). Big data, big metadata and quantitative study of science: A workflow model for big scientometrics. Proceedings of the Association for Information Science and Technology, 54(1), 36–45. doi: 10.1002/pra2.2017.14505401005 [DOI] [Google Scholar]
- Fortunato S, Bergstrom CT, Börner K, Evans JA, Helbing D, Milojević S, ... & Barabási AL (2018). Science of science. Science, 359(6379), 10.1126/science.aao0185. [DOI] [PMC free article] [PubMed] [Google Scholar]
- NAS (National Academies of Sciences, Engineering, and Medicine). (2022). Automated Research Workflows for Accelerated Discovery: Closing the Knowledge Discovery Loop. Washington, DC: The National Academies Press. 10.17226/26532. [DOI] [Google Scholar]
- NIH. (2018). NIH Strategic Plan for Data Science. https://datascience.nih.gov/sites/default/files/NIH_Strategic_Plan_for_Data_Science_Final_508.pdf. [Google Scholar]
- NSF. (2022). New data infrastructure initiative will accelerate the advancement and impacts of social and behavioral research. https://www.nsf.gov/news/special_reports/announcements/020422.jsp
- Priem J, Piwowar H, & Orr R (2022). OpenAlex: A fully-open index of scholarly works, authors, venues, institutions, and concepts. ArXiv. https://arxiv.org/abs/2205.01833 [Google Scholar]
- Polley KL, Tompkins VT, Honick BJ, & Qin J (2021). Named entity disambiguation for archival collections: Metadata, Wikidata, and Linked data. Proceedings of the Association for Information Science and Technology, 58(1), 520–524. [Google Scholar]
- Qin J (2020). Knowledge organization and representation under the AI lens. Journal of Data and Information Science. 5(1): 3–17. DOI: 10.2478/jdis-2020-0002 [DOI] [Google Scholar]
- Qin J, Bratt S, Hemsley J, & Smith AO (2023). Metadata analytics: A methodological discussion. In: Proceedings of the International Society of Scientometrics and Informetrics (ISSI) 2023 Conference, July 3–5, 2023, Bloomington, IN. [Google Scholar]
- Qin J, Hemsley J, & Bratt S (2022). The structural shift and collaboration capacity in GenBank networks: A longitudinal study. Quantitative Science Study, 1–20. DOI: 10.1162/qss_a_00181; [DOI] [PMC free article] [PubMed] [Google Scholar]
- Qin J, Costa M, & Wang J (2009). Methodological and Technical Challenges in Big Scientometric Data Analytics. In: iConference 2015 Proceedings, https://core.ac.uk/download/pdf/158299077.pdf. [Google Scholar]
- Wang D, & Barabási AL (2021). The science of science. Cambridge University Press. [Google Scholar]
- Wilder-James E (2016). Breaking down data silos. Harvard Business Review, December 16. https://hbr.org/2016/12/breaking-down-data-silos. [Google Scholar]
- Wilkinson MD, Dumontier M, Aalbersberg IJ, Appleton G, Axton M, Baak A, Blomberg N, Boiten J-W, da Silva Santos LB, Bourne PE, Bouwman J, Brookes AJ, Clark T, Crosas M, Dillo I, Dumon O, Edmunds S, Evelo CT, Finkers R, … Mons B (2016). The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data, 3. 10.1038/sdata.2016.18. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wing JM (2021). Trustworthy AI. Communications of the ACM, 64(10), 64–71. 10.1145/3448248 [DOI] [Google Scholar]
