Abstract
Objective:
To use software, datasets, and data formats in the domain of Infectious Disease Epidemiology as a test collection to evaluate a novel M1 use case, which we introduce in this paper. M1 is a machine that upon receipt of a new digital object of research exhaustively finds all valid compositions of it with existing objects.
Method:
We implemented a data-format-matching-only M1 using exhaustive search, which we refer to as M1DFM. We then ran M1DFM on the test collection and used error analysis to identify needed semantic constraints.
Results:
Precision of M1DFM search was 61.7%. Error analysis identified needed semantic constraints and needed changes in handling of data services. Most semantic constraints were simple, but one data format was sufficiently complex to be practically impossible to represent semantic constraints over, from which we conclude limitatively that software developers will have to meet the machines halfway by engineering software whose inputs are sufficiently simple that their semantic constraints can be represented, akin to the simple APIs of services. We summarize these insights as M1-FAIR guiding principles for composability and suggest a roadmap for progressively capable devices in the service of reuse and accelerated scientific discovery.
Conclusion:
Algorithmic search of digital repositories for valid workflow compositions has potential to accelerate scientific discovery but requires a scalable solution to the problem of knowledge acquisition about semantic constraints on software inputs. Additionally, practical limitations on the logical complexity of semantic constraints must be respected, which has implications for the design of software.
Keywords: Reusing digital research objects, Machine-FAIR principles, Scientific discovery, Scientific workflows, Automatic workflow composition
1. Introduction
Statement of significance
| Problem | The potential of digital research object reuse to accelerate scientific discovery cannot be fully realized until machines exhaustively test every new digital object for inclusion in scientific workflows. |
| What is known | Prior work on automatic service composition and workflow generation has not solved the core inference problem of determining the validity of a composition. |
| What this paper adds | The novel M1 use case, a thorough analysis of it as an AI problem, an evaluated implementation, a limitative result about complexity of semantic constraints, M1-FAIR principles and suggested properties, and a future research roadmap that addresses open problems of knowledge acquisition. |
The reuse of data and mathematical models is fundamental to scientific discovery, as exemplified by Kepler’s reuse of Tycho Brahe’s astronomical data to prove heliocentrism [1,2]. But reuse can be significantly time-lagged, as evinced by the 30-year delay between the publication of Gregor Mendel’s mathematical model of inheritance and its reuse in 1900 [3].
Systematic efforts to facilitate reuse in biomedicine date to at least 1879 when the US National Library of Medicine (NLM) created Index Medicus and subsequently its computerized descendants (MEDLARS, MEDLINE, PubMed, etc.). These tools enable scientists to search for and access publications, which may lead them to datasets and mathematical models. Fast forwarding to the present, Goal One in NLM’s strategic plan for 2017–2027 is creating, collecting, curating, and organizing digital research objects in a way that they could be coupled in novel modes to accelerate [scientific] discovery in a broad array of disciplinary explorations [4].
Ideally, opportunities for reuse would be exhaustively identified whenever new objects appear. This requirement necessitates a fully automatic technical approach.
2. Objective
Our objective was to develop representations and automated methods to satisfy the following novel machine use case:
Use case M1.
Whenever a new digital object of research becomes available, a machine searches for existing objects that can be validly composed with the new object.
Fig. 1 shows the real-world biomedical problem that motivates our M1 use case. On the right are mathematical models of natural phenomena, arguably the most important digital objects for scientific discovery. The rate at which such models can be evaluated and used to test hypotheses depends on the rate at which existing and new datasets and software can be assembled into workflows that culminate in model inputs (as well as new models). The M1 use case simply expresses a requirement that machines immediately find the “scientific workflow closure” of new digital objects, which will over time transition processes from data wrangling clouds into machine-processed workflows.
Fig. 1.

Digital objects of research in the context of scientific discovery. On the left are real world biological phenomena of interest. On the right are mathematical models of those phenomena, expressed in software. In between are scientific workflows that transform data about the phenomena into model inputs for model verification and hypothesis testing. . Legend: Arrow – direction of data flow; hexagon – software with name and type; italics – dataset type and format. Adapted from [5]
We can use Fig. 1 to pinpoint the key machine inference required by the M1 use case, which is whether a candidate set of input objects constitute valid input for a software application. In the case of the pFRED disease transmission model, the set of candidate input objects can be datasets or outputs of software. These candidate objects must satisfy (1) the syntactic and semantic constraints on their respective pFRED input ports, and (2) any interdependency constraints across the ports, e.g., the geographic location of input 1 (synthetic ecosystem) and input 2 (model parameters) must match.
3. Background
In 2016, the FAIR Guiding Principles first recognized that proper representation of digital research objects was central to solving the problem of their reuse [6]. That representations—which the Principles refer to as metadata—must make objects Findable, Accessible, Interoperable, and Reusable by both humans and machines.
Accordingly, we use both human and machine use cases shown in Table 1 to organize relevant prior work on representations. We labelled the three machine use cases M1, M2, and M3 because they—like SDI (Software Discovery Index) and DDI (Data Discovery Index)—refer to devices intended to facilitate reuse of digital research objects and scientific discovery.
Table 1.
Human and machine use cases that aim to accelerate scientific discovery through reuse of digital research objects.
| Use Case | Selected Prior work | |
|---|---|---|
| DDI | A person searches for datasets using a Data Discovery Index (DDI) | Representation: DATS [7]; NIAID Systems Biology Dataset schema [9] DataMed DDI [10,11] |
| SDI | A person searches for software using a Software Discovery Index (SDI). | Representation: CodeMeta crosswalk of software repository metadata schemas for SDIs [8,12]; NIAID Systems Biology Computational Tool schema [9] SDI workshop report [13,14] |
| M1 | Given a new object, a machine searches for other objects that can be validly combined with it | |
| M2 | Given a computational workflow, a machine automatically enacts it. | Representation: FAIR Computational Workflows [15] Pegasus WMS [16] |
| M3 | Given access to digital research objects and computational workflows, a machine develops and tests hypothesis. | Representation: Ontology of hypotheses [17], Ontology of experiments [18] DISK system for automated hypothesis testing [19] |
The DDI use case motivated the BioCADDIE project, which developed a JSON-LD schema called DATS (Data Tag Suite) for representing datasets for use in a prototype DDI spanning all of biomedicine [7].
The SDI use case motivated the CodeMeta project, which analyzes software metadata schemas used by existing software repositories [8].
Of the three machine use cases, M1 and its representational and inferential requirements are the subject of this paper. To satisfy M1, a machine must infer—based solely on representations of objects—that a candidate set of objects can interoperate. The M2 use case—automatic enactment—also has substantial representational requirements, but they are for properties such as the required operating system. As does the M3 use case, which is fully automated digital research.
Unlike the human use cases, which only require a representation language with which to represent the objects, M1 also requires an inference algorithm that can infer validity of combinations from their representations. In other words, it requires a formal system, not just a formal representation language. Examples of formal systems include propositional logic, first order logic, and probability theory.
Formal systems were first studied by the field of Mathematical Logic and later AI. The key insights about formal systems for our purposes are:
The importance of four desiderata when selecting a formal system: representational adequacy, inferential adequacy, inferential efficiency, and acquisitional efficiency [20]. Representational adequacy is also known as expressiveness. In the M1 application, it means the language must be able to express the properties of objects that would enable a machine to infer whether compositions of them are valid. Acquisitional efficiency is the ability to acquire new information easily. It includes the process of acquiring the information and the process of representing it using the constructs of the language. We refer to the latter as ease of use.
A key limitative result is the tradeoff between expressiveness and the desirable inferential properties of adequacy and efficiency; that is, as the expressiveness of a representation language increases, inference becomes more computationally expensive, or intractable, or you lose soundness or completeness [21].
We chose OWL 2 [22] as the formal system for this work because of its inferential fit to the problem of inferring validity of compositions of objects. Specifically, an OWL reasoner can derive a valid Dataset → Software X composition by inferring that the dataset belongs to the class of datasets that are valid input to Software X (OWL 2 classification inference) and a valid Software A → Software B composition by inferring that the class representing Software A’s output is subsumed by the input class of Software B (OWL 2 subsumption inference). Additionally, we had domain-specific requirements for taxonomic and partonomic inference (e.g., inferring that H1N1 is a type of influenza virus) and existing OWL 2 ontologies with the requisite taxonomic and partonomic information were available. Note that we chose OWL 2 despite the NP-hard result for OWL 2 inference [23] because OWL 2 has a polynomialtime EL subset, which sacrifices expressiveness for tractability.
In prior work, we created an RDF/SPARQL representation (OWL 2) of many of the digital research objects used in the present work and used it as the basis for a simple data-format matching search [5]. We conclude from that experience that RDF/SPARQL is inferentially inadequate for exhaustive search for data-format matching compositions, but for the above-mentioned reason of inferential fit, is a good candidate for the semantic constraint (validity) component of the M1 design described in Section 4.4.
The problem of inferring from their representations that a candidate set of digital objects can be composed into a computational workflow is also central to the problems of automatic service composition (ASC) and automatic workflow generation (AWG). Gamha [24] exemplifies the state of the art in ASC, especially for representations of semantic service descriptions. The state of the art in scientific AWG is exemplified by [25–28]. In this paper, we will refer to representations of digital objects capable of supporting machine composition as being M1-FAIR.
4. Materials and methods
This section describes the test collection of digital research objects and the elements of their representations that we used in the research. It then describes a data-format-only implementation of M1 search, which we refer to as M1DFM, and the methods we used to evaluate its precision at composing valid graphs of objects in the test collection. Note that this research stopped short of curating and representing semantic constraints on inputs and outputs, rather we used our evaluation to identify semantic constraints and understand their representational requirements.
4.1. The test collection
The test collection comprises 779 digital objects used in infectious disease epidemiology that we had previously curated and represented for the MIDAS Digital Commons (MDC) [29]. The test collection includes 632 datasets, 69 software objects,1 and the 78 data formats they use. Of the 632 datasets, 406 (64%) are disease surveillance datasets, 127 (20%) synthetic ecosystems, 88 (14%) epidemic datasets, and 11 (1.7%) other. The datasets and data formats are represented by instances of the DATS 2.2 JSON-LD schema types Dataset and DataStandard, respectively.
We represented the 69 software objects as instances of a Software type or one of its 13 subtypes: Disease Transmission Model (17), Disease Forecaster (16), Pathogen Evolution Model (4), Population Dynamics Model (2), Synthetic Ecosystem Constructor (1), Disease Transmission Tree Estimator (2), Phylogenetic Tree Constructor (2), Data Format Validator (1), Data Visualizer (6), Data Format Converter (2), Data Service (8), Modeling Platform (4), and Metagenomic Analysis (4). The types are defined in the MDC Software Metadata XML Schema Definition (henceforth Software XSD) [30].
There are also OWL-2 representations of the objects, created by semiautomatic translation from the instances of the types to instances of OWL-2 classes in Apollo-SV. Appendix A.1 describes the process and provides links to files and translation programs.
4.2. The domain ontology: Apollo-SV
We use the existing Apollo-SV ontology in the schema-based (DATS and Software XSD) and OWL 2 representations [31]. Apollo-SV is an OWL 2 ontology of the domain of infectious disease epidemiology, which at the start of this project already contained classes representing phenomena in infectious disease epidemiology and population biology (e.g., epidemics, disease control). Apollo-SV follows best practices in ontology development as set forth by the Open Biological and Biomedical Ontologies (OBO) Foundry. One principle is reuse of pre-existing ontologies to the extent possible. Apollo-SV therefore reuses classes and properties from several ontologies.
In the schema-based representations, we use Apollo-SV solely as the primary controlled vocabulary for pathogens, host species, disease control measures, and other things. For example, we set the value of the isAbout DATS 2.2 property of datasets to Apollo-SV classes representing pathogens.
4.3. Properties of data formats, datasets, and software used in M1DFM search
The set of properties of digital research objects in the test collection’s schemas is broader than the set of properties that we use in this research because our goal in the prior work was to create a FAIR commons that supported a set of use cases described in the paper, including DDI and SDI functionality.
For clarity, this section states the properties of data formats, datasets, and software that we used in this research. These properties can be thought of as the minimal information required to make a set of digital research objects M1DFM-FAIR.
Data formats.
M1DFM search requires a unique identifier corresponding to a specific version of a data format, as specified in the instance of DATS 2.2 DataStandard type representing the data format.2
Datasets.
M1DFM search similarly requires a unique identifier corresponding to a specific version of a dataset and the unique identifier of its data format. In the analysis of errors in the evaluation, we also used the isAbout and spatialCoverage properties in the DATS 2.2 Dataset representations.
Software.
M1DFM search similarly requires a unique identifier corresponding to a specific version of a software object, and for each input and output, its port number (1 to n) in the inputs or outputs lists, and its dataFormats. In the current implementation, M1DFM ignores the three ‘isComplete’ properties (isListOfDataFormatsComplete, isListOfOutputsComplete, and isListOfInputsComplete) because they are incompletely curated in the MDC collection as described in Section 5.2. Fig. 2A shows these and other properties in the XSD Software types. Fig. 2B shows an instance of that type for the pFRED disease transmission model.
Fig. 2.

(A) The Software, DiseaseTransmissionModel, and DataInputOrOutput XSD types. (B) the pFRED disease transmission model represented as an instance of the DiseaseTransmissionModel subtype. As shown in Fig. 2A, an instance of Software may have zero to n inputs and outputs. Each known software input or output must be numbered, denoted as being optional or not, and can have multiple data formats. The XSD allows the inputs and outputs of Software to be incomplete (is list complete?) and the list of data formats for each port to also be incomplete to accommodate incompletely documented software. It also defines the cardinality of many properties as [0:1] or [0:*] to support curation of poorly documented software. Software inputs and outputs only have syntactic (data format) constraints represented at this time. The 10 properties that comprise the minimal information required to make software M1DFM-FAIR are: identifier, inputs, isListOfInputsComplete, outputs, isListOfOutputsComplete, outputNumber, inputNumber, is[port]Optional, dataFormats, and isListOfDataFormatsComplete. Bold blue – property. Red – value. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
4.4. Design and implementation of M1DFM
We designed M1DFM to handle all objects in the MDC test collection. Thus, the implementation we describe here can compose data-format-matching DAGs that include software with multiple inports and outports, some of which have multiple alternative data formats, and can include data services.
Importantly, we also had a broader requirement in mind, which was the need to design a way for semantic constraints to be applied to graph composition.
The result is a two phase generate-and-test design in which the generate phase exhaustively composes all possible data-format matching graphs of objects in the collection, given a seed new object. In the design, the test phase applies a semantic validator to each generated data-format matching composition. This design is shown graphically in Appendix A.3 and the point of connection between the two phases is shown in the pseudocode of the generate-phase-only M1DFM (Fig. 3).
Fig. 3.

Pseudocode for the GenerateAndTest implementation of M1DFM. The test(g1) validity check is a no-op that always returns True in a data-format-matching-only implementation of M1.
As shown in Fig. 3, we implemented the generate phase as a bidirectional search in which, for example, a new software object s is first formed into a single-node graph g, which is then used to initiate the bidirectional search with a call to backSearch(g). The backSearch method exhaustively searches for datasets or software that can data-format satisfy the unbound inputs in the back-growing graph, which initially were just the inputs to the new software object s. Once back-Search finds a set of datasets that satisfy all the inputs of the back-growing graph, call it g1, it outputs g1 and invokes forwardSearch(g1). The forwardSearch method similarly exhaustively tries to extend the composition (the graph g1) in the forward direction. It does so by adding all software in turn from the collection that have an input that data-format matches one of the unbound outports in g1.
Generate, as currently implemented, i.e., M1DFM, requires that a graph’s root inputs terminate in either datasets or data services before it considers the graph to be a new composition and adds it to the list of compositions. The rationale is that the MDC collection contains the datasets and services needed to run the composition. At the end of each run triggered by a new object, Generate deduplicates the list and outputs it.
The source code repository for M1DFM is at [32].
4.5. Evaluation of M1DFM
The goals of the evaluation were both summative and formative. Summatively, it measures the precision with which data-format-matching alone can compose valid graphs of MDC’s digital research objects. Formatively, it uses error analysis to identify semantic constraints and other solutions that could improve the precision achieved by M1DFM.
4.5.1. The evaluation steps
Run M1DFM on each MDC software object.
A program called TestM1 created a list of software for M1DFM to run as “new” from the list of 69 software objects in the MDC JSON file [33]. TestM1 excluded 13 software objects that had neither inputs nor outputs documented, since composition is impossible in their absence. Note that TestM1 did not treat datasets as new objects because M1 back searching from software will compose every possible combination with datasets in the collection. TestM1 then initiated M1 search for each of the 56 software objects on the list, accumulating the composed graphs from each run, then deduplicating the accumulated graphs into an output set for judging.3
Judge validity.
Two judges (authors MMW and WRH) independently judged the validity of each composition in the output set, recording their judgements in a scoring spreadsheet. The judges then met to adjudicate any inter-rater differences.
Calculate precision.
From the validity judgements, we calculated overall search precision and precision for each software that appeared in more than 15 M1DFM composed graphs.
Error analysis.
We then analyzed the retrieved invalid compositions to establish cause and possible solutions.
4.5.2. Precision and recall
We measure and report M1DFM precision as:
We did not measure recall because of the effort to obtain the denominator, which in the general case requires judging validity of all possible compositions of objects in a collection.
Importantly, M1 composition has 100% recall by construction if the following conditions are met:
M1 composition is based solely on data-format matching
Objects in the collection are accurately represented for all 10 minimal M1DFM-FAIR properties
All software objects require data-format matching of all their inputs as a necessary condition for input validity.
If the above conditions are met, M1DFM will compose all graphs that satisfy the necessary (but not sufficient) criteria for validity. Thus, the count will equal and recall becomes by construction:
100% recall is a key property of an M1 that exhaustively finds uses for new digital research objects; thus, complete representation of the 10 minimal properties of software is important.
5. Results
In this section, we discuss several kinds of software that did not fit the graph composition framework well, then the completeness of software curation in the MDC collection, and then the results of the evaluation of M1DFM.
5.1. Software that did not fit the graph composition framework well
Three types of software presented representational challenges for the M1 use case:
R packages (e.g., HyPhy, spaero, Seedy) are collections of methods, each of which has inputs and outputs, thus the package itself does not have inputs and outputs in the conventional sense that inputs and outputs belong to the same processing element.4 Despite our concern, we represented R packages as instances of Software only to encounter errors in the data-format-matching search results described in Section 5.4. Our opinion is that R packages are not properly instances of our Software type (or Bioschemas’ ComputationalTool or schema.org’s SoftwareApplication) and should not be included in an M1 search universe (Note: Packages are not to be confused with apps that use methods in R packages, which are properly instances of Software.).
Web services (WS) required some thought because they typically get their inputs and send their outputs to the same software object, which graphically is the tight cycle , but they can participate also in pipeline combinations, e.g., . We elected to represent the eight data services in the collection as instances of a DataService subtype of Software. DataService includes a dataServiceDescription property, which takes as value a complex type with three elements: accessPointType[1:1] ENUM = REST/SOAP/Custom; accessPointDescription [1:1] token; and accessPointURL[1:1] anyURI. Our design, however, did not accommodate software that relies on services in a tight loop. We have corrected that omission in Appendix A.2, which provides guidance about properties needed in conventional software schemas to support M1.
Another issue was that RESTful web services can be problematic for automatic composition due to polymorphism, so to speak, in return data formats. A simple example of the problem is a service that returns tabular data in which the number of columns returned depends on one or more parameters in the service call (input). Software engineers should avoid output data formats whose syntax changes based on software input parameters if they wish to support the M1 use case.
Unix-style software can have the same issue with polymorphic output data formats due to input parameter options and flags that can affect the format of their outputs in complex ways, so in the future engineers wishing to support M1 should avoid creating that behavior.
5.2. Completeness of software curation
This section describes the completeness of curation for the 69 software objects. Fig. 2 lists what we are referring to as the 10 M1DFM-FAIR properties and shows where they fit within the Software XSD. Note that DataInput and DataOutput aren’t included in the list as they are types.
All 69 software objects had the identifier property set to a unique identifier. 42 (60.9%) had both inputs and outputs lists, nine (13.0%) had only an inputs list, five (7.2%) had only an outputs list, and 13 (18.8%) had neither an inputs list nor an outputs list. Data visualizers were the most common type of software having inputs only, data services were the most common type represented as having outputs only, and the main reasons for software having no curated inputs or outputs were that they were R packages or scantily documented.
Regarding completeness of curation of inputs for the 51 software objects with inputs lists, in 20 (39.2%) the isListOfInputsComplete property was set to “Yes”. One (2.0%) had it set to “No”, 10 (19.6%) to “Unknown” and for the remaining 20 (39.2%) the isListOfInputsComplete property was omitted altogether.
Of the 86 DataInputs in the inputs lists of the 51 software, 81 (94.2%) had a dataFormats list, and of those, 45 (52.3%) had an isListOfDataFormatsComplete property set to “Yes”. Three (3.5%) had it set to “No,” and 14 (16.3%) “Unknown”. All 86 DataInputs had an inputNumber, 38 (44.2%) had an isOptional property set to “Yes” or “No” (4 and 34, respectively). Zero were set to “Unknown” and 48 (55.8%) were missing the isOptional property.
Regarding completeness of outputs curation for the 47 software objects with outputs lists, in 20 (42.6%) the isListOfOutputsComplete was set to “Yes”. None had it set to “No”, 11 (23.4%) to “Unknown” and for the remaining 16 (34.0%) the isListOfOutputsComplete property was omitted altogether.
Of the 55 DataOutputs that were in the outputs lists, 52 (94.5%) had a dataFormats list, and of those, 26 (47.3%) had an isListOfDataFormatsComplete property set to “Yes”. One (1.8%) had it set to “No”, and 12 (21.8%) “Unknown”. All 55 DataOutputs had an outputNumber, and 21 (38.2%) had an isOptional property set to “Yes” or “No” (4 and 17, respectively). Zero were set to “Unknown” and 34 (61.8%) were missing the isOptional property.
The root cause of much of the incomplete curation was inadequate documentation by software developers, especially of inputs and outputs and their required data formats. This problem explains why curators set the completeness flags for software inputs, outputs, and their data formats to “Yes” less than half the time. The schema may also have confused curators because it provided two ways to represent ignorance about whether a list was complete: setting the value to “Unknown” or not including the completeness property itself. Our guidance for achieving M1-FAIR in Appendix A.2 addresses the latter problem.
5.3. Precision of search for valid combinations
The number of unique graphs composed by running M1DFM for each software on the “new” list was 347, of which 214 were judged valid by the 2-judge process. Inter-rater reliability as indicated by weighted and unweighted Cohen Kappa statistics were 0.86 (0.81, 0.90) and 0.77 (0.71, 0.83), respectively [34]. Details available here [35].
Table 2 shows the precision of the M1DFM search, by software (if it participates in more than 15 composed graphs) and overall. Precision overall was 61.7% (214 judged valid/347 retrieved). Only 20 software objects of the 55 in the run list had data-format matches with other software or datasets in the collection. In addition to the 12 software objects shown in Table 2, the other eight were: Multiple Outbreak Detection System (MODS), Pitt-Anthrax model, Pitt-SEIR model, EpiVis, Apollo FluTE, epidata2ilinet, Synthetic Microdata Household Viewer, and spew.
Table 2.
Precision of M1DFM search, by software and overall. Legend: Inport[#]:#DFs – [input port number]: count of that inport’s allowable data formats.
| Software | Inport [#]: #DFs | Outport [#]:#DFs | Composed | Valid | Precision |
|---|---|---|---|---|---|
| Disease Forecasters | |||||
| Stat Ensemble | [1]:2 [2]:1 |
[1]:1 | 18 | 0 | 0% |
| Sarima1 | [1]:2 | [1]:1 | 21 | 21 | 100% |
| Sarima2 | [1]:2 | [1]:1 | 21 | 21 | 100% |
| KCDE | [1]:2 | [1]:1 | 21 | 21 | 100% |
| Dynamic Bayes | [1]:2 | [1]:1 | 19 | 19 | 100% |
| CMODS | [1]:2 | [1]:1 | 18 | 18 | 100% |
| pFRED | [1]:1 [2]:1 |
[1]:1 | 108 | 1 | 0.9% |
| Data services | |||||
| Epi-Data API | [1]:a | [1]:2 | 47 | 29 | 61.7% |
| vApollo Library Viewer | [1]:6 | 59 | 0 | 0% | |
| Other software | [1]:1 | [1]:1 | 208 | 104 | 50% |
| Spew2synthia | |||||
| FluSight Validator | [1]:3 | b | 40 | 34 | 85% |
| FluSight viewer | [1]:1 | b | 40 | 34 | 85% |
| Overall | 347 | 214 | 61.7% |
Undocumented data format
Outputs not documented
5.4. Error analysis
We used error analysis of the 133 invalid compositions to identify semantic constraints that, when coupled with data-format matching, could improve the precision of machine determination of validity.
Table 3 summarizes the kinds of errors and the possible solutions, which we now discuss.
Table 3.
Error analysis.
| Error | Instances | Potential Solutions |
|---|---|---|
| Inter-inport constraints on location and temporal range not guaranteed to be satisfied by data service. Constraints must be managed by the invoking software. | 18 (13.4%) | Add requiredDataServices property to Software type. A software and data service can then be treated as a unit by M1. For handling newly minted data services the solution might be adding a data service discovery method to M1. To include data services in the M1 search space, try what ASC uses: type subsumption between software input and service endpoint. |
| Inter-inport mismatch on location | 2 (1.5%) | Inport 1 represents that it hasLocation some location. Inport 2 represents that it hasLocation some location. Each dataset represents (when relevant) that it hasLocation locationA. The software represents an inter-input constraint that the location of Inport 1 must equal location of Inport 2. Test checks the pair of datasets [Inport1 match, Inport 2 match] against the inter-input constraint. If yes, valid. |
| Inter-inport mismatch on location granularity (state vs. county) | 1 (0.7%) | Inport 1 represents that it hasLocationGranularity some granularity. Inport 2 represents that it hasLocationGranularity some granularity. The software represents an inter-input constraint that the granularity of Inport 1 must equal granularity of Inport 2. Each dataset represents the list of all the granularity levels (e.g. admin 1, admin 2) it contains. Test checks to see if the intersection of the two datasets’ lists of granularity levels is nonempty. If yes, valid. |
| Pathogen constraint on inport not satisfied by dataset | 1 (0.7%) | Inport represents that it hasPathogen pathogenA. Dataset represents that it hasPathogen pathogenB. Test checks if pathogenA subsumes pathogenB. If yes, valid. |
| Pathogen constraint on inport not satisfied by dataset | 1 (0.7%) | Inport represents that it hasPathogen pathogenA. Dataset represents that it hasPathogen pathogenB. Test checks if pathogenA subsumes pathogenB. If yes, valid. |
| Pathogen constraint on inport not guaranteed to be satisfied by data service | 2 (1.5%) | The data service implements an API method to return a list of all pathogens for which it has datasets. The software inport represents that it hasPathogen pathogenA. Test retrieves the list of pathogens from the service and checks whether each one is subsumed by pathogen A. If one or more in the list are subsumed, then valid. |
| Location constraint on inport not guaranteed to be satisfied by data service | 5 (3.7%) | The data service implements an API method to return a list of all locations for which it has datasets. The software inport represents that it hasLocation locationA. Test determines whether locationA is in the list. If yes, valid. |
| A compound constraint isn’t satisfied. 1. Location constraint on Inport [1] of pFRED=Synthetic ecosystem that has-location some location 2. Location constraint on Inport [2] of pFRED=Infectious disease scenario that haslocation some location 3. The location for Inport[1]= The location for Inport[2] 4. Semantic constraint #1 above must propagate back via Outport[1] of spew2synthia to Inport[1] of spew2synthia. |
104 (77.6%) | In addition to the solution for inter-inport mismatch: The outport of the software bound to Inport1 represents that it hasLocation some location. The inport of the software bound to Inport 1 represents that it hasLocation some location. That software represents an inter-port constraint stating that the location that applies to its inport also applies to its outport. We say that LocationA applies to an inport if either (1) it is bound to a dataset that hasLocation LocationA or (2) it is bound to an outport of a software to which LocationA applies. Test propagates location applications to Inport1, then checks the inter-input constraint. If satisfied, valid. Note: the inter-input constraint fails when no location applies to Inport1. |
| Total errors | 133 |
5.4.1. Solutions related to data services
Three of the seven categories of error in Table 3 involved data services. The underlying problem is that unless a software is designed to work with a data service, there is no guarantee that it can process every possible output of a desired service’s endpoint. An easy partial solution to the problem is simply adding a requiredDataServices property to the Software type so that M1 can treat a software object and data service pair as a single unit, and therefore not have to backsearch from a non-data service software to a data service.
A second solution also seeks to combine a software object and a data service into a unit. But it does not assume that the combinations are known. Rather, a service discovery algorithm [36] would search repositories of service descriptions for services that can be novelly combined with software. A third potential solution is inspired by prior work on service composition: Require the data service endpoint to return a type that is subsumed by the type of the software’s inport.
One error involving data services revealed a fundamental limitation to semantic matching: The IDS data format was designed to represent infectious disease scenarios involving almost any pathogen and almost any disease control measure, whereas no disease transmission model is that general. Expressing just the semantic constraints defining what is a valid school closure disease control measure for input to a particular disease transmission model would have been impossible in a schema-based representation or in OWL 2. And that is just one kind of control measure in the IDS data format, which also includes disease transmission parameters and population immunity statistics.
5.4.2. Errors unrelated to data services
For the remaining four types of error, we propose a set of graph-checking procedures. For clarity, we explain first how test() would validate the simplest possible graph and subsequently consider more complex graph structures.
M1 calls test(g) with a single-software node graph.
The simplest case is a graph with one single-inport software node that is bound to a dataset. Checking the validity of that graph requires classification inference to test whether the dataset is an instance of the class of valid datasets for that inport. The graph with the Pitt-Anthrax model and the influenza infectious disease scenario dataset is an example of this situation. Here, we must represent on the model’s lone inport that its class of valid datasets has a property hasPathogen = B. anthracis. Any dataset that hasPathogen = B. anthracis or one of its strains will classify as an instance of this class and is valid. The influenza scenario will not classify, and thus test(g) would reject this graph as invalid.
For this solution to work, the dataset and software metadata must use the same property names. In this example, they would both need to use hasPathogen. One limitation of MDC metadata is that the DATS dataset uses the isAbout property whereas the software XSD uses pathogenCoverage. Our OWL 2 transformation mapped these properties to the single property hasPathogen, as a prerequisite for enabling the classification inference.
If the software has more than one port bound to datasets, then in addition to checking the validity of each dataset–inport combination, test(g) must also check any inter-inport constraints. For example, pFRED has two inports—an infectious disease scenario and a synthetic ecosystem—that must be about the same location. Thus, test(g) must check whether both datasets have the same value of hasLocation. The graph is valid only if they do. Like the situation with hasPathogen, our OWL 2 transformation mapped the DATS spatialCoverage property and software XSD locationCoverage property to hasLocation.
M1 calls test(g) with a two-node graph.
In addition to checking all dataset–inport combinations and inter-input validity constraints across them, a 2-node graph requires two additional steps. First, for every software outport–software inport binding, test(g) must ensure that any constraint on the outport is compatible with the constraints on the inport. This entails the outport representing the class of datasets that it can output and the inport representing its class of valid datasets. Then, test(g) checks validity via subsumption testing: the inport class must subsume the outport class.
The second additional step in 2-node graphs is constraint propagation. In the invalid spew2synthia–pFRED graphs, pFRED was bound to spew2synthia via its synthetic ecosystem inport and to an infectious disease scenario dataset for Allegheny County, PA via its other inport. The location of the dataset output by spew2synthia must also be for Allegheny County for the graph to be valid.
The spew2synthia data-format conversion software has the property that whatever is the location for its input dataset, will be the location of its output dataset. Thus, the hasLocation value on the dataset determines the value of hasLocation on spew2synthia’s output. Checking the inter-input constraint on pFRED’s two inports thus requires propagating the hasLocation value of the dataset forward to spew2synthia’s outport.
5.5. Representational requirements of the M1 use case
Fig. 4 presents our guidance for achieving M1-FAIR Composability. Appendix A.2 specifies properties that conventional representations (“metadata schemas”) for data formats, datasets, and software must include to support the M1 use case.
Fig. 4.

M1-FAIR Composability. A machine subtype of FAIR interoperability.
6. Discussion
In this research, we took a first principles approach to the problem of automatically searching a collection of datasets and software for valid workflow compositions, with particular attention to representation. In short, we were working out how to build an M1, which ultimately entails solving the problem of knowledge acquisition about semantic constraints. (And the ambition of it is why we must end this discussion with a roadmap!).
Our key methodological contribution—using data-format matching as the foundation of M1 composition—builds on what software developers already do to ensure that inputs to their software are valid. However, we identified significant gaps between current documentation practice and M1DFM’s requirements, suggesting that software schemas being promulgated for use in FAIR representations include M1-FAIR properties and funders incentivize software developers to use them. A practical constraint identified by Tsueng et al. [9] is that contributors to repositories only represent a small number of requested properties, influencing their decision to limit the number of required elements in their ComputationalTool schema to six, and vindicating ours.
Our evaluation methodology is innovative in its use of error analysis to identify missing semantic constraints. It further identifies conditions under which M1DFM recall of valid compositions becomes 100% by construction. This desirable property reinforces the need for complete documentation.
The results of the evaluation suggest that M1DFM’s precision of 61.7% could be improved by better handling of data services and by representing and checking semantic constraints on individual inputs and across multiple inputs. However, 77% of errors would require propagation of semantic constraints from inputs to outputs. Additionally, one error was caused by a data format so complex that it would be practically impossible to represent constraints over, supporting the limitative conclusion about semantic constraint complexity that we articulated as an M1-FAIR composability guidance in Fig. 4.
A limitation of the evaluation was the two-judge assessment of composition validity. The method was appropriate for this stage of M1 development, but rigor of evaluation should progress towards judging the validity of the outputs of enacted compositions. A second limitation was inclusion of software that was not fully curated for the 10 M1-FAIR properties, which meant that we could not measure recall.
The M1 machine use case is our main conceptual contribution. It led us to additionally conceptualize FAIR Interoperability as M1-machine composability, which aligns with work on AWG and ASC, areas that share the goal of automatic composition of digital objects. Each represents inputs and outputs and their constraints but differ in how they approach the hard problem that semantic constraints pose for automatic determination of composition validity. ASC’s primary strategy has been keeping input and output parameters so simple that something like a single Semantic Annotation for WSDL (SAWSDL) ontology semantic annotation might suffice to express the constraint. But since the most common evaluation is a case study to demonstrate feasibility, we do not know if the approach generalizes. The well-known Semantic Web Challenges unfortunately used synthetic test collections with semantic constraints so trivial that contestants routinely achieved 100% correctness and completeness [37]. That said, the service paradigm checks many M1-FAIR boxes, including explicit representation of inputs and outputs, data formats, and semantic annotation [38]. And web services categorically check M2-FAIR (enactment) boxes. Thus, developers of scientific software and models should consider the paradigm. AWG avoids some of the challenge of representing semantic constraints by seeking only to automate the instantiation of manually designed workflow templates [39]. Still, to our knowledge, evaluation has been limited to proof-of-concept case studies, so we also do not know how well AWG approaches generalize.
A discussion of future work must begin with knowledge acquisition of semantic constraints. Our limited solution—identifying semantic constraints through error analysis of the results of data-format-matching-only search would work well in domains but likely not scale because it depends on judges capable of determining the validity of combinations. But it may be sufficient basis for domains to develop test collections of semantically represented objects and gold standard compositions, which has been lacking in recent ASC and AWG research. Some scalable approaches to consider include mining constraints from computational workflows or eliciting the semantic constraint knowledge that is mainly in the heads of software developers but possibly also encoded in their software’s internal validity checkers. So, it might be possible in a scientific domain that produces its own models that software developers will articulate the constraints or externalize their procedural validity checkers, including source code.
A roadmap that we think increases the chance of finding scientific communities willing and able to contribute to the knowledge acquisition effort starts with making M1DFM available, which we have done [32]. M1DFM may find use by repositories that wish to add composition to existing tools intended to facilitate reuse (e.g., DDI and SDI), or by researchers wishing to replicate our results or benchmark other approaches to composition, such as the M1DFM,Semantic approaches described below.
The roadmap then progresses through the following anticipated versions that differ on the method for testing semantic validity.
M1DFM,semantic checking by procedural validators.
This version would use externalized software input validators as its test for validity, thus its feasibility depends on the willingness of software developers to externalize software input validators or create them de novo. A key advantage of this admittedly scruffy approach is that once you have the procedural input validators, you have the basis for (1) creating a test collection for a semantic composition challenge in which the gold standard set of compositions is formed from the set of composable concrete workflows in the collection, and (2) creating training data for supervised learning of constraints via auto-labelling by each software’s validator.
M1DFM,semantic checking by declarative validators.
This version would use logical inference, e.g., an OWL 2 validator, as its test of validity. Accordingly, a prerequisite for its construction is acquiring and representing knowledge about the semantic constraints on software inputs and the semantic properties of outputs and datasets in the collection according to the M1-FAIR principles articulated in Fig. 4, which include the re-factoring of software and even data formats to reduce the complexity of semantic constraints [40].
In summary, this research on making objects M1-FAIR is one step on the path to Machine-FAIR. The next step would be making them M2-FAIR, which adds representational requirements of a machine that enacts valid compositions. And the next is M3-FAIR, which adds further requirements for a machine that chooses a specific workflow to execute in the service of testing a hypothesis or validating a model. Which together have potential to accelerate scientific discovery to a degree sufficient to warrant considering the goal itself a Grand [41] or even Nobel Turing Challenge [42].
7. Conclusion
Algorithmic search of digital repositories for valid workflow compositions has potential to accelerate scientific discovery, but it requires scalable solutions to the problems of knowledge acquisition and representation of semantic constraints on software inputs and of semantic features of datasets and software outputs. Our evaluation method for identifying semantic constraints in a domain collection could be used to create test collections that would advance research. Additionally, practical and theoretical limits on the logical complexity of semantic constraints must be respected, which has implications for the design of new software and the refactoring of existing software.
Supplementary Material
Acknowledgments
This work was supported by the National Institute for General Medical Sciences (NIGMS) award numbers U24GM110707 and R01GM101151. This paper does not represent the view of NIGMS.
Footnotes
CRediT authorship contribution statement
Michael M. Wagner: Writing – review & editing, Writing – original draft, Supervision, Software, Project administration, Methodology, Investigation, Funding acquisition, Formal analysis, Data curation, Conceptualization. William R. Hogan: Writing – review & editing, Writing – original draft, Visualization, Validation, Supervision, Software, Resources, Project administration, Methodology, Investigation, Funding acquisition, Formal analysis, Data curation, Conceptualization. John D. Levander: Writing – original draft, Software, Data curation, Conceptualization. Matthew Diller: Writing – review & editing, Writing – original draft, Software, Data curation.
Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Appendix A. Supplementary data
Supplementary data to this article can be found online at https://doi.org/10.1016/j.jbi.2024.104647.
We use “software object” because “software” is an uncountable noun and alternatives like “software application” were too narrow.
M1DFM uses integer identifiers for efficiency but maintains a map to the DOIs of the objects in the repository.
We loaded the MDC objects into HashMaps before the first software object on the “new” list was run, rather than adding new objects to the collection one at a time and then initiating search with them as would happen in M1 operation. Our rationale was that the same set of compositions would be generated.
In principle, if one were to represent each method in an R package as a first-class digital object with a DOI and inputs and outputs, each having data formats, then those method-objects would be amenable to M1 search.
References
- [1].Tycho FK, Kepler, The unlikely partnership that forever changed our understanding of the heavens, Bloomsbury Publishing USA, 2002. [Google Scholar]
- [2].Riebeck H Planetary Motion: The history of an idea that launched the scientific revolution. 2009. https://earthobservatory.nasa.gov/features/OrbitsHistory. [Google Scholar]
- [3].Roberts HF. Chapter 11. The discovery of Mendel’s papers. Plant Hybridization Before Mendel. Princeton University Press, Princeton, New Jersey: Humphrey Milford, Oxford University Press; 1929. p. 320–58. [Google Scholar]
- [4].National Library of Medicine (U.S.). Board of Regents. A platform for biomedical discovery and data-powered health : National Library of Medicine strategic plan 2017–2027 / report of the NLM Board of Regents. NIH publication. National Institutes of Health, National Library of Medicine: National Institutes of Health, National Library of Medicine; 2017. https://www.nlm.nih.gov/pubs/plan/lrp17/NLM_StrategicReport2017_2027.pdf. [Google Scholar]
- [5].Wagner MM, Hogan WR, Levander J, Darr A, Diller M, Sibilla M, et al. Creating a discipline-specific commons for infectious disease epidemiology. arXiv: 2311.06989. [Google Scholar]
- [6].Wilkinson MD, Dumontier M, Aalbersberg IJ, Appleton G, Axton M, Baak A, et al. , Comment: The FAIR Guiding Principles for scientific data management and stewardship, Sci Data. 3 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- [7].Sansone SA, Gonzalez-Beltran A, Rocca-Serra P, Alter G, Grethe JS, Xu H, et al. , DATS, the data tag suite to enable discoverability of datasets, Sci Data. 4 (2017) 170059. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [8].Jones MB, Boettiger C, Mayes AC, Smith A, Slaughter P, Niemeyer K, et al. CodeMeta github. 2023. https://github.com/codemeta/codemeta. [Google Scholar]
- [9].Tsueng G, Cano MAA, Bento J, Czech C, Kang M, Pache L, et al. , Developing a standardized but extendable framework to increase the findability of infectious disease datasets, Sci Data. 10 (2023) 99. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [10].Ohno-Machado L, Sansone SA, Alter G, Fore I, Grethe J, Xu H, et al. , Finding useful data across multiple biomedical data repositories using DataMed, Nat Genet 49 (2017) 816–819. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [11].Chen X, Gururaj AE, Ozyurt B, Liu R, Soysal E, Cohen T, et al. , DataMed - an open source discovery index for finding biomedical datasets, J Am Med Inform Assoc 25 (2018) 300–308. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [12].Jones MB. CodeMeta crosswalk.csv. 2023. https://github.com/codemeta/codemeta/blob/master/crosswalk.csv. [Google Scholar]
- [13].Bonazzi V, Bourne P, Brenner S, Brown R, Chandramouliswaran I, Couch J, et al. Software Discovery Index Workshop Report. 2015. https://nciphub.org/resources/885. [Google Scholar]
- [14].Bonazzi V Software Discovery Index Workshop Report. 2015. https://www.softwarediscoveryindex.org/. [Google Scholar]
- [15].Goble C, Cohen-Boulakia S, Soiland-Reyes S, Garijo D, Gil Y, Crusoe MR, et al. , FAIR computational workflows, Data Intelligence. 2 (2020) 108–121. [Google Scholar]
- [16].Deelman E, Vahi K, Rynge M, Mayani R, da Silva R, Papadimitriou G, et al. , The evolution of the Pegasus workflow management software, Comput. Sci. Eng 21 (2019) 22–36. [Google Scholar]
- [17].Garijo D, Gil Y, Ratnakar V. The DISK Hypothesis Ontology: Capturing hypothesis evolution for automated discovery. K-CAP ‘17 SciKnow. Austin, TX: 2017. [Google Scholar]
- [18].Soldatova LN, King RD, An ontology of scientific experiments, J R Soc Interface. 3 (2006) 795–803. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [19].Gil Y, Garijo D, Ratnakar V, Rajiv M, Adusumilli R, Boyce H, et al. , Towards continuous scientific data analysis and hypothesis evolution, in: Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, AAAI Press, San Francisco, California, USA, 2017, pp. 4406–4414. [Google Scholar]
- [20].Rich E, Knight K, Nair SB, Artificial intelligence, 3rd ed., McGraw-Hill, New York, 2009. [Google Scholar]
- [21].Brachman RJ, Levesque HJ, The tradeoff between expressiveness and tractability. Knowledge Representation and Reasoning, Morgan Kaufmann, Amsterdam; Boston, 2004, pp. 327–348. [Google Scholar]
- [22].Hitzler P, Krötzsch M, Parsia B, Patel-Schneider PF, Rudolph S. OWL 2 Web Ontology Language Primer. W3C; 2009. https://www.w3.org/TR/owl2-primer/. [Google Scholar]
- [23].W3C. OWL 2 Web Ontology Language Profiles (Second Edition). Section 5 Computational Properties2012. https://www.w3.org/TR/owl2-profiles/. [Google Scholar]
- [24].Gamha Y, A framework for REST services discovery and composition, SOCA 17 (2023) 259–275. [Google Scholar]
- [25].Gil Y, Ratnakar V, Kim J, Gonzalez-Calero P, Groth P, Moody J, et al. , Wings: Intelligent workflow-based design of computational experiments, IEEE Intell. Syst 26 (2011) 62–72. [Google Scholar]
- [26].Gil Y, González-Calero PA, Kim J, Moody J, Ratnakar V, A semantic framework for automatic generation of computational workflows using distributed data and component catalogues, J. Exp. Theor. Artif. Intell 23 (2011) 389–467. [Google Scholar]
- [27].Atkinson M, Gesing S, Montagnat J, Taylor I, Scientific workflows: Past, present and future, Futur. Gener. Comput. Syst 75 (2017) 216–227. [Google Scholar]
- [28].Lamprecht AL, Palmblad M, Ison J, Schwämmle V, Al Manir MS, Altintas I, et al. , Perspectives on automated composition of workflows in the life sciences, F1000Res 10 (2021) 897. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [29].Gonzalez-Beltran A DATS github. 2017. https://github.com/biocaddie/DATS. [Google Scholar]
- [30].Levander J, Darr A. MDC Software XSD. 2019. p. GitHub repository. https://github.com/midas-isg/mdc-xsd-and-types/blob/master/src/main/resources/software.xsd. [Google Scholar]
- [31].Hogan WR, Wagner MM, Brochhausen M, Levander J, Brown ST, Millett N, et al. , The Apollo Structured Vocabulary: an OWL2 ontology of phenomena in infectious disease epidemiology and population biology for use in epidemic simulation, J Biomed Semantics. 7 (2016) 50. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [32].m1-df-only source code release v2024-02-20. GitHub; 2024. https://github.com/mcwdsi/m1-df-only/releases/tag/v2024-02-20. [Google Scholar]
- [33].Corrected MDC JSON file. https://github.com/mcwdsi/m1-df-only/blob/master/src/main/resources/all_mdc_contents_from_api_2019-05-03-curated.json.
- [34].McHugh ML, Interrater reliability: the kappa statistic, Biochem Med (zagreb). 22 (2012) 276–282. [PMC free article] [PubMed] [Google Scholar]
- [35].Hogan WR, Wagner MM. M1-data-format-only compositions and two-rater judgments of composition validity. 2024. 10.5281/zenodo.10981456. [DOI] [Google Scholar]
- [36].Rodriguez-Mier P, Pedrinaci C, Lama M, Mucientes M, An integrated semantic web service discovery and composition framework, IEEE Trans Serv Comput 9 (2015). [Google Scholar]
- [37].Blake B, Cabral L, König-Ries B, Küster U, Martin D, Semantic Web Services: Advancement through Evaluation, Springer, Berlin, Heidelberg, 2012. [Google Scholar]
- [38].Verborgh R, Harth A, Maleshkova M, Stadtmüller S, Steiner T, Taheriyan M, et al. , Survey of semantic description of REST APIs, in: Pautasso C, Wilde E, Alarcon R (Eds.), REST: Advanced Research Topics and Practical Applications, New York, NY, Springer, New York, 2014, pp. 69–89. [Google Scholar]
- [39].Gil Y, Workflow composition: Semantic representations for flexible automation, in: Taylor IJ, Deelman E, Gannon DB, Shields M (Eds.), Workflows for e-Science: Scientific Workflows for Grids, Springer, London, London, 2007, pp. 244–257. [Google Scholar]
- [40].Hogan WR, Wagner MM, Demonstration of semantic and inter-input constraints on software in OWL 2 and SPARQL for fulfilling the M1 Machine FAIR Use Case, Zenodo (2023), 10.5281/zenodo.10116497. [DOI] [Google Scholar]
- [41].Gil Y, Will AI write scientific papers in the future? AI Mag 42 (2022) 3–15. [Google Scholar]
- [42].Kitano H, Nobel turing challenge: creating the engine for scientific discovery, npj Syst. Biol. Appl 7 (2021) 29. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
