An Automated Workflow Composition System for Liquid Chromatography–Mass Spectrometry Metabolomics Data Processing

Xinsong Du; Farhad Dastmalchi; Matthew A Diller; Mathias Brochhausen; Timothy J Garrett; William R Hogan; Dominick J Lemas

doi:10.1021/jasms.3c00248

. Author manuscript; available in PMC: 2025 Jun 4.

Published in final edited form as: J Am Soc Mass Spectrom. 2023 Oct 24;34(12):2857–2863. doi: 10.1021/jasms.3c00248

An Automated Workflow Composition System for Liquid Chromatography–Mass Spectrometry Metabolomics Data Processing

Xinsong Du ¹, Farhad Dastmalchi ², Matthew A Diller ³, Mathias Brochhausen ⁴, Timothy J Garrett ⁵, William R Hogan ⁶, Dominick J Lemas ^7,⁸

PMCID: PMC12135213 NIHMSID: NIHMS2069779 PMID: 37874901

Abstract

Liquid chromatography–mass spectrometry (LC–MS) metabolomics studies produce high-dimensional data that must be processed by a complex network of informatics tools to generate analysis-ready data sets. As the first computational step in metabolomics, data processing is increasingly becoming a challenge for researchers to develop customized computational workflows that are applicable for LC–MS metabolomics analysis. Ontology-based automated workflow composition (AWC) systems provide a feasible approach for developing computational workflows that consume high-dimensional molecular data. We used the Automated Pipeline Explorer (APE) to create an AWC for LC–MS metabolomics data processing across three use cases. Our results show that APE predicted 145 data processing workflows across all the three use cases. We identified six traditional workflows and six novel workflows. Through manual review, we found that one-third of novel workflows were executable whereby the data processing function could be completed without obtaining an error. When selecting the top six workflows from each use case, the computational viable rate of our predicted workflows reached 45%. Collectively, our study demonstrates the feasibility of developing an AWC system for LC–MS metabolomics data processing.

Graphical Abstract

graphic file with name nihms-2069779-f0001.jpg

1. INTRODUCTION

Clinical metabolomics has become a novel approach for biomarker discovery with the translational potential to guide next-generation therapeutics and precision health interventions.^1,2 LC–MS has emerged as a popular data acquisition technique because of its high sensitivity and specificity in metabolite identification.³ A typical workflow for LC–MS metabolomics includes sample preparation, data acquisition, data processing, and data interpretation, among which data processing is the first computational step and critical for downstream computational analysis.² LC–MS metabolomics data processing consists of signal processing techniques such as noise filtering⁴ and peak deconvolution.⁵ To support these analyses, freely available software has been developed,⁶ including XCMS,⁷ MZmine,⁸ and MS-DIAL.⁹ A primary challenge to reproducibility of LC–MS metabolomics data processing workflows is workflow decay,^10,11 which is defined as the computational failure or reduced ability to execute or repeat a computational procedure over time.¹² Recent years have seen a massive increase in the volume of clinical metabolomics data to be analyzed, as well as the complexity of that analysis.¹³ Consequently, the importance of defining workflows as abstract representations of data flow between multiple analytic tools has become a core part of computational scientific analysis.¹⁴

With continuous use of workflows in scientific projects and by user communities, workflows are becoming more complex and require more sophisticated workflow management capabilities.¹⁵ Automated workflow composition (AWC) was demonstrated to be a feasible approach to design novel and similar workflows. Automated Pipeline Explorer (APE) is a command-line tool for AWC.¹⁶ APE can compose workflows according to domain models that include semantically annotated software tools and workflow specifications. During domain modeling, the input, output, and operations of software tools are annotated using the controlled vocabulary underlying the EDAM ontology.¹⁷ EDAM includes topics, operations, types of data and data identifiers, and data formats, relevant in data analysis and data management in life sciences. Notably, EDAM ontology has been used for software annotation by bio.tools registry, which is the main catalog of computational tools in the life sciences. Through semantic annotations, workflows are specified using a controlled vocabulary underlying EDAM ontology to annotate the input, output, and required operations. Advantages of this approach includes the process of domain modeling and creating workflow specifications that ultimately improves reproducibility by leveraging information related to workflow provenance.¹⁰ Specific to data processing, APE has been used to create an AWC system applied to several high-dimensional data types including graphical,¹⁸ DNA sequencing,¹⁹ and proteomics data.^12,20 Despite these observations, the application of AWC in LC–MS metabolomics has not yet been investigated.

In this study, we developed an ontology-based AWC system for LC–MS metabolomics data processing. The system comprises four parts: (1) domain modeling using EDAM ontology, (2) workflow specification with EDAM ontology, (3) workflow synthesis with the APE software, and (4) workflow filtering. We annotated workflows and identified produced workflows for assessment. We used three use cases to demonstrate our proposed method. We found that our system could identify multiple usable workflows for each use case.

2. METHODS

As shown in Figure 1, the steps for the AWC include domain modeling, workflow specification, workflow synthesis, and workflow filtering. To illustrate if workflows produced by the proposed approach could successfully perform common LC–MS metabolomics data processing tasks, we considered an exemplary set of tools and three metabolomics use cases. The code for reproducing the process of automated workflow composition in this study can be found at: https://github.com/lemaslab/awc_lcms_metabolomics. We used the data collected for our previous milk metabolomics study to illustrate a traditional workflow for the following three use cases.²¹ Example data acquisition and analysis results for use cases can be found in the Supporting Information.

2.1. Automated Workflow Composition.

2.1.1. Domain Modeling.

A domain model captures the common knowledge and the possible variability allowed among applications in a domain.²² Our domain model captured knowledge from semantic annotations of LC–MS metabolomics software with EDAM ontology.¹⁷ Four software tools were included in our study’s use cases: ProteoWizard-msConvert (v3), MS-DIAL (v4.46), MZmine (v2.53), and XCMS (v3.12.0). Since each of these tools may have many functions, we split each tool into functions that can import input files and export output files. These software tools included functions that were considered potential components of workflows. We did not split functions that could not accept external inputs or produce output files since they cannot be interoperated with other tools or functions. Semantic annotation was performed of software and their functions by reading publicly available documentation and hands-on testing. All potential components were annotated with the input data type, input data format, output data type, output data format, and operations. A controlled vocabulary underlying the EDAM ontology was used for the annotation. Moreover, for simplicity, we limited the annotations to one input/output per component in this study, although the synthesis framework could manage several inputs and outputs.

2.1.2. Workflow Specification.

Workflow specification consists of the input data formats/types, the output data formats/types, and possible additional constraints that the produced workflow must fulfill per use case. Data types, formats, and workflow constraints were represented with the controlled vocabulary underlying the data [data:0006], format [format:1915], and operation [operation:0004] taxonomies of the EDAM ontology.

2.1.3. Workflow Synthesis.

Many platforms supporting automated workflow composition have been developed, including APE,¹⁶ jORCA,²³ Magallanes,²⁴ jABC,²⁵ PROPHETS,²⁶ and WINGS.²⁷ We selected APE for our study since (1) jORCA and Magallanes are no longer accessible; (2) WINGS has a complex installation process, and it has not been updated for over a year; and (3) jABC and PROPHETS are outdated and not being maintained. Notably, APE has been used for the AWC of MS-based proteomics research.²⁰ Therefore, APE was used in this study for workflow synthesis.

For our use cases, the length of workflows was set to four since we observed that workflows longer than four can simply extend already considered shorter solutions and introduce redundant steps. APE recommends workflows by matching tools’ input and output, but this does not guarantee the rationality of designed workflows, and many proposed workflows may not make sense or be executable.¹² Therefore, we manually reviewed recommended workflows and categorized them into four categories:²⁰

Traditional workflow: where the workflow is recognized as one that will work.
Novel workflow: an interesting suggestion that seems viable and is worth trying.
Redundant workflow: where the workflow may work but does not seem useful or has unnecessary steps.
Unreasonable workflow: where the workflow is recognized as one that will not work.

2.1.4. Workflow Filtering.

The workflow synthesis process may produce redundant or unreasonable workflows that may waste time during their assessment. A strategy for automatically selecting a subset of recommended workflows that are worth trying (i.e., workflows categorized as traditional or novel) helps save time. Therefore, we selected the first six workflows recommended by APE. We calculated if this could result in a better percentage of worthwhile workflows.

2.2. LC–MS Metabolomics Use Cases.

2.2.1. Use Case No. 1: Data Quality Control.

The initial step is checking whether the data files produced by LC–MS analysis have an acceptable quality for downstream computational analysis, e.g., do the quality control (QC) samples produce similar mass spectrum signals? To assess this, the data must be processed to obtain a peak table that includes intensity information. Principal component analysis can then be used to visualize the similarity of intensities generated by the QC samples.

2.2.2. Use Case No. 2: Metabolite Identification with Mass-to-Charge Ratio and Retention Time.

Once the quality of data has been accepted for downstream computational analysis, the metabolites in the samples can be identified by matching the mass-to-charge ratio (m/z) and retention time (RT) with the laboratory’s internal library. For this, the data must be processed to obtain m/z and RT information on peaks. Metabolites can then be identified whose m/z and RT values are within the predefined tolerance during library matching.

2.2.3. Use Case No. 3: Metabolite Identification with Tandem Mass Spectrum.

Although library matching with m/z and RT information can identify many metabolites, numerous peaks with unknown identities will be present. Matching tandem mass spectrum (MS2) of precursor peaks with publicly available MS2 libraries allows the identification of more metabolites and increases the confidence level of metabolites identified through m/z and RT matching. For MS2 library matching, data files containing MS2 information must first be processed and then this MS2 information matched with one or more publicly available MS2 libraries.

3. RESULTS

The domain modeling results are shown in Figure 2 and Table S1. Every component included a software tool and one or more EDAM operations that the software can perform. Figure 2 illustrates the EDAM taxonomy used for our annotation. Three root classes were included: operation [operation:0004], data [data:0006], and format [format: 1915]. Six operations were involved in the semantic annotation: filtering, formatting, QC, metabolite identification, peak detection, and principal component analysis. Three data types were used: mass spectrometry data, QC report, and compound identifier. Seven format types were involved: XML, mzTab, DSV, MGF, Thermo RAW, mzXML, and mzML. As shown in Table S1, 13 components were identified in total. One component was related to msConvert, three components were related to MS-DIAL, six components were from MZmine, and three components came from XCMS. Workflow specification results are shown in Table S2. We set the workflow input of all three use cases as mzXML files. Workflow output and constraints differ among use cases. Workflow synthesis statistics are summarized in Table S3 and Figure 3 and explained in detail below. Briefly, the system recommended 145 distinct workflows in total for all three use cases, 57 of them could be used for QC (use case No.1), 88 could be used for MS1 annotation (use case No.2), and 88 could be used for MS2 annotation (use case No.3). Additionally, synthesized workflows and their annotations are included in Table S4. Consequently, our system recommended traditional workflows for all use cases as expected. However, based on Figure 3, a large portion of recommended workflows for use cases No. 1, No. 2, and No. 3 were redundant or unreasonable (89.5%, 94.4%, and 97.7%, respectively). We also found 33.3% of novel workflows were executable, which means data could be transferred across steps without reporting an error. Additionally, we considered workflows with an annotation of “traditional” or “novel” as a “worth trying” workflow. According to Figure 3, we found selecting the first six workflows produced by APE could result in an >45% average percentage of “worth trying” workflows.

Figure 3. — Percentage of workflows worth assessing when selecting the first N workflows as recommended by APE software.

3.1. Use Case No. 1.

This use case translates into a specification with mass spectrum [data:0943] in mzXML [format:3654] format as the input, and the QC report [data:3914] as the output. Constraints were employed to enforce the use of peak detection [operation:3215], principal component analysis [operation:3960], and QC [operation:2428] operations and to avoid using metabolite identification operation [operation:3803]. APE synthesized 57 (100%) workflows, of which 3 (5.3%) were traditional workflows, 3 (5.3%) were novel workflows, 51 (89.5%) were redundant workflows, and none were unreasonable workflows. We tested novel workflows and found one executable workflow: MZmine (peak detection) ≥ XCMS (chromatogram visualization + chromatographic alignment + principal component analysis + QC). The other novel workflows were not executable because the output mzTab format from XCMS and MS-DIAL could not be used by MZmine. The reason why redundant workflows were produced was because “redundant format conversion step” and “two same steps connect” options were used.

3.2. Use Case No. 2.

This use case has the workflow input of mass spectrum [data:0943] in mzXML [format:3654], and workflow output of compound identifier [operation:1086]. Constraints include using operations of peak detection [operation:3215], and metabolite identification [operation:2421]. APE produced 88 (100%) workflows, of which 3 (3.4%) were traditional workflows and 2 (2.3%) were novel workflows while 73 (83%) were redundant workflows and 10 (11.4%) were unreasonable workflows. However, none (0%) of the two novel workflows were executable because the output mzTab format from XCMS and MS-DIAL could not be used by MZmine. Additionally, redundant workflows were produced because “redundant format conversion step” and “two same steps connect” were used. Unreasonable workflows were produced because of “XCMS cannot perform MS1 annotation.”

3.3. Use Case No. 3.

This use case implies a specification with mass spectrum [data:0943] in mzXML [format:3742] and workflow output of compound identifier [operation:1086]. Constraints included using operations of peak detection [operation:3215], metabolite identification [operation:2421], spectral library search [operation:2421]; and not using database search [operation:3801]. APE generated 88 (100%) workflows, of which 1 (1.1%) was a traditional workflow, 1 (1.1%) was a novel workflow, 22 (25%) were redundant workflows, and 64 (72.7%) were unreasonable workflows; the identified novel workflow was executable: MZmine (peak detection) ≥ XCMS (chromatogram visualization + chromatographic alignment + metabolite identification). Redundant workflows were produced because “redundant format conversion step” and “two same steps connect” were included, and unreasonable workflows were produced because “MZmine cannot perform MS2 annotation” was included.

4. DISCUSSION

AWC can accelerate computational workflow design,²⁰ promote workflow optimization,^28,29 facilitate research reproducibility,³⁰ and ensure the workflow methodological quality.³⁰ Several methods have been proposed for workflow discovery, including searching a workflow repository (e.g., MyExperiment),³¹ using machine learning to create a recommendation system based on existing workflows,³² and leveraging domain ontologies to facilitate workflow construction.^12,16,20 Notably, the ontology-based approach does not only compose workflows that already exist but can also create novel workflows, offering more opportunity to find better-suited workflows.³⁰ To date, the ontology-based AWC has been used in areas such as geographic data manipulation¹⁸ and MS-based proteomics data analysis.¹² In this study, we extended previous work^12,16,18 by using the ontology-based AWC method for LC–MS metabolomics research. We used EDAM ontology for domain modeling and workflow specifications, and APE software for workflow synthesis and added a workflow filtering step to automatically remove redundant and insensible workflows.

MS-based proteomics²⁰ using APE and all relevant tools registered in bio.tools found that the ontology-based AWC could explore the space of possible proteomics workflows. Referring to Kasalica et al, we categorized obtained workflows in our study using the same four categories (traditional, novel, redundant, and unreasonable). Kasalica et al. also found the quality of obtained workflows largely depends on the semantic annotation of software tools.²⁰ Therefore, while creating the domain model, we not only relied on the annotation from bio.tools registry but also revised the annotation according to our own experiences of included software tools. Our results are consistent with Kasalica et al. by demonstrating the top six workflows predicted by APE was associated with higher percentage of computationally viable workflows. When considering the first 20 workflows, we obtained 4.33 (21.65%) worth trying workflows on average for all use cases, whereas the previous study obtained 4.25 workflows (21.25%).²⁰ When selecting the first six workflows, we obtained 2.67 (44.50%) worth trying workflows in comparison with 2.5 workflows (41.67%) previously obtained.²⁰ This demonstrates that the workflow filtering strategy is effective for both MS-based proteomics research and LC–MS metabolomics research.

Community-wide adoption of our proposed method for LC–MS metabolomics workflow construction can enhance reproducibility. A recent review regarding the FAIRness (Findability, Accessibility, Interoperability, Reusability) of LC–HRMS metabolomics software indicates that no software for LC–MS metabolomics data processing includes semantic annotation of input, output, or operations in their documentation, which considerably diminishes FAIRness.⁶ Using our proposed strategy for workflow composition will encourage research software developers to annotate their software semantically so that this would be discovered by the AWC system and used in research workflows. Additionally, workflow reproducibility and reliability can be mitigated by improving workflow completeness and stability.³³ Workflow completeness refers to the richness of semantic annotation for workflows using the research object (RO) template,³⁴ and stability refers to stability of the completeness of a workflow over time. Notably, the RO template includes semantic annotation of input and output, and workflows produced with our proposed strategy would have automatic semantic annotations of input and output. Therefore, our proposed AWC method facilitates the improvement of research reproducibility by enhancing FAIRness of LC–MS metabolomics software and enhancing workflow completeness.

Our study has several strengths. We divided the included software into functions that can take external input files and export output files, providing the workflow synthesis software with more space for novel workflow discovery. We proposed a filtering step that selected only the first six recommended workflows and could consequently quickly guide researchers to identify workflows that were worth trying. Notably, when current tools are limited, bioinformaticians can easily add their self-developed tools to the AWC system by modifying the domain modeling config file in the repository, and then use the revised AWC to compose workflows automatically including their tools. Furthermore, combining multiple workflows for data processing can improve the true positive rate,³⁵ and our developed AWC can make the discovery of multiple workflows easier. However, the study also had some limitations. We identified that the ontology we used (EDAM) has limitations that caused the generation of unreasonable workflows, but we did not test or evaluate other related ontologies such as OntoSoft or SADI.³⁰ We only used one data set to test our workflows and did not investigate if different sample types or data acquisition approaches would impact the success of the workflow execution. Additionally, we only discussed how the AWC system could improve research reproducibility but did not propose an evaluation matrices to evaluate the reproducibility improvement quantitatively. Although combining results from multiple workflows may enhance the quality and reproducibility of results, adding AWC to design workflows as an additional step before the actual data processing steps may increase the time of analysis. Future studies that rank produced workflows based on factors such as the consumption of computational resources and the biological meaningfulness of results will be helpful in this regard. Proposing matrices or using an existing matrices such as the coefficient of variation (CV%)³⁶ to evaluate how the employment of AWC could improve metabolomics reproducibility across laboratories is a warranted future work. Furthermore, we found APE software does not take into consideration the compatibility among software tools when assembling workflows.

5. CONCLUSIONS

AWC is a technique that uses algorithms to perform the often tedious, time-consuming, limited, and error-prone workflow development process. Notably, using ontology for AWC not only saves time in the process of workflow design but also improves reproducibility via providing provenance information on software tools and workflows. However, the implementation of ontology-based AWC in LC–MS metabolomics has not been investigated. In this study, we proposed an ontology-based method using AWC for LC–MS metabolomics workflows. Our strategy had four steps: (1) domain modeling with EDAM ontology, (2) workflow specification with EDAM ontology, (3) workflow synthesis with APE software, and (4) workflow filtering by selecting the first six recommended workflows. We demonstrated our strategy with three use cases and found these effectively identified existing widely used workflows and discovered executable novel workflows. Currently, this is the first study to investigate AWC for LC–MS metabolomics analysis.

Supplementary Material

Main

NIHMS2069779-supplement-Main.pdf^{(249.6KB, pdf)}

Table S5

NIHMS2069779-supplement-Table_S5.xlsx^{(308.1KB, xlsx)}

Acknowledgments

Research reported in this publication was supported by the University of Florida Informatics Institute Fellowship Program. Research reported in this publication was also supported by Southeast Center for Integrated Metabolomics at the University of Florida, the National Institute of Diabetes and Digestive and Kidney Diseases (K01DK115632), the University of Florida Clinical and Translational Science Institute (UL1TR001427). The content is solely the responsibility of the authors and does not necessarily represent the official views of the University of Florida Informatics Institute, Southeast Center for Integrated Metabolomics at the University of Florida, University of Florida Clinical and Translational Science Institute, or the National Institutes of Health.

Footnotes

Notes

The authors declare no competing financial interest.

Complete contact information is available at: https://pubs.acs.org/10.1021/jasms.3c00248

ASSOCIATED CONTENT

Supporting Information

The Supporting Information is available free of charge at https://pubs.acs.org/doi/10.1021/jasms.3c00248.

Example analysis (sample collection and handling, metabolite extraction, analytical instrumentation, example data processing pipeline); domain modeling, workflow specifications, synthesis statistics); analysis result of an example workflow (PDF)

Distinct workflows recommended by the AWC system and use cases (XLSX)

Contributor Information

Xinsong Du, Division of General Internal Medicine, Department of Medicine, Brigham and Women’s Hospital, Boston, Massachusetts 02115, United States; Department of Medicine, Harvard Medical School, Boston, Massachusetts 02115, United States.

Farhad Dastmalchi, Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, Florida 32610, United States.

Matthew A. Diller, Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, Florida 32610, United States

Mathias Brochhausen, Department of Biomedical Informatics, College of Medicine, University of Arkansas for Medical Sciences, Little Rock, Arkansas 72205, United States.

Timothy J. Garrett, Department of Pathology, Immunology and Laboratory Medicine, College of Medicine, University of Florida, Gainesville, Florida 32610, United States

William R. Hogan, Data Science Institute, Medical College of Wisconsin, Milwaukee, Wisconsin 53226, United States

Dominick J. Lemas, Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, Florida 32610, United States Department of Obstetrics and Gynecology, College of Medicine and Center for Perinatal Outcomes Research, College of Medicine, University of Florida, Gainesville, Florida 32610, United States.

REFERENCES

(1).Dastmalchi F; Xu K; Jones HN; Lemas DJ Assessment of Human Milk in the Era of Precision Health. Curr. Opin. Clin. Nutr. Metab. Care 2022, 25 (5), 292–297. [DOI] [PMC free article] [PubMed] [Google Scholar]
(2).Du X; Aristizabal-Henao JJ; Garrett TJ; Brochhausen M; Hogan WR; Lemas DJ A Checklist for Reproducible Computational Analysis in Clinical Metabolomics Research. Metabolites 2022, 12 (1), 87. [DOI] [PMC free article] [PubMed] [Google Scholar]
(3).Pang Z; Zhou G; Ewald J; Chang L; Hacariz O; Basu N; Xia J Using MetaboAnalyst 5.0 for LC-HRMS Spectra Processing, Multi-Omics Integration and Covariate Adjustment of Global Metabolomics Data. Nat. Protoc. 2022, 17 (8), 1735–1761. [DOI] [PubMed] [Google Scholar]
(4).Wang S-C; Huang C-M; Chiang S-M Improving Signal-to-Noise Ratios of Liquid Chromatography-Tandem Mass Spectrometry Peaks Using Noise Frequency Spectrum Modification between Two Consecutive Matched-Filtering Procedures. J. Chromatogr. A 2007, 1161 (1), 192–197. [DOI] [PubMed] [Google Scholar]
(5).Vemula H; Kitase Y; Ayon NJ; Bonewald L; Gutheil WG Gaussian and Linear Deconvolution of LC-MS/MS Chromatograms of the Eight Aminobutyric Acid Isomers. Anal. Biochem. 2017, 516, 75–85. [DOI] [PMC free article] [PubMed] [Google Scholar]
(6).Du X; Dastmalchi F; Ye H; Garrett TJ; Diller MA; Liu M; Hogan WR; Brochhausen M; Lemas DJ Evaluating LC-HRMS Metabolomics Data Processing Software Using FAIR Principles for Research Software. Metabolomics 2023, 19 (2), 11. [DOI] [PMC free article] [PubMed] [Google Scholar]
(7).Smith CA; Want EJ; O’Maille G; Abagyan R; Siuzdak G XCMS: Processing Mass Spectrometry Data for Metabolite Profiling Using Nonlinear Peak Alignment, Matching, and Identification. Anal. Chem. 2006, 78 (3), 779–787. [DOI] [PubMed] [Google Scholar]
(8).Schmid R; Heuckeroth S; Korf A; Smirnov A; Myers O; Dyrlund TS; Bushuiev R; Murray KJ; Hoffmann N; Lu M; Sarvepalli A; Zhang Z; Fleischauer M; Dührkop K; Wesner M; Hoogstra SJ; Rudt E; Mokshyna O; Brungs C; Ponomarov K; Mutabdžija L; Damiani T; Pudney CJ; Earll M; Helmer PO; Fallon TR; Schulze T; Rivas-Ubach A; Bilbao A; Richter H; Nothias L-F; Wang M; Orešič M; Weng J-K; Böcker S; Jeibmann A; Hayen H; Karst U; Dorrestein PC; Petras D; Du X; Pluskal T Integrative Analysis of Multimodal Mass Spectrometry Data in MZmine 3. Nat. Biotechnol. 2023, 447. [DOI] [PMC free article] [PubMed] [Google Scholar]
(9).Tsugawa H; Cajka T; Kind T; Ma Y; Higgins B; Ikeda K; Kanazawa M; VanderGheynst J; Fiehn O; Arita M MS-DIAL: Data-Independent MS/MS Deconvolution for Comprehensive Metabolome Analysis. Nat. Methods 2015, 12 (6), 523–526. [DOI] [PMC free article] [PubMed] [Google Scholar]
(10).Kanwal S; Khan FZ; Lonie A; Sinnott RO Investigating Reproducibility and Tracking Provenance - A Genomic Workflow Case Study. BMC Bioinformatics 2017, 18 (1), 337. [DOI] [PMC free article] [PubMed] [Google Scholar]
(11).Roure DD; Goble C; Klyne G; Roos M; Hettne K; Ruiz JE; Palma R; Gómez-Pérez JM; Missier P; Belhajjame K Towards the Preservation of Scientific Workflows; 2011. [Google Scholar]
(12).Palmblad M; Lamprecht A-L; Ison J; Schwämmle V Automated Workflow Composition in Mass Spectrometry-Based Proteomics. Bioinformatics 2019, 35 (4), 656–664. [DOI] [PMC free article] [PubMed] [Google Scholar]
(13).Ding J; Feng Y-Q Mass Spectrometry-Based Metabolomics for Clinical Study: Recent Progresses and Applications. TrAC Trends Anal. Chem. 2023, 158, 116896. [Google Scholar]
(14).Atkinson M; Gesing S; Montagnat J; Taylor I Scientific Workflows: Past, Present and Future. Future Gener. Comput. Syst. 2017, 75, 216–227. [Google Scholar]
(15).Coleman T; Casanova H; Pottier L; Kaushik M; Deelman E; Ferreira da Silva R WfCommons: A Framework for Enabling Scientific Workflow Research and Development. Future Gener. Comput. Syst. 2022, 128, 16–27. [Google Scholar]
(16).Kasalica V; Lamprecht A-L APE: A Command-Line Tool and API for Automated Workflow Composition. In Computational Science - ICCS 2020; Krzhizhanovskaya VV., Závodszky G., Lees MH., Dongarra JJ., Sloot PMA., Brissos S., Teixeira J., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham, 2020; pp 464–476. DOI: 10.1007/978-3-030-50436-6_34. [DOI] [Google Scholar]
(17).Ison J; Kalaš M; Jonassen I; Bolser D; Uludag M; McWilliam H; Malone J; Lopez R; Pettifer S; Rice P EDAM: An Ontology of Bioinformatics Operations, Types of Data and Identifiers, Topics and Formats. Bioinformatics 2013, 29 (10), 1325–1332. [DOI] [PMC free article] [PubMed] [Google Scholar]
(18).Kasalica V; Lamprecht A-L Automated Composition of Scientific Workflows: A Case Study on Geographic Data Manipulation. In 2018 IEEE 14th International Conference on e-Science (e-Science); 2018; pp 362–363. DOI: 10.1109/eScience.2018.00099. [DOI] [Google Scholar]
(19).Zheng CL; Ratnakar V; Gil Y; McWeeney SK Use of Semantic Workflows to Enhance Transparency and Reproducibility in Clinical Omics. Genome Med. 2015, 7 (1), 73. [DOI] [PMC free article] [PubMed] [Google Scholar]
(20).Kasalica V; Schwämmle V; Palmblad M; Ison J; Lamprecht A-L APE in the Wild: Automated Exploration of Proteomics Workflows in the Bio.Tools Registry. J. Proteome Res. 2021, 20 (4), 2157–2165. [DOI] [PMC free article] [PubMed] [Google Scholar]
(21).Lemas DJ; Du X; Dado-Senn B; Xu K; Dobrowolski A; Magalhães M; Aristizabal-Henao JJ; Young BE; Francois M; Thompson LA; Parker LA; Neu J; Laporta J; Misra BB; Wane I; Samaan S; Garrett TJ Untargeted Metabolomic Analysis of Lactation-Stage-Matched Human and Bovine Milk Samples at 2 Weeks Postnatal. Nutrients 2023, 15 (17), 3768. [DOI] [PMC free article] [PubMed] [Google Scholar]
(22).Reinhartz-Berger I Towards Automatization of Domain Modeling. Data Knowl. Eng. 2010, 69 (5), 491–515. [Google Scholar]
(23).Karlsson J; Martín-Requena V; Ríos J; Trelles O Workflow Composition and Enactment Using jORCA. In Leveraging Applications of Formal Methods, Verification, and Validation; Margaria T., Steffen B., Eds.; Springer: Berlin, Heidelberg, 2010; pp 328–339. [Google Scholar]
(24).Ríos J; Karlsson J; Trelles O Magallanes: A Web Services Discovery and Automatic Workflow Composition Tool. BMC Bioinformatics 2009, 10 (1), 334. [DOI] [PMC free article] [PubMed] [Google Scholar]
(25).Steffen B; Margaria T; Nagel R; Jörges S; Kubczak C Model-Driven Development with the jABC. In Hardware and Software, Verification and Testing; Bin E, Ziv A., Ur S., Eds.; Lecture Notes in Computer Science; Springer: Berlin, Heidelberg, 2007; pp 92–108. DOI: 10.1007/978-3-540-70889-6_7. [DOI] [Google Scholar]
(26).Naujokat S; Lamprecht A-L; Steffen B Loose Programming with PROPHETS. In Fundamental Approaches to Software Engineering; de Lara J., Zisman A., Eds.; Lecture Notes in Computer Science; Springer: Berlin, Heidelberg, 2012; pp 94–98. DOI: 10.1007/978-3-642-28872-2_7. [DOI] [Google Scholar]
(27).Gil Y; Ratnakar V; Kim J; Moody J; Deelman E; Gonzalez-Calero P; Groth P Wings: Intelligent Workflow-Based Design of Computational Experiments. Intell. Syst. IEEE 2011, 26, 62–72. [Google Scholar]
(28).Reiner B; Siegel E; Carrino JA Workflow Optimization: Current Trends and Future Directions. J. Digit. Imaging 2002, 15 (3), 141–152. [DOI] [PMC free article] [PubMed] [Google Scholar]
(29).Kougka G; Gounaris A; Simitsis A The Many Faces of Data-Centric Workflow Optimization: A Survey. Int. J. Data Sci. Anal. 2018, 6 (2), 81–107. [Google Scholar]
(30).Lamprecht A-L; Palmblad M; Ison J; Schwämmle V; Manir MSA; Altintas I; Baker CJO; Amor ABH; Capella-Gutierrez S; Charonyktakis P; Crusoe MR; Gil Y; Goble C; Griffin TJ; Groth P; Ienasescu H; Jagtap P; Kalaš M; Kasalica V; Khanteymoori A; Kuhn T; Mei H; Ménager H; Möller S; Richardson RA; Robert V; Soiland-Reyes S; Stevens R; Szaniszlo S; Verberne S; Verhoeven A; Wolstencroft K Perspectives on Automated Composition of Workflows in the Life Sciences. F1000Research 2021, DOI: 10.12688/f1000research.54159.1. [DOI] [PMC free article] [PubMed] [Google Scholar]
(31).Goble CA; Bhagat J; Aleksejevs S; Cruickshank D; Michaelides D; Newman D; Borkum M; Bechhofer S; Roos M; Li P; De Roure D myExperiment: A Repository and Social Network for the Sharing of Bioinformatics Workflows. Nucleic Acids Res. 2010, 38, W677–W682. [DOI] [PMC free article] [PubMed] [Google Scholar]
(32).Deelman E; Mandal A; Jiang M; Sakellariou R The Role of Machine Learning in Scientific Workflows. Int. J. High Perform. Comput. Appl. 2019, 33 (6), 1128–1139. [Google Scholar]
(33).Gómez-Pérez JM; García-Cuesta E; Zhao J; Garrido A; Ruiz JE How Reliable Is Your Workflow: Monitoring Decay in Scholarly Publications; 2013. [Google Scholar]
(34).Soiland-Reyes S; Sefton P; Crosas M; Castro LJ; Coppens F; Fernández JM; Garijo D; Grüning B; La Rosa M; Leo S; ÓCarragáin E; Portier M; Trisovic A; et al. Packaging Research Artefacts with RO-Crate. Data Sci. 2022, 5 (2), 97–138. [Google Scholar]
(35).Myers OD; Sumner SJ; Li S; Barnes S; Du X One Step Forward for Reducing False Positive and False Negative Compound Identifications from Mass Spectrometry Metabolomics Data: New Algorithms for Constructing Extracted Ion Chromatograms and Detecting Chromatographic Peaks. Anal. Chem. 2017, 89 (17), 8696–8703. [DOI] [PubMed] [Google Scholar]
(36).Siskos AP; Jain P; Römisch-Margl W; Bennett M; Achaintre D; Asad Y; Marney L; Richardson L; Koulman A; Griffin JL; Raynaud F; Scalbert A; Adamski J; Prehn C; Keun HC Interlaboratory Reproducibility of a Targeted Metabolomics Platform for Analysis of Human Serum and Plasma. Anal. Chem. 2017, 89 (1), 656–665. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Main

NIHMS2069779-supplement-Main.pdf^{(249.6KB, pdf)}

Table S5

NIHMS2069779-supplement-Table_S5.xlsx^{(308.1KB, xlsx)}

[R1] (1).Dastmalchi F; Xu K; Jones HN; Lemas DJ Assessment of Human Milk in the Era of Precision Health. Curr. Opin. Clin. Nutr. Metab. Care 2022, 25 (5), 292–297. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] (2).Du X; Aristizabal-Henao JJ; Garrett TJ; Brochhausen M; Hogan WR; Lemas DJ A Checklist for Reproducible Computational Analysis in Clinical Metabolomics Research. Metabolites 2022, 12 (1), 87. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] (3).Pang Z; Zhou G; Ewald J; Chang L; Hacariz O; Basu N; Xia J Using MetaboAnalyst 5.0 for LC-HRMS Spectra Processing, Multi-Omics Integration and Covariate Adjustment of Global Metabolomics Data. Nat. Protoc. 2022, 17 (8), 1735–1761. [DOI] [PubMed] [Google Scholar]

[R4] (4).Wang S-C; Huang C-M; Chiang S-M Improving Signal-to-Noise Ratios of Liquid Chromatography-Tandem Mass Spectrometry Peaks Using Noise Frequency Spectrum Modification between Two Consecutive Matched-Filtering Procedures. J. Chromatogr. A 2007, 1161 (1), 192–197. [DOI] [PubMed] [Google Scholar]

[R5] (5).Vemula H; Kitase Y; Ayon NJ; Bonewald L; Gutheil WG Gaussian and Linear Deconvolution of LC-MS/MS Chromatograms of the Eight Aminobutyric Acid Isomers. Anal. Biochem. 2017, 516, 75–85. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] (6).Du X; Dastmalchi F; Ye H; Garrett TJ; Diller MA; Liu M; Hogan WR; Brochhausen M; Lemas DJ Evaluating LC-HRMS Metabolomics Data Processing Software Using FAIR Principles for Research Software. Metabolomics 2023, 19 (2), 11. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] (7).Smith CA; Want EJ; O’Maille G; Abagyan R; Siuzdak G XCMS: Processing Mass Spectrometry Data for Metabolite Profiling Using Nonlinear Peak Alignment, Matching, and Identification. Anal. Chem. 2006, 78 (3), 779–787. [DOI] [PubMed] [Google Scholar]

[R8] (8).Schmid R; Heuckeroth S; Korf A; Smirnov A; Myers O; Dyrlund TS; Bushuiev R; Murray KJ; Hoffmann N; Lu M; Sarvepalli A; Zhang Z; Fleischauer M; Dührkop K; Wesner M; Hoogstra SJ; Rudt E; Mokshyna O; Brungs C; Ponomarov K; Mutabdžija L; Damiani T; Pudney CJ; Earll M; Helmer PO; Fallon TR; Schulze T; Rivas-Ubach A; Bilbao A; Richter H; Nothias L-F; Wang M; Orešič M; Weng J-K; Böcker S; Jeibmann A; Hayen H; Karst U; Dorrestein PC; Petras D; Du X; Pluskal T Integrative Analysis of Multimodal Mass Spectrometry Data in MZmine 3. Nat. Biotechnol. 2023, 447. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] (9).Tsugawa H; Cajka T; Kind T; Ma Y; Higgins B; Ikeda K; Kanazawa M; VanderGheynst J; Fiehn O; Arita M MS-DIAL: Data-Independent MS/MS Deconvolution for Comprehensive Metabolome Analysis. Nat. Methods 2015, 12 (6), 523–526. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] (10).Kanwal S; Khan FZ; Lonie A; Sinnott RO Investigating Reproducibility and Tracking Provenance - A Genomic Workflow Case Study. BMC Bioinformatics 2017, 18 (1), 337. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] (11).Roure DD; Goble C; Klyne G; Roos M; Hettne K; Ruiz JE; Palma R; Gómez-Pérez JM; Missier P; Belhajjame K Towards the Preservation of Scientific Workflows; 2011. [Google Scholar]

[R12] (12).Palmblad M; Lamprecht A-L; Ison J; Schwämmle V Automated Workflow Composition in Mass Spectrometry-Based Proteomics. Bioinformatics 2019, 35 (4), 656–664. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] (13).Ding J; Feng Y-Q Mass Spectrometry-Based Metabolomics for Clinical Study: Recent Progresses and Applications. TrAC Trends Anal. Chem. 2023, 158, 116896. [Google Scholar]

[R14] (14).Atkinson M; Gesing S; Montagnat J; Taylor I Scientific Workflows: Past, Present and Future. Future Gener. Comput. Syst. 2017, 75, 216–227. [Google Scholar]

[R15] (15).Coleman T; Casanova H; Pottier L; Kaushik M; Deelman E; Ferreira da Silva R WfCommons: A Framework for Enabling Scientific Workflow Research and Development. Future Gener. Comput. Syst. 2022, 128, 16–27. [Google Scholar]

[R16] (16).Kasalica V; Lamprecht A-L APE: A Command-Line Tool and API for Automated Workflow Composition. In Computational Science - ICCS 2020; Krzhizhanovskaya VV., Závodszky G., Lees MH., Dongarra JJ., Sloot PMA., Brissos S., Teixeira J., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham, 2020; pp 464–476. DOI: 10.1007/978-3-030-50436-6_34. [DOI] [Google Scholar]

[R17] (17).Ison J; Kalaš M; Jonassen I; Bolser D; Uludag M; McWilliam H; Malone J; Lopez R; Pettifer S; Rice P EDAM: An Ontology of Bioinformatics Operations, Types of Data and Identifiers, Topics and Formats. Bioinformatics 2013, 29 (10), 1325–1332. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] (18).Kasalica V; Lamprecht A-L Automated Composition of Scientific Workflows: A Case Study on Geographic Data Manipulation. In 2018 IEEE 14th International Conference on e-Science (e-Science); 2018; pp 362–363. DOI: 10.1109/eScience.2018.00099. [DOI] [Google Scholar]

[R19] (19).Zheng CL; Ratnakar V; Gil Y; McWeeney SK Use of Semantic Workflows to Enhance Transparency and Reproducibility in Clinical Omics. Genome Med. 2015, 7 (1), 73. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] (20).Kasalica V; Schwämmle V; Palmblad M; Ison J; Lamprecht A-L APE in the Wild: Automated Exploration of Proteomics Workflows in the Bio.Tools Registry. J. Proteome Res. 2021, 20 (4), 2157–2165. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] (21).Lemas DJ; Du X; Dado-Senn B; Xu K; Dobrowolski A; Magalhães M; Aristizabal-Henao JJ; Young BE; Francois M; Thompson LA; Parker LA; Neu J; Laporta J; Misra BB; Wane I; Samaan S; Garrett TJ Untargeted Metabolomic Analysis of Lactation-Stage-Matched Human and Bovine Milk Samples at 2 Weeks Postnatal. Nutrients 2023, 15 (17), 3768. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] (22).Reinhartz-Berger I Towards Automatization of Domain Modeling. Data Knowl. Eng. 2010, 69 (5), 491–515. [Google Scholar]

[R23] (23).Karlsson J; Martín-Requena V; Ríos J; Trelles O Workflow Composition and Enactment Using jORCA. In Leveraging Applications of Formal Methods, Verification, and Validation; Margaria T., Steffen B., Eds.; Springer: Berlin, Heidelberg, 2010; pp 328–339. [Google Scholar]

[R24] (24).Ríos J; Karlsson J; Trelles O Magallanes: A Web Services Discovery and Automatic Workflow Composition Tool. BMC Bioinformatics 2009, 10 (1), 334. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R25] (25).Steffen B; Margaria T; Nagel R; Jörges S; Kubczak C Model-Driven Development with the jABC. In Hardware and Software, Verification and Testing; Bin E, Ziv A., Ur S., Eds.; Lecture Notes in Computer Science; Springer: Berlin, Heidelberg, 2007; pp 92–108. DOI: 10.1007/978-3-540-70889-6_7. [DOI] [Google Scholar]

[R26] (26).Naujokat S; Lamprecht A-L; Steffen B Loose Programming with PROPHETS. In Fundamental Approaches to Software Engineering; de Lara J., Zisman A., Eds.; Lecture Notes in Computer Science; Springer: Berlin, Heidelberg, 2012; pp 94–98. DOI: 10.1007/978-3-642-28872-2_7. [DOI] [Google Scholar]

[R27] (27).Gil Y; Ratnakar V; Kim J; Moody J; Deelman E; Gonzalez-Calero P; Groth P Wings: Intelligent Workflow-Based Design of Computational Experiments. Intell. Syst. IEEE 2011, 26, 62–72. [Google Scholar]

[R28] (28).Reiner B; Siegel E; Carrino JA Workflow Optimization: Current Trends and Future Directions. J. Digit. Imaging 2002, 15 (3), 141–152. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R29] (29).Kougka G; Gounaris A; Simitsis A The Many Faces of Data-Centric Workflow Optimization: A Survey. Int. J. Data Sci. Anal. 2018, 6 (2), 81–107. [Google Scholar]

[R30] (30).Lamprecht A-L; Palmblad M; Ison J; Schwämmle V; Manir MSA; Altintas I; Baker CJO; Amor ABH; Capella-Gutierrez S; Charonyktakis P; Crusoe MR; Gil Y; Goble C; Griffin TJ; Groth P; Ienasescu H; Jagtap P; Kalaš M; Kasalica V; Khanteymoori A; Kuhn T; Mei H; Ménager H; Möller S; Richardson RA; Robert V; Soiland-Reyes S; Stevens R; Szaniszlo S; Verberne S; Verhoeven A; Wolstencroft K Perspectives on Automated Composition of Workflows in the Life Sciences. F1000Research 2021, DOI: 10.12688/f1000research.54159.1. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R31] (31).Goble CA; Bhagat J; Aleksejevs S; Cruickshank D; Michaelides D; Newman D; Borkum M; Bechhofer S; Roos M; Li P; De Roure D myExperiment: A Repository and Social Network for the Sharing of Bioinformatics Workflows. Nucleic Acids Res. 2010, 38, W677–W682. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R32] (32).Deelman E; Mandal A; Jiang M; Sakellariou R The Role of Machine Learning in Scientific Workflows. Int. J. High Perform. Comput. Appl. 2019, 33 (6), 1128–1139. [Google Scholar]

[R33] (33).Gómez-Pérez JM; García-Cuesta E; Zhao J; Garrido A; Ruiz JE How Reliable Is Your Workflow: Monitoring Decay in Scholarly Publications; 2013. [Google Scholar]

[R34] (34).Soiland-Reyes S; Sefton P; Crosas M; Castro LJ; Coppens F; Fernández JM; Garijo D; Grüning B; La Rosa M; Leo S; ÓCarragáin E; Portier M; Trisovic A; et al. Packaging Research Artefacts with RO-Crate. Data Sci. 2022, 5 (2), 97–138. [Google Scholar]

[R35] (35).Myers OD; Sumner SJ; Li S; Barnes S; Du X One Step Forward for Reducing False Positive and False Negative Compound Identifications from Mass Spectrometry Metabolomics Data: New Algorithms for Constructing Extracted Ion Chromatograms and Detecting Chromatographic Peaks. Anal. Chem. 2017, 89 (17), 8696–8703. [DOI] [PubMed] [Google Scholar]

[R36] (36).Siskos AP; Jain P; Römisch-Margl W; Bennett M; Achaintre D; Asad Y; Marney L; Richardson L; Koulman A; Griffin JL; Raynaud F; Scalbert A; Adamski J; Prehn C; Keun HC Interlaboratory Reproducibility of a Targeted Metabolomics Platform for Analysis of Human Serum and Plasma. Anal. Chem. 2017, 89 (1), 656–665. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

An Automated Workflow Composition System for Liquid Chromatography–Mass Spectrometry Metabolomics Data Processing

Xinsong Du

Farhad Dastmalchi

Matthew A Diller

Mathias Brochhausen

Timothy J Garrett

William R Hogan

Dominick J Lemas

Abstract

Graphical Abstract

1. INTRODUCTION

2. METHODS

Figure 1.

2.1. Automated Workflow Composition.

2.1.1. Domain Modeling.

2.1.2. Workflow Specification.

2.1.3. Workflow Synthesis.

2.1.4. Workflow Filtering.

2.2. LC–MS Metabolomics Use Cases.

2.2.1. Use Case No. 1: Data Quality Control.

2.2.2. Use Case No. 2: Metabolite Identification with Mass-to-Charge Ratio and Retention Time.

2.2.3. Use Case No. 3: Metabolite Identification with Tandem Mass Spectrum.

3. RESULTS

Figure 2.

Figure 3.

3.1. Use Case No. 1.

3.2. Use Case No. 2.

3.3. Use Case No. 3.

4. DISCUSSION

5. CONCLUSIONS

Supplementary Material

Acknowledgments

Footnotes

Contributor Information

REFERENCES

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases