uafR: An R package that automates mass spectrometry data processing

Chase A Stratton; Yvonne Thompson; Konilo Zio; William R Morrison, III; Ebony G Murrell

doi:10.1371/journal.pone.0306202

. 2024 Jul 5;19(7):e0306202. doi: 10.1371/journal.pone.0306202

uafR: An R package that automates mass spectrometry data processing

Chase A Stratton ^1,², Yvonne Thompson ¹, Konilo Zio ³, William R Morrison III ⁴, Ebony G Murrell ^1,^*

Editor: Shailender Kumar Verma⁵

PMCID: PMC11226021 PMID: 38968199

Abstract

Chemical information has become increasingly ubiquitous and has outstripped the pace of analysis and interpretation. We have developed an R package, uafR, that automates a grueling retrieval process for gas -chromatography coupled mass spectrometry (GC -MS) data and allows anyone interested in chemical comparisons to quickly perform advanced structural similarity matches. Our streamlined cheminformatics workflows allow anyone with basic experience in R to pull out component areas for tentative compound identifications using the best published understanding of molecules across samples (pubchem.gov). Interpretations can now be done at a fraction of the time, cost, and effort it would typically take using a standard chemical ecology data analysis pipeline. The package was tested in two experimental contexts: (1) A dataset of purified internal standards, which showed our algorithms correctly identified the known compounds with R² values ranging from 0.827–0.999 along concentrations ranging from 1 × 10⁻⁵ to 1 × 10³ ng/μl, (2) A large, previously published dataset, where the number and types of compounds identified were comparable (or identical) to those identified with the traditional manual peak annotation process, and NMDS analysis of the compounds produced the same pattern of significance as in the original study. Both the speed and accuracy of GC -MS data processing are drastically improved with uafR because it allows users to fluidly interact with their experiment following tentative library identifications [i.e. after the m/z spectra have been matched against an installed chemical fragmentation database (e.g. NIST)]. Use of uafR will allow larger datasets to be collected and systematically interpreted quickly. Furthermore, the functions of uafR could allow backlogs of previously collected and annotated data to be processed by new personnel or students as they are being trained. This is critical as we enter the era of exposomics, metabolomics, volatilomes, and landscape level, high-throughput chemotyping. This package was developed to advance collective understanding of chemical data and is applicable to any research that benefits from GC -MS analysis. It can be downloaded for free along with sample datasets from Github at github.org/castratton/uafR or installed directly from R or RStudio using the developer tools: ‘devtools::install_github(“castratton/uafR”)’.

Introduction

Chemistry has a profound influence on every physical system in the human environment [1–5], hence the need for biochemical research is of utmost importance. Gas chromatography coupled with mass spectrometry (GC -MS), used to identify the chemical composition of samples, is a commonly used technology across many disciplines of research [6–9]. While the accuracy and efficiency of instruments continues to improve [10], preparing the library-matched output [i.e. top hit(s) for each set of molecular fragments streamed across the machines m/z detector] for analysis and interpretation remains antiquated. The traditional methods involve manually selecting, integrating, and identifying peaks based on a reference library and comparison to commercial standards across every sample in an experiment [11, 12]. Software that quickly and accurately identify top library matches for every tentative compound in an entire batch of experimental samples thankfully exist (e.g. Agilent’s MassHunter, Thermo Fisher Scientific’s Compound Discoverer, Shimadzu’s GCMSsolution); however, the output remains uninterpretable without additional process. In even simple experiments, the process of quantifying tentatively “identified” compounds across replicates can take weeks or months and is a significant impediment to collecting and analyzing many, large, and/or complex GC -MS datasets. Furthermore, focusing the interpretation on specific chemicals or chemistries that are meaningful would require looking up each molecule for published information and/or important associations. This additional bottleneck in chemical experimentation can lead to backlogs in collections, delays in chemical data being analyzed and published, and may even create a significant deterrent to collecting GC -MS data in studies (e.g. non-targeted and/or suspect screening analysis) where these data could be highly informative.

Another concern with manually selecting component areas for the same tentative molecule across different samples is the inherent subjectivity and inconsistency at many decision points. Every additional keystroke or choice about threshold provides an opportunity for unintended error. Technology exists that could help automate this process, converting the identified compounds into a digitally comparable structure in an instant [4, 8]; however, using such technology requires advanced computer programming experience. Any functional interpretation of a chemical benefits from structural comparisons with compounds of known functions, yet the ability to do so has historically been reserved for private industry or hyper-specialized professionals. A package that automates the sorting and collection of component areas across samples in an experiment while simultaneously storing critical information about every tentative molecule could propel every field of science forward by not only removing the bottlenecks and subjectivity in chemical analysis but also removing the need for hours of paid or untrained manual labor before even simple chemical interpretations can occur.

To address these barriers in the use of GC -MS data, we developed an R package that takes the raw, aggregated chemical identifications generated from a user-selected peak detection software. In this study, we used Agilent’s Unknowns Analysis software to identify peaks with their deconvolution algorithm and match m/z spectra to a locally installed NIST library, but any mass spectrometry software that produces the same information is equally viable. The package here comes after the initial processing of samples and communicates with public chemistry utilities (including PubChem and the National Cancer Institute) to sort and process the aggregated set of all tentatively identified molecules using underlying m/z (mass/charge of chemical fragments) ratio data and automatically interpret close matches across samples. In addition to precisely (but flexibly) grabbing tentative compounds from samples they could theoretically exist in and preparing the component areas for statistical summary and analysis–including principal component analyses, non-metric multidimensional scaling (NMDS), and/or machine learning algorithms; uafR also interacts with structural data [in SDF (Structure-Data format)] for all published compounds in the dataset. These data allow detailed summaries of the chemical constituents for each sample to be generated based on the user’s chemical(s) of interest. Thus, while a chemical ecologist may be more interested in the relative proportions of alkaloids to polyphenols in a sample [13, 14], a biochemist may only be interested in steroids [6, 15–17]. These groups (or others) can now be selectively pulled from one’s dataset to perform follow-up analyses. In addition, researchers (e.g. those performing targeted analysis) that have advanced knowledge of the molecule(s) or functional group(s) of interest can use our functions to isolate these chemistries from experimental data and focus their analysis/interpretation on specified chemicals or chemical groups more generally.

Users may also load personal chemical libraries, again as an easily formatted “.CSV” file using long or wide orientation, to compare any list of chemicals against the set(s) of classifier compounds in their.CSV input library. For the chemical structure processing, our package utilizes Tanimoto similarity, a commonly used and rigorously tested metric for physicochemical comparisons [5, 18, 19]. While there is a broad range of diversity in the chemistry of any system [20], there exists common structural subunits that can categorize molecules into their potential function(s) and the Tanimoto index provides efficient functional sorting of even diverse chemistries. As an example, these comparisons could be used in agricultural research to rapidly screen plant molecules for insecticidal or repellent properties [21–24]. More specifically, the pharmaceutical industry uses the Tanimoto similarity metric to discover compounds that will bind known ligands or share biological activity with known drugs [25–27]. This metric is underapplied, however, because to date, it requires multiple complex steps to generate or acquire data in the appropriate format [28, 29]. Our R package harnesses direct connections between PubChem and R to stream published information on every known (i.e., published and vetted by peer-review for merit) chemical in the dataset. This bypasses the need for other computer programs or coding environments to perform physicochemical comparisons and allows our algorithm to outperform any comparable utility for this stage of mass spectrometry data processing. If the user can install a package and read a “.CSV” file into R, they will have access to the entirety of PubChem and more.

Data science and informatics can circumvent analytical bottlenecks [30]. Automating the tedious portions of GC -MS data processing can not only turn weeks or months of work into a few keyboard strokes within a day, but also takes human error and subjectivity out of the equation. An efficient and user-friendly tool for interpreting these chemical data is long overdue.

Here, we present two examples to demonstrate the accuracy and efficiency of uafR. The first is the identification and analysis of a GC/MS dataset containing samples of a series of four known internal standards at different concentrations. The second is a re-identification of GC/MS samples from an already published dataset by Ponce et. al [31]. For this dataset, we compare the same statistical tests for the standardized areas for compounds identified with 4 methods by the uafR package and those from the published, manual identifications. We also briefly describe how the package can improve chemical workflows in non-GC -MS datasets or meta-analyses.

Materials and methods

Software description and workflow

The current build of uafR is optimized for raw output from Agilent’s Unknowns Analysis Software (Santa Clara, CA, USA, 95051); however, the only aspect of the workflow that is specific to their software are the column names for the input data frame. To briefly describe the output, after setting up the analysis environment [i.e., directing Unknowns Analysis to the sample directory where a “.UAF” file (hence, uafR) is created], running the deconvolution algorithm to identify peaks, and searching the peaks against the installed library (blank subtraction and target matching are also options and will not affect the input for uafR), a single “.CSV” file containing basic GC -MS output [i.e. retention times, peak area, captured mass-to-charge ratios (m/z), compound name, match quality] and a sample origin identifier (i.e. sample name or file name) for tentative compounds across all samples can be exported and read into R using “read.csv().” After reading the data into R and loading the package, uafR can use published information to sort and precisely select portions of the data that the user may be interested in.

A diagram of the workflow can be seen in Fig 1. The first function for GC -MS data is “spreadOut().” Running this function on properly formatted GC -MS input will automatically prepare the data for the next steps in the processing pipeline. Briefly, the function takes every recorded data point for every treatment and expands it in large database formats with unique identifiers assigned for each data point. These unique identifiers (unique IDs) are automatically created from the input data and are used to extract specific area values from the raw data. In addition to setting up large databases containing component area, tentative compound identities, match factors, captured m/z values, retention time indices, sample identities, and the unique IDs, the function also communicates with online databases to download relevant information about every tentative compound. To collect these data, the function converts the chemical names into PubChem compound identifiers (CIDs) using the “get_cid()” function from the R package webchem [32]. For published chemicals, this information includes exact mass, m/z histograms, and every name it has. Instances where the chemical cannot be identified by name on PubChem (i.e. compounds for which a CID are unavailable) are redirected to CADD Group Chemoinformatics Tools and User Services (CACTUS, https://cactus.nci.nih.gov/) from which a canonical Simplified Molecular Input Line Entry System (SMILES) can be generated using that server and algorithm. This SMILES notation is then used to simulate the mass and structure data for, as-of-yet unpublished chemicals on PubChem. All this information, including the large databases, are stored as a list in a user-defined object. Subsequent functions are designed to seamlessly interact with the list and will automatically use relevant information collected during “spreadOut()”.

The next step in the GC-MS workflow will depend on the type of analysis the user is performing. If the chemicals of interest are already known, they can be extracted by name with a single function—“mzExacto().” However, for complex datasets or analyses that involve more unknowns, the user may want to cast a broader, but still accurate, net. There are multiple steps that can be taken to hone in on the most relevant chemicals in a dataset using the features of uafR. A simple and effective approach is to subset the search chemicals by setting a minimal match factor on the raw output of Unknowns Analysis (or other GC -MS software). This can be done with R code described in the vignette published with the package (https://castratton.github.io/uafR/). Another approach could include subsetting with output from the function “categorate().” This function also uses PubChem to communicate with online databases and generate categorically, structurally, and chemically identifying information for every published chemical in the dataset. The categorical data include whether the chemical is biologically derived [Natural Products Online database (LOTUS; https://lotus.naturalproducts.net/)], has flavor or smell [Flavor and Extract Manufacturers Association (FEMA; https://www.femaflavor.org/)], has varied biological activities [Kyoto Encyclopedia of Genes and Genomes (KEGG; https://www.genome.jp/kegg/)], medical subject headings (MeSH; https://www.nlm.nih.gov/mesh/), or other information about their reactivity [Food and Drug Administration—Structured Product Labeling (FDA/SPL; https://www.fda.gov/) and Reactive Groups from PubChem (https://pubchem.ncbi.nlm.nih.gov/)].

After the categorical information is collected, the function generates substructure data for the chemicals to also be subsetted by common functional groups. This information is generated using the “read.SDFset()” function from another R package called ChemmineR [33]. This package is a dependency that is installed with uafR and is core to the cheminformatics methods deployed. The substructure information generated using ChemmineR includes the number of rings, all subgroups (e.g., R-COH, R-COOH, etc.) and their counts, all atoms (e.g., C, N, S, As, etc.) and their counts, and the number of charges for every chemical with published structural data (or canonical SMILES from CACTUS) on PubChem. The final steps in “categorate()” will not only assist in subsetting compounds of interest for extracting from GC -MS datasets, but could also be used to perform meta-analyses on published chemistries.

In order to run “categorate(),” users are required to include an input library that contains columns with labeled chemicals. The labels are customizable, but the most useful approach is to label a set of chemicals by a common feature or biological activity. For example, if a researcher has a set of plant chemicals of interest to test against active ingredients in pharmaceuticals, the input library could contain n columns whose headings are the biological activity (e.g., diuretic, blood pressure, etc.) and the contents (rows under the heading) are the active chemicals used in products that are approved for those medical outcomes. The “categorate()” function will then take the input library (saved as a “.CSV”) and compare every chemical of interest to the chemicals in each user-defined “chemical category,” returning two additional data frames—(1) whether it has a strong (Tanimoto similarity greater than 0.95) or moderate (greater than 0.85) structural match with any of the chemicals in each group; and, (2) for strong matches, the name of a chemical that it was most similar to. It performs these comparisons using the “fmcsBatch()” function from the R package fmcsR [34].

The utility of this information and approach cannot be overstated. For chemistry, structure defines function, so identifying structural matches is effectively identifying chemicals with the same function. This not only provides a powerful tool for novel chemical activity discoveries and/or natural backups to synthesized chemistries, but can also allow researchers to subset GC -MS data by general chemical structures or activities they are interested in. The possibilities are limited only by the maximum file size a user can create in the specified “.CSV” format and whether structural data were able to be generated from PubChem for the chemical(s). Subsetting of information generated with “categorate()” is easily done using the function–“exactoThese().” Users can specify which set of information they would like to subset and indicate desired criteria the chemicals should meet.

Next in the GC -MS workflow is to put the published information to use and aggregate every occurrence of the user-specified chemicals across every GC -MS sample. “mzExacto()” takes the output from “spreadOut()” along with the list of chemical names, and returns a single data frame containing their optimal retention time, exact mass, best identified match factor, and aggregated component area across samples in which it occurs (0 when absent). Additional technical details for this algorithm are available with the package (github.com/castratton/uafR). Briefly, after collecting mass and m/z information for the input chemicals of interest, they are ordered by exact masses so likely retention time windows can be determined based on the general structure of the input data and the information stored from “spreadOut().” After identifying perfect matches (i.e., those with high match factors and the same chemical names) the algorithm looks again through each sample for instances where the top 2 published m/z values for the tentative identity are the same as the query chemicals of interest. These matches are based on standard manual approaches to resolve uncertainties in any complex GC/MS workflow. The m/z values within retention time windows generated by the input data must be similar enough that the chemical fragments are practically and theoretically identical. A sub-argument, “decontaminate,” is on by default and removes any chemicals that did not have a strong match across samples, were unable to be found in public databases (i.e. PubChem), and/or were unable to have a canonical SMILES generated on the NCI server. This sub-argument can be turned off by adding “decontaminate = F” to the end of the items in “mzExacto().”

At this point in the GC/MS workflow, the most common step is to standardize component areas for tentatively identified chemicals by quantifying their values relative to known internal or external standard(s). “standardifyIt()” takes the output from “mzExacto()” and either a user-specified internal standard (e.g., tetradecane, or user defined-internal standard) or calibration curves (raw values) from an external standard(s), along with sub-arguments that allow the standardization to be tuned to the experimental methods. “standardifyIt()” returns a data frame that is standardized relative to the known chemical quantifications and formatted for subsequent statistical analyses. Common statistical protocols for GC -MS data include ordination analyses (e.g., PCA, NMDS, etc.), multivariate statistical tests (e.g., ANOSIM, MANOVA, PERMANOVA, etc.) and/or deep learning (neural networks or machine learning). Each of the required formats for running these statistics on GC -MS data are achievable with the final output of “mzExacto()” and “standardifyIt().”

Beyond automating a process that can require hours of work per sample, with potentially hundreds of samples per study, uafR makes cheminformatics a possibility for anyone working with GC -MS or chemical identity data. Furthermore, the public databases our package accesses will only improve in data quality/quantity with time and increased use. To showcase the utility and validity of our package for GC -MS workflows, we analyzed two datasets–one containing a set of known standards pipetted in known quantities across three samples (low, medium, and high concentrations) and the other, consisting of a recently published set of 35 samples.

Testing the accuracy of the package

The first dataset on which uafR functions were tested was a series of standards including, ethyl hexanoate (Prod#14896, CAS#123-66-0, Millipore Sigma, Burlington, MA, USA), methyl salicylate (Prod#M6752, CAS#119-36-8, Millipore Sigma), octanal (Prod#S7303001712, CAS#124-13-0, Merck, Darmstadt, Germany), and undecane (Prod#S7466429734, CAS#1120-21-4, Merck, Darmstadt, Germany) collected on an Agilent 7890b gas chromatograph (GC) equipped with an Agilent Durabond HP-5 column (30 m length, 0.250 mm diameter, and 0.25 μm flm thickness) using He as the carrier gas at a constant 5 ml/min flow and 39 cm/s velocity, coupled with an Agilent 5997B mass spectrometer (MS) single quadrupole detector. These were prepared in a serial dilution using 1 mL of neat compound from each and diluting in 10 mL in dichloromethane, and subsequently moving 1 mL of the dilutant to a new container with 10 mL of fresh dichloromethane until the following amounts of ethyl hexanoate, methyl salicylate, octanal, and undecane were achieved: low (0.000087 ng, 0.0001179 ng, 0.000082 ng, and 0.000072 ng, respectively), medium (434.5 ng, 589.5 ng, 410 ng, and 370 ng, respectively), or high (2172.5, 2947.5 ng, 2050 ng, and 1850 ng, respectively) relative quantities.

The second dataset was GC/MS data collected on the same GC-MS as the previous test chemicals that had been manually processed and published in August, 2022. Briefly, the samples were collected from grain samples that were a) UV sterilized (negative control), b) clean grain from storage (positive control), c) inoculated with asexual fungal spores, d) inoculated with sexual fungal spores (see Ponce et al. 2022 [31] for extended description of methods).

After analyzing the samples on the GC-MS, the raw output is saved to a local directory and loaded into Unknowns Analysis following default protocols. For a detailed overview of running this software, Agilent provides a user manual. After loading the samples and loading the methods file to every sample, the deconvolution algorithm identified the most accurate peaks for every chromatogram. Each peak was then searched against the NIST 20 database. The aggregated data frame was exported as a “.CSV” file. This data frame included columns for the compound names (“Compound.Name”), file name the tentative identity is from (“File.Name”), top m/z peaks captured by the GC/MS (“Base.Peak.MZ”), match factors for tentative identities (“Match.Factor”), and retention times (“Component.RT”).

Results

Peak areas calculated by uafR for the set of standards correlated with the volume of the standards injected, with R² values ranging from 0.8273 to 0.9998 (Fig 2). Importantly, the single standard (e.g., octanal) with a lower correlation coefficient was likely misread by the MS or had volatized prior to being run on the GC-MS. It is known that that octanal volatilizes very easily, and is used by plants as an anti-fungal compound to protect fruit [35].

Fig 2 — Correlations between volume of standards tests via GC/MS and the peak area estimates generated by uafR, for (A) Ethyl hexanoate, (B) Methyl salicylate, (C) Octanal, (D) Undecane. Points represent raw data while the line represents a natural log fit (A) or linear fit (B-D) to the raw data.

After confirming that uafR can precisely identify chemicals that are known to be in a sample, the next step was to assess its accuracy in a more complex experiment with unknowns. Using raw GC-MS data from a recently published experiment allowed the workflow to be tested against a peer-reviewed study. We found that uafR was able to identify the manually selected compounds with accurate matches to manually identified retention times (Table 1) and yielded the same overall pattern of significance in ANOSIM analysis (Table 2). The true benefit of using uafR is not merely its accuracy, but also its speed. For context, the original manual identifications required months of labor. Using uafR, we re-analyze this entire experiment in 150 minutes of automated computation using a standard desktop computer with a 3.30 GHz processor and 16 GB RAM. While the speed and accuracy for this experiment are apparent, additional trials on larger datasets are warranted.

Table 1. Summary of NMDS and ANOSIM calculation for models processed with uafR.

		ANOSIM
Model	NMDS Stress	R	P	# Compounds in Final Table
Original Ponce et al. 2022	0.10	0.20	0.001	33
uafR Ponce et al. 2022	0.11	0.185	0.009	33^a
>65% Match Factor	0.17	0.068	0.17	427
>75% Match Factor	0.11	0.016	0.33	116
>88.9 Match Factor	0.15	0.034	0.29	30
>97.2 Match Factor	0.04	-0.034	0.58	3

Open in a new tab

^a Only 30 compounds used for analysis, since three compounds were not present in enough experimental replicates

Table 2. Chemicals identified in Ponce et al. 2022 using manual identification, versus compounds identified by the uafR package using the same selection criteria: >75% match of the chemical ID, and present in more than one sample.

Compounds shared between identification techniques are in bold print.

Ponce et al. 2022		uafR
Chemical ID	RT	Chemical ID	RT
Pivalaldehyde, semicarbazone	4.735
2-Butenal	4.788
2,4-Dimethyl-1-heptene	4.792	2,4-Dimethyl-1-heptene	4.796
2-Pentanone, 4-hydroxy-4-methyl	4.802	2-Pentanone, 4-hydroxy-4-methyl-	4.804
		Cyclopentanone, 2-methyl-	5.513
Benzene, propyl	6.292
		Benzene, 1-ethyl-3-methyl-	6.404
1-Octen-3-ol	6.516
Butanal	6.525
4,5-Dichloro-1,3-dioxolan-2-one	6.628	Benzene, 1-ethyl-4-methyl-	6.633
3-Octanone	6.648	3-Octanone	6.645
Decane	6.823
Mesitylene	6.869	Mesitylene	6.872
Benzene, 1,2,4-trimethyl-	7.319	Benzene, 1,2,4-trimethyl-	7.017
D-Limonene	7.364
Benzeneethanol, beta-ethyl	7.541
Benzene, 1,4-diethyl	7.659	Benzene, 1,2-diethyl-	7.418
Benzene, 1,2-diethyl	7.759	Benzene, 1,4-diethyl-	7.738
1,3,8-p-Menthatriene	7.86	Limonene	7.780
		p-Cymene	8.200
		Benzene, 2-ethyl-1,4-dimethyl-	8.203
4-Dichloromethyl-2[[2-[1-methyl-2-pyrrolidinyl]ethyl]amino-6-Trichloromethylpyrimidine	8.271
Benzene, (2-methyl-1-propenyl)-	8.279
1-Phenyl-1-butene	8.282
Linalool	8.314	Linalool	8.315
		Undecane	8.329
Nonanal	8.37	Nonanal	8.373
2-Thiophenecarboxylic acid, 5-nonyl-	9.044	Cyclopentasiloxane, decamethyl-	8.950
Dichloroacetaldehyde	9.735
		Cyclopentanecarboxylic acid, pentyl ester	10.185
Linalyl acetate	10.558	1,6-Octadien-3-ol, 3,7-dimethyl-, formate	10.559
Beta-Ocimene	10.561
2-Thiophenecarboxylic acid	10.59	2-Thiophenecarboxylic acid, 3-methylbutyl ester	10.590
1-Pent-3-ynylcyclopenta-1,3-diene	10.653
		Ethanone, 1-(2,5-dimethylphenyl)-	10.816
1,5,6,7-Tetramethylbicyclo[3.2.0]hepta-2,6-diene	10.822	Ethanone, 1-(3,4-dimethylphenyl)-	10.821
		Ethanone, 1-(2,4-dimethylphenyl)-	10.916
Ethanone, 1-(4-ethylphenyl)	11.108	Ethanone, 1-(4-ethylphenyl)-	10.973
		Cyclotetrasiloxane, octamethyl-	13.749
Butyl citrate	21.812
1-Methyl-4-phenyl-5-Thioxo-1,2,4-triazolidin-3-one	23.957
9-Octadecenamide, (Z-)	26.311

Open in a new tab

The possible applications of a direct connection between R and PubChem are diverse. Beyond statistical tests and advanced computational pipelines, the graphical framework can provide publication quality visuals with minimal code. This package harnesses the most advanced open-source chemical dataset and makes it accessible to anyone with basic experience working in R.

Conclusion

Our described workflow and package utilities bring GC -MS data processing up to par with the advanced technology that generates the data. Though technically uafR should apply in the same manner to other mass spectrometers (e.g. Q-TOF, Orbitrap, TimsTOF HT, and/or Astral) and even liquid chromatography coupled MS data, it has not yet been tested in those contexts. In addition, since uafR depends on published data, our algorithms do not yet apply to MS2, MS3 or ion mobility + MS2 data. This functionality will be added to the software once these spectra are published for most chemicals in PubChem’s database. The difficult portion of chemical identifications should not occur on the computer. Anyone with the ability to install packages and load a “.CSV” file into R now has access to a suite of functions that streamline a complex workflow so more effort can be spent interpreting rather than preparing data. It is important to mention that while uafR accurately processed the GC/MS data tested here, researchers should still validate that the compound areas identified by the algorithm make chemical and/or biological sense in their study system. Thankfully, the output from categorate() can help in these assessments by collecting relevant information for every molecule in an easily interpretable structure.

Chemical knowledge has grown increasingly advanced and accessible in recent years. The precision of GC -MS instruments and, consequently, their output, allows published information to be accessed with 100% accuracy. While previous algorithms have focused on using statistics to separate likely aggregates of compound areas, their accuracy fails in complex contexts because too many distinctly different chemicals “behave” (i.e., have the same mass and/or retention indices) the same so cannot be teased out statistically without additional knowledge.

Our approach is the first and, to date, only R package that uses published data to extract compound areas for the most likely compound identifications. By automating this component of the GC -MS workflow, we anticipate our package will greatly increase the speed at which chemistry datasets are published, the size of chemical studies that can be conducted, and the accessibility of chemical analyses to scientists in related fields.

Acknowledgments

The use of trade names is for the purposes of providing scientific information only and does not constitute endorsement by the United States Department of Agriculture. The USDA is an equal opportunity employer.

Data Availability

Datasets used in this analysis are available on GitHub: github.org/castratton/uafR.

Funding Statement

This project was funded by USDA-NIFA projects: #2021-67034-35135 and #2018-67013-27402; and by generous private donations to The Land Institute. In addition, this work was funded, in part, by a United States Department of Agriculture, National Institute of Food and Agriculture, Crop Protection and Pest Management Grant (#2020-70006-33000), the NIH Health Research Centers for Minority Serving Institutions Grant (U54MD015959), and USDA Agricultural Research Service through Congress-appropriated funds. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

1.Bishop RC, Ellis GFR. Contextual Emergence of Physical Properties. Found Phys. 2020;50: 481–510. doi: 10.1007/s10701-020-00333-9 [DOI] [Google Scholar]
2.Spitzer J, Pielak GJ, Poolman B. Emergence of life: Physical chemistry changes the paradigm. Biol Direct. 2015;10: 33. doi: 10.1186/s13062-015-0060-y [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Seifert VA. Open questions on emergence in chemistry. Commun Chem. 2022;5: 49. doi: 10.1038/s42004-022-00667-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Sneddon J, Masuram S, Richert JC. Gas Chromatography‐Mass Spectrometry‐Basic Principles, Instrumentation and Selected Applications for Detection of Organic Compounds. Anal Lett. 2007;40: 1003–1012. doi: 10.1080/00032710701300648 [DOI] [Google Scholar]
5.Baldi P, Nasr R. When is Chemical Similarity Significant? The Statistical Distribution of Chemical Similarity Scores and Its Extreme Values. J Chem Inf Model. 2010;50: 1205–1222. doi: 10.1021/ci100010v [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Krone N, Hughes BA, Lavery GG, Stewart PM, Arlt W, Shackleton CHL. Gas chromatography/mass spectrometry (GC/MS) remains a pre-eminent discovery tool in clinical steroid investigations even in the era of fast liquid chromatography tandem mass spectrometry (LC/MS/MS). J Steroid Biochem Mol Biol. 2010;121: 496–504. doi: 10.1016/j.jsbmb.2010.04.010 [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Tedone L, Costa R, De Grazia S, Ragusa S, Mondello L. Monodimensional (GC–FID and GC–MS) and Comprehensive Two‐dimensional Gas Chromatography for the Assessment of Volatiles and Fatty Acids from Ruta chalepensis Aerial Parts. Phytochem Anal. 2014;25: 468–475. doi: 10.1002/pca.2518 [DOI] [PubMed] [Google Scholar]
8.Mondello L, Tranchida PQ, Dugo P, Dugo G. Comprehensive two‐dimensional gas chromatography‐mass spectrometry: A review. Mass Spectrom Rev. 2008;27: 101–124. doi: 10.1002/mas.20158 [DOI] [PubMed] [Google Scholar]
9.Beale DJ, Pinu FR, Kouremenos KA, Poojary MM, Narayana VK, Boughton BA, et al. Review of recent developments in GC–MS approaches to metabolomics-based research. Metabolomics. 2018;14: 152. doi: 10.1007/s11306-018-1449-2 [DOI] [PubMed] [Google Scholar]
10.Misra BB. Advances in high resolution GC-MS technology: a focus on the application of GC-Orbitrap-MS in metabolomics and exposomics for FAIR practices. Anal Methods. 2021;13: 2265–2282. doi: 10.1039/d1ay00173f [DOI] [PubMed] [Google Scholar]
11.Morrison WR, Ingrao A, Ali J, Szendrei Z. Identification of plant semiochemicals and evaluation of their interactions with early spring insect pests of asparagus. J Plant Interact. 2016;11: 11–19. doi: 10.1080/17429145.2015.1133848 [DOI] [Google Scholar]
12.Barbosa-Cornelio R, Cantor F, Coy-Barrera E, Rodríguez D. Tools in the Investigation of Volatile Semiochemicals on Insects: From Sampling to Statistical Analysis. Insects. 2019;10: 241. doi: 10.3390/insects10080241 [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Dimcheva V, Kaloyanov N, Karsheva M. The polyphenol composition of Cistus incanus L., Trachystemon orientalis L. and Melissa offi cinalis L. infusions by HPLC-DAD method. Open J Anal Bioanal Chem. 2019;3: 031–038. doi: 10.17352/ojabc.000008 [DOI] [Google Scholar]
14.Glassmire AE, Zehr LN, Wetzel WC. Disentangling dimensions of phytochemical diversity: alpha and beta have contrasting effects on an insect herbivore. Ecology. 2020;101. doi: 10.1002/ecy.3158 [DOI] [PubMed] [Google Scholar]
15.Chung B, Choo H-YP, Kim T, Eom K, Kwon O, Suh J, et al. Analysis of Anabolic Steroids Using GC/MS with Selected Ion Monitoring. J Anal Toxicol. 1990;14: 91–95. doi: 10.1093/jat/14.2.91 [DOI] [PubMed] [Google Scholar]
16.Shackleton C, Pozo OJ, Marcos J. GC/MS in Recent Years Has Defined the Normal and Clinically Disordered Steroidome: Will It Soon Be Surpassed by LC/Tandem MS in This Role? J Endocr Soc. 2018;2: 974–996. doi: 10.1210/js.2018-00135 [DOI] [PMC free article] [PubMed] [Google Scholar]
17.McDonald JG, Matthew S, Auchus RJ. Steroid Profiling by Gas Chromatography–Mass Spectrometry and High Performance Liquid Chromatography–Mass Spectrometry for Adrenal Diseases. Horm Cancer. 2011;2: 324–332. doi: 10.1007/s12672-011-0099-x [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Bajusz D, Rácz A, Héberger K. Why is Tanimoto index an appropriate choice for fingerprint-based similarity calculations? J Cheminform. 2015;7: 20. doi: 10.1186/s13321-015-0069-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Holliday JD, Salim N, Whittle M, Willett P. Analysis and Display of the Size Dependence of Chemical Similarity Coefficients. J Chem Inf Comput Sci. 2003;43: 819–828. doi: 10.1021/ci034001x [DOI] [PubMed] [Google Scholar]
20.Firn RD, Jones CG. Natural products? a simple model to explain chemical diversity. Nat Prod Rep. 2003;20: 382. doi: 10.1039/b208815k [DOI] [PubMed] [Google Scholar]
21.Foster and SP, Harris MO. Behavioral Manipulation Methods for Insect Pest-Management. Annu Rev Entomol. 1997;42: 123–146. doi: 10.1146/annurev.ento.42.1.123 [DOI] [PubMed] [Google Scholar]
22.Hansen IA, Rodriguez SD, Drake LL, Price DP, Blakely BN, Hammond JI, et al. The Odorant Receptor Co-Receptor from the Bed Bug, Cimex lectularius L. Reisert J, editor. PLoS One. 2014;9: e113692. doi: 10.1371/journal.pone.0113692 [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Regnault-Roger C, Vincent C, Arnason JT. Essential Oils in Insect Control: Low-Risk Products in a High-Stakes World. Annu Rev Entomol. 2012;57: 405–424. doi: 10.1146/annurev-ento-120710-100554 [DOI] [PubMed] [Google Scholar]
24.Pickett JA, Bruce TJA, Chamberlain K, Hassanali A, Khan ZR, Matthes MC, et al. Plant Volatiles Yielding New Ways to Exploit Plant Defence. Chemical Ecology. Dordrecht: Springer Netherlands; pp. 161–173. doi: 10.1007/978-1-4020-5369-6_11 [DOI] [Google Scholar]
25.Lionta E, Spyrou G, Vassilatis D, Cournia Z. Structure-Based Virtual Screening for Drug Discovery: Principles, Applications and Recent Advances. Curr Top Med Chem. 2014;14: 1923–1938. doi: 10.2174/1568026614666140929124445 [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Mason J, Good A, Martin E. 3-D Pharmacophores in Drug Discovery. Curr Pharm Des. 2001;7: 567–597. doi: 10.2174/1381612013397843 [DOI] [PubMed] [Google Scholar]
27.Kumar A, Zhang KYJ. Advances in the Development of Shape Similarity Methods and Their Application in Drug Discovery. Front Chem. 2018;6. doi: 10.3389/fchem.2018.00315 [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Xie X-QS. Exploiting PubChem for virtual screening. Expert Opin Drug Discov. 2010;5: 1205–1220. doi: 10.1517/17460441.2010.524924 [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Cheng T, Pan Y, Hao M, Wang Y, Bryant SH. PubChem applications in drug discovery: a bibliometric analysis. Drug Discov Today. 2014;19: 1751–1756. doi: 10.1016/j.drudis.2014.08.008 [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Lowndes JSS, Best BD, Scarborough C, Afflerbach JC, Frazier MR, O’Hara CC, et al. Our path to better science in less time using open data science tools. Nat Ecol Evol. 2017;1: 0160. doi: 10.1038/s41559-017-0160 [DOI] [PubMed] [Google Scholar]
31.Ponce MA, Lizarraga S, Bruce A, Kim TN, Morrison WR. Grain Inoculated with Different Growth Stages of the Fungus, Aspergillus flavus, Affect the Close-Range Foraging Behavior by a Primary Stored Product Pest, Sitophilus oryzae (Coleoptera: Curculionidae). Stelinski L, editor. Environ Entomol. 2022;51: 927–939. doi: 10.1093/ee/nvac061 [DOI] [PubMed] [Google Scholar]
32.Szöcs E, Stirling T, Scott ER, Scharmüller A, Schäfer RB. webchem: An R Package to Retrieve Chemical Information from the Web. Journal of Statistical Software, 2020, 93(13), 1–17. doi: 10.18637/jss.v093.i13 [DOI] [Google Scholar]
33.Cao Y, Charisi A, Cheng L-C, Jiang T, Girke T. ChemmineR: a compound mining framework for R. Bioinformatics. 2008;24: 1733–1734. doi: 10.1093/bioinformatics/btn307 [DOI] [PMC free article] [PubMed] [Google Scholar]
34.Wang Y, Backman TWH, Horan K, Girke T. fmcsR: mismatch tolerant maximum common substructure searching in R. Bioinformatics, 2013, 29(21), 2792–2794. ISSN 1367-4811, doi: 10.1093/bioinformatics/btt475 [DOI] [PubMed] [Google Scholar]
35.Hpoo MK, Mishyna M, Prokhorov V, Arie T, Takano A, Oikawa Y, et al. Potential of Octanol and Octanal from Heracleum sosnowskyi Fruits for the Control of Fusarium oxysporum f. sp. lycopersici. Sustainability. 2020;12: 9334. doi: 10.3390/su12229334 [DOI] [Google Scholar]

PLoS One. doi: 10.1371/journal.pone.0306202.r001

Decision Letter 0

Shailender Kumar Verma

11 Apr 2024

PONE-D-24-10543uafR: An R package that automates mass spectrometry data processingPLOS ONE

Dear Dr. Murrell,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript by May 26 2024 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.
A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.
An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Shailender Kumar Verma, Ph.D.

Academic Editor

PLOS ONE

Journal requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. Please note that PLOS ONE has specific guidelines on code sharing for submissions in which author-generated code underpins the findings in the manuscript. In these cases, all author-generated code must be made available without restrictions upon publication of the work. Please review our guidelines at https://journals.plos.org/plosone/s/materials-and-software-sharing#loc-sharing-code and ensure that your code is shared in a way that follows best practice and facilitates reproducibility and reuse.

3. Thank you for stating the following financial disclosure:

[This project was funded by USDA-NIFA projects: #2021-67034-35135 and #2018-67013-27402; and by generous private donations to The Land Institute. In addition, this work was funded, in part, by a United States Department of Agriculture, National Institute of Food and Agriculture, Crop Protection and Pest Management Grant (#2020-70006-33000) and USDA Agricultural Research Service through Congress-appropriated funds.].

Please state what role the funders took in the study. If the funders had no role, please state: "The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript."

If this statement is not correct you must amend it as needed.

Please include this amended Role of Funder statement in your cover letter; we will change the online submission form on your behalf.

4. Thank you for stating the following in the Acknowledgments Section of your manuscript:

[The use of trade names is for the purposes of providing scientific information only and does not constitute endorsement by the United States Department of Agriculture. The USDA is an equal opportunity employer. This project was funded by USDA-NIFA projects: #2021-67034-35135 and #2018-67013-27402; and by generous private donations to The Land Institute. In addition, this work was funded, in part, by a United States Department of Agriculture, National Institute of Food and Agriculture, Crop Protection and Pest Management Grant (#2020-70006-33000) and USDA Agricultural Research Service through Congress-appropriated funds.]

We note that you have provided funding information that is not currently declared in your Funding Statement. However, funding information should not appear in the Acknowledgments section or other areas of your manuscript. We will only publish funding information present in the Funding Statement section of the online submission form.

Please remove any funding-related text from the manuscript and let us know how you would like to update your Funding Statement. Currently, your Funding Statement reads as follows:

Please include your amended statements within your cover letter; we will change the online submission form on your behalf.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: No

Reviewer #2: Yes

Reviewer #3: Yes

Reviewer #4: Yes

Reviewer #5: Yes

Reviewer #6: Yes

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: No

Reviewer #2: Yes

Reviewer #3: Yes

Reviewer #4: Yes

Reviewer #5: Yes

Reviewer #6: N/A

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

Reviewer #4: Yes

Reviewer #5: Yes

Reviewer #6: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

Reviewer #4: Yes

Reviewer #5: Yes

Reviewer #6: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: The authors presented a manuscript that describes the creation of an R package to process Raw GC-MS data. They also claim applicability to LC-MS data, however failed to present any data to support nor addressed the fundamental difference between GC-MS and LC-MS data that would make the work flow a challenge. The authors compare their process to “manually selecting, integrating and identifying peaks” that fails to compare to any open source or commercial solution. For a non-targeted experiments, it would be rare for a lab not to use some level of processing software. The manuscript would be improved greatly by describing the current GC-MS solution and describing how their code improves on their shortcomings. The manuscript was not clear on the process of annotation. Was the EI spectra search against NIST library? Was there accurate mass searching used? Or both? If the later, how was a parent mass obtained? What was the resolution and mass accuracy of the instrument used to collect the dataset described?

The author claim “the most accurate compound areas”, however little support to support this claim. This would be to be compared to other peak detection algorithms and not manual process which the author stated is full of bias. In addition, any comparison to speed needs comparison to other algorithms instead of manual process. Th size of the datasets used it the examples were relatively small and do not support any claim to ability to handle large datasets.

The authors presented a process that was built from what appears to be existing code to perform the tasks numerous software packages can do. There is no evidence to that this process would be applicable to LC-MS data.

Consistently use GC-MS, and m/z should be italicized.

Reviewer #2: In this manuscript entitled “uafR: An R package that automates mass spectrometry data processing”, the authors were trying to demonstrate the workflow and R package uses published data to extract the most accurate compound areas for the most likely compound identifications, and will greatly increase the speed at which chemistry datasets are published, the size of chemical studies that can be conducted, and the accessibility of chemical analyses. This reviewer believes this manuscript will be beneficial for readers of PLOS One to some extent. However, the authors need to address following concerns before could be considered to publish on PLOS ONE.

Comments and Concerns:

1. In the manuscript, the authors did not mention which mass spectrometer was used for the first dataset, only demonstrate the second dataset of GC-MS data from Agilent 5997B mass spectrometer, which is single quadrupole detector. The authors should provide more examples or demonstrations on other mass spectrometers from different vendors or other types of mass spectrometers, like Q-TOF or Orbitrap, even most cutting-edge ones like TimsTOF HT or Astral.

2. If uafR compared with commercial softwares, like Compound Discoverer from Thermo, what are the advantages of uafR compared to these commercial softwares?

3. It will be great if the authors could address whether the uafR could handle MS2 or MS3 data, additionally whether uafR could handle ion mobility + MS2 data.

Reviewer #3: The authors have developed a new tool for processing large volumes of GC-LC/MS data in a minimum of time. They validated this tool using 2 different data sets. In addition, they allow everyone to test this tool with other data sets. In addition, the methods applied are well described and rigorous.

Reviewer #4: I want to congratulate you on your work. As an LC-MS user, although not a direct part of the target audience, I still appreciate your work very much. I wrote some suggestions in the attached document, which I think would improve the user and reader experience of your target audience. I did find a few small mispellings and typos, please do a careful read and check/correct these.

Reviewer #5: Review for UafR: An open-source R package that automates mass spectrometry data processing

I have previously reviewed this manuscript for a different journal and during the multiple revisions reviewed concluded that the manuscript was ready for publication. Only minor changes have been made since, so I can still recommend this for publication as is. For the sake of transparency and thoroughness, I've copied my prior review pertaining to this manuscript below so that the editor can see what was pointed out then and how edits were made. The only comment I'll make here is that the authors could tone down a bit of language used for more succinct and detail-oriented descriptions of implementation and results (for example, lines 235-237). Though I'll acknowledge this is personal preference.

Review from April 2023, Journal of Cheminformatics

The authors describe ‘uafR’, an R package to support mass spectrometry data processing. I believe the package the authors have created can provide utility for mass spec users, specifically for those interested in compiling metadata to support annotations. However, I believe the manuscript could be improved with more context and perspective. As is, it is not abundantly clear why a user would elect to use this tool over other similar tools or web resources available. Additionally, it’s difficult to determine what the primary goals of the package are- standard curve generation and metadata/substructure calculation are quite different pieces of a workflow, so describing the primary benefit would make the manuscript clearer as well. I’ve provided section specific comments below.

Introduction

Lines 34-39: here you outline the first impediment, when I see how the rest of the paper is outlined it appears that this step here is achieved with the Unknowns Analysis tool right? So first identifications occur in a tool of choice via library search, etc. and then use this the ‘uafR’ tool for enhanced metadata? I think providing some connection point between the impediments outlined in the introduction and how the implementation and results solve them will be helpful for readers.

Line 56-60: related to the above comment and this section, perhaps what is missing to me is exactly what is included in the outputs of Unknowns Analysis and then what is really being enhanced by the ‘uafR’ tool. Can you provide example outputs? Or snippets via screenshot to help orient the reader?

Implementation

Lines 120-123: So all identifications come from Unknowns Analysis? What if a user doesn’t use Unknowns Analysis?

Line 153-155: This function seems incredibly helpful. I’d recommend highlighting this even further

Line 206-212: How is the standard curve generated here different (or better?) than using the standard curve from elsewhere? Is the standard curve part of Unknowns Analysis too or it needs to be generated here because it is not in Unknowns Analysis? A little more clarity here would help.

Line 247-250: I think I missed a step in here. Are the peak areas calculated in Unknowns Analysis and then compared in ‘uafR’ or are they re-generated in ‘uafR’? Suggest adding a bit more detail around the steps followed.

Reviewer #6: I read the submitted manuscript to try providing a correct report on it.

However, I realized that the paper is presenting an original software able to analyze and report GC-MS and LC-MS data…

I am totally not able to understand how efficient is this software since I am totally not expert in computer script coding…

I can only state that there is a need for such informatic tools but I cannot state if this one is efficient or not.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Reviewer #3: No

Reviewer #4: No

Reviewer #5: No

Reviewer #6: No

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

Attachment

Submitted filename: Review PONE-D-24-10543.docx

pone.0306202.s001.docx^{(13.9KB, docx)}

PLoS One. 2024 Jul 5;19(7):e0306202. doi: 10.1371/journal.pone.0306202.r002

Author response to Decision Letter 0

24 May 2024

Please see the attached file, "Rebuttal Letter_uafR PLoS One 2024" for detailed responses to reviewer comments.

Attachment

Submitted filename: Rebuttal Letter_uafR PLoS One 2024.docx

pone.0306202.s002.docx^{(25.3KB, docx)}

PLoS One. doi: 10.1371/journal.pone.0306202.r003

Decision Letter 1

Shailender Kumar Verma

13 Jun 2024

uafR: An R package that automates mass spectrometry data processing

PONE-D-24-10543R1

Dear Dr. Murrell,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice will be generated when your article is formally accepted. Please note, if your institution has a publishing partnership with PLOS and your article meets the relevant criteria, all or part of your publication costs will be covered. Please make sure your user information is up-to-date by logging into Editorial Manager at Editorial Manager® and clicking the ‘Update My Information' link at the top of the page. If you have any questions relating to publication charges, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Shailender Kumar Verma, Ph.D.

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #2: All comments have been addressed

Reviewer #4: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

Reviewer #2: Yes

Reviewer #4: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #2: N/A

Reviewer #4: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

Reviewer #2: Yes

Reviewer #4: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

Reviewer #2: Yes

Reviewer #4: Yes

**********

6. Review Comments to the Author

Reviewer #2: (No Response)

Reviewer #4: Congratulations on your hard work! The revised manuscript added some crucial data, and although more could have been said/written in the discussions section of the manuscript, I think your manuscript can be published in its current form. I hope the software package you developed will be freeware, as it would help many analysts who are beginners and/or are struggling with mass spectra interpretations. Best regards,

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #2: Yes: Guanghui Han

Reviewer #4: No

**********

PLoS One. doi: 10.1371/journal.pone.0306202.r004

Acceptance letter

Shailender Kumar Verma

25 Jun 2024

PONE-D-24-10543R1

PLOS ONE

Dear Dr. Murrell,

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now being handed over to our production team.

At this stage, our production department will prepare your paper for publication. This includes ensuring the following:

* All references, tables, and figures are properly cited

* All relevant supporting information is included in the manuscript submission,

* There are no issues that prevent the paper from being properly typeset

If revisions are needed, the production department will contact you directly to resolve them. If no revisions are needed, you will receive an email when the publication date has been set. At this time, we do not offer pre-publication proofs to authors during production of the accepted work. Please keep in mind that we are working through a large volume of accepted articles, so please give us a few weeks to review your paper and let you know the next and final steps.

Lastly, if your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

If we can help with anything else, please email us at customercare@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Shailender Kumar Verma

Academic Editor

PLOS ONE

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Attachment

Submitted filename: Review PONE-D-24-10543.docx

pone.0306202.s001.docx^{(13.9KB, docx)}

Attachment

Submitted filename: Rebuttal Letter_uafR PLoS One 2024.docx

pone.0306202.s002.docx^{(25.3KB, docx)}

Data Availability Statement

Datasets used in this analysis are available on GitHub: github.org/castratton/uafR.

[pone.0306202.ref001] 1.Bishop RC, Ellis GFR. Contextual Emergence of Physical Properties. Found Phys. 2020;50: 481–510. doi: 10.1007/s10701-020-00333-9 [DOI] [Google Scholar]

[pone.0306202.ref002] 2.Spitzer J, Pielak GJ, Poolman B. Emergence of life: Physical chemistry changes the paradigm. Biol Direct. 2015;10: 33. doi: 10.1186/s13062-015-0060-y [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0306202.ref003] 3.Seifert VA. Open questions on emergence in chemistry. Commun Chem. 2022;5: 49. doi: 10.1038/s42004-022-00667-7 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0306202.ref004] 4.Sneddon J, Masuram S, Richert JC. Gas Chromatography‐Mass Spectrometry‐Basic Principles, Instrumentation and Selected Applications for Detection of Organic Compounds. Anal Lett. 2007;40: 1003–1012. doi: 10.1080/00032710701300648 [DOI] [Google Scholar]

[pone.0306202.ref005] 5.Baldi P, Nasr R. When is Chemical Similarity Significant? The Statistical Distribution of Chemical Similarity Scores and Its Extreme Values. J Chem Inf Model. 2010;50: 1205–1222. doi: 10.1021/ci100010v [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0306202.ref006] 6.Krone N, Hughes BA, Lavery GG, Stewart PM, Arlt W, Shackleton CHL. Gas chromatography/mass spectrometry (GC/MS) remains a pre-eminent discovery tool in clinical steroid investigations even in the era of fast liquid chromatography tandem mass spectrometry (LC/MS/MS). J Steroid Biochem Mol Biol. 2010;121: 496–504. doi: 10.1016/j.jsbmb.2010.04.010 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0306202.ref007] 7.Tedone L, Costa R, De Grazia S, Ragusa S, Mondello L. Monodimensional (GC–FID and GC–MS) and Comprehensive Two‐dimensional Gas Chromatography for the Assessment of Volatiles and Fatty Acids from Ruta chalepensis Aerial Parts. Phytochem Anal. 2014;25: 468–475. doi: 10.1002/pca.2518 [DOI] [PubMed] [Google Scholar]

[pone.0306202.ref008] 8.Mondello L, Tranchida PQ, Dugo P, Dugo G. Comprehensive two‐dimensional gas chromatography‐mass spectrometry: A review. Mass Spectrom Rev. 2008;27: 101–124. doi: 10.1002/mas.20158 [DOI] [PubMed] [Google Scholar]

[pone.0306202.ref009] 9.Beale DJ, Pinu FR, Kouremenos KA, Poojary MM, Narayana VK, Boughton BA, et al. Review of recent developments in GC–MS approaches to metabolomics-based research. Metabolomics. 2018;14: 152. doi: 10.1007/s11306-018-1449-2 [DOI] [PubMed] [Google Scholar]

[pone.0306202.ref010] 10.Misra BB. Advances in high resolution GC-MS technology: a focus on the application of GC-Orbitrap-MS in metabolomics and exposomics for FAIR practices. Anal Methods. 2021;13: 2265–2282. doi: 10.1039/d1ay00173f [DOI] [PubMed] [Google Scholar]

[pone.0306202.ref011] 11.Morrison WR, Ingrao A, Ali J, Szendrei Z. Identification of plant semiochemicals and evaluation of their interactions with early spring insect pests of asparagus. J Plant Interact. 2016;11: 11–19. doi: 10.1080/17429145.2015.1133848 [DOI] [Google Scholar]

[pone.0306202.ref012] 12.Barbosa-Cornelio R, Cantor F, Coy-Barrera E, Rodríguez D. Tools in the Investigation of Volatile Semiochemicals on Insects: From Sampling to Statistical Analysis. Insects. 2019;10: 241. doi: 10.3390/insects10080241 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0306202.ref013] 13.Dimcheva V, Kaloyanov N, Karsheva M. The polyphenol composition of Cistus incanus L., Trachystemon orientalis L. and Melissa offi cinalis L. infusions by HPLC-DAD method. Open J Anal Bioanal Chem. 2019;3: 031–038. doi: 10.17352/ojabc.000008 [DOI] [Google Scholar]

[pone.0306202.ref014] 14.Glassmire AE, Zehr LN, Wetzel WC. Disentangling dimensions of phytochemical diversity: alpha and beta have contrasting effects on an insect herbivore. Ecology. 2020;101. doi: 10.1002/ecy.3158 [DOI] [PubMed] [Google Scholar]

[pone.0306202.ref015] 15.Chung B, Choo H-YP, Kim T, Eom K, Kwon O, Suh J, et al. Analysis of Anabolic Steroids Using GC/MS with Selected Ion Monitoring. J Anal Toxicol. 1990;14: 91–95. doi: 10.1093/jat/14.2.91 [DOI] [PubMed] [Google Scholar]

[pone.0306202.ref016] 16.Shackleton C, Pozo OJ, Marcos J. GC/MS in Recent Years Has Defined the Normal and Clinically Disordered Steroidome: Will It Soon Be Surpassed by LC/Tandem MS in This Role? J Endocr Soc. 2018;2: 974–996. doi: 10.1210/js.2018-00135 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0306202.ref017] 17.McDonald JG, Matthew S, Auchus RJ. Steroid Profiling by Gas Chromatography–Mass Spectrometry and High Performance Liquid Chromatography–Mass Spectrometry for Adrenal Diseases. Horm Cancer. 2011;2: 324–332. doi: 10.1007/s12672-011-0099-x [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0306202.ref018] 18.Bajusz D, Rácz A, Héberger K. Why is Tanimoto index an appropriate choice for fingerprint-based similarity calculations? J Cheminform. 2015;7: 20. doi: 10.1186/s13321-015-0069-3 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0306202.ref019] 19.Holliday JD, Salim N, Whittle M, Willett P. Analysis and Display of the Size Dependence of Chemical Similarity Coefficients. J Chem Inf Comput Sci. 2003;43: 819–828. doi: 10.1021/ci034001x [DOI] [PubMed] [Google Scholar]

[pone.0306202.ref020] 20.Firn RD, Jones CG. Natural products? a simple model to explain chemical diversity. Nat Prod Rep. 2003;20: 382. doi: 10.1039/b208815k [DOI] [PubMed] [Google Scholar]

[pone.0306202.ref021] 21.Foster and SP, Harris MO. Behavioral Manipulation Methods for Insect Pest-Management. Annu Rev Entomol. 1997;42: 123–146. doi: 10.1146/annurev.ento.42.1.123 [DOI] [PubMed] [Google Scholar]

[pone.0306202.ref022] 22.Hansen IA, Rodriguez SD, Drake LL, Price DP, Blakely BN, Hammond JI, et al. The Odorant Receptor Co-Receptor from the Bed Bug, Cimex lectularius L. Reisert J, editor. PLoS One. 2014;9: e113692. doi: 10.1371/journal.pone.0113692 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0306202.ref023] 23.Regnault-Roger C, Vincent C, Arnason JT. Essential Oils in Insect Control: Low-Risk Products in a High-Stakes World. Annu Rev Entomol. 2012;57: 405–424. doi: 10.1146/annurev-ento-120710-100554 [DOI] [PubMed] [Google Scholar]

[pone.0306202.ref024] 24.Pickett JA, Bruce TJA, Chamberlain K, Hassanali A, Khan ZR, Matthes MC, et al. Plant Volatiles Yielding New Ways to Exploit Plant Defence. Chemical Ecology. Dordrecht: Springer Netherlands; pp. 161–173. doi: 10.1007/978-1-4020-5369-6_11 [DOI] [Google Scholar]

[pone.0306202.ref025] 25.Lionta E, Spyrou G, Vassilatis D, Cournia Z. Structure-Based Virtual Screening for Drug Discovery: Principles, Applications and Recent Advances. Curr Top Med Chem. 2014;14: 1923–1938. doi: 10.2174/1568026614666140929124445 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0306202.ref026] 26.Mason J, Good A, Martin E. 3-D Pharmacophores in Drug Discovery. Curr Pharm Des. 2001;7: 567–597. doi: 10.2174/1381612013397843 [DOI] [PubMed] [Google Scholar]

[pone.0306202.ref027] 27.Kumar A, Zhang KYJ. Advances in the Development of Shape Similarity Methods and Their Application in Drug Discovery. Front Chem. 2018;6. doi: 10.3389/fchem.2018.00315 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0306202.ref028] 28.Xie X-QS. Exploiting PubChem for virtual screening. Expert Opin Drug Discov. 2010;5: 1205–1220. doi: 10.1517/17460441.2010.524924 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0306202.ref029] 29.Cheng T, Pan Y, Hao M, Wang Y, Bryant SH. PubChem applications in drug discovery: a bibliometric analysis. Drug Discov Today. 2014;19: 1751–1756. doi: 10.1016/j.drudis.2014.08.008 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0306202.ref030] 30.Lowndes JSS, Best BD, Scarborough C, Afflerbach JC, Frazier MR, O’Hara CC, et al. Our path to better science in less time using open data science tools. Nat Ecol Evol. 2017;1: 0160. doi: 10.1038/s41559-017-0160 [DOI] [PubMed] [Google Scholar]

[pone.0306202.ref031] 31.Ponce MA, Lizarraga S, Bruce A, Kim TN, Morrison WR. Grain Inoculated with Different Growth Stages of the Fungus, Aspergillus flavus, Affect the Close-Range Foraging Behavior by a Primary Stored Product Pest, Sitophilus oryzae (Coleoptera: Curculionidae). Stelinski L, editor. Environ Entomol. 2022;51: 927–939. doi: 10.1093/ee/nvac061 [DOI] [PubMed] [Google Scholar]

[pone.0306202.ref032] 32.Szöcs E, Stirling T, Scott ER, Scharmüller A, Schäfer RB. webchem: An R Package to Retrieve Chemical Information from the Web. Journal of Statistical Software, 2020, 93(13), 1–17. doi: 10.18637/jss.v093.i13 [DOI] [Google Scholar]

[pone.0306202.ref033] 33.Cao Y, Charisi A, Cheng L-C, Jiang T, Girke T. ChemmineR: a compound mining framework for R. Bioinformatics. 2008;24: 1733–1734. doi: 10.1093/bioinformatics/btn307 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0306202.ref034] 34.Wang Y, Backman TWH, Horan K, Girke T. fmcsR: mismatch tolerant maximum common substructure searching in R. Bioinformatics, 2013, 29(21), 2792–2794. ISSN 1367-4811, doi: 10.1093/bioinformatics/btt475 [DOI] [PubMed] [Google Scholar]

[pone.0306202.ref035] 35.Hpoo MK, Mishyna M, Prokhorov V, Arie T, Takano A, Oikawa Y, et al. Potential of Octanol and Octanal from Heracleum sosnowskyi Fruits for the Control of Fusarium oxysporum f. sp. lycopersici. Sustainability. 2020;12: 9334. doi: 10.3390/su12229334 [DOI] [Google Scholar]

PERMALINK

uafR: An R package that automates mass spectrometry data processing

Chase A Stratton

Yvonne Thompson

Konilo Zio

William R Morrison III

Ebony G Murrell

Roles

Abstract

Introduction

Materials and methods

Software description and workflow

Fig 1. The uafR workflow and its constitutive functions: spreadOut, categorate, exactoThese, mzExcato, decontaminate, and standardifyIt.

Testing the accuracy of the package

Results

Fig 2.

Table 1. Summary of NMDS and ANOSIM calculation for models processed with uafR.

Table 2. Chemicals identified in Ponce et al. 2022 using manual identification, versus compounds identified by the uafR package using the same selection criteria: >75% match of the chemical ID, and present in more than one sample.

Conclusion

Acknowledgments

Data Availability

Funding Statement

References

Decision Letter 0

Shailender Kumar Verma

Roles

Author response to Decision Letter 0

Decision Letter 1

Shailender Kumar Verma

Roles

Acceptance letter

Shailender Kumar Verma

Roles

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases