Abstract
Untargeted metabolomics is a powerful approach for identifying small molecules from highly complex mixtures, such as biological tissues or environmental samples. This technology enables the relatively fast and inexpensive identification of metabolites in situations where many or most of the chemical species are unknown before the experiment begins. This situation often arises in biomedical and environmental research, as well as in the case described here, the discovery of metabolites from plants. The objective of this paper is to provide practical and technical knowledge about untargeted metabolomics using mass spectrometry as the detection method. Specifically, we focus on liquid chromatography tandem mass spectrometry (LC‐MS/MS). We provide a consolidated protocol for new users, serving as a starting point for experimental design, data collection, and data analysis. We explain the terminology and technical details in the context of real experiments and samples. In addition to general background information, step‐by‐step protocols are provided for sample preparation, liquid chromatography‐tandem mass spectrometry data collection, and data analysis, utilizing readily available and widely used software. The chosen example data set is based on plant metabolites with varying chemical properties; however, the approach is applicable to essentially any complex biological sample. © 2025 The Author(s). Current Protocols published by Wiley Periodicals LLC.
Basic Protocol 1: Sample preparation for LC‐MS/MS
Support Protocol 1: Preparing a ‘master mix’ sample for assessment of liquid chromatography and sensitivity
Basic Protocol 2: LC‐MS/MS data collection
Basic Protocol 3: Data analysis using the software MSConvert, MZMine, and SIRIUS
Support Protocol 2: Using the MZMine batch file
Keywords: electrospray ionization, LC‐MS/MS, mass spectrometry, plant metabolites, untargeted metabolomics
INTRODUCTION
Untargeted metabolomics is a strategy for discovering unknown small molecules (approx. ≤2000 Da) from complex biological samples. Although it is often compared to nucleic acid or protein sequencing, metabolomics is in many ways a more challenging problem because small molecules are so structurally diverse. Due to the lack of common building blocks, confident identification of unknown molecules typically requires correlating fragmentation data with retention time (Schrimpe‐Rutledge et al., 2016). Commercially available reference compounds are lacking for many metabolites, especially those found in plants; however, several publicly accessible metabolite databases exist (Kind et al., 2018), and their coverage and molecular diversity continue to increase with every data set deposited by the community. Some small molecule metabolites are conserved across species, making comparative studies possible (Hamberger & Bak, 2013; Wen et al., 2020) whereas others, including the precursors of some of our most important drugs, are limited to particular families or even specific plant species (Cech & Oberlies, 2023; Dhyani et al., 2022; Patocka et al., 2020; Tron et al., 2006). Plant metabolomics studies are therefore important for diverse fields including medicinal chemistry, chemical biology, ecology, and agriculture.
There are four main types of mass spectrometry instrumentation (paired with compatible chromatography systems) commonly used for this type of analysis: 1) liquid chromatography (LC)‐MS, or LC‐MS/MS; 2) matrix‐assisted laser‐desorption ionization (MALDI) MS; 3) gas chromatography (GC) MS; and 4) LC with tandem quadrupole mass filters, or ‘triple‐quad’ (QQQ) MS. The choice of what technique or combination of techniques to use depends on the amount of sample available, the desired accuracy, and the resources available (financial and researcher time) for the project. Here, we aim to provide a general overview of LC‐MS/MS, as well as guidance on sample preparation, data collection, and analysis of LC‐MS/MS data. These guidelines can then be extrapolated to lower‐resolution techniques and different ionization methods, such as GC‐EI‐MS, MALDI‐TOF‐MS, and QQQ‐MS, among others. We specifically focus on electrospray ionization (ESI), a technique where a high voltage is applied to a liquid to generate an aerosol containing ions derived from analyte molecules. ESI was developed independently by Yamashita and Fenn (Yamashita & Fenn, 1984) and Gall and co‐workers (Alexandrov et al., 1984). It is particularly useful for biological samples and complex mixtures because it is relatively gentle, ionizing many molecules without fragmenting them, and because it can produce multiply charged species, thus effectively extending its mass range to include macromolecules. LC‐ESI‐MS, or liquid chromatography electrospray ionization mass spectrometry, is one of the most widely used high‐resolution variants of mass spectrometry, largely due to its low limit of detection and wide range of detectable masses (Laaniste et al., 2019).
Here, we focus on liquid chromatography systems (ultra‐high‐performance liquid chromatography) paired with high‐resolution mass spectrometers, specifically quadrupole time‐of‐flight instruments (Q‐TOFs). However, the principles covered here generally apply to lower‐resolution and/or lower‐accuracy mass spectrometers and/or high‐pressure liquid chromatography (HPLC), with the understanding that the relevant tolerance parameters must be adjusted accordingly. The abbreviation LC‐MS is used in this section as a general term to describe all forms of the relevant instrumentation. We use 0.1% formic acid (FA) in H2O as our aqueous solvent A, and acetonitrile (ACN) as our solvent B. At the beginning of the experiment, the analyte(s) should be dissolved in solvent A, and the concentration should be high enough to be easily detected (maximizing signal), but not so high as to overload the column or the mass spectrometer. This can be experimentally determined as described in Support Protocol 1, but will generally be in the range of 1–10 micromolar (micromoles per liter, µM). Chromatography gradients may immediately start ramping up the concentration of the organic solvent, or a short isocratic wash period may be used. The latter can be used in conjunction with a waste divert to send any material that does not bind to the column (e.g., salts and very small, polar metabolites) away from the mass spectrometer. This is useful because bulk salts are usually best kept away from the source, as they are non‐volatile and can dirty the source over time, reducing sensitivity or causing the formation of unwanted adducts. Metal ion adducts may or may not be useful depending on the context, but are best introduced deliberately under controlled conditions rather than due to source contamination.
In an LC‐ESI‐MS experiment, a sample containing analyte(s) dissolved in at least one solvent is injected onto the instrument, where a continuous flow of solvent carries the mixture of analytes to the front of a column, a tube packed with (almost universally) silica functionalized with a chemical coating designed to bind a wide range of molecules with varying affinities (Waters). After analytes bind to the column bed (the stationary phase), we can elute analyte(s) from the column by changing the composition of the solvent (the mobile phase) to make it a more favorable environment for the analyte than sticking to the column bed; this is usually accomplished by making the composition of the mobile phase more non‐polar, such as switching from a high concentration of water to a high concentration of acetonitrile. The solvent flow coming out of the column (potentially containing analyte(s) of interest as they elute) is aerosolized out of a high‐voltage capillary into a stream of hot desolvation gas, which serves to remove solvent and concentrate each droplet until it reaches the Rayleigh limit for how much charge a droplet can carry and disperses into an aerosol containing gas‐phase ions (Prabhu et al., 2023; Wilm, 2011), as depicted in Figure 1. These ions are directed into the source, where they are steered via the electric fields of a series of ion guides into the flight tube, a component of the instrument responsible for measuring each ion's time‐of‐flight (TOF), which depends on the mass‐to‐charge (m/z) ratio. The TOF measurements are compared against a calibration curve of known m/z values and their observed TOFs, allowing the accurate determination of unknown m/z values down to less than 0.001 Da (note: in the context of small‐molecule metabolomics, m/z and Da are often used interchangeably, based on the assumption that small molecules usually bear only a single unit charge (+1 or −1)), where extra careful measurements using an internal calibrant can bring accuracy to better than 0.0001 Da. These measurements occur on a timescale of milliseconds to seconds; however, we consider ‘reasonable’ scan speeds to be between 0.1 and 0.5 seconds. Each scan is produced by collecting signals for a set period of time across the mass range designated by the user (e.g., 250‐ms scans across a mass range of 50–2000 Da). Each scan is associated with a retention time, where the time associated with the scan is a measure of where in the chromatography gradient the data were collected. Analytes elute on the scale of seconds; the bulk of a chromatographic peak is expected to elute over approximately 0.1 min, or 6 seconds.
Figure 1.

A diagram illustrating electrospray ionization and the quadrupole time‐of‐flight (QTOF) ion path (not drawn to scale). The sample is directed through a high‐voltage capillary into the source enclosure, where it is desolvated, resulting in the rapid concentration of positive charge, yielding an aerosolized spray of positively charged ions. These ions are directed into a mass analyzer via ion optics, which attract the charged particles to travel perpendicular to the direction of the spray, while uncharged molecules are left to travel to the edge of the source enclosure and are removed. The quadrupole can act as a mass filter, isolating a specific m/z ratio (typically 1.5 Da) rather than allowing all ions through, thereby enabling targeted MS/MS experiments. In a targeted MS/MS experiment, ions are fragmented inside the collision cell to generate clean MS/MS spectra for a particular molecule in a complex mixture. In untargeted data‐independent acquisitions, such as MSe, all ions are fragmented simultaneously, and the resulting fragments are assigned to specific precursors based on peak shape and retention time. All ions have their m/z values measured via the flight tube, which measures how long it takes ions to reach the detector after traveling the flight path. Exact mass measurements are enabled by calibration with a series of known standards, including clusters of sodium formate, sodium iodide, or cesium iodide.
This means that for any one analyte, with a scan duration of 250 ms repeated multiple times as the peak elutes, you could generate a chromatogram following the intensity of that peak with as many as 6000 ms/250 ms = 24 scans across the bulk of the peak—an extremely high resolution considering that LC gradients are usually 15‐30 min long, but can span 1–2 hr for complex proteolytic digests. As long as each analyte in a complex sample can be resolved either chromatographically (differing retention times) or by m/z ratio, there are, in principle, no limits to sample complexity if you are willing to devote the instrument time to resolving the components. Accurate measurements of analyte abundances rely on consistent and accurate chromatography across all samples in an experiment, as well as consistent and accurate mass measurements both within and across samples.
The other major technique used for untargeted metabolomics is NMR spectroscopy, which provides detailed structural information, but at the expense of reduced sensitivity. NMR is therefore the tool of choice for studies where structure elucidation of unknown molecules is required and sufficient sample is available (Bergo et al., 2024; Kim et al., 2010; Mahrous & Farag, 2015; Pontes et al., 2017). The chemical specificity of NMR makes it highly useful for studies where unexpected products are formed or where the objective is to discover novel molecules. NMR is quantitative and reproducible, and can be used without internal standards. NMR metabolomics experiments require minimal sample preparation, especially for water‐soluble samples. The extraction step can be eliminated by using high‐resolution magic angle spinning (HR‐MAS) on intact or minimally processed tissues (Augustijn et al., 2021). Although less sensitive than MS, NMR is also a non‐destructive technique. Therefore, the relatively large samples required for NMR measurements can also be reused for other studies. Further, the same molecules can be investigated using different pulse sequences, allowing confirmation of molecular species identification NMR is used to detect disease states using human blood or urine samples (Gowda & Raftery, 2023), often via comparison to known compounds in the Human Metabolite Database (Wishart et al., 2007, 2022), which contains 1H and 13C chemical shifts for >200,000 metabolites, many of which are linked to data about enzymes and transporters. Other methods have been developed to identify unknown compounds from complex biological mixtures. For example, natural‐abundance 1H‐13C pure shift HSQCs can be used to perform untargeted metabolomics (Timári et al., 2019), and multiple experiments on the same sample can be correlated automatically (Lefort et al., 2021). More complex workflows involving pre‐concentration, LC‐NMR, and more time‐intensive data collection are available to identify completely unknown molecules in complex biofluids (Garcia‐Perez et al., 2020). Studies using both NMR and mass spectrometry combine the advantages of both techniques; this type of tandem approach has been used to investigate adaptations to drought in roots (Honeker et al., 2022), compare the amount of primary and secondary metabolites produced in different organs throughout development (Noleto‐Dias et al., 2024), and discover new anthocyanins and related compounds (Denish et al., 2021; Yang et al., 2014).
Here we provide three protocols covering sample preparation, data acquisition, and data analysis. Basic Protocol 1 describes how to prepare samples from plant tissue. Support Protocol 1 outlines the procedure for creating a pooled sample to assess chromatographic performance. Basic Protocol 2 covers experimental data acquisition. Basic Protocol 3 details how to analyze the data with commonly used free (for academic labs) software packages. Support Protocol 2 provides detailed instructions on creating and using a custom MZMine batch file for your project.
NOTE: All protocols involving animals must be reviewed and approved by the appropriate Animal Care and Use Committee and must comply with regulations governing the care and use of laboratory animals. Appropriate informed consent is necessary for obtaining and use of human study material.
Basic Protocol 1. SAMPLE PREPARATION FOR LC‐MS/MS
This protocol focuses on the preparation of plant samples for analysis by LC‐ESI‐MS. The main priorities when preparing such samples are: 1) effective extraction of the desired class(es) of molecules into the final sample; 2) the effective removal of protein and other large biomolecules that may clog the LC, most concerningly through the formation of insoluble aggregates that may damage the stationary phase; while 3) not overly diluting analytes of interest during sample preparation. Under ideal conditions, the samples being compared are all prepared by the same individual, using the same set of equipment (micropipettes, balances, etc.), and solvents and standards are all prepared in a single batch. Samples should ideally be analyzed in a single continuous run on the LC‐MS. Blanks and quality control samples should be used to ensure that the chromatographic parameters do not result in carryover between samples and that retention time alignment between samples is of high quality (ideally within ±0.1 min or less) prior to any post‐processing corrections. The LC‐MS should be calibrated to ensure mass accuracy. For instruments capable of it, a LockMass should be used to correct for temperature and pressure‐related changes over long sample queues.
The amount of biological material to use in the preparation of a single sample depends heavily on the source of the material, the material complexity, and the classes of molecules in question. For example, it is reasonable to use 100 mg of plant tissue containing a wide range of metabolites, but 100 mg of pure caffeine would overload the column with an overly concentrated sample. Differences in electrospray efficiency will also impact performance: an amine will produce a much higher signal than an alcohol at the same concentration. When possible, use previous literature to estimate appropriate starting quantities to use from their sample preparation, but modify their procedure based on the resulting signal quality of test samples prior to running a full experiment. In the example protocol shown below, which is based on a project in our group, tissue from the carnivorous plant Drosera capensis is used; however, leaf tissue from any plant species should give comparable results. Providing context for this sample preparation, D. capensis produces a highly sticky, carbohydrate‐based mucilage that traps prey insects when they land. When examining the small‐molecule metabolites this plant produces in response to feeding, it is necessary to remove this mucilage prior to sample preparation to avoid overloading the column or the instrument with polysaccharides. A more minimal wash step can be performed for plants that lack mucilage on the leaf surface. Another, more general, concern is the removal of proteins from the sample, since they can aggregate and/or otherwise clog the column. Deproteination is accomplished through precipitation using a high concentration of organic solvent (MeOH, ≥ 90 % v/v).
Materials
50–100 mg freshly cut leaf tissue (here, Drosera capensis)
Purified water (ideally Nanopure, but distilled or reverse osmosis is acceptable)
Extraction solvent (see recipe)
1% formic acid (see recipe)
0.1% formic acid (see recipe)
5 µM N‐lauroylsarcosine (solid from Sigma‐Aldrich) (see recipe)
Low‐lint paper towels (Kimwipes or equivalent)
50‐ml conical tubes
Vortex mixer
Analytical balance
Mortar and pestle (agate or ceramic)
Liquid nitrogen (handle with eye protection, insulating gloves, and lab coat)
Pipettes (1000, 200, 20, and 2 µl)
1.5‐ml microcentrifuge tubes (Eppendorf or equivalent)
Centrifuge (sized for microcentrifuge tubes)
−80°C freezer
LC‐MS vials (Agilent or equivalent, Thermo Fisher Scientific, cat. no. 6PSV9‐03FIVP)
Prepare plant tissue samples for LC‐MS/MS
-
1
After cutting the leaf sample from the plant, wipe the leaf surface of mucilage and debris using Kimwipes. Wash the leaf tissue in a 50‐ml conical tube using 3–5 exchanges of deionized water, vigorously mixing with shaking or vortexing during each wash. Decant the last wash and pat the leaf dry.
-
2
Tare a mass balance to 0.0000 g with a 2‐ml microcentrifuge tube on it, and then transfer the washed and dried leaf tissue to the tube to obtain a wet mass for the tissue to 0.1 mg. This will be used later for data normalization.
-
3
Transfer the leaf tissue to a mortar (pre‐cooled if possible) and homogenize the tissue to a fine powder by grinding with a pestle under liquid nitrogen. Specifically, pour 3–5 mL of liquid nitrogen into the mortar on top of the tissue, then gently break up the tissue into small pieces, followed by aggressive grinding.
-
4
As the sample thaws and the powder turns to a fine slurry, add 1–2 ml extraction solvent (90% methanol, 9.9% H2O, 0.1% formic acid (v/v), containing a small molecule internal standard (e.g., N‐lauroylsarcosine) at 50 µM). Continue grinding vigorously until no visible chunks remain and a pipettable solution is formed. Transfer as much of this solution as possible to the original microcentrifuge tube.
-
5
Centrifuge the sample for 2 min at 14,100 × g and room temperature (RT) to pellet cell debris and proteins. Transfer the supernatant to a new, labeled microcentrifuge tube. This can be stored at −80°C or prepared for immediate use in the next step.
The sample extract in 90% methanol will not freeze at −80°C.
Immediately before data collection
-
6
Dilute the methanolic extract 5‐ to 100‐fold (depending on desired concentrations/signal; start with 10‐fold) using 0.1% or 1.0% formic acid (v/v). Vortex to mix, centrifuge for 2 min at 14,100 × g and RT, then transfer the supernatant to LC‐MS vials for data collection (Basic Protocol 2). This dilution brings the final concentration of internal standard down to 5 µM (assuming a 10‐fold dilution).
A visual overview of the sample preparation process is shown in Figure 2.
Figure 2.

A visual overview of the sample preparation protocol. (A) A freshly collected Drosera capensis leaf, held using tweezers. (B) The leaf is wiped with a KimWipe, washed with deionized water, and patted dry. (C) The clean leaf in a pool of liquid nitrogen in a mortar and pestle. (D) Leaf residue after grinding using the mortar and pestle. (E) Leaf slurry after grinding the powdery residue with extraction solvent. (F) Leaf slurry in a 2‐ml microcentrifuge tube. (G) Leaf slurry supernatant after high‐speed centrifugation. (H) Leaf extract diluted with aqueous solvent. (I) Final LC‐MS sample, the supernatant from the high‐speed centrifugation of the sample in H.
Support Protocol 1. PREPARING A ‘MASTER MIX’ SAMPLE FOR ASSESSMENT OF LIQUID CHROMATOGRAPHY AND SENSITIVITY
When beginning a metabolomics project, we find it useful to prepare a set of calibration samples by pooling aliquots of all experimental samples in a specific manner (Support Protocol 1). You can use this sample to optimize chromatographic methods before running the full sample set. You can also use it during data acquisition as a ‘master’ sample from which you can efficiently collect MS2 data on most high‐abundance molecules present in the dataset. Furthermore, because this sample contains all molecules present across all samples, it can also serve as a high‐quality reference for retention time alignment during data processing.
Assuming the samples begin as methanolic/solvent‐based extracts, which are then diluted using (primarily aqueous) solvent prior to analysis by LC‐MS (Basic Protocol 1), create pooled samples. Using the methanolic extracts for each replicate in a group, make a pooled sample using equal parts of each replicate from that group. Repeat this step for each group in the experiment, as applicable. Then, use the pooled extracts to create a ‘master mix’ containing all the molecules present in the dataset. Serial dilution of this mixture with methanolic solvent containing the internal standard produces a set of calibration samples that can then be diluted. Their internal standard concentration will match the real replicate samples at the same concentrations; however, the analytes of interest will all appear in these samples at lower and lower levels, enabling you to estimate limits of detection and choose the right dilution level for your samples of interest. Creation of pooled samples after dilution with aqueous solvent (as in step 6 of Protocol 1) is not recommended because it would result in samples that freeze well above −80°C, hastening degradation.
Materials
Methanolic extracts prepared in Basic Protocol 1, step 5
Purified water (ideally Nanopure, but distilled or reverse osmosis is acceptable)
Pipettes (1000, 200, 20, and 2 µl)
1.5‐ml microcentrifuge tubes (Eppendorf or equivalent)
−80°C freezer
LC‐MS vials (Agilent or equivalent, Thermo Fisher Scientific cat. no. 6PSV9‐03FIVP)
Prepare ‘master mix’ sample for assessment of liquid chromatography and sensitivity
-
1
Using the methanolic extracts for all replicates in each experimental group, make a pooled sample using equal volumes from each replicate in the group.
For example, if group A had five replicates, 50 µL of each would be pooled to make 250 µL of the group A pooled sample.
-
2
Repeat step 1 for all experimental groups you will include in the analysis.
-
3
Combine all the pooled samples from the individual groups. You have now prepared a master mix containing all molecules that exist in your dataset.
Testing liquid chromatography columns, solvents, gradients, and flow rates using this sample will ensure that all molecules of interest can be separated.
Serial dilution
-
4
Use the methanolic extraction solvent (see recipe) or other solvent containing internal standards (e.g., N‐lauroylsarcosine) to serially dilute this master mix.
This dilution maintains the concentration of internal standards while diluting all other molecules in the sample. These samples can be used to evaluate the ideal concentration range for your data collection, but more importantly, they provide a realistic measurement for the limit of detection for each molecule in the samples.
-
5
We recommend a three or four‐fold serial dilution as follows: Take 100 µL methanolic master mix, dilute with 300 µL methanolic solvent containing the internal standard(s) for a four‐fold dilution. Repeat this process using the newly diluted extract to create 16×, 64×, 256×, and 1024× dilutions.
These samples can be stored at −80°C with the other methanolic extracts.
Immediately before data collection
-
6
Dilute the methanolic extract 5‐ to 100‐fold (depending on desired concentrations/signal; start with 10‐fold) using 0.1% or 1.0% formic acid (v/v). Vortex to mix, centrifuge 2 min at 14,100 × g and RT, and transfer the supernatant to LC‐MS vials for data collection (Basic Protocol 2).
A visual overview of the sample preparation process is shown in Figure 2.
Basic Protocol 2. LC‐MS/MS DATA COLLECTION
This protocol provides step‐by‐step instructions for interpreting the initial results obtained from an LC‐MS/MS experiment and optimizing conditions to achieve the best results possible for your samples. In LC‐MS/MS experiments, the MS1 channel provides information about all the observable molecules that elute from the column: which molecules are present (observed as their adducts, e.g., M+H, and their corresponding m/z ratios), when they elute (retention time), and their intensities (measured in counts, or counts * min for area). The MS2 channel(s), at least for data‐directed MS2 experiments, are where we target specific precursors (the m/z value corresponding to a specific adduct of a specific molecule) for isolation and fragmentation. These fragments serve as a ‘fingerprint’ allowing us to distinguish and often identify particular molecules. Fragmentation is achieved by applying a potential difference (collision energy) that accelerates the ions into a wall of collision gas. Waters instruments using collision‐induced dissociation have the capacity to apply a static potential, a stepwise potential gradient, or ramp the voltage over the course of the scan, depending on the m/z of the precursor.
We have found voltage ramps to be particularly effective for untargeted experiments analyzing plant extracts, which can contain a very large array of chemical species, some of which are particularly resistant to fragmentation at lower voltages. A ramp over the course of the scan helps ensure that the widest and most diverse array of fragments is generated, whereas static voltage can result in the generation of fewer but more intense fragments. Applying too large a potential difference can lead to over‐fragmentation, although this usually only occurs with very high static potential differences in practice. We typically use ramps of 10–100 V with MS2 scan durations of 0.25–0.75 s. With a longer scan time, any one point of the gradient remains active for longer, which generally results in more signal; however, only one precursor is being fragmented at a time. This means we need to strike a balance between MS1 and MS2 scan times, especially for samples or gradients with chromatographically crowded regions containing many overlapping peaks. Ideally, this situation is best avoided by optimizing the chromatography step when possible (see Support Protocol 1).
Each sample needs to be run at least once. We require MS1 information about all the molecules in the sample set, which involves collecting scans across the range of retention times to identify where parent/precursor molecules elute. This gives us exact mass and retention time information; the former can also be used to determine molecular formula options. MS1 information can also be used in combination with ion identity networking to further narrow the list of molecular formula options for an unknown compound by using the pattern of adduct masses to determine the neutral mass of a compound (M), from putative adduct masses, e.g., M+H, M+Na, M‐H, and M+Cl. Further discussion of the type of data obtained and specific examples using a standard sample mixture can be found in the subsection “Essentials for obtaining high‐quality MS1 data”.
Materials
LC‐MS vials (Agilent or equivalent, Thermo Fisher Scientific, cat. no. 6PSV9‐03FIVP)
Chromatography Solvent A (see recipe)
Chromatography Solvent B (see recipe)
Waters Xevo G2‐XS QTOF mass spectrometry instrument
Appropriate column (e.g., ACQUITY Premier CSH Phenyl‐Hexyl 1.7 µm 2.1 × 50 mm (Waters Corporation, SKU 186005406))
Collect LC‐MS/MS data
MS1
-
1
After an initial run on the instrument, identify the most intense peak (the highest intensity scan) in the total ion chromatogram (TIC) of the sample. Identify the base peak (the highest intensity peak) m/z value.
In an untargeted analysis, we are unlikely to know anything about the peak in question, but we do know the characteristics of our instrumentation.
We can look in adjacent scans around the most intense scan for this m/z value and determine our scan‐to‐scan m/z tolerance (in practice, this is best done using known standard compounds spiked into all your samples). Let's say we observe an m/z value of 323.0190, with an approximate scan‐to‐scan tolerance of 0.0025 Da.
-
2
To determine the signal for any one analyte in a sample, integrate the peak: add all the signals for the m/z value 323.0190 ± 0.0025 Da, for all scans corresponding to the retention time of the peak. To determine the bounds of the peak, we first need to generate an extracted ion chromatogram (EIC), showing the signal for m/z value 323.0190 ± 0.0025 Da over the entire sample chromatographic range, and then define the bounds for (at least) the peak of interest.
There can be multiple peaks in a single EIC. This usually indicates isomeric compounds sharing an exact mass.
-
3
Once we've defined the boundaries for our peak, we can integrate the signal from the selected scans, for our selected mass of interest, within the scan‐to‐scan m/z tolerance we've determined. Once we do this, we have ‘a peak’ with an associated m/z value, retention time, and both height and area (integrated signal) measurements.
-
4
Repeat this process for every m/z value across every scan for the entire experiment. On completion, you should have thousands of peaks.
(Don't actually do this process manually. Use software like MZMine, as described in Basic Protocol 3 and Support Protocol 2.)
-
5
Any one analyte necessarily presents as at least one adduct (e.g., [M+H]+) and has ideally at least two higher mass isotopologue peaks (M+1 and M+2). Based on the exact mass of a putative monoisotopic peak, assign the associated isotopologue peaks to each adduct and search for other common potential adducts ([M+Na]+).
-
6
Depending on the analyte, it may also present as multiply charged species ([M+2H]2+). If doubly charged, it would present at approximately half the expected m/z value, and the spacing between isotopologue peaks would be ± 0.5 Da instead of ± 1 Da.
-
7
Depending on the analyte, it may also present as a multimer species ([2M+H]+) in which case it would appear (for a dimer) at approximately double the expected m/z value, with normal spacing between isotopologue peaks.
When trying to associate peaks corresponding to different adducts, charge states, and multimers, we have the benefit of knowing that all related peaks must necessarily share a retention time. The apex of each associated peak is expected to be at the same retention time, all displaying maximum signal within the scan where the highest concentration of analyte elutes.
MS2
-
8
After we've grouped peaks and assigned each to a putative analyte, we can begin associating MS2 spectra with each peak. For data‐dependent acquisition (DDA) experiments, this is straightforward because each MS2 scan is associated with a precursor mass at a specific retention time, allowing us to assign MS2 scans to analytes based on these tolerances.
-
9
For data‐independent acquisition (DIA) data, we must again generate EICs, pick peaks, and correlate their peak shapes and retention times with features found in the MS1 scans.
After all of this, we only have information about the peaks in a single sample. To relate multiple samples to each other, as in an untargeted metabolomics experiment, we must then somehow relate peaks among samples.
-
10
Align the ‘feature lists’ from each sample to one another ‐ the list for each sample contains information about each peak in the sample, such as m/z value, retention time, isotope pattern, and MS2 fragmentation spectra, all of which can be used to match up and pair peaks across samples.
-
11
The end result should be a large feature table showing measured intensities for the molecules found in your samples. To the extent possible, we will match these to structures using software that is freely available to academic labs (Basic Protocol 3).
Basic Protocol 3. DATA ANALYSIS USING THE SOFTWARE MSConvert, MZMine, AND SIRIUS
After completing Basic Protocol 2, most of the work is complete; however, we are left with a large feature table showing measured intensities for potentially thousands of molecules across any number of samples. How do we assign structures to these molecules? We could manually search databases, such as HMDB, by inputting exact mass and MS2 fragment information, and hope to find matching hits. Instead of doing this manually, we recommend relying on specialized programs to consistently and automatically perform these processing steps. In this section, we provide a start‐to‐finish tutorial on the analysis of high‐resolution untargeted LC‐MS/MS data using a combination of three open‐access (for academic labs) software packages: MSConvert, MZMine, and SIRIUS. Different instrument manufacturers and generations of instruments may refer to a specific mass spectrometry experiment by different names. This protocol is written in reference to the experiment names corresponding to a Waters Xevo G2‐XS QTOF instrument. The result of this protocol should be a list of molecular structures identified at varying levels of confidence.
Materials
LC‐MS/MS data collected as in Basic Protocol 2
Computer running Windows, Linux, or an appropriate emulator
Software: MSConvert, MZMine, and SIRIUS (see Internet Resources)
Internet access
Analyze LC‐MS data to identify molecular species
-
1
Begin by locating the raw data files for the samples in question. If you are unsure where to locate these files on the instrument workstation, ask your facility director.
-
2
Move these raw data files to a processing computer with modern hardware.
MSConvert, as of this writing, appears limited to Windows computers. MZMine and SIRIUS function on Windows and Linux machines. mzML files are an open‐access file type developed by the mass spectrometry community for improved data sharing and processing.
-
3
Open MSConvert, which is available as part of the ProteoWizard software package. An example of the GUI is shown in Figure 3. Use the Browse button near the top of the menu to locate your RAW files as input to the program. Use the lower browse button to select an output location (we suggest a dedicated mzML folder for organization). The ‘Options’ panel should not require any changes.
Figure 3.

The MSConvert GUI, shown with options selected appropriate for the conversion of RAW data files to mzML files, for DIA/MSe experiments with a low and high energy channel and excluding LockMass scans, or for FDDA experiments where a single MS2 scan is collected for every MS1 survey scan.
-
4
In the ‘Filters’ section, select parameters based on your mass spec experiments. Using the ‘Subset’ filter, select for specific MS levels (MS1, MS2, etc.) and scan events/channels corresponding to your experimental parameters. For MS1‐only experiments, where the first channel consists of a series of MS1 scans throughout the chromatography experiment, and the second channel is the LockMass channel, we can set the target MS level to 1–1 and the scan event filter to 1–1. For FDDA experiments, containing an MS1 channel, an MS2 channel, and a LockMass channel, select for MS level 1–2, and scan event 1–2, again excluding the LockMass channel.
FDDA experiments can easily have multiple MS2 channels, which is caused by searching for multiple peaks of interest in each MS1 scan. You must adjust the scan event accordingly. For DIA experiments, this would mean selecting the low and high energy channels, where the high energy channel is designated MS2, so you would again select for MS level 1–2 and scan event 1–2.
Any adjustments to the filter must be followed by the ‘add’ button, indicated by the box in the lower right portion of the GUI.
-
5
After adjusting and updating the final filter/processing queue and adding the input RAW files, click ‘Start’ to begin the conversion of all selected RAW files to mzML files based on the selected filtering parameters.
Before pressing ‘Start’, you may wish to adjust the number of files converted in parallel, if the computer has the cores available.
-
6
Open mzMine. When using version 4.6.1, you will see something that resembles Figure 4, although no files will be loaded yet.
Before continuing, you may wish to rename your mzML files to remove unnecessary information, such as collection date and user ID information, to make the file names shorter and more comprehensible during processing.
Figure 4.

MZMine version 4.6.1, initial GUI view in dark mode (accessible in the settings). Loaded mzML files are shown on the left, which is also where you can click and drag mzML files to load them into the software.
-
7
To load your mzML files into MZMine, select and drag them into the left side of the GUI. You can use the ‘Task’ menu to monitor their progress. After an mzML file finishes loading, it should appear in a list on the left side of the GUI, where you dragged the mzML file.
-
8
Once the files are loaded, you have two main routes to process your data: 1) A previously prepared batch file, which runs a series of processing commands in a specified order, using previously saved parameters (Support Protocol 2); and 2) Manually running all of the processing commands individually.
You can use the Processing Wizard to guide you through the creation of your own batch file. This is available under ’mzWizard’ at the top. The second option is only useful if you are trying to diagnose problems with a specific step in your batch file or are checking something in a specific processing step.
-
9
Before processing using a batch file, we recommend assigning appropriate metadata values to your samples under ‘Project’ and ‘Sample Metadata’. Remove the automatically generated columns and add a column with an appropriate name (e.g., Group, Time, etc.) in the ‘Text’ field, as shown in Figure 5.
You can export the metadata file to reuse later.
Figure 5.

Metadata window in MZMine, showing mzML files labeled by their sample group (A, B, or C).
-
10
To process samples using a batch file, access ‘Project’ and ‘Batch mode’. This will open a window where you can load a previously used batch file or create a new one. Figure 6 illustrates an example batch file, comprising several processing steps.
The settings for each step are summarized in Support Protocol 2 and further detailed in the Commentary subsection, “Using the MZMine batch file.”
Figure 6.

MZMine batch file with processing commands shown.
-
11
After the batch file settings are set, save the batch file. You may wish to keep or remove all intermediate feature lists as desired.
-
12
Click ‘OK’. The batch file will start, using all of the selected mzML files. The outcome at this point is a processed data set that is ready for statistical analysis and/or identification of molecular structures.
After the batch file is complete, your outputs will be exported to the file paths indicated by the last two steps of the batch file.
Support Protocol 2. USING THE MZMine BATCH FILE
Make a Custom MZMine Batch File for Your Project
-
1
To process samples using a batch file, access ‘Project’ and ‘Batch mode’. This will open a window where you can load a previously used batch file or create a new one.
Using a batch file is very useful for speeding up repetitive processing tasks and for ensuring reproducibility among samples.
-
2
Set the settings corresponding to how you want the data to be processed for this project. Figure 6 illustrates an example batch file, comprising several processing steps.
The purpose of each step is described briefly here. More detailed suggestions are given in the Commentary subsection “Using the MZMine batch file.”
Mass Detection: Used for the detection of m/z values and intensities in the MS1 channel.
Chromatogram Builder: This step builds extracted ion chromatograms from all the peaks detected in Mass Detection.
Smoothing: This step smooths the chromatograms generated by the Chromatogram Builder.
Local minimum feature resolver: This step picks peaks out of the smoothed extracted ion chromatograms.
13C isotope filter: This step finds isotopologue peaks explainable by 13C.
Isotopic peaks filter: This step searches for isotopologue signals from non‐13C isotopes.
Group MS2 scans with features: This step assigns MS2 scans to particular features in the chromatogram.
Retention time correction: This corrects for small variations in sample‐to‐sample retention time.
Join Aligner: This step combines feature lists from separate samples into a single big list.
Feature finder (multithreaded): In this step, the raw data is double‐checked for ‘missed’ peaks based on the new aligned feature list produced by the join aligner.
Duplicate peak filter: This step combines features that appear to be better described as a single feature.
Feature list blank subtraction: This step can be used when you have mzML files for blanks run using the same conditions for both chromatography and mass spectrometry as the rest of your samples.
Feature list rows filter: This step filters out noise or spurious peaks.
Export for statistics (Metaboanalyst): This option exports a CSV file that can be easily uploaded to Metaboanalyst, an online open‐access software for statistical analyses.
Export for SIRIUS: This exports a .mgf file, which can be uploaded to SIRIUS to identify the structures of our analytes.
-
3
After the batch file settings are set, save the batch file. You may wish to keep or remove all intermediate feature lists as desired.
-
4
If you click ‘OK’, the batch file will start, using all of the mzML files loaded and selected based on the file use settings for each batch file step. After the batch file is complete, your outputs will be exported to the file paths indicated by the last two steps of the batch file.
-
5
At this point, the outcome is one or more uniformly processed data sets that can be used with downstream analysis software (Basic Protocol 3).
Reagents and Solutions
Prepare all solutions using LCMS‐grade water, unless otherwise indicated.
Chromatography Solvent A
4 L bottle of LCMS‐grade water (Fisher Scientific, cat. no. W6‐4)
4 ml high‐purity formic acid (99%: Thermo Scientific, cat. no.270480010)
Add formic acid to water and mix well. This will produce a 0.1% solution (v/v).
Store up to 1 year at room temperature.
Chromatography Solvent B
LCMS‐grade acetonitrile (Sigma‐Aldrich, cat. no. A955‐4)
Store up to 1 year at room temperature.
Extraction Solvent
For 1 L extraction solvent, mix:
900 ml LCMS‐grade MeOH (Fisher Scientific cat. No. 047192‐K2)
94 ml LCMS‐grade water (Fisher Scientific, cat. No. W6‐4)
1 ml high‐purity formic acid (99%, Thermo Scientific, cat. No. 270480010)
5 ml of 10 mM N‐Lauroylsarcosine stock solution
This results in a solution of 90% (v/v) MeOH, 9.9% water, 0.1% formic acid, and 50 µM N‐Lauroylsarcosine.
Use immediately. Do not store.
Formic Acid, 1%
1 L of LCMS‐grade water (Fisher Scientific, cat. No. W6‐4)
10 ml high‐purity formic acid (99%, Thermo Scientific, cat. No. 270480010)
Add formic acid to water and mix well. This will produce a 1% solution (v/v).
Store up to 1 year at room temperature.
Formic acid, 0.1%
See Chromatography Solvent A
N‐Lauroylsarcosine Stock, 10 mM
29.3 mg high‐purity N‐Lauroylsarcosine (sodium salt is fine, Millipore Sigma 61743)
10 ml LCMS‐grade water (Fisher Scientific, cat. No. W6‐4)
Mix N‐Lauroylsarcosine with water to form 10 mM stock solution. Dilute as desired into solvent, e.g., 1:1000 (v/v) for 10 µM, 1:200 for 50 µM.
Use immediately. Do not store.
Commentary
Critical Parameters
1. All of the tools referenced in this publication were used on computers running Windows operating systems. Many of the tools are compatible with Linux; however, some tools are not compatible with MacOS. It is recommended that anyone collecting and processing this type of mass spectrometry data have access to a single processing computer capable of running MSConvert, mzMine, SIRIUS, and R, as well as stable internet access for online use of MetaboAnalyst.
2. Run at least one test sample prior to beginning large sample queues to verify required levels of chromatographic separation, mass accuracy, and sensitivity. Compare internal standard peaks across test samples and run technical replicates of a pooled sample to confirm consistency between injections.
3. When working with data processing software, work with local copies of data rather than remote files (such as online‐only cloud storage files), and ensure copious amounts of spare storage capacity on the drives you are using to store and process your data.
4. When using MSConvert to transform raw data files to mzML files, be aware that this paper is written in the context of Waters instruments. Other manufacturers seem to share similar labeling/assignment of scan events. However, if the resulting mzML files, after data conversion, result in errors or odd‐looking data when loaded into mzMine or SIRIUS, we suggest first examining the raw files and their assignment to different channels or scan events, as shown in mzMine.
5. When using MSConvert to turn Waters MSe experiment RAW files into mzML files, not having a third (LockMass) channel (with channels one and two being the low and high energy data collection channels) causes the conversion to ‘fail’ in that it turns the two channels into a single MS1 scan channel when loaded into MZMine. Collecting data with LockMass, which is recommended for accuracy anyway, solves this issue.
6. If you have limited processing power or are only interested in identifying compounds with known, reported structures, you can use the Database Search feature in SIRIUS, rather than De Novo or Bottom Up molecular formula generation strategies.
The pool of reported molecular formulas is much smaller than the pool of theoretical formulas, and any reported structure should necessarily have a known molecular formula. The database search takes significantly less time to complete, especially for large datasets containing many complex molecules.
Troubleshooting Table
Common problems with the protocols, their causes, and potential solutions are summarized in Table 1.
Table 1.
Troubleshooting Guide for LC‐MS/MS
| Problem | Possible cause | Solution |
|---|---|---|
| LC system clogs or overpressurizes | Protein was not removed and is aggregating | Precipitate proteins using a denaturant such as methanol, and filter the solution |
| Weak signal | Sample is not concentrated enough | Extract for longer or concentrate sample using centrivap (a centrifuge with an applied vacuum to rapidly remove solvent from many samples at once) |
| Poor LC separation | Wrong stationary phase | Try a different column |
| Poor LC separation | Flow is too fast or solvent gradient is too steep | Use a longer, shallower gradient |
| Poor LC separation | Column is overloaded | Use less sample volume |
| Poor mass accuracy | LockMass not used | Check that the LockMass standard is being introduced to the sample |
Statistical Analysis
It is difficult to perform a formal power analysis at the beginning of an untargeted metabolomics study because a major goal is the discovery of unknown metabolites. Biological samples are potentially highly heterogeneous and can depend on variables that are not intentionally changed as part of the study. Therefore, we recommend using as many biological replicates as practical (at least 5) and minimizing changes to variables not under investigation. For plants, this could include sampling at the same time of day and when the plants have similar light and hydration levels.
Significant differences in metabolite levels can be identified using a t‐test (Miller & Miller, 2018), with or without the Bonferroni correction (Haynes, 2013), which accounts for the increased false positive rate when attempting to observe rare events. For multiple groups of samples, one‐way analysis of variance (ANOVA) can be used to determine whether the means of different groups are significantly different from each other, testing the null hypothesis that all the samples are drawn from a group with the same mean value with respect to a particular measurement. In the context of metabolomics, it is often desirable to include corrections for the large number of variables (Hassall & Mead, 2018; Pérez‐Cova et al., 2022). Making sense of metabolomics data often requires the use of principal component analysis or hierarchical clustering to visualize relationships among different sample groups (Ren et al., 2015). Clustering the variables prior to statistical analysis of smaller groups can also be used to control the false positive rate (Blaise et al., 2009). These results can be put in context with biochemical knowledge using software such as MetaboAnalyst (metaboanalyst.ca, Pang et al., 2024) or the Kyoto Encyclopedia of Genes and Genomes (KEGG) database (Kanehisa et al., 2025).
Understanding Results
Optimizing the chromatography and ionization steps
For this demonstration, we use three analyte molecules throughout the examples, with structures shown in Figure 7: 3‐chloro‐4‐methoxyaniline (CMA), caffeine, and N‐lauroylsarcosine (N‐Laur). These are representative small molecules that might be found in an LC‐MS experiment on plant tissues. The three molecules chosen here vary significantly in structure, with CMA being the most polar of the three and the only molecule containing a free amine group. Caffeine sits in the middle, being slightly larger than CMA, with many amide bonds and heteroatoms. N‐Laur is the largest of the three, and while it bears both amide and carboxylic acid moieties, it also has a long aliphatic hydrocarbon chain, making it overall the least polar of the three molecules. Under these conditions, using a gradient of increasingly high concentration of organic solvent, we would quickly elute CMA, followed by caffeine, and finally N‐Laur. However, if we ran the sample again, but held the gradient constant (isocratic) at 0.1% FA/H2O (our solvent ‘A’), although we would again elute CMA very quickly, caffeine and N‐Laur are unlikely to elute from the column on any practical timescale, regardless of how much of solvent A we flow through. Suppose we know at what %B caffeine begins to elute from the column under a practical timescale. In that case, we can even set the solvent gradient to flow isocratically around that target percentage to ensure that we elute caffeine but not N‐Laur. However, this is generally impractical; rapid but effective separation is much more easily accomplished by using a time‐dependent gradient of increasing %B, where a gradient can be optimized to provide the desired level of separation among a large number of analytes. Figure 8 shows the total ion chromatogram (TIC) for a 10‐min gradient used to separate the three molecules.
Figure 7.

The structures of the three molecules used as examples in this paper, from left to right: 1) 3‐chloro‐4‐methoxyaniline (CMA), 2) caffeine, and 3) N‐lauroylsarcosine (N‐Laur). Each shows its name and exact mass, as well as the m/z value for its corresponding [M+H]+ adduct.
Figure 8.

The total ion chromatogram, or total ion current, of a 10‐min gradient used to separate a mixture of three standards: CMA, caffeine, and N‐lauroylsarcosine. The TIC is arguably the simplest representation of an LC‐MS sample, as it plots the total intensity of all ions combined from a single scan. That intensity is plotted against the time associated with the scan. We can see the peaks corresponding to the [M+H]+ adducts for (from left to right, highlighted by blue rectangles) CMA (m/z = 158.0367, RT = 1.09 min), caffeine (m/z = 195.0877, RT = 3.05 min), and N‐Laur (m/z = 272.2226, RT = 5.30) quite clearly in the TIC. For more realistic samples, especially when the gradient is long and we are separating multiple analytes, it is challenging to distinguish useful details about the analytes by examining the TIC alone.
Essentials for obtaining high‐quality MS1 data
Although chromatographic separation of analytes in LC‐MS is a crucial factor, it is also essential to optimize mass accuracy and resolution in the accompanying mass spectrometry experiment. The mass spectrometry component is in‐line with the LC system, so everything that flows through and elutes from the column passes into the mass spectrometer. Electrospray ionization is the most common form of ionization used with LC‐MS. This ionization method utilizes a high voltage applied to the sample, resulting in the desolvation and ionization of analytes, typically without significant fragmentation. This can be performed in positive or negative mode, effectively filtering for either protonated or deprotonated ions. The accuracy (how close the measured value is to the true value) and precision (how much the value varies when the experiment is repeated) of this measurement are what differentiate low‐resolution from high‐resolution mass spectrometry. Quadrupole mass analyzers can be used as detectors for low‐resolution experiments, but they are also commonly employed as low‐resolution mass filters at the front of a high‐resolution mass analyzer, such as a time‐of‐flight detector. When functioning as a low‐resolution detector, they scan through a range of m/z values, allowing analytes with a target m/z value through at any one moment while excluding all other m/z ratios, with unit‐mass resolution (target value of 0.5 Da). This resolution can be increased to a point by tightening the window. However, to cover the same mass range, a quadrupole operating with double the resolution (i.e., a smaller isolation window) would require either twice as long to collect the same amount of signal for a target ion or collect half as much signal in the same amount of time, decreasing sensitivity in order to gain higher resolution. This is where it becomes important to remember that these mass spectrometry experiments must be performed quickly relative to the liquid chromatography.
We can determine the current chromatographic and mass performance of our instrumentation by injecting our three standards multiple consecutive times, using the same mass spectrometry experiment, the same column, and the same LC gradient, and examining a few key properties. 1) How much does the retention time of our compounds change between injections? 2) Within an injection, what is the observed m/z ratio of each analyte, and how much does it change from scan to scan? 3) How much does the average observed m/z value for a standard compare to the expected value? If, from sample to sample, the retention times of each of the three analytes vary by less than 0.1 min, then the chromatography is considered precise. For high‐accuracy mass spectrometry, we can generally accept accuracy within 10 ppm, which for the mass range of 200–2000 Da (covering a range encompassing all but the smallest of organic molecules, up to large plant pigment molecules and lipids) translates to 0.002 to 0.02 Da, although we strive to stick to the lower end of that range regardless of the analyte in question. We generally observe a relatively constant mass error of within 0.0025 Da or less across a range of 50–2000 Da, which is acceptable for untargeted unknown quantitation and identification.
Most chromatography for LC‐MS is carried out using reverse‐phase column chemistry (Boyes & Dong, 2018), where the stationary phase is silica functionalized to bear an abundance of hydrophobic functional groups. The primary and most abundant example of this is a C18 column, where silica is functionalized to bear linear, aliphatic chains of eighteen methylene units. C18 column chemistry is popular because it is generally usable for the separation of many different molecule classes. Lipids, which bear their own long carbon chains, bind especially well to this column chemistry, aiding greatly in the separation of lipids from one another. The greater the separation required among highly similar molecules, the more the analytes must interact (differentially) with the column chemistry. The separation resulting from interaction with the column itself is typically combined with a gradient of mobile phase solvents. Elution of analytes from the column usually begins with a high concentration of highly polar solvent, usually water, with a low concentration of weak organic acid, such as formic or acetic acid, added, and is mixed with an increasing amount of less‐polar organic solvent (which must be miscible with water) as the sample flows through the column. Figure 9 illustrates the specific gradient used to separate the three example molecules over 10 min. As the composition of the mobile phase shifts from primarily aqueous to increasingly nonpolar, analytes under the flow of this solvent mixture begin to have differentially more favorable interactions with the mobile phase, depending on their own physicochemical parameters. As a given analyte begins to have more interactions with the mobile phase, it spends less time interacting with the stationary phase and begins moving through the column at a higher rate. As each chromatographic peak of interest elutes from the column, we have a relatively small window of time to collect information about that peak.
Figure 9.

An example gradient (top) of the percentage of acetonitrile (%B) outlined as it changes over the course of a 10‐min chromatographic run, separating CMA, caffeine, and N‐Laur. We observe that CMA elutes very early, at 1.07 min (0%B), followed by caffeine at 3.05 min (40%B), and N‐Laur at 5.30 min (86%B). CMA elutes during the isocratic wash period, where only 0.1% FA is flowing through the column with no ACN. Caffeine elutes when the gradient is ramping from 0% to 100% ACN, at approximately 40% can. Depending on the flow rate, this means caffeine requires at most 40% CAN to begin moving through the column, although a faster flow rate or a longer duration gradient would provide a more accurate measurement. This is because caffeine may begin moving through the column at a lower %B (such as 35%), but we only begin collecting evidence for its existence after it leaves the column. The time lag between an analyte beginning to move through a column and its detection by the mass spectrometer after elution can be an important consideration when using fast chromatography methods.
‘Normal’ peaks in reverse‐phase UPLC, assuming flow rates of 0.1‐0.5 ml/min, can be expected to have a chromatographic width‐at‐half‐height, or ‘full width half max’ (FWHM), of approximately 0.1 min, or 6 s. This can be much larger, but under typical experimental conditions, that would usually indicate column overload or a chromatographic aberration. During these 6 s, we expect an optimally shaped peak to increase in intensity for the next 3 s after initial detection, followed by a mirrored decrease in intensity in the remaining 3 s. The resolution of the retention time (the time associated with the apex of a chromatographic peak) domain is approximately defined by the length of each scan, where the intensity of the molecule of interest is measured in sequential scans, within the error of the mass measurement characteristic for the instrument. Using caffeine as an example and running an injection under the conditions specified in 4 (namely, a 10‐min gradient), we observe a retention time of 3.05 min. In this experiment, we are taking a 0.25 s scan every 0.5 s. The scan associated with this retention time is scan number 330 in the experiment, ranging from 1 to 1085; multiplication of 1085 by the length of each scan (0.25 s, or 250 ms) and multiplying by 2 (because there is one 0.25 scan every 0.5 s) gives the approximate experiment duration (10 min).
This experiment was performed on a high‐resolution Q‐TOF instrument, using electrospray ionization, and the mass measurements are based on both a base instrument calibration and a ‘Lock Mass’ correction. The instrument calibration is defined by measuring the observed m/z values for several known standard analytes across the mass range of interest (50–2000 Da here) and fitting a curve to minimize the error of each observed mass from its expected/theoretical/calculated value. Organic ion clusters, such as sodium formate, or inorganic ion clusters, such as (cesium or potassium) iodide, can be used to generate a large number of peaks over a wide mass range (Fig. 10), making their use as calibrant standards very convenient.
Figure 10.

A calibration curve of observed (top spectrum) vs expected (middle spectrum) mass error (bottom plot) for a range of m/z values, corresponding to gas‐phase clusters of sodium formate ([NaCO2H]nNa+). The peak in the top spectrum at m/z 523.2 is an unrelated contaminant peak. As the cluster size and corresponding exact mass increase, the signal intensity decreases. This places a practical limit on the mass range that can be accurately calibrated using sodium formate clusters—approximately 2000 Da. Alternative calibrants such as cesium or potassium iodide can be used for larger mass ranges, but these trade an expanded mass range for lower peak density, resulting in fewer points across a given range. Polymer series, such as polyethylene glycol (PEG), can also be used; however, they tend to quickly lose signal, primarily due to the formation of multiply charged species rather than increasingly large singly charged peaks. For practical use, 50‐2000 Da covers the overwhelming majority of singly and multiply charged species a user is likely to observe in this type of metabolomics experiment. Larger mass ranges are usually only required for very high molecular weight, multiply charged biomolecules, such as large proteins or protein complexes.
The observed m/z values are determined by at least one scan, but can be an average of many scans, and are expressed as single values (e.g., 250.2345 m/z). Mass spectrometers don't initially generate centroided data scans, where spectra have peaks defined by a single intensity‐m/z value pair. In contrast, profile or continuum data is generated first, where each scan/spectrum contains peaks defined by multiple points across the peak, like in chromatography. When centroiding profile‐mode data, the process is essentially the same as defining chromatographic peaks: find the apex and the bounds of the peak of interest and determine a weighted average, which, for ideal peaks, is just the value of the apex of the peak in the m/z domain. When centroiding data, we lose information about the shape of each peak in each scan, but we consider this a worthwhile tradeoff. Centroided data are far more practical to work with because the spectral information is condensed into a more usable and searchable format, and the peak shape is not essential for most practical purposes. When fitting a curve to the data for calibration, multiple scans can be collected to better sample the scan‐to‐scan consistency of each point in the calibration curve. Scan‐to‐scan variation must necessarily be minimized in the final reported data for a high‐accuracy instrument. Although each instrument has an inherent resolution limit, we can correct for fairly large errors using a ‘Lock Mass’ calibrant.
For Waters instruments, this is usually LeuEnk, or leuenkephalin, a highly stable peptide with an exact mass of 555.26930 Da. It is used as an internal standard, with a slow, constant infusion spraying into the source throughout the experiment via a dedicated LockMass capillary line. A single scan is only collected every 10–60 s. The LockMass spray is otherwise blocked by a rotating metal baffle, which intercepts either the LockMass or main LC flow at any given moment, allowing for an observed mass measurement of LeuEnk at consistent points during the experiment. The LockMass scans are centroided (the weighted averages of profile mode peaks are determined, and each profile mode peak is assigned a single m/z value). Then, the measured m/z value for LeuEnk is compared to the expected value (556.2766 for the [M+H]+ adduct in positive mode or 554.2620 for the [M‐H]– adduct in negative mode). This is illustrated in Figure 11. Waters uses 556.2771 and 554.2615, the difference being the mass of an electron (0.0005 Da), a discrepancy that is also present in their calibration series values for sodium formate clusters. The difference between the observed and expected mass is calculated and typically averaged over several scans (meaning three scans would cover 3 × 10–60 s of time), and a rolling average correction is determined for the experiment. This correction is then applied to all scans, shifting the spectra up or down in m/z across the board. This LockMass correction, combined with a high‐quality calibration curve, enables Q‐TOF instruments to accurately and precisely measure m/z ratios down to 0.0001 Da or lower under optimal conditions. A freshly calibrated instrument may be able to collect accurate measurements without a LockMass for a short while if the ambient conditions remain stable; however, shifting conditions will quickly result in noticeable deviations in accuracy. For this reason, we recommend that LockMass data be collected and applied to all measurements to ensure scan‐to‐scan and sample‐to‐sample accuracy, even for samples collected days or weeks apart.
Figure 11.

Use of the LockMass standard and centroiding the data. Centroiding takes the weighted average of a peak (based on the points defining the peak in the m/z and intensity domains), the left vertical line, which is also the 'observed' LockMass m/z value. The right vertical line is the 'expected' LockMass m/z value. The horizontal distance between them is the correction.
Returning to the example data, as caffeine is eluted from the column, the solvent flow is directed into the source. The solvent flow exits through a high‐voltage capillary into the source enclosure, producing what is referred to as a Taylor cone, an aerosol of charged droplets (Boyes & Dong, 2018; Ganán‐Calvo, A.M. et al., 2018). As these droplets travel a short distance, the solvent is removed by the applied vacuum of the source enclosure and a flow of hot desolvation gas. As each droplet is desolvated, the concentration of acid increases, as does the charge density of the droplet. Eventually, the high concentration of positive charge results in aerosolization of the sample into gas‐phase ions. The direction of the Taylor cone is usually perpendicular to the source cone. The applied electric field strength, caused by the potential difference, draws charged ions perpendicularly into the source cone, while uncharged species continue moving straight and are removed by the vacuum. This ionization step is an essential part of the detection because we measure a mass‐to‐charge ratio, which requires a non‐zero charge on our analytes. For electrospray, the ionization mechanism is easy to visualize: the primarily observed adducts in positive mode are protonated versions of the parent molecule, whereas in negative mode, they are deprotonated. Caffeine, with an exact mass of 194.0804, is expected to form a protonated [M+H]+ adduct with a m/z ratio of 195.0877. This corresponds to the monoisotopic mass, or the mass calculated using only the most abundant isotopes for each element (12C, 1H, 16O, etc.) We also expect to observe the isotopologue peaks at 196.0910, with an intensity of approximately 8.7% that of the monoisotopic peak. This is the m/z for a caffeine molecule containing one atom with an extra neutron, based on the probability of one atom being a higher‐mass isotope, such as 13C or 15N. The ability to reliably measure the intensity ratios of the different isotopologues of each analyte provides us with yet another way to characterize molecules. We can use the combination of exact mass measurements and the ratio of isotope peaks to narrow down a list of molecular formulas for unknown compounds. Figure 12 shows a zoomed‐in MS1 spectrum of caffeine, revealing the monoisotopic peak at m/z 195.0888 (expected m/z 195.0877), as well as two higher mass isotopologue peaks at m/z 196.0904 and 197.0930. The first isotopologue peak has an intensity of approximately 8% of the monoisotopic peak, which agrees with the expected value of 8.9% based on natural abundances and the molecular formula of caffeine.
Figure 12.

The observed isotope distribution of caffeine, seen as the [M+H]+ adduct. The peak at 195.0888 Da corresponds to the monoisotopic peak, which is also the base peak (most abundant peak). Caffeine, like many small molecules, exhibits a monotonic isotope pattern, where each subsequent isotope peak (here, the peaks at 196.0904 and 197.0930 Da) is lower than the preceding one.
Collecting and using fragmentation spectra
We can further expand our ability to characterize both known and unknown molecules by generating a fragmentation spectrum. This entails having the instrument fragment our molecules into smaller pieces and measuring the m/z value for each of the pieces. To generate such spectra using collision‐induced dissociation (CID) fragmentation, a potential difference is applied to analytes after ionization but before they reach the flight tube. This potential difference is measured in volts, and we typically use a gradient of 10–40 V or 10–100 V, depending on the molecule in question and the desired degree of fragmentation. Figure 13 shows fragmentation spectra collected for our three standards: CMA, caffeine, and N‐lauroylsarcosine. Notice how each structure produces distinct fragments, each of which provides structural information about the molecule. The collision energy ramp, rather than a static value, is used to generate a broad range of differently sized fragments, while also leaving behind some of the mass corresponding to the unfragmented precursor. A high but static voltage collision can generate fragments, but it may also eliminate useful higher molecular weight fragments. Higher molecular‐weight molecules require higher voltages to generate smaller fragments.
Figure 13.

Fragmentation spectra of CMA (top), caffeine (middle), and N‐lauroylsarcosine (bottom), collected using a collision energy ramp of 10‐100 V.
Molecules can also produce fewer fragments than desired if they contain particularly labile bonds. For example, the plant metabolite quercetin glucoside will fragment at the glycosidic bond joining the flavonoid and sugar, producing a highly intense quercetin fragment even at a relatively low applied potential, and the presence of the parent molecule may be missed. Higher potentials can result in further fragmentation, but the presence of particularly labile bonds can make it difficult to efficiently produce fragments stemming from all parts of a large molecule. Generally, it is better to produce more fragments rather than fewer in a spectrum that will be used for identifying an unknown or characterizing a known. In this way, fragmentation spectra serve as a molecular fingerprint, at least when comparing data across instruments with the same fragmentation schemes and similar mass accuracies.
We separate fragmentation experiments into two categories: data‐directed and data‐independent. Data‐directed experiments can take the form of dedicated MS/MS experiments, where the experiment targets a specific m/z value. The quadrupole (Q) is used as a mass filter to select for a specific precursor mass known to the user, which is then fragmented, and the spectra are collected. Quantitative experiments can be performed by measuring fragment intensity rather than precursor intensity. The benefit of these types of experiments is that the instrument is dedicating its analytical bandwidth towards monitoring a single or only a few select compounds, maximizing signal. When an untargeted approach is necessary, such as when screening samples to identify as many molecules as possible in a complex mixture, we can use a ‘survey’ experiment, or a ‘Fast Data Directed Experiment’. Figure 13 shows spectra collected in this manner. This approach can be described as alternating MS1 (precursor analysis) and MS2 (targeted fragmentation) experiments/scans. As an analyte elutes, we first identify its presence by an increase (from background) in the intensity of its m/z ratio relative to previous scans. Once a precursor mass of interest is identified, the next scan is a targeted MS/MS experiment using the quadrupole to filter for the precursor mass of interest. Using this alternating set of experiments to produce a series of MS1 and MS2 scans throughout the duration of the chromatography, we can effectively sample every second or so of the entire gradient, collecting targeted spectra on at least one molecule with every iteration. By utilizing exclusion lists, or lists of precursors to ignore for additional fragmentation, we can effectively generate high‐quality fragmentation spectra for thousands of molecules in a single experiment if their retention times are well‐dispersed throughout the gradient.
Data‐independent methods, or data‐independent acquisitions (DIA) are the other main category of fragmentation experiments. Waters uses the term ‘MSe’ to describe such experiments, which are carried out using a two‐experiment approach. 1) An ‘MS1’ only experiment/scan is conducted first to ionize and detect the precursor m/z values for as many molecules as possible coming out at the time of the scan. 2) A follow‐up scan is performed, but all ions are subjected to fragmentation via a collision energy (ramp or static value(s)), resulting in the untargeted fragmentation of all species. These two events alternate over the course of the chromatography run, effectively making a ‘low energy’ and ‘high energy’ channel. The low‐energy channel is used to identify precursor molecule peaks, just like any other MS1 experiment. In our opinion, the second channel is best thought of as something similar to GC‐EI‐MS data, where you're looking at scans filled with fragments of whole molecules. To associate fragments from the high‐energy channel with the precursor molecules found in the low‐energy channel, the peak shape and retention time of each fragment mass is compared to nearby precursor peaks; if a molecule is fragmented, its fragments must necessarily elute at the same time as the precursor molecule. Fragment peaks whose elution profile matches a precursor molecule are assumed to come from the precursor.
In an experiment where we can ensure that every single molecule of interest can be chromatographically separated from every other molecule, such DIA/MSe experiments are highly valuable. Any molecule may appear as multiple precursor masses, as each molecule can form different adducts (e.g., [M+Na]+, [M+K]+) or different charge states, but all of these must necessarily share an associated retention time and peak shape (even if the intensity of those peaks differs). Thus, fragments from all adducts and charge states can be easily summed together, where a DDA experiment is necessarily restricted to collecting fragmentation data on a single adduct or charge state at a time (within its isolation window of 0.5 Da). Further, because DIA/MSe experiments don't use the quadrupole to isolate a specific m/z ratio within 0.5 Da, these experiments subject all isotopic peaks of a molecule to fragmentation, across all adducts and charge states, rather than collecting signal only on the mono‐isotopic peak of a single adduct or charge state. This means fragmentation spectra from DIA experiments can have much higher signal and diverse fragments. However, it also requires each analyte to be chromatographically separated from every other molecule, to prevent the overlap of fragmentation spectra, which is not always realistic.
If two molecules, even with vastly different precursor masses and only appearing as single adducts (e.g., both show up as [M+H]+), elute at the same time, we would be unable to determine which fragments belong to either precursor. However, even if the two analytes can be separated by a few scans' worth of time, the two peaks should be reliably identifiable, and the fragmentation data can be assigned. By using a calibrated and well‐maintained high‐resolution high‐accuracy LC‐MS, we can obtain accurate retention times, exact precursor masses, and exact fragment masses for any analyte the instrument is capable of ionizing. We can use these characteristics to both describe known compounds and identify unknown compounds based on easily calculable characteristics, such as exact mass derived from the molecular formula. This can be accomplished through either database searching (based mainly on exact mass and fragmentation spectra), or in‐silico calculations based on databases of known structures. By determining the precursor and fragment masses for an unknown molecule, we can then search a database of spectra published by other members of the community.
We can also use tools such as SIRIUS (Dührkop et al., 2019), which accepts exact mass information about a precursor and its fragments, determines a set of molecular formulas that can be explained by the exact mass of the precursor, and then builds fragmentation trees from the provided fragmentation spectra. These trees are constructed by using the exact mass of each fragment to determine its molecular formula and whether that formula can be explained by the composition of the precursor molecule. The fragments must originate from some part of the precursor molecule. These fragmentation trees are then scored based on several parameters that aim to identify the tree that best explains the provided fragmentation spectra. These putative molecular formula hits and fragmentation trees are then searched against databases of chemical structures (not spectra) that have pre‐computed fragmentation trees, to then identify which structure (if any) matches the experimental data. SIRIUS has advantages over traditional spectral library searching because the fragmentation scheme is calculated based on chemical principles, independent of the specific instrument used to collect the data. In our experience, this methodology is most helpful when experimental fragmentation data is much richer than published spectra, such as when using a higher collision energy with longer scan durations and high analyte concentrations. Published spectra may have as few as three peaks, when more aggressive fragmentation can easily provide ten or a hundred peaks, each of which can provide valuable information useful for narrowing down a list of unknown molecules. Regardless of the method used to putatively identify an unknown molecule, there remains no better method of identifying an unknown than a direct comparison to a clean standard. Unfortunately, this is not always possible because standards may not be available or may be prohibitively expensive.
Using the MZMine batch file
Mass Detection (1): Used for the detection of m/z values and intensities in the MS1 channel. Select all raw data files, and use the scan filter to limit the selection to ‘MS1, level = 1’; set the Mass Detector type to ‘Centroid’, and set the noise level to something appropriate for the background scan‐to‐scan noise level of your instrument. We use between 50 and 250 counts, depending on the experiment. Lower thresholds leave more noise, but you reduce the chance of missing good peaks.
Mass Detection (2): As above, but for MS2 channel(s). Select all raw data files and use the scan filter to limit the selection to ‘MS2, level = 2’. The remaining settings should be identical to Mass Detection (1), except that the MS2 noise tends to be lower; we use 50 or 100 counts.
Chromatogram Builder: This step builds extracted ion chromatograms from all the peaks detected in Mass Detection, based on several m/z and intensity tolerances/thresholds. Select all raw data files and set the Scan Filter to MS1 only. We do not need to build chromatograms from MS2 data, and depending on the experiment, we may not be able to do so anyway. We set the minimum number of consecutive scans to 5 (so at least five points per peak), the consecutive scan intensity equal to the MS1 noise threshold from Mass Detection (1), e.g., 250 counts, and the minimum absolute height to at least twice that value, e.g., 500 counts. You can set it higher to reduce the number of low signal‐to‐noise peaks or try to filter them out later using other methods. We set the m/z scan‐to‐scan tolerance to either 0.0025 Da/10 ppm or 0.0015 Da/5 ppm, depending on the actual tolerance observed when examining multiple internal standard peaks across samples in the dataset. This is the first processing step, which will generate a ‘Feature List’, shown in the ’Feature Lists’ tab in the main section of the GUI. A suffix is added to the end of each feature list, making it easier to differentiate which list comes from which step.
Smoothing: This step smooths the chromatograms generated by the Chromatogram Builder, which helps with peak picking by reducing data spikiness and making the peaks appear more Gaussian/normal. This is the first step in the Batch file that works on generated feature lists, not the raw data files. In the Feature List selection dropdown, set it to ‘Those created by previous batch step’. This is applied to every step that follows, as well as to the remaining Batch file commands. For the smoothing algorithm, we prefer Loess smoothing. Based on our usual scan rate/speed of 0.25 s to 1 s, we smooth over 5 scans in the retention time domain. This protocol does not cover ion mobility data, so ignore the mobility setting and leave it unchecked. If you go through the processing steps manually, this is also the first processing step that offers a ‘preview’ via a checkbox. This is helpful when first setting up a batch file, as you can see what the command will produce before running it on all samples.
Local minimum feature resolver: This step picks peaks out of the smoothed extracted ion chromatograms. Setting each step to only ‘Those created by previous batch step’, eliminates the need to worry about this step processing both the unsmoothed and smoothed chromatograms. For settings, leave MS/MS scan pairing for now, use the retention time domain, set the chromatographic threshold to 0.0%, minimum search range to 0.04 min, minimum relative height to 0.0%, minimum absolute height to your Chromatogram Builder minimum absolute height (e.g., 500 counts), minimum ratio of peak top/edge to 1.1, peak duration range to 0.01‐1.5 min, and minimum scans to 5. These settings were chosen to pick even very small, low signal‐to‐noise peaks, with the understanding that many noise peaks will also be included: we prefer to filter out low‐quality features later in the processing. You can use more strict filtering thresholds if desired.
13C isotope filter (formerly: isotope grouper): This step examines all the peaks and identifies isotopologue peaks that can be explained by 13C. Again, set file use to ‘Those created by previous batch step’. Do this for all remaining steps unless otherwise specified. Set the intra‐sample m/z tolerance to the scan‐scan tolerance set in Chromatogram Builder, e.g., 0.0025 Da/10 ppm. Set the retention time tolerance to anything between the equivalent of one scan (0.25 s) and 1 s; one second is close enough to one scan to work reliably but wide enough to accommodate small errors. Leave mobility tolerance unchecked, the same for monotonic shape ‐ the monotonic shape option becomes troublesome when you have halogenated internal standards with prominent non‐monotonic shapes, in addition to some large plant pigment molecules. (You can use it if you do not have reason to believe it would result in erroneous exclusion of peaks.) We set the maximum charge to 1, because experience shows that we do not expect multiply charged small molecules at useful abundances in our datasets. If you are using this protocol to process data from peptides or other molecules that easily pick up multiple charges, you should probably change this to a larger value. We select the ‘use most intense isotope’ representative isotope option because we have had issues with the ‘use lowest m/z value’ mis‐assigning precursor m/z values—you should try both.
Isotopic peaks filter: This step searches for isotopologue signals from non‐13C isotopes. We usually select H, C, N, O, F, S, Cl, B, P, and Br. All of these nuclei except 31P have multiple stable isotopes at measurable natural abundance. The m/z tolerance is set to an appropriate value, e.g., 0.0025 Da / 10 ppm, and a maximum charge of 1. Again, if you expect multiply charged molecules, set this higher.
Group MS2 scans with features: Our Q‐TOF uses a standard isolation window of 0.5 Da, so we set the MS1‐to‐MS2 tolerance equal to 0.25 Da / 0 ppm, although this only matters because the Xevo does not automatically adjust the ‘precursor mass’ to the exact mass of the precursor after LockMass correction; it uses the pre‐LockMass value. The retention time filter is set to feature edges, so any MS2 scan with a ‘retention time’ within the edges of the peak will count. We don't enable a relative feature height, but we do require at least three fragment signals, which must also be above the noise level, for a scan to be included. We also enable the option to merge spectra, with no other advanced features enabled. You do not have to merge spectra, but this does it on a per‐sample level for any feature, across all samples in the dataset. We find this to be a good middle ground between not merging MS2 spectra (keeping them all separate) and merging spectra across all samples. SIRIUS does the latter, which is why we choose the middle ground; this will be discussed in the notes for Basic Protocol 3.
There is an alternative approach, which requires both high ‘un‐aligned’ sample‐to‐sample chromatographic consistency (0.05 min or less) and analyzing each sample twice. First, it is necessary to run an additional, dedicated MS1‐only experiment. The benefit of this approach is two‐fold: you can achieve better mass accuracy by using a dynamic range extension option, which reduces the number of scans by half while maintaining the same scan duration, e.g., taking a 0.25‐s scan every 0.5 s. This option also produces MS1 data with drastically reduced TOF‐artifact peaks, which appear when single‐ion peak intensities are too high, resulting in (multiple) lower intensity peaks 0.1–0.5 m/z higher than the real peak. A dedicated MS1‐only experiment also maximizes the number of points per peak across the entire experiment, increasing maximum chromatographic resolution, as well as quantification capabilities. Second, the same sample must be analyzed using an experiment designed to collect MS2 scans, such as MS/MS, FDDA, or MRM. You can use the file merge tool in mzMine (Raw data methods ‐ merge files) to combine the two runs, using the high accuracy and high chromatographic resolution MS1 data from the MS1‐only experiment, along with the MS2 scans from any number of other experiments collected on the same sample.
Retention time correction: This is a very useful step in cases where the sample‐to‐sample retention time variance exceeds 0.05 min but is less than approximately 0.5 min (assuming UPLC tolerances and typical chromatography lengths). A variance larger than that may indicate problems with the instrument that warrant further investigation. The larger the variance, the more features you want to have in common among all your samples for correction purposes. This can mean using solvent impurity peaks and column bleed, or you can add small molecule standards (or small amounts of PEG) to your solvent. As long as your chromatography shifts consistently within a sample, this correction works well, based on our experience. It identifies common features among all samples, calculates the average retention time for each of these features, and then applies corrections to each sample based on deviations from the average. Again, this assumes there are features common across all your samples, including blanks, which should always be included. We set a minimum standard intensity of 1000 counts, which is twice our minimum peak height. You can adjust this as necessary. The retention time tolerance is dependent on your full experiment sample‐to‐sample retention time tolerance for features. This is yet another determination where known internal standards are useful. We usually aim for 0.1 min or less; outside of this, we recommend trying to improve the chromatography and re‐collecting the data.
Join Aligner: This is where we consolidate all our feature lists from separate samples into a single comprehensive list. The m/z sample‐to‐sample tolerance is as previously mentioned. The retention time tolerance depends highly on the quality of your chromatography and whether you used the retention time correction. We suggest estimating the real variance by examining the last feature list for each sample (e.g., the retention time‐corrected ones or those immediately preceding them) and determining the maximum range of variances across multiple features for all samples. With consistent chromatography aided by a retention time alignment, you can reasonably approach values of less than 0.05 min. Without a retention time alignment and inconsistent chromatography, you may set it as high as 0.2 min. We generally set it to 0.05 or 0.1 min, depending on the data. We use the m/z and RT weights set equally at 1. We do not require the same charge state (mostly because we only check for singly charged things in previous steps) or the same identification (because we have not identified anything yet anyway). We do not require a spectral alignment (since not every sample has MS2 scans for every feature). However, you may find benefit in enabling the isotope similarity requirement (with m/z tolerances mentioned previously, no minimum intensity, and an alignment score of ≥70%).
Feature finder (multithreaded): This step takes the new, aligned feature list and examines the raw data to identify ‘missed’ peaks for any feature/sample pair that lacks an associated intensity value. It fills in the gaps by using a targeted search in the context of what was observed in other samples. We set the intensity tolerance to 100%, because we would rather have it fill in a very low intensity value that is equivalent to estimating the noise, rather than keep it empty, as this makes statistical calculations easier later on.
Duplicate peak filter: This step combines features that look like they should be one feature, based on overlapping tolerances. The practical need for this step is usually based on the consistent application of realistic tolerances in previous steps.
Feature list blank subtraction: This step assumes you have mzML files for blanks run under the same conditions for both chromatography and mass spectrometry as the rest of your samples, ideally with at least three of them. We usually require a ‘blank‐associated’ signal to appear in at least half. We use quantification based on peak area, and if a feature is detected at abundances at least 200% above the average blank value, it is kept rather than being removed. You must select the specific blank mzML files using the drop‐down menu.
Feature list rows filter: In our opinion, if you want to filter out noise or garbage peaks, this is where you should do it. We generally aim for a high number of replicates. If we have 7 replicates per group across five groups, we would likely set the requirements for a feature to be considered valid in at least five samples. This means that even if a feature is only detectable in one of the five groups, it must be present in at least five out of seven of those samples. Therefore, unless the feature is also low in abundance in that group, as long as it is reliably detectable, it will be kept. We also require at least two peaks in an isotope pattern for any feature to be considered valid, and we enable the removal of redundant isotope feature rows (which may have been missed in previous steps). The only other option we enable is resetting the feature ID number, so the resulting feature list is continuous with no ‘missing’ features in the series.
Export for statistics (Metaboanalyst): This option exports a well‐organized CSV file that can be easily uploaded to Metaboanalyst, an open‐access online software for statistical analyses. It also requires you to have metadata assigned to your samples, as mentioned previously, with at least three replicates per group. However, if you lack the appropriate replicates, it flags an error in MZMine but still exports the list. Assign your metadata grouping parameter, e.g., ‘Group’, and we recommend exporting feature areas rather than heights to account for imperfections in peak shape and wide elution profiles.
Export for SIRIUS: This exports a .mgf file, which can be uploaded to SIRIUS (we're almost there) to identify the structures of our analytes based on the provided spectral data. We do not perform intensity normalization (SIRIUS does this automatically), and we select the use of all input scans (no merging, beyond what was done in the MS2 grouping step). The m/z tolerance is set as noted previously. We also exclude any features that are multiply charged (which shouldn't exist based on our processing parameters) or multimeric (we should see the monomeric species as a different feature).
Using the SIRIUS GUI to identify structures from fragmentation data
With the .mgf file produced, you can launch the SIRIUS GUI .exe (there is a command line version, as well as an API, but using those is beyond the scope of this protocol). You'll need to create a profile, be logged in, and have internet access. When using SIRIUS 6.1.1, use the ‘Import’ button to select the .mgf file from MZMine. SIRIUS can also be used to directly process mzML files. Like mzMine, simply click and drag mzML files into the GUI, and it will prompt you to align the runs (select ‘yes’). After processing is complete, proceed with the following steps. The main advantages of SIRIUS processing the mzML files are speed and lack of user input; the major disadvantage is also the lack of user input. Leaving exploration of the GUI to the reader, we want to highlight the basics of processing, which will take time ranging from minutes to days, depending on the settings and hardware. These specifications are shown in Figure 14. First, click ‘Compute All’ to process all features at once. The menu contains many options. First, on the top right, enable advanced options. Enable all major tasks (SIRIUS, ZODIAC, Fingerprint Prediction/CANOPUS/Compound Class Prediction, and Structure Database Search ‐ MSNovelist optional). Using a QTOF instrument, filter for isotope pattern, set MS2 accuracy as appropriate (10 ppm), and disable MS/MS isotope scoring (if quad‐filter‐based MS2 experiment). Set candidates stored to 25, and 5 per adduct. Disable the fix for detected lipids, and both enable and enforce adducts commonly observed in your dataset (e.g., M+H, M+Na, M+K), avoiding neutral loss and less common adducts if possible. Enable de‐novo + bottom‐up processing, performing de‐novo below the maximum m/z value of your collected range (e.g., 2000 Da). This allows it to always perform both de novo and bottom‐up processing. We recommend applying the element filter to both de novo and bottom‐up results and selecting only for reasonable elements based on the context of the dataset. For fragmentation tree computation, set the tree and compound timeouts to a value between 300 and 1200 s, and set both heuristic use options to a value above 2000 (so they never use the heuristic method). For Zodiac, set the low mass number of candidates to 25. Leave all other settings as defaults.
Figure 14.

SIRIUS GUI processing settings.
If you're only interested in checking molecular formula hits that have associated structures, then you can use the more limited ‘Database’ search; this ensures that any molecular formula match has at least one potential structure match available to be checked against. The resulting processing is much faster than DeNovo molecular formula determinations (in most cases). However, any feature representing a molecule with a novel molecular formula won't be matched to any formula or structure. After clicking ‘Compute’, processing will begin, and progress can be monitored by clicking the ‘Jobs’ button. The feature number, m/z value, and retention time of each feature on the left side match the designator in the Metaboanalyst export CSV from MZMine. Of particular importance is the Substructure Annotations tab in SIRIUS, which sorts the list of structure matches based on a number of confidence values. We recommend using thresholds of Tanimoto similarity ≥70%, and CSI FingerID scores 0 to −100, where scores closer to 0 are better. For larger molecules, such as lipids, we have also observed larger negative CSIFID scores (approx. −200) paired with high Tanimoto similarities. Determining proper thresholds for high‐confidence hits in your data is likely best done using several clean standards as a reference.
Concluding Remarks
Planning your first untargeted LC‐MS experiment can be daunting, especially without local experienced users to show you how to get started. Every step, from experimental design to final data representation, is undertaken with a specific purpose, and understanding the purpose of each step requires time and practice. Start small by running a basic experiment using a handful of samples in triplicate with a known, well‐characterized outcome, and practice the basic workflow and techniques. Once you have a handle on the basics and have validated your workflow, then you can move on to real experiments. Consistent treatment of samples is one of the most important details to master when first carrying out these experiments, and this comes most easily through repetition. Start simple, start small, and move up from there. However, a simple experimental design need not negatively impact utility. Often, even incomplete identification of metabolites can provide valuable insights into complex biological samples. We hope this protocol encourages more novice users to add untargeted metabolomics to their experimental toolboxes.
Author Contributions
Zane G. Long: Conceptualization; data curation; funding acquisition; formal analysis; investigation; methodology; validation; visualization; writing—original draft; writing—review and editing. Rachel W. Martin: Conceptualization; funding acquisition; project administration; resources; supervision; visualization; writing—original draft; writing—review and editing.
Conflict of Interest
The authors declare no conflict of interest.
Acknowledgments
This work was funded by NIH grant R21EYEY035792 to R.W.M., a UCI UROP grant to Z.G.L., and the UCI Office of Research. The authors gratefully acknowledge Felix Grün and Ben Katz at the UCI Mass Spectrometry Facilities for access to instrumentation and expert assistance with mass spectrometry.
Long, Z. , & Martin, R. W. (2025). Design and analysis of untargeted metabolomics experiments. Current Protocols, 5, e70232. doi: 10.1002/cpz1.70232
Published in the Chemical Biology section
Contributor Information
Zane G. Long, Email: zglong@uci.edu.
Rachel W. Martin, Email: rwmartin@uci.edu.
Data Availability Statement
There are no data associated with this manuscript. All materials are contained in the main text.
Literature Cited
- Alexandrov, M. L. , Gall, L. N. , Krasnov, N. V. , Nikolaev, V. I. , Pavlenko, V. A. , & Shkurov, V. A. (1984). Ion extraction from solutions at atmospheric pressure: a method of mass‐spectrometric analysis of bioorganic substances. Doklady Akademii Nauk SSSR, 277(2), 379–383. [Google Scholar]
- Augustijn, D. , de Groot, H. J. M. , & Alia, A. (2021). HR‐MAS NMR applications in plant metabolomics. Molecules, 26(4), 931. 10.3390/molecules26040931 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Blaise, B. J. , Shintu, L. , Elena, B. , Emsley, L. , Dumas, M.‐E. , & Touhoat, P. (2009). Statistical recoupling prior to significance testing in nuclear magnetic resonance based metabolomics. Analytical Chemistry, 81(15), 6242–6251. 10.1021/ac9007754 [DOI] [PubMed] [Google Scholar]
- Bergo, A. M. , Leiss, K. , & Havlik, J. (2024). Twenty years of 1H NMR plant metabolomics: A way forward toward assessment of plant metabolites for constitutive and inducible defenses to biotic stress. Journal of Agricultural and Food Chemistry, 72(15), 8332–8346. 10.1021/acs.jafc.3c09362 [DOI] [PubMed] [Google Scholar]
- Boyes, B. , & Dong, M. (2018). Modern trends and best practices in mobile‐phase selection in reversed‐phase chromatography. LCGC North America, 36(10), 752–768. [Google Scholar]
- Cech, N. B. , & Oberlies, N. H. (2023). From plant to cancer drug: Lessons learned from the discovery of taxol. Natural Product Reports, 40, 1153–1157. 10.1039/D3NP00017F [DOI] [PMC free article] [PubMed] [Google Scholar]
- Denish, P. R. , Fenger, J. A. , Powers, R. , Sigurdson, G. T. , Grisanti, L. , Guggenheim, K. G. , Laporte, S. , Li, J. , Kondo, T. , Magistrato, A. , Moloney, M. P. , Riley, M. , Rusishvili, M. , Ahmadiani, N. , Baroni, S. , Dangles, O. , Giusti, M. , Collins, T. M. , Didzbalis, J. , … Robbins, R. J. (2021). Discovery of a natural cyan blue: A unique food‐sourced anthocyanin could replace synthetic brilliant blue. Science Advances, 7(15), eabe7871. 10.1126/sciadv.abe7871 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dhyani, P. , Quispe, C. , Sharma, E. , Bahukhandi, A. , Sati, P. , Attri, D. C. , Szopa, A. , Sharifi‐Rad, J. , Docea, A. O. , Mardare, I. , Calina, D. , & Cho, W. C. (2022). Anticancer potential of alkaloids: A key emphasis to colchicine, vinblastine, vincristine, vindesine, vinorelbine and vincamine. Cancer Cell International, 22, 206. 10.1186/s12935-022-02624-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dührkop, K. , Fleischauer, M. , Ludwig, M. , Aksenov, A. A. , Melnik, A. V. , Meusel, M. , Dorrestein, P. C. , Rousu, J. , & Böcker, S. (2019). SIRIUS 4: Turning tandem mass spectra into metabolite structure information. Nature Methods, 16, 299–302. 10.1038/s41592-019-0344-8 [DOI] [PubMed] [Google Scholar]
- Ganán‐Calvo, A. M. , López‐Herrera, J. M. , Herrada, M. A. , Ramos, A. , & Montanero, J. M. (2018). Review on the physics of electrospray: From electrokinetics to the operating conditions of single and coaxial Taylor cone‐jets, and AC electrospray. Journal of Aerosol Science, 125, 32–56. 10.1016/j.jaerosci.2018.05.002 [DOI] [Google Scholar]
- Garcia‐Perez, I. , Posma, J. M. , Serrano‐Contreras, J. I. , Boulangé, C. L. , Chan, Q. , Frost, G. , Stamler, J. , Elliott, P. , Lindon, J. C. , Holmes, E. , & Nicholson, J. K. (2020). Identifying unknown metabolites using NMR‐based metabolic profiling techniques. Nature Protocols, 15, 2538–2567. 10.1038/s41596-020-0343-3 [DOI] [PubMed] [Google Scholar]
- Gowda, G. A. N. , & Raftery, D. (2023). NMR metabolomics methods for investigating disease. Analytical Chemistry, 95(1), 83–99. 10.1021/acs.analchem.2c04606 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hamberger, B. , & Bak, S. (2013). Plant P450s as versatile drivers for evolution of species‐specific chemical diversity. Philosophical Transactions of the Royal Society of London B, 368, 20120426. 10.1098/rstb.2012.0426 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hassall, K. , & Mead, A. (2018). Beyond the one‐way ANOVA for ’omics data. BMC Bioinformatics, 19(Suppl 7), 199. 10.1186/s12859-018-2173-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Haynes, W. (2013). Bonferroni Correction. In: Dubitzky W., Wolkenhauer O., Cho K. H., & Yokota H. (eds), Encyclopedia of Systems Biology. Springer. 10.1007/978-1-4419-9863-7 [DOI] [Google Scholar]
- Honeker, L. K. , Hildebrand, G. A. , Fudyma, J. D. , Daber, L. E. , Hoyt, D. , Flowers, S. E. , Gil‐Loaiza, J. , Kübert, A. , Bamberger, I. , Anderton, C. R. , Cliff, J. , Leighty, S. , AminiTabrizi, R. , Kreuzwieser, J. , Shi, L. , Bai, X. , Velickovic, D. , Dippold, M. , Ladd, S. N. , … Tfaily, M. M. (2022). Elucidating drought‐tolerance mechanisms in plant roots through 1H NMR metabolomics in parallel with MALDI‐MS, and NanoSIMS imaging techniques. Environmental Science and Technology, 56(3), 2021–2032. 10.1021/acs.est.1c06772 [DOI] [PubMed] [Google Scholar]
- Laaniste, A. , Leito, I. , & Kruve, A. (2019). ESI outcompetes other ion sources in LC/MS trace analysis. Analytical and Bioanalytical Chemistry, 411, 3533–3542. 10.1007/s00216-019-01832-z [DOI] [PubMed] [Google Scholar]
- Lefort, G. , Liaubet, L. , Marty‐Gasset, N. , Canlet, C. , Vialaneix, N. , & Servien, R. (2021). Joint automatic metabolite identification and quantification of a set of 1H NMR spectra. Analytical Chemistry, 93(5), 2861–2870. 10.1021/acs.analchem.0c04232 [DOI] [PubMed] [Google Scholar]
- Kanehisa, M. , Furumuchi, M. , Sato, Y. , Matuura, Y. , & Ishiguro‐Watanabe, M. (2025). KEGG: Biological systems database as a model of the real world. Nucleic Acids Research, 53(D1), D672–D677. 10.1093/nar/gkae909 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kim, H. K. , Choi, Y. H. , & Verpoorte, R. (2010). NMR‐based metabolomic analysis of plants. Nature Protocols, 5, 536–549. 10.1038/nprot.2009.237 [DOI] [PubMed] [Google Scholar]
- Kind, T. , Tsugawa, H. , Cajka, T. , Ma, Y. , Lai, Z. , Mehta, S. S. , Wohlgemuth, G. , Barupal, D. K. , Showalter, M. R. , Arita, M. , & Fiehn, O. (2018). Identification of small molecules using accurate mass MS/MS search. Mass Spectrometry Reviews, 37(4), 513–532. 10.1002/mas.21535 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mahrous, E. A. , & Farag, M. A. (2015). Two‐dimensional NMR spectroscopic approaches for exploring plant metabolome: A review. Journal of Advanced Research, 6(1), 3–15. 10.1016/j.jare.2014.10.003 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Miller, J. , & Miller, J. C. (2018). Statistics and chemometrics for analytical chemistry. Pearson Education, London. [Google Scholar]
- Noleto‐Dias, C. , Farag, M. A. , Porzel, A. , Tavares, J. F. , & Wessjohann, L. A. (2024). A multiplex approach of MS, 1D‐, and 2D‐NMR metabolomics in plant ontogeny: A case study on Clusia minor L. organs (leaf, flower, fruit, and seed). Phytochemical Analysis, 35(3), 445–468. 10.1002/pca.3300 [DOI] [PubMed] [Google Scholar]
- Pang, Z. , Lu, Y. , Zhou, G. , Hui, F. , Xu, L. , Viau, C. , Spigelman, A. F. , MacDonald, P. E. , Wishart, D. S. , Li, S. , & Xia, J. (2024). MetaboAnalyst 6.0: Towards a unified platform for metabolomics data processing, analysis and interpretation. Nucleic Acids Research, 52(W1), W398–W406. 10.1093/nar/gkae253 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Patocka, J. , Nepovimova, E. , Wu, W. , & Kuca, K. (2020) Digoxin: Pharmacology and toxicology—A review. Environmental Toxicology and Pharmacology, 79, 103400. 10.1016/j.etap.2020.103400 [DOI] [PubMed] [Google Scholar]
- Pérez‐Cova, M. , Platikanov, S. , Stoll, D. R. , Tauler, R. , & Jaumot, J. (2022). Comparison of multivariate ANOVA‐based approaches for the determination of relevant variables in experimentally designed metabolomic studies. Molecules, 27(10), 3304. 10.3390/molecules27103304 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pontes, J. G. M. , Brasil, A. J. M. , Cruz, G. C. F. , Souza, R. N. , & Tasic, L. (2017). NMR‐based metabolomics strategies: Plants, animals and humans. Analytical Methods, 9(7), 1078–1096. 10.1039/C6AY03102A [DOI] [Google Scholar]
- Prabhu, G. R. D. , Williams, E. R. , Wilm, M. , & Urban, P. L. (2023). Mass spectrometry using electrospray ionization. Nature Reviews Methods Primers, 3(1), 23. 10.1038/s43586-023-00203-4 [DOI] [Google Scholar]
- Schrimpe‐Rutledge, A. C. , Codreanu, S. G. , Sherrod, S. D. , & McLean, J. A. (2016). Untargeted metabolomics strategies – Challenges and emerging directions. Journal of the American Society for Mass Spectrometry, 27(12), 1897–1905. 10.1007/s13361-016-1469-y [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ren, S. , Hinzman, A. A. , Kang, E. L. , Szczesniak, R. D. , & Lu, L. J. (2015). Computational and statistical analysis of metabolomics data. Metabolomics, 11, 1492–1513. 10.1007/s11306-015-0823-6 [DOI] [Google Scholar]
- Timári, I. , Wang, C. , Hansen, A. L. , Costa Dos Santos, G. , Yoon, S. O. , Bruschweiler‐Li, L. , & Brüschweiler, R. (2019). Real‐Time Pure Shift HSQC NMR for Untargeted Metabolomics. Analytical Chemistry, 91(3), 2304–2311. 10.1021/acs.analchem.8b04928 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tron, G. C. , Pirali, T. , Sorba, G. , Pagliai, F. , Busacca, S. , & Genazzani, A. A. (2006). Medicinal chemistry of combretastatin A4: Present and future directions. Journal of Medicinal Chemistry, 49(11), 3033–3044. 10.1021/jm0512903 [DOI] [PubMed] [Google Scholar]
- Wen, W. , Alseekh, S. , & Fernie, A. R. (2020). Conservation and diversification of flavonoid metabolism in the plant kingdom. Current Opinion in Plant Biology, 55, 100–108. 10.1016/j.pbi.2020.04.004 [DOI] [PubMed] [Google Scholar]
- Wilm, M. (2011). Principles of electrospray ionization. Molecular & Cellular Proteomics, 10(7), M111.009407. 10.1074/mcp.M111.009407 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wishart, D. S. , Tzur, D. , Knox, C. , Eisner, R. , Guo, A. C. , Young, N. , Cheng, D. , Jewell, K. , Arndt, D. , Sawhney, S. , Fung, C. , Nikolai, L. , Lewis, M. , Coutouly, M. A. , Forsythe, I. , Tang, P. , Shrivastava, S. , Jeroncic, K. , Stothard, P. , … Querengesser, L. (2007) HMDB: The Human Metabolome Database. Nucleic Acids Research, 35, D521–526. 10.1093/nar/gkl923 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wishart, D. S. , Guo, A. , Oler, E. , Guo, A. , Oler, E. , Wang, F. , Anjum, A. , Peters, H. , Dizon, R. , Sayeeda, Z. , Tian, S. , Lee, B. L. , Berjanskii, M. , Mah, R. , Yamamoto, M. , Jovel, J. , Torres‐Calzada, C. , Hiebert‐Giesbrecht, M. , Lui, V. W. , … Gautam, V. (2022). HMDB 5.0: The Human Metabolome Database for 2022. Nucleic Acids Research, 50(D1), D622–D631. 10.1093/nar/gkab1062 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yamashita, M. , & Fenn, J. B. (1984) Electrospray ion source. Another variation on the free‐jet theme. Journal of Physical Chemistry, 88, 4451–4459. 10.1021/j150664a002 [DOI] [Google Scholar]
- Yang, Z. , Nakabayashi, R. , Okazaki, Y. , Mori, T. , Takamatsu, S. , Kitanaka, S. , Kikuchi, J. , & Saito, K. (2014). Toward better annotation in plant metabolomics: Isolation and structure elucidation of 36 specialized metabolites from Oryza sativa (rice) by using MS/MS and NMR analyses. Metabolomics, 10, 543–555. 10.1007/s11306-013-0619-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
Internet Resources
- MSConvert : Maintained by the Mallick Lab at Stanford University, the MacCoss Lab at the University of Washington, and the ProteoWizard Software Foundation. Command‐line tool for converting among different mass spectrometry data formats.
- https://bio.tools/msconvert
- MZMine : Maintained by MZIO GmbH. A software platform for analyzing LC‐MS data and other types of mass spectrometry data. Free version available for academic labs.
- https://mzio.io/mzmine‐news/
- Sirius : Maintained by the Böcker Group at FSU Jena and Bright Giant GmbH. Java software for untargeted metabolomics using LC‐MS/MS. The primary focus is on determining the structures of novel molecules.
- https://github.com/sirius‐ms/sirius
- Metaboanalyst : A web‐based platform for statistical analysis of mass spectrometry datasets.
- https://www.metaboanalyst.ca/home.xhtml
- KEGG : Kyoto Encyclopedia of Genes and Genomes. Operated by the Kanehisa Lab at Kyoto University Bioinformatics Center. The KEGG pathway database links enzymes to each other and to small‐molecule metabolites, providing context for metabolomics studies.
- https://www.genome.jp/kegg/pathway.html
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
There are no data associated with this manuscript. All materials are contained in the main text.
