Skip to main content
. 2024 Mar 16;20(2):42. doi: 10.1007/s11306-024-02104-3

Table 3.

Potential data pre-processing and pre-treatment steps for untargeted direct mass spectrometry and commonly used methods

Step Description Commonly used methods* Papers which specified a method† (count (%)) Papers which did not mention the step†
(count (%))
Data selection Selecting only the parts of a run which contain relevant data, often by detecting a change in the intensity of a particular ion, or total ion count. Breath tracker (n = 12) 16 (14.5%) 94 (85.5%)
Baseline correction Correcting for instrumental drifts. Polynomial fit baseline subtracted (n = 6) 8 (7.3%) 100 (90.9%)
Deconvolution Resolving overlapping peaks. Modified Gaussian functions (n = 8) 8 (7.3%) 102 (92.7%)
Peak picking/alignment Peaks can be defined and assigned exact m/z values, with only values which appear above a baseline selected for further analysis, or data can be binned into windows. Centroided (n = 8) · Binned into windows (n = 4) 19 17.3%) 91 (82.7%)
Value calculation A typical run includes multiple scans. These scans are then usually summarised by a single value. Average over multiple scans (n = 32) · Integration of area under the curve (n = 3) 37 (33.6%) 73 (66.4%)
Background subtraction Removing the intensity measurements that are due to the background profile, not the sample. Subtraction of blank sample (n = 29) 30 (27.3%) 80 (72.7%)
Noise reduction Data may contain random instrumental noise which can be corrected for. Replicate averaging (n = 20) · Resampling (n = 7) · Smoothing (n = 6) · Wavelet denoising (n = 3) 30 (27.3%) 79 (71.8%)
Methods-based normalisation Correcting for systematic variation by scaling each sample to an additional measurement specific to that sample or batch which is expected to change proportionally to the ion intensity count. Primary ion (n = 19) · Sample quantity (n = 7) · Instrument parameter (n = 6) 31 (28.2%) 78 (70.9%)
Data-based normalisation Presenting data from each sample as a ratio, to compare proportional differences and reduce the impact of systematic variation. Control condition (n = 9) · Maximum value (n = 4) · Total ion count (n = 3) 20 (18.2%) 87 (79.1%)
Transformation Change the distribution of the data, typically to remove heteroscedastic noise. Log transform (n = 22) 23 (20.9%) 87 (79.1%)
Centring Performed on each column, makes the average value the same across columns, to reduce the influence of variables with high abundance on multivariate modelling Mean centred (n = 12) 12 (10.9%) 94 (85.5%)
Scaling Performed on each column to make the variance across columns similar, to reduce the influence of variables with high fold changes on multivariate modelling. Autoscaling (n = 9)‡ 9 (8.2%) 98 (89.1%)
Missing value replacement Replacing missing values possibly due to detector dead time or low concentrations to improve statistical performance. Poisson correction (n = 10) 14 (12.7%) 96 (87.3%)
Outliers Data points which deviate from the distribution of the majority of the data and can have an undue effect on data analysis. No method was used more than once. 2 (1.8%) 108 (98.2%)

* Methods used by three or more papers are listed. Details of other methods are available in the supplementary information

† Studies which specified that a step was not performed were not included in this count. Percentage is out of the total number of papers (110). Papers which did specify a method, did not necessarily provide enough information to replicate the method

‡ Autoscaling was assumed to refer only to the scaling method, although it is commonly used to refer to mean centring with unit (standard deviation) scaling (Goodacre et al., 2007). Two papers performed autoscaling but did not specify whether they also mean centred