. 2024 Mar 16;20(2):42. doi: 10.1007/s11306-024-02104-3

Table 3.

Potential data pre-processing and pre-treatment steps for untargeted direct mass spectrometry and commonly used methods

Step	Description	Commonly used methods*	Papers which specified a method† (count (%))	Papers which did not mention the step† (count (%))
Data selection	Selecting only the parts of a run which contain relevant data, often by detecting a change in the intensity of a particular ion, or total ion count.	Breath tracker (n = 12)	16 (14.5%)	94 (85.5%)
Baseline correction	Correcting for instrumental drifts.	Polynomial fit baseline subtracted (n = 6)	8 (7.3%)	100 (90.9%)
Deconvolution	Resolving overlapping peaks.	Modified Gaussian functions (n = 8)	8 (7.3%)	102 (92.7%)
Peak picking/alignment	Peaks can be defined and assigned exact m/z values, with only values which appear above a baseline selected for further analysis, or data can be binned into windows.	Centroided (n = 8) · Binned into windows (n = 4)	19 17.3%)	91 (82.7%)
Value calculation	A typical run includes multiple scans. These scans are then usually summarised by a single value.	Average over multiple scans (n = 32) · Integration of area under the curve (n = 3)	37 (33.6%)	73 (66.4%)
Background subtraction	Removing the intensity measurements that are due to the background profile, not the sample.	Subtraction of blank sample (n = 29)	30 (27.3%)	80 (72.7%)
Noise reduction	Data may contain random instrumental noise which can be corrected for.	Replicate averaging (n = 20) · Resampling (n = 7) · Smoothing (n = 6) · Wavelet denoising (n = 3)	30 (27.3%)	79 (71.8%)
Methods-based normalisation	Correcting for systematic variation by scaling each sample to an additional measurement specific to that sample or batch which is expected to change proportionally to the ion intensity count.	Primary ion (n = 19) · Sample quantity (n = 7) · Instrument parameter (n = 6)	31 (28.2%)	78 (70.9%)
Data-based normalisation	Presenting data from each sample as a ratio, to compare proportional differences and reduce the impact of systematic variation.	Control condition (n = 9) · Maximum value (n = 4) · Total ion count (n = 3)	20 (18.2%)	87 (79.1%)
Transformation	Change the distribution of the data, typically to remove heteroscedastic noise.	Log transform (n = 22)	23 (20.9%)	87 (79.1%)
Centring	Performed on each column, makes the average value the same across columns, to reduce the influence of variables with high abundance on multivariate modelling	Mean centred (n = 12)	12 (10.9%)	94 (85.5%)
Scaling	Performed on each column to make the variance across columns similar, to reduce the influence of variables with high fold changes on multivariate modelling.	Autoscaling (n = 9)‡	9 (8.2%)	98 (89.1%)
Missing value replacement	Replacing missing values possibly due to detector dead time or low concentrations to improve statistical performance.	Poisson correction (n = 10)	14 (12.7%)	96 (87.3%)
Outliers	Data points which deviate from the distribution of the majority of the data and can have an undue effect on data analysis.	No method was used more than once.	2 (1.8%)	108 (98.2%)

* Methods used by three or more papers are listed. Details of other methods are available in the supplementary information

† Studies which specified that a step was not performed were not included in this count. Percentage is out of the total number of papers (110). Papers which did specify a method, did not necessarily provide enough information to replicate the method

‡ Autoscaling was assumed to refer only to the scaling method, although it is commonly used to refer to mean centring with unit (standard deviation) scaling (Goodacre et al., 2007). Two papers performed autoscaling but did not specify whether they also mean centred