• Is machine learning to be applied to mass spectral peak data or to identified proteins? The former does not require the use of quantification methods, but further analysis is required after application of machine learning to identify the proteins related to peaks of interest. |
• What is required from machine learning: biomarker identification or unknown sample classification? Whilst all methods can be used for classification, not all can be used for biomarker identification; the most suitable are those such as rule-based, which report the proteins used in rules that classify samples. |
• What are the limitations on the number of samples produced and therefore what is the most suitable/realistic number of samples? Large numbers of samples tend to be more suitable for the application of ML and therefore the case is often the more samples the better. The limitations can come from the number of samples generated for MS analysis as well as time and financial restraints. The most suitable number of samples is a balance between all these factors, whilst trying to maximize the sample size. |
• Can labeled quantification be included in the protocol, or is label-free more suitable? Labeled quantitation may not be compatible with the MS technology available and the purchase of reagents and software are usually required, making these methods not always suitable. Label-free techniques become the only option when the quantitation of proteins and application of ML is not considered until after MS analysis. Many label-free methods are also open source, giving them a financial advantage. |
• How large is the dataset? This can impact on the choice of evaluation, how the training and test sets are generated and the choice of machine learning techniques that are applied. Multiple samples within classes are essential, rather than few samples across many classes. Cross-validation is frequently used for evaluation of classification on datasets that are not large. |
• Is machine learning likely to over-fit the data? Over-fitting can be caused by classifying on small datasets. Some machine learning techniques are less prone to over-fitting and others have associated methods to reduce it. |