Skip to main content

View full-text article in PMC

. 2013 Dec;17(12):595–610. doi: 10.1089/omi.2013.0017

Table 4.

Summary Table of Considerations for the Stages of a Proteomics Machine Learning Experiment

Considerations for experimental design
• Is machine learning to be applied to mass spectral peak data or to identified proteins? The former does not require the use of quantification methods, but further analysis is required after application of machine learning to identify the proteins related to peaks of interest.
• What is required from machine learning: biomarker identification or unknown sample classification? Whilst all methods can be used for classification, not all can be used for biomarker identification; the most suitable are those such as rule-based, which report the proteins used in rules that classify samples.
• What are the limitations on the number of samples produced and therefore what is the most suitable/realistic number of samples? Large numbers of samples tend to be more suitable for the application of ML and therefore the case is often the more samples the better. The limitations can come from the number of samples generated for MS analysis as well as time and financial restraints. The most suitable number of samples is a balance between all these factors, whilst trying to maximize the sample size.
• Can labeled quantification be included in the protocol, or is label-free more suitable? Labeled quantitation may not be compatible with the MS technology available and the purchase of reagents and software are usually required, making these methods not always suitable. Label-free techniques become the only option when the quantitation of proteins and application of ML is not considered until after MS analysis. Many label-free methods are also open source, giving them a financial advantage.
• How large is the dataset? This can impact on the choice of evaluation, how the training and test sets are generated and the choice of machine learning techniques that are applied. Multiple samples within classes are essential, rather than few samples across many classes. Cross-validation is frequently used for evaluation of classification on datasets that are not large.
• Is machine learning likely to over-fit the data? Over-fitting can be caused by classifying on small datasets. Some machine learning techniques are less prone to over-fitting and others have associated methods to reduce it.

Steps required for application of machine learning
1. Quantification of proteins, either by a labeled or label free method.
2. Generate training and test sets: either by cross-validation or, if a large dataset, by splitting it up to train on the majority of the dataset and test on a small subsection.
3. Pre-processing: feature selection methods. Feature selection is not essential, but can improve the classification accuracy of learners.
4. Application of machine learning methods. Models built using training sets and the accuracy of classification determined though application of models to the test set. Software, such as WEKA (Witten et al., 2011) can be used or methods can be implemented in R (R, http://www.r-project.org/).
5. Comparison of machine learning methods to identify the best method for the dataset.

Post machine learning analysis
Further analysis to be included if information can be extracted from results of machine learning methods (for biomarker identification), by identifying proteins that were essential for the classification:
• Literature mining
• Pathway analysis
• Generation of interaction networks