Machine learning-based plasma lipid discovery for breast cancer detection. Matched plasma from the same sample set as Figure 2A (n = 256) was used, and the same machine learning biomarker discovery pipeline as Figure 3B was used for plasma signature panel identification and predictive model development. (A) Average prediction of each model for individual samples across 2000 runs of LGOCV. (B) Lipids that were consistently selected as being important by the Boruta algorithm across all runs. The cutoff between the top 20 and the remaining 10 lipids is indicated with a dotted line. Lipids from the EV23 panel are indicated with red text and bars. (C–F) Results using the top 20 lipids from (B) as variables and using the (C) indicated models or (D–F) the ensemble model, trained using LGOCV (20% test, 80% train) and repeated 2000 times. (C) Test performance summary of the three models with the highest sensitivity. (D) Boxplots with interquartile range are indicated, representing the distribution of performance metrics. (E) Average ROC curve and AUC. (F) Certainty level of predictions. High: complete model agreement, medium: greater than 80% model agreement, low: less than 80% model agreement. Proportion (%) of high, medium, and low predictions are indicated. (G) Sensitivity analysis on the plasma ensemble model with varying numbers of lipids. The violin plots represent the distribution of the ensemble model accuracy such that the top 14 to 30 lipids were selected based on (B). Horizontal lines within each violin represent the 0.05, 0.5, and 0.95 quantiles for prediction accuracy. The signature size with the best accuracy and the fewest lipids is indicated by a pink density curve. LID, lipid identifier.