Skip to main content
. 2019 Dec 23;17:107. doi: 10.1186/s12915-019-0730-9

Fig. 3.

Fig. 3

Prediction performance using different input features. a, b Pearson’s correlation between predictions and observations across patients in the a breast and b ovary. The x-axis represents different methods. Specifically, RNA and CNV simply use the mRNA and DNA copy number variation values as approximations for the proteomic values, respectively. RF is the random forest model trained across all available proteins using two features, the corresponding RNA and CNV values of a protein. RF+cross1 and RF+cross2 are the random forest models transferring information cross breast and ovarian cancers. In RF+cross1, we trained two RF models on breast or ovary data separately and assembled the predictions of them, while in RF+cross2, we only trained one RF model on the combined breast and ovary data. c, d. The prediction performance using protein sequence and class information in the (c) breast and (d) ovary. In addition to RNA and CNV, in RF+aa, we add 20 features, each representing the number of occurrence of an amino acid in a protein. In RF+aaKR, we add only the numbers of two amino acids, lysine (K) and arginine (R), which are the cleavage targets of trypsin in proteomics mass spectrometry. In RF+class, we add four binary features, representing the four protein classes defined by the CATH protein structure classification database. In RF+aaKR+class, we add features of both the number of amino acids and protein classes