Table 2.
Properties of the selected protein descriptor sets and representations used in our benchmarks
| Name | Approach | Description | Dimension |
|---|---|---|---|
| apaac | Model-driven (physico-chemistry) | Amino acid composition regarding the sequence order correlated factors computed from hydrophobicity and hydrophilicity indices of a.aa | 80 |
| ctdd | Model-driven (physico-chemistry) | Chain length-based distribution of a.a for selected physicochemical properties | 195 |
| ctriad | Model-driven (physico-chemistry) | Triad frequency of residues classified on dipoles and volumes of aa side chains | 343 |
| dde | Model-driven (sequence comp.b) | Dipeptide composition deviation | 400 |
| geary | Model-driven (physico-chemistry) | Autocorrelation regarding the distribution of physicochemical properties of a.a | 240 |
| k-sep_pssm | Model-driven (sequence homology) | Column transformation-based position specific scoring matrix (pssm) profiles | 400 |
| pfam | Model-driven (functional properties) | Protein domain profiles | 38–294c |
| qso | Model-driven (physico-chemistry) | Sequence order effect based on physicochemical distances between coupled residues | 100 |
| spmap | Model-driven (sequence comp.) | Subsequence-based feature map | 544 |
| taap | Model-driven (physico-chemistry) | Summation of corresponding residue values for selected physicochemical properties | 10 |
| random 200 | – | Randomly generated continuous numbers between 0 and 1 with uniform distribution | 200 |
| protvec | Data-driven (learned embedding) | Sequence embedding utilizing skip-gram modelling approach | 100 |
| seqvec | Data-driven (learned embedding) | Sequence embedding based on bi-directional language model architecture “ELMo” | 1024 |
| transformer | Data-driven (learned embedding) | Transformer-architecture based embedding method that utilizes attention mechanism | 768 |
| unirep | Data-driven (learned embedding) | Sequence embedding based on mLSTM architecture as a variation of recurrent neural networks | 1900 and 5700 |
aAmino acids
bComposition
cSize varies depending on the dataset, since pfam vectors only include the domains presented in the given protein dataset