Table 2.
Name | Approach | Description | Dimension |
---|---|---|---|
apaac | Model-driven (physico-chemistry) | Amino acid composition regarding the sequence order correlated factors computed from hydrophobicity and hydrophilicity indices of a.aa | 80 |
ctdd | Model-driven (physico-chemistry) | Chain length-based distribution of a.a for selected physicochemical properties | 195 |
ctriad | Model-driven (physico-chemistry) | Triad frequency of residues classified on dipoles and volumes of aa side chains | 343 |
dde | Model-driven (sequence comp.b) | Dipeptide composition deviation | 400 |
geary | Model-driven (physico-chemistry) | Autocorrelation regarding the distribution of physicochemical properties of a.a | 240 |
k-sep_pssm | Model-driven (sequence homology) | Column transformation-based position specific scoring matrix (pssm) profiles | 400 |
pfam | Model-driven (functional properties) | Protein domain profiles | 38–294c |
qso | Model-driven (physico-chemistry) | Sequence order effect based on physicochemical distances between coupled residues | 100 |
spmap | Model-driven (sequence comp.) | Subsequence-based feature map | 544 |
taap | Model-driven (physico-chemistry) | Summation of corresponding residue values for selected physicochemical properties | 10 |
random 200 | – | Randomly generated continuous numbers between 0 and 1 with uniform distribution | 200 |
protvec | Data-driven (learned embedding) | Sequence embedding utilizing skip-gram modelling approach | 100 |
seqvec | Data-driven (learned embedding) | Sequence embedding based on bi-directional language model architecture “ELMo” | 1024 |
transformer | Data-driven (learned embedding) | Transformer-architecture based embedding method that utilizes attention mechanism | 768 |
unirep | Data-driven (learned embedding) | Sequence embedding based on mLSTM architecture as a variation of recurrent neural networks | 1900 and 5700 |
aAmino acids
bComposition
cSize varies depending on the dataset, since pfam vectors only include the domains presented in the given protein dataset