Skip to main content
. 2023 Feb 6;15:16. doi: 10.1186/s13321-023-00689-w

Table 2.

Properties of the selected protein descriptor sets and representations used in our benchmarks

Name Approach Description Dimension
apaac Model-driven (physico-chemistry) Amino acid composition regarding the sequence order correlated factors computed from hydrophobicity and hydrophilicity indices of a.aa 80
ctdd Model-driven (physico-chemistry) Chain length-based distribution of a.a for selected physicochemical properties 195
ctriad Model-driven (physico-chemistry) Triad frequency of residues classified on dipoles and volumes of aa side chains 343
dde Model-driven (sequence comp.b) Dipeptide composition deviation 400
geary Model-driven (physico-chemistry) Autocorrelation regarding the distribution of physicochemical properties of a.a 240
k-sep_pssm Model-driven (sequence homology) Column transformation-based position specific scoring matrix (pssm) profiles 400
pfam Model-driven (functional properties) Protein domain profiles 38–294c
qso Model-driven (physico-chemistry) Sequence order effect based on physicochemical distances between coupled residues 100
spmap Model-driven (sequence comp.) Subsequence-based feature map 544
taap Model-driven (physico-chemistry) Summation of corresponding residue values for selected physicochemical properties 10
random 200 Randomly generated continuous numbers between 0 and 1 with uniform distribution 200
protvec Data-driven (learned embedding) Sequence embedding utilizing skip-gram modelling approach 100
seqvec Data-driven (learned embedding) Sequence embedding based on bi-directional language model architecture “ELMo” 1024
transformer Data-driven (learned embedding) Transformer-architecture based embedding method that utilizes attention mechanism 768
unirep Data-driven (learned embedding) Sequence embedding based on mLSTM architecture as a variation of recurrent neural networks 1900 and 5700

aAmino acids

bComposition

cSize varies depending on the dataset, since pfam vectors only include the domains presented in the given protein dataset