Skip to main content

View full-text article in PMC

. 2023 Feb 6;15:16. doi: 10.1186/s13321-023-00689-w

Table 2.

Properties of the selected protein descriptor sets and representations used in our benchmarks

Name	Approach	Description	Dimension
apaac	Model-driven (physico-chemistry)	Amino acid composition regarding the sequence order correlated factors computed from hydrophobicity and hydrophilicity indices of a.a^a	80
ctdd	Model-driven (physico-chemistry)	Chain length-based distribution of a.a for selected physicochemical properties	195
ctriad	Model-driven (physico-chemistry)	Triad frequency of residues classified on dipoles and volumes of aa side chains	343
dde	Model-driven (sequence comp.^b)	Dipeptide composition deviation	400
geary	Model-driven (physico-chemistry)	Autocorrelation regarding the distribution of physicochemical properties of a.a	240
k-sep_pssm	Model-driven (sequence homology)	Column transformation-based position specific scoring matrix (pssm) profiles	400
pfam	Model-driven (functional properties)	Protein domain profiles	38–294^c
qso	Model-driven (physico-chemistry)	Sequence order effect based on physicochemical distances between coupled residues	100
spmap	Model-driven (sequence comp.)	Subsequence-based feature map	544
taap	Model-driven (physico-chemistry)	Summation of corresponding residue values for selected physicochemical properties	10
random 200	–	Randomly generated continuous numbers between 0 and 1 with uniform distribution	200
protvec	Data-driven (learned embedding)	Sequence embedding utilizing skip-gram modelling approach	100
seqvec	Data-driven (learned embedding)	Sequence embedding based on bi-directional language model architecture “ELMo”	1024
transformer	Data-driven (learned embedding)	Transformer-architecture based embedding method that utilizes attention mechanism	768
unirep	Data-driven (learned embedding)	Sequence embedding based on mLSTM architecture as a variation of recurrent neural networks	1900 and 5700

^aAmino acids

^bComposition

^cSize varies depending on the dataset, since pfam vectors only include the domains presented in the given protein dataset