Skip to main content
Biology Methods & Protocols logoLink to Biology Methods & Protocols
. 2022 Mar 30;7(1):bpac008. doi: 10.1093/biomethods/bpac008

PSSMCOOL: a comprehensive R package for generating evolutionary-based descriptors of protein sequences from PSSM profiles

Alireza Mohammadi 1,#, Javad Zahiri 2,3,✉,#, Saber Mohammadi 1, Mohsen Khodarahmi 4,5,6, Seyed Shahriar Arab 7
PMCID: PMC8977839  PMID: 35388370

Abstract

Position-specific scoring matrix (PSSM), also called profile, is broadly used for representing the evolutionary history of a given protein sequence. Several investigations reported that the PSSM-based feature descriptors can improve the prediction of various protein attributes such as interaction, function, subcellular localization, secondary structure, disorder regions, and accessible surface area. While plenty of algorithms have been suggested for extracting evolutionary features from PSSM in recent years, there is not any integrated standalone tool for providing these descriptors. Here, we introduce PSSMCOOL, a flexible comprehensive R package that generates 38 PSSM-based feature vectors. To our best knowledge, PSSMCOOL is the first PSSM-based feature extraction tool implemented in R. With the growing demand for exploiting machine-learning algorithms in computational biology, this package would be a practical tool for machine-learning predictions.

Keywords: R package, PSSM, machine learning, feature extraction

Introduction

Position-specific scoring matrix (PSSM) is defined as a matrix that involves information about the probability of amino acids or nucleotides occurrence in each position, which is derived from a multiple sequence alignment. This matrix is similar to the substitution matrix but it is more intricate due to including the alignment position information. In such a matrix, the rows represent the position of residues in an alignment and the columns specify the name of residues. This representation can be reversed so that the rows and columns would determine the name of residues and their corresponding positions in the alignment, respectively. The values of this matrix are the residues’ binary logarithm derived from multiple alignment scores. Briefly speaking, the procedure of building PSSM can be summarized as three main steps (Fig. 1A).

Figure 1:

Figure 1:

(A) The process used to build a PSSM. To build a PSSM, protein sequences are given to sequence databases such as NCBI as FASTA files for performing BLAST search. Having multiple alignments performed, a PSSM file can be obtained. The obtained PSSM can be used as a new query against the dataset. (B) Schematic presentation of row and column transformation. The feature vector specified as blue is obtained by summing the rows and columns highlighted in pink.

In these matrices, the positive numbers indicate that identical or similar sequences have been aligned and the negative numbers are indicators of a non-conserved alignment. This matrix, which can be considered a summary of the ensemble of corresponding sequences, is a quantified description for the conservation degree in each position of the alignment.

As far as the significance of PSSM is concerned, we investigated the studies that used the PSSM-based feature for predicting a protein attribute. By a thorough search on the literature in PubMed using PSSM and Prediction as keywords, we obtained 306 articles.

Moreover, the information conveyed through PSSMs is widely used in predicting various attributes of proteins ranging from the prediction of secondary and tertiary structures [1], protein–protein interactions [2], accessible surface area [3], flexibility [4], binding sites domains [5, 6], post-translational modification [7], protein localization [8], identifying the binding regions of protein–RNA [9], and protein–DNA [10] to the prediction of drug–target interaction [11]. Figure 2 shows the categorized papers based on their subjects that utilized PSSM-based features.

Figure 2:

Figure 2:

The frequency of categorized articles that employed PSSM-derived information in the years 1999–2021.

Feature extraction or feature encoding is a fundamental step in the construction of high-quality machine learning-based models. Specifically, this is a key step in various prediction problems in bioinformatics and computational biology [2, 12–14]. In the last two decades, a variety of new features have been proposed to train models for predicting several protein attributes. Such schemes are mostly based on sequence information or representation of diverse physicochemical properties. Although the features derived directly from protein sequences (such as amino acid compositions, conjoint triad, and k-mer composition) are regarded as essential factors for training efficient models, an increasing number of studies have shown that evolutionary information, which can be extracted from PSSMs, is much more informative than sequence-based information alone [2, 14, 15].

Several servers and software packages have been introduced to extract some specific descriptors from biological molecules, namely repRNA [16], repDNA [17], Pse-in-One [18, 19], Pse-Analysis [20], PseAAC [21], propy [22], PROFEAT [23], protr/ProtrWeb [24], and POSSUM [25]. All these tools, except for POSSUM and a recent version of protr, generate sequence-based and physiochemical features and commonly lack PSSM profile information. Moreover, more than 38 different PSSM-based feature types have been proposed in recent years. The POSSUM web server is the only tool devoted to the feature extraction from PSSMs. This tool provides 21 out of 38 PSSM-based feature types (unfortunately, the POSSUM tool was not accessible at the time of writing this manuscript). To the best of our knowledge, PSSMCOOL is the most comprehensive tool for extracting sequence-based evolutionary-related features from PSSMs.

Materials and methods

Various PSSM-derived features have been implemented as a comprehensive R package named PSSMCOOL. This R package includes 31 functions that extract 38 different PSSM-based features; that is, some of them are capable of generating more than one feature vector. These functions take a PSSM file for the protein of interest, as the input and output of the corresponding feature vector. In some functions, depending on the desired feature types, parameters are adjustable by users.

The implemented feature extraction algorithms are based on matrix transformations from the original PSSM profiles, which can be categorized into three types: Row transformations, column transformations (see Fig. 1B), and a mixture of row and column transformations (Table 1). For obtaining features derived from row transformation, we performed the following procedure: Two rows of PSSM were summed or subtracted or one or more rows were multiplied by a number. Similarly, by adding or reducing two or multiple columns, the features that were formed based on column transformation were obtained.

Table 1:

Implemented feature extraction algorithms and their application for predicting various problems in PSSMCOOL and a comparison between our package and POSSUM tool

Descriptor name Dimension PSSMCOOOL POSSUM Reference First usage
Row transformation AAC-PSSM 20 [26] Protein structural class
AATP 420 [27] Protein structural class
AB-PSSM 400 [28] Protein function
CS_PSe_PSSM 700 [29] Protein structural class
D-FPSSM 20 [2] Protein–protein interaction
DISSULFID a [30] Cysteine reactivity
Kiderafactor a [31] Ligand-binding site
MEDP 420 [32] Protein structural class
PSSM-composition 400 [33] Secreted effector proteins
RPM-PSSM 400 [28] Protein function
S-FPSSM 400 [2] Protein–protein interaction
Smoothed-PSSM b [34] RNA-binding sites
Column transformations DMACA-PSSM 210 [35] Protein types in Gram-negative bacteria
DPC-PSSM 400 [36] Protein fold recognition
DWTPSSM 80 [37] Protein crystallization prediction
EEDP 400 [38] Protein structural class
k-separated-bigrams PSSM 400 [36] Protein fold recognition
LPC_PSSM 280 [39] Protein structural class
MBMGACPSSM 560 [32] Protein structural class
SCSH2 b [14] Protein–protein interaction
SOMA_PSSM 160 [40] Protein structural class
TPC 400 [27] Protein structural class
tri-gram-PSSM 8000 [41] Protein fold recognition
Combination of row and column transformations AADP-PSSM 420 [26] Protein structural class
Average_Block 400 [42] Protein classification
Discrete cosin transform 400 [43] Protein–protein interaction
DP-PSSM 120 [44] Subcellular localizations
EDP 20 [38] Protein structural class
Gray_PSSM_PseAAC 100 [45] Antifreeze proteins
Pse-PSSM b [46, 47] Membrane proteins
PSSM400 400 [42] Protein classification
PSSM-AC 200 [33] Secreted effector proteins
PSSM_BLOCK b [48] Protein self-interactions
PSSM-CC b [33] Secreted effector proteins
PSSM_SEG 100 [49] Protein fold recognition
PSSM_SD 80 [49] Protein fold recognition
RPSSM 110 [50] Protein structural classes
Single_Average 400 [42] Protein classification
SVD_PSSM 20 [42] Protein classification
a

These features produce a matrix of features whose dimension varies based on the choice of the parameters.

b

Feature vector dimension varies based on the choice of the parameters.

The 10 important features implemented within the PSSMCOOL package are summarized below. More details and formulas are provided on the online documentation.

PSSM-AC

This feature, which stands for auto-covariance transformation [33], calculates the j-th column average and subtracts this from the i-th and i + g-th rows of this column and finally, these numbers are multiplied (Fig. 3). The values of j vary between 1 and 20. By changing the i variable from 1 to L−g, the acquired numbers are summed where L represents the length of the protein. The formula for generating this feature is provided in Equation (1).

PSSM-ACi,j=1L-gi=1L-gSi,j-1Li=1LSi,jSi+g,j-1Li=1LSi,j. (1)

Figure 3:

Figure 3:

Extraction of PSSM_AC feature from PSSM. Here, the average of column j is −0.8, which is subtracted from −1 corresponding to the i, jth element, and −1 corresponding to the i + g, jth element. This results in −0.2 for both subtractions. The obtained numbers are multiplied to gain 0.04. This must be repeated in the range of i = 1 to Lg and the resulting numbers must be summed and finally divided by Lg.

DPC–PSSM

This feature is related to dipeptide composition (DPC) [26] and originally was proposed for protein structural class prediction. For calculating this descriptor, the elements of two successive rows and two different columns are multiplied (see Fig. 4). This operation is performed on different rows and columns. Then, the computed values are summed and for every two successive rows, this sum is divided by L−1, where L is the protein length.

Figure 4:

Figure 4:

Extraction of DPC_PSSM feature from PSSM. As shown in the figure, the values of two consecutive rows from different columns are multiplied (−12) and summed for the range of k = 1 to k = L−1. The finally obtained number must be divided by L−1.

Trigram-PSSM

This feature is a feature vector with a length of 8000, which is extracted from PSSM [41]. If the elements of every three successive rows and three different columns of PSSM are multiplied and this operation is done for all three possible consecutive rows and eventually the acquired numbers are summed, we will have one of the elements of the final feature vector that corresponds to a specific combination of three amino acids out of 8000 possible combinations (Fig. 5). Equation (2) indicates how this feature can be generated.

Tm,n,r=i=1L-2Pi,mPi+1,nPi+2,r. (2)

Figure 5:

Figure 5:

Extraction of Trigram_PSSM feature from PSSM. The extraction of this feature is similar to DPC–PSSM extraction but instead of using two consecutive rows, the values of three consecutive rows in three different columns must be multiplied and summed. For the example provided here, the result of the multiplication is −12. This multiplication should be done for the range of i = 1 to L−2 for each combination of three columns and the obtained values must be summed.

PSe-PSSM

This feature originally was used to predict the membrane proteins and their types [47]. The PSe-PSSM feature vector is a vector with a length of 320 in which the 20 first numbers are the averages of 20 rows of PSSM [46]. The rest numbers of the final feature vector are computed as follows: For each column, the mean square of differences between the i-th and (i+lag)-th elements is computed for each column where lag can be any integer number between 1 and 15. Therefore, the length of the final feature vector will be 20 × 15 + 20. Figure 6 and Equation (3) show how this feature is generated.

p(k)=1(Llag)i=1Llag(pi,jpi+lag,j)2j=1,2,, 20,lag=1,2,, 15k=20+j+20(lag1). (3)

Figure 6:

Figure 6:

Extraction of PSe_PSSM feature from PSSM. The first 20 values in this feature vector are the averages of 20 columns of PSSM. The remaining 300 values are computed by the mean square of differences between the i-th and i+lag-th rows for each column (lag values vary between 1 and 15). For i = 3 and i + lag = 9, the squared difference would be (3−(−1))2 = 4. If lag = 6, this will be calculated for the range of i = 1 to i = L−lag, and the resulting values must be summed and divided by L−lag.

K-separated-bigram-PSSM

This feature is almost identical to the DPC feature; in fact, the DPC feature is part of this feature (for K = 1). As shown in Fig. 7, for every two different columns, it considers rows that have distance k [36].

Figure 7:

Figure 7:

Extraction of K-separated-bigram-PSSM feature from PSSM. This feature can be considered as an extension of the DPC feature. For each combination of two columns, the sum of multiplication of the i-th row corresponding to one column and the i + k-th row corresponding to the other column is computed where i varies between 1 and LK. Here, for i = 3 and K = 6, the multiplication would be 0.

AB-PSSM

The AB-PSSM feature was used for protein function prediction [28]. This feature consists of two types of feature vectors. At first, each protein sequence is divided into 20 equal parts, each of which is called a block. In each block, the row vectors of the PSSM related to that block are added together and the resulting final vector is divided by the length of that block, which is equal to 5% of protein length (see Fig. 8). Finally, concatenating these 20 vectors, the first feature vector of length 400 is obtained. For the second feature, in each block, the average of the positive numbers is computed for all 20 columns. Finally, these 400 averages will be used as the second feature vector.

Figure 8:

Figure 8:

Extraction of AB-PSSM feature from PSSM. The first feature vector is obtained by placing 20 vectors corresponding to each block next to each other. For having these vectors, the row vectors (with length 20) related to each block are added together and the resulting vector is divided by the length of that block. For computing the second feature vector, the average of positive numbers in each column related to each block is calculated. Then, 20 values corresponding to 20 blocks are placed next to each other. By performing this procedure for each individual column, a feature vector with a length of 400 could be obtained.

CS-PSe-PSSM

This feature consists of a combination of several types of features; in general, the obtained feature vector would be of length 700 [29]. The sub-features that have been integrated as the single feature vector (CS-PSe-PSSM) are CSAAC, CSCM, segmented PsePSSM features, and segmented ACTPSSM.

SCSH2

This feature has been utilized for protein–protein interactions prediction [14]. To produce this feature vector, we need to extract the consensus sequence corresponding to the protein sequence based on the PSSM scores. Having placed these two sequences next to each other, a matrix with a dimension of 2×L will be created. In the next step, each entry in this matrix is considered a node and connected to the two entries, which are immediately below it (except for the two entries in the last row). Finally, we will have a graph similar to a bipartite graph called the SCSH graph. Now in this graph, each path of length k specifies a (k + 1)-mer. Finally, a k-mer composition feature vector can be obtained using this graph. k is equal to 2 in SCSH2.

PSSM400

This feature was employed in protein classification and protein–protein interaction prediction [42]. To generate this feature vector, for each of the standard amino acids, the corresponding rows in the PSSM are extracted and considered a sub-matrix (see Fig. 9). Now, for this sub-matrix, the column-wise average is considered the feature vector (a 20-dimensional vector). Finally, by putting together these feature vectors for all 20 amino acids, a feature vector of length 400 for each protein can be acquired.

Figure 9:

Figure 9:

Extraction of PSSM400 feature from PSSM. To calculate this feature, a sub-matrix representing the conservation of each standard amino acid will be computed. To obtain this sub-matrix, for each standard amino acid (here, the serine amino acid), all the corresponding columns are extracted. By calculating the average of columns in the extracted sub-matrix, a vector of length 20 will be acquired for each standard amino acid type. By putting the vectors (with the length of 20) for all 20 amino acids, the final feature vector with a length of 400 could be obtained.

SVD–PSSM

Singular value decomposition (SVD) is a general-purpose matrix factorization approach that has many useful applications in signal processing and statistics, as well as computational biology [42]. To compute this feature, SVD is applied to the PSSM representation of a protein for reducing its dimensionality. The final feature vector would be a 20-dimensional vector for all protein and peptide sequences with length ≥20.

Case study

We presented a case study and procedure that can be followed in order to use the PSSMCOOL for extracting features and building models for a prediction problem. For this case study, the interactions between presynaptic proteins were extracted from the IntAct database [51]. As the first step, proteins with non-unique Uniprot accession numbers were discarded. For the positive set (protein–protein interactions), interactions from spoke expanded co-complexes and negative interactions were filtered out. Negative data set was constructed according to random pairing method as described in Refs. [2, 14, 52]. The final data set contained 1730 interactions (positive and negative) between 631 unique proteins. In this case study, “FPSSM2” function was used for feature extraction. Also, Bagged CART (treebag), and Single C5.0 Tree from caret package were used for classification. These two classifiers achieved 0.996 and 0.998 accuracy, respectively (R scripts corresponding to this case study are available at: https://github.com/BioCool-Lab/PSSMCOOL).

Run-time analysis

A set of human proteins was used to compare the time required for extracting features with PSSMCOOL and POSSUM. The human proteome was partitioned into 100 bins for assembling this set based on the protein lengths. Then, one random protein was selected from each bin and finally, a set comprised of 100 proteins was constructed. Figure 10 illustrates the performance of each tool for feature extraction in terms of run time. POSSUM can be run in two modes. In the slower mode, it writes header for each extracted feature in the output files (POSSUM_h) and in the faster mode, it writes features to the output file without headers (POSSUM_wh). Twenty-one features were used for making this comparison. Run time per 100 residues was calculated for each protein and these times were averaged across all these 21 features afterward. As Fig. 10 shows, the run times corresponding to PSSMCOOL are significantly lower than both POSSUM_h and POSSUM_wh. On average, PSSMCOOL only needs 0.14 s per 100 residues for feature extraction. It is worth mentioning that using POSSUM for several proteins requires writing command-line scripts, which does not seem to be very convenient for researchers who lack prior experience in Unix-based operating systems.

Figure 10:

Figure 10:

Comparison of run time between PSSMCOOL and POSSUM. POSSUM can be run in two modes. In the slower mode, it writes header for each extracted feature in the output files (POSSUM_h) and in the faster mode POSSUM writes features to the output file without headers (POSSUM_wh). Each point shows the average run time (in seconds) per 100 residues for each protein across all features.

For almost all features implemented in PSSMCOOL, the corresponding run time is proportional to the protein length. However, this does not apply to three features, including disulfide, PSSMSEG, and PSSMSD, which are not dependent on the protein length. Their run time depends on the frequency of specific amino acids within the input proteins. Figure 11 shows the details of the run time corresponding to 29 different feature types in PSSMCOOL for 631 proteins used in the case study using a laptop with the following configuration; operating system: Windows 10 ×64; CPU: corei7 7700 HQ; RAM: 16 GB; R version 4.1.2. Evidently, trigram and DFMCA are the two most computationally intensive features (P < 2.2E−22; t-test). Nevertheless, the average run time for these two feature extractions was 2.05 and 0.19 s per 100 residues in the protein, respectively. Regarding the mean length of the human proteome, which is 553, on average feature extraction takes 1 s for each protein using a non-high-performing personal laptop.

Figure 11:

Figure 11:

Feature extraction run time for features implemented in the PSSMCOOL. Trigram and DFMCA were the most computationally intensive features. However, the maximum run time corresponding to DFMCA did not exceed 23 min for a protein with >34 000 residues as the worst scenario. In addition, the average run time per 100 residues is 2.05 s for trigram and is <0.19 s for all other features. The configurations of the machine that was used for extracting features are as follows Windows 10 ×64; CPU: corei7 7700 HQ; RAM : 16 GB; and R version 4.1.2.

Results and discussion

In this work, we present PSSMCOOL, a comprehensive, practical, and publicly accessible R package, developed to make the feature extraction of PSSMs feasible for researchers. Since it supplies 18 additional features, compared with the preceding available toolkit (POSSUM), it can greatly help for extracting features and developing new methods for the prediction of various protein attributes. The PSSMCOOL is freely accessible at: https://cran.r-project.org/web/packages/PSSMCOOL/index.html. Soaring data production has opened the door to the new applications of machine-learning methods in biology. One of the most significant steps toward the development of an efficient predictive model is feature extraction. The extraction of features by PSSMCOOL would be of great help for bioinformaticians who are interested in building predictive models for protein attribute prediction.

Availability of data and materials

Project name: PSSMCOOL;

Project home page: https://cran.r-project.org/web/packages/PSSMCOOL/index.html;

Operating systems: Windows, Linux, Mac;

Programming language: R;

Other requirements: R;

License: Not applicable;

Any restrictions to use by non-academics: No restrictions;

GitHub page: https://github.com/BioCool-Lab/PSSMCOOL.

Author contribution

The main idea of this work was represented by J.Z. A.M., J.Z., and S.M. implemented the package. M.K. and A.M. prepared the online documentation. S.S.A., M.K., and J.Z. reviewed and optimized the written R codes. Drafting and writing of the manuscript was carried out by all the authors. J.Z. supervised the work.

Conflict of interest statement. None declared.

References

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

Project name: PSSMCOOL;

Project home page: https://cran.r-project.org/web/packages/PSSMCOOL/index.html;

Operating systems: Windows, Linux, Mac;

Programming language: R;

Other requirements: R;

License: Not applicable;

Any restrictions to use by non-academics: No restrictions;

GitHub page: https://github.com/BioCool-Lab/PSSMCOOL.


Articles from Biology Methods & Protocols are provided here courtesy of Oxford University Press

RESOURCES