Mathematical and Machine Learning Approaches for Classification of Protein Secondary Structure Elements from Cα Coordinates

Ali Sekmen; Kamal Al Nasr; Bahadir Bilgin; Ahmet Bugra Koku; Christopher Jones

doi:10.3390/biom13060923

. 2023 May 31;13(6):923. doi: 10.3390/biom13060923

Mathematical and Machine Learning Approaches for Classification of Protein Secondary Structure Elements from Cα Coordinates

Ali Sekmen ¹, Kamal Al Nasr ^1,^*, Bahadir Bilgin ^1,², Ahmet Bugra Koku ^2,³, Christopher Jones ¹

Editors: Jose M Guisan, Yung-Chuan Liu, Antonio Zuorro

PMCID: PMC10296594 PMID: 37371503

Abstract

Determining Secondary Structure Elements (SSEs) for any protein is crucial as an intermediate step for experimental tertiary structure determination. SSEs are identified using popular tools such as DSSP and STRIDE. These tools use atomic information to locate hydrogen bonds to identify SSEs. When some spatial atomic details are missing, locating SSEs becomes a hinder. To address the problem, when some atomic information is missing, three approaches for classifying SSE types using $C α$ atoms in protein chains were developed: (1) a mathematical approach, (2) a deep learning approach, and (3) an ensemble of five machine learning models. The proposed methods were compared against each other and with a state-of-the-art approach, PCASSO.

Keywords: protein structure modeling, protein secondary structure, secondary structure identification, machine learning, protein trace, mathematical modeling

1. Introduction

Proteins form 3D structures, via atomic and molecular interactions, that determine their functions such as material or signal transporting, cell adhesion, and cell cycle [1,2]. Primary structures (sequences of amino acids in polypeptide chains) are known for a large set of proteins. However, only a small portion of them (<0.1%) have known tertiary structures (folding of a polypeptide chain into a 3D shape) and quaternary structures (special 3D arrangements of all polypeptide chains of a protein) via experimentation. Secondary structures (repeated patterns of folding of the protein backbone) are important to analyze relationship between primary and tertiary structures. Once the structure of a protein is determined, it is uploaded into a publicly available database such as Protein Data Bank (PDB) [3,4], which had 205 K proteins as of May 2023.

There are three experimental techniques used for determining 3D structures of proteins: X-ray crystallography [5,6,7], Nuclear Magnetic Resonance (NMR) spectroscopy [8,9], and Cryo-electron microscopy (Cryo-EM) [5,10,11].

In a crystal, atoms and molecules arrange themselves in regular arrays and X-ray crystallography technology, which has been in use since the 1950s, utilizes this fact to generate atomic and molecular structure of the crystal. In order to determine the atomic structure of a protein, it first needs to be crystallized. However, protein crystallization is a difficult process and not possible for all proteins. For example, outer membrane proteins, mostly $β$ -Barrel architectures, of Gram negative bacteria are mostly rigid and stable and therefore X-ray crystallography can be applied relatively easily to determine their molecular structures. However, high-resolution diffracting crystals of plasma membrane proteins and large molecules are not easy to crystallize, due to difficulty of obtaining homogeneous protein samples.
NMR spectroscopy employs the properties of nuclear spin in the presence of an applied magnetic field to analyze the alignment of atoms’ nuclei and it also provides information about dynamic molecular interactions. NMR spectroscopy requires a large amount of pure samples and as with X-ray crystallography, it has difficulty analyzing molecules with large molecular weight.
Cryo-EM provides a lower resolution view of a protein compared to X-ray crystallography. However, it does not require crystallization and therefore many proteins that are difficult to crystallize and large protein assemblies can be imaged using Cryo-EM. It creates a 3D image using thousands of 2D projections. Cryo-EM provides different level of views at near-atomic (<5 Å), subnanometer (5–10 Å), and nanometer (>10 Å), resolutions. Only near-atomic resolution can be used to identify locations of $C α$ and other atoms in the backbone of a protein chain.

It is known that the primary amino acid sequence for a protein chain includes all information to determine tertiary 3D structure of that chain. Computational modeling consists of several techniques to predict tertiary structure from primary structure [12,13,14,15,16]. Since it is computationally very heavy, it has mainly limited for smaller proteins (100–150 amino acids). Aplhabet/Google DeepMind recently developed the AlphaFold 2 AI system to predict tertiary structures with near experimental level accuracy [17]. There is another impactful machine learning approach for tertiary structure prediction called RosettaFold, as described in [18]. A review of several deep learning-based approaches can be found in [19]. In comparative or template-based modeling, the 3D structure of at least one protein is determined experimentally, this structure is then used to model other members of the same family of proteins based the alignment of the amino acid sequences [20,21].

Determining Secondary Structure Elements (SSEs) for any protein is crucial as an intermediate step for in vitro tertiary structure determination. SSEs are sub-conformational regions that form when a polypeptide chain folds because of some factors including hydrogen bonds between amino acid molecules. SSEs are commonly divided as helices (formed with hydrogen bonding of N-H and C=O groups four residues apart) and sheets (formed with hydrogen bonding of N-H group of one strand with C=O group of the adjacent strand). Any amino acid that is neither a helix nor sheet is categorized as a loop or coil. Experimentally, SSEs are located using optical measurements such as circular dichroism spectroscopy [22,23], infrared spectroscopy [24,25] and Raman spectroscopy or NMR chemical shifts [26,27].

A previous study [28] showed that approximately $40 %$ of the protein structures deposited into database suffer from at least one or more missing backbone atoms, particularly, when higher resolution of the protein is not available. Further, the number of coarse grained proteins being constructed/simulated with C $α$ trace only is increasing. Therefore, assigning SSEs using C $α$ atoms only to tackle the problem of missing backbone atoms becomes a crucial step. Several approaches have been developed to determine the SSEs of protein using only the $C α$ atom locations. The first method used a sliding window covers four (4) consecutive residues to find the distances and dihedral angles of $C α$ atoms [29]. DEFINE relies on $C α$ coordinates only and compares $C α$ distances with distances in idealized secondary structure segments [30]. P-SEA assigns SSEs using a short $C α$ distance mask and two $C α$ dihedral angle criteria [31]. KAKSI uses $C α$ distance and backbone dihedral angles [32]. SACF identifies SSES based on the alignment of $C α$ backbone fragments with central poses derived by clustering known SSE fragment [33]. Other methods were developed to assign SSEs by approximating the backbone trace with a set of straight lines such as STICK [34] and PMML [35]. We have proposed a geometry-based approach using $C α$ trace that have reached $90 %$ accuracy [36]. Recently, many machine learning approaches were developed. One example is the implementation of a neural network-based classifier called HECA [37]. HECA has two hidden layers, each with 128 neurons. It receives a set of rotational-invariant geometric features extracted from the raw coordinates of $C α$ atoms. In [33], an implementation of a classification algorithm called SACF (secondary structure assignment based on $C α$ fragments) is presented. In [38], a random forest classifier called RaFoSa is described for determining SSEs using a set of geometric features. We previously developed an ensembled machine learning approach using support vector machine (SVM), random forest (RF), Multilayer Perceptron (MLP), and XGBoost based on 20 geometric features [39]. In this paper, we use five different machine learning models with stacking and increase the number of geometric features. Subsequently, the accuracy is improved. In addition, a mathematical model and a deep learning model were developed based on 27 geometric features to tackle the problem [40,41]. In this paper, we use larger number of geometric features and extend/recast the methods to improve the accuracy. A comparison between the performance of the proposed mathematical models, deep learning model, the ensemble model, and a state of the art model is conducted with a large dataset in this paper.

This paper presents three approaches for classifying SSE types using $C α$ atoms only in protein chains. This is beneficial when atomic information is missing. A novel set of features are generated using locations and positioning of neighboring $C α$ atoms in a chain. The first approach is a mathematical approach that models each SSE as a subspace and the entire protein chain as a union of three subspaces. In this approach, a subspace is computed for each of the SSEs types $α$ -helices, $β$ -sheets, and loops. Unknown amino acids are classified based on two methods. In the first method, the distance from the amino acid’s feature vector to each subspace is computed. In the second method, a local subspace is matched for each amino acid and the subspace distances on the Grassmanian subspace manifold is computed. The second approach (Deep Learning) uses some categorical features in addition to the geometric features and employs two Network Architecture Search algorithms for selecting deep neural network architectures, layer connectives, and regularization parameters. The third approach (Ensemble of Machine Learning) stacks five models: Random Forest, Logistic Regression, k-Nearest Neighbor, Multilayer Perceptrons, and eXtreme Gradient Boosting.

2. Materials and Methods

2.1. Feature Generation

2.1.1. Geometric Features

Our mathematical and machine learning models are based on geometrical features collected for the backbone of the protein structure, specifically, $C α$ trace (i.e., $C α$ coordinates). These geometrical features describe the geometry of each $C α$ atom and its surrounding neighborhood. For each $C α$ atom, we calculate a vector of geometric features, $F α$ , that consists of 39 features. $F α$ can be divided into seven categories of features. Each category is used to describe the geometry around $C α$ atom of interest in one aspect. Therefore, $F α$ = ( $R α$ , $E α$ , $D α$ , $V α$ , $T α$ , $M α$ , $N α$ ).

Angle Features, $R α$ . This category is used to calculate and to describe the geometric arrangements of $C α$ atoms around the $C α$ of interest, $C α_{i}$ . This category contains three different triangular angle values calculated around $C α_{i}$ . These angles are: angle( $i - 1, i, i + 1$ ), angle( $i - 2, i - 1, i$ ), and angle( $i, i + 1, i + 2$ ). Angle ( $i - 1, i, i + 1$ ) is the interior angle centered at $C α_{i}$ atom for the triangle formed between the three atoms ( $C α_{i - 1}$ , $C α_{i}$ , $C α_{i + 1}$ ). Similarly, angle( $i - 2, i - 1, i$ ) is the interior angle centered at $C α_{i - 1}$ atom for the triangle formed between the three atoms ( $C α_{i - 2}$ , $C α_{i - 1}$ , $C α_{i}$ ). The same idea is applied to calculate angle ( $i, i + 1, i + 2$ ). Figure 1a shows an example of the three angles calculated around one $C α_{i}$ .

Geometric features calculated for a given $C α_{i}$ .

Euclidean Distance Features, $E α$ . This group of features is calculated by finding the Euclidean distance between the $C α$ atom of interest, $C α_{i}$ , and other $C α$ s in its region. It consists of four Euclidean distances: dist( $i - 3, i$ ), dist( $i - 2, i$ ), dist( $i, i + 2$ ), and dist( $i, i + 3$ ). Figure 1b shows an example of the four calculated distances around one $C α_{i}$ in red dashed lines.

Axis Distance Features, $D α$ . This group of features is calculated by finding the distance between $C α_{i}$ and other $C α$ atoms around it on virtual axes constructed in the surrounding region. It consists of eight values: axisDist( $i - 2, i - 1$ ), axisDist( $i, i + 1$ ), axisDist2( $i, i + 1$ ), axisDist( $i - 1, i$ ), axisDist2( $i - 1, i$ ), axisDist( $i + 1, i + 2$ ), axisDist( $i - 3, i - 2$ ), and axisDist( $i + 2, i + 3$ ). For instance, to calculate axisDist( $i - 2, i - 1$ ), we create a virtual axis connects $C α_{i - 2}$ and $C α_{i + 1}$ and the value of the distance is calculated between $C α_{i - 2}$ and the projection of $C α_{i - 1}$ on this virtual axis. Using the same virtual axis, we calculate axisDist( $i, i + 1$ ) by finding the distance of the projection of $C α_{i}$ and the coordinate of $C α_{i + 1}$ . The idea is generalized to calculate all other axis distances. Each time a virtual axis is constructed and a distance is calculated by finding the distance between a $C α$ coordinate and a projection of another $C α$ coordinate or between the two projections of $C α$ coordinates such as in axisDist2( $i, i + 1$ ). axisDist2( $i, i + 1$ ) is calculated between the projections of $C α_{i}$ and $C α_{i + 1}$ on the virtual axis constructed between $C α_{i - 2}$ and $C α_{i + 2}$ . Figure 1c shows an example of axis distances axisDist( $i - 2, i - 1$ ) and axisDist( $i, i + 1$ ) on the axis between $C α_{i - 2}$ and $C α_{i + 1}$ .

Vector Angle Features, $V α$ . This group of features is calculated by finding the angles between some 3D vectors that are constructed around $C α_{i}$ . It contains four values: vAngle( $i - 2 \to i, i - 1 \to i + 1$ ), vAngle( $i - 1 \to i + 1, i \to i + 2$ ), vAngle( $i - 3 \to i - 1, i - 2 \to i$ ), vAngle( $i \to i + 2, i + 1 \to i + 3$ ). For instance, vAngle( $i - 2 \to i, i - 1 \to i + 1$ ) is the angle between the vector that is constructed from the coordinates of $C α_{i - 2}$ and $C α_{i}$ and the vector that is constructed between the coordinates of $C α_{i - 1}$ and $C α_{i + 1}$ . The idea is the same for all other values in this category. Figure 1d shows the vector angle vAngle( $i - 2 \to i, i - 1 \to i + 1$ ) and the angle between these two vectors is illustrated at the bottom.

Torsion Angle Features, $T α$ . The torsion angle is an example of a dihedral angle. It describes the geometric conformation and the relation of two parts of a molecule connected by a bond. It is the angle between two intersecting planes. Each plane is defined by three $C α$ coordinates. Therefore, the torsion angle can be calculated using four $C α$ coordinates. The first three coordinates define the first plane, and the last three coordinates define the second plane. This category consists of four torsion angle values: torsion( $i - 2, i - 1, i, i + 1$ ), torsion( $i - 1, i, i + 1, i + 2$ ), torsion( $i - 3, i - 2, i - 1, i$ ), and torsion( $i, i + 1, i + 2, i + 3$ ). As the definition suggests, each torsion angle is calculated by the coordinates of the four $C α$ atoms given. Figure 1e shows the torsion angle torsion( $i - 2, i - 1, i, i + 1$ ).

Miscellaneous Features, $M α$ . This group of features contains some other features for each $C α_{i}$ . It contains five values: the amino acid type of the residue i, and summation of the four values of $V α$ and $T α$ . vAngle( $i - 2 \to i, i - 1 \to i + 1$ ) is added to torsion( $i - 2, i - 1, i, i + 1$ ), vAngle( $i - 1 \to i + 1, i \to i + 2$ ) is added to torsion( $i - 1, i, i + 1, i + 2$ ), vAngle( $i - 3 \to i - 1, i - 2 \to i$ ) is added to torsion( $i - 3, i - 2, i - 1, i$ ), and vAngle( $i \to i + 2, i + 1 \to i + 3$ ) is added to torsion( $i, i + 1, i + 2, i + 3$ ).

Neighborhood Features, $N α$ . This category of features is calculated to focus on the $C α$ coordinate of interest, $C α_{i}$ , and the shape and orientation of its surrounding. This category is the largest in terms of the number of values calculated. It consists of 11 values: four Euclidean distances, six scalar values, and one angular value for residue i. To calculate this group of features, we initially find a set of candidate neighbors of residue i. The surrounding is scanned and the atoms around $C α_{i}$ are added to a candidates list. For residue k, it is added to the candidates list if it is at least three residues apart from residue i (i.e., $i - 2 > k > i + 2$ ), the distance between i and k is less than 6.31Å, and there is another residue $k ’$ in the candidates list that is adjacent to k such that $| {seqNum}_{k} - {seqNum}_{k} ’ |$ = 1. After the initial candidates list of neighbor residues is created, we keep only strong candidates in a final list. Residue k is added to the final list if the distance of $C α_{k}$ and the line segment formed between residues $i - 1$ and $i + 1$ (i.e., $C α_{i - 1}$ and $C α_{i + 1}$ ) is less than 5.81Å and its projection is inside same line segment. These features are mainly used to describe the geometry surrounding of a residue on $β$ -strands.

The features in $N α_{i}$ are calculated using $C α$ atoms in the final list. After the final list of neighbors is created, six scalar values are calculated: the number of neighbors in the list, the length of the three eigenvectors of the point clouds formed by the $C α$ atoms in the list, the Euclidean distance of residue i, $C α_{i}$ , and residue j, $C α_{j}$ , where residue j is the closest residue to residue i from the list, and number of amino acids residue i and residue j are apart, $seqDiff = {seqNum}_{i} - {seqNum}_{j}$ . Note that seqDiff could be a negative value if residue j comes after residue i in the sequence. Further, we calculate four Euclidean distances between i’s surrounding and j’s surrounding, where j is the closest residue in the neighbors list to residue i. These are the pairwise distances between $C α_{i - 1} - C α_{j - 1}$ , $C α_{i - 1} - C α_{j + 1}$ , $C α_{i + 1} - C α_{j - 1}$ , and $C α_{i + 1} - C α_{j + 1}$ . Finally, $N α_{i}$ contains one angular value, which is the angle between the vector that is constructed from the coordinates $C α_{i - 1}$ and $C α_{i + 1}$ and the vector is constructed from the coordinates $C α_{j - 1}$ and $C α_{j + 1}$ . Figure 1f shows an example of neighborhood features. In this example the first candidate list is found then the list is filtered to a final list. The $C α$ coordinates in red show some $C α$ atoms were in the initial candidate list and then removed from the final list. The $C α$ coordinates in green are examples of atoms make it to the final list. $C α_{j}$ is the closest atom in the list to $C α_{i}$ and the figure shows the four distances we calculate between atoms $C α_{j - 1}$ , $C α_{j + 1}$ , $C α_{i - 1}$ , and $C α_{i + 1}$ in dashed lines. The two calculated vectors are shown and the angle between them is illustrated on the top right corner of Figure 1e.

2.1.2. Determining Relevant Features

Given a feature matrix, reduction of features can be cast as in Problem 1, whose solution is provided in Algorithm 1. In this algorithm, a new rank estimation technique that was initially introduced in [40] is used. Some techniques such as in [42,43] were too sensitive and not very effective.

Problem 1.

Let $F$ be a $d \times N$ feature matrix whose columns represent $C α$ atoms where each atom has d feature points, i.e., N atoms in $R^{d}$ .

1.
Determine $k \leq d$ , the number of most relevant features.

2.
Determine those k features.

Algorithm 1: Reduction of features.

Require: d × N feature matrix F.

1: Estimate effective-rank k of

F

(using rank estimation technique in [40]).

2: Find a sub matrix with k rows and call it

F_{k}

3: while effective-

rank (F_{k}) \neq k

4: Find another sub matrix with k rows and call it

F_{k}

5: end while

6: k features corresponding to each row of

F_{k}

are most relevant features.

Network ID	Validation Accuracy
#1	$94.12 %$
#2	$94.14 %$
#3	$94.17 %$
#4	$94.13 %$
#5	$94.24 %$
#6	$94.24 %$
#7	$94.11 %$
#8	$94.15 %$
#9	$94.27 %$
#10	$93.88 %$
Joint Prediction	$95.06 %$

	Precision	Recall	F1-Score	Accuracy
Helix	$96.4 %$	$84.6 %$	$90.1 %$	$83.9 %$
Sheet	$79.2 %$	$83.6 %$	$81.3 %$
Loop	$73.2 %$	$82.9 %$	$77.7 %$

Observed/Predicted	Helix	Sheet	Loop	Total
Helix	26,681 (84.65%)	286 (0.90%)	4554 (14.45%)	31,521
Sheet	8 (0.06%)	12,022 (83.64%)	2343 (16.30%)	14,373
Loop	1001 (4.41%)	2868 (12.65%)	18,809 (82.94%)	22,678
Total	27,690	15,176	25,706	68,572

	Precision	Recall	F1-Score	Accuracy
Helix	$95.4 %$	$82.7 %$	$88.6 %$	$78.6 %$
Sheet	$69.2 %$	$76.8 %$	$72.8 %$
Loop	$66.4 %$	$74.1 %$	$70.0 %$

Observed/Predicted	Helix	Sheet	Loop	Total
Helix	26,082 (82.74%)	248 (0.79%)	5191 (16.47%)	31,521
Sheet	26 (0.18%)	11,043 (76.83%)	3304 (22.99%)	14,373
Loop	1210 (5.34%)	4665 (20.57%)	16,803 (74.09%)	22,678
Total	27,318	15,956	25,298	68,572

	Precision	Recall	F1-Score	Accuracy
Helix	$98.09 %$	$97.33 %$	$97.70 %$	$95.12 %$
Sheet	$92.35 %$	$93.88 %$	$93.11 %$
Loop	$92.74 %$	$92.79 %$	$92.77 %$

Observed/Predicted	Helix	Sheet	Loop	Total
Helix	30,918 (98.09%)	24 (0.07%)	579 (1.84%)	31,521
Sheet	44 (0.31%)	13,274 (92.35%)	1055 (7.34%)	14,373
Loop	804 (3.55%)	842 (3.71%)	21,032 (92.74%)	22,678
Total	31,766	14,140	22,666	68,572

	Precision	Recall	F1-Score	Accuracy
Helix	$97 %$	$97 %$	$97 %$	$96.3 %$
Sheet	$97 %$	$97 %$	$97 %$
Loop	$95 %$	$95 %$	$95 %$

Observed/Predicted	Helix	Sheet	Loop	Total
Helix	38,889 (97.22%)	24 (0.06%)	1087 (2.72%)	40,000
Sheet	17 (0.04%)	38,895 (97.24%)	1088 (2.72%)	40,000
Loop	1070 (2.67%)	1129 (2.82%)	37,801 (94.50%)	40,000
Total	39,976	40,048	39,976	120,000

	Precision	Recall	F1-Score	Accuracy
Helix	$97 %$	$96 %$	$96 %$	$93.51 %$
Sheet	$92 %$	$93 %$	$93 %$
Loop	$89 %$	$90 %$	$90 %$

Observed/Predicted	Helix	Sheet	Loop	Total
Helix	29,654 (95.85%)	26 (0.08%)	1259 (4.07%)	30,939
Sheet	15 (0.11%)	13,064 (92.93%)	979 (6.96%)	14,058
Loop	935 (4.45%)	1066 (5.08%)	18,998 (90.47%)	20,999
Total	30,604	14,156	21,236	65,996

	Precision	Recall	F1-Score	Accuracy
Helix	$98.3 %$	$77.7 %$	$86.8 %$	$84.3 %$
Sheet	$87.1 %$	$86.4 %$	$86.7 %$
Loop	$69.8 %$	$92.8 %$	$79.7 %$

Observed/Predicted	Helix	Sheet	Loop	Total
Helix	24,702 (77.67%)	349 (1.10%)	6751 (21.23%)	31,802
Sheet	12 (0.09%)	12,233 (86.35%)	1921 (13.56%)	14,166
Loop	413 (1.92%)	1145 (5.31%)	20,001 (92.77%)	21,559
Total	25,127	14,036	28,673	67,526

Num	Protein ID ^a	Chain ID ^b	#AA ^c	EML% ^d	Subspace I% ^e	Subspace II% ^f	DL% ^g	PCASSO% ^h
1	3SSB	I	30	75.0	83.9	78.6	80.0	26.7
2	2END	A	137	75.6	83.8	78.7	76.6	81.8
3	1ZUU	A	56	80.0	84.0	78.8	78.6	80.4
4	4UE8	B	37	63.3	83.7	78.5	67.6	48.6
5	5W82	A	100	77.4	84.0	78.8	77.0	94.0
6	3QR7	A	115	80.7	84.1	78.8	73.9	95.2
7	3NGG	A	46	82.5	83.9	78.8	87.0	84.4
8	3X34	A	87	81.3	84.1	78.8	83.9	63.3
9	1KVE	A	63	75.5	83.9	78.7	73.0	90.5
10	5DBL	A	130	86.3	84.0	78.7	94.6	66.7
11	4KK7	A	385	93.7	83.7	78.5	94.0	88.8
12	3MAO	A	105	91.9	84.0	78.8	95.2	87.6
13	5QS9	A	171	95.2	83.8	78.7	96.0	90.6
14	6DWD	D	481	94.2	83.5	78.4	96.7	85.2
15	4G9S	B	111	98.1	83.6	78.7	98.2	30.1
16	3ZVS	A	158	94.7	83.9	78.7	96.2	96.6
17	3QL9	A	125	93.3	84.0	78.7	93.6	87.9
18	4ZFL	A	229	94.1	83.8	78.7	93.4	86.7
19	5OBY	A	365	94.1	83.5	78.5	96.7	86.1
20	6B1K	A	114	93.3	84.2	78.8	90.4	85.1
21	4ONR	A	147	100	84.0	78.8	99.3	93.8
22	2IC6	A	71	100	84.1	78.8	100	94.4
23	3HE5	B	48	100	84.2	78.9	97.9	95.7
24	3LDC	A	82	100	84.0	78.8	100	91.5
25	4ABM	A	79	100	83.8	78.8	100	94.9
26	4I6R	A	77	100	84.0	78.7	100	62.0
27	4WZX	A	87	100	84.0	78.8	98.9	87.2
28	5OI7	A	88	100	83.9	78.5	98.9	95.4
29	3D3B	A	139	100	84.2	78.8	100	84.2
30	5XAU	B	71	100	84.1	78.8	98.6	97.2
Average			131.13	90.7	83.9	78.7	91.2	81.8

PERMALINK

Mathematical and Machine Learning Approaches for Classification of Protein Secondary Structure Elements from Cα Coordinates

Ali Sekmen

Kamal Al Nasr

Bahadir Bilgin

Ahmet Bugra Koku

Christopher Jones

Roles

Abstract

1. Introduction

2. Materials and Methods

2.1. Feature Generation

2.1.1. Geometric Features

Figure 1.

2.1.2. Determining Relevant Features

Problem 1.

2.2. Mathematical Approach

2.2.1. SSE Subspace Modeling

Figure 2.

Problem 2.

2.2.2. Projections on SSE Subspaces and Classification

2.2.3. Local Subspaces and Classification

2.2.4. Post Processing

2.3. Deep Learning Approach

2.3.1. Dataset

2.3.2. Network Architecture and Training Parameters

2.3.3. Evaluation Measures

2.3.4. Joint Prediction of Multiple Architectures

Table 1.

2.4. Ensemble of Machine Learning Models Approach (EML)

3. Results

3.1. Results: Subspace Segmentation Approach

Table 2.

Table 3.

Table 4.

Table 5.

3.2. Results: Deep Learning Approach

Table 6.

Table 7.

3.3. Results: Ensemble of Machine Learning Models

Table 8.

Table 9.

Table 10.

Table 11.

3.4. Results: Existing Approach

Table 12.

Table 13.

3.5. Results: Summary

Table 14.

Table 15.

4. Discussion

Author Contributions

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Funding Statement

Footnotes

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases