Effective Automated Feature Construction and Selection for Classification of Biological Sequences

Uday Kamath; Kenneth De Jong; Amarda Shehu

doi:10.1371/journal.pone.0099982

. 2014 Jul 17;9(7):e99982. doi: 10.1371/journal.pone.0099982

Effective Automated Feature Construction and Selection for Classification of Biological Sequences

Uday Kamath ¹, Kenneth De Jong ^1,^2,^*, Amarda Shehu ^1,^3,^4,^*

Editor: Alexandre G de Brevern⁵

PMCID: PMC4102475 PMID: 25033270

Abstract

Background

Many open problems in bioinformatics involve elucidating underlying functional signals in biological sequences. DNA sequences, in particular, are characterized by rich architectures in which functional signals are increasingly found to combine local and distal interactions at the nucleotide level. Problems of interest include detection of regulatory regions, splice sites, exons, hypersensitive sites, and more. These problems naturally lend themselves to formulation as classification problems in machine learning. When classification is based on features extracted from the sequences under investigation, success is critically dependent on the chosen set of features.

Methodology

We present an algorithmic framework (EFFECT) for automated detection of functional signals in biological sequences. We focus here on classification problems involving DNA sequences which state-of-the-art work in machine learning shows to be challenging and involve complex combinations of local and distal features. EFFECT uses a two-stage process to first construct a set of candidate sequence-based features and then select a most effective subset for the classification task at hand. Both stages make heavy use of evolutionary algorithms to efficiently guide the search towards informative features capable of discriminating between sequences that contain a particular functional signal and those that do not.

Results

To demonstrate its generality, EFFECT is applied to three separate problems of importance in DNA research: the recognition of hypersensitive sites, splice sites, and ALU sites. Comparisons with state-of-the-art algorithms show that the framework is both general and powerful. In addition, a detailed analysis of the constructed features shows that they contain valuable biological information about DNA architecture, allowing biologists and other researchers to directly inspect the features and potentially use the insights obtained to assist wet-laboratory studies on retainment or modification of a specific signal. Code, documentation, and all data for the applications presented here are provided for the community at http://www.cs.gmu.edu/~ashehu/?q=OurTools.

Introduction

The wealth of biological sequences made possible by high-throughput sequencing technologies is in turn increasing the need for computational techniques to automate sequence analysis. In particular, as the community at large is focusing on elucidating the sequence-function relationship in biological macromolecules, a primary sequence analysis problem involves unraveling the rich architecture of DNA and mapping underlying functional components in a DNA sequence [1]. A combination of valuable biological insight gathered from wet-laboratory experiments and increasingly powerful computational tools has resulted in significant progress being made in important sequence analysis tasks, such as gene finding [2], [3]. Despite this progress, challenges remain [4], [5]. For instance, accuracy in gene finding ultimately depends on addressing various subproblems, one of which is the correct detection of splice sites that mark the beginning and end of a gene. The splice site prediction problem is now considered a primary subtask in gene finding and is thus the subject of many machine learning methods [6]–[16]. Other prominent DNA analysis problems involve the identification of regulatory regions [17], [18] through detection of binding sites of transcription factors [19]–[21] or detection of hypersensitive sites as reliable markers of regulatory regions [15], [22]–[28], identification of ALU sites [29]–[33] to understand human evolution and inherited disease [34], [35], and more.

From a computational point of view, detecting specific functional regions in a DNA sequence poses the interesting and challenging task of searching for signals hidden in sequence data. Detecting a signal in a given sequence or whether a sequence contains a particular signal is a difficult computational task, particularly in the ab initio setting, for which little or no a priori information is available on what local or distal interactions among the building blocks of investigated sequences constitute the sought signal. Yet, automating this process is central to our quest to understand the biology of organisms and characterize the role of macromolecules in the inner workings of a healthy and diseased cell. This quest is not limited to nucleic acids. Important sequence analysis problems include predicting protein solubility, crystallizability, subcellular localization, detecting enzymatic activity, antimicrobial activity, secondary structure folding, and more [36]–[46]. The focus in this paper on DNA is due to a growing body of work in machine learning pointing to the fact that many important functional signals consist of a complex combination of local and distal information at the nuucleotide level.

Sequence analysis problems in which the objective is to find what constitutes a functional signal or property at the sequence level naturally lend themselves to formulation as classification problems in machine learning. The effectiveness of these algorithms largely depends on the feature sets used. In some settings, the construction of effective features can be facilitated by a priori insight from biologists or other domain experts. For instance, biophysical insights have been instrumental in developing effective features for predicting protein subcellular localization and folding rates, CG islands in DNA sequences, and more [37], [38], [42], [43], [47].

However, it is becoming increasingly clear that there are problems for which domain-specific insight is either incomplete or hard to translate into effective features. As a consequence, there is considerable interest in automating the process of constructing effective features. A prominent example is the automated detection of splice sites in DNA sequences [12]–[16]. The key issue here is how to define a space of potential features that is sufficiently rich to allow the generation of effective features while maintaining computational feasibility.

In recent work, we have indicated how one can explore large spaces of potential features in a computationally-viable manner by employing evolutionary algorithms (EAs) [15], [16]. The success of this "proof of principle" effort has prompted us to propose and investigate a more general EA-based framework (EFFECT) for efficient automated feature construction for classification of biological sequences. In this paper we describe the generalizations and then demonstrate the broad applicability of the framework on three DNA sequences analysis problems on the detection of splice sites, HS sites, and ALU sites in DNA sequences. The algorithmic realizations of EFFECT for each of the selected problems in this paper are sufficiently detailed to allow one to adapt the framework for other sequence classification problems of interest. Indeed, one of the contributions of this work is in providing a roadmap as to how one can do so in different application settings. To further facilitate this, the entire data, code, and documentation are provided to the community at http://www.cs.gmu.edu/~ashehu/?q=OurTools.

The rest of this article is organized as follows. We first provide a brief review of related research that includes machine learning methods for classification of biological sequences and EAs for feature construction in the context of classification. The EFFECT framework is detailed in Methodology. A comprehensive analysis of results from application of this framework on the three chosen problems is presented in Results. The paper concludes in Discussion, where we provide a short summary of the main features of EFFECT, its availability to the research community, and its use for other classification problem of interest.

Related Work

Methods for Classification of Sequence Data

We focus here on supervised learning methods for classification of sequences. In this scenario, a model is trained to find features that separate labeled training sequence data. Typically, these are binary classification problems in which a positive label is assigned to sequences known to contain a particular functional signal or property, and a negative label to sequences that do not. The learned model is then applied to novel sequences to make label predictions and thus detect or recognize the presence of the sought functional signal.

Our review below categorizes classification methods into statistical-based and feature-based, though many methods are a combination of the two approaches. Typically, the process involves first transforming sequence data into vectors over which an underlying classifier operates. In statistical-based approaches, the focus is on the underlying statistical model for the classification. In feature-based approaches, the primary focus is on constructing effective features that allow transforming sequence data into (feature) vectors for standard classifiers. What follows below is not a comprehensive review of literature on each of these two approaches, but rather a summary of representative methods in each category to facilitate the discussion of results in the comparison of our framework to state-of-the-art methods.

Statistical Learning Methods

Statistical learning methods can be broadly classified by the models that they employ, which can be generative or discriminative. Generative models learn the joint probability Inline graphic of inputs with labels . Bayes rule is used to then calculate the posterior and predict the most likely label for an unlabeled input. Discriminative models learn the posterior directly, but this also limits them to a supervised setting that demands labeled training data (as opposed to the ability of generative models to additionally exploit unlabeled data). Nonetheless, discriminative models are preferred in many classification settings, as they provide a more direct way at modeling the posterior without first addressing a more general setting (as demanded by modeling the joint probability) [48]. The transformation of input sequence data into numeric data for these models is conducted a priori through a kernel function or a feature-based method explicitly extracting features of relevance for the transformation.

Heuristic procedures have been proposed to combine discriminative and generative models [49] as a way to address the issue that generative methods lose their ability to exploit unlabeled data when trained discriminatively [50]. The resulting hybrid methods have been shown to result in superior performance on recognition of transcription factor-binding sites on DNA [51]. Representative methods include the position-specific scoring matrix (PSSM) – also known as the position-weight matrix (PWM) - a method that assumes nucleotides at all positions are drawn independently [52], [53], the weight array model (WAM) which relaxes assumptions of independence by additionally modeling dependencies on a previous position [54], higher-order Markov models which model more dependencies and outperform PSSMs [55], [56], and even more complex models like Bayesian networks [57], [58] and Markov Random Fields (MRFs) [59], [60]. A mixture of Bayesian trees and PSSMs in [61], smooth interpolations of PSSMs, and empirical distributions [62] have also been proposed to model arbitrary dependencies.

Kernel-based Methods

SVMs are probably the most widespread discriminative learning method in bioinformatics due to their ease of implementation and solid grounding in statistical theory [63], [64]. They have been applied to many sequence classification problems, including prediction of transcription start sites on DNA [65], translation initiation sites [66], gene finding [67], transcription factor-binding sites [68], and DNA regulatory regions [69]. The predictive power of SVMs greatly depends on the chosen kernel function. This function maps input (sequence, here) data onto a usually higher-dimensional feature space where provided samples of the two classes can be linearly separated by a hyper-plane. Many kernels are designed for sequence classification, of which the most relevant and state-of-the-art are weighted position and weighted position with shift kernels devised for recognition of DNA splice sites [12]. In these kernels, limited-range dependencies between neighboring nucleotides are considered to encode features for the SVM. Concepts from evolutionary computation have been lately proposed to learn effective, possibly more complex, kernels for a particular sequence classification problem at hand [27], [28].

Feature-based Methods

Feature-based methods make the process of feature construction transparent and so can offer constructed features for inspection and further analysis to biologists. Constructing effective features, however, is non-trivial. The straightforward approach is to use enumeration to list all considered features. When no domain-specific expertise is available to guide feature construction towards certain feature types, the predominant approach has been to limit the focus to features that are strings of Inline graphic symbols over the alphabet of building blocks in considered biological sequences (nucleotides in DNA/RNA and amino acids in proteins). These k-mers are also known as spectrum features [70].

The essential idea is to transform given sequences into numeric vectors recording frequency or occurrence of k-mers and then employ supervised learning techniques, such as SVMs, to separate training data in the resulting vector space [25]. Spectrum features have been shown useful in various classification problems, such as prediction of DNA promoter regions, cis sites, HS sites, splice sites, and more [70]–[73]. However, work has shown that the majority of spectrum features are seldom useful and can be removed by effective feature selection algorithms [74].

In many classification problems on biological sequences, research has shown that simple spectrum (compositional-based) features are not sufficient. Problems, such as predicting protein enzymatic activity, DNA hypersensitive sites, or RNA/DNA splice sites seem to necessitate complex local and distal features [11], [13], [14], [25]–[28], [38]. In particular, taking into account dependencies through features that encode correlations or simultaneous occurrences of particular Inline graphic -mers at different positions in a biological sequence is shown to be important for accurate detection of splice sites [14]–[16]. Work in [8], [14] introduced the idea of explicitly considering various feature types in the context of splice site detection but limited the number of types and number of enumerated features per type to control the size of the feature space and the computational cost demanded by enumeration. The feature types considered were position-based, region-based, and composition-based [14].

In general, enumeration-based approaches introduce artificial limits on the length and the complexity of features in order to achieve reasonable computation times. Moreover, insight in a particular problem domain is difficult to translate into meaningful features when a combination of local and distal features are needed. Ideally, a general feature construction approach would be able to operate ab initio; that is, explore the space of possible local and distal features and guide itself towards discriminating features. When the types or number of features are not limited, one is invariably confronted with a feature construction problem that is NP-hard problem due to the combinatorial explosion in the size of the feature space [75]. Yet, a variety of general purpose search techniques have been shown effective for NP-hard problems. In particular, EAs, which we summarize next, provide a viable alternative for exploration of complex feature spaces in automated feature construction and are the backbone of the framework proposed here for automatic feature construction for classification of biological sequences.

EAs for Exploration of Feature Spaces

The ability of EAs to efficiently explore large search spaces with complex fitness landscapes makes them appealing for feature construction [76]. EAs mimic biological evolution in their search for solutions to a given optimization problem. Typically, a population of candidate solutions, also referred to as individuals, is evolved towards the true ones through a process that generates candidate solutions and retains only a population deemed promising according to some fitness function.

Recognized early for their promise in addressing difficult optimization problems [77], EAs have gained popularity for feature construction in different application settings [16], [28], [38], [78]–[83]. In particular, recent work has shown improved classification accuracies when using genetic algorithms (GAs), a class of EAs, to replace feature enumeration techniques in predicting promoter regions, HS sites, and splice sites in DNA, and even enzymatic activity in proteins [15], [16], [26]–[28], [38].

In standard GAs, individuals are fixed-length strings of symbols. In another class of EAs, genetic programming (GP) algorithms, an individual is a variable-length tree composed of functions and variables. The functions are represented as non-terminal nodes, and the variables represented as terminal (leaf) nodes. GPs were originally introduced to evolve computer programs and complex functions [84]–[87]. Today, GP-based algorithms are being used for a variety of applications, including feature construction in the context of classification of biological sequences [38], [88]–[92]. Our recent work introduced a GP-based method for feature construction in the context of DNA splice site recognition [16]. In this paper, we present a more general EA-based approach that makes use of a GP algorithm to explore complex feature spaces and generate predictive features from sequence data.

Methods for Feature Selection

EAs can be used to construct a large set of discriminating features, but selecting a non-redundant subset that retains its predictive power remains a difficult and open problem, particularly when the set of features is large [93]. Finding an optimal set of features is generally intractable [94] and is shown to be NP-hard in various settings [95], [96]. This is due in part to the fact that a feature by itself may not be predictive of a particular class but may be informative in combination with other features. Additionally, features which are informative by themselves may be redundant when grouped with others. In general, even finding a small subset of discriminating features is a challenging search problem [93], [97], [98].

Feature selection methods generally follow one of two approaches, subset search and subset evaluation [99]. Univariate feature selection like Information Gain or Chi-square are not very useful when applied on a set of features that already have high discriminatory power, as is the case with features found by the GP algorithm employed in the first stage of our EFFECT framework. In cases of already discriminating features, a more relevant criterion for feature selection is to reduce redundancy while retaining predictive power in the selected subset. In this paper we present an EA-based approach to feature selection that achieves that goal.

Methodology

The proposed EFFECT framework consists of two stages, each comprised of an EA. In the first stage, the Evolutionary Feature Construction (EFC) algorithm is used to search a given space of complex features and identify a set of features estimated to be effective in the context of a given classification problem. These features are then fed to the second stage, where a second algorithm, Evolutionary Feature Selection (EFS), reduces the set of constructed features by selecting a subset deemed most informative without sacrificing performance. A schematic of the framework showing the interplay between these two algorithms, is shown in Figure 1.

While EFC conducts a biased exploration of a vast space of potentially complex features to find a set of top features, EFS reduces this set to a subset of informative yet low redundancy features. The remaining features are used to transform sequence data into vector data that can be separated by any classifier.

Constructing Complex Features with EFC

Since EFC is a generalization of the feature generation algorithm presented in [16], our description here focuses primarily on the novel components, providing a brief summary of the common elements where needed and directing the reader to Text S1 for further details.

Central to the power of EFC is its generalized representation of sequence-based features as GP trees. These feature "trees" are maintained in a population that evolves over generations using standard GP reproductive mechanisms of mutation and crossover. Mimicking the process of natural selection, features that are deemed more discriminative for classification have a higher probability of surviving into the next generation, steering the probabilistic search in EFC towards more effective features. The discriminative power of a feature is estimated through an empirical or surrogate fitness function. The best features (those with highest fitness) found by EFC are collected in a set referred to as a hall of fame. It is this set that is fed to the subsequent EFS algorithm for feature subset selection.

Feature Representation in EFC

As standard in GP, the individuals (features) evolved by the EFC algorithm are represented as parse trees [87]. In EFC, the leaf nodes of a feature tree are known building blocks of given biological sequences. In the case of DNA sequences, for instance, these blocks are the four nucleotides in the DNA alphabet. To improve on generality and effectiveness, EFC supports additional building blocks that represent groups of nucleotides based on similar chemical properties. In this paper, this capability is illustrated by the use of the IUPAC code [100], resulting in Inline graphic symbols listed in Table 1. If the sequences of interest are proteins, the building blocks can either be amino-acid identities, types, or other categorizations based on physico-chemical properties.

Table 1. IUPAC code is adapted from [100].

Symbol	Meaning	Description Origin
G	G	Guanine
A	A	Adenine
T	T	Thymine
C	C	Cytosine
R	G or A	puRine
Y	T or C	pYrimidine
M	A or C	aMino
K	G or T	Ketone
S	G or C	Strong interaction
W	A or T	Weak interaction
H	A or C or T	H follows G in alphabet
B	G or T or C	B follows A in alphabet
V	G or C or A	V follows U in alphabet
D	G or A or T	D follows C in alphabet
N	G or A or T or C	aNy

Open in a new tab

Alternatively, building blocks can be short subsequences or motifs of Inline graphic symbols. Information may be available from domain experts to determine the length of these motifs. For instance, work in splice sites shows that motifs of length are not useful [13], [14]. In other applications, there may be lower bounds on the length of effective motifs. Such bounds may be available and specified a priori to EFC or tuned interactively after analysis of constructed features. In the selected applications of the EFFECT framework in this paper, the leaf nodes of feature trees are motifs, and we limited the length of these motifs between Inline graphic and .

As illustrated in Figure 2 and Figure 3. EFC uses the standard boolean operators (and, or, not) to combine basic building blocks into more complex features. In addition to boolean operators, the EFC algorithm uses application-specific functional nodes to assist in the construct meaningful features for biological sequences. These are listed in Table 2. An important functional generalization in EFC is the ability to specify the matching of a motif in some region (up or down) or matching it around some expected position. This allows for the construction of features that are more robust to possible sequence variations. In Text S1 we provide more detail regarding the types of features that one can construct with these operators and provide illustrations for them.

Table 2. A table of non-terminals and terminals employed in feature construction.

Name	Args	Return Type
AND	2 non-terminal boolean	Boolean
OR	2 non-terminal boolean	Boolean
NOT	2 non-terminal boolean	Boolean
Correlational	2 non-terminal boolean, Shift	Boolean
Matches	Motif	Boolean
MatchesAtPosition	Matches, Position	Boolean
MatchesAtPositionWithShift	Motif, Position,Shift	Boolean
MatchesAtRegion	Matches, Region	Boolean
Motif-*	ERC-chars	Motif
Shift	ERC-int	Integer
Region	ERC-int	Integer
Length	ERC-int	Integer
ERC-char		Character
ERC-int		Integer

Open in a new tab

Population and Generation Mechanism

As detailed in Text S1, the initial population of Inline graphic features is carefully constructed to contain a variety of tree shapes with maximum depth . In contrast to EAs with fixed population sizes, EFC employs an implosion mechanism that reduces the size of the population by % over the previous one, in order to avoid known convergence pitfalls of GPs. The population of features evolves for a pre-specified number of generations Inline graphic . Each population contributes its top features to a hall of fame. In turn, the hall of fame is used to provide a randomly selected initial set of features for the next generation, with the rest of the features in the next generation obtained through reproductive operators.

In the experiments reported in this paper, Inline graphic , , , , , and .

Reproductive Operators

Based on studies that show robust EAs incorporate both asexual (mutation) and sexual (crossover) breeding operators [101], EFC employs both operators. These operators are executed until the goal population size for the next generation is reached. Each of the operators has a certain probability with which it is performed. Given the additional functional nodes in EFC over our prior work in [16], four new mutation operators are employed depending on the type of tree node being modified. Each of the variants has equal probability of being performed once the mutation operator is selected. Additional details and illustrations on the mutation and crossover operators are provided in Text S1.

Bloat Control

A common problem with tree-based individuals in EAs is that, as generations progress, individuals become more complex without any improvement in fitness. This is known as bloat. It is important to control bloat, particularly when the goal is to have features that are easily interpretable by humans. As such, bloat control is an important element in EFC, the details of which are given in Text S1.

Fitness Function

EFC employs a surrogate fitness function or a "filter" approach, which is considered to be more effective than wrapper approaches for feature evaluation [102]. Since most sequence classification datasets are imbalanced in the sense of having very few positives as compared to a large number of negatives, the objective of a filter approach is to improve precision while managing the discriminative power of features. For this purpose, we use the following fitness function: Inline graphic . In this equation, refers to a particular feature, and are the number of positive and negative training sequences that contain feature , respectively, and is the total number of positive training sequences. This fitness function tracks the occurrence of a feature in positive sequences, as negative sequences may not have any common features or signals. The fitness function additionally penalizes non-discriminating features; that is, features that are equally found in positive and negative training sequences.

Hall of Fame

Previous research on EAs has noted that if parents die after producing offspring, there can be genetic drift or convergence to some local optimum [76]. This can result in the loss of some of the best individuals. The EFC algorithm addresses this issue by using an external storage of features known as a hall of fame. As noted above, the Inline graphic best individuals in every generation are added to the hall of fame, and the hall of fame in return helps seed the population in each generation with randomly selected features. It should be noted that the parameter values for and should depend on the problem at hand. In general, keeping the fittest individuals in a hall of fame improves overall performance [103]. After execution of the EFC, the features in the hall of fame are those submitted to the ensuing EFS algorithm.

Effective Feature Selection with EFS

The hall of fame features generated by EFC were selected on the basis of their individual performance. What is required for effective and efficient classification is to identify a relevant and non-redundant subset of features. EFS, a novel GA-based algorithm, is employed for this purpose and described below.

Feature Subset Representation in EFS

EFS evolves feature subsets by having individuals in the population correspond to feature subsets represented as binary strings. The length of each string is equal to the number of individuals in the hall of fame. A string of all '1's would correspond to the maximum subset, the hall of fame itself, and a string of all '0's would correspond to the empty subset. In addition to being a suitable representation for our purposes, binary representations in GAs are the standard ones and include a well-studied set of mutation and crossover operators.