Skip to main content
Protein Science : A Publication of the Protein Society logoLink to Protein Science : A Publication of the Protein Society
. 2018 Oct 23;27(10):1857–1870. doi: 10.1002/pro.3494

The CaspBase: a curated database for evolutionary biochemical studies of caspase functional divergence and ancestral sequence inference

Robert D Grinshpon 2, Anna Williford 1, James Titus‐McQuillan 1, A Clay Clark 1,
PMCID: PMC6199153  PMID: 30076665

Abstract

Sequence databases are powerful tools for the contemporary scientists’ toolkit. However, most functional annotations in public databases are determined computationally and are not verified by a human expert. While hypotheses generated from computational studies are now amenable to experimentation, the quality of the results relies on the quality of input data. We developed the CaspBase to expedite high‐quality dataset compilation of annotated caspase sequences, to maximize phylogenetic signal, and to reduce the noise contributed from public databanks. We describe our methods of curation for the CaspBase and how researchers can acquire sequences from http://caspbase.org. Our immediate goal for developing the CaspBase was to optimize the ancestral protein reconstruction (APR) of caspases, and we demonstrate the utility of the CaspBase in APR studies. We also developed the Common Position (CP) system for comparing human caspase family paralogs and suggest the CP system as an update to current reporting methods of caspase amino acid positions. We present a standardized multiple sequence alignment (MSA) for the CP system and show the advantage of using large databases such as the CaspBase in defining structural positions in proteins. Although the results described here pertain to caspase evolution and structure–function studies, the methods can be adapted to any gene family.

Keywords: caspase, computational biology, protein evolution, ancestral protein reconstruction, database curation, sequence analysis

Introduction

The advent of next‐generation sequencing paved the way for a new field of study, called evolutionary biochemistry, where researchers attempt to understand how random chance, selection pressure, and physical laws determine the structure and function of protein families.1 Computational concepts of evolutionary biochemistry, like ancestral protein reconstruction (APR), were conceived over 50 years ago.2, 3, 4 In practice, however, experiments were limited due to the lack of sufficient sequence data. In APR, one uses extant genomic data to predict amino acid usage in common ancestral genes based on phylogenetic relationships of a protein family across various organisms.5 Only in recent years has sufficient genomic data become available for large‐scale multi‐group APR analyses. The reconstructed proteins of an APR analysis provide testable hypotheses; however, the certainty of such computational predictions depends on the quality of the initial input dataset as well as the accuracy of the multiple sequence alignment (MSA).

Three main factors contribute to the poor quality of public databases: the lack of experimental evidence, the inclusion of multiple isoforms, and the lack of pseudogene prediction. First, the addition of RNA‐seq data in lieu of cDNA reference genomes has led to discrepancies in the number of expressed human genes reported in various public databanks.6 As little as 1.1% of the annotations within available genomes are supported by experimental evidence,7 and current genome annotation methods correctly predict 40–50% of genes that are actually expressed.8, 9 Second, annotated genomes that contain multiple isoforms of the same gene generally lack experimental evidence of gene expression, relying instead on computationally predicted open reading frames (ORFs). The inclusion of multiple isoforms and/or duplicate sequences into a dataset contributes to “phylogenetic noise” and lowers the phylogenetic resolution relative to the sample size.10, 11 The “phylogenetic signal” refers to the likelihood of genes in closely related organisms to resemble each other more than genes chosen at random from a dataset.12 Thus, sequences with poor phylogenetic representation of a group are more likely to introduce phylogenetic noise into subsequent analyses. Finally, pseudogenization results in gene loss from the population when unfavorable mutations reduce or even nullify the functionality of the protein or when the gene has no effect on fitness.13 In the caspase family, for example, Caspase‐12 is considered an example of a gene in the process of pseudogenization due to introduction of a null allele.14 Even among closely related organisms, the number of expressed caspase genes is constantly evolving.15 The likelihood of computational algorithms to include genes undergoing pseudogenization is probable;6, 16 however, the inclusion of pseudogenes in a dataset contributes to phylogenetic noise and diminishes the certainty of downstream analyses. One way to improve the confidence in an annotation is to use conventional computational gene prediction methods followed by inspection by a human expert.8 By incorporating confident annotations, manually curated sequence databases address some of the shortcomings of public databases and improve computational biology by reducing phylogenetic background bias and noise.17, 18

We manually curated a sequence database, called the CaspBase, to help understand how members of the caspase family evolved discrete cellular functions while retaining high structural similarity. Here, we introduce http://caspbase.org, a web application that provides convenient access to our databases of caspase proteins from annotated animal genomes. Our immediate goal in developing the CaspBase was to expedite large‐scale computational analyses by facilitating the dataset compilation process, while simultaneously maximizing phylogenetic signal for the purpose of inferring ancestral states.

Caspases, or the clan CD 14 peptidases, are named for their most prominent functional feature, that is, the presence of a catalytic cysteine–histidine dyad for cleaving peptides after aspartate residues19. Human cells contain multiple caspases that are broadly separated into three subfamilies, the inflammatory caspases (‐1, ‐4, ‐5), the apoptotic initiator caspases (‐2, ‐8, ‐9, ‐10), and the apoptotic effector caspases (‐3, ‐6, ‐7). Caspase‐14 has been implicated in cell differentiation,20 and Caspases‐11 and ‐13 are misnomer duplicates of Caspase‐4 present in rodents and bovine, respectively.15

With an expanding database of non‐human genome sequences, the nomenclature system used to organize human caspase proteins, based on the human Caspase‐1 amino acid positions,21, 22 is not sufficient to organize all caspase homologs. Many annotations group together caspases within gene groups that share the same name based on similarity and do not take into consideration the present evolutionary track these proteins are on, for example pseudogenization. We describe the CP system as an improved naming convention for amino acid positions that addresses the complications of the current Caspase‐1 naming convention. The CP system is based on structurally informed sequence alignment, and its adoption would facilitate sequence analysis of homologous proteins. To this end, we provide an assessment of position‐based sequence conservation for human caspases among all chordate sequences available in the CaspBase.

Extant proteins are snapshots in evolutionary time, so reconstructing ancestral states is useful to understand the current configuration of intramolecular interactions.1 In APR experiments, low confidence, or ambiguity, at a given amino acid, position is typically attributed to insufficient sequence data, uncertain gap placement in the MSA, poor estimation of tree topology (phylogenetic noise), or the extent of sequence divergence relative to tree articulation (phylogenetic bias).23 Using data from the CaspBase, we describe the utility of the database by analyzing levels of sequence conservation among caspase homologs, and we discuss how the CaspBase lends itself to evolutionary biochemistry and APR analysis of common caspase ancestral proteins.

Results

The development of CaspBase (http://caspbase.org) was motivated by a lack of high‐quality, curated sequence resources for caspase proteins. Here we describe the construction of the caspase database and demonstrate its utility for multiple sequence alignment and ancestral sequence reconstruction. We also provide an overview of sequence conservation based on our improved alignment of caspase proteins from chordate taxa. In addition, we propose a new position numbering system to ease the comparisons between human caspase family paralogs.

The CaspBase

http://caspbase.org provides access to both uncurated and curated databases of caspase protein sequences. The uncurated database was constructed by extracting all protein sequences annotated as caspases from all animal RefSeq genomes available at NCBI (see Methods section). The uncurated database contains 6676 caspase sequences from 361 taxa (Table S1).

The curated database was constructed by applying a set of filtering steps to sequences present in the uncurated database (see Methods section and Fig. 1). Briefly, to be included in the curated database the following criteria had to be met: (1) the presence of the catalytic cysteine–histidine dyad; (2) the presence of the active‐site residues that specifically coordinate aspartate found in the P1 position of the substrate; and (3) the presence of data for the dimer interface (Fig. 2). The standard workflow of the curation process performed for caspase sequences from every genome is illustrated in Figures 1 and S2. As a final check for sequence quality, we examined the placement of caspase sequences within the phylogeny. Some sequences were not included into the curated database, for example metacaspases, nonfunctional proteins and other decoys, as described in Methods section. The list of all excluded sequences (3797) along with the reasons for exclusion is provided in Table S1.

Figure 1.

Figure 1

Steps in curation of the CaspBase. The examples show steps for including sequences in the database as well as criteria for excluding sequences.

Figure 2.

Figure 2

CaspBase curation criteria. The top right panel shows the β‐strands that form the dimer interface. The bottom right panel shows the conserved residues that confer “caspase” activity – CP‐018, CP‐020. CP‐024, CP‐075, and CP‐117. Sequences missing any of the criteria were excluded from the database.

The curated database of caspase proteins currently contains 2880 caspase sequences. These sequences represent 32 caspase types from 11 phyla, 26 classes and 353 species [Figs. 3(A) and (B) and S5]. CaspBase provides access to a searchable database of curated caspase proteins (http://caspbase.org). Users can download sequences in FASTA format by selecting species names, phylum_class, caspase type or any combination of these search terms [Fig. 3(C)].

Figure 3.

Figure 3

Summary of sequence data in the CaspBase. (A) The pie charts (top panel) show the diversity of organisms, and the bar graph (middle panel) shows the distribution of caspase types. (B) Example of the search feature from http://caspbase.org.

CaspBase users should be aware that some caspases in the database are annotated as “N‐like” (where “N” is any human caspase‐type) because we kept the original annotation from the NCBI database. N‐like sequences may be the result of gene duplication, or they may be in the process of pseudogenization.15 Many N‐like sequences were omitted from the CaspBase because they did not meet our curation criteria, but the CaspBase does include some sequences that might not be ideal for every analysis.

Multiple sequence alignment

The goal of a multiple sequence alignment in the context of evolutionary analyses is to identify homologous positions, and the difficulty of this task grows as the divergence between sequences increases. There are at least two ways to improve MSA in this situation: one can provide structural information to guide the alignment24 and one can include additional sequences.25

We used both of these improvement strategies to construct a MSA of 1481 chordate sequences using PROMALS3D (see Methods section). The human portion of the alignment is referred to as Human_MSA (Supplemental File 2: Human_MSA.fasta) and the full alignment is referred to as All_Chordate_MSA (Supplemental File 3:All_Chordate_MSA.fasta).

CP numbering system

Currently, the comparison between human caspase paralogs is based on human caspase‐1 amino acid positions. For example, active site Loop 4 (L4) varies in length among human caspase subfamilies, with the largest L4 found in the effector Caspase‐3 subfamily. In 2004, Fuentes‐Prior and Salvesen suggested that insertions in the loop utilize the Caspase‐1 numbering convention, where the loop begins with F381.22 In this case, the insertions in human Caspases‐3, ‐6, and ‐7 were named using the amino acid code, loop starting position of human Caspase‐1, and lowercase alphabetical marker beginning with “a.” For example, phenylalanine 250 in Caspase‐3 would be identified as Phe‐381c using the Caspase‐1 numbering convention. At the time of the earlier proposal, very few caspases had been identified outside of human and a few model systems such as mouse, C. elegans and Drosophila. Now, with several thousand caspase sequences available from hundreds of species, the numbering system based on human Caspase‐1 is insufficient to describe all caspases in all species, primarily due to the varying lengths of insertions and deletions in numerous regions of the proteins. A naming convention should describe the conserved positions in multiple species as well as the less conserved regions of insertions/deletions and pro‐domain. To this end, we propose a new naming scheme, called the CP system, based on the structurally validated CP_MSA alignment. In this system, each position in the alignment is classified as either pro‐domain (PD), CP, or gap position (GP) (see Methods section, Fig. 4, and Supplemental File 4: CP_MSA.xlsx) Overall, there are 218 CPs and 11 gapped regions that range from 24 residues in the inter‐subunit linker (GP7), to single amino acid insertions (GP10) and deletions (GP3) [Fig. 5(A) and (B) and Table 1].

Figure 4.

Figure 4

Common positions naming system for human caspase proteins. Partial alignment with every position labeled according to CP system is shown. Omitted parts of the alignment are indicated with “...” Full alignment is available as Supplementary File 4: CP_MSA.xlsx.

Figure 5.

Figure 5

Summary of gap locations in caspases. (A) Locations of the gapped positions (GPs) mapped on to caspase‐3 (PDB ID 2J30). L1 and L4 refer to active site Loops 1 and 4, respectively. IL refers to the intersubunit linker. (B) Comparison of caspase organization. Gapped positions are shown in orange, and features of the GPs are summarized in Table 1. CARD refers to caspase activation and recruitment domain, and DED refers to death effector domain. (C) Structural evidence for GP1 (N‐terminus) and GP10 (dimer interface) in caspase‐1 (cyan).

Table 1.

Gapped positions (GP) identified in the common position system for defining amino acid positions in caspases

GP# Max length in humans Location Description
GP1 2 N‐terminal to e1 Caspases‐1, ‐4 insertion (Fig. 5©)
GP2 17 L1 3 variable gap lengths
GP3 1 between h2 and e3 Deleted in Caspases‐3, ‐6, ‐7, ‐8, and 9; present in Caspases‐1, ‐2, ‐4, ‐10, and ‐14
GP4 2 between h2 and e3 Unique Caspase‐14 insertion
GP5 7 loop between e1` and e2` Unique Caspase‐9 insertion (Fig. 5(A))
GP6 6 loop between e2` and e3` Caspases‐1, ‐4 insertion (Fig. 5(A))
GP7 24 Inter‐subunit linker Variable in almost every caspase. Caspase‐9 is the longest, Caspase‐3 is the shortest
GP8 3 turn between h4 and h5 Caspases‐8 and ‐10 single amino insertion, and caspase‐14 deletion
GP9 10 L4 3 variable gap lengths
GP10 1 dimer interface Caspases‐1, ‐4, ‐5 insertion
GP11 8 C‐terminus 6 variable gap lengths

Length refers to number of amino acids in the gap; “e” refers to β‐strand, and h refers to α‐helix; L1, L4 refer to active site Loops L1 and L4.

The proposed CP system offers a unified way of referring to amino acid positions in all caspases because it is not based on the positions of a single caspase. With the CP system, one can more easily describe loop positions because the system is not based on a caspase with short active site loops as found in Caspase‐1. The structure‐based MSA is more reliable regarding gap placements, so the CP system is based on CPs observed in 10 human caspases. The CP system also accounts for the variable lengths of the pro‐domains. Using the example described above, in the CP system, Caspase‐3 F250 is GP9‐03 (Fig. 4). The position is also present in Caspases‐6, ‐7, ‐8, ‐10, and ‐2, but it is absent in Caspases‐9, ‐1, and ‐4. Finally, insertions in Caspase‐1 such as R391 in the dimer interface [Fig. 5(C)] would not have corresponding numbers in other caspases. In the CP system, the insertion is placed into a gap (GP10) that is unique to Caspases‐1 and ‐4, rather than skipping the position number 391 in all other caspases, as would occur in the caspase‐1 numbering system.

Conservation of caspase proteins

We used the All_Chordate_MSA (Supplemental File 3: All_Chordate_MSA.fasta) to calculate the percent identity at each position of alignment for all chordate sequences as well as for every caspase type (see Methods section). This information was used to evaluate the extent of conservation among caspase proteins. Table 2 records the percent of conservation for each caspase type, calculated by dividing the number of residues that are over 80% conserved in a given set of analyzed sequences by the number of residues with 100% occupancy in the caspase domain. Caspase‐7 sequences show the highest level of conservation (71%), while Caspase‐14 sequences are the least conserved among all caspase types analyzed (31%). When comparing all caspases, there are only 33 conserved residues among 218 amino acids of the protease domain (15%). The majority of these conserved residues are clustered in the active site and at the base of the enzyme [Fig. 6(A)]. The highly conserved regions likely result from the similarity in substrate binding (active site), particularly the requirement for a P1 aspartate in the substrate, and the conservation of an allosteric site (base of protein), which is phosphorylated in most caspases.26

Table 2.

Conservation levels of caspase proteins. The caspase‐N row includes all sequences used, the number of common positions in the caspase domain, and the number of amino acids conserved in all 1488 chordate caspases

Caspase‐N No. sequences analyzed No. amino acids in caspase domaina No. amino acids conservedb % conservation
Caspase‐1 75 251 98 39
Caspase‐2 198 260 141 54
Caspase‐3 202 240 129 54
Caspase‐4 33 251 140 56
Caspase‐6 201 256 146 57
Caspase‐7 211 243 172 71
Caspase‐8 156 253 83 33
Caspase‐9 178 263 121 46
Caspase‐10 117 246 112 45
Caspase‐14 110 230 71 30
Caspase‐N 1481 218 33 15
a

Positions with 100% occupancy.

b

For Caspase‐1 through Caspase‐14: number of amino acids 90% conserved; for Caspase‐N (all chordate caspases): number of amino acids 80% conserved.

Figure 6.

Figure 6

Conserved positions in chordate caspases. (A) The red spheres represent residues that are >80% conserved in the 1488 human caspase homolog sequence analysis mapped on to the protomer of human Caspase‐3 (PDB ID 2J30). Active site Loops L1, L2, and L4 are labeled. (B) Conservation levels within Caspase‐3, ‐6, and ‐7 subfamily. Maroon residues are highly conserved and cyan residues are the least conserved. The images were constructed using the ConSurf server at http://consurf.tau.ac.il/.

The percent identity for each position from All_Chordate_MSA was also integrated with the CP_MSA to generate a table that records how well every position in the human caspase alignment is conserved among all chordate sequences, both between and within each caspase type (Supplemental File 5: Conservation_CP_MSA.xlsx). For example, we compared 614 sequences for effector Caspases‐3, ‐6, and ‐7 from chordates, which represent a subset of the 1481 sequences in the All_Chordate_MSA. When considering the conservation levels in the effector caspases, one observes that the hydrophobic core and substrate‐binding pocket show the highest levels of conservation, while two active site loops (called L1 and L4) as well as α‐Helices 2 and 3, on the protein surface, show the lowest levels of conservation [Fig. 6(B)]. The data are also presented as a sequence logo for the 614 chordate effector caspases, with secondary structure, active site loops, and CP noted above the sequences (Fig. 7). The results are consistent with well‐conserved β‐strands in the core of the protein with more variable α‐helices on the protein surface. The data also show that the intersubunit linker (GP7), which is cleaved during maturation, is the least conserved region of the protease domain. In addition, the length of L4 is conserved within caspase subfamilies, but the sequences are less conserved. For example, the longest L4 loops are found in the effector caspase subfamily, and a shorter L4 is observed in the inflammatory subfamily including Caspase‐1. The shorter L4 in Caspase‐1, for example, allows for selection of the larger tryptophan residue at the P4 position of substrates.27, 28 While the length of L4 is conserved in effector caspases, five positions in the loop sequence are not well conserved (Fig. 7, GP9). Together, the data suggest that the length and the precise sequences of L4 are important for determining enzyme specificity through selection of the P4 residue of the substrate. Finally, while the β‐strands in the protein core are conserved, the three short surface β‐strands (β1–β3, Fig. 7) are not conserved. The region is important for connecting the catalytic histidine in the active site to helix 3 on the protein surface. Together, strands β1–β3 and α‐Helices 2 and 3 are part of an allosteric mechanism that inactivates Caspase‐6 through a coil‐to‐helix transition.29, 30 In addition, Helix 3 is part of an allosteric network in Caspase‐3 that connects a conserved phosphorylation site at the base of the protein [Fig. 6(A)] to the active site,26 through strands β1–β3. Altogether, the data show that the caspase‐hemoglobinase fold can be characterized as a well‐conserved core and substrate‐binding pocket, with lower conservation in two active site loops and allosteric sites. The lower conservation in allosteric networks connecting a conserved allosteric site to the active site may provide mechanisms for the evolution of species‐specific allosteric regulation and fine‐tuning of activity.

Figure 7.

Figure 7

Sequence logo for 614 chordate effector Caspases‐3, ‐6, and ‐7. Larger letters reflect higher conservation. Secondary structure, active site loops, and common position numbers are indicated above the sequence. The image was constructed at http://weblogo.berkeley.edu.

Ancestral protein reconstruction

APR is a powerful tool in protein evolution studies and can be used to infer residues that are most likely to contribute to changes in protein function. Caspases are an attractive model to examine protein evolution because both subfunctionalization and neofunctionalization are observed in the evolution of the dimeric state of the effector caspases, in variations of enzyme specificity among the three caspase subfamilies, and in unique allosteric sites that result in fine‐tuning caspase activity. In order to demonstrate the utility of the CaspBase in ancestral reconstruction, we studied the evolution of Caspase‐3a and Caspase‐3b proteins in zebrafish.

Among caspase proteins, there are three features that are highly conserved: the hemoglobinase fold, the orientation of the catalytic dyad (cysteine and histidine), and the specificity for aspartate in the S1 pocket.31 While retaining these structural features, caspase substrate specificity is determined primarily by the amino acid in the P2–P4 positions of the substrate. The caspases are also classified by their preference of amino acid at P4,28 where Group I caspases prefer bulky residues (W/H/Y), Group II caspases prefer charged residues (D/E), and Group III caspases prefer aliphatic residues (I/L/V) at the P4 site of the substrate. The substrates are also classified as Group I, II, or III based on the corresponding caspase recognition.

Danio rerio (zebrafish) is an excellent model system to examine differentiation and development.32, 33, 34 The genome of teleost fish was duplicated approximately 250 million years ago,35 although many of the duplicated caspases appear to be lost as evidenced by their absence from NCBI genome annotations. However, several species maintain duplicates that appear to be fixed, including D. rerio. The genome duplication resulted in two Caspase‐3 genes, called Caspase‐3a and ‐3b, in the genome of zebrafish. We showed previously that Caspase‐3a from D. rerio exhibits relaxed specificity for Group II and III substrates, where substrates with either aspartate or valine in the P4 position are acceptable.36 In contrast, much less is known about the substrate specificity of D. rerio Caspase‐3b. Thus, at least one Caspase‐3 gene in zebrafish exhibits properties that differ from those of human Caspase‐3, and the zebrafish system provides an opportunity to examine protein evolution within the same organism as well as between fish and humans.

We used the CaspBase to generate a dataset for reconstructing the common ancestor of D. rerio Caspases‐3a and ‐3b (see Methods section). The results for the APR are summarized in Figures 8 and S6. The experiment was designed to purposefully maximize the number of nodes between D. rerio Caspases‐3a and ‐3b on the phylogenetic tree. We obtained the “tree in ancestor format” output from the FastML server, and the results were used to identify the nodes between D. rerio Caspases‐3a and ‐3b [blue and green trajectories in Figs. 8(A) and 6(A)], and nodes 68, 69, 70, 73, 74, 75, 82, and 83 were pulled from “sequences of the joint reconstruction” file, also from the FastML server. The nodes were then organized so that the common ancestor (CA, Node 68) was centered in the MSA [Figs. 8(A) and S6(B)]. Although there are 90 differences out of 243 positions between D. rerio Caspases‐3a and ‐3b, only 40 residues diverged from the CA to Caspase‐3b. An analysis of the zebrafish Caspase‐3 proteins suggests that nine amino acids are functionally divergent [Fig. 8(B), spheres]. Three of the nine positions are in the active site, and may be important for substrate selection [Fig. 8(B), spheres in L3 and L4], and the remaining sites are in allosteric sites that have been shown to be part of allosteric networks in human Caspase‐3,26, 37 and regions of phosphorylation (N‐ and C‐termini).38, 39 Of the three positions in the active site, two of the positions are in the substrate binding loop, while one position is located in active site Loop 4 and makes contacts with several loops that stabilize the active site, the so‐called “loop bundle”.40, 41, 42 Together, the data show that rather than analyzing the 90 sites that differ between the two proteins, a smaller number of sites appear functionally divergent and may affect enzyme specificity or allosteric regulation. Thus, a more rational and targeted approach may be used to examine the sites through the reconstruction of ancestral nodes between the zebrafish Caspase‐3 proteins. Finally, prior to synthesizing the inferred sequences at each ancestral node, the large dataset from the CaspBase is useful to demonstrate the robustness of the inferred ancestral sequences to statistical uncertainty through generating multiple sequences at ambiguous sites or for constructing the “worst‐plausible” sequence, which examines all ambiguous sites simultaneous.43

Figure 8.

Figure 8

Example of an ancestral protein reconstruction of the common ancestor of zebrafish Caspase‐3a and ‐3b. (A) The phylogenetic tree was constructed as described in the text; only the Caspase‐3 clade is shown in the left panel. (Right) The path for Caspase‐3b is colored in green, and the path for Caspase‐3a is colored in blue. The nodes along the path are aligned in the order they evolved, and a small segment of the MSA is shown. (B) Analysis of zebrafish Caspase‐3a and ‐3b sequences suggests nine sites of functional divergence (spheres), which include the active site and allosteric sites near Helices 1 and 4 as well as the N‐ and C‐termini. The data show sites that are conserved in both clusters, but as different residues.

Discussion

We developed the CaspBase as a tool to rapidly disseminate organized caspase sequence data. The CaspBase includes all animal species with currently available annotated genomes in the NCBI genome database (as of December 2017), and it will be updated as new sequence data is made available. The CaspBase offers researchers the ability to rapidly compile large informative datasets, which serves to improve all downstream computational analyses, and both the curated and un‐curated sequences are available to download from http://caspbase.org.

In order to ease the comparisons of homologous positions between members of the multigene caspase protein family, we developed the alignment‐based, CP numbering system. The suggested CP numbering convention is particularly useful for comparing amino acid sequences with additional data for evolutionary constraints. A standardized MSA will serve to improve the reproducibility of experiments conducted by evolutionary biochemistry, and we present structural evidence for correct gap placement in several caspases. Characterizing the caspase GPs is an important new concept in characterizing caspases from all species, and it is likely that other protein families could benefit from their own CP system. The CaspBase lends itself to compiling large datasets that are easily parsed and modified for any project related to the caspase gene family, and the methods described here can be adapted to any protein family.

Evolutionary biologists seek to understand how genes came to be, and biochemists seek to understand how genes or gene products function as they are. Databases such as the CaspBase serve as important conduits between computational studies that generate hypotheses and experimental studies that test the hypotheses. Experiments aimed at determining the structure–function relationship between functionally diverse protein families have historically used “horizontal mutations,” defined as swapping putative functionally important residues in closely related proteins and assessing the effect on structure and/or function.1 In 1970, however, John Maynard Smith described protein sequence space, and evolution, as a walk from one functional protein to another in the space of all possible protein sequences.3 Horizontal mutations do not abide by Maynard's description, and studies that investigate such mutations may not address the historical context of the site of interest. That is, single point mutations are unlikely to thoroughly investigate the complexity of functional fitness landscapes because the optimal amino acid at any given position is context dependent.44, 45 In other words, the within‐protein epistatic contribution of free energy from a single mutation may alter biological activity through unforeseen perturbations to the free energy landscape that do not reflect reality.

The changes in amino acids from one ancestral node to the next are considered “vertical mutations and represent functional states along an evolutionary trajectory.23 Resurrecting and characterizing each node along two diverging trajectories, such as described here for D. rerio Caspases‐3a and ‐3b, would reduce the number of residues that are potentially involved in the functional divergence of their relative substrate specificities. This approach is akin to “reverse engineering” protein evolution, and it is complimentary to directed protein evolution methods.46

Because APR is a heuristic science, the computational predictions will improve with increased experimental validation. High quality databases such as that described here for the CaspBase are important tools to maximize the confidence of an APR analysis. Nonetheless, ambiguities in the reconstruction are most likely to occur in regions of the protein that are solvent exposed or otherwise lack functional evolutionary constraints; thus, ambiguity at such locations is less likely to deviate from biological reality. Repetition of analysis reduces ambiguity by increasing confidence through corroboration, and high‐quality, annotated databases, such as the CaspBase, ultimately serve to enhance our understanding of the sequence‐to‐function relationship.

Methods

Curation of the database

Two databases of caspase proteins, uncurated and curated, are made available through http://caspbase.org. The uncurated database contains 6676 protein sequences with caspase annotation from 361 animal genomes available at NCBI on 12 December 2017 (Supplemental File 1: CaspBaseDB_full.txt). The following steps were used to construct the database. First, we downloaded GFF3 files from NCBI Assembly (https://www.ncbi.nlm.nih.gov/assembly) for all RefSeq animal genomes. Second, we extracted protein accession numbers (protein_id) from the lines of GFF3 files that contain the word “caspase” but not “activation inhibitor,” “activity and apoptosis inhibitor” or “recruitment domain‐containing” in the value of the “product” tag in the “attributes” column of GFF3 files. Third, we used the list of resulting accession numbers to obtain protein sequences in FASTA format. Fourth, we mapped taxon ID from each GFF3 file to taxonomy using taxdmp file available at: ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy. The resulting database contains unique protein accession numbers associated with product annotation from GFF3 files, taxonomy, and FASTA sequences. Additional details, including the code used to generate the uncurated database, are provided in Supplementary Material (Fig. S1).

The curated database contains 2879 protein sequences from 353 taxa. The FASTA file for each genome was curated using the MEGA 7.0 alignment editor.47 Sequences were selected from the uncurated database and identified as functional caspases based on the following methods and criteria. First, “partial,” “LOW QUALITY,” or annotations that are not caspases such as “metacaspases” were omitted. Sequences shorter than 240 amino acids, duplicate sequences, and sequences that include characters (X, J, or B) were also omitted. Protein sequences with accession numbers that start with XP_ are computationally predicted from genomic data. Protein sequences with accession numbers that start with NP_ are experimentally supported and curated, thus were preferentially included in the CaspBase if there were duplicate copies of a gene (Figs. 1 and S2).

The resulting sequences were aligned with MUSCLE using a gap opening penalty of 6.9 and other parameters kept at default.47 The dimer interface containing β‐strand 6 of the caspase central core is typically within 20 amino acids of the C‐terminus. Sequences were omitted if data were missing for β‐strand 6 in the sequence alignment (Fig. 1). The catalytic residues, histidine (CP‐H075) and cysteine (CP‐C117), were checked for their presence. Finally, sequences were included in the database if they contained two out of three amino acids known to coordinate the P4 aspartate residue in the active site (CP‐R018, CP‐G020, and CP‐D024) (Figs. 1 and 2).

Genes with multiple predicted isoforms were assessed for the optimal isoform to maximize phylogenetic signal. For example, XP_ isoforms with similar length to that of the human NP_ caspases were typically selected for inclusion in the CaspBase. Otherwise, the longest isoform was chosen by default if no evidence was observed to the contrary. In some cases, longer isoforms opened a gap in the caspase domain, so the shorter isoform was preferentially chosen (Fig. 1). To further check for errors and problematic sequences, a phylogenetic tree for the entire CaspBase was generated to elucidate any errors not easily observed by eye. The tree was generated with the maximum‐likelihood (ML) framework using IQTREE (http://www.iqtree.org/)48 (Fig. S3). The putative placement of genes is known a priori, because ML methods take branch length into account. The phylogenies generated in this manner were used to identify sequences with incorrect placement within the tree given the specific caspase‐type. All sequence accession numbers that were removed from the CaspBase were recorded in a separate excel file along with the reason for sequence exclusion (Table S1).

Caspase alignments and the CP system

The CP system is based on the structure‐guided multiple sequence alignment of 1481 sequences from chordate species available from the CaspBase (caspase ‐1, ‐2, ‐3, ‐4, ‐6, ‐7, ‐8, ‐9, ‐10, and ‐14). The FASTA file with the 1481 sequences from our database was superficially aligned in MEGA using MUSCLE.47 All but the human caspase sequences were then removed, and the resulting FASTA file was aligned by structure in the PROMALS3D online server24 (http://prodata.swmed.edu/promals3d/promals3d.php) to generate a structurally informed multiple sequence alignment (MSA) with PDB IDs: 1IBC (Caspase‐1), 3R5J (Caspase‐2), 2J30 (Caspase‐3), 3OD5 (Caspase‐6), 3KJN (Caspase‐8), and 1JXQ (Caspase‐9). The remaining alignment was vetted and manually adjusted to match the PDB structures as necessary. The resulting alignment is referred to as the Human_MSA (Supplemental File 2: Human_MSA.fasta). We then used the Human_MSA as “user‐defined” constraints and aligned all chordate sequences in PROMALS3D again to confirm that the first residue in the proposed CP numbering system, CP‐Y001, as well as the gap residues, were properly aligned. We refer to this alignment as All_Chordate_MSA (Supplemental File 3: All_Chordate_MSA.fasta).

The CP system is based on Human_MSA. First, we removed all gaps within the alignment that we introduced by the addition of chordate sequences. Second, we removed all gaps in the prodomain sequences, effectively unaligning the prodomain portion of the alignment because it is extremely variable and best left unaligned for the purposes of the CP naming system. The resulting alignment is referred to as CP_MSA (Supplemental File 4: CP_MSA.xlsx). Each position in CP_MSA is assigned to one of the three position types: positions in the prodomain (PD), GP, and CPs. CPs are defined as positions with 100% occupancy within the caspase domain of human caspases, and GPs are positions with <100% occupancy. The first CP is a highly conserved tyrosine residue, referred to as CP‐Y001. Amino acids preceding CP‐Y001 make up the prodomain and are labeled sequentially PD‐MXXX through PD‐001, independent of the alignment, where PD‐001 represents the amino acid in the prodomain closest to CP‐Y001, and XXX is the length of the prodomain. For example, using the CP system, we refer to the first residue (methionine) in human Caspase‐3 as PD‐M036. GPs are numbered sequentially in the order they appear within the caspase domain and are identified by the gap number and two‐digit GP number within each gap (Table 1). We note that the single letter amino acid code is used only when the site is 100% conserved.

Levels of sequence conservation among caspases

One can quickly assess how well every position in human caspases is conserved across chordates by analyzing All_Chordates_MSA (Supplemental File 3: All_Chordate_MSA.fasta). An R script (Fig. S4) was written to calculate the percent identity and percent occupancy (gapped or not gapped) for each position of the All_Chordates_MSA. We also ran the script separately on each caspase type. The results of this analysis are integrated with CP_MSA (Supplemental File 5: Conservation_CP_MSA.xlsx) to generate a table that can be used to view how well every position in the human caspase alignment is conserved among all chordate sequences, both between and within each caspase type.

Ancestral protein reconstruction

The sequences used for the APR between D. rerio Caspases‐3a and ‐3b were acquired from the CaspBase by selecting the Chordata_Actinopteri, Chordata_Amphibia, and Chordata_Chondrichthyes from the “Phylum_Class” menu. In addition, a few selected taxa from the "individual species” menu were included to help articulate the tree, such as C. intestinalis, as were sequences from aves and mammals. We then selected caspases‐2, ‐3, ‐6, ‐7, and ‐8, and the CaspBase returned 217 sequences that matched all of the criteria (Supplemental File 6: APR_MSA.fasta). We removed the pro‐domains since they are not informative to our purposes and the sequences open many gaps in the MSA, contributing to phylogenetic noise that is likely to yield ambiguous results. The resulting file was uploaded to the PROMALS3D server. The CP_MSA was used again for constraints.

We determined the best model of evolution to construct a phylogenetic tree from our dataset with Prottest 3 (https://github.com/ddarriba/prottest3).49 The phylogenetic tree was computed with the maximum likelihood method in IQTREE,48 using the Jones–Taylor Thornton model (JTT) with a gamma distribution. No test for confidence of phylogeny was used for this example APR; however, 100–1000 bootstraps are recommended. The MSA and tree file were uploaded to the FastML server (http://fastml.tau.ac.il/),50 and a joint reconstruction was calculated with the default settings for amino acid sequences. The nodes of interest between D. rerio Caspases‐3a and ‐3b were identified with the reconstructed tree, and the tree was visualized in FigTree v1.4.2.

Supporting information

Supporting Information

References

  • 1. Harms MJ, Thornton JW (2013) Evolutionary biochemistry: Revealing the historical and physical causes of protein properties. Nat Rev Genet 14:559–571. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Zuckerkandl E, Pauling L Evolutionary divergence and convergence in proteins In: Bryson V, Vogel HJ, editors. Evolving Genes and Proteins. Vol. 97 New York: Academic Press; 1965. pp. 97–166. Available from: http://www.yanaiweb.com/genome/Clocks/Zuckerkandl_1965.pdf [Google Scholar]
  • 3. Smith JM (1970) Natural selection and the concept of a protein space. Nature 225:563–564. [DOI] [PubMed] [Google Scholar]
  • 4. Yon Rhee S, Wood V, Dolinski K, Draghici S, Mudge JM, Harrow J, Bastian FB, Chibucos MC, Gaudet P, Giglio M, et al. (2016) Caspase allostery and conformational selection. Mol Biol Evol [Internet] 17:6666–6706. Available from: http://cshperspectives.cshlp.org/content/5/4/a008656.short [Google Scholar]
  • 5. Rastogi S, Reuter N, Liberles DA (2006) Evaluation of models for the evolution of protein sequences and functions under structural constraint. Biophys Chem 124:134–144. [DOI] [PubMed] [Google Scholar]
  • 6. Mudge JM, Harrow J (2016) The state of play in higher eukaryote gene annotation. Nat Rev Genet 17:758–772. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Yon Rhee S, Wood V, Dolinski K, Draghici S (2008) Use and misuse of the gene ontology annotations. Nat Rev Genet 9:509–515. [DOI] [PubMed] [Google Scholar]
  • 8. Fawal N, Li Q, Mathé C, Dunand C (2014) Automatic multigenic family annotation: Risks and solutions. Trends Genet 30:323–325. [DOI] [PubMed] [Google Scholar]
  • 9. Holliday GL, Davidson R, Akiva E, Babbitt PC (2017) Evaluating functional annotations of enzymes using the gene ontology. Methods Mol Biol 1446:111–132. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Ashkenazy H, Kliger Y (2010) Reducing phylogenetic bias in correlated mutation analysis. Protein Eng Des Sel 23:321–326. [DOI] [PubMed] [Google Scholar]
  • 11. Jäckel C, Bloom JD, Kast P, Arnold FH, Hilvert D (2010) Consensus protein design without phylogenetic bias. J Mol Biol 399:541–546. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Münkemüller T, Lavergne S, Bzeznik B, Dray S, Jombart T, Schiffers K, Thuiller W (2012) How to measure and test phylogenetic signal. Methods Ecol Evol 3:743–756. [Google Scholar]
  • 13. Albalat R, Cañestro C (2016) Evolution by gene loss. Nat Rev Genet 17:379–391. [DOI] [PubMed] [Google Scholar]
  • 14. Wang X, Grus WE, Zhang J (2006) Gene losses during human origins. PLoS Biol 4:0366–0377. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Eckhart L, Ballaun C, Hermann M, VandeBerg JL, Sipos W, Uthman A, Fischer H, Tschachler E (2008) Identification of novel mammalian caspases reveals an important role of gene loss in shaping the human caspase repertoire. Mol Biol Evol 25:831–841. [DOI] [PubMed] [Google Scholar]
  • 16. Van Baren MJ, Brent MR, Van Baren MJ, Brent MR (2006) Iterative gene prediction and pseudogene removal improves genome annotation. Genome Res 16:678–685. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Stein LD (2003) Integrating biological databases. Nat Rev Genet 4:337–345. [DOI] [PubMed] [Google Scholar]
  • 18. Buneman P (2003) Curated databases. Proc – 4th Int Conf Web Inf Syst Eng 2003:1–13. [Google Scholar]
  • 19. Earnshaw WC, Martins LM, Kaufmann SH (1999) Mammalian caspases: structure, activation, substrates, and functions during apoptosis. Annu Rev Biochem 68:383–424. [DOI] [PubMed] [Google Scholar]
  • 20. Denecker G, Ovaere P, Vandenabeele P, Declercq W (2008) Caspase‐14 reveals its secrets. J Cell Biol 180:451–458. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. Alnemri ES, Livingston DJ, Nicholson DW, Salvesen G, Thornberry NA, Wong WW, Yuan J (1996) Human ICE/CED‐3 protease nomenclature. Cell 87:171. [DOI] [PubMed] [Google Scholar]
  • 22. Fuentes‐Prior P, Salvesen GS (2004) The protein structures that shape caspase activity, specificity activation and inhibition. Biochem J 384:201–232. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Harms MJ, Thornton JW (2010) Analyzing protein structure and function using ancestral gene reconstruction. Curr Opin Struct Biol 20:360–366. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Pei J, Grishin NV (2014) PROMALS3D: multiple protein sequence alignment enhanced with evolutionary and 3‐dimensional structural information. Methods Mol Biol 1079:263–271. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. Katoh K, Kuma KI, Toh H, Miyata T (2005) MAFFT version 5: improvement in accuracy of multiple sequence alignment. Nucleic Acids Res 33:511–518. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26. Thomas ME, Grinshpon R, Swartz P, Clark AC (2018) Modifications to a common phosphorylation network provide individualized control in caspases. J Biol Chem 293:5447–5461. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. Walker NPC, Talanian RV, Brady KD, Dang LC, Bump NJ, Ferenz CR, Franklin S, Ghayur T, Hackett MC, Hammill LD, et al. (1994) Crystal structure of the cysteine protease interleukin‐1b‐converting enzyme: a (p20/p10)2 homodimer. Cell 78:343–352. [DOI] [PubMed] [Google Scholar]
  • 28. Stennicke HR, Salvesen GS (1999) Catalytic properties of the caspases. Cell Death Differ 6:1054–1059. [DOI] [PubMed] [Google Scholar]
  • 29. Dagbay KB, Bolik‐Coulon N, Savinov SN, Hardy JA (2017) Caspase‐6 undergoes a distinct helix‐strand interconversion upon substrate binding. J Biol Chem 292:4885–4897. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. Vaidya S, Hardy JA (2011) Caspase‐6 latent state stability relies on helical propensity. Biochemistry 50:3282–3287. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31. Aravind L, Koonin EV (2002) Classification of the caspase‐hemoglobinase fold: detection of new families and implications for the origin of the eukaryotic separins. Proteins 46:355–367. [DOI] [PubMed] [Google Scholar]
  • 32. Inohara N, Nuñez G (2000) Genes with homology to mammalian apoptosis regulators identified in zebrafish. Cell Death Differ 7:509–510. [DOI] [PubMed] [Google Scholar]
  • 33. Lieschke GJ, Currie PD (2007) Animal models of human disease: zebrafish swim into view. Nat Rev Genet 8:353–367. [DOI] [PubMed] [Google Scholar]
  • 34. Greiling TMS, Clark JI (2009) Early lens development in the zebrafish: a three‐dimensional time‐lapse analysis. Dev Dyn 238:2254–2265. [DOI] [PubMed] [Google Scholar]
  • 35. Howe K, Clark MD, Torroja CF, Torrance J, Berthelot C, Al E (2013) The zebrafish reference genome and its relationship to the human genome. Nature 496:498–503. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36. Tucker MB, Mackenzie SH, Maciag JJ, Ackerman HD, Swartz P, Yoder JA, Hamilton PT, Clark AC (2016) Phage display and structural studies reveal plasticity in substrate specificity of caspase‐3a from zebrafish. Protein Sci 25:2076–2088. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37. Maciag JJ, Mackenzie SH, Tucker MB, Schipper JL, Swartz P, Clark AC (2016) Tunable allosteric library of caspase‐3 identifies coupling between conserved water molecules and conformational selection. Proc Natl Acad Sci USA 113:6080–6088. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38. Parrish A, Freel CD, Kornbluth S (2013) Cellular mechanisms controlling caspase activation and function. Cold Spring Harb Perspect Biol 2013:1–24. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39. Dagbay K, Eron SJ, Serrano BP, Velázquez‐Delgado EM, Zhao Y, Lin D, Vaidya S, Hardy JA (2014) A multipronged approach for compiling a global map of allosteric regulation in the apoptotic caspases. Methods Enzymol 544:215–249. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40. Feeney B, Pop C, Swartz P, Mattos C, Clark AC (2006) Role of loop bundle hydrogen bonds in the maturation and activity of (pro)caspase‐3. Biochemistry 45:13249–13263. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41. Witkowski WA, Hardy JA (2009) L2’ loop is critical for caspase‐7 active site formation. Protein Sci 18:1459–1468. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42. Hill ME, Macpherson DJ, Wu P, Julien O, Wells JA, Hardy JA (2016) Reprogramming caspase‐7 specificity by regio‐specific mutations and selection provides alternate solutions for substrate recognition. ACS Chem Biol 11:1603–1612. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43. Eick GN, Bridgham JT, Anderson DP, Harms MJ, Thornton JW (2017) Robustness of reconstructed ancestral protein functions to statistical uncertainty. Mol Biol Evol 34:247–261. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44. Figliuzzi M, Jacquier H, Schug A, Tenaillon O, Weigt M (2016) Coevolutionary landscape inference and the context‐dependence of mutations in beta‐lactamase tem‐1. Mol Biol Evol 33:268–280. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45. Pandya C, Farelli JD, Dunaway‐Mariano D, Allen KN (2014) Enzyme promiscuity: Engine of evolutionary innovation. J Biol Chem 289:30229–30236. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46. Packer MS, Liu DR (2015) Methods for the directed evolution of proteins. Nat Rev Genet 16:379–394. [DOI] [PubMed] [Google Scholar]
  • 47. Kumar S, Stecher G, Tamura K (2016) MEGA7: molecular evolutionary genetics analysis Version 7.0 for bigger datasets. Mol Biol Evol 33:1870–1874. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48. Nguyen LT, Schmidt HA, Von Haeseler A, Minh BQ (2015) IQ‐TREE: a fast and effective stochastic algorithm for estimating maximum‐likelihood phylogenies. Mol Biol Evol 32:268–274. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49. Abascal F, Zardoya R, Posada D (2005) ProtTest: selection of best‐fit models of protein evolution. Bioinformatics 21:2104–2105. [DOI] [PubMed] [Google Scholar]
  • 50. Ashkenazy H, Penn O, Doron‐Faigenboim A, Cohen O, Cannarozzi G, Zomer O, Pupko T (2012) FastML: a web server for probabilistic reconstruction of ancestral sequences. Nucleic Acids Res 40:580–584. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supporting Information


Articles from Protein Science : A Publication of the Protein Society are provided here courtesy of The Protein Society

RESOURCES