Insights into the inner workings of transformer models for protein function prediction

Markus Wenzel; Erik Grüner; Nils Strodthoff

doi:10.1093/bioinformatics/btae031

. 2024 Jan 19;40(3):btae031. doi: 10.1093/bioinformatics/btae031

Insights into the inner workings of transformer models for protein function prediction

Markus Wenzel ^1,^✉, Erik Grüner ², Nils Strodthoff ^3,^✉

Editor: Lenore Cowen

PMCID: PMC10950482 PMID: 38244570

Abstract

Motivation

We explored how explainable artificial intelligence (XAI) can help to shed light into the inner workings of neural networks for protein function prediction, by extending the widely used XAI method of integrated gradients such that latent representations inside of transformer models, which were finetuned to Gene Ontology term and Enzyme Commission number prediction, can be inspected too.

Results

The approach enabled us to identify amino acids in the sequences that the transformers pay particular attention to, and to show that these relevant sequence parts reflect expectations from biology and chemistry, both in the embedding layer and inside of the model, where we identified transformer heads with a statistically significant correspondence of attribution maps with ground truth sequence annotations (e.g. transmembrane regions, active sites) across many proteins.

Availability and Implementation

Source code can be accessed at https://github.com/markuswenzel/xai-proteins.

1 Introduction

1.1 Protein function prediction

1.1.1 Proteins—constituents of life

Proteins are versatile molecular machines, performing various tasks in basically all cells of every organism, and are modularly constructed from chains of amino acids. Inferring the function of a given protein merely from its amino acid sequence is a particularly interesting problem in bioinformatics research.

Function prediction can help to rapidly provide valuable pointers in the face of so far unfamiliar proteins of understudied species, such as of emerging pathogens. Moreover, it makes the analysis of large, unlabeled protein datasets possible, which becomes more and more relevant against the backdrop of the massive and evermore growing databases of unlabeled nucleic acid sequences, which again can be translated into amino acid sequences. Next-generation DNA sequencers can read the nucleic acid sequences present in a sample or specimen at decreasing costs (Mardis 2017, Shendure et al. 2017), much faster than experimenters can determine the function of the genes and corresponding proteins. Therefore, databases with genes and corresponding amino acid sequences grow much more rapidly than those of respective experimental gene and protein labels or annotations. Besides, gaining knowledge about the mapping between amino acid sequence and protein function can help to engineer proteins for dedicated purposes too (Alley et al. 2019, Yang et al. 2019, Ferruz et al. 2022, Hie and Yang 2022, Madani et al. 2023).

1.1.2 Machine learning approaches

Machine learning approaches to protein function prediction can include inferring enzymatic function (Dalkiran et al. 2018, Li et al. 2018, Zou et al. 2019, Yu et al. 2023), Gene Ontology (GO) terms (Kulmanov et al. 2017, You et al. 2018a,b, 2021, Kulmanov and Hoehndorf 2019, 2022, Strodthoff et al. 2020, Littmann et al. 2021), protein–protein/–drug interaction, remote homology, stability, sub-cellular location, and other properties (Rao et al. 2019, Bepler and Berger 2021). For structure prediction, the objective is to infer how the amino acid sequence folds into the secondary (Zhang et al. 2018, Rives et al. 2021) and tertiary protein structure (Torrisi et al. 2020, AlQuraishi 2021, Jumper et al. 2021, Weissenow et al. 2022). Several of the prediction tasks can be approached as well by transferring labels from similar sequences obtained via multiple sequence alignment (MSA) (Buchfink et al. 2014, Gong et al. 2016). Protein prediction models are compared by the scientific community in systematic performance benchmarks, e.g. for function annotation (CAFA, Radivojac et al. 2013, Jiang et al. 2016, Zhou et al. 2019), for structure prediction (CASP, Kryshtafovych et al. 2019, 2021), or for several semi-supervised tasks (Rao et al. 2019, Fenoy et al. 2022). Machine learning methods are continuing to win ground in comparison to MSA-techniques with respect to performance, have a short inference time, and can process sequences from the so-called “dark proteome” too, where alignments are not possible (Perdigão et al. 2015, Rao et al. 2019, Lin et al. 2023).

1.2 Protein language modeling and transfer learning

1.2.1 Relations to NLP

Amino acid sequences share some similarities with the sequences of letters and words occurring in written language, in particular with respect to the complex interrelationships between distant elements, which are arranged in one-dimensional chains. Thus, recent progress in research on natural language processing (NLP) employing language modeling in a transfer learning scheme (Howard and Ruder 2018) has driven forward protein function prediction too (e.g. Strodthoff et al. 2020).

1.2.2 Self-supervised pretraining

Typically, a language model is first pretrained on large numbers of unlabeled sequences in an unsupervised fashion, e.g. by learning to predict masked tokens (cloze task) or the respective next token in the sequences (which is why this unsupervised approach is also dubbed self-supervised learning). In this way, the model learns useful representations of the sequence statistics (i.e. language). These statistics possibly arise because the amino acid chains need to be stable under physiological conditions and are subject to evolutionary pressure. The learned representations can be transferred to separate downstream tasks, where the pretrained model can be further finetuned in a supervised fashion on labeled data, which are usually available in smaller amounts, considering that sequence labeling by experimenters is costly and lengthy.

1.2.3 Model architectures

Transformer models (Vaswani et al. 2017) making use of the attention mechanism (Niu et al. 2021), such as bidirectional encoder representations from transformers (BERT, Devlin et al. 2018) are currently prevailing architectures in NLP. Transformers have been recently applied to the study of amino acid sequences too, pushing the state of the art in the field of proteomics as well (Rao et al. 2019, 2021, Nambiar et al. 2020, Bepler and Berger 2021, Littmann et al. 2021, Rives et al. 2021, Brandes et al. 2022, Elnaggar et al. 2022, Fenoy et al. 2022, Unsal et al. 2022, Lin et al. 2023, Olenyi et al. 2023). Recurrent neural networks (RNNs) using long short term memory (LSTM) cells are another model architecture that is particularly suited to process sequential data. RNNs have been successfully employed to protein (Strodthoff et al. 2020) and peptide (Vielhaben et al. 2020) property prediction as well, within the scheme of language modeling combined with transfer learning, as sketched out above.

1.3 Explainable machine learning

1.3.1 Need for explainability

Transformers and other modern deep learning models are notorious for having often millions and sometimes billions of trainable parameters, and it can be very difficult to interpret the decision making logic or strategy of such complex models. The research field of explainable machine learning (Lundberg and Lee 2017, Montavon et al. 2018, Arrieta et al. 2020, Tjoa and Guan 2020, Covert et al. 2021, Samek et al. 2021) aims at developing methods that enable humans to better interpret—or to a limited degree: understand—such “opaque,” complex models. In certain cases, it was demonstrated that the methods can even help to uncover flaws and unintended biases of the models, such as being mislead by spurious correlations in the data (Lapuschkin et al. 2019).

1.3.2 Attribution methods

Attribution methods, such as integrated gradients (IG) (Sundararajan et al. 2017), layerwise-relevance propagation (Bach et al. 2015, Binder et al. 2016) or gradient-weighted class activation mapping (Selvaraju et al. 2017), make it possible to identify those features in the input space that the model apparently focuses on, because these features turn out to be particular relevant for the final classification decision of the model. Further examples of model explainability methods include probing classifiers (Belinkov 2022), testing with concept activation vectors (Kim et al. 2018), and studying the attention mechanism (Jain and Wallace 2019, Serrano and Smith 2019, Bai et al. 2021, Niu et al. 2021). Explainability methods have been employed in NLP too (Arras et al. 2019, Manning et al. 2020, Chefer et al. 2021, Pascual et al. 2021). Moreover, researchers have started to explore using explainability methods in the area of protein function prediction (Upmeier zu Belzen et al. 2019, Taujale et al. 2021, Vig et al. 2021, Hou et al. 2023, Vu et al. 2023, Zhou et al. 2023).

1.4 Contributions of the article

1.4.1 Goal of the study

Building upon this previous research on the interpretation of protein classification models, we aimed at exploring how explainability methods can further help to gain insights into the inner workings of the now often huge neural networks, and proceeded as follows.

1.4.2 Specific contributions

First, we finetuned pretrained transformers on selected prediction tasks and could push or reach the state-of-the-art (see Supplementary Appendix E). Then, we quantified the relevance of each amino acid of a protein for the function prediction model. Subsequently, we investigated if these relevant sequence regions match expectations informed by knowledge from biology or chemistry, by correlating the relevance attributions with annotations from sequence databases (see Fig. 1). For instance, we addressed the question if a classification model that is able to infer if a protein is situated in the cell membrane does indeed focus systematically on transmembrane regions or not. We conducted this analysis on the embedding level and “inside” of the model with a novel adaptation of IG. In this way, we identified transformer heads with a statistically significant correspondence of the attribution maps with ground truth annotations, across many proteins and thus going beyond anecdotes of few selected cases.

Figure 1. — Illustration of the experimental design. Top: From the amino acid sequence, the finetuned transformer model infers the applicable Gene Ontology (GO) terms (represented as multi-label class membership vector). (The depicted exemplary “catalase-3” should be labeled with the GO terms “catalase activity” as “molecular function,” “response to hydrogen peroxide” as “biological process,” “cytoplasm” as “cellular component,” etc.; about 5K of about 45K GO terms were considered.) Center: Relevance indicative for a selected GO term was attributed to the amino acids per protein and correlated with corresponding annotations per amino acid. This correlation between relevance attributions and annotations was then statistically assessed across the test dataset proteins. The analysis was conducted for the embedding layer and “inside” of the model, for each head in each layer, and was repeated for different GO terms (see Section 2.1). Bottom: Specific amino acids of a protein are annotated in sequence databases like UniProt, because they serve as binding or active sites or are located in the cell membrane etc. Active sites can, e.g. be found at the histidine (“H” at position 65) and asparagine (“N” at position 138) of “catalase-3” (protein structure prediction created by AlphaFold—“AlphaFold Data Copyright (2022)DeepMind Technologies Limited”—under the CC-BY 4.0 licence (Jumper *et al.* 2021, Varadi *et al.* 2021).

2 System and methods

2.1 Revealing insights into function prediction models

2.1.1 Prediction tasks

The prediction tasks of inferring GO terms and Enzyme Commission (EC) numbers, that the proteins are labeled with, from their amino acid sequence are detailed in Supplementary Appendix B. This Supplementary material also explains the finetuning of the transformers “ProtBert-BFD” and “ProtT5-XL-UniRef50” (Elnaggar et al. 2022) and “ESM-2” (Lin et al. 2023) on the GO and EC tasks, and contains statements about data availability and composition.

2.1.2 Overall approach

We investigated whether specific positions or areas on the amino acid sequence that had been annotated in sequence data bases are particularly relevant for the classification decision of the model (see Fig. 1). Annotations included UniProtKB/Swiss-Prot “active” and “binding sites,” “transmembrane regions,” “short sequence motifs,” and PROSITE patterns related to a GO term and its children terms in the ontology. Definitions of the aforementioned UniProt annotations (per amino acid) and matching GO terms (class labels of proteins) are compiled in Supplementary Table A.1 (tables/figures with prefix letters are shown in the Supplementary material). First, we attributed relevance indicative for a given class (either a selected GO term or EC number) to each amino acid of a protein. Then, we correlated the relevance heat map obtained for the amino acid chain of a protein with corresponding binary sequence annotations. To study the information representation within the model, the explainability analysis was conducted at the embedding layer and repeated “inside” of the model, separately for its different heads and layers, using a novel method building upon IG, described below in Section 3.

2.1.3 Experimental setup

For the experimental evaluation, we focus on the pretrained ProtBert model that was finetuned either to the multi-label GO-classification on the GO “2016” dataset, or to the multi-class EC number classification on the “EC50 level L1” dataset. We consider the comparatively narrow EC task in addition to the much more comprehensive GO prediction, because the test split of the EC dataset contains a larger number of samples that are both labeled per protein and annotated per amino acid, which is beneficial for the conducted explainability analysis. We observed that larger models tend to perform numerically better than smaller models (see Supplementary Appendix E). Given our focus on methodological matters of model interpretation, we deliberately studied ProtBert (420M parameters), because it is better manageable, due to its considerably smaller memory footprint, in comparison to the larger ProtT5 (1.2B).

3 Algorithm

3.1 Integrated gradients

Integrated gradients (Sundararajan et al. 2017) represents a model-agnostic attribution method, which can be characterized as unique attribution method satisfying a set of four axioms (Invariance, Sensitivity, Linearity, and Completeness). In this formalism, the attribution for feature i is defined via the line integral (along a path, parameterized as $γ (t)$ with $t \in [0, 1]$ , between some chosen baseline $γ (0) = x^{'}$ and the sample to be explained $γ (1) = x$ ),

{IG}_{i}^{γ} = \int_{0}^{1} d α \frac{\partial F (γ (α))}{\partial γ_{i}} \frac{d γ_{i}}{d α},

(1)

where F is the function we aim to explain. Choosing $γ$ as straight line connecting $x^{'}$ and x makes IG the unique method satisfying the four axioms from above and an additional symmetry axiom. This path is the typical choice in applications applied directly to the input layer for computer vision or to the embedding layer for NLP. The approach can be generalized to arbitrary layers if one replaces x and $x^{'}$ by the hidden feature representation of the network up to this layer (referred to as “layer IG” in the popular “Captum” library (Kokhlikyan et al. 2020)).

3.2 Head-specific attribution maps

To obtain attributions for individual heads, we have to target the output of the multi-head self-attention (MHSA) block of a particular layer; see Fig. 2 for a visualization of the transformer architecture. Properly separating the attributions of the individual heads from the attribution contribution obtained from the skip connection necessitates to target directly the output of the MHSA. Now, one cannot just simply choose an integration path that connects baseline and sample as encoded by the MHSA block because the input for the skip connection has to be varied consistently. To keep an identical path in all cases, we fix the integration path as a straight line in the embedding layer, which then gets encoded into a, in general, curvilinear path seen as input for some intermediate layer. Choosing not a straight path only leads to the violation of the symmetry axiom, which is not of paramount practical importance in this application; see (Ward et al. 2020, Kapishnikov et al. 2021) for other applications with IG applied to general paths. For every sample, this application of IG yields a relevance map of shape $seq \times n_{model}$ , where the first $n_{model} / n_{heads}$ entries in the last dimension correspond to the first head, followed by the second head etc. By summing over $n_{model} / n_{heads}$ entries in the last dimension, we can reduce the relevance map to a $seq \times n_{heads}$ attribution map, i.e. one relevance sequence per head.

Figure 2. — Visualization of the explainability method based on IG that can attribute relevance to sequence tokens (here: amino acids) separately for each head and layer of the transformer (adapted from Vaswani *et al.* 2017).

3.3 Correlation coefficients and statistical significance

Each sequence of relevance attributions can then be correlated with sequence annotations to find out if the model focuses on the annotated amino acids. Coefficients of point biserial correlation (Kornbrot 2005), which is equivalent to Pearson correlation, were calculated between the continuous relevance values and the corresponding binary annotations per amino acid. This correlation analysis was conducted separately for each head in each transformer layer. The resulting correlation coefficients were then assembled into a $n_{layer} \times n_{head}$ matrix per protein, which entered the subsequent statistical analysis across proteins. Summary statistics over all proteins (which belong to the respective GO or EC class, and, which are part of the respective test dataset split) were obtained by computing Wilcoxon signed-rank tests across the correlation coefficients. The resulting P-values were corrected for the multiple tests per condition (16 heads times 30 layers equals 480 hypothesis tests) by controlling the false discovery rate (Benjamini and Hochberg 1995).

3.4 Summed attribution maps

In parallel to the correlation analysis, we furthermore sum the aforementioned attribution map along the sequence dimension, and obtain $n_{heads}$ entries that specify the relevance distribution onto the different heads. We can carry out the same procedure for every transformer layer and combine all results into a $n_{layer} \times n_{head}$ relevance map of summed attributions. This map makes it possible to identify heads with a positive relevance with respect to the selected class. One map was obtained per protein. Heads with a significantly positive relevance were singled out by calculating a summary statistic across proteins with the Wilcoxon signed-rank test. Finally, the two parallel analysis tracks were combined by identifying transformer heads that feature both a significantly positive (A) relevance-annotation-correlation and (B) relevance (this overlay is displayed in the figures by masking A with B).

4 Implementation

Supplementary Appendix D shows implementation details.

5 Results and discussion

5.1 Predictive performance

The performance results for the ProtT5, ProtBert, and ESM-2 transformers finetuned to the GO and EC protein function tasks are presented in Supplementary Tables E.1 to E.3 in Supplementary Appendix E. In summary, we show that finetuning pretrained large transformer models leads to competitive results, in particular in the most relevant comparison in the single-model category, often on par with MSA-approaches. Larger models lead the rankings, with ProtT5 competing with ESM-2. Finetuning the entire model including the encoder shows its particular strength in the “CAFA3” benchmark.

5.2 Explainability analysis: embedding layer

5.2.1 Research question

Starting with embedding layer attribution maps, as the most widely considered type of attribution, we investigate whether there are significant correlations between attribution maps and sequence annotations from external sources (see Section 2.1). We aim to answer this question in a statistical fashion going beyond anecdotal evidence based on single examples, which can sometimes be encountered in the literature.

5.2.2 GO prediction: GO “membrane” attributions correlate in particular with UniProt “transmembrane regions”

Figure 3 shows the results of the explainability analysis for the embedding layer of ProtBert finetuned to GO classification. The relevance of each amino acid indicative for selected GO terms was computed with IG, and then correlated with UniProt and PROSITE sequence annotations. Subsequently, it was tested whether the correlation coefficients across all annotated proteins from the test set were significantly positive (see Section 2.1). A significant correlation was observed when relevance attributions indicative for the GO label “membrane” were correlated with UniProt “transmembrane regions” (p $≪$ .05). Correlation was not observed in the GO “catalytic activity” and “binding” cases.

The pretrained model is expected to contain substantial information already prior to finetuning; e.g. Bernhofer and Rost (2022) had identified transmembrane regions using the pretrained ProtT5. Therefore, we inspected the GO membrane case in more detail. The pretrained but not finetuned ProtBert (combined with a classification head trained for the same number of epochs) resulted also in a significantly positive correlation of embedding level attributions to the GO term “membrane” with transmembrane regions only. Thus common patterns emerge between the pretrained and the finetuned ProtBert.

5.2.3 EC prediction: attributions correlate significantly with several types of sequence annotations

Figure 4 shows the results of the explainability analysis for the embedding layer of ProtBert finetuned to EC number classification (“EC50 level L1” dataset; i.e. the differentiation between the six main enzyme classes). Relevance per amino acid for each of the six EC classes was correlated with the UniProt annotations as “active sites,” “binding sites,” “transmembrane regions,” and “short sequence motifs.” It can be observed that the relevance attributions correlated significantly (p $<$ .05) with “active site” and “binding site” annotations for five out of six EC classes, and with “transmembrane regions” and “short sequence motifs” for two, respectively, three EC classes. (Supplementary Figure E.2 shows that positive relevance-annotation-correlation was observed for all annotation types for “EC40” and “EC50” on both levels “L1” and “L2” for several enzyme (sub-) classes.)

5.2.4 Discussion

Attribution maps obtained for the embedding layer correlated with UniProt annotations on the amino acid level, in particular, in the EC case, but also for the GO term “membrane.” To summarize, across two tasks, we provide first quantitative evidence for the meaningfulness and specificity of attribution maps beyond anecdotal evidence. Note that the EC case has the benefit of often several hundred annotated samples contained in the test split (except for “transmembrane regions” and “motifs”; see right panel of Fig. 4). In comparison, the GO case provides fewer samples in the test split of the dataset that were also annotated on the amino acid level (see numbers in brackets below the x-axis in Fig. 3).

5.3 Explainability analysis: peeking inside the transformer

5.3.1 Research question

Given the encouraging results presented in Section 5.2, we aim to go one step further and try to answer the more specific question if there are specialized heads inside of the model architecture for specific prediction tasks, using our IG variant that calculates relevance on the amino acid level per transformer head and layer (see Section 3).

5.3.2 GO-prediction: membrane

Figure 5 shows the results of the explainability analysis inspecting the latent representations inside of the ProtBert model focusing on the selected class of the GO term “membrane” (GO:0016020). Relevance attributions indicative for GO “membrane” per amino acid were correlated with the UniProt annotations as “transmembrane regions” separately for each transformer head and layer (matrix plot pixels in Fig. 5). In parallel, ProtBert heads were singled out with a significantly positive relevance (sum along the sequence) indicative for “membrane” (see also Section 2.1 and Section 3). Both parallel analysis streams were combined by identifying ProtBert heads with both a significantly positive attribution-annotation-correlation and relevance. Several ProtBert heads in different layers feature a significantly positive correlation of relevance attributions per amino acid with the UniProt annotations as “transmembrane regions,” going along with a significantly positive relevance for the GO class “membrane.” In contrast, correlation of relevance attributions with UniProt “active” or “binding sites” or “motifs” or PROSITE patterns accompanied by a positive relevance was not observed (hence these cases were not included in Fig. 5).

5.3.3 GO prediction: catalytic activity

Supplementary Figure E.3 (in Supplementary Appendix E) shows the results of the explainability analysis for the case where the GO term “catalytic activity” was selected (GO:0003824). Different ProtBert heads stand out characterized by a positive relevance accompanied by a positive correlation of attributions with PROSITE patterns and with UniProt “active sites” and “transmembrane regions” (but neither with “binding sites” nor “motifs”).

5.3.4 GO-prediction: binding

Supplementary Figure E.4 (in Supplementary Appendix E) repeats the explainability analysis inside ProtBert for the GO term “binding” (GO:0005488). For several transformer heads and layers, a positive relevance went along with a correlation of relevance attributions with corresponding PROSITE patterns, and with UniProt “transmembrane regions” (but neither with UniProt “active” nor “binding sites” nor “motifs”).

5.3.5 EC-prediction

Subsequently, we conducted the explainability analysis for the case where ProtBert had been finetuned to EC number classification on EC50 level L1. Here, the model had learned to differentiate between the six main enzyme classes. Supplementary Figure E.5 (in Supplementary Appendix E) identifies ProtBert heads characterized both by a positive relevance (sum along the sequence) with respect to the EC class, and by a positive attribution-annotation-correlation (on the amino acid level). The analysis was conducted separately for UniProt annotations as “active”/“binding sites,” “transmembrane regions,” and “motifs.” (The absence of identified heads for EC4, EC5, and EC6 in the “transmembrane regions” rows and for EC1 and EC5 in the “motif” rows of Supplementary Figure E.5 goes along with the availability of relatively few “transmembrane” and “motif” annotations for these EC classes; see histogram in Fig. 4.)

5.3.6 Discussion

In summary, we propose a constructive method suited to identify heads inside of the transformer architecture that are specialized for specific protein function or property prediction tasks. The proposed method comprises a novel adaptation of the explainable artificial intelligence (XAI) method of IG combined with a subsequent statistical analysis. We first attributed relevance to the single amino acids per protein (per GO term or EC class), separately for each transformer head and layer. Then, we inspected the correlation between relevance attributions and annotations, in a statistical analysis across the annotated proteins from the test split of the respective dataset. Apparently, different transformer heads are sensitive to different annotated and thus biologically and, respectively, chemically “meaningful” sites, regions or patterns on the amino acid sequence.

We discuss the benefits of finetuning a pretrained model from end-to-end, and evaluate the XAI method with a residue substitution experiment in Supplementary Appendix E. There, we also discuss the relation of XAI to homology, to the hydrophobicity and charge of residues in transmembrane regions, and to probing and in-silico mutagenesis.

5.4 Uncovering collective dynamics

Finally, we studied collective dynamics potentially emerging among the transformer heads (ProtBert, EC50, level L1) by a visualization of the originally high-dimensional, summed attribution maps in two dimensions, taking their similarities into account. For this purpose, the attribution maps that were summed along the amino acid sequence and represented as $n_{layer} \times n_{head}$ matrices (see Section 3) were flattened, resulting in one vector per protein. The dimensionality of these vectors was then reduced with principal component analysis to 50 dimensions, and subsequently to two dimensions with t-distributed stochastic neighbor embedding (t-SNE; van der Maaten and Hinton 2008), using the default t-SNE parameters. The resulting 2D points were visualized as scatter plot and colored according to the corresponding six main enzyme classes (Fig. 6).

Figure 6. — PCA and t-SNE visualization of summed attribution maps (ProtBert, EC50, L1).

The points form distinctive clusters matching the EC labels. Apparently, a structure emerges in the attribution maps, that seems to indicate class-specific collective dynamics among several ProtBert heads. It is important to stress that the attribution map underlying the clustering no longer contains any reference to specific positions in the sequence but relies on the relevance distribution on the different heads through all layers of the model. The emergence of class-specific structures therefore indicates that there are specific combinations of heads that are relevant for a specific classification decision.

6 Conclusion

This work provides additional evidence for the effectiveness of the currently predominant paradigm in deep-learning-based protein analysis through the finetuning of large protein language models from end-to-end (which brings additional benefits; see Supplementary Appendix E). For different protein function prediction tasks, this approach leads to best-performing models according to single-model performance. The performance level is in many cases on par with MSA-approaches. The proposed models can even be effectively combined with the latter through the formation of ensembles.

Considering the ever increasing model complexity, XAI has started to gain traction in the field of protein analysis too (Upmeier zu Belzen et al. 2019, Taujale et al. 2021, Vig et al. 2021, Hou et al. 2023, Vu et al. 2023, Zhou et al. 2023), but quantitative evidence for its applicability beyond single examples was lacking up to now. We provide statistical evidence for the alignment of attribution maps with corresponding sequence annotations, both on the embedding level as well as for specific heads inside of the model architecture, which led to the identification of specialized heads for specific protein function prediction tasks. Emerging class-specific structures suggest that these specialized transformer heads act jointly to decide together in specific combinations. A further detailed analysis of the identified heads could be an interesting next step in future research, potentially based on the query/key/value (QKV) matrices. Internally to the multi-layered model, a direct correspondence between rows/columns of the QKV matrices and individual residues in the sequence is, however, not possible anymore. This limitation makes it, e.g. difficult to infer relations between residues from the QKV matrices.

In summary, XAI promises to tap into the presumably substantial knowledge contained in large models pretrained on massive datasets of amino and/or nucleic acid sequences (Ji et al. 2021). Therefore, we expect that XAI will play an increasingly important role in the future of bioinformatics research. We see potential applications of XAI for model validation and for scientific discovery (e.g. of novel discriminative sequence patterns or motifs that have not been identified by experiments or MSA so far). Identifying specialized heads might also help to prune overly large models, making them smaller and more efficient.

Supplementary Material

btae031_Supplementary_Data

btae031_supplementary_data.pdf^{(843.4KB, pdf)}

Contributor Information

Markus Wenzel, Department of Artificial Intelligence, Fraunhofer Institute for Telecommunications, Heinrich-Hertz-Institut, HHI, Einsteinufer 37, 10587 Berlin, Germany.

Erik Grüner, Department of Artificial Intelligence, Fraunhofer Institute for Telecommunications, Heinrich-Hertz-Institut, HHI, Einsteinufer 37, 10587 Berlin, Germany.

Nils Strodthoff, School VI - Medicine and Health Services, Carl von Ossietzky University of Oldenburg, Ammerländer Heerstr. 114-118, 26129 Oldenburg, Germany.

Supplementary data

Supplementary data are available at Bioinformatics online.

Conflict of interest

None declared.

Funding

This work was supported by the Bundesministerium für Bildung und Forschung through the BIFOLD—Berlin Institute for the Foundations of Learning and Data [grant numbers 01IS18025A, 01IS18037A].

References

Adebayo J., Gilmer J., Muelly M., et al. (2018). Sanity checks for saliency maps. Adv. neural inf. process. syst., 31. [Google Scholar]
Alley E. C., Khimulya G., Biswas S., et al. (2019). Unified rational protein engineering with sequence-based deep representation learning. Nat. Methods, 16(12), 1315–22. [DOI] [PMC free article] [PubMed] [Google Scholar]
AlQuraishi M. (2021). Machine learning in protein structure prediction. Curr. Opin. Chem. Biol., 65, 1–8. [DOI] [PubMed] [Google Scholar]
Arras L., Osman A., Müller K.-R., et al. (2019). Evaluating Recurrent Neural Network Explanations. In Proc. ‘19 ACL Workshop BlackboxNLP. ACL. [Google Scholar]
Arrieta A. B., Díaz-Rodríguez N., Del Ser J., et al. (2020). Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI. Information fusion, 58, 82–115. [Google Scholar]
Ashburner M., Ball C. A., Blake J. A., et al. (2000). Gene Ontology: tool for the unification of biology. Nat. genet., 25(1), 25–29. [DOI] [PMC free article] [PubMed] [Google Scholar]
Bach S., Binder A., Montavon G., et al. (2015). On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PLoS ONE, 10(7). [DOI] [PMC free article] [PubMed] [Google Scholar]
Bai B., Liang J., Zhang G., et al. (2021). Why Attentions May Not Be Interpretable? In Proc. 27th ACM SIGKDD, KDD ’21, page 25–34, NY, USA. ACM.
Baker J. A., Wong W.-C., Eisenhaber B., et al. (2017). Charged residues next to transmembrane regions revisited: “Positive-inside rule” is complemented by the “negative inside depletion/outside enrichment rule”. BMC biology, 15(1), 1–29. [DOI] [PMC free article] [PubMed] [Google Scholar]
Belinkov Y. (2022). Probing classifiers: Promises, shortcomings, and advances. Comput. Linguist., 48(1), 207–219. [Google Scholar]
Benjamini Y., Hochberg Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc.: series B (Methodol.), 57(1), 289–300. [Google Scholar]
Bepler T., Berger B. (2021). Learning the protein language: Evolution, structure, and function. Cell Systems, 12(6), 654–669.e3. [DOI] [PMC free article] [PubMed] [Google Scholar]
Bernhofer M., Rost B. (2022). TMbed: transmembrane proteins predicted through language model embeddings. BMC Bioinform., 23(1), 326. [DOI] [PMC free article] [PubMed] [Google Scholar]
Bernhofer M., Dallago C., Karl T., et al. (2021). PredictProtein - Predicting Protein Structure and Function for 29 Years. Nucleic Acids Res., 49(W1), W535–W540. [DOI] [PMC free article] [PubMed] [Google Scholar]
Binder A., Bach S., Montavon G., et al. (2016). Layer-wise relevance propagation for deep neural network architectures. In ICISA ‘16, pages 913–922. Springer.
Binder A., Weber L., Lapuschkin S., et al. (2023). Shortcomings of top-down randomization-based sanity checks for evaluations of deep neural network explanations. In Proc. IEEE/CVF CVPR, pages 16143–16152.
Blücher S., Vielhaben J., Strodthoff N. (2022). PredDiff: Explanations and interactions from conditional expectations. Artificial Intelligence, 312, 103774. [Google Scholar]
Brandes N., Ofer D., Peleg Y., et al. (2022). ProteinBERT: A universal deep-learning model of protein sequence and function. Bioinformatics, 38(8), 2102–10. [DOI] [PMC free article] [PubMed] [Google Scholar]
Bromberg Y., Rost B. (2008). Comprehensive in silico mutagenesis highlights functionally important residues in proteins. Bioinformatics, 24(16), i207–i212. [DOI] [PMC free article] [PubMed] [Google Scholar]
Buchfink B., Xie C., Huson D. H. (2014). Fast and sensitive protein alignment using DIAMOND. Nat. Methods, 12(1), 59–60. [DOI] [PubMed] [Google Scholar]
Chefer H., Gur S., Wolf L. (2021). Transformer interpretability beyond attention visualization. In 2021 IEEE/CVF CVPR, pages 782–791.
Clark W. T., Radivojac P. (2013). Information-theoretic evaluation of predicted ontological annotations. Bioinformatics, 29(13), i53–i61. [DOI] [PMC free article] [PubMed] [Google Scholar]
Consortium, G. O. (2020a). The Gene Ontology resource: enriching a GOld mine. Nucleic Acids Res., 49(D1), D325–D334. [DOI] [PMC free article] [PubMed] [Google Scholar]
Consortium, U. (2020b). UniProt: the universal protein knowledgebase in 2021. Nucleic Acids Res., 49(D1), D480–D489. [DOI] [PMC free article] [PubMed] [Google Scholar]
Covert I., Lundberg S., Lee S.-I. (2021). Explaining by Removing: A Unified Framework for Model Explanation. J. Mach. Learn. Res., 22(209), 1–90. [Google Scholar]
Cunningham B. C., Wells J. A. (1989). High-resolution epitope mapping of hGH-receptor interactions by alanine-scanning mutagenesis. Science, 244(4908), 1081–1085. [DOI] [PubMed] [Google Scholar]
Dalkiran A., Rifaioglu A. S., Martin M. J., et al. (2018). ECPred: a tool for the prediction of the enzymatic functions of protein sequences based on the EC nomenclature. BMC Bioinform., 19(1), 334. [DOI] [PMC free article] [PubMed] [Google Scholar]
Devlin J., Chang M.-W., Lee K., et al. (2018). BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805.
Elazar A., Weinstein J. J., Prilusky J., et al. (2016). Interplay between hydrophobicity and the positive-inside rule in determining membrane-protein topology. PNAS, 113(37), 10340–10345. [DOI] [PMC free article] [PubMed] [Google Scholar]
Elnaggar A., Heinzinger M., Dallago C., et al. (2022). ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning. IEEE Trans. Pattern Anal. Mach. Intell., 44(10), 7112–27. [DOI] [PubMed] [Google Scholar]
Fenoy E., Edera A. A., Stegmayer G. (2022). Transfer learning in proteins: evaluating novel protein learned representations for bioinformatics tasks. Brief. Bioinform., 23(4), bbac232. [DOI] [PubMed] [Google Scholar]
Ferruz N., Schmidt S., Höcker B. (2022). ProtGPT2 is a deep unsupervised language model for protein design. Nat. Commun., 13(1), 4348. [DOI] [PMC free article] [PubMed] [Google Scholar]
Gong Q., Ning W., Tian W. (2016). GoFDR: a sequence alignment based method for predicting protein functions. Methods, 93, 3–14. [DOI] [PubMed] [Google Scholar]
Hie B. L., Yang K. K. (2022). Adaptive machine learning for protein engineering. Curr. Opin. Struct. Biol., 72, 145–152. [DOI] [PubMed] [Google Scholar]
Hou Z., Yang Y., Ma Z., et al. (2023). Learning the protein language of proteome-wide protein-protein binding sites via explainable ensemble deep learning. Commun. Biol., 6(1), 73. [DOI] [PMC free article] [PubMed] [Google Scholar]
Howard J., Ruder S. (2018). Universal Language Model Fine-tuning for Text Classification. In Proc. 56th Annu. Meet. Assoc. Comput. Linguist. (Vol. 1: Long Papers), pages 328–339, Melbourne. ACL.
Jain S., Wallace B. C. (2019). Attention is not explanation. In Proc. 2019 Conf. N. Amer. Chapter Assoc. Comput. Linguist.: Hum. Lang. Tech., Vol. 1, pages 3543–56, Minneapolis, MN. ACL.
Ji Y., Zhou Z., Liu H., et al. (2021). DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome. Bioinformatics, 37(15), 2112–20. [DOI] [PMC free article] [PubMed] [Google Scholar]
Jiang Y., Oron T. R., Clark W. T., et al. (2016). An expanded evaluation of protein function prediction methods shows an improvement in accuracy. Genome biol., 17(1), 1–19. [DOI] [PMC free article] [PubMed] [Google Scholar]
Jumper J., Evans R., Pritzel A., et al. (2021). Highly accurate protein structure prediction with AlphaFold. Nature, 596(7873), 583–589. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kapishnikov A., Venugopalan S., Avci B., et al. (2021). Guided integrated gradients: An adaptive path method for removing noise. In 2021 CVPR, pages 5048–56.
Kim B., Wattenberg M., Gilmer J., et al. (2018). Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (TCAV). In ICML, pages 2668–77. PMLR.
Kingma D. P., Ba J. (2015). Adam: A method for stochastic optimization. In 3rd ICLR, San Diego.
Koehler Leman J., Szczerbiak P., Renfrew P. D., et al. (2023). Sequence-structure-function relationships in the microbial protein universe. Nature communications, 14(1), 2351. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kokhlikyan N., Miglani V., Martin M., et al. (2020). Captum: A unified and generic model interpretability library for PyTorch. arXiv:2009.07896.
Kornbrot D. (2005). Point Biserial Correlation. John Wiley & Sons. [Google Scholar]
Kryshtafovych A., Schwede T., Topf M., et al. (2019). Critical assessment of methods of protein structure prediction (CASP) – Round XIII. Proteins: Structure, Function, and Bioinformatics, 87(12), 1011–20. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kryshtafovych A., Schwede T., Topf M., et al. (2021). Critical assessment of methods of protein structure prediction (CASP) – Round XIV. Proteins: Structure, Function, and Bioinformatics, 89(12), 1607–17. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kulmanov M., Hoehndorf R. (2019). DeepGOPlus: improved protein function prediction from sequence. Bioinformatics. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kulmanov M., Hoehndorf R. (2022). DeepGOZero: improving protein function prediction from sequence and zero-shot learning based on ontology axioms. Bioinformatics, 38(Supp. 1), i238–i245. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kulmanov M., Khan M. A., Hoehndorf R. (2017). DeepGO: predicting protein functions from sequence and interactions using a deep ontology-aware classifier. Bioinformatics, 34(4), 660–668. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lapuschkin S., Wäldchen S., Binder A., et al. (2019). Unmasking Clever Hans predictors and assessing what machines really learn. Nat. Commun., 10, 1096. [DOI] [PMC free article] [PubMed] [Google Scholar]
Li Y., Wang S., Umarov R., et al. (2018). DEEPre: sequence-based enzyme EC number prediction by deep learning. Bioinformatics, 34(5), 760–769. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lin Z., Akin H., Rao R., et al. (2023). Evolutionary-scale prediction of atomic-level protein structure with a language model. Science, 379(6637), 1123–30. [DOI] [PubMed] [Google Scholar]
Littmann M., Heinzinger M., Dallago C., et al. (2021). Embeddings from deep learning transfer GO annotations beyond homology. Sci. rep., 11(1), 1–14. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lundberg S. M., Lee S.-I. (2017). A Unified Approach to Interpreting Model Predictions. In Guyon I., Luxburg U. V., Bengio S., Wallach H., Fergus R., Vishwanathan S., Garnett R., editors, Adv. NeurIPS, volume 30. Curran Assoc. [Google Scholar]
Madani A., Krause B., Greene E. R., et al. (2023). Large language models generate functional protein sequences across diverse families. Nat. Biotechnol., pages 1–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
Manning C. D., Clark K., Hewitt J., et al. (2020). Emergent linguistic structure in artificial neural networks trained by self-supervision. PNAS, 117(48), 30046–54. [DOI] [PMC free article] [PubMed] [Google Scholar]
Mardis E. R. (2017). DNA sequencing technologies: 2006–2016. Nat. Protoc., 12(2), 213–218. [DOI] [PubMed] [Google Scholar]
McDonald A. G., Boyce S., Tipton K. F. (2008). ExplorEnz: the primary source of the IUBMB enzyme list. Nucleic Acids Res., 37(S. 1: ), D593–D597. [DOI] [PMC free article] [PubMed] [Google Scholar]
Montavon G., Samek W., Müller K.-R. (2018). Methods for interpreting and understanding deep neural networks. Digit. Signal Process, 73, 1–15. [Google Scholar]
Nambiar A., Heflin M., Liu S., et al. (2020). Transforming the Language of Life: Transformer Neural Networks for Protein Prediction Tasks. In Proc. 11th ACM BCB, BCB ’20, New York City. ACM.
Niu Z., Zhong G., Yu H. (2021). A review on the attention mechanism of deep learning. Neurocomputing, 452, 48–62. [Google Scholar]
Olenyi T., Marquet C., Heinzinger M., et al. (2023). LambdaPP: Fast and accessible protein-specific phenotype predictions. Protein Science, 32(1), e4524. [DOI] [PMC free article] [PubMed] [Google Scholar]
Pascual D., Brunner G., Wattenhofer R. (2021). Telling BERT’s Full Story: from Local Attention to Global Aggregation. In Proc. 16th Conf. Eur. Ch. Assoc. Comput. Linguist.: Main Vol., pages 105–124, Online. ACL.
Paszke A., Gross S., Massa F., et al. (2019). PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Wallach H., Larochelle H., Beygelzimer A., d’Alché-Buc F., Fox E., Garnett R., editors, Adv. NeurIPS 32, pages 8024–35. Curran Assoc. [Google Scholar]
Pearson W. R. (2013). An Introduction to Sequence Similarity (“Homology”) Searching. CP Bioinformatics, 42(1), 3.1.1–3.1.8. [DOI] [PMC free article] [PubMed] [Google Scholar]
Perdigão N., Heinrich J., Stolte C., et al. (2015). Unexpected features of the dark proteome. PNAS, 112(52), 15898–903. [DOI] [PMC free article] [PubMed] [Google Scholar]
Radivojac P., Clark W. T., Oron T. R., et al. (2013). A large-scale evaluation of computational protein function prediction. Nat. Methods, 10(3), 221–227. [DOI] [PMC free article] [PubMed] [Google Scholar]
Raffel C., Shazeer N., Roberts A., Lee K., et al. (2020). Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. J. Mach. Learn. Res., 21(140), 1–67.34305477 [Google Scholar]
Raimondi D., Orlando G., Tabaro F.. et al. (2018). Large-scale in-silico statistical mutagenesis analysis sheds light on the deleteriousness landscape of the human proteome. Sci. Rep., 8(1), 16980. [DOI] [PMC free article] [PubMed] [Google Scholar]
Rao R., Bhattacharya N., Thomas N., et al. (2019). Evaluating Protein Transfer Learning with TAPE. In Wallach H., Larochelle H., Beygelzimer A., d’Alché-Buc F., Fox E., Garnett R., editors, Adv. Neural Inf. Process. Syst., volume 32. Curran Assoc. [PMC free article] [PubMed] [Google Scholar]
Rao R. M., Liu J., Verkuil R., et al. (2021). MSA Transformer. In Meila M., Zhang T., editors, Proc. 38th ICML, volume 139 of PMLR, pages 8844–56. PMLR. [Google Scholar]
Reimers N., Gurevych I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proc. ‘19 EMNLP. ACL.
Ribeiro M. T., Singh S., Guestrin C. (2016). “Why Should I Trust You?”: Explaining the Predictions of Any Classifier. In Proc. 22nd ACM SIGKDD, KDD ’16, page 1135–1144, New York, NY, USA. ACM.
Rives A., Meier J., Sercu T., et al. (2021). Biol. structure and function emerge from scaling unsupervised learning to 250 million protein sequences. PNAS, 118(15). [DOI] [PMC free article] [PubMed] [Google Scholar]
Samek W., Montavon G., Lapuschkin S., et al. (2021). Explaining Deep Neural Networks and Beyond: A Review of Methods and Applications. Proc. IEEE, 109(3), 247–278. [Google Scholar]
Seabold S., Perktold J. (2010). Statsmodels: Econometric and statistical modeling with Python. In Proc. SciPy 2010, volume 57, pages 10–25080. Austin, Texas.
Selvaraju R. R., Cogswell M., Das A., et al. (2017). Grad-CAM: Visual Explanations From Deep Networks via Gradient-Based Localization. In ‘17 IEEE ICCV, pages 618–626.
Serrano S., Smith N. A. (2019). Is Attention Interpretable? In Proc. 57th Annu. Meet. Assoc. Comput. Linguist., pages 2931–51, Florence, Italy. ACL.
Shendure J., Balasubramanian S., Church G. M., et al. (2017). DNA sequencing at 40: past, present and future. Nature, 550(7676), 345–353. [DOI] [PubMed] [Google Scholar]
Sigrist C. J., De Castro E., Cerutti L., et al. (2012). New and continuing developments at PROSITE. Nucleic Acids Res., 41(D1), D344–D347. [DOI] [PMC free article] [PubMed] [Google Scholar]
Steinegger M., Söding J. (2018). Clustering huge protein sequence sets in linear time. Nat. Commun., 9(1), 1–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
Steinegger M., Mirdita M., Söding J. (2019). Protein-level assembly increases protein sequence recovery from metagenomic samples manyfold. Nat. Methods, 16(7), 603–606. [DOI] [PubMed] [Google Scholar]
Strodthoff N., Wagner P., Wenzel M., et al. (2020). UDSMProt: universal deep sequence models for protein classification. Bioinformatics, 36(8), 2401–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
Sundararajan M., Taly A., Yan Q. (2017). Axiomatic Attribution for Deep Networks. In Precup D., Teh Y. W., editors, Proc. 34th ICML, volume 70 of Proc. Mach. Learn. Res., pages 3319–28. PMLR. [Google Scholar]
Suzek B. E., Wang Y., Huang H., et al. (2014). UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics, 31(6), 926–932. [DOI] [PMC free article] [PubMed] [Google Scholar]
Taujale R., Zhou Z., Yeung W., et al. (2021). Mapping the glycosyltransferase fold landscape using interpretable deep learning. Nat. Commun., 12(1), 5656. [DOI] [PMC free article] [PubMed] [Google Scholar]
Tjoa E., Guan C. (2020). A survey on explainable artificial intelligence (XAI): Toward medical XAI. EEE Trans. Neural Netw. Learn. Syst., 32(11), 4793–4813. [DOI] [PubMed] [Google Scholar]
Torrisi M., Pollastri G., Le Q. (2020). Deep learning methods in protein structure prediction. Comput. Struct. Biotechnol. J., 18, 1301–10. [DOI] [PMC free article] [PubMed] [Google Scholar]
Unsal S., Atas H., Albayrak M., et al. (2022). Learning functional properties of proteins with language models. Nat. Mach. Intell., 4(3), 227–245. [Google Scholar]
Upmeier zu Belzen J., Bürgel T., Holderbach S., et al. (2019). Leveraging implicit knowledge in neural networks for functional dissection and engineering of proteins. Nat. Mach. Intell., 1(5), 225–235. [Google Scholar]
van der Maaten L., Hinton G. (2008). Visualizing Data using t-SNE. J. Mach. Learn. Res., 9(86), 2579–2605. [Google Scholar]
Varadi M., Anyango S., Deshpande M., et al. (2021). AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids Research, 50(D1), D439–D444. [DOI] [PMC free article] [PubMed] [Google Scholar]
Vaswani A., Shazeer N., Parmar N., et al. (2017). Attention is All You Need. In Proc. 31st NIPS, page 6000–6010, NY, USA. Curran Assoc. [Google Scholar]
Vielhaben J., Wenzel M., Samek W., et al. (2020). USMPep: universal sequence models for major histocompatibility complex binding affinity prediction. BMC Bioinform., 21(1), 1–16. [DOI] [PMC free article] [PubMed] [Google Scholar]
Vig J., Madani A., Varshney L. R., et al. (2021). BERTology Meets Biology: Interpreting Attention in Protein Language Models. In ICLR 2021. [Google Scholar]
Virtanen P., Gommers R., Oliphant T. E., et al. (2020). SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nat. Methods, 17, 261–272. [DOI] [PMC free article] [PubMed] [Google Scholar]
Vonheijne G. (1989). Control of topology and mode of assembly of a polytopic membrane protein by positively charged residues. Nature, 341(6241), 456–458. [DOI] [PubMed] [Google Scholar]
Vu M. H., Akbar R., Robert P. A., et al. (2023). Linguistically inspired roadmap for building biologically reliable protein language models. Nat. Mach. Intell., 5(5), 485–496. [Google Scholar]
Ward G., Kamkar S., Budzik J. (2020). An exploration of the influence of path choice in game-theoretic attribution algorithms. arXiv:2007.04169.
Webb E. C. et al. (1992). Enzyme nomenclature 1992. Recommendations of the Nomenclature Committee of the IUBMB on the Nomenclature and Classification of Enzymes.Number Ed. 6. Academic Press.
Weissenow K., Heinzinger M., Rost B. (2022). Protein language-model embeddings for fast, accurate, and alignment-free protein structure prediction. Structure, 30(8), 1169–1177.e4. [DOI] [PubMed] [Google Scholar]
Yang K. K., Wu Z., Arnold F. H. (2019). Machine-learning-guided directed evolution for protein engineering. Nat. Methods, 16(8), 687–694. [DOI] [PubMed] [Google Scholar]
You R., Huang X., Zhu S. (2018a). DeepText2GO: Improving large-scale protein function prediction with deep semantic text representation. Methods, 145, 82–90. [DOI] [PubMed] [Google Scholar]
You R., Zhang Z., Xiong Y., et al. (2018b). GOLabeler: improving sequence-based large-scale protein function prediction by learning to rank. Bioinformatics, 34(14), 2465–73. [DOI] [PubMed] [Google Scholar]
You R., Yao S., Mamitsuka H., et al. (2021). DeepGraphGO: graph neural network for large-scale, multispecies protein function prediction. Bioinformatics, 37(1), i262–71. [DOI] [PMC free article] [PubMed] [Google Scholar]
Yu T., Cui H., Li J. C., et al. (2023). Enzyme function prediction using contrastive learning. Science, 379(6639), 1358–63. [DOI] [PubMed] [Google Scholar]
Zhang B., Li J., Lü Q. (2018). Prediction of 8-state protein secondary structures by a novel deep learning architecture. BMC Bioinform., 19(1), 1–13. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhou N., Jiang Y., Bergquist T. R., et al. (2019). The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens. Genome biol., 20(1), 1–23. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhou Z., Yeung W., Gravel N., et al. (2023). Phosformer: an explainable transformer model for protein kinase-specific phosphorylation predictions. Bioinformatics, 39(2). [DOI] [PMC free article] [PubMed] [Google Scholar]
Zou Z., Tian S., Gao X., et al. (2019). mlDEEPre: Multi-Functional Enzyme Function Prediction With Hierarchical Multi-Label Deep Learning. Front. Genet., 9, 714. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

btae031_Supplementary_Data

btae031_supplementary_data.pdf^{(843.4KB, pdf)}

[btae031-B1] Adebayo J., Gilmer J., Muelly M., et al. (2018). Sanity checks for saliency maps. Adv. neural inf. process. syst., 31. [Google Scholar]

[btae031-B2] Alley E. C., Khimulya G., Biswas S., et al. (2019). Unified rational protein engineering with sequence-based deep representation learning. Nat. Methods, 16(12), 1315–22. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae031-B3] AlQuraishi M. (2021). Machine learning in protein structure prediction. Curr. Opin. Chem. Biol., 65, 1–8. [DOI] [PubMed] [Google Scholar]

[btae031-B4] Arras L., Osman A., Müller K.-R., et al. (2019). Evaluating Recurrent Neural Network Explanations. In Proc. ‘19 ACL Workshop BlackboxNLP. ACL. [Google Scholar]

[btae031-B5] Arrieta A. B., Díaz-Rodríguez N., Del Ser J., et al. (2020). Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI. Information fusion, 58, 82–115. [Google Scholar]

[btae031-B6] Ashburner M., Ball C. A., Blake J. A., et al. (2000). Gene Ontology: tool for the unification of biology. Nat. genet., 25(1), 25–29. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae031-B7] Bach S., Binder A., Montavon G., et al. (2015). On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PLoS ONE, 10(7). [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae031-B8] Bai B., Liang J., Zhang G., et al. (2021). Why Attentions May Not Be Interpretable? In Proc. 27th ACM SIGKDD, KDD ’21, page 25–34, NY, USA. ACM.

[btae031-B9] Baker J. A., Wong W.-C., Eisenhaber B., et al. (2017). Charged residues next to transmembrane regions revisited: “Positive-inside rule” is complemented by the “negative inside depletion/outside enrichment rule”. BMC biology, 15(1), 1–29. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae031-B10] Belinkov Y. (2022). Probing classifiers: Promises, shortcomings, and advances. Comput. Linguist., 48(1), 207–219. [Google Scholar]

[btae031-B11] Benjamini Y., Hochberg Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc.: series B (Methodol.), 57(1), 289–300. [Google Scholar]

[btae031-B12] Bepler T., Berger B. (2021). Learning the protein language: Evolution, structure, and function. Cell Systems, 12(6), 654–669.e3. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae031-B13] Bernhofer M., Rost B. (2022). TMbed: transmembrane proteins predicted through language model embeddings. BMC Bioinform., 23(1), 326. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae031-B14] Bernhofer M., Dallago C., Karl T., et al. (2021). PredictProtein - Predicting Protein Structure and Function for 29 Years. Nucleic Acids Res., 49(W1), W535–W540. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae031-B15] Binder A., Bach S., Montavon G., et al. (2016). Layer-wise relevance propagation for deep neural network architectures. In ICISA ‘16, pages 913–922. Springer.

[btae031-B16] Binder A., Weber L., Lapuschkin S., et al. (2023). Shortcomings of top-down randomization-based sanity checks for evaluations of deep neural network explanations. In Proc. IEEE/CVF CVPR, pages 16143–16152.

[btae031-B17] Blücher S., Vielhaben J., Strodthoff N. (2022). PredDiff: Explanations and interactions from conditional expectations. Artificial Intelligence, 312, 103774. [Google Scholar]

[btae031-B18] Brandes N., Ofer D., Peleg Y., et al. (2022). ProteinBERT: A universal deep-learning model of protein sequence and function. Bioinformatics, 38(8), 2102–10. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae031-B19] Bromberg Y., Rost B. (2008). Comprehensive in silico mutagenesis highlights functionally important residues in proteins. Bioinformatics, 24(16), i207–i212. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae031-B20] Buchfink B., Xie C., Huson D. H. (2014). Fast and sensitive protein alignment using DIAMOND. Nat. Methods, 12(1), 59–60. [DOI] [PubMed] [Google Scholar]

[btae031-B21] Chefer H., Gur S., Wolf L. (2021). Transformer interpretability beyond attention visualization. In 2021 IEEE/CVF CVPR, pages 782–791.

[btae031-B22] Clark W. T., Radivojac P. (2013). Information-theoretic evaluation of predicted ontological annotations. Bioinformatics, 29(13), i53–i61. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae031-B23] Consortium, G. O. (2020a). The Gene Ontology resource: enriching a GOld mine. Nucleic Acids Res., 49(D1), D325–D334. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae031-B24] Consortium, U. (2020b). UniProt: the universal protein knowledgebase in 2021. Nucleic Acids Res., 49(D1), D480–D489. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae031-B25] Covert I., Lundberg S., Lee S.-I. (2021). Explaining by Removing: A Unified Framework for Model Explanation. J. Mach. Learn. Res., 22(209), 1–90. [Google Scholar]

[btae031-B26] Cunningham B. C., Wells J. A. (1989). High-resolution epitope mapping of hGH-receptor interactions by alanine-scanning mutagenesis. Science, 244(4908), 1081–1085. [DOI] [PubMed] [Google Scholar]

[btae031-B27] Dalkiran A., Rifaioglu A. S., Martin M. J., et al. (2018). ECPred: a tool for the prediction of the enzymatic functions of protein sequences based on the EC nomenclature. BMC Bioinform., 19(1), 334. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae031-B28] Devlin J., Chang M.-W., Lee K., et al. (2018). BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805.

[btae031-B29] Elazar A., Weinstein J. J., Prilusky J., et al. (2016). Interplay between hydrophobicity and the positive-inside rule in determining membrane-protein topology. PNAS, 113(37), 10340–10345. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae031-B30] Elnaggar A., Heinzinger M., Dallago C., et al. (2022). ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning. IEEE Trans. Pattern Anal. Mach. Intell., 44(10), 7112–27. [DOI] [PubMed] [Google Scholar]

[btae031-B31] Fenoy E., Edera A. A., Stegmayer G. (2022). Transfer learning in proteins: evaluating novel protein learned representations for bioinformatics tasks. Brief. Bioinform., 23(4), bbac232. [DOI] [PubMed] [Google Scholar]

[btae031-B32] Ferruz N., Schmidt S., Höcker B. (2022). ProtGPT2 is a deep unsupervised language model for protein design. Nat. Commun., 13(1), 4348. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae031-B33] Gong Q., Ning W., Tian W. (2016). GoFDR: a sequence alignment based method for predicting protein functions. Methods, 93, 3–14. [DOI] [PubMed] [Google Scholar]

[btae031-B34] Hie B. L., Yang K. K. (2022). Adaptive machine learning for protein engineering. Curr. Opin. Struct. Biol., 72, 145–152. [DOI] [PubMed] [Google Scholar]

[btae031-B35] Hou Z., Yang Y., Ma Z., et al. (2023). Learning the protein language of proteome-wide protein-protein binding sites via explainable ensemble deep learning. Commun. Biol., 6(1), 73. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae031-B36] Howard J., Ruder S. (2018). Universal Language Model Fine-tuning for Text Classification. In Proc. 56th Annu. Meet. Assoc. Comput. Linguist. (Vol. 1: Long Papers), pages 328–339, Melbourne. ACL.

[btae031-B37] Jain S., Wallace B. C. (2019). Attention is not explanation. In Proc. 2019 Conf. N. Amer. Chapter Assoc. Comput. Linguist.: Hum. Lang. Tech., Vol. 1, pages 3543–56, Minneapolis, MN. ACL.

[btae031-B38] Ji Y., Zhou Z., Liu H., et al. (2021). DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome. Bioinformatics, 37(15), 2112–20. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae031-B39] Jiang Y., Oron T. R., Clark W. T., et al. (2016). An expanded evaluation of protein function prediction methods shows an improvement in accuracy. Genome biol., 17(1), 1–19. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae031-B40] Jumper J., Evans R., Pritzel A., et al. (2021). Highly accurate protein structure prediction with AlphaFold. Nature, 596(7873), 583–589. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae031-B41] Kapishnikov A., Venugopalan S., Avci B., et al. (2021). Guided integrated gradients: An adaptive path method for removing noise. In 2021 CVPR, pages 5048–56.

[btae031-B42] Kim B., Wattenberg M., Gilmer J., et al. (2018). Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (TCAV). In ICML, pages 2668–77. PMLR.

[btae031-B43] Kingma D. P., Ba J. (2015). Adam: A method for stochastic optimization. In 3rd ICLR, San Diego.

[btae031-B44] Koehler Leman J., Szczerbiak P., Renfrew P. D., et al. (2023). Sequence-structure-function relationships in the microbial protein universe. Nature communications, 14(1), 2351. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae031-B45] Kokhlikyan N., Miglani V., Martin M., et al. (2020). Captum: A unified and generic model interpretability library for PyTorch. arXiv:2009.07896.

[btae031-B46] Kornbrot D. (2005). Point Biserial Correlation. John Wiley & Sons. [Google Scholar]

[btae031-B47] Kryshtafovych A., Schwede T., Topf M., et al. (2019). Critical assessment of methods of protein structure prediction (CASP) – Round XIII. Proteins: Structure, Function, and Bioinformatics, 87(12), 1011–20. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae031-B48] Kryshtafovych A., Schwede T., Topf M., et al. (2021). Critical assessment of methods of protein structure prediction (CASP) – Round XIV. Proteins: Structure, Function, and Bioinformatics, 89(12), 1607–17. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae031-B49] Kulmanov M., Hoehndorf R. (2019). DeepGOPlus: improved protein function prediction from sequence. Bioinformatics. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae031-B50] Kulmanov M., Hoehndorf R. (2022). DeepGOZero: improving protein function prediction from sequence and zero-shot learning based on ontology axioms. Bioinformatics, 38(Supp. 1), i238–i245. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae031-B51] Kulmanov M., Khan M. A., Hoehndorf R. (2017). DeepGO: predicting protein functions from sequence and interactions using a deep ontology-aware classifier. Bioinformatics, 34(4), 660–668. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae031-B52] Lapuschkin S., Wäldchen S., Binder A., et al. (2019). Unmasking Clever Hans predictors and assessing what machines really learn. Nat. Commun., 10, 1096. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae031-B53] Li Y., Wang S., Umarov R., et al. (2018). DEEPre: sequence-based enzyme EC number prediction by deep learning. Bioinformatics, 34(5), 760–769. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae031-B54] Lin Z., Akin H., Rao R., et al. (2023). Evolutionary-scale prediction of atomic-level protein structure with a language model. Science, 379(6637), 1123–30. [DOI] [PubMed] [Google Scholar]

[btae031-B55] Littmann M., Heinzinger M., Dallago C., et al. (2021). Embeddings from deep learning transfer GO annotations beyond homology. Sci. rep., 11(1), 1–14. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae031-B56] Lundberg S. M., Lee S.-I. (2017). A Unified Approach to Interpreting Model Predictions. In Guyon I., Luxburg U. V., Bengio S., Wallach H., Fergus R., Vishwanathan S., Garnett R., editors, Adv. NeurIPS, volume 30. Curran Assoc. [Google Scholar]

[btae031-B57] Madani A., Krause B., Greene E. R., et al. (2023). Large language models generate functional protein sequences across diverse families. Nat. Biotechnol., pages 1–8. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae031-B58] Manning C. D., Clark K., Hewitt J., et al. (2020). Emergent linguistic structure in artificial neural networks trained by self-supervision. PNAS, 117(48), 30046–54. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae031-B59] Mardis E. R. (2017). DNA sequencing technologies: 2006–2016. Nat. Protoc., 12(2), 213–218. [DOI] [PubMed] [Google Scholar]

[btae031-B60] McDonald A. G., Boyce S., Tipton K. F. (2008). ExplorEnz: the primary source of the IUBMB enzyme list. Nucleic Acids Res., 37(S. 1: ), D593–D597. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae031-B61] Montavon G., Samek W., Müller K.-R. (2018). Methods for interpreting and understanding deep neural networks. Digit. Signal Process, 73, 1–15. [Google Scholar]

[btae031-B62] Nambiar A., Heflin M., Liu S., et al. (2020). Transforming the Language of Life: Transformer Neural Networks for Protein Prediction Tasks. In Proc. 11th ACM BCB, BCB ’20, New York City. ACM.

[btae031-B63] Niu Z., Zhong G., Yu H. (2021). A review on the attention mechanism of deep learning. Neurocomputing, 452, 48–62. [Google Scholar]

[btae031-B64] Olenyi T., Marquet C., Heinzinger M., et al. (2023). LambdaPP: Fast and accessible protein-specific phenotype predictions. Protein Science, 32(1), e4524. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae031-B65] Pascual D., Brunner G., Wattenhofer R. (2021). Telling BERT’s Full Story: from Local Attention to Global Aggregation. In Proc. 16th Conf. Eur. Ch. Assoc. Comput. Linguist.: Main Vol., pages 105–124, Online. ACL.

[btae031-B66] Paszke A., Gross S., Massa F., et al. (2019). PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Wallach H., Larochelle H., Beygelzimer A., d’Alché-Buc F., Fox E., Garnett R., editors, Adv. NeurIPS 32, pages 8024–35. Curran Assoc. [Google Scholar]

[btae031-B67] Pearson W. R. (2013). An Introduction to Sequence Similarity (“Homology”) Searching. CP Bioinformatics, 42(1), 3.1.1–3.1.8. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae031-B68] Perdigão N., Heinrich J., Stolte C., et al. (2015). Unexpected features of the dark proteome. PNAS, 112(52), 15898–903. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae031-B69] Radivojac P., Clark W. T., Oron T. R., et al. (2013). A large-scale evaluation of computational protein function prediction. Nat. Methods, 10(3), 221–227. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae031-B70] Raffel C., Shazeer N., Roberts A., Lee K., et al. (2020). Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. J. Mach. Learn. Res., 21(140), 1–67.34305477 [Google Scholar]

[btae031-B71] Raimondi D., Orlando G., Tabaro F.. et al. (2018). Large-scale in-silico statistical mutagenesis analysis sheds light on the deleteriousness landscape of the human proteome. Sci. Rep., 8(1), 16980. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae031-B72] Rao R., Bhattacharya N., Thomas N., et al. (2019). Evaluating Protein Transfer Learning with TAPE. In Wallach H., Larochelle H., Beygelzimer A., d’Alché-Buc F., Fox E., Garnett R., editors, Adv. Neural Inf. Process. Syst., volume 32. Curran Assoc. [PMC free article] [PubMed] [Google Scholar]

[btae031-B73] Rao R. M., Liu J., Verkuil R., et al. (2021). MSA Transformer. In Meila M., Zhang T., editors, Proc. 38th ICML, volume 139 of PMLR, pages 8844–56. PMLR. [Google Scholar]

[btae031-B74] Reimers N., Gurevych I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proc. ‘19 EMNLP. ACL.

[btae031-B75] Ribeiro M. T., Singh S., Guestrin C. (2016). “Why Should I Trust You?”: Explaining the Predictions of Any Classifier. In Proc. 22nd ACM SIGKDD, KDD ’16, page 1135–1144, New York, NY, USA. ACM.

[btae031-B76] Rives A., Meier J., Sercu T., et al. (2021). Biol. structure and function emerge from scaling unsupervised learning to 250 million protein sequences. PNAS, 118(15). [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae031-B77] Samek W., Montavon G., Lapuschkin S., et al. (2021). Explaining Deep Neural Networks and Beyond: A Review of Methods and Applications. Proc. IEEE, 109(3), 247–278. [Google Scholar]

[btae031-B78] Seabold S., Perktold J. (2010). Statsmodels: Econometric and statistical modeling with Python. In Proc. SciPy 2010, volume 57, pages 10–25080. Austin, Texas.

[btae031-B79] Selvaraju R. R., Cogswell M., Das A., et al. (2017). Grad-CAM: Visual Explanations From Deep Networks via Gradient-Based Localization. In ‘17 IEEE ICCV, pages 618–626.

[btae031-B80] Serrano S., Smith N. A. (2019). Is Attention Interpretable? In Proc. 57th Annu. Meet. Assoc. Comput. Linguist., pages 2931–51, Florence, Italy. ACL.

[btae031-B81] Shendure J., Balasubramanian S., Church G. M., et al. (2017). DNA sequencing at 40: past, present and future. Nature, 550(7676), 345–353. [DOI] [PubMed] [Google Scholar]

[btae031-B82] Sigrist C. J., De Castro E., Cerutti L., et al. (2012). New and continuing developments at PROSITE. Nucleic Acids Res., 41(D1), D344–D347. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae031-B83] Steinegger M., Söding J. (2018). Clustering huge protein sequence sets in linear time. Nat. Commun., 9(1), 1–8. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae031-B84] Steinegger M., Mirdita M., Söding J. (2019). Protein-level assembly increases protein sequence recovery from metagenomic samples manyfold. Nat. Methods, 16(7), 603–606. [DOI] [PubMed] [Google Scholar]

[btae031-B85] Strodthoff N., Wagner P., Wenzel M., et al. (2020). UDSMProt: universal deep sequence models for protein classification. Bioinformatics, 36(8), 2401–9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae031-B86] Sundararajan M., Taly A., Yan Q. (2017). Axiomatic Attribution for Deep Networks. In Precup D., Teh Y. W., editors, Proc. 34th ICML, volume 70 of Proc. Mach. Learn. Res., pages 3319–28. PMLR. [Google Scholar]

[btae031-B87] Suzek B. E., Wang Y., Huang H., et al. (2014). UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics, 31(6), 926–932. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae031-B88] Taujale R., Zhou Z., Yeung W., et al. (2021). Mapping the glycosyltransferase fold landscape using interpretable deep learning. Nat. Commun., 12(1), 5656. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae031-B89] Tjoa E., Guan C. (2020). A survey on explainable artificial intelligence (XAI): Toward medical XAI. EEE Trans. Neural Netw. Learn. Syst., 32(11), 4793–4813. [DOI] [PubMed] [Google Scholar]

[btae031-B90] Torrisi M., Pollastri G., Le Q. (2020). Deep learning methods in protein structure prediction. Comput. Struct. Biotechnol. J., 18, 1301–10. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae031-B91] Unsal S., Atas H., Albayrak M., et al. (2022). Learning functional properties of proteins with language models. Nat. Mach. Intell., 4(3), 227–245. [Google Scholar]

[btae031-B92] Upmeier zu Belzen J., Bürgel T., Holderbach S., et al. (2019). Leveraging implicit knowledge in neural networks for functional dissection and engineering of proteins. Nat. Mach. Intell., 1(5), 225–235. [Google Scholar]

[btae031-B93] van der Maaten L., Hinton G. (2008). Visualizing Data using t-SNE. J. Mach. Learn. Res., 9(86), 2579–2605. [Google Scholar]

[btae031-B94] Varadi M., Anyango S., Deshpande M., et al. (2021). AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids Research, 50(D1), D439–D444. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae031-B95] Vaswani A., Shazeer N., Parmar N., et al. (2017). Attention is All You Need. In Proc. 31st NIPS, page 6000–6010, NY, USA. Curran Assoc. [Google Scholar]

[btae031-B96] Vielhaben J., Wenzel M., Samek W., et al. (2020). USMPep: universal sequence models for major histocompatibility complex binding affinity prediction. BMC Bioinform., 21(1), 1–16. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae031-B97] Vig J., Madani A., Varshney L. R., et al. (2021). BERTology Meets Biology: Interpreting Attention in Protein Language Models. In ICLR 2021. [Google Scholar]

[btae031-B98] Virtanen P., Gommers R., Oliphant T. E., et al. (2020). SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nat. Methods, 17, 261–272. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae031-B99] Vonheijne G. (1989). Control of topology and mode of assembly of a polytopic membrane protein by positively charged residues. Nature, 341(6241), 456–458. [DOI] [PubMed] [Google Scholar]

[btae031-B100] Vu M. H., Akbar R., Robert P. A., et al. (2023). Linguistically inspired roadmap for building biologically reliable protein language models. Nat. Mach. Intell., 5(5), 485–496. [Google Scholar]

[btae031-B101] Ward G., Kamkar S., Budzik J. (2020). An exploration of the influence of path choice in game-theoretic attribution algorithms. arXiv:2007.04169.

[btae031-B102] Webb E. C. et al. (1992). Enzyme nomenclature 1992. Recommendations of the Nomenclature Committee of the IUBMB on the Nomenclature and Classification of Enzymes.Number Ed. 6. Academic Press.

[btae031-B103] Weissenow K., Heinzinger M., Rost B. (2022). Protein language-model embeddings for fast, accurate, and alignment-free protein structure prediction. Structure, 30(8), 1169–1177.e4. [DOI] [PubMed] [Google Scholar]

[btae031-B104] Yang K. K., Wu Z., Arnold F. H. (2019). Machine-learning-guided directed evolution for protein engineering. Nat. Methods, 16(8), 687–694. [DOI] [PubMed] [Google Scholar]

[btae031-B105] You R., Huang X., Zhu S. (2018a). DeepText2GO: Improving large-scale protein function prediction with deep semantic text representation. Methods, 145, 82–90. [DOI] [PubMed] [Google Scholar]

[btae031-B106] You R., Zhang Z., Xiong Y., et al. (2018b). GOLabeler: improving sequence-based large-scale protein function prediction by learning to rank. Bioinformatics, 34(14), 2465–73. [DOI] [PubMed] [Google Scholar]

[btae031-B107] You R., Yao S., Mamitsuka H., et al. (2021). DeepGraphGO: graph neural network for large-scale, multispecies protein function prediction. Bioinformatics, 37(1), i262–71. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae031-B108] Yu T., Cui H., Li J. C., et al. (2023). Enzyme function prediction using contrastive learning. Science, 379(6639), 1358–63. [DOI] [PubMed] [Google Scholar]

[btae031-B109] Zhang B., Li J., Lü Q. (2018). Prediction of 8-state protein secondary structures by a novel deep learning architecture. BMC Bioinform., 19(1), 1–13. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae031-B110] Zhou N., Jiang Y., Bergquist T. R., et al. (2019). The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens. Genome biol., 20(1), 1–23. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae031-B111] Zhou Z., Yeung W., Gravel N., et al. (2023). Phosformer: an explainable transformer model for protein kinase-specific phosphorylation predictions. Bioinformatics, 39(2). [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae031-B112] Zou Z., Tian S., Gao X., et al. (2019). mlDEEPre: Multi-Functional Enzyme Function Prediction With Hierarchical Multi-Label Deep Learning. Front. Genet., 9, 714. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Insights into the inner workings of transformer models for protein function prediction

Markus Wenzel

Erik Grüner

Nils Strodthoff

Roles

Abstract

Motivation

Results

Availability and Implementation

1 Introduction

1.1 Protein function prediction

1.1.1 Proteins—constituents of life

1.1.2 Machine learning approaches

1.2 Protein language modeling and transfer learning

1.2.1 Relations to NLP

1.2.2 Self-supervised pretraining

1.2.3 Model architectures

1.3 Explainable machine learning

1.3.1 Need for explainability

1.3.2 Attribution methods

1.4 Contributions of the article

1.4.1 Goal of the study

1.4.2 Specific contributions

Figure 1.

2 System and methods

2.1 Revealing insights into function prediction models

2.1.1 Prediction tasks

2.1.2 Overall approach

2.1.3 Experimental setup

3 Algorithm

3.1 Integrated gradients

3.2 Head-specific attribution maps

Figure 2.

3.3 Correlation coefficients and statistical significance

3.4 Summed attribution maps

4 Implementation

5 Results and discussion

5.1 Predictive performance

5.2 Explainability analysis: embedding layer

5.2.1 Research question

5.2.2 GO prediction: GO “membrane” attributions correlate in particular with UniProt “transmembrane regions”

Figure 3.

5.2.3 EC prediction: attributions correlate significantly with several types of sequence annotations

Figure 4.

5.2.4 Discussion

5.3 Explainability analysis: peeking inside the transformer

5.3.1 Research question

5.3.2 GO-prediction: membrane

Figure 5.

5.3.3 GO prediction: catalytic activity

5.3.4 GO-prediction: binding

5.3.5 EC-prediction

5.3.6 Discussion

5.4 Uncovering collective dynamics

Figure 6.

6 Conclusion

Supplementary Material

Contributor Information

Supplementary data

Conflict of interest

Funding

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases