Skip to main content
The Innovation logoLink to The Innovation
. 2023 Jul 27;4(5):100487. doi: 10.1016/j.xinn.2023.100487

Deciphering “the language of nature”: A transformer-based language model for deleterious mutations in proteins

Theodore T Jiang 1,2,3, Li Fang 1,4,, Kai Wang 1,5,∗∗
PMCID: PMC10448337  PMID: 37636282

Abstract

Various machine-learning models, including deep neural network models, have already been developed to predict deleteriousness of missense (non-synonymous) mutations. Potential improvements to the current state of the art, however, may still benefit from a fresh look at the biological problem using more sophisticated self-adaptive machine-learning approaches. Recent advances in the field of natural language processing show that transformer models—a type of deep neural network—to be particularly powerful at modeling sequence information with context dependence. In this study, we introduce MutFormer, a transformer-based model for the prediction of deleterious missense mutations, which uses reference and mutated protein sequences from the human genome as the primary features. MutFormer takes advantage of a combination of self-attention layers and convolutional layers to learn both long-range and short-range dependencies between amino acid mutations in a protein sequence. We first pre-trained MutFormer on reference protein sequences and mutated protein sequences resulting from common genetic variants observed in human populations. We next examined different fine-tuning methods to successfully apply the model to deleteriousness prediction of missense mutations. Finally, we evaluated MutFormer’s performance on multiple testing datasets. We found that MutFormer showed similar or improved performance over a variety of existing tools, including those that used conventional machine-learning approaches. In conclusion, MutFormer considers sequence features that are not explored in previous studies and can complement existing computational predictions or empirically generated functional scores to improve our understanding of disease variants.

Graphical abstract

graphic file with name fx1.jpg

Public summary

  • Prediction of deleterious mutations using a model inspired from natural language processing.

  • A transformer-based model to learn long- and short-range dependencies between amino acids.

  • Learning from both common genetic variations and known disease mutations.

  • Complements existing predictions or empirically generated functional scores.

Introduction

Whole-exome and whole-genome sequencing technologies are powerful tools for the detection of genetic mutations. A typical human genome has 4.1 million to 5.0 million variants when compared with the reference genome sequence,1 while the average exome captures genomic regions that account for 1%–2% of the human genome. Therefore, distinguishing or prioritizing a small number of disease-related variants from such a large number of background variants becomes a key challenge in understanding genome and exome sequencing data. In particular, the interpretation of non-synonymous single nucleotide variants (SNVs) is of major interest, because missense mutations in proteins account for more than one-half of the current known variants responsible for human-inherited disorders, especially Mendelian diseases, where the mutations have high penetrance.2 Unlike frameshift indels or splicing mutations in canonical splice sites that have a high likelihood to alter protein function, missense mutations change only a single amino acid, so most of them may not have significant impacts on protein function. To this end, population-specific allele frequencies, such as those inferred from the ExAC3,4 and gnomAD5 databases, can be useful to filter out common missense variants that are likely to be neutral, and mutation databases such as ClinVar6,7,8 and the Human Gene Mutation Database (HGMD)2 can be valuable resources to find previously reported mutations that may be deleterious. Still, a large number of missense variants from exome sequencing are not yet documented; therefore, the functional interpretation of such variants remains a crucial task.

Numerous computational tools have been developed to predict the deleteriousness or pathogenicity of missense mutations9,10,11,12,13,14,15,16; however, as shown by multiple recent publications, the accuracy of predictive algorithms still has room for improvement. Databases such as dbNSFP9,10,11 have now documented these whole-exome prediction scores for different prediction algorithms in an effort to facilitate the development of improved functional assessment algorithms. However, depending on the evaluation datasets that were used, most algorithms for missense variant prediction are 65%–80% accurate when examining known disease variants, and only approximately 43.4% of pairwise prediction correlations between different predictive algorithms are greater than 0.5.9 Many conflicting predictions can be made between different algorithms, which motivated the development of several ensemble-based scoring systems that combine multiple prediction algorithms, such as MetaSVM,17 REVEL,18 and CADD.19 In fact, predictions combined from different algorithms are considered as a single piece of evidence according to the American College of Medical Genetics and Genomics-Association for Molecular Pathology 2015 guidelines.20 In addition, most existing computational algorithms are based on similar or related information (e.g., evolutionary conservation scores, mutation tolerance scores); potential improvements to the current state of the art could benefit from a fresh look at the biological problem using more sophisticated self-adaptive machine-learning approaches that examine additional types of information.

In other previously published prediction algorithms, deep learning-based sequence-focused models have been demonstrated as effective in modeling variant function. These existing methods primarily used convolutional neural networks (CNNs) to model sequences.21,22,23 However, recently, advances in deep learning have shown transformer models to be particularly powerful for modeling sequential data. Transformer models, such as the Bidirectional Encoder Representations from Transformers (BERT),24,25 rely on its central mechanism, self-attention. The use of self-attention allows the transformer model to achieve an unprecedented ability to model relationships between tokens in a sequence, which is crucial in the comprehension of linear sequences. In the past three years, transformers have achieved state-of-the-art performances on a broad range of natural language processing (NLP) tasks,24,26,27,28 and transformers are competitive with more traditional CNN-based models on image recognition tasks.29 As of late, transformers have also been successfully applied for modeling protein structure in Alphafold2,30 and in works such as Enformer,31 which used transformers for DNA interpretation. Part of the reasons for the successes of transformers may be caused by their increased ability to handle subtle context dependency through a multi-head attention mechanism, and the ability to compute attentions in parallel to greatly speed up computation over typical recurrent neural network-based algorithms.

In biological contexts, each amino acid in a given protein sequence exerts its function in a context-dependent manner, including both local dependency (such as forming a short signal peptide that was recognized by cellular machinery) and long-distance dependency (such as being close to another amino acid in three-dimensional structure to form a binding site for ligands). Therefore, we hypothesize that transformer models would be capable of more effective modeling of protein sequences, somewhat similar to how transformers have transformed the field of NLP and language translation over the past few years.

In this study, we propose MutFormer, a transformer-based model, to assess deleteriousness of missense mutations. MutFormer is an adaption of the BERT architecture24 to protein contexts, with appropriate modifications to incorporate protein-specific characteristics. MutFormer can analyze protein sequences directly, with or without any homology information or additional data. Our experiments show that MutFormer is capable of matching or outperforming current methods in the deleteriousness prediction of missense variants.

MutFormer is based on the BERT architecture.24 A central component of the classical BERT model is its bidirectional self-attention. This mechanism uses a two-dimensional matrix to model the context between all positions in a given sequence, enabling efficient learning of long-range dependencies between residues. In contrast, convolution is another mechanism capable of learning dependencies, which is better suited for short-range dependencies: convolutions are more capable of prioritizing localized patterns via filters, while the repeated application of convolution filters, which are required for the relating of farther residues in a sequence, often weakens long-range dependencies. MutFormer takes advantage of both self-attention layers and convolutional layers to effectively learn both long-range and short-range dependencies.

In language modeling tasks, words or sub-words are short-range features of the sequence. The original BERT uses a fixed WordPiece vocabulary, which contains common words or sub-words in the training corpus.32 This vocabulary cannot be tuned during the pre-training and fine-tuning process; therefore, a spelling error may introduce an out-of-vocabulary word that will hinder the model’s interpretation ability of a given sequence. In protein sequences, “words” correspond with key subsequences or patterns of amino acids. These words can often be changed because of mutations, and furthermore, are unknown. Recent studies showed that vocabulary-free models (e.g., byte-level models) are more robust to noise and perform better on tasks that are sensitive to spelling.33 Therefore, instead of using a fixed vocabulary, convolutional layers placed in between the embedding layers and the transformer body are used by MutFormer. MutFormer uses these convolutions to learn its own vocabulary over the course of the training process, incorporating nonlinear patterns via the convolution filters. The weights of the convolutional layers are tuned during both the pre-training and fine-tuning processes.

Materials and methods

Pre-processing of input sequence

The input of the MutFormer model is an amino acid sequence that can be either a single protein (with a missense mutation) or the concatenation of a pair of proteins (a mutated protein and its corresponding reference protein). Each amino acid is considered as a token. The maximum input length of MutFormer was set to 1,024, where protein sequences longer than 1,024 need to be cut into segments (see Supplemental Methods 1.1 for details). In the pre-processing step, some special tokens were added to the sequence (Figure 1A). As the input sequence may be a cropped sequence, we add a "B" token to the real start position of the protein and a "J" token to the real end of the protein so that a cropped start/end (without B or J) can be distinguished from a real start/end (with B or J). B and J letters were chosen because they are not included in the current amino acid code table. In the original BERT model, the first token of every sequence is always a [CLS] token and the final hidden state corresponding with this token is used as the aggregate sequence representation for classification tasks; a [SEP] token is used to separate different sentences and is also placed at the very end of the input.24 We followed this practice in MutFormer (Figure 1A).

Figure 1.

Figure 1

The MutFormer model architecture

(A) Pre-processing procedure of input. The input may be a single protein (with a missense mutation) or a pair of protein sequences (a reference protein and a mutated protein). In the input sequences, the red color indicates the mutated amino acid and the blue color indicates the original amino acid in the reference protein sequence. For the single-sequence case, “B” and “J” tokens are attached to the true start and end of the sequence, respectively. The sequence is then cropped to the maximum sequence length. A [CLS] token and a [SEP] token are then added to the start and end of the whole sequence, respectively. For the paired-sequence case, the B and J tokens are added as normal to both sequences, except now a [CLS] token is placed at the beginning of the first sequence; a [SEP] token is placed between the two; and another [SEP] token is placed at the very end.

(B) MutFormer model structure. A system of positional, label, and token embeddings is used to first vectorize the input tokens. Two convolutional operations (four convolution layers (kernel size 3, stride 1, filters 768)) are used to process the embedding representation (for the integrated convolutions model, skip connections are used before both convolution operations). A bidirectional transformer body with self-attention applies a sequence of attention layers to the resulting embedding representation to obtain the output embeddings (MutFormer/MutBERTNL indicates N layers in the transformer body).

(C) Pipeline for deleteriousness prediction without external resources. MutFormer output Embeddings are obtained by feeding the input protein sequence (single or paired) to the MutFormer model. A fully connected layer is attached after this to obtain the resulting deleteriousness prediction.

(D) Pipeline for deleteriousness prediction with external resources. MutFormer’s output embedding and the sequence representation of all external predictions are concatenated, then a fully connected layer is used to obtain the final prediction.

The MutFormer model

The MutFormer architecture was implemented on top of the classic BERT architecture. The MutFormer model consists of three primary parts: embeddings, convolutions, and the transformer body (Figure 1B).

The embedding layers create positional, label, and token embeddings for the input sequence. We denote the input sequence length as S, and hidden embedding size as H. Positional embeddings are calculated based on a learnable parameter matrix (size: S × H). Label embeddings are one-hot encodings that indicate whether the amino acid belongs to the reference or mutated sequence. Label embeddings are only used for paired sequence input and are not used if the input is a single protein. Token embeddings map each amino acid into a high dimensional space (dimension = H). Token embeddings are real numbers and are learnable parameters (size: S × H). The final embedding output is the sum of these three embedding representations with layer normalization applied after the sum.

For the convolutions of MutFormer, two different integrations into the model were tested: (1) four convolution layers, where the convolution outputs are the only input to the following transformer body, and (2) an integrated approach with skip connections in which the original embedding outputs and convolution outputs are combined and fed into the transformer body (the input of the transformer body is a sum of the embedding output and the result of applying two or four convolutions). Each convolution used a kernel size of 3, and no pooling or concatenation was applied (only the convolution operation itself was necessary to take advantage of convolutions’ pattern recognition ability).

The transformer body of MutFormer is taken from the original BERT model, along with its self-attention modules. The time and space complexity of self-attention is quadratic in the length of input.34 Because of the constraint of computational resources, a maximum input sequence length needs to be set. Sequences exceeding the maximum sequence length are trimmed. Regardless of the trim position, the position embedding corresponding to first position was always assigned to the first residue in the trimmed sequence but not the full sequence. For this reason, the position embeddings represent more of a relative position rather than true position within the full protein sequence.

Pre-training of MutFormer on human protein sequences

MutFormer was pre-trained on a database obtained by combining human reference protein sequences (all isoforms) and protein sequences caused by non-synonymous SNVs with more than 1% population frequency in the gnomAD database.5 For a full description of the pre-training data preparation, see Supplemental Methods 1.1.

The original BERT model uses a self-supervised pre-training objective of recovering an original sequence from corrupted (masked) input, from which high-dimensional representations of the sequence are learned. Similar to BERT, the pre-training objective of MutFormer was to predict corrupted amino acid residues from altered sequences. For the corrupted/masked residue prediction task, for each sequence, a number of residues were randomly selected for corruption. The ones that were selected were corrupted by either (1) replacing them with a [MASK] token or (2) another random amino acid. This was done to encourage the learning of context not only around explicitly masked residues, but on the entire sequence. To ensure enough context was present for the model, a maximum number of 20 amino acids were masked per sequence. This maximum number was chosen through testing; we tested training using a fixed masking percentage of 15%; however, this resulted in non-convergence during pre-training. For this reason, the 20-residue cap was used alongside the 15% maximum, whichever one was lower, to determine the number of residues masked. To facilitate the learning of more dependency knowledge and minimize overfitting, we used dynamic masking throughout the duration of the training: the corrupted residues were changed randomly for each epoch of data trained on. Note that, for MutFormer, the “next sentence prediction” objective used by the original BERT was removed, because protein sequences in aggregate, unlike their natural language counterpart, do not form ‘paragraphs’ with logical connections between sentences. The pre-training was performed on a single cloud machine instance with one tensor processing unit (TPU) hardware accelerator (TPU v2-8) on the Google Cloud Platform. Depending on the model, pre-training took approximately 100–200 h.

We pre-trained MutFormer with three different model sizes (Table 1), as well as a MutFormer model with integrated convolutions. For comparison purposes, we also pre-trained two models without convolutions, which were designated MutBERT to indicate the use of the original BERT architecture (Table 1). The hyperparameters used for pre-training as well as training time, for all models, are displayed in Table S1, and the loss and accuracy for the pre-training task are listed in Table S2.

Table 1.

Model sizes of the pre-trained models (subscripts in the model names denote the number of self-attention layers)

Model name Hidden layers Hidden size No. of parameters
MutBERT8L 8 768 58M
MutBERT10L 10 770 72M
MutFormer8L 8 768 62M
MutFormer10L 10 770 76M
MutFormer12L 12 768 86M
MutFormer8L(with integrated convs) 8 768 64M

MutFormer12L has the same size as BERTBase. Hyperparameters constant for all models: intermediate size = 3072, maximum input sequence length: 1024.

Fine-tuning MutFormer for the prediction of deleterious mutations

MutFormer was fine-tuned on a dataset built from 84K manually annotated pathogenic missense SNVs from the HGMD (version 2016)2 and SNPs from the gnomAD database5 with an allele frequency of more than 0.1%. We generated a training and independent validation set from this data. Mutated protein sequences were generated using ANNOVAR,35 with each sequence containing exactly one mutation. The fine-tuning was performed on a single cloud machine instance with one TPU hardware accelerator (TPU v2-8) on the Google Cloud Platform. While the specific training time varies from model to model, training time corresponded with approximately 200 steps per minute, making the general time range for fine-tuning approximately 1 h. To obtain the best possible results in deleteriousness prediction, we tested three different fine-tuning methods (described below).

Per residue classification

The single mutated protein sequence is used as input. The model is tasked with classifying each amino acid in the protein sequence as benign or deleterious. Amino acids that are identical to the reference sequence are labeled as benign, and the mutated residue is labeled as benign if the overall mutation is benign, or deleterious if the overall mutation is deleterious, depending on the true classification of the sequence (loss and metrics for classification are calculated on only the mutation site). This corresponds to the token classification task or named entity recognition task in NLP (Figure 2A).

Figure 2.

Figure 2

Different fine-tuning methods tested in this study

(A) Per residue classification. The input is a protein sequence that contains exactly one variant. Each residue (amino acid) is given a label of benign/deleterious. Benign variants and residues that are identical to the reference sequence are labeled as benign. The fine-tuning task is to predict the label of each amino acid. This is similar to token classification problems (e.g., named entity recognition) in NLP.

(B) Single sequence classification. The input is a protein sequence that contains exactly one variant with unknown significance. The embedding of the [CLS] token in the last layer is used to predict whether the sequence contains a deleterious variant. This is similar to sentence classification problems (e.g., sentiment analysis) in NLP.

(C) Sequence pair classification. The input is a pair of two sequences: a reference protein sequence and a mutated protein sequence (with a benign or deleterious variant in the center). The embedding of the [CLS] token in the last layer is used to predict whether the mutated sequence contains a deleterious variant. This is similar to sentence pair classification problems (e.g., sentence similarity) in NLP.

Single sequence classification

Same as per residue classification, the input is a single mutated protein sequence. The model is tasked with classifying the entire sequence as deleterious or benign (via the [CLS] token). This similar to the sentence classification task (e.g., sentiment analysis) in NLP (Figure 2B).

Paired sequence classification

The input is a pair of two sequences: the mutated protein sequence and its corresponding reference sequence. The model classifies the aggregate of the sequences as deleterious or benign through a comparison of the two sequences. This was inspired by the sentence similarity problem (e.g., the MRPC36 task) in NLP (Figure 2C).

Exploration of optimal fine-tuning methods

To find the best fine-tuning method, model, and hyperparameters, we performed two different internal comparison tests using our independent validation set. Test 1 compared the MutFormer architecture with the classical BERT architecture, as well as the three different fine-tuning methods, using different hyperparameters. Test 2 compared the use of the integrated convolution implementation against the classic MutFormer architecture. Before both tests, some initial testing was done to establish a set of hyperparameter values that worked well with all combinations of methods/models included in the test. For test 1, the initial set of hyperparameters was established based on the three fine-tuning methods (per residue, single sequence, paired sequence) and three different models (MutBERT8L, MutBERT10L, and MutFormer8L). For test 2, the initial set of hyperparameters was established based on the four models (MutFormer8L, MutFormer10L, MutFormer12L, and MutFormer8L (with integrated convolutions)) being tested. A full list of hyperparameters used is detailed in Table 3, and results of test 1 and test 2 are displayed, respectively, in Figures 3, 4, 5, S3, and S6.

Table 3.

MutFormer Fine-tuning hyperparameter specifications

Test/model Model architecture Fine-tuning method Initial/end learning rate Training steps
Internal comparison test 1 (MutFormer8L, MutBERT8L, MutBERT10L) (Per residue,
single sequence,
paired sequence)
1e-5/(2e-7 to 1e-6) (4k [fine-tuning method 1 and 2],
10k [fine-tuning method 3])
Internal comparison test 2 (MutFormer8L, MutFormer10L, MutFormer12L, MutFormer8L(with integrated convs) paired sequence 1e-5/1.4e-6 12k
MutFormer comparison with others MutFormer8L(with integrated convs) paired sequence 1e-5/3e-9 14k (evaluated on 6k, 8k, 11k, and 12k)
MutFormer final model (with external predictions) MutFormer8L(with integrated convs) paired sequence 1e-5/3e-9 12k
MutFormer final model (no external predictions) MutFormer8L(with integrated convs) paired sequence 1e-5/3e-9 8k
Test/model Max input sequence length Batch size Weight decay Freezing layers External predictions
Internal comparison test 1 (64, 128, 256, 512) 16 0.01 0 no
Internal comparison test 2 (256, 512) (16, 32, 64) 0.01 0 no
MutFormer comparison with others 512 (16, 32, 64) 0.01 (0, 5, 6, 8) yes
MutFormer final model (with external predictions) 512 32 0.01 0 yes
MutFormer final model (no external predictions) 512 32 0 0 no

Additional hyperparameters constant for all runs: - Gradient clip amount: No gradient clipping was used during fine-tuning (this was used previously during pre-training).

Figure 3.

Figure 3

Exploration of optimal fine-tuning methods

Performance comparison of different fine-tuning methods and MutFormer architecture/MutBERT architecture (internal comparison test 1).

(A–C) ROC curves of performances on our independent validation set (8,427 data points total) for two different model architectures (MutFormer (without integrated convolutions) and MutBERT) and three fine-tuning methods (per residue, single sequence, paired sequence). Labels are in the following format: “[model short name]_[max input sequence length]: [AUC score]”. Note that in (A–C), “MB” indicates MutBERT architecture and “MF” indicates MutFormer architecture.

(D) Performance comparison of three different fine-tuning methods, using AUC scores shown in (A–C). Whiskers indicate minimum and maximum values.

(E) Performance comparison of three pre-trained models: MutBERT8L, MutBERT10L, and MutFormer8L.

(F) Performance of different max input sequence lengths. (E, F) The results are mean ± SEM.

Figure 4.

Figure 4

ROC curves for performance comparison with existing methods

ROC curves and performance metrics of MutFormer and different existing methods of deleteriousness prediction evaluated on five different databases. The model which represents MutFormer here is MutFormer8L(with integrated convolutions), fine-tuned on a batch size of 32 with 0 freezing layers. Note that in this figure, “Mutformer” represents MutFormer’s performance with the incorporation of external predictions, while “MutFormer (no ext)” represents Mutformer’s performance without the use of external predictions. Labels are formatted in the following way: “[Method]: [Performance Metric].”

(A) Meta_SVM_LR_set_1 – dataset compiled by a previous paper that originally outlined the MetaSVM and MetaLR methods, containing 56 negative examples and 35 positive examples.

(B) MetaSVM_LR_set_2 – same source as Meta_SVM_LR_set_1, containing 5,866 negative examples and 4,115 positive examples.

(C) Varibench_PPARG – dataset from Varibench for the peroxisome proliferator-activated receptor (gamma) gene, containing 4,671 negative examples and 3,428 positive examples.

(D) Varibench_TP53 – dataset from Varibench for the TP53 gene, which codes for the tumor suppressor P53 protein, containing 3,444 negative examples and 4,505 positive examples.

(E) MetaSVM_LR_set_3 – same source as Meta_SVM_LR_set_1 and set_2, containing 2,422 negative examples. Because only negative examples are present, ROC is invalid in this case; instead, specificity is used for comparison.

Figure 5.

Figure 5

PRG curves for performance comparison with existing methods

Precision-recall-gain (PRG) Curves/AUC scores of MutFormer and different existing methods of deleteriousness prediction for four out of our five testing datasets (MetaSVM_LR_set_3 is omitted in this case since PRG is invalid for data with only one class of true labels). The model, which represents MutFormer here is MutFormer8L (with integrated convolutions), fine-tuned with a batch size of 32 and 0 freezing layers. Note that “MutFormer” represents MutFormer with the incorporation of external predictions, while “MutFormer (no ext)” represents MutFormer without the use of external predictions. Labels are formatted in the following way: [Method]: [AUC score].

(A) Meta_SVM_LR_set_1.

(B) MetaSVM_LR_set_2.

(C) Varibench_PPARG.

(D) Varibench_TP53.

Incorporation of external predictions

To create a model capable of the best possible performance in deleteriousness prediction, in addition to using protein sequence analysis, when training our final models (displayed in our final testing) (Figure 4), MutFormer also incorporated prediction values from previous methods published in literature. Computational predictions from previously published methods were given to MutFormer as input in the following way. First, using ANNOVAR, predicted scores for all mutations within a newly generated test set were obtained from the dbNSFPv3 database.11 These scores were standardized between 1 and 2, and all missing predictions were assigned values of 0. A fully connected dense layer was connected to these inputs, and the output of this dense layer was concatenated with the original model output. Another dense layer after this concatenated result was then connected to the output node to produce the end prediction result (Figure 1D). This incorporation strategy prevents the model from becoming reliant on external predictions, limiting the weighting of sequence analysis vs. external predictions to about 1:1 in MutFormer’s prediction (since the concatenated output of the two sequences of information is of the same length). To confirm this use ratio, upon completing fine-tuning of the various models, we analyzed the weights of both the combining dense layer, as well as the final output layer. When calculating the weighting sum over both layers, we found, for all models, an approximate weighting of 0.5 for MutFormer’s sequence analysis, and 0.5 for external predictions. In particular, the fine-tuned model that represents MutFormer in Figures 4, 5, S3, and S6 has a weighting of 0.485 for sequence analysis and 0.515 for external predictions.

Testing MutFormer against existing methods of deleteriousness prediction

To assess the performance of MutFormer against existing methods of deleteriousness prediction, a total of five testing datasets were used. Out of these five datasets, three are non-gene-specific and non-disease-specific datasets, and two are gene-specific mutation datasets (details for each testing dataset used are outlined in Table 2). For each dataset, filtering was performed using the reference sequences to ensure that no bias was present: all mutations that shared reference sequences with any mutation present in the pre-training data were deleted from the testing sets, and all mutants with identical reference sequences present in any of the independent test sets were removed from the fine-tuning training data before model training.

Table 2.

Details for each testing dataset

Testing dataset Testing dataset composition Compilation year
Set 1: Meta_SVM_LR_set_1 Dataset compiled by a previous method, Meta_SVM/Meta_LR. Used to assess Meta_SVM and Meta_LR’s performance against other methods.
Composition after filtering:
  • 56 pathogenic examples compiled from recent Nature Genetics publications at the time.

  • 35 benign examples from the CHARGE (Cohorts for Heart and Aging Research in Genetic Epidemiology) database, which focuses on identifying genes underlying heart, lung, and blood diseases.

2015
Set 2: Meta_SVM_LR_set_2 Dataset from the same source as set 2 (Meta_SVM_LR_set_1)
Composition after filtering:
  • 4,135 pathogenic examples from Varibench testing dataset II for missense mutations37 (Varibench is a dataset designed specifically for the testing of prediction methods for pathogenicity).

  • 5,884 benign examples also from Varibench testing dataset II.

2015
Set 3: Meta_SVM_LR_set_3 Dataset from the same source as set 2 (Meta_SVM_LR_set_1) set 3 (Meta_SVM_LR_set_2).
Composition after filtering:
  • 2,422 benign examples from the CHARGE database.

2015
Set 4: Varibench_PPARG Dataset from Varibench.37 Compiled by a study that specifically aimed to create datasets for assessing computational models’ performance in pathogenicity prediction of missense mutations.38 Focused on the PPARG gene which codes for the gamma member of the PPAR (Peroxisome Proliferator-activated Receptor) family of nuclear receptors, which can be linked to the pathology of diseases including diabetes, atherosclerosis, and cancer.
Composition after filtering:
  • 145 pathogenic variants from the experimentally validated Missense InTerpretation by Experimental Response (MITER) database.

  • 2,207 benign variants from the same source.

2018
Set 5: Varibench_TP53 Dataset from same source as Set 5 (Varibench_PPARG). Focused on the TP53 gene which codes for tumor protein p53, a tumor suppressor gene.
Composition after filtering:
  • 608 pathogenic examples from the IARC database (database specific for TP53), labeled for significantly changing the gene expression level of the TP53 gene.

  • 531 benign examples also from IARC which did not change expression level significantly.

2018

To allow for a more comprehensive evaluation of the performance of MutFormer with different levels of “fit” on a wide range of data (models with a higher fit will perform better on more similar data, but worse on more dissimilar data; models with a lower fit will have the opposite tendencies), different MutFormer models with varying hyperparameters that affected a model’s level of fit were trained (an analysis of MutFormer’s performance with varying levels of fit is analogous to an analysis of a receiver operator characteristic (ROC)-type curve, where the performances of MutFormer on similar vs. dissimilar data is compared for different fit levels). In this test, the number of freezing layers and batch size were varied, while all other hyperparameters were set to the best ones found during our hyperparameter test 2 (see above). From these results, testing sets 3–5 showed more variation with different fit parameters than sets 1–2 did (results from all test runs are summarized in Figure S2). All models were then tested on all testing datasets, and the overall best-performing model across all testing datasets was used to represent MutFormer in our comparison. Batch sizes of 16, 32, and 64 were tested in conjunction with freezing layer numbers of 0, 5, 6, and 8 (full hyperparameter description in Table 3). Freezing layer numbers are defined as the number of transformer body layers that were frozen, starting from the first layer (for MutFormer 8L with integrated convolutions, our current best-performing model, eight layers is the total number of transformer body layers). For any freezing layer number greater than 0, the embedding layers were frozen as well (through testing on our validation set we found that leaving the embedding layers trainable while freezing the transformer layers significantly decreased performance). Each model was trained for 14k steps, and checkpoint steps 6k, 8k, 11k, 12k, and 14k were evaluated.

For each dataset, MutFormer is shown twice: once without the incorporation of other scores and only relying on sequence data alone (labeled as “MutFormer (no ext)” in Figures 4, 5, S3, and S6), and another with the use of external predictions as part of its input as described in the Incorporation of external predictions section (labeled as “MutFormer” in Figures 4, 5, S3, and S6). For MutFormer without the incorporation of other scores, like MutFormer with the incorporation of external scores, we tested various models with varying levels of fit, with an initial set of hyperparameters found through testing based on our independent validation set. The full set of hyperparameters for both of these MutFormer models is displayed in Table 3.

When fine-tuning our final models (used in our comparison of MutFormer vs. existing methods), to increase MutFormer’s overall generalization ability and limit overfitting, data augmentation was implemented for the fine-tuning training data: for all epochs of data, each datapoint had a 50% chance of being altered. Those that were selected to be altered would be trimmed down to anywhere from 50% length of the original sequence to 100% length of the original sequence. Trimming was done around the mutation site to ensure the mutation site on average stayed in the same location (in the middle) in the sequence before and after trimming. Epochs were also shuffled independently of each other.

Results

Effect of MutFormer’s use of convolutions on the pre-training task

During pre-training, we trained models of different sizes for both the MutFormer architecture and MutBERT (MutFormer model without convolutions) architecture (Table 1). The loss and accuracy on the pre-training task are shown in Table S2. According to our results, the accuracy of MutFormer8L on the pre-training task test set was 54.5% higher than MutBERT8L, indicating the advantage of the convolutions. In addition, the accuracy of MutFormer8L was 27.7% higher than that of MutBERT10L, despite the latter having two more transformer layers and 10M more parameters (the subscript of each model name indicates only the number of transformer layers but does not consider the two convolutional layers). This outperformance despite its smaller size verifies that the improved performance of the model with convolutions was not simply due to the additional number of parameters or additional layers.

Performance of different fine-tuning methods and hyperparameters

As a part of our internal comparison test 1, we fine-tuned MutFormer and MutBERT using three methods: per residue classification, single sequence classification, and paired sequence classification (Figure 2; see Materials and methods for details). The ROC curves and corresponding area under the curve (AUC) for deleteriousness prediction are shown in Figures 3A–3C. Figure 3D shows a summary of the performance comparison of the three methods; paired sequence classification performed best, followed by per residue classification, and the optimal results were achieved by using a maximal input sequence length of 512 (use of the paired sequence method means an aggregate sequence length of 1,024) (Figure 3F). Upon examination of the performance of different model architectures, as shown in Figure 3E, MutFormer8L outperformed MutBERT8L and MutBERT10L for each fine-tuning method, indicating the advantage of the MutFormer architecture.

MutFormer’s use of integrated convolutions

In our internal comparison test 2, two different strategies for implementing the convolutions were tested: classic MutFormer, and MutFormer (with integrated convolutions). Justification for the second implementation strategy is as follows: while the convolution mechanism should in theory be able to create a representation that will enable the model to best interpret the protein sequence, some information that is present in the original raw embedded sequence may be lost in practice through the convolutions. To solve this, the integrated convolutions model, instead of feeding the embeddings through the convolutions linearly, uses skip connections that result in the convolutions acting as an integrated part of the original embedding layers, allowing the transformer model to access both the convolution filtered representation of the sequence as well as the original embedded representation. In our internal comparison test 2, we compared the performance of MutFormer (with integrated convolutions) in paired sequence classification of our independent validation set to that of the other three original MutFormer architecture models (Figure S1). ROC curves are shown in Figure S1A, and a summary comparison histogram of the four different models tested is shown in Figure S1B. Overall, the margins of difference are small but, based on the results, the performance of the MutFormer model with integrated convolutions is higher than that of the original MutFormer model; even with only eight transformer layers, the integrated convolutions model outperformed MutFormer12L, which had more layers and generally a better prediction ability than MutFormer8L.

Comparison with existing variant prediction methods

As paired sequence classification performed best, for the comparison of MutFormer vs. other methods, this fine-tuning method was used. MutFormer’s best overall performance was achieved by training our best model, MutFormer8L (with integrated convolutions), on a batch size of 32, 0 freezing layers, and access to external predictions (for full hyperparameter descriptions, see Table 3). MutFormer’s performance was compared against a variety of existing methods of deleteriousness prediction, including sequence alignment/homology-based scores (SIFT,13 PolyPhen-2,14 LRT,39 MutationTaster,40,41 MutationAssessor,42 FATHMM,43 PROVEAN,44 phastCons,45 and SiPhy46), ensembl-based scores (CADD,16,19 MetaSVM,17 MetaLR,17 VEST3,47 and DANN48), conservation based scores (GERP++,15 PhyloP,49 and fitCons50), we well as some other deep learning-based approaches (FATHMM-MKL,51 VARITY,52 and MVP22), out of which MVP and VARITY are recently developed methods. In our comparison, existing methods’ predictions were processed in the following way (note that this differs from the incorporation strategy of external scores into MutFormer as input): each method’s predictions were standardized from 0 to 1, based on prediction values of all possible missense mutations present in the dbNSFPv3 database.11 Missing predictions were automatically assigned a prediction value of 0. Both non-inverted and inverted prediction identities (1 = deleterious, 0 = benign and 0 = deleterious, 1 = benign) were tested across our fine-tuning training data, and inversion of scores was done accordingly for each algorithm being compared. We generated both an ROC curve and, because of the unbalanced nature of some of our datasets, a precision-recall-gain curve for each dataset. For dataset 4, which only included rare benign examples, a threshold for each existing method, chosen by taking the point closest to the upper left corner on a ROC curve based on MutFormer’s fine-tuning training data, was used to calculate a method specificity for each method. Upon analyzing the performances of the different testing datasets, we found that the best overall performing MutFormer model outperforms previous methods of deleteriousness prediction in non-gene-specific and non-disease-specific datasets (more similar to MutFormer’s fine-tuning training dataset: sets 1–3). On the two gene-specific databases (sets 4–5), which contain data less similar to that of MutFormer’s fine-tuning data, MutFormer’s performance in comparison with other methods expectedly drops, while still matching the performance of various existing methods. For MutFormer without external predictions, its performance is first among non-MutFormer methods for sets 2 and 3, among the top for set 1, and drops further than MutFormer with incorporated external predictions for sets 4 and 5. ROC curves for all datasets are displayed in Figure 4, and precision-recall-gain curves of the same testing results for datasets 1, 2, 4, and 5 are displayed in Figure 5. A bar plot display of all ROC and PRG AUC values for datasets 1, 2, 4, and 5 is also presented in Figure S6. Only three methods ever have PRG AUC values below 0 (these values were clipped to 0 for the best display quality in the bar plot). All three of these instances occurred for dataset 4: fitCons had a PRG AUC performance of −0.312, phyloP7way_vertebrate −0.633, and phyloP20way_mammalian −0.293. Because of a large number of existing methods compared, we also used the Delong test of ROC to statistically assess the pairwise probability that the given methods’ ROC curves were significantly different. Delong test results for datasets 1, 2, 4, and 5 are displayed in Figure S3.

Auxiliary tests

In addition to the results of the above tests, to further assess various aspects of MutFormer’s strengths and weaknesses in various other areas, we have performed five additional studies, each detailed in the Supplemental Methods: bias analysis in Mutformer’s fine-tuning data, an ablation study on MutFormer’s use of external predictions, an assessment of MutFormer’s correlation with existing evolutionary methods, an evaluation of MutFormer’s performance on the ProteinGym dataset, and an analysis of MutFormer’s weights for biological significance.

Precomputed deleteriousness scores for all missense mutations

To facilitate future use by other studies, we precomputed deleteriousness scores for all missense mutations using the best-performing MutFormer model. The inference was done on a cloud TPU device (v2-8), which took approximately 11.5 h. These scores can be directly used in the ANNOVAR software to annotate missense variants identified from genome or exome sequencing, and are organized in a flat-file format, allowing for easy use in other functional annotation software tools to complement the dbNSFP database, which has a variety of other prediction scores for missense mutations in the human genome.

Discussion

In the current study, we present MutFormer, a transformer-based machine-learning model to predict the deleteriousness of non-synonymous SNVs using protein sequence as the primary feature. We pre-trained MutFormer on reference protein sequences and alternative protein sequences resulting from common genetic variants in the human genome and tested different fine-tuning methods for deleteriousness prediction. During our evaluation processes, MutFormer outperformed multiple commonly used methods and had comparable performances with other methods even when tested on datasets that were less similar to MutFormer’s training data (gene-specific data, sets 5 and 6). Below we discuss several advantages and limitations of the MutFormer method and its computational package.

Although a large number of computational tools have been developed over the years on predicting the deleteriousness of non-synonymous mutations, to the best of our knowledge, MutFormer is among the first batch of tools that utilize transformer models to adapt the biological sequence analysis problem as a language analysis problem. A similar model of note is ProtBERT,53 a previous application of the BERT architecture to protein contexts. Despite both using the transformer architecture, MutFormer differs from ProtBERT in several key aspects: (1) MutFormer makes use of convolutions to learn its own vocabulary, while ProtBERT uses a fixed vocabulary; (2) MutFormer was pre-trained on human protein sequences and common variants, while ProtBERT was trained on reference sequences of all species with sequence information; and (3) MutFormer’s primary goal was deleteriousness prediction while ProtBERT focused on subcellular localization of proteins and secondary structure prediction. ConvBERT54 is another method that like MutFormer, uses a combination of CNNs and the transformer architecture. However, MutFormer and ConvBERT differ both in their architectural incorporation of convolutions as well as their motivation for using convolutional layers. The primary motivation for the use of convolutions in Mutformer is that the tokenization of protein sequences into individual residues does not necessarily represent the breaking of a given sequence into individual residues (the true residues of proteins are more accurately amino acid motifs within the protein’s 3D structure). For MutFormer, the convolutions are placed as part of the embedding layers to create more useful and efficient representations of a given sequence. In contrast, regular NLP tasks, such as those which ConvBERT attempts to address, do not have such problems (words are generally more or less residues in natural language). Instead, ConvBERT’s architecture uses convolutions within the attention module, with the motivation being the use of convolutions as a way to efficiently capture short-range dependencies.

In addition, the training process of MutFormer is simple and straightforward. MutFormer uses a self-supervised pre-training strategy and therefore does not require any labeled data. For this reason, a large model with hundreds of millions of parameters can be trained on a large amount of non-curated data. On this note, while the current study focused on the human genome exclusively, it is conceivable to include other well-annotated genomes from other species in future studies to see whether increased complexity in the sequence space during pre-training can further improve performance. In the fine-tuning stage, MutFormer learns the deleteriousness of mutations based on the labeled training data as well as its understanding of protein sequence already learned in the pre-training stage, allowing a small amount of fine-tuning data to be effectively used to achieve an accurate result. Furthermore, transformers consider attention, which is not only useful for understanding context in language processing problems, but could also give important insights into the deleteriousness effects that amino acids can have under different sequence contexts. Even though direct feature attribution is not possible with a model containing millions of intertwined model parameters such as MutFormer, our analyses of MutFormer’s model weights show that MutFormer is able to prioritize and model these relationships through the use of convolutions and attention.

There are also several limitations of the current study. First, the training data and testing datasets are still of limited size, and testing on large-scale experimentally or clinically supported datasets would result in a more effective evaluation of usability. In the future, MutFormer can be evaluated on large-scale genome sequencing data followed by manual review, to determine whether it helps to prioritize deleterious variants in clinical sequencing settings. Second, because of computational limitations, we did not fully test all parameters during training. As a result, it is likely that our results are not completely optimized; larger models using longer maximum sequence lengths would also be able to outperform the current MutFormer models (e.g., a 12-layer MutFormer with integrated convolutions should perform noticeably better than the best current MutFormer model with only eight attention layers). Third, in the deleteriousness prediction of missense mutations, it is likely impossible for a given model to obtain all required evidence from sequence data alone, so incorporation of other features, such as a three-dimensional (3D) structure (i.e., analyzing 3D structure to scale attention with 3D distance, or labeling sections as belonging to beta sheets or alpha helices for better prediction of deleteriousness), methylation, clinical phenotypic information (i.e., using this knowledge to prioritize certain genes), and other features that could significantly affect proteins’ behavior, could reasonably improve overall understanding and thus performance.

In summary, MutFormer is a novel transformer-based method to predict the functional effects of missense mutations. We hope that MutFormer can bring new insights to the bioinformatics community, by being a language model capable of improving our understanding of the language of proteins. Given that MutFormer used complementary information that other bioinformatics tools developed for deleteriousness prediction, we also envision that they could be combined to reach consensuses on predictions, which may be useful in implementation into current clinical guidelines.

Data and code availability

The source code to run MutFormer, all six pre-trained models, a reproducible workflow, and the best-performing fine-tuned models are available at the GitHub repository: https://github.com/WGLab/MutFormer.

Acknowledgments

We acknowledge the TPU Research Cloud (TRC) program by Google, which provided us with TPUs for the majority of the project. The study is in part supported by NIH grant GM132713 (K.W.), the CHOP Research Institute and the Fundamental Research Funds for the Central Universities, Sun Yat-sen University (No. 23ptpy119, to L.F).

Author contributions

L.F. and K.W. conceived and guided the study. T.J. performed the study, developed the software tools, and wrote the manuscript. All authors have given final approval for the manuscript to be published and have agreed to be responsible for all aspects of the manuscript.

Declaration of interests

The authors declare no competing interests.

Published Online: July 27, 2023

Footnotes

Contributor Information

Li Fang, Email: fangli9@mail.sysu.edu.cn.

Kai Wang, Email: wangk@chop.edu.

Lead contact website

Kai Wang: https://wglab.org.

Supplemental information

Document S1. Supplemental methods, Figures S1–S7, and Tables S1–S3 and S5
mmc1.pdf (11.7MB, pdf)
Table S4. The ROC (Receiver Operator Characteristic) and RPG (Precision- Recall- Gain) score of various methods on the testing dataset

This table is provided as a separate CSV file due to the presence of many columns.

mmc2.csv (3.1KB, csv)
Document S2. Article plus supplemental information
mmc3.pdf (17.3MB, pdf)

References

  • 1.1000 Genomes Project Consortium. Auton A., Brooks L.D., Durbin R.M., et al. A global reference for human genetic variation. Nature. 2015;526:68–74. doi: 10.1038/nature15393. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Stenson P.D., Mort M., Ball E.V., et al. The Human Gene Mutation Database (HGMD®): optimizing its use in a clinical diagnostic or research setting. Hum. Genet. 2020;139:1197–1207. doi: 10.1007/s00439-020-02199-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Lek M., Karczewski K.J., Minikel E.V., et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature. 2016;536:285–291. doi: 10.1038/nature19057. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Karczewski K.J., Weisburd B., Thomas B., et al. The ExAC browser: displaying reference data information from over 60 000 exomes. Nucleic Acids Res. 2017;45:D840–D845. doi: 10.1093/nar/gkw971. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Karczewski K.J., Francioli L.C., Tiao G., et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature. 2020;581:434–443. doi: 10.1038/s41586-020-2308-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Landrum M.J., Lee J.M., Riley G.R., et al. ClinVar: public archive of relationships among sequence variation and human phenotype. Nucleic Acids Res. 2014;42:D980–D985. doi: 10.1093/nar/gkt1113. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Landrum M.J., Lee J.M., Benson M., et al. ClinVar: improving access to variant interpretations and supporting evidence. Nucleic Acids Res. 2018;46:D1062–D1067. doi: 10.1093/nar/gkx1153. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Landrum M.J., Chitipiralla S., Brown G.R., et al. ClinVar: improvements to accessing data. Nucleic Acids Res. 2020;48:D835–D844. doi: 10.1093/nar/gkz972. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Liu X., Li C., Mou C., et al. dbNSFP v4: a comprehensive database of transcript-specific functional predictions and annotations for human nonsynonymous and splice-site SNVs. Genome Med. 2020;12:103. doi: 10.1186/s13073-020-00803-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Liu X., Jian X., Boerwinkle E. dbNSFP: A lightweight database of human nonsynonymous SNPs and their functional predictions. Hum. Mutat. 2011;32:894–899. doi: 10.1002/humu.21517. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Liu X., Wu C., Li C., et al. dbNSFP v3.0: A One-Stop Database of Functional Predictions and Annotations for Human Nonsynonymous and Splice-Site SNVs. Hum. Mutat. 2016;37:235–241. doi: 10.1002/humu.22932. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Thusberg J., Olatubosun A., Vihinen M. Performance of mutation pathogenicity prediction methods on missense variants. Hum. Mutat. 2011;32:358–368. doi: 10.1002/humu.21445. [DOI] [PubMed] [Google Scholar]
  • 13.Ng P.C., Henikoff S. SIFT: Predicting amino acid changes that affect protein function. Nucleic Acids Res. 2003;31:3812–3814. doi: 10.1093/nar/gkg509. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Adzhubei I.A., Schmidt S., Peshkin L., et al. A method and server for predicting damaging missense mutations. Nat. Methods. 2010;7:248–249. doi: 10.1038/nmeth0410-248. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Davydov E.V., Goode D.L., Sirota M., et al. Identifying a high fraction of the human genome to be under selective constraint using GERP++ PLoS Comput. Biol. 2010;6:e1001025. doi: 10.1371/journal.pcbi.1001025. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Kircher M., Witten D.M., Jain P., et al. A general framework for estimating the relative pathogenicity of human genetic variants. Nat. Genet. 2014;46:310–315. doi: 10.1038/ng.2892. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Dong C., Wei P., Jian X., et al. Comparison and integration of deleteriousness prediction methods for nonsynonymous SNVs in whole exome sequencing studies. Hum. Mol. Genet. 2015;24:2125–2137. doi: 10.1093/hmg/ddu733. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Ioannidis N.M., Rothstein J.H., Pejaver V., et al. REVEL: An Ensemble Method for Predicting the Pathogenicity of Rare Missense Variants. Am. J. Hum. Genet. 2016;99:877–885. doi: 10.1016/j.ajhg.2016.08.016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Rentzsch P., Witten D., Cooper G.M., et al. CADD: predicting the deleteriousness of variants throughout the human genome. Nucleic Acids Res. 2019;47:D886–D894. doi: 10.1093/nar/gky1016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Richards S., Aziz N., Bale S., et al. Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology. Genet. Med. 2015;17:405–424. doi: 10.1038/gim.2015.30. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Zhou J., Troyanskaya O.G. Predicting effects of noncoding variants with deep learning-based sequence model. Nat. Methods. 2015;12:931–934. doi: 10.1038/nmeth.3547. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Qi H., Zhang H., Zhao Y., et al. MVP predicts the pathogenicity of missense variants by deep learning. Nat. Commun. 2021;12:510. doi: 10.1038/s41467-020-20847-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Dunham A.S., Beltrao P., AlQuraishi M. High-throughput deep learning variant effect prediction with Sequence UNET. Genome Biol. 2023;24:110. doi: 10.1186/s13059-023-02948-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Devlin J., Chang M.-W., Lee K., et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. ArXiv. 2019 https://arxiv.org/abs/1810.04805 Preprint at. [Google Scholar]
  • 25.Vaswani A., Shazeer N.M., Parmar N., et al. Attention is All you Need. ArXiv. 2017 https://arxiv.org/abs/1706.03762 Preprint at. [Google Scholar]
  • 26.Liu Y., Ott M., Goyal N., et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach. ArXiv. 2019 https://arxiv.org/abs/1907.11692 Preprint at. [Google Scholar]
  • 27.Lan Z., Chen M., Goodman S., et al. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. ArXiv. 2019 https://arxiv.org/abs/1909.11942 Preprint at. [Google Scholar]
  • 28.Yang Z., Dai Z., Yang Y., et al. XLNet: Generalized Autoregressive Pretraining for Language Understanding. ArXiv. 2019 https://arxiv.org/abs/1909.11942 Preprint at. [Google Scholar]
  • 29.Dosovitskiy A., Beyer L., Kolesnikov A., et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. ArXiv. 2020 https://arxiv.org/abs/2010.11929 Preprint at. [Google Scholar]
  • 30.Jumper J., Evans R., Pritzel A., et al. Highly accurate protein structure prediction with AlphaFold. Nature. 2021;596:583–589. doi: 10.1038/s41586-021-03819-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Avsec Ž., Agarwal V., Visentin D., et al. Effective gene expression prediction from sequence by integrating long-range interactions. Nat. Methods. 2021;18:1196–1203. doi: 10.1038/s41592-021-01252-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Wu Y., Schuster M., Chen Z., et al. Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. ArXiv. 2016 https://arxiv.org/abs/1609.08144 Preprint at. [Google Scholar]
  • 33.Xue L., Barua A., Constant N., et al. ByT5: Towards a token-free future with pre-trained byte-to-byte models. ArXiv. 2021 https://arxiv.org/abs/2105.13626 Preprint at. [Google Scholar]
  • 34.Duman Keles F., Mahesakya Wijewardena P., Hegde C. On The Computational Complexity of Self-Attention. ArXiv. 2022 https://arxiv.org/abs/2209.04881 Preprint at. [Google Scholar]
  • 35.Wang K., Li M., Hakonarson H. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res. 2010;38:e164. doi: 10.1093/nar/gkq603. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Dolan W.B., Brockett C. Proceedings of the Third International Workshop on Paraphrasing (IWP2005) 2005. Automatically Constructing a Corpus of Sentential Paraphrases.https://aclanthology.org/I05-5002 [Google Scholar]
  • 37.Sasidharan Nair P., Vihinen M. VariBench: a benchmark database for variations. Hum. Mutat. 2013;34:42–49. doi: 10.1002/humu.22204. [DOI] [PubMed] [Google Scholar]
  • 38.Li J., Zhao T., Zhang Y., et al. Performance evaluation of pathogenicity-computation methods for missense variants. Nucleic Acids Res. 2018;46:7793–7804. doi: 10.1093/nar/gky678. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Chun S., Fay J.C. Identification of deleterious mutations within three human genomes. Genome Res. 2009;19:1553–1561. doi: 10.1101/gr.092619.109. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Schwarz J.M., Cooper D.N., Schuelke M., et al. MutationTaster2: mutation prediction for the deep-sequencing age. Nat. Methods. 2014;11:361–362. doi: 10.1038/nmeth.2890. [DOI] [PubMed] [Google Scholar]
  • 41.Schwarz J.M., Rödelsperger C., Schuelke M., et al. MutationTaster evaluates disease-causing potential of sequence alterations. Nat. Methods. 2010;7:575–576. doi: 10.1038/nmeth0810-575. [DOI] [PubMed] [Google Scholar]
  • 42.Reva B., Antipin Y., Sander C. Predicting the functional impact of protein mutations: application to cancer genomics. Nucleic Acids Res. 2011;39:e118. doi: 10.1093/nar/gkr407. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Shihab H.A., Gough J., Cooper D.N., et al. Predicting the functional, molecular, and phenotypic consequences of amino acid substitutions using hidden Markov models. Hum. Mutat. 2013;34:57–65. doi: 10.1002/humu.22225. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Choi Y., Sims G.E., Murphy S., et al. Predicting the functional effect of amino acid substitutions and indels. PLoS One. 2012;7:e46688. doi: 10.1371/journal.pone.0046688. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Siepel A., Bejerano G., Pedersen J.S., et al. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 2005;15:1034–1050. doi: 10.1101/gr.3715005. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Garber M., Guttman M., Clamp M., et al. Identifying novel constrained elements by exploiting biased substitution patterns. Bioinformatics. 2009;25:i54–i62. doi: 10.1093/bioinformatics/btp190. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Carter H., Douville C., Stenson P.D., et al. Identifying Mendelian disease genes with the variant effect scoring tool. BMC Genom. 2013;14(Suppl 3):S3. doi: 10.1186/1471-2164-14-S3-S3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Quang D., Chen Y., Xie X. DANN: a deep learning approach for annotating the pathogenicity of genetic variants. Bioinformatics. 2015;31:761–763. doi: 10.1093/bioinformatics/btu703. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Pollard K.S., Hubisz M.J., Rosenbloom K.R., et al. Detection of nonneutral substitution rates on mammalian phylogenies. Genome Res. 2010;20:110–121. doi: 10.1101/gr.097857.109. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Gulko B., Hubisz M.J., Gronau I., et al. A method for calculating probabilities of fitness consequences for point mutations across the human genome. Nat. Genet. 2015;47:276–283. doi: 10.1038/ng.3196. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Shihab H.A., Rogers M.F., Gough J., et al. An integrative approach to predicting the functional effects of non-coding and coding sequence variation. Bioinformatics. 2015;31:1536–1543. doi: 10.1093/bioinformatics/btv009. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Wu Y., Liu H., Li R., et al. Improved pathogenicity prediction for rare human missense variants. Am. J. Hum. Genet. 2021;108:2389. doi: 10.1016/j.ajhg.2021.11.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Elnaggar A., Heinzinger M., Dallago C., et al. ProtTrans: Towards Cracking the Language of Lifes Code Through Self-Supervised Deep Learning and High Performance Computing. IEEE Trans. Pattern Anal. Mach. Intell. 2022;44:7112–7127. doi: 10.1109/TPAMI.2021.3095381. [DOI] [PubMed] [Google Scholar]
  • 54.Jiang Z.-H., Yu W., Zhou D., et al. Convbert: Improving bert with span-based dynamic convolution. Adv. Neural Inf. Process. Syst. 2020;33:12837–12848. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Document S1. Supplemental methods, Figures S1–S7, and Tables S1–S3 and S5
mmc1.pdf (11.7MB, pdf)
Table S4. The ROC (Receiver Operator Characteristic) and RPG (Precision- Recall- Gain) score of various methods on the testing dataset

This table is provided as a separate CSV file due to the presence of many columns.

mmc2.csv (3.1KB, csv)
Document S2. Article plus supplemental information
mmc3.pdf (17.3MB, pdf)

Data Availability Statement

The source code to run MutFormer, all six pre-trained models, a reproducible workflow, and the best-performing fine-tuned models are available at the GitHub repository: https://github.com/WGLab/MutFormer.


Articles from The Innovation are provided here courtesy of Elsevier

RESOURCES