Abstract
Protein Language Models (pLMs) have revolutionized the computational modeling of protein systems, building numerical embeddings that are centered around structural features. To enhance the breadth of biochemically relevant properties available in protein embeddings, we engineered the Annotation Vocabulary, a transformer readable language of protein properties defined by structured ontologies. We trained Annotation Transformers (AT) from the ground up to recover masked protein property inputs without reference to amino acid sequences, building a new numerical feature space on protein descriptions alone. We leverage AT representations in various model architectures, for both protein representation and generation. To showcase the merit of Annotation Vocabulary integration, we performed 515 diverse downstream experiments. Using a novel loss function and only $3 in commercial compute, our premier representation model CAMP produces state-of-the-art embeddings for five out of 15 common datasets with competitive performance on the rest; highlighting the computational efficiency of latent space curation with Annotation Vocabulary. To standardize the comparison of de novo generated protein sequences, we suggest a new sequence alignment-based score that is more flexible and biologically relevant than traditional language modeling metrics. Our generative model, GSM, produces high alignment scores from annotation-only prompts with a BERT-like generation scheme. Of particular note, many GSM hallucinations return statistically significant BLAST hits, where enrichment analysis shows properties matching the annotation prompt - even when the ground truth has low sequence identity to the entire training set. Overall, the Annotation Vocabulary toolbox presents a promising pathway to replace traditional tokens with members of ontologies and knowledge graphs, enhancing transformer models in specific domains. The concise, accurate, and efficient descriptions of proteins by the Annotation Vocabulary offers a novel way to build numerical representations of proteins for protein annotation and design.
Keywords: Protein annotation, Protein design, Contrastive learning, Language Modeling, Annotation Transformer, Contrastive Annotation Model for Proteins, Annotation Sequence Model, Generative Sequence Model
Introduction
The evolutionary optimization of proteins is achieved through incremental, seemingly random changes to a genetic code that are mostly detrimental - implying that the natural protein landscape is hardly exhaustive (1, 2). Even within the space of natural proteins, high-throughput sequencing technologies have far outpaced our ability to characterize genetic constructs (3). Far less than 1% of documented protein sequences have ever been synthesized, let alone annotated (4-6). The immense challenge of exploring biological sequences with extremely sparse data places protein design and annotation as ubiquitous problems in the life sciences. Understanding this vast landscape of proteins is important for studying and treating diseases, as well as elucidating fundamental biology (7). Beyond biological systems, protein design harbors potential in generating sequences capable of valuable tasks, including plastics degradation and recycling, carbon capture and storage, and the generation of novel materials (8-10). While potential applications are numerous and significant, experimental characterization is time-intensive and expensive, heavily limiting the rate of progress as well as training data availability for computational methods. Therefore, there is a vital need for reliable computational methodologies that can translate between sequence and function based on sparse labeled data.
Both protein annotation and design have been a primary focus of the Protein Language Model (pLM) community, where protein sequences are modeled as a semantic language by amino acids, codons, nucleotides, or atoms (11-16). By leveraging large-scale semi-supervised denoising and transfer learning, transformer neural networks have showcased adept numerical representations that correlate to downstream tasks without any labels at all (11, 17). Of interest in biomedical communities, tasks such as Protein-Protein Interaction (PPI) and function prediction were improved with this approach (12, 17, 18). Generating natural seeming sequences from noise has also been possible with pLMs (19-21). However, a more recent study of pLM pretraining strategies suggests that Masked Language Modeling (MLM) is particularly effective for structure-based modeling, injecting many structurally correlated patterns into the pLM latent space (22). Whereas this gives insight into the success of protein folding models (21, 23-29)), we assume that the optimal latent space for annotation should more closely correlate with more abstract concepts like “protein function” and “biological process.” We also surmise that generating proteins for specific properties can be actualized by a closer relationship between a “property” latent space and sequence latent space.
Others have overcome the pretraining pitfalls of MLM over amino acids by applying additional labeled contrastive learning to pretrained models. By identifying similar protein pairs, dissimilar pairs, or building triplet datasets, projects like ProteinVec have greatly increased the functional relevance of downstream pLM fixed-length vector or full-token matrix representations (30-32). This has led to excellent protein representation qualities, which enables protein annotation through supervised learning or vector search (30-32). However, approaches that contrast sequences directly require similarity heuristics which impose human bias, defining what sequences or characteristics are inherently similar. One way around this is to assume sequences and their descriptions should inhabit the same embedding space. We observe this in projects, such as ProteinDT, correlating the pLM latent space directly with researcher-deposited natural language embeddings using contrastive learning (33). While this is a promising avenue for de novo protein design from prompts, we suspect that natural language is not an optimal interface to the protein language.
We postulate that much of the challenge involved in enabling effective protein annotation and design lies within the inadequacies in our descriptions of proteins, which are highly complex molecules operating under multiscale constraints. For example, natural proteins are optimized around countless considerations including cellular economics (expression energy and pathway efficiency), regulatory mechanisms (allosteric sites, feedback loops, post-transcriptional/translational modifications), and protein lifecycles (chaperone folding, complex formation, proteolysis) (34-40). Most of these qualities are rarely or never mentioned in deposited natural language descriptions. While it is possible to use Large Language Models (LLMs) to format ontology-based annotations to natural language, it requires nontrivial compute and runs the risk of hallucination (41). Despite these challenges, approaches like Mol-instructions have made great strides toward descriptive, machine-readable prompts for molecular design and annotation (42). Here, we ask: Why not just use the annotations as a direct input? A separate vocabulary of annotations.
To work toward descriptive protein property representations that enable the bidirectional translation of sequence and function, we engineered a new tool called the Annotation Vocabulary. The Annotation Vocabulary is a collection of human-labeled protein-related ontologies that concisely and accurately describe protein properties. By mapping Enzyme Commission numbers (EC), Gene Ontologies (GO), Interpro domains, and Gene3D domains to a set of unique integers, the Annotation Vocabulary was able to be modeled with transformer neural networks through token embedding. This eliminated the need for similarity heuristics for comparison between sequences by assuming a fundamental relationship between a sequence and its own annotations. Additionally, unlike natural language descriptions posited by researchers, specific properties were described in a consistent way. A transition away from natural language also removed artifacts like filler words, which saved on computation and increased interpretability. Using this vocabulary, we trained various model architectures to leverage protein annotation representations, including:
Annotation Transformer (AT): A transformer network that uses the Annotation Vocabulary to build functionally relevant representations of annotations,
Contrastive Annotation Model for Proteins (CAMP): Leverages AT to curate sequence representations with contrastive learning using a novel loss,
Annotation Sequence Model (ASM): Utilizes a dual vocabulary of sequences and annotations to curate sequence representations with self-attention,
Generation Sequence Model (GSM): Leverages AT to generate sequences from annotation prompts with cross-attention.
CAMP and ASM were evaluated on protein annotation tasks with downstream supervised learning and vector search, demonstrating a high correlation with valuable tasks. CAMP produced SOTA embeddings for five out of 15 standardized datasets with competitive performance on the rest, significantly outperforming the newest foundation model ESM3. To compare the Annotation Vocabulary to other strategies, we compiled a dataset of natural language descriptions and proteins that were applied to CAMP, replacing the AT with SciBERT, which also outperformed pretrained pLMs. Notably, training our premier representation model cost a total of $3 in commercial compute (3 hrs on an A6000), highlighting the computational efficiency of latent space curation with Annotation Vocabulary. We conducted protein annotation and sequence reconstruction tasks on AT and ASM using mask filling, both with and without reference amino acid sequences. F1 scores and loss values for sequence reconstruction show that ASM35 can outperform ESM2-150, underscoring added value in incorporating the Annotation Vocabulary into standard pLM pretraining practices.
However, standard metrics like accuracy or F1 scores between reconstructions and labels, as well as loss or perplexity, require the indices of correct tokens to exactly match, which is less meaningful for de novo protein generation. Within the context of biological sequences, many conserved domains may function correctly if slightly out of frame - meaning a high-quality generation result similar to the ground truth sequence may present poor metrics. For a more standardized comparison of generated biological sequences, we propose a novel normalized sequence alignment score based on the Needleman-Wunsch algorithm (43). Using this metric, we explored how well GSM can generate sequences at various mask percentages with annotation prompts, including from pure noise. Importantly, GSM generated realistic protein sequences with high sequence alignment scores to ground truth. Following Basic Local Alignment Search Tool (BLAST) queries, we show statistically significant hits with sequences annotated similar to the prompted annotations - even when the ground truth has a low sequence identity to the training set. Overall, our work offers a new way to build numerical descriptions of proteins through the Annotation Vocabulary. When utilizing our strategies, the functional relevance of amino acid embeddings is enhanced, hinting at broader improvements in both protein annotation and design.
Results
We used the Annotation Vocabulary to curate the latent space of various transformer architectures. To evaluate the effectiveness of Annotation Vocabulary integration, we performed 515 diverse performance evaluations. This included protein annotation using supervised learning with model probes, vector search, mask filling, and protein design using annotation vocabulary prompts. Throughout, we will refer to sequence reconstruction, where we are measuring the capabilities of models to exactly replicate ground truth sequences, and sequence generation, where we measure performance with less strict, but more biologically relevant, alignment-based metrics to explore the generation of plausible protein domains.
Annotation Vocabulary enhances the value of protein embeddings
Firstly, we set out to improve representation learning schemes with the Annotation Vocabulary. We compiled the EXP (UniProt sequences and experimentally validated redundant annotations, 70,000 total), RED (UniRef90 sequences and nonredundant annotations, 500,000 total), and NAT (UniRef50 sequences and nonredundant natural language descriptions, 1.4 million total) datasets. To conduct representation learning over pure annotations, we trained the Annotation Transformer (AT) (Figure 1A), a BERT-like (44) transformer, on the EXP and RED datasets separately, named and respectively. Then, CAMP models were trained with AT components and ESM2-650 (45-50) to curate the ESM2 latent space with annotations through contrastive learning (Figure 1B), named and respectively. For comparison against natural language descriptions, AT was replaced with SciBERT (51) on the NAT dataset, producing . Lastly, ASM (EXP and RED) (Figure 1C) has joint representation and reconstruction capabilities, so ASM was evaluated for sequence-only representation as well.
Fig. 1.
A: Annotation Transformer, a BERT-like network trained through MLM on protein annotation tokens. B: CAMP model schema, where ESM2-650 and AT are frozen to produce consistent representations after pretraining. Linear layers and ConvBERTs project these representations to a common hidden dimension. Vector outputs are contrasted at the mini-batch level to curate the protein latent space for annotation tasks. C: ASM, an ESM model with an extended dual sequence-annotation vocabulary, trained through MLM on both vocabularies. D: GSM, where AT hidden states are attended with an ESM2 model through cross-attention to enable protein sequence generation from annotation prompts. E: Example of model probe pipeline, where only the sequence track is used and frozen to produce embeddings, then used to train a probe. Created with BioRender.com
We evaluated sequence embeddings after contrastive learning on various tasks, split into in-distribution and out-of-distribution, which were either discretely defined within the Annotation Vocabulary (in) or not (out). Fixed-length vector embeddings from frozen models were fed to a linear probe (Figure 1E). As expected, consistent performance increases versus CAMP’s base model ESM2-650 were seen on in-distribution tasks (Table 1). CAMP variants exhibited the best overall performance of the tested models, with embeddings resulting in the only average F1 score above 0.6. scores were 2.6% higher than ESM3 (52) and 9.5% higher than ProteinVec, two large models that were trained with functional information on top of amino-acid based pretraining. Individually, embeddings produced the highest DL10 and second-highest EC and CC F1 scores, with generating the best CC and BP F1 scores. ASM embeddings also performed well, with RED and EXP embeddings 2.2% and 3.3% higher than their base model ESM2-35 F1 score on average. Interestingly, many smaller models trained by semi-supervised denoising had embeddings that correlated better with downstream tasks compared to larger counterparts. ESM2-150 is particularly good at DL2 prediction, and the best overall non-CAMP model was (17); matching in average performance with an F1 average of 0.589.
Table 1.
(multi-label) and F1 scores shown for in-distribution downstream tasks, which we classify as well aligned with the properties represented in the Annotation Vocabulary. Our models have the total model size in parenthesis referencing the total training schema including Annotation Vocabulary components which are not referenced during this embedding process. CAMP model embeddings outperform their closest equivalent counterpart in terms of methodology: ProteinVec, and also the newest frontier pLM ESM3. ASM35 outperforms its base model ESM2-35 in sequence only inference. * approximation of 1 AspectVec on top of ProtT5 encoder (11). EC, CC, MF, BP, CC, and CC AspectVecs refers to the order used in the table (30).
| Model name | Model size (1e6) | EC ↑ | CC ↑ | MF ↑ | BP ↑ | DL2 ↑ | DL10 ↑ | Avg ↑ |
|---|---|---|---|---|---|---|---|---|
| (Ours) | 658 (674) | 0.753 | 0.430 | 0.512 | 0.239 | 0.892 | 0.781 | 0.601 |
| (Ours) | 664 (681) | 0.738 | 0.433 | 0.495 | 0.242 | 0.892 | 0.742 | 0.590 |
| (Ours) | 664 (774) | 0.744 | 0.428 | 0.492 | 0.238 | 0.880 | 0.752 | 0.589 |
| 453 | 0.735 | 0.408 | 0.496 | 0.231 | 0.898 | 0.763 | 0.589 | |
| ESM3 | 1430 | 0.759 | 0.405 | 0.519 | 0.229 | 0.884 | 0.718 | 0.586 |
| ESM2-650 | 652 | 0.699 | 0.418 | 0.471 | 0.224 | 0.905 | 0.754 | 0.579 |
| 1150 | 0.716 | 0.392 | 0.456 | 0.226 | 0.896 | 0.773 | 0.577 | |
| ESM2-150 | 149 | 0.690 | 0.394 | 0.467 | 0.226 | 0.912 | 0.764 | 0.576 |
| SAProt | 656 | 0.691 | 0.411 | 0.475 | 0.221 | 0.892 | 0.758 | 0.575 |
| AspectVec | *1200 | 0.748 | 0.400 | 0.517 | 0.240 | 0.884 | 0.656 | 0.574 |
| (Ours) | 34 (50) | 0.700 | 0.388 | 0.463 | 0.220 | 0.894 | 0.742 | 0.568 |
| (Ours) | 34 (53) | 0.703 | 0.387 | 0.463 | 0.219 | 0.888 | 0.711 | 0.562 |
| ESM2-35 | 34 | 0.667 | 0.383 | 0.435 | 0.214 | 0.896 | 0.703 | 0.550 |
| ProteinVec | 1410 | 0.714 | 0.391 | 0.496 | 0.237 | 0.810 | 0.582 | 0.538 |
| ESM2-8 | 8 | 0.597 | 0.356 | 0.402 | 0.197 | 0.891 | 0.673 | 0.519 |
| Random weights | 90 | 0.339 | 0.323 | 0.276 | 0.137 | 0.772 | 0.492 | 0.390 |
| Random vectors | 0 | 0.070 | 0.272 | 0.156 | 0.065 | 0.420 | 0.131 | 0.186 |
Our out-of-distribution evaluation using vector embeddings and a linear probe portrayed a similar story (Table 2). Here, we did not necessarily expect increased performance compared to the base model. CAMP models individually excelled at PPI tasks, scoring the first and second highest F1 for each. However, it is clear that ANKH and ESM variants outperformed CAMP and ASM on MB. The lack of cofactor annotations, including metal cofactors, for RED and EXP is made clear: MB performance is lower than its base ESM2-650, and performed worse than even though the EXP variant has sparse cofactor information and RED does not.
Table 2.
(multi-label) and F1 scores shown for out-of-distribution downstream tasks, which we classify as outside the scope of the properties represented in the Annotation Vocabulary. While EXP and NAT have sparse CO annotations, RED has none, which is why we classify MB as out-of-distribution. CAMP embeddings showcase adept performance in PPI despite not being trained for it, vastly outperforming other SOTA pLMs with competitive performance in the MB task.
| Model | MB↑ | YPPI↑ | HPPI↑ | Avg↑ |
|---|---|---|---|---|
| ESM2-650 | 0.705 | 0.773 | 0.762 | 0.747 |
| 0.669 | 0.794 | 0.763 | 0.742 | |
| 0.646 | 0.782 | 0.767 | 0.732 | |
| 0.723 | 0.757 | 0.715 | 0.732 | |
| 0.748 | 0.755 | 0.689 | 0.731 | |
| ESM3 | 0.715 | 0.756 | 0.722 | 0.731 |
| ESM2-150 | 0.690 | 0.754 | 0.750 | 0.731 |
| 0.697 | 0.787 | 0.703 | 0.729 | |
| ESM2-35 | 0.691 | 0.745 | 0.732 | 0.723 |
| ESM2-8 | 0.670 | 0.745 | 0.677 | 0.697 |
| 0.642 | 0.737 | 0.671 | 0.683 | |
| 0.661 | 0.702 | 0.661 | 0.675 | |
| ProteinVec | 0.635 | 0.550 | 0.539 | 0.575 |
Supervised learning is not the only avenue for protein annotation; embedding labeled datasets and conducting vector search via vector similarity has also shown promise (30, 32). As such, we evaluated EC annotation using vector search and a SwissProt reference database with maximum separation techniques (32) (Table 3, full metrics in Supplemental Table 1). For the three benchmark datasets introduced by CLEAN (New, Price, Halogenase) (32), CAMP embeddings performed competitively, with the achieving an average AUC of 0.785. Notably, the scores for New and Price were within 0.001 AUC of the highest performers, ProteinVec and CLEAN, respectively. ASM underperformed compared to ESM2-35 but still outperformed ESM2-650 on average; there was no consistent size-to-performance trend with the CLEAN benchmark.
Table 3.
AUC for the CLEAN datasets using the maximum separation method with reference to a Split100 (SwissProt) vector database (32). CAMP and ASM models do not outperform their counterparts, but is within 0.001 AUC of SOTA on the New and Price dataset. Unsurprisingly, ProteinVec and CLEAN still perform excellently around their designed purpose of annotation by vector search (30, 32). * Reported (32)
| Model | New↑ | Price↑ | Halogenase↑ | Avg↑ |
|---|---|---|---|---|
| ProteinVec | 0.761 | 0.709 | 0.926 | 0.799 |
| *CLEAN | 0.740 | 0.733 | 0.907 | 0.793 |
| 0.744 | 0.699 | 0.921 | 0.788 | |
| 0.760 | 0.732 | 0.863 | 0.785 | |
| 0.747 | 0.728 | 0.862 | 0.779 | |
| ESM2-35 | 0.705 | 0.675 | 0.851 | 0.744 |
| ESM-3 | 0.708 | 0.697 | 0.807 | 0.737 |
| 0.706 | 0.703 | 0.783 | 0.731 | |
| 0.698 | 0.680 | 0.797 | 0.725 | |
| 0.708 | 0.677 | 0.777 | 0.721 | |
| ESM2-150 | 0.672 | 0.683 | 0.782 | 0.712 |
| 0.689 | 0.631 | 0.781 | 0.700 | |
| ESM2-650 | 0.686 | 0.589 | 0.810 | 0.695 |
| ESM2-8 | 0.688 | 0.637 | 0.713 | 0.679 |
We also evaluated the residue-wise matrix embeddings for CAMP (Table 4), even though we used a loss that was based on vector representations. The goal of this experiment was to determine if a pooled vector-based loss inhibits residue-wise tasks. By far, and ESM3 embeddings exhibited the best correlation with the SS tasks; however, they struggled with TS comparatively. Despite not achieving the top or second best performance for these residue-wise tasks, it still attained the best overall average at 0.652 F1 with ESM2-150 and ESM2-650 slightly lower. ASM models were close to their base ESM2-35 model on SS but significantly underperformed on TS. Similarly to performance with the CLEAN benchmark, TS results were not correlated with model size, as smaller ESM2 models performed the best. Notably, ESM2-8 outperformed comparatively massive models ESM3 and on average. Additional recorded metrics for protein annotation via model probes can be seen in Supplemental Table 2.
Table 4.
Spearman (TS) and F1 (SS3, SS8) shown for annotation tasks using frozen residue-wise embeddings. All Spearman values are highly statistically significant . CAMP was trained with vector embeddings in mind but is still the best on average. ASM does not outperform ESM here but has competitive metrics.
| Model | TS↑ | SS3↑ | SS8↑ | Avg↑ |
|---|---|---|---|---|
| 0.632 | 0.724 | 0.601 | 0.652 | |
| ESM2-150 | 0.656 | 0.710 | 0.586 | 0.651 |
| ESM2-650 | 0.603 | 0.731 | 0.608 | 0.647 |
| 0.597 | 0.730 | 0.604 | 0.644 | |
| 0.566 | 0.725 | 0.607 | 0.633 | |
| ESM2-35 | 0.663 | 0.678 | 0.543 | 0.628 |
| 0.531 | 0.723 | 0.592 | 0.615 | |
| ESM2-8 | 0.632 | 0.670 | 0.521 | 0.608 |
| 0.620 | 0.669 | 0.529 | 0.606 | |
| 0.581 | 0.668 | 0.527 | 0.592 | |
| 0.337 | 0.749 | 0.649 | 0.578 | |
| ESM3 | 0.217 | 0.733 | 0.628 | 0.526 |
Protein classification is tractable through bidirectional mask filling
We evaluated the performance of AT and ASM35 models in predicting masked annotations by modeling protein annotation as mask filling. We masked one annotation category completely (e.g. EC, CC, MF, etc.), and the models used the remaining annotations to predict the missing ones. For the ASM models, experiments were performed both with annotation information as an input and with annotation and full sequence inputs (denoted -seqs) for additional context. The AT models demonstrated superior performance compared to the ASM35 models across five of six downstream tasks (Table 5). had stronger performance in EC, Interpro, and Gene3D predictions, while excelled in MF and BP predictions. ASM35 models underperformed versus AT models across most tasks, with only marginally outperforming on the BP task by 0.002 F1 score. However, the addition of sequence information to ASM did improve its F1 scores on average.
Table 5.
Protein annotation as mask-filling with Annotation Vocabulary. F1 scores are shown for each downstream task. Models were evaluated on their respective validation datasets to fill in a missing aspect given the other. ASM models with “-Seqs” also had the full amino acid sequence as context.
| EC↑ | MF↑ | BP↑ | CC↑ | Pfam↑ | Gene3D↑ | Avg↑ | |
|---|---|---|---|---|---|---|---|
| 0.633 | 0.498 | 0.543 | 0.392 | 0.352 | 0.527 | 0.491 | |
| 0.716 | 0.391 | 0.179 | 0.319 | 0.552 | 0.672 | 0.472 | |
| 0.532 | 0.452 | 0.540 | 0.367 | 0.326 | 0.482 | 0.450 | |
| 0.515 | 0.457 | 0.545 | 0.366 | 0.319 | 0.468 | 0.445 | |
| 0.153 | 0.143 | 0.032 | 0.165 | 0.140 | 0.327 | 0.160 | |
| 0.135 | 0.145 | 0.031 | 0.151 | 0.132 | 0.293 | 0.148 |
Sequence reconstruction is improved with annotation context
Next, we set out to establish protein reconstruction schemes leveraging the Annotation Vocabulary. After training, ASM exhibited higher sequence recovery rates by leveraging annotations, highlighting its ability to fill masked regions given annotation context. Examining throughout training (early, mid, late, RED), we observed a gradual increase in reconstruction proficiency (Figure 2A) as compared to its base model ESM2-35. While this is true for all mask percentages compared to ESM2-35, was much better at high percentage mask recovery even relative to ESM2-150 with 0.23 vs. 0.21 F1 for 50% masking and 0.08 vs. 0.03 F1 for 70% masking. For , this difference is even more pronounced, with ASM35 overtaking ESM2-150 on all percentages (Figure 2C), including 0.24 compared to 0.23 for 50% mask and 0.08 compared to 0.03 for 70% masking. Even at low corruption rates, outperformed ESM2-35 for sequence recovery (0.39 versus 0.30 F1 at 5% ). The loss exhibited similar trends for both ASM35 models (Figure 2B,D), revealing that the improvement exists at the logit level and is not only performance improved by exact match recovery. With ASM35 exhibiting significant improvement in reconstruction by leveraging annotation context (even surpassing the much larger ESM2-150), the possibility may exist to extend the vocabulary of existing SOTA pLMs with this Annotation Vocabulary to train further for sequence reconstruction. This may be particularly valuable for tasks such as active site optimization and mutagenesis study (48, 53).
Fig. 2.
Average performance of sequence reconstruction with three standard deviation error bars. models are colored light to dark based on training progress - highlighting improvements throughout training. A: Sequence reconstruction F1(↑) of sampled throughout training vs. ESM2-35 and ESM2-150. B: Sequence reconstruction loss(↓) of sampled throughout training vs. ESM2-35 and ESM2-150. C: Sequence reconstruction F1 of vs. ESM2-35 and ESM2-150. D: Sequence reconstruction loss of vs. ESM2-35 and ESM2-150.
Annotation Vocabulary prompts generate sequences de novo that align well with ground truth
We engineered transformer components to create GSM, which leverages AT to generate sequences from annotation prompts. To evaluate the BERT-like generation performance of GSM against ESM2, we used a novel sequence alignment metric based on the Needleman-Wunsch algorithm and BLOSUM62 to compare how well a generated output matches the ground truth based on evolutionary log-odds. Our metric has advantages in the context of de novo design compared to traditional metrics like perplexity. Firstly, the metric is scaled from zero (extremely poor alignment) to one (perfect one-to-one amino acid match), which is convenient for interpretation, although it requires more and more similarity to get closer and closer to one. Secondly, it does not penalize models when they move conserved domains out of frame, even though they resemble ground truth almost perfectly. Traditional language modeling metrics look for the exact match in indices, whereas alignment-based methods can introduce gaps to align sequences optimally. Using this metric, randomly paired or generated sequences have a mean score close to 0.15, whereas above 0.5 implies an extremely high degree of similarity. Additional plots to understand possible alignment score distributions are shown in Supplemental Figure 1.
In addition to our novel alignment score, we used BLAST to query generated sequences against a nonredundant version of SwissProt (54). When sequences return statistically significant results it indicates that a sequence has conserved domains that, at least partially, resemble natural sequences. By running Blast2GO on BLAST hits, we were able to get high-quality GO annotations through enrichment analysis backed by the manual annotation of SwissProt. This approach allowed us to look for matching properties between the annotation prompt and the Blast2GO consensus.
Before evaluation, we tuned generation hyperparameters for nucleus or top- sampling , as well as the number of tokens to generate each forward pass and the temperature . We found that greedy denoising with leads to the best score on average; however, higher values (3, 5, 10) and lower reduced the tendency of self-reinforcing repetitions. We observed that this repetition prevention improved the qualitative properties of generated sequences without improving quantitative metrics on average. We also found that led to favorable results in top- or nucleus sampling, but below 0.1 was detrimental. In an effort to maximize performance and reduce computational time, we chose and greedy denoising for reported metrics. Full results our hyperparameter search are shown Supplemental Tables 3, 4.
Following hyperparameter tuning, we fed 1000 random sequences from the GSM train and test set with various masking percentages to GSM and ESM2-150. GSM also received a full annotation prompt. Due to the presence of self-reinforcing repetitions common to these models, we "filtered" results by employing a (Chi-square) test over amino acid distributions to reject poorly generated sequences with repetitive regions (55). Below 50%, both models perform markedly better than random mask filling (Figure 3, train results in Supplemental Figure 2). In fact, every pairwise comparison between unfiltered results is highly statistically significant (p < 0.001) using a two-tailed t-test. However, at low percentages, ESM2-150 is noticeably better than GSM at mask filling on the train and test sequences, although we note that the ESM2-150 training set (Uniref50) overlaps considerably. On 70% masking and above, GSM is far superior in generating sequences by leveraging the annotation prompts. By examining the number of filtered sequences, we saw that the prevalence of generating low-quality sequences with highly repeated amino acids is similar at low masking percentages for GSM and ESM2-150, although in general more GSM sequences are filtered out. This has been observed in many transformer generation schemes, but our current generation hyperparameter search has not solved this (56). However, we saw that annotation prompts were better utilized at 70% and higher, which is a statistically significant improvement above ESM2-150 and random filling. From analysis of multiple sequence alignments, we note that GSM results were fairly bimodal, generating sequences that either resembled natural sequences and reconstructed obvious conserved domains or it got stuck in self-reinforcing repetition loops where a few tokens were repeated.
Fig. 3.
Violin plots of protein sequence generation performance over various mask parameters for 1000 random test sequences. GSM, ESM2-150, and a random mask-filling scheme receive the same masks then fill amino acids with and (greedy denoising and 10 tokens at a time). However, GSM also receives an Annotation Vocabulary prompt. Both sets also have low quality results filtered out according to amino acid distribution using test, labeled as “filtered.”
Surprisingly, we do not observe a concrete trend in GSM performance relationship to training set similarity (Figure 4). Some test sequences with high training set sequence similarity had near random performance (Figure 4E), and some with low sequence identity to the entire training set appeared to be valid proteins - exhibiting high alignment scores to their ground truth and returning relevant BLAST hits (Figure 4B). GO annotations from all recorded BLAST hits tended to overlap with the input annotation prompt. We suspect that GSM-like models may be trained to condition average results better and that additional schemes such as repetition penalties and MCTS decoding may help (57, 58). Regardless, GSM’s capability of hallucinating natural-like sequences, verified with alignment metrics, BLAST, and enrichment analysis, suggests immense promise in developing generative systems with the Annotation Vocabulary.
Fig. 4.
Various GSM sequence generation examples using 100% mask tokens (except a given start methionine) and Annotation Vocabulary prompts (translated to natural language for easier interpretability). Novel alignment score, percent positive alignment indices, max sequence similarity in the training set (Max train sim), the average similarity of the top 100 most similar training sequences (Top100), average sequence similarity of the BLAST hits, the number of BLAST hits, and the BLAST E-value. If a sequence resulted in statistically significant BLAST hits, GO enrichment analysis is shown from Blast2GO (54). Matching words are highlighted in green between the annotation prompt and enrichment terms. A: High sequence alignment score (for 100% mask), high training set similarity, BLAST hits, matching prompt and GO terms from BLAST hits. B: Medium sequence alignment score, low training set similarity, BLAST hits, matching prompt and GO terms from BLAST hits. C: High sequence alignment score, high training set similarity, no BLAST hits. D: Medium sequence alignment score, high training set similarity, no BLAST hits. This example exhibits highly repetitive regions but also exhibits clearly generated and potentially important domains. E: Medium sequence alignment score, low training set similarity, no BLAST hits. This example exhibits high repetitive regions but does not exhibit obvious domains.
Discussion
Recent work suggests that MLM over amino acid sequences instills a representation centered around structural information, where structure-based task performance is disproportionately increased (22). With the goal of annotating sequence repositories on the scale of UniProt, we aimed to curate protein latent spaces that more highly correlate with functional characteristics. To more effectively describe protein annotations at the embedding level, we created the Annotation Vocabulary, a compilation of EC, GO, Interpro, and Gene3D ontologies that map to unique integers. For simplicity, we will refer to unique sets of categorizations (EC, GO, etc.) as aspects, in line with the terminology usage in ProteinVec (30). By assigning token embeddings to each ontology, we hypothesized that we could use annotation tokens to build semantic protein function representations. Our annotation transformer (AT) accomplishes this task, providing a high annotation recovery rate through an MLM objective (Supplemental Table 5). When masking out the entirety of specific aspects, AT was able to leverage the other aspects to annotate proteins without reference to sequence information. Interestingly, had EC prediction F1 score of 0.716 on its validation set without sequence information. This EC annotation task is technically an easier objective than the multi-label task shown via model probes, as the models know how many EC numbers each example should have due to the number of tokens present. Also, some MF tokens or other labels possess a large correlation with EC. However, because this is a normal F1 score and not , implying this is actually an exceedingly high metric compared to probe-based reported metrics. Sequence annotation as mask filling opens the door to labeling many of the partial annotations in UniProt as our techniques mature, with and without amino acid sequence context.
Once we engineered the Annotation vocabulary toolbox, we identified four main mechanisms in which to add functional information to a standard hidden state from a transformer-like network: 1) by contrasting, constraining, or regularizing a hidden state with functional embeddings, 2) along with new function tokens, 3) along by elongating the hidden dimension with functional embeddings, and 4) within by concatenating a hidden state with functional embeddings.
The first strategy is inclusive of conventional fine-tuning techniques, where contrastive learning is applied with natural language descriptions or between sequences based on similarity heuristics to curate the latent space after pretraining. In initial attempts at this problem, we designed a protein annotation model inspired by ProteinVec and Mixture-of-Experts frameworks, which we call MOESM. While this system had compelling results for EC prediction (second to only ESM3 by 0.002 ) it did not perform well on average (Supplemental Figure 3 and Table 6). Importantly, we concluded from this experiment that small models on top of larger pretrained and frozen pLMs could drastically alter the final functional relevance of the output embeddings, implying an effective and computationally efficient shortcut to full model fine-tuning. With this in mind, we used versions of the AT with ESM2-650 to create our CAMP models, which contrast semantic protein and annotation representations. Our novel loss focused on matching the distribution of sequences compared to other sequences with the distribution of annotations compared to other annotations. A sequence and its corresponding annotation representation were never directly compared, as would be done with a typical cosine similarity or MNR loss (59, 60). Despite this, CAMP model sequence-only inference resulted in fixed-length vector representations that outperform all tested popular pLMs on downstream probes. In particular, on in-distribution annotation tasks, had the only average F1 score above 0.6, higher than premier (and much larger) models ProteinVec and ESM3. CAMP versions performed with much higher metrics on PPI tasks and do not fall short on residue-wise annotation tasks, despite not being trained for either. We hypothesize that CAMP models may be adept at PPI prediction due to their high performance on BP, as many interacting proteins likely fall in the same BP categories. Additionally, CAMP models also produced the best average F1 score on residue-wise downstream tasks, even though they were not trained with a residue-wise objective. This suggests that pooled representations may be sufficient to curate the entire hidden state.
Strategy two offers a theoretically sound way to directly move residue embeddings into more functional clusters. The self-attention mechanism is a built-in vector similarity heuristic in the transformer neural network, which models multi-scale relationships between input tokens through projections and dot products. Therefore, if a model can learn to effectively attend discrete sequence and function tokens, their projections must be moved closer within the embedding space. This strategy was prototyped by ASM, where the ESM2-35 vocabulary was extended with the Annotation Vocabulary to model sequences and annotations in a bidirectional manner. After training, the pooled vector embeddings had an increased correlation to in-distribution annotations with 3.3% higher average F1 compared to ESM2-35. Additionally, sequence reconstruction was greatly improved by referencing annotations, outperforming ESM2-35 and ESM2-150 in mask-filling tasks. Of course, one of the main disadvantages of this approach is that the attention mechanism scales with combined protein and annotations length , ultimately posing a significant computational expense.
To work against the problematic attention scaling, and to further prototype protein generation using the Annotation Vocabulary, we designed GSM: An Encoder-Decoder schema using AT to produce rich representations of annotation prompts and generate sequences via a cross-attention mechanism. Notably, we used a BERT-like ESM2 model as the decoder, highlighting the newfound potential of using BERT models for sequential generation similar to diffusion models. By removing mask tokens sequentially and strategically, one bridges the gap between representation and generative modeling, spearheaded by ESM3 (52). We evaluated various generation schemes while comparing GSM and ESM2-150 to random mask filling to assess whether Annotation Vocabulary prompts can give an edge over well-trained models such as ESM2. Thus far the most significant approach for improved generation quality is greedy denoising one token at a time. This can be prohibitively expensive for long sequences at , but we have found that up to 10 tokens at a time reduces this cost without sacrificing much performance. Importantly, the here scales with the amino acid sequence length primarily, as they are much longer as average, due to the use of a cross-attention mechanism instead of self-attention for annotation information mixing.
While GSM underperforms compared to ESM2-150 when generating sequences with mask percentages at or below 50%, we see the significant value of annotation prompts in GSM’s ability to design sequences at 70% masking or from complete noise. Through manual experimentation by prompting from the test set, we observed that GSM generation was fairly “bi-modal,” either designing a sequence that aligns with some domain to the ground truth or getting stuck in self-reinforcing repetition. We removed poor-quality generations using a test as a filter. Some GSM-generated outputs returned BLAST hits, and further enrichment analysis found GO annotations matching the prompt. In particular, Figure 4B showcases an example where the ground truth sequence has low sequence identity with the entire training set. This hallucinated protein with many BLAST hits and matching enrichment terms implies that the GSM scheme and Annotation Vocabulary are promising avenues for protein design.
Strategy three is compelling if there were adequate residue-wise protein function ontologies. Whereas there are typically Interpro annotations for every sequence in our dataset (90+%), we are reluctant to rely on mappings that correlate sequence motifs directly to protein function, as conceptually, we would rather include Interpro and GO annotations independently to allow the model to learn an (approximately) optimal relationship. That being said, domain-level correlations to function through sequence homology is a remarkably powerful predictor, and clearly Interpro2Go and the excellent tools that have used it to predict GO terms (61, 62) have significant value. Our experiments evaluating ESM2 models with random weights support sequence homology as a significant driver of protein function similarity. While this seems trivial, vector embeddings from randomized ESM2 weights performed much better with probes than random vectors of the same size alone, as shown in our baseline for in-distribution tasks (Table 1). We hypothesize this is because similar proteins by homology will be embedded similarly through the token embedding process, even with random weights. Therefore, the downstream probe was still able to recognize functional clusters within sequences clustered by homology. In addition to the conceptual problem with strategy three, there is the less discussed computational scaling of the MLP sections of transformers which scale for hidden dimension , implying adding function regions along would add considerable computational cost.
The fourth strategy has recently been tested with ESM3, whereby function embeddings were added directly to sequence (and other modalities) embeddings similar to token type or position embeddings. Computationally, this does not add much cost to the forward pass as the hidden state size is not augmented. Additionally, this has advantages in any-to-any generation due to its ability to represent diverse prompts across modalities as reported in detail in the recent ESM3 paper (52). We hypothesize that ESM3’s functional integration may limit the range of sequence-wise functional ontologies that can be effectively used, as they are still applied at the residue level. However, the strategies we employ may limit the ability of the model to correlate residues with specific functional characteristics because we do not assign them to residues directly. This speculation points to the optimal strategy as potentially being some combination of these approaches.
Importantly, there are some limitations to the evaluation approaches used and some surprising results. For example, small linear or BERT-like probes assessed how directly model embeddings correlated with a downstream task, but not the propensity for a model to be fine-tuned for a specific task. Because we were analyzing training strategies to incorporate functional information inherently, this was ideal. However, this approach is less ideal for scaling to a production-ready model. This evaluation strategy produced some results that did not follow a conventional size-to-performance ratio. We hypothesize that this is particularly common for models trained through only semi-supervised denoising, where the local minimum the model has found to minimize language modeling cross-entropy just happens to place downstream embeddings in a way that benefits one task over another.
Another surprising result was that ASM performed worse than AT on annotation mask filling, even when ASM had the context of the entire sequence. In the limit, it is clear that sequence information should not hinder a models’ ability to annotate based on other annotations, as it is additional information. In this case, this could be because ASM was under-trained, or perhaps starting from a pretrained ESM2 checkpoint is not ideal for a dual vocabulary scenario. While perhaps less likely, there could also be some percentage of incorrect annotations (5, 6). We see a similar trend, with GSM performing worse than ESM2 with lower mask percentages for design tasks. In theory, more information with annotations should always improve this performance as it provides more information; however, this information may constrain the generation in a harmful way. The high repetition nature of GSM during inference could also be due to under-training for that specific task. Of course, we cannot confirm that any generated result from ASM, ESM, or GSM is “wrong” without experimental validation, but ground truth comparisons seem to be the best computational equivalent. Lastly, it was surprising that the ESM3 embeddings performed worse on average compared to CAMP and ANKH in spite of its impressive training schema. However, it is important to note that ESM3 was not trained solely for representation learning but generation as well, and thus, probing its embeddings is not necessarily indicative of its value as a whole.
Overall, the Annotation Vocabulary toolbox presents a promising pathway to replace traditional tokens with members of ontologies and knowledge graphs, enhancing transformer models in specific domains. We use these strategies to build a language around protein properties, which we feed to various transformer neural network schemes to enhance computational protein design and annotation.
Methods
Annotation vocabulary
The Annotation Vocabulary uses EC, GO Cellular Component (CC), GO Molecular Function (MF), GO Biological Process (BP), Interpro, and Gene3D ontologies to describe protein sequences. For each property / aspect / ontology, a minimum and maximum range of integer values was determined based on the number of possible options within the ontology (for that dataset). Each ontology member was assigned a unique integer in ascending value from EC to Gene3D in the order mentioned above. This resulted in a vocabulary of 30,000 - 60,000 unique integers and annotations depending on the base dataset used for this mapping. Once mapped to integers, annotations were fed to transformer neural networks to build numerical representations after token embedding. Importantly, we always fed annotations to transformer models sorted by their tokenized integer value. This introduced annotation “grammar” and enabled training through semi-supervised denoising.
Data compilation
We compiled three datasets of protein and annotation pairs called EXP, RED, and FINAL for short. Technically, Pfam annotation from UniProt (4) was used instead of Interpro for EXP and RED. The first dataset focused on experimentally validated annotations (EXP) and was gathered from a UniProt query on 5/20/24. We searched for sequences with experimentally validated (manual) GO annotations that also had at least one EC annotation, and Interpro or Gene3D. Whereas cofactor (CO) annotation was sparse, within this query of experimentally validated and high UniProt annotation-score entries, we also recorded CO information for the EXP Annotation Vocabulary. We removed duplicate sequences primarily by prioritizing the most annotations and secondarily by prioritizing the length of the sequence. We removed sequences of less than 50 amino acids or greater than 2048 for computational efficiency. This resulted in a total of 70,395 sequence annotation pairs. 1000 pairs were randomly withheld for validation.
The second dataset was designed to maximize nonredundancy and size (RED), compiled from a UniProt query on 5/29/24 for sequences with any EC annotation, totaling over 41 million. We kept sequences that were representative sequences for a Uniref90 cluster (63). This struck a balance between accurate annotations (representative sequences are chosen based on UniProt annotation-score) and nonredundancy (maximum 90% pairwise sequence identity) and resulted in a total of 17 million sequences. We used 90% clustering instead of 50 or 30 to maximize the size of the final processed dataset. We saved a full set with all 17 million sequences and annotations , and a set where duplicate annotation entries are removed (RED) consisting of 516,184 pairs. Duplicates were removed by prioritizing sequence length. 1000 pairs were withheld randomly for validation.
The FINAL dataset was simply a combination of the best characteristics we observed from RED and EXP through experimentation. We constructed 700k sequence annotation pairs that were Uniref50 representative sequences (nonredundant) with a maximum length of 512, 157k experimentally validated sequence annotation pairs (redundant) with a maximum length of 512, and 104k experimentally validated sequence annotation pairs (redundant) with length between 512 and 2,048. We also created a set of nonredundant annotation only inputs (no matches), which comprised 212k total entries that were used to train .
To compare our approach versus more common natural language representations, we used a previously compiled dataset from our lab of protein sequences and natural language descriptions using UniProt called NAT for short. It was compiled from Uniref50 representative sequence and property pairs by adding corresponding headers for each unique annotation type, followed by new lines. For example “EC: 1.1.1.1 \n Localization: cytosol, etc.” Representative sequences were recorded when they exhibited at least three of the annotations in Figure 5. Because we used Uniref50, sequences have a maximum sequence identity of 50%; this is not necessarily true of the descriptions, which can match exactly. Therefore, we removed duplicates which resulted in 1,435,224 million examples. Sequence overlap between dataset splits can be found in Supplemental Figure 4.
Fig. 5.
Annotations that defined inclusion criterion for sequence and natural language dataset NAT.
Annotation Transformer (AT)
We trained two versions of independent AT on EXP and RED, respectively. is a single BERT-like transformer block with a hidden size 384, intermediate dimension of 2,048, and Annotation Vocabulary of 33,328 (Figure 1A). is the same model with an adjusted vocabulary size of 38,953. We also used rotary embeddings instead of absolute position embeddings due to the larger vocabulary size (64). and were subject to 15% masked language modeling (MLM) objectives for 100 and 10 epochs, respectively, and evaluated periodically based on MLM accuracy on validation sets with early stopping once a patience of three was achieved.
Annotation Sequence Model (ASM)
The Annotation Sequence Models (ASM) were designed to mix information between sequences and annotations through the self-attention mechanism. These models were actualized by a combination of ESM2 and AT through vocabulary extension - where the token embedding matrix of ESM2 was extended with our Annotation Vocabulary (Figure 1C). Like AT, we trained two versions on EXP and RED, respectively. For both experiments, we used ESM2-35 to strike a balance between functional correlation and computational efficiency. Therefore, the two models were called and to delineate the dataset they were trained on. Whereas both models were closer to 50 million parameters with the large token embedding matrix and language modeling heads, the weights used during sequence-only inference were exactly equivalent in size to ESM2-35. Both models were subject to 15% MLM objectives with varied training schemes and allowed maximum lengths. When a sequence annotation pair exceeded their combined maximum length, the annotations were shuffled and discarded as the primary truncation strategy to prevent feeding the model protein fragments. However, if this would get rid of over 75% of the annotations, the sequence is truncated as well. was first trained on for approximately 0.25 epochs (4.25 million sequence annotation pairs) with a maximum length of 768. It was then trained for two epochs on RED with a max length of 1,536. was trained on EXP with a maximum length of 2,048 for 23 epochs total, decided by over-training as observed by decreased MLM recovery accuracy on the validation set.
Novel contrastive loss
While directly minimizing the difference between multiple modalities makes the translation between them conceptually convenient, we see no reason to assume that the best latent representation for a protein must be near the representation for its annotation. Therefore, we designed a novel contrastive loss that instead seeks to match the intra-latent relationships among proteins with those of the corresponding annotations. The first term of the loss is the MSE between the pairwise cosine similarities of the protein vector representations and the pairwise cosine similarities of the corresponding annotation representations. This term can be trivially minimized by any solution which projects all outputs to a single point in space, e.g., by zeroing out weights. Hence, we regularize the loss by adding the average intra-latent cosine similarities for each modality to encourage intra-modality embedding diversity. Formally, the loss is defined as follows:
where the rows of are the CAMP outputs for modality , and
is the corresponding matrix of pairwise cosine similarities. We chose and to place emphasis on the diversity of protein representations.
As usual, computing this loss (and its gradients) over the full data is computationally prohibitive, so we instead work in each iteration with stochastic mini-batches of samples chosen uniformly at random (without replacement) from the full data. Formally, where denotes the randomly selected sample indices (the same indices are used for all modalities). Notably, the resulting stochastic gradients are technically biased since
but the resulting scaling can be straightforwardly corrected or even simply ignored. See full details and the derivations on the loss analysis in Supplemental Loss Analysis section.
Contrastive Annotation Modeling of Proteins (CAMP)
The CAMP models are designed to contrast sequence and annotation representations from independent frozen models to further curate the sequence latent space (Figure 1B). ESM2-650 and AT were frozen, and a small one-block ConvBERT was added to the end of each model alongside many additional linear layers. Sequences were fed to ESM2-650 and corresponding annotations to the AT, which produced full token matrix embeddings. We applied mean pooling to each modality and used our contrastive loss to train the model. and delineate which dataset and AT were used. Importantly, the vocabularies are different sizes, so EXP cannot be used to fine-tune and vice-versa. For the natural language comparison, we trained CAMP with a frozen SciBERT instead of AT on the NAT dataset. All versions were trained for one epoch over their respective dataset.
Model benchmarks
To benchmark CAMP and ASM against popular pLMs we designed a rigorous evaluation scheme based on freezing the respective pLM, embedding an entire dataset, and either training a downstream probe or conducting vector search (Figure 1E). The datasets used for downstream analysis are shown in Figure 6. We used datasets as is except for SS and PPI tasks. For SS3 and SS8 we used the Proteinea training set for training (17), CB513 and TS115 for validation (65, 66), leaving CASP12, CASP13, and CASP14 for testing (67). Instead of hiding intrinsically disordered residue labels from the loss function, we created a new label for those residues. Therefore, there were four and nine options per residue for SS3 and SS8, respectively. The PPI sets were the human and yeast splits from PiNUI (68). We generated new validation and test sets using 2500 randomly selected pairs each.
Fig. 6.
Standard datasets used to probe model performance. EC, CC, MF, BP, DL2, DL10, MB, and TS are from the SaProt project (69). YPPI an HPPI are sampled from the PiNUI project (68). SS3 and SS8 are modified from Protinea (ANKH) (17). New, Price, Halogenase, and Split100 are from the CLEAN project (32).
EC, CC, MF, BP, DL2, DL10, and MB were downloaded from SaProt (69). We note that we fed the amino acid and 3Di sequences to SaProt in the in-distribution benchmark, and just amino acid sequences to the rest (Table 1). These datasets, along with YPPI and HPPI, depict sequence-wise categories and were thus evaluated with a linear probe on fixed-length vector embeddings from mean pooling of the last hidden state. For the PPI tasks, we stacked two vector embeddings into a pair representation for each interaction pair. The paired vectors had their order switched with a 50% chance during training. TS is a sequence-wise measurement but is not modelable via a linear probe from a frozen model (all evaluated pLMs perform poorly and inconsistently from pooled states). Therefore, we evaluated TS with a ConvBERT and max pooling, which has been shown to be effective (17, 70). SS3 and SS8 are residue-wise tasks so they were modeled with a BERT probe. The linear probe was a three layer MLP with two hidden layers with a size of 8,192. We tried a large variety of more shallow, more deep, and smaller or larger hidden dimensions and choose this ultimate size based on average performance. For the BERT-like probes, we used an initial linear layer to project the pLM embeddings being evaluated to a standard hidden dimension of 384. We used an intermediate dimension of 1,024 and only used one transformer block for each task. All probes had GELU activation functions.
Probes were trained up to a maximum of 200 epochs to force early stopping, which was triggered by a patience on validation loss improvement, and then evaluated on the test set with the best set of weights. A patience of 10 was used for everything except for SS3 and SS8, which had a more stable performance convergence, and thus we used a patience of five to save time. We validated every epoch except for PPI tasks due to the dataset size, which instead were validated every 1000 batches. A learning rate of was used for all probes with 100 warm-up steps, a cosine learning rate scheduler, and batch size 64.
CLEAN benchmarks New, Price, and Halogenase were evaluated using the CLEAN maximum separation scheme against an embedding dataset of Split100 from mean pooling (32).
Protein annotation as mask filling
To evaluate the annotation capabilities of the AT and ASM35 models, we conducted a series of mask-filling experiments for each annotation aspect independently, using the 1000 withheld sequences from our EXP and RED datasets. We filtered the withheld sequences, retaining only those that possessed at least one annotation for the aspect under evaluation. For each sequence, we masked all annotations of the target aspect while providing the remaining annotations as context. We then assessed the models’ ability to accurately predict the masked annotations vs. ground truth with standard metrics. We evaluated four models: , , , and , each on their respective validation datasets. For the ASM models, we performed evaluations both with and without the corresponding protein sequences to assess the impact of sequence information on annotation prediction.
Generation Sequence Model (GSM)
A 12 transformer block AT variant was trained on the nonredundant FINAL dataset for two epochs . Then, was combined with ESM2-150 in a transformer “Encoder-Decoder” scheme, where the AT last hidden state was fed to the “Decoder” through cross attention (Figure 1D). We modified the ESM2 model with new layer norms on the query and keys of self-attention layers to increase stability and switched the activation function to SiLU. The resulting GSM model had a protein annotation and protein sequence track, which trained both the AT and ESM2 further with MLM and cross-entropy loss, concurrently. The annotation track received annotation sequences at a set 15% masking rate. The sequence track received masked protein sequences from a noise scheduler. For the first stage of training, the masking rate was sampled from a normal distribution with a mean of 0.5, standard deviation of 0.1, and clipped at 0.15 and 1.0. The first stage received sequences with a max length of 512 for eight epochs. The annotation and sequence tracks had cross-entropy hyperparameters of 1.0 and 2.0, respectively. The first stage utilized a learning rate of , batch size of 32, and cosine learning rate scheduler with 1000 warm-up steps. In the second stage, the mean and standard deviation were set at 0.3 and the annotation track had a mask rate of 0%. We still used a cross-entropy loss on the language modeling head output of the AT to enforce identity on the embeddings. The hyperparameters were shifted to 0.01 and 1.0 for the annotation and sequence track, respectively. During the second stage, we trained up to a max length of 2000, learning rate of , batch size of two, gradient accumulation for an effective batch size of at least 50,000 sequence tokens, and the same learning rate scheduler and warm up. The second stage consisted of two epochs. For each epoch, the data was sampled from the Uniref50 section followed by the experimentally validated section upon completion, simulating an epoch of RED and then EXP sequentially (high and low-value tokens). The first stage used the EXP section with a maximum length of 512 and the second with the larger maximum length. For both stages, a maximum length of 256 was used for the annotation track, which encompassed all annotation examples.
BERT-like generation
We accomplished BERT-like sequence generation by employing various popular sampling techniques, including top- (considering the best options per token) and nucleus sampling (thresholding and sampling options above ) (71). Generation from mask tokens during inference were actualized by choosing the top- number of mask tokens to fill each forward pass, chosen based on the maximum logit or entropy value (before or after softmax). For instances of and that prevent greedy denoising, logits or entropy values were sampled from a multinomial distribution: introducing randomness into the generation process. Temperature was used before softmax to push intra-token probabilities closer together, which heavily affects the multinomial sampling (72).
Sequence reconstruction referencing annotations
To access the sequence reconstruction potential of dual vocabulary systems like ASM we designed a scheme to assess the effectiveness of mask filling of ASM with full annotation context versus base ESM2 models. For ease of comparison, we used and , which is standard greedy BERT mask filling (as ). For and we conducted mask filling for five percentages (5, 15, 30, 50, 70%) with five replicate experiments with different random seeds each. The same sequence sections were masked and fed to ASM or ESM2. This ensured that each model received the same sequence and mask positions, but ASM also got annotation tokens at the end. The logits were recorded, and we tracked various metrics, including F1, for tokenwise classification and cross-entropy loss.
Sequence generation with annotation prompts
In local experiments, we observed that sequence generation with high or full masking probabilities required careful hyperparameter selection. We assessed ESM2-150 and GSM over five masking percentages (15, 30, 60, 70, 100%) over a diverse selection of , or , and for 100 sequences in the GSM test set.
GSM also received the full annotations as a prompt. Of particular note, we included uncommonly low temperatures inspired by (72) in an effort to increase performance. For 100% masking, we included methionine at the start of the sequence, and tile mask tokens up until the length of the ground truth sequence.
Multiple sequence alignment and Novel alignment score
Multiple sequence alignment was calculated with standard global alignment settings using Biopython or Biotite Python packages (73, 74). This included BLOSUM62, a gap score of −10, and gap extension of −0.5. For BLAST services, we employed SequenceServer or Blast2GO for BLASTP, for which we used the default settings and a nonredundant SwissProt reference database (54, 75, 76).
We placed multiple sequence alignment scores between zero and one by constructing an error-based metric scaled by the sequence length,
where is the multiple sequence alignment score using Needleman–Wunsch and BLOSUM62 between protein sequence strings (ground truth) and (generated sequence), and is the length of string . The result is an error term in the denominator that reduces the score upon poor alignment. Whereas the score scales from zero to one, we observe it is a non-linear range wherein it is increasingly difficult to get close to one. Calculated distributions of the scores can be found in Supplemental Figure 1.
Low sequence identity mining
Data mining between test set and training set examples was calculated via pairwise sequence alignment and exact match accuracy for the sequence identity percentage. We filtered out low sequence identities produced by the generation repetition problem by conducting a test between the amino acid counts of a sequence vs. a reference database. The reference database reported on was the unique set of sequences from every dataset split of the GSM dataset. The null hypothesis that a sequence belonged to the reference distribution was rejected at an arbitrary -value threshold found through manual experimentation.
Supplementary Material
ACKNOWLEDGEMENTS
The authors thank Katherine M. Nelson, Ph.D., for reviewing and commenting on drafts of the manuscript. This work was partly supported by the University of Delaware Graduate College through the Unidel Distinguished Graduate Scholar Award (LH), and the National Institutes of Health through R01HL133163 (JPG) and R01HL145147 (JPG).
Footnotes
DATA AND CODE AVAILABILITY
Selected datasets, code, and model weights can be found at github.com/Gleghorn-Lab/AnnotationVocabulary.
References
- 1.Shen Xukang, Song Siliang, Li Chuan, and Zhang Jianzhi. Synonymous mutations in representative yeast genes are mostly strongly non-neutral. Nature, 606(7915):725–731, June 2022. ISSN 1476-4687. doi: 10.1038/s41586-022-04823-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Loewe Laurence and Hill William G.. The population genetics of mutations: good, bad and indifferent. Philosophical Transactions of the Royal Society B: Biological Sciences, 365(1544):1153–1167, April 2010. ISSN 0962-8436. doi: 10.1098/rstb.2009.0317. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Satam Heena, Joshi Kandarp, Mangrolia Upasana, Waghoo Sanober, Zaidi Gulnaz, Rawool Shravani, Thakare Ritesh P., Banday Shahid, Mishra Alok K., Das Gautam, and Malonia Sunil K.. Next-generation sequencing technology: Current trends and advancements. Biology, 12(77):997, July 2023. ISSN 2079-7737. doi: 10.3390/biology12070997. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.UniProt: the universal protein knowledgebase in 2023. ∣ nucleic acids research ∣ oxford academic. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Jones Craig E, Brown Alfred L, and Baumann Ute. Estimating the annotation error rate of curated GO database sequence annotations. 8:170. ISSN 1471-2105. doi: 10.1186/1471-2105-8-170. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Silveira Sabrina de Azevedo, de Melo-Minardi Raquel Cardoso, da Silveira Carlos Henrique, Santoro Marcelo Matos, and Meira Wagner Jr. ENZYMAP: Exploiting protein annotation for modeling and predicting EC number changes in UniProt/swiss-prot. 9(2):e89162. ISSN 1932-6203. doi: 10.1371/journal.pone.0089162. Publisher: Public Library of Science. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Markus Braun, Gruber Christian C, Andreas Krassnigg, Arkadij Kummer, Stefan Lutz, Gustav Oberdorfer, Elina Siirola, and Radka Snajdrova. Accelerating biocatalysis discovery with machine learning: A paradigm shift in enzyme engineering, discovery, and design. 13(21):14454–14469. doi: 10.1021/acscatal.3c03417. Publisher: American Chemical Society. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Acero Enrique Herrero, Ribitsch Doris, Dellacher Anita, Zitzenbacher Sabine, Marold Annemarie, Steinkellner Georg, Gruber Karl, Schwab Helmut, and Guebitz Georg M.. Surface engineering of a cutinase from thermobifida cellulosilytica for improved polyester hydrolysis. 110(10):2581–2590. ISSN 1097-0290. doi: 10.1002/bit.24930. [DOI] [PubMed] [Google Scholar]
- 9.Pang Ju-Jiun, Shin Jong-Shik, and Li Si-Yu. The catalytic role of RuBisCO for in situ CO2 recycling in escherichia coli. 8. ISSN 2296-4185. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Miserez Ali, Yu Jing, and Mohammadi Pezhman. Protein-based biological materials: Molecular design and artificial production. 123(5):2049–2111. ISSN 0009-2665. doi: 10.1021/acs.chemrev.2c00621. Publisher: American Chemical Society. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Elnaggar Ahmed, Heinzinger Michael, Dallago Christian, Rehawi Ghalia, Wang Yu, Jones Llion, Gibbs Tom, Feher Tamas, Angerer Christoph, Steinegger Martin, Bhowmik Debsindhu, and Rost Burkhard. ProtTrans: Toward understanding the language of life through self-supervised learning. 44(10):7112–7127, . ISSN 1939-3539. doi: 10.1109/TPAMI.2021.3095381. [DOI] [PubMed] [Google Scholar]
- 12.Hallee Logan, Rafailidis Nikolaos, and Gleghorn Jason P.. cdsBERT - extending protein language models with codon awareness. doi: 10.1101/2023.09.15.558027. Pages: 2023.09.15.558027 Section: New Results. [DOI] [Google Scholar]
- 13.Li Sizhen, Moayedpour Saeed, Li Ruijiang, Bailey Michael, Riahi Saleh, Miladi Milad, Miner Jacob, Zheng Dinghai, Wang Jun, Balsubramani Akshay, Tran Khang, Zacharia Minnie, Wu Monica, Gu Xiaobo, Clinton Ryan, Asquith Carla, Skalesk Joseph, Boeglin Lianne, Chivukula Sudha, Dias Anusha, Fernando Ulloa Montoya Vikram Agarwal, Bar-Joseph Ziv, and Jager Sven. CodonBERT: Large language models for mRNA design and optimization. . doi: 10.1101/2023.09.09.556981. Pages: 2023.09.09.556981 Section: New Results. [DOI] [Google Scholar]
- 14.Ren Zilin, Jiang Lili, Di Yaxin, Zhang Dufei, Gong Jianli, Gong Jianting, Jiang Qiwei, Fu Zhiguo, Sun Pingping, Zhou Bo, and Ni Ming. CodonBERT: a BERT-based architecture tailored for codon optimization using the cross-attention mechanism. page btae330. ISSN 1367-4811. doi: 10.1093/bioinformatics/btae330. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Nguyen Eric, Poli Michael, Durrant Matthew G., Thomas Armin W., Kang Brian, Sullivan Jeremy, Ng Madelena Y., Lewis Ashley, Patel Aman, Lou Aaron, Ermon Stefano, Baccus Stephen A., Hernandez-Boussard Tina, Ré Christopher, Hsu Patrick D., and Hie Brian L.. Sequence modeling and design from molecular to genome scale with evo. doi: 10.1101/2024.02.27.582234. Publisher: Cold Spring Harbor Laboratory_eprint: https://www.biorxiv.org/content/early/2024/02/27/2024.02.27.582234.full.pdf. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Zheng Kangjie, Long Siyu, Lu Tianyu, Yang Junwei, Dai Xinyu, Zhang Ming, Nie Zaiqing, Ma Wei-Ying, and Zhou Hao. ESM all-atom: Multi-scale protein language model for unified molecular modeling. [Google Scholar]
- 17.Elnaggar Ahmed, Essam Hazem, Wafaa Salah-Eldin Walid Moustafa, Elkerdawy Mohamed, Rochereau Charlotte, and Rost Burkhard. Ankh: Optimized protein language model unlocks general-purpose modelling. (arXiv:2301.06568), . doi: 10.48550/arXiv.2301.06568. [DOI] [Google Scholar]
- 18.Hallee Logan and Gleghorn Jason P.. Protein-protein interaction prediction is achievable with large language models. doi: 10.1101/2023.06.07.544109. Pages: 2023.06.07.544109 Section: New Results. [DOI] [Google Scholar]
- 19.Ferruz Noelia, Schmidt Steffen, and Höcker Birte. ProtGPT2 is a deep unsupervised language model for protein design. 13(1):4348. ISSN 2041-1723. doi: 10.1038/s41467-022-32007-7. Number: 1 Publisher: Nature Publishing Group. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Munsamy Geraldene, Lindner Sebastian, Lorenz Philipp, and Ferruz Noelia. ZymCTRL: a conditional language model for the controllable generation of artificial enzymes. [Google Scholar]
- 21.Krishna Rohith, Wang Jue, Ahern Woody, Sturmfels Pascal, Venkatesh Preetham, Kalvet Indrek, Lee Gyu Rie, Morey-Burrows Felix S., Anishchenko Ivan, Humphreys Ian R., McHugh Ryan, Vafeados Dionne, Li Xinting, Sutherland George A., Hitchcock Andrew, Hunter C. Neil, Baek Minkyung, DiMaio Frank, and Baker David. Generalized biomolecular modeling and design with RoseTTAFold all-atom. doi: 10.1101/2023.10.09.561603. Pages: 2023.10.09.561603 Section: New Results. [DOI] [PubMed] [Google Scholar]
- 22.Li Francesca-Zhoufan, Amini Ava P., Yue Yisong, Yang Kevin K., and Lu Alex X.. Feature reuse and scaling: Understanding transfer learning with protein language models. . doi: 10.1101/2024.02.05.578959. [DOI] [Google Scholar]
- 23.Abramson Josh, Adler Jonas, Dunger Jack, Evans Richard, Green Tim, Pritzel Alexander, Ronneberger Olaf, Willmore Lindsay, Ballard Andrew J., Bambrick Joshua, Bodenstein Sebastian W., Evans David A., Hung Chia-Chun, O’Neill Michael, Reiman David, Tunyasuvunakool Kathryn, Wu Zachary, Žemgulytė Akvile, Arvaniti Eirini, Beattie Charles, Bertolli Ottavia, Bridgland Alex, Cherepanov Alexey, Congreve Miles, Cowen-Rivers Alexander I., Cowie Andrew, Figurnov Michael, Fuchs Fabian B., Gladman Hannah, Jain Rishub, Khan Yousuf A., Low Caroline M. R., Perlin Kuba, Potapenko Anna, Savy Pascal, Singh Sukhdeep, Stecula Adrian, Thillaisundaram Ashok, Tong Catherine, Yakneen Sergei, Zhong Ellen D., Zielinski Michal, Žídek Augustin, Bapst Victor, Kohli Pushmeet, Jaderberg Max, Hassabis Demis and Jumper John M.. Accurate structure prediction of biomolecular interactions with AlphaFold 3. pages 1–3. ISSN 1476-4687. doi: 10.1038/s41586-024-07487-w. Publisher: Nature Publishing Group. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Jumper John, Evans Richard, Pritzel Alexander, Green Tim, Figurnov Michael, Ronneberger Olaf, Tunyasuvunakool Kathryn, Bates Russ, Žídek Augustin, Potapenko Anna, Bridgland Alex, Meyer Clemens, Kohl Simon A. A., Ballard Andrew J., Cowie Andrew, Bernardino Romera-Paredes Stanislav Nikolov, Jain Rishub, Adler Jonas, Back Trevor, Petersen Stig, Reiman David, Clancy Ellen, Zielinski Michal, Steinegger Martin, Pacholska Michalina, Berghammer Tamas, Bodenstein Sebastian, Silver David, Vinyals Oriol, Senior Andrew W., Kavukcuoglu Koray, Kohli Pushmeet, and Hassabis Demis. Highly accurate protein structure prediction with AlphaFold. 596(7873):583–589. ISSN 1476-4687. doi: 10.1038/s41586-021-03819-2. Number: 7873 Publisher: Nature Publishing Group. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Lin Zeming, Akin Halil, Rao Roshan, Hie Brian, Zhu Zhongkai, Lu Wenting, Smetanin Nikita, Verkuil Robert, Kabeli Ori, Shmueli Yaniv, Costa Allan dos Santos, Fazel-Zarandi Maryam, Sercu Tom, Candido Salvatore, and Rives Alexander. Evolutionary-scale prediction of atomic-level protein structure with a language model. 379(6637):1123–1130. doi: 10.1126/science.ade2574. Publisher: American Association for the Advancement of Science. [DOI] [PubMed] [Google Scholar]
- 26.Baek Minkyung, Anishchenko Ivan, Humphreys Ian R., Cong Qian, Baker David, and DiMaio Frank. Efficient and accurate prediction of protein structure using RoseTTAFold2. doi: 10.1101/2023.05.24.542179. Pages: 2023.05.24.542179 Section: New Results. [DOI] [Google Scholar]
- 27.Wu Ruidong, Ding Fan, Wang Rui, Shen Rui, Zhang Xiwen, Luo Shitong, Su Chenpeng, Wu Zuofan, Xie Qi, Berger Bonnie, Ma Jianzhu, and Peng Jian. High-resolution de novo structure prediction from primary sequence. doi: 10.1101/2022.07.21.500999. Pages: 2022.07.21.500999 Section: New Results. [DOI] [Google Scholar]
- 28.Wang Yining, Gong Xumeng, Li Shaochuan, Yang Bing, Sun YiWu, Shi Chuan, Li Hui, Wang Yangang, Yang Cheng, and Song Le. xTrimoABFold: Improving antibody structure prediction without multiple sequence alignments. (arXiv:2212.00735). [Google Scholar]
- 29.Chen Bo, Cheng Xingyi, Geng Yangli-ao, Li Shen, Zeng Xin, Wang Boyan, Gong Jing, Liu Chiming, Zeng Aohan, Dong Yuxiao, Tang Jie, and Song Le. xTrimoPGLM: Unified 100b-scale pre-trained transformer for deciphering the language of protein. doi: 10.1101/2023.07.05.547496. Pages: 2023.07.05.547496 Section: New Results. [DOI] [PubMed] [Google Scholar]
- 30.Hamamsy Tymor, Barot Meet, Morton James T., Steinegger Martin, Bonneau Richard, and Cho Kyunghyun. Learning sequence, structure, and function representations of proteins with language models. doi: 10.1101/2023.11.26.568742. Publisher: Cold Spring Harbor Laboratory_eprint: https://www.biorxiv.org/content/early/2023/11/26/2023.11.26.568742.full.pdf. [DOI] [Google Scholar]
- 31.Su Jin, Zhou Xibin, Zhang Xuting, and Yuan Fajie. ProTrek: Navigating the protein universe through tri-modal contrastive learning. . doi: 10.1101/2024.05.30.596740. Pages: 2024.05.30.596740 Section: New Results. [DOI] [Google Scholar]
- 32.Yu Tianhao, Cui Haiyang, Li Jianan Canal, Luo Yunan, Jiang Guangde, and Zhao Huimin. Enzyme function prediction using contrastive learning. 379(6639):1358–1363. doi: 10.1126/science.adf2465. Publisher: American Association for the Advancement of Science. [DOI] [PubMed] [Google Scholar]
- 33.Liu Shengchao, Zhu Yutao, Lu Jiarui, Xu Zhao, Nie Weili, Gitter Anthony, Xiao Chaowei, Tang Jian, Guo Hongyu, and Anandkumar Anima. A text-guided protein design framework. (arXiv:2302.04611). [Google Scholar]
- 34.Gyorgy Andras, Jiménez José I., Yazbek John, Huang Hsin-Ho, Chung Hattie, Weiss Ron, and Del Vecchio Domitilla. Isocost lines describe the cellular economy of genetic circuits. Biophysical Journal, 109(3):639–646, August 2015. ISSN 0006-3495. doi: 10.1016/j.bpj.2015.06.034. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Roberts Richard J., Hallee Logan, and Lam Chi Keung. The potential of hsp90 in targeting pathological pathways in cardiac diseases. 11(12):1373. ISSN 2075-4426. doi: 10.3390/jpm11121373. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Sen Sujoita, Hallee Logan, and Lam Chi Keung. The potential of gamma secretase as a therapeutic target for cardiac diseases. 11(12):1294. doi: 10.3390/jpm11121294. Number: 12 Publisher: Multidisciplinary Digital Publishing Institute. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Morris Rhiannon, Black Katrina A., and Stollar Elliott J.. Uncovering protein function: from classification to complexes. Essays in Biochemistry, 66(3):255–285, August 2022. ISSN 0071-1365. doi: 10.1042/EBC20200108. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Nachtergaele Sigrid and He Chuan. The emerging biology of rna post-transcriptional modifications. RNA Biology, 14(2):156–163, February 2017. ISSN 1547-6286. doi: 10.1080/15476286.2016.1267096. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Narayanan Vani, Schappell Laurel E., Mayer Carl R., Duke Ashley A., Armiger Travis J., Arsenovic Paul T., Mohan Abhinav, Dahl Kris N., Gleghorn Jason P., and Conway Daniel E.. Osmotic gradients in epithelial acini increase mechanical tension across e-cadherin, drive morphogenesis, and maintain homeostasis. Current Biology, 30(4):624–633.e4, February 2020. ISSN 0960-9822. doi: 10.1016/j.cub.2019.12.025. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.DeLong John P. Al-Sammak Maitham A. Al-Ameeli Zeina T. Dunigan David D. Edwards Kyle F. Fuhrmann Jeffry J. Gleghorn Jason P. Li Hanqun Haramoto Kona Harrison Amelia O. Marston Marcia F. Moore Ryan M. Polson Shawn W. Ferrell Barbra D. Salsbery Miranda E. Schvarcz Christopher R. Shirazi Jasmine Steward Grieg F. Van Etten James L. and Wommack K. Eric. Towards an integrative view of virus phenotypes. Nature Reviews Microbiology, 20(2):83–94, February 2022. ISSN 1740-1534. doi: 10.1038/s41579-021-00612-w. [DOI] [PubMed] [Google Scholar]
- 41.Towhidul Islam Tonmoy S. M., Mehedi Zaman S. M., Jain Vinija Rani Anku Rawte Vipula Chadha Aman and Das Amitava. A comprehensive survey of hallucination mitigation techniques in large language models. (arXiv:2401.01313), January 2024. doi: 10.48550/arXiv.2401.01313. arXiv:2401.01313 [cs]. [DOI] [Google Scholar]
- 42.Fang Yin, Liang Xiaozhuan, Zhang Ningyu, Liu Kangwei, Huang Rui, Chen Zhuo, Fan Xiaohui, and Chen Huajun. Mol-instructions: A large-scale biomolecular instruction dataset for large language models. (arXiv:2306.08018). doi: 10.48550/arXiv.2306.08018. [DOI] [Google Scholar]
- 43.Needleman Saul B. and Wunsch Christian D.. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology, 48(3):443–453, March 1970. ISSN 0022-2836. doi: 10.1016/0022-2836(70)90057-4. [DOI] [PubMed] [Google Scholar]
- 44.Devlin Jacob, Chang Ming-Wei, Lee Kenton, and Toutanova Kristina. BERT: Pre-training of deep bidirectional transformers for language understanding. (arXiv:1810.04805). doi: 10.48550/arXiv.1810.04805. [DOI] [Google Scholar]
- 45.Rives Alexander, Meier Joshua, Sercu Tom, Goyal Siddharth, Lin Zeming, Liu Jason, Guo Demi, Ott Myle, Zitnick C. Lawrence Ma Jerry and Fergus Rob. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. 118(15):e2016239118. doi: 10.1073/pnas.2016239118. Publisher: Proceedings of the National Academy of Sciences. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Rao Roshan M Meier Joshua, Sercu Tom, Ovchinnikov Sergey, and Rives Alexander. Transformer protein language models are unsupervised structure learners. bioRxiv, 2020. doi: 10.1101/2020.12.15.422761. [DOI] [Google Scholar]
- 47.Rao Roshan, Liu Jason, Verkuil Robert, Meier Joshua, Canny John F., Abbeel Pieter, Sercu Tom, and Rives Alexander. Msa transformer. bioRxiv, 2021. doi: 10.1101/2021.02.12.430858. [DOI] [Google Scholar]
- 48.Meier Joshua, Rao Roshan, Verkuil Robert, Liu Jason, Sercu Tom, and Rives Alexander. Language models enable zero-shot prediction of the effects of mutations on protein function. bioRxiv, 2021. doi: 10.1101/2021.07.09.450648. [DOI] [Google Scholar]
- 49.Hsu Chloe, Verkuil Robert, Liu Jason, Lin Zeming, Hie Brian, Sercu Tom, Lerer Adam, and Rives Alexander. Learning inverse folding from millions of predicted structures. ICML, 2022. doi: 10.1101/2022.04.10.487779. [DOI] [Google Scholar]
- 50.Lin Zeming, Akin Halil, Rao Roshan, Hie Brian, Zhu Zhongkai, Lu Wenting, Smetanin Nikita, Costa Allan dos Santos Fazel-Zarandi Maryam Sercu Tom Candido Sal, et al. Language models of protein sequences at the scale of evolution enable accurate structure prediction. bioRxiv, 2022. [Google Scholar]
- 51.Beltagy Iz, Lo Kyle, and Cohan Arman. SciBERT: A pretrained language model for scientific text. (arXiv:1903.10676). doi: 10.48550/arXiv.1903.10676. [DOI] [Google Scholar]
- 52.Hayes Thomas, Rao Roshan, Akin Halil, Sofroniew Nicholas J., Oktay Deniz, Lin Zeming, Verkuil Robert, Tran Vincent Q., Deaton Jonathan, Wiggert Marius, Badkundri Rohil, Shafkat Irhum, Gong Jun, Derry Alexander, Molina Raul S., Thomas Neil, Khan Yousuf, Mishra Chetan, Kim Carolyn, Bartie Liam J., Nemeth Matthew, Hsu Patrick D., Sercu Tom, Candido Salvatore, and Rives Alexander. Simulating 500 million years of evolution with a language model. page 2024.07.01.600583, July 2024. doi: 10.1101/2024.07.01.600583. [DOI] [PubMed] [Google Scholar]
- 53.Teukam Yves Gaetan Nana Dassi Loïc Kwate Manica Matteo Probst Daniel Schwaller Philippe, and Laino Teodoro. Language models can identify enzymatic active sites in protein sequences. doi: 10.26434/chemrxiv-2021-m20gg-v3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Conesa Ana, Götz Stefan García-Gómez Juan Miguel, Terol Javier, Talón Manuel, and Robles Montserrat. Blast2go: a universal tool for annotation, visualization and analysis in functional genomics research. Bioinformatics, 21(18):3674–3676, September 2005. ISSN 1367-4803. doi: 10.1093/bioinformatics/bti610. [DOI] [PubMed] [Google Scholar]
- 55.McHugh Mary L.. The chi-square test of independence. Biochemia Medica, 23(2):143–149, June 2013. ISSN 1330-0962. doi: 10.11613/BM.2013.018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Fu Zihao, Lam Wai, Man-Cho So Anthony, and Shi Bei. A theoretical analysis of the repetition problem in text generation. arXiv.org, December 2020. [Google Scholar]
- 57.Chaffin Antoine, Claveau Vincent, and Kijak Ewa. Ppl-mcts: Constrained textual generation through discriminator-guided mcts decoding. arXiv.org, September 2021. [Google Scholar]
- 58.Yaari Gur, Rokach Lior, Puzis Rami, and Katz Gilad. Mctransformer: Combining transformers and monte-carlo tree search for offline reinforcement learning. September 2022. [Google Scholar]
- 59.Radford Alec, Kim Jong Wook Hallacy Chris, Ramesh Aditya, Goh Gabriel, Agarwal Sandhini, Sastry Girish, Askell Amanda, Mishkin Pamela, Clark Jack, Krueger Gretchen, and Sutskever Ilya. Learning transferable visual models from natural language supervision. (arXiv:2103.00020). doi: 10.48550/arXiv.2103.00020. [DOI] [Google Scholar]
- 60.Henderson Matthew, Al-Rfou Rami Strope Brian, Sung Yun-hsuan, Lukacs Laszlo, Guo Ruiqi, Kumar Sanjiv, Miklos Balint, and Kurzweil Ray. Efficient natural language response suggestion for smart reply. (arXiv:1705.00652), May 2017. doi: 10.48550/arXiv.1705.00652. arXiv:1705.00652 [cs]. [DOI] [Google Scholar]
- 61.Camon Evelyn B., Barrell Daniel G., Dimmer Emily C., Lee Vivian, Magrane Michele, Maslen John, Binns David, and Apweiler Rolf. An evaluation of go annotation retrieval for biocreative and goa. BMC bioinformatics, 6 Suppl 1(Suppl 1):S17, 2005. ISSN 1471-2105. doi: 10.1186/1471-2105-6-S1-S17. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Ibtehaz Nabil, Kagaya Yuki, and Kihara Daisuke. Domain-pfp allows protein function prediction using function-aware domain embedding representations . Communications Biology, 6(1):1–14, October 2023. ISSN 2399-3642. doi: 10.1038/s42003-023-05476-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63.Steinegger Martin and Söding Johannes. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. 35(11):1026–1028. ISSN 1546-1696. doi: 10.1038/nbt.3988. Number: 11 Publisher: Nature Publishing Group. [DOI] [PubMed] [Google Scholar]
- 64.Su Jianlin, Lu Yu, Pan Shengfeng, Murtadha Ahmed, Wen Bo, and Liu Yunfeng. Roformer: Enhanced transformer with rotary position embedding. (arXiv:2104.09864), November 2023. arXiv:2104.09864 [cs]. [Google Scholar]
- 65.Cuff J. A. and Barton G. J.. Evaluation and improvement of multiple sequence methods for protein secondary structure prediction. Proteins, 34(4):508–519, March 1999. ISSN 0887-3585. doi: 10.1002/(sici)1097-0134(19990301)34:4<508::aid-prot10>3.0.co;2-4. [DOI] [PubMed] [Google Scholar]
- 66.Yang Yuedong, Gao Jianzhao, Wang Jihua, Heffernan Rhys, Hanson Jack, Paliwal Kuldip, and Zhou Yaoqi. Sixty-five years of the long march in protein secondary structure prediction: the final stretch? Briefings in Bioinformatics, 19(3):482–494, May 2018. ISSN 1477-4054. doi: 10.1093/bib/bbw129. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 67.Protein structure prediction center. (Accessed on 07/23/2024).
- 68.Dubourg-Felonneau Geoffroy Wesego Daniel Mitiku, Akiva Eyal, and Varadan Ranjani. PiNUI: A dataset of protein–protein interactions for machine learning. doi: 10.1101/2023.12.12.571298. Pages: 2023.12.12.571298 Section: New Results. [DOI] [Google Scholar]
- 69.Su Jin, Han Chenchen, Zhou Yuyang, Shan Junjie, Zhou Xibin, and Yuan Fajie. SaProt: Protein language modeling with structure-aware vocabulary. . doi: 10.1101/2023.10.01.560349. Pages: 2023.10.01.560349 Section: New Results. [DOI] [Google Scholar]
- 70.Jiang Zihang, Yu Weihao, Zhou Daquan, Chen Yunpeng, Feng Jiashi, and Yan Shuicheng. Convbert: Improving bert with span-based dynamic convolution. (arXiv:2008.02496), February 2021. doi: 10.48550/arXiv.2008.02496. arXiv:2008.02496 [cs]. [DOI] [Google Scholar]
- 71.Holtzman Ari, Buys Jan, Du Li, Forbes Maxwell, and Choi Yejin. The curious case of neural text degeneration. (arXiv:1904.09751), February 2020. doi: 10.48550/arXiv.1904.09751. arXiv:1904.09751 [cs]. [DOI] [Google Scholar]
- 72.Zhang Edwin, Zhu Vincent, Saphra Naomi, Kleiman Anat, Edelman Benjamin L., Tambe Milind, Kakade Sham M., and Malach Eran. Transcendence: Generative models can outperform the experts that train them. arXiv.org, June 2024. [Google Scholar]
- 73.Cock Peter J. A., Antao Tiago, Chang Jeffrey T., Chapman Brad A., Cox Cymon J., Dalke Andrew, Friedberg Iddo, Hamelryck Thomas, Kauff Frank, Wilczynski Bartek, and de Hoon Michiel J. L.. Biopython: freely available python tools for computational molecular biology and bioinformatics. Bioinformatics, 25(11):1422–1423, June 2009. ISSN 1367-4803. doi: 10.1093/bioinformatics/btp163. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 74.Kunzmann Patrick and Hamacher Kay. Biotite: a unifying open source computational biology framework in python. BMC Bioinformatics, 19(1):346, October 2018. ISSN 1471-2105. doi: 10.1186/s12859-018-2367-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 75.Camacho Christiam, Boratyn Grzegorz M., Joukov Victor, Alvarez Roberto Vera, and Madden Thomas L.. Elasticblast: accelerating sequence search via cloud computing. BMC Bioinformatics, 24(1):117, March 2023. ISSN 1471-2105. doi: 10.1186/s12859-023-05245-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 76.Priyam Anurag, Ben J Woodcroft Vivek Rai, Moghul Ismail, Munagala Alekhya, Ter Filip, Chowdhary Hiten, Pieniak Iwo, Maynard Lawrence J Gibbins Mark Anthony, Moon HongKee, Davis-Richardson Austin Uludag Mahmut, Watson-Haigh Nathan S Challis Richard, Nakamura Hiroyuki, Favreau Emeline, Gómez Esteban A Pluskal Tomás, Leonard Guy, Rumpf Wolfgang, and Wurm Yannick. Sequenceserver: A modern graphical user interface for custom blast databases. Molecular Biology and Evolution, 36(12):2922–2924, December 2019. ISSN 0737-4038. doi: 10.1093/molbev/msz185. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 77.Hallee Logan, Kapur Rohan, Patel Arjun, Gleghorn Jason P., and Khomtchouk Bohdan. Contrastive learning and mixture of experts enables precise vector embeddings. (arXiv:2401.15713), May 2024. doi: 10.48550/arXiv.2401.15713. arXiv:2401.15713 [cs]. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 78.Shazeer Noam, Mirhoseini Azalia, Maziarz Krzysztof, Davis Andy, Le Quoc, Hinton Geoffrey, and Dean Jeff. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. (arXiv:1701.06538). doi: 10.48550/arXiv.1701.06538. [DOI] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.






