Abstract
The human immune response relies on the unique ability of T-cell receptors (TCRs) to specifically bind to peptides, a process essential for immune surveillance and response. Although deep learning methods for prediction of TCR–peptide binding have proliferated, many encoder-based approaches learn dataset biases, greatly overestimating the model results, and ignoring the biochemical mechanisms and spatial properties affecting binding. Through our analysis, we found that interaction pairs generated by cross-mapping the amino acid properties between TCR and peptide implicitly simulate spatial structure, enabling machine learning models to capture information more effectively. Based on this insight, we developed T-cell receptor cross (TCRoss), a transformer-based model for large-scale learning. In addition, we observed that incorporating environmental information into the dataset not only mitigates learning biases but also improves performance. Experiments show that TCRoss consistently outperforms existing models in both observed contexts and de novo peptide scenarios. Wet-lab validation using T-cell activation assays confirmed the model’s predictions for nonbinding peptides and provided critical experimental evidence for model assessment. Biophysical validation confirms that high-attention residue pairs correspond to crystallographically observed binding interfaces.
Keywords: T-cell receptors (TCRs), peptide binding, transformer, cross-mapping, spatial structure
Introduction
The adaptive immune system is based on the ability of T cells to recognize and respond to a wide range of pathogen-derived peptides. This recognition is facilitated by T-cell receptors (TCRs), which specifically bind to peptides presented by major histocompatibility complex (MHC) molecules on the surface of antigen-presenting cells [1]. The unique binding capability between TCRs and peptides forms the foundation of immune surveillance and response, enabling the immune system to detect and react to pathogens, cancer cells, and other foreign entities [2]. The discovery of pairs of binding TCR peptides is critical for designing effective vaccines that target infectious diseases and cancers. Predictive models can facilitate the discovery of promising epitopes that generate strong T-cell responses [3–5]. Furthermore, predictive models can support personalized immunotherapy and help identify pathogenic T-cell responses that provide information on disease mechanisms. Such predictions facilitate the discovery of biomarkers, improving patient stratification [6, 7]. Identification of a binding TCR–peptide remains a challenge due to the diversity of TCR genes, peptide sequences, and the polyspecificity of TCRs. These receptors can recognize multiple distinct peptides or MHC ligands, as well as the specificity with which each of these ligands is recognized [8].
Traditional methods for studying TCR–peptide interactions often rely on wet-laboratory settings, which are costly, time-consuming, and resource-intensive [9–11]. Computational methods, including deep learning (DL), hold promise for revolutionizing this process [12–14]. However, despite substantial progress achieved by sequence-based DL models, these approaches often face challenges in generalization and interpretability, particularly in zero-shot scenarios [15, 16]. Zero-shot learning, which involves predicting the binding of unseen peptides, remains a critical hurdle in immunoinformatics [17].
From a data perspective, while binding data are abundant, unbiased negative (nonbinding) data is scarce. The quality and type of negative data significantly impact outcomes, as models trained on biased negative samples tend to overfit observed patterns or memorize specific instances, instead of capturing the biochemical essence of TCR–peptide interactions or learning robust binding paradigms [15, 16].
As illustrated in Fig. 1, TCR–peptide binding is largely determined by the interaction between the complementarity determining region 3 (CDR3) of the TCR-
chain and the peptide sequence (epitope), both of which are composed of amino acid sequences with complex tertiary structures. While some models either completely ignore structural considerations or attempt to predict entire TCR structures, such approaches may be unnecessary, as binding is predominantly driven by the CDR3 region. In this study, we implicitly model the CDR3 binding structure in prediction.
Figure 1.

Both TCRs and peptides are composed of amino acid sequences with complex tertiary structures (Protein Data Bank (PDB) ID: 2CKB [18]). The focus of this study is to model the interaction between the CDR3 of the TCR-
chain and the peptide sequence, so-called epitope. This pattern recognition task can predict TCR–peptide binding by identifying specific sequences.
Many studies have primarily focused on TCR- chains, which may introduce dataset biases and potentially lead to overestimated prediction accuracy. Although both TCR-
and TCR-
chains are essential for TCR–peptide recognition, this study focuses on the
chain for two main reasons. First, the available dataset of TCR-
is much larger and more diverse compared with that of TCR-
sequences, providing a more robust foundation for model training. Second, many existing TCR-
and peptide datasets are prone to biases since the triplets are correlated, which may lead to overestimated prediction accuracy and reduced generalization. By prioritizing TCR-
, we aim to mitigate these issues and develop a more reliable and interpretable prediction framework by learning binding paradigm.
To address these challenges, we leverage the largest-scale dataset of TCR-
sequences and peptides to learn robust binding paradigms, while implicitly modeling the CDR3 binding structure in our predictions. This targeted approach aims to improve generalization and mitigate the limitations of existing methods.
Previously, most existing models show unsatisfied performance when tested on de novo peptides using unbiased generated nonbinding data [15], which can be attributed to both nonbinding data limitations and a lack of attention to biochemical properties and spatial structure. This poses challenges to model generalization and interpretability, both essential for clinical and research applications [19]. This paper addresses these issues by proposing solutions to enhance both the data and the model in the prediction of TCR–peptide binding.
The properties of amino acids and the structure of proteins play a critical role in protein function, directly affecting the affinity and specificity of TCR–peptide binding, which in turn influences the effectiveness of immune responses [20, 21]. This fact limits the ability of many previous DL models to fully account for the tertiary structure involved in binding. This creates an opportunity to combine amino acid properties with implicit modeling of protein structure to better predict TCR–peptide binding mechanisms.
For the prediction of TCR–peptide binding, a model must be able to generalize across various types of alleles and peptides while learning the core mechanisms of interaction that extend to unseen samples. Models trained on small datasets have shown unsatisfied performance, providing insight into effective large-scale dataset construction, particularly in curating unbiased nonbinding samples.
In this paper, we introduce the following.
TCRoss (T-cell receptor cross), a Transformer-based multimodal AI model designed to interpret TCR–peptide binding interactions using biochemical properties of amino acids and cross-chain mapping between TCR and peptide sequences.
TCRC-200k, a curated large-scale dataset featuring robust nonbinding data generated by two different schemes, which enables model pretraining to improve performance of most published literature.
EES (Environmental Enhancement Strategy), a training strategy applicable to any published model that facilitates the learning of effective binding information to improve their performance on previously unseen (de novo) peptides.
Materials and methods
TCRC-200k
Our TCRC-200k dataset is a rigorously curated dataset that includes a comprehensive binding set and a test set with de novo peptides. Supporting diverse experimental setups, TCRC-200k is currently the largest cleaned TCR–peptide dataset. The common TCR–peptide datasets used for machine learning and their key characteristics are summarized in Table 1. Given the scarcity of reliable, unbiased nonbinding (negative) data, we generated nonbinding samples and corresponding test sets using two methods: random negative (RN) and environment negative (EN). This approach allows for adjustable positive-to-negative ratios.
Table 1.
Comparison of common TCR–peptide datasets for machine learning and their key characteristics, TCR sequences data needs to be cleansed for reasonable use [22]
| Dataset | Description | Source database | Total data | Positive count | Neg-to-Pos ratio | Strictly cleaned | Zero-shot testing | nonbinding data (RN/EN) |
|---|---|---|---|---|---|---|---|---|
| TChard [15] | Released in 2022. | 4 | 500K | 142K | 3:1 |
|
|
/
|
| TEP-merge [23] | Released in 2023 from TEPCAM. | 3 | 128K | 64K | 1:1 |
|
|
/
|
| ImmRep22 [24] | Benchmark with 17 unique peptides. | 1 | 5K | 1K | 5:1 |
|
|
/
|
| PanPep [25] | Released in 2023 from PanPep. | 4 | 64K | 32K | 1:1 |
|
|
/
|
| TCRC-200k | Ours, DL-ready dataset. | 5 | 218K
|
110K | Adjustable |
|
|
/
|
Originally contains 218K data, but adjustable to more data. The ratio can be adjusted to a maximum of 1:20.
Our curated dataset draws positive samples from three widely recognized databases: VDJdb [26], McPAS-TCR [27], and IEDB [28]. Specifically, McPAS contributes
entries, IEDB adds
, and VDJdb provides
samples. We conducted extensive preprocessing on these data sources to ensure quality. Entries from VDJdb with a confidence score of zero were excluded, only MHC Class I data were selected, and entries deemed excessively long, noisy, erroneous, or irrelevant were manually removed. Additionally, we realigned some nonaligned data entries. This rigorous filtering resulted in a total of
high-confidence binding samples. Our dataset was curated to maximize diversity while ensuring data quality. The dataset contains a total of
unique peptide sequences. Most peptide sequences are between 9 and 11 amino acids long, while the CDR3 sequence length is more uniformly distributed. The key characteristics of the TCRC-200k dataset are summarized in Fig. 2.
Figure 2.
Characteristics of the TCRC-200k dataset, and visualization of EN and RN data. (A) Visualization of TCR sequences from the EN data using t-SNE and PCA. Different colors and shapes represent positive and negative data. While the visualization is generally consistent with the RN data, differences in the distribution of positive and negative TCR-
CDR3 sequences can be observed. (B) Visualization of TCR sequences from the RN data using t-SNE and PCA. Different colors and shapes represent positive and negative data. (C) Distribution of data entries across different CDR3 and peptide lengths. The size of each point represents the number of data entries, while the color indicates the total amino acid count of the data. (D) Distribution of data entries across different peptide sequences. The blue, yellow, red represent peptide sequences with >100, >10, <10 data entries, respectively. The y-axis is on a logarithmic scale, highlighting the highly long-tailed nature of the dataset. (E) Comparison of TCRC-200k and other datasets.
Unlike positive data labeled as binding, identifying nonbinding pairs is challenging due to their scarcity. Some databases such as the 10
Genomics [29] offer tested, but limited, nonbinding samples. However, studies indicate that using these tested nonbinding samples can introduce bias, as models can disproportionately learn to memorize the distribution of samples, potentially inflating performance [15].
Negative data generation strategies
We employ two negative data generation strategies: RN and EN.
![]() |
(1) |
![]() |
(2) |
Here,
and
represent the TCR sequence and peptide sequence, respectively.
represents randomly sampling an element in a unique set
.
,
, and
are sets of pairs
, where
is a set consisting only of TCR sequences, and the indices
,
, and
refer to specific elements chosen from these sets. The RN dataset involves random reshuffling of pairs of peptides
and TCR
within an existing positive dataset
. This approach is biologically reasonable given the specificity of TCR–peptide binding, with the likelihood of an incorrect pairing of order
[8, 30]. We believe that this error rate is acceptable in the context of machine learning. In contrast, the EN dataset selects a random TCR sequence from a dataset of
million TCRs derived from human peripheral blood [31] and pairs it with an existing peptide to generate a nonbinding negative sample. The relative merits of these two methods will be discussed in Section “Environmental enhancement strategy.” In TCRC-200k, the positive-to-negative ratio can be flexibly adjusted. Although a
ratio is widely used, our generation script allows for ratios up to
in EN-mode, enabling customization based on model requirements.
Dataset splits
We provide two distinct split methodologies in TCRC-200k. In the majority setting, peptides with
associated entries are assigned to zero-shot test set, while peptides with
entries form the majority set. This split comprises
unique peptides, allowing evaluation of model performance on well-represented peptides and comparisons to other models. The second split, the zero-shot setting, allocates
unique peptides unseen during training to the test set, effectively evaluating the model’s generalization capability.
T-cell receptor cross
A major limitation of current models is their inability to capture the underlying mechanisms of TCR–peptide binding, including the tertiary structure of the protein and the biochemical nuances of binding interactions. Most existing models adopt generic DL architectures, but few have been specifically adapted to address the unique challenges presented by TCR–peptide interactions.
TCRoss is a multimodal model that integrates amino acid biochemical properties through a transformer-based architecture, as shown in Fig. 3. It generates binding tokens that traverse combinations of binding sites within the spatial structure. This approach implicitly captures the biological mechanisms underlying TCR–peptide binding, while also mitigating spatial conformation effects. Importantly, the interaction tokens do not depend on the input sequence. Instead, they incorporate physicochemical properties to ensure that the model avoids memorizing training data, a significant advantage in reducing bias in datasets with known imbalances. Furthermore, TCRoss is highly interpretable, allowing us to analyze how each interaction token in the binding attention layers influences the final binding prediction.
Figure 3.
Overview of the TCRoss architecture. (A) Token generation process in TCRoss with Patch Size
. Input consists of amino acid sequences and properties. A property mapping is performed to generate a cross-relation map, followed by data replication, which is used to facilitate sliding window sampling with
. Each interaction pair represents a potential spatial binding interaction. Finally, feature selection and order embedding are applied to produce the binding tokens across properties. (B) Fourteen types of amino acid properties used. In addition to standard physicochemical descriptors such as charge, polarity, and hydropathy, we included structural and context-derived features commonly used in protein bioinformatics. (C) Feature Embedding. Diverging from Vision Transformers, we separate the tokens based on feature types instead of concatenating input channels. Each feature is assigned a learnable embedding, enabling the model to better capture the interactions between the features.
We selected
amino acid properties that are informative in the prediction of binding. We included both standard physicochemical descriptors and context-derived features. These features capture information about residue positioning, steric effects, secondary structure preferences, and flexibility. Values were obtained from established scales or predicted using widely used structural bioinformatics tools. Together, these descriptors provide a multidimensional view of amino acid behavior that complements sequence-based features for machine learning models. We avoid using the Atchley factor obtained through the analysis of the principal components to reduce learning bias and use more comprehensive data [32, 33]. For data input, the model incorporates the CDR3 region of the TCR beta chain, the amino acid sequence of the peptide, and the biochemical properties of each amino acid. Next, we perform cross-mapping on the two amino acid sequences, computing pairwise mappings across these properties. Moris et al. [16] demonstrated the superiority of a limited version of this scheme over dual input. By mapping the properties of the corresponding amino acids between sequences, we form a cross-relation map
in a tensor format. We introduce multiple property cross-mapping strategies, including absolute difference, product, and exponential relationships, to assess property similarity or interaction potential.
![]() |
(3) |
where
represents the selected properties,
and
represent TCR and peptide amino acids in sequence. Next, order embedding is introduced to capture amino acid order in TCR and peptide sequences by assigning each amino acid a unique identifier based on its type, which is then incorporated using the same cross-mapping strategy as properties. Additionally, for gaps in sequences, we apply a negative padding strategy aligned with the cross-mapping to ensure that the model handles varying sequence lengths effectively.
Generation of interaction tokens
Generating the property cross-relation tensor is followed by transforming it into interaction pairs that are potential binding sites between positions in the CDR3 and peptide chains. Given the spatial structures of peptides and proteins, any location within a peptide chain can theoretically interact with any CDR3 region. Consequently, interaction pairs within the cross-relation tensor may overlap. To simulate spatial configurations, we perform a flip augmentation on the tensor, essentially sampling interaction points across the tensor with a convolution stride of
. Through linear projection, each interaction pair is assigned an interaction token. This process consolidates amino acid properties and sequences into a comprehensive set of interaction tokens. From a spatial perspective, these tokens encompass all possible binding sites independent of any inherent structural constraints.
Feature embedding
Feature embedding of interaction tokens occurs along the dimension representing amino acid properties. The uniqueness of the interaction pair comes from the properties used to generate them. We adopt two embedding strategies that include concatenating different properties within a single token and generating a set of tokens equal to the number of properties. For the latter, tokens representing the same property undergo embedding with a set of learnable parameters to allow the model to internalize property-specific information, which is called feature embedding. After embedding, we employ a Transformer encoder and a classifier to generate binding scores.
Here is a detailed mathematical description of the process. Let the input consists of
distinct feature groups. For the
th feature group, (
) denote its
patches as
, where
is the raw feature dimension. A shared linear projection maps all patches to a latent space:
![]() |
(4) |
where
is the projection matrix, and
is the learnable embedding vector for the
th feature group. Next comes the sequence construction. A learnable classification token
is prepended to the projected patches:
![]() |
(5) |
The sequence
is processed by standard Transformer blocks:
![]() |
(6) |
where each block comprises multihead self-attention (MSA) and feed-forward networks (FFN):
![]() |
(7) |
In our model, the feature embedding process introduces additional computational costs. Here we evaluate both the overhead resulting from feature embedding and the improvements achieved through Linformer optimization. The standard self-attention mechanism, which is widely adopted in Transformer-based architectures, has a time complexity that scales quadratically with the sequence length. This is expressed as
where
is the number of tokens and
is the latent dimension.
In the context of TCR–peptide interaction modeling, the inputs are encoded as 2D feature maps. Let
and
denote the sequence lengths for the TCR and peptide, respectively, and
represents the patch size. Accounting for bidirectional encoding or symmetric padding, the total token count
is given by:
![]() |
(8) |
The level of complexity is comparable to standard Vision Transformer variants (e.g. ViT-Base with
), and it remains tractable on modern GPUs.
To further address the quadratic scaling issue, we implement the Linformer optimization method. In this approach, the key and value matrices are projected to a low-rank space with rank
, where
. This reduces the self-attention complexity to 
Although feature embedding increases the number of tokens—thereby raising the computational cost—the resulting complexity remains within acceptable limits for our application. Furthermore, the Linformer optimization efficiently reduces the complexity from a quadratic to a near-linear relationship, helping to balance both local feature extraction and global interaction modeling while ensuring computational efficiency.
Unbiased learning and interpretability
It is worth noting that our model does not directly use the original input sequences throughout the training process. During the generation of interaction tokens, the model cannot memorize any relevant specific information from the input sequences. All learning and inference are performed at the TCR–peptide binding level, with information processed through the cross-mapping strategy, effectively preventing the model from memorizing the sequences. In addition to robustness against bias, our model also exhibits strong interpretability. The output attention layer can be directly related to the TCR–peptide binding process, as visualized in Fig. 5, where we can analyze the contribution of each interaction token to the final binding result .
Figure 5.
Analysis of TCRoss attention using structure visualization. (A) Attention weight for each patch from TCRoss. (B) Visualization of top-attended interaction patches for four TCR–peptide complexes. For each complex, the three patches with the highest cross-attention weights are shown. Colored squares represent local interaction patches (TCR residue versus peptide residue), with attention weights indicated by color intensity. (C) The 3D structure from PDB [41]. The PDB IDs from left to right are 2ESV [42], 1OGA [43], 8WUL [44], and 3E3Q [45]. The amino acid pairs of the high-attention patches are related to the residues with noncovalent interactions.
Environmental enhancement strategy
We propose a training strategy called the EES, which is compatible with most existing models. EES improves training by leveraging environmental information in a configurable ratio. This addresses the limitations of current models for generalizing to de novo peptides. Although RN is the standard negative data type, previous studies show that existing models fail to learn meaningful binding information from RN data [15]. EES effectively prevents overfitting and encourages the model to learn essential binding information. This approach ensures a genuine and unbiased evaluation of test sets and demonstrates that RN data can provide valuable, generalizable insights for the prediction of TCR–peptide binding.
Currently, no published model achieves better-than-random performance on RN test sets. It is widely considered that using only TCR-
and peptide cannot generalize to de novo peptides on RN. However, we attribute the issue to training strategies rather than chain limitations. The failure to generalize results from insufficient learning opportunities in the training data. Although RN is accurate as stated in Section “TCRC-200k,” it may lack the diversity needed for models to learn effectively.
EN data introduce bias when used for testing. The high performance on EN may be misleading, since the model may memorize TCR sequences from the training set, identifying common sequences as binding and other sequences as nonbinding. Consequently, EN test sets may not be used to evaluate generalization.
EES addresses these challenges by incorporating environmental information from a background TCR dataset into training. These environmental sequences are derived from the TCR repertoire of healthy individuals. The strategy introduces a tunable hyperparameter
, which controls the proportion of environmental sequences used in training. Adjusting
allows EES to adapt to different model architectures and dataset characteristics.
Hyperparameter
The hyperparameter
functions as a soft label regulator, dynamically balancing the training process. Early in training, a smaller
reduces the risk of overfitting, as the model has not yet learned sufficient features. As training progresses, the dynamic
strengthens the ability of the model to learn binding patterns and distinguish between cases, improving generalization to de novo peptides.
EES enables the conversion of unsupervised tasks into self-supervised learning without requiring true negative labels. Traditional unsupervised methods rely on clustering, dimensionality reduction, or self-supervised techniques to explore data structure. EES introduces negative supervision by obtaining environmental information, leveraging the polyspecificity of TCR–peptide interactions to improve model discrimination.
Unbiased evaluation
Through adopting EES, we can perform unbiased evaluations of model performance using RN test sets. By training on blended EN/RN data but validating exclusively on RN, which contains no environmental sequences, EES decouples learning signals from evaluation integrity. By adding a generated RN validation set as an early stopping criterion, EES ensures that models do not overfit patterns or memorize training data, as the introduced environmental sequences are entirely absent from the test set. This approach also allows models to learn generalizable knowledge rather than dataset-specific artifacts. Our results in Table 2 demonstrate that, contrary to concerns about EN bias, environmental information provides valuable insights for the prediction of TCR–peptide binding, enabling models to learn, and generalize effectively beyond training data. In TCR–peptide research, zero-shot is crucial because it can provide a more forward-looking reference for immunology research. Underlined results indicate anomalous lower than random results, showing that these models may learned dataset biases.
Table 2.
Majority and zero-shot setting result for different models in different settings. The majority setting means that the test set contains previously seen epitopes, and the zero-shot setting means that all test epitopes are de novo epitopes.
indicates best, and
indicates second-best. EN may contain bias, see Section “Environmental enhancement strategy,” all models are tested on RN if not specified.
represents the average increment of testing on EN compared with baseline, and the average improvement of using EES compared with not using EES. Seq and Prop represents sequence and property.
| Model | Backbone | Modality | Majority setting | Zero-shot setting | ||||
|---|---|---|---|---|---|---|---|---|
| AUROC | AUPR |
|
AUROC | AUPR |
|
|||
|
|
|
|
|
|
|
|
|
| Baseline | w/o EES | |||||||
| ERGO-II [34] | AutoEncoder | Seq |
|
|
– |
|
|
– |
| ERGO-II [34] | LSTM | Seq |
|
|
– |
|
|
– |
| ImRex [16] | CNN | Seq+Prop |
|
|
– |
|
|
– |
| ATM-TCR [35] | Attention | Seq |
|
|
– |
|
|
– |
| TEINet [36] | Encoder+NN | Seq |
|
|
– |
|
|
– |
| TCRoss | Transformer | Seq+Prop |
|
|
– |
|
|
– |
| EN dataset (may contain bias) |
w/ EES
|
|||||||
| ERGO-II [34] | Autoencoder | Seq |
|
|
|
|
|
|
| ERGO-II [34] | LSTM | Seq |
|
|
|
|
|
|
| ImRex [16] | CNN | Seq+Prop |
|
|
|
|
|
|
| ATM-TCR [35] | Attention | Seq |
|
|
|
|
|
|
| TEINet [36] | Encoder+NN | Seq |
|
|
|
|
|
|
| TCRoss | Transformer | Seq+Prop |
|
|
|
|
|
|
Experimental procedures
Animals and reagents
C57BL/6 (H-2Kb) and OVA257-264-specific (OT-I) TCR transgenic mice were obtained from The Jackson Laboratory. All mice were housed in Ohio State University ULAR facilities under pathogen-free conditions. Experiments were performed in compliance with Institutional Animal Care and Use Committee (IACUC) guidelines under approved protocol OSU IACUC 2022A00000061. Splenic dendritic cell (DC) isolation was performed as the previous study described [37]. Liberase TM (Roche) and DNase I (Roche) were used for tissue digestion. Cells were maintained in sterile complete RPMI 1640 medium (Thermo Fisher Scientific) supplemented with 10% heat-inactivated FBS, 50 U/ml penicillin, 50
M streptomycin, 1 mM sodium pyruvate, 2 mM L-glutamine, 0.1 mM nonessential amino acids, 50
M 2-mercaptoethanol, and 10 mM HEPES. MACS buffer (PBS supplemented with 0.5% BSA and 2 mM EDTA) was used for magnetic cell separations. All peptides, including canonical SIINFEKL and in silico-generated mutant variants, were synthesized by Genscript. Antibodies used for flow cytometry included anti-CD69-PE (clone H1.2F3, BioLegend), anti-CD25-AF647 (clone PC61, BioLegend), and Live/Dead Fixable Near-IR viability dye (Thermo Fisher Scientific).
Dendritic cell–T-cell cocultures
DCs were pulsed with 10
M peptide for 30 min at 37
C in complete RPMI, washed three times, and plated in flat-bottom 96-well plates (
cells/well). Naive CD8+ OT-I T cells were isolated from spleens by negative selection using the CD8a+ T Cell Isolation Kit (Miltenyi Biotec) and added at a 1:1 ratio. Cocultures were incubated for 18 h at 37
C, with no-peptide and no-antigen controls included. Following coculture, T-cell activation was assessed by flow cytometry after staining in FACS buffer. Data were acquired on a BD FACSymphony A3 (BD Biosciences) and analyzed with FlowJo (Tree Star).
Results
We compared the performance of TCRoss with several well-established and state-of-the-art models. We also analyzed how various components and features impact performance. Models pretrained on the dataset TCRC-200k are used for interpretability experiments. Parameter experiments, ablation studies, and detailed experimental setup are provided in the Supplementary data.
Overall performance assessment
Through a range of comparative tests, TCRoss routinely surpassed leading models, whether dealing with observed peptides or those in zero-shot scenarios. This highlights its superior resistance to biases. In de novo epitope prediction scenario, it improves performance from
(ImRex) to
(ERGO-II LSTM) over competing models (paired permutation test
).
We evaluated performance using AUROC (Area Under the Receiver Operating Characteristics curve) and AUPR (Area Under the Precision-Recall curve) as metrics, as these are widely applicable in binary classification tasks and are more suitable for our problem than accuracy [38].
To minimize the impact of dataset size on performance, we used commonly reported data sizes from previous models’ papers. This ensures that the model’s performance is less influenced by data size. For fairness, we restricted the dataset to the same 60k used in PanPep [25]. We trained the models and repeated each experiment five times, reporting average metrics with confidence intervals.
TCRoss achieves the best results in all datasets except the EN dataset which may be biased. ATM–TCR and ImRex are both nonexplicit encoder architectures, and their performance in zero-shot and RN datasets is second only to TCRoss.
In the majority setting, we used five-fold cross-validation and tested models on the RN and EN datasets (see Section “TCRC-200k”), with RN serving as the baseline. TCRoss achieved
and
(95% CI) in the majority setting using the RN dataset, which is higher than other compared models. In the zero-shot setting with de novo epitope, TCRoss achieved
and
.
In the zero-shot setting, we tested the models on the RN dataset, comparing performance with and without the use of EES. Notably, some methods showed an AUROC
on the RN test set, which is statistically significant, distinguished from chance (
). This indicates that the models may have learned inverse information, highlighting potential biases or mismatches in the distribution of samples between the training and test data. Using EES significantly improved zero-shot performance on the RN test set.
Validation with functional T-cell assays
To experimentally validate the binding predictions of our DL model, TCRoss, for the OT-I TCR recognizing the SIINFEKL (OVA257-264) epitope and its virtual mutants, we performed ex vivo T-cell activation assays. Virtual mutants are in silico-generated peptide variants used to computationally probe TCR specificity prior to experimental validation. Splenic DCs from C57BL/6 (H-2Kb) mice were pulsed with individual peptide mutants and co-cultured at a 1:1 ratio with naive CD8+ OT-I T cells for 18 h. Controls included cultures without peptide. T-cell activation was specifically quantified by flow cytometric analysis of CD69 and CD25 co-expression, robust markers of early T-cell activation. This experimental setup directly assesses the ability of the peptide-MHC complex to trigger functional TCR signaling in primary T cells relevant to the OT-I system.
A panel of virtual peptide mutants was selected for testing based on TCRoss predictions and IEDB annotations. Crucially, all training data pertaining to the SIINFEKL peptide and the OT-I TCR were excluded during model training to ensure an unbiased validation. The test set comprised:
Peptides with high predicted binding scores (
).Peptides with low predicted binding scores (
).All peptides within the virtual mutant set that had existing experimental annotations in the IEDB database (
).
The results, summarized in Fig. 4, revealed several key findings. First, most results have agreement with IEDB annotations. For the nine peptides with pre-existing IEDB annotations, our functional assay largely corroborated the database records. Seven (7/9) IEDB-annotated agonists induced significant T-cell activation, aligning with their known behavior. One (1/9) IEDB-annotated agonist peptide, however, failed to elicit activation above the no-peptide control level in our assay.
Figure 4.
Validation of TCR signaling for SIINFEKL virtual mutants. C57BL/6 (H-2Kb) splenic DCs pulsed with peptide mutants were co-cultured (1:1) with naive CD8+ OT-I T cells for 18 h, with no-peptide controls. T-cell activation was assessed by CD69+CD25+ co-expression via flow cytometry. The bar graph shows activation percentages per peptide, with plots for canonical agonist, weak agonist, and no-antigen conditions. Data represent three experiments with three replicates. Error bars show SD. Significance was determined by one-way ANOVA with Dunnett’s test versus no-peptide control (****
). Heatmaps compare experimental results with TCRoss predictions and IEDB annotations: green for activation, red for nonactivation, and white for unavailable data generated by virtual mutant process.
All four (4/4) peptides predicted by TCRoss to have low binding affinity consistently failed to activate OT-I T cells, confirming the model’s accuracy in identifying nonbinders for this specific set.
Notably, none of the five (0/5) peptides predicted by TCRoss to have high binding affinity induced detectable T-cell activation in our co-culture assay.
Dataset bias and environmental enhancement strategy analysis
To investigate the bias in the EN dataset, we conducted a simple experiment where we selected only the TCR-beta sequences, excluding the peptide and all other information, and performed dimensionality reduction for visualization. Specifically, we first converted the raw sequence data into a matrix of TF-IDF features, then reduced the dimensionality to 50 using PCA, and finally applied t-SNE to map the data to 2D. This allowed us to visualize the distribution of TCR sequences in the dataset. Fig. 2 shows the t-SNE visualizations for the EN and RN datasets.
From the visualizations, we observe that the distribution of TCR sequences in the RN data is highly consistent, with positive and negative sequences evenly spread, showing no evidence of bias. In contrast, the EN data show some variation in the distribution, but overall, the data remain relatively consistent. While we believe that EN data may introduce some bias, the environmental information provided by the EN data can still aid in better training the model.
EES helps to utilize all available information while avoiding data bias. When applied to established TCR-prediction models, EES achieved an average improvement of
(mean
= 0.077, mean
= 0.073). Notably, EES transformed the performance of ERGO-II (LSTM-based) from subrandom (AUROC =
, AUPR =
) to clinically significant levels (AUROC =
). Models employing direct epitope encoding showed maximum benefit. Currently, the direct epitope encoding model is widely used in mainstream models [12].
Interpretable attention patterns
We present an interpretability experiment analyzing the attention and binding interactions of TCRoss using 3D models. We selected four representative MHC-I complexes from the TCR3D database [39] as examples. For each, the corresponding 3D structure was retrieved from the PDB. Images are created using Mol* [40] (see Fig. 5). The figure shows examples of interaction tokens generated via flip augmentation. For each PDB structure, the top/bottom rows show shifted windows (stride=1) over the TCR–peptide interaction tensor. The top row starts at TCR position 0 (including the conserved “C”), where the bottom row starts at position 1. Amino acid pairs represent local interactions within each patch. The conserved “C” is structurally invariant and noninformative for binding.
TCRoss was pretrained in the TCRC-200k dataset with the corresponding data removed. We identified key amino acid pairs in the 3D structure that are involved in binding or chemical interactions. These were compared with the final attention layer of TCRoss. The top three most influential amino acid pairs are identified.
The results indicate that the three pairs with the highest attention are strongly associated with amino acids located near the TCR–peptide binding site. In many cases, these pairs correspond to residues involved in chemical bonding. Notably, the attention mechanism highlights critical amino acids within the TCR sequence, suggesting that TCRoss has, to some extent, captured the underlying binding interactions.
Key physicochemical properties
To evaluate the impact of specific amino acid properties on model performance and robustness, we performed an ablation study comprising 100 independent trials. In each trial, half of the input features were stochastically masked in a zero-shot setting on the reduced dataset.
We applied t-tests to assess the statistical significance. Properties showing significant differences were as follows:
Core
, Volume
,
Turn
, Disorder
,
Strength
, pH
.
We conclude that Core, pH, and Turn are essential properties, while Volume, Disorder, and Strength can be dropped to enhance model efficiency. We used violin plots to visualize the results, as shown in Fig. 6, which illustrates the relative importance of different properties in binding prediction by comparing model performance when specific properties are included or removed. This provides useful insights into the selection of features for binding prediction.
Figure 6.

Impact of individual properties on binding prediction. Experimental results highlight that Core, pH, and Turn are crucial for accurate binding prediction.
Property feature embedding
Our model employs multimodal input that incorporates paired amino acid sequences along with their physicochemical properties. While conventional transformer architectures such as vit-tcr [46, 47] typically concatenate different amino acid attributes within single tokens, an approach that may compromise the model’s ability to distinguish between distinct biochemical features, we introduce a feature embedding strategy. As illustrated in Fig. 3C, this method generates parallel token representations corresponding to each molecular property type. Each property-specific token subset undergoes dedicated embedding through a series of learnable parameter groups, enabling the model to explicitly internalize attribute-specific information during feature transformation.
Comparative analysis demonstrates the superiority of our feature embedding approach over conventional methods. The baseline model with positional embedding achieved AUROC scores of
and AUPR scores of
. Removal of positional embedding improved performance to 0.792
0.006 and 0.810
0.007. Our implementation of feature embedding yielded better predictive accuracy (0.815
0.008 AUROC; 0.831
0.009 AUPR, see Table 3), indicating that explicit property-specific representation improves the model discrimination capability for TCR binding prediction tasks. This performance improvement suggests that preserving attribute identity through dedicated embedding spaces allows for more effective utilization of multimodal input information compared with fused representation approaches.
Table 3.
Effect of different backbones for TCRoss. Feature embedding achieved the best results, proving the effectiveness of combining embedding with amino acid property.
indicates best, and
indicates second-best
| Backbone | AUROC | AUPR |
|
Time |
|---|---|---|---|---|
| Baseline |
|
|
– |
|
| w/o Positional embedding |
|
|
|
|
| Simple |
|
|
|
|
| w/o Positional embedding |
|
|
|
|
| Cross |
|
|
|
|
| Efficient |
|
|
|
|
| w/o Positional embedding |
|
|
= |
|
| w/ Feature embedding |
|
|
|
|
Ablation study on architecture
To explore the impact of different backbones on TCRoss, we conducted experiments on reduced 30k dataset using majority setting with several transformer architectures, comparing their performance and run-time per epoch while controlling variables such as positional embedding and feature embedding inclusion.
We selected four backbones based on their adaptability to our problem. Baseline is the original transformer architecture [48]. Simple refers to SimpleViT [49], a streamlined version of ViT that features a linear classifier and 2D sinusoidal positional embeddings. Efficient uses Linformer from Wang et al. [50], which approximates self-attention with low-rank matrices to achieve linear complexity. Cross incorporates CrossViT [51], which leverages multiscale feature learning by using patch sizes of
and
to simulate various binding patterns.
As shown in Table 3, our findings indicate that the proposed feature embedding paired with the Efficient backbone yielded the highest performance. The simple backbone achieved the most balanced results, suggesting that learnable positional embeddings may not outperform fixed ones. Notably, Efficient without positional embeddings was the fastest and offered the best cost-performance ratio.
Discussion
Different modalities should be taken into consideration
Bonetta and Valentino [52] reviewed the application of machine learning techniques in protein function prediction, emphasizing the role of amino acid properties in improving prediction accuracy. ImRex [16] introduced the concept of mapping amino acid properties and leveraging CNNs [53] to achieve promising results. This approach demonstrated the effectiveness of amino acid properties in such predictions. Other methods based on Atchley factors were also proposed [46]. However, these approaches did not address the structural aspects of TCR–peptide binding.
AlphaFold [54] highlighted the importance of amino acid properties in protein structure prediction and demonstrated that machine learning could reveal information about protein binding forces and interactions [55]. This work won the 2024 Nobel Prize in Chemistry “for protein structure prediction.” Similarly, progress has been made in predicting peptide–MHC interactions based on structural characteristics, further highlighting the need to consider structural factors in TCR–peptide binding [56, 57]. However, the limited availability of high-quality structural data presents a challenge, making it difficult to apply these methods directly [39, 58].
We conjecture that existing methods are limited by focusing solely on single-modality sequence analysis, thereby overlooking crucial biological factors that influence TCR recognition. Although mechanistic modeling introduces implementation complexity, this paradigm is both physiologically justified and methodologically imperative. Conventional encoder architectures risk becoming paradoxically counterproductive through their disregard of biophysical interaction principles. By rigorously integrating amino acid physicochemical properties with conformational dynamics modeling, a more interesting result may be revealed.
Our ablation study (100 independent trials) revealed that specific physicochemical properties exhibit divergent impacts on AUROC and AUPR metrics. Notably, properties such as Turn propensity significantly enhanced AUROC but marginally reduced AUPR. AUROC evaluates the model’s ability to rank binders above nonbinders across all classification thresholds, making it sensitive to generalizable patterns. A property like Turn propensity may improve AUROC by contributing to general binding patterns but harm AUPR if it introduces false positives in nonbinding contexts. Our ablation study also confirmed that the effects of Turn propensity are statistically significant for both AUROC and AUPR (paired t-test,
), despite visual overlap in the violin plots (Fig. 6). However, there are some other properties like Hydropathy, Beta, and Polarity, these may be effected by trial size. Incorporating structural context like solvent accessibility to disambiguate property effects may enhance both AUROC and AUPR in imbalanced settings.
Redefining T-cell receptor binding prediction through multidimensional attention
Attention mechanisms [48] have proven to be valuable in enhancing both the generalizability and the interpretability of the models. Many sequence-based approaches incorporate attention mechanisms to improve model performance and interpretability [59–64]. However, these models often fail to capture the biological mechanisms underlying binding, as they primarily focus on 1D data patterns rather than the biological context. Although these approaches have achieved some success, they struggle to generalize to de novo peptides, indicating that purely data-driven methods may fail to capture the true biological dynamics of TCR–peptide interactions.
Foundation models such as Transformers [48] and Vision Transformers [65] show promising potential due to their strong generalization capabilities. The attention mechanisms in these architectures can capture complex relational patterns, enabling effective learning even from limited or zero-shot samples [66]. Transformers excel at modeling dependencies between tokens, leading to the development of variants designed for multilayer, multidimensional, and multiscale data [51, 67, 68]. The Transformer is particularly suitable for studying multifeatured and spatially dependent TCR–peptide binding processes, promising to overcome barriers to AI applications in immunology [69].
Data challenges are growing serious
Some models [34, 70, 71] incorporate additional data sources beyond the TCR
and
chains, such as MHC sequences and V/J genes. Although this added complexity can improve prediction accuracy, it often reduces practical applicability because these data are significantly less available than the
chain data alone [28]. Moreover, these additional data can introduce biases, causing models to learn superficial patterns rather than capture the underlying mechanisms of TCR–peptide interactions [34].
The two main negative data generation methods have been shown to introduce biases. RN, used in [34, 72], generates negative data by swapping TCR peptides, which introduces bias in majority settings. EN, used by Luu et al. [73], draws from background TCR chains collected in [31, 74], and has also shown bias, particularly with de novo peptides [15, 61]. Better performance has been observed with models tested on EN with zero-shot setting [75, 76]. PanPep [25] used EN data, but our repeat of baseline models [34, 35] exceeded their performance. TEINet [36] focused on negative data strategies, but sequence input still makes it easy to learn biases. Myronov et al. [61] highlighted biases and used an ML-based filter for EN, although problems remained. There are also studies that attempt to use unsupervised learning [77]. Many approaches still focus exclusively on previously observed or specific peptides [9, 62, 78, 79].
Existing explicit encoders may have learned biases
We observed systematic performance disparities across evaluation frameworks, with models demonstrating marked improvement on EN datasets (mean
AUROC=+0.133, see Table 2) compared to RN benchmarks. This discrepancy likely originates from inherent biases in EN sampling protocols, particularly the nonuniform TCR
repertoire distribution in background reference populations. Sequence-embedding architectures (e.g. ERGO-II, TEINet) exhibited maximum EN-RN performance gaps (mean
AUROC=+0.185, see Table 2), suggesting overfitting to dataset-specific covariance patterns rather than generalizable binding principles.
Strikingly, multiple models showed an inverted predictive capability in RN validation sets (AUROC=0.435
0.005 versus chance=0.5, permutation test
). This systematic prediction inversion implies learned anticorrelations between specific CDR3
physicochemical profiles and epitope binding likelihood. This is a critical warning against uncritical negative sampling practices.
Notably, the superior performance of ATM–TCR and ImRex supports this interpretation, since both models that avoid explicit sequence encoding, and their architectural designs appear less susceptible to learning such dataset-specific biases.
These findings demonstrate that even minimal-input models (TCR-
CDR3 + epitope sequences) exhibit measurable prediction bias. Incorporating additional covariates (MHC genotype, T-cell activation status) without proper confounding control may further distort performance metrics [80]. Our results question the reliability of standard evaluation measures (AUROC/accuracy) for TCR specificity prediction, instead advocating for clinical endpoint-aligned metrics like positive/negative predictive value in therapeutic contexts.
Limitations
In our interpretability experiment, despite using a leave-one-out approach, potential biases may arise due to the sparse nature of the TCR3D dataset and the selection of representative complexes. Applying our interpretable method to new, unseen TCR–peptide complexes may yield unreliable or incomplete interpretations. Furthermore, our current approach does not incorporate information about the MHC molecule or the TCR-
chain, both of which are important for TCR–peptide binding. The omission of these components may limit the biological completeness of our findings, as they are essential for capturing the full complexity of TCR–peptide interactions.
From the perspective of the dataset, some binding data may contain false positives. In addition, our peptide space only covers a part of the peptide sequence space since the theoretical peptide sequence space is astronomically large, and we use zero-shot learning to achieve generalization.
To address these limitations, we are planning to develop and integrate unbiased methods that include other context information into our framework. This extension will enable more comprehensive and biologically informed learning, ultimately improving the robustness and generalizability of our approach. Further validation on a more diverse and representative dataset will also be conducted to ensure the reliability of our interpretability framework.
Future work
Our future work will focus on three interconnected directions to address current limitations and advance clinical translation. First, we will expand the framework to incorporate paired TCR
chain sequences from public repositories through systematic wet-lab validation, including binding measurements for TCR–peptide pairs, functional T-cell activation assays across HLA alleles, and dose-response profiling to quantify half-maximal effective concentrations. This will be coupled with developing multitask prediction models jointly estimating binding probabilities, functional avidity, and HLA-specific activation thresholds. Reinforcement learning approaches will be implemented to optimize personalized vaccine design through tumor mutation-guided epitope selection, safety-constrained reward functions, and iterative candidate refinement. We will improve the robustness of the model through improved negative sampling strategies, including autoimmune TCR sequence filtering and tissue-resident TCR background integration.
As computational power and TCR repertoire data continue to grow, we hope that such DL frameworks will become tools for deciphering the fundamental biophysical principles governing TCR recognition, ultimately bridging the gap between computational prediction and mechanistic understanding of immune specificity.
Conclusion
In this study, we introduce a robust and generalizable model that leverages amino acid properties and implicitly models spatial structure during TCR–peptide interactions. TCRoss is particularly effective in classifying the binding of de novo peptides. The introduction of EES for data curation improves performance across all published models, partially mitigating the bias inherent in EN. Although EES and TCRoss address the issue of de novo peptides to some extent, data bias persists, especially in negative samples. We identified this bias as one of the primary challenges in TCR–peptide binding prediction. Looking ahead, we envision that future models will incorporate biological principles and binding mechanisms more directly, particularly multisequence information, while handling data with greater objectivity.
Key Points
T-Cell Receptor CRoss combines physicochemical properties of amino acids and implicit structural modeling to predict T-cell receptor–peptide binding through interaction pairs, improving the prediction accuracy of de novo peptides in zero-shot scenarios.
Reveals that dataset bias severely overestimate model performance, calling for the adoption of unbiased benchmarks to reflect true accuracy. Avoiding direct sequence encoding can effectively suppress bias learning.
Proposes environmental enhance strategy, which mitigates dataset bias by introducing environmental information without bias, and can be seamlessly applied to any existing model.
High attention amino acid interaction pairs highly coincide with actual binding sites in PDB structures validated through leave-one-out test across TCR3D complexes.
A novel embedding framework explicitly models physicochemical properties, outperforming fused representations, and identifying critical attributes for binding prediction.
Supplementary Material
Contributor Information
Luming Yang, Photogrammetric Computer Vision Lab., The Ohio State University, 2070 Neil Ave, Columbus, OH 43210, United States.
Haoxian Liu, Department of Computer Science and Engineering, Hong Kong University of Science and Technology, Kowloon, Hong Kong SAR, China.
Alec Calanche, The Ohio State University Wexner Medical Center, 460 W 12th Avenue, Columbus, OH 43210, United States.
Sohret M Gokcek, The Ohio State University Wexner Medical Center, 460 W 12th Avenue, Columbus, OH 43210, United States.
Vishal Singh, The Ohio State University Wexner Medical Center, 460 W 12th Avenue, Columbus, OH 43210, United States.
Nicholas Sansoterra, Photogrammetric Computer Vision Lab., The Ohio State University, 2070 Neil Ave, Columbus, OH 43210, United States.
Munir Akkaya, The Ohio State University Wexner Medical Center, 460 W 12th Avenue, Columbus, OH 43210, United States.
Billur Akkaya, The Ohio State University Wexner Medical Center, 460 W 12th Avenue, Columbus, OH 43210, United States.
Alper Yilmaz, Photogrammetric Computer Vision Lab., The Ohio State University, 2070 Neil Ave, Columbus, OH 43210, United States.
Author contributions
L.Y. designed the methodology, implemented the model, and wrote the original manuscript. H.L. performed data curation and validation experiments. A.C., S.M.G., and V.S. conducted wet-lab validation assays and data analysis. N.S. performed benchmarking studies. M.A. contributed methodology, supervision, and validation. B.A. and A.Y. jointly conceived the study, supervised the project, secured funding, and critically revised the manuscript. All authors reviewed and approved the final manuscript.
Conflict of interest: None declared.
Funding
None declared.
Data availability
The TCR–peptide interaction data analyzed in this study are publicly available in the IEDB (https://www.iedb.org/) and VDJdb (https://vdjdb.cdr3.net/). The dataset TCRC-200k and all other augmented data generated during this study and preprocessing scripts have been deposited in the Anonymous GitHub repository at TCRoss and will be permanently available through Zenodo upon publication. The source code and trained models for TCRoss are available in the GitHub repository at https://github.com/skylynf/TCRoss under an academic use license. Custom code for feature embedding implementation can be found in the “backbone/” directory of the repository.
References
- 1. Klein L, Kyewski B, Allen PM. et al. Positive and negative selection of the T cell repertoire: what thymocytes see (and don’t see). Nat Rev Immunol 2014;14:377–91. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Shah K, Al-Haidari A, Sun J. et al. T cell receptor (TCR) signaling in health and disease. Signal Transduct Target Ther 2021;6:412. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Schumacher TN, Schreiber RD. Neoantigens in cancer immunotherapy. Science 2015;348:69–74. [DOI] [PubMed] [Google Scholar]
- 4. Linette GP, Carreno BM. Neoantigen vaccines pass the immunogenicity test. Trends Mol Med 2017;23:869–71. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Ott PA, Zhuting H, Keskin DB. et al. An immunogenic personal neoantigen vaccine for patients with melanoma. Nature 2017;547:217–21. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Kiyotani K, Toyoshima Y, Nakamura Y. Personalized immunotherapy in cancer precision medicine. Cancer Biol Med 2021;18:955. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Michael T Bethune and Alok V Joglekar. Personalized T cell-mediated cancer immunotherapy: progress and challenges. Curr Opin Biotechnol 2017;48:142–52. [DOI] [PubMed] [Google Scholar]
- 8. Wucherpfennig KW, Allen PM, Celada F. et al. Polyspecificity of T cell and B cell receptor recognition. In: Seminars in Immunology, Vol. 19, Elsevier, Amsterdam, 2007, 216–24. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Tianshi L, Zhang Z, Zhu J. et al. Deep learning-based prediction of the T cell receptor–antigen binding specificity. Nat Mach Intell 2021;3:864–75. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Zhang S-Q, Ma K-Y, Schonnesen AA. et al. High-throughput determination of the antigen specificities of T cell receptors in single cells. Nat Biotechnol 2018;36:1156–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Kula T, Dezfulian MH, Wang CI. et al. T-scan: a genome-wide method for the systematic discovery of T cell epitopes. Cell 2019;178:1016–28. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Weber A, Pélissier A, Martínez MR. T-cell receptor binding prediction: a machine learning revolution. ImmunoInformatics 2024;100040. [Google Scholar]
- 13. Zhang Y, Yang X, Zhang Y. et al. Tools for fundamental analysis functions of TCR repertoires: a systematic comparison. Brief Bioinform 2020;21:1706–16. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. Feng X, Huo M, Li H. et al. A comprehensive benchmarking for evaluating TCR embeddings in modeling TCR-epitope interactions. Brief Bioinform 2025;26:bbaf030. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15. Grazioli F, Mösch A, Machart P. et al. On TCR binding predictors failing to generalize to unseen peptides. Front Immunol 2022;13:1014256. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Moris P, De Pauw J, Postovskaya A. et al. Current challenges for unseen-epitope TCR interaction prediction and a new perspective derived from image classification. Brief Bioinform 2021;22:bbaa318. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17. Reddy ST. The patterns of T-cell target recognition. Nature 2017;547:36–8. [DOI] [PubMed] [Google Scholar]
- 18. Christopher Garcia K, Degano M, Pease LR. et al. Structural basis of plasticity in T cell receptor recognition of a self peptide-MHC antigen. Science 1998;279:1166–72. [DOI] [PubMed] [Google Scholar]
- 19. Heather JM, Ismail M, Oakes T. et al. High-throughput sequencing of the T-cell receptor repertoire: pitfalls and opportunities. Brief Bioinform 2018;19:554–65. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20. Huseby ES, Crawford F, White J. et al. Interface-disrupting amino acids establish specificity between T cell receptors and complexes of major histocompatibility complex and peptide. Nat Immunol 2006;7:1191–9. [DOI] [PubMed] [Google Scholar]
- 21. Marrack P, Scott-Browne JP, Dai S. et al. Evolutionarily conserved amino acids that control TCR-MHC interaction. Annu Rev Immunol 2008;26:171–203. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22. Li R, Altan M, Reuben A. et al. A novel statistical method for decontaminating T-cell receptor sequencing data. Brief Bioinform 2023;24, 4:bbad230. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23. Chen J, Zhao B, Lin S. et al. TEPCAM: prediction of T-cell receptor–epitope binding specificity via interpretable deep learning. Protein Sci 2024;33:e4841. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24. Meysman P, Barton J, Bravi B. et al. Benchmarking solutions to the T-cell receptor epitope prediction problem: Immrep22 workshop report. ImmunoInformatics 2023;9:100024. [Google Scholar]
- 25. Gao Y, Gao Y, Fan Y. et al. Pan-peptide meta learning for T-cell receptor–antigen binding recognition. Nat Mach Intell 2023;5:236–49. [Google Scholar]
- 26. Shugay M, Bagaev DV, Zvyagin IV. et al. VDJdb: a curated database of T-cell receptor sequences with known antigen specificity. Nucleic Acids Res 2018;46:D419–27. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27. Tickotsky N, Sagiv T, Prilusky J. et al. MCPAS-TCR: a manually curated catalogue of pathology-associated T cell receptor sequences. Bioinformatics 2017;33:2924–9. [DOI] [PubMed] [Google Scholar]
- 28. Vita R, Mahajan S, Overton JA. et al. The immune epitope database (IEDB): 2018 update. Nucleic Acids Res 2019;47:D339–43. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29. Zheng GXY, Terry JM, Belgrader P. et al. Massively parallel digital transcriptional profiling of single cells. Nat Commun 2017;8:14049. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30. Mason D. A very high level of crossreactivity is an essential feature of the T-cell receptor. Immunol Today 1998;19:395–404. [DOI] [PubMed] [Google Scholar]
- 31. Dean J, Emerson RO, Vignali M. et al. Annotation of pseudogenic gene segments by massively parallel sequencing of rearranged lymphocyte receptor loci. Genome Med 2015;7:1–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32. Ostmeyer J, Christley S, Toby IT. et al. Biophysicochemical motifs in T-cell receptor sequences distinguish repertoires from tumor-infiltrating lymphocyte and adjacent healthy tissue. Cancer Res 2019;79:1671–80. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33. Jolliffe IT. Generalizations and adaptations of principal component analysis. In: Principal Component Analysis, Springer, New York, 2002;373–405. [Google Scholar]
- 34. Springer I, Tickotsky N, Louzoun Y. Contribution of T cell receptor alpha and beta CDR3, MHC typing, V and J genes to peptide binding prediction. Front Immunol 2021;12:664514. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35. Cai M, Bang S, Zhang P. et al. ATM-TCR: TCR-epitope binding affinity prediction using a multi-head self-attention model. Front Immunol 2022;13:893247. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36. Jiang Y, Huo M, Li SC. Teinet: a deep learning framework for prediction of TCR–epitope binding specificity. Brief Bioinform 2023;24:bbad086. [DOI] [PubMed] [Google Scholar]
- 37. Akkaya M, Al Souz J, Williams D. et al. Illuminating T cell-dendritic cell interactions in vivo by flashing antigens. Elife 2024;12:RP91809. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38. McDermott M, Hansen LH, Zhang H. et al. A closer look at AUROC and AUPRC under class imbalance. arXiv preprint arXiv:2401.06091. 2024. https://arxiv.org/pdf/2401.06091 (accessed November 14, 2025).
- 39. Gowthaman R, Pierce BG. TCR3d: the T cell receptor structural repertoire database. Bioinformatics 2019;35:5323–5. [DOI] [PMC free article] [PubMed] [Google Scholar]
-
40.
Sehnal D, Bittrich S, Deshpande M. et al. Mol
viewer: modern web app for 3D visualization and analysis of large biomolecular structures. Nucleic Acids Res 2021;49:W431–7.
[DOI] [PMC free article] [PubMed] [Google Scholar] - 41. Berman HM, Westbrook J, Feng Z. et al. The protein data bank. Nucleic Acids Res 2000;28:235–42. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42. Hoare HL, Sullivan LC, Pietra G. et al. Structural basis for a major histocompatibility complex class IB-restricted T cell response. Nat Immunol 2006;7:256–64. [DOI] [PubMed] [Google Scholar]
- 43. Stewart-Jones GBE, McMichael AJ, Bell JI. et al. A structural basis for immunodominant human T cell receptor recognition. Nat Immunol 2003;4:657–63. [DOI] [PubMed] [Google Scholar]
- 44. Zhang M, Wei X, Luo L. et al. Identification and affinity enhancement of T-cell receptor targeting a KRASG12V cancer neoantigen. Commun Biol 2024;7:512. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45. Jones LL, Colf LA, Stone JD. et al. Distinct CDR3 conformations in TCRs determine the level of cross-reactivity for diverse antigens, but not the docking orientation. J Immunol 2008;181:6255–64. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46. Jiang M, Zilan Y, Lan X. VitTCR: a deep learning method for peptide recognition prediction. Iscience 2024;27:109770. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47. Ji H, Wang X-X, Zhang Q. et al. Predicting TCR sequences for unseen antigen epitopes using structural and sequence features. Brief Bioinform 2024;25:bbae210. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48. Vaswani A. Attention is all you need. In: Guyon I, von Luxburg U, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R (eds.), Advances in Neural Information Processing Systems 30 (NIPS 2017), Curran Associates, Inc., Red Hook, NY, USA, 2017. [Google Scholar]
- 49. Beyer L, Zhai X, Kolesnikov A. Better plain ViT baselines for ImageNet-1k. arXiv preprint arXiv:2205.01580. 2022. https://arxiv.org/pdf/2205.01580 (accessed November 14, 2025).
- 50. Wang S, Li BZ, Khabsa M. et al. Linformer: self-attention with linear complexity. arXiv preprint arXiv:2006.04768. 2020. https://arxiv.org/pdf/2006.04768 (accessed November 14, 2025).
- 51. Chen C-FR, Fan Q, Panda R. Crossvit: cross-attention multi-scale vision transformer for image classification. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 357–66, 2021.
- 52. Bonetta R, Valentino G. Machine learning techniques for protein function prediction. Proteins: Struct Funct Bioinf 2020;88:397–413. [DOI] [PubMed] [Google Scholar]
- 53. LeCun Y, Bengio Y, Hinton G. Deep learning. Nature 2015;521:436–44. [DOI] [PubMed] [Google Scholar]
- 54. Jumper J, Evans R, Pritzel A. et al. Highly accurate protein structure prediction with alphafold. Nature 2021;596:583–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55. Vladimir Gligorijević P, Renfrew D, Kosciolek T. et al. Structure-based protein function prediction using graph convolutional networks. Nat Commun 2021;12:3168. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56. Mariuzza RA, Agnihotri P, Orban J. The structural basis of T-cell receptor (TCR) activation: an enduring enigma. J Biol Chem 2020;295:914–25. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57. Bradley P. Structure-based prediction of T cell receptor: peptide-MHC interactions. Elife 2023;12:e82813. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58. Lin V, Cheung M, Gowthaman R. et al. TCR3d 2.0: expanding the T cell receptor structure database with new structures, tools and interactions. Nucleic Acids Res 2024;gkae840. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59. Wu KE, Yost K, Daniel B. et al. TCR-BERT: Learning the grammar of T-cell receptors for flexible antigen-binding analyses. In: Machine Learning in Computational Biology, pp. 194–229. PMLR, 2024. [Google Scholar]
- 60. Korpela D, Jokinen E, Dumitrescu A. et al. Epic-trace: predicting TCR binding to unseen epitopes using attention and contextualized embeddings. Bioinformatics 2023;39:btad743. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61. Myronov A, Mazzocco G, Król P. et al. Bertrand—peptide: TCR binding prediction using bidirectional encoder representations from transformers augmented with random TCR pairing. Bioinformatics 2023;39:btad468. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62. Weber A, Born J, Martínez MR. Titan: T-cell receptor specificity prediction with bimodal attention networks. Bioinformatics 2021;37:i237–44. [DOI] [PMC free article] [PubMed] [Google Scholar]
-
63.
Montemurro A, Schuster V, Povlsen HR. et al. NetTCR-2.0 enables accurate prediction of TCR-peptide binding by using paired TCR
and
sequence data. Commun Biol 2021;4:1060.
[DOI] [PMC free article] [PubMed] [Google Scholar] - 64. Zhang J, Ma W, Yao H. Accurate TCR-PMHC interaction prediction using a BERT-based transfer learning method. Brief Bioinform 2024;25:bbad436. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65. Dosovitskiy A. An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929. 2020. https://arxiv.org/pdf/2010.11929 (accessed November 14, 2025).
- 66. Han K, Wang Y, Chen H. et al. A survey on vision transformer. IEEE Trans Pattern Anal Mach Intell 2022;45:87–110. [DOI] [PubMed] [Google Scholar]
- 67.Perera S, Navard P, Yilmaz A. Segformer3D: an efficient transformer for 3D medical image segmentation. arXiv preprint arXiv:2402.12345, 2024. https://arxiv.org/pdf/2402.12345 (accessed November 14, 2025). [Google Scholar]
- 68. Bao Y, Sivanandan S, Karaletsos T. Channel vision transformers: an image is worth c x 16 x 16 words. arXiv preprint arXiv:2309.16108. 2023. https://arxiv.org/pdf/2309.16108 (accessed November 14, 2025).
- 69. Moor M, Banerjee O, Abad ZSH. et al. Foundation models for generalist medical artificial intelligence. Nature 2023;616:259–65. [DOI] [PubMed] [Google Scholar]
- 70. John-William Sidhom H, Larman B, Pardoll DM. et al. DeepTCR is a deep learning framework for revealing sequence concepts within T-cell repertoires. Nat Commun 2021;12(1):1605. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 71. Jensen MF, Nielsen M. NetTCR 2.2-improved TCR specificity predictions by combining pan-and peptide-specific training strategies, loss-scaling and integration of sequence similarity. bioRxiv, pages 2023–10, 2023. [DOI] [PMC free article] [PubMed]
- 72. Montemurro A, Jessen LE, Nielsen M. NetTCR-2.1: lessons and guidance on how to develop models for TCR specificity predictions. Front Immunol 2022;13:1055151. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 73. Luu AM, Leistico JR, Miller T. et al. Predicting TCR-epitope binding specificity using deep metric learning and multimodal learning. Genes 2021;12(4):572. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 74. Oakes T, Heather JM, Best K. et al. Quantitative characterization of the T cell receptor repertoire of naïve and memory subsets using an integrated experimental and computational pipeline which is robust, economical, and versatile. Front Immunol 2017;8:1267. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 75. Pham M-DN, Nguyen T-N, Tran LS. et al. epiTCR: a highly sensitive predictor for TCR–peptide binding. Bioinformatics 2023;39:btad284. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 76. Pham M-DN, Su CT-T, Nguyen T-N. et al. epiTCR-KDA: knowledge distillation model on dihedral angles for TCR-peptide prediction. bioRxiv. 2024;2024–05. [DOI] [PMC free article] [PubMed]
- 77. Pertseva M, Follonier O, Scarcella D. et al. TCR clustering by contrastive learning on antigen specificity. Brief Bioinform 2024;25:bbae375. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 78. Fast E, Dhar M, Chen B. Tapir: A T-Cell Receptor Language Model for Predicting Rare and Novel Targets. bioRxiv, 2023. https://www.biorxiv.org/content/10.1101/2023.09.12.557285v1 (accessed November 14, 2025). [Google Scholar]
- 79. Gielis S, Moris P, De Neuter N. et al. TCRex: a webtool for the prediction of T-cell receptor sequence epitope specificity. BioRxiv 2018;373472. https://www.biorxiv.org/content/10.1101/373472v1 (accessed November 14, 2025). [Google Scholar]
- 80. Knapp B, Demharter S, Esmaielbeiki R. et al. Current status and future challenges in T-cell receptor/peptide/MHC molecular dynamics simulations. Brief Bioinform 2015;16:1035–44. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The TCR–peptide interaction data analyzed in this study are publicly available in the IEDB (https://www.iedb.org/) and VDJdb (https://vdjdb.cdr3.net/). The dataset TCRC-200k and all other augmented data generated during this study and preprocessing scripts have been deposited in the Anonymous GitHub repository at TCRoss and will be permanently available through Zenodo upon publication. The source code and trained models for TCRoss are available in the GitHub repository at https://github.com/skylynf/TCRoss under an academic use license. Custom code for feature embedding implementation can be found in the “backbone/” directory of the repository.



















