Abstract
Paired and single chain T cell receptor (TCR) sequencing are now commonly used techniques for interrogating adaptive immune responses. TCRs targeting the same epitope frequently share motifs consisting of critical contact residues. Here we illustrate the key features of tcrdist3, a new Python package for distance-based TCR analysis, through a series of three interactive examples. In the first example, we illustrate how tcrdist3 can integrate sequence similarity networks, gene-usage plots, and background-adjusted CDR3 logos to identify TCR receptor sequence features conferring antigen specificity among sets of peptide-MHC-multimer sorted receptors. In the second example, we show how the TCRjoin feature in tcrdist3 can be used to flexibly query receptor sequences of interest against bulk repertoires or libraries of previously annotated TCRs based on matching of similar sequences. In the third example, we show how the TCRdist metric can be leveraged to identify candidate polyclonal receptors under antigenic selection in bulk repertoires based on sequence neighbor-enrichment testing, a statistical approach similar to TCRNET and ALICE algorithms, but with added flexibility in how the neighborhood can be defined.
Keywords: T Cell Receptors, TCR, distance-based learning, epitope-specificity, bioinformatics, python
1. Introduction
The complementarity determining regions (CDRs) of a T cell receptor (TCR) determine its binding affinity with peptide-MHC (pMHC) complexes [1]. Previously, Dash et al., (2017) introduced TCRdist, a weighted multi-CDR distance metric, which enabled grouping of paired αβ TCRs by antigen specificity based on their sequence similarity [2]. Differences in the amino-acid sequences of the aligned CDRs are totaled based on the number of gaps (−4) and their BLOSUM62 substitution penalties (ranging from 0 to −4) with an additional 3-fold weighting on CDR3 substitutions. For instance – with the 3-fold weighting on the CDR3 – two amino-acid mismatches in the CDR3 between two TCRs with identical V-genes results in a maximal distance of 24 TCRdist units (tdus); however, a TCRdist can vary considerably among TCRs with identical edit distance based on the nature of the non-aligned residues. The TCRdist metric can be computed between a pair of single-chain receptors or combined into a paired chain distance.
In this chapter, we describe some of the features of tcrdist3, a new open-source Python software package [3]. This package provides an interface to compute the TCRdist metric, and it brings new flexibility to distance-based repertoire analysis, allowing customization of the distance metric and at-scale computation with sparse data representations and parallelized, byte-compiled code.
Because tcrdist3 can be used for pairwise distance computation between single-chains or paired-chain TCRs, we anticipate users may find the tcrdist3 Python package useful for examining TCR receptor sequence features conferring antigen specificity. It can also be useful for characterizing the potential specificities of an unannotated TCR repertoire with a library of previously annotated TCR sequences; with a distance-based approach, matching of TCRs can be based on sequence similarity as opposed to requiring exact sequence matching(Table 1).
Table 1:
Previously annotated TCR epitopes neighboring (TCRdist <- 24) a TCR in the bulk repertoire with statistically enriched number of network edges.
| Epitope Species | HLA | MHC | N |
|---|---|---|---|
| CMV | HLA-A*02:01 | MHCI | 6 |
| HLA-B*44:03:08 | MHCI | 6 | |
| EBV | HLA-A*02:01 | MHCI | 5 |
| HLA-B*08:01 | MHCI | 5 | |
| HCV | HLA-A*01:01 | MHCI | 2 |
| HLA-A*02 | MHCI | 2 | |
| HLA-A*02:01 | MHCI | 1 | |
| HLA-B*08:01 | MHCI | 1 | |
| HIV-1 | HLA-A*02:01 | MHCI | 1 |
| HLA-A*03 | MHCI | 2 | |
| HLA-B*07:02 | MHCI | 1 | |
| HLA-B*08 | MHCI | 1 | |
| HLA-B*27:05 | MHCI | 2 | |
| HLA-B*51:01 | MHCI | 5 | |
| HLA-B*51:193 | MHCI | 1 | |
| HomoSapiens | HLA-A*02:01 | MHCI | 4 |
| HLA-DRA*01:01 | MHCII | 3 | |
| InfluenzaA | HLA-A*02 | MHCI | 28 |
| HLA-A*02:01 | MHCI | 30 | |
| HLA-A*02:01:48 | MHCI | 3 | |
| HLA-A*68:01 | MHCI | 4 | |
| HLA-DRA*01:02:03 | MHCII | 2 | |
| M.tuberculosis | HLA-DQA1*01:02 | MHCII | 1 |
| YFV | HLA-A*02 | MHCI | 4 |
This chapter includes practical applications that illustrate the flexibility of the tcrdist3 package. Because tcrdist3 is a collection of documented Python functions, objects, and methods, it is designed to be written into Python scripts or run interactively with iPython or Jupyter notebooks. Thus, tcrdist3 is best suited for users with some prior knowledge of Python. The package makes use of common Python data formats (i.e., standard dictionaries, numpy.array, pandas.DataFrame, and scipy.sparse.csrmat) that allow the user to build workflows seamlessly with tcrdist3 and other common Python packages such as matplotlib, network, and scikit-learn. Primary results can also be output as tab-delimited files (.tsv), which can be evaluated in the user’s preferred software.
The structure of this chapter is as follows. First, we show how tcrdist3 can be used to examine shared sequence features of antigen-specific TCRs in a paired-chain SARS-CoV-2 epitope-specific dataset recently generated by Minervina and Pogorelyy et al. 2021 [4]. Second, we consider how tcrdist3 can be applied to find potentially interesting clusters of similar TCRs within an unannotated set of repertoires from COVID-19 patients. Next, we consider how tcrdist3 can be applied directly to bulk repertoires to identify the clones most likely to be under antigenic selection. Appendices provide additional technical details.
2. Materials
2.1. Installation of the tcrdist3
Tcrdist3 is a cross-platform Python package that can be used in Linux and OSX environments, with dependencies on other packages commonly used in scientific analysis; on Windows it can be used through remote Jupyter notebooks, for example Google’s Colab environment (https://colab.research.google.com/), which is a fast way to explore and use tcrdist3 interactively without a local installation. Tcrdist3 is pip installable either from the Python Package Index (PyPi) or directly from GitHub. Tcrdist3’s dependencies require users to have an appropriate C compiler. More practical guidance on the topic of installation of tcrdist3 is provided at a documentation page (https://tcrdist3.readthedocs.io/). The examples in this chapter were run using the version 0.2.2 of the tcrdist3 package. To get the most out of this chapter, we recommend running the scripts in this tutorial as you review the examples in the Methods. Code snippets may be copied and pasted into an interactive iPython session. The tcrdist3 package has been tested in a Python 3.8 environment installed on systems running OSX and Ubuntu.
The 0.2.2 version of tcrdist3 can be installed within a Python environment with a single pip install command.
pip install tcrdist3==0.2.2
Alternatively specific versions of tcrdist3 can also be installed directly from the package’s GitHub repository.
pip install git+https://github.com/kmayerb/tcrdist3.git@0.2.2
2.2. Sub-packages
Upon installation using pip, multiple sub-packages written by tcrdist3’s authors will automatically be installed. These include palmotif for CDR3 logo generation, tcrsampler for generating V-J gene matched background receptor sets, and pwseqdist for efficient and parallelizable pairwise distance computation.
2.3. Package Data
The package tcrsampler makes use of background TCR repertoires from human and mouse selected from previous studies [5–7] or synthesized using probabilistic models trained on non-productive sequences [8, 9]. After the package tcrsampler is installed, which occurs automatically when pip installing tcrdist3, pre-formatted background sequence files can be subsequently downloaded and added to the package source files with the following commands. This needs only be done once.
from tcrsampler.setup_db import install_all_next_gen install_all_next_gen(dry_run = False)
3. Methods
We illustrate approaches for using tcrdist3 through a series of three examples.
3.1. Example 1: Examining Epitope-Specific Paired αβ TCR Data
3.1.1. TCR Data
Rep-Seq data is produced using several technologies (e.g., RACE-5, Adaptive Biotechnologies, and 10X) and gene-calling tools (e.g., MIXCR, MIGEC, ImmunoSeq, Cell Ranger). As a result of the heterogeneity in RepSeq formats, the first step towards using the tcrdist3 package successfully is to correctly identify and rename the required input columns. This is illustrated below by example. (Appendix C provides more specific guidance for preprocessing files when working with Adaptive ImmunoSeq and AIRR-formatted files). The data used here is taken directly from a recent preprint manuscript of Minervina and Pogorelyy et al. 2021 SI Table 6: Unique epitope-specific CD8+ αβ TCR clonotypes [4]. We focus on TCR chains found in 18 subjects recovered from 6 of the most frequently identified DNA-labeled peptide:MHC dextramers.
3.1.2. Data Preprocessing
The data generated by Minervina and Pogorelyy et al., 2021 contains the following columns ‘cdr3b’, ‘cdr3b_nt’, ‘vb’, ‘jb’, ‘cdr3a’, ‘cdr3a_nt’, ‘va’, ‘ja’,’clonotype_id’, ‘Degree’, ‘cl_120’, ‘cl_120_members’, ‘donor’,’ epitope’, ‘specificity’, and ‘category’. After loading the data into a Pandas DataFrame, we rename the critical columns to match column names expected by tcrdist3.
Tcrdist3 only requires 3 input columns for single chain analysis (i.e., for beta chain cdr3_b_aa and v_b_gene, j_b_gene) and 6 columns for paired chain analysis (i.e., cdr3_b_aa, v_b_gene, j_b_gene, cdr3_a_aa, v_a_gene, and j_a_gene). More columns can be included depending on the application.
An optional count column will track the abundance of each clone. If no count column is provided all clones are assigned a count of 1. Additional columns can be included if the user intends to distinguish input rows with donor information or epitope specificity annotations. The names of these columns are not pre-defined (e.g., subject, cell_type, visit, epitope)
By default, tcrdist3 uses all supplied columns provided in the DataFrame passed to the cell_df argument to look for potentially duplicated rows based on the default setting (deduplicate = True). The initialization of a TCRrep instance automatically aggregates counts over duplicated rows.
This has practical consequences. For instance, if no subject column is included identical clones from two or more individuals will be combined into a single row.
If any columns have missing values, the corresponding row containing the missing value is excluded. Thus, do not include columns that have missing values. If you wish to retain every clonotype, adding an index column or the nucleotide sequence will prevent rows with identical amino acid sequences from being merged.
# Check that file for example 1 is available. If not, download it directly from GitHub import os f = ‘clonotypes_minervina.tsv’ if not os.path.isfile(f): os.system(‘wget https://raw.githubusercontent.com/kmayerb/tcrdist3_book_chapter/main/data/clonotypes_minervina.tsv’) # Make a folder ‘data’ where some outputs will be written path = ‘data’ if not os.path.isdir(path): os.mkdir(path)
import os
import pandas as pd
import numpy as np
from tcrdist.repertoire import TCRrep
# Load Pandas DataFrame from SI Table 6 (Minervina and Pogorelyy et al., 2021)
f = ‘clonotypes_minervina.tsv’
df = pd.read_csv(f,sep =“\t”)
# Subset to top 6 epitopes.
list_of_epitopes = [‘A01_TTD’, ‘A02_YLQ’, ‘A01_LTD’,
‘B15_NQK’, ‘A01_FTS’, ‘A24_NYN’]
df = df[ df[‘epitope’].isin(list_of_epitopes)].\
reset_index(drop = True)
# Rename columns.
df = df.rename(columns = {
‘cdr3b’:’cdr3_b_aa’,
‘vb’ :’v_b_gene’,
‘jb’ :’j_b_gene’,
‘cdr3a’:’cdr3_a_aa’,
‘va’ :’v_a_gene’,
‘ja’ :’j_a_gene’,
‘donor’:’subject’} )
# Preview the first 2 lines of DataFrame.
df.head(2)
Minervina and Pogorelyy et al., 2021 did not provide allele level resolution in the V and J gene calls (e.g., TRBV5–1 vs. TRBV5–1*01). Because tcrdist3 infers CDR1, CDR2, and CDR2.5 from the full gene and allele name, we must append an allele designation when it is unresolved in the input data. We will also assign a ‘count’ number of 1 to each clonotype in the table.
# Add *01 allele level designation.
df[‘v_a_gene’] = df[‘v_a_gene’].apply(lambda x: f”{x}*01”)
df[‘v_b_gene’] = df[‘v_b_gene’].apply(lambda x: f”{x}*01”)
df[‘j_a_gene’] = df[‘j_a_gene’].apply(lambda x: f”{x}*01”)
df[‘j_b_gene’] = df[‘j_b_gene’].apply(lambda x: f”{x}*01”)
df[‘count’] = 1
3.1.3. Loading Data into a TCRrep Instance
Once the data is properly formatted, the next step is to connect the data to an instance of the TCRrep class. The header of almost all scripts working with tcrdist3 includes the import statement from tcrdist.repertoire import TCRrep. When a TCRrep instance is initialized, the user must specify some key information along with the input data:
organism specifies the appropriate organism. Either the character string ‘human’ or ‘mouse’ must be specified.
chains specifies whether the TCRrep instance will evaluate a single chain or paired chain data. Provide [‘alpha’] or [‘beta’] to the chains argument for single-chain analysis. For paired chain analysis, supply [‘alpha’, ‘beta’]. Tcrist3 supports [‘gamma’],[ ‘delta’], or [‘gamma’, ‘delta’] as available options as well.
The organism and chains arguments ensure the correct lookup when appending CDR1, CDR2, and CDR2.5 sequences to the input cell_df DataFrame. To append these germline-encoded CDR sequences, tcrdist3 must recognize the user-supplied V gene names. The package uses IMGT nomenclature and a library of allele-specific reference genes.
cell_df contains the input TCR data. Only the relevant columns should be passed in the DataFrame to the cell_df argument. This is critical because a NaN (missing value) in any column will result in the corresponding row being removed from the analysis.
If the user wishes to retain clones identical at the amino acid level but with distinct CDR3 nucleotide junctions, the nucleotide sequence or another unique-valued column should be provided in the DataFrame passed to the cell_df argument.
Finally, remember that any row of cell_df with an unrecognized V gene name will be removed from the final clone_df. It is possible to see those lines of cell_df not integrated into clone_df by calling tr.show_incomplete() after initialization. (Note: Advanced users who wish to add new genes not currently in the tcrdist3 library can do so by modifying the content of the ‘alphabeta_gammadelta_db.tsv ‘ file in the package source code (python3.8/site-packages/tcrdist/db/alphabeta_gammadelta_db.tsv))
Before proceeding, it is also helpful to understand that each TCRrep instance contains two Pandas DataFrames: (i) the cell_df, which is provided by the user at initialization, and (ii) the clone_df, which is generated by the program immediately thereafter. The cell_df contains the data specified by the user, which is then augmented with columns containing IMGT aligned CDR1, CDR2, and CDR2.5 inferred from the V-gene name. The clone_df is a derivative Pandas DataFrame generated by deduplicating identical rows in the cell_df. That is, the rows of the cell_df with identical values are grouped together and the count column is updated to reflect the aggregation of multiple rows. Also, it is helpful to know that the order of the rows in the clone_df will not match the order in cell_df. (Although not recommended for new users of tcrdist3, users who pre-check their data to ensure no missing values and no unrecognized V-gene names, may use the deduplicate = False option which will allow the cell_df row order to be directly transferred to the clone_df without any row removal.)
3.1.4. Computation of Pairwise Distance Matrices
from tcrdist.repertoire import TCRrep tr = TCRrep(cell_df = df[[‘subject’,’epitope’,’cdr3_a_aa’,’v_a_gene’, ‘j_a_gene’,’cdr3_b_aa’,’v_b_gene’,’j_b_gene’, ‘category’,’count’,’cdr3a_nt’,’cdr3b_nt’]], organism = ‘human’, chains = [‘alpha’,’beta’], deduplicate = True, compute_distances = True)
We now initialize a TCRrep instance with the following arguments:
For datasets containing fewer than 10,000 unique clones, the default initialization of a TCRrep instance automatically computes pairwise distances across all CDRs. This completes in less than 60 seconds depending on the size of the input. Automatic computation of pairwise distances can be disabled by setting compute_distances = False. (More details on methods for customizing the TCRdist computation can be found on the TCR Distances subpage of https://tcrdist3.readthedocs.io/).
After initialization, with the default compute_distances argument set to True, the user can immediately access the following pairwise distances as Numpy arrays:
tr.pw_cdr1_a_aa
tr.pw_cdr2_a_aa
tr.pw_pmhc_a_aa (CDR2.5 - the pMHC-facing loop between CDR2 and CDR3, which are referred to in tcrdist3 as pmhc_a)
tr.pw_cdr3_b_aa
tr.pw_cdr1_b_aa
tr.pw_cdr2_b_aa
tr.pw_pmhc_b_aa (CDR2.5 - the pMHC-facing loop between CDR2 and CDR3, which are referred to in tcrdist3 as pmhc_b)
tr.pw_cdr3_a_aa
And the weighted multi-CDR distance matrices:
tr.pw_alpha
tr.pw_beta
The curious reader can confirm the following to see how the four individual CDR matrices are combined to arrive at a weighted multi-CDR distance.
import numpy as np np.all(tr.pw_beta == (tr.weights_b[‘cdr1_b_aa’]*tr.pw_cdr1_b_aa + tr.weights_b[‘cdr2_b_aa’]*tr.pw_cdr2_b_aa+ tr.weights_b[‘pmhc_b_aa’]*tr.pw_pmhc_b_aa + tr.weights_b[‘cdr3_b_aa’]*tr.pw_cdr3_b_aa))
Since tr.pw_beta and tr.pw_alpha have identical dimensions and a consistent row order (matching the row order of the TCRrep.clone_df), a paired tcrdist can be generated by simply adding the matrices together.
tr.pw_alpha_beta = tr.pw_beta + tr.pw_alpha
There are additional options that allow customization of the pairwise distance computation or use of alternative metrics such as Smith-Waterman alignment score, aligned Hamming distance, or edit distance. The tcrdist3 project’s documentation page (https://tcrdist3.readthedocs.io/) discusses these options in more detail and Appendix B of this chapter illustrates how to compute pairwise CDR3-only edit distance instead of multi-CDR TCRdist.
3.1.5. Distance Matrices to Networks
One can construct a weighted network with edges connecting similar TCRs from these pairwise distance matrices. The threshold at which to construct edges between clones is flexible, but for paired-chain TCRdist with the default 3-fold CDR3-weight, Minervina and Pogorelyy et al. selected a threshold of 120 TCRdist units to define meaningful edges between clones; we will use this threshold as well. Note that any row in the tr.clone_df can also be used to describe a node in a TCR network. It is easy to lookup these values since the row index of tr.clone_df matches the row and column order of the pairwise distance matrices:
from tcrdist.public import _neighbors_fixed_radius
# <edge_threshold> is used to define maximum distance to for a network edge.
edge_threshold = 120
# <tr.pw_alpha_beta> is paired chain TCRdist.
tr.pw_alpha_beta = tr.pw_beta + tr.pw_alpha
# <network> initialize a list to populate with edges between TCRs.
network = list()
for i,n in enumerate(_neighbors_fixed_radius(tr.pw_alpha_beta, edge_threshold)):
for j in n:
if i != j:
network.append((
i, # ‘node_1’ - row index
j, # ‘node_2’ - column index
(tr.pw_alpha_beta )[i,j], # ‘dist’- gets the distance between TCR(i,j)
tr.clone_df[‘v_b_gene’].iloc[i], # ‘v_b_gene_1’ - v beta gene of clone i
tr.clone_df[‘v_b_gene’].iloc[j], # ‘v_b_gene_2’ - v beta gene of clone j
tr.clone_df[‘cdr3_b_aa’].iloc[i], # ‘cdr3_b_aa_1’ - cdr3 beta of clone i
tr.clone_df[‘cdr3_b_aa’].iloc[j], # ‘cdr3_b_aa_2’ - cdr3 beta of clone j
tr.clone_df[‘subject’].iloc[i], # ‘subject_1’ - subject of clone i
tr.clone_df[‘subject’].iloc[j], # ‘subject_2’ - subject of clone j
tr.clone_df[‘epitope’].iloc[i], # ‘epitope_1’ - epitope associated with clone i
tr.clone_df[‘epitope’].iloc[j], # ‘epitope_2’ - epitope associated with clone j
len(n)-1)) # ‘K_neighbors’ - number of neighbors
cols = [‘node_1’, ‘node_2’, ‘dist’, ‘v_b_gene_1’, ‘v_b_gene_2’,
‘cdr3_b_aa_1’,’cdr3_b_aa_2’, ‘subject_1’,’subject_2’,
‘epitope_1’,’epitope_2’, ‘K_neighbors’]
# Store the <network> edge list as a DataFrame.
df_net = pd.DataFrame(network, columns = cols)
# Option to write the edge list to a file for use in Gephi, Cytoscape, R, etc.
outfile = os.path.join(path,f”{f}_paired_TCRdist_{edge_threshold}_network.csv”)
df_net.to_csv(outfile, sep = “,”, index = False)
Once all the network edges are represented as unique rows in the df_net DataFrame, we can add columns to designate edges that unite nodes that come from different donors (i.e., public). We can further identify those edges that link TCRs recognizing the same epitope (i.e., consistent edges). A large fraction of inconsistent edges may indicate an overly permissive edge threshold. We can further assign a weight proportional to similarity by subtracting the pairwise distance from the max edge threshold distance and dividing by the edge threshold. This will result in more similar TCRs having higher-edge weights, with weights bound between 1 and 0.
df_net[‘public’] = df_net[‘public’] = df_net.apply(lambda x : x[‘subject_1’] != x[‘subject_2’], axis = 1) df_net[‘consistent’] = df_net.apply(lambda x : x[‘epitope_1’] == x[‘epitope_2’], axis = 1) df_net[‘weight’] = (edge_threshold - df_net[‘dist’])/edge_threshold
3.1.6. Visualization of TCRdist Networks
Using the Python networkx package, one can visualize the resulting network within an interactive session. (Note that df_net was also written to a .csv file so it could also be loaded into a GUI network visualization software like Gephi or Cytoscape). Below we have colored the nodes by their associated epitope annotations. We can see that with a threshold of 120 TCRdist units most of the edges are consistent and join two TCRs binding the same peptide:MHC (pMHC) dextramer.
import networkx as nx
from tcrdist.html_colors import get_html_colors
# Optionally, one can limit network edges to those formed only
# between TCRs found in two distinct individuals.
df_net = df_net.query(‘public == True’)
# <G> Initialize a networkx Graph instance from the columns of df_net.
G = nx.from_pandas_edgelist(pd.DataFrame({‘source’ : df_net[‘node_1’],
‘target’ : df_net[‘node_2’],
‘weight’ : df_net[‘weight’]}))
# Assign each node a color based on its epitope annotation.
epitopes = [‘A01_TTD’, ‘A02_YLQ’, ‘A01_LTD’, ‘B15_NQK’, ‘A01_FTS’, ‘A24_NYN’]
# Get the same nunber of colors as unique epitopes
colors = get_html_colors(len(epitopes))
# Construct a dictionary to lookup color by epitope.
color_by_epitope = {epitope: color for epitope,color in zip(epitopes,colors)}
# Assign colors to each node based on its epitope annotation.
node_colors = {node: color_by_epitope.get(epitope) for node, epitope in
zip(df_net[‘node_1’],df_net[‘epitope_1’])}
# Positions for all nodes according to a spring layout.
pos = nx.spring_layout(G, seed=2, k = .15)
# Define aesthetic options
options = {“edgecolors”: “tab:gray”, “node_size”: 30, “alpha”: 0.5}
nx.draw(G,
nodelist = G.nodes,
pos = pos,
node_color= [node_colors[node] for node in G.nodes],
**options)
The network in Fig. 1 recapitulates a well-recognized feature of TCR biology: biochemically similar TCRs often recognize the same epitope. In the proceeding step, we more closely examine clusters of TCRs annotated to bind the A*02 YLQPRTFLL pMHC multimer (shown in Fig. 1 as the cluster of green nodes).
Fig. 1.

Network formed by epitope specific TCRs. The edge threshold is 120 TCR distance units. TCRs come from a set annotated with 1 of 6 immunodominant SARS-CoV-2 epitopes identified in Minervina and Pogorelyy et al., (2021). The node colors correspond to A*01 FTSDYYQLY (magenta), A*01 LTDEMIAQY (blue), A*01 TTDPSFLGRY (red), A*02 YLQPRTFLL (green), A*24 NYNYLYRLF (black), and B*15 NQKLIANQF (cyan), The network edges shown are those formed between TCRs in different participants, emphasizing the public nature of the sequence similarity.
3.1.7. Networks Among A*02 YLQ Annotated TCRs
To keep things tidy, we repeat the data import and cleanup steps that we previously introduced above. However, now we subset to focus on TCRs annotated as recognizing the HLA-A*02 YLQPRTFLL (A02_YLQ) epitope in the SARS-CoV-2 spike protein. Note the line of code below containing df = df[ df[‘epitope’].isin(epitopes)].reset_index(drop = True) will select only those TCRs specified in the epitopes list, which we have populated with ‘A02_YLQ’. We leave it as an exercise for the reader to investigate other clusters of TCRs by changing this line in the code.
import os
import pandas as pd
import networkx as nx
import community.community_louvain as community_louvain
from tcrdist.repertoire import TCRrep
path = ‘data’
f = ‘clonotypes_minervina.tsv’
# Only the A*02 YLQ epitope will be considered
epitopes = [“A02_YLQ”]
edge_threshold = 120
df = pd.read_csv(f, sep = “\t”)
df = df[ df[‘epitope’].\
isin(epitopes)].\
reset_index(drop = True)
# Rename columns.
df = df.rename(columns = {
‘cdr3b’:’cdr3_b_aa’,
‘vb’:’v_b_gene’,
‘jb’: ‘j_b_gene’,
‘cdr3a’:’cdr3_a_aa’,
‘va’: ‘v_a_gene’,
‘ja’ :’j_a_gene’,
‘donor’:’subject’} )
# Add *01 allele level designation.
df[‘v_a_gene’] = df[‘v_a_gene’].apply(lambda x: f”{x}*01”)
df[‘v_b_gene’] = df[‘v_b_gene’].apply(lambda x: f”{x}*01”)
df[‘j_a_gene’] = df[‘j_a_gene’].apply(lambda x: f”{x}*01”)
df[‘j_b_gene’] = df[‘j_b_gene’].apply(lambda x: f”{x}*01”)
df[‘count’] = 1
# <tr> Initialize TCRrep instance.
tr = TCRrep(cell_df = df[[‘subject’,’epitope’,’cdr3_a_aa’,
‘v_a_gene’,’j_a_gene’,’cdr3_b_aa’,
‘v_b_gene’,’j_b_gene’,’category’,
‘count’,’cdr3a_nt’,’cdr3b_nt’]],
organism = ‘human’,
chains = [‘alpha’,’beta’],
deduplicate = True,
compute_distances = True)
# Identify edges among A*02 YLQ annotated TCRs.
network = list()
for i,n in enumerate(_neighbors_fixed_radius(tr.pw_beta+tr.pw_alpha, edge_threshold)):
for j in n:
if i != j:
network.append((
i, # ‘node_1’ - row index
j, # ‘node_2’ - column index
(tr.pw_beta + tr.pw_alpha)[i,j] # ‘dist’- gets the distance between TCR(i,j)
))
cols = [‘node_1’, ‘node_2’, ‘dist’]
df_net = pd.DataFrame(network, columns = cols)
df_net[‘weight’] = edge_threshold - df_net[‘dist’]
By representing the network as a networkx graph object, we can then apply the Louvain community identification algorithm [10] from the Python community package to identify grouping of highly networked TCRs within the set of A*02 YLQ - specific TCRs.
import networkx as nx
G = nx.from_pandas_edgelist(
pd.DataFrame({‘source’ : df_net[‘node_1’],
‘target’ : df_net[‘node_2’],
‘weight’ :df_net[‘weight’]}))
partition= community_louvain.best_partition(G)
A shortcoming of the output returned from the community_louvain function is that the cluster identification numbers don’t correspond to the size of the clusters. Before proceeding, we re-assign cluster numbers such that 0 corresponds with the largest cluster, 1 with the second largest cluster, and so on.
# Change partition such that cluster Id is in descending order based on community size
partitions_by_cluster_size = list(pd.Series(partition.values()).value_counts().index)
partition_reorder = {id:rank for id,rank in zip(partitions_by_cluster_size,
range(len(partitions_by_cluster_size)))}
partition = {k:partition_reorder.get(v) for k,v in partition.items()}
We visualize the TCR network. We specify the node colors according to community membership identified by the Louvain algorithm.
from tcrdist.html_colors import get_html_colors
clusters = [i for i in pd.Series(partition.values()).value_counts().index]
colors = get_html_colors(len(clusters))
cluster_to_color = {cluster:color for cluster,color, in zip(clusters,colors)}
options = {“edgecolors”: “tab:gray”,”node_size”: 50}
pos = nx.spring_layout(G, seed=2, k = .3)
nx.draw(G,
nodelist = G.nodes,
pos = pos,
node_color=[cluster_to_color.get(partition.get(i)) for i in G.nodes],
**options)
3.1.8. A*02 YLQ Clusters Gene Usage
Tcrdist3 contain a visualization tool to generate custom Sankey SVG diagrams showing frequency of gene-usage and gene-pairings. This enables the user to identify different Va, Ja, Vb, and Jb gene pairings within any user-specified TCR set.
import IPython from tcrdist import plotting clone_df_ylq = tr.clone_df.copy() clone_df_ylq[‘cluster_alpha_beta’] = [str(partition.get(i)) if partition.get(i) is not None else None for i in clone_df_ylq.index] import IPython from tcrdist import plotting # Note that not all clones were part of the network so subset to those # that are .notna() clone_df_ylq_clustered = clone_df_ylq[clone_df_ylq.cluster_alpha_beta.notna()] svg = plotting.plot_pairings(cell_df = clone_df_ylq_clustered, cols = [‘j_a_gene’,’v_a_gene’, ‘cluster_alpha_beta’, ‘v_b_gene’,’j_b_gene’], count_col=‘count’) IPython.display.SVG(data=svg)
Visualizing the gene-usage plot (Fig. 3A) alongside the network representation (Fig. 2, Fig. 3B) can be helpful. Colors (red, green, and blue) represent the first, second, and third largest clusters respectively. The alpha-beta TCR cluster (i.e., ‘cluster_alpha_beta’ in Fig. 3A) column is color-matched to the network diagram above (Fig. 3). By inspection, we notice that the largest cluster (i.e., cluster 0 shown in red) is defined by the exclusive use of TRAV12–1 with TRAJ43 or TRAJ30 genes. We can also see from the Sankey diagram that TRAV12–1 alpha-chains within this cluster pair with beta-chains formed from VDJ junctions involving many possible TRBV genes (including TRBV2, TRBV19, TRBV30, TRBV5–1, TRBV12–3, and others). By contrast, the second most abundant cluster (cluster 1 shown in green) is comprised of TCRs nearly exclusively formed from TRBV7–9/TRBJ1–1 beta-recombination, with an alpha chain using either TRAV12–2 or TRAV12–1.
Figure 3.

(A) Sankey plot showing Ja-Va-Vb-Jb gene pairings in all TCRs comprising the multiple identified clusters among the A02_YLQ annotated TCRs (B) Network diagram of A02_YLQ annotated TCRs joined by edges less than 120 TCRdist units. The color of the clusters in the network diagram match the ‘cluster_alpha_beta’ column in the Sankey plot
Fig. 2.

Network formed by TCRs annotated as recognizing the SARS-CoV2 epitope A*02 YLQ by Minervina and Pogorelyy et al., 2021. The edge threshold is 120 TCR distance units. The colors red, green, blue, lime, black, magenta, and cyan map in descending order by cluster size.
The observable gene usage pattern is potential evidence that the modes of binding achieved by TCRs in these two common clusters likely differ. In cluster 0 (red), the high diversity of TRBV genes implies that some degree of specificity is likely conferred by contacts in the alpha chain and portions of the CDR3 contributed by the conserved TRBJ-gene. Thus, in cluster 0 (red) interactions between the TRBV-encoded CDR1, CDR2, CDR2.5 and the A*02 MHC presenting molecule are likely less important compared to TCRs found in cluster 1 (green), where the TRBV-encoded CDRs might have a larger role in stabilizing the molecular interaction. A third cluster (cluster 2 shown in blue), although having less TRBV gene diversity than cluster 0 (red), may share a similar binding conformation to cluster 0, given that all members of this cluster also use TRAV12–1 to form a functional receptor.
3.1.9. Gene Usage with Sankey Diagrams
With tcrdist3, it is immediately feasible to see each cluster’s gene usage in isolation, as shown below by querying only the portion of the clone_df corresponding with a particular clusters of interest (as shown in Fig. 4 for cluster 0 and Fig. 5 for cluster 1).
with open(“YLQ_clusters.svg”, ‘w’) as oh: oh.write(svg) import IPython from tcrdist import plotting cluster_df = clone_df_ylq.query(“cluster_alpha_beta == ‘0’“) svg = plotting.plot_pairings(cell_df = cluster_df, cols = [‘j_a_gene’,’v_a_gene’,’cluster_alpha_beta’, ‘v_b_gene’,’j_b_gene’], count_col=‘count’) IPython.display.SVG(data=svg) import IPython from tcrdist import plotting cluster_df = clone_df_ylq.query(“cluster_alpha_beta == ‘1’“) svg = plotting.plot_pairings(cell_df = cluster_df, cols = [‘j_a_gene’,’v_a_gene’,’cluster_alpha_beta’, ‘v_b_gene’,’j_b_gene’], count_col=‘count’) IPython.display.SVG(data=svg)
Fig. 4.

Sankey plot showing Ja-Va-Vb-Jb gene pairings in all TCRs comprising the largest cluster (cluster 0) of A*02_YLQ annotated TCRs
Fig. 5.

Sankey plot showing Ja-Va-Vb-Jb gene pairings in all TCRs comprising the second largest cluster (cluster 1) of A*02_YLQ annotated TCRs
3.1.10. Background Subtracted CDR3 Logos
Through the process of VDJ recombination, CDR3s contain germline encoded as well as randomly generated amino acids. Since germline-encoded residues are common, non-distinct features of TCRs, it is the non-germline residues, which often define a receptor’s antigen specificity. To visually identify those portions of the motif that are unexpected beyond residues contributed by germline V and J gene segments, tcrdist3 allows users to generate “background-subtracted” logos. The Python package TCRsampler, by the same authors of tcrdist3, facilitates rapid generation of a background set of TCRs that share the same V and J gene usage as a foreground set of TCRs to facilitate their comparison.
3.1.11. Downloading TCRsampler backgrounds
If you’ve not run TCRsampler previously, you will need to download the default background data sets (as described in the materials section 2.1). Downloads are added to packages source code automatically, by running the following command in your terminal. You only need to do this once.
Alternatively, a background can be downloaded in an interactive session via the following command:
!python -c “from tcrsampler.setup_db import install_all_next_gen; install_all_next_gen(dry_run = False)” from tcrsampler.sampler import TCRsampler # Downloads human TRBV cord blood background from 8 individuals TCRsampler.download_background_file(‘britanova_human_beta_t_cb.tsv.sampler.tsv’) # Downloads human alpha and beta synthetic background using OLGA (Sethna et al 2019) TCRsampler.download_background_file(‘olga_sampler.zip’)
3.1.12. Initializing TCRsamplers
Before proceeding, we initialize a beta- and alpha-chain sampler. With default backgrounds downloaded (see section 3.1.11), we proceed to initialize an instance of the TCRsampler object. One sampler is for the beta-chain and will use cord blood from 8 donors[6]. For the alpha-chain, we will use pre-computed synthetic TCRs generated with the software OLGA [8]from a generation probability model trained on non-productive TCRs [9]. We combine the use of an OLGA synthesized (alpha-chain) and natural umbilical cord blood (beta-chain) background for demonstration purposes. An ‘olga_human_beta_t.sampler.tsv’ is also available.
from palmotif import compute_pal_motif, svg_logo from tcrsampler.sampler import TCRsampler from tcrdist.regex import _matrix_to_regex # This step can take up to 1 minute, so be patient ts_beta = TCRsampler(default_background = ‘britanova_human_beta_t_cb.tsv.sampler.tsv’) ts_beta.build_background(max_rows = 50, stratify_by_subject = True) ts_alpha = TCRsampler(default_background = ‘olga_human_alpha_t.sampler.tsv’) ts_alpha.build_background(max_rows = 50, stratify_by_subject = True)
How does a TCRsampler work? With a large set of TCRs loaded into memory for most V-J pairings, the method TCRsampler.sample quickly returns randomly sampled CDR3s based on a set of user specified V-J pairings. The method returns a list of CDR3s with the same ratio of gene usages as the input. For instance, suppose you have a set of 3 TCRs (1 with ‘TRBV10–2*01’+’TRBJ2–1*01’ and 2 with ‘TRBV15*01’+’TRBJ2–1*01’). TCRsampler.sample returns a set of 3 random CDRs with the same gene usage as follows:
ts_beta.sample([[‘TRBV10–2*01’,’TRBJ2–1*01’,1], [‘TRBV15*01’,’TRBJ2–1*01’,2]]) [[‘CASSSGLAIGNEQFF’], [‘CATSRDSYEQFF’, ‘CATSRDLDYAFNEQFF’]]
In practice, a larger background with many V- and J-matched random CDR3 sequences is required. By setting the depth = x argument, more CDR3s are returned, where x is the multiple by which the background will be deeper than the input list. For example:
ts_beta.sample([[‘TRBV10–2*01’,’TRBJ2–1*01’,1], [‘TRBV15*01’,’TRBJ2–1*01’,2]], depth = 2) OUTPUTS: [[‘CASSSGLAIGNEQFF’, ‘CARDRGQFF’], [‘CATSRDSYEQFF’, ‘CATSRDLDYAFNEQFF’, ‘CATSRATSGRSSEQFF’, ‘CATSRDRSYEQFF’]]
Notice that the ratio of gene usage stays the same but the size of the background increases to double that of the input. To get all the CDR3s as a single list, choose the flatten = True argument. When TCRsampler cannot find any TCRs matching a particular V- and J-gene pairing, it will return the Python expression None.
ts_beta.sample([[‘TRBV10–2*01’,’TRBJ2–1*01’,1], [‘TRBV15*01’,’TRBJ2–1*01’,2]], depth = 3, flatten = True)
OUTPUTS:
[‘CASSSGLAIGNEQFF’, ‘CARDRGQFF’, ‘CASSLLYNEQFF’, ‘CATSRDSYEQFF’, ‘CATSRDLDYAFNEQFF’, ‘CATSRATSGRSSEQFF’, ‘CATSRDRSYEQFF’, ‘CATSRGTGGNEQFF’, ‘CATSRDVPSLRFF’]
Suppose we want to generate a deeper background for a specific cluster. We first create a DataFrame that contains only clones for that specific cluster (i.e., where the alpha_beta_cluster cluster variable is ‘0’).
cluster_id = ‘0’ cluster_df = clone_df_ylq[clone_df_ylq[‘cluster_alpha_beta’] == cluster_id]
Next, format the gene usage information in the form that TCRsampler recognizes (i.e., a list of lists each with three parts specifying the (i) V-gene, (ii) J-gene, and (iii) count. We do this once for the alpha-receptors and once for the beta-receptors.
gene_usage_alpha = cluster_df.groupby([‘v_a_gene’,’j_a_gene’]).size().reset_index().to_dict(‘split’)[‘data’] gene_usage_alpha
OUTPUTS:
[[‘TRAV12–1*01’, ‘TRAJ10*01’, 1], [‘TRAV12–1*01’, ‘TRAJ17*01’, 1], [‘TRAV12–1*01’, ‘TRAJ30*01’, 3], [‘TRAV12–1*01’, ‘TRAJ31*01’, 3], [‘TRAV12–1*01’, ‘TRAJ34*01’, 4], [‘TRAV12–1*01’, ‘TRAJ36*01’, 1], [‘TRAV12–1*01’, ‘TRAJ39*01’, 1], [‘TRAV12–1*01’, ‘TRAJ43*01’, 17], [‘TRAV12–1*01’, ‘TRAJ47*01’, 4]] gene_usage_beta = cluster_df.groupby([‘v_b_gene’,’j_b_gene’]).size().reset_index().to_dict(‘split’)[‘data’] gene_usage_beta
OUTPUTS:
[[‘TRBV10–2*01’, ‘TRBJ2–1*01’, 1], [‘TRBV12–3*01’, ‘TRBJ2–2*01’, 3], [‘TRBV15*01’, ‘TRBJ2–2*01’, 2], [‘TRBV19*01’, ‘TRBJ2–2*01’, 4], [‘TRBV2*01’, ‘TRBJ2–2*01’, 8], [‘TRBV25–1*01’, ‘TRBJ1–1*01’, 1], [‘TRBV25–1*01’, ‘TRBJ2–2*01’, 1], [‘TRBV3–1*01’, ‘TRBJ2–2*01’, 1], [‘TRBV30*01’, ‘TRBJ2–2*01’, 3], [‘TRBV5–1*01’, ‘TRBJ2–2*01’, 5], [‘TRBV5–6*01’, ‘TRBJ1–1*01’, 1], [‘TRBV6–1*01’, ‘TRBJ2–2*01’, 1], [‘TRBV6–5*01’, ‘TRBJ2–2*01’, 1], [‘TRBV7–8*01’, ‘TRBJ2–2*01’, 1], [‘TRBV7–9*01’, ‘TRBJ2–2*01’, 1], [‘TRBV9*01’, ‘TRBJ2–2*01’, 1]]
Use the gene usage objects to generate backgrounds of CDR3s for both the alpha and beta chain.
sampled_rep_alpha = ts_alpha.sample(gene_usage_alpha, flatten = True, depth = 100) # remove any None values that could have been generated by unknown pairs sampled_rep_alpha = [x for x in sampled_rep_alpha if x is not None] sampled_rep_beta = ts_beta.sample(gene_usage_beta, flatten = True, depth = 100) # remove any None values that could have been generated by unknown pairs sampled_rep_beta = [x for x in sampled_rep_beta if x is not None]
3.1.13. SVG CDR3 αβ Logos
With a relevant V-J gene matched background, we can generate aligned background-subtracted logo plots with the functions compute_pal_motif and svg_logo functions in the palmotif package. The palmotif package is a dependency of tcrdist3 and will already be installed. By aligning both foreground and background CDR3s to a centroid sequence we have a simple global alignment that can be used to identify the distinctive residues of an antigen specific TCR set. To identify critical residues, we constructed a logo plot of all TCR CDR3 amino acid sequences from cluster 0, together with a “background-adjusted” logo plot. The background-adjusted plot shows the position-specific Kullback-Leibler divergence from the aligned background CDR3s that were sampled from cord blood (in the case of TRBV and TRBJ) and synthesized from a probabilistic model (in the case of TRAV and TRAJ) constrained to use the same V and J genes. The background-adjusted plot emphasizes the uncommon amino acid residues in the cluster of antigen-specific TCRs, reducing the size in particular of residues encoded by the germline V and J genes. By specifying refs = None a raw logo is generated instead. Comparison of the background-adjusted logo to the raw logo beneath it allows for simultaneous appreciation of the most common residues with those that are under the strongest selection. For example, in the alpha logos below (Fig. 6), the first ‘N’, in the fourth position from the left, is both highly conserved across the receptor pool, but also unlikely given the background receptor distribution.
# ALPHA CHAIN motif, stat = compute_pal_motif( seqs = cluster_df[‘cdr3_a_aa’].to_list(), refs = sampled_rep_alpha, centroid = cluster_df[‘cdr3_a_aa’].value_counts().index[0]) background_subtracted_svg_alpha = svg_logo(motif, return_str= True).\ replace(‘height=“100%”‘, ‘height=“20%”‘).\ replace(‘width=“100%”‘, ‘width=“20%”‘) motif_raw, stat_raw = compute_pal_motif( seqs = cluster_df[‘cdr3_a_aa’].to_list(), refs = None, centroid = cluster_df[‘cdr3_a_aa’].value_counts().index[0]) raw_svg_alpha = svg_logo(motif_raw, return_str= True).\ replace(‘height=“100%”‘, ‘height=“20%”‘).\ replace(‘width=“100%”‘, ‘width=“20%”‘) regex_alpha = _matrix_to_regex(motif_raw,max_ambiguity=5, ntrim=0, ctrim=0) # BETA CHAIN motif, stat = compute_pal_motif( seqs = cluster_df[‘cdr3_b_aa’].to_list(), refs = sampled_rep_beta, centroid = cluster_df[‘cdr3_b_aa’].value_counts().index[0]) background_subtracted_svg_beta = svg_logo(motif, return_str= True).\ replace(‘height=“100%”‘, ‘height=“20%”‘).replace(‘width=“100%”‘, ‘width=“20%”‘) motif_raw, stat_raw = compute_pal_motif( seqs = cluster_df[‘cdr3_b_aa’].to_list(), refs = None, centroid = cluster_df[‘cdr3_b_aa’].value_counts().index[0]) raw_svg_beta = svg_logo(motif_raw, return_str= True).\ replace(‘height=“100%”‘, ‘height=“20%”‘).replace(‘width=“100%”‘, ‘width=“20%”‘) regex_beta = _matrix_to_regex(motif_raw, max_ambiguity=5, ntrim=0, ctrim=0)
Fig. 6.

Background-subtracted CDR3 αβ logo plots of A*02_YLQ cluster. The “TGEL” portion of the beta-motif is contributed primarily from the germline encoded TRBJ2–2 gene.
Here we’ve displayed the four generated SVG graphics by wrapping them in a small amount of html code. If the svg_logo function is run with return_str = False and filename=“output_name.svg”, it can be used to write SVG vector graphics directly to a file for use in publication graphics.
from IPython.core.display import display, HTML
no_wrap_div = ‘<div><h3>Cluster {}</h3><h5>_________alpha CDR3 Logo {}_________’\
‘beta CDR3 Logo {}________________________</h5></div>‘\
‘<div style=“white-space: nowrap”>{}{}<br></br>{}{}</div>‘
display(HTML(no_wrap_div.format(cluster_id, regex_alpha,regex_beta, background_subtracted_svg_alpha, background_subtracted_svg_beta, raw_svg_alpha,raw_svg_beta)))
Similarly, we can follow the same procedure to visualize any cluster in the data, simply by changing the cluster_id variable in the code. (See cluster 1 logos dominated by a ‘PDI’ motif in the beta-chain in Fig. 7). Users may wrap their custom routine in a function to generate many logos as we have done with the function custom_logo_routine function in the code block below.
from IPython.core.display import display, HTML
from palmotif import compute_pal_motif, svg_logo
from tcrsampler.sampler import TCRsampler
from tcrdist.regex import _matrix_to_regex
# Select the cluster to subset, selecting `1` instead of `0` in the
# previous example.
cluster_id = ‘1’
cluster_df = clone_df_ylq[clone_df_ylq.cluster_alpha_beta == cluster_id]
def custom_logo_routine(cluster_df, label):
# Identify the gene usage pattern for the alpha and beta chains
gene_usage_alpha = cluster_df.groupby([‘v_a_gene’,’j_a_gene’]).\
size().\
reset_index().\
to_dict(‘split’)[‘data’]
gene_usage_beta = cluster_df.groupby([‘v_b_gene’,’j_b_gene’]).\
size().\
reset_index().\
to_dict(‘split’)[‘data’]
# ALPHA
# Sample a background for the alpha chain
sampled_rep_alpha = ts_alpha.sample(gene_usage_alpha, flatten = True, depth = 100)
# Remove any None values that could have been generated by unknown pairs
sampled_rep_alpha = [x for x in sampled_rep_alpha if x is not None]
# Compute background subtracted positionally aligned motif logo
motif, stat = compute_pal_motif(
seqs = cluster_df[‘cdr3_a_aa’].to_list(),
refs = sampled_rep_alpha,
centroid = cluster_df[‘cdr3_a_aa’].value_counts().index[0])
# Compute raw positionally aligned motif logo
background_subtracted_svg_alpha = svg_logo(motif, return_str= True).\
replace(‘height=“100%”‘, ‘height=“20%”‘).\
replace(‘width=“100%”‘, ‘width=“20%”‘)
# Compute raw positionally aligned motif logo
motif_raw, stat_raw = compute_pal_motif(
seqs = cluster_df[‘cdr3_a_aa’].to_list(),
refs = None,
centroid = cluster_df[‘cdr3_a_aa’].value_counts().index[0])
raw_svg_alpha = svg_logo(motif_raw, return_str= True).\
replace(‘height=“100%”‘, ‘height=“20%”‘).\
replace(‘width=“100%”‘, ‘width=“20%”‘)
# Generate a regeular expression from the pal motif matrix
regex_alpha = _matrix_to_regex(motif_raw,max_ambiguity=5, ntrim=0, ctrim=0)
# Reapeat for the the BETA chain
# Sample a background for the beta chain
sampled_rep_beta = ts_beta.sample(gene_usage_beta, flatten = True, depth = 100)
# remove any None values that could have been generated by unknown pairs
sampled_rep_beta = [x for x in sampled_rep_beta if x is not None]
# Compute background subtracted positionally aligned motif logo
motif, stat = compute_pal_motif(
seqs = cluster_df[‘cdr3_b_aa’].to_list(),
refs = sampled_rep_beta,
centroid = cluster_df[‘cdr3_b_aa’].value_counts().index[0])
background_subtracted_svg_beta = svg_logo(motif, return_str= True).\
replace(‘height=“100%”‘, ‘height=“20%”‘).\
replace(‘width=“100%”‘, ‘width=“20%”‘)
# Compute raw positionally aligned motif logo
motif_raw, stat_raw = compute_pal_motif(
seqs = cluster_df[‘cdr3_b_aa’].to_list(),
refs = None,
centroid = cluster_df[‘cdr3_b_aa’].value_counts().index[0])
raw_svg_beta = svg_logo(motif_raw, return_str= True).\
replace(‘height=“100%”‘, ‘height=“20%”‘).\
replace(‘width=“100%”‘, ‘width=“20%”‘)
regex_beta = _matrix_to_regex(motif_raw, max_ambiguity=5, ntrim=0, ctrim=0)
# Output the results as tidy html
no_wrap_div = ‘<div><h3>{}</h3><h5>_________alpha CDR3 Logo {}________’\
‘beta CDR3 Logo {}________________________</h5></div>‘\
‘<div style=“white-space: nowrap”>{}{}<br></br>{}{}</div>‘
display(HTML(no_wrap_div.format(label,
regex_alpha,regex_beta,
background_subtracted_svg_alpha,
background_subtracted_svg_beta,
raw_svg_alpha,raw_svg_beta)))
custom_logo_routine(cluster_df = cluster_df, label = “Cluster 1”)
Fig. 7.

Background-subtracted CDR3 αβ logo plots of A*02_YLQ cluster 1. The ‘cluster 1’ beta chain logos identify the “PDI” motif as highly conserved across the cluster and non-germline encoded.
3.1.14. Compute Positionally Aligned Motif Logos
Tcrdist3 permits the generation of logos from any set of TCRs or clustering methods and the modular nature of tcrdist3 permits the user flexibility to provide any set of background CDRs to the refs argument that might be most appropriate. For instance, a tissue-specific, HLA-matched, or strain-specific background could be used. Note that TCRsampler will issue a warning when a particular V-J gene pair is not available in the reference set. In most cases, logos can be generated using the sample of CDR3s from the remaining pairings, but users should consider the potential for bias if a particularly common VJ pair in the foreground receptor pool is missing from the background. One might also wish to construct a background by selecting only those CDR3s of the same length ranges as the foreground receptor pool being evaluated, which can be easily accomplished as follows:
max_cdr3_len = cluster_df[‘cdr3_b_aa’].str.len().max() min_cdr3_len = cluster_df[‘cdr3_b_aa’].str.len().min() sampled_rep_beta = [x for x in sampled_rep_beta if len(x) >= min_cdr3_len and len(x) <= max_cdr3_len]
3.2. Example 2: Finding Biochemically Similar TCRs in a Larger Set of Unannotated TCRs
Thus far, we’ve shown how tcrdist3 can be used to interrogate a relatively small number of antigen-specific TCRs that were already annotated using DNA-barcoded pMHC dextramers. Tcrdist3 is also useful for finding similar TCRs in larger unannotated sets of TCRs. For instance, we might be curious whether the TCRs found in cluster 0 (A*02:01:YLQ, Example 3.1) are commonly detectable in unenriched repertoires from COVID-19 patients in the context of natural infection. Here we analyze published single-cell TCR data generated from COVID-19 patient samples by Ren and colleagues[11] and attempt to identify TCRs with sequences similar to those that recognized A*02_01:YLQ in the previous example.
3.2.1. Data Preprocessing
We first preprocess the data to ensure proper column names and correctly formatted gene names.
import os import pandas as pd import numpy as np from tcrdist.repertoire import TCRrep from tcrdist.breadth import get_safe_chunk from tcrdist.join import join_by_dist # Check that file for example 2 available. If not, download it. import os f = ‘GSE158055_covid19_tcr_vdjnt_pclone.tsv.gz’ if not os.path.isfile(f): os.system(‘wget https://github.com/kmayerb/tcrdist3_book_chapter/raw/main/data/GSE158055_covid19_tcr_vdjnt_pclone.tsv.gz’) # Make a folder ‘data’ where some outputs will be written path = ‘data’ if not os.path.isdir(path): os.mkdir(path) # Preprocess the data from Ren et al.,(2021) COVID-19 immune features revealed by a large-scale single-cell transcriptome atlas. Cell 184:1895–1913.e19 columns = {‘PatientID’ :’subject’, ‘TCRA_vgene’ :’v_a_gene’, ‘TCRA_jgene’ : ‘j_a_gene’, ‘TCRA_cdr3aa’:’cdr3_a_aa’, ‘TCRB_vgene’ :’v_b_gene’, ‘TCRB_jgene’ :’j_b_gene’, ‘TCRB_cdr3aa’:’cdr3_b_aa’, ‘TCR_pclone.freq’:’count’} f =‘/content/GSE158055_covid19_tcr_vdjnt_pclone.tsv.gz’ ren = pd.read_csv(f, sep = ‘\t’) ren = ren[columns.keys()].\ rename(columns = columns) ren[‘v_b_gene’] = ren[‘v_b_gene’].apply(lambda x: f”{x}*01”) ren[‘v_a_gene’] = ren[‘v_a_gene’].apply(lambda x: f”{x}*01”) ren[‘j_b_gene’] = ren[‘j_b_gene’].apply(lambda x: f”{x}*01”) ren[‘j_a_gene’] = ren[‘j_a_gene’].apply(lambda x: f”{x}*01”) ren.head(5)
The resulting DataFrame contains the column names recognized by tcrdist3.
subject v_a_gene j_a_gene ... j_b_gene cdr3_b_aa count 0 P-M044 TRAV4*01 TRAJ34*01 ... TRBJ2–7*01 CSATWRRRESPYEQYF 1 1 P-M044 TRAV13–2*01 TRAJ8*01 ... TRBJ2–2*01 CASSQLSPGPRNTGELFF 145 2 P-M044 TRAV13–2*01 TRAJ8*01 ... TRBJ2–2*01 CASSQLSPGPRNTGELFF 145 3 P-M044 TRAV1–2*01 TRAJ20*01 ... TRBJ2–1*01 CSARPGLASYNEQFF 1 4 P-M044 TRAV13–2*01 TRAJ8*01 ... TRBJ2–2*01 CASSQLSPGPRNTGELFF 145
3.2.2. Search for Exact CDR3αβ Matches
We might initially consider whether there are CDR3 alpha and CDR3 beta-amino acids sequences found in the Ren et al. dataset that match any of the ‘cluster 0’ TCRs annotated as A*02 YLQ-specific in the Minervina et al. dataset from Example 3.1. We find that there are none, as a merge on identical CDR3 amino acid sequences results in an empty DataFrame.
cluster_id = ‘0’ cluster_df = clone_df_ylq[clone_df_ylq.cluster_alpha_beta == cluster_id].reset_index(drop = True) cluster_df[[‘v_a_gene’,’cdr3_a_aa’, ‘v_b_gene’,’cdr3_b_aa’]].\ merge(ren, how = “inner”, on = [‘cdr3_b_aa’,’cdr3_b_aa’])
Empty DataFrame Columns: [v_a_gene_x, cdr3_a_aa_x, v_b_gene_x, cdr3_b_aa, subject, v_a_gene_y, j_a_gene, cdr3_a_aa_y, v_b_gene_y, j_b_gene, count] Index: []
3.2.3. Finding Similar, Non-Identical TCRs in Bulk Repertoires
The lack of identical sequence matches at the amino acid level is not unexpected. Our CDR3 logo motifs indicate clearly that there is considerable sequence diversity at several positions in both the CDR3 alpha and CDR3 beta of ‘cluster 1’ A*02 YLQ TCRs. Thus, we might want to search for close, but non-identical matches. Tcrdist3 is particularly well suited for this task. Before proceeding, we need to introduce an important feature of the software suite. Up to this point, we’ve made use of “square” pairwise distance computation to automatically compute all-vs-all “square” distance matrices. These contain pairwise distances for all the clones in a TCRrep.clone_df DataFrame. It is often the case that an all-vs-all comparison is unnecessary. Instead, a common objective is to compute distances between TCRs in one, often small, ‘search’ set against a second, often much larger, ‘bulk’ set of TCRs. In these cases, a large “rectangular” distance matrix is desired. Tcrdist3 makes this easy, with the compute_rect_distances() method.
The first argument df specifies the primary DataFrame of clones to correspond to rows.
The second df2 argument specifies the secondary DataFrame that will correspond to the columns of the resulting “rectangular” distance matrix.
TCRrep.compute_rect_distances(df = tr_search.clone_df, df2 = tr_bulk.clone_df)
When compute_rect_distances is called, the resulting Numpy array attributes (np.arrays) will be generated, with row indices aligning to rows of the primary DataFrame df and columns aligning with the rows of the secondary DataFrame df2:
TCRrep.rw_beta
TCRrep.rw_alpha,
TCRrep.rw_cdr3_b_aa, and
TCRrep.rw_cdr3_a_aa
However, as the number of pairwise comparisons exceeds 100 million (10^8) (i.e., what one might generate with 100 search clones (10^2) against 1 million (10^6) bulk clones), the computing memory demand becomes substantial. In practice, only the shortest distances are relevant for subsequent analyses, which can greatly reduce the number of distances for which storage in memory is required. Thus, tcrdist3 takes advantage of scipy.sparse.csrmat format to store only those distances that are less than or equal to some user-specified threshold distance (the radius argument in a function called compute_sparse_rect_distances):
TCRrep.compute_sparse_rect_distances(df = tr_search.clone_df, df2 = tr_bulk.clone_df, radius = 100, chunk_size = 100)
When storing distances in a sparse format, the user should also specify the argument chunk_size, which is the maximum number of rows to compute at a time on each CPU before the desired results are stored in a sparse format. As the ideal chunk size depends primarily on the size of the secondary DataFrame (df2), we provide a convenience function get_safe_chunk(tr_search.clone_df.shape[0], tr_bulk.clone_df.shape[0], target = 10**7) that aims to keep overall distances stored in memory to less than 10 Million. In practice, it has been our experience that this chunk size permits concurrent use of 6 cpus on a 16 GB laptop without exceeding available memory resources. When the sparse computation is complete the attributes rw_beta and rw_alpha will be populated with a resultant distance matrix in Scipy’s Compressed Sparse Row matrix format (scipy.sparse.csrmat). In this format, distances above the threshold are encoded as sparse zeros and are no longer stored in memory; true zero-distances are retained in the sparse format, but they are represented as −1. Just as we formed networks from full pairwise matrices in examples 1 and 2, we can do so directly from scipy.sparse.csrmat matrices (This is shown via a full example in Appendix A, in section 4.1).
With that procedural explanation out of the way, we can proceed to use the compute_sparse_rect_distances method for the biological task at hand. Recall that we wanted to test whether there were any similar, but non-identical, TCRs in the Ren et al. 2021 dataset that resembled those TCRs comprising cluster 0 of A*02-YLQ specific TCRs annotated in Minervina et al. 2021. To do so we:
Instantiate a new TCRrep instance, which we name tr_search, passing the cluster 0 clones to the cell_df argument
Set the number of CPUs to 6 or the amount available, whichever is less (i.e., min(6, multiprocessing.cpu_count())
Identify an appropriate chunk size
Call the compute_sparse_rect_distances() method.
import multiprocessing cluster_id = ‘0’ cluster_df = clone_df_ylq[clone_df_ylq.cluster_alpha_beta == cluster_id].reset_index(drop = True) tr_search =TCRrep(cell_df = cluster_df, organism = “human”, chains =[‘alpha’,’beta’], deduplicate = False, compute_distances = False, cpus = min(6, multiprocessing.cpu_count())) tr_bulk = TCRrep(cell_df = ren, organism = “human”, chains =[‘alpha’,’beta’], deduplicate = False, compute_distances = False) chunk_size = get_safe_chunk(tr_search.clone_df.shape[0], tr_bulk.clone_df.shape[0]) tr_search.compute_sparse_rect_distances( df = tr_search.clone_df, df2 = tr_bulk.clone_df, radius = 100, chunk_size = chunk_size)
Note that for the beta-chains, we stored approximately 330,000 pairwise distances out of the more than 8 million distances that were actually computed. For alpha chains, we stored approximately 765,000 distances.
tr_search.rw_beta <35×220968 sparse matrix of type ‘<class ‘numpy.int16’>‘ with 328984 stored elements in Compressed Sparse Row format> tr_search.rw_alpha <35×220968 sparse matrix of type ‘<class ‘numpy.int16’>‘ with 764378 stored elements in Compressed Sparse Row format>
To aid in our understanding of these data objects, we now check that tr_search.rw_alpha have the expected dimensions:
assert tr_search.rw_alpha.shape[0] == tr_search.clone_df.shape[0] assert tr_search.rw_alpha.shape[1] == tr_bulk.clone_df.shape[0]
assert tr_search.rw_alpha.size != tr_search.rw_beta.size
However, the number of entries stored in each are not identical.
Thus, we need to use the function tcrdist.sparse.add_sparse_pwd if we wish to get combined alpha-beta TCRdist from sparse csr matrices. Only those {i, j} coordinates with entries in both sparse matrices will be present in the resulting matrix.
from tcrdist.sparse import add_sparse_pwd tr_search.rw_alpha_beta = add_sparse_pwd(tr_search.rw_beta,tr_search.rw_alpha)
3.2.4. TCRjoin: Table Joins by Distance
Once the sparse matrix has been computed, the function tcrdist.join.join_by_dist performs a database-style join; however, rather than using identical index keys, it joins rows with TCRs that share a pairwise distance less than the threshold specified by the radius argument. For each TCR in the ‘left’ DataFrame, the function returns up to max_n nearest neighbors in the right DataFrame. See the join_by_dist docstrings for more details about all possible options.
df_join = join_by_dist( how = ‘inner’, csrmat = tr_search.rw_alpha_beta, left_df = tr_search.clone_df, right_df = tr_bulk.clone_df, left_cols = tr.clone_df.columns.to_list(), right_cols = tr_bulk.clone_df.columns.to_list(), left_suffix = ‘_search’, right_suffix = ‘_bulk’, max_n= 1000, radius = 120)
df_join[[‘cdr3_a_aa_search’,’v_a_gene_search’, ‘j_a_gene_search’, ‘cdr3_b_aa_search’,’v_b_gene_search’, ‘j_b_gene_search’, ‘cdr3_a_aa_bulk’,’v_a_gene_bulk’, ‘j_a_gene_bulk’, ‘cdr3_b_aa_bulk’,’v_b_gene_bulk’, ‘j_b_gene_bulk’,’dist’,’subject_bulk’]].\ query(‘dist <= 100’).\ sort_values(‘dist’).\ groupby([‘cdr3_a_aa_bulk’,’v_a_gene_bulk’, ‘j_a_gene_bulk’,’cdr3_b_aa_bulk’, ‘v_b_gene_bulk’, ‘j_b_gene_bulk’,’subject_bulk’]).\ head(1).reset_index(drop = True).head(10)
If we inspect the resulting DataFrame df_join, we can view the closest non-identical matches.
From the 220,968 clones from Ren et al. 2021 dataset – without a single TCR receptor matching identically at the amino acid level to any of the clusters 0 A*02-YLQ TCRs – there are 49 clones (in 33 of 136 unique subjects) within 100 TCRdistance units from a TCR in cluster 0.
df_join[[‘cdr3_a_aa_search’,’v_a_gene_search’, ‘j_a_gene_search’, ‘cdr3_b_aa_search’,’v_b_gene_search’, ‘j_b_gene_search’, ‘cdr3_a_aa_bulk’,’v_a_gene_bulk’, ‘j_a_gene_bulk’, ‘cdr3_b_aa_bulk’,’v_b_gene_bulk’, ‘j_b_gene_bulk’,’dist’,’subject_bulk’]].\ query(‘dist <= 100’).sort_values(‘dist’).\ groupby([‘subject_bulk’]).\ count().\ shape
OUTPUTS: (33, 13)
Grouping by subject, we see approximately 33 subjects have a TCR within 100 TCR distance units from one of the annotated A*02 YLQ TCRs cluster 0 sequences.
df_join[[‘cdr3_a_aa_bulk’,’v_a_gene_bulk’, ‘j_a_gene_bulk’, ‘cdr3_b_aa_bulk’,’v_b_gene_bulk’, ‘j_b_gene_bulk’, ‘dist’,’subject_bulk’]].\ query(‘dist <= 100’).\ sort_values(‘dist’).\ groupby([‘cdr3_a_aa_bulk’,’v_a_gene_bulk’, ‘j_a_gene_bulk’, ‘cdr3_b_aa_bulk’,’v_b_gene_bulk’, ‘j_b_gene_bulk’,’subject_bulk’]).\ head(1).shape
OUTPUTS: (49,8)
Grouping by subject, V genes, J genes, as well as CDR3s, we see approximately 50 subject-unique TCRs were within 100 TCRdistance units from one of the annotated A*02 YLQ TCRs cluster 0 sequences.
3.2.5. CDR3 Logo Motifs from Neighboring Sequences
Visualizing these TCRs may help us get a picture of the receptor diversity found in sequences of similar TCRs annotated to recognize the YLQ epitope. We first query the unique neighboring sequences in the bulk dataset and save the results to a new DataFrame cluster_df_neighbors. Next, we pass that DataFrame to the custom_logo_routine function we defined above in section 3.1.13 to produce new αβ logos (Fig. 8)
cluster_df_neighbors = df_join[[‘cdr3_a_aa_bulk’,’v_a_gene_bulk’, ‘j_a_gene_bulk’,
‘cdr3_b_aa_bulk’,’v_b_gene_bulk’, ‘j_b_gene_bulk’,
‘dist’,’subject_bulk’]].\
query(‘dist <= 100’).\
sort_values(‘dist’).\
groupby([‘cdr3_a_aa_bulk’,’v_a_gene_bulk’, ‘j_a_gene_bulk’,’cdr3_b_aa_bulk’,
‘v_b_gene_bulk’, ‘j_b_gene_bulk’,’subject_bulk’]).\
head(1).\
reset_index().\
sort_values(‘dist’).\
rename(columns = {k:k.replace(“_bulk”,”“) for k in
[‘cdr3_a_aa_bulk’,’v_a_gene_bulk’, ‘j_a_gene_bulk’,
‘cdr3_b_aa_bulk’,’v_b_gene_bulk’, ‘j_b_gene_bulk’,
‘subject_bulk’]})
# Note that we defined the function custom_logo_routine in a previous code block.
# We can reuse it. Here we substitute in the < cluster_df_neighors`
custom_logo_routine(cluster_df = cluster_df_neighbors, label = “Neighbors or Cluster 0”)
Fig. 8.

Background-subtracted CDR3 αβ logo plots of sequencing neighboring a TCR annotated A*02_YLQ cluster 0.
We can review the logo motif for the original annotated sequences with the following code, which produces the logo shown previously in Fig 6.
cluster_id = ‘0’ cluster_df = clone_df_ylq[clone_df_ylq.cluster_alpha_beta == cluster_id].\ reset_index(drop = True) # Note that we defined the function custom_logo_routine in a previous code block. # We can reuse it. custom_logo_routine(cluster_df = cluster_df, label = “TCRs Annoated A*02 YLQ, Cluster 0”)
Note the conservation in the key residues of the alpha chain. Although outside the scope of this tutorial, this methodology could be used to check for the frequency of similar TCRs in non-COVID-19 patients. Moreover, a permutation of the alpha and beta receptor pairings could be used to test whether this many close matches would be expected by chance alone.
3.3. Example 3: Comprehensive Analysis of Bulk Unannotated Data
Identifying important clones within bulk deeply sequenced TCR repertoires is a challenge. Algorithms such as TCRNET [12] and ALICE [13, 14] have been proposed to identify TCRs sharing antigen-specificity within bulk repertoires when no in vitro binding or activation data is present. These methods were developed to identify TCR nodes in a network with an enriched number of edges compared to the expected number of edges in an antigen-naïve repertoire estimated from a background (in the case of TCRNET) or derived from a probability model (in the case of ALICE). In a similar vein, algorithms GLIPH [15] and GLIPH2 [16] allow users to analyze bulk repertoires for groups of TCRs with shared CDR3 k-mer motifs that are more frequent within the repertoire than in a background of a naïve reference TCR set. Recently, an algorithm GIANA [17] proposed a novel embedding that enables approximate nearest neighbor identification of TCRs, enabling a two-phase approach, limiting the number of total pairwise alignments required in the second step. These algorithms do not require all-vs-all comparison and thus avoid the scaling problem of O(n^2) algorithms. They can be useful first pass algorithms for identifying key TCRs to focus on in more detail with more computationally intensive algorithms. Users interested in a two-stage approach where TCRdist is applied to a subset of a repertoire identified by another algorithm should consult Appendix D in Section 4.4.
Because of advances in computational speed introduced in tcrdist3, it is also computationally feasible to evaluate all-vs-all TCRs (despite it being a O(n^2) process) within a repertoire of 50,000 to 200,000 unique clones. To do so without consuming an enormous amount of memory, tcrdist3 must be run with the sparse data storage option that retains only the subset of pairwise distances beneath a desired threshold. For example, on a reasonably powerful computing node (e.g., 12 CPU, 48 GB Memory), an all-vs all distance matrix of 53,290 clones (1.25 billion comparisons) was completed in 3:36 minutes. The same task on a 6 CPU, 16GB intel-based OSX laptop consumed 12 minutes.
The following methods in Example 3 are most appropriate for users with access to more substantial computing resources than are present on a personal computer or in the Google Colab environment. For these examples we recommend at least 12 CPUS and 64 GB of memory. Because Google’s Colab environment, where we recommend running the code presented in this tutorial, limits users to 2 CPU per session, we will demonstrate subsequent steps by loading precomputed sparse matrices.
3.3.1. Loading Precomputed Data
The data used in this example comes from the ImmuneRACE effort led by Adaptive Biotechnologies[18] and Microsoft. We use a bulk PBMC repertoire from a convalescent COVID-19 patient. The input file has already been formatted for use with tcrdist3. Because this example is more computationally intensive, we also download precomputed distance matrices that will be explained as we move through the example in more detail. We download these files now in the code block below.
# Check that file needed for this example have been downloaded # to Google Colab’s content/ folder # If not already present, this block will download and unzip! import os f = ‘1588BW_20200417_PBMC_unsorted_cc1000000_ImmunRACE_050820_008_gDNA_TCRB.tsv.tcrdist3.tsv’ if not os.path.isfile(f): os.system(‘wget https://github.com/kmayerb/tcrdist3_book_chapter/raw/main/data/1588BW_20200417_PBMC_unsorted_cc1000000_ImmunRACE_050820_008_gDNA_TCRB.tsv.tcrdist3.tsv.zip’) os.system(‘unzip 1588BW_20200417_PBMC_unsorted_cc1000000_ImmunRACE_050820_008_gDNA_TCRB.tsv.tcrdist3.tsv’) f = ‘1588BW_20200417_PBMC_unsorted_cc1000000_ImmunRACE_050820_008_gDNA_TCRB.tsv.tcrdist3.tsv.tr_bulk.rw_beta_csrmat.npz’ if not os.path.isfile(f): os.system(‘wget https://github.com/kmayerb/tcrdist3_book_chapter/raw/main/data/1588BW_20200417_PBMC_unsorted_cc1000000_ImmunRACE_050820_008_gDNA_TCRB.tsv.tcrdist3.tsv.tr_bulk.rw_beta_csrmat.npz’) f = ‘1588BW_20200417_PBMC_unsorted_cc1000000_ImmunRACE_050820_008_gDNA_TCRB.tsv.tcrdist3.tsv.tr_nn_v_cord.rw_beta_csrmat.npz’ if not os.path.isfile(f): os.system(f’wget https://github.com/kmayerb/tcrdist3_book_chapter/raw/main/data/1588BW_20200417_PBMC_unsorted_cc1000000_ImmunRACE_050820_008_gDNA_TCRB.tsv.tcrdist3.tsv.tr_nn_v_cord.rw_beta_csrmat.npz’)
Note in the next code block we have commented out the line containing tr_bulk.compute_sparse_rect_distances(radius = 72, chunk_size = chunk_size) and instead we have used the scipy.sparse.load_npz function to the precomputed data.
import multiprocessing
import numpy as np
import os
import pandas as pd
import scipy.sparse
import re
from tcrdist.breadth import get_safe_chunk
from tcrdist.repertoire import TCRrep
from scipy.stats import poisson
# The number of CPUS to use depends on your system
CPUS = min(12, multiprocessing.cpu_count())
file = ‘1588BW_20200417_PBMC_unsorted_cc1000000_ImmunRACE_050820_008_gDNA_TCRB.tsv.tcrdist3.tsv’
# Load repertoire to Pandas DataFrame
df_bulk = pd.read_csv(file, sep = “\t”)
# Sort values based on ‘count’ column
df_bulk = df_bulk.sort_values(‘count’, ascending = False).reset_index(drop = True)
# Similar to ALICE this allows for consideration of each distinct nucleotide
# sequences, each assumed to have been generated from a distinct T-cell clones.
# To retain all clonse, even those identical at the AA levele,
# we define a rank which will be used later to track individuals clones
df_bulk[‘rank’] = df_bulk.index.to_list()
# Initialize a TCRrep instances, passing < df_bulk > to the 1st argument
tr_bulk = TCRrep(
cell_df = df_bulk,
organism = “human”,
chains = [‘beta’],
cpus = CPUS,
compute_distances= False) # <- Note we don’t auto compute
# Optionally, rearrange clone_df before computing any distances
tr_bulk.clone_df = tr_bulk.clone_df.\
sort_values(‘rank’, ascending = True).\
reset_index(drop = True)
# Set custom weighting with 6X weight on CDR3 relative to other CDRs
tr_bulk.weights_b = {‘cdr3_b_aa’: 6, ‘pmhc_b_aa’: 1, ‘cdr2_b_aa’: 1, ‘cdr1_b_aa’: 1}
chunk_size = get_safe_chunk(tr_bulk.clone_df.shape[0], tr_bulk.clone_df.shape[0])
# tr_bulk.compute_sparse_rect_distances(radius = 72, chunk_size = chunk_size)
# 12 CPU, [03:36, 1.33it/s]
# scipy.sparse.save_npz(f’{file}.tr_bulk.rw_beta_csrmat.npz’), tr_bulk.rw_beta )
tr_bulk.rw_beta=scipy.sparse.load_npz(f’{file}.tr_bulk.rw_beta_csrmat.npz’)
Of a potential 1.25 billion comparisons, approximately 730,000 pairwise distances were within less than 72 TCR distance units (CDR3 weighted: 6:1:1:1); only these distances are stored in the sparse format. Because this example uses single chain data, we opted to use a 6x weighing on the CDR3 and relatively wide search radius. This allows finding neighboring TCRS as either (i) TCRs with identical V-gene-encoded CDRs and CDR3s with 0–3 substitutions or (ii) TCRs with slightly different V gene-encoded CDRs and CDR3s with 0–1 substitutions.
tr_bulk.rw_beta OUTPUTS: <53290x53290 sparse matrix of type ‘<class ‘numpy.int16’>‘ with 733562 stored elements in Compressed Sparse Row format>
Based on the assumption that antigen exposure increases the detection probability of clones, we examine which clones have close neighbors within the repertoire. When we consider close neighbors for the purposes of statistical testing, we consider only those within 48 TCR distance units (6:1:1:1 CDR3 weighted), accommodating (i) up to 2 major substitutions or gaps in the CDR3 or identical CDR3 with (ii) slightly differing v-gene encoded CDRs. This threshold is flexible, and users may choose to experiment with more or less-stringent neighbor definitions. (Although outside the scope of this tutorial, a _neighbors_sparse_variable_radius function is also available with a radius_list argument that can accommodate a list of different neighbor distances for each clone.)
from tcrdist.public import _neighbors_sparse_fixed_radius # look up neighbors tr_bulk.clone_df[‘nn’] = _neighbors_sparse_fixed_radius( csrmat = tr_bulk.rw_beta, radius = 48) # count how many neighbors tr_bulk.clone_df[‘k_nn’] = [len(x)-1 for x in tr_bulk.clone_df[‘nn’]] # < df_nn> is a dataframe of clones with > 4 neighbors within tcrdist of 48 df_nn = tr_bulk.clone_df.iloc[tr_bulk.clone_df.query(‘k_nn > 4’).index,].\ reset_index(drop = True)
At TCRdist 48 (6:1:1:! CDR3 weighted), we find that there are 6408 clones with 5 or more neighbors. (We performed the len(x)-1 step is to avoid counting a self-reference as a neighbor).
df_nn.shape OUTPUTS: (6408, 18)
Next, we load 960000 sequences from 8 cord blood samples to estimate the neighbor frequency of neighbors in an antigen naïve background. Cord blood samples were originally sequenced by Britanova and colleagues[6].
from tcrdist.background import sample_britanova df_cord = sample_britanova(960000, random_state=1) ts = get_stratified_gene_usage_frequency(ts = ts, replace = True)
Because we are comparing 6.4*103 clones against approximately 106 clones, this will involve 6.4 billion comparisons, which will be too slow on Google’s Colab but can be completed in 10 minutes on a node with 12 CPUs and 64 GB. We will load the precomputed output to illustrate the subsequence analysis steps. The code to compute the distances with the compute_sparse_rect_distances has been commented out.
tr_nn = TCRrep(
cell_df = df_nn,
organism = “human”,
chains = [‘beta’],
deduplicate = False,
compute_distances= False)
tr_nn.cpus = CPUS
tr_nn.weights_b = {‘cdr3_b_aa’: 6, ‘pmhc_b_aa’: 1, ‘cdr2_b_aa’: 1, ‘cdr1_b_aa’: 1}
tr_cord = TCRrep(
cell_df = df_cord,
organism = “human”,
chains = [‘beta’],
cpus = CPUS,
compute_distances= False)
chunk_size = get_safe_chunk(tr_nn.clone_df.shape[0], tr_cord.clone_df.shape[0])
# Now we can compare number of Neighbors in cord blood to number of internal neighbors
#tr_nn.compute_sparse_rect_distances(df = tr_nn.clone_df, df2 = tr_cord.clone_df, radius = 72, chunk_size = chunk_size)
#scipy.sparse.save_npz(f’{file}.tr_nn_v_cord.rw_beta_csrmat.npz’, tr_nn.rw_beta )
# 12 CPU [10:10, 1.13s/it]
# Here we load the output
tr_nn.rw_beta=scipy.sparse.load_npz(f’{file}.tr_nn_v_cord.rw_beta_csrmat.npz’)
tr_nn.clone_df[‘nn_cord’] = _neighbors_sparse_fixed_radius(csrmat = tr_nn.rw_beta, radius = 48)
tr_nn.clone_df[‘k_nn_cord’] = [len(x) for x in tr_nn.clone_df[‘nn_cord’]]
tr_nn.rw_beta OUTPUTS: <6408x901985 sparse matrix of type ‘<class ‘numpy.int16’>‘ with 5271537 stored elements in Compressed Sparse Row format>
After the brute force computation of all vs. all distances, we formulate a model for a null expectation based on the Poisson distribution, where d is the expected number of neighbors, similar to the ALICE algorithm described by Pogorelyy et al., 2019.
However, where ALICE estimates the rate parameter λ based on the sum of Pgen of all edit distance 1 neighbors of the sequence σ, we estimate lambda empirically from the background neighbor frequency. Because these T cells are derived post-thymic selection we set Q to 1 (a rescaling factor accounting for thymic selection used in ALICE[14]).
from statsmodels.stats.multitest import fdrcorrection from statsmodels.stats.multitest import multipletests bulk_n = tr_bulk.clone_df.shape[0] Q = 1 tr_nn.clone_df[‘lambda’] = (tr_nn.clone_df[‘k_nn_cord’]+1)/tr_cord.clone_df.shape[0] tr_nn.clone_df[‘poisson’] = tr_nn.clone_df.apply(lambda x : 1- poisson.cdf(x[‘k_nn’], Q*x[‘lambda’]*bulk_n, loc = 0), axis = 1) tr_nn.clone_df[‘poisson_fdr’] = multipletests(tr_nn.clone_df[‘poisson’], method = “fdr_bh”)[1] tr_nn.clone_df[‘poisson_holm’] = multipletests(tr_nn.clone_df[‘poisson’], method = “holm”)[1] tr_nn.clone_df[‘poisson_bonferroni’] = multipletests(tr_nn.clone_df[‘poisson’], method = “bonferroni”)[1] np.sum(tr_nn.clone_df[‘poisson_fdr’] < 0.001)
OUTPUTS: 1234
tr_nn.clone_df[[‘v_b_gene’,’cdr3_b_aa’,’k_nn’,’k_nn_cord’,’poisson_fdr’]].sort_values(‘poisson_fdr’)
OUTPUTS:
v_b_gene cdr3_b_aa k_nn k_nn_cord poisson_fdr 2916 TRBV12–1*01 CASSFHNYGYTF 21 1 0.0 715 TRBV12–1*01 CASSFSNYGYTF 22 3 0.0 2986 TRBV12–1*01 CASSLVSNQPQHF 14 4 0.0 2826 TRBV12–1*01 CASSLGYEQYF 34 21 0.0 2989 TRBV12–1*01 CASSLSSNYGYTF 22 2 0.0 ... ... ... ... ... ... 1671 TRBV5–1*01 CASSLGGEETQYF 24 589 1.0 5920 TRBV27*01 CASSLRGGYEQYF 13 462 1.0 5921 TRBV29–1*01 CSVEQETQYF 5 272 1.0 1064 TRBV5–1*01 CASSTTGGTDTQYF 6 235 1.0 2534 TRBV20–1*01 CSARTGTDTQYF 23 743 1.0 [6408 rows x 5 columns]
This output shows the top and bottom rows of the resulting analysis sorted by statistical significance (i.e., lowest FDR-adjusted p-value). For example, clone 2916 has 21 nearest neighbors within its repertoire, within the relatively conservative threshold of 72 tdus. Only one neighbor was identified in the background cord blood repertoires. In contrast, clone 2534 has 23 neighbors within its own repertoire, but 743 in the cord blood, indicating a high likelihood that this cluster arises by chance due to an overall high probability of generation of TCRs in the neighborhood.
3.3.2. Probability of Generation
Next, we estimate the probability of generation for each CDR3. This can be done with the software OLGA [8]. For convenience, we integrated some of the open-source OLGA software’s functionality within tcrdist3. We now compute Pgen estimates for each of the 6408 clones with 4 or more neighbors in the bulk repertoire. Without parallelization, this step takes approximately 3 minutes in the Colab environment.
from tcrdist.pgen import OlgaModel
olga_beta = OlgaModel(chain_folder = “human_T_beta”, recomb_type=“VDJ”)
tr_nn.clone_df[‘pgen_cdr3_b_aa’] = olga_beta.compute_aa_cdr3_pgens(CDR3_seq = tr_nn.clone_df.cdr3_b_aa.to_list())
# Creates a file for plotting (see Appendix 5)
tr_nn.clone_df.to_csv(f”{file}.poisson_pgen_productive_frequency.tsv”, sep = “\t”, index = False)
We can now visualize some interesting features of the bulk repertoire. Fig, 9 considers the approximately 6400 clones with 4 or more neighbors. The size of each circles is scaled to each clone’s productive frequency, so that large circles emphasize the expanded clones in this beta-chain receptor set.
Fig. 9.

Sequence features in an illustrative bulk repertoire. Size of points scaled to the productive frequency of each unique TCRb. Identification of TCRs -- with greater than the expected number of neighbors -- may be due to antigenic selection. (A) Recombined V and J segments with high probability of generation tend to have more neighbors, thus in (B) the y-axis compares the likelihood of the observed number of neighbors based on Poisson model with lambda set to background neighbor frequency.
Fig 9A. shows clearly a correlation between probability of generation(Pgen) the number of neighbors to each TCR in the cord blood background. That is, higher Pgen CDR3s tend to have more neighbors because they are closer to germline-encoded sequences. These receptors are more common and public due to shared molecular recombination biases. By contrast, the low Pgen receptors (at the bottom right of Fig. 9A) tend to be longer with a greater number of randomly inserted nucleotides within the V(D)J junction. They have few neighbors in the cord blood background. The big take-away message from Fig. 9A is that the number of total neighbors a clone has is not, on its own, a reliable metric to differentiate TCRs that may or may not be under antigen selection. Hence, we consider the conditional likelihood that the number of neighbors is greater than the expected number of neighbors, based on the frequency of neighbors in a relevant antigen naïve background. We refer to this conditional likelihood as the “unexpectedness”, following the terminology used by the authors of ALICE algorithm [13]. Fig 9B shows the probability of generation of each clone on the x-axis and the unexpectedness of the within-repertoire neighbor count on the y-axis. Clones of interest are the unexpected clones: those with a higher-than-expected number of neighbors. Moreover, the size of each point is scaled to each clone’s productive frequency, where larger dots indicate clonally expanded TCRs detected in multiple copies. Plotting the data in this way allows for identification of low Pgen, unexpected, and expanded clones, which are in the upper right quadrant of Fig 9B. The code to make these plots is provided in Appendix 5.
Although this sample came from a convalescent COVID-19 patient, it is reasonable to assume that many of the clones that manifest evidence of antigenic selection may be due to exposure to common human pathogens, such as CMV, EBV, or Influenza. Thus, to the extent possible, one might want to exclude clones with a higher-than-expected number of neighbors if they closely resemble TCRs with existing annotations to recognize these common viral epitopes. In the next code block, we will load a table of TCRs with epitope annotations from VDJdb [19].
f = ‘VDJdb-human-mhc1-mhc2–2021-08–20.tsv’ if not os.path.isfile(f): os.system(f’wget https://github.com/kmayerb/tcrdist3_book_chapter/raw/main/data/VDJdb-human-mhc1-mhc2–2021-08–20.tsv.zip’) os.system(f’unzip VDJdb-human-mhc1-mhc2–2021-08–20.tsv.zip’) fp_vdjdb = ‘VDJdb-human-mhc1-mhc2–2021-08–20.tsv’ vdjdb = pd.read_csv(fp_vdjdb, sep = “\t”) vdjdb_to_tcrdist = {‘CDR3’:’cdr3_b_aa’, ‘V’:’v_b_gene’, ‘J’:’j_b_gene’, ‘Score’:’score’, ‘Species’:’species’, ‘MHC A’:’mhc_a’, ‘MHC B’:’mhc_b’, ‘MHC class’: ‘mhc_class’, ‘Epitope’:’epitope’,’Epitope species’:’epitope_species’} vdjdb = vdjdb.rename(columns =vdjdb_to_tcrdist)[vdjdb_to_tcrdist.values()] # Here, we only use TCRs with a quality score greater than 0 vdjdb = vdjdb.query(‘score > 0’).reset_index(drop = True) tr_vdjdb = TCRrep(cell_df = vdjdb, organism = ‘human’, chains = [‘beta’], deduplicate = False, compute_distances = False)
Next, we compute distance between clones in the bulk repertoire with 4 or more neighbors and all of the TCRs in the VDJdb table with quality score > 0.
tr_nn.cpus = 2 chunk_size = get_safe_chunk(tr_nn.clone_df.shape[0], tr_vdjdb.clone_df.shape[0]) tr_nn.compute_sparse_rect_distances( df2 = tr_vdjdb.clone_df, radius = 72, chunk_size = chunk_size) #100%|██████████| 4/4 [00:37<00:00,9.43s/it]
3.3.3. TCRjoin to VDJdb CDR3 Epitope Annotations
After computing distances, we can implement a join-by-distance as first described in Section 3.2.4.
from tcrdist.join import join_by_dist df_join_nn = join_by_dist( how = ‘left’, csrmat = tr_nn.rw_beta, left_df = tr_nn.clone_df, right_df = tr_vdjdb.clone_df, left_cols = tr_nn.clone_df.columns.to_list(), right_cols = tr_vdjdb.clone_df.columns.to_list(), left_suffix = ‘‘, right_suffix = ‘_vdjdb’, max_n= 5, radius = 24) df_join_nn[‘annotation’] = df_join_nn[‘dist’].notna() df_join_nn[[‘k_nn’, ‘k_nn_cord’, ‘poisson_fdr’,’rank’, ‘v_b_gene’,’v_b_gene_vdjdb’, ‘cdr3_b_aa’,’cdr3_b_aa_vdjdb’, ‘mhc_a_vdjdb’, ‘mhc_b_vdjdb’, ‘mhc_class_vdjdb’, ‘epitope_vdjdb’,’epitope_species_vdjdb’,’dist’,’annotation’,’valid_cdr3’]].\ query(‘poisson_fdr < 0.01’).\ groupby([‘rank’]).\ head(1)[[‘annotation’,’valid_cdr3’]].\ groupby([‘annotation’]).\ count() # valid_cdr3 # annotation # False 1932 # True 47
Within the set of TCRs that have a statistically anomalous number of neighbors, we find that 47 of the TCRs have at least 1 prior annotation in VDJdb (among annotations with quality score > 0) within 24 TCRdistance units. Some TCRs may have more than one annotation.
df_join_nn[[‘k_nn’, ‘k_nn_cord’, ‘poisson_fdr’,’rank’, ‘v_b_gene’,’v_b_gene_vdjdb’, ‘cdr3_b_aa’,’cdr3_b_aa_vdjdb’, ‘mhc_a_vdjdb’, ‘mhc_b_vdjdb’, ‘mhc_class_vdjdb’, ‘epitope_vdjdb’,’epitope_species_vdjdb’,’dist’,’annotation’,’valid_cdr3’]].\ query(‘poisson_fdr < 0.01’)\ [[‘annotation’,’valid_cdr3’,’epitope_species_vdjdb’,’mhc_a_vdjdb’, ‘mhc_b_vdjdb’, ‘mhc_class_vdjdb’]].\ groupby([‘annotation’,’epitope_species_vdjdb’,’mhc_a_vdjdb’, ‘mhc_b_vdjdb’, ‘mhc_class_vdjdb’]).\ count()
The sample came from an individual with the following genotype.
A02:01:01, A23:01:01
B07:02:01, B18:01:01
C04:01:01, C07:02:01
DPA101:03:01, DPA102:01:02
DPB101:01:01, DPB104:01:01
DQA101:02:01, DQA101:04:01
DQB105:03:01, DQB106:02:01
DRB114:54:01, DRB115:01:01
DRB302:02:01, DRB501:01:01
One could remove only those TCRs with an pMHC annotation matching one or more allele in the donor’s genotype or be more conservative and exclude any TCR highly similar to an already annotated TCR. Removing those TCRs with a pre-existing annotation to a common viral epitope, we are left with approximately 2000 remaining TCRs, which based on neighbor enrichment statistics, could represent TCRs under antigenic selection following the patient’s SARS-CoV-2 infection.
df_join_nn.query(‘annotation == False’).\ query(‘poisson_fdr < 0.01’).\ sort_values(‘poisson_fdr’)\ [[‘k_nn’, ‘k_nn_cord’, ‘poisson_fdr’,’rank’,’v_b_gene’,’j_b_gene’,’cdr3_b_aa’]]
OUTPUTS:
k_nn k_nn_cord poisson_fdr ... v_b_gene j_b_gene cdr3_b_aa 2778 17 9 0.000000 ... TRBV12–1*01 TRBJ1–2*01 CASSPDSNYGYTF 2849 14 7 0.000000 ... TRBV12–1*01 TRBJ2–7*01 CASSLAGFSYEQYF 2852 13 1 0.000000 ... TRBV12–1*01 TRBJ2–7*01 CASSLAGVEYEQYF 2857 10 1 0.000000 ... TRBV12–1*01 TRBJ1–2*01 CASSLNRNYGYTF 2747 17 2 0.000000 ... TRBV12–1*01 TRBJ2–7*01 CASSLQNEQYF ... ... ... ... ... ... ... ... 6200 8 48 0.009805 ... TRBV27*01 TRBJ1–1*01 CASSWGRNTEAFF 4004 8 48 0.009805 ... TRBV7–9*01 TRBJ1–4*01 CASSLGDSNEKLFF
3.3.4. MIRA-Matched Clones
This dataset is unique in that the donor also contributed samples for a multiplex identification of antigen-specific T Cell receptor assay (MIRA) experiment (experimental identifier: eLH42). This permits the opportunity to examine whether this procedure was successful at identifying SARS-CoV-2 specific TCRs that were also annotated as recognizing a SARS-CoV-2 peptide via the in vitro activation marker assays.
We load the MIRA MHC-I in vitro predictions generated from this donor and convert gene names from ImmunoSEQ to IMGT nomenclature.
from tcrdist.swap_gene_name import adaptive_to_imgt if not os.path.isfile(‘eLH42_mch_i_mira_tcrs.tsv’): os.system(‘wget https://github.com/kmayerb/tcrdist3_book_chapter/raw/main/data/eLH42_mch_i_mira_tcrs.tsv’) df_eLH42 = pd.read_csv(‘eLH42_mch_i_mira_tcrs.tsv’, sep = “\t”) df_eLH42[‘v_b_gene’] = df_eLH42[‘v’].apply(lambda x : adaptive_to_imgt[‘human’].get(x)) df_eLH42[‘j_b_gene’] = df_eLH42[‘j’].apply(lambda x : adaptive_to_imgt[‘human’].get(x))
We us an “inner” join to find clones in the bulk repertoire (among those with > 4 neighbors, i.e., tr_nn.clone_df) matching the MIRA annotated clones. An “inner” join will drop all the clones that lack a match to a MIRA clone. We then ensure no duplicates where a clone could matches multiple MIRA clones, by grouping by (i.e., groupby([‘v_b_gene’,’j_b_gene’,’cdr3_b_aa’,’rank’])) uniquely identifying columns and taking the first entry (i.e., head(1)). The shape[0] method returns the number of rows. To find unmatched clones we use a “left” instead of an “inner” join and keep only those rows that did not have a match (i.e. set indictor == True and then only keep _merge == “left_only”).
from scipy.stats import fisher_exact
# CLONES WITH MIRA CONFIRMATION
# Compute number of MIRA matched clones matching a clone with > 4 within repertoire neighbors
N_matched_clones = \
tr_nn.clone_df.merge(df_eLH42, how= “inner”‘human’, on = [‘v_b_gene’,’j_b_gene’,’cdr3_b_aa’]).\
groupby([‘v_b_gene’,’j_b_gene’,’cdr3_b_aa’,’rank’]).head(1).\
shape[0]
# MIRA matched clones matching a clone with > 4 within repertoire neighbors AND FDR < 0.01
N_matched_clones_fdr_lt_01 = \
tr_nn.clone_df.query(‘poisson_fdr < 0.01’).\
merge(df_eLH42, how= “inner”, on = [‘v_b_gene’,’j_b_gene’,’cdr3_b_aa’]).\
groupby([‘v_b_gene’,’j_b_gene’,’cdr3_b_aa’,’rank’]).head(1).\
shape[0]
# MIRA matched clones matching a clone with > 4 within repertoire neighbors AND FDR >= 0.01
N_matched_clones_fdr_gte_01 = \
tr_nn.clone_df.query(‘poisson_fdr >= 0.01’).\
merge(df_eLH42, how= “inner”, on = [‘v_b_gene’,’j_b_gene’,’cdr3_b_aa’]).\
groupby([‘v_b_gene’,’j_b_gene’,’cdr3_b_aa’,’rank’]).head(1).\
shape[0]
# CLONES WITHOUT MIRA CONFIRMATION
# MIRA unmatched clones matching a clone with > 4 within repertoire neighbors AND FDR < 0.01
N_unmatched_clones_fdr_lt_01 = \
tr_nn.clone_df.query(‘poisson_fdr < 0.01’).\
merge(df_eLH42, how= “left”, on = [‘v_b_gene’,’j_b_gene’,’cdr3_b_aa’], indicator=True).\
query(“_merge == ‘left_only’“).\
groupby([‘v_b_gene’,’j_b_gene’,’cdr3_b_aa’,’rank’]).head(1).\
shape[0]
# MIRA unmatched clones matching a clone with > 4 within repertoire neighbors AND FDR >= 0.01
N_unmatched_clones_fdr_gte_01 = \
tr_nn.clone_df.query(‘poisson_fdr >= 0.01’).\
merge(df_eLH42, how= “left”, on = [‘v_b_gene’,’j_b_gene’,’cdr3_b_aa’], indicator=True).\
query(“_merge == ‘left_only’“).\
groupby([‘v_b_gene’,’j_b_gene’,’cdr3_b_aa’,’rank’]).head(1).\
shape[0]
print(f”-- Considering tr_nn (k > 4) -- “)
print(f”\tnumber_of_mira_matched_clones : {N_matched_clones}”)
print(f”\tnumber_of_mira_matched_clones, fdr < 0.01 : {N_matched_clones_fdr_lt_01}”)
print(f”\tnumber_of_mira_matched_clones, fdr >= 0.01 : {N_matched_clones_fdr_gte_01}”)
print(f”\tnumber_of_unmatched_clones, fdr < 0.01: {N_unmatched_clones_fdr_lt_01}”)
print(f”\tnumber_of_unmatched_clones, fdr > 0.01: {N_unmatched_clones_fdr_gte_01}”)
table = np.array([[N_matched_clones_fdr_gte_01, N_matched_clones_fdr_lt_01],
[N_unmatched_clones_fdr_gte_01, N_unmatched_clones_fdr_lt_01]])
#table = np.array([[19, 38], [4410, 1941]])
oddsr, p = fisher_exact(table, alternative=‘two-sided’)
print(“FDR:\n>=0.01\t<0.01”)
print([N_matched_clones_fdr_gte_01, N_matched_clones_fdr_lt_01], “\tMIRA MATCHED”)
print([N_unmatched_clones_fdr_gte_01, N_unmatched_clones_fdr_lt_01], “\tMIRA UNMATCHED”)
print(f”FISHER’S p-value: {p}”)print(f”FISHER’S p-value: {p}”)
OUTPUTS:
-- Considering tr_nn (k > 4) -- number_of_mira_matched_clones : 57 number_of_mira_matched_clones, fdr < 0.01 : 38 number_of_mira_matched_clones, fdr >= 0.01 : 19 number_of_unmatched_clones, fdr < 0.01: 1941 number_of_unmatched_clones, fdr > 0.01: 4410 FDR: >=0.01 <0.01 [19, 38] MIRA MATCHED [4410, 1941] MIRA UNMATCHED FISHER’S p-value: 4.258871221083108e-08
The presence of SARS-CoV-2 MIRA-matched clones that also have higher than expected neighbor frequency suggests that statistical degree enrichment methods such as ALICE, TCRNET, and the method above using TCRdist can be useful for identifying clones that under antigenic selection from a recent natural infection. Are those receptors with a statistically unexpected number of neighbors (FDR < 0.01) more likely to also show evidence of activation by a SARS CoV-2 peptide in the MIRA assays? Yes. We find some evidence that this is the case.
The significant result of the Fisher’s Exact Test indicates that there is a non-random association between clones predicted as under antigenic selection by the neighbor enrichment method and clones identified as antigen specific in vitro by MIRA(Table 2). However, most clones (1941 out of 1979) with a statistically significant signal of neighbor enrichment had no confirmatory MIRA result. This could be due, in part, to the lack of MIRA identification of MHC-II CD4 TCRs, which were not captured in the MIRA data considered here. Also, antigenic selection need not lead to a highly polyclonal response. Thus, MIRA may correctly identify single expanded clone without sufficient neighbors exceeding the detection limit in this bulk repertoire.
Table 2:
Fisher’s Exact Test with clones with network degree enrichment and identified by MIRA
| For Clones with >= 4 Neighbors (n = 6408) | ||
|---|---|---|
| Poisson, FDR q-value | ||
| >= 0.01 | < 0.01 | |
| Clones with fewer neighbors | Clones with more neighbors than expected | |
| Clone Identified by CD8+ MIRA | 19 | 38 |
| Not MIRA-identified | 4410 | 1941 |
| Fisher’s Test p-value 4e-8 | ||
However, the current intersection of the clones identified by the two methods are likely to yield higher confidence assignment of antigen-specific clones than either method alone, which may be useful for prioritizing TCRs to clone in further experimental validation studies, since (i) MIRA antigen assignment is not yet a widely used method (ii) network based enrichment statistics can produced false positives based on discrepancies in gene-calling or other sample-specific idiosyncrasies not captured by available background repertoires.
In example 3.3, we illustrated how tcrdist3 could be used to customize an evaluation of a bulk unannotated TCR repertoire consisting of approximately 50,000 clones. We found that about 1941 (3%) of the bulk repertoire had modest evidence of antigenic selection by neighbor enrichment. Let us summarize the multiple steps we took in Example 3.
Customized the distance metric for single-chain analysis. We changed the weighting on the CDR3 from 3X to 6X. This allows finding neighboring TCRS as either (i) TCRs with identical V-gene-encoded CDRs and CDR3s with 0–3 substitutions or (ii) TCRs with slightly different V gene-encoded CDRs and CDR3s with 0–1 substitutions.
Identified the number of neighbors (highly similar TCRs with different nucleotide sequences, TCRdist <= 48) for each clone.
Made a subset of the bulk repertoire with at least 4 (TCRdist <= 48) neighbors.
Used a Poisson model to assess whether the number of neighbors within the repertoire was statistically unexpected based on the number of neighbors each TCR had in a 1 million clone umbilical cord background.
Identified all clones within TCRdist 24 of TCR beta chain likely associated with a common (e.g., CMV, EBV, HIV-1, and Flu epitopes), previously annotated in VDJdb (quality score > 0), represented in Fig. 10 as black circles.
Cross-referenced clones with experimental MIRA annotations from a sample taken from the same donor, represented in Fig. 10 as red triangles.
Fig. 10:

TCR beta chains with at least 3 neighbors compared with prior MIRA and VDJdb annotations. (TCRdist(6:1:1:1) <= 48 TCRdistance units). Red triangles indicate clones that were also identified by a MIRA assay to be activated by a SARS-CoV2 MHC-I peptide. Black circles indicate clones with TCRdist (6:1:1:1) <= 24) of previously annotated non-SARS-CoV-2 TCRs for other common antigens in CMV, EBV, HIV-1, or Influenza in VDJdb (Score > 0, August 2021).
Fig. 10 visualizes the results of this multi-part analysis. It is encouraging that many of the most expanded, lower Pgen TCRs with a higher-than-expected number of neighboring TCRs were often concurrently identified as antigen-specific by the MIRA methodology (as indicated by the red triangles on the upper left portion of the Fig. 10 and 11). Combining methodologies provides a further validation of MIRA-based binding predictions and allowed for the prioritization of approximately 2% of clones within the bulk repertoire as possible candidates for clones under antigen selection.
Fig 11.

Network based on edges formed between clones with TCR beta-chain edit distance less than or equal to 1.
There multiple reasons why a TCR motif under antigen selection (i.e., one with higher-than-expected within-repertoire neighbors) may not have been identified by the MIRA assay. Some may be SARS-CoV-2 specific TCRs that recognize MHC-I presented peptides that were not included in the MIRA stimulations. Moreover, some may be CD4 cells recognizing MHC-II presented SARS-CoV-2 epitopes, which would not be associated with the CD8 activation markers used in the MIRA experiments. It is also likely that many of the clones with statistically anomalous neighbor frequency are due to antigenic selection from exposures to antigens other than SARS-CoV-2 as we showed for some cases (e.g., EBV and CMV identified clones) in section 3.3.3. Nonetheless, the clones identified as under antigenic selection can be thought of as forming a useful hypothesis set that could be evaluated in longitudinal samples pre- and post-exposure or compared with clones from other donors sharing (i) one or more of the relevant HLA alleles and (ii) recent exposures to the pathogen of interest.
This example illustrates the flexibility of tcrdist3. It enables integration of computational and experimental methods for interrogating bulk TCR RepSeq data. As multiplex methods for mapping TCRs to antigen specificity are refined, integrated analysis of bulk repertoires with activation marker enriched sub-repertoires may lead to higher confidence annotations.
4. Appendices
4.1. Appendix A - Using Sparse Format
Appendix A illustrates the same analysis as in section 3.1 but here we use sparse rather than full matrices.
Just as we formed networks from full pairwise matrices (i.e., tr.pw_beta+tr.pw_alpha):
from tcrdist.public import _neighbors_fixed_radius network = list() edge_threshold = 120 for i,n in enumerate(_neighbors_fixed_radius(tr.pw_beta+tr.pw_alpha, edge_threshold)): for j in n: if i != j: network.append(( i, # ‘node_1’ - row index j, # ‘node_2’ - column index (tr.pw_beta+tr.pw_alpha)[i,j]))
We can also do the same with a scipy.sparse.csr matrix, which can also be used to form a network. Note that the only differences are (i) the need to use the add_sparse_pwd function to add two sparse matrices together, (ii) the use of the function _neighbors_sparse_fixed_radius in place of _neighbors_fixed_radius, and d (iii) resetting of −1 to 0 when specifying distances. Recall that in sparse format, tcrdist3 represent pairs of receptors with TCRdist 0 as −1.
from tcrdist.public import _neighbors_sparse_fixed_radius from tcrdist.sparse import add_sparse_pwd tr.rw_alpha_beta = add_sparse_pwd(tr_search.rw_beta,tr_search.rw_alpha) # <- NOTE (i) network = list() edge_threshold = 120 for i,n in enumerate(_neighbors_sparse_fixed_radius(tr.rw_alpha_beta, edge_threshold)): # <- NOTE (ii) for j in n: if i != j: network.append(( i, j, (tr.rw_alpha_beta)[i,j])) cols = [‘node_1’, ‘node_2’, ‘dist’] df_net = pd.DataFrame(network, columns = cols) df_net[‘dist’]= df_net[‘dist’].apply(lambda x: 0 if x == −1 else x) # <- NOTE (iii) WE MUST CONVERT BACK FROM −1 to 0
In fact, in the following full example, we illustrate the same analysis as in 3.1 of this chapter, but we use sparse matrices which can be useful when attempting to construct a network from a larger repertoire. The example follows as before.
import os
import pandas as pd
import networkx as nx
import community.community_louvain as community_louvain
from tcrdist.repertoire import TCRrep
from tcrdist.public import _neighbors_sparse_fixed_radius
from tcrdist.sparse import add_sparse_pwd
path = ‘‘
f = ‘clonotypes_minervina.tsv’
epitopes = [“A02_YLQ”] # Note only this epitope will be considered
edge_threshold = 120
fp = os.path.join(path,f)
df = pd.read_csv(fp, sep = “\t”)
df = df[ df.epitope.isin(epitopes)].reset_index(drop = True)
df = df.rename(columns = {
‘cdr3b’:’cdr3_b_aa’,
‘vb’:’v_b_gene’,
‘jb’: ‘j_b_gene’,
‘cdr3a’:’cdr3_a_aa’,
‘va’: ‘v_a_gene’,
‘ja’ :’j_a_gene’,
‘donor’:’subject’} )
df[‘v_a_gene’] = df[‘v_a_gene’].apply(lambda x: f”{x}*01”)
df[‘v_b_gene’] = df[‘v_b_gene’].apply(lambda x: f”{x}*01”)
df[‘j_a_gene’] = df[‘j_a_gene’].apply(lambda x: f”{x}*01”)
df[‘j_b_gene’] = df[‘j_b_gene’].apply(lambda x: f”{x}*01”)
\df[‘count’] = 1
tr = TCRrep(cell_df = df[[‘subject’,’epitope’,’cdr3_a_aa’,
‘v_a_gene’,’j_a_gene’,’cdr3_b_aa’,
‘v_b_gene’,’j_b_gene’,’category’,
‘count’,’cdr3a_nt’,’cdr3b_nt’]],
organism = ‘human’,
chains = [‘alpha’,’beta’],
deduplicate = True,
compute_distances = False)
# APPENDIX A: NOTE THE WE CAN COMPUTE ALL PAIRWISE DISTANCES IN SPARSE FORMAT
tr.compute_sparse_rect_distances(df = tr.clone_df, df2= tr.clone_df, radius = 1000)
# APPENDIX A: NOTE THAT WE MUCH USE THE FUNCTION add_spase_pwd TO COMBINE TWO SPARSE MATRICES
tr.rw_alpha_beta = add_sparse_pwd(tr.rw_beta,tr.rw_alpha)
edge_threshold = 120
network = list()
for i,n in enumerate(_neighbors_sparse_fixed_radius(tr.rw_alpha_beta, edge_threshold)):
for j in n:
if i != j:
network.append((
i, # ‘node_1’ - row index
j, # ‘node_2’ - column index
(tr.rw_alpha_beta)[i,j] # ‘dist’- gets the distance between TCR(i,j)
))
cols = [‘node_1’, ‘node_2’, ‘dist’]
df_net = pd.DataFrame(network, columns = cols)
df_net[‘dist’]= df_net[‘dist’].apply(lambda x: 0 if x == −1 else x) # <- NOTE WE MUST CONVERT BACK FROM −1 to 0
df_net[‘weight’] = edge_threshold - df_net[‘dist’]
G = nx.from_pandas_edgelist(pd.DataFrame({‘source’ : df_net[‘node_1’],’target’ : df_net[‘node_2’], ‘weight’ :df_net[‘weight’]}))
partition= community_louvain.best_partition(G)
partitions_by_cluster_size = list(pd.Series(partition.values()).value_counts().index)
partition_reorder = {id:rank for id,rank in zip(partitions_by_cluster_size, range(len(partitions_by_cluster_size)))}
partition = {k:partition_reorder.get(v) for k,v in partition.items()}
from tcrdist.html_colors import get_html_colors
clusters = [i for i in pd.Series(partition.values()).value_counts().index]
colors = get_html_colors(len(clusters))
cluster_to_color = {cluster:color for cluster,color, in zip(clusters,colors)}
options = {“edgecolors”: “tab:gray”,”node_size”: 50}
pos = nx.spring_layout(G, seed=2, k = .3)
nx.draw(G, nodelist = G.nodes, pos = pos, node_color=[cluster_to_color.get(partition.get(i)) for i in G.nodes], **options)
4.2. Appendix B - Edit Distance
In Appendix B we illustrate how to compute CDR3 edit-distance networks instead of using TCRdistances. We start with the preprocessing steps described in detail in section 3.1.2. Note that we have set the compute_distances argument in the TCRrep initialization to False.
import os
import pandas as pd
import networkx as nx
import community.community_louvain as community_louvain
from tcrdist.repertoire import TCRrep
from tcrdist.public import _neighbors_sparse_fixed_radius
from tcrdist.sparse import add_sparse_pwd
import pwseqdist as pw
path = ‘‘
f = ‘clonotypes_minervina.tsv’
epitopes = [“A02_YLQ”] # Note only this epitope will be considered
edge_threshold = 120
fp = os.path.join(path,f)
df = pd.read_csv(fp, sep = “\t”)
df = df[ df.epitope.isin(epitopes)].reset_index(drop = True)
df = df.rename(columns = {
‘cdr3b’:’cdr3_b_aa’,
‘vb’:’v_b_gene’,
‘jb’: ‘j_b_gene’,
‘cdr3a’:’cdr3_a_aa’,
‘va’: ‘v_a_gene’,
‘ja’ :’j_a_gene’,
‘donor’:’subject’} )
df[‘v_a_gene’] = df[‘v_a_gene’].apply(lambda x: f”{x}*01”)
df[‘v_b_gene’] = df[‘v_b_gene’].apply(lambda x: f”{x}*01”)
df[‘j_a_gene’] = df[‘j_a_gene’].apply(lambda x: f”{x}*01”)
df[‘j_b_gene’] = df[‘j_b_gene’].apply(lambda x: f”{x}*01”)
df[‘count’] = 1
tr = TCRrep(cell_df = df[[‘subject’,’epitope’,’cdr3_a_aa’,’v_a_gene’,’j_a_gene’,’cdr3_b_aa’,’v_b_gene’,’j_b_gene’,’category’,’count’,’cdr3a_nt’,’cdr3b_nt’]],
organism = ‘human’,
chains = [‘alpha’,’beta’],
deduplicate = True,
compute_distances = False)
Before computing distances, we further configure the TCRrep instance. To compute a custom distance, we configure the keyword arguments (kargs_a and kargs_b), weights, and metrics for each chain. More details about customizing the TCRdist metric are described on the tcrdist3 package documentation page (https://tcrdist3.readthedocs.io).
# To compute edit distances instead of TCRdists we must configure the correct keyword arguments, weights, metrics,
kargs_a = {
‘cdr3_a_aa’ :
{‘use_numba’: True},
‘pmhc_a_aa’ : {
‘use_numba’: True},
‘cdr2_a_aa’ : {
‘use_numba’: True},
‘cdr1_a_aa’ : {
‘use_numba’: True}
}
weights_a = {
“cdr3_a_aa” : 1,
“pmhc_a_aa” : 0,
“cdr2_a_aa” : 0,
“cdr1_a_aa” : 0}
metrics_a = {
“cdr3_a_aa” : pw.metrics.nb_vector_editdistance,
“pmhc_a_aa” : pw.metrics.nb_vector_editdistance,
“cdr2_a_aa” : pw.metrics.nb_vector_editdistance,
“cdr1_a_aa” : pw.metrics.nb_vector_editdistance }
kargs_b = {
‘cdr3_b_aa’ :
{‘use_numba’: True},
‘pmhc_b_aa’ : {
‘use_numba’: True},
‘cdr2_b_aa’ : {
‘use_numba’: True},
‘cdr1_b_aa’ : {
‘use_numba’: True}
}
weights_b = {
“cdr3_b_aa” : 1,
“pmhc_b_aa” : 0,
“cdr2_b_aa” : 0,
“cdr1_b_aa” : 0}
metrics_b = {
“cdr3_b_aa” : pw.metrics.nb_vector_editdistance,
“pmhc_b_aa” : pw.metrics.nb_vector_editdistance,
“cdr2_b_aa” : pw.metrics.nb_vector_editdistance,
“cdr1_b_aa” : pw.metrics.nb_vector_editdistance }
tr.weights_a = weights_a
tr.metrics_a = metrics_a
tr.kargs_a = kargs_a
tr.weights_b = weights_b
tr.metrics_b = metrics_b
tr.kargs_b = kargs_b
tr.compute_sparse_rect_distances(df = tr.clone_df, df2= tr.clone_df, radius = 10)
tr.rw_alpha_beta = add_sparse_pwd(tr.rw_beta,tr.rw_alpha)
We can now form a network based on beta-chain or alpha-chain edit distance instead of TCRdist as is illustrated in the code block below.
edge_threshold = 1
network = list()
for i,n in enumerate(_neighbors_sparse_fixed_radius(tr.rw_beta, edge_threshold)):
for j in n:
if i != j:
network.append((
i, # ‘node_1’ - row index
j, # ‘node_2’ - column index
tr.rw_beta[i,j],
tr.rw_alpha[i,j]# ‘dist’- gets the distance between TCR(i,j)
))
cols = [‘node_1’, ‘node_2’, ‘dist’,’dist_alpha’]
df_net = pd.DataFrame(network, columns = cols)
df_net[‘dist’]= df_net[‘dist’].apply(lambda x: 0 if x == −1 else x) # <- NOTE WE MUST CONVERT BACK FROM −1 to 0
df_net[‘weight’] = edge_threshold - df_net[‘dist’]
G = nx.from_pandas_edgelist(pd.DataFrame(
{‘source’ : df_net[‘node_1’],
‘target’ : df_net[‘node_2’],
‘weight’ :df_net[‘weight’]}))
partition= community_louvain.best_partition(G)
partitions_by_cluster_size = list(pd.Series(partition.values()).value_counts().index)
partition_reorder = {id:rank for id,rank in zip(partitions_by_cluster_size, range(len(partitions_by_cluster_size)))}
partition = {k:partition_reorder.get(v) for k,v in partition.items()}
from tcrdist.html_colors import get_html_colors
clusters = [i for i in pd.Series(partition.values()).value_counts().index]
colors = get_html_colors(len(clusters))
cluster_to_color = {cluster:color for cluster,color, in zip(clusters,colors)}
options = {“edgecolors”: “tab:gray”,”node_size”: 50}
pos = nx.spring_layout(G, seed=2, k = .3)
nx.draw(G, nodelist = G.nodes, pos = pos, node_color=[cluster_to_color.get(partition.get(i)) for i in G.nodes], **options)
4.3. Appendix C - Adaptive and AIRR to IMGT Gene Names
Adaptive Biotechnologies uses a distinct gene naming convention which differs from the IMGT Nomenclature. This poses a formatting challenge when using ImmunoSEQ files as inputs to tcrdist3. This script provides an example of how to prepare an ImmunoSeq formatted file for tcrdist3.
# Check that file for example 1 available. If not, download it. import os f = ‘Adaptive2020.tsv’ url = ‘https://raw.githubusercontent.com/kmayerb/tcrdist3/master/Adaptive2020.tsv’ if not os.path.isfile(f): os.system(f’wget {url}’) # Example of converting Adaptive ImmunoSeq file containing # a bulk reperoitre into a tcrdist3 ready DataFrame import pandas as pd import numpy as np from tcrdist.repertoire import TCRrep from tcrdist.swap_gene_name import adaptive_to_imgt # An example Adaptive ImmunoSeq beta TCRseq input file filename = ‘Adaptive2020.tsv’ chain = ‘beta’ # specify the relevant chain organism = ‘human’ # specify the relevant organism bulk_df = pd.read_csv(filename, sep = “\t”) # Correct names will depend on TCR chain item_names = {‘alpha’ : [“cdr3_b_aa”,”v_a_gene”, “j_a_gene”,”cdr3_b_nucseq”], ‘beta’ : [“cdr3_b_aa”,”v_b_gene”,”j_b_gene”,”cdr3_b_nucseq”], ‘gamma’ : [“cdr3_g_aa”,”v_g_gene”,”j_g_gene”,”cdr3_g_nucseq”], ‘delta’ : [“cdr3_d_aa”,”v_d_gene”,”j_d_gene”,”cdr3_d_nucseq”]}[chain] # Parse bio-identity ns= {0:”cdr3_aa”, 1:”v_gene”, 2:”j_gene”} # Expand bio_idenitty column to 3-column cdr3,v,j cdr_v_j = bulk_df[‘bio_identity’].str.split(“+”, expand = True).\ rename(columns = lambda x: ns[x]) bulk_df[[item_names[0], ‘v_gene’, ‘j_gene’]] = cdr_v_j # Convert Names from Adapative to IMGT bulk_df[item_names[1]] = bulk_df[‘v_gene’].\ apply(lambda x : adaptive_to_imgt[organism].get(x)) bulk_df[item_names[2]] = bulk_df[‘j_gene’].\ apply(lambda x : adaptive_to_imgt[organism].get(x)) # Validate CDR3 def _valid_cdr3(cdr3): “““ Return True if all amino acids are part of standard amino acid list”““ if not isinstance(cdr3, str): return False else: amino_acids = [‘A’, ‘C’, ‘D’, ‘E’, ‘F’, ‘G’, ‘H’, ‘I’, ‘K’, ‘L’, ‘M’, ‘N’, ‘P’, ‘Q’, ‘R’, ‘S’, ‘T’, ‘V’, ‘W’, ‘Y’] valid = np.all([aa in amino_acids for aa in cdr3]) return valid bulk_df[‘valid_cdr3’] = bulk_df[item_names[0]].apply(lambda cdr3: _valid_cdr3(cdr3)) bulk_df = bulk_df[bulk_df[‘valid_cdr3’]] bulk_df[‘productive_frequency’] = pd.to_numeric(bulk_df[‘productive_frequency’],errors=‘coerce’) bulk_df[‘count’] = pd.to_numeric(bulk_df[‘templates’],errors=‘coerce’) bulk_df[‘subject’] = filename bulk_df[item_names[3]] = bulk_df[‘rearrangement’] bulk_df = bulk_df[[ item_names[0],item_names[1],item_names[2], item_names[3], ‘productive_frequency’, ‘count’,’subject’]] # Test that we can initialize a TCRrep distance wiht the bulk_dataframe tr = TCRrep(cell_df = bulk_df, organism = ‘human’, chains = [‘beta’], compute_distances = False) tr.clone_df.head()
The Adaptive Immune Receptor Repertoire or AIRR-seq data format is also common. Conversion of the AIRR-formatted files to for use in tcrdist3 is also relatively straightforward.
import os import pandas as pd f = ‘airr2020.tsv’ url = ‘https://raw.githubusercontent.com/kmayerb/tcrdist3_book_chapter/main/data/airr2020.tsv’ if not os.path.isfile(f): os.system(f’wget {url}’) bulk_df = pd.read_csv(‘airr2020.tsv’, sep = “\t”) # specify any columns from the input that are necesseary for the analysis additionally_relevant_cols = [‘sequence_id’] # specify the relevant chain chain = ‘beta’ item_names = {‘alpha’:[“cdr3_b_aa”,”v_a_gene”, “j_a_gene”,”cdr3_b_nucseq”], ‘beta’ :[“cdr3_b_aa”,”v_b_gene”,”j_b_gene”,”cdr3_b_nucseq”], ‘gamma’:[“cdr3_g_aa”,”v_g_gene”,”j_g_gene”,”cdr3_g_nucseq”], ‘delta’:[“cdr3_d_aa”,”v_d_gene”,”j_d_gene”,”cdr3_d_nucseq”]}[chain] bulk_df = bulk_df.rename(columns = {‘junction_aa’:item_names[0], ‘v_call’:item_names[1], ‘j_call’:item_names[2], ‘junction’: item_names[3]}) bulk_df[item_names[1]] = bulk_df[item_names[1]].\ apply(lambda x: x.replace(“*00”, “*01”)) bulk_df[item_names[2]] = bulk_df[item_names[2]].\ apply(lambda x: x.replace(“*00”, “*01”)) # Validate CDR3 def _valid_cdr3(cdr3): “““ Return True if all amino acids are part of standard amino acid list”““ if not isinstance(cdr3, str): return False else: amino_acids = [‘A’, ‘C’, ‘D’, ‘E’, ‘F’, ‘G’, ‘H’, ‘I’, ‘K’, ‘L’, ‘M’, ‘N’, ‘P’, ‘Q’, ‘R’, ‘S’, ‘T’, ‘V’, ‘W’, ‘Y’] valid = np.all([aa in amino_acids for aa in cdr3]) return valid bulk_df[‘valid_cdr3’] = bulk_df[item_names[0]].apply(lambda cdr3: _valid_cdr3(cdr3)) bulk_df = bulk_df[bulk_df[‘valid_cdr3’]] # subset to only necessary columns item_names = item_names + additionally_relevant_cols bulk_df = bulk_df[item_names] # Test that we can initialize a TCRrep distance wiht the bulk_dataframe tr = TCRrep(cell_df = bulk_df, organism = ‘human’, chains = [‘beta’], compute_distances = False) tr.clone_df.head()
4.4. Appendix D - pwseqdist for Python Power Users
Most of this tutorial has illustrated how to use the tcrdist3 interface. However, it is also useful to recognize that its core dependency, pwseqdist, can be used directly. It enables fast and flexible computation of pairwise sequence-based distances using either numba-enabled tcrdist and edit distances (or other user-coded Python3 metric to relate TCRs); it can also accommodates computation of “rectangular” pairwise matrices: distances between a relatively small set of TCRs with all TCRs in a much larger set (e.g., a bulk repertoire or library of annotated receptors). On a modern laptop, distances can be computed at a rate of ~70M per minute, per CPU. To illustrate direct use of the pwseqdist package, we load the same bulk repertoires that was discussed in section 3.3.1.
import numpy as np import os import pandas as pd import itertools import pwseqdist as pw from scipy.sparse import csr_matrix # It might be impractical to do brute force comparision of all TCRS, # It may be sufficient to compare all CDR3 from TCRS with the same V gene and # CDR3 Length. file = ‘1588BW_20200417_PBMC_unsorted_cc1000000_ImmunRACE_050820_008_gDNA_TCRB.tsv.tcrdist3.tsv’ df_bulk = pd.read_csv(file, sep = “\t”) df_bulk = df_bulk.sort_values(‘count’, ascending = False).reset_index(drop = True) df_bulk[‘rank’] = df_bulk.index.to_list()
4.4.1. First-Stage: Find Pairs with the Same V Gene and Identical CDR3 Length
We can reduce the number of total potential comparisons among all the receptors in repertoire from 1.25 billion to 1 million by only comparing receptors with CDR3 of identical length generated with the same TRBV gene. This may be an attractive option when repertoires contain hundreds of thousands of clones. The pairings from derived from a first-stage algorithms can be input to pwseqdist and the relevant TCRdist or edit distances between CDR3s computed. In the first stage, we identify the pairs of sequences with the same V gene and identical CDR3 length using a standard Pandas ‘groupby’ command.
# find all the pairs with equal V gene and CDR3 Length. v_gene_index = dict() df_bulk[‘length’] = df_bulk[‘cdr3_b_aa’].apply(lambda x : len(x)) pairs = list() for i,g in df_bulk.groupby([‘v_b_gene’,’length’]): combinations = itertools.combinations(g[‘rank’].to_list(),2) v_gene_index[i]= combinations pairs.extend(list(combinations))
Next, that we can then pass the specific pairings list (stored in the list object we named above as pairs) to the function pw.pairwise.apply_pairwise_sparse.
4.4.2. Second-Stage Edit Distance on CDR3
In the second stage, we compute the beta-chain edit distance among all TCRs with matching V genes and identical CDR3 length identified in the first stage. We then repackage the result in a sparse matrix as shown below.
data = pw.pairwise.apply_pairwise_sparse(pw.metrics.nb_vector_editdistance, seqs = df_bulk[‘cdr3_b_aa’], pairs = pairs, ncpus=1, use_numba=True) # repackage as a csr_mat row = [x[0] for x in pairs] col = [x[1] for x in pairs] n = df_bulk[‘rank’].max() + 1 csr_mat_editdistance = csr_matrix((data, (row, col)), shape=(n,n))
4.4.3. Second-Stage TCRdist on CDR3
We next compute the beta-chain TCRdist among all TCRs with matching V genes and identical CDR3 length. We then repackage the result in a sparse matrix as shown below.
# Compute TCRdistances data = pw.pairwise.apply_pairwise_sparse(pw.metrics.nb_vector_tcrdist, seqs = df_bulk[‘cdr3_b_aa’], pairs = pairs, ncpus=1, use_numba=True) data = [−1 if x == 0 else x for x in data] # package as a csr_mat row = [x[0] for x in pairs] col = [x[1] for x in pairs] n = df_bulk[‘rank’].max() + 1 csr_mat_tcrdist = csr_matrix((data, (row, col)), shape=(n,n))
4.4.4. Construct a Network from a Sparse Matrix
We can construct a TCR network directly from the sparse matrix generated above.
# As before we can quickly find neighbors, and construct a network
from tcrdist.public import _neighbors_sparse_fixed_radius
import networkx as nx
df_bulk[‘nn’] = _neighbors_sparse_fixed_radius(csr_mat_tcrdist, 12)
df_bulk[‘k_nn’] = df_bulk[‘nn’].apply(lambda x: len(x))
network = list()
edge_threshold = 4
for i,n in enumerate(_neighbors_sparse_fixed_radius(csr_mat_tcrdist, edge_threshold)):
for j in n:
if i != j:
network.append((
i, # ‘node_1’ - row index
j, # ‘node_2’ - column index
csr_mat_tcrdist[i,j] # ‘dist’- gets the distance between TCR(i,j)
))
cols = [‘node_1’, ‘node_2’, ‘dist’]
df_net = pd.DataFrame(network, columns = cols)
df_net[‘weight’] = edge_threshold - df_net[‘dist’]
G = nx.from_pandas_edgelist(pd.DataFrame({‘source’ :
df_net[‘node_1’],’target’ :
df_net[‘node_2’], ‘weight’:
df_net[‘weight’]}))
4.5. Appendix 5 – Plotting Code for Fig. 9
Code to make Fig. 9 using matplotlib:
import os
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
# Download output file with TCR-beta’s: 1. Productive Frequency 2. Pgen (OLGA estimate 3. and Poisson Probability of Within Repertoire Neighbors
file = ‘1588BW_20200417_PBMC_unsorted_cc1000000_ImmunRACE_050820_008_gDNA_TCRB.tsv.tcrdist3.tsv.poisson_pgen_productive_frequency.tsv ‘
cmd = f’wget -O {file} https://www.dropbox.com/s/58dxmer1ixzm5o8/1588BW_20200417_PBMC_unsorted_cc1000000_ImmunRACE_050820_008_gDNA_TCRB.tsv.tcrdist3.tsv.poisson_pgen_productive_frequency.tsv?dl=1’
os.system(cmd)
df = pd.read_csv(file, sep = “\t”)
df[‘color_pf’] = df[‘productive_frequency’].apply(lambda x :np.log10(x))
df[‘poisson_fdr’] = df[‘poisson_fdr’].apply(lambda x: 1E-15 if x == 0 else x)
# set a shape variable proportional to productive frequency
df[‘s’] = df[‘productive_frequency’].apply(lambda x : 100000* x)
# make a second dataframe with subset of expanded clones
df2 = df.query(‘productive_frequency > 0.0001’)
# Fig. 9A
fig = plt.figure(figsize=(4, 4), dpi=300)
ax = fig.add_subplot()
ax.set_xscale(‘log’)
ax.set_yscale(‘log’)
ax.scatter(x=df[‘pgen_cdr3_b_aa’],
y=df[‘k_nn_cord’],
color=‘blue’,
alpha = .2,
s = df[‘s’] )
# Overplot black circles over clones with freq > 0.0001
ax.scatter(x=df2[‘pgen_cdr3_b_aa’],
y=df2[‘k_nn_cord’],
color=‘black’,
alpha = .6,facecolors=‘none’,
s = df2[‘s’] )
plt.xlim((1e-5, 1e-12))
plt.ylim((1, 1e3))
plt.ylabel(‘Neighbors in Cord Blood Background’)
plt.xlabel(‘Probability of Generation’)
plt.savefig(“Fig9A.pdf”, bbox_inches=‘tight’)
# Fig. 9B
fig = plt.figure(figsize=(4, 4), dpi=300)
ax = fig.add_subplot()
ax.set_xscale(‘log’)
ax.set_yscale(‘log’)
ax.scatter(x=df[‘pgen_cdr3_b_aa’],
y=df[‘poisson_fdr’],
color=‘blue’,
alpha = .1,
s = df[‘s’] )
# Overplot black circles over clones with freq > 0.0001
ax.scatter(x=df2[‘pgen_cdr3_b_aa’],
y=df2[‘poisson_fdr’],
color=‘black’,
facecolors=‘none’,
alpha = .6,
s = df2[‘s’] )
ax.hlines(0.01, 1e-5, 1e-12, colors=“black”, linestyles=‘dashed’, lw= 1)
plt.xlim((1e-5, 1e-12))
plt.ylim((1, 1e-16))
plt.ylabel(‘P(Within−Repertoire Neighbors >= Observed)\nFDR q−value’)
plt.xlabel(‘Probability of Generation’)
plt.savefig(“Fig9B.pdf”, bbox_inches=‘tight’)
For users who prefer to produce graphics using ggplot2, we provide code to make Fig. 9 in R:
require(dplyr) require(ggplot2) require(ggforce) d = readr::read_tsv(‘1588BW_20200417_PBMC_unsorted_cc1000000_ImmunRACE_050820_008_gDNA_TCRB.tsv.tcrdist3.tsv.poisson_pgen_productive_frequency.tsv’) d %>% filter(poisson == 0) %>% select(k_nn, k_nn_cord, poisson, poisson_fdr, lambda) %>% mutate(poisson_fdr = ifelse(poisson_fdr == 0, 1E-15, poisson_fdr)) d_focus = d %>% filter(productive_frequency > 0.0001) gg0 = ggplot(d, aes(x=pgen_cdr3_b_aa, y=k_nn_cord, size = log10(productive_frequency))) + geom_point(pch = 19, alpha = .05, col = “black”) + geom_point(data = d_focus, col= “black”, pch = 1)+ geom_point(data = d_focus, pch = 19, alpha = .1)+ theme_classic() + scale_x_continuous(trans = trans_reverser(‘log10’)) + scale_y_log10()+ annotation_logticks(sides = ‘b’) + coord_cartesian(xlim = c(1E-5, 1E-13)) + theme(legend.position = “bottom”) + xlab( ‘Probability of Generation’) + ylab( ‘No. of Neighbors TCRdist <= 48, CDR3 Weighted 6:1\n1M Cord Blood TCRs’) gg2 = ggplot(d, aes(y = poisson_fdr, x=pgen_cdr3_b_aa, col = log10(productive_frequency), size = log10(productive_frequency))) + geom_point(pch = 19, alpha = .3) + geom_point(data = d_focus, col= “black”, pch = 1)+ geom_point(data = d_focus, pch = 19, alpha = .3)+ theme_classic() + scale_color_viridis_c(option=“B”, begin = 0, end = .9, direction = −1)+ scale_x_continuous(trans = ggforce::trans_reverser(‘log10’)) + scale_y_continuous(trans = ggforce::trans_reverser(‘log10’)) + annotation_logticks(sides = ‘lb’) + coord_cartesian(xlim = c(0.00001, 1E-13), ylim = c(1, 1E-15)) + theme(legend.position = “bottom”) + ylab( ‘P(Within-Repertoire Neighbors >= Observed)\nFDR q-value’) + xlab( ‘Probability of Generation’) gridExtra::grid.arrange(gg0 + theme(legend.position = “none”), gg2 + theme(legend.position = “none”), ncol =2)
5 References
- 1.Rudolph MG, Wilson IA (2002) The specificity of TCR/pMHC interaction. Curr Opin Immunol 14:52–65 [DOI] [PubMed] [Google Scholar]
- 2.Dash P, Fiore-Gartland AJ, Hertz T, et al. (2017) Quantifiable predictive features define epitope-specific T cell receptor repertoires. Nature 547:89–93 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Mayer-Blackwell K, Schattgen S, Cohen-Lavi L, et al. (2020) TCR meta-clonotypes for biomarker discovery with tcrdist3: quantification of public, HLA-restricted TCR biomarkers of SARS-CoV-2 infection [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Minervina AA, Pogorelyy MV, Kirk AM, et al. (2021) Convergent epitope-specific T cell responses after SARS-CoV-2 infection and vaccination. medRxiv. 10.1101/2021.07.12.21260227 [DOI] [Google Scholar]
- 5.Wirasinha RC, Singh M, Archer SK, et al. (2018) αβ T-cell receptors with a central CDR3 cysteine are enriched in CD8αα intraepithelial lymphocytes and their thymic precursors. Immunol Cell Biol 96:553–561 [DOI] [PubMed] [Google Scholar]
- 6.Britanova OV, Shugay M, Merzlyak EM, et al. (2016) Dynamics of individual T Cell repertoires: from cord blood to centenarians. J Immunol 196:5005–5013 [DOI] [PubMed] [Google Scholar]
- 7.Ruggiero E, Nicolay JP, Fronza R, et al. (2015) High-resolution analysis of the human T-cell receptor repertoire. Nat Commun 6:8081. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Sethna Z, Elhanati Y, Callan CG, et al. (2019) OLGA: fast computation of generation probabilities of B- and T-cell receptor amino acid sequences and motifs. Bioinformatics 35:2974–2981 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Marcou Q, Mora T, Walczak AM (2018) High-throughput immune repertoire analysis with IGoR. Nat Commun 9:561. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Blondel VD, Guillaume J-L, Lambiotte R, Lefebvre E (2008) Fast unfolding of communities in large networks. J Stat Mech 2008:P10008 [Google Scholar]
- 11.Ren X, Wen W, Fan X, et al. (2021) COVID-19 immune features revealed by a large-scale single-cell transcriptome atlas. Cell 184:1895–1913.e19 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Ritvo P-G, Saadawi A, Barennes P, et al. (2018) High-resolution repertoire analysis reveals a major bystander activation of Tfh and Tfr cells. Proc Natl Acad Sci U S A 115:9604–9609 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Pogorelyy MV, Shugay M (2019) A framework for annotation of antigen specificities in high-throughput T-Cell repertoire sequencing studies. Front Immunol 10:2159. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Pogorelyy MV, Minervina AA, Shugay M, et al. (2019) Detecting T cell receptors involved in immune responses from single repertoire snapshots. PLOS Biology 17:e3000314. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Glanville J, Huang H, Nau A, et al. (2017) Identifying specificity groups in the T cell receptor repertoire. Nature 547:94–98 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Huang H, Wang C, Rubelt F, et al. (2020) Analyzing the Mycobacterium tuberculosis immune response by T-cell receptor clustering with GLIPH2 and genome-wide antigen screening. Nat Biotechnol. 10.1038/s41587-020-0505-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Zhang H, Zhan X, Li B (2021) Publisher Correction: GIANA allows computationally-efficient TCR clustering and multi-disease repertoire classification by isometric transformation. Nat Commun 12:5334. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Nolan S, Vignali M, Klinger M, et al. (2020) A large-scale database of T-cell receptor beta (TCRβ) sequences and binding associations from natural and synthetic exposure to SARS-CoV-2. Res Sq. 10.21203/rs.3.rs-51964/v1 [DOI] [Google Scholar]
- 19.Shugay M, Bagaev DV, Zvyagin IV, et al. (2018) VDJdb: a curated database of T-cell receptor sequences with known antigen specificity. Nucleic Acids Res 46:D419–D427 [DOI] [PMC free article] [PubMed] [Google Scholar]
