Abstract
Most current analysis tools for antibody next-generation sequencing data work with primary sequence descriptors, leaving accompanying structural information unharnessed. We have used novel rapid methods to structurally characterize the complementary-determining regions (CDRs) of more than 180 million human and mouse B-cell receptor (BCR) repertoire sequences. These structurally annotated CDRs provide unprecedented insights into both the structural predetermination and dynamics of the adaptive immune response. We show that B-cell types can be distinguished based solely on these structural properties. Antigen-unexperienced BCR repertoires use the highest number and diversity of CDR structures and these patterns of naïve repertoire paratope usage are highly conserved across subjects. In contrast, more differentiated B-cells are more personalized in terms of CDR structure usage. Our results establish the CDR structure differences in BCR repertoires and have applications for many fields including immunodiagnostics, phage display library generation, and “humanness” assessment of BCR repertoires from transgenic animals. The software tool for structural annotation of BCR repertoires, SAAB+, is available at https://github.com/oxpig/saab_plus.
Author summary
B-cell receptors (BCR) are the major components of the adaptive immune system. These are immunoglobulin molecules that bind to foreign substances known as antigens. Each individual has a huge BCR repertoire, where each individual BCR has a specific binding site composed of the complementary-determining regions (CDRs) capable of recognising a specific antigen. Drug discovery and immunodiagnostics inspired by the adaptive immune system rely on our ability to accurately interrogate the structural diversity of the binding sites of the BCR repertoire. Here we report our novel rapid pipeline, SAAB+, which has enabled us to interrogate how the structure of the CDR changes in BCR repertoires along the B-cell differentiation axis. By analysing human and mouse BCR repertoires at an unprecedented scale, we observed species-specific structural predetermination and detected CDR dynamics across multiple stages of B-cell differentiation. We showed that naïve repertoires share the highest number and diversity of CDR structures, a pattern which was highly conserved in all B-cell donors. Our results suggest that increased B-cell differentiation is associated with a personalization of CDR structure usages. Finally, we established the differences in CDR usages between humans and mice, analysis with immediate relevance for BCR repertoire “humanness” assessment and rational immunotherapeutic engineering.
Introduction
B-cells are essential components of the adaptive immune system in jawed vertebrates. They play a key role in recognizing foreign molecules (antigens) via membrane-bound B-cell receptors (BCR), and antibodies (secreted BCRs). Successful recognition of a broad array of structural motifs (epitopes) on antigens relies on the enormous sequence and structural diversity of BCR repertoires, generated by the rearrangement of V(D)J gene segments in the two variable domain chains (heavy and light), each consisting of four framework and three complementary-determining (CDR) loop regions [1,2]. Upon antigen stimulation, somatic hypermutation (SHM) recursively introduces changes to the variable (Fv) domain of naïve BCR repertoires. These occur primarily in the antibody binding interface (paratope, which consists mostly of CDR residues)[3], leading to structural changes. Those B-cells whose paratopes are epitope-complementary are clonally expanded, and further diversified and selected to enhance antigen binding properties. BCR diversification also happens outside the Fv domain, where immunoglobulin class switching changes the constant region of the heavy chain [4]. There are five main heavy constant regions (isotypes), each with a unique profile of effector functions and antigen binding avidity.
Next-generation sequencing of immunoglobulin genes (Ig-seq) has become an essential technique in immunology [5,6]. For example, Ig-seq has revealed the dynamics of BCR sequence diversification across different B-cell types in healthy and antigen-stimulated B-cell donors [7–10], advanced our understanding of the adaptive immune response, and contributed to vaccine development [11] and immunodiagnostics [12].
Most Ig-seq analysis tools work within the remit of BCR primary sequence information [6,13]. These rapid methods of measuring BCR diversity are highly scalable, an important property as Ig-seq datasets become ever larger and more numerous [2]. However, the decision to avoid paratope structural descriptors could lead to inaccuracies [14–16], as it is known that similar sequences can have markedly different epitope complementarity and vice versa [13]. Therefore, a computationally-efficient structure-based BCR repertoire method should augment current Ig-seq analysis pipelines to deliver a clearer understanding of the process of BCR development.
One of the first structural analyses of Ig-seq data was that of DeKosky et al., [14]. They demonstrated that antibody models from paired-chain naïve and memory BCR repertoires displayed different physicochemical properties. However, their analysis was limited to 2,000 antibody models from three B-cell donors [14]. Most publicly-available BCR repertoires are unpaired, only covering either the heavy or light variable domain [17] precluding the generation of refined antibody models. Krawczyk et al., [15] showed that it was possible to annotate unpaired BCR repertoires with structural information by mapping loop sequences individually onto crystallographically-solved antibody structures.
Using a similar approach, we have investigated structural diversity along the B-cell differentiation axis in humans and mice. We show that structurally annotating BCR repertoires yields unprecedented insights into both the structural predetermination and dynamics of the adaptive immune response. By approximating BCR repertoire structures with rapid homology modelling techniques, we find that different B-cell types can be distinguished by their usage of CDR loop structures. Our analysis reveals that BCR repertoires of naïve B-cells tend to contain conserved “public” CDR structure profile, whilst those of more differentiated B-cell types become more personalized. These results provide crucial information about the structural changes in antibody CDRs during B-cell differentiation, with a plethora of prospective applications in immunodiagnostics and rational immunotherapeutic engineering.
Results
Structural annotation of Ig-seq data
We searched the Observed Antibody Space (OAS) resource [17] for heavy chain Ig-seq studies that contained at least three different B-cell types, had sequences with defined isotype information and consisted of at least 50 BCR repertoires, and identified two studies: Galson et al., (“human”) [7] and Greiff et al., (“mouse”) [9].
Annotating the antibody CDR sequences in these human and mouse Ig-seq studies with structural information allows us to investigate how the three-dimensional shape of CDR-H1, CDR-H2 and CDR-H3 loops vary across BCR repertoires (Fig 1). To achieve this, we developed the SAAB+ pipeline.
SAAB+ is our analysis pipeline that performs annotation of BCR repertoire amino acid data with structural information. The SAAB+ pipeline was built on the previously developed tool, SAAB [15]. The major differences/improvements of SAAB+ over SAAB are the following. First, SAAB+ scrutinises each antibody sequence for structural viability. In this step, sequences are passed through multiple structural filters (alignment to Hidden Markov Model (HMM) profiles, indel and conserved residue identification, chimeric sequence removal, presence of all 3 CDR loops according to the IMGT numbering [19] definition) [17,20]. SAAB+ utilizes SCALOP [21] for rapid canonical class identification, which gives a significant boost to the analysis rate without compromising accuracy. In a similar fashion to SAAB, FREAD is employed to predict whether CDR-H3s from the Ig-seq data share a similar structure to a crystallographically-solved CDR-H3 structure [22]. CDR-H3 predictions made by FREAD are annotated with the PDB code of the crystallographically-solved CDR-H3 structure (template). The SAAB+ pipeline also performs structural clustering of CDR-H3 templates. To find structural templates with similar CDR-H3 loop shapes (analogous to canonical loop shapes), SAAB+ structurally clusters them based on their backbone atom RMSD values (see Methods). SAAB+ outputs a tab delimited text file that can easily be merged with the current AIRR-seq standard format file [23].
Structural CDR-H3 coverage and template usage
We investigated the structures of CDR-H3s used across BCR repertoires of different B-cell types in the human and mouse data. Predicting approximate shapes of CDRs rather than creating refined three-dimensional antibody models allows the SAAB+ pipeline to structurally annotate BCR repertoires rapidly. SAAB+ can analyse ~4.5 million BCR sequences a day on a 40 core computing cluster (Intel Xeon E5-2699 v4 @ 2.20GHz). SAAB+ structurally annotated the human data within two days and the mouse data in 4–5 weeks. Table 1 shows the coverage achieved by FREAD for each species.
Table 1. FREAD coverage of Ig-seq data.
Data | Total sequences | CDR-H3 template predicted | Mean coverage with std |
---|---|---|---|
Human | 5,712,939 | 2,750,469 (48.1%) | 47.2±11% |
Mouse | 206,680,496 | 182,309,575 (88%) | 88.1±4% |
CDR-H3 structural coverages of BCR repertoires were similar across different B-cell types in the human data (Kruskal-Wallis test, p = 0.37), but varied in the mouse data (Kruskal-Wallis test, p< 0.001). In both species, the variance of coverage was lower in the BCR repertoires of antigen-unexperienced B-cells (S2 Fig). The mean structural coverage was higher for mouse CDR-H3s than for human CDR-H3s (Table 1). Differences in length distributions could be a major cause of this discrepancy, as CDR-H3 structures are harder to predict for longer lengths, and the most common lengths were 11 and 12 residues in the mouse data, compared to 15 residues in the human data (S3 Fig).
Human and mouse BCR repertoires are the effector products of two different sets of germline genes. We therefore investigated whether species germline genes might also translate into preferred CDR-H3 structure usage. We used reported species origin information from SAbDab [24] to calculate the usages of different species CDR-H3 templates across our BCR repertoires (S4 Fig). As expected, human and mouse data used different frequencies of species CDR-H3 templates. The human BCR repertoires tended to use more human CDR-H3 templates as compared to uniform CDR-H3 template sampling, with mouse CDR-H3 templates appearing about as often as would be expected at random. In the mouse data, usage of mouse CDR-H3 templates was enriched, whilst usage of human CDR-H3 templates was reduced. These usages were roughly similar across B-cell types in both human and mouse data, suggesting a species bias towards CDR-H3 structural sampling largely independent of B-cell maturation. Interestingly, 109 (or ~4%) of all FREAD templates were never used in neither the human nor mouse data. Eighty eight of these templates were derived from nanobodies (S1 Appendix).
Together, these results suggest that different species may engage different epitopes on the same antigen through inherent structural biases.
CDR-H3 cluster profiles along the B-cell differentiation axis
The adaptive immune system responds to antigen exposure by selecting and optimizing the most efficacious BCRs. Therefore, B-cells at different maturation stages may possess discrete paratope structural properties.
Galson et al., [7] demonstrated that different B-cell types could be separated using three heterogeneous sequence descriptors (clonality, average CDR-H3 loop length and percentage of V gene mutations) in a principal component analysis (PCA). We repeated their experiment on our human and mouse data (Fig 2A and 2B). In the human data, their sequence descriptors distinguished B-cell types. In the mouse data, pre, naïve, and plasma IGHM BCR repertoires clustered together, whilst plasma IGHG were clearly distinguishable from other B-cell types.
We investigated whether the structural annotation of CDR-H3s on its own could distinguish the BCR repertoires of different B-cell types, by performing PCA on CDR-H3 cluster usages across BCR repertoires. We found a clear separation of B-cell types in both the human and mouse data (Fig 2C and 2D), with a sequential pattern of B-cell differentiation in the human data (Naïve → Marginal → Memory → Plasma). Mouse IGHM and IGHG plasma BCR repertoires can be distinguished by CDR-H3 cluster usages, whereas neither we nor Galson et al., [7] observe the same separation in the human plasma BCR repertoires. The variance of CDR-H3 cluster usages in plasma IGHM were, in fact, more similar to antigen-unexperienced than to plasma IGHG BCR repertoires in the mouse data. Inaccuracies arising during B-cell sorting could cause improper B-cell labelling, adding noise to the B-cell type separation seen in Fig 2. In laboratory mice, the range and degree of antigen exposure is limited by pathogen-free housing conditions and low organism ages. This “purity” could account for the finer separation of B-cell types.
To quantify the behaviour seen in Fig 2, we employed the DBSCAN clustering algorithm with increasing maximum distance to closest neighbours (ε) to interrogate the densities of CDR-H3 cluster usages across BCR repertoires. Clustering at lower ε distances indicates a more similar distribution of CDR-H3 cluster usages. In the human data, all naïve BCR repertoires clustered at low ε distances along with one marginal zone BCR repertoire. As the value of ε was increased, all marginal zone BCR repertoires merged with the naïve BCR repertoire cluster, followed by memory and finally plasma BCR repertoires (S5 Fig). In the mouse data, pre and naïve BCR repertoires initially formed two separate clusters at low ε distances. As ε was increased, antigen-unexperienced (pre and naïve) BCR repertoire merged into a single cluster, followed by plasma IGHM and plasma IGHG repertoires respectively (S6 Fig).
BCR repertoires of different B-cell types are known to have their own characteristic distributions of CDR-H3 lengths [7,25]. To see whether this alone was driving the separation, we repeated our PCA experiment at specific lengths of CDR-H3, again employing DBSCAN to interrogate the densities of CDR-H3 cluster usages. For each length, we observed the same patterns, confirming that our separation of BCR repertoires was not solely an artefact of CDR-H3 loop length (S7 Fig).
These findings give structural confirmation to our understanding of B-cell development from antigen-unexperienced to terminally-differentiated plasma B-cells. The collection of CDR-H3s in a terminally-differentiated BCR repertoire should be reflective of individual’s complex history of antigenic stimulations yielding highly specialized, high-affinity antibodies [2]. These results demonstrate a mode of structural BCR repertoire ontogeny, where antigen-unexperienced BCR repertoires have the most conserved “public” frequencies of CDR-H3 structural clusters across individuals. Upon antigenic stimulation, the somatic hypermutation (SHM) machinery of B-cells recursively introduces point mutations, primarily to the antibody CDR regions [3,26]. Our DBSCAN analysis shows that BCR repertoires of different B-cell types do not use equal frequencies of CDR-H3 clusters, suggesting that affinity maturation leads to discernible structural changes in the paratope. As B-cells differentiate to the next developmental stage, their repertoires become more personalized; a fine-tuning of antibody CDR structures along the differentiation axis.
Next, we checked whether above results were caused by varying numbers of utilized CDR-H3 clusters. We evaluated the total number of CDR-H3 clusters represented across different B-cell types in the human and mouse data (Fig 3A and 3B). None of the BCR repertoires used the maximum number of CDR-H3 clusters (1,169), and the numbers varied between BCR repertoires, with antigen-unexperienced repertoires using the most. The average number of CDR-H3 clusters in plasma IGHG BCR repertoires was 3–4 times smaller than in naïve repertoires.
This difference could potentially be explained by a smaller number of isolated plasma B-cells. To account for the varying sizes of BCR repertoires, we subsampled 10,000 sequences from each of them 100 times and recorded the average number of CDR-H3 clusters. The subsampling gave a similar pattern to the complete data, with the average number of CDR-H3 clusters being highest in antigen-unexperienced BCR repertoires (Fig 3C and 3D), and total numbers of represented clusters decreasing along the B-cell differentiation axis. This drop in the number of CDR-H3 clusters is not caused by poorer structural coverage of more differentiated BCR repertoires, as we have already shown that the coverage is not significantly different across B-cell types in the human data, and increases for more differentiated cells in the mouse data (S2 Fig). Therefore, we suspect that this decrease in the number of represented CDR-H3 clusters along the differentiation axis was the result of only specific CDR-H3 structures transitioning to the next development stage.
To confirm this hypothesis, we investigated whether the decreased numbers of CDR-H3 clusters in antigen-experienced BCR repertoires are also accompanied by structural specialization i.e. personalized CDR-H3 cluster usage. We employed Shannon entropy to investigate the structural diversity of CDR-H3s across our BCR repertoires. High entropy demonstrates a high diversity of CDR-H3 structures, whilst low entropy indicates the over-representation of one or more CDR-H3s. To account for the decreasing number of represented CDR-H3 structures, we calculated the proportion of theoretical maximum entropy for each BCR repertoire to yield a normalized estimate of the diversity of CDR-H3 clusters used (S8 Fig). This confirmed that the structural diversity of CDR-H3 gradually decreased along the B-cell differentiation axis. Antigen-unexperienced BCR repertoires had the highest structural diversity of CDR-H3s, as well as the lowest variance in entropy across B-cell types. Marginal and memory IGHM BCR repertoires utilized the same number of CDR-H3 structures (p = 0.66, Mann-Whitney U-Test), whilst the structural diversity was significantly lower in memory B-cells (p = 0.005, Mann-Whitney U-Test). Our results again give structural confirmation of the affinity maturation process, where only CDR structures that are specific to cognate antigens are retained.
Overall, the above results demonstrate that B-cell types can be distinguished based on the profile of CDR-H3 structural descriptors alone and that antigen-unexperienced BCR repertoires utilized the highest number and the highest entropy of CDR-H3 clusters. Cluster frequencies in naive BCR repertoires were conserved across different B-cell donors. As B-cells differentiate, their CDR-H3 cluster usage becomes narrower and more distinct between individuals, which is reflective of both affinity maturation and a personalized history of B-cell selection. These results provide us with the first structural insight into fundamental processes that govern BCR repertoire differentiation across B-cell donors.
Canonical class characterization
Our analysis so far has focused on CDR-H3, but CDR-H1 and CDR-H2 also play a key role in shaping the antibody paratope [27]. Most CDR-H1 and CDR-H2 loops are found in a small set of structures known as canonical classes. This allows prediction of their structure from sequence with high confidence [21].
A single V gene encodes for both CDR-H1 and CDR-H2 loops and it is known that SHMs preferentially take place in these loops during B-cell differentiation [3,26]. As the level of SHMs increases with B-cells differentiation, the number of mutations in the V gene has often been used as a proxy to study B-cell development [7,28].
Here, we investigated whether SHMs in the V gene lead to structural changes in CDR-H1 and CDR-H2 in humans and mice. We calculated the percentage of sequences across BCR repertoires where either the CDR-H1 or CDR-H2 canonical class diverged from its parent germline. Sequences with unassigned canonical class information were retained in the analysis as their number was low (S2 Table), and SHMs can still change loop conformation to a yet uncharacterized canonical class. As of June 2019, only one human and six mouse V genes contained either a CDR-H1 or a CDR-H2 shape that did not fall into a SCALOP canonical classes [21].
Canonical class divergence from germline occurred in all B-cell types, but was observed to increase along the B-cell differentiation axis in the human data (S9 Fig). This was less clear in the mouse data. Pre and naïve B-cells had less canonical class divergence from the germline, whereas memory and plasma B-cells had a higher divergence. These results place structural information on the knowledge that the percentage of V gene mutations increases with B-cell differentiation [7]. The average percentage of canonical class divergence across B-cell types were consistently higher in human than mouse data. This is in agreement with previously-reported results showing that human V genes tend to accumulate a larger number of SHMs than mouse [29].
CDR-H1 and CDR-H2 loops had different levels of canonical class divergence in both human and mouse data, with CDR-H1s changing their germline loop shapes more often than CDR-H2s (S10 Fig). This can probably be directly attributed to the different number of canonical classes accessible to CDR-H1 and CDR-H2 (7 versus 4), which implies CDR-H1 loops have a greater degree of structural freedom.
Both Galson et al., [7] and Greiff et al., [9] studies showed that the V gene usages varied across B-cell types. Here, we investigated whether canonical class usages could provide a structural explanation for the observed alterations in V gene utilization during B-cell differentiation. As with CDR-H3, we performed PCA on combinations of canonical class usages across BCR repertoires (S11 Fig). In the human data, we found a separation between naïve and more differentiated B-cell types, with naïve BCR repertoires utilizing more similar canonical class usages. In the mouse data, BCR repertoires were separated into different B-cell types with the sequential pattern of B-cell differentiation.
Our results demonstrate that canonical class usages are not static during B-cell differentiation, with more mature B-cells exhibiting a higher level of canonical class divergence from the parent germline. CDR-H1 and CDR-H2 structures are clearly modulated to help refine the antibody paratope configuration against the cognate antigen.
Patterns of CDR-H3 cluster usage
Biased usage of CDR-H3 clusters is observed in different BCR repertoires along the differentiation axis. For instance, antigen-unexperienced B-cells share the closest frequencies of CDR-H3 clusters (Fig 2). A detailed understanding of biased CDR-H3 structure usage would significantly advance our knowledge of the adaptive immune system development and maturation.
To investigate patterns of biased CDR-H3 cluster usage, we split CDR-H3 clusters into three groups for each B-cell type based on frequencies of CDR-H3 clusters used across these BCR repertoires. “Structural Stems”, which were defined as CDR-H3 clusters, whose frequencies were significantly over-represented across the BCR repertoires of a given B-cell type, “Under-Represented” which describes under-represented CDR-H3 clusters. And CDR-H3 clusters, whose frequencies were not significantly different from random uniform sampling -“Random-Usage” (Fig 4).
First, we looked at the average number of CDR-H3 clusters found in our three groups (Structural Stems, Random-Usage and Under-Represented) across the different B-cell types. In all BCR repertoires, Under-Represented always contained the largest number of CDR-H3 clusters (S12 Fig), however, this does not translate to dominance in terms of coverage (Fig 5). This is because, in most cases, Under-Represented CDR-H3 clusters tend to have only a few sequences in a repertoire that share that shape, whereas Structural Stems will have far higher numbers.
In the human data, the number of Structural Stems was largest in naïve BCR repertoires and gradually decreased along the B-cell differentiation axis. The number of Random-Usage CDR-H3 clusters was lowest in the naïve repertoires. This number increased in marginal BCR repertoires followed by a gradual decline along the B-cell differentiation axis. Similar to the human data, the number of Structural Stems was the highest in antigen-unexperienced BCR repertoires in the mouse data. The number of Structural Stems declined in plasma IGHM and were completely absent in plasma IGHG repertoires.
Next, we investigated the proportional composition of BCR repertoires across B-cell types with Structural Stem, Random-Usage and Under-Represented CDR-H3 clusters. The distribution of repertoire coverages differed between B-cell types in both human and mouse data (Fig 5). Structural Stems cover ~70–80% of antigen-unexperienced BCR repertoires, with coverage declining along the B-cell differentiation axis. In contrast, coverage with Under-Represented clusters gradually increased as B-cells matured. Pre and naïve BCR repertoires were least covered with Random-Usage CDR-H3 clusters (only 5–10%). In the human data, coverage with Random-Usage CDR-H3 clusters showed a transient increase in memory BCR repertoires followed by a decline in plasma repertoires, though this trend was less evident in the mouse data. The same CDR-H3 clusters are preferentially over-represented across different B-cell types with the number of these over-represented CDR-H3 clusters diminishing to none along the B-cell development axis (S13 Fig) (S2 Appendix).
These results demonstrate that antigen-unexperienced BCR repertoires display CDR-H3 structural conservatism. Naive BCR repertoires are largely composed of “public” sets of over-represented CDR-H3 clusters. During B-cell selection, CDR-H3 cluster usages become less conserved across individuals as the coverage with Random-Usage and Under-Represented CDR-H3 clusters rise. In terminally-matured plasma IGHG BCR repertoires, none of CDR-H3 clusters was significantly over-represented across individuals. This reflects how the history of antigenic stimulations structurally shapes BCR repertoires, which become increasingly specialized as B-cells differentiate.
Discussion
We have carried out the first systematic study of structural diversity in the BCR repertoires of multiple donors and species along the B-cell differentiation axis. By mapping sequences to solved antibody structures, we show the structural transformation occurring as BCR repertoires develop in humans and mice.
Our data show that B-cell types can be distinguished based solely on the structural diversity of CDR-H3 loops. Antigen-unexperienced (pre and naïve) BCR repertoires contain conserved “public” CDR-H3 cluster frequencies across individuals. As B-cells differentiate, their structural repertoires become increasingly personalized, as a reflection of each individual’s history of antigen exposure. Antigenic stimulation induces marked changes in the pattern of CDR-H3 cluster usage in BCR repertoires. The repertoires utilize a smaller number of available CDR-H3 configurations, CDR-H3 structural diversity is reduced, and CDR-H3 cluster usage becomes increasingly divergent from naïve BCR repertoires. Structural changes also take place in non-CDR-H3 loops, highlighting the importance of canonical loops in paratope shaping. This shows how structure changes as B-cells, whose CDRs are complementary to cognate antigens, are positively selected.
The human and mouse data used in this work came from two studies that interrogated BCR repertoire sequence diversities following antigenic stimulation [7,9]. In their original publications, the sequence diversity was shown to decrease along the B-cell differentiation axis, with the most marked decline in the plasma cell repertoires in response to antigen exposure [7,9]. In our work, we find evidence to support these diversity changes on the structural level, which can be indicative of affinity selection of antigen specific BCRs. The SAAB+ pipeline was able to capture both these sequence/structural convergence paradigms, where similar CDR-H3 templates are more likely to be selected in less sequence diverse BCR repertoires as well as identification of sequence dissimilar CDR-H3 loops located in the same structural space.
Most publicly available Ig-seq studies interrogate BCR repertoires either in disease or antigenic stimulation settings [17]. This could potentially lead to sequence/structural diversity measurements that are different from the resting BCR repertoire state. Recently, two Ig-seq studies released cumulatively more than 7 billion BCR reads of unsorted B-cells from healthy individuals [2,30]. These studies provided crucial insights into the BCR sequence diversity and dynamics amongst healthy individuals. However, analogous large-scale studies have not yet been performed on sorted BCR repertoires due to high costs concomitant with experimental setups, and labour requirements. In order to develop a robust immunodiagnostics pipeline, it is important to understand sequence/structural diversity on the level of the whole repertoire as well as individual B-cell types.
In this work we considered only the three CDRs encoded by heavy chain genes. However, the light chain also plays an important role in shaping the BCR paratope [31]. Therefore, it is anticipated that the diversity of light chain CDR configurations would also change along the B-cell differentiation axis. To advance our understanding of the role of the light chain in paratope shaping, further investigations are required to study joined structural diversities of heavy and light chains within individual B-cell donors.
Inclusion of cognate light chain pairing information into our analysis would also facilitate generation of refined antibody models. Increased availability of paired heavy/light BCR data [32] and improvements in antibody modelling speed [13] will pave the way to a new frontier of antibody structure usage analysis at the scale of an entire BCR repertoire. Structural descriptors harvested from these models will push forward the resolution of our current work, enabling calculations of paratope charge and hydrophobicity, as well as antibody developability profiles [33].
Structural coverage of CDR-H3s of the human and mouse data was ∼48% and ∼88% respectively. In this analysis, we did not consider BCR sequences with CDR-H3 loop lengths greater than 16 amino acids to ensure high prediction accuracies. This modelling quality filter removed 3.9% (± 2 s.t.d) of the mouse and 29% (±11 s.t.d) of the human BCR datasets from the analysis, meaning that the structural diversity information of longer CDR-H3 loops unexplored (S3 Fig). Despite these marked differences in BCR repertoire coverages, both human and mouse data showed similar patterns of structural diversities along the differentiation axis. These findings agree with and provide structural reasons for the sequence diversity measurements calculated across all lengths in the original studies of the human and mouse data. Therefore, it is unlikely that the unexplored portions of human BCR repertoires would have significantly different structural diversity properties.
More than 400 antibody structures have been solved in 2019 [24], which constitutes more than 10% of our CDR-H3 template library. With this steadily increasing rate of antibody structure availability and continuous improvements in homology modelling technologies, further studies will be soon necessary to investigate trade-offs between BCR repertoire coverage and prediction accuracies.
Structural characterization of Ig-seq data can augment existing analysis pipelines [13]. Current Ig-seq data clustering approaches work on the premise that CDR-H3 sequence identity alone can capture structural features of the paratope [6]. However, sequences with low CDR-H3 sequence identity can adopt close shapes and vice versa [13]. Hence, the development of structure-aware clustering methods such as SAAB+ allows for the direct grouping of structurally/functionally related BCR sequences [34], as well as enables structural changes to be traced within individual B-cell linages.
A set of CDR-H3 clusters was consistently over-represented across all B-cell donors (“Structural Stems”) within the specific B-cell types. These clusters encompassed 70–80% of all sequences in antigen-unexperienced BCR repertoires. This shows that humans and mice largely rely on a conserved “public” set of CDR-H3 clusters to initiate antigen recognition. This knowledge could be leveraged to study immune system disorders, including immunosenescence, where distortions in the conserved public pattern of CDR-H3 cluster usage in antigen-unexperienced BCR repertoires could signal disease states. Furthermore, the knowledge of over-represented CDR-H3 clusters in naïve BCR repertoires could be applied in rational phage display library engineering, with Structural Stem cluster sequences used as starting points for library diversity generation.
Recently, transgenic mouse models with human adaptive immune system have been created to raise “naturally human” antibodies in non-human systems [35]. However, their BCR repertoires are shaped inside the murine environment, which could potentially select for BCR paratopes non-native to the human body. Hence, our structural diversity analysis could also be employed in the paratope “humanness” assessment of BCR repertoires derived from transgenic animals.
Methods
Data selection
Human Ig-seq data from Galson et al., [7] and mouse (C57BL/6 inbred strain) Ig-seq data from Greiff et al., [9] were used. Galson et al., (“human”) is a longitudinal vaccination study across nine healthy human donors, in which the heavy chain of naïve, marginal zone, memory, and plasma B-cell types were interrogated [7]. Greiff et al., (“mouse”) is a high depth sequencing study of the murine adaptive immune system in response to antigenic stimulation, containing heavy chain BCR repertoires from pre, naïve and plasma B-cells [36]. Both studies used FACS to sort B-cells into subpopulations according to their differentiation stages.
The Ig-seq amino acid sequences were downloaded from the Observed Antibody Space (OAS) [17] resource, retaining their Data Unit information. Each Data Unit is a sequencing sample from a single B-cell donor with a defined combination of B-cell type and isotype information, and contains sequences that are IMGT-numbered [19] and filtered for antibody structural viability. Henceforth, OAS Data Units will be referred to as B-cell receptor (BCR) repertoires.
To investigate structural changes along the B-cell differentiation axis, BCR repertoires with defined B-cell type and isotype information were downloaded. Only IGHG and IGHM sequences were considered as these were the most abundant. The total number of BCR repertoires in the human and mouse data were 85 and 82 respectively.
Structural annotation
To annotate the human and mouse data with structural information, we developed a customized version of our SAAB pipeline [15], SAAB+ that predicts the structural shape of the IMGT-defined CDRs. CDR-H1 and CDR-H2 adopt a limited number of structural configurations, known as canonical classes [16,37,38], which can be predicted accurately and rapidly from sequence [21]. SAAB+ uses SCALOP [21] to annotate non-CDR-H3 loop canonical classes. Canonical class annotation should be highly accurate, with SCALOP predictions estimated to be within 1.5 Å of the true structure 90% of the time [21]. The June 2019 SCALOP database was used in this study.
SAAB+ uses FREAD to predict CDR-H3 structural templates [22,39,40]. Accurately modelling all the CDR-H3s in an Ig-seq dataset is challenging, owing to the vastness of structural space accessible to these loops [41–43], relative to the small number of publicly-available crystallographically-solved antibodies (many of which are highly sequence redundant) [24]. In addition, structurally-solved antibodies have a CDR-H3 length distribution and sequence diversity that is different from natural Ig-seq data (S3 Fig). We tested the performance of FREAD on the Ig-seq data and, at the parameters used, the expected average RMSD of FREAD CDR-H3 template predictions for both human and mouse data is 2.5 Å (see S3 Appendix). This is in line with current state-of-the-art CDR-H3 modelling software tools (mean RMSD of 2.8 Å) [44]. In a similar manner to DeKosky et al., [14], we limited our CDR-H3 analysis to loop lengths of 16 amino acids or shorter, as far fewer structures with longer CDR-H3 loops are available and longer loops have increased structural freedom. We also excluded CDR-H3 loops shorter than five amino acids from our analysis, as only three CDR-H3 templates covered these lengths. FREAD templates were downloaded from SAbDab (14th November 2018) [24], and consisted of all X-ray crystal structures of antibodies with a resolution better than 2.9 Å.
CDR-H3 clustering
To identify similar CDR-H3 loop structures, we used the DTW algorithm [16] to cluster FREAD templates by backbone RMSD. Those within 0.6 Å were placed in the same cluster, reducing our 2,943 FREAD CDR-H3 templates to 1,169 CDR-H3 clusters.
Filtering BCR repertoires
As PCR sequencing can lead to variable amplicon amplification, we removed any BCR repertoire if its two most redundant CDR-H3 clusters contained more than 80% of all repertoire sequences (S1 Fig). We also discarded any BCR repertoire that contained fewer than 10,000 sequences with predicted CDR-H3 structures—this cut-off was selected to allow for adequate sampling of CDR-H3 template usages, whilst retaining as many BCR repertories as possible (S3 Table). This reduced the number of repertoires for all subsequent structural analysis to 81 (human) and 73 (mouse). CDR-H1 and CDR-H2 loops were not taken into account in determining BCR repertoire quality, since canonical class coverage was ~95% and ~99% for the human and mouse data respectively (S2 Table).
Patterns of CDR-H3 cluster usage
We analysed the pattern of CDR-H3 cluster frequencies in the human and mouse data, to identify clusters whose usages were over-represented (Structural Stems), random (Randomly-Used) and under-represented (Under-Represented) within a given B-cell type.
The structurally-annotated human and mouse data was split into individual groups based on unique B-cell type and isotype combinations. Within these groups, we calculated the CDR-H3 length distributions and the proportion modellable by FREAD for each CDR-H3 length. Next, we randomly selected CDR-H3 templates from our FREAD library (with replacement) according to these distributions, to generate a randomized dataset for each BCR repertoire. Sampling was performed across the set of FREAD templates already present in each BCR repertoire. The randomized dataset sizes were set to one million sequences and the total number of randomized datasets was matched to the number of the BCR repertoires within the corresponding groups (S3 Table).
A one-sided Mann-Whitney rank test (p = 0.05) was performed on the relative usage of each CDR-H3 cluster in the grouped BCR repertoires and the corresponding randomized datasets, to categorize them as Structural Stem, Random-Usage or Under-Represented CDR-H3 clusters.
Statistical analysis
Statistical analyses were performed in Python using the scikit-learn[45] and scipy packages. Detailed information on statistical tests is outlined in the figure legends. Data visualization was performed with the seaborn package.
Supporting information
Data Availability
All data are within the manuscript and supporting information files, except for our software tool, which is available at https://github.com/oxpig/saab_plus under BSD 3-Clause license.
Funding Statement
AK is supported by funding from Biotechnology and Biological Sciences Research Council (BBSRC) [BB/M011224/1], UCB Pharma Ltd and Royal Commission for the Exhibition of 1851 Industrial Fellowship awarded to AK. MIJR and WKW are supported by Engineering and Physical Sciences Research Council and Medical Research Council Grant EP/L016044/1, GlaxoSmithKline plc, AstraZeneca plc, UCB Celltech, and F. Hoffmann-La Roche AG. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
References
- 1.Tonegawa S. Somatic generation of antibody diversity. Nature. 1983;302(5909):575–81. 10.1038/302575a0 [DOI] [PubMed] [Google Scholar]
- 2.Briney B, Inderbitzin A, Joyce C, Burton DR. Commonality despite exceptional diversity in the baseline human antibody repertoire. Vol. 566, Nature. Nature Publishing Group; 2019. p. 393–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Yaari G, Vander Heiden JA, Uduman M, Gadala-Maria D, Gupta N, Joel JN, et al. Models of somatic hypermutation targeting and substitution based on synonymous mutations from high-throughput immunoglobulin sequencing data. Front Immunol. 2013; [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Stavnezer J, Guikema JEJ, Schrader CE. Mechanism and Regulation of Class Switch Recombination. Annu Rev Immunol. 2008. April;26(1):261–92. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Georgiou G, Ippolito GC, Beausang J, Busse CE, Wardemann H, Quake SR. The promise and challenge of high-throughput sequencing of the antibody repertoire. Vol. 32, Nature Biotechnology. 2014. p. 158–68. 10.1038/nbt.2782 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Miho E, Yermanos A, Weber CR, Berger CT, Reddy ST, Greiff V. Computational Strategies for Dissecting the High-Dimensional Complexity of Adaptive Immune Repertoires. Front Immunol [Internet]. 2018 Feb 21 [cited 2018 Jul 5];9:224 Available from: 10.3389/fimmu.2018.00224 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Galson JD, Clutterbuck EA, Trück J, Ramasamy MN, Münz M, Fowler A, et al. BCR repertoire sequencing: Different patterns of B-cell activation after two Meningococcal vaccines. Immunol Cell Biol. 2015. November 1;93(10):885–95. 10.1038/icb.2015.57 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Galson JD, Trück J, Clutterbuck EA, Fowler A, Cerundolo V, Pollard AJ, et al. B-cell repertoire dynamics after sequential hepatitis B vaccination and evidence for cross-reactive B-cell activation. Genome Med [Internet]. 2016. December 16 [cited 2017 Oct 16];8(1):68 Available from: http://genomemedicine.biomedcentral.com/articles/10.1186/s13073-016-0322-z [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Greiff V, Menzel U, Miho E, Weber C, Riedel R, Cook S, et al. Systems Analysis Reveals High Genetic and Antigen-Driven Predetermination of Antibody Repertoires throughout B Cell Development. Cell Rep. 2017;19(7):1467–78. 10.1016/j.celrep.2017.04.054 [DOI] [PubMed] [Google Scholar]
- 10.Ellebedy AH, Jackson KJL, Kissick HT, Nakaya HI, Davis CW, Roskin KM, et al. Defining antigen-specific plasmablast and memory B cell subsets in human blood after viral infection or vaccination. Nat Immunol. 2016. September 20;17(10):1226–34. 10.1038/ni.3533 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Haynes BF, Kelsoe G, Harrison SC, Kepler TB. B-cell-lineage immunogen design in vaccine development with HIV-1 as a case study. Vol. 30, Nature Biotechnology. 2012. p. 423–33. 10.1038/nbt.2197 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Greiff V, Bhat P, Cook SC, Menzel U, Kang W, Reddy ST. A bioinformatic framework for immune repertoire diversity profiling enables detection of immunological status. Genome Med. 2015. May 28;7(1). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Kovaltsuk A, Krawczyk K, Galson JD, Kelly DF, Deane CM, Trück J. How B-Cell Receptor Repertoire Sequencing Can Be Enriched with Structural Antibody Data. Front Immunol [Internet]. 2017. Dec 8 [cited 2018 Jul 5];8:1753 Available from: http://journal.frontiersin.org/article/10.3389/fimmu.2017.01753/full [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.DeKosky BJ, Lungu OI, Park D, Johnson EL, Charab W, Chrysostomou C, et al. Large-scale sequence and structural comparisons of human naive and antigen-experienced antibody repertoires. Proc Natl Acad Sci U S A. 2016. May 10;113(19):E2636–45. 10.1073/pnas.1525510113 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Krawczyk K, Kelm S, Kovaltsuk A, Galson JD, Kelly D, Trück J, et al. Structurally mapping antibody repertoires. Front Immunol. 2018. July 23;9(JUL). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Nowak J, Baker T, Georges G, Kelm S, Klostermann S, Shi J, et al. Length-independent structural similarities enrich the antibody CDR canonical class model. MAbs. 2016;8(4):751–60. 10.1080/19420862.2016.1158370 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Kovaltsuk A, Leem J, Kelm S, Snowden J, Deane CM, Krawczyk K. Observed Antibody Space: A Resource for Data Mining Next-Generation Sequencing of Antibody Repertoires. J Immunol. 2018. October 15;201(8):2502–9. 10.4049/jimmunol.1800708 [DOI] [PubMed] [Google Scholar]
- 18.Berman H, Henrick K, Nakamura H, Markley JL. The worldwide Protein Data Bank (wwPDB): Ensuring a single, uniform archive of PDB data. Nucleic Acids Res. 2007;35(SUPPL. 1). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Lefranc M-P, Pommié C, Ruiz M, Giuducelli V, Foulquier E, Truong L, et al. IMGT unique numbering fro immunoglobulin and T cell receptor variable domains and Ig superfamily V-like domains. Dev Comp Immunol. 2003;27(1):55–77. 10.1016/s0145-305x(02)00039-3 [DOI] [PubMed] [Google Scholar]
- 20.Kovaltsuk A, Krawczyk K, Kelm S, Snowden J, Deane CM. Filtering Next-Generation Sequencing of the Ig Gene Repertoire Data Using Antibody Structural Information. J Immunol. 2018; [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Wong WK, Georges G, Ros F, Kelm S, Lewis AP, Taddese B, et al. SCALOP: Sequence-based antibody canonical loop structure annotation. Bioinformatics. 2019. May 15;35(10):1774–6. 10.1093/bioinformatics/bty877 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Choi Y, Deane CM. FREAD revisited: Accurate loop structure prediction using a database search algorithm. Proteins Struct Funct Bioinforma. 2010;78(6):1431–40. [DOI] [PubMed] [Google Scholar]
- 23.Breden F, Luning Prak ET, Peters B, Rubelt F, Schramm CA, Busse CE, et al. Reproducibility and Reuse of Adaptive Immune Receptor Repertoire Data. Front Immunol [Internet]. 2017. November 1 [cited 2018 Jul 5];8:1418 Available from: http://www.ncbi.nlm.nih.gov/pubmed/29163494 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Dunbar J, Krawczyk K, Leem J, Baker T, Fuchs A, Georges G, et al. SAbDab: The structural antibody database. Nucleic Acids Res. 2014;42(D1). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Zemlin M, Klinger M, Link J, Zemlin C, Bauer K, Engler JA, et al. Expressed murine and human CDR-H3 intervals of equal length exhibit distinct repertoires that differ in their amino acid composition and predicted range of structures. J Mol Biol. 2003;334(4):733–49. 10.1016/j.jmb.2003.10.007 [DOI] [PubMed] [Google Scholar]
- 26.Sheng Z, Schramm CA, Kong R, Mullikin JC, Mascola JR, Kwong PD, et al. Gene-specific substitution profiles describe the types and frequencies of amino acid changes during antibody somatic hypermutation. Front Immunol. 2017;8(MAY). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Olimpieri PP, Chailyan A, Tramontano A, Marcatili P. Prediction of site-specific interactions in antibody-antigen complexes: The proABC method and server. Bioinformatics. 2013;29(18):2285–91. 10.1093/bioinformatics/btt369 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Davis CW, Jackson KJL, McElroy AK, Halfmann P, Huang J, Chennareddy C, et al. Longitudinal Analysis of the Human B Cell Response to Ebola Virus Infection. Cell. 2019. May 30;177(6):1566–1582.e17. 10.1016/j.cell.2019.04.036 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Shi B, Ma L, He X, Wang X, Wang P, Zhou L, et al. Comparative analysis of human and mouse immunoglobulin variable heavy regions from IMGT/LIGM-DB with IMGT/HighV-QUEST. Theor Biol Med Model. 2014. July 3;11(1). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Soto C, Bombardi RG, Branchizio A, Kose N, Matta P, Sevy AM, et al. High frequency of shared clonotypes in human B cell receptor repertoires. Nature [Internet]. 2019. February 13 [cited 2019 Mar 4];566(7744):398 Available from: http://www.nature.com/articles/s41586-019-0934-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Goldstein LD, Chen YJJ, Wu J, Chaudhuri S, Hsiao YC, Schneider K, et al. Massively parallel single-cell B-cell receptor sequencing enables rapid discovery of diverse antigen-reactive antibodies. Commun Biol. 2019. December 1;2(1). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.DeKosky BJ, Kojima T, Rodin A, Charab W, Ippolito GC, Ellington AD, et al. In-depth determination and analysis of the human paired heavy- and light-chain antibody repertoire. Nat Med. 2015. January 1;21(1):86–91. 10.1038/nm.3743 [DOI] [PubMed] [Google Scholar]
- 33.Raybould MIJ, Marks C, Krawczyk K, Taddese B, Nowak J, Lewis AP, et al. Five computational developability guidelines for therapeutic antibody profiling. Proc Natl Acad Sci U S A. 2019;116(10):4025–30. 10.1073/pnas.1810576116 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Raybould MIJ, Wong WK, Deane CM. Antibody–antigen complex modelling in the era of immunoglobulin repertoire sequencing. Mol Syst Des Eng. 2019;4(4):679–88. [Google Scholar]
- 35.Lee EC, Liang Q, Ali H, Bayliss L, Beasley A, Bloomfield-Gerdes T, et al. Complete humanization of the mouse immunoglobulin loci enables efficient therapeutic antibody discovery. Nat Biotechnol. 2014;32(4):356–63. 10.1038/nbt.2825 [DOI] [PubMed] [Google Scholar]
- 36.Greiff V, Weber CR, Palme J, Bodenhofer U, Miho E, Menzel U, et al. Learning the High-Dimensional Immunogenomic Features That Predict Public and Private Antibody Repertoires. J Immunol. 2017. October 15;199(8):2985–97. 10.4049/jimmunol.1700594 [DOI] [PubMed] [Google Scholar]
- 37.North B, Lehmann A, Dunbrack RL. A new clustering of antibody CDR loop conformations. J Mol Biol. 2011;406(2):228–56. 10.1016/j.jmb.2010.10.030 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Al-Lazikani B, Lesk AM, Chothia C. Standard conformations for the canonical structures of immunoglobulins. J Mol Biol. 1997. November 7;273(4):927–48. 10.1006/jmbi.1997.1354 [DOI] [PubMed] [Google Scholar]
- 39.Deane CM, Blundell TL. CODA: a combined algorithm for predicting the structurally variable regions of protein models. Protein Sci. 2001;10(3):599–612. 10.1110/ps.37601 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Hill JR, Kelm S, Shi J, Deane CM. Environment specific substitution tables improve membrane protein alignment. Bioinformatics. 2011. July;27(13). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Regep C, Georges G, Shi J, Popovic B, Deane CM. The H3 loop of antibodies shows unique structural characteristics. Proteins Struct Funct Bioinforma. 2017;85(7):1311–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Marks C, Deane CM. Antibody H3 Structure Prediction. Vol. 15, Computational and Structural Biotechnology Journal. 2017. p. 222–31. 10.1016/j.csbj.2017.01.010 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Weitzner BD, Dunbrack RL, Gray JJ. The origin of CDR H3 structural diversity. Structure. 2015;23(2):302–11. 10.1016/j.str.2014.11.010 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Almagro JC, Teplyakov A, Luo J, Sweet RW, Kodangattil S, Hernandez-Guzman F, et al. Second Antibody Modeling Assessment (AMA-II). Proteins: Structure, Function and Bioinformatics. 2014;82(8):1553–62. [DOI] [PubMed] [Google Scholar]
- 45.Pedregosa FABIANPEDREGOSA F, Michel V, Grisel OLIVIERGRISEL O, Blondel M, Prettenhofer P, Weiss R, et al. Scikit-learn: Machine Learning in Python Gaël Varoquaux Bertrand Thirion Vincent Dubourg Alexandre Passos PEDREGOSA, VAROQUAUX, GRAMFORT ET AL. Matthieu Perrot [Internet]. Vol. 12, Journal of Machine Learning Research. 2011. [cited 2019 Aug 1]. Available from: http://scikit-learn.sourceforge.net. [Google Scholar]