Abstract
Somatic hypermutation (SHM) generates much of the antibody diversity necessary for affinity maturation and effective humoral immunity. The AID-induced DNA lesions and error-prone repair that underlie SHM are known to exhibit intrinsic biases when targeting the immunoglobulin (Ig) sequences. Computational models for SHM targeting often model the targeting probability of a nucleotide in a motif-based fashion, assuming that the same DNA motif is equally likely to be targeted regardless of its position along the Ig sequence. The validity of this assumption, however, has not been rigorously studied in vivo. Here, by analyzing a large collection of 956,157 human Ig sequences while controlling for the confounding influence of selection, we show that the likelihood of a DNA 5-mer motif being targeted by SHM is not the same at different positions in the same Ig sequence. We found position-dependent differential SHM targeting for about three quarters of the 38 and 269 unique motifs from more than half of the 292 and 1,912 motif-allele pairs analyzed using productive and non-productive Ig sequences respectively. The direction of the differential SHM targeting was largely conserved across individuals, with no allele-specific effect within an Ig heavy chain variable (IGHV) gene family, but was not consistent with general decay of SHM targeting with increasing distance from the transcription start site. However, SHM targeting did correlate positively with the mutability of the wider sequence neighborhood surrounding the motif. These findings provide insights and future directions for computational efforts towards modeling SHM.
Introduction
Effective humoral immunity is driven by B cells that can produce antibodies with high binding affinity towards their target antigens. Somatic hypermutation (SHM) generates the antibody diversity necessary for affinity maturation. During SHM, the enzyme AID deaminates a cytidine into uracil. The canonical and mutagenic versions of various repair pathways, such as base excision repair and mismatch repair, recognize and compete to resolve this lesion (1). As a result, point mutations are introduced at a rate of approximately 10−3 per base pair per division (2) to the immunoglobulin (Ig) variable region, not only at but also around the initial lesion site. B cells with mutations that increase their binding affinity are then positively selected for. While largely stochastic, there are biases intrinsic to SHM in terms of AID targeting preferences and error-prone repair. Computational models exist that attempt to capture such biases and predict which positions are more or less likely to mutate as a result of this SHM process. Understanding and modeling these biases are important for many aspects of computational analysis of B cell receptor (BCR) sequences, such as building phylogenetic models that take into account of such B cell-specific mutation biases (3), understanding the lineage development of broadly neutralizing anti-HIV antibodies that require extensive SHM (4), quantifying the selection pressure on Ig sequences (5), and simulating Ig sequences with realistic mutation profiles (6).
Computational efforts to model SHM have long focused on the immediate DNA context of the mutations. These models attempt to estimate the relative likelihood of SHM targeting based on the local microsequence context. Studies from before the high-throughput sequencing era identified classical hot or cold spots made up of specific DNA motifs. Hot spots canonically include 5’WRC/GYW3’ and 5’WA/TW3’ motifs, where W={A, T}, R={G, A}, Y={C, T}, and the position mutating is underlined (7, 8). Cold spots include motifs of the pattern 5’SYC/GRS3’, where S={C, G} (9). It was observed that hot spots tended to be targeted by AID for SHM more often, whereas cold spots tended to be targeted less. Beyond these categorical designations, a hierarchy of mutabilities exists that is highly dependent on the surrounding nucleotides (10, 11). More recent works have largely reaffirmed the overall hierarchy amongst these 3-mer motifs, but have also shown significant differences among classical hot and cold spots by extending this microsequence context-based approach into models involving 5-mer (12-14) or even 7-mer motifs (15, 16). One widely used 5-mer model (17-22) was developed by (12). This synonymous, 5-mer, functional (S5F) model was able to estimate SHM targeting without the confounding influence of selection by focusing on synonymous (silent) mutations that occurred at central positions of DNA 5-mer motifs on productively rearranged, or “functional”, sequences (12). An alternative approach to leverage information from both non-synonymous (replacement) and synonymous mutations relies on the use of non-productively rearranged, or “non-functional”, sequences to avoid the confounding influence of selection. This model is referred to as the replacement and synonymous, 5-mer, non-functional (RS5NF) model (13). Both models attempt to model the targeting preference of SHM by assigning to 5-mer motifs the relative probability of its central nucleotide being targeted given the composition of the adjacent four nucleotides (12, 13).
Beyond the microsequence context, it has been suggested that SHM targeting could vary due to positional effects within the Ig sequence (23). (24) reported an exponential decay in mutation frequency at the 3’end due to the distance from the transcription start site, predicting a fall to <1% of the maximum mutation frequency at ~2100 bp downstream of the 5’ boundary. (25) reported a sharp reduction in mutations separated by around 50 nucleotides for the nontranscribed strand, raising the possibility that AID might consistently skip a fixed distance during targeting. Past studies have also reported differences between the framework regions (FWRs) and the complementarity-determining regions (CDRs) beyond hot and cold spots, though they tended to be either low-throughput with analysis of fewer than 1,000 mutations from human heavy chain variable region sequences (8, 26) or did not fully account for selection (27). Generally, these proposed positional effects are not part of the microsequence context-based motif models for SHM.
A key assumption of the microsequence context-based models, regardless of whether they rely on 3-mer, 5-mer, or 7-mer motifs, is that the targeting probability of a motif is constant regardless of the position of the motif along the Ig sequence. However, the validity of this assumption has not been rigorously investigated. Here, we investigated whether the 5-mer microsequence context was sufficient to capture variations in mutability by examining the assumption that the probability of a 5-mer motif being targeting by SHM is the same regardless of its position along the Ig sequence. Specifically, we identified 5-mer DNA motifs that occur at two different positions in the same Ig heavy chain variable (IGHV) gene allele, and tested whether these motifs were targeted at the same rate. To control for selection, we either analyzed only motifs where mutation at the central nucleotide is always synonymous when using productively rearranged sequences, or performed the analysis using non-productively rearranged sequences. Based on close to one million BCR sequences, we found position-dependent differential SHM targeting for about three quarters of the unique motifs analyzed. The direction of the differential SHM targeting was largely conserved across individuals with no allele-specific effect within an IGHV family, and correlated positively with the mutability of the wider sequence neighborhood surrounding the motif.
Materials & Methods
Motif-allele pairs (MAPs)
A motif-allele pair (MAP) captures exactly two instances of the same DNA 5-mer motif in an IGHV allele with the central nucleotides located at position i (5’ site) and position j (3’ site), where i<j. If a motif appears m times in an IGHV allele, where m>2, this is captured by MAPs. Under the S5F analysis, which used productively rearranged sequences, the MAPs analyzed were restricted to motifs whose central nucleotide can mutate synonymously only. Under the RS5NF analysis, which used non-productively rearranged sequences, the MAPs analyzed included motifs whose central nucleotide can mutate both synonymously and non-synonymously.
Reference IGHV alleles
An initial set of 372 reference IGHV alleles for Homo sapiens were downloaded from IMGT/GENE-DB release 201918-4 under the “F+ORF+in-frame P” specification (28). Only alleles labelled functional and non-partial were kept, leaving 218 alleles, each of which fully covered IMGT-numbered nucleotide positions 1 to 312. Additionally, one allele each from 6 pairs of alleles with identical nucleotide sequences was removed. These were: IGHV1-69D*01, IGHV3-23D*01, IGHV3-30-5*02, IGHV3-30-3*03, IGHV3-30-5*01, and IGHV2-70D*04, which were identical sequence-wise to, respectively, IGHV1-69*01, IGHV3-23*01, IGHV3-30*02, IGHV3-30*04, IGHV3-30*18, and IGHV2-70*04. The curated set of reference IGHV alleles contained 212 unique alleles.
Overview of datasets
Five previously reported BCR repertoire datasets comprising 30 human subjects were re-processed in a consistent manner.
The dataset from (20) originated from a study (PRJNA338795) examining the BCR repertoire of myasthenia gravis patients (20). From this study, we started with bulk-sequenced heavy chains from FACS-sorted CD19+CD27− naïve B cells and CD19+CD27+ memory B cells from 4 healthy controls. For analysis, we used only sequences from CD19+CD27+ memory B cells.
The dataset from (29) originated from a study (PRJNA429427) comparing IgD−CD27− double negative (DN) B cells and IgD−CD27+ non-DN B cells (29). From this study, we analyzed bulk-sequenced heavy chains from FACS-sorted CD19+IgD−CD27+ class-switched memory B cells from 3 healthy controls.
The dataset from (30) originated from a study (PRJNA300878) on the influence of inheritability on the BCR repertories of twins (30). From this study, we started with bulk-sequenced heavy chains from FACS-sorted CD20+CD27− naïve B cells and CD20+CD27+CD38low memory B cells from 10 nominally healthy subjects belonging to five monozygotic twin pairs. For analysis, we used only sequences from CD20+CD27+CD38lo memory B cells.
The dataset from (31) originated from a study (PRJNA349143) on the antibody response induced by flu vaccine (31), the samples collected in which were later re-sequenced (32). From the re-sequencing (32), we started with bulk-sequenced heavy chains from unsorted PBMCs from 3 healthy adults. For analysis, we used only class-switched sequences.
The dataset from (33) originated from one of the largest studies (PRJNA406949) to date examining the circulating BCR repertoire (33). From this study, we started with bulk-sequenced heavy chains from unsorted PBMCs from 10 healthy adults. For analysis, we used only class-switched sequences.
Re-processing of datasets
For the datasets from (20), (29), (30), and (31), we obtained sequences pre-processed via pRESTO as described in (20), (29), (30), and (32) respectively.
For the dataset from (33), we downloaded assembled reads made publicly available by (33) at https://github.com/briney/grp_paper (2019-08-27) and performed further pre-processing in pRESTO v0.5.11 (34). For each subject, there were 6 biological samples, each with 3 technical replicates. For each biological sample, duplicate reads in its technical replicates were collapsed into unique sequences. Biological samples for the same subject were then combined. Isotypes were assigned by local alignment of the 3’ end of each sequence to IgM- and IgG-specific internal constant region sequences (IgM: GGGCGGATGCACTCCC; IgG: GGCCCTTGGTGGARGC). Sequences with inconsistent isotype assignment and IgM or IgG primer alignment were removed.
For each dataset, initial germline V(D)J gene annotation was performed using IgBLAST v1.14.0 (35) with IMGT/GENE-DB release 201918-4 (28). Additional quality control was performed, requiring sequences to align exclusively to heavy chain V and J genes; have a minimum V segment coverage from nucleotide positions 1 to 312 under the IMGT unique numbering scheme (36); and have Ns at fewer than 10% of V segment positions. Productively rearranged sequences without a junction length that was a multiple of 3 were removed, where junction was defined to be from IMGT codon 104 encoding the conserved cysteine to codon 118 encoding phenylalanine or tryptophan. Individual genotypes were computationally inferred using TIgGER v0.3.1 (37) and used to finalize V(D)J annotations. To remove potential chimeric sequences, sequences containing more than 5 mismatches from the germline in any 10-bp window were identified using SHazaM v0.1.11 (38) and removed.
Productively rearranged sequences were then used for inferring B cell clones via hierarchical clustering (32) on a by-subject basis. Using Change-O v0.4.5 (38), Productively rearranged sequences were first partitioned based on common V and J gene annotations and junction region lengths. Within each partition, sequences whose junction regions were within a specified normalized Hamming distance from each other were clustered as clones. This distance threshold was determined by manual inspection in conjunction with kernel density estimates, in order to identify the local minimum between the two modes of the bimodal distance-to-nearest distribution representing, respectively, clonally related and unrelated sequences. The thresholds used for the datasets from (20), (29), (30), (31), and (33) were, respectively, 0.1, 0.12, 0.08, 0.1, and 0.05 units of normalized Hamming distance. Following clonal clustering, clonal consensus germline sequences were reconstructed for each clone. A single clonal representative sequence was then chosen for each clone. For the sorted datasets ((20), (29), and (30)), the most highly mutated memory B cell sequence, if any, was used. For the unsorted datasets ((31) and (33)), the most highly mutated class-switched sequence, if any, was used.
Non-productively re-arranged sequences which contained out-of-frame rearrangements in the junction region, and which contained no indel in the V or J segment or stop mutation (30), were aggregated across subjects and across datasets.
For the dataset from (33), nucleotide positions 1 through 50 were excluded from the analysis to account for the fact that the V gene primers annealed to FWR1 (33).
Statistical tests
For a given MAP, to test whether any observed difference in the mutation frequency of the central nucleotide of the 5-mer motif at the 5’ site and at the 3’ site was statistically significant, a binary sign test, or McNemar’s test, was performed (39). First, a contingency table was constructed, in which n11, n12, n21, and n22 denote the number of sequences in which, respectively, there was no mutation at either site; there was no mutation at the 5’ site but there was at the 3’ site; there was mutation at the 5’ site but there was not at the 3’ site; and there were mutations at both sites. A test was performed if there were at least 35 sequences containing the MAP (n11+n12+n21+n22 ≥ 35), and if there were at least 5 sequences in which there was mutation at one site but not at the other (n12+n21 ≥ 5). The test statistic, , follows a standard Normal distribution. A two-sided test was performed, and the direction of the observed difference was noted. Correction for multiple testing was performed by controlling for false discovery rate (40).
Under the meta-analysis, for a given MAP tested in more than one subject, a combined one-sided p-value was calculated for each direction using Stouffer’s method (41). The combined p-value of the direction with the lower value was doubled and controlled for false discovery rate.
Targeting models
An updated version of the S5F SHM targeting model was built with SHazaM v0.1.11 (38) using the most highly mutated, memory or class-switched, and productively rearranged sequences as clonal representatives (same as in “Re-processing of datasets”) from all subjects from all datasets. Default parameters, “minNumMutations=50” and “minNumSeqMutations=500”, were used for the “createSubstitutionMatrix” and “createMutabilityMatrix” functions, respectively.
Simulation
Simulated sequences were generated using the “shmulateSeq” function from SHazaM v0.1.11 (38). In each simulation run, for each observed sequence, the number of nucleotide mutations in the V segment was first calculated using the “observedMutations” function. An equal number of mutations were then introduced into the corresponding germline sequence with probabilities specified by an S5F SHM targeting model.
Motif Analysis
Enrichment for particular nucleotides at each position in the motifs from MAPs was assessed using pLogo v1.2.0 (https://plogo.uconn.edu/) (42). For comparing MAPs that showed higher mutation frequency at the 5’ site with those that showed lower mutation frequency at the 5’ site, the set of motifs from MAPs that showed significant differential SHM targeting was used as the background.
Correlation between observed mutation frequency and surrounding germline mutability
We calculated the correlation between the observed mutation frequency of the central nucleotide of the MAPs and the germline mutability of the neighborhoods of varying sizes surrounding the MAPs. The germline mutability of a neighborhood surrounding the central nucleotide of a given motif was calculated by averaging the individual mutabilities estimated by the updated S5F targeting model of the 5-mer motifs within that neighborhood. The size of the neighborhoods ranged from up to 28 5-mer motifs (30 nucleotides) in the 5’ direction and up to 43 5-mer motifs (45 nucleotides) in the 3’ direction. Depending on the positions of a MAP and the neighborhood size, if the neighborhood covered more nucleotides than there were in the germline allele, truncation was performed and only the 5-mer motifs present were considered. For each neighborhood size, the percentage of MAPs whose observed mutation frequency correlated positively with the surrounding germline mutability was calculated.
Results
Germline landscape of motif-allele pairs (MAPs)
To examine whether the local nucleotide context is sufficient to capture the relative likelihood of SHM targeting, we proposed to test whether two instances of the same DNA 5-mer motif at different positions in a single IGHV allele would be targeted at equal rates. As a first step, we surveyed the germline landscape for DNA 5-mer motifs that appear at more than one site along the same IGHV allele. We define a motif-allele pair (MAP) as two instances of the same 5-mer motif in an IGHV allele (Materials & Methods). We collected 212 unique, functional, and non-partial human germline IGHV reference alleles from IMGT/GENE-DB (28), and searched for MAPs along this curated set of germline IGHV alleles. In doing so, we divided the MAPs into two categories, depending on the types of mutations possible at the central nucleotide of a 5-mer motif. The first category of MAPs involved motifs whose central nucleotide only mutates synonymously. In total, we found 876 such MAPs involving 210 unique alleles and 44 unique motifs (Figure 1A; Supplemental Figure 1A). Alleles contained 4.2 MAPs on average, with most alleles harboring between 2 and 6 MAPs (Figure 1B). Around 93% of the MAPs occurred at classical SYC/GRS cold spots or neutral spots, with the remaining 1% and 5% of the MAPs occurring at classical WA/TW and WRC/GYW hot spots, respectively (Figure 1C). The median distance between two instances of the same motif was 39, 54, 51, and 9 nts for MAPs occurring at cold, neutral, WA/TW hot, and WRC/GYW hot spots, respectively (Figure 1C). The second category of MAPs involved motifs whose central nucleotide could also mutate non-synonymously. We found a total of 12,704 such MAPs involving 212 unique alleles and 389 unique motifs (Supplemental Figures 1B and 2A). Alleles contained 59.9 MAPs on average, with most alleles harboring between 50 and 65 MAPs (Supplemental Figure 2B). Within MAPs, there was a median distance of 85, 79, 69, and 89 nts between instances for those occurring at cold, neutral, WA/TW hot, and WRC/GYW hot spots, respectively (Supplemental Figure 2C). Overall, the widespread presence of MAPs across germline IGHV alleles supports a systematic investigation of the MAPs observed in high-throughput BCR repertoire sequencing data.
Figure 1.

Motif-allele pairs (MAPs) in which the central nucleotide of the motif can only mutate synonymously. (A) MAPs found in an example set of germline IGHV alleles. The list of MAPs in IGHV2-5*01 is enumerated in the box. (B) Distribution of the number of MAPs harbored in a germline IGHV allele. (C) Distribution of the nucleotide distance between the central nucleotides at the 5’ and 3’ sites of a MAP.
MAPs are targeted by SHM at different rates
To test whether MAPs are targeted for SHM at the same rate, we gathered curated bulk-sequenced BCR heavy chain sequences from 30 human subjects from five published studies (Materials & Methods). These sequences either came from FACS-sorted memory B cells or, if derived from unsorted PBMCs, were filtered to include only class-switched, and hence putatively antigen-experienced, sequences. In total, we collected 936,714 productively rearranged BCR sequences and19,443 non-productively rearranged BCR sequences (Table 1).
Table 1.
Heavy chain immunoglobulin sequences from bulk sequencing from five B cell receptor (BCR) repertoire datasets used for analysis
| Dataset | Source | Isotype filter | Subject | Number of productively rearranged sequences a |
Number of non- productively rearranged sequences |
Reference & Accession |
|---|---|---|---|---|---|---|
| (20) | FACS-sorted CD19+CD27+ memory B cells b | - | HD07 | 4401 | 29 | (20); PRJNA338795 |
| HD09 | 3240 | 16 | ||||
| HD10 | 3974 | 21 | ||||
| HD13 | 4249 | 29 | ||||
| (29) | FACS-sorted CD19+IgD−CD27+ memory B cells | - | HC024 | 5221 | 37 | (29); PRJNA429427 |
| HC038 | 4404 | 11 | ||||
| HC263 | 11199 | 55 | ||||
| (30) | FACS-sorted CD20+CD27+ CD38low memory B cells b | - | TW01A | 1502 | 267 | (30); PRJNA300878 |
| TW01B | 1056 | 20 | ||||
| TW02A | 5950 | 887 | ||||
| TW02B | 4336 | 1094 | ||||
| TW03A | 3309 | 896 | ||||
| TW03B | 1866 | 402 | ||||
| TW04A | 5506 | 1348 | ||||
| TW04B | 6974 | 1991 | ||||
| TW05A | 3196 | 890 | ||||
| TW05B | 1315 | 266 | ||||
| (31) | Unsorted PBMCs | Class-switched (IgA, IgE, IgG) | hu420139 | 25700 | 718 | (31-32); PRJNA349143 |
| 420IV | 35702 | 593 | ||||
| PGP1 | 12825 | 249 | ||||
| (33) | Unsorted PBMCs | Class-switched (IgG) | 316188 | 90724 | 836 | (33); PRJNA406949 |
| 326650 | 58993 | 406 | ||||
| 326651 | 215683 | 3261 | ||||
| 326713 | 102876 | 1243 | ||||
| 326737 | 30366 | 213 | ||||
| 326780 | 51748 | 540 | ||||
| 326797 | 95961 | 1119 | ||||
| 326907 | 33148 | 162 | ||||
| 327059 | 83602 | 1503 | ||||
| D103 | 27688 | 341 |
one clonal representative sequence per clone
additional cell subsets used for genotyping and clonal inference but not for analysis (see Materials & Methods)
We first analyzed the SHM frequency of MAPs among the productively rearranged BCR sequences. To avoid the possibility that selection pressures could influence the observed mutation frequency at the different sites, we focused this analysis (hereafter the “S5F analysis”) on the set of MAPs where all mutations at the central nucleotide were synonymous. For each MAP, within each subject, we compared the mutation frequency at the two sites of the MAP (Figure 2A). Specifically, for a MAP observed in at least 35 sequences, in which at least 5 sequences saw mutation at one site but not at the other, we performed a McNemar’s test to determine whether the likelihoods of observing a mutation at one site but not at the other were significantly different (Materials & Methods). As one example, consider the MAP comprising the motif CTGAG occurring at IMGT-numbered nucleotide positions 57 and 282 along IGHV3-74*02 (Figure 2A, left). This MAP was observed in a total of 1,388 sequences in Subject 316188, and accumulated a mutation at one site but not the other in 74 of these sequences. Ten sequences contained mutation at position 57 but not at position 282 (frequency = 0.14), while 64 sequences contained mutation at position 282 but not at position 57 (frequency = 0.86). In other words, the 5’ site was significantly less likely than the 3’ site to mutate with a difference in likelihood of 0.73 (False Discovery Rate (FDR) = 1.7*10−9). As a second example, consider the MAP comprising ACCAG occurring at positions 105 and 237 along IGHV1-8*01 (Figure 2A, right). This MAP was observed in a total of 270 sequences in Subject HC263, and accumulated a mutation at one site but not the other in 53 sequences. Forty-four sequences contained a mutation at position 105 but not at position 237 (frequency = 0.83), while 9 sequences contained mutation at position 237 but not at position 105 (frequency = 0.17). In other words, the 5’ site was significantly more likely than the 3’ site to mutate with a difference in likelihood of 0.66 (FDR = 1.6*10−5). Using the abovementioned thresholds, there was sufficient data to test for 12-218 MAPs involving 5-57 unique alleles and 9-27 unique motifs across the 30 subjects (Table 2; Supplemental Figure 3A). In 28 of the 30 subjects, we found MAPs in which there was a statistically significant difference (FDR < 0.05) between the two sites in the likelihood of being targeted by SHM (Figure 2B). Amongst these MAPs, the median difference in the likelihood of being targeting by SHM between the two sites ranged from 0.46 to 1.00, with 1.00 being the scenario where all of the mutations were observed to accumulate at one site and never the other site (Figure 2B). Thus, there exist MAPs that are targeted by SHM at different rates.
Figure 2.
Position-dependent differential SHM targeting within MAPs. (A) Examples of MAPs being tested for differential SHM targeting in individual-subject S5F analysis and aggregate RS5NF analysis pooling all non-productively rearranged sequences. Gray shaded regions correspond to CDR1 and CDR2. (B) Difference in likelihood of mutating at one site but not at the other between the 5’ and 3’ sites of MAPs showing significant differential SHM targeting in each individual-subject S5F analysis. Each point represents a MAP and is colored by the classical categorization of the motif in the MAP. Numbers at the top indicate the number of MAPs found significant. (C) Percentage of MAPs showing significant differential SHM targeting in observed sequences (solid) versus sequences from 500 simulated datasets (hollow) based on the S5F model.
Table 2.
Number of motif-allele pairs (MAPs) detected, tested, and found significant for differential somatic hypermutation (SHM) targeting
| Analysis | Dataset | Subject | MAPs Detected a |
MAPs Tested b |
Unique Alleles Tested |
Unique Motifs Tested |
MAPs Significant |
||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Total | Cold SYC/ GRS |
Neutral | Hot WA/ TW |
Hot WRC/ GYW |
|||||||
| S5F | (20) | HD07 | 189 | 70 | 17 | 46 | 1 | 6 | 25 | 24 | 1 |
| HD09 | 180 | 57 | 13 | 37 | 1 | 6 | 22 | 20 | 0 | ||
| HD10 | 216 | 50 | 12 | 33 | 1 | 4 | 22 | 20 | 5 | ||
| HD13 | 214 | 81 | 17 | 54 | 2 | 8 | 26 | 27 | 0 | ||
| (29) | HC024 | 70 | 62 | 16 | 40 | 0 | 6 | 16 | 26 | 20 | |
| HC038 | 79 | 69 | 18 | 46 | 2 | 3 | 21 | 20 | 10 | ||
| HC263 | 102 | 94 | 19 | 67 | 3 | 5 | 23 | 26 | 28 | ||
| (30) | TW01A | 199 | 21 | 9 | 10 | 1 | 1 | 11 | 11 | 5 | |
| TW01B | 190 | 12 | 5 | 6 | 1 | 0 | 5 | 9 | 3 | ||
| TW02A | 232 | 112 | 35 | 68 | 3 | 6 | 33 | 29 | 30 | ||
| TW02B | 198 | 90 | 30 | 53 | 2 | 5 | 29 | 23 | 17 | ||
| TW03A | 225 | 56 | 17 | 35 | 2 | 2 | 21 | 19 | 12 | ||
| TW03B | 202 | 29 | 12 | 14 | 3 | 0 | 14 | 14 | 10 | ||
| TW04A | 207 | 103 | 31 | 60 | 4 | 8 | 34 | 28 | 26 | ||
| TW04B | 219 | 117 | 39 | 67 | 4 | 7 | 37 | 29 | 41 | ||
| TW05A | 213 | 64 | 19 | 41 | 1 | 3 | 22 | 23 | 15 | ||
| TW05B | 204 | 20 | 10 | 7 | 1 | 2 | 10 | 13 | 3 | ||
| (31) | hu420139 | 234 | 203 | 43 | 143 | 2 | 15 | 51 | 34 | 54 | |
| 420IV | 251 | 218 | 53 | 149 | 3 | 13 | 57 | 37 | 61 | ||
| PGP1 | 245 | 179 | 42 | 123 | 2 | 12 | 50 | 32 | 21 | ||
| (33) | 316188 | 209 | 80 | 24 | 53 | 3 | 0 | 37 | 22 | 52 | |
| 326650 | 176 | 69 | 22 | 45 | 2 | 0 | 33 | 19 | 47 | ||
| 326651 | 202 | 81 | 27 | 51 | 3 | 0 | 37 | 22 | 63 | ||
| 326713 | 246 | 96 | 29 | 64 | 3 | 0 | 43 | 22 | 65 | ||
| 326737 | 210 | 64 | 20 | 41 | 3 | 0 | 32 | 19 | 33 | ||
| 326780 | 222 | 75 | 21 | 52 | 2 | 0 | 37 | 20 | 45 | ||
| 326797 | 207 | 74 | 24 | 48 | 2 | 0 | 35 | 20 | 51 | ||
| 326907 | 202 | 57 | 16 | 39 | 2 | 0 | 31 | 17 | 30 | ||
| 327059 | 208 | 81 | 27 | 51 | 3 | 0 | 39 | 20 | 56 | ||
| D103 | 253 | 77 | 23 | 51 | 3 | 0 | 37 | 19 | 35 | ||
| RS5NF | All | All | 6579 | 1912 | 180 | 368 | 396 | 968 | 68 | 269 | 813 |
Number of MAPs present in at least one sequence
Number of MAPs with sufficient data for testing (see Materials & Methods)
Next, we analyzed the SHM frequency of MAPs among the non-productively rearranged BCR sequences. As these sequences contained out-of-frame junctions and were presumably not subjected to selection (30), we were able to explore in this analysis (hereafter the “RS5NF” analysis) a larger number of MAPs, some of which gave rise to non-synonymous mutations when the central nucleotide mutated, and which were therefore inaccessible previously due to the restriction to MAPs giving rise to only synonymous mutations. Given the smaller number of non-productively rearranged sequences in each subject, we performed the analysis by combining all such sequences across subjects. There was sufficient data to test for 1,912 MAPs involving 68 unique alleles and 269 unique motifs (Table 2; Supplemental Figure 3B). In 813 MAPs, the likelihoods of one site mutating and other not mutating were found to be statistically different (FDR < 0.05), with the median difference between the two sites being 0.80 (Supplemental Figure 4). These 813 MAPs involved 199 unique motifs, including 19 SYC/GRS cold spots, 34 WA/TW hot spots, and 35 WRC/GYW hot spots.
Since SHM is a stochastic process, to test the possibility that the observed differences in the likelihood of one site of a MAP mutating and the other not mutating rose by chance, we stochastically simulated sequences to see whether the same extent of differences could be detected. We preserved the MAP abundance and the mutational load of the observed data by generating an equivalent number of sequences with underlying germline alleles and introducing an equivalent number of mutations per sequence as the observed data. The mutations were introduced according to targeting probabilities specified by an updated S5F model of SHM targeting that we constructed as part of this study. The original S5F model estimated the SHM targeting probabilities based on 806,860 synonymous mutations found in 446,027 sequences (12). We built an updated S5F model based on 4,074,051 synonymous mutations found in our curated set of 936,714 productively rearranged sequences (Supplemental Figure 5A). The two models correlate highly in terms of 5-mer motif mutability (Spearman correlation coefficient = 0.86) (Supplemental Figure 5B). In the updated S5F model, we continued to see a distinct hierarchy of 5-mer motif mutability (Supplemental Figure 5C) and a strong bias for nucleotide transitions over transversions (Supplemental Figure 5D), as observed in the original model (12). Upon analyzing the MAPs from 500 simulated datasets, we found on average 0.03%, 0.06%, 0.02%, 0.05%, and 0.03% MAPs per dataset with significant differential SHM targeting for, respectively, Subject HD10, HC263, TW04B, 420IV, and the RS5NF analysis pooling non-productively rearranged sequences, compared to significantly higher (Bonferroni empirical p < 0.05) percentages of 10.00%, 29.79%, 35.04%, 27.98%, and 42.52% in the observed data (Figure 2C). Hence, it was unlikely that the observed differential SHM targeting on MAPs was due to chance. Taken together, these results suggest that there is differential SHM targeting on the same DNA 5-mer motif occurring at different positions along the same IGHV allele.
Direction of differential SHM targeting consistent across subjects and across alleles within an IGHV family
To investigate whether the observed differential SHM targeting was likely to be an effect intrinsic to the MAPs, we examined the consistency in the direction of the differences observed across individual subjects. We reasoned that a lack of consistency in directionality would imply that the observed differences were due to factors extrinsic to the MAPs themselves. Of the 134 MAPs that tested significant in more than one individual-subject S5F analysis, 118 (88%; p = 1.1*10−20, one-sided binomial test) showed consistency in directionality across all the subjects in which the MAP tested significant (Figure 3; Supplemental Figure 6). These 118 MAPs involved 24 unique motifs, including 6 SYC/GRS cold spots, 1 WA/TW hot spot, and 2 WRC/GYW hot spots. Thus, it is likely that two instances of the same motif experience differential SHM targeting because of some characteristic intrinsic to the MAP, such as the specific allele in which the MAP occurs, the positions of the motifs, or the broader sequence context surrounding the motifs.
Figure 3.
Direction of differential SHM targeting is largely conserved across individuals. A meta-analysis identified 190 MAPs with significant differential SHM targeting, for which the directions observed in individual-subject S5F analyses, aggregate S5F analysis pooling all productively rearranged sequences, and aggregate RS5NF analysis pooling all non-productively rearranged sequences are also shown. Each column identifies a MAP. The identities of the MAPs are detailed in Supplemental Figure 6.
In the S5F analysis, we analyzed exclusively MAPs that gave rise to only synonymous mutations in an effort to avoid being skewed by the selection bias exerted upon the productively rearranged sequences. To assess the possibility that the observed differential SHM targeting on these MAPs were still driven by selection, despite the fact that their central nucleotides could only mutate synonymously, we compared the direction of the differences observed in the productively rearranged sequences in the S5F analysis to that observed in the non-productively rearranged sequences – thought to be not subjected to selection (30) – in the RS5NF analysis. To facilitate comparison, we first synthesized the results from the individual-subject S5F analyses by performing a meta-analysis across subjects (Materials & Methods). Of the 292 unique MAPs involving 38 unique motifs that were tested using productively rearranged sequences in more than one subject and therefore included in the meta-analysis, 190 showed significant differential SHM targeting (FDR < 0.05). These 190 MAPs involved 29 unique motifs, including 6 cold SYC/GRS cold spots, 1 WA/TW hot spot, and 3 WRC/GYW hot spots. The meta-analysis result was consistent with the result from the RS5NF analysis pooling non-productively rearranged sequences from all subjects. Of the 39 MAPs that tested significant in both the meta-analysis and the RS5NF analysis, 34 (87%; p = 1.2*10−6, one-sided binomial test) agreed in terms of directionality (Figure 3; Supplemental Figure 6). Hence, it is unlikely that the differences observed in the MAPs in the S5F analysis were due to selection.
Next, to investigate whether there is an allele-specific effect to the observed differential SHM targeting on a motif, we investigated the consistency in directionality across different alleles. From the MAPs that tested significant in the meta-analysis, we examined the direction of differential SHM targeting on motifs occurring at the same IMGT-numbered nucleotide positions across different alleles. In all but four cases, the alleles of concern were all found within the same IGHV family. These involved 23, 6, 55, and 48 MAPs comprised of the same motifs occurring at the same positions in different alleles of, respectively, the IGHV1, IGHV2, IGHV3, and IGHV4 family (Figures 4A-D). All but three of these MAPs showed consistency in directionality across alleles within their respective IGHV family. One exception was the motif CTGAG at positions 57 and 282, which was targeted more frequently at the 5’ site in IGHV3-23*04, whereas less frequently in the other IGHV3 alleles (Figure 4C). The remaining two exceptions involved the motif GGGGG at positions 24 and 48, which was targeted more frequently at the 5’ site in IGHV3-48*02 and IGHV3-66*02, whereas less frequently in IGHV3-15*01, IGHV3-15*07, IGHV3-21*01, and IGHV3-23*01 (Figure 4C). The four cases in which the alleles of concern spanned different IGHV families involved 41 MAPs comprised of the same motifs occurring at the same positions in alleles across one or more of the IGHV1, IGHV3, IGHV5, and IGHV7 families. Of the four unique motifs involved in these 41 MAPs, CCTGG and CTGAA showed consistency in directionality across alleles from different IGHV families, while GTGCA and TCTGG displayed inconsistency (Figure 4E). Interestingly, GTGCA showed inconsistency amongst IGHV1 alleles, whereas TCTGG showed consistency amongst IGHV1 alleles but inconsistency between IGHV1 and IGHV3 alleles (Figure 4E). Overall, the direction of differential SHM targeting appears to be largely conserved across MAPs involving the same motif at the same positions in different alleles, at least within the same IGHV family.
Figure 4.
Direction of significant differential SHM targeting for MAPs found in multiple (A) IGHV1 alleles; (B) IGHV2 alleles; (C) IGHV3 alleles; (D) IGHV4 alleles; and (E) alleles across different IGHV families. Each point represents a MAP, with the x-axis indicating the allele in which the MAP is found and the y-axis indicating the 5-mer motif and the IMGT-numbered positions of the central nucleotides at the 5’ and 3’ sites.
Differential SHM targeting is correlated with surrounding mutability
To investigate the potential mechanisms underlying the differential SHM targeting observed in MAPs, we first examined characteristics intrinsic to the MAPs that showed significant differential SHM targeting in the meta-analysis (Figure 5A). We compared the motifs from the MAPs that showed higher mutation frequency at the 5’ site with those that showed lower mutation frequency at the 5’ site. We found enrichment for A at the first nucleotide and for T at the central nucleotide in the motifs from the former group of MAPs, and enrichment for C at the first nucleotide in the motifs from the latter group of MAPs (Bonferroni p < 0.05) (Materials & Methods; Supplemental Figures 7A-B). We found no significant difference in the abundancy of different types of motif (cold/neutral/hot), or of alleles from different IGHV families (Supplemental Figures 7C-D), with the Spearman correlation coefficients for abundancy rank being 1 and 0.9 respectively for the types of motif and the IGHV families. Overall, in terms of intrinsic characteristics, we found different patterns of nucleotide enrichment in MAPs showing differential SHM targeting.
Figure 5.

Potential factors related to the direction of (A) Position-dependent differential SHM targeting in 190 MAPs that tested significant in the meta-analysis. Gray shaded regions correspond to CDR1 and CDR2. (B) Fold change in mutation frequency at 5’ site over that at 3’ site versus distance between the 5’ and 3’ sites of a MAP, with the linear regression line shown in red. Relationship between the mutation frequency of each MAP and (C) the percentage of classical hot spots, or (D, E) the average mutability, in a neighborhood extending 16 (C, D) or 23 (E) 5-mer motifs in the 5’ direction and 27 5-mer motifs in the 3’ direction. Each line represents a MAP. MAPs for which the slope was undefined due to 5’ and 3’ sites having equal percentages of surrounding hot spots or average mutability were excluded. One-sided binomial tests were performed for the proportions of positive slopes being >0.5.
It has been suggested that SHM targeting decreases with distance from the transcription start site (24). Although we did not find that MAPs with an increased mutation frequency at the 5’ site were over-represented in our data (P > 0.05), the effect of distance may only be apparent when considering MAPs that are further apart. Thus, we examined whether there was an incremental relationship between the distance from the 5’ site of a MAP to its 3’ site, and the fold change in mutation frequency at the 5’ site over that at the 3’ site. We performed a linear regression and found that an x nt increase in the distance between the 5’ and 3’ sites of a MAP tended to correlate with a fold change in 5’ over 3’ site mutation frequency of 2−0.00985x, or 0.99x times (p = 1.3*10−10, t-test) (Figure 5B). The fact that 0.99x is smaller than 1 for any positive x implies that MAPs whose motifs are farther apart tend to be targeted less frequently at the 5’ site compared to at the 3’ site, consistent with the observation in Figure 5A. In addition, the fact that 0.99x decreases as x increases implies that there is a decremental relationship where a greater distance between the sites is associated with a smaller fold change in 5’ over 3’ site mutation frequency. Hence, we did not find evidence that the 5’ site of MAPs was targeted more frequently, or that greater distance between the 5’ and 3’ sites correlated with greater difference in SHM targeting.
The CDRs tend to have a higher density of classical hot spot motifs compared to the framework regions (FWRs) (p = 7.8*10−37, one-sided paired Mann-Whitney test; Supplemental Figure 8A) and are known to carry higher mutational loads (43). We hypothesized that proximity to these regions might lead to higher SHM targeting that is not fully captured by the 5-mer motif context. Thus, we tested whether the MAP site closer to a CDR would have a higher mutation frequency. For each MAP, we compared the distance to the nearest CDR for the 5’ site and for the 3’ site, with the corresponding mutation frequency at each site. We did not find a significant relationship between the distance to the nearest CDR and the mutation frequency (p = 0.77, one-sided binomial test; Supplemental Figure 8B).
Even if not driven by proximity to a CDR, it is possible that increased mutability of the surrounding sequence leads to a higher mutation frequency by recruiting the SHM machinery to the local vicinity. Thus, we tested whether the more mutated site in each MAP tended to be found in a neighborhood with higher overall mutability, where mutability was either based crudely on the abundance of classical hot spots, or estimated more precisely by the updated S5F model. We first tested a neighborhood extending 16 5-mer motifs 5’ and 27 5-mer motifs 3’ of the motif. This neighborhood size is based on the finding that error-prone repair can introduce mutations extending at least 29 nt 5’ and 18 nt 3’ of the cytosine deaminated by AID (44). With respect to a mutation that is part of the mutational spread, the spread could extend 18nt 5’ and 29 nt 3’ of itself, or, equivalently, 16 5-mer motifs and 27 5-mer motifs in the 5’ and 3’ direction respectively. With this neighborhood size, we found a significant enrichment (p = 2.6*10−10, one-sided binomial test) amongst 72% of the MAPs for positive correlation between the mutation frequency in each MAP and the abundance of classical hot spots (either WA/TW or WRC/GYW or both), with no strand specificity (Figure 5C; Supplemental Figure 9). Similarly, we found a significant enrichment (p = 7.0*10−9, one-sided binomial test) amongst 71% of the MAPs for positive correlation with the average mutability of the neighborhood estimated by the S5F model (Figure 5D). In addition, we also investigated whether other neighborhood sizes could give a greater enrichment for MAPs with positive correlation between the observed differential SHM targeting pattern and S5F-based mutability of the neighborhood. We performed a grid search for a neighborhood size for which we saw the most MAPs with positive correlation (Materials & Methods). We found the optimal neighborhood size to be 23 5-mer motifs (25 nt) in the 5’ and 27 5-mer motifs (29 nt) in the 3’ direction. With this neighborhood size, we found a significant enrichment (p = 5.2*10−14, one-sided binomial test) amongst 77% of the MAPs for positive correlation (Figure 5E). Taken together, these results suggest that differences in the mutation frequency of the same motif at different positions are at least partly driven by the overall mutability of the wider neighborhood surrounding the motif.
Discussion
In this study, we examined a key assumption of the microsequence context-based SHM targeting model. Using curated BCR sequences from five published studies and under a MAP framework, we investigated whether it was equally likely for a DNA 5-mer motif to be targeted by SHM regardless of its position along the same IGHV allele. To avoid being skewed by the selection pressure exerted on B cells as part of affinity maturation, we analyzed either MAPs found on productively rearranged sequences in which the motif could only mutate synonymously, or MAPs found on non-productively rearranged sequences which presumably were not subject to selection.
We found statistically significant, position-dependent, differential SHM targeting for about two thirds of the 292 MAPs tested based on productively rearranged sequences and for about half of the 1,912 MAPs tested based on non-producitvely rearranged sequences. In both cases, the significant MAPs involved about three quarters of the unique motifs tested (29 out of 38 and 199 out of 269, respectively). Importantly, some of these MAPs involved classical WA/TW and WRC/GYW hot spots, suggesting that the “hotness” of a hot spot could also vary depending on its position. For a motif that occurs twice on the same allele, the difference in the likelihood that one site has a mutation while the other does not ranged from 0.092 to 1. The magnitude of difference correlated loosely and negatively with the number of sequences (Pearson correlation coefficients −0.43 and −0.40; Supplemental Figures 10A-B), suggesting either that the effect size decreased as the amount of data increases, or that more data was able to reveal more subtle differences. Either way, a difference that is at the minimum around 0.1, or 10%, should not be negligible. Indeed, when we performed the same analysis on simulated sequences into which the same number of mutations as observed were introduced according to SHM targeting probabilities specified by the current S5F model, we did not detect the same differential SHM targeting, suggesting that the observed effect was not due to chance. Hence, the next generation of motif-based SHM targeting models would be improved by taking into account of such position dependency.
Upon examining the direction of the observed differential SHM targeting, we found that it was largely consistent across individuals and across different IGHV alleles within the same gene family. Due to the small number of motifs analyzed that occurred at the same positions across IGHV alleles from different gene families (Figure 4E), it remains inconclusive whether there is a gene-specific effect to the differential SHM targeting observed. Furthermore, we found positive correlation with the mutability of the wider sequence neighborhood surrounding the motif. One mechanism that could underlie this correlation is spreading of mutations from the initial AID-induced lesion to the surrounding sequence neighborhood due to error-prone repair pathways (44). Under this model, 5-mers that are surrounded by AID hot spots are more likely to be in the range of mutation spreading events, resulting in a higher targeting probability. Consistent with this hypothesis, we found that the correlation between 5-mer motif targeting and the surrounding neighborhood was strongest when the neighborhood extended 25 nucleotides in the 5’ direction and 29 nucleotides in the 3’ direction from the central nucleotide of the motif. This is highly similar to the observation that SHM could spread mutations in a neighborhood extending 29 nucleotides in the 5’ direction and 18 nucleotides in the 3’ direction from the initial AID lesion site (44). Further studies are necessary in order to quantify the potential influence of mutational spread on position-dependent differences in SHM frequency. One possibility is to study SHM patterns in Ig sequences from antigen-experienced B cells that are deficient in MutSα-mediated mismatch repair (1). While knockout mice targeting MutSα have been reported (45-48), mutations in MutSα, a heterodimer made up of MSH2 and MSH6, are rare in human. However, high-throughput repertoire sequencing data from individuals with such mutations were recently reported by (49).
Our analysis focused on pairs of 5-mers that occurred at different positions along a germline IGHV allele. While we often visualized these MAPs as lines connecting two points, where the two points represented the two positions of occurrence (e.g., Figure 5A), these lines should not be interpreted as a trend. It is also important to point out that our conclusions do not require that a motif occurring more than twice along an allele follow a monotonically increasing or decreasing pattern amongst its multiple occurrences. Among motifs eligible for the S5F analysis, the maximum number of occurrences for a single motif within an allele was three. These motifs were captured by three separate MAPs: one for positions i and j; one for positions i and k; and one for positions j and k, where i<j<k. Within the 190 MAPs that tested significant for differential SHM targeting (Figure 5A), there were only14 cases in which the motif occurred three times along an allele (Supplemental Figure 10C). In 9 of these 14 cases, the mutation frequency of the motif did not monotonically increase or decrease from 5’ to 3’. Instead, for these cases, we found high correlation between the mutability of the local sequence neighborhood surrounding the motif and its mutation frequency (Supplemental Figure 10D), consistent with the broader results for MAPs.
A common view in the field holds that SHM targeting decreases with distance from transcription start (50-52). Although this finding is primarily attributed to a study by (24), (53) correctly points out that this study only established that SHM frequency exponentially decays downstream of the J segment, and did not investigate the V(D)J exon itself. The fact that we observed MAPs in the V segment with both positive and negative slopes of differential SHM targeting (Figure 5A), with many MAPs accumulating more mutations at the 3’ site, argues against a model where SHM frequency generally decreases with distance from transcription start. Thus, our results challenge the assumption that the exponential decay observed by (24) applies to the V(D)J exon.
Instead of modeling mutational spread around 5-mer motifs, an alternative possibility would be to expand the number of surrounding nucleotides in the original motifs and use 7-mer, 9-mer, or even 11-mer motifs. As the length of the motif increases, an increasing challenge that such an approach faces is that a possible motif that exists in theory may not exist in reality given the composition of the Ig germlines, thereby greatly reducing the number of motifs eligible for studying. In addition, it has been suggested that even extending the microsequence context to 7-mer motifs only accounts for 70%-80% of the variations in mutability (23). Indeed, based on their 7-mer motif model, (16) noted that in order to determine SHM hot spots, one also needs to consider the co-occurrence of nearby mutations, in addition to the 7-mer microsequence context.
While we found significant differential SHM targeting for MAPs involving alleles from all seven IGHV families, an inherent limitation of the allele-specific MAP framework is that it requires more data compared to a motif-based approach. Whereas a motif-based approach aggregates all sequences in which a motif is found in an allele-blind manner, the MAP framework requires sequences from a specific allele for the analysis of a particular MAP. Consequently, only 292 unique MAPs out of the theoretical set of 876 were tested in more than one subject in the S5F analysis; and only 1,912 unique MAPs out of the theoretical set of 12,704 were tested in the RS5NF analysis. Also, for subjects in the dataset from (33), we did not test for any MAP involving a WRC/GYW hot spot (Table 2), because the 5’ site of such MAPs all fell within the first 50 IMGT-numbered nucleotide positions (Supplemental Figure 1A), which were excluded from analysis as “burn-in” for the sequencing (Materials & Methods). In addition, while we used a version of the IMGT reference alleles that was most up-to-date at the time of analysis, and which contained computationally identified and experimentally validated novel alleles (54) from the datasets from (20), (30), and (31), we did not computationally infer novel alleles for the datasets from (29) and (33). Should novel polymorphisms exist in those two datasets, that could be a potential source of error as undetected polymorphisms would be mistaken for mutations. Nevertheless, the likelihood that a novel polymorphism happens to fall within one of the MAPs analyzed should be small. Indeed, it has been estimated that less than 0.1% of the mutations used for the S5F analysis should be affected by unknown polymorphism (12).
In summary, we conclude that the SHM targeting probability of a DNA 5-mer motif is not constant along the Ig sequence and depends on the position of the motif. This finding has significant implications for SHM analysis, as current analytical frameworks generally assume that mutation targeting can be accurately captured by a 5-mer sequence context or even more minimally (for example, the WRCY/RGYW hot spot). Incorporating positional-dependence will be an important challenge for future SHM targeting models, perhaps by explicitly modeling the initial AID-induced lesion, followed by mutational spread.
Supplementary Material
Key Points.
Somatic hypermutation targeting for a DNA 5-mer motif can differ by position.
Direction of differential targeting conserved across subjects and alleles.
Differential targeting correlates with mutability of wider neighborhood of motifs.
Acknowledgments
We thank the Yale Center for Research Computing for use of the research computing infrastructure.
This work was supported in part by the National Institutes of Health (NIH) under award number R01AI104739.
Footnotes
Conflict of Interest Statement
S.H.K. receives consulting fees from Northrop Grumman.
References
- 1.Methot SP, and Di Noia JM. 2017. Molecular Mechanisms of Somatic Hypermutation and Class Switch Recombination. Adv. Immunol 133: 37–87. [DOI] [PubMed] [Google Scholar]
- 2.McKean D, Huppi K, Bell M, Staudt L, Gerhard W, and Weigert M. 1984. Generation of antibody diversity in the immune response of BALB/c mice to influenza virus hemagglutinin. Proc. Natl. Acad. Sci. U. S. A 81: 3180–3184. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Hoehn KB, Lunter G, and Pybus OG. 2017. A Phylogenetic Codon Substitution Model for Antibody Lineages. Genetics 206: 417–427. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Haynes BF, Kelsoe G, Harrison SC, and Kepler TB. 2012. B-cell–lineage immunogen design in vaccine development with HIV-1 as a case study. Nat. Biotechnol 30: 423–433. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Yaari G, Uduman M, and Kleinstein SH. 2012. Quantifying selection in high-throughput Immunoglobulin sequencing data sets. Nucleic Acids Res. 40: e134–e134. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Yermanos A, Greiff V, Krautler NJ, Menzel U, Dounas A, Miho E, Oxenius A, Stadler T, and Reddy ST. 2017. Comparison of methods for phylogenetic B-cell lineage inference using time-resolved antibody repertoire simulations (AbSim). Bioinforma. Oxf. Engl 33: 3938–3946. [DOI] [PubMed] [Google Scholar]
- 7.Betz AG, Rada C, Pannell R, Milstein C, and Neuberger MS. 1993. Passenger transgenes reveal intrinsic specificity of the antibody hypermutation mechanism: clustering, polarity, and specific hot spots. Proc. Natl. Acad. Sci 90: 2385–2388. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Shapiro GS, Aviszus K, Ikle D, and Wysocki LJ. 1999. Predicting Regional Mutability in Antibody V Genes Based Solely on Di- and Trinucleotide Sequence Composition. J. Immunol 163: 259–268. [PubMed] [Google Scholar]
- 9.Bransteitter R, Pham P, Calabrese P, and Goodman MF. 2004. Biochemical Analysis of Hypermutational Targeting by Wild Type and Mutant Activation-induced Cytidine Deaminase. J. Biol. Chem 279: 51612–51621. [DOI] [PubMed] [Google Scholar]
- 10.Smith DS, Creadon G, Jena PK, Portanova JP, Kotzin BL, and Wysocki LJ. 1996. Di- and trinucleotide target preferences of somatic mutagenesis in normal and autoreactive B cells. J. Immunol 156: 2642–2652. [PubMed] [Google Scholar]
- 11.Shapiro GS, Ellison MC, and Wysocki LJ. 2003. Sequence-specific targeting of two bases on both DNA strands by the somatic hypermutation mechanism. Mol. Immunol 40: 287–295. [DOI] [PubMed] [Google Scholar]
- 12.Yaari G, Vander Heiden J, Uduman M, Gadala-Maria D, Gupta N, Stern JNH, O’Connor K, Hafler D, Laserson U, Vigneault F, and Kleinstein S. 2013. Models of Somatic Hypermutation Targeting and Substitution Based on Synonymous Mutations from High-Throughput Immunoglobulin Sequencing Data. Front. Immunol 4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Cui A, Niro RD, Vander Heiden JA, Briggs AW, Adams K, Gilbert T, O’Connor KC, Vigneault F, Shlomchik MJ, and Kleinstein SH. 2016. A Model of Somatic Hypermutation Targeting in Mice Based on High-Throughput Ig Sequencing Data. J. Immunol 197: 3566–3574. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Feng J, Shaw DA, Minin VN, Simon N, and Matsen FAI. 2019. Survival analysis of DNA mutation motifs with penalized proportional hazards. Ann. Appl. Stat 13: 1268–1294. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Elhanati Y, Sethna Z, Marcou Q, Callan CG, Mora T, and Walczak AM. 2015. Inferring processes underlying B-cell repertoire diversity. Philos. Trans. R. Soc. Lond. B. Biol. Sci 370. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Marcou Q, Mora T, and Walczak AM. 2018. High-throughput immune repertoire analysis with IGoR. Nat. Commun 9: 561. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Bohannon C, Powers R, Satyabhama L, Cui A, Tipton C, Michaeli M, Skountzou I, Mittler RS, Kleinstein SH, Mehr R, Lee FE-Y, Sanz I, and Jacob J. 2016. Long-lived antigen-induced IgM plasma cells demonstrate somatic mutations and contribute to long-term protection. Nat. Commun 7: 1–13. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Bonsignori M, Kreider EF, Fera D, Meyerhoff RR, Bradley T, Wiehe K, Alam SM, Aussedat B, Walkowicz WE, Hwang K-K, Saunders KO, Zhang R, Gladden MA, Monroe A, Kumar A, Xia S-M, Cooper M, Louder MK, McKee K, Bailer RT, Pier BW, Jette CA, Kelsoe G, Williams WB, Morris L, Kappes J, Wagh K, Kamanga G, Cohen MS, Hraber PT, Montefiori DC, Trama A, Liao H-X, Kepler TB, Moody MA, Gao F, Danishefsky SJ, Mascola JR, Shaw GM, Hahn BH, Harrison SC, Korber BT, and Haynes BF. 2017. Staged induction of HIV-1 glycan–dependent broadly neutralizing antibodies. Sci. Transl. Med 9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Supek F, and Lehner B. 2017. Clustered Mutation Signatures Reveal that Error-Prone DNA Repair Targets Mutations to Active Genes. Cell 170: 534–547.e23. [DOI] [PubMed] [Google Scholar]
- 20.Vander Heiden JA, Stathopoulos P, Zhou JQ, Chen L, Gilbert TJ, Bolen CR, Barohn RJ, Dimachkie MM, Ciafaloni E, Broering TJ, Vigneault F, Nowak RJ, Kleinstein SH, and O’Connor KC. 2017. Dysregulation of B Cell Repertoire Formation in Myasthenia Gravis Patients Revealed through Deep Sequencing. J. Immunol 198: 1460–1473. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Ohm-Laursen L, Meng H, Chen J, Zhou JQ, Corrigan CJ, Gould HJ, and Kleinstein SH. 2018. Local Clonal Diversification and Dissemination of B Lymphocytes in the Human Bronchial Mucosa. Front. Immunol 9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Wiehe K, Bradley T, Meyerhoff RR, Hart C, Williams WB, Easterhoff D, Faison WJ, Kepler TB, Saunders KO, Alam SM, Bonsignori M, and Haynes BF. 2018. Functional Relevance of Improbable Antibody Mutations for HIV Broadly Neutralizing Antibody Development. Cell Host Microbe 23: 759–765.e6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Schramm CA, and Douek DC. 2018. Beyond Hot Spots: Biases in Antibody Somatic Hypermutation and Implications for Vaccine Design. Front. Immunol 9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Rada C, and Milstein C. 2001. The intrinsic hypermutability of antibody heavy and light chain genes decays exponentially. EMBO J. 20: 4570–4576. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.MacCarthy T, Kalis SL, Roa S, Pham P, Goodman MF, Scharff MD, and Bergman A. 2009. V-region mutation in vitro, in vivo, and in silico reveal the importance of the enzymatic properties of AID and the sequence environment. Proc. Natl. Acad. Sci 106: 8629–8634. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Shapiro GS, Aviszus K, Murphy J, and Wysocki LJ. 2002. Evolution of Ig DNA Sequence to Target Specific Base Positions Within Codons for Somatic Hypermutation. J. Immunol 168: 2302–2306. [DOI] [PubMed] [Google Scholar]
- 27.Cohen RM, Kleinstein SH, and Louzoun Y. 2011. Somatic hypermutation targeting is influenced by location within the immunoglobulin V region. Mol. Immunol 48: 1477–1483. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Giudicelli V, Chaume D, and Lefranc M-P. 2005. IMGT/GENE-DB: a comprehensive database for human and mouse immunoglobulin and T cell receptor genes. Nucleic Acids Res. 33: D256–D261. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Fraussen J, Marquez S, Takata K, Beckers L, Diaz GM, Zografou C, Van Wijmeersch B, Villar LM, O’Connor KC, Kleinstein SH, and Somers V. 2019. Phenotypic and Ig Repertoire Analyses Indicate a Common Origin of IgD−CD27− Double Negative B Cells in Healthy Individuals and Multiple Sclerosis Patients. J. Immunol 203: 1650–1664. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Rubelt F, Bolen CR, McGuire HM, Vander Heiden JA, Gadala-Maria D, Levin M, Euskirchen GM, Mamedov MR, Swan GE, Dekker CL, Cowell LG, Kleinstein SH, and Davis MM. 2016. Individual heritable differences result in unique cell lymphocyte receptor repertoires of naïve and antigen-experienced cells. Nat. Commun 7: ncomms11112. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Laserson U, Vigneault F, Gadala-Maria D, Yaari G, Uduman M, Heiden JAV, Kelton W, Jung ST, Liu Y, Laserson J, Chari R, Lee J-H, Bachelet I, Hickey B, Lieberman-Aiden E, Hanczaruk B, Simen BB, Egholm M, Koller D, Georgiou G, Kleinstein SH, and Church GM. 2014. High-resolution antibody dynamics of vaccine-induced immune responses. Proc. Natl. Acad. Sci 111: 4928–4933. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Gupta NT, Adams KD, Briggs AW, Timberlake SC, Vigneault F, and Kleinstein SH. 2017. Hierarchical Clustering Can Identify B Cell Clones with High Confidence in Ig Repertoire Sequencing Data. J. Immunol 198: 2489–2499. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Briney B, Inderbitzin A, Joyce C, and Burton DR. 2019. Commonality despite exceptional diversity in the baseline human antibody repertoire. Nature 566: 393–397. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Vander Heiden JA, Yaari G, Uduman M, Stern JNH, O’Connor KC, Hafler DA, Vigneault F, and Kleinstein SH. 2014. pRESTO: a toolkit for processing high-throughput sequencing raw reads of lymphocyte receptor repertoires. Bioinformatics 30: 1930–1932. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Ye J, Ma N, Madden TL, and Ostell JM. 2013. IgBLAST: an immunoglobulin variable domain sequence analysis tool. Nucleic Acids Res. 41: W34–W40. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Lefranc M-P, Pommié C, Ruiz M, Giudicelli V, Foulquier E, Truong L, Thouvenin-Contet V, and Lefranc G. 2003. IMGT unique numbering for immunoglobulin and T cell receptor variable domains and Ig superfamily V-like domains. Dev. Comp. Immunol 27: 55–77. [DOI] [PubMed] [Google Scholar]
- 37.Gadala-Maria D, Yaari G, Uduman M, and Kleinstein SH. 2015. Automated analysis of high-throughput B-cell sequencing data reveals a high frequency of novel immunoglobulin V gene segment alleles. Proc. Natl. Acad. Sci 112: E862–E870. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Gupta NT, Vander Heiden V, Uduman M, Gadala-Maria D, Yaari G, and Kleinstein SH. 2015. Change-O: a toolkit for analyzing large-scale B cell immunoglobulin repertoire sequencing data. Bioinformatics 31: 3356–3358. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Agresti A 2013. Categorical Data Analysis, 3rd Edition. John Wiley & Sons, Inc., Hoboken, New Jersey. [Google Scholar]
- 40.Benjamini Y, and Hochberg Y. 1995. Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. J. R. Stat. Soc. Ser. B Methodol 57: 289–300. [Google Scholar]
- 41.Whitlock MC 2005. Combining probability from independent tests: the weighted Z-method is superior to Fisher’s approach. J. Evol. Biol 18: 1368–1373. [DOI] [PubMed] [Google Scholar]
- 42.O’Shea JP, Chou MF, Quader SA, Ryan JK, Church GM, and Schwartz D. 2013. pLogo: a probabilistic approach to visualizing sequence motifs. Nat. Methods 10: 1211–1212. [DOI] [PubMed] [Google Scholar]
- 43.Murphy K, and Weaver C. 2017. Janeway’s Immunobiology, 9th Edition. Garland Science, New York, NY. [Google Scholar]
- 44.Unniraman S, and Schatz DG. 2007. Strand-Biased Spreading of Mutations During Somatic Hypermutation. Science 317: 1227–1230. [DOI] [PubMed] [Google Scholar]
- 45.Rada C, Ehrenstein MR, Neuberger MS, and Milstein C. 1998. Hot Spot Focusing of Somatic Hypermutation in MSH2-Deficient Mice Suggests Two Stages of Mutational Targeting. Immunity 9: 135–141. [DOI] [PubMed] [Google Scholar]
- 46.Rada C, Di Noia JM, and Neuberger MS. 2004. Mismatch Recognition and Uracil Excision Provide Complementary Paths to Both Ig Switching and the A/T-Focused Phase of Somatic Mutation. Mol. Cell 16: 163–171. [DOI] [PubMed] [Google Scholar]
- 47.Shen HM, Tanaka A, Bozek G, Nicolae D, and Storb U. 2006. Somatic Hypermutation and Class Switch Recombination in Msh6−/− Ung−/− Double-Knockout Mice. J. Immunol 177: 5386–5392. [DOI] [PubMed] [Google Scholar]
- 48.Wiesendanger M, Kneitz B, Edelmann W, and Scharff MD. 2000. Somatic Hypermutation in Muts Homologue (Msh)3-, Msh6-, and Msh3/Msh6-Deficient Mice Reveals a Role for the Msh2-Msh6 Heterodimer in Modulating the Base Substitution Pattern. J. Exp. Med 191: 579–584. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.IJspeert H, van Schouwenburg PA, Pico-Knijnenburg I, Loeffen J, Brugieres L, Driessen GJ, Blattmann C, Suerink M, Januszkiewicz-Lewandowska D, Azizi AA, Seidel MG, Jacobs H, and van der Burg M. 2019. Repertoire Sequencing of B Cells Elucidates the Role of UNG and Mismatch Repair Proteins in Somatic Hypermutation in Humans. Front. Immunol 10. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Odegard VH, and Schatz DG. 2006. Targeting of somatic hypermutation. Nat. Rev. Immunol 6: 573–583. [DOI] [PubMed] [Google Scholar]
- 51.Teng G, and Papavasiliou FN. 2007. Immunoglobulin Somatic Hypermutation. Annu. Rev. Genet 41: 107–120. [DOI] [PubMed] [Google Scholar]
- 52.Hwang JK, Alt FW, and Yeap L-S. 2015. Related Mechanisms of Antibody Somatic Hypermutation and Class Switch Recombination. Microbiol. Spectr 3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Peled JU, Kuang FL, Iglesias-Ussel MD, Roa S, Kalis SL, Goodman MF, and Scharff MD. 2008. The Biochemistry of Somatic Hypermutation. Annu. Rev. Immunol 26: 481–511. [DOI] [PubMed] [Google Scholar]
- 54.Gadala-Maria D, Gidoni M, Marquez S, Vander Heiden JA, Kos JT, Watson CT, O’Connor KC, Yaari G, and Kleinstein SH. 2019. Identification of Subject-Specific Immunoglobulin Alleles From Expressed Repertoire Sequencing Data. Front. Immunol 10. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.



