Abstract
Synthetic biology and deep learning synergistically revolutionize our ability for decoding and recoding DNA regulatory grammar. The B-cell-specific transcriptional regulation is intricate, and unlock the potential of B-cell-specific promoters as synthetic elements is important for B-cell engineering. Here, we designed and pooled synthesized 23 640 B-cell-specific promoters that exhibit larger sequence space, B-cell-specific expression, and enable diverse transcriptional patterns in B-cells. By MPRA (Massively parallel reporter assays), we deciphered the sequence features that regulate promoter transcriptional, including motifs and motif syntax (their combination and distance). Finally, we built and trained a deep learning model capable of predicting the transcriptional strength of the immunoglobulin V gene promoter directly from sequence. Prediction of thousands of promoter variants identified in the global human population shows that polymorphisms in promoters influence the transcription of immunoglobulin V genes, which may contribute to individual differences in adaptive humoral immune responses. Our work helps to decipher the transcription mechanism in immunoglobulin genes and offers thousands of non-similar promoters for B-cell engineering.
Graphical Abstract
Graphical Abstract.
Introduction
B-cell-specific Immunoglobulin gene transcription not only determines antibody production but also influences various processes throughout B-cell development, such as V (D)J recombination (Assembles immunoglobulin receptor genes from the preexisting variable (V), diversity (D) and joining (J) gene segments by a cut and paste mechanism (1)), allelic exclusion, and differentiation (2–5). Engineered B cells driving the expression of antibodies, cytokines, and other genes provide an alternative to traditional vaccination, passive immunization, or other protein replacement therapies due to their unique longevity and capacity to secrete high protein levels (6). Using B-cell-specific promoters like CD19 to drive gene expression in engineered B cells ensures a higher safety profile by avoiding leaky expression in other tissue cells, compared to non-specific promoters (such as CMV and SFFV) (7,8). The study of B-cell-specific promoters is challenged by hundreds copies of Immunoglobulin V genes with highly similar sequences in the human genome, and the determination of their transcription by sequencing is inaccurate (9,10).
The synergistic integration of synthetic biology and deep learning revolutionizes our ability decoding DNA regulatory grammar, accurately predicting gene expression across multiple species. Synthetic biology, using a bottom-up approach to design synthetic regulatory elements, offers a broader array of transcription factor combinations and an expanded sequence space compared to natural sequences (11–13). MPRA (Massively Parallel Reporter Assays) is a powerful technique that enables the high-throughput and unbiased quantitative measurement of the activity of thousands of regulatory elements, providing a valuable dataset for deciphering the regulatory code and studying its evolution (14–16). By implementing deep learning on regulatory sequence space and their MPRA data, it enables the prediction and generation of constitutive promoters in Escherichia coli, yeast Saccharomyces cerevisiae, with activities comparable to or outperforming benchmark promoter (12,17–20). Despite these advances, designing specific regulatory elements, such as tissue-specific promoters with complex regulatory mechanisms, remains challenging. (21).
Here, we designed and synthesized a diverse set of B-cell-specific promoters by a strategy that discovers motifs from nature Immunoglobulin gene promoters and shuffles them into different background sequences. Through high-throughput MPRA analysis of these promoters, we observed diverse transcription patterns in B cells, which were influenced by specific motifs and motif syntax (their combination and distance). Next, we built and trained a deep learning model to predict promoter transcription strength by learning this diversity sequence space of synthetic promoters and their transcription strength measured from MPRA. Using this model, we predict the immunoglobulin V gene transcription landscape in the global human population, and polymorphisms in immunoglobulin V gene promoters influence transcriptional strength. These variations may contribute to individual differences in gene usage and adaptive humoral immune responses. Our study enhances our understanding of the transcription mechanism in immunoglobulin genes and provides a diverse collection of non-similar promoters for B-cell engineering.
Materials and methods
Synthetic promoters design
(i) Motifs were detected on the 250 bp and 1000 bp upstream regions of functional V genes in human and mouse immunoglobulin gene loci (IGH, IGK, IGL, Igh, Igk, Igl) using MEME and BaMMmotif2, setting the motif search length at 10–15bp, and searching on both the positive and negative strands. The identified motifs were subsequently mapped to JASPAR and HOCOMOCO databases to determine transcription factor binding sites using the TOMTOM tool. Candidate motifs were classified as Core and Accessory motifs based on their frequency. Core motifs include Oct, Pd, and Core promoter region (CPR) (Supplementary Table 1). Accessory motifs include Runx1, E-BOX, CArG-box, PU.1, Arid3a, EBF1, motifH02, 03 and 04, motifM02, 03 and 04, and the CCCT element (Supplementary Table2). (ii) To create background sequence, Use BiasAway to shuffle 20 natural promoter sequences based on k-mers (k = 1, 2 or 3). (iii) Embedded selected candidate motifs into the generated background sequences by simdna (https://github.com/kundajelab/simdna). One copy of each Core motif was placed in a fixed position within the background sequences, and Accessory motifs were randomly embedded into the background sequences using conserved sequences (scores > 8) from the JASPAR position probability matrix (PPM). For motifs such as Runx1, E-BOX, CArG-box, PU.1, Arid3a, and EBF1, the insertion orientation was forward, aligning with both our motif discovery results and previous reports. The newly identified motifs, namely motifH02, 03, 04, and motifM02, 03, 04, were also inserted in the forward direction, consistent with our motif discovery findings. (iv) 12 bp length barcode was generated using the VFOS (22), considering factors such as GC content (GCC), single base repeats (HP), dinucleotide repeats (SR), edit distance (HD), and the formation of cross-dimers (CP). The barcode set was further filtered to exclude termination codons (TAG, TGA, TAA), trinucleotide repeats (NNN), and BsaI and BbsI restriction enzyme sites. The ATG start codon, and a 12-bp random barcode were connected to the synthetic promoter sequence, forming a 248-bp synthetic promoter sequence. Additionally, each end included a 26-bp homology arm sequence for vector integration, resulting in a library with a length of 300 bp per sequence. A complete list of the synthetic promoter sequences can be found in (Supplementary Tables 3 and 4).
Library construction
library construction for this study was based on the MPRA (15). A pool of oligonucleotides containing 300 nt mixtures was synthesized (Twist), which included 23640 synthetic promoter sequences and 245 natural sequences. The oligonucleotides were PCR-amplified for 20 cycles using Q5 High-Fidelity DNA polymerase (New England Biolabs, #M0543S) and primers PCR_F and PCR_R (Supplementary Table 5). A p-Rep vector (Supplementary Table 6) was constructed based on PLL3.7 (Addgene, #11795). In this modified vector, the CMV promoter was removed, and the Eμ enhancer was inserted downstream of EGFP. Additionally, the EF1A promoter was used to drive the expression of Puro-2A-mCherry. The p-Rep vector was digested with AscI-HF (NEB, # R0558L). The PCR and digestion products were purified and recovered (Qiagen, # 28506). Subsequently, the NEBuilder HiFi DNA Assembly Master Mix (NEB, E2621) was used for assembly. The resulting plasmid was transformed into stbl3 electrocompetent cells. The cells were incubated at 30°C for 20 h, and after a maxiprep (Qiagen, # 12362), the pLib plasmid was produced.
Cell culture and transfection
Ramos (RA 1) (ATCC, CRL-1596) and Jurkat, Clone E6-1 (ATCC, TIB-152) cells were maintained at 37°C under 5% CO2 in RPMI-1640 medium (Gibco, 11875093) supplemented with 10% FBS (Gibco, 10091148). The Neon transfection system and 100 μl kit (ThermoFisher, MPK10096) were used according to the manufacturer's instructions. For Ramos cells, electroporation was performed with one 1350 V pulse for 30 ms, with a total of three replicates, each consisting of ∼8 million cells, transfected 10 μg of library plasmid per replicate. For Jurkat cells, electroporation was performed with three 1350V pulses, each lasting for 10 ms. Following transfection, each replicate was grown in RPMI + 10% FBS without Pen-Strep for 24 h.
293T (ATCC, CRL-3216), HeLa (ATCC, CCL-2), NIH/3T3 (ATCC, CRL-1658), Hepa1-6 (ATCC, CRL-1830) cells were maintained at 37 °C under 5% CO2 in DMEM medium (Gibco, 11965092) supplemented with 10% FBS (Gibco, 10091148). For adherent cells, plasmid transfection was carried out according to the Lipo3000 (Invitrogen, L3000015) protocol. Lentivirus was produced by co-transfecting promoter plasmid (based on PLL3.7 (Addgene, #11795) and helper plasmids (pCMV-VSV-G (#8454), psPAX2 (#12260)) into 6 well plate of 293 T cells using lipo3000 (Invitrogen, L3000015). After 48 h of transfection, cell culture supernatants were harvested and filtered through a 0.22 μm PES filter unit to remove cell debris. 1 ml of supernatant was then added to each well with Ramos cell in 6well plate, and cells were spinoculated at 2000 rpm at 30°C for 1 h (Thermo ST16R, Rotor M10) in the presence of 5 μg/ml polybrene (Sigma, #TR-1003). A medium change without polybrene was performed on the next day. Two days after lentivirus infection, the cells were selected with 200 ng/ml puromycin (Gibco, A1113803). GFP and mCherry fluorescence intensity were measured every two weeks using flow cytometry.
Flow cytometry
Twenty-four hours after transfection, Cells were washed twice with Flow Cytometry Staining Buffer (eBioscience). Flow cytometry was performed using a BD Aria III. The data were analyzed with FlowJo (Treestar).
MPRA (massively parallel reporter assay)
Twenty-four hours post-transfection, cells were pelleted and washed with DPBS. Total RNA was extracted using Trizol (Invitrogen, 12183555). From 75 μg of total RNA per sample, mRNAs were isolated using the Dynabeads mRNA Purification Kit (Invitrogen, 61006). Residual DNA was removed from 1 μg of mRNA using ezDNase (Invitrogen, 11766051). All mRNAs were subjected to a reverse-transcription reaction in 8-strip PCR tubes, using a custom primer (R-amp1-UMI12-rd2, FS-TSO) and the SuperScript IV First-Strand Synthesis Kit (Invitrogen, 18091050). The resulting cDNA was purified using the QIAquick PCR Purification Kit (Qiagen, 28506). The purified cDNA was split into 8 PCR tubes and amplified using NEBNext Ultra II Q5 Master Mix (NEB, M0544S) and F-TSO (ISPCR)-rd1 and P7-i7 (index)-rd2 primers for three cycles. After the PCR reaction, all eight tubes for each sample were combined in a DNA LoBind tube. PCR products were purified using AMPure XP (Beckman Coulter, A63881) and then further amplified using P5-i5-rd1 and P7 primers for seven PCR cycles to add P5, P7, and unique 8 bp Illumina index sequences. Similarly, plasmid DNA was amplified in 2 × 50 μl reactions using NGS-F and NGS-R primers and amplified using NGS-P5 and NGS-P7 primers for seven PCR cycles to add P5, P7. Following SPRI (Beckman Coulter, B23319) cleanup, amplicons were sequenced using a NovaSeq 6000 SP flow cell with a 15% PhiX spike-in. A comprehensive list of all primers used in this paper can be found in Supplementary Table 7.
PCR duplicates were removed using starcode-umi (23) based on the UMI, and the abundance of the 12 bp barcode for synthetic promoters was quantified using MAGeCK (24). Barcodes were filtered to ensure a minimum of 3 counts in three replicate experiments, resulting in 5546 synthetic promoters available for subsequent analysis and modeling. Promoter strength was determined by calculating RNA/DNA ratios for each barcode, then normalized with respect to IGKV1-5. A complete list of the synthetic promoter strength can be found in Supplementary Table 8.
Motif syntax-based model
This model comprises the following elements: Basic features (97 total): These include the GC content of the promoter sequence (normalized), core motif type (20 CPR, 63 PD and 1 OCT core motifs, the presence of which is denoted as 0/1), and 12 accessory motifs (normalized Motif scores). Motif syntax-related features (744 total): These features are further divided into three subcategories: Motif combination features (483 total): We considered combinations of 2–8 accessory motifs, with the feature value being the product of the individual, normalized Motif scores. Distances between core motifs (3 total): Given the relatively fixed positions of the core motifs, we used the distances between the three core motifs (after normalization) as a feature. Distances between accessory motifs—core motifs, accessory motifs—accessory motifs (258 total): These distances are categorized into three intervals: close (<25 bp), intermediate (≥25 bp and ≤50 bp), and distal (>50 bp).
CNN model architecture and training
To predict synthetic promoter expression levels, a deep neural network architecture was employed, consisting of a variable number of CNN layers (ranging from 3 to 7) and 2 dense (FC) layers. The dataset was divided into train, validation, and test sets at a ratio of 0.63, 0.27 and 0.10, respectively. One-hot-encoded 249-bp long DNA sequences (A = [1,0,0,0], C = [0,1,0,0], G = [0,0,1,0], T = [0,0,0,1]) were used as input to the first CNN layer, with subsequent layers built upon the previous ones. Batch normalization and dropout were applied after all layers, and max-pooling was applied after the CNN layers. The Adam optimizer with mean squared error (MSE) loss function, ReLU activation function, and uniform weight initialization were employed. A total of 13 hyperparameters were optimized using the Bayesian Optimization approach with Keras Tuner for 20 iterations. These included batch size (128–256), kernel sizes (3-7), learning rate (1e-3, 1e-4, 1e-5), filter numbers (32-256), convolutional layers (3-7), additional layers (1-3), dropout probability (0.1–0.5), and dense layer neuron numbers (32-256).
The best models were chosen based on the minimal MSE on the validation set, with the least spread between training and validation sets. The Spearman correlation coefficient served as an additional evaluation metric during the training process. Tensorflow and Keras software packages were used for training deep models and data collection, accessed through the Python interface. The model was optimized with multiple trials and executions per trial, resulting in the best model and corresponding hyperparameters.
Immunoglobulin V gene promoter variants calling
We developed a pipeline for IG promoter variant discovery from the Genome Aggregation Database (gnomAD v3.1.2). The code can be found on zenodo (https://doi.org/10.5281/zenodo.8008545). In summary, the code retrieves variants from the upstream 1kb region of the IG genes, located on chromosomes 14 (IGH), 2 (IGK) and 22 (IGL), from the VCF files. The variant sequences are then created based on the reference genome sequence (GRCh38). A complete list of the variants can be found in Supplementary Tables 9 and 10.
Data analysis
Python v3.11.3 (www.python.org) and R v4.3.0 (www.r-project.org) were used for computations. Sequence distance metrics were calculated using either the Levenshtein edit distance (python-Levenshtein 0.21.0) as designated. Linear model (lm) and ridge regression models were built using scikit-learn (v1.2) with 10-fold cross-validation. For data analysis, NumPy (v1.24.2), pandas (v2.0.1) and scikit-learn (v1.2) were used. The stats.pearsonr (x, y) function from the SciPy (v1.10.1) library calculates the Pearson correlation coefficient (PCC) between two datasets. UMAP was performed using the available software (v0.5.3, github.com/lmcinnes/umap). WebLogo analysis was also conducted using available software (weblogo.berkeley.edu/logo.cgi). Biopython (v1.81) was used for calculating phylogenetic trees, and pycircos (v1.0.2) was employed for creating circos plots. All visualizations were prepared using seaborn (v0.12.2), matplotlib (v3.7.0), and plotly (v5.14.1).
Results
Design of synthetic B- cell-specific promoters
V gene promoters from human and truncated (250 bp) promoters can drive V gene expression in mouse B cells, producing functional antibodies (25–28), suggesting conserved elements between the two species, and are likely to be present within the 250 bp region. To explore the motifs within V gene promoter regions, we conducted de novo motif discovery, concentrating on the 1000 bp upstream regions of functional V genes in human and mouse immunoglobulin gene loci (IGH, IGK, IGL, Igh, Igk, Igl), Motifs were detected using Expectation-maximization (MEME (29)) and Gibbs sampling techniques (BaMMmotif2 (30)) respectively. The identified motifs were subsequently mapped to JASPAR (31) and HOCOMOCO (32) to determine transcription factor binding sites using the TOMTOM (33). We identified Oct (Octamer) (34), Pd (Pentadecamer) (9), E-box (35), CArG-box (36), PU.1 (37), Arid3a (38), CCCT (9), EBF (36) motifs in the ATG upstream 0–250 bp region, which has been previously reported (39). Furthermore, we discovered that a motif was enriched in the Kappa V promoter region. By comparing this motif to the database, we identified its binding affinity with Runx1, a transcription factor closely associated with B cell development (40). Intriguingly, within the promoter regions of human and mouse V genes, we identified three regions containing motifs that did not include in known transcription factor binding site databases, we designated these motifs from human as motifH02, 03, and 04, and from mouse as motifM02, 03, and 04, respectively (Figure 1A). The 16 candidate motifs were classified as Core and Accessory motifs based on their frequency. Core motifs were defined as motifs that are conserved in all nature promoter sequences, including Oct, Pd, and Core promoter region (CPR). Accessory motifs referred to conserved sequences present in some of the promoter sequences, including Runx1, E-BOX, CArG-box, PU.1, Arid3a, EBF1, motifH02, motifH03, motifH04, motifM02, motifM03, motifM04, CCCT element.
Figure 1.
Design of synthetic B- cell-specific promoters. (A) Motif discovery from concentrating on the 250 bp and 1000 bp upstream regions of functional V genes in human and mouse IGH, IGK and IGL loci. (B) Pipeline of design synthetic promoters, identification of Candidate motif; generate nucleotide composition-matched background sequences; embedding selected motif to generated background sequence with varying numbers and positions. (C) UMAP-based spectral clustering of accessory motif combinations, dimensionality reduction on the accessory motif score matrix for synthetic promoters (blue), and theoretical motif combinations (red). (D) Levenshtein edit distance matrix and distribution between every pair of nonrepetitive promoters (n = 23 640). (E) UMAP-based spectral clustering of promoter sequences, dimensionality reduction on the sequence identity distance matrix for synthetic promoters (blue, n = 23 640) and natural promoters (yellow–green, n = 245).
To investigate the influence of background sequences on promoter strength, we designed and recombined three sets of background sequences, maintaining the mononucleotide (GC content), dinucleotide, or trinucleotide polymorphism as natural promoter sequence, by employing BiasAway (41) to shuffle natural promoter sequences based on k-mer (k = 1, 2 or 3), respectively. We generated a set of 23 460 background sequences based on 20 natural promoter sequences. To examine the impact of motifs and their arrangements on promoter strength, we embedded selected candidate motifs into the generated background sequences. The Core motifs in the V region are position-conserved, exhibiting relatively fixed positions in natural sequences (9). Starting from the 5′ end, the OCT motif was inserted at positions 126, 129 or 133, while the Pd element was placed upstream of OCT, preserving relative distances of 100, 70, 40 or 30 bp, consistent with their natural positions. Contrarily, the positions of accessory motifs in natural promoters are not notably conserved. For these, we opted to embed them randomly into the background sequences, using conserved sequences (with scores > 8) from the JASPAR position probability matrix (PPM). (Supplementary Fig. 1).
Following the design steps outlined above, we generated a set of 23 640 synthetic promoter sequences for synthesis, each 248 base pairs (bp) in length (Figure 1B). The library contains 16 candidate motifs, including 3 core motifs and 13 accessory motifs, with a total of 496 distinct combinations of accessory motifs (Figure 1C, Supplementary Fig. 2). To investigate the similarity between sequences, we calculated the Levenshtein edit distance between them and constructed a sequence identity distance matrix, employed UMAP-based spectral clustering (42) to perform dimensionality reduction on this matrix. There are at least 70 mutations separating any two promoters (Figure 1D, Supplementary Figure 3), and nucleotide composition of these synthetic promoters exhibited substantial variability and dissimilarity compared to natural sequences (Figure 1E, Supplementary Figure 4).
Synthetic B-cell-specific promoters enable diverse transcriptional patterns
We synthesized the 23 640 synthetic promoter sequences designed in the previous section, along with 245 natural sequences, using oligo pool synthesis technology. These sequences were cloned into a reporter system to quantify their transcription strength in B cells, utilizing fluorescence-activated cell sorting (FACS) or high-throughput sequencing methods for the assessment (Figure 2A). We applied MPRA (Massively parallel reporter assays) (15) to systematically evaluate thousands of the synthetic promoter strength simultaneously in a high-throughput. The synthetic promoters were designed to drive the expression of a barcoded green fluorescent protein (GFP) reporter gene. Each synthetic promoter sequence is directly linked to a specific DNA barcode sequence. Upon transfection of the library into the B cells, each synthetic promoter uniquely transcribes its corresponding DNA barcode sequence into RNA. The RNA levels are subsequently measured through RNA sequencing. Due to variations in abundance among the synthetic promoters within the library, it is necessary to consider the initial quantities of each synthetic promoter. Dividing the RNA levels of each barcode by its corresponding level in the plasmid library (DNA sequencing) provides the normalized strength of each synthetic promoter. This precise one-to-one correspondence between synthetic promoters and barcode sequences enables the accurate quantification of promoter strength for each individual synthetic promoter. Moreover, mCherry expression is driven by the constitutive promoter PEF1a, which serves to validate transfection efficiency and standardize GFP expression. The EGFP/mCherry mean fluorescence intensity (MFI) ratio was employed to measure promoters’ transcription strength. The reporting system also includes the Eμ enhancer, which can regulate immunoglobulin heavy chain (IGH) gene expression in nature gene locus.
Figure 2.
Synthetic B-cell-specific promoters enable diverse transcriptional patterns. (A) MPRA Schematic for the analysis of synthetic promoters’ transcription strength, (1) synthesis promoter sequence using the oligo-pool method; (2) cloned promoter to a reporter constructs to drive the expression of a barcoded GFP gene; (3) transfect library into cells; (4) extract mRNA and sequenced; (5) mRNA barcode counts compared to plasmid DNA counts to determine the promoters’ activities. (B) Flow cytometry analysis of GFP expression in transfection successful Ramos cell (mCherry+) with control plasmid (left) and synthetic promoter library (right). The promoter information can be found in Supplementary Table 11. (C) Relative promoter strength (EGFP/mCherry) of 13 synthetic promoters and PCD19in B cells, HeLa, NIH-3T3 and Hepa1-6 cells. (D) Correlation of log2 promoters’ strength (RNA/DNA) between technical replicates (Rep1 versus Rep2: PCC = 0.87, r2 = 0.75), normalization by the IGKV1-5 promoter's transcription rate. (E) Correlation of promoter strength in 32 promoters via MRPA-seq and flow cytometry (MRPA versus FACS: PCC = 0.85), red dash line represents the strength of PCD19measured by FACS. The promoter information can be found in Supplementary Table 12.(F) Promoters’ strength (RNA/DNA) in natural human (green), mouse (yellow), and Synthetic promoters (blue). (G) Distribution, and counts of synthetic promoters within each transcription strength interval, IGKV1-5 promoter's transcription rate (red lines). (H) Levenshtein edit distance of synthetic promoters’ pair in each transcription strength interval. (I) Top 180 strongest synthetic promoters. (J) Levenshtein edit distance matrix and distribution among the top 180 synthetic promoters' pairs.
We transfected the library into Ramos (RA 1: human B lymphocyte cell line) and Jurkat (Clone E6-1: human T lymphocyte cells) cells. As controls, we individually transfected four natural promoters (Ighv2-9, Ighv5-15, Ighv14-4, Ig5-48). Following 24 h post-transfection, flow cytometry analysis of transfected cells (mCherry+), GFP+ positive cells were observed in Ramos cells with both synthetic promoter libraries and natural promoters. Moreover, no GFP + positive cells were detected in Jurkat cells (Figure 2B, Supplementary Figure 5). To further validate the cell-specificity of our designed synthetic B-cell promoters, we randomly selected 16 synthetic promoters along with the PCD19 (Mouse CD19 promoter, commonly used B-cell-specific promoter). We then assessed their promoter strength (EGFP/mCherry) in B cell (Ramos) and three non-B cell lines: HeLa, NIH-3T3, and Hepa1-6. All 16 promoters exhibited activity in B cells. In non-B cells, the activity (EGFP/mCherry) of the 16 synthetic promoters and PCD19 showed no discernible differences when compared to the control (a vector without promoter sequence) (Figure 2C). This suggests that the synthetic promoters had almost no activity in these non-B cells.
Further, we determined the promoters’ transcription strength by taking the ratio between their RNA-seq and DNA-seq barcode read counts. The transcriptional strength of the promoters exhibited a high level of correlation across the three replicates of the experiment. (Rep1 versus Rep2: Pearson correlation coefficient (PCC) = 0.87, r2= 0.75) (Figure 2D, Supplementary Figures 6 and 7). To validate the accuracy of our MPRA data, we independently constructed plasmids for 32 promoters (including 27 synthetic promoters and 5 nature promoters) from our library, as well as PCD19. The promoter strength of these constructs was individually assessed through the EGFP/mCherry mean fluorescence intensity (MFI) ratio using flow cytometry. The result exhibited a high correlation between the flow cytometry results and the MPRA data (PCC = 0.85) (Figure 2E). In our MPRA data, the 5546 synthetic promoters exhibit a vast range of transcription strength, spanning an approximately 10000-fold difference. Among these, 3976 (71.69%) promoters are stronger than the IGKV1-5 promoter, and the strongest promoter has approximately 1000-fold higher transcription strength. 1570 (28.31%) promoters are weaker than the IGKV1-5 promoter. The weakest promoter has approximately 16-fold lower transcription strength than the IGKV1-5 promoter (Figure 2F). Additionally, there are at least 200 promoters within each 2-fold increment in transcription strength in relation to the IGKV1-5 promoter, ranging from 1/16 to 16 times the transcription strength (Figure 2G). The 5546 synthetic promoters are also remarkably distant in sequence space with the majority of promoters having more than half (125bp) of the promoter length separated by mutations (Figure 2H). In our testing of the 5546 synthetic promoters, we observed that 180 exhibited greater strength than the strongest natural Igkv5-43 promoter. (Figure 2I). These promoters are also exceptionally far apart in sequence space (Figure 2J). Overall, we obtained thousands of distinct B-cell-specific promoters that exhibit greater distance in sequence space, and span a broad expression range compared to natural V gene promoters.
To assess the long-term stability of the synthetic promoters, we established 25 distinct B cell lines (Ramos) using lentivirus transfection, encompassing 20 synthetic promoters and 5 natural promoters. We monitored their long-term stable expression (EGFP/mCherry) over one-month duration via flow cytometry. Results indicated that both synthetic and natural promoters sustained their activity throughout the month, with no discernible decrease in expression intensity (Supplementary Figure 8).
Core motif and motif syntax shaping B-cell-specific promoter transcriptional pattern
We then investigated how the composition and presence of motif elements influence promoter strength. Using UMAP spectral clustering (42), we identified multiple clusters of promoters that share similar core promoter features and transcription strength (Figure 3A, Supplementary Figure 9). Upon analyzing synthetic promoters, we observed variations in the strength among different Core Promoter Regions (CPRs). By classifying CPRs based on their species of origin, we found that synthetic promoters with human-derived CPRs generally exhibited lower strength than synthetic promoters with mouse-derived CPRs. (Figure 3B, Supplementary Figure 10). Next, we examined the motif enrichment of the Top50 strongest and weakest promoters. The core promoter elements exhibit significant clustering tendencies. The top three CPRs (Igk19-93, Igk6-15, Igk6-23) accounted for over 50% of the total frequency in Top50 (Figure 3C). To compare the importance of individual core promoter region (CPR), we constructed a linear model to evaluate synthetic promoters’ strength in the dataset which solely contains core motifs (n = 213). The importance of each CPR evaluated by the linear model was consistent with the results obtained from the classification analysis (Supplementary Figure 11A, B). We next tested how GC content, TATA-box, and Inr motifs of CPR affect promoter strength. For CPR with similar length (∼70bp), there is a negative correlation between the importance of the core promoter and its GC content (Supplementary Figure 11C). Core promoters containing TATA-box with consensus sequences have higher importance than those with insertion/deletion or mutation (deletion, insertion, replaced with a G/C) in the TATA-box sequence (Supplementary Figure 11D).
Figure 3.
Core motif and motif syntax shaping B-cell-specific promoter transcriptional pattern. (A) UMAP-based spectral clustering of motifs combinations and their transcription rates. (B) Promoters’ strength with varying Core promoter Region (accessory motifs = 0). (C) Frequency of Core promoter Region type in the top 50 highest (red), bottom lowest 50 (blue), and all (gray) synthetic promoters. (D) Promoters’ strength variability with increasing GC count of the promoter sequence. (E) Schematic of motif syntax rules. (F) Scatter plots of Motif syntax-based ridge model predicted versus observed synthetic promoters’ activity (PCC = 0.58, using 10-fold cross-validation). (G) Feature importance in Motif syntax-based ridge model. (H) Synthetic promoter strength with the distance between the OCT motif and the core promoter region (CPR) greater than/less than 45 bp.
Transcriptional differences were observed between synthetic promoters with different CPR and those with the same CPR. For example, in the subset with the same CPR (IGKV3-15), the strongest promoter was 30 folds stronger than the weakest promoter. This may be due to the differences in background sequences. We speculate that these differences in synthetic promoter strength may be attributed to sequence features and motif syntax. The GC content of the promoter sequence is one factor that influences the promoters' strength (43,44). In the set of 5546 synthetic promoters measured by MPRA, we observed that as the promoter sequence's GC content decreases, synthetic promoters' overall strength increases. This finding suggests a potential negative correlation between the GC content of promoters and their strength. (Figure 3D). To further investigate the impact of motif syntax on promoter strength, we generated a syntax-based model to assess the impact of motifs and their cooperative interactions on promoter strength. These models considered not only 97 simple features (such as GC content of the promoter sequence, motif type, and motif score) but also 744 motif syntax-related features (including motif position, motif-motif combinations, and distances) (Figure 3E). The syntax-based models exhibited a high degree of similarity between model predicted and MPRA-measured synthetic promoter strength (PCC = 0.58) (Figure 3F, Supplementary Fig. 12). Notably, the distance between the CPR and Oct motifs has an impact on promoter strength (imp = 0.85), which is higher than the contributions from CPR-Pd (imp = 0.31) and Oct-Pd (imp = 0.02) (Figure 3G). To validate the relationship between the distance between CPR and Oct motifs and synthetic promoter strength, we divided the distances into two groups based on a cutoff of 45 bp. The results indicated that when the distance between the CPR and Oct motifs is longer than 45 bp, there is a decrease in promoter activity (Figure 3H). Interestingly, in the context of accessory motif combinations, the top three features ranked by their importance of the synthetic promoter strength are CArGbox -Runx1-Arid3a-motifM03 (imp = 0.72), CArGbox-Runx1-Arid3a-motifM02-motifM03 (imp = 0.68), Arid3a-motifM02-motifM04 (imp = 0.65), suggesting an enrichment trend for combinations involving Arid3a, motifM02, motifM03, and motifM04 (Supplementary Fig. 13). These results suggest that core motifs set a baseline for B-cell-specific promoter transcription strength, and the synergistic interaction of other factors (GC content, motif syntax) further diverse transcription strength.
CNN model predicts promoter transcription strength in base-resolution
To precisely predict promoter strength in Base-resolution, we build a convolutional neural network (CNN) model using MPRA-measured synthetic promoter strength and their DNA sequence. The synthetic promoter DNA sequence is first subjected to one-hot encoding. Subsequently, it undergoes the first convolutional layer, which enables the identification of local sequence features, including transcription factor (TF) motifs like Oct. Following the initial convolutional layer, subsequent convolutional layers are employed to identify intricate patterns within the sequence, including motif syntax like Oct-CPR. Ultimately, the motifs and motif syntax features discovered from the synthetic promoter sequence by the convolutional layers are consolidated and weighted through fully connected layers to predict the strength of the synthetic promoter. (Figure 4A). We trained the CNN model using 90% of the total synthetic promoters (n = 5546), with the remaining 10% used to test the model. In the test dataset, the predicted promoter strength by the CNN model showed a high level of similarity (PCC = 0.63) to the promoter strength detected through MPRA. (Figure 4B).
Figure 4.
CNN model predicts promoter transcription strength in Base-resolution. (A) Architecture of the multitask convolutional neural network designed to predict IG promoter activities from 248-bp DNA sequences. (B) Correlation of promoter strength in test dataset (n = 554) between CNN model predictions and MPRA-seq (PCC = 0.59). (C) Correlation of promoter strength in natural promoter (n = 22) between CNN model predictions and MPRA-seq (PCC = 0.59). (D) Compare of natural promoter strength (predict by CNN model) for all functional mouse and human Igkv (n = 88), Ighv (n = 88), IGHV (n = 44) and IGKV (n = 25) genes.
We applied the trained model to predict the promoter activity of natural promoters not included in the training. The CNN model's predicted activity profiles and the MPRA-measured activities showed a high similarity (PCC = 0.59) (Figure 4C). Additionally, we predicted the promoter activity for all-natural mouse and human IGHV and IGKV functional gene promoters. Our prediction indicates that the strength of natural immunoglobulin V gene promoters spans a dynamic range of roughly 32-fold. The promoter strength from mice is generally stronger than that from humans (Figure 4D). Thus, we have developed a CNN model enabling prediction of synthetic B-cell-specific promoters and natural immunoglobulin V gene promoter activity at base-resolution level.
Polymorphisms in IgV gene promoters influence gene transcription.
Finally, we applied the CNN model built and trained by our synthetic promoter dataset, which predicts promoter strength based on sequences, to estimate the influence of genomic variations in the immunoglobulin V gene promoter region on transcriptional activity across global human populations. We developed a pipeline for IG promoter variant discovery from the Genome Aggregation Database (gnomAD v3.1.2) (45), which encompasses 76 156 whole genome sequences and includes genomes from the Human Genome Diversity Project (HGDP) and the 1000 Genomes Project (1KG). We extracted functional and pseudogene (coding region has stop codon (s) and/or frameshift mutation (s), and/or a mutation affects the initiation codon) (9) promoter variants from all three groups (IGH, IGK and IGL), creating an IG promoter variants database that totaling 26 412 variants includes single nucleotide polymorphisms (SNPs) and insertion/deletion (indel) events (Figure 5A, Supplementary Figures 14 and 15). We predicted the strength of 26412 variants of immunoglobulin gene promoters by the CNN model. To validate the accuracy of our predict data. We independently constructed plasmids for 24 promoters. This set comprised 12 variants alongside their corresponding reference sequence. Each variant was selected based on its predicted activity, which exhibited a change of more than 2-fold when compared to its reference. We measured those promoter strength (EGFP/mCherry) by flow cytometry. The results showed a high correlation between the prediction and experimental (PCC = 0.91) (Figure 5B). Analysis of functional promoter strength and sequence similarity revealed that promoters from the same subgroup (highly similar in their coding sequence (9)) exhibit similar predicted transcriptional activities. Interestingly, not only in the functional gene but promoters of pseudogene also possess predicted transcriptional activities (Figure 5C, Supplementary Fig. 16), possibly due to them including core motifs (such as CPR, Oct) as same as the functional gene.
Figure 5.
Polymorphisms in IgV gene promoters influence gene transcription. (A) Distribution of 26412 immunoglobulin V gene promoter variants by function and locus. (B) Correlation between the relative promoter strength (variants/natural refence) in 12 variants as predicted by the CNN model and as measured by FACS (PCC = 0.91). The sequence information can be found in Supplementary Table 13. (C) Transcription strength prediction by CNN model for 7259 functional IGHV gene promoter variants. Circos inner: Phylogenetic tree of 42 IGHV gene promoters, created by 248 bp sequences upstream of the ATG site; middle: promoter region corresponding to the IGHV gene family (lines) and name (boxes); Outer: Scatterplot of promoter region variant types and CNN model-predicted promoter strength, with the horizontal axis showing the length, and the vertical axis representing the predicted promoter strength. The gray band indicates the strength range of 0–2. (D) CNN model-predicted log2 fold-change Functional IGV promoter variants activity to the natural promoter, plotted relative to the position (E) and position intervals (F). (G) CNN model-predicted log2 fold-change functional IGV promoter variants activity to natural promoter plotted relative to the original strength (>2: Strong, 0–2: medium, < 0: weak). (H) Variants with activity change range exceeding 2 times compared to the original promoter. (I) IGHV3-23, and IGHV1-69 promoter variants predicted activity relative to the position and mutation type.
Mutation of promoters (SNP, indel) influences transcription levels of genes (46,47). Based on our prediction, we observed that some promoter variants exhibited changes in predicted transcription strength compared to their reference promoter (sequence from human reference genome sequence, GRCh38). These changes could be attributed to mutations that alter the binding affinity and cooperative interactions with transcription factors in the promoter sequence. Due to the more sequence changes caused by indels than SNPs, we anticipated that SNP variants would have a more negligible impact on promoter strength than indel variants. As expected, the transcription strength change for all SNP variants did not exceed a range of 2-fold change relative to the reference promoter sequence. And indel variants exhibited a broader range of effects, with some resulting in a transcription strength change greater than 2-fold (Figure 5D). We speculate that their position and their reference promoter may cause a broader range of effects on transcription strength exhibited by indel variants. To investigate the relationship between the position of indel variants and the changes in transcription strength, we examined the distance distribution of these variants from the start codon (ATG) of the immunoglobulin V gene. We observed that indels located further away from the ATG site have a lesser impact on transcriptional strength than those positioned closer to the ATG site. (Figure 5E, F). Finally, we investigated the relationship between the changes in transcription strength caused by variants and the baseline strength of their reference promoters. By dividing the variants into three regions based on the reference promoter strength (log2)—strong (>2), medium (0–2) and weak (<0)—we observed that, in weakly reference promoters, indel variants tend to increase expression strength, conversely, in naturally strongly active promoters, indel variants tend to decrease expression strength (Figure 5G). Further analysis of variants with a change range exceeding 2 folds compared to the reference promoter, the observed trends in transcriptional strength changes were in accordance with the aforementioned result (Figure 5H). For instance, the promoters of IGHV3-23 and IGHV1-69 show divergent tendencies. In the weaker promoter of IGHV3-23 (with predicted strength (log2) < 0), 3 indels predicts tend to enhance its strength, and in the stronger promoter of IGHV1-69 (with predicted strength (log2) > 2), 10 indels predicts tend to weaken it (Figure 5I). Overall, we predict the immunoglobulin gene transcription landscape in the global human population by the CNN model. It was predicted that polymorphisms in IgV gene promoters may influence transcriptional strength, and this influence is dependent on the type of variant, its position, and the strength of the reference promoter.
Discussion
Based on the MPRA testing of rationally designed synthetic promoters and deep learning, we build a CNN model prediction of synthetic B-cell-specific promoters and natural immunoglobulin V gene promoter activity at the base-resolution level. Here, we utilized an episomal MPRA based on plasmids to assess the expression strength of synthetic promoters. Although the results of episomal MPRA are highly correlated with those of genomically integrated MPRA (11,48,49), they may not fully reflect the actual situation in the genomic locus. By integrating chromatin accessibility, epigenetic features, distal interactions, and other factors into the deep learning prediction model for regulatory elements, it may obtain more accurate predictions (50,51). We observed that B-cell promoter activity inversely correlates with GC content. Such a trend may be influenced by the interplay between DNA's physical properties and GC-associated biological factors like CpGs (52–54). Deep learning models offer potential in discovering the intricate relationships among these features and provide a tool to decipher the emergence and evolution of transcription in eukaryotic organisms. Additionally, we anticipate that the development of explainable artificial intelligence will provide new insights into deciphering cell-specific regulation mechanisms in different cell types.
Mutations of coupled cis-regulatory and coding regions synergistically shape gene evolvability (55–57). Our prediction shows that mutations in the immunoglobulin gene promoter region can influence transcriptional activity. These may represent a balancing selection mechanism in the immunoglobulin gene loci, which potentially aimed to prevent severe preferential usage of specific genes, similar to those observed in MHC and other gene loci (58). The unique evolution of immunoglobulin gene loci is intricately tied to its inherent processes like V (D)J recombination and somatic hypermutation (SHM) (59), which are intimately linked with transcription (3,60–62). Incorporating transcription activity in the immunoglobulin gene could provide a fresh angle on understanding the causes and outcomes of evolution in the immunoglobulin gene loci.
Explainable artificial intelligence help to guide the rational design of regulatory sequences with specific features, such as inducible promoters (18,21). In the future, our deep learning model has the potential to automatically generate custom B-cell-specific promoters that are shorter in length and exhibit sequence and transcriptional diversity. These will offer specific and flexible elements for engineering B cells, unlocking their potential as protein factories and accelerating progress in B cell therapeutics. Furthermore, an array of synthetic promoters enables the innovative de novo design and reconstruction of immunoglobulin gene loci, encompassing nearly a hundred V gene promoters. By replacing the high-similarity natural promoter sequences with synthetic ones, we can modularize the gene loci, reducing the repetitiveness in the gene locus and lowering the difficulty of large-scale DNA synthesis and assembly, improving the stability in vivo (52,63). This will pave the way for the development of synthetic humanized antibody animal models.
Supplementary Material
Contributor Information
Zong-Heng Fu, Frontiers Science Center for Synthetic Biology and Key Laboratory of Systems Bioengineering (Ministry of Education), School of Chemical Engineering and Technology, Tianjin University, Tianjin 300072, China; Frontiers Research Institute for Synthetic Biology, Tianjin University, Tianjin 300072, China.
Si-Zhe He, Frontiers Science Center for Synthetic Biology and Key Laboratory of Systems Bioengineering (Ministry of Education), School of Chemical Engineering and Technology, Tianjin University, Tianjin 300072, China; Frontiers Research Institute for Synthetic Biology, Tianjin University, Tianjin 300072, China.
Yi Wu, Frontiers Science Center for Synthetic Biology and Key Laboratory of Systems Bioengineering (Ministry of Education), School of Chemical Engineering and Technology, Tianjin University, Tianjin 300072, China; Frontiers Research Institute for Synthetic Biology, Tianjin University, Tianjin 300072, China.
Guang-Rong Zhao, Frontiers Science Center for Synthetic Biology and Key Laboratory of Systems Bioengineering (Ministry of Education), School of Chemical Engineering and Technology, Tianjin University, Tianjin 300072, China; Frontiers Research Institute for Synthetic Biology, Tianjin University, Tianjin 300072, China.
Data availability
The data underlying this article are available in the Gene Expression Omnibus at https://www.ncbi.nlm.nih.gov/geo/, and can be accessed under GSE232161. All code underlying this article are available in Zenodo at https://doi.org/10.5281/zenodo.8008545.
Supplementary data
Supplementary Data are available at NAR Online.
Funding
National Key R&D Program of China, Synthetic Biology Research [2019YFA0903800]. Funding for open access charge: National Key R&D Program of China, Synthetic Biology Research [2019YFA0903800].
Conflict of interest statement. None declared.
References
- 1. Schatz D.G., Ji Y.. Recombination centres and the orchestration of V (D)J recombination. Nat. Rev. Immunol. 2011; 11:251–263. [DOI] [PubMed] [Google Scholar]
- 2. Pelanda R., Torres R.M.. Central B-Cell Tolerance: where Selection Begins. Cold Spring Harb. Perspect. Biol. 2012; 4:a007146. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Bevington S, Boyes J.. Transcription-coupled eviction of histones H2A/H2B governs V (D)J recombination. EMBO J. 2013; 32:1381–1392. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Schram B.R., Tze L.E., Ramsey L.B., Liu J., Najera L., Vegoe A.L., Hardy R.R., Hippen K.L., Farrar M.A., Behrens T.W.. B cell receptor basal signaling regulates antigen-induced Ig light chain rearrangements. J. Immunol. 2008; 180:4728–4741. [DOI] [PubMed] [Google Scholar]
- 5. Rowland S.L., DePersis C.L., Torres R.M., Pelanda R.. Ras activation of Erk restores impaired tonic BCR signaling and rescues immature B cell differentiation. J. Exp. Med. 2010; 207:607–621. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Cheng R.Y.-H., Hung K.L., Zhang T., Stoffers C.M., Ott A.R., Suchland E.R., Camp N.D., Khan I.F., Singh S., Yang Y.-J.et al.. Ex vivo engineered human plasma cells exhibit robust protein secretion and long-term engraftment in vivo. Nat. Commun. 2022; 13:6110. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Huang D., Tran J.T., Olson A., Vollbrecht T., Tenuta M., Guryleva M.V., Fuller R.P., Schiffner T., Abadejos J.R., Couvrette L.et al.. Vaccine elicitation of HIV broadly neutralizing antibodies from engineered B cells. Nat. Commun. 2020; 11:5850. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Nahmad A.D., Lazzarotto C.R., Zelikson N., Kustin T., Tenuta M., Huang D., Reuveni I., Nataf D., Raviv Y., Horovitz-Fried M.et al.. In vivo engineered B cells secrete high titers of broadly neutralizing anti-HIV antibodies in mice. Nat. Biotechnol. 2022; 40:1241–1249. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Manso T., Folch G., Giudicelli V., Jabado-Michaloud J., Kushwaha A., Nguefack Ngoune V., Georga M., Papadaki A., Debbagh C., Pégorier P.et al.. IMGT® databases, related tools and web resources through three main axes of research and development. Nucleic Acids Res. 2022; 50:D1262–D1272. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Peng K., Safonova Y., Shugay M., Popejoy A.B., Rodriguez O.L., Breden F., Brodin P., Burkhardt A.M., Bustamante C., Cao-Lormeau V.-M.et al.. Diversity in immunogenomics: the value and the challenge. Nat. Methods. 2021; 18:588–591. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Davis J.E., Insigne K.D., Jones E.M., Hastings Q.A., Boldridge W.C., Kosuri S.. Dissection of c-AMP response element architecture by using genomic and episomal massively parallel reporter assays. Cell Syst. 2020; 11:75–85. [DOI] [PubMed] [Google Scholar]
- 12. Zrimec J., Fu X., Muhammad A.S., Skrekas C., Jauniskis V., Speicher N.K., Börlin C.S., Verendel V., Chehreghani M.H., Dubhashi D.et al.. Controlling gene expression with deep generative design of regulatory DNA. Nat. Commun. 2022; 13:5099. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. de Boer C.G., Vaishnav E.D., Sadeh R., Abeyta E.L., Friedman N., Regev A.. Deciphering eukaryotic gene-regulatory logic with 100 million random promoters. Nat. Biotechnol. 2020; 38:56–65. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. Zhao S., Hong C.K.Y., Myers C.A., Granas D.M., White M.A., Corbo J.C., Cohen B.A.. A single-cell massively parallel reporter assay detects cell-type-specific gene regulation. Nat. Genet. 2023; 55:346–354. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15. Gordon M.G., Inoue F., Martin B., Schubach M., Agarwal V., Whalen S., Feng S., Zhao J., Ashuach T., Ziffra R.et al.. lentiMPRA and MPRAflow for high-throughput functional characterization of gene regulatory elements. Nat. Protoc. 2020; 15:2387–2412. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Gallego Romero I., Lea A.J.. Leveraging massively parallel reporter assays for evolutionary questions. Genome Biol. 2023; 24:26. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17. Kotopka B.J., Smolke C.D.. Model-driven generation of artificial yeast promoters. Nat. Commun. 2020; 11:2113. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18. Yu T.C., Liu W.L., Brinck M.S., Davis J.E., Shek J., Bower G., Einav T., Insigne K.D., Phillips R., Kosuri S.et al.. Multiplexed characterization of rationally designed promoter architectures deconstructs combinatorial logic for IPTG-inducible systems. Nat. Commun. 2021; 12:325. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19. LaFleur T.L., Hossain A., Salis H.M.. Automated model-predictive design of synthetic promoters to control transcriptional profiles in bacteria. Nat. Commun. 2022; 13:5159. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20. Cai Y.-M., Kallam K., Tidd H., Gendarini G., Salzman A., Patron N.J.. Rational design of minimal synthetic promoters for plants. Nucleic Acids Res. 2020; 48:11845–11856. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21. Zhang P., Wang H., Xu H., Wei L., Hu Z., Wang X.. Deep flanking sequence engineering for efficient promoter design. Nat. Commun. 2023; 14:6309. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22. Yang I.S., Bae S.W., Park B., Kim S.. Development of a program for in silico optimized selection of oligonucleotide-based molecular barcodes. PLoS One. 2021; 16:e0246354. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23. Zorita E., Cuscó P., Filion G.J.. Starcode: sequence clustering based on all-pairs search. Bioinformatics. 2015; 31:1913–1919. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24. Li W., Xu H., Xiao T., Cong L., Love M.I., Zhang F., Irizarry R.A., Liu J.S., Brown M., Liu X.S.. MAGeCK enables robust identification of essential genes from genome-scale CRISPR/Cas9 knockout screens. Genome Biol. 2014; 15:554. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25. Lee E.-C., Liang Q., Ali H., Bayliss L., Beasley A., Bloomfield-Gerdes T., Bonoli L., Brown R., Campbell J., Carpenter A.et al.. Complete humanization of the mouse immunoglobulin loci enables efficient therapeutic antibody discovery. Nat. Biotechnol. 2014; 32:356–363. [DOI] [PubMed] [Google Scholar]
- 26. Murphy A.J., Macdonald L.E., Stevens S., Karow M., Dore A.T., Pobursky K., Huang T.T., Poueymirou W.T., Esau L., Meola M.et al.. Mice with megabase humanization of their immunoglobulin genes generate antibodies as efficiently as normal mice. Proc. Natl. Acad. Sci. U.S.A. 2014; 111:5153–5158. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27. Macdonald L.E., Karow M., Stevens S., Auerbach W., Poueymirou W.T., Yasenchak J., Frendewey D., Valenzuela D.M., Giallourakis C.C., Alt F.W.et al.. Precise and in situ genetic humanization of 6 Mb of mouse immunoglobulin genes. Proc. Natl. Acad. Sci. U.S.A. 2014; 111:5147–5152. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28. Xu J., Xu K., Jung S., Conte A., Lieberman J., Muecksch F., Lorenzi J.C.C., Park S., Schmidt F., Wang Z.et al.. Nanobodies from camelid mice and llamas neutralize SARS-CoV-2 variants. Nature. 2021; 595:278–282. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29. Bailey T.L., Johnson J., Grant C.E., Noble W.S.. The MEME Suite. Nucleic Acids Res. 2015; 43:W39. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30. Ge W., Meier M., Roth C., Söding J.. Bayesian Markov models improve the prediction of binding motifs beyond first order. NAR Genom Bioinform. 2021; 3:lqab026. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31. Castro-Mondragon J.A., Riudavets-Puig R., Rauluseviciute I., Berhanu Lemma R., Turchi L., Blanc-Mathieu R., Lucas J., Boddie P., Khan A., Manosalva Pérez N.et al.. JASPAR 2022: the 9th release of the open-access database of transcription factor binding profiles. Nucleic Acids Res. 2022; 50:D165–D173. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32. Kulakovskiy I.V., Vorontsov I.E., Yevshin I.S., Sharipov R.N., Fedorova A.D., Rumynskiy E.I., Medvedeva Y.A., Magana-Mora A., Bajic V.B., Papatsenko D.A.et al.. HOCOMOCO: towards a complete collection of transcription factor binding models for human and mouse via large-scale ChIP-Seq analysis. Nucleic Acids Res. 2018; 46:D252–D259. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33. Gupta S., Stamatoyannopoulos J.A., Bailey T.L., Noble W.S.. Quantifying similarity between motifs. Genome Biol. 2007; 8:R24. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34. Wirth T., Staudt L., Baltimore D.. An octamer oligonucleotide upstream of a TATA motif is sufficient for lymphoid-specific promoter activity. Nature. 1987; 329:174–178. [DOI] [PubMed] [Google Scholar]
- 35. Aranburu A., Carlsson R., Persson C., Leanderson T.. Transcription factor AP-4 is a ligand for immunoglobulin-kappa promoter E-box elements. Biochem. J. 2001; 354:431–438. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36. Aranburu A., Bennett M., Leanderson T.. The κ promoter penta-decamer binding protein CBF-A interacts specifically with nucleophosmin in the nucleus only. Mol. Immunol. 2006; 43:690–701. [DOI] [PubMed] [Google Scholar]
- 37. Bemark M., Leanderson T.. Diverse transcription factors are involved in the quantitative regulation of transcriptional activation of χ promoters. Eur. J. Immunol. 1997; 27:1308–1318. [DOI] [PubMed] [Google Scholar]
- 38. Kim D., Schmidt C., Brown M.A., Tucker H.. Competitive promoter-associated matrix attachment region binding of the Arid3a and Cux1 transcription factors. Diseases. 2017; 5:34. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39. Roy A.L., Sen R., Roeder R.G.. Enhancer-promoter communication and transcriptional regulation of Igh. Trends Immunol. 2011; 32:532–539. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40. Nutt S.L., Kee B.L.. The transcriptional regulation of B cell lineage commitment. Immunity. 2007; 26:715–725. [DOI] [PubMed] [Google Scholar]
- 41. Khan A., Riudavets Puig R., Boddie P., Mathelier A.. BiasAway: command-line and web server to generate nucleotide composition-matched DNA background sequences. Bioinformatics. 2021; 37:1607–1609. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42. McInnes L., Healy J., Saul N., Großberger L.. UMAP: uniform Manifold Approximation and Projection. J. Open Source Software. 2018; 3:861. [Google Scholar]
- 43. Weingarten-Gabbay S., Nir R., Lubliner S., Sharon E., Kalma Y., Weinberger A., Segal E.. Systematic interrogation of human promoters. Genome Res. 2019; 29:171–183. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44. Jores T., Tonnies J., Wrightsman T., Buckler E.S., Cuperus J.T., Fields S., Queitsch C.. Synthetic promoter designs enabled by a comprehensive analysis of plant core promoters. Nat. Plants. 2021; 7:842–855. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45. Chen S., Francioli L.C., Goodrich J.K., Collins R.L., Kanai M., Wang Q., Alföldi J., Watts N.A., Vittal C., Gauthier L.D.et al.. A genome-wide mutational constraint map quantified from variation in 76,156 human genomes. 2022; bioRxiv doi:10 October 2022, preprint: not peer reviewed 10.1101/2022.03.20.485034. [DOI]
- 46. Cheung V.G., Spielman R.S.. Genetics of human gene expression: mapping DNA variants that influence gene expression. Nat. Rev. Genet. 2009; 10:595–604. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47. Agarwal A., Zhao F., Jiang Y., Chen L.. TIVAN-indel: a computational framework for annotating and predicting non-coding regulatory small insertions and deletions. Bioinformatics. 2023; 39:btad060. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48. Schofield J.A., Hahn S.. Broad compatibility between yeast UAS elements and core promoters and identification of promoter elements that determine cofactor specificity. Cell Rep. 2023; 42:112387. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49. Griesemer D., Xue J.R., Reilly S.K., Ulirsch J.C., Kukreja K., Davis J.R., Kanai M., Yang D.K., Butts J.C., Guney M.H.et al.. Genome-wide functional screen of 3′UTR variants uncovers causal variants for human disease and evolution. Cell. 2021; 184:5247–5260. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50. Avsec Ž., Agarwal V., Visentin D., Ledsam J.R., Grabska-Barwinska A., Taylor K.R., Assael Y., Jumper J., Kohli P., Kelley D.R.. Effective gene expression prediction from sequence by integrating long-range interactions. Nat. Methods. 2021; 18:1196–1203. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51. Karollus A., Mauermeier T., Gagneur J.. Current sequence-based models capture gene expression determinants in promoters but mostly ignore distal enhancers. Genome Biol. 2023; 24:56. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52. Hossain A., Lopez E., Halper S.M., Cetnar D.P., Reis A.C., Strickland D., Klavins E., Salis H.M.. Automated design of thousands of nonrepetitive parts for engineering stable genetic systems. Nat. Biotechnol. 2020; 38:1466–1475. [DOI] [PubMed] [Google Scholar]
- 53. Khuu P., Sandor M., DeYoung J., Ho P.S.. Phylogenomic analysis of the emergence of GC-rich transcription elements. Proc. Natl. Acad. Sci. U.S.A. 2007; 104:16528–16533. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54. Johns N.I., Gomes A.L.C., Yim S.S., Yang A., Blazejewski T., Smillie C.S., Smith M.B., Alm E.J., Kosuri S., Wang H.H.. Metagenomic mining of regulatory elements enables programmable species-selective gene expression. Nat. Methods. 2018; 15:323–329. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55. Cisneros A.F., Gagnon-Arsenault I., Dubé A.K., Després P.C., Kumar P., Lafontaine K., Pelletier J.N., Landry C.R.. Epistasis between promoter activity and coding mutations shapes gene evolvability. Sci. Adv. 2023; 9:eadd9109. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56. Vuolo F., Mentink R.A., Hajheidari M., Bailey C.D., Filatov D.A., Tsiantis M.. Coupled enhancer and coding sequence evolution of a homeobox gene shaped leaf diversity. Genes Dev. 2016; 30:2370–2375. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57. Li X., Lalić J., Baeza-Centurion P., Dhar R., Lehner B.. Changes in gene expression predictably shift and switch genetic interactions. Nat. Commun. 2019; 10:3886. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58. Meyer D., C. Aguiar V.R., Bitarello B.D., C. Brandt D.Y., Nunes K.. A genomic perspective on HLA evolution. Immunogenetics. 2018; 70:5–27. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59. Pennell M., Rodriguez O.L., Watson C.T., Greiff V.. The evolutionary and functional significance of germline immunoglobulin gene variation. Trends Immunol. 2023; 44:7–21. [DOI] [PubMed] [Google Scholar]
- 60. Espinoza C.R., Feeney A.J.. The extent of histone acetylation correlates with the differential rearrangement frequency of individual VH genes in Pro-B cells. J. Immunol. 2005; 175:6668–6675. [DOI] [PubMed] [Google Scholar]
- 61. Storb U. Alt F.W. Chapter seven - Why does somatic hypermutation by AID require transcription of its target genes. Advances in Immunology. 2014; 122:Academic Press; 253–277. [DOI] [PubMed] [Google Scholar]
- 62. Liu M., Schatz D.G.. Balancing AID and DNA repair during somatic hypermutation. Trends Immunol. 2009; 30:173–181. [DOI] [PubMed] [Google Scholar]
- 63. Reis A.C., Halper S.M., Vezeau G.E., Cetnar D.P., Hossain A., Clauer P.R., Salis H.M.. Simultaneous repression of multiple bacterial genes using nonrepetitive extra-long sgRNA arrays. Nat. Biotechnol. 2019; 37:1294–1301. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The data underlying this article are available in the Gene Expression Omnibus at https://www.ncbi.nlm.nih.gov/geo/, and can be accessed under GSE232161. All code underlying this article are available in Zenodo at https://doi.org/10.5281/zenodo.8008545.